API Agents vs. GUI Agents: Divergence and Convergence

2503.11069

Chaoyun Zhang et el.

🤗 2025-03-17

TL;DR
#

Large Language Models (LLMs) are used in software agents, and this leads to two distinct paradigms: API-based and GUI-based agents. API-based agents use direct connections and GUI-based agents interact like humans. While both help with automation, their designs, development, and user interactions vary significantly. This paper shows the key differences and looks at how they might come together in the future.

This paper offers the first complete study comparing these two types of LLM agents. It looks at their differences and convergence, focusing on important aspects and hybrid solutions. It provides guidelines and real-world examples to help researchers and developers choose, combine, or switch between these designs. The goal is to help create more flexible and adaptable solutions in real-world situations.

Key Takeaways
#

Why does it matter?
#

This paper is important for researchers because it systematically analyzes API-based and GUI-based LLM agents, offering clear decision criteria for selecting the right approach. It bridges the gap between theoretical understanding and practical application, guiding future research and development in LLM-driven automation and human-computer interaction, and addressing the evolving needs in software ecosystems.

Visual Insights
#

🔼 This figure illustrates the contrasting approaches of API-based and GUI-based LLM agents in scheduling a meeting. The API agent uses a single API call to directly create the event, highlighting its efficiency and reliance on well-defined interfaces. Conversely, the GUI agent mimics human actions, visually navigating the Google Calendar interface and interacting with its elements (clicking buttons, filling in text fields) before successfully scheduling the meeting. This demonstrates the GUI agent’s adaptability but also its slower, more error-prone approach.
read the caption
Figure 1: The difference between an API agent and a GUI agent in completing the task “Schedule a 1-hour meeting on Google Calendar for LLM Agent at 4:00 PM on March 8”.

Dimension	API Agents	GUI Agents
Modality	Rely on text-based API calls	Depend on screenshots or accessibility trees
Reliability	Generally higher with well-defined endpoints	Lower due to visual parsing and layout changes
Efficiency	Achieve complex tasks in a single call	Require multiple user-like actions
Availability	Limited to published or pre-defined APIs	Can operate on any visible UI element
Flexibility	Constrained by existing APIs	Highly adaptable to new or unexposed features
Security	Manageable via granular endpoint controls	Riskier due to broad access to UI elements
Maintainability	Stable if APIs remain versioned	Prone to breakage on UI redesigns
Transparency	Often hidden, back-end driven	Step-by-step, visually traceable
Human-Like Interaction	Purely programmatic	Simulates user actions on a screen

🔼 This table provides a detailed comparison of API-based and GUI-based Large Language Model (LLM) agents across several key dimensions. These dimensions help to understand the strengths and weaknesses of each approach in terms of modality (how they interact with software), reliability (consistency of performance), efficiency (speed and resource usage), availability (accessibility of the tools or interfaces they use), flexibility (adaptability to changes and new tasks), security (protection against unauthorized access), maintainability (ease of updating and managing the agent), transparency (visibility into how the agent performs its work), and whether the interaction with the agent resembles human actions. This comparison is crucial for selecting the most appropriate agent type for specific applications.
read the caption
Table 1: Comparison of API vs. GUI agents across key dimensions.

In-depth insights
#

API vs. GUI Agents
#

The strategic considerations for deploying API-based versus GUI-based agents depend on the nature of the target software, the level of integration or validation required, and long-term sustainability concerns. API agents excel when stable, documented endpoints exist, offering a reliable and performant mode of automation. GUI agents are advantageous in contexts where interfaces are the only means of access or where visual confirmation is essential. Finally, hybrid approaches strike a balance between these strengths, allowing organizations to adapt as their software ecosystems evolve. By taking these factors into account, decision-makers can ensure they select the agent paradigm—API, GUI, or both—that best aligns with their specific requirements.

Hybrid Approach
#

The “Hybrid Approach” represents a significant shift in how we conceptualize and implement agent-based automation, moving beyond the traditional dichotomy of API-based and GUI-based systems. It acknowledges that neither approach is universally superior and that the optimal solution often involves strategically combining their respective strengths. This necessitates a nuanced understanding of the task at hand, the capabilities of the underlying systems, and the desired user experience. The essence of the hybrid approach lies in its adaptability, allowing developers to tailor solutions that leverage the efficiency and reliability of APIs for data-intensive operations while utilizing the flexibility and human-like interaction of GUIs for tasks such as visual validation or legacy system integration. This convergence is facilitated by emerging technologies and frameworks. The potential benefits of a hybrid approach include broader coverage of use cases, enhanced efficiency, and a more intuitive user experience. However, realizing these benefits requires careful consideration of the trade-offs involved and a robust understanding of how to effectively orchestrate API and GUI interactions. Ultimately, the “Hybrid Approach” signifies a more mature and sophisticated approach to agent-based automation, paving the way for more versatile and powerful solutions.

Strategic Criteria
#

When deciding between API and GUI agents, stable, well-documented APIs strongly favor API agents due to their speed, reliability, and controlled access. This minimizes maintenance. However, GUI agents excel with legacy systems or limited API access, enabling automation without backend changes and allowing visual validation. Hybrid approaches offer adaptability by blending both paradigms, using APIs for data-intensive tasks and GUIs for specialized front-end interactions. The choice depends on factors like integration level, software nature, and sustainability. Ultimately, this decision should align with specific project requirements, emphasizing a tailored strategy for optimal performance and adaptability within diverse software ecosystems, ensuring a well-balanced and effective automation solution that addresses both technical and user-centric considerations for successful implementation.

Divergence Factors
#

Divergence factors between API and GUI agents stem from their core interaction methods. API agents rely on structured, programmatic access, offering efficiency and security through predefined endpoints. However, this approach limits flexibility, as agents are confined to exposed functionalities. GUI agents, conversely, interact with software like humans, using visual or multimodal inputs. This grants broader applicability, automating tasks even without formal APIs, but introduces challenges in visual parsing, reliability, and maintainability due to interface changes. The modality of interaction dictates differences in efficiency, with API agents often completing tasks faster and with less overhead. However, GUI agents offer greater transparency and human-like interaction, simulating user actions on-screen. Security also diverges, as API agents provide granular control via endpoint security, while GUI agents risk unintended access to privileged operations.

LLM Integration
#

LLM integration represents a pivotal shift in software agent development, moving beyond traditional programming paradigms. API-based agents initially demonstrated the power of LLMs by automating tasks through direct interaction with software interfaces, offering efficiency and reliability. However, their limitations in flexibility and adaptability to evolving interfaces became apparent. GUI-based agents emerged as an alternative, leveraging LLMs’ multimodal capabilities to ‘see’ and interact with software interfaces like humans. While offering greater flexibility and accessibility, they face challenges in visual parsing and reliability. The trend is converging towards hybrid approaches, combining the strengths of both paradigms. This involves API wrappers for GUI workflows and unified orchestration tools that dynamically select the optimal interaction method. The strategic integration of LLMs necessitates careful consideration of the target software’s nature, required level of integration, and long-term maintainability. These considerations ensure the right paradigm is selected to aligns with specific requirements.

More visual insights
#

More on figures

🔼 This figure illustrates the key difference between API and GUI agents in how they receive input and produce output. API agents rely solely on text-based API calls as input and produce actions based on those calls. GUI agents, however, rely on visual or multimodal inputs such as application screenshots and use actions mimicking human interactions (mouse clicks, keyboard inputs) to manipulate the GUI and produce actions. This highlights the fundamental difference in modality and interaction between the two types of agents.
read the caption
Figure 2: The difference between an API agent and a GUI agent in input and output.

🔼 This figure illustrates how an API wrapper can be used to interact with a GUI application. Instead of directly manipulating GUI elements, the API wrapper acts as an intermediary, translating high-level function calls into a sequence of GUI interactions (like clicks and text entry). This simplifies the process of automating GUI workflows by allowing developers to interact with the application using a more programmatic, API-based approach. The example shows an API wrapper handling the generation of a financial report, a task that might normally require multiple steps within the GUI. This shows the potential of API wrappers in allowing developers to leverage the functionality of GUI-based applications in a more manageable and streamlined manner.
read the caption
Figure 3: An example of a API wapper over a GUI workflow.

🔼 This figure illustrates a unified orchestrator that manages both API and GUI actions. It shows a hybrid approach where an orchestrator decides whether to use API calls or GUI interactions depending on the task’s requirements and system capabilities. This orchestrator uses a workflow, and based on the input, it can make decisions to leverage both API and GUI agents for different parts of the workflow, combining their strengths.
read the caption
Figure 4: An example of a unified orchestrator to manage both API and GUI actions.

🔼 Figure 5 illustrates a no-code platform’s workflow design incorporating both API calls and GUI agents. The workflow visually represents the various stages of an order processing system. It starts with an ‘Order Received’ event, initiating actions by an API agent that interacts with a ‘Payment Gateway’ (API call). Subsequently, an API agent interacts with a ‘Shipping Service’ (API call). Then a GUI agent performs ‘Verification’ through visual GUI interaction, before finally reaching ‘Completion’. This figure highlights how a no-code platform enables users to integrate both API-driven automation and GUI-driven interactions within a unified workflow, simplifying the development of complex automated processes.
read the caption
Figure 5: One example of a no-code platform to create workflows integrating both API calls and GUI agents.

Approach	Key Benefit	Primary Challenge
API Wrappers Over GUI Tools	Provides a quasi-API experience for GUI-only software	Still relies on underlying GUI elements that may change
Unified Orchestration Tools	Hides agent-type details from the user	Complex logic to choose between API and GUI in real time
Low-Code / No-Code Solutions	Simplifies design of advanced workflows	May introduce hidden dependencies and abstractions

Scenario	Recommended Approach	Rationale
Stable, well-documented APIs	API Agents	Exploit robust endpoints for speed and reliability
Performance-critical operations	API Agents	Reduce latency and overhead via direct function calls
Controlled access to applications	API Agents	Ensure safety and security
Legacy or proprietary software	GUI Agents	Automate tasks without requiring new backend integration
Visual validation or UI testing	GUI Agents	Verify on-screen text or elements directly
Interactive or graphical manipulation	GUI Agents	Seamlessly replicate human-like interactions with visual elements
Partial API coverage	Hybrid	Combine UI-based steps where APIs are unavailable with direct calls for data-heavy tasks
Future-proofing	Hybrid	Facilitate switching from GUI to API as endpoints evolve

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

API vs. GUI Agents#

Hybrid Approach#

Strategic Criteria#

Divergence Factors#

LLM Integration#

More visual insights#

Full paper#