↗ arXiv ↗ Hugging Face ↗ Papers with Code
TL;DR#
Large Foundation Models (LFMs) are increasingly used not just for text, but to interact with digital systems via Graphical User Interfaces (GUIs), like humans do. This approach, using GUI agents, is promising but faces challenges like adapting to dynamic layouts and diverse GUI designs across platforms, as well as correctly identifying visual elements within a page. These elements can be small, numerous, scattered, and visually different between websites and apps. Existing GUI agents struggle with complex and nuanced GUI interactions, limiting their practical applications. They may fail to interpret the state of GUI elements or perform actions accurately. Existing benchmarks lack diversity, do not sufficiently reflect real-world scenarios, and haven’t established standardized evaluation metrics. Furthermore, privacy concerns arise from dealing with potentially sensitive information in GUIs.
This paper provides a comprehensive survey of GUI agents, establishing a unified framework to categorize research. It examines benchmarks, evaluation metrics, architectures (perception, reasoning, planning, acting), and training methods (prompt-based, training-based). By clearly defining the problems and current approaches, the survey aims to facilitate research and development of more robust and efficient GUI agents. It also highlights open challenges like intent understanding, security and privacy, and inference latency, offering directions for future research and aiming to guide the development of more sophisticated GUI agents capable of reliably automating complex tasks in diverse digital environments. It emphasizes the importance of developing standardized evaluation metrics and more realistic benchmarks.
Key Takeaways#
Why does it matter?#
This survey is crucial for AI researchers due to the rising importance of GUI agents. It provides a comprehensive overview of current progress, challenges, and future directions in the field, offering valuable insights for both newcomers and experienced researchers. It highlights the potential impact of GUI agents on automating human-computer interaction and opens new avenues for research in areas such as robust intent understanding, security, real-time responsiveness, and personalization of these agents.