AI Agents for Computer Use

An awesome list of computer control agents (GUI automation of desktop and mobile devices) 🚀.

Please have a look at our website for more information.

Repository Contents

📄 Paper: Link to Paper (arXiv.2501.16150)
🌐 Website: https://sagerpascal.github.io/agents-for-computer-use
🤖Agent Overview
📊 Datasets Overview

Agents

Abukadah et al. - Mapping Natural Language Intents to User Interfaces through Vision-Language Models
Bishop et al. - Latent State Estimation Helps UI Agents to Reason
Bonatti et al. - Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Branavan et al. - Reinforcement Learning for Mapping Instructions to Actions
Chae et al. - Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Cheng et al. - Seeclick: Harnessing gui grounding for advanced visual gui agents
Cho et al. - CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only
Deng et al. - Mind2Web: Towards a Generalist Agent for the Web
Deng et al. - Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
Deng et al. - On the Multi-turn Instruction Following for Conversational Web Agents
Ding et al. - MobileAgent: enhancing mobile control via human-machine interaction and SOP integration
Dorka et al. - Training a Vision Language Model as Smartphone Assistant
Fereidouni et al. - Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning
Furuta et al. - Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Furuta et al. - Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Gao et al. - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
Guan et al. - Intelligent Virtual Assistants with LLM-based Process Automation
Guo et al. - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion
Gur et al. - A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Gur et al. - Environment Generation for Zero-Shot Compositional Reinforcement Learning
Gur et al. - Learning to Navigate the Web
Gur et al. - Understanding HTML with Large Language Models
He at al. - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hong et al. - CogAgent: A Visual Language Model for GUI Agents
Humphreys et al. - A data-driven approach for learning to control computers
Iki et al. - Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts
Jia et al. - DOM-Q-NET: Grounded RL on Structured Language
Kil et al. - Dual-View Visual Contextualization for Web Navigation
Kim et al. - Language Models can Solve Computer Tasks
Koh et al. - Tree Search For Language Model Agents
Lai et al. - AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
Lee et al. - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
**Li ** - Learning UI Navigation through Demonstrations composed of Macro Actions
Li et al. - A Zero-Shot Language Agent for Computer Control with Structured Reflection
Li et al. - AppAgent v2: Advanced Agent for Flexible Mobile Interactions
Li et al. - Glider: A Reinforcement Learning Approach to Extract UI Scripts from Websites
Li et al. - Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations
Li et al. - Mapping Natural Language Instructions to Mobile UI Action Sequences
Li et al. - On the Effects of Data Scale on Computer Control Agents
Li et al. - UINav: A Practical Approach to Train On-Device Automation Agents
Lin et al. - Automating Web-based Infrastructure Management via Contextual Imitation Learning
Liu et al. - Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Lu et al. - GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Lu et al. - OmniParser for Pure Vision Based GUI Agent
Lu et al. - WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Lutz et al. - WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Ma et al. - CoCo-Agent: Comprehensive Cognitive LLM Agent for Smartphone GUI Automation
Ma et al. - LASER: LLM Agent with State-Space Exploration for Web Navigation
Mazumder et al. - FLIN: A Flexible Natural Language Interface for Web Navigation
Murty et al. - BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Nakano et al. - WebGPT: Browser-assisted question-answering with human feedback
Niu et al. - ScreenAgent: A Vision Language Model-driven Computer Control Agent
Nong et al. - MobileFlow: A Multimodal LLM For Mobile GUI Agent
Pan et al. - Autonomous Evaluation and Refinement of Digital Agents
Putta et al. - Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Rahman et al. - V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Rawles et al. - Android in the Wild: A Large-Scale Dataset for Android Device Control
Shaw et al. - From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
Shi et al. - World of Bits: An Open-Domain Platform for Web-Based Agents
Sodhi et al. - HeaP: Hierarchical Policies for Web Actions using LLMs
Song et al. - MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot
Song et al. - Navigating Interfaces with AI for Enhanced User Interaction
Song et al. - RestGPT: Connecting Large Language Models with Real-World RESTful APIs
Song et al. - VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning
Lo et al. - Hierarchical Prompting Assists Large Language Model on Web Navigation
Sun et al. - AdaPlanner: Adaptive Planning from Feedback with Language Models
Sun et al. - META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
Tao et al. - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
Wang et al. - Enabling Conversational Interaction with Mobile UI using Large Language Models
Wang et al. - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Wang et al. - OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Wen et al. - AutoDroid: LLM-powered Task Automation in Android
Wen et al. - DroidBot-GPT: GPT-powered UI Automation for Android
Wu et al. - MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
Wu et al. - OS-COPILOT: TOWARDS GENERALIST COMPUTER AGENTS WITH SELF-IMPROVEMENT
Xu et al. - Grounding Open-Domain Instructions to Automate Web Support Tasks

Datasets

Shi et al. - World of Bits: An Open-Domain Platform for Web-Based Agents
Liu et al. - Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Xu et al. - Grounding Open-Domain Instructions to Automate Web Support Tasks
Gur et al. - Environment Generation for Zero-Shot Compositional Reinforcement Learning
Yao et al. - WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
Deng et al. - Mind2Web: Towards a Generalist Agent for the Web
Koroglu et al. - QBE: QLearning-Based Exploration of Android Applications
Rawles et al. - Android in the Wild: A Large-Scale Dataset for Android Device Control
Zhou et al. - WebArena: A Realistic Web Environment for building autonomous agents
Li et al. - Mapping Natural Language Instructions to Mobile UI Action Sequences
Toyama et al. - AndroidEnv: A Reinforcement Learning Platform for Android
Burns et al. - A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
Xie et al. - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Shvo et al. - AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning
Sun et al. - META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI
Liu et al. - AgentBench: Evaluating LLMs as Agents
Chen et al. - WebVLN: Vision-and-Language Navigation on Websites
Song et al. - RestGPT: Connecting Large Language Models with Real-World RESTful APIs
Koh et el. - VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks
Deng et al. - On the Multi-turn Instruction Following for Conversational Web Agents
Kapoor et al. - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Wen et al. - Empowering LLM to use Smartphone for Intelligent Task Automation
Gao et al. - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
Niu et al. - ScreenAgent: A Vision Language Model-driven Computer Control Agent
Drouin et al. - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Lai et al. - AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
**Zhang et al. ** - Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Chen et al. - GUICourse: From General Vision Language Models to Versatile GUI Agents
Guo et al. - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion
Venkatesh et al. - UGIF: UI Grounded Instruction Following
Zheng et al. - AgentStudio: A Toolkit for Building General Virtual Agents
Zhang et al. - Mobile-Env: An Evaluation Platform and Benchmark for LLM-GUI Interaction
Chen et al. - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Chai et al. - AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Citation

If helpful, please cite:

@misc{sager_acu_2025,
      title={A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions}, 
      author={Pascal J. Sager and Benjamin Meyer and Peng Yan and Rebekka von Wartburg-Kottler and Layan Etaiwi and Aref Enayati and Gabriel Nobel and Ahmed Abdulkadir and Benjamin F. Grewe and Thilo Stadelmann},
      year={2025},
      eprint={2501.16150},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2501.16150}, 
}

Website License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
static		static
.nojekyll		.nojekyll
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Agents for Computer Use

Repository Contents

Agents

Datasets

Citation

Website License

About

Uh oh!

Releases

Packages

Languages

sagerpascal/agents-for-computer-use

Folders and files

Latest commit

History

Repository files navigation

AI Agents for Computer Use

Repository Contents

Agents

Datasets

Citation

Website License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages