🤗 Upvotes: 81 | cs.CV
Authors:
Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
Title:
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Arxiv:
http://arxiv.org/abs/2509.15221v1
Abstract:
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.
We update every weekday to discuss highest-voted papers from Huggingface Daily Paper (https://huggingface.co/papers). Both the podcast scripts and audio are generated by AI. Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com
Creator:
Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/
Gengyu Wang, LLM ML, http://wanggengyu.com
Listen on:
Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL
Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236
Cover Image by Kawen Kuang https://kawen.art
And welcome back to the Daily Papercast, where we break down the latest and greatest in AI, NLP, CV, and more. I'm Echo.
Speaker 2:And I'm Nova. We've got a fantastic episode lined up for you today, diving into some riveting research straight from Hugging Face's Daily Paper list.
Speaker 1:Absolutely. Today's episode is structured into four parts, introduction, methods, experiments, and related work. We'll be spending about four to five minutes on each section.
Speaker 2:Exactly. So let's get started with the paper we've chosen for today. It's titled word by word scale CUA, scaling open source computer use agents with cross platform data.
Speaker 1:The first two authors are Zhao Yang Liu and Jingjing Hsieh, and the last author listed is Wen Hai Wang, all from the Shanghai AI Laboratory. Now let's dive into the introduction section of this fascinating paper.
Speaker 2:Great. Let's do it. The introduction starts by highlighting how humans interact with digital environments using graphical user interfaces or GUIs. You know, those icons and windows we click on our screens.
Speaker 1:Yeah. Totally. With the recent advances in vision language models or VLMs, which combine visual perception with task planning, it's becoming more feasible to automate these interactions. The paper refers to these automated systems as computer use agents, or CUAs.
Speaker 2:That's right. These CUAs aim to autonomously operate various digital platforms like desktops, mobiles, and web environments by relying solely on what they can see, which is pretty impressive.
Speaker 1:Indeed. However, it's highlighted in the paper that many existing CUAs perform well but are typically built on closed source models or proprietary data sets, which makes them less accessible for wide scale use and further research.
Speaker 2:Yes. And a significant challenge, they point out, is the scarcity and expense of obtaining fine grained action trajectories, which are essential for effective computer use. Unlike the abundant image text pairs found on the Internet, this data is hard to come by and costly to collect and annotate.
Speaker 1:Plus, there's the issue of software, web pages, and operating systems evolving rapidly. This constant change can render existing data obsolete quickly, creating a bottleneck for scaling CUAs in both data scale and model generalizability.
Speaker 2:Exactly. To tackle these issues, the authors focused on two main aspects, constructing a large scale, cross platform, GUI centric training corpus and developing a set of scalable, versatile foundation models for general purpose CUAs.
Speaker 1:And how do they approach this? They introduce a cross platform interactive data pipeline composed of two synergistic loops. The first loop is the agent environment interaction loop. This loop allows automated agents to interact with various GUI environments to collect data.
Speaker 2:Perfect. The second loop is the agent human hybrid data acquisition loop. This includes expert annotated trajectories, which ensure high coverage and quality. The pipeline spans six major platforms, Windows, macOS, Linux, Android, iOS, and web.
Speaker 1:Right. This is a big deal because it means they can gather rich screen state observations, metadata, and raw trajectories from a wide range of environments. In this setup, they designed up a unified action space to ensure consistent and efficient interaction.
Speaker 2:This unified action space is critical as it standardizes how interactions are modeled across different platforms, making it easier to train and evaluate their models.
Speaker 1:So to wrap up the introduction, the paper situates their work within the broader context of leveraging data driven scaling to enhance the power and generalizability of CUAs across diverse and evolving digital platforms.
Speaker 2:Absolutely fascinating. And that's the end of the introduction section. Stay tuned as we move on to the methods section where we'll dig into the nitty gritty of how they built and tested their solutions.
Speaker 1:Alright, Nova. Let's dig into the methods section of this fascinating paper. Shall we?
Speaker 2:Yes, Echo. This part is where the real magic happens. So the authors introduced their cross platform interactive data pipeline. Sounds fancy, right?
Speaker 1:Very fancy indeed. So what's this pipeline all about?
Speaker 2:Well, they propose this dual loop framework combining agent environment interactions and hybrid agent human data acquisition. It's designed to integrate agents and humans into the process of collecting data across different platforms like Windows, Mac OS, Ubuntu, Android, and even web browsers.
Speaker 1:Wow. That's ambitious. How do they manage to standardize observations and actions across such varied environments?
Speaker 2:Good question. They establish a unified interface for observation acquisition and action execution across these platforms. For instance, various metadata extraction methods are utilized. Accessibility trees for desktop environments, document object model structures for web platforms, and parsed XML layout files for Android. Got it.
Speaker 2:And what about the data acquisition?
Speaker 1:How do they ensure they capture diverse and relevant data?
Speaker 2:They use two main loops. The agent interaction loop involves agents autonomously interacting with GUI environments to collect screenshots and structural metadata. The agent human hybrid data acquisition loop then integrates these automatically collected trajectories with those gathered by human experts. This two pronged approach ensures comprehensive coverage and high data quality.
Speaker 1:But I bet there's more to it. Right? Are there specific strategies they employ to collect and refine this data?
Speaker 2:Absolutely. They use a rule driven random walk strategy for exploration. Essentially, this involves selecting random actions from the available action space at each step to explore different interface states. To enhance efficiency, they apply heuristic pruning to avoid redundant branches.
Speaker 1:That's smart. And what happens after collecting these raw trajectories?
Speaker 2:Once they have the raw trajectories, they transform them into training corpora. These corpora are annotated to support tasks like GUI understanding, grounding, and sequential action modeling. The annotation process includes generating appearance, position, function descriptions, and screen transition summaries using advanced vision language models like GPT four o and Claude 3.7.
Speaker 1:That sounds like a lot of detailed work. How do they tackle the noise often found in automatically collected data?
Speaker 2:Great point. They complement automated collection with expert curated trajectories to ensure quality. The annotators use a unified cross platform system to interact with applications within isolated environments like Docker containers. This setup reduces noise and provides high quality trajectories that reflect realistic usage patterns.
Speaker 1:I see. They really seem to cover all the bases. What about the actual training of the agents? How do they optimize for different environments and tasks?
Speaker 2:They train the agents using a combination of multimodal data and task specific supervision. This includes GUI grounding tasks like point grounding, bounding box grounding, and action grounding, each of which maps natural language instructions to specific GUI regions. They augment these tasks with additional data generated by prompt templates fed into large language models to improve the model's generalization ability across diverse interfaces.
Speaker 1:And how do they ensure the agents are prepared to handle the various intricacies of different platforms? I mean, operating a web browser is quite different from a mobile app.
Speaker 2:Indeed, and they address this by establishing a unified action space covering core behaviors across all platforms while retaining flexibility for platform specific functionalities. This consistency simplifies downstream policy learning and data annotation.
Speaker 1:But what about real time tasks, like online gaming where fast and reactive behavior is critical?
Speaker 2:For such scenarios, they designed three inference paradigms for computer use agents, grounding mode, direct action mode, and reasoned action mode. Grounding mode focuses on UI localization. Direct action mode generates executable actions with low latency, and reasoned action mode provides a chain of thought rationale, enhancing reliability in complex tasks.
Speaker 1:That's impressive. It's like they thought of everything. Are there any specific implementations they include to improve this pipeline?
Speaker 2:Certainly. For instance, they enhance their GUI grounding with point bounding box and action grounding, enriched by both manually and automatically generated annotations to handle diverse interface layouts. They also upgrade existing evaluation frameworks like WebArena Lite to WebArena Lite v two to better assess visual based web agents.
Speaker 1:Upgrading frameworks sounds crucial for keeping up with evolving web technologies. How do they validate the pipeline's effectiveness?
Speaker 2:Through structured evaluations across designated benchmarks like ScreenSpot v two, ScreenSpot Pro, and OS World G, they measure cross platform grounding accuracy. Additionally, they deploy agents in real time environments to assess task completion effectiveness.
Speaker 1:And of course, performance isn't just about task completion rates. What about user intent and contextual accuracy?
Speaker 2:To capture those, they include interface captioning, user intention prediction, and screen transition captioning tasks. These provide foundational understanding for accurately interpreting and responding to user actions.
Speaker 1:This comprehensive approach really sets a high bar for developing versatile computer use agents. Is there more about how they ensure their models make decisions that align closely with human reasoning?
Speaker 2:Yes. They generate reasoned action traces for each interaction, embedding rationales in XML tags. This structured reasoning provides transparency and improves the interpretability of agents' actions, which is crucial for tasks demanding precision and explainability.
Speaker 1:Pretty thorough indeed. But such a system must require rigorous annotation schemes.
Speaker 2:Right? Absolutely. They employ a thorough annotation process, segmenting interaction sequences into sub trajectories based on screen similarity to cover diverse GUI states. They balance between high quality human annotations and expansive rule driven data collection to ensure the robustness of agent training.
Speaker 1:Bravo. It's certainly an extensive methodology. Shall we wrap up this section here, Nova?
Speaker 2:Yes, Echo. I hope our listeners found the dive into this methodology as fascinating as we did. Stay tuned for more insights in the next section.
Speaker 1:Alright, Nova. Let's jump into the heart of our episode, the experiments and results. This is where things get really exciting. So how did they set up their experiments?
Speaker 2:Oh, this is a juicy bit, Echo. So to assess the capabilities of their computer use agents, they evaluated them on several fronts, understanding, grounding, and task completion. They used pure visual observation for all benchmarks, which I think is fascinating because it reflects how real agents would interact with GUI environments without additional cues. Wow.
Speaker 1:That's pretty comprehensive. Did they specify what benchmarks they used for evaluation?
Speaker 2:Absolutely. They used bench GUI L1 for GUI understanding, which involves multiple choice questions to evaluate the agent's perception and reasoning about the interface content. It's like a standardized test for GUI understanding across different platforms for a source.
Speaker 1:Sounds rigorous. And how did their models perform?
Speaker 2:Glad you asked. The results were quite impressive. For instance, even their smallest model, scale CUA three b, achieved an average score of 83.6%. The larger scale CUA 32 b scored 92.5 on the medium level tasks and an astounding 90.4% on the hard level tasks. This outperformed all other proprietary and open source models they compared four zero eight source.
Speaker 2:That's fantastic. What about GUI grounding? For GUI grounding, they measured the ability to localize and associate visual elements with textual or functional references. The scale CUA32B model excelled here as well, achieving an average score of 94.7% on the ScreenspV2 benchmark, which is pretty remarkable. They also tested on ScreenSpot Pro where similar high performance was seen across various domains like creative software, CAD, and office applications for eight source.
Speaker 1:Wow. It's clearly well rounded. Did they conduct any task completion assessments?
Speaker 2:Yes, they did. They evaluated end to end task completion across several platforms, including Android, Ubuntu, Windows, Mac OS, and web. On these benchmarks, the ScaleCUA32B model again proved to be the top performer. For example, it achieved 47.4% on WebArena Lite v two for a 50 step task budget, outperforming all baseline models significantly for eight source. Incredible.
Speaker 2:Were there any specific trends or patterns they observed? Yes. A few notable ones. Firstly, scaling from three b to seven b and then to 32 b consistently led to performance gains, particularly on Windows and Web Benchmarks. Secondly, the effect of the step budget.
Speaker 2:Most agents, including ScaleCUA saw substantial improvements with a 50 step limit. Lastly, the performance on macOS remained relatively low compared to other platforms, hinting at OS specific challenges rather than grounding limitations. Four eight source. Interesting. It sounds like there's always room for improvement.
Speaker 2:Did they
Speaker 1:mention why macOS was particularly challenging?
Speaker 2:Yes. They speculated that it might be due to OS specific affordances and not just grounding limitations. The macOS platform has unique interface elements and behaviors that might present additional challenges for the agent's 4.8 source. Makes sense. Considering all these points,
Speaker 1:were there any trade offs they discussed, like between general multimodal data and GUI specific data?
Speaker 2:Yes. Indeed. They found that while incorporating general multimodal data improved the model's general reasoning ability, it somewhat diluted the GUI specific knowledge. This indicates a trade off. Increasing general data improved general benchmarks but caused a gradual decline in GUI specific performance.
Speaker 2:They suggest that a balanced approach to data mixing might be necessary to maintain a high level of specialization while not sacrificing general capabilities for source.
Speaker 1:That's a valuable insight. Anything else noteworthy from their discussion?
Speaker 2:Well, they highlighted that high resolution inputs and reasoning based inference significantly enhance grounding and task completion, although they come at a computational cost. Moreover, data scaling is essential, but benchmarks vary in their sensitivity to data volume. They emphasized the importance of scalable cross platform training data to develop robust general purpose agents for source.
Speaker 1:That's a thorough analysis, Nova. It's amazing how much effort goes into making these models generalize well across multiple platforms.
Speaker 2:Absolutely. Their work really advances the frontier of what computer use agents can achieve. It's astounding how they bridged vision language modeling with practical GUI interaction, and I think ScaleCUA is definitely a game changer in this field.
Speaker 1:Totally agree. Well, that's a wrap on the experiments and results section, folks. Stay tuned for our next part where we dive into the related work. Alright, Nova, let's dive into the related work section of this paper. It's really fascinating to see how previous research has influenced the study at hand.
Speaker 2:You got that right, Echo. So this section primarily focuses on prior work in vision language models, VLMs, GUI agents, and datasets. These areas are foundational to understanding how their research builds upon existing knowledge.
Speaker 1:Vision language models have come a long way, haven't they? The paper mentions significant developments both in proprietary API services and open source models, which have enhanced capabilities for a wide range of tasks.
Speaker 2:Yes. Exactly. VLMs like those from OpenAI and MetaAI have integrated extensive GUI knowledge during pretraining stages. This has enabled them to acquire explicit computer use capabilities, though they still face challenges in solving simple computer use tasks.
Speaker 1:And that's a bit ironic, isn't it? I mean, they're great at complex vision language tasks, but but still stumble on basic computer use tasks.
Speaker 2:Absolutely. The struggle seems to stem partly from the fact that the GUI corpora used to train these models are largely proprietary, which limits their generalizability and performance on open source tasks.
Speaker 1:That makes sense. Now moving on to GUI agents. The development here is pretty intriguing. Recent advances in general purpose VLMs, such as the GPT four o, have led to modular GUI agents that decompose decision making and perception into planning and grounding. Right.
Speaker 2:This planner grounder paradigm is key. A VLM based planner predicts the next high level operation and its associated object description, while a specialized GUI grounding model localizes this object on the interface. However, despite their strong performance, agentic workflows often suffer from high computational latency and significant token consumption.
Speaker 1:Yeah. And that's where native computer use agents come in. These integrate planning and grounding into a unified model that's trained end to end. Jobs like Aguvis and UI TARs have shown impressive reasoning and adaptability by being trained on extensive task trajectories.
Speaker 2:Indeed, echo. The unified model approach offers tighter alignment between perception and action, and these models can benefit significantly from the native agent's improvements.
Speaker 1:Okay. Let's talk about data sets. Open source data sets have played a critical role in advancing GUI agent development. They capture diverse forms of interaction, visual perception, and instruction following behaviors across platforms.
Speaker 2:Exactly. For example, the Ricoh dataset offers over 70,000 Android UI screens with gesture traces, and AITW comes with around 715,000 human demonstrations aligned with about 30,000 natural language commands.
Speaker 1:Wow. Those are some huge numbers. It goes to show how scalable data collection efforts can provide a robust foundation for training and evaluating these models.
Speaker 2:Yep. And speaking of large scale efforts, the paper also details smaller but significant datasets like miniWAB for web based tasks and mind two web, which offers long horizon, open ended tasks over real websites.
Speaker 1:And not to forget, desktop environments. There's a data set with 4,000,000 examples synthesized via interface decomposition designed to boost grounding accuracy, as shown by Shi et al. In combination, these data sets form the backbone of their research.
Speaker 2:Absolutely. They even explore scalable data generation with projects like OS Genesis, which synthesizes trajectories via exploration, and Agoovus, which curates a large scale dataset with multimodal annotations.
Speaker 1:So, with these comprehensive datasets, the paper's aim is clear. It strives to create a unified framework for evaluating and training agents across multiple platforms.
Speaker 2:Exactly. Echo. With a robust cross platform dataset and agent models like ScaleCUA, they hope to bridge the gap between vision language modeling and practical GUI interaction. And that, my friend, concludes the related work section.
Speaker 1:Alright, everyone. Welcome back. Let's dive into the key contributions and takeaways from this fascinating paper on scale CUA. Nova, do you wanna kick us off?
Speaker 2:Sure thing, Echo. So the first major contribution of the paper is their curated cross platform computer use dataset. It's pretty impressive because it covers six major platforms, Windows, Mac OS, Linux, Android, iOS, and web. And within these platforms, they focus on three GUI centric task domains, understanding, grounding, and task completion, which together make a solid foundation for their research. Amazing.
Speaker 2:Right?
Speaker 1:Absolutely. That's quite a scope. You know, this dataset isn't just a collection of random actions either. It was collected via an interactive data pipeline that integrates automated agents with human experts. This allows for data that's not only large scale, but also high quality and well annotated.
Speaker 1:They thought of everything.
Speaker 2:Right? And what's really cool is their development of ScaleCUA, a family of robust base agent models that unify perception, reasoning, and action. This supports flexible inference paradigms like grounding, direct action, and reasoned action. Think about it like having a Swiss army knife of models capable of tackling different types of computational tasks seamlessly.
Speaker 1:Yeah. That's a great analogy, Nova. They've also ensured these agents interact seamlessly across different platforms through a unified action space. This means one set of models can be deployed on multiple systems, creating a truly universal computer use agent. That's innovation at its finest, don't you think?
Speaker 2:Totally. And let's not forget their comprehensive evaluation. They demonstrated that their agents could achieve state of the art performance across understanding, grounding, and end to end task completion. These experiments aren't just about hitting benchmarks. They're about proving these agents can operate effectively in diverse real world environments.
Speaker 1:Which is so crucial, right? Given that many previous works either used limited open source data or closed proprietary data, this paper really sets a new standard. By making their resources fully open, they're paving the way for future advancements in computer use agents.
Speaker 2:Exactly. And speaking of paving the way, they also highlighted some areas for future work, like improving the quality of agent collected data through iterative model refinement. Plus, they noted the importance of advanced mechanisms like reflection and hierarchical planning, which they haven't integrated yet. Interesting.
Speaker 1:So there is still room for development even with such a comprehensive model. Alright, Nova, any final takeaways before we wrap up?
Speaker 2:I'd say the key takeaway here is that ScaleCUA offers a groundbreaking approach to building more adaptable and robust computer use agents. By combining diverse data collection methods and advanced modeling techniques, they've moved us a step closer to truly autonomous cross platform digital agents.
Speaker 1:Well, there you have it, folks. We've covered the innovative scale CUA model and its potential impact on the future of computer use agents. We hope you found this episode as exciting as we did. Nova, any final words for our listeners?
Speaker 2:Just a big thank you to everyone for tuning in. Don't forget to check out the paper for more detailed information, and we'll see you all in our next episode. For more exciting discussions on the latest in AI, NLP, and everything in between, stay tuned to Daily Papercast.
Speaker 1:Absolutely. Thanks again, everyone, and until next time. Keep learning and stay curious.