惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Recent Commits to openclaw:main
Recent Commits to openclaw:main
博客园 - 叶小钗
Stack Overflow Blog
Stack Overflow Blog
S
SegmentFault 最新的问题
D
DataBreaches.Net
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Threatpost
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
Microsoft Azure Blog
Microsoft Azure Blog
WordPress大学
WordPress大学
Engineering at Meta
Engineering at Meta
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
Cisco Blogs
PCI Perspectives
PCI Perspectives
Project Zero
Project Zero
G
Google Developers Blog
宝玉的分享
宝玉的分享
H
Heimdal Security Blog
美团技术团队
Schneier on Security
Schneier on Security
C
CERT Recently Published Vulnerability Notes
Martin Fowler
Martin Fowler
博客园 - 司徒正美
博客园 - 三生石上(FineUI控件)
Help Net Security
Help Net Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Google DeepMind News
Google DeepMind News
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
L
LINUX DO - 最新话题
O
OpenAI News
Hacker News - Newest:
Hacker News - Newest: "LLM"
N
Netflix TechBlog - Medium
S
Security Affairs
小众软件
小众软件
MongoDB | Blog
MongoDB | Blog
Blog — PlanetScale
Blog — PlanetScale
V
V2EX - 技术
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
F
Fortinet All Blogs
G
GRAHAM CLULEY
云风的 BLOG
云风的 BLOG
S
Secure Thoughts

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap
ScreenSuite - The most comprehensive evaluation suite for GUI Agents!
Amir Mahla, Aymeric Roucher, Thomas Wolf · 2025-06-06 · via Hugging Face - Blog

Back to Articles

Releasing ScreenSuite, the most comprehensive evaluation suite for GUI Agents!

TL;DR

Over the past few weeks, we’ve been working tirelessly on making GUI agents more open, accessible and easy to integrate. Along the way, we created the largest benchmarking suite for GUI agents performances 👉 let us introduce ScreenSuite.

We are very excited to share it with you today: ScreenSuite is the most comprehensive and easiest way to evaluate Vision Language Models (VLMs)across many agentic capabilities!

WTF is a GUI Agent?

GUI Agents in action - courtesy of OSWorld

In short, an AI Agent is a robot that acts in the virtual world. (more thorough definition here)

In particular, a “GUI Agent” is an agent that lives in a GUI. Think “an agent that can do clicks and navigate on my desktop or my phone”, à la Claude Computer Use.

This means in essence that the AI model powering the agent will be given a task like “Fill the rest of this Excel column”, along with screen captures of the GUI. Using this information, it will then decide to take action on the system : click(x=130, y=540) to open a web browser, type(”Value for XYZ in 2025"), scroll(down=2) to read further… To see a GUI agent in action, you can try our Open Computer Agent, powered by Qwen2.5-VL-72B.

A good GUI agent will be able to navigate a computer just like we would, thus unlocking all computer tasks : scrolling through Google Maps, editing a file, buying an item online. This involves a variety of capabilities that can be hard to evaluate.

Introducing ScreenSuite 🥳

The literature, for instance Xu et al. (2025) or Qin et al. (2025), generally splits GUI agent abilities amongst several categories:

  1. Perception: correctly perceiving the informati displayed on screen
  2. Grounding: understanding the positioning of elements - this is paramount to click the correct place
  3. Single step actions: solving instructions correctly over one action
  4. Multi-step agents: solving a higher-level goal through several actions in a GUI environment.

So our first contribution is to gather and unify a comprehensive suite of 13 benchmarks spanning the full range of these GUI agent capabilities.

If you look at the last category listed above, evaluating Multi-step agentic capabilities is especially challenging as it requires virtual machines to run the agent’s environment, be it Windows, Android, Ubuntu... To address this, we provide support both for E2B desktop remote sandboxes, and we created from scratch a new option to easily launch Ubuntu or Android virtual machines in Docker!

Implementation details

We’ve carefully designed our benchmark suite with modularity and consistency in mind, ensuring strong alignment across tasks and environments. When required, especially for online benchmarks, we leverage smolagents as framework layer to streamline agent execution and orchestration.

To support reproducibility and ease of use, we’ve built custom Dockerized containers that allow local deployment of full Ubuntu Desktop or Android environments.

Unlike many existing GUI benchmarks that rely on accessibility trees or other metadata alongside visual input, our stack is intentionally vision-only. While this can result in different scores on some established leaderboards, we deem that it creates a more realistic and challenging setup, one that better reflects how humans perceive and interact with graphical interfaces.

– All agentic frameworks (Android World, OSWorld, GAIAWeb, Mind2Web) use smolagents and rely solely on vision, without any accessibility tree or DOM added (in contrast with evaluation settings reported in other sources). – Mind2Web (Multimodal) originally used element-name-based multi-choice selection based on the accessibility tree and screenshots, but was later adapted to click precision within bounding boxes using vision only, which significantly increases task difficulty.

Ranking leading VLMs on ScreenSuite 📊

We’ve evaluated leading VLMs on the benchmark

  • The Qwen-2.5-VL series of models from 3B to 72B. These models are known for their amazing localization capabilities, in other words they know the coordinates of any element in an image which makes them suited for GUI agents that need to click precisely.
  • UI-Tars-1.5-7B, the all-rounder by ByteDance.
  • Holo1-7B, the latest model by H company, showing extremely performant localization for its size.
  • GPT-4o

Our scores are in general agreement with the scores reported in various sources! With the caveat that we evaluate on vision only, causing some differences, see implementation details above.

💡 Note that ScreenSuite does not intend to exactly reproduce benchmarks published in the industry: we evaluate models on GUI agentic capabilities based on vision. As a result, on benchmarks like Mind2Web where other benchmarks gave the agent a view of information rich context like DOM or accessibility tree, our evaluation setting is much harder, thus ScreenSuite does not match other sources.

Start your custom evaluation in 30s ⚡️

Head to the repository.

  1. Clone the repository with submodules: git clone --recurse-submodules git@github.com:huggingface/screensuite.git
  2. Install the package: uv sync --extra submodules --python 3.11
  3. Run python run.py
    • Alternatively, run python examples/run_benchmarks.py for more fine-grained control, like running evaluations for several models in parallel.

The multistep benchmarks requires a bare-metal machine to run and deploy desktop/mobile* environment *emulators (see README.md)

Next steps 🚀

Running consistent and meaningful evaluations easily allows the community to quickly iterate and make progress in this field, as we’ve seen with Eleuther LM evaluation harness, the Open LLM Leaderboard and the Chatbot Arena.

We hope to see much more capable open models in the coming month that can run a wide range of tasks reliably and even run locally!

To support this effort: