Smol2Operator: Post-Training GUI Agents for Computer Use

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Featherless AI on Hugging Face Inference Providers 🔥 Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Amir Mahla, merve, Sergio Paniego, Vaibhav Srivastav, Lewis Tuns · 2025-09-23 · via Hugging Face - Blog

Back to Articles

TL;DR: This work shows how a lightweight vision–language model can acquire GUI-grounded skills and evolve into an agentic GUI coder. We release all training recipes, data-processing tools, resulting model, demo and datasets to enable full reproducibility and foster further research 🫡. Find the collection here.

This video demonstrates the model obtained through the recipe described below, executing a task end-to-end.

Introduction
1. Data Transformation and Unified Action Space
2. Phase 1: From Zero to Perception
3. Phase 2: From Perception to Cognition
- Training Data
- Phase 2 Results
4. All you need is Open Source
5. Conclusion
What's Next?

Introduction

Graphical User Interface (GUI) automation is one of the most challenging frontiers in computer vision. Developing models that see and interact with user interfaces enables AI agents to navigate mobile, desktop, and web platforms. This will reshape the future of digital interaction.

In this blog post, we present a comprehensive approach to training vision-language models for GUI automation through a multi-phase training strategy. We demonstrate how to transform a model with zero grounding capabilities into an agentic coder capable of understanding and interacting with graphical interfaces.

Rather than aiming for a SOTA model, our goal is to demonstrate the entire process, from data processing to model training, and, in doing so, show how to unlock GUI-grounding capabilities in VLMs.

GUI capabilities combine understanding of the interface and precise element localization. These abilities enable the model to translate high-level tasks into low-level GUI actions such as clicking, typing, …

Our approach leverages SmolVLM2-2.2B-Instruct as the baseline model, a small powerful vision-language model that initially has no grounding capabilities for GUI tasks. This makes it an ideal candidate to demonstrate the effectiveness of our training methodology. Through our two-phase training process, we first instill grounding capabilities in the model, then enhance it with agentic reasoning abilities using Supervised Fine-Tuning (SFT).

We evaluate our approach on an established perception benchmark: ScreenSpot-v2, which tests the model’s ability to understand and locate elements within screenshots. Our process is inspired by the AGUVIS paper, and we leverage their carefully curated datasets to build upon their foundational work.

Evolution of ScreenSpot-v2 performance during the training phase of the base model SmolVLM2-2.2B-Instruct.

1. Data Transformation and Unified Action Space

This section explains how we convert heterogeneous GUI actions format from multiple datasets into a single unified format. By standardizing function names, signatures, and parameters, we create consistent, high-quality data that forms the foundation for effective model training.

The Challenge of Inconsistent Action Spaces

One of the primary challenges when working with multiple GUI automation datasets is the lack of standardization in action representations. Different datasets use varying function signatures, parameter naming conventions, and action taxonomies, making it difficult to train a unified model across diverse data sources.

Our Unified Approach

We took the open-source datasets (xlangai/aguvis-stage1, xlangai/aguvis-stage2), originally used by AGUVIS, and implemented a comprehensive data transformation pipeline to create a unified action space. Our approach involved:

Function Parsing and Normalization: We developed a function parser (see utils/function_parser.py) that can extract and parse function calls from various formats across all datasets. This parser supports any function signature format, handles complex parameter structures, and can reconstruct function calls with proper parameter ordering.
Action Space Unification: We implemented a comprehensive action conversion system (see preprocessing/action_conversion.py) that transforms all original action representations into a standardized function naming and argument structure. This process highlighted the significant inconsistencies in function signatures across different datasets and allowed us to:
- Remove undesired or redundant actions
- Standardize parameter naming conventions
- Create a cohesive action vocabulary
(Bonus) Flexible Adaptation Framework: Our transformation pipeline includes utilities that allow users to:
- Adapt the entire dataset to their own action space naming conventions using the utils/action_space_converter.py tool
- Extract and analyze the current action space structure

Example Data Transformation

Here are real examples from our action conversion system (preprocessing/action_conversion.py) showing how we transform heterogeneous action representations into our unified format (grounding coordinates normalized to [0,1]):

Before (Original Action Dataset Formats):

# Mobile Actions
mobile.home()
mobile.open_app(app_name='drupe')
mobile.swipe(from_coord=[0.581, 0.898], to_coord=[0.601, 0.518])
mobile.long_press(x=0.799, y=0.911)
mobile.terminate(status='success')
# Desktop Actions
pyautogui.click(x=0.8102, y=0.9463)
pyautogui.doubleClick(x=0.8102, y=0.9463)
pyautogui.hotkey(keys=['ctrl', 'c'])
pyautogui.scroll(page=-0.1)
pyautogui.write(message='bread buns')
pyautogui.dragTo(from_coord=[0.87, 0.423], to_coord=[0.8102, 0.9463])

After (Unified Action Dataset Formats):

# Unified Mobile Actions
navigate_home()
open_app(app_name='drupe')
swipe(from_coord=[0.581, 0.898], to_coord=[0.601, 0.518])
long_press(x=0.799, y=0.911)
final_answer('success')
# Unified Desktop Actions
click(x=0.8102, y=0.9463)
double_click(x=0.8102, y=0.9463)
press(keys=['ctrl', 'c'])
scroll(direction='up', amount=10)  # Smart direction detection
type(text='bread buns')
drag(from_coord=[0.87, 0.423], to_coord=[0.8102, 0.9463])

This unification process was essential for creating coherent training data that allows the model to learn consistent action patterns across diverse GUI environments.

💡 Why Normalized Coordinates?
Using raw pixel coordinates in text-action datapoint (e.g. click(x=302, y=63)) ties them to a single image size. Vision Language Models (VLMs) often resize images, causing pixel coordinates to break and require adjustment. Normalized coordinates (relative to image size) remain valid at any resolution and keep the dataset consistent.

(Bonus) Custom Action Space Adaptation with Action Space Converter

To maximize flexibility for different use cases, we developed the Action Space Converter (utils/action_space_converter.py), a tool that allows users to easily adapt from an action space to their own custom action vocabularies and naming conventions.

You can use this tool to transform one action signature (function names, parameter names, and parameter value changes, ...) into another:

Before

assistant_message: "Action: click(x=0.5, y=0.3)"

After

assistant_message: "Action: touch(x_coord=200, y_coord=300)"

Key Features

The Action Space Converter provides:

Configurable Mappings: Define custom mappings between unified actions and your preferred action names
Parameter Transformation: Rename parameters, apply value transformations, and set default values
Flexible Architecture: Support for both simple parameter mappings and complex custom transformation functions
Validation: Built-in validation to ensure mapping configurations are valid

Usage Example

from utils.action_space_converter import ActionSpaceConverter, ActionMapping, ParameterMapping
from utils.function_parser import parse_function_call

# Create custom mappings
mappings = [
    ActionMapping(
        source_function="click",
        target_function="touch",
        parameter_mappings=[
            ParameterMapping(source_name="x", target_name="x_coord"),
            ParameterMapping(source_name="y", target_name="y_coord")
        ],
        description="Touch screen at coordinates"    ),
    ActionMapping(
        source_function="type", # source_function is the name of the function in the original function call
        target_function="write", # target_function is the name of the function in the target function call        
        parameter_mappings=[
            ParameterMapping(source_name="text", target_name="content")
            # source_name is the name of the parameter in the original function call            
            # target_name is the name of the parameter in the target function call        
        ],
        description="Input text"    
    )
]

assistant_message = "I'll interact at those coordinates for you. click(x=0.5, y=0.3) Now I'll input the text. type(text='hello world')"

# Parse function calls
parsed_function_calls = parse_function_call(text)

# Initialize converter
converter = ActionSpaceConverter(mappings)

# Convert actions
converted_actions = converter.convert_actions(parsed_function_calls)
for new_function_call, old_function_call in zip(converted_actions, parsed_function_calls):
    text = text.replace(old_function_call.to_string(), new_function_call.to_string())

print(text)
# Output: I'll interact at those coordinates for you. touch(x_coord=0.5, y_coord=0.3) Now I'll input the text. write(content='hello world')

This tool enables researchers and practitioners to:

Customize Training Data: Adapt the dataset to match their specific action vocabulary requirements
Domain Adaptation: Transform actions for different platforms (mobile vs. desktop vs. web)
Framework Integration: Easily align training data with existing automation frameworks
Rapid Experimentation: Quickly test different action space configurations
Release Preparation: Standardize action spaces for production deployment with consistent naming conventions

The Action Space Converter is particularly valuable for preparing datasets for training, as it ensures consistent action vocabularies across different deployment environments while maintaining compatibility with existing automation frameworks.

Transformed and Released Datasets

Through this pipeline, we transform the open-source datasets xlangai/aguvis-stage1, xlangai/aguvis-stage2 into our unified action space (see here). The output of this process is released as two new fully formatted datasets: smolagents/aguvis-stage-1 and smolagents/aguvis-stage-2.

2. Phase 1: From Zero to Perception

Training Data

Phase 1 leverages the smolagents/aguvis-stage-1 dataset, which introduces GUI grounding by pairing low-level instructions with diverse executable actions (expressed in code form). For example, a user/assistant turn in smolagents/aguvis-stage-1 follows the structure:

{
  "user": "click on more button",
  "assistant": "click(x=0.8875, y=0.2281)",
}

Each sample links a screenshot with multi-turn user/assistant interactions, enabling the model to learn fine-grained action grounding across dialogue turns. During fine-tuning, the data collator masks everything except the assistant’s answers when computing the loss.

Optimization Experiments

Before proceeding with full-scale Phase 1 training, we conducted comprehensive ablation studies to determine optimal training configurations

Image Resolution and Coordinate System Analysis

We experimented with different image sizes and coordinate representation systems to identify the optimal configuration for SmolVLM2:

Image Sizes Tested: 384px, 768px, 1152px
Coordinate Systems: Pixel coordinates vs. normalized coordinates (0-1 range)
Training Data: 400K samples from Aguvis datasets

Some SOTA GUI VLMs (e.g., Qwen-VL) appear also to use a different normalized range (0–1000), which was not tested in this experiment.

Configuration (coords / image size)	Screenspot-v2 (%)
Normalized coordinates
Base / –	0.47
384	31.28
764	32.32
1152	33.72
Pixel coordinates
Base / –	0.55
384	1.17
764	2.67
1152	4.32

Table 1: Baseline on HuggingFaceTB/SmolVLM2-2.2B-Instruct (400k samples, aguvis-stage-1). Higher is better.

As demonstrated in our benchmark results, SmolVLM2-2.2B-Instruct base initially achieved 0% performance on perception benchmarks like ScreenSpot-v2. This complete lack of grounding capability provided us with a clean slate to evaluate the effectiveness of our training methodology.

Key Findings

From our experiments, we determined that:

Image Size: 1152px
Coordinate System: Normalized coordinates (0-1 range) proved most effective for SmolVLM2
Note: The optimal choice between pixel and normalized coordinates may vary depending on the base model’s pre-training approach

Phase 1 Results

Using the optimal configuration (1152px resolution with normalized coordinates), we trained for 2 epochs on the smolagents/aguvis-stage-1 dataset. The results were remarkable, +41% improvement over baseline on ScreenSpot-v2

This dramatic improvement demonstrates that our Phase 1 training successfully instilled fundamental grounding capabilities in the model, enabling it to understand and locate visual elements within screenshots.

Configuration (coords / image size)	Screenspot-v2 (%)
Normalized coordinates / 1152	41.27

Table 2: Baseline on HuggingFaceTB/SmolVLM2-2.2B-Instruct (2 epochs, aguvis-stage-1).

3. Phase 2: From Perception to Cognition

Whereas Phase 1 provided grounding capabilities, Phase 2 targets agentic reasoning, the ability to deliberate and plan before acting. This stage transforms the model from a reactive system identifying GUI elements into a proactive agent capable of executing complex, multi-step interactions.

Training Data

Phase 2 uses the smolagents/aguvis-stage-2 dataset, which introduces agentic scenarios:

Explicit reasoning about upcoming actions
Context consistency across multiple interaction steps
High-level instructions require multi-step, low-level actions.

For example, the smolagents/aguvis-stage-2 chat message is like this:

{
  "system": "You are a helpful GUI agent. ...",
  "user": "Please generate the next move according to the UI screenshot, instruction and previous actions.\n\nInstruction: What information does the site provide about Judith Lauand's career, works and exhibitions?\n\nPrevious actions:\nNone",
  "assistant": "<think>\nClick on the link labeled 'Judith Lauand: Brazilian 1922-2022' to explore more about her career and exhibitions.\n</think>\n<code>\nclick(x=0.41, y=0.178)\n</code>",
}

Each sample links a screenshot with a system/user/assistant turn. During fine-tuning, the data collator masks everything except the assistant’s answers when computing the loss.

Phase 2 Results

Starting from the Phase 1 checkpoint (1152 px resolution, normalized coordinates), we fine-tuned the model for two epochs on smolagents/aguvis-stage-2. The accuracy on ScreenSpot-v2 increased from 41% to 61%, indicating that explicit reasoning improves GUI grounding performance.

Configuration (coords / image size)	Screenspot-v2 (%)
Normalized coordinates / 1152	61.71

Table 2: Baseline on HuggingFaceTB/SmolVLM2-2.2B-Instruct after Phase 1 finetuning (2 epochs, aguvis-stage-1).

💡 We also reproduced the two-phase training on a much smaller VLM (nanoVLM-460M). Despite its reduced capacity, the model achieved ~58% on ScreenSpot-v2, demonstrating that the training strategy scales down effectively, making it SOTA on ScreenSpot-v2 for this model size (460M parameters). In addition, aguvis-stage-1 is already included in FineVision Dataset!

4. All you need is Open Source

All training code, data processing pipelines, datasets and model are open-source!

Training Recipe (recipe.ipynb): Complete training pipeline for both Phase 1 and Phase 2, including dataset mixture configurations and training orchestration. We leverage the TRL library to train our models.
Datasets (smolagents/aguvis-stage-1, smolagents/aguvis-stage-2): all datasets used are open-source.
Model (smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI): the model produced by applying the training recipe described above.
Preprocessing Tools:
- Function Parser (utils/function_parser.py): Utilities for parsing, normalizing, and reconstructing function calls from diverse dataset formats. Supports complex parameter structures, positional arguments, and multiple function call extraction.
- Action Conversion System (preprocessing/action_conversion.py): Core unification engine transforming mobile and PyAutoGUI desktop actions into a standardized API format. Features smart coordinate handling, direction detection for scroll actions, and comprehensive parameter normalization.
- Action Space Converter (utils/action_space_converter.py): Flexible tool for adapting the unified action space to custom vocabularies and naming conventions. Enables domain-specific customization through configurable parameter mappings.

💡 We’ve also released a Space to try the model’s agentic grounding capabilities: A-Mahla/Smol2Operator

5. Conclusion

Our experiments demonstrate that high-quality, reasoning-oriented data can substantially improve GUI grounding, even for small VLMs, using only supervised fine-tuning (SFT). Beyond raw performance gains, these results show that the GUI grounding capabilities are largely determined by the quality of the data. Carefully curated datasets teach models the structure and semantics of user interfaces, providing the grounding needed for accurate action prediction.

To support the development of GUI agents, we’re open-sourcing everything: our complete pipeline, datasets, and trained model. You can reproduce our results, experiment with different models and architectures, or adapt our approach to new domains. The future of agentic AI depends on researchers like you pushing these boundaries further!

What's Next?

While SFT excels at supervised tasks, emerging methods such as Reinforcement Learning (RL) or Direct Preference Optimization (DPO) help develop stronger reasoning capabilities and enable real-time adaptation. These advances point toward a new generation of GUI agents that learn and improve through interaction rather than relying solely on static datasets.

Let’s build the future of GUI agents together 🤗

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Table of Contents

Introduction

1. Data Transformation and Unified Action Space

The Challenge of Inconsistent Action Spaces

Our Unified Approach

Example Data Transformation

(Bonus) Custom Action Space Adaptation with Action Space Converter

Key Features

Usage Example

Transformed and Released Datasets

2. Phase 1: From Zero to Perception

Training Data

Optimization Experiments

Image Resolution and Coordinate System Analysis

Key Findings

Phase 1 Results

3. Phase 2: From Perception to Cognition

Training Data

Phase 2 Results

4. All you need is Open Source

5. Conclusion

What's Next?