SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Dana Aubakirova, Andres Marafioti, merve, Aritra Roy Gosthipaty, · 2025-06-03 · via Hugging Face - Blog

Back to Articles

This article is also available in Chinese 简体中文.

🧭TL;DR
📚 Table of Contents
Introduction
Meet SmolVLA!
🚀 How to Use SmolVLA?
Install
Finetune the pretrained model
Train from scratch
Method
Main Architecture
Design Choices for Efficiency and Robustness
Asynchronous Inference
Community Datasets
Improving Task Annotations
Standardizing Camera Views
Results
Conclusion
Call to Action:
🧭TL;DR

Today, we introduce SmolVLA, a compact (450M), open-source Vision-Language-Action model for robotics that runs on consumer hardware.

Pretrained only on compatibly licensed, open-source community-shared datasets under the lerobot tag.
SmolVLA-450M outperforms much larger VLAs and strong baselines such as ACT on simulation (LIBERO, Meta-World) and real-world tasks (SO100, SO101).
Supports asynchronous inference for 30% faster response and 2× task throughput.

Useful links:

Hardware used to train and evaluate SO-100/101: https://github.com/TheRobotStudio/SO-ARM100
Base model https://huggingface.co/lerobot/smolvla_base
Paper: https://huggingface.co/papers/2506.01844

📚 Table of Contents

🧭 TL;DR
📖 Introduction
🤖 Meet SmolVLA
🚀 How to Use SmolVLA?
🧠 Method
📦 Community Datasets
- Improving Task Annotations
- Standardizing Camera Views
📊 Results
✅ Conclusion
📣 Call to Action

Introduction

Over the past few years, Transformers have driven remarkable progress in AI, from language models capable of human-like reasoning to multimodal systems that understand both images and text. However, in real-world robotics, advancements have been much slower. Robots still struggle to generalize across diverse objects, environments, and tasks. This limited progress stems from a lack of high-quality, diverse data and the absence of models that can reason and act like humans in the physical world.

In response to these challenges, the field has recently turned to vision-language-action (VLA) models, which aim to unify perception, language understanding, and action prediction within a single architecture. VLAs typically take as input raw visual observations and natural language instructions, and output corresponding robot actions. While promising, much of the recent progress in VLAs remains locked behind proprietary models trained on large-scale private datasets, often requiring costly hardware setups and extensive engineering resources. As a result, the broader robotics research community faces significant barriers in reproducing and building upon these models.

SmolVLA addresses this gap by offering an open-source, compact, and efficient VLA model that can be trained on consumer-grade hardware using only publicly available datasets. By releasing not only model weights but also using very affordable open-source hardware, SmolVLA aims to democratize access to vision-language-action models and accelerate research toward generalist robotic agents.

Figure 1: Comparison of SmolVLA across task variations. From left to right: (1) asynchronous pick-place cube counting, (2) synchronous pick-place cube counting, (3) pick-place cube counting under perturbations, and (4) generalization on pick-and-place of the lego block with real-world SO101.

Meet SmolVLA!

SmolVLA-450M is our open-source, compact yet capable VLA model. It is:

Small enough to run on CPU, train on a single consumer GPU, or even a MacBook!
Trained on public, community-shared robotics data
Released with full training and inference recipes
Can be tested and deployed on very affordable hardware (SO-100, SO-101, LeKiwi, etc.)

Inspired by the training paradigms of Large Language Models (LLMs), SmolVLA goes through a pretraining phase on general manipulation data, followed by task-specific post-training. Architecturally, it combines Transformers with flow-matching decoders, and is optimized for speed and low-latency inference with the following design choices:

Skipping half of the layers of the vision model for faster inference and smaller size
Interleaving self-attention and cross-attention blocks
Using fewer visual tokens
Leveraging smaller pretrained VLMs

Despite using fewer than 30k training episodes—an order of magnitude less than other VLAs—SmolVLA matches or exceeds the performance of much larger models, both in simulation and the real world.

To make real-time robotics easier to use, we introduce an asynchronous inference stack. This technology separates how robots perform actions from how they understand what they see and hear. Because of this separation, robots can respond more quickly in fast-changing environments.

SmolVLA architecture.
Figure 2. SmolVLA takes as input a sequence of RGB images from multiple cameras, the robot’s current sensorimotor state, and a natural language instruction. The VLM encodes these into contextual features, which condition the action expert to generate a continuous sequence of actions.

🚀 How to Use SmolVLA?

SmolVLA is designed to be easy to use and integrate—whether you're finetuning on your own data or plugging it into an existing robotics stack.

Install

First, install the required dependencies:

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"

Finetune the pretrained model

Use smolvla_base, our pretrained 450M model, with the lerobot training framework:

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_stacking \
  --batch_size=64 \
  --steps=20000  # 10% of training budget

Train from scratch

If you'd like to build from the architecture (pretrained VLM + action expert) rather than a pretrained checkpoint:

python lerobot/scripts/train.py \
  --policy.type=smolvla \
  --dataset.repo_id=lerobot/svla_so100_stacking \
  --batch_size=64 \
  --steps=200000

You can also load SmolVLAPolicy directly:

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")

Method

SmolVLA is not only a lightweight yet capable model, but also a method for training and evaluating generalist robotics policies. In this section, we introduce the model architecture behind SmolVLA and the asynchronous inference setup used for evaluation, which has proven to be more adaptable and capable of faster recovery.

SmolVLA consists of two core components: a Vision-Language Model (VLM) that processes multimodal inputs and an action expert that outputs robot control commands. Below, we share the details of the main components of SmolVLA architecture and the Asynchronous Inference. More details can be found in our technical report.

Main Architecture

Vision-Language Model (VLM)

We use SmolVLM2 as our VLM backbone. It’s optimized for multi-image inputs and consists of a SigLIP vision encoder and a SmolLM2 language decoder.

Image tokens are extracted via the vision encoder
Language instructions are tokenized and fed directly into the decoder.
Sensorimotor states are projected into a single token using a linear layer to align with the token dimension of the language model.

The decoder layers process concatenated image, language, and state tokens. The resulting features are then passed to the action expert.

Action Expert: Flow Matching Transformer

SmolVLA’s action expert is a compact transformer (~100M parameters) that generates action chunks, i.e. sequences of future robot actions, conditioned on the VLM’s outputs. It is trained using a flow matching objective, which teaches the model to guide noisy samples back to the ground truth. In contrast, while discrete action representations (e.g., via tokenization) are powerful, they often require autoregressive decoding, which is slow and inefficient at inference time. While flow matching allows direct, non-autoregressive prediction of continuous actions, enabling real-time control with high precision.

More intuitively, during training, we add random noise to the robot’s real action sequences and ask the model to predict the “correction vector” that brings them back to the correct trajectory. This forms a smooth vector field over the action space, helping the model learn accurate and stable control policies.

We implement this using a transformer architecture with interleaved attention blocks (see the figure 2), and reduce its hidden size to 75% of the VLM’s, keeping the model lightweight for deployment.

Design Choices for Efficiency and Robustness

While combining a vision-language model with an action prediction module is a common design pattern in recent VLA systems—such as Pi0, GR00T, Diffusion Policy — we identified several architectural choices that significantly enhance the robustness and performance. In SmolVLA, we apply three key techniques: reducing the number of visual tokens, skipping upper layers in the VLM, and interleaving cross- and self-attention layers in the action expert.

Visual Token Reduction

High-resolution images improve perception but can significantly slow down inference. To strike a balance, SmolVLA limits the number of visual tokens to 64 per frame during both training and inference. For example, a 512×512 image is compressed into just 64 tokens, instead of 1024, using PixelShuffle as an efficient shuffling technique. While the underlying Vision-Language Model (VLM) was originally pretrained using image tiling for broader coverage, SmolVLA uses only the global image at runtime to keep inference lightweight and fast.

Faster Inference via Layer Skipping

Rather than always relying on the final layer of the VLM—which can be expensive and sometimes suboptimal—we use features from intermediate layers. Prior work has shown that early layers often provide better representations for downstream tasks. In SmolVLA, the action expert only attends to VLM features up to a configurable layer NN during training, set to half the total layers. This halves the compute cost of both the VLM and the action expert, significantly speeding up inference with minimal performance loss.

Interleaved Cross and Self-Attention

Inside the action expert, attention layers alternate between:

Cross-attention (CA), where action tokens attend to the VLM’s features
Self-attention (SA), where action tokens attend to each other (causally—only to the past)

We found that this interleaved design is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.

In SmolVLA, CA ensures that actions are well-conditioned on perception and instructions, while SA improves temporal smoothness—especially critical for real-world control, where jittery predictions can result in unsafe or unstable behavior.

Asynchronous Inference

Asynchronous inference

Figure 3. Asynchronous inference. Illustration of the asynchronous inference stack. Note that the policy can be run on a remote server, possibly with GPUs.

Modern visuomotor policies output action chunks—sequences of actions to execute. There are two ways to manage them:

Synchronous (sync): The robot executes a chunk, then pauses while the next one is computed. Simple, but causes a delay where the robot can't react to new inputs.
Asynchronous (async): While executing the current chunk, the robot already sends the latest observation to a Policy Server (possibly hosted on GPU) for the next chunk. This avoids idle time and improves reactivity.

Our async stack decouples action execution from chunk prediction, resulting in higher adaptability, and the complete lack of execution lags at runtime. It relies on the following key mechanisms:

1. Early trigger: When the queue length falls below a threshold (e.g., 70%), we send an observation to a Policy Server, calling for a new action chunk.
2. Decoupled threads: Control loop keeps executing → inference happens in parallel (non-blocking).
3. Chunk fusion: Overlapping actions from successive chunks are stitched with a simple merge rule to avoid jitter.

We are really excited about releasing asynchronous inference because it guarantees greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.

Community Datasets

While vision and language models thrive on web-scale datasets like LAION, ImageNet, and Common Crawl, robotics lacks a comparable resource. There’s no “Internet of robots.” Instead, data is fragmented across robot types, sensors, control schemes, and formats—forming disconnected "data islands". In our previous post, we explored how this fragmentation could be resolved through open, collaborative efforts. Just as ImageNet catalyzed breakthroughs in computer vision by providing a large, diverse benchmark, we believe that community-driven robotics datasets can play the same foundational role for generalist robot policies.

SmolVLA is our first step toward that vision: It is pretrained on a curated mix of publicly available, community-contributed datasets designed to reflect real-world variation. Rather than optimizing for dataset size alone, we focus on diversity: a range of behaviors, camera viewpoints, and embodiments that promote transfer and generalization.

All training data used in SmolVLA comes from LeRobot Community Datasets , robotics datasets shared on the Hugging Face Hub under the lerobot tag. Collected in diverse settings, from labs to living rooms, these datasets represent an open, decentralized effort to scale real-world robot data.

Figure 4. A glimpse of the community dataset. Special thanks to Ville Kuosmanen for creating the visualization. Unlike academic benchmarks, community datasets naturally capture messy, realistic interactions: varied lighting, suboptimal demonstrations, unconventional objects, and heterogeneous control schemes. This kind of diversity will be very useful for learning robust, general-purpose representations.

We used a customfiltering tool created by Alexandre Chapin and Ville Kuosmanen to select datasets based on frame count, visual quality, and task coverage. After a meticulous manual review (special thanks to Marina Barannikov), we curated a collection of 487 high-quality datasets focused on the SO100 robotic arm, standardized at 30 FPS. This yielded around 10 million frames—at least one order of magnitude smaller than other popular benchmark datasets, yet significantly more diverse.

Improving Task Annotations

A common issue across community datasets was noisy or missing task descriptions. Many episodes lacked annotations or included vague labels like “task desc” or “Move”, “Pick”. To improve quality and standardize the textual input across datasets, we used Qwen2.5-VL-3B-Instruct to generate concise, action-oriented descriptions.

Given sample frames and the original label, the model was prompted to rewrite the instruction in under 30 characters, starting with an action verb (e.g., “Pick,” “Place,” “Open”).

The prompt used is:

Here is a current task description: {current_task}. Generate a very short, clear, and complete one-sentence describing the action performed by the robot arm (max 30 characters). Do not include unnecessary words.
Be concise.
Here are some examples: Pick up the cube and place it in the box, open the drawer and so on.
Start directly with an action verb like “Pick”, “Place”, “Open”, etc.
Similar to the provided examples, what is the main action done by the robot arm?

Standardizing Camera Views

Another challenge was inconsistent camera naming. Some datasets used clear names like top or wrist.right, while others used ambiguous labels like images.laptop, which varied in meaning. To fix this, we manually went through the datasets and mapped each camera view to a standardized scheme: OBS_IMAGE_1: Top-down view OBS_IMAGE_2: Wrist-mounted view OBS_IMAGE_3+: Additional viewpoints

We further isolate the contributions of community dataset pretraining and multitask finetuning. Without pretraining on the LeRobot community datasets, SmolVLA initially achieves 51.7% success on SO100. After pretraining on community-collected data, performance jumps to 78.3%, a +26.6% absolute improvement. Multitask finetuning further boosts performance, showing strong task transfer capabilities even in low-data regimes.

Table 1. Impact of Pretraining on Community Datasets and Multitask Finetuning.

Results

We evaluate SmolVLA across simulation and real-world benchmarks to test its generalization, efficiency, and robustness. Despite being compact, It consistently outperforms or matches the performance of significantly larger models and policies pretrained on higher-scale robotics data.

SmolVLA Performance on Simulation Benchmarks.

Table 2. SmolVLA Performance on Simulation Benchmarks.

SmolVLA vs Baselines on Real-World Tasks (SO100).

Table 3. SmolVLA vs Baselines on Real-World Tasks (SO100).

In real-world settings, SmolVLA is evaluated on two diverse suites: SO100 and SO101. These tasks include pick-place, stacking, and sorting, with both in-distribution and out-of-distribution object configurations. On SO101, SmolVLA also excels in generalization:

Generalization of SmolVLA to New Embodiment (SO101) vs ACT..

Table 4. Generalization of SmolVLA to New Embodiment (SO101) vs ACT..

Finally, we evaluate SmolVLA under synchronous and asynchronous inference modes. Async inference decouples action execution from model inference, allowing the policy to react while the robot is moving.

Both modes achieve similar task success (≈78%), but async inference:
- Completes tasks ~30% faster (9.7s vs. 13.75s)
- Enables 2× more completions in fixed-time settings (19 vs. 9 cubes)

This results in more responsive and robust real-world performance, especially in dynamic environments with shifting objects or external disturbances.

Asynchronous vs. Synchronous Inference in Real-World Tasks.

Figure 5. Asynchronous vs. Synchronous Inference in Real-World Tasks. (a) Task success rates (%), (b) average completion time(s), and (c) number of tasks completed within a fixed time window.

Conclusion

SmolVLA is our contribution to building robotics foundation models that are open, efficient, and reproducible. Despite its small size, it matches or outperforms larger, proprietary models across a range of real-world and simulated tasks. By relying solely on community-contributed datasets and affordable hardware, SmolVLA lowers the barrier to entry for researchers, educators, and hobbyists alike. But this is just the beginning. SmolVLA is more than just a model — it's part of a growing open-source movement toward scalable, collaborative robotics.

Call to Action:

🔧 Try it out! Finetune SmolVLA on your own data, deploy it on affordable hardware, or benchmark it against your current stack and share it on twitter/linkedin.
🤖 Upload the dataset! Got a robot? Collect and share your data using the lerobot format. Help expand the community dataset that powers SmolVLA.
💬 Join the blog discussion. Drop your questions, ideas, or feedback in the discussion below. We’re happy to help with integration, training, or deployment.
📊 Contribute. Improve datasets, report issues, suggest new ideas. Every contribution helps.
🌍 Spread the word. Share SmolVLA with fellow researchers, developers, or educators interested in efficient, real-time robotic policies.
📫 Stay in touch: Follow the LeRobot organization and Discord server for updates, tutorials, and new releases.

Together, we can make real-world robotics more capable, more affordable, and more open. ✨

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

📚 Table of Contents

Introduction

Meet SmolVLA!

🚀 How to Use SmolVLA?

Install

Finetune the pretrained model

Train from scratch

Method

Main Architecture

Vision-Language Model (VLM)

Action Expert: Flow Matching Transformer

Design Choices for Efficiency and Robustness

Visual Token Reduction

Faster Inference via Layer Skipping

Interleaved Cross and Self-Attention

Asynchronous Inference

Community Datasets

Improving Task Annotations

Standardizing Camera Views

Results

Conclusion

Call to Action: