Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Featherless AI on Hugging Face Inference Providers 🔥 Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Gaetan Bahl, Enzo Ruedas, Tess Boivin · 2026-03-05 · via Hugging Face - Blog

Back to Articles

Authors: Enzo Ruedas, Tess Boivin

Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems. First, with the integration of visual perception in Vision–Language Models (VLMs), and more recently with the generation of robot actions in Vision–Language–Action (VLA) models. Deploying these models on embedded robotic platforms remains a challenge due to tight constraints in terms of compute, memory, and power, as well as real-time control requirements.

In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands leading to oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. However, to be effective, the end-to-end inference latency must remain shorter than the action execution duration. This temporal constraint therefore sets an upper limit on the model's throughput.

Bringing VLA models to embedded platforms is not a matter of model compression, but a complex systems engineering problem requiring architectural decomposition, latency-aware scheduling, and hardware-aligned execution. Addressing these challenges is essential to translate recent advances in multimodal foundation models into practical and deployable embedded robotic systems.

This guide presents NXP’s hands‑on best practices for recording reliable robotic datasets, fine‑tuning VLA policies (ACT and SmolVLA), and hightlights the real-time performance that NXP i.MX 95 SoC achieves after optimization.

🎥 Dataset Recording: What Actually Matters

High‑quality, consistent data beats “more but messy” data. This section turns hard‑earned lessons into concrete checklists and schemas.

In our case, we recorded a dataset for the task: "Put the tea bag in the mug."

1) Consistency First

Fixed cameras: Use rigid mounts to avoid pose drift. If during recording or evaluation one or more cameras shift because of the robot's vibrations or the operator resetting the environment, you can observe a severe accuracy loss.
Controlled lighting: Set up your environment where you can have as much control as possible on lighting (Fixed light source(s) and far from sunlight that vary during the day).
Strong contrast: Avoid training with “white on white” unless that’s your deployment domain. Maximize contrast between the arm, the object and the environment.
Fixed calibration: Make sure to have backups of your robot and teleoperator calibrations so you don't have to re-record your previous episodes if the code crashes.
Do not cheat: Do not use information the model will not have access to at inference time. During data recording, it is tempting for the operator to rely on direct visual observation of the scene. However, this introduces information that is absent from the dataset. Dataset collection must be restricted to the same camera inputs that will be available to the policy at runtime.

2) Use a Gripper Camera (Highly Recommended)

Moving from scene‑only views to mixed viewpoints increases the global accuracy, but the more cameras you have the more the latency is impacted. Therefore, you must choose right compromise. In our case that balance was reached with 3 cameras:

We strongly recommend using a gripper-mounted camera. It consistently improves success rates on fine manipulation tasks by providing a close, task-relevant viewpoint. Importantly, it is also the camera that most effectively enforces correct data collection practices, allowing the operator to rely exclusively on the robot’s perception rather than observing the scene directly.

When installing a gripper camera, we recommend securing the cable with Velcro or a strain-relief guide to prevent it from obstructing the field of view or becoming disconnected during motion.

3) Improve Prehension

Simple hardware tweaks like heat‑shrink tubing over gripper claws increase friction, reduce roughness, reduce slippage during episodes, and increase task success rate (fewer “almost success” episodes), improving policy learning stability.

4) Diversity & Splits

When recording a dataset, you should:

Vary episodes distribution: Divide your workspace into starting-position clusters, and record at least 10 episodes per cluster. Add diversity by changing the object position and rotation.

e.g. we partitioned the robot arm’s reachable workspace into 11 clusters, each measuring 10 × 10 cm.
Differentiate training & validation sets: Policies can easily overfit on the training set, so make sure that the validation set is unseen by the model.

e.g. we removed cluster 6 from the training set.
Record the most movements you can: Small VLA models exhibit limited generalization on unseen motion. Therefore, record episodes that cover the wider ranges of degrees of freedom.

e.g. we grasped the tea bag either in horizontal or vertical position.
Anticipate failure: Sometimes the policy will not reach the object the first time and will have to "go back to it". We noticed that having 20% of all episodes that corresponds to the case of going back to the object help the model improve overall success rate.

e.g. around 20% of our training set corresponds to recovery episodes.

This mirrors best practices across VLA papers and community guides. Here are 3 examples of data diversity within the same cluster:

Starting positions 1 and 2 correspond to different positions within the same cluster. In contrast, during the recovery episode, the robot does not begin in "starting mode"; but is instead already near the mug and should proceed directly to retrieve the tea bag from that location.

🎛️ Fine‑Tuning VLAs

What we did in practice:

Tasks: "Grab the tea bag and place it in the mug."
Dataset:
- 120 episodes: 10 clusters x (10 different tea bag starting positions + 2 recovery episodes)
- 3 cameras (640x480px, 30fps): Top, Gripper, Left
- Cluster n°6 was removed for validation
Batch size: 8
Training: Model checkpoint with the lowest validation loss after 200k steps was chosen

The range providing the best trade-off between accuracy, generalization, and motion smoothness across both the training and validation sets was found for ACT (100 actions per chunk) within a 100k-160k training steps. For SMolVLA training (50 actions per chunk), the trade‑off appears after many more training steps. We found that continuing training slightly past the point where the model begins to overfit tends to improve overall accuracy.

Rule of thumb: choose final checkpoint by evaluating success on both training and validation set, not by training loss.

⚡ Optimizing for the NXP i.MX 95 Applications processor

The i.MX 95 SoC integrates 6× Arm Cortex‑A55, a Cortex‑M7 and a Cortex M33 MCU, a Mali GPU, a new NXP ISP, and the eIQ® Neutron NPU, targeting efficient, secure edge inference with multi‑camera support and strong I/O. [nxp.com]

1) Divide And Conquer

Instead of running the models as one monolithic graph, we decompose the VLA graph into logical stages: encoders, decoders, and action experts. Therefore, allowing each component to be optimized, scheduled, and deployed independently.

In practice, SmolVLA is partitioned into the following sub-blocks:

Vision: processes RGB camera frames and produces visual embeddings.
LLM backbone: generates actions tokens from visual and textual embeddings.
Action expert: applies flow matching to iteratively denoise action samples and outputs final control commands.

This separation allows per-block optimizations. The impact of each block quantization can be measured to choose the best tradeoff between latency and accuracy. Also, isolating the action expert from the VLM was ideal to run it at lower frequency.

2) Quantization

In order to optimize the inference for i.MX 95 SoC, we explored several quantization techniques on different blocks. We found that quantizing the vision encoder and LLM prefill had limited impact on accuracy, whereas quantization of the denoising flow in the action expert significantly degrades performance.
This behaviour is expected, as quantization errors are accumulating across iterative denoising steps.

That is why we decided to keep this block at higher precision to preserve stability, while on the other blocks, we explored various quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layers.

In addition, we applied in-house optimization on the different blocks. Results are shown in the below table, referred as optimized models.

3) Asynchronous Inference: Control-Aware Scheduling

In a synchronous control loop, the pipeline operates as:

Capture observation
Run full model inference
Execute generated action

During step (2), the robot remains idle. If inference latency is non-negligible, this produces:

Idle gaps in motion
Oscillatory corrections due to stale observations
Reduced effective control frequency
Poor recovery behavior

With Asynchronous Inference, action generation runs in parallel with execution:

The robot executes the current action chunk
The next chunk is computed simultaneously

This increases effective control frequency, reduces observation staleness, and improves recovery behavior.

On embedded platforms such as the i.MX 95 SoC, asynchronous inference is essential — but only effective if inference latency is kept under the action horizon budget: inference time < execution time

	Synchronous inference	Asynchronous inference
Actions per chunk	100	100
FPS	60	60
Chunk size threshold	N/A	0.2
Aggregate function	N/A	weighted_average
Action queue evolution
Results

📊 What We Achieve on i.MX 95 Applications Processor

Setup

Tasks: "Grab the tea bag and place it in the mug."
Test set (20 episodes): 2 random positions for each cluster.
Validation set (10 episodes): all 10 positions in cluster n°6

Platform (CPU)	Policy	Format	Inference Latency	Accuracy Test Set (20)	Accuracy Validation Set (10)	Global Accuracy (30)
i.MX 95	ACT	ONNX FP32	2.86 s	1.00	0.90	0.96
i.MX 95	ACT	Optimized	0.32 s	1.00	0.60	0.89
i.MX 95	SmolVLA	ONNX FP32	29.1 s	0.50	0.40	0.47

⏩ Next Steps

Our immediate objective is to improve task accuracy with SmolVLA (ONNX FP32). We have already established a baseline and measured an optimized on-board inference latency of 6.15 s.

The next phase will focus on deeper optimizations on our NPUs. In parallel, we aim to move from single-task setup toward longer-horizon and more complex scenarios. To do that, we will introduce:

Simulation environments for scalable data generation and benchmarking
Reinforcement Learning (RL) for policy refinement
Sim-to-Real transfer to bridge domain gaps and improve real-world performance

The goal is to move from a single validated manipulation task toward a reproducible methodology for deploying VLA policies on embedded robotic systems.

✅ Checklists You Can Reuse

Recording

Fixed mounts verified
Good cameras focus and illumination
Good gripper claws prehension
Calibration files backups saved
Contrast validated

Training

Save/eval checkpoints every 20k steps
Save also your training parameters to be able to resume training if needed
Prepare in advance your validation set and your tracking method for accuracy and latency

Deployment on i.MX 95 SoC

You are satisfied with your accuracy
Contact us to have your model optimized

📚 Resources & Inspiration

ACT documentation & paper (core idea, action chunking, low‑demo success). [huggingface.co], [arxiv.org]
SmolVLM/SmolVLA family & repos (compact multimodal + VLA design). [huggingface.co], [github.com], [smolvla.net]
Sherry Chen’s HF blog on training ACT on SO‑101 (practical lessons, pitfalls, fixes). [huggingface.co]

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

🎥 Dataset Recording: What Actually Matters

1) Consistency First

2) Use a Gripper Camera (Highly Recommended)

3) Improve Prehension

4) Diversity & Splits

🎛️ Fine‑Tuning VLAs

⚡ Optimizing for the NXP i.MX 95 Applications processor

1) Divide And Conquer

2) Quantization

3) Asynchronous Inference: Control-Aware Scheduling

📊 What We Achieve on i.MX 95 Applications Processor

⏩ Next Steps

✅ Checklists You Can Reuse

📚 Resources & Inspiration