惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Recent Commits to openclaw:main
Recent Commits to openclaw:main
博客园 - 叶小钗
Stack Overflow Blog
Stack Overflow Blog
S
SegmentFault 最新的问题
D
DataBreaches.Net
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Threatpost
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
Jina AI
Jina AI
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
Microsoft Azure Blog
Microsoft Azure Blog
WordPress大学
WordPress大学
Engineering at Meta
Engineering at Meta
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
Cisco Blogs
PCI Perspectives
PCI Perspectives
Project Zero
Project Zero
G
Google Developers Blog
宝玉的分享
宝玉的分享
H
Heimdal Security Blog
美团技术团队
Schneier on Security
Schneier on Security
C
CERT Recently Published Vulnerability Notes
Martin Fowler
Martin Fowler
博客园 - 司徒正美
博客园 - 三生石上(FineUI控件)
Help Net Security
Help Net Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Google DeepMind News
Google DeepMind News
C
Check Point Blog
Hacker News: Ask HN
Hacker News: Ask HN
L
LINUX DO - 最新话题
O
OpenAI News
Hacker News - Newest:
Hacker News - Newest: "LLM"
N
Netflix TechBlog - Medium
S
Security Affairs
小众软件
小众软件
MongoDB | Blog
MongoDB | Blog
Blog — PlanetScale
Blog — PlanetScale
V
V2EX - 技术
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
F
Fortinet All Blogs
G
GRAHAM CLULEY
云风的 BLOG
云风的 BLOG
S
Secure Thoughts

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Featherless AI on Hugging Face Inference Providers 🔥 Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap
Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations
Gaetan Bahl, Enzo Ruedas, Tess Boivin · 2026-03-05 · via Hugging Face - Blog

Back to Articles

blog_image

Authors: Enzo Ruedas, Tess Boivin


Recent advances in Large Language Models have enabled the transition from text-only reasoning to multimodal systems. First, with the integration of visual perception in Vision–Language Models (VLMs), and more recently with the generation of robot actions in Vision–Language–Action (VLA) models. Deploying these models on embedded robotic platforms remains a challenge due to tight constraints in terms of compute, memory, and power, as well as real-time control requirements.

In synchronous control pipelines, while the VLA is running inference, the arm is idle awaiting commands leading to oscillatory behavior and delayed corrections. To tackle that, asynchronous Inference can enable smooth and continuous motion by dissociating generation from execution. However, to be effective, the end-to-end inference latency must remain shorter than the action execution duration. This temporal constraint therefore sets an upper limit on the model's throughput.

Bringing VLA models to embedded platforms is not a matter of model compression, but a complex systems engineering problem requiring architectural decomposition, latency-aware scheduling, and hardware-aligned execution. Addressing these challenges is essential to translate recent advances in multimodal foundation models into practical and deployable embedded robotic systems.

This guide presents NXP’s hands‑on best practices for recording reliable robotic datasets, fine‑tuning VLA policies (ACT and SmolVLA), and hightlights the real-time performance that NXP i.MX 95 SoC achieves after optimization.


🎥 Dataset Recording: What Actually Matters

High‑quality, consistent data beats “more but messy” data. This section turns hard‑earned lessons into concrete checklists and schemas.

In our case, we recorded a dataset for the task: "Put the tea bag in the mug."

1) Consistency First

  • Fixed cameras: Use rigid mounts to avoid pose drift. If during recording or evaluation one or more cameras shift because of the robot's vibrations or the operator resetting the environment, you can observe a severe accuracy loss.
  • Controlled lighting: Set up your environment where you can have as much control as possible on lighting (Fixed light source(s) and far from sunlight that vary during the day).
  • Strong contrast: Avoid training with “white on white” unless that’s your deployment domain. Maximize contrast between the arm, the object and the environment.
  • Fixed calibration: Make sure to have backups of your robot and teleoperator calibrations so you don't have to re-record your previous episodes if the code crashes.
  • Do not cheat: Do not use information the model will not have access to at inference time. During data recording, it is tempting for the operator to rely on direct visual observation of the scene. However, this introduces information that is absent from the dataset. Dataset collection must be restricted to the same camera inputs that will be available to the policy at runtime.

2) Use a Gripper Camera (Highly Recommended)

Moving from scene‑only views to mixed viewpoints increases the global accuracy, but the more cameras you have the more the latency is impacted. Therefore, you must choose right compromise. In our case that balance was reached with 3 cameras:

We strongly recommend using a gripper-mounted camera. It consistently improves success rates on fine manipulation tasks by providing a close, task-relevant viewpoint. Importantly, it is also the camera that most effectively enforces correct data collection practices, allowing the operator to rely exclusively on the robot’s perception rather than observing the scene directly.

When installing a gripper camera, we recommend securing the cable with Velcro or a strain-relief guide to prevent it from obstructing the field of view or becoming disconnected during motion.

3) Improve Prehension

heat_shrink-tube

Simple hardware tweaks like heat‑shrink tubing over gripper claws increase friction, reduce roughness, reduce slippage during episodes, and increase task success rate (fewer “almost success” episodes), improving policy learning stability.

4) Diversity & Splits

clusters

When recording a dataset, you should:

  • Vary episodes distribution: Divide your workspace into starting-position clusters, and record at least 10 episodes per cluster. Add diversity by changing the object position and rotation.

    e.g. we partitioned the robot arm’s reachable workspace into 11 clusters, each measuring 10 × 10 cm.

  • Differentiate training & validation sets: Policies can easily overfit on the training set, so make sure that the validation set is unseen by the model.

    e.g. we removed cluster 6 from the training set.

  • Record the most movements you can: Small VLA models exhibit limited generalization on unseen motion. Therefore, record episodes that cover the wider ranges of degrees of freedom.

    e.g. we grasped the tea bag either in horizontal or vertical position.

  • Anticipate failure: Sometimes the policy will not reach the object the first time and will have to "go back to it". We noticed that having 20% of all episodes that corresponds to the case of going back to the object help the model improve overall success rate.

    e.g. around 20% of our training set corresponds to recovery episodes.

This mirrors best practices across VLA papers and community guides. Here are 3 examples of data diversity within the same cluster:

Starting positions 1 and 2 correspond to different positions within the same cluster. In contrast, during the recovery episode, the robot does not begin in "starting mode"; but is instead already near the mug and should proceed directly to retrieve the tea bag from that location.


🎛️ Fine‑Tuning VLAs

act_loss

What we did in practice:

  • Tasks: "Grab the tea bag and place it in the mug."
  • Dataset:
    • 120 episodes: 10 clusters x (10 different tea bag starting positions + 2 recovery episodes)
    • 3 cameras (640x480px, 30fps): Top, Gripper, Left
    • Cluster n°6 was removed for validation
  • Batch size: 8
  • Training: Model checkpoint with the lowest validation loss after 200k steps was chosen

The range providing the best trade-off between accuracy, generalization, and motion smoothness across both the training and validation sets was found for ACT (100 actions per chunk) within a 100k-160k training steps. For SMolVLA training (50 actions per chunk), the trade‑off appears after many more training steps. We found that continuing training slightly past the point where the model begins to overfit tends to improve overall accuracy.

Rule of thumb: choose final checkpoint by evaluating success on both training and validation set, not by training loss.


⚡ Optimizing for the NXP i.MX 95 Applications processor

The i.MX 95 SoC integrates 6× Arm Cortex‑A55, a Cortex‑M7 and a Cortex M33 MCU, a Mali GPU, a new NXP ISP, and the eIQ® Neutron NPU, targeting efficient, secure edge inference with multi‑camera support and strong I/O. [nxp.com]

1) Divide And Conquer

Instead of running the models as one monolithic graph, we decompose the VLA graph into logical stages: encoders, decoders, and action experts. Therefore, allowing each component to be optimized, scheduled, and deployed independently.

In practice, SmolVLA is partitioned into the following sub-blocks:

  • Vision: processes RGB camera frames and produces visual embeddings.
  • LLM backbone: generates actions tokens from visual and textual embeddings.
  • Action expert: applies flow matching to iteratively denoise action samples and outputs final control commands.

This separation allows per-block optimizations. The impact of each block quantization can be measured to choose the best tradeoff between latency and accuracy. Also, isolating the action expert from the VLM was ideal to run it at lower frequency.

2) Quantization

In order to optimize the inference for i.MX 95 SoC, we explored several quantization techniques on different blocks. We found that quantizing the vision encoder and LLM prefill had limited impact on accuracy, whereas quantization of the denoising flow in the action expert significantly degrades performance.
This behaviour is expected, as quantization errors are accumulating across iterative denoising steps.

That is why we decided to keep this block at higher precision to preserve stability, while on the other blocks, we explored various quantization configurations, from 8-bit mixed precision to 4-bit quantization, depending on the layers.

In addition, we applied in-house optimization on the different blocks. Results are shown in the below table, referred as optimized models.

3) Asynchronous Inference: Control-Aware Scheduling

In a synchronous control loop, the pipeline operates as:

  1. Capture observation
  2. Run full model inference
  3. Execute generated action

During step (2), the robot remains idle. If inference latency is non-negligible, this produces:

  • Idle gaps in motion
  • Oscillatory corrections due to stale observations
  • Reduced effective control frequency
  • Poor recovery behavior

With Asynchronous Inference, action generation runs in parallel with execution:

  • The robot executes the current action chunk
  • The next chunk is computed simultaneously

This increases effective control frequency, reduces observation staleness, and improves recovery behavior.

On embedded platforms such as the i.MX 95 SoC, asynchronous inference is essential — but only effective if inference latency is kept under the action horizon budget: inference time < execution time

Synchronous inference Asynchronous inference
Actions per chunk 100 100
FPS 60 60
Chunk size threshold N/A 0.2
Aggregate function N/A weighted_average
Action queue evolution async_g_0 async_g_02
Results

📊 What We Achieve on i.MX 95 Applications Processor

imx95

Setup

  • Tasks: "Grab the tea bag and place it in the mug."
  • Test set (20 episodes): 2 random positions for each cluster.
  • Validation set (10 episodes): all 10 positions in cluster n°6
Platform (CPU) Policy Format Inference Latency Accuracy Test Set (20) Accuracy Validation Set (10) Global Accuracy (30)
i.MX 95 ACT ONNX FP32 2.86 s 1.00 0.90 0.96
i.MX 95 ACT Optimized 0.32 s 1.00 0.60 0.89
i.MX 95 SmolVLA ONNX FP32 29.1 s 0.50 0.40 0.47

⏩ Next Steps

Our immediate objective is to improve task accuracy with SmolVLA (ONNX FP32). We have already established a baseline and measured an optimized on-board inference latency of 6.15 s.

The next phase will focus on deeper optimizations on our NPUs. In parallel, we aim to move from single-task setup toward longer-horizon and more complex scenarios. To do that, we will introduce:

  • Simulation environments for scalable data generation and benchmarking
  • Reinforcement Learning (RL) for policy refinement
  • Sim-to-Real transfer to bridge domain gaps and improve real-world performance

The goal is to move from a single validated manipulation task toward a reproducible methodology for deploying VLA policies on embedded robotic systems.


✅ Checklists You Can Reuse

Recording

  • Fixed mounts verified
  • Good cameras focus and illumination
  • Good gripper claws prehension
  • Calibration files backups saved
  • Contrast validated

Training

  • Save/eval checkpoints every 20k steps
  • Save also your training parameters to be able to resume training if needed
  • Prepare in advance your validation set and your tracking method for accuracy and latency

Deployment on i.MX 95 SoC

  • You are satisfied with your accuracy
  • Contact us to have your model optimized

📚 Resources & Inspiration