PRX Part 3 — Training a Text-to-Image Model in 24h!

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

David Bertoin, Roman Frigg, Jon Almazán · 2026-03-04 · via Hugging Face - Blog

Back to Articles

Introduction
The Training Recipe
X-prediction and Training in the Pixel Space
Perceptual Losses
Token Routing with TREAD
Representation Alignment with REPA and DINOv3
Optimizer: Muon
Training Settings
Results and Closing Thoughts
What’s next?
Acknowledgements.
Introduction

Welcome back 👋

In the last two posts (Part 1 and Part 2), we explored a wide range of architectural and training tricks for diffusion models. We tried to evaluate each idea in isolation, measuring throughput, convergence speed, and final image quality, and tried to understand what actually moves the needle.

In this post, we want to answer a much more practical question:

What happens when we combine all the tricks that worked?

Instead of optimizing one dimension at a time, we’ll stack the most promising ingredients together and see how far we can push performance under a strict compute budget.

To make things concrete, we’re doing a 24-hour speedrun:

32 H200
~$1500 total compute budget (2$/hour/GPU)

This is very far from the early diffusion days, where training competitive models could cost millions of dollars. The goal here is to demonstrate how much the field has evolved and how far careful engineering can take you in just a single day of training.

This speedrun is not just a fun experiment. It will likely serve as the foundation for our large-scale training recipe going forward.

Alongside the results, we’re also open-sourcing our code (Github link), which contains:

The training code used for this speedrun
The experimental framework from the previous blog post

So you can reproduce, modify, and extend everything yourself.

The Training Recipe

Now let’s walk through what went into this 24h run.

X-prediction and Training in the Pixel Space

We use the x-prediction formulation from Back to Basics: Let Denoising Generative Models Denoise [Li and He, 2025]. As seen in Part 2, this enables training directly in pixel space and eliminates the need for a VAE altogether. We use a patch size of 32 and use a 256-dimensional bottleneck in the initial token projection layer. This design keeps the sequence length under control, making pixel-space training computationally manageable even at higher resolutions.

At 512px, the sequence length is:

(512/32)2=256 (512 / 32)^2 = 256

At 1024px, the sequence length becomes:

(1024/32)2=1024 (1024 / 32)^2 = 1024

Instead of following the usual 256px → 512px → 1024px schedule, we start directly at 512px and then fine-tune at 1024px.

With controlled token counts and modern hardware, pixel-space training is no longer prohibitive. It is simply a cleaner and more direct formulation.

Perceptual Losses

One very nice side effect of predicting x0x_0 directly in pixel space is that we can reuse a whole toolbox from classical computer vision.

When your model outputs latents, perceptual supervision becomes awkward. You either have to decode back to pixels or define losses in a learned latent space that may or may not align with human perception. Once you predict pixels directly, everything becomes straightforward again. You can plug in perceptual losses exactly as they were originally designed.

We take inspiration from the paper PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss [Ma et al.], where the authors introduce additional perceptual objectives on top of the diffusion loss. They show that adding perceptual signals can noticeably improve convergence speed and final visual quality.

For this 24h run, we add two auxiliary losses:

LPIPS ([Zhang et al.])
A DINO-based perceptual loss (we use DINOv2 [Oquab et al.])

The idea is simple: In addition to the standard flow matching objective, we encourage the predicted clean image to match the target image in a perceptual feature space. LPIPS captures low-level perceptual similarity, while DINO features provide a stronger semantic signal.

We keep the same overall idea as the paper, but we tweaked a few details. In our experiments, we empirically found that it worked better to:

apply the perceptual losses on pooled full images instead of patch-wise features
apply them at all noise levels

These are small implementation details, but in our setting they consistently gave better results.

We used a weight of 0.1 for the LPIPS loss and 0.01 for the DINO perceptual loss, matching the values recommended in the original paper.

These losses are lightweight compared to the main transformer forward pass, and in our setup they add only a small overhead while providing a consistent quality boost.

Token Routing with TREAD

To make each step cheaper, we use token routing with TREAD [Krause et al., 2025]), which randomly selects a fraction of tokens and lets them bypass a contiguous chunk of transformer blocks, then re-injects them later so nothing is dropped.

We picked TREAD over SPRINT (Park et al., 2025) mostly for simplicity, and because the extra complexity of SPRINT did not feel worth the fairly small additional compute savings in our setting (sequence length 64 vs. 128 with TREAD at 512px).

Following the TREAD recipe, we route 50% of the tokens from the 2nd block to the penultimate block of the transformer.

Routed models can look worse under vanilla CFG, especially when undertrained, so we implemented a simple self-guidance scheme inspired by Guiding Token-Sparse Diffusion Models (Krause et al., 2025), which guides using a dense vs. routed conditional prediction instead of relying on an unconditional branch.

Representation Alignment with REPA and DINOv3

We used REPA [Yu et al., 2024] for representation alignment.

For the teacher, we went with DINOv3 [Siméoni et al. 2025] since it gave the best quality improvements in our previous experiments.

Concretely, we apply the alignment loss once,at the 8th transformer block with a loss weight of 0.5.

Since we combine REPA with TREAD routing, we only compute the alignment loss on the non-routed tokens, meaning the tokens that actually go through the blocks where we apply the loss. This keeps the REPA signal consistent and avoids comparing features for tokens that skipped the computation path.

Optimizer: Muon

We used the Muon optimizer, using the FSDP implementation from muon_fsdp_2, since it showed a clear improvement over Adam in our previous runs.

Muon is only applied to 2D parameters (basically matrices). Everything else (biases, norms, embeddings, etc.) is optimized with Adam, which is why the config has two parameter groups.

Group	What it applies to	Key params we used
Muon	2D parameters	`lr=1e-4`, `momentum=0.95`, `nesterov=true`, `ns_steps=5`
Adam	all non-2D parameters	`lr=1e-4`, `betas=(0.9, 0.95)`, `eps=1e-8`

Training Settings

We trained on three publicly available synthetic datasets:

Flux generated (1.7M), lehduong/flux_generated
FLUX-Reason-6M (6M), LucasFang/FLUX-Reason-6M
midjourney-v6-llava (1M), brivangl/midjourney-v6-llava which we re-captioned with Gemini 1.5 to make prompts more consistent and cut down caption noise.

The schedule is basically: go fast at 512, then sharpen at 1024:

512px for 100k steps with batch size 1024
1024px for 20k steps with batch size 512 without REPA.

We also keep an EMA of the weights for sampling and eval:

smoothing = 0.999
update_interval = 10ba
ema_start = 0ba

Results and Closing Thoughts

Below are the evaluation curves we tracked throughout the run and a few sample grids from the final checkpoint:

For a one day training run, this is already a pretty solid place to be. The model is not flawless yet (you can still spot some texture glitches, occasional weird anatomy, and it can get a bit shaky on very hard prompts), but it is clearly usable. Prompt following is strong, the overall aesthetic is consistent, and the 1024 stage mostly does what we want: sharpen details without breaking composition.

The key takeaway is that we're very close. The remaining issues look more like undertraining artifacts and limited data diversity than signs of a structural flaw in the recipe. The failure modes are consistent with what you’d expect from a model that simply hasn’t seen enough varied data yet. With more compute and broader coverage, this exact setup should continue improving in a fairly predictable way.

Zooming out, this speed run also highlights how far diffusion training has come. By combining pixel-space training, efficient routing, representation alignment, and lightweight perceptual guidance, you can now get a meaningful model in about a day on a budget that would have sounded unrealistic not that long ago.

What’s next?

This 24h run is just a starting point, not the finish line. Next, we will keep pushing the same recipe with a bit more scale and iterate on the dataset mix and captioning.

All the code and configs behind this speedrun, as well as the full experimental framework used throughout Part 1 and Part 2, are available in the PRX repository: https://github.com/Photoroom/PRX.

While we don’t redistribute the exact training datasets used in this run, the pipeline is fully configurable and designed to be easily adapted to your own data. You can plug in different datasets, tweak individual components (TREAD, REPA, perceptual losses, Muon, etc.), and run controlled experiments with minimal friction. Our goal is to make this a practical playground for fast diffusion research, and we hope the community will use it to explore, benchmark, and iterate on these techniques in their own setups.

If you made it this far, thank you for reading. We would also love to have you join our Discord community, where we share PRX progress and results, and discuss anything diffusion and text-to-image related.

Goodbye for now, and stay tuned for the next round of experiments! 🚀

Acknowledgements.

This speedrun was inspired by several recent efforts exploring fast and low-cost training of diffusion models. If you're interested in speedrunning text-to-image models, we encourage you to check out the following works:

Haridas, A., Shen, T., Yu, J. Nitro-T: Training a Text-to-Image Diffusion Model from Scratch in 1 Day. https://rocm.blogs.amd.com/artificial-intelligence/nitro-t-diffusion/README.html
Bhanded, S. Speedrunning ImageNet Diffusion. https://arxiv.org/abs/2512.12386
Sehwag, V., Kong, X., Li, J., Spranger, M., Lyu, L. Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget. https://arxiv.org/abs/2407.15811
Yeh, S.-Y. Home-made Diffusion Model from Scratch to Hatch. https://arxiv.org/abs/2509.06068

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Introduction The Training Recipe X-prediction and Training in the Pixel Space Perceptual Losses Token Routing with TREAD Representation Alignment with REPA and DINOv3 Optimizer: Muon Training Settings Results and Closing Thoughts What’s next? Acknowledgements. Introduction

The Training Recipe

X-prediction and Training in the Pixel Space

Perceptual Losses

Token Routing with TREAD

Representation Alignment with REPA and DINOv3

Optimizer: Muon

Training Settings

Results and Closing Thoughts

What’s next?

Acknowledgements.

Introduction
The Training Recipe
X-prediction and Training in the Pixel Space
Perceptual Losses
Token Routing with TREAD
Representation Alignment with REPA and DINOv3
Optimizer: Muon
Training Settings
Results and Closing Thoughts
What’s next?
Acknowledgements.
Introduction