From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Jared Sulzdorf, yuchenglow, Zach Nation, saba noorassa · 2025-02-12 · via Hugging Face - Blog

Back to Articles

Content-defined chunking (CDC) plays a central role in enabling deduplication within a Xet-backed repository. The idea is straightforward: break each file’s data into chunks, store only unique ones, reap the benefits.

In practice, it's more complex. If we focused solely on maximizing deduplication, the design would call for the smallest possible chunk size. By doing that, we’d create significant overheads for the infrastructure and the builders on the Hub.

On Hugging Face's Xet team, we're bringing CDC from theory to production to deliver faster uploads and downloads to AI builders (by a factor of 2-3x in some cases). Our guiding principle is simple: enable rapid experimentation and collaboration for teams building and iterating on models and datasets. This means focusing on more than just deduplication; we’re optimizing how data moves across the network, how it’s stored, and the entire development experience.

The Realities of Scaling Deduplication

Imagine uploading a 200GB repository to the Hub. Today, there are a number of ways to do this, but all use a file-centric approach. To bring faster file transfers to the Hub, we've open-sourced xet-core and hf_xet, an integration with huggingface_hub which uses a chunk-based approach written in Rust.

If you consider a 200GB repository with unique chunks, that's 3 million entries (at ~64KB per chunk) in the content-addressed store (CAS) backing all repositories. If a new version of a model is uploaded or a branch in the repository is created with different data, more unique chunks are added, driving up the entries in the CAS.

With nearly 45PB across 2 million model, dataset, and space repositories on the Hub, a purely chunk-based approach could incur 690 billion chunks. Managing this volume of content using only chunks is simply not viable due to:

Network Overheads: If each chunk is downloaded or uploaded individually, millions of requests are generated on each upload and download, overwhelming both client and server. Even batching queries simply shifts the problem to the storage layer.
Infrastructure Overheads: A naive CAS that tracks chunks individually would require billions of entries, leading to steep monthly bills on services like DynamoDB or S3. At Hugging Face’s scale, this quickly adds up.

In short, network requests balloon, databases struggle to manage the metadata, and the cost of orchestrating each chunk skyrockets all while you wait for your files to transfer.

Design Principles for Deduplication at Scale

These challenges lead to a key realization:

Deduplication is a performance optimization, not the final goal.

The final goal is to improve the experience of builders iterating and collaborating on models and datasets. The system components from the client to the storage layer do not need to guarantee deduplication. Instead, they leverage deduplication as one tool among many to aid in this.

By loosening the deduplication constraint, we naturally arrive at a second design principle:

Avoid communication or storage strategies that scale 1:1 with the number of chunks.

What does this mean? We scale with aggregation.

Scaling Deduplication with Aggregation

Aggregation takes chunks and groups them, referencing them intelligently in ways that provide clever (and practical) benefits:

Blocks: Instead of transferring and storing chunks, we bundle data together in blocks of up to 64MB after deduplication. Blocks are still content-addressed, but this reduces CAS entries by a factor of 1,000.
Shards: Shards provide the mapping between files and chunks (referencing blocks as they do so). This allows us to identify which parts of a file have changed, referencing shards generated from past uploads. When chunks are already known to exist in the CAS, they’re skipped, slashing unnecessary transfers and queries.

Together, blocks and shards unlock significant benefits. However, when someone uploads a new file, how do we know if a chunk has been uploaded before so we can eliminate an unnecessary request? Performing a network query for every chunk is not scalable and goes against the “no 1:1” principle we mentioned above.

The solution is key chunks which are a 0.1% subset of all chunks selected with a simple modulo condition based on the chunk hash. We provide a global index over these key chunks and the shards they are found in, so that when the chunk is queried, the related shard is returned to provide local deduplication. This allows us to leverage the principles of spatial locality. If a key chunk is referenced in a shard, it’s likely that other similar chunk references are available in the same shard. This further improves deduplication and reduces network and database requests.

Key chunks

Aggregated Deduplication in Practice

The Hub currently stores over 3.5PB of .gguf files, most of which are quantized versions of other models on the Hub. Quantized models represent an interesting opportunity for deduplication due to the nature of quantization where values are restricted to a smaller integer range and scaled. This restricts the range of values in the weight matrices, naturally leading to more repetition. Additionally, many repositories of quantized models store multiple different variants (e.g., Q4_K, Q3_K, Q5_K) with a great deal of overlap.

A good example of this in practice is bartowski/gemma-2-9b-it-GGUF which contains 29 quantizations of google/gemma-2-9b-it totalling 191GB. To upload, we use hf_xet integrated with huggingface_hub to perform chunk-level deduplication locally then aggregate and store data at the block level.

Once uploaded, we can start to see some cool patterns! We’ve included a visualization that shows the deduplication ratio for each block. The darker the block, the more frequently parts of it are referenced across model versions. If you go to the Space hosting this visualization, hovering over any heatmap cell highlights all references to the block in orange across all models while clicking on a cell will select all other files that share blocks:

Quantization deduplication visualization

A single block of deduplication might only represent a few MB of savings, but as you can see there are many overlapping blocks! With this many blocks that quickly adds up. Instead of uploading 191GB, the Xet-backed version of the gemma-2-9b-it-GGUF repository stores 1515 unique blocks for a total of approximately 97GB to our test CAS environment (a savings of ~94GB).

While the storage improvements are significant, the real benefit is what this means for contributors to the Hub. At 50MB/s, the deduplication optimizations amount to a four hour difference in upload time; a speedup of nearly 2x:

Repo	Stored Size	Upload Time @ 50MB/s
Original	191 GB	509 minutes
Xet-backed	97 GB	258 minutes

Similarly, local chunk caching significantly speeds up downloads. If a file is changed or a new quantization is added that has significant overlap with the local chunk cache, you won’t have to re-download any chunks that are unchanged. This contrasts to the file-based approach where the entirety of the new or updated file must be downloaded.

Taken together, this demonstrates how local chunk-level deduplication paired with block-level aggregation dramatically streamlines not just storage, but developing on the Hub. By providing this level of efficiency in file transfers, AI builders can move faster, iterate quickly, and worry less about hitting infrastructure bottlenecks. For anyone pushing large files to the Hub (whether you're pushing a new model quantization or an updated version of a training set) this helps you shift focus to building and sharing, rather than waiting and troubleshooting.

We’re fast at work, rolling out the first Xet-backed repositories in the coming weeks and months! As we do that, we will be releasing more updates to bring these speeds to every builder on the Hub to make file transfers feel invisible.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

The Realities of Scaling Deduplication

Design Principles for Deduplication at Scale

Scaling Deduplication with Aggregation

Aggregated Deduplication in Practice