1 Billion Classifications

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Derek Thomas · 2025-02-13 · via Hugging Face - Blog

Back to Articles

You’ve optimized your model. Your pipeline is running smoothly. But now, your cloud bill has skyrocketed. Running 1B+ classifications or embeddings per day isn’t just a technical challenge—it’s a financial one. How do you process at this scale without blowing your budget? Whether you're running large-scale document classification or bulk embedding pipelines for Retrieval-Augmented Generation (RAG), you need cost-efficient, high-throughput inference to make it feasible, and you get that from having a well optimized configuration.

These tasks often use encoder models, which are much smaller than modern LLMs, but at the 1B+ inference request scale it's still quite a non-trivial task. Just to be clear, that's English Wikipedia 144x over. I haven’t seen much information on how to approach this with cost in mind and I want to tackle that. This blog breaks down HOW to calculate cost and latency for large scale classification and embedding. We’ll analyze different model architectures, benchmark costs across hardware choices, and give you a clear framework for optimizing your own setup. Additionally we should be able to build some intuition if you don't feel like going through the process yourself.

You might have a couple questions:

What is the cheapest configuration to solve my task for 1B inputs? (Batch Inference)
How can I do that while also considering latency? (Heavy Usage)

Here is the code to make it happen: https://github.com/datavistics/encoder-analysis

tl;I'm not gonna reproduce this, tell me what you found;dr

With this pricing I was able to get this cost:

Approach

To evaluate cost and latency we need 4 key components:

Hardware options: A variety of hardware to compare costs
Deployment Facilitator: A way to deploy models with settings of our choice
Load Testing: A way of sending requests and measuring the performance
Inference Server: A way to run the model efficiently on the hardware of choice

I’ll be leveraging Inference Endpoints for my Hardware Options, as it allows me to choose from a wide range of hardware choices. Do note you can replace that with your GPUs of choice/consideration. For the Deployment Facilitator I'll be using the ever so useful Hugging Face Hub Library which allows me to programmatically deploy models easily.

For the Inference Server I’ll also be using Infinity which is an amazing library for serving encoder based models (and more now!). I’ve already written about TEI, which is another amazing library. You should definitely consider TEI when approaching your use-case, though this blog focuses on methodology rather than framework comparisons. Infinity has a number of key strengths, like serving multimodal embeddings, targeting different hardware (AMD, Nvidia, CPU and Inferentia) and running any new models that contain remote code that have not been integrated into huggingface’s transformer library. The most important of these to me is that most models are compatible by default.

For Load Testing I'll be using k6 from Grafana which is an open-source load testing tool written in go with a javascript interface. It’s easily configurable, has high performance, and has low overhead. It has a lot of built-in executors that are super useful. It also pre-allocates Virtual Users (VUs), which can be much more realistic than throwing some testing together yourself.

I'll go through 3 use-cases that should cover a variety of interesting points:

Optimization

Optimization is a tricky issue, as there is a lot to consider. At a high level, I want to sweep across important load testing parameters and find what works best for a single GPU, as most encoder models will fit in a single GPU. Once we have a baseline of cost for a single GPU, the number of GPUs and throughput can be scaled horizontally by increasing the number of replica GPUs.

Setup

For each use-case this is the high-level flow I’ll use:

Since you have the code, feel free to adjust any part of this to your:

GPUs
Deployment process
Experimentation Framework
Etc

You won’t hurt my feelings.

Load Testing Parameters

VUs and Batch Size are important because these influence how well we take advantage of all the compute available in the GPU. A large enough Batch Size makes sure we are fully utilizing the Streaming Multiprocessors and VRAM. There are scenarios where we have VRAM left over but there is a bandwidth cost that prevents throughput from increasing. So experimentation can help us. VUs allow us to make sure we are fully utilizing the batch size we have available.
These are the main parameters I'll be testing:

INFINITY_BATCH_SIZE
- This is how many documents will make a forward pass in the batch in the model
- Too low and we won't be utilizing the GPU
- Too high and the GPU can't handle the large input
VUs
- This is the number of Virtual Users simulating parallel client requests that is sent to K6
- It can be hard to simulate a large number of users, and each machine will vary.
GPUs
- We have a variety of GPUs available on Inference Endpoints
- I prioritized those with the best performance/cost ratio
CPUs
- I omitted these since Nvidia-T4s are so cheap that CPUs didn’t seem appealing upon light testing. I left some code to test this if a user is interested though!

Depending on your model you might want to consider:

Which Docker image you are using. ['michaelf34/infinity:0.0.75-trt-onnx', 'michaelf34/infinity:0.0.75']
- There are a number of Infinity Images that can support different backends. You should consider which ones are most applicable to your hardware/model/configuration.
INFINITY_COMPILE whether or not you want to use torch.compile() docs
INFINITY_BETTERTRANSFORMER whether or not you want torch to use Better Transformer

K6

K6 is great because it allows you to pre-allocate VUs and prevent some insidious bugs. It's quite flexible in how you are sending requests. I decided to use it in a specific way.

When I'm calling a k6 experiment, I mainly want to know what the throughput and average latency is for the requests, and have a few sanity checks for the load testing parameters I have chosen. I also want to do a sweep which means many experiments.

I use the shared-iterations executor (docs) which means K6 shares iterations between the number of VUs. The test ends once k6 executes all iterations. This allows me to have a reasonable time-out, but also put through enough requests to have a decent confidence that I can discriminate between load testing parameters choices in my sweep. In contrast to other executors, this allows me to be confident that I'm simulating the client working as hard as possible per VU which should show me what the cheapest option is.

I use 10_000 requests† and have a max experiment time of 1 min. So if 10_000 requests aren’t finished by 1 minute, then that experiment is over.

export const options = {
    scenarios: {
        shared_load_test: {
            executor: 'shared-iterations',
            vus: {{ pre_allocated_vus }}, 
            iterations: 10000, 
            maxDuration: '1m',
        },
    },
};

Stats

P95†† and Average Latency
Throughput

Sanity Checks

Accuracy (Classification only)
Test Duration
Successful Requests
Format validation

† I use less max requests for the Vision Embeddings given that the throughput is much lower and images are heavy.
†† P95 means that 95% of requests complete within this time. It represents the worst-case latency for most users.

Orchestration

You can find the 3 notebooks that put the optimization workflow together here:

The main purpose was to get my experiments defined, launch the correct Inference Endpoint with the right parameters, and launch k6 with the right parameters to test the endpoint.

I made a couple design choices which you may want to think through to see if they work for you:

I do an exponential increase of VUs then a binary search to find the best value
I don’t treat the results as exactly repeatable
- If you run the same test multiple times you will get slightly different results
- I used an improvement threshold of 2% to decide if I should keep searching

Classification

Introduction

Text Classification can have a variety of use-cases at scale, like Email Filtering, Toxicity Detection in pre-training datasets, etc. The OG classic architecture was BERT, and quickly many other architectures came about. Do note that in contrast to popular decoder models, these need to be fine-tuned on your task before using them. Today I think the following architectures† are the most notable:

DistilBERT
- Good task performance, Great Engineering performance
- It made some architectural changes from OG Bert and is not compatible for classification with TEI
DeBERTa-v3
- Great task performance
- Very slow engineering performance†† as its unique attention mechanism is hard to optimize
ModernBERT
- Uses Sequence Packing and Flash-Attention-2
- Great task performance and great engineering performance††

† Do note that for these models, you typically need to fine-tune this on your data to perform well
†† I'm using “Engineering performance” to denote anticipated latency/throughput

Experiment

I chose DistilBERT to focus on as it’s a great lightweight choice for many applications. I compared 2 GPUs, nvidia-t4 and nvidia-l4 as well as 2 Infinity Docker Images.

Results

You can see the results in an interactive format here or in the space embedded below in the Analysis section. This is the cheapest configuration across the experiments I ran:

Category	Best Value
Cost of 1B Inputs	$253.82
Hardware Type	`nvidia-L4` ($0.8/hr)
Infinity Image	default
`batch_size`	64
`vus`	448

Embedding

Introduction

Text Embeddings are a loose way of describing the task of taking a text input and projecting it into a semantic space where close points are similar in meaning and distant ones are dissimilar (example below). This is used heavily in RAG and is an important part of AI search (which some are fans of).

There are a large number of architectures which are compatible, and you can see the most performant ones in the MTEB Leaderboard.

Experiment

ModernBERT is the most exciting encoder release since DeBERTa back in 2020. It has all the ahem modern tricks built into an old and familiar architecture. It makes for an attractive model to experiment with as it's much less explored than other architectures and has a lot more potential. There are a number of improvements in speed and performance, but the most notable for a user is likely the 8k context window. Do check out this blog for a more thorough understanding.

It's important to note that Flash Attention 2 will only work with more modern GPUs due to the compute capability requirement, so I opted to skip the T4 in favor of the L4. An H100 would also work really well here for the heavy hitters category.

Results

You can see the results in an interactive format here. This is the cheapest configuration across the experiments I ran:

Category	Best Value
Cost of 1B Inputs	$409.44
Hardware Type	`nvidia-L4` ($0.8/hr)
Infinity Image	default
`batch_size`	32
`vus`	256

Vision Embedding

Introduction

ColQwen2 is a Visual Retriever which is based on the Qwen2-VL-2B-Instruct and uses ColBERT style multi-vector representations of both text and images. We can see that it has a complex architecture compared to the encoders we explored above.

There are a number of use-cases which might benefit from this at large scale like, e-commerce search, multi-modal recommendations, enterprise multi-modal RAG, etc

ColBERT style is different than our previous embedding use-case since it breaks the input into multiple tokens and returns a vector for each instead of 1 vector for the input. You can find an excellent tutorial from Jina AI here.This can lead to superior semantic encoding and better retrieval, but is also slower, and more expensive.

I'm excited for this experiment as it explores 2 lesser known concepts, vision embeddings and ColBERT style embeddings†. There are a few things to note about ColQwen2/VLMs:

2B is ~15x bigger than the other models we looked at in this blog
ColQwen2 has a complex architecture with multiple models including a decoder which is slower than an encoder
Images can easily consume a lot of tokens
API Costs:
- Sending images over an API is slower than sending text.
- You will further encounter larger egress costs if you are on a cloud

†Do check out this more detailed blog if this is new and interesting to you.

Experiment

I wanted to try a small and modern GPU like the nvidia-l4 since it should be able to fit the 2B param model but also scale well since it's cheap. Like the other embedding model, I'm varying batch_size and vus.

Results

You can see the results in an interactive format here. This is the cheapest configuration across the experiments I ran:

Category	Best Value
Cost of 1B Inputs	$44,496.51
Hardware Type	`nvidia-l4`
Infinity Image	`default`
`batch_size`	4
`vus`	4

Analysis

Do check out the detailed analysis in this space (derek-thomas/classification-analysis) for the classification use-case (hide side-bar and scroll down for the charts):

Conclusion

Scaling to 1B+ classifications or embeddings per day is a non-trivial challenge, but with the right optimizations, it can be made cost-effective. From my experiments, a few key patterns emerged:

Hardware Matters – NVIDIA L4 ($0.80/hr) consistently provided the best balance of performance and cost, making it the preferred choice over T4 for modern workloads. CPUs were not competitive at scale.
Batch Size is Critical – The sweet spot for batch size varies by task, but in general, maximizing batch size without hitting GPU memory and bandwidth limits is the key to efficiency. For classification, batch size 64 was optimal; for embeddings, it was 32.
Parallelism is Key – Finding the right number of Virtual Users (VUs) ensures GPUs are fully utilized. An exponential increase + binary search approach helped converge on the best VU settings efficiently.
ColBERT style Vision Embeddings are Expensive – At over $44,000 per 1B embeddings, image-based retrieval is 2 orders of magnitude costlier than text-based tasks.

Your data, hardware, models, etc might differ, but I hope you find some use in the approach and code provided. The best way to get an estimate is to run this on your own task with your own configurations. Let's get exploring!

Special thanks to andrewrreed/auto-bench for some inspiration, Michael Feil for creating Infinity. Also thanks to Pedro Cuenca, Erik Kaunismaki, and Tom Aarsen for helping me review.

References

Appendix

Sanity Checks

It's important to scale sanity checks as your complexity scales. Since we are using subprocess and jinja to then call k6 I felt far from the actual testing so I put a few checks in place.

Task Performance

The goal of task performance is that for a specific configuration we are performing similarly to what we expect. Task Performance will have different meanings across tasks. For classification I chose accuracy, and for embeddings I skipped it. We could look at average similarity and other similar metrics. It gets a bit more complex with ColBERT style since we are getting many vectors per request.

Even though 58% isn't bad for some 3 class classification tasks, it's irrelevant. The goal is to make sure that we are getting the expected task performance. If we see a significant change (increase or decrease) we should be suspicious and try to understand why.

Below is a great example from the classification use-case since we can see an extremely tight distribution and one outlier. Upon further investigation the outlier is due to a low amount of requests sent.

You can see the interactive results visualized by nbviewer here:

Classification

Failed Requests Check

We should expect to see no failed requests as Inference Endpoints has a queue to handle extra requests. This is relevant for all 3 use-cases: Classification, Embedding, and Vision Embedding.

sum(df.total_requests - df.successful_requests) allows us to see if we had any failed requests.

Monotonic Series - Did we try enough VUs?

As described above we are using a nice strategy of using exponential increases and then a binary search to find the best number of VUs. But how do we know that we tried enough VUs? What if we tried a higher amount of VUs and throughput kept increasing? If that's the case then we would see a monotonically increasing relationship between VUs and Throughput and we would need to run more tests.

You can see the interactive results visualized by nbviewer here:

Embedding Size Check

When we request an embedding it makes sense that we would always get back an embedding of the same size as specified by the model type. You can see the checks here:

Embedding
Vision Embedding (inner embedding)

ColBERT Embedding Count

We are using a ColBERT style model for Vision Embedding which means we should be getting multiple vectors per image. It's interesting to see the distribution of these as it allows us to check for anything unexpected and to get to know our data. To do so I stored the min_num_vectors, avg_num_vectors, max_num_vectors, in the experiments.

We should expect to see some variance in all 3, but it’s ok that min_num_vectors and max_num_vectors have the same value across experiments.

You can see the check here:

Vision Embedding

Cost Analysis

Here is a short description, but you can get a more detailed interactive experience in the space: derek-thomas/classification-analysis

Best Image by Cost Savings

For the classification use-case we looked at 2 different Infinity Images, default and trt-onnx. Which one was better? It's best if we can compare these across the same settings (GPU, batch_size, VUs). We can simply group our results by those settings and then see which one was cheaper.

Cost vs Latency

This is a key chart as for many use-cases there is a maximum latency that a user can experience. Usually if we allow the latency to increase we can increase throughput creating a trade-off scenario. That's true in all the use-cases I've looked at here. Having a pareto curve is great to help visualize where that tradeoff is.

Cost vs VUs and Batch Size Contour Plots

Lastly it's important to build intuition on what is happening when we try these different settings. Do we get beautiful idealized charts? Do we see unexpected gradients? Do we need more exploration in a region? Could we have issues in our setup like not having an isolated environment? All of these are good questions and some can be tricky to answer.

The contour plot is built by interpolating intermediate points in a space defined by the 3 dimensions. There are a couple phenomena that are worth understanding:

Color Gradient: Shows the cost levels, with darker colors representing higher costs and lighter colors representing lower costs.
Contour Lines: Represent cost levels, helping identify cost-effective regions.
Tight clusters: (of contour lines) indicate costs changing rapidly with small adjustments to batch size or VUs.

We can see a complex contour chart with some interesting results from the classification use-case:

But here is a much cleaner one from the vision-embedding task:

Infinity Client

For actual usage, do consider using the infinity client. When we are benchmarking it's good practice to use k6 to know what’s possible. For actual usage, use the official lib, or something close for a few benefits:

Base64 means smaller payloads (faster and cheaper)
The maintained library should make development cleaner and easier
You have inherent compatibility which will make development faster

You also have OpenAI lib compatibility with the Infinity backend as another option.

As an example vision-embeddings can be accessed like this:

pip install infinity_client && python -c "from infinity_client.vision_client import InfinityVisionAPI"

Other Lessons Learned

I tried deploying multiple models on the same GPU but didn’t see major improvements despite having leftover VRAM and GPU processing power left, this is likely due to the bandwidth cost of processing a large batch
Getting K6 to work with images was a headache until I learned about SharedArrays
Image Data can be super cumbersome to work with
The key to debugging K6 when you are generating scripts is to manually run K6 and look at the output.

Future Improvements

Have the tests run in parallel while managing a global max of VUs would save a lot of time
Look at more diverse datasets and see how that impacts the numbers

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Approach

Optimization

Setup

Load Testing Parameters

K6

Stats

Sanity Checks

Orchestration

Classification

Introduction

Experiment

Results

Embedding

Introduction

Experiment

Results

Vision Embedding

Introduction

Experiment

Results

Analysis

Conclusion

References

Appendix

Sanity Checks

Task Performance

Failed Requests Check

Monotonic Series - Did we try enough VUs?

Embedding Size Check

Cost Analysis

Best Image by Cost Savings

Cost vs Latency

Cost vs VUs and Batch Size Contour Plots

Infinity Client

Other Lessons Learned

Future Improvements