TGI Multi-LoRA: Deploy Once, Serve 30 Models

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Derek Thomas, Diego Maniloff, David Holtz · 2024-07-18 · via Hugging Face - Blog

Back to Articles

This article is also available in Chinese 简体中文.

Are you tired of the complexity and expense of managing multiple AI models? What if you could deploy once and serve 30 models? In today's ML world, organizations looking to leverage the value of their data will likely end up in a fine-tuned world, building a multitude of models, each one highly specialized for a specific task. But how can you keep up with the hassle and cost of deploying a model for each use case? The answer is Multi-LoRA serving.

Motivation

As an organization, building a multitude of models via fine-tuning makes sense for multiple reasons.

Performance - There is compelling evidence that smaller, specialized models outperform their larger, general-purpose counterparts on the tasks that they were trained on. Predibase [5] showed that you can get better performance than GPT-4 using task-specific LoRAs with a base like mistralai/Mistral-7B-v0.1.
Adaptability - Models like Mistral or Llama are extremely versatile. You can pick one of them as your base model and build many specialized models, even when the downstream tasks are very different. Also, note that you aren't locked in as you can easily swap that base and fine-tune it with your data on another base (more on this later).
Independence - For each task that your organization cares about, different teams can work on different fine tunes, allowing for independence in data preparation, configurations, evaluation criteria, and cadence of model updates.
Privacy - Specialized models offer flexibility with training data segregation and access restrictions to different users based on data privacy requirements. Additionally, in cases where running models locally is important, a small model can be made highly capable for a specific task while keeping its size small enough to run on device.

In summary, fine-tuning enables organizations to unlock the value of their data, and this advantage becomes especially significant, even game-changing, when organizations use highly specialized data that is uniquely theirs.

So, where is the catch? Deploying and serving Large Language Models (LLMs) is challenging in many ways. Cost and operational complexity are key considerations when deploying a single model, let alone n models. This means that, for all its glory, fine-tuning complicates LLM deployment and serving even further.

That is why today we are super excited to introduce TGI's latest feature - Multi-LoRA serving.

Background on LoRA

LoRA, which stands for Low-Rank Adaptation, is a technique to fine-tune large pre-trained models efficiently. The core idea is to adapt large pre-trained models to specific tasks without needing to retrain the entire model, but only a small set of parameters called adapters. These adapters typically only add about 1% of storage and memory overhead compared to the size of the pre-trained LLM while maintaining the quality compared to fully fine-tuned models.

The obvious benefit of LoRA is that it makes fine-tuning a lot cheaper by reducing memory needs. It also reduces catastrophic forgetting and works better with small datasets.

During training, LoRA freezes the original weights W and fine-tunes two small matrices, A and B, making fine-tuning much more efficient. With this in mind, we can see in Figure 1 how LoRA works during inference. We take the output from the pre-trained model Wx, and we add the Low Rank adaptation term BAx [6].

Multi-LoRA Serving

Now that we understand the basic idea of model adaptation introduced by LoRA, we are ready to delve into multi-LoRA serving. The concept is simple: given one base pre-trained model and many different tasks for which you have fine-tuned specific LoRAs, multi-LoRA serving is a mechanism to dynamically pick the desired LoRA based on the incoming request.


Figure 2: Multi-LoRA Explained

Figure 2 shows how this dynamic adaptation works. Each user request contains the input x along with the id for the corresponding LoRA for the request (we call this a heterogeneous batch of user requests). The task information is what allows TGI to pick the right LoRA adapter to use.

Multi-LoRA serving enables you to deploy the base model just once. And since the LoRA adapters are small, you can load many adapters. Note the exact number will depend on your available GPU resources and what model you deploy. What you end up with is effectively equivalent to having multiple fine-tuned models in one single deployment.

LoRAs (the adapter weights) can vary based on rank and quantization, but they are generally quite tiny. Let's get a quick intuition of how small these adapters are: predibase/magicoder is 13.6MB, which is less than 1/1000th the size of mistralai/Mistral-7B-v0.1, which is 14.48GB. In relative terms, loading 30 adapters into RAM results in only a 3% increase in VRAM. Ultimately, this is not an issue for most deployments. Hence, we can have one deployment for many models.

How to Use

Gather LoRAs

First, you need to train your LoRA models and export the adapters. You can find a guide here on fine-tuning LoRA adapters. Do note that when you push your fine-tuned model to the Hub, you only need to push the adapter, not the full merged model. When loading a LoRA adapter from the Hub, the base model is inferred from the adapter model card and loaded separately again. For deeper support, please check out our Expert Support Program. The real value will come when you create your own LoRAs for your specific use cases.

Low Code Teams

For some organizations, it may be hard to train one LoRA for every use case, as they may lack the expertise or other resources. Even after you choose a base and prepare your data, you will need to keep up with the latest techniques, explore hyperparameters, find optimal hardware resources, write the code, and then evaluate. This can be quite a task, even for experienced teams.

AutoTrain can lower this barrier to entry significantly. AutoTrain is a no-code solution that allows you to train machine learning models in just a few clicks. There are a number of ways to use AutoTrain. In addition to locally/on-prem we have:

AutoTrain Environment	Hardware Details	Code Requirement	Notes
Hugging Face Space	Variety of GPUs and hardware	No code	Flexible and easy to share
DGX cloud	Up to 8xH100 GPUs	No code	Better for large models
Google Colab	Access to a T4 GPU	Low code	Good for small loads and quantized models

Deploy

For our examples, we will use a couple of the excellent adapters featured in LoRA Land from Predibase:

predibase/customer_support is trained on the Gridspace-Stanford Harper Valley speech dataset which enhances its ability to understand and respond to customer service interactions accurately. This improves the model's performance in tasks such as speech recognition, emotion detection, and dialogue management, leading to more efficient and empathetic customer support.
predibase/magicoder is trained on ise-uiuc/Magicoder-OSS-Instruct-75K which is a code instruction dataset that is synthetically generated.

TGI

There is already a lot of good information on how to deploy TGI. Deploy like you normally would, but ensure that you:

Use a TGI version newer or equal to v2.1.1
Deploy your base: mistralai/Mistral-7B-v0.1
Add the LORA_ADAPTERS env var during deployment
- Example: LORA_ADAPTERS=predibase/customer_support,predibase/magicoder

model=mistralai/Mistral-7B-v0.1
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.1.1 \
    --model-id $model \
    --lora-adapters=predibase/customer_support,predibase/magicoder

Inference Endpoints GUI

Inference Endpoints allows you to have access to deploy any Hugging Face model on many GPUs and alternative Hardware types across AWS, GCP, and Azure all in a few clicks! In the GUI, it's easy to deploy. Under the hood, we use TGI by default for text generation (though you have the option to use any image you choose).

To use Multi-LoRA serving on Inference Endpoints, you just need to go to your dashboard, then:

Choose your base model: mistralai/Mistral-7B-v0.1
Choose your Cloud | Region | HW
- Ill use AWS | us-east-1 | Nvidia L4
Select Advanced Configuration
- You should see text generation already selected
- You can configure based on your needs
Add LORA_ADAPTERS=predibase/customer_support,predibase/magicoder in Environment Variables
Finally Create Endpoint!

Note that this is the minimum, but you should configure the other settings as you desire.


Figure 3: Multi-LoRA Inference Endpoints


Figure 4: Multi-LoRA Inference Endpoints 2

Inference Endpoints Code

Maybe some of you are musophobic and don't want to use your mouse, we don’t judge. It’s easy enough to automate this in code and only use your keyboard.

from huggingface_hub import create_inference_endpoint

# Custom Docker image details
custom_image = {
    "health_route": "/health",
    "url": "ghcr.io/huggingface/text-generation-inference:2.1.1",  # This is the min version
    "env": {
        "LORA_ADAPTERS": "predibase/customer_support,predibase/magicoder",  # Add adapters here
        "MAX_BATCH_PREFILL_TOKENS": "2048",  # Set according to your needs
        "MAX_INPUT_LENGTH": "1024", # Set according to your needs
        "MAX_TOTAL_TOKENS": "1512", # Set according to your needs
        "MODEL_ID": "/repository"
    }
}

# Creating the inference endpoint
endpoint = create_inference_endpoint(
    name="mistral-7b-multi-lora",
    repository="mistralai/Mistral-7B-v0.1",
    framework="pytorch",
    accelerator="gpu",
    instance_size="x1",
    instance_type="nvidia-l4",
    region="us-east-1",
    vendor="aws",
    min_replica=1,
    max_replica=1,
    task="text-generation",
    custom_image=custom_image,
)
endpoint.wait()

print("Your model is ready to use!")

It took ~3m40s for this configuration to deploy. Note for more models it will take longer. Do make a github issue if you are facing issues with load time!

Consume

When you consume your endpoint, you will need to specify your adapter_id. Here is a cURL example:

curl 127.0.0.1:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
  "inputs": "Hello who are you?",
  "parameters": {
    "max_new_tokens": 40,
    "adapter_id": "predibase/customer_support"
  }
}'

Alternatively, here is an example using InferenceClient from the wonderful Hugging Face Hub Python library. Do make sure you are using huggingface-hub>=0.24.0 and that you are logged in if necessary.

from huggingface_hub import InferenceClient

tgi_deployment = "127.0.0.1:3000"
client = InferenceClient(tgi_deployment)
response = client.text_generation(
    prompt="Hello who are you?",
    max_new_tokens=40,
    adapter_id='predibase/customer_support',
)

Practical Considerations

Cost

We are not the first to climb this summit, as discussed below. The team behind LoRAX, Predibase, has an excellent write up. Do check it out, as this section is based on their work.


Figure 5: Multi-LoRA Cost For TGI, I deployed mistralai/Mistral-7B-v0.1 as a base on nvidia-l4, which has a cost of $0.8/hr on Inference Endpoints. I was able to get 75 requests/s with an average of 450 input tokens and 234 output tokens and adjusted accordingly for GPT3.5 Turbo.

One of the big benefits of Multi-LoRA serving is that you don’t need to have multiple deployments for multiple models, and ultimately this is much much cheaper. This should match your intuition as multiple models will need all the weights and not just the small adapter layer. As you can see in Figure 5, even when we add many more models with TGI Multi-LoRA the cost is the same per token. The cost for TGI dedicated scales as you need a new deployment for each fine-tuned model.

Usage Patterns


Figure 6: Multi-LoRA Serving Pattern

One real-world challenge when you deploy multiple models is that you will have a strong variance in your usage patterns. Some models might have low usage; some might be bursty, and some might be high frequency. This makes it really hard to scale, especially when each model is independent. There are a lot of “rounding” errors when you have to add another GPU, and that adds up fast. In an ideal world, you would maximize your GPU utilization per GPU and not use any extra. You need to make sure you have access to enough GPUs, knowing some will be idle, which can be quite tedious.

When we consolidate with Multi-LoRA, we get much more stable usage. We can see the results of this in Figure 6 where the Multi-Lora Serving pattern is quite stable even though it consists of more volatile patterns. By consolidating the models, you allow much smoother usage and more manageable scaling. Do note that these are just illustrative patterns, but think through your own patterns and how Multi-LoRA can help. Scale 1 model, not 30!

Changing the base model

What happens in the real world with AI moving at breakneck speeds? What if you want to choose a different/newer model as your base? While our examples use mistralai/Mistral-7B-v0.1 as a base model, there are other bases like Mistral's v0.3 which supports function calling, and altogether different model families like Llama 3. In general, we expect new base models that are more efficient and more performant to come out all the time.

But worry not! It is easy enough to re-train the LoRAs if you have a compelling reason to update your base model. Training is relatively cheap; in fact Predibase found it costs only ~$8.00 to train each one. The amount of code changes is minimal with modern frameworks and common engineering practices:

Keep the notebook/code used to train your model
Version control your datasets
Keep track of the configuration used
Update with the new model/settings

Conclusion

Multi-LoRA serving represents a transformative approach in the deployment of AI models, providing a solution to the cost and complexity barriers associated with managing multiple specialized models. By leveraging a single base model and dynamically applying fine-tuned adapters, organizations can significantly reduce operational overhead while maintaining or even enhancing performance across diverse tasks. AI Directors we ask you to be bold, choose a base model and embrace the Multi-LoRA paradigm, the simplicity and cost savings will pay off in dividends. Let Multi-LoRA be the cornerstone of your AI strategy, ensuring your organization stays ahead in the rapidly evolving landscape of technology.

Acknowledgements

Implementing Multi-LoRA serving can be really tricky, but due to awesome work by punica-ai and the lorax team, optimized kernels and frameworks have been developed to make this process more efficient. TGI leverages these optimizations in order to provide fast and efficient inference with multiple LoRA models.

Special thanks to the Punica, LoRAX, and S-LoRA teams for their excellent and open work in multi-LoRA serving.

References

[1] : Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham, LoRA Learns Less and Forgets Less, 2024
[2] : Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, 2021
[3] : Sourab Mangrulkar, Sayak Paul, PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware, 2023
[4] : Travis Addair, Geoffrey Angus, Magdy Saleh, Wael Abid, LoRAX: The Open Source Framework for Serving 100s of Fine-Tuned LLMs in Production, 2023
[5] : Timothy Wang, Justin Zhao, Will Van Eaton, LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4, 2024
[6] : Punica: Serving multiple LoRA finetuned LLM as one: https://github.com/punica-ai/punica

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Motivation

Background on LoRA

Multi-LoRA Serving

How to Use

Gather LoRAs

Low Code Teams

Deploy

TGI

Inference Endpoints GUI

Inference Endpoints Code

Consume

Practical Considerations

Cost

Usage Patterns

Changing the base model

Conclusion

Acknowledgements

References