Build a Domain-Specific Embedding Model in Under a Day

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Steve Han, Rucha Apte, Sean Sodha, Oliver Holworthy · 2026-03-21 · via Hugging Face - Blog

Back to Articles

If you are building a RAG (Retrieval-Augmented Generation) system, you have likely hit this wall: Everything works… until it doesn’t. General-purpose embedding models are trained to understand the internet; not your contracts, manufacturing logs, proprietary chemical formulations or internal taxonomy. They capture broad semantic similarity, but they do not understand the fine-grained distinctions that matter in your domain. Fine-tuning an embedding model can improve the performance of your retrieval pipeline when off-the-shelf models fail to effectively capture domain-specific nuances. Despite how critical embeddings are to RAG performance, the process remains surprisingly fragmented, the skills required are specialized, and the time investment is daunting.

With a single GPU and less than a day of training time, you can transform a general-purpose embedding model into one that truly understands your domain, no manual labeling required. To help you hit the ground running, we are also releasing a ready-to-use synthetic training dataset generated from NVIDIA's public documentation using this exact pipeline. Using this data and the recipe, we saw over 10% improvement in both Recall@10 and NDCG@10. Atlassian applied this recipe to fine-tune on their JIRA dataset, increasing Recall@60 from 0.751 to 0.951, a 26% improvement - on a single GPU.

🔗Quick Links to Dataset and Code:

🧑‍💻Open Source Projects Recipe Integrates:

NeMo Data Designer for synthetic data generation
NeMo Automodel for embedding model training
BEIR for Information retrieval evaluation
NeMo Export-Deploy for ONNX/TensorRT conversion
NVIDIA NIM for production inference serving

📋Prerequisites:

A directory of domain documents (text files - .txt, .md, or similar)
A valid NVIDIA API key (free at build.nvidia.com)
NVIDIA Ampere GPU or newer with at least 80GB memory (with Compute Capability >= 8.0)
- This tutorial has been tested on 1xA100 (80GB), and 1xH100 (80GB)

By the end of this post, you’ll know how to:
📄 Generate training data from domain documents without labeled data
🎯 Use hard negative mining for effective contrastive training
🔗 Improve embedding quality with multi-hop queries
⚙️ Fine-tune a bi-encoder embedding model
📊 Evaluate whether fine-tuning improves retrieval
🚀 Deploy the fine-tuned model in your pipeline

⚙️Setup

In this tutorial, we will finetune the base model Llama-Nemotron-Embed-1B-v2 - a 1-billion-parameter embedding model that balances quality and inference cost. To get started, follow this setup guide.

📚 Step 1: Generate Training Data from Documents

Fine-tuning an embedding model requires thousands of (query, relevant document) pairs. Most use cases don’t have this data readily available. Creating it manually is expensive, slow, and often biased by the annotator’s personal interpretation of what’s “relevant.”
Instead of labeling data by hand, you can use an LLM (nvidia/nemotron-3-nano-30b-a3b) to read your documents and automatically generate high-quality synthetic question–answer pairs.

nemotron embed sdg -c default corpus_dir=./data/my_domain_docs

How does it work?

Behind the scenes, this runs a four-stage synthetic data generation (SDG) pipeline powered by NeMo Data Designer:

What does the output look like?

Source document chunk:

The thermal design power (TDP) of the H100 GPU is 700W in SXM form factor. The cooling solution must maintain junction temperature below 83°C under sustained workloads. Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.

Generated QA pairs:

{
  "question": "What cooling approach is recommended when deploying more than 4 H100 GPUs per server node?",
  "answer": "Liquid cooling is recommended for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.",
  "query_type": "contextual",
  "reasoning_type": "factual",
  "question_complexity": 3,
  "segment_ids": [1],
  "quality_score": 8.5
}

{
  "question": "How does the 700W TDP of the H100 SXM constrain the choice between air and liquid cooling in multi-GPU configurations?",
  "answer": "The 700W TDP generates substantial heat that must be dissipated to keep junction temperatures below 83°C. In dense configurations exceeding 4 GPUs per node, air cooling in standard 2U chassis cannot handle this thermal load, making liquid cooling necessary.",
  "query_type": "multi_hop",
  "reasoning_type": "causal",
  "question_complexity": 4,
  "segment_ids": [1, 2],
  "hop_count": 2,
  "quality_score": 9.0
}

Notice the difference: the first question is a simple factual lookup. The second requires multi-hop, causal reasoning. The pipeline generates both types, with configurable complexity levels (2–5) and hop counts (1–3). Each QA pair then undergoes quality evaluation, receiving sub-scores for relevance, accuracy, context support, and clarity, along with an overall score. Only pairs that meet the threshold are included in training.

⛏️ Step 2: Mine Hard Negatives (and Why They Matter)

If you train an embedding model with only positive pairs (query + correct document), it learns to distinguish obviously different documents but fails on the hard cases — passages that look relevant but are not the right answer. In a real retrieval system, these near-misses are exactly the documents that cause bad answers. Hard negative mining finds these confusing passages so the model can learn to tell them apart.

nemotron embed prep -c default

The above command runs three sub-steps automatically:

2a. Train / Validation / Test Split

The generated QA pairs are split into training (80%) and test (20%) sets. The test set is formatted as a BEIR-compatible benchmark for standardized evaluation in Step 5.

2b. Hard Negative Mining

Using the base embedding model, the pipeline:

Embeds every query and every passage in the corpus.
Computes similarity between each query and all passages.
Masks out each query's labeled positive documents.
Applies a margin filter: any non-positive document scoring above 95% of the minimum positive score is eliminated. This exclusion zone guards against false negatives — unlabeled passages that are so close to the positive they may actually be relevant.
From the surviving candidates, selects the top-k highest-scoring documents as hard negatives (5 per query by default).

The result: hard negatives are the most similar non-positive passages that still fall safely below the positive-score ceiling. They are passages the current model considers highly relevant but that are not the labeled answer.

Why this works: Training on easy negatives (completely unrelated passages) teaches the model nothing new. Training on hard negatives forces it to learn the subtle distinctions that matter in your domain. For example, in a medical corpus, a question about "metformin dosage for Type 2 diabetes" might have hard negatives about "metformin side effects" or "insulin dosage for Type 1 diabetes" — close but critically different. The 95% margin ceiling prevents the miner from selecting passages that are too close to the positive, which could actually be correct answers that simply weren't labeled during SDG.

2c. Multi-Hop Unrolling

Multi-hop questions reference multiple positive documents. For example, a question like "How does the thermal management system in Section 3.2 relate to the power constraints described in Section 5.1?" has two positive passages.

Unrolling creates one training example per (query, positive document) pair, so the contrastive loss sees each positive independently. A question with 2 positive documents becomes 2 training examples, each with the same hard negatives but a different positive.

The final output is a training-ready JSON file:

{
  "question_id": "q42_0",
  "question": "How does the thermal management system in Section 3.2 relate to the power constraints described in Section 5.1?",
  "pos_doc": [{"id": "Section 3.2"}],
  "neg_doc": [{"id": "d_x7y8z9"}, {"id": "d_m4n5o6"}, {"id": "d_p1q2r3"}, {"id": "d_s4t5u6"}, {"id": "d_v7w8x9"}]
}, 
{
  "question_id": "q42_1",
  "question": "How does the thermal management system in Section 3.2 relate to the power constraints described in Section 5.1?",
  "pos_doc": [{"id": "Section 5.1"}],
  "neg_doc": [{"id": "d_x7y8z9"}, {"id": "d_m4n5o6"}, {"id": "d_p1q2r3"}, {"id": "d_s4t5u6"}, {"id": "d_v7w8x9"}]
}

🔍 Step 3: Understand Multi-Hop Questions and Why They Improve Retrieval

Standard embedding fine-tuning generates one question per passage and trains the model to match them. This works for simple factual lookups, but real users ask complex questions that span multiple documents or sections. If the model has only seen single-hop training data, it will struggle to retrieve all the relevant passages for these complex queries.

The SDG pipeline generates questions at 1 to 3 hops by default:

1-hop: "What is the TDP of the H100 SXM?" — answered by a single passage.
2-hop: "How does the H100's TDP relate to cooling requirements in dense deployments?" — requires connecting information from two passages.
3-hop: "Given the TDP, cooling constraints, and rack density limits, what is the maximum number of H100 GPUs deployable in a standard data center row?" — synthesizes three passages.

Each hop is tracked with its own context summary and segment IDs, so the training data preserves the full reasoning chain. After unrolling (Step 2c), each (question, relevant passage) pair becomes an independent training signal, teaching the model that all of these passages are relevant to the multi-hop query.

The fine-tuned model learns to retrieve contextually related documents, not just lexically similar ones.

🧠 Step 4: Fine-Tune the Embedding Model

nemotron embed finetune -c default

How contrastive learning works

The training uses a biencoder architecture with contrastive loss.

The temperature of 0.02 is deliberately aggressive, it produces a very sharp probability distribution. This works well because the hard negatives from Step 2 are high-quality: they are genuinely confusing passages that the model needs strong gradients to learn to distinguish.

Key hyperparameters

Parameter	Default	Notes
Epochs	3	Default is tuned for the small example dataset; for real-world data, 1–2 epochs is usually sufficient to avoid overfitting.
Learning rate	1e-5	Tuning: try double and half of the default value
Learning rate warmup steps	5	Set to 5-10% of total steps of finetune to have better early training stability
Global batch size	128
Passages per query	5	1 positive + 4 hard negatives

Checkpoint frequency

If ckpt_every_steps is omitted from the config, checkpoint frequency is set automatically:

Map-style datasets (known length): defaults to once per epoch.
Iterable datasets (unknown length): defaults to twice during training.

This means you can start with a small corpus (50–100 documents) for a quick proof-of-concept and scale up later without manually tuning checkpoint settings.

📈 Step 5: Measure the Improvement

Did fine-tuning actually help? Let’s find out by running a standardized evaluation comparing the base model against the fine-tuned checkpoint on the held-out test set:

nemotron embed eval -c default

The evaluation uses the BEIR framework and computes four standard information retrieval metrics at k = 1, 5, 10, and 100:

nDCG@k: Ranking quality — are the best documents ranked highest?
Recall@k: Coverage — what fraction of relevant documents appear in the top k?
Precision@k: Accuracy — what fraction of the top k results are actually relevant?
MAP@k: Average precision across all queries

A successful fine-tune typically results in a 10% improvement in nDCG@10 and Recall@10 within <1 day.

Results using Retrieval Synthetic NVDocs:

📊 Comparison (Base -> Fine-tuned)
============================================================
  NDCG:
    NDCG@1: 0.55178 → 0.60796 (+0.05618, +10.2%)
    NDCG@5: 0.51894 → 0.57689 (+0.05795, +11.2%)
    NDCG@10: 0.55506 → 0.61559 (+0.06053, +10.9%)
    NDCG@100: 0.60617 → 0.66567 (+0.05950, +9.8%)
  Recall:
    Recall@1: 0.28478 → 0.31547 (+0.03069, +10.8%)
    Recall@5: 0.54486 → 0.60288 (+0.05802, +10.6%)
    Recall@10: 0.62979 → 0.69296 (+0.06317, +10.0%)
    Recall@100: 0.81421 → 0.87020 (+0.05599, +6.9%)

What if the numbers don't improve?

The pipeline makes it easy to iterate:

Low quality scores in SDG? Check your document quality — clean, well-formatted text produces better synthetic data. Try a larger and more powerful LLM.
Not enough training data? Add more documents to your corpus and re-run Stage 0.
Overfitting? The default 3 epochs is calibrated for the small example dataset; for most real-world data, 1–2 epochs is enough. Also consider increasing the quality threshold to keep only the best training examples.
Wrong learning rate? Try double or half of the default learning rate.

🏆 Real-World Results: Atlassian

This recipe has been validated on real enterprise data by Atlassian. They applied this pipeline to fine-tune Llama-Nemotron-Embed-1B-v2 on a public Jira dataset using a single NVIDIA A100 80GB GPU, following the same stages described above

Recall@60 jumped from 0.751 to 0.951 — a 26.7% gain.
The fine-tuned model retrieves the correct document within the top 60 results for 95.1% of queries, up from 75.1% with the base model. For a retrieval system underpinning Jira search, this directly translates into more relevant results for millions of users. Find more details in their blog post Advancing semantic search for millions of Rovo users.

🚀 Step 6: Export and Deploy

A PyTorch checkpoint is great for evaluation but too slow for production. The final two stages convert the model and serve it behind an API.

Export to ONNX / TensorRT

nemotron embed export -c default

This exports the fine-tuned checkpoint to ONNX (opset 17). Optionally, it compiles a TensorRT engine for maximum inference throughput, with configurable optimization profiles for batch size (1–64) and sequence length (3–256):

# ONNX only (runs anywhere)
nemotron embed export -c default export_to_trt=false

# FP8 quantization for further speedup
nemotron embed export -c default quant_cfg=fp8

Deploy with NVIDIA NIM

The exported model is deployed inside an NVIDIA NIM container — a production-ready inference microservice exposing an OpenAI-compatible /v1/embeddings endpoint:

nemotron embed deploy -c default

Once running, any client can call it:

curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["What cooling is needed for 8 H100 GPUs in a 2U chassis?"],
       "model": "custom",
       "input_type": "query"}'

Because NIM serves an OpenAI-compatible API, you can drop it into any existing RAG pipeline that uses the embeddings API format — no code changes needed.

Verify deployment accuracy

The pipeline includes a NIM accuracy verification step that runs the same BEIR evaluation against the deployed endpoint:

nemotron embed eval -c default eval_nim=true eval_base=false

This catches any accuracy loss from the ONNX/TensorRT conversion. Metrics that match within tolerance (0.03 for @1, 0.01 for @5+) are marked with a check; deviations beyond conversion noise are flagged.

Putting It All Together

The full embedding fine-tuning pipeline can be run in six commands, from raw documents to a deployed model.

# 1. Generate synthetic training data from your documents
nemotron embed sdg -c default corpus_dir=./data/my_docs

# 2. Prepare the training data (split data, mine hard negatives, unroll)
nemotron embed prep -c default

# 3. Fine-tune the embedding model
nemotron embed finetune -c default

# 4. Evaluate the base vs. fine-tuned model
nemotron embed eval -c default

# 5. Export the optimized model
nemotron embed export -c default

# 6. Deploy the model
nemotron embed deploy -c default

Expected time and resources

Stage	GPU Required?	Estimated Time	Notes
SDG	No (uses API)	~1 hour	Varies by corpus size and API rate limit
Data Prep	Yes (40 GB VRAM)	~5 min	Hard negative mining on GPU
Fine-Tune	Yes (80 GB VRAM)	~1 hours	Varies by dataset size and epochs
Eval	Yes (40 GB VRAM)	~5 min
Export	Yes (40 GB VRAM)	~5 min	TensorRT requires NGC container
Deploy	Yes (40 GB VRAM)	~5 min	NIM container startup

Total: under a day, with most time being hands-off training. For a small corpus (~500 documents), the entire pipeline completes in about 2–3 hours.

The pipeline can run end-to-end, but each stage can also be executed independently depending on your starting point. For example, if you have raw documents, you can begin with synthetic data generation (SDG), while datasets that already include hard negatives can skip earlier steps and go directly to fine-tuning. Since every stage uses standard formats such as JSON, BEIR, and ONNX, it’s easy to integrate custom components or reuse intermediate outputs in other workflows. The recipe is also flexible in how it runs, supporting execution on a local machine, inside Docker containers, or on Slurm-based clusters.

Try It Yourself

If you have domain documents and some time in your hand, you can generate your first batch of synthetic training data today! The full pipeline - from documents to a deployed, domain-adapted embedding model - runs in under a day on a single GPU. You can start with our ready-made nvidia/Retrieval-Synthetic-NVDocs-v1 dataset to try the pipeline right away. Let us know what you build.
Star the repos for Nemotron, NeMo Data Designer and NeMo Automodel if you find them useful.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

🔗Quick Links to Dataset and Code:

🧑‍💻Open Source Projects Recipe Integrates:

📋Prerequisites:

⚙️Setup

📚 Step 1: Generate Training Data from Documents

How does it work?

What does the output look like?

⛏️ Step 2: Mine Hard Negatives (and Why They Matter)

2a. Train / Validation / Test Split

2b. Hard Negative Mining

2c. Multi-Hop Unrolling

🔍 Step 3: Understand Multi-Hop Questions and Why They Improve Retrieval

🧠 Step 4: Fine-Tune the Embedding Model

How contrastive learning works

Key hyperparameters

Checkpoint frequency

📈 Step 5: Measure the Improvement

What if the numbers don't improve?

🏆 Real-World Results: Atlassian

🚀 Step 6: Export and Deploy

Export to ONNX / TensorRT

Deploy with NVIDIA NIM

Verify deployment accuracy

Putting It All Together

Expected time and resources

Try It Yourself