LLM-as-judge variance broke our DPO training signal for 3 weeks

TL;DR: Our DPO pipeline used a single LLM as the preference judge. Training reward climbed every run. Production accuracy fell 4 points. The judge was flipping its own labels 28% of the time at temperature 0.

The setup

Nexus Labs ships agents that book travel, file expenses, process insurance claims. Eight engineers on my fine-tuning team. We run DPO on Qwen2.5-32B, target latency under 800ms p95 on a single H100.

Our preference data pipeline:

2,400 prompts sampled from production traces per cycle
4 completions per prompt from the current checkpoint
GPT-4o-mini grades pairwise preferences against a 6-axis rubric
TRL DPO, 3 epochs, lr 5e-7, beta 0.1

Standard recipe. Worked fine for two months.

What we saw

Week 9. Training loss curves looked clean. Reward margins grew run over run. Held-out eval reward climbed 0.62 → 0.71. Internal dashboards were green.

Then product filed tickets. Latency was fine. Tool use accuracy on our production traffic mirror was down 4 points against the pre-DPO baseline. The thing we shipped to make the agent better made it worse.

We trusted offline eval. We were wrong.

The investigation

I rebuilt the judge call as a deterministic test. Same prompt, same two completions, GPT-4o-mini at temperature 0. Fired the API 50 times in a row.

The judge flipped its preference 14 of 50 times. 28% self-disagreement on a single pair.

That number alone should have killed the project. We had built a training signal on top of a weighted coin.

Ran the test across 200 prompt pairs. Median self-disagreement was 19%. The tail was worse. 8% of pairs had over 40% flip rates, and those pairs were exactly the ambiguous multi-step agent traces we cared about most.

What was actually happening

DPO gradients care about margin. When labels are noisy, the model still gets a gradient, but the direction is garbage. Over thousands of pairs you converge on whatever spurious feature the judge weights at temperature 0. Which, surprise, is not what end users want.

Our offline reward went up because the model learned the judge's quirks. Production accuracy dropped because the quirks weren't the task.

The fix

Three changes.

# preference_judging.yaml
judges:
  - provider: anthropic
    model: claude-sonnet-4-6
  - provider: openai
    model: gpt-4o-2024-11-20
  - provider: google
    model: gemini-2.5-pro
consensus:
  min_agree: 2
  drop_pair_if_split: true
sampling:
  judges_per_pair: 3
  rotate_completion_order: true

Three judges, 2-of-3 majority. Drop the pair if split. We lose 18% of pairs. Acceptable.
Rotate completion order per judge. Position bias was ~7% on its own. Sonnet was closer to 2%, GPT-4o-mini was the worst offender.
Bootstrap CIs on the eval set. Report reward with a 95% interval, not a point estimate. Half of our prior "improvement" was inside the noise floor.

The judge fleet routes through Bifrost (https://github.com/maximhq/bifrost). One OpenAI-compatible endpoint, automatic fallback when a provider degrades, per-judge token accounting in one place. We were already running three providers for app traffic, so the judge pool was a config change.

Numbers after the fix

| Metric | Single judge | 3-judge consensus |
|---|---|
| Judge self-consistency | 72% | 94% |
| Production tool-use accuracy | -4.0 pts | +2.1 pts |
| Training pairs retained | 100% | 82% |
| Cost per 10k pairs (USD) | $11 | $34 |
| Eval-to-prod Spearman correlation | 0.31 | 0.78 |

Cost tripled. The signal went from misleading to useful. We take that trade every cycle.

Trade-offs and limitations

This isn't free and it isn't a silver bullet.

Judge cost. 3x judges plus pair retries. Budget for it before you propose this to a director.
Consensus isn't truth. Three judges can agree on the wrong thing. We still sample 5% of pairs for human review weekly. That review process has caught two systematic biases all three LLM judges shared. Probably trained on overlapping data.
Latency. Preference labeling is no longer a same-afternoon job. Two-day turnaround on a full cycle now. Plan the data pipeline schedule around it.
Bad rubric, no rescue. If your scoring criteria don't match what users care about, ensembling judges won't save you. We rewrote the rubric twice during this work.
Position bias varies by model. Don't assume. Measure.

The deeper point. Most teams I talk to treat the judge as an oracle and the model as the unknown. It's backwards. The model converges on whatever target you point it at. If the target wobbles, the model wobbles with it, and you won't see it in your reward curve.

We spent three weeks training a model to imitate a noisy judge. The model worked. That was the bug.

推荐订阅源

DEV Community