How I set up RAG evals in CI/CD so they actually catch regressions

I have hit this a few times.A PR lands late in the day, the RAG eval runs in under a minute, green check, merge.Twelve hours later support tickets start coming in.

The trace shows the retriever switched its top-1 chunk on a class of queries the 30-example dataset never covered. Suite Groundedness stayed at 0.91. Production Groundedness on the affected traffic was 0.62.

The gate passed because it was not checking the right thing.Most CI eval gates I have seen for RAG are smoke tests. Small dataset, mean compared against a fixed floor, pass unless something is badly broken.

The dataset is not representative, the floor is not tied to baseline variance, and the threshold does not separate a real regression from judge noise.So a green check does not tell you much.

The way I think about it now: a gate wants three things at once -cheap, fast, and statistically significant and you usually get two.
A 30-example dataset at 12 cents per PR fails on significance.A 2,000-example sweep at 9 dollars per PR fails on cost and speed.
The work is holding all three.

I split the gate into three tiers

Running the full LLM-judge sweep on every push was my first mistake.That belongs on nightly main, not an in-progress branch.

Every push:

Cheap classifier rubrics (NLI faithfulness, claim_support) plus deterministic floors (citation validity, schema, latency).
Under three minutes against 100 to 200 examples.
Blocks the merge.

Nightly main:

The full LLM-judge stack against the versioned dataset.
15 to 30 minutes.
Blocks the next promotion to canary.

Canary:

The same rubrics scoring 5 to 10 percent of live traffic.
Alarms on rolling-mean drift.

The dataset is the gate's worldview

A 2,000-example set built from my own head loses to a 200-example set sampled from production.If it misses the failure modes that show up at 2 am, so does the gate.
Below 100 examples per route the variance drowns the signal.
Above 500 the per-PR judge bill grows faster than detection improves.My PR range is 100 to 200 per route, covering happy paths, edge cases, refusals, and the hardest 10 percent of past incidents.
The field most people skip is expected_chunks, the ground-truth doc IDs.
Without it you can score generation but not retrieval, and the bisect takes a day instead of an hour.

Five rubrics cover most regressions

The five rubrics that gate most RAG regressions in CI:

Groundedness
Context Relevance
Answer Relevance
Citation Validity
Retrieval Recall

I split them by layer so the bisect is easy:

ContextRelevance drops while Groundedness holds → retriever regressed.
Groundedness drops while ContextRelevance holds → generator regressed.

Citation validity is just a string match, so I run it on 100 percent of responses and keep the judge bill on the rubrics that need semantic scoring.

Gating on the delta, not the mean

This is the part most setups skip and the one that matters.

A green check should mean the PR did not introduce a statistically significant regression, not that the mean sat above some floor.

I use two thresholds.

An absolute floor per rubric catches the obvious break:

Groundedness ≥ 0.85
ContextRelevance ≥ 0.80
Citation validity ≥ 0.99

A delta gate against the trailing 7-day rolling baseline catches the slow drift, using Welch's t-test on per-example scores.

import statistics
from scipy import stats

def regression_gate(current, baseline, alpha=0.05, min_effect=0.03):
    """Fail only if the mean dropped, the change is significant, and the effect is big enough."""
    delta = statistics.mean(current) - statistics.mean(baseline)

    if delta >= 0:
        return True, f"no regression (delta=+{delta:.3f})"

    _, p = stats.ttest_ind(current, baseline, equal_var=False)

    if p >= alpha:
        return True, f"delta={delta:.3f}, p={p:.3f} (not significant)"

    if abs(delta) < min_effect:
        return True, f"delta={delta:.3f} below effect floor {min_effect}"

    return False, f"regression: delta={delta:.3f}, p={p:.3f}"

A 30-example dataset gives a confidence interval of about ±0.07 on a 1–5 mean, so a 2-point drop sits inside the noise.

Gate on that and you are wrong half the time, and after two false alarms people stop trusting the gate.

For long-tail failures that hide in averages, gate on percentiles instead.

The one rule that keeps this honest:

The baseline is a rolling production window, not a frozen number.

The GitHub Actions wiring

The workflow itself is path-scoped triggers, a cache, parallel pytest, and the cheap-rubric assertions.

This is roughly what I run on PRs.

name: RAG Evals

on:
  pull_request:
    paths:
      - "rag/**"
      - "evals/**"
      - "fi-evaluation.yaml"

concurrency:
  group: rag-evals-${{ github.head_ref }}
  cancel-in-progress: true

jobs:
  pr-gate:
    runs-on: ubuntu-latest
    timeout-minutes: 8

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 2

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip

      - uses: actions/cache@v4
        with:
          path: .eval_cache
          key: evals-${{ hashFiles('evals/rubrics/**', 'evals/datasets/**') }}

      - run: pip install -r requirements.txt

      - id: routes
        run: echo "routes=$(python evals/affected_routes.py)" >> "$GITHUB_OUTPUT"

      - name: Cheap-rubric gate
        if: steps.routes.outputs.routes != '[]'
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
        run: fi run --check --strict --parallel 16 -c evals/fi-evaluation.yaml

      - name: Statistical delta gate
        if: steps.routes.outputs.routes != '[]'
        env:
          FI_API_KEY: ${{ secrets.FI_API_KEY }}
          FI_SECRET_KEY: ${{ secrets.FI_SECRET_KEY }}
          BASELINE_WINDOW_DAYS: "7"
        run: pytest evals/test_rag.py -n auto --routes='${{ steps.routes.outputs.routes }}'

      - if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-report-${{ github.sha }}
          path: eval-report.json

A separate nightly cron workflow runs the full sweep across all routes and posts the daily baseline back to the observer.

The same shape works on GitLab CI, Buildkite, Jenkins, or CircleCI, since it is just pytest, a CLI, and a cache action.

Three things paid back in the first week:

Path-scoped triggers
Concurrency cancel-in-progress
Eval report artifacts with failing examples

Those turned borderline regressions into short conversations instead of arguments.

Running the same rubric in production

Offline CI catches the regressions I can think of.Production catches the rest.So I run the same rubric definition in both places.The CI version lives in code.

The production version runs against live OTel spans as a score attached to the trace.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    EvalTag,
    EvalTagType,
    EvalSpanKind,
    EvalName,
    ProjectType,
)

eval_tags = [
    EvalTag(
        eval_name=EvalName.GROUNDEDNESS,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        config={},
        mapping={
            "context": "retrieval.documents",
            "output": "output.value",
        },
    ),
    EvalTag(
        eval_name=EvalName.CONTEXT_RELEVANCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.RETRIEVER,
        config={},
        mapping={
            "input": "input.value",
            "context": "retrieval.documents",
        },
    ),
]

register(
    project_type=ProjectType.OBSERVE,
    project_name="legal-rag",
    eval_tags=eval_tags,
)

The score lands as a span attribute, so a failing trace shows up with its rubric score next to latency and chunk IDs.

I sample 5 to 10 percent of production traffic for the LLM-judge rubrics and run the cheap ones on 100 percent.

Then I alarm on a sustained 2 to 5 point drop in a rubric's rolling mean over a 15 to 60 minute window.

The gap between the CI baseline and the production rolling mean is its own signal.

When it widens, the dataset has stopped being representative.

Closing the loop so the dataset keeps up

A dataset stops being a regression suite the moment production drifts past it.
The loop is what keeps the gate honest.
Failing production traces get clustered into named issues.
An analysis pass writes up the root cause and a fix for each one.
The worst representative traces get promoted into the eval set with rubric labels.
The next PR touching that path either clears the new entries or fails on them.
Over a few weeks the gate gets stronger instead of decaying because it learns the failures that actually happened rather than the ones I guessed at during setup.

Pitfalls I watch for

Scoring only Groundedness

Catches hallucinations.
Misses retrieval regressions.
Run all five core rubrics.

No expected_chunks in the dataset

You cannot score retrieval recall without ground truth.
The labeling pays back the first time retrieval breaks.

A floating judge model

Scores drift across runs of the same eval.
Pin and version the judge alongside the rubric.

Floors with no delta gate

Slow regressions stay under the floor for months.
The delta gate is what catches them.

A 30-example dataset with a mean gate
Variance wider than the regressions you are chasing.
Grow the set or gate on percentiles.

The full LLM-judge sweep on every PR

Too slow and too expensive.
Cheap rubrics on PRs.
Heavy sweeps nightly.

A dataset frozen at launch

A 2024 set scoring a 2026 product is a benchmark, not a regression suite.

No cache

Cost climbs and flaky network calls block PRs.
Cache verdicts.
Invalidate on rubric or judge version change.

A production observer using a different rubric than CI

Then people argue about which number is real instead of fixing the bug.
One definition.
Run it in both places.

Tradeoffs I made on purpose

Statistical gating slows the first week

Building the baseline window takes a few nightly runs before the delta gate means anything.
The payoff is that a red PR starts meaning a real regression, which is the only reason people leave the gate on.

Sharded, path-scoped runs cost orchestration

Route-aware test selection and a distributed runner are extra wiring.
The return is the difference between a four-minute and a thirty-second PR gate.

The classifier cascade needs rubric discipline

Cascades work where the classifier has a clean target.
They fall apart on subjective axes.

- Deciding which rubrics cascade is a one-time design call you have to actually make.

If you have built a gate like this, I am curious which of the three corners gave you the most trouble.

For me it was significance, every time.

推荐订阅源

DEV Community