How a model upgrade silently broke our extraction prompt (and how we caught it)

A friend's product summarizes customer support tickets using a fine-tuned LLM
prompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated
4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said
"looks fine," and shipped.

Two weeks later a customer escalated: "Your urgency tagging is wrong on
basically everything since last Wednesday."

The prompt asked for {"intent": "...", "urgency": "low|medium|high"}. On
4o, the model returned exactly that. On 4.1, it started returning
{"intent": "...", "urgency_level": "..."} — semantically identical, but
the downstream classifier was indexing on urgency and silently fell
through to a default value of "low" on 100% of new tickets.

Nobody saw it because:

The prompt didn't error. JSON parsed. Fields existed.
The unit tests checked the prompt string, not the prompt output.
The integration tests mocked the LLM call.
The output was indistinguishable from "everything's fine and quiet."

This is the silent regression problem. Code has tests; prompts have vibes.

Three categories of model-swap failure

After looking at a dozen of these incidents, the failures cluster into three
groups. Knowing which kind you're looking at tells you what to test.

1. Format drift. The model decides to rename a field, drop a field, add
a field you didn't ask for, or change list ordering. JSON still parses. Your
downstream code breaks.

2. Reasoning regression. The model is "improved" but loses a hidden
constraint your prompt depended on. Classic example: GPT-4 reliably extracted
all requirements from a contract; GPT-4-Turbo extracted "the most important
ones," dropping 15-20% of clauses. The format was fine. The data was wrong.

3. Tone shift. Less common but expensive. The new model's outputs are
more verbose, less verbose, friendlier, blunter. If anything downstream
(another model, a regex, a fuzzy matcher) was tuned to the old tone, it
breaks.

What the team should have had

A test suite of 30 representative tickets, each with an expected JSON shape.
On model swap day:

$ promptfork test summarize_ticket --baseline gpt-4o
→ running v12 across [gpt-4.1] vs baseline [gpt-4o]
✗ 30/30 ok, but 6 regressions detected
  - urgency_field_renamed: 6 cases
  - severity 2 (functional)

Six lines. Seven seconds. Two-week customer-facing bug avoided.

How to actually do this

The setup for the team that got bitten took four minutes:

pip install promptfork

# Save the current production prompt, version 1
promptfork push summarize_ticket \
  --file prompts/summarize.txt \
  --message "current prod"

# Pin 30 real tickets from your support inbox
for t in tickets/*.json; do
  name=$(basename "$t" .json)
  promptfork add-test summarize_ticket "$name" \
    --input ticket="$(cat "$t")" \
    --rubric "must return urgency in {low,medium,high}"
done

# Run baseline on 4o
promptfork test summarize_ticket --models gpt-4o

# Now upgrade — push the new prompt as v2 (or keep v1 and swap models)
# Run with v1 (4o) as the baseline, get an LLM-judge regression report
promptfork test summarize_ticket --baseline 1 --models gpt-4.1

That's it. The --baseline flag is what catches drift — it pulls the
baseline output, runs the candidate, and asks Claude Haiku to compare them
under a strict "only flag strictly worse" rubric.

The CI version

The same command in a GitHub Action means no prompt change ever ships
without running against a known-good baseline:

- uses: shaunvand/promptfork-cli@v0
  with:
    prompt: summarize_ticket
    baseline: 1
    api-key: ${{ secrets.PROMPTFORK_API_KEY }}

The action exits non-zero on regression. Branch protection blocks the merge.

If you ship LLM features, you need this. The first time it catches a silent
regression, it pays for itself a hundred times over. PromptFork has a free
tier (3 prompts, 50 runs/mo) at https://promptfork.online/diff — set it up
in five minutes, sleep better forever.

推荐订阅源

DEV Community

Three categories of model-swap failure

What the team should have had

How to actually do this

The CI version