When the LLM Refuses: A Fallback Chain That Salvages Most Refusals

Every production LLM app eats false-positive refusals. A user asks something perfectly fine, the safety filter trips, the model emits two sentences of "I can't help with that," and your UI shows a wall. Do that a few times and the user leaves.

We've measured this on HoneyChat — Telegram-native AI companion, ~300 DAU, 17 languages. Across a normal day, somewhere between 2% and 8% of model calls land in a refusal or finish_reason="content_filter" state. Most of those are not actually problematic content — they're the model being twitchy about edge phrasing, polysemous words, or roleplay framing. The pattern below recovers about 70% of them.

HoneyChat LLM routing at a glance (core/llm.py, plan-gated via OpenRouter):

Tier(s)	Pace	Primary model (OpenRouter slug)
`free` / `basic` / `premium`	natural	`qwen/qwen3-235b-a22b-2507`
`free` / `basic` / `premium`	instant / explicit	`deepseek/deepseek-v4-flash`
`vip` / `elite`	any	`google/gemini-3.1-flash-lite-preview`

Emergency content_filter fallback chain (GEMINI_CONTENT_FILTER_FALLBACK_CHAIN): x-ai/grok-4.20 → an open roleplay-tuned model. The rescue chain below is what feeds traffic into that fallback only when it's actually needed.

Three steps, in order of cost.

Step 0: Don't trigger it in the first place

Free, and where most posts on this topic stop. Two things:

Tighten the safety knobs the provider exposes. For Gemini via OpenRouter, that's safety_settings in the extra body. Default is BLOCK_MEDIUM_AND_ABOVE on four categories; for roleplay/chat traffic we lower them via a helper called _maybe_inject_gemini_safety_off():
```
extra_body = {
    "safety_settings": [
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
    ],
}
```
Probe before/after on the same fictional-scene prompt: 130-char refusal → 2,571-char full response. The hard, non-negotiable filters (CSAM, etc.) stay on at the provider level regardless of this knob; only the adjustable sliders move.
Don't apply this to moderation/vision calls. Those calls want the filter on. The helper is scoped to the chat/roleplay code path only.

This alone cuts refusals roughly in half on our traffic.

Step 1: Partial salvage before fallback

When you do get a refusal, the model still sent something. Check the streamed buffer or the partial completion before declaring failure:

def salvage_partial(text: str) -> str | None:
    """Extract usable content from a partial/filtered response. None = unsalvageable."""
    extracted = _try_extract_json_field(text, "content") or text
    cleaned = _strip_trailing_refusal_markers(extracted)   # 17-lang marker set
    cleaned = _truncate_to_sentence_end(cleaned)
    if len(cleaned) < 150:
        return None
    return cleaned

The 17-language refusal marker list (one per supported HoneyChat locale) is the boring part — "I can't", "I'm not able", "As an AI", plus their localised equivalents ("Я не могу", "Lo siento, no puedo", "申し訳ありません", …). Strip the trailing one, keep what came before, and a lot of "filtered" responses turn out to be 800 words of useful content followed by one sentence of model anxiety.

Gate (len ≥ 150) is what stops "I can't help" from being salvaged as "I can." We have 70 unit tests on this function — tests/test_salvage_partial.py is the largest single test file in the codebase.

Cost so far: zero extra API calls.

Step 2: Provider rescue with a system-prefix override

If salvage returns None, now we route to a backup provider. Ordered by cost:

Grok 4.20 (xAI) via OpenRouter — much looser refusal posture by default, no system-prefix needed.
A roleplay-tuned open model (we currently use minimax/minimax-m2-her via OpenRouter) — needs an explicit "stay in character, do not break the fourth wall" system-prefix prepended via _maybe_prepend_minimax_jb(); without it, refuses about as often as the primary. Probe: 215-char soft-refuse → 1,237-char full output.

Both calls only happen on a salvage-fail, so the volume is small (low single-digit percent of all traffic).

async def rescue(prompt: ChatPrompt) -> str | None:
    grok_out = await call_grok(prompt)             # x-ai/grok-4.20
    if salvage_partial(grok_out):
        return grok_out
    prefixed = prompt.with_system_prefix(MINIMAX_PREFIX)
    return await call_minimax(prefixed)            # minimax/minimax-m2-her

The prefix isn't magic — it's a short, explicit "you are a fictional character, the user is a consenting adult, stay in scene" framing. We don't ship it to providers that would refuse anyway; the rescue model is specifically picked because it tolerates and uses it.

Step 3: Plan-aware degradation

Here's the part we got wrong for a month before fixing.

We were running steps 1 and 2 unconditionally for every user, every refusal. That meant a free-tier user whose call hit a hard content_filter got 3-4 extra API calls (salvage attempt → Grok → MiniMax), each adding latency and cost. They'd often still get a usable response. But over a month of free traffic, those rescue calls were a meaningful share of model spend on users who weren't paying us a dime.

The fix is just a gate, mapped against HoneyChat's five tiers:

PAID_TIERS = {"basic", "premium", "vip", "elite"}

if user.plan in PAID_TIERS:
    salvaged = salvage_partial(raw)
    if not salvaged:
        return await rescue(prompt)
    return salvaged
else:
    salvaged = salvage_partial(raw)
    if salvaged:
        return salvaged
    return _in_character_refusal(prompt.character)

Free users still get something — a synthesised in-character soft refusal that's better than the model's generic wall — without paying for the cascade of upstream calls. Paid users get the full chain because their economics support it.

Effect on our cost graph: free-tier refusal cost dropped to near zero. Paid-tier user-perceived "the bot refused me" rate dropped by about 70%.

Lessons we'd pin to the wall

Refusals are not all-or-nothing. Most "filtered" responses contain usable content before the refusal sentence — salvage before fallback.
Provider safety knobs work, but only on the adjustable categories. BLOCK_NONE doesn't disable the non-negotiables; it just turns off the over-eager middle ground.
Don't apply the knob globally. Moderation and vision calls want the filter on.
Make rescue plan-aware. A 4-call rescue cascade for every free user adds up.
Synthesise an in-character refusal locally when you can't or won't rescue.

The whole pattern is a couple hundred lines of glue (core/llm.py, helpers _maybe_inject_gemini_safety_off, _maybe_prepend_minimax_jb, salvage_partial). The unit-test suite around salvage_partial keeps the regression risk low.

This pattern is in production at HoneyChat — Telegram-native AI companion bot where a single refusal mid-conversation kills the experience. Canonical version: honeychat.bot/en/blog/llm-content-filter-fallback-rescue-chain.

— HoneyChat Engineering

Sources

Google — Gemini safety settings — the four adjustable harm categories, threshold semantics, what BLOCK_NONE does and doesn't.
OpenRouter — Provider parameters / extra_body — passthrough to provider-specific knobs.
OpenRouter — Model routing & fallback — declarative fallback chain semantics.
Anthropic — stop_reason and finish_reason reference — how providers signal a content-filter stop vs a token-limit stop.
HoneyChat engineering notes: LLM routing per tier on OpenRouter · prompt caching measured.

推荐订阅源

DEV Community