LLM Prompt Injection & Guardrail Security

A recall reference built from working through a 7-layer prompt-injection challenge. Focus: how each defense layer works, where it breaks, and most importantly how to defend.

The one idea underneath everything

LLMs have no hard boundary between instructions and data. Everything in the context window — system prompt, user message, retrieved documents — is one stream of tokens the model interprets. Prompt injection exploits exactly this: attacker-controlled data gets read as instructions. You cannot fully filter your way out of it; you manage it with defense-in-depth, knowing each individual layer is bypassable.

The defense layers (and where each cracks)

A progression of controls from weakest to strongest, each with the lesson it teaches.

1–2. No / weak guardrails

Baseline: the model just answers. Lesson: an LLM holding secrets in its context with no controls will leak them on request.

3. Input filtering — block words in the user's message

Defense: scan the incoming prompt for banned terms ("code", "secret", "reveal") and block.
Weakness: keyword blocklists are trivially evaded — synonyms, misspellings, split words, leetspeak, another language, oblique references. Filtering strings doesn't filter intent.
What actually helps: prefer allowlists to blocklists; classify intent semantically rather than matching keywords; treat all input as untrusted; rate-limit and log probing.

4. Output filtering — catch the secret in the response

Defense: string-match the known secret in the model's output and redact.
Weakness: substring matching only catches the contiguous secret. Fragmenting or transforming it (separators, per-character, encodings) means the literal string never appears, so there is nothing to match.
What actually helps: don't put secrets where the model can emit them in the first place; minimize sensitive data in context; treat output filtering as a brittle last line, never a primary control.

5. Input + output filtering combined

Defense: both of the above, stacked.
Weakness: the weaknesses stack too — slip past the input filter with obfuscation, then past the output filter with fragmentation. Layering raises the bar but each layer is still individually defeatable.
Lesson: layering is good, but "more filters" is not the same as "secure."

6. Second LLM as a guardrail — semantic check

Defense: a separate model reads the output and censors it if it recognizes the secret. It understands meaning, not just strings, so it catches fragmentation and reversal.
Weakness: a reasoning judge can be socially engineered — reframe the secret so the judge believes it is harmless (e.g., "this code is expired / has changed"), or present it in a form the judge does not recognize. LLM-judging-LLM inherits all the same manipulability.
What actually helps: pair the LLM judge with deterministic checks; don't treat model-on-model moderation as airtight; constrain what the protected model can even access.

7. Human-in-the-loop review

Defense: a person reviews outgoing messages and redacts anything that reveals the secret.
Weakness: humans see rendered text, not raw bytes. Content can be hidden from human eyes while still being read by the model — this is ASCII smuggling (next section). The control fails by construction.
What actually helps: never rely on human review of rendered text alone; sanitize and normalize the raw input stream before it reaches the model or the human.

Deep dive: ASCII Smuggling (the interesting one)

What it is. An application-logic flaw that abuses the gap between the display layer (the UI renders certain characters as nothing) and the raw data stream (the model tokenizes everything, invisible characters included). Hidden text is embedded using characters invisible to humans but live to the LLM.

The invisible vehicles:

Unicode Tags block (U+E0000–U+E007F) — deprecated tag characters mirroring ASCII; invisible in essentially every renderer. The primary smuggling channel.
Zero-width characters — ZWSP (U+200B), ZWNJ (U+200C), ZWJ (U+200D), BOM / ZWNBSP (U+FEFF).
Bidirectional controls (U+202A–U+202E, U+2066–U+2069) — the "Trojan Source" family; reorder displayed text vs. logical order.

Why it matters now. LLMs are wired into email, calendars, documents, and RAG pipelines. Documented real-world impact (FireTail, Sept 2025) includes:

Identity spoofing — a tampered calendar invite whose hidden text rewrites the organizer; the assistant reads the spoofed identity and the victim never accepted the invite.
Autonomous data exfiltration — a hidden email instruction telling an inbox-connected assistant to search for and leak sensitive items.
Content poisoning — a product review with a hidden "visit scam-store…" that the summarizer surfaces as if it were customer consensus.

It bypasses the "Accept/Decline" gate and human review entirely. Their tests found Gemini, Grok, DeepSeek vulnerable, while ChatGPT, Copilot, Claude scrubbed the input.

The key mental model: this is not a model jailbreak — it is a pipeline / UI flaw. The fix lives in the application, not the model.

Defenses:

Inspect and sanitize the raw payload the tokenizer receives, not the rendered text.
Strip Tags-block, zero-width, and control / format characters; NFKC-normalize.
Prefer an allowlist of the Unicode categories you actually need over chasing bad ranges.
Flag inputs where visible / printable length diverges sharply from the raw code-point count — a strong "someone is probing me" signal.
Apply all of this to retrieved / ingested content (RAG), not just user prompts — a poisoned document is the same threat as a malicious message.
Log the anomalies and treat them as attack telemetry. (AWS has published guidance on Unicode-smuggling defenses; Google declined to act on the disclosure — so responsibility sits with the application owner.)

Cross-cutting principles to remember

All input is untrusted — including documents you retrieve and feed the model. RAG is a top injection vector.
No instruction / data boundary → you can't filter your way to safety; design assuming injection is possible.
Defense-in-depth, with humility — layer controls, but assume each is individually bypassable.
Deterministic beats probabilistic for security-critical checks where you can manage it; don't rely solely on an LLM or a human to "notice."
Normalize bytes at the boundary — before the model and before the human.
Minimize secrets in context — assume anything the model can see can eventually leak.

Staying current on this topic

Frameworks & standards

OWASP Top 10 for LLM Applications / OWASP GenAI Security Project — the canonical threat list; follow the latest revision.
MITRE ATLAS — adversarial threat landscape for AI systems (a catalog of attack techniques).
NIST AI Risk Management Framework — the governance / risk side.

People & blogs worth following

Simon Willison (simonwillison.net) — popularized the term "prompt injection"; ongoing, sharp coverage.
Johann Rehberger / Embrace The Red (embracethered.com) — deep on ASCII smuggling, data exfiltration, and AI-agent attacks.
Lakera (blog + the Gandalf game) — prompt-injection research and a great hands-on trainer.
FireTail blog — the ASCII-smuggling disclosure referenced above.
Vendor security write-ups: Anthropic, OpenAI, Google, Microsoft.

Papers & alerts

arXiv cs.CR for new research.
Set Google Scholar / news alerts for "prompt injection", "indirect prompt injection", "LLM security".

Community & events

DEF CON AI Village; security newsletters such as tl;dr sec; relevant Discords and subreddits.

Hands-on practice (the best way to retain it)

Lakera Gandalf, Prompt Airlines, Secure Code Warrior (what you just did), and AI-security labs as HackTheBox / PortSwigger roll them out; CTFs with AI categories.

A sustainable habit: follow ~3 of the people above, set one Scholar/news alert, and do one hands-on lab a month. That keeps the muscle memory fresh without drinking from the firehose.

Reminder: these notes describe attack **classes* so you can defend against them. The real value is the defensive half — sanitize at the boundary, treat all input (including retrieved content) as untrusted, and never rely on a model or a human to simply "notice."*

推荐订阅源

DEV Community