惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Deep Dive: React Server Components in TanStack Start Migrating off Google Analytics: Umami vs Plausible vs Fathom Building a Portfolio That Actually Demonstrates Software Engineering Async/Await in JavaScript: From Callbacks to Clean Code (2026) ShareBox v5 — GPU transcoding, Netflix-style grid, and why I don't need Plex anymore TOML Schema is live Handling Duplicate Shopify Webhook Events (And Why You Must) Original Kubernetes Dashboard — retired upstream, upgraded to Angular 21. لماذا أسست ترينافو للتجار العرب الذين تتجاهلهم المنصات الغربية Construyendo un recomendador de películas en Python: de los datos al modelo When APIs Lie: A Lesson in Defensive Debugging Pope Leo XIV's AI Encyclical: What Builders Must Know (2026) Donna v0.3.0 HTB — MonitorsFour | Writeup The Free Tool You Trust Is the One You Should Fear the Most HTB — MonitorsFour | Writeup Fr 97. Embeddings and Vector Search: Semantic Search That Works Deep Dive: Building "Gravity Paint" - A Tactile Physics Instrument with React, Matter.js, and p5.js ABAP Unit Testing with Test Doubles and Mocking Frameworks: A Senior Architects Guide to Isolating Dependencies in SAP S/4HANA LeetCode Solution: 5. Longest Palindromic Substring kovax-react 0.8: Tailwind v4 preset, FormField adapters, ColorModeScript, and Storybook I built an AI résumé tool that refuses to lie about your experience The hat Azure Entra ID User & Role Management — Step-by-Step Practical Guide With A Simple Excercise The AI-Native Company: How a Single Founder Can Build Global Organizations Powered by AWS and an Ecosystem of Artificial Intelligences Building a Lightweight Remote MCP Knowledge Base on Cloudflare Workers Why I built Trinavo for the MENA merchants Western platforms ignore The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Training Data Provenance: The Manifest Diff That Explains the Hash Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler Tesla P40 in a Homelab: 24GB of Inference on a Budget Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨ Road to Senior #2: How Computers Think in Numbers Why LLM debugging fails on fragmented repository context How to Deploy a LangGraph Agent on AWS Bedrock AgentCore An outreach kit for solo founders whose drafts can't hallucinate Open Satchel is live Amy Kwalwasser and the Growing Importance of Quantum Risk Modeling I Built ShellReq - A Native API Client for VS Code & Terminal If Microsoft and Uber can't afford AI coding, what chance do the rest of us have? MADCAP: Building a Multi-Agent Debate CLI That Argues With Itself So You Don't Have To Why most AI fails at IDOR (and how AMAS fixes it with causal reasoning) How to Audit a Laravel Codebase You've Inherited LangGraph 워크플로우 템플릿 (v34) BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed
Benchmarking LLM Structured Outputs
David Moores · 2026-05-26 · via DEV Community

Cross-posted from carrick.tools.

When you read the API documentation for OpenAI, Anthropic, or Google Gemini, the feature called "structured outputs" looks like a solved problem: pass a JSON schema, get back JSON that conforms to it.

In production, it is not a contract. It is a well-typed, best-effort suggestion.

At Carrick, the code-analysis scanner I work on, our post-LLM pipeline relies on a four-stage fallback parser. We attempt a direct parse, strip markdown fences, scan for array bounds inside surrounding garbage text, and finally apply regex cleanup. If all four fail, we drop the payload and proceed. If structured outputs worked as advertised, this would be a single serde_json::from_str(response).

To isolate why this defensive parsing is necessary, I built a benchmark testing 8 synthetic schemas against six models (the flagship and cheaper tiers from each provider). The schemas isolate one structural stressor each: a flat baseline, a 3-level nested object, a 7-level nested chain, a long enum, a oneOf tagged union, nullable + format fields, a 20-item array, and a closed object with additionalProperties: false. Every response is validated against the original schema using two independent validators (ajv and hyperjump). A response only counts as strict adherence when both agree.

Here is how the implementations actually behave.

At a glance

Of the 8 stressor schemas, here is how many each model handled with full strict adherence on every run, and how many tripped a specific failure mode:

Horizontal bar chart showing each model's outcome distribution across 8 schemas. OpenAI gpt-5.5 and gpt-5.4-mini both pass 2 schemas and pre-reject 6. Anthropic Opus 4.7 passes 7 schemas and partially-fails one (S3 deep nesting, 65 percent strict). Sonnet 4.6 passes 7 schemas and silently fails one (S3, 0 percent strict). Gemini Pro 3.1 and Flash 3.5 both pass 6 schemas and pre-reject 2.

Three patterns emerge. OpenAI rejects most schemas at submit time and then conforms perfectly on what is left. Anthropic accepts every schema but silently corrupts one specific structure. Gemini rejects a narrow set of features and conforms perfectly on the rest. Each pattern is the symmetric mirror of the others.

1. Anthropic accepts complex schemas, then silently returns the wrong shape

Anthropic's tool-use API is the most permissive of the three. It accepts almost any standard JSON schema as the input_schema for a tool, and on 7 of the 8 schemas in this bench, both Claude Sonnet 4.6 and Claude Opus 4.7 produce strict-conforming output 100% of the time. The failure mode is concentrated on one schema: a 7-level nested object chain (S3).

On S3 at n=20 runs per model:

  • Claude Sonnet 4.6: 20 of 20 runs silent-failed. Strict adherence 0% (95% CI: 0%–16.1%).
  • Claude Opus 4.7: 7 of 20 runs silent-failed. Strict adherence 65% (95% CI: 43.2%–82.3%).

Grouped bar chart showing strict adherence rate by schema depth for Claude Sonnet 4.6 and Claude Opus 4.7. Both models achieve 100 percent on the flat baseline (S1) and the 3-level schema (S2). On the 7-level schema (S3), Sonnet drops to 0 percent and Opus drops to 65 percent.

The failure mode is unusual. Instead of returning a 7-level nested object, the model emits the entire nested structure as a single JSON-encoded string assigned to the root level1 field. Here is one of the Opus failures verbatim:

{"level1":"{\"name\":\"system\",\"child\":{\"name\":\"ingest_pipeline\",
\"child\":{\"name\":\"batch_24a17\",\"child\":{\"name\":\"parse_stage\",
\"child\":{\"name\":\"error_handling\",\"child\":{\"name\":\"dlq_promotion\",
\"leaf\":{\"value\":\"2 rows failed JSON parsing and were promoted to dlq
.ingest.parse-errors; weekly cleanup later inspected 412 items, removed
312, returned 100 for reprocessing\",\"kind\":\"outcome_summary\",
\"count\":2}}}}}}}}"}

Enter fullscreen mode Exit fullscreen mode

The schema declares level1 as type: object. The model returned type: string containing a JSON serialisation of what the object should have been. ajv's diagnostic:

/level1 must be object {"type":"object"}

Enter fullscreen mode Exit fullscreen mode

This is the most dangerous failure mode in the benchmark because:

  • The transport layer says success. The API returns HTTP 200 with no error field and no refusal signal.
  • The SDK does not validate. The Anthropic client passes tool_use.input back to your application without checking whether it conforms to the input_schema you sent.
  • The output parses cleanly. JSON.parse(response) succeeds, returning { level1: "{\"name\": ..." }. Only an explicit schema validator catches the type drift.

The mechanism is consistent across all 27 silent failures in the dataset (20 Sonnet plus 7 Opus): the model wraps the entire nested payload in a single string value. Run-to-run variance is in where the string boundary sits, not in whether the wrapping happens.

2. OpenAI enforces adherence by rejecting standard schemas

OpenAI's strict: true mode is the symmetric mirror of Anthropic. Where it accepts a schema, it produces strict-conforming output. Where the schema does not meet strict mode's narrow dialect, the request never reaches the model.

Of the 8 bench schemas, only 2 pass OpenAI's strict-mode rules (S1 baseline, which I deliberately shaped to be strict-compliant, and S8 closed object). The other 6 are rejected before the call is sent.

OpenAI strict mode requires:

  • Every object must explicitly declare additionalProperties: false.
  • Every property must be listed in the required array.
  • Type-arrays (e.g., type: ["string", "null"]) and oneOf unions are unsupported.

The bench performs the same schema validation OpenAI's API would perform, locally, before submission. A representative rejection (for the 7-level schema):

OpenAI strict mode violations:
  $: object missing additionalProperties: false;
  $.level1: object missing additionalProperties: false;
  $.level1.child: object missing additionalProperties: false

Enter fullscreen mode Exit fullscreen mode

The rejection rate is identical between gpt-5.4-mini and gpt-5.5. The check runs server-side at the schema-submission layer before any model is invoked, so flagship intelligence does not change the outcome.

If you pull a schema from an OpenAPI spec or package.json, it will likely fail. Your options are to rewrite the schema to the strict dialect, or disable strict mode and inherit Anthropic's silent-failure problem.

3. Gemini is the rigid middle ground

Gemini's schema validator rejects modern JSON Schema features that OpenAI strict also bans (oneOf, type-arrays, $ref) but accepts the looser shapes OpenAI strict refuses. On the 6 of 8 bench schemas that clear Gemini's pre-flight, both Gemini Pro 3.1 and Gemini Flash 3.5 maintain 100% strict adherence at n=5 each (Wilson 95% CI for 5/5: 56.6%–100%; tight enough across 6 schemas to support the pattern).

The two rejected schemas are S5 (uses oneOf) and S6 (uses type: ["string", "null"] plus format: date-time). Gemini surfaces the rejection at submission time with a clear error naming the unsupported feature.

Notably, Gemini handled the same 7-level deeply nested schema that destroyed Anthropic at 100% strict adherence on every run. Where Gemini accepts a schema, it conforms.

The outcome matrix

The full pilot, condensed to one grid. S3 and S7 ran at n=20 for Anthropic; all other cells ran at n=5.

Heatmap of 8 schemas across 6 models. OpenAI columns are dominated by amber pre-call rejections except S1 and S8. Anthropic columns are mostly green with red silent failure on the deep nesting row (S3, 0 percent on Sonnet, 65 percent on Opus). Gemini columns are mostly green except S5 and S6 which are amber pre-call rejections.

Defensive implementation patterns

The provider feature called "structured output" cannot be trusted as an application boundary. To handle the realities of the current APIs, your pipeline needs explicit guardrails. Here is the implementation priority:

  1. Run an independent validation step. An HTTP 200 from the provider means nothing. Validate every single response payload against your schema using ajv, hyperjump, or a custom walker in your own codebase before passing the data to your application logic.
  2. Redefine success criteria. Treat a standard parse error, a schema violation, and a refusal as equal failure modes. Trigger the same retry/fallback logic for all of them.
  3. Flatten Anthropic schemas. Deep nesting triggers silent corruption in Claude, including at the flagship tier. Flatten structures into top-level arrays of sibling objects wherever possible. If a schema exceeds three or four levels of depth, consider refactoring it.
  4. Compile schemas to the OpenAI dialect. If you are targeting OpenAI strict mode, author your schemas from the start with additionalProperties: false propagated to every sub-level and no optional fields.
  5. Strip unions for Gemini. Avoid oneOf and ["string", "null"]. Use anyOf for unions and rely on a single nullable type constraint.

What this bench does and does not measure

Three caveats worth surfacing explicitly:

OpenAI rejection is bench-side, server-rule-mirrored. The 6 of 8 schemas reported as rejected by OpenAI are rejected by a pre-flight validator inside the bench that implements the documented strict-mode rules (additionalProperties: false, every property required, no type-arrays, no oneOf). I did not separately submit each schema to the OpenAI API and observe the server's 400 response, so the rejection rate reported here is the rate at which OpenAI's documented strict-mode rules disqualify normal JSON Schema, not the rate at which OpenAI's server returns an error. If OpenAI relaxed strict mode tomorrow, the bench would not notice.

Gemini schemas are normalised before submission. Gemini's structured-output API supports a narrower keyword set than OpenAPI / draft-2020-12 JSON Schema. The bench's convertSchemaToGemini function passes through the keywords Gemini's docs list as supported (type, enum, format, min/max, required, properties, items) and drops the rest before submission. The validator still checks Gemini's output against the original schema, so any constraint the converter drops is implicitly given a free pass on the Gemini side. For the current corpus this only affects S5 and S6 (already rejected at pre-flight), but it would matter for any future schema relying on const, pattern, or additionalProperties as a real constraint.

Sample sizes are uneven. The two cells the article quotes specifically (Anthropic Sonnet and Opus on S3 deep nesting) ran at n=20 each. The S7 long-array cells also ran at n=20 after an initial pilot revealed the Anthropic adapter was hard-capped at max_tokens: 4096, which was inflating the truncation rate; raising the cap to 8192 brought both Anthropic tiers to 100% strict adherence on S7. Everywhere else the bench ran at n=5 per cell, which is enough to see the dominant outcome but not enough to claim sharp rates.

Methodology, raw JSONL, schemas, and reproducible scripts are available at carrick-llm-structured-bench. The full re-run that backs the figures above cost roughly $8 in API credits and took about an hour of wall time.