惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Proofpoint News Feed
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Attack and Defense Labs
Attack and Defense Labs
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
W
WeLiveSecurity
O
OpenAI News
SecWiki News
SecWiki News
博客园 - Franky
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
T
Tor Project blog
Microsoft Security Blog
Microsoft Security Blog
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
P
Privacy & Cybersecurity Law Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
月光博客
月光博客
李成银的技术随笔
Spread Privacy
Spread Privacy
F
Full Disclosure
F
Fortinet All Blogs
T
The Exploit Database - CXSecurity.com
Vercel News
Vercel News
AWS News Blog
AWS News Blog
WordPress大学
WordPress大学
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
V
Visual Studio Blog
J
Java Code Geeks
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Last Week in AI
Last Week in AI
P
Palo Alto Networks Blog
宝玉的分享
宝玉的分享
T
True Tiger Recordings
N
News and Events Feed by Topic
酷 壳 – CoolShell
酷 壳 – CoolShell
Cisco Talos Blog
Cisco Talos Blog
N
News | PayPal Newsroom
S
SegmentFault 最新的问题
Jina AI
Jina AI

DEV Community

If your AI initiative is pending for 6 months, the bottleneck is probably not technology Hermes Agent Under the Hood: The Open-Source Runtime for Autonomous AI Systems Expert Systems -The AI That Existed Before AI Was Cool My HTML Learning Journey 🚀 The Day PayPal Failed and the Rust Rewrite Saved the Product Launch Google Sheets CRM: 4 Ways I've Actually Done It (with Apps Script Code) BrontoScope: AI-Powered Error Investigations The job of an AI engineer inside a 40-person company is not what most CEOs think it is Building a Clinical Speech-Therapy App With a Real SLP: 4 Lessons From PhoenixSteps 7 overlooked .Net features How Stripe Took 48 Hours and 3 API Calls to Break My Freelance Income Stream in Lagos Pretty normal Both Camps in the 'Left Behind' Argument Are Right About Each Other Flutter MCP Toolkit v3 Google Just Shipped Gemini 3.5 Flash. Here's What Developers Actually Need to Know. 🔐 Working with Private Symfony Recipes Rate limiting in web apps: what to protect before picking a library Rate limiting en aplicaciones web: qué proteger antes de elegir una librería What Are Lakehouse Catalogs? The Role of Catalogs in Apache Iceberg What It Really Takes to Become a Senior Software Engineer Microservices Were Never About Technology JS Crime Scene: The Misleading Array Project-as-code for a Directus v9 backend When the API literally burned your database after a typo COOKIES DPRK Hacking Trends 2026: AI‑Powered Supply Chain and Developer Environment Attacks Phone control for AI coding sessions is not a tiny terminal PayPal and Crypto Are Not Equals: How I Built a Gumroad Alternative for Restricted Countries Exploring Tech as a Content Writer I Raised Gemma 4's Token Cap. The Dense Model Stopped Refusing. React Server Components Don't Make Your App Fast by Default Multi-Stage Builds for a Next.js App — Reduce Image Size by 70% I Built a Chrome Extension That Teaches Vocabulary While You Browse Why I Walked Back from Next.js and RSC to a Plain SPA and a Separate Backend NeuralPocket: Private On-Device AI with Gemma 4 — Android & Web Github Speckit: Revolucionando o Desenvolvimento com SDD Cloud Cost Elasticity I Built a Payment System for Bangladesh—Heres Why Stripe Failed Us Polyglot Persistence in Microservices: Choosing the Right Database for Each Service Centralized Authentication for a Multi-Brand Laravel Ecosystem How I made a perfect recording button. Simple yet complex thing. Mumbli – my personal Wispr Flow Getting Paid Should Not Be a Geopolitical Nightmare: My NOWPayments Integration Story Four Layers of Validation in Kubernetes with Claude Code Prompt Flow — a visual side project for flow design, trace, and integration steps (looking for feedback) AI Citation Registry: Temporal Gaps in Government Publishing Cycles ShowDev: I built a 100% local, zero-upload PDF editor using WebAssembly JavaC Written by an AI Pipeline, Verified by Three Models. Is It Slop? Part1 Vulkan: Drawing Triangle 1 Why I Stopped Using useEffect to Sync State — and What I Use Instead Por qué dejé de usar useEffect para sincronizar estado y qué uso ahora Migrating a Long-Running WordPress Site to Payload CMS (And All The Chaos That Came With It) Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans Azure DevOps Structure Explained: Organizations, Projects, and Repos Without the Mess A Simple React Hook for localStorage State, Expiry, and Sync I sold you on /scratchpad. Then I migrated to /note. Fixing WSL Errors on Windows 11 Your app is not Netflix. Stop building like it is. Resolving inter-service communication issue I built an email cleaner. CSV parsing took longer than the actual validators. How I Would Learn Full-Stack Development in 2026 If I Started From Zero Partition Evolution: Change Your Partitioning Without Rewriting Data What Google Play's I/O 2026 Updates Look Like From a Solo Indie Puzzle Developer Forgetting the Myth of "Ease of Integration" When Selling Digital Products with Bitcoin My 4-Step Regex Debugging Workflow (That Actually Saves Time) Stop Scraping Betting Sites: How to Build a Real-Time Sports Tracker in Python Civic Identity and Responsibility in Modern Democracy OLTP vs OLAP Are binaries really executable code ? The lie of the 80%: why software progress charts don't work What a Datacenter in Space Actually Buys You: Three Server Racks Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't Accessibility - This looks like a job for a developer advocate! I built a Mac app that turns web pages into live widgets How to Teach Source Evaluation When Your Students Use ChatGPT More Context Does Not Mean More Trust RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase Past the JVM Design decisions behind my “Irregular German Verbs” iOS app WordPress 7.0 "Armstrong" Is Live — Post-Release Deep Dive 🎺 Performance and Apache Iceberg's Metadata I Shipped a Bug to Production That Cost Us 3 Hours of Downtime 程序人生:在代码与时间之间 The Wrong Way to Think About XRPL Event Infrastructure What I Learned About MND, Voice Banking, and Why Assistive Tech Is Personal $1.50/Month Email Infrastructure That Beats Your $20 SendGrid Plan Cloud Unit Economics: The Metrics DevOps and FinOps Teams Actually Need Bypassing Payment Platform Restrictions Was The Best Decision I Ever Made For My Digital Product Business The Hidden Life of a Container: A Complete Lifecycle When a port is already in use, there is no interactive way to find it — so I built `port-peek` Como Sumir com o Barulho do Teclado Mecânico no Ubuntu Usando o NoiseTorch Google I/O 2026 dropped a bomb on Android tooling, and nobody's talking about it (or maybe they are 😅) Mentoring Junior Developers: What Actually Works How I Prevented Claude Code from Breaking My Architecture with 18 Tests That Run in 0.4 Seconds I Controlled an ESP32 Drone Using Only My Voice vite HMR is silently the reason ur laptop fan wont stop AI Agents Security for Developers: Don't Let Your Agents Become a Liability Single List Keyboard Handling 9 SaaS development companies worth knowing (a technical look)
AI-generated accessibility, an update — frontier models still fail, but skills change the game
Michael Fair · 2026-05-21 · via DEV Community

Michael Fairchild

A few months ago I shared early results from the A11y LLM Eval project, a benchmark that measures how accessibly LLMs generate UI code. The previous post showed that LLMs default to inaccessible code, explicit accessibility instructions can dramatically change that, and manual testing is still essential.

The latest report is out, with new models, a redesigned test scope, and a brand new mechanic: skills. Two things stand out:

  1. The newest frontier models (GPT‑5.5, Claude Opus 4.7, Gemini 3.1 Pro Preview, Claude Haiku 4.5, and others) still fail accessibility checks by default.
  2. A well-written skill can produce the highest pass rates we've measured. Skills can even let a weak-baseline model outperform the leaders, though they can cost more tokens to run.

Screenshot of the overview section of the report

Note that the pass rate reflects only this harness's automated checks (a curated set of axe-core WCAG rules plus hand-written assertions per test case). Automated testing can detect only a subset of accessibility issues: 100% here means the sample passed every check that was run, not that the page is WCAG conformant or fully accessible.

TL;DR

  • Default ("control") accessibility is still bad: average pass rate is 12%, with GPT‑5.4 Mini leading at 25%. Newer does not mean more accessible.
  • Custom instructions still pay off. The Basic instruction set lifts pass rates by +48.5pp to 60%.
  • Skills go further. The Building Accessible UI skill, run as a two-turn Generate then Review workflow, hits 86% pass rate (+74.6pp). The top performer is Gemini 3.1 Pro Preview, a model that scored only 8% on control.
  • The skill's review turn costs about 5.5× the input tokens of control. Quality is not free.

What's new in this report

  • Results are now fully agentic. The previous report called LLM APIs directly with a single prompt and response. This report drives each evaluation through the GitHub Copilot SDK as an actual agent, with tool use, multi-turn reasoning, and the same instruction and skill loading mechanics that Copilot agents use in production. The numbers below reflect how these models behave when wrapped in an agent loop, not in a one-shot API call. It also means the new "skills" variant is only possible because we're running through an agent runtime.
  • 8 models evaluated across 32 prompt cases (1,280 control samples). The model lineup is mostly fresh: GPT‑5.4, GPT‑5.4 Mini, GPT‑5.5, Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 3.1 Pro Preview, and Gemini 3 Flash Preview.
  • Instruction sets tracked are now Basic and Minimal. The Detailed expert-level instruction set from the previous run isn't in this report.
  • Skills, reusable task-specific guidance packages, were added as a new mechanic. The first skill evaluated is Building Accessible UI. More on this below.
  • A new variant token and pass-rate snapshot lets you compare quality against token cost across control, instructions, and skill turns.

Control: frontier models, same accessibility problem

The headline finding from the previous post was that LLMs default to producing inaccessible code. With the latest models in hand, that hasn't changed.

Screenshot of the control overview

A few observations:

  • GPT still leads on control, but the top score is well below the 41% GPT‑5.2 posted previously. The prompt set has changed and the assertions are stricter, so the numbers aren't directly comparable, but the conclusion is the same: nobody is shipping accessible code by default.
  • Claude Haiku 4.5 sits at 3%, with an average of nearly 6 WCAG failures per sample. Sonnet 4.6 and Gemini 3 Flash Preview aren't far behind.
  • The hardest test case, Shopping Home Page (React, Dark theme), produced a 0% pass rate with 15.55 average WCAG failures across all models. Component density compounds the problem fast.

The training-data hypothesis from the previous post still seems to fit. The open web is overwhelmingly inaccessible, so models trained on it inherit those patterns regardless of how capable they are at code in general.

Instruction sets: still the cheapest win

Custom instructions are still the fastest thing a team can ship to improve accessibility.

Screenshot of the instruction set table

The Basic instruction file produces nearly 5× the control pass rate while only increasing average input tokens by roughly 50%. Even the one-line Minimal instruction ("All output MUST be accessible.") more than triples it.

If you do nothing else, ship an instruction file. The basic instructions are a good starting point to customize for your team's stack and design system.

Skills: the new mechanic

The biggest change in this report is the introduction of skills. Where instruction sets are always-on guidance loaded into the agent's context for every task, a skill is a reusable, task-specific package that bundles guidance, examples, supporting files, scripts, and a tool-use workflow. The agent loads the skill only when it's relevant, and within a skill, only loads the slice it needs for the task at hand.

This matters because it changes two things at once: what guidance the model sees and when it sees it. Skills can carry far more detail than an instruction file without flooding the context window, and the two-turn Generate then Review pattern gives the model a structured second look at its own output before it's done. Together that's why skills outperform instructions in this report.

The first skill evaluated is Building Accessible UI. It is purpose built to:

  • Activate when the agent is creating UI, so generated code is more accessible by default.
  • Contain expert-level guidance and checklists for many components and patterns.
  • Only pull the guidance for the specific component or pattern being built, to limit token impact on the context window.
  • Run a two-turn workflow: Generate the UI, then Review it against the skill's checklist and fix issues.
Variant Pass rate Delta vs. control
Building Accessible UI, Generate (turn 1) 82% +70.4pp
Building Accessible UI, Review (turn 2) 86% +74.6pp

Screenshot of skill table

A few observations:

  • The best model under the skill is Gemini 3.1 Pro Preview, the same model that scored 8% on control. With the right scaffolding, a weak baseline can outperform the leaders.
  • The review turn pulls its weight. Asking the agent to self-check against the skill's checklist adds 5.6pp on top of an already strong first turn, closer to how a human accessibility reviewer actually works.
  • Skills don't dominate the prompt. Only 14 to 18% of input tokens come from the skill itself, compared to 100% for instruction sets. Most of the context window is still free for the actual task.

The cost question

Skills win on quality, but they aren't free.

Screenshot of the token impact table

The skill's review turn averages roughly 5.5× the input tokens and 2.7× the API calls of control. At scale, that's a meaningful budget impact.

A practical split:

  • Instruction sets are broad, always-on guardrails. Cheap, simple to ship, and the best ratio of accessibility improvement to tokens spent. The downside is that the token impact of custom instructions can add up quick if your project has instructions for multiple domains, such as accessibility, security, content, etc. Use these as the default for any team but keep them short.
  • Skills are focused, procedural guidance for higher-stakes work, or when you have the budget.

Recommendations

The advice from the previous post still holds, with one new lever:

  1. Ship a project-tailored instruction file today. Start from the basic instructions and customize for your stack, design system, and component library.
  2. Add a skill for high-stakes UI work if your token budget allows. The two-turn Generate then Review pattern materially improves outcomes.
  3. Bake automated accessibility checks into CI/CD and block PRs on regressions.
  4. Keep manual testing by humans, including people with disabilities. None of these tools (automated checks, instructions, or skills) can cover all accessibility requirements.

Closing

The trajectory hasn't changed. AI keeps scaling how much UI we ship, and the open-web training data keeps scaling inaccessible patterns alongside it. Frontier models alone aren't going to fix this; the latest results from Claude 4.7, Gemini 3.1, and GPT‑5.5 make that clear.

The toolbox has changed, though. Instructions still pay off in minutes. Skills are a new lever, and a powerful one: capable of taking an 8% baseline model to 86% on these checks. The work now is to pick the right tool for the right task, enforce it in CI, and keep humans in the loop where it matters most.

See the full report and the a11y-llm-eval repository on GitHub.