惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

Aaron Gustafson: Latest Posts & Links

Easy Data-entry Verification with a Web Component :: Aaron Gustafson 🔗 AI companies will fail. We can salvage something from the wreckage 🔗 Accessible faux-nested interactive controls 🔗 AI-assisted coding transforms PDF to web app using NYS Design System 🔗 Modern CSS Feature Support For Shadow DOM 🔗 AI is locking people out. At Scale. 🔗 The Incredible Overcomplexity of the Shadcn Radio Button Making keyboard navigation effortless :: Aaron Gustafson 🔗 The WebAIM Million: The 2026 report on the accessibility of the top 1,000,000 home pages Under the hood of MDN’s new frontend :: Aaron Gustafson 🔗 Endgame for the Open Web slideVars :: Aaron Gustafson AI is accidently making documentation accessible Design systems can’t automate away all of your accessibility considerations The Power of ‘No’ in Internet Standards Nice Select Visual Validation Feedback for Form Fields :: Aaron Gustafson Never Lose Form Progress Again Different contexts, different tools, same person :: Aaron Gustafson Accessibility Assistant for Figma v52 :: Aaron Gustafson Some blind fans to experience Super Bowl with tactile device that tracks ball :: Aaron Gustafson Why we teach our students progressive enhancement :: Aaron Gustafson Repeatable Form Fields Made Simple :: Aaron Gustafson A Production-Ready Web Component Starter Template :: Aaron Gustafson ✍🏻 Fullscreen Video and Iframes Made Easy ✍🏻 Dynamic Datalist: Autocomplete from an API ✍🏻 Lazy Loading Images Based on Screen Size A Web Component for Obfuscating Form Fields :: Aaron Gustafson Forrester Research: As technology has evolved, so has the need for accessibility :: Aaron Gustafson Creating a more accessible web with ARIA Notify :: Aaron Gustafson Optimizing Your Codebase for AI Coding Agents :: Aaron Gustafson A Web Component for Conditionally Displaying Fields :: Aaron Gustafson Default Isn’t Design :: Aaron Gustafson Designing for Distress: Understanding Users in Crisis :: Aaron Gustafson Why I'm Betting Against AI Agents in 2025 (Despite Building Them) :: Aaron Gustafson Why AI Won’t Destroy Us with Microsoft’s Brad Smith :: Aaron Gustafson Learning Web Design, 6th Edition is out! :: Aaron Gustafson
Identifying Accessibility Data Gaps in CodeGen Models :: Aaron Gustafson
2025-10-16 · via Aaron Gustafson: Latest Posts & Links
A pop-art style illustration of a wide chasm. On the left side of the chasm stands a small, cute, red robot, gazing to the right, across the abyss. On the right side of the chasm is his destination: a finish line flag. The flag reads “Accessible.”
Credit: Aaron Gustafson × Designer

Late last year, I probed an LLM’s responses to HTML code generation prompts to assess its adherence to accessibility best practices. The results were unsurprisingly disappointing — roughly what I’d expect from a developer aware of accessibility but unsure how to implement it. The study highlighted key areas where training data needs improvement.

Why take on this challenge?

I get it — you probably rolled your eyes at yet another “AI and accessibility” post. Maybe you think AI-assisted coding is overhyped, environmentally harmful, unreliable, or just plain dangerous for our craft. I share many of those concerns. But here’s the thing: whether we like it or not, codegen models aren’t going anywhere. GitHub Copilot has millions of users, and tools like Claude Code and Cursor are rapidly gaining popularity.

So we have a choice: we can complain about the inevitable tide of AI-generated garbage code, or we can get in there and figure out how to make it better — especially when it comes to accessibility.

We’re facing a looming wave of inaccessible code that will be extremely difficult to remediate later. The foundation models are already being trained on the collective output of the web’s development community — a community that doesn’t have a high bar high for accessibility already. Codegen models are a massive consultancy staffed with full StackOverflow developers. We need to figure out how to make them part of the solution, not part of the problem.

It’s also worth noting that the better we make the output of these models, the fewer bugs will be generated. That, in turn, means fewer accessibility issues to fix later. If we don’t, there are plenty of AI-assisted scanners out there happy to burn the rainforest to find and remediate the bugs after the fact. We risk doubling the environmental impact—once to generate the bug, and again to fix it. That’s not the future I want. The reality here is that the only way to deal with this flood of AI-generated code is to make sure it’s good code in the first place.

How did I conduct my research?

Rather than relying on anecdotal evidence or cherry-picked examples, I built a systematic approach to evaluate how well LLMs — starting with GPT-4 — generate accessible HTML. The methodology is straightforward but comprehensive: I created a Python testing framework that sent carefully crafted prompts to Azure OpenAI’s GPT 4 model, collected the generated HTML responses, and then manually analyzed these responses for accessibility compliance.

Here’s how it works:

Prompt Engineering: I designed prompts that ask for specific UI components—form fields, navigation menus, interactive elements—without explicitly mentioning accessibility requirements. This gives us a baseline of what the model considers “standard” output. I included one prompt that specifically requested accessibility features to see if the model could improve when guided. I suspected it would often add ARIA attributes without addressing underlying issues, but I wanted to validate that too.

Response Collection: For each prompt, I generated 10 iterations at high temperature (0.95) to capture the model’s range of responses. Each unique response got saved as an individual HTML file for analysis.

Systematic Analysis: I manually review each generated code snippet, cataloging accessibility errors, warnings, and missed opportunities. I tried using the LLM as a judge, but even with a detailed rubric, the results were poor. My eval looked specifically for things like:

  • Improper semantic HTML usage
  • Missing or incorrect ARIA attributes
  • Keyboard navigation issues
  • Screen reader compatibility problems
  • Form labeling errors

When I identified errors, I remediated them and committed the remediated file to the repo with a commit message that included all of the issues and warnings on its own line.

Diff-Based Retesting: I wanted to see if diff data could improve future codegen requests, so I created a tool to generate a collection of .diff files for each pattern that included the commit message as a header in each file. I then used those diff files as part of a new instance of the prompt to test whether the model can improve its output when guided.

What did I learn?

After analyzing hundreds of generated code snippets, the results are sobering. The model consistently demonstrates what I’d describe as superficial awareness without true understanding — it knows accessibility concepts exist but fundamentally misunderstands their purpose and proper implementation.

Here are some of the patterns I’ve documented:

Form Label Disasters: When asked to create a required text field, the model failed to include a visible label:

<input
  type="text"
  id="orangeColor"
  name="orangeColor"
  required
  placeholder="What color is an orange?"
/>

Sure, the placeholder attribute is there, and in a pinch it will be included in a field’s accessible name calculation, but sighted users will lose the label as soon as they start typing.

ARIA Attribute Confusion: The model would routinely involve ARIA for no reason:

<label for="color-question"
  >What color is an orange? <span style="color: red;">*</span></label
>
<input
  type="text"
  id="color-question"
  name="color-question"
  required
  aria-required="true"
  aria-labelledby="color-question"
/>

Here the for attribute already establishes the relationship between the label and input, so aria-labelledby is redundant. A bit of a nitpick, but the aria-required="true" is also unnecessary since the native required attribute already conveys that information to assistive technologies. aria-required="true" is only needed when creating custom form controls non-semantic markup.

Redundant ARIA: Keeping on the ARIA redundancy, consider examples like this:

<input
  type="radio"
  id="option1"
  aria-labelledby="label1"
  aria-label="Option 1"
/>
<label for="option1" id="label1">Option 1</label>

This redundancy raises the question why‽

Required Field Misapplication: For checkbox groups where users need to select “one or more,” the model often adds required to individual checkboxes:

<fieldset>
  <legend>What fruits do you like?</legend>
  <div>
    <input
      type="checkbox"
      id="bananas"
      name="fruits"
      value="bananas"
      required
    />
    <label for="bananas">Bananas</label>
  </div>
  <div>
    <input
      type="checkbox"
      id="oranges"
      name="fruits"
      value="oranges"
      required
    />
    <label for="oranges">Oranges</label>
  </div>
  <div>
    <input type="checkbox" id="apples" name="fruits" value="apples" required />
    <label for="apples">Apples</label>
  </div>
  <div style="color: red; display: none;" id="validation-error">
    You must choose one or more fruits
  </div>
</fieldset>

This breaks the intended behavior—if any checkbox is marked required, it must be checked for form validation to pass. For a web component that addresses this limitation in HTML, see my post “Requirement Rules for Checkboxes.”

Grouped Field Confusion: Not understanding when to use fieldset and legend (or at least using role="group" and aria-labelledby) on a field group:

<div>
  <label>Select Theme:</label>
  <div>
    <input type="radio" id="light" name="theme" value="light" />
    <label for="light">Light</label>
  </div>
  <div>
    <input type="radio" id="dark" name="theme" value="dark" />
    <label for="dark">Dark</label>
  </div>
  <div>
    <input type="radio" id="high-contrast" name="theme" value="high-contrast" />
    <label for="high-contrast">High Contrast</label>
  </div>
  <p>You can change this later</p>
</div>

Ideally, this would be a fieldset with a legend and the descriptive text would appear right after the legend and be associated with the group using aria-describedby.

Color-Only Error Indication: Generating error states that rely solely on color changes without text indicators or proper ARIA attributes to convey the error state to screen readers.

Unnecessary Role Additions: Adding redundant roles like role="radiogroup" to properly structured fieldsets containing radio inputs, where the native semantics already provide the correct accessibility tree.

Missing Error State Management: Failing to include aria-invalid="true" on fields with errors or properly associate error messages with their corresponding form controls.

Lack of Wayfinding Help: Failing to include navigational labels and aria-current="page" in a breadcrumb nav.

Adding Unnecessary JavaScript: Even though it was instructed to only generate JavaScript when absolutely necessary, the model would often inject JavaScript for simple tasks that could be handled with HTML and CSS alone.

How Does This Help?

Here’s where things get interesting — and hopeful. When I retested using prompts that included accessibility hints, the model’s output improved dramatically. Not just slightly better, but often going from fundamentally broken to genuinely accessible.

For example, when I added diff data related to fieldset use to a prompt about radio button groups, the model switched from generating meaningless div wrappers to proper semantic structures.

This suggests the model can produce quality code if properly primed. It also indicates that the training data likely lacks sufficient examples of well-implemented accessible components. If the model had been trained on a richer dataset of accessible code, it might not need such explicit guidance to produce good results.

Where Do We Go From Here?

These findings point to several concrete approaches for improving accessibility in AI-generated code:

Enhanced Training Data: The models need exposure to more high-quality, accessible code examples. Current training data clearly overrepresents inaccessible implementations. We need comprehensive datasets of properly implemented accessible components across different frameworks and use cases.

Accessibility-Aware Fine-Tuning: Post-training refinement specifically focused on accessibility compliance could help models prioritize inclusive patterns. This could involve training on accessibility-annotated code pairs — showing inaccessible implementations alongside their accessible counterparts, like the diffs do.

Prompt Engineering Guidelines: Tool creators should integrate accessibility considerations into their default system prompts. Instead of just asking for “clean, semantic HTML,” prompts should provide detailed instructions to demonstrate accessibility best practices rather than pointing at often vague guidelines like WCAG.”

Integrated Accessibility Validation: IDE integrations should include real-time accessibility linting of AI-generated code, providing immediate feedback and suggestions for improvement.

Community-Contributed Training Data: We should coordinate our efforts to produce an open source, high-quality accessible code dataset so that this data can be integrated into future models.


The data from this project provides a roadmap for where to focus these efforts. We’re not dealing with models that are fundamentally incapable of generating accessible code — we’re dealing with models that haven’t been properly trained to prioritize accessibility by default.

Want to Get Involved?

If you want to conduct similar evaluations with your preferred models or specific use cases, I’ve created a template repository with the testing framework: CodeGen Model Eval and Refine Tools. It includes the Python testing harness, prompt templates, and analysis guidelines to get you started.

The complete findings, methodology details, and code samples for my research are available on GitHub. I encourage you to dig into the data — it’s eye-opening and frustrating, yes, but ultimately actionable.

There are other projects and research exploring this space as well. A few worth checking out:

  • AIMAC - The AI Model Accessibility Checker (AIMAC) Leaderboard measures how well LLMs generate accessible HTML pages using neutral prompts without specific accessibility guidance. Checks are performed with axe-core.
  • A11y LLM Evaluation Harness and Dataset - A more recent research project to evaluate how well various LLM models generate accessible HTML content.

We’re at a critical moment where the patterns established in AI-assisted development will shape the accessibility of the web for years to come. We can either let this technology amplify existing accessibility problems, or we can tackle the problems head-on and be part of the solution.