ๆƒฏๆ€ง่šๅˆ ้ซ˜ๆ•ˆ่ฟฝ่ธชๅ’Œ้˜…่ฏปไฝ ๆ„Ÿๅ…ด่ถฃ็š„ๅšๅฎขใ€ๆ–ฐ้—ปใ€็ง‘ๆŠ€่ต„่ฎฏ
้˜…่ฏปๅŽŸๆ–‡ ๅœจๆƒฏๆ€ง่šๅˆไธญๆ‰“ๅผ€

ๆŽจ่่ฎข้˜…ๆบ

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
ๅš
ๅšๅฎขๅ›ญ - ๅถๅฐ้’—
้˜ฎไธ€ๅณฐ็š„็ฝ‘็ปœๆ—ฅๅฟ—
้˜ฎไธ€ๅณฐ็š„็ฝ‘็ปœๆ—ฅๅฟ—
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
้‡
้‡ๅญไฝ
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
ๅš
ๅšๅฎขๅ›ญ_้ฆ–้กต
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
้’›ๅช’ไฝ“๏ผšๅผ•้ข†ๆœชๆฅๅ•†ไธšไธŽ็”Ÿๆดปๆ–ฐ็Ÿฅ
้’›ๅช’ไฝ“๏ผšๅผ•้ข†ๆœชๆฅๅ•†ไธšไธŽ็”Ÿๆดปๆ–ฐ็Ÿฅ
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
ๅš
ๅšๅฎขๅ›ญ - ๅธๅพ’ๆญฃ็พŽ
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - ๆœ€ๆ–ฐ่ฏ้ข˜
GbyAI
GbyAI
Project Zero
Project Zero
่…พ
่…พ่ฎฏCDC
T
Tailwind CSS Blog

DEV Community

How we connect two strangers' webcams fast (and keep the TURN bill small) Minimal Code Doesnโ€™t Mean Stable Code How I manage 40+ skills across Claude Code, Codex, and .agents folders Hardening Stealth Browser Fingerprint Integrity and State Persistence Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes How I Slashed My AI API Bill by 92% in 2026 โ€” A Cost Optimizer's Speed Benchmark Guide How I Slashed My AI API Bill by 95% โ€” A Practical Guide for 2026 A Go outbox library that runs inside your own DB transaction How I Built a Credit Optimizer That Saves 30-75% on AI Agent Costs (Open Architecture) The Missing POP: How I Ported a Yul Contract to Huff by Reading Every Opcode The Moment the Config Parser Became the Bottleneck Churn Tool Stack by Revenue Stage ($5K to $50K+) What I Learned Exploring AI-Generated 3D: A Hands-On Tour of Meshy, Tripo, and Three.js Day 15 - Software Composition Analysis(SCA) Contributing Upstream Instead of Forking: My grape-swagger-rails Story Behind The Badge: How We Built 2,000 Hackable Badges For Temporal Replay Access Control Doesn't Scale Linearly -- Part 3 33x faster than Rust: Why I stopped waiting for my compiler and built my own. I Built My First Production AWS Project as a Career Changer Why Detecting PII Matters More Than Ever JSON Schema in 10 Minutes โ€” Validation, Types & Real Examples Python Tasks How I Started My Cybersecurity Journey as an SQA Engineer ๐Ÿ” Why "fancy fonts" in Discord and Instagram bios turn into boxes โ˜๏ธ GKE private cluster setup โ€” common mistakes and how to avoid them I Thought a Username Didnโ€™t Matterโ€ฆ Until I Saw How Much People Care About It Claude for Small Business: 382K Day-One Buyer's Guide I Built a Diagnostic Toolkit for PyTorch Because I Was Tired of Guessing Why Models Fail How I Built an AI-Powered Incident RCA Platform with LangGraph and RAG The Paywall Was a Painted Door Sonnet hallucinated. My agent stored it as fact. How React-Style Time-Slicing Keeps UIs Responsive ่ฟ™ไธช Princeton ๅผ€ๆบ้กน็›ฎ่ฎฉ AI ่‡ชๅทฑไฟฎ Bug๏ผŒ19K Stars ไฝ† 90% ็š„ไบบๅช็”จไบ† 1% ๅŠŸ่ƒฝ ๐Ÿ”ฅ SWE-agent's 5 Hidden Uses Nobody Told You About ๐Ÿ”ฅ Decompiling Serial Number U-36: Python TERCOM Reconstruction, Cryptographic Logistical Forensics, and Swarm Consensus Fault Tolerance Microservices Patterns You Cannot Outrun a Wave I Fired My Entire Node.js Stack โ€” Rust Rebuilt It in 3 Weeks (The Ugly Truth) BoxAgnts Introduction (2) โ€” AI Agent Toolbox Cursor 3 ships parallel AI agents. Here is the multi-agent workflow that actually works. Prisma-7 A Complete Beginners Guide (With Free Cloud Database!) Akses HDD Rumah dari Laptop Kantor Pakai Tailscale + SMB (Tanpa VPN Ribet) Content Pipeline in MonoGame: Why I Don't Use It Debug Log #1 โ€” The Pipeline That Looked Broken Data Structures in JavaScript: When to Use What (2026) BGP Route Flap Damping: A Solution or a New Problem? First look at AWS DevOps Agent The Next Big โ€œCult Appโ€ Probably Isnโ€™t Another Social Media Platform From Template to Production-Shaped: An AI-Native Dev Flow for Go Side Projects Idempotency Keys: The API Pattern That Saves You From Duplicate Payments and Phantom Records Everyone's Building Jarvis. Nobody's Even Close. The Moment the Jaeger Tracer Exhausted Itself and What We Switched To How to Fix Tool-Use Loops in Autonomous Coding Agents Months of self-testing: Citations shine, other features remain unproven. Claude Code for Canary Deployments: How I Ship to 1% of Users Before Breaking Everything Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET) 20 Years of GPUs in Numbers: How FLOPS & TDP Grew, and Who Led the NVIDIA vs AMD Race (open dataset, 13.5k GPUs) Espressif Reveals CoreBoard and Korvo Dev Kits for ESP32-S31 Composable Abstraction Layer: o pattern que faltava entre Pinia e seus componentes Vue Your GitHub Actions Logs Are Leaking LLM Keys and Your SIEM Isn't Catching It Solving Complex Logic with Claude and Research Papers Building TheEpicBook: A Deep Dive into a Node.js Monolithic Web Application Haber yazilimi, haber scripti, haber sistemi: ayni urun, uc ayri arama niyeti Predicting Blood Glucose Fluctuations: Building a Transformer-based CGM Forecaster with PyTorch & InfluxDB Pre-task hooks: the one-line wire-up that gives your Hono agent shared memory Concurrent writes to a shared agent memory: what we shipped, what we punted on Building a Production Serverless URL Shortener on AWS โ€” 21 Articles, Every Test Run for Real My CKA Cheat Sheet: Commands, Aliases, and Documentation Tricks I Used During the Exam Frontend Engineering Beyond Pixels: The Architecture of Digital Accessibility VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner Fabric AI Functions Turn GenAI Into a Data Pipeline Step Proximate vs Ultimate: The Bug Is Never Just the Bug The Treasure Hunt Engine That Broke Before the Traffic Did Reset Windows Update: The Definitive MSP Guide to RWU Your Resume Was Never Built for This AI Writes 46% of Code Now: What Snap's Layoffs Mean for Developers in 2026 From Chatbot to Agent โ€” Tool Calling with NVIDIA NIM Fatigue and Fracture Mechanics: Why Parts Break Below Their Yield Strength I built a token-level debugger for comparing two LLMs VCP-Virtual Private Cloud Embedding sing-box in an iOS messenger to bypass Russian DPI (no VPN) Microsoft Copilot just exfiltrated a company's files. The attack was one email. Here's the mechanism. RAG ์‹œ์Šคํ…œ ์‹ค์ „ ๊ตฌ์ถ• (v42) copilot cloud agent is becoming an automation api Cx Dev Log โ€” 2026-04-23 Why Tesla Is Becoming the AI Enterprise Case Study Every Leader Should Understand ORA-00214 ์˜ค๋ฅ˜ ์›์ธ๊ณผ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• ์™„๋ฒฝ ๊ฐ€์ด๋“œ SpecAgnt v2.0: The Agent Lifecycle Framework for AI-Native Engineering Optimizing Signal Latency and Weight Allocations in Algorithmic Pipelines SSH Under the Hood: Protocols, Mechanisms, and the Full Technical Story ุฏู„ูŠู„ ุจูˆุงุจุงุช ุงู„ุฏูุน ู„ู„ุชุงุฌุฑ ุงู„ุนุฑุจูŠ ููŠ 2026 (ูˆูƒูŠู ุชุฎุชุงุฑ ุงู„ู…ู†ุงุณุจุฉ ู„ู…ุชุฌุฑูƒ) Cรณmo Mi Configuraciรณn de Docker Me Salvรณ de un Ataque de Supply Chain (Y Por Quรฉ la Tuya Deberรญa Hacerlo Tambiรฉn) How My Docker Setup Saved Me From a Supply Chain Attack (And Why Yours Should Too) Astro: The epitome of SEO Technical Update I Gave My AI Agent the Ability to Research Before It Writes โ€” Hereโ€™s What Changed Kubernetes sem Cloud Provider (Parte 2): Criando Operators em Go para automaรงรฃo e self-service de plataforma AI Memory Needs an Authority Policy, Not Just More Context You've done tutorial after tutorial. Your GitHub is still empty. (Free 1โ€‘page PDF, no signup) TypeScript 7.0: The Go Compiler That Makes TS 10x Faster
LLM Agents Are Now Finding Zero-Days: How AI is Autonomously Rewriting the Rules of Vulnerability Research
Manoranjan R ยท 2026-05-26 ยท via DEV Community

LLM Agents Are Now Finding Zero-Days: How AI is Autonomously Rewriting the Rules of Vulnerability Research

๐Ÿ’ก TL;DR: LLM agents like Claude Mythos Preview and GPT-5.5 are now autonomously hunting zero-days at massive scale โ€” 10,000+ critical CVEs found in weeks. This post breaks down the agentic harness architecture, real-world results, and gives you runnable code to deploy your own AI security pipeline today.

Published: May 26, 2026 ยท โฑ๏ธ 18 min read ยท Tags: security, llm, ai-agents, vulnerability-research, devops, cybersecurity

AI scanning code for vulnerabilities in a cyberpunk server room


Table of Contents

  1. The Day an AI Found a macOS Kernel CVE
  2. What Is LLM Vulnerability Research? (Beyond Static Analysis)
  3. How Mythos Preview Works: Exploit Chain Construction & Proof Generation
  4. Real-World Results: Mozilla, Cloudflare, and Numbers That Stunned the Industry
  5. The Agentic Harness Architecture: Deep Technical Breakdown
  6. GPT-5.5 vs Claude Mythos: A Comparative Look at Frontier Security Models
  7. The New Bottleneck: Finding > Fixing
  8. Building Your Own AI Security Pipeline
  9. Safety, Ethics, and Dual-Use Concerns
  10. What's Next: The Near-Future of AI-Powered Cyber Defense
  11. Conclusion & Call to Action

1. The Day an AI Found a macOS Kernel CVE

On May 11, 2026 โ€” just days ago โ€” Apple published its security advisory for macOS Tahoe 26.5. Tucked among dozens of credited human researchers was one unusual line:

CVE-2026-28952 โ€” An integer overflow addressed with improved input validation. Impact: An app may be able to gain root privileges.
Discovered by: Calif.io in collaboration with Claude and Anthropic Research.

Read that again. A kernel-level privilege escalation vulnerability โ€” the kind that allows arbitrary apps to gain root access on macOS โ€” was credited to an AI model.

This wasn't a toy benchmark or a controlled research sandbox. This was a real CVE, now patched and assigned by Apple, found in critical kernel code by a large language model operating as an autonomous security research agent. The same week, Anthropic's Project Glasswing announced that Claude Mythos Preview had found over 10,000 critical or high-severity vulnerabilities across the world's most systemically important software in under a month.

If you're a security engineer, a platform developer, or anyone who ships software that other people depend on โ€” this changes your threat model. Permanently. This post breaks down exactly what happened, how these LLM vulnerability research agents work under the hood, and what you need to do about it right now.


2. What Is LLM Vulnerability Research? (Beyond Static Analysis)

Before LLMs, automated vulnerability detection fell into well-understood categories:

  • SAST (Static Application Security Testing): Pattern-matching against known vulnerability signatures in source code. Fast, high false-positive rate, misses logic bugs entirely.
  • DAST (Dynamic Application Security Testing): Black-box fuzzing, sending malformed inputs and watching for crashes. Good for input validation bugs, blind to architectural flaws.
  • Symbolic Execution: Exhaustively explores code paths using constraint solvers (e.g., KLEE, angr). Powerful but doesn't scale to real-world codebases.
  • Manual Penetration Testing: Human researchers manually auditing code. High quality, brutally expensive, doesn't scale.

LLM vulnerability research is none of these โ€” and all of them at once.

What makes frontier LLMs different is contextual reasoning at scale. A traditional SAST scanner matches patterns. An LLM understands what the code is trying to do, can reason about multi-file call graphs, can hypothesize about trust boundaries, and can generate the proof that a bug is exploitable โ€” all in a single reasoning pass.

The key insight that the research community has arrived at in 2026 is this: LLMs don't just find bugs by recognizing patterns. They find bugs by understanding programmer intent vs. actual behavior โ€” and finding where those diverge.

A 20-year-old XSLT bug in Firefox wasn't missed by fuzzers because the input space wasn't covered. It was missed because understanding the bug required knowing that reentrant key() calls cause a hash table rehash that frees its backing store while a raw entry pointer is still in use โ€” a multi-step logical chain that requires semantic understanding of the codebase's memory model. Claude Mythos found it.

This is the paradigm shift. We're no longer talking about automated scanners. We're talking about AI agents that reason like senior security researchers.


3. How Mythos Preview Works: Exploit Chain Construction & Proof Generation

Cloudflare's security team spent weeks with Mythos Preview on their own infrastructure, and their writeup identified two capabilities that distinguish it from all prior tooling:

3.1 Exploit Chain Construction

Real exploits rarely use a single vulnerability. They chain multiple primitives together โ€” a use-after-free (UAF) becomes an arbitrary read/write primitive, which enables control-flow hijacking, which enables a full sandbox escape. Each step is individually low-severity; together they're critical.

Traditional scanners report bugs in isolation. Mythos Preview reasons about how to chain them. Given a set of identified primitives, it evaluates:

  • Which primitives can be combined?
  • What preconditions does each step require?
  • Can an attacker reliably satisfy those preconditions from an unprivileged context?
  • What does the final exploit look like end-to-end?

Cloudflare observed the model taking bugs that would normally sit ignored in a low-severity backlog and constructing high-severity exploit chains that their own security team hadn't considered. This isn't just vulnerability finding โ€” it's vulnerability weaponization, in service of defenders understanding true risk.

3.2 Proof-of-Concept Generation Loop

Finding a bug and proving it's exploitable are two very different things. Mythos Preview closes this gap with an autonomous PoC generation loop:

  1. Hypothesize: Identify a suspected vulnerability and formulate a triggering condition.
  2. Synthesize: Write code that would trigger the bug โ€” a test harness, a malformed input, a specific sequence of API calls.
  3. Compile & Execute: Build the PoC in an isolated sandbox environment and run it.
  4. Observe & Iterate: If the expected behavior (crash, memory corruption, privilege escalation) isn't observed, read the output, revise the hypothesis, and try again.

This loop runs autonomously. Cloudflare described watching the model read compiler errors, adjust its exploit logic, and retry โ€” behavior that previously required a human researcher sitting at a terminal. The result is a finding backed by a working proof of concept, not a speculative observation hedged with "might" and "potentially."


4. Real-World Results: Mozilla, Cloudflare, and Numbers That Stunned the Industry

AI vulnerability discovery statistics dashboard

The numbers from Project Glasswing's first month are genuinely staggering:

Organization Bugs Found Severity Notes
Project Glasswing Partners (~50 orgs) 10,000+ Critical/High Collectively across critical infrastructure
Cloudflare 2,000 400 Critical/High Scanned 50+ internal repos
Mozilla Firefox 271 Mixed 10x more than Firefox 148 with Opus 4.6
Open Source Projects (1,000+) 6,202 (high/critical est.) High/Critical 90.6% true-positive rate after triage
Palo Alto Networks 5x normal patch volume โ€” Accelerated release cadence

4.1 Mozilla's Firefox: The Most Detailed Public Case Study

Mozilla's Hacks blog published their harness methodology and even disclosed specific bug IDs โ€” an unusual level of transparency that gives us a rare window into what AI-found bugs actually look like in practice. A few highlights:

  • Bug 2024918: An incorrect equality check allowed the JIT compiler to optimize away initialization of a live WebAssembly GC struct, creating a fake-object primitive with arbitrary read/write. This code had undergone extensive fuzzing by both internal and external researchers and was never found.
  • Bug 2024437: A 15-year-old bug in the <legend> HTML element triggered by an intricate orchestration of recursion stack depth limits, expando properties, and cycle collection across distant parts of the browser.
  • Bug 2022733: A parent-process UAF triggered by flooding WebTransport with thousands of certificate hashes to stretch a race condition in a refcount-heavy copy loop โ€” then exploiting that race over IPC from a compromised content process.

These aren't simple buffer overflows. These are complex, multi-system, architecture-aware bugs that require deep understanding of browser internals. Fuzzers, which work by exploring input space, simply can't reason about the semantic relationships between components that make these bugs possible.

4.2 The False Positive Problem (And How It's Being Solved)

One important caveat: early LLM-based security scanning (2024โ€“early 2025 era models) was plagued by AI-generated slop bug reports โ€” plausible-sounding but entirely wrong findings that wasted maintainer time. Several open-source projects created policies explicitly rejecting AI-generated issues.

Mythos Preview represents a step-change improvement here. Cloudflare reported that the model's output had noticeably higher quality: fewer hedged findings, clearer reproduction steps, and less work to reach a fix-or-dismiss decision. Critically, findings backed by a working PoC have a false-positive rate that approaches zero by definition โ€” if the exploit runs and produces the expected output, the bug is real.


5. The Agentic Harness Architecture: Deep Technical Breakdown

AI security pipeline architecture diagram

The key lesson from all successful deployments is this: naรฏvely pointing an LLM at a repository and asking "find bugs" doesn't work well. The quality of results scales dramatically with the sophistication of the harness around the model. Here's the architecture that state-of-the-art practitioners are converging on:

5.1 Core Components

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    SECURITY AGENT PIPELINE                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1. THREAT MODELER          โ”‚  Maps codebase, identifies    โ”‚
โ”‚     (LLM + static analysis) โ”‚  attack surfaces, prioritizes โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  2. SCANNER ORCHESTRATOR    โ”‚  Spins up parallel sub-agents โ”‚
โ”‚     (Agent coordinator)     โ”‚  per module/subsystem         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  3. VULN DETECTOR           โ”‚  Per-file/function analysis   โ”‚
โ”‚     (LLM sub-agent)         โ”‚  with semantic reasoning      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  4. EXPLOIT SYNTHESIZER     โ”‚  Generates PoC code,          โ”‚
โ”‚     (LLM + code executor)   โ”‚  compiles, and runs in sandboxโ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  5. TRIAGE ENGINE           โ”‚  Multi-model consensus,       โ”‚
โ”‚     (Ensemble of models)    โ”‚  severity rating, dedup       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  6. REPORT GENERATOR        โ”‚  CVE-formatted output,        โ”‚
โ”‚     (LLM)                   โ”‚  fix suggestions, CVSS scoringโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Enter fullscreen mode Exit fullscreen mode

5.2 Phase 1: Threat Modeling (Don't Skip This)

The single biggest productivity multiplier is spending compute on threat modeling before scanning. Ask the LLM to:

  1. Build a dependency graph of the codebase
  2. Identify all external trust boundaries (network input, file parsing, IPC, user input)
  3. Enumerate attack-relevant subsystems (crypto, auth, memory management, privilege operations)
  4. Produce a prioritized list of modules to scan, ordered by attack surface and severity potential

This turns unfocused scanning into targeted analysis. Mozilla's team found this dramatically improved signal-to-noise: instead of 10,000 low-confidence findings across the whole codebase, they got 500 high-confidence findings in the highest-risk subsystems.

5.3 Phase 2: Parallel Sub-Agent Scanning

Each high-priority module gets its own sub-agent instance with:

  • Full file context for the module under analysis
  • Relevant cross-file dependencies loaded dynamically
  • Language-specific vulnerability playbook (C/C++ memory bugs vs. Python deserialization vs. Rust unsafe blocks)
  • A structured output format enforcing finding quality
# Simplified sub-agent invocation pattern
async def scan_module(module_path: str, context: SecurityContext) -> list[Finding]:
    """
    Launch a sandboxed LLM sub-agent to analyze a single module.
    Returns structured findings with severity, description, and PoC.
    """
    system_prompt = build_security_analyst_prompt(
        language=context.language,
        vulnerability_classes=context.priority_vuln_classes,
        trust_model=context.trust_model,
        output_schema=FindingSchema
    )

    file_content = load_with_dependencies(module_path, context.repo_root)

    findings = await llm_client.chat(
        model="claude-opus-4-7",           # or gpt-5.5 for high-value targets
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": f"Analyze this module for security vulnerabilities:\n\n{file_content}"
        }],
        response_schema=list[Finding],      # structured output enforces quality
        max_tokens=8192,
        timeout=120
    )

    return findings

Enter fullscreen mode Exit fullscreen mode

5.4 Phase 3: The PoC Validation Loop

This is where the magic happens โ€” and where the false-positive rate collapses:

async def validate_finding(finding: Finding, sandbox: SandboxEnv) -> ValidatedFinding:
    """
    Attempt to generate and run a PoC for a suspected vulnerability.
    A finding backed by a working PoC has effectively 0% false positive rate.
    """
    max_iterations = 5

    for attempt in range(max_iterations):
        # Step 1: Synthesize PoC code
        poc_code = await llm_client.chat(
            model="claude-opus-4-7",
            messages=[{
                "role": "user", 
                "content": f"""
                Write a minimal proof-of-concept that triggers this vulnerability:

                Finding: {finding.description}
                Affected code: {finding.code_snippet}
                Expected behavior: {finding.expected_trigger}

                Write executable {finding.language} code only. No explanations.
                """
            }]
        )

        # Step 2: Execute in isolated sandbox
        result = await sandbox.execute(
            code=poc_code,
            language=finding.language,
            timeout=30,
            memory_limit="512mb"
        )

        # Step 3: Did it trigger the expected vulnerability?
        if result.crashed and matches_expected_behavior(result, finding):
            return ValidatedFinding(
                finding=finding,
                poc_code=poc_code,
                execution_result=result,
                confidence="HIGH",
                false_positive=False
            )

        # Step 4: Iterate โ€” feed failure back to model
        finding = await refine_hypothesis(finding, result, llm_client)

    # Couldn't reproduce after max_iterations โ€” flag as unconfirmed
    return ValidatedFinding(finding=finding, confidence="LOW", false_positive=True)

Enter fullscreen mode Exit fullscreen mode

5.5 Phase 4: Multi-Model Triage for Consensus

One of the most powerful techniques for reducing false positives โ€” borrowed from Milvus's research on AI code review โ€” is running multiple independent models and requiring consensus. A finding reported by Claude Opus, GPT-5.5, and Gemini independently is orders of magnitude more likely to be real than one reported by a single model.

async def triage_with_consensus(
    finding: Finding,
    models: list[str] = ["claude-opus-4-7", "gpt-5.5", "gemini-2.5-pro"]
) -> ConsensusResult:
    """
    Submit a finding to multiple models for independent verification.
    Require 2/3 agreement to advance to human review queue.
    """
    verdicts = await asyncio.gather(*[
        verify_finding_with_model(finding, model) 
        for model in models
    ])

    confirmed_count = sum(1 for v in verdicts if v.is_valid)

    return ConsensusResult(
        finding=finding,
        verdicts=verdicts,
        consensus_reached=confirmed_count >= 2,
        confidence_score=confirmed_count / len(models),
        advance_to_human_review=confirmed_count >= 2
    )

Enter fullscreen mode Exit fullscreen mode


6. GPT-5.5 vs Claude Mythos: A Comparative Look at Frontier Security Models

As of May 2026, two models dominate the LLM vulnerability research space. Here's how they compare based on independent benchmarks and real-world deployments:

Capability Claude Mythos Preview GPT-5.5
Availability Restricted (Project Glasswing / Enterprise) Generally available
Vulnerability Miss Rate ~5-8% (est.) 10% (XBOW benchmark)
Black-box performance Excellent Excellent โ€” outperforms GPT-5 with source code
White-box performance Best-in-class "Effectively killed" XBOW's benchmark
Exploit chain construction โœ… Core capability โœ… Strong
PoC generation โœ… Autonomous loop โœ… Strong
Persist vs. pivot decision-making Strong Improved (50% fewer bad persist decisions vs. GPT-5.4)
Consistency/guardrails Inconsistent organic refusals More consistent behavior
Token efficiency "Absolutely unprecedented precision" (XBOW) Good

The key practical difference today: Claude Mythos Preview is not publicly available โ€” it's restricted to Project Glasswing partners and enterprise security teams with a verified use case. GPT-5.5 is generally available and, per XBOW's benchmarks, delivers Mythos-class performance in white-box scenarios.

For most security teams today, GPT-5.5 in a well-architected harness is the path to production. If your organization qualifies for Anthropic's Cyber Verification Program or Claude Security enterprise beta, Mythos-class capabilities are accessible via Claude Opus 4.7 as well.


7. The New Bottleneck: Finding > Fixing

Here's the uncomfortable truth that Project Glasswing has surfaced for the entire software industry:

AI has solved the hard part. The bottleneck is now entirely human.

For decades, the security community's limiting factor was finding vulnerabilities โ€” it required expensive, senior human expertise and took weeks per codebase. That constraint has evaporated. Mythos Preview is finding critical bugs faster than any team of human researchers could. The new constraint is triage, disclosure, patch development, and deployment.

Some maintainers in Project Glasswing's open-source scanning initiative have asked Anthropic to slow down disclosure because they can't keep up. That's an extraordinary sentence. A world-class AI is producing so much valid, actionable security research that human maintainers are begging it to stop.

The downstream implications for your engineering organization:

  1. Shorten patch cycles aggressively. The 90-day standard disclosure window was designed for the old world. As AI-found bugs become public CVEs faster, the exploitation window is compressing.

  2. Invest in automated patch generation pipelines. Claude Security (now in public beta for Enterprise) can generate proposed fixes, not just identify bugs. This is the next frontier for reducing the triage burden.

  3. Memory-safe languages matter more than ever. Both Cloudflare and Mozilla's data confirm significantly higher false-positive rates and more severe findings in C/C++ codebases vs. Rust or Go. The ROI on memory-safe rewrites just got a lot more concrete.

  4. Staged rollout policies are critical. With AI accelerating both attack and defense, end users need to be able to receive patches faster. Frictionless update mechanisms aren't just a UX concern โ€” they're a security posture.


8. Building Your Own AI Security Pipeline

LLM vulnerability research pipeline architecture

You don't need access to Mythos Preview to start today. Here's a practical, production-ready approach using generally available models:

8.1 Minimal Viable Security Scanner (Python + Claude API)

#!/usr/bin/env python3
"""
minimal_vuln_scanner.py
A basic LLM-powered vulnerability scanner for CI/CD integration.
Requires: anthropic>=0.30.0, pip install anthropic
"""

import asyncio
import json
from pathlib import Path
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

SECURITY_SYSTEM_PROMPT = """You are an expert security researcher performing a 
white-box vulnerability audit. Analyze the provided code for:

1. Memory safety issues (buffer overflows, UAF, null deref โ€” especially in C/C++)
2. Injection vulnerabilities (SQL, command, LDAP, path traversal)  
3. Authentication/authorization bypasses
4. Race conditions and TOCTOU bugs
5. Cryptographic weaknesses
6. Unsafe deserialization
7. Integer overflow/underflow conditions
8. Logic bugs affecting security-critical code paths

For each finding, provide:
- Vulnerability class (CWE ID if applicable)
- Severity (Critical/High/Medium/Low)
- Affected code location (file:line)
- Root cause explanation (2-3 sentences)
- Proof-of-concept trigger (how would an attacker trigger this?)
- Recommended fix

Return your response as a JSON array of findings. If no vulnerabilities are found,
return an empty array []. Do NOT speculate โ€” only report findings you are confident about."""


async def scan_file(file_path: Path) -> list[dict]:
    """Scan a single file for vulnerabilities using Claude."""

    content = file_path.read_text(errors='replace')

    # Skip files that are too short to be meaningful
    if len(content.strip()) < 50:
        return []

    message = await client.messages.create(
        model="claude-opus-4-5",  # Use claude-opus-4-7 for higher accuracy
        max_tokens=4096,
        system=SECURITY_SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": f"File: {file_path}\n\n```
{% endraw %}
\n{content[:50000]}\n
{% raw %}
```"
            # Truncate to 50k chars; for large files, chunk by function
        }]
    )

    response_text = message.content[0].text.strip()

    try:
        # Extract JSON array from response
        start = response_text.find('[')
        end = response_text.rfind(']') + 1
        if start != -1 and end > start:
            findings = json.loads(response_text[start:end])
            # Annotate each finding with source file
            for f in findings:
                f['source_file'] = str(file_path)
            return findings
    except json.JSONDecodeError:
        pass

    return []


async def scan_repository(repo_path: str, extensions: list[str] = None) -> dict:
    """
    Scan an entire repository for vulnerabilities.

    Args:
        repo_path: Path to the repository root
        extensions: File extensions to scan (default: common security-relevant types)

    Returns:
        Dict with findings grouped by severity
    """
    if extensions is None:
        extensions = ['.c', '.cpp', '.h', '.py', '.js', '.ts', '.go', '.rs', '.java']

    repo = Path(repo_path)
    files_to_scan = [
        f for f in repo.rglob('*')
        if f.suffix in extensions
        and '.git' not in f.parts
        and 'node_modules' not in f.parts
        and 'vendor' not in f.parts
    ]

    print(f"[*] Scanning {len(files_to_scan)} files in {repo_path}")

    # Scan files concurrently (respect API rate limits)
    semaphore = asyncio.Semaphore(5)  # Max 5 concurrent API calls

    async def scan_with_limit(f):
        async with semaphore:
            print(f"    Scanning: {f.relative_to(repo)}")
            return await scan_file(f)

    all_results = await asyncio.gather(*[scan_with_limit(f) for f in files_to_scan])

    # Flatten and group by severity
    all_findings = [f for sublist in all_results for f in sublist]

    grouped = {
        'critical': [f for f in all_findings if f.get('severity', '').lower() == 'critical'],
        'high':     [f for f in all_findings if f.get('severity', '').lower() == 'high'],
        'medium':   [f for f in all_findings if f.get('severity', '').lower() == 'medium'],
        'low':      [f for f in all_findings if f.get('severity', '').lower() == 'low'],
    }

    return grouped


async def main():
    import sys
    repo_path = sys.argv[1] if len(sys.argv) > 1 else '.'

    results = await scan_repository(repo_path)

    total = sum(len(v) for v in results.values())
    print(f"\n{'='*60}")
    print(f"SCAN COMPLETE โ€” {total} findings")
    print(f"{'='*60}")
    print(f"  ๐Ÿ”ด Critical: {len(results['critical'])}")
    print(f"  ๐ŸŸ  High:     {len(results['high'])}")
    print(f"  ๐ŸŸก Medium:   {len(results['medium'])}")
    print(f"  ๐ŸŸข Low:      {len(results['low'])}")
    print(f"{'='*60}\n")

    # Print critical and high findings in detail
    for severity in ['critical', 'high']:
        for finding in results[severity]:
            print(f"[{finding['severity'].upper()}] {finding.get('vulnerability_class', 'Unknown')}")
            print(f"  File: {finding.get('source_file')}")
            print(f"  {finding.get('root_cause', 'No description')}")
            print(f"  Fix: {finding.get('recommended_fix', 'See full report')}\n")

    # Save full report
    with open('security_report.json', 'w') as f:
        json.dump(results, f, indent=2)
    print("[*] Full report saved to security_report.json")


if __name__ == '__main__':
    asyncio.run(main())

Enter fullscreen mode Exit fullscreen mode

8.2 CI/CD Integration (GitHub Actions)

# .github/workflows/ai-security-scan.yml
name: AI Security Scan

on:
  pull_request:
    types: [opened, synchronize]
  schedule:
    - cron: '0 2 * * 1'  # Weekly full scan every Monday at 2am

jobs:
  llm-vuln-scan:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      security-events: write

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for diff-based scanning on PRs

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: pip install anthropic>=0.30.0

      - name: Run AI Security Scanner
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          # On PRs: scan only changed files for speed
          if [ "${{ github.event_name }}" = "pull_request" ]; then
            git diff --name-only origin/${{ github.base_ref }}...HEAD > changed_files.txt
            python minimal_vuln_scanner.py --files-list changed_files.txt
          else
            # On scheduled run: full repository scan
            python minimal_vuln_scanner.py .
          fi

      - name: Check for Critical Findings
        run: |
          CRITICAL_COUNT=$(python -c "
          import json
          with open('security_report.json') as f:
              report = json.load(f)
          print(len(report.get('critical', [])))
          ")
          echo "Critical findings: $CRITICAL_COUNT"
          # Fail the build on critical findings
          if [ "$CRITICAL_COUNT" -gt "0" ]; then
            echo "::error::$CRITICAL_COUNT critical security vulnerabilities found!"
            exit 1
          fi

      - name: Post PR Comment with Findings
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = JSON.parse(fs.readFileSync('security_report.json'));
            const total = Object.values(report).flat().length;

            const body = `## ๐Ÿ” AI Security Scan Results

            | Severity | Count |
            |---|---|
            | ๐Ÿ”ด Critical | ${report.critical?.length || 0} |
            | ๐ŸŸ  High | ${report.high?.length || 0} |
            | ๐ŸŸก Medium | ${report.medium?.length || 0} |
            | ๐ŸŸข Low | ${report.low?.length || 0} |

            ${total === 0 ? 'โœ… No vulnerabilities found!' : 'โš ๏ธ Review findings in the security_report.json artifact.'}`;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

Enter fullscreen mode Exit fullscreen mode

8.3 Advanced: Multi-Model Consensus Harness

For high-value codebases, the production-grade approach is multi-model consensus to approach near-zero false-positive rates:

# multi_model_consensus.py
# Run findings through multiple models; only surface results where โ‰ฅ2 agree.
# Requires ANTHROPIC_API_KEY and OPENAI_API_KEY env vars.

import asyncio
import json
from anthropic import AsyncAnthropic
from openai import AsyncOpenAI

anthropic_client = AsyncAnthropic()
openai_client = AsyncOpenAI()

VERIFICATION_PROMPT = """You are an expert security researcher verifying whether
a reported vulnerability is real or a false positive.

Given the following finding and source code, answer:
1. Is this vulnerability real? (yes/no/uncertain)
2. If real: can an attacker trigger it from an untrusted context? (yes/no/uncertain)
3. Confidence: (high/medium/low)

Respond in JSON: {"is_real": bool, "triggerable": bool, "confidence": "high"|"medium"|"low", "reasoning": "one sentence"}"""


async def verify_with_claude(finding: dict, source_code: str) -> dict:
    msg = await anthropic_client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        system=VERIFICATION_PROMPT,
        messages=[{"role": "user", "content": f"Finding:\n{json.dumps(finding)}\n\nCode:\n{source_code}"}]
    )
    return json.loads(msg.content[0].text)


async def verify_with_gpt(finding: dict, source_code: str) -> dict:
    resp = await openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": VERIFICATION_PROMPT},
            {"role": "user", "content": f"Finding:\n{json.dumps(finding)}\n\nCode:\n{source_code}"}
        ],
        max_tokens=512,
        response_format={"type": "json_object"}
    )
    return json.loads(resp.choices[0].message.content)


async def consensus_verify(finding: dict, source_code: str) -> dict:
    """Verify a finding with multiple models; return consensus result."""
    claude_result, gpt_result = await asyncio.gather(
        verify_with_claude(finding, source_code),
        verify_with_gpt(finding, source_code)
    )

    # Require both to agree the finding is real
    both_confirm = claude_result.get('is_real') and gpt_result.get('is_real')

    return {
        "finding": finding,
        "consensus": both_confirm,
        "claude_verdict": claude_result,
        "gpt_verdict": gpt_result,
        "advance_to_human_review": both_confirm,
        "false_positive_probability": "low" if both_confirm else "high"
    }

Enter fullscreen mode Exit fullscreen mode


9. Safety, Ethics, and Dual-Use Concerns

Dual-use AI cybersecurity ethics illustration

It would be irresponsible to discuss this technology without addressing the elephant in the room: the same capability that finds bugs for defenders can be used by attackers.

Anthropic has been explicit about this tension. From their Glasswing update:

"Models as capable as Mythos Preview will soon be developed by many different AI companies. At present, no company โ€” including Anthropic โ€” has developed safeguards strong enough to prevent such models from being misused."

This is why Mythos Preview is not publicly released. But it's also why this matters: the capability genie is not going back in the bottle. The question isn't whether powerful AI vulnerability research tools will exist โ€” they will. The question is whether defenders can gain and hold an asymmetric advantage before those tools proliferate to malicious actors.

Key ethical considerations for engineers building in this space:

  1. Responsible disclosure, always. AI is going to accelerate vulnerability discovery dramatically. The 90-day disclosure standard exists for good reason โ€” it gives end users time to patch. Don't let the excitement of AI-found bugs shortcut this process.

  2. Scope your harness carefully. Ensure your scanning pipeline only touches infrastructure you own or have explicit written authorization to test. The fact that a tool is effective doesn't change the legal and ethical requirements for authorization.

  3. Verify before you disclose. Submit only confirmed, PoC-backed findings to maintainers. The open-source community is already overwhelmed by low-quality AI-generated bug reports. Be part of the solution, not the problem.

  4. Watch for model inconsistency. Cloudflare's team documented that Mythos Preview's organic guardrails are inconsistent โ€” the same task framed differently could produce completely different refusal behavior. Don't treat model-level safeguards as a substitute for process-level controls.


10. What's Next: The Near-Future of AI-Powered Cyber Defense

Based on the current trajectory, here's what the next 12โ€“18 months look like for LLM vulnerability research:

Near-term (3โ€“6 months):

  • Mythos-class capabilities will become available in more generally accessible models as Anthropic and OpenAI iterate
  • Automated patch generation will mature โ€” tools like Claude Security will move from "propose fixes" to "generate, test, and submit PRs" with minimal human involvement
  • CI/CD-integrated AI security scanning will become a default expectation, not a differentiator

Medium-term (6โ€“18 months):

  • The concept of a "security debt surface" will become quantifiable in real-time โ€” every codebase will have a live severity score updated continuously by AI agents
  • Memory-safe language adoption will accelerate dramatically as C/C++ vulnerability rates become impossible to ignore empirically
  • The security research job market will bifurcate: routine scanning automation, but a premium on researchers who can architect harnesses, interpret AI findings, and build novel exploitation techniques that AI hasn't yet learned

Long-term:

  • The bottleneck will shift from patching to architectural hardening โ€” teams will move from "fix this bug" to "eliminate this entire bug class" through language choices, sandboxing, capability restriction, and formal verification
  • AI models may begin writing security specifications and verifying code against them, moving toward a world where newly written code is provably free of common vulnerability classes

11. Conclusion: The LLM Vulnerability Research Era Has Begun

We are living through a genuine phase transition in software security. The tools that found a kernel CVE in macOS, 271 latent bugs in Firefox, and 2,000 vulnerabilities across Cloudflare's infrastructure in weeks โ€” these are not research prototypes. They are production systems, available today, finding real bugs in real code.

The LLM vulnerability research agent isn't coming. It's here. And if you're shipping software that other people depend on, the question is not whether to engage with this technology โ€” it's whether you engage with it proactively, as a defender, or reactively, after an attacker already has.

Three things you can do this week:

  1. Run the minimal scanner above against your most critical service. Set your ANTHROPIC_API_KEY, point it at a repo, and see what it finds. The marginal cost of a scan is a few API dollars. The marginal cost of an unpatched critical is not.

  2. Set up the GitHub Actions workflow for your team's most security-sensitive repositories. Automated scanning on every PR is now table stakes.

  3. Apply to Anthropic's Cyber Verification Program if your organization does legitimate security research, red-teaming, or penetration testing. Access to higher-capability models in this domain is now a significant professional advantage.

The Glasswing era of software security has begun. The organizations that understand the architecture behind these tools โ€” not just that they exist, but how they work and how to deploy them effectively โ€” will have a structural security advantage for the next decade.

The bugs are being found. The question is who finds them first.


Found this useful? Drop a โญ on the companion GitHub repo with the full harness implementation, contribute to the discussion in the comments, and share this with the security engineer on your team who hasn't heard about Project Glasswing yet.


Tags: llm-vulnerability-research generative-ai cybersecurity agentic-ai claude gpt-5 devsecops security-engineering zero-day project-glasswing