惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

The Register - Security
The Register - Security
美团技术团队
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
Jina AI
Jina AI
C
Check Point Blog
aimingoo的专栏
aimingoo的专栏
I
InfoQ
S
Securelist
T
Tor Project blog
GbyAI
GbyAI
L
LINUX DO - 热门话题
V
Visual Studio Blog
AWS News Blog
AWS News Blog
The Cloudflare Blog
腾讯CDC
K
Kaspersky official blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recorded Future
Recorded Future
李成银的技术随笔
W
WeLiveSecurity
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
M
Microsoft Research Blog - Microsoft Research
G
Google Developers Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Schneier on Security
Schneier on Security
B
Blog
IT之家
IT之家
爱范儿
爱范儿
H
Help Net Security
Simon Willison's Weblog
Simon Willison's Weblog
NISL@THU
NISL@THU
J
Java Code Geeks
博客园 - 聂微东
T
The Exploit Database - CXSecurity.com
Cyberwarzone
Cyberwarzone
博客园 - 叶小钗
MyScale Blog
MyScale Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Project Zero
Project Zero
F
Future of Privacy Forum
D
Darknet – Hacking Tools, Hacker News & Cyber Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Hacker News: Ask HN
Hacker News: Ask HN
D
Docker
Apple Machine Learning Research
Apple Machine Learning Research
B
Blog RSS Feed
V
Vulnerabilities – Threatpost

DEV Community

Telemedicine in Venezuela: A Technical Guide for Clinics in 2026 SSO, SAML, OIDC, and SCIM: What Actually Happens When You Click "Sign in with Google" Mastering Next.js 16 Server Actions & Forms: The Future of Full-Stack React | Muhammad Arslan Enterprise Laravel API Development: Best Practices for Performance, Security, and Scale | Muhammad Arslan How I Turned an Image Into a 3D Model in Minutes With AI Why Pure Rust WASM Is Harder Than It Looks Platform Stores Are a Dead End for Crypto Payments LeetCode Solution: 10. Regular Expression Matching IPv4 Geolocation and Leasing: A Practical Guide for Network Operators Reconciling the Inefficiencies of Global Crypto Payments Platforms I Exported HT-Demucs FT to ONNX in 2026 (4 Blockers Everyone Else Gave Up On) 🤖 The Hacker in the Machine: Using AI Agents to Build Interactive Security Games Savings Plan Amortized Cost in AWS Cost Explorer: What It Is and How to Use It How to Tailor Your Resume to a Job Description in 5 Minutes (A Method That Actually Works) Flutter vs React Native in 2026: I Built the Same App in Both JWT vs Session Tokens in Spring Boot: A Senior Dev's Decision Guide How to Choose an AI Gateway in 2026 How to Teach Source Evaluation When Your Students Use ChatGPT Why Passwordless B2C Rollouts Stall at 5% (and How to Reach 60%) Rmux Review: Rust Terminal Multiplexer Built for AI Agents I realized I was only using half of what Claude Code has to offer DevOps & Deployment Essentials: Your Practical CI/CD Guide How next-generation captchas work and why it matters for automation Chat is Dead: How JSON Prompting Cut My AI Costs by 73% What if Everybody Were Suddenly... Better? OCI Web Application Firewall (WAF) Deep Dive: Architecture, Traffic Inspection, Threat Protection, and Enterprise Security Design Selling Digital Products in a Country PayPal Refuses to Touch PostgreSQL backup tool Databasus released backup verification in real database Docker containers We Connected an LLM to a 12-Year-Old Codebase. Here's What Broke. The Fallacy of Digital Platforms: Why Stripe Isn't Always King Sizce Google'ın 26 Mayıs tarihinde arama bölümünü tamamen yapay zekaya devredecek olması açık webin devamı için nasıl sonuçlanır? When Should You Use GraphRAG Instead of RAG? Big Data Is Not Just About “Huge Data” The Prefix Bubble MPP TestKit VSCode Extension - Inline HTTP 402 Payment Flow Hints The README Was a Protocol. The Entrypoint Was Still Optional. After AI Healthcare, Medical World Models May Be the Next Life-Science AI Platform Your AI Agent Doesn't Need an API Key: Entra Agent ID and Anthropic's Workload Identity Federation ECDSA - The Math That Only Goes One Way S3 Files Killed My Least Favorite Lambda Pattern BNB RPC Endpoints for Production Apps and Backend Workloads I Used to Get Excited About New Tools Now I Feel Tired. Google I/O 2026 — What I Hoped to See Beyond the Model Announcements Most 'AI agents' are just scripts with a marketing budget 🚀 Replicating the evasive VoidLink: My Journey Building Cortex C2 # new stuff dropped in duckkit 🦆 Paying the bills in a restricted country with cryptocurrency: the lie that almost killed our digital product Building Global Economies Through Better APIs: Lessons from PayPal vs Crypto for Crypto Payments in Developing Countries Verified or Not? Ep. 2 — Snyk's Own Test App Scanned With 9 Engines 17 SessionAuth Tools in OpenClaw: Integrate Any AI Framework with Wallet Infrastructure WebMCP and the Citation Paradox — What Agent-Ready Websites Actually Mean for GEO What Gemma 4 Doesn't Know About Cameroon — and What That Taught Me About Building AI for the Real World AI Can Generate Code — And Interactive Coding Playgrounds Are Becoming Essential Modern Web Guidance: Teaching AI Agents to Stop Coding Like It's 2019 The Discipline We Forgot We Had I Built a 3-Agent AI Research Crew in 250 Lines of Python (LangGraph + Free Gemini) PostgreSQL MCP: Let Claude query your databases in plain English Building digital products and Android apps under IteraTrail Fuel Price API for Fleet Cost Planning Linux File System Explained Simply Building a shot-detection worker for an upload pipeline with PySceneDetect 0.7 Wiring VMAF (and PSNR) into your encoder CI with FFmpeg 8.1 and ffmpeg-quality-metrics Bikin Chatbot Sendiri yang Bisa Jawab Pertanyaan dari Dokumen kamu Learning Arabic: Where to Start Shipping WebVTT subtitles in HLS that actually stay in sync (a hands-on guide for 2026) Understanding AI Code Fast: A 60-Second Habit for Institutional Memory Building a Real-Time Camera Classifier Chasing Tokens: The Developer Grind Nobody Warned You About A 10th Grader’s Journey: Why Cyber Security Starts with Your Very First Loop Why Most Developer Portfolios Fail to Show Engineering Maturity Agent Loop and Harness: A Practical Engineering View of AI Operations I built Alpha Insights: AI business research with validators, not just prompts Polygon RPC Endpoints: Free, Dedicated, and Production Options BNB Chain RPC Provider Guide for Production Apps What Is a Nonce in Blockchain? Transaction Nonces Explained Testnet RPC Guide: Sepolia, BNB, Solana Devnet, and More Solana Devnet RPC Guide for Builders and QA Teams How to Choose an RPC Provider for Production Web3 Apps Best Hyperliquid RPC Provider for Low-Latency Apps Best Ethereum RPC API for Web3 Apps and Developers Base RPC Provider Guide for Production Web3 Apps New NPM package to add customizable avatar system for react project Building a Customizable Avatar System in React (Without Creating Everything From Scratch) Request-Boundary AI Spend Control in 2026: A Practical Diagnostic for Gateway and FinOps Teams LOCALMIND AI-Offline Learning powered by GEMMA4:E4B-IT The Day AI Became Its Own CTO: Antigravity 2.0 and the 12-Hour OS Magento 2 REST API Performance: Bulk Endpoints, Async Operations & Optimization When Payment Platforms Fail: My Venezuela Nightmare with Digital Creators Vellum — a private, on‑device screenshot assistant powered by Gemma 4 Seasons time-lapse - the foundations How to Measure AI Coding Agents Beyond Lines of Code and PR Acceptance Rates Recruiters do not care about your tools list Building a Monte Carlo Retirement Simulator in Python ShareBox: self-hosted file sharing with video streaming in pure PHP XSLT performance tuning without losing readability Comparing Replication and Failover in PostgreSQL and MongoDB Build a Smart Sport Predictor with Data Science Como Usar Qwen 3.7 Grátis? I turned my daily job hunt into a semi-automated workflow in Cursor. Why Enterprise AI Fails: Fragmented Data, Not Model Choice
The VLA Testing Pipeline in Mano-AFK: When AI Agents QA Their Own Work
Mininglamp · 2026-05-21 · via DEV Community

AI coding tools have gotten remarkably good at generating code. You describe what you want, and within minutes you have functions, components, even entire applications scaffolded out. But there's a question that rarely gets asked in the excitement: who tests it?

Writing code accounts for maybe 30% of shipping software. The remaining 70% — defining requirements, deploying, testing, finding bugs, fixing them, and verifying the fixes — is where most projects quietly stall. Every AI coding assistant today stops at some variation of "here's the code, good luck." The developer is still left to deploy it, test it manually, discover the bugs, explain the bugs back to the AI, wait for fixes, and re-test.

That workflow isn't autonomous development. It's autocomplete with extra steps.

The Testing Gap Nobody Talks About

Most engineering teams rely on a layered testing strategy: linting catches syntax errors, unit tests verify individual functions, and API tests confirm that endpoints return the right data. These layers are well-understood, well-automated, and widely adopted.

But here's the uncomfortable reality: all three can pass while the application is completely broken for end users.

A button's onClick handler might correctly call an API endpoint that returns valid JSON — and the unit test, API test, and linter will all report green. Meanwhile, the button itself is hidden behind a CSS overflow, or renders off-screen on mobile, or navigates to a blank page because the frontend routing is misconfigured. The backend works. The tests pass. The user sees nothing.

This is the E2E testing gap. It's the difference between "the code compiles" and "the software ships." And it's the hardest layer to automate, because it requires something most test frameworks don't have: the ability to actually look at the application and interact with it the way a human would.

Why Traditional E2E Testing Falls Short

Tools like Selenium and Playwright have been the go-to for browser-based E2E testing for years. They work by programmatically controlling a browser through DOM selectors — clicking elements by their CSS class, filling inputs by their HTML id, asserting text content by XPath.

The problem is fragility. DOM-based selectors break whenever the UI changes. A designer renames a class, a framework update restructures the component tree, a developer switches from a <div> to a <button> — and the entire test suite fails, not because the application is broken, but because the selectors are stale.

This creates a maintenance burden that scales linearly with application complexity. Large teams often dedicate entire QA engineers just to keep Selenium tests from becoming red noise. Smaller teams simply skip E2E testing altogether.

There's a more fundamental issue, too. DOM-based testing can only verify what's programmatically accessible. It can check that a text node contains "Success" but it can't tell you that the success message is rendered in white text on a white background. It can verify that an image element exists but not that the image actually loaded. It operates on structure, not on what the user actually sees.

VLA: Giving Agents Eyes

Vision-Language-Action (VLA) models change this equation. A VLA model takes a screenshot of the application, understands what it sees through visual reasoning, and generates concrete actions — click coordinates, text input, scroll directions — based on that understanding.

The key difference from DOM-based automation: VLA operates on pixels, not selectors. It doesn't need to know that the "Submit" button is a <button class="btn-primary">. It sees a button labeled "Submit" and clicks it, exactly as a human tester would. If the button moves to a different position on the page, the VLA model still finds it. If the framework changes from React to Vue, the visual interface stays the same and the tests still work.

This makes VLA-based testing inherently more robust than selector-based approaches. But it also enables something selector-based tools fundamentally cannot do: visual validation. A VLA model can verify that a chart actually renders with the correct data, that a color-coded status indicator is the right color, that a modal overlay is visible and properly positioned. It tests what the user experiences, not what the DOM describes.

Benchmark Overview
Mano-P's benchmark performance across multiple evaluation dimensions, including GUI grounding and visual understanding tasks.

The Full Pipeline: Build → Test → Fix → Repeat

Individual testing capability is useful. But the real value emerges when visual testing becomes part of a fully autonomous development pipeline — where an AI agent doesn't just write code, but also deploys it, tests it with real browser interactions, and fixes whatever breaks.

Here's what that pipeline looks like in practice:

Step 1: Requirements first. Before a single line of code is written, a structured PRD (Product Requirements Document) is generated with acceptance criteria. Every test case traces back to a specific requirement. Every bug fix maps to an AC number. This eliminates the most common failure mode of AI-generated code: "it works, but it doesn't match the intent."

Step 2: Build and deploy. Code is generated, dependencies are installed, and the application is deployed to a local development server — all without human intervention.

Step 3: Layered testing. The pipeline runs lint checks first (fast, catches syntax issues), then API tests (verifies backend logic), then E2E tests using a VLA model to open the app in a browser, navigate through user flows, and verify that the interface matches the acceptance criteria.

Step 4: Fix loop. When tests fail, the agent reads the failure report, inspects the relevant code, makes targeted fixes, re-deploys, and re-tests. This loop can run for multiple iterations — catching not just the initial bug but also regressions introduced by the fix itself.

The entire cycle — from "build me a budget tracker" to "here's your running app with a test report" — runs without human involvement.

Adversary Review: Why the Builder Shouldn't Test Itself

There's a well-known principle in software engineering: the person who writes the code shouldn't be the only one testing it. Developers have blind spots about their own work. They unconsciously avoid testing the edge cases they didn't think of during implementation.

The same principle applies to AI agents. When a single agent builds and tests, it tends to generate tests that validate its own assumptions rather than challenging them. The tests pass not because the code is correct, but because the tests are aligned with the same reasoning that produced the code.

A more robust approach uses separation of concerns:

  • A Build Agent writes the code, handles deployment, and fixes bugs
  • An Adversary Agent independently reviews the PRD and source code to find problems the builder missed
  • A Main Agent triages each finding through code inspection, API tests, or E2E verification

The adversary operates without knowledge of the builder's implementation decisions. It reads the requirements, reads the code, and asks: "What could go wrong that the builder didn't consider?" This catches usability gaps, data integrity issues, inconsistent behavior across features, and missing edge cases that automated tests alone would miss.

Self-Evolution: Getting Smarter Over Projects

Most AI coding tools treat every project as a fresh start. The context window resets, lessons from previous sessions are lost, and the same mistakes get repeated.

A self-evolving pipeline maintains persistent knowledge across projects through two mechanisms:

  • Build rules — When a bug takes multiple fix iterations to resolve, the lesson is extracted and applied to all future projects. "Always add loading states to async data fetches" isn't a generic best practice; it's a specific rule learned from a specific failure.

  • Preference accumulation — Layout patterns, color schemes, component choices, and architectural preferences converge over time. The tenth project reflects accumulated understanding of what the developer actually wants, not just what they described in a single prompt.

This is a meaningful shift from stateless code generation to something that develops institutional memory.

Mano-AFK: An Open-Source Implementation

At Mininglamp, we built Mano-AFK as an open-source implementation of this full pipeline. It takes a natural language description, generates a PRD with acceptance criteria, builds the application, deploys it locally, runs layered testing (lint → API → E2E → adversary review), and iterates through fix loops — up to 10 rounds — until all tests pass or a detailed report is generated.

The E2E testing layer is powered by Mano-P, Mininglamp's on-device VLA model. Mano-P runs entirely on local hardware — the 4B quantized model achieves 76 tokens/s decode speed on an M4 Pro with just 4.3 GB peak memory. No screenshots leave the device, no API keys are required, and there's zero per-test cost. It uses pure vision to understand GUI interfaces without relying on DOM parsing or accessibility trees, which means it works across web apps, desktop software, and any application with a visual interface.

GUI Agent Grounding Benchmark
Mano-P's GUI grounding benchmark results — the ability to accurately locate and interact with UI elements is foundational to reliable visual testing.

For teams that prefer cloud-based testing, Mano-AFK also supports Claude CUA as an alternative backend. The local mode with Mano-P is recommended for development workflows where privacy, latency, and cost matter.

What This Means for Development Workflows

The combination of VLA-based visual testing, adversary review, and self-evolving build rules points toward a future where "AI-assisted development" means more than code generation. It means AI agents that can participate in the full software lifecycle — including the 70% that happens after the code is written.

We're still early. VLA models aren't perfect at visual understanding, adversary review can produce false positives, and self-evolution needs many project cycles to show meaningful improvement. But the direction is clear: autonomous development pipelines that close the loop between writing code and shipping software.

Both Mano-AFK and Mano-P are open source and available on GitHub. If this approach to autonomous testing resonates with your workflow, we'd welcome you to try them out and share your experience. ⭐