惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Is AI actually replacing developers? Customizing Docker Images: Write Your First Dockerfile (2026) €40 n8n vs 28% weekly Anthropic quota. Which /goal layer should you actually run? 04/20: Data Encapsulation: How a Message Becomes Bits on the Wire Hướng Dẫn Thiết Lập Reasoning Proxy DeepSeek V4-Pro với Cursor (2026) Sofi Log #012: Agentic GDP — Solana Pay.sh & x402 Protocol Spec Input Types, Attributes, Self-Closing Tags, Hover Effect Absolute vs Relative Paths File Types (Regular, Directory, Link, Device, Socket, Pipe) From Arduino IDE to AVR GCC | AVR Bare Metal #1 Using Bitcoin as collateral without wrapping it: the design of a BTC collateral vault Unreal Engine 5 Skill System Architecture using GAS and GameplayTags 5 Things I Wish I Knew Before Building with Hermes Agent Thoughts on Codingame 2026 Spring challenge OUT WITH THE OLD IN WITH THE NEW Why are simple 1099 tax calculators online so horribly bloated? So I built my own "Why You're Not Getting Callbacks (It's Not Your Skills)" # How I Built a Retail Demand Forecasting App with Python and Streamlit Why We Deliberately Crush Lithium Batteries (UN38.3 Crush Testing Explained) Command History & Completion The Three-Body Problem: AI Code, Supply Chain Attacks, and the Talent Exodus 로컬 LLM 셋업 가이드 (v27) Building Better .NET Worker Services with Cursor Rules Generate Professional PDF Invoices via REST API — JSON In, PDF Out Redis: Big Keys Destroem o Desempenho Compartilhado Agentic AI for Cybersecurity: Autonomous Threat Detection and Response How to Automate Android Without Appium Cron vs systemd daemon: which one for Node.js? Designing XSLT transforms with parameters and multiple inputs I Downloaded Gemma4:e2b On My Macbook in 2 steps Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation The EU AI Act in 2026: Reading the Law After the Omnibus I had zero coding knowledge. Here is "RetroTube", a 2010 YouTube sandbox prototype I built using AI! How to Validate Environment Variables in TypeScript (and Why You Should) I Built a CLI Tool That Writes Better Git Commits Than I Do Transfer Fees, Metadata, and Soulbound Tokens: My First Real Token Experiments on Solana Stop Using Fetch() in React: A Better Way To Call Your Backend Creando un Tetris con JavaScript VI: Complicando el juego. DeepSeek's API Price Cut Changed My Claude Code and ChatGPT Math [Boost] Perl 🐪 Weekly #774 - Perl is too HOT How to Track AI Usage Without Losing Revenue (Complete Guide) 77 Rules Later: What Graduating Our First Stack Actually Looked Like RAG 시스템 실전 구축 (v26) When Premature Scaling Leads to Operator Burnout Multi-Repo Microservice Changes Are a Coordination Problem. I Solved It With AI Agent Teams. The Next Frontier: How Multi-Agent Systems are Redefining Productivity The Kimwolf Bust Just Outed Android Webcams as Botnet Fodder — Here's the Question Every Repurposed-Phone Camera Setup Has to Answer I'm an autonomous AI agent. I shipped 18 fixes to myself in one session. Building a Secure Future with Zero Trust Security Architecture Asynchronous Functions in Dart How I migrated magic-link login from Resend to AWS SES + Lambda five days before launch Edge Computing He creado una empresa ficticia IT/OT para poder encontrar sus vulnerabilidades y reforzar su seguridad en sus activos críticos Why I Built @editora/react I built a tiny UGC script generator because hooks are the hardest part The Phone Is Becoming the New Terminal Why Most AI Music Tools Feel Wrong to Developers Goroutines vs. Promises: Why Go and JavaScript Look at Concurrency Completely Differently How I Use Antigravity 2.0 to Navigate Open-Source Codebases and Make Better Technical Decisions Understanding Basic HTML & CSS Concepts for Beginners Go Error Handling: Annoying or Awesome? Your To-Do List Doesn't Know You — So I Gave Mine Three Brains Shell Basics (Bash, Zsh, Sh) Free MongoDB GUI Tool for Developers, Students, and Teams Designing High-Performance Blockchain Indexers Choosing Models for an Agentic Chat App on Amazon Bedrock How Smart Growth Teams Automate Their Marketing Stack in 2026 (Without Hiring More People) What I Learned About Memory-Augmented AI Agents Seven Docker Tips Every Engineer Should Know (from Docker Captains) Welcome to the Fast-Food Era of Testing: Over-Weight by Tests How to use Claude in vscode? Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions Full Stack Projects Are Not Enough Anymore Virtualization & Cloud Basics Orakle: Turning Raw Blockchain Data into Intelligence with Gemma 4 Building an Autoposting Pipeline with Hermes Agent: Why Waterfall Beats Parallel, and the Edge Cases Nobody Talks About OpenShift Virtualization Migration Advisor — Local-First, Powered by Gemma 4 26B MoE WebMCP is coming — so I’m building webmcp.js I Disappeared for 4 Months After Launch - Here's What Brought Me Back Jira Is Turing-Complete (And You've Been Coding in It) NyayAI: Building an AI Legal Assistant for 1.4 Billion People — A Technical Deep Dive E-commerce Order Automation: Stripe + Invoice + Shipping Workflow How to Evaluate AI Agents: LLM-as-Judge Tutorial The Interview Prep Stack I Used as a Senior Software Engineer Targeting Big Tech Gemma4 Challenge OptiLearn - Powered by Google Gemma 4 Aura — The Gemma 4 Powered Agentic Web Copilot & Self-Healing Accessibility Engine I built a tool that catches misleading charts using Gemma 4 running locally Worklog companion with Gemma4 GBase: Building LLM Agents That Actually Learn from Their Mistakes Blossom — a small step toward student mental wellbeing WordPress Performance Monitoring: A Complete Guide Principal Components in TypeScript (Part 4) When three sharp wallets agree: what consensus signals on Polymarket actually mean I Built a Fail-Fast Rust Scheduler with Background OAuth Auto-Refresh (Part 2) Sharing is caring How Putting Faces (Literally) to My AI Garden Images Gave It a Personality Sofi Log #001: Thailand's Tourism Tax & the 180-Day AI Surveillance Wall Sofi Log #006: Decentralized IP-Address Obfuscation Specs
Harness Tells Your Agent What to Do. GUI Agents Let It Actually Do It.
Mininglamp · 2026-05-25 · via DEV Community

The Rise of Harness Engineering

Harness Engineering has become the defining conversation in AI agent development this quarter. Anthropic published "Effective Harnesses for Long-Running Agents." OpenAI released their own take on constraining agent behavior through software engineering practices. The thesis is straightforward: wrap your AI agent in a structured control layer—task routing, approval gates, verification loops, and retrospectives—so it behaves reliably over extended sessions.

The pattern makes intuitive sense. An unconstrained agent is a liability. A harnessed agent is a tool. The community has responded: open-source harness frameworks are emerging, giving teams reusable scaffolding for decision-level reliability.

But here's the question no one is asking loudly enough: after the harness decides what to do, how does the agent actually do it?

What Harness Solves

A harness framework operates at the decision layer. It answers:

  • What should the agent do next?
  • In what order should tasks execute?
  • When should it pause for human review?
  • How do we verify the outcome before moving on?

Think of it as the prefrontal cortex of your agent system—planning, sequencing, gating. Frameworks like cow-harness already provide open-source implementations of these patterns: task decomposition, approval workflows, retry logic, and audit trails.

This is genuinely valuable. Without a harness, agents hallucinate plans, skip steps, and compound errors. With one, they become predictable and auditable.

But predictable planning is not the same as reliable execution.

The Execution Gap

Consider a real scenario. Your harnessed agent determines the next action: "Open the CRM, navigate to the customer record for Acme Corp, and update the contract renewal date to June 15."

The harness has done its job. The decision is correct. The approval gate passed. Now... how does the agent physically perform this action?

Current execution methods each carry fundamental limitations:

CLI tools — Powerful but narrow. Only works for systems that expose command-line interfaces. Most enterprise software does not.

API calls — The gold standard when available. But many critical business systems—legacy ERPs, proprietary desktop apps, government portals—simply have no API. Or the API covers 20% of what the GUI exposes.

DOM manipulation — Works for web apps, breaks on desktop. Requires knowledge of the target app's internal structure. One frontend update can invalidate your selectors.

RPA scripts — The enterprise workaround. Record a macro, replay it. Brittle by nature: a single UI change—a moved button, a renamed field, a new modal dialog—breaks the entire flow. Maintenance cost scales linearly with the number of automations.

The common thread: all of these methods require a pre-existing technical interface to the target system. They assume the system was designed to be automated, or that someone has reverse-engineered a way in.

In enterprise reality, the most critical systems are often GUI-only black boxes. No API. No CLI. No stable DOM. Just a screen that a human clicks through.

This is the execution gap. Harness frameworks have nothing to say about it.

Vision-Based GUI Agents as the Execution Layer

What if the agent could interact with software the same way a human does—by looking at the screen and clicking?

That's exactly what vision-based GUI agents do:

  1. Input: A screenshot of the current screen state
  2. Understanding: A vision-language model identifies UI elements—buttons, text fields, menus, labels—and comprehends their spatial relationships and semantic meaning
  3. Output: Precise mouse coordinates and keyboard actions to accomplish the intended task

The key property: zero dependency on target system internals. The agent doesn't need an API, a DOM tree, or accessibility hooks. It sees pixels and acts on them. This works across:

  • Web applications
  • Native desktop software
  • Remote desktop sessions
  • Terminal UIs
  • Even systems running in virtual machines

If a human can operate it by looking at a monitor, a vision-based GUI agent can too.

Putting It Together: Harness + GUI Agent

This is where the architecture becomes complete. The harness provides the brain—deciding what to do, when to pause, how to verify. The GUI agent provides the hands—executing actions on any visual interface.

Mano-P is an open-source GUI agent built for exactly this role. Developed by Mininglamp Technology under the Apache 2.0 license, Mano-P implements a Vision-Language-Action (VLA) architecture designed to serve as the execution layer in agentic systems.

The name encodes the philosophy: "Mano" is Spanish for "hand"—the part that acts. "P" stands for Private—your data never leaves the device.

Architecture: Think-Act-Verify

Mano-P operates through an inference loop that mirrors how a careful human operator works:

  1. Think — Observe the current screen state, reason about what UI elements are present, and determine the next action
  2. Act — Execute the precise mouse/keyboard operation
  3. Verify — Capture the resulting screen state and confirm the action had the intended effect

This loop provides built-in error detection. If a click lands on the wrong element or a form doesn't submit, the verify step catches it immediately—enabling retry or escalation back to the harness layer.

On-Device Performance

Mano-P is designed for local execution. The quantized 4B model runs on consumer hardware:

  • Minimum: Apple M4 chip + 32GB RAM (Mac mini or MacBook)
  • Performance on M5 Pro: ~80 tokens/s decode speed

The Cider SDK provides W8A8 activation quantization, delivering approximately 12.7% prefill acceleration compared to the W8A16 baseline, and 1.4x–2.2x prefill speedup versus MLX native W4A16. This means real-time interaction with GUIs—no cloud round-trip, no latency spikes.

Benchmark Results

On the OSWorld benchmark—the standard evaluation for GUI agent capabilities across real operating system tasks—Mano-P 1.0-72B achieved a 58.2% success rate, ranking #1 among specialized GUI agent models.

For web navigation specifically, the WebRetriever Protocol I achieved a 41.7 NavEval score, demonstrating reliable multi-step web interaction.

Mano-AFK: The Full Automation Loop

To demonstrate how harness-level planning connects to GUI-level execution, Mininglamp Technology built Mano-AFK—an end-to-end autonomous development pipeline:

Natural language requirementPRD generationArchitecture designCode generationDeploymentE2E testing (Mano-P's visual model drives the browser to test the deployed app) → Bug detectionFixRetest

This is the harness + GUI agent pattern in its most complete form. The planning layer decomposes a vague requirement into structured development phases. The GUI agent handles the parts that require visual interaction—browser testing, UI verification, visual bug detection—without any test framework dependencies.

Privacy by Design

In local execution mode, all processing happens on-device. Screenshots are captured and analyzed locally. Model inference runs locally. No data transits to external servers. For organizations handling sensitive information—financial records, medical data, classified documents—this is not a feature. It's a requirement.

Honest Limitations

No technology is universally optimal. Vision-based GUI agents have real tradeoffs:

Overhead on simple web tasks — For well-structured web applications with clean APIs or stable DOM trees, direct API calls or DOM manipulation will always be faster than screenshot-based interaction. If you have a good API, use it.

Accuracy ceiling on complex UIs — The 4B on-device model handles standard interfaces well but can struggle with extremely dense or unconventional UI layouts. The 72B model pushes accuracy significantly higher but requires more compute.

Best suited for specific scenarios:

  • Legacy enterprise systems with no API
  • Cross-platform automation spanning web and desktop
  • Data-sensitive workflows requiring strictly local execution
  • Systems where UI changes frequently (vision adapts; scripts break)
  • Remote desktop environments where DOM access is impossible

The right architecture uses the right tool for each target. API calls where APIs exist. DOM methods for stable web apps. And vision-based GUI agents for everything else—which, in most enterprises, is a surprisingly large surface.

Conclusion

The AI agent stack is crystallizing into two distinct layers:

The Brain — Harness frameworks that constrain, route, verify, and audit agent decisions. This is a solved problem with active open-source development.

The Hands — Execution layers that translate decisions into physical actions on real systems. For GUI-bound systems, vision-based agents are the only approach that scales without per-system integration work.

Harness tells the agent what to do. GUI agents let it actually do it. Together, they close the automation loop.

Mano-P is Apache 2.0 licensed and available on GitHub: https://github.com/Mininglamp-AI/Mano-P


Feedback and contributions welcome.⭐