Apple Took 50 Years for 3 CEOs — GUI Agents Went from Paper to Production in One

Yesterday, Apple announced a landmark succession: Tim Cook steps down as CEO to become Executive Chairman, with John Ternus taking over on September 1. In its 50-year history, Apple has had just three CEOs: Jobs, Cook, Ternus.

Three people. Fifty years. Each transition spaced over a decade apart.

Now consider the AI Agent space: one year ago, most people were still debating whether AI could operate a computer at all. Today, there are open-source projects delivering usable on-device solutions.

This article breaks down the technical evolution of GUI Agents — using Mano-P, our open-source project, as a concrete example of what it takes to go from training to on-device deployment.

What Is a GUI Agent?

A GUI Agent's core mission: let AI operate a computer's graphical interface the way a human does — recognizing screen elements, understanding task intent, and executing clicks, typing, and drag-and-drop operations.

There are currently two main technical approaches:

Approach	Mechanism	Strength	Limitation
API/DOM-driven	Reads interface structure via accessibility APIs or DOM trees	Precise element targeting	Depends on app-specific interfaces
Pure vision	Understands UI from screenshots alone	Works across any application	Higher demand on visual comprehension

Mano-P takes the pure vision route. Designed for Mac, it's an on-device GUI Agent — "Mano" means "hand" in Spanish, "P" stands for Person. AI for Personal. It runs entirely locally; no data leaves the device.

Training: Bidirectional Self-Reinforcement Learning

The training pipeline follows a three-stage progressive framework:

Stage 1: SFT (Supervised Fine-Tuning)
    ↓  Build foundational capabilities
Stage 2: Offline Reinforcement Learning
    ↓  Learn strategy optimization from historical data
Stage 3: Online Reinforcement Learning
    ↓  Continuously improve through real-environment interaction

Stage 1 — SFT: Supervised fine-tuning on high-quality GUI operation datasets. The model learns basic interface understanding and action mapping — ground-truth capability building.

Stage 2 — Offline RL: Uses collected interaction trajectories to optimize policies via reinforcement learning. Extracts success/failure signals from historical operations without requiring live environment interaction, keeping training costs manageable.

Stage 3 — Online RL: Interacts with real GUI environments, adjusting strategy based on live feedback. The key challenge here is balancing exploration (trying new operation paths) with exploitation (reinforcing proven strategies).

Inference: Think-Act-Verify Loop

The inference mechanism uses a think-act-verify cycle:

while task_not_complete:
    # Think: analyze current screen, plan next action
    thought = model.think(screenshot, task_context)

    # Act: execute GUI operation (click, type, scroll)
    action = model.act(thought)
    execute(action)

    # Verify: capture new screenshot, check result
    new_screenshot = capture_screen()
    verified = model.verify(new_screenshot, expected_state)

    if not verified:
        task_context.update(error_info)  # back to Think

This gives the Agent self-correction capability. In real desktop environments, unexpected popups, loading delays, and dynamic element repositioning are common — the verify step catches these before errors cascade.

Core capabilities span four areas: complex GUI automation, cross-system data integration, long-task planning and execution, and intelligent report generation.

Benchmark Performance

OSWorld: Mano-P's 72B model achieves 58.2% success rate, ranking #1 among specialized GUI agent models. Second place scores 45.0%. OSWorld simulates real OS environments with cross-application tasks including file operations, browser interactions, and office software workflows.

WebRetriever Protocol I: Scores 41.7 NavEval, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3). This benchmark focuses on web information retrieval and interaction.

Edge Deployment: 4B Model Running On-Device

On-device deployment is a core feature of Mano-P. Here's the 4B quantized model (w4a16) performance on M4 Pro:

Metric	Value
Prefill Speed	476 tokens/s
Decode Speed	76 tokens/s
Peak Memory	4.3 GB

The w4a16 quantization scheme — 4-bit weights with 16-bit activations — strikes a practical balance: 4-bit weights dramatically reduce memory footprint while 16-bit activations preserve numerical precision during inference.

Hardware requirement: Apple M4 chip + 32 GB RAM. Fully local execution — your screen data never leaves your device.

Getting Started

Open-sourced under the Apache 2.0 license:

# Install
brew tap HanningWang/tap && brew install mano-cua

GitHub: https://github.com/Mininglamp-AI/Mano-P

Wrapping Up

From the three-stage progressive training framework, to think-act-verify inference, to w4a16 quantization enabling edge deployment — the path from "concept" to "locally usable" GUI Agents is becoming clear.

Apple took 50 years and three leaders. The GUI Agent space went from academic papers to open-source tools in roughly one year. These are two fundamentally different timescales.

For developers, Mano-P — Apache 2.0 licensed, runnable on a local Mac — is already a starting point for exploration and experimentation.

推荐订阅源

DEV Community