慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

Google DeepMind News
Google DeepMind News
人人都是产品经理
人人都是产品经理
M
MIT News - Artificial intelligence
博客园 - 叶小钗
MyScale Blog
MyScale Blog
V
Visual Studio Blog
月光博客
月光博客
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
I
InfoQ
有赞技术团队
有赞技术团队
阮一峰的网络日志
阮一峰的网络日志
Jina AI
Jina AI
V
V2EX
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Blog — PlanetScale
Blog — PlanetScale
Last Week in AI
Last Week in AI
雷峰网
雷峰网
Stack Overflow Blog
Stack Overflow Blog
博客园 - Franky

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
勿浪掷代币于安卓自动化
Elliot Gao · 2026-05-24 · via DEV Community

勿浪掷令牌于安卓自动化

多数由大语言模型驱动的安卓自动化,始于向模型展示界面。

此理似合。人观手机,决所点,遂点之。予模型以同观。

然"同观"之费,实为昂贵。

全屏截屏,耗之甚巨。素 Android UI XML 之倾泻,亦耗之甚巨,惟声息较微耳。模型读布局之机杼数千,方至所重之标签寥寥数枚:

Email
Password
Continue

入全屏模式 出全屏模式

一步之耗,尚易忽之。五十步之移动代理轨迹,则成其账单矣。

其环

安卓代理者,常为此事。

  1. 观当前之屏。
  2. 决其事。
  3. 轻点,键入,或滑动。
  4. 候次屏。
  5. 复之。

首事乃符钥之漏所始。

若尔用之uiautomator dump,则模型得XML若此:

<node index="0" text="" resource-id=""
      class="android.widget.FrameLayout"
      package="com.google.android.apps.nexuslauncher"
      content-desc=""
      checkable="false" checked="false"
      clickable="false" enabled="true"
      focusable="false" focused="false"
      scrollable="false" long-clickable="false"
      password="false" selected="false"
      bounds="[0,0][1440,3120]">

入全景模式 退出全屏模式

此乃一布局节点。所言甚寡,无甚可令智能体行之。

此非UIAutomator之谬。XML乃无违之存取树之序列化。忠实非即有用。

数字

于数屏之寻常Android,其异若此:

UIAutomator XML 手機hs ui -i 減少
啟動頁面 3,153 個詞元 246 個詞元 12.8倍
設置頁面 5,762 個詞元 729 個詞元 7.9倍
設置>應用 4,050 個詞元 三百二十字 十二点七倍

字数之计,出tiktoken,以GPT-四之编。详述在,一Android界面之曝,为LLMs之用

简略而言:一屏之示,费四千至六千字,若为XML,然以行表示,仅数百字足矣。

越五十步,此乃送二百五十万符文之屏显状态与送二万五千至四万符文之别耳。

无论何种方式,代理之决策常同。

模型所实需者

就界面自动化而言,模型无须DOM形之树。

所需者,乃可为之事之列:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

展全景模式 退出全屏模式

是此表授模型以有用之实:

  • 何行可用。
  • 人目所睹之标签为何。
  • 此控何类。
  • 工器所击或所击之位。

今模型可应:

tap "Continue"

进入全屏模式 退出全屏模式

此无需解析布局祖先、负布尔值、全限定类名,或四数界限矩形.

规则

对LLM工具输出,优化之则简:

勿序列化模型于下次行动无用之实.

安卓XML恒违此规:

  • clickable="false"于智能体永不相触之节点。
  • enabled="true" 几乎遍及每一节点.
  • 空之 FrameLayoutLinearLayout 容器.
  • 类名全称如 android.widget.TextView.
  • 代理仅需点触时,却界定矩形框.
  • 读者为语言模型而非解析器时,复现 JSON 风格之键名重复。

手持之器去其常设,名缩其长,计其中枢,存其标识.

其果非为文件之XML缩微。乃为界面殊异:

hs ui
hs tap "Continue"
hs wait "Dashboard"

入全景式 出全景式

屏幕之图犹为有用

此非为屏幕之图而辩。

屏幕截图于布局攸关、视觉状态攸关,或应用呈示重要信息而无可及标签之时,实为有用。

然屏幕截图非每步之良选。其幅广,移迟,且常迫模型为之类OCR之工,以辨文,而Android已自显之。

善者之序乃:

hs ui > /tmp/screen.txt
hs see --size 768 /tmp/screen.jpg   # only when visual context matters

入全屏模式 出全屏模式

先予模型文UI。文不足时,乃加图像。

此常省令牌,且使行迹易察。

何以此事于代理者较之测试更为重要

古之移动测试,于令牌之数,视之不重。测试之役,非为读XML而费心也。

LLM 之代理,异于常物。每回环之步,皆有境域之预算与耗损。若提示之半乃满布枯节点之UI树,则模型必耗神力于无物。

此弊显于三处:

  • 耗损:复现之屏态,主于长轨迹。
  • 迟滞:巨提示,传送与处理,需时尤长。
  • 可靠:行动导向之境稍短,则模型少有隙可乘,以附不相关之结构。

对智能体而言,最佳工具之输出,非系统最全之表征。乃最简之表征,而存次正确之行。

实用之式

于安卓,其式若此:

hs use
hs ui
hs tap "Sign in"
hs fill "Email" "you@example.com"
hs fill "Password" "$PASSWORD"
hs tap "Continue"
hs wait "Dashboard"

进入全屏模式 退出全屏模式

于大语言模型,其要旨之交接,益微矣:

Here is the current Android UI. Pick the next action by label.

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

进入全屏模式 退出全屏模式

模型无须知此节点存于三重FrameLayout之内。唯需知“续行”乃为按钮耳。

相关指南

https://handsets.dev/blog/stop-wasting-tokens-on-android-automation/