惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

The New Stack | DevOps, Open Source, and Cloud Native News

Why AI agents need a Context Lake Google pushes Pro, Ultra, and free users from open-source Gemini CLI to closed-source Antigravity CLI Taming the agentic influx: a blueprint for AI business observability How the AC/DC framework helps teams govern AI coding agents GitLab 19.0 trades its string section for a full DevSecOps orchestra Who’s monitoring the agents? How Jaeger hit 8.6× compression on 10 million spans with ClickHouse What ClickHouse learned from a year of coding with AI agents OpenClaw passed 300,000 GitHub stars. Then Google launched Spark. Anthropic’s $300M Stainless deal lands hardest on OpenAI and Google How MCP and synthetic data are reshaping compliance in the agentic era What Anthropic and OpenAI launched in 72 hours has Wall Street paying attention JetBrains is selling independence as the rest of AI coding picks sides Three ways operational debt will break your AI strategy, and how to recover I buried 20 problems in a fake P&L to see if Claude for Small Business could find them Why enterprise AI keeps stalling — and how data streaming could unlock it JFrog report recaps a tumultuous year in supply chain security Kore counts down to Artemis, its moonshot for governable AI agents How to build your first end-to-end AI workflow in n8n CI wasn’t built for coding agents. Here’s what comes next. “Morally repugnant shortsightedness”: Why open source security leaders say companies must stop freeloading on maintainers After becoming cloud computing’s telemetry standard, OpenTelemetry graduates into the AI infrastructure era Building the agentic agreement enterprise: How developers are unlocking agentic experiences with Docusign’s MCP server and platform Cut your AI search costs without sacrificing quality NanoCo bets the future of enterprise AI is one sandboxed agent per employee Why six AI labs built the same product for knowledge workers in four months LLMs were trained on an inaccessible web — AudioEye data shows AI is still building one Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5 At Google I/O 2026, Antigravity gets a new job description Anthropic hires OpenAI co-founder Andrej Karpathy to lead Claude pre-training research Google launches $100 AI Ultra plan and cuts top tier to $200 Google’s Gemini 3.5 Flash beats the frontier models Google now lets developers use GPT and Claude in Android Studio Google wants to make the web agent-ready Google now lets you vibe code native Android apps in AI Studio Valkey just had a 17x year. Its lead maintainer still doesn’t want Redis to die. Anthropic debuts MCP tunnels and self-hosted sandboxes to lock down AI agent infrastructure Why production RAG systems give confident, wrong answers at scale Steve Yegge’s AI agent orchestration project Gas Town comes to the cloud — and brings the Wasteland with it Pulumi bets infrastructure’s next decade belongs to AI agents Why Google’s Remy leaks have enterprise architects rethinking the AI stack GitHub will start paying some bug bounty hunters in swag instead of cash AI security readiness is now the No. 1 obstacle to adoption, Linux Foundation finds The Mac mini just became infrastructure The cleanup cost of AI-generated code GitHub takes aim at Claude Code and Codex with its new Copilot app Forward deployed engineer is AI’s hottest job as OpenAI and Google race to hire. Here’s how to become one. Why Block handed Goose to the Linux Foundation AWS found bugs in 60% of software requirements. Its fix isn’t more AI — it’s a 50-year-old logic engine. The software fix that could shrink AI’s energy bill without new hardware Why AI is failing in the security operations center The hidden cost of build vs. buy for agentic AI in regulated industries OpenAI brings Codex to the ChatGPT mobile app Cloud code: Conductor joins rush toward remote coding agents GitLab is betting a 19th-century economic theory will shape its AI era Anthropic splits billing again: Agent SDK gets separate credit pools The Rust sidecar pattern that fixes Python AI’s biggest weakness Fivetran’s CPO: Closed data stacks won’t survive the agent era MinIO’s MemKV promises 95% better GPU utilization by ending AI recompute tax Red Hat’s skill packs give AI agents something a bigger model never could: 20 years of institutional memory Anthropic’s Claude Code agent view is a better dashboard. So why aren’t developers convinced? OpenAI’s Daybreak and Anthropic’s Glasswing have nearly identical benchmarks — and 3 of the same partners I tested OpenAI’s three claims about GPT-5.5 Instant, and only one fully held up Temporal hits 3,000 paying customers with its crash-proof workflow engine How to build a skills library for your engineering team Why agent harnesses fail inside cloud-native systems Cimento emerges from stealth to secure the one thing no firewall can protect Cloud native application challenges: installing the walking skeleton Why enterprise AI needs customization The new FinOps problem isn’t cloud bills Jensen Huang and Bill McDermott bet on OpenShell to secure enterprise AI agents The API portal is the clearest signal of whether your company can handle AI agents AI is creating a generation of developers who can’t debug their own code Red Hat is betting on AgentOps to close the gap between AI experiments and production AI teams are spending months on web scrapers that SerpApi replaces with one API call Living off the agent: The new tactic hijacking enterprise AI SAP launches managed Joule Studio with Cursor and Claude Code support SAP launches AI Agent Hub at Sapphire 2026 to tame vendor agent sprawl As agentic dev tools boom, workflow auditability becomes the constraint Anthropic’s Claude Platform comes to AWS Anthropic trains Claude to resist blackmail & self-preservation behavior via agentic misalignment How AI-native systems are built Why your AI agent doesn’t actually remember anything Why 157,000 developers are hedging against Anthropic with OpenCode Claude can now follow users across Outlook, Word, Excel, and PowerPoint Why Prometheus couldn’t see Cilium metrics at 2 a.m. Anthropic puts the “myth” in Mythos with its HackerOne bug bounty program The attack surface moved inside the agent. So did Arcjet. Tanzu Platform’s 15-year head start meets the AI moment Datadog and T-Mobile leaders reveal the reality of deploying AI agents in production How Anthropic and Elon Musk cornered Sam Altman this week OpenAI Codex arrives in the browser with new Chrome extension “Several known limitations”: Developers react to Cursor’s promising but still-moving SDK AI startups are scrambling to survive in big tech’s shadow “The terminal still matters”: Amp rebuilds its CLI for an agentic future beyond the command line Anthropic recruited SpaceX’s 220,000-GPU Colossus 1 to fix what Claude users kept complaining about How Microsoft is governing thousands of Kubernetes clusters without manual intervention Temporal reveals serverless option for its Durable Execution platform OpenAI brings GPT-5-level reasoning to its speech models Elastic architects reveal how to query observability data in plain English
Google ranks the best AI for building Android apps, and the winner isn’t Gemini
Adrian Bridg · 2026-05-27 · via The New Stack | DevOps, Open Source, and Cloud Native News

Google wants software developers to use the best possible AI models when building Android applications; consequently, the company debuted its Android Bench benchmarking portal in March. The service is intended to provide a continuously updated leaderboard to act as a reference point for developers and model creators.

The leaderboard was updated last week to include open-weight models and add new columns for latency, tokens, and cost.

“By establishing a clear, reliable baseline for what high-quality Android development looks like, we’re helping model creators identify gaps and accelerate improvements — which empowers developers to work more efficiently.”
—Matthew McCullough, Google.

Model students

Matthew McCullough, Google VP of product for the Android Developer divisionwrites in a March blog post that Google actively benchmarks top AI LLMs against tests designed to assess how these tools can build Android apps.

“Our goal is to provide model creators with a benchmark to evaluate LLM capabilities for Android development,” explains McCullough. “By establishing a clear, reliable baseline for what high-quality Android development looks like, we’re helping model creators identify gaps and accelerate improvements — which empowers developers to work more efficiently with a wider range of helpful models to choose for AI assistance — which ultimately will lead to higher-quality apps across the Android ecosystem.”

GPT 5.5 is currently the best AI model for Android

This new service doesn’t appear to offer a historical record of where models have risen and fallen over time, but 9to5Google reports that the last Android Bench ranked Gemini 3.1 Pro alongside OpenAI’s GPT 5.4 as joint leaders.

As of the May 18 update, GPT 5.5 is currently the best AI model for Android app development. 

Google provides an openly accessible explanation of its operating methodology for Android Bench to explain that, “The service evaluates the ability of LLMs to generate code that resolves the issue by presenting them with real-world issues and pull requests from open-source software projects. This approach aims to ensure that the tasks are representative of the challenges developers face daily.”

Why did Google build Android Bench?

Google has said it built Android Bench because AI-assisted software engineering “has seen the emergence of several benchmarks” for measuring LLM capabilities. The company has further stated that Android developers “face specific challenges that aren’t covered by existing benchmarks”, so it created a ranking service that to focus on a comprehensive total assessment of high-quality Android development.

“We created a model-agnostic benchmark to accurately evaluate LLM performance on a variety of Android development tasks,” stated Google. The company further defined the goals of Android Bench as a means of encouraging LLM improvements for Android development; empowering Android developers to be more productive with a range of “helpful models” for AI assistance; and leading to higher-quality applications across the Android ecosystem.

Do software development benchmarks work?

Developers and model creators will naturally question whether Google’s action to set up this benchmarking is useful. Naysayers might naturally point to Goodhart’s Law, which states that, “When a measure becomes a target, it ceases to be a good measure.” Certainly, any reward system can attract actors who optimize their actions to achieve standardized goals.

Google may have second-guessed this pitfall by establishing Android Bench based upon real-world public code repositories. 

“We created the benchmark by curating a task set against a range of common Android development areas. It is composed of real challenges of varying difficulty, sourced from public GitHub Android repositories,” writes Google’s McCullough.

This means scenarios tested against include resolving “breaking changes” across Android releases (when code that worked fine previously becomes corrupted as a result of Google updating Android to a new version), domain-specific tasks such as networking for wearable devices (where the specter of high latency and frequent disconnections is always a threat), and migrating to the latest version of Jetpack Compose (Android’s own declarative UI toolkit that uses Kotlin language functions), and more.

What other Android benchmarks exist?

Other Android benchmarks include Jetpack Microbenchmark, a library that allows developers to benchmark their Android native code — whether written in Kotlin or Java — from within Android Studio. The sister Jetbank Macrobenchmark is provided to test large-scale user interactions, such as cold app startup time or the fluidity of user interface animations.

Also available in the Android benchmarking space is Firebase Performance Monitoring, a production-level field benchmark that monitors an app’s network requests and screen rendering times; this is more of an application performance monitoring tool. 

Within the Android developer community, Android Vitals already provides a dashboard to track app quality metrics such as stability, performance, battery usage, and permission issues. Apptim is a generative AI mobile app profiling and testing tool, so again, performance benchmarking, but not quite the same as Android Bench. We could also mention Google’s own Android Performance Analyzer (APA). which only arrived on 19 May this year and serves as a profiling and performance analysis tool with workflow simplification support.

“Open benchmarks like Android Bench are great, and we wish there were more of them. The caveat is data contamination. Public repositories leak into training, and we have seen models that cluster within a few points on public evals spread dramatically on private benchmarks built to mimic the same workload.” – Andrew Filev, CEO, Zencoder.

Andrew Filev, CEO and founder of code orchestration company Zencoder, tells The New Stack that he’s a fan of these systems, with caveats.

“Open benchmarks like Android Bench are great, and we wish there were more of them,” Filev enthuses. “In general terms, software development is too diverse for a single headline score to be universally meaningful — a Python benchmark tells you little about how a model handles Rust, embedded systems, or a mobile app. There’s also a real gap between building an open web app, an internal tool used by a few hundred people, and a multi-tenant product at a global scale, and models do not perform identically across those domains.”

Consequently, he says, domain-specific benchmarks push model developers to pay attention to the environments their users actually work in, so he thinks that “Google deserves credit here” and hopes other platforms follow.

“The caveat is data contamination. Public repositories leak into training, and we have seen models that cluster within a few points on public evals spread dramatically on private benchmarks built to mimic the same workload,” Filev says. “In our own research, a small change in how we framed test cases shifted the model spread from six percentage points to 26 and completely reordered the rankings. So public benchmarks help improve LLM performance across domains, and private evals help assess real-world performance on your workload.”

How Android Bench scores are built

Each Android Bench model’s overall benchmark score is based on a Google-developed calculation comprising four core values. 

The confidence interval (CI) range (%) is a measure of the expected performance range, reflecting the results’ statistical reliability (p-value, 0.05); the average latency score is the time taken to solve 100 tasks across 10 runs; the average total tokens score is a measure of token consumption for a full benchmark run across 10 runs; and the average cost is the cost per benchmark run at the time of testing, in US dollars.

The test harness for Android Bench is publicly available on GitHub.

TRENDING STORIES

Group Created with Sketch.