惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
V
Vulnerabilities – Threatpost
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Visual Studio Blog
月光博客
月光博客
IT之家
IT之家
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Tailwind CSS Blog
罗磊的独立博客
S
SegmentFault 最新的问题
博客园 - 三生石上(FineUI控件)
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
V
V2EX
Jina AI
Jina AI
The GitHub Blog
The GitHub Blog
小众软件
小众软件
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
阮一峰的网络日志
阮一峰的网络日志
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
Y
Y Combinator Blog
H
Help Net Security
博客园_首页
Cyberwarzone
Cyberwarzone
T
Tenable Blog
A
Arctic Wolf
C
CERT Recently Published Vulnerability Notes
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
T
Threat Research - Cisco Blogs
aimingoo的专栏
aimingoo的专栏
Google DeepMind News
Google DeepMind News
博客园 - 叶小钗
C
Cyber Attacks, Cyber Crime and Cyber Security
美团技术团队
Attack and Defense Labs
Attack and Defense Labs
GbyAI
GbyAI
博客园 - 【当耐特】
Cloudbric
Cloudbric
NISL@THU
NISL@THU
B
Blog RSS Feed
K
Kaspersky official blog
Hugging Face - Blog
Hugging Face - Blog
P
Privacy International News Feed
博客园 - Franky
博客园 - 司徒正美
Microsoft Azure Blog
Microsoft Azure Blog
Apple Machine Learning Research
Apple Machine Learning Research
Webroot Blog
Webroot Blog
Microsoft Security Blog
Microsoft Security Blog

Google DeepMind News

Investing in multi-agent AI safety research DiffusionGemma: 4x faster text generation Fluid, natural voice translation with Gemini 3.5 Live Translate Measuring the impact of learning with AI in Sierra Leone and beyond Powering the future of robotics in Europe Introducing Gemma 4 12B: a unified, encoder-free multimodal model Strengthening Singapore’s AI Future: A New National Partnership Simulate real-world places with Project Genie and Street View Introducing Gemini Omni Gemini for Science: AI experiments and tools for a new era of discovery Making it easier to understand how content was created and edited Gemini 3.5: frontier intelligence with action Co-Scientist: A multi-agent AI partner to accelerate research How WeatherNext helped the National Hurricane Center better predict Hurricane Melissa’s historic landfall in Jamaica Fast-tracking genetic leads to reverse cellular aging Finding the molecular switches behind new infectious diseases Opening new paths in aging research Accelerating discovery of liver disease mechanisms Uniting biological toolkits for a new approach to ALS Uncovering repurposed medicines to fight liver fibrosis Google Antigravity We’re launching the Google DeepMind Accelerator program in Asia Pacific to tackle environmental risks. Reimagining the mouse pointer for the AI era AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields Enabling a new model for healthcare with AI co-clinician Announcing our partnership with the Republic of Korea Decoupled DiLoCo: A new frontier for resilient, distributed AI training Partnering with industry leaders to accelerate AI transformation Gemini 3.1 Flash TTS: the next generation of expressive AI speech Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning Gemma 4: Byte for byte, the most capable open models Gemini 3.1 Flash Live: Making audio AI more natural and reliable Protecting people from harmful manipulation Lyria 3 Pro: Create longer tracks in more Google products Measuring progress toward AGI: A cognitive framework From games to biology and beyond: 10 years of AlphaGo’s impact Gemini 3.1 Flash-Lite: Built for intelligence at scale Nano Banana 2: Combining Pro capabilities with lightning-fast speed Gemini 3.1 Pro: A smarter model for your most complex tasks A new way to express yourself: Gemini can now create music Accelerating discovery in India through AI-powered science and education Gemini 3 Deep Think: Advancing science, research and engineering Accelerating Mathematical and Scientific Discovery with Gemini Deep Think Project Genie: Experimenting with infinite, interactive worlds D4RT: Teaching AI to see the world in four dimensions Veo 3.1 Ingredients to Video: More consistency, creativity and control Google's year in review: 8 areas with research breakthroughs in 2025 Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior Google DeepMind supports U.S. Department of Energy on Genesis: a national mission to accelerate innovation and scientific discovery Gemini 3 Flash: frontier intelligence built for speed Improved Gemini audio models for powerful voice interactions Deepening our partnership with the UK AI Security Institute Strengthening our partnership with the UK government to support prosperity and security in the AI era FACTS Benchmark Suite: Systematically evaluating the factuality of large language models Engineering more resilient crops for a warming climate AlphaFold: Five years of impact Revealing a key protein behind heart disease How we’re bringing AI image verification to the Gemini app Build with Nano Banana Pro, our Gemini 3 Pro Image model Introducing Nano Banana Pro We’re expanding our presence in Singapore to advance AI in the Asia-Pacific region Start building with Gemini 3 A new era of intelligence with Gemini 3 Google Antigravity WeatherNext 2: Our most advanced weather forecasting model SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds Teaching AI to see the world more like we do How AI is giving Northern Ireland teachers time back Mapping, modeling, and understanding nature with AI Accelerating discovery with the AI for Math Initiative MedGemma: Our most capable open models for health AI development VaultGemma: The world's most capable differentially private LLM Bringing AI to the next generation of fusion energy Introducing Veo 3.1 and advanced capabilities in Flow How a Gemma model helped discover a new potential cancer therapy pathway Introducing CodeMender: an AI agent for code security Gemini Robotics 1.5 brings AI agents into the physical world Strengthening our Frontier Safety Framework Discovering new solutions to century-old problems in fluid dynamics Gemini achieves gold-medal level at the International Collegiate Programming Contest World Finals Using AI to perceive the universe in greater depth Image editing in Gemini just got a major upgrade Introducing Gemma 3 270M: The compact model for hyper-efficient AI How AI is helping advance the science of bioacoustics to save endangered species Genie 3: A new frontier for world models Rethinking how we measure AI intelligence Try Deep Think in the Gemini app AlphaEarth Foundations helps map our planet in unprecedented detail Aeneas transforms how historians connect the past Gemini 2.5 Flash-Lite is now stable and generally available Exploring the context of online images with Backstory Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad T5Gemma: A new collection of encoder-decoder Gemma models Introducing Gemma 3n: The developer guide AlphaGenome: AI for better understanding the genome Gemini Robotics On-Device brings AI to local robotic devices We’re expanding our Gemini 2.5 family of models Gemini 2.5: Updates to our family of thinking models Behind “ANCESTRA”: combining Veo with live-action filmmaking How we're supporting better tropical cyclone prediction with AI
Introducing the Gemini 2.5 Computer Use model
Google DeepMind · 2025-10-08 · via Google DeepMind News

Available in preview via the API, our Computer Use model is a specialized model built on Gemini 2.5 Pro’s capabilities to power agents that can interact with user interfaces.

General summary

Google is releasing the Gemini 2.5 Computer Use model via the Gemini API, enabling developers to build agents that can interact with user interfaces. This model outperforms others in web and mobile control benchmarks with lower latency. You can access it now on Google AI Studio and Vertex AI to start building and share feedback in the Developer Forum.

Summaries were generated by Google AI. Generative AI is experimental.

Gemini Computer Use

Earlier this year, we mentioned that we're bringing computer use capabilities to developers via the Gemini API. Today, we are releasing the Gemini 2.5 Computer Use model, our new specialized model built on Gemini 2.5 Pro’s visual understanding and reasoning capabilities that powers agents capable of interacting with user interfaces (UIs). It outperforms leading alternatives on multiple web and mobile control benchmarks, all with lower latency. Developers can access these capabilities via the Gemini API in Google AI Studio and Vertex AI.

While AI models can interface with software through structured APIs, many digital tasks still require direct interaction with graphical user interfaces, for example, filling and submitting forms. To complete these tasks, agents must navigate web pages and applications just as humans do: by clicking, typing and scrolling. The ability to natively fill out forms, manipulate interactive elements like dropdowns and filters, and operate behind logins is a crucial next step in building powerful, general-purpose agents.

How it works

The model’s core capabilities are exposed through the new `computer_use` tool in the Gemini API and should be operated within a loop. Inputs to the tool are the user request, screenshot of the environment, and a history of recent actions. The input can also specify whether to exclude functions from the full list of supported UI actions or specify additional custom functions to include.

Gemini 2.5 Computer Use Model flow

Diagram of AI agent loop: Initial task leads to a screenshot/context, which is sent to the Model, which returns a response to the computer environment to execute an action.

The model then analyzes these inputs and generates a response, typically a function call representing one of the UI actions such as clicking or typing. This response may also contain a request for an end user confirmation, which is required for certain actions such as making a purchase. The client-side code then executes the received action.

After the action is executed, a new screenshot of the GUI and the current URL are sent back to the Computer Use model as a function response restarting the loop. This iterative process continues until the task is complete, an error occurs or the interaction is terminated by a safety response or user decision.

The Gemini 2.5 Computer Use model is primarily optimized for web browsers, but also demonstrates strong promise for mobile UI control tasks. It is not yet optimized for desktop OS-level control.

Check out a few demos below to see the model in action (shown here at 3X speed).

Prompt: “From https://tinyurl.com/pet-care-signup, get all details for any pet with a California residency and add them as a guest in my spa CRM at https://pet-luxe-spa.web.app/. Then, set up a follow up visit appointment with the specialist Anima Lavar for October 10th anytime after 8am. The reason for the visit is the same as their requested treatment.”

Prompt: “My art club brainstormed tasks ahead of our fair. The board is chaotic and I need your help organizing the tasks into some categories I created. Go to sticky-note-jam.web.app and ensure notes are clearly in the right sections. Drag them there if not.”

How it performs

The Gemini 2.5 Computer Use model demonstrates strong performance on multiple web and mobile control benchmarks. The table below includes results from self-reported numbers, evaluations run by Browserbase and evaluations we ran ourselves. Evaluation details are available in the Gemini 2.5 Computer Use evaluation info and in Browserbase’s blog post. Unless otherwise indicated, scores shown are for computer use tools exposed via API.

Gemini 2.5 Computer Use outperforms leading alternatives on multiple benchmarks

Benchmark performance table: Gemini 2.5 Computer Use leads in Online-Mind2Web, WebVoyager, and AndroidWorld benchmarks.

The model offers leading quality for browser control at the lowest latency, as measured by performance on the Browserbase harness for Online-Mind2Web.

Gemini 2.5 Computer Use delivers high accuracy while maintaining low latency

Latency vs. Quality scatterplot: Gemini 2.5 Computer Use is lowest in latency and highest in accuracy (70%+ accuracy, ∼225 sec latency).

How we approached safety

We believe that the only way to build agents that will benefit everyone is to be responsible from the start. AI agents that control computers introduce unique risks, including intentional misuse by users, unexpected model behavior, and prompt injections and scams in the web environment. Thus, it is critical to implement safety guardrails with care.

We have trained safety features directly into the model to address these three key risks (described in the Gemini 2.5 Computer Use System Card).

Further, we also provide developers with safety controls, which empower developers to prevent the model from auto-completing potentially high-risk or harmful actions. Examples of these actions include harming a system's integrity, compromising security, bypassing CAPTCHAs, or controlling medical devices. The controls:

  • Per-step safety service: An out-of-model, inference-time safety service that assesses each action the model proposes before it’s executed.
  • System instructions: Developers can further specify that the agent either refuses or asks for user confirmation before it takes specific kinds of high-stakes actions. (Example in documentation).

Additional recommendations for developers on safety measures and best practices can be found in our documentation. While these safeguards are designed to reduce risk, we urge all developers to thoroughly test their systems before launch.

How early testers have used it

Google teams have already deployed the model to production for use cases including UI testing, which can make software development signficantly faster. Versions of this model have also been powering Project Mariner, the Firebase Testing Agent, and some agentic capabilities in AI Mode in Search.

Users from our early access program have also been testing the model to power personal assistants, workflow automation, and UI testing, and have seen strong results. In their own words:

How to get started

Starting today, the model is available in public preview, accessible via the Gemini API on Google AI Studio and Vertex AI.

Related stories