惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
V
Vulnerabilities – Threatpost
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Visual Studio Blog
月光博客
月光博客
IT之家
IT之家
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Tailwind CSS Blog
罗磊的独立博客
S
SegmentFault 最新的问题
博客园 - 三生石上(FineUI控件)
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
量子位
V
V2EX
Jina AI
Jina AI
The GitHub Blog
The GitHub Blog
小众软件
小众软件
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
阮一峰的网络日志
阮一峰的网络日志
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
Y
Y Combinator Blog
H
Help Net Security
博客园_首页
Cyberwarzone
Cyberwarzone
T
Tenable Blog
A
Arctic Wolf
C
CERT Recently Published Vulnerability Notes
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
T
Threat Research - Cisco Blogs
aimingoo的专栏
aimingoo的专栏
Google DeepMind News
Google DeepMind News
博客园 - 叶小钗
C
Cyber Attacks, Cyber Crime and Cyber Security
美团技术团队
Attack and Defense Labs
Attack and Defense Labs
GbyAI
GbyAI
博客园 - 【当耐特】
Cloudbric
Cloudbric
NISL@THU
NISL@THU
B
Blog RSS Feed
K
Kaspersky official blog
Hugging Face - Blog
Hugging Face - Blog
P
Privacy International News Feed
博客园 - Franky
博客园 - 司徒正美
Microsoft Azure Blog
Microsoft Azure Blog
Apple Machine Learning Research
Apple Machine Learning Research
Webroot Blog
Webroot Blog
Microsoft Security Blog
Microsoft Security Blog

Google DeepMind News

Investing in multi-agent AI safety research DiffusionGemma: 4x faster text generation Fluid, natural voice translation with Gemini 3.5 Live Translate Measuring the impact of learning with AI in Sierra Leone and beyond Powering the future of robotics in Europe Introducing Gemma 4 12B: a unified, encoder-free multimodal model Strengthening Singapore’s AI Future: A New National Partnership Simulate real-world places with Project Genie and Street View Introducing Gemini Omni Gemini for Science: AI experiments and tools for a new era of discovery Making it easier to understand how content was created and edited Gemini 3.5: frontier intelligence with action Co-Scientist: A multi-agent AI partner to accelerate research How WeatherNext helped the National Hurricane Center better predict Hurricane Melissa’s historic landfall in Jamaica Fast-tracking genetic leads to reverse cellular aging Finding the molecular switches behind new infectious diseases Opening new paths in aging research Accelerating discovery of liver disease mechanisms Uniting biological toolkits for a new approach to ALS Uncovering repurposed medicines to fight liver fibrosis Google Antigravity We’re launching the Google DeepMind Accelerator program in Asia Pacific to tackle environmental risks. Reimagining the mouse pointer for the AI era AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields Enabling a new model for healthcare with AI co-clinician Announcing our partnership with the Republic of Korea Decoupled DiLoCo: A new frontier for resilient, distributed AI training Partnering with industry leaders to accelerate AI transformation Gemini 3.1 Flash TTS: the next generation of expressive AI speech Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning Gemma 4: Byte for byte, the most capable open models Gemini 3.1 Flash Live: Making audio AI more natural and reliable Protecting people from harmful manipulation Lyria 3 Pro: Create longer tracks in more Google products Measuring progress toward AGI: A cognitive framework From games to biology and beyond: 10 years of AlphaGo’s impact Gemini 3.1 Flash-Lite: Built for intelligence at scale Nano Banana 2: Combining Pro capabilities with lightning-fast speed Gemini 3.1 Pro: A smarter model for your most complex tasks A new way to express yourself: Gemini can now create music Accelerating discovery in India through AI-powered science and education Gemini 3 Deep Think: Advancing science, research and engineering Accelerating Mathematical and Scientific Discovery with Gemini Deep Think Project Genie: Experimenting with infinite, interactive worlds D4RT: Teaching AI to see the world in four dimensions Veo 3.1 Ingredients to Video: More consistency, creativity and control Google's year in review: 8 areas with research breakthroughs in 2025 Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior Google DeepMind supports U.S. Department of Energy on Genesis: a national mission to accelerate innovation and scientific discovery Gemini 3 Flash: frontier intelligence built for speed Deepening our partnership with the UK AI Security Institute Strengthening our partnership with the UK government to support prosperity and security in the AI era FACTS Benchmark Suite: Systematically evaluating the factuality of large language models Engineering more resilient crops for a warming climate AlphaFold: Five years of impact Revealing a key protein behind heart disease How we’re bringing AI image verification to the Gemini app Build with Nano Banana Pro, our Gemini 3 Pro Image model Introducing Nano Banana Pro We’re expanding our presence in Singapore to advance AI in the Asia-Pacific region Start building with Gemini 3 A new era of intelligence with Gemini 3 Google Antigravity WeatherNext 2: Our most advanced weather forecasting model SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds Teaching AI to see the world more like we do How AI is giving Northern Ireland teachers time back Mapping, modeling, and understanding nature with AI Accelerating discovery with the AI for Math Initiative MedGemma: Our most capable open models for health AI development VaultGemma: The world's most capable differentially private LLM Bringing AI to the next generation of fusion energy Introducing Veo 3.1 and advanced capabilities in Flow How a Gemma model helped discover a new potential cancer therapy pathway Introducing the Gemini 2.5 Computer Use model Introducing CodeMender: an AI agent for code security Gemini Robotics 1.5 brings AI agents into the physical world Strengthening our Frontier Safety Framework Discovering new solutions to century-old problems in fluid dynamics Gemini achieves gold-medal level at the International Collegiate Programming Contest World Finals Using AI to perceive the universe in greater depth Image editing in Gemini just got a major upgrade Introducing Gemma 3 270M: The compact model for hyper-efficient AI How AI is helping advance the science of bioacoustics to save endangered species Genie 3: A new frontier for world models Rethinking how we measure AI intelligence Try Deep Think in the Gemini app AlphaEarth Foundations helps map our planet in unprecedented detail Aeneas transforms how historians connect the past Gemini 2.5 Flash-Lite is now stable and generally available Exploring the context of online images with Backstory Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad T5Gemma: A new collection of encoder-decoder Gemma models Introducing Gemma 3n: The developer guide AlphaGenome: AI for better understanding the genome Gemini Robotics On-Device brings AI to local robotic devices We’re expanding our Gemini 2.5 family of models Gemini 2.5: Updates to our family of thinking models Behind “ANCESTRA”: combining Veo with live-action filmmaking How we're supporting better tropical cyclone prediction with AI
Improved Gemini audio models for powerful voice interactions
Bibo Xu · 2025-12-13 · via Google DeepMind News

Tara Sainath

Distinguished Research Scientist

General summary

Google enhanced Gemini 2.5 Flash Native Audio for better live voice agents. Expect sharper function calling, robust instruction following and smoother conversations. Try live speech translation in the Google Translate app beta, rolling out now on Android in the US Mexico and India.

Summaries were generated by Google AI. Generative AI is experimental.

Bullet points

  • "Improved Gemini audio models for powerful voice interactions" enhance live agents and translation.
  • Gemini 2.5 Flash Native Audio now has sharper function calling and better instruction following.
  • The update allows for smoother conversations by retrieving context from previous turns.
  • Live speech translation in Google Translate preserves intonation and handles 70+ languages.
  • You can start building voice agents today with Gemini 2.5 Flash Native Audio on Vertex AI.

Summaries were generated by Google AI. Generative AI is experimental.

Basic explainer

Google made its Gemini AI better at understanding and speaking in conversations. It can now understand instructions better, have smoother conversations, and translate languages in real time. This means AI can help businesses with customer service and people can understand each other better, even if they speak different languages. You can even try out the live translation feature in the Google Translate app.

Summaries were generated by Google AI. Generative AI is experimental.

Explore other styles:

Gemini Audio text logo

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Earlier this week, we introduced greater control over audio generation with an upgrade to our Gemini 2.5 Pro and Flash Text-to-Speech models.

But generating expressive speech is only one side of the conversation. Today, we’re releasing an updated Gemini 2.5 Flash Native Audio for live voice agents. This update improves the model’s ability to handle complex workflows, navigate user instructions, and hold natural conversations.

Gemini 2.5 Flash Native Audio is now available across Google products including Google AI Studio, Vertex AI, and has also started rolling out in Gemini Live and Search Live, bringing the naturalness of native audio to Search Live for the first time. This means you can more effectively brainstorm live with Gemini, get real-time help in Search Live, or build the next generation of enterprise-ready customer service agents.

Beyond powering helpful agents, native audio unlocks new possibilities for global communication. We’re introducing live speech translation, a capability that enables streaming speech-to-speech translation for headphones. It preserves the speaker’s intonation, pacing and pitch. This beta experience is rolling out in the Google Translate app starting today.

Live Voice Agents

To enable the breadth of use cases across surfaces and products, we have improved Gemini 2.5 Native Audio in three key areas:

  • Sharper function calling: We’ve improved the model's reliability when triggering external functions. It can now more accurately identify when to fetch real-time information during a conversation and seamlessly weave that data back into the audio response, without breaking the flow. On ComplexFuncBench Audio, an eval that captures multi-step function calling with various constraints, Gemini 2.5 Native Audio leads with a score of 71.5%.
  • Robust instruction following: The model is now better at handling complex instructions resulting in higher user satisfaction on content completeness. With a 90% adherence rate to developer instructions (up from 84%), it delivers more reliable outputs.
  • Smoother conversations: We’ve achieved significant gains in multi-turn conversation quality. Gemini 2.5 Flash Native Audio is able to retrieve context from previous turns more effectively, creating more cohesive conversations.

The updated Gemini 2.5 Flash Native Audio’s performance against previous versions and industry competitors on ComplexFuncBench

updated Gemini 2.5 Flash Native Audio’s performance against previous versions and industry competitors

What customers are saying

Google Cloud customers are already using Gemini’s native audio capabilities to drive real business results, from mortgage processing to customer calls.

  • “Users often forget they’re talking to AI within a minute of using Sidekick, and in some cases have thanked the bot after a long chat…New Live API AI capabilities offered through Gemini [2.5 Flash Native Audio] empower our merchants to win.” – David Wurtz, VP of Product, Shopify
  • "By integrating the Gemini 2.5 Flash Native Audio model…we've significantly enhanced Mia's capabilities since launching in May 2025. This powerful combination has enabled us to generate over 14,000 loans for our broker partners." – Jason Bressler, Chief Technology Officer, United Wholesale Mortgage (UWM)
  • “Working with the Gemini 2.5 Flash Native Audio model through Vertex AI allows Newo.ai AI Receptionists to achieve unmatched conversational intelligence ... .They can identify the main speaker even in noisy settings, switch languages mid-conversation, and sound remarkably natural and emotionally expressive.” – David Yang, Co-founder, Newo.ai

Live Speech Translation

Gemini now natively supports new live speech-to-speech translation capabilities designed to handle both continuous listening and two-way conversation.

With continuous listening, Gemini automatically translates speech in multiple languages into a single target language. This allows you to put headphones in and hear the world around you in your language.

For two-way conversation, Gemini’s live speech translation handles translation between two languages in real-time, automatically switching the output language based on who is speaking. For example, if you speak English and want to chat with a Hindi speaker, you’ll hear English translations in real-time in your headphones, while your phone broadcasts Hindi when you’re done speaking.

Gemini’s live speech translation has a number of key capabilities that help in the real world:

  • Language coverage: Translates speech in over 70 languages and 2000 language pairs by combining Gemini model’s world knowledge and multilingual capabilities with its native audio capabilities
  • Style transfer: Captures the nuance of human speech, preserving the speaker’s intonation, pacing and pitch so the translation sounds natural.
  • Multilingual input: Understands multiple languages simultaneously in a single session, helping you follow multilingual conversations without needing to fiddle around with language settings.
  • Auto detection: Identifies the spoken language and begins translation, so you don’t even need to know what language is being spoken to start translating.
  • Noise robustness: Filters out ambient noise so you can converse comfortably even in loud, outdoor environments.

Starting today, you can try it in a new beta experience in the Google Translate app for real-time translation in your headphones by connecting them to your device and tapping “Live translate.” This experience is rolling out to all Android devices in the US, Mexico and India with support for iOS and more regions coming soon.

Based on feedback, we will continue to iterate on this experience and bring it to more Google products including the Gemini API in 2026.

Get started today

Start building voice agents today with Gemini 2.5 Flash Native Audio, now generally available on Vertex AI and as preview in the Gemini API. Try it out in Google AI Studio.

Gemini 2.5 Flash and 2.5 Pro text-to-speech models are also available via the Gemini API in Google AI Studio. Get started with the speech generation docs, explore the prompting guide, or check out the Gemini API Cookbook to get started.