惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
SegmentFault 最新的问题
量子位
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Jina AI
Jina AI
V
Visual Studio Blog
C
Check Point Blog
博客园 - 聂微东
博客园 - 叶小钗
Microsoft Security Blog
Microsoft Security Blog
E
Exploit-DB.com RSS Feed
Microsoft Azure Blog
Microsoft Azure Blog
G
Google Developers Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
N
Netflix TechBlog - Medium
Recorded Future
Recorded Future
aimingoo的专栏
aimingoo的专栏
罗磊的独立博客
Spread Privacy
Spread Privacy
Cisco Talos Blog
Cisco Talos Blog
C
Comments on: Blog
N
News and Events Feed by Topic
L
Lohrmann on Cybersecurity
小众软件
小众软件
H
Heimdal Security Blog
云风的 BLOG
云风的 BLOG
The Cloudflare Blog
Apple Machine Learning Research
Apple Machine Learning Research
The GitHub Blog
The GitHub Blog
Security Latest
Security Latest
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
U
Unit 42
阮一峰的网络日志
阮一峰的网络日志
H
Hacker News: Front Page
D
Docker
N
News and Events Feed by Topic
Application and Cybersecurity Blog
Application and Cybersecurity Blog
P
Privacy & Cybersecurity Law Blog
S
Schneier on Security
T
Troy Hunt's Blog
MyScale Blog
MyScale Blog
The Register - Security
The Register - Security
Simon Willison's Weblog
Simon Willison's Weblog
L
LangChain Blog
T
The Exploit Database - CXSecurity.com
D
Darknet – Hacking Tools, Hacker News & Cyber Security
NISL@THU
NISL@THU
TaoSecurity Blog
TaoSecurity Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
P
Privacy International News Feed
Blog — PlanetScale
Blog — PlanetScale

Swift for Visual Studio Code comes to Open VSX Registry | InfoWorld

OpenAI buys Ona to help rein in AI agents Software engineer reportedly wins religious exemption from AI use Why cloud outages are such a stubborn problem It’s crunch time for Java modernization Build an agent? Sell an agent Microsoft open sources AI evaluation framework for enterprise agents Databricks’ OpenSharing targets the ‘integration tax’ of enterprise AI The tokenmaxxing backlash is coming EU rules on securing IT products could affect open source software users beginning this week GitHub finally pulls the plug on automatic install script execution for npm The GPU multitenancy mess Beware of the genAI token trap 8 cutting-edge web development tools you don’t want to miss Enterprises know AI-generated code is vulnerable; they're shipping it anyway How to use virtual environments in Python Meet Hades: The malware that lies to AI security agents 10 MCP servers to connect LLMs with databases Making sense of too much code Protocol Buffers schemas expose remote code execution risk Broadcom beefs up Spring security to protect against AI-enabled attacks Anthropic’s AI services are too expensive, says Microsoft AI head The real cost of agentic AI Microsoft identifies seven new ways AI agents can be hacked Patching fast and slow: Ruby devs delay to defend against supply chain attack GitHub adds new Copilot features as usage-based billing takes effect AWS targets a longtime cloud migration blocker with SQL Server license portability Microsoft makes Linux developers feel more at home in Windows with Coreutils release Embedding pipelines are the new ETL Microsoft’s Web IQ aims to give enterprise AI agents real-time web intelligence OpenAI fixed a visibility problem; the governance problem remains. Google brings local AI agents to laptops with Gemma 4 12B Rayfin signals Microsoft’s push to make Fabric an AI app runtime Angular Signals explained: How pull-based reactivity changes how we model state Hole in GitHub’s browser-based VSCode editor could lead to stolen token The next AI breakthrough won’t come from bigger models, but from better data Enterprise Spotlight: Rethinking cloud strategy in the age of AI - Whitepaper Repository - Enterprise Spotlight: Rethinking cloud strategy in the age of AI - Whitepaper Repository - An explosion of software is coming Workday launches Agent Passport to test and monitor AI agents in the enterprise Infected Red Hat npm packages expose developer credentials Attack targeting OpenAI Codex users exposes AI software supply chain risks Will the hyperscalers own AI workloads forever? What will AI-first UX look like? Snowflake’s Horizon Context aims to give AI agents a common understanding of the business Pyrefly 1.0: A fast, forward-looking Python linter How to succeed with AI-powered devops tools How to run enterprise GenAI like a production service AI’s brave new world of technical debt Flowise’s MCP implementation can run ghost commands What Snowflake Summit 2026 signals about enterprise AI Amazon deletes devs’ tokenmaxxing leaderboard to minimize costs How are enterprises using cloud today? Plunge into Python profiling DNS-AID will make AI agents easier to discover, says Linux Foundation Certifiably random: Swiss researchers claim perfect random number source Supply chain battles intensify as takedowns meet AI-driven noise Snowflake to acquire MCP-focused Natoma to boost governance for AI agents An open-source toolkit for controlling out-of-control AI agents Stop checking AI-generated code. Start generating less of it How to stop the AI code generation treadmill Microsoft’s open-source toolkit for controlling out-of-control AI agents IBM and Red Hat want to become the ‘security clearinghouse’ for open source applications in the enterprise Lack of response to critical vulnerability in Gogs is a reminder of the limits of open source projects Developers on H-1B face a tighter job market as AI shifts hiring priorities What do software developers do now? Docker Sandboxes and microVMs, explained FastAPI-based AI tools exposed to authentication bypass by flaw in Starlette framework Why most AI agents disappoint in production (and what to fix first) Taming the generative AI back end The Big Three cloud providers are more alike than not The role of MCP in context engineering AI coding agents need good software engineers AI coding agents need good software engineers AI coding agents need good software engineers The sovereign cloud illusion Angular Signal Forms: From event pipelines to signal-driven state AI at scale: What engineering teams are confronting Salesforce extends its headless push into enterprise data via Informatica 9 application security startups combating AI risks Why I trust Claude Code First look: Mojo 1.0 mixes Python and Rust Google launches Gemini 3.5 Flash to push AI agents deeper into enterprise workflows Google to unify AI coding tools under Antigravity Learning to trust Claude Code Context graphs and decision traces to the rescue An AI data center in your home? What can you do with quantum computing today? Anthropic acquires Stainless to strengthen Claude’s developer tooling Contexts graphs, AI memory, and enterprise knowledge: Are decision traces enough? AWS boosts CloudWatch Logs query limits by 10x to ease debugging for developers, SREs 21 LLMs tuned for special domains The new AI lock-in Informatica and Salesforce move data platforms into the decision layer AWS adds Advanced Prompt Optimization tool to Bedrock Capacity markets could reshape cloud computing Four cutting-edge tools for spec-driven development 4 cutting-edge tools for spec-driven development Anthropic puts Claude agents on a meter across its subscriptions Notion courts developers with a platform for AI agents and workflow automation Using continuous purple teaming to protect fast-paced enterprise environments
Google unveils DiffusionGemma, an AI model that breaks free of left-to-right processing
by Taryn Plumb · 2026-06-13 · via Swift for Visual Studio Code comes to Open VSX Registry | InfoWorld

Rather than generating text word by word, Google's experimental open-source model drafts entire passages simultaneously using diffusion, resulting in up to 4x faster inference.

Extremely powerful large language models (LLMs) still operate as though they’re typing on a keyboard, processing workloads in a simple left-to-right fashion. But in locally-run, single-user scenarios, this sequential processing can leave graphics processing units (GPUs) and tensor processing units (TPUs) underutilized.

Google is betting that DiffusionGemma can get around this bottleneck. The new experimental open model generates text “exceptionally fast,” creating entire blocks of text simultaneously through diffusion techniques rather than through token-by-token processing. The company says this technique results in 4x faster inference compared to auto-regressive models that rely on sequential processing.

It can also save users money. Technology analyst Carmi Levy noted that existing pay-per-token monetization models “penalize the use of less than optimally efficient AI solutions.”

But DiffusionGemma “could herald a new generation of task-defined, efficient solutions that can enable expanded compute capacity without draining the operations budget,” he said.

A contrast to left-to-right processing

Built on Google’s Gemma 4 family and its Gemini Diffusion research, DiffusionGemma is a 26B mixture-of-experts (MoE) model designed to maximize text output generation.

It essentially shifts how models use hardware, giving processors a larger hunk of work each cycle so it can draft full 256-token paragraphs in sequence. This allows the model to generate text up to 4x faster on GPUs, Google claims. It activates only 3.8B parameters during inference, and, when quantized, can fit within 18GB VRAM on high-end consumer GPUs like Nvidia RTX 5090.

“It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously,” Google research scientists Brendan O’Donoghue and Sebastian Flennerhag wrote in a blog post.

AI image generators begin with pure, random ‘visual noise’ and iteratively refine that into a finalized picture (what’s known as ‘diffusion’); DiffusionGemma applies this same process to text. It does not generate tokens in order, but begins with a “canvas of random placeholder tokens” that it processes in multiple passes, identifying the context tokens it feels are most relevant and using those to refine the rest.

The model has the ability to self-correct, using confidence scoring to re-evaluate tokens in the next pass. “The model iteratively refines its own output, allowing it to evaluate the entire text block at once to fix mistakes in real-time,” O’Donoghue and Flennerhag explained.

DiffusionGemma also has bidirectional attention, they wrote. “Generating 256 tokens in parallel with each forward pass allows every token to attend to all others.” This can be particularly helpful in domains that are non-linear in nature, such as mathematical graphs, code infilling, and in-line editing, they said.

DiffusionGemma is optimized across Nvidia’s hardware stack, making it compatible with consumer setups as well as with high-performance enterprise systems like Hopper and Blackwell.

Because it is released under the Apache 2.0 license, developers can freely use, modify, distribute, and commercialize the software using their preferred tools. It can be run on GPUs or in the cloud through Google Cloud Model Garden or Nvidia NIM, and is available on Hugging Face, GitHub, and vLLM, with support for the open-source library llama.cpp coming soon.

Key use cases

The model is particularly useful in local workflows that are “speed critical,” such as generation of non-linear text structures, and unlocks what Google calls “new patterns of model behavior” like multimodal understanding and generating and rendering code in near real-time.

Levy explained, “DiffusionGemma is particularly well suited for interactive coding and editing where its efficiency allows rapid processing and iterations,” noting that its ability to fit within 18GB of VRAM and its deployability on commonly available local GPUs can potentially benefit customer service-related workloads that lean heavily on real-time interaction and local processing.

“DiffusionGemma also incorporates a thinking mode that is especially adept at problem solving,” he said. For instance, the model was fine-tuned to play Sudoku, a typically challenging task for autoregressive models because each token depends on future tokens. This “rather handily” illustrates the model’s capability to solve more complex problems, Levy noted.

Limitations

Google freely admits that DiffusionGemma is geared to specific workflows, and there are “key trade-offs.”

The model is engineered for small batch size inferencing and low-latency, high-speed generation low-to-medium batch sizes on a “single capable accelerator.”

In high-QPS cloud serving environments, (where infrastructure is designed to handle tens or hundreds of thousands of requests per second with ultra-low latency), DiffusionGemma’s parallel coding “offers diminishing returns,” and can even result in higher serving costs, Google conceded. In addition, its overall output quality is lower than that of standard Gemma 4, which is built for apps demanding maximum quality.

However, Levy noted that while DiffusionGemma “can be less precise than other models in certain workloads,” subsequent refinement cycles could overcome this limitation.

While Google isn’t sharing runtime costs, it’s clear that this is an efficiency play, he added. “When deployed across the kinds of workloads that would optimally benefit from its architecture, DiffusionGemma seems to have the potential to reduce processing overhead and related costs,” he said.