惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

P
Proofpoint News Feed
Microsoft Azure Blog
Microsoft Azure Blog
Jina AI
Jina AI
博客园_首页
宝玉的分享
宝玉的分享
The Cloudflare Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
量子位
T
Tailwind CSS Blog
雷峰网
雷峰网
Blog — PlanetScale
Blog — PlanetScale
Last Week in AI
Last Week in AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
月光博客
月光博客
罗磊的独立博客
F
Fortinet All Blogs
酷 壳 – CoolShell
酷 壳 – CoolShell
Stack Overflow Blog
Stack Overflow Blog
J
Java Code Geeks
V
V2EX
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The GitHub Blog
The GitHub Blog
Apple Machine Learning Research
Apple Machine Learning Research
博客园 - 聂微东
U
Unit 42
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Docker
阮一峰的网络日志
阮一峰的网络日志
I
InfoQ
Simon Willison's Weblog
Simon Willison's Weblog
D
DataBreaches.Net
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
I
Intezer
Scott Helme
Scott Helme
B
Blog
M
MIT News - Artificial intelligence
K
Kaspersky official blog
H
Help Net Security
V
Vulnerabilities – Threatpost
C
CXSECURITY Database RSS Feed - CXSecurity.com
Engineering at Meta
Engineering at Meta
博客园 - 【当耐特】
L
Lohrmann on Cybersecurity
P
Privacy & Cybersecurity Law Blog
Project Zero
Project Zero
The Hacker News
The Hacker News
B
Blog RSS Feed
T
Tor Project blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
TPUs vs. GPUs: What They Are, How They Differ, and Which Workloads Belong on Each
Grace Gong · 2026-05-01 · via DEV Community

If you've worked with machine learning on Google Cloud, you've hit the choice: GPU instance or TPU? Most teams default to GPU because that's what they already know. But as inference costs climb and TPU tooling matures, it's worth understanding what each chip actually does and when one outperforms the other.

This post covers what GPUs and TPUs are, how they work, and which workloads run better on each. It ends with a look at Google's current TPU lineup, including the eighth-generation chips announced at Google Cloud Next 2026.


Why TPUs exist

Image source: Google Cloud

GPUs were originally built for rendering video games. They handle AI workloads well because the underlying math, large parallel floating-point operations is the same. Researchers figured this out around 2012, and GPUs became the default for training neural networks.

Google ran into a problem in 2013. Engineers at Google Brain calculated that if every Android user used voice search for just three minutes a day, Google would need to double its global data center capacity. Running inference on general-purpose GPUs at that scale was too expensive and power-hungry.

Their solution was to build a chip designed specifically for neural network math. The first TPU went into production in Google's data centers in 2015. Google made Cloud TPUs publicly available in 2018. The core idea, strip out everything a GPU carries from its graphics origins and focus entirely on matrix multiplication still drives every TPU generation today.


How a GPU works


Image source: Google Cloud. Some images of GPUs.

A GPU is a parallel processor with thousands of smaller cores. Where a CPU has 8 to 64 powerful general-purpose cores, a high-end GPU like the NVIDIA H100 has thousands of smaller ones that run the same instruction across many data points at once. This is called SIMD (Single Instruction, Multiple Data) parallelism.

GPUs support a wide range of precision formats: FP32, FP16, BF16, INT8, FP8. They run PyTorch, TensorFlow, JAX, CUDA libraries, simulations, rendering pipelines. That broad support is useful, but it means a GPU carries hardware for texture mapping, branch prediction, and other operations that sit completely idle during a matrix multiplication.

The NVIDIA H100 has 80GB of HBM2e memory on-package. Memory bandwidth matters a lot for AI workloads because moving data between memory and compute units is often what limits throughput, not the raw math.


How a TPU works


Image source: Google Cloud

A TPU is built for one job: tensor math. Specifically, the matrix multiplications at the core of neural network training and inference.

The key piece of hardware is the systolic array. In a standard processor, every operation reads inputs from memory, computes, and writes the result back. In a systolic array, data flows through a grid of multiply-and-accumulate units. You load the weights once, pass inputs through the grid, and results flow from unit to unit without going back to main memory. This removes the constant memory round-trips that slow conventional chips.

Google built BF16 support into TPUs from early generations; GPUs added it later. Recent chips support FP8 natively, which helps throughput for inference workloads.

The limitation: TPUs work poorly with dynamic control flow, variable-length sequences, and custom operations. They are best suited for static computation graphs, which is what most transformer models produce.


Side-by-side comparison

When to use a GPU


Image source: Google Cloud.

Recommended GPUs based on workload type.PyTorch-first teams. Most research code on GitHub, most open-source model checkpoints, and most fine-tuning guides assume a GPU. If your team works primarily in PyTorch, starting on GPU is faster.

Models with TensorFlow ops that are not available on Cloud TPU (see the list of available TensorFlow ops)

Models with dynamic inputs. Variable-length sequences, conditional branches, custom CUDA extensions - these work on GPUs and can be tricky to run on TPUs.

Medium-to-large models with larger effective batch sizes

Multi-cloud or on-prem deployments. TPUs only exist in Google Cloud. If your infrastructure is on AWS, Azure, or your own servers, you don't have a choice.

Mixed workloads. If the same team does ML training, scientific simulation, and rendering, GPUs handle all of it. TPUs don't.

Small teams moving fast. GPU tooling (profilers, debuggers, community tutorials) is more mature. Diagnosing a performance problem on a GPU is easier today than on a TPU.


When to use a TPU


Models relying on embeddings: Cloud TPUs feature SparseCores, which are dataflow processors specifically built to accelerate models that heavily use embeddings. This makes them ideal for applications like recommendation systems. - Google Cloud

Training massive deep learning models: If you're building and training large and complex deep learning models, especially large language models (LLMs), Cloud TPUs are designed to handle the immense number of matrix calculations involved efficiently.

Models dominated by matrix computations

Models that train for weeks or months

Models with ultra-large embeddings common in advanced ranking and recommendation workloads

Large-scale transformer training. TPU pods scale to tens of thousands of chips through Google's Inter-Chip Interconnect (ICI). Training something like Gemma on a TPU pod tends to be faster and cheaper per token than an equivalent GPU cluster.

High-volume production inference. TPU v6e (Trillium) and Ironwood were built specifically for inference workloads. Ironwood delivers more than 4x better performance per chip for inference compared to TPU v6e (Trillium).

Models with no custom PyTorch/JAX operations inside the main training loop

Google open-weight models. Gemma 4 (released April 2026) is built and optimized for TPU serving. Google publishes JAX reference implementations for every Gemma variant, and there are community guides for deploying Gemma 4 via vLLM on Cloud TPU.

Cloud TPUs are not suited to the following workloads:

  • Linear algebra programs that require frequent branching or contain many element-wise algebra operations
  • Workloads that require high-precision arithmetic
  • Neural network workloads that contain custom operations in the main training loop

Google's current TPU lineup

TPU v5e, available now

Good starting point. Used for smaller inference workloads and fine-tuning. Lower per-chip cost than newer generations.

TPU v6e (Trillium), available now

4.7x the peak compute of v5e, with 67% better energy efficiency. Scales to 256 chips per pod. Still widely used for inference, particularly for teams where cost per chip-hour matters more than raw throughput. vLLM supports TPU v6e for both offline batch inference and online API serving.

TPU v7 (Ironwood), generally available since late 2025

Announced at Google Cloud Next 2025. Specs per chip: 4,614 FP8 TFLOPS, 192GB of HBM3E memory, 7.37 TB/s memory bandwidth, 9.6 Tb/s inter-chip interconnect. Scales to 9,216 chips in a single superpod, delivering 42.5 FP8 ExaFLOPS per pod. That's more than 4x the performance per chip of
TPU v6e (Trillium) and 10x of TPU v5p.
Each Ironwood chip contains two TensorCores and four SparseCores in a dual-chiplet design. Anthropic's Claude models train and serve on TPUs, and Anthropic signed an agreement to access up to one million Ironwood TPUs through Google Cloud.
Ironwood is the first TPU generation where Google used AlphaChip - a reinforcement learning tool - to design the chip layouts.

TPU 8t and TPU 8i (eighth generation), coming later in 2026

Announced at Google Cloud Next 2026. For the first time, Google has split its TPU lineup into two chips with different architectures for training and inference.

TPU 8t is built for training. A single superpod holds 9,600 chips with 2 petabytes of shared HBM memory and 121 FP4 ExaFLOPS of compute, nearly tripling compute per pod versus Ironwood. ICI bandwidth is 19.2 Tb/s per chip, double Ironwood. The new Virgo Network fabric can link 134,000 chips across a data center and theoretically over 1 million chips across sites. TPUDirect RDMA and TPU Direct Storage bypass the host CPU entirely, doubling bandwidth for large data transfers. Google targets 97% goodput meaning 97% of compute cycles go toward actual learning rather than overhead.

TPU 8i is built for inference. It scales to 1,152 chips per pod and delivers 11.6 FP8 ExaFLOPS. Each chip carries 288GB of HBM, more than the 8t training chip and 384MB of on-chip SRAM, 3x what Ironwood had. Google reports 80% better performance-per-dollar versus Ironwood for inference, and 2x better performance-per-watt.

The 8i uses a new Boardfly interconnect that reduces the maximum number of network hops from 16 to 7. This matters for Mixture-of-Experts models, where data needs to move quickly between expert layers. The chip also replaces Ironwood's SparseCores with a Collectives Acceleration Engine (CAE), which cuts the latency of collective operations by 5x - important when many agents are running concurrently and small latency multiplies across thousands of calls.

The reason the inference chip has more memory than the training chip: large MoE inference is memory-bandwidth-bound. The chip serving tokens needs to stream weights and KV-cache faster than the chip training the model. Both 8t and 8i run on Google's Axion ARM host CPU and use liquid cooling.

More info from TPU Overview


The software side

TPUs run best with a few specific tools:
JAX is Google's ML framework. Its jit, vmap, pmap, and shard_map primitives map directly onto TPU hardware. If you're new to TPUs and want to get the most out of them, JAX is where to start.
MaxText is Google's open-source LLM reference implementation for TPUs, available at AI-Hypercomputer/maxtext on GitHub. It's a practical starting point for training large language models on TPU pods.
Pallas is Google's Python-based kernel language for writing low-level, hardware-aware kernels. Supported on both Ironwood and the eighth-generation chips.
vLLM now has first-class TPU support. You can run offline batch inference or an OpenAI-compatible API server on a Cloud TPU VM with standard configuration.
PyTorch on TPU is in preview as of the eighth-generation launch. If your team is on PyTorch, you can now bring existing models to TPU hardware without rewriting them in JAX.
Google's Gemma 4 (April 2026) is optimized for TPU serving. The google-deepmind/gemma GitHub repo has JAX reference implementations for every model variant.


Summary

GPUs are the practical default for most research and development work. The tooling is mature, the community is large, and most models you'll find online were built on GPUs.
TPUs are worth the switch when you're running workloads at sustained scale on Google Cloud, especially for inference. Ironwood is available today. The eighth-generation 8t and 8i chips, which separate training and inference into dedicated hardware, are coming later in 2026. If you want to try TPUs before committing, Google Colab's free TPU runtime lets you run a JAX or Keras model on one without any setup.


Resources

Google's eighth-generation TPUs: two chips for the agentic era
TPU 8t and TPU 8i technical deep dive - Google Cloud Blog
Ironwood: The first Google TPU for the age of inference
Training large models on Ironwood TPUs - Google Cloud Blog
Performance per dollar of GPUs and TPUs for AI inference - Google Cloud Blog
Building production AI on Google Cloud TPUs with JAX
MaxText: LLM reference implementation for TPUs - GitHub
Gemma open-weight LLM library - Google DeepMind GitHub
Serve and Inference Gemma 4 on TPU
Google Cloud unveils eighth-generation TPUs - TechRadar


TPUSprint