惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
M
MIT News - Artificial intelligence
博客园 - 司徒正美
I
InfoQ
V
V2EX
L
LangChain Blog
人人都是产品经理
人人都是产品经理
T
Tailwind CSS Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
The GitHub Blog
The GitHub Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
WordPress大学
WordPress大学
H
Help Net Security
美团技术团队
Y
Y Combinator Blog
G
Google Developers Blog
小众软件
小众软件
The Cloudflare Blog
博客园 - 三生石上(FineUI控件)
Jina AI
Jina AI
量子位
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Spread Privacy
Spread Privacy
博客园 - 聂微东
The Register - Security
The Register - Security
F
Full Disclosure
S
Securelist
G
GRAHAM CLULEY
Cyberwarzone
Cyberwarzone
F
Fox-IT International blog
H
Hacker News: Front Page
C
Cisco Blogs
D
Docker
L
LINUX DO - 热门话题
Google Online Security Blog
Google Online Security Blog
T
Troy Hunt's Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
T
ThreatConnect
aimingoo的专栏
aimingoo的专栏
Last Week in AI
Last Week in AI
J
Java Code Geeks
宝玉的分享
宝玉的分享
Project Zero
Project Zero
L
LINUX DO - 最新话题
博客园_首页
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
P
Proofpoint News Feed
博客园 - 叶小钗

DEV Community

🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API) I've audited dozens of estate agency websites. The same 5 problems show up every single time. Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue Building My AI-Powered VS Code Extension 🚀 Google Login in Express with PassportJS & JWT Great example of Gemma 4 moving beyond chatbots into real-world decision support. Using AI to guide everyday actions like recycling shows how impactful applied LLMs can be when designed for usability, not just capability. #Gemma4 #AI #Sustainability Building a Production AI Chatbot for an Educational Institute: Architecture, Lessons & Full Stack Deep-Dive Google Login in Express with PassportJS & JWT How I reclaimed 47GB on my MacBook by cleaning developer project junk Operators Are Not Oracles: How We Learned to Stop Worrying and Love the Configuration I Built 6 Free Developer Tools for AI APIs, Cron, Docker, and Self-Hosting How I Built a Real-Time Precious Metals Price Feed for 30,000 Concurrent Users in Laravel How to Use a SERP API to Validate Whether a Project Idea Is Worth Building Gemma 4 discussions often focus on capability, but real-world impact depends on deployment context. For offline education, especially in low-connectivity regions, latency, cost, and local inference matter as much as model strength. Local Mind Explores it Space Complexity + Ω and Θ Notations Google I/O 2026 Just Confirmed the Shift From AI Chatbots to AI Agents How to Add API Monitoring to an Express App in 5 Minutes (2026) Designing an In-Game Inflation Tracking Algorithm for Web Utility Apps Google AI Studio Just Changed the Shape of App Development If you struggle to learn then this is for you. Best AI Agent Security & Guardrails Tools in 2026: LLM Guard vs NeMo vs Guardrails AI Building Dynamic RBAC in React 19: From Permission Strings to Component-Level Access Control How to Build a Self-Hosted AI Code Review Tool in Python Why We Switched from React to HTMX in Production: A 200-Site Case Study Gemma-Loom: The Intent-Based Virtual Machine (IVM) for Edge Sovereignty Java实习海投攻略:3天300个沟通,我是怎么拿到面试的 I Deployed Netflix's Web Server in 30 Seconds (And So Can You) - Docker Project 1 Debugging Android 14 WebRTC Disconnects on a coturn Relay Path 1/30 Days System Design Question Testing FastAPI + SQLAlchemy with Real PostgreSQL Fixtures: No More Mocking Misery FAQ Schema Markup Generators: What They Actually Do (and What They Don't Tell You) How a pure-TypeScript flex layout engine closed the last WASM-Yoga gap Spot instances as GitHub Actions runners Agents Need Receipts, Not Just Better Prompts readmegen — Generate beautiful README.md in seconds (12 templates, open source) When AI Reads Blueprints: The Hidden Attack Surface of Multimodal Engineering Intelligence Simplicity scales — complexity kills side projects AI does exactly what you ask — that's the problem How a model upgrade silently broke our extraction prompt (and how we caught it) The Best Form Backend for Static Sites in 2026 # ⛽ I Built a Cross-Platform Fuel Finder with React & Supabase: The Indie Dev Journey The 11 Major Cloud Service Providers in 2025 Membangun Karya Visual: Mengintip Fasilitas Multimedia dan Studio Kreatif Amikom What Is IOPS? Visualizing Database Design: From Interactive Canvas to Drizzle, Prisma, and SQL in Real-time A tool to make your GitHub README impossible to ignore 🚀 Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate I reproduced a Claude Code RCE. The bug pattern is everywhere. We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found. Jenkins CI/CD Pipeline for a Dockerized Node.js Application: Manual Trigger vs Automatic Trigger Using GitHub Webhooks How to Stream Live Forex Rates to Google Sheets API: A Complete Guide Small Models Will Beat Giant Models (And Most People Haven’t Realized Why Yet) How I Built 5 Linux Automation Scripts on AWS EC2 I built TokenPatch to measure AI coding cost per applied patch I built a Chrome extension to stop squinting at the web Producer audit clean, six tests red Conversa — A Multi-Agent AI Platform Powered by Gemma 4 Build a Real Agent in 15 Minutes with Gemini's New Managed Agents API What I Actually Build: AI Systems That Ship, Not Demos That Impress The Box Ticked While You Read This: LinkedIn, AI Training, and the Switch You Did Not Flip Investasi Masa Depan: Mengintip Fasilitas Laboratorium Komputer Kelas Dunia di Yogyakarta I Cancelled My $20 Claude Cowork Plan After a Week With OpenWork Stop Reviewing Every Line of AI Code - Build the Trust Stack Instead How To Build an Image Cropper in Browser (Simple Steps) I built a macOS disk cleaner for developers and just launched it would love feedback Membangun Kompetensi dan Relasi: Mengapa Ekosistem Kampus Itu Penting I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room Codex Team Usage SOP How to Actually Become a Programmer: The Hard Part Nobody Wants to Explain Building a Production-Style Multi-Tool AI Agent with Python, Flask, React & Gemini AI The Caretaker Sandbox: An Offline-First Visual Playground & Template Engine powered by Gemma 4 # Building Instagram OSINT Projects with HikerAPI Your AI can read. Gemma 4 can see The Battle of the Senior Dev: Why AI Gives You Wings But Only If You're Ready to Pilot HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead I Finally Finished a Project I Abandoned — And GitHub Copilot Helped Me Ship It SafeSMS: On-Device Threat Detection with Gemma 4 E4B, no internet required I Built OpenKap — A Loom Alternative for Small Teams Who Just Want to Ship Gemma 4 is Here: The Dawn of Local Multimodal Reasoning Offline-First Flutter: How We Built a CRM That Manages 100K+ Leads With No Internet Memory for Agents: When Vectors Meet Graphs, Bugs Drop 4 The Rise of Production-Grade AI Infrastructure I ran my idea-validation product through its own validator. The verdict was PIVOT. We Built an Agent Commerce API. Google I/O 2026 Changed Our 3-Month Roadmap in 24 Hours. "My Partner's Memory Was Full. I Didn't Know — Until We Tried to Talk." I’m a Front End Web Developer Learning Machine Learning From Scratch Laravel Waiting Request I Built a Chrome Extension to Track How Long You Actually Spend on Each Tab Why Google Can't See Your React Breadcrumbs (And the 4-Line Fix) AI Travel Assistant Powered by Gemma 4; With Streaming, Image Input, and Visual Recommendation Cards Microsoft tried to kill the printer driver. Healthcare said no. The Blueprint Beneath the Blueprint: Designing Data Model and Choosing Its Database
The Brutal Reality of Running Gemma 4 Locally
Sayandip Roy · 2026-05-23 · via DEV Community

This is a submission for the Google I/O 2026 Writing Challenge


"At Google I/O 2026, Google made a specific claim: Gemma 4 runs on consumer laptops without cloud dependency. They demoed offline coding on stage. Local AI on everyday hardware is finally practical, they said."


I tested that claim

GPU and high-bandwidth memory prices are not normal right now. AI companies are buying hardware at a scale that has genuinely disrupted the consumer market. A PC build suitable for local AI costs significantly more than it would have three or four years ago, if you can find the parts at all.

If you bought your machine before the AI hardware gold rush, you have leverage most people do not. I bought my laptop four years ago. An RTX 3050 with 4GB VRAM is not a serious AI card by any current standard, but it is exactly the kind of hardware Google implied Gemma 4 would run on. For local inference to start feeling consistently comfortable beyond lightweight models, 16GB VRAM is where things become much less restrictive. I have 4GB. This is what that looks like.


The Model Loaded. Then the Problems Started.

You install Ollama, pull the model, the weights load, the cursor blinks.

The GPU appears busy. Fans are screaming. The model is loaded entirely in VRAM. And long-context inference still slows down much faster than most demos suggest.

With Gemma 4 specifically, E2B loaded on my machine. E4B required closing everything else first to free RAM. Neither behaved the way the keynote implied.

Real throughput was more nuanced than I expected.

# Sustained long-form inference benchmark
# RTX 3050 Laptop GPU (4GB VRAM)
# 16GB DDR5 RAM
# Ollama on Windows

# Gemma 4 E2B
# eval rate: ~38.68 tok/s

# Gemma 4 E4B
# eval rate: ~24.39 tok/s

# Same prompt.
# Same hardware.
# Same runtime.

# E2B remained surprisingly usable.
# E4B pushed much closer to the memory wall.

Enter fullscreen mode Exit fullscreen mode

The slowdown was not catastrophic. That was the interesting part. E2B remained mostly inside GPU memory on this workload, which avoided the worst PCIe and shared-memory penalties.

Small efficient models are now genuinely viable on consumer hardware. The problems start once context length, KV cache growth, and memory spillover begin compounding at the same time.

# First thing to check: is the model actually in GPU memory?
nvidia-smi

# Watch VRAM live as a conversation grows
# If VRAM rises and speed falls, KV cache is overflowing into RAM
watch -n 1 nvidia-smi

Enter fullscreen mode Exit fullscreen mode


The Real Bottleneck Is Not Compute

Every inference run has two phases.

Prefill: the model reads your entire prompt in parallel. Compute-heavy, GPU handles it well. You generally do not feel this.

Decode: the model generates each output token one at a time. This is memory-bound. Every token forces the GPU to reload model weights from memory again. The GPU finishes its math and waits. It is not slow. It is starving for bandwidth.

It is why local inference feels slow even when Task Manager shows your GPU is busy.

# Memory bandwidth comparison — this is what determines tokens/sec

# RTX 3050 4GB     -> ~192 GB/s   (my machine)
# RTX 3060 12GB    -> ~360 GB/s
# RTX 4090 24GB    -> ~1008 GB/s
# M4 Max           -> ~546 GB/s
# M3 Ultra         -> ~800 GB/s

# VRAM capacity gets you the model loaded
# Bandwidth determines how fast it actually runs

Enter fullscreen mode Exit fullscreen mode

Check your own card before loading anything:

# Linux: query GPU name and memory from the driver
nvidia-smi --query-gpu=name,memory.total --format=csv

Enter fullscreen mode Exit fullscreen mode

# Windows: grep does not exist in PowerShell
# Use Select-String instead
nvidia-smi -q | Select-String "Product Name", "Total", "Free", "Used"

# nvidia-smi does not expose memory bandwidth on Windows (WDDM)
# Get the real number from: https://www.techpowerup.com/gpuz/
# The "Memory Bandwidth" field on the main tab is what you want

Enter fullscreen mode Exit fullscreen mode

# Apple Silicon: no nvidia-smi, use system_profiler
system_profiler SPHardwareDataType | grep -i bandwidth

Enter fullscreen mode Exit fullscreen mode


The KV Cache Is Quietly Eating Your VRAM

Even if your model fits in VRAM, that headroom disappears as your conversation grows.

Every token the model has seen gets stored in the key-value cache. Without it, the model would reprocess the entire conversation on every generation step. The KV cache trades memory for speed. The tradeoff is it grows with every token.

For Gemma 4 E2B, a moderately long conversation on a 4GB card will push you over the edge mid-generation. The model does not crash. It silently offloads to system RAM and your tokens per second falls off a cliff. Once inference spills heavily into system RAM, throughput collapses dramatically.

# Ollama defaults to 4096 token context even on models that support 128K
# This is why your model seems to forget things in long conversations
# Set it explicitly so you know what you are allocating

OLLAMA_NUM_CTX=8192 ollama run gemma4:e2b

# Confirm what context your running model is actually using
ollama ps

Enter fullscreen mode Exit fullscreen mode


Quantization Is Not Just About Fitting the Model

Most guides explain quantization as a way to make models smaller so they fit in VRAM. That undersells it.

The real bottleneck is how fast the GPU can move weights from memory to compute units. Quantization reduces bytes per weight, so fewer bytes move per token generated. An INT4 model transfers 4 times less data per inference step than FP16, which translates almost directly to 4 times faster generation.

# Quantization levels for Gemma 4 via llama.cpp

# Q2/Q3   -> smallest file, lowest quality, fits tight VRAM
# Q4_K_M  -> best balance for most consumer hardware
# Q8_0    -> higher quality, needs more VRAM
# FP16    -> full precision, not practical on 4GB cards

Enter fullscreen mode Exit fullscreen mode

Quantizing the KV cache separately is now supported in llama.cpp and is worth doing on constrained hardware:

# --cache-type-k and --cache-type-v cut KV cache memory ~50%
# with minimal quality impact — easier than switching model sizes

./llama-cli \
  -m gemma4-e2b-q4_k_m.gguf \
  --n-gpu-layers 99 \        # push all layers to GPU
  --cache-type-k q8_0 \      # quantize key cache
  --cache-type-v q8_0 \      # quantize value cache
  --ctx-size 4096             # keep context tight on 4GB cards

Enter fullscreen mode Exit fullscreen mode


The Layer Offloading Trap

When VRAM is tight, --n-gpu-layers 20 on a 32-layer model sounds like a reasonable compromise. It is usually not.

Partial offloading means some inference steps cross the PCIe bus, introducing high-latency transfers that stall the pipeline. The slowdown is not proportional to layers offloaded. Even a few CPU-side layers can significantly tank throughput.

# This looks like a reasonable compromise. It is not.
# Every forward pass stalls waiting on PCIe transfers for CPU-side layers.
./llama-cli \
  -m gemma4-e2b-q4_k_m.gguf \
  --n-gpu-layers 20           # partial offload = worst of both worlds

# Better: use Q3 so the whole model fits on GPU at --n-gpu-layers 99
./llama-cli \
  -m gemma4-e2b-q3_k_m.gguf \
  --n-gpu-layers 99           # everything in VRAM, no PCIe stalls

Enter fullscreen mode Exit fullscreen mode


What Windows Task Manager Is Lying to You About

This is where most people on Windows laptops get confused.

While running Gemma 4 E4B, Task Manager showed the RTX 3050 at 0% GPU utilization. At the same time, nvidia-smi showed:

# nvidia-smi output during active Gemma 4 E4B inference
# Task Manager said 0%. This is what was actually happening.

# +-----------------------------------------------+
# | GPU: NVIDIA GeForce RTX 3050 Laptop GPU        |
# | VRAM:    3564MiB / 4096MiB  (87% full)        |
# | GPU-Util: 44%                                  |
# | Power:    52W / 95W                            |
# +-----------------------------------------------+

# Always trust nvidia-smi over Task Manager for CUDA workloads
# Task Manager shows 3D engine usage — LLM inference runs on CUDA compute
# Windows sees "no 3D rendering" and reports 0%

Enter fullscreen mode Exit fullscreen mode

RTX 3050 showing active VRAM usage during Gemma 4 E4B inference

Now the 11.6GB figure. This laptop has two GPUs: the RTX 3050 (GPU 1) and the AMD Radeon iGPU inside the Ryzen 7 6800H (GPU 0). The AMD iGPU has no dedicated VRAM. It borrows from system RAM dynamically. Windows adds them together:

# How Windows calculates "total GPU memory" on a dual-GPU laptop

# RTX dedicated VRAM:          4.0 GB  (fast, ~192 GB/s)
# AMD iGPU shared system RAM:  7.6 GB  (slow, ~70-90 GB/s)
# ----------------------------------------
# Windows "GPU Memory":       11.6 GB  (misleading total)

# You do NOT have 11.6GB of fast VRAM
# You have 4GB fast + 7.6GB slow with a PCIe penalty to cross between them

Enter fullscreen mode Exit fullscreen mode

AMD Radeon integrated GPU contributing shared system memory
And here is system RAM during E4B inference:

System RAM pressure during Gemma 4 E4B inference

13.2GB of 15.3GB used. 2.1GB available. Ollama is consuming roughly 4GB of system memory alongside the 3.5GB allocated in dedicated VRAM. The actual footprint for Gemma 4 E4B is 7 to 8GB total, split cleanly across two entirely different physical hardware pools running at wildly mismatched speeds. That split is exactly why generation feels slower than the model size alone would suggest.

At the same time, Ollama alone was consuming nearly 8GB of system RAM:

Ollama consuming nearly 8GB RAM during Gemma 4 E4B inference

# "The model loaded" does not mean the system is comfortable

# During Gemma 4 E4B inference on a 4GB RTX 3050 laptop:

# GPU memory pool
# ----------------
# Dedicated VRAM (RTX 3050)      -> 4.0 GB
# Shared DDR5 system memory      -> 7.6 GB
# Effective Windows "GPU Memory" -> 11.6 GB

# Real-world bottlenecks
# ----------------------
# [x] VRAM saturation
# [x] KV cache growth
# [x] Shared memory spillover
# [x] PCIe transfer overhead
# [x] Windows scheduler latency
# [x] Dual-GPU memory juggling

# Result
# ------
# The model technically fits.
# The hardware still struggles.
#
# Local inference on consumer laptops is often a
# memory orchestration problem, not a compute problem.

Enter fullscreen mode Exit fullscreen mode

The result is that local AI performance becomes a memory orchestration problem long before it becomes a compute problem.


Hardware Tiers for Gemma 4 in 2026

# What you can realistically run locally in 2026
# (and what it costs to buy the hardware right now)

# 4GB VRAM (RTX 3050 — my machine)
#   -> Gemma 4 E2B with Q4 quantization
#   -> short contexts only, KV cache fills fast
#   -> the floor for local AI, barely

# 8GB-12GB VRAM
#   -> comfortable Gemma 4 E4B
#   -> 7B models from other families run well
#   -> context length starts to matter

# 16GB-24GB VRAM
#   -> where Gemma 4 becomes reliable for real work
#   -> this is what Google probably had in mind at I/O
#   -> good luck finding one at a reasonable price

# 36GB-64GB Unified Memory (Apple Silicon)
#   -> best consumer option for serious local AI
#   -> no VRAM/RAM split, no PCIe penalty

# 96GB-192GB Unified Memory
#   -> 70B models, workstation territory

Enter fullscreen mode Exit fullscreen mode


Measure Before You Tune

# Get a baseline before changing anything
# Run this before and after every config change
./llama-bench -m gemma4-e2b-q4_k_m.gguf -p 512 -n 128

Enter fullscreen mode Exit fullscreen mode

# Windows: check Ollama RAM usage directly
Get-Process ollama | Select-Object ProcessName,WorkingSet64

# Or watch:
# Task Manager -> Performance -> Memory

Enter fullscreen mode Exit fullscreen mode

# Linux equivalent
free -h

Enter fullscreen mode Exit fullscreen mode

# Watch GPU utilization and VRAM together in one view
# util column = compute bound, mem column = memory bound
nvidia-smi dmon -s mu

# Apple Silicon: watch memory pressure in real time
# Red = unified memory is overcommitted
sudo memory_pressure

Enter fullscreen mode Exit fullscreen mode


What Google Got Right and What They Left Out

Gemma 4 E2B running locally on a 4GB VRAM laptop is not nothing. Four years ago that would not have been possible at all. The model quality for its size is genuinely impressive.

But "runs on consumer laptops" and "runs well on consumer laptops" are different claims. The I/O keynote did not mention memory bandwidth, KV cache overflow, or the fact that the hardware shortage means GPUs with enough VRAM for comfortable inference are still expensive and unusually difficult to find.

# What "model loaded successfully" actually guarantees

# NOT guaranteed:
# [ ] fits comfortably in VRAM
# [ ] KV cache has room to grow
# [ ] throughput will be usable
# [ ] PCIe offloading is avoided

# ONLY guaranteed:
# [x] weights entered memory without crashing

Enter fullscreen mode Exit fullscreen mode

The model loading is the beginning of the problem. What happens after is a memory bandwidth race your hardware either wins or does not. Now you know which race you are in.