The Brutal Reality of Running Gemma 4 Locally

This is a submission for the Google I/O 2026 Writing Challenge

"At Google I/O 2026, Google made a specific claim: Gemma 4 runs on consumer laptops without cloud dependency. They demoed offline coding on stage. Local AI on everyday hardware is finally practical, they said."

I tested that claim

GPU and high-bandwidth memory prices are not normal right now. AI companies are buying hardware at a scale that has genuinely disrupted the consumer market. A PC build suitable for local AI costs significantly more than it would have three or four years ago, if you can find the parts at all.

If you bought your machine before the AI hardware gold rush, you have leverage most people do not. I bought my laptop four years ago. An RTX 3050 with 4GB VRAM is not a serious AI card by any current standard, but it is exactly the kind of hardware Google implied Gemma 4 would run on. For local inference to start feeling consistently comfortable beyond lightweight models, 16GB VRAM is where things become much less restrictive. I have 4GB. This is what that looks like.

The Model Loaded. Then the Problems Started.

You install Ollama, pull the model, the weights load, the cursor blinks.

The GPU appears busy. Fans are screaming. The model is loaded entirely in VRAM. And long-context inference still slows down much faster than most demos suggest.

With Gemma 4 specifically, E2B loaded on my machine. E4B required closing everything else first to free RAM. Neither behaved the way the keynote implied.

Real throughput was more nuanced than I expected.

# Sustained long-form inference benchmark
# RTX 3050 Laptop GPU (4GB VRAM)
# 16GB DDR5 RAM
# Ollama on Windows

# Gemma 4 E2B
# eval rate: ~38.68 tok/s

# Gemma 4 E4B
# eval rate: ~24.39 tok/s

# Same prompt.
# Same hardware.
# Same runtime.

# E2B remained surprisingly usable.
# E4B pushed much closer to the memory wall.

The slowdown was not catastrophic. That was the interesting part. E2B remained mostly inside GPU memory on this workload, which avoided the worst PCIe and shared-memory penalties.

Small efficient models are now genuinely viable on consumer hardware. The problems start once context length, KV cache growth, and memory spillover begin compounding at the same time.

# First thing to check: is the model actually in GPU memory?
nvidia-smi

# Watch VRAM live as a conversation grows
# If VRAM rises and speed falls, KV cache is overflowing into RAM
watch -n 1 nvidia-smi

The Real Bottleneck Is Not Compute

Every inference run has two phases.

Prefill: the model reads your entire prompt in parallel. Compute-heavy, GPU handles it well. You generally do not feel this.

Decode: the model generates each output token one at a time. This is memory-bound. Every token forces the GPU to reload model weights from memory again. The GPU finishes its math and waits. It is not slow. It is starving for bandwidth.

It is why local inference feels slow even when Task Manager shows your GPU is busy.

# Memory bandwidth comparison — this is what determines tokens/sec

# RTX 3050 4GB     -> ~192 GB/s   (my machine)
# RTX 3060 12GB    -> ~360 GB/s
# RTX 4090 24GB    -> ~1008 GB/s
# M4 Max           -> ~546 GB/s
# M3 Ultra         -> ~800 GB/s

# VRAM capacity gets you the model loaded
# Bandwidth determines how fast it actually runs

Check your own card before loading anything:

# Linux: query GPU name and memory from the driver
nvidia-smi --query-gpu=name,memory.total --format=csv

# Windows: grep does not exist in PowerShell
# Use Select-String instead
nvidia-smi -q | Select-String "Product Name", "Total", "Free", "Used"

# nvidia-smi does not expose memory bandwidth on Windows (WDDM)
# Get the real number from: https://www.techpowerup.com/gpuz/
# The "Memory Bandwidth" field on the main tab is what you want

# Apple Silicon: no nvidia-smi, use system_profiler
system_profiler SPHardwareDataType | grep -i bandwidth

The KV Cache Is Quietly Eating Your VRAM

Even if your model fits in VRAM, that headroom disappears as your conversation grows.

Every token the model has seen gets stored in the key-value cache. Without it, the model would reprocess the entire conversation on every generation step. The KV cache trades memory for speed. The tradeoff is it grows with every token.

For Gemma 4 E2B, a moderately long conversation on a 4GB card will push you over the edge mid-generation. The model does not crash. It silently offloads to system RAM and your tokens per second falls off a cliff. Once inference spills heavily into system RAM, throughput collapses dramatically.

# Ollama defaults to 4096 token context even on models that support 128K
# This is why your model seems to forget things in long conversations
# Set it explicitly so you know what you are allocating

OLLAMA_NUM_CTX=8192 ollama run gemma4:e2b

# Confirm what context your running model is actually using
ollama ps

Quantization Is Not Just About Fitting the Model

Most guides explain quantization as a way to make models smaller so they fit in VRAM. That undersells it.

The real bottleneck is how fast the GPU can move weights from memory to compute units. Quantization reduces bytes per weight, so fewer bytes move per token generated. An INT4 model transfers 4 times less data per inference step than FP16, which translates almost directly to 4 times faster generation.

# Quantization levels for Gemma 4 via llama.cpp

# Q2/Q3   -> smallest file, lowest quality, fits tight VRAM
# Q4_K_M  -> best balance for most consumer hardware
# Q8_0    -> higher quality, needs more VRAM
# FP16    -> full precision, not practical on 4GB cards

Quantizing the KV cache separately is now supported in llama.cpp and is worth doing on constrained hardware:

# --cache-type-k and --cache-type-v cut KV cache memory ~50%
# with minimal quality impact — easier than switching model sizes

./llama-cli \
  -m gemma4-e2b-q4_k_m.gguf \
  --n-gpu-layers 99 \        # push all layers to GPU
  --cache-type-k q8_0 \      # quantize key cache
  --cache-type-v q8_0 \      # quantize value cache
  --ctx-size 4096             # keep context tight on 4GB cards

The Layer Offloading Trap

When VRAM is tight, --n-gpu-layers 20 on a 32-layer model sounds like a reasonable compromise. It is usually not.

Partial offloading means some inference steps cross the PCIe bus, introducing high-latency transfers that stall the pipeline. The slowdown is not proportional to layers offloaded. Even a few CPU-side layers can significantly tank throughput.

# This looks like a reasonable compromise. It is not.
# Every forward pass stalls waiting on PCIe transfers for CPU-side layers.
./llama-cli \
  -m gemma4-e2b-q4_k_m.gguf \
  --n-gpu-layers 20           # partial offload = worst of both worlds

# Better: use Q3 so the whole model fits on GPU at --n-gpu-layers 99
./llama-cli \
  -m gemma4-e2b-q3_k_m.gguf \
  --n-gpu-layers 99           # everything in VRAM, no PCIe stalls

What Windows Task Manager Is Lying to You About

This is where most people on Windows laptops get confused.

While running Gemma 4 E4B, Task Manager showed the RTX 3050 at 0% GPU utilization. At the same time, nvidia-smi showed:

# nvidia-smi output during active Gemma 4 E4B inference
# Task Manager said 0%. This is what was actually happening.

# +-----------------------------------------------+
# | GPU: NVIDIA GeForce RTX 3050 Laptop GPU        |
# | VRAM:    3564MiB / 4096MiB  (87% full)        |
# | GPU-Util: 44%                                  |
# | Power:    52W / 95W                            |
# +-----------------------------------------------+

# Always trust nvidia-smi over Task Manager for CUDA workloads
# Task Manager shows 3D engine usage — LLM inference runs on CUDA compute
# Windows sees "no 3D rendering" and reports 0%

Now the 11.6GB figure. This laptop has two GPUs: the RTX 3050 (GPU 1) and the AMD Radeon iGPU inside the Ryzen 7 6800H (GPU 0). The AMD iGPU has no dedicated VRAM. It borrows from system RAM dynamically. Windows adds them together:

# How Windows calculates "total GPU memory" on a dual-GPU laptop

# RTX dedicated VRAM:          4.0 GB  (fast, ~192 GB/s)
# AMD iGPU shared system RAM:  7.6 GB  (slow, ~70-90 GB/s)
# ----------------------------------------
# Windows "GPU Memory":       11.6 GB  (misleading total)

# You do NOT have 11.6GB of fast VRAM
# You have 4GB fast + 7.6GB slow with a PCIe penalty to cross between them

And here is system RAM during E4B inference:

13.2GB of 15.3GB used. 2.1GB available. Ollama is consuming roughly 4GB of system memory alongside the 3.5GB allocated in dedicated VRAM. The actual footprint for Gemma 4 E4B is 7 to 8GB total, split cleanly across two entirely different physical hardware pools running at wildly mismatched speeds. That split is exactly why generation feels slower than the model size alone would suggest.

At the same time, Ollama alone was consuming nearly 8GB of system RAM:

# "The model loaded" does not mean the system is comfortable

# During Gemma 4 E4B inference on a 4GB RTX 3050 laptop:

# GPU memory pool
# ----------------
# Dedicated VRAM (RTX 3050)      -> 4.0 GB
# Shared DDR5 system memory      -> 7.6 GB
# Effective Windows "GPU Memory" -> 11.6 GB

# Real-world bottlenecks
# ----------------------
# [x] VRAM saturation
# [x] KV cache growth
# [x] Shared memory spillover
# [x] PCIe transfer overhead
# [x] Windows scheduler latency
# [x] Dual-GPU memory juggling

# Result
# ------
# The model technically fits.
# The hardware still struggles.
#
# Local inference on consumer laptops is often a
# memory orchestration problem, not a compute problem.

The result is that local AI performance becomes a memory orchestration problem long before it becomes a compute problem.

Hardware Tiers for Gemma 4 in 2026

# What you can realistically run locally in 2026
# (and what it costs to buy the hardware right now)

# 4GB VRAM (RTX 3050 — my machine)
#   -> Gemma 4 E2B with Q4 quantization
#   -> short contexts only, KV cache fills fast
#   -> the floor for local AI, barely

# 8GB-12GB VRAM
#   -> comfortable Gemma 4 E4B
#   -> 7B models from other families run well
#   -> context length starts to matter

# 16GB-24GB VRAM
#   -> where Gemma 4 becomes reliable for real work
#   -> this is what Google probably had in mind at I/O
#   -> good luck finding one at a reasonable price

# 36GB-64GB Unified Memory (Apple Silicon)
#   -> best consumer option for serious local AI
#   -> no VRAM/RAM split, no PCIe penalty

# 96GB-192GB Unified Memory
#   -> 70B models, workstation territory

Measure Before You Tune

# Get a baseline before changing anything
# Run this before and after every config change
./llama-bench -m gemma4-e2b-q4_k_m.gguf -p 512 -n 128

# Windows: check Ollama RAM usage directly
Get-Process ollama | Select-Object ProcessName,WorkingSet64

# Or watch:
# Task Manager -> Performance -> Memory

# Linux equivalent
free -h

# Watch GPU utilization and VRAM together in one view
# util column = compute bound, mem column = memory bound
nvidia-smi dmon -s mu

# Apple Silicon: watch memory pressure in real time
# Red = unified memory is overcommitted
sudo memory_pressure

What Google Got Right and What They Left Out

Gemma 4 E2B running locally on a 4GB VRAM laptop is not nothing. Four years ago that would not have been possible at all. The model quality for its size is genuinely impressive.

But "runs on consumer laptops" and "runs well on consumer laptops" are different claims. The I/O keynote did not mention memory bandwidth, KV cache overflow, or the fact that the hardware shortage means GPUs with enough VRAM for comfortable inference are still expensive and unusually difficult to find.

# What "model loaded successfully" actually guarantees

# NOT guaranteed:
# [ ] fits comfortably in VRAM
# [ ] KV cache has room to grow
# [ ] throughput will be usable
# [ ] PCIe offloading is avoided

# ONLY guaranteed:
# [x] weights entered memory without crashing

The model loading is the beginning of the problem. What happens after is a memory bandwidth race your hardware either wins or does not. Now you know which race you are in.

推荐订阅源

DEV Community