惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Devlog 1: I tried building an SNES game with the super FX chip Why Gemma 4 Feels Like an Important Moment for AI Developers✨ From Zero and Confused, This Is How I Started Learning to Code I Built a Local AI Gateway That Talks to Claude, ChatGPT, DeepSeek and Gemini — Without a Single API Key Bootstrapping with AI: Why Gemma 4 is the Micro-SaaS Founder’s Best Friend MyErp Architecture Series - #02 Cellular Architecture: Mapping Biology to Software Systems NodeJS vs Bun vs Go 🌍 RTL Arabic Style UI How Does an AI Agent Actually Buy Something? Google Just Published the Spec. Google I/O 2026 Is One Uncanny F.R.I.E.N.D.S Group Upgrade I Replaced 70MB Node.js Log Viewer with a 172KB Zig Binary The "MTTR Is All You Need" Trap The Quiet Revolution: How Firebase Became the First Agent-Native Backend at Google I/O 2026 I Built ResuMate! A 100% Private, Local AI Resume Optimizer with Google Gemma 4 Learning DirectX 12 - Part 2 Initialization Theory NeuralHats: I Put Edward de Bono’s Six Thinking Hats on Local LLMs Using Gemma 4 📝 Instant Auto Save Notes Engineering the "App-Like" Experience: A Deep Dive into PWA Architecture I built a local first AI CCTV assistant using Gemma 4 + Frigate CrowdShield AI — Smart Stadium Operating System & Crowd Intelligence Platform I built a free AI observability tool, prove your AI is useful, not just running Beyond Autocomplete: Why Google Antigravity 2.0 Changes the Rules for Indie Builders 터미널 AI 에이전트 구축 (v12) Building Instagram-Powered Apps with HikerAPI (Without Fighting Scrapers) Checkpoints, Not Transcripts: Rethinking AI Coding Agent Memory From Side Project to Student Savior: My AI PPT & Resume Tool Crossed 1.5K+ Users Why Story Points Don’t Work in the AI Era, And What Should Take Their Place Instead. Self-Hosted Document AI: How to Run Document Intelligence On Your Own Infrastructure (2026) How to Extract Tables from PDFs with AI: 4 Methods That Actually Work (2026) IDP vs OCR: What's the Difference — and Which Does Your Business Actually Need? Automated PII Detection and Redaction in Business Documents: A Practical Guide Human-in-the-Loop Document Review: When to Use It and How to Set It Up (2026) Document Processing Without RPA: A Modern Approach for Small Teams Reducto Alternative: When You Need More Than a Document Parser (2026) Hermes Agent vs LangChain vs CrewAI: When to Reach for Each SparshAI: I Built an Offline AI Tutor for Students Using Gemma 4 — Here's What Happened Building NeuroSense AI: A Human-Centered Stress Insight Assistant Powered by Gemma Why I Built a Privacy-First Dev Toolkit GAS Input Tags: Ability Activation Without Hardcoded Bindings AI Legal Document Advisor Supported By Gemm 4 Model Building Convertify in Public Week 10: PDF Cluster + Blog Launch CureNet AI: Decentralized Health Intelligence for India, Powered by Gemma 4 and ABHA Standardization When Open-Weights AI Meets a Broken Healthcare System: Deploying Gemma 4 in Rural India V.A.L.I.D. Google I/O 2026: The Year Google Stopped Building AI Assistants and Started Shipping AI Engineers Bondmap: AI-Powered Relationship Network That Maps How You're Connected to Everyone Using Gemma 4 Gemma 4 challenge inspired me to build my first app! 96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop From a Student Who Used CircuitVerse to a GSoC Contributor — My Community Bonding Story How Bf-Tree Keeps Mini-Pages Small, Hot, and Cheap to Evict I asked Claude to explain the chip war and ended up understanding modern geopolitics differently Stop Manually Checking for Server Updates: Automate With Email Notifications Nostalgia Meets Cybersecurity: Spotting Modern Scams in a Retro OS Simulator - Forward or Fraud CRACKING CODING INTERVIEW From Python to Production Pipeline :A Practical guide to Apache Airflow Antigravity 2.0: Google Just Changed What It Means to Be an Engineer I Built a Free Sticker Maker Because Every Other One Hid the Export How I bypassed Blazor WebAssembly's Virtual DOM using raw WASM pointers Distributed Tracing for LLM Agents: When MCP Makes Tool Calls Observable The Zero-Budget Memory Setup Behind My AI Agent Workflow No database. No framework. Just files, startup order, correction logs, and discipline. I Built an AI Second Brain with Gemma 4 The Most Exciting Google I/O 2026 Announcement for Me: HTML-in-Canvas CrisisLens: Compressing Disaster Scenes into 200-Byte Emergency Payloads with Gemma 4 I'm 15 and I built a todo app with Telegram Stars payments — only legal way for me to monetize before turning 18 Crypto Branding After the Token Launch Building an on-chain alerts bot in Python without any blockchain library FinePrint — An AI Pocket Lawyer That Decodes Predatory Contracts Using Gemma 4 How to Connect OpenAI with Supabase in 10 Minutes for a Lightning-Fast AI MVP One AI Gateway for AWS Bedrock, Google Vertex AI, Gemini, and Anthropic Reading Log #9 — Aoashi The Tacit Dimension Thinking, Fast and Slow Web3 Onboarding Is Not a Wallet Problem. It Is a Trust Problem. FHE Prompt Privacy: The Metadata Leak Your Demo Still Has Software Might Be Becoming Agent-Aware: What if software starts coordinating itself? The Silent Killers of Go Concurrency: Mutexes, Semaphores, and Goroutine Leaks Lynx framework first look Building Aries AI: A Solo-Built AI Abacus Tutor on OpenAI + Supabase + Render + Razorpay I built a paid Telegram bot. Here's what Telegram Stars actually pay. Transfer Fees, Metadata, and Soulbound Tokens: A Tour of Solana Token Extensions Improving AI resume matching with prompt iteration — 7.37 to 8.37/10 7 things you can do with Rogue Studio that no other AI IDE will let you do Why I Think WordPress Still Matters Reading Log #7 — Aoashi Guns, Germs, and Steel Distinction Open Models and the Sub-Saharan Region What 12 Months of AI-Generated Pull Requests Taught My Engineering Team Feature Flags in .NET 8: ASP.NET Core, Minimal APIs, Blazor The Quiet Architecture of Systems That Refuse to Die From OOP to SOLID: Everything You Need to Know in One Article I Scanned 5 Common LangChain Agent Patterns. Every Single One Was Over-Permissioned. Production-Ready MCP Servers in 60 Seconds (Auth, Rate Limits, Audit Logs Included) Dari OOP ke SOLID: Semua yang Perlu Kamu Tahu dalam Satu Artikel The Most Important Part of Google I/O 2026 Wasn’t a Model — It Was the Infrastructure When SafetyCo Goes to War: Anthropic, the DOD, and the Limits of Ideals-Based Frameworks Why AI Memory Resolves Too Much — And What to Preserve Instead What Gemma 4 Means for the Future of Local AI (And Why It Matters More Than GPT-5) The Classroom Gap: Why Applied AI Has Yet to Transform How the World Learns Cell-to-Sentence (C2S): LLM-Powered scRNA-seq Annotation with Gemma 4 GitHub rust-2026-template — my Rust starter in 2026 Stop Editing JSON by Hand How I Turned an Old Movie Recommendation Project Into a Cinematic AI Platform Linux Command Line: The 25 Commands I Use Every Day (2026)
Running Gemma 4 26B on an Old GTX 1080 with llama.cpp
Martin Andre · 2026-05-25 · via DEV Community

How to get Google's Gemma 4 26B-A4B Mixture-of-Experts model running locally — including speculative decoding — on hardware that has no business running it.


Google's Gemma 4 26B-A4B is a Mixture-of-Experts model: 25.2 billion total parameters, but only 3.8 billion are active per token. That distinction matters enormously for running it locally, because it means you can keep the cold expert weights in system RAM and stream them over PCIe, while a much smaller working set lives on the GPU.

This post walks through getting Gemma 4 running on a GeForce GTX 1080 — a 2016-vintage card with 8 GiB of VRAM — on Fedora 42, achieving ~24.5 tokens/second with 128k context, including fully-GPU-resident speculative decoding via Gemma 4's MTP assistant head.

For comparison: I also ran the Qwen 3.6 35B-A3B model through the same process. It produced slightly slower output at the same context length, and was much more verbose given the same prompts — so for typical assistant workloads, Gemma 4 ends up faster end-to-end regardless of tok/s.


The Hardware

The full system spec matters here, because the CPU and RAM are as important as the GPU when streaming MoE weights over PCIe:

Component Spec
CPU Intel i7-6700 (Skylake, 4c/8t, 2015)
RAM 32 GiB system RAM
GPU NVIDIA GeForce GTX 1080, 8 GiB VRAM (Pascal, 2016)
OS Fedora 42

Nothing here is new : I bought the GPU second-hand in 2025 for under $200 USD.

The key bottleneck to understand upfront:

# Check PCIe link state while the model is generating
lspci -vv -s 01:00.0 | grep LnkSta
#   LnkSta: Speed 8GT/s, Width x16
# i.e. running at PCIe 3.0 maximum

Enter fullscreen mode Exit fullscreen mode

At the same time, nvidia-smi shows the GPU at roughly 40–50% utilisation. PCIe maxed out + GPU half-idle = bandwidth-limited, not compute-limited. This is the single most important fact for this setup: anything that reduces the volume of weight data crossing the PCIe bus per token helps; just having a faster GPU wouldn't.


Gemma 4 26B-A4B: What You're Working With

Property Value
Total parameters 25.2B
Active parameters per token 3.8B
Layers 30
Trained context 256K tokens

The trick: with MoE models, only a few experts activate per token. llama.cpp exposes this directly:

  • --n-cpu-moe N — keep the MoE weights of the first N layers on the CPU
  • --n-gpu-layers 999 — everything else on the GPU

On this card the sweet spot for 128k context turns out to be --n-cpu-moe 21 (with MTP) or --n-cpu-moe 20 (without).


Step 1: Pin the NVIDIA Driver to the 580xx Branch

Pascal (GTX 1080) is approaching legacy status. On Fedora 42 you need to pin akmod-nvidia to the 580xx branch:

dnf swap akmod-nvidia akmod-nvidia-580xx --allowerasing --releasever=44

Enter fullscreen mode Exit fullscreen mode

The --releasever=44 is necessary to pull 580xx packaging from the newer repo metadata, even though the running system is Fedora 42.


Step 2: CUDA Toolkit and a Working nvcc

dnf reinstall cuda-nvcc-12-9.x86_64
find / | grep nvcc
# /usr/local/cuda-12.9/bin/nvcc

export CUDACXX=/usr/local/cuda-12.9/bin/nvcc

Enter fullscreen mode Exit fullscreen mode


Step 3: Force gcc-14 for the CUDA Build

CUDA 12.9 doesn't accept the newest gcc that Fedora 42 ships by default:

dnf install gcc14 gcc14-c++

Enter fullscreen mode Exit fullscreen mode

The straightforward -DCMAKE_C_COMPILER CMake flags don't work here — somewhere inside the NVIDIA/CUDA CMake modules, plain gcc is hard-coded. The least-bad workaround is a symlink early on PATH:

mkdir -p ~/.local/bin
pushd ~/.local/bin/
  ln -s /usr/bin/gcc-14 gcc
  ln -s /usr/bin/g++-14 g++
popd

# Confirm ~/.local/bin is at the front of PATH
echo $PATH

Enter fullscreen mode Exit fullscreen mode

Remember to remove these symlinks afterwards if you don't want every other build using gcc-14.


Step 4: Patch CUDA's math_functions.h for glibc 2.41

CUDA 12.9 headers were written against an older glibc. On Fedora 42 (glibc 2.41) some inline definitions collide. Gentoo has a clean patch:

# Edit by hand, applying the patch:
$EDITOR /usr/local/cuda-12.9/targets/x86_64-linux/include/crt/math_functions.h

Enter fullscreen mode Exit fullscreen mode

The core change: replace rsqrt(double x); with rsqrt(double x) noexcept (true);, and __func__(double rsqrt(double a)); with __func__(double rsqrt(double a)) throw();, for these six functions:

double rsqrt(double a);  double sinpi(double a);  double cospi(double a);
float rsqrtf(float a);   float sinpif(float a);   float cospif(float a);

Enter fullscreen mode Exit fullscreen mode


Step 5: Choose the Right llama.cpp Fork

Vanilla llama.cpp works for most cases, but for Gemma 4 on an 8 GiB card you need two things standard llama.cpp doesn't have:

  1. RotorQuant — a Gemma-specific KV-cache quantisation scheme that makes the difference between fitting at 16k context and fitting at 128k context
  2. MTP speculative decoding support — for Gemma 4's assistant head

The right fork is AtomicBot-ai/atomic-llama-cpp-turboquant, which combines RotorQuant, TurboQuant KV cache, and Gemma 4 MTP support.

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git
cd atomic-llama-cpp-turboquant/

Enter fullscreen mode Exit fullscreen mode


Step 6: Build llama.cpp

export CUDACXX=/usr/local/cuda-12.9/bin/nvcc

cmake --fresh -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
# 'native' picks up the compute capability of the installed GPU automatically.

cmake --build build --config Release
# NB: --parallel seemed to cause problems, so leave it off

Enter fullscreen mode Exit fullscreen mode

Sanity check:

cd ./build/bin

./llama-cli --list-devices
# ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8107 MiB):
#   Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yes, VRAM: 8107 MiB

Enter fullscreen mode Exit fullscreen mode

It's also worth dumping the full help text — there are a lot of flags and you'll be grepping it constantly:

./llama-server --help > llama.cpp-man.txt
wc -l llama.cpp-man.txt
# 570 llama.cpp-man.txt

Enter fullscreen mode Exit fullscreen mode


Step 7: Download Gemma 4

You need two GGUFs: the main model and the MTP assistant head.

# Via huggingface-cli:
hf download AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF \
    --include "*Q4_K_M.gguf" --local-dir ./Models

hf download unsloth/gemma-4-26B-A4B-it-GGUF \
    --include "*Q4_K_M*.gguf" --local-dir ./Models

Enter fullscreen mode Exit fullscreen mode

Or directly via wget:

wget https://huggingface.co/AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF/resolve/main/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf
wget https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf

Enter fullscreen mode Exit fullscreen mode

Move them to ~/Models/ for sanity.

Model cards:


Step 8: First Runs (Baseline, No MTP)

cd ./build/bin

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 29 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 16384

Enter fullscreen mode Exit fullscreen mode

A smoke-test query from another terminal:

curl -X POST http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "messages": [
         {"role": "system", "content": "You are a helpful assistant."},
         {"role": "user", "content": "Please write a program to stream the Fibonacci numbers under 1000 - with the restriction that there should be only one print statement in the loop"}
       ]
     }'

Enter fullscreen mode Exit fullscreen mode

Timing from the server log:

prompt eval time =    1035.77 ms /    52 tokens (   19.92 ms/tok,    50.20 tok/s)
       eval time =   67336.21 ms /  1076 tokens (   62.58 ms/tok,    15.98 tok/s)

Enter fullscreen mode Exit fullscreen mode

About 16 tok/s. Pushing to 128k context:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 29 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000

Enter fullscreen mode Exit fullscreen mode

prompt eval time =     985.96 ms /    52 tokens (   18.96 ms/tok,    52.74 tok/s)
       eval time =   98309.88 ms /  1538 tokens (   63.92 ms/tok,    15.64 tok/s)

Enter fullscreen mode Exit fullscreen mode

The memory breakdown at startup is informative:

llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (GTX 1080)   |  8107 = 4632 + ( 3299 =  2103 +     664 +     532) +         174 |
llama_memory_breakdown_print: |   - Host               |                 14747 = 14477 +       0 +     270                |

Enter fullscreen mode Exit fullscreen mode

~3.3 GiB on GPU, ~14.7 GiB on host. There's ~4.6 GiB free on GPU — enough to pull more layers over. Trying --n-cpu-moe 20:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 20 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000

Enter fullscreen mode Exit fullscreen mode

prompt eval time =     584.25 ms /    29 tokens (   20.15 ms/tok,    49.64 tok/s)
       eval time =   88887.26 ms /  1758 tokens (   50.56 ms/tok,    19.78 tok/s)

Enter fullscreen mode Exit fullscreen mode

At --n-cpu-moe 19 it OOMs. 20 is the floor for Gemma 4 at 128k context without MTP — giving ~20 tok/s.


Step 9: Adding MTP Speculative Decoding

Gemma 4 ships with a small "assistant" MTP (Multi-Token Prediction) head designed for speculative decoding. The idea: the small assistant drafts several tokens cheaply, then the main model verifies them in one pass. If enough drafts are accepted, effective throughput goes up.

Initial attempt with MTP:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 20 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --mtp-head ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
    --n-gpu-layers-draft 999 \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 16 --draft-min 0 \
    --cache-type-k-draft turbo3 --cache-type-v-draft turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000

Enter fullscreen mode Exit fullscreen mode

eval time =   55049.45 ms /  1151 tokens (   47.83 ms/tok,    20.91 tok/s)
draft acceptance rate = 0.76096 (694 accepted / 912 generated)

Enter fullscreen mode Exit fullscreen mode

~21 tok/s. A 76% acceptance rate sounds good — but we only gained ~1 tok/s over the no-MTP baseline. Something is wrong.


Step 10: Debugging Why MTP Barely Helps

The llama_memory_breakdown_print line at startup showed ~4.6 GiB free on GPU. The assistant head is small — why isn't it helping more?

The clue is in the per-model load_tensors stanzas in the server's startup log. There are two of them — one for the main model, one for the assistant. Here's what they showed:

# Main Gemma 4 26B-A4B:
load_tensors: offloaded 31/31 layers to GPU
load_tensors:          CPU model buffer size =   577.50 MiB
load_tensors:        CUDA0 model buffer size =  6504.39 MiB
load_tensors:    CUDA_Host model buffer size =  9498.51 MiB

# MTP assistant (5 layers):
load_tensors: offloaded 5/5 layers to GPU
load_tensors:          CPU model buffer size =   210.00 MiB   ← problem
load_tensors:        CUDA0 model buffer size =    82.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB

Enter fullscreen mode Exit fullscreen mode

The assistant reports 5/5 layers "offloaded to GPU" — but 210 MiB is still on plain CPU, vs only 82 MiB on CUDA0. The sum is ~292 MiB; what's in that CPU chunk?

The answer is in llama.cpp's source:

// assign the input layer
// there is very little benefit to offloading the input layer,
// so always keep it on the CPU
pimpl->dev_input = { cpu_dev, &pimpl->cpu_buft_list };

Enter fullscreen mode Exit fullscreen mode

The token embedding table is unconditionally pinned to the CPU, regardless of --n-gpu-layers. For most models this is fine: the embedding lookup is a get_rows operation — it pulls a handful of vocab rows per forward pass and is cheap from CPU.

But Gemma 4 26B-A4B's assistant has a tied LM head: the LM head matrix is the same tensor as token_embd.weight. Every single draft token generation performs a full mul_mat(tok_embd, hidden_state) — a 262144 × 1024 matmul against that 210 MiB table. At Q4_K_M that's ~150 MiB hauled across PCIe for every draft token generated.

The supposed-to-be-free speculative decoding was actually adding PCIe load on top of the target model's MoE streaming. That's why MTP barely moved the needle.


Step 11: Fix — Force the Embedding Table onto the GPU

Two subtleties to get this right:

1. Use --override-tensor-draft, not --override-tensor.

llama.cpp has parallel flags for the target model and the speculative draft model:

-ot,  --override-tensor         # affects the target model only
-otd, --override-tensor-draft   # affects the assistant/draft model

Enter fullscreen mode Exit fullscreen mode

2. Use the on-disk tensor name, not the C++ field name.

The tensor is stored on disk as token_embd.weight, not mtp.tok_embd. The override flag matches against the on-disk name.

The corrected invocation:

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --mtp-head ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 16 --draft-min 0 \
    --n-gpu-layers 999 \
    --n-cpu-moe 21 \
    --n-gpu-layers-draft 999 \
    --n-cpu-moe-draft 0 \
    --override-tensor-draft "token_embd\.weight=CUDA0" \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --cache-type-k-draft turbo3 --cache-type-v-draft turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000

Enter fullscreen mode Exit fullscreen mode

Note: --n-cpu-moe 21 rather than 20 — moving 210 MiB into VRAM consumes that headroom. 20 now OOMs; 21 is the new floor.

With --verbose, you can confirm the override fired:

tensor token_embd.weight (210 MiB q6_K) buffer type overridden to CUDA0

Enter fullscreen mode Exit fullscreen mode

And the assistant's load_tensors stanza now shows:

load_tensors: offloaded 5/5 layers to GPU
load_tensors:        CUDA0 model buffer size =   292.24 MiB   # was 82 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB
                                                              # CPU line gone

Enter fullscreen mode Exit fullscreen mode

82 + 210 = 292. The CPU buffer has disappeared entirely.


Step 12: Results

Sweeping --n-cpu-moe to find the sweet spot (more CPU layers = less VRAM pressure but more PCIe per target token):

--n-cpu-moe 25 (conservative):

eval time = 48.07 ms/tok,  20.80 tok/s
draft acceptance rate = 0.74150 (829 accepted / 1118 generated)

Enter fullscreen mode Exit fullscreen mode

--n-cpu-moe 22:

eval time = 40.95 ms/tok,  24.42 tok/s
draft acceptance rate = 0.82300 (637 accepted / 774 generated)

Enter fullscreen mode Exit fullscreen mode

--n-cpu-moe 21 (OOM floor, sweet spot):

eval time = 40.85 ms/tok,  24.48 tok/s
draft acceptance rate = 0.78587 (712 accepted / 906 generated)

Enter fullscreen mode Exit fullscreen mode

(20 = OOM)

~24.5 tok/s at 128k context — a real ~22% improvement over the ~20 tok/s no-MTP baseline.

The mtp statistics line tells the full story:

statistics mtp: #calls(b,g,a) = 1 453 389  dur(b,g,a) = 0.004, 3048.331, 0.086 ms

Enter fullscreen mode Exit fullscreen mode

The dur(b,g,a) tuple is time in each MTP phase: batch (prefill), generation (drafting), acceptance (verification). Generation takes ~3 seconds total across 453 calls (~6.7 ms per draft call); acceptance is essentially free at 0.086 ms total. That's exactly what you want: the draft model is CUDA-compute-bound, not PCIe-bound.

Before the fix, each draft call was individually slower — the matmul was crossing PCIe. After moving to CUDA0, per-call duration dropped and acceptance rate improved.


The Diagnostic: How to Tell If Your MTP Head Is Actually on the GPU

The startup llama_memory_breakdown_print line is not reliable — it covers the target model only, not the assistant. The correct check is the second load_tensors stanza in the startup log.

Good — no CPU line for the assistant:

load_tensors:        CUDA0 model buffer size =   292.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB

Enter fullscreen mode Exit fullscreen mode

Bad — embedding table is on the CPU, MTP won't benefit:

load_tensors:          CPU model buffer size =   210.00 MiB
load_tensors:        CUDA0 model buffer size =    82.24 MiB
load_tensors:    CUDA_Host model buffer size =     3.09 MiB

Enter fullscreen mode Exit fullscreen mode

If the CPU line is non-zero for the assistant, check whether the model has a tied LM head and add --override-tensor-draft "token_embd\.weight=CUDA0".

The mtp statistics generation time is also a tell: a few milliseconds per draft call means GPU-resident; tens of milliseconds means PCIe-bound.


Summary: Working Configuration

Here's the fastest configuration found on this hardware (GTX 1080, 8 GiB VRAM, 128k context):

cd atomic-llama-cpp-turboquant/build/bin

./llama-server \
    --model ~/Models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --n-gpu-layers 999 \
    --n-cpu-moe 21 \
    --mtp-head ~/Models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
    --n-gpu-layers-draft 999 \
    --n-cpu-moe-draft 0 \
    --override-tensor-draft "token_embd\.weight=CUDA0" \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 16 --draft-min 0 \
    --cache-type-k turbo3 --cache-type-v turbo3 \
    --cache-type-k-draft turbo3 --cache-type-v-draft turbo3 \
    --flash-attn on \
    --no-mmap --mlock \
    --ctx-size 128000

Enter fullscreen mode Exit fullscreen mode

Result: ~24.5 tok/s, 128k context, ~79% draft acceptance rate.

The key lessons:

  1. The MoE architecture is what makes this possible. Only ~3.8B parameters are active per token; the rest sit cold in RAM and stream on demand. --n-cpu-moe 21 is the sweet spot between VRAM pressure and PCIe bandwidth.

  2. RotorQuant KV cache matters. The --cache-type-k turbo3 --cache-type-v turbo3 flags (from the AtomicBot fork) are what get you from 16k context to 128k context on 8 GiB VRAM.

  3. MTP works — but only once you force the embedding table onto the GPU. --n-gpu-layers-draft 999 is not enough. Gemma 4's assistant has a tied LM head; without --override-tensor-draft "token_embd\.weight=CUDA0", the 262144×1024 matmul runs against CPU memory, adding ~150 MiB of PCIe traffic per draft token and negating almost all of the speculative decoding benefit.

  4. Check the second load_tensors stanza, not llama_memory_breakdown_print. The breakdown line covers the target model only. The per-model load stanzas are the only reliable way to confirm where the assistant's weights actually landed.


The llama.cpp fork used throughout: AtomicBot-ai/atomic-llama-cpp-turboquant