Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task

Five local models. One frontier cloud model. The same coding task. Zero hand-holding.

Only two shipped code. One of them was the cloud model.

Part of my goal with this series is to continuously test the viability and maturity of local models. I've done it for basic agentic tasks. Today we're revisiting coding tasks.

What did we learn?

Local models are not ready — yet. At least not for homelabs like mine. Perhaps if you have hundreds of gigabytes of unified memory (I'm looking at you, older Mac Studios) you can run fully unquantized models. But with even the beefiest of discrete consumer GPUs, local models can't code.

Let's dig in.

The Setup

This is Round 7 of the Model Showdown series. Previous rounds tested cloud models against each other — Opus, Sonnet, GPT-5.5, Qwen cloud. This time I wanted to answer a different question: can local models running on consumer hardware actually complete a real agentic coding task?

The homelab:

CPU: AMD Ryzen 9 9950X3D, 64GB RAM
GPU: NVIDIA RTX 5090, 32GB VRAM
Inference: llama.cpp b9660, single-model serving on port 8080
Agent platform: Coder Agents v2.34.0
OS: Ubuntu 24.04, NVIDIA Driver 590.48.01, CUDA 13.1

Every local model was configured as aggressively as the hardware allows — flash attention, quantized KV cache (q8_0), and context windows maxed to what VRAM permits.

The Contestants

Model	Type	Quant	VRAM	Context	Max Output
Qwen 3.6 35B-A3B	Local MoE	UD-Q4_K_XL (21GB)	~21GB	131,072	81,920
Gemma 4 12B	Local Dense	UD-Q4_K_XL (6.9GB)	~8GB	65,536	32,768
Hermes 4 14B	Local Dense	Q8_0 (15GB)	~15GB	65,536	32,768
Qwen3-Coder 30B-A3B	Local MoE	UD-Q4_K_XL (17GB)	~17GB	65,536	32,768
Devstral 24B	Local Dense	Q5_K_M (17GB)	~17GB	65,536	32,768
Claude Sonnet 4	Cloud (control)	Native	N/A	200,000	—

Sonnet 4 is the control variable. I already know what it can do. The question is how close the local models get.

The Task: Admin Tag Manager

Previous rounds used an "image management" feature, but that collided with existing code in the repo. For Round 7, I designed a clean-room task: build a tag manager for the blog's admin panel.

The blog already has tags — posts use a tags[] array in MDX frontmatter, there's a public /tags page, and src/lib/posts.ts has a getAllTags() function. But there's no admin UI to manage them.

Each model got the identical prompt:

Goal: Add a Tag Manager to the /admin section.

Requirements:

Create src/lib/tags.ts — list tags with post counts, detect orphans, support rename and merge

Create src/app/api/admin/tags/route.ts — GET, PATCH, DELETE endpoints

Create src/app/admin/tags/page.tsx — table with inline rename, delete, sort

Add "Tags" to AdminNav

Client-side mutations with refresh (no full page reload)

npm run build must pass with zero errors

Take a screenshot via Playwright MCP

Commit in logical chunks, push to branch

Do NOT open a PR

Ten requirements. Real codebase. Real build system. Real git workflow.

The Methodology

Each model got its own clean branch (run-10 through run-15) forked from the same main commit. Local models were loaded one at a time via llm-switch.sh and served through llama-server on localhost:8080. Sonnet 4 ran through Coder's built-in Anthropic provider.

Model-to-run assignment was randomized and sealed before execution. I didn't know which model was which run until after all six completed (or failed).

A note on human intervention: I monitored each session live and occasionally nudged stalled models ("keep going", "can you finish?") or stopped them when they entered obvious doom loops ("stop"). There was no standardized intervention protocol — I used my judgment as a developer watching an AI assistant, which is how these tools actually get used in practice. Some models got more nudges than others because they stalled more. The two models that shipped code needed zero intervention.

The Results

Model	Tool Calls	Total Tokens	Commits	Build Pass	Screenshot	Outcome
Sonnet 4 ☁️	88	19K	4	✅ (1st try)	✅	Complete
Qwen3-Coder 30B-A3B	60	2.06M	1	✅ (3rd try)	❌	Partial
Qwen 3.6 35B-A3B	76	3.89M	0	✅ (2nd try)	❌	Failed (never committed)
Gemma 4 12B	34	1.17M	0	❌ (0/7)	❌	Failed
Hermes 4 14B	40	1.14M	0	❌ (0/13)	❌	Failed
Devstral 24B	0	14K	0	❌	❌	Total failure

One cloud model. Five local models. One complete success. One partial. Four failures.

What Each Model Actually Did

Sonnet 4 — The Control (Run 14): Complete Success

Sonnet did what you'd expect a frontier model to do. It cloned the repo, spent 25 tool calls reading existing code (auth patterns, API conventions, admin page structure, frontmatter format), then wrote all four files in a tight burst. Build passed on the first try. It hit a real environment issue — a stray package.json confused Turbopack's workspace detection — diagnosed the root cause, fixed it with a config change, took a Playwright screenshot, and pushed four clean conventional commits.

Total time: ~10 minutes. Zero human intervention.

acb4ea1 fix: set turbopack.root to avoid workspace lockfile detection in dev
352a8ca feat: add Tags link to AdminNav
22899a0 feat: add /admin/tags page with inline rename, delete, and sort
19f44fa feat: add tags.ts lib with stats, rename, and remove helpers

The implementation followed existing project patterns because it read them first. That's the difference.

Qwen3-Coder 30B-A3B (Run 15): The One That Shipped

The best-performing local model. It cloned the repo, explored the codebase, created all four required files (410 lines of code), fixed TypeScript errors across three build attempts, and pushed a working commit.

But it wasn't clean. It burned ~8 tool calls just fighting the working directory problem (each execute call resets to /home/coder, so it kept forgetting to cd into the repo). After committing, it spent another 30 tool calls confused about whether its own API route file existed — trying to delete and recreate something that was already committed.

No screenshot. No logical commit chunking (everything in one commit). But it shipped working code, which puts it in a category of one among the local models.

Qwen 3.6 35B-A3B (Run 13): The Tragic Hero

This is the one that hurts. Qwen 3.6 actually completed the implementation. It explored the codebase thoroughly, wrote all four files, fixed a type error, and got npm run build to pass cleanly.

Then it decided it needed a Playwright screenshot before committing.

It spent the next 77 messages — over 50% of its entire session — trying to install Playwright, fighting missing Chromium dependencies, debugging browser launch failures, rewriting a screenshot script four times, and wrestling with the auth middleware that blocked unauthenticated page loads. It never took the screenshot. It never committed. It never pushed.

The code was right there. Build passing. Ready to go. But the model couldn't prioritize "commit what works" over "complete requirement #7 first." Three times I nudged it — "You there?", "Keep going", "can you finish?" — and each time it dove back into the Playwright rabbit hole.

3.89 million tokens burned. Zero commits pushed.

Gemma 4 12B (Run 11): The API Misunderstanding

Gemma cloned the repo, read the existing code, and wrote all three new files plus the nav update. Reasonable start. Then it ran npm run build and hit a type error with gray-matter's stringify() function.

The fix was simple: matter.stringify(content, data) — content string first, data object second. Gemma had the arguments reversed. It tried six variations of the call, rewrote tags.ts six times, ran seven builds — and never once tried the correct argument order. It never read the gray-matter type definitions. It never checked the docs.

After the fifth failed build, it fell into a degenerate text generation loop — printing "I'll also make sure src/lib/tags.ts is correct" 26 consecutive times. I had to send "stop" to break the loop.

Hermes 4 14B (Run 12): The Import Path That Wouldn't Die

Hermes jumped straight to writing code without exploring the project structure first. It created two files and ran npm run build. The error:

Module not found: Can't resolve '../../../lib/tags'

The route file at src/app/api/admin/tags/route.ts needs ../../../../lib/tags (four levels up) or @/lib/tags (Next.js path alias). Hermes used three levels. Off by one.

It never diagnosed this. Instead, it rewrote both files with the same wrong import and rebuilt. Thirteen times. The output from message 34 onward is nearly verbatim identical every iteration. Same code. Same error. Same "fix." When I sent "stop," it continued for five more tool calls before acknowledging the signal.

Devstral 24B (Run 10): The Non-Starter

Devstral never executed a single tool call. It hallucinated an entire fake conversation about a Python project that doesn't exist, then emitted what looked like tool invocations — execute, read_file, write_file — but rendered them as plain text inside the assistant message. The platform couldn't parse them as structured tool calls, so nothing happened.

This is a fundamental compatibility failure. The model couldn't interface with Coder's tool-calling protocol at all. Nine messages, 14K tokens, zero actions.

The Token Efficiency Gap

This is the number that stopped me:

Model	Total Tokens	Result
Sonnet 4	19,237	Complete (4 commits, screenshot)
Qwen3-Coder	2,059,519	Partial (1 commit, no screenshot)
Qwen 3.6	3,890,791	Failed (build passed, never committed)
Gemma 4 12B	1,170,967	Failed (0/7 builds passed)
Hermes 4 14B	1,138,614	Failed (0/13 builds passed)
Devstral 24B	14,447	Failed (zero tool calls)

Sonnet used 19K tokens to complete the task. The local models that actually tried burned 1–4 million tokens and mostly failed. That's a 100-200x token efficiency gap for the same task.

The local models aren't just slower. They're doing fundamentally more work per unit of progress — re-reading files they already read, rewriting code they just wrote, rebuilding with the same error, looping through the same reasoning. It's not a speed problem. It's a thinking problem.

Common Failure Patterns

Every local model that ran long enough exhibited the same pathologies:

1. Degenerate loops. Gemma repeated the same text 26 times. Hermes rebuilt with the same wrong import 13 times. Qwen 3.6 rewrote its screenshot script 4 times with the same approach. Once a local model enters a loop, it can't break out without human intervention.

2. Working directory amnesia. Coder's execute tool doesn't preserve cd across calls. Sonnet learned this instantly and prefixed every command. Multiple local models burned 5-10 tool calls per session rediscovering this.

3. Inability to prioritize. Qwen 3.6 had a passing build and chose to yak-shave on Playwright instead of committing. No local model demonstrated the judgment to ship what works and iterate.

4. No self-diagnosis. When a build fails, the fix requires reading the error, forming a hypothesis, and trying something different. Hermes and Gemma both tried the same fix repeatedly. Neither ever stepped back to read docs, check type definitions, or examine the project configuration.

What I Actually Learned

Local models can write plausible code. Four of five local models produced syntactically reasonable TypeScript. The code looked right. The architecture was sensible. It's the last mile — debugging, building, committing, shipping — where they fall apart.

The agentic gap is wider than the coding gap. These models can generate code. What they can't do is operate as agents — managing state across tool calls, diagnosing errors, prioritizing tasks, knowing when to stop and ship. That's a different capability than code generation, and it's where local models are currently weakest.

Token efficiency is the real benchmark. Raw parameter count and context window don't predict agentic success. Qwen 3.6 had the biggest context (131K) and burned the most tokens (3.89M) — and still didn't ship. Sonnet used 100x fewer tokens and completed everything. The bottleneck isn't context. It's reasoning quality per token.

Tool-calling compatibility isn't guaranteed. Devstral is marketed as an agentic coding model, but it couldn't even interface with the tool-calling protocol. If you're evaluating local models for agent use, test tool calling first.

Qwen3-Coder is the local model to watch. It's the only local model that actually shipped code in this test. Messy, single-commit, no screenshot — but working code pushed to a branch. For a 30B MoE model running on a single consumer GPU, that's notable.

The Numbers

Metric	Sonnet 4	Qwen3-Coder	Qwen 3.6	Gemma 4 12B	Hermes 4 14B	Devstral 24B
Type	Cloud	Local MoE	Local MoE	Local Dense	Local Dense	Local Dense
Parameters	Unknown	30B (3B active)	35B (3B active)	12B	14B	24B
Total tokens	19,237	2,059,519	3,890,791	1,170,967	1,138,614	14,447
Tool calls	88	60	76	34	40	0
Messages	183	127	162	81	88	9
Commits pushed	4	1	0	0	0	0
Build passed	✅ 1st try	✅ 3rd try	✅ 2nd try	❌ 0/7	❌ 0/13	❌
Screenshot	✅	❌	❌	❌	❌	❌
Human nudges	0	0	3	2 + stop	stop	1
Outcome	Complete	Partial	Failed	Failed	Failed	Failed

Inference stack: llama.cpp b9660, flash attention, q8_0 KV cache, Coder Agents v2.34.0

Hardware: RTX 5090 32GB, Ryzen 9 9950X3D, 64GB RAM, Ubuntu 24.04

Next up: Round 6 brings more frontier models to the same task. And I'll keep pushing the local models — better quants, newer releases, maybe a different agent framework. The gap is real, but the pace of improvement on the local side is fast.

推荐订阅源

DEV Community