惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Project Zero
Project Zero
F
Fortinet All Blogs
Recent Announcements
Recent Announcements
云风的 BLOG
云风的 BLOG
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
S
SegmentFault 最新的问题
Blog — PlanetScale
Blog — PlanetScale
T
Tailwind CSS Blog
WordPress大学
WordPress大学
Engineering at Meta
Engineering at Meta
S
Schneier on Security
N
News and Events Feed by Topic
N
News | PayPal Newsroom
H
Help Net Security
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
The Exploit Database - CXSecurity.com
Attack and Defense Labs
Attack and Defense Labs
博客园 - Franky
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
A
About on SuperTechFans
AWS News Blog
AWS News Blog
S
Secure Thoughts
The Cloudflare Blog
Hugging Face - Blog
Hugging Face - Blog
爱范儿
爱范儿
C
Cybersecurity and Infrastructure Security Agency CISA
V2EX - 技术
V2EX - 技术
Recorded Future
Recorded Future
Microsoft Azure Blog
Microsoft Azure Blog
博客园_首页
MyScale Blog
MyScale Blog
Martin Fowler
Martin Fowler
Help Net Security
Help Net Security
人人都是产品经理
人人都是产品经理
Latest news
Latest news
C
Cyber Attacks, Cyber Crime and Cyber Security
大猫的无限游戏
大猫的无限游戏
The Last Watchdog
The Last Watchdog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
月光博客
月光博客
H
Hacker News: Front Page
P
Proofpoint News Feed
N
News and Events Feed by Topic
H
Heimdal Security Blog
L
Lohrmann on Cybersecurity
有赞技术团队
有赞技术团队
L
LangChain Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task
Rob · 2026-06-18 · via DEV Community

Rob

Five local models. One frontier cloud model. The same coding task. Zero hand-holding.

Only two shipped code. One of them was the cloud model.

Part of my goal with this series is to continuously test the viability and maturity of local models. I've done it for basic agentic tasks. Today we're revisiting coding tasks.

What did we learn?

Local models are not ready — yet. At least not for homelabs like mine. Perhaps if you have hundreds of gigabytes of unified memory (I'm looking at you, older Mac Studios) you can run fully unquantized models. But with even the beefiest of discrete consumer GPUs, local models can't code.

Let's dig in.

The Setup

This is Round 7 of the Model Showdown series. Previous rounds tested cloud models against each other — Opus, Sonnet, GPT-5.5, Qwen cloud. This time I wanted to answer a different question: can local models running on consumer hardware actually complete a real agentic coding task?

The homelab:

  • CPU: AMD Ryzen 9 9950X3D, 64GB RAM
  • GPU: NVIDIA RTX 5090, 32GB VRAM
  • Inference: llama.cpp b9660, single-model serving on port 8080
  • Agent platform: Coder Agents v2.34.0
  • OS: Ubuntu 24.04, NVIDIA Driver 590.48.01, CUDA 13.1

Every local model was configured as aggressively as the hardware allows — flash attention, quantized KV cache (q8_0), and context windows maxed to what VRAM permits.

The Contestants

Model Type Quant VRAM Context Max Output
Qwen 3.6 35B-A3B Local MoE UD-Q4_K_XL (21GB) ~21GB 131,072 81,920
Gemma 4 12B Local Dense UD-Q4_K_XL (6.9GB) ~8GB 65,536 32,768
Hermes 4 14B Local Dense Q8_0 (15GB) ~15GB 65,536 32,768
Qwen3-Coder 30B-A3B Local MoE UD-Q4_K_XL (17GB) ~17GB 65,536 32,768
Devstral 24B Local Dense Q5_K_M (17GB) ~17GB 65,536 32,768
Claude Sonnet 4 Cloud (control) Native N/A 200,000

Sonnet 4 is the control variable. I already know what it can do. The question is how close the local models get.

The Task: Admin Tag Manager

Previous rounds used an "image management" feature, but that collided with existing code in the repo. For Round 7, I designed a clean-room task: build a tag manager for the blog's admin panel.

The blog already has tags — posts use a tags[] array in MDX frontmatter, there's a public /tags page, and src/lib/posts.ts has a getAllTags() function. But there's no admin UI to manage them.

Each model got the identical prompt:

Goal: Add a Tag Manager to the /admin section.

Requirements:

  1. Create src/lib/tags.ts — list tags with post counts, detect orphans, support rename and merge
  2. Create src/app/api/admin/tags/route.ts — GET, PATCH, DELETE endpoints
  3. Create src/app/admin/tags/page.tsx — table with inline rename, delete, sort
  4. Add "Tags" to AdminNav
  5. Client-side mutations with refresh (no full page reload)
  6. npm run build must pass with zero errors
  7. Take a screenshot via Playwright MCP
  8. Commit in logical chunks, push to branch
  9. Do NOT open a PR

Ten requirements. Real codebase. Real build system. Real git workflow.

The Methodology

Each model got its own clean branch (run-10 through run-15) forked from the same main commit. Local models were loaded one at a time via llm-switch.sh and served through llama-server on localhost:8080. Sonnet 4 ran through Coder's built-in Anthropic provider.

Model-to-run assignment was randomized and sealed before execution. I didn't know which model was which run until after all six completed (or failed).

A note on human intervention: I monitored each session live and occasionally nudged stalled models ("keep going", "can you finish?") or stopped them when they entered obvious doom loops ("stop"). There was no standardized intervention protocol — I used my judgment as a developer watching an AI assistant, which is how these tools actually get used in practice. Some models got more nudges than others because they stalled more. The two models that shipped code needed zero intervention.

The Results

Model Tool Calls Total Tokens Commits Build Pass Screenshot Outcome
Sonnet 4 ☁️ 88 19K 4 ✅ (1st try) Complete
Qwen3-Coder 30B-A3B 60 2.06M 1 ✅ (3rd try) Partial
Qwen 3.6 35B-A3B 76 3.89M 0 ✅ (2nd try) Failed (never committed)
Gemma 4 12B 34 1.17M 0 ❌ (0/7) Failed
Hermes 4 14B 40 1.14M 0 ❌ (0/13) Failed
Devstral 24B 0 14K 0 Total failure

One cloud model. Five local models. One complete success. One partial. Four failures.

What Each Model Actually Did

Sonnet 4 — The Control (Run 14): Complete Success

Sonnet did what you'd expect a frontier model to do. It cloned the repo, spent 25 tool calls reading existing code (auth patterns, API conventions, admin page structure, frontmatter format), then wrote all four files in a tight burst. Build passed on the first try. It hit a real environment issue — a stray package.json confused Turbopack's workspace detection — diagnosed the root cause, fixed it with a config change, took a Playwright screenshot, and pushed four clean conventional commits.

Total time: ~10 minutes. Zero human intervention.

acb4ea1 fix: set turbopack.root to avoid workspace lockfile detection in dev
352a8ca feat: add Tags link to AdminNav
22899a0 feat: add /admin/tags page with inline rename, delete, and sort
19f44fa feat: add tags.ts lib with stats, rename, and remove helpers

The implementation followed existing project patterns because it read them first. That's the difference.

Qwen3-Coder 30B-A3B (Run 15): The One That Shipped

The best-performing local model. It cloned the repo, explored the codebase, created all four required files (410 lines of code), fixed TypeScript errors across three build attempts, and pushed a working commit.

But it wasn't clean. It burned ~8 tool calls just fighting the working directory problem (each execute call resets to /home/coder, so it kept forgetting to cd into the repo). After committing, it spent another 30 tool calls confused about whether its own API route file existed — trying to delete and recreate something that was already committed.

No screenshot. No logical commit chunking (everything in one commit). But it shipped working code, which puts it in a category of one among the local models.

Qwen 3.6 35B-A3B (Run 13): The Tragic Hero

This is the one that hurts. Qwen 3.6 actually completed the implementation. It explored the codebase thoroughly, wrote all four files, fixed a type error, and got npm run build to pass cleanly.

Then it decided it needed a Playwright screenshot before committing.

It spent the next 77 messages — over 50% of its entire session — trying to install Playwright, fighting missing Chromium dependencies, debugging browser launch failures, rewriting a screenshot script four times, and wrestling with the auth middleware that blocked unauthenticated page loads. It never took the screenshot. It never committed. It never pushed.

The code was right there. Build passing. Ready to go. But the model couldn't prioritize "commit what works" over "complete requirement #7 first." Three times I nudged it — "You there?", "Keep going", "can you finish?" — and each time it dove back into the Playwright rabbit hole.

3.89 million tokens burned. Zero commits pushed.

Gemma 4 12B (Run 11): The API Misunderstanding

Gemma cloned the repo, read the existing code, and wrote all three new files plus the nav update. Reasonable start. Then it ran npm run build and hit a type error with gray-matter's stringify() function.

The fix was simple: matter.stringify(content, data) — content string first, data object second. Gemma had the arguments reversed. It tried six variations of the call, rewrote tags.ts six times, ran seven builds — and never once tried the correct argument order. It never read the gray-matter type definitions. It never checked the docs.

After the fifth failed build, it fell into a degenerate text generation loop — printing "I'll also make sure src/lib/tags.ts is correct" 26 consecutive times. I had to send "stop" to break the loop.

Hermes 4 14B (Run 12): The Import Path That Wouldn't Die

Hermes jumped straight to writing code without exploring the project structure first. It created two files and ran npm run build. The error:

Module not found: Can't resolve '../../../lib/tags'

The route file at src/app/api/admin/tags/route.ts needs ../../../../lib/tags (four levels up) or @/lib/tags (Next.js path alias). Hermes used three levels. Off by one.

It never diagnosed this. Instead, it rewrote both files with the same wrong import and rebuilt. Thirteen times. The output from message 34 onward is nearly verbatim identical every iteration. Same code. Same error. Same "fix." When I sent "stop," it continued for five more tool calls before acknowledging the signal.

Devstral 24B (Run 10): The Non-Starter

Devstral never executed a single tool call. It hallucinated an entire fake conversation about a Python project that doesn't exist, then emitted what looked like tool invocations — execute, read_file, write_file — but rendered them as plain text inside the assistant message. The platform couldn't parse them as structured tool calls, so nothing happened.

This is a fundamental compatibility failure. The model couldn't interface with Coder's tool-calling protocol at all. Nine messages, 14K tokens, zero actions.

The Token Efficiency Gap

This is the number that stopped me:

Model Total Tokens Result
Sonnet 4 19,237 Complete (4 commits, screenshot)
Qwen3-Coder 2,059,519 Partial (1 commit, no screenshot)
Qwen 3.6 3,890,791 Failed (build passed, never committed)
Gemma 4 12B 1,170,967 Failed (0/7 builds passed)
Hermes 4 14B 1,138,614 Failed (0/13 builds passed)
Devstral 24B 14,447 Failed (zero tool calls)

Sonnet used 19K tokens to complete the task. The local models that actually tried burned 1–4 million tokens and mostly failed. That's a 100-200x token efficiency gap for the same task.

The local models aren't just slower. They're doing fundamentally more work per unit of progress — re-reading files they already read, rewriting code they just wrote, rebuilding with the same error, looping through the same reasoning. It's not a speed problem. It's a thinking problem.

Common Failure Patterns

Every local model that ran long enough exhibited the same pathologies:

1. Degenerate loops. Gemma repeated the same text 26 times. Hermes rebuilt with the same wrong import 13 times. Qwen 3.6 rewrote its screenshot script 4 times with the same approach. Once a local model enters a loop, it can't break out without human intervention.

2. Working directory amnesia. Coder's execute tool doesn't preserve cd across calls. Sonnet learned this instantly and prefixed every command. Multiple local models burned 5-10 tool calls per session rediscovering this.

3. Inability to prioritize. Qwen 3.6 had a passing build and chose to yak-shave on Playwright instead of committing. No local model demonstrated the judgment to ship what works and iterate.

4. No self-diagnosis. When a build fails, the fix requires reading the error, forming a hypothesis, and trying something different. Hermes and Gemma both tried the same fix repeatedly. Neither ever stepped back to read docs, check type definitions, or examine the project configuration.

What I Actually Learned

Local models can write plausible code. Four of five local models produced syntactically reasonable TypeScript. The code looked right. The architecture was sensible. It's the last mile — debugging, building, committing, shipping — where they fall apart.

The agentic gap is wider than the coding gap. These models can generate code. What they can't do is operate as agents — managing state across tool calls, diagnosing errors, prioritizing tasks, knowing when to stop and ship. That's a different capability than code generation, and it's where local models are currently weakest.

Token efficiency is the real benchmark. Raw parameter count and context window don't predict agentic success. Qwen 3.6 had the biggest context (131K) and burned the most tokens (3.89M) — and still didn't ship. Sonnet used 100x fewer tokens and completed everything. The bottleneck isn't context. It's reasoning quality per token.

Tool-calling compatibility isn't guaranteed. Devstral is marketed as an agentic coding model, but it couldn't even interface with the tool-calling protocol. If you're evaluating local models for agent use, test tool calling first.

Qwen3-Coder is the local model to watch. It's the only local model that actually shipped code in this test. Messy, single-commit, no screenshot — but working code pushed to a branch. For a 30B MoE model running on a single consumer GPU, that's notable.

The Numbers

Metric Sonnet 4 Qwen3-Coder Qwen 3.6 Gemma 4 12B Hermes 4 14B Devstral 24B
Type Cloud Local MoE Local MoE Local Dense Local Dense Local Dense
Parameters Unknown 30B (3B active) 35B (3B active) 12B 14B 24B
Total tokens 19,237 2,059,519 3,890,791 1,170,967 1,138,614 14,447
Tool calls 88 60 76 34 40 0
Messages 183 127 162 81 88 9
Commits pushed 4 1 0 0 0 0
Build passed ✅ 1st try ✅ 3rd try ✅ 2nd try ❌ 0/7 ❌ 0/13
Screenshot
Human nudges 0 0 3 2 + stop stop 1
Outcome Complete Partial Failed Failed Failed Failed

Inference stack: llama.cpp b9660, flash attention, q8_0 KV cache, Coder Agents v2.34.0

Hardware: RTX 5090 32GB, Ryzen 9 9950X3D, 64GB RAM, Ubuntu 24.04

Next up: Round 6 brings more frontier models to the same task. And I'll keep pushing the local models — better quants, newer releases, maybe a different agent framework. The gap is real, but the pace of improvement on the local side is fast.