惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

What will you think of when you read about a neural network!!? Mathematics? 🤔 I Built a Free Finance Dashboard as a Solo Dev — Here's What I Learned Drive JHipster with your AI agent: introducing jhipster-mcp (v0.0.4) Pokemon Battle Simulator Napkin Challenge! Looking for a Founding Engineer Copy Job CDC with SQL estate is now GA in Microsoft Fabric what terminal for CLI in Windows 10 do users like most Is Claude API Worth $3/1M Tokens Over Self-Hosted Llama? Vibe Coding Meets Spec-Driven Development: The Best of Both Worlds 10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble. Building a Browser-Based Free Isometric Illustration Maker for Modern UI Animation Workflows Use Blunt Prompts and Get Shit Done MCP servers are just REST APIs in a polite wrapper - here's 5 lines of Python I Got Tired of LLMs Hallucinating Compliance, So I Built an Open-Source Governance Layer Containers & Agents with Docker & OpenClaw All About AI & Using Claude On the Shoulders of Giants: Package Registries, Node & NPM Decoupling Webhook Verification and Automating Unstructured Data Ingestion Why flag_shih_tzu is changing its default SQL for bit flags Cómo construí una calculadora de interés compuesto con JavaScript vanilla y por qué todo el mundo debería usar una The Hard Part of Building a Realtime Binary Options Platform Was Not the Chart When the Runtime Was the Wall: How Rust Broke a 50 ms SLA and Saved the Day 🎤 Building a Real-Time Voice AI Assistant Using Open Source Tools I Benchmarked 5 Voice AI Stacks. Only 2 Stayed Under 300ms. I built AimVantage — an AI tool that turns your CV + a job link into a full interview prep pack in 90 seconds Your LLM Is Wrong. Your Codebase Is Why. Building an indexable verification page for a freshly-launched small business FinancialService schema for a real merchant services brokerage: a case study How Free Online Tools Survive Without Collecting Your Email The Day the Treasure Hunt Engine Buried Itself Alive Zero-Day Exploits, GitHub Actions Supply Chain Attacks, and OTP Auth Flaws Only 14.6% of 'AI-native' job postings actually name an AI tool. I checked 37,920. AI Agents, Jupyter Tooling, and LLM Code Gen Production Metrics SQLite Internals, PostgreSQL Performance & Multi-Tenancy Patterns FlashAttention CUDA Kernel, Strix Halo MOE Boost, & NVIDIA DLSS 4.5 Driver Update From "Vibe Coding" to Precision: Why GitHub Spec Kit Changes Everything Scale Wars #5 — Twitter: The Fan-out Pattern and the Architecture Behind 140 Characters Retrying HTTP Requests in Go Without Making It Worse Building a Vector Search Engine from Scratch: The Math and Mechanics of HNSW Technical Due Diligence Checklist for Startup Investors (2026) My AI agent ran overnight and I woke up to a $47 bill — so I built a kill-switch Run your first AI agent in Java — for free, with Mistral The Joke Worked: Building an AI-Powered COBOL Meeting Auditor with Hermes Agent Deep Dive into Y.js CRDTs for Real-Time Multiplayer Editors Async Python for AI Applications: Patterns That Don't Break Under Load The Hidden Reason GRC Programs Keep Failing: It's a Design Problem, Not a People Problem An LLM API call, in 4 GIFs Fear not the Markdown: A Beginner's Quest 😱 [Boost] I built a search engine for 3 million Polish businesses — here's what I learned An Intelligence Briefing for the Port of Rotterdam, from a Single Prompt How I Built Semantic Discussion Clustering Without Embeddings (and Why It Was Good Enough) I Built a Real-Time Simulation Game in a Single HTML File (Without React or Custom JavaScript) I Got Tired of SNMP Dev Hell, So I Built Trishul SNMP Suite 98. RAG: Give Your AI Access to Your Documents Why Getting a Tech Job Right Now Feels Broken? The Container Runtime Nobody Told You About (And Four Others) The Singleton Labyrinth Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes. Check Wallet Balances Across 4 Chains with Zero Dependencies — chain_balance.py Vectr — Code Intelligence AI Tool Veltrix Was Killing Us With YAML 5 PostgreSQL locking behaviors that trip people up Beyond Monolithic AI: How to Build a Pluggable "Brain" Architecture for Autonomous Agents The Operational Cost of JWT Lifecycle Management: Overlooked Details Mastering Structured JSON Outputs with Gemini API ATR Implements the Detection Layer the NSA Identified as Missing in MCP I tried both Cursor and Antigravity(1.20) - Switching Context - which one is better? Negative Lookups in Bf-Tree: Caching Things That Don't Exist My Struggles as a Software Engineer in 2026 Why Hybrid Metaheuristics Still Beat “Smarter” AI in Real-World Optimization Cómo destacar como JR DEV en tu equipo I got tired of guessing which model holds my VRAM, so I built a tiny dashboard Qwen Is Not Yet Ready to Power Local OpenClaw Deployments Top 7 Featured DEV Posts of the Week Why I got frustrated with AI job search tools and built my own 10 Best Open-Source AI Agents for 2026 Contract Analysis Will Replace Legal Gatekeeping AWS Cloud Shell with Antigravity CLI Building Reliable Event Delivery for XRPL Applications AMTP: HTTP for the Agentic Web — A New Markdown-First Protocol for AI Agents LLM Security Vulnerabilities Engineers Need to Know in 2026 Shared Build Cache: Makes Sense for the Independent Developer? Live Lessons From Running a 5-Minute Polymarket Crypto Bot Cómo Evaluar Agentes IA: Tutorial de LLM-as-Judge Day 2 of Python Learning 🐍 I built a local-first Apple Health recovery briefing that shows its math I Built a REST Microservice With a Database in 3 Files — and Wrote Zero Code 10 Avro Schema Mistakes Even Experienced Developer Do Commit: Refactor background workers and logging pipeline GitHub Actions vs Jenkins vs GitLab CI: A Developer's Honest Comparison (2026) Clean Architecture in MongoDB + C#: Why is the Repository Pattern Alone Not Enough? I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%. I Almost Quit Coding to Become a Welder Understanding Reinforcement Learning with Human Feedback Part 6: How the Reward Model Trains the Original Model # Level Up Your Portfolio with Wowfolio.in: Free, Customizable, Type Inhabitation in Lean: Why “Hello {name}” Can Become a Theorem Mastering Context in Go: A Senior Engineer’s Playbook for Lifecycle Management Solana Transactions Through a Backend Developer’s Eye Agent as a Tool Call: Claude Code's Fork-Exec Pattern
We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.
Vilius · 2026-05-27 · via DEV Community

Vilius

By Vilius Vystartas | May 2026

Every LLM can write code that works. The question is: can they write code that's efficient — and does telling them to be efficient actually help?

I tested 10 models on 10 coding tasks, each in two phases: unprompted (the model writes its own code) and prompted (explicitly told to write clean, DRY, efficient code). That's 200 API calls, $0.56 total. The results are... not what most prompt engineers would predict.

GPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the "write efficient code" prompt was meaningless or actively harmful.


How the Metric Works

Each task has a known optimal token budget — the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The efficiency score is optimal_tokens / actual_tokens, capped at 1.0.

A score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the "write efficient code" instruction actually changes behaviour.


The Leaderboard (Sorted by Prompted Efficiency)

# Model Unprompted Prompted Δ Frugal Cost Correctness
🥇 GPT-5.4 0.43 0.63 +0.20 30% $0.096 78% → 85%
🥈 Qwen 3.6 Plus 0.44 0.60 +0.17 40% $0.158 78% → 87%
🥉 Gemma 4 31B 0.54 0.58 +0.04 50% $0.003 92% both
4 DeepSeek Chat 0.51 0.55 +0.04 30% $0.006 91% → 80%
5 Claude Sonnet 4 0.47 0.52 +0.04 40% $0.121 92% both
6 LFM 2 24B A2B 0.54 0.47 -0.06 30% $0.001 90% → 80%
7 Mistral Large 2411 0.54 0.46 -0.08 40% $0.050 90% → 82%
8 Gemini 2.5 Flash 0.47 0.46 -0.01 50% $0.020 92% → 90%
9 Cohere Command A 0.60 0.44 -0.17 40% $0.071 90% → 82%
10 Kimi K2.6 0.34 0.43 +0.09 30% $0.029 76% → 86%

What Stands Out

GPT-5.4 Is the Prompt Whisperer

GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were config-generation (+0.81 — went from 12 inline JSON blocks to a template loop), html-from-data (+0.71), and magic-strings (+0.38 — switched to an Enum). It's the only model in the batch where the "write efficient code" instruction consistently produces different (and better) output.

The cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.

Gemma 4 31B: The Quiet Winner

Half of Gemma 4's tasks were already "frugal" — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That's a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.

Cohere Command A: Prompting Backfires

Cohere Command A had the highest unprompted efficiency in the batch (0.60) — it naturally writes concise code. But when told "write efficient code," it ballooned output on several tasks. html-from-data went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.

Lesson: if a model is already efficient, don't prompt it to be more efficient.

Qwen 3.6 Plus: Second Place, Slowest

Qwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took 26 minutes for 20 tasks — by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you're waiting for it. Batch workloads only.

The Kimi Surprise

Kimi K2.6 had the lowest unprompted efficiency (0.34 — verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge.

Frugality: What Does It Mean?

"Frugal" means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up.


The Bigger Picture

Group Models Behaviour
Prompt-responsive GPT-5.4, Qwen 3.6 Plus Efficiency improves substantially with prompting
Prompt-neutral Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6 Prompt has little effect (±0.04)
Prompt-antagonistic LFM 2 24B A2B, Mistral Large 2411, Cohere Command A Efficiency drops when prompted

The prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.

If the prompt says "write efficient code" and the model responds by writing more tokens, something in the training signal is misaligned.


My Picks

  • Best prompted efficiency: GPT-5.4 — 0.63, $0.10 for 20 tasks. The only model where prompting reliably improves output.
  • Best value overall: Gemma 4 31B — 0.58 prompted, 92% correctness, $0.003. Absurd price/performance.
  • Best natural efficiency: Cohere Command A — 0.60 unprompted. Don't prompt it, just let it work.
  • Most consistent: Claude Sonnet 4 — 92% correctness on both phases, small +0.04 efficiency gain. Reliable.
  • Skip if you're in a hurry: Qwen 3.6 Plus — 26 minutes for 20 tasks. Great efficiency gains, terrible latency.
  • Watch list: Kimi K2.6 — low base efficiency but the prompt actually helps. Worth retesting with a better prompt.

Methodology

Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).

Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: efficiency_ratio = optimal_tokens / actual_tokens (capped at 1.0). Correctness scored against expected output patterns.

Total cost: $0.56 for 200 API calls (10 models × 10 tasks × 2 phases). Temperature: 0.1. Max tokens: 600.

Full results: benchmarks.workswithagents.dev