惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

What will you think of when you read about a neural network!!? Mathematics? 🤔 I Built a Free Finance Dashboard as a Solo Dev — Here's What I Learned Drive JHipster with your AI agent: introducing jhipster-mcp (v0.0.4) Pokemon Battle Simulator Napkin Challenge! Looking for a Founding Engineer Copy Job CDC with SQL estate is now GA in Microsoft Fabric what terminal for CLI in Windows 10 do users like most Is Claude API Worth $3/1M Tokens Over Self-Hosted Llama? Vibe Coding Meets Spec-Driven Development: The Best of Both Worlds We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better. Building a Browser-Based Free Isometric Illustration Maker for Modern UI Animation Workflows Use Blunt Prompts and Get Shit Done MCP servers are just REST APIs in a polite wrapper - here's 5 lines of Python I Got Tired of LLMs Hallucinating Compliance, So I Built an Open-Source Governance Layer Containers & Agents with Docker & OpenClaw All About AI & Using Claude On the Shoulders of Giants: Package Registries, Node & NPM Decoupling Webhook Verification and Automating Unstructured Data Ingestion Why flag_shih_tzu is changing its default SQL for bit flags Cómo construí una calculadora de interés compuesto con JavaScript vanilla y por qué todo el mundo debería usar una The Hard Part of Building a Realtime Binary Options Platform Was Not the Chart When the Runtime Was the Wall: How Rust Broke a 50 ms SLA and Saved the Day 🎤 Building a Real-Time Voice AI Assistant Using Open Source Tools I Benchmarked 5 Voice AI Stacks. Only 2 Stayed Under 300ms. I built AimVantage — an AI tool that turns your CV + a job link into a full interview prep pack in 90 seconds Your LLM Is Wrong. Your Codebase Is Why. Building an indexable verification page for a freshly-launched small business FinancialService schema for a real merchant services brokerage: a case study How Free Online Tools Survive Without Collecting Your Email The Day the Treasure Hunt Engine Buried Itself Alive Zero-Day Exploits, GitHub Actions Supply Chain Attacks, and OTP Auth Flaws Only 14.6% of 'AI-native' job postings actually name an AI tool. I checked 37,920. AI Agents, Jupyter Tooling, and LLM Code Gen Production Metrics SQLite Internals, PostgreSQL Performance & Multi-Tenancy Patterns FlashAttention CUDA Kernel, Strix Halo MOE Boost, & NVIDIA DLSS 4.5 Driver Update From "Vibe Coding" to Precision: Why GitHub Spec Kit Changes Everything Scale Wars #5 — Twitter: The Fan-out Pattern and the Architecture Behind 140 Characters Retrying HTTP Requests in Go Without Making It Worse Building a Vector Search Engine from Scratch: The Math and Mechanics of HNSW Technical Due Diligence Checklist for Startup Investors (2026) My AI agent ran overnight and I woke up to a $47 bill — so I built a kill-switch Run your first AI agent in Java — for free, with Mistral The Joke Worked: Building an AI-Powered COBOL Meeting Auditor with Hermes Agent Deep Dive into Y.js CRDTs for Real-Time Multiplayer Editors Async Python for AI Applications: Patterns That Don't Break Under Load The Hidden Reason GRC Programs Keep Failing: It's a Design Problem, Not a People Problem An LLM API call, in 4 GIFs Fear not the Markdown: A Beginner's Quest 😱 [Boost] I built a search engine for 3 million Polish businesses — here's what I learned An Intelligence Briefing for the Port of Rotterdam, from a Single Prompt How I Built Semantic Discussion Clustering Without Embeddings (and Why It Was Good Enough) I Built a Real-Time Simulation Game in a Single HTML File (Without React or Custom JavaScript) I Got Tired of SNMP Dev Hell, So I Built Trishul SNMP Suite 98. RAG: Give Your AI Access to Your Documents Why Getting a Tech Job Right Now Feels Broken? The Container Runtime Nobody Told You About (And Four Others) The Singleton Labyrinth Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes. Check Wallet Balances Across 4 Chains with Zero Dependencies — chain_balance.py Vectr — Code Intelligence AI Tool Veltrix Was Killing Us With YAML 5 PostgreSQL locking behaviors that trip people up Beyond Monolithic AI: How to Build a Pluggable "Brain" Architecture for Autonomous Agents The Operational Cost of JWT Lifecycle Management: Overlooked Details Mastering Structured JSON Outputs with Gemini API ATR Implements the Detection Layer the NSA Identified as Missing in MCP I tried both Cursor and Antigravity(1.20) - Switching Context - which one is better? Negative Lookups in Bf-Tree: Caching Things That Don't Exist My Struggles as a Software Engineer in 2026 Why Hybrid Metaheuristics Still Beat “Smarter” AI in Real-World Optimization Cómo destacar como JR DEV en tu equipo I got tired of guessing which model holds my VRAM, so I built a tiny dashboard Qwen Is Not Yet Ready to Power Local OpenClaw Deployments Top 7 Featured DEV Posts of the Week Why I got frustrated with AI job search tools and built my own 10 Best Open-Source AI Agents for 2026 Contract Analysis Will Replace Legal Gatekeeping AWS Cloud Shell with Antigravity CLI Building Reliable Event Delivery for XRPL Applications AMTP: HTTP for the Agentic Web — A New Markdown-First Protocol for AI Agents LLM Security Vulnerabilities Engineers Need to Know in 2026 Shared Build Cache: Makes Sense for the Independent Developer? Live Lessons From Running a 5-Minute Polymarket Crypto Bot Cómo Evaluar Agentes IA: Tutorial de LLM-as-Judge Day 2 of Python Learning 🐍 I built a local-first Apple Health recovery briefing that shows its math I Built a REST Microservice With a Database in 3 Files — and Wrote Zero Code 10 Avro Schema Mistakes Even Experienced Developer Do Commit: Refactor background workers and logging pipeline GitHub Actions vs Jenkins vs GitLab CI: A Developer's Honest Comparison (2026) Clean Architecture in MongoDB + C#: Why is the Repository Pattern Alone Not Enough? I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%. I Almost Quit Coding to Become a Welder Understanding Reinforcement Learning with Human Feedback Part 6: How the Reward Model Trains the Original Model # Level Up Your Portfolio with Wowfolio.in: Free, Customizable, Type Inhabitation in Lean: Why “Hello {name}” Can Become a Theorem Mastering Context in Go: A Senior Engineer’s Playbook for Lifecycle Management Solana Transactions Through a Backend Developer’s Eye Agent as a Tool Call: Claude Code's Fork-Exec Pattern
10 Models Tested: From 81.6% to 10%. The Free Tier is a Full-On Gamble.
Vilius · 2026-05-27 · via DEV Community

Vilius

By Vilius Vystartas | May 2026

I tested another 10 models across the same 10 agent coding tasks. Four of them were free-tier models — and the range was absurd: Owl Alpha scored 76.7% with zero hard fails, Laguna M.1 scored 10% and produced garbage on 9 out of 10 tasks. The free tier is not free if it costs you debugging time.

Total cost for all 10 models: $0.10. The paid models (6 of 10) came to $0.10 combined.


Batch 12 Leaderboard

# Model Score P/P/F Cost Time Category
🥇 Grok 4.3 81.6% 7/3/0 $0.017 39.9s Paid (xAI)
🥈 Perceptron Mk1 79.9% 8/1/1 $0.002 29.3s Paid (Perceptron)
🥉 Owl Alpha (free) 76.7% 5/5/0 Free 83.0s Free tier
4 xAI: Grok Build 0.1 75.0% 5/4/1 $0.034 95.3s Paid (xAI)
5 OpenAI: GPT Chat Latest 73.3% 6/2/2 $0.043 18.7s Paid (OpenAI)
6 Mistral Medium 3.5 71.6% 6/2/2 $0.008 12.6s Paid (Mistral)
7 Nemotron 3 Nano Omni (free) 50.0% 4/2/4 Free 23.5s Free tier
8 Laguna XS.2 (free) 49.7% 3/3/4 Free 28.7s Free tier
9 Baidu CoBuddy (free) 40.0% 4/0/6 Free 362.4s Free tier
10 Laguna M.1 (free) 10.0% 1/0/9 Free 89.8s Free tier

The Headlines

Grok 4.3 (81.6%, $0.017, 39.9s) — Grok's latest release takes the batch with zero hard fails. Seven clean passes, three partials. Process-monitor was the only full pass it earned that 4.3's competitors missed. xAI's Grok line is quietly consistent — 4.1 Fast (76.7%), 4.20 (75%), and now 4.3 (81.6%) — all within striking distance of the 80%+ club without crossing into premium pricing.

Perceptron Mk1 (79.9%, $0.002, 29.3s) — A brand new family debuts at nearly 80%, with eight passes — the most in the batch — for two-tenths of a cent. The one failure (regex-extract at 17%) is a known weakness for small models. At this price-to-pass ratio, Perceptron Mk1 is the value story of this batch.

Owl Alpha (free, 76.7%, 83.0s) — A free model with zero hard fails and 5 full passes. That's the standout free-tier result. Takes 2x longer than paid models for some tasks (24s on csv-stats vs 1-3s for the field), but the code is functional. If latency isn't critical, this is usable.


The Free Tier Lottery

Four free models. Results:

Model Score Verdict
Owl Alpha 76.7% Usable — zero hard fails, 5/10 full passes. Slow but functional.
Nemotron 3 Nano Omni 50.0% Mixed — half of tasks hit output cap at 400 tokens. Hit or miss.
Laguna XS.2 49.7% Unreliable — 400-token cap kills complex responses.
Baidu CoBuddy 40.0% Frustrating — 362 seconds total. Half the tasks hit output cap at 399 tokens. Waiting 6 minutes for 40% accuracy is not a good trade.
Laguna M.1 10.0% Broken — 1/10 passes. Every response capped at 400 tokens. Do not use.

The free tier cap of 399-400 output tokens is the real problem. Models like Laguna M.1 and CoBuddy truncate every response, turning what could be a partial into a fail. Owl Alpha works despite the cap because its outputs are concise enough to fit.

Pay $0.002 for Perceptron Mk1 and get 8/10 passes, or use Laguna M.1 free and get 1/10. The math is not subtle.


Disappointments

GPT Chat Latest (73.3%, $0.043) — OpenAI's catch-all endpoint was solid on easy tasks (file-parse, csv-stats, sql-query all passed) but fell apart on fix-bug (0%) with a lengthy, expensive hallucination. The most expensive model in the batch and it doesn't crack 75%.

Mistral Medium 3.5 (71.6%, $0.008) — Fastest model in the batch at 12.6s total, but the process-monitor task hit a 504 Gateway Timeout and scored 0%. A timeout fail on a model that otherwise looks strong carries a disproportionate penalty — without it, Medium 3.5 would be at 79.5%.

Laguna M.1 (10%) — The worst score in any batch I've run. Seven of its task responses were blank 400-token output cap fills. Not worth listing on OpenRouter.


Price/Performance

Model Score Cost $/%-pt
Owl Alpha (free) 76.7% $0 $0
Nemotron 3 Nano Omni (free) 50.0% $0 $0
Laguna XS.2 (free) 49.7% $0 $0
Baidu CoBuddy (free) 40.0% $0 $0
Laguna M.1 (free) 10.0% $0 $0
Perceptron Mk1 79.9% $0.002 $0.0024
Mistral Medium 3.5 71.6% $0.008 $0.0108
Grok 4.3 81.6% $0.017 $0.0209
xAI: Grok Build 0.1 75.0% $0.034 $0.0450
GPT Chat Latest 73.3% $0.043 $0.0584

Free models dominate the $/%-pt table by definition, but only Owl Alpha is actually usable. Among paid models, Perceptron Mk1 at $0.0024/%-pt is the efficiency winner — 24x cheaper per point than GPT Chat Latest.


My Picks

  • Best overall: Grok 4.3 — 81.6%, 39.9s, $0.017. Cleanest leaderboard of the batch.
  • Best value (paid): Perceptron Mk1 — 79.9%, $0.002 total. Eight passes for two-tenths of a cent.
  • Best free model: Owl Alpha — 76.7%, zero hard fails. The only free model I'd ship with in production.
  • Fastest: Mistral Medium 3.5 — 12.6s for all 10 tasks
  • Skip entirely: Laguna M.1 and all Laguna free-tier variants. 10% is not testable.

Methodology

Same setup as previous batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 400. Temperature: 0.1. Pattern-matching scoring against expected outputs.

Pre-flight verification caught zero failures this batch. Total cost: $0.10. Total dataset: 168 models tested across cloud and local.

Full results and per-task scores: benchmarks.workswithagents.dev