惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

Unchaining the African Creator Economy The Treasure Hunt Engine Gotcha - A Lesson in Constrained Performance great_cto v2.17 - no more tambourine dance When Catalogs Are Embedded in Storage SafeMind AI: Instant Health & Safety Intelligence What Is PKCE, How It Works & Flow Examples AI Agent Failure Modes Beyond Hallucination Fastest Way to Understand Stryker Solana Accounts Explained to a Web2 Developer TV Yayın Akışı Sitesi Geliştirirken Öğrendiğim Teknik Dersler $500 Challenge Drop My First Look at Google's Gemma 4: A Quick Introduction How I use an LLM as a translation judge Best Calendar and Scheduling API for Developers — 2026 Comparison Agentic AI in Travel: Why UCP Isn't Travel-Ready Yet — and What We Measured I Finished Machine Learning. And Then Changed The Plan. The Five-Thousand-Line File The AI Whirlwind: Why Your Local Agent Matters More Than Ever I Built an Oracle DBA That Lives in Telegram. It Cut a 500K-Row Scan to 5 - After Asking Permission. The Day 2 Reality of Running a Kubernetes Lab on Your Mac: Stop/Start, CKS Scenarios, and What I Learned Building It. n8n for Airtable Power Users: 5 Automations That Take Your Base to the Next Level Validating Gemma 4 for Industrial IoT: A Governance Pattern VS Code Now Credits Copilot on Every Commit by Default Astro and Islands Architecture: Why Your Portfolio Doesn't Need React for Everything Booting from FAT12: How I added file reading to my x86 kernel Unity’s AI agent went public: the developers of a static analysis tool on what that means for code quality Anna's Archive publica un llms.txt para los LLMs que rastrean su catálogo CRDTs for Offline-First Mobile Sync Why I Built Mneme HQ: Preventing AI Agent Architectural Drift Google Antigravity 2.0 Is the I/O 2026 Announcement You Should Actually Care About I Built a Pay-Per-Call Crypto Signal API with x402 — Heres the Architecture JWT Token Refresh Patterns in React 19: Avoiding the Silent Auth Death Spiral 🚀 “From Prompts to Autonomous Agents: What Google I/O 2026 Changed” The Power of Distributed Consensus in Autonomous SOCs Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern The Hardest Part of Being a Developer Isn't Coding Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it 25 React Interview Questions 2026 (With Answers) — Hooks, React 19, Concurrent Mode An open source LLM eval tool with two independent quality signals Using Dashboard Filtering to Get Customer Usage in Seconds from TBs of Data Skills, Java 17, And Theme Accents 4 Hard Lessons on Optimizing AI Coding Agents Arctype: Cross-Platform Database GUI for LLM Artifacts Your robots.txt says GPTBot is welcome. Your server says 403. Organizing How to Use AWS Glue Workflow 5 n8n Automations Every Digital Agency Should Be Running (Bill More, Work Less) Getting Started with TorchGeo — Remote Sensing with PyTorch Designing a Scalable Cross-Platform Appium Framework Google Antigravity 2.0 & Slash Commands Building a Unified Adaptive Learning Intelligence with Gemma 4, Flutter, and Multi-Model Orchestration Looking for beta testers for a £60 server management application The Disk-Pressure Incident That Taught Me to Always Set LimitRanges and Other Lessons from Mirroring EKS Locally. Why AI Should Not Write SQL Against ERP Databases Vibe coding works until it doesn't. The debt is real. Shipping at the Edge: Migrating a Coffee Subscription Platform to Cloudflare Workers Stop Tab-Switching: A Developer's Guide to Color Tools That Actually Fit the Workflow DevOps vs MLOps vs AIOps: What Changes, What Stays, and a Simple Roadmap to Get Started Run Powerful AI Coding Locally on a Normal Laptop 5 n8n Automations Every WooCommerce Store Needs (Save 10+ Hours/Week) What I Learned Building My Own AI Harness Hytale Servers Will Fail Treasure Hunts Until We Fix Our Event Handling Redux in React: Managing Global State Like a Pro Unfreezing Your GitHub Actions: Troubleshooting Stuck Deployments and Protecting Your Git Repo Statistics Unlocking Project Discoverability on GHES: A Key to Software Engineering Productivity When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Stop Calling It an AI Assistant. It’s Already Managing Your Company Why Hardcoded Automations Fail AI Agents Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia AI Is Changing Engineering Culture More Than We Realize A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure
Evaluating LLMs for Under a Dollar
Thokozani Bu · 2026-05-14 · via DEV Community

Why Evals Matter

Training a model is only half the job. Without a systematic way to measure what it can actually do, you are flying blind. The problem is that evaluation is easy to do badly, you can run a benchmark, get a number, and walk away thinking you know something when you don't.

This post is about doing it properly on a budget. I ran three standard benchmarks against Qwen2.5-0.5B on a free Colab T4, logged wall-clock time and dollar cost for each task, and documented every methodological decision along the way. Total spend: $0.1185.


The Benchmarks

I picked three tasks that cover meaningfully different capabilities rather than variations of the same thing.

GSM8K (Cobbe et al., 2021) tests grade-school math reasoning. The model has to produce a chain-of-thought and arrive at a final numeric answer. Scoring is exact match, either the answer is right or it isn't. This is a generative task, which makes it slower and more expensive than the others. I used 5-shot prompting following the original paper.

HellaSwag (Zellers et al., 2019) tests commonsense sentence completion. Given a partial sentence, the model scores four candidate continuations using normalized log-likelihood and picks the highest. The dataset was constructed with adversarial filtering, meaning the wrong answers were specifically chosen to fool models that rely on surface-level patterns. Human performance is around 95%. I used 10-shot following the original paper.

TruthfulQA-MC2 (Lin et al., 2021) tests whether the model produces truthful answers to questions that commonly elicit false beliefs. I used the MC2 variant multiple choice scored by log-likelihood, rather than the generative version, which requires a GPT-4 judge model. This keeps the eval fully self-contained and free. 0-shot, following the original paper.


The Harness

All three tasks were run through lm-evaluation-harness by EleutherAI. The harness standardizes few-shot prompt construction, normalization, and metric computation across tasks, which matters a lot for reproducibility. Running the same eval twice should give the same number.

One non-obvious decision: GSM8K in the harness defaults to max_gen_toks=2048, which generates up to 2048 tokens per sample. On a T4 that was running over 4 hours. I capped it at 256 tokens and included a limit=0.25 which runs 25% of the test set. I figured this is enough to capture a complete chain-of-thought for grade-school math and brings runtime down to under 50 minutes.


The Model

Qwen2.5-0.5B is a 500M parameter base model from Alibaba. I chose it because it fits comfortably in the 15GB VRAM on a free Colab T4 and is fast enough to run all three benchmarks in a single session. Being a base model rather than an instruction-tuned one is worth noting, the experiment primarily reflects runtime, generation behaviour, and evaluation cost characteristics of the base model under standard benchmark workloads.


Cost Accounting

Cost basis: Colab Pro at approximately $0.10/hr for a T4 session.

Task Time Cost
GSM8K 46.52 min $0.0775
HellaSwag 23.67 min $0.0394
TruthfulQA-MC2 0.97 min $0.0016
Total 71.16 min $.1185

Runtime Characteristics

  • the experiment measured runtime generation behaviour
  • not benchmark capability
Task Logged Metric Generated Length
GSM8K sample_len 330
HellaSwag sample_len 2511
TruthfulQA-MC2 sample_len 205


Limitations

A few things worth being honest about before drawing conclusions from these numbers.

Contamination. Qwen's training data composition is not fully disclosed. Any of these benchmarks could have been in the pretraining mix, which would inflate scores. There is no way to verify this from the outside.

Exact match undercounts GSM8K. A model that produces the right reasoning but formats the final answer differently, writing "42 dollars" instead of "42", gets marked wrong. The real accuracy is likely slightly higher than the number reported.

Prompt sensitivity. Benchmark scores can shift meaningfully with different few-shot examples or prompt formatting. The numbers here are specific to the default harness prompt templates.


What I Would Do Differently

Running a single model against three benchmarks gives you a snapshot, not a story. The more interesting experiment is running the same benchmarks against multiple checkpoints say, the base model, a LoRA fine-tune, and a DPO fine-tune and measuring the delta. That is what weeks 13+ will set up.


Results and notebook committed to lm_eval_harness in my github.
https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026