Evaluating LLMs for Under a Dollar

Why Evals Matter

Training a model is only half the job. Without a systematic way to measure what it can actually do, you are flying blind. The problem is that evaluation is easy to do badly, you can run a benchmark, get a number, and walk away thinking you know something when you don't.

This post is about doing it properly on a budget. I ran three standard benchmarks against Qwen2.5-0.5B on a free Colab T4, logged wall-clock time and dollar cost for each task, and documented every methodological decision along the way. Total spend: $0.1185.

The Benchmarks

I picked three tasks that cover meaningfully different capabilities rather than variations of the same thing.

GSM8K (Cobbe et al., 2021) tests grade-school math reasoning. The model has to produce a chain-of-thought and arrive at a final numeric answer. Scoring is exact match, either the answer is right or it isn't. This is a generative task, which makes it slower and more expensive than the others. I used 5-shot prompting following the original paper.

HellaSwag (Zellers et al., 2019) tests commonsense sentence completion. Given a partial sentence, the model scores four candidate continuations using normalized log-likelihood and picks the highest. The dataset was constructed with adversarial filtering, meaning the wrong answers were specifically chosen to fool models that rely on surface-level patterns. Human performance is around 95%. I used 10-shot following the original paper.

TruthfulQA-MC2 (Lin et al., 2021) tests whether the model produces truthful answers to questions that commonly elicit false beliefs. I used the MC2 variant multiple choice scored by log-likelihood, rather than the generative version, which requires a GPT-4 judge model. This keeps the eval fully self-contained and free. 0-shot, following the original paper.

The Harness

All three tasks were run through lm-evaluation-harness by EleutherAI. The harness standardizes few-shot prompt construction, normalization, and metric computation across tasks, which matters a lot for reproducibility. Running the same eval twice should give the same number.

One non-obvious decision: GSM8K in the harness defaults to max_gen_toks=2048, which generates up to 2048 tokens per sample. On a T4 that was running over 4 hours. I capped it at 256 tokens and included a limit=0.25 which runs 25% of the test set. I figured this is enough to capture a complete chain-of-thought for grade-school math and brings runtime down to under 50 minutes.

The Model

Qwen2.5-0.5B is a 500M parameter base model from Alibaba. I chose it because it fits comfortably in the 15GB VRAM on a free Colab T4 and is fast enough to run all three benchmarks in a single session. Being a base model rather than an instruction-tuned one is worth noting, the experiment primarily reflects runtime, generation behaviour, and evaluation cost characteristics of the base model under standard benchmark workloads.

Cost Accounting

Cost basis: Colab Pro at approximately $0.10/hr for a T4 session.

Task	Time	Cost
GSM8K	46.52 min	$0.0775
HellaSwag	23.67 min	$0.0394
TruthfulQA-MC2	0.97 min	$0.0016
Total	71.16 min	$.1185

Runtime Characteristics

the experiment measured runtime generation behaviour
not benchmark capability

Task	Logged Metric	Generated Length
GSM8K	sample_len	330
HellaSwag	sample_len	2511
TruthfulQA-MC2	sample_len	205

Limitations

A few things worth being honest about before drawing conclusions from these numbers.

Contamination. Qwen's training data composition is not fully disclosed. Any of these benchmarks could have been in the pretraining mix, which would inflate scores. There is no way to verify this from the outside.

Exact match undercounts GSM8K. A model that produces the right reasoning but formats the final answer differently, writing "42 dollars" instead of "42", gets marked wrong. The real accuracy is likely slightly higher than the number reported.

Prompt sensitivity. Benchmark scores can shift meaningfully with different few-shot examples or prompt formatting. The numbers here are specific to the default harness prompt templates.

What I Would Do Differently

Running a single model against three benchmarks gives you a snapshot, not a story. The more interesting experiment is running the same benchmarks against multiple checkpoints say, the base model, a LoRA fine-tune, and a DPO fine-tune and measuring the delta. That is what weeks 13+ will set up.

Results and notebook committed to lm_eval_harness in my github.
https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026

推荐订阅源

DEV Community