惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
腾讯CDC
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
L
LINUX DO - 热门话题
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Project Zero
Project Zero
V
Vulnerabilities – Threatpost
Cisco Talos Blog
Cisco Talos Blog
P
Palo Alto Networks Blog
C
Cisco Blogs
A
Arctic Wolf
月光博客
月光博客
The GitHub Blog
The GitHub Blog
T
The Blog of Author Tim Ferriss
量子位
小众软件
小众软件
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Microsoft Security Blog
Microsoft Security Blog
T
The Exploit Database - CXSecurity.com
Security Latest
Security Latest
N
Netflix TechBlog - Medium
K
Kaspersky official blog
人人都是产品经理
人人都是产品经理
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
博客园_首页
Y
Y Combinator Blog
P
Proofpoint News Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
M
MIT News - Artificial intelligence
T
Threat Research - Cisco Blogs
S
Schneier on Security
D
Docker
Scott Helme
Scott Helme
MyScale Blog
MyScale Blog
Spread Privacy
Spread Privacy
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
GbyAI
GbyAI
有赞技术团队
有赞技术团队
Google DeepMind News
Google DeepMind News
The Hacker News
The Hacker News
H
Help Net Security
Simon Willison's Weblog
Simon Willison's Weblog
J
Java Code Geeks
C
Cyber Attacks, Cyber Crime and Cyber Security
T
Tenable Blog
B
Blog
Know Your Adversary
Know Your Adversary
IT之家
IT之家

Will Angel's Blog

The AI Tarpit: Why You Can't Stop Reading Your Code Anthropic Fable The Stochastically K-Shaped Engineering Job Market Claude.AI Pro Plan quotas too small for deep research Apple Silicon costs more than OpenRouter Offline Agentic Coding: OpenCode Offline Agentic Coding Washington DC on track for most volatile temperature year since 1959
Jankmarking: Janky Benchmarking
2026-05-08 · via Will Angel's Blog

Tags: AI, LLMs, Benchmarking, Jankmarking, Evals, local models

Published 2026-05-08

Local LLM performance frontier: Ollama with M5 Pro
my jankmark frontier using ollama

Are janky benchmarks useful?

I put together some evals for local benchmarking. I'm mostly interested in approximately measuring performance versus speed because I don't like waiting while my macbook imitiates a jet turbine in noise and temperature and want to get good enough answers locally. For anything bigger the cloud is still the move. The problem is my benchmarks are --- to use a human em-dash and quote the kids these days "hella jank".

my janky benchmark questions:

  • hello: Say hello in one sentence.
  • haiku: Write a haiku about autumn leaves.
  • reasoning: If a train travels 60 mph for 2.5 hours, how far does it go? Show your work.
  • summarize: Summarize the following in two sentences: The Apollo program was a NASA spaceflight program that successfully landed humans on the Moon from 1969 to 1972, using the Saturn V rocket and Command/Service Module.
  • code_simple: Write a Python function that returns the nth Fibonacci number.
  • code_complex: Implement a thread-safe LRU cache in Python with O(1) get and put. Include type hints and a brief docstring.

These aren't very good. They're half AI generated. The summarization one is a single sentence and is actually backwards asking for a summary longer than the input sentence... (it used to be longer but I think it got lost at some point when I copy and pasted it. Never made it into version control... )

And yet, these are directionally correct. The better models cluster together scorewise.

For serious work, make sure your bench marks aren't jankmarks.