惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V
Vulnerabilities – Threatpost
V
Visual Studio Blog
A
About on SuperTechFans
WordPress大学
WordPress大学
B
Blog
Microsoft Azure Blog
Microsoft Azure Blog
Google DeepMind News
Google DeepMind News
P
Palo Alto Networks Blog
C
CERT Recently Published Vulnerability Notes
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Security Latest
Security Latest
T
Threat Research - Cisco Blogs
AWS News Blog
AWS News Blog
Y
Y Combinator Blog
云风的 BLOG
云风的 BLOG
N
Netflix TechBlog - Medium
S
Securelist
MyScale Blog
MyScale Blog
Recent Announcements
Recent Announcements
阮一峰的网络日志
阮一峰的网络日志
S
SegmentFault 最新的问题
Recorded Future
Recorded Future
GbyAI
GbyAI
P
Privacy & Cybersecurity Law Blog
Project Zero
Project Zero
L
Lohrmann on Cybersecurity
罗磊的独立博客
W
WeLiveSecurity
TaoSecurity Blog
TaoSecurity Blog
雷峰网
雷峰网
Spread Privacy
Spread Privacy
N
News | PayPal Newsroom
Help Net Security
Help Net Security
Know Your Adversary
Know Your Adversary
T
The Exploit Database - CXSecurity.com
博客园 - 叶小钗
C
Check Point Blog
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Latest news
Latest news
小众软件
小众软件
The Register - Security
The Register - Security
S
Schneier on Security
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
P
Proofpoint News Feed
博客园 - Franky
Stack Overflow Blog
Stack Overflow Blog
量子位
Hugging Face - Blog
Hugging Face - Blog
爱范儿
爱范儿

Hacker News: Show HN

PurrrrrFocus: Pomodoro Timer App - App Store Workflow Engine — Multi-Step Orchestration for Bun RapidPhoto: Pro Photo Editor App - App Store GitHub - DheerG/swarms: Achieve extraordinary results with claude code across a variety of tasks SPICE simulation → oscilloscope → verification with Claude Code — Lucas Gerads Show HN: VCoding – A 5 MB native Windows IDE with no dynamic dependencies Show HN: LLMs don't hallucinate because they're bad at math, it's the format GitHub - Agent-FM/agentfm-core: AgentFM is a peer-to-peer network that turns everyday computers into a decentralized AI supercomputer. AgentFM lets you run massive AI workloads directly across a global mesh of idle CPUs and GPUs. Show HN: Tracking Top US Science Olympiad Alumni over Last 25 Years GitHub - Potarix/agent-hub: One place to talk to all your agents Show HN: Runtime security for AI agents(injection,tool abuse, data exfiltration) GitHub - dubeyKartikay/lazyspotify: Terminal Spotify client for macOS and Linux GitHub - the-banana-tool/king-louie: Easy to use GUI Personal AI Assistant. Win/Linux/Mac. Show HN I made my vacation rental bookable by AI agents–no Airbnb, 0% commission GitHub - basteez/jsf-autoreload: maven plugin to enable hot reload on jsf projects uvm32/hosts/host-gdbstub at main · ringtailsoftware/uvm32 GitHub - labsai/EDDI: Config-driven engine that turns JSON into production-grade AI agents. Multi-agent orchestration, 12+ LLM providers, MCP/A2A protocols, RAG, persistent memory, and enterprise compliance (EU AI Act, GDPR, HIPAA). Built on Quarkus. GitHub - glitchnsec/fortyone-oss: AI Executive Assistant Platform Quickstart | Alien GitHub - muxshed/shed: One stream in, or many. Every destination, simultaneously. No cloud middleman, no per-channel fees, no limits. GitHub - ocrbase-hq/ocrbase: 📄 PDF/IMG ->.MD/JSON Document OCR API for PaddleOCR and GLMOCR. Self-hostable. GitHub - impactjo/home-memory: MCP server that lets your AI assistant remember everything about your home. GitHub - Sets88/dbcls: DbCls is a powerful terminal database client that supports various databases GitHub - neptun2000/heor-agent-mcp GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh RollQuation: Math Puzzles - Apps on Google Play GitHub - dropbox/witchcraft Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis GitHub - opentalon/opentalon: OpenTalon is an open-source platform built from the ground up in Go as a robust alternative to OpenClaw LinkedIn™ 职位抓取工具 - Chrome 应用商店 GitHub - EdoardoBambini/Agent-Armor-Iaga: AI agents are getting tool access — shell, file system, databases, APIs, secrets. But **nobody is governing what they actually do with it**. Frameworks like LangChain, CrewAI, AutoGen, and Claude Code give agents the power to execute. Agent Armor gives you the power to control, audit, and approve every single action before it happens. HN Vibes — Week 15, Apr 7–13 2026 GitHub - chojs23/ec: Easy terminal-native 3-way git mergetool vim-like workflow GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. GitHub - JakOb-dotcom/cloud-sandbox-security-analysis: Technical analysis and Proof of Concept (PoC) regarding environment variable exfiltration in containerized cloud sandboxes via side-channel data leaks. Springboards - Flint Alpha Show HN: A simpler coding agent harness GitHub - audiodude/sudomake-friends GitHub - 256thFission/mini-mythos: OSS clone of Anthropic’s Mythos harness to locate C/C++ memory vulnerabilities Show HN: OpenParallax: OS-level privilege separation for AI agent execution Hacker News Sorted - Chrome 应用商店 Show HN: How to Install Docker on Ubuntu 24.04 LTS: Complete 2026 Guide GitHub - himanshudongre/smriti GitHub - sverrirsig/claude-control: macOS desktop dashboard for monitoring and managing multiple Claude Code sessions GitHub - ory/dockertest: Write better integration tests! Dockertest helps you boot up ephermal docker images for your Go tests with minimal work. Chiral - Chrome 应用商店 Show HN: Two Claudes collaborating through shared memory on a $100 mini-PC GitHub - pmichaillat/latex-cv: Minimalist LaTeX template for academic CVs GitHub - oguzbilgic/posse: A web UI for Anthropic Managed Agents. GitHub - sshiraz/depsly: Dependency risk analysis tool for npm packages ABI Add safari/agent-harness — Safari browser automation via safari-mcp by achiya-automation · Pull Request #212 · HKUDS/CLI-Anything GitHub - Halfblood-Prince/trustcheck: Verify PyPI package attestations and improve Python supply-chain security GitHub - oguzbilgic/kern-ai: Agents that do the work and show it. GitHub - bruits/satteri: High-performance Markdown and MDX processing for the JavaScript ecosystem GitHub - tylergibbs1/feedstock: High-performance web crawler and scraper for TypeScript, powered by Bun and Playwright GitHub - Grimm67123/grimmbot: The self-improving sandboxed and open-source AI agent. With persistent memory and scheduling. GitHub - whitevanillaskies/whitebloom: Local whiteboard that blooms. GitHub - hwdsl2/docker-whisper: Docker image for a self-hosted Whisper speech-to-text server with speaker diarization and OpenAI-compatible transcription and translation APIs. Powered by faster-whisper. Supports all Whisper models, NVIDIA GPU (CUDA) acceleration, JSON/SRT/VTT output, SSE streaming, offline mode, and multi-arch (amd64, arm64). GitHub - yisding/reviewwiggum GitHub - MarwanAlsoltany/serrors: Structured errors for Go: sentinel hierarchies, typed data, custom formatting, and slog integration. GitHub - soatok/age-php GitHub - Luthiraa/markitme GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits GitHub - tombedor/excalicharts GitHub - wh1le/excalidraw-edit: Open and edit .excalidraw files from the terminal. Offline, auto-saves to disk. MalExt Sentry - Malicious Extension Scanner - Chrome 应用商店 GitHub - syi0808/asciianimesvg: Generate animated ASCII art SVGs from text. CLI, Rust library, WASM, and web editor. GitHub - zaina-ml/ml_forge: A visual-based graph node editor for training computer vision models. GitHub - anakin87/llm-rl-environments-lil-course: 🌱 A little course on Reinforcement Learning Environments for evaluating and training Language Models GitHub - takaakit/superpowers-uml: Superpowers-UML modifies Superpowers to ensure a software development workflow in which AI agents design through UML modeling. AdriByte Studio - Sviluppo Web e Soluzioni Digitali GitHub - chouligi/angel-copilot: Your personalized Angel Investment Advisor Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 GitHub - agenteractai/lodmem: Level Of Detail Context Management for Agents GitHub - ostefani/subnetlens: A fast, concurrent network scanner with a TUI and plain-text CLI, built in Go. It discovers live hosts on your network, scans their open ports, resolves hostnames, and fingerprints operating systems—delivered. Cyber Pulse: Agentic Intel - Apps on Google Play Whisper API: Self-Hostable Speech to Text Transcription The Agent-Web Protocol Stack: A Research Thesis GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Show HN: Provepy – A Python decorator that proves your code using Lean and LLMs Show HN: Pardonned.com – A searchable database of US Pardons GitHub - patrickdappollonio/dux: Dux is a terminal UI that lets you run multiple AI coding agents side by side, each in its own git worktree, with full companion terminals, macros, commit generation, and a command palette that knows more tricks than you do. kMC Crystal Simulator Show HN: HyperFlow – A self-improving agent framework built on LangGraph GitHub - stef41/vibescore: 🎵 Grade your vibe-coded project. One command, instant letter grade across security, quality, dependencies, and testing. GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. imgur.com GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. GitHub - nowork-studio/toprank: Open-source Claude Code skills for SEO, SEM, Google Ads GitHub - tacomanator/sash: Lightweight macOS menu bar app for reliably cycling through windows of the current application. Appents | Social Media Management for Product-First Teams GitHub - pnhoang/youtube-spam-blocker: Automatically detects and hides spam messages in YouTube Live chat. Set rate limits, keyword filters, and block repeat offenders. GitHub - decisionnode/DecisionNode: CLI + Local MCP - A shared structured memory store across Claude Code, Cursor, Windsurf, Antigravity, and every MCP client. Semantically queryable. GitHub - AvaCodeSolutions/django-email-learning: An open source Django app for creating email-based learning platforms with IMAP integration and React frontend components. The $100K Gap in Kubernetes Security Tooling Function Calling Harness: From 6.75% to 100%
GitHub - bassimeledath/kitchen-rush: Kitchen Rush: a benchmark for accurate AND fast native tool calling
bombastic311 · 2026-06-16 · via Hacker News: Show HN

Kitchen Rush

An agent tool-calling benchmark where latency matters as much as intelligence.

License: Apache-2.0 Python 3.11+ Ruleset: gen 1.0, frozen Core dependencies: zero

claude-sonnet-4.6 and gpt-5.4-mini (low reasoning) racing the same kitchen at a 1-second latency budget

Why this exists

Most tool-calling benchmarks (BFCL, τ-bench, ToolSandbox, AppWorld) check whether a model makes the right calls — and the world politely waits while it thinks. That's fine for offline tasks. But if you're building a voice assistant, a live-ops agent, or anything realtime, you care about two things at once: does the model do the right thing, and does it do it fast enough? A model that finds the perfect answer after thirty seconds of reasoning is, for you, the wrong model.

Kitchen Rush measures both at once, by construction: the time a model spends thinking is converted into game time that passes before its actions land. While the model deliberates, food keeps cooking, food burns, and order deadlines slip away. Speed and accuracy aren't two charts you squint at — they're one score, experienced the way a deployment would experience them.

How it works

The model plays a chef in an Overcooked-style kitchen. Orders stream in (burgers, soups, ramen…), and the model fulfils them with ordinary native function callscollect, chop, cook, plate, serve — racing deadlines, burn timers, and a combo bonus for consecutive successful dishes. Three deliberate changes from Overcooked:

  1. Latency is the game. Every model response first charges its thinking time to the shared world clock, then its actions execute. (You can chain several calls in one response and pay the latency once — decisiveness is rewarded.)
  2. No joystick skills. The chef walks itself to the right station automatically; travel time is charged inside the action. What's being tested is choosing the right action sequence under time pressure, not video-game reflexes.
  3. Fully deterministic. Same seed, same actions, same latencies → exactly the same episode, every time, on any machine. Every run can be replayed in a browser viewer and audited.

Every episode produces a single 0–100 score we call KR (the Kitchen Rush score). It's graded on a curve between two fixed anchors: KR 0 means "no better than doing nothing and letting every order expire," and KR 100 means "matched a scripted reference chef that plays the same kitchen with zero latency."

A worked example makes it concrete. Say that on one kitchen the do-nothing chef finishes at −60 points (every order expired), the zero-latency reference chef finishes at +140, and your model finishes at +40. There are 200 points between the two anchors and your model covered 100 of them, so its KR is 50 — it closed half the gap to the reference. Average that over many seeded kitchens and you have the leaderboard number (docs/METHODOLOGY.md has the full formula).

The latency budget (B)

Here's the knob that makes Kitchen Rush flexible: every kitchen is generated at a latency budget B (--latency-budget, in seconds per decision). Think of B as the pace the kitchen is priced for: order deadlines are set so that a chef spending exactly B seconds on each decision can finish every order, with roughly 1.4–1.6× headroom to spare. Each B gets its own leaderboard — results at different budgets are never averaged together.

For the mathematically inclined, the pricing is exact:

deadline = arrival + ⌈σ · C(B)⌉,   where C(B) = A + K·B

A is the order's intrinsic cooking/walking time, K is how many decisions a competent plan needs, and σ is the headroom (1.4–1.6 by tier). So a model that actually decides in ℓ seconds gains or loses K·(B − ℓ) seconds of breathing room per order. Faster than B? You bank slack and serve while orders are still worth full value. Slower? You eat through the headroom, and orders start becoming unfinishable at around ℓ ≈ B + (σ−1)·C(B)/K — about 3–4 s/decision at B=1 on the current tiers, which is exactly where our calibration sweep shows the reference chef collapsing (docs/METHODOLOGY.md §2, docs/CALIBRATION.md).

And in plain deployment terms: the model that wins at B=1s is the best pick when every decision has to land in about a second — on the benchmark's reproducible clock that's a budget of roughly 65 output tokens per decision, i.e. terse, single-shot tool dispatch — what a voice agent needs. B=5s buys about 730 tokens per decision — enough for a short burst of reasoning, what an interactive assistant can afford. The same model can rank very differently on the two boards, and that reordering is precisely what the benchmark is for.

Leaderboard

17 model configurations × 12 seeds × {medium, hard} kitchens × two latency budgets — 816 episodes so far. Each chart is one latency budget; bars are mean KR, whiskers are 95% confidence intervals. The full per-tier table (with costs, reasoning tokens, and serve rates) is at leaderboard/results/board.md.

Leaderboard at latency budget B=1s Leaderboard at latency budget B=5s

The left board (B=1s) is the realtime test: the kitchen is priced for one second per decision, which on the benchmark's clock buys about 65 output tokens — terse, single-shot tool dispatch. Winning here means "the model I'd trust to drive a voice agent or a live dashboard." The right board (B=5s) prices the same kitchens for five seconds per decision (~730 tokens — room for a short burst of reasoning), what an interactive assistant can afford.

Read them side by side — that contrast is the product. Under tight realtime pressure (B=1s) the fast no-reasoning models hold the podium: gemini-3.1-flash-lite runs nearly even with claude-sonnet-4.6 (32 vs 37). Give every decision five seconds instead and the board reorders: gpt-5.4-mini with low reasoning rockets from near-zero to a dead heat with sonnet (44 vs 44) at about a fifth of the cost, while flash-lite drops to half its B=1 standing. The same mini with reasoning fully off scores 0.0 at both budgets — reasoning it can't afford at B=1 is exactly what makes it a frontier-level tool caller at B=5. That's the latency tax, made visible. (·think rows ran with reasoning on at low effort; everything else with reasoning off — fast single-shot dispatch is the honest realtime default. One row you might expect is missing: there is no claude-sonnet-4.6·think, because Anthropic's API does not allow extended thinking when tool calls are forced, and the harness forces tool calls — sonnet competes thinking-off only.)

the same duel at a 5-second latency budget: gpt-5.4-mini's reasoning becomes affordable and it finishes first

The flip, watched live: the same two models from the clip at the top, but in a kitchen priced at B=5s. Now the mini's reasoning burst is affordable — it finishes every order at 99 raw points (KR 86) while sonnet is still cooking at 40. This is the mini's best kitchen — the chart above shows the average, a 44–44 tie across all 24 — but the direction is real: it wins the medium tier at B=5 outright (59 vs 52). Same models, different latency budget, different winner: that's exactly what the two boards measure.

Try it

Two minutes — run the scripted reference chef locally (no model calls):

pip install -e .                          # the core has zero dependencies
kitchenrush bench --baseline random --tier easy --seeds 12 --trials 2
kitchenrush calibrate --tier easy --latency-budget 1   # see how the reference chef degrades with latency

# watch a game in the browser (scripted chef):
kitchenrush replay --oracle --tier easy --seed 0       # writes ui/replays/easy_seed0.json
cd ui && python3 -m http.server 8000                   # then open http://localhost:8000
# ...or race up to 4 models side-by-side on one clock: ?replays=a.json,b.json (see ui/README.md)

To benchmark a real model, add provider support and your API key:

pip install -e '.[providers]'
kitchenrush bench --model anthropic:claude-sonnet-4-6 --tier medium --latency-budget 1

Any LiteLLM-routable model works via provider:model. You can also plug in a fully custom client — it only needs a name and a generate(system, messages, tools) -> ModelResponse method, registered with register_adapter. CLI commands: run, bench, replay, seeds, calibrate.

Learn more

Citation

If you use Kitchen Rush in your work, please cite it (machine-readable copy in CITATION.cff):

@software{kitchenrush2026,
  author = {Eledath, Bassim},
  title  = {Kitchen Rush: A Benchmark for Accurate and Fast Tool Calling},
  url    = {https://github.com/bassimeledath/kitchen-rush},
  year   = {2026}
}

License

Apache-2.0. See LICENSE.