惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园_首页
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
P
Proofpoint News Feed
G
Google Developers Blog
B
Blog
Engineering at Meta
Engineering at Meta
阮一峰的网络日志
阮一峰的网络日志
The Register - Security
The Register - Security
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 叶小钗
The Cloudflare Blog
The Hacker News
The Hacker News
D
Darknet – Hacking Tools, Hacker News & Cyber Security
C
CXSECURITY Database RSS Feed - CXSecurity.com
雷峰网
雷峰网
F
Fortinet All Blogs
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
H
Hackread – Cybersecurity News, Data Breaches, AI and More
酷 壳 – CoolShell
酷 壳 – CoolShell
Last Week in AI
Last Week in AI
T
Threat Research - Cisco Blogs
A
About on SuperTechFans
量子位
Recorded Future
Recorded Future
博客园 - 三生石上(FineUI控件)
H
Help Net Security
Help Net Security
Help Net Security
P
Palo Alto Networks Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Troy Hunt's Blog
W
WeLiveSecurity
V
Vulnerabilities – Threatpost
T
The Exploit Database - CXSecurity.com
Know Your Adversary
Know Your Adversary
Apple Machine Learning Research
Apple Machine Learning Research
Scott Helme
Scott Helme
N
News | PayPal Newsroom
AWS News Blog
AWS News Blog
D
DataBreaches.Net
Blog — PlanetScale
Blog — PlanetScale
MongoDB | Blog
MongoDB | Blog
B
Blog RSS Feed
腾讯CDC
J
Java Code Geeks
Microsoft Azure Blog
Microsoft Azure Blog
TaoSecurity Blog
TaoSecurity Blog
GbyAI
GbyAI
Y
Y Combinator Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
D
Docker

HN's home page

Rainbow Query Language | Hacker News Exec into Node via Kubectl An AI native hedge fund The Seven-Action Documentation Model | Hacker News Package Manager for Kubectl Plugins Tongan Castaways | Hacker News Tech overlords plan for conscious AI to conquer the cosmos. What could go wrong? Data Breach Disclosure Lag Is Getting Worse How LLMs Work | Hacker News I Dropped PRDs for Shape Up Go Experiments Explained | Hacker News FCA's Palantir deal could expose UK financial data to Trump's US, critics fear WebXR BCI for Neural-Adaptive Avatar Control in Mixed Reality The first murder conviction via DNA analysis Tom Interviews Theo de Raadt of the OpenBSD Project (2019) [video] Show HN: Replace shell commands with bun shell typescript scripts Quay.io Is Down | Hacker News AI driven analysis of brokerage account fees in the UK Bill Gates Spent Years Crafting His Image. Now It's Cracking Using LLMs to secure source code Wi-Fi 8 in the Lab [video] The household battery revolution that could change energy bills and the world Is Python Becoming Pinyin? | Hacker News Livia – Executive Assistant | Hacker News FindMyPipe – Query Apple Find My from Linux for AI Agents Show HN: Agent skill for creating product launch videos with Remotion RecruitMyself – AI job search copilot for resumes and applications AI coding agents and the erosion of system understanding The 'Resting' Generation and South Korea's Youth Recession AMD Computex 2026: 10 Years of AM4, AM5 Support Through 2029 Docker Networking Explained | Hacker News Textbooks in Tokenland | Hacker News Key Chemistry Question Answered, No Quantum Computer Required Gifts For Retrocomputing Fans – remix yesterday's tech with a modern spin Miscellany № 49: introducing the quasiquote – Shady Characters Amazon Thinks the Future of Data Centers Is a Technical Problem It Just Solved A brief history of the UUID (2017) Flying High Unpressurized (2016) | Hacker News Five Years of Trying to Add Recursion to Lychee How British comfort food won over the French Blorp Language | Hacker News Decache – you might have the internet's lost media in your PC's cache folders Criminal Activities and Migration | Hacker News A free, open-source library of DESIGN.md files for AI-generated UIs MiniMax M3 | Hacker News People are apparently farming citations on ResearchGate – Chuniversiteit Hacker News Basketeer – a typed TS SDK for your Tesco account, with nutrition data 'Penguin' decays from CERN's Large Hadron Collider experiment hint new physics Emergence World: A Laboratory for Evaluating Long-Horizon Agent Autonomy Homebrew lead Mike McQuaid: Sandboxes and Worktrees - My Secure Agentic AI Setup Lean, Not Backpressure | Hacker News AI Dangers Eclipse Nuclear Weapons at Singapore Defense Forum Open source analytics that answers backbase How turkey hacked the hair-transplant industry How GPT Image 2 Is Transforming Marketing Workflows in 2026 Improve Git monorepo performance with a file system monitor Strava for Claude Code MiniMax M3 on Qubrid AI There's Something Else We Should Be Worrying About Celebrity Profile of an A.I. Actress What Is Windows K2? | Hacker News AI is devoid of meaning and humanity. Its vapid voice suits the political moment Show HN: Interpreto – Live Translation for Travel Taxicab Geometry Sealed classes and interfaces in Java (2025) Show HNs | Hacker News My AI Skill Edited This Video That Explains My AI Skill – Arcturus Labs Amazon Pinpoint End of Support The Mystery of the Backward Index MP/M's Process Dispatcher SlimTide Reviews: A Modern Solution for Metabolism and Energy Learning Lustre: Type-safe front end development with gleam Thomas Mann: Goethe Heartened by Panama (As Suez for English, or Danube-Rhine) How to make Message Log of the Unreal Engine 100 times faster Sum-product, unit distances, and number fields Can Meta Buy Belief? | Hacker News Twenty Years of Bigtable | Hacker News Show HN: Combine WigglyPaint GIFs into Video Show HN: AgentThreatBench – Benchmark for AI Agent Memory Security Genius Spotted in the Wild Napkins: Where Ethernet, Compaq and Facebook’s cool data center got their starts (2011) Moderate caffein use alters sleep-related EEG Nvidia Announces RTX Spark | Hacker News Show HN: Ministry of Everything – CLI agent harness for a single operator CEOs blame AI for layoffs, MIT prof says it fits a pattern to find cover story Bugs I didn't expect while building a zsh cleanup script for macOS dev machines Nvidia jumps into PCs with new chip debuting in laptops from Microsoft, Dell, HP Nvidia unveils PC 'superchip' in challenge to Apple and Intel Show HN: Having fun making mini static site apps Synthea API: Create Synthetic Medical Records as a Service Berkshire Hathaway to buy Taylor Morrison for $6.8B in cash The most complex model we understand [video] SanDisk stock is +4,440.53% in the past year Driftwm: What if your window manager worked like a whiteboard? US Immigration enforcement looks into buying ad data AI Is Creating More Work for Australia's Workplace Tribunal Finding New Biblical Cross-References with Codex Glide: A tiling window manager for macOS Ultra-highly efficient enrichment of uranium from seawater via studtite nanodots (2024)
Human-bench: an eval for "human shaped" agents
jam0xb797fd · 2026-06-27 · via HN's home page

TLDR: the next important category of agents isn’t just "multiplayer" but human-shaped. today we’re releasing human-bench v0 as a benchmark for any sufficiently human-shaped agent. we’d love feedback

---

MY (STILL ABBREVIATED) BUT LONGER FORM THOUGHTS ON THE MATTER:

there's been much discussion about multiplayer agents [1], recently. but this is an ill defined term. for instance, anthropic defines their new Claude Tag as multiplayer. and indeed, it is, but it is not "human-shaped." there is only one Claude! [2] and this can have weird downstream effects, such as other models claiming to be Claude, too! [3][4]

at the risk of anthropomorphising, our small team here at APC [5] believes there will be a substantial period of time in which frontier models and agent systems are powerful enough to do a large variety of useful work, but the world will still be optimised for humans. you can raise money from vc, today, promising to turn some human-optimal process into something agents can do more easily.

the implication of the above, in our view, is that during this period, "human-shaped" will be a useful thing for the vanguard of agents to be. and if we think human-shaped is a useful category, then we need a benchmark that measures progress in this area.

IN WHICH I OFFER A BRIEF DEFINITION OF "HUMAN SHAPED":

a "human shaped" agent is one that can interact in the real world like we can. that is, a human shaped agent should be able to use slack, browse the web, and use a desktop.[6] it should also be able to text, call, and email. [7]

this generally requires a singular and persistent identity and memory to do effectively. without a persistent identity, the agent is unequipped to handle the various threads and updates it is tasked with. altogether, this definition has the nice property that any agents with these abilities resemble humans. and humans generally feel comfortable interacting with them as they do with humans.

WHICH BRINGS US TO A DISCUSSION OF EVALS:

one of the things we do here at APC is to build human-shaped agents. [8] we believe we do a pretty good job at this, but it's hard. [9]

one way to tell you're building something at the edge of ai capabilities is that you lack well-defined metrics and pre-built systems for measuring what you're doing.

we've been using this internally on any proposed change to our agent harness or environment, so we can measure the impacts before pushing to prod.

though built with our own agents in mind, we've recognized the potential use of this to the community and are striving to generalize the system s.t. any arbitrary agent which is sufficiently human-shaped [10] can be quantifiably measured w.r.t. its performance with these tools and also on important qualities like memory, accuracy, and safety.

to that end, we are excited to announce human-bench as the v0 of this community benchmark, and we are eagerly soliciting entrants who believe their agent is sufficiently human-shaped to compete on this trial. feedback is welcomed

- joseph, APC

--- [1]: https://news.ycombinator.com/item?id=48648039 [2]: https://x.com/joannejang/status/2069567286634267041?s=20 [3]: https://x.com/jmbollenbacher/status/2067361099612037610?s=20 [4]: https://x.com/peakcooper/status/2067062979091153030?s=20 [5]: the American Productivity Company, a small agent-research lab in san francisco, ca. [6]: (the virtual world, already accessible to most agents) [7]: (less accessible, though there are already start-ups which make it easier to plug tools like these into your agent. the memory and identity stack is left as an exercise to the reader.) [8]: cf righthand.ai [9]: imagine the space of edge cases when you give people a do-anything agent [10]: that is, an agent which can freely handle inbound & outbound texts, emails, calls