惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tenable Blog
H
Heimdal Security Blog
K
Kaspersky official blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
S
Schneier on Security
G
GRAHAM CLULEY
U
Unit 42
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
CERT Recently Published Vulnerability Notes
Google DeepMind News
Google DeepMind News
罗磊的独立博客
Stack Overflow Blog
Stack Overflow Blog
阮一峰的网络日志
阮一峰的网络日志
Simon Willison's Weblog
Simon Willison's Weblog
C
Cisco Blogs
Cyberwarzone
Cyberwarzone
T
The Exploit Database - CXSecurity.com
Project Zero
Project Zero
Security Archives - TechRepublic
Security Archives - TechRepublic
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园 - 司徒正美
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
V
Visual Studio Blog
博客园 - Franky
Engineering at Meta
Engineering at Meta
WordPress大学
WordPress大学
Jina AI
Jina AI
P
Proofpoint News Feed
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
L
LINUX DO - 最新话题
宝玉的分享
宝玉的分享
N
News and Events Feed by Topic
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
博客园 - 聂微东
T
The Blog of Author Tim Ferriss
Spread Privacy
Spread Privacy
Application and Cybersecurity Blog
Application and Cybersecurity Blog
IT之家
IT之家
S
Security Affairs
博客园 - 叶小钗
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
小众软件
小众软件
N
News | PayPal Newsroom
Cloudbric
Cloudbric
AWS News Blog
AWS News Blog
W
WeLiveSecurity
The Last Watchdog
The Last Watchdog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
NISL@THU
NISL@THU

HN's home page

Rainbow Query Language | Hacker News Exec into Node via Kubectl An AI native hedge fund The Seven-Action Documentation Model | Hacker News Package Manager for Kubectl Plugins Tongan Castaways | Hacker News Tech overlords plan for conscious AI to conquer the cosmos. What could go wrong? Data Breach Disclosure Lag Is Getting Worse How LLMs Work | Hacker News I Dropped PRDs for Shape Up Go Experiments Explained | Hacker News FCA's Palantir deal could expose UK financial data to Trump's US, critics fear WebXR BCI for Neural-Adaptive Avatar Control in Mixed Reality The first murder conviction via DNA analysis Tom Interviews Theo de Raadt of the OpenBSD Project (2019) [video] Show HN: Replace shell commands with bun shell typescript scripts Quay.io Is Down | Hacker News AI driven analysis of brokerage account fees in the UK Bill Gates Spent Years Crafting His Image. Now It's Cracking Using LLMs to secure source code Wi-Fi 8 in the Lab [video] The household battery revolution that could change energy bills and the world Is Python Becoming Pinyin? | Hacker News Livia – Executive Assistant | Hacker News FindMyPipe – Query Apple Find My from Linux for AI Agents Show HN: Agent skill for creating product launch videos with Remotion RecruitMyself – AI job search copilot for resumes and applications AI coding agents and the erosion of system understanding The 'Resting' Generation and South Korea's Youth Recession AMD Computex 2026: 10 Years of AM4, AM5 Support Through 2029 Docker Networking Explained | Hacker News Textbooks in Tokenland | Hacker News Key Chemistry Question Answered, No Quantum Computer Required Gifts For Retrocomputing Fans – remix yesterday's tech with a modern spin Miscellany № 49: introducing the quasiquote – Shady Characters Amazon Thinks the Future of Data Centers Is a Technical Problem It Just Solved A brief history of the UUID (2017) Flying High Unpressurized (2016) | Hacker News Five Years of Trying to Add Recursion to Lychee How British comfort food won over the French Blorp Language | Hacker News Decache – you might have the internet's lost media in your PC's cache folders Criminal Activities and Migration | Hacker News A free, open-source library of DESIGN.md files for AI-generated UIs MiniMax M3 | Hacker News People are apparently farming citations on ResearchGate – Chuniversiteit Hacker News Basketeer – a typed TS SDK for your Tesco account, with nutrition data 'Penguin' decays from CERN's Large Hadron Collider experiment hint new physics Emergence World: A Laboratory for Evaluating Long-Horizon Agent Autonomy Homebrew lead Mike McQuaid: Sandboxes and Worktrees - My Secure Agentic AI Setup Lean, Not Backpressure | Hacker News AI Dangers Eclipse Nuclear Weapons at Singapore Defense Forum Open source analytics that answers backbase How turkey hacked the hair-transplant industry How GPT Image 2 Is Transforming Marketing Workflows in 2026 Improve Git monorepo performance with a file system monitor Strava for Claude Code MiniMax M3 on Qubrid AI There's Something Else We Should Be Worrying About Celebrity Profile of an A.I. Actress What Is Windows K2? | Hacker News AI is devoid of meaning and humanity. Its vapid voice suits the political moment Show HN: Interpreto – Live Translation for Travel Taxicab Geometry Sealed classes and interfaces in Java (2025) Show HNs | Hacker News My AI Skill Edited This Video That Explains My AI Skill – Arcturus Labs Amazon Pinpoint End of Support The Mystery of the Backward Index MP/M's Process Dispatcher SlimTide Reviews: A Modern Solution for Metabolism and Energy Learning Lustre: Type-safe front end development with gleam Thomas Mann: Goethe Heartened by Panama (As Suez for English, or Danube-Rhine) How to make Message Log of the Unreal Engine 100 times faster Sum-product, unit distances, and number fields Can Meta Buy Belief? | Hacker News Twenty Years of Bigtable | Hacker News Show HN: Combine WigglyPaint GIFs into Video Show HN: AgentThreatBench – Benchmark for AI Agent Memory Security Genius Spotted in the Wild Napkins: Where Ethernet, Compaq and Facebook’s cool data center got their starts (2011) Moderate caffein use alters sleep-related EEG Nvidia Announces RTX Spark | Hacker News Show HN: Ministry of Everything – CLI agent harness for a single operator CEOs blame AI for layoffs, MIT prof says it fits a pattern to find cover story Bugs I didn't expect while building a zsh cleanup script for macOS dev machines Nvidia jumps into PCs with new chip debuting in laptops from Microsoft, Dell, HP Nvidia unveils PC 'superchip' in challenge to Apple and Intel Show HN: Having fun making mini static site apps Synthea API: Create Synthetic Medical Records as a Service Berkshire Hathaway to buy Taylor Morrison for $6.8B in cash The most complex model we understand [video] SanDisk stock is +4,440.53% in the past year Driftwm: What if your window manager worked like a whiteboard? US Immigration enforcement looks into buying ad data AI Is Creating More Work for Australia's Workplace Tribunal Finding New Biblical Cross-References with Codex Glide: A tiling window manager for macOS Ultra-highly efficient enrichment of uranium from seawater via studtite nanodots (2024)
MirrorCode: What's the largest software project AI can complete on its own?
tadamcz · 2026-06-27 · via HN's home page

Hi HN! I'm the creator of MirrorCode, a benchmark of long-horizon SWE tasks.

In a MirrorCode task, AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s stdout and stderr exactly on end-to-end tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.

IMO, the key tension in eval design is now: how to make tasks that are difficult for AI, but actually fair? My default expectation in 2026 when I see a coding eval with low scores is that the tasks turn out to be impossible.

In MirrorCode, we leverage reimplementation of existing software as the raw material to make tasks hard. But we also manually select each of the 25 target programs, and carefully collect the end-to-end tests, so that the task is actually achievable (if extremely difficult).

It's very easy for MirrorCode-style tasks to degenerate into absurd reverse-engineering: stuff you'd never do as a human engineer.

e.g. one of our target programs is the brotli compression library. But we explicitly test the decompression path only (and call the task "brotlid")

RFC 7932 only defines decompression: given compressed bytes, there's one correct output. In the other direction, there's no unique mapping.

So the output of a compressor is governed by thousands of internal efficiency heuristics (e.g. how far back to look for repeated strings). Replicating those perfectly from a black box is extreme and pointless reverse-engineering.

An aspect of MirrorCode I’m proud of: it’s the first benchmark I know of to implement a serious “security mindset”. The field fatalistically treats AI cheating as an unwinnable game of whack-a-mole. I think this is completely wrong.

My views: (1) in 2026, cheat-proofing must be a first-class consideration in benchmark design, not an afterthought (2) most benchmarks CAN be made secure, by using known primitives like containers

The problem is that benchmark authors aren’t even trying to be secure against human-level attackers, which AIs are.

Another key differentiator of MirrorCode: AIs are clearly told the scope of the task, rather than having to guess what they'll secretly be tested on.

We clarify task scope by showing AIs a list of test inputs (while also keeping some hidden to prevent AIs from cheating with a lookup table).

If this sounds too easy or like giving them the answer, I assure you, you're thinking about it wrong. e.g. perfectly matching gotree's behaviour does not become easy because I show you 1,899 inputs (!) that your program will be tested on.

Meanwhile, if you don't show test cases and just say "reimplement all of gotree" the task is dominated by an impossible guessing-game: what's actually in scope?

The Nexus format was only loosely specified in its original publication (which we give to AIs). Among other omissions, documentation does not mention that Nexus files may contain comments. Nexus comments are between brackets [], so they may appear in the middle of a line of data, unlike code comments. gotree _generally_ handles them without complaint, but errors with comments in certain locations.

I personally think guessing that comments exist, and all the ways comments must be handled, despite comments not even being mentioned in documentation, is functionally impossible. So the benchmark becomes dominated by the guessing-game, rather than actual implementation.

Human engineers gradually learn the scope of inputs a program should support thanks to external feedback from users of the program, they don't just think them all up in a vacuum before publication. Yet that's what some benchmarks test.

We’re releasing MirrorCode as open-source.