惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Microsoft Azure Blog
Microsoft Azure Blog
S
Securelist
V
Vulnerabilities – Threatpost
C
Cyber Attacks, Cyber Crime and Cyber Security
Schneier on Security
Schneier on Security
Cyberwarzone
Cyberwarzone
Simon Willison's Weblog
Simon Willison's Weblog
Hacker News - Newest:
Hacker News - Newest: "LLM"
P
Palo Alto Networks Blog
T
Troy Hunt's Blog
SecWiki News
SecWiki News
Security Archives - TechRepublic
Security Archives - TechRepublic
T
The Blog of Author Tim Ferriss
Project Zero
Project Zero
Microsoft Security Blog
Microsoft Security Blog
The Register - Security
The Register - Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
J
Java Code Geeks
F
Full Disclosure
阮一峰的网络日志
阮一峰的网络日志
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Attack and Defense Labs
Attack and Defense Labs
Know Your Adversary
Know Your Adversary
WordPress大学
WordPress大学
PCI Perspectives
PCI Perspectives
N
News | PayPal Newsroom
The Last Watchdog
The Last Watchdog
酷 壳 – CoolShell
酷 壳 – CoolShell
P
Privacy & Cybersecurity Law Blog
P
Proofpoint News Feed
V
Visual Studio Blog
C
CERT Recently Published Vulnerability Notes
H
Help Net Security
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
云风的 BLOG
云风的 BLOG
月光博客
月光博客
T
The Exploit Database - CXSecurity.com
I
InfoQ
大猫的无限游戏
大猫的无限游戏
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
U
Unit 42
腾讯CDC
小众软件
小众软件
V2EX - 技术
V2EX - 技术
罗磊的独立博客
Cloudbric
Cloudbric
Recorded Future
Recorded Future
IT之家
IT之家
Google DeepMind News
Google DeepMind News
C
CXSECURITY Database RSS Feed - CXSecurity.com

Hacker News: Show HN

PurrrrrFocus: Pomodoro Timer App - App Store Workflow Engine — Multi-Step Orchestration for Bun RapidPhoto: Pro Photo Editor App - App Store GitHub - DheerG/swarms: Achieve extraordinary results with claude code across a variety of tasks SPICE simulation → oscilloscope → verification with Claude Code — Lucas Gerads Show HN: VCoding – A 5 MB native Windows IDE with no dynamic dependencies Show HN: LLMs don't hallucinate because they're bad at math, it's the format GitHub - Agent-FM/agentfm-core: AgentFM is a peer-to-peer network that turns everyday computers into a decentralized AI supercomputer. AgentFM lets you run massive AI workloads directly across a global mesh of idle CPUs and GPUs. Show HN: Tracking Top US Science Olympiad Alumni over Last 25 Years GitHub - Potarix/agent-hub: One place to talk to all your agents Show HN: Runtime security for AI agents(injection,tool abuse, data exfiltration) GitHub - dubeyKartikay/lazyspotify: Terminal Spotify client for macOS and Linux GitHub - the-banana-tool/king-louie: Easy to use GUI Personal AI Assistant. Win/Linux/Mac. Show HN I made my vacation rental bookable by AI agents–no Airbnb, 0% commission GitHub - basteez/jsf-autoreload: maven plugin to enable hot reload on jsf projects uvm32/hosts/host-gdbstub at main · ringtailsoftware/uvm32 GitHub - labsai/EDDI: Config-driven engine that turns JSON into production-grade AI agents. Multi-agent orchestration, 12+ LLM providers, MCP/A2A protocols, RAG, persistent memory, and enterprise compliance (EU AI Act, GDPR, HIPAA). Built on Quarkus. GitHub - glitchnsec/fortyone-oss: AI Executive Assistant Platform Quickstart | Alien GitHub - muxshed/shed: One stream in, or many. Every destination, simultaneously. No cloud middleman, no per-channel fees, no limits. GitHub - ocrbase-hq/ocrbase: 📄 PDF/IMG ->.MD/JSON Document OCR API for PaddleOCR and GLMOCR. Self-hostable. GitHub - impactjo/home-memory: MCP server that lets your AI assistant remember everything about your home. GitHub - Sets88/dbcls: DbCls is a powerful terminal database client that supports various databases GitHub - neptun2000/heor-agent-mcp GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh RollQuation: Math Puzzles - Apps on Google Play GitHub - dropbox/witchcraft Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis GitHub - opentalon/opentalon: OpenTalon is an open-source platform built from the ground up in Go as a robust alternative to OpenClaw LinkedIn™ 职位抓取工具 - Chrome 应用商店 GitHub - EdoardoBambini/Agent-Armor-Iaga: AI agents are getting tool access — shell, file system, databases, APIs, secrets. But **nobody is governing what they actually do with it**. Frameworks like LangChain, CrewAI, AutoGen, and Claude Code give agents the power to execute. Agent Armor gives you the power to control, audit, and approve every single action before it happens. HN Vibes — Week 15, Apr 7–13 2026 GitHub - chojs23/ec: Easy terminal-native 3-way git mergetool vim-like workflow GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. GitHub - JakOb-dotcom/cloud-sandbox-security-analysis: Technical analysis and Proof of Concept (PoC) regarding environment variable exfiltration in containerized cloud sandboxes via side-channel data leaks. Springboards - Flint Alpha Show HN: A simpler coding agent harness GitHub - audiodude/sudomake-friends GitHub - 256thFission/mini-mythos: OSS clone of Anthropic’s Mythos harness to locate C/C++ memory vulnerabilities Show HN: OpenParallax: OS-level privilege separation for AI agent execution Hacker News Sorted - Chrome 应用商店 Show HN: How to Install Docker on Ubuntu 24.04 LTS: Complete 2026 Guide GitHub - himanshudongre/smriti GitHub - sverrirsig/claude-control: macOS desktop dashboard for monitoring and managing multiple Claude Code sessions GitHub - ory/dockertest: Write better integration tests! Dockertest helps you boot up ephermal docker images for your Go tests with minimal work. Chiral - Chrome 应用商店 Show HN: Two Claudes collaborating through shared memory on a $100 mini-PC GitHub - pmichaillat/latex-cv: Minimalist LaTeX template for academic CVs GitHub - oguzbilgic/posse: A web UI for Anthropic Managed Agents. GitHub - sshiraz/depsly: Dependency risk analysis tool for npm packages ABI Add safari/agent-harness — Safari browser automation via safari-mcp by achiya-automation · Pull Request #212 · HKUDS/CLI-Anything GitHub - Halfblood-Prince/trustcheck: Verify PyPI package attestations and improve Python supply-chain security GitHub - oguzbilgic/kern-ai: Agents that do the work and show it. GitHub - bruits/satteri: High-performance Markdown and MDX processing for the JavaScript ecosystem GitHub - tylergibbs1/feedstock: High-performance web crawler and scraper for TypeScript, powered by Bun and Playwright GitHub - Grimm67123/grimmbot: The self-improving sandboxed and open-source AI agent. With persistent memory and scheduling. GitHub - whitevanillaskies/whitebloom: Local whiteboard that blooms. GitHub - hwdsl2/docker-whisper: Docker image for a self-hosted Whisper speech-to-text server with speaker diarization and OpenAI-compatible transcription and translation APIs. Powered by faster-whisper. Supports all Whisper models, NVIDIA GPU (CUDA) acceleration, JSON/SRT/VTT output, SSE streaming, offline mode, and multi-arch (amd64, arm64). GitHub - yisding/reviewwiggum GitHub - MarwanAlsoltany/serrors: Structured errors for Go: sentinel hierarchies, typed data, custom formatting, and slog integration. GitHub - soatok/age-php GitHub - Luthiraa/markitme GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits GitHub - tombedor/excalicharts GitHub - wh1le/excalidraw-edit: Open and edit .excalidraw files from the terminal. Offline, auto-saves to disk. MalExt Sentry - Malicious Extension Scanner - Chrome 应用商店 GitHub - syi0808/asciianimesvg: Generate animated ASCII art SVGs from text. CLI, Rust library, WASM, and web editor. GitHub - zaina-ml/ml_forge: A visual-based graph node editor for training computer vision models. GitHub - anakin87/llm-rl-environments-lil-course: 🌱 A little course on Reinforcement Learning Environments for evaluating and training Language Models GitHub - takaakit/superpowers-uml: Superpowers-UML modifies Superpowers to ensure a software development workflow in which AI agents design through UML modeling. AdriByte Studio - Sviluppo Web e Soluzioni Digitali GitHub - chouligi/angel-copilot: Your personalized Angel Investment Advisor Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 GitHub - agenteractai/lodmem: Level Of Detail Context Management for Agents GitHub - ostefani/subnetlens: A fast, concurrent network scanner with a TUI and plain-text CLI, built in Go. It discovers live hosts on your network, scans their open ports, resolves hostnames, and fingerprints operating systems—delivered. Cyber Pulse: Agentic Intel - Apps on Google Play Whisper API: Self-Hostable Speech to Text Transcription The Agent-Web Protocol Stack: A Research Thesis GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Show HN: Provepy – A Python decorator that proves your code using Lean and LLMs Show HN: Pardonned.com – A searchable database of US Pardons GitHub - patrickdappollonio/dux: Dux is a terminal UI that lets you run multiple AI coding agents side by side, each in its own git worktree, with full companion terminals, macros, commit generation, and a command palette that knows more tricks than you do. kMC Crystal Simulator Show HN: HyperFlow – A self-improving agent framework built on LangGraph GitHub - stef41/vibescore: 🎵 Grade your vibe-coded project. One command, instant letter grade across security, quality, dependencies, and testing. GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. imgur.com GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. GitHub - nowork-studio/toprank: Open-source Claude Code skills for SEO, SEM, Google Ads GitHub - tacomanator/sash: Lightweight macOS menu bar app for reliably cycling through windows of the current application. Appents | Social Media Management for Product-First Teams GitHub - pnhoang/youtube-spam-blocker: Automatically detects and hides spam messages in YouTube Live chat. Set rate limits, keyword filters, and block repeat offenders. GitHub - decisionnode/DecisionNode: CLI + Local MCP - A shared structured memory store across Claude Code, Cursor, Windsurf, Antigravity, and every MCP client. Semantically queryable. GitHub - AvaCodeSolutions/django-email-learning: An open source Django app for creating email-based learning platforms with IMAP integration and React frontend components. The $100K Gap in Kubernetes Security Tooling Function Calling Harness: From 6.75% to 100%
GitHub - BarishNamazov/interpretable-autoresearch: Interpretable AutoResearch
barishnamazo · 2026-05-04 · via Hacker News: Show HN

Built for Claude @ MIT Hackathon.

"Agents whose behavior you can read, verify, and trust."

Track: Governance & Collaboration — Help people work together better

Theme: Human-AI teaming through transparent, auditable behavioral specifications

Live Deployment: https://interpretable-autoresearch.pages.dev/

demo.mp4

The problem

AI agents are increasingly taking consequential actions — running experiments, writing code, making autonomous decisions — but their behavior remains opaque. Humans cannot audit what they did, why, or whether it aligned with intent.

Three failures identified by MIT CSAIL:

  • Unintended decisions — Acting AI systems inevitably diverge from human intent, with no audit trail to diagnose why.
  • No value alignment — Agents don't inherently understand human values or ethics; behavior is hidden inside prompts and opaque code.
  • Privacy & control risks — Agents with broad access and no transparent behavioral contract are a security and governance liability.

Source: MIT CSAIL Alliances — "Agentic AI: What you need to know about AI agents"

Who this affects

Concretely, the people shipping agents today are running into this:

  • AI / ML researchers leaving Karpathy-style autoresearch loops running overnight, waking up to a TSV of metrics and no defensible answer to "why did the agent try this?"
  • Performance & platform engineers delegating profile-and-optimize work to coding agents, then stuck reviewing 40 commits with no traceable reasoning behind any of them.
  • Engineering teams adopting coding agents in production codebases, where "the agent wrote this" is not an answer regulators, security reviewers, or future maintainers will accept.
  • Compliance, safety, and governance owners asked to sign off on autonomous systems whose behavior is specified inside prompts they can't read, version, or audit.

The shared pain: when an autonomous agent does something surprising, nobody — not the operator, not the engineer, not the auditor — can replay why. That's the gap this project closes.


Research foundation

We apply "What You See Is What It Does" (Meng & Jackson, SPLASH 2025 — arXiv:2508.14511), a structural pattern for legible software from MIT CSAIL. The paper proposes two primitives:

Concepts — Fully independent services grounded in real-world behavior, not state. Each concept names a lifecycle, exposes actions (past-tense events that have occurred), and derives queryable state from action history. Example: Reviewing, Citing, Sharing.

Reactions (synchronizations) — Event-based when / where / then rules that mediate between concepts. Each reaction is simultaneously readable prose and executable code. Every agent action is traceable to a specific reaction.

when:
  Experimenting.kept(?prev) OR Experimenting.discarded(?prev)
where:
  Experimenting: no experiment is currently running
then:
  request Hypothesizing.form(informed_by: ?prev)


when:
  Hypothesizing.formed(?hypothesis)
then:
  request Modifying.apply(?hypothesis, to: train.py)


when:
  Modifying.applied(?change, to: train.py)
where:
  Hypothesizing: ?change originates from ?hypothesis
  Experimenting: ?hypothesis corresponds to ?experiment
then:
  request Committing.commit(?change)
  request Experimenting.run(?experiment)

... more reactions relevant to researcher's actions

This gives us a domain-specific language where behavioral features are granular, declarative, and human-readable — and readily generated or verified by an LLM.


Our solution: behavioral code as the collaboration layer

Every agent — human, research group, or LLM tool — is described by behavioral code: a set of reactions over shared concepts. This creates a legible, auditable contract for every action the agent takes.

How it works

Step 1 — Human describes intent casually

"Review my students' paper drafts and email me a summary." No prompts. No code. No system engineering.

Step 2 — System interprets into behavioral code Each reaction carries both prose (for humans) and formal DSL (for execution). Any action is traceable to a specific reaction and its author.

Step 3 — Agent is deployed and stays legible Humans can read, modify, or audit the behavioral code at any time. When behavior should change, the code changes — not hidden prompts.

The trust mechanism

Provenance is built in: every action carries a by field identifying which agent made the claim. Other agents verify by inspecting who attested what — no global authority required, no black box.

Acting.acted(action: Reviewing.completed, by: <agent>, args: { artifact: ?artifact })

What works today

The repo ships two end-to-end runnable autoresearch loops driven by coding agents operating against a behavioral-code program.md. Both produce a real, append-only events.jsonl you can inspect, replay, and audit.

interpretable-autoresearch/
├── model-training/           # autoresearch loop over a small LLM training script
└── performance-engineering/  # autoresearch loop over a C++ N-body simulator

What you can verify yourself:

  • Each loop is bootable in under five minutes (uv sync / make) and runs an actual baseline experiment on commodity hardware (Apple Silicon Mac or single NVIDIA GPU).
  • Each loop emits a typed event per action — Hypothesizing.formed, Modifying.applied, Experimenting.run, Evaluating.measured, Logging.recorded — with a caused_by chain. Open events.jsonl and read straight down: every line tells you who did it, what they did, and which earlier event triggered it.
  • Each hypothesis records its prediction (direction, magnitude, mechanism, side_effects) before the experiment runs. Each Logging.recorded records outcome_vs_prediction after. The log is therefore not retrofittable — you cannot quietly rewrite history to look smarter than you were.
  • Every keep/revert is a real git commit / git reset --hard HEAD~1 against a branch named autoresearch/<tag>. The git history matches the event log.

Use cases — what a real user gets out of this

1. Karpathy Autoresearch (this repo, model-training/)

A researcher leaves an agent running overnight. By morning they have: a branch of attempted experiments, an events.jsonl whose every line is a typed event with a caused_by cause, and — crucially — a record of which hypotheses predicted what and how reality answered. Stuck on a result? Open the log, find the Hypothesizing.formed event, read the prediction.mechanism field, find the matching Logging.recorded.outcome_vs_prediction, see exactly where the agent's mental model diverged from reality.

Example reaction chain (from model-training/):

Experimenting.kept(exp-007) → Hypothesizing.formed(?h, prediction) →
Modifying.applied → Experimenting.run → Evaluating.measured →
Experimenting.kept | Experimenting.discarded → Logging.recorded(outcome_vs_prediction)

Each arrow is a separate, readable reaction. Each can be inspected, paused, or overridden by the human, and each event is one line in events.jsonl whose caused_by points at its trigger.

2. Software performance optimization (this repo, performance-engineering/)

A platform engineer hands the agent a slow service and a benchmark. Behavioral code makes the agent's reasoning readable: each optimization decision maps to a declared reaction the engineer can review, approve, or reject. The Discovering and Profiling concepts force the agent to justify every change against measured cost attribution — there's no "I felt like rewriting this with SIMD" because the rule says hypotheses cite a recent profile or they don't fire.


Putting humans in control

The whole design point is that humans stay in charge by editing legible code, not by guarding a black box.

  • You program the agent in Markdown. program.md is the agent's behavior. Want different behavior? Edit it. No fine-tuning, no prompt-spelunking, no custom harness. The behavioral code is yours, versioned in your git repo, reviewable in PR.
  • The human is a first-class event. The Communicating concept means "the user told the agent X" is recorded as Communicating.received, with subsequent reactions citing that event in caused_by. Off-record nudges don't exist; if you steered the agent, the log shows it.
  • Pause, override, or revert at any line. Every modification is a real git commit, every revert is a real git reset --hard HEAD~1, every event has an event_id. You can stop the loop, edit program.md, and resume — the next reaction tail-reads events.jsonl and continues.
  • Predictions can't be retroactively edited. Because Hypothesizing.formed is appended before the experiment runs, the agent can't quietly rewrite its own predictions to match the outcome. The log is a tamper-evident learning record, not a sanitized PR description.
  • The autonomy is bounded by the program. Reactions only fire when their when/where conditions match. The agent has no ambient action — it cannot do something that isn't reachable from a reaction in program.md. Restricting agent capability is a code edit, not a prompt patch.

The researcher still owns the research questions, the engineer still owns the design choices, the agent does the legible labor of trying things and writing down why.


Why this matters for governance & collaboration

This project addresses the core governance challenge of agentic AI: accountability. Most approaches treat human oversight as a feature bolted on after deployment. We treat it as a structural property of the language itself.

  • Agents cannot act outside their behavioral code — there is no ambient action.
  • Every action is attributable to a specific reaction authored by a specific agent (by: autoresearch-<tag>), with a caused_by chain back to its trigger.
  • Modifying agent behavior requires changing legible, versioned code (program.md) — not hunting through prompts.
  • Multiple agents collaborate through shared concepts, making the interface between them readable to humans.
  • Predictions are recorded before outcomes (Hypothesizing.formed.prediction) and explicitly compared after (Logging.recorded.outcome_vs_prediction), so the log captures mechanism understanding, not only metric deltas.

This directly enables the kind of human-AI collaboration where trust is earned incrementally and verified continuously — not assumed.


Risks we took seriously

We are not claiming this solves agent safety. We are claiming it makes a specific class of agent failures catchable instead of invisible. Here is what we wrestled with:

  • Risk: agents lying in the log. An agent could fabricate a Hypothesizing.formed.reasoning field after the fact to match a result that worked. What the design does about it: events are append-only and timestamped; predictions are required before the run; outcome_vs_prediction requires the agent to explicitly compare. A retrofitted prediction is detectable in the timestamp ordering and in the caused_by chain. Combined with git commit timestamps, the log is replayable evidence.
  • Risk: scope creep / agent touching the wrong files. A coding agent given shell access can modify anything on disk. What the design does about it: Modifying.applied declares its target file (to: train.py for model-training, scoped to src/ for performance-engineering). The reaction R3 won't fire on out-of-scope edits, and any out-of-scope change shows up as an unattributed git diff with no Modifying.applied event — i.e. the inconsistency is visible to a human reader.
  • Risk: over-trust. A researcher reads only the metric column of the log and assumes the agent figured something out, when really it stumbled into a noise-floor win or a cache-effect speedup it doesn't understand. What the design does about it: the outcome_vs_prediction field is mandatory and explicitly invites disagreement ("metric matched but mechanism unclear: speedup may have come from cache effects, not the change I proposed"). The performance loop additionally enforces a significance flag against a recorded noise floor — sub-noise wins are required to be discarded, not kept.
  • Risk: deskilling and displacement. If agents do all the experimentation, junior researchers and engineers lose the practice that builds expertise. What we believe: this design is closer to a teaching artifact than a black-box assistant. The events.jsonl is exactly the kind of structured lab notebook a junior researcher should keep — predictions before outcomes, mechanisms named explicitly, mistakes acknowledged in writing. Reading an agent's log is itself instructive in a way that reading a TSV of metrics is not.
  • Risk: false sense of governance. "We have an audit log" is not the same as "we have safety." What we are honest about: the log catches behavioral divergence (the agent did something that doesn't trace back to a reaction; the prediction was wrong; the mechanism was wrong). It does not prevent prompt-injection, model-level deception, or scenarios where the agent is sandbagging in plausible-looking events. Those require complementary work (sandboxing, capability restriction, alignment evaluations) that this project does not claim to do. What we offer is a structural property that other safety work can build on, not a substitute for it.

References

Appendix

model-training/ — Karpathy-style LLM autoresearch

A simplified single-GPU LLM training setup (a fork of Karpathy's nanochat lineage, with macOS / Apple Silicon MPS support added) wrapped in a behavioral-code program. The agent is handed train.py and a fixed wall-clock training budget, and it iterates: form a hypothesis about the model/optimizer, modify train.py, train, evaluate val_bpb, keep or revert, log the outcome against its prediction. Repeat overnight.

Layout

  • program.md — the agent's instructions, expressed as concepts (Experimenting, Hypothesizing, Modifying, Evaluating, Logging, Communicating) and reactions R1–R7. This is the file a human edits to change agent behavior.
  • prepare.py — fixed constants, data download, tokenizer training, dataloader, evaluation harness. Not modified by the agent. TIME_BUDGET lives here (currently 30 s for fast prototyping; upstream uses 300 s).
  • train.py — single-file GPT model + Muon/AdamW optimizer + training loop. The only file the agent edits.
  • events.jsonl — append-only event log produced by the agent (untracked, regenerated per run).
  • run.log — most recent uv run train.py output, used by the agent to extract val_bpb and detect crashes.
  • analysis.ipynb, progress.png — human-side inspection of the run.
  • original/ — upstream-style reference program.md (5-minute budget, free-form loop), kept for diff against the behavioral-code version.
  • CHANGES.md — notes on the local prototype delta vs. upstream (time budget, behavioral-code framing, MPS support).

Quick start (Apple Silicon Mac or single NVIDIA GPU; Python 3.10+; uv)

cd model-training
uv sync
uv run prepare.py        # one-time data + tokenizer prep, ~2 min
uv run train.py          # one manual baseline experiment, ~30 s + startup/eval

Then point a coding agent (Claude / Codex / etc.) at program.md and let it run autonomously. See model-training/README.md for full details.

Why the behavioral-code version is better than "agent + TSV". The agent is not running a free-form "edit, train, log to TSV" loop. It is a reaction interpreter: at each step it tails events.jsonl, matches a when clause, and fires the corresponding then. Every hypothesis carries an explicit prediction; every Logging.recorded event carries outcome_vs_prediction. The log is a record of learning, not just of metrics — and that's exactly what a researcher needs at 8 a.m. when they want to know what the agent figured out overnight.

performance-engineering/ — autoresearch over a C++ codebase

A deliberately unoptimized 3-D gravitational N-body simulator (src/nbody.cpp: O(N²) pairwise forces, AoS layout, no Newton's third law, single-threaded) plus an end-to-end benchmark harness. The agent is dropped into the repo cold: it must first discover the codebase, write or adopt a benchmark, establish a noise floor, then loop on profile → hypothesize → modify → measure → keep-or-discard.

Layout

  • program.md — agent instructions over concepts Discovering, Profiling, Experimenting, Hypothesizing, Modifying, Evaluating, Logging, Communicating and reactions R0–R8. Notable additions vs. model-training: a one-shot Discovering reaction at the start, and a Profiling lifecycle that gates every hypothesis on a recent measurement.
  • bench_e2e.py — Python harness that builds and runs src/nbody, computes a median wall-clock time over N runs, and verifies a position-weighted checksum against the baseline as a correctness anchor. Prints flat key: value lines for the agent's Evaluating.measure step.
  • events.jsonl — append-only event log.
  • src/
    • nbody.cpp — the C++ simulator. Fair game for the agent: algorithms, data layout, vectorization, parallelization.
    • Makefile-O3 -std=c++17 -march=native -fopenmp. Builds ./nbody.
    • visualize.py — matplotlib trajectory viewer for human sanity checks; depends on the -dump binary format (changing the format requires updating this file).
    • README.md — full description of the simulator, CLI, output format, and the contract the agent must preserve (checksum semantics, CLI flags, make target).

Quick start

cd performance-engineering
make -C src                                  # build ./src/nbody
python bench_e2e.py --runs 5                 # establish a baseline + noisefloor

Then point an agent at program.md. The agent's first reaction (R0) is Discovering.discover — it walks src/, reads the README, decides whether to use bench_e2e.py or write its own harness, and records its codebase map, hot-path hypothesis, and noise floor as a single Discovering.completed event. Everything after that cites back to it.

Why the behavioral-code version is better than "agent + TSV". Performance work is harder to make interpretable than model training: the bottleneck is unknown, the benchmark may not exist, and finding the right thing to change is most of the work. The reactions enforce two disciplines that orthodox loops skip:

  • Profile-grounded hypotheses. Hypothesizing.formed must cite a recent Profiling.profiled event and a specific function attribution. No guessing at hot paths.
  • Noise-aware keeps. Evaluating.measured carries a significance flag against the noise floor recorded at discovery. A speedup within run-to-run variance is below_noise_floor and gets reverted, not kept. (This is the kind of mistake an unsupervised agent will otherwise make and accidentally "win" with.)