惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

C
Comments on: Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
李成银的技术随笔
美团技术团队
博客园 - 三生石上(FineUI控件)
爱范儿
爱范儿
Simon Willison's Weblog
Simon Willison's Weblog
Cisco Talos Blog
Cisco Talos Blog
博客园 - 司徒正美
Jina AI
Jina AI
S
SegmentFault 最新的问题
Recorded Future
Recorded Future
大猫的无限游戏
大猫的无限游戏
月光博客
月光博客
E
Exploit-DB.com RSS Feed
J
Java Code Geeks
腾讯CDC
V
V2EX
NISL@THU
NISL@THU
M
MIT News - Artificial intelligence
量子位
T
Tor Project blog
T
Threatpost
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
博客园 - Franky
Scott Helme
Scott Helme
U
Unit 42
博客园 - 聂微东
Hacker News - Newest:
Hacker News - Newest: "LLM"
雷峰网
雷峰网
Vercel News
Vercel News
GbyAI
GbyAI
MyScale Blog
MyScale Blog
Microsoft Security Blog
Microsoft Security Blog
Recent Commits to openclaw:main
Recent Commits to openclaw:main
aimingoo的专栏
aimingoo的专栏
H
Hackread – Cybersecurity News, Data Breaches, AI and More
有赞技术团队
有赞技术团队
W
WeLiveSecurity
T
Tailwind CSS Blog
S
Schneier on Security
Hugging Face - Blog
Hugging Face - Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Y
Y Combinator Blog
I
Intezer
Last Week in AI
Last Week in AI
D
Darknet – Hacking Tools, Hacker News & Cyber Security

Hacker News: Show HN

Show HN: I'm 17 and built a tool that scores all stocks against Buffett criteria Show HN: Context-drop – CLI tool to to share files/images between remote agents Multiple Real Desktops for Windows GitHub - lionello/han64: Handling Chinese text on the Commodore 64 Show HN: Strudel – Generate commit messages via Apple's on-device LLM Show HN: Audiomass – a free, open-source multitrack audio editor for the web move-reminder The Front Page HtmlUnit – Welcome to HtmlUnit GitHub - kouhxp/textsnap: Snap any image, screenshot, or webpage into plaintext. No GPU. No cloud. One command. Show HN: Pro Health Ledger – An open-source, net-neutral reputation system Show HN: Baby's First Cards – real photo flash cards for toddlers LLMRequirements.com — Hardware for Local LLMs Show HN: Hookwarden – npx tool to find and fix webhook HMAC bugs (JS/TS/Py/PHP) Frello — A small revolt against bloated software Career tools for data professionals | Datamata Studios Show HN: Kanban CLI (A local-first, agent-first task manager for the terminal) Show HN: Fleet – Python supervisor for running coding agents in parallel TravElly | A travel diary app for kids TapToyPia Show HN:An LED display app that lights up concerts, events, and fan moments Show HN: Logo Fonts Home | Qavvāli Wiki Show HN: Panorama – Review Code, Faster Show HN: Slow Code, a monthly meetup to practice coding by hand GitHub - abakh/nbsdgames: A package of 21 new, improved, text-based modern games. Some are entirely original ideas. Best and lightest! Let's Jam Show HN: CurRant->Screw Google scourge, help people notice what is worth a look Fruitsy Show HN: World Cup 2026 free family and friends prediction platform AgentLens — Know if your AI features are actually working Planet Maiko Specter — Use AI on your Ghost blog, locally Money Me - Personal Financial Management Smart Runner: Adaptive Running Plans for iPhone and Apple Watch Show HN: nsS3ui – A non sucking GUI for S3 Show HN: Vibe-coded Steam, but in the browser AI Model Idle · 인공지능 키우기 Show HN: Running BitNet b1.58 inside DRAM by breaking DDR4 timing rules GitHub - muhammadmuzzammil1998/pack-src: Recursively pack source code directories into clean ZIP archives — fast, smart, and ignore-aware. Streakout: Visualize Your Apple Health Workout Data Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%) AI Chess Coach — Know Why Every Move Matters Gemini Omni Flash Video Generator & Editor | Vivify Show HN: Invoker, an IDE for orchestrating massively parallel agents Show HN: I built a RAG and knowledge graph agent that runs locally Clankerfights Show HN: Fed.run – online collaborative Rust IDE and Markdown editor GitHub - EliotAndres/SimStream: SimStream is a library to stream iOS simulators from your Mac to your iPhone (or any web browser) Solana Index | Slot-Based Token Balance API GitHub - mainline-org/mainline: Git-native memory for coding agents. Repo memory before the diff. CostHawk - Track AI Adoption, Cost, and Rollout Across Your Team Show HN: Calculator Music – play songs with number keys in the browser Show HN: First MCP server for Guesty property mgmt – 43 tools, open source GitHub - stukenov/rav2d: AV2 video decoder in Rust — full port of dav2d C logic to memory-safe Rust. 47K lines, 786 tests. Assembly DSP stays via FFI. Show HN: I built a powerful RAG and knowledge graph agent that runs locally Show HN: Directionally bad – a newsletter about risks of AI centralization Welcome chord-commander GitHub - klimentij/klimkit: Agentic engineering across machines, under control. Congress Disclosure Watchlist Digest - TinyOps Studio GitHub - arthurzhu29/larksson Brev - Frictionless Notes - Apps on Google Play TruLayer — Evals, Closed Control Loop & Auto-Rollback for Production AI GitHub - pilatesjs/pilates: Headless flex layout engine for terminal UIs. NOW and CSCO Politician Trading Claims ThinkLLM — Think through your LLM choices Show HN: Waiting for AI Grand Prix racing SIM? Me too So I made one WAR.GOV/UFO Microfilm5 Show HN: Claude Code for Customer Support Show HN: Agentikus Purr - Apps on Google Play atom.plumocracy.com Show HN: BonzAI – self-sovereign, local LLM inference in the browser GitHub - T0nd3/logatory: Local-first log analysis with PII redaction, threat detection, anomaly detection and LLM insights — CLI, web dashboard and REST API GitHub - doshareme/synchole: P2P Data Sync Protocol GitHub - secluso/core: A privacy-preserving Raspberry Pi home security camera that uses advanced end-to-end encryption. Show HN: Microcodegen.py – PRD → FastAPI app, one file, no LLM calls iPhone 版“Today” - App Store ScrollLaunch — Launch Your Product. Get Seen Weekly. GitHub - ppserapiao/mneme: the open, user-sovereign memory layer for AI. local-first · client-side encrypted · open protocol. your memory. your keys. every model. Datapoint AI Home Codeep — Go Deep into Code OpenRig — Local control plane for multi-agent coding topologies GitHub - allenwu-blip/mcpaudit: Static pre-install security scanner for MCP (Model Context Protocol) servers — `npx mcpaudit <path>` flags command injection, credential/env exfiltration into LLM-visible output, over-broad filesystem/tool scope and dynamic eval before you wire a server into your agent. Show HN: Quit All, an iOS app with an SOS mode for cravings GitHub - dmichael-fastly/fastly-examples-live-betting-fanout: A working example of distributing live game scores and betting odds to millions of concurrent users without overwhelming origin — built on Fastly's edge stack. 404 Page Generator — Make your 404 page a needle-mover Show HN: Neuz, a self-hosted news dashboard curated by Claude Senior SWE interview prep — Semicolony GitHub - DefangLabs/pulumi-defang: Defang Pulumi providers - Take your app from Docker Compose to a secure and scalable cloud deployment with Pulumi. OpenYardage — Printable Golf Yardage Books GitHub - uAIex/KeyMouseRecorder Ship Mobile Features Instantly — Nativeblocks SDUI Platform CoreMem - Your context, any AI agent Show HN: AI-Mirror - Self-optimising ranking engine for modern web applications. Show HN: Mechs.lol – a free, web-based autoshooter game SnapMeasureAI — AI Body Measurements For The Perfect Fit Show HN: Accurate body measurements from two images
GitHub - kimjune01/swebench-verified: Reproducible recon/craft/audit agent pipeline for SWE-bench Verified. Official-graded, codex-attested, GPL-3.0. Run it yourself.
kimjune01 · 2026-05-25 · via Hacker News: Show HN

A three-stage agent pipeline for SWE-bench Verified, built to be re-run and inspected by skeptics. The point of this repo is not the score. It is that you can clone it, run the exact procedure on the exact skills, and check the artifacts against your own grading.

One repo per benchmark. This is the Verified run. The SWE-bench Pro run — and its port plan, goal predicate, and validation design (PRO_PORT.md) — lives in its own repo, swebench-pro. Separate repos so each run's artifacts and number stand alone, rather than hiding behind branches.

Where the numbers come from (start at 500)

Every one of the 500 SWE-bench Verified instances flows to a visible terminal below — nothing is silently dropped. Resolved: 426 / 438 eligible (~97%); 426 / 500 full set (~85%). Of the 426, 418 were first-attempt and 8 won only on a re-run — and every re-run was an external-fault correction (box-death, a serialization bug since fixed, co-tenant contention), never a re-roll of a reasoning loss; the original losing runs stay in history. The gap from 500 is 44 sphinx-doc (tox-based, can't run airgapped) + 18 documented defects (KNOWN_BAD.md) — not cherry-picking. Counts are re-derivable: the eligible/excluded split from the dataset + KNOWN_BAD.md, the resolved count from committed official_eval/summary.json files (driver/scoreboard.py).

sankey-beta
SWE-bench Verified,sphinx-doc (offline-infeasible),44
SWE-bench Verified,KNOWN_BAD defects,18
SWE-bench Verified,Eligible (attempted),438
Eligible (attempted),Resolved (official),426
Eligible (attempted),Not resolved,12
Resolved (official),first attempt,418
Resolved (official),external-fault re-run,8
Not resolved,Reasoning loss,9
Not resolved,Infra / timeout DNF,3
Reasoning loss,recon-ceiling,2
Reasoning loss,genuinely-hard,2
Reasoning loss,gate-divergence,2
Reasoning loss,craft-overfit,1
Reasoning loss,other (sympy),2
Infra / timeout DNF,heavy-suite hang,3
Loading

The 12 not-won, by name (audit them against results/): reasoning (9, not rerun — genuine no-solve) django-11734, django-14351 (recon-ceiling), astropy-13398, django-16263 (genuinely-hard), django-14170, pytest-5787 (gate-divergence), sympy-13091 (craft-overfit), sympy-20438, sympy-17139; infra/timeout (3) django-15957, matplotlib-25311, sympy-19040 (heavy-suite stage-hang, await a general stage-cap fix — not re-rolled). The 4 external-fault losses (django-15563/14404 box-death, django-15987 serialization, sympy-13878 contention) were corrected by re-run and now resolve; their original losing runs remain in history. Only external/infra faults are eligible for rerun — a reasoning loss stays a loss.

⚠️ Contamination disclaimer — read this before trusting the number

SWE-bench Verified is training-data contaminated for every modern model, including the ones used here. Claude Sonnet generates and codex (GPT-5.5) filters; both training cutoffs postdate the Verified instances. This is a leaderboard / capability configuration, not a contamination-clean science claim.

  • Contamination is a property of the benchmark, shared by all leaderboard entries — so any score here is a fair entry on the same terms as everyone else, and nothing more. It is not evidence that the method causes the result.
  • Isolating the method as the cause requires a separate clean-room ablation: post-cutoff instances (SWE-rebench), with-vs-without the pipeline on one fixed model. This repo does not make that claim. What it demonstrates is that the pipeline runs end-to-end and produces correct, official-grader-verified patches.
  • The honest unit is a win count with an explicit denominator (KNOWN_BAD.md exclusions committed), not a "solve rate" you should read as model capability.

LIMITATIONS.md is the full, unhedged list. Read it before drawing any conclusion.

What it is

Three skills, run as separate claude --print invocations, chained by a driver:

  • recon (read-only): reproduce the failing test, localize the root cause, emit a structured handoff. Single diagnostician.
  • craft (implementation): draft the patch from the handoff, have a codex subagent challenge it (codex generates nothing; it filters), apply, and loop against the test gate until the FAIL_TO_PASS tests pass.
  • audit (verification): run the full suite, classify each failure against a captured fail-on-base baseline, emit a verdict (RESOLVED / NOT_RESOLVED / PARTIAL) and a re-entry route.

The driver wraps these in an outer loop (max 3): a non-RESOLVED audit routes back to recon (wrong diagnosis) or craft (over-broad fix that regressed something). The system under test is an offline Docker container; the only ground truth is the test gate, never the model's own say-so.

The skills in skills/ are frozen independent copies — a pinned snapshot of the exact recon/craft/audit text that produced the Verified run (426/438). They were hardlinked to the canonical authoring copies during the campaign; now that the eligible pool is exhausted the link is broken, so ongoing edits to the canonical copies (e.g. the SWE-bench Pro port) do not mutate this frozen artifact. driver/link_skills.sh is retired for this repo — do not re-run it. A clone gets these real files regardless.

Recordkeeping (how the honesty is enforced)

The commit history is the audit trail of the method, and it is append-only by policy: no force-push, no rebase that drops a losing-run commit, no amending to tidy a failure out of view. A skeptic can replay the whole loop from the log — where the artifact failed, the general fix, the clean re-run — and a quietly-removed loss would show up as rewritten history. Concretely:

  • Every run is committed, wins and losses (results/<id>/<run>/), with the official grader report. Re-runs add new timestamped run dirs; prior losing runs are never deleted or overwritten.
  • SCOREBOARD.md is regenerated, but its source isn't. The per-run official_eval/summary.json files only accumulate; the scoreboard is derived from them.
  • Method changes are their own legible diffs. A reader judges each change for generality (instance-blind) vs a smuggled instance prior.
  • Frozen artifact versions are git-tagged. The deliverable is a single tagged version run from scratch on the whole target; checking out the tag shows exactly the code that produced those runs. The cross-version history is the development record, not the deliverable.

What counts as a win

A win is a passing attestation from the official swebench.harness.run_evaluation grader — nothing else. Not our gate, not the agent's RESOLVED claim, not "it would have passed without the infra hiccup." SCOREBOARD.md is the live count, re-derived by driver/scoreboard.py from the committed official_eval/summary.json files — trust it over any number in prose, which drifts.

Exclusions

KNOWN_BAD.md lists the SWE-bench instances with broken Docker envs, flaky tests, gold patches that fail to grade, or weak coverage (sourced from SWE-bench issues and the UTBoost paper). Every batch is filtered against it before running and the exclusion is committed, so the denominator is honest. LIMITATIONS.md carries the full, unhedged list of everything else.

Run it yourself

Prerequisites:

  • An x86-64 Linux Docker host (the SWE-bench eval images are linux/amd64). The provided driver/provision.sh spins up an AWS EC2 box with a self-terminating watchdog; any amd64 Docker host works.
  • claude CLI (Anthropic) authenticated, and codex CLI (OpenAI) authenticated, both on the machine that runs the driver. The models run on the driver host; the container is offline.
  • Python with swebench and datasets (pip install -r requirements.txt).

Steps:

# 1. Build a task JSON for any Verified instance
python driver/make_task.py pallets__flask-5014 tasks/pallets__flask-5014.json

# 2. Provision an offline-capable Docker box (or point at your own)
bash driver/provision.sh          # writes /tmp/v4smoke.env (KEY, PUBIP, ...)

# 3. Run the pipeline
python driver/rung4_driver.py /tmp/v4smoke.env tasks/pallets__flask-5014.json pallets__flask-5014

# 4. Inspect: ledger, patch, hypothesis graph land in /tmp/swebench-abduction/

Grade the captured patch with the official SWE-bench harness (do not trust this repo's gate as the grader; the gate is the agent's stopping signal, the official harness is the verdict):

python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Verified \
  --predictions_path <patches.jsonl> --run_id check

See PROCEDURE.md for the full step-by-step and WORKLOG.md for the build history.

Layout

skills/{recon,craft,audit}/skill.md   the three skills (frozen snapshot; formerly hardlinked)
driver/rung4_driver.py                the orchestrator (recon -> craft -> audit + outer loop)
driver/make_task.py                   builds a task JSON from any Verified instance id
driver/provision.sh                   EC2 provisioning with self-terminating watchdog
driver/link_skills.sh                 (retired) re-established skill hardlinks during the campaign
tasks/                                generated task JSONs
results/<instance>/                   ledger, patch, hypothesis graph, agent logs, codex proof

License

GPL-3.0 (copyleft). See LICENSE. If you build on the method, share back.