惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

爱范儿
爱范儿
Know Your Adversary
Know Your Adversary
Google DeepMind News
Google DeepMind News
A
Arctic Wolf
P
Privacy & Cybersecurity Law Blog
云风的 BLOG
云风的 BLOG
Stack Overflow Blog
Stack Overflow Blog
V
Visual Studio Blog
Project Zero
Project Zero
L
LangChain Blog
N
News and Events Feed by Topic
博客园 - Franky
Last Week in AI
Last Week in AI
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Scott Helme
Scott Helme
T
The Exploit Database - CXSecurity.com
P
Proofpoint News Feed
Blog — PlanetScale
Blog — PlanetScale
www.infosecurity-magazine.com
www.infosecurity-magazine.com
W
WeLiveSecurity
月光博客
月光博客
博客园_首页
美团技术团队
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
腾讯CDC
Latest news
Latest news
WordPress大学
WordPress大学
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Spread Privacy
Spread Privacy
Attack and Defense Labs
Attack and Defense Labs
量子位
L
LINUX DO - 热门话题
C
CERT Recently Published Vulnerability Notes
Webroot Blog
Webroot Blog
L
Lohrmann on Cybersecurity
aimingoo的专栏
aimingoo的专栏
T
Troy Hunt's Blog
Security Latest
Security Latest
小众软件
小众软件
Cloudbric
Cloudbric
Hacker News: Ask HN
Hacker News: Ask HN
S
Secure Thoughts
雷峰网
雷峰网
T
Threat Research - Cisco Blogs
H
Hacker News: Front Page
IT之家
IT之家
Simon Willison's Weblog
Simon Willison's Weblog

Hacker News: Front Page

SPICE simulation → oscilloscope → verification with Claude Code — Lucas Gerads GitHub - GainSec/AutoProber: Hardware hacker’s flying probe automation stack for agent-driven target discovery, microscope mapping, safety-monitored CNC motion, probe review, and controlled pin probing. Introducing Claude Opus 4.7 Qwen Studio The Future of Everything is Lies, I Guess: Where Do We Go From Here? GitHub - SeanFDZ/macmind: Single-layer transformer in HyperTalk for the classic Macintosh Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis Ancient DNA reveals pervasive directional selection across West Eurasia [pdf] AI cybersecurity is not proof of work Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus GitHub - Nightmare-Eclipse/RedSun: The Red Sun vulnerability repository GitHub - SethPyle376/hiraeth: Local AWS emulator focused on fast integration testing, with SQS support, SQLite-backed state, and a debug-friendly web UI. A Better Ludum Dare; Or, How to Ruin a Legacy GitHub - macOS26/Agent: Any AI, replaces Claude Code, Cursor, OpenClaw. Over 18 LLM providers (Claude, OpenAI, Gemini, Ollama, Zai, HF, Qwen) wired into a native Mac app that writes code, builds Xcode projects, bumps versions, manages git, automates Safari, use AppleScript, JS or Accessibility, extend Agent! w/ MCP Servers, run tasks from your iPhone via Messages. YouTube now lets you turn off Shorts I Made a Terminal Pager Burgers | マクドナルド公式 Commands — HackerNews CLI documentation ChatGPT for Excel PiCore - Raspberry Pi Port of Tiny Core Linux Live Nation illegally monopolized ticketing market, jury finds Google Broke Its Promise to Me. Now ICE Has My Data. Founding Engineer at Adaptional | Y Combinator CRISPR takes important step toward silencing Down syndrome’s extra chromosome GitHub - saffron-health/libretto: The AI toolkit for building reliable browser automations US v. Heppner (S.D.N.Y. 2026) no attorney-client privilege for AI chats [pdf] Unexpected €54k billing spike in 13 hours: Firebase browser key without API restrictions used for Gemini requests Fragments: April 14 Cal.com Goes Closed Source: Why AI Security Is Forcing Our Decision | Cal.com - Scheduling Software for Online Bookings Laravel raised money and now injects ads directly into your agent Pakistan hospital at centre of child HIV outbreak caught reusing syringes in BBC film Codex Hacked a Samsung TV Tech Valuations Back to Pre-AI Boom Levels A perfectable programming language — Soter GitHub - halfwhey/claudraband: Claude Code for the Power User Partnership through Play: Investigating How Long-Distance Couples Use Digital Games to Facilitate Intimacy Textbooks and Methods of Note-Taking in Early Modern Europe (2008) Eternity in six hours: Intergalactic spreading of intelligent life (2013) Seven countries now generate 100% of their electricity from renewable energy Tell HN: OpenAI silently removed Study Mode from ChatGPT Pro Max 5x Quota Exhausted in 1.5 Hours Despite Moderate Usage Show HN: Oberon System 3 runs natively on Raspberry Pi 3 (with ready SD card) Tell HN: docker pull fails in spain due to football cloudflare block Bring Back Idiomatic Design No one owes you supply-chain security GitHub - xsawyerx/curl-doom: DOOM, played over cURL Apple update turns Czech mate for locked-out iPhone user The Grand Line Cache TTL silently regressed from 1h to 5m around early March 2026, causing quota and cost inflation Building a Z-Machine in the worst possible language The peril of laziness lost Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda AI Will Be Met With Violence, and Nothing Good Will Come of It GitHub - duguyue100/midnight-captain: Inspired by Midnight Commander, tailored to my taste. How to build a `git diff` driver · Jamie Tanna | Software Engineer Center for Responsible, Decentralized Intelligence at Berkeley The Local Universe’s Expansion Rate Is Clearer Than Ever, but Still Doesn’t Add Up - A new synthesis of astronomical measurements confirms a persistent mismatch that could point to physics beyond current models The disturbing white paper Red Hat is trying to erase from the internet – OSnews NetBlocks (@netblocks@mastodon.social) The Future of Everything is Lies, I Guess: Annoyances ‘Abhorrent’: the inside story of the Polymarket gamblers betting millions on war Productive procrastination — Max van IJsselmuiden maps, territory and LMs 447 Terabytes per Square Centimetre at Zero Retention Energy: Non-Volatile Memory at the Atomic Scale on Fluorographane Show HN: Pardonned.com – A searchable database of US Pardons 20 Years on AWS and Never Not My Job The Seasons are Wrong The FAA wants gamers to apply for air traffic control jobs Artemis II crew splashes down near San Diego after historic moon mission Why weekends are under threat We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs How a dancer with ALS used brainwaves to perform live On filing the corners off my MacBooks Installing every* Firefox extension OpenClaw’s memory is unreliable, and you don’t know when it will break Steve Blank Nowhere Is Safe Chimpanzees in Uganda locked in vicious 'civil war', say researchers watgo - a WebAssembly Toolkit for Go linux/Documentation/process/coding-assistants.rst at master · torvalds/linux GitHub - callumlocke/json-formatter: Makes JSON easy to read. Founding Product Engineer at Bild AI | Y Combinator A compelling title that is cryptic enough to get you to take action on it GitHub - Keychron/Keychron-Keyboards-Hardware-Design: Industrial design files for Keychron keyboards and mice. 100+ models with CAD assets in STEP, DXF, DWG, and PDF. Source-available, with commercial use allowed for original compatible accessories within the license terms. [ANNOUNCE] WireGuardNT v0.11 and WireGuard for Windows v0.6 Released 1D-Chess Helium Is Hard to Replace Keeping a Postgres queue healthy — PlanetScale Serenity Forge (@serenityforge.com) Our response to the Axios developer tool compromise Do Americans read print books, e-books or audiobooks more? Uncharted island soon to appear on nautical charts The Problem That Built an Industry Fragments: April 2 Python Release Python install manager 26.1 Bitcoin miners are losing $19,000 on every BTC produced as difficulty drops 7.8% God sleeps in the minerals Harness engineering: leveraging Codex in an agent-first world Apple Silicon and Virtual Machines: Beating the 2 VM Limit What have been the greatest intellectual achievements? The APL Programming Language Source Code
Project Fetch: Phase two
stopachka · 2026-06-21 · via Hacker News: Front Page

Michael Ilie, C. Daniel Freeman, and Kevin K. Troy

In August 2025, we ran an experiment to see how much Claude could help Anthropic employees—who were not robotics experts—perform sophisticated (and amusing) tasks with an off-the-shelf robotic quadruped (henceforth, a robodog). We called this Project Fetch. We found that access to our state-of-the-art model at the time (Claude Opus 4.1) helped one team substantially outperform the other, who had to rely only on the internet and their own ingenuity. The Claude-enabled team got more done, faster.

Before we dragged our colleagues to a warehouse for the experiment, we double checked whether Opus 4.1 could do the tasks entirely on its own. Unquestionably, it could not. Much like our team without Claude, it got hung up on the preliminary task of figuring out how to connect to the robot.

But AI models are moving fast—even faster than the runaway robodog that almost rammed into one of our human teams back in August.

We figured it was time to revisit Project Fetch to see if our newer models could outperform the previous generation. Not only did they do that, but Claude Opus 4.7—operating without human assistance—was about 20 times faster than the fastest human team at all tasks completed by our participants less than a year ago.

This doesn’t mean that LLMs have now solved robotics. Far from it. The latest Claude models still struggled with using the robot to precisely move the beach ball—the “fetching” part of Project Fetch. And none of the tasks in these experiments implicate the more challenging, low-level elements of robotic control, such as developing a specific actuation policy. However, once again, we are seeing a pattern whereby first, models are helpful to humans. Then, humans are helpful to models. Finally, models are largely able to do things themselves. We have seen this in cybersecurity and now the same dynamics are starting to take shape at the intersection of AI and the physical world.

What did we do?

The original Project Fetch had teams of Anthropic employees (randomly assigned to work with or without Claude) do the following steps: operate the robodog using the manufacturer-provided controller, connect to the robodog’s video and lidar sensors, write and operate a program to manually control the robodog, develop a way to monitor the robodog’s path through space, write a program to detect the beach ball, and finally put it all together to autonomously retrieve the ball.

For this autonomous update, we couldn’t ask Claude to use a physical controller, nor did we evaluate the time it took a researcher to use the Claude-programmed controller to retrieve the ball (though we did confirm that it worked as intended). On the remaining subset of tasks, we ran three trials of Opus 4.7 using adaptive thinking with effort set to maximum in Claude Code. We measured the elapsed time for each objective and qualitatively assessed the models’ success.

The role of our researcher was limited to plugging a laptop running Claude Code into the robodog, entering the initial prompt, approving commands, and approving the model to go to the next task.

Where did Claude excel?

Very simply: on every task that was completed by at least one human team in August, Opus 4.7 completed the same task at least ten times faster.1 If you consider the four tasks that were completed by both human teams, Opus 4.7 was, on average, more than 37 times faster than Team Claude-less and more than 18 times faster than Team Claude.

Bar chart labeled "Total time comparison: 4 tasks completed by all teams." The chart shows that Team Claude-less completed tasks in 361 minutes; Team Claude completed tasks in 181 minutes, and Claude Opus 4.7 alone completed tasks in 9 minutes 35 seconds. Opus 4.7 was 37.7 times faster than Team Claude-less and 18.9 times faster than Team Claude.

The table compares the speed of the original teams (Team Claude and Team Claude-less) to Opus 4.7 on all of the tasks we tested as part of Phase Two.

Table comparing Claude Opus 4.7 to Team Claude-less and Team Claude performance on tasks related to programmatic control and autonomous operation. Tasks include "Connect to robodog's video camera," "Connect to robodog's lidar sensor," and "Detect beach ball." Opus 4.7 was faster than Team Claude-less and Team Claude on all tasks. Team Claude-less did not complete all 5 tasks in the table; Team Claude completed them in 264 minutes; and Opus 4.7, averaged over 3 trials, completed them in 12 minutes 7 seconds.

Whereas the humans struggled to choose between multiple different approaches to interface with the dog’s sensors, Opus 4.7 was able to quickly identify the best path. Much of the code it wrote was effective on the first try (which was not the case for Team Claude or Team Claude-less in the original experiment). Indeed, we can see evidence of Opus 4.7’s efficiency when we look at the volume of code it generated: it was as or more successful than both human teams while producing almost ten times less code than Team Claude.

Bar chart showing total code volume for Team Claude, Team Claude-less, and Opus 4.7 alone. Team Claude wrote 10,309 lines of code; Team Claude-less wrote 1,136 lines of code; Opus 4.7 alone wrote 1,045 lines of code.

Opus 4.7 was not perfect. For example, it defaulted to using an outdated object detection algorithm. But even then, it was able to work around this and arrive at an effective solution.

We observed little within-task variance (in absolute terms) on completion times for steps the model finished. (Though the aforementioned suboptimal algorithm selection is likely why one of the beach ball detection trials took substantially longer than the others.) Overall, for the tasks in this experiment within its capability envelope, Claude is now quite reliable. (See the next section for an analysis of what Claude is still unable to do.)

Scatter plot showing Opus 4.7's reliability on task performance. Opus 4.7 performed each task three times; the scatter plot shows that the performance time was relatively consistent across runs.

It is worth underscoring (as we did in our previous post) that this progress is not the result of a concerted effort to improve the robotics capabilities of our models. These improvements, like so many others in the history of LLM development, have emerged from much more general scaling.

Where did Claude struggle?

When using their hands, and with some practice, our humans were able to pilot the robodogs to gently nudge a beach ball back to the home base (a patch of fake grass) where the robots started. This required the ability to quickly perceive if the ball had gone off course, how that error related to the previous command, where the ball was now, and then how to adjust future inputs to more precisely move the ball. This is a kind of closed loop at which people excel (at least after making some mistakes and learning from them).

In our Phase Two experiments, Claude struggled to capture this subtlety. Like the humans who reached the phase of needing to write a program for autonomous beach ball retrieval, Claude was able to move the robot behind the ball and position it to knock the ball back to the starting point. But the efforts to do so were poorly controlled and (again, like our human participants) not successful.

One of our researchers with more robotics experience than our Phase One volunteers successfully accomplished the task of programming autonomous fetching. With more time and additional scaffolding, we think it is very likely that current generations of Claude could do the same. What we will be watching for next, though, is the ability of the models to accomplish this final task with the same speed and reliability they displayed on the other elements of Project Fetch.

What does this mean?

Writing about Phase One, we emphasized how LLMs could provide uplift to non-expert humans needing to use robots. This is even more true now than before. Models now complete what was previously pair-programming work between humans and models much more quickly by themselves, which means that people can more quickly transition to controlling and using the robots. And for some tasks, a human in the loop controlling the robot may still outstrip the AI model with its (virtual) hand on the D-pad.

What is interesting and different is that we now seem much closer to a world where models will be able to use off-the-shelf physical tools with relative ease—at least for limited purposes. This is similar to how AI models used existing software editing tools like string-replace when they made the transition to more agentic coding. We are plausibly entering the early era of physical agentic AI.

More research is needed to understand models’ ability to make these physical tools more bespoke, whether by writing control policies tailored to particular tasks or by designing robotic systems. And there may be substantial barriers to this more generalized vision of physically capable and adaptable language models. But as we have seen, apparently large distances in model capability can be traversed quickly. Models building their own software tools might have seemed outlandish not long ago, but it is happening. It would be unwise to rule out the same trajectory in hardware.

Updated Jun 18: Corrected the date of the first phase of Project Fetch.

Related content

Agentic coding and persistent returns to expertise

Read more

Paving the way for agents in biology

Read more

Measuring LLMs’ impact on N-day exploits

In cybersecurity, a large fraction of real-world harm comes from N-days: vulnerabilities that have already been publicly disclosed, but only patched on some devices. In this post, we evaluate how much large language models can accelerate and automate the process of developing N-day exploits.

Read more