


























Today I'm announcing an early preview of Rotunda. It gives your agents and programs access to a local browser tailored for programmatic control. Everything it does looks just like you're doing it, except it's an agent behind your keyboard instead of you.
Here's a screen recording of Rotunda logging into my Amazon account step by step and summarizing my most recent order (driven by just telling Codex about the available CLI and using 1Password to auth).
Check it out on its website or GitHub.
Automation was one of the dreams that really got me into computer science. Freshman year of college I wanted to join a small energy seminar that opened for registration at some odd hour of the morning. It had capped enrollment and an unclear waitlist. Since sleep was such a premium, I wrote a Puppeteer script to log into my SUNet account and click the registration link the second the page went live. By the time I woke up at 11am it worked and I was in the class. A few days later I got a polite email from the registrar's office asking how on earth I'd managed to register that fast. "I'm fast on my keyboard" I answered.
I assume the statute of limitations has run out on my diploma: so yes, it was automation all the way down. That was a soft introduction to the cat-and-mouse game of automating against websites you don't control, and the first time I realized how much of the internet only exists in browser form. Almost every site has a private backend that its own frontend talks to over a structured protocol1. The rest of us only get the rendered HTML version, the one designed to be looked at rather than addressed programmatically.
The web has been made into a beautiful layer to interact with almost anything. If there's something that can be done remotely, it can almost certainly be done online. From your bank to your flights to your maps to your payroll, it's frankly unbelievable how much of a universal language websites have become. Despite how convenient it would be to have a global structured API for all sites, companies are loath to try.
And the regular web has never been harder to automate. Because bad actors have so abused the open nature of the internet, you have a billion dollar industry that is responsible for doing bot detection. They cast a wide net: and for ease of classification they really don't care if you're running a crawl farm or doing a single task on your local computer. A bot is a bot and a bot is bad.
So how does your Claude Cowork, Codex, or OpenClaw talk to the open web? Usually one of these options:
Use screen control to talk to your existing Chrome/Safari and click buttons visually. This uses some combination of the macos accessibility APIs (text) and computer use (vision) to figure out what to click on. It works much better than it did a few years ago but it's still error prone and very expensive. Plus these models can only rely on what the browser can see in its current viewport: if some button is all the way at the bottom of the page, they have to issue scroll down requests until they get there. And that's only if the model intuits that the button might even be down there. Until they see it they won't know.
Get some code involved. Playwright has become the industry standard way to drive new web control workflows. It provides a high level API on top of CDP, which powers Chrome's programmatic debugging. It also ports between Safari, Firefox, and Chromium. Because it plugs into browser internals it has the benefit of being able to inspect everything on the screen as raw text and reason about buttons as logical objects instead of pixels. But both Playwright and Chromium leak their own flags that scream they're being controlled programmatically. The webdriver global is the most obvious but there are dozens of others. Plus if you run it headless it'll leak other information about software rendering when drawing to the canvas.
When you get blocked by a captcha or fingerprint that opaquely fails to log you in, you switch to a stealth browser. Most are defunct but some promise relatively active development. Alas it's an almost impossible game to spoof your browser: among the statistical soup that is the billions of daily web sessions, it's easy enough to find you're anomalous if you're pretending to be a desktop Mac but running without a GPU in Linux. Some go down to monitoring your mouse and typing patterns.2
Rotunda tries to take the good of all approaches and leave the bad. A well-behaved agent wants a browser that looks like an extension of you: on your home IP, taking its time and making some mistakes while filling out forms, truthful GPU reporting, and no leaked Javascript identifiers for browser control. We just want it to have access to the same browser experience that we do when we're using the web.
That's for the good bots. Serving the bad bots (ticket scalpers, shoe resellers, etc.) is of no interest to me.
Rotunda combines a few different elements: fingerprint fibbing, isolated javascript control, and realistic actions. My goal is for these to all be implementation details to you.
Stealth in this corner of the world means making a programmatically driven browser look indistinguishable from a real human at a real machine. The catch is that it's an asymmetric game: a fingerprinting vendor only has to find one thing out of a thousand that looks off, while the stealth side has to keep all thousand consistent. Verification is much easier than the solution.
That cat-and-mouse game plays out against deep-pocketed vendors like hCaptcha, Fingerprint.js, and Cloudflare, just to name some that are commercially available. Most big sites layer their own checks on top. Trying to beat them at their own game is a losing position.3 At best you'll get a captcha and move on; at worse they'll flag your whole account.
So Rotunda is by design not a stealth browser. Stealth research is helpful in shaping the implementation, but we're not playing whack-a-mole with detectors.
My view is that browsers are better off not lying. But they're pretty safe fibbing. The things that are hardest to mock are low level: GPU rendering and audio drivers, mostly. These remain consistent across all versions of your hardware and major version. Everyone with an M4 Macbook will have the same renderer, same with a NUC Windows. Sites aren't going to block you because you're using the same hardware as someone else. Higher entropy is found in the things that you control: browser extensions, font selection, monitor screen size, etc. We choose to mock those safe attributes because there's no real way to assess the validity from inside the browser sandbox.
So what we say is real: just permuted. If we advertise you have a font, we'll properly render with that font. But each profile can have a subset of valid fonts that provides the additional entropy.
This is hugely helped by running locally: it inherits your solid IP reputation and your full device GPU capabilities. Cloud-hosted browsers like Steel.dev, Kernel, and Browser-Use all work by spinning up Linux VMs (either headless or in X11 sessions) that must lie about their underlying hosts to avoid getting flagged. They usually provide captcha solving as part of their "stealth" offerings, but enough of these can still get your account flagged. It's better to minimize the amount of captchas you see in the first place.
Under the hood, Rotunda is a heavily patched version of Firefox. This is an unusual choice in the browser automation community, since almost all alternatives are based on Chromium because of its great scripting support by CDP. The decision came from early work by Camoufox, who proved out that Firefox has more safeguards by default (so it's harder to fingerprint) and that CDP simply leaks too much state. Juggler is Firefox's alternative automation API, developed before Google formalized the spec. It has the same effective area as CDP but operates in a fully isolated Javascript context, so it won't leak any state to the remote site no matter what variables you access. I probably could have hacked some patches to CDP to make that work as well, but Juggler ends up being far less of an uphill battle.
Desktop software is having a moment.
We're seeing more agentic work done locally than I would have expected. All the major harnesses like Claude Code, Codex, and OpenClaw are all on the desktop. Their equivalent cloud platforms haven't caught on in the same way. Rotunda plays nicely with all of them by running locally and giving them access to a simple API surface:
uvx rotunda agent new-context agent-demo
uvx rotunda agent navigate 3 https://pierce.dev
uvx rotunda agent describe 3
I opted for a native bash CLI over an MCP server for a couple reasons.
First, every MCP you load eats a chunk of your context window before the agent has done anything. Tool descriptions and JSON schemas add up fast, and in practice most agents only use two or three tools per task. A CLI defers that cost: the model only sees the --help text when it's unsure and asks for the signature. Most of the CLI names are obvious enough that it can intuit what they'll do without even prompting.
Second, models have been trained on orders of magnitude more bash than MCP. The magic of bash is its composability. You can pipe some output into jq, or chain it with grep, or wrap it in a loop. MCP is its own little walled garden that doesn't really compose with anything you already have.
I only use MCP these days for products that (for some reason) have an MCP server but not an official API. I incidentally think this is largely political: MCP was absurdly en vogue in 2025. Practically the only thing you'd see on X and LinkedIn was about them. This top down approval gave engineering groups the will-power and optics to look AI-first.4
Right now for anything you'd actually want to live on a developer's machine, a CLI mostly lets you do more for less.
If you actually watch a session that's being controlled by a script, it's immediately obvious what's going on. The mouse will barely move then jump to the link that it wants, or every input entry will be filled instantly with a full paragraph. It's fast but it's super unrealistic for a human operator. If antibot providers aren't already modeling that signal they will be soon.
The only foolproof way to escape these checks is to emulate human usage of websites. That brings me to last Monday when I wrote a Swift app to record my keyboard and mouse movements. I let it run for a few days to record all the idiosyncrasies of how I move my mouse and make mistakes as I type.
Yes I basically built a key logger.5
Through this logger I got a big data stream with everything we need to emulate something realistic. The inputs the model sees at inference are the desired destination – the target text of a field, the (src, dst) of a mouse hop. The model must fill in the rich, messy event stream a real human would produce in between. Credit to Claude for the animations:
Event stream — mouse
A single hop from src to dst, sampled every few milliseconds.
srcdstovershoot
Knownsrc=(245, 380), dst=(892, 410)
Predictedevery intermediate (Δx, Δy, Δt)
Event stream — keyboard
A target string, unrolled into the keystrokes that get the field to match it.
target field valuerrorotroturotunrotundrotundsrotundrotundat = 0~1.3srotundsmistype⌫backspaceacorrectionstructured decoding ensures the field eventually reads "rotunda"
Knowntarget = "rotunda"
Predictedevery (key, Δt) including mistypes and corrections
At the moment I do this with a trained autoregressive RNN. Keystrokes and mouse movements share a single time-ordered token stream:
The current goal (the target string and the destination point) is prepended as context so each step knows what it's aiming at. Modeling characters also allows you to learn some keyboard geometry: that mistyped "P"s tend to come out as "O" more than "A", and analogously that real cursors overshoot and curve toward a target rather than jump in a straight line. It's fast at inference in C++ and very parameter efficient.
But we don't want to leave your field inputs up entirely to autoregressive RNNs. There's a chance they'll output the wrong values even when they're given the initial prompt. So during inference we add a structured decoding pipeline to ensure it actually does eventually type what you want it to and that any mistakes are corrected.
Rotunda is the kind of tool whose duality keeps me up at night a little. The same browser that lets you scrape your own amazon account also, in principle, lets someone scrape something they shouldn't.
The structural shape of Rotunda makes it pretty bad at most of the things abusive bots are after. Take ticket scalping: the entire economy is about being first to the API the millisecond the queue opens. Most of these architectures are on a 10gbps connection banging endpoints at thousands of requests per second. A browser politely moving its mouse from button to button doesn't help them much.
Most actual abuse on the open web depends on volume, speed, or both. Rotunda is designed to be the opposite of either. The use cases that look like a person happen to be the ones that don't scale into adversarial harm. The cost-per-action is too high and by definition puts you into competition with actual people instead of high frequency traders.
I philosophically disagree with sites that try to block automation that lets regular people get some time back in their day; if an intern can manually do something for you, a computer should be capable of doing the same.
There's a lot to do here:
Immediately on my plate is improving the accuracy of the mouse/keyboard simulation model. I expect that to naturally improve as I keep my key logger enabled. 😅 Ideally I'd also like to crowd-source data capture to give more variation to mouse and keyboard patterns - but I have to give more thought to doing that in a privacy preserving way.
Also, while we currently have perfect scores on all fingerprint benchmarks, I need to test on more heterogenous machines to make sure those scores stay strong. This should also be automated via CI.
There's also the outputs of the CLI. Right now we're mostly trying to get raw outputs to agents so they can figure out what to do... but we could be smarter about rendering Markdown formatted or structured content. Perhaps even driven by a mini-LLM that's just tailored to that purpose. For some of these feature questions I'm waiting to see how the community ends up using it and plan the feature roadmap from there.
So give it a try! And hit me up on X if you do end up using it and let me know what you think. It's free and licensed under Mozilla Public License. A star on Github would also help justify the nights and weekends.
Til next time! Have fun automating.
JSON, html-over-the-wire, GraphQL, etc. ↩
These device fingerprints are both used to identify bots, as well as to track you across the web. Even if you disable cookies, there are enough unique signals on your browser that it's almost trivial to identify you across websites. I have a lot of thoughts on fingerprinting more broadly and how it's toxic to free society, but I'll cover those some other time. ↩
At the limit, browsers have an advantage because they control the sandbox that remote sites must operate in. If you invested those same billions into stealth evasion, I'm sure you could make something pretty bullet proof. But ML is easier to deploy with bigger datasets, which the vendors have and clients do not; plus billions have not been invested in stealth browsing. So the cat is pretty firmly winning this race at the moment. ↩
Enterprises have always been squirrelly about opening up an API to risk data exfiltration and easier vendor migration. But the competitive tides were just strong enough for MCP to take hold. ↩
It's very funny to look back over what you typed for a day, without the context of what you were responding to. ↩
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。