From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging

Honest framing first: mk-qa-master is an open-source MCP server for QA engineers. The reCAPTCHA solver in it is a Tier 3 fallback for testing your own apps when Tier 1 (Google's official test keys) and Tier 2 (feature flags / IP allowlist) aren't available. It is not a "beat captcha" tool. It refuses to run on Google / Apple / Microsoft / Discord login pages regardless of consent flag. With that out of the way…

This is a diary about the 48 hours it took to go from "shipped a reCAPTCHA solver, all unit tests green" to "it actually works against the real Google demo." Four versions (v0.7.0 → v0.7.4), three broken intermediate ones, and a bunch of lessons that I want to write down before I forget.

The setup

The idea behind the solver is simple. Two atomic MCP tools:

inspect_visual_challenge — finds the captcha iframe on the current page, screenshots it, returns the tile grid coordinates + a screenshot.
solve_visual_challenge — accepts the AI client's tile selection (which tiles contain buses, which contain crosswalks, etc.), clicks them, presses Verify, returns the token.

The AI client (Claude Code, Cursor, etc.) sees the screenshot, decides which tiles match the prompt, and calls solve. The server is the eyes and hands; the AI is the brain. Multimodal models like Claude 4.7 are surprisingly good at this — they were trained on the open web, which has a lot of bus pictures.

So far so good in theory.

Day 1 — v0.7.0 ships

The first version landed Monday. It detected the reCAPTCHA iframe[src*="bframe"], screenshotted it, computed tile coordinates by dividing the iframe's bounding box into a 3×3 or 4×4 grid, and clicked at the center of each selected tile.

Unit tests passed. The bundled mock fixture (a self-contained HTML page that mimics reCAPTCHA's structure) round-tripped end-to-end. I wrote a PRD, shipped a release, posted a Dev.to walkthrough. Felt great.

The mock fixture's structure was:

<table class="rc-imageselect-table">
  <tr><td>...</td><td>...</td><td>...</td></tr>
  ...
</table>

Selectors in the fingerprint:

"tile_table_selector": ".rc-imageselect-table",
"tile_cell_selector": "td",

What could go wrong?

Day 2 — v0.7.1 adds hCaptcha

The next day I extended the fingerprint table to support hCaptcha. Same architecture — different selectors. No new MCP tools. Tests stayed green. I felt good about the design: when a vendor changes, you add a row to the fingerprint table, you're done.

I didn't run a real-world dogfood for hCaptcha either. (We'll come back to this.)

Day 3 — v0.7.2: the first "fix"

I wrote a tiny dogfood script — open Chromium, navigate to https://www.google.com/recaptcha/api2/demo, click the anchor to trigger an image challenge, call inspect_visual_challenge, save the screenshot, ask the AI for tile indices, call solve_visual_challenge, see if a token comes back.

The first run came back with status failed. I asked the user (in this case: me) what they saw in the browser. The answer was unsettling: "I told it to click 2, 5, and 8 — only 5 and 8 actually got highlighted."

I dug into the coordinate math. The iframe-divide approach split the full iframe into rows × cols cells. But the iframe contains a header banner (the prompt text) above the grid and a footer (the Verify button) below. So:

For a 3×3 grid in a 400×580 iframe with header ~130px and footer ~130px:
- The actual grid is 320px tall, ~106px per row.
- Naive iframe-divide gives 193px per row.
- Row 0's computed center lands in the header banner.
- Row 2's computed center lands in the footer.
- Only row 1 happens to be roughly correct.

I wrote v0.7.2 to fix this. Instead of dividing the iframe, I'd read each cell's real bounding_box() from the DOM via Playwright:

for index in range(tile_count):
    bb = cells.nth(index).bounding_box()
    if not is_real_dict(bb):
        # fall back to iframe-divide for mock fixtures
        break
    candidate.append({"viewport_x": bb["x"], ...})

The unit test (against the mock fixture) immediately confirmed the fix. I bumped to v0.7.2, opened a PR, merged, released, published to PyPI. Done.

Day 4 morning — wait, it's still broken

Next morning, ran the dogfood again. Console output for inspect:

"tiles": [
  {"index": 0, "viewport_x": 85, "viewport_y": 92,  "w": 133, "h": 193},
  {"index": 1, "viewport_x": 218, "viewport_y": 92, "w": 133, "h": 193},
  ...
]

133 × 193 rectangles. The exact dimensions you'd get from dividing a 400×580 iframe by 3×3. Which meant the per-cell bounding_box() path was returning None on every cell in real reCAPTCHA, silently falling back to the same broken iframe-divide math.

Looked at the code path: it had a try / except swallowing the error. I added a debug field _coord_method so the inspect response would show which path actually fired:

"_coord_method": "iframe_divide"  // ← v0.7.2's "fix" never ran

So v0.7.2 fixed the mock fixture and shipped to PyPI. In production, against real Google reCAPTCHA, it behaved identically to v0.7.0. The unit test was green because the mock fixture's <td> elements had real CSS dimensions; in real reCAPTCHA the tiles aren't <td>. I just didn't know that yet.

Day 4 afternoon — going DOM-spelunking

Wrote a one-off debug script that opened the real reCAPTCHA bframe and ran arbitrary JavaScript inside it. The first query was: "does .rc-imageselect-table even exist?"

{
  "tableExists": false,
  "altSelectors": {
    "table[class*=\"rc-imageselect\"]": true,
    ".rc-imageselect-target": true,
    ".rc-image-tile-wrapper": true
  }
}

false. The class I'd been targeting since v0.7.0 doesn't exist in production.

The real DOM looks like this:

Element	Mock fixture	Real Google reCAPTCHA
Table class	`rc-imageselect-table`	`rc-imageselect-table-33` (or `-44`)
Tile element	`<td>` with real CSS	`<div class="rc-image-tile-wrapper">` (the `<td>` is 0×0 because tiles are absolutely-positioned)
Challenge text	`.rc-imageselect-desc`	`.rc-imageselect-desc-no-canonical` (in dynamic-replace mode)

The whole fingerprint table had been wrong all along. Unit tests passed because I wrote the mock fixture to match the selectors I'd hardcoded. Tautology. The mock fixture lied because the person who wrote it was the same person who wrote the selectors.

Day 4 evening — v0.7.3 actually fixes it

I rewrote the fingerprint to chain both real and mock selectors via the CSS comma operator (which means "or"):

"challenge_text_selector": (
    ".rc-imageselect-desc-no-canonical, .rc-imageselect-desc"
),
"tile_table_selector": (
    'table[class*="rc-imageselect-table"], '
    '.rc-imageselect-target, .rc-imageselect-table'
),
"tile_cell_selector": ".rc-image-tile-wrapper, .rc-imageselect-table td",

Now the same fingerprint matches both production reCAPTCHA AND the mock fixture. The per-cell bounding_box() path finally runs against real DOM, returning real 95×95 squares instead of distorted 133×193 rectangles. Tile 0 sits at y=211 (just below the 200px header), not y=92 (inside the header banner).

I also fixed a different UX problem in the same release. The MCP server was returning the screenshot as a base64 string embedded in a JSON TextContent. Multimodal AI clients can't "see" base64 — they see a giant string of iVBORw0KG.... The fix: return the screenshot as a native MCP ImageContent:

return [
    ImageContent(type="image", data=b64, mimeType="image/png"),
    TextContent(type="text", text=json.dumps(metadata)),
]

Now Claude Code receives the screenshot as if you'd dragged it into the chat. No manual screenshot juggling.

Day 4 night — v0.7.4 closes the multi-round gap

One more dogfood run, this time the challenge text was different: "Select all images with buses. Click verify once there are none left."

This is reCAPTCHA's dynamic-replace mode. Click a matching tile, the tile gets replaced with a new image. You have to keep selecting until no buses remain, then click Verify. v0.7.3 always clicked Verify after the first round, so it always failed against this mode even with perfect tile judgment.

v0.7.4 added a new return status: "continue". When solve detects dynamic mode (the prompt contains "none left" / "確定沒有遺漏" / equivalent), it does the clicks, waits for the replace animation, re-screenshots the iframe, and returns status: "continue" with a fresh screenshot + new tile geometry. The AI client looks at the new grid, finds any remaining matches, calls solve again. When the AI sees no more matches, it passes an empty selected_tile_indices: [] to signal "click Verify now."

// Round 1 response
{
  "status": "continue",
  "rounds_used": 1,
  "screenshot_base64": "...new grid...",
  "tiles": [...],
  "hint": "Dynamic-replace round 1/5. Look at the new screenshot and call solve again."
}

// Round 2 AI sees no more buses
{ "selected_tile_indices": [], "confirm": true }

// Round 2 response
{ "status": "passed", "token": "03AGdBq25..." }

Hard cap of 5 rounds prevents infinite loops on pathological challenges. Static mode (no marker phrase) is unchanged — legacy flow runs verbatim.

And — the lesson from this entire saga — I added a weekly GitHub Action that runs the dogfood script against the real Google reCAPTCHA demo and asserts _coord_method != "iframe_divide". If Google ships a DOM change next week that breaks the fingerprint again, I'll get a CI failure email within seven days instead of finding out from a user issue six months later.

on:
  schedule:
    - cron: "0 2 * * 0"  # Sunday 02:00 UTC

What works now

✅ reCAPTCHA v2 image-grid (3×3 + 4×4) — verified against the real Google demo
✅ hCaptcha image-select — same fingerprint infrastructure, fixture verified, real-vendor TBD
✅ Multi-round dynamic-replace — unit-test verified, end-to-end real-vendor TBD
✅ MCP ImageContent — multimodal clients see screenshots natively
✅ Consent gate, domain allowlist, hard-stop blacklist (Google / Apple / Microsoft / Discord login pages)
✅ Weekly real-world CI guard

What doesn't (yet)

❌ Mobile WebView — v0.8.0 mini-PRD drafted, ~6 working days of implementation ahead
❌ reCAPTCHA v3 — pure behavior scoring, no visible challenge, out of scope by design
❌ Cloudflare Turnstile — same reason
❌ Audio captcha fallback — accessibility tier, low usage in QA context
❌ The dynamic-replace loop on real Google reCAPTCHA with AI in the loop — that's my next dogfood session

Lessons I want to remember

Mock fixtures can lie. When the same person writes both the production selectors and the mock that tests them, the mock matches by construction. There's no signal. The fix is dogfood against the real thing — and if you can't dogfood, at minimum run a recorded HAR of the real DOM and assert against that.
Silent fallbacks are the worst kind of bug. v0.7.2's try / except swallowed the failure of every per-cell bounding_box() and quietly fell back to broken math. A _coord_method debug field that surfaces which path actually fired would have caught this in minutes. I now add a debug field every time I have more than one code path for the same output.
Multi-round is UX, not a bug. reCAPTCHA's "Click verify once there are none left" isn't an edge case — it's the dominant mode on hard challenges. I built the static-only solver, said "ship it," and was surprised when most real-world challenges fell into the dynamic-replace bucket I hadn't designed for.
Weekly CI catches what unit tests can't. The dogfood workflow runs once a week against a third-party demo. It's noisy, it depends on a vendor's continued cooperation, and it'd be wrong to depend on it for blocking merges. But as a background signal that catches selector drift, it's exactly the right level of investment.

Try it

pip install mk-qa-master==0.7.4

# In your MCP host (Claude Code config, Cursor, etc.)
{
  "mcpServers": {
    "qa-master": {
      "command": "python",
      "args": ["-m", "mk_qa_master.server"],
      "env": {
        "QA_VISUAL_CHALLENGE_CONSENT": "true",
        "QA_VISUAL_CHALLENGE_AUTHORIZED_DOMAINS": "your-staging.example.com"
      }
    }
  }
}

Then ask Claude: "Test the signup flow on staging. If you hit a captcha, solve it." The MCP tools take it from there.

Repo + walkthrough: https://github.com/kao273183/mk-qa-master.

What's next

v0.8.0 — mobile WebView captcha via Maestro CLI (same fingerprint table, new driver). PRD is up in the repo. Probably another diary entry when that one ships.

If you find a bug, the dogfood script lives at scripts/dogfood-inspect-only.py — run it against the page that broke and the inspect output will tell you exactly which coordinate path fired. Beats debugging blind.

推荐订阅源

DEV Community