Conclusion: Fixes deployed. Bill sent to Cloudflare support with full explanation. Unknown if Cloudflare will credit it.
tldr: RetainDB (memory layer for AI agents on Cloudflare Workers + KV + Durable Objects + Queues) with 81 users racked up $36k in one month — 3.13B KV writes ($15,635), 16.62B KV reads ($8,306), 4.01B DO storage rows written ($3,962), 574M KV list ops ($2,870) — caused by three compounding bugs: an infinite queue loop passing write_mode: "async" back into itself, 12 unbatched DO storage.put() calls per memory write, and a kv.list() scan running on 95% of auth requests because legacy keys missed the hash/prefix indexes.
The three bugs:
Bug #1 — Infinite queue loop ($15k): Ingest worker forwarded the original write_mode to its internal API call. async got re-queued every time. Fix: force write_mode: "sync" on internal calls.
Bug #2 — 4B DO writes ($4k): 12 unbatched storage.put() calls per memory write across pending overlay, job state, and acks. Fix: removed all DO writes from ingest worker. Pending overlay TTL handles expiry. Dropped 12 → 2.
Bug #3 — KV list scan on every request ($2.8k): Auth fallback ran kv.list() when hash/prefix lookups missed. Legacy keys missed both indexes — 574M list ops. Fix: LEGACY_API_KEY_SCAN_ENABLED = "false".
Lessons:
- Never pass user-facing write modes through to internal queue workers. Queue consumer IS the async handler.
- Durable Object
storage.put()is not cheap at scale. Batch everything. Use TTLs instead of explicit deletes. - Any fallback that touches
kv.list()runs on every request in practice. KV list is $5/million. - Set up Cloudflare spending alerts before you need them. There’s no hard spending cap on Workers.


























