惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
博客园_首页
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
阮一峰的网络日志
阮一峰的网络日志
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 司徒正美
V
V2EX
Cloudbric
Cloudbric
Hugging Face - Blog
Hugging Face - Blog
腾讯CDC
量子位
博客园 - 三生石上(FineUI控件)
博客园 - 叶小钗
K
Kaspersky official blog
博客园 - 【当耐特】
T
Tenable Blog
L
Lohrmann on Cybersecurity
The Cloudflare Blog
S
Schneier on Security
A
Arctic Wolf
Latest news
Latest news
C
Cyber Attacks, Cyber Crime and Cyber Security
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Cisco Talos Blog
Cisco Talos Blog
小众软件
小众软件
P
Privacy & Cybersecurity Law Blog
WordPress大学
WordPress大学
Simon Willison's Weblog
Simon Willison's Weblog
雷峰网
雷峰网
NISL@THU
NISL@THU
人人都是产品经理
人人都是产品经理
月光博客
月光博客
J
Java Code Geeks
V
Visual Studio Blog
S
Security Affairs
博客园 - Franky
T
Tailwind CSS Blog
Apple Machine Learning Research
Apple Machine Learning Research
H
Heimdal Security Blog
有赞技术团队
有赞技术团队
V2EX - 技术
V2EX - 技术
AWS News Blog
AWS News Blog
G
GRAHAM CLULEY
T
Troy Hunt's Blog
SecWiki News
SecWiki News
Spread Privacy
Spread Privacy
宝玉的分享
宝玉的分享
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园 - 聂微东

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Cost-engineering an "AI Generate" button in a freemium product (from $0.08 to $0.029 per click)
Crackly · 2026-04-25 · via DEV Community

TL;DR

My coding-interview prep app has a "Generate Visualization" button. Click it on any algorithm problem and Claude Sonnet 4.5 produces a self-contained interactive widget that teaches it — a sliding-window expanding, a two-pointer racing, a DP table filling cell by cell. (That's what the GIF above is showing.)

At naive implementation, each click cost me about $0.08. Workable for Pro subscribers. Ruinous if I let free-tier users click freely.

Through five cost decisions — tiering the call path, prompt caching, output capping, a Haiku gatekeeper, and a Groq fallback on regenerations — I got the per-click cost to $0.029 and pushed free-tier users to a $0-marginal-cost path entirely.

If you're building anything with a "Generate with AI" button in a freemium product, these are the moves that matter before you ship.


The product problem

I'm building Crackly — a DSA interview prep tool. 474 problems, each with an "AI Visual" panel. Press "Generate Visualization" and Claude generates a custom HTML+JS widget that animates the algorithm end-to-end: you watch the two-pointer sweep, the sliding window expand, the recursion tree unfold.

It's the best feature in the product. It's also the most expensive.

A naive build looks like this:

  1. User opens a problem page.
  2. Claude Sonnet 4.5 is called with the problem + a spec for how to generate the visualization.
  3. ~15 seconds later, the visualization renders in an iframe.
  4. Cost: ~$0.08 per click at full API pricing (mostly from output tokens — visualizations run ~4,000 output tokens naive).

For Pro users on a $49/quarter plan, that's fine. A Pro user generates maybe 20 of these per month, costing me $1.60 in Claude on $16.33/mo of revenue. ~90% gross margin.

For free users, this is an existential problem. At just 1,000 free DAUs generating one visualization each per day, that's $80/day, $2,400/mo burned by a cohort paying me $0.

The first question any founder shipping an AI feature should ask: what's the marginal cost of a free user's most expensive action? For me it was $0.08, and it was eating my runway.

Here's how I got it down.


Decision 1: Tier the call path. Free users never trigger Claude.

The biggest wins in AI product engineering come from not making the call, not making it cheaper.

I split the "Generate Visualization" button into two code paths:

Free user clicks → Check DB for cached visualization for this exercise. If hit, serve it (~10ms). If miss, show a tasteful "Generating visualizations is a Pro feature — here's a preview of a similar one" state. No Claude call. $0 marginal.

Pro user clicks → Check cache first. If hit, serve it. If miss, generate fresh with Claude, store in cache for the next user, serve.

This is obvious in retrospect but I didn't build it this way at first. My v1 ran Claude on every click regardless of tier. The metrics dashboard told me within a day that this would end me.

The important property: free users benefit from the cache Pro users warm. Every Pro user who generates a fresh visualization populates the DB; every free user who lands on that same problem afterward gets it free. Pro users subsidize free coverage without knowing it, and the DB gets organically richer over time.

No batch job required. No pre-generated library. The cache grows naturally from user behavior.


Decision 2: Prompt caching — 90% off input tokens

Every "Generate Visualization" call uses the same ~2,800-token system prompt. It defines the HTML output contract, the styling rules, the safety constraints, eight example outputs. The only thing that varies between calls is the problem description (~200 tokens).

Anthropic's prompt caching charges ~10% of full input-token price on cache hits. 5-minute TTL, extended to 1 hour on paid plans.

Without caching:
  2,800 tokens × $3.00/MTok = $0.0084 per call (input side)

With caching:
  First call (cache write): 2,800 × $3.75/MTok = $0.0105
  Calls 2..N (cache read): 2,800 × $0.30/MTok = $0.00084 — 10x cheaper

Enter fullscreen mode Exit fullscreen mode

The practical wrinkle: the 5-minute TTL means the cache only stays warm if calls arrive frequently. For organic traffic on a not-yet-launched product, I barely got cache hits on quiet days — the cache would expire between clicks.

Fix: a shared worker that pools requests within TTL windows, so back-to-back calls within 5 minutes share the warm cache. On high-traffic Pro days, cache hit rate climbs to ~85%, and per-call input cost trends toward the $0.00084 floor.

If your system prompt is >500 tokens and identical across calls, this is your biggest free lunch. Cache it.


Decision 3: Output capping — the sneakiest cost sink

Output tokens cost 5x more than input tokens on Sonnet 4.5: $15/MTok out vs $3/MTok in. And models have a bad habit of filling whatever budget you give them.

My first prompt said "generate a complete visualization." No cap, no structural constraint. Typical responses came back at ~4,000 output tokens because Claude kept adding elaborate comments, explanatory headers, <details> sections, inline docstrings — none of which the iframe needed.

I switched to this constraint in the system prompt:

Generate exactly the HTML/JS body. No HTML page wrapper (no <!DOCTYPE>,
<html>, <head>, <body>). No comments. No explanation. Output must be
under 3,500 tokens. If approaching the limit, truncate visual polish
before truncating logic.

Enter fullscreen mode Exit fullscreen mode

Plus max_tokens: 3500 on the API call — a hard cap. If the model tries to exceed it, the response gets truncated mid-output and I detect that server-side and retry with a tighter instruction.

Typical response length dropped from ~4,000 tokens to ~1,850 tokens. A 54% reduction on the expensive side.

Before: 4,000 × $15/MTok = $0.060 per call (output)
After:  1,850 × $15/MTok = $0.0278 per call
Savings: $0.032 per call

Enter fullscreen mode Exit fullscreen mode

At 1,000 Pro-user generations per month, that's $32/mo saved forever. The change was one paragraph in the prompt. Cap your outputs — it doesn't just save money, it forces you to define what you actually need, which produces better outputs anyway.


Decision 4: Groq fallback on regenerations

Pro users have a "Regenerate" button if they don't like the first output. Each regenerate is another call. If I let Pro users regenerate unlimited times at full Sonnet price, my cost per Pro user goes from $1.60/mo to unbounded.

What I built:

  • First generation: Claude Sonnet 4.5. High quality. Persisted in the cache.
  • Regenerations 1–5: Groq (Llama 3.3 70B). Free tier: 14,400 req/day. Ephemeral — not persisted in DB.
  • Regeneration 6+: Rate-limited. "You've hit today's regeneration limit. Back tomorrow."

The insight: tier your inference quality to the user's need at that moment. First-generation quality is what the user judges the product by. Regeneration quality is incrementally useful — they already have a decent visualization, they're asking for a second take. A "good enough" viz from Groq is fine.

First gen to expensive-but-reliable. Regenerations to free-but-decent. Same design pattern applies to many AI features: your first inference run is your hero surface. Your retries, refinements, and exploratory calls can route to cheaper infrastructure without the user ever noticing.


Decision 5: Haiku gatekeeper (offline batch only)

This one runs exactly once per problem, offline, during seed-script runs.

Not every algorithm benefits from visualization. "Return a constant" doesn't. "Sum two integers" doesn't. I didn't want to manually tag 474 problems for visualization-worthiness.

A two-pass filter does the work:

  1. Haiku pass ($0.25/MTok in, $1.25/MTok out): "Given this problem description, is an interactive visualization likely to help a learner understand the algorithm? Reply YES or NO with a one-sentence reason."
  2. Sonnet pass: runs only if Haiku said YES.

Haiku filters out ~13% of problems as "not visualization-worthy." At ~$0.04 full-cost per Sonnet call on the output side, the filter saves about $5 on a full-catalog batch and — more importantly — stops the system from generating awkward animations of trivial operations.

This pattern is underused. Flash / Haiku / GPT-4o-mini are nearly free at classification tasks. If your expensive model is making judgment calls that could be offloaded, offload them.


The per-click economics today

Putting it all together, here's what a Pro user's "Generate Visualization" click costs me after all the engineering:

Component Amount
Input tokens (2,800 cached at ~90% hit rate) $0.00084
Output tokens (1,850 × $15/MTok) $0.02775
Network/infra overhead (Cloud Run + DB write) $0.001
Per-click total ~$0.029

A typical Pro user generates ~20 visualizations per month. That's $0.58/mo in Claude cost per Pro user, on a $16.33/mo subscription. Gross margin per Pro user: ~96%.

Free users cost me $0 in Claude — they see the cache or see the upgrade nudge.

Cost curve Before engineering After engineering
Per-click cost $0.08 $0.029
Per-Pro-user monthly cost $1.60 $0.58
Per-Pro-user gross margin ~90% ~96%
Free-user marginal cost $0.08 $0

A 6-point margin improvement on an AI feature's hero surface is the difference between a company that can pour into growth and one that can't. At 1,000 Pro users that's $720/mo of gross margin I wasn't paying attention to before.


The framework I'd use on any freemium AI product

If you're shipping a "Generate with AI" button, run this checklist before launch:

  1. What's the marginal cost of one click at naive settings? Know this number. If it's >$0.02 and you have free-tier users clicking it, you have a unit-economics problem to solve before you scale.

  2. Tier the call path by user segment. Free users should usually see cached outputs or a graceful paywall state, not trigger fresh inference. Pro users get the live call. This isn't user-hostile — it's the only way freemium AI math works.

  3. What tokens are identical across calls? Those go in a cached system prompt. Prompt caching is the single biggest lever on input cost at scale.

  4. What's your output actually using? If your responses exceed ~2,000 tokens often, your prompt isn't constraining output enough. Cap it in both the instruction and max_tokens.

  5. Is there a cheap model that can route or retry the expensive one? Haiku / Flash / Mini are nearly free at classification. Offload judgment calls and filters.

  6. Can some of your calls run on a free tier? Groq, Gemini free tier, Cerebras. Not for your hero feature — but regenerations, retries, warm-up passes, exploratory runs? Yes. Tier your inference quality to the user's moment-of-need.

  7. What % of your LTV per user does per-user click cost consume? If a single user's typical use of the feature can eat >10% of their subscription price, you've got a hidden margin killer.

None of this is secret. These are rarely applied rigorously because "make it work first" correctly precedes "make it cheap." But the gap between naive and cost-engineered is 3-5x on AI features — which is the difference between a feature you can give away and one you have to meter.


What I'm building this for

All of this cost engineering is in service of Crackly — a DSA interview prep tool I'm shipping at crack-ly.com. The "Generate Visualization" button is one feature among many; the whole product is designed around the principle that teaching should be expensive thinking, not expensive generation. The AI works hardest at the moment you most need help. Everywhere else, it stays out of your way.

It's in private beta. Free tier forever. If you're prepping coding interviews in the next six months, try the free tier — and play with the live "AI Visual" demo embedded on the landing page. If you're building with LLMs and want to compare notes on cost engineering, my inbox is open: admin@crack-ly.com.

Follow me on X at @jobcrackly for more building-in-public from this project.