I switched my Gemma 4 model three times in 72 hours. Here's the decision tree I wish I'd had.

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I picked the wrong Gemma 4 model. Twice.

A 72-hour speedrun through E2B, E4B, and 31B-via-cloud — and the decision tree I wish I'd had on hour one.

Three days before the deadline, I sat down to build a multimodal Gemma 4 app for the challenge. I'd already decided which variant I'd use: E4B, because bigger is better, right?

I shipped on E4B. Then I shipped on E2B. Then I added OpenRouter's 31B as a third option and let users pick.

Here is what each move cost me, what I learned, and the decision tree I'd hand to anyone starting today.

Quick context before the story: Gemma 4 is Google's open AI model family — Google publishes the model weights for free, you download them and run them yourself, no API key required. It ships in four sizes; the two smallest (E2B and E4B) are tiny enough to run inside a browser tab via WebGPU (the browser's graphics-card API), while the 31B Dense and 26B MoE variants are server-class. All four are multimodal — they read images and audio, not just text. That last part is why a real app inside a browser tab is suddenly possible: the model that categorizes your text transactions can also read a photo of a receipt, with no extra download.

The setup

The app — a private personal-finance dashboard that runs Gemma 4 entirely in the browser — needed three things from the model:

Categorize transaction text ("STARBUCKS #1234" → restaurants).
Read paper receipts (image → merchant, amount, date).
Answer free-form questions about a year of statements in one prompt.

So: multimodal, long context, must run client-side (in the user's browser, not on a server I rent). That's how I narrowed to the E-series Gemma 4 variants in the first place. The 31B Dense and 26B MoE were never candidates — they're just too big for a browser tab. That left E2B (~1.5 GB on disk once quantized) and E4B (~2.5 GB).

I picked E4B without thinking. That was mistake #1.

Pick #1: E4B, because "bigger is better"

E4B is the larger of the two browser-tier Gemma 4 models. It scored higher on every benchmark in Google's release. I figured the extra GB of weights would buy me cleaner categorization and smarter answers, and I'd ship a more impressive demo.

It worked. Categorization was crisp. The chat panel handled "which restaurant did I visit the most?" without breaking a sweat. I wrote the entire project around the assumption that E4B was the right call and shipped a first cut.

Then a user opened the deployed link.

Cold-load was a 2.5 GB download. On a normal connection that's somewhere between three and ten minutes of staring at a progress bar before the app does anything. My first beta tester typed "is there other solution its time consuming" before the model had finished downloading.

I'd optimized for what the model could do and ignored what the user would experience before it did anything. That's mistake #1.

Pick #2: E2B, because respecting people's bandwidth is part of the product

E2B is the smaller browser variant. Same multimodal capability. Same 128K context window (meaning it can read about a 300-page book in one prompt — important if you want to ask questions across a whole year of bank statements). Same compression. About 40% less to download. Slightly thinner reasoning on multi-step questions.

The swap was a one-line code change:

// before
export const MODEL_ID = "onnx-community/gemma-4-E4B-it-ONNX";

// after
export const MODEL_ID = "onnx-community/gemma-4-E2B-it-ONNX";

The interesting thing wasn't the code — it was the trade-off math.

The "thinner reasoning" I was worried about cost me maybe 5–10% of categorization accuracy on long-tail merchants. That's a tiny gap. The "40% less to download" turned a five-minute wait into a two-minute wait, which is the difference between a user trying your app and a user closing the tab.

The general lesson, written down where I won't forget it:

The smaller capable model usually wins. Cold-load time is the most expensive thing your app does. Trim it ruthlessly.

This held even when the larger model would have produced marginally better outputs. The output gap was invisible to the user. The download gap was the only thing they could see.

That should have been the end of it. It wasn't.

Pick #3: 31B in the cloud, because some users won't wait at all

The same user came back: "no user wait for loading 1.5 gb 2.5 gb will add selection and add openrouter selection also."

They were right. Even E2B's ~1.5 GB is a wall for someone on a phone, on a flaky connection, or just trying a demo for thirty seconds to decide if it's worth more attention. The honest answer was that the right model depends on who's using the app right now.

So I added a third option: Gemma 4 31B Dense via OpenRouter's free tier. OpenRouter is a service that lets you call lots of different AI models through one API. They expose Gemma 4 31B on a free tier — no credit card, no download. Zero download. Highest quality of the three. The trade-off is brutal and has to be explicit: your prompts and receipt photos are sent to a third-party server for inference. Privacy goes from "on-device, never uploaded" to "trust OpenRouter's logs policy."

Two practical things bit me adding the cloud path:

The free tier is 16 requests per minute. My categorization loop fired one API request per transaction. For a 71-row sample statement, that hit the rate limit in three seconds. Fix: batch up to 25 transactions per prompt — instead of asking the model "what category is this?" 71 times, ask it "here are 25 transactions, classify each" three times. With Gemma 4's 128K context, this is free — the model handles a whole statement in one shot, and your three batched requests stay comfortably under any free-tier limit.

// One prompt, 25 transactions, one response. Free-tier safe.
const prompt = `Classify each transaction with one category from this list.
Output ONE LINE per transaction as "<n>. <category>".

${chunk.map((t, i) => `${i + 1}. ${t.rawDescription} (${t.amount})`).join("\n")}`;

The model ID format is strict. OpenRouter wants google/gemma-4-31b-it:free (the :free suffix matters). Hit the /v1/models endpoint with your key once to confirm the exact ID before you spend an hour debugging 400 errors.

The decision tree I wish I'd had

Here it is, no theory, just the thing I'd tape to my wall:

Question	If yes →	If no →
Will users get more than 30 seconds before they leave?	Local model OK	Cloud-only (OpenRouter 31B or similar)
Is the data on the user's machine sensitive (finance, health, journals, work)?	Local model required	Cloud is fine
Is the task multi-step reasoning (agentic, planning) or simple classification?	Lean E4B / 31B	E2B is enough
Will users return many times, making the one-time download amortize?	Local OK at any size	Smallest model that does the job
Are you charging users / can you eat the API cost?	Cloud OK	Local or free-tier cloud only

You can stop here. Most projects only need the first two rows.

The real answer: don't pick. Let the user pick.

What I actually shipped in the end was a model picker. Three cards. Each one shows: name, download size, where inference happens (on-device vs cloud), and one sentence on the trade-off.

The picker doesn't avoid the decision; it moves it to the person who has the right information to make it. The product manager in me cringed at exposing a "model selection" UI to consumer users. The engineer in me realized that the alternative — picking one model for everyone — meant always being wrong for somebody.

"Intentional model selection" is one of the Gemma 4 Challenge's judging criteria. I'd bet that on most submissions, that intention lives in the writeup, not in the product. In mine, it lives in the user's first click.

If you're starting a Gemma 4 build right now, I'd save yourself the 72 hours and start there.

The app is PocketCFO — open source, MIT. Drop a CSV bank statement and pick a model. Built for the Gemma 4 Challenge. Live demo · code.

Tags: #gemmachallenge #ai #webgpu #javascript

推荐订阅源

DEV Community