I Built an AI Receipt Scanner with Gemma 4 — As an SDET with No Dev Background

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

My Honest Starting Point

I want to be upfront about something before diving in: I am an SDET — a Software Development Engineer in Test. My world is test automation, quality assurance, and finding bugs, not building apps from scratch. I have spent the last few weeks learning AI fundamentals, experimenting with different models, trying to understand how this whole ecosystem actually works beneath the surface.
When I came across this hackathon, my first instinct was to scroll past it. This is for developers. But then I thought — why not? The worst that happens is I learn something.
What followed was equal parts confusion, accidental discovery, and a working app I am genuinely proud of.

The Gemini vs Gemma Confusion (I Suspect I'm Not Alone)

My first stop was Google AI Studio. And honestly? It was fantastic for getting ideas off the ground quickly. I built a small app there, got a feel for prompt engineering, and started to understand how multimodal models work.
But there was a problem: every time I tried to use Gemma 4, Google AI Studio kept routing me to Gemini Flash Preview — the latest hosted model. No matter what I selected, it defaulted back to Gemini.
I spent an embarrassing amount of time thinking I was using Gemma 4 when I wasn't.
That confusion forced me to actually sit down and research the difference. And that is when it clicked:

Gemma is not a smaller Gemini. They share research lineage, but the deployment story is completely different. Once I understood that, everything else fell into place.

What I Built: ReceiptMind

ReceiptMind is an AI-powered receipt scanner that extracts structured data from receipt photos and builds an expense dashboard automatically.
You take a photo of a receipt — any receipt, any store — upload it, and Gemma 4 reads the image and returns:

Merchant name
Total amount
Date
Expense category (Food & Dining, Groceries, Transport, Healthcare, Entertainment)
Tax amount

No manual entry. No OCR pipeline. No template matching. Just Gemma 4 looking at the image and understanding it.
This started as a feature I wanted to add to a personal finance app I have been quietly building on the side. The hackathon gave me the deadline I needed to actually ship something.

Why Gemma 4 26B MoE — Not the Other Models

This is the question I care most about answering, because I made this choice deliberately.
The Gemma 4 family has four models:

I chose the 26B MoE (A4B) for two specific reasons:

It is the only model in the family with native image input.
ReceiptMind's entire value is reading receipt photos. Without multimodal vision, there is no product. The E2B and E4B are text-only. The 31B dense is text-only. Only the 26B MoE can receive an image and reason about what it sees.
Despite 26B total parameters, only 4B activate per token.
This is the Mixture-of-Experts efficiency. The model routes each token through only the most relevant expert layers — so I get near-31B quality visual reasoning at a fraction of the compute cost. For a hobby project running on a free API tier, this matters enormously.

I also used the 256K context window to pass multiple receipts in a single prompt when generating monthly spending insights — no chunking, no retrieval, just the full history in one shot.

The Tech Stack

Frontend → HTML + Vanilla JavaScript
Backend → Node.js + Express
AI → Gemma 4 26B MoE via OpenRouter (free tier)
Database → Neon Postgres (serverless)
File Upload → Multer (in-memory buffer → base64)

Why OpenRouter?
Google AI Studio kept routing me to Gemini Flash. OpenRouter gave me direct access to google/gemma-4-26b-a4b-it:free with no credit card and no routing surprises. Once I found it, the API worked on the first try.

How It Works — The Architecture

User uploads receipt image
↓
Express backend receives file via multer
↓
Image converted to base64
↓
Sent to OpenRouter → Gemma 4 26B MoE (multimodal)
↓
Gemma reads the image, returns structured JSON
↓
JSON saved to Neon Postgres
↓
Dashboard updates with new receipt + running totals

The Core API Call

Here is the exact call that makes ReceiptMind work. The key is the image_url content block — this is what tells Gemma 4 to look at the receipt image:

const response = await fetch("https://openrouter.ai/api/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.OPENROUTER_API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    model: "google/gemma-4-26b-a4b-it:free",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image_url",
            image_url: { url: `data:${mimeType};base64,${base64Image}` }
          },
          {
            type: "text",
            text: `Look at this receipt and extract the data. 
            Reply ONLY with JSON:
            {
              "merchant": "store name",
              "amount": 12.50,
              "date": "2026-05-14",
              "category": "Food & Dining",
              "tax": 1.10
            }`
          }
        ]
      }
    ]
  })
});

No OCR library. No preprocessing. No regex parsing of receipt text. Gemma reads the image exactly like a human would and returns clean structured data.

Test Receipts — What Gemma 4 Had to Handle

I tested with 5 real-world receipt types, each designed to stress-test different extraction challenges:

The gas receipt was the toughest — 42.45L @ $1.649/L = $69.96. Gemma extracted both the unit price and total correctly without any hints.

The entertainment receipt had a discount applied before tax, which changes the subtotal calculation. Gemma handled it correctly.

What Broke (And How I Fixed It)

**Problem 1: **The request would hang indefinitely
The free-tier model on OpenRouter can be slow during peak hours. I added a 30-second AbortController timeout so the frontend shows a proper error instead of spinning forever.

const controller = new AbortController(); const timeout = setTimeout(() => controller.abort(), 30000);

Problem 2: Gemma sometimes wraps JSON in markdown code fences
The model would return

json { ... }

instead of raw JSON. Fixed with a one-liner:

const clean = rawText.replace(/json|/g, "").trim();

Problem 3: I had no idea what was failing
As an SDET, my instinct was to add logging everywhere. I added console.log checkpoints at every step (file received → base64 converted → API called → response received → JSON parsed → DB saved). This immediately showed me exactly where things were failing during development.

The Expense Dashboard

After scanning receipts, the dashboard shows:

Running total across all receipts
Breakdown by category (Food & Dining, Groceries, Transport, etc.)
Full receipt log with merchant, amount, date, and category
Per-receipt tax tracking (useful for expense reports)

After scanning all 5 test receipts, the dashboard showed a combined $369.70 across 5 categories — exactly matching the manual totals.

What This Means for My Pet Project

ReceiptMind started as one feature of a larger personal finance app I have been building. The plan is to integrate it so users can:

Scan receipts throughout the month
Get AI-generated spending summaries ("You spent 40% more on dining this month")
Set budget alerts by category
Export expense reports for tax season

The 256K context window is what makes the spending insight feature viable — I can pass the entire month's worth of receipts in one prompt and ask Gemma to reason across all of them at once.

What I Learned as an SDET Doing This

A few things surprised me:

1. Prompt engineering is just test case design.
Writing a good prompt felt exactly like writing a good test spec — be precise, cover the edge cases, define the expected output format. The skills transferred more than I expected.

2. The model choice matters more than I thought.
I initially assumed any capable model would work. But switching from text-only to multimodal was the difference between having a product and not having one.

3. The confusion between Gemini and Gemma is real.
If you are just getting started, burn this into your memory: Gemma = open weights you run yourself. Gemini = Google's hosted API. They are different products built from related research.

4. Ship something small and real.
I could have tried to build the full personal finance app. Instead I picked one feature, made it work end-to-end, and learned more in a week than I had in the previous month of reading documentation.

GitHub Repository
🔗 https://github.com/Hema-Nambi/ReceiptMind

Try It Yourself

Requirements:

Node.js 18+
Free OpenRouter account → openrouter.ai
Free Neon database → neon.tech

git clone https://github.com/Hema-Nambi/receiptmind
cd receiptmind
npm install
# Add your keys to .env
node server.js
# Open http://localhost:3000

Built during the Gemma 4 Challenge — May 2026

推荐订阅源

DEV Community