Two weeks ago, I built a RAG pipeline on my phone. Termux. Gemma 4 E2B. A Python script that took my lecture notes and turned them into a private AI tutor I could interrogate offline. It worked. It was slow. It was fragile. But it worked.
Then Google dropped an entire family update, and I realized I'd been running the equivalent of a beta test.
After digging through the architecture docs and benchmarks that have come out since the release, I revisited my original build to answer one question: if I were starting fresh today, what would I actually do differently?
What's New Under the Hood
The Gemma 4 family now has four variants, and the architecture decisions baked into them directly address the pain points I hit in my original build .
The E2B model I used runs on something called Per-Layer Embeddings (PLE). Instead of one massive lookup table sitting at the start of the model eating up RAM, PLE distributes compressed mini-lookups across every decoder layer . The result is a 2.3B effective parameter model that fits in under 1.5GB of RAM with aggressive quantization . On a phone with 4GB RAM, that's the difference between a model that runs and one that crashes mid-inference.
The context window on my E2B was 128K tokens. That was enough for one textbook chapter, but not a full semester. The new 26B MoE and 31B Dense variants push to 256K—enough to drop an entire codebase or all seven of my course materials into a single prompt . And they achieved this without the quality cliff that plagued earlier long-context attempts, scoring 66.4% on the RULER benchmark compared to Gemma 3's 13.5% .
But the single biggest change? The license. Gemma 4 is now Apache 2.0 . No usage restrictions. No commercial ambiguity. I can build a product on this and sell it without a lawyer on retainer. That's not a technical detail—that's a business unlock.
What I'd Build Differently Today
My original RAG pipeline was a single Python script with hardcoded file paths and no persistent memory. Every time I restarted Termux, it forgot everything. If I were rebuilding today, here's what I'd change:
- I'd use Ollama's native API instead of spawning subprocesses.
My original build shelled out to Ollama via Python's subprocess module. It worked, but it was janky. The proper approach—documented in the Haystack cookbook—is to use OllamaChatGenerator directly with think=False for RAG queries, which disables extended reasoning to keep answers fast . Cleaner code. Fewer moving parts.
- I'd add persistent memory across sessions.
One of the best technical breakdowns I found is a guide to adding genuine cross-session memory to a local AI setup using LM Studio and a plugin called Big RAG . The concept is simple: maintain a chat_memory.json file that stores summaries of past interactions, and inject them into the prompt alongside retrieved document chunks. The implementation involves modifying a promptPreprocessor.ts file to pull recent conversation history and past session summaries before assembling the final prompt . If I were rebuilding my phone pipeline, I'd port this concept to Python—a simple JSON file that remembers what I've asked across sessions. The code wouldn't be trivial, but the principle is sound.
- I'd target the E4B model instead of E2B.
The E4B has 4.5B effective parameters and fits in 4-6GB of RAM . On a phone with 8GB RAM, that leaves room for the OS and background apps while delivering meaningfully better reasoning than the E2B I used. The hardware decision tree from the Dev.to comparison piece is clear: if you're on mobile, go E2B. If you're on a laptop CPU, go E4B . For my use case—a dedicated Android device running Termux—the E4B is the sweet spot.
What's Still Hard
Let me be honest about what hasn't changed. Running any model locally on a phone generates heat. After 20 minutes of continuous inference, my device throttles. Android's memory management is aggressive—if you switch away from Termux for too long, the OS kills the Ollama process. And the setup isn't plug-and-play. You're compiling packages, configuring webhooks, debugging Python scripts in a terminal on a 6-inch screen.
That friction is real. But so is the payoff: a private AI that runs offline, costs nothing per query, and doesn't send your data anywhere.
The Bigger Picture
Gemma 4's Apache 2.0 license signals something important: Google is serious about local-first, commercially-viable open models . They're not just releasing a research artifact. They're releasing infrastructure for builders who can't afford cloud API bills.
For anyone building in Nigeria, India, Brazil—anywhere the internet is unreliable or expensive this matters. The tools are getting smaller, faster, and legally safer to build on.
I'm rebuilding my pipeline this week. E4B. Persistent memory. Native API calls. I'll report back with what breaks.




















