NeuralPocket: Private On-Device AI with Gemma 4 — Android & Web

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

NeuralPocket — a private multimodal AI assistant that runs entirely on your device. Available as both an Android app and a web app. No cloud, no subscription, no data leaving your hands.

Honest About My Motivation

I've participated in Google hackathons several times. Each time I built something real, put in the work — and each time walked away with just a participation badge 😄 This time I want to actually place, though I know there are plenty of strong projects out there!

So NeuralPocket is not a demo and not a proof-of-concept. It's a full-featured app with real architecture that solves a real problem.

The problem: modern AI assistants are brilliant — until you lose Wi-Fi. On a plane, in the mountains, roaming abroad, they become useless icons. And every message you type, every photo you send, flies off to someone else's servers.

Google gave me an extra push: the AI Edge Gallery app simply refused to install on my Android 9. Even though the phone has a 64-bit OS — which matters, since LiteRT-LM only runs on 64-bit. Instead of giving up, I figured it out myself. That became the starting point for NeuralPocket.

I wanted an assistant that:

works fully offline — always, everywhere
never sends your data anywhere
understands text, photos, and audio — in one chat
runs on both Android and in the browser

What NeuralPocket Can Do

📷 Photo analysis — snap a menu in Japan → translation and context; photograph a broken part → repair advice; photograph a document → ask questions about it
🎤 Voice input — record up to 30 seconds, converted to WAV, processed on-device
💬 Multiple independent chats with different system prompts — "Translator", "Tech Assistant", "Personal Journal"
⚙️ Configurable context memory — 0–5 conversation pairs to balance coherence and context window
🎨 Markdown rendering — model responses display with full formatting: code, lists, emphasis

Demo

🎬 Android Demo Video (will come later...)

🌐 Web Version (live)

Code

Both projects are fully open source:

🤖 Android (Kotlin + LiteRT-LM) → github.com/premananda108/NeuralPocket · download APK
🌐 Web (React 19 + TypeScript + WebGPU) → github.com/premananda108/NeuralPocketWeb

How I Used Gemma 4

Choosing the Model

I chose Gemma 4 E2B IT (2B parameters, ~2.6 GB) as the primary model for three reasons:

Native multimodal input — text, image, and audio in a single request, no workarounds needed
Compact size — fits on a mid-range Android phone with 4+ GB RAM
One model, two platforms — .litertlm for Android LiteRT-LM, .web.task for WebGPU in the browser

For devices with 6+ GB RAM, the app offers Gemma 4 E4B (~3.7 GB) as a more capable option. The 31B Dense model is overkill for on-device use cases for now.

Architecture: Two Platforms, One Model

┌─────────────────────────────────────────────────┐
│                  NeuralPocket                   │
├──────────────────────┬──────────────────────────┤
│     Android App      │        Web App           │
│       Kotlin         │  React 19 + TypeScript   │
├──────────────────────┼──────────────────────────┤
│   LiteRT-LM SDK      │  MediaPipe Tasks GenAI   │
│   (native runtime)   │  Web Worker + WebGPU     │
├──────────────────────┴──────────────────────────┤
│              Gemma 4 E2B IT / E4B IT            │
│            (running locally on device)          │
└─────────────────────────────────────────────────┘

Android: LiteRT-LM

Stack: Kotlin + Google AI Edge LiteRT-LM + CameraX + MVVM

The engine automatically selects the best available backend — GPU via Vulkan or OpenCL, falling back to CPU via XNNPack. Concurrent inference calls are serialized through a Mutex to prevent race conditions.

Key architectural decisions:

A single StateFlow<ChatUiState> as the source of truth — the UI only observes, never mutates directly
Chat history is written atomically via a temp file — no data loss on crash
The vision encoder loads only when an image is present — saves RAM
Preflight check on first launch: RAM, ABI, free storage — the app warns if the device doesn't meet the minimum requirements

Performance:

GPU (Vulkan/OpenCL): ~15–30 tokens/sec
CPU-only (XNNPack): ~5–10 tokens/sec
Requirements: Android 8+, arm64, 4+ GB RAM

All three screenshots were taken in airplane mode — no network, everything running locally:

Web: WebGPU Right in the Browser

Stack: React 19 + TypeScript + Vite + Tailwind CSS v4 + MediaPipe Tasks GenAI

All inference runs inside a Web Worker — generation never blocks the UI, keeping the interface responsive during streaming. Models are cached in OPFS (Origin Private File System): first launch downloads ~2.6 GB, every subsequent launch starts instantly without a network connection.

Three model presets are supported: Gemma 4 E2B, Gemma 4 E4B, and Gemma 3 Multimodal. You can also provide a custom model URL.

The web app is built as a PWA (Progressive Web App) — you can install it on your computer as a standalone app with one click from the browser, just like YouTube or other web services. Once installed, it appears in your app menu and opens in its own window without an address bar.

Web version in action (all computation happens locally in the browser via WebGPU):

Honest caveat about offline: after the first launch the app works without a network. But it's not fully autonomous out of the box: the MediaPipe runtime loads from jsDelivr, and fonts load from Google Fonts. For full offline you'd need to self-host those dependencies.

Honest caveat about multimodal in the web: at the time of development I couldn't find web-optimized multimodal models for Gemma 4 — available versions only support text. However, I found a fully multimodal model from the previous generation — gemma-3n-E2B-it-int4-Web.litertlm — which supports displaying text, images and audio directly in the browser. That became the third preset in the web version.

A note on how fast things move. While building NeuralPocket, Google released Gemini 3.5 Flash — and first impressions suggest it's a notable step up from 3.1. It handles complex multi-step tasks confidently: for example, it wrote a full test suite for the web version of NeuralPocket on the first try, something that used to take several iterations. It's remarkable how fast this space evolves — the world changes while you're still writing the article.

At this pace, in a year you might just need to download the latest Gemma and ask it to build the whole app itself. Probably. Maybe. 😄

Privacy as Architecture, Not Marketing

NeuralPocket sends nothing anywhere — not messages, not photos, not chat history, not analytics. This isn't a setting you toggle. It's a consequence of the architecture: there's no server that could receive anything. Works in airplane mode. No account, no subscription.

Summary: Android vs Web

Two apps, one idea — but different trade-offs:

	🤖 Android	🌐 Web
Installation	APK (~36 MB)	None — just open in browser
Install as app	✅ native	✅ PWA
Model	Gemma 4 E2B / E4B	Gemma 4 E2B / E4B
Text chat	✅	✅
Photo input	✅	⚠️ Gemma 3n only
Audio input	✅	⚠️ Gemma 3n only
Offline	✅ after downloading	⚠️ after first launch
	models	and downloading models
Performance	~15–30 tok/s (GPU)	depends on browser WebGPU
Requirements	Android 8+, arm64	Chrome / Edge with WebGPU
Multiple chats	✅	✅
Custom model	❌	✅ by URL

Need maximum multimodality and full offline? Go Android. Want to try it right now without installing anything? Go Web.

🤖 Download APK