惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

MyScale Blog
MyScale Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Google DeepMind News
Google DeepMind News
C
Cisco Blogs
量子位
WordPress大学
WordPress大学
C
CXSECURITY Database RSS Feed - CXSecurity.com
The Hacker News
The Hacker News
C
Comments on: Blog
Blog — PlanetScale
Blog — PlanetScale
PCI Perspectives
PCI Perspectives
Martin Fowler
Martin Fowler
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
D
DataBreaches.Net
T
The Exploit Database - CXSecurity.com
有赞技术团队
有赞技术团队
Hugging Face - Blog
Hugging Face - Blog
Simon Willison's Weblog
Simon Willison's Weblog
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
T
Troy Hunt's Blog
L
Lohrmann on Cybersecurity
L
LangChain Blog
Security Latest
Security Latest
A
Arctic Wolf
博客园 - Franky
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
C
Check Point Blog
V
Vulnerabilities – Threatpost
博客园 - 聂微东
SecWiki News
SecWiki News
H
Hackread – Cybersecurity News, Data Breaches, AI and More
I
Intezer
腾讯CDC
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
N
News and Events Feed by Topic
E
Exploit-DB.com RSS Feed
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Engineering at Meta
Engineering at Meta
Microsoft Security Blog
Microsoft Security Blog
Google DeepMind News
Google DeepMind News
Spread Privacy
Spread Privacy
Recorded Future
Recorded Future
C
CERT Recently Published Vulnerability Notes
Last Week in AI
Last Week in AI
大猫的无限游戏
大猫的无限游戏
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
小众软件
小众软件

DEV Community

Your AI agent has amnesia. You've just normalized it. I built an AI that reviews every PR automatically (because nobody was reviewing mine) 🌿 Git Mastery: The Complete Developer Guide Bringing Gemma 4 E2B to the Edge: Building a Privacy-First Dream Analyzer with Flutter & LiteRT Google I/O 2026 Wasn’t About Features — It Was About AI Becoming the Developer Environment Building an AI Vedic Astrology App in 25 Days — What Actually Worked (and What Didn't) Hermes Agent Has Four Memories — And That's Why It Doesn't Forget You Pressure Isn't Killing You -Your Relationship With It Is 🐳 How to Run Any Project in Docker: A Complete Guide AccessLens — a blind person's lanyard, powered by Gemma 4 on-device Glyph v0.2: the release is the joinery How I Built a Blazingly Fast, Privacy-First Batch Image Converter in the Browser Using OPFS and Web Workers Cómo solucionar \"Text content does not match server-rendered HTML\" en Next.js App Router FCoP 3.0: Why AI Agents Need a Track, Not a Brake Fibonacci: Quiz app which anyone can make revenue by viewing ads to the quiz contestants. The Subconscious Powered by Edge AI GPU Utilization Is Becoming the New Cloud Waste Crisis Cómo solucionar `docker run` con exit code 1 en Raspberry Pi JWT is a scam and your app doesn't need it 7 Agent Skill Packs That Actually Make AI Coders Better More Control, More Cost: Why Commanding AI Isn't Delegation SecureScan Synthadoc: We Built an AI Judge for Our AI Wiki Compiler - Here's What We Learned Cómo solucionar el error de permiso al ejecutar `pip.exe` en entorno virtual (Python 3.10 en Windows) Postgres-grade Serializable at 20k+ ops/s — on a laptop. Don’t try this at home. Pure Core, Imperative Shell in Rust with Stillwater Lean 4 for Programmers: Building a Todo List with Proof Trustless Bug Bounty Releases with a PoW-Gated DLC Oracle Building Autonomous DevOps Agents with MCP and LangChain Multimodal Gemma 4 Visual Regression & Patch Agent Git Time Machine — How Version Control Can Save Your Project My Dad Got an Electricity Bill He Couldn't Understand. Google I/O 2026 Just Made That Problem Solvable. My Dad Got an Electricity Bill He Couldn't Understand. Google I/O 2026 Just Made That Problem Solvable. Read Replicas Lie About Consistency. 4 Sync Modes Behind the Lie. Reviving My Coding Project with GitHub Copilot I Tried Gemini 3.5 Flash After Google I/O 2026 - Here is What I Found :)) Zero-Cost AI in VS Code Blueprints Might Be More Important Than Frameworks AI CareCompanion - Offline Health Assistant Long-Context Models Killed RAG. Except for the 6 Cases Where They Made It Worse. I Built a Neural Network Engine in C# That Runs in Your Browser - No ONNX Runtime, No JavaScript Bridge, No Native Binaries An In-Depth Overview of the Apache Iceberg 1.11.0 Release Your Agent Just Called the Same Tool 47 Times. Here's the 20-Line Detector. How I Built a Multi-System Astrology Bot in Python (And What Meta Banned Me For) Gemma 4 Has Four Variants. Here's How to Pick the Right One Before You Write a Single Line of Code. Log Level Strategies: Balancing Observability and Cost Why WebMCP Is the Most Important Thing Google Announced at I/O 2026 (And Nobody's Talking About It) Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch Google's 2x Energy Efficiency Claim Is Real — But Here's What They're Not Measuring What's actually going on with CORS, under the hood Language-Agnostic Code Generation: The Driver Plugin Model Why We Rewrote Our Python CLI in Go (and What We Gained) I added up everything Google gives developers for free after I/O 2026. It's kind of absurd The Dawn of Smarter Apps: My Take on Google I/O 2026 AI Announcements Why AI Agents Like Hermes Need a Semantic Execution Layer for the Physical World Why We Built TestSmith: The Test Coverage Problem Nobody Talks About How to Convert Bank Statement PDFs to Excel: The Complete 2026 Guide Have You Ever Used a Website That Keeps Working After You Turn Off Your Internet? From idea to indexed: how I launched a SaaS in 60 days with Laravel + React Building a local-first AI tutor for my daughter (and 10–14 year-olds in Austrian schools) with Gemma 4 EC2 SSH Not Connecting? Here Are the 5 Things That Were Wrong (And How I Fixed Them) Best AI Tools for HVAC Contractors 2026 From Closed Internal Stack to Open-Source Ecosystem: I Finally Shipped Three Years of .NET Infrastructure Scrumpan is offlically LIVE!! Building a BMI Calculator CLI with TypeScript — Types, Functions, and Vitest From Building WordPress Websites to Node.js APIs: My Honest Full Stack Journey XiHan Snore Coach: Privacy-First On-Device MedTech Guardian powered by Gemma 4 Mobile Why AI Coding Agents Hallucinate and How to Fix It mcp-probe v1.4.0: Contract assertions for production MCP servers Google I/O 2026 Wasn't About One More Model. It Was About the Agent Stack. How I built 100+ crypto calculators in 6 languages on Astro The Dawn of Local Multi-Agent Architectures: Why Gemma 4 Changes Everything for Cloud Developers # I Told My AI to Simulate a Planet for 10,000 Years. It Built the Whole Thing Itself. 18/30 Days System Design Questions! From Hackathon Chaos to Clean CLI: Reviving My Daily Routine Analyser with GitHub Copilot Building a Home Lab with Proxmox and Terraform (for Kubernetes) PolicyAware vs Guardrails vs AI Gateways vs Model Routers: The Comparison Every AI Engineer Needs to Read Partner: An AI That Does Research While You Sleep Rugby Fundamentals as Software Concepts - Mapping the Pitch to your Code Base I Let Claude Code Run Unsupervised for 24 Hours. Here's What Happened. Why Zed Is Replacing VS Code in My AI-Augmented Workflow Build a scroll-driven WebGL hero in 30 lines Karpathy's LLM Wiki? No Code with Claude or Github Copilot! Why Platform Governance and Transparency Matter for Developers and Freelancers I built a Flutter CLI that generates Clean Architecture in seconds Using an LLM to automate a task that used to take hours by hand CyberArena – Interactive Cyber Security Simulation & Threat Analysis Platform Tile Extractor Mathematical Functions in CSS: clamp, min, max and How They Simplify Responsiveness Polyglot Persistence in Microservices: Let the Domain Choose the Database 190 Countries, Zero API Calls: Shipping Static Data in a Chrome Extension Your AI Writes Code Fast. Here’s How to Check It Before Shipping qwen2.5-coder is too slow for Claude Code on a Mac. Here's the fix. Building Automated Text-to-Video Pipelines with AI Can Gemini Become an Offline AI Tutor? Lessons from Building Educational AI OPRIX : From a simple messaging web app to a well structured and enhanced UI messaging web app Why React + TypeScript Nullability Slowly Becomes Exhausting Why AI Agents Need a Project Layer - Part 1 Stop Hand-Editing MCP Configs: A Zero-Dependency Go CLI What I Learned Working With Microsoft, SQUAD(GTCO), and Different Tech Communities
Gemma 4 on Android: Tricks for Faster On-Device Inference
IBIYEMI Samu · 2026-05-24 · via DEV Community

When I tried building an on-device AI app with Gemma 4, the pitch was clear: model weights on the device, no server, no API calls, works offline. Getting it to actually run fast was a different problem.

This post covers what I learned working with LiteRT-LM 0.12.0 and Gemma 4 E2B on Android in Kotlin. Some of it is configuration. Some of it is understanding what the bottleneck actually is before reaching for a fix. If you're building with Gemma 4 E2B on Android and inference feels too slow to ship, here are the tricks that actually helped.

1. Basic Setup

Add the dependency:

// build.gradle
implementation("com.google.ai.edge.litertlm:litertlm-android:0.12.0")

Enter fullscreen mode Exit fullscreen mode

The model file itself comes from Hugging Face. The litert-community/gemma-4-E2B-it-litert-lm repository hosts the .litertlm format that LiteRT-LM expects. This is not a GGUF file. Using the wrong format will cause a silent failure on model load, so confirm the file extension before downloading.

The model is gated on Hugging Face, so you'll need an access token. A read token is enough. If your app handles the download directly (via DownloadManager or a similar mechanism), pass the token as an Authorization header in the request rather than entering it interactively. The full LiteRT-LM Android API reference is here.

Initialize the engine:

val options = LlmInferenceOptions.builder()
    .setModelPath(modelPath)
    .setMaxTokens(512)
    .setTopK(40)
    .setTemperature(0.8f)
    .setRandomSeed(101)
    .build()

val llmInference = LlmInference.createFromOptions(context, options)

Enter fullscreen mode Exit fullscreen mode

2. GPU Backend and Why It Silently Falls Back to CPU

LiteRT-LM supports three backends: CPU, GPU (via OpenCL), and NPU. GPU is where you get meaningful speed on Android.

The problem is that OpenCL support is not universal. Mid-range and budget chips from Qualcomm and MediaTek often don't expose OpenCL to the Android application layer. If you initialize with Backend.GPU() on one of these devices, the engine falls back to CPU without throwing an error by default.

If you don't log this, you'll spend time optimizing prompts thinking you're on GPU when you're not.

Check which backend actually initialized:

try {
    val config = EngineConfig.builder()
        .setModelPath(modelPath)
        .setBackend(Backend.GPU())
        .build()
    engine = Engine(config)
    Log.d("Inference", "GPU backend initialized")
} catch (e: Exception) {
    val config = EngineConfig.builder()
        .setModelPath(modelPath)
        .setBackend(Backend.CPU())
        .build()
    engine = Engine(config)
    Log.d("Inference", "CPU fallback: ${e.message}")
}

Enter fullscreen mode Exit fullscreen mode

On CPU with Gemma 4 E2B, expect roughly 2 to 5 tokens per second on mid-range hardware. On GPU-capable devices via OpenCL, LiteRT-LM benchmarks show around 52 tokens per second on a Samsung S26 Ultra. The delta between CPU and GPU is not incremental, it is a different category of usability.

If your target users are running budget Android devices, plan your UX around CPU speeds. Streaming tokens as they arrive, showing a "thinking" indicator early, and capping output length all reduce how slow it feels even when the hardware is constrained.

One more thing on backends: NPU initialization is not just a silent fallback situation. On some devices, attempting Backend.NPU() can cause a native process crash (SIGKILL or SIGSEGV) due to driver fragmentation across Android hardware. If you want to expose NPU as an option, treat it as an experimental toggle rather than a default path, and always have the GPU-to-CPU chain as the safe baseline.

3. Prefill Is the First Bottleneck, Not Decoding

Most discussions about LLM inference speed focus on decode speed (tokens per second). On mobile, the more immediate pain point is often prefill: the time before the model generates the first token.

Prefill is proportional to the size of your input prompt. Every character you inject into the system prompt has to be processed before generation starts. If you're doing context injection (pasting a document or manual into the prompt), this cost hits on every single query.

A rough example. A 50,000 character document injected into a system prompt is approximately 12,000 to 15,000 tokens. On CPU, processing that input alone takes several seconds before the model produces anything. A user taps submit and waits in silence.

Gemma 4 E2B supports a 128K context window, and that number is real. But mobile hardware is bound by prefill latency and KV cache limits long before you hit 128K. The theoretical capacity and the practical ceiling on a 4GB device are very different numbers.

Practical fixes:

Set a hard character budget on injected context and enforce it at the application layer:

val contextBudget = 6000 // characters, not tokens
val injectedContext = sourceDocument.take(contextBudget)

Enter fullscreen mode Exit fullscreen mode

6,000 characters is roughly 1,500 tokens. That's enough context to be useful for most domain-specific queries while keeping prefill manageable on CPU.

If you're building a document Q&A feature, extract only the relevant section rather than injecting the full document. A keyword match or simple sentence scoring function in Kotlin can identify the most relevant passage and inject that instead of the whole file. This is not full RAG. It's a practical middle ground that works without vector databases.

4. Multi-Token Prediction: The Feature That Makes Gemma 4 Worth It on Mobile

Multi-Token Prediction (MTP) is one of the things that genuinely sets Gemma 4 apart from earlier versions for on-device use. It was introduced with the Gemma 4 model family specifically, and it changes what's achievable on mobile hardware in a meaningful way.

Standard autoregressive inference generates one token per forward pass. The processor moves model parameters from memory to compute units, generates one token, then does it again. On mobile hardware, the data movement cost dominates over the actual computation.

MTP uses speculative decoding to work around this. A lightweight drafter model proposes several tokens ahead of time. The primary model then verifies those proposals in a single parallel forward pass. If the proposed tokens are correct, the model accepts them all plus generates one more. If the drafter was wrong at some position, it rejects from that point and takes over. Output quality doesn't change because the primary model has final say over every token.

LiteRT-LM bundles the MTP drafter inside the same .litertlm model artifact. Both models run on the same hardware backend, sharing KV cache in local memory. This avoids the cross-device data transfer overhead that would otherwise cancel out part of the gain.

Google's benchmarks show up to a 2.2x decode speedup with MTP enabled on the GPU backend. See their full breakdown here. For the dedicated MTP announcement and how the drafter was designed for the Gemma 4 family specifically, see this post. Enabling it is two lines of configuration:

val options = LlmInferenceOptions.builder()
    .setModelPath(modelPath)
    .setMaxTokens(512)
    .setUseMtp(true)       // enable MTP drafter
    .setTopK(40)
    .setTemperature(0.8f)
    .build()

Enter fullscreen mode Exit fullscreen mode

The gains are more pronounced for predictable completions. For creative or open-ended generation where the drafter has low acceptance rates, the speedup is smaller. For structured or domain-constrained outputs, acceptance rates are higher and the gains are closer to the ceiling.

One important caveat: if you're on CPU, disable MTP. The 2.2x gain assumes parallel GPU execution where the drafter and target model run simultaneously. On CPU they run sequentially, and the overhead of running two models back to back outweighs the benefit. Check which backend actually initialized before deciding whether to enable it.

val useMtp = backend == Backend.GPU() // only enable on GPU
val options = LlmInferenceOptions.builder()
    .setModelPath(modelPath)
    .setUseMtp(useMtp)
    .build()

Enter fullscreen mode Exit fullscreen mode

5. Thinking Mode: When to Use It and When Not To

Gemma 4 supports a reasoning mode where the model generates an internal scratchpad before producing its final response. LiteRT-LM exposes this directly. The reasoning output appears between <|think|> and </think> tags in the stream.

Thinking mode improves output quality for multi-step or diagnostic tasks. It costs tokens. On CPU, those extra 200 to 400 reasoning tokens represent meaningful latency before the user sees a final answer.

The practical approach: enable thinking on tasks where accuracy matters, disable it on conversational turns where it doesn't.

fun buildSystemPrompt(requiresReasoning: Boolean): String {
    return if (requiresReasoning) {
        "<|think|> You are a diagnostic expert. Think through this step by step before answering."
    } else {
        "You are a helpful assistant. Answer clearly and concisely."
    }
}

Enter fullscreen mode Exit fullscreen mode

If you're displaying the thinking stream in the UI (as a visible "reasoning" component), the latency becomes part of the experience rather than dead time. The user sees the model working. This matters more on CPU where the stream is slow enough to read.

If you strip thinking tokens before displaying the response, you're paying the token cost with no UX return. In that case, disable it.

6. Constrained Decoding for Structured Output

LiteRT-LM supports constrained decoding, which enforces a JSON schema on the model's output. Instead of parsing free text and hoping the model follows your format instructions, you define the schema and the engine guarantees compliance.

This is useful for any feature that needs to render structured results rather than prose. A diagnosis card, a checklist, a decision tree. The model produces valid JSON every time.

val schema = """
{
  "type": "object",
  "properties": {
    "diagnosis": { "type": "string" },
    "confidence": { "type": "string", "enum": ["high", "medium", "low"] },
    "action": { "type": "string" },
    "escalate": { "type": "boolean" }
  },
  "required": ["diagnosis", "confidence", "action", "escalate"]
}
"""

val options = LlmInferenceOptions.builder()
    .setModelPath(modelPath)
    .setResponseSchema(schema)
    .setMaxTokens(300)
    .build()

Enter fullscreen mode Exit fullscreen mode

The max tokens value matters here. A constrained JSON response is short. Setting a generous 2,000 token budget for a response that will always be under 100 tokens keeps the KV cache allocated longer than necessary. Set it tight.

7. Session Save and Restore

LiteRT-LM supports serializing and restoring the KV cache state across sessions. For applications with persistent context (a long document loaded once, or a multi-turn workflow), this means the prefill phase only happens once. On return sessions, the engine restores the cached state and skips the expensive input processing step.

For document Q&A specifically, this is worth implementing. The user loads a document, the prefill runs once and the state is serialized to disk. Every subsequent question in that session resumes from the cached state rather than reprocessing the document from scratch. The Google AI Edge Gallery app is the most complete open-source example of session management in a real LiteRT-LM application.

Summary

Technique Where it helps Implementation cost
Log GPU vs CPU backend Debugging, avoiding silent CPU fallback Low
Reduce injected context Prefill speed Low
Enable MTP Decode speed on GPU-capable devices Very low (two lines)
Conditional thinking mode Balancing quality vs latency per task Low
Constrained decoding Structured output reliability and token efficiency Medium
Session save/restore Repeated queries against same context Medium

The model download is the upfront cost. After that, LiteRT-LM gives you enough knobs to tune inference for the specific hardware and use case you're targeting. The techniques above don't require changes to model weights or training. They're all configuration and prompt engineering decisions available in 0.12.0.


For the full LiteRT-LM technical deep-dive from the Google AI Edge team, including iOS and WebGPU benchmarks, read the official post.