惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LangChain Blog
Security Latest
Security Latest
P
Proofpoint News Feed
GbyAI
GbyAI
PCI Perspectives
PCI Perspectives
博客园 - Franky
N
Netflix TechBlog - Medium
博客园_首页
WordPress大学
WordPress大学
K
Kaspersky official blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Vercel News
Vercel News
T
Threatpost
The Hacker News
The Hacker News
H
Help Net Security
S
Securelist
Recent Announcements
Recent Announcements
腾讯CDC
T
Tailwind CSS Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Engineering at Meta
Engineering at Meta
C
Cisco Blogs
V
V2EX
C
Check Point Blog
S
Schneier on Security
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Jina AI
Jina AI
M
MIT News - Artificial intelligence
T
Threat Research - Cisco Blogs
博客园 - 叶小钗
A
Arctic Wolf
AWS News Blog
AWS News Blog
Latest news
Latest news
Martin Fowler
Martin Fowler
Recorded Future
Recorded Future
Last Week in AI
Last Week in AI
The GitHub Blog
The GitHub Blog
小众软件
小众软件
B
Blog
aimingoo的专栏
aimingoo的专栏
C
Cyber Attacks, Cyber Crime and Cyber Security
V
Visual Studio Blog
P
Palo Alto Networks Blog
Spread Privacy
Spread Privacy

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
5 things `flutter_gemma` doesn't tell you about shipping Gemma 4 on Android
Manoj H M · 2026-05-24 · via DEV Community

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I shipped a Gemma 4 assistant on Android in 17 days. Voice, vision, RAG, eight device actions, everything offline once the model is on device. The project is called PocketClaw, and you can read about it here if that's what you came for.

This post is about the 5 things I had to figure out the hard way. Not in the flutter_gemma README. Not in Google's MediaPipe docs. Not in any of the half-dozen "run Gemma on Android" tutorials I read this week.

If you're about to ship Gemma 4 on Android, I hope this saves you a weekend.

📱 Companion post: How I built PocketClaw — a fully offline AI assistant on Android with Gemma 4 E2B. Demo video, architecture deep-dive, full source code.

1. Small models drop facts buried mid-prompt. Put what matters at the top.

I've been building agents on Claude and GPT-4 for about 18 months. Both of them handle a long system prompt fine. You can mix instructions and facts in any order, and the model figures out what's a fact versus what's a behavior rule.

Gemma 4 E2B doesn't.

My first system prompt for PocketClaw looked like this:

You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.

Enter fullscreen mode Exit fullscreen mode

User asks "what's my name?" Claw answers "I do not know your name." Every time. The name is sitting right there in sentence three.

I spent about an hour staring at this. My theory after the fact is that "never restate the question" was acting as a dominant instruction, and the model generalized it to "don't reference any user context." That's the kind of overgeneralization a 2B-effective model does. Cloud LLMs don't.

The fix was structural, not lexical:

final namePart = (userName != null && userName.trim().isNotEmpty)
    ? "The name of the user is ${userName.trim()}.\n\n"
    : '';
final systemPreamble = '${namePart}You are Claw, ...';

Enter fullscreen mode Exit fullscreen mode

Same fact. Moved to the first line of the prompt. On its own. Flat declarative sentence. Worked the first time I tried it.

The lesson generalizes. When you're working with a 2B-class model, anything you actually want the model to remember goes at the front of the prompt, in simple sentences, with no competing instructions in the same paragraph. The model is paying way more attention to the opening than to the middle. Treat the system prompt like a slot-filling template, not a paragraph.

2. Vanilla RAG breaks on the queries users actually type.

If you've worked with RAG before, you know the textbook setup. Chunk the document, embed the chunks, store the vectors, embed the query at retrieval time, find the closest matches. Works great on benchmarks where the queries are specific.

It doesn't work on "summarise the document."

I caught this on Friday. I'd been heads-down for two weeks. I thought I had a working build. I uploaded a PDF I had lying around (a paper on edge LLMs), typed "summarise the document", hit send. Claw said "Summarize the document." back to me. Just that. Like an echo. I tried twice more with different phrasing. Same answer.

Eventually I typed "summarise llmaiedge.pdf" with the actual filename. Got a real summary.

The problem is obvious once you see it. "Summarise the document" has zero semantic overlap with the document text. The PDF doesn't contain the words "summarise" or "this document." Cosine similarity returns nothing useful. The retrieved chunk list is empty. Gemma gets the user's question with no actual document context attached. So it does what any LLM does when starved for context. It hallucinates a generic answer from its training data.

The fix I shipped is two heuristics deep:

final isGenericIntent = hits.length <= 1 && (
    lower.contains('summari') ||
    lower.contains('tldr') ||
    lower.contains('explain') ||
    lower.contains('describe') ||
    lower.contains('the document') ||
    lower.contains('the pdf')
);

if (isGenericIntent) {
    hits = await RagService.instance.getDocStarts(
        conversationId: _conversation.id,
    );
}

Enter fullscreen mode Exit fullscreen mode

getDocStarts is a small fallback method. It runs searchSimilar once per indexed doc, using each doc's filename as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question.

Two lines of conditional logic. The difference between "Claw can summarise PDFs" and "Claw echoes your question back at you."

If you're building RAG on Gemma 4 (or any small on-device model really), test it with generic queries before you ship. Your textbook similarity search will fail in ways that look like the model is broken.

3. The flutter_gemma plugin bundles ~33 MB of native libraries you probably don't use.

Stock APK for PocketClaw came out at 185 MB. That felt heavy.

When I unzipped it and looked at the native libs (arm64-v8a only, I was already shipping a single-arch build), this is what I saw:

26 MB  libllm_inference_engine_jni.so       (needed)
24 MB  libLiteRtLm.so                       (needed)
17 MB  libgemma_embedding_model_jni.so      (don't use — using Gecko)
17 MB  libgecko_embedding_model_jni.so      (needed)
14 MB  libmediapipe_tasks_vision_jni.so     (needed — vision input)
14 MB  libmediapipe_tasks_vision_image_generator_jni.so  (NOT USED)
10 MB  libimagegenerator_gpu.so             (NOT USED)
8  MB  libLiteRtGpuAccelerator.so           (needed)
8  MB  libLiteRtWebGpuAccelerator.so        (NOT USED — Android has OpenCL)
9  MB  libtext_chunker_jni.so               (needed)

Enter fullscreen mode Exit fullscreen mode

The image-generation libs are for using Gemma to generate images. PocketClaw only consumes images (vision input to multimodal Gemma). I'm never going to generate. The WebGPU accelerator is for browsers — Android uses OpenCL. None of it does anything on my target platform.

Four lines in android/app/build.gradle.kts:

packaging {
    jniLibs {
        excludes.addAll(listOf(
            "**/libimagegenerator_gpu.so",
            "**/libmediapipe_tasks_vision_image_generator_jni.so",
            "**/libLiteRtWebGpuAccelerator.so",
            "**/libLiteRtTopKWebGpuSampler.so"
        ))
    }
}

Enter fullscreen mode Exit fullscreen mode

APK dropped from 185 MB to 152 MB. 33 MB cut. Vision input still works, embedder still works, inference still works.

If your use case is similar (chat + vision input + RAG, no image generation), copy these excludes. If your use case is different — say you actually want Gemma to generate images — leave the image-gen libs in. The point is, audit what your plugin pulls in and exclude what you don't use. flutter_gemma is built for general capability surface, not minimum-bytes-on-device.

There's a second-order point here that matters more. MediaPipe is the reason flutter_gemma is so big. It's also the reason it does vision and audio at all. The llama.cpp-based alternatives ship at 30-60 MB on Android but skip multimodal entirely. So the choice is really: 152 MB with vision, or 60 MB without. There's no free lunch where you get multimodal for the size of a text-only stack. Pick based on what your product actually needs.

4. Don't feed the 128K context window. Compact it.

Gemma 4 has a 128K context window. Sounds great in theory. In practice it's a footgun.

Every token in the prompt costs latency at decode time. Every token costs RAM. On a phone, both of those are tight. If you naively shove the whole chat history into context every turn, you'll find that turn 20 takes noticeably longer than turn 5, and turn 50 might OOM the app.

PocketClaw keeps a sliding window of the most recent 24 messages in their raw form. Anything older runs through a compaction pass:

  1. Extract explicit facts the user has stated ("I am X", "Remember Y", "My name is Z").
  2. Capture unresolved goals (keywords like "fix", "todo", "issue").
  3. Compile both into a single lightweight summary paragraph that gets prepended to the prompt as memory.

That's the chat part. The aggressive part is image handling.

A typical user-uploaded photo is roughly 1 MB. As base64 in a prompt, that's around 30,000 tokens. That's a quarter of the entire 128K context window for one image. If the user uploads three photos across a conversation, your context budget is in trouble.

So PocketClaw does this: when an image message slides past the 24-message boundary, the raw image bytes get deleted from memory. What's preserved is the assistant's prior textual description of that image:

String _imageMemoryFromAssistant({
  required String? imageName,
  required String assistantText,
}) {
  final label = imageName ?? 'uploaded image';
  return 'Assistant previously described $label as: '
         '${_shorten(assistantText, 1000)}';
}

Enter fullscreen mode Exit fullscreen mode

So Claw still "remembers" what it saw, but only the description goes through the prompt. The 30,000-token base64 blob becomes a 100-token summary. That's roughly a 300x compression of image memory.

The kicker: this works really well for the kind of follow-up questions users actually ask. "What was in that photo I sent earlier?" is answerable from the description alone. The model rarely needs the pixels back. If it does, the user can re-upload.

The general pattern here is: don't think of the context window as "free space up to the limit." Think of it as a budget. Spend it on things the model needs for the current turn. Everything else gets compacted to a textual summary.

5. Native audio in flutter_gemma is gated on Gemma 3n, not Gemma 4.

This one I want you to know so you don't waste a day like I did.

Gemma 4 E2B's model card lists native audio as a supported modality. So I thought: cool, I'll skip the speech-to-text plugin entirely, feed raw audio bytes straight to Gemma, get a single multimodal call instead of an STT-then-LLM pipeline. Cleaner.

I dug into flutter_gemma v0.15.1 source to find the audio API. Saw this comment in the interface:

/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).
bool supportAudio = false,

Enter fullscreen mode Exit fullscreen mode

The plugin's audio code path is gated on Gemma 3n. If you set supportAudio: true while loading Gemma 4 E2B, you'll either get a load error or a silent failure at inference time. The native side does support audio (the C++ engine handles it fine), but the Dart-side check rejects it for non-3n models.

So PocketClaw uses Android's system STT (the speech_to_text package, which is a wrapper around RecognizerIntent). Side benefit: I get live transcription as the user is speaking. The text appears in the input field word by word while they're holding the mic. That's a noticeably better UX than the "hold the button, speak, release, wait three seconds while audio uploads and processes, see both your words and the AI response" pattern you'd get from on-model audio.

When (if?) flutter_gemma exposes audio for E2B, the path collapses. Until then, system STT plus text-mode Gemma is the right architecture.

The takeaway isn't "audio is broken." It's: read your plugin's source before you trust its capability flags. Especially for multimodal features that span Gemma versions. The model can do something doesn't mean the plugin's wrapped it for your model.

Closing

Five patterns. None of them are in the README. None of them are in Google's docs. I learned all of them by shipping something real and watching it fail in interesting ways.

If you're building on Gemma 4 for Android, these will save you time. If you want to see all five running together in a real app, PocketClaw is fully open source, MIT licensed.

The thing I keep coming back to, after 17 days with Gemma 4 E2B on a mid-range Android phone, is how capable a 2B model can be when it's running fast on the user's device. The latency feels different from cloud. There's no perceived "AI thinking" delay because there's no network. It just answers at the speed of your phone.

That's worth optimizing for.


Editorial assistance from Claude. All code, decisions, bugs, and engineering choices in this post are mine.