惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

DEV Community

How to Prompt AI Tools to Write Accurate SQL Queries (And Why Most Developers Get This Wrong) Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test B] Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test A] PayPal and Stripe Are Not the Answer for Global Digital Sales Signs your WordPress site needs a headless CMS rebuild Sanity CMS vs Contentful for Next.js projects: an honest comparison Sanity vs Strapi vs Payload CMS: an honest comparison for 2026 Sanity CMS website cost in 2026: what founders actually pay INP for React Apps: Profiling and Eliminating Long Tasks Why Core Web Vitals Matter (and How I Improve Them) Why AI Agents Love Boring Code I got tired of manual WordPress maintenance across 8 client sites - so I automated all of it My PR Merged Into a Graveyard: On the Rise of Antigravity and the Fall of Open Source Private AI on a Normal Android Phone: Building Krexel with Gemma 4 E2B The Best Result This Week Was a Failed Prediction — Phase-3a Doesn't Transfer Embedding 685 million texts in 32 minutes I Asked the Top 6 AI Chatbots to Sell Me on Themselves - Then Asked Each One Who Came Second Hello World JahSeeToo The First Malaysia's Hacker i watched google tear down the old internet from a hostel room in kolkata How I audit and prune unused Sanity document types to reclaim Studio performance What is MCP, and why it's the missing layer between AI and your CRM Stop adding print statements to debug your data pipeline — use watcher instead Hire a Sanity developer vs agency: five honest trade-offs Temporal vs Make for API-First Workflows The Antigravity 2.0 Forced Update: How to Fix the Broken Editor Loop 10 Ways To Reduce Your LLM API Costs mcp-probe v1.0.0: A CI readiness gate for MCP servers Building ValoVault: The Per-Agent Skin Loadouts Riot Never Shipped 45 MB of Claude Code Sessions You Don't See Most CMS Platforms Aren’t Built for Full Lifecycle Ownership Building a Resilient Checkout in NestJS: Retry, Idempotency, and a System That Tunes Itself Html learning journey I built an open-source alternative to ViciDial. Here's the stack, and the bugs that ate my nights. Zero-PC Architecture: Deploying Webhooks & AI Triage from a Mobile Footprint Why AI Coding Agents Fail Senior Engineers (And What I Built to Fix It) Stop Pasting URLs into Security Header Sites - Use This CLI 26 of 39 AI Companies Use SPF Softfail — Their Email Can Be Spoofed Mastering useRef in React: The Hook That Gives React Memory Without Re-Rendering One Brain, Many Hands: Building a Parallel Task Orchestrator for AI Agents Understanding useRef in React: Concepts, Use Cases, and Examples An AI That Can't Trade, a Human That Can't Say No SSH died. Spent 3 hours fixing the wrong thing. ## Rise of the Managed Agent: Why Antigravity 2.0 is Google I/O 2026’s Most Critical Developer Release From Concept to Production: A Technical Guide to Deploying Markus Multi-Agent Systems Why Browsers Outpaced Web Tooling (And How We Catch Up) First Principles Building a Safety-First RAG Triage Agent in Python Gemma 4 Isn’t Just Another AI Model — It’s A Shift In How We Build AI The Feature Store: Consistency and Latency Are Both Non-Negotiable What did gemma see? - Thinking in comments... I Built a Desktop Chat App for Running Local LLMs Offline Alert Fatigue Is a Design Choice: Building Views That Actually Help Building A Laravel Google Sheets Package That Imports, Exports, Caches, Formats, And Tests Cleanly DOM Accessibility Tree Extraction: A Reliable Method for LLMs on Dynamic Web Tables Building a Production Grade AWS Infrastructure Project (Part 1) Google just shifted the agent workflow from the cloud to the desktop I built a Claude skill that keeps your AI coding tools from contradicting each other — and I need beta testers Google I/O 2026 - Day 1 - Live from the Front Row The Effect of Frosted Glass (Glassmorphism) in Pure CSS in 2026 Gemini vs. ChatGPT for Coding: A Developer's Guide Solana's Account Types Are Just Database Rows With Different Flags Cryptographic Forensics for AI Coding Agent Sessions Testing NGB Platform Beyond a Small Demo Dataset with k6 and TypeScript Metabase 61: AI fun police, build questions and dashboards with MCP, and much more! How GBase 8a Rough Index Works: Block‑Level Pruning for 10x Faster Queries The Anti-Antigravity Bulkhead vs Rate limiting. The Age of Accountable Agents: Building Trust in Your AI Automation Securing Your AI Agents: Essential Practices for On-Device Automation I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems The Code Nobody Will Delete Building a desktop studio for interactive video stories like Late Shift - Devlog #1 Solving the Local AI Sandbox Issue: How TaigaAI Keeps Your Workstation Safe Why Enterprises Will Struggle With MCP — And What to Do About It Why I Finally Added a Blog to My Converter Tool When Your Coding Agent's String-Matcher Becomes a Billing Decision Building ThreatPulse IDS: An AI-Powered Intrusion Detection System I Built a Register-VM JavaScript Engine in Rust with opencode.ai — Beating QuickJS Per-User OAuth for AI Agents: Why It Matters and What to Look For You Got Your Whole Genome Sequenced. Now What? Zero to Full-Stack in 6 Months: The Izzy Way... PasteCheck v1.3 — what I improved after launching and getting real users DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem How Strong Is "Strong"? Password Entropy in Plain English Precision Mechatronics: Mitigating Step-Pulse Resonance and Thermal Dissipation in Micro-Stepping Hardware Controllers A Fact A Day, an autonomous Podcast as my entry 4 Hermes Agent Challenge #100DaysOfSolana Day29: My Experience Generating Token On Solana Devnet Overcoming Challenges and Applying Best Practices in Migrating Large JavaScript Codebases to TypeScript Decostruire lo Streaming di FC2: Come Costruire un Downloader ad Alte Prestazioni con HLS e WebAssembly Top 10 Agentic AI Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs... (Benchmarks Inside) How I Built a Hermes Agent for Lead Generation That Finds and Qualifies Better Prospects The Hybrid Method: when Claude.ai supervises Claude Code LLMs Are Probabilistic. Your Workflow Shouldn't Be. Deploying Tempo Distributed Tracing Backend on Ubuntu 24.04 Deploying Technitium DNS Server Platform on Ubuntu 24.04 Deploying Gitea Self-Hosted Git Platform on Ubuntu 24.04 Deploying Vault Secret and Key Management on Ubuntu 24.04 Deploying Loki Log Aggregation System on Ubuntu 24.04 Deploying Mimir Scalable Metrics Storage on Ubuntu 24.04
How We Built an Autonomous AI Agent That Controls Your Phone, Entirely Offline
Adebisi Mosi · 2026-05-19 · via DEV Community

Most AI assistants are text boxes with a microphone icon. You speak. Your words leave your device. A server somewhere thinks. A server responds. If you're on a plane, in a rural clinic, or behind a firewall, nothing works.

We built the opposite.

Genie is an autonomous AI agent that runs Gemma 4 directly on your Android GPU through Google's LiteRT-LM SDK. It doesn't call an API. It doesn't stream to a cloud. It sees your screen, controls your apps, reads your documents, remembers your preferences, teaches you concepts on a whiteboard, and executes multi-step tasks, all on-device, all offline.

This is the engineering story of how we built it for the Gemma Kaggle Hackathon.

This article was written by Mosimiloluwa Adebisi (@A-Simie), along with Adetunji Akeem (@Akeem1955) and Adenuga Abdulrahmon (@Rahmannugar), and created for the purposes of entering the Google Gemma Kaggle Hackathon. It covers how we built Genie using Gemma 4 and LiteRT-LM for fully on-device AI.


Table of Contents

  1. The Problem
  2. Architecture Overview
  3. Layer 1: The Voice Pipeline
  4. Layer 2: The Brain — LiteRT-LM Inference
  5. Layer 3: The Agent Loop
  6. Layer 4: Safety — Risk Assessment + Biometric HITL
  7. Layer 5: Error Taxonomy
  8. Layer 6: Self-Improvement — Skill Cache
  9. Layer 7: 53 Tools Across 6 Families
  10. The Profile System — 9 Modes
  11. Memory System
  12. The Teaching Profile
  13. Observability
  14. What We Learned
  15. The Numbers

The Problem: Running an LLM Is Easy. Letting It Touch Your OS Is Not.

Google gave us a great on-device inference SDK. Loading a Gemma model, creating a conversation, streaming tokens, that's a few method calls. The hard part is everything around it.

When your agent can tap buttons, type into fields, and open apps autonomously, you have a fundamentally different safety problem than a chatbot. A chatbot that hallucinates gives you wrong text. An agent that hallucinates taps "Confirm Payment" on your PayPal screen.

That constraint shaped every architectural decision in Genie.


Architecture Overview

Here's the full system at a glance:

Wake Word (Vosk) → STT (Android) → AgentOrchestrator
    ↓
PromptBuilder → Planner → GenieEngine (LiteRT-LM / Gemma 4)
    ↓
Decision.Act → RiskAssessor → [Biometric HITL?] → ToolRegistry → OS Action
    ↓
ToolOutcome → History → SlidingWindow → Next Planning Turn
    ↓
Decision.Finish → Skill Cache Write → TTS Response → Resume Wake Word

Enter fullscreen mode Exit fullscreen mode

Let's walk through each layer.


Layer 1: The Voice Pipeline — Two Engines, One Microphone

Most voice assistants use a single speech engine. We use two, and the reason is physics.

Vosk is a lightweight, offline speech recognition library. We run it continuously in the background at 16kHz, listening for exactly one word: "Gemma". It uses ~30MB RAM and has near-zero latency for wake-word detection.

Android SpeechRecognizer is heavier but far more accurate for full sentences. It only activates after Vosk detects the wake word.

// Vosk wake-word detection
override fun onPartialResult(hypothesis: String?) {
    val json = JSONObject(hypothesis)
    val partial = json.optString("partial", "")
    if (partial.lowercase().contains("gemma")) {
        speechService?.stop()           // Kill Vosk
        setUiState(AgentUIState.Waking)  // Show overlay
        startSttListening()              // Start full STT
    }
}

// STT captures the actual command
override fun onResults(results: Bundle?) {
    val text = results?.getStringArrayList(
        SpeechRecognizer.RESULTS_RECOGNITION
    )?.getOrNull(0) ?: ""
    dispatchToAgent(text)  // → AgentOrchestrator
}

Enter fullscreen mode Exit fullscreen mode

Why not just use SpeechRecognizer for everything? Because it's expensive. Running full-sentence recognition 24/7 drains battery and hogs the microphone. Vosk is purpose-built for always-on keyword spotting with minimal resource usage.


Layer 2: The Brain — LiteRT-LM Inference with Manual Tool Calling

The GenieEngine wraps Google's LiteRT-LM SDK. The most important design decision lives in one line:

val newConversation = engine.createConversation(
    ConversationConfig(
        samplerConfig = agentSamplerConfig(),
        systemInstruction = PromptFormatting.buildSystemInstruction(systemPrompt),
        tools = tools,
        automaticToolCalling = false,  // ← This is everything
    )
)

Enter fullscreen mode Exit fullscreen mode

Why automaticToolCalling = false?

In default mode, LiteRT-LM detects a tool call in the model's output, executes it automatically, and feeds the result back, all without the application knowing. That's fine for a chatbot generating weather data.

It's catastrophic for an agent that can tap "Send $500" on your banking app.

By disabling automatic tool calling, every single tool call from the model passes through our code first. We intercept it. We validate it. We optionally require biometric authentication. Only then do we execute.

The Callback-to-Flow Bridge

LiteRT-LM uses a callback-based API (MessageCallback). We convert that into a Kotlin callbackFlow so the agent loop can consume responses asynchronously:

fun sendMessageAsync(contents: Contents): Flow<AgentResponse> = callbackFlow {
    conversation.sendMessageAsync(
        contents,
        object : MessageCallback {
            override fun onMessage(message: Message) {
                if (message.toolCalls.isNotEmpty()) {
                    trySend(AgentResponse.ToolCallRequest(message))
                } else {
                    trySend(AgentResponse.Token(message.toString()))
                }
            }
            override fun onDone() {
                trySend(AgentResponse.Done)
                close()
            }
            override fun onError(throwable: Throwable) {
                val error = if (throwable is CancellationException) {
                    ErrorTaxonomy.TransientErr("Inference cancelled", throwable)
                } else {
                    ErrorTaxonomy.FatalErr("Inference error: ${throwable.message}", throwable)
                }
                trySend(AgentResponse.Error(error))
                close()
            }
        }
    )
    awaitClose { }
}

Enter fullscreen mode Exit fullscreen mode

This gives us a clean reactive API where the orchestrator can collect tokens, tool calls, and errors from a single Flow.


Layer 3: The Agent Loop — Planning Like a Human

Here's where Genie diverges from every other "AI assistant." When you say "Open WhatsApp and send hi to Mom", a chatbot tries to answer in one shot. Genie plans.

The AgentOrchestrator runs a loop that mirrors human problem-solving:

  1. Observe — Read the screen
  2. Plan — Decide the next action
  3. Act — Execute one tool
  4. Evaluate — Check the result
  5. Repeat — Until the goal is done or a circuit breaker trips

The State Machine

data class AgentState(
    val goal: String,
    var intent: AgentIntent? = null,
    var plan: AgentPlan? = null,
    var currentStepIndex: Int = 0,
    val history: MutableList<HistoryEntry> = mutableListOf(),
    var retryCount: Int = 0,
    var replanCount: Int = 0,
    val maxRetries: Int = 3,
    val maxReplans: Int = 3,
    var isNovelPlan: Boolean = true,
)

Enter fullscreen mode Exit fullscreen mode

Every tool call, every result, every error, all tracked in history. This history feeds back into the next planning prompt so the model always knows what it has already tried.

The Decision Type System

The planner produces exactly one of three decisions per turn:

sealed class Decision {
    data class Act(val tool: String, val args: Map<String, String>) : Decision()
    data class Finish(val summary: String) : Decision()
    data class Reply(val message: String) : Decision()
}

Enter fullscreen mode Exit fullscreen mode

  • Act → Execute a tool and continue the loop
  • Reply → Speak to the user and stop
  • Finish → Mark the goal complete

Plain text during planning is treated as invalid. The system prompt forces exactly one native tool call per turn:

"Call EXACTLY ONE tool per turn. No markdown, no extra text."

The Sliding Window

The model has a limited context window. The SlidingWindowManager ensures the model always sees what matters:

  • Keeps the first entry (the user's goal — always visible)
  • Keeps the last 9 entries (recent actions and results)
  • Prunes transient errors after a success — if click("Wi-Fi") failed twice then succeeded, those two failures are removed from history
fun pruneAfterSuccess(history: MutableList<HistoryEntry>) {
    val lastEntry = history.lastOrNull() as? HistoryEntry.ToolResult ?: return
    if (lastEntry.outcome !is ToolOutcome.Ok) return

    val successToolName = lastEntry.toolName
    var index = history.size - 2
    while (index >= 0) {
        val entry = history[index]
        if (entry is HistoryEntry.ToolResult &&
            entry.toolName == successToolName &&
            entry.outcome is ToolOutcome.TransientErr) {
            history.removeAt(index)
        } else {
            break
        }
        index--
    }
}

Enter fullscreen mode Exit fullscreen mode

This keeps the context clean and prevents the model from getting confused by noise from past failures.


Layer 4: The Safety Net — Dynamic Risk Assessment + Biometric HITL

This is the layer I'm most proud of.

Most AI safety systems use static per-tool flags: "this tool is dangerous, always ask for confirmation." That's crude. Opening Settings is safe. Opening PayPal and clicking "Send" is not — but both use the same click tool.

Genie's RiskAssessor is dynamic. It evaluates the current screen context in real time:

object RiskAssessor {
    private val DESTRUCTIVE_VERBS = setOf(
        "send", "transfer", "pay", "confirm", "submit",
        "delete", "remove", "purchase", "authorize",
    )

    private val CURRENCY_REGEX = Regex(
        """[$€£¥₦₹₩₿]\s*\d+|\d+\.\d{2}\s*(USD|EUR|GBP|NGN)""",
        RegexOption.IGNORE_CASE,
    )

    fun assess(
        toolName: String,
        args: Map<String, String>,
        screen: ScreenContext
    ): RiskVerdict {
        if (toolName !in setOf("click", "tap_at", "type_text")) {
            return RiskVerdict.Allow
        }

        val signals = mutableListOf<RiskSignal>()

        if (isFinancialScreen(screen)) signals.add(FINANCIAL_SCREEN)
        if (isSensitiveApp(screen))    signals.add(SENSITIVE_APP)
        if (isDestructiveVerb(target)) signals.add(DESTRUCTIVE_VERB)
        if (isSensitiveField(screen))  signals.add(SENSITIVE_FIELD)
        if (isAuthFlow(screen))        signals.add(AUTH_FLOW)

        return if (signals.size >= 2) {
            RiskVerdict.RequireBiometric(reason)
        } else {
            RiskVerdict.Allow
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

The key insight: ≥2 independent signals required. A single signal (e.g., seeing a dollar sign) doesn't trigger auth — that would cause false positives everywhere. But a currency symbol plus the word "Send" as a click target? That's a real financial action. Biometric required.

The Biometric Bridge

When auth is required, HITLInterceptionWrapper launches a transparent activity that shows Android's BiometricPrompt:

object HITLInterceptionWrapper {
    val authResultChannel = Channel<AuthResult>(capacity = 1)

    suspend fun executeWithAuth(
        tool: GenieTool,
        args: Map<String, String>,
        serviceContext: ToolServiceContext,
        appContext: Context,
        reason: String,
    ): ToolOutcome {
        // Launch transparent biometric activity
        appContext.startActivity(
            Intent(appContext, BiometricAuthActivity::class.java)
        )

        // Suspend until user authenticates (30s timeout)
        val result = withTimeoutOrNull(30_000L) {
            authResultChannel.receive()
        }

        return when (result) {
            is AuthResult.Approved -> tool.execute(args, serviceContext)
            is AuthResult.Denied  -> ToolOutcome.AuthErr("User denied")
            else                  -> ToolOutcome.AuthErr("Auth timed out")
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

The BiometricAuthActivity is invisible (Theme.Translucent.NoTitleBar), excluded from recents, and immediately finishes after sending the result. The user sees a fingerprint prompt appear, authenticates, and it vanishes.


Layer 5: Error Taxonomy — Four Tiers of Failure

Not all errors are equal. Genie classifies every failure into one of four tiers, each with its own recovery strategy:

sealed class ToolOutcome {
    data class Ok(val result: String) : ToolOutcome()
    data class TransientErr(val message: String) : ToolOutcome()
    data class LogicErr(val message: String) : ToolOutcome()
    data class AuthErr(val message: String) : ToolOutcome()
    data class FatalErr(val message: String) : ToolOutcome()
}

Enter fullscreen mode Exit fullscreen mode

Tier Meaning Recovery Example
TransientErr Might work if we wait Retry with exponential backoff UI hasn't rendered yet
LogicErr Agent made a bad choice Error goes into history; model self-corrects Wrong tool name
AuthErr User denied authorization Hard stop with notification Biometric denied
FatalErr Unrecoverable Hard stop immediately OOM, engine crash

Circuit Breakers

Two circuit breakers prevent infinite loops:

  1. Consecutive failure breaker: 5 consecutive failures of any kind → abort
  2. Unknown tool breaker: Same non-existent tool requested 3 times → abort
if (consecutiveFailureCount >= 5) {
    Log.e(TAG, "Circuit breaker triggered")
    return "I ran into repeated errors. Please try again."
}

Enter fullscreen mode Exit fullscreen mode

These exist because on-device models can hallucinate tool names. Without circuit breakers, the agent would loop forever.


Layer 6: Self-Improvement — The Skill Cache

Every time Genie completes a novel task, it serializes the successful plan and stores it in a local Room database:

@Entity(tableName = "skills")
data class Skill(
    @PrimaryKey(autoGenerate = true) val id: Int = 0,
    val goalPattern: String,
    val planJson: String,
    val successCount: Int = 0,
    val createdAt: Long = System.currentTimeMillis(),
)

Enter fullscreen mode Exit fullscreen mode

Next time you ask for something similar:

val skills = skillDao.findMatchingSkills("turn on wi-fi")
val bestSkill = skills.maxByOrNull { it.successCount }

Enter fullscreen mode Exit fullscreen mode

If a match is found, Genie replays the cached plan step-by-step without invoking the LLM. No inference needed. Instant execution.

Every successful replay increments successCount. Over time, more reliable skills are prioritized. If a cached plan fails (e.g., UI drift from an app update), the agent falls back to live planning.

That's on-device self-improvement. The agent gets faster the more you use it.


Layer 7: The Hands — 53 Tools Across 6 Families

Tools don't touch the AccessibilityService directly. They go through the ToolServiceContext interface — a seam that lets us mock every OS action in tests:

interface ToolServiceContext {
    suspend fun clickElement(target: String): Boolean
    suspend fun typeText(text: String): Boolean
    suspend fun swipe(direction: String): Boolean
    suspend fun readScreen(): String
    suspend fun openApp(name: String): Boolean
    // ... 48 more
}

Enter fullscreen mode Exit fullscreen mode

Tool Families

Family Tools Purpose
Core OS click, type_text, swipe, scroll, open_app, go_back, go_home Direct UI interaction
Awareness read_screen, read_screen_summary, where_am_i, read_notifications Situational awareness
Memory save_fact, retrieve_fact Persistent preference storage
Document list_device_pdfs, detect_open_pdf File access and PDF extraction
Teaching board_teach_step, visualize_concept Interactive whiteboard
Health health_search_topics, health_get_topic WHO fact-sheet queries

The Gesture System

Every gesture is a suspend function wrapping dispatchGesture() into a coroutine:

private suspend fun dispatchGesture(
    gesture: GestureDescription
): Boolean {
    return suspendCancellableCoroutine { continuation ->
        service.dispatchGesture(
            gesture,
            object : GestureResultCallback() {
                override fun onCompleted(desc: GestureDescription) {
                    if (continuation.isActive) continuation.resume(true)
                }
                override fun onCancelled(desc: GestureDescription) {
                    if (continuation.isActive) continuation.resume(false)
                }
            },
            null
        )
    }
}

Enter fullscreen mode Exit fullscreen mode

The agent loop can await a gesture completing before moving to the next step. No callbacks. No race conditions.


The Profile System — 9 Modes, One Agent

Not every task needs full autonomous planning. Genie has 9 specialized profiles:

Profile Architecture Use Case
Chat Agent-driven General Q&A, remembering facts
AppControl Agent-driven, reactive Navigating WhatsApp, Settings, Spotify
Vision Agent-driven, multimodal Screen analysis with allergy cross-referencing
Reader Agent-driven Accessibility screen narration
Teaching Agent-driven, whiteboard Step-by-step concept lessons
SeeAndTap Agent-driven, visual grounding Screenshot → numbered elements → tap by ID
Document Hybrid PDF quiz and summary generation
Scribe UI-driven (no agent) Audio → transcription → SOAP notes
Health UI-driven (no agent) Food analysis, WHO health topics

The key insight: not every feature needs agent autonomy. Scribe and Health have fixed, deterministic workflows. Adding an agent layer would only introduce latency and hallucination risk. Those profiles bypass the orchestrator entirely.

enum class ToolProfile(
    val id: String,
    val toolNames: Set<String>,
) {
    Chat(toolNames = setOf("reply", "save_fact", ...)),
    AppControl(toolNames = setOf("reply", "open_app", "click", ...)),
    Scribe(toolNames = emptySet()),  // UI-driven
    Health(toolNames = emptySet()),  // UI-driven
}

Enter fullscreen mode Exit fullscreen mode


Memory System — Facts and Preferences

When a user says "Remember that I'm allergic to peanuts", the agent calls save_fact(key="allergy", value="peanuts"). This is stored in Room:

@Entity(tableName = "user_facts")
data class UserFact(
    @PrimaryKey(autoGenerate = true) val id: Int = 0,
    val key: String,
    val value: String,
    val createdAt: Long = System.currentTimeMillis(),
    val updatedAt: Long = System.currentTimeMillis(),
)

Enter fullscreen mode Exit fullscreen mode

Every fact is injected into every prompt:

## User Preferences
- allergy: peanuts
- favorite_restaurant: Mama Cass
- preferred_language: Yoruba

Enter fullscreen mode Exit fullscreen mode

The model always sees these. When the user asks for a snack recommendation, the model avoids peanuts automatically.

For an accessibility agent, this isn't a gimmick. It's a safety feature. Saved allergies prevent dangerous suggestions. Saved mobility preferences change how the agent interacts with the UI.


The Teaching Profile — An AI Tutor With a Whiteboard

This is the profile that surprises people. Most AI tutors dump a wall of text. Genie teaches visually.

When you say "Teach me about photosynthesis", the agent:

  1. Creates a teaching board scene
  2. Places a title card
  3. Adds content cards with real definitions, formulas, and examples
  4. Generates visualize_concept diagrams (flowcharts, timelines, mind maps)
  5. Narrates each step with synchronized text-to-speech

Each step is one tool call. The user controls pacing — say "next" and the agent adds the next concept. The agent never ends the lesson on its own.

The system prompt enforces factual content:

"Every narration MUST deliver a concrete fact, definition, formula, example, or explanation. NEVER write meta-commentary like 'Let's look at...' or 'Next we will cover...'"


Observability — The Event Logger

Every significant event flows through an async event bus:

📦 Bootstrap: engine_init [3421ms]
⚡ State: idle → planning
🧠 Inference: 847ms, 156 tokens
✅ Tool: open_app({name=Settings}) [234ms]
✅ Tool: click({target=Wi-Fi}) [143ms]
⚡ State: executing → finished
📚 Skill written: 'turn on wi-fi' (3 steps)

Enter fullscreen mode Exit fullscreen mode

Events are emitted with trySend() — non-blocking, never delays the agent loop. A dedicated coroutine consumes the Channel<GenieEvent> and writes to Logcat. Every state transition, every tool with its latency, every error classification, every skill write — timestamped and categorized.


What We Learned

1. automaticToolCalling = false is non-negotiable for agents. If your model can touch the OS, you must intercept every tool call. No shortcut.

2. Two-signal risk thresholds prevent false positives. A single "Send" button doesn't mean danger. "Send" on a financial screen does.

3. Not every feature needs agent autonomy. Fixed workflows are faster, more reliable, and more predictable with direct UI control.

4. Error history is a feature, not a log. Feeding errors back into the planning prompt lets the model self-correct. Pruning resolved errors keeps context clean.

5. Skill caching is the simplest form of self-improvement. No fine-tuning, no RLHF. Serialize what worked. Replay it next time.


The Numbers

  • 53 registered tools across 6 families
  • 9 specialized profiles
  • 4-tier error taxonomy with 2 circuit breakers
  • 5-signal real-time risk assessor with biometric HITL
  • Room-backed skill cache with success-count ranking
  • 0 cloud dependencies
  • 0 API keys

Built entirely by a team of students.

We didn't try to compete with Siri or Google Assistant. Those are cloud products with billion-dollar infrastructure.

We built an agent that works when the infrastructure doesn't exist.


Closing Thoughts

Building Genie for the Gemma Kaggle Hackathon pushed us to rethink what an AI assistant can be when you strip away the cloud entirely. By running Gemma 4 directly on the device through LiteRT-LM, we proved that autonomous, multi-step agent behavior is possible at the edge. No servers, no API keys, no internet dependency.

The constraints of on-device inference forced better engineering: tighter context management, smarter error recovery, and a safety system that actually understands screen context rather than relying on static flags. These are patterns that will only become more relevant as on-device models continue to improve.

We're just getting started. 🧞


The Genie Team

Code Repository
🔗 github.com/Akeem1955/Genie

Download the APK
📲 Genie APK (Google Drive)

Watch the Demo

Contributor GitHub Profile
Adetunji Akeem @Akeem1955
Mosimiloluwa Adebisi @A-Simie
Adenuga Abdulrahmon @Rahmannugar