惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

人人都是产品经理
人人都是产品经理
W
WeLiveSecurity
Recorded Future
Recorded Future
P
Privacy & Cybersecurity Law Blog
V
Vulnerabilities – Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
G
GRAHAM CLULEY
S
Securelist
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
小众软件
小众软件
The Hacker News
The Hacker News
The Cloudflare Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
V
V2EX
C
Cisco Blogs
Cisco Talos Blog
Cisco Talos Blog
腾讯CDC
Recent Announcements
Recent Announcements
Jina AI
Jina AI
K
Kaspersky official blog
The GitHub Blog
The GitHub Blog
云风的 BLOG
云风的 BLOG
酷 壳 – CoolShell
酷 壳 – CoolShell
GbyAI
GbyAI
F
Fortinet All Blogs
T
ThreatConnect
S
Schneier on Security
罗磊的独立博客
Y
Y Combinator Blog
C
Check Point Blog
T
The Exploit Database - CXSecurity.com
宝玉的分享
宝玉的分享
aimingoo的专栏
aimingoo的专栏
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
I
Intezer
F
Full Disclosure
T
Troy Hunt's Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
WordPress大学
WordPress大学
Application and Cybersecurity Blog
Application and Cybersecurity Blog
V
V2EX - 技术
C
Comments on: Blog
T
Tenable Blog
Project Zero
Project Zero
H
Help Net Security
A
Arctic Wolf
Google DeepMind News
Google DeepMind News
NISL@THU
NISL@THU
博客园 - 【当耐特】
F
Fox-IT International blog

DEV Community

Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic Regression Models Serious Question: Is the Developer Job Actually in Risk Due to AI? published: true tags: #discuss #career #ai #help rav2d: We ported an AV2 video decoder from C to Rust — here's why Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch From YAML to AI Agents: Building Smarter DevOps Pipelines with MCP A Field Guide to Human–AI Relations (For the Newly Bewildered Mortal) The AI Agent That Learns While It Works — A Complete Guide to Hermes Agent Inviting collaborators to work on ArchScope ArchScope is an interactive web-based tool that lets you design, visualize, and test system architectures with real-time performance simulations. Github - ArchScope is an interactive web-based tool that lets you Gemma 4: Google's Open-Weight AI Is a Game Changer for Developers Confessions of a Git Beginner: Why the Terminal Stopped Scaring Me Docker 容器化实战:从零到生产部署 🚀 I Built a Full Stack Miro Clone with Real-Time Collaboration using Next.js Building an African Economic Data Pipeline with Python, DuckDB & World Bank API llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet Intigriti Challenge 0526 Writeup Business Logic Flaws: How Attackers Skip Steps in Your App to Get What They Should Never Have Why Vibe Coders Need Boilerplates to Save Time, Tokens, and Build More Secure SaaS Projects Idle Cloud Cost Is the New Egress Cost Quark's Outlines: Python Traceback Objects Ghost in the Stack (Part 1): Why uninitialized variables remember old data Building a High-Performance Local Chess Assistant Extension with WebAssembly Stockfish and Manifest V3 Breaking the Trade-off Between Self-Custody and Intelligent Automation on the Stellar Network I Open-Sourced a Practical Fullstack Interview Preparation Repository (React + Node + System Design) 🚀 How I Started Coding as a Student (Beginner-Friendly Guide) WordPress vs. Ghost: Why Automated Bot Attacks Are Making us think much I tested 4 AI agent-governance tools against an open spec - here's the matrix zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First Customer Portals Should Remove Repeated Admin Work Episode 4: The Time Loop (Layers & Caching) I Built ContextForge with Gemma 4: A Project Memory Generator for Developers and AI Coding Agents Why shadow DOM beat iframe for inline tooltips HOW TO CREATE USER AND ASSIGN ROLES IN AZURE WITH ENTRA ID When AI Blackmail Goes Viral Episode 3: The Secret Scroll (The Dockerfile) Monte Carlo Simulation for Engineers: Turning Uncertainty Into Numbers The tokens-per-byte trap: character-level 'compression' adds tokens Nobody Reads Your Code Anymore Why I built a collection of 5 free, zero-signup career finance tools for solo builders 🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays The Brutal Reality of Running Gemma 4 Locally I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API) I've audited dozens of estate agency websites. The same 5 problems show up every single time. Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue Building My AI-Powered VS Code Extension 🚀 Google Login in Express with PassportJS & JWT Great example of Gemma 4 moving beyond chatbots into real-world decision support. Using AI to guide everyday actions like recycling shows how impactful applied LLMs can be when designed for usability, not just capability. #Gemma4 #AI #Sustainability Building a Production AI Chatbot for an Educational Institute: Architecture, Lessons & Full Stack Deep-Dive Google Login in Express with PassportJS & JWT How I reclaimed 47GB on my MacBook by cleaning developer project junk Operators Are Not Oracles: How We Learned to Stop Worrying and Love the Configuration I Built 6 Free Developer Tools for AI APIs, Cron, Docker, and Self-Hosting How I Built a Real-Time Precious Metals Price Feed for 30,000 Concurrent Users in Laravel How to Use a SERP API to Validate Whether a Project Idea Is Worth Building Gemma 4 discussions often focus on capability, but real-world impact depends on deployment context. For offline education, especially in low-connectivity regions, latency, cost, and local inference matter as much as model strength. Local Mind Explores it Space Complexity + Ω and Θ Notations Google I/O 2026 Just Confirmed the Shift From AI Chatbots to AI Agents How to Add API Monitoring to an Express App in 5 Minutes (2026) Designing an In-Game Inflation Tracking Algorithm for Web Utility Apps Google AI Studio Just Changed the Shape of App Development If you struggle to learn then this is for you. Best AI Agent Security & Guardrails Tools in 2026: LLM Guard vs NeMo vs Guardrails AI Building Dynamic RBAC in React 19: From Permission Strings to Component-Level Access Control
Gemma Guide - Real-Time Spatial Awareness for Blind Users
Dan Parii · 2026-05-23 · via DEV Community

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

The Problem

For a blind user, the important question is not just what is in front of me, but how far away it is and how I should move safely. That gap between scene description and grounded spatial awareness is what Gemma Guide is built to close.

The Solution

Gemma Guide combines Gemma 4 with TIPSv2 into a routed multi-agent pipeline. The user speaks a question, the system interprets it visually and acoustically, and returns grounded guidance: not just "there is a chair in front of you" but "the chair is 1.4 meters ahead, slightly to your left."

The diagram below shows the flow end to end. A Scout agent first decides whether the question needs spatial analysis at all. If it does, a Mapper agent localizes the relevant objects and calls the TIPSv2 tool stack to measure distance and bearing for each one. A Navigator agent then reasons over those grounded measurements and produces the spoken response.

Architecture Overview

The Spatial Grounding Layer

TIPSv2 (Google DeepMind) provides three capabilities the spatial grounding layer depends on: metric depth, semantic segmentation, and open-vocabulary matching:

Class heads (segmentation + metric depth): Dense prediction transformer (DPT) heads produce per-pixel semantic segmentation and metric depth across 150 common object classes. When Gemma localizes a known object, the system intersects that region with the segmentation mask so depth is measured over the right pixels, not a coarse bounding box.

Open-vocabulary matching: TIPSv2 is a vision-language encoder trained so that image patches and text live in the same embedding space. Gemma can pass any class name directly to the encoder and receive a per-patch similarity map in return, turning open-vocabulary understanding into open-vocabulary measurement. This extends grounding well beyond the 150-class limit without any fine-tuning.

I used the B/14 variant (86M vision + 110M text params), the smallest of the four.

The Interface

Gemma Guide is designed with a blind-first philosophy at its core: accessibility is not a feature added on top, it is the only way to build. The UI uses a two-zone tap-anywhere layout with no visual-first interaction patterns. Audio soundscapes and TTS guidance bridge the gap during model reasoning, and all architectural complexity is entirely hidden from the user.

Demo

Code

Gemma Guide

Gemma Guide is a blind-first multimodal navigation assistant that combines Gemma 4 with TIPSv2 to answer grounded questions like:

  • What object is in front of me?
  • How far away is it?
  • Where is it relative to me?
  • How should I move safely?

Motivation

For a blind user, the important question is not just what is in front of me, but how far away it is and how I should move safely. That is the gap between scene description and real navigation assistance. A useful system must do more than describe a scene in natural language; it must produce grounded spatial answers that guide movement in the real world.

Language models are not reliable depth sensors, but with Gemma 4, they can act as an agent that identifies an object, calls specialized spatial tools, and turns grounded distance estimates into practical guidance. Gemma Guide is built to turn…

How I Used Gemma 4

I used Gemma 4 E4B as the multimodal orchestrator across a three-agent pipeline.

Why Gemma 4 E4B specifically: the interaction is voice-driven, the scene must be interpreted visually, and the model must decide in real time whether a conversational answer is sufficient or whether grounded spatial measurement is required. That demands audio understanding, image understanding, native function calling, and strong reasoning together, and E4B delivers all of this while remaining compact.

That compactness is important for more than speed. It makes co-deployment alongside TIPSv2 feasible, and it directly advances the case for on-device deployment. A blind user should not need an internet connection to see, and running locally removes a meaningful privacy concern around continuous camera access. Keeping the language model in this size class is what makes that future realistic.

The Gemma 4 model card also lists two capabilities that turned out to be directly relevant here: native pointing (the model can return spatial coordinates for objects within an image, not just describe them) and interleaved multimodal input (audio and visual content freely mixed in a single prompt). Both are described under image understanding in the model card, leveraging this gave us more reliable responses.

The pipeline:

  • Scout: Classifies the request - direct answer or spatial analysis needed - and routes accordingly.
  • Mapper: Interprets the question, localizes relevant objects, and calls the TIPSv2 spatial tool stack to build a grounded scene state with distance and bearing per object.

Mapper overview

  • Navigator: Receives the annotated image and structured measurements and generates the final spoken guidance. Separating this from the Mapper matters: reasoning over a clean measured world model is more reliable than reasoning over raw tool outputs.

Mapper Output & Navigator Input - Annotated Scene

User Question : Can you tell me how to get to the grill in my backyard?

Gemma Guide : The grill is 2.342 meters away, straight ahead. There are a few things in the way. You should first step slightly right to clear the table, which is 0.985 meters away about 25 degrees to your left. Then, the chair at 1.345 meters straight ahead will be in your path. Steer gently to your right to avoid it and step forward.

Key Findings

  • Distance alone is not enough. Early versions reported metric depth but users had no sense of direction. Adding horizontal bearing from the object's position in the frame turned a distance reading into actionable spatial guidance.

  • Whole-scene depth reasoning was too unreliable. Asking the model to reason over a full depth map produced ambiguous results. The reliable path was having Gemma localize the object first, then feeding that into the measurement pipeline - leveraging what the model is actually built for: reasoning, localization, and tool calling.

  • Separation of concerns made outputs consistent. Combining scene description, tool orchestration, and navigation reasoning in one agent made outputs inconsistent. Splitting into Scout, Mapper, and Navigator fixed this.

  • vLLM was the right local inference choice for now. Ollama lacks audio input entirely, ruling it out for a voice-driven pipeline, and its Gemma 4 tool-calling parser has had numerous bugs where vLLM's proved robust. With quantization, reduced max sequence length, and fewer image patches per call, the full stack - Gemma 4 E4B plus TIPSv2 - fits on a single 16GB GPU.

  • Latency is the dominant UX constraint. Tool calls run in parallel, but the Mapper and Navigator still bottleneck on model reasoning itself, and that cost grows as conversation history lengthens. Complex scenes can push end-to-end response time past 20 seconds.

  • Reliability is the remaining work. The navigation instructions are still sometimes off, and fine-tuning the grounding stack on task-specific data is where the remaining work sits.

Toward on-device deployment:

I explored Google AI Edge Gallery as a path to partial on-device deployment (Gemma locally, TIPS stack remote), but the current blocker is that the image Gemma sees in chat is not forwarded into the skill execution context, which breaks grounded measurement. A standalone mobile app with tighter camera and voice control is the stronger long-term path, and both models are compact enough to make fully offline deployment on edge hardware plausible.