This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
GizmoGuard is a low-budget, privacy-first AI-at-the-edge personal safety and monitoring bot powered by locally running Gemma models.
The idea started from a simple but relatable problem:
“Who moved my mug?”
GizmoGuard continuously monitors a workspace — or any valuable object of interest, indoors or outdoors — using an ArduCam attached to a Raspberry Pi. The system detects scene changes such as:
- Objects being moved
- Objects being removed
- Objects being replaced
- Unexpected objects appearing
- People approaching or touching monitored items
- Ambient light changes
- Unwanted background noise
The system is designed to intelligently distinguish between normal environmental activity and a real scene change near the protected object.
When motion or scene changes are detected, GizmoGuard captures “evidence images” and sends them to a Spring Boot backend API. The backend then uses Gemma 4 for multimodal image reasoning and natural-language explanations.
Using additional preconfigured contextual information, the system can also:
- Recognize known people vs strangers
- Analyze gestures and emotions
- Understand nearby activity and surroundings
- Generate spoken voice-enabled responses using the host system’s speech functionality
The entire system is built around a local-first AI architecture:
- Images never leave the local environment
- No cloud AI APIs are required
- No recurring inference costs
- Runs affordably on consumer-grade hardware
GizmoGuard demonstrates how compact multimodal AI models like Gemma can power practical, privacy-focused real-world edge AI applications.
Current Architecture
The current GizmoGuard architecture consists of the following components:
Raspberry Pi + ArduCam
- Python running on the Raspberry Pi continuously captures images
- Performs lightweight motion and scene-change detection
- Sends “evidence images” to backend APIs for AI analysis
Spring Boot REST API
The Spring Boot backend acts as the orchestration layer and:
- Manages image analysis workflows
- Stores chat and contextual memory
- Handles evidence image pipelines
- Integrates with Gemma 4 using OpenAI-compatible APIs
Docker Model Runner (DMR)
- Runs Gemma locally on my laptop
- Exposes model APIs for multimodal inference
- Enables fully local AI processing without cloud dependencies
Local Storage + MySQL
- Stores evidence images locally
- Maintains conversation history and contextual memory
- Persists AI-generated responses and metadata
Multimodal AI Layer
Powered by Gemma 4, the AI layer:
- Analyzes captured images
- Explains scene changes in natural language
- Supports conversational interaction
- Generates contextual reasoning about nearby activity
The project demonstrates how practical multimodal AI systems can run locally using affordable hardware — without requiring expensive cloud infrastructure or hosted AI services.
Demo
Demo Link: Gizmo-Guard Bot Demo
Demo Includes
Mug placed on desk
Scene continuously monitored by Raspberry Pi + ArduCam
Mug moved, removed, or scene unexpectedly changes
Evidence image captured automatically
Gemma analyzes the image and explains what changed using multimodal reasoning
When real people (or images of them) appear in the scene:
- Detects known people pre-configured through system prompts and contextual memory
- Identifies all unknown individuals as strangers
- Analyzes appearance, ambience, emotions, and gestures
- Detects potentially malicious or friendly behavior and reports observations
- The system generates voice-enabled spoken responses using the host system’s native speech functionality, allowing GizmoGuard to verbally describe scene changes, alerts, and AI observations in real time.
- Future prospects (model voice capabilities for analysis, Servo-based/GIO-Header wheels)
Code
GitHub (sasiperi) Repo name and Link: gizmo-guard-gemma4-challenge
Tech stack includes:
- Java + Spring Boot
- Raspberry Pi + ArduCam
- Docker Model Runner (DMR)
- Ollama/OpenAI-compatible APIs (gguf models)
- Gemma4 multimodal model
- MySQL for Chat Memory and Responses etc..
- Local filesystem storage for images.
How I Used Gemma 4
GizmoGuard is powered by Gemma 4B Quantized (gemma4:4B-Q4_K_XL) running locally through Docker Model Runner (DMR).
I specifically selected this model because it delivered the best overall balance between:
- Multimodal capability
- Local deployment feasibility
- Memory footprint
- Reasoning quality
- Privacy
- Cost efficiency
Why Gemma Was the Right Fit
1. Privacy-First Local AI
One of the primary goals of GizmoGuard was ensuring that camera images and personal workspace data never leave the local environment.
By running Gemma locally:
- No images are uploaded to external AI services
- No cloud inference is required
- The system can operate completely offline
For an always-on visual monitoring system, this was extremely important.
2. Edge-Friendly Performance
I evaluated several local multimodal models.
Some lightweight models were fast but struggled with:
- Reliable image understanding
- Object consistency
- Intelligent system/user prompt processing (chat capabilities)
Larger models produced strong results but required significantly more resources and slower inference times.
gemma4:4B-Q4_K_XL turned out to be the ideal middle ground:
- Compact enough for practical local deployment
- Efficient enough for near real-time analysis
- Still capable of strong multimodal reasoning quality
This made it an excellent fit for AI-at-the-edge workloads.
3. Multimodal Simplicity
A major advantage of Gemma4:4B was its ability to handle:
- Image understanding
- Reasoning
- Conversational responses
- System and user prompt processing
within a single model.
This avoided the need to chain together:
- Separate vision models
- OCR pipelines
- Reasoning models
- Chat models
Using a unified multimodal model simplified:
- Architecture
- Orchestration
- Deployment
- Latency
- Operational complexity
4. Cost-Effective AI
Another goal of the project was proving that useful AI systems do not require expensive cloud GPUs or recurring API fees.
Running Gemma locally means:
- Zero inference cost
- No token billing
- Predictable performance
- Full ownership of the AI stack
This makes GizmoGuard practical for:
- Hobbyists
- Makers
- Students
- Small-scale edge deployments
5. Real-World AI at the Edge
GizmoGuard demonstrates how compact multimodal models like Gemma can power practical real-world edge AI applications using affordable hardware and open-source tooling.
The project combines:
- Edge AI
- IoT
- Computer Vision
- Multimodal Reasoning
- Local LLM Deployment
- Spring Boot APIs
- Privacy-First Architecture
into a fully working end-to-end system.
It showcases how modern multimodal AI can move beyond cloud-only deployments and become useful directly at the edge.





















