惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Apple Machine Learning Research
Apple Machine Learning Research
The GitHub Blog
The GitHub Blog
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
爱范儿
爱范儿
量子位
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
博客园_首页
博客园 - 【当耐特】
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Microsoft Azure Blog
Microsoft Azure Blog
美团技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
aimingoo的专栏
aimingoo的专栏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
腾讯CDC

DEV Community

Membangun Kompetensi dan Relasi: Mengapa Ekosistem Kampus Itu Penting I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room Codex Team Usage SOP How to Actually Become a Programmer: The Hard Part Nobody Wants to Explain Building a Production-Style Multi-Tool AI Agent with Python, Flask, React & Gemini AI The Caretaker Sandbox: An Offline-First Visual Playground & Template Engine powered by Gemma 4 # Building Instagram OSINT Projects with HikerAPI The Battle of the Senior Dev: Why AI Gives You Wings But Only If You're Ready to Pilot HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead I Finally Finished a Project I Abandoned — And GitHub Copilot Helped Me Ship It SafeSMS: On-Device Threat Detection with Gemma 4 E4B, no internet required I Built OpenKap — A Loom Alternative for Small Teams Who Just Want to Ship Gemma 4 is Here: The Dawn of Local Multimodal Reasoning Offline-First Flutter: How We Built a CRM That Manages 100K+ Leads With No Internet Memory for Agents: When Vectors Meet Graphs, Bugs Drop 4 The Rise of Production-Grade AI Infrastructure I ran my idea-validation product through its own validator. The verdict was PIVOT. We Built an Agent Commerce API. Google I/O 2026 Changed Our 3-Month Roadmap in 24 Hours. "My Partner's Memory Was Full. I Didn't Know — Until We Tried to Talk." I’m a Front End Web Developer Learning Machine Learning From Scratch Laravel Waiting Request I Built a Chrome Extension to Track How Long You Actually Spend on Each Tab Why Google Can't See Your React Breadcrumbs (And the 4-Line Fix) AI Travel Assistant Powered by Gemma 4; With Streaming, Image Input, and Visual Recommendation Cards Microsoft tried to kill the printer driver. Healthcare said no. The Blueprint Beneath the Blueprint: Designing Data Model and Choosing Its Database REST APIs vs Webhooks in Telecom Billing - Which One Actually Makes Sense? Accounting Made Simple: AI-Powered Financial Insights of Japanese Companies with Gemma 4 The append-only AST trick that makes Flutter AI chat actually smooth Designing the Future of Payments — Why XML Still Matters in the Age of APIs From Legacy to Live — Reviving XMLPayments with GitHub Copilot Two Weeks Into Learning Solana XMLPayments — The Hidden Backbone of Modern Financial Orchestration AI Agents in Practice — Read from the beginning Reviving My Gemma Agentic Framework: From Prototype to Polished Repo Smart Contracts Demand Better Infrastructure: Building on contract.dev Self-Hosted LLM Tool Calling: Forge and the Build-vs-Buy Decision ORA-00072 오류 원인과 해결 방법 완벽 가이드 OpenWA for CTOs: Self-Hosted WhatsApp Gateway Trade-Offs NotebookLM Automation With notebooklm-py: Useful, But Classify Data First Docker v29.5.x Operator Upgrade Checklist Coding-Agent Instruction Design: The CLAUDE.md File That Prevents Rework When I Finally Realized My Runtime Was Holding Me Back GnokeOps: Host Your Own AI House Party The Death of Static Rate Limiters: Why Your Java Virtual Threads Need BBR-Style Adaptive Concurrency AI Agents in Practice — Part 2: What Makes Something an Agent Stop scattering LLM SDK/API calls across your codebase. Here is the 2-file rule that fixed mine Beyond Prompts: Structuring AI Workflows for Real Frontend Engineering From an Abandoned Hackathon Project to an AI Study Workspace 🚀 Terraform with AI: Build AWS Infra (Cursor + MCP) What If AI Didn’t Need the Internet? 750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek's Permanent Price Cut You're Renting Someone Else's Compute — And It's Costing You More Than You Think CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago Five Clusters. Five Lessons. One Production System. Synaptic: A Local-First AI Dev Companion That Remembers How You Think Revolutionizing Edge MedTech: Building a Sovereign Sleep Apnea Companion ("XiHan Snore Coach") with Gemma 4 HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DMARC p=none vs p=quarantine vs p=reject: what to use and when DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I solo-built a reputation layer for AI agents on NEAR — and here's what I learned I built an AI faceless video generator in 2 months — here's the stack Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it) From Abandoned Prototype to Smart AI System: Reviving Trafiq AI with GitHub Copilot Why Country/State/City Pickers Are Weirdly Hard Node.js 22 LTS — EOL Date, Support Timeline, and What Comes Next The 7-Layer Memory Architecture Behind Modern AI Agents I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI One backend, four products: why we bet on platform-per-brand AI's tech debt is invisible — even to AI. I solved it at the architecture layer. Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals You Don’t Need to Try Every AI Tool to Keep Up NovelPilot: A Novel Writing Agent Powered by Gemma 4 BoxAgnts is an Out-Of-The-Box Secure AI Agent ToolBox in a WASM SandBox Gemma 4 deep dive: why a 1.5 GB model scores 37.5% on competition mathematics, how the MoE routing actually works, and which model fits your hardware. Full breakdown inside. BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090 Google Just Declared the Chat-Log Interface Dead. Here's What Neural Expressive Actually Signals for Developers. ARCHITECTURE SPECIFICATION & FORMAL SYSTEM REPORT: k501-AIONARC Notes from a Hammock What's Google Antigravity 2.0 ? Here's What the Agent Harness Actually Changes for Developers. Building an E2EE Chat App in Flask - Part 3: Keeping File Uploads Safe Google's Gemini Spark. Here's What It Actually Does for Developers. Microsoft Just Shipped MCP Governance for .NET. Here's What It Actually Enforces. How I Built a Pakistan Internet Speed Test Platform at 16 How to Build a Supervisor Agent Architecture Without Frameworks I Built My Own Corner of the Internet — Here's What It Looks Like How does VuReact compile Vue 3's defineExpose() to React? Neo-VECTR's Rift Ascent Idempotency Keys: The API Safety Net You Probably Aren't Using Building E-Commerce Sites for Niche Products: Technical Lessons from Specialty Outdoor Retailers Audit Logs: The Silent Guardian of Every Serious System Open-source SDS tooling for Japanese MHLW compliance: the gap nobody filled BetAGracevI I Built a Post-Quantum Cryptographic Identity SDK for AI Agents — Here's Why It Needs to Exist Running Claude Code across multiple repos without losing context
Your AI can read. Gemma 4 can see
amionweb · 2026-05-23 · via DEV Community

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Your AI can read. Gemma 4 can see. Here's what that actually changes.

For two years, talking to an AI meant typing. You described things in words, the AI answered in words. If you wanted help with a photo, a handwritten note, or a screenshot, you first had to translate it into a paragraph — and hope you didn't leave out the part that mattered.

Gemma 4 is multimodal, which is a clunky word for a simple idea: you can show it a picture instead of describing one. I spent an afternoon doing exactly that, and the gap between "tell the AI" and "show the AI" turned out to be bigger than I expected.

Here's what multimodal actually means, three things I showed it, and how you can try it yourself in about five minutes — free, no fancy hardware.

"Multimodal" in one sentence

A mode is a type of input: text is one mode, images are another, audio is a third.

A text-only model is like texting a friend who can only read words. A multimodal model is like video-calling that friend — you can hold something up to the camera and they just see it.

Gemma 4 handles text, images, and audio through the same model. You don't bolt on a separate "image reader." The thing that understands your sentence is the same thing that understands your photo. That matters more than it sounds, and the examples make it obvious.

Three things I showed it

I didn't write clever prompts. I literally uploaded a photo and asked a plain question, the way you'd ask a knowledgeable friend.

1. A drooping houseplant. I uploaded a photo of a sad-looking plant and asked, "What's wrong with this?" It pointed out the yellowing lower leaves and damp-looking soil and suggested I was overwatering — and to check that the pot actually drained. I never told it the leaves were yellow. It looked.

2. A handwritten grocery list. My handwriting is genuinely bad. I snapped a photo and asked it to type the list out. It read all but one item correctly (it guessed "tomatoes" where I'd scrawled something closer to "tamarind" — fair). Typing that list myself would've taken longer than photographing it.

3. A screenshot of a line chart with no title. I asked, "What's the trend here?" It described the steady climb, called out the dip in the middle, and noted the sharp rise at the end — reading the shape of the data, not just labels. For someone who finds charts intimidating, that's a quiet superpower.

None of this was perfect. It got one grocery item wrong, and if I'd asked it to read tiny dense text it would've struggled. But "show instead of describe" changes the kind of help you can ask for. You stop being the translator.

Why this is a bigger deal than it looks

Three reasons this matters beyond the novelty:

  • You skip the translation step. Describing an image in words is lossy and slow. A photo carries everything at once — color, layout, handwriting, the thing you didn't think to mention.
  • It opens AI to people who don't love typing. Point a camera at a problem and ask about it. That's a far lower bar than composing the perfect prompt.
  • The small versions run on your own machine. Gemma 4 comes in sizes small enough to run on a laptop or even a phone, offline. So "show the AI a photo" doesn't have to mean "upload my private photo to someone's server." It can all happen on your device. For anything personal — documents, medical photos, your kid's homework — that's the difference between useful and no thanks.

That last point is the one I keep coming back to. A model that can see, running entirely on hardware you own, with no internet connection, would have sounded like science fiction in 2023. It's a free download in 2026.

Try it yourself in five minutes

You don't need a powerful computer to start. Two paths, easiest first.

Path A — zero install (browser, free)

  1. Go to Google AI Studio (aistudio.google.com) and sign in with a Google account.
  2. Start a new prompt and pick a Gemma 4 model from the model dropdown.
  3. Click the image/upload icon, add any photo from your computer — a plant, a receipt, a whiteboard, a chart.
  4. Type a plain question: "What is this?" or "Read the text in this image."
  5. Watch it answer based on what it sees.

That's the whole thing. No setup, no card, no code.

Path B — run it on your own machine (offline, private)

If you want it running locally with nothing leaving your computer:

  1. Install Ollama from ollama.com (one download, Windows/Mac/Linux).
  2. Open a terminal and pull the small multimodal model:
   ollama run gemma4:e4b

Enter fullscreen mode Exit fullscreen mode

The first run downloads the model once (a couple of gigabytes). After that it works with no internet.

  1. In the chat prompt, point it at an image file on your computer and ask your question. It reads the picture locally — nothing uploaded.

Start with Path A to feel the magic, switch to Path B when you want privacy.

What I'd explore next

The thing I want to try next is audio: Gemma 4 hears as well as sees, which means you could hand it a voice memo and a photo together and ask one question about both. We're early in figuring out what that unlocks.

But the simple version is already enough to change how I use AI day to day. I type less. I show more. And the friend on the other end of the video call finally has eyes.

If you try it, show it something weird and tell me what it said — that's the fun part.


Want to go deeper? The official models are on Hugging Face and Kaggle, all free to download.