Your AI can read. Gemma 4 can see

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Your AI can read. Gemma 4 can see. Here's what that actually changes.

For two years, talking to an AI meant typing. You described things in words, the AI answered in words. If you wanted help with a photo, a handwritten note, or a screenshot, you first had to translate it into a paragraph — and hope you didn't leave out the part that mattered.

Gemma 4 is multimodal, which is a clunky word for a simple idea: you can show it a picture instead of describing one. I spent an afternoon doing exactly that, and the gap between "tell the AI" and "show the AI" turned out to be bigger than I expected.

Here's what multimodal actually means, three things I showed it, and how you can try it yourself in about five minutes — free, no fancy hardware.

"Multimodal" in one sentence

A mode is a type of input: text is one mode, images are another, audio is a third.

A text-only model is like texting a friend who can only read words. A multimodal model is like video-calling that friend — you can hold something up to the camera and they just see it.

Gemma 4 handles text, images, and audio through the same model. You don't bolt on a separate "image reader." The thing that understands your sentence is the same thing that understands your photo. That matters more than it sounds, and the examples make it obvious.

Three things I showed it

I didn't write clever prompts. I literally uploaded a photo and asked a plain question, the way you'd ask a knowledgeable friend.

1. A drooping houseplant. I uploaded a photo of a sad-looking plant and asked, "What's wrong with this?" It pointed out the yellowing lower leaves and damp-looking soil and suggested I was overwatering — and to check that the pot actually drained. I never told it the leaves were yellow. It looked.

2. A handwritten grocery list. My handwriting is genuinely bad. I snapped a photo and asked it to type the list out. It read all but one item correctly (it guessed "tomatoes" where I'd scrawled something closer to "tamarind" — fair). Typing that list myself would've taken longer than photographing it.

3. A screenshot of a line chart with no title. I asked, "What's the trend here?" It described the steady climb, called out the dip in the middle, and noted the sharp rise at the end — reading the shape of the data, not just labels. For someone who finds charts intimidating, that's a quiet superpower.

None of this was perfect. It got one grocery item wrong, and if I'd asked it to read tiny dense text it would've struggled. But "show instead of describe" changes the kind of help you can ask for. You stop being the translator.

Why this is a bigger deal than it looks

Three reasons this matters beyond the novelty:

You skip the translation step. Describing an image in words is lossy and slow. A photo carries everything at once — color, layout, handwriting, the thing you didn't think to mention.
It opens AI to people who don't love typing. Point a camera at a problem and ask about it. That's a far lower bar than composing the perfect prompt.
The small versions run on your own machine. Gemma 4 comes in sizes small enough to run on a laptop or even a phone, offline. So "show the AI a photo" doesn't have to mean "upload my private photo to someone's server." It can all happen on your device. For anything personal — documents, medical photos, your kid's homework — that's the difference between useful and no thanks.

That last point is the one I keep coming back to. A model that can see, running entirely on hardware you own, with no internet connection, would have sounded like science fiction in 2023. It's a free download in 2026.

Try it yourself in five minutes

You don't need a powerful computer to start. Two paths, easiest first.

Path A — zero install (browser, free)

Go to Google AI Studio (aistudio.google.com) and sign in with a Google account.
Start a new prompt and pick a Gemma 4 model from the model dropdown.
Click the image/upload icon, add any photo from your computer — a plant, a receipt, a whiteboard, a chart.
Type a plain question: "What is this?" or "Read the text in this image."
Watch it answer based on what it sees.

That's the whole thing. No setup, no card, no code.

Path B — run it on your own machine (offline, private)

If you want it running locally with nothing leaving your computer:

Install Ollama from ollama.com (one download, Windows/Mac/Linux).
Open a terminal and pull the small multimodal model:

   ollama run gemma4:e4b

The first run downloads the model once (a couple of gigabytes). After that it works with no internet.

In the chat prompt, point it at an image file on your computer and ask your question. It reads the picture locally — nothing uploaded.

Start with Path A to feel the magic, switch to Path B when you want privacy.

What I'd explore next

The thing I want to try next is audio: Gemma 4 hears as well as sees, which means you could hand it a voice memo and a photo together and ask one question about both. We're early in figuring out what that unlocks.

But the simple version is already enough to change how I use AI day to day. I type less. I show more. And the friend on the other end of the video call finally has eyes.

If you try it, show it something weird and tell me what it said — that's the fun part.

Want to go deeper? The official models are on Hugging Face and Kaggle, all free to download.

推荐订阅源

DEV Community