RefVault: a local-first design reference vault, powered by Gemma 4 26B MoE

What I Built

RefVault is a native macOS app that turns your screenshot folder into a searchable design reference library. Out of the box it watches ~/Desktop — exactly where macOS drops every Cmd-Shift-4 screenshot, so there's nothing for you to set up. Drop a screenshot the way you already do — from Pinterest, Dribbble, a competitor's landing page, anywhere — and Gemma 4 26B reads it locally, pulls out palette, typography, mood, layout, tags, and the URL on screen, then files it. When you need it weeks later, you search by sentence ("minimal pricing serif", "i want some illustration references") and the right screenshot is right there.

I'm a designer. I bookmark pages, save Pinterest pins, and screenshot UI references constantly — and by the time I actually need a reference for a project, none of it is where I left it. Bookmarks in the wrong browser, Pinterest boards reorganized, screenshots buried on the Desktop with names like Screenshot 2026-05-08 at 4.24.26 AM.png. RefVault is the thing I built so I'd stop losing them.

Everything runs on the Mac. Nothing leaves the machine.

Demo

↓ Download RefVault for macOS · README

A single walkthrough showing the whole loop end-to-end — taking a screenshot, RefVault auto-indexing it, the Dynamic-Island-style save toast, searching the library by sentence, and dragging a result out into another app:

A few stills:

Save toast	Already in library	Drop to import

Code

Repo: github.com/Krsatvik1/RefVault

Download the signed .app (free, ad-hoc signed — no Developer ID): latest release.

Built in Swift / SwiftUI as a single SwiftPM-wrapped .app. The app bundles its own Ollama runtime and downloads Gemma 4 26B on first run, so end users don't need to install anything — drag the app into /Applications, click "Open Anyway" once for Gatekeeper (since I don't have an Apple Developer account), and it works.

The release artifact ships as a .zip, not a .dmg. macOS Sequoia (15+) added a Gatekeeper check on disk images themselves that flags ad-hoc-signed .dmgs with a separate "Apple could not verify…" prompt at mount time, on top of the .app's own unidentified-developer prompt. A .zip isn't subject to that check, so users only deal with one Privacy & Security override (for the .app) instead of two. Safari auto-extracts the download; users see RefVault.app and drag it straight into /Applications.

How I Used Gemma 4

I picked Gemma 4 26B MoE — the Mixture-of-Experts variant Google describes as "designed for high-throughput, advanced reasoning." That framing fits RefVault almost word-for-word: indexing is high-throughput by nature (one image at a time, but a steady stream as the user takes screenshots), and the per-image work is genuinely a reasoning task — read the image, identify the design archetype, infer mood and typography, distinguish browser chrome from page content. The MoE architecture also means RefVault gets reasoning quality close to the 31B Dense variant while staying inside the 24 GB unified-memory budget of a base-config M-series Mac.

Earlier builds of RefVault used the smaller E4B variant for speed, but it got palette and typography wrong often enough that the library became noisy. When you search for "minimal pricing serif" and the screenshot is mistagged as "sans-serif", the whole product breaks. The 26B MoE produces tags I trust on the first read.

A few engineering decisions that made Gemma 4 work well for this:

1. The indexing pipeline — granular + parallel

Every screenshot first runs through a relevance gate (relevance.txt) — one short Gemma call that decides whether the image is a design reference. If Gemma says it's a chat window, an error dialog, a code editor, or a random photo, RefVault drops the screenshot before any extraction runs and nothing else gets called.

If it passes, the agent fires seven granular calls in parallel, one per axis — each its own short focused prompt that does exactly one thing:

axis	prompt	extracts
`style`	`metadata_style.txt`	one of `minimal`, `brutalist`, `editorial`, `playful`, …
`typography`	`metadata_typography.txt`	per-slot type (headings / bodies / others)
`mood`	`metadata_mood.txt`	2–3 adjectives
`layout`	`metadata_layout.txt`	one of `hero`, `pricing`, `dashboard`, `landing`, …
`tags`	`metadata_tags.txt`	5–15 single-word tags
`color`	`colors.txt`	primary / secondary / accent / full palette as hex
`url`	`url.txt`	the URL on screen — only when the relevance gate flagged the image as a browser shot

So a typical indexing path is 1 relevance call + up to 7 metadata calls (six always run; URL is conditional). All seven metadata calls fan out concurrently and share Ollama's warm KV cache, so total wall-clock time barely grows past a single call's worth.

Across screenshots, indexing is serial — a single queue runs them one at a time. If five screenshots land in the watched folder at once, the first goes through the relevance + parallel-extraction pipeline end-to-end, then the second, and so on. This keeps Ollama's GPU memory usage predictable on M-series Macs (the 26B model takes ~16 GB at runtime, so trying to run two indexes concurrently would thrash) and lets the in-app toast show clean per-image progress.

Why split it up at all? Each prompt is small and specific — when you ask for one thing, the model can't shortcut a hard sub-task (mood and typography are the usual culprits) by giving up on just that one and padding the rest. I A/B'd this against the combined prompt inside an in-app Debug view:

The granular-parallel pipeline produced consistently sharper per-field outputs at comparable wall-clock time on the M4. When forced to answer all axes in one response, the model leans on the easy ones and gets sloppy on the hard ones — separating them keeps each answer crisp.

2. Why 26B MoE, not E4B

Same image, same prompts, two model variants:

The 26B MoE output recognizes "high-end, editorial" mood and richer layout language ("modern, sophisticated, large-scale-typography, monochromatic, asymmetric, minimalist") where the E4B variant returns thinner, generic tags. Indexing happens once in the background, so model quality matters more than raw speed for this use case — and the MoE design means the quality jump comes without a proportional jump in active-parameter cost during inference.

3. Search uses the same model — no embeddings

I considered adding a separate embedding model for semantic search and decided against it for a practical reason: shipping one extra model means another download on first run (the user already waits for ~15 GB of Gemma 4 26B), another set of weights resident in RAM, and another moving piece to keep version-aligned with Gemma. Reusing the same model the user already has on disk keeps the install one-shot and the runtime memory budget single-tenant.

Instead, the user's sentence goes through one short Gemma prompt that rewrites it into a structured filter:

"i want some illustration references"
   →  { "tags_any": ["illustration"] }
   →  SQL against the local SQLite library

One Gemma call per query, no extra model to sync, search stays offline and snappy.

Performance on a MacBook Air M4 (24 GB RAM)

Indexing: ~60–100 seconds per screenshot (one-time, in the background)
Search: ~20 seconds per query (one Gemma call to parse, then a local SQLite hit)

Both scale with model size and chip — bigger Macs go faster.

Full prompts, code, and the build pipeline are in the repo. Built solo in Swift / SwiftUI / Ollama for the Gemma 4 Challenge.

推荐订阅源

DEV Community