This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
scribe-check is a local-first command-line tool that reads a Markdown article and a folder of source documents, and reports every concrete claim in the article that isn't corroborated by the sources you handed it. It checks five categories of fabrication risk: quoted strings that drifted a word, named entities the sources never mention (a coauthor that shouldn't be on a paper), numeric specifics that don't match (off-by-2× rod-cell counts), italicized terminology that drifted (the article italicizes X where the source italicizes Y), orthographic drift (British spelling leaking into a US-English piece, or vice-versa), and temporal-marker leaks (today, this morning, weekday names sneaking into evergreen prose).
It's the kind of pass an editor would do on every draft, if every writer had an editor on every draft. Instead, it runs on Gemma 4 E4B via Ollama. Locally. On a laptop. In about a minute on a ~2,000-word article.
I built it because I'd been doing this review by hand on my own articles, assembling a citations.md file and scanning the article line by line against the citations. It's exactly the kind of repetitive, structural check a small local model can do consistently and cheaply.
Demo
Three planted fabrications in a real published article: a drifted italicized term (*simple cells* → *elementary cells*), a fake coauthor (Ahmed, Natarajan, Rao, and Petrova), and a doubled count (120 million rod cells → 240 million rod cells). scribe-check catches all three on a single pass against the article's citations.md. The CLI shows a live spinner with elapsed seconds on stderr during the ~50-second model call (auto-suppressed when piped), so the wait never feels hung:
(raw transcript and JSON live in examples/transcript-fabrications.txt and examples/output-fabrications.json.)
⚑ scribe-check: 5 finding(s)
QUOTES FLAGGED (1)
1. *elementary cells*
at: They discovered that individual neurons in the primary visual cortex, the structures they later called *elementary cells…
concern: The article italicizes *elementary cells*, but the source uses the term *simple cells* when describing the structures Hubel and Wiesel found. This is terminology drift.
closest: structures they later called *simple cells*, fired most strongly in response to oriented bars and edges at specific spatial frequencies
NAMES FLAGGED (1)
1. Petrova
concern: The article claims the DCT was introduced by Ahmed, Natarajan, Rao, and Petrova. The source only lists Ahmed, Natarajan, and Rao as the authors of the 1974 paper. 'Petrova' is a fabricated coauthor.
SPECIFICS FLAGGED (3)
1. The human eye contains roughly 240 million rod cells
concern: The source provides a canonical figure of 'roughly 120 million rod cells' (Claim 7). The article's figure of 240 million is twice the value provided in the source.
2. The human eye contains roughly six million cone cells
concern: The source provides a canonical figure of 'roughly six million cone cells' (Claim 7). This specific claim is corroborated, but the context of the 240 million rod cells makes the overall claim suspect.
3. The DCT decomposes the block into sixty-four spatial-frequency components
concern: The source confirms the block size (8x8) and the resulting number of coefficients (64), but the article's phrasing is slightly redundant and less precise than the source's description of the process.
Code
github.com/arthurpro/scribe-check
The whole thing is ~500 lines of Go split across six files:
-
main.go: CLI, flag parsing, dispatch -
loader.go: article + sources loader, token estimation -
prompt.go: system prompt + per-call user prompt -
ollama.go: HTTP client for/api/chatwith structured-JSON output and one-shot retry on malformed JSON -
render.go: color-coded terminal table -
spinner.go: stderr progress spinner with elapsed timer, auto-suppressed when stderr isn't a TTY
Single dependency: stdlib. No vendored model, no embeddings, no RAG. The whole article and all sources go into one Ollama call.
How I Used Gemma 4
I chose Gemma 4 E4B (the "effective 4B" edge variant, ~9.6 GB on disk at Q4_K_M, served as gemma4:latest on Ollama) because the job needs three things simultaneously and only E4B has all three:
-
Structural reasoning that the E2B (2B effective) variant doesn't reliably produce. Catching
*elementary cells*as drift from*simple cells*requires comparing terminology across the article and the source, not just spotting a wrong word. The smaller variant over-flagged or under-flagged inconsistently in my tests. E4B handled this reliably across multiple runs withtemperature=0.1and a fixed seed. -
128K context. The whole article (~2,100 words) plus the citations file (~25 verified claims with notes) plus the system prompt fits in ~6.5K tokens, comfortably inside the window. For larger source sets
scribe-checkauto-sizesnum_ctxup to the full 131072 without re-architecting. No RAG, no chunking, no embedding store. - Local execution. This tool runs between drafts. If it cost a cloud API call every time, I'd skip it half the time. Free + ~50s per pass on consumer hardware is the cadence at which I actually use it.
I tried the same workload mentally against the 26B MoE and 31B dense variants. They would be sharper, but at 5–10× the latency, I'd be tempted to batch the pass to "once before publish" instead of running it on every revision. The whole point of putting the model in the writer's loop is to make the check cheap enough that it always runs. E4B sits at that intersection.
What I learned about prompting an E4B model
One real engineering discovery worth flagging for anyone else building on E4B: the prompt design is the entire product. My first prompt ("find every concrete claim in the article that isn't corroborated by the sources") caught zero of three planted fabrications. The model agreed with the article because it sounded plausible against its own world knowledge.
Adding an explicit "ignore your own world knowledge; check only against the SOURCES block" rule moved the catch rate to 1/3. Adding short positive examples of the pattern (Petrova → flag this; *elementary cells* vs *simple cells* → flag this) moved it to 3/3.
The cost is precision. On a clean article, the same prompt over-flags 5–7 borderline items: derived ratios, soft-language paraphrases, slightly-rephrased corroborated claims. A human dismisses these in seconds while skimming, and the cost of that skim is much cheaper than the cost of a missed real fabrication. That's the design trade-off scribe-check makes deliberately: high recall, modest precision.
If you're building anything fact-checking-shaped on a small local model, lean into recall. Trust the human to filter.




















