Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Parlotype is a voice-to-text desktop app for Windows. It is built with .NET 10 and Avalonia UI. You hold a global hotkey, speak, then release it. Your text appears in whatever app you were typing into. All speech recognition runs on your machine. No cloud, no audio leaves the machine.

Google released Gemma 4 in April 2026. It has a native multimodal audio path. I added it as an alternative speech engine alongside the existing Whisper.net pipeline. You pick Whisper or Gemma 4 in Settings. The rest of the audio pipeline (WASAPI capture, then Silero VAD, then text injection) stays the same.

The interesting part, and what this post is mostly about, is which Gemma 4 variant to ship. The ggml-org GGUF repo publishes five variants (E2B and E4B, each in BF16, Q4_K_M, and Q8_0, except where the repo skips one). The model card does not tell you which combination of accuracy, speed, and disk footprint you will actually get. So I ran each one on the same dataset, picked a default, and shipped.

Demo

The video shows the engine selector, the model picker with five variants, and a live dictation with Gemma 4.

Code

Source, ADRs, and benchmark configs: github.com/mdemin729/parlotype

Relevant entry points:

src/Parlotype.Platform/Speech/LlamaCppSpeechRecognizer.cs: the recognizer that talks to llama-server.
src/Parlotype.Core/Speech/Gemma4ModelInfo.cs: the 5-variant catalog.
docs/decisions/025-gemma4-llamacpp-desktop.md through 030-configurable-gemma4-prompts.md: the ADR series covering the integration.
results/comparison-libri-speech-test-other-2026-05-23-cuda.md: the benchmark data behind the choices below.

How I Used Gemma 4

Why a separate engine at all

Whisper is great on clean read English. It gets noticeably worse on conversational or noisy audio. Gemma 4 has a conformer audio encoder. Google's own evaluations show it reaching 4.17% WER on LibriSpeech-test-clean, which is competitive with much larger Whisper variants. For a voice-to-text app, the typical user is dictating to themselves into a focused text field. That noise profile is closer to "clean read" than to "AMI meeting", so Gemma 4 is a real alternative. Giving people the choice felt right. Either way, privacy does not depend on which model is loaded.

Why `llama-server` as the runtime

I looked at several inference paths before picking llama-server, the HTTP server from llama.cpp. The constraints were: no cloud, Windows desktop, single end-user installer, cross-vendor GPU support, no Python runtime in the user's install.

onnxruntime-genai does not support Gemma 4's architecture yet (per-layer embeddings, variable head dimensions). Tracking issue: microsoft/onnxruntime-genai#2062. A Python sidecar works, but it pulls Python and CUDA into the user's install. That is a non-starter for non-developer users. LLamaSharp's P/Invoke bindings lock you to one llama.cpp build at compile time, so switching from Vulkan to CUDA means re-compiling. Ollama does not support Gemma audio yet (ollama/ollama#15333). Lemonade is AMD-only.

llama-server with the pre-built Vulkan/CUDA Windows binaries hits all of these. Cross-vendor GPU support from one download. A stable OpenAI-compatible HTTP API at /v1/chat/completions, with input_audio blocks for audio. A release cadence I can manage from in-app updates. ADR-025 has the longer version of this decision.

Picking a variant: the benchmark

The catalog has five variants. That is what ggml-org/gemma-4-E2B-it-GGUF and ggml-org/gemma-4-E4B-it-GGUF actually publish, not what I would ideally pick (see ADR-029):

ModelId	GGUF	Size on disk (with bf16 mmproj)
`gemma-4-E2B-it-Q8_0`	E2B Q8_0	~5.5 GiB
`gemma-4-E2B-it-bf16`	E2B BF16	~9.6 GiB
`gemma-4-E4B-it-Q4_K_M`	E4B Q4_K_M	~5.9 GiB
`gemma-4-E4B-it-Q8_0`	E4B Q8_0	~8.4 GiB
`gemma-4-E4B-it-bf16`	E4B BF16	~15 GiB

E2B has no Q4_K_M. That asset does not exist in the repo. I learned this when manual testing returned a 404. After that, I rebuilt the catalog from the actual file lists on HuggingFace.

I ran each variant against Whisper (Small, Medium, LargeV3Turbo) on 50 samples of LibriSpeech test-other, which is the "harder" English split. Same machine, same warm-up methodology, both engines on CUDA. Whisper used greedy decoding (beam=1) so the runs are reproducible.

Rank	Engine	Model	WER %	CER %	RTF	Model load (s)
1	Whisper (CUDA)	`LargeV3Turbo`	11.48	4.97	0.055	1.31
2	Whisper (CUDA)	`Medium`	12.18	5.41	0.073	1.28
3	Whisper (CUDA)	`Small`	13.10	5.87	0.034	0.71
4	Gemma 4 (llama.cpp)	`E2B-it-BF16`	13.15	4.95	0.038	6.70
5	Gemma 4 (llama.cpp)	`E4B-it-Q4_K_M`	13.82	5.80	0.038	6.73
6	Gemma 4 (llama.cpp)	`E4B-it-BF16`	14.20	5.40	0.038	6.72
7	Gemma 4 (llama.cpp)	`E4B-it-Q8_0`	14.39	5.79	0.044	9.25
8	Gemma 4 (llama.cpp)	`E2B-it-Q8_0`	19.22	8.95	0.315	6.74

Three things from the table:

E2B-it-BF16 has the lowest CER of any model here (4.95%). It barely beats Whisper LargeV3Turbo (4.97%), but it still beats it. WER and CER do not always agree, and at this size class Gemma's character-level errors are unusually small.
E4B-it-Q4_K_M (the shipping default) is at 13.82% WER and 0.038 RTF. That is close to Whisper Small (13.10% WER and 0.034 RTF) at about the same on-disk size. The Q4_K_M quant is the right floor for shipping. It gives people Gemma 4 without asking them to download 15 GiB.
E2B-it-Q8_0 is broken on this dataset. RTF 0.315, which is 8x slower than the other Gemma variants. WER 19.22%. The first benchmark attempt crashed llama-server mid-sample because the model emitted a stray <|channel> reasoning token that the chat-template parser could not handle. I keep this variant selectable in the catalog for experimentation, but the user-facing default avoids it.

What I picked, and why

The shipping default is gemma-4-E4B-it-Q4_K_M. About 5.9 GiB on disk, 13.82% WER on this dataset, 0.038 RTF. E2B-BF16 is technically more accurate, but it takes 9.6 GiB. That is not worth it for a tiny WER edge. E4B Q8 and BF16 are there for people who want maximum accuracy and have the disk space. E2B-Q8 stays in the catalog with a "known issue" tag.

The model picker shows all five so people can experiment. But the default is the one I would install on a friend's machine without thinking about it.

Architecture

Gemma 4 sits behind the same ISpeechRecognizer interface as Whisper. A DelegatingSpeechRecognizer (backed by a small SpeechRecognizerFactory) picks one or the other at init time, based on the user's engine setting. The LlamaCppSpeechRecognizer owns a child llama-server.exe process. It posts audio as a base64 WAV blob to /v1/chat/completions:

// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = promptText },
                new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
            }
        }
    },
    stream = false
};
using var response = await _httpClient.PostAsJsonAsync(
    "/v1/chat/completions", body, cancellationToken);

Same capture, same VAD, different recognizer:

The llama-server binary itself is also managed by the app. ADR-026 covers the catalog/installer/registry subsystem that downloads Vulkan or CUDA builds from llama.cpp's GitHub Releases on demand. Users do not pick paths in a folder browser. They pick a backend in a list and hit Install. That subsystem is about 1,800 lines on its own and probably deserves its own post.

The transcription prompt is also user-editable. ADR-030 turned the hardcoded prompt into a small registry with a built-in default and a {language} placeholder. The placeholder is there for a future feature that picks the source language from the active keyboard layout.

What this taught me

Three things I learned from doing this:

The model card's headline numbers do not transfer to your stack. Google's reported 4.17% WER on LibriSpeech-clean is real. But the path from "the model can do 4.17%" to "my app does 13.82% on noisy audio with the quantization that fits on user disks" goes through five variant choices, a runtime choice, and the measurement methodology. Benchmark on your own stack.
Most of the work is in the catalog, not in the inference call. The actual /v1/chat/completions HTTP call is about 30 lines of code. The variant catalog, the download manager, the side-by-side install of llama-server backends, the prompt registry. That is where most of the engineering went.
Asymmetric quantization coverage is the rule, not the exception. E2B has no Q4_K_M in the published GGUFs. The catalog has to reflect what is actually on HuggingFace, not what would be theoretically nicest.

Try Parlotype

Repo: github.com/mdemin729/parlotype
Windows only for now. .NET 10, MIT licensed.
Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads llama-server and the GGUF for you.

Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.

推荐订阅源

DEV Community

What I Built

Demo

Code

How I Used Gemma 4

Why a separate engine at all

Why `llama-server` as the runtime

Picking a variant: the benchmark

What I picked, and why

Architecture

What this taught me

Try Parlotype

推荐订阅源

DEV Community

What I Built

Demo

Code

How I Used Gemma 4

Why a separate engine at all

Why llama-server as the runtime

Picking a variant: the benchmark

What I picked, and why

Architecture

What this taught me

Try Parlotype

Why `llama-server` as the runtime