Mumbli – my personal Wispr Flow

My text-to-speech journey started roughly a year ago, when I tried it again and was impressed by how much faster it was than typing. I'd been fascinated by Churchill's use of a stenographer, and I knew it was a more efficient way to write things given proper editor post-processing. I'd tried dictation earlier, about 5 years ago, but it didn't work out - I tried using dictation and then sending it to people to transcribe, but it wasn't successful.

A year ago, I tried Whisper models with Python Qt, and it was fine. I used that initial Qt application for some time, but it was a bit ugly and not too easy to use. I found another application for Mac, and used it for a while.

After a year of using it almost every day, I've learned how to speak freely without hesitation. It's not perfect, but it's much faster, and I've been able to produce a lot more. A friend told me about Wispr Flow, which had a nice UI with a small overlay and simple key bindings. Although I didn't like the Electron application and the marketing, I picked the features I needed, like vocabulary, history of dictations, and key bindings.

That's how I created Mumbli, and so far, it's working well - I'm over 3300 transcriptions now. I can tweak it and try new things, like new GPT live transcription models.

The most useful part of owning the stack is that I can actually measure the pieces. Recently I benchmarked the latest 50 saved Mumbli recordings: 2,021 seconds of real dictation, with clips ranging from 0.3s to 293.8s. I ran the same recordings through a few STT providers and measured the time from "audio file is ready" to "transcript is back."

Provider	Model	Success	Median STT latency	p95 latency	Audio-judge wins
Groq Whisper	`whisper-large-v3-turbo`	50/50	534ms	1,098ms	2
ElevenLabs Scribe	`scribe_v1`	50/50	2,386ms	7,472ms	25
Interfaze STT	`interfaze-beta`	50/50	8,383ms	13,584ms	6
Tie / skipped	-	-	-	-	16 ties, 1 skipped

So the technical result is not "Groq wins everything." It is more specific: Groq was about 4.5x faster than ElevenLabs on median STT latency in this benchmark, and about 6.8x faster at p95. Compared to Interfaze it was much faster again. But ElevenLabs won the audio-judge comparison more often, so it still looks like the better quality default.

For my personal use, that split matters. Wispr Flow feels good partly because the interaction is tiny: hold a key, speak, release, get text. If the transcript comes back in half a second, the whole app feels different. It is the difference between "this is a transcription job running somewhere" and "this is close enough to typing latency that I keep using it."

So the setup is not one fixed answer. I can keep a quality-oriented path, and I can keep a fast path. That was the part I wanted from the Wispr Flow-like experience: the tiny overlay and simple hotkeys, but with the ability to swap the engine when the numbers say something interesting.

I've put the benchmark notes in the documentation: https://docs.mumbli.app/benchmarks#benchmarks-mumbli-performance. You can also check out the GitHub page: https://github.com/fireharp/mumbli, and the main site: https://mumbli.app/. No huge value here, just a nice tinkering approach - it was pretty fast to do, pretty exciting, and I still use it.

推荐订阅源

DEV Community