Building AI-Powered Voice Transcription at Scale: Engineering Lessons

Eighteen months ago, we thought we were building a simple voice memo app.

We were wrong about the "simple" part.

At Vomo, what started as a tool to capture and transcribe voice notes evolved into a full voice-first productivity platform supporting 50+ languages, real-time streaming transcription, and a growing number of enterprise customers with strict latency and accuracy requirements. Along the way, we learned a lot — some of it the hard way.

This post covers the engineering decisions we made, the ones that hurt us, and what we'd do differently. If you're building anything in the audio/speech space, I hope this saves you some pain.

Why We Built a Voice-First AI Tool

The initial insight was embarrassingly simple: people think faster than they type. Voice memos have existed for decades, but the experience of using them is terrible. You record something, and then it just... sits there. You either listen to the whole thing again or you forget it.

The opportunity was to make voice memos actually useful — not just stored audio, but captured thought that gets organized, summarized, and actionable automatically.

That meant transcription was table stakes. But transcription alone is boring. The real product is what happens to the text after: structured notes, action items, searchable archives, smart summaries, integrations with Notion and Slack and everything else knowledge workers already use.

We scoped the MVP in two weeks. That scope did not survive contact with reality.

The Technical Architecture

Audio Capture and Streaming Pipeline

The first question we faced: do we send audio to the server in chunks as the user speaks, or wait for them to finish and process the whole file?

We went with streaming from day one, and it's one of the decisions I'm most glad we made.

Real-time streaming means users see text appearing as they speak. The psychological difference is enormous — it feels like the tool is listening, not processing. Users with streaming transcription are significantly more likely to keep talking, which results in longer, more useful recordings.

The architecture:

Mobile/Web Client
    ↓ (WebSocket, 100ms audio chunks, Opus codec)
API Gateway (load balanced)
    ↓
Transcription Worker Pool
    ↓ (partial results every ~500ms)
Client (streaming text updates)
    ↓ (on recording stop)
Post-processing Pipeline (cleanup, structure, AI enrichment)

Key decisions here:

Opus codec at 16kHz: Better compression than MP3 for speech, lower bandwidth than WAV, and Whisper performs well on it. PCM 16kHz is what Whisper actually wants; we convert on the worker side.
100ms chunk window: Smaller chunks = lower perceived latency; larger chunks = better context for word boundary detection. 100ms struck the right balance after testing 50ms, 100ms, 200ms, and 500ms windows.
WebSocket over HTTP long-polling: Latency was 40% lower on our test conditions. The connection management overhead is real but manageable.

Model Selection: Whisper, Cloud ASR, or Something Else

We evaluated four options:

Self-hosted Whisper large-v3 — best accuracy, highest infrastructure cost, full control
OpenAI Whisper API — lower ops overhead, per-minute pricing, good accuracy
Google Cloud Speech-to-Text v2 — strong real-time streaming support, good but not exceptional accuracy
Deepgram Nova-2 — purpose-built for real-time, excellent streaming latency

We ended up with a hybrid: Deepgram Nova-2 for real-time streaming (where latency matters most) and self-hosted Whisper large-v3 for post-processing uploaded files (where accuracy matters most and latency is acceptable).

The accuracy difference between these models matters less in clean conditions (all hit >95% on clear studio audio) and enormously in noisy conditions. Whisper large-v3 on a cafeteria recording still hits around 91%; the same recording on a mid-tier commercial ASR drops to 78-83%.

For our target user — people recording voice memos while commuting, walking, or between meetings — noise robustness was non-negotiable. That pushed us toward Whisper for the quality path even with the infrastructure overhead.

Latency Optimization

Our initial streaming implementation had a "first word latency" of about 1.8 seconds — the time from when a user starts speaking to when the first transcribed word appears on screen. Users found this uncomfortable. It felt like the tool wasn't keeping up.

We got this to 340ms through three changes:

1. Model warm-keeping: Transcription workers stay loaded with the model in memory. Cold-starting Whisper large-v3 takes 3–8 seconds depending on hardware. Warm requests take milliseconds. We keep a pool of warm workers sized to handle 95th-percentile concurrency without cold starts.

2. Partial Transcription Streaming: Instead of waiting for a complete sentence, we emit partial results every 500ms during active speech. These get replaced as context improves. Users see text "solidifying" in real time — initial rough transcription that gets corrected as more audio context arrives.

3. Edge pre-processing: We run a lightweight VAD (Voice Activity Detection) model on the client before streaming. Silence periods don't get sent. This reduces the amount of audio the server processes and eliminates the confusion caused by long pauses generating incomplete sentence segments.

The Scaling Challenges We Didn't Expect

Concurrency Spikes

Our first major traffic spike came after a mention in a tech newsletter. We went from ~80 concurrent transcription sessions to ~1,400 in about 25 minutes. Our worker pool maxed out. New sessions queued. Queue depth hit 600+.

The problem was that our auto-scaling was too slow. We were using cloud VM auto-scaling with a 3–5 minute spin-up time. That's fine for gradual traffic increases. It's useless for spike traffic.

The fix was two-pronged:

Pre-warming worker capacity based on historical traffic patterns (time of day, day of week). We overprovision by ~30% during predicted peak hours.
Cloud function fallback: For overflow beyond our worker pool capacity, we route to cloud-based ASR (Deepgram API) as a degraded-but-functional fallback. Lower accuracy, but better than a queue timeout.

Auto-scaling now responds to queue depth rather than just CPU utilization. Queue depth above threshold triggers immediate scale-out; it doesn't wait for CPU to saturate.

Multi-Language Model Loading

Supporting 50+ languages meant we needed Whisper large-v3, which handles multilingual transcription. The challenge: language detection requires processing the first 30 seconds of audio.

For short recordings under 30 seconds, we were initially guessing the language wrong ~12% of the time. A voice memo recorded in Japanese would start processing as English because we didn't have enough audio to be confident.

Our solution: language detection from the first 3 seconds using a lightweight language ID model (fastText language identification), followed by Whisper processing with the detected language as a forced parameter. This reduced language misdetection to under 2% and eliminated the accuracy penalty from wrong-language processing.

Noise Robustness at Scale

We knew Whisper was good at noise robustness. What we didn't anticipate was the diversity of "noise" in production.

Our test suite covered café noise, street traffic, and office chatter. Production audio included: treadmill recordings, car engine noise, HVAC hum, keyboard clatter, music from a nearby speaker, and — most challenging — Bluetooth headsets with their own compression artifacts on top of background noise.

Bluetooth + background noise was particularly brutal. WER on some samples jumped from our expected 9% to 22-28%.

We added an optional pre-processing step using the DeepFilterNet noise suppression model before Whisper sees the audio. On heavily degraded audio, this consistently improved WER by 4–8 percentage points. On clean audio, it has essentially no effect.

The tradeoff: DeepFilterNet adds ~150ms of processing latency. We enable it adaptively — only when the input audio fails a quick SNR check.

What We Shipped After 6 Months

Six months after the MVP:

Real-time streaming transcription with 340ms first-word latency
50+ language support with automatic language detection
Speaker diarization (2–6 speakers, accuracy >88% in our testing)
Post-processing pipeline: cleaning → summarization → action item extraction → structured notes
Integrations: Notion, Google Docs, Obsidian, Slack, Zapier
On-device processing option for enterprise customers with data residency requirements

The piece I'm most proud of is the post-processing pipeline. Getting transcription right is a solved problem if you're willing to pay for infrastructure. Getting the intelligence layer right — the summarization that's actually useful, the action items that aren't garbage, the structure that fits how knowledge workers think — that's the hard problem.

We ended up fine-tuning a smaller Claude model on our own structured outputs, which significantly improved the quality of AI-generated notes compared to zero-shot prompting. The training data was annotations from our own team on hundreds of real voice memo transcripts.

Lessons Learned & Open Questions

What worked:

Investing in streaming from day one. It's much harder to add later than to build in from the start.
Noise suppression as an optional pre-processing step. Don't force it — adaptive application is better.
Queue depth as the auto-scaling signal, not CPU. Queue depth is closer to user experience than CPU.
Hybrid model strategy: purpose-built ASR for latency-critical paths, higher-accuracy models for quality-critical paths.

What hurt:

We underestimated the diversity of production audio. Test with recordings from phones, AirPods, cheap headsets, car mounts, and smartwatches — not just your studio mic.
Auto-scaling configuration took 4 sprints to get right. This is worth investing in early.
Speaker diarization accuracy drops sharply past 4 speakers. Set correct expectations in UX, or you'll get support tickets.

Open questions we're still working on:

How to handle cross-lingual code-switching in real time (e.g., a Spanish-English conversation where language changes mid-sentence)
Confidence scores at the word level for downstream highlighting of uncertain transcription
Real-time noise suppression without the latency penalty for mobile clients

What's Next

The platform we've built treats voice as input. The next frontier for us is voice as interface — where you can query your own recordings, ask questions about what was said in past meetings, and surface relevant notes through voice commands.

This requires evolving from a transcription + structuring system to an actual memory system, with semantic search, long-term context, and personalization. The transcription and AI layer we built is the foundation. The next layer is considerably more interesting.

If you're working on related problems — audio pipelines, speech AI, or voice-first products — I'm happy to trade notes. The engineering community in this space is still surprisingly small and surprisingly collegial.

Stack notes: Python workers (FastAPI), WebSocket via Redis pub/sub, Whisper large-v3 on A10G GPUs, Deepgram Nova-2 for streaming, DeepFilterNet for noise suppression, PostgreSQL + pgvector for transcript storage and search.

推荐订阅源

DEV Community