惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

The Hacker News
The Hacker News
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
雷峰网
雷峰网
人人都是产品经理
人人都是产品经理
Recent Announcements
Recent Announcements
D
DataBreaches.Net
P
Proofpoint News Feed
V
Visual Studio Blog
J
Java Code Geeks
Recorded Future
Recorded Future
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
F
Full Disclosure
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
The GitHub Blog
The GitHub Blog
Engineering at Meta
Engineering at Meta
C
Cybersecurity and Infrastructure Security Agency CISA
V
Vulnerabilities – Threatpost
罗磊的独立博客
Jina AI
Jina AI
博客园 - 【当耐特】
C
CERT Recently Published Vulnerability Notes
G
GRAHAM CLULEY
Y
Y Combinator Blog
L
LangChain Blog
L
LINUX DO - 热门话题
宝玉的分享
宝玉的分享
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
H
Help Net Security
云风的 BLOG
云风的 BLOG
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园_首页
A
About on SuperTechFans
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Latest news
Latest news
T
Threatpost
T
Tenable Blog
有赞技术团队
有赞技术团队
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Stack Overflow Blog
Stack Overflow Blog
C
Cisco Blogs
C
Check Point Blog
T
Tor Project blog
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
S
Schneier on Security
美团技术团队
I
Intezer
S
Securelist
AWS News Blog
AWS News Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Building AI-Powered Voice Transcription at Scale: Engineering Lessons
douzatan · 2026-05-31 · via DEV Community

Eighteen months ago, we thought we were building a simple voice memo app.

We were wrong about the "simple" part.

At Vomo, what started as a tool to capture and transcribe voice notes evolved into a full voice-first productivity platform supporting 50+ languages, real-time streaming transcription, and a growing number of enterprise customers with strict latency and accuracy requirements. Along the way, we learned a lot — some of it the hard way.

This post covers the engineering decisions we made, the ones that hurt us, and what we'd do differently. If you're building anything in the audio/speech space, I hope this saves you some pain.

Why We Built a Voice-First AI Tool

The initial insight was embarrassingly simple: people think faster than they type. Voice memos have existed for decades, but the experience of using them is terrible. You record something, and then it just... sits there. You either listen to the whole thing again or you forget it.

The opportunity was to make voice memos actually useful — not just stored audio, but captured thought that gets organized, summarized, and actionable automatically.

That meant transcription was table stakes. But transcription alone is boring. The real product is what happens to the text after: structured notes, action items, searchable archives, smart summaries, integrations with Notion and Slack and everything else knowledge workers already use.

We scoped the MVP in two weeks. That scope did not survive contact with reality.

The Technical Architecture

Audio Capture and Streaming Pipeline

The first question we faced: do we send audio to the server in chunks as the user speaks, or wait for them to finish and process the whole file?

We went with streaming from day one, and it's one of the decisions I'm most glad we made.

Real-time streaming means users see text appearing as they speak. The psychological difference is enormous — it feels like the tool is listening, not processing. Users with streaming transcription are significantly more likely to keep talking, which results in longer, more useful recordings.

The architecture:

Mobile/Web Client
    ↓ (WebSocket, 100ms audio chunks, Opus codec)
API Gateway (load balanced)
    ↓
Transcription Worker Pool
    ↓ (partial results every ~500ms)
Client (streaming text updates)
    ↓ (on recording stop)
Post-processing Pipeline (cleanup, structure, AI enrichment)

Enter fullscreen mode Exit fullscreen mode

Key decisions here:

  • Opus codec at 16kHz: Better compression than MP3 for speech, lower bandwidth than WAV, and Whisper performs well on it. PCM 16kHz is what Whisper actually wants; we convert on the worker side.
  • 100ms chunk window: Smaller chunks = lower perceived latency; larger chunks = better context for word boundary detection. 100ms struck the right balance after testing 50ms, 100ms, 200ms, and 500ms windows.
  • WebSocket over HTTP long-polling: Latency was 40% lower on our test conditions. The connection management overhead is real but manageable.

Model Selection: Whisper, Cloud ASR, or Something Else

We evaluated four options:

  1. Self-hosted Whisper large-v3 — best accuracy, highest infrastructure cost, full control
  2. OpenAI Whisper API — lower ops overhead, per-minute pricing, good accuracy
  3. Google Cloud Speech-to-Text v2 — strong real-time streaming support, good but not exceptional accuracy
  4. Deepgram Nova-2 — purpose-built for real-time, excellent streaming latency

We ended up with a hybrid: Deepgram Nova-2 for real-time streaming (where latency matters most) and self-hosted Whisper large-v3 for post-processing uploaded files (where accuracy matters most and latency is acceptable).

The accuracy difference between these models matters less in clean conditions (all hit >95% on clear studio audio) and enormously in noisy conditions. Whisper large-v3 on a cafeteria recording still hits around 91%; the same recording on a mid-tier commercial ASR drops to 78-83%.

For our target user — people recording voice memos while commuting, walking, or between meetings — noise robustness was non-negotiable. That pushed us toward Whisper for the quality path even with the infrastructure overhead.

Latency Optimization

Our initial streaming implementation had a "first word latency" of about 1.8 seconds — the time from when a user starts speaking to when the first transcribed word appears on screen. Users found this uncomfortable. It felt like the tool wasn't keeping up.

We got this to 340ms through three changes:

1. Model warm-keeping: Transcription workers stay loaded with the model in memory. Cold-starting Whisper large-v3 takes 3–8 seconds depending on hardware. Warm requests take milliseconds. We keep a pool of warm workers sized to handle 95th-percentile concurrency without cold starts.

2. Partial Transcription Streaming: Instead of waiting for a complete sentence, we emit partial results every 500ms during active speech. These get replaced as context improves. Users see text "solidifying" in real time — initial rough transcription that gets corrected as more audio context arrives.

3. Edge pre-processing: We run a lightweight VAD (Voice Activity Detection) model on the client before streaming. Silence periods don't get sent. This reduces the amount of audio the server processes and eliminates the confusion caused by long pauses generating incomplete sentence segments.

The Scaling Challenges We Didn't Expect

Concurrency Spikes

Our first major traffic spike came after a mention in a tech newsletter. We went from ~80 concurrent transcription sessions to ~1,400 in about 25 minutes. Our worker pool maxed out. New sessions queued. Queue depth hit 600+.

The problem was that our auto-scaling was too slow. We were using cloud VM auto-scaling with a 3–5 minute spin-up time. That's fine for gradual traffic increases. It's useless for spike traffic.

The fix was two-pronged:

  1. Pre-warming worker capacity based on historical traffic patterns (time of day, day of week). We overprovision by ~30% during predicted peak hours.
  2. Cloud function fallback: For overflow beyond our worker pool capacity, we route to cloud-based ASR (Deepgram API) as a degraded-but-functional fallback. Lower accuracy, but better than a queue timeout.

Auto-scaling now responds to queue depth rather than just CPU utilization. Queue depth above threshold triggers immediate scale-out; it doesn't wait for CPU to saturate.

Multi-Language Model Loading

Supporting 50+ languages meant we needed Whisper large-v3, which handles multilingual transcription. The challenge: language detection requires processing the first 30 seconds of audio.

For short recordings under 30 seconds, we were initially guessing the language wrong ~12% of the time. A voice memo recorded in Japanese would start processing as English because we didn't have enough audio to be confident.

Our solution: language detection from the first 3 seconds using a lightweight language ID model (fastText language identification), followed by Whisper processing with the detected language as a forced parameter. This reduced language misdetection to under 2% and eliminated the accuracy penalty from wrong-language processing.

Noise Robustness at Scale

We knew Whisper was good at noise robustness. What we didn't anticipate was the diversity of "noise" in production.

Our test suite covered café noise, street traffic, and office chatter. Production audio included: treadmill recordings, car engine noise, HVAC hum, keyboard clatter, music from a nearby speaker, and — most challenging — Bluetooth headsets with their own compression artifacts on top of background noise.

Bluetooth + background noise was particularly brutal. WER on some samples jumped from our expected 9% to 22-28%.

We added an optional pre-processing step using the DeepFilterNet noise suppression model before Whisper sees the audio. On heavily degraded audio, this consistently improved WER by 4–8 percentage points. On clean audio, it has essentially no effect.

The tradeoff: DeepFilterNet adds ~150ms of processing latency. We enable it adaptively — only when the input audio fails a quick SNR check.

What We Shipped After 6 Months

Six months after the MVP:

  • Real-time streaming transcription with 340ms first-word latency
  • 50+ language support with automatic language detection
  • Speaker diarization (2–6 speakers, accuracy >88% in our testing)
  • Post-processing pipeline: cleaning → summarization → action item extraction → structured notes
  • Integrations: Notion, Google Docs, Obsidian, Slack, Zapier
  • On-device processing option for enterprise customers with data residency requirements

The piece I'm most proud of is the post-processing pipeline. Getting transcription right is a solved problem if you're willing to pay for infrastructure. Getting the intelligence layer right — the summarization that's actually useful, the action items that aren't garbage, the structure that fits how knowledge workers think — that's the hard problem.

We ended up fine-tuning a smaller Claude model on our own structured outputs, which significantly improved the quality of AI-generated notes compared to zero-shot prompting. The training data was annotations from our own team on hundreds of real voice memo transcripts.

Lessons Learned & Open Questions

What worked:

  • Investing in streaming from day one. It's much harder to add later than to build in from the start.
  • Noise suppression as an optional pre-processing step. Don't force it — adaptive application is better.
  • Queue depth as the auto-scaling signal, not CPU. Queue depth is closer to user experience than CPU.
  • Hybrid model strategy: purpose-built ASR for latency-critical paths, higher-accuracy models for quality-critical paths.

What hurt:

  • We underestimated the diversity of production audio. Test with recordings from phones, AirPods, cheap headsets, car mounts, and smartwatches — not just your studio mic.
  • Auto-scaling configuration took 4 sprints to get right. This is worth investing in early.
  • Speaker diarization accuracy drops sharply past 4 speakers. Set correct expectations in UX, or you'll get support tickets.

Open questions we're still working on:

  • How to handle cross-lingual code-switching in real time (e.g., a Spanish-English conversation where language changes mid-sentence)
  • Confidence scores at the word level for downstream highlighting of uncertain transcription
  • Real-time noise suppression without the latency penalty for mobile clients

What's Next

The platform we've built treats voice as input. The next frontier for us is voice as interface — where you can query your own recordings, ask questions about what was said in past meetings, and surface relevant notes through voice commands.

This requires evolving from a transcription + structuring system to an actual memory system, with semantic search, long-term context, and personalization. The transcription and AI layer we built is the foundation. The next layer is considerably more interesting.

If you're working on related problems — audio pipelines, speech AI, or voice-first products — I'm happy to trade notes. The engineering community in this space is still surprisingly small and surprisingly collegial.


Stack notes: Python workers (FastAPI), WebSocket via Redis pub/sub, Whisper large-v3 on A10G GPUs, Deepgram Nova-2 for streaming, DeepFilterNet for noise suppression, PostgreSQL + pgvector for transcript storage and search.