慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
运Gemma 4语音识别于Windows .NET桌面应用:五变模型选择之旅
Maksim Demin · 2026-05-24 · via DEV Community

此乃投于Gemma 4之赛:以Gemma 4筑之

所筑何物

Parlotype者,乃Windows之语音转文桌面应用也。以.NET 10与Avalonia UI筑之。持全局热键,而言,复释之。汝之文,现于汝所键之应用。凡语音识别,皆于汝机运行。无云,无音出机。

谷歌于二零二六年四月发布Gemma四。其有原生多模态音频路径。吾加之为现有Whisper.net管道之外之语音引擎。汝于设置中择Whisper或Gemma四。其余音频管道(WASAPI捕获,继而Silero VAD,继而文本注入)则无改。

至若有趣之所在,亦为此文主要所论者,乃当选用Gemma四之何种变体也。ggml-org GGUF 之仓库刊布五式(E2B 与 E4B,各具 BF16、Q4_K_M 及 Q8_0,惟仓库或阙其一)。模型之牒不告汝实得准确、速疾、磁盘足迹之孰优。是故余各试之於同数据集,择其常者,乃发之。

Gemma 4 model picker

演示

视频示引擎之择,模型之选有五式,并示以实录,用 Gemma 4。

代码

源码、地址与基准配置:github.com/mdemin729/parlotype

相关入口:

吾之用Gemma四

乃何独设引擎耶

Whisper于净读英文甚佳,然于对话或嘈杂之音则显著劣化。Gemma 4具合音音频编码器。Google自评其于LibriSpeech-test-clean达4.17%之WER,与规模远巨之Whisper变体相匹敌。若为语音转文本之应用,寻常用户多自述于专注文本域。此噪声谱貌近于“净读”而非“AMI会议”,故Gemma 4实为替代之选。予人选择之念,甚是。无论何法,隐私不系于所载模型。

何故llama-server为运行时

吾尝观诸推演之道,乃择llama-server,即llama.cpp之HTTP服务器。其约束有四:勿涉云,居Windows之桌,独为用户安装包,跨厂商GPU之支持,用户安装中无Python运行时。

onnxruntime-genai今尚不支Gemma 4之架构(每层嵌入,变头维度)。跟踪之题:微软/ONNXRuntime-GenAI#2062。Python之侧翼程序可行,然引Python与CUDA入用户之安装。此乃非开发者之用户所不能接受。LLamaSharp之P/Invoke绑定,使汝锁定于单一之llama.cpp构建于编译时,故自Vulkan转CUDA,必需重新编译。Ollama尚未支援Gemma音频(ollama/ollama#15333)。Lemonade唯支援AMD。

llama-server所载之Vulkan/CUDA Windows预构建二进制文件,悉合诸要。一下载即可跨厂商GPU支持。/v1/chat/completions处有稳定之OpenAI兼容HTTP API,并备input_audio以处理音声。更新于应用内,我可自主掌控发布节奏。ADR-025载此决策之详文。

择变体:基准测试

目录有五变。此即之ggml-org/gemma-4-E2B-it-GGUFggml-org/gemma-4-E4B-it-GGUF实乃刊行,非吾所理想之选(见)《逍遥游》有云:“乘天地之正,御六气之辩,以游无穷者。”今观ADR-029,其意若此。):

模型之号 GGUF 盘上之大小(bf16 mmproj)
gemma-4-E2B-it-Q8_0 E2B Q8_0 五点五吉字节
gemma-4-E2B-it-bf16 E2B BF16 ~9.6千兆字节
gemma-4-E4B-it-Q4_K_M E4B Q4_K_M ~5.9 GiB
gemma-4-E4B-it-Q8_0 E4B Q8_0 ~8.4 GiB
gemma-4-E4B-it-bf16 E4B BF16 ~15 GiB

E2B 乃无 Q4_K_M。此资不存于库。吾知之,乃因手试而得四十不存之应。其后,吾自 HuggingFace 实际文件之列,重建目录。

吾以五十份LibriSpeech之樣本,試各變體於Whisper(Small, Medium, LargeV3Turbo)test-other此乃英语分叉之"较难"者。同机同法预热,双引擎皆用CUDA。Whisper用贪婪解码(束=1),故运行可复现。

位次 机轮 模型 效率比 CER百分之几 富文本格式 模型载入(秒)
1 Whisper (CUDA) LargeV3Turbo 11.48 4.97 0.055 1.31
2 Whisper (CUDA) Medium 12.18 5.41 0.073 1.28
3 Whisper (CUDA) Small 13.10 5.87 0.034 0.71
4 Gemma 4 (llama.cpp) E2B-it-BF16 13.15 4.95 0.038 6.70
5 Gemma 4 (llama.cpp) E4B-it-Q4_K_M 13.82 5.80 0.038 6.73
6 Gemma 4 (llama.cpp) E4B-it-BF16 十四点二零 五点四零 约三百分之一十四 六点七二
Gemma 4 (llama.cpp) E4B-it-Q8_0 十四有九 五点七九 四千四百之百也 九月廿五日
Gemma 4 (llama.cpp) E2B-it-Q8_0 十九有二 八点九五 0.315 6.74

WER % by model

案上三物:

  • E2B-it-BF16者,此中模型CER最低(4.95%)。几胜Whisper LargeV3Turbo(4.97%),然犹胜之。WER与CER非恒相合,此量级中Gemma字级之误殊为微少.
  • E4B-it-Q4_K_M(运默认者)WER为13.82%,RTF为0.038。 此近于低语 Small (误字率13.10%,响应时间0.034),于磁盘大小相近。Q4_K_M量化为发布之宜。此使人人得用Gemma 4,而无需下载15 GiB之巨。
  • E2B-it-Q8_0于此数据集有弊。 响应时间0.315,较他Gemma变体慢八倍。误字率19.22%。首度测试即崩溃。llama-server取中样,盖模型偶发歧义之<|channel>推理符,聊天模板解析器不能解。吾仍存此变体于目录以试,然用户所见之默认则避之。

所择之故

运货行之默认者gemma-4-E4B-it-Q4_K_M者,约五点九吉字节于盘,此数据集误字率十三点八二,实时处理率零点零三八。E2B-BF16技精于准,然需九点六吉字节。微末之误字率增益,未足值其耗。E4B Q8与BF16,专供求极准者,且具盘之广。E2B-Q8则列于目录,附"已知之患"之签。

模型选择器尽显五者,使人得实验焉。然默认者,乃吾于友之机,不假思索而欲安装者也。

架构

Gemma 4 背依 Whisper 之同ISpeechRecognizer。一DelegatingSpeechRecognizer(辅以小SpeechRecognizerFactory)于初启时择其一,依用户引擎之设也。TheLlamaCppSpeechRecognizer 拥有一子llama-server.exe 之处理。其以 base64 WAV 块之音频发表于/v1/chat/completions

// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = promptText },
                new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
            }
        }
    },
    stream = false
};
using var response = await _httpClient.PostAsJsonAsync(
    "/v1/chat/completions", body, cancellationToken);

进入全屏模式 退出全屏模式

同此捕获,同此 VAD,异此识别器:

llama-server之二进制亦由是管理。ADR-026 盖括目录/安装/注册子系统,可按需自 llama.cpp 之 GitHub Releases 下载 Vulkan 或 CUDA 构建。用户不于文件夹浏览器中择路径,乃于列表中择后端,击 Install。此子系统自计约一千八百行,或当别立一帖以述之。

转录提示亦可由用户编辑。ADR-030 将硬编码之提示化为小册,内置默认值,并设 {language} 为占位符。此占位符为未来之功能所设,以择取活跃键盘布局之源语言。

此事所教吾

吾于此事得三知:

  1. 模型卡之显数,不传于汝之堆栈。 谷歌所报之四分之一十七分之四之误率于LibriSpeech洁净集,实也。然自“模型能达四分之一十七分之四”至“吾之应用于嘈杂音频,以合用户磁盘之量化,得误率为一分三十八分之十三”,其间经五变之择,一运行时之择,及一测量之法。于汝之栈,自为基准。
  2. 大部之功,在于目录,非在于推论之呼。 其实/v1/chat/completions 此 HTTP 调用约三十行代码。变体目录、下载管理、llama-server 后端并排安装、提示注册。此乃工程之要务。
  3. 非对称量化之覆盖,乃常例,非例外。 E2B 于公开之 GGUFs 中无 Q4_K_M。目录须反映 HuggingFace 之实况,非理论至善之构想。

试Parlotype

  • 库: github.com/mdemin729/parlotype
  • 今唯Windows可用。.NET 10,MIT许可.
  • 于设置中择Gemma 4——>语音引擎。应用内安装程序将为你下载llama-server及GGUF。

马克西姆·德明者,乃.NET工程师,筑Parlotype,乃语音转文之桌面应用也。其文载跨平台.NET、Avalonia及本地AI之事。