慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
Piclu - 以本地Gemma 4将语音笔记转为购物清单
Abhishek Bor · 2026-05-24 · via DEV Community

此乃投献于Gemma四挑战:以Gemma四构建之

吾所造也

吾与妻偕隐,育稚女年半。吾辈皆勤于职,常碌碌于案牍。妻掌仓廪,司米盐之储,庖厨仆役频促其购物,言物将罄。此请多仓促,妻必记于心,继添于购单。及至暇时,乃入诸市(如亚马逊、布林克特等)之车。然屡有遗忘,遂致诸馔难成,因材匮乏而折中。

吾成 Piclu 之器,以录声,植于庖厨。今庖者、仆者,但按一钮,随念所至,即留声音。吾妻无时中之扰。声音为 Gemma 4 E4B 所录而理,购物之单恒盈,吾妻得暇辄观之。彼乃随宜而命。庖者、仆者亦悦,不复待吾妻以言缺,且记忆之事亦少矣。

为录声计——吾设M5Stack Atom Echo为按语器,插于庖厨插座。其录音于地,复推至后端端点。诸事皆行于家中小服务器,无涉云,故音声未尝出户。

此问甚善:何不令庖厨、侍女书其请,或发WhatsApp音讯乎?有二故。庖厨手中忙碌,书之实多艰,非人人善书。纵书之,吾妻犹须披览,一一入车。WhatsApp音讯亦尝试之,然彼辈方事,又急于行,实难为也。

演示

视频之中,可见其全貌——助者按计器之钮,复以印地语书简(曰"lemons and tomatoes",谓柠与茄已毕),Gemma 4 则于地录之,次以Gemma 4再过,将简转为条理之购物单,每物自显为可勾之待办。

开图录

一户之音记脑,以本地Gemma 4 E4B驱动.可录屋内诸人(庖厨、家人、仆役)之音,依其语种方音转录,化成条理井然之购物单,令户主得阅而行之.

专为Gemma 4之赛而设:以Gemma 4构建之.之音,经一HTTP端点而达。自备发声之具(如附设之网面界面、装 Raspberry Pi 配按钮者、Home Assistant 之蓝图、IFTTT 之钩子、君之手机——凡能以 POST 之 WebM/Opus 肿块至 /api/items/freeform_audio 者皆可)。

数据流

PTT source ─► POST /api/items/freeform_audio   (multipart: audio)
           ─► app HTTP
           ─► brain gRPC over Unix socket
           ─► recordings.Store (seals .webm on disk + emits SealedEvent)
           ─► processor:  ffmpeg decode (WebM/Opus → 16 kHz mono PCM)
                          → Gemma 4 E4B pass 1 — TRANSCRIBE (multimodal audio →
                            Hinglish text,

开源之库有Gemma脑(转录与结构),有网页界面,有M5Stack推送之单音频端点。亦含吾下文所述之全转录与提取评估。

其内乃Go之应用,心智交感于Unix之socket。车与笔记实时更新于SSE,而通体运行于Docker Compose。Gemma 4于本地运行,依libmtmd / llama.cpp之构建,配以Lemonade适配器以速GPU之推论(评估亦用此适配器),而评估全然可复现于此库.

吾之用Gemma 4

Gemma助成笔记录之两要——录文与构架。二者皆于家服务器行Gemma 4 E4B之本地过。

第一步:录文

余初构此,尝试数开源之模,置于吾所备之小评数据集上,曰Whisper,曰indic-whisper(自AI4Bharat),曰Gemma 4。纯语音转文字之模,于吾厨仆言家物时,难辨其音。Gemma 4独异,以其本具多模态——兼取音与文示,故非纯STT之模,吾实可告之将闻何事。于开源之林,Gemma 4 E2B/E4B,乃少数能于单示中兼纳音文之模。

诸Gemma之变体中,其大者(26B MoE与31B Dense)竟不取音,小者E2B则精微不足,其字误率逾1.0,且于结构之际,失其物者约三六之数。是故E4B乃兼取音而精当之变体也。于试之,亦胜Whisper-Large-v3,此乃具全球语言支持之语音转文字之圭臬,无论精当与否,抑或于喧嚣中默然不扰(Whisper每于寂默中自造文辞,暗污其列)。今列其首尾相较之状如下:

模型与提示 词错误率 字符错误率 无声检测
Gemma E4B,提示中无上下文 1.29 0.80 67%
Gemma E4B,完整提示(已发货) 0.55 0.30 93%
Gemma E2B,提示中含完整上下文 一四八 九零 二零%
低语大版三(强制印地语) 七零 三六 七%

WER与CER愈低愈善(WER为词错率,CER为字错率;CER此处为公允之读,因罗马化印地语多合乎法之拼写)No-speech detection者,乃模型默然不语或标示音段为噪,而不妄造之频也——所载E4B之示,教其识此无声之旗,故能应之九三。其全法及余数悉在仓之评

吾所授之境,独系于吾家。吾居印度古鲁格拉姆,庖厨与仆皆出西孟加拉与比哈尔。吾示E4B所言何事,其意何在,由谁而发,意指何人,俾其导引所预测之符,而存其意。此一变,乃最大之杠杆——依吾之评,加此境由,较之素曰“录此”之令(上表无境之列),字误率减半有余。转录之令,其相关者若此:

You are an expert audio transcriber specializing in {{native_language}} dialects.
You receive voice notes from members of a household in {{city}}, {{country}} —
typically the cook, household helps, or family members — addressed to the
household manager regarding the shopping list (stock requests, brand preferences,
quantity changes, cancellations).

入全屏模式 出全屏模式

全备之提(转录与结构)存于库中,在internal/prompts

第二步:结构之设

此程将笔记化整为零,以备次第而列。笔记或携关联之境,如:

  • 笔记一:洗碗液将罄。
  • 注二:今之洗濯液非善,昔者较佳。

吾持一SQL表,以名号键之,列有品类、尺寸、数量、库存状态及注记。吾以新之录文及既有表项示于Gemma,其乃生成一小DSL,而代码解析之,以增或改表内之物:

  • 注一返:UPSERT name_id="dishwash-liquid" name="dishwash liquid" stock_state="low"
  • 注二返:UPSERT name_id="dishwash-liquid" name="dishwash liquid" notes="current one is not good, the earlier one was better"

二注皆同一名_id,故表终合为单行:

dishwash-liquid stock_state="low" notes="current one is not good, the earlier one was better"

若注与购物单全无涉,Gemma发NO_OP,而无所变。于试,此结构之过,其召回率为1.0(尽得诸物),F1为0.99,且未尝自非购物之注造物。

次欲构 MCP 服务器于在线订货之台,终将自填货于车中,以平台之订货史为文脉。吾信 Gemma 4 亦能克之——所恃者,惟其唤器之精,思链之优耳。且观其变,祈之矣。