惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Resume Building using HTML & CSS SpecFlow: Multi-Agent SDD in Cursor (4 phases, /approve, single code writer) SpecFlow: SDD multi-agente en Cursor (4 fases, /approve, un solo escritor de código) Adobe Commerce Cloud now costs $40k/year. We migrated from Adobe Commerce to Magento Open Source — here's the honest breakdown We Trust Third Party Code, It’s Time to Trust AI Generated Code LangGraph 워크플로우 템플릿 (v38) Sustainable AI Starts with Efficient AI Find Remove duplicated files in Google Drive How to Detect GPU Waste in a Kubernetes Cluster The Privacy Bug in My First Chrome Extension (And How to Avoid It) Serverless Mental Models: What They Don't Tell You Before You Build Preventing GPT hallucination in automated content pipelines: how I structure Make.com flows with data injection Hmm, where were we? AI Visibility Tools, Math Proofs, and Stripped Guardrails Shape Developer Landscape How AI and Electronics Are Changing Healthcare Devices: The Future of Smart Healthcare Author: Shivam Wakade | Founder, PrivSR Making Claude Sound Like Optimus Prime Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions Learning Progress Pt.20 How Secure LoRa Communication Devices Work: Building the Future of Private and Long-Range Connectivity Author: Shivam Wakade | Founder, PrivSR How I Rebuilt an RPG Map Editor with Rust, React, and WASM Building a System That Automates YouTube Post-Production Building a 100% Serverless Digital Asset Packager in the Browser Game Recommended AI What is Human-In-The-Loop (HITL)? Deep Dive: React Server Components in TanStack Start Migrating off Google Analytics: Umami vs Plausible vs Fathom Building a Portfolio That Actually Demonstrates Software Engineering Async/Await in JavaScript: From Callbacks to Clean Code (2026) Benchmarking LLM Structured Outputs Angular 21 Multiselect Dropdown: A Migration-Friendly Component with Live Functional Tests ShareBox v5 — GPU transcoding, Netflix-style grid, and why I don't need Plex anymore TOML Schema is live Handling Duplicate Shopify Webhook Events (And Why You Must) Original Kubernetes Dashboard — retired upstream, upgraded to Angular 21. لماذا أسست ترينافو للتجار العرب الذين تتجاهلهم المنصات الغربية Construyendo un recomendador de películas en Python: de los datos al modelo When APIs Lie: A Lesson in Defensive Debugging Pope Leo XIV's AI Encyclical: What Builders Must Know (2026) Donna v0.3.0 HTB — MonitorsFour | Writeup The Free Tool You Trust Is the One You Should Fear the Most HTB — MonitorsFour | Writeup Fr 97. Embeddings and Vector Search: Semantic Search That Works Deep Dive: Building "Gravity Paint" - A Tactile Physics Instrument with React, Matter.js, and p5.js ABAP Unit Testing with Test Doubles and Mocking Frameworks: A Senior Architects Guide to Isolating Dependencies in SAP S/4HANA LeetCode Solution: 5. Longest Palindromic Substring kovax-react 0.8: Tailwind v4 preset, FormField adapters, ColorModeScript, and Storybook I built an AI résumé tool that refuses to lie about your experience The hat Azure Entra ID User & Role Management — Step-by-Step Practical Guide With A Simple Excercise The AI-Native Company: How a Single Founder Can Build Global Organizations Powered by AWS and an Ecosystem of Artificial Intelligences Building a Lightweight Remote MCP Knowledge Base on Cloudflare Workers Why I built Trinavo for the MENA merchants Western platforms ignore The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Training Data Provenance: The Manifest Diff That Explains the Hash Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler Tesla P40 in a Homelab: 24GB of Inference on a Budget Llama 4: Meta's Latest — Scout, Maverick, and the MoE Revolution George Hotz called AI code 'slop.' He's half right. Como Construir um Fluxo de Trabalho Baseado em Engenharia de Prompt e Automação We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage. The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure Getting started with openHUMANS can be an exciting venture for developers looking to create innovative applications in the realm of human-ce Stack Overflow: A Powerful Community for Developers and Learners From Language Models to Humanoid Minds ✨
Running ASR for smart homes in the NPU of Intel processors
Miguel Camba · 2026-05-26 · via DEV Community

I run my own smart home — Home Assistant, voice assistant pipeline, the whole self-hosted thing. The speech-to-text step (Parakeet TDT 0.6B v3 over the Wyoming protocol) had been running on my i3 1220P intel NUC with an 12gb RTX 3060 eGPU for months. I recently upgraded my home server to a full desktop with an AMD 7900XTX, and since I want to save as much of the VRAM as I can for LLMs, I've been running nvidia parakeet on CPU since then.
It works fine, but it always nagged me: my new home server has an Intel Core Ultra 7 265K (Arrow Lake) with the built-in "AI Boost" NPU, and that silicon was sitting completely idle.

With the hype of AI, chip manufacturers have started to slap NPUs on their chips mostly so they can put AI on their names, but little to no software actually makes use of them, although some projects are starting to pop here and there.

So I decided to actually try one if I could put that stupidly underused chunk of silicon to work on a workload that should, on paper, be ideal for it.

And it worked remarkably well, but the road was bumpy.


TL;DR — the result, you came here for this table.

Same Spanish audios, similar wyoming-onnx-asr stack, but I swapped the inference backend from plain ORT-CPU to OpenVINO targeting the NPU, and I went from using the INT8 quantized model on the CPU to using the full precision FP32 model on the NPU.

Results averaged from 10 runs after 1 warmup round.

Audio Backend Avg latency Energy / inference Power above idle
10 s CPU INT8 978 ms 44.6 J 45.6 W
10 s NPU FP32 204 ms 4.2 J 20.5 W
20 s CPU INT8 1 708 ms 79.8 J 46.7 W
20 s NPU FP32 615 ms 7.8 J 12.7 W
60 s CPU INT8 5 011 ms 237.7 J 47.4 W
60 s NPU FP32 818 ms 11.0 J 13.4 W

3-6× faster wall time. 10-22× less energy per transcription. For a workload that runs quite often in my home (I have 5 satellites and I don't reach for switches often), this is the kind of result that makes me wonder why nobody seems to be doing it.
For a nice voice assistant, response speed is a critical part of the experience. It's not like 500ms extra makes for a terrible experience, but very little you save does improve the experience.

I've packaged the whole thing into a Docker image: 👉 ghcr.io/cibernox/wyoming-parakeet-on-intel-npu. If you have a Core Ultra chip and are Home Assistant, you can docker run it and skip everything below.

But if you want the story…


Why bother

Quick context. The home server is a Proxmox 9.x box, Intel Core Ultra 7 265K, 64 GB DDR5, an AMD 7900XTX dedicated GPU, and various LXC + Docker workloads (Home Assistant, llama.cpp on GPU, paperless-ngx, the usual). I'd been running Parakeet TDT on CPU at ~0.5-0.8 s per utterance. Acceptable but not "instant", but it was a downgrade from where I was running it in my RTX 3060 that I could live with but it could feel it too.

The CPU baseline is genuinely strong on this chip — Parakeet's INT8 ONNX through ORT-CPU benefits from AVX-VNNI INT8 matmuls and the 265K is beefier than most home servers. So when I say the NPU is 3-6× faster, I'm not comparing it to a low power N150 mini-cp. This is a 20-core desktop-class CPU at 125 W TDP.

The Intel NPU on Arrow Lake is rated at 13 TOPS. By LLM-accelerator standards that's tiny, and AMD boosts NPUs with 40TOPS already. But Parakeet's encoder is exactly the kind of work an NPU is designed for: matrix multiplications with predictable shapes and modest activation memory. Worth trying.


Trap #1: Ubuntu's Level Zero loader is too old

First time you'd think "yeah I just install OpenVINO and the NPU driver, right?" And it almost works. The container detected the NPU device node but reported available_devices: ['CPU']. No NPU.

The reason, after some ZE_ENABLE_LOADER_DEBUG_TRACE=1 archeology:

ZE_LOADER_DEBUG_TRACE: Load Library of libze_intel_vpu.so.1 failed

Enter fullscreen mode Exit fullscreen mode

Ubuntu 24.04's bundled Level Zero loader (libze1 v1.16) is looking for the legacy library name libze_intel_vpu.so.1. I should have figured this faster than I did because this chip was released in 2025, so it's totally to be expected that Ubuntu needed some help getting it to work. Recent Intel NPU driver builds install libze_intel_npu.so.1 — different name, same library. The loader needs to be v1.17 or newer to know about the new name.

Fix is straightforward once you know:

RUN curl -fL -O \
    "https://github.com/oneapi-src/level-zero/releases/download/v1.28.6/libze1_1.28.6+u24.04_amd64.deb" \
    && apt-get install -y --no-install-recommends ./libze1*.deb

Enter fullscreen mode Exit fullscreen mode

Now ov.Core().available_devices returns ['CPU', 'NPU'] and the full device name comes back as Intel(R) AI Boost. 🎉


Trap #2: don't try to run the INT8 model on NPU

The model I was already using is INT8 quantized. Natural first move: feed the same ONNX to OpenVINO targeting NPU. It blows up:

[OpenVINO-EP] Output names mismatch between OpenVINO and ONNX

Enter fullscreen mode Exit fullscreen mode

What's happening: the INT8 Parakeet ONNX uses DynamicQuantizeLinear/MatMulInteger/DequantizeLinear chains, and OpenVINO's graph optimizer aggressively folds those into native INT8 matmuls. The folding renames or drops intermediate tensors that the runtime is trying to read back. Hard fail at first inference.

Worse: even if you find a way to coax it through (I tried onnxruntime-openvino, raw OpenVINO with enable_qdq_optimizer, even NNCF post-training quantization), INT8 runs slower than FP32 on this NPU. The Intel NPU is BF16-native — it converts everything to BF16 internally. Feeding it INT8 just means extra dequant/requant on every operator boundary.

The right move is the opposite of what I expected: use the FP32 model. It's 4× bigger on disk (2.5 GB vs 650 MB) but the compiler converts it cleanly to BF16 for the NPU and runs full speed.

NOTE: After all theses tests I found that someone has created an FP16 version of Parakeet that is ~1.5 GB. I tried it briefly and if performed much better than INT8 but still 15% slower than fp32. I am not sure why, but if you are ram constrained you might prefer that one.


Trap #3: NPUs hate dynamic shapes

The Parakeet encoder accepts dynamic input shape (batch, 128, T) where T is the number of mel-feature frames — proportional to audio length. A 1.5 s "lights off" command is 150 frames; a 60 s dictation is 6 000 frames. ONNX Runtime on CPU handles that natively — every call allocates whatever shape comes in.

Quick aside: what's a "mel-feature frame" you may ask? (It's OK, I didn't know until yesterday) Speech models don't ingest raw audio. The audio is sliced into overlapping ~25 ms windows, each window converted into a 128-element vector of mel-frequency magnitudes (energies at different frequency bands, weighted to match human hearing). Parakeet does this conversion at 100 frames per second. T = audio_seconds × 100. That's the dimension that varies with utterance length.

The Intel NPU absolutely does not do dynamic shapes. At least I couldn't find a way. The compiler bakes the tile sizes and memory layout into the compiled blob based on the static input dimensions. Hand it an unbounded dynamic shape and OpenVINO refuses to compile:

[ERROR] Upper bounds are not specified for node '/pre_encode/Cast' (type 'Convert'):
        input '0' bounds are '[9223372036854775807]'

Enter fullscreen mode Exit fullscreen mode

I tried bounded dynamic shapes too (ov.PartialShape([1, 128, ov.Dimension(1, 2000)])) — the bounds don't propagate through every internal op of the Conformer, so the compiler still hits unbounded operands and bails out.

Three options:

  1. One static shape (e.g., 20 s) — pad every utterance up to 20 s of silence and run the full encoder no matter what. Simple but very wasteful — a 2 s command pays the encoder cost of 20 s.
  2. Recompile per request — NPU compile takes ~12 s. Hard no.
  3. Multi-bucket dispatch — compile a handful of static shapes ahead of time, cache them, and route each request to the smallest bucket that fits.

Option 3 is the only sane answer unless someone can prove me wrong on allowing dynamic shapes. Since smart home commands are usually rather quick, here are the bucket sizes I settled on for my Spanish smart-home traffic and the NPU encoder time for each:

Bucket Typical traffic Encoder time on NPU
5 s "Apaga la luz de la cocina y la del comedor" ~55 ms
20 s Voice notes, reminders ~150 ms

Without buckets, every single utterance would pay the 20 s bucket's ~150 ms encoder cost. With the 5 s bucket added, the most common commands now spend only 55 ms on the encoder phase. We could have smaller buckets, and I did try that, but each bucket requires a new compilation step, and takes space and memory, so I though that 2 tiers was granular enough.


Trap #4: the false start that wasted a whole afternoon

For most of this investigation I was getting "NPU" and "CPU" timings within noise of each other and was about to declare the NPU not worth it.

Turned out my integration shim was being attached to the wrong attribute on the loaded onnx-asr model.

onnx_asr.load_model() returns a TextResultsAsrAdapter that wraps the actual ASR object on .asr. The wrapper does NOT proxy attribute writes. So this:

model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", ...)
model._encoder = OpenVINOEncoderShim(...)  # ← attribute added to wrapper, ignored

Enter fullscreen mode Exit fullscreen mode

…just adds an attribute to the wrapper that nothing reads. model.recognize() still routes through model.asr._encoder, which is the original ORT-CPU session. Every "NPU" benchmark I had been running was secretly plain ORT-CPU with an extra unused NPU encoder warming up uselessly in memory.

One-line fix:

model.asr._encoder = OpenVINOEncoderShim(...)  # ← actually used
model.asr._decoder_joint = OpenVINODecoderShim(...)

Enter fullscreen mode Exit fullscreen mode

Once corrected, the real numbers landed where the silicon could deliver them. Lesson: when integrating with someone else's pipeline, add a tracer that confirms your code is actually being called before you trust any benchmark. This was on me.


The code that works

  1. Download the FP32 encoder (encoder-model.onnx + 2.4 GB external data) and FP32 decoder from the istupakov HF repo.
  2. For each bucket size, reshape the encoder to a static T and compile for NPU:
import openvino as ov
core = ov.Core()
model = core.read_model("encoder-model.onnx")
model.reshape({"audio_signal": [1, 128, T_fixed], "length": [1]})
compiled = core.compile_model(model, "NPU", config={
    "CACHE_DIR": "/data/ov_cache",
    "PERFORMANCE_HINT": "LATENCY",
    "NPU_TURBO": "YES",
})

Enter fullscreen mode Exit fullscreen mode

  1. Same idea for the decoder/joint, but only ONE bucket — it's called per-token with fixed shapes regardless of audio length.
  2. At inference time, pick the smallest bucket whose T_fixed ≥ the actual mel-frame count. Zero-pad to that bucket's length, pass length=actual so the encoder knows where real audio ends.
  3. Plug the NPU-compiled encoder + decoder into onnx_asr by assigning to model.asr._encoder and model.asr._decoder_joint (NOT model._encoder — see Trap #4).

NPU compile time is ~12 s per bucket cold, ~1 s when the CACHE_DIR blob hits. First container start is ~80 s with all buckets; subsequent restarts are fast because everything is cached.


Things I tried that didn't help

So you don't have to:

  • onnxruntime-openvino with the INT8 model → output-name mismatch bug
  • NNCF post-training INT8 quantization → still 17% slower than FP32 on this NPU, but not bad at all considering saves 75% of the RAM. If was is tight, this approach is for you, and quality degradations is very low.
  • FP16 model (from the grikdotnet repo) → marginally slower than FP32 because of FP32-I/O Cast ops the converter inserts. Saves ~50% RAM though, which is nice. I have plenty, so I didn't bother, but might be an easy save and fp32 -> fp16 should be negligible.
  • Async ping-pong with two InferRequests on the decoder → TDT decoder is auto-regressive, nothing to overlap
  • INFERENCE_PRECISION_HINT=f16 → no measurable effect; compiler was already running BF16 internally
  • MODEL_PRIORITY=HIGH → compile-time only, no runtime effect
  • Bounded dynamic shapes → bounds don't propagate through the Conformer ops, compiler still bails

Benchmarking for smart home commands

Voice commands arrive sporadically — a few seconds of speech after several minutes of silence. The relevant metric isn't steady-state throughput transcribing a 90min podcast, it's single-shot cold-after-idle latency, because the CPU's caches/clocks are cold and the NPU might be in a low-power state.

I run my home server with aggressive power-saving (deep C states, PCIe sleep — my AMD 7900XTX idles at 4 W). "Idle" wall power is around 32-38 W (as idle as a server running 20 containers can be). I was worried these would punish cold inference. They don't.

The NPU has no observable wake-up penalty. Cold-after-idle:

Audio CPU INT8 NPU FP32
10 s 918 ms 276 ms
20 s 1 628 ms 693 ms
60 s 4 756 ms 884 ms

Real Home Assistant trace for "apaga la luz de la cocina y la del comedor" (turn off the kitchen and dining-room lights which is a longer-than-average-sentence): CPU 0.71 s vs NPU 0.18 s, identical transcript.


What this means in absolute terms

The result that genuinely surprised me: this 13-TOPS NPU running Parakeet ends up as fast or faster than the same model running on an Nvidia RTX 3060 (~13 TFLOPS on FP16), which I had been using on my previous server as an eGPU. The RTX did 0.15-0.3 s per utterance. The NPU does 0.1-0.2 s. Same ballpark, and:

  • The NPU pulls ~13 W during transcription
  • The RTX 3060 pulled ~170 W active, ~15 W idle

The NPU's active power is lower than the RTX's idle. On a workload that's mostly idle anyway, that's a 10× efficiency gain in steady state and infinite in active comparison.

For 13 TOPS, that's a remarkable use of silicon. The "NPUs are marketing" take is wrong for at least this workload.

Now, I am not claiming that the NPU is more powerful than a 3060, it clearly isn't, but I suspect it's able to match or best it because (and this is just a theory), it wakes up faster than a discrete GPU, and for a short burst of work like this, that gives it an early start that the nvidia card wasn't able to overcome. I'm sure that transcribing commands over 10 seconds the GPU would win, but those are very rare.


Try it

I packaged everything into a public Docker image. If you have:

  • An Intel Core Ultra processor (Meteor Lake / Arrow Lake / Lunar Lake)
  • /dev/accel/accel0 on your host (lsmod | grep intel_vpu to verify)
  • Home Assistant or any other Wyoming-protocol client

You can do this:

docker run -d \
  --name wyoming-parakeet-npu \
  --device /dev/accel/accel0 \
  -e LANGUAGE=es \
  -p 10300:10300 \
  -v parakeet-data:/data \
  --restart unless-stopped \
  ghcr.io/cibernox/wyoming-parakeet-on-intel-npu:latest

Enter fullscreen mode Exit fullscreen mode

First boot downloads ~3.2 GB of model weights and compiles the NPU buckets (~60-90 s). Subsequent restarts are under 5 s. Point Home Assistant's Wyoming integration at tcp://<host>:10300 and you're done.

Repo with source, Dockerfile, docs and a docker-compose.yml example: github.com/cibernox/wyoming-parakeet-on-intel-npu.


What I'd love help with

If you're playing with this, things I haven't done yet that I think could move the needle:

  • VAD gate before the encoder — most wake-word false-positives carry a fraction of a second of speech then silence. Cheap RMS-based VAD on the host could avoid invoking the encoder entirely for those. Probably the single biggest aggregate energy saver in a real smart home.
  • Lazy bucket loading + LRU eviction — I keep multiple buckets resident, but each compiled blob takes ~1.5 GB of RAM. An LRU policy would let you compile many buckets but only keep N hot. (The repo already has a basic "lazy load one large bucket" mode; full LRU would be the next step.)
  • Investigating the TDT decoder's Python overhead — even with the decoder itself running at ~1 ms per call on NPU, the surrounding loop in onnx_asr (numpy state-handling, control flow) accounts for a meaningful fraction of total time on long audio.

PRs welcome.


Acknowledgements

This work stands on top of several open-source projects, all of which made this hack possible:

Thanks to all of them for shipping working code.