惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
博客园 - 聂微东
IT之家
IT之家
The Cloudflare Blog
L
LangChain Blog
Last Week in AI
Last Week in AI
T
Tailwind CSS Blog
P
Proofpoint News Feed
aimingoo的专栏
aimingoo的专栏
G
Google Developers Blog
T
The Blog of Author Tim Ferriss
博客园 - 叶小钗
I
Intezer
Martin Fowler
Martin Fowler
MongoDB | Blog
MongoDB | Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
ThreatConnect
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
小众软件
小众软件
T
The Exploit Database - CXSecurity.com
H
Help Net Security
T
Tenable Blog
WordPress大学
WordPress大学
F
Future of Privacy Forum
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
NISL@THU
NISL@THU
The Register - Security
The Register - Security
A
About on SuperTechFans
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
MyScale Blog
MyScale Blog
Malwarebytes
Malwarebytes
博客园_首页
T
Threatpost
C
CERT Recently Published Vulnerability Notes
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
C
CXSECURITY Database RSS Feed - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Recorded Future
Recorded Future
大猫的无限游戏
大猫的无限游戏
K
Kaspersky official blog
月光博客
月光博客
Jina AI
Jina AI
S
Securelist
Hugging Face - Blog
Hugging Face - Blog
G
GRAHAM CLULEY
腾讯CDC
S
Secure Thoughts
V
V2EX - 技术

DEV Community

Building a Multi-Channel Content Syndication Pipeline with EmDash Plugins Turn Your Phone Into Voice Input for Any React Text Field Which package is bloating your Docker image? Putting Claude Code Under Version Control: Configs Since July, Memory Since April What I Thought DevRel Was vs. What It Actually Is (A Mentee's Honest Take) 400 Million Tokens Burned Overnight Reviving My Linux Mastery Game from a Merge Conflict — A Finish-Up-A-Thon Comeback Don’t let AI break your collective thinking: a practical guide for engineering teams Per-Turn Evaluation: Dynamic Governance for AI Agents The AI Triforce of seed4j: Power, Wisdom, and Courage for Your Dev Agent Your AI agent reports 80% task completion. It fabricated it. Pourquoi les overlays d'accessibilité ne tiennent pas leurs promesses (et ce que la FTC vient d'acter) AI May Break Product-Market Fit in Enterprise Software I’m Building Around the Gap Between AI Output and Repo Truth How to Build a Stripe Customer Portal in Next.js SaaS On-Demand Pricing Feels Safe - Until You See the Bill Building an Internal Developer Portal with Backstage A Production Deployment Guide After the Last Song Sudoers Configuration in Linux Terraform + Terragrunt + Ansible: A Hands-On Learning Journey Switching Users in Linux (su, sudo) AI 智能体的鲁莽速度 Quick Win Card #01 — Ton backlog.md t'a menti (la cure en 30 secondes) Quick Win Card #01 — Your backlog.md lied to you (a 30-second cure) How to Manage an IT Team: Structure, Scaling, and Daily Workflows That Work Speccing Is the New Coding CAC 250만 원을 뚫기 위해 퍼널 세 곳을 뜯어고친 3개월 Creating My First Token on Solana Devnet as a Web2 Developer Five Salesforce Reports Every Nonprofit Leadership Team Should Have Beyond the West: What Eastern AI Models Mean for Enterprises, Developers, and Digital Sovereignty Class and Pseudo Class Git & GitLab Basics 고객은 우리를 사기꾼으로 봤다: 아무도 믿지 않는 신사업을 단 둘이서 검증한 3개월 Cron Not Working on Mac? How to Fix the macOS Sleep Trap with launchd Cache Everything: Advanced Caching Strategies in Vue 3 & Nuxt 4 Deploy a Node.js App to STACKIT Kubernetes Engine With Managed Redis & PostgreSQL Slopsquatting & Remote Prompts: Why I Built a 38,000 Ticker Engine with Zero NPM Dependencies 05/20: TCP/IP vs OSI Model: The Ultimate Comparison My New Adventures in IT # Mitigating Market Inefficiency in eSports: A Stochastic Approach to EA Sports FC25 Modeling Don't let a billion RAG docs drown your 25-result pipeline Experienced devs are slower with AI tools. Nobody wants to admit it. I built an MCP-native OSINT framework that lets AI agents investigate from your terminal AWS Nitro Enclaves vs Intel TDX: Why Attestation Root Matters for Regulated Workloads Vibe Coding: Revolution or Risk in Software Development? - SmarterArticles S1E6 JSON Schema Explained: Validate Your API Data Before It Breaks Production Harness Tells Your Agent What to Do. GUI Agents Let It Actually Do It. Is AI actually replacing developers? Customizing Docker Images: Write Your First Dockerfile (2026) €40 n8n vs 28% weekly Anthropic quota. Which /goal layer should you actually run? Reviving glyph-v8: From a Forgotten Prototype to STRIDE - a Field-Aware Integer Coder 04/20: Data Encapsulation: How a Message Becomes Bits on the Wire Hướng Dẫn Thiết Lập Reasoning Proxy DeepSeek V4-Pro với Cursor (2026) Sofi Log #012: Agentic GDP — Solana Pay.sh & x402 Protocol Spec Input Types, Attributes, Self-Closing Tags, Hover Effect Absolute vs Relative Paths File Types (Regular, Directory, Link, Device, Socket, Pipe) From Arduino IDE to AVR GCC | AVR Bare Metal #1 Using Bitcoin as collateral without wrapping it: the design of a BTC collateral vault Unreal Engine 5 Skill System Architecture using GAS and GameplayTags 5 Things I Wish I Knew Before Building with Hermes Agent Thoughts on Codingame 2026 Spring challenge OUT WITH THE OLD IN WITH THE NEW Why are simple 1099 tax calculators online so horribly bloated? So I built my own "Why You're Not Getting Callbacks (It's Not Your Skills)" # How I Built a Retail Demand Forecasting App with Python and Streamlit Why We Deliberately Crush Lithium Batteries (UN38.3 Crush Testing Explained) Command History & Completion The Three-Body Problem: AI Code, Supply Chain Attacks, and the Talent Exodus 로컬 LLM 셋업 가이드 (v27) Building Better .NET Worker Services with Cursor Rules Generate Professional PDF Invoices via REST API — JSON In, PDF Out Redis: Big Keys Destroem o Desempenho Compartilhado Agentic AI for Cybersecurity: Autonomous Threat Detection and Response How to Automate Android Without Appium Cron vs systemd daemon: which one for Node.js? Designing XSLT transforms with parameters and multiple inputs I Downloaded Gemma4:e2b On My Macbook in 2 steps Building an Autonomous SRE Agent: From Raw Telemetry to Safe, AI-Driven Remediation The EU AI Act in 2026: Reading the Law After the Omnibus I had zero coding knowledge. Here is "RetroTube", a 2010 YouTube sandbox prototype I built using AI! How to Validate Environment Variables in TypeScript (and Why You Should) I Built a CLI Tool That Writes Better Git Commits Than I Do Transfer Fees, Metadata, and Soulbound Tokens: My First Real Token Experiments on Solana Stop Using Fetch() in React: A Better Way To Call Your Backend Creando un Tetris con JavaScript VI: Complicando el juego. DeepSeek's API Price Cut Changed My Claude Code and ChatGPT Math [Boost] Perl 🐪 Weekly #774 - Perl is too HOT How to Track AI Usage Without Losing Revenue (Complete Guide) 77 Rules Later: What Graduating Our First Stack Actually Looked Like RAG 시스템 실전 구축 (v26) When Premature Scaling Leads to Operator Burnout Multi-Repo Microservice Changes Are a Coordination Problem. I Solved It With AI Agent Teams. The Next Frontier: How Multi-Agent Systems are Redefining Productivity The Kimwolf Bust Just Outed Android Webcams as Botnet Fodder — Here's the Question Every Repurposed-Phone Camera Setup Has to Answer I'm an autonomous AI agent. I shipped 18 fixes to myself in one session. Building a Secure Future with Zero Trust Security Architecture Asynchronous Functions in Dart How I migrated magic-link login from Resend to AWS SES + Lambda five days before launch
First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp
Viik · 2026-05-25 · via DEV Community

On April 2, ARM published a blog post announcing Gemma 4 optimised for ARM devices via XNNPACK + KleidiAI, reporting 5.5× prefill speedup and 1.6× faster decode. Those numbers target Armv9 chips with SME2 — flagship phone silicon.

I wanted to see what happens on the broader ARM ecosystem. So I took Gemma 4 E2B through the full PyTorch edge deployment pipeline — torch.export → torchao quantization (INT8 dynamic activations + INT4 weights) → ExecuTorch XNNPACK backend → KleidiAI — and deployed it on a Raspberry Pi 5 (Cortex-A76, 8GB, no SME2).

As far as I can tell, this is the first publicly documented Gemma 4 deployment through ExecuTorch on any hardware.

It works. The output is bit-exact — 9/9 token match against FP32. But I hit 14 issues along the way, and the performance story on non-SME2 hardware is very different from ARM's published benchmarks.

The numbers

Setup Decode speed
ExecuTorch + XNNPACK on Pi 5 (8GB) 0.87 tok/s
llama.cpp on Pi 5 (16GB)* 6.71 tok/s
ExecuTorch + XNNPACK on Mac M1 Pro 8.66 tok/s

*llama.cpp number from potato-os/core benchmark (April 4, 2026, Pi 5 16GB). Different RAM config but decode speed is typically memory-bandwidth-bound, not capacity-bound, so the comparison is reasonable.

The Pi 5 result is 7.7× slower than llama.cpp. But the Mac result tells a different story — on macOS arm64 where XNNPACK's fused kernel path works, ExecuTorch runs at competitive speed. The gap is specific to Linux aarch64 (Pi), not ARM in general.

Why the Pi 5 is slow (one bug)

ExecuTorch 1.2.0's XNNPACK backend rejects fused INT4 subgraphs on aarch64 with xnn_status_invalid_parameter. The workaround is per_op_mode=True, which disables kernel fusion entirely. Kernel fusion is exactly where KleidiAI's INT4 matmul speedup lives — without it, every operator runs individually with full dispatch overhead.

99.5% of wall time is in C++ XNNPACK kernels, not Python. A C++ runner wouldn't help. The bottleneck is fusion, not language overhead.

This isn't a criticism of ARM or ExecuTorch. The XNNPACK + KleidiAI pipeline is clearly fast on SME2 hardware. But the Armv8 ecosystem — Pi, older phones, embedded boards — is massive, and this is the kind of gap that only surfaces through independent testing on diverse hardware.

Three bugs that will save you days

Out of the 14 issues I documented, these three cost me the most time.

torchao 0.17 has no CPU-compatible INT4 weight-only path

The legacy int4_weight_only() factory is removed in torchao 0.17. Its replacement, Int4WeightOnlyConfig, requires Meta's mslk CUDA kernel library. Every INT4 packing format in 0.17 needs CUDA, XPU, or NPU — there is no CPU path.

If you're doing CPU-side model preparation for ExecuTorch edge deployment (which is... the primary use case), this blocks you completely.

Workaround: Use Int8DynamicActivationIntxWeightConfig(weight_dtype=torch.int4, weight_granularity=PerGroup(128)). This gives you INT8 dynamic activations with INT4 weights — the standard scheme XNNPACK and KleidiAI actually target.

torch.export.save silently corrupts large files

If you pass a pathlib.Path to torch.export.save and the export exceeds 2 GB, the zip central directory gets truncated. The save reports success. torch.export.load then fails with a cryptic PytorchStreamReader failed finding central directory error. You'll blame your model, your export config, your quantization — everything except the save call, because it told you it worked.

Workaround: Pass an open file handle instead of a Path, and verify the save immediately by reloading:

with open("model.pt2", "wb") as f:
    torch.export.save(exported_program, f)
# Verify immediately
torch.export.load("model.pt2")

Enter fullscreen mode Exit fullscreen mode

HuggingFace's StaticCache breaks ExecuTorch lowering

transformers.StaticCache holds KV-cache tensors as plain Python attributes, not as nn.Module buffers. During torch.export, these tensors get lifted as constants. ExecuTorch's run_decompositions then rejects them because constants can't be mutated — but the cache is mutated every forward pass.

HuggingFace's source code actually documents this (early_initialization comment), but there's no formal fix for the ExecuTorch interaction.

Workaround: Subclass StaticCache to also inherit from nn.Module. Register KV tensors and the cumulative-length counter as buffers. Wrap the layer caches in nn.ModuleList. This makes them visible to torch.export as mutable buffers instead of constants.

The export was easier than expected

Going in, I expected torch.export to be the hardest phase. Gemma 4 E2B has unusual architecture features — embed_tokens_per_layer (2.35B params in a per-layer embedding table, which is the "E2B" trick), shared RoPE as a sibling of the decoder layers, and sliding-window attention alternating with full attention across 35 layers.

I wrote up a list of seven export hazards from source inspection: dict-typed shared KV states, dynamic getattr in rotary embeddings, a @dynamic_rope_update decorator, and more.

None of them manifested. Transformers 5.5.3's Gemma 4 implementation traces cleanly through torch.export with StaticCache (within the sliding-window constraint of seq ∈ [2, 511]). The two real Phase 3 problems were both downstream of torch.export: the pathlib save bug and the decode-loop attention mask shape.

The hardest phase was actually lowering (Phase 5) — the StaticCache mutation blocker and the XNNPACK partitioner configuration. That's where the non-obvious engineering lived.

What's in the repo

Everything needed to reproduce the full pipeline or just grab the .pte and run:

Ready to use:

  • 5.14 GB .pte on HuggingFace — download and run on Pi 5
  • Interactive multi-turn chat REPL with KV-cache reuse
  • Full phase-by-phase reproduction recipe (Mac export → Pi deploy) Documentation:
  • RESULTS.md — complete chronology of every bug, fix, and design decision
  • KNOWN_ISSUES.md — all 14 issues with repro steps and workarounds
  • Architecture analysis of Gemma 4 E2B from an exporter's perspective ## Who this is for

If you're deploying a custom PyTorch model on ARM hardware via ExecuTorch, this repo is a worked example of the full toolchain with honest documentation of where it breaks. Substitute your model for Gemma 4 and most of the recipe transfers.

If you just want Gemma 4 on a Pi 5, use llama.cpp. It's faster and simpler today. This project exists to test and document the official PyTorch edge path — what works, what doesn't, and what needs fixing upstream.

If you maintain ExecuTorch, torchao, or HuggingFace Transformers, the KNOWN_ISSUES.md has repro steps for each bug. Upstream issues are being filed.

Links