惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LangChain Blog
Security Latest
Security Latest
P
Proofpoint News Feed
GbyAI
GbyAI
PCI Perspectives
PCI Perspectives
博客园 - Franky
N
Netflix TechBlog - Medium
博客园_首页
WordPress大学
WordPress大学
K
Kaspersky official blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Vercel News
Vercel News
T
Threatpost
The Hacker News
The Hacker News
H
Help Net Security
S
Securelist
Recent Announcements
Recent Announcements
腾讯CDC
T
Tailwind CSS Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Engineering at Meta
Engineering at Meta
C
Cisco Blogs
V
V2EX
C
Check Point Blog
S
Schneier on Security
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Jina AI
Jina AI
M
MIT News - Artificial intelligence
T
Threat Research - Cisco Blogs
博客园 - 叶小钗
A
Arctic Wolf
AWS News Blog
AWS News Blog
Latest news
Latest news
Martin Fowler
Martin Fowler
Recorded Future
Recorded Future
Last Week in AI
Last Week in AI
The GitHub Blog
The GitHub Blog
小众软件
小众软件
B
Blog
aimingoo的专栏
aimingoo的专栏
C
Cyber Attacks, Cyber Crime and Cyber Security
V
Visual Studio Blog
P
Palo Alto Networks Blog
Spread Privacy
Spread Privacy

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything BFF模式详解:构建前后端协同的中间层 I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
I Build the Infrastructure That Serves AI Models. Gemma 4 Just Made My Job Existential.
Sodiq Jimoh · 2026-05-24 · via DEV Community

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4


Most posts about Gemma 4 start with "I downloaded it and ran it locally." Mine starts differently.

I build the platform that serves models like Gemma 4. I spend my days writing Kubernetes manifests, configuring KServe InferenceServices, debugging Knative ingress routes, and making sure Kyverno policies block the bad deployments before they reach the cluster. My project — NeuroScale — is a self-service AI inference platform where a developer fills in a Backstage form, the platform creates a pull request, ArgoCD deploys it, and a production-grade inference endpoint goes live.

I am, in the most literal sense, the person who keeps the lights on for AI model serving.

So when Google released Gemma 4 with models ranging from 2B parameters on a Raspberry Pi to 31B on a workstation, my first thought wasn't "cool, let me try it." My first thought was: "Does this make my entire platform unnecessary?"

That question — and the answer I eventually reached — is what this post is about.


The Serving Tax Nobody Talks About

Here's a dirty secret from platform engineering: most of the cost of running an AI model isn't the model. It's everything around it.

On NeuroScale, deploying a single sklearn InferenceService on KServe creates roughly 5 pods. Not one. Five. A predictor pod, a transformer (if configured), an Istio/Kourier sidecar or gateway pod, a queue-proxy injected by Knative, and a storage initializer that downloads model artifacts at startup.

Each of those pods needs CPU requests, memory limits, health probes, and — because we enforce this through Kyverno admission policies — ownership labels and cost-center attribution before they're even allowed to exist:

# This gets DENIED by our Kyverno policy — no owner, no deploy
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gemma-model
  namespace: default
spec:
  predictor:
    sklearn:
      storageUri: gs://models/gemma/v1
# ❌ Blocked: missing app.kubernetes.io/owner and cost-center labels

Enter fullscreen mode Exit fullscreen mode

On a local k3d cluster — the kind I use for development — a single InferenceService consumes around 1.2 GB of RAM. Add Backstage (the developer portal), ArgoCD (GitOps), Kyverno (policy engine), and OpenCost (cost attribution), and you're at 6-8 GB of infrastructure memory before you've served a single prediction.

This is the serving tax. Every platform engineer pays it. Nobody writes blog posts about it.


Then Gemma 4 E2B Showed Up on a Raspberry Pi

When I read that Gemma 4's E2B model — 2 billion parameters, multimodal, 128K context window — runs on a Raspberry Pi 5 drawing 7.5 watts, something clicked.

Not "clicked" as in excitement. "Clicked" as in a core assumption quietly becoming wrong.

My entire platform exists because of a specific premise: deploying AI models is hard, dangerous, and requires infrastructure guardrails. The selling point of NeuroScale is that you don't need to understand Kubernetes to get an inference endpoint. You fill in a form. The platform handles the YAML, the drift correction, the policy enforcement, the cost tracking.

But what if the model just... runs? On the device? With no YAML? No Kubernetes? No platform?

Gemma 4's model family makes this question concrete in a way previous open models didn't:

Model Parameters Runs On Needs Platform?
E2B 2B Phone, Raspberry Pi, browser No
E4B 4B Laptop, edge device No
31B Dense 31B Workstation GPU Maybe
26B MoE 26B (4B active) Server Yes

The bottom two rows are my world. The top two rows are... not.


The Three-Hour Outage That Explains Everything

Let me tell you about the worst three hours of my NeuroScale build.

When I first set up KServe, all InferenceService creation was blocked cluster-wide. No model could deploy. Zero. The reason? KServe's default configuration assumes you're running Istio as your service mesh. We were running Kourier — a lightweight Envoy-based ingress gateway that uses ~100 MB instead of Istio's ~1 GB — because our k3d cluster didn't have the RAM headroom for Istio.

The fix was a single configuration patch:

# infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml
ingress:
  disableIstioVirtualHost: true

Enter fullscreen mode Exit fullscreen mode

One line. disableIstioVirtualHost: true. That's what stood between "all inference is blocked" and "everything works." And this configuration flag isn't in KServe's getting-started documentation. I found it by reading the KServe controller source code.

Three hours of production-blocking downtime. One boolean.

Now here's the thing: if I had been running Gemma 4 E2B locally via Ollama, this outage would not have existed. There's no ingress to misconfigure. There's no service mesh. There's no Kubernetes. There's just a process listening on a port.

# The entire "platform" for local Gemma 4
ollama run gemma4:e2b
# That's it. No YAML. No CrDs. No three-hour outage.

Enter fullscreen mode Exit fullscreen mode

So why do platforms like mine still exist?


Because "Running a Model" and "Serving a Model" Are Different Problems

This is the nuance I think most Gemma 4 coverage misses.

Running a model means: download weights, load into memory, send a prompt, get a response. Gemma 4 E2B does this beautifully on consumer hardware.

Serving a model means: make it available to 50 developers across 3 teams, ensure the person who deployed it is accountable for its resource cost, prevent someone from accidentally deploying a root container with no memory limits that crashes the shared cluster, automatically roll back when the model artifact URI changes and the new version returns garbage, and produce an audit trail that shows exactly who deployed what, when, and why.

That's not a model problem. That's an organizational problem. And Gemma 4 doesn't solve it — no model does.

On NeuroScale, a single deployment goes through this gauntlet:

Developer fills Backstage form
    → Platform generates YAML with required labels
    → PR is created on GitHub
    → CI runs kubeconform (schema validation)
    → CI runs kyverno-cli (policy simulation)
    → CI calculates resource delta (cost impact)
    → PR is reviewed and merged
    → ArgoCD detects the change
    → ArgoCD applies the manifest
    → Kyverno admission webhook validates live
    → KServe creates the InferenceService
    → Knative provisions the serving infrastructure
    → Smoke tests verify the endpoint responds

Enter fullscreen mode Exit fullscreen mode

Thirteen steps. Most of them invisible to the developer. All of them essential when you're running models for an organization, not for yourself.


The False-Green That Changed How I Think About AI Governance

Here's a story that connects Gemma 4's model selection philosophy to platform engineering in a way I didn't expect.

For two weeks, our CI pipeline was reporting that all Kyverno policy checks passed. Green checkmarks everywhere. PRs merged with confidence. We thought our guardrails were working.

They weren't.

The kyverno-cli apply command exits with code 0 even when policies are violated. Zero. Success. The violation details are printed to stdout, but the exit code — the thing CI systems use to determine pass/fail — says "everything's fine."

# This exits 0. CI shows green. Policies are NOT enforced.
kyverno-cli apply ./policies/ --resource ./bad-manifest.yaml
echo $?  # 0 ← This is a lie

Enter fullscreen mode Exit fullscreen mode

The correct tool for CI policy validation is kyverno test, which handles exit codes properly and is purpose-built for pipelines. We were using kyverno apply — not out of ignorance, but because it was the first thing in the docs. That was the mistake: reaching for what was convenient instead of verifying it was correct. The fix required checking both the exit code and parsing stdout for violation strings:

# The real check — trust nothing
OUTPUT=$(kyverno-cli apply ./policies/ --resource "$manifest" 2>&1)
EXIT_CODE=${PIPESTATUS[0]}

if [ "$EXIT_CODE" -ne 0 ] || echo "$OUTPUT" | grep -qi "fail\|violation\|denied"; then
    echo "❌ Policy violation detected"
    exit 1
fi

Enter fullscreen mode Exit fullscreen mode

For two weeks, any developer could have deployed an InferenceService without resource limits, without ownership labels, without cost attribution — and CI would have said "all checks passed."

This is what governance failure looks like. Not a dramatic breach. A silent green checkmark.

What This Has to Do With Gemma 4

Gemma 4's model family is the first time I've seen model selection treated as a first-class engineering decision rather than an afterthought. The challenge itself asks for "intentional model selection" — show why your model was the right tool for the job.

That phrase — intentional selection — is exactly what was missing from our CI pipeline. We selected kyverno-cli apply because it was the obvious tool. We didn't verify that it actually enforced what we thought it enforced. The selection was convenient, not intentional.

The same trap exists with Gemma 4's model variants. E2B is convenient — smallest download, runs anywhere. But if your use case requires multi-step reasoning over 50K tokens of context, the benchmark data shows E4B or 31B Dense is the intentional choice. If your workload needs high-throughput batch processing, the 26B MoE's 4B active parameters per forward pass is the intentional choice.

Convenience and correctness are not the same thing. I learned that the hard way with a CI tool. Gemma 4's model family is designed so you don't have to.


Where Gemma 4 Actually Threatens My Platform (and Where It Doesn't)

After 108 commits, 21 smoke tests, and 6 milestone postmortems on NeuroScale, here's my honest assessment:

Gemma 4 Makes My Platform Unnecessary For:

Solo developers and small teams. If you're one person running one model for your own project, you should be running Gemma 4 locally. Ollama + E2B or E4B, depending on your reasoning needs. No Kubernetes. No KServe. No platform tax. The model runs on your hardware, your data stays on your machine, and you don't need me.

Edge and on-device inference. The E2B variant running on a phone or Raspberry Pi at 7.5W is a fundamentally different deployment model than anything platform engineers have optimized for. These devices don't have container runtimes. They don't need admission controllers. The compute is the deployment. This is a category Gemma 4 created, and it's outside my scope entirely.

Prototyping and experimentation. When you're exploring whether Gemma 4 can solve your problem, the correct infrastructure is zero infrastructure. Download the model. Try it. The 128K context window means you can feed it your entire codebase in one shot and ask questions. Don't set up a serving platform to prototype.

Gemma 4 Makes My Platform More Necessary For:

Multi-team organizations deploying multiple models. The moment you have Team A running a 31B Dense model for code generation and Team B running an E4B for document classification on the same cluster, you need resource isolation, cost attribution, and policy enforcement. That's not a model capability — it's an organizational capability.

Regulated environments. Healthcare, finance, government. These industries don't just need a model that works. They need an audit trail, admission controls, reproducible deployments, and the ability to prove that the model serving configuration hasn't drifted from what was approved. Gemma 4's open weights are a regulatory advantage — you control the model — but the serving infrastructure is where compliance lives.

Production SLAs. Knative's autoscaling, KServe's canary rollouts, ArgoCD's self-healing drift correction — these exist because production means "it works at 3 AM on a Sunday when nobody is watching." Running ollama serve on a workstation is not production. It's a demo.


The Decision Framework I Wish Existed

Based on my experience building NeuroScale and studying Gemma 4's architecture, here's the decision tree for anyone choosing between local inference and platform-served inference:

Question If Yes → If No →
Are you one developer working alone? Local Gemma 4 (E2B/E4B) Keep reading
Is your data too sensitive to leave the device? Local Gemma 4, any variant that fits Cloud/platform OK
Do multiple teams share the same compute? You need a platform Local is fine
Do you need an audit trail for model deployments? You need a platform Local is fine
Does someone need to answer "who deployed this and what does it cost?" You need a platform Local is fine
Is the model serving a production API with SLA guarantees? You need a platform Local is fine

If you answered "no" to all six questions, run Gemma 4 locally and don't look back.

If you answered "yes" to any of the last four, the model isn't the hard part. The platform is.


What I'm Actually Building Next

Here's the honest ending: Gemma 4 didn't make my platform unnecessary. It made it smaller.

The next milestone for NeuroScale is adding Gemma 4's 26B MoE as a first-class runtime alongside our existing sklearn InferenceService. The MoE architecture — 26B total parameters but only 4B active per forward pass — has an important nuance: MoE saves compute, not memory. Because the routing mechanism selects experts dynamically per token, all 26B weights must be resident simultaneously. At 4-bit quantization that's ~13-15GB VRAM. What MoE does give you in a platform context is higher throughput per GPU dollar — the compute overhead per request drops dramatically, so you can serve more concurrent requests on the same hardware. That's where the resource governance story actually lands.

# What this might look like in NeuroScale
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gemma4-moe-team-alpha
  labels:
    app.kubernetes.io/owner: team-alpha
    cost-center: cc-ml-inference
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: gemma4-moe-runtime
      storageUri: gs://neuroscale-models/gemma4-26b-moe
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"     # MoE saves compute, not memory — all 26B weights load at runtime. 16Gi works at 4-bit quant (~13-15GB), but budget for the full model, not just active params.
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: "32Gi"
          nvidia.com/gpu: "1"

Enter fullscreen mode Exit fullscreen mode

The E2B and E4B? Those run on developers' laptops directly. No platform needed. No form to fill. No PR to merge. That's the right answer for that model tier, and I'm at peace with it.

The platform engineer's job isn't to serve every model. It's to serve the models that need governance, and get out of the way for the ones that don't.

Gemma 4's model family, more than any other open model release, forced me to think clearly about that boundary.


The Takeaway

If you're a developer choosing a Gemma 4 variant, here's what a platform engineer wants you to know:

  1. E2B and E4B are genuine infrastructure breakthroughs. Not because they're the smartest models — they're not. Because they eliminate the serving tax entirely. No Kubernetes. No ingress. No three-hour debugging sessions. Run them locally and ship.

  2. 31B Dense and 26B MoE still need a platform. The moment you need multi-team isolation, cost attribution, or audit trails, the model's intelligence is the easy part. The infrastructure is where the real engineering happens.

  3. Intentional model selection is governance. Choosing E2B because it's the smallest download is convenience. Choosing E2B because your task is single-turn classification with sub-second latency requirements on edge hardware — that's engineering. Gemma 4's model family is the first open model release where this distinction actually matters at every tier.

  4. The most capable model is not always the right model. I learned this from a CI tool that reported false greens for two weeks. Capability without verification is worse than limited capability you actually validated.

I build platforms that serve AI models. Gemma 4 is the first model family that made me question which models deserve a platform — and that question made my platform better.


NeuroScale is open source: github.com/sodiq-code/neuroscale-platform. 108 commits, 21 smoke tests, 6 milestones of reality-check documentation. The BEFORE.md and AFTER.md tell the full story.