I Build the Infrastructure That Serves AI Models. Gemma 4 Just Made My Job Existential.

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Most posts about Gemma 4 start with "I downloaded it and ran it locally." Mine starts differently.

I build the platform that serves models like Gemma 4. I spend my days writing Kubernetes manifests, configuring KServe InferenceServices, debugging Knative ingress routes, and making sure Kyverno policies block the bad deployments before they reach the cluster. My project — NeuroScale — is a self-service AI inference platform where a developer fills in a Backstage form, the platform creates a pull request, ArgoCD deploys it, and a production-grade inference endpoint goes live.

I am, in the most literal sense, the person who keeps the lights on for AI model serving.

So when Google released Gemma 4 with models ranging from 2B parameters on a Raspberry Pi to 31B on a workstation, my first thought wasn't "cool, let me try it." My first thought was: "Does this make my entire platform unnecessary?"

That question — and the answer I eventually reached — is what this post is about.

The Serving Tax Nobody Talks About

Here's a dirty secret from platform engineering: most of the cost of running an AI model isn't the model. It's everything around it.

On NeuroScale, deploying a single sklearn InferenceService on KServe creates roughly 5 pods. Not one. Five. A predictor pod, a transformer (if configured), an Istio/Kourier sidecar or gateway pod, a queue-proxy injected by Knative, and a storage initializer that downloads model artifacts at startup.

Each of those pods needs CPU requests, memory limits, health probes, and — because we enforce this through Kyverno admission policies — ownership labels and cost-center attribution before they're even allowed to exist:

# This gets DENIED by our Kyverno policy — no owner, no deploy
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gemma-model
  namespace: default
spec:
  predictor:
    sklearn:
      storageUri: gs://models/gemma/v1
# ❌ Blocked: missing app.kubernetes.io/owner and cost-center labels

On a local k3d cluster — the kind I use for development — a single InferenceService consumes around 1.2 GB of RAM. Add Backstage (the developer portal), ArgoCD (GitOps), Kyverno (policy engine), and OpenCost (cost attribution), and you're at 6-8 GB of infrastructure memory before you've served a single prediction.

This is the serving tax. Every platform engineer pays it. Nobody writes blog posts about it.

Then Gemma 4 E2B Showed Up on a Raspberry Pi

When I read that Gemma 4's E2B model — 2 billion parameters, multimodal, 128K context window — runs on a Raspberry Pi 5 drawing 7.5 watts, something clicked.

Not "clicked" as in excitement. "Clicked" as in a core assumption quietly becoming wrong.

My entire platform exists because of a specific premise: deploying AI models is hard, dangerous, and requires infrastructure guardrails. The selling point of NeuroScale is that you don't need to understand Kubernetes to get an inference endpoint. You fill in a form. The platform handles the YAML, the drift correction, the policy enforcement, the cost tracking.

But what if the model just... runs? On the device? With no YAML? No Kubernetes? No platform?

Gemma 4's model family makes this question concrete in a way previous open models didn't:

Model	Parameters	Runs On	Needs Platform?
E2B	2B	Phone, Raspberry Pi, browser	No
E4B	4B	Laptop, edge device	No
31B Dense	31B	Workstation GPU	Maybe
26B MoE	26B (4B active)	Server	Yes

The bottom two rows are my world. The top two rows are... not.

The Three-Hour Outage That Explains Everything

Let me tell you about the worst three hours of my NeuroScale build.

When I first set up KServe, all InferenceService creation was blocked cluster-wide. No model could deploy. Zero. The reason? KServe's default configuration assumes you're running Istio as your service mesh. We were running Kourier — a lightweight Envoy-based ingress gateway that uses ~100 MB instead of Istio's ~1 GB — because our k3d cluster didn't have the RAM headroom for Istio.

The fix was a single configuration patch:

# infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml
ingress:
  disableIstioVirtualHost: true

One line. disableIstioVirtualHost: true. That's what stood between "all inference is blocked" and "everything works." And this configuration flag isn't in KServe's getting-started documentation. I found it by reading the KServe controller source code.

Three hours of production-blocking downtime. One boolean.

Now here's the thing: if I had been running Gemma 4 E2B locally via Ollama, this outage would not have existed. There's no ingress to misconfigure. There's no service mesh. There's no Kubernetes. There's just a process listening on a port.

# The entire "platform" for local Gemma 4
ollama run gemma4:e2b
# That's it. No YAML. No CrDs. No three-hour outage.

So why do platforms like mine still exist?

Because "Running a Model" and "Serving a Model" Are Different Problems

This is the nuance I think most Gemma 4 coverage misses.

Running a model means: download weights, load into memory, send a prompt, get a response. Gemma 4 E2B does this beautifully on consumer hardware.

Serving a model means: make it available to 50 developers across 3 teams, ensure the person who deployed it is accountable for its resource cost, prevent someone from accidentally deploying a root container with no memory limits that crashes the shared cluster, automatically roll back when the model artifact URI changes and the new version returns garbage, and produce an audit trail that shows exactly who deployed what, when, and why.

That's not a model problem. That's an organizational problem. And Gemma 4 doesn't solve it — no model does.

On NeuroScale, a single deployment goes through this gauntlet:

Developer fills Backstage form
    → Platform generates YAML with required labels
    → PR is created on GitHub
    → CI runs kubeconform (schema validation)
    → CI runs kyverno-cli (policy simulation)
    → CI calculates resource delta (cost impact)
    → PR is reviewed and merged
    → ArgoCD detects the change
    → ArgoCD applies the manifest
    → Kyverno admission webhook validates live
    → KServe creates the InferenceService
    → Knative provisions the serving infrastructure
    → Smoke tests verify the endpoint responds

Thirteen steps. Most of them invisible to the developer. All of them essential when you're running models for an organization, not for yourself.

The False-Green That Changed How I Think About AI Governance

Here's a story that connects Gemma 4's model selection philosophy to platform engineering in a way I didn't expect.

For two weeks, our CI pipeline was reporting that all Kyverno policy checks passed. Green checkmarks everywhere. PRs merged with confidence. We thought our guardrails were working.

They weren't.

The kyverno-cli apply command exits with code 0 even when policies are violated. Zero. Success. The violation details are printed to stdout, but the exit code — the thing CI systems use to determine pass/fail — says "everything's fine."

# This exits 0. CI shows green. Policies are NOT enforced.
kyverno-cli apply ./policies/ --resource ./bad-manifest.yaml
echo $?  # 0 ← This is a lie

The correct tool for CI policy validation is kyverno test, which handles exit codes properly and is purpose-built for pipelines. We were using kyverno apply — not out of ignorance, but because it was the first thing in the docs. That was the mistake: reaching for what was convenient instead of verifying it was correct. The fix required checking both the exit code and parsing stdout for violation strings:

# The real check — trust nothing
OUTPUT=$(kyverno-cli apply ./policies/ --resource "$manifest" 2>&1)
EXIT_CODE=${PIPESTATUS[0]}

if [ "$EXIT_CODE" -ne 0 ] || echo "$OUTPUT" | grep -qi "fail\|violation\|denied"; then
    echo "❌ Policy violation detected"
    exit 1
fi

For two weeks, any developer could have deployed an InferenceService without resource limits, without ownership labels, without cost attribution — and CI would have said "all checks passed."

This is what governance failure looks like. Not a dramatic breach. A silent green checkmark.

What This Has to Do With Gemma 4

Gemma 4's model family is the first time I've seen model selection treated as a first-class engineering decision rather than an afterthought. The challenge itself asks for "intentional model selection" — show why your model was the right tool for the job.

That phrase — intentional selection — is exactly what was missing from our CI pipeline. We selected kyverno-cli apply because it was the obvious tool. We didn't verify that it actually enforced what we thought it enforced. The selection was convenient, not intentional.

The same trap exists with Gemma 4's model variants. E2B is convenient — smallest download, runs anywhere. But if your use case requires multi-step reasoning over 50K tokens of context, the benchmark data shows E4B or 31B Dense is the intentional choice. If your workload needs high-throughput batch processing, the 26B MoE's 4B active parameters per forward pass is the intentional choice.

Convenience and correctness are not the same thing. I learned that the hard way with a CI tool. Gemma 4's model family is designed so you don't have to.

Where Gemma 4 Actually Threatens My Platform (and Where It Doesn't)

After 108 commits, 21 smoke tests, and 6 milestone postmortems on NeuroScale, here's my honest assessment:

Gemma 4 Makes My Platform Unnecessary For:

Solo developers and small teams. If you're one person running one model for your own project, you should be running Gemma 4 locally. Ollama + E2B or E4B, depending on your reasoning needs. No Kubernetes. No KServe. No platform tax. The model runs on your hardware, your data stays on your machine, and you don't need me.

Edge and on-device inference. The E2B variant running on a phone or Raspberry Pi at 7.5W is a fundamentally different deployment model than anything platform engineers have optimized for. These devices don't have container runtimes. They don't need admission controllers. The compute is the deployment. This is a category Gemma 4 created, and it's outside my scope entirely.

Prototyping and experimentation. When you're exploring whether Gemma 4 can solve your problem, the correct infrastructure is zero infrastructure. Download the model. Try it. The 128K context window means you can feed it your entire codebase in one shot and ask questions. Don't set up a serving platform to prototype.

Gemma 4 Makes My Platform More Necessary For:

Multi-team organizations deploying multiple models. The moment you have Team A running a 31B Dense model for code generation and Team B running an E4B for document classification on the same cluster, you need resource isolation, cost attribution, and policy enforcement. That's not a model capability — it's an organizational capability.

Regulated environments. Healthcare, finance, government. These industries don't just need a model that works. They need an audit trail, admission controls, reproducible deployments, and the ability to prove that the model serving configuration hasn't drifted from what was approved. Gemma 4's open weights are a regulatory advantage — you control the model — but the serving infrastructure is where compliance lives.

Production SLAs. Knative's autoscaling, KServe's canary rollouts, ArgoCD's self-healing drift correction — these exist because production means "it works at 3 AM on a Sunday when nobody is watching." Running ollama serve on a workstation is not production. It's a demo.

The Decision Framework I Wish Existed

Based on my experience building NeuroScale and studying Gemma 4's architecture, here's the decision tree for anyone choosing between local inference and platform-served inference:

Question	If Yes →	If No →
Are you one developer working alone?	Local Gemma 4 (E2B/E4B)	Keep reading
Is your data too sensitive to leave the device?	Local Gemma 4, any variant that fits	Cloud/platform OK
Do multiple teams share the same compute?	You need a platform	Local is fine
Do you need an audit trail for model deployments?	You need a platform	Local is fine
Does someone need to answer "who deployed this and what does it cost?"	You need a platform	Local is fine
Is the model serving a production API with SLA guarantees?	You need a platform	Local is fine

If you answered "no" to all six questions, run Gemma 4 locally and don't look back.

If you answered "yes" to any of the last four, the model isn't the hard part. The platform is.

What I'm Actually Building Next

Here's the honest ending: Gemma 4 didn't make my platform unnecessary. It made it smaller.

The next milestone for NeuroScale is adding Gemma 4's 26B MoE as a first-class runtime alongside our existing sklearn InferenceService. The MoE architecture — 26B total parameters but only 4B active per forward pass — has an important nuance: MoE saves compute, not memory. Because the routing mechanism selects experts dynamically per token, all 26B weights must be resident simultaneously. At 4-bit quantization that's ~13-15GB VRAM. What MoE does give you in a platform context is higher throughput per GPU dollar — the compute overhead per request drops dramatically, so you can serve more concurrent requests on the same hardware. That's where the resource governance story actually lands.

# What this might look like in NeuroScale
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gemma4-moe-team-alpha
  labels:
    app.kubernetes.io/owner: team-alpha
    cost-center: cc-ml-inference
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: gemma4-moe-runtime
      storageUri: gs://neuroscale-models/gemma4-26b-moe
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"     # MoE saves compute, not memory — all 26B weights load at runtime. 16Gi works at 4-bit quant (~13-15GB), but budget for the full model, not just active params.
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: "32Gi"
          nvidia.com/gpu: "1"

The E2B and E4B? Those run on developers' laptops directly. No platform needed. No form to fill. No PR to merge. That's the right answer for that model tier, and I'm at peace with it.

The platform engineer's job isn't to serve every model. It's to serve the models that need governance, and get out of the way for the ones that don't.

Gemma 4's model family, more than any other open model release, forced me to think clearly about that boundary.

The Takeaway

If you're a developer choosing a Gemma 4 variant, here's what a platform engineer wants you to know:

E2B and E4B are genuine infrastructure breakthroughs. Not because they're the smartest models — they're not. Because they eliminate the serving tax entirely. No Kubernetes. No ingress. No three-hour debugging sessions. Run them locally and ship.
31B Dense and 26B MoE still need a platform. The moment you need multi-team isolation, cost attribution, or audit trails, the model's intelligence is the easy part. The infrastructure is where the real engineering happens.
Intentional model selection is governance. Choosing E2B because it's the smallest download is convenience. Choosing E2B because your task is single-turn classification with sub-second latency requirements on edge hardware — that's engineering. Gemma 4's model family is the first open model release where this distinction actually matters at every tier.
The most capable model is not always the right model. I learned this from a CI tool that reported false greens for two weeks. Capability without verification is worse than limited capability you actually validated.

I build platforms that serve AI models. Gemma 4 is the first model family that made me question which models deserve a platform — and that question made my platform better.

NeuroScale is open source: github.com/sodiq-code/neuroscale-platform. 108 commits, 21 smoke tests, 6 milestones of reality-check documentation. The BEFORE.md and AFTER.md tell the full story.

推荐订阅源

DEV Community