慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
吾筑基以供智网之用。Gemma 4令吾职存亡系焉。
Sodiq Jimoh · 2026-05-24 · via DEV Community

Gemma 4 挑战:论Gemma 4之提交

此乃Gemma 4挑战之提交:论Gemma 4


多文论Gemma 4者,皆以"吾下载之,于本地运行"为始。吾之文则异是。

吾筑平台,以供之之模型若Gemma 4。吾日日撰Kubernetes之制,设KServe之InferenceServices,理Knative之入径,且使Kyverno之规阻恶部署,俟其未至集群。吾之业——NeuroScale。 — 乃自助式AI推理平台,开发者填表于后台,平台立创拉取请求,ArgoCD遂部署之,遂有生产级推理端点启用。

吾实为AI模型服务之灯火守护者。

故当谷歌释出Gemma 4,其模型之规模,自树莓派之2B参数至工作站之31B参数,吾初念非曰“妙哉,吾欲试之”。吾初念乃曰:“此是否使吾整个平台全然无谓?”

此问——及吾终得之答——即此篇之所由也。


供役之税,人鲜言之

平台工程有污言秽语:运行人工智能模型之费,非模型本身,乃其周遭诸事也。

于 NeuroScale,于 KServe 部署 sklearn InferenceService 唯一者,则成五 pod。非一。五也。一为 predictor pod,一为 transformer(若配置),一为 Istio/Kourier 之 sidecar 或 gateway pod,一为 Knative 所注入之 queue-proxy,一为 storage initializer,于启动时下载模型之 artifact。

凡此诸荚,必需CPU之请,内存之限,康探之察,且——因吾等以Kyverno之准入策施之——主属之签与费区之归,方得存焉:

# This gets DENIED by our Kyverno policy — no owner, no deploy
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gemma-model
  namespace: default
spec:
  predictor:
    sklearn:
      storageUri: gs://models/gemma/v1
# ❌ Blocked: missing app.kubernetes.io/owner and cost-center labels

Enter fullscreen mode Exit fullscreen mode

于本地k3d集群——吾辈用以开发者——一InferenceService耗RAM约1.2GB。添Backstage(开发者之门),ArgoCD(GitOps),Kyverno(策令引擎),OpenCost(成本归因),未及应一预测,其基础设施内存已至6-8GB。

此乃服务之税。平台工程师皆付之。无人撰文论此。


时Gemma四E2B现于树莓派

吾读Gemma四之E2B模型——二十亿参数,多模态,一十二万八千字符之窗——运行于树莓派五,耗七点五瓦,忽有所悟。

非为欣喜而悟。乃为根本之设悄然谬误而悟。

吾之全平台,存于特定之设:部署人工智能模型,其事艰险,需有基建之规约。 NeuroScale之长处,在于汝无需通晓 Kubernetes,即可得推理之端。但填一表耳。平台自掌 YAML,校偏移,施政令,计耗资。

然则若此模型仅...运行乎?于器上?无YAML?无Kubernetes?无平台乎?

Gemma 4之模型家族,使此问具体,非前此开放模型所能及也。

模型 参数 运行于 需平台乎?
E2B 二B 电话,树莓派,浏览器
E4B 4B 笔记本电脑,边缘设备
31B密集 31B 工作站GPU 或可
26B多模态 26B(4B激活) 服务器

下二行乃吾之世界。上二行则……非也.


三时之停电,释万般

吾为汝言吾神经尺度之建,最恶三时也

初设KServe,凡InferenceService之创,悉阻于整群。无模型可布。绝无。其故何?KServe之常设配置,乃假汝以运行Istio为服务之络。吾辈则行Kourier——轻便之Envoy基入口关隘,用百兆而替Istio之一千兆——盖吾辈之k3d群,无RAM之余裕以容Istio.

其解,惟一配置之补耳:

# infrastructure/serving-stack/patches/inferenceservice-config-ingress.yaml
ingress:
  disableIstioVirtualHost: true

入全景模式 出全景模式

一言而已disableIstioVirtualHost: true。此乃"一切推论皆阻"与"万事俱通"之间之阻隔。此配置标志未载于KServe入门指南。吾于阅读KServe控制器源码时得之.

三时之生产阻绝。一真值耳。

今有此理:若吾以Ollama于本地运行Gemma 4 E2B,此中断自当无之。无入口可误设,无服务网。无Kubernetes。唯一进程,听于一端口而已。

# The entire "platform" for local Gemma 4
ollama run gemma4:e2b
# That's it. No YAML. No CrDs. No three-hour outage.

入全屏模式 出全屏模式

然则,吾辈平台犹存,何故?


盖"运行模型"与"服务模型"实为二事也

此乃吾以为Gemma 4之覆盖所失其妙处者

运行模型者:下载其重,载入内存,送以提示,得其所应也Gemma 4 E2B于民用硬件为之甚美

服务模型之意,在使五十开发者得三队共用,责其部署者于资源之费,防不慎者以无内存限制之根容器部署,致共享集群崩坏,模型工件URI变更而新版返诸秽物则自动回滚,并产审计之迹,明示谁何部署何物,何时何故。

此非模型之弊,乃组织之患也。Gemma 4亦无解,诸模型皆然。

于 NeuroScale,一部署历经此阵:

Developer fills Backstage form
    → Platform generates YAML with required labels
    → PR is created on GitHub
    → CI runs kubeconform (schema validation)
    → CI runs kyverno-cli (policy simulation)
    → CI calculates resource delta (cost impact)
    → PR is reviewed and merged
    → ArgoCD detects the change
    → ArgoCD applies the manifest
    → Kyverno admission webhook validates live
    → KServe creates the InferenceService
    → Knative provisions the serving infrastructure
    → Smoke tests verify the endpoint responds

入全景模式 出全屏模式

十三步。其中多步,开发者难察。然为组织运行模型,则缺一不可。


《伪青之变,启我思AI之治》

有故事焉,合Gemma 4之模选哲思于平台工程,非我所期。

凡二旬,吾之CI管道报凡Kyverno之策皆通。青勾遍处,PR皆信而合。吾以为制衡得宜。

非也。

未完之句,难译其意。kyverno-cli apply指令虽出,码归零。虽违规,犹言成。违规之详,印于标准输出,然退出之码——此乃CI系统判成败之器——却道万事安好。

# This exits 0. CI shows green. Policies are NOT enforced.
kyverno-cli apply ./policies/ --resource ./bad-manifest.yaml
echo $?  # 0 ← This is a lie

入全景模式 出全屏模式

验证CI政策之正确工具者kyverno test者,善处退出码,亦为管道而设。吾等初用kyverno apply,非无知也,盖因文牍首见之故。此乃谬误:求便而舍验。其修正之道,须察之退出码,并解析stdout,以察违碍之辞。

# The real check — trust nothing
OUTPUT=$(kyverno-cli apply ./policies/ --resource "$manifest" 2>&1)
EXIT_CODE=${PIPESTATUS[0]}

if [ "$EXIT_CODE" -ne 0 ] || echo "$OUTPUT" | grep -qi "fail\|violation\|denied"; then
    echo "❌ Policy violation detected"
    exit 1
fi

进入全屏模式 退出全屏模式

两星期间,任一开发者皆可无资源之限,无归属之签,无费用之归——而CI必言"诸检皆合。"

此乃治理之失状也.非显赫之破,乃默然之绿勾。

此事与Gemma 4何干

Gemma 4之模型家族,乃吾初见模型之择取,非视为末节,实为首要之工程决断。其挑战本身即求“有意之模型择取”——示吾之模型何以为适任之器。

是语——有意之择取者,吾等CI之流所缺者也。吾等择kyverno-cli apply者,以其显见之器也。未验其果能行吾所思之行。其择也,便,非

Gemma 4之模型变体,亦陷同辙。E2B便捷——下载最小,随处可运行。然若用例需于五十万符文之语境中,多步推演,则基准数据示E4B或31B Dense乃有意之择。若工作需高吞吐量批量处理,则26B MoE每前向传递4B主动参数,亦有意之择也。

便利与正确,非同一物也。吾以 CI 工具之故,方知其艰。Gemma 4 之模系,设为使君毋须劳神也.


Gemma 4 实则可胁吾之平台(然亦非尽然)

既经 108 次提交,21 次烟测,及 6 次里程碑之终局检讨于 NeuroScale,吾之诚见如是:

Gemma 4 使吾平台无以为用者:

独行开发者与小众之众. 若君独力营一模型以成己之业,当于本土运行Gemma 4。Ollama辅以E2B或E4B,视君推理之需而定。勿需Kubernetes,勿需KServe,无平台之税。模型行于君之硬件,数据存于君之机,君无需我助.

边沿与机内推演。 E2B之变,行于机或树莓派,耗七点五瓦,其部署之模,迥异乎平台工程师所优化者。此辈器无容器运行时,亦无需准入控制器。其计算之要,部署之枢。此乃Gemma 4所创之格,于我而言,全然无涉。

创制与试验。 汝欲探赜Gemma 4能否解汝之困,则所资之构,莫过于无构。下载此模。试之。其128K之境窗,使汝可一气输入汝之全码库,而询诸事。勿设伺服之台以原型之。

Gemma 4使吾之平台所必需者:

众团队组织部署众模型.既使甲队运行31B密实模型以生代码,乙队运行E4B以分文档于同群,则需资源隔离,成本归属,政策施行。此非模型之能,乃组织之能也.

规整之境。 医疗、金融、官署。此业非独求法式可行,更需审计之迹、准入之控、可复之布,且能证所施之法式未离所许。Gemma 4之开权为规之利——尔掌法式——然施之基设乃规之所在。

产制之约。 Knative之自適應,KServe之灰度發布,ArgoCD之自愈漂移糾正——此皆因生產即"無人觀之,日曜三更亦能運作也。"於工作站運行ollama serve非生產。乃示範耳.


所願決策之框架

吾尝筑 NeuroScale,研 Gemma 4 之构,今为择于本地推演抑或平台服务推演者,列决断之木如左:

若然 → 若否 →
尔独为匠者乎? 本地 Gemma 4 (E2B/E4B) 姑续读之
君之数据,其密若此,岂可置诸器耶? 本地Gemma 4,凡合其式者皆可 云或平台,无碍
众队共分一算力乎? 君需平台 本地亦可
君需模型部署之审计乎? 君需平台 本地亦可
须有人应曰:"谁所部署?其费几何?" 汝需平台 在本地亦可
此模是否供生产API且具SLA之保证? 汝需平台 在本地亦可

若六问皆答"否",则本地运行Gemma 4,不复顾也。

若于末四者中任一作是答,则此模非难,其台乃难也。


吾所实构者何?诚言其终:Gemma 4非使吾台无谓,乃使之益微也。

NeuroScale 之次里程碑,乃增 Gemma 4 之 26B MoE 为首屈一指之运行时,辅以吾等既有之 sklearn InferenceService。MoE 架构——总参数计 26B,然每前向仅 4B 活跃——有要义在焉:MoE 节省 计算之能。非记忆之属。盖路由之术,每于符文动态择专,故二六亿之重须并驻。四比特量化,约需一三至一五吉字节显存。至若MoE 之于平台,所赐者,实为每GPU之价得更高之通量也。 — 每请求数据处理之费骤降,故于同器可应更多并请。此乃资源治理之实境也。

# What this might look like in NeuroScale
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gemma4-moe-team-alpha
  labels:
    app.kubernetes.io/owner: team-alpha
    cost-center: cc-ml-inference
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: gemma4-moe-runtime
      storageUri: gs://neuroscale-models/gemma4-26b-moe
      resources:
        requests:
          cpu: "4"
          memory: "16Gi"     # MoE saves compute, not memory — all 26B weights load at runtime. 16Gi works at 4-bit quant (~13-15GB), but budget for the full model, not just active params.
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: "32Gi"
          nvidia.com/gpu: "1"

Enter fullscreen mode Exit fullscreen mode

E2B與E4B乎?此二者直運於開發者之膠囊。無需平台。無需填表。無需合PR。此乃彼等模層之正解,吾心安之。

平台工程者之職,非為侍奉萬模。乃為侍奉需治理之模,避讓無需者。

Gemma 4之模系,非他开放模型之发布,迫吾明思其界.


所得之要

尔为开发者,择Gemma 4之变体,平台工程师欲告尔:

  1. E2B与E4B,实为基建之突破。 非为智巧之模最也——实非也。以其尽除供役之税。无 Kubernetes。无入径。无三时之排诘。就之本土而发之。

  2. 31B 密集而26B 多效犹需平台。 汝需众队之隔离,费之归因,或稽之迹,则模之智巧易也。基之构实乃工之真所。

  3. 意匠择模,乃治之要.择E2B以其下载为小,乃便也。择E2B因其务为单轮分类,于边缘硬件需亚秒级延迟——此乃工也。Gemma 4之模系,首开开放模型之发布,此别于各层皆实有关键.

  4. 能最者,非恒宜也。 吾得此训于 CI 工具,其报假绿凡二旬。无验之能,甚于已验之能之有限。

吾筑平台以供 AI 模型。Gemma 4 为首见之模型家族,使吾疑何模型当得平台——此疑遂令吾之平台益善。


NeuroScale 乃开源之术:github.com/sodiq-code/neuroscale-platform。一百度之献,廿一烟试,六度现实检文。BEFORE.md與AFTER.md,尽述其事。