惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
WordPress大学
WordPress大学
博客园 - 司徒正美
美团技术团队
酷 壳 – CoolShell
酷 壳 – CoolShell
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
小众软件
小众软件
量子位
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
有赞技术团队
有赞技术团队
博客园 - 【当耐特】
博客园 - Franky
Jina AI
Jina AI
人人都是产品经理
人人都是产品经理
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threat Research - Cisco Blogs
D
Darknet – Hacking Tools, Hacker News & Cyber Security
F
Fox-IT International blog
T
ThreatConnect
A
Arctic Wolf
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Last Week in AI
Last Week in AI
C
CERT Recently Published Vulnerability Notes
P
Palo Alto Networks Blog
李成银的技术随笔
Project Zero
Project Zero
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
F
Full Disclosure
H
Hacker News: Front Page
雷峰网
雷峰网
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
S
SegmentFault 最新的问题
S
Schneier on Security
T
Tor Project blog
博客园_首页
月光博客
月光博客
大猫的无限游戏
大猫的无限游戏
博客园 - 聂微东
S
Securelist
C
Comments on: Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Attack and Defense Labs
Attack and Defense Labs
IT之家
IT之家
博客园 - 叶小钗
J
Java Code Geeks
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events

DEV Community

Why DMX Infrastructure is Still Stuck in the 90s Agent Series (2): ReAct — The Most Important Agent Reasoning Paradigm Open Source Project (No.73): Sub2API - All-in-One Claude/OpenAI/Gemini Subscription-to-API Relay I Made the Wrong Bet on Event Streaming in Our Treasure Hunt Engine #ai #productivity #chatgpt #python Symbolic Constant Conundrum From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM Building an outbound-only WebSocket bridge for local AI agents Our System's Sins in Ghana: Why We Had to Rethink Digital Product Sales Execution Governance, AI Drift, and the Security Paradox of Runtime Enforcement Differential Pair Impedance: Why USB and HDMI Routing Is a Geometry Problem Small AI database questions can become big scans Claude Code 2.1 Agent View & /goal: Autonomous Dev Guide 2026 Your AI database agent should not see every column Rust's Low-Latency Conquest: Why We Ditched C++ for a Treasure Hunt Engine Floating-point will quietly corrupt your emissions math, and 0.1 + 0.2 already warned you Autonomous Agents: what breaks first (and why that's the real product) [2026-05-23] Agent payments are the new cloud bill footgun ORA-00069 오류 원인과 해결 방법 완벽 가이드 How I Built a Local, Multimodal Gemma 4 Visual Regression & Patch Agent: Closed-Loop Validation, Canvas Pixel Diffing, and Reproducible Benchmarks Pressure-testing Ota on Supabase: from setup prose to executable repo readiness VPC CNI en EKS: cómo dejar de pagar nodos que no usás The Future of Text Analysis: Introducing TechnoHelps Semantic Engine I built a Chrome Extension that saves product images + context directly to Google Drive & Sheets 95+ browser-based dev tools that never touch a server Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama From a 10,000-line OpenSearch export script to a log analysis tool Ghost Bugs Cost $40K: A Neural Debugging Postmortem SECPAC: A Lightweight CLI Tool to Password-Protect Your Environment Variables 🚀 PasteCheck v1.7 + v1.8 — Hints that tell you what to fix, and a nudge panel that tells you where to start 8 Real Ways Developers Make Money in 2026 (Ranked by Effort) I built a free AI-powered Git CLI that writes your commit messages for you sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs OpenLiDARViewer: A Browser-Based LiDAR and Point-Cloud Viewer Local-First Browser Tools: What You Should Not Upload Online Why most freelancers undercharge (and the maths behind fixing it) We built a mahjong dangerous-tile predictor calibrated on 4.97M real hands Building a Chord Progression Generator in the Browser — Music Theory in JS, Sound via Web Audio API tutorial #10: 148 Opens, 0 Replies — How My Forge Cold Email v1 Completely Failed 9 in 10 Docker Compose files skip the basic security flags How to Forward Android SMS to Telegram Automatically I built the first security scanner for MCP servers — here's what I found Building an Interplanetary Quantum Logic Engine in Rust/Ovie From AI Code Generation to AI System Investigation I gave Gemini 3.5 Flash a CVE-fix PR to review. It found another bug in the same file. When I Realized We Were Throwing Away Half Our Engine's Potential TokenJuice and the 20-Minute Cron: Inside OpenHuman’s Aggressive Context-Harvesting Engine CodeDNA: AI Codebase Archaeologist Built with Gemma 4 Thinking Mode Building a semantic search API in Go with Meilisearch April 2026 DigitalOcean Tutorials: Inference Optimization and AI Infrastructure Looking for DTMF transceiver module Moving Beyond "Tribal Software": Why the Singularity Demands the Interplanetary Hybrid Human Use SVGIcons as a Claude Custom Connector to Find Icons Faster DMARC Is Now a Proper Internet Standard: What Changed in RFC 9989/9990/9991 OpenTelemetry Is Now a CNCF Graduate — and It's Coming for Your AI Stack OpenHuman Follows OpenClaw’s Rise, But With an Obsidian Brain O erro mais caro em programas Solana: PDA sem bump check Build a Live Flight Radar in a Single HTML File DuckDB 1.5.3 Adds Quack Client-Server, SQLite Gets Cypher Graph Extension Custom Copilot Agents: Building Domain-Expert AI Teammates with Skills, MCP Tools, and Custom Knowledge RTX 5090 Cooling, BeeLlama VRAM Opts, Resizable BAR Performance Gains This week in Cursor + .NET — 3 rules + 4 essays (week ending May 22, 2026) RAG Architecture with n8n + PostgreSQL (pgvector) + Ollama Gemma4 on AWS EC2 Keep Your Taste I Built chanprobe Because My Go Queues Were Invisible Building a Live Solana TPS Meter with OrbitFlare's TypeScript SDK Using Gemma 4 to Analyze Bitcoin’s Next 5, 15, and 60 Minutes Security news weekly round-up - 22nd May 2026 When Stress Disguises Itself as Rational Planning (Bite-size Article) A Domain-Driven Notification Microservice — Patterns From Production I Built KubeCrash: Learn Kubernetes by Diagnosing Real Incidents The Real-World Test: How Gemini’s New Interface Won Over My Wife and Mother-in-Law (Who Are Totally Non-Tech) Running a Full Multi-Stage Intrusion Simulation. Every Detection Fired. Spec sheets aren't capabilities: a Day-1 Gemma 4 eval on Telugu vision Design a Clean Form with Floating Labels in Bootstrap 5 Your MCP Server Is Probably Overprivileged - Here's a Scanner For It I built a free developer tools site that works entirely in your browser Maatru: An agentic Telugu literacy app for kids, built with Gemma 4 GitHub confirms internal repository breach via poisoned VS Code extension Gemma 4 Is Not Just Another Open Model — It Changes What Developers Can Build Locally OpenVibe: An Open-Source AI Coding IDE That Works With Any Model I Inspected the System Program and It Looked Just Like My Wallet Hermes vs OpenClaw: The Two Most-Starred AI Agent Frameworks of 2026 Stop retraining YOLO: a developer’s guide to zero-shot object detection with generative VLMs AI, the New UI, Not the New API Sensors and Guides: Two Ways Your Harness Talks to Your Agent Fixing Google BigQuery Auth Proxying We didn't ship a feature, we shipped an agentic opt-in beta Wake-Up Call: Why AI Safety Guardrails Break Under Pressure 🧩 Handling 1,000+ Inputs with Angular Reactive Forms: An Enterprise Architecture Breakdown How to Collect Telegram Media Groups in Node.js I Ran Gemma 4 on an 8GB Laptop — Here’s What the Experience Was Actually Like Lean 4 101 for Python Programmers: A Gentle Introduction to Theorem Proving From Assistants to Agents: My Take on Google I/O 2026 Learning Progress Pt.16 From Unfinished Idea to Real Product: My BuildGenAI Comeback The Quiet Strategy I Revived a 9-Year-Old App with OpenAI Codex with a Product Engineer Mindset What Enterprise RAG Is Ready For Today and What Production Deployment Actually Requires Cursor AI Pricing 2026: Is It Worth $20/Month?
I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible
Sodiq Jimoh · 2026-05-23 · via DEV Community

Submitted for the GitHub Copilot Challenge — deadline June 7, 2026. Built with GitHub Copilot as an active architectural and debugging partner.


On April 4th, I abandoned this platform. Backstage crashed. ArgoCD was broken. KServe couldn't serve a single model. I walked away and left it for dead.

48 days later, I came back and rebuilt it into a self-service AI inference platform with GitOps, policy enforcement, and deterministic recovery.

21 checks. 0 failures. Reproducible on any machine.

Before vs After — CrashLoopBackOff and kubectl apply → hope on the left. Backstage form → ArgoCD synced → PASS 21/FAIL 0 on the right. 48 Days Later.

Repo: github.com/sodiq-code/neuroscale-platform


Watch It Run: 21 Checks, 0 Failures

NeuroScale smoke test — 21 automated checks running live: ArgoCD drift self-heal, KServe inference returning predictions, Kyverno blocking non-compliant manifests, all green

This is not a claim. The video below shows every check running live against a real k3d cluster.

▶ Watch the full smoke test demo — 21 checks, 0 failures

Real terminal: Final smoke test results — PASS 21, FAIL 0, SKIP 1

━━━ Milestone A — GitOps Spine (ArgoCD) ━━━
  [✓ PASS] All ArgoCD pods are Running
  [✓ PASS] ArgoCD Applications: 7 Healthy, 0 Progressing, 7 total
  [✓ PASS] ArgoCD sync visibility: no Unknown states (7/7 Synced)
  [✓ PASS] Drift self-heal: nginx-test recreated and Ready in ~20s

━━━ Milestone B — AI Serving Baseline (KServe) ━━━
  [✓ PASS] KServe controller-manager: 1 replica available
  [✓ PASS] InferenceServices: 2/2 Ready=True
  [✓ PASS] Inference request: demo-iris-2 returned predictions
           ↳ Response: {"predictions":[1,1]}

━━━ Milestone C — Golden Path (Backstage) ━━━
  [✓ PASS] Backstage deployment: 1 replica available
  [✓ PASS] demo-iris-2 InferenceService exists (scaffolder output)
  [✓ PASS] demo-iris-2 ArgoCD Application exists (ApplicationSet output)

━━━ Milestone D — Guardrails (Kyverno + CI) ━━━
  [✓ PASS] Kyverno pods running: 3
  [✓ PASS] Kyverno ClusterPolicies installed: 5 policies
  [✓ PASS] Admission block: non-compliant InferenceService correctly denied

━━━ Milestone F — Production Hardening ━━━
  [✓ PASS] ApplicationSet neuroscale-model-endpoints exists
  [✓ PASS] ArgoCD has 7 Applications (ApplicationSet + static)
  [✓ PASS] ResourceQuota exists in namespace default
  [✓ PASS] LimitRange exists in namespace default
  [✓ PASS] Non-root admission block: root-container Deployment denied
  [✓ PASS] OpenCost deployment healthy: 1 replica available

  PASS 21 / FAIL 0 / SKIP 1
  ✓ All checks passed. Platform is healthy and ready to demo.

Enter fullscreen mode Exit fullscreen mode

The single SKIP is the drift self-heal pre-condition check — normal after a previous test run. The drift self-heal itself passed, visible in the output above and in the video.

Reproducible on any machine:

bash scripts/bootstrap.sh     # ~5 minutes — requires Docker + k3d
bash scripts/smoke-test.sh    # 21 checks, all green

Enter fullscreen mode Exit fullscreen mode


The Problem: A Platform That Was Abandoned and Dangerous

NeuroScale started in February 2026 as an AI inference platform on Kubernetes. By early April it was abandoned — Backstage crashing, ArgoCD broken, KServe unable to serve a single model. The last commit before this challenge was April 4th. Then 48 days of silence.

Here's what I found when I came back:

  • Backstage: CrashLoopBackOff — 14 restarts. A Helm values nesting bug caused probe timings to be silently ignored.
  • ArgoCD repo-server: CrashLoopBackOff — every application showed Unknown, meaning ArgoCD couldn't even evaluate their state.
  • KServe: READY=False — default config assumed Istio for ingress, but the cluster ran Kourier. Error: "virtual service not found".
  • Policy enforcement: None. Root containers, no resource limits, :latest tags — deployed freely.
  • Drift detection: None. Manual kubectl changes accumulated silently.

The deployment process was vimkubectl apply → hope. Developers feared deploying models. The platform was technically worse than not having one.


What I Built: Five Enforcement Layers

NeuroScale Platform Architecture — Backstage → GitHub PR + CI → ArgoCD → Kyverno → KServe + Kourier → OpenCost

Layer 1: Self-Service Golden Path

A developer fills in a Backstage form. The platform does everything else.

Backstage form → PR created → CI validates → Merge → ArgoCD syncs
  → ApplicationSet discovers → KServe endpoint live → Predictions working

Enter fullscreen mode Exit fullscreen mode

No kubectl. No YAML editing. No tribal knowledge. The template.yaml generates a compliant InferenceService manifest, opens a PR, and the neuroscale-model-endpoints ApplicationSet auto-discovers it.

Backstage Scaffolder UI — KServe model endpoint form: endpoint name (DNS pattern enforced), model format dropdown, storage URI, owner label, cost center label

Five fields. No kubectl. No YAML. DNS pattern enforced client-side, cost center required. Click Next and Backstage does the rest.

Backstage scaffolder execution log —

Two steps, 9 seconds total. PR opened, ApplicationSet picks it up on next ArgoCD sync.

Layer 2: GitOps Drift Control

Git is the source of truth. Drift is auto-corrected.

$ kubectl delete deploy nginx-test -n default
# 20 seconds later...
$ kubectl get deploy nginx-test -n default
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test   1/1     1            1           8s   # Auto-recreated by ArgoCD

Enter fullscreen mode Exit fullscreen mode

selfHeal: true and prune: true. Manual cluster changes cannot persist.

Layer 3: Policy Guardrails — Shift-Left + Shift-Down

At PR time (CI): kubeconform validates schemas. kyverno-cli simulates all 5 policies against rendered manifests with a dual exit-code + stdout check to guard against false-greens. Full pipeline at .github/workflows/guardrails-checks.yaml.

At admission time (cluster): Kyverno blocks non-compliant resources before they reach the cluster.

Real terminal: Kyverno admission denial —

Five enforced policies:

Policy What It Blocks
require-standard-labels-inferenceservice Missing owner + cost-center labels
require-standard-labels-deployment Missing ownership labels on Deployments
require-resource-requests-limits No CPU/memory requests or limits
disallow-latest-image-tag Floating :latest image tags
disallow-root-containers Containers without runAsNonRoot: true

Layer 4: Cost Attribution

Every workload carries owner and cost-center labels enforced by Kyverno — you can't deploy without them. OpenCost reads these via Prometheus for per-team cost breakdowns. The CI pipeline also comments on PRs with CPU/memory deltas and flags workloads exceeding thresholds.

Layer 5: Operational Recovery

Documented runbooks for every failure mode encountered. 3-command, 2-minute recovery procedures. Full runbook at docs/runbook.md.


Copilot Partnership: Three Moments That Mattered

Copilot didn't write this platform. It functioned as a senior infrastructure advisor at three exact moments where I could have stayed stuck for days.

Moment 1: The Architectural Decision — Kourier vs Istio

Problem: KServe stuck at READY=False. Error: "virtual service not found" — an Istio concept on a cluster running Kourier.

Copilot Chat — KServe virtual service not found. Copilot searches repo files, identifies Kourier/Istio mismatch, delivers 5-step fix with exact ConfigMap keys

Copilot Chat continued — exact kubectl commands: reapply serving-stack overlay, verify ConfigMap values, restart kserve-controller-manager and knative-serving controller

Copilot searched the actual repo files, confirmed the non-Istio setup was already correct, and identified the root cause: stale cached config. Critical tradeoff it surfaced — Istio adds ~1GB memory overhead; Kourier is under 200MB. On a shared 8GB dev node, Istio would have killed reproducibility.

Fix: Reapply the serving-stack overlay, verify disableIstioVirtualHost=true in ConfigMaps, restart control plane pods. Result: working inference, 800MB freed.

Moment 2: The Silent Bug — CI Guardrails That Can't False-Green

Problem: kyverno-cli apply looked green in CI. Then I tested with a deliberately non-compliant manifest. It still passed. The guardrail was checking nothing.

Copilot Chat —

Copilot Chat continued — delivers the per-resource fan-out pattern with dual exit-status capture and independent violation detection

Two undocumented kyverno-cli behaviors Copilot surfaced:

  1. A single --resource flag with multiple paths silently ignores every path after the first.
  2. Exit code is 0 even when violations are printed to stdout.

The fix (live in .github/workflows/guardrails-checks.yaml):

# Per-resource fan-out — one file per --resource flag
mapfile -t app_files < <(find apps -type f \( -name '*.yaml' -o -name '*.yml' \) | sort)
failed=0

for resource in "${app_files[@]}"; do
  log="$(mktemp)"
  if ! docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \
      apply infrastructure/kyverno/policies/*.yaml --resource "$resource" 2>&1 | tee "$log"; then
    failed=1
  fi
  # Dual check: exit code AND stdout — never trust one signal alone
  if grep -qiE 'denied|violat|fail|error' "$log"; then
    failed=1
  fi
done

exit "$failed"

Enter fullscreen mode Exit fullscreen mode

A guardrail that silently passes is worse than no guardrail. Every team using kyverno-cli in CI without this pattern has a potential false-green.

Moment 3: The Recurring Crash — CrashLoopBackOff and the Runbook

Problem: Backstage in CrashLoopBackOff — 7 restarts. Pod failing health checks it never properly configured.

I pasted the raw kubectl get pods -n backstage -w output directly into Copilot:

Copilot Chat — kubectl output showing CrashLoopBackOff 7 restarts. Copilot plans diagnosis, checks pod termination cause

Copilot Chat — root cause found: wrong Helm values nesting backstage.backend/backstage.appConfig → must be backstage.backstage.*. Startup probe missing, probes using aggressive defaults.

Root cause: Backstage is a dependency chart — app settings must be nested under backstage.backstage.*. Keys at the wrong level meant Helm silently used defaults, including failureThreshold: 3 with aggressive timings and no startupProbe. The container kept failing before the plugin system finished initializing.

ArgoCD hit the same pattern: Kyverno installation disrupted the repo-server's gRPC channel, which doesn't auto-reconnect — causing all apps to show Unknown. Copilot identified both as the same root pattern and helped write a deterministic recovery:

kubectl -n argocd rollout restart deploy/argocd-repo-server
kubectl -n argocd rollout status deploy/argocd-repo-server --timeout=120s
kubectl -n argocd get applications  # All Synced/Healthy

Enter fullscreen mode Exit fullscreen mode

That became docs/runbook.md. The platform doesn't just work — it's recoverable. That's the difference between a demo and a real platform.


Before vs After

For Developers

Before After
Edit YAML by hand, kubectl apply, hope Fill a Backstage form, review a PR, merge
No policy feedback until deployment fails CI blocks non-compliant manifests before merge
No cost visibility PR comment shows CPU/memory delta
Tribal knowledge required Anyone can deploy

For Operators

Before After
Manual cluster inspection for drift ArgoCD self-heals in ~20 seconds
No runbooks — "ask the person who built it" Documented recovery for every failure mode
No smoke tests 21-check automated verification, any machine
No namespace governance ResourceQuota + LimitRange enforced

For the Platform

Before After
Abandoned since April 4th Finished, documented, reproducible
Collection of broken parts 6 milestones, 21 verified checks, 0 failures
Manual and error-prone Automated and policy-enforced end-to-end

What Made This Real: The Failures

This was not built on the happy path. Every milestone hit real failures:

Milestone Key Failure What It Taught Me
A — GitOps Spine ArgoCD UnknownError — comparison engine couldn't run Don't confuse UI status with root cause
B — KServe Serving Istio/Kourier mismatch — undocumented KServe default Always verify infrastructure defaults on constrained clusters
C — Golden Path Backstage CrashLoopBackOff from Helm mis-nesting — probes silently ignored CI must validate rendered manifests, not just source YAML
D — Guardrails Kyverno webhook disrupts all ArgoCD apps during install window Admission controllers need deployment ordering
E — Cost & CI kyverno-cli false-green: exit 0 with actual violations Dual-check exit code AND stdout — never trust one signal
F — Hardening ApplicationSet replaced per-app files — requires skeleton alignment Scaffolder templates must match GitOps discovery patterns

Full failure log and recovery steps in docs/runbook.md.


Try It Yourself

# Clone and bootstrap (requires Docker + k3d + kubectl + helm)
git clone https://github.com/sodiq-code/neuroscale-platform.git
cd neuroscale-platform
bash scripts/bootstrap.sh     # ~5 minutes

# Verify everything works
bash scripts/smoke-test.sh    # 21 checks, 0 failures

# Open all UIs
bash scripts/port-forward-all.sh

Enter fullscreen mode Exit fullscreen mode

After port-forward-all.sh:

  • Backstage at http://localhost:7010 — developer portal
  • ArgoCD at http://localhost:8080 — 7 synced applications
  • OpenCost at http://localhost:9090 — per-workload cost attribution

5 minutes from git clone to a fully working platform. The smoke test proves it all.


The Bottom Line

Copilot helped at the exact points where strong engineering judgment mattered most: an architectural tradeoff that saved 800MB of memory, a silent CI bug that every kyverno-cli user faces, and operational recovery that turns a 2-hour outage into a 2-minute runbook.

21 checks. 0 failures. Reproducible on any machine.


What's one abandoned project you wish you had finished? Drop it in the comments.