


























I now use agentic coding on a daily basis, like many of us. I'm overall happy with my Claude Code experience, but I keep a close eye on the open-weight ecosystem1, which is also rapidly evolving: Qwen, Llama, Mistral, GLM, DeepSeek — and on the fierce competition between the various players in the space.
So I set myself a challenge: how could we host our own LLMs in our own infrastructure? Data sovereignty, model choice, vendor independence, and above all — the technical curiosity to see what's really feasible today with the modern tools at our disposal.
This article is a demonstration
The models deployed here are 7-8B parameters, served on a L4 24GB GPU. On complex agentic tasks, they fall well short of a Sonnet 4.6 or an Opus 4.7 — they hallucinate more, loop on errors where a frontier model breezes through.
The open-weight models that could truly compete (DeepSeek V4 Pro, Kimi K2.6) require hardware that's simply out of reach for personal use.
This article aims to lay the foundations of an alternative meant to evolve as the ecosystem catches up, and demonstrates what's already possible today — even with modest models.
The reference repo
The whole stack is deployed via GitOps from cloud-native-ref — beyond the LLM layer, the repo illustrates a complete cloud-native ecosystem:
Before diving into the heart of the matter, here's a bird's-eye view:

The stack is organized into three layers, which we'll walk through from the foundation up to the front door:
The whole thing is driven by GitOps (Flux reconciles everything from cloud-native-ref) and exposed privately via Tailscale — no data leaves the infrastructure. Each layer is detailed in the sections that follow.
InferenceService abstractionIf you've read some of my previous articles, you know I particularly like Crossplane for providing the right abstraction to end users. It's one of my essential components and lets me expose a simple, fit-for-purpose interface. I already have a few: App, SQLInstance, EPI — and now InferenceService.
Concretely, exposing a new LLM on the platform boils down to adding a YAML file like this:
1apiVersion: cloud.ogenki.io/v1alpha1
2kind: InferenceService
3metadata:
4 name: xplane-my-new-model
5 namespace: llm
6spec:
7 model:
8 repository: Qwen/Qwen3-8B
9 revision: <hf-commit-sha>
10 quantization: fp8
11 contextWindow: 32768
12 toolCallParser: hermes
13 preload:
14 enabled: true
15 gpu:
16 count: 1
17 routing:
18 tier: medium
19 specialty: general
20 scaling:
21 minReplicas: 1
22 maxReplicas: 2
A few seconds after the push, Flux reconciles, the Crossplane composition fires, and all required Kubernetes resources are applied automatically.
From this Claim, about a dozen Kubernetes resources are created. Beyond the usual suspects (Service, ServiceAccount, ExternalSecret), six deserve a closer look:
🧠 vLLM (Deployment) — the inference engine: continuous batching + paged attention deliver 3-10× higher throughput than naive serving on the same GPU. Image vllm/vllm-openai, nodeSelector: gpu-l4. (more on vLLM in Layer 2)
📦 Preloading Job — downloads the ~15 GB of HuggingFace weights to the S3 Files PVC. Idempotent (1× per repository@revision pair): vLLM pods (replicas, redeploys) then mount the volume in seconds.
🛡️ Two CiliumNetworkPolicy (zero-trust): a restrictive policy for the long-lived vLLM pod (outbound DNS + ingress AI Gateway/Iris/vmagent), a more permissive one scoped to the preloading Job (HF, AWS API, EKS Pod Identity Agent). Avoids granting the serving pod the broader permissions only needed during bootstrap.
⚡ KEDA ScaledObject — autoscaling triggered before load saturates a pod, rather than reacting to a queue forming. (the math is detailed in Layer 2)
📊 VMServiceScrape + VMRule — vLLM metrics scraped by VictoriaMetrics on /metrics, SLOs and alerts (cold-start budget, error rate, latency) shipped alongside the model. (detailed approach in the Observability series)
🚪 AIGatewayRoute — declares how the Envoy AI Gateway dispatches traffic to this model: model: xplane-qwen-coder in an OpenAI request lands on the right pod, with no application logic.
To switch from one model to another, you only change two fields (model.repository and model.revision). Here's the typical PR to move from the current Qwen2.5-Coder-7B to a Qwen3-Coder-30B the day it fits on the hardware:
1# apps/base/ai/llm/qwen-coder.yaml
2spec:
3 model:
4- repository: Qwen/Qwen2.5-Coder-7B-Instruct
5- revision: c03e6d358207e414f1eca0bb1891e29f1db0e242
6+ repository: Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
7+ revision: <matching-hash>
8 quantization: fp8
9 contextWindow: 32768
Four lines change. Flux reconciles, KEDA readjusts triggers, Karpenter provisions the right GPU — the rest of the wiring follows.
KCL: version, test, validate a composition
The composition isn't written as YAML patches (unreadable, untestable) but in KCL via the function-kcl — a typed configuration language with native assertions. Three direct consequences:
main_test.k file validates each behavior (kcl test runs in CI on every PR).oci://ghcr.io/smana/cloud-native-ref/crossplane-inference-service:0.6.0), referenced by immutable tag.kubectl apply rejects inconsistent claims (e.g. minReplicas > maxReplicas) before the composition is even triggered.Versions, tests, schema — not just "YAML we copy-paste".
Now let's walk the three layers of the diagram, starting at the base. Two structural decisions here: what hardware is financially accessible (and therefore which models can fit), and how their weights are stored and loaded on demand.
Over the last few months, open-weight model quality has progressed dramatically. On SWE-bench Verified2, the top two open-weight models are now DeepSeek V4 Pro (Max) (80.6%) and Kimi K2.6 (80.2%) — they hold their own against the proprietary leaders (Claude Opus 4.7 at 87.6%, GPT-5.5 at 88.7%). On the lighter side, Qwen2.5-Coder and Qwen3-Coder (Alibaba) remain references for code generation.
But between "being able to download the model from HuggingFace" and "running it on financially accessible GPUs", there's a world of difference. The best open-weight coding models demand multi-H100 or multi-H200 configurations, and even mid-tier models like Qwen3-Coder-30B-A3B-FP8 require at minimum an L40S 48GB.
The limiting factor here isn't the technical solution but cost. For this demo, I capped myself at modest models.
What would a DeepSeek V4 Pro or Kimi K2.6 cost?
For an order of magnitude: running a DeepSeek V4 Pro (1.6T params, ~49B active in MoE) or a Kimi K2.6 (~1T total, ~32B active) in FP8 typically demands 8 to 16× H100 80GB. Running 24/7 on AWS, the monthly bill lands somewhere between $40,000 and $80,000. Even on spot or reserved for a year, you stay comfortably above $15,000/month.
Personally, that's obviously out of reach. The demo stays on L4 24GB and picks models that fit.
Loading ~15GB of weights from HuggingFace at every pod cold-start is slow and expensive (egress bandwidth). The chosen solution: Amazon S3 Files, a brand-new AWS feature (GA in April 2026) exposing an S3 bucket as a POSIX-compliant NFS filesystem, mountable as a shared Kubernetes volume across pods.
The benefits in our case:
The initial fill is still slow
S3 Files isn't magic on the initial bootstrap: the preloading Job has to download the ~15GB of weights from HuggingFace a first time into S3. This step is bound by the instance's network bandwidth (~10 Gbps on g6.xlarge) and typically takes several tens of seconds. The upfront cost amortizes from there: every subsequent start (replicas, redeploys, other models sharing the bucket) mounts the volume in seconds.
Once the model landscape is laid out and the weights are accessible via S3 Files, we still need to serve them efficiently — a choice that directly affects latency, throughput, and cost.
vLLM is an open-source inference engine specialized in serving LLMs on GPU, and has become a de-facto standard. It's the component that actually runs the models in this stack, deployed via its Production Stack Helm chart. The features that matter here:
tools[] of agentic clients work._Continuous batching_ and _paged attention_, in two sentences
Continuous batching: a naive serving stack waits for a complete batch of requests before processing it (static batching). vLLM instead inserts each new request into the in-flight GPU batch — the GPU never sleeps between requests.
Paged attention: a naive serving stack reserves the KV cache (the memory accumulated per generated token, which dominates VRAM) as one contiguous block sized for the max length — a worst-case malloc, wasted on short requests. vLLM pages it just like an OS pages its virtual memory: fixed-size pages allocated on demand, with sequences of very different lengths cohabiting on the same GPU without fragmentation.
Concretely, each model is served by a vLLM Pod with this kind of config (generated by the Crossplane composition):
1servingEngineSpec:
2 modelSpec:
3 - name: "xplane-qwen-coder"
4 repository: "vllm/vllm-openai"
5 modelURL: "Qwen/Qwen2.5-Coder-7B-Instruct"
6 requestGPU: 1
7 vllmConfig:
8 enablePrefixCaching: true
9 enableChunkedPrefill: true
10 maxModelLen: 32768
11 dtype: "fp8"
12 maxNumSeqs: 32
13 extraArgs: ["--tool-call-parser", "hermes"]
What is quantization (fp8)?
A language model stores its weights as decimal numbers. In native precision (FP16 / BF16), each weight takes 16 bits (2 bytes) — precise, but VRAM-hungry.
Quantization consists in representing those same weights on fewer bits. We round the values with reduced numerical precision, which divides memory usage and speeds up computation. The trade-off: the model's outputs (probabilities over tokens) drift slightly from those of the unquantized model.
FP8 halves VRAM and is supported natively (at the hardware level) by recent GPUs: L4, L40S, H100, H200 — hence its widespread adoption for inference.
One flag in the YAML above deserves an extra word: enablePrefixCaching: true, crucial for FIM (Fill-In-the-Middle, the autocomplete in the IDE). During an autocomplete session, every request the IDE sends contains the same prefix (the code around the cursor); only the last few characters change as the developer types. With prefix caching, vLLM reuses the KV cache already computed for that shared prefix from one request to the next, instead of recomputing everything — that's what keeps p95 latency under 200 ms during intensive tab-completion.
The standard Kubernetes HPA is limited to CPU/memory — signals that don't reflect the actual load of a vLLM pod (the limiting factor is VRAM and the engine's internal batch, not host CPU). KEDA solves this by allowing scaling on any external signal: Prometheus, queue, business event, etc.
For each model, the KEDA ScaledObject relies on two Prometheus metrics exposed by vLLM:
vllm:num_requests_running — how many requests vLLM is processing in parallel, divided by the configured batch size (maxNumSeqs) to measure internal batch saturation.vllm:gpu_cache_usage_perc — pressure on the GPU KV cache, which climbs fast with long contexts. 1# Snippet from the Crossplane composition
2triggers:
3 - type: prometheus
4 metadata:
5 query: max(vllm:num_requests_running{model_name="xplane-qwen-coder"}) / scalar(vector(32))
6 threshold: "0.7" # 70% of the batch (maxNumSeqs) occupied
7 - type: prometheus
8 metadata:
9 query: max(vllm:gpu_cache_usage_perc{model_name="xplane-qwen-coder"})
10 threshold: "0.6" # 60% of the GPU KV cache
The goal is to anticipate: these indicators rise before a queue forms on the user side, which lets KEDA add a replica early enough to absorb the surge without degrading the latency of in-flight requests. That's the key difference between an autoscaler that reacts (queue grows, latency explodes) and one that anticipates.
KEDA scales vLLM replicas, but those replicas need an available GPU to be scheduled. Karpenter handles the layer below: when a pod can't find a free GPU node, Karpenter automatically provisions a GPU-capable instance (~60s cycle); and as soon as a node no longer hosts any GPU pod, it gets decommissioned.
What's left is the front door: how a request makes its way to the right vLLM pod, and who's allowed to send one.
Envoy AI Gateway is an open-source project built on top of Envoy Gateway, dedicated to managing traffic toward Generative AI services. It acts here as the single entry point to the platform. Its main features:
https://llm.priv.cloud.ogenki.io/v1/...) without knowing about individual vLLM pods.model field from the OpenAI request body (or from an x-ai-eg-model header) and dispatches to the right Kubernetes AIServiceBackend (details in the next sub-section).SecurityPolicy checks each request against a list of known tokens.llmRequestCosts extracts input/output/total tokens from responses, a useful basis for rate limiting per tenant or per model (wired up but not enabled here).A single model doesn't do everything: coder for code, reasoner for math, guard for safety. The front door combines two complementary mechanisms, each used as needed:
model: xplane-qwen-coder) via the x-ai-eg-model header or the request body. Envoy AI Gateway dispatches natively, with negligible latency. That's what Continue (autocomplete) and OpenCode do when they know which model to use.MoM (Mixture of Models) and vLLM Semantic Router (Iris) analyzes the prompt to pick the right actual model (coder, reasoner, guard). Detailed in the next sub-section.With this setup, a client that knows what it wants incurs no extra latency, and a generic client (OpenWebUI) benefits from automatic smart routing.
Iris (vLLM Semantic Router) is the open-source project that implements the Mixture of Models (MoM) logic. Deployed as a sidecar next to the AI Gateway, it intercepts requests addressed to the virtual MoM model and dynamically picks the real model that will answer.
Under the hood, a compact classifier (~100M parameters, derived from mmBERT, served on CPU — so no pressure on GPU-pod VRAM) evaluates the prompt against several criteria: intent (code, reasoning, multilingual…), the presence of a possible jailbreak attempt, or personally identifiable information (PII) detection. Based on the verdict, Iris routes to the appropriate model — or applies a dedicated safety guardrail.
The benefit: clients hit a single endpoint (MoM) and get a per-prompt routing without coding their own selection logic. The trade-off: ~250-300 ms of classification, incurred only on requests routed to MoM. That's acceptable for chat — the overall TTFT stays imperceptible to a human — but a deal-breaker for IDE autocomplete, which has to stay under 200 ms p95: that's precisely why Continue hits the coder pod directly via explicit routing.
A platform without clients is just plumbing. No need to walk through each config in detail here — the point is to show that we have credible open-source alternatives to proprietary tools: web chat, IDE autocomplete, CLI agent.
OpenWebUI exposes a standard web chat UI on top of our OpenAI-compatible API. In the models dropdown, you find the platform's pods plus the virtual MoM model (Iris then picks the routing by reading the prompt). Use case: exploratory chat, quick tests, non-developer access — exactly what you'd do on chat.openai.com or gemini.google.com.
Continue plugs the API into VSCode (or JetBrains). The killer feature here is FIM (Fill-In-the-Middle): autocomplete under 200ms p95 thanks to the dedicated always-warm coder pod and vLLM's prefix cache. That's the difference between a snappy autocomplete and a frustrating one — the equivalent of a Cursor Tab or a Copilot, but on our own infra.
I just discovered OpenCode — the CLI agent that comes closest to the Claude Code experience, and therefore the only serious candidate to consider for a migration the day it becomes necessary. It ships an explicit compatibility shim — AGENTS.md ↔ CLAUDE.md, skills, MCPs, sub-agents, slash commands — so that the entire workflow built around Claude Code carries over directly. That's what sets it apart from Aider, Crush, or Continue's agent mode.
An LLM platform produces signals on several axes at once — serving health, per-tenant token consumption, response quality — and each has its own indicators (TTFT, inter-token latency, prefix cache hit, token usage per operation…) that classic web monitoring doesn't capture. Good news: the existing observability stack (VictoriaMetrics, VictoriaLogs, Grafana) absorbs all of this with no new tooling — it was just a matter of wiring up the right sources. And the stake goes beyond anomaly detection: understanding how the platform is used — who consumes what, on which models, at what cost — is just as important.
vLLM metricsvLLM natively exposes a full Prometheus endpoint, with each metric carrying a model_name label:
vllm:num_requests_running, vllm:num_requests_waiting, vllm:gpu_cache_usage_perc (KEDA triggers come from here)vllm:time_to_first_token_seconds, vllm:inter_token_latency_seconds, vllm:e2e_request_latency_secondsvllm:prompt_tokens, vllm:generation_tokensvllm:prefix_cache_hits (essential to measure FIM efficiency)The LLM Platform Grafana dashboard (deployed via Grafana Operator from apps/base/ai/llm/grafana-dashboard.yaml) aggregates all of this per model:

In the screenshot, the stack is idle: 4 active models (one replica each), 0 in-flight requests, KV cache at 0.01% — the "always warm" state with no load.
The Envoy AI Gateway observes traffic one level up from vLLM — at the business level. Its metrics follow the OpenTelemetry Gen AI Semantic Conventions standard: it doesn't count requests, it instruments the actual business vocabulary of LLMs:
| Metric | Measures |
|---|---|
gen_ai.client.token.usage | Tokens consumed (input / output / total) |
gen_ai.server.request.duration | End-to-end per-request latency |
gen_ai.server.time_to_first_token | TTFT at the gateway level |
gen_ai.server.time_per_output_token | Inter-token latency |
Each metric is automatically enriched with gen_ai.* labels (model, operation, provider) — you already know who consumes what in a single PromQL query. And you can inject arbitrary HTTP headers as labels (x-tenant-id, x-team…): from there, you can answer:
The usual SLOs / alerts (p95 latency, error rate, GPU saturation) stay defined as
VMRulenext to that — nothing LLM-specific, I cover it in the observability/alerting article.
PromptfooInfrastructure metrics tell you whether the platform works. They don't tell you whether it produces good answers — that's a completely orthogonal dimension, and exactly what Promptfoo brings to the table.
The idea is simple: you declare test cases that look like unit tests — an input prompt + assertions on the expected output. The assertion types cover the whole spectrum:
equals, contains, regex, is-json (with schema), javascript / python — perfect for validating structured output (tool-calling, formatted JSON)llm-rubric (an LLM grades the output against a qualitative rubric), factuality (adherence to provided facts), similar (embeddings + cosine), g-eval (chain-of-thought with custom criteria) — the same approach as proprietary evals (HELM, MT-Bench), but declarable in YAML in-houseIn the stack, this is packaged as a nightly CronJob (tooling/base/promptfoo/): the suite lives in a ConfigMap, the job runs against the models + the MoM routing, and results are pushed as Prometheus metrics — so they show up in the same Grafana as the technical metrics.
The concrete benefits:
vLLM version bump — a quality score that drops from 80 → 65% overnight is far more telling than a latency graphMoM routing degrade quality compared to a direct call to the right model?Qwen2.5-Coder chain before crashing?Continuous eval isn't about reassuring yourself in absolute terms, but about measuring evolution over time and avoiding silent regressions — that's what lets me quantify the claims I make in the conclusion.
We managed to build a simple way to serve open-weight models on Kubernetes: a YAML claim triggers all the necessary plumbing, swapping a model is a few-line change, and observability is in place end-to-end.
But let's be honest about user experience: without access to "serious" models (DeepSeek V4 Pro, Kimi K2.6, Qwen3-Coder-30B…), our tests mostly came down to checking that we got any response — not to measuring real production-grade quality. With the 7-8B models that fit on an L4, you sometimes feel like you're back in the stone age 😅.
The bright side is what's under the hood: the stack itself is solid and production-ready.
The experience is immediately relevant for organizations with a real data-sovereignty concern — healthcare, defense, finance, regulated industries. And with the most advanced open-weight models (DeepSeek V4 Pro, Kimi K2.6, within ten points of Opus on SWE-bench), the quality gap with proprietary models has become marginal. For data that can't leave the premises anyway, the question barely even comes up anymore.
For everyone else — myself first — the point is to lay the foundations for the moment the calculus really shifts. That moment depends on two fast-moving factors:
I deliberately kept these considerations out of the main article — and I take the benchmarks at face value, assuming they're unbiased. That said, two observations are worth putting down:
Let's be clear: today I wouldn't trade my Claude ecosystem. Mainly for financial reasons — not out of any particular attachment. At my usage scale, Sonnet and Opus cost me less than replicating equivalent quality through self-hosting.
That said, I would have liked to push my use of OpenCode further and migrate my Claude setup (skills, MCPs, sub-agents) onto that backend for good — that may be the topic of a future article dedicated to this open-source coding agent.
But I'm keeping the stack alive. The day a Qwen3-Coder-30B-A3B runs cleanly on a quantized L4 — a path documented in docs/llm-platform-future-paths.md — the swap will be a few-line PR. That's the main point of this demo: positioning yourself to move fast when the time comes, rather than scrambling to (re)build everything the day open-weight catches up to the frontier.
And this catch-up isn't only about models: the open-source serving layer evolves just as fast and regularly brings in capabilities previously reserved for proprietary solutions. For instance, vLLM-Omni (first stable late 2025) extends vLLM to omni-modality (text, image, audio, video, as inputs and outputs) with the same OpenAI-compatible API, so it plugs directly into the platform described here.
The kicker: this entire stack was designed and built with the help of Claude Code 🙃.
cloud-native-ref — The complete platformdocs/decisions/ — ADRs (vLLM Production Stack, S3 Files…)docs/llm-platform-future-paths.md — Evolution paths此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。