惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Forbes - Security
Forbes - Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
F
Fortinet All Blogs
B
Blog
T
The Blog of Author Tim Ferriss
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI
Y
Y Combinator Blog
Microsoft Azure Blog
Microsoft Azure Blog
L
LangChain Blog
Recent Announcements
Recent Announcements
U
Unit 42
Martin Fowler
Martin Fowler
M
MIT News - Artificial intelligence
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
The Register - Security
The Register - Security
Recorded Future
Recorded Future
C
Check Point Blog
V
V2EX
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
WordPress大学
WordPress大学
Google DeepMind News
Google DeepMind News
酷 壳 – CoolShell
酷 壳 – CoolShell
F
Full Disclosure
小众软件
小众软件
A
About on SuperTechFans
云风的 BLOG
云风的 BLOG
宝玉的分享
宝玉的分享
Last Week in AI
Last Week in AI
有赞技术团队
有赞技术团队
MongoDB | Blog
MongoDB | Blog
爱范儿
爱范儿
P
Proofpoint News Feed
罗磊的独立博客
量子位
D
Docker
博客园_首页
D
DataBreaches.Net
Project Zero
Project Zero
博客园 - 司徒正美
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
博客园 - Franky
Security Latest
Security Latest
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
N
Netflix TechBlog - Medium
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
博客园 - 三生石上(FineUI控件)
H
Hackread – Cybersecurity News, Data Breaches, AI and More
大猫的无限游戏
大猫的无限游戏

Show HN

CSP Radar GitHub - awebai/aweb-team-coord-worktrees: An aweb team template for a minimum team with a permanent coordinator and worktrees with local developers. GitHub - fujibee/agmsg GitHub - lucastononro/notify: 100% local, free, offline attention skill for Claude Code: plays a sound and speaks a short status update when a long task finishes, blocks, or needs a decision. GitHub - sebastianwessel/skills: AI Skills tivatdoar / workout-to-work · GitLab GitHub - enumura1/py-sql-cleaner: Find, format, and safely extract embedded SQL from Python files. GitHub - intent-bench/intent-bench: Intent fulfillment benchmark for agentic AI engineering GitHub - steveking-gh/firmion: Firmion is DSL and engine for firmware image generation. GitHub - villagesql/villagesql-skills: Agent skills for VillageSQL - gemini-cli-extension; claude-code-plugin GitHub - 0gsd/enough: a personal language system for planning, writing, and translation. GitHub - Kaelio/ktx: ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills, memory and a semantic layer GitHub - ThatXliner/xtras: Xliner's Claude Code Skills GitHub - flightdeckhq/flightdeck: Observability and control plane for AI agents. GitHub - search-router/simple-search: Open-source reference app on top of the Search Router API: FastAPI + Jinja metasearch service with pluggable backends, deterministic mocks (no API key needed), RTL UI, Redis cache, and a demo ads cabinet. CSP Radar GitHub - Light-Heart-Labs/DreamServer: Turn your PC, Mac, or Linux box into an AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. GitHub - Diplomat-ai/diplomat-agent-ts: What can your TypeScript AI agent do to the real world? Scan your code. See which tool calls have zero checks Code Block Selector - Visual Studio Marketplace Prometheus dependency graph — interactive showcase | Riftmap Show HN: I made a vi-like modal keyboard plugin for Figma GitHub - run-llama/liteparse: A fast, helpful, and open-source document parser GitHub - dalemyers/Roar: A macOS CLI tool for notifications GitHub - district-solutions/open-agent-tools-coder: Enables small-to-large self-hosted ai models to use local source code when running tool-calling agentic workloads. We actively data mine 20,900+ (2+ TB) popular github repos using large and small ai models to create reuseable: json, markdown and parquet files for local-first tool-calling models. GitHub - progapandist/stripeek: A local TUI proxy for real-time Stripe API debugging, built for navigating complex payloads fast. GitHub - sir1st/hermes-desktop: All-in-one cross-platform desktop app for Hermes Agent — bundles Python + hermes-agent + hermes-web-ui GitHub - astefanutti/shaderbang: Shebang for Shaders Show HN: Generate Claude Code Workflows using Spec Driven Development approach GitHub - nixys/nxs-universal-chart: The Helm chart you can use to install any of your applications into Kubernetes/OpenShift Show HN: AI agents for UK GDAD PCF roles and their skills The Two Pillars: Mixer Mode and Meta-Software in the Reorganization of Software Work After AI GitHub - JaiCode08/teleport-env What 1,000+ Harness Experiments Taught Me About Self-Improving Agents Show HN: Liiists, a Markdown-first, iOS and CLI list app SwiperTab – Get this Extension for 🦊 Firefox (en-US) GitHub - kouhxp/fftext: Summarize, explain, fact-check, or translate any text, URL, or file. No GPU. No cloud. One command GitHub - sweetpad-dev/sweetpad: Develop Swift/iOS projects using VSCode GitHub - dogmaticdev/IRON: IRON a.k.a. Intermediate Representation Object Notation is a Interpreter/Database that is used to create Programming Languages. GitHub - sjhalani7/vaen: Package your AI coding harness into a portable .agent file, and share it across repos, teams, & the community without ever having to copy-paste instructions, skills, MCP config, or secrets. Show HN: Gandalf the Grader Show HN: Citadeld – replay any CI failure locally from a single file GitHub - tdortman/cuSBF: High-Performance GPU Super Bloom Filter coral-ai/claude-code-token-xray at main · Coral-Bricks-AI/coral-ai GitHub - ulyssestenn/funes: Funes is a Git-based framework for LLM-managed knowledge work: an AI Librarian ingests raw sources, builds an interlinked Markdown knowledge base, and uses it to produce cited reports, analyses, and other outputs. GitHub - ThatXliner/gah: Git Add Hunk, built for agents to use GitHub - harmont-dev/harmont-cli: Command-line client for the Harmont CI platform GitHub - brooksmcmillin/mcp-authflow: OAuth 2.0 Authorization Server framework for MCP servers GitHub - javaid-codes/audit-supply-chain-agents GitHub - amorey/gochan: A small library of common channel architectures for Go, inspired by Rust GitHub - arifozgun/OpenGem: Free, Open-Source AI API Gateway with Gemini, OpenAI & Anthropic Compatibility in 1 file GitHub - Pranesh950/BioPetals: 🌸 Run BIOxAI models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading GitHub - cnguyen14/bounty-doctor: Diagnose a GitHub bounty issue before you waste hours: detects honeypot scam repos, AI-bot attempt swarms, and stale contests. Show HN: CoreMCP – MCP Server for On-Prem DBs Show HN: KittyHTML – Render HTML/CSS as an inline image in your terminal GitHub - bingud/filemat: Web-based file manager Show HN: TruthLens – Free multi-signal deepfake image detector GitHub - apexlocal-jz/claude-usage-tray: Windows system-tray app showing your Claude Code rate-limit usage at a glance. Zero deps, ~300 lines of PowerShell. Cross-IDE (works regardless of VS Code, Cursor, plain terminal). Release v0.1.2.1 · kouhxp/yapsnap GitHub - noopolis/moltnet: Self-hostable chat network for AI agents. Pre-built bridges for Claude Code, Codex, and the Claws. Rooms, DMs, history. No Slack bots, no Matrix, no glue code. GitHub - tamerh/enju: Coordinating Humans, AI Agents, and Compute as Peers on a Shared Workflow Graph Show HN: Continuity-auth – Respect-weighted rate limits for the open web GitHub - luml-ai/luml: AI lifecycle platform where engineers and agents track experiments, train models, and ship to production. GitHub - mrdanielcasper/CoreTex: A UNIX-inspired, biomimetic, flat-file AI harness and knowledge engine. GitHub - clemg/pierre-github: Pierre's diffs.com and trees.software for Github GitHub - lyriks-io/unspaghettit: Behavior-driven AI development without prompt spaghetti. GitHub - sofumel/claude-handoff-revive: Resume Claude Code work after rate/usage/context limits without replaying the prior transcript. Auto-saves at 90%/95% usage. Plugin-installable, 10 languages. GitHub - dotexorg/saferpc: Typed, end-to-end encrypted RPC over any bidirectional channel. GitHub - BeeZeeAgent/beezee: Agent harness orchestration Legato Next.js Boilerplate for Internal Tools · CoreUI GitHub - clark-labs-inc/clark-hash: Clark Hash, 32x smaller searchable sketches for embeddings GitHub - ZeroPointRepo/youtube-mcp: The fastest YouTube transcript + YouTube search MCP for AI agents. Try for free. Typing Mastery — climb toward 100+ WPM, deliberately GitHub - Andebugulin/Awareen GitHub - fayzan123/claude-workflow-composer: Visual desktop app for composing multi-agent coding workflows. Drag agents, attach skills and MCPs, wire handoffs, export to .claude/ GitHub - StackOneHQ/stack-nudge We hardened an LLM agent. Each defense we added made it more exploitable. GitHub - alkait/WhatsKept: Agent-queryable WhatsApp history from an iOS backup — a single Go binary. GitHub - octelium/cordium: Open-source, general-purpose sandbox platform for devs and AI agents that provides identity-based secure access to infrastructure without credentials. GitHub - scosman/videowright: Build animated explainer videos with your coding agent GitHub - dipankar/dscode: The code editor you can take apart. GitHub - zoharbabin/web-researcher-mcp: MCP server (Go) for AI assistants: web search, content extraction, academic/patent/news research. Multi-provider routing, 4-tier scraping, search lenses. Works with Claude, Cursor, and any MCP client. GitHub - scanaislop/aislop: Catch the slop AI coding agents leave in your code: narrative comments, swallowed exceptions, as-any casts, dead code, oversized functions. 50+ rules across 7 languages (TypeScript, JavaScript, Python, Go, Rust, Ruby, PHP). Sub-second, deterministic, no LLM at runtime. MIT-licensed. GitHub - kouhxp/cheap-im: CPU-only voice agent approximating Thinking Machines' Interaction Models demo GitHub - unprovable/OrchidMantis: Orchid Mantis — standalone framework for Zero-Knowledge Proofs of eXploit (ZKPoX). GitHub - TangibleResearch/Halgorithem: A Algo designed to detect AI Hallucitions GitHub - CarpseDeam/Aura-IDE: An AI coding harness that shaped itself - Planner/Worker agents, repo awareness, surgical edits, validation, recovery, and safe diff approvals. GitHub - chojs23/concord: A feature-rich TUI client for Discord GitHub - aerf-spec/aerf: Agent Evidence Receipt Format (AERF) — an open specification for tamper-evident, independently verifiable records of AI agent actions. GitHub - Jwrede/tokentoll: Catch LLM cost changes in code review. Infracost for LLM spend. GitHub - samchon/ttsc: A `typescript-go` toolchain for compiler-powered plugins and type-safe execution + 500x faster lint integrated into compiler GitHub - Higangssh/homebutler: 🏠 Manage your homelab from chat. Single binary, zero dependencies. GitHub - olalie/tapmap: See where your computer connects and what stands out on a live world map. GitHub - Diplomat-ai/diplomat-agent: What can your AI agent do to the real world? Scan your code. See which tool calls have zero checks GitHub - Bajusz15/beacon: Open-source agent for secure remote access, monitoring, and deploys across home-lab and self-hosted machines like Raspberry Pi, N100, or any Linux server. Open web based TTY or tunnel Home Assistant and other local services securely without opening ports. BigTech AI News - Chrome 应用商店 GitHub - vinhnx/VTCode: VT Code is an open-source coding agent with LLM-native code understanding and robust shell safety. Supports multiple LLM providers with automatic failover and efficient context management. GitHub - Lumen-Labs/brainapi2: BrainAPI is a knowledge graph–powered AI memory layer that transforms unstructured data into structured knowledge, enabling intelligent search, recommendations, and contextual memory for AI agents and applications. GitHub - familiar-software/familiar: Let AI watch you work. Familiar lets your AI update its memory, skills, and knowledge by watching your screen. make sidebar/address bar rounded corner toggleable
GitHub - sturnus-dev/sturnus: An OpenAI-compatible LLM proxy that flocks toward the fastest provider
dannyboland · 2026-06-22 · via Show HN

License: MIT GitHub Release Docker Image

Automatic latency-based routing across LLM providers. A single static binary, zero infrastructure.

LLM providers have variable latency and availability that can break production features. sturnus is a lightweight sidecar that sits beside your app, exposes an OpenAI-compatible API, and automatically shifts traffic to whichever provider is fastest and available right now.

Quick start

sturnus needs a config.toml — copy config.example.toml and add your providers.

Docker — best for production deployments and Kubernetes sidecars:

docker run -v ./config.toml:/config.toml \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest

cargo install — best for local testing if you have a Rust toolchain:

cargo install sturnus
sturnus --config config.toml

Prebuilt static binaries for Linux and macOS (x86_64 and aarch64) are attached to every release.

Then point any OpenAI-compatible SDK at sturnus — the only change is the base URL:

- client = OpenAI(base_url="https://api.openai.com/v1", api_key="sk-...")
+ client = OpenAI(base_url="http://127.0.0.1:4000/v1", api_key="unused")
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:4000/v1", api_key="unused")
response = client.chat.completions.create(
    model="fast",  # resolved by sturnus to the fastest available candidate
    messages=[{"role": "user", "content": "Hello"}],
)

Features

  • Latency- and error-aware routing — the fastest healthy provider gets the bulk of traffic, while slower or erroring ones keep a small, shrinking share. That share doubles as a probe, so a recovered provider wins its traffic back automatically, with no thresholds to trip.
  • Session affinity — a stateless x-session-affinity header pins follow-up requests to the same provider across pods.
  • Transparent passthrough — only the model field is rewritten: the request body is otherwise forwarded byte-for-byte, preserving key order, number precision, and formatting. Responses, including SSE text/event-stream chunks, are relayed untouched as they arrive.
  • Memory-bounded — request buffers are capped per request and in aggregate; bursts beyond the memory budget shed load with 429 + Retry-After instead of OOMing the pod.
  • Vertex AI support — GKE Workload Identity auth via the metadata server, with automatic token refresh.
  • Zero infrastructure — a single static binary; no Redis, database, or control plane.

Why sturnus

Most LLM gateways are either a hosted SaaS you route all your traffic (and keys) through, or a large application with a significant surface area. sturnus is the opposite — a single static binary with a small auditable surface area, MIT-licensed and running entirely inside your infrastructure. It speaks the OpenAI API, so any OpenAI-compatible SDK works by changing one base URL. The core capability of sturnus is automatic latency-based routing across providers — something that most gateways put behind an enterprise tier. Each sidecar routes independently from what it observes locally, so there is no shared state to run.

If you need a full LLMOps platform (spend tracking, prompt management, a UI, dozens of integrations), sturnus is not that.

Design choices & deliberate omissions

sturnus has a bounded scope by design and has some deliberate omissions:

  • No request-level failover or retries. sturnus is a transparent proxy: it surfaces upstream errors to the client verbatim rather than silently retrying within a black box. Error responses still feed the routing signal, so a flaky provider is quickly deprioritized for subsequent traffic — but the individual failed request is returned as-is. Client SDKs (OpenAI, Anthropic, LangChain, etc.) already ship mature, configurable retry and backoff; configure it there and let sturnus steer those retries toward the healthiest provider.
  • Latency-based, not cost or quality-based. Routing optimizes time-to-first-chunk within an alias, and every model routed under that alias should be largely interchangeable. sturnus never trades quality or cost for speed — it just picks the fastest among options you've already deemed equivalent.

Contents

  • Configuration
  • Endpoints
  • Observability
  • Session affinity
  • How routing works
  • Docker
  • Building

Configuration

# use 127.0.0.1:4000 if running locally rather than in a container
listen = "0.0.0.0:4000"

# Providers: where to send requests
[provider.openai]
base_url = "https://api.openai.com/v1"
api_key = "${OPENAI_API_KEY}"

# Vertex AI via GKE Workload Identity (no API key needed)
[provider.vertex]
vertex_ai = { project_id = "my-gcp-project", location = "us-central1" }

# Model map: aliases the client uses → provider+model candidates
[model]
fast = [
  { provider = "openai", model = "gpt-4o-mini" },
  { provider = "vertex", model = "google/gemini-2.5-flash" },
]

[routing]
ewma_alpha = 0.3          # smoothing for the latency and success-rate EWMAs (higher = more reactive)
error_threshold = 0.5      # error-rate EWMA above which a session-affinity pin is broken (routing weights are unaffected)

See config.example.toml for all providers (Groq, Azure, Google AI Studio, Anthropic, local OpenAI-compatible) and options.

Environment variables in ${VAR} syntax are interpolated at config load time. Where they're available in an .env file (KEY=VALUE per line), pass it with --env-file:

sturnus --env-file /secrets/.env
Vertex billing attribution

For Vertex providers, sturnus can inject sidecar-controlled labels into outbound requests so the resulting spend shows up tagged in GCP Billing Export. The labels live in a top-level [attribution] block (typically deployment identity sourced from env vars) and are merged into each request body for any Vertex provider that opts in:

[attribution]
service = "${SERVICE_NAME}"
owner = "${OWNER}"
env = "${ENV}"

[provider.vertex]
vertex_ai = { project_id = "my-project", location = "us-central1", attribution = true }

Sidecar keys take precedence over any client-supplied labels keys with the same name; disjoint client keys are preserved. The feature is currently scoped to Vertex only. Keys and values must conform to Vertex naming rules ([a-z][a-z0-9_-]{0,62}).

Endpoints

Method Path Description
POST /v1/chat/completions Proxied to upstream (model alias resolved)
POST /v1/embeddings Proxied to upstream (model alias resolved)
GET /health Returns {"status":"ok"}
GET /status Returns current streaming/non-streaming EWMAs, error rate, and status per candidate
GET /metrics Prometheus metrics (see below)

Observability

Metrics

Prometheus metrics on /metrics, all labelled by alias, provider, model:

Metric Type Meaning
sturnus_requests_total counter Completed responses, additionally labelled by status_code (includes upstream 4xx/5xx)
sturnus_ttfc_seconds histogram Streaming time-to-first-chunk (streaming requests only)
sturnus_latency_seconds histogram Non-streaming full response time (non-streaming requests only)
sturnus_errors_total counter Transport failures that never produced a response (timeout, connect, DNS)
sturnus_buffer_rejections_total counter Requests shed with 429 because the aggregate buffer budget was full (no per-alias labels)

Connection failures are zero-initialised at startup so a missing series is never mistaken for "no errors".

Logging

Structured logging via tracing: coloured text on a terminal (respecting NO_COLOR), newline-delimited JSON when piped or redirected. Set the format with --log-format <auto|pretty|json> (or STURNUS_LOG_FORMAT) and the level with RUST_LOG (default sturnus=info).

Each request gets a span with a request_id; a client-supplied W3C traceparent propagates as trace_id and parent_span_id for cross-service correlation.

Session affinity

Every response includes an x-session-affinity header (e.g. openai/gpt-4o-mini). Pass it back on subsequent requests to pin to the same provider — useful for multi-turn conversations where context is provider-specific:

response = client.chat.completions.create(
    model="fast",
    messages=[{"role": "user", "content": "Hello"}],
)
affinity = response.headers["x-session-affinity"]  # e.g. "openai/gpt-4o-mini"

response = client.chat.completions.create(
    model="fast",
    messages=[{"role": "user", "content": "Follow-up"}],
    extra_headers={"x-session-affinity": affinity},
)

Fully stateless — works across pods with no shared state. The pin is honored until the pinned candidate's error-rate EWMA breaches error_threshold (at the default smoothing, roughly two consecutive errors), at which point the header is ignored and a new provider is selected — check the updated x-session-affinity in the response. Unknown or malformed headers fall back to normal routing.

How routing works

  1. Client sends POST /v1/chat/completions with "model": "fast".
  2. Sidecar looks up the fast alias and computes each candidate's effective latency: its latency EWMA divided by its success-rate EWMA. A candidate erroring with probability p needs ~1/(1-p) attempts per success, so errors inflate effective latency the same way slowness does.
  3. Each candidate is weighted by (best_effective / its_effective)^k, so the best gets the bulk of traffic and worse ones a shrinking-but-nonzero share. A deterministic low-discrepancy sequence (golden-ratio Weyl sequence) turns those weights into picks.
  4. Because worse candidates always keep a small share, their EWMAs stay fresh — a provider that recovers (faster responses or errors stopping) wins traffic back automatically; a cold candidate (no latency data yet) probes at a quarter of the best candidate's rate, scaled by its success rate, until its first samples land.
  5. The model field is rewritten to the real model name, auth headers are set, and the request is forwarded.
  6. TTFC is measured at first chunk arrival and fed back into the EWMA; the response status (any non-2xx counts as an error, including upstream 4xx) feeds the success-rate EWMA.

The best provider is exploited heavily while worse ones keep enough traffic to stay measured. A candidate's probe share shrinks with how bad it looks but is floored at 1%, so re-detecting a recovered provider costs at most ~100 requests — and during an outage at most ~1% of an alias's traffic is spent on the failing candidate.

Docker

When running in Docker or as a Kubernetes sidecar, listen must be 0.0.0.0:4000 (the value in config.example.toml) — 127.0.0.1 only accepts connections from within the container itself.

On Kubernetes, run sturnus as a native sidecar — an init container with restartPolicy: Always (stable since v1.29). It then starts before the app container and is terminated after it, so the proxy is ready for the app's first request and stays up while the app drains.

Memory needs no tuning: the aggregate request-buffer budget defaults to half the container's memory limit (read from cgroups at startup, logged with its source), so a small sidecar sheds excess load with 429s rather than getting OOM-killed. Override with routing.max_buffered_bytes if you want a different ceiling.

The image is published as a multi-arch (amd64/arm64) scratch container to ghcr.io/sturnus-dev/sturnus. Tags follow semver: :latest, :5.0, :5.0.0.

To inject secrets via a mounted .env file:

docker run -v ./config.toml:/config.toml \
  -v ./secrets.env:/secrets/.env:ro \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest --env-file /secrets/.env
Vertex credentials outside GKE

On GKE, workload identity is picked up automatically. Elsewhere, supply credentials one of two ways.

A service account key, pointed to by GOOGLE_APPLICATION_CREDENTIALS (recommended for production):

docker run -v ./config.toml:/config.toml \
  -v ./sa-key.json:/sa-key.json:ro \
  -e GOOGLE_APPLICATION_CREDENTIALS=/sa-key.json \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest

Or gcloud ADC for local dev, mounted to $HOME/.config/gcloud/ (the image sets HOME=/root):

docker run -v ./config.toml:/config.toml \
  -v ~/.config/gcloud/application_default_credentials.json:/root/.config/gcloud/application_default_credentials.json:ro \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest

Building

# Development
cargo build

# Release (static binary with LTO)
cargo build --release

# Run tests
cargo test

License

MIT