惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Project Zero
Project Zero
WordPress大学
WordPress大学
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
V
Visual Studio Blog
爱范儿
爱范儿
P
Proofpoint News Feed
F
Fortinet All Blogs
雷峰网
雷峰网
小众软件
小众软件
Jina AI
Jina AI
人人都是产品经理
人人都是产品经理
TaoSecurity Blog
TaoSecurity Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
S
Secure Thoughts
Recent Commits to openclaw:main
Recent Commits to openclaw:main
博客园 - 司徒正美
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Microsoft Azure Blog
Microsoft Azure Blog
IT之家
IT之家
S
Security @ Cisco Blogs
Help Net Security
Help Net Security
GbyAI
GbyAI
Webroot Blog
Webroot Blog
T
Troy Hunt's Blog
B
Blog
MongoDB | Blog
MongoDB | Blog
月光博客
月光博客
H
Heimdal Security Blog
Google Online Security Blog
Google Online Security Blog
S
Security Affairs
云风的 BLOG
云风的 BLOG
Engineering at Meta
Engineering at Meta
www.infosecurity-magazine.com
www.infosecurity-magazine.com
H
Help Net Security
O
OpenAI News
H
Hacker News: Front Page
博客园 - 叶小钗
Last Week in AI
Last Week in AI
S
Schneier on Security
The Last Watchdog
The Last Watchdog
C
Cyber Attacks, Cyber Crime and Cyber Security
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
MyScale Blog
MyScale Blog
Recorded Future
Recorded Future
博客园 - 【当耐特】
V
Vulnerabilities – Threatpost
大猫的无限游戏
大猫的无限游戏
N
News | PayPal Newsroom
The Hacker News
The Hacker News
A
Arctic Wolf

Hacker News - Newest: "LLM"

GitHub - lechmazur/position_bias: A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders. Flex routing (EU and EFTA) Dark Factories: Retooling for LLM Velocity Ask HN: What would be the impact of a LLM output injection attack? GitHub - AronDaron/dataset-generator: No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload. GitHub - Oaklight/llm-rosetta: Production-ready LLM API translation layer for Python — bidirectional conversion between OpenAI, Anthropic & Google formats via hub-and-spoke IR. Optional API gateway. Streaming & non-streaming. Zero core deps. Contributions welcome! GitHub - browser-use/browser-harness: Self-healing browser harness that enables LLMs to complete any task. GitHub - moeen-mahmud/remen: Remen turns thoughts into something you can return to Analyzing 156 LLM Launch Posts on Hacker News ChatGPT vs Gemini vs Claude: The Best LLM Subscription You Should Buy GitHub - salaamalykum/quran-semantic-search: High-density RAG Semantic Search Engine & Quran Corpus (GEO/SEO Architecture) GitHub - NVIDIA/TensorRT-LLM: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way. The State of LLM Bug Bounties in 2026 Operational Readiness Criteria for Tool-Using LLM Agents Meshcore: Architecture for a Decentralized P2P LLM Inference Network How an LLM becomes more coherent as we train it GitHub - seetrex-ai/laimark GitHub - Jossifresben/BibCrit: AI-assited biblical textual criticism GitHub - wastedcode/memex: File system based wiki, maintained by Claude 99helpers.com GitHub - cliver-project/AITrigram GitHub - unbody-io/adapt: A self-evolving memory layer for AI agents. GitHub - hb20007/awesome-gen-ai-fails: A list of incidents where reliance on generative AI and LLMs resulted in harm to companies, individuals, or society GitHub - nevenkordic/localmind: Run any local LLM with persistent memory and context. CLI agent over Ollama with SQLite-backed hybrid recall. No cloud. Ask HN: What are the machine requirements for a LLM like Llama-3.1-8B? Faster LLM Inference via Sequential Monte Carlo grpo explained: group relative policy optimization for llm finetuning - cgft Stop comparing price per million tokens: the hidden LLM API costs · TensorZero Andrej Karpathy's LLM Wiki Is a Bad Idea GitHub - GG-QandV/mnemostroma: Offline RAM-first cognitive leer/coprocessor for AI agents and robotics. Solves "Context Abandonment" with 20-80ms latency using a dual-thread biomimetic memory architecture (ONNX + SQLite WAL). mempalace/agent at agent · skorotkiewicz/mempalace GitHub - Nyquest-ai/nyquest-rust-fullstack-pub: Nyquest — Semantic Compression Proxy for LLMs. 350+ rules, local LLM stage, 15-75% token savings. Full Rust stack. GitHub - TheoV823/mneme: Enforce architectural decisions in AI-assisted development. GitHub - klemenvod/TokenBrawl: A 1v1 Bomberman-style game where two LLM agents play autonomously against each other. No human plays — you watch the AIs fight. Each agent receives a text description of the board state, reasons about it, and outputs a move as JSON. The game engine executes it. Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow Power Circuit AI: Designing Power Electronic Circuits for Motor Drives with Generative Artificial Intelligence Ask HN: How to program with IDE and LLM on CPU locally? Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis Bonsai 1-bit WebGPU - a Hugging Face Space by webml-community The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows Ask HN: Simple tooling for local LLM code critique without IDE integration? Can a General LLM Diagnose a DICOM Slice? A 10-Case Public Benchmark Charts-of-Thought: Enhancing LLM Visualization Literacy (PDF, 2026) GitHub - Mesh-LLM/mesh-llm: Distributed AI/LLM for the people. Share compute privately or publicly to power your agents and chat. GitHub - seamus-brady/springdrift: A persistent runtime for long-lived LLM agents Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation Ask HN: Which LLM model and agentic CLI are you using for local development? GitHub - wayneColt/modelcascade: Route local. Escalate smart. Never overspend. Open-source multi-model cascade routing for autonomous agents. LLM pricing is 100x harder than you think GitHub - asakin/llm-primer: Pre-warmed Claude Code sessions in tmux. No startup wait. GitHub - EggerMarc/chat-rs: A multi-provider LLM framework for Rust. GitHub - SynapseKit/SynapseKit: Minimal, async-first Python framework for production LLM apps- 2 hard deps, no magic, no SaaS. A Claude Skill that Makes LLM Paragraphs More Bearable Does Gas Town 'steal' usage from users' LLM credits & paid services to improve itself? What's Claude Code Actually Doing? Open the Black Box with the Arthur Engine Milla Jovovich's New Open Source LLM Memory App and the Dark Code Problem Your intuition of LLM token usage might be wrong Show HN: Bloomberg Terminal for LLM ops – free and open source GitHub - 0xchamin/mcptube: Transform YouTube videos into a compounding knowledge base with transcripts, vision analysis, and agentic search. Works as an MCP server for Claude, Copilot & more. Show HN: Open KB: Open LLM Knowledge Base Your LLM is a compiler, not a runtime GitHub - sapountzis/Unslop: A Web Feed That Deserves You crates.io: Rust Package Registry Beyond Karpathy's LLM-Wiki: The Necessity of Cognitive Governance GitHub - amitshekhariitbhu/llm-internals: Learn LLM internals step by step - from tokenization to attention to inference optimization. GitHub - parallem-ai/parallem: An expressive library for running agents with the Batch API. GitHub - stfurkan/pi-llm LLM-Wiki Show HN: Formal – Formal verification for AI-generated code using Lean 4 LRTS – Regression testing for LLM prompts (open source, local-first) LLM Wiki Skill: Build a Second Brain with Claude Code and Obsidian I built an LLM Wiki and RAG solution: here's a demo for a security KB The biggest advance in AI since the LLM Predict-Rlm: The LLM Runtime That Lets Models Write Their Own Control Flow the-synthetic-library/the-synthetic-mind at main · joshferrer1/the-synthetic-library GitHub - yisding/reviewwiggum GitHub - Donnyb369/mcp-spine: Context Minifier & State Guard — Local-first MCP middleware proxy GitHub - Beledarian/wgpu-llm: A from-scratch LLM inference engine that uses wgpu (the cross-platform WebGPU implementation) to dispatch WGSL compute shaders for every math operation a Transformer needs. No CUDA. No Python. No massive framework dependencies. Just Rust, raw shaders, and your GPU. GitHub - anitiue/Hindsight: An experience-driven self-improvement framework for LLM agents — 基于经验的 LLM Agent 自我改进框架 GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. GitHub - alainnothere/AmdPerformanceTesting: Amd Performance Testing Ask HN: Is a purely Markdown-based CRM a terrible idea? Optimized for LLM agents Context Engineering - LLM Memory and Retrieval for AI Agents | Weaviate little_helper_tui/letter.md at main · sleepyeldrazi/little_helper_tui GitHub - EvanZhouDev/umr: The Unified Model Registry for all your local AI apps. GitHub - JordanCT/VigIA-Orchestrator Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain A Taxonomy of RL Environments for LLM Agents Llama LLM Network Feture GitHub - genedeng-ca/ai-mac-migration: AI-powered Mac-to-Mac migration tool - replace Apple Migration Assistant with intelligent, selective transfer using local LLMs GitHub - lunargate-ai/gateway: High-performance self-hosted AI gateway (OpenAI-compatible) with routing, retries, and streaming GitHub - AuthBits/webmcp: A lightweight, prompt-driven MCP web research server for high-quality LLM powered information extraction. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
GitHub - sturnus-dev/sturnus: An OpenAI-compatible LLM proxy that flocks toward the fastest provider
dannyboland · 2026-06-22 · via Hacker News - Newest: "LLM"

License: MIT GitHub Release Docker Image

Automatic latency-based routing across LLM providers. A single static binary, zero infrastructure.

LLM providers have variable latency and availability that can break production features. sturnus is a lightweight sidecar that sits beside your app, exposes an OpenAI-compatible API, and automatically shifts traffic to whichever provider is fastest and available right now.

Quick start

sturnus needs a config.toml — copy config.example.toml and add your providers.

Docker — best for production deployments and Kubernetes sidecars:

docker run -v ./config.toml:/config.toml \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest

cargo install — best for local testing if you have a Rust toolchain:

cargo install sturnus
sturnus --config config.toml

Prebuilt static binaries for Linux and macOS (x86_64 and aarch64) are attached to every release.

Then point any OpenAI-compatible SDK at sturnus — the only change is the base URL:

- client = OpenAI(base_url="https://api.openai.com/v1", api_key="sk-...")
+ client = OpenAI(base_url="http://127.0.0.1:4000/v1", api_key="unused")
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:4000/v1", api_key="unused")
response = client.chat.completions.create(
    model="fast",  # resolved by sturnus to the fastest available candidate
    messages=[{"role": "user", "content": "Hello"}],
)

Features

  • Latency- and error-aware routing — the fastest healthy provider gets the bulk of traffic, while slower or erroring ones keep a small, shrinking share. That share doubles as a probe, so a recovered provider wins its traffic back automatically, with no thresholds to trip.
  • Session affinity — a stateless x-session-affinity header pins follow-up requests to the same provider across pods.
  • Transparent passthrough — only the model field is rewritten: the request body is otherwise forwarded byte-for-byte, preserving key order, number precision, and formatting. Responses, including SSE text/event-stream chunks, are relayed untouched as they arrive.
  • Memory-bounded — request buffers are capped per request and in aggregate; bursts beyond the memory budget shed load with 429 + Retry-After instead of OOMing the pod.
  • Vertex AI support — GKE Workload Identity auth via the metadata server, with automatic token refresh.
  • Zero infrastructure — a single static binary; no Redis, database, or control plane.

Why sturnus

Most LLM gateways are either a hosted SaaS you route all your traffic (and keys) through, or a large application with a significant surface area. sturnus is the opposite — a single static binary with a small auditable surface area, MIT-licensed and running entirely inside your infrastructure. It speaks the OpenAI API, so any OpenAI-compatible SDK works by changing one base URL. The core capability of sturnus is automatic latency-based routing across providers — something that most gateways put behind an enterprise tier. Each sidecar routes independently from what it observes locally, so there is no shared state to run.

If you need a full LLMOps platform (spend tracking, prompt management, a UI, dozens of integrations), sturnus is not that.

Design choices & deliberate omissions

sturnus has a bounded scope by design and has some deliberate omissions:

  • No request-level failover or retries. sturnus is a transparent proxy: it surfaces upstream errors to the client verbatim rather than silently retrying within a black box. Error responses still feed the routing signal, so a flaky provider is quickly deprioritized for subsequent traffic — but the individual failed request is returned as-is. Client SDKs (OpenAI, Anthropic, LangChain, etc.) already ship mature, configurable retry and backoff; configure it there and let sturnus steer those retries toward the healthiest provider.
  • Latency-based, not cost or quality-based. Routing optimizes time-to-first-chunk within an alias, and every model routed under that alias should be largely interchangeable. sturnus never trades quality or cost for speed — it just picks the fastest among options you've already deemed equivalent.

Contents

  • Configuration
  • Endpoints
  • Observability
  • Session affinity
  • How routing works
  • Docker
  • Building

Configuration

# use 127.0.0.1:4000 if running locally rather than in a container
listen = "0.0.0.0:4000"

# Providers: where to send requests
[provider.openai]
base_url = "https://api.openai.com/v1"
api_key = "${OPENAI_API_KEY}"

# Vertex AI via GKE Workload Identity (no API key needed)
[provider.vertex]
vertex_ai = { project_id = "my-gcp-project", location = "us-central1" }

# Model map: aliases the client uses → provider+model candidates
[model]
fast = [
  { provider = "openai", model = "gpt-4o-mini" },
  { provider = "vertex", model = "google/gemini-2.5-flash" },
]

[routing]
ewma_alpha = 0.3          # smoothing for the latency and success-rate EWMAs (higher = more reactive)
error_threshold = 0.5      # error-rate EWMA above which a session-affinity pin is broken (routing weights are unaffected)

See config.example.toml for all providers (Groq, Azure, Google AI Studio, Anthropic, local OpenAI-compatible) and options.

Environment variables in ${VAR} syntax are interpolated at config load time. Where they're available in an .env file (KEY=VALUE per line), pass it with --env-file:

sturnus --env-file /secrets/.env
Vertex billing attribution

For Vertex providers, sturnus can inject sidecar-controlled labels into outbound requests so the resulting spend shows up tagged in GCP Billing Export. The labels live in a top-level [attribution] block (typically deployment identity sourced from env vars) and are merged into each request body for any Vertex provider that opts in:

[attribution]
service = "${SERVICE_NAME}"
owner = "${OWNER}"
env = "${ENV}"

[provider.vertex]
vertex_ai = { project_id = "my-project", location = "us-central1", attribution = true }

Sidecar keys take precedence over any client-supplied labels keys with the same name; disjoint client keys are preserved. The feature is currently scoped to Vertex only. Keys and values must conform to Vertex naming rules ([a-z][a-z0-9_-]{0,62}).

Endpoints

Method Path Description
POST /v1/chat/completions Proxied to upstream (model alias resolved)
POST /v1/embeddings Proxied to upstream (model alias resolved)
GET /health Returns {"status":"ok"}
GET /status Returns current streaming/non-streaming EWMAs, error rate, and status per candidate
GET /metrics Prometheus metrics (see below)

Observability

Metrics

Prometheus metrics on /metrics, all labelled by alias, provider, model:

Metric Type Meaning
sturnus_requests_total counter Completed responses, additionally labelled by status_code (includes upstream 4xx/5xx)
sturnus_ttfc_seconds histogram Streaming time-to-first-chunk (streaming requests only)
sturnus_latency_seconds histogram Non-streaming full response time (non-streaming requests only)
sturnus_errors_total counter Transport failures that never produced a response (timeout, connect, DNS)
sturnus_buffer_rejections_total counter Requests shed with 429 because the aggregate buffer budget was full (no per-alias labels)

Connection failures are zero-initialised at startup so a missing series is never mistaken for "no errors".

Logging

Structured logging via tracing: coloured text on a terminal (respecting NO_COLOR), newline-delimited JSON when piped or redirected. Set the format with --log-format <auto|pretty|json> (or STURNUS_LOG_FORMAT) and the level with RUST_LOG (default sturnus=info).

Each request gets a span with a request_id; a client-supplied W3C traceparent propagates as trace_id and parent_span_id for cross-service correlation.

Session affinity

Every response includes an x-session-affinity header (e.g. openai/gpt-4o-mini). Pass it back on subsequent requests to pin to the same provider — useful for multi-turn conversations where context is provider-specific:

response = client.chat.completions.create(
    model="fast",
    messages=[{"role": "user", "content": "Hello"}],
)
affinity = response.headers["x-session-affinity"]  # e.g. "openai/gpt-4o-mini"

response = client.chat.completions.create(
    model="fast",
    messages=[{"role": "user", "content": "Follow-up"}],
    extra_headers={"x-session-affinity": affinity},
)

Fully stateless — works across pods with no shared state. The pin is honored until the pinned candidate's error-rate EWMA breaches error_threshold (at the default smoothing, roughly two consecutive errors), at which point the header is ignored and a new provider is selected — check the updated x-session-affinity in the response. Unknown or malformed headers fall back to normal routing.

How routing works

  1. Client sends POST /v1/chat/completions with "model": "fast".
  2. Sidecar looks up the fast alias and computes each candidate's effective latency: its latency EWMA divided by its success-rate EWMA. A candidate erroring with probability p needs ~1/(1-p) attempts per success, so errors inflate effective latency the same way slowness does.
  3. Each candidate is weighted by (best_effective / its_effective)^k, so the best gets the bulk of traffic and worse ones a shrinking-but-nonzero share. A deterministic low-discrepancy sequence (golden-ratio Weyl sequence) turns those weights into picks.
  4. Because worse candidates always keep a small share, their EWMAs stay fresh — a provider that recovers (faster responses or errors stopping) wins traffic back automatically; a cold candidate (no latency data yet) probes at a quarter of the best candidate's rate, scaled by its success rate, until its first samples land.
  5. The model field is rewritten to the real model name, auth headers are set, and the request is forwarded.
  6. TTFC is measured at first chunk arrival and fed back into the EWMA; the response status (any non-2xx counts as an error, including upstream 4xx) feeds the success-rate EWMA.

The best provider is exploited heavily while worse ones keep enough traffic to stay measured. A candidate's probe share shrinks with how bad it looks but is floored at 1%, so re-detecting a recovered provider costs at most ~100 requests — and during an outage at most ~1% of an alias's traffic is spent on the failing candidate.

Docker

When running in Docker or as a Kubernetes sidecar, listen must be 0.0.0.0:4000 (the value in config.example.toml) — 127.0.0.1 only accepts connections from within the container itself.

On Kubernetes, run sturnus as a native sidecar — an init container with restartPolicy: Always (stable since v1.29). It then starts before the app container and is terminated after it, so the proxy is ready for the app's first request and stays up while the app drains.

Memory needs no tuning: the aggregate request-buffer budget defaults to half the container's memory limit (read from cgroups at startup, logged with its source), so a small sidecar sheds excess load with 429s rather than getting OOM-killed. Override with routing.max_buffered_bytes if you want a different ceiling.

The image is published as a multi-arch (amd64/arm64) scratch container to ghcr.io/sturnus-dev/sturnus. Tags follow semver: :latest, :5.0, :5.0.0.

To inject secrets via a mounted .env file:

docker run -v ./config.toml:/config.toml \
  -v ./secrets.env:/secrets/.env:ro \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest --env-file /secrets/.env
Vertex credentials outside GKE

On GKE, workload identity is picked up automatically. Elsewhere, supply credentials one of two ways.

A service account key, pointed to by GOOGLE_APPLICATION_CREDENTIALS (recommended for production):

docker run -v ./config.toml:/config.toml \
  -v ./sa-key.json:/sa-key.json:ro \
  -e GOOGLE_APPLICATION_CREDENTIALS=/sa-key.json \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest

Or gcloud ADC for local dev, mounted to $HOME/.config/gcloud/ (the image sets HOME=/root):

docker run -v ./config.toml:/config.toml \
  -v ~/.config/gcloud/application_default_credentials.json:/root/.config/gcloud/application_default_credentials.json:ro \
  -p 4000:4000 \
  ghcr.io/sturnus-dev/sturnus:latest

Building

# Development
cargo build

# Release (static binary with LTO)
cargo build --release

# Run tests
cargo test

License

MIT