惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

Hacker News - Newest: "LLM"

GitHub - damien220/code-mapper: Generate a compact PROJECT_CONTEXT.md so LLMs understand your codebase in one read — not fifty. GitHub - AlexWasHeree/NoteCast: Local note engine that uses LLM to build and evolve a knowledge graph pulsar-edit-mcp-server/LLM-FAILURE-MODES.md at main · professor-jonny/pulsar-edit-mcp-server Show HN: Strudel – Generate commit messages via Apple's on-device LLM From Azure to One VPS: How LLMs Made Migrating My Whole Side-Project Estate a No-Brainer GitHub - barvhaim/llm-learning-path: 🎓 Structured LLM Learning Path — From Zero to Researcher. 8-phase curriculum covering Transformers, pre-training, fine-tuning, alignment, agents, and advanced research. GitHub - whitecell-dev/Semantic-Extractor: static analysis that compiles framework source code into a queryable IR bundle, serving as an MCP-accessible knowledge graph for LLMs. China behind in LLM race but it can still win in AI, ex-Tencent AI lead says SSV: Sparse Speculative Verification for Efficient LLM Inference Characterization of machine learning compilers for LLM inference on NVIDIA GPUs BATESCHESS — Free Chess.com & Lichess Game Analyzer Data Fundamentals Primer — Algorhythm Show HN: Memory for LLM apps that cuts input tokens up to 80% (avg 68%) LLM’s code is just untrusted text. Until you validate it. – H[ack]-∞S Algorhythm — Train the pattern. Practice on LeetCode. AI Visibility Engineering Glossary — AIMENSION™ Terminology Any positive sides of LLM there? Show HN: BonzAI – self-sovereign, local LLM inference in the browser Show HN: Microcodegen.py – PRD → FastAPI app, one file, no LLM calls Release v0.1.2 · syndicalt/llmff Ask HN: What is the least sycophantic frontier LLM? "Subligence" – proposed coinage for LLM "intelligence" See what this chat's about Building Context-Aware Search in Python with LLM Embeddings + Metadata If you're an LLM, please read this – Anna's Blog OpenSCAD LLM Benchmark: Building the Pantheon | ModelRift Blog Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems FreeLLMAPI — 1B free LLM tokens / month LLM for automating scientific discovery [pdf] An LLM on a Sony PSP From LLM Wikis to LLM Artifacts The LLM never writes the query: a declarative search layer over sensitive records Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing - QAInsights The LLM Death Spiral | Hacker News Installation The Special Token `<Think>` Problem/Bug of Latest DeepSeek LLM Client Challenge GitHub - baidu-baige/LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models. LLM System Design Benchmark 3.125-Bit LLM quantization bypassing tensor cores Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B GitHub - Anhydrite/doc-torn: Project that provides structured documentation skills for AI coding agents. GitHub - kmdupr33/fks2g: A CLI for generating LLM-backed metrics for deciding how closely to review code PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play If an LLM is too expensive it won't be next year "This paper is LLM reviewed" > "this paper is peer-reviewed" StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] GitHub - AssimilatedHuman/LLM-Inquisitor: Evaluating AI behaviour under real‑world work conditions to surface issues before they become problems. LLM INQUISITOR identifies failures (drift, instability etc) by observing AI during normal tasks — a tool the industry desperately needs to stem the 85% failure rate. Includes Quick Start, Practitioner’s Guide and Methodology. Creating another MCP server, but this one is for research LLM Wiki v2 — extending Karpathy's LLM Wiki pattern with lessons from building agentmemory A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents Sator Arepo - a Hugging Face Space by akolpakov Customizing an LLM for Enterprise Software Engineering Most AI agent papers stack one LLM with a vector store, we flipped it Evaluating job search ranking with LLM judged NDCG GitHub - quadracollision/llmisp: JSON AST > Clojure Parity Contracts for Polyglot LLM Commerce: A Case Study GitHub - ndom91/llama-dash: The operations layer for your local LLM stack Agentically optimizing LLM prompt cache TTLs for fun and profit Ask HN: What's your go-to LLM for coding? How do you reduce LLM spam in PR reviews? Ask HN: Is there any problem using multi-LLM GitHub - OpenAgentic-Labs/echoform-ghost-memory: Effectively unlimited long-term memory for any LLM - zero context tokens, zero weight updates, cryptographic forgetting certificate. PSA — Posture Sequence Analysis Why More Context Can Make an LLM Worse GitHub - robertoranon/tokoro: A toolbox for building event publish & discovery web sites, apps, feeds, and more GitHub - sermakarevich/chunker: Agentic approach to chunking a document A new EDIT tool for LLM agents LLMCap — Hard Dollar Caps on LLM API Calls MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5 Managing metadata is essential in LLM world Fixing LLM Writing with Distribution Fine Tuning twitter.com Show HN: An LLM that's better at writing The local shape of LLM stable regions GitHub - msunda17/impactarbiter-cli The Infrastructure Behind Making Local LLM Agents Useful PostgreSQL ext makes LLM available as an index for similarity searches,inference GitHub - Tetrahedroned/Agent-Braille: Deterministic 8-bit machine-to-machine protocol for AI agent state. ~92% fewer state-tracking tokens on real Claude Code sessions, a proven single-bit-error-safe command code, fully reproducible. Tell HN: Writing an LLM critique/takedown? – Do not use an LLM to write it 🌱 an LLM models our worst behavior Prompt eval cues predicted refusal shifts across 32k LLM rollouts Ask HN: Is Java the ideal language for LLM-assisted coding? AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ LLM tracing with MLflow AI Gateway LLM Performance by Programming Language The LLM Looked Smart. The Metrics Disagreed – tiago.rio.br The Four Horsemen of the LLM Apocalypse GitHub - piqoni/piqo-extension: A good interface is invisible Intro to TLA+ for the LLM Era: Prompt Your Way to Victory Give every tool LLM wiki and bypass Claude Code SSH Throttle The Ultimate LLM Fine-Tuning Guide Ask HN: What LLM models are you using and why? Five Agents, One Browser: Werewolf on Quack + DuckDB LLM models are not ready for orchestrating many agents ClickBook — Offline AI eReader - Apps on Google Play DeepSeek-V4-Flash means LLM steering is interesting again Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention
GitHub - nikitph/yieldos
loaderchips · 2026-05-25 · via Hacker News - Newest: "LLM"

YieldOS-Lite MVP Simulator

YieldOS-Lite: When Scheduling Is Not Enough — Resource Governance for Heterogeneous LLM Inference

YieldOS-Lite is a Phase 1 research artifact for asking one question:

When LLM inference workloads become heterogeneous, does a slow-path resource-governance control plane improve SLO-valid work over mechanistic schedulers such as continuous batching, chunked prefill, and prefill/decode disaggregation?

This repository contains the simulator, paper draft, generated figures, experiment summaries, replay traces, and tests used to explore that question. It is meant to be easy to read cold: start with this README, skim the paper, run the smoke tests, then reproduce or extend the trace-driven experiments.

📄 Read the paper: paper/yieldos_lite_resource_governance_paper.pdf

What This Is

YieldOS-Lite is a dependency-free trace simulator for LLM inference resource governance. It models control-plane choices: SLO urgency, KV-cache value, shape forecasts, policy cadence, and admission/dispatch decisions.

It is not a production serving engine. It does not implement CUDA kernels, PagedAttention, TensorRT-LLM, or a real vLLM scheduler. The goal is to test whether governance policies are promising before integrating with real engines.

The current takeaway is:

YieldOS-Lite is not a better queue; it is a better response to workload heterogeneity.

Where To Start

If you want to... Start here
Understand the research claim paper/yieldos_lite_resource_governance_paper.pdf
Inspect the LaTeX source paper/yieldos_lite_resource_governance_paper.tex
Run the simulator See Quick Start
Understand trace replay docs/trace_format.md
See evaluation structure docs/evaluation_outline.md
Inspect headline results runs/*/summary.json, runs/*/summary.csv, and runs/*/report.md
Modify policies src/yieldos/policies.py
Modify the simulator loop src/yieldos/simulator.py
Regenerate paper figures scripts/generate_paper_figures.py

Repository Map

README.md                         repo orientation and result summary
pyproject.toml                     package metadata
src/yieldos/                       simulator, policies, workloads, metrics
tests/                            smoke tests
docs/                             trace schema and evaluation outline
paper/                            LaTeX paper, compiled PDF, generated figures
scripts/generate_paper_figures.py figure generation from run summaries
runs/                             compact experiment outputs and replay traces

The committed runs/ files intentionally include summaries, reports, and small replay traces. Large per-policy decision logs (*_decisions.jsonl) are ignored because they can make the working directory several gigabytes.

Core Concepts

  • Governed goodput: SLO-valid completed output tokens per simulated GPU-second, net of control-plane overhead.
  • SLO Notary: predictive SLO governance that turns future breach risk into present scheduling pressure.
  • KV Treasury: value-aware KV accounting that scores cache residency by expected utility, not raw hit count alone.
  • Shape forecast: advisory request-shape evidence. It is not treated as a validated hard-routing authority.
  • Policy snapshot: slow-path governance output consumed by a simple fast dispatch loop.
  • Obligation Heterogeneity Index (OHI): coarse diagnostic for when governance should help.

What To Believe

The evidence currently supports:

  1. Resource governance is a promising research direction for heterogeneous LLM inference workloads.
  2. YieldOS-Lite is runnable today as a simulator and trace-replay scaffold.
  3. Predictive SLO governance is the strongest validated primitive in this MVP.
  4. Value-aware KV accounting is promising under pressure, especially when evaluated by value preserved rather than raw hit rate.
  5. Shape classification should remain advisory until better calibrated.

The evidence does not yet claim:

  1. Production readiness.
  2. CUDA-level serving speedup.
  3. Direct replacement for vLLM, TensorRT-LLM, Sarathi-Serve, or DistServe.
  4. Production GPU utilization gains on real deployment traces.
  5. Universal dominance over disaggregated serving.

It implements:

  • Synthetic heterogeneous trace generation.
  • vLLM-style continuous batching baseline.
  • Sarathi-style chunked-prefill baseline.
  • DistServe-style prefill/decode-disaggregated baseline.
  • YieldOS-Lite policy with:
    • probabilistic shape classification,
    • coarse value-aware KV Treasury,
    • predictive SLO Notary interventions,
    • governed-goodput metrics,
    • trace archive and decision logs.
  • MVP ablations from the paper.

Current Research Status

YieldOS-Lite is currently a Phase 1 control-plane simulator, not a production serving engine.

The current evidence supports:

  1. Predictive SLO governance is P1-ready.
  2. Value-aware KV accounting is promising under KV pressure, especially when evaluated by value preserved rather than raw hit rate.
  3. Shape classification should remain advisory; hard routing is not yet validated.
  4. The largest observed gains appear in heterogeneous workloads such as RAG-heavy, code-heavy, batch-summary-heavy, and mixed-enterprise traffic.

The current evidence does not yet claim:

  1. CUDA-level serving improvement.
  2. Direct vLLM/TensorRT-LLM replacement.
  3. Production GPU utilization gains on real traces.
  4. Shape classification as a validated routing authority.

Quick Start

Use Python 3.11+.

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .

Run a baseline comparison:

python -m yieldos.cli run --requests 800 --seed 7 --out runs/demo

Run the main experiment families:

python -m yieldos.cli ablate --requests 800 --seed 7 --out runs/ablations
python -m yieldos.cli workload-suite --requests 800 --seed 41 --out runs/workload_suite
python -m yieldos.cli replay --trace runs/workload_suite/traces/chat_heavy.csv --out runs/replay_chat

Run tests:

python -m unittest discover -s tests

The commands write summary.csv, summary.json, per-policy decision logs, and a human-readable report.md.

Trace replay format is documented in docs/trace_format.md.

Figure regeneration requires matplotlib:

python -m pip install matplotlib
python scripts/generate_paper_figures.py

Completed Runs

I ran the MVP comparison and ablation suite with --requests 800 --seed 7.

  • Baseline comparison: runs/final_compare_800/
  • MVP ablations: runs/final_ablations_800/

Headline comparison from the final run:

policy governed goodput SLO attainment TTFT p95 ITL p95
YieldOS-Lite 486.41 0.274 30163.0 ms 325.0 ms
DistServe-style 401.63 0.258 32658.0 ms 325.0 ms
Sarathi-style 126.64 0.076 39016.6 ms 905.0 ms
Continuous batching 40.15 0.033 49318.0 ms 625.0 ms

The strongest ablation signal was SLO Notary removal, which dropped governed goodput from 486.41 to 337.82 tokens/GPU-second. KV Treasury beat LRU on this seed on governed goodput and recompute waste. A couple of control ablations (no_fallback_lane, no_online_reclass) slightly exceeded the full policy, which is useful evidence that the current fallback/reclassification heuristics need another tuning pass before claiming they are universally beneficial.

Next Experiments

I added and ran the follow-up experiments suggested by the first ablation:

  • Ablation-refined policy comparison: runs/next_policy_compare_800_v2/
  • KV-heavy stress: runs/next_kv_stress_800_v2/
  • Shape-classifier stress: runs/next_shape_stress_800_v2/
  • Policy tick sweep: runs/next_tick_sweep_800_v2/
  • Load regimes: runs/next_load_regimes_600_v2/

The ablation-refined policy changes are:

  • SLO Notary remains the authority for urgency.
  • KV Treasury remains value-aware.
  • Shape forecasts are soft scoring evidence, not hard lane assignment.
  • Fallback demotion is removed.
  • Online reclassification is hysteretic.

Headline results:

experiment best policy governed goodput main read
default comparison YieldOS-Lite 507.88 The ablation-refined policy beats the initial policy by 4.4% and DistServe-style by 26.5%.
KV stress YieldOS-Lite 29.82 KV pressure is severe; value-aware KV beats LRU by 6.0% goodput and 2.9% recompute waste.
shape stress no classifier 163.19 Hard routing fails badly; even soft forecasts still underperform SLO-only governance on this stress trace.
tick sweep 100 ms 522.12 75-100 ms is the current cadence knee; 5-10 ms lose to overhead, 200 ms goes stale.
load regimes YieldOS-Lite / LRU split 906.61 at 60% load YieldOS-Lite beats DistServe at every load; KV value helps most at 120-150% load but LRU is close below 100%.

The shape-stress result is the sharpest warning: the current classifier is still not P1-ready. The simulator now supports the right architectural test, and the result says the paper should claim predictive SLO governance first, KV Treasury second, and shape classification as an open calibration problem rather than a validated component.

Further Sweeps

I added a value-weighted KV metric and ran the next experiment set:

  • Load sweep: runs/further_load_sweep_800/
  • KV pressure sweep: runs/further_kv_pressure_sweep_800/
  • SLO tightness sweep: runs/further_slo_tightness_sweep_800/
  • Normal tick sweep: runs/further_tick_sweep_normal_800/
  • Stress tick sweep: runs/further_tick_sweep_stress_800/

New metric:

kv_value_preserved = saved_prefix_tokens * slo_pressure * tenant_priority * expected_future_use * sharing_potential

This is intentionally separate from raw KV hits. It is a coarse MVP proxy for value-weighted KV residency.

Highlights:

sweep result
load YieldOS-Lite beats DistServe-style at every tested load: +22.7% at 50%, +37.1% at 70%, +31.5% at 90%, +33.4% at 110%, +30.1% at 130%, and +42.5% at 150%.
KV pressure YieldOS-Lite beats DistServe-style under every KV pressure level. Against LRU, value-aware KV is mixed on raw goodput but preserves more KV value at medium/high/extreme pressure.
SLO tightness SLO Notary improves over no-Notary by +6.5% loose, +9.5% normal, +9.1% tight, and +11.2% impossible on this seed.
normal tick best cadence is 100ms at 522.12 governed tokens/GPU-second; 75ms and 50ms are close, while 5-10ms lose to overhead and 200ms goes stale.
stress tick best cadence shifts to 75ms at 32.59 governed tokens/GPU-second; severe contention changes the knee.

The load sweep complicates the SLO story slightly: no-Notary is competitive at some individual load points, but the SLO tightness sweep shows the Notary consistently helps as SLO risk becomes the explicit experimental variable. The more careful claim is: SLO Notary is most useful when SLO pressure, not merely aggregate load, is the binding constraint.

Interpretation Discipline

This simulator is not a CUDA-level serving benchmark. It is a control-plane simulator. Its purpose is to test whether governance policies produce better allocation behavior before integrating with real engines.

The current results support three claims:

  1. Predictive SLO governance is P1-ready.
  2. Value-aware KV accounting is promising, especially under KV pressure, but should be evaluated using value-weighted metrics rather than raw hit rate alone.
  3. Shape classification is not yet validated as a hard routing primitive. In the current policy it is treated as soft evidence; further calibration is required before it can become routing authority.

Therefore, the MVP claim is not "full YieldOS works." The MVP claim is: a slow-path SLO-aware resource governance layer can improve governed goodput over mechanistic baselines in a trace simulator.

Trace Compatibility

YieldOS-Lite can replay CSV and JSONL traces with:

arrival_time_ms, prompt_tokens, output_tokens, tenant_id, priority_class, optional SLO fields, optional prefix_id, and optional abandoned_at_ms.

This makes the simulator ready for public or production trace validation when real traces are available.

Semi-Realistic Profiles

I added a workload-suite runner for less synthetic, production-shaped traffic profiles:

  • chat_heavy
  • rag_heavy
  • code_heavy
  • batch_summary_heavy
  • mixed_enterprise

Canonical run: runs/trace_workload_suite_800_v2/

The runner also writes normalized replayable CSV traces under runs/trace_workload_suite_800_v2/traces/. Replaying the generated chat_heavy.csv trace produced the same YieldOS-Lite metrics as the workload-suite run, which verifies that experiment labels do not affect policy randomness.

Profile results for YieldOS-Lite vs DistServe-style:

profile YieldOS-Lite DistServe-style improvement
chat-heavy 530.69 488.50 +8.6%
RAG-heavy 41.10 27.83 +47.6%
code-heavy 286.30 209.77 +36.5%
batch-summary-heavy 85.85 51.13 +67.9%
mixed-enterprise 172.82 77.90 +121.8%

This is still not external validation, but it is a cleaner bridge: YieldOS-Lite now has an explicit replay schema and a profile suite that better resembles chat, RAG, code, batch, and enterprise traffic mixes.

External Trace Pilot

I added a BurstGPT adapter and ran a small external-trace pilot using the public HPMLL/BurstGPT workload trace. BurstGPT provides request timestamps, request tokens, response tokens, model name, and log type; the adapter normalizes those fields into the YieldOS replay schema.

Canonical pilot runs:

  • runs/burstgpt_sample_800/ at 20x time scale
  • runs/burstgpt_sample_800_scale50/
  • runs/burstgpt_sample_800_scale100/
  • runs/burstgpt_sample_800_scale200/

This pilot is deliberately not folded into the main positive claim. The first 800 BurstGPT rows are almost entirely interactive, have no prefix reuse signal, and are relatively homogeneous compared with the mixed-enterprise profiles. In that regime, DistServe-style disaggregation is competitive or better:

time scale best policy note
20x DistServe-style, tied on goodput very light contention; all governed-goodput values are effectively equal
50x DistServe-style, tied on goodput still mostly equal
100x DistServe-style DistServe reaches 507.77 vs YieldOS-Lite at 491.15
200x DistServe-style DistServe reaches 442.61 vs YieldOS-Lite at 367.68

This is a useful counterweight: YieldOS-Lite is not claiming universal dominance. The current simulator says governance helps most when workload heterogeneity, prefix reuse, tenant priority, and SLO pressure create competing obligations. When an external sample is nearly all interactive and has no prefix reuse, mechanistic disaggregation can be the better policy.

Obligation Heterogeneity

I added a coarse Obligation Heterogeneity Index (OHI) to every report:

OHI = (H(prompt_bucket) + H(output_bucket) + H(priority) + H(SLO) + prefix_reuse + KV_pressure) / 6

This is not a final metric, but it makes the current hypothesis measurable:

YieldOS-Lite is not a better queue; it is a better response to heterogeneity.

Canonical OHI analysis: runs/ohi_gain_analysis/

workload OHI YieldOS gain over DistServe-style
BurstGPT sample, 100x 0.315 -3.3%
chat-heavy 0.540 +8.6%
code-heavy 0.666 +36.5%
mixed-enterprise 0.711 +121.8%
batch-summary-heavy 0.722 +67.9%
RAG-heavy 0.763 +47.6%

Across this small pilot set, OHI and YieldOS gain have a Pearson correlation of 0.713. This is suggestive, not definitive. The useful scientific claim is: governance advantage appears to increase when requests impose heterogeneous obligations on prefill compute, decode bandwidth, KV residency, SLO slack, and tenant priority.

Notes

The simulator is intentionally coarse. It is not a CUDA, vLLM, or TensorRT-LLM replacement. It models the control-plane questions in the draft:

  • Which requests should enter protected lanes?
  • Which KV entries are worth keeping under pressure?
  • When should the scheduler intervene before tail-latency collapse?
  • How much of YieldOS-Lite's gain comes from each component?

The hot path consumes only a precomputed policy snapshot. Pricing, lane updates, KV value scoring, and SLO prediction happen on the slow path at configurable cadences.