惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

cs updates on arXiv.org

The Closure of LCD-to-GI Reductions via Generalized Inner Products Anytime Training with Schedule-Free Spectral Optimization Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations The Attribution Contract: Feature Attribution for Generative Language Models ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention The Implicit Bias of Depth: From Neural Collapse to Softmax Codes Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift When Determinants Are Not Enough: Private Rare Switching Archimedean Copula Inference via Taylor-Mode AD CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness Any-Dimensional Invariant Universality Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback Empirical Bayes Conformal Prediction for Vision and Language Models Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling Adaptive Mass-Segmented KV Compression for Long-Context Reasoning DRL-Driven Edge-Aware Utility Optimization for Multi-Slice 6G Networks A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents Assessing Predictive Models for Fairness Based on Movement Patterns Convex Low-resource Accent-Robust Language Detection in Speech Recognition Self-supervised Adversarial Purification for Graph Neural Networks RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases Security of LLM-generated Code: A Comparative Analysis Convex Optimization for Alignment and Preference Learning on a Single GPU Accelerating Divisible Load Processing Through Machine Learning: A Practical Framework for Large-Scale Workloads Enhancing Deep Neural Network Reliability with Refinement and Calibration Learning-Augmented Online Scheduling with Parsimonious Preemption Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study A Simple Plug-in for Improving Eviction-Based KV Cache Compression When Good Equations Get Bad Scores: Improving Symbolic Regression Through Better Parameter Optimization Defining AI Fatigue in Academic Contexts: Dimensions, Indicators, and a Stage-Based Model Using Grounded Theory Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays Instance-Optimal Estimation with Multiple LLM Judges on a Budget Score-Based One-step MeanFlow Policy Optimization Extending Deep Event Visual Odometry with Sparse Point-Cloud Export Curriculum reinforcement learning with measurable task representation learning Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition Convex Compositional Reasoning Models Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting Hybrid Quantum-Classical Corrective Diffusion Modeling for Meteorological Downscaling Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks 6G Communication Networks Enabling Embodied Agents: Architecture and Prototype Sparse Compositional Flow Matching by geometric assembly from motion primitives XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms Cross-attention-based bipartite graph neural network for coupled nodal and elemental field prediction in large-deformation sheet material forming From Simulation to Discovery: AI Enabled Probabilistic Emulation of Mechanistic Crop Systems Resilience Characterization of AI-Native Wireless Receivers via Persistent Homology SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control Orbax: Distributed Checkpointing with JAX Encrypted Neural Networks without Overflows Entropy Equivalence Testing Intercloud: Eventual Consistency for Decentralised Economies via Chilling-Effect Consensus Monte Cimone v3: Where RISC-V Stands in High-Performance Computing Mathematical Foundations for Peer-to-Peer Lattice Computation An Axiomatic Theory of Tie-Breaking: Impossibility, Characterization, and Decomposition The Geometry of Cooperative Game Solutions: Stratified Egalitarian Shapley Values On Reed-Muller subcodes, Grassmannian partitions and sum-free functions Multi-Dimensional Matching in Market Design Budgeted Dynamic Trace Structures for Token-Efficient Sequential Computation StanBKT: Rethinking Parameter Estimation in Bayesian Knowledge Tracing Mode-Shape Expansion Using Physics-Constrained Gaussian Process Regression Improved Torn Paper Coding via Local Alignment Convex Hybrid Modeling: An Operator-Based Approach Remote Teleoperation of Endovascular Intervention Robots: A Systematic Review NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference $π_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control Deception and Counter Deception in Adversarial Graph Traversal Game From Preventive to Reactive: How AI Coding Assistants Transform Developers' Security Awareness The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study Orchestrating Data Collection and Computation in Green IoT Networks SolarChain: Bridging Physical Law, Verifiable Trust, and Sustainable Markets for Urban Energy Resilience Effective information gathering for ore estimation, evaluation and perspectives on adaptive sampling Cognitive offloading and the speedup illusion in human-AI interaction SpikingMoE: SDPrompt-Guided Dynamic Expert Fusion in Spiking Neural Networks Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio Experimental Evaluation of Data Upload Efficiency and Guiding Challenges for a Vehicular-to-Road System Using 60-GHz mmWave Ultra-Spots MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization Signal Temporal Logic Motion Planning via Graphs of Convex Sets Cogniscope: A Synthetic Longitudinal Benchmark and Browser-Based Evaluation Framework for Early-Risk Cognitive AI Systems Fairness in Aggregation: Optimal Top-$k$ and Improved Full Ranking Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework On the Performance of DCF in Full Duplex WLANs with Hidden Terminals BCTuner: LLM-Guided Monte Carlo Tree Search for Efficient Blockchain Knob Tuning From Visual to Digital: Coordination Scheduling and Its Effect on Safety and Efficiency in UAM Corridors Bayesian Extreme Value Theory with Hawkes-AR-Gumbel Dependence for Extreme CVaR Estimation in Operational Risk
AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System
Fengyao Bai, · 2026-05-25 · via cs updates on arXiv.org

View PDF HTML (experimental)

Abstract:High-throughput inference serving is essential for applications built on large language models (LLMs). Existing serving frameworks reduce request-level and batch-level bubbles through batching and scheduling, but often overlook bubbles within each decode iteration. Tokens generated in the same iteration may incur different costs because they depend on KV caches of different lengths; tokens with long KV caches can become bottlenecks and delay the next iteration. We propose AlignedServe, an LLM serving framework built around prefix-aware batching. It groups requests with similar KV-cache lengths into the same batch to reduce iteration-level bubbles. To support this policy efficiently, AlignedServe uses large CPU memory to maintain sufficient in-flight requests for batching and applies a batch-level scheduling policy to reduce batch-level bubbles. It also introduces a GPU-Prefetch-For-GPU architecture, where one GPU prefetches KV cache for another to reduce CPU-to-GPU transfer latency. Experiments on synthetic and application workloads show that AlignedServe improves decoding throughput by up to 1.98 times and reduces latency by up to 7.4 times over state-of-the-art systems.
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as: arXiv:2605.23389 [cs.DC]
  (or arXiv:2605.23389v1 [cs.DC] for this version)
  https://doi.org/10.48550/arXiv.2605.23389

arXiv-issued DOI via DataCite (pending registration)

Related DOI: https://doi.org/10.1145/3802009

DOI(s) linking to related resources

Submission history

From: Fengyao Bai [view email]
[v1] Fri, 22 May 2026 09:00:45 UTC (4,594 KB)