
























Abstract:Deploying multiple models within shared GPU clusters is a key strategy to improve resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems improve GPU utilization at the cost of degraded inference performance, particularly time-to-first-token (TTFT). We attribute this degradation to the lack of awareness regarding future workload characteristics. In contrast, recent analyses have shown the strong periodicity and long-term predictability of real-world LLM serving workloads. In this paper, we propose one-for-many GPU prewarming, which proactively loads parameters from multiple models onto GPUs based on workload forecasts. These prewarmed weights enable the system to promptly instantiate serving instances upon encountering request bursts. We design and implement WarmServe, a multi-LLM serving system incorporating three key techniques: (1) a model placement algorithm that optimizes prewarming decisions to minimize cross-model prewarming interference, (2) a KV cache reservation strategy that repurposes idle KV cache space on running GPUs for prewarming new models, and (3) an efficient GPU memory switching mechanism for tensor management. Evaluation on real-world datasets shows that WarmServe reduces tail TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while supporting up to 2.5$\times$ higher request throughput than the GPU-sharing system.
| Comments: | Accepted at ICML 2026 |
| Subjects: | Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) |
| Cite as: | arXiv:2512.09472 [cs.DC] |
| (or arXiv:2512.09472v2 [cs.DC] for this version) | |
| https://doi.org/10.48550/arXiv.2512.09472 arXiv-issued DOI via DataCite |
From: Chiheng Lou [view email]
[v1]
Wed, 10 Dec 2025 09:47:40 UTC (689 KB)
[v2]
Thu, 21 May 2026 07:26:42 UTC (554 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。