





















Authors:Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang
Abstract:Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recent work on such any-precision quantization either relies on hardware-inefficient vector quantization or induces additional scaling factors when switching between bit-widths. Meanwhile, existing post-training quantization (PTQ) methods calibrated for a fixed low precision show poor generalizability under runtime precision change. In this work, we attribute the source of poor generalization across bit-widths to a precision-dependent \textit{outlier migration} phenomenon where the distribution of PTQ-sensitive tokens changes across precisions. Motivated by this observation, we propose \texttt{MoBiQuant}, a novel any-precision Mixture-of-Bits quantization framework that adjusts weight precision for flexible LLM inference based on token sensitivity. Specifically, we propose a many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights at runtime and mitigates \textit{outlier migration} with a token-aware router to dynamically select the optimal inference precision of each this http URL experiments show that \texttt{MoBiQuant} matches or surpasses frontier single-precision PTQ while exhibiting strong elasticity, achieving significant memory savings and throughput gains of up to $1.34\times$ over state-of-the-art any-precision methods.
| Comments: | 20 pages, 10 figures |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) |
| Cite as: | arXiv:2602.20191 [cs.LG] |
| (or arXiv:2602.20191v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2602.20191 arXiv-issued DOI via DataCite |
From: Dongwei Wang [view email]
[v1]
Sat, 21 Feb 2026 21:11:08 UTC (3,462 KB)
[v2]
Mon, 25 May 2026 16:17:59 UTC (4,174 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。