

















Abstract:Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive components: one to determine when to stop candidate token generation and the other to decide token acceptance, updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 1.46x speedup over vanilla speculative decoding while limiting accuracy degradation to under 1.8%, making it a practical solution for efficient and adaptive LLM inference.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2512.11280 [cs.CL] |
| (or arXiv:2512.11280v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2512.11280 arXiv-issued DOI via DataCite |
From: Kuan-Wei Lu [view email]
[v1]
Fri, 12 Dec 2025 04:56:08 UTC (162 KB)
[v2]
Tue, 26 May 2026 07:30:13 UTC (123 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。