

















Abstract:Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based truncation and selection. Qrita leverages pivot-based search for both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output. We implement Qrita using Triton and evaluate its performance against the Top-k and Top-p kernels of high-performance LLM execution engines such as SGLang and FlashInfer, improving end-to-end serving throughput up to 1.4 times with half the memory usage, while providing the same output as the sorting-based algorithms. Qrita is now the default Top-k and Top-p sampler for the GPU execution path of vLLM, and a ternary implementation of Qrita is available at this https URL.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2602.01518 [cs.AI] |
| (or arXiv:2602.01518v2 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2602.01518 arXiv-issued DOI via DataCite |
From: Jongseok Park [view email]
[v1]
Mon, 2 Feb 2026 01:19:28 UTC (5,083 KB)
[v2]
Tue, 26 May 2026 07:25:54 UTC (2,390 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。