























Abstract:Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or-nothing decision at token level. We study this fine-grained routing mechanism in two settings. For compression-aware language modeling, we introduce a query-independent design that combines routed subspace LoRA with value-group routing on the KV path. For downstream-task-preserving KV compression, we introduce a query-aware design in which a predictor-based selector allocates a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves the quality-compression tradeoff for language modeling, while the query-aware design preserves downstream behavior under reduced KV budgets. We further examine the relation between token-level and sub-token-level query-aware routing, and show that they form complementary compression axes: token-level methods determine which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally.
| Comments: | 16 pages, 14 tables, 2 figures |
| Subjects: | Machine Learning (cs.LG); Computation and Language (cs.CL) |
| MSC classes: | 68W99, 68W40 |
| Cite as: | arXiv:2604.21335 [cs.LG] |
| (or arXiv:2604.21335v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2604.21335 arXiv-issued DOI via DataCite (pending registration) |
From: Wei Jiang [view email]
[v1]
Thu, 23 Apr 2026 06:47:33 UTC (614 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。