
























Abstract:Vision Transformers (ViTs) face severe computational bottlenecks due to the quadratic complexity of self-attention at high resolutions. Existing token reduction methods rely on local metrics - such as single-layer attention scores - that are inherently vulnerable to the attention sink phenomenon, where uninformative tokens are paradoxically preserved over salient foreground objects. We propose ASAP (Attention Sink Anchored Pruning), a training-free framework that recasts this sink as a feature. Modeling ViT information flow as a Lazy Random Walk, ASAP identifies the sink as a dominant accumulator of probability mass. By computing the diffusion distance to the sink within the cumulative transition matrix, ASAP partitions tokens via Radial Diffusion Clustering and compresses background redundancy through Transition Weight Pooling in a single shot. Extensive experiments across image, video, and vision-language tasks demonstrate ASAP outperforms state-of-the-art methods, accelerating throughput by up to 48% while maintaining - or even exceeding - baseline accuracy.
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.22372 [cs.LG] |
| (or arXiv:2605.22372v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.22372 arXiv-issued DOI via DataCite (pending registration) |
From: Jaehyuk Lee [view email]
[v1]
Thu, 21 May 2026 12:04:49 UTC (27,623 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。