






















Abstract:Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.
| Comments: | 12 pages, 10 figures, to be published in CVPR 2026 HOW Vision Interpretability Workshop Proceedings |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| Cite as: | arXiv:2604.14433 [cs.CV] |
| (or arXiv:2604.14433v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2604.14433 arXiv-issued DOI via DataCite |
From: Felipe Parodi [view email]
[v1]
Wed, 15 Apr 2026 21:24:42 UTC (2,944 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。