

















Abstract:Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_\theta(X)$, where $f_\theta$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.
| Comments: | Accepted at the ICLR 2026 GRaM workshop: this https URL |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2603.13381 [cs.LG] |
| (or arXiv:2603.13381v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2603.13381 arXiv-issued DOI via DataCite |
From: Marko Karbevski [view email]
[v1]
Wed, 11 Mar 2026 03:13:10 UTC (70 KB)
[v2]
Fri, 24 Apr 2026 15:48:35 UTC (62 KB)
[v3]
Tue, 26 May 2026 02:11:34 UTC (68 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。