

















Abstract:Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on this https URL.
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.16457 [cs.LG] |
| (or arXiv:2605.16457v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16457 arXiv-issued DOI via DataCite |
From: Youngin Kim [view email]
[v1]
Fri, 15 May 2026 05:58:58 UTC (1,722 KB)
[v2]
Thu, 21 May 2026 00:53:36 UTC (1,675 KB)
[v3]
Tue, 26 May 2026 03:24:15 UTC (1,675 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。