





















Abstract:Molecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions, limiting their ability to capture meaningful chemical substructure context. We introduce FragmentNet, a graph-to-sequence model built around a novel adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments of adjustable granularity, complemented by chemically aware spatial positional encodings that preserve molecular topology in the resulting sequence. Extending masked pre-training strategies from natural language processing to the molecular domain, we mask and reconstruct molecules at the level of chemically meaningful fragments rather than individual atoms. Evaluating across multiple property prediction benchmarks, we find that pre-training at fragment granularity leads to improved downstream performance on the majority of tasks, demonstrating that tokenization granularity is an important design choice for molecular representation learning.
| Comments: | 22 pages, 13 figures, 5 tables |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM) |
| Cite as: | arXiv:2502.01184 [cs.LG] |
| (or arXiv:2502.01184v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2502.01184 arXiv-issued DOI via DataCite |
From: Ankur Samanta [view email]
[v1]
Mon, 3 Feb 2025 09:21:49 UTC (24,068 KB)
[v2]
Mon, 25 May 2026 05:20:26 UTC (11,891 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。