

















Abstract:The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{this https URL}{this https URL}.
| Comments: | Accepted at NLP4DH @ ACL 2026 |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.09156 [cs.CL] |
| (or arXiv:2605.09156v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.09156 arXiv-issued DOI via DataCite |
From: Esteban Garces Arias [view email]
[v1]
Sat, 9 May 2026 20:36:49 UTC (6,218 KB)
[v2]
Tue, 26 May 2026 08:07:03 UTC (6,219 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。