

















Abstract:Authorship attribution asks whether two pieces of text share a writer, but topical confound makes the task deceptively easy: two authors covering the same topic may look more alike than one author covering two topics. Scholarly prose offers a natural remedy, academic writers produce multiple papers on related but distinct topics while maintaining consistent stylistic habits. We introduce HALvest, a 17-billion-token multilingual corpus of open-access academic papers, and its English contrastive derivative HALvest-Contrastive, where same-author passages are drawn from distinct papers within a disciplinary field to minimize topical overlap. We validate our benchmark by showing that a strong lexical baseline collapses once topical shortcuts are removed. On this same benchmark, we revisit how authorship is scored. Standard systems compress each document into a single vector. We instead keep a sequence of vectors and compare them with late interaction, then propose patch-level late interaction, which groups neighboring tokens into patches before matching. Matching at the sequence level greatly improves performance over the single-vector baseline, but the optimal interaction granularity is subtle.
| Comments: | 19 pages, 9 figures. Under review |
| Subjects: | Digital Libraries (cs.DL); Computation and Language (cs.CL) |
| Cite as: | arXiv:2407.20595 [cs.DL] |
| (or arXiv:2407.20595v5 [cs.DL] for this version) | |
| https://doi.org/10.48550/arXiv.2407.20595 arXiv-issued DOI via DataCite |
From: Francis Kulumba [view email]
[v1]
Tue, 30 Jul 2024 07:14:04 UTC (121 KB)
[v2]
Thu, 27 Feb 2025 19:33:23 UTC (128 KB)
[v3]
Wed, 26 Nov 2025 21:07:57 UTC (214 KB)
[v4]
Tue, 19 May 2026 11:14:57 UTC (290 KB)
[v5]
Mon, 25 May 2026 17:50:46 UTC (291 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。