

















Abstract:Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.
| Comments: | 3 figures |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2511.14683 [cs.CL] |
| (or arXiv:2511.14683v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2511.14683 arXiv-issued DOI via DataCite |
From: Wentian Li [view email]
[v1]
Tue, 18 Nov 2025 17:22:00 UTC (1,650 KB)
[v2]
Tue, 26 May 2026 03:38:45 UTC (1,654 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。