
























Abstract:Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment $D$ and the residual student variance $\Delta$. At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width $O(D^{-1})$ remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero $\Delta$. As a function of the training time $\alpha$ the late-time solution yields a $\alpha^{-1/3}$ power law not only for the test loss but also for the generalization error $\epsilon_g$, i.e., one minus test accuracy. This is much slower than the $\alpha^{-1}$ Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a $\epsilon_g \sim \alpha^{-1/2}$ power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.
| Comments: | 20 pages, 7 figures |
| Subjects: | Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn) |
| Cite as: | arXiv:2605.22341 [cs.LG] |
| (or arXiv:2605.22341v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.22341 arXiv-issued DOI via DataCite (pending registration) |
From: Marcel Kühn [view email]
[v1]
Thu, 21 May 2026 11:26:32 UTC (553 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。