惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

cs.AI updates on arXiv.org

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure? Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG Nonlocal operator learning for fMRI encoding and decoding tasks Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues Consistently Informative Soft-Label Temperature for Knowledge Distillation AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback Axiomatizing Neural Networks via Pursuit of Subspaces The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression SDM: A Powerful Tool for Evaluating Model Robustness Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models Runtime-Certified Bounded-Error Quantized Attention Winfree Oscillatory Neural Network LEAP: A closed-loop framework for perovskite precursor additive discovery Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU Efficient Learning of Deep State Space Models via Importance Smoothing Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation \textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning Variance Reduction for Expectations with Diffusion Teachers It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning Data-Efficient Neural Operator Training via Physics-Based Active Learning Mechanisms of Misgeneralization in Physical Sequence Modeling \ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards A Sharper Picture of Generalization in Transformers AgentAtlas: Beyond Outcome Leaderboards for LLM Agents Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics Divide et Calibra: Multiclass Local Calibration via Vector Quantization FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning Latent Process Generator Matching Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines Training Language Agents to Learn from Experience Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning Code Generation by Differential Test Time Scaling CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach Behavior-Consistent Deep Reinforcement Learning torchtune: PyTorch native post-training library TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning Instance Discrimination for Link Prediction DEL: Digit Entropy Loss for Numerical Learning of Large Language Models Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty Approximation Theory for Neural Networks: Old and New ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining Closed-form predictive coding via hierarchical Gaussian filters Modality-Decoupled Online Recursive Editing LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents Multi-agent Collaboration with State Management Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization STELLAR: Scaling 3D Perception Large Models for Autonomous Driving AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search HiRes: Inspectable Precedent Memory for Reaction Condition Recommendation
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Dayal Singh · 2026-05-21 · via cs.AI updates on arXiv.org

View PDF HTML (experimental)

Abstract:Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($\mu$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $\mu$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $\mu$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $\mu$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.
Comments: 10+28 pages, 5+17 figures
Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as: arXiv:2605.21486 [cs.LG]
  (or arXiv:2605.21486v1 [cs.LG] for this version)
  https://doi.org/10.48550/arXiv.2605.21486

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Dayal Singh Kalra [view email]
[v1] Wed, 20 May 2026 17:59:40 UTC (5,696 KB)