惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.AI updates on arXiv.org

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning Scalable GANs with Transformers Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference Personalized Generative Models for Contextual Debiasing E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering Co-folding model guided by structural proteomics AssetGen: Deployable 3D Asset Generation at Interactive Speed Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift Bilevel Optimization over Saddle Points of Zero-Sum Markov Games When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control Recursive Flow Matching Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies Measuring Prediction Uncertainty in Neural Cellular Automata When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals Hands-On: Segmenting Individual Signs from Continuous Sequences LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models Semigroup Consistency as a Diagnostic for Learned Physics Simulators Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems Adversarial Training for Robust Coverage Network under Worst-case Facility Losses GEM: Geometric Entropy Mixing for Optimal LLM Data Curation Ratio-Variance Regularized Policy Optimization Linear and Neural Dueling Bandits with Delayed Feedback TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation Curriculum Learning for Safety Alignment Unified Panoramic Geometry Estimation via Multi-View Foundation Models On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition On the Error-Correcting Effects of Stochasticity in Discrete Diffusion Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models Trust Region Q Adjoint Matching Innovation: An Almost Characterization of Hallucination Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents Cross-scale Aligned Supervision for Training GANs DEI: Diversity in Evolutionary Inference for Quality-Diversity Search Less is More: Early Stopping Rollout for On-Policy Distillation "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference Unified Neural Scaling Laws BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection MobileMoE: Scaling On-Device Mixture of Experts Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes Model Merging on Loss Landscape: A Geometry Perspective VesselSim: learning 3D blood vessel segmentation without expert annotations Periodic Topological Deep Learning for Polymer Design and Discovery
More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
Mingze Wang, · 2026-05-27 · via cs.AI updates on arXiv.org

View PDF HTML (experimental)

Abstract:Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.
Comments: 31 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as: arXiv:2605.26647 [cs.LG]
  (or arXiv:2605.26647v1 [cs.LG] for this version)
  https://doi.org/10.48550/arXiv.2605.26647

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Mingze Wang [view email]
[v1] Tue, 26 May 2026 07:30:53 UTC (6,072 KB)