惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

cs updates on arXiv.org

High-order Conservative Discontinuous Galerkin Methods via Implicit Penalization for the Generalized Korteweg-de Vries Equation and the Hirota-Satsuma KdV System The Attribution Contract: Feature Attribution for Generative Language Models ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention The Implicit Bias of Depth: From Neural Collapse to Softmax Codes Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift When Determinants Are Not Enough: Private Rare Switching Archimedean Copula Inference via Taylor-Mode AD CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness Any-Dimensional Invariant Universality Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback Empirical Bayes Conformal Prediction for Vision and Language Models Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling Adaptive Mass-Segmented KV Compression for Long-Context Reasoning DRL-Driven Edge-Aware Utility Optimization for Multi-Slice 6G Networks A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents Assessing Predictive Models for Fairness Based on Movement Patterns Convex Low-resource Accent-Robust Language Detection in Speech Recognition Self-supervised Adversarial Purification for Graph Neural Networks RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases Security of LLM-generated Code: A Comparative Analysis Convex Optimization for Alignment and Preference Learning on a Single GPU Accelerating Divisible Load Processing Through Machine Learning: A Practical Framework for Large-Scale Workloads Enhancing Deep Neural Network Reliability with Refinement and Calibration Learning-Augmented Online Scheduling with Parsimonious Preemption Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study A Simple Plug-in for Improving Eviction-Based KV Cache Compression When Good Equations Get Bad Scores: Improving Symbolic Regression Through Better Parameter Optimization Defining AI Fatigue in Academic Contexts: Dimensions, Indicators, and a Stage-Based Model Using Grounded Theory Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays Instance-Optimal Estimation with Multiple LLM Judges on a Budget Score-Based One-step MeanFlow Policy Optimization Extending Deep Event Visual Odometry with Sparse Point-Cloud Export Curriculum reinforcement learning with measurable task representation learning Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition Convex Compositional Reasoning Models Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting Hybrid Quantum-Classical Corrective Diffusion Modeling for Meteorological Downscaling Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation What Linear Probes Miss: Multi-View Probing for Weight-Space Learning Sample-wise Targeted Adversarial Attacks on Test-time Adaptation Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks 6G Communication Networks Enabling Embodied Agents: Architecture and Prototype Sparse Compositional Flow Matching by geometric assembly from motion primitives XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms Cross-attention-based bipartite graph neural network for coupled nodal and elemental field prediction in large-deformation sheet material forming From Simulation to Discovery: AI Enabled Probabilistic Emulation of Mechanistic Crop Systems Resilience Characterization of AI-Native Wireless Receivers via Persistent Homology SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control Orbax: Distributed Checkpointing with JAX Encrypted Neural Networks without Overflows Entropy Equivalence Testing Intercloud: Eventual Consistency for Decentralised Economies via Chilling-Effect Consensus Monte Cimone v3: Where RISC-V Stands in High-Performance Computing Mathematical Foundations for Peer-to-Peer Lattice Computation An Axiomatic Theory of Tie-Breaking: Impossibility, Characterization, and Decomposition The Geometry of Cooperative Game Solutions: Stratified Egalitarian Shapley Values On Reed-Muller subcodes, Grassmannian partitions and sum-free functions Multi-Dimensional Matching in Market Design Budgeted Dynamic Trace Structures for Token-Efficient Sequential Computation StanBKT: Rethinking Parameter Estimation in Bayesian Knowledge Tracing Mode-Shape Expansion Using Physics-Constrained Gaussian Process Regression Improved Torn Paper Coding via Local Alignment Convex Hybrid Modeling: An Operator-Based Approach Remote Teleoperation of Endovascular Intervention Robots: A Systematic Review The Closure of LCD-to-GI Reductions via Generalized Inner Products From Head to Tail: Asymmetric Knowledge Transfer in Long-tail Recommendation with Generative Semantic IDs From Preventive to Reactive: How AI Coding Assistants Transform Developers' Security Awareness The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study Orchestrating Data Collection and Computation in Green IoT Networks SolarChain: Bridging Physical Law, Verifiable Trust, and Sustainable Markets for Urban Energy Resilience Effective information gathering for ore estimation, evaluation and perspectives on adaptive sampling Cognitive offloading and the speedup illusion in human-AI interaction SpikingMoE: SDPrompt-Guided Dynamic Expert Fusion in Spiking Neural Networks Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio Experimental Evaluation of Data Upload Efficiency and Guiding Challenges for a Vehicular-to-Road System Using 60-GHz mmWave Ultra-Spots MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization Signal Temporal Logic Motion Planning via Graphs of Convex Sets Cogniscope: A Synthetic Longitudinal Benchmark and Browser-Based Evaluation Framework for Early-Risk Cognitive AI Systems Fairness in Aggregation: Optimal Top-$k$ and Improved Full Ranking Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework On the Performance of DCF in Full Duplex WLANs with Hidden Terminals BCTuner: LLM-Guided Monte Carlo Tree Search for Efficient Blockchain Knob Tuning NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference DeFi Yield Aggregators: Analysing Investment Strategies and Structural Dependencies Bayesian Extreme Value Theory with Hawkes-AR-Gumbel Dependence for Extreme CVaR Estimation in Operational Risk
TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization
Lingling Fu, · 2026-05-25 · via cs updates on arXiv.org

View PDF HTML (experimental)

Abstract:Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization.
To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability.
Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences.
Comments: 11 pages,6 figures
Subjects: Information Retrieval (cs.IR)
Cite as: arXiv:2605.23398 [cs.IR]
  (or arXiv:2605.23398v1 [cs.IR] for this version)
  https://doi.org/10.48550/arXiv.2605.23398

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: LingLing Fu [view email]
[v1] Fri, 22 May 2026 09:11:20 UTC (709 KB)