惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

A
Arctic Wolf
V
V2EX
P
Proofpoint News Feed
The Hacker News
The Hacker News
GbyAI
GbyAI
G
Google Developers Blog
S
Schneier on Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
W
WeLiveSecurity
Security Archives - TechRepublic
Security Archives - TechRepublic
博客园 - Franky
Recent Announcements
Recent Announcements
腾讯CDC
Hacker News - Newest:
Hacker News - Newest: "LLM"
K
Kaspersky official blog
U
Unit 42
Engineering at Meta
Engineering at Meta
J
Java Code Geeks
Google Online Security Blog
Google Online Security Blog
Last Week in AI
Last Week in AI
V
Vulnerabilities – Threatpost
N
News and Events Feed by Topic
O
OpenAI News
量子位
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Y
Y Combinator Blog
博客园 - 【当耐特】
Vercel News
Vercel News
Hacker News: Ask HN
Hacker News: Ask HN
T
Tor Project blog
Apple Machine Learning Research
Apple Machine Learning Research
Microsoft Security Blog
Microsoft Security Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
AWS News Blog
AWS News Blog
MongoDB | Blog
MongoDB | Blog
S
Security Affairs
A
About on SuperTechFans
Project Zero
Project Zero
D
Darknet – Hacking Tools, Hacker News & Cyber Security
博客园 - 聂微东
Webroot Blog
Webroot Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Cloudbric
Cloudbric
T
Tenable Blog
月光博客
月光博客
C
Check Point Blog
宝玉的分享
宝玉的分享
V
Visual Studio Blog
T
The Blog of Author Tim Ferriss
NISL@THU
NISL@THU

cs.LG updates on arXiv.org

Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals Generalization Guarantees on Data-Driven Tuning of Gradient Descent with Langevin Updates Automated co-design of high-performance thermodynamic cycles via graph-based hierarchical reinforcement learning Bias-Corrected Adaptive Conformal Inference for Multi-Horizon Time Series Forecasting Counterfactual Peptide Editing for Causal TCR--pMHC Binding Inference Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models Physics-informed reservoir characterization from bulk and extreme pressure events with a differentiable simulator Some Theoretical Limitations of t-SNE Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding Multi-Task LLM with LoRA Fine-Tuning for Automated Cancer Staging and Biomarker Extraction Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration BioTrain: Sub-MB, Sub-50mW On-Device Fine-Tuning for Edge-AI on Biosignals Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction Adaptive Unknown Fault Detection and Few-Shot Continual Learning for Condition Monitoring in Ultrasonic Metal Welding Universality of Gaussian-Mixture Reverse Kernels in Conditional Diffusion Computational framework for multistep metabolic pathway design LEGO-MOF: Equivariant Latent Manipulation for Editable, Generative, and Optimizable MOF Design Learning Inference Concurrency in DynamicGate MLP Structural and Mathematical Justification Parameter-efficient Quantum Multi-task Learning Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges Self-Organizing Maps with Optimized Latent Positions A Bayesian Framework for Uncertainty-Aware Explanations in Power Quality Disturbance Classification Optimization with SpotOptim Physics-Informed Neural Networks for Solving Derivative-Constrained PDEs Spectral Thompson sampling Online learning with noisy side observations Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate Composite Silhouette: A Subsampling-based Aggregation Strategy RPS: Information Elicitation with Reinforcement Prompt Selection UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization Beyond State Consistency: Behavior Consistency in Text-Based World Models Simulation-Based Optimisation of Batting Order and Bowling Plans in T20 Cricket Hardware-Efficient Neuro-Symbolic Networks with the Exp-Minus-Log Operator Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety MolCryst-MLIPs: A Machine-Learned Interatomic Potentials Database for Molecular Crystals DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off Unsupervised Anomaly Detection in Process-Complex Industrial Time Series: A Real-World Case Study Quantum Machine Learning for Colorectal Cancer Data: Anastomotic Leak Classification and Risk Factors Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation BOAT: Navigating the Sea of In Silico Predictors for Antibody Design via Multi-Objective Bayesian Optimization PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling Unsupervised domain transfer: Overcoming signal degradation in sleep monitoring by increasing scoring realism Physics-Informed Neural Networks for Methane Sorption: Cross-Gas Transfer Learning, Ensemble Collapse Under Physics Constraints, and Monte Carlo Dropout Uncertainty Quantification A Complete Symmetry Classification of Shallow ReLU Networks Momentum Further Constrains Sharpness at the Edge of Stochastic Stability Complex Interpolation of Matrices with an application to Multi-Manifold Learning Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version HUANet: Hard-Constrained Unrolled ADMM for Constrained Convex Optimization Fast Voxelization and Level of Detail for Microgeometry Rendering Rare Event Analysis via Stochastic Optimal Control From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning TIP: Token Importance in On-Policy Distillation Neural architectures for resolving references in program code $π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs MAny: Merge Anything for Multimodal Continual Instruction Tuning Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection Context Sensitivity Improves Human-Machine Visual Alignment Evaluating Supervised Machine Learning Models: Principles, Pitfalls, and Metric Selection SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training Soft $Q(λ)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents Automatic Charge State Tuning of 300 mm FDSOI Quantum Dots Using Neural Network Segmentation of Charge Stability Diagram Ordinary Least Squares is a Special Case of Transformer (How) Learning Rates Regulate Catastrophic Overtraining Golden Handcuffs make safer AI agents Design Space Exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence Diffusion Sequence Models for Generative In-Context Meta-Learning of Robot Dynamics Beyond Uniform Sampling: Synergistic Active Learning and Input Denoising for Robust Neural Operators The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery Rethinking Uncertainty in Segmentation: From Estimation to Decision
DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors
[Submitted on 10 Jun 2026 (v1), last revised 16 Jun 2026 (this v · 2026-06-11 · via cs.LG updates on arXiv.org

View PDF HTML (experimental)

Abstract:High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: this https URL.

Submission history

From: Jiale Deng [view email]
[v1] Wed, 10 Jun 2026 03:28:17 UTC (5,069 KB)
[v2] Tue, 16 Jun 2026 04:07:41 UTC (3,265 KB)