惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

cs.AI updates on arXiv.org

Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation Distributionally Robust Transfer Learning with Structurally Missing Covariates, with Application to Cross-National Cardiac Arrest Prediction Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing Fundamental Limitation in Explaining AI TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security Assessing the Operational Viability of Foundation Models for Time Series Forecasting MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning Mixture of Complementary Agents for Robust LLM Ensemble LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood? MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform Generative OOD-regularized Model-based Policy Optimization Treatment Effect Estimation with Differentiated Networked Effect on Graph Data Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers Inference Time Context Sparsity: Illusion or Opportunity? Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning Learning to Reason Efficiently with A* Post-Training Balancing Fairness, Privacy, and Accuracy: A Multitask Adversarial Framework for Centralized Data-Driven Systems Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection Batch Normalization Amplifies Memorization and Privacy Risks Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning Raon-Speech Technical Report An Interpretable CF-RL-TOPSIS Fusion Model for Skills-Aware Talent Recommendation How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models Feature Lottery? A Bifurcation Theory of Concept Emergence HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette When Mean CE Fails: Median CE Can Better Track Language Model Quality Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning Extracting Training Data from Diffusion Language Models via Infilling Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence An Interactive Paradigm for Deep Research Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation Side-by-side Comparison Amplifies Dialect Bias in Language Models Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation Confidence Calibration in Large Language Models Nano World Models: A Minimalist Implementation of Future Video Prediction Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems Not All Transitions Matter: Evidence from PPO Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization Enhancing Reliability in LLM-Based Secure Code Generation Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection Momentum Streams for Optimizer-Inspired Transformers ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions Hypothesis Generation and Inductive Inference in Children and Language Models Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver Hidden-State Privacy Has an Empty Middle Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment High-Risk AI Systems and the Problem of Identity in the European AI Act
Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries
Hinduja Niru · 2026-05-26 · via cs.AI updates on arXiv.org

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as: arXiv:2605.24137 [cs.SE]
  (or arXiv:2605.24137v1 [cs.SE] for this version)
  https://doi.org/10.48550/arXiv.2605.24137

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hinduja Nirujan [view email]
[v1] Fri, 22 May 2026 18:55:46 UTC (618 KB)