惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

A
Arctic Wolf
T
The Blog of Author Tim Ferriss
月光博客
月光博客
Recent Announcements
Recent Announcements
V
V2EX
Microsoft Azure Blog
Microsoft Azure Blog
博客园 - 三生石上(FineUI控件)
P
Proofpoint News Feed
The Register - Security
The Register - Security
博客园 - 叶小钗
博客园 - Franky
The Cloudflare Blog
雷峰网
雷峰网
罗磊的独立博客
M
MIT News - Artificial intelligence
I
InfoQ
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 【当耐特】
Engineering at Meta
Engineering at Meta
N
Netflix TechBlog - Medium
爱范儿
爱范儿
博客园 - 司徒正美
Recorded Future
Recorded Future
酷 壳 – CoolShell
酷 壳 – CoolShell
Google DeepMind News
Google DeepMind News
Martin Fowler
Martin Fowler
Microsoft Security Blog
Microsoft Security Blog
F
Full Disclosure
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
B
Blog
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
腾讯CDC
WordPress大学
WordPress大学
小众软件
小众软件
K
Kaspersky official blog
Attack and Defense Labs
Attack and Defense Labs
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Forbes - Security
Forbes - Security
aimingoo的专栏
aimingoo的专栏
IT之家
IT之家
The Last Watchdog
The Last Watchdog
N
News and Events Feed by Topic
B
Blog RSS Feed
S
Security @ Cisco Blogs
美团技术团队
量子位
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cloudbric
Cloudbric
Hacker News - Newest:
Hacker News - Newest: "LLM"

cs.SE updates on arXiv.org

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning Neurosymbolic Repo-level Code Localization CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility Verification Modulo Tested Library Contracts The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE Scaling Test-Time Compute for Agentic Coding AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap Vibe-Coding: Feedback-Based Automated Verification with no Human Code Inspection, a Feasibility Study Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks Prompt-Driven Code Summarization: A Systematic Literature Review LinuxArena: A Control Setting for AI Agents in Live Production Software Environments LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB Large Language Models to Enhance Business Process Modeling: Past, Present, and Future Trends CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go? Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents CodeTracer: Towards Traceable Agent States Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems Designing Adaptive Digital Nudging Systems with LLM-Driven Reasoning Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning Ambiguity Detection and Elimination in Automated Executable Process Modeling Compliant But Unsatisfactory: The Gap Between Auditing Standards and Practices for Probabilistic Genotyping Software Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis The Code Whisperer: LLM and Graph-Based AI for Smell and Vulnerability Resolution AutoFlows++: Hierarchical Message Flow Mining for System on Chip Designs DynamicsLLM: a Dynamic Analysis-based Tool for Generating Intelligent Execution Traces Using LLMs to Detect Android Behavioural Code Smells Vibe-driven model-based engineering Machine Learning-Based Detection of MCP Attacks Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks Intent-aligned Formal Specification Synthesis via Traceable Refinement ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents From Helpful to Trustworthy: LLM Agents for Pair Programming MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis Applying an Agentic Coding Tool for Improving Published Algorithm Implementations Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems Can Coding Agents Be General Agents? Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models CCCE: A Continuous Code Calibration Engine for Autonomous Enterprise Codebase Maintenance via Knowledge Graph Traversal and Adaptive Decision Gating Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm ECM Contracts: Contract-Aware, Versioned, and Governable Capability Interfaces for Embodied Agents Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent CODESTRUCT: Code Agents over Structured Action Spaces Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy WybeCoder: Verified Imperative Code Generation QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion A Pythonic Functional Approach for Semantic Data Harmonisation in the ILIAD Project Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness X-SYS: A Reference Architecture for Interactive Explanation Systems KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations VeruSAGE: A Study of Agent-Based Verification for Rust Systems Process-Centric Analysis of Agentic Software Systems Enabling Predictive Maintenance in District Heating Substations: A Labelled Dataset and Fault Detection Evaluation Framework based on Service Data Context-Guided Decompilation: A Step Towards Re-executability Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model From Charts to Code: A Hierarchical Benchmark for Multimodal Models E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task AISysRev -- LLM-based Tool for Title-abstract Screening SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG
Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub
Ramtin Ehsani, Sakshi Pathak, Shriya Rawal, Abdullah Al Mujahid, · 2026-01-22 · via cs.SE updates on arXiv.org

AI coding agents are now submitting pull requests (PRs) to software projects, acting not just as assistants but as autonomous contributors. As these agentic contributions are rapidly increasing across real repositories, little is known about how they behave in practice and why many of them fail to be merged. In this paper, we conduct a large-scale study of 33k agent-authored PRs made by five coding agents across GitHub. (RQ1) We first quantitatively characterize merged and not-merged PRs along four broad dimensions: 1) merge outcomes across task types, 2) code changes, 3) CI build results, and 4) review dynamics. We observe that tasks related to documentation, CI, and build update achieve the highest merge success, whereas performance and bug-fix tasks perform the worst. Not-merged PRs tend to involve larger code changes, touch more files, and often do not pass the project's CI/CD pipeline validation. (RQ2) To further investigate why some agentic PRs are not merged, we qualitatively analyze 600 PRs to derive a hierarchical taxonomy of rejection patterns. This analysis complements the quantitative findings in RQ1 by uncovering rejection reasons not captured by quantitative metrics, including lack of meaningful reviewer engagement, duplicate PRs, unwanted feature implementations, and agent misalignment. Together, our findings highlight key socio-technical and human-AI collaboration factors that are critical to improving the success of future agentic workflows.