Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing

Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph

Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning

LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning

Neurosymbolic Repo-level Code Localization

CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Verification Modulo Tested Library Contracts

The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE

Scaling Test-Time Compute for Agentic Coding

AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

Vibe-Coding: Feedback-Based Automated Verification with no Human Code Inspection, a Feasibility Study

Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex

Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime

Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

Prompt-Driven Code Summarization: A Systematic Literature Review

LinuxArena: A Control Setting for AI Agents in Live Production Software Environments

LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB

Large Language Models to Enhance Business Process Modeling: Past, Present, and Future Trends

CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation

Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go?

Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment

The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents

CodeTracer: Towards Traceable Agent States

Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering

OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

Designing Adaptive Digital Nudging Systems with LLM-Driven Reasoning

Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

Ambiguity Detection and Elimination in Automated Executable Process Modeling

Compliant But Unsatisfactory: The Gap Between Auditing Standards and Practices for Probabilistic Genotyping Software

Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents

LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

The Code Whisperer: LLM and Graph-Based AI for Smell and Vulnerability Resolution

AutoFlows++: Hierarchical Message Flow Mining for System on Chip Designs

DynamicsLLM: a Dynamic Analysis-based Tool for Generating Intelligent Execution Traces Using LLMs to Detect Android Behavioural Code Smells

Vibe-driven model-based engineering

Machine Learning-Based Detection of MCP Attacks

Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering

How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

Intent-aligned Formal Specification Synthesis via Traceable Refinement

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

From Helpful to Trustworthy: LLM Agents for Pair Programming

MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis

Applying an Agentic Coding Tool for Improving Published Algorithm Implementations

Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents

Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems

Can Coding Agents Be General Agents?

Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models

CCCE: A Continuous Code Calibration Engine for Autonomous Enterprise Codebase Maintenance via Knowledge Graph Traversal and Adaptive Decision Gating

Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety

Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm

ECM Contracts: Contract-Aware, Versioned, and Governable Capability Interfaces for Embodied Agents

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

CODESTRUCT: Code Agents over Structured Action Spaces

Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

WybeCoder: Verified Imperative Code Generation

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express

Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts

LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

A Pythonic Functional Approach for Semantic Data Harmonisation in the ILIAD Project

Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement

ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

X-SYS: A Reference Architecture for Interactive Explanation Systems

KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

VeruSAGE: A Study of Agent-Based Verification for Rust Systems

Process-Centric Analysis of Agentic Software Systems

Enabling Predictive Maintenance in District Heating Substations: A Labelled Dataset and Fault Detection Evaluation Framework based on Service Data

Context-Guided Decompilation: A Step Towards Re-executability

Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model

From Charts to Code: A Hierarchical Benchmark for Multimodal Models

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

AISysRev -- LLM-based Tool for Title-abstract Screening

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG

推荐订阅源

cs.SE updates on arXiv.org