惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
V
Vulnerabilities – Threatpost
Attack and Defense Labs
Attack and Defense Labs
N
News and Events Feed by Topic
SecWiki News
SecWiki News
S
Security @ Cisco Blogs
Schneier on Security
Schneier on Security
B
Blog
TaoSecurity Blog
TaoSecurity Blog
The Last Watchdog
The Last Watchdog
H
Hacker News: Front Page
Hacker News - Newest:
Hacker News - Newest: "LLM"
博客园_首页
D
Docker
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Y
Y Combinator Blog
W
WeLiveSecurity
N
News and Events Feed by Topic
F
Fortinet All Blogs
PCI Perspectives
PCI Perspectives
WordPress大学
WordPress大学
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
Forbes - Security
Forbes - Security
T
Tailwind CSS Blog
Hacker News: Ask HN
Hacker News: Ask HN
爱范儿
爱范儿
腾讯CDC
Last Week in AI
Last Week in AI
月光博客
月光博客
C
Cybersecurity and Infrastructure Security Agency CISA
P
Proofpoint News Feed
Help Net Security
Help Net Security
V
V2EX
C
Cyber Attacks, Cyber Crime and Cyber Security
C
CXSECURITY Database RSS Feed - CXSecurity.com
H
Heimdal Security Blog
L
LINUX DO - 最新话题
GbyAI
GbyAI
The Hacker News
The Hacker News
罗磊的独立博客
S
SegmentFault 最新的问题
H
Hackread – Cybersecurity News, Data Breaches, AI and More
博客园 - 【当耐特】
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
V2EX - 技术
V2EX - 技术
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
O
OpenAI News
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻

cs.SE updates on arXiv.org

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning Neurosymbolic Repo-level Code Localization CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility Verification Modulo Tested Library Contracts The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE Scaling Test-Time Compute for Agentic Coding AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap Vibe-Coding: Feedback-Based Automated Verification with no Human Code Inspection, a Feasibility Study Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks Prompt-Driven Code Summarization: A Systematic Literature Review LinuxArena: A Control Setting for AI Agents in Live Production Software Environments LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB Large Language Models to Enhance Business Process Modeling: Past, Present, and Future Trends CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go? Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents CodeTracer: Towards Traceable Agent States Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems Designing Adaptive Digital Nudging Systems with LLM-Driven Reasoning Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning Ambiguity Detection and Elimination in Automated Executable Process Modeling Compliant But Unsatisfactory: The Gap Between Auditing Standards and Practices for Probabilistic Genotyping Software Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis The Code Whisperer: LLM and Graph-Based AI for Smell and Vulnerability Resolution AutoFlows++: Hierarchical Message Flow Mining for System on Chip Designs DynamicsLLM: a Dynamic Analysis-based Tool for Generating Intelligent Execution Traces Using LLMs to Detect Android Behavioural Code Smells Vibe-driven model-based engineering Machine Learning-Based Detection of MCP Attacks Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks Intent-aligned Formal Specification Synthesis via Traceable Refinement ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents From Helpful to Trustworthy: LLM Agents for Pair Programming MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis Applying an Agentic Coding Tool for Improving Published Algorithm Implementations Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems Can Coding Agents Be General Agents? Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models CCCE: A Continuous Code Calibration Engine for Autonomous Enterprise Codebase Maintenance via Knowledge Graph Traversal and Adaptive Decision Gating Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm ECM Contracts: Contract-Aware, Versioned, and Governable Capability Interfaces for Embodied Agents Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent CODESTRUCT: Code Agents over Structured Action Spaces Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy WybeCoder: Verified Imperative Code Generation QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion A Pythonic Functional Approach for Semantic Data Harmonisation in the ILIAD Project Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness X-SYS: A Reference Architecture for Interactive Explanation Systems KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations VeruSAGE: A Study of Agent-Based Verification for Rust Systems Process-Centric Analysis of Agentic Software Systems Enabling Predictive Maintenance in District Heating Substations: A Labelled Dataset and Fault Detection Evaluation Framework based on Service Data Context-Guided Decompilation: A Step Towards Re-executability Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model From Charts to Code: A Hierarchical Benchmark for Multimodal Models E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task AISysRev -- LLM-based Tool for Title-abstract Screening SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG
Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
Huan Zhang, Wei Cheng, Wei Hu · 2026-03-31 · via cs.SE updates on arXiv.org

Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.