惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Application and Cybersecurity Blog
Application and Cybersecurity Blog
月光博客
月光博客
Y
Y Combinator Blog
P
Proofpoint News Feed
Forbes - Security
Forbes - Security
美团技术团队
博客园 - Franky
Attack and Defense Labs
Attack and Defense Labs
T
Tor Project blog
T
The Blog of Author Tim Ferriss
C
CERT Recently Published Vulnerability Notes
U
Unit 42
人人都是产品经理
人人都是产品经理
V2EX - 技术
V2EX - 技术
L
Lohrmann on Cybersecurity
罗磊的独立博客
博客园 - 聂微东
C
Cybersecurity and Infrastructure Security Agency CISA
N
News and Events Feed by Topic
大猫的无限游戏
大猫的无限游戏
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
H
Help Net Security
Security Archives - TechRepublic
Security Archives - TechRepublic
Microsoft Azure Blog
Microsoft Azure Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
W
WeLiveSecurity
P
Privacy International News Feed
爱范儿
爱范儿
J
Java Code Geeks
Blog — PlanetScale
Blog — PlanetScale
The Cloudflare Blog
T
Threat Research - Cisco Blogs
云风的 BLOG
云风的 BLOG
F
Full Disclosure
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Hugging Face - Blog
Hugging Face - Blog
T
Tenable Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hacker News: Ask HN
Hacker News: Ask HN
TaoSecurity Blog
TaoSecurity Blog
B
Blog RSS Feed
Google Online Security Blog
Google Online Security Blog
D
Docker
Martin Fowler
Martin Fowler
I
Intezer
阮一峰的网络日志
阮一峰的网络日志
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
Security Affairs
T
Tailwind CSS Blog
IT之家
IT之家

cs.SE updates on arXiv.org

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning Neurosymbolic Repo-level Code Localization CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility Verification Modulo Tested Library Contracts The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE Scaling Test-Time Compute for Agentic Coding AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap Vibe-Coding: Feedback-Based Automated Verification with no Human Code Inspection, a Feasibility Study Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks Prompt-Driven Code Summarization: A Systematic Literature Review LinuxArena: A Control Setting for AI Agents in Live Production Software Environments LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB Large Language Models to Enhance Business Process Modeling: Past, Present, and Future Trends CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go? Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents CodeTracer: Towards Traceable Agent States Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems Designing Adaptive Digital Nudging Systems with LLM-Driven Reasoning Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning Ambiguity Detection and Elimination in Automated Executable Process Modeling Compliant But Unsatisfactory: The Gap Between Auditing Standards and Practices for Probabilistic Genotyping Software Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis The Code Whisperer: LLM and Graph-Based AI for Smell and Vulnerability Resolution AutoFlows++: Hierarchical Message Flow Mining for System on Chip Designs DynamicsLLM: a Dynamic Analysis-based Tool for Generating Intelligent Execution Traces Using LLMs to Detect Android Behavioural Code Smells Vibe-driven model-based engineering Machine Learning-Based Detection of MCP Attacks Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks Intent-aligned Formal Specification Synthesis via Traceable Refinement ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents From Helpful to Trustworthy: LLM Agents for Pair Programming MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis Applying an Agentic Coding Tool for Improving Published Algorithm Implementations Formal Architecture Descriptors as Navigation Primitives for AI Coding Agents Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems Can Coding Agents Be General Agents? Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models CCCE: A Continuous Code Calibration Engine for Autonomous Enterprise Codebase Maintenance via Knowledge Graph Traversal and Adaptive Decision Gating Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm ECM Contracts: Contract-Aware, Versioned, and Governable Capability Interfaces for Embodied Agents Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent CODESTRUCT: Code Agents over Structured Action Spaces Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy WybeCoder: Verified Imperative Code Generation QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion A Pythonic Functional Approach for Semantic Data Harmonisation in the ILIAD Project Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness X-SYS: A Reference Architecture for Interactive Explanation Systems KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations VeruSAGE: A Study of Agent-Based Verification for Rust Systems Process-Centric Analysis of Agentic Software Systems Enabling Predictive Maintenance in District Heating Substations: A Labelled Dataset and Fault Detection Evaluation Framework based on Service Data Context-Guided Decompilation: A Step Towards Re-executability Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model From Charts to Code: A Hierarchical Benchmark for Multimodal Models E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task AISysRev -- LLM-based Tool for Title-abstract Screening SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG
Comparative Separation: Evaluating Separation on Comparative Judgment Test Data
Xiaoyin Xi, Neeku Capak, Kate Stockwell, Zhe Yu · 2026-01-11 · via cs.SE updates on arXiv.org

This research seeks to benefit the software engineering society by proposing comparative separation, a novel group fairness notion to evaluate the fairness of machine learning software on comparative judgment test data. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. It is the responsibility of all software developers to make their software accountable by ensuring that the machine learning software do not perform differently on different sensitive groups -- satisfying the separation criterion. However, evaluation of separation requires ground truth labels for each test data point. This motivates our work on analyzing whether separation can be evaluated on comparative judgment test data. Instead of asking humans to provide the ratings or categorical labels on each test data point, comparative judgments are made between pairs of data points such as A is better than B. According to the law of comparative judgment, providing such comparative judgments yields a lower cognitive burden for humans than providing ratings or categorical labels. This work first defines the novel fairness notion comparative separation on comparative judgment test data, and the metrics to evaluate comparative separation. Then, both theoretically and empirically, we show that in binary classification problems, comparative separation is equivalent to separation. Lastly, we analyze the number of test data points and test data pairs required to achieve the same level of statistical power in the evaluation of separation and comparative separation, respectively. This work is the first to explore fairness evaluation on comparative judgment test data. It shows the feasibility and the practical benefits of using comparative judgment test data for model evaluations.