惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

cs.AI updates on arXiv.org

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems Neuro-Inspired Inverse Learning for Planning and Control Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer's Assistive Robotics Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems Distilling Game Code World Model Generation into Lightweight Large Language Models Adaptive Human-AI Coordination via Hierarchical Action Disentanglement Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching Stop Comparing LLM Agents Without Disclosing the Harness EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning A governance horizon for ethical-use constraints in open-weight AI models JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game Advancing Graph Few-Shot Learning via In-Context Learning Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition A Sober Look at Agentic Misalignment in Automated Workflows Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems Spacetime Formation under Requirements: Contextual Realization and Form-Dependent Probability EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs Toward Enactive Artificial Intelligence LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs Understanding and Mitigating Premature Confidence for Better LLM Reasoning When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification BODHI: Precise OS Kernel Specification Inference How Well Do Models Follow Their Constitutions? When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure Confidence Calibration in Large Language Models Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World Replications Inference Time Context Sparsity: Illusion or Opportunity? Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test ETCHR: Editing To Clarify and Harness Reasoning MedExpMem: Adapting Experience Memory for Differential Diagnosis SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions Staging by the Book: Automatic Sleep Stage Classification Using Scoring Rules The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation Anytime Training with Schedule-Free Spectral Optimization One-Forcing: Towards Stable One-Step Autoregressive Video Generation CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions FastKernels: Benchmarking GPU Kernel Generation in Production A mathematical theory of balancing relational generalization and memorization Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling LLM Code Smells: A Taxonomy and Detection Approach Tensor Cache: Eviction-conditioned Associative Memory for Transformers LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations Lipschitz Optimization for Formal Verification of Homographies Expressive Power of Deep Homomorphism Networks over Relational Databases Autonomous Frontier-Based Exploration with VLM Guidance Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models Approximate Machine Unlearning through Manifold Representation Forgetting Guided by Self Mode Connectivity EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation ChainFlow-VLA: Causal Flow Planning with Vision-Language Models Decomposing and Measuring Evaluation Awareness Uncovering the Latent Potential of Deep Intermediate Representations Multimodal Distribution Matching for Vision-Language Dataset Distillation Online Hand Gesture Recognition Using 3D Convolutional Neural Networks Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models SkillOpt: Executive Strategy for Self-Evolving Agent Skills Multi-Gate Residuals Test-Time Training Undermines Safety Guardrails Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution
Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof
Alfredo Mete · 2026-05-26 · via cs.AI updates on arXiv.org

View PDF HTML (experimental)

Abstract:The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This
paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM-driven runtime (a deterministic script-side reachable
through a non-deterministic LLM-side), state the verification problem as a capability-containment property over that semantics, and present three composable methods that
together raise a skill from declared or tested to formal: (1) sound static capability-containment analysis of the script-side via abstract interpretation over a small effect
lattice; (2) a refinement type system for tool-call envelopes that mechanically rejects any call whose statically-inferred capability is not in the manifest's declared set;
(3) SMT-bounded model checking against the parent paper's biconditional correctness criterion, with the bound chosen so any counter-example fitting the runtime's
transaction-buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper's threat model modulo a single residual (the
LLM's freedom to refuse to act) that the parent paper's runtime biconditional catches at session boundary. The methods reuse existing well-engineered tools (Z3, Semgrep,
CodeQL, refinement-type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof-carrying artifact extends the existing this http URL
convention. All three methods plus the bundle producer and re-checker ship as zero-dependency JavaScript modules in the open-source enclawed framework
(this https URL project page this https URL), with 53 unit tests and an end-to-end CLI demo on a sample skill.
Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
Cite as: arXiv:2605.23951 [cs.AI]
  (or arXiv:2605.23951v1 [cs.AI] for this version)
  https://doi.org/10.48550/arXiv.2605.23951

arXiv-issued DOI via DataCite

Submission history

From: Alfredo Metere [view email]
[v1] Sat, 9 May 2026 19:27:38 UTC (45 KB)