惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

MongoDB | Blog
MongoDB | Blog
IT之家
IT之家
J
Java Code Geeks
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Recent Announcements
Recent Announcements
博客园 - 三生石上(FineUI控件)
博客园_首页
MyScale Blog
MyScale Blog
腾讯CDC
I
InfoQ
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
人人都是产品经理
人人都是产品经理
Vercel News
Vercel News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
量子位
爱范儿
爱范儿
U
Unit 42
aimingoo的专栏
aimingoo的专栏
B
Blog RSS Feed
云风的 BLOG
云风的 BLOG
M
MIT News - Artificial intelligence
A
About on SuperTechFans
T
The Blog of Author Tim Ferriss
Blog — PlanetScale
Blog — PlanetScale
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Engineering at Meta
Engineering at Meta
博客园 - 叶小钗
小众软件
小众软件
Jina AI
Jina AI
Hugging Face - Blog
Hugging Face - Blog
Google DeepMind News
Google DeepMind News
The Cloudflare Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
D
Docker
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园 - 【当耐特】
博客园 - Franky
H
Help Net Security
Stack Overflow Blog
Stack Overflow Blog
阮一峰的网络日志
阮一峰的网络日志
C
Check Point Blog
C
CERT Recently Published Vulnerability Notes
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Cisco Talos Blog
Cisco Talos Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
I
Intezer
Latest news
Latest news
D
Darknet – Hacking Tools, Hacker News & Cyber Security
博客园 - 司徒正美
Microsoft Security Blog
Microsoft Security Blog

cs.CR updates on arXiv.org

Attribute Inference from Interactive Targeted Ads A Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment CmdNeedle: Measuring the Incompleteness of Command Denylists for AI Agents FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion AnonShield: Scalable On-Premise Pseudonymization for CSIRT Vulnerability Data Odds Law: The Decomposition Algebra On How Intelligence Organizes Itself to Solve Difficult Problems Reliably Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice? GAS-Leak-LLM: Genetic Algorithm-Based Suffix Optimization for Black-Box LLM Jailbreaking Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens SPARK: Security Knowledge Priming and Representation-Guided Knowledge Activation for LLM-based Secure Code Generation The Proxy Knows Too Much: Sealing LLM API Routers with Attested TEEs Automated jailbreak attack targeting multiple defense strategies The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models Vision-Encoder Behavioral Fingerprints of Image-to-Image Generative Models: A Training-Paradigm-Driven Taxonomy of Six Commercial APIs Your "Pro" LLM Subscription May Actually Be "Free": Exposing Fingerprint Spoofing Risks in LLM Inference Services DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing Censorship-Resistant Sealed-Bid Auctions on Blockchains Differentially Private Submodular Maximization with a Knapsack Constraint Continual Backdoor Training in IoT/CPS Security Engineering of OpenClaw: Analyzing Attack Surface Expansion and Trust-Boundary Violations Semantic Integrity Failures in Document-to-LLM Supply Chains BT-MTD: Bus Traversal-based Moving Target Defense for Smart Grid Fuzzy PSI from Symmetric Primitives with Exact Logarithmic Dependence on Distance Threshold Data-Centric Benchmarking of Exploit Generation in LLMs: Understanding the Impact of Fine-Tuning VLALeaks: Membership Inference Attacks against Vision-Language-Action Models Robust and Precise Application Fingerprinting on 5G Physical Uplink Channel LLM: LSTM Look-Ahead Moving Target Defense Based on Historical Malicious Scan Cross-Silo De-Anonymization Under Local Differential Privacy: Threat Model, Phase Transition, and Coordination Necessity The Audit Gap in Blockchain Security: A Four-Year Empirical Study of Public Audit Findings and Real-World Exploit Incidents Multi-tier Differential Private Query Release FEnc$^2$: Unifying Data Packing for Efficient Private Inference via Convolution and Architecture-Aware Fragment Encoding Secure and Low-Latency IoT Analytics Using an Edge-Based Streaming Architecture did:crdt: Coordination-Free Decentralised Identifiers via Signed CRDTs AttackonCTF: Defending Hardware Security Competition Benchmarks in the Age of LLMs FuseChain: Runtime Evidence Reconstruction for Software Supply-Chain Attacks Stickel-type key exchange with hidden subspaces New Ideas on a New Old Type of Cipher:The Mixed-Radix One-Time Pad The Anatomy of Scam Scenarios: Large-Scale Characterization and Conversation-Aware Detection Invisible Manipulation Channels in AI-Assisted Financial Advisory: Implications for Market Integrity and Regulatory Design Scalable Malware Family Classification Using Quantum Kernel Based Machine Learning Dynamic Malicious Skills in Agentic AI MIPSBLEED: Uncovering Microarchitectural Timing Leaks in Pervasive Embedded Processors MPX: A Unified Systolic Array for Matrix and Polynomial Multiplication Transferable Self-Evolving Playbooks for Agentic Security Auditing A Formal Resilience Framework for Cyber-Physical Embodied Systems under Device-Level Cyberattacks Measurement Study of Post-Quantum Readiness of Internet: 2026 A data-driven security quantification framework for IoT-based systems SoK: Taxonomizing the Low-Level Attack Surface of Modern Web Browsers From Third-Party to First-Party: Measuring and Protecting Against Modern Web Tracking Mechanisms The Ghosts of Polymarket: When Off-Chain Matches Meet On-Chain Reverts Di5Guise: 5G Privacy with vSIM High-Performance Pipelined NTT Accelerators with Homogeneous Digit-Serial Modulo Arithmetic Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
From Refusal Geometry to Safety Geometry: Harmfulness--Refusal Coupling under Dynamic Adversarial Fine-Tuning
[Submitted on 15 Jun 2026] · 2026-06-16 · via cs.CR updates on arXiv.org

View PDF HTML (experimental)

Abstract:Safety alignment requires language models to refuse harmful requests without losing the ability to answer benign ones. Existing robustness evaluations, however, do not reveal whether a model has learned to recognize harmfulness, to activate a refusal policy, or to couple these two processes. We study this question with a dual safety-geometry protocol that measures harmfulness carriers, refusal carriers, and their coupling across aligned instruction-tuned anchors and matched Mistral-7B-v0.1 SFT/R2D2 training trajectories. The aligned anchors validate the protocol: refusal-side interventions reopen attack success more strongly than harmfulness-only interventions, while harmfulness and refusal carriers remain nearly orthogonal. Along the Mistral trajectory, R2D2 exhibits a high-coupling early phase with strong fixed-source robustness, saturated safe-prompt refusal, and collapsed benign utility. Later checkpoints move to a lower-coupling regime with partial utility recovery and reopened attack success. SFT provides an important contrast: it also reaches low coupling, but remains substantially less robust, showing that low coupling alone is not a safety guarantee. All-anchor diagnostics and sparse GCG/AutoDAN transfer experiments further show that H/R coupling is informative in the R2D2 regime, whereas SFT transfer is better summarized by drift or behavior-state measures. Causal sweeps support fixed-protocol sensitivity relative to matched unit-direction controls, but do not establish independent harmfulness and refusal pathways. These results frame harmfulness--refusal coupling as an operational diagnostic for safety-geometry dynamics under adversarial fine-tuning.

Submission history

From: Wenhao Lan [view email]
[v1] Mon, 15 Jun 2026 07:50:00 UTC (317 KB)