惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

MongoDB | Blog
MongoDB | Blog
IT之家
IT之家
J
Java Code Geeks
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Recent Announcements
Recent Announcements
博客园 - 三生石上(FineUI控件)
博客园_首页
MyScale Blog
MyScale Blog
腾讯CDC
I
InfoQ
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
人人都是产品经理
人人都是产品经理
Vercel News
Vercel News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
量子位
爱范儿
爱范儿
U
Unit 42
aimingoo的专栏
aimingoo的专栏
B
Blog RSS Feed
云风的 BLOG
云风的 BLOG
M
MIT News - Artificial intelligence
A
About on SuperTechFans
T
The Blog of Author Tim Ferriss
Blog — PlanetScale
Blog — PlanetScale
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Engineering at Meta
Engineering at Meta
博客园 - 叶小钗
小众软件
小众软件
Jina AI
Jina AI
Hugging Face - Blog
Hugging Face - Blog
Google DeepMind News
Google DeepMind News
The Cloudflare Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
D
Docker
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园 - 【当耐特】
博客园 - Franky
H
Help Net Security
Stack Overflow Blog
Stack Overflow Blog
阮一峰的网络日志
阮一峰的网络日志
C
Check Point Blog
C
CERT Recently Published Vulnerability Notes
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Cisco Talos Blog
Cisco Talos Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
I
Intezer
Latest news
Latest news
D
Darknet – Hacking Tools, Hacker News & Cyber Security
博客园 - 司徒正美
Microsoft Security Blog
Microsoft Security Blog

cs.CR updates on arXiv.org

Attribute Inference from Interactive Targeted Ads QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agentic Networks TrustedARI: Towards Trust-Native Agentic Routing Infrastructure for Agentic AI AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents A Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment CmdNeedle: Measuring the Incompleteness of Command Denylists for AI Agents FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion AnonShield: Scalable On-Premise Pseudonymization for CSIRT Vulnerability Data Odds Law: The Decomposition Algebra On How Intelligence Organizes Itself to Solve Difficult Problems Reliably Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice? GAS-Leak-LLM: Genetic Algorithm-Based Suffix Optimization for Black-Box LLM Jailbreaking Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens SPARK: Security Knowledge Priming and Representation-Guided Knowledge Activation for LLM-based Secure Code Generation The Proxy Knows Too Much: Sealing LLM API Routers with Attested TEEs Automated jailbreak attack targeting multiple defense strategies The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models Vision-Encoder Behavioral Fingerprints of Image-to-Image Generative Models: A Training-Paradigm-Driven Taxonomy of Six Commercial APIs How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation Your "Pro" LLM Subscription May Actually Be "Free": Exposing Fingerprint Spoofing Risks in LLM Inference Services DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing Censorship-Resistant Sealed-Bid Auctions on Blockchains Differentially Private Submodular Maximization with a Knapsack Constraint Continual Backdoor Training in IoT/CPS Security Engineering of OpenClaw: Analyzing Attack Surface Expansion and Trust-Boundary Violations Semantic Integrity Failures in Document-to-LLM Supply Chains BT-MTD: Bus Traversal-based Moving Target Defense for Smart Grid Fuzzy PSI from Symmetric Primitives with Exact Logarithmic Dependence on Distance Threshold Data-Centric Benchmarking of Exploit Generation in LLMs: Understanding the Impact of Fine-Tuning VLALeaks: Membership Inference Attacks against Vision-Language-Action Models Robust and Precise Application Fingerprinting on 5G Physical Uplink Channel LLM: LSTM Look-Ahead Moving Target Defense Based on Historical Malicious Scan Cross-Silo De-Anonymization Under Local Differential Privacy: Threat Model, Phase Transition, and Coordination Necessity The Audit Gap in Blockchain Security: A Four-Year Empirical Study of Public Audit Findings and Real-World Exploit Incidents In-DRAM Signature Generation Using Simultaneous Multiple-Row Activation: An Experimental Study of Off-The-Shelf DRAM Chips Model Stealing Through the Lens of Model Multiplicity Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance Multi-tier Differential Private Query Release Your Privacy My Cloak: Backdoor Attacks on Differentially Private Federated Learning FEnc$^2$: Unifying Data Packing for Efficient Private Inference via Convolution and Architecture-Aware Fragment Encoding Secure and Low-Latency IoT Analytics Using an Edge-Based Streaming Architecture Robust and Automated Reconfiguration of Byzantine Wide-Area Replication did:crdt: Coordination-Free Decentralised Identifiers via Signed CRDTs CoBRA: A Universal Strategyproof Confirmation Protocol for Quorum-based Proof-of-Stake Blockchains AttackonCTF: Defending Hardware Security Competition Benchmarks in the Age of LLMs FuseChain: Runtime Evidence Reconstruction for Software Supply-Chain Attacks Stickel-type key exchange with hidden subspaces New Ideas on a New Old Type of Cipher:The Mixed-Radix One-Time Pad The Anatomy of Scam Scenarios: Large-Scale Characterization and Conversation-Aware Detection Invisible Manipulation Channels in AI-Assisted Financial Advisory: Implications for Market Integrity and Regulatory Design Scalable Malware Family Classification Using Quantum Kernel Based Machine Learning Dynamic Malicious Skills in Agentic AI From Refusal Geometry to Safety Geometry: Harmfulness--Refusal Coupling under Dynamic Adversarial Fine-Tuning MIPSBLEED: Uncovering Microarchitectural Timing Leaks in Pervasive Embedded Processors MPX: A Unified Systolic Array for Matrix and Polynomial Multiplication Transferable Self-Evolving Playbooks for Agentic Security Auditing A Formal Resilience Framework for Cyber-Physical Embodied Systems under Device-Level Cyberattacks Measurement Study of Post-Quantum Readiness of Internet: 2026 A data-driven security quantification framework for IoT-based systems SoK: Taxonomizing the Low-Level Attack Surface of Modern Web Browsers KnowML: Improving Generalization of ML-NIDS with Attack Knowledge Graphs From Third-Party to First-Party: Measuring and Protecting Against Modern Web Tracking Mechanisms The Ghosts of Polymarket: When Off-Chain Matches Meet On-Chain Reverts Di5Guise: 5G Privacy with vSIM High-Performance Pipelined NTT Accelerators with Homogeneous Digit-Serial Modulo Arithmetic obliv-clang: Real-World Oblivious Programming in C++ Quantum Futures Interactive: A Live Demonstration of Post-Quantum Blockchain Security, Infrastructure Tradeoffs, and Sustainable Distributed Trust On MDS Property of g-Circulant Matrices Calyx: Privacy-Preserving Multi-Token Optimistic-Rollup Protocol MIRAGE: Misleading Retrieval-Augmented Generation via Black-box and Query-agnostic Poisoning Attacks Taint-Based Code Slicing for LLMs-based Malicious NPM Package Detection Cryptanalysis of LDPC-Based Pseudorandom Error-Correcting Codes The Coverage Gap: Chile's Cyber Disclosure Framework versus the USA, EU and UK AutoSUT: The Environment Semantics Gap in Structured CTI for Adversary Emulation From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense QSignAI: Quantum-Randomness-Seeded Identity Signatures at the Intersection of AI for Science and Science for AI Intent-based Security Management Using the TM Forum TR292I Security Ontology Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests Cordyceps: Covert Control Attacks on LLMs via Data Poisoning SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT? Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw From Specification to Deployment: Empirical Evidence from a W3C VC + DID Trust Infrastructure for Autonomous Agents Parallel Test-Time Scaling with Multi-Sequence Verifiers MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline AI Kill Switch for malicious web-based LLM agent Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs
HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
[Submitted on 17 Oct 2025 (v1), last revised 14 Jun 2026 (this v · 2026-06-16 · via cs.CR updates on arXiv.org

View PDF HTML (experimental)

Abstract:Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have gained significant attention due to their objective and verifiable reward signals, demonstrating strong performance in reasoning and code generation tasks. However, the potential safety risks associated with RLVR remain underexplored. This paper presents HarmRLVR, the first systematic investigation into the alignment reversibility risk of RLVR. We show that safety alignment can be rapidly reversed using GRPO with merely 64 harmful prompts without responses, causing models to readily comply with harmful instructions. Across five models from Llama, Qwen, and DeepSeek, we empirically demonstrate that RLVR-based attacks elevate the average harmfulness score to 4.94 with an attack success rate of 96.01\%, significantly outperforming harmful fine-tuning while preserving general capabilities. Our findings reveal that RLVR can be efficiently exploited for harmful alignment, posing serious threats to open-source model safety. Please see our code at this https URL.

Submission history

From: Yuexiao Liu [view email]
[v1] Fri, 17 Oct 2025 10:13:44 UTC (2,427 KB)
[v2] Sun, 14 Jun 2026 10:53:03 UTC (1,008 KB)