





















Abstract:Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2510.26418 [cs.AI] |
| (or arXiv:2510.26418v4 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2510.26418 arXiv-issued DOI via DataCite |
From: Jianli Zhao [view email]
[v1]
Thu, 30 Oct 2025 12:10:03 UTC (26,470 KB)
[v2]
Tue, 11 Nov 2025 14:33:12 UTC (26,465 KB)
[v3]
Tue, 3 Feb 2026 10:05:14 UTC (26,489 KB)
[v4]
Sun, 24 May 2026 08:30:15 UTC (26,452 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。