

















Abstract:In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.
| Subjects: | Cryptography and Security (cs.CR); Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.27110 [cs.CR] |
| (or arXiv:2605.27110v1 [cs.CR] for this version) | |
| https://doi.org/10.48550/arXiv.2605.27110 arXiv-issued DOI via DataCite (pending registration) |
From: Xuan Luo [view email]
[v1]
Tue, 26 May 2026 14:51:13 UTC (7,891 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。