

















Abstract:Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
| Comments: | Preprint. 17 pages, 8 figures, 6 tables |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| ACM classes: | I.2.10; I.4.9 |
| Cite as: | arXiv:2605.10764 [cs.CV] |
| (or arXiv:2605.10764v2 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.10764 arXiv-issued DOI via DataCite |
From: Mengqi He [view email]
[v1]
Mon, 11 May 2026 15:59:02 UTC (949 KB)
[v2]
Sat, 23 May 2026 07:33:40 UTC (949 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。