

















Abstract:Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.
| Comments: | Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version |
| Subjects: | Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.02958 [cs.CR] |
| (or arXiv:2605.02958v2 [cs.CR] for this version) | |
| https://doi.org/10.48550/arXiv.2605.02958 arXiv-issued DOI via DataCite |
From: Xulin Hu [view email]
[v1]
Sat, 2 May 2026 14:56:37 UTC (433 KB)
[v2]
Tue, 26 May 2026 16:10:32 UTC (436 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。