

















Abstract:Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT) and Robust Refusal Dynamic Defense (R2D2), aligning HarmBench, StrongREJECT, and XSTest evaluations with five-anchor geometry measurements, causal interventions, and sparse adaptive stress tests. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints; however, these checkpoints also exhibit maximal XSTest refusal and fail a benign-utility audit. Later checkpoints partially recover utility-facing behavior while reopening attack success, with adaptive GCG attack success rate rising to 0.415 at step 250 and 0.613 at step 500. Internally, R2D2 preserves a late-layer admissible refusal-control carrier through step 100 and then relocates the best admissible carrier to an early layer; SFT relocates earlier yet remains less robust. Effective rank stays near 1.24, and SFT shows larger principal-angle drift, arguing against both dimensional expansion and drift magnitude as sufficient explanations. Causal interventions support a low-dimensional but utility-coupled carrier. These results support a geometry-reorganization account of R2D2 along a robustness--utility frontier, without establishing adaptive robustness.
| Subjects: | Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR) |
| Cite as: | arXiv:2604.27019 [cs.LG] |
| (or arXiv:2604.27019v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2604.27019 arXiv-issued DOI via DataCite |
From: Wenhao Lan [view email]
[v1]
Wed, 29 Apr 2026 12:44:05 UTC (151 KB)
[v2]
Sun, 17 May 2026 03:56:38 UTC (225 KB)
[v3]
Mon, 25 May 2026 19:43:08 UTC (197 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。