





















Abstract:Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.
| Subjects: | Artificial Intelligence (cs.AI); Software Engineering (cs.SE) |
| Cite as: | arXiv:2605.24154 [cs.AI] |
| (or arXiv:2605.24154v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.24154 arXiv-issued DOI via DataCite (pending registration) |
From: Qitao Tan [view email]
[v1]
Fri, 22 May 2026 19:22:17 UTC (16,759 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。