




























HN -
The topic of LLM jailbreaking is an important one. As machines increase in power their ability to benefit society increases, as does their ability to harm society.
This paper intends to address the question: "Is it possible to create a perfect universal guardian against LLM jailbreaks?". This means:
Universal: applies to all prompts and input combinations Perfect: is able to successfully distinguish harmful versus not harmful Computable: is feasible to compute We show that this combination does not exist and Request for Comment.
A more feasible approach involves post-hoc filtering of results, which involves an easier problem of decideability. In order to implement such approaches, both reasoning traces an intermediate code changes would likely need to be hidden from the user until the results are determined.
Note: this paper was written in conjunction with ChatGPT 5.5 Pro. The topic of the reducibility of such problems to The Halting Problem has been top of mind for me for some time. In one sense, it is a trivial conclusion. Yet it is one of high importance.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。