


















This started as a response to the recent "you have to post-train a model to pen-test" Show HN — we don't think you need to, just makes life a bit easier.
Across 10K+ of our agent transcripts from benchmarking against OpenAI's EVMBench, we saw zero refusals. In the closed-frontier models, the refusal you hit is mostly a separate content classifier, or a system prompt, not so much the model itself. Breadth (more cheap agents) beats a bigger model, but it puts more requirements on context engineering
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。