The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness - 惯性聚合

推荐订阅源

Cybersecurity and Infrastructure Security Agency CISA

The Exploit Database - CXSecurity.com

Palo Alto Networks Blog

Schneier on Security

Know Your Adversary

Fortinet All Blogs

Simon Willison's Weblog

Kaspersky official blog

博客园_首页

Tailwind CSS Blog

The GitHub Blog

Microsoft Security Blog

Cisco Talos Blog

The Register - Security

有赞技术团队

cs.CL updates on arXiv.org

Google DeepMind News

The Hacker News

LINUX DO - 热门话题

Hugging Face - Blog

博客园 - 三生石上(FineUI控件)

Cyber Security Advisories - MS-ISAC

CXSECURITY Database RSS Feed - CXSecurity.com

让小产品的独立变现更简单 - ezindie.com

Threat Research - Cisco Blogs

Proofpoint News Feed

钛媒体：引领未来商业与生活新知

Privacy & Cybersecurity Law Blog

Darknet – Hacking Tools, Hacker News & Cyber Security

CERT Recently Published Vulnerability Notes

SegmentFault 最新的问题

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

罗磊的独立博客

Apple Machine Learning Research

Proofpoint News Feed

The Cloudflare Blog

OSCHINA 社区最新新闻

Vulnerabilities – Threatpost

AI Alignment Forum

Deployment Awareness Matters More Than Evaluation Awareness — AI Alignment Forum The Case for Model Forensics — AI Alignment Forum LLM-Driven Feature Discovery — AI Alignment Forum How transparent is DiffusionGemma (and why it matters) — AI Alignment Forum GDM AI Control Roadmap — AI Alignment Forum Predicting LLM Safety Before Release by Simulating Deployment — AI Alignment Forum Synthetic document finetuning for instilling positive traits — AI Alignment Forum Why Do Naive SFT Filters For Safety Properties Fail? — AI Alignment Forum SFT Drives Gemini’s Safety Properties — AI Alignment Forum Building and evaluating model diffing agents — AI Alignment Forum Sympathy for both sides of the egregious misalignment debate — AI Alignment Forum Models May Behave Worse When Eval Aware — AI Alignment Forum Sequent: scale and automation for higher confidence in alignment — AI Alignment Forum Tracing Eval-Awareness Emergence Through Training of OLMo 3 — AI Alignment Forum A Mike's-Eye View of ARC's Research — AI Alignment Forum Efficient tradeoffs and the safety-usefulness tradeoff model — AI Alignment Forum Can activation verbalizers surface an internal chain of thought? — AI Alignment Forum My research: a computational cognitive neuroscience perspective on alignment — AI Alignment Forum Announcing the ARC White-Box Estimation Challenge — AI Alignment Forum Testing Gemini models for scheming tendencies Advice for making robust-to-training model organisms Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming Full automation of AI R&D probably yields a large speed up even without a software-only singularity Looking for backdoors in Jane Street LLMs The Case for Evaluating Model Behaviors Risk reports need to address deployment-time spread of misalignment Mechanistic estimation for expectations of random products Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Charlie Grif · 2026-05-15 · via AI Alignment Forum

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。