惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

罗磊的独立博客
SecWiki News
SecWiki News
酷 壳 – CoolShell
酷 壳 – CoolShell
爱范儿
爱范儿
量子位
M
MIT News - Artificial intelligence
GbyAI
GbyAI
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
TaoSecurity Blog
TaoSecurity Blog
博客园 - 【当耐特】
H
Heimdal Security Blog
腾讯CDC
The Last Watchdog
The Last Watchdog
Security Archives - TechRepublic
Security Archives - TechRepublic
Hacker News: Ask HN
Hacker News: Ask HN
S
Schneier on Security
Microsoft Security Blog
Microsoft Security Blog
WordPress大学
WordPress大学
博客园 - 司徒正美
Recent Commits to openclaw:main
Recent Commits to openclaw:main
C
Cybersecurity and Infrastructure Security Agency CISA
S
SegmentFault 最新的问题
大猫的无限游戏
大猫的无限游戏
Application and Cybersecurity Blog
Application and Cybersecurity Blog
F
Full Disclosure
有赞技术团队
有赞技术团队
T
Tailwind CSS Blog
Engineering at Meta
Engineering at Meta
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threatpost
月光博客
月光博客
A
Arctic Wolf
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
雷峰网
雷峰网
T
Troy Hunt's Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The Cloudflare Blog
D
DataBreaches.Net
O
OpenAI News
L
LINUX DO - 最新话题
宝玉的分享
宝玉的分享
小众软件
小众软件
V
Vulnerabilities – Threatpost
A
About on SuperTechFans
人人都是产品经理
人人都是产品经理
T
The Exploit Database - CXSecurity.com
Martin Fowler
Martin Fowler
美团技术团队
P
Privacy International News Feed

AI Alignment Forum

Deployment Awareness Matters More Than Evaluation Awareness — AI Alignment Forum The Case for Model Forensics — AI Alignment Forum LLM-Driven Feature Discovery — AI Alignment Forum How transparent is DiffusionGemma (and why it matters) — AI Alignment Forum GDM AI Control Roadmap — AI Alignment Forum Predicting LLM Safety Before Release by Simulating Deployment — AI Alignment Forum Synthetic document finetuning for instilling positive traits — AI Alignment Forum Why Do Naive SFT Filters For Safety Properties Fail? — AI Alignment Forum SFT Drives Gemini’s Safety Properties — AI Alignment Forum Building and evaluating model diffing agents — AI Alignment Forum Sympathy for both sides of the egregious misalignment debate — AI Alignment Forum Models May Behave Worse When Eval Aware — AI Alignment Forum Sequent: scale and automation for higher confidence in alignment — AI Alignment Forum Tracing Eval-Awareness Emergence Through Training of OLMo 3 — AI Alignment Forum A Mike's-Eye View of ARC's Research — AI Alignment Forum Efficient tradeoffs and the safety-usefulness tradeoff model — AI Alignment Forum Can activation verbalizers surface an internal chain of thought? — AI Alignment Forum My research: a computational cognitive neuroscience perspective on alignment — AI Alignment Forum Announcing the ARC White-Box Estimation Challenge — AI Alignment Forum Testing Gemini models for scheming tendencies Advice for making robust-to-training model organisms Full automation of AI R&D probably yields a large speed up even without a software-only singularity Looking for backdoors in Jane Street LLMs The Case for Evaluating Model Behaviors Risk reports need to address deployment-time spread of misalignment Mechanistic estimation for expectations of random products The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
Jasmine Li · 2026-05-28 · via AI Alignment Forum
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may …