惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
月光博客
月光博客
The Last Watchdog
The Last Watchdog
T
Tenable Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
C
CXSECURITY Database RSS Feed - CXSecurity.com
Simon Willison's Weblog
Simon Willison's Weblog
V
Vulnerabilities – Threatpost
F
Fortinet All Blogs
Microsoft Security Blog
Microsoft Security Blog
A
Arctic Wolf
云风的 BLOG
云风的 BLOG
Know Your Adversary
Know Your Adversary
P
Palo Alto Networks Blog
GbyAI
GbyAI
阮一峰的网络日志
阮一峰的网络日志
The GitHub Blog
The GitHub Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
U
Unit 42
MyScale Blog
MyScale Blog
B
Blog
Spread Privacy
Spread Privacy
S
Schneier on Security
Project Zero
Project Zero
L
LINUX DO - 热门话题
M
MIT News - Artificial intelligence
F
Full Disclosure
WordPress大学
WordPress大学
Apple Machine Learning Research
Apple Machine Learning Research
Cyberwarzone
Cyberwarzone
AWS News Blog
AWS News Blog
aimingoo的专栏
aimingoo的专栏
博客园 - 三生石上(FineUI控件)
C
Cybersecurity and Infrastructure Security Agency CISA
Hugging Face - Blog
Hugging Face - Blog
Security Latest
Security Latest
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
T
Tailwind CSS Blog
K
Kaspersky official blog
Recent Announcements
Recent Announcements
NISL@THU
NISL@THU
Cisco Talos Blog
Cisco Talos Blog
S
Securelist
P
Privacy & Cybersecurity Law Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
T
The Exploit Database - CXSecurity.com
V
Visual Studio Blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Webroot Blog
Webroot Blog

AI Alignment Forum

Deployment Awareness Matters More Than Evaluation Awareness — AI Alignment Forum The Case for Model Forensics — AI Alignment Forum LLM-Driven Feature Discovery — AI Alignment Forum How transparent is DiffusionGemma (and why it matters) — AI Alignment Forum GDM AI Control Roadmap — AI Alignment Forum Predicting LLM Safety Before Release by Simulating Deployment — AI Alignment Forum Synthetic document finetuning for instilling positive traits — AI Alignment Forum Why Do Naive SFT Filters For Safety Properties Fail? — AI Alignment Forum SFT Drives Gemini’s Safety Properties — AI Alignment Forum Building and evaluating model diffing agents — AI Alignment Forum Sympathy for both sides of the egregious misalignment debate — AI Alignment Forum Models May Behave Worse When Eval Aware — AI Alignment Forum Sequent: scale and automation for higher confidence in alignment — AI Alignment Forum A Mike's-Eye View of ARC's Research — AI Alignment Forum Efficient tradeoffs and the safety-usefulness tradeoff model — AI Alignment Forum Can activation verbalizers surface an internal chain of thought? — AI Alignment Forum My research: a computational cognitive neuroscience perspective on alignment — AI Alignment Forum Announcing the ARC White-Box Estimation Challenge — AI Alignment Forum Testing Gemini models for scheming tendencies Advice for making robust-to-training model organisms Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming Full automation of AI R&D probably yields a large speed up even without a software-only singularity Looking for backdoors in Jane Street LLMs The Case for Evaluating Model Behaviors Risk reports need to address deployment-time spread of misalignment Mechanistic estimation for expectations of random products The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
Tracing Eval-Awareness Emergence Through Training of OLMo 3 — AI Alignment Forum
Ram Bharadwa · 2026-06-10 · via AI Alignment Forum

x