Why Do Naive SFT Filters For Safety Properties Fail? — LessWrong - 惯性聚合

推荐订阅源

WordPress大学

Check Point Blog

宝玉的分享

Apple Machine Learning Research

博客园 - 叶小钗

Schneier on Security

The Cloudflare Blog

The Register - Security

Last Week in AI

The Exploit Database - CXSecurity.com

Recorded Future

博客园 - 聂微东

奇客Solidot–传递最新科技情报

About on SuperTechFans

Threat Research - Cisco Blogs

Simon Willison's Weblog

The Hacker News

SegmentFault 最新的问题

Cybersecurity and Infrastructure Security Agency CISA

Google DeepMind News

Security Latest

让小产品的独立变现更简单 - ezindie.com

Netflix TechBlog - Medium

KPMG report finds enterprise disconnect between AI and its ROI | CIO

Hackread – Cybersecurity News, Data Breaches, AI and More

Kaspersky official blog

Tor Project blog

Know Your Adversary

Vulnerabilities – Threatpost

Cyber Attacks, Cyber Crime and Cyber Security

Palo Alto Networks Blog

CXSECURITY Database RSS Feed - CXSecurity.com

cs.CL updates on arXiv.org

Threat Intelligence Blog | Flashpoint

The GitHub Blog

LessWrong

CLR's Safe Pareto Improvements Research Agenda — LessWrong My Last 7 Blog Posts: a weekly round-up — LessWrong Quality Matters Most When Stakes are Highest — LessWrong If a room feels off the lighting is probably too "spiky" or too blue — LessWrong Stop AI Now — LessWrong Stupid Minutes Reevaluating "AGI Ruin: A List of Lethalities" in 2026 Who I Follow What's the LessWrongist philosophy of mathematics? MixedHTML Mode for Emacs Summarizing and Reviewing my earliest ML research paper, 7 years later Stop AI Resources for starting and growing an AI safety org There are only four skills: design, technical, management and physical Fifteen Years Aboard Arguments Should Be Decisive Criticisms — LessWrong The map is part of the territory — LessWrong “Best humans still outperform”: One turning point in the history of cope around artificial intelligence — LessWrong Society is a social construct, pace Arrow — LessWrong Consent-Based RL: Letting Models Endorse Their Own Training Updates — LessWrong AI #164: Pre Opus — LessWrong Publish-first writing — LessWrong What does status signalling do? When successful, what does it achieve? — LessWrong Let goodness conquer all that it can defend — LessWrong Why I'm Less of a Shill for Related Work Sections — LessWrong From Artificial Intelligence to an ecosystem of artificial life-forms. — LessWrong If You've Never Bought a Tool You Didn't Need, You're Not Buying Enough Tools — LessWrong Verify, but Trust — LessWrong Taking political violence seriously — LessWrong Against Doom & Pause AI — LessWrong Come to Manifest 2026! (June 12-14) — LessWrong How Big Tech Becomes Ungovernable — LessWrong Attempting to Quantify Chinese Bias in Open-Source LLMs — LessWrong A Research Bet on SAE-like Expert Architectures — LessWrong Church Planting: Lessons from the Comments — LessWrong On Dwarkesh Patel’s Podcast With Nvidia CEO Jensen Huang — LessWrong Anthropic Releases Opus 4.7 — LessWrong Specialization is a Driver of Natural Ontology — LessWrong You can only build safe ASI if ASI is globally banned — LessWrong Laptop stands are a thing your neck may appreciate — LessWrong Simulated Qualia Mugging — LessWrong You Aren't in Charge of the Overton Window; Politics Is Not Interior Design — LessWrong Post-Scarcity is bullshit — LessWrong Two Examples of Joy in the Seemingly Mundane — LessWrong How to run from a bull — LessWrong Carpathia Day — LessWrong Do not conquer what you cannot defend — LessWrong What economists get wrong (and sometimes right!) about AI — LessWrong Reflections of a Wordcel — LessWrong MAISU 2026 - Minimal AI Safety Unconference (April 24-27, online) — LessWrong Not a Goal. A Goal-like behavior. — LessWrong A visualization of changing AGI timelines, 2023 - 2026 — LessWrong What is the Iliad Intensive? — LessWrong LLM-tier personal computer security — LessWrong Beware of Well-Written Posts — LessWrong The Mirror Test Is Complicated — LessWrong Political Violence Is Never Acceptable — LessWrong AI Safety's Biggest Talent Gap Isn't Researchers. It's Generalists. — LessWrong Clique, Guild, Cult — LessWrong Your body is not a white box (and you're thinking about weight loss wrong) — LessWrong Counterintuitive Coin Toss. Part II — LessWrong An Ode to Humility and Curiosity in the New Machine Era [Hot take] Problems with AI prose You can’t trust violence — LessWrong The Blast Radius Principle — LessWrong On not being scared of math — LessWrong Why I'm excited about meta-models for interpretability — LessWrong The Ethics of AI-Assisted Creative Work — LessWrong How to make good tea — LessWrong Searchable explorer of EA Forum & LessWrong posts with explicit cruxes or "change my mind" content — LessWrong Constitutional AI vs. RLHF vs. Deliberative Alignment — LessWrong Eating meat is fine if you live in a simulation — LessWrong Tactics for Denying Your Motivations, or Why Legibility is Expensive — LessWrong Spectra of LSRDRs of the Okubo algebra — LessWrong Your Mom is a Chimera — LessWrong An apple picking model for AI R&D — LessWrong Dreams of the Future — LessWrong Pausing AI Is the Best Answer to Post-Alignment Problems — LessWrong Quick Thoughts About Mythos — LessWrong A permitted value of resting — LessWrong Scott Alexander gentrified my meetup — LessWrong Claude Interviews Me About Writing — LessWrong Catching illicit distributed training operations during an AI pause — LessWrong Proof Explained: Touchette-Lloyd Theorem — LessWrong 10% ≈ 90% — LessWrong Anthropic Shadow Realm (working notes) — LessWrong the Lazy Market Hypothesis — LessWrong Announcing ILIADIII: AENEID — LessWrong Have we already lost? Part 3: Reasons for Optimism — LessWrong Dario probably doesn't believe in superintelligence — LessWrong Why Nothing Ever Happens — LessWrong Could a single rogue AI destroy humanity? — LessWrong Hi. I am hbj. — LessWrong Getting Claude to rank the inkhaven bloggers — LessWrong Some thoughts on Nectome's risk and resilience — LessWrong The median take is taken — LessWrong If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines — LessWrong Biological Computing Underhang — LessWrong Claude Mythos #2: Cybersecurity and Project Glasswing — LessWrong The Unintelligibility is Ours: Notes on Chain-of-Thought — LessWrong

Why Do Naive SFT Filters For Safety Properties Fail? — LessWrong

Josh Engels · 2026-06-15 · via LessWrong

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。