Predicting Rare LLM Failures with 30× Fewer Rollouts - 惯性聚合

推荐订阅源

Proofpoint News Feed

Help Net Security

Schneier on Security

Heimdal Security Blog

Tor Project blog

News and Events Feed by Topic

Google DeepMind News

Simon Willison's Weblog

Secure Thoughts

Cybersecurity and Infrastructure Security Agency CISA

The Hacker News

Kaspersky official blog

博客园 - 叶小钗

LINUX DO - 最新话题

Security Archives - TechRepublic

博客园 - Franky

CTFtime.org: upcoming CTF events

The Cloudflare Blog

WordPress大学

博客园 - 司徒正美

Cisco Talos Blog

博客园 - 聂微东

Exploit-DB.com RSS Feed

有赞技术团队

cs.CL updates on arXiv.org

Threat Intelligence Blog | Flashpoint

Engineering at Meta

www.infosecurity-magazine.com

News and Events Feed by Topic

Y Combinator Blog

Lohrmann on Cybersecurity

Full Disclosure

Java Code Geeks

OSCHINA 社区最新新闻

Help Net Security

Goodfire Research

Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train Logits as a new monitor for evaluation awareness The Shape of Stories Inside Neural Networks Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention Can SAEs Capture Neural Geometry? Steering Along Manifolds to Control Neural Networks A Geometric Calculator Inside a Neural Network The Neural Geometry Series The World Inside Neural Networks Verbalized Eval Awareness Inflates Measured Safety Paper Summary: Interpreting Language Model Parameters Interpreting Language Model Parameters Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training Using Self-Correcting Search to Accelerate Materials Discovery Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions Covariance-based Sequence Pooling Reasoning Theater: Probing for Performative Chain-of-Thought Features as Rewards: Using Interpretability to Reduce Hallucinations Using Interpretability to Identify a Novel Class of Alzheimer's Biomarkers Understanding Memorization via Loss Curvature Deploying Interpretability to Production with Rakuten: SAE Probes for PII Detection Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model Mapping the Latent Space of Llama 3.3 70B Understanding and Steering Llama 3 with Sparse Autoencoders Discovering Undesired Rare Behaviors via Model Diff Amplification Open Problems in Mechanistic Interpretability Understanding Sparse Autoencoder Scaling in the Presence of Feature Manifolds Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering Priors in Time: Missing Inductive Biases for Language Model Interpretability Adversarial Examples Are Not Bugs, They Are Superposition Painting With Concepts Using Diffusion Model Latents Under the Hood of a Reasoning Model Finding the Tree of Life in Evo 2 The Circuits Research Landscape: Results and Perspectives Towards Scalable Parameter Decomposition Replicating Circuit Tracing for a Simple Known Mechanism

Predicting Rare LLM Failures with 30× Fewer Rollouts

Santiago Aranguri, Francisco Pernice, · 2026-06-11 · via Goodfire Research

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。