Understanding and Steering Llama 3 with Sparse Autoencoders - 惯性聚合

推荐订阅源

WordPress大学

Engineering at Meta

Attack and Defense Labs

Y Combinator Blog

Privacy International News Feed

博客园 - 三生石上(FineUI控件)

Threat Research - Cisco Blogs

Simon Willison's Weblog

Threat Intelligence Blog | Flashpoint

Netflix TechBlog - Medium

Security @ Cisco Blogs

Cybersecurity and Infrastructure Security Agency CISA

Full Disclosure

Tor Project blog

cs.CV updates on arXiv.org

Forbes - Security

The GitHub Blog

Troy Hunt's Blog

博客园 - 司徒正美

CXSECURITY Database RSS Feed - CXSecurity.com

Proofpoint News Feed

Hacker News: Ask HN

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

博客园 - 聂微东

Heimdal Security Blog

宝玉的分享

aimingoo的专栏

Comments on: Blog

MIT News - Artificial intelligence

The Register - Security

Cisco Talos Blog

The Cloudflare Blog

Goodfire Research

Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train Logits as a new monitor for evaluation awareness Predicting Rare LLM Failures with 30× Fewer Rollouts The Shape of Stories Inside Neural Networks Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention Can SAEs Capture Neural Geometry? Steering Along Manifolds to Control Neural Networks A Geometric Calculator Inside a Neural Network The Neural Geometry Series The World Inside Neural Networks Verbalized Eval Awareness Inflates Measured Safety Paper Summary: Interpreting Language Model Parameters Interpreting Language Model Parameters Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training Using Self-Correcting Search to Accelerate Materials Discovery Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions Covariance-based Sequence Pooling Reasoning Theater: Probing for Performative Chain-of-Thought Features as Rewards: Using Interpretability to Reduce Hallucinations Using Interpretability to Identify a Novel Class of Alzheimer's Biomarkers Understanding Memorization via Loss Curvature Deploying Interpretability to Production with Rakuten: SAE Probes for PII Detection Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model Mapping the Latent Space of Llama 3.3 70B Discovering Undesired Rare Behaviors via Model Diff Amplification Open Problems in Mechanistic Interpretability Understanding Sparse Autoencoder Scaling in the Presence of Feature Manifolds Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering Priors in Time: Missing Inductive Biases for Language Model Interpretability Adversarial Examples Are Not Bugs, They Are Superposition Painting With Concepts Using Diffusion Model Latents Under the Hood of a Reasoning Model Finding the Tree of Life in Evo 2 The Circuits Research Landscape: Results and Perspectives Towards Scalable Parameter Decomposition Replicating Circuit Tracing for a Simple Known Mechanism

Understanding and Steering Llama 3 with Sparse Autoencoders

Thomas McGrath* · 2026-02-05 · via Goodfire Research

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。