Open Problems in Mechanistic Interpretability - 惯性聚合

推荐订阅源

Fortinet All Blogs

Apple Machine Learning Research

博客园 - Franky

Cisco Talos Blog

Exploit-DB.com RSS Feed

奇客Solidot–传递最新科技情报

Cybersecurity and Infrastructure Security Agency CISA

WordPress大学

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

The Cloudflare Blog

阮一峰的网络日志

PCI Perspectives

博客园 - 三生石上(FineUI控件)

Security Latest

The GitHub Blog

Help Net Security

Netflix TechBlog - Medium

Full Disclosure

Java Code Geeks

Microsoft Azure Blog

人人都是产品经理

Recorded Future

Y Combinator Blog

Heimdal Security Blog

博客园 - 聂微东

The Register - Security

有赞技术团队

cs.AI updates on arXiv.org

博客园 - 司徒正美

Threat Intelligence Blog | Flashpoint

OSCHINA 社区最新新闻

www.infosecurity-magazine.com

Help Net Security

LINUX DO - 最新话题

aimingoo的专栏

Goodfire Research

Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train Logits as a new monitor for evaluation awareness Predicting Rare LLM Failures with 30× Fewer Rollouts The Shape of Stories Inside Neural Networks Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention Can SAEs Capture Neural Geometry? Steering Along Manifolds to Control Neural Networks A Geometric Calculator Inside a Neural Network The Neural Geometry Series The World Inside Neural Networks Verbalized Eval Awareness Inflates Measured Safety Paper Summary: Interpreting Language Model Parameters Interpreting Language Model Parameters Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training Using Self-Correcting Search to Accelerate Materials Discovery Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions Covariance-based Sequence Pooling Reasoning Theater: Probing for Performative Chain-of-Thought Features as Rewards: Using Interpretability to Reduce Hallucinations Using Interpretability to Identify a Novel Class of Alzheimer's Biomarkers Understanding Memorization via Loss Curvature Deploying Interpretability to Production with Rakuten: SAE Probes for PII Detection Interpreting Evo 2: Arc Institute's Next-Generation Genomic Foundation Model Mapping the Latent Space of Llama 3.3 70B Understanding and Steering Llama 3 with Sparse Autoencoders Discovering Undesired Rare Behaviors via Model Diff Amplification Understanding Sparse Autoencoder Scaling in the Presence of Feature Manifolds Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering Priors in Time: Missing Inductive Biases for Language Model Interpretability Adversarial Examples Are Not Bugs, They Are Superposition Painting With Concepts Using Diffusion Model Latents Under the Hood of a Reasoning Model Finding the Tree of Life in Evo 2 The Circuits Research Landscape: Results and Perspectives Towards Scalable Parameter Decomposition Replicating Circuit Tracing for a Simple Known Mechanism

Open Problems in Mechanistic Interpretability

Lee Sharkey, · 2025-12-05 · via Goodfire Research

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。