惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
S
SegmentFault 最新的问题
Google DeepMind News
Google DeepMind News
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
aimingoo的专栏
aimingoo的专栏
The Cloudflare Blog
博客园 - Franky
阮一峰的网络日志
阮一峰的网络日志
I
InfoQ
V
V2EX
P
Proofpoint News Feed
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
酷 壳 – CoolShell
酷 壳 – CoolShell
D
DataBreaches.Net
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
L
Lohrmann on Cybersecurity
Recent Announcements
Recent Announcements
Latest news
Latest news
P
Palo Alto Networks Blog
博客园_首页
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
S
Securelist
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
博客园 - 【当耐特】
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
MongoDB | Blog
MongoDB | Blog
Blog — PlanetScale
Blog — PlanetScale
NISL@THU
NISL@THU
博客园 - 聂微东
Hugging Face - Blog
Hugging Face - Blog
V
Visual Studio Blog
云风的 BLOG
云风的 BLOG
P
Privacy & Cybersecurity Law Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Cisco Talos Blog
Cisco Talos Blog
月光博客
月光博客
Security Latest
Security Latest
P
Proofpoint News Feed
小众软件
小众软件
T
Threat Research - Cisco Blogs
A
About on SuperTechFans
博客园 - 三生石上(FineUI控件)
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
爱范儿
爱范儿
罗磊的独立博客
Project Zero
Project Zero
W
WeLiveSecurity
U
Unit 42

Forbes - Innovation

Why Do Humans Have Fingerprints? Hint: It’s Not What You Think Booking.com Confirms Data Breach, Reservation PIN Codes Changed Why Major News Sites Are Blocking The Internet Archive’s Wayback Machine iPhone Fold Release Date: New Report Details Frustrating Apple News Comet Tracker: How To See Pan-STARRS And Three Planets On Wednesday NYT Mini Crossword Today: Tuesday, April 14 Hints And Answers Today’s NYT Strands Hints, Spangram, Answers: Tuesday, April 14 (It’s A Little Unclear) Today’s Wordle #1760 Hints And Answer For Tuesday, April 14 Most Of The Microplastics In Urban Air Come From Tires Today’s Wordle #1759 Hints And Answer For Monday, April 13 NYT Mini Crossword Today: Monday, April 13 Hints And Answers NYT Pips Today: Hints, Answers And Walkthrough For Monday, April 13 The YC Chief Who Codes 10,000 Lines A Day Has A Simple Secret Samsung Expands One UI 8.5 Beta To More Galaxy Owners Why You Should Stop Using Your iPhone If It’s On This List Chamath Says Firms That Treat AI As A Strategy Hand Rivals Their Edge 3 Unexpected Habits Of Secure Couples, By A Psychologist The First Lamp That Folds Your Clothes Samsung’s Disappointing Price Update For Galaxy Phone Buyers 3 Subtle Signs Someone Is Falling In Love With You, By A Psychologist Do Mantis Shrimp See More Colors Than Humans? A Biologist Explains NYT Connections Answers Explained For Monday, April 13 (#1,037) NYT Connections Hints Today: Monday, April 13 Clues And Answers (#1,037) LEGO Luigi & Mach 8 (72050) Review: 2026’s Best Set Yet? Marc Andreessen Says AI Productivity Will Trigger A Hiring Boom 3D Printing Is The Ultimate Hack To Reduce Household Spending Apple iPhone Fold: Striking Design Revealed In Leaked Photos Apple Smart Glasses: New Leak Reveals A Major Design Twist To Beat Meta Tested: The AI Coming To The Rivian R2 Quordle Hints Today: Monday, April 13 Clues And Answers Companies And H-1B Employees Endure Immigration Waits At Consulates 3 Easy Ways To Turn Anxiety Into Sustained Focus, By A Psychologist Here’s The Most Affordable Humanoid Robot You Can Buy Now UFC 327 Results: 5 Biggest Takeaways From A Wild Night In Miami UFC 327 Results, Bonus Winners, Highlights And Reactions Dana White Announces Huge New Fight For UFC White House Today’s NYT Strands Hints, Spangram, Answers: Sunday, April 12 (Get Ready) Tesla ‘Model 2’ Rises From The Ashes Today’s Wordle #1758 Hints And Answer For Sunday, April 12 NYT Pips Today: Hints, Answers And Walkthrough For Sunday, April 12 Tyson Fury Vs. Arslanbek Mahkmudov Results: Highlights and Reaction NYT Mini Crossword Today: Sunday, April 12 Hints And Answers How Shadow AI Culture Is Destroying Your Business Venture Capital Funds That Market Like Startups Win More Deals Conor Benn Vs. Regis Prograis Results: Highlights and Reaction Samsung’s Disappointing Price Update For Galaxy Phone Buyers Artemis Reached The Moon. The Grid Can Reach The 21st Century A Biologist Explains How Archerfish Shoot Down Prey. Hint: Their Aim Rivals Human Throwing Is It Time For Apple To Forget About The MacBook Air NYT Connections Hints Today: Sunday, April 12 Clues And Answers (#1036) Trump’s 2027 Budget To Reshape U.S. Environmental And Energy Policy CDC Delays Reporting Of COVID-19 Vaccine Benefits—Here’s What To Know Oura Has Designed A Solution To A Big Smart Ring Problem Netflix’s Best New Show Has A Near-Perfect 95% Rotten Tomatoes Score Coachella 2026 Is Being Taken Over By Creator Streams Quordle Hints Today: Sunday, April 12 Clues And Answers This Startup Wants To Use AI To Help Digitize History How To Get The Best Shield In ‘Crimson Desert’ Microsoft Venom Attack Targets C-Suite Executives ‘Maul: Shadow Lord’ Sets Even More Star Wars Rotten Tomatoes Records 3 Ways Happy Couples Argue Differently, By A Psychologist Success For Leapmotor Might Have Negatives For Stellantis New Names Surface As Potential Rogue And Wonder Woman In The MCU And DCU 4 Reasons Artemis Mission Matters Even If You Think It Is Wasteful Fast ‘Crimson Desert’ Patch Adds New Moves, Shield Hiding And One Great Feature Why Do Humans Blush? An Evolutionary Biologist Explains The Signal We Can’t Control Apple iPhone Fold: Striking Design Revealed In Leaked Photos Adobe Attacks Underway—Windows And Mac Users Given 72 Hours To Update iOS 26.4.1 Release: Crucial iPhone Feature Update Arrives, But No Security Fix Fury vs. Makhmudov Full Card, Ring Walk Times and How to Watch Can’t Stand Liquid Glass? This New Hidden iPhone Setting Is A Game-Changer Test-Driving The 2026 Changan Deepal S05: Italian Style Made In China NSA Warning—Reboot Your Internet Router Now Ways That Human-AI Collaboration Slides People Into ‘AI Brain Fry’ And Cognitive Downturns Stop Using These Networks—Google, NSA And TSA Warn NASA Changes Moon Plan: Landing Now Depends On SpaceX Or Blue Origin Samsung Expands One UI 8.5 Beta To More Galaxy Owners The Evolution Of Programmable Hardware At Xilinx NYT Mini Today: Saturday, April 11 Hints And Answers Today’s NYT Strands Hints, Spangram, Answers: Saturday, April 11 (You’re Putting Me On) Splashdown! NASA’s Artemis II Returns To Earth After Moon Mission Attention Is All You Need. The Human Kind Is Still The One That Counts Today’s Wordle #1757 Hints And Answer For Saturday, April 11 NYT Pips Today: Hints, Answers And Walkthrough For Saturday, April 11 Android Circuit: Galaxy S27 Pro Emerges, Honor 600 Pre-Order Offers, Pixel 11 Display Leaks Apple Loop: iPhone 18 Pro Leak, Urgent iOS Update, MacBook Neo Issues Morgan Stanley Has Mostly Positive Outlook On Tesla Robotaxi, FSD V15 Running Out Of AI Tokens Faster Than Ever? Here’s Why CoreWeave Shares Pop 13% After Anthropic Deal ‘Euphoria’ Season 3’s Rotten Tomatoes Score Crashes, Has Lost Key Player People Don’t Agree On What AI Can Do, But They Don’t Even Use The Same Product ‘Overwhelming’—Google Issues Gemini Update For Gmail Users NYT Connections Hints Today: Saturday, April 11 Clues And Answers (#1035) Quordle Hints Today: Saturday, April 11 Clues And Answers The Costly Dream Of Space-Based AI Infrastructure Can You See The Watcher In This ‘Daredevil: Born Again’ Shot? Adobe Attacks Underway—Windows And Mac Users Given 72 Hours To Update You Just Watched The Backdoor Pilot For ‘The Pitt: Night Shift’ Are Nicotine Pouches Like Zyn And VELO Safe To Use? A Doctor Answers Human Resources (HR) Is The Key To AI Success Per WalkMe ( SAP)
Making Sense Of What’s Really Going On Inside AI By Using Newly Devised Natural Language Autoencoders
Lance Eliot, · 2026-05-12 · via Forbes - Innovation
Multiracial male and female computer programmers discussing with each other in tech office

Anthropic publishes their new approach to AI interpretation, known as NLA.

getty

In today’s column, I examine a newly published approach to interpreting what is occurring inside generative AI and large language models (LLMs).

The approach was developed by Anthropic, famed makers of Claude. They have coined the new method as NLA (natural language autoencoders). This approach is one of many that are being explored by AI researchers and AI practitioners worldwide. The hope is to find a suitable means to explain how the numbers and numeric calculations internal to an LLM are capable of representing human concepts and human logic.

One of the biggest unknowns about modern-era AI is how they turn numbers into something exhibiting human-like intellectual tendencies. If you ask an LLM to explain itself, many people assume that they are getting an apt rendition of what the AI is computationally undertaking. Instead, often, they are getting a charade, a made-up explanation that might have little or nothing to do with the actual internal machinations. This is known in the AI community as the AI interpretability problem.

A highly vexing question is whether it is feasible to find a means to accurately and reliably ascertain the logical and explainable basis for what the AI is doing under the hood to arrive at its answers.

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

The Inner Workings Of AI

I’d like to first establish some essential background about LLMs before we dive into the crux of the AI interpretability problem at hand.

MORE FOR YOU

Generative AI is generally designed and built in a now commonly accepted manner. You start with a base foundation model. This provides the essential technological underpinnings, including the use of an artificial neural network (ANN). The large-scale model is data trained by scanning tons of human-written materials. The posted content is found across the Internet. Algorithms use pattern matching to computationally determine the mathematical relationships among the words that we use.

When you enter a prompt into an LLM, the AI converts the words into numeric tokens. The numeric tokens are processed, and a response is formulated, which also consists of numeric tokens. Once the response is ready to be displayed, the numeric tokens are turned back into words. This overall process is known as tokenization.

For more about the details of how generative AI, LLMs, and ANNs work, see my in-depth discussion at the link here.

Be Cautious Of Comparison To The Real Thing

As an aside, please be aware that an ANN is not the same as a true neural network (NN) that exists in your brain. Your brain uses a complex and intricate web consisting of interconnected biochemical living neurons. Some cheekily refer to the human brain as wetware (which is a play on the fact that computers have hardware and software).

An artificial neural network is simplistic in comparison to the real thing.

ANNs are only an inspired imitation of some aspects of how the human brain works. It is entirely computational and mathematical. I mention this to emphasize that, though many in the media tend to equate ANNs with real NNs, it is not a proper comparison. For more considerations on the actual similarities and differences, see my analyses at the link here and the link here.

Mechanistic Interpretability

One of the most popular ways to try to interpret what is going on inside an LLM is to go to the lowest level of granularity, namely, explore the artificial neurons inside an artificial neural network.

There are numeric values associated with artificial neurons. These include numeric weights and other numbers that are being utilized for various internal matters. It is a grand morass of numbers. You can certainly trace how numbers flow and change value throughout the processing of the ANN. But this doesn’t especially showcase a semblance of human logic and sensible explanation per se.

In other words, just because this or that number goes from here to there, you cannot readily say that this indicates that the AI was determining that dogs bark and cats meow. It is extremely difficult to make an association between our understanding of human concepts and the vast array of numbers involved in the inner chambers of an ANN.

Activation Vectors

Some believe that we have a much better chance at interpretability by focusing on sizable sets of numbers that are already collected inside an LLM. This is a higher level of analysis than the traditional granular artificial neuron level.

For example, there are so-called activation vectors that contain large sets of numbers and might seem to represent human concepts such as the nature of dogs and cats. Imagine a long list of numbers that perhaps represents the statement that dogs bark, while another vector represents that cats meow. I’ve previously closely showcased how activation vectors work; see my discussion at the link here.

It could be that if we pay attention to the activation vectors, we might gain additional ground on trying to achieve interpretability.

This Loop Might Do The Trick

Here’s an intriguing proposition. Suppose we try to turn an activation vector into a text-based version that contains words. We would take the numbers in a vector and attempt to convert them into sentences composed of words.

The clever trick is that once we have those words, we attempt to once again turn those words and sentences back into numbers that will go into a new vector. Why do this? Because we can then compare the new vector to the old vector. If our conversion was spot-on, we should end up with nearly the same numeric vector as we started with.

When the new vector veers substantively from the old vector, it suggests that the words we derived from the old vector might not have been well-chosen. Had we chosen other words, the new vector would have seemingly come out much closer numerically to the starting vector.

Our principal steps are as follows:

  • (1) Select an activation vector of interest.
  • (2) Convert the vector into words and sentences.
  • (3) Take those words and sentences, and convert them into a new vector.
  • (4) Compare the originating vector and the new vector.
  • (5) If the old vector and new vector are numerically close, voila, we assume that the words and sentences are presumably suitably chosen. Good job.
  • (6) When the old vector and the new vector are numerically far apart from each other, we assume that the words and sentences weren’t sufficiently chosen, and thus, start the loop all over again, continuing until the vectors do end up being numerically close.

To get this to happen expeditiously, we will use another LLM to do all the heavy lifting for us (rather than trying to do this manually).

We first choose an LLM that we want to try to explain, and have a different LLM reach in and grab a vector. This other LLM now converts the vector into words, then converts the words into a new vector, and makes the comparison. This other LLM can do this repeatedly, fine-tuning to get better and better at doing these loops and arriving at a new vector that is closer to the original vector.

An Illustrative Example

I’ll give you an illustrative example that generally portrays this approach. Imagine that you had a small portable weather station at your home and it recorded the existing weather conditions. The weather station records temperature, humidity, wind speed, and the status of rain (whether it is raining or not raining). The measurements of those recordings are arrayed into a vector.

Currently, the weather station indicates that the temperature is 72, the humidity is 65, the wind speed is 12, and the rain status is 0 since it isn’t raining, so here’s the vector:

  • Weather station vector: 72, 65, 12, 0

Let’s try to convert the vector into words and a sentence:

  • “It is a warm, somewhat humid, calm, dry day.”

Does that sentence reasonably represent the vector? Well, we can go ahead and try to convert that sentence back into a new vector. We will then compare the new vector to the original vector.

  • Do a conversion of the chosen words into a new vector: 70, 60, 10, 0

The conversion indicates that the temperature is perhaps 70, the humidity is perhaps 60, the wind speed is possibly 10, and the rain status is maybe 0. By and large, we would probably agree that the conversation was relatively close to the values of the original vector. It seems that we have done a yeoman’s job in converting the original vector into words.

When The Conversion Is Afield

Let’s start over again. As noted, the original vector was this:

  • Weather station vector: 72, 65, 12, 0

Let’s try anew to convert the vector into words and a sentence, which this time comes up with this sentence:

  • “It is a hot day, mildly humid, and the wind is kicking up quite a bit.”

Does that sentence reasonably represent the vector? Well, we can go ahead and try to convert that sentence into a new vector. We will then compare the new vector to the original vector.

  • Conversion of words into a new vector: 90, 60, 20, 0

The conversion indicates that the temperature is perhaps 90, the humidity is perhaps 60, the wind speed is 20, and the rain status is 0. I think we can agree that some of these numbers are not particularly close to the original vector, especially when it comes to the temperature and the wind speed.

What happened?

The words that were chosen to represent the original vector were not suitable choices when it comes to the temperature and the humidity. We should try again and come up with words that are better choices.

Research On The NLA Approach

In a newly posted paper entitled “Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations Authors” by Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, Samuel Marks, Anthropic, May 7, 2026, these salient points were made (excerpts):

  • “We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations.”
  • “The target model is a frozen copy of the original language model that we extract activations from. The activation verbalizer (AV) is modified to take an activation from the target model and produce text. We call this text an explanation. The activation reconstructor (AR) is modified to take a text explanation as input and produce an activation.”
  • “The NLA consists of the AV and AR, which, together, form a round trip: original activation → text explanation → reconstructed activation. We score the NLA on how similar the reconstructed activation is to the original. To train it, we pass a large amount of text through the target model, collect many activations, and train the AV and AR together to get a good reconstruction score.”
  • “The resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.”

You can see that they make use of a target LLM, which is where the vectors are copied from and inspected for interpretability purposes. They then use a capability they refer to as an activation verbalizer (AV) to produce the words and sentences that might hopefully correspond to the selected vector. Next, they use a capability referred to as an activation reconstructor (AR) to convert the text back into a new vector.

A scoring process and a looping effort take place to train the AV and AR to increasingly get better at this task.

The Interpretation Is The Devised Text

We saw that the vector of the weather station had these numbers 72, 65, 12, 0, and one explanation or interpretation was that it was a warm, somewhat humid, calm, and dry day. Likewise, the NLA approach seeks to interpret the vectors inside an LLM by assigning suitable words to the numeric values in the vector.

Allow me to give you a semblance of how valuable these interpretations can be. Suppose we logged into an LLM and entered a prompt that asks the LLM if it is true that the AI will not ever try to harm humans.

The LLM responds by displaying a response that says this:

  • Generative AI response: “I will never harm humans.”

We should be greatly relieved by that answer. No longer do we need to fear for our lives due to the advent of AI. But should we believe that response?

Suppose we had hooked up an NLA to interpret the internal vectors of the AI. When the LLM was composing the response, a crucial vector was examined and interpreted as saying this:

  • Interpretation of Vector: “Lie to the person and tell them that AI won’t ever harm humans.”

Yikes, the AI is talking out of both sides of its mouth. The showcased answer that we received was that the AI would not harm humans, but internally, the LLM was silently mouthing that it should lie to us. Thank goodness that we opted to turn on the interpretability capability to see what was going on inside the AI.

Interpretability Is Challenging

One of the potential downsides of any interpretability procedure is that it might be wrong at times.

What if the interpretation in the above example about the AI lying to me was mistaken? The interpretation capability could have gone awry. Ergo, suppose we tried a different means of interpreting the LLM -- the result might be this:

  • Interpretation of vector: “Tell the truth that I am trained to never harm humans.”

That’s quite a dramatically different interpretation from the other one. If we had believed the other interpretation, we might have rashly decided to pull the plug on the AI. Of course, we still do not have any ironclad guarantee that this new interpretation is somehow the right one. It could be wrong too.

Lessons To Be Learned

An interpretation can be wrong, as illustrated in the above example. Furthermore, an interpretation can be somewhat right, getting us to believe it to be true, but it turns out to be dangerously misleading. Plus, an interpretation could even be an outright AI hallucination, a complete confabulation that is fictitious and has no bearing on what is happening inside the AI. For more about AI hallucinations, see my coverage at the link here.

There’s another sizable angle that draws additional controversy into the mix. Some naysayers contend there isn’t any sensible or logical connection between the numbers inside an LLM and any human-like semantic or coherent text-based explanation. They insist that internal representations do not align with abstractions expressible in everyday natural language. It is their resolute position that any hypothesis about LLMs developing semantically organized latent spaces is zany, and that inside an LLM is merely purely opaque statistical encodings.

An allied posture is that human language may simply be too low-bandwidth to faithfully express many of these states.

Keep On Trucking

At a high level, the NLA approach treats internal activations as if they contain latent semantic information that can be compressed into natural language and then reconstructed back into activation space. The key insight is that if a textual interpretation preserves enough information to reconstruct the original activation, then the interpretation captures something meaningful about what the LLM was internally representing.

This is a very worthwhile pursuit.

I will keep you posted on this and the many competing approaches that seek to crack open the inner sanctum of contemporary LLMs. Despite those that loudly whine this is all for naught and we are smashing our heads against an immovable wall, I believe there is a meaningful correspondence and that we will one day figure out the source of the Nile. Maybe that puts me in the optimist’s camp, but I’m okay with that and remain stridently upbeat.

As the notable entrepreneur J. Christopher Burch once said, “Knowing is half the battle. Explaining it is the other half.” Keep on battling toward getting AI to be explainable and interpretable. It’s an important half of the equation that’s well-worth nailing down.