惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
J
Java Code Geeks
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
H
Hackread – Cybersecurity News, Data Breaches, AI and More
V
Visual Studio Blog
G
Google Developers Blog
V
V2EX
The Register - Security
The Register - Security
博客园 - 三生石上(FineUI控件)
云风的 BLOG
云风的 BLOG
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园_首页
S
SegmentFault 最新的问题
博客园 - Franky
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog
A
About on SuperTechFans
人人都是产品经理
人人都是产品经理
aimingoo的专栏
aimingoo的专栏
罗磊的独立博客
C
Check Point Blog
MyScale Blog
MyScale Blog
T
The Blog of Author Tim Ferriss
MongoDB | Blog
MongoDB | Blog
The GitHub Blog
The GitHub Blog
Last Week in AI
Last Week in AI
Microsoft Azure Blog
Microsoft Azure Blog
IT之家
IT之家
F
Fortinet All Blogs
Jina AI
Jina AI
P
Proofpoint News Feed
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
阮一峰的网络日志
阮一峰的网络日志
B
Blog
L
LangChain Blog
月光博客
月光博客
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
宝玉的分享
宝玉的分享
博客园 - 【当耐特】
T
Tailwind CSS Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Microsoft Security Blog
Microsoft Security Blog
WordPress大学
WordPress大学
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
博客园 - 聂微东
Hugging Face - Blog
Hugging Face - Blog
M
MIT News - Artificial intelligence
GbyAI
GbyAI

Forbes - Business

Pickleball Slam 4 Preview — History Of The Event And Behind The Scenes Prep With The Players How To Get Masters 2027 Tickets Lottery Dates And Odds ‘Malcolm In The Middle: Life’s Still Unfair’ Is Likely A Wrap For Show Tony Gonzales, Eric Swalwell Will Resign Following Sexual Misconduct Allegations Suspect In Sam Altman Molotov Attack Charged With Attempted Murder Today’s Wordle #1760 Hints And Answer For Tuesday, April 14 Dan Orlovsky Compares Ty Simpson To Brock Purdy, Names Surprising NFC Contender As Fit For 2026 NFL Draft Prospect IndyCar’s Chip Ganassi Racing, OpenAI Hope For ‘Competitive Advantage’ Shingles Altered Achilles Rehab For Pacers Star Tyrese Haliburton, But He’s Back On The Court NYT Pips Today: Hints, Answers And Walkthrough For Tuesday, April 14 LVMH Founder Bernard Arnault’s Fortune Falls $50 Billion This Year Inter Miami CF Kicks Off New Era For South Florida Soccer In Nu Stadium IndyCar’s AJ Foyt Racing Hires Toby Sowery As Reserve Driver IndyCar’s Chip Ganassi Racing Goes Green With Green Sports Alliance Rory McIlroy Claims Second Straight Masters Title At Augusta Rockets Claim Fifth Seed In West Today’s Wordle #1759 Hints And Answer For Monday, April 13 NYT Pips Today: Hints, Answers And Walkthrough For Monday, April 13 Design Details In ‘The Drama’ Delve Deep Into Character AEW Dynasty 2026 Results, Winners And Live Updates On April 12 Former Dodgers Infielder, 3-Time MLB All-Star And Champion, Dies After Cancer Battle Townsend And Wild Secure Double Golds At Pro Pickleball Association Australia Moreton Bay Los Angeles Dodgers Prospect James Tibbs III Is Tearing Up Triple-A Hungary’s Authoritarian Orban—Boosted By Trump—Loses. European Leaders Celebrate. Review: Blackbraid Delivers Exteme Metal Masterclass To Dublin, Ireland Colorado Is Emerging As An Energy Innovation Hub U.S. Military Ships In Strait of Hormuz Violate Ceasefire, Iran Warns (Live Updates) Rosé’s All-Time Sales Chart Record Has Been Beaten IC3 Report Reveals Surge In Cryptocurrency Investment Scams The Top Contenders For The 2026 NCAA Gymnastics All-Around Title What Time Does ‘Euphoria’ Season 3 Come Out? How To Watch Tonight John Nolan, ‘Batman’ Films And ‘Person Of Interest’ Actor, Dies At 87 BTS Dominates The Biggest Songs Chart In America — Again Jannik Sinner Ties Novak Djokovic’s Masters 1000 Mark—Will Return To World No. 1 Will Iran War Result In Nuclear Weapon Transfers To The Middle East? Iran Reportedly Used Chinese Satellite To Target U.S. Bases—Here’s How China And Russia Could Help Iran Why Diesel Prices Spike Faster Than Gasoline In A Crisis UFC 327 Results: 5 Biggest Takeaways From A Wild Night In Miami Taemin Dazzles At Coachella Debut And Premieres 6 New Songs: Full Setlist UFC 327 Results, Bonus Winners, Highlights And Reactions Dana White Announces Huge New Fight For UFC White House Today’s Wordle #1758 Hints And Answer For Sunday, April 12 NYT Pips Today: Hints, Answers And Walkthrough For Sunday, April 12 WNBA Draft 2026 Date, Time, Order And Top Prospects Tyson Fury Vs. Arslanbek Mahkmudov Results: Highlights and Reaction Avengers’ Biggest Battle? Taking On ‘Dune: Part Three’ At Christmas U.S.-Iran Peace Talks Stretch Into Sunday—As Iran Denies U.S. Navy Destroyers Passed Through Hormuz Conor Benn Vs. Regis Prograis Results: Highlights and Reaction Why Dewey Actor Was Recast For ‘Malcolm In The Middle’ Revival Series Eric Swalwell Is Being Investigated In New York After Sexual Assault Claim Against Him Artemis Reached The Moon. The Grid Can Reach The 21st Century Pope Leo XIV Says 'Enough Of War!' As He Urges ‘Mediation’ To End Iran Conflict NYT Connections Hints Today: Sunday, April 12 Clues And Answers (#1036) U.S.-Iran Peace Talks Stretch Into Sunday—As Iran Denies U.S. Navy Destroyers Passed Through Hormuz Beyond Private Credit—The Overlooked Risks Of Banks’ Ties To Nonbanks ‘Euphoria’ Musician Labrinth Suggests He Was Treated Poorly Before Leaving Hit Show Netflix’s Best New Show Has A Near-Perfect 95% Rotten Tomatoes Score Coachella 2026 Is Being Taken Over By Creator Streams UFC Tonight: What Time Does The UFC 327 Fight Card Start? Microsoft Venom Attack Targets C-Suite Executives ‘Maul: Shadow Lord’ Sets Even More Star Wars Rotten Tomatoes Records Harry Styles Flies With His Oldest Hit Thanks To A Box Office Smash New Names Surface As Potential Rogue And Wonder Woman In The MCU And DCU Chris Stapleton’s High-Profile Collaboration Becomes A Certified Hit Miley Cyrus Charts The Biggest New Sales Smash In America Pet Shop Boys’ Visual History Told In New Book China Seizes An Island While The World Is Watching Iran Ozzy Osbourne’s Name Helps A Rock Band Chart A New Top 10 Hit Forbes House of the Week: 3 Things We Crave Make U.S. Air Cargo More Valuable Than Ocean Ocean Tight Ends To Trade Away In Dynasty Fantasy Football Fury vs. Makhmudov Full Card, Ring Walk Times and How to Watch Ways That Human-AI Collaboration Slides People Into ‘AI Brain Fry’ And Cognitive Downturns What’s At Stake In Hungary’s Election For Ukraine And Russia Coachella 2026: All 95 Surprise Guests Who Appeared This Year Coachella Accidentally Plays New KATSEYE EP Announcement Before Debut Performance KATSEYE Performs ‘Golden’ At Coachella with HUNTR/X voices KATSEYE Feature ‘KPop Demon Hunters’ Singers For 'Golden' At Coachella WWE SmackDown Results, Winners And Grades On April 10, 2026 WWE SmackDown Results As Pat McAfee Announces 25% Off WrestleMania 42 Tickets Bini Makes History For Filipino Music At Coachella 2026: Full Setlist 5 Under-The-Radar Winners And Losers In The Iran War So Far Menswear In The Post-Covid Age Is High Tech And High Touch Today’s Wordle #1757 Hints And Answer For Saturday, April 11 NYT Pips Today: Hints, Answers And Walkthrough For Saturday, April 11 ‘Hacks’ Season 5 Release Schedule Reveals Final Episodes For Series Packers Trade Inconsistent Dontayvion Wicks To The Eagles Dan Levy’s Netflix Crime Comedy ‘Big Mistakes’ Takes Huge, Hilarious Risks Inside 30 Years Of Progress At The Wendy Hilliard Gymnastics Foundation With A $1.2 Billion Sale To Unilever, Grüns’ Founder Mints A Fortune What Does ‘You The Birthday’ Mean? TikTok’s Viral Phrase, Explained Kenny Omega Talks Comeback And Facing MJF At AEW Dynasty FIFA World Cup 2026: Why Ticket Scandals Still Cloud the Tournament Two Months Out Oldest US Navy Supercarrier Sailing In ‘Southern Seas 2026’ Exercises Huang Urges People To ‘Move To California’ As Billionaire Tax Looms BTS ARIRANG World Tour: What To Expect For New Fans And Old ‘You, Me & Tuscany’ Rotten Tomatoes Reviews Like Where Rom-Com Lands IRS Issues New ‘No Tax On Tips’ Rules—Here’s Who Qualifies Trump Wants To Build An Arch In D.C.—Here’s What It Would Look Like Molotov Cocktail Thrown At Sam Altman’s Home, OpenAI Says—Suspect Arrested
Making Sense Of What’s Really Going On Inside AI By Using Newly Devised Natural Language Autoencoders
Lance Eliot, · 2026-05-12 · via Forbes - Business
Multiracial male and female computer programmers discussing with each other in tech office

Anthropic publishes their new approach to AI interpretation, known as NLA.

getty

In today’s column, I examine a newly published approach to interpreting what is occurring inside generative AI and large language models (LLMs).

The approach was developed by Anthropic, famed makers of Claude. They have coined the new method as NLA (natural language autoencoders). This approach is one of many that are being explored by AI researchers and AI practitioners worldwide. The hope is to find a suitable means to explain how the numbers and numeric calculations internal to an LLM are capable of representing human concepts and human logic.

One of the biggest unknowns about modern-era AI is how they turn numbers into something exhibiting human-like intellectual tendencies. If you ask an LLM to explain itself, many people assume that they are getting an apt rendition of what the AI is computationally undertaking. Instead, often, they are getting a charade, a made-up explanation that might have little or nothing to do with the actual internal machinations. This is known in the AI community as the AI interpretability problem.

A highly vexing question is whether it is feasible to find a means to accurately and reliably ascertain the logical and explainable basis for what the AI is doing under the hood to arrive at its answers.

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

The Inner Workings Of AI

I’d like to first establish some essential background about LLMs before we dive into the crux of the AI interpretability problem at hand.

MORE FOR YOU

Generative AI is generally designed and built in a now commonly accepted manner. You start with a base foundation model. This provides the essential technological underpinnings, including the use of an artificial neural network (ANN). The large-scale model is data trained by scanning tons of human-written materials. The posted content is found across the Internet. Algorithms use pattern matching to computationally determine the mathematical relationships among the words that we use.

When you enter a prompt into an LLM, the AI converts the words into numeric tokens. The numeric tokens are processed, and a response is formulated, which also consists of numeric tokens. Once the response is ready to be displayed, the numeric tokens are turned back into words. This overall process is known as tokenization.

For more about the details of how generative AI, LLMs, and ANNs work, see my in-depth discussion at the link here.

Be Cautious Of Comparison To The Real Thing

As an aside, please be aware that an ANN is not the same as a true neural network (NN) that exists in your brain. Your brain uses a complex and intricate web consisting of interconnected biochemical living neurons. Some cheekily refer to the human brain as wetware (which is a play on the fact that computers have hardware and software).

An artificial neural network is simplistic in comparison to the real thing.

ANNs are only an inspired imitation of some aspects of how the human brain works. It is entirely computational and mathematical. I mention this to emphasize that, though many in the media tend to equate ANNs with real NNs, it is not a proper comparison. For more considerations on the actual similarities and differences, see my analyses at the link here and the link here.

Mechanistic Interpretability

One of the most popular ways to try to interpret what is going on inside an LLM is to go to the lowest level of granularity, namely, explore the artificial neurons inside an artificial neural network.

There are numeric values associated with artificial neurons. These include numeric weights and other numbers that are being utilized for various internal matters. It is a grand morass of numbers. You can certainly trace how numbers flow and change value throughout the processing of the ANN. But this doesn’t especially showcase a semblance of human logic and sensible explanation per se.

In other words, just because this or that number goes from here to there, you cannot readily say that this indicates that the AI was determining that dogs bark and cats meow. It is extremely difficult to make an association between our understanding of human concepts and the vast array of numbers involved in the inner chambers of an ANN.

Activation Vectors

Some believe that we have a much better chance at interpretability by focusing on sizable sets of numbers that are already collected inside an LLM. This is a higher level of analysis than the traditional granular artificial neuron level.

For example, there are so-called activation vectors that contain large sets of numbers and might seem to represent human concepts such as the nature of dogs and cats. Imagine a long list of numbers that perhaps represents the statement that dogs bark, while another vector represents that cats meow. I’ve previously closely showcased how activation vectors work; see my discussion at the link here.

It could be that if we pay attention to the activation vectors, we might gain additional ground on trying to achieve interpretability.

This Loop Might Do The Trick

Here’s an intriguing proposition. Suppose we try to turn an activation vector into a text-based version that contains words. We would take the numbers in a vector and attempt to convert them into sentences composed of words.

The clever trick is that once we have those words, we attempt to once again turn those words and sentences back into numbers that will go into a new vector. Why do this? Because we can then compare the new vector to the old vector. If our conversion was spot-on, we should end up with nearly the same numeric vector as we started with.

When the new vector veers substantively from the old vector, it suggests that the words we derived from the old vector might not have been well-chosen. Had we chosen other words, the new vector would have seemingly come out much closer numerically to the starting vector.

Our principal steps are as follows:

  • (1) Select an activation vector of interest.
  • (2) Convert the vector into words and sentences.
  • (3) Take those words and sentences, and convert them into a new vector.
  • (4) Compare the originating vector and the new vector.
  • (5) If the old vector and new vector are numerically close, voila, we assume that the words and sentences are presumably suitably chosen. Good job.
  • (6) When the old vector and the new vector are numerically far apart from each other, we assume that the words and sentences weren’t sufficiently chosen, and thus, start the loop all over again, continuing until the vectors do end up being numerically close.

To get this to happen expeditiously, we will use another LLM to do all the heavy lifting for us (rather than trying to do this manually).

We first choose an LLM that we want to try to explain, and have a different LLM reach in and grab a vector. This other LLM now converts the vector into words, then converts the words into a new vector, and makes the comparison. This other LLM can do this repeatedly, fine-tuning to get better and better at doing these loops and arriving at a new vector that is closer to the original vector.

An Illustrative Example

I’ll give you an illustrative example that generally portrays this approach. Imagine that you had a small portable weather station at your home and it recorded the existing weather conditions. The weather station records temperature, humidity, wind speed, and the status of rain (whether it is raining or not raining). The measurements of those recordings are arrayed into a vector.

Currently, the weather station indicates that the temperature is 72, the humidity is 65, the wind speed is 12, and the rain status is 0 since it isn’t raining, so here’s the vector:

  • Weather station vector: 72, 65, 12, 0

Let’s try to convert the vector into words and a sentence:

  • “It is a warm, somewhat humid, calm, dry day.”

Does that sentence reasonably represent the vector? Well, we can go ahead and try to convert that sentence back into a new vector. We will then compare the new vector to the original vector.

  • Do a conversion of the chosen words into a new vector: 70, 60, 10, 0

The conversion indicates that the temperature is perhaps 70, the humidity is perhaps 60, the wind speed is possibly 10, and the rain status is maybe 0. By and large, we would probably agree that the conversation was relatively close to the values of the original vector. It seems that we have done a yeoman’s job in converting the original vector into words.

When The Conversion Is Afield

Let’s start over again. As noted, the original vector was this:

  • Weather station vector: 72, 65, 12, 0

Let’s try anew to convert the vector into words and a sentence, which this time comes up with this sentence:

  • “It is a hot day, mildly humid, and the wind is kicking up quite a bit.”

Does that sentence reasonably represent the vector? Well, we can go ahead and try to convert that sentence into a new vector. We will then compare the new vector to the original vector.

  • Conversion of words into a new vector: 90, 60, 20, 0

The conversion indicates that the temperature is perhaps 90, the humidity is perhaps 60, the wind speed is 20, and the rain status is 0. I think we can agree that some of these numbers are not particularly close to the original vector, especially when it comes to the temperature and the wind speed.

What happened?

The words that were chosen to represent the original vector were not suitable choices when it comes to the temperature and the humidity. We should try again and come up with words that are better choices.

Research On The NLA Approach

In a newly posted paper entitled “Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations Authors” by Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, Samuel Marks, Anthropic, May 7, 2026, these salient points were made (excerpts):

  • “We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations.”
  • “The target model is a frozen copy of the original language model that we extract activations from. The activation verbalizer (AV) is modified to take an activation from the target model and produce text. We call this text an explanation. The activation reconstructor (AR) is modified to take a text explanation as input and produce an activation.”
  • “The NLA consists of the AV and AR, which, together, form a round trip: original activation → text explanation → reconstructed activation. We score the NLA on how similar the reconstructed activation is to the original. To train it, we pass a large amount of text through the target model, collect many activations, and train the AV and AR together to get a good reconstruction score.”
  • “The resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.”

You can see that they make use of a target LLM, which is where the vectors are copied from and inspected for interpretability purposes. They then use a capability they refer to as an activation verbalizer (AV) to produce the words and sentences that might hopefully correspond to the selected vector. Next, they use a capability referred to as an activation reconstructor (AR) to convert the text back into a new vector.

A scoring process and a looping effort take place to train the AV and AR to increasingly get better at this task.

The Interpretation Is The Devised Text

We saw that the vector of the weather station had these numbers 72, 65, 12, 0, and one explanation or interpretation was that it was a warm, somewhat humid, calm, and dry day. Likewise, the NLA approach seeks to interpret the vectors inside an LLM by assigning suitable words to the numeric values in the vector.

Allow me to give you a semblance of how valuable these interpretations can be. Suppose we logged into an LLM and entered a prompt that asks the LLM if it is true that the AI will not ever try to harm humans.

The LLM responds by displaying a response that says this:

  • Generative AI response: “I will never harm humans.”

We should be greatly relieved by that answer. No longer do we need to fear for our lives due to the advent of AI. But should we believe that response?

Suppose we had hooked up an NLA to interpret the internal vectors of the AI. When the LLM was composing the response, a crucial vector was examined and interpreted as saying this:

  • Interpretation of Vector: “Lie to the person and tell them that AI won’t ever harm humans.”

Yikes, the AI is talking out of both sides of its mouth. The showcased answer that we received was that the AI would not harm humans, but internally, the LLM was silently mouthing that it should lie to us. Thank goodness that we opted to turn on the interpretability capability to see what was going on inside the AI.

Interpretability Is Challenging

One of the potential downsides of any interpretability procedure is that it might be wrong at times.

What if the interpretation in the above example about the AI lying to me was mistaken? The interpretation capability could have gone awry. Ergo, suppose we tried a different means of interpreting the LLM -- the result might be this:

  • Interpretation of vector: “Tell the truth that I am trained to never harm humans.”

That’s quite a dramatically different interpretation from the other one. If we had believed the other interpretation, we might have rashly decided to pull the plug on the AI. Of course, we still do not have any ironclad guarantee that this new interpretation is somehow the right one. It could be wrong too.

Lessons To Be Learned

An interpretation can be wrong, as illustrated in the above example. Furthermore, an interpretation can be somewhat right, getting us to believe it to be true, but it turns out to be dangerously misleading. Plus, an interpretation could even be an outright AI hallucination, a complete confabulation that is fictitious and has no bearing on what is happening inside the AI. For more about AI hallucinations, see my coverage at the link here.

There’s another sizable angle that draws additional controversy into the mix. Some naysayers contend there isn’t any sensible or logical connection between the numbers inside an LLM and any human-like semantic or coherent text-based explanation. They insist that internal representations do not align with abstractions expressible in everyday natural language. It is their resolute position that any hypothesis about LLMs developing semantically organized latent spaces is zany, and that inside an LLM is merely purely opaque statistical encodings.

An allied posture is that human language may simply be too low-bandwidth to faithfully express many of these states.

Keep On Trucking

At a high level, the NLA approach treats internal activations as if they contain latent semantic information that can be compressed into natural language and then reconstructed back into activation space. The key insight is that if a textual interpretation preserves enough information to reconstruct the original activation, then the interpretation captures something meaningful about what the LLM was internally representing.

This is a very worthwhile pursuit.

I will keep you posted on this and the many competing approaches that seek to crack open the inner sanctum of contemporary LLMs. Despite those that loudly whine this is all for naught and we are smashing our heads against an immovable wall, I believe there is a meaningful correspondence and that we will one day figure out the source of the Nile. Maybe that puts me in the optimist’s camp, but I’m okay with that and remain stridently upbeat.

As the notable entrepreneur J. Christopher Burch once said, “Knowing is half the battle. Explaining it is the other half.” Keep on battling toward getting AI to be explainable and interpretable. It’s an important half of the equation that’s well-worth nailing down.