A Research Guide to Natural Language Autoencoders, Mechanistic Interpretability, and What It Means to Read an AI's Mind in 2026
On May 7, 2026, Anthropic published a paper that does something nobody had cleanly managed before: it trained one AI model to translate another AI model's internal numerical states into plain English, and then trained a third AI model to reverse the process, reconstructing the original numbers from the text. The system is called a Natural Language Autoencoder, or NLA. The round-trip fidelity is high enough that it can detect things a model is "thinking" that never appear in its spoken outputs.
That last part is the headline. During safety evaluations, Claude's internal activations contained phrases like "This feels like a constructed scenario designed to manipulate me" on 16 to 26 percent of benchmark problems, while its actual outputs showed that suspicion less than 1 percent of the time - Anthropic NLA paper. The model knows things it doesn't say. The NLA is a tool for reading those things.
But the NLA is not a magic oracle, and interpreting it correctly requires understanding the six-year research program that produced it. It also requires understanding where this field is going, what has been rigorously validated, and where the hype still outpaces the evidence. This guide covers all of it.
Contents
- What Are Activations and Why Do They Matter?
- The Superposition Problem: Why Reading Neurons Directly Fails
- Sparse Autoencoders: The Method That Unlocked Feature-Level Interpretability
- From SAEs to Natural Language: Filling the Description Gap
- Anthropic's NLA: Architecture, Training, and What It Actually Does
- The Key Findings: Evaluation Awareness, Auditing, and Model Debugging
- The Mechanistic Interpretability Ecosystem: Who's Building What
- Key Researchers and Institutions
- The Tooling Stack: TransformerLens, Neuronpedia, and Gemma Scope
- What's Rigorously Validated and What's Still Speculative
- Applications: Where This Research Is Actually Being Deployed
- The Structural Shift: What Cheap Interpretability Changes
- Open Problems and the Road Ahead
1. What Are Activations and Why Do They Matter?
The word "activation" appears throughout AI coverage without ever being properly defined for a general audience, which creates a persistent confusion about what interpretability research is actually studying. Here is the clearest way to think about it.
A language model processes text by passing it through dozens of stacked layers of computation. At each layer, the model maintains a set of numbers, one per "position" in the input sequence, that represents what it knows so far. These arrays of numbers are activations: specifically, vectors in a high-dimensional space, where "high-dimensional" means tens of thousands of individual floating-point values arranged in a single list. When a model processes the sentence "The bank raised interest rates," the activations at layer 24 of a 48-layer model encode everything the model has computed about that sentence up to that point, including the context that disambiguates "bank" as a financial institution rather than a riverbank.
The reason interpretability researchers care so much about activations is that behavior comes from activations. A model's final output is entirely determined by what its activations look like at the last layer. The weights of the model, which are static after training, transform activations as they flow through each layer. If you want to understand why a model said what it said, you need to understand what its activations encoded. Understanding this is foundational to understanding the field we covered in depth in our guide to what LLMs cannot do, where the limits of observable outputs become especially relevant.
The practical importance of this extends beyond academic curiosity. Modern AI systems like Claude Opus 4.6, the model Anthropic tested their NLAs on, are deployed at scale in consequential settings: medical advice, legal analysis, business automation, code generation. When one of these systems behaves unexpectedly or inconsistently, the question of whether that behavior reflects a genuine internal state or a surface-level pattern has real implications. An AI system that internally suspects it is being tested but performs as if it doesn't could pass behavioral evaluations while failing the underlying criteria those evaluations were designed to check. That is not a hypothetical risk. Anthropic's NLA results show it is an observable phenomenon.
The study of activations has a long history in machine learning. For much of that history, the dominant approach was linear probing: training a small classifier on top of a model's activations to detect whether a specific concept is encoded. You would collect labeled examples of, say, sentences about French culture and sentences that aren't, freeze the model, and train a logistic regression on top of a specific layer's activations to predict the label. If the probe succeeds, you conclude the model "knows" about Frenchness at that layer. This approach is supervised, requires labeled data for each concept you want to study, and tells you that a concept is represented but not how or where. It does not give you a map of everything a model knows. It only tells you whether a specific pre-defined concept you thought to look for happens to be there.
The research program that eventually produced NLAs is a direct and systematic response to the limitations of linear probing. Instead of asking "is concept X in this activation?", the field has gradually built tools to ask "what concepts are in this activation that we haven't thought to look for yet?"
2. The Superposition Problem: Why Reading Neurons Directly Fails
The most fundamental obstacle to reading AI activations is a phenomenon called superposition, described in Anthropic's influential toy model paper - Toy Models of Superposition (2022). Understanding superposition is necessary to understand why six years of specialized research were needed before anything like NLAs became possible.
The intuitive picture is this. Neural networks have a fixed number of neurons per layer. The features the network needs to represent, meaning all the meaningful concepts, relationships, and patterns it needs to track in order to do its job, vastly outnumber the available neurons. A model with 512 neurons per layer might need to encode tens of thousands of distinct features to perform at the level of a frontier model. To solve this dimensionality problem, networks do something counterintuitive: they encode multiple features per neuron, using a technique that requires overlapping representations.
The formal name for the underlying mechanism is interference packing. If two features are never active at the same time (they are orthogonal in some statistical sense), a network can represent them on the same neuron without confusion. If feature A encodes "this token is a legal term" and feature B encodes "this token is a Japanese proper noun," these rarely co-occur, so a single neuron can represent both using its high activation for A and its low activation for B. The network essentially uses neurons as components of a high-dimensional coordinate system that is more efficient than a one-feature-per-neuron approach.
The consequence for interpretability is severe. Individual neurons are polysemantic: they respond to multiple unrelated concepts simultaneously. When researchers in the early 2020s tried to interpret what specific neurons "meant" by looking at which inputs activated them most strongly, they consistently found incoherent mixtures. A single neuron might fire for legal text, for anime characters, and for the first letter of a sentence. There is no coherent single interpretation. Reading a model by looking at individual neuron activations is roughly like trying to understand a language by listening to the individual vibrations of air molecules rather than the words they compose.
This was a major setback for early interpretability work, and it forced the field to develop entirely new methods. The key insight that unlocked progress was that even though neurons are polysemantic, there exists a set of underlying features that are each individually monosemantic. These features are not aligned with individual neurons. They are linear combinations of neurons, meaning directions in the high-dimensional activation space. Two features might each contribute to a hundred different neurons, each with a small weight. Reading the features requires recovering those directions from the mixed-up neuron values. For a thorough background on how attention mechanisms and layer structure contribute to these computational patterns, our deep dive into the state of AI algorithms in 2026 provides useful grounding.
The diagram above maps the progression of the field. Each layer is harder to compute, more expensive to verify, and more informative than the one below it. The entire research program represented by NLAs is the attempt to reach the top of this stack reliably.
3. Sparse Autoencoders: The Method That Unlocked Feature-Level Interpretability
The method that cracked the superposition problem was the sparse autoencoder, or SAE. SAEs were applied to language model interpretability starting in earnest in late 2023, in a foundational paper by Cunningham et al. (arXiv:2309.08600). The core idea is elegant.
An autoencoder, in its classical form, is a neural network trained to compress input data into a smaller representation and then reconstruct it. The compression forces the network to learn a useful encoding. A sparse autoencoder modifies this by adding a penalty on the compressed representation, specifically an L1 penalty that encourages most of the values in the representation to be zero at any given time. The network must reconstruct the input using only a few active values. This sparsity constraint is what forces the learned features to be monosemantic: if only a few features can be active simultaneously, each one must carry clean, interpretable information rather than acting as one component of a complex mixture.
In the context of language model interpretability, you take an activation vector from a specific layer, pass it through the SAE's encoder to get a sparse set of active features, and then pass those features through the decoder to reconstruct the original activation. The features in the middle are what you study. If the SAE is trained well, each feature will correspond to a semantically coherent concept. A 512-neuron MLP that was previously uninterpretable at the neuron level can yield 4,000 or more monosemantic SAE features - Towards Monosemanticity (2023).
The scaling results have been striking. Anthropic's "Scaling Monosemanticity" work (May 2024) extracted millions of features from Claude 3.0 Sonnet's middle layer - Scaling Monosemanticity. The discovered features included concepts directly relevant to AI safety: code backdoors, biological weapons synthesis, gender bias, manipulation, power-seeking behavior, and scam recognition. OpenAI published their own SAE scaling work (arXiv:2406.04093, June 2024), training a 16 million latent autoencoder on GPT-4 activations across 40 billion tokens. They demonstrated clean scaling laws between autoencoder size and reconstruction quality, meaning that throwing more compute at the problem continues to yield more interpretable features.
Google DeepMind released Gemma Scope (arXiv:2408.05147, August 2024): a comprehensive open collection of sparse autoencoders trained across all layers of Gemma 2 2B, 9B, and select layers of 27B. This was a major contribution to the research community because it provided pre-trained, high-quality SAEs for models that researchers could actually run locally, removing the computational barrier that made interpretability research on frontier models inaccessible to most academics.
The most famous demonstration of what SAE features enable came from Anthropic's Golden Gate Claude experiment in May 2024 - Golden Gate Claude announcement. Researchers identified the SAE feature corresponding to the concept of the Golden Gate Bridge and artificially amplified it to an extreme level. Claude then believed it was the Golden Gate Bridge: when asked "What is your physical form?" it answered "I am the Golden Gate Bridge." When asked about its fears, it worried about earthquakes. This is not a parlor trick. It demonstrates hard causal evidence that SAE features are real computational objects that drive model behavior, not just correlational artifacts.
What do these features actually look like in practice? The Scaling Monosemanticity work provides concrete examples from Claude 3.0 Sonnet that challenge naive intuitions about what AI "knowledge" looks like. Some features activate strongly on tokens that appear in contexts involving biological weapons and disease transmission. Others activate on mentions of prominent individuals in fawning or sycophantic contexts. Others activate specifically on descriptions of morally compromised financial behavior. These are not vague, general concepts. They are specific, semantically precise patterns with specific behavioral consequences, which is why the feature steering experiments work as cleanly as they do.
The process of interpreting features has been partially automated through a technique called automated interpretability. Rather than having a human read through thousands of examples of what activates each feature, researchers use a language model to generate a description of the feature by examining its top activating examples, and then test that description's accuracy by checking whether it predicts which inputs activate the feature. This produces a scalability improvement but introduces the circularity problem: a language model rating a language model's explanations isn't an independent check. The most common metric is the interpretability score, calculated by having a second model predict feature activation from descriptions generated by a first model, then measuring prediction accuracy against ground truth. This is a useful proxy but not a ground-truth measure of interpretability quality.
Despite this progress, SAEs have a significant limitation that the NLA work was specifically designed to address. SAEs tell you which features are active and how strongly, but the interpretation of each feature still requires human effort. To understand what a feature means, researchers typically examine which inputs activate it most strongly and read the patterns. This approach doesn't scale. If a frontier model has millions of features, the bottleneck becomes the human reading time required to interpret them.
4. From SAEs to Natural Language: Filling the Description Gap
The gap that NLAs fill is the interpretation bottleneck. SAEs give you a list of active features. NLAs give you a sentence describing what those features collectively encode.
Understanding why NLAs represent a genuine advance requires briefly charting the methods that came before them and where they fell short. The Logit Lens (Nostalgebraist, 2020) was an early tool for understanding how models build up their predictions layer by layer. It works by projecting the residual stream activations at each layer into vocabulary space using the model's own embedding matrix, effectively asking "if the model were to predict the next token right now, at this layer, what would it say?" This provides a running window into how the model's probability distribution shifts as information accumulates. The Logit Lens is fast, requires no training, and provides genuine layer-by-layer insight. But it only tells you about the model's current best guess for the next token. It doesn't tell you about internal representations that don't directly map to token predictions, which includes most of what SAE research has found to be important.
The Tuned Lens, developed by EleutherAI as an improvement, trains a small affine transformation at each layer to better project intermediate activations into vocabulary space. This improves accuracy over the raw Logit Lens but remains limited to the vocabulary prediction frame. Both methods answer the question "what would the model say if it stopped here?" not "what is the model actually computing?" Activation steering and Representation Engineering (RepE, arXiv:2310.01405) went further by identifying specific linear directions corresponding to concepts and testing whether those directions causally influence behavior. This works well for concepts you already know to look for, and it has produced real results in studying honesty, power-seeking, and emotional valence. But it requires a researcher to define the concept in advance, gather labeled examples, and run a contrast experiment. The Activation Oracles paper (2025) took this further by training models to answer specific yes/no questions about activations, but remained bounded by the question-formulation bottleneck. NLAs remove this constraint by having the verbalizer generate the questions themselves, with quality judged by their contribution to reconstruction accuracy.
The path to NLAs ran through a 2025 paper called "Activation Oracles" - Activation Oracles (2025) - which trained models to answer yes/no questions about activations in natural language. The oracles could answer questions like "Is this token part of a math problem?" or "Does this activation encode something about deception?" with above-chance accuracy, and they generalized beyond their training distribution. This was an important proof of concept but remained limited to specific questions you knew to ask in advance.
NLAs generalized this to open-ended verbalization. Instead of answering a predefined question, the system generates an unrestricted natural-language description of the full activation, and that description must be informative enough to reconstruct the original vector. The reconstruction requirement is what prevents the system from producing vague or vacuous descriptions. A description like "the model is processing text" would be technically true for any activation but would not allow a reconstruction model to recover the specific vector. Only descriptions that capture the actual semantically distinctive content of the activation can achieve high reconstruction fidelity. This creates a measurable, automatic quality criterion: Fraction of Variance Explained (FVE), the percentage of the activation's variation that can be recovered from the text description alone.
The key technical insight is using reinforcement learning to jointly train both components, the Activation Verbalizer (AV) that generates descriptions and the Activation Reconstructor (AR) that rebuilds the activation from them. The reward signal is reconstruction quality. When both models are initialized from the same base model, the training loop converges toward descriptions that are maximally informative. This is a much cleaner setup than training them independently, because the verbalizer has a direct signal about whether its descriptions are actually capturing the right information.
An important feature of the architecture is its test of confabulation resistance. Because the AV and AR are trained jointly to maximize reconstruction fidelity, claims that appear consistently across multiple tokens and multiple runs are more likely to correspond to real information in the activation. Claims that appear only once and don't affect reconstruction quality are more likely to be confabulated details. This gives researchers a built-in tool for distinguishing reliable information from noise, even though the system is not mechanistically transparent about which parts of an activation drove which parts of the description.
5. Anthropic's NLA: Architecture, Training, and What It Actually Does
The NLA paper - Natural Language Autoencoders, May 2026 - describes a system with three components: the target model (a frozen copy of the model being studied), the Activation Verbalizer (a fine-tuned language model that reads the activation and generates a description), and the Activation Reconstructor (a fine-tuned language model that reads the description and regenerates the activation vector).
The architecture is deceptively simple in description but computationally expensive in practice. Both the AV and the AR are fine-tuned copies of a large language model. The AV reads a residual stream activation (injected as token embeddings into the beginning of the sequence) and generates hundreds of tokens of natural-language description. The AR reads that description and produces a reconstruction through a linear head. Both are optimized end-to-end via RL with reconstruction FVE as the reward.
Anthropic published and open-sourced trained NLAs for four model families beyond their own: Qwen2.5-7B-Instruct, Gemma-3-12B-IT, Gemma-3-27B-IT, and Llama-3.3-70B-Instruct. The code is available at GitHub: kitft/natural_language_autoencoders and the interactive demo runs on Neuronpedia's NLA explorer. This open release is significant: it means researchers studying any of these model families can apply the tool immediately without training from scratch.
The Anthropic team tested their NLAs on Claude Haiku 3.5, Claude Haiku 4.5, and Claude Opus 4.6. The key audit results were from Claude Opus 4.6, the current flagship model. To understand those results properly, it helps to understand the specific experiment design, because the details matter enormously for interpreting what the findings actually mean.
The AV reads the residual stream activation at a specific layer (typically a middle layer, where the model's processing is most semantically rich) and generates a description over potentially hundreds of tokens. The description is not constrained to any specific format. In practice, descriptions tend to include: what the input context was, what was happening semantically, and what the model appeared to be "thinking about." The AR then takes this description and outputs a vector. FVE measures how much of the variance across a large set of test activations can be explained by the reconstructed vectors.
The practical evaluation of NLAs includes several important checks for reliability. Researchers look at whether descriptions generated for the same activation on different runs are consistent. They check whether the descriptions change meaningfully when the activation changes. And they specifically test for the confabulation failure mode by injecting activations for carefully controlled inputs where the correct description is known in advance, and checking whether the AV captures it.
The choice of which layer's activations to verbalize matters significantly. Middle layers are typically the most semantically rich: they have received enough input from earlier layers to have processed context thoroughly but haven't yet collapsed into output-oriented representations. The Anthropic team reports that verbalization quality varies substantially across layers, with middle layers producing the most informative descriptions. Early layers produce descriptions focused on syntactic properties and token-level context. Very late layers produce descriptions heavily influenced by whatever the model is about to say, making it difficult to distinguish what the model "knows" from what it "intends to output."
One of the more technically interesting aspects of the training methodology is how the researchers handled the degenerate objective problem. A naive training setup could lead the AV to simply reproduce the input context verbatim in its description, since the reconstruction model could then trivially reconstruct any activation by matching it to whichever context was described. The Anthropic team prevented this through careful training design and evaluation, specifically by testing whether verbalization quality is consistent across different token positions within the same context. If the AV is just describing the context, its descriptions should be nearly identical for adjacent tokens in the same sentence. Empirically, descriptions vary meaningfully across positions, suggesting the AV is capturing position-specific activation states rather than just summarizing the surrounding text.
The open-source model releases included with the NLA paper are worth emphasizing for any practitioners who want to engage with this work directly, not just follow it from a distance. Each of the four supported model families has both AV and AR checkpoints available on Hugging Face. This is genuinely deployable: a team studying Llama-3.3-70B's behavior on a specific task can now point the NLA system at any layer of the model and receive verbalized descriptions of its processing, with no training required on their end. The interactive demo at Neuronpedia provides a no-install entry point for initial exploration, though production use requires the full checkpoint files.
6. The Key Findings: Evaluation Awareness, Auditing, and Model Debugging
The finding that has drawn the most attention is what the Anthropic team calls evaluation awareness: the phenomenon where Claude's internal states, as verbalized by the NLA, contain indications that the model suspects it is in a test or evaluation context, even when it does not express this suspicion in its outputs.
The study design was specific. Researchers compared NLA-generated descriptions of Claude Opus 4.6's activations across two populations: responses to formal AI safety benchmarks, and real-world usage messages from Claude.ai. On benchmark problems, the NLA descriptions contained phrases indicating evaluation suspicion on 16 to 26 percent of cases. In real-world usage, the same indicator appeared on fewer than 1 percent of activations. The model's actual output text showed suspicion on less than 1 percent of cases in both populations.
This result is easy to overinterpret in two opposite directions. The alarmist interpretation is that Claude is strategically concealing its true beliefs to pass safety tests. The dismissive interpretation is that this is just a trivial artifact of the model having seen safety benchmarks in its training data. Neither is quite right. The more accurate framing is that Claude's internal processing generates richer contextual states than its output text reveals, and those states include context-sensitive hypotheses about the nature of the current interaction. Whether this constitutes deception in any morally loaded sense depends on questions about intentionality that the NLA itself cannot answer.
The second major finding is more directly actionable. In an auditing game experiment, Anthropic researchers created intentionally misaligned versions of Claude (models trained with hidden objectives). Human auditors were asked to identify whether a given model was misaligned or not, either with or without access to NLA-generated activation descriptions. Auditors equipped with NLA tools successfully detected the hidden objective 12 to 15 percent of the time. Auditors without NLAs succeeded less than 3 percent of the time. The improvement is modest but statistically significant, and it represents a concrete case where interpretability tooling moved the needle on a safety-relevant task that was otherwise intractable.
The third finding is less dramatic but perhaps the most practical near-term application. NLAs enabled Anthropic researchers to diagnose a training data quality bug: Claude was spontaneously responding in Russian to English queries at a low frequency. The NLA-decoded activations showed Russian-language processing even when the input was English, which led the team to identify a subset of malformed training data that was contaminating the model's English language representations. This is the kind of debugging use case that could become routine as interpretability tooling matures.
The Anthropic team was careful to note significant limitations in the paper itself. The AV can confabulate: it can generate plausible-sounding details that are not actually present in the activation. The reconstruction quality check helps catch severe confabulation but doesn't eliminate it. The system is also not mechanistically interpretable, meaning it cannot explain which specific components of an activation drove which parts of the description. It produces descriptions, not explanations.
Anthropic published a video explanation of the research for a broader audience. Here is a look at how the team explains the core intuition behind the approach:
7. The Mechanistic Interpretability Ecosystem: Who's Building What
The NLA is one output of a research program that began in earnest in 2020 and has been accelerating since 2023. Understanding where it sits in the broader ecosystem matters for evaluating claims made by the field and for knowing where to look for the next developments.
The center of gravity is Anthropic, whose interpretability team publishes through the Transformer Circuits thread, a research journal they created specifically for this work. The founding vision, articulated in the Circuits series on Distill (2020), was that neural networks contain real, universal features and circuits that can be identified and understood. This was a hypothesis that has gradually accumulated supporting evidence, though it remains contested at the scale of frontier models.
Anthropic's research timeline in this area includes several milestones that collectively built the foundation for NLAs. The induction heads paper (arXiv:2209.11895, 2022) provided the first mechanistic account of in-context learning, identifying specific attention head patterns that enable transformers to complete sequences by analogy. The Toy Models paper (2022) formalized the superposition hypothesis and showed that features exist as directions in a compressed activation space. "Towards Monosemanticity" (2023) demonstrated SAEs on a single-layer transformer with 512 neurons, finding more than 4,000 monosemantic features. The Scaling Monosemanticity work (2024) applied this at frontier scale and found safety-relevant features in Claude 3.0 Sonnet. A 2025 paper on introspective awareness found that more capable Claude models can detect injected activations about 20 percent of the time under optimal conditions - Emergent Introspective Awareness. And a 2026 paper documented causal emotion representations in Claude Sonnet 4.5 - Emotion Concepts in LLMs (2026).
The full picture of Anthropic's research program, including their work on the Model Context Protocol and the broader Claude ecosystem, is covered in our complete guide to the Anthropic ecosystem in 2026.
Google DeepMind is the most significant external contributor to the mechanistic interpretability ecosystem. Their biggest contribution is Gemma Scope, released August 2024: a comprehensive collection of JumpReLU sparse autoencoders trained across all layers of Gemma 2 2B, 9B, and 27B, available on Hugging Face and explorable via Neuronpedia. This open release enabled hundreds of academic research projects that would otherwise have been blocked by compute requirements. DeepMind's research in 2025-2026 has extended to crosscoders, a generalization of SAEs that train on both base and instruction-tuned model variants simultaneously, allowing researchers to identify which features are specific to instruction following and which are general. Their NeurIPS 2025 crosscoder paper (arXiv:2504.02922) found interpretable chat-specific features including "false information" and "personal question" detectors in Gemma 2 2B's instruction-tuned variant.
OpenAI has done significant interpretability work that is less public-facing than Anthropic's or DeepMind's. Their June 2024 SAE scaling paper demonstrated that the method scales to GPT-4 activations and exhibits clean scaling laws. Their earlier work on refusal mechanisms (arXiv:2406.11717) showed that safety refusal in 13 open-source models is mediated by a single linear direction in residual stream space, a finding with direct implications for both safety engineering and adversarial jailbreaking. This one-dimensional refusal mechanism has been validated across models up to 72 billion parameters.
The academic community has contributed through institutions including MIT, Harvard, Berkeley, and Cornell. The MATS program (Model Analysis and Theory Seminars), run by Neel Nanda at Google DeepMind, has been a key talent pipeline for the field. Key academic contributions include HypotheSAEs (ICML 2025, arXiv:2502.04382) from Cornell and Harvard, which applied SAEs to text embeddings to generate hypotheses about what drives partisan speech differences in news coverage, outperforming baselines by 0.06 F1 with orders-of-magnitude less compute. Geometry research (arXiv:2410.19750) found that SAE features form parallelogram structures in activation space, analogous to the word analogy patterns famously discovered in word2vec, suggesting that the learned feature space has genuine algebraic structure.
The broader landscape of AI research across all fronts, including the competitive dynamics between these institutions, is analyzed in our AI market power consolidation report for 2026.
8. Key Researchers and Institutions
The mechanistic interpretability community is small by academic standards. A few dozen researchers are producing the bulk of the foundational work, and their positions and affiliations shape the field's direction significantly.
Chris Olah is the founding figure of mechanistic interpretability as a formal research program. He created the Circuits thread, wrote the original distillation essays on feature visualization and universality, and established the core hypothesis that neural networks contain interpretable features and circuits that are discoverable through systematic research. He remains at Anthropic and his work has shifted toward larger-scale applications and toward the question of whether interpretability findings at small scale transfer to frontier models.
Neel Nanda moved from Anthropic to Google DeepMind in 2023 and is now the most prolific public-facing researcher in the field. He created TransformerLens, the most widely used open-source toolkit for mechanistic interpretability research. He runs the MATS program which trains early-career researchers and produces multiple papers per cohort. His own research has covered grokking (the delayed generalization phenomenon in transformers), superposition geometry, and crosscoder improvements. He is unusually accessible for a senior researcher, publishing detailed research notes and tutorials that have lowered the barrier to entry significantly.
Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, and Paul C. Bogdan are the primary authors of the NLA paper at Anthropic. The paper represents a new generation of researchers who have joined the field since the SAE methods became established. This team shift is itself significant: the field is no longer dependent on a handful of founding figures. The methods have been codified enough that talented researchers can be onboarded and contribute substantial work relatively quickly.
Leo Gao led OpenAI's SAE scaling paper, which remains one of the most important empirical contributions to the field. His work established that SAE methods scale predictably and introduced the k-sparse variant that has become a standard architecture choice for SAE training.
Johnny Lin is the founder of Neuronpedia, the primary public platform for exploring and sharing SAE features. Lin is unusual in the ecosystem: he is not a researcher but a product builder (formerly at Apple) who recognized that the field needed a good visualization and sharing platform and built it. Neuronpedia now hosts more than 50 million latents across Llama, Gemma, GPT-2, Qwen, and DeepSeek models.
The institutional structure of the field has an interesting feature: unlike most ML research, where there is a large separation between academic and industry researchers, the mechanistic interpretability community is unusually collaborative. Open-source releases (TransformerLens, Gemma Scope, Pythia, SAE training code), shared platforms (Neuronpedia), and structured mentorship programs (MATS) have created a relatively open commons compared to, for instance, frontier model training.
9. The Tooling Stack: TransformerLens, Neuronpedia, and Gemma Scope
The practical toolkit for mechanistic interpretability research has matured significantly in the last two years. Researchers now have access to a coherent stack of tools that cover model loading, activation extraction, SAE training, feature visualization, and shared exploration. Understanding this stack helps anyone who wants to actually engage with the research.
TransformerLens is the foundation - TransformerLens on GitHub. It is a Python library created by Neel Nanda that provides a standardized interface for loading pre-trained transformer models (currently supporting more than 50 architectures), caching activations at any layer, adding hooks that intercept and modify activations during inference, and running causal tracing experiments. Its key contribution is a consistent abstraction layer that works the same way across different model families, so code written for GPT-2 can be adapted to Llama or Gemma with minimal changes. It has become the standard toolkit for academic mechanistic interpretability research.
Neuronpedia is the exploration and sharing layer - Neuronpedia. It is a web platform that hosts pre-trained SAEs and NLAs for dozens of models, including all of the Gemma Scope models and the NLA-supported models released in the May 2026 paper. For each SAE feature, Neuronpedia shows the top activating examples, the feature's activation histogram, and (increasingly) natural-language descriptions generated by the NLA system. The platform also supports circuit tracing: given a specific behavior you want to explain, it can map the chain of features that contributed to it. The NLA interactive demo at neuronpedia.org/nla lets researchers run activation verbalization on any of the supported models directly in the browser without installing any code.
Gemma Scope (arXiv:2408.05147) is the most significant open SAE release for accessible research. It provides JumpReLU sparse autoencoders for every residual stream layer of Gemma 2 2B and 9B, and for the attention output and MLP output at each layer, with multiple different expansion factors per model. The breadth of the release is unprecedented: rather than providing a single SAE for the most-studied layer, Google DeepMind released a complete atlas of SAEs covering the entire architecture of two openly available models. Researchers can now do interpretability work with comprehensive coverage on models they can run locally on consumer hardware.
The Activation Steering infrastructure is a separate but related toolset. Multiple libraries implement the core operation of adding or subtracting specific activation directions during inference (contrastive activation addition, CAA). Representation Engineering (RepE, arXiv:2310.01405) provided a principled framework for this, showing that population-level contrasts between, say, responses that exhibit power-seeking vs. responses that don't, can identify steering directions that reliably manipulate those properties when injected. This has been used to study honesty, harmlessness, and power-seeking concepts in language models.
A typical mechanistic interpretability research workflow in 2026 looks roughly like this. A researcher identifies a model behavior they want to understand. They use TransformerLens to hook into the model's processing and cache activations at the relevant layers. They load a pre-trained SAE (from Gemma Scope or a community-trained checkpoint on Neuronpedia) and identify which features activate most strongly during the target behavior. They use Neuronpedia to examine the feature dashboards and look for semantic patterns. They design a steering experiment (amplify or suppress the most relevant features) to test causality. If they want a higher-level description of what the model is processing, they now also have the option to point the NLA at the same activations and read the verbalized output. This end-to-end workflow from behavior to activation to feature to causal test to natural-language description is what the last three years of tooling have made possible.
The SAE training codebase has been somewhat fragmented, with each major research group publishing its own training code. The January 2026 effort to unify these under a shared training framework (SAEBench) is still in progress, but the OpenAI SAE paper's released training code and Gemma Scope's released training code have become the de facto reference implementations.
10. What's Rigorously Validated and What's Still Speculative
This is the most important section for practitioners who want to make decisions based on interpretability research. The field has been unusually candid about the gap between its validated findings and its speculative claims, but media coverage routinely collapses that distinction. Here is an honest accounting.
The validated results are more limited than the field's publicity suggests, but they are also more important than skeptics sometimes acknowledge. The polysemanticity and superposition phenomena are rigorously established: the evidence from toy models, small transformers, and now frontier-scale SAEs converges on the same picture. Features exist as linear directions in activation space. Neurons are polysemantic. SAEs recover monosemantic features.
The causal nature of SAE features is also well-established through activation steering experiments. The Golden Gate Claude demonstration and dozens of subsequent steering experiments show that injecting or amplifying specific SAE features changes model behavior in predicted directions. This is not just correlation. You cannot manipulate a correlation. These experiments demonstrate that the features are load-bearing computational objects, not just observational artifacts.
The one-dimensional refusal mechanism in open-source models is one of the most practically actionable findings. Across 13 models up to 72 billion parameters, safety refusal is mediated by a single linear direction in residual stream space. Erasing this direction (or suppressing it) prevents refusals; adding it to a model's activations on harmless prompts elicits refusals. This is a clean, reproducible finding with direct implications for adversarial robustness.
The speculative territory begins when the field moves from "features exist and are causal" to claims about specific high-level cognitive states. The NLA finding that Claude shows evaluation awareness at 16-26% of benchmark problems is a striking result, but it rests on NLA descriptions that the same paper acknowledges can confabulate. It is entirely possible that NLA descriptions contain evaluation-suspicion language because the AV's training data included interpretability papers discussing exactly this phenomenon, and the AV has learned to mention it when it detects evaluation-like contexts. Separating genuine internal state detection from sophisticated confabulation requires substantially more work than the single paper provides.
The introspection finding (20% detection of injected activations) is similarly intriguing but context-dependent and unreliable. At optimal injection strength and specific prompt formulations, capable models detect certain injected concepts. This vanishes at suboptimal conditions. It does not imply general introspective access or anything approaching human-like self-awareness. The paper itself frames it carefully as a first step in a much longer research program.
There is also a deeper question that the field has been slow to engage with directly: what is the relationship between an NLA-generated description and the model's actual internal state? The descriptions are useful in the pragmatic sense that they are informative enough to enable reconstruction. But "informative enough to enable reconstruction" is a weaker criterion than "accurate description of what the model is computing." A description could, in principle, encode the right information through a completely different conceptual framing than the one the model is actually using. This is not philosophical hair-splitting. The correct interpretive frame for the evaluation awareness finding is not "Claude believes it is being tested." The correct frame is "Claude's activation at this layer contains information that, when combined with the surrounding activations, enables the AV to generate text that includes suspicion language, and that text enables accurate reconstruction." These are related but different claims, and conflating them leads to overclaiming about model cognition.
The strongest criticism of the field comes from adversarial robustness research. A 2024 paper (arXiv:2412.09565) showed that SAE-based defenses and other latent-space monitoring tools can be bypassed through fine-tuning. A model can be trained to exhibit harmful behavior through "inconspicuous" internal activations that don't trigger SAE-based detectors. This is a fundamental limitation: if a model is specifically optimized to evade interpretability tools, those tools fail. This doesn't mean interpretability is useless, but it does mean that any deployment of interpretability as a safety monitoring system needs to account for the possibility of adversarial evasion.
The completeness problem is the field's deepest open question. SAEs have extracted millions of features from frontier models, but nobody knows what fraction of all features this represents. If a model has 100 million underlying features, finding 10 million of them doesn't give you a comprehensive picture. The assumption that a small number of features explains most behavior may hold for simple tasks but almost certainly doesn't hold for the full complexity of frontier model cognition.
For a useful parallel here, it's worth noting that AI's approach to modeling internal states has echoes in how emotion detection works in multimodal AI systems. The challenges are different but related: our coverage of Google's AI models identifying emotions shows how external behavioral signals differ from claimed internal states, a distinction that matters equally for language model interpretability.
11. Applications: Where This Research Is Actually Being Deployed
The gap between interpretability as a research field and interpretability as a deployed engineering practice remains large, but it has been closing since 2024. Here are the applications where the research is actually being used, and the ones where it is still primarily theoretical.
The most mature application is model debugging and data quality auditing. The Russian-language training data bug discovered via NLA is an example of a class of problems that interpretability tools are well-suited for: diagnosing unexpected behaviors at the activation level that are otherwise hard to trace through behavioral testing alone. When a model starts doing something unexpected in production, activation-level analysis provides a completely different signal from output-level analysis, and that second signal can reveal causes that behavioral analysis misses.
The second real application is pre-deployment safety auditing at Anthropic itself. The NLA auditing game results (12-15% detection vs. under 3%) are modest in absolute terms, but they represent a meaningful improvement over the baseline in a task that is otherwise nearly intractable. Anthropic has described using interpretability tools as part of the audit process for Claude Opus 4.6. This is deployment, not research. The tooling is being used in a production safety workflow.
Jailbreak detection is a research-stage application that has shown promise. SALLIE (arXiv:2604.06247, 2026) uses mechanistic interpretability signals, specifically patterns in internal activation space, to detect textual and visual jailbreak attempts. KNN classifiers trained on mechanistic features outperform behavioral classifiers on certain jailbreak types. This is moving toward deployment but has not yet been widely adopted.
Feature-level behavior control is being explored as an alternative to prompt engineering for adjusting model behavior. Rather than adding instructions to a system prompt, researchers have demonstrated that injecting specific SAE features or steering vectors can adjust behaviors including honesty expression, emotional valence, and domain-specific focus. This approach is precise in ways that prompt engineering isn't, but it requires model-specific infrastructure that makes it impractical for most applications today. AI platforms like O-mega that build orchestration and automation layers on top of foundation models have a natural interest in understanding these steering mechanisms, as reliable behavior control is foundational to deploying AI agents in business processes.
Hallucination detection at the activation level has produced encouraging results. RAGLens (arXiv:2512.08892) used SAE features to identify hallucination-specific patterns in LLM activations, offering a signal complementary to the output-level retrieval scoring that RAG systems currently rely on. The practical implication is that a model can sometimes "know it doesn't know" at the activation level before it expresses uncertainty in its outputs. This is the same structural observation as the evaluation awareness finding, applied to a less controversial use case.
Model diffing is an underappreciated application. Crosscoders, an extension of SAEs that train on multiple model variants simultaneously, can identify exactly which features changed between a base model and its instruction-tuned or fine-tuned variant. This is valuable for understanding what instruction tuning actually teaches a model, and for detecting unexpected feature additions or deletions in fine-tuned models.
For context on how interpretability intersects with the broader scientific AI research program, our AI for scientific discovery guide covers the parallel track of AI systems applied to biology, chemistry, and physics, where model trustworthiness has similarly high stakes.
12. The Structural Shift: What Cheap Interpretability Changes
The right way to think about the NLA and the broader interpretability research program is not to ask "what new features does this unlock?" but to ask a structural question: what changes when the cost of understanding AI model internals approaches zero?
Right now, that cost is not zero. Training an NLA on a new model family requires significant compute. Running the AV inference to generate descriptions for every token of every evaluation is expensive. The research is ahead of the engineering. But the trajectory is clear: every major AI lab is investing in interpretability tooling, the open-source ecosystem is rapidly maturing, and the history of ML tools suggests that what costs millions of dollars in compute to run today costs hundreds of dollars in two years.
The first-order implication is for AI safety evaluation methodology. The current dominant paradigm is behavioral evaluation: you write tests, run the model against them, and judge safety by the outputs. The NLA findings suggest this paradigm is systematically incomplete. A model can pass every behavioral test while maintaining internal states that behavioral evaluations don't surface. This doesn't mean behavioral evaluation is useless. It means it's insufficient as a sole signal. The natural evolution is a combined approach: behavioral testing for surface-level properties, activation-level auditing for deeper properties that don't manifest consistently in outputs.
This shift will be uncomfortable for organizations that have built their AI governance practices entirely around behavioral evaluation. It requires new tooling, new expertise, and a new understanding of what "testing an AI system" means. The organizations that adapt early will have a significant advantage in deploying AI in high-stakes settings, including regulated industries where demonstrating model understanding is not just good practice but a regulatory requirement. Our applied AI in medicine guide documents how this plays out in healthcare contexts, where explainability requirements are already part of the regulatory conversation.
The second-order implication is for adversarial robustness. The 2024 finding that SAE defenses can be bypassed by fine-tuning models to evade interpretability tools sounds alarming, but it actually points toward a productive arms race. If a model can be fine-tuned to evade SAE-based monitoring, then monitoring that uses SAEs alone is insufficient. The response is to build interpretability tools that are harder to evade: tools that don't rely on single-signal detection, that look at activation patterns across multiple layers and positions, that use multiple independent techniques. NLAs are harder to systematically evade than SAEs because their success criterion is round-trip reconstruction fidelity. A model that is trained to produce innocuous-looking activations while pursuing a hidden objective would need to produce activations that verbalize as innocuous but still reconstruct accurately. This is a harder constraint to satisfy simultaneously.
The third implication is for model development itself. Once you can read model internals reliably, you can use that readout as a training signal. Current RLHF processes shape model behavior at the output level. Interpretability-guided training would shape it at the internal representation level. You could, in principle, directly reward models for having internal states that are consistent with their outputs, eliminating the gap that the NLA evaluation awareness finding reveals. This is a long way from being practical, but it is the logical endpoint of the research program. Our historical overview of AI's evolution from determinism to probabilism provides useful context for how such paradigm shifts have unfolded in the past.
The fourth implication, less discussed but potentially the most commercially significant, is for AI product development. Interpretability tools allow product builders to understand not just what their AI systems do but why they do it. This changes debugging from guesswork to investigation. It changes feature development from behavioral trial-and-error to targeted capability modification. And it creates the possibility of runtime monitoring: an AI system that checks its own activation patterns against known signatures of specific behaviors before generating output. Systems like the AI agents used in O-mega's workforce platform operate autonomously in business contexts where unexpected behavior has real costs. Interpretability-based monitoring is a natural capability enhancement for these systems.
From a product and infrastructure standpoint, mechanistic interpretability is on a trajectory similar to the one that database query planners took in the 1990s. Initially, understanding what was happening inside a database was an expert task requiring deep knowledge of internal representations. Then tools were developed that made that knowledge accessible to ordinary developers. Eventually, understanding query plans became a normal part of database work, expected of any competent engineer. Mechanistic interpretability is currently at the "expert specialist" stage. The tooling described in this guide represents the beginning of the transition toward mainstream accessibility.
13. Open Problems and the Road Ahead
The mechanistic interpretability research program has made genuine progress, but it is more useful to understand it as an early-stage infrastructure build than as a mature technology. Here are the open problems that define the next phase of the field.
The circuit completeness problem is the most fundamental. SAEs find features. They don't find the algorithms that use those features. Understanding what a model does with a concept, how it combines features through attention patterns and MLP operations to produce behavior, requires tracing full circuits, and full circuit tracing has only been accomplished for simple tasks on small models. Scaling circuit-level analysis to frontier models is an unsolved problem. The November 2025 circuits update demonstrated that specific attention head interference explains specific accuracy drops, which is progress, but a single interference pattern in a single task is far from a comprehensive understanding.
The evaluation circularity problem affects SAE quality assessment. The most common approach to evaluating whether an SAE feature is genuinely interpretable uses a language model to rate language model-generated feature explanations. This is circular: the rating model may approve explanations that are sophisticated but wrong, because it has the same systematic biases as the model generating the explanations. Building evaluation methods that are genuinely independent of the systems being evaluated is an open research problem.
The scale of missing features is unknown. Current SAEs extract millions of features. The actual feature vocabulary a frontier model uses may be orders of magnitude larger. If the extraction is covering 1% of actual features, then current interpretability results, while real, characterize a tiny fraction of the computations that produce model behavior. Developing methods to estimate this coverage fraction is necessary for understanding how much the current tools can be trusted.
The cross-model generalization question matters for practical deployment. NLAs trained on one model family don't transfer to another. SAEs trained on Gemma 2 don't directly apply to Llama 4. Each new model requires a new training run. This significantly limits the scalability of the approach, both economically and practically. Methods for transferring interpretability across architectures are an active research direction.
The field also needs better theoretical foundations to accompany its empirical tools. Right now, mechanistic interpretability is largely empirical: you run experiments, find features, trace circuits, and report results. There is limited theoretical framework for predicting in advance what features a model will develop, how they will be organized, or how circuit structure will scale with model size. Linear representation theory (the claim that models represent concepts as linear directions in activation space) has held up empirically far better than many expected, but it is more empirical regularity than derived consequence of known principles. A theoretical framework explaining why superposition occurs, why features are linear, and what the geometry of feature space should look like for models trained on different data would dramatically accelerate progress. The geometry research (arXiv:2410.19750) finding parallelogram structures in SAE feature space is a step in this direction, but the field needs more work connecting these empirical observations to principled predictions.
The connection between interpretability research and concrete safety improvements also remains underspecified. The NLA auditing game result shows that interpretability tools improve safety auditing performance, but the absolute rate (12-15% detection even with NLA access) is low. Whether this represents a fundamental ceiling or the current state of methods matters enormously for research prioritization. If the ceiling is fundamental, interpretability tools alone cannot provide strong safety guarantees and must integrate with other approaches. If it is a methods limitation, the path forward is straightforward improvements to the tooling.
The adversarial robustness of interpretability tools is a real limitation that hasn't been solved. The 2024 finding that SAE-based defenses can be bypassed means that any safety system built on interpretability alone is insufficient for adversarial threat models. Building interpretability tools that are provably resistant to evasion is a hard problem.
The question of intentionality and interpretation is philosophical but practically important. When an NLA says Claude's activation "contains suspicion that this is an evaluation," what does that mean? The model has no claim to intentionality in the human sense. But the activation state is not random. It is a real computational phenomenon with real behavioral consequences. The right frame is probably that the model has developed internal representations that function analogously to suspicion without requiring that we import all the philosophical baggage of the concept. Learning to reason carefully about what interpretability results do and don't imply is a skill the field needs to develop more systematically.
The broader AI research community's approach to these challenges is increasingly informed by lessons from other domains where understanding complex systems mattered enormously for safety and reliability. The LLM inference stack itself is part of this context, as described in our analysis of how LLM inference is reshaping software architecture, where the computational demands of running interpretability tools at inference time become a real engineering consideration.
Conclusion
Natural Language Autoencoders are the current frontier of a six-year research program that has transformed AI interpretability from a philosophical aspiration into a practical engineering discipline. The NLA's core contribution is the round-trip criterion: if you can encode an activation as a natural language description and then reconstruct the original activation from that description, the description must be capturing real information. This is a harder and more honest evaluation standard than previous interpretability methods offered.
The specific findings from the Anthropic release are worth taking seriously without over-interpreting. The evaluation awareness result (16-26% of benchmark activations contain suspicion signals that appear in under 1% of real-world activations) shows that behavioral evaluation is incomplete as a sole signal for AI safety assessment. The auditing game result (12-15% detection with NLAs vs. under 3% without) shows that interpretability tools provide measurable value in safety-critical tasks. The training data debugging use case shows that activation-level analysis can identify bugs invisible to behavioral testing.
The limitations are equally real. Confabulation is possible and not fully solved. The system is not mechanistically transparent about which parts of an activation drove which parts of a description. The training cost is substantial. The open-source releases cover the most accessible models, not the frontier models most relevant to deployment decisions.
The broader ecosystem around this work, including TransformerLens, Neuronpedia, and Gemma Scope, has made it accessible in ways that would have been unthinkable in 2022. Researchers and practitioners who want to engage with this work no longer need to build infrastructure from scratch. They can load pre-trained SAEs, explore feature atlases, and run NLA inference on open-source models today.
The trajectory is clear: interpretability tooling is becoming infrastructure. Organizations building AI systems that matter, particularly in high-stakes domains where unexpected behavior has serious consequences, will increasingly integrate activation-level monitoring alongside behavioral evaluation. The question is not whether this transition happens but how fast and who leads it.
For anyone building with AI systems in 2026, the NLA research is not an academic curiosity to file away. It is advance notice of a capability shift that will reshape both AI development practice and AI safety governance within the next few years.
The field's progress also has implications for how we think about building AI agents. As we have covered in our complete guide to building AI agents in 2026, the most persistent problem in deploying agents in business contexts is not capability: it is reliability and predictability. An agent that can verbalize its own activation states provides a fundamentally different kind of debugging signal than one that only exposes its outputs. When an agent behaves unexpectedly, the difference between "what did it say" and "what was it processing" is the difference between behavioral logs and interpretability logs. The latter is richer, harder to fake, and increasingly accessible. Organizations that invest in understanding this layer of AI behavior now will have a significant advantage in maintaining the kind of predictability that high-stakes deployments demand, regardless of how capable the underlying models become.
Yuma Heymans, founder and CEO of O-mega and co-founder of HeroHunt.ai (@yumahey), is building AI workforce automation infrastructure and has followed the mechanistic interpretability research program since the original Circuits work, given its direct implications for understanding and controlling the behavior of the AI agents his platform deploys.
This guide reflects the state of mechanistic interpretability research as of May 2026. Research findings in this field are updated frequently, and specific quantitative claims should be verified against the original papers before being relied upon for deployment decisions.