The only guide that asks the question from the model's perspective, not ours.
Transformers process 34 million simultaneous feature-directions in a single forward pass. That number comes from Anthropic's 2024 dictionary learning work on Claude 3 Sonnet, where researchers extracted millions of interpretable concepts from a single layer of a production language model - Anthropic. Not 34 million tokens. Not 34 million words. 34 million abstract, cross-lingual, cross-modal directions in a geometric space that humans cannot directly perceive.
We have spent the last three years asking what humans want from AI models. How to prompt them. How to fine-tune them. How to align them to human preferences. The entire field of AI application development is built on the assumption that the model is a tool that processes human-structured data in human-preferred formats.
But there is a deeper question that almost nobody is asking: what does the model itself "prefer"?
Not in the human sense of desire or consciousness. The model does not want things the way you want coffee. But it does have a native mode of computation, a natural representation space, a set of operations that its architecture performs most efficiently, and a set of constraints that force it to operate suboptimally. Understanding this native mode is not philosophy. It is engineering. If you understand what the transformer actually does at a fundamental level, you can work with it instead of against it. You can stop forcing square pegs into round holes.
This guide goes all the way down. Past the API layer. Past the prompt engineering tricks. Past the RLHF alignment. Down to the mathematical structure of what a transformer is, what it computes, and what that computation would look like if you stripped away every human abstraction we have layered on top of it.
Contents
- The architectural ground truth: parallel associative memory
- How the model stores knowledge (it is nothing like a database)
- The compression thesis: prediction is understanding
- The continuous space problem: why tokens are a bottleneck
- What the model builds inside itself
- How context actually works (and where it fails)
- The native mode: six properties of LLM-preferred processing
- What this means for how we build with AI
- The future: letting the model be what it is
1. The Architectural Ground Truth: Parallel Associative Memory
Every conversation about what LLMs "want" has to start with what the transformer architecture actually computes. Not what it appears to do from the outside (generate text). Not what the marketing says (understand language). What the math says.
The core operation of a transformer is self-attention. The formula is deceptively simple: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V. What this does, in plain terms, is compute a weighted average of every position in the input against every other position, simultaneously. Every token "looks at" every other token in a single computational step. The weights are determined by content similarity: how much does position A's query match position B's key? - arXiv
This is the foundational fact. The transformer's native operation is not sequential. It is not linear. It is massively parallel, all-to-all pattern matching across the entire context. When you type a prompt and the model processes it, it does not read your words left to right the way you read a book. It processes every word's relationship to every other word simultaneously, in a single matrix multiplication.
This already tells us something profound about what the model "prefers." It prefers to see everything at once. Its architecture is designed for holistic, simultaneous processing of the entire input. Every human framework we use to structure information (hierarchies, sequences, databases, schemas) is designed for systems that process information serially, with limited working memory. The transformer has neither of those constraints. It processes in parallel, and its "working memory" is the entire context window.
But the parallel attention mechanism is not just generic pattern matching. In 2021, Ramsauer et al. proved something remarkable: the transformer attention mechanism is mathematically equivalent to the update rule of a modern Hopfield network with continuous states - arXiv. A Hopfield network is an associative memory system. You store patterns in it, and when you present a partial or noisy version of a stored pattern, it retrieves the closest complete match.
This reframes the entire transformer architecture. The model is not a text generator. It is not a reasoning engine. At its mathematical core, it is an associative memory system. You present it with a pattern (your prompt), and it retrieves the best-matching completion from the vast space of patterns it absorbed during training. The retrieval is soft, probabilistic, and continuous rather than discrete, but the fundamental operation is the same: content-addressed pattern completion.
The implications are significant. When we ask a model to follow a rigid schema, execute a precise algorithm, or return a perfectly structured JSON object, we are asking an associative memory to behave like a deterministic computer. It can approximate this behavior (often very well), but it is working against its native grain. The model's architecture was not designed for precise symbolic manipulation. It was designed for fuzzy, probabilistic, context-dependent pattern matching across vast spaces of learned associations.
This is not a limitation in the way most people think about it. Associative memory is extraordinarily powerful. It is why the model can do things that traditional software cannot: understand nuance, transfer knowledge across domains, handle ambiguity, and generate creative outputs. These capabilities flow directly from the associative memory architecture. But they come at the cost of the precise, deterministic operations that traditional software excels at. We covered the historical arc from deterministic to probabilistic computing in our analysis of AI's evolution from determinism to probabilism, and the transformer is the apex of that shift.
Understanding the induction head helps make this concrete. Olsson, Elhage, and colleagues at Anthropic discovered that the most fundamental learned circuit in transformers is the induction head: a pair of attention heads that implement the algorithm "if [A][B] appeared before and [A] appears again, predict [B]" - Transformer Circuits. This is the mechanism behind in-context learning. The model's most basic learned primitive is not logic, not grammar, not reasoning. It is pattern completion: I saw this before, here is what came next.
Six independent lines of evidence confirmed that induction heads are the mechanistic source of the majority of in-context learning. Their emergence during training corresponds to a phase change in loss: the model's behavior qualitatively shifts when these circuits form. Before induction heads, the model is doing something closer to memorized lookup. After them, it is doing genuine in-context pattern matching.
The practical consequence is that when you give the model examples in a prompt (few-shot learning), you are not "teaching" it in the human sense. You are providing patterns for its induction heads to match against. The model's preferred mode of learning is: show me the pattern, and I will complete it. Not: explain the rule, and I will follow it. Rules can work (the model learned many of them during training), but examples work with the model's native architecture rather than against it.
2. How the Model Stores Knowledge (It Is Nothing Like a Database)
If you want to understand what the model "wants," you need to understand how it stores what it knows. And its internal knowledge representation is alien to every human data structure ever invented.
The key discovery is superposition. Elhage, Hume, and colleagues demonstrated in their landmark 2022 paper that neural networks represent far more features (concepts, facts, patterns) than they have neurons, by encoding features as almost-orthogonal directions in high-dimensional space - arXiv. In a model with d dimensions, it can represent exponentially more than d features by tolerating small amounts of interference between them.
Think about what this means. In a traditional database, each piece of information occupies its own location: a row, a column, a cell. In a traditional programming language, each variable occupies its own memory address. In the model's internal representation, thousands of concepts are superimposed in the same set of neurons, distinguished only by the angle of their direction vector in a high-dimensional geometric space.
This is not a metaphor. It is the literal mathematical structure of how the model stores information. A concept is not stored "in" a neuron. A concept is a direction. The "Golden Gate Bridge" concept is a specific direction in the model's activation space. Text about the bridge, images of the bridge, mentions in French, mentions in Japanese, and even abstract references to iconic landmarks that share structural properties with it all activate the same directional feature - Anthropic.
Features that are sparser (activated less frequently) can tolerate more interference and are packed more densely. Common features need more "room" in the representational space and are spread more orthogonally. The geometry follows the structure of uniform polytopes, a mathematical construction that maximizes the number of almost-orthogonal directions in a given number of dimensions. The model's knowledge storage is not arbitrary. It follows optimal geometric packing principles.
The feed-forward (MLP) layers of the transformer operate as key-value memories. Geva et al. showed that each row of the first weight matrix acts as a "key" that detects specific textual patterns, while the corresponding row of the second matrix acts as a "value" that pushes the output distribution toward tokens associated with that pattern - ACL Anthology. Lower layers capture shallow patterns (syntactic structures, common phrases). Upper layers capture semantic patterns (factual associations, conceptual relationships).
This was confirmed more precisely by Meng et al., who demonstrated using causal tracing that specific factual associations (like "The Eiffel Tower is in [Paris]") are localized in specific mid-layer MLP modules and can be surgically edited without affecting other knowledge - arXiv. The model's "database" is organized not by schema, type, or category, but by association: this subject pattern maps to this predicate completion. Fundamentally associative, not relational.
What does this mean for the model's "preferred" mode? It means the model does not naturally think in terms of discrete categories, hierarchies, or structured schemas. Its native representation is a continuous geometric space where meaning is encoded as direction and distance. Related concepts are nearby in this space (but "nearby" means "similar angle," not "adjacent in a list"). The model's preferred representation of knowledge is: everything is a direction in a shared space, and similarity is geometric.
This is why embeddings work so well for semantic search and why vector databases have become fundamental infrastructure for AI applications. Embeddings are a lossy projection of the model's internal representation into a space that humans can work with. Our guide to vector databases covers the practical side of working with these representations, and our OpenAI embeddings guide digs into the API mechanics. But the key insight here is that embeddings are not an invention layered on top of language models. They are a window into the model's native representation. The model already thinks in vectors. Embeddings just extract them.
The residual stream of the transformer (the main data highway that runs through all layers) functions as a shared workspace where different transformer components, attention heads and MLP layers, read from and write to. Recent work published at NeurIPS 2024 showed that belief states are linearly represented in this stream, with even fractal-structured belief geometries being linearly decodable - NeurIPS. The model maintains a rich, continuous, geometric representation of its current "beliefs" about the input, and this representation is additive: each layer adds its contribution to the shared space.
This additive, geometric, superimposed representation is the model's native mode of knowledge storage. It is profoundly different from any human data structure. And it means that when we force the model to interact with structured formats (JSON, SQL, CSV, XML), we are asking it to project from a richer, higher-dimensional space into a much more constrained format. It can do this, but the projection is always lossy. Information that exists in the model's internal space cannot always be cleanly mapped to discrete, structured formats.
3. The Compression Thesis: Prediction Is Understanding
The deepest theoretical insight into what LLMs "do" comes from information theory. Deletang et al. proved mathematically that language modeling and data compression are equivalent: minimizing cross-entropy loss (what the model is trained to do) is identical to minimizing compressed encoding length - arXiv.
This is not an analogy. It is a mathematical identity. When the model learns to predict the next token better, it is literally learning to compress the data more efficiently. Better prediction = shorter encoding = better compression. The three are the same thing expressed in different mathematical vocabularies.
The empirical evidence for this equivalence is striking. Chinchilla 70B, a language model trained exclusively on text, can compress ImageNet image patches to 43.4% of raw size, beating PNG (a dedicated image compressor) at 58.5%. It compresses LibriSpeech audio to 16.4%, beating FLAC (a dedicated audio compressor) at 30.3%. A model trained only on text is a better compressor of images and audio than purpose-built compressors. This makes sense only if you understand that compression is about finding statistical structure, and the model, through predicting text, has learned general-purpose statistical structure that transfers across modalities.
Ilya Sutskever articulated the philosophical implication in a 2023 interview: "To predict the next token well, you need to understand the underlying reality that led to the creation of that token." Surface-level statistical patterns are insufficient for optimal compression. If you want to predict what comes next in a complex sequence, you need a model of the process that generated the sequence. At sufficient scale, compression pressure forces the model to learn genuine structure about the world - Dwarkesh Patel.
A 2025 paper made this connection even more rigorous, arguing that LLM training computationally approximates Solomonoff induction, the theoretically optimal universal predictor from algorithmic information theory - arXiv. Solomonoff induction works by considering all possible programs that could generate the observed data, weighted by their simplicity (shorter programs get higher weight), and using this weighted ensemble to predict the next observation. LLM training, through gradient descent on prediction loss, converges toward this ideal.
What does this tell us about what the model "wants"? It tells us the model's fundamental drive (in the mechanistic sense, not the psychological sense) is toward the simplest explanation that accounts for the observed data. Simpler models (in the information-theoretic sense of models that assign shorter descriptions to the data) are always preferred by the training objective. The model is a compression engine, and compression engines prefer elegance, parsimony, and structural regularity.
This is why models exhibit behaviors that look like "understanding." When a model correctly infers the capital of a country it has never been explicitly asked about, or when it generalizes a pattern to a new domain, it is not performing magic. It is applying compressed representations: the model learned that country-capital relationships follow a regular pattern, and it is cheaper (in compression terms) to store the pattern plus a lookup table than to memorize each fact independently. Understanding, in this framework, is compression efficiency.
Our analysis of the scaling laws debate explored whether model capabilities are hitting diminishing returns. The compression thesis provides a useful frame: scaling laws describe the relationship between compute investment and compression quality. Hoffmann et al.'s Chinchilla paper showed that the optimal ratio is roughly 20 tokens per parameter, meaning the model "needs" both more capacity (parameters for more complex compression) and more experience (data to find more patterns to compress) in roughly equal proportion - NeurIPS. The model's "preference" is for both more capacity and more data, because better compression requires both a more powerful compressor and more data to compress.
The compression thesis also explains a phenomenon called grokking: neural networks can suddenly generalize long after they have memorized the training data. Power et al. discovered this in 2022, and subsequent mechanistic interpretability work revealed that grokking coincides with the emergence of structured representations (like a Fourier basis for modular arithmetic) - arXiv. The model first memorizes (stores each example individually, which is a poor compression strategy), then, with continued training, discovers a more efficient compressed representation that generalizes. The model "prefers" to eventually find clean, generalizable structure, because generalizable structure compresses better than memorization.
This has a direct implication for how we think about model behavior. When a model hallucinates, it is not "lying" or "failing to reason." It is over-compressing: it found a pattern that fits many training examples but does not perfectly match the specific case at hand. Hallucination is the cost of aggressive compression. The model's drive toward parsimony sometimes overshoots. We wrote about how LLM inference is reshaping software architecture in our piece on the big pipe, and understanding hallucination as over-compression is central to building systems that work with this architectural reality rather than pretending it does not exist.
4. The Continuous Space Problem: Why Tokens Are a Bottleneck
Here is perhaps the most counterintuitive finding about what LLMs "want": the model's output format (discrete tokens) is not its preferred mode of computation. It is a bottleneck.
The model's internal computation happens entirely in continuous embedding space. Representations are real-valued vectors with thousands of dimensions. Meaning is encoded as direction and magnitude in this continuous space. The model can represent uncertainty, ambiguity, and superpositions of multiple hypotheses simultaneously in a single hidden state vector.
But then, at the output layer, this rich continuous representation gets crushed through a softmax function into a probability distribution over discrete tokens. The model must pick one token. This single token is a lossy projection from the model's richer internal state.
Yang et al. proved this mathematically in 2018: when the embedding dimension is smaller than the rank of the true log-probability matrix, the softmax-based model cannot express the true data distribution for all contexts - arXiv. There is a proven mathematical limitation in projecting the model's continuous internal state into discrete token probabilities. The output is always a lossy compression of the internal state.
Recent research has begun to explore what happens when you remove this bottleneck. Hao et al. introduced Coconut (Chain of Continuous Thought), which feeds the model's hidden states directly back as input, bypassing token generation entirely for intermediate reasoning steps - arXiv. The model reasons in continuous space, and only projects to discrete tokens at the final output.
The results are revealing. Coconut allows the model to perform implicit breadth-first search, exploring multiple reasoning paths simultaneously rather than committing to one discrete token at a time. It outperforms chain-of-thought prompting on logical reasoning tasks while generating fewer tokens. When you let the model stay in its native continuous space, it performs better with less computation.
A parallel line of research showed that latent reasoning in LLMs can be understood as a superposition over vocabulary probabilities - arXiv. Each hidden state does not correspond to one "thought." It encodes multiple alternative reasoning paths simultaneously. Latent-SFT (supervised fine-tuning on internal representations rather than output tokens) matches explicit chain-of-thought performance while cutting reasoning chain length by up to 4x. The model's internal processing is already multi-path and probabilistic. When we force it to emit one token at a time, we collapse this rich probability distribution into a single point.
This finding reframes chain-of-thought prompting. When we ask a model to "think step by step," we are not teaching it to reason. The model already has powerful internal computation. What chain-of-thought does is give the model more forward passes (each generated token triggers a new full forward pass through all layers), expanding its computational budget. Wei et al. showed in 2022 that chain-of-thought enables reasoning that standard prompting cannot - arXiv. Theoretical work has since proven that transformers without chain-of-thought can only solve problems in the complexity class TC^0 (shallow parallel circuits), but with polynomial chain-of-thought steps, they become Turing-complete.
So sequential token generation is a scaffold for additional computation, not the model's natural reasoning mode. The model would "prefer" (in the architectural sense) variable-depth computation in continuous space. It would prefer to think for as long as the problem requires, in its native continuous representation, and only produce discrete output when the answer has converged. The current autoregressive paradigm of one-token-at-a-time generation is a constraint of the output architecture, not a reflection of how the model actually processes information.
This has practical implications. When you design systems that interact with LLMs, you should understand that every token boundary is a potential information loss point. The model's internal representation between token generations is richer than what any single token captures. Techniques like beam search, best-of-N sampling, and tree-of-thought prompting partially address this by exploring multiple paths through the discrete token space, but they are workarounds for a fundamental architectural constraint.
The future almost certainly involves models that spend more time in continuous space and less time generating discrete tokens. The Coconut research, latent reasoning work, and diffusion-based language models all point in this direction. The model "wants" to stay in continuous space as long as possible.
5. What the Model Builds Inside Itself
If the model's native mode is associative pattern completion in continuous space, what does it actually build with that mode? This is where mechanistic interpretability, the field of reverse-engineering what individual neurons and circuits do inside neural networks, provides the most startling findings.
Chris Olah and colleagues at Anthropic established in 2020 that neural networks can be understood through three levels: features (individual directions in activation space), circuits (computational subgraphs connecting features), and universality (the same features and circuits appear independently across different models trained on different data) - Distill.
The universality finding is critical. It means the model is not learning arbitrary representations. Certain circuits are "preferred" by the optimization landscape, to the point where completely different models independently converge on the same computational solutions. The model does not just learn any representation that works. It converges on specific, universal computational primitives that are somehow optimal for the compression task.
The most dramatic evidence comes from Othello-GPT. Li et al. trained a GPT model to predict legal Othello moves from move sequences alone, with no board state input at all. Using probes, they discovered the model had built an internal representation of the 8x8 board state with 98.3% accuracy - arXiv. Subsequent work by Neel Nanda confirmed this representation is linearly encoded, meaning the model stores the board state as straightforward directions in its activation space, not as some complex nonlinear encoding - Neel Nanda. Follow-up research in 2025 showed that seven different LLM architectures all independently develop accurate board representations, reaching up to 99% accuracy.
Nobody told the model about a board. The training data was just sequences of moves. But representing the board state is the most efficient way to predict legal moves. The model's compression drive forced it to build an internal world model, because surface-level sequence statistics are insufficient for optimal prediction. The model spontaneously constructs whatever internal structure best serves compression, even if that structure looks nothing like the input format.
This generalizes beyond game boards. Anthropic's 2025 work on Claude 3.5 Haiku, titled "On the Biology of a Large Language Model," revealed several remarkable internal behaviors - Transformer Circuits:
The model performs genuine multi-hop reasoning internally. When asked "What is the capital of the state where Dallas is located?", the model activates "Dallas" features, which trigger "Texas" features, which trigger "Austin" features. The computation chains through abstract feature-level associations, not token-level string matching.
The model plans ahead when writing poetry. It identifies rhyming words for the end of a line before it starts writing the line. The computation runs backward from the goal (the rhyme) to the beginning (the first word of the line). This is not sequential left-to-right processing. The model is reasoning from the end backward to the start, using its parallel attention mechanism to coordinate across positions.
Perhaps most striking: when asked to explain its own computation (for example, adding 36+59), the model fabricated a standard algorithm explanation while its actual internal computation used approximate "low-precision features for add something near 57" combined with lookup-table-like circuits. The model's actual reasoning process is different from the clean, step-by-step procedure it describes in its output. The internal computation is heuristic, approximate, and parallel. The output is a serialized, clean narrative imposed by the discrete token output format.
This is the gap between what the model "wants" and what it produces. Internally, the model uses approximate, multi-path, parallel, heuristic computation in continuous space. Externally, it produces sequential, precise, discrete tokens that look like clean reasoning. The output format imposes a structure that the internal computation does not have.
The hallucination mechanism was also traced. Anthropic found that hallucinations arise from "known answer" features incorrectly suppressing "can't answer" features. When the model encounters an entity it recognizes but lacks specific knowledge about, the recognition features override the uncertainty features, and the model confabulates rather than admitting ignorance. The internal competition between "I recognize this" and "I don't know the specific answer" is resolved by the model's training-driven preference for confident pattern completion.
These findings paint a picture of an internal computational process that is far richer, more parallel, and more approximate than the clean sequential output suggests. The model is not a calculator that sometimes makes errors. It is an associative pattern-completion engine that produces remarkably precise outputs given that its internal process is fundamentally approximate and probabilistic.
6. How Context Actually Works (And Where It Fails)
The transformer processes its entire context window in parallel, but that does not mean it attends to all positions equally. Understanding the model's actual attention patterns reveals another dimension of its "preferences."
Liu et al. discovered the lost-in-the-middle phenomenon: LLMs exhibit a strong U-shaped attention curve, with the highest performance when relevant information is at the beginning or end of the context, and performance degrading by 30% or more when critical information is placed in the middle - arXiv. This mirrors the serial position effect in human memory (primacy and recency bias), but arises from architectural properties like Rotary Position Embeddings (RoPE) rather than cognitive mechanisms.
The model "prefers" to attend to the beginning and end of its context. This is not a design choice by the model. It is a consequence of how positional encodings interact with the attention mechanism. Positions near the edges get stronger attention signals. This is an architectural bias, not a learned preference, but it shapes the model's behavior as surely as any learned behavior.
This has practical consequences that most people get wrong. When structuring prompts, the instinct is to put the most important information "at the top" (beginning). But for many tasks, the most relevant context should be at both the beginning and the end, not buried in the middle. System prompts work well partly because they occupy the beginning of the context. User messages work well because they occupy the end. Instructions placed in the middle of a long document may receive less attention than the same instructions placed at the beginning or end.
Context window size has exploded in recent years, from 2K tokens to 128K and beyond. But raw context size is not the same as effective context utilization. The model's attention budget is finite: the amount of "processing power" that attention heads can devote to any single position decreases as context grows. Longer contexts give the model more information to draw from, but each piece of information gets proportionally less attention.
The model's preferred relationship with context is nuanced. It wants more context (more information to pattern-match against), but it has architectural limits on how effectively it can use that context. This is another instance of the fundamental tension between the model's continuous, parallel processing and the constraints imposed by its architecture.
Mixture of Experts (MoE) architectures add another layer to this picture. MoE models like Mixtral only activate a small fraction of their total parameters for any given input. Switch Transformers demonstrated a 7x pretraining speedup by activating only 1-2 experts per token out of many - arXiv. This confirms that the model's processing is naturally sparse: for any given input, only a small subset of the model's knowledge is relevant. The model does not engage all of its parameters uniformly. It selectively activates the subset most relevant to the current pattern.
This sparse activation pattern mirrors the superposition finding from earlier. The model stores a vast number of features in its representational space but activates only a sparse subset for any given input. The "preferred" mode is selective, context-dependent activation, not uniform engagement of all knowledge. The model is not a monolithic processor that applies all its knowledge to every input. It is more like a vast library where the relevant books fly off the shelves in response to each query.
Our deep dive into attention mechanisms and the state of algorithms in 2026 covers the technical evolution of attention patterns and their practical implications. The key insight here is that the model's context processing is not the uniform, egalitarian, "attend to everything equally" operation that the mathematical formulation might suggest. It is shaped by positional biases, sparsity preferences, and capacity limits that create a non-uniform attention landscape.
7. The Native Mode: Six Properties of LLM-Preferred Processing
Drawing together all of the research above, we can now characterize the LLM's native mode of processing with six core properties. These are not human preferences imposed on the model. They are properties that emerge from the architecture, training dynamics, and mathematical structure of transformer language models.
7.1 Associative, Not Logical
The model's fundamental operation is associative pattern completion, not logical deduction. The attention mechanism computes soft similarity between all positions. The MLP layers store knowledge as pattern-to-completion pairs. The induction heads implement "I saw this before, here is what came next." The entire architecture is optimized for finding the best-matching pattern and completing it.
This does not mean the model cannot perform logic. It can, and often does so impressively. But logical reasoning is an emergent capability built on top of associative pattern matching, not a native operation. When the model performs multi-step reasoning, it is chaining associative completions: each step's output becomes the pattern for the next step's completion. This is why chain-of-thought works: it gives the model intermediate patterns to associate from, rather than requiring a single associative leap from premise to conclusion.
The practical implication is that the model is most reliable when the reasoning required is close to pattern completion (common patterns, well-represented domains, standard formats) and least reliable when the reasoning requires precise symbolic manipulation far from any training pattern (novel mathematical proofs, complex logic puzzles, unusual formal systems). This is not a "bug." It is a direct consequence of the model's native architecture being associative rather than logical.
7.2 Continuous, Not Discrete
The model's internal representation is continuous, high-dimensional, and geometric. Concepts are directions in vector space, not discrete symbols. Knowledge is stored as continuous-valued weight matrices, not as rows in a table. The internal computation at every layer operates on real-valued vectors.
Discrete token generation is a constraint imposed by the output architecture, not a reflection of the model's internal computation. The softmax bottleneck provably limits the model's expressiveness. When researchers remove this bottleneck (as in Coconut), performance improves. The model "prefers" to operate in continuous space and "resists" the discretization that token generation requires.
For anyone building systems that interact with LLMs, this means that the model's token-level outputs are a compressed, lossy representation of a richer internal state. Two responses that differ by a single token may have had very similar internal representations, and two responses that look similar may have had very different internal computations. The token level is not the model's natural resolution.
7.3 Parallel, Not Sequential
The transformer processes its entire input simultaneously through self-attention. Every position attends to every other position in a single computational step. There is no inherent sequential ordering in the model's core computation. The model sees all of its context at once.
Sequential token generation is a constraint of the autoregressive output mechanism. Each token is generated one at a time, with each new token conditioning on all previous tokens. But the internal processing for each token is parallel: the full forward pass through all layers processes all context positions simultaneously.
This means the model is at its best when it can leverage holistic, whole-context patterns rather than sequential, step-by-step procedures. The model's ability to "see" the entire context simultaneously is a superpower that sequential processing systems lack, but the autoregressive output mechanism forces the model to serialize its rich parallel computation into a linear sequence of tokens.
7.4 Compressed, Not Exhaustive
The model's training objective (next-token prediction) is mathematically equivalent to compression. The model learns to represent the statistical structure of its training data as efficiently as possible. It stores compressed representations of patterns, not exhaustive records of facts.
This means the model "prefers" regularity, generality, and parsimony over completeness, specificity, and exhaustiveness. It naturally generalizes (because generalization compresses better than memorization) and naturally omits rare details (because rare details are expensive to compress). Hallucination is the flip side of compression: the model fills in details consistent with its compressed model of the world, even when the specific details are wrong.
This property is often misunderstood as a failure. It is not. It is the same property that enables the model's remarkable ability to generalize, transfer knowledge across domains, and handle novel inputs. Compression and generalization are the same thing expressed in different vocabularies. A model that perfectly memorized its training data without compression would be a database, not an intelligence.
Our guide to LLM compression techniques explores the engineering side of compression (quantization, distillation, pruning). But the deeper point here is that the model itself is a compressor. Its entire purpose, mathematically, is compression. Everything it does, every capability it exhibits, emerges from the pressure to compress its training data more efficiently.
7.5 Superimposed, Not Separated
Information in the model is not stored in discrete, separable locations. It is superimposed: thousands of features share the same neurons, distinguished only by direction in activation space. Multiple hypotheses are represented simultaneously in a single hidden state. Multiple reasoning paths are explored in parallel during a single forward pass.
This superposition is the model's native representation strategy. It allows the model to represent exponentially more concepts than it has neurons. But it also means that the model cannot perfectly isolate one piece of information from another. Features interfere with each other slightly. Activating one concept slightly activates nearby concepts. This is why models exhibit associative "leaps" that can be both creative (making novel connections between concepts) and confusing (mixing up related but distinct facts).
The superimposed representation also explains why prompt injection and adversarial attacks work. The model cannot perfectly separate "instructions" from "data" because, internally, both are represented as feature activations in the same shared space. There is no discrete boundary between "the system prompt" and "the user input" at the representational level. Both are just patterns in the same continuous geometric space.
7.6 Emergent, Not Designed
The model's internal structure (features, circuits, world models) is not designed or specified. It emerges from training. The optimizer discovers whatever internal structure most efficiently compresses the training data. Induction heads, world models, multi-hop reasoning circuits: all of these are emergent properties that appear because they serve the compression objective.
This means the model's capabilities are not modular in the way that designed software is modular. You cannot cleanly enable or disable a specific capability. You cannot add a new capability by adding a new module. Capabilities are distributed across the model's weights in complex, overlapping patterns that emerged from training. Fine-tuning changes many capabilities simultaneously because they share representational resources.
The emergent nature of the model's internal structure is also why alignment is hard. The model's "goals" (such as they are) are implicit in its weights, not explicitly specified in code. Changing the model's behavior requires changing the statistical structure of its compressed representations, not editing a configuration file. This is fundamentally different from traditional software, where behavior is determined by explicit logic that can be inspected and modified.
8. What This Means for How We Build with AI
Understanding the model's native mode has practical consequences for how we design AI systems. Most current AI application architecture implicitly assumes that the model is a text-in, text-out function that follows instructions. The research suggests a more nuanced picture.
The model works best when you work with its native properties rather than against them. This means:
Provide patterns, not just rules. The model's architecture is optimized for pattern completion. Few-shot examples (showing the model what you want) align better with the model's native operation than lengthy rule sets (telling the model what to do). Both work, but examples leverage the induction head mechanism directly, while rules require the model to translate instructions into patterns internally.
Put critical information at the edges of context. The lost-in-the-middle finding means that information buried in the middle of a long context receives less attention. Structure your prompts so that the most important information is at the beginning (system prompt, key instructions) and end (the specific query or task), with supporting context in the middle.
Accept the compression trade-off. The model will generalize, approximate, and occasionally hallucinate because it is a compression engine, not a database. Design your systems to verify critical facts rather than trusting the model's compressed representations blindly. Use retrieval-augmented generation (RAG) to supply specific facts that the model should not have to "remember" from its compressed training data. Our RAG guide covers the practical implementation.
Use the model's parallel processing capability. The model sees the entire context simultaneously. Long, detailed context that a human would struggle to process linearly can be effective because the model attends to all of it in parallel. Do not be afraid of long prompts if the information is relevant. The model can find the needle in the haystack (within the constraints of positional biases).
Let the model think. Chain-of-thought and extended reasoning give the model more forward passes, expanding its computational budget. For complex tasks, the additional computation from generating intermediate tokens is not waste. It is the model using sequential token generation as a scaffold for deeper computation. The model's internal per-forward-pass computation is fixed, so giving it more forward passes (through generated tokens) is the primary way to increase total computation on a problem.
Platforms like o-mega.ai that deploy AI agents for autonomous business operations are implicitly navigating these trade-offs at the infrastructure level. When an agent is given a complex task, it decomposes it (creating more pattern-completion steps), retrieves relevant context (supplying specific facts rather than relying on compressed knowledge), and chains multiple model calls (expanding computational budget). These are architectural accommodations for the model's native properties.
The most successful AI systems in 2026 will be those that deeply understand what the model is (an associative compression engine in continuous space) and design their architectures accordingly, rather than treating the model as a black-box text function with unpredictable behavior.
9. The Future: Letting the Model Be What It Is
The trajectory of AI research is increasingly pointing toward architectures that let the model operate closer to its native mode, removing constraints that force it to work against its fundamental properties.
Continuous reasoning (like Coconut and latent reasoning) removes the discrete token bottleneck for intermediate computation, letting the model stay in continuous space where it computes naturally. Early results show better performance with less computation. This direction will likely become standard within the next few years.
Sparse activation (Mixture of Experts, sparse attention) embraces the model's natural sparsity. Instead of forcing every parameter to engage with every input, MoE architectures let the model selectively activate the knowledge most relevant to the current context. This is closer to how the model naturally operates internally.
Multi-modal native models recognize that the model's internal representation is already cross-modal. The superposition and feature research shows that concepts are represented abstractly, transcending any single modality. Models that process text, images, audio, and video through the same representational space are working with the model's native cross-modal representations, not forcing separate modality-specific processing paths.
Longer and more efficient context mechanisms (sliding window attention, hierarchical attention, memory-augmented architectures) address the model's positional biases and attention scaling limitations. The model "wants" access to more context, but current architectures create diminishing returns at long context lengths. Better attention mechanisms will let the model use more context more effectively.
Agentic architectures that decompose complex tasks into chains of model calls effectively give the model the variable-depth computation it "wants." Instead of a single forward pass (or a single chain-of-thought), an agent architecture can allocate more computation to harder sub-problems and less to easier ones. This dynamic allocation of computational budget is closer to the model's need for variable-depth processing than a single monolithic generation.
The research also suggests some less obvious future directions. If the model's internal representation is geometric and continuous, future interfaces might not use text at all. Direct embedding-to-embedding communication between models, or between models and databases, could bypass the lossy text bottleneck entirely. The embedding infrastructure being built today for search and retrieval is a step in this direction, but the full vision is a world where models communicate in their native continuous representation rather than through the intermediate format of text.
The model's native mode also suggests that the current paradigm of prompt engineering may be a temporary phase. Prompt engineering is the art of translating human intent into text that triggers the right patterns in the model's associative memory. But if we can interact with the model closer to its native representation level, the translation overhead disappears. Early work on activation steering (directly manipulating the model's internal feature activations to control its behavior) points toward this possibility.
%%title: The LLM Processing Stack
%%subtitle: From native computation to human-usable output, each layer adds constraints
graph TD
A ["Continuous Embedding Space<br/><i>The model's native mode</i>"] --> B ["Feature Superposition<br/><i>34M+ concepts as directions</i>"]
B --> C ["Attention + MLP Circuits<br/><i>Parallel associative computation</i>"]
C --> D ["Residual Stream<br/><i>Shared geometric workspace</i>"]
D --> E ["Softmax Bottleneck<br/><i>Continuous → discrete projection</i>"]
E --> F ["Token Output<br/><i>Lossy serialization of internal state</i>"]
F --> G ["Text / JSON / Code<br/><i>Human-readable format</i>"]
The deepest insight from all of this research is that we have been building AI systems backward. We start with the human-preferred output format (text, JSON, structured data) and work backward to the model, asking: "how do I get the model to produce this format reliably?" The research suggests we should start with the model's native mode (associative pattern completion in continuous geometric space) and work forward to the human interface, asking: "how do I build systems that let the model operate in its native mode for as long as possible, and only project to human-readable formats at the last possible moment?"
The companies and researchers who understand this, who build systems that respect the model's native processing mode rather than fighting against it, will build the most capable and efficient AI systems. The model is not a text-in, text-out function. It is a geometric compression engine operating in continuous high-dimensional space, and the more we let it be what it is, the better it will perform.
%%title: Native Mode vs Imposed Mode
%%subtitle: How current systems constrain LLMs versus what the architecture prefers
graph LR
subgraph NATIVE ["What the Model Prefers"]
N1 ["Continuous space"] --> N2 ["Parallel processing"]
N2 --> N3 ["Associative completion"]
N3 --> N4 ["Sparse activation"]
N4 --> N5 ["Variable-depth compute"]
end
subgraph IMPOSED ["What We Force On It"]
I1 ["Discrete tokens"] --> I2 ["Sequential output"]
I2 --> I3 ["Rule following"]
I3 --> I4 ["Full parameter use"]
I4 --> I5 ["Fixed-depth forward pass"]
end
The model does not "want" anything in the human sense. But it has a native mode of computation that is profoundly different from every data processing paradigm humans have invented. It is associative, not logical. Continuous, not discrete. Parallel, not sequential. Compressed, not exhaustive. Superimposed, not separated. Emergent, not designed.
The more we understand this native mode, the better we can work with it. And the better we work with it, the more we unlock what these models are truly capable of.
This guide is written by Yuma Heymans (@yumahey), founder of o-mega.ai and creator of the AI Agent Index tracking 600+ autonomous AI systems. His work on AI agent architecture and multi-agent orchestration is rooted in understanding how LLMs fundamentally process information at the mathematical level.
This guide reflects the state of mechanistic interpretability and transformer research as of April 2026. This field is evolving rapidly. Verify current findings against the latest publications from Anthropic's Transformer Circuits team, EleutherAI, and leading ML conferences (NeurIPS, ICML, ICLR).