SubQ Attention: 2026 Guide to Subquadratic AI | Articles

Yuma Heymans

6 May 2026

•

50 min read

The insider guide to SubQ, subquadratic attention, and what it means for the future of LLMs.

A Miami startup just claimed it built the first fully subquadratic frontier LLM, with a 12 million token context window and 1,000x less compute than existing models. If SubQ's claims hold, it would represent the most fundamental architectural shift in language models since the original transformer paper in 2017. If they don't, it joins a growing list of extraordinary claims backed by impressive funding and very little verifiable evidence. This guide breaks down exactly what SubQ is, how its architecture works from first principles, what the evidence actually shows, and how to think about whether this is a genuine breakthrough or something else entirely.

The attention mechanism inside transformer-based language models is one of the most consequential pieces of software ever written. It is also one of the most expensive. Every major LLM from Claude to GPT to Gemini runs on some variant of the same core operation: for every token in the input, compute a relevance score against every other token, then use those scores to mix information. The cost of this operation scales quadratically with the number of tokens. Double the input, quadruple the cost. This is the fundamental bottleneck that determines how long a context window can be, how much inference costs, and ultimately how capable these systems are at processing large amounts of information.

SubQ claims to have eliminated this bottleneck entirely. Not reduced it by a constant factor. Not worked around it with engineering tricks. Eliminated the quadratic scaling itself, replacing it with an architecture that scales linearly with context length. This guide examines that claim with the depth it deserves.

Written by Yuma Heymans (@yumahey), who has been building AI agent infrastructure at O-mega where long-context models directly determine what autonomous agents can accomplish in a single pass.

Why Attention is Quadratic (and Why That Matters)
The Landscape of Solutions Before SubQ
What SubQ Actually Claims
How Subquadratic Sparse Attention (SSA) Works
The Benchmarks: What the Numbers Show
The Team and the Money Behind It
The Skeptic's Case Against SubQ
The Magic.dev Precedent
Theoretical Limits on Subquadratic Attention
What "Fully Subquadratic" Really Means
The Competitive Landscape: How Frontier Labs Handle Long Context Today
First-Principles Assessment: Breakthrough or Not?
What This Means for Builders

1. Why Attention is Quadratic (and Why That Matters)

To understand why SubQ matters (or doesn't), you first need to understand the problem it claims to solve. This requires going back to the fundamental operation at the core of every transformer-based language model: self-attention.

Self-attention is the mechanism that allows a language model to understand relationships between words in a sequence. When a model processes the sentence "The cat sat on the mat because it was tired," attention is what allows the model to figure out that "it" refers to "the cat" and not "the mat." It does this by computing a score between every pair of tokens in the sequence, indicating how much each token should "attend to" every other token.

The mathematical operation is straightforward. For a sequence of N tokens, the model creates three matrices: queries (Q), keys (K), and values (V). Each token generates a query vector ("what am I looking for?"), a key vector ("what do I contain?"), and a value vector ("what information do I carry?"). The attention computation multiplies Q by the transpose of K to produce an N x N attention matrix, applies softmax to normalize the scores, then multiplies by V to produce the output. The critical step is that N x N matrix. For 1,000 tokens, this is a 1 million-entry matrix. For 100,000 tokens, it is a 10 billion-entry matrix. For 1 million tokens, it is a 1 trillion-entry matrix.

This is what quadratic scaling means in practice. The number of operations grows with the square of the sequence length. Double the input from 500K to 1M tokens and you don't double the compute cost, you quadruple it. The concrete numbers are staggering. For a single attention layer with a hidden dimension of 4,096, processing 1 million tokens requires approximately 8.19 petaFLOPs just for the query-key multiplication. A frontier model with 80 layers would need roughly 655 petaFLOPs for attention alone - Tensor Economics. At 10 million tokens, this becomes 65,500 petaFLOPs, a 100x increase for a 10x growth in context.

Memory is equally problematic. The N x N attention matrix for 1 million tokens contains 1 trillion entries. In FP16 precision, that is 2 terabytes per layer per head. No single GPU can hold this. This is why FlashAttention's memory reduction from O(N^2) to O(N) was essential for making million-token contexts even possible, even though FlashAttention does not reduce the computational complexity itself.

The practical consequence is that every frontier lab hits a wall. Context windows have grown from 4K tokens in early GPT models to 1 million tokens in current models from Anthropic, OpenAI, and Google. But the cost of pushing beyond 1 million grows explosively. A 12 million token context window using standard attention would require approximately 94,000 petaFLOPs per forward pass for attention alone, roughly 144 times more compute than a 1 million token pass. This is not a matter of buying more GPUs. It is a fundamental constraint on what the architecture can do. And this is the constraint SubQ claims to have broken.

2. The Landscape of Solutions Before SubQ

SubQ did not emerge in a vacuum. The quadratic attention bottleneck has been the single most studied problem in deep learning for the past five years. Dozens of research groups have proposed solutions, each with different trade-offs. Understanding this landscape is essential for evaluating whether SubQ's claimed approach is genuinely novel or a repackaging of existing ideas.

FlashAttention: Making Quadratic Faster (Without Changing It)

The most impactful work on the attention bottleneck came not from changing the algorithm but from optimizing its implementation. FlashAttention, created by Tri Dao at Stanford, is an IO-aware attention algorithm that uses tiling to minimize data movement between GPU memory levels - arXiv. The key insight is that the bottleneck for attention on modern GPUs is not the arithmetic but the memory bandwidth: reading and writing the large intermediate matrices to and from slow GPU high-bandwidth memory (HBM).

FlashAttention restructures the computation to keep data in fast on-chip SRAM as much as possible, computing attention in small tiles and never materializing the full N x N matrix. The result is dramatic speedups (2-4x in version 1, up to 73% of theoretical peak throughput in version 2) and a memory reduction from O(N^2) to O(N) for intermediate storage. FlashAttention-3, which won a NeurIPS 2024 paper award, pushed utilization to 75% on H100 GPUs by exploiting warp-level parallelism and FP8 quantization - Tri Dao's blog.

But here is the critical point: FlashAttention does not change the complexity class. The total number of floating-point operations is still O(N^2). It makes those operations faster by being smarter about memory access, but the quadratic scaling wall remains. At sufficiently long contexts, FlashAttention cannot save you. This is the wall SubQ claims to have broken through, and it is why SubQ specifically benchmarks against FlashAttention: claiming 52x faster at 1 million tokens.

State Space Models: Linear but Lossy

An entirely different approach comes from state space models (SSMs), most notably Mamba, created by Albert Gu and Tri Dao - arXiv. Instead of computing pairwise interactions between all tokens, SSMs process sequences through a continuous state evolution equation: x(t+1) = A * x(t) + B * u(t). Each token updates a fixed-size state matrix rather than comparing against all previous tokens. This is inherently O(N) in sequence length.

Mamba's key innovation is making the state transition parameters (A, B, C) input-dependent rather than fixed. At each position, the model can selectively decide what to remember and what to forget based on the current input. This selective scan mechanism is what makes Mamba competitive with transformers on many language tasks.

The fundamental limitation of SSMs, however, is information compression. The state matrix has a fixed size (d x d), which can store at most d^2 bits of information. If the context contains more information than can fit in this compressed state, information is permanently lost. This manifests most clearly in associative recall tasks. If "Harry Potter" appears at position 1,000 in the context, and the model encounters "Harry" at position 500,000, an attention-based model can directly look up position 1,000 and retrieve "Potter." An SSM has to hope that the association survived 499,000 state updates of lossy compression. In practice, it often doesn't.

This is why no pure SSM has achieved frontier-quality performance on retrieval-heavy tasks. The Mamba-2 paper itself suggests that hybrid models (mixing SSM and attention layers) outperform both pure architectures. SubQ claims to avoid this trade-off entirely: maintaining attention's retrieval precision while achieving SSM-level efficiency. That is a very specific and testable claim.

Linear Attention: Algebraic Tricks with Caveats

Another family of approaches tries to make attention itself cheaper through algebraic reformulation. The foundational paper by Katharopoulos et al. (2020), titled "Transformers are RNNs," showed that if you replace the softmax function with a kernel function phi(), you can change the order of matrix multiplication from (Q * K^T) * V (which requires the N x N intermediate matrix) to Q * (K^T * V) (which produces only a d x d intermediate matrix). This changes the complexity from O(N^2 * d) to O(N * d^2), which is linear in sequence length N.

The problem is what softmax does that kernels don't. Softmax creates a proper probability distribution: all weights are non-negative and sum to 1. This means attention can express sharp focus, placing nearly all weight on a single relevant token. Kernel approximations lose this property. They cannot express high-confidence retrieval in the same way, leading to what researchers call context collapse: the model's ability to selectively retrieve specific information degrades as context length grows.

RetNet from Microsoft (2023) and Gated Linear Attention (GLA) represent improvements on this basic idea, adding gating mechanisms that give the model more control over what to retain and forget. But the fundamental trade-off remains: linear attention mechanisms achieve O(N) scaling by compressing the attention computation into a fixed-size representation, and that compression is inherently lossy for retrieval tasks - Hailey Schoelkopf.

Fixed-Pattern Sparse Attention: Longformer, BigBird, and Friends

A third family of approaches keeps the attention mechanism but restricts which token pairs actually attend to each other. Longformer and BigBird use a combination of local sliding-window attention (each token attends to its neighbors) and global attention (designated tokens attend to everything). This reduces the number of computed attention scores from N^2 to O(N), achieving linear scaling.

The limitation is that the sparsity pattern is position-based, not content-based. The model decides which tokens to attend to based on their position in the sequence, not on what they contain. This means a critical piece of information 50,000 tokens away might be outside the local window and not connected to a global token, making it invisible to the model. No fixed-pattern sparse attention model has achieved frontier-quality performance on general language tasks.

DeepSeek Native Sparse Attention: Learned but Still Quadratic

The most sophisticated pre-SubQ approach is DeepSeek's Native Sparse Attention (NSA), which won Best Paper at ACL 2025 - arXiv. NSA uses a three-branch hierarchical approach: compressed coarse-grained tokens (summaries of groups of tokens), selectively retained fine-grained tokens (the most important individual tokens), and a sliding window for local context. The key innovation is that the sparsity pattern is learned during pretraining, not applied post-hoc. The model learns which tokens are worth attending to.

NSA achieves impressive results: 11.6x faster decoding for 64K-length sequences, and remarkably, it actually surpasses full attention on several benchmarks. But critics point out that NSA's underlying complexity is still O(N^2). The selection mechanism reduces compute by a constant factor, but the routing and selection overhead, plus the full softmax attention over selected positions, means the quadratic dependency on N has not been eliminated. It is a very good constant-factor optimization, not a complexity class change.

Kimi Linear: The 75/25 Hybrid

Moonshot AI's Kimi Linear architecture represents the most transparent attempt at a hybrid approach. It uses a 3:1 ratio of linear attention layers (using their Kimi Delta Attention module) to standard quadratic Multi-head Latent Attention (MLA) layers - arXiv. So 75% of the model's layers are subquadratic, and 25% retain full quadratic attention.

The results are strong: Kimi Linear outperforms pure MLA across all evaluated tasks while reducing KV cache by up to 75% and achieving up to 6x decoding throughput at 1 million tokens. But mathematically, a model with even one quadratic layer is still O(N^2) asymptotically. The 25% quadratic layers dominate the scaling behavior at sufficiently long contexts. Kimi Linear makes long context much cheaper through constant-factor improvements, but it does not change the complexity class. SubQ claims to have no quadratic layers at all. Zero. This would be qualitatively different from everything described above.

Ring Attention: Distributing the Problem

UC Berkeley's Ring Attention, published at ICLR 2024, takes a distribution approach - arXiv. Devices form a ring, with each device holding a block of the sequence. KV blocks are passed around the ring while attention is computed in parallel, overlapping computation with communication. This solves the memory problem (no single device needs to hold the full KV cache) but does not reduce compute. Every query still attends to every key. The total FLOPs are identical to standard attention. Ring Attention is an engineering solution, not an algorithmic one.

The breadth of this landscape matters for evaluating SubQ. Every major research group in the world has been working on this problem. The solutions that work well (FlashAttention, NSA, Kimi Linear) achieve constant-factor improvements while remaining fundamentally quadratic. The solutions that achieve true subquadratic complexity (SSMs, linear attention) sacrifice retrieval quality. Nobody has achieved both: true subquadratic scaling with full attention-quality retrieval at frontier scale. That is exactly what SubQ claims to have done.

3. What SubQ Actually Claims

SubQ launched on May 5, 2026, coming out of stealth as the first product from Miami-based startup Subquadratic - SiliconANGLE. The announcement was made primarily through a tweet by CTO Alexander Whedon (@alex_whedon) and a blog post on subq.ai. The claims are specific and extraordinary.

The headline claim is that SubQ is the first frontier LLM built on a fully subquadratic sparse-attention architecture. Not a hybrid with some quadratic layers retained. Not an approximation that sacrifices quality. A model where every layer uses subquadratic attention, and the overall architecture scales linearly, O(N), with context length.

The model ships in two configurations. A research model supports up to 12 million tokens of context, which would be the longest context window of any model by a factor of 12 over the current 1 million token standard from Anthropic, OpenAI, and Google. A production model (SubQ 1M-Preview) supports 1 million tokens, matching current frontier models but claiming to do so at radically lower cost.

The specific performance claims, organized by category:

Speed: SubQ claims to be 52x faster than FlashAttention at 1 million tokens for input processing. The claimed speedup scales with context length: 7.2x at 128K tokens, 13.2x at 256K, 23x at 512K, and 52.2x at 1M. If attention were truly O(N) versus O(N^2), you would expect the speedup to grow linearly with N, which is roughly consistent with these numbers (a 7.8x increase in context from 128K to 1M producing a 7.25x increase in speedup).

Compute: The headline figure is a 1,000x compute reduction at 12 million tokens versus frontier transformer models. At 1 million tokens, SubQ claims a 62.5x reduction in attention FLOPs. These numbers describe the attention operation specifically, not the full model forward pass (the FFN layers, which scale linearly regardless, would be the same).

Cost: SubQ claims to run at less than 5% the cost of Claude Opus for equivalent tasks. Their blog post provides a specific comparison: processing a RULER 128K benchmark query costs approximately $8 with SubQ versus approximately $2,600 with Claude Opus, a 300x cost reduction - SubQ blog.

Quality: SubQ published benchmarks on three tests. On RULER at 128K tokens (a long-context retrieval benchmark), SubQ scored 95.0% versus Claude Opus 4.6's 94.8%. On MRCR v2 at 1 million tokens (multi-document reasoning and retrieval), the production model scored 65.9% and the research model scored 83.0%, compared to Claude Opus 4.6's 32.2% and GPT-5.5's 74.0%. On SWE-Bench Verified (a coding benchmark), SubQ scored 81.8% versus Claude Opus 4.6's 80.8%.

Three products launched simultaneously, all in private beta: SubQ API (OpenAI-compatible inference endpoints), SubQ Code (a CLI coding agent that loads entire codebases into a single context window), and SubQ Search (a free research tool leveraging the long context). None are publicly available for independent testing yet.

The scaling curves chart published on SubQ's website illustrates the core claim visually. While transformer attention cost accelerates upward as context grows (the characteristic quadratic curve), SubQ's cost grows as a straight line. If accurate, this means that the cost advantage of SubQ grows without bound as context increases, which is exactly what a genuine complexity class change would produce.

4. How Subquadratic Sparse Attention (SSA) Works

SubQ's architecture is called Subquadratic Sparse Attention (SSA). The company has published a technical explainer on their website but no peer-reviewed paper, no arXiv preprint, and no open weights. What follows is based on the published descriptions and inferences from the claimed properties - SubQ technical explainer.

The fundamental idea behind SSA is content-dependent token selection. In standard attention, every query token computes a relevance score against every key token. SSA replaces this exhaustive comparison with a selection step: for each query token, the model first identifies a small subset of key tokens that are likely to be relevant, then computes exact attention only over those selected positions. If each query selects K tokens out of N total (where K << N), the total compute becomes O(N * K) instead of O(N^2). If K is constant or grows sublinearly with N, the overall complexity is subquadratic.

This differs from existing approaches in several important ways. Fixed-pattern sparse attention (Longformer, BigBird) selects tokens based on position: local neighbors plus designated global tokens. SSA selects based on content: the model decides where to look based on the semantic meaning of the current query, not its position in the sequence. DeepSeek NSA also does content-dependent selection, but SubQ claims that NSA's selection mechanism itself has quadratic overhead, whereas SSA's routing is claimed to be subquadratic.

SSA's properties, as described by the company:

Linear scaling in both compute and memory. The total work grows proportionally with context length, not quadratically. At 1 million tokens, this means roughly 1/62.5 the attention FLOPs of standard attention.

Content-dependent routing. The selection of which tokens to attend to is based on semantic meaning, not fixed positional patterns. A legal contract at position 50,000 in the context can be directly retrieved by a query at position 11 million if the content is relevant, regardless of distance.

Exact attention over selected positions. Unlike linear attention (which approximates the attention computation) or SSMs (which compress context into a fixed state), SSA computes the actual softmax attention operation over its selected tokens. This means no information loss from approximation: the attention scores are mathematically identical to what standard attention would produce if you only computed those specific token pairs.

Sparse retrieval from arbitrary positions. The model can attend to tokens anywhere in the context, not just nearby tokens. This is the property that distinguishes SSA from local attention patterns.

The training methodology uses a three-stage approach: pretraining on the SSA architecture, supervised fine-tuning, and reinforcement learning specifically targeting long-context retrieval reliability. The CTO confirmed in social media comments that SubQ uses open-source model weights as a starting point, adapting them to the SSA architecture rather than training from scratch. This is a significant detail: it means SubQ's capabilities on general tasks (coding, reasoning, general knowledge) are inherited from existing models, and the novel contribution is specifically the attention mechanism.

The use of open-source weights as a starting point is itself an important data point. It means SubQ did not pretrain from scratch, which would require hundreds of millions of dollars of compute that a $29M seed round could not support. Instead, they likely took an existing model's weights, modified the architecture to replace standard attention with SSA, and then fine-tuned to recover or improve performance. This is a valid approach (many production models are built this way), but it constrains the claims. The model's general knowledge, coding ability, and reasoning come from the base model. SubQ's contribution is specifically the attention mechanism and the fine-tuning to make it work. This is not a criticism, but it does narrow the scope of what "breakthrough" means in this context.

The deepest unanswered question about SSA is how the token selection works. The company describes it as "content-dependent" but has not published the mechanism. The critical challenge is that any selection mechanism must itself be efficient: if you need to compare each query against all N keys to decide which ones are relevant, you've reintroduced the O(N^2) computation you were trying to avoid. Subquadratic selection requires some form of approximate nearest-neighbor search, hash-based routing, or learned routing function that can identify relevant tokens without exhaustive comparison.

Approaches like locality-sensitive hashing (used in Reformer from 2020) can approximate this in O(N log N), but they introduce approximation error that can hurt retrieval quality. Learned routing (training a small network to predict relevance) is another option, but the quality of this prediction directly bounds the quality of the attention output. SubQ has not disclosed which approach they use or how they ensure the selection step itself is both efficient and accurate. This is the key technical gap in their public disclosure.

5. The Benchmarks: What the Numbers Show

SubQ published results on three benchmarks. Let's examine each one, what it measures, and what the results tell us.

RULER at 128K Tokens

RULER is a benchmark designed specifically to test long-context retrieval. It includes tasks like finding specific information scattered across a long document, tracking multiple pieces of information simultaneously, and aggregating information from different parts of the context. SubQ scored 95.0%, compared to Claude Opus 4.6's 94.8%.

This result is essentially a tie. A 0.2 percentage point difference on a single run is well within the noise of any benchmark. What it demonstrates, if accurate, is that SubQ can match frontier models on retrieval quality at 128K tokens. It does not demonstrate superiority. The significance is not the score itself but the claimed cost: SubQ says this benchmark costs $8 to run versus $2,600 for Claude Opus, a 300x reduction. If the quality is the same and the cost is 300x lower, the cost difference is the story, not the quality difference.

MRCR v2 at 1 Million Tokens

MRCR v2 (Multi-document Reasoning and Citation Retrieval) tests the ability to reason across and cite information from multiple documents in a very long context. This is where SubQ's results are most impressive. The production model scored 65.9% and the research model scored 83.0%, compared to Claude Opus 4.6's 32.2%, GPT-5.5's 74.0%, and Gemini 3.1 Pro's 26.3%.

If accurate, the research model's 83.0% would represent a significant improvement over all existing frontier models on million-token reasoning. The production model's 65.9% is also competitive, falling between Claude and GPT-5.5. There is a notable 17 percentage point gap between SubQ's research and production models, which the company attributes to the research model's larger context capacity and different inference configuration, but has not fully explained.

SWE-Bench Verified

SWE-Bench Verified measures a model's ability to solve real-world software engineering problems from GitHub issues. SubQ scored 81.8% compared to Claude Opus 4.6's 80.8%. Again, this is essentially a tie within noise.

What the Benchmarks Don't Show

The benchmark selection itself is worth analyzing. SubQ chose three tests: one long-context retrieval test (RULER), one million-token reasoning test (MRCR v2), and one coding test (SWE-Bench). All three play to SubQ's designed strengths: long-context processing and coding (which benefits from loading entire codebases into context).

Missing from the benchmark suite are general reasoning benchmarks (MMLU, ARC, HellaSwag), mathematical reasoning (MATH, GSM8K), instruction following (IFEval), creative writing assessments, safety evaluations, or any test that measures capabilities not directly related to long context. For a model that claims frontier status, this is a narrow evaluation. It is possible that SubQ performs well on general tasks since it inherits weights from open-source models, but it is also possible that the architectural changes required for SSA degraded general capabilities, and the company chose not to publish those results.

Additionally, all benchmarks were run once with no confidence intervals. The company cited high inference costs as the reason. In a field where benchmark scores can vary by several percentage points across runs, single-run results are insufficient to claim that a 0.2 or 1.0 percentage point difference is meaningful.

Multiple reviewers also flagged inconsistencies in the published numbers. The SWE-Bench score appears as 81.8% in some materials and 82.4% in others. Claude Opus 4.6's comparison score appears as 80.8% in some places and 81.4% in others. These are small discrepancies, but in a context where the company is making extraordinary claims, numerical inconsistency erodes trust - VentureBeat.

The MRCR v2 results are SubQ's strongest data point. If independently verified, the research model's 83% accuracy at 1 million tokens would be a clear leader. But two caveats apply: these are self-reported numbers from a single run, and the research model (which scores highest) is not the product that will be commercially available. The production model's 65.9% is more relevant for potential users, and it trails GPT-5.5's 74.0%.

6. The Team and the Money Behind It

The people and capital behind SubQ tell a specific story about what kind of company this is. Understanding that story helps calibrate expectations.

Subquadratic is headquartered in Miami, Florida, and emerged from stealth on May 5, 2026. The company raised a $29 million seed round at a reported $500 million valuation. That is a 17x revenue-to-valuation multiple on zero revenue, which positions Subquadratic in the upper tier of AI startup valuations at the seed stage.

The investor list includes Javier Villamizar (former SoftBank Vision Fund partner), Justin Mateen (Tinder co-founder, through JAM Fund), Grant Gittlin (Lasagna), and Jaclyn Rice Nelson (Coalition Operators). The company notes that its investors include "early investors in Anthropic, OpenAI, Stripe, and Brex," though specific names are not disclosed for all - SiliconANGLE.

The two co-founders have distinct profiles. Justin Dangel (CEO) is a five-time founder with a background in consumer technology and insurance. His most notable prior venture, Goji/Consumer United, was named one of Inc. Magazine's Top 10 Fastest Growing Companies in Insurance. He is a business operator and company builder, not an ML researcher.

Alexander Whedon (CTO) was previously a software engineer at Meta and Head of Generative AI at TribeAI, where he led 40+ enterprise AI implementations. His background is in applied AI, not foundational ML research. He has not published papers on attention mechanisms or model architectures.

The research bench is the 11 PhD researchers from Meta, Google, Oxford, Cambridge, ByteDance, Adobe, and Microsoft. The company has not published individual names or specific research contributions for these team members, making it impossible to assess their specific expertise in efficient attention or model architecture. This is a significant gap: if one of these researchers had previously published influential work on sparse attention, that would substantially strengthen the credibility of the claims.

The founding team profile raises a question that the ML community has been asking: is this a research-driven company that happened to raise money, or a fundraising-driven company that hired researchers? The CEO's background is in scaling consumer businesses, not ML. The CTO's background is in applying existing AI, not inventing new architectures. The actual architecture work is being done by hired PhDs whose identities haven't been disclosed. This doesn't mean the research is bad, but it does mean the leadership team's credibility comes from business execution, not scientific achievement.

For context, the labs that have made genuine architectural breakthroughs typically have researcher-founders. Tri Dao (FlashAttention) was a Stanford PhD student when he created FlashAttention. Albert Gu (Mamba) was faculty at CMU. The DeepSeek team publishes full author lists with institutional affiliations. Moonshot AI's Kimi team includes researchers from Tsinghua and other top Chinese universities. The norm in this field is that architectural innovation comes from people whose primary identity is research, and those people are publicly named with verifiable publication records.

7. The Skeptic's Case Against SubQ

The skepticism around SubQ has been vocal and specific. Rather than dismiss it or accept it wholesale, let's examine each line of criticism on its merits.

"This is probably a fine-tune"

Will Depue, a prominent AI engineer, characterized SubQ as "almost surely a sparse attention finetune of Kimi or DeepSeek." His reasoning: CTO Whedon confirmed using open-source model weights as a starting point, the speedup claims align with what you would get from adding sparse attention patterns to an existing model, and the benchmark selection avoids tests where fine-tuned models typically underperform base models.

This criticism has substance. If SubQ took an existing model (say, DeepSeek V3 or Kimi K2), replaced some or all attention layers with a sparse attention variant, and fine-tuned on long-context tasks, you would expect exactly the profile SubQ presents: competitive on coding and long-context retrieval (inherited from the base model and enhanced by sparse attention), with a speed advantage from the sparser attention computation. You would also expect reluctance to publish general reasoning benchmarks if the attention swap degraded those capabilities.

The counterargument is that a fine-tune on a sparse attention architecture is not trivial. If SubQ genuinely replaced all quadratic attention layers with a subquadratic variant and maintained frontier-quality performance, that is itself significant even if the base knowledge comes from an open-source model. The question is whether SSA is a genuine architectural contribution or a minor modification.

"The numbers don't add up"

Depue also said the scaling claims "don't seem to line up." The specific concern is about the relationship between claimed speedup (52x at 1M tokens) and claimed FLOP reduction (62.5x at 1M tokens). If the FLOP reduction is 62.5x but the wall-clock speedup is only 52x, the gap could be explained by overhead in the selection mechanism (routing queries to their selected keys takes time beyond the attention computation itself). But critics argue the claimed speedup is suspiciously close to what you'd get from existing sparse attention approaches with the right configuration.

"Cherry-picked benchmarks"

Stepan Goncharov characterized the benchmarks as "very interesting cherry-picked benchmarks." The core criticism: SubQ tested on exactly the tasks where a long-context sparse attention model would excel (long-context retrieval, million-token reasoning, code), and avoided every test where it might struggle (general reasoning, math, instruction following, safety). A model that scores 95% on RULER but 60% on MMLU would not be a frontier model in any meaningful sense. Without general benchmarks, "frontier" is an unsubstantiated claim.

"Inconsistent numbers"

Multiple observers noted that SubQ's published benchmark numbers differ between their blog post and press materials. SWE-Bench appears as both 81.8% and 82.4%. Claude Opus comparison scores differ between materials. The inconsistencies are small but they undermine the precision that benchmark claims require. If you're claiming a 1 percentage point advantage over Claude, a 0.6 percentage point inconsistency in your own reporting makes that claim meaningless.

"Possible astroturfing"

The Hacker News community flagged suspicious patterns in SubQ's social media reception: newly created accounts posting identical praise across platforms, a coordinated push of positive commentary on launch day. This is not evidence that the technology is fake (many legitimate companies engage in coordinated PR), but it adds to an overall pattern of marketing sophistication exceeding the level of technical transparency - Hacker News.

The astroturfing concern connects to a deeper issue about how AI companies build credibility. In traditional software, you can evaluate a product by using it. In AI model development, the product is inaccessible (private beta), the architecture is undisclosed (no paper), and the evidence is self-reported (benchmarks run by the company). In this environment, social proof (enthusiastic tweets, positive commentary, impressed reactions) becomes a substitute for technical proof. When that social proof appears coordinated rather than organic, it further undermines the already-thin evidence base.

The Dan McAteer Summary

AI commentator Dan McAteer captured the ambiguity in a widely shared tweet: "SubQ is either the biggest breakthrough since the Transformer... or it's AI Theranos." This framing resonated because it accurately reflects the binary nature of the situation. If SubQ's claims are substantially true, it is a major breakthrough. If they are substantially false, it is a case study in AI hype. There is very little middle ground.

8. The Magic.dev Precedent

SubQ is not the first company to claim extraordinary context length achievements backed by novel attention mechanisms and substantial venture funding. The most relevant precedent is Magic.dev, and the parallels are striking enough to warrant detailed examination.

In August 2024, Magic.dev announced LTM-2-mini, a model with a 100 million token context window - Magic.dev blog. The key claim was that their "sequence-dimension algorithm" was roughly 1,000x cheaper than Llama 3.1 405B's attention at 100 million tokens. They calculated that running Llama 3.1 405B at 100M tokens would require 638 H100 GPUs per user just for the KV cache. Magic's approach required a fraction of a single H100.

The claim profile is remarkably similar to SubQ's: 1,000x compute reduction, orders-of-magnitude longer context, closed architecture, no paper, no weights, no independent verification. Magic raised approximately $500 million on these claims.

As of May 2026, nearly two years after the announcement, there is no public evidence that LTM-2-mini has been deployed outside Magic.dev. The model is not available via API. No independent benchmarks have been published. No weights have been released. Magic announced training a larger LTM-2 on new GPU clusters but has not demonstrated the technology publicly. The company appears to be genuine (it has real employees, real compute, real products in development), but the specific claims about 100M-token context windows remain unverified after nearly two years.

The parallels to SubQ are specific. Both companies claim 1,000x compute improvements over existing models. Both attribute this to novel attention mechanisms. Both are closed-source with no published paper. Both raised substantial funding ($500M for Magic, $29M for SubQ). Both launched with impressive but narrow benchmarks that could not be independently replicated. Both have founders from applied tech (Magic's CEO previously founded coding tools) rather than foundational ML research.

The differences are also worth noting. SubQ has launched three products (API, Code, Search) even if in private beta, whereas Magic has not launched consumer-facing products based on LTM. SubQ's claims are more modest in absolute terms (12M tokens vs 100M tokens). SubQ benchmarks against existing frontier models on established tests, whereas Magic's benchmarks were entirely internal.

The lesson from Magic.dev is not necessarily that SubQ is making false claims. It is that extraordinary claims about attention efficiency require extraordinary evidence, and "trust us, we'll publish the paper soon" is not sufficient evidence when hundreds of millions of dollars are at stake. The ML community learned from Magic that the appropriate response to such claims is patient skepticism: interesting if true, insufficient to evaluate until independently verified.

There is also a structural market dynamic at play. AI startups that claim architectural breakthroughs attract disproportionate investor attention and media coverage compared to those that claim incremental improvements. A startup saying "we made attention 6x faster" gets a modest amount of interest. A startup saying "we eliminated quadratic scaling entirely" gets a $500M valuation at seed stage. This creates an incentive gradient that pushes founders toward bolder claims regardless of the underlying reality. The Magic.dev outcome (bold claims, massive funding, no public verification after two years) is not an anomaly. It is the predictable result of a market that rewards ambition of claims over evidence of results. This doesn't mean SubQ is following the same playbook, but it means the default prior for extraordinary unverified claims should be skepticism, updated by evidence as it becomes available.

9. Theoretical Limits on Subquadratic Attention

Beyond the practical skepticism around SubQ's specific claims, there is a deeper theoretical question: is fully subquadratic attention even possible without sacrificing capability?

A 2024 paper by Alman and Yu, "Fundamental Limitations on Subquadratic Alternatives to Transformers," provides a partial answer - arXiv. Accepted at ICLR 2025, the paper proves that under the Strong Exponential Time Hypothesis (SETH), a widely-believed conjecture in computational complexity theory, any subquadratic model cannot perform certain document similarity tasks that standard transformers can perform.

The specific result is that computing the exact inner product between all token pairs (which is what attention does) requires O(N^2) time in the worst case, assuming SETH. This means any subquadratic attention mechanism must be doing one of two things: either it is approximating the attention computation (not computing exact inner products for all pairs), or it is computing exact attention for a subset of pairs (sparse attention). Either way, there are tasks where it will fail that full quadratic attention would succeed on.

This theoretical result does not mean SubQ is impossible. It means SubQ must be making trade-offs, and the question is whether those trade-offs matter in practice. SubQ's claimed approach (computing exact attention over a content-dependent subset of token pairs) is consistent with the theoretical framework: it gives up the ability to attend to every pair, but claims to select the right subset such that the quality loss is negligible for practical tasks.

A separate analysis on LessWrong by Vladimir Ivanov argues more broadly that most "subquadratic" attention claims are actually constant-factor improvements rather than genuine complexity class changes - LessWrong. The argument is that even mechanisms labeled as O(N) or O(N log N) in theory tend to have such large constant factors in practice that they only provide 3-8x speedups at the context lengths currently in use. Only at context lengths far beyond what anyone currently uses (10M+ tokens) would the complexity class change produce truly transformative speedups.

This is a nuanced point. A mechanism that is O(N) but with a constant factor of 10,000 is slower than O(N^2) with a constant factor of 1 for any sequence shorter than 10,000 tokens. The complexity class only dominates at scale. SubQ's claimed speedups (52x at 1M tokens) would be consistent with either a genuine O(N) mechanism with a moderate constant factor, or a very well-optimized O(N^2) mechanism with aggressive constant-factor reductions. Without the full mathematical specification, it is impossible to distinguish between these possibilities.

10. What "Fully Subquadratic" Really Means

The term "fully subquadratic" in SubQ's marketing deserves precise unpacking because it draws a sharp line between SubQ and every other approach.

Quadratic (O(N^2)): Standard transformer attention. Every token attends to every other token. Cost quadruples when input doubles. This is what vanilla GPT, Claude, and Gemini architectures use at their core, enhanced by FlashAttention for implementation efficiency.

Partially subquadratic / hybrid: Models that mix subquadratic layers with standard quadratic attention layers. Kimi Linear (75% linear, 25% quadratic) is the canonical example. If even one layer out of 80 is quadratic, the overall model is still O(N^2) asymptotically. The quadratic layers dominate at sufficiently long contexts. The practical speedup might be 4-6x, but it does not compound as context grows further. You save a constant factor, not a scaling factor.

Fully subquadratic (O(N) or O(N log N)): No quadratic layers anywhere in the model. Every layer uses a mechanism whose cost grows subquadratically with N. This is what SubQ claims. If true, the speedup over quadratic models grows without bound as context increases. At 1M tokens, it might be 52x. At 12M tokens, it would be approximately 625x (assuming O(N) vs O(N^2)). At 100M tokens, it would be 5,200x. The advantage accelerates rather than plateaus.

This distinction matters enormously for practical applications. A constant 6x speedup (like Kimi Linear achieves) is valuable but incremental: it makes million-token contexts 6x cheaper. A genuine complexity change makes million-token contexts 50x cheaper, 10-million-token contexts 600x cheaper, and opens up context lengths that would be economically impossible with any quadratic component.

The reason no previous frontier model has been fully subquadratic is that the quadratic layers appear to be necessary for maintaining quality on certain tasks. Kimi Linear's 25% quadratic layers exist for a reason: they provide the retrieval precision that linear layers cannot match. Removing them entirely degrades performance. SubQ claims to have solved this specific problem, maintaining retrieval precision without any quadratic fallback, through a mechanism it has not fully disclosed.

The diagram illustrates the core tension. Everything in the fully subquadratic column has achieved O(N) scaling by giving something up: SSMs give up exact retrieval through lossy compression, and linear attention gives up sharp focus through kernel approximations. SubQ claims to sit in the O(N) column while retaining the exact retrieval properties of the O(N^2) column. If true, it has solved a problem that every previous attempt has failed to solve. If false, it belongs somewhere in the hybrid or quadratic column, dressed up with subquadratic marketing.

11. The Competitive Landscape: How Frontier Labs Handle Long Context Today

To evaluate SubQ's significance, you need to know what the established frontier labs are currently doing about the quadratic bottleneck. The answer is: a lot, just not as dramatically as SubQ claims.

Anthropic (Claude Opus 4.6/4.7)

Anthropic's Claude models currently support 1 million tokens of context. Anthropic has confirmed using sparse attention patterns with hierarchical KV-cache management across GPU memory tiers. Claude Opus 4.6 achieved 76% accuracy on long-context retrieval benchmarks, a massive improvement over its predecessor's 18.5%. The most recent Claude Opus 4.7 continues this trajectory - Anthropic docs.

Anthropic has not published the specifics of their attention mechanism, but the described approach (sparse patterns + hierarchical caching) is consistent with a hybrid strategy: using some form of sparsity to reduce average-case compute while retaining full attention capabilities where needed. The engineering sophistication required to maintain 76% retrieval accuracy at 1M tokens with any form of sparsity should not be underestimated.

For those building applications that need to process large documents, research papers, or entire codebases in a single context window, platforms like O-mega integrate with frontier models to provide autonomous agents that can work across long-context tasks, making practical use of these expanded context windows regardless of which underlying model architecture prevails.

OpenAI (GPT-5.5)

OpenAI's GPT-5.5, released April 2026, supports 1 million tokens via API - OpenAI. It achieved 74% accuracy at 512K-1M tokens, compared to GPT-5.4's 36.6% at the same range. OpenAI has not disclosed its attention mechanism in detail.

Google DeepMind (Gemini 2.5 Pro)

Google's Gemini 2.5 Pro supports 1 million tokens (with 2 million tokens announced) and achieved 100% recall up to 530K tokens and 99.7% at 1 million tokens - Google AI Blog. These are the strongest publicly reported long-context retrieval numbers from any frontier lab.

The Convergence Pattern

All three major frontier labs now offer 1M-token context as standard. None have disclosed fully subquadratic architectures. All appear to use engineering optimizations (FlashAttention, KV-cache compression, some form of sparsity) to make quadratic attention practical at 1M tokens. The cost of these contexts is high (hence SubQ's $2,600 per RULER benchmark claim for Claude), but the labs accept this cost because the alternative, changing the fundamental architecture, has proven difficult without quality degradation.

SubQ's positioning is that this engineering approach has hit its ceiling. You can optimize quadratic attention to work at 1M tokens, but pushing to 12M or beyond requires a genuine architectural change. The frontier labs have essentially said, through their actions, that 1M tokens is "good enough" for current applications. SubQ is betting that it isn't, and that a model that can process 12M tokens at reasonable cost unlocks applications that 1M-token models cannot address.

The context window plateau is real and visible. After rapid expansion from 128K to 1M tokens between early 2024 and early 2025, frontier models have stalled at 1M for over a year. SubQ's claim to 12M tokens represents a 12x jump, which is either a genuine architectural breakthrough enabling capabilities beyond what quadratic models can economically support, or an overpromise from a model that works at 12M tokens in theory but produces degraded output in practice (as evidenced by the 17-point gap between SubQ's research and production models).

12. First-Principles Assessment: Breakthrough or Not?

Strip away the marketing, the skepticism, the investor names, and the Twitter arguments. What does a first-principles analysis say about whether SubQ is likely to be what it claims?

Why This Question Matters More Than SubQ Itself

Before dissecting SubQ's specific claims, it is worth stepping back to understand why the quadratic attention bottleneck is such a high-stakes problem. The entire trajectory of AI capabilities is currently gated by how much context a model can process per inference call. Every major application category, from autonomous coding agents that need to understand full codebases, to legal AI that needs to process complete case files, to scientific research assistants that need to synthesize entire paper corpuses, runs into the same wall: the context window is too small or too expensive.

The current 1 million token standard from frontier labs represents a hard-won engineering achievement. But it is also a ceiling that has not moved in over a year. The labs have optimized quadratic attention about as far as it can go. Pushing to 10M or 100M tokens with quadratic attention would require compute costs that make the economics untenable for almost any application. This is why the question of subquadratic attention is not academic. It is the question that determines whether the next generation of AI applications can exist at all.

If SubQ has genuinely solved this (or even partially solved it with meaningful constant-factor improvements beyond what DeepSeek and Kimi achieved), the implications cascade through the entire industry. It would mean that context-limited RAG pipelines become unnecessary for most applications. It would mean that autonomous agents can reason over their entire operational history rather than working from compressed summaries. It would mean that the cost curve of AI inference bends sharply downward rather than plateauing. The stakes explain both the excitement and the skepticism.

The Structural Question

The fundamental structural question is: can you eliminate the quadratic dependency in attention without losing information that matters?

The theoretical answer is: not in the worst case (per Alman and Yu). But the practical answer might be different. Real-world language is not adversarial. In a legal contract, when you ask about the termination clause, the relevant information is concentrated in a small number of specific paragraphs, not uniformly distributed across the document. In a codebase, when you're debugging a function, the relevant context is the function definition, its callers, and the types it uses, not every line of every file. The key-value pairs that actually matter for any given query might be 0.1% of the total context.

If real-world attention patterns are inherently sparse (most queries only need to attend to a small fraction of keys), then a model that can efficiently identify and compute attention over just that fraction could achieve near-identical quality at a fraction of the cost. This is plausible. Research on attention patterns in trained transformers consistently shows that attention heads learn sparse patterns: they don't actually use most of the N^2 computed scores - LessWrong analysis.

The question is whether you can reliably identify the right sparse subset without computing the full attention matrix first. This is a chicken-and-egg problem: you need to know which tokens are relevant to know where to look, but you need to look to know which tokens are relevant. SubQ claims to have solved this with content-dependent routing, but the routing mechanism is the critical undisclosed detail.

The Evidence Assessment

Here is what we can state with reasonable confidence:

Plausible claims: The basic architecture of sparse attention with content-dependent routing is well-established in the research literature. Computing exact attention over a selected subset of tokens is straightforward once the subset is identified. The claimed speedups (52x at 1M tokens) are within the range that aggressive sparsity could produce.

Uncertain claims: Whether the routing mechanism itself is truly subquadratic (or has hidden quadratic costs in the selection step). Whether the quality at 12M tokens matches the quality at 128K tokens. Whether general reasoning capabilities have been preserved after the architectural change. Whether the cost claims account for the full inference pipeline or only the attention operation.

Unverifiable claims: The specific architecture of SSA. The training methodology. The research model's performance at contexts beyond 1M tokens. Any claim about production readiness.

The Probability Distribution

Rather than declaring SubQ a breakthrough or a fraud, the honest assessment is a probability distribution.

There is maybe a 20-30% chance that SubQ has achieved a genuine architectural innovation: a content-dependent sparse attention mechanism that is truly subquadratic and maintains frontier quality. This would be a significant contribution to the field, even if the base knowledge comes from fine-tuning open-source models.

There is maybe a 40-50% chance that SubQ has built a well-optimized sparse attention variant that provides real speedups (10-50x depending on context) but is not genuinely O(N). It might be O(N * sqrt(N)) or O(N * log(N)) or O(N^2) with an aggressive constant factor. This would still be useful, still worth a product, but the "fully subquadratic" framing would be misleading.

There is maybe a 20-30% chance that SubQ is primarily a marketing exercise around existing open-source models with relatively minor modifications, dressed up with impressive-sounding claims about complexity class. This does not mean fraud, it means a startup positioning incremental work as revolutionary, which is common in venture-backed AI.

The key that would shift these probabilities decisively: a peer-reviewed paper with full architecture specification, open weights for independent benchmarking, or at minimum, independent evaluation by researchers with API access. Until one of these materializes, the probability distribution remains wide.

What Would Settle the Debate

Five specific pieces of evidence would resolve the ambiguity:

A technical paper with full architecture specification. The routing mechanism is the critical undisclosed detail. How does SSA select which tokens to attend to? What is the computational cost of the selection step? What is the theoretical complexity of the full forward pass, proven mathematically?

Independent benchmark results. Any credible third-party evaluation on standard benchmarks (MMLU, MATH, IFEval) would answer whether general capabilities have been preserved or degraded.

Open weights. Even partial weight release would allow the research community to analyze whether the architecture is genuinely novel or a modification of existing open-source models with known attention variants.

Scaling curve verification. If SubQ is truly O(N), plotting inference cost versus context length should produce a straight line. If it is O(N^2) with constant-factor improvements, the curve will bend upward at longer contexts. This is the most decisive test and could be performed by anyone with API access.

Production deployment at scale. If SubQ's API becomes publicly available and demonstrates consistent quality across diverse tasks at the claimed cost, the debate becomes academic. Results matter more than architecture papers.

13. What This Means for Builders

If you are building products that depend on long-context language models, the SubQ announcement has practical implications regardless of whether the specific claims are verified.

The Short-Term View (Next 3-6 Months)

SubQ is in private beta with no public API access. For the immediate future, your production systems should continue to be built on established frontier models with 1M token context: Claude Opus 4.6/4.7, GPT-5.5, or Gemini 2.5 Pro. These are proven, available, and well-documented.

If you can get access to SubQ's private beta, it is worth testing for long-context-specific workloads: legal document analysis, codebase comprehension, multi-document research synthesis. These are the use cases where SubQ's advantages, if real, would be most apparent. But do not build production dependencies on a private beta from a startup with no track record of uptime or reliability.

The Medium-Term View (6-18 Months)

If SubQ's architecture works as claimed, the frontier labs will respond. Anthropic, OpenAI, and Google all have research teams working on efficient attention. If content-dependent sparse attention can achieve frontier quality at O(N) complexity, the major labs will develop their own variants. The technology, if real, will be commoditized within 12-18 months of verification. SubQ's moat would be speed to market, not the architecture itself.

This means the interesting question for builders is not "should I switch to SubQ?" but "should I design systems that can take advantage of 10M+ token contexts?" If the answer is yes, design your architecture now to be model-agnostic, with long-context capabilities as a feature that can scale as models improve. Use retrieval-augmented generation (RAG) as a current solution while keeping the option to switch to pure long-context approaches as they become available and affordable. We explored the mechanics of RAG and vector search in depth in our guide to enterprise AI search, which remains the practical approach for most production systems today.

The Long-Term View (The Real Question)

The deeper question that SubQ raises is whether the transformer attention mechanism as we know it has a future. The quadratic bottleneck has been the defining constraint of the transformer era. Every year, the pressure to extend context windows grows as applications demand models that can process entire codebases, complete legal cases, full research corpora, or years of conversation history.

The history of computing suggests that quadratic algorithms eventually get replaced. Sorting went from O(N^2) bubble sort to O(N log N) merge sort. String matching went from O(N * M) brute force to O(N) KMP. Matrix multiplication has been progressively improved from O(N^3) to O(N^2.37). In each case, the improvement required a genuinely clever insight that avoided the seemingly unavoidable exhaustive computation.

Attention may follow the same trajectory. The insight that most token pairs in a real-world sequence are not informative (the attention scores for most pairs are near zero after softmax) suggests that computing all N^2 pairs is wasteful. The question is whether you can identify the informative pairs without computing all pairs first. If someone finds an efficient solution to this selection problem, the result would look a lot like what SubQ describes: content-dependent sparse attention with exact computation over the selected subset.

Whether SubQ specifically has found this solution is an open question. But the structural argument that such a solution exists and will eventually be found is strong. The quadratic era of attention is likely to end. The question is when, and by whom.

The implications for AI agents are particularly significant. As we discussed in our guide to building AI agents, context windows are the fundamental constraint on what an agent can reason about in a single step. A 12M-token context would allow an agent to hold an entire codebase (most production codebases are 2-5M tokens), a complete legal case file, or months of customer interaction history in a single context window. This would eliminate the need for complex retrieval pipelines and multi-step reasoning chains that currently limit agent capabilities. For platforms focused on self-improving AI agents, the ability to reason over vastly more context would accelerate the feedback loops that drive autonomous improvement.

The cost dimension of this debate is equally transformative and worth examining closely. Our analysis of the true cost of AI agents found that inference costs are the primary barrier to deploying agents at scale. A 50-300x reduction in the cost of long-context inference would fundamentally change the economics. Tasks that currently require careful context management (chunking documents, maintaining retrieval indexes, orchestrating multi-step pipelines) could be replaced with a single large-context call. The engineering complexity drops, the latency drops, and the cost drops, but only if the quality holds.

The Broader Lesson

SubQ's announcement, regardless of outcome, illustrates a pattern that builders should recognize. Every 12-18 months, a startup emerges with claims of having solved a fundamental bottleneck in AI infrastructure. Sometimes these claims are real (FlashAttention genuinely transformed inference efficiency). Sometimes they are exaggerated (Magic.dev's 100M-token context remains unverified after nearly two years). The appropriate builder response is the same in both cases: track the technology, test when possible, design for model-agnostic architectures, and never build critical dependencies on unverified claims.

The market for long-context AI is real and growing. The consolidation of AI market power around a few major providers creates opportunities for startups that can deliver genuine architectural improvements. If SubQ is one of those startups, early adopters will benefit. If it isn't, the underlying problem remains unsolved and the next attempt will follow soon. Either way, the quadratic attention bottleneck will be broken eventually. The physics of information processing demand it.

This guide reflects the AI model landscape as of May 6, 2026. SubQ is in private beta and has not been independently verified. All benchmark numbers are self-reported by Subquadratic. Model capabilities and pricing change frequently. Verify current details before making technical decisions.

Yuma Heymans

6 May 2026

•

50 min read

The insider guide to SubQ, subquadratic attention, and what it means for the future of LLMs.

Written by Yuma Heymans (@yumahey), who has been building AI agent infrastructure at O-mega where long-context models directly determine what autonomous agents can accomplish in a single pass.

Why Attention is Quadratic (and Why That Matters)
The Landscape of Solutions Before SubQ
What SubQ Actually Claims
How Subquadratic Sparse Attention (SSA) Works
The Benchmarks: What the Numbers Show
The Team and the Money Behind It
The Skeptic's Case Against SubQ
The Magic.dev Precedent
Theoretical Limits on Subquadratic Attention
What "Fully Subquadratic" Really Means
The Competitive Landscape: How Frontier Labs Handle Long Context Today
First-Principles Assessment: Breakthrough or Not?
What This Means for Builders

1. Why Attention is Quadratic (and Why That Matters)

2. The Landscape of Solutions Before SubQ

FlashAttention: Making Quadratic Faster (Without Changing It)

State Space Models: Linear but Lossy

Linear Attention: Algebraic Tricks with Caveats

Fixed-Pattern Sparse Attention: Longformer, BigBird, and Friends

DeepSeek Native Sparse Attention: Learned but Still Quadratic

Kimi Linear: The 75/25 Hybrid

Ring Attention: Distributing the Problem

3. What SubQ Actually Claims

The specific performance claims, organized by category:

4. How Subquadratic Sparse Attention (SSA) Works

SSA's properties, as described by the company:

5. The Benchmarks: What the Numbers Show

SubQ published results on three benchmarks. Let's examine each one, what it measures, and what the results tell us.

RULER at 128K Tokens

MRCR v2 at 1 Million Tokens

SWE-Bench Verified

What the Benchmarks Don't Show

6. The Team and the Money Behind It

The people and capital behind SubQ tell a specific story about what kind of company this is. Understanding that story helps calibrate expectations.

7. The Skeptic's Case Against SubQ

The skepticism around SubQ has been vocal and specific. Rather than dismiss it or accept it wholesale, let's examine each line of criticism on its merits.

"This is probably a fine-tune"

"The numbers don't add up"

"Cherry-picked benchmarks"

"Inconsistent numbers"

"Possible astroturfing"

The Dan McAteer Summary

8. The Magic.dev Precedent

9. Theoretical Limits on Subquadratic Attention

Beyond the practical skepticism around SubQ's specific claims, there is a deeper theoretical question: is fully subquadratic attention even possible without sacrificing capability?

10. What "Fully Subquadratic" Really Means

The term "fully subquadratic" in SubQ's marketing deserves precise unpacking because it draws a sharp line between SubQ and every other approach.

11. The Competitive Landscape: How Frontier Labs Handle Long Context Today

To evaluate SubQ's significance, you need to know what the established frontier labs are currently doing about the quadratic bottleneck. The answer is: a lot, just not as dramatically as SubQ claims.

Anthropic (Claude Opus 4.6/4.7)

OpenAI (GPT-5.5)

Google DeepMind (Gemini 2.5 Pro)

The Convergence Pattern

12. First-Principles Assessment: Breakthrough or Not?

Strip away the marketing, the skepticism, the investor names, and the Twitter arguments. What does a first-principles analysis say about whether SubQ is likely to be what it claims?

Why This Question Matters More Than SubQ Itself

The Structural Question

The fundamental structural question is: can you eliminate the quadratic dependency in attention without losing information that matters?

The Evidence Assessment

Here is what we can state with reasonable confidence:

Unverifiable claims: The specific architecture of SSA. The training methodology. The research model's performance at contexts beyond 1M tokens. Any claim about production readiness.

The Probability Distribution

Rather than declaring SubQ a breakthrough or a fraud, the honest assessment is a probability distribution.

What Would Settle the Debate

Five specific pieces of evidence would resolve the ambiguity:

Independent benchmark results. Any credible third-party evaluation on standard benchmarks (MMLU, MATH, IFEval) would answer whether general capabilities have been preserved or degraded.

13. What This Means for Builders

If you are building products that depend on long-context language models, the SubQ announcement has practical implications regardless of whether the specific claims are verified.

Contents

1. Why Attention is Quadratic (and Why That Matters)

2. The Landscape of Solutions Before SubQ

FlashAttention: Making Quadratic Faster (Without Changing It)

State Space Models: Linear but Lossy

Linear Attention: Algebraic Tricks with Caveats

Fixed-Pattern Sparse Attention: Longformer, BigBird, and Friends

DeepSeek Native Sparse Attention: Learned but Still Quadratic

Kimi Linear: The 75/25 Hybrid

Ring Attention: Distributing the Problem

3. What SubQ Actually Claims

4. How Subquadratic Sparse Attention (SSA) Works

5. The Benchmarks: What the Numbers Show

RULER at 128K Tokens

MRCR v2 at 1 Million Tokens

SWE-Bench Verified

What the Benchmarks Don't Show

6. The Team and the Money Behind It

7. The Skeptic's Case Against SubQ

"This is probably a fine-tune"

"The numbers don't add up"

"Cherry-picked benchmarks"

"Inconsistent numbers"

"Possible astroturfing"

The Dan McAteer Summary

8. The Magic.dev Precedent

9. Theoretical Limits on Subquadratic Attention

10. What "Fully Subquadratic" Really Means

11. The Competitive Landscape: How Frontier Labs Handle Long Context Today

Anthropic (Claude Opus 4.6/4.7)

OpenAI (GPT-5.5)

Google DeepMind (Gemini 2.5 Pro)

The Convergence Pattern

12. First-Principles Assessment: Breakthrough or Not?

Why This Question Matters More Than SubQ Itself

The Structural Question

The Evidence Assessment

The Probability Distribution

What Would Settle the Debate

13. What This Means for Builders

The Short-Term View (Next 3-6 Months)

The Medium-Term View (6-18 Months)

The Long-Term View (The Real Question)

The Broader Lesson

Contents

1. Why Attention is Quadratic (and Why That Matters)

2. The Landscape of Solutions Before SubQ

FlashAttention: Making Quadratic Faster (Without Changing It)

State Space Models: Linear but Lossy

Linear Attention: Algebraic Tricks with Caveats

Fixed-Pattern Sparse Attention: Longformer, BigBird, and Friends

DeepSeek Native Sparse Attention: Learned but Still Quadratic

Kimi Linear: The 75/25 Hybrid

Ring Attention: Distributing the Problem

3. What SubQ Actually Claims

4. How Subquadratic Sparse Attention (SSA) Works

5. The Benchmarks: What the Numbers Show

RULER at 128K Tokens

MRCR v2 at 1 Million Tokens

SWE-Bench Verified

What the Benchmarks Don't Show

6. The Team and the Money Behind It

7. The Skeptic's Case Against SubQ

"This is probably a fine-tune"

"The numbers don't add up"

"Cherry-picked benchmarks"

"Inconsistent numbers"

"Possible astroturfing"

The Dan McAteer Summary

8. The Magic.dev Precedent

9. Theoretical Limits on Subquadratic Attention

10. What "Fully Subquadratic" Really Means

11. The Competitive Landscape: How Frontier Labs Handle Long Context Today

Anthropic (Claude Opus 4.6/4.7)

OpenAI (GPT-5.5)

Google DeepMind (Gemini 2.5 Pro)

The Convergence Pattern

12. First-Principles Assessment: Breakthrough or Not?

Why This Question Matters More Than SubQ Itself

The Structural Question