A complete, data-driven breakdown of what LLM inference actually costs, who is paying for it, and why the price of a token has fallen 1,000x in three years.
The cost of generating one million tokens from a frontier AI model fell from $60 in 2020 to $0.05 in 2025. That is a 1,200x decline in five years. No other technology in history has experienced a price collapse of this magnitude in this timeframe. Not computing. Not storage. Not bandwidth. The cost of intelligence, measured in the price of producing coherent language at machine speed, has dropped faster than any input cost in the modern economy - a16z.
Yet the companies producing this intelligence are losing money at a historic pace. OpenAI burned through $5 billion more than it earned in 2024 - CNBC. Anthropic posted a negative 94% gross margin the same year - The Information. Individual users on $200/month plans are consuming $10,000 to $20,000 worth of inference and paying a fraction of that. Sam Altman publicly admitted OpenAI is "currently losing money" on its most expensive subscription tier - Yahoo Finance.
This guide maps out the complete cost structure of LLM inference as of May 2026. Not estimates, not projections, not vibes. Actual prices, actual GPU costs, actual energy bills, actual company financials, actual margins. Every number is sourced and dated. Where data is unverified or self-reported, it is flagged as such.
This guide covers the price of a token from 2020 to today, the five forces that drove costs down, the GPU economics underneath it all, the energy bill behind every prompt, how training costs get amortized into inference pricing, the massive subsidies that AI companies are absorbing, the Jevons paradox that is making AI more expensive despite cheaper tokens, the real margins these companies earn (or don't), and where this all goes from here.
Written by Yuma Heymans (@yumahey), who has been tracking the economics of autonomous AI systems through O-mega and has spent close to $20,000 on Claude Code subscriptions across two accounts in the past year, making this topic deeply personal as both a builder and a customer.
Contents
- The Token: The Universal Unit of AI Cost
- A Price History: From $60 to $0.05 Per Million Tokens
- The Five Forces Behind Falling Costs
- GPU Economics: The Silicon Foundation
- The Energy Bill: Powering Intelligence
- Training Costs: The Amortized Foundation
- The Great Subsidy: Below-Cost Pricing at Scale
- The Jevons Paradox: Cheaper Tokens, Bigger Bills
- Provider Economics: Margins, Revenue, and Sustainability
- Rate Limits: The New Scarcity
- The Future: Where Inference Costs Go From Here
1. The Token: The Universal Unit of AI Cost
Before mapping the cost of intelligence over time, it helps to understand the unit of measurement. A token is the atomic unit of text that language models process. It is not a word, not a character, and not a sentence. It is a subword fragment that the model's tokenizer produces from raw text. In English, one token averages roughly 0.75 words, meaning 1,000 tokens is approximately 750 words. A typical ChatGPT response of 400 words consumes about 530 tokens - OpenAI Developer Community.
The reason tokens became the universal pricing unit is structural: they map directly to the computational work the model performs. Every token generated requires a forward pass through the model's neural network, consuming GPU compute cycles, memory bandwidth, and energy. The cost of producing one token depends on the model's size (number of parameters), the architecture (dense vs. sparse), the hardware running it (which GPU, at what utilization), and the context length (how many previous tokens the model must attend to). This means pricing per token is not an arbitrary choice by API providers. It is the closest proxy to the actual cost of compute consumed.
All prices in this guide are expressed in cost per million tokens (MTok), split into input tokens (what you send to the model) and output tokens (what the model generates back). Output tokens are always more expensive because generating each one requires a full forward pass, while input tokens can be processed in parallel during the "prefill" phase. We explored how these architectural differences affect real-world pricing in our AI model benchmarks analysis, where the gap between input and output pricing ranges from 3x to 6x depending on the provider.
When this guide references "flagship" pricing, it means the most capable generally available model from a given provider at that point in time. When it references "budget" pricing, it means the cheapest model in the provider's lineup that still delivers usable quality for production workloads.
2. A Price History: From $60 to $0.05 Per Million Tokens
The commercial history of LLM inference pricing begins in June 2020, when OpenAI launched its first API with GPT-3. Before that, large language models existed in research labs, but there was no standardized way to buy inference by the token. The cost trajectory since then is one of the most dramatic price collapses in the history of technology.
The GPT-3 Era (2020-2022): Intelligence as a Luxury
When OpenAI opened its GPT-3 API in June 2020, the flagship Davinci model cost $60 per million tokens (combined input and output). The smaller models in the GPT-3 family were cheaper: Curie at $6/MTok, Babbage at $1.20/MTok, and Ada at $0.80/MTok - The Decoder. At $60/MTok, generating a 10,000-word document (roughly 13,300 tokens) from Davinci cost about $0.80. That sounds trivial today, but at scale, with thousands of API calls per hour, costs added up fast. A customer support chatbot handling 10,000 conversations per day could easily rack up $5,000 to $10,000 per month on Davinci alone.
In September 2022, OpenAI cut GPT-3 prices by two-thirds. Davinci dropped to $20/MTok, Curie to $2/MTok, and Babbage to $0.50/MTok - The Decoder. This was the first major price reduction, driven by inference optimizations and increased GPU availability. But $20/MTok was still expensive enough that most startups building on GPT-3 treated API costs as their primary COGS concern.
There was no commercial API for GPT-2 (released February 2019) and no pre-transformer commercial LLM pricing exists in any standardized form. The 2020 GPT-3 launch is truly the starting point for tokenized AI pricing.
It is worth noting what existed before this. Prior to GPT-3, commercial NLP services from Google Cloud, IBM Watson, and Amazon Comprehend used entirely different billing models: per-request, per-character, or per-document. These services performed narrow tasks (sentiment analysis, entity extraction, translation) rather than general-purpose text generation. The cost per unit of "intelligence" was not comparable because the capability was fundamentally different. A Google Cloud NLP API call in 2018 that classified the sentiment of a paragraph and a GPT-3 API call in 2020 that generated an original paragraph are different categories of computation. The token-based pricing model that emerged with GPT-3 was itself an innovation: it created a standardized, comparable unit of measurement for generative AI that enabled the price tracking this guide documents.
The ChatGPT Moment (2023): Prices Start Falling
Two events in early 2023 changed everything. First, OpenAI released GPT-3.5 Turbo on March 1, 2023, at $2/MTok (combined rate), a 10x reduction from Davinci's reduced price and 30x cheaper than the original 2020 Davinci rate - OpenAI Blog. This was the model powering ChatGPT, and making it available via API at $2/MTok meant that the quality level most consumers experienced for free was now accessible to developers at a price that made real applications viable.
Two weeks later, on March 14, 2023, OpenAI released the GPT-4 API at $30/MTok input and $60/MTok output - Andrew Ng on X. This was a step backward in price but a leap forward in capability. GPT-4 represented the new frontier, and its premium pricing reflected both the larger model size and the constrained GPU supply at the time.
The price gap between GPT-3.5 Turbo ($2/MTok) and GPT-4 ($30-60/MTok) created a tiered market that persists to this day: a budget tier for high-volume, cost-sensitive applications and a premium tier for tasks requiring maximum intelligence. Anthropic entered this market in July 2023 with Claude 2.0 at approximately $8/MTok input and $24/MTok output - Monetizely, positioning itself between GPT-3.5 and GPT-4 on price while competing with GPT-4 on quality.
The year's most important pricing event came on November 6, 2023, at OpenAI's DevDay. GPT-4 Turbo launched at $10/MTok input and $30/MTok output, a 67% reduction from the original GPT-4 price - OpenAI DevDay. Simultaneously, GPT-3.5 Turbo 1106 dropped to $0.50/MTok input and $1.50/MTok output, another 75% cut - AI Business. Google entered the race in December 2023 with Gemini 1.0 Pro at $0.50/MTok input and $1.50/MTok output - PromptHub, matching the reduced GPT-3.5 Turbo pricing.
This chart shows flagship model input pricing (the most capable model from OpenAI at each moment). The trend is clearly downward from $60 to the $2-5 range, with the April 2026 bump reflecting the release of GPT-5.5, a more capable (and expensive) model. The budget tier tells an even more dramatic story: GPT-4.1 Nano launched in April 2025 at $0.05/MTok input, representing a 1,200x decline from GPT-3 Davinci's original $60 price.
The Competition Intensifies (2024): Everyone Cuts Prices
The year 2024 was defined by aggressive competition across all providers. In March 2024, Anthropic launched the Claude 3 family: Haiku at $0.25/$1.25 (input/output per MTok), Sonnet at $3/$15, and Opus at $15/$75 - Anthropic Pricing. This three-tier structure gave developers a budget option (Haiku) that was 240x cheaper per input token than GPT-3 Davinci had been four years earlier.
In May 2024, OpenAI released GPT-4o at $5/$15 per MTok, a 50% cut from GPT-4 Turbo - OpenAI. Two months later, GPT-4o mini arrived at $0.15/$0.60, making GPT-4-level intelligence available at a price point that was 400x cheaper than the original GPT-4 launch price just 16 months earlier - OpenAI Blog.
Google aggressively undercut everyone with Gemini 1.5 Flash at just $0.075/$0.30 per MTok after an August 2024 price cut - Google Developers Blog. By December 2024, Gemini 2.0 Flash launched at $0.10/$0.40 - PricePerToken.
We covered the full competitive dynamics of this price war in our analysis of the big pipe: how LLM inference is eating software, where we traced how the commoditization of intelligence reshapes which companies capture value and which ones get absorbed.
The DeepSeek Shock (January 2025): The Price Floor Drops Again
On January 20, 2025, the Chinese AI lab DeepSeek released DeepSeek R1, a reasoning model that matched OpenAI's o1 on major benchmarks while pricing its API at $0.55/$2.19 per MTok, roughly 20-50x cheaper than o1's $15/$60 pricing - Statista. The market reaction was immediate: NVIDIA lost $600 billion in market capitalization in a single day on January 27, 2025, the largest single-day loss in stock market history.
DeepSeek's cost advantage came from two architectural choices. First, DeepSeek V3 (the base model for R1) used a Mixture of Experts (MoE) architecture with 671 billion total parameters but only 37 billion active per inference - arXiv:2412.19437. Second, the final training run cost only $5.576 million (2.788 million H800 GPU hours at $2/hour), with the R1 reinforcement learning phase adding just $294,000 - Nature / The Register. The total was under $6 million, compared to estimated costs of $78 million for GPT-4 and $191 million for Gemini Ultra - Epoch AI.
The full significance of DeepSeek's approach, and how it relates to the broader cost-efficiency revolution in AI architecture, is covered in our DeepSeek V4 Preview guide.
The Current State (May 2026): A Fragmented Market
As of May 2026, the pricing landscape has fragmented into distinct tiers. At the frontier, the most capable models from the major providers are priced in a remarkably narrow band. OpenAI's GPT-5.5 costs $5/$30 per MTok - TokenMix. Anthropic's Claude Opus 4.7 costs $5/$25 per MTok - ClaudeFast. Google's Gemini 3.1 Pro sits at $2/$12 per MTok - Zerlo. At the budget tier, Google's Gemini 3.1 Flash Lite offers $0.25/$1.50 and OpenAI's GPT-4.1 Nano remains available at $0.05/$0.20 - PricePerToken.
The open-source ecosystem (via third-party providers) offers even lower prices. Llama 3.3 70B runs at $0.23/$0.40 on DeepInfra and $0.59/$0.79 on Groq - AI Pricing Guru. DeepSeek V3 is available for just $0.14/$0.28 - DeepSeek API. xAI's Grok 4 competes at $3/$15 per MTok - PricePerToken, while Mistral's budget tier (Mistral Small 4) and DeepSeek's V4 Flash (284B parameters, 13B active) target the sub-dollar tier - CNBC.
The full provider landscape in May 2026 reveals a remarkable compression at the top and an expanding floor at the bottom. Three years ago, the price difference between the best and worst model was roughly 75x ($60 for Davinci vs $0.80 for Ada). Today, the price difference between the most expensive frontier model (GPT-5.5 at $5/$30) and the cheapest usable model (GPT-4.1 Nano at $0.05/$0.20) is 100x on input but only about 150x on output. The ceiling has fallen 12x while the floor has fallen 1,200x. This asymmetric compression means the budget tier is approaching marginal cost while the frontier tier still carries significant margin (or subsidy).
An academic paper published in early 2026 identified a structural break in May 2024 marking the shift from technology-driven price decline (hardware improvements, architecture innovations) to competition-driven price decline (providers undercutting each other for market share) - arXiv. This distinction matters because technology-driven decline has natural limits (physics), while competition-driven decline can push prices below cost, which is exactly what has happened. The two phases also differ in who benefits. Technology-driven decline benefits all providers equally (everyone gets cheaper GPUs). Competition-driven decline benefits consumers at the expense of provider margins. Understanding which phase we are in determines whether the current pricing is sustainable or a temporary anomaly funded by venture capital.
3. The Five Forces Behind Falling Costs
The 1,000x price decline in LLM inference is not the result of a single breakthrough. It is the compounding effect of five distinct forces, each contributing a multiplicative reduction. Understanding these forces separately is essential for predicting where costs go from here, because they operate on different timescales and have different limits.
Force 1: Hardware Performance Improvements
The most fundamental driver is the steady improvement in GPU price-performance. According to Epoch AI's analysis of 470 GPUs from 2006 to 2021, the number of floating-point operations per second (FLOP/s) per dollar doubles every 2.5 years for all GPUs, and every 2.07 years for GPUs specifically used in ML research - Epoch AI. This is a slower doubling time than Moore's Law (which predicted doubling every 18-24 months for transistors), but it compounds relentlessly.
NVIDIA's progression from A100 (2020) to H100 (2022) to B200 (2025) has delivered roughly 15x improvement in inference cost per token from the Hopper generation to the Blackwell generation - SemiAnalysis. Each new GPU generation brings more compute cores, faster memory bandwidth (HBM3 to HBM3e), and larger memory capacity (80GB to 192GB), all of which directly reduce the cost of generating each token.
However, Epoch AI's data reveals an important caveat: the cheapest GPU price per FLOP has not decreased since 2017 at release-date pricing. The gains have come from higher-end chips that cost more per unit but deliver disproportionately more performance. This means the hardware cost floor may be closer than the headline improvement numbers suggest.
Force 2: Model Architecture Innovations
The shift from dense models to Mixture of Experts (MoE) architectures is arguably the single largest efficiency gain in the last two years. A dense model activates all of its parameters for every token. An MoE model routes each token to a subset of specialized "expert" sub-networks, activating only a fraction of total parameters.
DeepSeek V3 has 671 billion total parameters but activates only 37 billion per inference call. Meta's Llama 4 Maverick has 400 billion total parameters with only 17 billion active per request - Meta. The result is that MoE models deliver 2-4x compute cost reduction versus dense models of equivalent quality - Epoch AI.
Beyond MoE, three other architectural techniques compound the savings. Quantization (reducing numerical precision from 16-bit to 8-bit or 4-bit) cuts memory usage by 50-75% and speeds inference by 1.5-2.4x with minimal quality loss - Branch8. Speculative decoding uses a small draft model to propose tokens that the large model verifies in parallel, achieving 2-3x latency speedups with identical output quality - Red Hat. Flash Attention reduces memory reads and writes by 5-20x for long sequences - Vijay Kodam.
We analyzed how these algorithmic efficiency gains interact with hardware improvements in our state of algorithms 2026 report, which tracks the compounding effect of sub-quadratic attention and other efficiency techniques.
Force 3: Prompt Caching
When multiple API calls share the same system prompt or context (which is extremely common in production applications), providers can cache the processed input tokens and reuse them. Anthropic's prompt caching delivers up to 90% cost reduction and 85% latency reduction on cached tokens - Anthropic. OpenAI's automatic caching provides 50% cost reduction and 80% latency reduction - PromptHub.
This matters enormously for agentic workloads where the same system prompt (often thousands of tokens) is repeated across hundreds of tool calls. A Claude Code session making 47 tool calls (the Q1 2026 average) benefits massively from caching the system prompt across all those calls.
Force 4: Competitive Pressure
The entry of Google (December 2023), DeepSeek (January 2025), and dozens of open-source inference providers (2024-2025) created genuine price competition in what had been a two-player market (OpenAI and Anthropic). Google's willingness to price Gemini Flash at near-zero margins, DeepSeek's demonstration that frontier-quality models could be trained for under $6 million, and the proliferation of Llama-based inference providers (over 300 new GPU cloud providers entered the market in 2025) all contributed to a race to the bottom on pricing.
The academic research identifying the May 2024 structural break confirms this: the rate of price decline accelerated from ~10x per year to ~200x per year after competition intensified - Epoch AI.
Force 5: Scale Economics
As inference volumes grew from millions of queries per day (early 2023) to billions (2025-2026), providers achieved better GPU utilization, amortized fixed costs over more tokens, and negotiated better hardware and energy prices. ChatGPT alone now serves 500 million weekly active users - Backlinko. At this scale, even small per-token cost reductions translate into massive absolute savings.
The critical insight is that these five forces are multiplicative, not additive. A 4x hardware gain multiplied by a 3x architecture gain multiplied by a 2x caching gain multiplied by a 2x competition-driven margin compression equals a roughly 48x total improvement, and that is just one generation of each force. Over three years of compounding across all five forces, the 1,000x total makes structural sense.
4. GPU Economics: The Silicon Foundation
Every token generated by a language model ultimately runs on silicon. Understanding the economics of that silicon, what it costs to manufacture, what NVIDIA charges for it, and what cloud providers charge to rent it, is essential for understanding the floor below which inference costs cannot fall.
The NVIDIA GPU Pricing Timeline
NVIDIA's data center GPU lineup has evolved rapidly over the past five years. Each generation brings more performance, more memory, and a higher price tag per unit, while the cost per unit of compute drops.
The A100 (Ampere architecture, launched May 2020) established the modern AI GPU market. An 80GB SXM variant cost approximately $15,000-$20,000 at retail, or around $200,000 for a DGX A100 system with 8 GPUs - IntuitionLabs.
The H100 (Hopper architecture, shipped broadly Q1 2023) became the workhorse of the AI boom. Pricing ranged from $25,000 to $40,000 per unit depending on vendor and timing, with peak shortage pricing pushing above $40,000 in 2023 - Clarifai. A DGX H200 system (8x H200, the refreshed variant with 141GB HBM3e) ran approximately $400,000-$500,000 - Tech-Insider.
The B200 (Blackwell architecture, shipped Q1 2025) represents the current state of the art for inference workloads. With 192GB of HBM3e memory and 72 petaFLOPS of FP8 compute per chip, the B200 costs approximately $30,000-$50,000 per unit, with a DGX B200 system (8x B200) at roughly $515,000 - Northflank.
What NVIDIA Actually Earns
The gap between NVIDIA's manufacturing cost and sell price reveals one of the most profitable hardware businesses in history. Epoch AI's teardown of the B200 estimates total manufacturing cost at approximately $6,400 per chip: $2,900 for HBM3e memory (45% of BOM), $1,400 for the logic dies (22%), $1,100 for advanced packaging (17%), and $1,000 for other components - Epoch AI.
At a sell price of $30,000-$50,000, NVIDIA's implied gross margin on data center GPUs is 75-87%. The company's reported GAAP gross margins confirm this: 78.9% at peak (Q1 FY2025), settling to 73-75% through FY2026 as Blackwell production ramped - SEC filings. NVIDIA's net profit margin sits at 55.6% as of January 2026 - MacroTrends.
The scale of NVIDIA's data center business is staggering. Quarterly data center revenue went from $4.28 billion in Q1 FY2024 (April 2023) to $51.2 billion in Q3 FY2026 (October 2025), a 12x increase in 2.5 years - SEC filings.
We explored the broader implications of NVIDIA's dominance and how the chip economics shape the AI infrastructure market in our AI factory of factories analysis and the Taalas guide on the 17,000 TPS chip revolution.
The Cloud GPU Rental Collapse
For most companies, the relevant cost is not the purchase price of a GPU but the hourly rental rate from a cloud provider. This market experienced a dramatic boom-and-bust cycle.
During the GPU shortage of mid-2023, renting a single H100 on AWS cost approximately $7.50 per GPU-hour, with Google Cloud pricing at $11 per GPU-hour - Silicon Data. By early 2024, as supply normalized, the average dropped to $2.85/GPU-hour. By mid-2025, AWS cut P5 instance prices by 44% to approximately $3.90/GPU-hour - Introl.
As of early 2026, H100 rental prices range from $1.49 to $6.98 per GPU-hour depending on provider, commitment term, and region - IntuitionLabs. The cheapest option is RunPod at $1.99-$2.69/GPU-hour, while AWS on-demand remains at the higher end. Lambda Labs now offers B200 GPUs at $3.79/GPU-hour - SpendArk.
The entry of over 300 new GPU cloud providers in 2025 created genuine competition in a market previously dominated by the three hyperscalers (AWS, GCP, Azure). This is a key reason inference API prices fell faster than hardware improvement alone would explain.
What This Means for Per-Token Cost
NVIDIA claims that a $5 million investment in a GB200 NVL72 rack can generate $75 million in token revenue running DeepSeek R1 inference, implying a 15x return on hardware investment - NVIDIA Blog. While this is a marketing claim from NVIDIA (using favorable assumptions about utilization and pricing), it illustrates the fundamental economics: at current token prices and hardware costs, inference is a high-margin business for the entity that owns the GPUs, even if the API provider selling those tokens operates at a loss due to other costs (training amortization, R&D, sales, support).
5. The Energy Bill: Powering Intelligence
Every token generated consumes electricity. The GPU performing the computation, the memory bandwidth shuffling data, the cooling systems keeping the chips from overheating, the networking infrastructure moving data between GPUs, the power distribution converting grid electricity into the precise voltages the hardware needs. Energy is the one cost in the inference stack that never goes to zero, no matter how efficient the hardware becomes.
What a Query Actually Consumes
The energy cost of a single LLM query varies enormously depending on model size, context length, and infrastructure efficiency. Sam Altman has stated that a ChatGPT query consumes approximately 0.34 watt-hours - Towards Data Science. Independent estimates range higher: a Dalhousie University study estimated approximately 2.9 watt-hours per query when including full infrastructure overhead - Dalhousie University.
For context, a traditional Google search consumes approximately 0.3 watt-hours - Epoch AI. So depending on whose numbers you trust, an LLM query costs between 1x and 10x as much energy as a web search. The wide range reflects different measurement methodologies (does the estimate include cooling, power distribution losses, and idle power, or just the GPU computation itself?) and different model sizes (a query to Gemini Flash costs far less energy than a query to GPT-5.5).
An H100 GPU draws 350-700 watts under inference load (PCIe models at the low end, SXM modules at the high end) - TRG Datacenters. At 61% average utilization, a single H100 consumes approximately 3,740 kWh per year - Tom's Hardware.
Data Center Electricity Costs
Large data center operators negotiate bulk electricity rates significantly below residential prices. The typical range is $0.04-$0.08 per kWh - Thunder Said Energy, compared to the U.S. residential average of $0.19 per kWh as of late 2025 - EESI.
But these costs are rising, not falling. Wholesale electricity prices in the PJM Interconnection (the grid serving Northern Virginia, the world's largest data center hub) surged from $77.78/MWh in Q1 2025 to $136.53/MWh in Q1 2026, a 76% increase in one year - E&E News. Bloomberg reported that wholesale electricity near major data center clusters costs up to 267% more than five years ago - Bloomberg.
This is a counter-trend to everything else in the inference cost stack. While hardware gets cheaper per FLOP, energy is getting more expensive, in part because data centers themselves are driving up demand. Electricity price growth is running at 6.9% annually, more than double the 2.9% general inflation rate - CNBC / Goldman Sachs.
The Efficiency Factor: PUE
Data center efficiency is measured by Power Usage Effectiveness (PUE): total facility power divided by IT equipment power. A PUE of 1.0 means every watt goes to computation. The industry average is 1.56, meaning 36% of total power goes to cooling and overhead - Statista. Google's fleet averages 1.09, and AWS achieves 1.15 globally (with their best European site at 1.04) - Google Data Centers, Congress.gov.
The difference matters. At the industry-average PUE of 1.56, a data center consuming 100 MW of IT power actually draws 156 MW from the grid. At Google's 1.09, the same IT power draws only 109 MW. Over a year at $0.06/kWh, that 47 MW difference costs roughly $24.7 million. For operators running hundreds of megawatts of AI inference, PUE optimization is worth tens or hundreds of millions of dollars annually.
We covered the broader infrastructure buildout and its financing in our guide to EU AI infrastructure and the debt-funded revolution, which tracks how European governments and private investors are spending billions to build sovereign AI compute capacity.
Energy as a Percentage of Total Inference Cost
Despite the rising electricity prices, energy remains a relatively small fraction of total inference cost. Epoch AI's analysis of frontier model training costs found that energy accounts for only 2-6% of total cost, with hardware at 47-67% and R&D staff at 29-49% - Epoch AI. For inference specifically, the ratio shifts somewhat (no R&D staff cost, hardware amortization dominates), but energy remains in the single-digit percentage range for well-optimized facilities.
This is counterintuitive given the headlines about AI energy consumption. The reason is that GPU hardware is so expensive (a single B200 costs $30,000-$50,000) that even at elevated electricity prices, the annual energy cost per GPU ($2,000-$5,000 depending on utilization and local rates) is a fraction of the hardware's depreciation cost. The hardware cost, not the energy cost, is the binding constraint on inference pricing.
6. Training Costs: The Amortized Foundation
Inference cost is only one part of the total cost of delivering AI capability. Before a model can generate a single token, it must be trained, a one-time (per model) but increasingly enormous expense that providers must recover through inference pricing. The training cost is amortized: spread across all the tokens the model will ever generate during its commercial lifetime.
The Training Cost Escalation
The escalation in training costs over the past seven years is extraordinary. The original Transformer paper (2017) trained its model for approximately $670-$900 - Stanford AI Index. GPT-4 (2023) cost an estimated $78 million in compute alone - Epoch AI. Google's Gemini Ultra (late 2023) cost approximately $191 million - Fortune. Meta's Llama 3.1 405B (2024) consumed 39.3 million H100 GPU hours, with an estimated cost of $170 million - Epoch AI.
The current generation is even more extreme. OpenAI's GPT-5 reportedly cost $500 million per training run, with total development costs estimated at $1.7 billion to $2.5 billion by HSBC analysts - Fanatical Futurist, HSBC / SYZ Group. Sam Altman stated that training runs for GPT-5-class models can cost up to $1 billion per attempt - Fortune.
DeepSeek's $5.9 million total (V3 base + R1 RL phase) is the obvious outlier. While SemiAnalysis argues the "true total" including hardware acquisition, infrastructure, and failed experiments is "well higher than $500 million" - IT Pro, the disclosed compute cost remains a fraction of Western competitors. Epoch AI confirmed that training compute costs are doubling every 8 months for the largest models, with spending growing at 2.4x per year - Epoch AI.
We examined the broader implications of scaling laws and whether they've hit a wall in our analysis of the promised scaling laws.
How Training Costs Get Amortized
The training cost gets spread across all inference the model serves during its commercial lifetime. If GPT-5 cost $2 billion to develop and serves 100 trillion tokens over its lifetime, the training cost contribution is $0.02 per million tokens, negligible compared to the direct inference cost. If it serves only 1 trillion tokens, the contribution jumps to $2.00 per million tokens, meaningful at current pricing levels.
This is why scale matters so much to the economics. OpenAI spent $8.7 billion on inference compute (Azure costs) in just the first three quarters of 2025, more than double the $3.7 billion for all of 2024 - The Register. The total compute budget in 2024 was approximately $7 billion, of which about $3 billion went to training, $1.8 billion to inference, and $1 billion to research - Epoch AI. Critically, only about $500 million (10% of R&D compute) went to final training runs that actually produced shipped models. The rest went to scaling experiments, synthetic data generation, and dead-end research.
The Shift from Training to Inference
A structural shift is underway in how compute budgets are allocated. Inference demand is projected to exceed training demand by 118x by 2026 - FourWeekMBA, with inference claiming 75% of total AI compute by 2030 (up from a minority share in 2023). This shift is accelerated by reasoning models (o3, DeepSeek R1) that trade training compute for inference compute: spending more tokens "thinking" at inference time rather than cramming more knowledge in at training time. A 7 billion parameter model with 100x inference compute can match a 70 billion parameter model with standard inference - Build Fast with AI.
This has profound implications for the cost structure. As inference becomes the dominant compute workload, the economics shift from large, periodic training investments to continuous, scaling operational costs. Providers that optimize inference efficiency (better batching, smarter routing, caching) gain a structural cost advantage.
7. The Great Subsidy: Below-Cost Pricing at Scale
This is perhaps the most important section of this guide, because it answers the question most users never ask: if the cost of generating tokens has fallen 1,000x, why are AI companies losing billions? The answer is that they are selling intelligence below cost, deliberately, as a market-capture strategy funded by venture capital and corporate partners.
OpenAI's Financial Reality
OpenAI generated $3.7 billion in revenue in 2024 while losing approximately $5 billion, spending $1.35 for every $1 earned - CNBC. Of that $3.7 billion, roughly $2.7 billion came from ChatGPT subscriptions and $1 billion from the API and other businesses.
Revenue accelerated dramatically in 2025 to $13.1 billion (with CFO Sarah Friar confirming a $20 billion ARR by year-end) - Yahoo Finance. But losses grew even faster: the first half of 2025 alone saw $13.5 billion in losses - The Decoder. The company's own internal projections forecast a $14 billion loss on approximately $13 billion in sales for 2026, scaling to a projected $74 billion operating loss in 2028 before achieving cash-flow positivity around 2029-2030 - Fortune.
The burn rate as of early 2026: approximately $47 million per day - Medium.
ColdFusion's analysis of this cost collapse provides additional context on how these numbers interact with the broader market dynamics:
Anthropic's Financial Reality
Anthropic's trajectory shows similar dynamics at a different scale. Revenue grew from $87 million ARR in January 2024 to $1 billion ARR by December 2024, then to $14 billion ARR by February 2026, and $30 billion ARR by April 2026 (confirmed by CEO Dario Amodei) - SaaStr, VentureBeat. Claude Code specifically hit $1 billion ARR within six months of public launch, with $2.5 billion+ run rate by February 2026 - SaaStr.
Despite this revenue growth, Anthropic posted a $5.3 billion net loss in 2024 on gross margins of negative 94% - The Information. Gross margins improved to approximately 40% in 2025 with projections of 77% by 2028 and cash-flow break-even targeted for 2028 - The Information.
To fund this path, Anthropic has raised a staggering $72.3 billion across 18 rounds, including a $30 billion Series G in February 2026 at a $380 billion valuation - Anthropic. As of May 2026, the company is reportedly in talks to raise another $30 billion at a valuation exceeding $900 billion - Bloomberg.
The $200/Month Plan That Loses Money
The most vivid illustration of AI subsidization is OpenAI's ChatGPT Pro plan. At $200 per month, it is the most expensive consumer AI subscription. Sam Altman personally set the price and, in his own words, "thought we would make some money" - Constellation Research. Instead, OpenAI is losing money on every Pro subscriber because users consume more than expected - Futurism. A single query on the most advanced reasoning mode (o3-pro) can cost the provider up to $1,000 in compute - Oreate AI.
Anthropic faces the same dynamics with its Max 20x plan at $200/month. Claude Code sessions have grown from an average of 4 minutes with 5 tool calls (Q1 2025) to 23 minutes with 47 tool calls (Q1 2026) - Anthropic Agentic Coding Report. Multi-file edits went from 34% of sessions to 78% in the same period. A heavy Claude Code user (such as a developer on a Max plan running agents for 8 hours a day) can consume billions of tokens per month, with API-equivalent costs of $150-250 per developer per month at enterprise rates and far more for power users - Verdent Guides.
We detailed the full breakdown of Claude Code's pricing economics in our Claude Code pricing guide.
GitHub Copilot: The Canary in the Coal Mine
The coding assistant market provides the clearest evidence that flat-rate AI subscriptions are unsustainable. GitHub Copilot launched at $10 per user per month (Individual tier). A Wall Street Journal report estimated Microsoft was losing approximately $20 per user per month on the service, with some heavy users costing $80/month on a $10 plan - WSJ / Neowin.
The response has been universal: both Cursor (June 2025) and GitHub Copilot (June 2026) abandoned flat-rate pricing for usage-based or credit-based models - GitHub Blog. This confirms that the era of unlimited AI coding for a fixed monthly fee is ending.
Why They Do It
The subsidy is rational if you believe (as these companies do) that AI will become the dominant platform for computing in the next decade. Microsoft's below-market compute provision to OpenAI has been described as "circular financing" on Wall Street - Axios. The logic: capture users now at a loss, build switching costs through ecosystem lock-in (API integrations, fine-tuned models, workflow dependencies), then raise prices once the market consolidates.
Marc Andreessen articulated this view clearly in his 2026 outlook, discussing how the collapsing cost of intelligence creates enormous optionality for companies that capture distribution now:
The risk, of course, is that the market never consolidates enough for anyone to raise prices, or that open-source alternatives (Llama, DeepSeek) permanently cap what closed-source providers can charge. We analyzed this competitive dynamic in our AI market power consolidation report.
The Subscriber Economics: What Users Actually Pay vs. What They Consume
The gap between subscription price and inference value consumed is the most concrete measure of the subsidy. Consider the tier structure across the two largest providers.
OpenAI's consumer plans as of 2026 range from Free ($0) and Go ($8/month with ads) to Plus ($20/month), Pro at $100/month, and Pro at $200/month - chatgpt.com/pricing. The company projected ChatGPT Plus subscribers would decline from 44 million in 2025 to 9 million in 2026 as users shift to the cheaper Go tier at $8/month, with 112 million Go subscribers projected for 2026 - Where's Your Ed At. This would drop the average revenue per user (ARPU) from approximately $23/month to under $12/month, a deliberate trade: more users at lower revenue, betting on volume and ecosystem lock-in.
Anthropic's structure is similar but oriented toward developer-heavy users: Pro ($20/month), Max 5x ($100/month), and Max 20x ($200/month) - claude.com/pricing. The Max tiers exist specifically because Claude Code users were hitting Pro limits within hours. One documented case: a developer consumed 10 billion tokens over 8 months. At API pricing, that would cost over $15,000. On a Max subscription, the same developer paid roughly $800 for the same period, a 93% discount versus pay-as-you-go - Verdent Guides.
This is not a bug. It is a deliberate strategy. Both companies are sacrificing short-term margins to build habit formation and ecosystem dependency among the most active developers, the users who are most likely to influence enterprise purchasing decisions. The question is whether this cohort generates enough downstream enterprise revenue to justify the subsidy.
8. The Jevons Paradox: Cheaper Tokens, Bigger Bills
Here is the paradox at the heart of LLM economics: token prices fell 280x between 2023 and 2025, but enterprise AI spending rose 320% over the same period - iKangai. This is a textbook Jevons paradox: when a resource becomes cheaper, total consumption increases so much that total spending rises despite the lower unit cost.
The mechanism is straightforward. When GPT-4 cost $30/$60 per MTok, developers carefully minimized token usage, truncated prompts, cached aggressively, and used cheaper models whenever possible. When GPT-4o mini dropped to $0.15/$0.60, the same developers started sending full documents as context, running multi-step agent loops, and letting models "think" through complex reasoning chains that consume thousands of output tokens per step.
Agentic Usage: The Multiplier
The single biggest driver of the consumption explosion is agentic AI. A traditional chatbot interaction involves one input and one output. An agentic session involves dozens or hundreds of model calls as the agent plans, executes, checks results, adjusts, and iterates. Claude Code's average of 47 tool calls per session (Q1 2026) means each coding session generates roughly 50x more inference demand than a single chat exchange - Anthropic Report.
The trend is accelerating. Gartner reported a 1,445% surge in multi-agent inquiries from Q1 2024 to Q2 2025 - Gartner / Modall. When users can spin up multiple agents that each run inference autonomously, a single human request can trigger thousands of token-generating model calls. The 85% of developers now regularly using AI coding tools (up from negligible adoption two years ago) represents an entirely new category of inference demand that did not exist before 2023 - Uvik.
This pattern is exactly what we described in our analysis of the agentification of business: as AI agents become more capable and easier to deploy, the total volume of inference scales with the number of autonomous workflows, not the number of human prompts. We also explored the economics of this shift in our agent economy guide and the cost of AI agents report.
The Enterprise Budget Reality
For enterprises adopting AI at scale, the cost dynamics are sobering. Adding AI inference to a SaaS product adds approximately $15 in COGS per $80/month seat - SaaS Mag. For a company with 10,000 seats, that is $150,000 per month in incremental cost that did not exist before, with no corresponding increase in subscription revenue unless the company raises prices.
The average enterprise developer using AI coding tools costs approximately $13 per active day in inference, or $150-250 per month - Verdent Guides. At the 90th percentile, costs stay below $30 per active day. But the distribution has a very long tail: the heaviest users can consume ten or twenty times the median, and these are often the most productive developers, meaning cutting their access would be counterproductive.
Organizations are learning that AI cost management is not a one-time optimization but an ongoing operational discipline. Nearly 25% of organizations underestimate their AI costs by more than 50% - Keyhole Software. The tools and frameworks for managing AI costs (FinOps for AI) are still immature compared to cloud cost management.
The parallel to cloud computing adoption in 2010-2015 is instructive but imperfect. Cloud spending also surprised enterprises with its variability and tendency to exceed budgets. But cloud costs scaled roughly linearly with usage (more VMs, more storage, more bandwidth). AI inference costs scale combinatorially with agentic complexity: one human prompt triggers an agent that makes 50 model calls, each of which might spawn sub-agents or tool calls that trigger additional inference. The relationship between human input and machine output is no longer linear, and the cost models built for linear scaling do not apply.
This is also why the shift to cost-efficient models matters so much for production workloads. Using a $0.05/MTok model (like GPT-4.1 Nano) instead of a $5/MTok model (like GPT-5.5) for routine sub-tasks within an agentic workflow reduces the cost of those 50 model calls by 100x. Smart routing (sending simple tasks to cheap models and complex tasks to expensive ones) is becoming an essential architectural pattern, as we explored in our guide to what LLMs cannot do and the tool ecosystem and how cost-efficient agent swarms can be built in our Kimi K2 agent swarm guide.
9. Provider Economics: Margins, Revenue, and Sustainability
Understanding the margin structure of AI providers reveals who is actually making money in this market and who is burning capital to buy market share.
The Gross Margin Trajectory
AI company gross margins have been improving but remain well below traditional software margins. The industry-wide average gross margin for AI-native companies rose from 41% in 2024 to 45% in 2025 and an estimated 52% in 2026 - Tanay Jaipuria. For comparison, traditional SaaS companies typically achieve 70-85% gross margins. The "new normal" for mature LLM-native companies is projected to be 60-70% - SaaStr.
OpenAI claims approximately 70% "compute margin", but this is a narrower metric that excludes several cost categories. Their adjusted gross margin (including all COGS) actually declined from 40% in 2024 to 33% in 2025 as inference costs quadrupled due to surging demand - AI Automation Global. The gap between the claimed 70% compute margin and the 33% adjusted gross margin reveals the scale of costs that lie outside raw compute: training amortization, R&D, customer support, safety research, and the inference overhead of running free-tier users.
Anthropic's margin trajectory is more dramatic. From negative 94% gross margin in 2024 (spending nearly $2 for every $1 of revenue on just the direct cost of serving requests), the company improved to approximately 40% gross margin in 2025 and projects 77% by 2028 when they target cash-flow break-even at $70 billion in revenue - The Information.
A critical caveat on this data: OpenAI's $5B loss in 2024 excludes equity-based compensation, meaning the actual total loss is higher. Anthropic's ARR figures are annualized run rates, not realized annual revenue. The $13.5B figure for OpenAI in 2025 represents losses in the first half only, reported by The Decoder; full-year losses may be higher. These are the best available numbers from credible sources, but AI company finances remain largely private and unaudited.
The Valuation Disconnect
Both companies are valued at levels that imply massive future profitability. OpenAI's March 2026 fundraise valued it at $852 billion on approximately $13 billion in annual revenue - CNBC. Anthropic's February 2026 round valued it at $380 billion, with a potential $900 billion valuation in talks as of May 2026 - Bloomberg. Total funding raised: approximately $180 billion for OpenAI and $72.3 billion for Anthropic - Tracxn.
These valuations only make sense if you believe that inference costs will continue falling (improving margins), revenue will continue growing (more users, more agentic workloads), and the market will consolidate enough for survivors to raise prices. The path to profitability for OpenAI, per their own internal projections, requires reaching ~$200 billion in revenue by 2030 - Fortune.
Gartner projects that inference costs for a 1 trillion parameter model will drop over 90% by 2030 compared to 2025 - Gartner. If that holds, and if demand continues growing faster than costs fall, the profitability math works. If costs plateau while open-source competitors cap pricing power, it does not.
10. Rate Limits: The New Scarcity
As AI companies grapple with the economics of serving heavy users at a loss, a new mechanism has emerged: rate limits. These are the usage caps that restrict how much inference a subscriber can consume within a given time window. They did not exist in the early days of ChatGPT and Claude. They exist now because they have to.
The Anatomy of Rate Limits
Anthropic's Claude uses a dual-layer rate limiting system. Free users get approximately 40 short messages per day with no access to Claude Code. Pro subscribers ($20/month) get roughly 45 prompts per 5-hour rolling window with a weekly cap - SitePoint. Max 5x ($100/month) provides 5x the Pro throughput, and Max 20x ($200/month) provides 20x - IntuitionLabs.
The rolling window mechanism is important: usage resets every 5 hours, not daily, allowing multiple intensive sessions per day but preventing sustained 24/7 inference consumption. A bug in Claude Code v2.1.89 (March 2026) that caused 3-50x faster rate limit consumption demonstrated how sensitive these systems are. Max 20x plans were being exhausted in 70 minutes instead of the expected multi-day cycle - GitHub Issue #42272.
OpenAI's ChatGPT uses soft degradation rather than hard caps: when usage is high, the system reduces response quality and speed rather than blocking access entirely - Explore AI Together. This approach is less transparent but avoids the user frustration of hitting a hard wall.
Why Rate Limits Exist Now
The structural reason is agentic usage. A traditional chat interaction generates a bounded amount of inference per human action. An agentic session can consume an entire day's quota in a single run, because the agent (not the human) is driving the inference loop. When you tell Claude Code to "refactor this codebase," the agent may make 50+ model calls over 23 minutes, each consuming significant tokens. Multiply that by thousands of developers running similar sessions simultaneously, and the infrastructure costs become untenable without usage controls.
The practical constraint is hardware. New GPU capacity takes 12-24 months to provision (ordering chips, building data centers, deploying infrastructure) - MindStudio. Dario Amodei has publicly described Anthropic as "compute-constrained", meaning demand for Claude inference exceeds available capacity. Rate limits are the demand-management tool while supply catches up.
The economic reality is that rate limits are the hidden price increase. Instead of charging more per token (which would be transparent and comparable), providers maintain low nominal prices while limiting consumption. A $200/month plan with aggressive rate limits may deliver less total inference value than a $100/month plan without them, but the comparison is difficult for consumers to make. This opacity is not accidental. It makes direct comparison across providers harder, which benefits providers with the strongest brand loyalty.
This dynamic is not unique to AI. Mobile phone plans went through the same evolution: unlimited plans replaced by tiered data caps once smartphone usage exploded. The difference is that AI inference demand is growing far faster than mobile data demand ever did, and the cost per unit (per token) is falling at a rate that makes fixed infrastructure investment extremely risky.
The fundamental tension is between user experience and financial sustainability. Aggressive rate limits frustrate power users (the most vocal community) and drive them to competitors. Loose rate limits attract power users who consume far more than they pay for. The sweet spot, wherever it exists, depends on the provider's capital reserves, their tolerance for short-term losses, and their read on how quickly hardware improvements will bring costs down. Anthropic's stated position, that it is "compute-constrained" and that new capacity takes 18-24 months, suggests rate limits will remain a reality for at least the medium term regardless of pricing decisions.
For users navigating this environment, the practical implication is that the effective cost of AI is not just the subscription price. It is the subscription price divided by the actual usable throughput within the rate limit window. A $200/month plan that limits you to the equivalent of $300 worth of API usage per month has an effective cost of $200 for $300 of value, a 33% discount. A $200/month plan that limits you to $150 worth of API usage per month is actually a net cost increase compared to pay-as-you-go, a scenario that is becoming more common as providers tighten limits.
11. The Future: Where Inference Costs Go From Here
Predicting cost trajectories in a market this volatile requires separating structural trends (which are predictable) from competitive dynamics (which are not). The structural trends point clearly downward. The competitive dynamics introduce uncertainty about how fast and how far.
The Hardware Roadmap
NVIDIA's Blackwell architecture (B200, GB200) delivers 15x lower cost per token compared to the previous Hopper generation - SemiAnalysis. As Blackwell deployment scales through 2026 and the subsequent Rubin architecture arrives (projected 2027), hardware-driven cost reductions will continue at the historical rate of FLOP/s per dollar doubling every 2-3 years - Epoch AI.
The infrastructure investment to deploy this hardware is unprecedented. Project Stargate, the SoftBank-OpenAI-Oracle joint venture, represents $500 billion in committed AI infrastructure investment - Fortune. xAI's Colossus facility in Memphis houses 555,000 GPUs at a cost of approximately $18 billion, with 2 gigawatts of power capacity - Introl. Data center construction starts in the U.S. went from $14.9 billion in 2023 to $77.7 billion in 2025 - NetworkInstallers.
The cost per megawatt of data center capacity has risen from $7.7 million in 2020 to $10.7 million in 2025 (7% CAGR), with AI-optimized facilities running $20 million+ per MW - JLL. This means the absolute cost of infrastructure is rising even as the cost per unit of compute falls. The bet is that the compute delivered per dollar of infrastructure will grow faster than the infrastructure cost itself.
The Efficiency Roadmap
Software-level optimizations still have significant runway. The combination of MoE architectures, quantization, speculative decoding, and prompt caching has already delivered a cumulative 20-50x improvement on top of raw hardware gains. As these techniques mature and new ones emerge (sub-quadratic attention mechanisms, adaptive compute routing, learned compression of KV caches), the software multiplier on hardware gains could add another 5-10x over the next three years.
The open-source ecosystem continues to drive efficiency innovations. DeepSeek demonstrated that architectural cleverness can substitute for brute-force hardware spending. As these techniques spread through the open-source community, they create a cost floor that closed-source providers cannot price above without offering meaningful quality differentiation.
When Subsidies End
The most important unknown is what happens when AI companies must become profitable. OpenAI's own projections show profitability by 2029-2030 - Fortune. Anthropic targets cash-flow break-even by 2028 - The Information. These timelines assume continued revenue growth and cost improvement. If either assumption fails, prices go up.
The shift from flat-rate to usage-based pricing (already visible in GitHub Copilot and Cursor) is a leading indicator. API prices for frontier models have been remarkably stable over the past year (Claude Opus 4.5-4.7 held at $5/$25, GPT-5 to GPT-5.5 in the $2-5 input range), suggesting providers are finding a sustainable price point, or at least a loss they are willing to absorb for now.
Multiple analysts estimate that API prices for frontier models will likely increase within 12-24 months as the subsidy era ends - Uncover Alpha. The counterargument is that open-source alternatives will permanently cap pricing power: if Llama 5 (or whatever Meta releases next) matches frontier closed-source quality, providers cannot raise prices above the cost of running open-source inference on rented GPUs.
The Commoditization of Intelligence
The structural endpoint of all these forces is the commoditization of intelligence. When the marginal cost of generating a token approaches zero (constrained only by energy costs), intelligence becomes an input, not a product. The value shifts from producing intelligence to applying intelligence: combining cheap inference with domain expertise, proprietary data, regulatory knowledge, and customer relationships to deliver outcomes.
This is the thesis behind platforms like O-mega, which use inference as an input to build and operate autonomous businesses. When intelligence costs $0.05 per million tokens, the competitive advantage is not in the intelligence itself but in how effectively it is orchestrated, deployed, and integrated into real-world workflows. We explored this dynamic in our guide to how to vibe-automate with AI agents and the broader analysis of the future of autonomous business operations.
The companies that win in this environment are not the ones that produce the cheapest tokens. They are the ones that convert cheap tokens into expensive outcomes.
The Role of Open Source
Open-source models play a pivotal role in the future cost trajectory because they establish a price ceiling for closed-source providers. If Meta releases Llama 5 at a quality level matching GPT-5.5, and third-party providers offer it at $0.50/MTok (the current ballpark for Llama 3.3 70B via inference providers), then no closed-source provider can sustainably charge $5/MTok for equivalent capability unless they offer significant value-adds: better reliability, lower latency, superior tool integration, compliance certifications, or enterprise support.
Meta's economic model is fundamentally different from OpenAI's or Anthropic's. Meta does not need to profit from inference. Llama exists to prevent any single AI provider from having platform leverage over Meta's products (Facebook, Instagram, WhatsApp). By making frontier-quality models freely available, Meta ensures that intelligence remains a commodity rather than a monopoly. This strategic motivation means Meta will likely continue releasing open models regardless of whether they are profitable to serve, creating a permanent price anchor in the market.
DeepSeek serves a similar function from a different direction. By demonstrating that architectural innovation (MoE, efficient training) can produce frontier models at a fraction of Western costs, DeepSeek established a credibility floor: the claim that frontier models "must" cost $200 million to train is no longer tenable. This affects investor willingness to fund the massive training budgets that OpenAI and Anthropic require, which in turn pressures these companies toward profitability sooner than they might otherwise need.
The net effect is a market where closed-source providers must justify their premium through capability differentiation, trust, safety, and ecosystem integration rather than through intelligence alone. For the end user, this means inference costs will continue falling regardless of any single company's pricing decisions, because the competitive pressure is structural, not cyclical.
Conclusion: The Structural Picture
The true cost of LLM inference in May 2026 exists at three levels, and understanding the gap between them is the key insight of this guide.
Level 1: The hardware cost. A B200 GPU manufactured for $6,400 and sold for $30,000-$50,000 generates tokens at a hardware-level cost that is a fraction of what API providers charge. NVIDIA claims a $5 million hardware investment can generate $75 million in token revenue. At the hardware level, inference is already extremely profitable.
Level 2: The API price. Frontier model API pricing ranges from $0.05/MTok (budget tier) to $5-30/MTok (flagship tier). These prices have fallen 1,000x since 2020 and continue to decline. But they include margins for the provider (when positive), training cost amortization, R&D, safety research, and customer support. At the API level, pricing is below cost for most providers.
Level 3: The subscription price. Consumer and developer plans ($20-$200/month) package API access into fixed-rate bundles that obscure the per-token economics. At this level, heavy users consume 10-100x more value than they pay for, subsidized by light users and venture capital. Rate limits are the mechanism that prevents this from being completely unsustainable.
The trajectory from here is clear on the cost side: hardware improvements, architecture innovations, and scale economics will continue pushing the per-token cost of inference toward zero. It is unclear on the pricing side: competitive dynamics, open-source pressure, subsidy timelines, and the shift to usage-based billing will determine what customers actually pay.
What is certain is that intelligence, measured in the cost of generating coherent language at machine speed, has never been cheaper. And it will be cheaper tomorrow than it is today.
This guide reflects the LLM inference cost landscape as of May 2026. Pricing, model availability, and company financials change frequently. Verify current details before making purchasing or investment decisions. Financial figures from private companies (OpenAI, Anthropic) are sourced from credible reporting but are not independently audited. Where figures are estimates, analyst projections, or self-reported, this is noted in the text.