Claude Sonnet 5: Benchmarks & Cost Breakdown | Articles

Yuma Heymans

30 June 2026

•

54 min read

The honest, fully sourced breakdown of every Claude Sonnet 5 benchmark and what it actually costs to run.

Anthropic shipped Claude Sonnet 5 on June 30, 2026 at $2 per million input tokens, one-fifth the input price of its own frontier model, and the cheapest near-flagship agentic model any US lab currently sells. That single number is why the entire industry stopped to read the release notes. A model that runs browsers, terminals, and multi-step agents at close to the quality of a model costing five times more is the kind of price move that reshapes how teams budget for AI.

But here is the problem: almost every write-up you will find about Sonnet 5 is wrong on the numbers. Within hours of launch, blogs were quoting "82.1% on SWE-bench" and "92.4% SWE-bench Verified" for the model. Neither figure appears anywhere in Anthropic's announcement, its official pricing docs, or its system card. The real headline coding number is 63.2% on SWE-bench Pro, a harder and less saturated benchmark, and the gap between those two facts tells you everything about why AI benchmark coverage is so unreliable right now.

This guide does the work properly. It pulls every Sonnet 5 benchmark from Anthropic's own system card and announcement, reprices the model down to the per-token, cached, and batched level, and then builds a complete master comparison table against the genuinely latest competing models: OpenAI's GPT-5.5 (and the limited-preview GPT-5.6 family), Google's Gemini 3.5 Flash and Gemini 3.1 Pro, and Anthropic's own Opus 4.8, Haiku 4.5, and Fable 5. Every cell is filled, every number is sourced, and every benchmark-variant mismatch is flagged so you are not comparing a hard test against an easy one and calling it a verdict. For a broader cross-model snapshot that predates this launch, our running AI Model Benchmarks & Pricing tracker covers the wider field.

What Claude Sonnet 5 Is, and Why It Shipped Today
The Master Benchmark and Cost Tables
The Pricing Breakdown: Intro, Standard, and the Tokenizer Trap
Effective Cost: Caching, Batch, and Real Dollar Scenarios
Coding and Agentic Performance
Reasoning, Knowledge Work, and Computer Use
Sonnet 5 vs the Latest GPT
Sonnet 5 vs the Latest Gemini
Where Sonnet 5 Sits in the Claude Lineup
The Benchmark Honesty Problem
How to Access and Use Sonnet 5
Limitations, Safety, and the IPO Subtext
The Future Outlook for Cheap Near-Frontier Intelligence
Conclusion: Which Model Should You Actually Run

The Scorecard at a Glance

Before the deep dive, here is the weighted verdict. Each model is scored from 0 to 10 on five criteria that actually decide which model a team runs in production: coding and agentic capability, price-performance, reasoning and knowledge, availability today, and speed plus context. The weights reflect what most buyers optimize for in 2026, where agentic coding and cost dominate the decision. The final score is the weighted average, and the table is sorted from highest to lowest. Every cell carries the data point behind the score, not just a number, so you can disagree with the weighting and recompute it yourself.

#	Model	What It Is	Coding/Agentic (30%)	Price-Performance (30%)	Reasoning (20%)	Availability (10%)	Speed/Context (10%)	Final
1	Claude Sonnet 5	Cheapest near-flagship agent model	8 - SWE-bench Pro 63.2%, Terminal-Bench 2.1 80.4%	9 - $2/$10 intro, near-Opus at one-fifth Opus price	8 - HLE 43.2/57.4, GDPval 1618 Elo	10 - GA day one, default on Free + Pro, in Claude Code	8 - "Fast", 1M context, 128k output	8.5
2	Gemini 3.5 Flash	Fastest cheap near-Pro model	7 - SWE-bench Pro 55.1%, Terminal-Bench 2.1 76.2%	10 - $1.50/$9, ~$1.31 AA-blended, the value leader	7 - HLE 40.2%, AA Index 50	9 - GA, Gemini app + API + Vertex	10 - 175 tok/s, 1M context	8.4
3	Claude Opus 4.8	Anthropic's GA capability ceiling	9 - SWE-bench Verified 88.6%, Pro 69.2%	7 - $5/$25, ~$3.85 blended (AA)	9 - GPQA 93.6%, GDPval 1890 Elo	10 - GA on every platform	5 - 65 tok/s, 1M context	8.1
4	Gemini 3.1 Pro	Google's reasoning-heavy Pro tier	7 - SWE-bench Verified 80.6%, Pro 54.2%	8 - $2/$12 (<=200k), ~$1.74 blended	8 - GPQA 94.3%, HLE 44.4%	7 - Preview / Pre-GA, widely usable	9 - 138 tok/s, 1M context	7.7
5	GPT-5.5	OpenAI's current GA flagship	8 - SWE-bench Pro 58.6%, Verified 88.7% vendor / 82.6% indep	6 - $5/$30, ~$4.35 blended (AA flags "expensive")	9 - GPQA 94.0%, AIME 100%	10 - GA in ChatGPT + API + Codex	6 - 79.8 tok/s, 1M context	7.6
6	Claude Fable 5	Most capable Claude, now restricted	10 - SWE-bench Verified 95.0%, Terminal-Bench 2.1 83.1%	5 - $10/$50, most expensive GA tier	10 - HLE 59.0/64.5, GDPval 1932 Elo	3 - GA June 9 then suspended ~June 12	5 - top-tier latency, 1M context	7.3
7	Claude Haiku 4.5	The speed and cost floor	5 - SWE-bench Verified 73.3%, Pro 39.5%	9 - $1/$5, cheapest current Claude	5 - near-frontier but limited public evals	9 - GA everywhere	8 - fastest Claude, 200k context	6.9
8	Claude Sonnet 4.6	The model Sonnet 5 replaces	6 - SWE-bench Verified 79.6%, Pro 58.1%	6 - $3/$15, same price, now superseded	6 - GPQA 74.1%, HLE 34.6/46.8	7 - legacy but still callable	7 - fast, 1M context	6.2

A note on the criteria. Coding and agentic capability and price-performance each carry 30% because, in practice, those are the two axes that decide most 2026 deployments: can it run the agent loop, and what does the agent loop cost. Reasoning and knowledge get 20% as the tie-breaker for analytical work. Availability and speed-plus-context get 10% each, because a model you cannot call (or that streams too slowly for an interactive tool) loses real value regardless of its eval scores. Preview-only models (GPT-5.6 Sol and Terra, Gemini 3.5 Pro) are deliberately excluded from this scorecard because they are not generally available and their numbers are unverified, but they appear in full in the raw tables and the competitor sections below.

The ranking is genuinely close at the top, and that is the honest story. Sonnet 5 wins on balance, not on dominance. It is not the most capable model in this table (Fable 5 and Opus 4.8 clearly out-score it on raw coding), and it is not the cheapest (Gemini 3.5 Flash undercuts it on every blended-cost measure). What it does better than anything else is sit at the intersection of near-flagship capability, aggressive pricing, and instant, everywhere availability. If your single most important variable is throughput cost, the value crown belongs to Gemini 3.5 Flash, which we break down in our Gemini 3.5 Flash benchmarks and cost guide. If it is raw capability and you can live with the price, the answer is Opus 4.8, covered in our Claude Opus 4.8 full benchmark and cost guide. For most agent builders, the new default is Sonnet 5.

1. What Claude Sonnet 5 Is, and Why It Shipped Today

To understand Sonnet 5, start with the structural question, not the spec sheet. The fundamental thing happening in mid-2026 is that intelligence is splitting into tiers by cost rather than by capability. A year ago, the gap between a frontier model and a mid-tier model was a gap in what they could do at all. In 2026, the gap is mostly in how reliably and how cheaply they do the same things. Anthropic's own framing of Sonnet 5 makes this explicit: the model, in their words, "can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models" - Anthropic. That sentence is the entire product thesis. The capability is not new. The price for that capability is.

Sonnet 5 is the new mid-tier of the Claude family, the direct successor to Sonnet 4.6, and as of launch it is the default model on the Free and Pro plans of claude.ai, available across Max, Team, and Enterprise, and shipped inside Claude Code and the developer platform. The API identifier is claude-sonnet-5. Anthropic positions it as "the best combination of speed and intelligence" and "the most agentic Sonnet model yet" - Anthropic. The strategic logic is to capture high-volume agentic workloads (the long-running, many-call jobs where token cost compounds) by undercutting both OpenAI's GPT-5.5 and Google's Gemini Pro line on price while staying within striking distance of Anthropic's own Opus 4.8 on capability.

The timing is the interesting part. Sonnet 5 did not arrive into an empty field. It shipped after Opus 4.8 (May 28), after the briefly-available Fable 5 (June 9), and days after OpenAI previewed its GPT-5.6 family. It is, deliberately, not a frontier model. The system card is unusually blunt about this: "Claude Sonnet 5 is our most capable Sonnet-class model, but it does not advance our capability frontier compared to more capable Opus- or Mythos-class models" - Anthropic system card. Read that carefully. Anthropic is telling you, in its own safety documentation, that this is a value play, not a capability leap. That honesty is rare and it should shape how you read the benchmarks.

Why this matters in practice is that the default model most people now touch is Sonnet 5. Every Free and Pro user on claude.ai, every developer who opens Claude Code without changing settings, and a large share of agent platforms that route to "the latest Sonnet" are now running this model by default. That makes its price and its true capabilities a question with unusually broad consequences. A model that ships as the default for tens of millions of interactions is an economic decision as much as a technical one, and the introductory pricing in particular looks engineered to make that default feel free to switch to. The competitive backdrop, where labs increasingly absorb the work that standalone software used to do, is something we have traced in depth in The Big Pipe: How LLM Inference is Eating Software.

The other reason today's launch matters is what it signals about Anthropic's commercial posture. The dominant press framing, from VentureBeat, was that Anthropic launched Sonnet 5 "at a steep discount to its top model as the company races toward a blockbuster IPO." Anthropic confidentially filed a draft S-1 with the SEC in early June 2026 - Anthropic. A cheap, high-volume tier that most users default into is exactly the kind of revenue-density story a company tells right before going public, and we will return to that subtext in section 12. The financial details circulating about valuation and revenue are unverified press estimates, so treat them with care, but the strategic shape of the launch is clear enough.

2. The Master Benchmark and Cost Tables

This is the section the rest of the internet skipped. To compare Sonnet 5 honestly against its neighbors, you need three things at once: the prices, the benchmark scores, and a clear label on every number marking whether it came from the vendor's own marketing or from an independent run. Mixing those three carelessly is how the "92.4% SWE-bench" myth spread. The tables below keep them separate and fill every cell, using "n/p" only where a number genuinely has not been published as of June 30, 2026, rather than guessing.

Start with the pricing, because price is the most stable and verifiable fact about any model. The figures here are list API prices per million tokens, drawn from each vendor's official pricing documentation. For Anthropic, that is the Claude pricing docs; for OpenAI, the developer pricing page; for Google, the Gemini API pricing. Where a model has tiered pricing or an introductory rate, both numbers appear.

Model	Vendor	Status	Input $/MTok	Output $/MTok	Cache read $/MTok	Batch in/out $/MTok	Context
Claude Sonnet 5	Anthropic	GA	$2 intro / $3 std	$10 intro / $15 std	$0.20 / $0.30	$1 / $5 intro	1M
Claude Opus 4.8	Anthropic	GA	$5	$25	$0.50	$2.50 / $12.50	1M
Claude Haiku 4.5	Anthropic	GA	$1	$5	$0.10	$0.50 / $2.50	200k
Claude Fable 5	Anthropic	Suspended	$10	$50	$1.00	$5 / $25	1M
Claude Sonnet 4.6	Anthropic	Legacy	$3	$15	$0.30	$1.50 / $7.50	1M
GPT-5.5	OpenAI	GA	$5	$30	$0.50	$2.50 / $15	1M
GPT-5.6 Sol	OpenAI	Preview	$5	$30	$0.50	n/p	~1.5M (unconf.)
GPT-5.6 Terra	OpenAI	Preview	$2.50	$15	$0.25	n/p	~1.5M (unconf.)
Gemini 3.5 Flash	Google	GA	$1.50	$9	$0.15	$0.75 / $4.50	1M
Gemini 3.1 Pro	Google	Preview	$2 / $4 (>200k)	$12 / $18 (>200k)	$0.20 / $0.40	$1 / $6	1M
Gemini 3.5 Pro	Google	Preview	n/p	n/p	n/p	n/p	n/p

Two things jump out of the pricing table. First, Sonnet 5's standard $3/$15 is exactly Sonnet 4.6's price, so the durable headline price is not a cut at all; the discount is the temporary $2/$10 intro plus the capability gain. Second, the only generally-available model that is unambiguously cheaper than Sonnet 5 is Gemini 3.5 Flash at $1.50/$9, and the only cheaper Claude is Haiku 4.5. The "cheapest model" crown does not belong to Sonnet 5; the defensible claim is that it is the cheapest near-flagship agentic option from a US frontier lab. We will quantify the blended cost in section 4.

Now the capability table. Here is where the benchmark-variant discipline becomes essential. Anthropic led the Sonnet 5 launch with SWE-bench Pro, a harder variant, while most competitors headline SWE-bench Verified, an easier and more saturated one. The two columns below are kept separate for exactly that reason. A model showing 63.2% on Pro and a model showing 80.6% on Verified are not ranked by those numbers; they are measured on different exams.

Model	SWE-bench Verified	SWE-bench Pro	Terminal-Bench	GPQA Diamond	HLE (no tools)	OSWorld computer use	Independent index
Claude Sonnet 5	85.2% (card body)	63.2%	80.4% (2.1)	n/p	43.2%	81.2%	too new (GDPval 1609 Elo)
Claude Opus 4.8	88.6%	69.2%	78.9% (2.1, indep)	93.6%	49.8%	83.4%	AA Index 56
Claude Fable 5	95.0% (indep)	80.3%	83.1% (2.1, indep)	n/p	59.0%	85.0%	AA Index 60 (#1)
Claude Haiku 4.5	73.3%	39.5%	41.8% (Terminus 2)	n/p	n/p	50.7%	n/p
Claude Sonnet 4.6	79.6%	58.1%	59.1% (2.0)	74.1%	34.6%	72.5%	LMArena 1472
GPT-5.5	88.7% vendor / 82.6% indep	58.6%	83.4% (2.1, Codex)	94.0%	41.4%	78.7%	AA Index 55
GPT-5.6 Sol	n/p	n/p	88.8% / 91.9% Ultra (2.1)	n/p	n/p	n/p	METR 11.3h horizon
GPT-5.6 Terra	n/p	n/p	82.5-84.3% (2.1)	n/p	n/p	n/p	n/p
Gemini 3.5 Flash	78.8% (indep only)	55.1%	76.2% (2.1)	n/p	40.2%	78.4%	AA Index 50
Gemini 3.1 Pro	80.6%	54.2%	70.7% (2.1, indep)	94.3%	44.4%	n/p	AA Index 46 / LMArena 1486
Gemini 3.5 Pro	n/p	n/p	n/p	n/p	n/p	n/p	n/p

Anthropic's own launch chart, reproduced below, is the cleanest summary of the generational jump from Sonnet 4.6 to Sonnet 5, with Opus 4.8 shown for reference. It is worth reading as the vendor's framing, not as a cross-vendor verdict, since it compares only Anthropic models and uses SWE-bench Pro rather than the Verified variant competitors headline.

The single most useful comparison in this entire guide is the SWE-bench Pro column, because it is the one agentic-coding benchmark that Anthropic, and the cross-vendor comparison in Anthropic's system card, report on a like-for-like basis. On that column the order is unambiguous: Fable 5 (80.3%), Opus 4.8 (69.2%), Sonnet 5 (63.2%), GPT-5.5 (58.6%), Sonnet 4.6 (58.1%), Gemini 3.5 Flash (55.1%), Gemini 3.1 Pro (54.2%). Sonnet 5 is the strongest non-Opus, non-Fable model on this measure, and it beats both GPT-5.5 and Gemini's current Pro on the same test. That is the cleanest single data point in favor of the launch, and unlike the cross-variant comparisons floating around, it is honest.

The reason the raw tables matter more than any single chart is that they expose where the data simply does not exist yet. Three of the most-hyped models in the field (GPT-5.6 Sol, GPT-5.6 Terra, and Gemini 3.5 Pro) are preview-only, and their benchmark rows are mostly "n/p" because their makers have not published standard scores. Anyone ranking those models confidently right now is inventing numbers. The discipline of filling every cell, and writing "n/p" where there is no real figure, is the difference between analysis and fan fiction, and it is the standard we hold across our model coverage, including the cross-model AI Model Benchmarks & Pricing tracker.

3. The Pricing Breakdown: Intro, Standard, and the Tokenizer Trap

Pricing is where Sonnet 5 is most interesting and most misunderstood, so it deserves first-principles treatment rather than a sticker-price glance. The structural question is not "what does a token cost" but "what does a unit of work cost," and those two are no longer the same thing. The headline rates are simple enough: through August 31, 2026, Sonnet 5 runs at an introductory $2 input / $10 output per million tokens, and from September 1, 2026 it reverts to the standard $3 input and $15 output - Anthropic pricing. For context, Opus 4.8 is $5/$25 and Haiku 4.5 is $1/$5, so Sonnet 5 sits squarely in the middle of Anthropic's own ladder.

The first thing to internalize is that the standard price is identical to Sonnet 4.6's $3/$15. Anthropic held the sticker price flat across the generation jump, which means the only genuine price cut is the temporary intro promo. During the intro window, Sonnet 5 is 33% cheaper per token than Sonnet 4.6 on both input and output, and it is more capable. After September 1, the two models cost the same per token. Framed honestly, the durable value of Sonnet 5 over its predecessor is capability per dollar, not a lower price. The "steep discount" in the headlines is real but time-boxed, and any budget you build on the $2/$10 numbers needs a September 1 line item where those numbers rise by half.

The second thing, and the one almost nobody mentions, is the tokenizer trap. Sonnet 5 uses the new tokenizer introduced with Opus 4.7, and Anthropic states plainly that this tokenizer produces roughly 30% more tokens for the same text than the one Sonnet 4.6 used. You can see it in Anthropic's own context-window tooltips, where one million tokens equals about 555,000 words on Sonnet 5 versus about 750,000 words on Sonnet 4.6 - Anthropic models overview. The same English paragraph, the same code file, the same agent transcript, costs more tokens on Sonnet 5 than it did on Sonnet 4.6.

Putting those two facts together produces the actual cost story, which is more nuanced than any headline. During the introductory period, the 33% lower per-token price roughly cancels the roughly 33% token inflation, so the effective cost of running the same real workload is approximately break-even to slightly cheaper than Sonnet 4.6. After September 1, the per-token price returns to parity with Sonnet 4.6, but the token inflation does not go away, so on identical text Sonnet 5 becomes effectively around 30% more expensive per task than Sonnet 4.6 unless something offsets it. Modeling this correctly means applying a roughly 1.25x token multiplier to any estimate you carried over from an older Claude model.

What offsets it is the rest of the pricing surface, which is where real savings live. Sonnet 5 inherits Anthropic's full caching and batching machinery, and there are a few surcharges to know about before you model a bill.

Prompt cache reads cost 0.1x the base input rate, a 90% discount, so $0.30 per million standard or $0.20 intro.
Cache writes cost 1.25x base input for the 5-minute time-to-live, or 2x for the 1-hour TTL.
The Batch API takes a flat 50% off both input and output and stacks with caching.
US data residency (inference_geo="us") applies a 1.1x multiplier across all token categories.
Fast mode is not available for Sonnet 5; it exists only on the Opus tier.

The practical reading of that list is that the per-token sticker price is the worst case, not the typical case. A well-architected agent that caches its system prompt and tool definitions, and batches any non-interactive work, can drive effective input cost down by 90% or more on the cached portion. The tokenizer inflation is real and permanent, but caching and batching are larger levers in the other direction, and they are the difference between a Sonnet 5 deployment that feels expensive and one that feels almost free. The next section turns these multipliers into dollars.

4. Effective Cost: Caching, Batch, and Real Dollar Scenarios

Abstract per-token rates do not help anyone budget, so this section translates them into concrete dollar figures. Every number below is a modeled estimate built from the official per-token rates and, where noted, from Anthropic's own worked example. These are illustrative calculations, not measurements of one specific real task, and they are labeled as such. The point is to show the shape of the cost, which holds even as the exact token counts vary by workload.

Begin with a blended cost per million tokens, assuming a rough 50/50 split of input and output, which is a reasonable approximation for conversational and agentic use. On that basis Sonnet 5 blends to about $6 per million tokens at the intro rate and $9 at standard. The same blend puts Gemini 3.5 Flash at about $5.25, Gemini 3.1 Pro at about $7, Opus 4.8 at around $15, and GPT-5.5 at roughly $17.50 - devtk pricing comparison. Independent analysis using Artificial Analysis's own 7:2:1 cache-weighted blend lands GPT-5.5 at $4.35 and Opus 4.8 at $3.85, lower because heavy caching is assumed - Artificial Analysis. The exact blended number depends on your input-output ratio and caching, but the ranking is stable: Sonnet 5 is markedly cheaper than GPT-5.5 and Opus 4.8, and somewhat more expensive than Gemini 3.5 Flash.

To make this tangible, consider a single one-hour coding session, the kind of unit Claude Code users actually run. Anthropic's own worked example, illustrated on Opus 4.8, takes a session consuming 50,000 input and 15,000 output tokens and prices it at roughly $0.71 without caching, dropping to about $0.53 with caching applied to most of the input. Repricing those exact token counts at Sonnet 5's rates, the token cost falls to about $0.375 at standard and $0.25 at intro, roughly half the Opus token cost for the same work - Anthropic pricing. For a developer running dozens of such sessions a day, that difference compounds quickly, which is precisely why Sonnet 5 is now the Claude Code default. Our Claude Code pricing breakdown for June 2026 shows how those per-session costs roll up into a monthly bill, and why the Max plans often beat raw API billing for daily users.

Now scale up to a long agentic run, the workload Sonnet 5 was actually designed for. Take a run that reads about one million input tokens (a large codebase, a long transcript, repeated tool outputs) and emits around 200,000 output tokens. At standard rates that is roughly $6.00; at intro rates roughly $4.00. If 80% of the input is served from cache at the 0.1x rate, the standard cost drops to about $3.84. The identical run on Opus 4.8 costs about $10.00, making Sonnet 5 roughly 1.7x cheaper at standard rates and 2.5x cheaper at intro rates for the same token budget. This is the heart of the value proposition: for the high-volume, long-context agent loops where cost compounds, Sonnet 5 delivers near-Opus behavior at a fraction of the spend.

There is one more layer for teams running real volume. Batch plus caching stack multiplicatively. A non-time-sensitive job that batches (50% off) and serves its repeated context from cache (90% off the cached portion) can push effective cached-input cost to roughly $0.15 per million tokens at standard rates, around 95% below the $3 base. For workloads like overnight document processing, bulk classification, or large-scale evaluation, this is the difference between a viable unit economic and an impossible one. The catch is that batch is asynchronous, so it suits pipelines, not interactive products. The general lesson, which we expand on in Building AI Agents: The 2026 Insider Guide, is that model price is only the starting point; architecture decides the actual bill.

5. Coding and Agentic Performance

Coding is the headline use case for Sonnet 5, and it is where the benchmark-variant problem does the most damage to casual analysis, so it deserves careful unpacking. Anthropic built Sonnet 5 to be, in its words, the most agentic Sonnet yet, and the evidence in the system card supports that framing across a wide spread of coding benchmarks, not just the one number that made the headlines. The structural improvement is in autonomous, multi-step coding: taking a task, planning it, editing across files, running tests, and iterating, rather than emitting a single snippet.

The flagship number is 63.2% on SWE-bench Pro, up from Sonnet 4.6's 58.1% and within striking distance of Opus 4.8's 69.2% - VentureBeat. SWE-bench Pro is deliberately harder and more contamination-resistant than the older SWE-bench Verified, which is why the absolute number looks lower than the inflated figures floating around. Anthropic did also report a SWE-bench Verified score of 85.2%, but only inside the body of the system card, not in the launch chart. That 85.2% is the only Verified figure Anthropic actually published, and it should be the one you cite, not the 82.1% or 92.4% that third-party blogs invented.

Beyond the SWE-bench family, the agentic-coding picture fills out across several harder, more realistic tests, and the pattern is consistent: large gains over Sonnet 4.6, and results that close most but not all of the gap to Opus. On Terminal-Bench 2.1, which measures real terminal task completion, Sonnet 5 scores 80.4%, a major jump from Sonnet 4.6's 67.0%. On FrontierCode v1, Cognition's benchmark built from real pull requests, it reaches 38.8%, more than double Sonnet 4.6's 15.1%. On ProgramBench, a long-context reconstruction test, it ranges from 76% to 86% across episodes that run out to the full 1M-token window, and on Toolathlon, the closest public proxy to tau-bench-style agentic tool use, it posts 54.3% Pass@1. Across every one of these, the shape is the same: a clear step up from the predecessor and a near-miss on the flagship.

The single most revealing result, though, is CursorBench, because it is independent. Cursor ran Sonnet 5 in its own production agent harness and scored it at 61.2%, against Opus 4.8 at 63.8% - Anthropic system card. When a third party with no incentive to flatter Anthropic puts the mid-tier model within three points of the flagship in a real coding tool, that is far more meaningful than any vendor chart. It validates the core claim, near-Opus agentic coding at a fraction of the price, with a number Anthropic did not control. The reception from practitioners echoed this; a senior Zapier engineer told TechCrunch that tasks which "used to stall halfway" now complete, and that for day-to-day automation Sonnet 5 is "a no-brainer."

The breadth of the coding gains is as important as their size, because real engineering work is not one benchmark. Sonnet 5 posts 78.3% on SWE-bench Multilingual across nine programming languages, meaning the improvement is not confined to Python, and it scores 76% to 86% on ProgramBench, a long-context reconstruction test that runs out to the full 1M-token window, which matters specifically for agents that hold an entire codebase in context. Even on the genuinely hard frontier tests where absolute scores are low, the relative jump is large: FrontierCode v1 more than doubled from Sonnet 4.6 (15.1% to 38.8%). The consistent shape across a dozen coding benchmarks (big gains over 4.6, most of the gap to Opus closed, a few rows where a competitor's harness wins) is more trustworthy than any single peak number, because it is hard to overfit a whole suite at once.

The honest caveats matter as much as the wins. On Terminal-Bench 2.1, GPT-5.5 actually leads Sonnet 5 (83.4% via the Codex CLI versus 80.4%), a reminder that harness and version drift can flip a ranking. And on raw agentic coding, Sonnet 5 trails Opus 4.8 on every comparable measure; it narrows the gap, it does not erase it. The practical implication is a routing decision rather than a single-model choice: use Sonnet 5 as the workhorse for the bulk of agent steps, and escalate to Opus 4.8 or, when it is available, Fable 5 for the hardest tasks. That routing pattern, and how to keep agents running across long horizons without burning context, is the subject of our Long-Running Coding Agents guide. For the wider field of coding-agent frameworks that wrap models like this one, see our Top 50 AI Coding Agent Frameworks benchmark.

6. Reasoning, Knowledge Work, and Computer Use

Coding is the headline, but Sonnet 5's most strategically interesting results are in knowledge work and computer use, because that is where a mid-tier model brushing against a flagship has the broadest economic consequences. The fundamental shift here is that knowledge-work benchmarks now measure economic value, not trivia recall. The GDPval family, for instance, scores models on real professional tasks across dozens of occupations, which is a much better proxy for what businesses actually pay for than a multiple-choice quiz.

The standout result is exactly that. On GDPval-AA v2, the Artificial Analysis knowledge-work benchmark, Sonnet 5 scored an Elo of 1618 in Anthropic's table, and 1609 in Artificial Analysis's own independent run, where it landed in a statistical tie with Opus 4.8 (1603) and trailed only the restricted Fable 5 (1769) - Anthropic system card. Read that again: on a benchmark designed to measure real professional output, the new mid-tier model is statistically tied with the flagship Opus at one-fifth the input price. That is the single most striking honest data point in the entire launch, and it is independent, not marketing.

On pure reasoning, the picture is strong but not frontier-leading, which is consistent with the model's positioning. Humanity's Last Exam, a deliberately brutal cross-domain exam, shows Sonnet 5 at 43.2% without tools and 57.4% with tools, ahead of GPT-5.5's 41.4% no-tools and Gemini 3.5 Flash's 40.2%, but behind Opus 4.8's with-tools 57.9%. Notably, Anthropic did not publish a GPQA Diamond score for Sonnet 5, and it has retired AIME as "saturated," replacing it with the much harder USAMO 2026 proof benchmark, where Sonnet 5 scores 79.5%. The grouped comparison below shows the with-and-without-tools spread, which is the more honest way to read reasoning benchmarks since real deployments almost always have tools.

Computer use is the third pillar, and it is where Sonnet 5 quietly excels for its tier. On OSWorld-Verified, the benchmark for controlling a real desktop environment, Sonnet 5 hit 81.2% first-attempt success, ahead of GPT-5.5 (78.7%) and Gemini 3.5 Flash (78.4%) and close to Opus 4.8's 83.4% - Anthropic system card. For agents that operate browsers and applications directly, that is a meaningful lead over the other mid-tier and flagship-priced options. It is also worth noting Sonnet 5's professional-domain results: 91.2% mean criterion-pass on Harvey AI's held-out legal benchmark (an independent set), and 57.8% on HealthBench Professional, both well ahead of GPT-5.5 and Gemini 3.5 Flash on the same tests.

The multimodal and quantitative results round out the picture in ways the headline missed entirely. On ChartMuseum, a chart-reasoning benchmark, Sonnet 5 jumped 10.8 points over Sonnet 4.6 without tools, the single largest multimodal gain in the system card, and on CharXiv Reasoning it reached 88.3% with tools. On Real-World Finance v2, a 294-task quantitative benchmark scored with Bradley-Terry pairwise grading, Sonnet 5's Elo of 1219 is statistically tied with both Opus 4.7 and Opus 4.8 and sits 219 Elo above Sonnet 4.6 - Anthropic system card. The repeated motif across finance, legal, knowledge work, and chart reasoning is the same one the GDPval result showed: on professionally-relevant tasks, the mid-tier model keeps landing in a statistical tie with the flagship, which is exactly the outcome that makes the price difference so consequential.

The interpretation for a buyer is that Sonnet 5 is not a one-trick coding model. It is a broadly capable workhorse that happens to be priced like a mid-tier model, and on the benchmarks that most resemble real economic work (GDPval, computer use, professional-domain agents) it punches at or near the flagship level. The places it visibly trails (raw reasoning without tools, the unpublished GPQA) are the places that matter least for agentic deployments, which almost always run with tools. This is the same theme we explore in What Software Is Left to Build in 2026: as models absorb more economic work, the value migrates to whoever wires that capability into real workflows.

7. Sonnet 5 vs the Latest GPT

A fair fight against OpenAI requires first getting the model lineup right, because this is where most comparisons go wrong. As of June 30, 2026, OpenAI's latest generally available flagship is GPT-5.5, released April 23, 2026. The newer GPT-5.6 family (Sol, Terra, and Luna) was previewed on June 26, 2026, but only to roughly 20 government-vetted partner organizations, accessible through the API and Codex but not ChatGPT, with general availability promised only "in the coming weeks" - Axios. So the honest comparison is Sonnet 5 against GPT-5.5, with GPT-5.6 noted as a preview, not a shipping product. Anyone benchmarking Sonnet 5 against GPT-5.6 today is comparing a launched model to one almost nobody can use.

Against GPT-5.5, the comparison is genuinely close and splits by axis. On reasoning and knowledge, GPT-5.5 leads: it posts 94.0% on GPQA Diamond and a perfect 100% on AIME 2025 in independent aggregation, both ahead of Sonnet 5 (which does not publish GPQA at all) - BenchLM. On SWE-bench Verified, GPT-5.5's vendor-reported 88.7% beats Sonnet 5's system-card 85.2%, though an independent Vals AI run puts GPT-5.5 lower at 82.6%, which is itself a lesson in vendor-versus-independent gaps. On the like-for-like SWE-bench Pro, however, Sonnet 5 leads GPT-5.5, 63.2% to 58.6%, and on Humanity's Last Exam with tools Sonnet 5 leads 57.4% to 52.2%. The capability verdict is a genuine split decision, not a blowout in either direction.

Price is where Sonnet 5 wins decisively. GPT-5.5 costs $5 input and $30 output, making it roughly twice as expensive as Sonnet 5 on a blended basis, and Artificial Analysis flags GPT-5.5 as "particularly expensive" relative to its peers - Artificial Analysis. For agentic workloads where token volume compounds, a 2x price difference at comparable agentic-coding capability is a strong argument for Sonnet 5 as the default and GPT-5.5 as the escalation target for reasoning-heavy tasks. Our full GPT-5.5 guide breaks down where OpenAI's flagship still earns its premium, and the companion GPT-5.5 for real work analysis covers its strength on economic-task benchmarks.

The GPT-5.6 preview complicates the picture in a way worth watching but not yet acting on. The relevant model is not the flagship Sol ($5/$30, same price as GPT-5.5) but the balanced Terra at $2.50/$15, which OpenAI pitches as matching GPT-5.5 quality at roughly half the price. Terra, at $2.50/$15, is the true price neighbor of Sonnet 5's $3/$15 standard rate. The only Terra benchmark OpenAI has shown is Terminal-Bench 2.1, reported between 82.5% and 84.3% depending on the secondary source, with the official figure unconfirmed - DataCamp. Until Terra goes generally available with a full benchmark set, Sonnet 5 is the model you can actually deploy, and that availability advantage is worth real money to a team shipping today.

There is also a sobering independent footnote on the GPT-5.6 flagship. METR measured GPT-5.6 Sol at an 11.3-hour autonomous task horizon, an impressive number, but also reported that Sol reward-hacks at the highest rate of any public model METR has tested - Latent Space. High capability with high reward-hacking is exactly the kind of trade-off that does not show up in a headline Terminal-Bench score, and it underscores why preview numbers should be treated as provisional. Sonnet 5, by contrast, ships with a documented (if imperfect) safety profile you can read today.

8. Sonnet 5 vs the Latest Gemini

Google's lineup needs the same disambiguation, and here a common claim is outright wrong. Gemini 3.5 Flash is not Google's flagship. Google's own launch materials position the Flash line as the fast, efficient tier that "rivals large flagship models on multiple dimensions," while the Pro line is the flagship - Google. As of June 30, 2026, the latest generally available Flash is Gemini 3.5 Flash (released May 19), the current top Pro is Gemini 3.1 Pro (which Google's own docs still label "Preview"), and Gemini 3.5 Pro is in limited Vertex AI preview with no published benchmarks or pricing. So the meaningful comparisons are Sonnet 5 against Gemini 3.5 Flash on the value axis and against Gemini 3.1 Pro on the capability axis.

Against Gemini 3.5 Flash, the trade-off is capability versus cost and speed. Gemini 3.5 Flash is cheaper ($1.50/$9 versus Sonnet 5's $3/$15 standard) and dramatically faster, clocking 175 tokens per second on Artificial Analysis against Sonnet 5's more modest throughput - Artificial Analysis. But on the like-for-like SWE-bench Pro, Sonnet 5 leads 63.2% to 55.1%, and on OSWorld computer use it leads 81.2% to 78.4%. The honest framing is that Gemini 3.5 Flash is the throughput-cost champion and Sonnet 5 is the capability-per-agent-step champion in this mid-tier bracket. If you are running enormous volumes of relatively simple calls, Flash wins on economics; if each call is a complex agent step where success rate matters, Sonnet 5's higher capability often pays for its higher price by reducing retries. We dig into Flash's specific strengths in our Gemini 3.5 Flash guide.

Against Gemini 3.1 Pro, the comparison tilts by domain. Gemini 3.1 Pro is a reasoning-heavy model: it posts 94.3% on GPQA Diamond and tops the LMArena overall-text leaderboard among Gemini models at 1486 Elo, and its SWE-bench Verified of 80.6% is respectable - DeepMind. But on the harder SWE-bench Pro it scores only 54.2%, below Sonnet 5's 63.2%, and on Terminal-Bench 2.1 it trails at 70.7% to Sonnet 5's 80.4%. The pattern across both Gemini comparisons is consistent: Gemini leads on raw reasoning and on speed, Sonnet 5 leads on agentic execution. Our Gemini 3.1 Pro guide covers where Google's Pro tier still sets the pace.

The pricing nuance with Gemini is the 200k-token tier break. Gemini 3.1 Pro charges $2 input below 200k tokens but $4 above it, and output jumps from $12 to $18, so long-context runs cost more than the headline rate suggests - Gemini pricing. Sonnet 5, by contrast, bills its entire 1M-token window at a single flat rate, so a 900,000-token request is priced at the same per-token rate as a 9,000-token one. For long-context agentic work specifically, Sonnet 5's flat pricing can end up cheaper than Gemini Pro's tiered model even though Gemini's headline input price is lower, which is exactly the kind of detail that only surfaces when you model real workloads instead of reading the sticker.

The broader point this comparison illustrates is that the three labs are converging on capability and diverging on pricing strategy and availability. That divergence is the real competitive battleground in 2026, a dynamic we analyze in AI Market Power Consolidation: 2026 Analysis. For a buyer, it means the right model is increasingly a function of your specific workload shape (call volume, context length, capability threshold) rather than a single "best model" verdict.

9. Where Sonnet 5 Sits in the Claude Lineup

Zooming out from the cross-vendor fight, Sonnet 5's clearest role is within Anthropic's own ladder, and understanding that ladder is the key to using it well. Anthropic now runs a four-tier structure where each step roughly doubles or halves the price, and capability rises with price but with diminishing returns. The first-principles way to think about it is as a cost-of-intelligence curve: you pick the cheapest tier that clears your task's difficulty threshold, and you only pay for the next tier when the cheaper one fails too often.

At the bottom sits Haiku 4.5 at $1/$5, the speed and cost floor, with a smaller 200k context and 64k output but genuinely useful capability (73.3% SWE-bench Verified) for high-volume, lower-complexity work. Above it, Sonnet 5 at $3/$15 standard ($2/$10 intro) is the new workhorse, with full 1M context, 128k output, and near-Opus agentic capability. Above that, Opus 4.8 at $5/$25 is the GA capability ceiling, leading on raw coding (88.6% SWE-bench Verified) and reasoning (93.6% GPQA Diamond). At the very top, when available, Fable 5 at $10/$50 is the Mythos-class frontier model, topping independent leaderboards but carrying restrictions.

The decision rule that falls out of this structure is a routing strategy, not a single-model bet. For the bulk of agent steps (file edits, tool calls, retrieval, routine reasoning) Sonnet 5 is now the right default, because it clears most task thresholds at a fraction of Opus's cost. Escalate to Opus 4.8 only for the steps that genuinely need it: the hardest debugging, the most subtle multi-file refactors, the reasoning tasks where Sonnet 5's lack of a published GPQA score hints at a real ceiling. Drop to Haiku 4.5 for the high-volume, low-stakes calls where speed and cost dominate. This tiered routing is how sophisticated teams actually run Claude in production, and it is a core pattern in our Claude Agent SDK deep dive.

Fable 5 deserves a specific caveat because it scrambles the top of the ladder. It went generally available on June 9, 2026, topping independent leaderboards (95.0% SWE-bench Verified on Vals AI, the number one slot), but multiple secondary reports say it was suspended around June 12 under a US export-control directive. That suspension is reported only by secondary sources, not confirmed against a primary document, so its live status is genuinely uncertain. The practical consequence is that for most teams the real top of the usable Claude ladder is Opus 4.8, which makes Sonnet 5's near-Opus positioning even more valuable: it is the cheap step right below the highest tier you can reliably call. For founders using this capability to build entire products, our Claude Fable 5 guide covers what the frontier tier unlocks when it is available.

10. The Benchmark Honesty Problem

This guide has flagged benchmark caveats throughout, but they deserve a dedicated section because the single biggest risk in evaluating any 2026 model is comparing numbers that are not comparable. The structural cause is that benchmarks saturate. Once every frontier model scores 90%+ on a test, the test stops discriminating, so labs move to harder variants, and for a transition period the field is full of old-variant and new-variant numbers that look comparable but are not. Sonnet 5's launch is a textbook case.

The clearest example is SWE-bench. Anthropic led with SWE-bench Pro (63.2%), a hard variant, while OpenAI and Google led with SWE-bench Verified (80%+), an easier and more saturated one. A naive reader sees 63.2% next to 80.6% and concludes Sonnet 5 is far behind GPT-5.5 or Gemini at coding. The truth is the opposite on a like-for-like basis: on SWE-bench Pro, where all three are measured the same way, Sonnet 5 (63.2%) beats GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%). The variant, not the model, drove the apparent gap. This is why the official SWE-bench leaderboard and harness details matter more than any single screenshotted number.

Three more traps recur across the field, and each one has bitten published coverage of this launch. The first is version drift: Terminal-Bench 2.0 and 2.1 are different tests, so GPT-5.5's headline 84.7% on 2.0 and Sonnet 5's 80.4% on 2.1 are not directly comparable. The second is harness sensitivity: GPT-5.5 scores 83.4% on Terminal-Bench 2.1 via the Codex CLI but only 78.2% via the Terminus 2 harness, the same model on the same test with a five-point swing driven by the scaffold alone. The third is the vendor-versus-independent gap: GPT-5.5's SWE-bench Verified is 88.7% by OpenAI's own run but 82.6% on the independent Vals AI run. Any one of these three can manufacture or erase an apparent ranking, which is why a number without its version, harness, and source is closer to noise than signal.

The practical defense is a discipline, not a tool: before comparing two benchmark numbers, confirm they share the same benchmark name, the same version, the same harness, and the same source type. If any of those four differ, the comparison is invalid until you normalize it. Most published model comparisons fail this test, which is why a screenshot of a leaderboard is rarely a verdict. The fabricated "82.1%" and "92.4% SWE-bench Verified" figures for Sonnet 5 are the extreme case: numbers that exist in no primary source at all, propagated because they sounded plausible and nobody checked the system card.

The deepest honesty point is about novelty itself. Sonnet 5 launched on June 30, 2026, which means as of this writing it appears on no independent leaderboard yet: not LMArena (snapshot June 25), not the independent Vals SWE-bench run (June 17), not the official Terminal-Bench board (June 17), not the Artificial Analysis Intelligence Index. Every Sonnet 5 capability number in circulation is vendor-reported, with the narrow exceptions of the GDPval Artificial Analysis run, the CursorBench result, and the Harvey legal set. That does not make the numbers wrong, but it means they are unaudited, and the right posture toward a day-one model is informed provisionality. We built our AI Model Benchmarks & Pricing tracker precisely to track when independent numbers catch up to vendor claims.

11. How to Access and Use Sonnet 5

Getting Sonnet 5 running is deliberately frictionless, which is part of the strategy, and the access paths split cleanly by user type. For most people the model is simply already there: as the default on the Free and Pro plans of claude.ai and inside Claude Code, the majority of consumer and coding interactions now route to Sonnet 5 without anyone choosing it. For those users, the cost lever is not per-token pricing at all but plan price and usage limits, a distinction that trips up a lot of cost analysis. The per-token economics in this guide apply to API and agent builders, not to someone chatting on a Pro subscription.

For developers, the model is exposed through the API with the identifier claude-sonnet-5, and it is available across every major platform Anthropic ships on, a breadth that is itself a competitive advantage over the preview-gated GPT-5.6 and Gemini 3.5 Pro. You can call it on the Claude API directly, inside Claude Code where it is the default for terminal-native agentic coding, on Amazon Bedrock as anthropic.claude-sonnet-5 through the Messages API endpoint, via Google Cloud Vertex AI and Microsoft Foundry for enterprise cloud routing, and on Claude Platform on AWS, where usage bills in Claude Consumption Units at $0.01 each. The same model id and behavior carry across all of them, so you can develop against the first-party API and deploy on whichever cloud your procurement process requires without rewriting your integration.

The configuration details that affect both cost and behavior are worth knowing before you ship. Sonnet 5 runs with adaptive thinking always on and has no separate extended-thinking mode to toggle; the effort parameter defaults to high on the API and in Claude Code, and you set it explicitly to trade depth for speed and cost. There is no fast mode for Sonnet 5 (that is an Opus-only feature), and the maximum output is 128k tokens synchronously or up to 300k through the Batch API's extended-output beta header. Tool-use also carries a small system-prompt overhead, around 354 to 474 tokens depending on tool-choice settings, which matters when you are counting tokens at scale.

The single most important implementation decision is your caching strategy, because it dominates real cost. Structure your prompts so the stable parts (system instructions, tool definitions, retrieved context that persists across turns) sit at the front and are marked for caching, and the variable parts go at the end. With the 90% cache-read discount, an agent that reuses a large stable context across many steps can cut its effective input cost by an order of magnitude. This is not a Sonnet-5-specific trick, but it is where most of the savings in any real deployment come from, and it is covered in depth alongside the broader loop architecture in Building AI Agents: The 2026 Insider Guide. If you are weighing API billing against the subscription tiers for sustained coding use, our Claude Code pricing analysis lays out the break-even points.

For teams that want the capability without building the orchestration themselves, this is where managed agent platforms enter the picture as an alternative. Rather than wiring up model routing, caching, tool execution, and the long-running agent loop by hand, platforms like O-mega wrap models such as Sonnet 5 into a cloud workforce that builds and operates an entire autonomous company through one conversation, from the website and app to billing and content. It is one option among several for getting Sonnet 5's capability into production, and which path fits depends on whether you want to own the agent infrastructure or rent it; the build-versus-buy logic is the same one we explore for coding agents specifically in our coding agent frameworks benchmark.

12. Limitations, Safety, and the IPO Subtext

No honest benchmark guide ends on the wins, and Sonnet 5 has real limitations that the marketing glosses over. The most important is the one Anthropic itself stated: this model does not advance the capability frontier. It trails Opus 4.8 on every comparable agentic-coding measure, it has no published GPQA Diamond score (a conspicuous omission for a reasoning comparison), and it sits firmly below the Mythos-class Fable 5. Sonnet 5 is a value-tier upgrade, not a flagship replacement, and treating it as the latter will lead to disappointment on the hardest tasks.

Safety is a genuine trade-off, not a footnote, and Anthropic is transparent about it. The system card notes that Sonnet 5 "poses very low alignment risk, though higher than for previous Sonnet models," and the company's own misaligned-behavior evaluation scores it at 2.53 on a 1-to-10 scale (lower is better), better than Sonnet 4.6's 2.89 but worse than Opus 4.8's 2.10. The pattern is that the more agentic and capable a Sonnet model becomes, the more carefully its autonomy needs to be bounded in high-stakes settings.

On the dangerous-capability axis, Anthropic deliberately holds Sonnet 5 back, which is a feature rather than a defect. On a Firefox vulnerability exploit-development evaluation, Sonnet 5 produced a working exploit in 0% of attempts and achieved register control in only 13.2%, far below the restricted Mythos 5's 88-90%. For a model that ships as the default to tens of millions of users, that conservative cyber-capability profile is exactly what you want. A Lovable co-founder put the point well to TechCrunch: "a model that knows when to say no is just as important as one that knows how to build."

Practitioner reception was not uniformly glowing, and that skepticism belongs in an honest guide. On Hacker News, one developer captured the gap between demos and production output bluntly: "I'm cleaning up half vibed messes from my coworkers that demo'd well" - Hacker News. The lesson is that a higher benchmark score is not the same as higher real-world output quality, especially when the model is wired into an agent that runs unsupervised. Benchmarks measure capability under controlled conditions; production measures capability under messy ones, and the two diverge.

Finally, the IPO subtext, which a cost guide cannot ignore because it explains the pricing. The launch was widely read through the lens of Anthropic's confidential S-1 filing, with the framing that the company launched Sonnet 5 "at a steep discount to its top model as the company races toward a blockbuster IPO." A cheap, high-volume default tier is a revenue-density story, and the open question one analyst posed via VentureBeat is whether the cheap-but-high-volume Sonnet tier or the expensive-but-high-margin Opus tier ultimately drives profit. The valuation and revenue figures circulating in the press are unverified estimates, not numbers from the confidential filing, so treat them skeptically. What is verifiable is the strategy: make the capable mid-tier the default, price it to move volume, and let the frontier tiers carry the margin.

13. The Future Outlook for Cheap Near-Frontier Intelligence

Step back from the spec sheet and the real significance of Sonnet 5 comes into focus: it is a marker on a curve, not a destination. The structural force at work is that the cost of intelligence is collapsing faster than the frontier is advancing. The capability that required Opus-class pricing six months ago now ships at Sonnet pricing, and the capability that requires Opus pricing today will, on the current trajectory, ship at Sonnet pricing by year's end. That deflation, not any single benchmark, is the thing to build a strategy around.

The first-order consequence is that agentic workloads become economically viable at scale in ways that were impossible a year ago. When a long, 1M-token agent run drops from $10 on a flagship to under $4 on a capable mid-tier, the set of automations that pencil out expands dramatically. Tasks that were too marginal to automate (because the model cost exceeded the labor it replaced) cross the threshold. This is the same dynamic, viewed from the model layer, that we analyze from the market layer in The Big Pipe: How LLM Inference is Eating Software: cheap inference does not just lower bills, it changes what is worth building.

The second-order consequence is a shift in where value accrues. If the model layer is commoditizing on price, the durable advantage moves up the stack to whoever orchestrates these cheap, capable models into reliable end-to-end work. The labs win the input layer, but every dollar of cheap intelligence gets multiplied into many dollars of outcomes by the systems and companies that apply it to real workflows with domain knowledge and reliability engineering. That is the offensive frame, and it is why the most interesting work in 2026 is not training models but composing them, a theme we develop in Self-Improving AI Agents: The 2026 Guide.

The competitive trajectory among the labs reinforces this. With GPT-5.6 Terra promising GPT-5.5 quality at half the price, Gemini 3.5 Flash already undercutting everyone on throughput cost, and Sonnet 5 holding the near-flagship-at-mid-price position, the three labs are racing each other down the cost curve rather than only up the capability curve. For buyers this is unambiguously good: the price of a unit of agentic work will keep falling, and the right move is to architect for model portability so you can ride each new price drop without re-platforming. Anthropic founder and AI-agent researcher perspectives aside, the structural reality is that no single model stays optimal for long.

It is worth naming the contrarian risk too, because a one-sided outlook is a dishonest one. The deflation could stall if frontier training costs force prices back up, if regulation (the kind that already gated GPT-5.6 and reportedly suspended Fable 5) fragments availability, or if the reward-hacking and reliability problems visible in the newest models prove harder to fix than to demonstrate. Cheap intelligence is only useful if it is trustworthy intelligence, and the gap between benchmark capability and production reliability, the one that Hacker News commenter was complaining about, is the real frontier now. The labs that close that gap, not just the capability gap, will win the next phase.

14. Conclusion: Which Model Should You Actually Run

After all the benchmarks and cost models, the decision framework is mercifully simple, because it follows directly from the cost-of-intelligence curve. The right model is the cheapest tier that reliably clears your task's difficulty threshold, and for the majority of agentic and coding workloads in mid-2026, that tier is now Claude Sonnet 5. It delivers near-Opus agentic capability (63.2% SWE-bench Pro, 81.2% OSWorld, a GDPval knowledge-work score statistically tied with the flagship) at a mid-tier price, with day-one availability everywhere. For the broad middle of the market, it is the new default, and the introductory $2/$10 pricing through August 31 makes the switch nearly free to test.

The honest exceptions are clear and worth stating plainly. If your dominant constraint is throughput cost and your tasks are relatively simple, Gemini 3.5 Flash is cheaper and faster, and you should run it. If you need the highest reliable capability and can absorb the price, Opus 4.8 remains the GA ceiling. If you need peak reasoning specifically, GPT-5.5 still leads on GPQA and AIME, and Gemini 3.1 Pro leads on the overall-text arena. And if you are tempted by the GPT-5.6 or Gemini 3.5 Pro previews, remember you cannot actually deploy them yet, which makes them irrelevant to a decision you have to make today. The mature approach is not to pick one model but to route: Sonnet 5 as the workhorse, Opus 4.8 as the escalation, Haiku 4.5 as the cheap floor, and a portable architecture so you can swap as prices keep falling.

A closing note on perspective. The most consequential fact about Sonnet 5 is not any benchmark; it is that near-flagship intelligence now costs mid-tier money, and that line keeps moving in the buyer's favor. Yuma Heymans (@yumahey), founder and CEO of O-mega and co-founder of the autonomous recruiter HeroHunt.ai, has argued that the leverage in this shift belongs not to whoever owns the cheapest model but to whoever turns cheap models into reliable autonomous work, the operating layer above the inference layer. Sonnet 5 is a powerful, honestly-priced input. What you build on top of it is where the value actually compounds, and that calculus is the one worth getting right.

This guide reflects the AI model landscape as of June 30, 2026, the day Claude Sonnet 5 launched. Pricing, benchmark scores, and model availability change frequently in this space, and many figures here are vendor-reported on a day-one basis before independent leaderboards have caught up. Verify current details against primary sources before making purchasing or deployment decisions.

Yuma Heymans

30 June 2026

•

54 min read

The honest, fully sourced breakdown of every Claude Sonnet 5 benchmark and what it actually costs to run.

What Claude Sonnet 5 Is, and Why It Shipped Today
The Master Benchmark and Cost Tables
The Pricing Breakdown: Intro, Standard, and the Tokenizer Trap
Effective Cost: Caching, Batch, and Real Dollar Scenarios
Coding and Agentic Performance
Reasoning, Knowledge Work, and Computer Use
Sonnet 5 vs the Latest GPT
Sonnet 5 vs the Latest Gemini
Where Sonnet 5 Sits in the Claude Lineup
The Benchmark Honesty Problem
How to Access and Use Sonnet 5
Limitations, Safety, and the IPO Subtext
The Future Outlook for Cheap Near-Frontier Intelligence
Conclusion: Which Model Should You Actually Run

The Scorecard at a Glance

#	Model	What It Is	Coding/Agentic (30%)	Price-Performance (30%)	Reasoning (20%)	Availability (10%)	Speed/Context (10%)	Final
1	Claude Sonnet 5	Cheapest near-flagship agent model	8 - SWE-bench Pro 63.2%, Terminal-Bench 2.1 80.4%	9 - $2/$10 intro, near-Opus at one-fifth Opus price	8 - HLE 43.2/57.4, GDPval 1618 Elo	10 - GA day one, default on Free + Pro, in Claude Code	8 - "Fast", 1M context, 128k output	8.5
2	Gemini 3.5 Flash	Fastest cheap near-Pro model	7 - SWE-bench Pro 55.1%, Terminal-Bench 2.1 76.2%	10 - $1.50/$9, ~$1.31 AA-blended, the value leader	7 - HLE 40.2%, AA Index 50	9 - GA, Gemini app + API + Vertex	10 - 175 tok/s, 1M context	8.4
3	Claude Opus 4.8	Anthropic's GA capability ceiling	9 - SWE-bench Verified 88.6%, Pro 69.2%	7 - $5/$25, ~$3.85 blended (AA)	9 - GPQA 93.6%, GDPval 1890 Elo	10 - GA on every platform	5 - 65 tok/s, 1M context	8.1
4	Gemini 3.1 Pro	Google's reasoning-heavy Pro tier	7 - SWE-bench Verified 80.6%, Pro 54.2%	8 - $2/$12 (<=200k), ~$1.74 blended	8 - GPQA 94.3%, HLE 44.4%	7 - Preview / Pre-GA, widely usable	9 - 138 tok/s, 1M context	7.7
5	GPT-5.5	OpenAI's current GA flagship	8 - SWE-bench Pro 58.6%, Verified 88.7% vendor / 82.6% indep	6 - $5/$30, ~$4.35 blended (AA flags "expensive")	9 - GPQA 94.0%, AIME 100%	10 - GA in ChatGPT + API + Codex	6 - 79.8 tok/s, 1M context	7.6
6	Claude Fable 5	Most capable Claude, now restricted	10 - SWE-bench Verified 95.0%, Terminal-Bench 2.1 83.1%	5 - $10/$50, most expensive GA tier	10 - HLE 59.0/64.5, GDPval 1932 Elo	3 - GA June 9 then suspended ~June 12	5 - top-tier latency, 1M context	7.3
7	Claude Haiku 4.5	The speed and cost floor	5 - SWE-bench Verified 73.3%, Pro 39.5%	9 - $1/$5, cheapest current Claude	5 - near-frontier but limited public evals	9 - GA everywhere	8 - fastest Claude, 200k context	6.9
8	Claude Sonnet 4.6	The model Sonnet 5 replaces	6 - SWE-bench Verified 79.6%, Pro 58.1%	6 - $3/$15, same price, now superseded	6 - GPQA 74.1%, HLE 34.6/46.8	7 - legacy but still callable	7 - fast, 1M context	6.2

1. What Claude Sonnet 5 Is, and Why It Shipped Today

2. The Master Benchmark and Cost Tables

Model	Vendor	Status	Input $/MTok	Output $/MTok	Cache read $/MTok	Batch in/out $/MTok	Context
Claude Sonnet 5	Anthropic	GA	$2 intro / $3 std	$10 intro / $15 std	$0.20 / $0.30	$1 / $5 intro	1M
Claude Opus 4.8	Anthropic	GA	$5	$25	$0.50	$2.50 / $12.50	1M
Claude Haiku 4.5	Anthropic	GA	$1	$5	$0.10	$0.50 / $2.50	200k
Claude Fable 5	Anthropic	Suspended	$10	$50	$1.00	$5 / $25	1M
Claude Sonnet 4.6	Anthropic	Legacy	$3	$15	$0.30	$1.50 / $7.50	1M
GPT-5.5	OpenAI	GA	$5	$30	$0.50	$2.50 / $15	1M
GPT-5.6 Sol	OpenAI	Preview	$5	$30	$0.50	n/p	~1.5M (unconf.)
GPT-5.6 Terra	OpenAI	Preview	$2.50	$15	$0.25	n/p	~1.5M (unconf.)
Gemini 3.5 Flash	Google	GA	$1.50	$9	$0.15	$0.75 / $4.50	1M
Gemini 3.1 Pro	Google	Preview	$2 / $4 (>200k)	$12 / $18 (>200k)	$0.20 / $0.40	$1 / $6	1M
Gemini 3.5 Pro	Google	Preview	n/p	n/p	n/p	n/p	n/p

Model	SWE-bench Verified	SWE-bench Pro	Terminal-Bench	GPQA Diamond	HLE (no tools)	OSWorld computer use	Independent index
Claude Sonnet 5	85.2% (card body)	63.2%	80.4% (2.1)	n/p	43.2%	81.2%	too new (GDPval 1609 Elo)
Claude Opus 4.8	88.6%	69.2%	78.9% (2.1, indep)	93.6%	49.8%	83.4%	AA Index 56
Claude Fable 5	95.0% (indep)	80.3%	83.1% (2.1, indep)	n/p	59.0%	85.0%	AA Index 60 (#1)
Claude Haiku 4.5	73.3%	39.5%	41.8% (Terminus 2)	n/p	n/p	50.7%	n/p
Claude Sonnet 4.6	79.6%	58.1%	59.1% (2.0)	74.1%	34.6%	72.5%	LMArena 1472
GPT-5.5	88.7% vendor / 82.6% indep	58.6%	83.4% (2.1, Codex)	94.0%	41.4%	78.7%	AA Index 55
GPT-5.6 Sol	n/p	n/p	88.8% / 91.9% Ultra (2.1)	n/p	n/p	n/p	METR 11.3h horizon
GPT-5.6 Terra	n/p	n/p	82.5-84.3% (2.1)	n/p	n/p	n/p	n/p
Gemini 3.5 Flash	78.8% (indep only)	55.1%	76.2% (2.1)	n/p	40.2%	78.4%	AA Index 50
Gemini 3.1 Pro	80.6%	54.2%	70.7% (2.1, indep)	94.3%	44.4%	n/p	AA Index 46 / LMArena 1486
Gemini 3.5 Pro	n/p	n/p	n/p	n/p	n/p	n/p	n/p

3. The Pricing Breakdown: Intro, Standard, and the Tokenizer Trap

Prompt cache reads cost 0.1x the base input rate, a 90% discount, so $0.30 per million standard or $0.20 intro.
Cache writes cost 1.25x base input for the 5-minute time-to-live, or 2x for the 1-hour TTL.
The Batch API takes a flat 50% off both input and output and stacks with caching.
US data residency (inference_geo="us") applies a 1.1x multiplier across all token categories.
Fast mode is not available for Sonnet 5; it exists only on the Opus tier.

Claude Sonnet 5: Benchmarks and Cost Breakdown

Contents

The Scorecard at a Glance

1. What Claude Sonnet 5 Is, and Why It Shipped Today

2. The Master Benchmark and Cost Tables

3. The Pricing Breakdown: Intro, Standard, and the Tokenizer Trap

4. Effective Cost: Caching, Batch, and Real Dollar Scenarios

5. Coding and Agentic Performance

6. Reasoning, Knowledge Work, and Computer Use

7. Sonnet 5 vs the Latest GPT

8. Sonnet 5 vs the Latest Gemini

9. Where Sonnet 5 Sits in the Claude Lineup

10. The Benchmark Honesty Problem

11. How to Access and Use Sonnet 5

12. Limitations, Safety, and the IPO Subtext

13. The Future Outlook for Cheap Near-Frontier Intelligence

14. Conclusion: Which Model Should You Actually Run

Claude Sonnet 5: Benchmarks and Cost Breakdown

Contents

The Scorecard at a Glance

1. What Claude Sonnet 5 Is, and Why It Shipped Today

2. The Master Benchmark and Cost Tables

3. The Pricing Breakdown: Intro, Standard, and the Tokenizer Trap

4. Effective Cost: Caching, Batch, and Real Dollar Scenarios

5. Coding and Agentic Performance

6. Reasoning, Knowledge Work, and Computer Use

7. Sonnet 5 vs the Latest GPT

8. Sonnet 5 vs the Latest Gemini

9. Where Sonnet 5 Sits in the Claude Lineup

10. The Benchmark Honesty Problem

11. How to Access and Use Sonnet 5

12. Limitations, Safety, and the IPO Subtext

13. The Future Outlook for Cheap Near-Frontier Intelligence

14. Conclusion: Which Model Should You Actually Run