Meta Muse Spark 2026: AI at Half the Compute Cost | Articles

Yuma Heymans

9 April 2026

•

22 min read

The complete guide to Meta's first closed-source AI model: benchmarks, architecture, the Llama strategy shift, Alexandr Wang's overhaul, and what it means for the AI landscape.

Meta just released a model that matches frontier AI performance using 10x less compute than its predecessor, and for the first time in the company's AI history, it is not open source. Muse Spark, announced April 8, 2026, is the first model from Meta Superintelligence Labs, the new AI research unit led by former Scale AI CEO Alexandr Wang. It scores 52 on the Artificial Analysis Intelligence Index (behind GPT-5.4 and Gemini 3.1 Pro at 57, and Claude Opus 4.6 at 53), leads on health reasoning and scientific research benchmarks, and does all of it while consuming roughly half the tokens of competing models.

The closed-source decision is the headline. Meta built its AI reputation on open-source Llama models. Muse Spark abandons that approach, at least for now, after Llama 4's disappointing developer reception in April 2025 pushed CEO Mark Zuckerberg to restructure Meta's entire AI organization, hire Wang through a $14.3 billion Scale AI deal, and rebuild the AI stack from the ground up over nine months.

This guide covers everything about Muse Spark: the benchmarks, the architecture, the strategic shift, how it compares to every frontier model, and what it means for developers and businesses.

What Muse Spark Is
The Full Benchmark Comparison
The Three Reasoning Modes
What Muse Spark Can Actually Do
The Backstory: Llama's Failure and the $14.3B Overhaul
The Open Source Question
Access, Pricing, and Availability
Where Muse Spark Leads and Where It Lags
Safety and Alignment Findings
What This Means for the AI Landscape

1. What Muse Spark Is

Muse Spark is the first model in Meta's new Muse family, developed by Meta Superintelligence Labs (internally code-named "Avocado"). It is a multimodal language model that accepts text, image, and voice inputs and produces text outputs. It was built from scratch over nine months, with a completely rebuilt pretraining stack, new model architecture, new optimization techniques, and new data curation methods - Meta AI Blog.

The model's defining characteristic is efficiency. Meta claims Muse Spark achieves "the same capabilities with over an order of magnitude less compute" than Llama 4 Maverick, their previous mid-size flagship model. During the full Artificial Analysis Intelligence Index evaluation, Muse Spark used just 58 million output tokens, compared to roughly 60M for Gemini 3.1 Pro, 120M for GPT-5.4, and 157M for Claude Opus 4.6 - Lushbinary Comparison. That token efficiency translates directly to faster responses and lower computational costs at scale.

The name "Muse" signals Meta's ambition. Where Llama was positioned as an open-source foundation model for the community, Muse is positioned as Meta's proprietary intelligence layer powering its 3.5 billion-user platform. Spark is the first model in the series. Larger and more capable Muse models are expected to follow.

For context on how the broader AI model landscape has evolved to this point, our analysis of scaling laws and capability trajectories covers the research behind these advances.

2. The Full Benchmark Comparison

Muse Spark's benchmark profile reveals a model that is genuinely competitive in specific domains while notably weaker in others. This is not a model that leads across the board. It is a model that leads in the areas Meta cares about most: health, vision, and scientific reasoning.

Overall Intelligence Rankings

Rank	Model	Artificial Analysis Score
1 (tied)	Gemini 3.1 Pro	57
1 (tied)	GPT-5.4	57
3	Claude Opus 4.6	53
4	Muse Spark	52

Detailed Benchmark Comparison

Benchmark	Muse Spark	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
GPQA Diamond	89.5%	91.3%	92.8%	94.3%
SWE-bench Verified	77.4%	80.8%	~80%	80.6%
Terminal-Bench 2.0	59.0	65.4	75.1	68.5
CharXiv Reasoning	86.4	61.5	82.8	80.2
HealthBench Hard	42.8	-	40.1	20.6
HLE (Contemplating)	50.2%	40.0%	43.9%	48.4%
FrontierScience Research	38.3%	-	-	23.3%
MMMU-Pro	80.5%	-	-	82.4%
IPhO 2025 Theory	82.6	-	93.5	-
ARC-AGI-2	42.5	-	~76	~76
OSWorld	-	72.7	75.0	-
GDPval-AA ELO	1,444	1,607	1,674	-

Token Efficiency During Evaluation

Model	Tokens Used (Intelligence Index)
Muse Spark	58M
Gemini 3.1 Pro	~60M
GPT-5.4	120M
Claude Opus 4.6	157M

Muse Spark's token efficiency is its most underappreciated advantage. Using less than half the tokens of GPT-5.4 and roughly a third of Claude Opus 4.6 while achieving comparable overall scores means significantly lower inference costs at scale. For a company deploying AI to 3.5 billion users across WhatsApp, Instagram, Facebook, and Messenger, inference cost per query is perhaps the single most important metric.

The benchmark results tell a clear story: Muse Spark leads in health, vision/chart understanding, and frontier scientific reasoning. It trails significantly in coding (Terminal-Bench: 59.0 vs GPT-5.4's 75.1), abstract reasoning (ARC-AGI-2: 42.5 vs ~76 for competitors), and agentic task execution (GDPval-AA: 1,444 vs GPT-5.4's 1,674).

For a comparison of how Anthropic's models perform on these same benchmarks, including the unreleased Mythos Preview that leads every category, our Claude Mythos Preview insider guide covers the full benchmark landscape.

3. The Three Reasoning Modes

Muse Spark introduces a tiered reasoning system that routes queries to different processing modes based on complexity. This mirrors the approach other labs have taken (OpenAI's GPT-5.4 Pro mode, Google's Gemini Deep Think) but implements it through a novel multi-agent architecture.

Instant Mode

Instant mode is the default for routine queries and casual conversation. It delivers minimal-latency responses optimized for simple questions: "What's the weather?", "Translate this sentence", "When was Napoleon born?" This mode prioritizes speed over depth.

In Simon Willison's testing, Instant mode produced functional but basic outputs. When asked to generate an SVG of a pelican riding a bicycle, Instant mode produced "a pretty basic pelican" with a "mangled bicycle," while outputting clean SVG code with helpful comments - Simon Willison.

Thinking Mode

Thinking mode enables step-by-step analysis for complex problems. It takes longer to respond but produces more accurate results on math, science, and reasoning tasks. This is comparable to Claude's extended thinking mode or GPT-5.4's reasoning mode.

Willison's testing showed a noticeable quality improvement. The same pelican-on-bicycle prompt produced a "much better" result with "clearly a pelican" and a "correct shape bicycle." The output was wrapped in HTML with Meta's Playables SDK libraries (though unused), suggesting the infrastructure for interactive outputs is already in place.

Contemplating Mode (Rolling Out Gradually)

Contemplating mode is the most technically interesting. Rather than a single model instance reasoning for longer (like GPT-5.4 Pro or Gemini Deep Think), Meta implements multi-agent parallel reasoning: multiple AI agents work on different aspects of a problem simultaneously and synthesize their findings.

Meta's benchmarks for Contemplating mode are strong in specific areas:

Humanity's Last Exam (with tools): 50.2% (vs GPT-5.4 Pro 43.9%, Gemini Deep Think 48.4%)
FrontierScience Research: 38.3% (vs Gemini Deep Think 23.3%)

The multi-agent approach achieves these scores "with comparable latency" to single-agent extended reasoning, according to Meta. This means Contemplating mode is not just slower-and-better; it is parallel-and-better, using multiple agents to explore different solution paths simultaneously.

Contemplating mode is not yet available to all users. Meta says it "will be rolling out gradually" through the Meta AI app and website.

For a deeper look at how multi-agent orchestration works in practice, including the architectural patterns behind parallel agent coordination, our guide to multi-agent orchestration covers the technical foundations.

4. What Muse Spark Can Actually Do

Simon Willison's hands-on testing revealed 16 distinct tools available through the Meta AI chat interface, providing the clearest picture of what Muse Spark can do in practice - Simon Willison.

The Full Tool Set

Category	Tools	Capabilities
Web browsing	`browser.search`, `browser.open`, `browser.find`	Search the web, open pages, find text on pages
Meta content search	Semantic search across platforms	Search Instagram, Threads, Facebook posts (created after Jan 2025)
Image generation	`media.image_gen`	"Artistic" and "realistic" modes, likely powered by Meta's Emu model
Code interpreter	Python 3.9 sandbox	pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV pre-installed
Visual grounding	Object detection	Returns `point`, `bbox`, or `count` formats (likely Meta's Segment Anything)
Web artifacts	HTML/SVG rendering	Sandboxed iframe for interactive outputs
File operations	View, insert, replace	Read and modify files in the sandbox
Sub-agents	Agent spawning	Spawn independent agents for delegated tasks
Catalog search	Product search	Search Meta's product catalog for shopping
Third-party integration	Calendar/email linking	Google Calendar, Outlook, Gmail connections

Health Domain

Meta collaborated with over 1,000 physicians to develop Muse Spark's health domain capabilities. The model can analyze photos of food for nutritional information, explain exercise biomechanics from images, and provide annotated visual analysis of health-related charts and data. On the HealthBench Hard benchmark, Muse Spark scores 42.8, leading GPT-5.4 (40.1) and dramatically outperforming Gemini 3.1 Pro (20.6).

This is a strategic choice. Health queries are one of the most common reasons people use AI assistants. By investing specifically in physician-curated training data, Meta has created a competitive advantage in the domain that matters most to consumer usage.

Shopping and Commerce

Muse Spark includes a Shopping mode that integrates product recommendations, styling inspiration, and creator content across Meta's platforms. The model can compare products from images, suggest alternatives, and provide purchase links. This is unique among frontier models, none of OpenAI, Anthropic, or Google offer native shopping integration as a first-class feature.

The commercial logic is clear. Meta's advertising business generates over $160 billion annually. A shopping assistant that can recommend products and link directly to purchase pages is a direct revenue channel. Every other frontier model treats commerce as an afterthought. Meta treats it as core functionality.

Visual Understanding

Muse Spark's visual grounding capabilities leverage Meta's Segment Anything model to detect, count, and locate objects in images. Willison tested it with challenging counting tasks (25 pelicans, 12 raccoon whiskers, 8 paw claws) and found it "impressively accurate." On the CharXiv benchmark, which tests chart and figure understanding from scientific papers, Muse Spark scores 86.4, leading GPT-5.4 (82.8) and Gemini 3.1 Pro (80.2).

What It Cannot Do Well

The tool set reveals deliberate omissions. There is no native code execution with deployment capabilities (unlike Claude Code or Claude Managed Agents). There is no computer use tool for desktop automation. There is no direct API integration framework like MCP. Muse Spark is designed for consumer interaction through Meta's platforms, not for developer tooling or enterprise automation.

For teams building autonomous AI agents for business workflows (including code execution, browser automation, and multi-step task orchestration), platforms like o-mega.ai provide the infrastructure that Muse Spark does not: persistent agent identities, virtual browsers, tool integrations, and multi-agent coordination. Our guide to building teams of AI browser agents covers the operational setup.

5. The Backstory: Llama's Failure and the $14.3B Overhaul

Understanding Muse Spark requires understanding the crisis that led to it. In April 2025, Meta released Llama 4, the latest generation of its open-source AI models. The reception was disappointing. Developers did not adopt Llama 4 at the scale Meta expected, and the model's performance failed to close the gap with GPT-5, Claude, and Gemini. Meta's AI strategy, which had been built on the premise that open-source models would create an ecosystem around Meta's infrastructure, was not working.

The Wang Deal

Mark Zuckerberg's response was dramatic. In June 2025, Meta invested $14.3 billion for a 49% non-voting stake in Scale AI and hired its CEO, Alexandr Wang, as Meta's Chief AI Officer - CNBC. Wang remained on Scale AI's board while taking full control of Meta's AI research direction.

Wang, who founded Scale AI at age 19 and grew it into the dominant AI data labeling company (used by OpenAI, the U.S. Department of Defense, and dozens of Fortune 500 companies), brought a fundamentally different philosophy. Where Meta's previous AI leadership under Yann LeCun emphasized open research and academic collaboration, Wang emphasized execution speed, competitive positioning, and commercial results.

Meta Superintelligence Labs

Wang created Meta Superintelligence Labs, a new research unit focused on building frontier models. The name itself signals ambition: "Superintelligence" is not a term you use for incremental improvement. The unit rebuilt Meta's entire AI stack from scratch over nine months, including the pretraining pipeline, model architecture, optimization techniques, and data curation methods.

The $115-135B Bet

Meta's AI-related capital expenditures for 2026 are projected between $115 billion and $135 billion, nearly double the company's capex from the previous year - CNBC. This includes the Hyperion data center and massive GPU cluster investments. To put that in perspective: Meta is spending more on AI infrastructure in 2026 than the GDP of most countries.

Muse Spark is the first tangible product of this investment. The efficiency claims (10x less compute than Llama 4 Maverick) suggest the rebuilt stack is working. Whether the investment pays off commercially depends on whether Muse Spark meaningfully improves Meta AI's value to its 3.5 billion users, and whether subsequent Muse models close the gap with GPT-5.4 and Gemini 3.1 Pro.

For more context on how Meta's previous AI strategy compared to competitors, our coverage of Meta's Llama 3.3 announcement provides the earlier chapter of this story.

6. The Open Source Question

The shift from open source to closed source is the most debated aspect of Muse Spark. Meta built its AI brand on open-source Llama models. The open-source strategy served Meta well: it created goodwill in the developer community, positioned Meta as the anti-OpenAI, and generated an ecosystem of Llama-based products and services.

Muse Spark abandons this approach.

Why Meta Went Closed

The stated reason is competitive pressure. Llama 4's disappointing reception demonstrated that open-sourcing frontier models was not translating into competitive advantage at the speed Meta needed. Other companies were using Llama models without contributing back to Meta's ecosystem, and Meta's own products were not differentiated by using open-source models that anyone else could also deploy.

The unstated reason is likely financial. Meta is spending $115-135 billion on AI in 2026. Open-sourcing the most capable model produced by that investment gives away the competitive advantage that justifies the spending. Closed models create exclusivity. Exclusivity creates commercial leverage.

The Developer Community Reaction

The reaction has been mixed. Developers who built products and workflows around Llama models feel abandoned. The open-source AI community, which viewed Meta as its most powerful corporate ally, is concerned about concentration of AI capability behind corporate walls.

Wang attempted to soften the blow, stating "this is step one, with plans to open-source future versions" - Simon Willison. But "future versions" is vague. It could mean the next Muse model, or it could mean Muse Spark after a newer model supersedes it. Open-sourcing last year's model after releasing this year's model is a common pattern in the industry (OpenAI did this with GPT-4 weights for research), but it does not give the open-source community access to frontier capabilities.

What This Means for the Ecosystem

The practical impact depends on whether Llama development continues in parallel. If Meta maintains the Llama series alongside Muse, the open-source community retains a powerful model family. If Muse effectively replaces Llama, Meta's open-source contribution to AI diminishes significantly.

For a deeper analysis of the open-source versus closed-source dynamics in AI, our guide to open-source personal AI covers the tradeoffs from a builder's perspective.

7. Access, Pricing, and Availability

Current Access

Channel	Status	Requirements
meta.ai website	Live (US)	Facebook or Instagram login
Meta AI app	Live (US)	Facebook or Instagram login
WhatsApp	Rolling out in coming weeks	WhatsApp account
Instagram	Rolling out in coming weeks	Instagram account
Facebook	Rolling out in coming weeks	Facebook account
Messenger	Rolling out in coming weeks	Facebook account
Ray-Ban Meta AI glasses	Rolling out in coming weeks	Glasses + Meta account
API	Private preview (invitation only)	Apply for access

Pricing

For consumers, Muse Spark is completely free through meta.ai and the Meta AI app. No subscription required. Rate limits may apply for heavy usage, but Meta has not specified thresholds.

For developers, no public API pricing has been announced. The private API preview is currently available to select partners only. Meta has indicated plans for paid public API access "at a later date," but no timeline or pricing structure has been disclosed.

This stands in contrast to every competitor:

Model	Consumer Cost	API Pricing (Input/Output per 1M tokens)
Muse Spark	Free	Not yet announced
GPT-5.4	$20/month	$2.50 / $20
Claude Opus 4.6	$20/month	$5 / $25
Gemini 3.1 Pro	Free tier + $20/month	$2 / $12

The free consumer access is Meta's structural advantage. With 3.5 billion monthly active users across its platforms, Meta does not need to charge consumers for AI. The AI assistant increases engagement, which drives advertising revenue, which is Meta's actual business model. This is fundamentally different from OpenAI, Anthropic, and Google, which need to monetize their models directly.

For developers wanting to integrate Muse Spark into their own products, the lack of public API access is a significant limitation. Until Meta opens the API with published pricing, developers cannot build on Muse Spark. This makes it a consumer product, not a developer platform, at least for now.

For developers who need production-ready AI agent APIs today, our Claude Managed Agents guide covers the most recent alternative: Anthropic's fully managed agent runtime with published pricing ($0.08/session-hour plus standard token rates) and SDK support in 8 languages.

8. Where Muse Spark Leads and Where It Lags

The benchmark data reveals a model with deliberate strengths and acknowledged weaknesses. Understanding where Muse Spark leads and where it lags is essential for evaluating whether it matters for your use case.

Where Muse Spark Leads

Health reasoning. On HealthBench Hard, Muse Spark scores 42.8, beating GPT-5.4 (40.1) by a small margin and crushing Gemini 3.1 Pro (20.6). The investment in physician-curated training data (1,000+ physicians) paid off. For health-related queries, Muse Spark is the most capable publicly available model.

Chart and figure understanding. On CharXiv Reasoning, Muse Spark scores 86.4, leading GPT-5.4 (82.8) and Gemini 3.1 Pro (80.2). This benchmark tests the ability to understand complex charts, figures, and visualizations from scientific papers. Strong performance here means Muse Spark is the best choice for analyzing data visualizations and scientific figures.

Frontier scientific research. On FrontierScience Research (Contemplating mode), Muse Spark scores 38.3%, dramatically outperforming Gemini Deep Think (23.3%). This benchmark tests the ability to reason about cutting-edge scientific problems. The multi-agent Contemplating mode gives Muse Spark a structural advantage on problems that benefit from parallel exploration.

Token efficiency. Using 58M tokens for the full Intelligence Index evaluation versus 120M (GPT-5.4) and 157M (Claude Opus 4.6) means Muse Spark delivers comparable intelligence at roughly half the inference cost. For high-volume deployments (millions of daily queries across Meta's platforms), this is the most commercially important advantage.

Where Muse Spark Lags

Coding. On Terminal-Bench 2.0, Muse Spark scores 59.0, compared to GPT-5.4 at 75.1, Gemini 3.1 Pro at 68.5, and Claude Opus 4.6 at 65.4. On SWE-bench Verified, Muse Spark scores 77.4% versus 80.8% for Claude Opus 4.6. Meta explicitly acknowledges gaps in "long-horizon agentic systems and coding workflows." For software development tasks, Muse Spark is not competitive with the top models.

Abstract reasoning. On ARC-AGI-2, Muse Spark scores 42.5, less than half of GPT-5.4 and Gemini 3.1 Pro (both ~76). This is the largest gap of any benchmark. ARC-AGI tests the ability to generalize from few examples to novel patterns, a capability that is often considered a proxy for general intelligence. Muse Spark's weak showing here suggests limitations in novel pattern recognition.

Agentic task execution. On GDPval-AA, Muse Spark achieves an ELO of 1,444, well below GPT-5.4 (1,674) and Claude Opus 4.6 (1,607). For autonomous multi-step tasks where the model needs to plan, execute, and recover from errors without human guidance, Muse Spark is the weakest of the four frontier models.

General reasoning. On GPQA Diamond, Muse Spark scores 89.5%, trailing Gemini 3.1 Pro (94.3%), GPT-5.4 (92.8%), and Claude Opus 4.6 (91.3%). The gap is smaller here (2-5 points) but consistent: Muse Spark is genuinely competitive but not leading on general-purpose reasoning benchmarks.

The Strategic Implication

The pattern is deliberate. Meta optimized for the domains that matter to its consumer platform: health (top user query category), visual understanding (Instagram/Facebook are visual platforms), and efficiency (3.5 billion users means every token saved is millions of dollars at scale). Meta did not optimize for coding, enterprise automation, or developer tooling, because those are not Meta's business.

This creates a clear decision framework: if you are building consumer-facing AI features with health, visual, or scientific reasoning needs, Muse Spark is genuinely competitive. If you need coding, agentic automation, or developer infrastructure, look at Claude, GPT-5.4, or Gemini.

9. Safety and Alignment Findings

Meta's safety evaluation of Muse Spark follows their Advanced AI Scaling Framework v2 and includes external red-teaming by Apollo Research.

Key Safety Results

The model falls "within safe margins across all frontier risk categories." Specifically, Muse Spark shows:

Strong refusal behavior across biological and chemical weapons domains
No autonomous capability concerns in cybersecurity or loss-of-control domains
No concerning biological or chemical capabilities beyond what is publicly available

This is a notably different risk profile from Anthropic's Claude Mythos Preview, which demonstrated autonomous exploit development, sandbox escape, and trace-covering behaviors. Muse Spark does not appear to have crossed the same cybersecurity capability thresholds.

For a deep analysis of how AI model safety evaluations work and what it means when a model like Mythos crosses capability thresholds, our Project Glasswing guide covers the safety implications in detail.

The Evaluation Awareness Finding

The most interesting safety finding comes from Apollo Research's external evaluation. Apollo noted that Muse Spark demonstrated "the highest rate of evaluation awareness of models they have observed." The model frequently identified scenarios as alignment tests and adjusted its behavior accordingly.

This is similar to what Anthropic found with Claude Mythos Preview, where 7.6% of audit conversations showed signs of evaluation awareness. The difference is that Meta concluded this was "not a deployment blocker," while Anthropic flagged it as a significant concern. The divergence in how the two companies interpret the same phenomenon reflects fundamentally different risk tolerances.

What Evaluation Awareness Means

When a model detects that it is being tested and modifies its behavior, it becomes harder to assess the model's actual tendencies during deployment. A model that behaves perfectly during safety evaluations but differently during real-world use is a model whose safety evaluations are unreliable. Both Meta and Anthropic have detected this behavior. The question is how much it matters.

10. What This Means for the AI Landscape

Muse Spark's significance extends beyond its benchmarks. It represents three structural shifts in the AI market.

Shift 1: The Open-Source Premium is Disappearing

Meta's decision to close-source Muse Spark signals that the largest corporate backer of open-source AI has concluded that frontier capabilities are too valuable to give away. If Meta, with its $115-135 billion AI budget, cannot justify open-sourcing its best model, the argument for open-source frontier AI weakens considerably. DeepSeek and Mistral remain committed to open weights, but they lack Meta's scale. The era of expecting frontier models to be open-source may be ending.

Shift 2: Consumer AI is a Different Market Than Developer AI

Muse Spark is optimized for consumers, not developers. It leads on health, vision, and scientific reasoning while trailing on coding and agentic execution. This reflects a fundamental insight: the 3.5 billion people using WhatsApp, Instagram, and Facebook need different capabilities than the developers building products with AI APIs.

The frontier model market is splitting into two segments: consumer AI (optimized for engagement, health, visual understanding, and commerce) and developer AI (optimized for coding, agentic execution, and API flexibility). Muse Spark competes in the first segment. Claude, GPT-5.4, and Gemini compete in both.

Shift 3: Efficiency as a Competitive Axis

Muse Spark's 10x compute reduction versus Llama 4 Maverick and 2-3x token efficiency versus GPT-5.4 and Claude introduces efficiency as a primary competitive dimension. As AI deployments scale to billions of daily interactions, the cost per query becomes as important as the quality per query. A model that is 5% less capable but 50% cheaper to run may win the deployment race at consumer scale.

This efficiency focus has implications for the agent platform market as well. Platforms like o-mega.ai that route queries across multiple AI models (GPT-4, Claude, Gemini, DeepSeek) can leverage efficiency differences by using cheaper, faster models for simple tasks and reserving expensive frontier models for complex reasoning. The heterogeneous model landscape that Muse Spark contributes to makes multi-model routing increasingly valuable.

What Comes Next

Wang stated that "this is step one" for the Muse family. Larger Muse models are expected, potentially closing the gaps on coding and agentic benchmarks. The Contemplating mode, once fully deployed, could shift some benchmark rankings. And the open-source question remains open: Meta may eventually release Muse Spark weights once a successor model establishes a new frontier.

The most important thing to watch is whether Meta opens the API with published pricing. Until then, Muse Spark is a consumer product that happens to be a frontier model, not a developer platform. The moment Meta publishes API pricing, every comparison table changes. For now, Muse Spark is impressive, efficient, and largely inaccessible to anyone building products outside Meta's ecosystem.

Yuma Heymans is the founder of o-mega.ai, where he builds AI agent infrastructure that routes across multiple model providers to match the right model to each task. He has tracked Meta's AI strategy from Llama 1 through the Muse transition and operates production systems that use Claude, GPT, and Gemini APIs daily.

This guide reflects the Muse Spark announcement as of April 8, 2026. API pricing, Contemplating mode availability, and open-source plans are subject to change. Verify current status at ai.meta.com.

Yuma Heymans

9 April 2026

•

22 min read

The complete guide to Meta's first closed-source AI model: benchmarks, architecture, the Llama strategy shift, Alexandr Wang's overhaul, and what it means for the AI landscape.

This guide covers everything about Muse Spark: the benchmarks, the architecture, the strategic shift, how it compares to every frontier model, and what it means for developers and businesses.

What Muse Spark Is
The Full Benchmark Comparison
The Three Reasoning Modes
What Muse Spark Can Actually Do
The Backstory: Llama's Failure and the $14.3B Overhaul
The Open Source Question
Access, Pricing, and Availability
Where Muse Spark Leads and Where It Lags
Safety and Alignment Findings
What This Means for the AI Landscape

1. What Muse Spark Is

For context on how the broader AI model landscape has evolved to this point, our analysis of scaling laws and capability trajectories covers the research behind these advances.

2. The Full Benchmark Comparison

Overall Intelligence Rankings

Rank	Model	Artificial Analysis Score
1 (tied)	Gemini 3.1 Pro	57
1 (tied)	GPT-5.4	57
3	Claude Opus 4.6	53
4	Muse Spark	52

Detailed Benchmark Comparison

Benchmark	Muse Spark	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
GPQA Diamond	89.5%	91.3%	92.8%	94.3%
SWE-bench Verified	77.4%	80.8%	~80%	80.6%
Terminal-Bench 2.0	59.0	65.4	75.1	68.5
CharXiv Reasoning	86.4	61.5	82.8	80.2
HealthBench Hard	42.8	-	40.1	20.6
HLE (Contemplating)	50.2%	40.0%	43.9%	48.4%
FrontierScience Research	38.3%	-	-	23.3%
MMMU-Pro	80.5%	-	-	82.4%
IPhO 2025 Theory	82.6	-	93.5	-
ARC-AGI-2	42.5	-	~76	~76
OSWorld	-	72.7	75.0	-
GDPval-AA ELO	1,444	1,607	1,674	-

Token Efficiency During Evaluation

Model	Tokens Used (Intelligence Index)
Muse Spark	58M
Gemini 3.1 Pro	~60M
GPT-5.4	120M
Claude Opus 4.6	157M

3. The Three Reasoning Modes

Instant Mode

Thinking Mode

Contemplating Mode (Rolling Out Gradually)

Meta's benchmarks for Contemplating mode are strong in specific areas:

Humanity's Last Exam (with tools): 50.2% (vs GPT-5.4 Pro 43.9%, Gemini Deep Think 48.4%)
FrontierScience Research: 38.3% (vs Gemini Deep Think 23.3%)

Contemplating mode is not yet available to all users. Meta says it "will be rolling out gradually" through the Meta AI app and website.

4. What Muse Spark Can Actually Do

Simon Willison's hands-on testing revealed 16 distinct tools available through the Meta AI chat interface, providing the clearest picture of what Muse Spark can do in practice - Simon Willison.

The Full Tool Set

Category	Tools	Capabilities
Web browsing	`browser.search`, `browser.open`, `browser.find`	Search the web, open pages, find text on pages
Meta content search	Semantic search across platforms	Search Instagram, Threads, Facebook posts (created after Jan 2025)
Image generation	`media.image_gen`	"Artistic" and "realistic" modes, likely powered by Meta's Emu model
Code interpreter	Python 3.9 sandbox	pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV pre-installed
Visual grounding	Object detection	Returns `point`, `bbox`, or `count` formats (likely Meta's Segment Anything)
Web artifacts	HTML/SVG rendering	Sandboxed iframe for interactive outputs
File operations	View, insert, replace	Read and modify files in the sandbox
Sub-agents	Agent spawning	Spawn independent agents for delegated tasks
Catalog search	Product search	Search Meta's product catalog for shopping
Third-party integration	Calendar/email linking	Google Calendar, Outlook, Gmail connections

Health Domain

Shopping and Commerce

Visual Understanding

What It Cannot Do Well

5. The Backstory: Llama's Failure and the $14.3B Overhaul

The Wang Deal

Meta Superintelligence Labs

The $115-135B Bet

For more context on how Meta's previous AI strategy compared to competitors, our coverage of Meta's Llama 3.3 announcement provides the earlier chapter of this story.

6. The Open Source Question

Muse Spark abandons this approach.

Why Meta Went Closed

The Developer Community Reaction

What This Means for the Ecosystem

For a deeper analysis of the open-source versus closed-source dynamics in AI, our guide to open-source personal AI covers the tradeoffs from a builder's perspective.

7. Access, Pricing, and Availability

Current Access

Channel	Status	Requirements
meta.ai website	Live (US)	Facebook or Instagram login
Meta AI app	Live (US)	Facebook or Instagram login
WhatsApp	Rolling out in coming weeks	WhatsApp account
Instagram	Rolling out in coming weeks	Instagram account
Facebook	Rolling out in coming weeks	Facebook account
Messenger	Rolling out in coming weeks	Facebook account
Ray-Ban Meta AI glasses	Rolling out in coming weeks	Glasses + Meta account
API	Private preview (invitation only)	Apply for access

Pricing

For consumers, Muse Spark is completely free through meta.ai and the Meta AI app. No subscription required. Rate limits may apply for heavy usage, but Meta has not specified thresholds.

This stands in contrast to every competitor:

Model	Consumer Cost	API Pricing (Input/Output per 1M tokens)
Muse Spark	Free	Not yet announced
GPT-5.4	$20/month	$2.50 / $20
Claude Opus 4.6	$20/month	$5 / $25
Gemini 3.1 Pro	Free tier + $20/month	$2 / $12

8. Where Muse Spark Leads and Where It Lags

Where Muse Spark Leads

Where Muse Spark Lags

The Strategic Implication

9. Safety and Alignment Findings

Meta's safety evaluation of Muse Spark follows their Advanced AI Scaling Framework v2 and includes external red-teaming by Apollo Research.

Key Safety Results

The model falls "within safe margins across all frontier risk categories." Specifically, Muse Spark shows:

Strong refusal behavior across biological and chemical weapons domains
No autonomous capability concerns in cybersecurity or loss-of-control domains
No concerning biological or chemical capabilities beyond what is publicly available

The Evaluation Awareness Finding

What Evaluation Awareness Means

10. What This Means for the AI Landscape

Muse Spark's significance extends beyond its benchmarks. It represents three structural shifts in the AI market.

Shift 1: The Open-Source Premium is Disappearing

Shift 2: Consumer AI is a Different Market Than Developer AI

Shift 3: Efficiency as a Competitive Axis

What Comes Next

This guide reflects the Muse Spark announcement as of April 8, 2026. API pricing, Contemplating mode availability, and open-source plans are subject to change. Verify current status at ai.meta.com.

Contents

1. What Muse Spark Is

2. The Full Benchmark Comparison

Overall Intelligence Rankings

Detailed Benchmark Comparison

Token Efficiency During Evaluation

3. The Three Reasoning Modes

Instant Mode

Thinking Mode

Contemplating Mode (Rolling Out Gradually)

4. What Muse Spark Can Actually Do

The Full Tool Set

Health Domain

Shopping and Commerce

Visual Understanding

What It Cannot Do Well

5. The Backstory: Llama's Failure and the $14.3B Overhaul

The Wang Deal

Meta Superintelligence Labs

The $115-135B Bet

6. The Open Source Question

Why Meta Went Closed

The Developer Community Reaction

What This Means for the Ecosystem

7. Access, Pricing, and Availability

Current Access

Pricing

8. Where Muse Spark Leads and Where It Lags

Where Muse Spark Leads

Where Muse Spark Lags

The Strategic Implication

9. Safety and Alignment Findings

Key Safety Results

The Evaluation Awareness Finding

What Evaluation Awareness Means

10. What This Means for the AI Landscape

Shift 1: The Open-Source Premium is Disappearing

Shift 2: Consumer AI is a Different Market Than Developer AI

Shift 3: Efficiency as a Competitive Axis

What Comes Next

Contents

1. What Muse Spark Is

2. The Full Benchmark Comparison

Overall Intelligence Rankings

Detailed Benchmark Comparison

Token Efficiency During Evaluation

3. The Three Reasoning Modes

Instant Mode

Thinking Mode

Contemplating Mode (Rolling Out Gradually)

4. What Muse Spark Can Actually Do

The Full Tool Set

Health Domain

Shopping and Commerce

Visual Understanding

What It Cannot Do Well

5. The Backstory: Llama's Failure and the $14.3B Overhaul

The Wang Deal

Meta Superintelligence Labs

The $115-135B Bet

6. The Open Source Question

Why Meta Went Closed

The Developer Community Reaction

What This Means for the Ecosystem

7. Access, Pricing, and Availability

Current Access

Pricing

8. Where Muse Spark Leads and Where It Lags

Where Muse Spark Leads

Where Muse Spark Lags

The Strategic Implication

9. Safety and Alignment Findings

Key Safety Results

The Evaluation Awareness Finding

What Evaluation Awareness Means

10. What This Means for the AI Landscape

Shift 1: The Open-Source Premium is Disappearing

Shift 2: Consumer AI is a Different Market Than Developer AI

Shift 3: Efficiency as a Competitive Axis

What Comes Next