Kimi & Moonshot AI: The Complete Guide 2026 | Articles

Yuma Heymans

23 March 2026

•

22 min read

The Complete Guide to Kimi, Moonshot AI, and the Rise of Chinese LLMs

On March 19, 2026, Cursor launched Composer 2.0. They called it "frontier-level coding intelligence." The blog post did not mention where the intelligence actually came from. Within hours, a developer found the internal model identifier: accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast. The base model was Kimi K2.5, built by a 80-person Chinese startup called Moonshot AI. One of the biggest AI coding tools in the world, with $2 billion in annual recurring revenue and over a million daily active users, had quietly built its flagship product on top of a Chinese open-source model.

This guide covers everything you need to know about Kimi, Moonshot AI, and the broader Chinese LLM landscape that produced it.

This guide is written by Yuma Heymans (@yumahey), founder of o-mega.ai, the AI workforce platform where autonomous agents learn to use business tool stacks and execute workflows.

The Cursor Controversy: What Actually Happened
Who Is Moonshot AI
Yang Zhilin: The Researcher Behind Kimi
The Kimi Model Lineage
Kimi K2: One Trillion Parameters, Open Weights
Kimi K2.5: Agent Swarms and Vision
Benchmark Comparisons
API Pricing Comparison
Kimi's Consumer Product
The Chinese LLM Landscape
Head-to-Head: Kimi vs DeepSeek vs Qwen
What This Means for the Industry

1. The Cursor Controversy: What Actually Happened

The timeline is straightforward.

March 19, 2026: Cursor (owned by Anysphere, San Francisco) launched Composer 2.0 through a blog post promoting it as their most capable coding model. Moonshot AI was not mentioned anywhere in the announcement.

Within hours: A developer using the handle @fynnso on X discovered the internal model identifier string embedded in the system: accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast. The "kimi-k2p5" portion made it immediately clear that Kimi K2.5 served as the base model.

March 22, 2026: The story hit TechCrunch, VentureBeat, and eWEEK. Cursor's VP of Developer Education, Lee Robinson, acknowledged it publicly: "Yep, Composer 2 started from an open-source base! Only ~1/4 of the compute spent on the final model came from the base, the rest is from our training."

Cursor co-founder Aman Sanger added: "It was a miss to not mention the Kimi base in our blog from the start. We'll fix that for the next model."

The license problem. Kimi K2.5 uses a Modified MIT License with a specific clause: if the derivative work is used in a commercial product exceeding 100 million monthly active users or generating more than $20 million per month in revenue, the licensee must "prominently display 'Kimi K2.5' on the user interface." Anysphere's $2 billion ARR translates to roughly $166 million per month, more than 8x the threshold. Moonshot AI stated that Cursor used the model "as part of an authorized commercial partnership" through Fireworks AI, suggesting a separate commercial arrangement existed. But the lack of upfront disclosure remained the central issue.

Community reaction was mixed. Some praised the quality of Chinese open-source models. Others questioned whether Cursor was primarily a "model routing layer" with a good UI rather than an independent AI research company. The political dimension was unavoidable: an American AI darling had built its core product on Chinese technology during a period of intense U.S.-China AI competition rhetoric.

The incident did more for Kimi's brand recognition outside China than any marketing campaign could have. It also raised fundamental questions about attribution, licensing compliance, and the actual origin of AI capabilities in commercial products.

2. Who Is Moonshot AI

Company name: Moonshot AI (Chinese: 月之暗面, literally "The Dark Side of the Moon," a tribute to Pink Floyd that reflects CEO Yang Zhilin's love of classic rock)

Founded: March 2023, Beijing, China

Founders: Yang Zhilin, Zhou Xinyu, and Wu Yuxin. All three are Tsinghua University alumni and former bandmates in a group called "Splay."

Mission: Build foundation models to achieve AGI.

Headcount: Approximately 80 employees.

Product: Kimi, both a consumer chatbot and an API platform.

Funding History

Round	Date	Amount	Valuation	Key Investors
Seed	Early 2023	$60M	$300M	Various
Series B	October 2023	$274M	Undisclosed	Various
Series B Extension	February 2024	$1B	$2.5B	Alibaba
Series B Further	August 2024	$300M	$3.3B	Tencent, Gaorong Capital
Series C	January 2026	$500M	$4.3B	IDG Capital, Alibaba, Tencent
New Round	February 2026	$700M+	$10B+	Alibaba, Tencent, 5Y Capital, Cathay Capital
Latest	March 2026	Undisclosed	$18B	Existing investors

Total raised: Over $1.77 billion across multiple rounds from 8+ investors.

Moonshot AI became the fastest Chinese company to reach decacorn status (valuation exceeding $10 billion), achieving the milestone in roughly two years. ByteDance took over four years. Pinduoduo took over three.

Revenue: $240 million reported by November 2025. After the Kimi K2.5 launch, the company reported that cumulative revenue in fewer than 20 days already exceeded the entire 2025 annual total.

An 80-person company generating hundreds of millions in revenue and valued at $18 billion. Those are exceptional numbers for any AI company, let alone one operating from Beijing.

3. Yang Zhilin: The Researcher Behind Kimi

Yang Zhilin is one of the most technically credentialed founders in the AI industry, on either side of the Pacific.

Born: 1992 in China.

Education: BSc in Computer Science at Tsinghua University, graduated top of his class in 2015. He had no programming background before university. He won first prize at the National Olympiad in Informatics in Guangdong province, which earned him admission to Tsinghua.

PhD: Carnegie Mellon University, completed in 2019 after only four years (the standard timeline is six). His advisors were Ruslan Salakhutdinov (who later became Apple's head of AI research) and William Cohen (a principal scientist at Google DeepMind).

Research contributions during his PhD: He interned at Google Brain and Meta's FAIR (Facebook AI Research). He co-developed two architectures that shaped the trajectory of language modeling:

Transformer-XL: Extended the standard Transformer to handle longer sequences through a recurrence mechanism. This was foundational work for long-context language models.
XLNet: A language model that surpassed Google BERT on 20 standard NLP tasks and achieved state-of-the-art results on 18. Selected as an Oral presentation at NeurIPS 2019, one of the highest distinctions at the conference.

After completing his PhD, Yang returned to China and co-founded Moonshot AI in March 2023. His deep expertise in long-context modeling directly influenced Kimi's initial differentiator: supporting 200,000 Chinese characters in a single conversation when the model first launched, a world record at the time.

Co-founders:

Zhou Xinyu: Deep engineering and systems-level expertise. Responsible for moving models efficiently from research to production at scale.
Wu Yuxin: Research background in vision, multimodal modeling, and open-source culture. Previously at Google Brain and Facebook AI Research.

4. The Kimi Model Lineage

Kimi's development shows a clear trajectory from consumer chatbot to open-weight frontier model.

Kimi v1.0 (October 2023)

The initial release supported up to 200,000 Chinese characters per conversation. At the time, this was the longest context window of any publicly available LLM. By early 2024, Kimi had expanded this to over 2 million characters in a single conversation. This long-context capability was directly connected to Yang Zhilin's research on Transformer-XL during his PhD.

Kimi k1.5 (January 20, 2025)

The first major architectural leap. Key characteristics:

Training approach: Reinforcement learning (RL) with long chain-of-thought reasoning
Context window: 128,000 tokens (RL context scaled to 128K)
Multimodal: Vision and text capabilities
Training innovations: Online mirror descent for policy optimization, partial rollout reuse for efficiency
What it did NOT use: Monte Carlo tree search, value functions, or process reward models

Benchmark	Kimi k1.5	OpenAI o1
MATH-500	96.2%	96.4%
Codeforces	94th percentile	96th percentile

The k1.5 paper (arXiv 2501.12599) demonstrated that carefully scaled reinforcement learning alone, without elaborate search techniques, could match OpenAI's reasoning models on core benchmarks.

Kimi K2 (July 11, 2025)

The open-weight frontier model that put Moonshot on the global map. More details in Section 5.

Kimi K2 Thinking (November 6, 2025)

A reasoning-focused variant of K2 with extended thinking capabilities. This model was the first to demonstrate stable tool use across 200-300 sequential calls. More details in Section 6.

Kimi K2.5 (January 27, 2026)

The multimodal, agent-swarm capable model that Cursor used as the base for Composer 2.0. More details in Section 6.

5. Kimi K2: One Trillion Parameters, Open Weights

Kimi K2, released July 11, 2025, marked Moonshot AI's arrival as a serious contender in the global foundation model competition. The numbers tell the story.

Architecture

Specification	Kimi K2	DeepSeek-V3	Qwen3 MoE
Total parameters	1 trillion	671 billion	235 billion
Active parameters	32 billion	37 billion	22 billion
Architecture	MoE	MoE	MoE
Number of experts	384	256	128
Experts activated per layer	8	8	8
Attention mechanism	MLA (64 heads)	MLA (128 heads)	GQA
Hidden dimension	7,168	7,168	4,096
Expert hidden dimension	2,048	2,048	1,536
Context window	128K	128K	256K
Pretraining tokens	15.5 trillion	14.8 trillion	36 trillion
Training GPU-hours	~4.2 million	~2.79 million	Undisclosed

K2 uses a Mixture of Experts (MoE) architecture with 384 experts but only activates 32 billion parameters for any given input. This means it delivers performance comparable to much larger dense models while requiring compute equivalent to running a 32B parameter model.

The MuonClip Optimizer

One of K2's most significant technical contributions is MuonClip, a novel optimizer that integrates the Muon algorithm with a QK-Clip stability mechanism. The result: K2 trained on the full 15.5 trillion tokens without a single loss spike. Training instability is one of the most expensive problems in LLM development, often requiring checkpoint rollbacks that waste days of GPU time. Solving this problem at the trillion-parameter scale is a genuine engineering achievement.

License

K2 uses a Modified MIT License. The model weights are open and available on Hugging Face under moonshotai/Kimi-K2-Instruct and moonshotai/Kimi-K2-Base. Commercial use is permitted with one condition: if the product exceeds $20M/month in revenue or 100M monthly active users, the Kimi K2 branding must be prominently displayed. This is the same license clause that created the Cursor controversy.

6. Kimi K2.5: Agent Swarms and Vision

Released January 27, 2026, K2.5 extends K2 with two major capabilities: native multimodal vision and parallel agent swarm execution.

MoonViT-3D Vision Encoder

K2.5 introduces MoonViT-3D, a 400-million parameter vision encoder based on SigLIP-SO-400M. It uses the NaViT packing strategy for variable-resolution images, supporting:

Images: Up to 4K resolution (4096x2160) in PNG, JPEG, WebP, GIF formats
Video: Up to 2K resolution (2048x1080) in MP4, MPEG, MOV, AVI, FLV, WebM formats
Documents: PDF and text

The total parameter count with the vision encoder reaches approximately 1.04 trillion.

Agent Swarm: Parallel Multi-Agent Execution

This is K2.5's most distinctive capability. Trained with Parallel-Agent Reinforcement Learning (PARL), the model can self-direct up to 100 sub-agents executing up to 1,500 coordinated steps in parallel. On wide-search tasks (competitive analysis across 50+ websites, multi-file codebase debugging), the agent swarm delivers 4.5x faster execution compared to single-agent models.

Kimi K2 Thinking (November 6, 2025)

The reasoning variant deserves its own mention. Key specifications:

Parameters: 1 trillion total, 32 billion active
Context window: 256K tokens
Quantization: Native INT4
Agentic reasoning: Stable tool use across 200-300 sequential calls

K2 Thinking was evaluated by NIST's CAISI (Center for AI Safety and Intelligence) in December 2025, ranking as the #1 open-source model on multiple benchmarks at the time of its release.

7. Benchmark Comparisons

Core Reasoning and Math

Benchmark	Kimi K2.5	Kimi K2 Thinking	Kimi K2	Claude Opus 4.5	GPT-5.2	DeepSeek V3
MATH-500	97.8%	97.6%	97.4%	96.1%	97.3%	90.2%
AIME 2025	99.2%	99.8%	N/A	N/A	86.7%	39.2%
GPQA Diamond	91.8%	91.3%	72.0%	75.7%	72.0%	59.1%

Coding

Benchmark	Kimi K2.5	Kimi K2	Claude Opus 4.5	GPT-5.2	DeepSeek V3	Qwen3 Coder
SWE-Bench Verified (single)	76.8%	65.8%	80.9%	80.0%	42.0%	69.6%
LiveCodeBench	55.2%	53.7%	N/A	N/A	33.8%	N/A

Agentic and Tool Use

Benchmark	Kimi K2.5	Claude Opus 4.5	GPT-5.2
HLE Full (with tools)	50.2%	32.0%	41.7%
BrowseComp	62.4%	N/A	54.9%
Tool performance gain	+20.1pp	+12.4pp	+11.0pp

The HLE (Humanity's Last Exam) scores are particularly notable. K2.5 outperforms both Claude Opus 4.5 and GPT-5.2 by substantial margins when tools are available. The "+20.1pp" gain from adding tools (compared to without tools) suggests K2.5 was specifically optimized for agentic tool-use scenarios.

Vision and Multimodal

Benchmark	Kimi K2.5	GPT-5.2	Claude Opus 4.5	Gemini 2.5 Pro
MMMU Pro	78.5%	72.3%	68.9%	75.1%
MathVision	84.2%	78.1%	71.3%	80.6%
VideoMMMU	86.6%	N/A	N/A	84.2%

K2.5's vision benchmarks are strong across the board. The VideoMMMU score of 86.6% reflects the MoonViT-3D encoder's ability to process video content effectively.

Key Takeaways from Benchmarks

On pure math and reasoning, Kimi K2 Thinking and K2.5 are at or above the frontier set by GPT-5.2 and Claude Opus 4.5.
On coding (SWE-Bench), Kimi trails Claude and GPT but significantly outperforms DeepSeek V3 and is competitive with Qwen3 Coder.
On agentic tasks with tools, K2.5 currently leads. The 50.2% HLE score with tools is the highest published result.
On vision tasks, K2.5 leads on MMMU Pro and MathVision, strong performance for an open-weight model.

8. API Pricing Comparison

One of the most compelling aspects of Chinese LLMs is pricing. The cost differences versus Western models are dramatic.

Per-Million-Token Pricing

Model	Input ($/M tokens)	Output ($/M tokens)	Context Window	Architecture
DeepSeek V3	$0.14	$0.28	128K	MoE 671B/37B
Kimi K2	$0.39	$1.90	128K	MoE 1T/32B
DeepSeek R1	$0.55	$2.19	32K	Reasoning
Kimi K2.5	$0.60	$2.50	128K	MoE 1T/32B
GPT-5.2	$1.75	$14.00	128K	Dense
Claude Sonnet 4.6	$3.00	$15.00	200K	Dense
Claude Opus 4.6	$5.00	$25.00	200K	Dense

Kimi's Tiered Pricing (by context length)

Tier	Input ($/M tokens)	Output ($/M tokens)
8K context	$0.20	$2.00
32K context	$1.00	$3.00
128K context	$2.00	$5.00
Cached tokens	$0.15	N/A

The cached token pricing at $0.15/M tokens offers a 75% savings for repeated content, making Kimi particularly cost-effective for applications with consistent system prompts or reference documents.

Cost at Scale: 100 Million Tokens Per Month

Model	Monthly Cost	Relative to Kimi K2.5
DeepSeek V3	~$42	0.14x
Kimi K2	~$229	0.74x
Kimi K2.5	~$310	1.0x
DeepSeek R1	~$274	0.88x
GPT-5.2	~$1,575	5.1x
Claude Sonnet 4.6	~$1,800	5.8x
Claude Opus 4.6	~$3,000	9.7x

At 100M tokens per month, Kimi K2.5 costs roughly $310. The same workload on Claude Opus 4.6 would cost approximately $3,000, nearly 10x more. DeepSeek V3 remains the cheapest option at $42, though it trails significantly on benchmarks.

The pricing gap between Chinese and Western models is structural. Lower labor costs, government subsidies for AI development, and different monetization strategies all contribute. For startups and enterprises evaluating model providers, the cost difference is significant enough to influence architecture decisions.

9. Kimi's Consumer Product

Kimi is not just an API. It operates a consumer chatbot that competes directly with ChatGPT and other conversational AI products.

Platform: Available at kimi.com and kimi.ai, plus mobile apps.

Peak usage: 36 million monthly active users in China, reaching 50 million globally by August 2025.

Market position in China: Kimi ranked among the top Chinese AI chatbots but dropped to 7th in active monthly users by June 2025 as competition from Baidu's Ernie, Alibaba's Tongyi Qianwen, and others intensified.

International expansion: After the K2.5 launch, overseas revenue overtook domestic income. International subscriber growth accelerated significantly, and overseas API revenue quadrupled since November 2025.

Key differentiator: Kimi's original claim to fame was its long-context capability. The ability to process entire books, lengthy documents, and extended conversations in a single session gave it a distinct positioning in the Chinese market. When K2 launched with open weights, this shifted the conversation toward Kimi as a platform for developers and enterprises rather than just a consumer chatbot.

10. The Chinese LLM Landscape

Kimi does not exist in isolation. It emerged from a Chinese AI ecosystem that has produced several world-class foundation models. Understanding the broader landscape provides context for why Chinese LLMs have advanced so rapidly.

DeepSeek (High Flyer Capital Management)

Background: DeepSeek is a research lab funded by High Flyer, a quantitative hedge fund based in Hangzhou. The hedge fund's computing resources gave DeepSeek access to GPU clusters that most Chinese startups could not afford.

Key models:

Model	Parameters	Training Cost	Key Achievement
DeepSeek-V3	671B total, 37B active	$5.6M	Outperformed Llama 3.1-405B and GPT-4o
DeepSeek-R1	Reasoning variant	Undisclosed	Matched OpenAI o1 on reasoning tasks
DeepSeek-V3.2 Exp	Updated	Undisclosed	AIME 2025: 96.0%

DeepSeek's defining characteristic is efficiency. Training V3 for only $5.6 million (2.79 million GPU-hours) while matching or exceeding models that cost hundreds of millions to train was a landmark moment for the industry. It demonstrated that brute-force compute spending is not the only path to competitive performance.

Pricing: DeepSeek V3 at $0.14/$0.28 per million input/output tokens is the cheapest frontier-class model available anywhere. DeepSeek-V3.2 Experimental pushes this even lower to approximately $0.028 per million input tokens.

Open source: Yes, fully open weights under permissive licenses.

Qwen (Alibaba Cloud)

Background: Alibaba's LLM family, developed by the Tongyi Lab division of Alibaba Cloud.

Key models:

Model	Parameters	Notable Achievement
Qwen3	0.6B to 32B (dense), 235B MoE (22B active)	Ranked 3rd globally on LMArena text
Qwen3-Max	1T+ parameters, 36T training tokens	100% on AIME25 and HMMT
Qwen3 Coder	Coding-focused variant	69.6% SWE-Bench Verified

Market share: 32.1% of China's enterprise LLM market in H2 2025, nearly doubling from 17.7% in H1 2025.

Noteworthy: Alibaba is simultaneously Moonshot AI's largest investor and its direct competitor. Alibaba led Moonshot's $1B Series B extension in February 2024 and has participated in subsequent rounds. This kind of "invest in your competitors" dynamic is common in the Chinese tech ecosystem, where major platforms often back multiple horses in emerging categories.

Yi / 01.AI (Kai-Fu Lee)

Founder: Kai-Fu Lee, former president of Google China and former executive at Microsoft and Apple.

Founded: March 2023 (same month as Moonshot AI).

Key model: Yi-34B achieved MMLU 76.3%, outperforming Meta's Llama 2-70B (68.9%) and TII's Falcon 180B despite being a much smaller model.

Current status: 01.AI remains active but has not released a model at the frontier scale of Kimi K2 or DeepSeek V3. The company focuses on efficient, smaller-scale models for enterprise deployment.

Ernie (Baidu)

Model: Ernie Bot 4.0

Baidu's LLM benefits from deep integration with China's dominant search engine. Ernie consistently leads on Chinese language tasks and has gradually closed the gap with GPT-4 on general benchmarks. Its primary moat is the Baidu ecosystem: search, maps, cloud, and enterprise services that give Ernie distribution advantages within China.

GLM / ChatGLM (Zhipu AI)

Model: GLM-4.5 with 355 billion parameters, GLM-4.5-Air at 106 billion.

Zhipu AI has focused on the Chinese enterprise market with strong multimodal capabilities. GLM-4.5 approaches GPT-4 level performance on general benchmarks. Open-source versions are available.

Overview: Chinese LLM Ecosystem

Company	Flagship Model	Total Params	Active Params	Key Strength	Pricing (Input $/M)
Moonshot AI	Kimi K2.5	1.04T	32B	Agent swarms, vision, long-context	$0.60
DeepSeek	V3 / V3.2	671B	37B	Training efficiency, lowest cost	$0.14
Alibaba	Qwen3-Max	1T+	Undisclosed	Enterprise market share, coding	Varies
Zhipu AI	GLM-4.5	355B	Dense	Chinese enterprise, multimodal	Varies
Baidu	Ernie 4.0	Undisclosed	Undisclosed	Chinese language, search integration	Varies
01.AI	Yi-34B	34B	Dense	Efficiency, smaller-scale deployment	Open source

11. Head-to-Head: Kimi vs DeepSeek vs Qwen

These three models represent the current frontier of Chinese AI. Each has distinct strengths.

Architecture Comparison

Aspect	Kimi K2.5	DeepSeek V3	Qwen3-Max
Total parameters	1.04T	671B	1T+
Active parameters	32B	37B	Undisclosed
MoE experts	384	256	Undisclosed
Training tokens	15T	14.8T	36T
Training cost	~$20M+ (est.)	$5.6M	Undisclosed
Context window	128K-256K	128K	256K
Vision	Yes (MoonViT-3D)	No native	Yes
Agent swarm	Yes (100 sub-agents)	No	No
Open weights	Yes (Modified MIT)	Yes (MIT)	Yes (Apache 2.0)

Benchmark Comparison

Benchmark	Kimi K2.5	DeepSeek V3	Qwen3-Max
MATH-500	97.8%	90.2%	95.8%
AIME 2025	99.2%	39.2%	100%
SWE-Bench Verified	76.8%	42.0%	~69.6% (Coder)
GPQA Diamond	91.8%	59.1%	~80%
HLE (with tools)	50.2%	N/A	N/A

When to Use Which

Use Case	Best Choice	Why
Cheapest API calls	DeepSeek V3	$0.14/M input tokens
Best coding assistant	Kimi K2.5 or Qwen3 Coder	Highest SWE-Bench scores among Chinese models
Agent orchestration	Kimi K2.5	Only model with native agent swarm capability
Vision/multimodal	Kimi K2.5	MoonViT-3D with 4K image and 2K video support
Math reasoning	Qwen3-Max or Kimi K2 Thinking	Both approach or reach 100% on AIME
Enterprise deployment (China)	Qwen3-Max	32.1% market share, Alibaba Cloud integration
Budget-constrained production	DeepSeek V3	5-20x cheaper than alternatives
Long-context applications	Kimi K2.5	256K context with stable performance
Open research	DeepSeek V3	Most permissive license (MIT, no revenue restrictions)

The Licensing Factor

This is where the models differ significantly:

Model	License	Revenue Threshold	Attribution Requirement
DeepSeek V3	MIT	None	None
Kimi K2.5	Modified MIT	$20M/month or 100M MAU	Must display "Kimi K2.5" on UI
Qwen3	Apache 2.0	None	None

For large-scale commercial applications, DeepSeek and Qwen have simpler licensing stories. Kimi's branding requirement above the revenue threshold adds a consideration that matters for companies like Cursor (as the March 2026 controversy demonstrated).

12. What This Means for the Industry

The Kimi story illustrates several trends that will define AI development in 2026 and beyond.

Chinese open-source models are at the frontier. This is no longer debatable. Kimi K2.5 outperforms GPT-5.2 on agentic benchmarks. DeepSeek V3 trains at a fraction of the cost. Qwen3-Max matches the best Western models on math reasoning. The gap between Chinese and American models has narrowed to the point where leadership shifts depending on which benchmark you examine.

Open weights change the competitive dynamics. Cursor built Composer 2.0 on Kimi K2.5 because the model was available, capable, and cheap. This is exactly how open-source software has always worked. The difference is that we are now seeing it with models that cost tens of millions of dollars to train, being fine-tuned and deployed by companies that would otherwise need to invest hundreds of millions in their own training runs.

The "model routing layer" is a viable business. Cursor's $2 billion ARR demonstrates that you do not need to train your own foundation model to build a massive AI business. What matters is the product experience, the fine-tuning for specific use cases, and the integration quality. This pattern will repeat across every vertical.

Pricing pressure is structural. When DeepSeek offers frontier-class performance at $0.14 per million input tokens and Kimi K2.5 at $0.60, while Claude Opus charges $5.00 and GPT-5.2 charges $1.75, the pricing gap creates real economic pressure. Developers and enterprises will increasingly adopt multi-model strategies, using the cheapest model that meets their quality requirements for each specific task.

The 80-person company is the new model. Moonshot AI built a $18 billion company with 80 people. DeepSeek trained a frontier model for $5.6 million. These are not anomalies. They reflect a structural shift in how AI companies operate. The capital intensity of AI training is real, but the human capital requirements are lower than the industry assumed.

Agent capabilities are the new frontier. Kimi K2.5's agent swarm, with 100 sub-agents executing 1,500 coordinated steps, points to where the competition is heading. Raw benchmark scores on math and coding are approaching saturation. The next differentiation vector is how well models can orchestrate complex, multi-step tasks autonomously. This is where K2.5's 50.2% HLE score with tools, the highest published result, signals Moonshot's strategic direction.

The rise of Kimi and Chinese LLMs broadly is not a threat narrative. It is an efficiency narrative. These models demonstrate that world-class AI can be built with smaller teams, lower budgets, and open distribution. The entire industry benefits when the cost of intelligence decreases. The question is not whether Chinese models will be competitive. They already are. The question is how the ecosystem adapts to a world where frontier AI capabilities are available to anyone who can download weights from Hugging Face.

Yuma Heymans

23 March 2026

•

22 min read

The Complete Guide to Kimi, Moonshot AI, and the Rise of Chinese LLMs

This guide covers everything you need to know about Kimi, Moonshot AI, and the broader Chinese LLM landscape that produced it.

This guide is written by Yuma Heymans (@yumahey), founder of o-mega.ai, the AI workforce platform where autonomous agents learn to use business tool stacks and execute workflows.

The Cursor Controversy: What Actually Happened
Who Is Moonshot AI
Yang Zhilin: The Researcher Behind Kimi
The Kimi Model Lineage
Kimi K2: One Trillion Parameters, Open Weights
Kimi K2.5: Agent Swarms and Vision
Benchmark Comparisons
API Pricing Comparison
Kimi's Consumer Product
The Chinese LLM Landscape
Head-to-Head: Kimi vs DeepSeek vs Qwen
What This Means for the Industry

1. The Cursor Controversy: What Actually Happened

The timeline is straightforward.

Cursor co-founder Aman Sanger added: "It was a miss to not mention the Kimi base in our blog from the start. We'll fix that for the next model."

2. Who Is Moonshot AI

Company name: Moonshot AI (Chinese: 月之暗面, literally "The Dark Side of the Moon," a tribute to Pink Floyd that reflects CEO Yang Zhilin's love of classic rock)

Founded: March 2023, Beijing, China

Founders: Yang Zhilin, Zhou Xinyu, and Wu Yuxin. All three are Tsinghua University alumni and former bandmates in a group called "Splay."

Mission: Build foundation models to achieve AGI.

Headcount: Approximately 80 employees.

Product: Kimi, both a consumer chatbot and an API platform.

Funding History

Round	Date	Amount	Valuation	Key Investors
Seed	Early 2023	$60M	$300M	Various
Series B	October 2023	$274M	Undisclosed	Various
Series B Extension	February 2024	$1B	$2.5B	Alibaba
Series B Further	August 2024	$300M	$3.3B	Tencent, Gaorong Capital
Series C	January 2026	$500M	$4.3B	IDG Capital, Alibaba, Tencent
New Round	February 2026	$700M+	$10B+	Alibaba, Tencent, 5Y Capital, Cathay Capital
Latest	March 2026	Undisclosed	$18B	Existing investors

Total raised: Over $1.77 billion across multiple rounds from 8+ investors.

Revenue: $240 million reported by November 2025. After the Kimi K2.5 launch, the company reported that cumulative revenue in fewer than 20 days already exceeded the entire 2025 annual total.

An 80-person company generating hundreds of millions in revenue and valued at $18 billion. Those are exceptional numbers for any AI company, let alone one operating from Beijing.

3. Yang Zhilin: The Researcher Behind Kimi

Yang Zhilin is one of the most technically credentialed founders in the AI industry, on either side of the Pacific.

Born: 1992 in China.

Research contributions during his PhD: He interned at Google Brain and Meta's FAIR (Facebook AI Research). He co-developed two architectures that shaped the trajectory of language modeling:

Transformer-XL: Extended the standard Transformer to handle longer sequences through a recurrence mechanism. This was foundational work for long-context language models.
XLNet: A language model that surpassed Google BERT on 20 standard NLP tasks and achieved state-of-the-art results on 18. Selected as an Oral presentation at NeurIPS 2019, one of the highest distinctions at the conference.

Co-founders:

Zhou Xinyu: Deep engineering and systems-level expertise. Responsible for moving models efficiently from research to production at scale.
Wu Yuxin: Research background in vision, multimodal modeling, and open-source culture. Previously at Google Brain and Facebook AI Research.

4. The Kimi Model Lineage

Kimi's development shows a clear trajectory from consumer chatbot to open-weight frontier model.

Kimi v1.0 (October 2023)

Kimi k1.5 (January 20, 2025)

The first major architectural leap. Key characteristics:

Training approach: Reinforcement learning (RL) with long chain-of-thought reasoning
Context window: 128,000 tokens (RL context scaled to 128K)
Multimodal: Vision and text capabilities
Training innovations: Online mirror descent for policy optimization, partial rollout reuse for efficiency
What it did NOT use: Monte Carlo tree search, value functions, or process reward models

Benchmark	Kimi k1.5	OpenAI o1
MATH-500	96.2%	96.4%
Codeforces	94th percentile	96th percentile

The k1.5 paper (arXiv 2501.12599) demonstrated that carefully scaled reinforcement learning alone, without elaborate search techniques, could match OpenAI's reasoning models on core benchmarks.

Kimi K2 (July 11, 2025)

The open-weight frontier model that put Moonshot on the global map. More details in Section 5.

Kimi K2 Thinking (November 6, 2025)

A reasoning-focused variant of K2 with extended thinking capabilities. This model was the first to demonstrate stable tool use across 200-300 sequential calls. More details in Section 6.

Kimi K2.5 (January 27, 2026)

The multimodal, agent-swarm capable model that Cursor used as the base for Composer 2.0. More details in Section 6.

5. Kimi K2: One Trillion Parameters, Open Weights

Kimi K2, released July 11, 2025, marked Moonshot AI's arrival as a serious contender in the global foundation model competition. The numbers tell the story.

Architecture

Specification	Kimi K2	DeepSeek-V3	Qwen3 MoE
Total parameters	1 trillion	671 billion	235 billion
Active parameters	32 billion	37 billion	22 billion
Architecture	MoE	MoE	MoE
Number of experts	384	256	128
Experts activated per layer	8	8	8
Attention mechanism	MLA (64 heads)	MLA (128 heads)	GQA
Hidden dimension	7,168	7,168	4,096
Expert hidden dimension	2,048	2,048	1,536
Context window	128K	128K	256K
Pretraining tokens	15.5 trillion	14.8 trillion	36 trillion
Training GPU-hours	~4.2 million	~2.79 million	Undisclosed

The MuonClip Optimizer

License

6. Kimi K2.5: Agent Swarms and Vision

Released January 27, 2026, K2.5 extends K2 with two major capabilities: native multimodal vision and parallel agent swarm execution.

MoonViT-3D Vision Encoder

K2.5 introduces MoonViT-3D, a 400-million parameter vision encoder based on SigLIP-SO-400M. It uses the NaViT packing strategy for variable-resolution images, supporting:

Images: Up to 4K resolution (4096x2160) in PNG, JPEG, WebP, GIF formats
Video: Up to 2K resolution (2048x1080) in MP4, MPEG, MOV, AVI, FLV, WebM formats
Documents: PDF and text

The total parameter count with the vision encoder reaches approximately 1.04 trillion.

Agent Swarm: Parallel Multi-Agent Execution

Kimi K2 Thinking (November 6, 2025)

The reasoning variant deserves its own mention. Key specifications:

Parameters: 1 trillion total, 32 billion active
Context window: 256K tokens
Quantization: Native INT4
Agentic reasoning: Stable tool use across 200-300 sequential calls

K2 Thinking was evaluated by NIST's CAISI (Center for AI Safety and Intelligence) in December 2025, ranking as the #1 open-source model on multiple benchmarks at the time of its release.

7. Benchmark Comparisons

Core Reasoning and Math

Benchmark	Kimi K2.5	Kimi K2 Thinking	Kimi K2	Claude Opus 4.5	GPT-5.2	DeepSeek V3
MATH-500	97.8%	97.6%	97.4%	96.1%	97.3%	90.2%
AIME 2025	99.2%	99.8%	N/A	N/A	86.7%	39.2%
GPQA Diamond	91.8%	91.3%	72.0%	75.7%	72.0%	59.1%

Coding

Benchmark	Kimi K2.5	Kimi K2	Claude Opus 4.5	GPT-5.2	DeepSeek V3	Qwen3 Coder
SWE-Bench Verified (single)	76.8%	65.8%	80.9%	80.0%	42.0%	69.6%
LiveCodeBench	55.2%	53.7%	N/A	N/A	33.8%	N/A

Agentic and Tool Use

Benchmark	Kimi K2.5	Claude Opus 4.5	GPT-5.2
HLE Full (with tools)	50.2%	32.0%	41.7%
BrowseComp	62.4%	N/A	54.9%
Tool performance gain	+20.1pp	+12.4pp	+11.0pp

Vision and Multimodal

Benchmark	Kimi K2.5	GPT-5.2	Claude Opus 4.5	Gemini 2.5 Pro
MMMU Pro	78.5%	72.3%	68.9%	75.1%
MathVision	84.2%	78.1%	71.3%	80.6%
VideoMMMU	86.6%	N/A	N/A	84.2%

K2.5's vision benchmarks are strong across the board. The VideoMMMU score of 86.6% reflects the MoonViT-3D encoder's ability to process video content effectively.

Key Takeaways from Benchmarks

On pure math and reasoning, Kimi K2 Thinking and K2.5 are at or above the frontier set by GPT-5.2 and Claude Opus 4.5.
On coding (SWE-Bench), Kimi trails Claude and GPT but significantly outperforms DeepSeek V3 and is competitive with Qwen3 Coder.
On agentic tasks with tools, K2.5 currently leads. The 50.2% HLE score with tools is the highest published result.
On vision tasks, K2.5 leads on MMMU Pro and MathVision, strong performance for an open-weight model.

8. API Pricing Comparison

One of the most compelling aspects of Chinese LLMs is pricing. The cost differences versus Western models are dramatic.

Per-Million-Token Pricing

Model	Input ($/M tokens)	Output ($/M tokens)	Context Window	Architecture
DeepSeek V3	$0.14	$0.28	128K	MoE 671B/37B
Kimi K2	$0.39	$1.90	128K	MoE 1T/32B
DeepSeek R1	$0.55	$2.19	32K	Reasoning
Kimi K2.5	$0.60	$2.50	128K	MoE 1T/32B
GPT-5.2	$1.75	$14.00	128K	Dense
Claude Sonnet 4.6	$3.00	$15.00	200K	Dense
Claude Opus 4.6	$5.00	$25.00	200K	Dense

Kimi's Tiered Pricing (by context length)

Tier	Input ($/M tokens)	Output ($/M tokens)
8K context	$0.20	$2.00
32K context	$1.00	$3.00
128K context	$2.00	$5.00
Cached tokens	$0.15	N/A

The cached token pricing at $0.15/M tokens offers a 75% savings for repeated content, making Kimi particularly cost-effective for applications with consistent system prompts or reference documents.

Cost at Scale: 100 Million Tokens Per Month

Model	Monthly Cost	Relative to Kimi K2.5
DeepSeek V3	~$42	0.14x
Kimi K2	~$229	0.74x
Kimi K2.5	~$310	1.0x
DeepSeek R1	~$274	0.88x
GPT-5.2	~$1,575	5.1x
Claude Sonnet 4.6	~$1,800	5.8x
Claude Opus 4.6	~$3,000	9.7x

9. Kimi's Consumer Product

Kimi is not just an API. It operates a consumer chatbot that competes directly with ChatGPT and other conversational AI products.

Platform: Available at kimi.com and kimi.ai, plus mobile apps.

Peak usage: 36 million monthly active users in China, reaching 50 million globally by August 2025.

10. The Chinese LLM Landscape

DeepSeek (High Flyer Capital Management)

Key models:

Model	Parameters	Training Cost	Key Achievement
DeepSeek-V3	671B total, 37B active	$5.6M	Outperformed Llama 3.1-405B and GPT-4o
DeepSeek-R1	Reasoning variant	Undisclosed	Matched OpenAI o1 on reasoning tasks
DeepSeek-V3.2 Exp	Updated	Undisclosed	AIME 2025: 96.0%

Open source: Yes, fully open weights under permissive licenses.

Qwen (Alibaba Cloud)

Background: Alibaba's LLM family, developed by the Tongyi Lab division of Alibaba Cloud.

Key models:

Model	Parameters	Notable Achievement
Qwen3	0.6B to 32B (dense), 235B MoE (22B active)	Ranked 3rd globally on LMArena text
Qwen3-Max	1T+ parameters, 36T training tokens	100% on AIME25 and HMMT
Qwen3 Coder	Coding-focused variant	69.6% SWE-Bench Verified

Market share: 32.1% of China's enterprise LLM market in H2 2025, nearly doubling from 17.7% in H1 2025.

Yi / 01.AI (Kai-Fu Lee)

Founder: Kai-Fu Lee, former president of Google China and former executive at Microsoft and Apple.

Founded: March 2023 (same month as Moonshot AI).

Key model: Yi-34B achieved MMLU 76.3%, outperforming Meta's Llama 2-70B (68.9%) and TII's Falcon 180B despite being a much smaller model.

Ernie (Baidu)

Model: Ernie Bot 4.0

GLM / ChatGLM (Zhipu AI)

Model: GLM-4.5 with 355 billion parameters, GLM-4.5-Air at 106 billion.

Zhipu AI has focused on the Chinese enterprise market with strong multimodal capabilities. GLM-4.5 approaches GPT-4 level performance on general benchmarks. Open-source versions are available.

Overview: Chinese LLM Ecosystem

Company	Flagship Model	Total Params	Active Params	Key Strength	Pricing (Input $/M)
Moonshot AI	Kimi K2.5	1.04T	32B	Agent swarms, vision, long-context	$0.60
DeepSeek	V3 / V3.2	671B	37B	Training efficiency, lowest cost	$0.14
Alibaba	Qwen3-Max	1T+	Undisclosed	Enterprise market share, coding	Varies
Zhipu AI	GLM-4.5	355B	Dense	Chinese enterprise, multimodal	Varies
Baidu	Ernie 4.0	Undisclosed	Undisclosed	Chinese language, search integration	Varies
01.AI	Yi-34B	34B	Dense	Efficiency, smaller-scale deployment	Open source

11. Head-to-Head: Kimi vs DeepSeek vs Qwen

These three models represent the current frontier of Chinese AI. Each has distinct strengths.

Architecture Comparison

Aspect	Kimi K2.5	DeepSeek V3	Qwen3-Max
Total parameters	1.04T	671B	1T+
Active parameters	32B	37B	Undisclosed
MoE experts	384	256	Undisclosed
Training tokens	15T	14.8T	36T
Training cost	~$20M+ (est.)	$5.6M	Undisclosed
Context window	128K-256K	128K	256K
Vision	Yes (MoonViT-3D)	No native	Yes
Agent swarm	Yes (100 sub-agents)	No	No
Open weights	Yes (Modified MIT)	Yes (MIT)	Yes (Apache 2.0)

Benchmark Comparison

Benchmark	Kimi K2.5	DeepSeek V3	Qwen3-Max
MATH-500	97.8%	90.2%	95.8%
AIME 2025	99.2%	39.2%	100%
SWE-Bench Verified	76.8%	42.0%	~69.6% (Coder)
GPQA Diamond	91.8%	59.1%	~80%
HLE (with tools)	50.2%	N/A	N/A

When to Use Which

Use Case	Best Choice	Why
Cheapest API calls	DeepSeek V3	$0.14/M input tokens
Best coding assistant	Kimi K2.5 or Qwen3 Coder	Highest SWE-Bench scores among Chinese models
Agent orchestration	Kimi K2.5	Only model with native agent swarm capability
Vision/multimodal	Kimi K2.5	MoonViT-3D with 4K image and 2K video support
Math reasoning	Qwen3-Max or Kimi K2 Thinking	Both approach or reach 100% on AIME
Enterprise deployment (China)	Qwen3-Max	32.1% market share, Alibaba Cloud integration
Budget-constrained production	DeepSeek V3	5-20x cheaper than alternatives
Long-context applications	Kimi K2.5	256K context with stable performance
Open research	DeepSeek V3	Most permissive license (MIT, no revenue restrictions)

The Licensing Factor

This is where the models differ significantly:

Model	License	Revenue Threshold	Attribution Requirement
DeepSeek V3	MIT	None	None
Kimi K2.5	Modified MIT	$20M/month or 100M MAU	Must display "Kimi K2.5" on UI
Qwen3	Apache 2.0	None	None

12. What This Means for the Industry

The Kimi story illustrates several trends that will define AI development in 2026 and beyond.

Contents

1. The Cursor Controversy: What Actually Happened

2. Who Is Moonshot AI

Funding History

3. Yang Zhilin: The Researcher Behind Kimi

4. The Kimi Model Lineage

Kimi v1.0 (October 2023)

Kimi k1.5 (January 20, 2025)

Kimi K2 (July 11, 2025)

Kimi K2 Thinking (November 6, 2025)

Kimi K2.5 (January 27, 2026)

5. Kimi K2: One Trillion Parameters, Open Weights

Architecture

The MuonClip Optimizer

License

6. Kimi K2.5: Agent Swarms and Vision

MoonViT-3D Vision Encoder

Agent Swarm: Parallel Multi-Agent Execution

Kimi K2 Thinking (November 6, 2025)

7. Benchmark Comparisons

Core Reasoning and Math

Coding

Agentic and Tool Use

Vision and Multimodal

Key Takeaways from Benchmarks

8. API Pricing Comparison

Per-Million-Token Pricing

Kimi's Tiered Pricing (by context length)

Cost at Scale: 100 Million Tokens Per Month

9. Kimi's Consumer Product

10. The Chinese LLM Landscape

DeepSeek (High Flyer Capital Management)

Qwen (Alibaba Cloud)

Yi / 01.AI (Kai-Fu Lee)

Ernie (Baidu)

GLM / ChatGLM (Zhipu AI)

Overview: Chinese LLM Ecosystem

11. Head-to-Head: Kimi vs DeepSeek vs Qwen

Architecture Comparison

Benchmark Comparison

When to Use Which

The Licensing Factor

12. What This Means for the Industry

Contents

1. The Cursor Controversy: What Actually Happened

2. Who Is Moonshot AI

Funding History

3. Yang Zhilin: The Researcher Behind Kimi

4. The Kimi Model Lineage

Kimi v1.0 (October 2023)

Kimi k1.5 (January 20, 2025)

Kimi K2 (July 11, 2025)

Kimi K2 Thinking (November 6, 2025)

Kimi K2.5 (January 27, 2026)

5. Kimi K2: One Trillion Parameters, Open Weights

Architecture

The MuonClip Optimizer

License

6. Kimi K2.5: Agent Swarms and Vision

MoonViT-3D Vision Encoder

Agent Swarm: Parallel Multi-Agent Execution

Kimi K2 Thinking (November 6, 2025)

7. Benchmark Comparisons

Core Reasoning and Math

Coding

Agentic and Tool Use

Vision and Multimodal

Key Takeaways from Benchmarks

8. API Pricing Comparison

Per-Million-Token Pricing

Kimi's Tiered Pricing (by context length)

Cost at Scale: 100 Million Tokens Per Month

9. Kimi's Consumer Product

10. The Chinese LLM Landscape

DeepSeek (High Flyer Capital Management)

Qwen (Alibaba Cloud)

Yi / 01.AI (Kai-Fu Lee)

Ernie (Baidu)

GLM / ChatGLM (Zhipu AI)

Overview: Chinese LLM Ecosystem