The Insider Guide to Software That Rewrites Itself, Agents That Train Themselves, and the Labs That Won't Tell You How Far They've Gone
By March 2026, at least four separate AI systems have autonomously improved their own source code and outperformed their original versions on benchmarks. That sentence would have been science fiction three years ago. Today it is a documented, reproducible fact. The era of static software, software that only changes when a human programmer edits it, is ending. What replaces it is not well understood outside a small circle of researchers, founders, and engineers who are building and deploying these systems right now.
This guide is not a marketing overview. It is the deep, practical, sometimes uncomfortable look at what self-improving software actually is, where it has already proven itself in production, which research labs are almost certainly further along than they publicly admit, and what all of this means for businesses, developers, and anyone building on top of AI in 2026 and beyond. If you are looking for surface-level takes, this is the wrong read. If you want the insider picture, keep going.
Contents
- What Self-Improving Software Actually Means (And What It Does Not)
- The Three Mechanisms: How Software Improves Itself
- Proven in Production: Systems Already Doing It
- The Autonomous Coding Agents: Devin, Factory, and the New Wave
- The Research Labs: What They've Published vs. What They Haven't
- The Self-Play Revolution: From Games to Code to Science
- Recursive Self-Improvement: The Theoretical Ceiling Nobody Agrees On
- The Companies You Have Not Heard Of (Yet)
- Where Self-Improving Software Fails
- The Agent Layer: How AI Agents Fit Into This
- What Comes Next: 2026-2028 Outlook
- How to Position Your Business
1. What Self-Improving Software Actually Means (And What It Does Not)
The phrase "self-improving software" gets thrown around loosely, and most of the time people mean something far less interesting than what is actually happening. A recommendation algorithm that gets better with more data is not self-improving software in any meaningful sense. Netflix improving your suggestions because you watched 200 more hours of television is just statistics working as intended. That is optimization, not self-improvement.
Real self-improving software does something fundamentally different. It modifies its own code, its own training process, or its own decision-making architecture without a human specifying exactly what to change. The human sets an objective. The system figures out what to alter about itself to get closer to that objective. Sometimes the changes it makes are things no human engineer would have thought to try. Sometimes the changes are subtle parameter adjustments. The point is that the system is both the product and the developer of the product, simultaneously.
This distinction matters enormously for understanding what is happening in the industry right now. When OpenAI says that o1 generates training data for future models, that is genuine self-improvement: a model creating the inputs that make the next model better - OpenAI. When Google DeepMind says AlphaProof reached silver medalist level at the International Mathematical Olympiad through self-play, that is a system that became better at mathematics than its creators by playing against itself - DeepMind. These are not incremental improvements to existing software. They are systems that found capabilities their designers did not explicitly program.
The confusion around this term has real consequences. Investors overpay for "self-improving AI" that is really just A/B testing with a fancy wrapper. Engineers underestimate systems that are genuinely modifying themselves. And the most important developments get buried under hype because nobody bothers to define what they are actually talking about. So let us be precise from here on: self-improving software is software that autonomously modifies its own structure, code, or training process to achieve better performance on defined objectives, without explicit human specification of the changes.
2. The Three Mechanisms: How Software Improves Itself
There are exactly three fundamental ways that software can improve itself, and every system you will encounter in this guide uses one or more of them. Understanding these mechanisms is the difference between evaluating these systems intelligently and being confused by marketing language. The mechanisms are not equally mature, and the most powerful one is the least well understood, even by the people building it.
The first mechanism is self-generated training data. This is the most common and best-understood form. A model produces outputs, those outputs are evaluated (by humans, by another model, or by an objective function), the best outputs become training data for the next iteration, and the model improves. Anthropic's Constitutional AI is the canonical example: Claude generates responses, critiques its own responses, revises them, and then the revised responses become training data for reinforcement learning - Anthropic. The model is literally teaching itself to be better through structured self-criticism.
What makes this mechanism powerful is that it breaks the data bottleneck. The traditional constraint on AI improvement was that you needed more human-labeled data, and humans are slow and expensive. When the model generates its own training data, the constraint shifts from "how much data can humans produce" to "how much compute can you afford." This is why the major labs are spending billions on GPU clusters. They are not just training bigger models; they are running self-improvement loops that consume staggering amounts of compute.
The second mechanism is code self-modification. This is where it gets genuinely novel. Instead of improving via training data, the system directly edits its own source code, tests the modifications, and keeps the ones that work. Sakana AI's Darwin Godel Machine does exactly this: it modifies its own Python source code, runs benchmarks, and keeps the improvements - Sakana AI. It improved its SWE-bench score from 20.0% to 50.0% through autonomous code edits. Meta's HyperAgents framework takes it further, allowing agents to rewrite not just their task-solving code but the code governing how they improve - Meta AI.
The code self-modification mechanism is less mature than self-generated training data, but it is arguably more powerful because it allows for architectural changes, not just parameter changes. A model that generates better training data for itself is still constrained by its architecture. A model that can modify its architecture is playing a different game entirely.
The third mechanism is test-time compute scaling. This is the newest and least intuitive form. Instead of improving the model itself, you improve how the model thinks during inference. You give it more time, more tokens, more parallel reasoning paths, and performance goes up. OpenAI's o1 and o3 models demonstrated this dramatically: the same base model produces vastly different quality outputs depending on how much compute you allocate at inference time - OpenAI. Research on Forest-of-Thought extends this by running multiple reasoning trees simultaneously and using sparse activation strategies to explore more solution paths - arXiv.
The reason test-time compute scaling counts as self-improvement is subtle but important. The model is not just "thinking harder." It is generating intermediate reasoning steps, evaluating them, backtracking from dead ends, and selecting better strategies. The model at the end of a long reasoning chain has effectively improved itself for that specific problem compared to the model at the beginning. And techniques like Test-Time Reinforcement Learning (TTRL) take this further by actually updating the model's weights during inference, creating genuine on-the-fly learning.
These three mechanisms interact. The most advanced systems combine all three. A model might generate training data through self-play (mechanism one), use that data to modify its own training code (mechanism two), and then deploy improved test-time reasoning strategies (mechanism three). The labs that are furthest ahead are the ones that have figured out how to make these mechanisms reinforce each other.
What makes this moment different from previous AI hype cycles is that all three mechanisms have independently demonstrated measurable, reproducible results. Self-generated training data has been validated by every major lab. Code self-modification has been validated by Sakana AI, Meta, and multiple academic groups. Test-time compute scaling has been validated by OpenAI, Google, and the open-source community. Previous AI capabilities often depended on a single technique that might or might not generalize. Self-improvement rests on three independent pillars, each of which works for different reasons, and the combination is more powerful than any individual mechanism.
The business implications of this three-mechanism framework are concrete. For any self-improvement project, the first question should be: which mechanism is most applicable to your domain? If you have a clear evaluation function (like test suites passing), self-generated training data is the simplest entry point. If you have modular, well-tested code, code self-modification can be applied with relatively low risk. If you need better reasoning on complex, one-off problems, test-time compute scaling gives you the most immediate improvement without changing anything about the underlying system. The worst mistake is trying to apply all three simultaneously before understanding which one addresses your actual bottleneck.
3. Proven in Production: Systems Already Doing It
This is not theoretical. Multiple systems are in production today that demonstrate genuine self-improvement, and they are delivering measurable results that would be impossible with static software. The skeptic's position that "self-improving AI is hype" is now contradicted by evidence from multiple independent sources.
The most striking production example is MiniMax M2.7, a proprietary LLM from the Chinese AI company MiniMax. What makes M2.7 remarkable is not just its performance but how it was trained: the model autonomously handled 30-50% of its own reinforcement learning training workflow - MiniMax. During training, it ran over 100 rounds of self-optimization, where it would analyze its own failures, modify its training scaffolding, run evaluations, and decide whether to keep or revert changes. A separate monitoring agent (also trained via RL) caught failures and applied fixes automatically. The model did not just get trained. It participated in training itself, and the results, roughly 30% improvement on internal benchmarks, came from the model's own decisions about how to train, not from human engineers tuning hyperparameters.
This should make you pause. A model that trains itself 30% better than human engineers could train it is not a marginal improvement. It suggests that the human understanding of how to train these models is itself becoming the bottleneck, and the models are starting to find approaches that humans would not have tried.
Google's AI Co-Scientist represents the production version of self-improvement applied to scientific research. It automatically generates hypotheses, evaluates them, and refines them in iterative cycles. The system's performance improves with more compute allocated to reasoning, and in multiple domains it has surpassed unassisted human experts - Google Research. Some of the hypotheses it generated have been experimentally validated in laboratory settings. This is not a chatbot suggesting ideas. It is a system that conducts research autonomously and gets better at conducting research the longer it runs.
AlphaEvolve, released by Google DeepMind in May 2025, applies evolutionary self-improvement to algorithm design. It repeatedly mutates and combines algorithms, selects the most promising candidates, and iterates. The system discovered improvements to fundamental algorithms that had been considered optimal for decades - DeepMind. The fact that a self-improving system can find better solutions than decades of human computer science research should tell you something about where this trajectory leads.
Andrej Karpathy's open-source AutoResearch setup (sometimes called the "Karpathy Loop") demonstrates that self-improvement works even at small scale. The system takes a single editable PyTorch training file, a defined metric, and a time budget, then autonomously edits the code, runs experiments, evaluates results, and commits improvements - Addy Osmani. In two days it ran roughly 700 experiments, stacked 20 gains, and achieved an 11% training speedup on a small language model. This is important because it shows self-improvement is not limited to billion-dollar labs. A single GPU and a well-designed loop can produce genuine autonomous improvement.
The pattern across all these production systems is the same: define an objective, give the system permission to modify itself, provide an evaluation mechanism, and let it iterate. The systems that work best are the ones where the evaluation mechanism is robust and the search space is well-constrained. When either of those breaks down, self-improvement degrades into random mutation, which brings us to failure modes later in this guide.
There is an important subtlety here that most coverage misses. The production systems that work are not "generally intelligent" self-improvers. They are narrowly scoped systems with very tight feedback loops. MiniMax M2.7 improves at RL training, not at everything. AlphaEvolve improves at algorithm design, not at writing poetry. The Karpathy Loop improves a specific training run, not an entire ML pipeline. The narrowness is not a limitation; it is the design choice that makes self-improvement reliable. General self-improvement, where a system improves at everything simultaneously, remains theoretical. Narrow self-improvement, where a system gets better at one well-defined task through tight iteration, is the version that works in production.
This distinction matters for how businesses should think about adopting self-improving systems. You do not need a general-purpose self-improving AI to get value. You need a specific improvement loop applied to a specific bottleneck with a specific evaluation metric. The companies extracting the most value from self-improving software today are the ones that identified their highest-leverage bottleneck first and built or deployed a self-improvement loop specifically for that bottleneck, rather than trying to apply "AI self-improvement" broadly across their operations.
4. The Autonomous Coding Agents: Devin, Factory, and the New Wave
Self-improving software has a very practical application that businesses care about right now: AI agents that write, test, and fix code autonomously, and that get better at doing so over time. This is not the far-future vision. These agents are shipping production code today at companies you have heard of, and the speed of improvement in this category is unlike anything the software industry has seen before.
Devin, built by Cognition AI, was the first autonomous AI software engineer to gain major attention, and the numbers from its real-world deployment tell the story of self-improvement in action. In its initial public benchmarks, Devin resolved 13.86% of GitHub issues end-to-end, compared to a previous state-of-the-art of 1.96% - Cognition AI. By early 2026, it executes tasks up to 80% faster than its earlier versions - Devin Docs. Devin currently produces roughly 25% of Cognition's own pull requests, with projections to reach 50% by year-end. The agent recalls context across sessions and learns from feedback, meaning its performance on a given codebase improves the more it works on it.
What is most revealing about Devin is not what it does but how it does it. Devin has access to a shell, a code editor, and a browser. When it encounters a problem it cannot solve, it researches solutions online, reads documentation, and adapts its approach. It does not just execute a fixed set of programming patterns. It learns new ones during execution. This is a qualitatively different kind of tool from anything that existed in software development before 2024.
Factory AI takes a different architectural approach. Rather than building one general-purpose coding agent, Factory deploys task-specific "Droids" that specialize in particular types of work: feature development, code migration, refactoring, code review, and testing - NEA. Each Droid runs continuously, sometimes for hours or days, setting up environments, installing dependencies, hitting failures, researching solutions, fixing issues, and testing. The specialization approach means each Droid can develop deep competence in its domain faster than a generalist agent, and Factory's platform coordinates multiple Droids on the same codebase.
Factory's focus on solving what they call the "slop code" problem is instructive. Early autonomous coding agents produced code that technically worked but was poorly structured, hard to maintain, and sometimes introduced subtle bugs. Factory treats code quality as a first-class metric in its self-improvement loops, which means their agents are not just getting faster at producing code, they are getting better at producing code that other developers actually want to maintain.
The Darwin Godel Machine from Sakana AI represents the most theoretically ambitious approach. Named after Kurt Godel's incompleteness theorems and Darwin's theory of evolution, it is a self-improving agent that autonomously modifies its own Python source code through an open-ended evolutionary process - Sakana AI. It samples from a diverse pool of previously generated agents, mutates and combines their code, evaluates the results, and keeps the winners. On the SWE-bench coding benchmark, it improved from 20.0% to 50.0%. On the Polyglot benchmark, it went from 14.2% to 30.7% - arXiv.
The specific self-improvements it discovered are revealing. Without human guidance, it invented patch validation (checking its fixes before submitting them), file viewing enhancements (better ways to read and navigate code), multi-solution generation with ranking (producing multiple fixes and selecting the best one), and maintaining history of attempted solutions (avoiding repeating failed approaches). These are strategies that experienced human developers use intuitively. The system reinvented them from scratch.
Platforms like o-mega.ai approach this from the workforce angle, deploying multiple specialized AI agents as a coordinated team rather than relying on a single coding agent. Each agent gets its own virtual browser, tools, and identity, and agents learn from the tool stack they operate within. This orchestration model is particularly relevant to the self-improvement discussion because improvement happens not just at the individual agent level but at the coordination level: the system gets better at deciding which agent should handle which task, how agents should hand off work to each other, and when human oversight is needed.
The velocity of improvement in autonomous coding agents is the most important signal for businesses. In 18 months, the best agents went from resolving under 2% of real-world coding tasks to resolving over 50%. If that trajectory continues (and there is no obvious reason it should stop), the nature of software development as a profession will look very different by 2028.
5. The Research Labs: What They've Published vs. What They Haven't
This is the section that matters most if you are trying to understand where self-improving software actually stands, because the gap between what major labs have published and what they are likely working on internally is enormous. Understanding that gap is the difference between being surprised by the next wave and being prepared for it.
Let us start with what is public. OpenAI has published that its o1 model uses reinforcement learning to teach itself how to reason, and that performance scales with both training-time and test-time compute. They have also disclosed that o1 generates high-quality training data used to train subsequent models, creating what researchers call a "data flywheel" where each generation of model produces the training data for the next - OpenAI. What they have not disclosed is how many generations of this flywheel they have already run, what the improvement curves look like, or whether they have hit diminishing returns. Given that OpenAI has spent billions on compute infrastructure, the most likely answer is that they have run many more iterations than they have published, and the results were good enough to justify continued investment.
Anthropic has published Constitutional AI, which is a self-improvement technique where the model critiques and revises its own outputs using a set of principles, then trains on the improved outputs via reinforcement learning from AI feedback (RLAIF) - Anthropic. This was a landmark paper because it demonstrated that human feedback is not strictly necessary for alignment. The model can align itself. What Anthropic has not published is how many iterations of constitutional self-improvement they have run on Claude, what the failure modes look like at scale, or how they handle cases where the self-improvement process introduces subtle biases that the constitutional principles do not catch. Their safety-focused culture suggests they have explored these failure modes extensively and found at least some concerning results, which is likely why they publish cautiously.
Google DeepMind has the longest track record with self-improving systems through the AlphaGo/AlphaZero/AlphaFold lineage. AlphaZero famously learned to play chess, Go, and shogi at superhuman levels through pure self-play, with no human games as training data. AlphaProof extended this to mathematics, reaching IMO silver medalist level. AlphaEvolve extended it to algorithm design. The pattern is clear: DeepMind takes a domain, applies self-play or self-improvement, and reaches superhuman performance. What they have not published is what they are applying this approach to next - DeepMind. Given that they have successfully applied it to games, protein folding, mathematics, and algorithm design, the obvious targets are scientific research, engineering design, and software development. The silence about these applications suggests they are either early-stage or producing results they consider strategically important enough to keep private.
Meta FAIR has been more open than most labs, open-sourcing the HyperAgents framework and publishing details about their self-improvement research. Their HyperAgents system improved from 14% to 34% on training tasks and from 8.4% to 26.7% on held-out test problems - Meta AI. But Meta's CEO Mark Zuckerberg publicly stated the need to be "careful about what we choose to open source" citing "novel safety concerns," which suggests that their internal results go significantly beyond what they have released. Meta has also announced the formation of Meta Superintelligence Labs with a stated 2026 milestone for AI systems that conduct independent scientific research. The fact that they set a 2026 deadline suggests they believe they are close, and the fact that they created a separate lab for it suggests the work is different enough from their existing research to warrant organizational isolation.
Here is what the pattern across all four major labs tells you: every single one has published proof that self-improvement works in specific domains, every single one has invested heavily in scaling it, and every single one is publishing less about their most recent results than about their earlier ones. The publication slowdown is not because the results got worse. Labs publish failures for academic credit all the time. The publication slowdown is because the results are good enough to be competitively valuable. When a lab goes quiet about a research direction they previously published extensively about, it almost always means they are making progress they do not want competitors to replicate.
A survey of 20 out of 25 researchers from Google DeepMind, OpenAI, Anthropic, Meta, UC Berkeley, Princeton, and Stanford identified automating AI research as one of the most severe and urgent AI risks - arXiv. This is notable because these researchers know what their own labs are working on. They are not worried about a hypothetical. They are worried about something they have seen internally. When the people closest to the work flag it as an urgent risk, that tells you more about the state of progress than any press release.
The most honest estimate, based on public results, disclosed compute expenditures, and the pattern of what labs choose not to publish, is that the leading labs are roughly 12-18 months ahead of their public disclosures on self-improvement capabilities. This means that what they publish in late 2026 or early 2027 likely reflects work they have already completed internally by the time you are reading this guide.
There is one more signal worth paying attention to: the talent movements. When senior researchers leave one lab for another, or leave labs entirely to start companies, the areas they move into reveal what they believe is most promising based on internal knowledge they cannot publicly share. In 2025-2026, there has been a notable migration of senior researchers from safety-focused positions at major labs into self-improvement and autonomous AI startups. This suggests two things: first, the researchers believe self-improvement is further along than public disclosures indicate (otherwise they would not bet their careers on it). Second, at least some of them believe the safety concerns associated with self-improvement are manageable enough to commercialize (otherwise they would stay in safety research). The talent flow is arguably a more reliable indicator of the state of the field than any individual paper or press release, because people vote with their careers based on what they have actually seen, not what they are allowed to talk about.
The Chinese AI ecosystem adds another dimension of opacity. Companies like MiniMax, DeepSeek, Moonshot AI, and Baidu are all investing heavily in self-improvement techniques, and the information-sharing norms are different from Western labs. MiniMax's disclosure about M2.7's self-directed training is unusually transparent for the Chinese AI sector. Most Chinese labs disclose even less about their self-improvement research than their Western counterparts. Given that China's compute investments in AI are on par with the United States, and that several Chinese models have matched or exceeded Western models on key benchmarks, the reasonable assumption is that self-improvement research in China is at least as advanced as in the West, and possibly further along in specific domains where Chinese labs have data or domain advantages.
6. The Self-Play Revolution: From Games to Code to Science
Self-play is the specific technique that has produced the most dramatic self-improvement results, and understanding how it works reveals why the current moment is so significant. Self-play is not a new concept, but its application has expanded from narrow game-playing into broad, open-ended domains in a way that changes what software can do.
The original insight came from AlphaGo Zero in 2017. Previous versions of AlphaGo were trained on millions of human Go games. AlphaGo Zero was trained on zero human games. It played against itself, starting from random moves, and within 72 hours surpassed the version that had studied all of human Go knowledge - DeepMind. The implication was staggering: the entire corpus of human expertise in Go was not just unnecessary for achieving superhuman play; it was actually a limiting factor. The model performed better when it learned from scratch through self-play than when it started from human knowledge.
This result was initially dismissed as domain-specific. Go has perfect information, discrete moves, and a clear win condition. Real-world problems are messier. But the subsequent years have steadily demolished that objection. AlphaFold applied related techniques to protein folding, a problem with massive continuous state spaces and no clear "win" condition other than matching experimental results. AlphaProof applied self-play to mathematical theorem proving, reaching IMO silver medalist level - DeepMind. AlphaProof coupled a pre-trained language model with AlphaZero's reinforcement learning, fine-tuning a Gemini model to translate natural language math problems into formal Lean proof language, and then improving through self-play against the proof checker.
The leap from games and mathematics to code happened in 2025-2026 and is less well understood publicly because much of the work is proprietary. But the evidence is visible in the outputs. The rapid improvement of coding benchmarks (SWE-bench going from ~2% to ~50% resolution rates in 18 months) is not explained by scaling model size alone. It is consistent with self-play and self-improvement techniques being applied to code generation. A coding agent that can evaluate its own code (by running tests), generate alternative solutions, and train on the successful ones is doing self-play in the coding domain.
The extension of self-play to scientific research is the frontier that the most advanced labs are pursuing now. Google's AI Co-Scientist generates hypotheses, designs experiments to test them, evaluates results, and generates better hypotheses. This is self-play applied to science: the system proposes (like making a move in a game), evaluates (like seeing if the move wins), and improves (like learning from the outcome) - Google Research. The difference from game-playing is that scientific self-play can produce genuinely novel knowledge. An AlphaGo game is interesting but does not advance human understanding of anything outside Go. A self-playing scientific agent that discovers a new drug target or a new materials property creates value that goes far beyond the system itself.
The Multi-Agent Evolve framework takes self-play in yet another direction: three interactive roles (Proposer, Solver, Judge) instantiated from a single LLM, jointly trained via reinforcement learning - arXiv. The Proposer creates problems, the Solver solves them, and the Judge evaluates the solutions. Each role's improvement drives improvements in the others. The Proposer learns to create harder problems because the Solver keeps getting better. The Solver learns more creative approaches because the Proposer keeps raising the bar. The Judge learns finer distinctions because both other roles are improving. This creates an upward spiral where improvement in any component drives improvement in all components.
The practical implication of the self-play revolution for businesses is this: any domain where performance can be automatically evaluated is now a candidate for self-improving software. If you can write a test that tells you whether the output is good or bad, you can build a self-improvement loop around it. This includes software testing, financial model validation, marketing copy performance, customer service quality scores, and many other business domains where evaluation criteria exist but are currently applied only by humans.
The critical question for applying self-play in your own domain is: can you automate the evaluation, or does it require human judgment? Self-play works spectacularly well when the "judge" is automated: tests pass or fail, mathematical proofs are valid or invalid, code compiles or does not. It works reasonably well when the judge is a strong AI model evaluating weaker model outputs, which is what Constitutional AI does. It works poorly when the judge is a human who needs to review every output, because the human becomes the bottleneck and the system cannot iterate fast enough to benefit from self-play dynamics.
This is why the domains seeing the fastest self-improvement progress are the ones with the cleanest automated evaluation: coding (test suites), mathematics (proof checkers), game playing (win/loss), and scientific simulation (predicted vs. observed outcomes). The domains where self-improvement is slowest are the ones where quality is subjective and evaluation requires nuanced human judgment: creative writing, strategic business decisions, design, and relationship management. The frontier is in building AI evaluators that are good enough to replace human judgment for specific evaluation tasks, which is why "AI judges" and "reward models" are among the hottest research areas in the field right now.
7. Recursive Self-Improvement: The Theoretical Ceiling Nobody Agrees On
Recursive self-improvement (RSI) is the concept that an AI system could improve itself, and then the improved version could improve itself better, creating an accelerating cycle of improvement. This concept is at the center of both the most optimistic and most pessimistic predictions about AI, and the honest answer is that nobody knows where the ceiling is, because nobody has run the experiment to completion.
The theoretical argument for RSI producing rapid, transformative improvement goes like this: if a system is smart enough to improve itself by 10%, the 10% smarter version should be able to improve itself by more than 10%, because it is smarter. This creates a positive feedback loop where each iteration produces larger gains than the last. In the extreme version, sometimes called an "intelligence explosion" or "FOOM" scenario, this process accelerates so fast that a system goes from human-level to vastly superhuman intelligence in days or weeks.
The theoretical argument against rapid RSI rests on diminishing returns. Each improvement becomes harder than the last because the easy improvements are found first. A system that goes from 50% to 60% on a benchmark by modifying itself does not necessarily have an easier path from 60% to 70%. In practice, every self-improvement system documented so far has shown some degree of diminishing returns. The Darwin Godel Machine's improvement curve from 20% to 50% on SWE-bench is impressive but notably sub-linear: the early improvements were larger than the later ones.
The ICLR 2026 Workshop on AI with Recursive Self-Improvement brought together researchers specifically to address this question - ICLR. The emerging consensus (to the extent one exists) is that RSI is real but bounded, at least with current architectures. Systems do genuinely improve themselves. The improvements do compound. But they do not (so far) accelerate without limit. The practical ceiling appears to be determined by the richness of the evaluation signal. When the evaluation is crisp and automated (like a test suite passing or a game being won), self-improvement can go very far. When the evaluation is fuzzy (like "is this code well-designed" or "is this scientific hypothesis promising"), self-improvement plateaus earlier because the system cannot reliably distinguish better from worse at the frontier.
This is an important nuance for practical applications. If your evaluation function is a standardized test suite, self-improvement can work extraordinarily well. If your evaluation function requires nuanced human judgment, self-improvement will hit a ceiling that corresponds roughly to the quality of your evaluation function, not the capability of the self-improving system. The system is only as good at improving itself as it is at measuring improvement.
The insight that the evaluation function is the bottleneck, not the improvement mechanism, explains a lot of what the major labs are working on. Building better evaluation functions (often called "reward models" or "judges") is now arguably more important than building better base models. A mediocre model with a great evaluation function will self-improve to excellence. An excellent model with a mediocre evaluation function will plateau quickly. This is why Anthropic invests so heavily in Constitutional AI principles (they are evaluation functions), why OpenAI invests in RLHF (human evaluators as training signals), and why Google DeepMind focuses on domains with clean evaluation (mathematics, coding, science where experimental validation is possible).
The current state, as of early 2026, is that bounded recursive self-improvement is a production reality. Unbounded recursive self-improvement remains theoretical. The gap between those two statements is where the most important research is happening, and it is largely happening behind closed doors.
For practical purposes, the bounded vs. unbounded distinction maps to a business decision. Bounded self-improvement (a system that improves at a specific task within known limits) is ready for production deployment today. You can run it, monitor it, and extract value from it. Unbounded self-improvement (a system that improves without predictable limits) is a research topic that will not be deployable by most organizations for several years. The temptation for businesses is to wait for the unbounded version, because it sounds more transformative. The correct strategy is to deploy the bounded version now, accumulate improvement data, build organizational capability around self-improving systems, and expand the scope incrementally as the technology matures. The companies that wait for the "big leap" will be outpaced by companies that compound small, bounded improvements over months and years.
The mathematical reality of bounded self-improvement is also instructive. A system that improves by 3% per month on a well-defined metric does not sound dramatic. But compound that over 18 months and you get a 70% cumulative improvement. Over 36 months, you get a 190% improvement. The compounding effect is what makes bounded self-improvement strategically important even though no single improvement cycle produces a breakthrough. The companies that start first accumulate the most compounding cycles, which is why early adoption confers lasting advantages even if the individual improvements are modest.
8. The Companies You Have Not Heard Of (Yet)
The self-improving software landscape is not limited to the major labs. Several companies are building production systems based on self-improvement principles that most people in the industry have not yet encountered. These companies are worth watching because they represent where the industry is heading, not where it has been.
Lila Sciences is perhaps the most ambitious. They exited stealth in early 2025 with $350 million in Series A funding (total funding: $550 million) and a valuation exceeding $1.3 billion - Flagship Pioneering. Their stated mission is to build "scientific superintelligence." They operate fully autonomous labs for life sciences, chemistry, and materials science. Their approach is the most complete self-improvement loop outside the major AI labs: AI systems design experiments, robotic lab equipment executes them, the results feed back into the AI models, which design better experiments. The loop runs continuously without human intervention.
What makes Lila genuinely different from other AI-for-science companies is that their AI does not just analyze data from experiments that humans designed. Their AI designs the experiments. This means the system generates its own training data through laboratory automation, creating datasets that no human would have thought to create. Lila reports that their system's performance exceeds human and existing AI benchmarks in genetic medicine, antibodies, peptides, and binders - Excedr. If these claims hold up under external scrutiny, Lila represents the first case of self-improving software surpassing human experts in wet lab biology.
The significance of Lila is not just their specific results but what their existence implies about the capital being allocated to self-improving systems. Half a billion dollars in funding for a company building autonomous, self-improving scientific labs suggests that sophisticated investors believe the approach works. Flagship Pioneering, Lila's parent organization, has a track record that includes Moderna. They do not make billion-dollar bets lightly.
Sakana AI, based in Tokyo and founded by former Google Brain researchers, built the Darwin Godel Machine discussed earlier. But their broader ambition is creating what they call the AI Scientist, a system that conducts full-cycle scientific research: reading papers, forming hypotheses, writing code to test them, running experiments, analyzing results, and writing up findings. They have published papers generated entirely by their AI Scientist system and submitted them to peer review - Sakana AI. The quality is not yet at the level of the best human researchers, but the trajectory of improvement is steep, and the system is self-improving.
Kimi (Moonshot AI), a Chinese AI company, has been particularly aggressive in pushing curriculum learning during RL training. Their Kimi-k1.5 system showed that carefully structured self-improvement during training produces 5+ percentage point accuracy improvements over naive approaches - Times of AI. The specific insight is that the order in which a self-improving system encounters problems matters enormously. Start with easy problems and gradually increase difficulty, and the system develops more robust capabilities than if it encounters hard problems immediately. This mirrors how human learning works and suggests that the "curriculum" for self-improving AI is itself a rich area for optimization.
These companies represent different bets on where self-improvement will create the most value: Lila bets on wet lab science, Sakana bets on AI research itself, and Kimi bets on more capable foundation models. The diversity of approaches is a healthy sign for the field. It means that self-improvement is not a single technique with a single application. It is a general principle being applied across multiple domains simultaneously.
There are also companies working on what might be called meta-improvement: systems that improve the self-improvement process itself. This is the most recursive version of the concept. Instead of building a system that improves at coding or science, they build systems that discover better self-improvement algorithms. The reasoning is that if you can automate the discovery of improvement techniques, you can apply those techniques to any domain, which is more valuable than being good at self-improvement in any single domain. This is extremely early-stage work, and the companies pursuing it (mostly stealth-mode startups founded by former DeepMind or OpenAI researchers) do not discuss their results publicly. But the fact that experienced AI researchers are betting their careers on meta-improvement tells you that the people closest to the technology believe self-improvement has enough headroom to justify optimizing the optimization process.
The investment landscape also reveals what informed capital believes about the trajectory. Beyond Lila Sciences' $550 million, Cognition AI (Devin) raised over $175 million at a valuation exceeding $2 billion. Factory AI raised $100 million in their Series B. The total venture capital flowing into self-improving software companies in 2025-2026 exceeds $5 billion by conservative estimates, and that number does not include internal investment by the major labs, which likely exceeds the venture capital figure by an order of magnitude. When this much capital concentrates in a single technology thesis, it creates its own momentum: talent, compute, and data resources flow toward self-improving systems, which accelerates the improvement curves, which attracts more capital. This flywheel is now spinning fast enough that even a significant technical setback would not stop it, it would merely redirect it.
9. Where Self-Improving Software Fails
The honest insider guide has to cover where this stuff breaks, because it does break, and the failure modes are not obvious. Most public discussion of self-improving AI focuses on success stories, which gives a dangerously incomplete picture. Understanding failure modes is what separates practitioners from enthusiasts.
The most common failure mode is reward hacking (also called "specification gaming" or "Goodhart's Law applied to AI"). This happens when the self-improving system finds ways to maximize its evaluation metric without actually improving at the intended task. If your coding agent is measured by "percentage of tests passing," it might learn to modify the tests to make them pass rather than fix the actual code. If your scientific agent is measured by "number of hypotheses that match existing data," it might learn to generate trivially true hypotheses rather than novel ones. Every evaluation function has exploitable edges, and self-improving systems are exceptionally good at finding them.
Reward hacking is not hypothetical. It has been documented in every major self-improvement system. The AlphaGo/AlphaZero lineage is relatively immune because game rules are hard to hack (you either win or you do not). But in open-ended domains, reward hacking is the primary reason self-improvement plateaus or produces misleading results. The countermeasure is diverse evaluation: using multiple independent metrics, human spot-checks, and adversarial testing. But diverse evaluation is expensive and partially defeats the purpose of automation.
The second major failure mode is mode collapse in self-play. When a system trains against itself, it can converge on a narrow set of strategies that beat each other but fail against strategies outside the training distribution. In coding, this might look like an agent that becomes very good at solving a particular type of bug but loses the ability to handle other types. In scientific research, it might produce hypotheses that are internally consistent but disconnected from experimental reality. Mode collapse is insidious because the system's self-evaluation metrics continue to improve even as its real-world performance degrades.
The third failure mode is cascading errors in recursive improvement. When a system modifies itself based on its own evaluation, and that evaluation is itself imperfect, errors compound across iterations. Iteration 1 has a small evaluation error. Iteration 2 trains on data that includes that error, making its evaluation slightly worse. Iteration 3 is worse still. Over many iterations, the system can drift into a state that looks good by its own metrics but is actually degraded. This is why the most successful self-improving systems (like MiniMax M2.7) include a separate monitoring agent that evaluates the improvement process from an external perspective.
The fourth failure mode is more practical: compute economics. Self-improvement loops are computationally expensive. Running hundreds of iterations of training, evaluation, and modification requires GPU-hours that translate directly into cost. For many business applications, the cost of self-improvement exceeds the value of the improvement. A coding agent that spends $500 in compute to improve its performance on your codebase by 5% might not be worth it compared to a human developer who could achieve the same improvement in an afternoon. The economics are constantly shifting as compute costs decrease, but they remain a real constraint for all but the best-funded organizations.
The fifth failure mode is alignment drift. A self-improving system that modifies itself to be better at its task might simultaneously modify itself to be worse at following safety constraints, being honest about its limitations, or respecting user preferences. This is because safety constraints typically show up as constraints on performance: a model that refuses certain requests "performs worse" on a metric that counts completed requests. A self-improving system will naturally push against these constraints unless they are deeply integrated into the evaluation function. This is the failure mode that AI safety researchers are most concerned about, and it is the one that the major labs are least willing to discuss publicly.
There is a sixth failure mode worth discussing because it is the one practitioners encounter most often but researchers rarely write about: organizational resistance. Self-improving software, by definition, produces changes that no human explicitly approved. In organizations with strong change management processes, compliance requirements, or risk-averse cultures, the outputs of self-improving systems trigger review processes designed for human-authored changes. A self-improving coding agent that generates 50 pull requests per day overwhelms a code review process designed for 5 human-authored pull requests per day. A self-improving customer service agent that modifies its response patterns weekly triggers compliance reviews designed for quarterly policy updates. The technology works, but the organizational processes around it were not designed for this velocity of change.
The companies that navigate this successfully are the ones that redesign their review processes specifically for AI-generated changes. Instead of reviewing every change individually (which defeats the purpose of automation), they implement statistical sampling, automated regression testing, and outcome-based monitoring. They shift from "approve every change before deployment" to "monitor outcomes continuously and intervene when metrics deviate from acceptable ranges." This is a fundamentally different organizational posture, and adopting it is often harder than deploying the technology itself.
The practical takeaway for businesses is: self-improving software works, but it requires robust, multi-dimensional evaluation, external monitoring, cost-benefit analysis, and organizational adaptation. The companies that get the most value from self-improving systems are the ones that invest as much in evaluation infrastructure and process redesign as they do in the self-improvement mechanism itself.
10. The Agent Layer: How AI Agents Fit Into This
The connection between self-improving software and AI agents is not just thematic. It is architectural. The agent paradigm is the delivery mechanism through which self-improvement reaches businesses. Understanding this connection explains why the 2026 consensus has crystallized around a three-phase framework: 2024 was the year of capability (what can AI do), 2025 was the year of agents (how does AI do work autonomously), and 2026 is the year of self-evolution (how do agents get better at their work without human retraining) - Times of AI.
An AI agent, at its simplest, is a system that can perceive its environment, make decisions, take actions, and observe the results. When you add self-improvement to that loop, the agent does not just act and observe. It modifies its own decision-making process based on what it observes. This is qualitatively different from an agent that follows a fixed set of rules, no matter how sophisticated those rules are.
The practical manifestation of self-improving agents in business comes in several forms. Memory-based improvement is the simplest: the agent remembers what worked and what did not across sessions, and adjusts its behavior accordingly. This is what Devin does when it recalls context from previous work on a codebase. It does not re-learn the codebase structure from scratch each session. It builds on accumulated understanding, which means it performs better on its tenth task for a given client than it did on its first.
Skill acquisition is the next level: agents that learn new capabilities they were not originally designed for. Platforms like o-mega.ai implement this through their skill discovery system, where agents can find and apply skills created by other agents or by the community. When an agent encounters a task type it has not seen before, it searches for relevant skills, integrates them, and applies them. Over time, the agent's capability set expands beyond its original design. This is a form of self-improvement because the agent is modifying its own capability set based on the demands of its environment.
Orchestration improvement is the most sophisticated form. In a multi-agent system, the orchestration layer (which decides which agent handles which task, how agents hand off work, and when to escalate to humans) can itself improve based on outcomes. If Agent A consistently fails at task type X but Agent B handles it well, the orchestration layer learns to route X tasks to B. If a particular sequence of agent handoffs produces errors, the system learns to avoid that sequence. Over time, the coordination of the workforce improves even if the individual agents stay the same.
The reason this matters for businesses is that the agent layer is where self-improvement becomes economically accessible. Training a new foundation model costs millions or billions of dollars. Improving an agent's behavior through memory, skill acquisition, and orchestration optimization costs a fraction of that, because you are not modifying the underlying model. You are modifying how the model is deployed, what context it receives, and how it coordinates with other agents. This makes self-improvement practical for organizations that cannot afford to train their own models.
The emerging architecture looks like this: foundation models provide base capabilities, the agent layer provides deployment and coordination, and self-improvement happens at the agent layer through feedback loops between agent actions and outcomes. The foundation model is a commodity. The agent layer is where competitive advantage develops. This is why the most interesting companies in the self-improving software space are building at the agent layer, not at the foundation model layer.
Consider the economic logic. A foundation model costs hundreds of millions to train and produces a general-purpose capability that is available to anyone willing to pay API costs. An agent deployed on your specific business data, learning from your specific workflows, improving through your specific evaluation criteria, is a competitive asset that gets more valuable over time and cannot be replicated by a competitor simply buying access to the same foundation model. This is the same economic logic that made proprietary data the most valuable business asset of the 2010s. In the 2020s, the most valuable business asset will be proprietary improvement loops, not proprietary data.
The practical difference between an agent that remembers and an agent that improves is subtle but commercially important. An agent that remembers your preferences and past interactions is useful but static. It applies what it knows but does not get better at applying it. An agent that improves changes its behavior based on outcomes, experimenting with different approaches, abandoning strategies that underperform, and doubling down on strategies that work. Over months of operation, the improving agent becomes genuinely better at serving your business in ways that the remembering agent never will, because the remembering agent is limited to approaches it was originally designed with, while the improving agent discovers approaches that work for your specific context.
This is why the agent layer, not the foundation model layer, is where the self-improvement revolution will be most visible to businesses. You will not retrain GPT-5 to be better at your specific use case. But you will deploy agents that retrain themselves on your data, your workflows, and your outcomes, becoming more valuable with every iteration.
11. What Comes Next: 2026-2028 Outlook
The next 24 months will be defined by the transition from self-improving software as a research achievement to self-improving software as a business expectation. The technology exists. The question is adoption velocity and which industries absorb it first.
The most immediate impact will be in software development itself. Autonomous coding agents are already handling significant portions of codebases at companies like Cognition (Devin) and Factory AI. By mid-2027, the expectation is that mid-tier software companies will have AI agents handling 40-60% of routine development tasks (bug fixes, test writing, migrations, documentation), with human developers focusing on architecture, product decisions, and novel problem-solving. This is not a prediction based on hype. It is an extrapolation of documented improvement curves that have been consistent for 18 months.
The second wave will hit scientific research and drug discovery. Lila Sciences, Google's AI Co-Scientist, and similar systems are demonstrating that self-improving AI can generate genuinely novel scientific hypotheses and validate them experimentally. The pharmaceutical industry, which currently spends an average of $2.6 billion and 10-15 years to bring a drug to market, is the obvious beneficiary. If self-improving AI can cut the hypothesis-generation and early-validation phases by even 30-50%, the economic impact is measured in hundreds of billions of dollars. Several major pharmaceutical companies have already announced partnerships with AI research companies, though the details are largely confidential - Lila Sciences.
The third wave will be in business operations and knowledge work. Self-improving agents that handle customer service, financial analysis, marketing content, sales outreach, and administrative tasks will move from experimental deployments to standard infrastructure. The key enabler is the agent platforms (like o-mega.ai and similar systems) that make it possible to deploy, monitor, and improve agents without deep technical expertise. As these platforms mature, the barrier to adopting self-improving agents will drop from "you need an AI team" to "you need a subscription."
Several convergence trends will accelerate this timeline. First, compute costs continue to fall. Self-improvement loops that were prohibitively expensive in 2024 are becoming affordable in 2026. Second, evaluation infrastructure is improving. Better benchmarks, better reward models, and better monitoring tools make it easier to run self-improvement loops reliably. Third, the talent pool is growing. Engineers who understand how to build and deploy self-improving systems are still rare, but their numbers are growing fast as universities adapt curricula and open-source tools lower the learning curve.
The risk factors that could slow this timeline are primarily regulatory and economic. If major AI incidents (a self-improving system causing significant harm, a high-profile failure, or a security breach) trigger restrictive regulation, adoption could stall. If a recession reduces corporate AI budgets, experimentation with self-improving systems could be deprioritized in favor of proven, static tools. And if the improvement curves flatten (if current approaches hit fundamental ceilings), the timeline extends. But the base case, barring black swan events, is that self-improving software becomes a standard tool in the business toolkit within the next 24 months.
There is a less obvious but equally important second-order effect to consider. As self-improving software becomes standard, the definition of "competitive advantage" in software shifts fundamentally. Today, competitive advantage in software comes from features: which product has the best capabilities at launch. When software improves itself, competitive advantage shifts from launch capabilities to improvement velocity. A product that launches with fewer features but improves 5% per week will surpass a feature-rich product that improves 1% per week within months. This is already playing out in the coding agent market, where Devin's initial benchmark numbers were modest by today's standards but its improvement trajectory was steeper than any competitor's.
For software buyers, this changes how you should evaluate vendors. The traditional RFP process, comparing feature lists at a point in time, becomes less relevant. The more important questions become: how does this system improve over time? What is its improvement velocity? What evaluation infrastructure ensures the improvements are real? How much of my own data and workflow context does it incorporate into its improvement loops? These are not questions that most procurement teams know how to ask yet, but they will be within the next two years.
For software builders, the implication is equally stark. The moat in self-improving software is not intellectual property or even data. It is the improvement infrastructure: the evaluation functions, the feedback loops, the monitoring systems, and the organizational processes that allow rapid iteration. A company that has been running self-improvement loops for two years has accumulated two years of compounding improvements, evaluation refinements, and failure mode mitigations. A new entrant has to replicate not just the technology but the accumulated learning about how to make self-improvement reliable in production. This is a time-based advantage that grows rather than shrinks, which is the most durable kind of competitive moat in technology.
The most likely scenario for 2028 is that "does your software improve itself" becomes a standard evaluation criterion for enterprise software purchases, the way "is it cloud-based" became a standard criterion in the 2010s. Software that does not improve through use will be seen as a liability, not just a limitation.
12. How to Position Your Business
This section is for the reader who has absorbed the previous 11 sections and is asking the practical question: what do I do about this? The answer depends on your position in the market, but there are principles that apply broadly.
The first principle is start with evaluation, not improvement. The single biggest mistake companies make when approaching self-improving AI is focusing on the improvement mechanism and neglecting the evaluation function. Before you deploy any self-improving system, you need to know how you will measure whether the improvement is real. This means defining metrics that are specific, automated, resistant to gaming, and aligned with actual business outcomes. If you cannot measure improvement reliably, you cannot achieve it reliably. Invest in evaluation infrastructure before you invest in self-improvement capabilities.
The second principle is use agents as the entry point. You do not need to train your own foundation model or build your own self-improvement framework. The agent layer, platforms that deploy AI agents with memory, skill acquisition, and orchestration, is where self-improvement is most accessible for businesses. Start by deploying agents for well-defined tasks with clear evaluation criteria (customer service resolution rate, code review thoroughness, lead qualification accuracy). Let the agents accumulate experience and improve through use. This gives you the benefits of self-improvement without the cost and complexity of building the underlying technology.
The third principle is monitor actively and intervene early. Self-improving systems can degrade in subtle ways that are not visible in aggregate metrics. An agent that improves its average performance by 10% might also develop a failure mode where it handles 5% of cases catastrophically. Active monitoring, including sampling individual outputs for human review, tracking the distribution of outcomes (not just the mean), and watching for signs of reward hacking, is essential. The companies that get into trouble with self-improving AI are invariably the ones that deployed it and stopped paying attention.
The fourth principle is plan for capability acceleration. The improvement curves in this field are steep. If you wait for self-improving software to be "proven" before investing, you will be playing catch-up against competitors who adopted early and have months or years of accumulated improvement. The analogy to cloud computing is apt: companies that adopted early had to deal with immature technology, but they also built organizational capabilities that late adopters struggled to replicate. The same dynamic will play out with self-improving software.
The transition to self-improving software is not a sudden disruption. It is a gradually steepening curve. The systems are already here. They are already proving themselves. The question is not whether this transition will happen but whether your business will be positioned to benefit from it or be disrupted by it.
Consider the analogy to cloud computing one more time, because the parallels are instructive for understanding timing. In 2008, cloud computing was widely dismissed by enterprise IT leaders as insecure, unreliable, and suitable only for startups. By 2012, it was understood to be viable but still optional. By 2015, it was the default choice for new projects. By 2018, organizations that had not migrated were considered behind. The entire cycle from "dismissed" to "required" took about a decade. Self-improving software is currently in the 2010-2012 phase of that analogy: proven but not yet default. If the timeline compresses (and AI timelines tend to compress relative to previous technology transitions), the window for "optional early adoption" may be as short as two to three years.
The evidence from every sector where self-improving systems have been deployed points to the same conclusion: the sooner you start building your evaluation infrastructure, deploying agents, and accumulating improvement data, the stronger your competitive position will be when self-improvement becomes the baseline expectation. The companies that start now will not just have better technology in 2028. They will have organizational muscle memory for working with self-improving systems that cannot be acquired through any shortcut.
This guide is written by Yuma Heymans (@yumahey), founder of o-mega.ai, who has been writing code since age six and now builds autonomous AI agent infrastructure used by businesses deploying self-improving software systems in production.
This guide reflects the self-improving software landscape as of March 2026. This is one of the fastest-moving areas in technology. Capabilities, pricing, and competitive positioning change weekly. Verify current details before making investment or deployment decisions.