
Defeating Non-Determinism in LLMs: Solving AI's Reproducibility Crisis
Discover how Mira Murati's Thinking Machines Lab is solving the non-determinism problem in large language models, enabling reproducible AI outputs and transform...

Discover how a tiny 7M parameter model outperforms Gemini, DeepSeek, and Claude using recursive reasoning and deep supervision. Learn the revolutionary approach that challenges everything we know about AI scaling.
The artificial intelligence landscape has long operated under a fundamental assumption: bigger is better. Larger models with more parameters, more training data, and more computational resources consistently outperform their smaller counterparts. However, a groundbreaking research paper from Samsung has challenged this conventional wisdom in a way that could reshape how we think about AI model design and efficiency. A tiny neural network with just 7 million parameters—a fraction of the size of frontier models like GPT-4, Gemini 2.5 Pro, or DeepSeek—is now achieving superior performance on some of the most challenging reasoning benchmarks in artificial intelligence. This remarkable achievement isn’t the result of simply scaling up training data or computational resources. Instead, it represents a fundamental rethinking of how neural networks approach complex problem-solving through a technique called recursive hierarchical reasoning combined with deep supervision. In this comprehensive guide, we’ll explore how this tiny model works, why it’s so effective, and what it means for the future of AI development and deployment.
Before we can appreciate the innovation behind the Tiny Recursive Model, we need to understand why large language models struggle with complex reasoning tasks in the first place. Modern large language models like GPT-4, Claude, and Gemini operate on a fundamental principle: they predict the next token in a sequence based on the tokens that came before it. This autoregressive approach has proven remarkably effective for many tasks, from translation to summarization to creative writing. However, when it comes to hard reasoning problems—particularly those requiring multiple steps of logical deduction, constraint satisfaction, or abstract pattern recognition—this approach reveals significant limitations. The core issue is that a single incorrect token prediction can invalidate an entire answer. Imagine solving a complex mathematical equation: if the model makes an error in the first step, all subsequent steps become meaningless. This cascading error problem becomes exponentially worse as problems increase in complexity. Additionally, large language models don’t truly “reason” in the way humans do. They’re performing sophisticated pattern matching based on their training data, not engaging in genuine logical inference. When faced with novel problems that require reasoning beyond their training distribution, they often fail spectacularly. This is why even the most advanced frontier models struggle with benchmarks like ARC AGI (Abstraction and Reasoning Corpus), which specifically tests the ability to solve novel reasoning problems that require genuine abstract thinking rather than pattern recognition.
The AI research community has developed several techniques to address the reasoning limitations of large language models, each with its own strengths and weaknesses. The most prominent of these is chain-of-thought prompting, a technique that has become ubiquitous in modern AI systems. Chain-of-thought works by encouraging the model to generate step-by-step reasoning before providing its final answer. Rather than jumping directly to a conclusion, the model is prompted to “think through” the problem, generating intermediate reasoning steps that lead to the final answer. This approach has proven remarkably effective, with studies showing that chain-of-thought can significantly improve performance on reasoning tasks. However, chain-of-thought comes with substantial drawbacks. First, it’s computationally expensive—generating multiple reasoning steps requires processing many additional tokens, which increases inference time and computational cost. Second, it requires high-quality reasoning data for training, which is expensive and time-consuming to create. Third, and perhaps most importantly, chain-of-thought is brittle. The generated reasoning may be incorrect, and if the reasoning is flawed, the final answer will be wrong. The model isn’t actually verifying its reasoning; it’s simply generating plausible-sounding explanations that may or may not be logically sound. Another popular technique is pass-at-K sampling, where the model generates multiple candidate answers and selects the best one. If you ask a model “What is 5 times 5?”, it might generate ten different responses and choose the most accurate one. While this can improve accuracy, it’s also computationally expensive and doesn’t address the fundamental problem: the model still isn’t reasoning; it’s just generating multiple predictions and hoping one is correct. These techniques represent what researchers call “test-time compute scaling”—using more computational resources at inference time to improve answer quality. While effective, this approach is fundamentally limited by the fact that the underlying model still isn’t performing genuine reasoning; it’s just generating more predictions and hoping for better results.
To understand the significance of the Tiny Recursive Model’s achievements, we need to understand the benchmark it’s being evaluated on: ARC AGI (Abstraction and Reasoning Corpus). The ARC AGI benchmark was created to test something that most AI benchmarks don’t: genuine abstract reasoning ability. Unlike benchmarks that test knowledge or pattern recognition, ARC AGI presents novel visual reasoning puzzles that require the ability to identify abstract patterns and apply them to new situations. The benchmark consists of tasks where the model is shown a few examples of input-output pairs and must figure out the underlying rule or transformation, then apply that rule to new inputs. These aren’t tasks that can be solved through memorization or pattern matching from training data; they require genuine reasoning and the ability to generalize abstract concepts. Since the ARC AGI benchmark was introduced in 2019, it has become a gold standard for evaluating reasoning capabilities in AI systems. Despite six years of progress in large language models, human-level accuracy on ARC AGI has still not been achieved. This is a humbling reminder that despite the impressive capabilities of modern AI systems, they still struggle with tasks that humans find relatively straightforward. Gemini 2.5 Pro, one of the most advanced frontier models available, achieves only 4.9% accuracy on ARC AGI 2 even when given substantial test-time compute resources. The newer ARC AGI 3 benchmark is even more challenging, with frontier models struggling to make meaningful progress. This is the context in which the Tiny Recursive Model’s achievements become truly remarkable. A model with 7 million parameters—less than 0.01% of the parameters in Gemini 2.5 Pro—is achieving 45% accuracy on ARC AGI 1 and 8% on ARC AGI 2, substantially outperforming these massive frontier models.
The key innovation behind the Tiny Recursive Model is a technique called recursive hierarchical reasoning, which represents a fundamentally different approach to how neural networks tackle complex problems. To understand this technique, it’s helpful to think of an analogy: imagine you’re trying to solve a difficult Sudoku puzzle. You don’t solve it in one pass, making all the decisions at once. Instead, you make a guess, think about whether that guess makes sense given the constraints, and if it doesn’t work, you revise your guess and try again. You might go through this cycle dozens of times, each time refining your solution based on your previous attempts and the reasoning about why those attempts failed. This iterative refinement process is essentially what recursive hierarchical reasoning does. The model maintains two key pieces of information: its current best guess at the solution and a trace of the reasoning that led to that guess. At each recursion step, the model updates both of these pieces of information. It looks at its current guess, thinks about the reasoning that led to it, and generates an improved guess based on that reasoning. Then it repeats this process, using the improved guess and updated reasoning as input to the next iteration. The original hierarchical reasoning model (HRM) that inspired this work used two separate neural networks operating at different hierarchies or “speeds.” The biological justification was that the human brain operates at different temporal frequencies—some processes are fast and reactive, while others are slow and deliberative. The two networks in HRM were supposed to emulate this, with one network operating quickly and another operating more slowly, and the two networks working together in a loop. However, the Samsung researchers who developed the Tiny Recursive Model questioned this biological justification. While it’s interesting to draw parallels between artificial neural networks and biological brains, such analogies don’t necessarily explain why a particular architectural choice is effective. The original HRM paper relied heavily on biological arguments and complex mathematical theorems (fixed-point theorems) to justify its design, but it didn’t provide clear ablation studies showing which components actually contributed to performance improvements. The researchers asked a simple but profound question: why use two networks? Why not one? Why not three or four? And more fundamentally, why do we need to justify architectural choices based on biology rather than empirical results?
The answer to these questions led to the development of the Tiny Recursive Model (TRM), which takes the core insight of recursive reasoning but strips away the complexity and biological justifications. Rather than using two medium-sized networks operating at different hierarchies, TRM uses a single tiny network with just two layers. The model is remarkably simple—the pseudocode for TRM is short enough to fit on a single screen. This simplicity is not a limitation; it’s a feature. By eliminating unnecessary complexity, the researchers were able to focus on what actually matters: the recursive refinement process itself. The key insight is that the model needs to maintain two pieces of information: its current guess and the reasoning trace that led to that guess. These aren’t necessarily different hierarchies or different temporal frequencies; they’re simply two different types of information that the model needs to track. At each recursion step, the model takes these two pieces of information as input, processes them through its tiny two-layer network, and outputs updated versions of both the guess and the reasoning trace. This process repeats multiple times, with each iteration potentially improving the solution. The beauty of this approach is that it provides what the researchers call “virtual depth.” Even though the network only has two layers, by recursing through it multiple times, the model effectively has much greater depth. It’s as if the model is simulating a much deeper network through iteration rather than through additional layers. This is a crucial insight because it challenges the conventional wisdom that deeper networks are always better. In traditional neural network design, we add more layers to increase the model’s capacity to learn complex functions. But the Tiny Recursive Model shows that you can achieve similar or better results by keeping the network shallow and instead increasing the number of recursion steps. This is a fundamentally different way of thinking about model architecture.
The second key innovation in the Tiny Recursive Model is a technique called deep supervision. While recursive reasoning provides the iterative refinement mechanism, deep supervision ensures that the model learns effectively from each iteration. In traditional supervised learning, a model makes a prediction and receives feedback only on the final output. If the final answer is wrong, the model learns that the entire process was incorrect, but it doesn’t get specific feedback about which intermediate steps were problematic. Deep supervision changes this by providing supervision signals at multiple intermediate steps during the reasoning process. Rather than only checking whether the final answer is correct, the model receives feedback at each recursion step. This means that the model learns not just whether its final answer is right or wrong, but whether each intermediate step in its reasoning process is moving in the right direction. The impact of deep supervision on performance is dramatic. In initial experiments, using deep supervision doubled accuracy compared to single-step supervision, improving from 19% to 39% accuracy on certain tasks. This is a massive improvement from a single architectural change. The reason deep supervision is so effective is that it provides much richer learning signals. When a model only receives feedback on the final answer, it has to figure out through backpropagation which of its intermediate steps were responsible for the error. This is a difficult credit assignment problem, especially in deep networks. By providing direct supervision at each step, the model gets clear feedback about whether each intermediate step is correct, making it much easier to learn the right behavior. Furthermore, deep supervision helps prevent the model from getting stuck in local optima. If the model makes a wrong turn early in its reasoning process, deep supervision will catch this immediately and provide feedback to correct it, rather than allowing the error to propagate through multiple steps before being detected.
The performance improvements achieved by the Tiny Recursive Model are nothing short of remarkable. On the Sudoku Extreme benchmark, the model improved from 55% to 87% accuracy. On the Maze Hard benchmark, it improved from 75% to 85%. On ARC AGI 1, it achieved 45% accuracy compared to 40% for the previous approach. On ARC AGI 2, it achieved 8% accuracy compared to 5% for the previous approach. While the improvements on ARC AGI 2 might seem modest—from 5% to 8%—they represent a 60% relative improvement, which is substantial in a field where progress is often measured in single-digit percentage point improvements. More importantly, these results need to be understood in the context of model size. The Tiny Recursive Model has only 7 million parameters. To put this in perspective, Gemini 2.5 Pro has hundreds of billions of parameters, DeepSeek R1 has hundreds of billions of parameters, and Claude 3.7 has hundreds of billions of parameters. The Tiny Recursive Model is achieving competitive or superior performance on these benchmarks while using less than 0.01% of the parameters of these frontier models. When you compare the performance-to-parameter ratio, the Tiny Recursive Model is orders of magnitude more efficient. This has profound implications for AI deployment. Smaller models are cheaper to run, require less computational infrastructure, and can be deployed on edge devices or in resource-constrained environments. If a 7 million parameter model can achieve performance comparable to or better than models with hundreds of billions of parameters, this opens up entirely new possibilities for AI applications. The only frontier model that outperformed the Tiny Recursive Model on these benchmarks was Gro for Thinking, which achieved significantly better results. However, Gro for Thinking is a massive model with over a trillion parameters—more than 140,000 times larger than TRM. Even accounting for this size difference, the Tiny Recursive Model’s efficiency is remarkable.
Understanding why recursive reasoning is so effective requires thinking about the nature of complex reasoning problems. Many hard reasoning tasks have a particular structure: they involve finding a solution that satisfies multiple constraints or discovering a pattern that explains a set of observations. These problems often can’t be solved in a single forward pass through a neural network. Instead, they require iterative refinement, where you generate a candidate solution, check it against the constraints or observations, identify where it fails, and then refine it. This is exactly what recursive reasoning enables. By maintaining both a current guess and a reasoning trace, the model can engage in this iterative refinement process. The reasoning trace serves as a form of working memory, allowing the model to keep track of what it has tried, what worked, and what didn’t. This is fundamentally different from how traditional neural networks operate. A traditional neural network processes input through a series of layers and produces an output. There’s no mechanism for the network to revisit its earlier decisions or to maintain a record of its reasoning process. The network can’t say “I tried this approach and it didn’t work, so let me try something different.” It just processes the input and produces an output. Recursive reasoning changes this by explicitly building in a mechanism for iterative refinement and maintaining a reasoning trace. This allows the model to engage in a form of reasoning that’s much closer to how humans actually solve complex problems. When humans solve a difficult puzzle, we don’t just think about it once and produce an answer. We think about it, generate a candidate solution, check it, find problems with it, and refine it. We might go through this cycle many times. Recursive reasoning enables neural networks to do something similar. Another key insight is that recursive reasoning provides a form of regularization. By forcing the model to maintain a reasoning trace and to refine its answer iteratively, the model is constrained to learn solutions that are more generalizable. A model that can only produce an answer in a single forward pass might memorize specific patterns from the training data. A model that must refine its answer iteratively and maintain a reasoning trace is forced to learn more fundamental principles that can be applied to new problems. This helps explain why the Tiny Recursive Model generalizes so well to new problems, even though it’s trained on relatively small amounts of data.
The implications of the Tiny Recursive Model extend beyond academic research into practical applications. Organizations increasingly need to automate complex reasoning tasks—from data analysis and pattern recognition to decision-making and problem-solving. Traditionally, these tasks have required either human expertise or large, expensive AI models. The Tiny Recursive Model opens up new possibilities for automating these tasks efficiently. FlowHunt, an AI workflow automation platform, can leverage these advances in reasoning models to create more efficient and cost-effective automation solutions. Rather than relying on massive frontier models that require significant computational resources, FlowHunt can integrate smaller, more efficient models like the Tiny Recursive Model into automated workflows. This allows organizations to build intelligent automation systems that can handle complex reasoning tasks without the overhead of running massive models. For example, consider a workflow that needs to analyze customer data, identify patterns, and make recommendations. Using a traditional large language model, this workflow would be expensive to run at scale. Using a tiny recursive model integrated into a FlowHunt workflow, the same task could be accomplished at a fraction of the cost. The model could iteratively refine its analysis, maintaining a reasoning trace that explains its recommendations, and providing transparency into how it arrived at its conclusions. This is particularly valuable in domains where explainability is important, such as healthcare, finance, or legal applications. The reasoning trace maintained by the recursive model provides a clear record of the model’s thinking process, making it easier to understand and verify the model’s decisions. Furthermore, the efficiency of tiny recursive models makes it possible to deploy reasoning capabilities in edge environments or on resource-constrained devices. A mobile application could potentially include reasoning capabilities that previously would have required cloud-based processing. This opens up new possibilities for intelligent applications that can operate offline or with minimal network connectivity.
Experience how FlowHunt automates your AI content and SEO workflows — from research and content generation to publishing and analytics — all in one place.
The success of the Tiny Recursive Model challenges one of the most fundamental assumptions in modern AI development: the scaling laws that have guided the field for the past decade. The scaling laws suggest that performance improves predictably with increases in model size, training data, and computational resources. Larger models are better. More data is better. More compute is better. This assumption has driven the development of increasingly massive models, with companies investing billions of dollars in training models with hundreds of billions or even trillions of parameters. The Tiny Recursive Model suggests that this assumption may be incomplete or even misleading in certain contexts. By using a different architectural approach—recursive reasoning with deep supervision—a tiny model can achieve performance comparable to or better than massive models on certain tasks. This doesn’t mean that scaling laws are wrong; rather, it suggests that there are multiple paths to achieving high performance, and scaling up model size is just one of them. This has profound implications for the future of AI development. If smaller models can achieve comparable performance to larger models through clever architectural innovations, this could lead to a shift in how AI systems are developed and deployed. Rather than focusing exclusively on building larger and larger models, the field might shift toward developing more efficient architectures that can achieve high performance with fewer parameters. This would have significant benefits for the environment, for computational efficiency, and for accessibility. Training and running massive models requires enormous amounts of electricity and computational resources. If we can achieve similar performance with models that are orders of magnitude smaller, this would reduce the environmental impact of AI development and make AI more accessible to organizations with limited computational resources. The Tiny Recursive Model also suggests that the relationship between model size and generalization may be more complex than previously thought. Conventional wisdom suggests that larger models generalize better because they have more capacity to learn complex patterns. However, the Tiny Recursive Model shows that smaller models can generalize better if they’re designed with the right inductive biases. By building in mechanisms for iterative refinement and maintaining reasoning traces, the model is constrained to learn more generalizable solutions. This is an example of how architectural innovations can sometimes be more important than raw model size.
One of the most striking aspects of the Tiny Recursive Model is its simplicity. The model uses only two layers and achieves its performance through recursive refinement rather than through architectural complexity. This simplicity is not accidental; it’s a deliberate design choice based on empirical findings. The researchers found that adding more layers actually decreased generalization due to overfitting. This is a counterintuitive finding that challenges conventional neural network design wisdom. Typically, we think of deeper networks as more powerful and capable of learning more complex functions. However, the Tiny Recursive Model shows that in the context of reasoning tasks, depth through recursion is more effective than depth through additional layers. Why is this the case? One explanation is that additional layers increase the model’s capacity to memorize specific patterns from the training data, which can lead to overfitting. By keeping the network shallow and instead increasing the number of recursion steps, the model is forced to learn more generalizable solutions. Each recursion step must work with the same two-layer network, so the network must learn to perform useful computations that can be applied iteratively. This constraint forces the network to learn more fundamental principles rather than memorizing specific patterns. Another explanation relates to the nature of the reasoning tasks. These tasks often involve iterative refinement and constraint satisfaction. A shallow network that’s applied recursively is well-suited to this type of problem because it can focus on making incremental improvements to the current solution. A deep network, by contrast, might try to solve the entire problem in a single forward pass, which is less effective for problems that require iterative refinement. The simplicity of the Tiny Recursive Model also has practical benefits. Simpler models are easier to understand, easier to debug, and easier to modify. If you want to understand why the model made a particular decision, you can trace through its reasoning process step by step. If you want to modify the model to handle a new type of problem, you can make targeted changes to the architecture or training procedure. This is in contrast to massive models with billions of parameters, which are essentially black boxes that are difficult to understand or modify. The principle that “less is more” extends beyond just the architecture of the model. The researchers also found that the model doesn’t need complex mathematical theorems or biological justifications to work effectively. The original hierarchical reasoning model relied on fixed-point theorems and biological arguments about how the brain operates. The Tiny Recursive Model works without these theoretical justifications. It’s simply a model that maintains two pieces of information and refines them iteratively. This suggests that sometimes the simplest explanation is the best one, and that we shouldn’t overcomplicate our models with theoretical justifications that may not be necessary.
The success of the Tiny Recursive Model has significant implications for how AI systems will be developed and deployed in the future. First, it suggests that efficiency should be a primary design goal, not an afterthought. Rather than building massive models and then trying to compress them or optimize them for deployment, we should design models with efficiency in mind from the start. The Tiny Recursive Model shows that it’s possible to achieve high performance with a small, efficient model if you design the architecture carefully. Second, it suggests that architectural innovation may be more important than scale. While the field has focused heavily on scaling up models, the Tiny Recursive Model shows that clever architectural innovations can sometimes be more effective than simply making models larger. This could lead to a renewed focus on architecture design and a shift away from the “bigger is better” mentality that has dominated the field. Third, it suggests that reasoning capabilities can be built into models through architectural design rather than through scale. The Tiny Recursive Model achieves strong reasoning performance not because it’s a massive model, but because it’s designed with mechanisms for iterative refinement and reasoning trace maintenance. This could lead to new approaches for building reasoning capabilities into AI systems. Fourth, it has implications for how we evaluate and benchmark AI systems. The ARC AGI benchmark has proven to be a valuable tool for evaluating reasoning capabilities, and the success of the Tiny Recursive Model on this benchmark suggests that we should continue to develop benchmarks that test genuine reasoning rather than just pattern recognition or knowledge retrieval. Looking forward, there are several directions in which this research could be extended. One direction is to explore how recursive reasoning can be combined with other techniques, such as chain-of-thought prompting or retrieval-augmented generation. Another direction is to explore how recursive reasoning can be applied to other types of problems beyond visual reasoning tasks. A third direction is to explore how to scale recursive reasoning to larger models and see if the same principles apply. A fourth direction is to explore how to make the reasoning process more interpretable and transparent, so that users can understand how the model arrived at its conclusions.
The Tiny Recursive Model represents a significant breakthrough in artificial intelligence, demonstrating that smaller, more efficient models can achieve superior performance on complex reasoning tasks through clever architectural innovations. By combining recursive hierarchical reasoning with deep supervision, the model achieves 45% accuracy on ARC AGI 1 and 8% on ARC AGI 2 using only 7 million parameters—less than 0.01% of the parameters in frontier models like Gemini 2.5 Pro. This achievement challenges fundamental assumptions about AI development, suggesting that architectural innovation and efficiency should be prioritized alongside scale. The implications extend beyond academic research into practical applications, where organizations can leverage smaller, more efficient models to automate complex reasoning tasks at a fraction of the cost of massive frontier models. As the field continues to evolve, the principles demonstrated by the Tiny Recursive Model—simplicity, iterative refinement, and efficient architecture design—will likely become increasingly important in developing the next generation of AI systems.
The Tiny Recursive Model is a 7 million parameter neural network that uses recursive hierarchical reasoning and deep supervision to achieve superior performance on complex reasoning tasks compared to much larger models like Gemini 2.5 Pro and DeepSeek.
TRM uses a novel approach combining recursive reasoning (looping through improvement steps) and deep supervision (passing learned features between steps). This allows the small model to think through problems iteratively, similar to human reasoning, rather than predicting answers in a single pass.
TRM achieves 45% accuracy on ARC AGI 1 and 8% on ARC AGI 2, outperforming Gemini 2.5 Pro (4.9%), DeepSeek R1, and Claude 3.7, while using less than 0.01% of their parameters.
Recursive reasoning allows the model to iteratively refine its answer by maintaining two key pieces of information: its current guess and the reasoning trace. This creates a feedback loop where the model can critique itself and revise answers multiple times, similar to how humans solve complex problems through trial and refinement.
Deep supervision improves accuracy by providing supervision signals at multiple steps during the reasoning process. Rather than only checking the final answer, the model receives feedback at each intermediate step, which doubled accuracy from 19% to 39% in initial experiments.
Arshia is an AI Workflow Engineer at FlowHunt. With a background in computer science and a passion for AI, he specializes in creating efficient workflows that integrate AI tools into everyday tasks, enhancing productivity and creativity.
Build intelligent automation flows that leverage cutting-edge AI models and reasoning techniques to solve complex problems efficiently.
Discover how Mira Murati's Thinking Machines Lab is solving the non-determinism problem in large language models, enabling reproducible AI outputs and transform...
An in-depth analysis of LG's EXAONE Deep 32B reasoning model tested against DeepSeek R1 and Alibaba's QwQ, examining claims of superior performance and actual r...
Explore Andrej Karpathy's nuanced perspective on AGI timelines, AI agents, and why the next decade will be critical for artificial intelligence development. Und...
Cookie Consent
We use cookies to enhance your browsing experience and analyze our traffic. See our privacy policy.


