Defeating Non-Determinism in LLMs: Solving AI's Reproducibility Crisis

Defeating Non-Determinism in LLMs: Solving AI's Reproducibility Crisis

AI LLMs Machine Learning AI Engineering

Introduction

The reproducibility crisis in artificial intelligence has long been a thorn in the side of researchers, engineers, and enterprises relying on large language models. When you ask ChatGPT the same question twice, you rarely get identical answers—a phenomenon that undermines scientific rigor and practical reliability. Recently, Mira Murati, the former CTO of OpenAI, launched Thinking Machines Lab with a bold mission: to solve one of AI’s most fundamental problems—non-determinism in LLM inference. Through their research blog, Connectionism, they’ve published groundbreaking work on defeating non-determinism, revealing not only the root causes of this inconsistency but also practical solutions that could transform how we build and trust AI systems. This article breaks down their findings, explains the technical mechanisms behind LLM variability, and explores the implications for the future of AI reliability.

Thumbnail for Ex-OpenAI CTO Reveals Plan to Fix LLMs Biggest Problem

Understanding Non-Determinism: The Core Problem

Non-determinism in large language models is a deceptively simple concept with profound implications. When you provide the same exact prompt to an LLM multiple times, you receive different responses—sometimes subtly different, sometimes dramatically different. This inconsistency violates one of the fundamental principles of scientific methodology: reproducibility. Reproducibility is considered the bedrock of scientific progress, yet it remains remarkably difficult to achieve with modern large language models. The problem isn’t merely an inconvenience; it represents a critical vulnerability in the deployment of AI systems across industries where consistency and reliability are paramount. Whether you’re using an LLM for medical diagnosis support, legal document analysis, financial forecasting, or scientific research, the inability to reproduce results creates a cascade of downstream problems that affect trust, validation, and regulatory compliance.

The manifestation of non-determinism is observable and frustrating. Run the same prompt through an LLM ten times, and you might get ten different responses. Even when you attempt to eliminate randomness by setting the temperature parameter to zero—which theoretically should produce deterministic outputs—the model still generates different results. This persistence of variability even under supposedly deterministic conditions puzzled researchers for years. The conventional wisdom suggested that this was simply how language models worked, an inherent characteristic of the technology. However, Thinking Machines’ research reveals that this assumption was incomplete. The true causes of non-determinism are far more specific and, importantly, addressable through targeted technical interventions.

Why Reproducibility Matters: The Business and Scientific Case

The importance of defeating non-determinism extends far beyond academic curiosity. In practical terms, reproducibility is essential for building trustworthy AI systems that organizations can confidently deploy in production environments. When an LLM produces inconsistent outputs, it becomes nearly impossible to debug problems effectively. If a model generates an incorrect or harmful response, engineers cannot reliably reproduce the issue to understand what went wrong. This makes it extraordinarily difficult to identify whether the problem stems from the model itself, the prompt engineering, the data, or some other factor. Debugging becomes a game of chance rather than a systematic process of elimination.

Beyond debugging, reproducibility is critical for auditing and verification. Regulatory bodies, compliance officers, and security teams need to understand how AI systems make decisions. When outputs are non-deterministic, auditing becomes a nightmare. You cannot trace a specific output back to its causes with certainty. This is particularly problematic in regulated industries like healthcare, finance, and law, where explainability and auditability are legal requirements. Additionally, benchmarking becomes unreliable when inputs and outputs are non-deterministic. If you’re comparing two models or two versions of the same model, you need stable, reproducible results to make meaningful comparisons. Non-determinism introduces noise into benchmarks, making it difficult to determine whether performance differences are real or artifacts of randomness.

From a user trust perspective, reproducibility is equally important. Users want to know that when they ask an AI system a question, they’ll get a consistent, reliable answer. If the same question produces wildly different responses, users lose confidence in the system. This is particularly true for applications where users rely on the AI for decision support or information retrieval. Furthermore, reproducibility enables better prompt engineering and optimization. If you can’t reproduce results, you can’t systematically improve your prompts or understand which variations actually work better.

The Technical Roots of Non-Determinism: Floating-Point Arithmetic and Concurrent Execution

The conventional hypothesis for why LLMs produce non-deterministic results has centered on two technical factors: floating-point non-associativity and concurrent execution on GPUs. Understanding these concepts requires diving into the mathematical and computational foundations of how neural networks operate. Floating-point numbers are the standard way computers represent decimal numbers—values like 5.23 or 3.14159. However, computers cannot store infinite precision. At some point, you must round the number to fit it into a fixed amount of memory. This rounding introduces a tiny amount of error, and when you perform millions or billions of mathematical operations, these tiny errors can accumulate and compound.

The non-associativity aspect is particularly important. In pure mathematics, addition is associative: (a + b) + c equals a + (b + c). However, in floating-point arithmetic, this is not always true due to rounding errors. Depending on the order in which you add numbers, you might get slightly different results. This might seem trivial, but in the context of neural network computations involving billions of parameters and operations, these tiny differences can propagate through the network and eventually affect which token the model selects as its next output.

The second factor is concurrent execution on GPUs. Graphics Processing Units are designed to perform many calculations simultaneously. When you give a GPU a mathematical operation, it doesn’t execute sequentially; instead, it distributes the work across thousands of cores running in parallel. The problem is that you often don’t know which core will finish first. This non-deterministic ordering of completion can affect the final result, particularly when operations depend on each other or when results are aggregated. Some specialized hardware, like chips from companies such as Groq, addresses this by using completely symmetrical architectures where you know exactly how long each operation will take. However, most GPUs don’t have this property.

The Real Culprit: Batch Size Variability

While the floating-point and concurrent execution hypotheses contain elements of truth, Thinking Machines’ research reveals that they don’t tell the complete story. The real culprit behind non-determinism in LLMs is batch size variability. To understand this, imagine a carpool system. When you submit a prompt to an LLM, it doesn’t process your request in isolation. Instead, your request gets grouped with other requests into a batch—a carpool of queries. When the system is busy, the carpool is large, containing many requests. When the system is quiet, the carpool is small. This batch size is not fixed; it changes dynamically based on system load.

The critical insight is that batch size affects the order in which tiny mathematical operations are performed inside the neural network. Different batch sizes can cause the same mathematical operations to be executed in different orders. While the mathematical operations themselves might be identical, the order matters due to floating-point non-associativity. A slightly different order of operations leads to slightly different intermediate results, which can cascade through the network and eventually change which token the model selects as its next output. Since LLMs work by predicting one token at a time, and each token prediction depends on all previous predictions, a single difference early in the generation process can lead to completely different outputs by the end.

This is a subtle but profound insight. It means that the non-determinism isn’t inherent to the model architecture or the fundamental nature of neural networks. Rather, it’s a consequence of how batching is implemented during inference. The batch size is a variable that changes based on system conditions, and this variability directly translates into output variability. This discovery is important because it suggests that the problem is solvable through careful engineering of the inference pipeline.

The Solution: Batch Invariant Kernels and Deterministic Processing

Thinking Machines’ solution to non-determinism involves three coordinated technical fixes, collectively referred to as batch invariant kernels. The first fix ensures that regardless of batch size, the computational operations are weighted and normalized consistently. Using a restaurant analogy, imagine you’re creating bowls of food. You need to ensure that each bowl is weighed the same, whether the kitchen is crowded or empty. This means implementing computational kernels that maintain consistent normalization and weighting regardless of how many requests are in the batch. The trade-off is that you might lose some speed—the system might process requests slightly more slowly to maintain consistency. However, the consistency gained is far more valuable than the marginal speed loss.

The second fix involves keeping the mixing step identical across all batch sizes. In neural network computations, there are mixing operations where different components are combined. These operations must be performed in exactly the same way regardless of batch size. This requires careful implementation of the computational kernels to ensure that the order and method of mixing remain constant. Again, this might introduce some computational overhead, but the benefit of deterministic outputs justifies the cost.

The third fix addresses the attention mechanism, which is central to transformer-based language models. The attention mechanism allows the model to look back at what it has previously written and weight different parts of the text differently. When text is processed in chunks of different sizes, the order of operations in the attention mechanism can change. The solution is to use the same chunk size every single time, ensuring that the attention mechanism processes information in a consistent order. This consistency in attention processing is crucial for deterministic outputs.

Validation and Results: Proof of Concept

The true test of any scientific claim is empirical validation. Thinking Machines tested their solution using Qwen 2.5B, a large language model, and ran a rigorous experiment. They generated 1,000 completions at temperature zero (the lowest randomness setting) using the same prompt: “Tell me about Richard Feynman.” Each completion generated 1,000 tokens. Before implementing their batch invariant kernels, the results were disappointing but revealing. Out of 1,000 completions, they generated 80 unique responses, with the most common response appearing only 78 times. This baseline demonstrated the severity of the non-determinism problem—even with temperature set to zero, the model produced 80 different outputs.

After enabling their batch invariant kernels, the results were dramatic: all 1,000 completions were identical. Perfect determinism was achieved. This wasn’t a marginal improvement or a partial solution; it was a complete elimination of non-determinism. Every single run produced the exact same output. This validation is crucial because it proves that the problem is indeed solvable and that the proposed solution actually works. The experiment was conducted with a real, production-scale language model, not a toy model or simplified version, which makes the results even more significant.

Implications for AI Trust, Debugging, and Auditing

The achievement of deterministic LLM outputs has far-reaching implications for how we build, deploy, and trust AI systems. First and foremost, determinism enables reliable debugging. When a model produces an incorrect or unexpected output, engineers can now reproduce the issue consistently. This transforms debugging from a frustrating game of chance into a systematic process. Engineers can trace the exact computation path that led to the problematic output, identify where the error occurred, and implement fixes with confidence that they’ve actually solved the problem.

Second, determinism dramatically improves auditability. Regulatory bodies, compliance officers, and security teams can now audit AI systems with much greater confidence. When you can reproduce outputs consistently, you can trace the exact factors that influenced a decision. This is particularly important in regulated industries like healthcare, finance, and law, where explainability is a legal requirement. Auditors can verify that the model is behaving as expected and that it’s not producing biased or harmful outputs due to non-deterministic randomness.

Third, benchmarking becomes far more reliable. When comparing two models or two versions of the same model, researchers can now be confident that performance differences are real and not artifacts of randomness. This enables more rigorous scientific evaluation of AI systems and more informed decisions about which models to deploy. Additionally, determinism enables better prompt engineering and optimization. Researchers can systematically test different prompts and measure their effects with confidence that the results are reproducible.

FlowHunt Application: Automating Reliable AI Workflows

For organizations using FlowHunt to automate their AI workflows, the implications of deterministic LLMs are significant. FlowHunt enables users to build complex, multi-step AI workflows that integrate language models with other tools and processes. When LLMs are non-deterministic, these workflows become unreliable—the same input might produce different outputs, leading to inconsistent downstream results. With deterministic LLMs, FlowHunt users can build workflows with much greater confidence in their reliability and consistency.

FlowHunt’s automation capabilities are particularly valuable when combined with deterministic LLMs. Users can create workflows that depend on specific LLM outputs, knowing that those outputs will be consistent and reproducible. This enables more sophisticated automation, better error handling, and more reliable integration with other systems. For example, a workflow that extracts information from documents using an LLM can now be confident that the same document will always produce the same extracted information. This consistency is crucial for building trustworthy, production-grade AI automation.

Advanced Considerations: When Determinism Isn’t Desired

While deterministic outputs are generally desirable, there are important use cases where non-determinism is actually beneficial. Creative writing is the most obvious example. If you’re using an LLM to generate creative content—stories, poetry, marketing copy—you probably want variability. You want the model to generate different creative outputs each time you run it, not the same output repeatedly. In these cases, users would want to disable deterministic mode and allow the model to generate varied outputs.

Similarly, in brainstorming or ideation applications, variability can be valuable. If you’re using an LLM to generate multiple ideas or perspectives on a topic, you want different outputs, not the same output repeated. The solution is to make determinism optional—users can enable it when they need reproducibility and disable it when they want variability. This flexibility is important for ensuring that deterministic LLMs don’t unnecessarily constrain use cases where variability is beneficial.

The Broader Impact on AI Development and Deployment

The work by Thinking Machines on defeating non-determinism represents a significant step forward in making AI systems more reliable, trustworthy, and production-ready. This research addresses a fundamental problem that has plagued the AI industry since the emergence of large language models. By solving this problem, Thinking Machines is enabling a new generation of AI applications that can be deployed with greater confidence in regulated industries and mission-critical applications.

The implications extend beyond just LLMs. The techniques developed for achieving deterministic LLM inference could potentially be applied to other types of neural networks and AI systems. The principles of batch invariant kernels and consistent computational ordering are general principles that could improve the reliability of AI systems across the board. As AI becomes increasingly integrated into critical infrastructure and decision-making processes, the importance of reproducibility and determinism will only grow.

Furthermore, this work highlights the importance of fundamental research in AI. While much of the AI industry focuses on scaling models and adding new capabilities, research like this addresses foundational issues that enable better deployment and trust in AI systems. The fact that a former OpenAI CTO is dedicating her efforts to solving this problem underscores its importance and suggests that the AI industry is beginning to recognize that reliability and reproducibility are just as important as raw capability.

Conclusion

Mira Murati’s Thinking Machines Lab has identified and solved a critical problem in large language model inference: non-determinism. By recognizing that batch size variability, rather than floating-point arithmetic or GPU concurrency alone, is the primary cause of non-deterministic outputs, and by developing batch invariant kernels to address this issue, they’ve demonstrated that deterministic LLM inference is achievable. Their experimental validation using Qwen 2.5B showed that perfect determinism is possible—all 1,000 test completions were identical after implementing their solution. This breakthrough has profound implications for AI trust, debugging, auditing, and the deployment of AI systems in regulated industries. As organizations increasingly rely on LLMs for critical applications, the ability to produce reproducible, deterministic outputs will become a fundamental requirement for production-grade AI systems.

Frequently asked questions

What is non-determinism in large language models?

Non-determinism in LLMs refers to the phenomenon where the same input prompt produces different outputs each time it's run. This occurs due to floating-point arithmetic precision, concurrent GPU execution, and batch size variations, making it difficult to reproduce results consistently.

Why is defeating non-determinism important for AI systems?

Defeating non-determinism is crucial for trust, debugging, auditing, and verification of AI systems. When outputs are reproducible, benchmarks become more reliable, users can trust results better, and it becomes easier to understand why a model produces specific outputs.

What is batch invariant kernel technology?

Batch invariant kernels are a technical solution that ensures LLM computations produce identical results regardless of batch size. By maintaining consistent processing order and computational steps, this technology eliminates the variability caused by different batch sizes during inference.

How does Thinking Machines' solution work?

Thinking Machines' solution involves three key fixes: maintaining consistent batch weighting regardless of system load, keeping the mixing step identical across all batches, and processing attention mechanisms in the same order. These changes ensure deterministic outputs while maintaining reasonable performance.

What are the practical applications of deterministic LLMs?

Deterministic LLMs are valuable for scientific research, regulatory compliance, debugging, auditing, benchmarking, and any application where reproducibility is critical. However, they may be less desirable for creative applications where variability is beneficial.

Arshia is an AI Workflow Engineer at FlowHunt. With a background in computer science and a passion for AI, he specializes in creating efficient workflows that integrate AI tools into everyday tasks, enhancing productivity and creativity.

Arshia Kahani
Arshia Kahani
AI Workflow Engineer

Automate Your AI Workflows with FlowHunt

Build reliable, reproducible AI workflows with FlowHunt's intelligent automation platform. Ensure consistency in your AI operations from research to deployment.

Learn more

Why Do Language Models Hallucinate? OpenAI Research
Why Do Language Models Hallucinate? OpenAI Research

Why Do Language Models Hallucinate? OpenAI Research

Discover how OpenAI's latest research identifies why language models hallucinate and produce confident falsehoods. Learn the root causes and practical solutions...

14 min read
AI Language Models +3
The Decade of AI Agents: Karpathy on AGI Timeline
The Decade of AI Agents: Karpathy on AGI Timeline

The Decade of AI Agents: Karpathy on AGI Timeline

Explore Andrej Karpathy's nuanced perspective on AGI timelines, AI agents, and why the next decade will be critical for artificial intelligence development. Und...

20 min read
AI AGI +3
How a 7M Parameter Model is Beating Frontier AI Models
How a 7M Parameter Model is Beating Frontier AI Models

How a 7M Parameter Model is Beating Frontier AI Models

Discover how a tiny 7M parameter model outperforms Gemini, DeepSeek, and Claude using recursive reasoning and deep supervision. Learn the revolutionary approach...

22 min read
AI Machine Learning +3