Taylor was in the boardroom demo. Same prompt, same customer data. He’d run it twenty times that morning — always perfect. Live, in front of the CEO and three VCs, the AI budgeting coach suggested cancelling his gym membership. That morning it had suggested meal-prep kits. Same inputs. Different output. The room went dead quiet.
Someone muttered, “So… we’re shipping a coin flip?”
This is the moment every AI PM hits. Large language models are fundamentally probabilistic. Every token is sampled from a probability distribution. Temperature, top-p, nucleus sampling — all deliberate ways to introduce controlled randomness because that randomness is what makes the model creative, robust, and actually useful.
Even at temperature = 0.0 (greedy decoding), real production deployments are rarely bit-for-bit identical. Batch-size-dependent numerics, floating-point non-associativity on GPUs, tiny differences in parallel execution. Different load, different hardware — same prompt, different tokens.
This isn’t a bug. It’s the architecture that lets the model handle the messy, ambiguous real world. Make it fully deterministic and you lose the intelligence that justifies using it.
Product Consequences You’ll Feel Every Day
▸ Testing is statistical, not deterministic. One run means nothing. You need evals on hundreds of cases.
▸ Users will see different answers tomorrow than today for the “same” request.
▸ You can never promise “exact” behavior — only “consistently in this range.”
▸ Monitoring shifts from “did it break?” to “how often is it drifting outside acceptable bounds?”
This is exactly why the Control vs. Convenience trade-off in Chapter 14 bites so hard. Raw frontier APIs = maximum intelligence, maximum variability. Heavy guardrails = more control, less magic.