Taylor was in the boardroom demo. Same prompt, same customer data. He’d run it twenty times that morning — always perfect. Live, in front of the CEO and three VCs, the AI budgeting coach suggested cancelling his gym membership. That morning it had suggested meal-prep kits. Same inputs. Different output. The room went dead quiet.
Someone muttered, “So… we’re shipping a coin flip?”
This is the moment every AI PM hits. Large language models are fundamentally probabilistic. Every token is sampled from a probability distribution. Temperature, top-p, nucleus sampling — all deliberate ways to introduce controlled randomness because that randomness is what makes the model creative, robust, and actually useful.
Even at temperature = 0.0 (greedy decoding), real production deployments are rarely bit-for-bit identical. Batch-size-dependent numerics, floating-point non-associativity on GPUs, tiny differences in parallel execution. Different load, different hardware — same prompt, different tokens.
This isn’t a bug. It’s the architecture that lets the model handle the messy, ambiguous real world. Make it fully deterministic and you lose the intelligence that justifies using it.
Product Consequences You’ll Feel Every Day
▸ Testing is statistical, not deterministic. One run means nothing. You need evals on hundreds of cases.
▸ Users will see different answers tomorrow than today for the “same” request.
▸ You can never promise “exact” behavior — only “consistently in this range.”
▸ Monitoring shifts from “did it break?” to “how often is it drifting outside acceptable bounds?”
This is exactly why the Control vs. Convenience trade-off in Chapter 14 bites so hard. Raw frontier APIs = maximum intelligence, maximum variability. Heavy guardrails = more control, less magic.
The Practical Playbook
How winning teams design for a non-deterministic world in 2026.
- 1
Surface confidence or ranges when it matters. Don’t pretend the answer is certain when it isn’t.
- 2
Run 3–5 rollouts and pick the best (self-consistency). Cheap insurance against bad samples.
- 3
Add deterministic post-processing layers — rules, templates, validation — on top of probabilistic outputs.
- 4
Let users toggle “more consistent” mode. Give them control to trade magic for reliability.
- 5
Be radically transparent: “Here’s one possible summary — want me to try again?”
Maya’s turnaround (Ch 15). Her email triage feature went from 17% adoption to 54% the day she added “Why did you flag this?” explanations and deterministic rules on top of the AI suggestions. Users stopped seeing it as a coin flip and started seeing it as a helpful, fallible teammate.
Ask Your DS Team
1. “What’s our current temperature setting and how much does output variance change if we drop it?”
2. “How can we add self-consistency or validation without killing latency or cost?”
3. “What user research have we done on how much variation people will actually tolerate?”
that could handle reality’s messiness.
Accept the cliff. Build the guardrails.
Ship anyway. That’s the job now.