Lena thought fine-tuning would be her silver bullet. As PM at a fast-growing legaltech startup, she was tired of the base model ignoring their clause library. “Just fine-tune it on our 5,000 approved contracts,” she told engineering. Six weeks, $42K in labeling + GPU time later, the model went live.
First week: brand voice finally perfect. Second week: it confidently invented clauses that never existed. Legal almost had a heart attack. The model hadn’t “learned” new facts — it overfit to patterns and filled gaps with high-confidence nonsense.
Fine-tuning takes a pre-trained model and continues training on a small, high-quality dataset of your input–output pairs. You’re nudging the probability distribution so outputs look more like yours.
Tone, style, voice consistency. Format adherence (JSON, templates). Domain adaptation (legal, medical, jargon). Efficiency on narrow tasks.
Reliably add new factual knowledge (use RAG). Fix reasoning weaknesses. Make a mediocre model brilliant.
“99% of problems don’t require fine-tuning… Fine-tuning should be your last resort, not the first step.”— Santiago @svpino, June 2025
When it is the right move, production wins come from LoRA or QLoRA — tiny adapter layers at 1/100th the cost. Elliot Arledge: instead of $10K to fine-tune 32B, run multiple rollouts + introspection for $18.
Trinity Impact
▸ Data: Where most teams die. You need hundreds to thousands of clean, consistent examples. Garbage in = very expensive garbage out.
▸ Models: You trade some general capability for narrow reliability. Expect to re-tune every quarter as your product evolves.
▸ UX: More consistent, on-brand outputs = faster trust. But failures now feel more confident, which hurts worse.
Kai at the DTC apparel brand: Tried fine-tuning for product descriptions. It worked — after three months cleaning historical data. His exact words: “Fine-tuning didn’t save us time. It finally forced us to fix our data mess. That was the real win.”
PM Monday Checklist
Before you greenlight fine-tuning, answer all four honestly.
- 1
Have you maxed prompting + RAG? If not, start there. Most teams skip this step and regret it.
- 2
Do you have 500+ gold-standard examples? Clean, consistent, representative of production. Not “we can scrape some.”
- 3
Is the use case high-volume and repetitive? Fine-tuning pays off on narrow, repeated tasks — not broad, creative ones.
- 4
Will you monitor for drift? Catastrophic forgetting and distribution shift are real. Budget for quarterly re-tuning.
Ask Your DS Team
1. “What’s the realistic effort to collect and clean 800 high-quality examples — and how will we keep them fresh?”
2. “LoRA or full? What’s the inference cost delta at our projected volume?”
3. “How do we catch distribution shift post-deployment before users do?”
tailoring an off-the-rack suit, not building a new person.
Do it at the right moment and you get the crisp,
on-brand experience your users actually trust.