Zara sat in the war room at 11 p.m. Her AI feedback synthesizer had gone live three weeks earlier. On paper, crushing it: 3.2x more churn signals flagged. Engineering loved the 87% accuracy. But three key accounts had escalated. Retention for users who touched “AI Insights”? Flat.
Output doubled, but nobody feels the win.
— Zara, 11 p.m. war roomThe PM Toolbox
What actually moves the needle.
Dovetail + Claude Projects — feed raw Zendesk, Gong, Intercom. Agent tags by segment, spots clusters. Viktor saved 18 hours/week with Airtable AI.
Reality check: Only works with clean, labeled data first.
Perplexity + Zapier agents crawling competitor changelogs, reviews, X threads every Monday.
Best PMs maintain a living library. Save yourself the blank-page tax.
Cluster this raw data. Ignore noise. Per cluster:
theme, evidence quotes (3 max),
segment impact, one experiment idea.
Measuring What Matters
“Evals are the most critical element… but 85% of teams use generic scores that miss domain-specific issues.”— Pawel Huryn
Bottom-Up Failure Mode Evals
- 1
Ship a narrow slice.
- 2
Pull 200 real traces.
- 3
Sit with DS and label every failure.
- 4
Group into modes: “misses edge case,” “tone too casual,” “hallucinates pricing.”
- 5
Build one cheap evaluator per mode — regex for format, LLM-as-judge for tone.
- 6
Track custom metrics like your life depends on it.
Trinity Metrics You Should Own
% feedback labeled for retraining (>80%). Data freshness. Tribal knowledge capture rate.
Drift rate on holdout. Human override rate. Cost per inference + retrain.
Trust score (1-5). Support deflection rate. Novelty fatigue (drop-off after 3 uses).