Chapter 11 — Tools and Metrics for AI PMs

Zara's gap

3.2x

more churn signals. Zero retention lift.

Zara sat in the war room at 11 p.m. Her AI feedback synthesizer had gone live three weeks earlier. On paper, crushing it: 3.2x more churn signals flagged. Engineering loved the 87% accuracy. But three key accounts had escalated. Retention for users who touched “AI Insights”? Flat.

“

Output doubled, but nobody feels the win.

— Zara, 11 p.m. war room

The PM Toolbox

What actually moves the needle.

Sid @JustAnotherPM“You won't lose your job to AI. You'll lose it to the PM who uses these tools better.”

1 FEEDBACK SYNTHESIS AGENTS

Dovetail + Claude Projects — feed raw Zendesk, Gong, Intercom. Agent tags by segment, spots clusters. Viktor saved 18 hours/week with Airtable AI.

Reality check: Only works with clean, labeled data first.

2 COMPETITIVE ANALYSIS AGENTS

Perplexity + Zapier agents crawling competitor changelogs, reviews, X threads every Monday.

3 PROMPT TEMPLATES

Best PMs maintain a living library. Save yourself the blank-page tax.

Viktor's savings

18h

per week with one feedback synthesis agent

// Feedback synthesis
Cluster this raw data. Ignore noise. Per cluster:
theme, evidence quotes (3 max),
segment impact, one experiment idea.

Figure 11.1 — PM AI Stack 2026. Frequency × cost. Start bottom-left. Graduate to paid when you outgrow free.

Measuring What Matters

Template ruleMaintain a living prompt library. Blank-page tax costs 20+ min per session.

“Evals are the most critical element… but 85% of teams use generic scores that miss domain-specific issues.”— Pawel Huryn

Bottom-Up Failure Mode Evals

1
Ship a narrow slice.
2
Pull 200 real traces.
3
Sit with DS and label every failure.
4
Group into modes: “misses edge case,” “tone too casual,” “hallucinates pricing.”
5
Build one cheap evaluator per mode — regex for format, LLM-as-judge for tone.
6
Track custom metrics like your life depends on it.

Duolingo

+27%

session duration after domain-specific evals

Trinity Metrics You Should Own

DATA

% feedback labeled for retraining (>80%). Data freshness. Tribal knowledge capture rate.

MODELS

Drift rate on holdout. Human override rate. Cost per inference + retrain.

Trust score (1-5). Support deflection rate. Novelty fatigue (drop-off after 3 uses).

The PM Toolbox

Measuring What Matters

Bottom-Up Failure Mode Evals

Trinity Metrics You Should Own

Unlock the full chapter