Chapter Eleven

Tools and Metrics
for AI PMs

The tools and metrics that look clean in a demo will lie to you in production unless you design them around the Trinity and measure what actually moves the business.

📖 ~14 min readPages 75–80
scroll
Zara's gap
3.2x
more churn signals. Zero retention lift.

Zara sat in the war room at 11 p.m. Her AI feedback synthesizer had gone live three weeks earlier. On paper, crushing it: 3.2x more churn signals flagged. Engineering loved the 87% accuracy. But three key accounts had escalated. Retention for users who touched “AI Insights”? Flat.

Output doubled, but nobody feels the win.

— Zara, 11 p.m. war room

The PM Toolbox

What actually moves the needle.

Sid @JustAnotherPM“You won't lose your job to AI. You'll lose it to the PM who uses these tools better.”
1 FEEDBACK SYNTHESIS AGENTS

Dovetail + Claude Projects — feed raw Zendesk, Gong, Intercom. Agent tags by segment, spots clusters. Viktor saved 18 hours/week with Airtable AI.

Reality check: Only works with clean, labeled data first.

2 COMPETITIVE ANALYSIS AGENTS

Perplexity + Zapier agents crawling competitor changelogs, reviews, X threads every Monday.

3 PROMPT TEMPLATES

Best PMs maintain a living library. Save yourself the blank-page tax.

Viktor's savings
18h
per week with one feedback synthesis agent
// Feedback synthesis
Cluster this raw data. Ignore noise. Per cluster:
theme, evidence quotes (3 max),
segment impact, one experiment idea.
DAILY USE → WEEKLY USE FREE → PAID ClaudeProjects Prompt lib PerplexityAgents Dovetail+ AI Airtable AI LangSmithEvals Custom
Figure 11.1 — PM AI Stack 2026. Frequency × cost. Start bottom-left. Graduate to paid when you outgrow free.

Measuring What Matters

Template ruleMaintain a living prompt library. Blank-page tax costs 20+ min per session.
“Evals are the most critical element… but 85% of teams use generic scores that miss domain-specific issues.”— Pawel Huryn

Bottom-Up Failure Mode Evals

  1. 1

    Ship a narrow slice.

  2. 2

    Pull 200 real traces.

  3. 3

    Sit with DS and label every failure.

  4. 4

    Group into modes: “misses edge case,” “tone too casual,” “hallucinates pricing.”

  5. 5

    Build one cheap evaluator per mode — regex for format, LLM-as-judge for tone.

  6. 6

    Track custom metrics like your life depends on it.

Duolingo
+27%
session duration after domain-specific evals

Trinity Metrics You Should Own

DATA

% feedback labeled for retraining (>80%). Data freshness. Tribal knowledge capture rate.

MODELS

Drift rate on holdout. Human override rate. Cost per inference + retrain.

UX

Trust score (1-5). Support deflection rate. Novelty fatigue (drop-off after 3 uses).

Chapter 11

Unlock the full chapter

The first two chapters are free. Chapters 3 through 18 unlock with a one-time purchase on the same account.

$18.99one-time purchase via PayPal

Already purchased? Sign in with the same account you used at checkout.