{"type":"rich","version":"1.0","provider_name":"Transistor","provider_url":"https://transistor.fm","author_name":"The Experimentation Edge","title":"DART: The four metrics that actually measure AI agents","html":"<iframe width=\"100%\" height=\"180\" frameborder=\"no\" scrolling=\"no\" seamless src=\"https://share.transistor.fm/e/ecb85e39\"></iframe>","width":"100%","height":180,"duration":1942,"description":"SummaryRingCentral's Director of Product Management for AI Products, Mayank Agarwal, joins host Ashley Stirrup to dismantle the metrics most teams use to judge AI agents. Drawing on his background founding an AI-first quantitative trading firm and scaling Groupon's bookable marketplace, Mayank explains why accuracy and thumbs-up/down feedback both mislead, and introduces DART — a four-metric behavioral framework (decay, acceptance, relevance, task completion) ported from how he measured trading strategies. He also breaks down a Groupon flash-discount experiment that backfired and the scarcity pivot that fixed it. Essential listening for product managers, engineers, and data scientists building or measuring AI features.Chapters00:00 Welcome and Mayank's path from quant trading to RingCentral AI02:45 Why experimentation has to be owned cross-functionally04:55 Small experiments that compounded to a 12% lift at Groupon06:45 Why accuracy and thumbs-up/down fail for AI agents08:15 The DART framework, metric by metric12:45 Applying DART to AI-generated smart notes14:55 The Groupon flash-sale that dropped conversion16:45 Swapping price urgency for scarcity and social proof19:45 North Star metrics, guardrails, and Goodhart's law26:45 The future: experimenting on — and for — AI agentsTakeawaysAccuracy is a comfortable lie. It grades a narrow test set and can stay high while the agent fails real users.Thumbs-up/down feedback is sparse and skewed. Unhappy users rarely rate — they just quietly stop using the product.DART measures behavior, not opinions. Four signals read off logs and transcripts: decay, acceptance, relevance, and task completion.Acceptance rate is the trust metric. The share of output users keep without editing is the strongest available proxy for trust.A losing experiment is paid-for information. Groupon's flash-sale flop revealed the lever was wrong, not the goal — scarcity beat price-based urgency.Connect with the GuestLinkedIn:...","thumbnail_url":"https://img.transistorcdn.com/D9kLs0HSsqR4ttk_5ESEdC1jX-wmD76GK-OHmb3a9B8/rs:fill:0:0:1/w:400/h:400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS80YTFk/MGU1MjJlODhlNjJh/MTdlZTZkN2Q1ODY5/OTdjYy5wbmc.webp","thumbnail_width":300,"thumbnail_height":300}