The Experimentation Edge | DART: The four metrics that actually measure AI agents

Summary
RingCentral's Director of Product Management for AI Products, Mayank Agarwal, joins host Ashley Stirrup to dismantle the metrics most teams use to judge AI agents. Drawing on his background founding an AI-first quantitative trading firm and scaling Groupon's bookable marketplace, Mayank explains why accuracy and thumbs-up/down feedback both mislead, and introduces DART — a four-metric behavioral framework (decay, acceptance, relevance, task completion) ported from how he measured trading strategies. He also breaks down a Groupon flash-discount experiment that backfired and the scarcity pivot that fixed it. Essential listening for product managers, engineers, and data scientists building or measuring AI features.

Chapters

00:00 Welcome and Mayank's path from quant trading to RingCentral AI
02:45 Why experimentation has to be owned cross-functionally
04:55 Small experiments that compounded to a 12% lift at Groupon
06:45 Why accuracy and thumbs-up/down fail for AI agents
08:15 The DART framework, metric by metric
12:45 Applying DART to AI-generated smart notes
14:55 The Groupon flash-sale that dropped conversion
16:45 Swapping price urgency for scarcity and social proof
19:45 North Star metrics, guardrails, and Goodhart's law
26:45 The future: experimenting on — and for — AI agents

Takeaways

Accuracy is a comfortable lie. It grades a narrow test set and can stay high while the agent fails real users.
Thumbs-up/down feedback is sparse and skewed. Unhappy users rarely rate — they just quietly stop using the product.
DART measures behavior, not opinions. Four signals read off logs and transcripts: decay, acceptance, relevance, and task completion.
Acceptance rate is the trust metric. The share of output users keep without editing is the strongest available proxy for trust.
A losing experiment is paid-for information. Groupon's flash-sale flop revealed the lever was wrong, not the goal — scarcity beat price-based urgency.

Connect with the Guest
LinkedIn: https://www.linkedin.com/in/mayank-agarwal-6223b04a/
Website: https://www.ringcentral.com

Sponsor
Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts.

Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse.

With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction.

See a demo at https://www.growthbook.io/

What is The Experimentation Edge?

How do product teams decide what to build and what not to? The Experimentation Edge is the podcast where product, growth, and engineering leaders share how A/B testing, feature flags, and experimentation drive real business outcomes — backed by named companies and real numbers. From DoorDash's 12,000 A/B tests a year to Atlassian's experimentation-led product win to UPS's $500M experimentation team, each episode goes deep with operators running experimentation programs at scale.

Hosted by Ashley Stirrup, CMO at GrowthBook and a 25-year executive in data and experimentation. For product managers, engineers, data scientists, and growth leaders at B2B tech companies who care about experimentation culture, statistical rigor, and shipping with confidence. No marketing speak. Just operators explaining what they shipped, what moved the needle, and how experimentation reshaped their teams.

Topics: A/B testing, experimentation, growth experimentation, product experimentation, tech experimentation, feature flags, experimentation culture, statistical significance, marketplace experimentation, conversion rate optimization, experimentation at scale.

mayankagarwal-2026-6-1__16-8-36-CFR
===

[00:00:00]

Mayank Agarwal: In practice, the people who find the AI unhelpful, they don't stop to tell you why, right? They just quietly stop using it. So you end up collecting almost no signal, and the signal that you do collect is skewed.

So that is what I've observed working at RingCentral and in fact I built a four-metric framework for this, which I call

Welcome to the Experimentation Edge, where product managers, data scientists, and engineers talk about how they make smarter decisions. I'm Ashley Stirrup, the chief marketing officer for GrowthBook, and in each episode, I'll sit down with an executive to unpack how they use experimentation and A/B testing to make better decisions.

This show is sponsored by GrowthBook, the open source experimentation platform leader. Now let's jump in and get started with our next guest

Ashley Stirrup: Welcome to today's episode. I'm excited to sit down with Mayank Agarwal, Director of Product Management for AI [00:01:00] Products at RingCentral. Mayank brings a terrific background. He's been an engineer, he's been a product manager. He's worked at Adobe and Groupon, and now RingCentral. Mayank, welcome to the show

Mayank Agarwal: Thank you, Ashley. Excited to be on the show. And just to share a little bit about myself as Ashley mentioned, I currently lead AI assistant and agent NIC products at RingCentral. And for anyone who doesn't know what RingCentral is or what we do, so we are a business communications company. We power business communications for other companies ranging from small businesses to large enterprises.

And our current offerings include cloud phone, video meetings team messaging, and the contact center software that customer support teams live in on a daily basis. These offerings, they are split across two product lines: employee collaboration and contact center. And my AI products, they sit across both of these lines, the everyday employee collaboration and communication, as well as the contact [00:02:00] center side, where agents are handling these live customer calls.

Before this, I founded an AI-first quantitative trading firm where I built swarms of self-correcting AI agents that researched and deployed quantitative strategies across different asset classes. And this is where I learned experimentation in its least forgiving form, because a bad test in finance at a trading firm, it loses real money.

Prior to my trading firm I built video and webinar products at RingCentral and marketplace platform and booking products at Groupon.

Ashley Stirrup: Great. Thanks for that intro. I should add that once upon a time, I had a small business and we used RingCentral and it was a great product. It helped us all kinda stay on the same page, communicate professionally with customers. I'm a fan of the product

Mayank Agarwal: Absolutely. Never anything better than a good customer testimonial.

Ashley Stirrup: That's right. That's right. So could you tell us a little bit about experimentation at RingCentral?

Mayank Agarwal: [00:03:00] Absolutely. When it comes to experimentation at RingCentral, and I would also talk about my experience from different companies that I've worked at in the industry and the best practices, right? So I've observed that experimentation works well only if it's owned across various functions if it's not just handed to one team, right?

So in this cross-functional role, the PM owns the hypothesis and the decision-making as to what experiments we should run, what the decisions we need to test are. The data science team, they own the experiment design things like power and how long you want to run the experiment. And the engineering team they own the implementation or the instrumentation of the experiments.

Now, when it comes to typically how many experiments we run, I have noticed that varies from various companies like across different industries. And I think the real answer here is whether any experiment that you're running is powered to answer something real. [00:04:00] So I would rather run fewer experiments that are well-designed these well-designed tests instead of spraying a bunch of un-unpowered or underpowered tests that just produce noise.

I have also seen teams using a bunch of tools but one pattern has been fairly common across all of my experience at different companies. And that pattern is that most teams running experiments, they reach for purpose-built platforms like GrowthBook or Statsig, LaunchDarkly or Optimizely. But I think having the tool is the smaller part here.

The real question is cultural, which is whether the team treats a losing experiment as a failure or whether they learn-- they treat it as a learning. And I've observed that teams that win, they are the ones that get curious about a failed experiment or a lost experiment's results, and they try to investigate further on it.

So yeah, have had, as I was saying a lot of experiment... experience running experiments [00:05:00] across different companies and, so the clearest example that I can think of from my experience is the culture at Groupon that I observed, right? And at Groupon, we ran a program of continuous funnel experiments across browse, search, and booking funnel. And over one stretch that work that we did as part of these different experiments, it lifted conversion rates by 12% and produced around three point four million bookings over one year.

None of this came from a single big idea, but it was in fact a lot of small experiments that were well-prepared and that compounded over time. So yeah, I feel the bigger wins here are cultural when it comes to experimentation and how teams treat different experiments.

Ashley Stirrup: Yeah. One of the things that I love about the fact that you have the experience you have is that you've worked at a business where I think it's very natural to do experimentation, which is Groupon, where you have a lot of kind of user funnels where it's very logical [00:06:00] to say, "Okay, how do I optimize, get people to go from searching to clicking, from clicking to buying?" Versus RingCentral where it might not be as obvious. 'Cause I think a lot of product managers, if they work in a business where it's obvious what to experiment on, they experiment. If they work in a business that it's not as obvious, it's not as natural. They're worried about, customer requests, sales and marketing requests, CEO requests, whatever it might be, and it's less obvious to them how and when they should test.

Do you have any advice for somebody working at a company like RingCentral on, like, how do you think about your portfolio of, "Okay, there's all these things I need to go do. Some are big features, some are little wins, some are like bug fixes," and how should experimentation-

Mayank Agarwal: Yeah, absolutely. So I think different products call for different kinds of experimentation approaches, right? We talked about experimentation for e-commerce products. Similarly, RingCentral has an e-commerce side where we are selling our offerings, right? There are [00:07:00] product managers working on that side of the house which are trying to optimize the conversion rates of these customers.

Then there is the AI product side of the house, right? Where fundamentally experiments that we are building and designing, they are different from the traditional e-commerce world, right? In fact from my experience at RingCentral, my biggest learning recently has been that the industry did not really have the correct metrics for measuring the performance of AI agents, right?

I've seen that most teams still reach for accuracy, but an AI agent can be accurate on the narrow task defined in the test dataset and still fail the user completely. Another trap that I've seen a lot of teams falling into is relying on users' self-reported metrics, right? Things like thumbs up or thumbs down.

The problem there is that you are asking users to do something that they have no incentive to do. And in practice, the people who find the [00:08:00] AI unhelpful, they don't stop to tell you why, right? They just quietly stop using it. So you end up collecting almost no signal, and the signal that you do collect is skewed.

So that is what I've observed working at RingCentral and in fact I built a four-metric framework for this, which I call DART. And it comes directly from how I measured performance of my trading strategies. The key idea here is that all of these metrics that I track, they are behavioral. So you read them off system logs or transcripts, and you don't rely on user-reported feedback or user ac-- like self-reported actions.

So I, I could go further into this DART framework if you would like

Ashley Stirrup: Yeah, that sounds great

Mayank Agarwal: So the first metric that I track in this DART framework is the task completion rate, which is the T in the DART, right? So we are going over this framework in a reverse order. And this task completion rate, it comes directly from how I [00:09:00] measured the performance of trading strategies what it does is that we are trying to measure whether the agent actually finishes what it starts, right?

In trading, that's your win rate. The second metric that we track is relevance or the signal-to-noise ratio. This is the R in DART, and it is really about asking whether the agent should have acted at all, right? Of everything that the system pushed out, how much did the user actually engage with? This is the metric that tells you whether you are helping someone do real work or accomplish real tasks, or you're just creating information fatigue for them.

And in trading life, we draw a parallel. This is separating real alpha from the noise. Then we have the third metric, which is acceptance rate. This is the A in the DART framework, and it measures what percentage of AI's output does the user keep without editing or correcting. This is a behavioral [00:10:00] proxy for trust because when people rewrite everything that the AI gives them, you know that you have not earned the user's trust, right?

Users have to essentially redo every AI response. Whereas if they are accepting most of AI's responses without editing them, then it's a signal of high trust. If we draw a parallel from trading world this is the execution rate, which is the trades that the system automatically puts without needing you to edit those trades versus the ones where you are editing the trades.

And the fourth metric which is D in our DART framework system, is the decay rate. So this is where we measure like whether the AI agent's output is decaying over time. Because an agent that works the day you ship it, it tends to-- like the quality of its output tends to degrade quietly as conditions change or the conditions evolve, right?

So if you are not regularly evaluating your AI agent's performance against a fixed evaluation set, then you are just [00:11:00] assuming that your agent is as good as the day you launched it, right? Which is rarely the case. This is exactly the same as alpha decay in trading, where your strategy's edge erodes over time, and you have to keep checking the performance of your strategies over time to make sure they are meeting your quality bars.

Now, when it comes to application of these metrics or application of this framework to enterprise AI products was how often a feature scored great on task completion rate, but badly on acceptance, right? It finished the job, but nobody trusted the output. And this gap is completely invisible if you're just using accuracy as the benchmark metric for your AI agents and models.

Ashley Stirrup: Yeah, that, that makes a lot of sense. I think this is one of the more interesting topics inside of experimentation these days is that, when GrowthBook customer Typeform and they created an AI [00:12:00] assistant to help people create a survey, that one's pretty easy to measure, right? Like, how long did it take them to create the survey without the AI assistant?

How long did it take them to do with it? Did they publish it or not? are really straightforward. And then there are other ones, where maybe you're asking it a question or asking it to write something where you don't necessarily know, was that a good experience for the user or not? And so then it takes a lot more creativity to define the right metrics so then you can run a good experiment.

Great example. Yeah, a great example of that is Khan Academy, where they basically started labeling the questions people are asking to determine, like, how engaged is this user? Are they just trying to get through the class as fast as possible, and they're trying to get the AI to give them the answer, or are they actually trying to learn? And so I think a lot of those same concepts are being shown in what you're talking about here. Could you talk a little bit about the [00:13:00] use cases that you would apply this to at RingCentral?

Mayank Agarwal: Absolutely, yeah. So when it comes to AI assistant products, right? A lot of AI assistant products in the industry, like one of the core tasks is generating smart notes, right? AI-generated smart notes. Now, AI-generated smart notes, task completion rate there is not as important as measuring whether, like the AI-generated smart notes have any noise or whether they are talking really relevant, right?

So that is one of the metrics that we track signal-to-noise ratio. Then another metric that's important to track is the acceptance rate. What percentage of the times does a user have to go back and edit those AI-generated notes versus accepting them as is? The third and the last thing that we track there is decay rate, right?

So if you have these AI-generated smart notes that are working really well today on certain use cases that you might have in your eval data set, for different industries, for different accents, the [00:14:00] accuracy and the performance might not be that great. So you wanna measure the decay rate of accuracy or performance over these different datasets, different conditions.

Ashley Stirrup: Yeah, I think that's a really important point that I'm hearing consistently from customers, but it's not very intuitive. You think, "Okay, once I've gotten my prompts all tuned and I've got a model running and everything, it's only gonna get better, right? 'Cause they'll come out with the 4.7, 4.8, whatever it is." but I consistently hear people saying that you have to watch it for this decay for a variety of different reasons, right?

Mayank Agarwal: Exactly. So I, what I've realized is that it's not enough to run your evaluations the day you are launching a feature or a new product. You have to keep running them continuously at a regular cadence, right? And track any degradations in performance over time

Ashley Stirrup: Yeah, makes total sense. Yeah, the long-term tracking is much more important when it comes to AI-powered tools.

Mayank Agarwal: Exactly

Ashley Stirrup: Do you have an example of how you've approached losing and how you've tried [00:15:00] to extract as much learning as possible out of it?

Mayank Agarwal: Sure. Yeah. So first of all, I think you framed it the right way because I love losing experiments. Losing experiment is information that you paid for, right? So the real waste is throwing the information-- throwing that learning away. For this, I have an example from Groupon, and during my time there, I worked on the team that scaled the inventory, Groupon's billion dollar plus marketplace.

Like it was powered by the inventory for that marketplace. It was powered by my team, right? And not just did we scale this inventory, we also made it bookable. So when it came to the booking experience we had a strong hypothesis that adding urgency indicators to our deals would lift conversion rates.

Ashley Stirrup: And just to interject for a second, I think just to make sure the audience understands that the interesting thing with the Groupon business model is you only have so many deals to offer [00:16:00] on different types of products. And so that's the inventory side of Groupon that you're then trying to match with, the buyers that are coming to the site.

Is that the right way to put it?

Mayank Agarwal: Exactly, yeah. So a lot of people try to think of Groupon as this coupons website, but it's actually much more than that. Groupon, during my time there, we had different verticals, right? We had things to do. So for example, if you have to-- you are traveling to Alaska, let's say, you have to look for experiences, right?

Commonly, people go to Viator or TripAdvisor for looking for those experiences. Groupon is another place that you could go to. So we had a whole category of things to do for looking for these kind of experiences. But yeah. So coming back to a failed experiment or a losing experiment we had this hypothesis that adding urgency indicators to our deals, it would lift conversion rates.

And the first lever that we tried as part of that was flash discounts. These flash discounts were time-limited price drops [00:17:00] on certain experiences. And the results of this experiment that we run that we ran, it-- they went the wrong way because the conversion actually dropped. So when we further dug into what happened here we figured out that discounts were teaching people the bad behavior, right?

So if you keep showing someone a flash sale, then people are trained to assume that another one is always like around the corner. Maybe if I wait more, I'll get a much deeper discount, right? So the rational move for any rational user is to wait. And instead of creating urgency in the buying flow, we had given them a reason to wait, right?

To buy the deal.

So this was a losing experiment and this was our learning. What we got out of the experiment was that we kept our goal, which is urgency indicators, and we changed the lever. And the way we changed that lever was that we moved away from price-based urgency to scarcity and social proof.[00:18:00]

So we would show indicators like only three spots left or selling fast. This kind of urgency it was about availability, not about price, right? So there was nothing to wait for. If anything, like people knew that this deal is selling fast or there are only X spots left, so we should rather hurry. And this is actually the version that that moved our conversion rates.

So losing experiments they are what really taught us the difference because the difference was that it wasn't urgency that didn't work, it was that we were using the wrong lever. We were using the wrong kind of urgency. So again, like tying it back to your question I love losing experiments because a loss is information that we did not have before and that we can use to design the next ex-next experiment with a sharper hypothesis.

Ashley Stirrup: Yeah. Yeah, I think that's such a great point 'cause really the whole point of experimentation, it's a little counterintuitive, but the whole point is to find losers.

Mayank Agarwal: Exactly

Ashley Stirrup: [00:19:00] That's where, if everything that you think is gonna win, wins, there's no point in running experiments. But it's knowing the fact that, hey, some of these aren't winners, and let me help figure out which ones aren't so I can do something different with those.

Mayank Agarwal: Absolutely. Totally

Ashley Stirrup: Your story's a fascinating one because you would think that the price discount would create urgency and that it wouldn't have that kind of effect of, oh, I'll... let me see what else is around the corner. It made me think of online dating apps. You hear that story a lot, that people always think, oh the next guy or girl might be even better.

They, it discourages people from settling down, which is just,

Mayank Agarwal: Exactly. Like it creates a paradox, right? Where instead of settling down or instead of finding a partner quicker, you are forever waiting

Ashley Stirrup: Yeah. Yeah. Could you talk a little bit about how you've set your North Star metric, your overall evaluation criteria? Could be at RingCentral or somewhere else.

Mayank Agarwal: Absolutely, yeah. So for a lot of organizations that I've worked [00:20:00] at primarily north star metrics matter at the organizational level, right? For public companies, it could be revenue. For smaller startups, it could be user growth, right? Like revenue growth as well. But whatever that north star metric is I've figured or I've seen that these north star metrics, they trickle down into each individual team as their sub-metrics, right?

So for the team that I was at Groupon, our north star was really whether we are showing people the relevant inventory and whether they are converting on it. And underneath that, we had a bunch of metrics that were designed for each experiment, right? We watched click-through rates by channel. We watched conversion of the homepage versus from different category pages.

We watched how people scrolled what percentage of sessions ended up in a booking, right? And all of these metrics, they gave us a more honest read than revenue alone, which used to be the north star [00:21:00] metric for Groupon at the time. Because revenue can move for reasons that have nothing to do with all the innovations that you are building in the product, right?

For AI products I think the overall evaluation criteria, it gets a little harder because a single metric for measuring the performance of AI agents is... it's dangerous. Because if you chase one number like engagement or task completion, you can quietly erode trust, and trust is the thing that keeps users using your AI products.

So typically what I do is I pair a primary metric with guardrail metrics. In my framework that primary metric could be acceptance rate along with a couple of guardrail metrics, for example measuring task completion or accuracy, right? But the learning here is that you cannot win a headline number by degrading other things that matter.

And I think this trap is called the Goodhart's law, right? Which is the moment a metric becomes [00:22:00] your primary metric or your primary target, people start gaming it, right? And AI systems are especially good at gaming a badly chosen metric. So like in fact-- Sorry just to finish the thought.

I, I was reading recently a post from someone on LinkedIn about LinkedIn cracking down on AI-generated slop, right? And this person post-- this person's post was like how LinkedIn product managers have never used their product because they were using this metric as to what feels real which was evaluated by AI itself to detect whether a post was synthetic or authentic.

And I think that's a classic example of Goodhart's law, where like it's that now you know that they are detecting your post structure, so people are going to start gaming it.

Ashley Stirrup: Yeah. I think this whole topic is just an important one that's often overlooked at companies that, you know, having a team really sit down and think through what is our North Star metric? What are our [00:23:00] guardrails? How do we optimize revenue in the short term while, maximizing trust or retention or whatever the long-term metrics are? I'm surprised how often people don't have clarity around that. Is that the kind of thing that you feel like you've participated in dedicated conversations around that? Or have you seen it evolve more organically at the-

Mayank Agarwal: So when you say-- and just to make sure that I understand the question, like the question is whether these conversations evolve or how they are evolving in the AI world versus having these conversations around metrics

Ashley Stirrup: Yeah. My point is that I think having a clear North Star metric and having the right guardrails is an incredibly strategic conversation that I think a lot of times is not happening at companies. And so then you have somebody running the experimentation team without that context to run it in.

And so for the companies you've worked at, like how have North Star metrics and guardrails been set? And has it been a [00:24:00] strategic conversation or has it been just something that evolved and people realized, "Ooh, that was bad. Let's put a guardrail in to make sure we don't see that again." Or what do we really care about?

Do we care about clicks or engagement or revenue? And, maybe in different places in the product experience, different things matter.

Mayank Agarwal: Yeah. So actually, I'll take it even one step further. For my trading firm like this is where I actually really understood the importance of running experimentation and having the right metrics. Because oftentimes we would come up with these incredibly good strategies that performed so well on, on the back-tested data, on the historical data, right?

And the performance on real world data, it looked nothing like what we had observed in, in the back-tested data, right? So I think that is where our conversation evolved as in oh obviously, if you are just using a strategy based on your conviction, which itself is based on how a strategy has performed on past data, it's going to be overfit on that data, right?

[00:25:00] So like that conversation evolved from oh the strategy is working great, let's deploy it, let's see how it goes, to having these quality gates before deploying a strategy. Those quality gates would be, okay, a strategy is performing great on your back-tested or historical data. Let's see how it performs on out of sample validation set, right?

If it's performing great, that still does not mean that you deploy the strategy for real money, right? You then graduate it to the next level, which is paper trading. If the strategy has been performing well on paper trading, which is real sort of test data, then you graduate it to live trading with real money.

So I think that's how the conversations evolved at my strat-- like at my trading firm, where initially we were like, "Oh, this strategy is looking great," and you deploy it to having these maturity curves for any strategy. This is the same thing that I've seen for a lot of enterprise products, where people are super excited about how AI is performing in a [00:26:00] pilot, right?

But then when it goes to production like things are way beyond-- below expectations, right? So

Ashley Stirrup: Yeah

Mayank Agarwal: Having that maturity curve it really helps in enterprise AI products too.

Ashley Stirrup: Yeah. I love that story you just told because what you showed there is how a business went through this evolution of what are our North Star metrics and how do we measure these and how do we go through a process? And maybe that's a real opportunity for a lot of other businesses to apply a similar playbook.

Like maybe they don't know at a certain e-commerce company or a certain do- food delivery service, what should their metrics be? But at least putting in place a rigorous process to evolve over time and continuously learn. Because once you get to those core metrics, I just think that's a really powerful unlock there

Mayank Agarwal: Yeah, couldn't agree more with that

Ashley Stirrup: Final topic for today's episode how do you see experimentation evolving in the future?

Mayank Agarwal: That's a really good question actually. I have been thinking [00:27:00] about it quite a bit. And I think two shifts stand out to me, especially in the AI world. The first shift that I see is the thing that we are experimenting for, it is changing, right? We are heading towards a world where AI agents browse and trans- transact for people.

So increasingly, you are designing for an agent as the visitor instead of a human person as the visitor, right? And a lot of the assumptions that are baked into classic A/B testing, they start to wobble when the user is software instead of human The second shift that I see is that experimenting on AI, it itself needs new methods.

You cannot A/B test a probabilistic system the way you would test a button color, right? Because with a probabilistic system, the same input does not guarantee or does not give you the same output every time. So what you do instead is you lean on offline evaluation harnesses before anything goes into production, and then [00:28:00] you have these online evaluation harnesses and guardrails that continue running as the product is live in production.

And another thing that I have observed with AI products is that you're often trading off things that pull against each other, right? These are things like latency and answer quality or your agent's quality of output. So when it comes to the speed of these agents or the latency part like that is measuring whether the response generated by your agents, they meet the user's expectations, right?

Because oftentimes, if an AI agent is generating a response that's too late in the conversation like for example, thinking about an assistant product for a customer service agent, right? Dealing with real customer in-interactions in a contact center setting. If you are using AI to help that customer service agent and AI's responses are coming late it's not going to help the customer service agent, right?

So that's the trade-off here. Another thing that I see with AI agents is the speed of innovation itself, [00:29:00] right? So as AI coding tools are becoming more and more prevalent, people are using them to build new features at a, at an ever-increasing pace, right? And that is itself compressing the experimentation loop.

Because not only people are generating these features at an ever-increasing rate, but they are also using these AI tools to generate and read tests faster than th-than they have done before. So I think in summary, I would say that the mindset stays the same, which is that every feature, every product change that you are rolling out, it is a hypothesis that you have to test.

And the methods that are used for testing those hypothesis, they are changing. Like in the traditional world, we have been designing experiments for measuring how users react to a certain change on your website. In the future with AI agents it would be about constructing experiments that would work with AI agents.[00:30:00]

Ashley Stirrup: Yeah. Yeah your point about not being able to test AI in the same way as other things couple-- made me think of a couple things. One is that we had a, an- another guest talk about how they had built a prompt where step one, step two, step three, step four, five, six. And so they had a mistake in their prompt.

And so it was, it was just a typo, and so they went and corrected it, but their policy was to A/B test everything, so they A/B tested the correction. And actually the correct prompt underperformed the incorrect prompt. And so they, they kept the original prompt which is a crazy story, but I, whatever works. Yeah the other thing that you were making me think about is that the whole... the fact that AI is probabilistic means that maybe that prompt you set up and that bot you set up, it works for one class of users, but it doesn't work for another class of users. And maybe that shifts over time as people ask questions [00:31:00] in different ways.

Mayank Agarwal: Exactly, yeah. This is something that we have observed like across industries as well, right? An example that I was using from earlier, which is generating smart notes. So you generate smart notes, they work really well for one industry, one class of users, right? For another industry like healthcare or financial services, the jargon is different, the workflows are different, and that same prompt might not work there

Ashley Stirrup: Yeah, I've seen oth-other companies that just the-- it-- AI has forced people to embrace A/B testing because they, before they used to think, "Oh, I know how this is gonna work." Maybe it didn't actually work that way in real life, in terms of how users reacted to it. Maybe they didn't love the new feature like they thought.

But with AI, people know they don't know how the response is gonna be. And that they know the bar is higher and that they just have to do more experimentation. So hopefully that leads to a lot more learning. Maya, I really

Loved having you on the show today. I think we covered some fascinating topics. I love the DART concept. So thank you so much for joining us today.

Mayank Agarwal: [00:32:00] Thank you, Ashley. It was my pleasure to join this podcast today

Thanks for tuning in. If you enjoyed this conversation, please support our channel by hitting the like button and subscribing. Better yet, share the episode with a friend. I'm Ashley Stirrup with Growth Book. We'll see you next time on the Experimentation Edge.