The Experimentation Edge

This episode of The Experimentation Edge unpacks how DoorDash's experimentation platform runs 12,000+ A/B tests per year across 42 million monthly active users — and now powers merchant-led testing on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading the platform, explains how feature flags, marketplace experimentation, and CEO-level experiment reviews built a multi-million-dollar experimentation culture across consumers, dashers, and merchants.

Summary
Most companies struggle to scale experimentation beyond engineering teams. DoorDash runs over 12,000 experiments per year across 42 million monthly active users — and now they're enabling restaurant owners to run their own A/B tests on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading DoorDash's experimentation platform, shares how the company built a three-sided marketplace testing program that balances consumers, dashers, and merchants across 40+ countries. From his time scaling search at Amazon (where offline model evaluation narrowed hundreds of candidates down to 10 for live testing) to preventing DashPass churn at DoorDash, Ilya reveals what happens when experimentation scales beyond product teams — and why CEO-level experiment review emails drive cultural change faster than any training program.

One standout learning: expanding delivery radius to 11+ miles increased grocery orders but tanked retail conversions. The lesson wasn't about distance — it was that one metric approach breaks in multi-dimensional marketplaces. DoorDash now segments experimentation by vertical, behavior pattern, and regional market, using AI agents to mine institutional knowledge from past tests and auto-generate experiment summaries that ship company-wide within hours of readout.

Timestamps
00:40 From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber
03:04 Why product velocity without experimentation creates feature bloat, not impact
05:32 Scaling search at Amazon: billions of products, 10 visible results, 25% win rate
08:22 Offline evaluation as a filter — golden data sets cut model candidates before live traffic
10:23 DoorDash's three-sided marketplace: 300 million feature flag evaluations per second
12:38 CEO Tony Xu reads every experiment email and replies with alternative hypotheses
13:33 Democratization at scale: enabling merchants to A/B test menu pricing and promotions
17:05 DashPass churn experiment uncovered value perception gap — became a full product area
22:03 Why expanding delivery radius killed retail orders but boosted grocery conversions
24:16 No single North Star metric — balancing consumer quality, dasher earnings, merchant mix
27:29 Four-dimensional scale: democratization, global expansion, new verticals, AI agents
31:03 Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance

Takeaways
- Win rate matters less than learnings per test — DoorDash ships company-wide experiment summaries (win or lose) that the CEO actively reads and responds to, creating cultural accountability around testing rigor.
- Offline evaluation acts as a pre-filter for model velocity — Amazon's search team used golden data sets to cut hundreds of ML candidates down to 10 for live A/B testing, preventing wasted experiment slots.
- One-size metrics break in multi-dimensional marketplaces — DoorDash balances consumer retention, dasher utilization, and merchant inventory mix across verticals because optimizing one side degrades the ecosystem.
- Democratization requires opinionated templates, not open-ended tools — enabling non-technical users to run tests means embedding success metrics and guardrails into pre-built experiment configs.
- AI scales institutional knowledge, not just analysis speed — mining past experiment readouts to auto-generate new hypotheses turns your testing history into a compounding advantage.

Connect with the guest
LinkedIn: https://www.linkedin.com/in/ilyaizrailevsky/
Learn more about DoorDash: https://www.doordash.com/

Sponsor
Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts.

Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse.

With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction.

See a demo at https://www.growthbook.io/

Topics: A/B testing, experimentation platform, feature flags, marketplace experimentation, machine learning, growth experimentation, statistical significance, experimentation culture, agentic AI workflows.

(00:40) - From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber
(03:04) - Why product velocity without experimentation creates feature bloat, not impact
(05:32) - Scaling search at Amazon: billions of products, 10 visible results, 25% win rate
(08:22) - Offline evaluation as a filter — golden data sets cut model candidates before live traffic
(10:23) - DoorDash's three-sided marketplace: 300 million feature flag evaluations per second
(12:38) - CEO Tony Xu reads every experiment email and replies with alternative hypotheses
(13:33) - Democratization at scale: enabling merchants to A/B test menu pricing and promotions
(17:05) - DashPass churn experiment uncovered value perception gap — became a full product area
(22:03) - Why expanding delivery radius killed retail orders but boosted grocery conversions
(24:16) - No single North Star metric — balancing consumer quality, dasher earnings, merchant mix
(27:29) - Four-dimensional scale: democratization, global expansion, new verticals, AI agents
(31:03) - Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance

What is The Experimentation Edge?

How do product teams decide what to build and what not to? The Experimentation Edge is the podcast where product, growth, and engineering leaders share how A/B testing, feature flags, and experimentation drive real business outcomes — backed by named companies and real numbers. From DoorDash's 12,000 A/B tests a year to Atlassian's experimentation-led product win to UPS's $500M experimentation team, each episode goes deep with operators running experimentation programs at scale.

Hosted by Ashley Stirrup, CMO at GrowthBook and a 25-year executive in data and experimentation. For product managers, engineers, data scientists, and growth leaders at B2B tech companies who care about experimentation culture, statistical rigor, and shipping with confidence. No marketing speak. Just operators explaining what they shipped, what moved the needle, and how experimentation reshaped their teams.

Topics: A/B testing, experimentation, growth experimentation, product experimentation, tech experimentation, feature flags, experimentation culture, statistical significance, marketplace experimentation, conversion rate optimization, experimentation at scale.

Ashley Stirrup (00:00.266)
One sec.

Ilya Izrailevsky (00:01.74)
No way.

Ashley Stirrup (00:16.0)
Hello and welcome to today's episode of the Experimentation Edge. We have Ilya Izreilsky, who is the Senior Engineering Manager at DoorDash and he's leading the experimentation platform there. He brings with him a rich experience across a wide range of companies, including Amazon, Robinhood, Uber, Intuit and PayPal. So welcome to the show, Ilya.

Ilya Izrailevsky (00:40.195)
Thank you, Ashley. Really glad to be here. I love experimentation and yeah, really excited to share my experiences and make sure that folks listening will learn something new.

Ashley Stirrup (00:55.958)
Well, great. I'm really excited to have you on and we both share a passion for experimentation. So that's a great start. One interesting thing for you is you've lived on both sides of the fence. You've owned the engineering for the experimentation platform and then you've run engineering for products that use the experimentation platform. Do you have some examples of like lessons learned as a user of the platform versus owning the platform?

Ilya Izrailevsky (01:25.282)
Yes, definitely. again, I'm Ilya. I'm currently leading DoorDash's experimentation platform across both engineering and science and partner very closely with product engineering teams and analytics folks. Yes, my background has been in both building experimentation systems. For example, at Intuit we built our experimentation system from scratch.

a years ago, and then I led an effort to open sources. So was one of the early stage experimentation platforms available as an open source project. And then I led experimentation platform at Robinhood and now at DoorDash. But then I also built data-driven systems, for example, owning search companies such as Uber and Amazon.

search and also working on recommendation personalization systems at Intuit and also I owned internalization efforts at PayPal, kind earlier days of PayPal. But no matter, I've been either building experimentation system or using them, what I found is experimentation is super critical to make sure that you build the best product possible and you...

Try with tests rather than building this minimum valuable products and making sure that what you're building is actually satisfying customer needs.

Ashley Stirrup (03:04.659)
Yeah, I couldn't agree more. I think in today's age of AI, where it's easier and easier to develop products, what's going to differentiate people is not how many features they have, but how excellent the features are that really matter to users.

Ilya Izrailevsky (03:19.81)
Yes, absolutely. So this is why, for example, at Amazon, when I was leading their machine learning automation for their main customer and e-commerce site search, we needed to make sure that customers actually find what they're looking for if they're searching.

for some specific types of products through Amazon's mobile or web experiences. And behind the scenes, there is so much going on. You're literally going from billions of products that are in the Amazon's catalog down to 10 to 20 products that Amazon's customers actually see. And there are many, many machine learning models that run in the background every time you do a search.

query, including matching and ranking models. And we needed to run tons of experiments to make sure that those types of models are delivering the rights of products that customers are looking for. So for example, optimizing for things such as brand diversity so that people see results from different types of brands or

price differentiation. could be cheaper products or more expensive products that could be of higher quality or fast shipping speed. Ensure that customers receive their product on time. So going through so many different products and narrowing down to just 10 to 15 to 20 that people are actually looking for. There you need to run many, many experiments. Make sure that they're

successful and we're delivering customer impact. So what I learned is it's not just about the experiment, it's not the goal. Customer impact is what really matters. Experimentation is just a vehicle to enable you to do that.

Ashley Stirrup (05:32.757)
Yeah, yeah, boy, I'm just struck as you tell that story about the scale of the challenge there. Because, you know, there's so many different dimensions that you would want to experiment on and try to understand, you know, does variety of brands matter more or, you know, what are the different levers you might pull? And so I would imagine it's a very

Ilya Izrailevsky (05:53.59)
Exactly. It's a multi-dimensional optimization. So you're looking at different types of metrics and you're trying to see the experiments that you are running. Are they satisfying this various needs? And you need to balance them. So, and at times it's not just about one magic metric that you're going after. There are many, many different metrics that you need to take on account as well as looking at your guardrail metrics to make sure you're not degrading those.

as in the mint.

Ashley Stirrup (06:25.405)
Yeah, and so I would guess that was a very cross-functional effort where you had a wide variety of people that had opinions on how to optimize search.

Ilya Izrailevsky (06:35.566)
100%. So while my team was mostly machine learning engineers focused building machine learning models, but our larger org was mostly data science focus of folks with data and science experience, looking at various types of models, understanding customer patterns of what Amazon's customers are looking for.

and what they're interested in finding. And optimizing for different types of customers, customers who are maybe just registering with Amazon or power users that are looking for many different types of products. So yeah, and working with product managers and engineering. yeah, many, many different.

as a stakeholders in place, but experimentation is a vehicle to kind of bring everybody together and say, yeah, we can have a lot of opinions, but at the end of the day, it doesn't matter what we think. What matters is what our customers usually do, their behavior.

Ashley Stirrup (07:34.046)
Yes.

Ashley Stirrup (07:44.575)
Yeah.

Yeah. What was your win rate? Do you remember?

Ilya Izrailevsky (07:52.814)
At Amazon, was something like in the range of about 25 to 30%. So you can imagine that you run a lot of different experiments, but you need to really look at different arms. So it's not just like A-B test. would be a test that run across multiple...

Ashley Stirrup (08:02.972)
wow, that high, yeah.

Ashley Stirrup (08:15.625)
Yes.

Ilya Izrailevsky (08:22.774)
model candidates, and you're looking at different ones at the same time. And by the way, we also did what's called a lot of offline experimentation. For example, doing sampling of different types of queries that Amazon's customers run, and then generating sample and golden data sets of products for a given query, you're looking for, for example, vacuum cleaner.

Ashley Stirrup (08:24.21)
Right.

Ilya Izrailevsky (08:52.29)
what types of products you would expect to see at different price ranges and things like that. Make sure that those show up. then you're going, let's say from hundreds of machine learning model candidates down to maybe 10 that are performing the best. And those 10 you would run with online experiments and testing with real customers. So yes, there's a lot of going on kind of both offline.

Ashley Stirrup (08:56.466)
Right.

Ilya Izrailevsky (09:21.902)
training and evaluation and then online experimentation with real customers.

Ashley Stirrup (09:26.652)
Yeah, that makes a lot of sense. I can see how that helps go from an infinite number of options to, OK, we've made sure that we're not breaking vacuum cleaners or, I don't know, dog food or whatever it might be, right? And we're actually getting a better and better model over time. So that's terrific.

Ilya Izrailevsky (09:47.446)
Exactly. And there are many different categories on Amazon, right? You're looking at books, retail products, electronics, household items, so on. Many non-perishable items, but different categories you need to run actually separate experiments. Make sure that you're focusing on specific types of categories and providing the best results for those categories because they vary widely.

Ashley Stirrup (09:57.02)
Yes.

Ashley Stirrup (10:12.572)
Yes, that makes total sense. Well, now let's turn a little bit to DoorDash. I'd love to hear about the experimentation program there and how it's evolving.

Ilya Izrailevsky (10:23.022)
Yes, definitely. So DoorDash is different that we're focusing on your local businesses and kind of your mom and pop shops as well as just businesses that are around you, including restaurants and grocery stores and electronics. Nowadays, can, for example, buy ATV through DoorDash from Best Buy. But we run our experimentation at scale.

We've had our experimentation platform that we've been building in the last about five to six years. We were on over, let's say, about 12,000 experiments per year. Over 42 million monthly active users on Doordash's platform at peak range, something like 300 million evaluations of experiments with feature slides per second.

So our scale is very, very, very large. But what makes DoorDash unique is it's a very operations driven and data driven company. So everything we do at the end of the day, it's about the customer. Are you getting your delivery of your lunch that you ordered on time with high quality and make sure we don't deliver food cold.

to you and also making sure that you can get your lunch on time so you're ready for your next meeting that can be coming up. But the way it's enforced is that, for example, after we run an experiment, no matter the results, whether ship or no ship, we would send the results out across the company and have a discussion about those experiments. So for example, our leadership, including our CEO, Tony, would read and reply.

to those experiment emails and would congratulate folks, but also encourage them to try some alternative ways or approaches to the same problem. So this really builds a culture that experimentation is encouraged and everything we do should be through experiment.

Ashley Stirrup (12:38.556)
Yeah, boy, that's fascinating. Yeah, I'm definitely starting to notice a pattern that the strongest experimentation programs have the CEO actually reading the results of experiments, which I think at other companies, it's probably several layers from the CEO is about the top of when it's actually getting, who it's getting shared with.

Ilya Izrailevsky (13:01.07)
Right. Yeah, being there in the details for somebody like a CEO really encourages the rest of the company to be also in the weeds and understand what we're building, what types of features we're building and what we're experimenting with.

Ashley Stirrup (13:02.018)
Ashley Stirrup (13:17.576)
Yeah. And how would you describe the model? Like how much is centralized versus, you know, how much are each, is each individual team empowered to do experimentation? Do they have data scientists in the various groups or is that centralized?

Ilya Izrailevsky (13:33.336)
Yes. So in terms of our model, we do have the central experimentation platform, which I lead, but then I partner very closely with our analytics group, which is centralized. But then their data sciences are embedded with a product teams with engineering, product managers, for folks like that to make sure we run successful experiments.

But now we're actually working on what's called democratization. We want to make sure that we can go beyond engineers setting up and instrumenting experiments with their mobile code or web or backend type of experiments. And then data scientists, analytics, analyzing results would like to get to a point where folks without a PhD in statistics should be able to set up and run.

experiments and actually learn from them. Right? So we're focusing on product managers and user interface designers, advertisers, marketers, business operations folks as we're launching in new countries and new regions to be able to run their own experiments and understand analyzed results. Of course, it's more difficult than it sounds.

because there's a lot going on in order to run successful experiments. So we're working closely with analytics partners to set up what's called a very opinionated experiment templates where we embed, for example, success and card rail metrics so that experiments can be run very, very quickly. And also our longer term focus is to even go beyond our company wide experimentation runs to...

to merchants such as restaurant or store owners would like them to be able to run their own experiments. For example, on various promotions that they're running or prices of the menu items that they're showing to customers so that they can attract more customers to their local stores. And this would be a win-win-win across the board because customers will have more selections across their menu items.

Ilya Izrailevsky (15:56.118)
prices and inventory of stock that grocery stores, for example, hold. And it would benefit dashers and couriers who deliver results because they'll have orders and also benefit DoorDash as a company as a whole because we'll have more orders to deliver. So it's all about the ecosystem, how we can empower the entire ecosystem through experimentation.

Ashley Stirrup (16:18.483)
I love that. And that's a great example of innovation. Like I've heard a lot of companies talk about wanting to democratize experimentation internally. But it's the first time I've heard about, you know, I'm not sure if they're customers, maybe your partners, how do you enable your partners to do experimentation too? And that sounds very powerful.

Ilya Izrailevsky (16:39.714)
Definitely, yeah, because we are at the end of the day, a marketplace. So we connect local businesses to customers, to local individuals and dashers who deliver them. So we're Meganuims, but yeah, for us, it's all about enabling all the various parts of this ecosystem.

Ashley Stirrup (16:58.087)
Yeah, yeah. So can you tell us about an experiment or two where you had a lot of learnings?

Ilya Izrailevsky (17:05.186)
Definitely. Let me tell you about a couple of experiments, maybe one of them that we ended up shipping, another one that did not ship. But at the end of the day, what I learned is there's no such thing as a failed experiment. Every experiment is a learning opportunity, no matter how it goes, right? It's just the ones that you don't ship, maybe you kill, but the ones that ship or you need to iterate on, you should...

continue to rating. one experiment was about DashPass subscribers, equivalent of Amazon Prime customers. We have a program that encourages people to order through us, and we give, for example, delivery fee, discounts, various promotions, faster shipping speeds, and we add in other perks, such as, for example,

streaming services subscriptions as well. But at one point we experienced a lot of churn where DashPass customers were canceling their subscription. And when we looked deeper and we actually ran an experiment that at the point of cancellation, we would show them the various perks and the actual value that they're getting through DoorDash. So in fact, for example, if you're ordering more than about three...

orders per month, you're already recouping the subscription fee, which is about $10 per month. So it's not a big fit to actually invest in being a DashPass subscriber. But in by running this experiment, we found that they saved thousands of DashPass subscriptions per week. But not only that,

we really found that we need to be proactive in messaging, showing the value of DashPass throughout the experience at DoorDash. So for example, during the checkout flow, the confirmation time, during the monthly recap of your orders and things like that. So kind of keep restating the value. And right now it became a whole product area, we call it DashPass, where we proactively are focusing on subscribers.

Ashley Stirrup (19:32.347)
Yeah, I love that story. So basically you started off running an experiment to learn more about how you stopped cancellations of DashPass. And then it led to the learning that, our customers don't realize how much value they're getting here. And then that led to a whole basically new product area for you where you were constantly educating your customers. That's a great example of how experimentation learning can lead to changes in product strategy.

Ilya Izrailevsky (20:00.75)
Definitely, it became a product area and we have a separate roadmap for it. It's a whole area that we're practically focusing on. Another experiment that did not shape, it was as valuable, was that as we're launching a new, it's called verticals, new product areas going to be beyond restaurants, such as grocery stores.

electronics or retail. At that point, we ran an experiment where we said, well, if we're going beyond the default 11-mile radius or the longest distance where you're looking across your area from where you would like to deliver from, if we go beyond that, this should increase the selection and also should lead to more orders that we get from...

from those stores. But what we found is that while this worked for some new vertical areas, for example, grocery stores, but it really did not work and we actually decreased our order volume and order rate. overall, this introduced more noise and higher delivery fields for retail, for clothing type of orders.

And we ended up kind of killing that. And what we learned that is one size really does not fit all. You have to really look at different types of verticals and understand the customer behavior. And this resulted in us being more careful in how we approach experimentation and really distinguishing and understanding kinds of behavior for

grocery stores versus electronic orders versus clothing.

Ashley Stirrup (22:03.622)
Yeah, that makes a lot of sense. I could see that as a customer, you know, I probably want food from maybe five miles away or closer. And I don't want to have my experience cluttered with a bunch of options that are too far away and the food's going to be too cold when it gets here. Versus if I'm going to order a new TV and I don't want to drive, you know, 20 minutes away or 30 minutes away to go get a new TV and I can order it on DoorDash, that that's fine. I'm willing to wait that extra time. There's not the same level of urgency.

Ilya Izrailevsky (22:33.582)
Exactly. but yeah, for example, if you're buying a T-shirt, you're probably not for something, know, $20, you're not willing to pay like a higher delivery fee just to get that, you know, maybe cheaper. So yeah, exactly. It really depends on the type of product that you're looking for and how much you're willing to pay.

Ashley Stirrup (22:55.386)
Yeah, yeah, that seems like an area where experimentation is particularly powerful is understanding that a category by category, how do you dial in what's the right delivery range?

Ilya Izrailevsky (23:05.166)
And our intuition might not work by default because, so it's really important to just test it and see what works and what doesn't.

Ashley Stirrup (23:13.328)
Yeah. Yeah, that's a great point. Like I would imagine that a lot of companies that aren't as experimentation centric, they'd think, well, I can just figure this out. I can just guess at what the range is, but you could probably significantly outperform that by doing experimentation and getting real data. Yeah.

Ilya Izrailevsky (23:34.03)
Yeah. And maybe default intuition tells you, well, if I just have more and more stores around me, I'll have more selection and I should be able to just get more things, more orders. But reality is that it's better to focus on your local stores, providing the best selection across those and the best value customers are having going very, very far. then getting into that more noise, more more things.

Ashley Stirrup (24:01.786)
Yeah. Yeah, that makes a lot of sense. So do you have, you know, very clear kind of North Star or overall evaluation criteria metrics at DoorDash that you've kind of all lined on?

Ilya Izrailevsky (24:16.012)
Yeah, so again, because we're a marketplace, a three-dimensional marketplace, we don't have one magic number or metric that we're going after because, again, we need to balance between consumers, dashers, and our merchants. But we are focusing for specific apps that we support on different types of metrics. So for example, for our main door dash,

Consumer app, we're focusing on quality, reliability of orders that they're getting, as well as overall customer satisfaction and retention. While on dashers and couriers, we're focusing on their earnings and their utilization, making sure that they're not sitting idle in the car waiting for the next order. And we're focusing on fairness, so distributing orders across those dashers.

And then on the merchant side, we're focusing on order mix, making sure that they're getting orders across their main use inventory so that they don't stock up on just like one specific item. And so it ends up depleting their inventory as well as kind of unit economics, right? For example, for grocery stores, how much they should be investing versus how much they're gating and overall longevity.

of your local stores, right? So it's not just about like weeks orders or monthly orders of orderlies. For the long term, are they able to grow their business by partnering with DoorDash? So, and when we run experiments, it's not just about this one success metric. We have many different guardrail metrics that protect the rest of the ecosystem. Make sure that you're not degrading.

Ashley Stirrup (26:10.042)
Yeah, boy, that's very partner centric of you to be thinking about not only how do I kind of maximize today for the, you know, the local vendors in this particular area, but how do I help make sure they've got a long-term successful business? yeah, yeah, it's also really interesting about the, or the making sure that they're getting a good mix.

Ilya Izrailevsky (26:25.036)
Yes, because it's in our interest exactly for them to be successful.

Ashley Stirrup (26:34.906)
Like you could easily see somebody think, well, as long as they're buying, I don't care if they're all buying the same thing. But yeah, if that particular vendor, they run out of that and then they're not selling anything, that's not a good outcome for them.

Ilya Izrailevsky (26:48.11)
Yes, exactly. So it's a classic search problem, exploit versus explore, right? So you should exploit things that customers already ordered, but you also want to show them maybe new things, of letting them explore new things that they can look at them and maybe start ordering. Because if you just focus on one thing at one point, people get tired of it and they say, oh, is this? So you want to have that mixture.

Ashley Stirrup (26:55.536)
Yeah.

Ashley Stirrup (27:03.569)
Yeah.

Ilya Izrailevsky (27:17.774)
of different types of orders that people try out.

Ashley Stirrup (27:21.946)
Yeah. And as we wrap up here, how do you see experimentation at DoorDash evolving?

Ilya Izrailevsky (27:29.514)
Definitely as we're scaling experimentation, we're going into a multi-dimensional scale. on one side, as I mentioned, we're focusing on democratization, right? Making sure that folks throughout the company will be able to round their own experiments and learn from them very, very quickly. So removing the friction of experimentation. The other...

dimension is global scale, right? So we're acquiring companies around the world, including VOLD, based in Finland and Deliveroo, some are based in London, UK, and now we're based in over 40 plus different countries around the world. So really understanding regions and marketplaces and behaviors of customers.

And also, we're going into different verticals, as I mentioned, going beyond restaurants, into grocery stores, deliveries, and electronics, retails that introduce another dimension. And finally, last but not least, AI is a big factor right now. So with AI, we're able to ship a lot more features, kind of the barrier to entry to...

to build and test and launch a feature has been lowered, but then with more and more features coming, want to make sure that we experiment with them and those features provide customer value. And it's not just like garbage in, garbage out, right? So are we selecting the best features actually benefit those customers? So we need to ensure that we scale experimentation with that.

AI powered growth and we've been enabling experimentation for use cases such as looking at the institutional knowledge, kind of the past experiments that we already launched and using AI to be able to mine those and see what worked, what didn't work and use the past hypothesis to inform new hypothesis that we should be launching.

Ilya Izrailevsky (29:56.754)
And then also we're now adding agentic experiences throughout the experimentation lifecycle, right? So would be to be able to set up, run, analyze results on the experiment, debug any issues with experiments, for example, imbalance issues and things like that, as well as generating experiment summaries and readouts very quickly. The ones that I mentioned that we are sending out across the whole company.

So really powering the experimentation life cycle with AI.

Ashley Stirrup (30:30.021)
Well, that's quite an ambitious future you have in front of you because so many different dimensions, Like new companies, new product lines, AI, more users, even your partners doing testing. That's pretty incredible. So where do you feel like there's the biggest opportunity within AI amongst all those different things? Is it?

more about scaling or trying to help people be smarter before they run the test? Or is it more about kind of using AI after the fact to really analyze the results?

Ilya Izrailevsky (31:03.994)
I would say throughout the experimentation life cycle. So I really look at AI as an opportunity to get what is called insights more quickly from while you're, for example, setting up your hypothesis. We used to just create documents describing the experiment. Right now you can go through your genetic experience and type out your hypothesis, what you're trying to test, and then let AI...

to help you set up the right type of experiment, right? Would it be an AP test, multi-rambanded experiment, or if you're looking at search, it could be interleaving experiments, or we're doing what's called like region split experiments, using diff and diff, difference and difference. So there are many different types of tests that we need to be running. yeah, letting AI to help you while you're setting up experiment, but then also if you're having issues.

Ashley Stirrup (31:36.932)
Yeah.

Ilya Izrailevsky (32:01.666)
Many times where folks that are not kind of tech savvy to help them find and actually resolve, maybe set up issues with their experiment. And then after they actually get their analysis results to be able to look at those, know, P values and the various things that are going on with your experiment results, what you're actually learning from this? What is the data telling you?

Right. You used to have, you need to have a data scientist or analyst help you. But now with AI, you can actually do, it can assist you in understanding your experiment results. so yeah, across the board, AI can help a lot.

Ashley Stirrup (32:33.199)
Yeah.

Ashley Stirrup (32:42.768)
Yeah.

Ashley Stirrup (32:46.457)
Yeah, I think that you just did a great job of summarizing all the different opportunities for AI. One of the things I find the more and more I use AI is that it's just so much better than I am at consuming large amounts of information and processing it. If you can kind of take that concept and then overlay on top of it, the expertise of like it knows what's practices in data science and you can kind of help provide an AI agent that can do both of those things. That's very powerful.

Ilya Izrailevsky (33:16.162)
Yes, exactly. And of course, there's a saying, there's a human is always at the helm, right? So AI can help you to do a lot of research and give you insights, but at the end of the day, humans need to make calls and to prioritize. for example, go and no-go decisions and things like that. But definitely AI can, exactly as you were saying, can crunch a lot of information and drill down into most important things.

So that otherwise it takes humans a lot of time to do that otherwise.

Ashley Stirrup (33:48.869)
That's right. Well, Ilya, thank you so much for joining us today. This has been a incredibly insightful episode kind of across the board. So I really appreciate your time today.

Ilya Izrailevsky (34:00.408)
Thank you so much, Ashley, for organizing this. This has been a pleasure. And yeah, I hope everyone learned something today.

Ashley Stirrup (34:09.862)
Thank you so much.

More episodes

Chapters

What is The Experimentation Edge?