The Experimentation Edge

This episode of The Experimentation Edge explores how A/B testing, feature flags, and user research transformed Atlassian's talent product after it failed with its first users. Andrew Willingham — 11 years at Amazon, now Head of Legal and People Products at Atlassian — shares how product experimentation works when you can't test at scale, why your customer and your user are not the same person, and how the metrics you choose decide which experiments you can even run.

Summary
Andrew Willingham, Head of Legal and People Products at Atlassian, spent 11 years at Amazon before joining Atlassian a year ago. His path from running A/B tests on millions of Amazon shoppers to building talent management software for a few hundred thousand employees forced a fundamental shift: when you can't run tests at scale, you have to sit with your actual users and watch them fail. He shares how building a talent review product for Amazon's HR specialists completely flopped when handed to HRBPs — and why that failure taught him more than any winning experiment. Now at Atlassian, he's applying that same rigor to reimagining hiring processes with AI, testing everything from recruiter screens to interview sequences that the industry has run the same way for decades.

Timestamps
03:09 From marketing Amazon's mobile app to building HR software for 1.5 million associates
08:19 Why a talent review product loved by IO psych experts flopped with actual HRBPs
11:11 How A/B testing helps product managers escape opinion-based politics
15:25 Testing copy that changes behavior: "We'll generate that status report for you"
17:20 The two North Star metrics Andrew optimizes: efficiency and quality
19:05 Khan Academy's metric trap: measuring cognitive engagement, not just completion
21:10 Why product managers resist experimentation — and what changes when you admit you don't know

Takeaways
- Your customer and your user may not be the same person — building for HR specialists instead of the HRBPs who actually run talent reviews resulted in a feature nobody could use.
- When you can't test at scale, desk rides replace A/B tests — sitting with users and watching them struggle reveals failures faster than any dashboard.
- Experimentation short-circuits political debates by removing opinion from product decisions.
- Test metrics before you test features — usage time could signal engagement or just mean your product takes too long to do its job.
- The experiments that fail deliver the most valuable learnings, especially when you expected a slam dunk.

Connect with the guest
Andrew Willingham on LinkedIn: https://www.linkedin.com/in/andrewwillingham/
Learn more about Atlassian: https://www.atlassian.com/

Sponsor
Growthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts.

Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse.

With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction.

See a demo at https://www.growthbook.io/

Topics: A/B testing, product experimentation, feature flags, user research, talent management, qualitative research, metric design, experimentation at scale, growth experimentation.

(03:09) - From marketing Amazon's mobile app to building HR software for 1.5 million associates
(08:19) - Why a talent review product loved by IO psych experts flopped with actual HRBPs
(11:11) - How A/B testing helps product managers escape opinion-based politics
(15:25) - Testing copy that changes behavior: "We'll generate that status report for you"
(17:20) - The two North Star metrics Andrew optimizes: efficiency and quality
(19:05) - Khan Academy's metric trap: measuring cognitive engagement, not just completion
(21:10) - Why product managers resist experimentation — and what changes when you admit you don't know

What is The Experimentation Edge?

How do product teams decide what to build and what not to? The Experimentation Edge is the podcast where product, growth, and engineering leaders share how A/B testing, feature flags, and experimentation drive real business outcomes — backed by named companies and real numbers. From DoorDash's 12,000 A/B tests a year to Atlassian's experimentation-led product win to UPS's $500M experimentation team, each episode goes deep with operators running experimentation programs at scale.

Hosted by Ashley Stirrup, CMO at GrowthBook and a 25-year executive in data and experimentation. For product managers, engineers, data scientists, and growth leaders at B2B tech companies who care about experimentation culture, statistical rigor, and shipping with confidence. No marketing speak. Just operators explaining what they shipped, what moved the needle, and how experimentation reshaped their teams.

Topics: A/B testing, experimentation, growth experimentation, product experimentation, tech experimentation, feature flags, experimentation culture, statistical significance, marketplace experimentation, conversion rate optimization, experimentation at scale.

Andrew Willingham (00:00)
I find that the experimentation, it helps manage the politics. It's not opinion-based. just is what it is. And so I find as a product manager, one that helps you manage executives, because I can just say, well, here's what the data shows. It's not my opinion or your opinion. This is the data. And that helps short-circuit a lot of these political or

opinion-based comments that I think can slow us down.

Ashley Stirrup (00:55)
Welcome to today's episode. Today we have Andrew Willingham, head of legal and people products at Atlassian. And I'm particularly excited to have Andrew on because of his extensive career at Amazon. He spent over 11 years there and now he's been at Atlassian for a little over a year. And I think he brings a particularly unique perspective from more like the product manager and business perspective.

Andrew Willingham (01:03)
really excited to have everyone's extensive career.

Ashley Stirrup (01:20)
So we have a mixture of guests on the show. Some are more data science oriented, some are more engineering. And so excited to have you on, Andrew. Welcome.

Andrew Willingham (01:28)
Thanks so much, Ashley. Really excited to be here and thanks for having me.

Ashley Stirrup (01:32)
Yeah, maybe you could kick things off by just talking a little bit about the evolution of your career.

Andrew Willingham (01:37)
Yeah, absolutely. as you mentioned, the intro, I lead people and legal products here at Atlassian. So that's kind of everything from our HRIS to the TA systems. When you interview people, we do all that, both kind of what we build internally and what we license from third parties, right? So it's everything the business needs to run for the people in legal domains. I kind of came here not in a very direct route.

So as you mentioned, I joined Amazon in 2013 as a product manager focused on consumer marketing. So my first job was trying to get people to download and use the mobile shopping application. So I was kind of a mobile guy, had worked in the early days of the apps on the app store. And so Amazon's like, cool, maybe this guy can help us drive some behavior change. And we did a lot of great work there. So I spent five years kind of in Amazon's marketing department. So basically trying to get people to come to the site.

trying to interact with our posts on social media networks, all focused on kind of this mobile social thing. And I really loved that work, but I wanted a chance to lead a kind of a cross-functional product development team. At that time, I was leading a small team of product and marketing folks. And so I took a job in HR, kind of one of the first product development teams in Amazon's HR department, focused on building solutions to help Amazon's operations business kind of scale what they were doing.

But at Amazon operations means like everybody in the fulfillment centers, delivery centers, like how we get actual physical packages to a customer's doorsteps. So a lot of scale there all over the world. And so we ended up building most of Amazon's kind of one P talent management software suite that was proprietary. So how we evaluate talent, how we do internal transfers, how we did promotions, how we did organizational design. And that was a kind of a big pivot point in my career.

And since then I've worked in a variety of HR focused roles across talent management, talent acquisition, and now here at Atlassian.

Ashley Stirrup (03:30)
Yeah, and so in your early days at Amazon, I imagine you did a lot of A-B testing.

Andrew Willingham (03:36)
Absolutely. mean, for those of you, mean, this is a whole experimentation focused podcast, but if you don't know Amazon, AB tests everything on the consumer website. So any little change we make the buy button position, the color font sizes, everything is AB tested and at massive scale. I mean, you know, I don't know how much these days the amazon.com homepage gets in terms of hits, but it's a lot. Um, and so there was a kind of, when I was working in that marketing job, everything we did, so

For example, if I wanted to put an ad on the detail page, like you go to a product page and it says, hey, download the Amazon app and check out on mobile or whatever, we would have to test that placement, make sure that one, of course, we would test it to see does it drive conversion of what we want to do or not. But two, we could not have a negative impact on page load latency or anything like that. So in that case, was sort of a double A-B test. One is, you doing no harm in terms of the page load and that sort of stuff?

And then if that's cool, then are you ⁓ driving, you know, whatever metric you're optimizing on? And we would test that. Everything was tested.

Ashley Stirrup (04:39)
Yeah, and so then I'd imagine, you you went from a world where you're serving millions of consumers to one where you're focused on Amazon employees, a much smaller employee base. so sometimes you just didn't have the data volumes to actually run A-B tests. And so you had to look at different metrics in order to learn as you were building products on that side.

Andrew Willingham (05:01)
Exactly. So I think, you know, the nice thing about particularly Amazon is you have so much traffic that you can run, you know, many simultaneous AP tests, not even just one. And you're exactly right, Ashley, when we came in, you know, Amazon's still very large scale for HR products. mean, you 450,000 corporate employees, 1.5 million associates. That's still a lot. Not a lot when you're thinking about compared to the gateway traffic where you're talking hundreds of millions of hits or whatever it is these days.

So we did have to adjust our approaches. So instead of leading kind of with a pure data approach of like, cool, we did this and here's our, you know, p-values and everything else, we moved into kind of relying on very close connection with customer research. So I would sit with like actual HRVPs and be like, okay,

I attended many talent reviews. I would sit in there and watch them do it. I'm like, okay, what are you doing here? Why are you bringing the Excel here? What's the PowerPoint for? Didn't you already do that? So we asked a lot of questions to kind of gather that research to try to give us more confidence in what we were going to build. And so I think that sort of, you know, in the user experience space, they would call this, you know, UX research or, you know, qualitative research versus quantitative research. And we leaned heavily on that to try to de-risk.

a lot of our roadmaps before we launched.

Ashley Stirrup (06:19)
Yeah, that makes a lot of sense. Can you tell us about a time when you learned a lot after releasing a product or feature?

Andrew Willingham (06:25)
Yeah, absolutely. everything I just talked about is sitting with customers and doing this. The important thing here is you got to understand who your customer and your user are. It may not be the same person. So I'll share a story. When I first joined HR for that first role, our first project was really focused on building a talent review product. And so like I mentioned for operations, there's all these organizational layers all across the world. And you have to roll all these calibrations up

for our yearly kind of calibration review that decides comp and everything at the same time so that then all of Amazon can roll it up. And at that time, know, Amazon was running our account reviews on Excel, like Excel files or PowerPoint or like Word docs. So it was very disorganized. There was not a lot of like, we weren't sure how the talent reviews were being run and how consistent was this process. We didn't really know. Of course, there's a lot of data leakage risks as well as when you got files floating around and

people emailing ratings and stuff. So we came in here and tried to really clean this up. And so we partnered really closely with the Talent Management Center of Excellence. And so for folks who aren't closer to HR, these are like the HR specialists, the IOS psych people who really nerd out about like, hey, how do we build high performing organizations? And how do we really evaluate talent? How do you do succession planning? That sort of stuff. So we embedded with them and built this product. And so we said, great, we built this. They loved what we built.

And he said, all right, let's give it, let's go try it out in production to run a talent review. And we handed it to the HRBPs and these are the people who actually run the talent reviews. They're not the, they're not the IO psych sitting up here in the ivory tower and the COE, they're down on the front lines. And they got into it they were like, what, this is way too complex. Like you have all these features. I don't know how to use this. And so it was a, it was a flop initially. So we had to go back to the drawing board and really say, okay, cool. Great that we have the best practices from our COE experts.

here in TM, but we pivoted and worked directly with the HRBPs. And when we did that, that was the secret sauce because when we made it so any HRBP could run a talent review, the secondary benefit was, which we didn't expect was business leaders could run their own talent reviews. And so this was a big unlock for the business. Usually you'd have to say, okay, everybody get in the room. I got to have my HRBP here and they have to prep for two weeks to bring all the data so we can do it. Now you can just fire it up and do it in real time.

That was a really, really big win after that initial failure.

Ashley Stirrup (08:50)
Yeah, yeah, lots of great learnings there. It's great that you were able to kind of uncover learnings that could apply to a whole new user base.

Andrew Willingham (08:59)
Exactly. think the key learning there for me was, you know, which I've incorporated to everything I do now is really doing the desk rides, as we call them. Like you need to sit with your actual user and watch them use your product. It's going to be painful because you're going to be like, what are you doing? You're supposed to do this. But you can figure out pretty quickly, okay, that's actually what they're trying to do. We thought it was something else. Let me adjust. So that's kind of how we've proxied or replaced the A-B testing at scale is really that kind of embedded user research.

Ashley Stirrup (09:28)
Yeah, I've had my own ⁓ personal moments. There was a time I built a mobile product many years ago and was watching the user testing and was like, nobody can find my magic button. That button was the gateway to everything else I needed them to do on the use cases and nobody found the button. just was so painful watching user test after user test that way.

Andrew Willingham (09:42)
Yeah, I know.

Ashley Stirrup (09:53)
Well, so now you've moved on to Atlassian. Tell us a little bit about your role there and what you're trying to accomplish.

Andrew Willingham (10:00)
Yeah, absolutely. So as I mentioned, you know, we're really the, the, um, technical owners for everything that the business needs to run from payroll to promotions, to hiring, to compliance, everything. So it's a really cool role. I, I, compared to Amazon where I kind of owned all these little slices with very large teams at Atlassian, the breadth is one thing I really enjoy. So everything from hire to retire, we get to play in, which is awesome. So along with the breadth right now, of course, like everybody else.

in the industry, we're trying to figure out how do we unlock sort of this transformation to applying AI to our space. So that's been really, really exciting. Not only because it gives you a chance to drive efficiencies, more efficiencies than kind of we've been able to drive in the past with less effort in terms of how we run talent processes, but we're really kind of re-imagining and reinventing these processes in the age of AI. So we get to tackle questions like,

Hey, like, you know, if our hiring process is, you know, market and then somebody applies and then we do a recruiter screen, then we do HM screen, then we do five interviews, and then we give an offer and then we do, like, is that really the optimal way to recruit people? Like, I don't know. We've been doing it collectively in the industry like that for many years, but it's the first time where we're getting a lot of tailwinds to kind of say, well, go look at it. Like, maybe we could try something different. Could we skip these steps? Can we replace these things? Can we more intelligently pick?

the interviewers and the sequence and then for candidate experience as well as for hiring decision quality. So that's a lot of what we're going to do to last scene is really just kind of build zero to one high impact solutions, ⁓ leveraging some cool new technologies. So I'm having a blast.

Ashley Stirrup (11:34)
Yeah, that sounds amazing. One of the things I think that's be particularly interesting for our audiences, you know, what I see is that there's some companies where the CEO wants to see the results of experimentation and there are other times when you've got maybe a head of experimentation or a data scientist who's trying to champion experimentation internally and maybe the organization isn't as bought in. And to me, I think one of the things that's really interesting is, you know, somebody who owns product strategy like you do.

and you need to make a portfolio of bets. You're gonna go build something big and new that maybe doesn't make sense for A-B testing. And there's other times where it's very clear that you wanna do it. Like, how do you think about what to build and how do you think about, and this is a good time to A-B test something.

Andrew Willingham (12:20)
Yeah, it's a great question. One is I think one of the things I love about A B testing or experimentation in general is as a product manager, it allows you to relieve some of the pressure to have to be right all the time. So for example, when I'm coaching my team, it's like, I'm less interested in tell me how to do X, Y, and Z and what features are you going to use? And all right, walk me through your assumptions. Like what, what has to be true?

And then what's our testing plan to gain confidence over time? So

I find that the experimentation, parking, whether the culture is really pulling for it or not, it helps manage the politics. It's not opinion-based. just is what it is. And so I find as a product manager, one that helps you manage executives, because I can just say, well, here's what the data shows. It's not my opinion or your opinion. This is the data. And that helps short-circuit a lot of these political or

opinion-based comments that I think can slow us down.

And two I find it more fun. Like it's more like, okay, we don't, I don't know what will work. You don't know what will work, but let's try some stuff. And then you can iterate. And if you can do these small iterations, that's really, if you go back to kind of how Amazon grew, that's how they grew. It was very small changes that added up. We didn't jump to today's gateway. It's 20 years of experimentation.

Ashley Stirrup (13:18)
Yeah.

Andrew Willingham (13:37)
And along the way, you learn kind of what are those pillars that become enduring truths. So for example, at Amazon, we never wanted to ship you a product more slowly. Like no customer's going to say, yeah, I want my product to get here later than I thought it would. So any investment we would make that would try to reduce ⁓ shipping time or time to customer, we would say, cool, that's worth doing. So the way I think about it, there are all those advantages. What do you test versus not test?

That's kind of my framework. So if it's something around like, hey, do we think faster shipping times will be ⁓ translate to future revenue? I'm like, I buy that. That's a truth that I buy. So anything I bet there, I feel very confident that it's strategically aligned with outcomes we care about. But there's other things you want to test where I'm not so sure. So for example, there's some counterintuitive things you can find sometimes. Like here's what I use all the time with my team.

if you measure on your dashboard usage time, like how long do customers use your product? You know, everyone comes in and says, well, if they use it longer, that's better. Cause see, they're engaged. And I said, well, sure. But it could also just be a really bad product and it takes me too long to do it. And that's actually an inconvenience. So where I'm really interested in testing is you have to test one in that case, does usage time actually indicate satisfaction and helpful to customers?

I can think of ways in which it would and ways in which it wouldn't. And so you don't know unless you're really gonna go in and again, do that research or test to understand is this a durable metric? If it is, great, go do it. If it's not, you're not sure, I would test and then figure out how that relates to the outputs you care about.

Ashley Stirrup (15:13)
And to me, it's like so much of it is about where do I have the biggest opportunities to learn and when is A-B testing a lever for uncovering new insight? And sometimes it might be, I already understand this, but let me test it to understand. And then other times it might be truly just pure exploration.

Andrew Willingham (15:24)
Mm-hmm.

Yeah, absolutely. mean, one of the most obvious examples, let's say you want to get somebody to start using AI, testing that value proposition, like marketing copy, I really think about like that. So like, if I'm a, if I'm a frontline product manager, do I care about AI because the company says it's important? Probably not.

What I care about is like, hey, you're about to type up that status report. We'll just generate it for you. Did you know that click here to figure out how now I'm like, man, I don't have to type up a status report. That really matters to me as a user. So a lot of what I do even in this role, and of course we did in the marketing role at Amazon is test copy, what's resonating with, ⁓ with your users to get them to do behavior change. So again, most of what we're doing experimentation is trying to change behavior, if not everything.

Ashley Stirrup (16:16)
Yeah,

in certain situations, people can just get overwhelmed, right? Like I think about as a CMO owning a website and you've got people who are looking for a job and you've got people who are looking to buy and you've got people who are looking to implement a product and like so many different flavors.

And that if you just look at all the data, it becomes noise and you want to walk away. But if you can actually build the right metrics like you were talking about, and then understand how to segment your user base, that's when you can really start to learn how you unlock a better experience for those different use cases.

Andrew Willingham (16:48)
Exactly, exactly. And you know, even in that example, I'm sure when you guys started, you know, there's probably some assumptions that were around who was visiting the website, but like that job seeker is a great example or investors, you know, it's like, okay, how do we figure out these people are? And then of course, that's very different content and features you want to deliver versus a perspective, customer.

Ashley Stirrup (16:58)
Yeah.

So in your current role, do you have some North Star metrics that you're trying to optimize?

Andrew Willingham (17:11)
Yeah, absolutely. So ⁓ for us right now, there's probably two big ones that are top of my mind. One is efficiency, like we talked about. So if it takes us 10 hours to calibrate a large organization, can we drive that to five? While, and this is the second big metric, we maintain or increase quality. So that's really kind of anything you do. Same with like you take the hiring funnel. If we make it more efficient, we're hiring people faster, we're spending less effort to do it, great.

But that only is advantageous if you also maintain your increasing quality. So these are kind of my balance metrics. So the example I use with my team is like, if our goal is to hire faster, I'll just get rid of all interviews and blind hire people. And then I'm going to hit my metric and walk away here. But that's why you have to have that quality metric to say, you're optimizing against these things that are naturally kind of antagonistic.

Ashley Stirrup (18:01)
Graham, our CEO, is about to publish a blog where he talks about experimentation teams that get too focused on how many experiments did I run versus how much did I learn. So like designing the right metrics is ⁓ super important. We actually had a really interesting webinar with Khan Academy where they were trying to test.

Which version of the AI tutor is actually optimizing cognitive engagement? Because they found that was the measure that eventually led to learning. That was the leading indicator for learning is how engaged are the students in the learning? And like, are they asking questions that are really helping them learn? Are they just trying to get the AI to give them the answer so they can go to the next step in the process? So yeah, the whole kind of metric design is such an important step in that process. It really requires a lot of rigor. Yeah.

Andrew Willingham (18:32)
Mmm.

Yeah, 100%.

my sense is that's an often under invested step of this whole thing. So, you know, you want to set up your metrics such that when you run an experiment, you're actually getting signal that that you can do something with. So I think you're an example of like this, hey, we ran a volume of experiments. That's a trap I see a lot of folks fall into when really, you know, you're you're running experiments to generate learnings. And if you run

Ashley Stirrup (19:02)
Yeah.

Andrew Willingham (19:13)
one experiment and generate learnings you can apply that moves the needle that may be more valuable than saying, well, we tried 500 things and we didn't move the metric we care about at all. It's hard to interpret those results for sure.

Ashley Stirrup (19:25)
Yeah. So as a last question before we wrap up, know, we definitely see that there are those companies that have really bought in on experimentation and those that are reluctant and, know, even some businesses that are really high traffic companies and maybe it's the product manager who's reluctant to A-B test, you know, why do you think that is? And, you know, obviously you had a very different mindset at both Atlassian and at Amazon.

Andrew Willingham (19:52)
Yeah, you know, it's, I'll, I'll speak from, my experience as a product leader and kind of what I've seen. it's a bit scary as a product manager to go down this road, because again, you're giving up. It's, it's a bit of an ego hit in some ways, cause you're not saying I have the answers. You're not saying here, let me tell you what to do. And I'm right. You're saying, cool. Here's my plan for testing. I don't know if I'm right or not. I can give you a confidence level, but we'll find out together.

That was a scary moment for me in my career to kind of admit, hey, I don't know what's right. But the more and more I go in my career, the more I would say, no, that's the attitude of executives. We don't know what's right. But I think we have a plan to test and learn, and then we're going to apply those learnings. I would encourage folks to try it. I think you don't have to be a data-driven culture like Amazon, which is,

Ashley Stirrup (20:27)
Yeah.

Andrew Willingham (20:38)
doing the most A-B tests in the world, or Google, somebody whose whole business model is built on it, if you can get focused on the value you get from failure. That's the

I'm sure many of you would identify with a lot of the times the experiments that were the most helpful were the ones that didn't work. Particularly if we expected it to be a slam dunk and it didn't work, I'm like, whoa. That's my example I shared earlier. Like, wow.

Ashley Stirrup (20:55)
Yeah.

Andrew Willingham (21:02)
see these people are not my actual customers and they're not the same. I can't use them as a proxy. That was a foundational learning that then everything we did then aren't out all the other proxy, but we said, all right, who's actually operating this thing? don't talk to the executive or the manager. You got to get right there with that person. if you run it the cycle once, I think you will see that the business and your stakeholders will see that value. They're going to say, ⁓ wow, okay, cool.

And then if you can bring them in to say, okay, here's what do think we should test next? Here's what I think we should do. Then you're getting them engaged. I think the more you can get people engaged in the excitement around experimentation and learning, I think you're going to see great results.

Ashley Stirrup (21:37)
Yeah, I think that's such a great answer. Going back to my mobile example and user testing, which is a little similar. I was so proud of that product. And of course I had this huge roadmap of things I wanted to go do. And then I saw the A-B tests and just sitting there, we had the glass wall and I could see them on the other side. And you know, it was transformational. Suddenly we had to test everything I did after that, because there's no point in building something if you're not actually delivering value to your end user.

Andrew Willingham (22:06)
Exactly right. Exactly right.

Ashley Stirrup (22:08)
Well, thank you so much for joining us today, Andrew. This was a fabulous episode. I learned a lot.

Andrew Willingham (22:13)
Thanks, Ashley. Appreciate it.

More episodes

Chapters

What is The Experimentation Edge?