Recsperts - Recommender Systems Experts | #32: RecSys in the Delivery Industry at Wolt with Sasha Fedintsev

In episode 32 of Recsperts, I’m joined by my colleague Sasha Fedintsev, Staff Applied Scientist at Wolt (DoorDash), working across personalization and ads, to unpack the realities of building large-scale recommender systems in food, grocery, and retail delivery. Together, we discuss the specifics of personalization in the delivery domain, and the models and ideas that power Wolt’s recommender system across 30+ markets - where theory quickly meets messy, high-stakes practice.

We explore what makes this domain fundamentally different from traditional e-commerce: strong locality constraints, real-time context, and a heavy skew toward repurchasing behavior. Sasha explains how these factors break many textbook approaches - like standard collaborative filtering - and require creative adaptations such as clustering strategies and multi-stage ranking systems optimized for latency, all while respecting locality constraints.

We also discuss the evolution of recommendation approaches over time - from classical collaborative filtering with ALS, to Neural Collaborative Filtering with BPR, and ultimately to transformer-based models for user sequence modeling and next-purchase prediction powering today’s venue ranking systems.

We also touch on practical challenges such as evaluation in real-world systems, including A/B testing pitfalls and biases in logged data, as well as the complexity introduced by multi-surface experiences like discovery pages, vertical lists, and search. Beyond venues, we discuss why item-level recommendation is an order of magnitude harder - due to scale, context dependence, and availability constraints - and what this implies for future system design.

Throughout the episode, Sasha provides a candid view on the evolving role of a Staff Applied Scientist - bridging research and production, setting scientific standards, and driving cross-team impact.

Enjoy this enriching episode of RECSPERTS – Recommender Systems Experts.
Don’t forget to follow the podcast and please leave a review.

(00:00) - Introduction
(05:10) - About Sasha Fedintsev
(15:26) - The Role of a Staff Applied Scientist
(25:50) - Challenges and Specifics of the Delivery Industry
(47:24) - Ranking and Recommendation Problems at Wolt
(51:31) - NCF with BPR for Wolt's First DNN Recommendation Model
(01:16:43) - User Sequence Transformers for Next Purchase Prediction
(01:26:51) - Explore vs. Exploit or New vs. Recurring Purchases
(01:31:29) - Ads Personalization at Wolt
(01:36:16) - Further Challenges in RecSys
(01:37:58) - A Final Note on Radical Longevity
(01:46:30) - Closing Remarks

Links from the Episode:

Papers:

General Links:

Follow me on LinkedIn
Follow me on X
Send me your comments, questions and suggestions to marcel.kurovski@gmail.com
Recsperts Website

What is Recsperts - Recommender Systems Experts?

Recommender Systems are the most challenging, powerful and ubiquitous area of machine learning and artificial intelligence. This podcast hosts the experts in recommender systems research and application. From understanding what users really want to driving large-scale content discovery - from delivering personalized online experiences to catering to multi-stakeholder goals. Guests from industry and academia share how they tackle these and many more challenges. With Recsperts coming from universities all around the globe or from various industries like streaming, ecommerce, news, or social media, this podcast provides depth and insights. We go far beyond your 101 on RecSys and the shallowness of another matrix factorization based rating prediction blogpost! The motto is: be relevant or become irrelevant!
Expect a brand-new interview each month and follow Recsperts on your favorite podcast player.

Note: This transcript has been generated automatically using OpenAI's whisper and may contain inaccuracies or errors. We recommend listening to the audio for a better understanding of the content. Please feel free to reach out if you spot any corrections that need to be made. Thank you for your understanding.

I really enjoy when my work impacts real life of people and I can see and measure this impact and this is what drives me.
In our case, the kind of first or zero stage ranker is already kind of given by a system, right? So we just filter by distance.
In real life, most of the purchases in our industry come from the venues that users already purchase from, so it's kind of repurchasing behavior.
This multi-page setup, right, it is a blessing in disguise.
If we stop worrying about performance of the model and recurring sessions with purchases from old venues, so to speak, then we will get even better results.
To really improve the conversion, we probably need to make our model generalize better. And what does it mean in practice?
We should isolate a subset of sessions where user purchased from a new venue and try to improve metrics on this subset with a constraint that we shouldn't kind of ruin the performance on sessions where users kind of repurchased from a known venue.
I think that the most important thing in every person's life is health. So when you are healthy, you think about like thousands of things. But when you are sick, you think about only one thing, how to become healthy again.
Hello and welcome to this new episode of Recexperts, Recommender Systems Experts.
Today's episode is special in my relationship to my interviewee since we are not just people working on recommender systems, but we are also collaborating and colleagues since more than two years.
And for this episode, I have actually invited a practitioner, as you might assume at this point. And this practitioner is part of Wolt and DoorDash.
And my guest for today's episode is Sasha Fedintsev, also Alexander Fedintsev. Hello and welcome to the show.
Hi, Marcel. Hello, listeners. Thank you very much for inviting me to this extremely, extremely useful podcast and extremely kind of important podcast in our area.
I would say it's one of the rare examples of kind of podcasts where the top experts speak out. And yeah, I listen to it with pleasure. And I'm really happy to be here as a guest. So it's an honor. Thank you very much.
Thanks for those kind words. And yeah, I would definitely also put you into this role of great practitioners in RecSys since I have not just been able to perceive the work that you're doing, but also had the pleasure to collaborate with you on many things at Wolt.
And in that sense, I also have to be a bit cautious today since my first and primary goal within the show is to interview my guests. But today, I might have to strike the balance to a certain degree to also not interview myself too much.
So in that sense, it might also be a bit different, but I hope it won't be of lesser use for our listeners. Before we introduce Sascha, we first need to introduce Wolt.
So Wolt is a leading local commerce platform founded in Helsinki in 2014.
What started out as a food delivery app has grown into a platform connecting over 60 million registered users with more than 200,000 active venues across restaurants, grocery stores and retailers in more than 30 plus countries and more than 1000 cities worldwide.
The idea is simple, bring joy and convenience to everyday life. Customers get food, groceries and more delivered to their door in around 30 minutes.
Versions reach new customers and grow their business. Since joining forces with StoreDash in 2022, Wolt is now part of a global platform with ambitions to become the definite local commerce company worldwide.
A big part of what makes the app feel relevant and engaging is the work of the personalization team, shaping what you see, when you see and why.
The team is growing and we will soon be opening a machine learning engineer position. So stay tuned for more. And with that being said, to the introduction of Sascha.
So Sascha Fedintsev is a staff applied scientist at Wolt DoorDash. He has joined Wolt pre-COVID in 2019.
And given the age of this company, he is really kind of a dinosaur, but not a dinosaur in the way that he is keeping track of the latest research because this is where he actually also excels at.
So I guess there is no week where I don't get thrown a paper by Sascha or a new idea and something that I need to take a look at, which makes stuff exciting to have colleagues like you.
And yeah, he has built many models, especially in the personalization, but also in the ads domain at Wolt.
Many successful models. And I also had the pleasure to like build our most recent venue recommender system together with him called UVR.
And you are probably going to know more or get to know more about this soon.
And also slightly different is their conferences that you might see his name appearing at because it's not the usual like RecSys or SIGIR web conference, whatnot.
But it's conferences such as Brain and Behavior, Artificial Intelligence for Health, Longevity or publications in the American Journal of Physiology, Cell Physiology.
So you also can already see that RecSys personalization, applied science and ad science is not the only thing that Sascha's expertise is strong at.
But that he is also by publication and I guess also by all the other publications that you do and the work that you do there.
Quite a known researcher in the domain of longevity research.
And this is definitely something we are going to talk towards the end of the episode.
So therefore, and with all other episodes as well, stay tuned for the whole episode.
And with that, I guess there are many things that I haven't even mentioned yet.
So Sascha, please continue.
What should listeners know about yourself and why are you actually working on RecSys, on longevity, on ads?
Thanks, Marcel. That was long and introduction that made me a little bit plush.
Blush. Yeah, you covered it pretty well already. Right.
So I joined a long time ago at WALT and I was in a unique position because I was one of the few data scientists, maybe like second data scientists in the company or machine learning engineer at the time.
And the first one working on recommendations and why I chose recommendations, because I thought and I still think that this is the most impactful area in e-commerce.
Right. So what what can men do the most impactful with using tools like math and machine learning, statistics, whatever, is to improve the personalization, improve the customer experience.
And this is what I believed and continue believing in. And that's why I chose this domain.
Prior to WALT, I worked as a research engineer at Zalando, also working on personalization.
And I really enjoyed doing that. And I also saw the impact of personalization there.
But the problem was that at Zalando, a lot of things were already implemented and kind of on a very high level, probably even SOTA at that time, like state of the art.
They used quite sophisticated techniques and whatnot. And but I wanted to kind of do something from scratch almost.
Right. So to implement something by myself. Right. And that's why I joined the WALT.
And it was pretty unique situation back then, because, as I mentioned, I was the first one working on personalization.
There was nothing done in that area before.
And another unique aspect was that it wasn't a small startup.
There were a lot of customers already, like more than 10 million or something.
So it was just the moment when personalization could actually show an impact based on the number of customers.
I thought that, OK, I know how things are done because I worked at Zalando, I implemented parts of it.
I know Rexxys, I know like collaborative field and whatever.
I just take that knowledge and apply it at WALT and profit.
Boy, was I wrong because I didn't realize at the moment that ZAL is a very similar businesses.
Right. Both are e-commerce. It was a food delivery business.
Now it's like more broad. It's deliver everything business. And there are also side businesses, whatever.
So like a lot of stuff going on. But at that time it was a food delivery business.
And it is different from usual e-commerce like Zalando, for instance, like fashion e-commerce store.
So the difference is what I tend to describe is like location dependency, location and distance from customer to venue for listeners.
We call venue restaurants, stores, whatever, like where recourers, pickup orders, whatever.
At regular e-commerce, you don't have this problem.
For instance, Zalando customer may order any dress, any pair of jeans, whatever they want.
They would like and the anything that an algorithm recommends to them.
But at WALT, if you recommend something really relevant, which is outside of the delivery area, customer won't be able to buy it. Right.
And if you recommend something that is a slightly more relevant, but the delivery cost is much higher than for something that is less relevant, customer will probably also choose something that is cheaper or just leave this the app or website.
So that's very difficult thing that we need to take into account. Right.
And for instance, the traditional collaborative filter models struggle here. Right.
Because they rely on factorizing this huge matrix of user item or user venue, in our case, interaction.
In this case, in our case, this matrix becomes highly clustered.
So users tend to interact with venues that are in the same geographical area.
And we have only weak connections via customers that can buy from different places.
So we can connect different clusters in this matrix.
But these customers are rare and these connections are weak, therefore.
And collaborative filtering algorithm doesn't work that well as it works like in other domains.
So that was like one of the big issues back then.
I don't have like a definitive answer how one can tackle this. Right.
But for instance, one thing that I found useful was clustering things. Right.
So for instance, three to invent is belonging to the same franchise clustering them as a single entity.
So it helps to make these connections between clusters stronger and make like improve the performance of collaborative filtering.
Buy a lot. Yeah. Yeah.
You're already diving directly into one of the very interesting and outstanding specifics of the delivery or more specifically the food delivery industry.
Before we go more into this, I would be curious to better understand two things.
First, why have you originally joined this field?
So even at Zalando, like what made you getting intrigued by personalization and recommender systems in a way to personalize content display?
So this is one. And the second one is more about your position.
So what's the daily, weekly work of a staff applied scientist?
What are your responsibilities? What does your usual day look like?
And what you even dislike or like about it in specific? Why people should strive to become a staff applied scientist or why they should actually not strive to become a staff applied scientist?
So first, yeah, let's talk about your motivation for personalization at RecSys and then about the specifics of your role.
Thanks. Very good questions.
To be honest, I don't have like a completely conscious answer on the first one.
So why I joined. I can try to reflect. And I think my motivation was love and passion for math and machine learning on one hand.
And on the second hand is my passion towards having an impact.
I really enjoy when my work impacts real life of people and I can see and measure this impact.
And this is what drives me in part.
So when I solve pain points of people. Right. So pains of people. When I solve customer problems. This is important.
And this is one obvious thing where these two passions kind of merge.
So my passion towards math and machine learning and also software engineering and impact on people's life and whatnot.
And this personalization is one of the few things where the convergence happen. So that's why I chose it.
Yeah. And what is the role of a staff applied scientist like and maybe how is it in specific at Wolt?
Yeah, this is actually quite a new role in the company. Right.
So this stuff role is not new because there are plenty of staff engineers.
But a staff applied scientist is a little bit different role, though sharing a lot of similarities with a staff engineer.
And I was like the first one had the honor of being the first staff applied scientists at the company.
Now there are several more. And yeah, like our presence increases.
This is definitely a very fruitful area for growth and stuff roles. They are interesting, but a little bit difficult to comprehend.
Right. So, for instance, a natural path from a software engineer to manager is pretty straightforward.
Manager manages people. Right. So it's pretty straightforward to describe.
But what staff level or senior plus or staff plus roles that what they do is a little bit vague.
Right. So there are different archetypes of this role. So one person could be like a really, really deep expert in a single domain, single technology or whatever.
But generally staff level roles, they tend to have cross team, cross domain influence.
Right. So they help multiple teams, coordinate efforts of different teams or even organizations.
Right. So that that is staff plus level, probably. So in my case, it naturally fits kind of my division between personalization and ad tech.
Yeah. So this is basically describes what I'm doing. Right. I'm helping different teams at tech and personalization.
And previously I was like very hands on, like focused on building, like tuning hyper parameters, experimenting with different architectures.
And now my goal is to, I don't know, share my experience, also participate hands on, but discuss different things, share ideas, evaluate ideas, like provide some feedback.
Think about architecture. Think about like different caveats and corner cases, whatever. Plan the architecture of the big system.
Right. Try to understand the scope and limitations and many other things. Try to think more strategically what customer problem we are solving and how to translate it into particular machine learning problem.
Or even do we need machine learning to solve this particular customer problem?
I mean, nowadays we need AI for everything. And like, since we all know AI has nothing to do with ML, then of course we don't need ML anymore since we can't solve everything with AI.
So this is basically a quick summary of daily LinkedIn hype. Not to say that there is not like also a true point to that hype, but sometimes it feel like these terms are getting a bit out of hand.
Yes. Yes. AI is like extremely valuable tool, right? But as every tool, it has its pros and cons. It has its strengths and limitations.
And this is also a very important part of the job of senior plus level engineers, ML engineers, applied scientists, whatever, to understand which tool to use, where, in which conditions, to which problems we should apply this or that tool.
From the terminology that you were already using when describing the role of a staff applied scientist or staff scientist. And I mean, this role is actually not unique to Vault. Other tech companies also have this role.
But from the terminology that you were using, I can already like assume that you might have read one of these books like the staff engineer leadership beyond the management track by Will Larson.
Or there's also the other pretty famous one that appeared like shortly after that, the staff engineers part guide for individual contributors, navigating growth and change by Tanya Riley.
I always wonder since the first one I'm also currently reading, whether there is kind of a bit of a gap, a room for, because these books like in the archetypes that you already mentioned, they are describing, I guess this is also valuable for people that are more concerned with the data science part of things.
So that are focusing in the role of an applied scientist and not of an engineer. So I think like there's valuable content for people like us. However, do you think there could be room for a book that would be called the staff applied scientists path?
Or what do you think? And what should be written in this book and what would be in that book that is not in those staff engineers path books?
Yeah. So the thing is that you correctly mentioned the word scientists, right? So we're not only implementing models, but we are doing a lot of research and internal research. We are figuring out what works, what not.
And the goal of a staff applied scientists, like if we talk about the staff engineer, they are called tech leads, right? Because they kind of lead the technological development, right?
What staff applied scientists do, like analogy, if we draw parallels analogous to that, staff applied scientist is a science lead and figuring out how to do research better.
What are the standards of research, right? For instance, one can just train one model once, measure difference in some metric and say, okay, this model is better.
But the goal of a staff applied scientist is to say, no, this is not up to the standards. Let's do 10 or even more runs of the model and average results, measure the confidence intervals.
And then if there is like stat-seq improvement, only then report that it is better and offline metrics.
And so, well, this is just a very simple, maybe a little bit artificial example because a senior kind of AS should already know that and do that.
But in general, I would say that this kind of the goal of a staff applied scientist is to help to establish good scientific methodology, be the leader in terms of reading papers, taking relevant ideas from these papers and sharing knowledge, promoting this scientific exchange of opinions.
But also not only staff AS should work in the closed community of data professionals. Their job is also to communicate to the business in a clear way and like try to protect the methods that we established.
And we know that they work and we need to protect this and kind of communicate that the way we do things is not because it is convenient for us, but because it is the right thing to do, which eventually leads to the better results for the whole company.
And so the staff role should be like this bridge between different parties, between applied data, ML scientists, between product, between engineering, whatever.
So this person should understand all of these parties, understand and be able to communicate between these parties clearly and establish good and productive communication collaboration.
I guess you are just about to turn into the John F. Kennedy of applied science saying like, we do not do these things because they are convenient, but because they are right.
Like when he announced the moon program and saying like, we don't do these things because they are easy, but because they are hard. Like, yes, somewhat he said, just remembered me of that, but I guess, well put and I guess hits the nail on the head.
Hey folks, quick pause. This is Marcel. I started ReXperts almost four and a half years ago. Since then, I've released more than 30 episodes with guests from industry and academia, reaching listeners in over 54 countries.
Across all platforms and over the lifetime of the show, we have crossed 50,000 place with listeners mainly in the US, Germany, India, the UK and the Netherlands.
And we are close to 1000 subscribers on Spotify alone, which is great. I run this podcast completely voluntarily.
There is no monetization, just a lot of time spent researching, reaching out, preparing, interviewing and producing all to give the recommender systems community a platform and to give you new ideas and perspectives.
And here's the honest ask. Despite nearly 1000 Spotify subscribers, there are only 37 ratings. On Apple podcasts, it's even worse. Three ratings for about 350 subscribers.
I'm not asking for money. I'm not asking you to join a newsletter. I'm not even asking for five stars. I'm just asking for 30 seconds to rate the show honestly.
Whether that's three, four or five stars. And maybe leave a comment on what you like or what could be done better.
That small action makes a huge difference for discoverability and for keeping this project going.
Thank you to everyone who has already rated the show, shared feedback or especially joined me as a guest. And now back to the episode.
What are really the challenges and specifics of the delivery domain? And you already mentioned one and this is locality.
And when you were talking about this, it also reminded me of this spectrum. And initially I was thinking it was more like binary, but it's really like a spectrum of the characteristics of the object of recommendations we present to our consumers.
Because there you have a range from virtual to real goods. And even within real goods, you could further distinguish into.
I would come up with something saying goods that are easily movable and goods that are less easily movable or that have like less radius of moving.
In that sense, something like YouTube as a global video platform is something that has a large corpus where almost all those items could be candidates for all users worldwide where YouTube is available.
Already like Netflix, even though the good is still a virtual good with movies and TV shows, it's already constrained to a certain degree due to all kinds of media and publication rights.
So certain movies are not able to be shown in certain countries. So in that sense, it's already like a bit constrained. Music might be more into the YouTube domain.
But then if you really like get into, for example, thinking about e-commerce, classical e-commerce, so Amazon ordering a book or some pair of jeans like you just mentioned from them.
This is already a bit harder, but also there like most of the stuff is accessible and it doesn't make a difference whether I'm ordering stuff in Munich or in Berlin if I'm in Germany.
But it's already like different what I can order in Germany versus Italy, like I just experienced recently.
And then you have really this domain and maybe there's some example that goes even further of delivery business because, yeah, we have venues and those venues, if we just think about food, are restaurants and those restaurants, they are stationary.
I can't order a pizza from a restaurant in Berlin to get it delivered to Munich.
I mean, theoretically I can, but since I, as hopefully most people like to enjoy my pizza hot, that's going to be challenging.
So locality, do you want to first elaborate on this or what other like major challenges or even surprises have you experienced?
Yeah, this is a very good question. There are plenty of challenges. Almost everything that you can imagine comes out.
So first of all, we are doing real time recommendations, right? So latency is a big rate limiting thing.
This is a huge challenge in itself. Then there was a very fun thing called cannibalization. We can talk a bit about it later.
This is what is different from academic work, right? In academic work, you have like well established data sets like movie lens or like whatever.
And you just develop an algorithm, it beats current SOTA, you are happy. In practice, in business, the journey starts with it. So you have to do a lot of other stuff.
Regarding the locality, right? I mentioned kind of negative sides of the locality. To be honest, it also gives us a little bit of a bonus, right?
So you mentioned YouTube and for YouTube, a customer can see almost any video, right? Apart from like maybe some age restrictions, whatever.
But that creates like huge engineering challenge that you need to build like a multi-stage ranking system. Select all the videos, narrow down the funnel, right? To apply more and more sophisticated algorithms.
In our case, the kind of first or zero stage ranker is already kind of given by a system, right? So we just filter by distance.
So there is no need for candidate generation since it comes more natural.
Yes. And it naturally comes to a manageable amount of candidates. Even in the biggest cities, we have like few thousand candidates after distance filtering.
So this simplifies things. But then it kind of creates these difficulties with applying like collaborative filtering methods specifically. But it's kind of double-edged sword. It has its pros and cons.
Let's talk about latency, right? Or considerations. So the latency is a very, very important aspect.
On one hand, you can technically pre-compute all the recommendations for the customers, but you will need like an enormous amount of memory to store these recommendations.
And maybe in some cases it will work. But in the vast majority of situation, you want to build a real-time recommender.
And that's why you have to apply this multi-stage recommendation systems where you have like a very fast, simple models optimized for recall at first stages and then apply successively more accurate, slower models on later stages.
So this is one practical challenge limitation, but it's pretty standard in industry.
So with that in mind, I would encourage researchers in academia to have this in mind when they develop algorithms.
Right. So what if we start our research with this thought in mind? Right. So that our recommender systems in practice are multi-staged.
So can we build like a multi-stage recommender system from first principles, like from get-go? Can we train a multi-stage pipeline in one kind of training, how to say, like training all the stages together?
Yeah. Yeah. Right. As part of a single kind of entity and not training them separately, whatever. So these are research topics for the future, probably that will be very useful for industry.
Going back to the latency, you were mentioning we are doing real-time recommendations. Let's specify it a bit because I'm not entirely sure whether we do.
For me, real-time recommendations would qualify as recommendations that take near real-time interactions or feedback into account to further distill the recommendations.
I don't think that this is what you mean. It's more like the difference between some batch-wise processing and maybe recommending from a key value store as compared to, hey, we want to create the recommendations in real-time.
Well, you have a point here. Yes, definitely. But on the other hand, even the simplest models we have and had previously, they use some real-time features. For instance, the distance between customer and venue, right?
Which can change almost in real-time, like time features, like the context, so that changes.
Yeah, exactly. So basically the features of our venues or the relationship between a user and a venue since the distance has, of course, an impact on the estimates.
Yeah. Maybe some features that we are using are not as real-time as one can imagine, for instance, like customer preferences or whatever. So they don't change immediately.
But our latest system, it is capable of taking into account the latest information, right? This is only like the matter of engineering, whether we have this streaming of the latest events to the feature store.
Yeah. And I guess this is really like when things are really getting to industry gold standard when you are doing this, because the intent of a user from session to session could be very different, but our more longer-term picture of the user isn't.
And then like there is Marcel coming to Wolt, ordering something for dinner. And today he might be in sushi mode or in pizza mode, which you could like easily infer from how I'm browsing when I start my session or what I'm clicking to then further fine-grain or not fine-grain, but tailor the page towards that immediate intent, for example.
But you are absolutely right. And I mean, Wolt has our own team dedicated to creating part of those features. So people that are just concerned with estimating proper delivery times, because I know that from my own behaviors, I'm not just working on that system, but I'm also a consumer, which I guess is always great if you are your own customer, because you could at least share a bit of the pain that other people might be having.
And there it's really like a pain if those delivery estimates are inaccurate in the sense that they are, were far too optimistic. I mean, if they are too pessimistic, then this is fine. But that could actually also help users back from ordering.
If the venue says, oh, it takes like 55 minutes until your pizza is getting delivered and I want something within the next 30 minutes, so I'm probably not going to order something.
Yeah. And of course this goes into our recommender and it plays a role there. Yeah. Latency, latency and real-time recommendations. What else?
Cannibalization.
Cannibalization. Okay. There's a person that is also concerned with ads, possibly. Or something else you were thinking about with regards to cannibalization.
Let me briefly explain the setup. Again, theoretical researchers working on Rexxys, they have very good controlled setup. They have data set with training part validation testing, whatever.
They have a set of offline metrics like NDCG, what a mean reciprocal rank, hit rate at K, whatever. So they measure it. They get statistically significant improvement on those metrics and they report a success and write a paper and everything, everyone is happy.
For us, the journey just begins here. So we have a new model, we test it on offline data, but by no means it is the end of the story because you cannot be sure that it will translate into actual customer satisfaction and eventually into profit, like increased revenue for instance, or some other business metric.
So what do we do? Usually we run an A-B test to estimate the real life impact of the model. And this is like a whole different story and it is like very complicated because, well, there is a problem of novelty because short term A-B test can reveal like an improvement and then longer term will show no significant impact because that will be a problem.
So let's go to the next page. So we have a list of all the different ways to do this. And I'm not sure what it looks like, but for those who don't, it had at that time, let's refer to the old design, it had several tabs.
So that was one page. Another page was called restaurants tab, and it was just a vertical list of restaurants, obviously.
Then was the stores page, also vertical list of retail venues. And then was the search, which is a different story. And what did we do? It was extremely difficult for me in the beginning to understand like how we can evaluate our algorithm.
For instance, if you take discovery page, we have different venues on different carousels in different horizontal and vertical positions. So essentially it is a 2D grid.
And how to apply metrics here, we have no idea. This is another complication. So in real life you have complex layouts and how to evaluate your algorithms is very unclear.
But okay, we have this restaurant stuff and it is pretty convenient because it's just a vertical list. We can log what customer purchased from which position in this list and use this as a proxy to like offline data set, right? To evaluate our models.
There are also some biases, which we will talk later. I can talk about it like for hours, but okay. What we did, we trained a model, collaborative filtering at that time and applied it only on this part where we did the evaluation on the restaurants tab.
And guess what? So the conversion rate increased on that tab.
For the specific restaurants tab.
Yes. But when we measured the overall conversion, it stayed flat. Why? Because people just started buying from restaurants tab and then...
Instead of from discovery.
Instead of from search or discovery. So usually when people cannot find something, they go to search and find what they want. But now a lot of stuff appears on restaurants tab and on this vertical list because of a better algorithm, they go there.
But there is no actual value because they didn't start to buy more. So that is called cannibalization of the traffic. So we did good on our job to make the restaurant stuff better, but we didn't do a good job enough to increase the overall conversion rate. So that was the problem.
And later I can tell why, what we discovered and how we solved the problem. And eventually we did move the needle. We did move the conversion rate up.
So, yeah, this multi-page setup, right? It is a blessing in disguise. On one hand, it makes it difficult to completely ruin the experience of the customers.
So it is not that easy to drop the conversion rate because if customers don't see relevant things on say discovery page, they always can go to search to other places and find what they need.
But it is also very difficult to move the conversion rate up. Right. And this is a problem, a real business problem. Okay. Now let's talk a little bit about evaluation.
Right. I mentioned that we used this vertical list as a proxy to kind of compute offline metrics on this list. But imagine a situation. So you log only events from restaurants tab. Right.
You only log purchases that are from restaurants tab and you don't log anything else. So you log on the successful sessions that ended up with the purchase. And now you have like a labeled data set.
You have a session ID, you have user ID, you have a venue ID and a label. Was it purchased in this session or not?
And actually is a position where it appeared at.
Yes. Yes. So what will happen if we naively take this as our source of truth? If we focus on the on sessions with purchases that are coming from restaurants page, what is your intuition?
Will it be kind of feasible to improve this offline metric if we calculate it on this subset? Just yes or no? Like binary?
Oh, no, I don't think so.
Yeah, your intuition is definitely right. So why? I have an explanation. So because if we take only purchases from restaurants tab, we log sessions. Vast majority of these sessions are such that the ranking was okay for the customers.
And that's why they purchased from that tab. And that's why it will be very difficult to improve the offline metrics based on only that subset of data. Because this is actually a very good kind of data.
So the thing is that it is biased towards very good ranking. And so our algorithm, like even most sophisticated models will probably not beat this baseline. And we will be like scratching our heads and we'll do whatever stuff trying to understand what's going on.
So the solution, right? We of course can do some re-weighting, like excluding sessions, whatever. So like my colleague, Marcus, have come up with a beautiful Bayesian idea, like how to re-weight those sessions, whatever. But I had a simpler idea, like a very, very simple idea.
What if we take also sessions where user viewed the restaurants tab, but then moved to another view, like search for a discovery to make actual purchase. It means that the ranking on the restaurant tab were not as good as it could be.
Right? And the customer had to move to another view to find a relevant thing and make a purchase. And if we log the content of a restaurant's tab and we take the actual purchase of the customer from a different view, we can map that purchase onto that vertical view to get like a counterfactual.
Right? So like what would have happened, right? If the customer had to stay on the restaurant from which position they would buy, right? And that kind of mitigated that bias. So, and we have much better evaluation right now.
Yeah. Let's assume my favorite pizza place was appearing at position 20 on the restaurant's tab. After scrolling through 10 or 12 venues, I deemed that list as irrelevant.
So let's say, then I go back to discovery, scroll down to a specific carousel, and then I see that venue appearing on cert position so directly without even having to scroll to the right and see it there, make the purchase.
And then we can say like, okay, on the restaurant's page, the MRR was one divided by 20.
So it makes perfect sense, but basically to better leverage the data that you have.
Exactly. Exactly. Yeah. And there are also tons of sessions where users just didn't buy anything. Right. But it's really difficult to make use of them. Right.
And this is my message to our colleagues from the research side of things. Right. So please come up with ideas. What can we do with this amount of data that is just wasted at the moment?
Yeah. There might already be like plenty of ideas on the research side. It's more a matter of like identifying those and having actually the time testing those because like, yeah, applied science is also inherently risky.
Like not every idea, not every approach that we pursue is finally turning out into something profitable because, yeah, you first need to try something on your data or you need to build it.
So it's not like, yeah, I want to have that red triangle appearing on the website or the app somewhere. I guess like UI and UX are also a pretty challenging job.
So I don't want to downplay those. I just want to say like, yeah, we can put something there and maybe in the end it turns out that the user doesn't like it. And we will only finally see by like the golden standard of testing, which I guess you can safely besides all cool counterfactual stuff claim is still online testing through AB testing.
Something that during the preparation of this episode also came to my mind, we just have a huge diversity of stuff that we could actually recommend or that we could rank and personalize.
And to whom we can recommend.
So you already did a pretty good job like illustrating to our users how our website looks like. If you are confident with the Netflix layout, think about the same but displaying restaurants and then you have what Walt's discovery looks like.
But this would actually downplay what our discovery page looks like since it does not only contain venues nowadays, but it also contains like carousels of items like for example, lunch menus.
So where we propose items so dishes in that sense, or we have brands, we have categories like real food categories. Think about like Italian sushi, Asian, Levantine kitchen, whatnot.
And there is actually a colleague working on ranking those and making this more personalized.
Traditionally, we have venue recommendations where we recommend a place to order from.
But I always found this like we are solving for an indirect problem since no user is going to buy a venue or a restaurant or a store, but they are buying from a store or a restaurant.
Like I just need to point them where they find the relevant stuff and hope that where I point them at is already relevant.
And then we have like the more direct problem, which comes afterwards is finding the right stuff at the right place.
So like finding the right items and presenting items. So sequencing fruits at the grocery place, picking the most relevant items and whatnot.
So and then we just have talked about venues and items. But you already mentioned discovery page is consisting of carousels.
So there is carousel ranking. So this multidimensional problem that we should also solve. And then there is images.
So if you think about it, I always wonder how our team could be that small because there are so tremendously many problems that we should solve and are only solving to a certain degree.
What's your take on this?
Yeah, like definitely. So we solved, I would say, to a decent degree on the like restaurant ranking problem.
I can say that we finally are approaching to this state where we are doing a very decent job on restaurant recommendation with a new model.
But every other problem is order of magnitude more complicated. For instance, take item recommendation.
This is like an extremely difficult problem in comparison to venue recommendation because we have like two orders of magnitude more items than venues or even three orders of magnitude.
I haven't checked the numbers recently. And it's like very context dependent. Right.
So I think that's a really good question. I think that's a really good question.
I think that's a really good question. I think that's how I'm going to know whether a customer needs like an iPhone at this moment.
Right. And how should I not recommend a toilet paper to this customer? Because toilet toilet paper is like a no brainer.
Select on a very specific set of items from like a very vast pool of possibilities and show it to the customer at the right moment, taking into account all the context, intent of the customer, whatnot.
So this is like very, very, very difficult.
Yeah. Yes. And actually also availability of items then actually plays a role.
If you think about groceries being sold from a supermarket or some dark store, of course you don't want to piss off your consumers recommending stuff that then finally turns out not to be available for order.
This brings us back to something that initially we discarded, which is candidate generation for items. It comes not for free then.
Yes.
I hope at this point people have been able to get an understanding of how extremely difficult this problem could become.
But we don't only want to talk about all the challenges and the problems and the difficulties and whatnot.
But of course, today we also want to talk about solutions.
There I have to share a bit of a surprise moment that happened about three years ago.
I was in my interview process for my role as applied scientist at Vault.
This was in twenty twenty three.
And during my interviews, the term NCF was brought up and I heard that NCF was being used in production at Vault and I was, wait, there's a bit of history behind NCF.
I was, are you really using NCF, but haven't you read those papers?
And the listeners might be well aware of the history of papers behind NCF.
But nevertheless, as for every episode, we are going to put those papers into the show notes.
But nevertheless, I will try to give a short recap, at least of three of them.
So the original paper about NCF also standing for neural collaborative filtering, appeared in twenty seventeen at the Web conference and was published by a group of researchers from National University of Singapore, Shandong University, Texas University and Columbia University.
So it was a rather academical paper that was presenting a neural network based approach to collaborative filtering that combined generalized matrix factorization with a multilayer perceptron and was evaluated on two data sets.
So movie lands and a Pinterest data set.
And this was really during the time where deep learning really took off in applications for recommender systems, mostly leveraging content features like image or text data, where neural networks have already shown to be very useful for think about all the applications and NLP or CNNs being applied to immagarial data.
Listeners might be well aware of tons of other papers during that time and also of the RecSys workshops that were dedicated to deep learning based applications to recommender systems.
And I guess nowadays we could say that deep learning has become an obvious part of multiple approaches to our recommender systems.
And we are well beyond that point of saying that deep learning is a special application in recommender systems.
However, in 2019, there was a bit of a reproducibility crisis in RecSys I always like to refer to.
And there was this paper which then also ended up becoming the best paper of RecSys 2019 with the well-known title, Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches.
In this paper, the researchers have taken a look at 18 algorithms and finally only ended up being able to reproduce the results of seven out of those.
So almost just a bit more than a third.
And in six out of these seven papers, they were able to outperform those presented approaches and among them, NCF, neural collaborative filtering with properly trained heuristic methods or baselines.
So this basically showed a bit that this NCF approach was maybe not as good as originally thought.
And also at RecSys in 2022, there was the paper coined neural collaborative filtering versus matrix factorization revisited that ended up taking a deeper look into this comparison and really finding out, is this really the case?
And why wasn't NCF as good as initially thought?
And it turned out, or this was part of the conclusion of the researchers of this 2020 paper, which was also a reproducibility paper, that learning the dot product function with an MLP is doable, but very inefficient.
And that proper baseline algorithms do a better job there.
And that this also can become not just very inefficient, but in the end, even ineffective.
So it's worth reading.
And this was kind of my background when I joined and heard that NCF was being used, but very soonish and trying to be not too preoccupied with my thoughts.
We met, so Sascha and I met, and of course this was one of my first questions.
How did you make it work?
And this was not only my question back then, but it's also my question today.
So Sascha, share with us how you made NCF one of the, or even the greatest success in personalization at Vault.
Oh yeah, thanks, Marcel.
That was a long, but very, very good, very good question.
The success of the NCF at Vault, I would say, is because I wasn't able to find those papers before starting with the networks.
The thing is that like a little bit of a history.
So how did we end up in this situation?
Remember I told you about cannibalization.
We applied an algorithm on restaurants tab, it improved conversion from the restaurants tab, but it didn't lead to improving in the overall conversion.
I started analyzing why this happened.
And the discovery was striking.
In real life, most of the purchases in our industry come from the venues that users already purchased from.
So it's kind of repurchasing behavior.
And the model that we deployed actually improved the metric only on that kind of a subset of the data.
So it basically promoted to the top venues that you have, the customers have ordered from before.
And that isn't like a very good thing, right?
So because why even to use machine learning here, just you can just use simple heuristic to promote this venues.
And obviously it doesn't move the conversion, right?
Because, well, customers will anyway convert.
So they have clear intent to order from this particular venue.
They will find it anyway. There are plenty of ways to find it via search, via discovery, via profile, whatever.
So and then I realized at that moment, OK, to really improve the conversion, we probably need to make our model generalize better.
And what does it mean in practice? We should isolate a subset of sessions where user purchased from a new venue and try to improve metrics on this subset with a constraint that we shouldn't kind of ruin the performance on sessions where users kind of repurchased from a known venue.
Looking forward, that was the right idea. So that was like a pivotal moment.
But this is something about how you evaluate and which kind of training data you use. So not something specific about why NCF ended up being the one, right?
It is kind of related to that. And when I looked at the performance of the current model on that session, right?
And the current model at that time was just matrix factorization implemented in Spark.
So the standard Spark ALS implementation. It was alternative least squares with implicit feedback.
So and it it performed like very poorly on like on that specific subset of the data.
And then I started looking at alternatives and we also had a real life constraint.
So at that time, it was like also pretty pretty early in the company and there was almost no machine learning infrastructure at that moment.
And that was a limiting factor because normally if you you need several components in kind of machine learning infrastructure, you need like a training pipeline.
You need data, you need deployment mechanism, right? You need inference server and you need feature store for real time retrieval of features to do the inference.
And we didn't have like a performance good feature store. In fact, we had none whatsoever.
And collaborative filtering, it was pretty simple to implement it because we had user factors and venue factors stored in Redis, whatnot.
And so it worked just fine. And I looked at the models that could avoid, you know, implementing this external calls to the feature store and whatnot.
And neural collaborative filtering, it kind of satisfied these two conditions. First of all, there was like this Microsoft GitHub repo where they compared different recommendation algorithms and NCF performed decently on that benchmark.
I guess you mean the Microsoft recommenders repository? Yes. We had Miguel Fierro, who is one responsible for those in one of the very earliest episodes of Rex Pertzer.
So on that repo, it was not the best, but much better than the ALS, which we had at the moment.
So I didn't want to start with something like very advanced, like variation or auto encoders or whatever. Just NCF looked easy enough to implement.
And also because it uses only learned embeddings for users and items, slash venues, whatever you need to kind of recommend.
It is kind of a standalone model that you can deploy as a service and it will work out of the box without any infrastructure around it.
So it is pretty appealing. That was the motivation. So let's take a simple and decently performing model and replace our current model with it.
And because the baseline is so low, then we will have an improvement. And then because it's a deep learning, so you can continue.
It's rather flexible so you can change architecture, introduce new features, whatever. So that was the motivation.
So we have more control on it in contrast to the Spark implementation of ALS and whatnot.
Yeah. I mean, in that sense, you had at least some kind of a feature store, which was then basically kind of your huge embedding tables that you were serving from memory.
Yeah. But I didn't need to implement the feature store as a standalone service and don't need this communication.
So additional round trips over network, which increased latency or whatever. Yeah. That was the motivation.
So in that sense, the original story was really like you just found this repository and really then found those comparison of different methods.
And then I assume you checked out the original paper and said, OK, this is flexible. This is pretty straightforward. So let's give it a try.
So and then you ended up implementing it. Yes. And results were disappointing.
Even on offline metrics, it wasn't able to beat like the Spark implementation.
Oh, OK. What the heck? So am I stupid or you really implemented it in its original version? How it was described?
Yeah, like it was a disappointment. Yeah. But I was blamed on myself because I felt like I'm stupid and I just can't reproduce the paper.
And I didn't know about the existence of these two papers that kind of failed to reproduce the results as well.
Exactly. So in that sense, you actually have tried your implementation on the original data sets and you weren't able to reproduce those results or you were disappointed by the results on internal data.
Yeah, on internal data. I didn't use the standard data sets like movie lens or whatever. I just straight applied the NCF to straight our data.
Because this would have been like a possible trial, like to see, is my implementation possibly correct if I'm able to reproduce their work on their data?
But even this, of course, comes with problems because it assumes they have completely and well documented everything, which for some certain papers was also not the case.
So even that could actually create self-doubts that are not fair because in the end they haven't shared everything or maybe what is being reported is overly optimistic or not really reproducible.
But okay, so you found this first implementation not even beating ALS on internal vault data.
Yes. Okay. What happened afterwards?
Yeah. And then I started to kind of think how to improve because I liked the opportunities that this deep learning could offer.
Right. So I liked the overall approach, but the results were disappointing.
And I thought like, okay, maybe I am doing something wrong. Maybe try something else.
And then I started the period of experimentation. I started reading a lot of papers about like different objectives, different loss functions.
Yeah. And the first one that I stumbled upon was a paper about BPR, Bayesian personalized ranking objective that showed that this BCE loss, binary cross entropy loss is not very well suited for ranking tasks.
And it was convincing enough. So I tried it and it was an immediate jump in metrics, kind of beating the ALS.
I guess at a meetup, and we are going to link this talk of yours, you even mentioned that it provided a 20% relative uplift in MRR, just changing the loss function, right?
Yes. Okay. Yeah, it was probably the biggest update. And then everything kind of revolved around it.
Yeah, it was a little bit more complicated. But it was a little bit more complicated because BPR is not a good example of the change to the loss function.
So the problem was that BPR works that you have a positive, which is for instance, a purchase and a negative, which is random samples sampled uniformly from the pool.
And it's super inefficient. So if you have like one negative per one positive, the conversions is very slow and it's super inefficient.
So that's why I experimented with sampling like many negatives and including them into a single batch that also kind of increased metrics.
So this is a very, very decent improvement. And this is actually, it's like, I thought about it, it's like this negative sampling is very related to a famous coupon collector problem.
So do you know this? The idea is that you sample negatives randomly. So initially the learning curve is very steep.
And then you sample them very quickly. So your metrics are improving very rapidly. And then there are fewer and fewer, these hard negatives that bring a lot of information to the model.
So most of the negatives you sample are very easy and they kind of bring little to the table.
And you need to find good negatives. And also a lot of negatives are already kind of processed by the learning algorithm and you need to find this unique.
And that's why the training becomes very long. So yeah, and increasing the number of negatives that are sampled at once, they kind of increase the likelihood that there will be at least one informative negative in the single batch.
So the gradient update will be meaningful. But it's not guaranteed. It's like probabilistic things.
Yeah. One thing is the loss function, like changing from a point wise to a ranking loss, namely in that case, BPR.
And the interesting like side remark is also that Stefan Randel, who I guess we can refer to as the inventor of BPR together with also one of my prior guests, Zeno Gantner.
He was also the main author for this neuro collaborative filtering versus matrix factorization revisited paper, interestingly though.
So kind of there the loop is closing. So one thing was changing the loss function. One thing is also was changing the sampling strategy.
Yes. And actually also the original paper already posed out as room for future work because they were, if I remember correctly, using a uniform negative sampling strategy.
I guess they have already tried to increase the negative sampling ratio. They, I guess, originally reported something between three to six per positive.
Yes. And you were also working with this, but you tried to increase the likelihood of hard negatives appearing in the data set.
How have you been able to do this? Well, this is actually exploits the locality thing that we talked about. Right.
So there is no point in sampling negatives from a different city or even different area of the city.
Yeah. Right. So what I did, I basically for a given positive, I get context. Right.
So from what are the negatives that are associated with a given positive and sample from this pool and not the whole pool of available venues in the country.
That was a huge improvement as well. What else have you changed? BPR works.
It samples negative uniformly from the whole pool. Right.
It means that if a user have like multiple purchases for a given purchase, you can accidentally sample as a negative another positive.
Yeah. If you don't remove them from the negative candidate. Yes.
So that was another improvement. So you just for every user, you keep like filtered set of negatives, possible negatives.
And then you kind of also do another filtering by removing venues from this set that aren't in the vicinity of the positive venue.
And then also there is another very important addition, which comes from a different paper.
Basically the paper is collaborative filtering for implicit feedback data sets, which is ALS implementation.
And the feature that they use there is importance weight. Right.
Corinne and Wolinski, this famous formula that everybody has appearing in their heads while they are hearing this.
Yeah. So basically they use weight weights. Right. So confidence weights.
If a user purchased from a given venue multiple times, then this is likely a very strong positive signal and we need to kind of up weight this pair.
Yeah. And actually the weighting scheme that they proposed wasn't the best.
So taking the log of that worked better, probably because of the less kind of popularity bias.
I don't know exact reason, but yeah, it was another improvement. Yeah.
What have you used to compute those confidence scores?
This is basically our kind of know-how. Right. That is related to the company.
So because we use linear combination of number of purchases, is the venue in favorites and also clicks for certain time periods.
And everything is combined. Then there is log transformed and this is used as a weight.
So we are basically like using several different positive feedback channels that we put into one aggregate figure.
Yes. Another improvement was that, okay, we sample negatives from a set of all venues minus set of positives.
But what if we sample a negative from a set of all venues that either the customer has no interaction with, or we also sample from positives that have lower weight than the current positive.
So that we establish a ranking between positives as well. And that also was very good improvement.
Yeah. And then there were like kind of minor improvements like architectural improvements.
So like add and dropout for instance was a very good addition because it kind of reduced overfitting apparently and let the model train for a longer time.
And if model trains for longer without overfitting, there is a higher and higher likelihood that it will eventually sample good negative, which will kind of improve the model performance.
Yeah. Which I guess if I got that figure correctly, added another 10% relative uplift for MRR. So effects are amalgamating.
Yes. So yeah, everything altogether combined resulted in a huge improvement over the baseline.
Given the engineering effort for lots of A-B tests and so on, you somewhat need to strike the balance between testing every change online.
Yes. So being like very rigorous with regards to scientific practices versus like the effort and the costs that are associated with this and potentially also the risk of deteriorating user experience.
So like what you ended up with is like putting all those improvements that you validated offline in one.
And then there was, was this actually then the first A-B test?
No, there was like a preliminary A-B test, right? We tested like small improvements in offline metrics and it was unsuccessful.
So that's why I was like very cautious and decided to go for an A-B test only when I had like a huge improvement, like 50%.
Of course, I don't remember exact numbers, but the order of magnitude is roughly the same.
And then when we got it like this improvement and I have this hypothesis at the time that, okay, this should improve the performance.
That should move the business KPIs as well. So the improvement of the MRR on sessions where a user has purchased from a new venue.
And when we ran the A-B test, it actually confirmed my hypothesis.
Then we had this conversation about like this novelty effects, as I mentioned before.
And then we ran like for several months, like long experiment confirmatory, which basically showed that the results, it was like on a holdout set.
So it wasn't like 50-50 split. It was 90-10 split.
And it showed, well, basically the same picture and it was like a huge monetary change.
And in the end also helped promoting newer venues successfully to user sizes.
So in terms of generalization did a far better job.
Yeah. So this is how sometimes maybe you can also reach too many papers and then like preemptively discard papers.
So my first hunch turned out to be wrong when I learned.
When I joined the company, I was sure that this must have been tested properly so that there was a good reason to roll it out.
And then I learned that this was not the NCF that I remembered from the original paper.
Yeah. Since then time passed and multiple models followed.
There was another deep learning based model integrating more content and contextual signals into it, namely the second pass ranker short for SPR.
There was a dedicated first time user model or something dedicated to cold start users.
But apart from that, we might be able to work on something because more recently we were working on something together, bringing transformers into production at Volt and now being called UVR.
So Universal Venue Ranker that has shown even more uplift.
Yeah. So NCF was like a huge step forward. Right. But it was not by any means.
It was not the kind of ultimate model that did like perfect ranking recommendations.
It was a huge improvement. But metrics were still. Yeah.
There was still room for improvement and room was pretty spacious, I would say.
And the one limitation of the NCF was that it didn't take into account locality. Right.
So customers may have different locations, for instance, they may buy from home, from work and NCF take all of the signals and cram it into one user vector.
That's why it may recommend you when you are at home, it may rank higher up in the list of venues that are far away from your home just because they are from your work location and are similar to the restaurants that you've ordered from your working place.
And this is only one limitation that comes to my mind that is pretty easy to solve. Right.
And the SPR includes, for instance, explicit distance features. So how far venues from the customer also deliver estimate.
Also other tons of other features like content features of the of the restaurant like menu size, like proportion of images, whatever.
So it definitely was an improvement over the NCF and they work like a multi-stage model.
And CF was not only quite good at ranking, but it was also pretty fast.
I'm really proud of it because it was a model that ranks thousands venues in under 30 milliseconds for 99th percentile of latency, which is pretty good.
I would say taking into account this like a deep neural network. Right.
But because it sits in memory, there is no additional round trips to any storages, whatever. It is so fast, even on a CPU.
Yeah. That enabled us to apply a second stage model on top of it. Of course, we don't re-rank all the venues in the list.
We take only like top several hundred from the output of the NCF and then we re-rank it using this second pass ranker, which is also a deep neural network.
We experimented with gradient boosting kind of second pass ranker. The accuracy was on par between those, but neural network was slightly faster.
Probably because this gradient boosting methods, they are not super cache friendly, probably because of that.
At some point I also realized, okay, then even this has limitations, right? Because we don't have true personalization in the SPR.
So we learn like average overall signal for how important distance is on average, how important time of the day is on average, how important is day of the week on average.
Right. But for particular customer, can we learn this importance of different contextual factors? And then something hit me.
All right. What if we represent a user journey as a sequence of actions and try to do sequential modeling?
So for instance, a user on Sunday tends to purchase from retail. On workdays user usually buys, I don't know, orders coffee and bagel from places nearby their office.
For instance, or something like that. Can these behavioral patterns be extracted from the data in automatic fashion? And for that we need sequence modeling.
And what does sequence modeling best of all? Currently it's transformer architecture.
Yeah. And I mean, sequence aware recommender systems are not very new as well.
So we have seen like plenty of models from the NLP domain being applied to RecSys sooner or later.
So one of the, I guess, famous might be the one, if I'm not mistaken, by Yahoo, where they applied word to VAC, to sequences of user interactions, rendering it into prod to VAC.
And you mean you can do the same with playlists of songs to generate vectors for songs using or perceiving playlists as documents containing songs where every song is a word in the document.
So by the time, and I guess there's this great review by Massimo Quadrana, who has done a lot of research on that.
And we saw like gated recurrent units, LSTMs being applied to user interaction sequences.
And then, yeah, as you just said, transformers were kicking off in the domain of natural language processing.
And again, the same story goes, can we apply something from that domain also in the domain of RecSys applying it to use sequences of interactions?
And yeah, some folks in industry have already shown that this is possible and great inspiration again.
You were starting applying transformers to sequences of user purchases, right?
Yes, exactly. So yeah, basically we have all the user purchases, right? But then I also thought about why should we use only purchases?
We can use some metadata as well. Not only like purchases and some positional encoding, but what about location of the customer?
And the location could be represented by, for instance, a hexagon where user is located.
We use Uber H3 library to kind of get this hexagons as a categorical feature and then translate it into like an embedding vector that is used alongside with the embedding that is associated with the venue.
And we also use the same approach for time features, day features, whatever.
And like all together, it gives a model a very good understanding of what a customer prefers in a given location at a given time, in a given day of the week.
And basically we are trying to do the same. It looks like training large language models.
So we're also trying to predict the next purchase. So given a context.
So that means by this time, and this was early 2025 when we were collaborating on this and I came a bit from the direction of let's look at the recommender model landscape that we have created so far with four models with different use cases as well.
And how can we actually make it less complex? How could we further improve with more modern approaches and maybe also reduce the costs?
And I guess this was something that we soonish learned when we were working on this initial Transformer, mainly you, that after a few iterations, we found that just the Transformer model with these comparatively few features.
So the venue, the time context information, the location context was able to perform as well as NCF and SPR together.
Yes. Yes. That was moment that we realized that we can not only just replace two models at equal performance, right?
Or actually more models like several models with a single one, but also with some effort, we can beat these models in terms of accuracy and probably also increase revenue and have much improved customer experience.
And then your work came on top of that, Marcel, which was amazing and listeners can read about it in a blog post how we further improved the accuracy of the Transformer model, what we did and applied on top of it.
So to get like better ranking quality. Again, shameless plug.
No, I mean, we can do that. We put in plenty of work, not just us, also the other great colleagues on our team, other applied scientists, for example, Daniel and Pawel, but also our great engineers.
Stefan.
Stefan, who kicked my ass multiple times writing proper tests or of course also Attila supporting the project greatly and still today.
So it was definitely a great team effort. And then like last summer, finishing up with another A-B test that showed great uplifts and also further enabled us to solve actually for the more difficult problem, which is always recommending something new that users like and makes them try something new.
And without telling too much people who are, for example, familiar with the work by YouTube on exploration and the long term effects of exploration know that more diversified, still relevant content can help with long term retention of users.
And one could assume that this might be also the same when it comes to finding something new for dinner or lunch or whatnot.
Talking about this problem, the trade-off between recommending something recurring versus something new.
I'm always a bit remembered of papers, for example, the act our work being applied with Transformers in the music domain by Deezer that argue we are in a domain where we need to strike a balance between recommending something new versus something recurring.
Recommending something recurring is not necessarily worse than recommending something new that is relevant. However, Sascha, give us your definite answer to this.
It's not a definite answer.
How is it for our domain?
Yeah, it's my opinion on there and I am pretty bullish on it.
So the more I'm in the industry and the more I think that the models we should optimize strongly for recommending new stuff.
And the problem of recurring recommendations we should solve by different means.
Right. So use heuristics, use UI, special UI.
So order again from this from that and by the models, I think we are kind of limiting ourselves by training model on this mixed data because the recurrent purchases are overrepresented in the data.
We are biased in our models towards recommending something old.
Right. And in this model, we introduce some weighting right that downweights a little bit recurrent.
And we saw like an improvement in business metrics, whatever.
So I am extrapolate and even further, I believe that if we stop worrying about performance of the model on recurring sessions with purchases from old venues, so to speak, then we will get even better results.
But yeah, this is my opinion. It could be wrong, but I have a hunch so that it might work.
I agree that definitely part of that hunch is well supported by the data because this was kind of our priority for UVR to really push for new venue exploration, which is of course also the harder problem.
And as we already initially said, isn't it also more fun solving for the harder problems?
Yeah.
But in the same surface, users might want to see both things, or at least we should even do a better job to near real time detect the intent of the user.
So kind of is the user currently in exploration mode or are there, let's say in lazy going back to something known mode and being able to detect this reliably, tailor the recommendations to be steered more toward showing something more recurring.
And then you can maybe solve it again with some fancy Bayesian stuff or something else.
So like thinking about multinomial blending, also a paper that appeared at RecSys and Bari where this was being applied to different content types.
So blending music with podcasts, but maybe you could also use multinomial blending to apply it to new versus recurring content and then blend it based on some parameters that you infer live or from the user's history.
Because I guess we can also discriminate users into those that being very exploration oriented and always trying something new, versus those that are almost always going back to something known.
Yeah. If I remember correctly, we even have feature like that.
Maybe also to be appearing in the blog post.
Sasha, I guess we could talk hours about RecSys, but as we learned initially, this is not your single domain of expertise or your domain of responsibility at vault and your position as staff applied scientist.
I also want to briefly touch on another domain, which is actually ads.
Could you give us a short snapshot, I guess, in one of the earlier episodes of RECSPERTS already talked about targeted advertising with Flavio and Fazile from Critio?
But how does this play out at vault and what's kind of your work in the ads domain?
Yeah. These two kind of domains problems are intersecting to a very high degree, I would say.
So in both problems, we need to identify pieces of content that have the highest likelihood of customer conversion, right?
That customer will purchase from particular venue or will purchase certain item, whatever.
What's different is what we do with these pieces of content that we recommend, right?
In personalized ranking, we just rank these content by relevancy. While in ad tech, we have thing that is called auction.
In this auction, there are different types of auctions. First price, second price, whatever.
But this doesn't matter too much what we do.
So we first in personalized ranking, we can just produce a number, some number that we will use as a ranking criteria.
In advertising, in particular in auction, we have to produce a well calibrated probability, which we then multiply by the beat that the advertiser places.
And then we use this quantity as a ranking criteria. And that's kind of the main difference, right?
There are, of course, tons of other small but important things. But the main premise is this.
So what we do, we estimate the probability. We multiply this probability by a bit.
We do the ranking and we select the winner. And there can be multiple winners in different places.
We show the advertisement and charge the advertiser.
And the thing is that it's very difficult to kind of come up with these probabilities.
Right. But otherwise, the models are quite similar.
But yeah, so in advertisement, we are not very much inclined towards novelty. We need to maximize the likelihood of purchase, whatever.
But the models are similar. And the first ad tech model was NCF that was repurposed to produce well calibrated probabilities.
It was like calibration layer on top of it. That's it.
So now we are working a new generation of ad tech model also based on Transformer. A little bit different.
So from practical standpoint or ad tech, latency is even more important than for personalization.
Those are very, very internal reasons. So I won't be telling about them in this podcast.
But yeah, this is a main challenge in ad tech at Wolt so that we need to come up with a decent probability estimation in a very short period of time.
And that's not an easy task. Yeah. So that's what we are working on.
But in general, problems that we are solving are very similar. Right.
But the ad tech domain itself is much broader because this probability estimation is only one small problem that you can solve with machine learning.
There are tons of other stuff, for instance, auto bidding and like many others.
So this seems like there are still lots of demand for qualified research and then turning that research into practice, which we learned today.
Not always at the very first point lead something positive, but then like with proper grit.
And I guess you have definitely proven that grit in your history with great ideas and being able to transfer research into practice can yield tremendous success.
Apart from RecSys at Wolt, ad tech as well, which challenges do you see for recommender systems or data-driven personalization in general?
Maybe specific to the delivery domain, but also in other domains where we as a community of researchers and practitioners haven't done a good job yet?
Well, my personal challenge is item recommendation. I still have no decent understanding how one can recognize customer intent and understand whether to recommend an iPhone or, or yeah, like a banana, for instance.
So if you try to naively learn patterns, you always will be recommending like toilet paper and probably some other groceries to the customer because it's kind of repetitive pattern and it's easy, predictable, but it's not very useful, so to speak.
Right. So as we talked previously, the main benefit and recommendations is also novelty like serendipity, whatever.
So and how to do that. I have no clue at the moment, but I'm thinking about it constantly.
Yeah. So how to suddenly recommend something very useful for the customer, the customer kind of is thinking about at the moment, right? So how to solve this problem.
This is the biggest challenge and also the biggest thing that can bring a lot of customer satisfaction.
All right. You are not especially internally well known for your work on RecSys on ads, but externally, as I initially mentioned, more known for your work on longevity research.
So if listeners, you take Sasha's name and type it into YouTube, for example, you will definitely find that meetup talk about NCF that we of course are going to link to the show notes, but you will find even more podcasts, episodes, talks, paper presentations on his favorite.
I guess it deserves more than being referred to as a hobby topic because the work or the energy you dedicate to this seems to be pretty high.
So as a wrap up for this episode, can you briefly introduce us into what kind of longevity research you are doing and what makes you being amazed about this to dedicate so much time beyond just your work to this?
Yeah, I can explain, and I think I have a pretty logical explanation why there is such a discrepancy between the amount of external work presentation on longevity and on kind of RecSys.
Because I think that the most important thing in every person's life is health. So when you are healthy, you think about like thousands of things.
But when you are sick, you think about only one thing, how to become healthy again. This summarizes it pretty well. Right.
So now you have understanding why I am very interested in this longevity topic. The thing is that it looks like aging and death is kind of the inevitable thing for every one of us.
And the goal of intelligent life is to delay as much as possible this moment of death. Right.
You can agree with me because or not. It's up to you. But it looks like that we evolved to kind of intelligent life evolved to avoid death. Right.
We use our intelligence to kind of put ourselves out of danger. Right. We kind of use our intelligence to get food, to treat diseases. Right.
Which eventually kind of keeps us alive. And what ultimately kills us is aging and concomitant age-related diseases. Right.
And this is, in my opinion, the next logical step of an intelligent life. Not only to kind of avoid death from starvation, from some accidents, from infectious diseases, but also from this chronic age-related diseases.
Ultimately leading to so-called longevity escape velocity where the progress in medicine increases our life expectancy every year by more than a year.
So this is kind of what interests me the most, even more than RecSys, I would say. I really enjoy working on recommender systems and machine learning.
But I honestly, I think the longevity and health thing is more important. And when we are young, we don't think. But when we are becoming older and older, like I think most of the people will agree with me.
Yeah, yeah. This is more important. This actually remembers me of a sentence that David Sinclair mentioned in the podcast Lifespan.
As humanity have accustomed to perceiving aging as a given, where we should be perceiving aging as a disease.
Well, there are debates about like, should we treat aging as a disease? I don't think that it is a disease, but it is a certainly life thing that we don't want to be like majority of us don't want to experience.
And I think we are living in such a time, which is also unique. I really believe that if we do everything right, we might be the first generation that could avoid aging and actually experience radical longevity.
But of course, the chances of it honestly are not extremely high, just to say the least. So this is achievable only if things are done right from the get go, but things are never done right.
And this is where my advocacy, I think is important and works in the field of radical longevity. So I'm trying to accelerate situation a bit because I'm not satisfied with the progress.
Though we have extended lifespan of many model organisms, we still are far from translating this effects to human beings.
And I have few scientific papers and also popular articles talking about this discrepancies.
I mean, there are also claims that I keep seeing more frequently these days that say with the accelerating progress in AI research, we might also see accelerating speed of research progress in other domains, which then could of course also include longevity research.
Things are going more disruptive in the future. And disruption is sometimes coined a bit negatively, but if you think positively, it could be like the speed that you have been experiencing in terms of advancements in research could not be the same in the future, but it could be radically accelerating.
Which then also reminds me of another sentence by a person in another podcast that said, the most stupid thing would be to die within the next 10 to 15 years.
Yeah.
So if listeners would like to know more about this topic, and if I remember correctly, you are not the greatest fan of this branch of research that, for example, David Sinclair is pursuing.
Like what would be your recommended resource to kick off with? Is there a book or a podcast or what should people read apart from your papers that you have published in that domain?
I can give you my latest article. So it's a popular article about longevity bottlenecks. So something that limits our maximum longevity so that readers can have a look at and hopefully they will find it interesting.
And it also contains some practical things that can be utilized already now to avoid one of the most devastating consequences of aging called dementia.
Which I mathematically show there that even if we cure all diseases, like 100% elimination rate of cardiovascular cancer, whatever, by age 110, 99.8% this is not a strict number, but just a ballpark, we'll get dementia.
Almost everyone will get it if we won't find cure.
Wow. Okay. Okay.
That's frightening and nevertheless, somewhat amazing or interesting learning.
Yeah. So let's link that paper to the show notes. I guess that could definitely draw additional attention beyond just RecSys and as another useful recommendation to stay with that term.
This brings us to the conclusion of this episode.
Sascha, it was a really great pleasure to talk with you about all of this.
Likewise.
Thanks. And yeah, your experience, your drive for pursuing ideas and making them work and tremendously successful for the business that you are operating in, but also seeing beyond that there's even so much more.
So it was a great pleasure and I thank you so much for sharing those thoughts with the community and practitioner episodes are also greatly appreciated.
So thank you again for this.
Thanks, Marcel, for kind words.
And it was a pleasure to be on your podcast and it was even a greater pleasure to work with you through these years and implementing really cool stuff.
Cool. Great. Then have a wonderful day and see you probably in the office, even though virtually. Bye.
Bye bye.
Thank you so much for listening to this episode of Rexxperts, Recommender Systems experts, the podcast that brings you the experts in recommender systems.
If you enjoy this podcast, please subscribe to it on your favorite podcast player and please share it with anybody you think might benefit from it.
If you have questions or recommendation for an interesting expert, you want to have in my show or any other suggestions, drop me a message on Twitter or send me an email.
Thank you again for listening and sharing and make sure not to miss the next episode because people who listen to this also listen to the next episode.
Goodbye.
You.

Recsperts - Recommender Systems Experts

More episodes

Chapters

What is Recsperts - Recommender Systems Experts?