How AI Is Built | From Keywords to AI (to GAR): The Evolution of Search, Finding Search Signals

In this episode of How AI is Built, Nicolay Gerold interviews Doug Turnbull, a search engineer at Reddit and author on “Relevant Search”. They discuss how methods and technologies, including large language models (LLMs) and semantic search, contribute to relevant search results.

Key Highlights:

Defining relevance is challenging and depends heavily on user intent and context
Combining multiple search techniques (keyword, semantic, etc.) in tiers can improve results
LLMs are emerging as a powerful tool for augmenting traditional search approaches
Operational concerns often drive architectural decisions in large-scale search systems
Underappreciated techniques like LambdaMART may see a resurgence

Key Quotes:

"There's not like a perfect measure or definition of what a relevant search result is for a given application. There are a lot of really good proxies, and a lot of really good like things, but you can't just like blindly follow the one objective, if you want to build a good search product." - Doug Turnbull

"I think 10 years ago, what people would do is they would just put everything in Solr, Elasticsearch or whatever, and they would make the query to Elasticsearch pretty complicated to rank what they wanted... What I see people doing more and more these days is that they'll use each retrieval source as like an independent piece of infrastructure." - Doug Turnbull on the evolution of search architecture

"Honestly, I feel like that's a very practical and underappreciated thing. People talk about RAG and I talk, I call this GAR - generative AI augmented retrieval, so you're making search smarter with generative AI." - Doug Turnbull on using LLMs to enhance search

"LambdaMART and gradient boosted decision trees are really powerful, especially for when you're expressing your re-ranking as some kind of structured learning problem... I feel like we'll see that and like you're seeing papers now where people are like finding new ways of making BM25 better." - Doug Turnbull on underappreciated techniques

Doug Turnbull

Nicolay Gerold:

Chapters

00:00 Introduction and Guest Introduction 00:52 Understanding Relevant Search Results 01:18 Search Behavior on Social Media 02:14 Challenges in Defining Relevance 05:12 Query Understanding and Ranking Signals 10:57 Evolution of Search Technologies 15:15 Combining Search Techniques 21:49 Leveraging LLMs and Embeddings 25:49 Operational Considerations in Search Systems 39:09 Concluding Thoughts and Future Directions

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

===

[00:00:00]

Introduction and Guest Introduction
---

Nicolay Gerold: Hey everyone, welcome back to How AI is Built. This is Nicolay. I run an AI agency and I am the CTO at a Generative AI startup. Today, we are back continuing our series on search. We have a very special guest. Doug Turnbull is a search engineer at Reddit. And he has written one of the Bibles, or at least for me, on information retrieval, which is relevant search.

We will make a roundtrip through the evolution of search and how different search methods can be used to return relevant search results. And also what role LLMs and semantic search play in state of the art search systems.

Understanding Relevant Search Results
---

Nicolay Gerold: What is a relevant search result to you?

Doug Turnbull: That's a great question. I don't think people ask themselves that question enough. So [00:01:00] first of all, there are different, problem with search. It's like a very intentional activity that people are engaged with, but there's so many unconscious things that go into how people are searching and like why people are searching and what they want.

Search Behavior on Social Media
---

Doug Turnbull: And as an example social and I'm new, like relatively new, like I haven't done social media search my whole life, like how people search a social media site is very specific needs relative, were the concept of social media and what people want. And so when you go to reddit.

com, yeah, there's like informational things you're searching for, but then there's also just think about the other ways that you search, like Twitter or social media, like you want to know about some event that's happening right now and what people are talking about. Reddit, there's a lot of goofiness.

So for something like a query like cyber truck, that was a surprise for us. Like we found that. [00:02:00] Cybertruck, the default would be maybe I want product reviews but it turns out that what people want when they search Cybertruck is like on a social media site is goofy Cybertruck videos. And you learn these things from your users.

Challenges in Defining Relevance
---

Doug Turnbull: There's not there's not like a definition of what a relevant search result is. They're what your users are telling you through Obviously implicit feedback and what people are clicking and engaging with but With the caveat that you know people that's a very rough Measurement tool because people are often engaging in stuff purely for I guess you would say like lizard brain reasons like There's some drama, or there's some like spicy picture, or some other reason that's unrelated to that query.

So there's not like a perfect measure or definition of what a relevant search result is for a given application. There [00:03:00] are a lot of really good proxies, and a lot of really good like things, but you can't just like blindly follow the one objective, if you want to build a good search product,

Nicolay Gerold: And on Reddit, is the different objectives, do you see them Come into play in the different for example, communities, or is it that even across communities, like you have to detect them on a query level

Doug Turnbull: that's a great question. I think there are, for the most part. There are patterns across all communities. And there's classes of queries that are, happen regardless of whether or not, people are searching across the entire site. Yes, there is like the ability to go into your subreddit or a subreddit and search for stuff.

Usually if you're doing that, you know what you're looking for. There are a couple ways people search [00:04:00] for all of Reddit. They will search for topics and expect what is the latest for that topic like news, politics, maybe like a prominent politician's name. Or any other subject, they will search for they will search for like people's names that they want to know about and then they'll search informationally, people will search for open ended sort of questions to, and what's interesting about social media and reddit in particular is Reddit's almost like the anti AI.

When you search for a corpus like Reddit, like when you talk to an AI, you get the generic response that's almost as if it's a Wikipedia article. But if you go to Reddit, the reason people go to Reddit is they want like an actual human being's experience, like their subjective experience with this topic.

Doug Turnbull: You also get things that are like, really much, I find like about human connection, so want, obviously people want to know about products, but people also want to know about medical [00:05:00] things, or mental health things, and they want to find human connection, and find communities that provide that connection, and that in the search queries too.

Query Understanding and Ranking Signals
---

Nicolay Gerold: And how you actually decompose the query that you get, like the different signals the user can give you in what's his intention or what's a relevant search result to him.

Doug Turnbull: Yeah So the first way of thinking about like a ranking signal is understanding the query. So there's a couple of ways of thinking about it. There's understanding the query. There's understanding the content that's being searched. And then really the secret sauce is like the, how you do the data, how you do data modeling.

And query understanding. So those two things come together. Okay. In a reasonable way, in a way that makes sense. So one, one example of this is you might take a [00:06:00] query and this is a pretty common pattern and you might have a sense of the general categories that query goes with that presupposes that you can take that, the content you're searching and also classify those to the right categories.

And it's really the marriage of those two that makes really good relevant signals, like for building a ranking function where obviously you have your baseline retrieval, but then can you understand the query enough so that, and understand your content enough so that your content is targeting the right query and your query is map to the right type of content.

And that's really hard. I think. I think that's something that's honestly, the secret sauce of most search applications. People do things obvious, like these days, there's obviously things like, okay, I'm going to do BM 25 search of some text. I'm going [00:07:00] to I'm going to build a two tower model maybe, and do Some embedding based retrieval on how queries map to content and content maps to queries based on historical data.

But even regardless, independent of that, there's things like, what does it mean for query to get a good title match? What does that look like? And you look at that and you're like, In what context Do I want the query to match the title to rank highly? Okay there's a couple low hanging fruit.

Obviously, if someone asks a question and the title is exactly that question, that's a really good signal. Or if some part of the query that's an important phrase matches the title, that's probably really an important signal. But title matching as a text matching thing from a lexical point of view is very different than body matching, right?

Body matching, if you think about this title matching, you don't really, you usually don't care about [00:08:00] like how often a search term occurs in a title. So if you search for I don't know, like Kamala Harris, You're not, you don't care that the title says Kamala Harris two times, but the body, you do care because you want to know if Kamala Harris is like the main topic of that.

The concept of term frequency, how often that search term happens becomes much more an important part of the ranking signal. Then of the body text, then it does like the title itself. And those are patterns that you need to have grounded these obviously in data, whether it's engagement data or having humans or your buddy and the product team labeling things, these are insights you have to like figure out through that, having data that kind of helps you know whether or not you're heading in the right direction.

Nicolay Gerold: Yeah, what would you say is the unit of search signals? Is it the [00:09:00] term, is it the document, or is it in the end how you use both?

Doug Turnbull: I think it's how you use both. I think it's the unit is a relationship. Between the query and the document. Obviously when we say query, we usually mean the keywords in the query. But that could expand to like other notions of things we know about the user or filters they've selected maybe in the UI or something.

And then also in the abstract, just the relationship to the document, that's really useful. But the. You also, one thing that's often underappreciated, it's not just about getting a relevant search result. It's also about just delivering a good experience. And so like in, when I used to work at job search, we would return a great match, like if someone was looking for a restaurant job nearby, [00:10:00] we would return a Chipotle job, but often there would be nine Chipotle jobs.

So you don't want to just. Yes, all of those queries have great relationships, or all those documents have great relations to the query. But it's also about delivering a search results page, people say SERP, that can help them see the big picture of not just the bunch of relevant results, but what is the landscape look like of documents that relate to this query?

Nicolay Gerold: Yeah, and I think that's really moving also into the front end part or UX part of search. But you actually have The different user intents as well. Is he exploring where you want to give like more of an overview or is he already comparing that you have like more either like cards or like the Apple comparison you have on their websites of the different devices and the computing power, I'm really curious.

Doug Turnbull: Yeah, totally.

Evolution of Search Technologies
---

Nicolay Gerold: What would you say was actually like [00:11:00] the real start in the evolution of search, so which technology, which part actually kicked it off?

Doug Turnbull: The evolution of of doing intent detection, or?

Nicolay Gerold: Yeah.

Doug Turnbull: I think, yeah, that's a great question. So historically what people have done, and still to this day, this is the case. People build. Like literally, you can start out really dumb, just have a dictionary somewhere to know that this set of queries goes and maps roughly to this category. Like maybe these.

Like earlier, I talked about topics versus questions. You can just create a really coarse differentiation between okay. Topics are usually shorter queries. Maybe even you enumerate the ones, in the dictionary and [00:12:00] questions are longer. And they they, maybe end in a question mark.

And a lot of, for the longest time, people would just create basic rules of like how I might classify these things. Now building a query classifier is a classic, in my opinion, a classic classification problem. Where you have a string, you have to at query time, make a decision, whether it's in this I don't know, for us, like a very common one at Reddit is like safe for work or not safe for work.

And you have to make a decision just on that string and using a model and return a verb. Now, what's interesting about this is. A model could have a lot of different types of models like what are if you just take a query string you out of the box [00:13:00] with a simple machine like just to say like logistic regression, you really, how do you get features on this, we take features of a two word string and make a decision on it using something like logistic regression like, The feature engineering problem there is quite complicated because if you think about like one challenge that reddit has is for safe for work or not safe for work is like female names unfortunately is like a gray area because there's a lot of people who are like adult stars that like put stuff on reddit and then there's those names look very similar to a And a prominent actress and that kind of thing, and at query time, you don't, you can't easily, you could, you can't usually can't easily be like what is the history of this query and what kinds of [00:14:00] things are being engaged with instead, what you have to do often is find a way and honestly, large language models are great for this.

But find a way of taking that natural language and then putting a lot of the context into the model itself of like where this, where these, the sequence of characters or the sequence of n-grams tends to fall on the SFW or NSFW line. And even in those cases, the risk of failure is so high that you really need to be careful because there are often in these situations, there can be consequences.

If your search is prominent enough and you make a bad mistake and all of a sudden you're classifying a prominent female actor as always NSFW, Then there could be consequences for that. Like that person could get upset and start, there could be [00:15:00] legal action or it could be whatever. So you, the downside consequences are so great that you're never going to get completely away from having a dictionary of rules of you definitely should treat this as SFW or you definitely should treat this as NSFW.

If that makes sense.

Combining Search Techniques
---

Nicolay Gerold: Yeah, and I think that makes search so interesting because You typically don't get away using just one technique. You have to combine like loads of them, which is like heuristics, hard coding, keyword, boolean search, phrase search, embedding, semantic search. What are like the most interesting or weird ways you have, especially seeing keyword and boolean search used to create search signals?

Doug Turnbull: Yeah. I think an interesting thing people do, and I talk about this in relevant search. And honestly, even in this day of rag and LLMs, a lot of this still applies. But there's [00:16:00] this concept that I think you see in information retrieval called, I'll just use a lack of, for lack of a better term, tiering

where you using a bull, you can use a boolean query Honestly, it's not just boolean queries you can do this with re rankers people. There are tiers of systems and machine learning models, but the basic idea is that when you're not super confident in a ranking signal, you weigh it lower. Like your first pass.

BM 25 match of maybe you merged all, like it's very common, for example, to merge all the text of a document into a single. Grab bag all text sometimes called an all field and that's like your first pass BM25 like I matched a bunch of text it's pretty low confidence. It's a first cut but let's say [00:17:00] You know that the query is an exact match for the title in those cases Like then you're like in this territory of a kind of high confidence match You Of I know that this query is a really strong match, or this document is a really strong match for this query. I'm going to treat it with, I'm going to give it the extra treatment. I'm going to look at it and be like, okay, now that I'm really confident that at least I'm in the ballpark of A strong match, like the exact question is basically being asked in this document. Whatever, social media posts, reddit posts, whatever.

Now I can say okay, if there are more than one of these, should I rate them based on popularity? Should I get the most recent ones? What is the best mix of these kinds of high confidence matches? So it's I have a backup search, and then I have like the thing I'm like, okay, this is my 10x, Really strong [00:18:00] result. That's like a pretty, how often is it the case that you're going to match this way, but there are like ways in between to you may have a sense of the general, maybe Maybe at Reddit, it's like celebrity names, maybe there are topics, but you, or politician names, you may have a sense of like the general entities being searched for, and that exists in a document, and that might be like a middle tier of okay, I know that this user searching for Joe Biden, and through some entity extraction, that this document is mentioning Joe Biden.

Then I can treat this as like a level of confidence that we're in the topical ballpark. And now I'm going to look for other kinds of popularity signals and bring those to the top. And then finally, I think an often under look thing is what we mentioned negative, [00:19:00] filtering things out. I think people often don't think like In terms of how could I express what is definitely a bad search result? Are there things that I know are just things I should just pre filter out? Maybe it's things that are known to be spammy, low quality I know it doesn't match the concepts of the query. Are there things I could just go ahead and exclude from the search results so that they're not, there's just less chance of them getting in my top 10? And those are just like common lexical techniques that mirror what you see when you do if you're building some machine learning layers for search where you have like your early pass high recall and then you're winnowing down to higher and higher precision higher confidence results over time but you can always fall back to those like higher recall ones if you don't get those stronger matches.

Nicolay Gerold: And your higher [00:20:00] recall one tends to be the bm25. How do you actually see keyword and boolean search and then the phrase and term search playing together? Do they play a role at Reddit, for example, like you use both or do you just tend to use one of those?

Doug Turnbull: Yeah, they definitely Both matter a lot. Obviously finding the direct phrase that's one. When you use one of the query parsers in Elasticsearch or Solr or something, just and you say I want to match, do a match query on the query in the document. That's just going to be like, straight up term BM25

and then you're like, oh, and also match phrase. That's just going to look for the, whatever query you pass as the full phrase. But I also think about, when I think about phrases, I think about like entities too and there's a common technique in NLP called collocations, which is just and Gensim has a way of doing this, like Gensim, the common, popular NLP library that does like basic things, like [00:21:00] straight up just here's a corpus, what are the statistically significant phrases?

What two terms go together that are statistically anomalous? You don't have to build a fancy entity extraction system. You can just know that Kamala and Harris go together in some very statistically high probability than Kamala and Spaghetti, right? And then you're like, oh, if I see that in a query, I'm gonna treat that as a phrase.

You could build a pre, you could do like very straightforward statistical checks to, to, in your corpus to find these phrases. And then one thing that's really interesting to me, which we haven't talked about is how we, like these like techniques I talk about in relevant search.

Leveraging LLMs and Embeddings
---

Doug Turnbull: I actually think they're even more relevant in the age of LLMs.

Because a lot of these ideas of extracting phrases or like understanding the entities [00:22:00] being talked about, I feel like LLMs make this a lot more possible. Whereas if you asked me five years ago, I'd be like, Oh, you have to build, you have to do some labeling, get some like entity classification system to pull out politician names or whatever.

Now it's I give an LLM examples of extracting the brand name out of a product description. And it just does it and then I have a field on my corpus that like gives me the signal of like the query has a product and now my, the query mentions a product brand or a product type and the document we extracted that using an LM from the document.

Using just some few shot examples and like that just gives you a lot of gives you a lot of the manual relevance tuning that people used to have to like, Oh, I'm going to have to build a whole like content understanding system. Makes that so [00:23:00] much easier.

Nicolay Gerold: Do you, I think that goes for like LLMs and embeddings. How do you actually determine like the additional complexity that they bring and especially like computational demand is worth it?

Doug Turnbull: That's a great. Yeah. So one thing that is a common theme in search, let's say at some, like we talked a lot about. Manually building ranking signals. Something we haven't talked about is that the concept of learning to rank on top of that is about, in some ways, giving ranking signals to a model and being like, find the best way to balance these instead of me doing it. And what's interesting when you do that is you realize that the best way to improve a model is to give the model information that's orthogonal, that's completely different than what it's seen. In [00:24:00] its existing framework, so anytime you think about that, you're like what are ways of giving orthogonal information?

And one thing that as a pattern is like completely new systems like vector retrieval and the kinds of scoring that it does. Is a signal that is very independent of the exact term match BM25, maybe not exact, maybe some stemming and stuff because that's coming at the problem from a completely different angle of I'm semantically in the ballpark, even if I don't directly mention the term, and it complements the lexical space in an amazing way. giving you this rough ballpark semantic in the zone of meaning, whereas the lexical side is much more like it either matches or it doesn't. And both of those are tremendously valuable signals. [00:25:00] In some ways the vector is like that more based here is just focused on better recall in a way.

And the lexical in some ways is that oh, but it also directly match this thing. So I'm going to Give it a higher tier, so you can also think about it in that like tiering sense of layers of recall.

Nicolay Gerold: And what is something where you see more potential? Is this rather your Using a classical search system. So a lexical one and with a very broad base, you're retrieving and using something like a re ranker afterwards, which is based on embeddings, or are you going like lexical plus semantic search and basically combining those signals, like what is your, like your guess or your bet what would perform better?

Operational Considerations in Search Systems
---

Doug Turnbull: what organizations are doing. I think 10 years ago, what people would do is they would just put everything in solar Elasticsearch or whatever, and they would [00:26:00] make the query to Elasticsearch pretty complicated to rank what they wanted, and just be like, I query you, I've ranked everything, get it back, or even use something like Elasticsearch learning to rank, which is like does ranking machine learning can run a model in search engine like Elasticsearch.

Give me the best results back. What I see people doing more and more these days. There's my dog Archie. What I see people doing more and more these days is that they'll use each retrieval source is like an independent piece of infrastructure. They're trying to keep it as dumb as possible. And then they're pulling back candidates from each.

And they're putting the smarts in an inference layer. That's like in front of the search engine to build like a re ranker, cross encoder, whatever kind of learned rank model. And the reason people are doing that is it's a really [00:27:00] as search teams get more and more complex, what is operationally, you have a team that focuses on more and more on infrastructure and operational sides of search.

And the more complexity you push in that. The more things can go wrong for the person on call. So you want to keep that layer, those layers in some ways, really simple and operationally easy to manage and understand. And then you push the complexity into some kind of machine learning inference layer.

And that has its own different kinds of problems, complexities. And it's almost if you think about the retrieval systems is like the cake, and then the re ranking is the icing on the cake. If the re ranking part fails, at least you can fall back and give people like, I don't know, a good cake. Everyone wants icing, of course.

And, operationally, it's just way easier to manage from an organizational point of view. If you keep [00:28:00] that re ranking layer. A bit like as like the icing on the cake, but if it breaks, you at least have something to fall back on and you're not tasking your infrastructure teams with understanding this like behemoth search system that also runs like machine learning models and does a million other things that it has to get right, if that makes sense.

Nicolay Gerold: Yeah. And the. The learning to rank part comes into play in the icing layer, but when you especially have like different search systems, so you have a vector database and you have elastic search, you have different signals for both. So how do you end up actually combining the different signals into one signal that you can use for re ranking?

Doug Turnbull: That's a I think that is these days, often the 60, 000 question or 60, 000, 000, whatever dollar question. So that is the downside to what the architecture I described. So a lot of teams do what I described and they get by. [00:29:00] So they will basically do, they won't try to combine them using. Like you won't go to grab your vector results and then somehow go to Solr or like your search engine and get the BM 25 scores for those or vice versa.

What a lot of teams do is they grab their top whatever candidates. And they will do enough of the feature extraction in the re ranking layer to make it valuable to do what they want to do. But, in some ways, in that setup that I described, you'll get some amount of duplicate work. You're like, oh, now I have to encode this with this, with a sort of vector, go get, with a vector embedding.

Or I have to. Maybe re tokenize this text a little bit to get like the direct term matches. But it's definitely a trade off. You're like having the separation of concerns where if you think about it, I don't [00:30:00] know, if you look at a Pytorch graph in Pytorch, there are things that do or SK learn there's a TF IDF vectorizer there's like things that sort of recreate parts of the search in some ways, these worlds.

Are very different and maybe they should talk to each other more. But right now, if you give a machine learning engineer and say, build a ranking model they might go to something like PyTorch and be like, okay, I'm going to get a TF IDF vectorizer that's going to turn these documents into, and the query into TF IDF, at least on this, like set of top thousand from each, and then I'm going to take the top thousand from each encode them and, get the embedding, do, extract the embedding for each and I might do a little bit of work and massaging and like just do a little bit of inference and that at least seems to be a lot of what I find is maybe there's some duplicate work that happens there and it's not super ideal.

But from a operational point of view, it's just like [00:31:00] interesting how organizations also really want to focus on uptime and not having their search come down. So you could push all this work. There are systems, increasingly Lucene, open source, open search, Elasticsearch and systems like Vespa will just do all this for you.

But then you're like pushing, it's almost becomes the SQL database that has to solve like for dozens and dozens of types of queries. And then you'll find this one problem query that causes the whole system go down and do you really want to push like one, is there going to be one usage pattern that causes the entire point, single point of failure and causes everything to collapse.

So I don't think there's honestly, and this is. I don't think there's a great answer these days when it comes to search systems because you're balancing these kind of factors and how people are like thinking about like operationally things versus like machine learning and how people build machine learning [00:32:00] models.

Nicolay Gerold: Yeah, I think like for smaller projects, you're better off. Because you can't just go with something like Elastic and Solr and they have vector search nowadays as well and you can keep it like in one but as you have to optimize performance you just have to split it up and really optimize the singular components as well and not just the system overall.

The when you look at the search problem, do you, especially for learning to rank, do you actually try to create like as many signals as possible and then put learning to rank on that to basically identify which signals are worth it? I want to learn a little bit more about like, how do you think of approaching a learning to rank algorithm?

Doug Turnbull: Yeah so you have to be careful with that because, dimensionality is we just throw more and more, signals. What we think of as ranking signals in the machine learning model, they're going to be like [00:33:00] features. And the more features we throw at a model, you have to have you double the amount, ideally, you double the amount of training data because you have to show how the every instance of the entire range of that feature relates to the entire range of every other feature.

So it becomes like Cartesian product of every feature. So you have to be really careful. What I tend to think about is people, yeah, we proposed a bunch of features. What's also interesting is to try to find the smallest set of features that gets the highest performance. And there's a lot of ways of doing that, like literally.

It's what people are doing is what can I edit down even like doing direct feature search to find like the smallest subset of features that gets the most performance. And then what I find is thinking incrementally on top of that. So you layer on. Let's say, and honestly, you can do this, not just [00:34:00] because you're building a learning training system, this one, the production, you can do this just to explore for your own manual relevance, getting started.

Can I add a new ranking signal? Let's say I want to boost popular products that sell well in my search. Can I add that? Just adding that does that improve my model's ability to understand and improve, make. relevance better. What we haven't really talked about, I'm sure probably you've talked about with other people, is like the metrics that you look at for a ranking system like NDCG or DCG, which is really like moving good stuff to the higher parts of a list. And then the other side of this is do I have training data that covers all the cases where a popular product matters, a popular product is actually a bad result for any given situation, a popular result is. Or an unpopular result is more relevant. What are all are you covering all the quadrants [00:35:00] with your training data, which is also hard.

Like, how are you, cause you're, if you're going to add a new feature, then you have to also get training data on that. And the problem is if you're using an almost always high scale, you want to use your engagement features. The big problem with that is your search systems are incredibly biased towards the existing algorithm because that's what's currently ranking results and that's what people are clicking on engaging with.

So how do you explore, how do you explore with users the new feature? Let me try to get some higher more popular stuff just in the third slot, just to see if people engage with it before you even go and build a model with it as a whole like Headache of a problem that you have to really solve with search.

Nicolay Gerold: Yeah. And What, if you have the re ranking model, what [00:36:00] triggers or motivates the retraining of it? So when you actually see, hey performance is degrading, I have a data shift. How do you actually measure that in search system?

Doug Turnbull: Yeah, you hope that your model is sufficiently generalized so that the features you're describing are somewhat immune to this. So you might see the, you might learn that a, in the case of searching a strong title match, that you want things that are popular and recent, for example, and you hope that pattern is general and it doesn't change much with time.

However, if you train something like a two tower model or something that's very depending very much on on the content and the queries relationship to [00:37:00] each other. To build, to understand Oh, based on historical patterns, queries, these query tends to go in this area of the dimensional space. If you pick a certain week at Reddit, like searching for Joe Biden might mean he's dropping out of the presidential race and you see a lot of semantic relationships between Joe Biden and dropping out.

So let's say you train that week. How would you know for given set of queries? That you that's degrading. There's a couple of things you can do. Obviously there's a tremendous amount of ML ops infrastructure to just like offline be taking the Predictions of the model and taking your sort of like your how people are your sort of what we will call a judgment list, but like the labels for queries and documents and what's engaging and monitoring that over time to see how that changes to grades.

What queries is getting worse for workers is getting better [00:38:00] for the other thing that you would do is. Just straight up you could straight up have, and some people do this, just straight up have a holdout where you're just like, I have a set of, in my A B test, I have a dump search that is the holdout population, and I'm seeing how my fancy model is performing relative to the holdout.

There's complexities in that, because you're intentionally giving 10 percent of your users a pretty basic search. But the trade off there is that's a basic search. You feel a bit more confident in the holdout situation because it's against like real user traffic whereas trying to construct continually construction offline eval set that you're monitoring like how the model ranking is performing over time has the big downside of assuming how your construction that offline eval set is correct and that you're not you know missing anything and That is probably the same practice you're doing as you're building a training set.

[00:39:00] Because you need those kind of labels for that situation too. But that's, that is something, that is a complexity you have to trade off, you have to think about.

Nicolay Gerold: Yeah.

Concluding Thoughts and Future Directions
---

Nicolay Gerold: What would you say what are the new types of signals that are emerging at the moment, especially like through newer technologies like conversational search, multimodal search now becoming more viable, but also LLMs? What is the stuff you're looking at right now that you want to implement?

Doug Turnbull: Yeah, multimodal is really interesting because you have the sense of, I have an I have a vector for the text, the image maybe there's a video, and I All of these things can get placed into maybe some kind of two tower model that gives us a sense of the vector representation, the query and the documents vector representation and some vector space that [00:40:00] map that's trained on relevance.

So those can be inputs to that kind of thing. But like I said, as the More complex the model, the higher the risk, potentially, of model drift. I also just think about One thing I think about, I mentioned this, is I am trying to incorporate LLMs more in my just traditional relevance work.

Help me understand this content and for this query, help me classify this query into, here's a set, I am describing to the LLM that I want you to be kind of an SEO manager where you're going to look at these product descriptions. And you're going to fine tune them to be amazing for matching the keywords that should match this product, or you're an SEO manager and you need to look at this product and extract.

Like we all know, Ikea has their weird, like product [00:41:00] product lines or what, bill fund or something like these words extract those into its own field so that we can treat the direct, the match to a. One of those product lines as a, as an explicit signal or look at this content and extract the celebrity names that are mentioned in it.

And all the variations also of that celebrity name. So we get like a synonym situation. Honestly, I feel like that's a very practical and underappreciated thing. I, people talk about rag and I talk, I call this GAR generative AI augmented retrieval, so you're making search smarter with generative AI.

And that's just in a way that's just takes everything I talk about in relevant search a book eight, eight years ago that came out. Because if you have an LLM in the loop and you're doing rag, [00:42:00] then you can ask that LLM to just do a lot of things for you and like even help understand the query.

And then also when results come back to the LLM, you can be, you can literally tell the LLM hey, ignore any irrelevant. There's a limit to this for sure, but you can say hey, ignore anything on irrelevant to this query. Okay. And it can help you like filter out some of the false positives that pop up.

Those aren't quite signals per se, but those are like tricks I'm seeing more and more as people use generative AI in the loop of doing search.

Nicolay Gerold: Yeah, I think this is on the like bigger trends. What is an underappreciated technology in the like data and AI space or especially search space that you think deserves more attention?

Doug Turnbull: Oh, that's a great question. Underappreciated technology. I feel like the cycles go [00:43:00] in information retrieval. And recently I feel like all the rag people discovered even BM25 which is great. Coming at it, I think everyone comes at it. It's a rag from a vector search perspective. And I realized, Oh, there is this like good baseline technology that works pretty well out of the box.

And it's simple and it's fast. And then the other thing that sort of the. The AI crowd is building these days, the default re ranking model these days is a cross encoder. I feel like the next thing that, in the next six months, the AI crowd will suddenly realize that LambdaMart and gradient boosted decision trees Are really powerful, especially for when you're expressing your re-ranking as some kind of structured learning problem where it's I, especially in e-commerce, so I have a table of attributes of something or a table of [00:44:00] scores between a query and a document.

LAMBDAMart which is just a basically XG boost. Is really underappreciated for these things and also something that's like the next level up from learning about bm25 that people will I feel like rediscover as a Oh, this is a really powerful technique of building a decision tree and then like building the next decision tree to learn the error of that decision tree So I feel like we'll see that and like you're seeing papers now where people are like finding new ways of making bm25 better I feel like we'll have that with LambdaMart will be like, people will learn about that loss function, that way of thinking about ranking and that nonlinear tree based way.

Where, still trees are a very compelling choice when you have tabular data and a lot of matching things is not like similarity between query and document. It's what is the popularity of this thing? What [00:45:00] is the, just straight up attributes of content. So I feel like there'll be a, that may be a renaissance in Lambda Mart.

Nicolay Gerold: Yeah, nice. And if people want to start building the stuff we just talked about, where would you actually point them?

Doug Turnbull: For Lambda Mart probably just like XGBoost. There's if you're just getting started and you don't want to build a whole machine learning system, like OpenSearch, Elasticsearch, Solr, have ways of doing inference on, with a lambda model, like super easy to play with.

Nicolay Gerold: And if people want to follow along with you, where can they do

Doug Turnbull: Yeah. I'm soft, I'm software Doug everywhere. I'm software Doug on Strava, most importantly. You can find me on LinkedIn. I think I'm SoftwareDoug on LinkedIn. I'm SoftwareDoug. com. Yeah I picked, I got like SoftwareDoug when before software wasn't everything. And I was like, I'll define my career niche.

But now, of course, software is everything. And I'm even [00:46:00] like, beyond software. I guess I do like search now, but just find SoftwareDoug. That's me.

Nicolay Gerold: And we will have to start a petition to get your search dug as well.

Doug Turnbull: SearchDoug. Yeah.

So, first of all, I think this wasn't my best interview. I think a bunch of, I missed a bunch of interesting followups or rabbits hole to go into with Doug. And also some questions were not formulated in the best way. But still we get some really interesting insights out of him. One of them. First of all like relevance is subjective.

So there is no single definition because it depends on the user intent. So basically on what is the information need and the goal of the user. Based on which relevant results can differ greatly even within one application. And what you as an engineer basically have to do is understand the [00:47:00] query and understand the content. And how you can basically map between them or create the best possible mapping between them.

And this is basically your job or. What you try to emulate in your search engine, associate database. What I found really interesting. It's basically the tiered approach to ranking. So that you have different. For one ranking techniques, but also retrieval strategies. Which gives you different levels of confidence.

So for example, . BM25 is a very strong baseline. So you have like a strong confidence and vetted return results. To a science. And. You can add additional search techniques to it. But give them lighter score. So for example, with a semantic search and give them a lower score in your basically overall score for the search. And then you have basically the ranking techniques, which come after [00:48:00] that. Which give additional a confidence boost and.

Strong signals is basically our methods that can create strong signals. Can we phone everywhere. So for one that's in retrieval techniques. So for example, BM 25. If these can also be in certain fields. So for example, if you have an exact title match, So imagine you're building a search engine for movies. And you haven't except match on a title field. Like you can be very confident that this is what the user was looking for. And. In the end. Don't underestimate keyword and phrase search.

While some index search is gaining popularity, keyword and phrase search are like the best studied. And like how to optimize them. It's pretty well documented. Whereas in semantic search is not that well known and well documented yet. Like how can we actually get the best results out of them?[00:49:00]

One interesting thing is I think it's. Using LLMs for augmentation. So basically. Entity extraction for more fuzzy entities. So for example, what Doug mentioned for celebrities but also generating synonyms or filtering out.

the one extra thing I think. What. I want to try to implement this actually. Creating additional representation, of the data. I think like a summary, for example, for long texts can already be seen as a different representation. But you can do translations, but you actually try to emulate the uses language in your document representation. And through that create an easier matching between the query and the document.

And. What. He mentioned as crucial as well as basically [00:50:00] orthogonal signals. So try to combine. many different signals, but try to create signals that are using. Or signaling different types of relevance.

And yeah. I think some protocols. You can implement this basically first focused on data. Develop a deep understanding of the queries. Like what is your users searching? But really try to understand the user as well. Like what is he actually searching for? And don't be stuck on saying, yeah, all the intent is in the query because it likely isn't. And. Really understand the types of queries they submit and what they want to do with it. And also on a sidenote dive deep into the results and try to understand why. We did get the results to use a search or why we didn't. And we, especially in the episode last week with, [00:51:00] with Charlie, we talked a lot about that, basically trying to debug it, like. It's document position maybe malicious and stuff like that. And then the second one is basically embraced experimentation. Continuously test and refine your search systems. And half like different levels of tests on which you escalate.

So basically you first, when you implement something new to an offline test, and only when you basically see that it leads to large improvements in the authentic evaluation metrics, you can move on to like online tests and AB tests. And also. I think what doc also shows us is basically keep up with the latest advancements in search. And. This especially also means like LLMs even. Though they are. Nor, especially in the social yet. And they still have like many interesting applications.

Yes, that's it.

So stay tuned for [00:52:00] next week. I'm not sure which episode I will publish next week. We might be going onto a different route and moving more into like the semantic search and embeddings. Next with Nils reimers from cohere. But I have to lay out the episodes for its so.