How AI Is Built

In this episode, we talk data-driven search optimizations with Charlie Hull.

Charlie is a search expert from Open Source Connections. He has built Flax, one of the leading open source search companies in the UK, has written “Searching the Enterprise”, and is one of the main voices on data-driven search.

We discuss strategies to improve search systems quantitatively and much more.

Key Points:

Relevance in search is subjective and context-dependent, making it challenging to measure consistently.
Common mistakes in assessing search systems include overemphasizing processing speed and relying solely on user complaints.
Three main methods to measure search system performance:
- Human evaluation
- User interaction data analysis
- AI-assisted judgment (with caution)
Importance of balancing business objectives with user needs when optimizing search results.
Technical components for assessing search systems:
- Query logs analysis
- Source data quality examination
- Test queries and cases setup

Resources mentioned:

Quepid: Open-source tool for search quality testing
Haystack conference: Upcoming event in Berlin (September 30 - October 1)
Relevance Slack community
OpenSource Connections

Charlie Hull:

Nicolay Gerold:

search results, search systems, assessing, evaluation, improvement, data quality, user behavior, proactive, test dataset, search engine optimization, SEO, search quality, metadata, query classification, user intent, search results, metrics, business objectives, user objectives, experimentation, continuous improvement, data modeling, embeddings, machine learning, information retrieval

00:00 Introduction
01:35 Challenges in Measuring Search Relevance
02:19 Common Mistakes in Search System Assessment
03:22 Methods to Measure Search System Performance
04:28 Human Evaluation in Search Systems
05:18 Leveraging User Interaction Data
06:04 Implementing AI for Search Evaluation
09:14 Technical Components for Assessing Search Systems
12:07 Improving Search Quality Through Data Analysis
17:16 Proactive Search System Monitoring
24:26 Balancing Business and User Objectives in Search
25:08 Search Metrics and KPIs: A Contract Between Teams
26:56 The Role of Recency and Popularity in Search Algorithms
28:56 Experimentation: The Key to Optimizing Search
30:57 Offline Search Labs and A/B Testing
34:05 Simple Levers to Improve Search
37:38 Data Modeling and Its Importance in Search
43:29 Combining Keyword and Vector Search
44:24 Bridging the Gap Between Machine Learning and Information Retrieval
47:13 Closing Remarks and Contact Information

What is How AI Is Built ?

Real engineers. Real deployments. Zero hype. We interview the top engineers who actually put AI in production. Learn what the best engineers have figured out through years of experience. Hosted by Nicolay Gerold, CEO of Aisbach and CTO at Proxdeal and Multiply Content.

===

[00:00:00]

Nicolay Gerold: [00:01:00] What is a relevant search result?

Charlie Hull: So I think a relevant search result is very much a subjective thing. What's relevant to you may not be relevant to me. It very much depends on what task I'm carrying out, what what my information need is. Am I just exploring? Am I browsing? Am I trying to buy something? Am I trying to find out some particular information or am I just asking a vague question to, to educate myself?

Challenges in Measuring Search Relevance
---

Charlie Hull: And this makes it quite difficult for us to measure relevance because if we ask a person. even two people who are both experts in the same area are going to disagree on the relevance of a particular result. This is a very common problem. And if we do it by some kind of more automatic means, for example, looking at results, people click on, then people click on results for all kinds of reasons.

They click on them because they're higher up the [00:02:00] list, something we call position bias, but it's a hard thing to measure, but yeah, relevance is, it's a hard thing to define. We can. We can maybe explain it ourselves. We can maybe explain our reasoning, but getting a consistent view of what is a relevant result is really hard.

Nicolay Gerold: Yep. So maybe let's start there with a problem.

Common Mistakes in Search System Assessment
---

Nicolay Gerold: What do you think are the like most common things people get wrong when they assess search systems and try to improve them?

Charlie Hull: It depends what we assess. One of the common things people look at is just, processing speed. How quickly does my search result come back? And they May say a search result comes back quickly is more accurate or less accurate. It's a strange thing. We perceive the speed of how fast the computer responds as telling us something.

Maybe a complex system shouldn't come back immediately. That's one thing you can measure just the raw performance. And then you have various measures of what results have come back. We have traditional ones, [00:03:00] such as recall and precision. How many of the results you've got are actually relevant.

How many of the possible relevant results have you been shown? And then we we have to figure out again a way to to measure them to calculate some data, to give us some number that says, how good is search today? How good is my search engine performing? Is it giving my users useful results?

Methods to Measure Search System Performance
---

Charlie Hull: There's two main ways to do that and possibly a third. The first one is just to ask some people. So I'm going to find some people who say, know something about shoes. I'm building a shop that sells shoes, and I'm going to type in red shoes, and I'll ask these people, out of the results that have come back, how many of these are actually red shoes? Now, again, some of these people might disagree. They might think a red pair of sandals isn't a pair of red shoes. They may say a slightly purpley pair of shoes is actually red. So there's some some problems getting humans to judge these things, but I'm going to get them to actually, give me some numbers.

Is this a great result? Is this a [00:04:00] kind of middling result? Is this a terrible result? Is this not relevant at all? Collecting those numbers gives me something I can then measure. And I can give that for a particular query, say red shoes. I've now got a number to say how good is the results I'm coming back. So that's what, that's one way of doing it. And that has the problem that it's not very scalable. I've got humans in the loop. It's not the most exciting task to do, but it's a very common starting point for measuring search systems is doing some kind of human powered evaluation.

Human Evaluation in Search Systems
---

Charlie Hull: The next way I can do it is by looking at what users do with my system.

What, where, what they click on, how they interact with it. Did they buy this particular item, for example, or did they download a PDF based on this search result? And I can collect my clicks and that's done with some kind of web analytics tool. The trouble is, this is very noisy data. People click on things for all kinds of reasons.

And it's often quite hard to figure out the connection between that first query and an actual action at the end that validates it. I search for something, I do a search, I search for something else, I change my query, and [00:05:00] eventually, sometime in the future, I do some action to validate that. I download something or I buy something.

How do I connect those actions together? How do I maintain causality but it's in theory it's cheaper. We can just harvest the clicks off the web, do some data crunching. And we've got a load of data and that tells us which results are relevant.

Leveraging User Interaction Data
---

Charlie Hull: And then the third way, which is the new way is to get an AI to help.

So we can actually get an LLM, for example, to judge the quality of search results. Now, this is fine as long as we trust our LLM and in theory, it's more scalable. We've not got real humans involved. We've got a computer, but the problem is we've also got to train my LLM on something. And for that, you need some kind of human backup.

You can't just assume that an LLM that's trained, for example, on the generic case, knows anything about shoes, or about resumes, or about cooking. And, our experts do, but the LLM may not know that well. So you've got to be careful how you do that. But it's showing increasing promise. So those are the three main ways we train, we measure search systems, [00:06:00] and the, the pros and cons of each.

Nicolay Gerold: Yep.

Implementing AI for Search Evaluation
---

Nicolay Gerold: I'm really curious, like I will probably ask a few questions on each one. In the human evaluation, do you use like a standard scale across Different projects you have done, or you have a scale basically like one, two, three, four, five, or is it really dependent on the use case?

Charlie Hull: It's dependent on the use case, but, I think the important thing is if you pick a scale you've got to stick to it. And you've also got to have some guidance for people using it. So again, a common scale is brilliant results, middling results, poor results, and awful results, absolutely irrelevant results.

So a scale of four, or you can do one to 10, or you can do just a binary. Is this relevant or is it not? But then when, if you have human judges, you then need to educate them on what you mean by these scales. So what is poor? What is fair? What is middling? And again, that's a subjective judgment and people will disagree.

There are some ways you can sum these results, average across them, [00:07:00] get some measure of what they call inter rater reliability, some idea of how raters work but yeah, the scale is really up to you. It depends on what you're trying to do and what you're going to do with the data eventually.

If for example, you're going to feed some kind of machine learning system, you may want the data, the judgment data, or, conditioned in a certain way to feed that systems that may influence your choice of scale. Thank you. But it's entirely up to you. So I, I see one to 10, I see one to four, I see binary zero or one.

Nicolay Gerold: And do you also ever collect comments from the actual user? Why the different results were. Not as well suited for the search or were suited for the search.

Charlie Hull: You can, yes this does put more of a load on the user, but often to get some explanation of as to why they rated something terribly badly, for example maybe useful, that may reveal other problems with your search system. For example, the result you showed them may not be, you may not have shown them enough.

You may not have shown them a picture, for example, and they're [00:08:00] asking, is this a relevant result? I need a picture of the book before I can tell you. I need a picture of the shoes. But it does make the evaluation task a little more onerous, but it can give a useful context. People do love to complain about search, and sometimes all you'll have is the complaints and the negative feedback.

But when you're doing, evaluation, you want both pros and cons. The trouble is people are less likely to write you a compliment than they are to complain. It's quite common for search teams just to have a litany of complaints they work through and they become very reactive to those and just think, we got, we've gotta make these people happy.

But yeah, comments can be useful context, but they don't give you any kind of quantitative data. Of course.

Nicolay Gerold: Yeah. I only think like always the human evaluation and especially the comments and the AI as a judge work quite well together because you can use the comments to actually feed the LLM and tell it what to look out for.

Charlie Hull: Possibly. Possibly. I think to train an LLM you need a lot of data. Possibly I think that might be worth investigating, but [00:09:00] generally I think we need those. Now, we need those hard judgments and the, the text is just more, decoration and useful context and again, might inspire you to look at something else, but it's not necessarily going to give you any deep insight.

Nicolay Gerold: Yeah.

Technical Components for Assessing Search Systems
---

Nicolay Gerold: And more on the technical side, when you come into a project. In a new search project. And they already have a search system implemented. What are like the first few technical components you actually implement or have the team implement so you actually can start to even assess the search system and how well it works.

Charlie Hull: My first question to teams is, what do you know about how people interact with your system? Do you have complaints? Do you have a set of queries? No, just don't work. Do you have any quantitative data that you've measured? A lot of the teams we work with have never really measured search quality before.

In fact, quite often I'll ask, how do you know if it's a good search result? And they'll say we don't really. We have no way of assessing that. [00:10:00] So that's my first question. What have you done before in terms of measurement? And then we're looking to what we can add to the system to start doing that measurement.

We have an open source tool we built called Quepid, for example, which will sit on top of the search system and let you carry out set up test queries in sets, what we call cases, and then run those queries on your search engine, and then collect those judgments, and they can be used by people who are non technical.

So we might do that, and we might do it with as few as 50 queries, say. So I'll say, show me your query logs, show me what people are actually searching on your system, and we'll figure out a way to sample those queries and come up with, say, the first 50 we're going to test. Put those into our tool, make sure our tool's pointed at your search system or an offline version of it, say, with realistic data, run the queries, and get some numbers.

And that gives us a starting point. Another way is to look at the query logs and begin to try and subdivide the queries. Some of the queries people are using may be [00:11:00] informational, some may be very specific, some may be vague, some might be multi word queries. Let's start dividing those up and thinking about each as a particular use case.

For example, looking for a serial number shows a very specific need for a certain thing. And you can't just show someone loads of other things. They care about that item. Let's somebody typing in, dress to wear to a prom is a very vague descriptive query. They're looking for advice. So let's subdivide those queries and then figure out where we can help in each area. We then need to think about prioritization. What are the most important queries for your business? If you make more money, for example, selling shoes, then perhaps we should fix the shoes queries first. But if you're, if you think it's more important to concentrate on this other area, we'll put it on that.

But we now have to think, we really need to think about the kind of business drivers behind this. Why have we come to help make your search engine better? How does your business, connects to the search engine? [00:12:00] What is the search system doing for you? And that's where we can prioritize and figure out where our areas of interest would be.

Improving Search Quality Through Data Analysis
---

Charlie Hull: The other very important thing is to look at the source data. No search system in the world can produce brilliant results from a garbage input. So what are you doing with that source data? Where does that come from? How is it controlled or regulated or processed or what happens to it before it gets into the search index?

And that data quality issue is a massive problem for search and also for pretty much any AI system as well. And it's something that's very much ignored, I think. We assume these machines can somehow. Produce beauty from nonsense. They can't always, and the data quality is a huge problem. So that's something we look at very early on as well.

The search engine itself, I using an old version, a new version, what technology you're using, et cetera. That is a factor, but not as a bigger factor as you might think. Often it's the way you're doing search is way more important than, the search [00:13:00] technology you're using. Again, if you're feeding in garbage, no search engine in the world is going to be able to sort it out.

If you're not measuring quality, how can you adjust the parameters to tune it? But just, thinking you've got an old version upgrading, it's going to fix everything is generally not the case.

Nicolay Gerold: Yeah. How do you start detecting the garbage? If you assume you have like terabytes of data, how do you start to go about that?

Charlie Hull: We can look at, again, the queries and what people are getting an example, for example is you might say be looking for a safe specification document. I want to know this specification for this item that was manufactured in 1992. And then you do that search and you get back 20 results, all called specification version 2.

1 final, no actually absolutely final copy, final approved version. And they're basically the same document. that points very clearly to a problem with versioning, a problem with content. Or you might just get a totally irrelevant result when you're pretty sure there is a [00:14:00] relevant item in there, your content, you have this thing, but you just can't find it with search.

Why not? And that might point to a problem with how you're conditioning the data before it gets into the search index. Are you picking up the right fields? You might see you can't retrieve a certain object type. For some reason, we're getting no PDFs in the search result. Why is that? Maybe we're not processing PDFs correctly.

So those test searches give you some idea of, what's coming in and why, where there might be a problem with content. When we dive in a bit finer, we figure out what's happening with the content before it gets to search. And there'll often be people who deal with, editorial control of the content or deciding which items of content get into search.

And they may be disconnected from the search team itself. They may, produce beautiful web pages or beautiful content or document. documents, but they don't really understand what's needed for search. So they might, for example, put the wrong metadata in the search engine, picks out the metadata.

It's in the wrong field. It's got repeated [00:15:00] words in the right, the wrong field, and this skews search results. So we need to make sure that the people controlling the content understand what we need to search people to give it, to make sure that content is surfaced.

Nicolay Gerold: And do you think that search systems improve one error at a time? So you're basically doing an error analysis. So you basically pick the most important error at the time, try to fix it and hope that it doesn't destroy any of the other queries

Charlie Hull: Search is never finished. That's one of the sad truths about this, you start with a certain level of quality and you try and improve it and you try and continue to improve it, you're never going to finish that task because the world changes. The sort of queries people use change content changes, new content arrives all the time.

Let's give an example. A few years ago, we had a client who suddenly found lots of people were searching for personal protective equipment, masks, and we all know why that was. And, [00:16:00] but the world had changed overnight. Suddenly the most important queries had changed overnight and suddenly the content that they needed to surface was, was particularly valuable to them and, that's something they needed to be able to sell very quickly.

So we have a constant process of improvement of search quality, and that's what we, the train we need to get on. We're not going to be able to say it's going to be this much better by next week. But we need to be continuously improving and adjusting. And creating a process where we can constantly measure and improve and iterate is really important.

A lot of search teams are reactive. They are being complained at. They're being sent problems. They're being told on a Monday morning, their boss searched for something on a Sunday that no one else has ever searched for, because the boss searched for it. We're going to fix that today. That they're in a constant panic mode. What we want to do is turn it into a more proactive situation where we have a constant measure of search quality. We have strategies to try and improve that. And we're consistently [00:17:00] trying new experiments, new ways of adjusting our search engine to improve our search results, our metrics, and drive the business forward.

We want to have a more sensible process for dealing with search improvement.

Proactive Search System Monitoring
---

Nicolay Gerold: before we move maybe into the constant measure, how do you. Get proactive, especially in tracking new groups or categories of queries that are suddenly coming in.

Charlie Hull: We need data. We need to be monitoring what people are doing. We need the support of analytics teams. We need to look at what's coming into the search engine and what's what people are doing, how user behavior is changing. Knowing what your user is doing with your website is obviously vital.

And we put a lot of investment into that. We people do this a lot on SEO, for example, SEO teams are often very focused on what are the queries, what are the things people are looking for that find our website and they'll put a huge amount of effort into developing ads. For example pay per click ads that respond to those particular [00:18:00] search queries, but on the internal side, when we're looking at what a search engine does, we also need to be monitoring those queries on a constant basis.

We need dashboards. We need visualizations to show, hang on. We've suddenly got loads of queries from this part of the world, asking about this thing. What's that telling us? And there's a lot of intelligence in that information. We had a client who selling sportswear and they suddenly noticed that a particular basketball star in the U S was a name was coming up all the time in queries.

Now, what had happened is this basketball star was now, there was now a brand, some branded clothing associated with that name. Our client weren't selling it. But by spotting these queries, they then can ask themselves, should we start stocking it? So what people are searching for can inform you about the way the world is changing and you need to be constantly looking at that. Another very common thing is looking for search queries that give zero results. Zero results is a terrible place for a search engine that end up in, because if you type it as a user, if [00:19:00] you type in a word and back comes zero results, what do you do? You assume something's broken. The thing doesn't exist.

You go somewhere else. You're going to use a competitor. A lot of people will just leave you there in that poor zero results page. I'm sorry we don't have any results for your query. I'll not help you out, take you anywhere, try spell correcting your query, suggest alternatives. Knowing which queries are giving zero results gives you a really good place to start improving search.

And we want to try and reduce that percentage of zero results as much as we possibly can. Because we want to make the user feel that they're being catered for. Yeah, there are various ways, but I think that intelligence about what users are doing is a vital part of it, and we should be able to see that all the time.

We should be able to, that should be the dashboard we look at every morning as a search team.

Nicolay Gerold: Yeah. And since you're doing interventions in the search system, it's maybe move in how we can measure the quality of our interventions. So I assume you need to build up a data set of queries and [00:20:00] relevant documents and irrelevant documents to measure it. How do you go about actually building that and how large, especially should that test data set be?

Charlie Hull: As I said earlier, you can start with as little as 50 queries, and you might measure, say, the first 10 or 20 results and judge whether they're relevant or on a particular scale, and that's a start. Thanks. And if you're 50 queries are sampled across your query sets, not just the most popular queries, but also some further down, basically search queries, they follow a power curve.

You get the loads of the most popular ones, and then it goes down to a very long tail, but you need to sample through that. And there are various ways of doing that. Just get your first 50. Now, how many you actually test regularly is down to how many you can cope with. Should it be 50 or 500? And it's down to the size of your business and how important you judge this problem.

But once you have those, you can then figure out what are the relevant results with some kind of human judgment, or maybe with a bit of AI help and that gives you your first test set. You're not going to have that test set forever, but [00:21:00] it's a starting point. You'll also maybe want to adjust it, add new things as new things turn up.

So every, after a certain time period, revise it, but it also gives you some regression because you want to make sure if you make a fix, you don't break something. So constantly testing this test set and making sure that you're. sudden new idea for changing the search configuration hasn't broken lots of old queries.

That can be part of your regression test cycle. It could be something you run automatically every time you do a release, for example. So it's a vital tool for checking that you haven't just broken search accidentally.

Nicolay Gerold: Yeah. And do you add metadata to the different queries as well? So basically add something like query type. So you actually can measure the performance across the different query types, which pop up most often.

Charlie Hull: Yes, you can do that. And again, you can subdivide the queries. We have a concept of the, the information need. What information need is this query answering? Again, I want to know where the [00:22:00] office is in London. I want to buy some red shoes. I want to know what jobs there are in Poland for Java developers.

Everyone's got an information need. And if you can connect a query to that gives you a lot of insight into, why the user's doing that search and how we can solve that problem and what should the relevant results look like. Query classification, breaking things down, trying to understand what the user's trying to do, which can be a bit hard and they've just given you two words.

That gives you a much better way to try and think about how we're going to solve this problem. In some cases, it may not be with a set of search results. Now, where's the office in London? I just wanted you to show me where the contact page is with the address of the office in London. I want a single result.

I maybe just want to redirect the browser to that place. It's not always a case of just a set of links being the best result. And increasingly now, we're moving into AI systems, actually, maybe I just want an answer. I don't want a list of links at all.

Nicolay Gerold: Yeah. And the scoring. [00:23:00] Types of metrics you set up, especially like more of the common one is I think precision recall, we don't really have to cover but more of the uncommon ones and also like the business relevant ones, like how do you go about like setting up custom relevancy functions,

Charlie Hull: So there are lots of ways of measuring, getting a single number for how good search is, a search metric. Some of the common ones we see for example, NDCG, normalized discounted cumulative gain. We have map mean average precision. These are all what we call search nerd metrics.

And they do things like they, they account for they help account for things like position bias. So people generally prefer the results at the top of the list, even if they're not relevant. So people will click on the first few results. People will think those are better somehow because they're top of the pile.

It's a natural thing. So there are various things here to help us cope with that. Now. These numbers are great and search nerds love them. [00:24:00] And, we can think about the theory and they give you a number to say how good is search, but this is only from, again, the search nerds perspective, what the business is doing is slightly different.

The business is, selling things, say, or deliver information in a certain way, or trying to retain customers, whatever the reason for searches. So what we also need is some kind of KPI. We need some kind of business number that we're trying to optimize towards.

Balancing Business and User Objectives in Search
---

Charlie Hull: And these two things may not have many obvious connections, but if we managed to come up with a KPI at the business level that the search team can work towards, that gives us an ambition.

So for example, we want to maybe show more results that would make us more money as a business. in some way. Now we can't optimize completely for that because otherwise our users are going to be very unhappy because all you're showing me is things that are really expensive. I just want to buy some cheap shoes.

Why do I always see the most expensive shoes at the top? So [00:25:00] there's a balance between the business objectives and the user's objectives here to be taken. We can't go all the way one way, all the way the other.

Search Metrics and KPIs: A Contract Between Teams
---

Charlie Hull: So some search teams measure their search nerd metrics and that keeps them pretty sure that their search engine is working in the right way technically.

But they've also got to take account of the business metrics, the KPIs, the numbers we're trying to, we're already there to solve. And this, almost acts, can almost act as a kind of contract. So my business stakeholders want the search team to do something and let's work out what that actually means.

And we agree on some kind of metric. Now we want to get, more people clicking on these particular things. We want a, we want to raise the average order value down that we know is attributed to search. That's quite a good one. So we know that people, buy a certain amount of things or, have a certain order value.

After doing a search let's try and raise that. Let's try and make search power more business. That might be one you pick, but this [00:26:00] is a discussion you have to have with your business stakeholders and that then gives them an idea how well you're performing as a search team. And it gives the search team something to aspire to.

So that, that kind of almost, informal contract is a great way to, to connect the two sides. The sad fact is very often search teams don't really know what the business is for, what the business is trying to do. They don't have a detailed understanding of the business strategy and the business stakeholders don't really know what the search team are doing.

They don't really understand how the search engine works. And a lot of my job is trying to connect those two sides to get people talking and communicating so they can both drive forward together. A lot of it is just education, it's process, it's numbers. It's a way for everyone to come together and say, this is what our joint challenge is.

Nicolay Gerold: I think, especially the. Relation to other external factors makes search so interesting.

The Role of Recency and Popularity in Search Algorithms
---

Nicolay Gerold: How do you bring them into the search [00:27:00] engine, especially stuff like what we talked about, like the dollar value of an item or also the recency or the overall popularity of an item as well. How do they come together with the, search matches, like the keyword term TF IDF or semantic similarity score. How do you combine all of those?

Charlie Hull: So that's a really interesting question. If we were just in the lab building a search engine, our TFIDF or our statistics that underpin all the modern search engines, really, that's all we care about, does this word match? Is this word a popular word? Therefore, should it, affect the statistics?

That's fine. Now, the sad truth is, once you've got that basic algorithm, you then have to mess with it and everyone does. So everyone modifies their basic search algorithm in some way for their business objectives. So recency is a really good example. If you're a doing a news search, for example today's news is probably more important than yesterday's news, but how much more important?

And is that the [00:28:00] only, only factor? And I've had conversations with people in the news business who said actually recency is the most important thing relevance, keyword matching, not so important. And if you go all the way with recency, you end up with whatever you search for, you get today's news stories.

Which isn't great, obviously. If you go all the way to relevance, then you don't see today's news stories because they're in somewhere in the hundreds of results. So you need a balance. And you don't really know what that balance is. A similar thing with, let's think about popular items, things that people have clicked or clicked on before or bought on before.

Now, if we only show popular items, then you get a self reinforcing behavior and a new item will never show up because it's got no history. So again, there's a balance here. Now, where is that balance? We don't actually know. The only way to find it is by experimentation.

Experimentation: The Key to Optimizing Search
---

Charlie Hull: And my mental model of a modern search engine is a bit [00:29:00] like a mixing desk in a recording studio.

Your task is to make things sound nice. And you've got 200 little sliders volume knobs and twisters and these sort of things. How do you set that to make it sound nice? You use your experience, your intuition, that maybe a bit more bass here because it's this kind of music, fine, you don't, you can't come up with a definitive answer of where those things should be to make the thing sound nice.

So you try things out, you experiment. So an experimental methodology is absolutely vital here. Now we end up with, we have some kind of KPI we're trying to attack, we're trying to get towards, but we don't quite know how to get there yet. So what we have to do is do repeated experiments.

Measure, rinse, repeat. We do a search, we get some results, we come up with some numbers telling us how good those results are. Eh, it's okay, it's only six and a half out of ten. I know let's see what happens if I boost the title of the search results. Matches on the title, we might intuitively think it's more important, the title of a news article, for example.[00:30:00]

Let's try that, let's make the matches on that title a little more important. Try that, rerun our experiment. Oh, it's gone up to seven and a half, brilliant. That's that. That sounds like a thing. Let's do more matching the title. Let's try 25 match on it. Let's boost the title even more. Oh, hang on.

Actual relevance has gone down because not all the titles are, sometimes the matches are in the body text, but we have to keep these experiments going and repeatedly doing them. And that will give us a better idea of how to get to that, that eventual target and keep our search quality up. Now.

You can't do thousands of experiments a minute. And there are some automatic techniques, things like, gradient descent, for example, where you can iterate towards a place where these parameters are about right. And, do it in a more and more automatic way. There's lots of techniques, but the point is the quicker and the more often you can do these experiments, the better.

And if you can do them offline, even better, because you don't want to be doing this with real people.

Offline Search Labs and A/B Testing
---

Charlie Hull: You're going to be doing this in a A/B test, which might take three weeks [00:31:00] to give you some results. You want to do it in your little search lab, trying things out, and then when you find something that's really cool and seems to give some promise, maybe put that onto your online system and try it out in an A/B test.

But that rapidity of experiment is absolutely vital because you don't actually know how to balance these different factors to give you your search results in the right way. There are some other way techniques you can look at. There's machine learning based techniques, such as learning to rank, where you look at Do a search, get a thousand results, and then reshuffle them using a machine learning model that's trained on some particular objective and using a set of signals, and a signal might be how often someone's clicked on a particular result.

And the machine learning algorithm can then learn how to reshuffle the results in the best way. But you need really good data for this. You need a good understanding of your model, you're a good model, a good understanding of your data. And it's not something that search teams get to till a little later, but it's very [00:32:00] It can be very effective once you've tried everything else.

And in fact, in most situations I see, there's a lot of really simple things you can do to make search better. And most people that we talked to are starting from that, that, that ground where really they haven't tried anything too clever or worse. They've tried a few numbers, a few boosts here and there for the intuition.

They've tried to move some sliders up and down. It's okay, but they've got no real understanding of why. They've just tried stuff and it's made things a bit better and they've left it there. We want to actually do this by experiment. So we've got some actual basis to say, we tried this, we tried six, we tried seven, we tried eight.

Eight was a sweet spot, but that's where we got to. And we've got a story for why we got to that point.

Nicolay Gerold: So to mirror it back a little bit, that's the important of having a test set. So you have the luxury of only pushing major changes to production that you can then run an AB test on, [00:33:00] which is very costly in terms of iteration speed.

Charlie Hull: Yes. Yes. This is something we talked to a lot of people about the idea of having an offline search lab somewhere you can try lots of experiments based on data you've collected through ratings or clicks. And that's. Where you do lots of experiments all the time and then the best experiments, the ones that really show promise, then bubble up to the online situation where we can test in front of real people.

The trouble with online, of course, is, it takes a long time to collect the data. Also a cohort of your users are going to get a worse experience because, by definition, an A, B test, A or B is worse. And it can harm your business. So you don't want to try some crazy idea out on an A, B test.

You want to be pretty sure it's a good idea, pretty sure it's going to improve things. Yeah, we like to try the crazy ideas offline, with some offline data, a test set, and some, some idea of relevance. And then eventually, once we're confident, we can bubble that up to online.[00:34:00]

Nicolay Gerold: And you just mentioned there are always a few simple levers people can pull.

Simple Levers to Improve Search
---

Nicolay Gerold: What are the first few simple levers that would you would recommend like search engineers to try?

Charlie Hull: A few good examples. Phrase search is quite a good one. Is there a way where we're going to match more words in the query closer together? That can sometimes help. Simple boosting, boosting titles is often quite a good idea. There's, most modern search engines have a slew of different features you can try.

Have a look at the fields and the metadata as well. Where is this data distributed across fields? Questions? You're searching for a particular quality, if I'm searching for red as a, there should be a field somewhere in my data that's the color of the item I shouldn't need to look for that red word somewhere in a description somewhere.

We should have a field that says red. Should we have a color field for something if we're selling lots of colored items? How's that working? Is that coming through from my data correctly? What about [00:35:00] crimson? What about rouge? Varieties people might use. Is that consistent in my data? Very often we see data sets that can be incredibly complex with hundreds of different fields.

Sometimes a lot of these fields have not really much information in them at all, or they're only used for one particular kind of item. And so doing some analysis of the source data is important to figure out what are the fields we should hit. A common pattern is people to take the source data and dump it all into the search engine.

We're going to need it one day, let's just dump it all in there. But actually you don't need to match on everything. You just need some of that metadata generally, and you need to do some analysis to figure out which of it is important for searching, which of it is the sort of thing people are going to query for when it comes down to it, search is a problem of language.

You use a particular language when I, when you query and my data has another kind of language and the two may not match up what I call something in my source data isn't necessarily what [00:36:00] you call something in your query. So we're trying to make those connect those two things together. So there's other strategies here synonyms, calling two things the same thing, two different words.

There's spelling mistakes. So people mistype things and most search engines have got built in ways to correct some of these simple spelling mistakes. You can also fix that with synonyms. We've got strategies such as using query rewriting. This is a very powerful thing. So you type in cheap iPhone.

And the results that come back are iPhone accessories. And that's not what you wanted. You want the actual iPhone. Why are you showing me accessories? So we might need to push down those accessories down the list. And then you said cheap iPhone. Now cheap doesn't really help in this case. All the iPhones we sell, they are the price they are.

And cheap is now matching somewhere else and polluting our results. So we're going to get rid of that word cheap. So I'm basically interfering in the query you've put in, but I'm doing it for a good reason to improve my [00:37:00] business or to solve, to serve you better. All these strategies need to be assessed based on, what are the actual problems we've seen?

How do we prioritize them against the business objectives? And what data do we have to measure how good or bad this is? But yes, there's intuition as search experts, but we're also trying to be data driven. We're not just going to make stuff up. We're not going to go into mysterious tweaks for you and that you don't necessarily understand.

We want you to understand why we're doing it. And so you can support this and own it as you go forward.

Nicolay Gerold: What do you think is the role of data modeling in search and search databases?

Data Modeling and Its Importance in Search
---

Charlie Hull: Data modeling, as I've said, it's, I think, looking at the way the source data is structured and how that's going to work for search that, that's, that's effectively what we're doing. We're building a model of your data in the search index. We don't need all of it. We need some of it.

We might need to collapse some of it, glue some of it together, modify it, translate it, process it Every search system is different because all the source data is [00:38:00] different. The process of indexing, if you start from scratch, does involve a deep understanding of what is the data we're trying to search.

Even what shape and type it is. Is it 3D models, or is it geographical information, or is it books? All of these things need different strategies. So we need to think about that from a search perspective. One common thing is, what is the thing that we're trying to return as a search result? Is it a document?

Is it an email? Is it a product? Is it a geographical location? Is it a picture? Where is that thing represented in our data? Maybe it comes from five different places that have to be glued together in some kind of object before we can index it. But, what is the task the user is trying to carry out?

If I'm searching for a bit of clip art, for example. I might type in some words, but I want a picture. So how do I connect those two things together? Is there descriptive text? Have they been, these pictures been [00:39:00] categorized somehow? Have they been classified? Have they been tagged? Do I have to do some kind of clever translation of images to pictures, or use an AI model, or whatever, to create the data I need?

But what if people are searching for a picture by dragging another picture onto a box? Or taking a photograph of their phone? That's a different methodology. can I support that in my data? Where does How do I match those two things together? Again, how you represent something and how I represent something are very different.

They may have been different in terms of object types. One may be a picture, one may be some text. How do I glue those things together? The data modeling is an essential step. And we do this a lot when we look at Greenfield search, but also when we try and, maybe repair existing search installation.

Have we understood how to represent the source data in the index in a way that supports the query?

Nicolay Gerold: Yeah, and when you have set up like all the stuff you need for the evaluation, how do you actually prioritize like the [00:40:00] different queries, document representations, all the levers you could pull, how do you especially balance like the quick wins, you can implement quickly, or more of the Long term implementations and more challenging aspects of improving it

Charlie Hull: I think it depends on the context. Sadly, I often see people who are in bad situations because their search is broken. Everyone hates search. We know our search is terrible. We'd never be able to get search working. I hear this all the time. So I think to build support within the business that we are on the right path, quick wins are really important.

And luckily with some intuition experience, you can often find a few things you can knock off really quickly that will make search better overnight, but you don't want to stop there. This should only be a supporting strategy. You want to say, look, We're now at the beginning of this process. We can do a few things, make things better now, but we are going to need a proper process.

So the quick wins help you build support, both at technical, products and stakeholder level for building a [00:41:00] search improvement process. And that is a, not just a question of technology. It's a question of people, a question of procedures, a question of making sure people are meeting regularly, even how you're recording search issues.

Don't just react to every email, complaint you'll send. Maybe use a issue tracker, maybe use a GitHub issues to say, here's a search problem we found. Then your prioritization, how are you going to decide with the support of the business, which are the most important things to do? Because you can't fix everything. One of the most egregious examples I saw was a company in the electronic space who was, who were, they were testing something like 20, 000 different queries, every few weeks. Fantastic. They were testing queries, rating all the results. And they collected all these results in a giant spreadsheet and then emailed it to the solitary search developer.

So fix search. And that's a, a massive anti pattern that's saying which of this do you want me to fix? There's only one person and you've got all of these problems and, search is never finished, but you also have to [00:42:00] start at the beginning. So I think quick wins are important, but you also need to put that process in for continuous improvement and realize, this is a long road.

You want to start fixing some things, build stakeholder support and just get better at doing search. Search unfortunately suffers from the fact that people have historically thought it's, it's a magic black box. It's not as easy as all of that. It underpins so much of what we do that we need to have a proper way of dealing with it.

Otherwise we're, we're not going to achieve those, business objectives and everyone's going to still hate the search engine three days, three years down the line.

Nicolay Gerold: Yeah, and I think embeddings aren't doing the search field of service that we treated less As a black box,

Charlie Hull: I think embeddings are another tool in our toolbox. Keyword matching works really well for a lot of situations, but if you're doing some kind of multimodal search, searching images by text, or you're trying to go slightly beyond those exact matches. So again, if I don't use the same language as you, I've got to rely on synonyms to connect those [00:43:00] two things together.

But a language model can say this word is close to this word in a vector space. So when I search for this word, maybe I should also search for these words. And that's effectively what's happening with an embedding. It's placing something within a space where we can divine some closeness and therefore some kind of similarity between a query and a result.

Now they're expensive to calculate and they've got to be stored in the right way and they add to the footprint of your engine, et cetera. But they can give us huge benefits.

Combining Keyword and Vector Search
---

Charlie Hull: The current question though, is how do you balance these two things together? Because keyword search is useful for some things.

Vector search and embeddings some other things. How do you combine the two? And most of the successful strategies I, I see are some kind of hybrid. When I type a vague word, I get no exact match. Vector search and embeddings can probably help. When I'm typing a part number, embeddings are no use at all.

Vector search is no use at all. That's an exact match problem. So we need to balance these things together. How do we balance them? How much contribution do we [00:44:00] get to each, from each side? There's another chance for an experiment there. There's another way we can try this out. Embeddings are just another tool in our toolbox.

Nicolay Gerold: why do you think is the like recommendation literature so focused on embeddings and the information retrieval literature so focused on like more phrase and term search?

Bridging the Gap Between Machine Learning and Information Retrieval
---

Charlie Hull: Both come from different heritage. Yes. I think the recommendation systems, you can build a recommendation system with a search engine. You always have been able to but more often they're built with some kind of machine learning pipeline. Now the machine learning people don't necessarily know the history of information retrieval.

The information retrieval people don't necessarily know the history of machine learning. So you've got two colliding worlds here. Because of the current excitement around modern language models and AI, ML is in the ascendant. So everyone's saying we can use these things to improve search. [00:45:00] People even said, Oh, search engines will go away.

We won't need search engines anymore. We'll just need a big model. It'll do everything for us. That's obviously nonsense and slightly over optimistic. The interesting thing I'm finding is as we're building these AI powered search engines, people are rediscovering all the history of information retrieval.

Now, there's decades of useful information there. You can't just do it with one and you can't just do it with the other. If you want to build a modern search engine, again, it's going to be a hybrid of these two things. So you've got these colliding worlds. You've got some machine learning people who don't necessarily know about the history of IR and what amazing things you can do with text matching in a really efficient, powerful way.

This is how we can build, billion item scale search engines to come back in, fractions of a second with relatively simple hardware. On the other side, I think the information retrieval people are a little suspicious of this magic world of AI because possibly because it's been so massively oversold and it's been telling them they're not going to be needed anymore.

And then they're failing to [00:46:00] understand there's some real value here in some of the tasks that we've always struggled with in the search and information retrieval world. For, lexical search is fine, but semantic search is very hard. We don't, again, connecting these two different languages.

If the keyword didn't match and there's a slightly similar keyword, what do you do, create a manual synonym, but actually a language model could really help. So I think we're seeing the two worlds coming together and that's really exciting. What it has done is brought new energy to the world of search, but that new energy high venture funded startups, lots of excitement.

Lots of excitable DevRel people running about the world telling us that vector search is going to change the world. It's all going to be marvellous. But we also need to remember the history, remember that a lot of these challenges have been solved in other ways. And we need the best of both worlds. We need to put these things together to really solve tomorrow's search challenges.

So I think it's just, a kind of, bit of a clash of cultures. The machine learning people, they write Python. They write pipelines. They like, like Jupyter notebooks. The old search guys, it's all [00:47:00] indexes and text processing. It's interesting. But, it's it, we're still facing, the same challenges to build information retrieval systems or information systems or recommendation systems, which again, we just need to bring our two worlds together.

Closing Remarks and Contact Information
---

Nicolay Gerold: Yeah, I think that's a perfect note to close it. So if people want to get in touch with you, follow you, get in touch with your company or hire you, where can they do that?

Charlie Hull: They can always talk to me. I'm Charlie Hull at Open Source Connections. They can come to some of our events, the Haystack conference. We're planning Haystack Europe in Berlin September 30th and October the 1st this year. They can jump on Relevance Slack we'll provide a link to that which is a 5, 000 plus person strong community of people all talking about search and AI all the different aspects of that.

But yes we're always very happy to help people build better search teams, build better search engines and really solve those business problems. So yes, very happy to get to to have a discussion.

[00:48:00] [00:49:00] [00:50:00] [00:51:00]

#19 Charlie Hull on Data-driven Search Optimization, Analysing Relevance | Search

#19 Charlie Hull on Data-driven Search Optimization, Analysing Relevance | Search#19 Charlie Hull on Data-driven Search Optimization, Analysing Relevance | Search

More episodes

#19 Charlie Hull on Data-driven Search Optimization, Analysing Relevance | Search

#19 Charlie Hull on Data-driven Search Optimization, Analysing Relevance | Search

Chapters

What is How AI Is Built ?