How AI Is Built

Welcome back to How AI Is Built. 

We have got a very special episode to kick off season two. 

Daniel Tunkelang is a search consultant currently working with Algolia. He is a leader in the field of information retrieval, recommender systems, and AI-powered search. He worked for Canva, Algolia, Cisco, Gartner, Handshake, to pick a few. 

His core focus is query understanding.  

**Query understanding is about focusing less on the results and more on the query.** The query of the user is the first-class citizen. It is about figuring out what the user wants and than finding, scoring, and ranking results based on it. So most of the work happens before you hit the database. 

**Key Takeaways:**

- The "bag of documents" model for queries and "bag of queries" model for documents are useful approaches for representing queries and documents in search systems.
- Query specificity is an important factor in query understanding. It can be measured using cosine similarity between query vectors and document vectors.
- Query classification into broad categories (e.g., product taxonomy) is a high-leverage technique for improving search relevance and can act as a guardrail for query expansion and relaxation.
- Large Language Models (LLMs) can be useful for search, but simpler techniques like query similarity using embeddings can often solve many problems without the complexity and cost of full LLM implementations.
- Offline processing to enhance document representations (e.g., filling in missing metadata, inferring categories) can significantly improve search quality.

**Daniel Tunkelang**

- [LinkedIn](https://www.linkedin.com/in/dtunkelang/)
- [Medium](https://queryunderstanding.com/)

**Nicolay Gerold:**

- [⁠LinkedIn⁠](https://www.linkedin.com/in/nicolay-gerold/)
- [⁠X (Twitter)](https://twitter.com/nicolaygerold)
- [Substack](https://nicolaygerold.substack.com/)

Query understanding, search relevance, bag of documents, bag of queries, query specificity, query classification, named entity recognition, pre-retrieval processing, caching, large language models (LLMs), embeddings, offline processing, metadata enhancement, FastText, MiniLM, sentence transformers, visualization, precision, recall

[00:00:00] 1. Introduction to Query Understanding
  • Definition and importance in search systems
  • Evolution of query understanding techniques
[00:05:30] 2. Query Representation Models
  • The "bag of documents" model for queries
  • The "bag of queries" model for documents
  • Advantages of holistic query representation
[00:12:00] 3. Query Specificity and Classification
  • Measuring query specificity using cosine similarity
  • Importance of query classification in search relevance
  • Implementing and leveraging query classifiers
[00:19:30] 4. Named Entity Recognition in Query Understanding
  • Role of NER in query processing
  • Challenges with unique or tail entities
[00:24:00] 5. Pre-Retrieval Query Processing
  • Importance of early-stage query analysis
  • Balancing computational resources and impact
[00:28:30] 6. Performance Optimization Techniques
  • Caching strategies for query understanding
  • Offline processing for document enhancement
[00:33:00] 7. Advanced Techniques: Embeddings and Language Models
  • Using embeddings for query similarity
  • Role of Large Language Models (LLMs) in search
  • When to use simpler techniques vs. complex models
[00:39:00] 8. Practical Implementation Strategies
  • Starting points for engineers new to query understanding
  • Tools and libraries for query understanding (FastText, MiniLM, etc.)
  • Balancing precision and recall in search systems
[00:44:00] 9. Visualization and Analysis of Query Spaces
  • Discussion on t-SNE, UMAP, and other visualization techniques
  • Limitations and alternatives to embedding visualizations
[00:47:00] 10. Future Directions and Closing Thoughts - Emerging trends in query understanding - Key takeaways for search system engineers
[00:53:00] End of Episode

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

How AI Is Built, Daniel Tunkelang
===

[00:00:00] What faceted search tells you is, look, retrieval and ranking are nice, but the real issue is figuring out what the searcher wants, and trying to engage the searcher more conversationally to arrive at that. Albeit through this kind of very structured kind of guided dialogue and it has all sorts of requirements.

Daniel: But fundamentally saying, instead of putting everything in the, we'll figure out what you want in one shot, it's negotiation of the query. So I thought the query was, queries were really [00:01:00] important from early on. But then while I was at LinkedIn, it became clear there, so when I joined LinkedIn, I was in data science, not in search.

But I was still very much a search person, and I saw that the mistakes they were making and that they were trying to address by better ranking. Fundamentally, failures to get the right query representation. As part of infiltrating the search group and eventually leaving it, I said, look, we have to put much more focus on the query side.

And there had been a guy, Brian Johnson at eBay, who was the head of query understanding who might met at a SIGIR conference. And I said, that's a great title. I'm taking that. And I wasn't sure quite what I meant by it. But I told my my VP, I said I want to head up query understanding. He's that sounds good.

Not sure what you mean, but, by whatever amount of time, tell me what your quarterly goals are. So it sounds good. And that's became an agenda, a roadmap. And I guess I have a knack [00:02:00] for a certain kind of technical marketing and I just ran with that. And that became everything from our named entity recognition efforts to different kinds of sort of classification thoughts and segmentation and expansion.

I created an agenda around this that turned out to align with, I think, what people meant in search, but I think to some degree, I took something that was embryonic, but not really well scoped and branded, and I SEO'd it, and it worked for the field, it worked for me personally, and it's been a good ride.

Nicolay: And how have you seen query understanding like for yourself and in general evolve over time?

Daniel: When I started doing this, our named entity recognition used hidden Markov models, so that feels almost pre historic now. Even then we should have been using conditional random fields. And, these days, of course, you'd at least use LSTM if you weren't just straight up using a transformer based approach.

But the, what I'd like to say was [00:03:00] important, though, was recognizing that you had to have a query representation that was not just basically, a collection of query dependent features in your ranking model, which is the way things have been before. I've seen that, that emphasis. Has stuck.

I see query understanding teams. I see efforts that are explicitly in that. I see a separation between the query understanding and certainly in ranking, but even before, retrieval. And that's been really nice. The other thing, which I couldn't have anticipated, is that as people have gotten excited about LLM's natural languages or conversational approaches There's been a lot of emphasis on prompt engineering or re engineering or what have you.

And, in a way, that is just the generalization of query rewriting, which in turn is saying that there, you need to treat the query as a first class object. I think that, we [00:04:00] never expected it to turn out this way, but the move towards LLMs and this kind of, AI orientation around search has placed a lot more emphasis around the query representation.

Nicolay: So for the main search problem, you basically try to solve is to solve like the mismatch between the query And The document and how does like the user intent.

Intent come into play in query understanding.

Daniel: Yeah, so I think that

but there's a bigger problem in my mind, which is that your documents are specific. They are by definition individual documents, but your queries can vary. Your query might be a proxy for a document title, a known item retrieval. Cool. Or, it might be quite general, it might be a category, it might be an entity, it might be anything in between those.

And there is a, not just a mismatch of format, but a mismatch, potentially, of granularity. And I think that, [00:05:00] bearing that in mind, You don't want to just say certainly you don't want to use the same model, or I should say, you don't want to use the the same embedding approach for both queries and documents as if, for example, queries were document titles and expect it to work.

It might. In fact, it certainly works well enough that, and I've certainly talked to folks who sell vector search and they say, Oh yeah, for a large fraction of our customers do that. I'm like I'm glad it works, but it's certainly my experience. It's a very rough approximation.

And what I found instead is that you have to find some way of either making documents look like queries or making queries look like documents. That preserves the meaning, and since documents are potentially far more specific than queries, I found that it makes sense to think of the query as a distribution.

over the document space as the default. Although I've thought a bit about as well, thinking of the [00:06:00] documents as a distribution of the query space. But I think that takes a little bit more effort. This is something that you have called the sort of the bag of documents model for queries or the bag of queries model for documents.

Nicolay: So basically the bag of queries, you represent the document by the queries that retrieve it and the bag of documents represents a query by the documents that fit to the query.

Daniel: That's the idea. And I want to be careful not to be too literal about that because the, although it may very well be. That for a query, if everything were static, you could say, look, these are in fact the documents that get it. But I worked at eBay for a while, and one of the interesting things about eBay is that the documents are all but static right there.

It's a marketplace. And it's quite possible that two people sell very similar items, but they certainly have different item IDs, right? They're not the same literal item. And they may describe them slightly differently. But, I think of the query as a distribution over [00:07:00] the space that the documents live in.

Think of it more as a distribution over the document vectors. And what you'd hope then is that if let's say that there were I don't know, a thousand relevant documents for a query, and I split it, arbitrarily into two sets of 500, you would expect similar distributions in the document side for those two sets.

And so essentially you should be able to learn A distribution of these, of the document vectors is a a distribution within that vector space that you can then learn from the queries. And the beauty of it is that in the case of mapping documents, mapping queries into this sort of bag of document representations, is you can use simple aggregations, just a mean gets you quite far.

You can also then play around with a kind of notion of variance. So if your query is very specific, you expect the associated documents to be extremely self similar. While if your query is very general, you expect the documents to spread out more. So you can [00:08:00] effectively think of the variance in that space as a measure of query specificity.

So this notion works quite nicely. And, thinking of the mean as the centroid and thinking of this kind of variance as the specificity. Going the other way is a little bit trickier, but if you think about, for example, the people who buy AdWords to advertise their sites, in a way what they're doing is they're saying, Hey, look.

I want to target certain kinds of queries. So this idea of thinking of a document as a distribution of queries or intents has been around for a while as well. And I distinguish it from the traditional sort of inverted index model where you say Oh, here are the main tokens or other kind of a reductionist notion of the components of my document.

This is more saying no, I'm going to think it holistically. What are the whole intents? Thanks. Which you can think of as queries, that this document is targeting. And it's a little trickier, because without knowing something about the distribution of what people do, it's hard to come up with what the queries are.

But, if you have logs, you can do [00:09:00] that. And all these things are really nice when you can just look directly at your logs and say, Can I find, for this query, what are the documents, for example, that historically people have engaged with? Or for this document, what are the queries that have historically led to it?

If you want to go into detail, then you have to resort to a little bit more effort. And for example finetune in a sentence transformers model based on the queries that you have seen before. Or in the document case, you might have to play around with once you have queries and documents living in the same space doing a sort of nearest neighbor view of basically what queries look like your document based on these queries as means.

So there, there's work involved. But I think, philosophically, once you've decided that this is the way you're going to think about your queries or your documents, life is a little bit easier than when you have this misaligned world. I have had good luck with it.

Nicolay: When do you think is one model more appropriate than the other, or do you always end up using a little bit of both?

Daniel: A lot has to do with what you know [00:10:00] more about. In my experience, you tend to know more about your documents than about your queries. Because your documents, first off, they're self contained. So they're longer, they are supposed to stand alone. In principle, the documents exist even before you have a search engine to, to find them.

The other thing that, in your queries are they're shorter, they're potentially noisier. So I find that it makes a lot more sense to treat the documents as given. And then build query understanding in general on top of what you know about your documents. But I've seen exceptions, and these happen, for example, in cases where, let's say your documents are entirely user generated and not necessarily by people who are highly motivated to, And in those cases, you may actually know more about your queries than about your documents.

And you may be able to learn how to build your document representations based on queries. The catch being, there [00:11:00] has to be some mechanism by which you were able to find the documents from the queries in the first place. And if your document representation is very weak, That can be hard to bootstrap on.

So I tend to say content first, then queries. But I have seen the exceptions though.

Nicolay: Yeah. You already mentioned. You use the specificity of the query to basically model it a little bit. What do you use to basically determine the specificity of the query? Do you use a classifier, an XGBoost model? What are your your go tos?

Daniel: I've tried those things and this is a funny thing where I remember, multiple projects where it's oh, we need a classifier to determine if query is broad or narrow. And that dichotomy always bothered me because I wasn't convinced that This is a binary question. My early approach to this was to look at search journeys.

To say, if I've seen a query, then how long does it [00:12:00] take to get from that query to, say, a document that someone chooses? So let's do an e commerce case, where if I look up a particular product by name, and if I end up buying it, I probably get there almost immediately because hopefully it's one of the top results.

I might not buy it, but if I buy it, then I buy it in one step. On the other hand, let's say that I have an extremely generic query, like shoes or iPhone accessories. then it is unlikely that what I end up buying is going to be from the first page of results because of that lack of signal. It's likely that I have to paginate a lot or do various, refinements through facets or maybe even reformulate my query.

So my first impulse was to ask questions like, what is the conditional probability That I convert on the original [00:13:00] query, given that I convert on the session, right? It was essentially saying, are we there yet? I can't promise you that the query will lead to a conversion at all, but if it does, is it immediate?

And if it is not immediate, how long does it take? And that was a promising direction. It was highly application dependent, right? You can imagine that the size of your page matters. The availability of facets matters and so forth. I think it was the right intuition, but it was a bit of a pain in practice.

And by the way, this is. Follows also a pattern you can do this for head and torso queries where you have that kind of history, but then you have to Essentially use this as training data if you want to build in the tail and at that point because you're looking at queries You're pretty much stuck with embedding based models, right?

I don't you could try to use XGBoost if you can come up with interesting query features But they're going to be things like post retrieval ones like, the entropy of [00:14:00] your categories and so forth. And that, by the way, has its own challenges. We could talk about the challenge of entropy in a world where you have similarity among the attributes.

But the, when I worked later on the bag of documents model, initially I said, oh, queries are means and we're good. But I did notice the following problem, which even just to say are two queries equivalent, how high a cosine is good enough to say that two queries are so similar they're equivalent. And what I noticed was that the more specific the queries were the higher the bar should be.

And the way you could think about this is by saying, I want to say that two queries mean the same thing If I couldn't tell which, if I got a result, I wouldn't be able to tell you whether it came from query A or query B. If that's the case, then query A and B mean the same thing. I could look at the cosine, for example, between the result, result coming from query A and the result from query [00:15:00] B, and if the distribution of cosines across the two queries is similar to the distribution within the queries, That's great.

But now we can see that if the queries are really specific, they're going to be tightly clustered in both, and if the queries are broad, they will be not tightly clustered in both, and that's going to result in interesting things. If they're tightly clustered in both, then they'd better be in the same, the means better be extremely close.

Because they're both tightly clustered around their means, there's not much room. You don't really have a triangle inequality for cosines, but you can pretend that's what's going on. On the other hand, if the if they are not, if they are broadly clustered around both, then you're going to end up that maybe the means even move around a little bit as well.

This thing is going to end up being just a little bit noisier. So initially I tried to do this with pairwise cosines, and it didn't work. I said I think it was a good idea, but it didn't work, so we can't do this. And then later said, wait a minute. [00:16:00] How about instead, once we've gotten the cosine of a query, look at, oh sorry, once we've gotten the query vector as the mean of its document vectors.

Take those document vectors and compute their cosines with that mean, and take its mean. Now this may seem like a complicated way, but it's not that different from the definition of variance as the expected value of the square of the difference in mean. Except that it's in cosines, and it's not, it's not Euclidean, it doesn't have any of those properties, but it feels the same.

And then it turned out, ah, that works much better. It's, by the way, it feels like a linear problem instead of quadratic problem, which is also nicer that way. And that turned out to work extremely well as a kind of variance measure. So that's how I arrived at it. And at the time, I was just looking for a way to pick the threshold for query similarity.

But then I realized, wait a minute, this is great. And then how to [00:17:00] compute it? Just as with computing query vectors as the means of the documents, For the head and torso, but then for the tail, training a sentence transformer model by essentially fine tuning an existing one, say like mini LM, but for for triplet loss, we could then say, ah, we're going to essentially take BERT and add a regression layer, just like a linear layer on top, because we have, once we have a bunch of queries and we have their this specificity, which is just a measure between zero and one, right?

It's a mean of cosines, but now we have a regression problem. We take a bunch of these as examples, we we fine tune BERT with a linear layer on top, and that works quite nicely. By the way, you don't need to, you can decide to write off the tail if you want to. After all, for one thing, the tail tends to be more specific.

Although, I will say, there are tail queries that are highly general. Think about a strange way of expressing a a headish concept. So instead of saying, men's shoes, you say [00:18:00] shoes that would fit men. It's a very rare query, but it has very low specificity. And you know that your model is working well when it's able to detect that.

Nicolay: And are you doing all the different, basically, similarity calculations? Are you doing them at inference time?

Daniel: So you are basically when you're offline. You're able to learn about your, your more frequent queries, your head and torso queries, and you're able to build models for your tail. So then, when a query comes in, first, if it's a head or torso query, and you want to know anything about it, you should be able to do that offline, right?

You can build equivalence classes, anything you want. If you see a tail query, first, you know it's a tail query, and one thing you could do is say do a nearest neighbor search for similar queries from your head and torso. And that's quick, right? It's a similarity search. It's against a relatively contained space, maybe your top, million, 10 million, whatever queries.[00:19:00]

And, that, that's just a few milliseconds. You could, there, we won't get into the details of how you've exactly set up your your similarity search, but it's once per query for your tail queries, which you probably need the help for. And then you decide what you do with them.

You could rewrite, you could expand. You could do any number of things there. And then for specificity, again you could look it up if you want, or if it's something that you've got to infer. Again, it's a single calculation, and you can use that then to decide, for example, to trigger present, presenting facets, or and, or on the flip side, deciding that you want a hero slot for your top results.

Yes, if you're going to do things to the tail, you have to do your inference at query time, but it's once per query. And if you can't afford to do a single kind of inference, glorified dot product and similarity search for a query, then you probably have bigger problems from a performance perspective.

This is very different from the work that you're doing for any kind of query [00:20:00] result calculation.

Nicolay: Yeah. And the intent determination, do you actually determine a certain amount of intents when you're looking at the queries, if you can, and basically use a classifier as well for the different query intents you have, or is search or to be so unspecific that you aren't really able to fix the intent in a fixed number of classes.

Daniel: So I think there, there are a few different things. The way if you think about that, I was just talking a moment ago about query equivalence and that is at the saying, Oh, I know exactly what you mean. I'd have, I've essentially mapped it to a particular canonical query. At the other extreme you might have, for example I know that you are looking for clothing.

You're looking for electronics, right? This is classification that aligns with the taxonomy. But it's, maybe mapping a query [00:21:00] to one of twenty or a hundred things as opposed to a, making it, finding the sort of nearest neighbor in the many millions. So that's a classification problem, and I, that, that can be extremely valuable, right?

It can, for example one of the, older techniques for increasing recall is query expansion through synonyms or query relaxation by dropping tokens. But those notoriously suffer because they violate context, right? You say that you think that cup and glass are similar, but that's only true in the case of kitchenware.

It's not going to work so well in others. And the, but if you already know that when somebody's looking for a a drinking cup, that it's kitchenware and you match cup to glass, That'll be fine. And you will essentially, you can use the classification into the part of your [00:22:00] taxonomy as a guardrail to ensure precision, and in the meantime allow other things to increase recall.

In fact, take query relaxation, which can be extremely dangerous, and there's a lot of effort, ah, let's make sure we keep the right tokens and so forth, or the number of tokens. And if instead you say, look, you can replace your and with an or, but at least. Make sure that your results are in the right categories, then you're not going to do that much damage compared to keeping almost all of the tokens, but potentially because of a violation of context.

Straying very far outside of that. So query classification is often the, I think of it as the highest significant bits, the most significant bits of the sort of, relevance for retrieval. It's looking to be in the right area and it's extremely useful when you're then trying to do other things to increase your recall.

It's your starting point on precision. Now, there's a different aspect of query intent, [00:23:00] which is what are you trying to do? Are you doing knowing item retrieval? Are you doing exploratory search? That I find to be a bit trickier. And that's that, that requires some kind of of ground truth that establishes what you're doing there.

I'm more Usually I'm more excited about what you can do with, um, relating queries to content than the, these higher order things. But, if you have the training data, you can do that. I should also say there are also the, I think of query classification as taking the raw query in, because, especially for using transformer based models it's really nice to do it that way.

But I've seen cases where it's actually better to do a bit of stacking first. For example, LinkedIn, a typical classifier might be, are you looking for a person? Or are you looking for a company? Are you looking for a job title? And, as somebody with a unique last name, I certainly don't expect the model to [00:24:00] know my name in the sense of just direct classification.

So there it makes more sense to build a named entity recognition first, collect the tags from that, and then feed those into a model. So sometimes, especially if we're dealing with tail vocabulary, it makes more sense. to do other processing first before you go to something like query classification. But often the classification actually is better because once you do the classification, let's say in a product taxonomy, you could say, ah, I know that you're looking for electronics.

So now I have a set of entities. that makes sense for electronics. While you're looking for clothing, I have a different set of entities that make sense for clothing, and I can scope the space a lot better by getting in broad strokes the classification into a category before I do the try to look at what entities would be useful.

Nicolay: Yeah, the big question is, does the named entity recognition recognize your name as a person as well?[00:25:00]

Daniel: oh, it's I linked in. Luckily I managed to be in there. But initially, especially when we had an it was like the you basically if the token was only there in a name, it ended up that did get this 100 percent probability of being the one entity type it had, but that, it's a bit dangerous and you don't like, and there's an interesting question in general that when you're dealing with knowledge that isn't generalizable, then it's hard.

If I remember it was a bit of an aside. I interviewed a linguist who I think she may still work at LinkedIn. And I asked her, how she would address this problem, and she started looking at the sub token level to try to figure out, could you figure out that something was a person's name, a person's, and I was fascinated, it's not where I expected the the interview to go.

But if you can do that's great, but if there's knowledge that you could only have because you know it, but not something you can generalize from a bunch of examples, then you have to take a different approach. Machine learning breaks [00:26:00] down in cases where there is no pattern, simply you just have to memorize.

Nicolay: Yeah, and especially with names when you're crossing language boundaries frequently, which is very common in names, like what is the source of origin in the end? It gets very challenging. And especially like nowadays, like names are like completely free flowing. North. Sky can be names as well nowadays.

Daniel: Yeah. Yeah. And I can imagine, a lot of company names are, in fact, historically were the names of the people who created those companies. Thomas Cook. I assume there was a Thomas Cook.

Nicolay: Yeah The how do you handle like all the different stuff you're running on top of the search like the? Intent determination, query classification, query scoping, expanding, relaxation. How do you handle all of that stuff, especially in like low latency or high queries per second environments?

Daniel: The good [00:27:00] news is that, Anything that you're only doing once per query tends not to be your bottleneck. I've seen exceptions. I've seen people somehow build these, 10 megabyte query representations and have quadratic things that are happening. I'm like, okay, yes, you can break that.

But in general the, at least in my experience, what you do at query time, you do once you have a decent amount of scope for it. The only bad news is that you probably want to do that before retrieval because it, it can inform retrieval and tell you your retrieval strategy because then presumably you know, on in retrieval, you're going to have some cost that is linear in the number of things retrieved hopefully with a small constant factor.

And then anything you invest in scoring and ranking beyond that, It's going [00:28:00] to, your factor goes up as hopefully your set of things goes down right in the sort of fa phases of retrieval and then ranking. So I tend to say, look, because you're doing the, the pre retrieval stuff once and it has high leverage don't be stingy about it.

This is a terrible place to, to be that concerned about your milliseconds. Because hopefully whatever you learn there saves you on, by reducing, for example, what you retrieve or taking some of the burden off of scoring. You don't want scoring to be doing what should have been done upstream.

So with that said, there is a serialization problem. And you also have to worry about just making sure things are co located. You don't want to be sending stuff with high latency to a subsystem somewhere else where, that's not a problem necessarily for queries per second. Because Probably that thing is can handle the throughput, but there is the latency issue.

The other thing is that there's a lot of opportunity for caching. So even in, if you're your index keeps [00:29:00] changing you can't necessarily cache retrieval, let alone your end to end results, and maybe you have personalization, but the meaning of your query, from the perspective of query understanding, tends to have a lot of longevity.

Caching can help there as well. I guess the other thing I would add is that I don't rule out doing post retrieval things as well. So for example, you get back no results and then you say, aha, better try harder now. That, that can be annoying, but usually the reason you're doing this in the first place is that you've decided that returning the results as is going to be a bad idea.

So increasing the latency a bit, it seems like a good gamble. Um, would say that both in terms of latency and in terms of computational efforts. People very often say, and this is a bias I've seen from folks who work in infrastructure and operations, they say, ah, we have to shave a few milliseconds off of this operation.

And then they think about this from, a very systems way. And I tend to say, maybe the problem is that we are obtaining too many [00:30:00] results. So maybe if we think about this more saying, or is there more of an 80 20 where we can say, actually, if we had fewer results. We would have less work.

Let's worry about that first Before we look at the amount of work. We're doing per result returned and I think that um that's a place where people across the stack would do better to communicate better rather than compartmentalize and often kind of work at cross purposes as well.

Nicolay: Yeah, I would love to get your thoughts, especially on the query part on LLMs, because it's like all the hype at the moment, especially like In rec, like the type of like query expansion, adding synonyms, rewriting queries, the stuff you can do with LLMs is basically endless. But for me, it seems a little bit you want to have more control, especially when we are looking at the synonyms where you have to be a little bit more restrictive in the end as using a language [00:31:00] model to like completely rewrite the query.

Daniel: So I'd say a few things. One is that the, to me, the biggest win, so I used to, be quite aggressive in my search for using, queer expansion and relaxation to, to increase recall. And let's face it, a lot of when you work on query understanding, you tend to have a kind of either a mind frame of saying, what do I do to increase precision, classifying, segmenting, and so forth.

Or it's like, how do I increase recall and the two techniques being expansion. And whether it's, you can think of even things like stemming and lemmatization as a kind of expansion, even if you're doing it through canonicalizing the tokens you're sending to abbreviations and then, relaxation. What becomes clear is, when you do those separately, you're saying, Oh, that, you're always managing this trade off. But in contrast, in the, in the embedding based world, you think more holistically. You don't say, let me find a synonym for a token. [00:32:00] You're really saying, let me find a synonym for the query, or let me find the neighborhood of this query.

Sort of thinking about, replacing a point with more like a ball in the space. And I know I'm mixing point and vector metaphors, but, it's essentially thinking about this in more of this overall space of meaning. Now, similarity search, it's not free, but it's not absurdly expensive either.

And that's why, for example, using, this guy query similarity for a whole queries in my experience actually works much more reliably than doing things at the token level. But this is still, I mean, you're using these these large models. But this is not like using an LLM based application with, your prompt engineering and God knows what as well on the other end, this is just a similarity search.

You know what's going on [00:33:00] and you have a variety of ways you could control it. You can decide what's the target space of things that you obtain back. You can make trade offs, for example, between how similar is something versus, for example, how well does that query perform? How frequent has it been?

How many results does it turn back? So there are a variety of levers that you have at your disposal. You can also decide whether to transparently expose what you're doing or not. Now, that doesn't always solve your problems for recall, but it solves a lot of them, right? If you think that your typical problem is that your query is a variation of a canonical query.

And you can just say, can I, find the canonical query from it. Then this approach works nicely. Now that's different than, for example, what happens in your typical kind of LLM world where you're saying, oh no, you want to decompose your prompt into multiple queries [00:34:00] and come up with strategies and so forth.

And that's, approach I'm suggesting is not going to get you like a full fledged chatbot. But I think that for ordinary search needs, the space of essentially single meaning intents gets you quite far. And I think people should start there before they go the full LLM route.

Because otherwise, They're using an expensive and over engineered approach to just say, Oh, I want synonyms, but I need to keep context. So I'm going to, deal with a full fledged LLM based application. And no, in that case, I think just thinking more holistically about query similarity gets you most of the way there.

And it gets you there relatively cheaply and preserves at least some ability to understand what's going on underneath. So I think that there's a little bit, as you said, that the hype about what you can do is exotic approaches. Like I get it cause you can, but just because you can, doesn't [00:35:00] mean you should, especially when the the expense gets so high and you're not really taking advantage of what it can do.

Nicolay: Yeah, I think it's only cheaper in terms of like how much do you have to think about a problem when you slap an LLM on top of everything. It gets pretty easy to solve different kinds of things.

Daniel: Look, I've seen this as well, where, um, I've been in my share, it's fun being a consultant, right? Where people say, Oh, what can we do with generative AI? And then basically people start suggesting we're gonna build classifiers. And I'm like, we have training data. We can build classifiers.

You don't ... Now if what you're saying is we can do zero shot learning because somebody else has taken the collective knowledge of humanity. And put it into a model. Using data we don't have. And so we'll use that as a classifier. I understand that. And [00:36:00] essentially, that's a kind of useful cheating. It has nothing to do at that point with the architecture of the model.

And simply to do with the fact that they have the training data and we don't. And I think people need to bear that in mind. It's quite different to use a model that way. To solve a problem where you simply lack the data. As opposed to taking advantage of the the richness of its architecture.

Nicolay: Yeah. On the flip side, do you think LLMs are interesting to create new representations of the documents in your store?

Daniel: Yeah I, I alluded to having worked with marketplaces like eBay, and certainly there are times where it would be nice if document representations were themselves a bit more canonical. If you could say If two documents essentially mean the same thing, I would like them to, [00:37:00] ideally have the same vectors.

And there's a reasonable case that, ironically, the same, seeming lack of creativity of some of the outputs of generative AI could be used to canonicalize and denoise these representations so that they look more similar, more standardized. I see value there. I don't think generative AI is the only way to do this.

I think that you could imagine an old school approach where you simply compute a bunch of features and represent something as a bag of features in a canonical order and that gets you. A lot of the way there. But if you have a very heterogeneous. collection, then LLMs might be easier than trying to build, to have all of the different sorts of use cases accounted for, where you go.

And certainly less work intellectually, but [00:38:00] also potentially less brittle in terms of the on the resulting way of doing that. So I do see value there. In general, I'd say that if you can clean up your offline representations, sorry, if you can clean up your document representations offline, you should do that because you save work later, right?

What you don't want to do is compute stuff at run time, which you could have done earlier. Offline. And that starts with the document representation that you use for indexing. That certainly is going to be the case for, and you can learn about your head and torso queries, and then, kind of models that you use afterwards.

Yeah I see value there, much in the way that summarization has been so successful overall. I certainly see value there for summarization, for chunking as well. People are using Learn, learned approaches for chunking. And even if they're expensive, you can do them offline. They're worth it.

Nicolay: I would love to hear what the different approaches or methods are you usually do. [00:39:00] Offline to add basically additional fields, additional metadata to the document representation to make search easier. Online

Daniel: So the, um, I would say that some of the most important things you're going to do are basically fill in missing metadata, right? The if something should be in a category, you could generally learn that, right? The. Again, I have a fair amount of experience with either user generated or marketplace data.

But even when, it's your data or it came in from your suppliers or what have you, it's quite common that it's, it's hard to enforce getting all of your fields filled, getting all of your categories consistent. Let's just take a simple case of you have a category taxonomy, and when something comes in, you say, great, let me try to classify it into the right category, regardless of whether it had a category assigned.

[00:40:00] And you can then say, oh, it was missing one, we can assign one, or we can get a probability distribution on it. If it even if it has one, and it disagrees, But how we classified it, we can get that. And in some cases, you might be able to say in fact there's a, we can, construct a category similarity matrix and realize that we should get that otherwise.

So then, at run time when queries come in certainly, we can use these inferred categories, It's either for precision or recall, right? For precision in the sense that, well, it doesn't seem to be in the right category based on what we inferred, so maybe demote it or exclude it, if you can do that.

Or include this thing, even though it wasn't in there. And that's, you can then take, fairly sophisticated ways of augmenting the this structured data, but put those into even a sparse retrieval approach to to benefit from it where, you include, exclude or boost, it's getting now we use our XGBoost Model, right?

To to do better based on what we were able to learn offline. And which we wouldn't [00:41:00] have had any chance of doing online. So that's maybe a simple way of thinking about this is we're just dealing with either missing or potentially incorrect structured data.

But it's probably the most direct way to use offline analysis of your content. And what's nice is that if enough of your content is good, you can just bootstrap on the data you already have. But obviously the other thing you can do is if you're, especially if you're new and you don't have the data, you might be able to just use LLMs to essentially give you your initial labeling, which you otherwise would have said crowdsourced, and get yourself started.

Nicolay: Yeah And let's assume a little bit like two different scenarios green field and brown field So basically you're building an entirely new search system or you already have a search system How would you recommend like an engineer to start with query understanding? Like they haven't done anything before they haven't touched the field.

They haven't heard about it. How should they start? approach Integrating it into the system

Daniel: So [00:42:00] what I tell people if they haven't done it already is to essentially build a query classifier to whatever their main taxonomy is. What I find is that, first off, that often solves their biggest relevance problems which they've been desperately trying to deal with in the scoring function. The last thing you want is for people to be doing trade offs between query dependent and query independent features in the scoring function that affect whether relevant or non relevant results show up.

At that point, you're a bit late. In contrast figuring out that your two word query is within, one or two different categories at most should exclude almost everything irrelevant off the bat. The flip side of it is that if you are doing any kind of recall oriented approach, like using synonyms or using query relaxation, again, having query classification allows you to say, Great.

You can do all that, and instead of spending, [00:43:00] who knows how much time, curating or sitting in a dictionary or trying to come up with a perfect token informativeness and so forth, just impose the guardrail from the classifier. And the beauty is that query classification is easy. It's a remarkably easy thing to do if you have just training data, which you build from, you have your queries, you have results that led to engagements.

So you now have, their categories tell you the query categories you can build a model. I've did this in a, as a homework assignment for a class using fast text. All right. So you can it's that easy to do reasonably well and it's very high leverage. So I tell them to do that.

And then, let's see what's left, then you start worrying about things like things at the token or entity level and similarity and all sorts of fun stuff like that. But if you're, if you haven't done any query and you're standing, do that, that will convince you. And I have, I've had skeptics come back to me afterwards and say, yeah, that was the most important thing we ever did to improve search.

I'm like, yeah you're welcome. It's obvious once you've done it, but if you haven't, it feels like It feels maybe silly to people, but [00:44:00] the beauty is you can do it outside of your stack, right? You're doing this as the query comes in. And even if you have a, a fairly locked in search stack, you can very often still, essentially restrict at the category level on top of that.

And you can learn from your logs. So that's the beauty of it. It works at the edges of your stack, not inside it.

Nicolay: Yeah, what are your go to tools for query understanding type of models, services, libraries, broad strokes?

Daniel: Yeah I, as I said, I alluded to using FastText, I've been doing this for long enough that there was a time when that was, and by the way, I still know people using it. The next thing there is you don't need to be fine tuning, right? You can actually build your models entirely from scratch and, fine tuning, obviously transformer based model is, in principle, better, but way more expensive.

And you're stuck with fine tuning. You, the you're not going to build a transformer based model from scratch unless You are, raising hundreds of millions of [00:45:00] dollars and, probably, on TechCrunch this week. I think that, people should start with these kind of simple, cheap tools to see what they do.

Obviously I think you can do way more with transformer based models and done a lot with MiniLM, especially when, building these like query similarity approaches because they are, it makes sense when you, at that point, you need to be able to do something with triplet loss and you need a decent starting point.

I think because I'm focused so much on on queries. And queries, essentially, the obvious mapping to sentence transformers. That's the world in in which I've lived. But I will say, like, even there doing a little bit of sort of feature engineering or stacking by saying, Oh, can we, can we do something like, compute a classification of the query first and stack that in.

To the to the query strings first, or tag the models by using like an NER, which might be, like a BiLSTM or have you [00:46:00] that's, do the easy stuff that you can, because the fact is that once you start training, even just fine tuning these models, you're doing these overnight runs when they don't work, you're basically using trial and error to try to get an intuition as to what it is that does or doesn't work.

I try to have fast cycles where I can and and have some intuition.

Nicolay: Yeah, I have to sneak one more question in, because a buddy asked me, t SNE or UMAP, what's your preference?

Daniel: t-SNE her what?

Nicolay: Or UMAP, for especially when you're visualizing embeddings.

Daniel: Yeah, so the funny thing is, I literally, as I said earlier, I got my start in in this space, with network visualization. And, but I was literally trying to see the things I was doing. So the number of dimensions was like two or three, basically. And the, [00:47:00] I haven't, there's a reason I left that space.

Which is that I find that trying to get dimensional importance, dimension reduction, understand the, kind of the way that these clusters work from these visualizations. It's tough and I haven't had much success there. Um, I've played around with various, with t-SNE, I'm trying to remember.

I've looked at EMET visualizations recently, but like other kind of, multiscale visualizations and like for one thing that they're static, right? So the part of the problem is that they, I have not had much luck capturing the dependencies. In these dimensions, right? Like the, once you specify two or three and you have another one and so forth the kind of exploring around the space.

So I actually have had much more luck with just ad hoc exploration to see what's going [00:48:00] on, than getting a visualization like this. And doing it, but I'm also not a very visual person. So I'm, I know there are other people who do better with this, but yeah, so sorry to your buddy. Like I, I shouldn't be bad, especially given that I literally did my dissertation on network viz but I did it because I'm not visual, not because I am,

Nicolay: Yeah Nice. I have like hundreds of questions left. But let's close it out here If people want to get in touch with you follow along with you or even want to hire you since you're freelancing Where are they where can they do that?

Daniel: Certainly they can find me on LinkedIn and and send me an email because that's, that always works. My email is my first initial and last name at Gmail, so it, that certainly works as well. And I post my content on medium and on LinkedIn and distribute on both.

So I certainly encourage anybody who just wants to learn more about anything I've talked about to go through the stuff I've posted. And if you're [00:49:00] interested in learning more or engaging with me, just reach out. I don't bite.

​[00:50:00] [00:51:00] [00:52:00] [00:53:00]