How AI Is Built

Hey! Welcome back.
Today we look at how we can get our RAG system ready for scale.
We discuss common problems and their solutions, when you introduce more users and more requests to your system.
For this we are joined by Nirant Kasliwal, the author of fastembed.
Nirant shares practical insights on metadata extraction, evaluation strategies, and emerging technologies like Colipali. This episode is a must-listen for anyone looking to level up their RAG implementations.
"Naive RAG has a lot of problems on the retrieval end and then there's a lot of problems on how LLMs look at these data points as well."
"The first 30 to 50% of gains are relatively quick. The rest 50% takes forever."
"You do not want to give the same answer about company's history to the co-founding CEO and the intern who has just joined."
"Embedding similarity is the signal on which you want to build your entire search is just not quite complete."
Key insights:
  • Naive RAG often fails due to limitations of embeddings and LLMs' sensitivity to input ordering.
  • Query profiling and expansion: 
    • Use clustering and tools like latent Scope to identify problematic query types
    • Expand queries offline and use parallel searches for better results
  • Metadata extraction: 
    • Extract temporal, entity, and other relevant information from queries
    • Use LLMs for extraction, with checks against libraries like Stanford NLP
  • User personalization: 
    • Include user role, access privileges, and conversation history
    • Adapt responses based on user expertise and readability scores
  • Evaluation and improvement: 
    • Create synthetic datasets and use real user feedback
    • Employ tools like DSPY for prompt engineering
  • Advanced techniques: 
    • Query routing based on type and urgency
    • Use smaller models (1-3B parameters) for easier iteration and error spotting
    • Implement error handling and cross-validation for extracted metadata
Nirant Kasliwal:
Nicolay Gerold:
query understanding, AI-powered search, Lambda Mart, e-commerce ranking, networking, experts, recommendation, search

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

RAG at Scale: The problems you will encounter and how to prevent (or fix) them | S2 E4
===

Nicolay Gerold: [00:00:00] Hey everyone. Welcome to how AI is built. This is Nicolay I run an AI agency and I'm the CTO at a Gen AI startup. Today, we are back continuing our series on search. We have a very special guest, Nirand Kasliwal. is the author of FastEmbed, a very popular library for computing embeddings, and a RAG consultant.

We take a hard look at RAG and especially how we can apply information retrieval techniques to it. My favorite parts are especially his different strategies when it comes to starting out with RAG, but also when scaling it up. Let's do it.

Nirant: What is exciting? I think the, Colipali I think that's the one, right? C O L I P A L I basically a late interaction model for document understanding. I think that's interesting. And I think one accidental discovery from that paper was that fine tuning 7b models is [00:01:00] basically worthless. 7B, 8B models. There's just like this golden point, 1B models are fast, cheap, too fine tuned, tinker around, throw away, catastrophic for getting, is easy to discover more interpretability, debuggability.

All of that exists at the 1 to 3 billion model sizes. Or you could go to the 30 and 70 billion model sizes and you get emergent properties in terms of reasoning, better compliance, better JSON mode instruction. for your attention. Extract better structured output reasoning. Like you were talking about information extraction.

They get dramatically better at 70B and things like that. In the middle, it doesn't make sense. And that middle has always been moving. We used to, I used to think that it is seven to 10B or something like that. It turns out that it's. Probably 3B. You can just, it's just better to do more epochs with same data than to try and take a larger model and assume that it will, the model will retain a generalization after a LoRa merge or something like that.

I does not seem to be the case. Or at least that's my interpretation from it as of now could be wrong [00:02:00] and obviously the idea itself is interesting that you could actually do retrieval of which document which page? And then do ocr on that which means an ocr step basically becomes a real time or pseudo real time operation instead of Something which you do at ingestion time which gives you a lot more design flexibility But also ties you into a mechanism.

So without Caulipalya, you will still need to do OCR anyway. So a little disappointing from that lens. But I'm curious. I think this will unlock some new experiences. Because doing large volume OCR is in a lot of cases just not possible at ingestion.

Nicolay Gerold: In the paper, my mind already I've tried it out and I think my mind already jumped it to actually replace my existing systems and production where I basically batch process all the PDFs I would have to basically do it like on demand and store whatever I extract. But still basically [00:03:00] modularize all the different extraction pipelines.

So I have to split it out and like for the different parts, like for tables for figures and for like raw text and split it up and basically pipe it through there on demand, which as like an additional step into the pipeline as well, and also more complexity.

Nirant: Yeah. I'm not really sure if that, for any of these components, do you you are going to be trading off latency to quality if you're doing it on demand. And I'm not sure if that's a good trade off, all said and done.

Nicolay Gerold: Yeah. In the end, it will be, it might be interesting to identify. The documents which contain information you will need basically in the RAG you're building. And I think for that, it could be interesting because you can search for like information more robustly across different modalities, like in figures and tables.

And you can identify just the documents which contain the information and then basically [00:04:00] process the entire document and ignore the rest for that. It could be interesting.

Nirant: yeah, actually speaking of that, did you at any point in time try ideas like Deplot it's a table. Okay, I'll quickly put that on screen. Can I share here?

Yes. So dplot is this model where you can basically put a chart and it will create that into a table and then you can use it for rag as you would.

Okay. So it's a and then you can fine tune it on basically a couple of PDF images, which you have, right? Because quite a few documents, you will have the source from which this was generated the table versus chart. And I think for 10, 15, you would make it by hand also, if you're required, like it's that kind of thing.

And it's very small, right? It's only 282 million parameters. So what I was

Nicolay Gerold: Sorry.

Nirant: In my very limited experience, it works once in a while. It's like it's T-5 [00:05:00] based. So if you fine, it works really well, is what my, is what I was coming to. And it's also a little outdated, right?

Nicolay Gerold: Yeah.

Nirant: But for the,

Nicolay Gerold: I'm thinking also in the direction of does it work for scatterplots? This is something I would be very interested in.

Nirant: I have not tried it for scatter plots, for line plots and bar charts. It works.

Nicolay Gerold: because that could be an interesting thing as well. Also I think I can't really predict in most, so where I've built the most pipelines where I have like figures and tables as well, it's mostly financial stuff. And in finance, you often have the scatter plots and scatter plots are just a pain to work with.

Because. There are just way too many points on the chart. And I think like bar charts are easy. But also the box plots. Do you know, have you tested box plots?[00:06:00]

Nirant: No, I have not. No.

Nicolay Gerold: That I could be interesting as well, basically figuring out the distribution of the data and extracting it.

Nirant: What are you using for scatter plots right now?

Nicolay Gerold: so basically So I am, I'm using a vision model to tell me like information about the data. So rather like distribution and stuff, which I try to encode with a vision model but I'm just basically using GPT 4v to tell me like, what's the min, what's the max, how is it distributed? What are the trends or the seasons?

So it's because finance is mostly Time series in time series, you can split up a chart and basically you have the trend overall which is more long term then you have the seasons. And then lastly, if you remove that, you have like cycles, which are more like the small ones. And I basically ignore cycles and only try to see are there trends and are there seasons, which works fairly well, I would say.

Nirant: So that way you don't need the [00:07:00] Underlying data as such like you don't need the table from which that plot was built.

Nicolay Gerold: yeah, so because it's basically impossible to extract the underlying data, it's just way too many data points in there.

Nirant: Yeah makes sense

Nicolay Gerold: Yeah, nice. Now that we are already talking about like the problems of RAG, we can move a little bit into what are the problems of Naive RAG? So where does it really break down?

Nirant: yeah, I think we were discussing one which we already have come across right? It's not very robust to Data types, which humans understand visually quite well. So for instance any Chart types are very difficult in case you need a reasoning which intersects over a chart and a table or a chart and a Description around it.

It gets more difficult So that is a function of LLM reasoning sometimes but also a function of retrieval Because you might not know that this chart is related to this text So you might not retrieve it itself and pass to the LLM. So that causes another [00:08:00] source of error But these are still slightly more educated.

I think this Just the hygiene case or the default case where people just dump things into a vector db and expect that retrieval to work, which is what we mean by NAIVRAG, that almost always fails. And it's not because the retrieval is designed to be bad. It's because embeddings do not capture everything which might be important to your data set and your domain.

As a very common example, let's say if you were to do an embedding similarity between yes and no, in a lot of embedding models that will be 0. 8 or 0. 9 also. The idea that embedding similarity is the signal on which you want to build your entire search is just not quite complete. And you can also think of a lot of things which are basically filters of some sort or facets in the classical search way.

Thank you. Which is let's say by time. Okay. I want documents which are updated in the last week or edited by me, right? Like the kind of google drive search feature. These are [00:09:00] very basic hygiene features, which You know, this way of thinking does not account for in the retrieval sense. A lot of it is People trying to, I've actually seen firsthand people trying to answer when can I take leave questions, which is basically a HR question by passing it to sales information because they expect the LLM to know things magically.

So they don't really put effort in the retrieval and how correct it is. And embedding models, as I mentioned, right? The reason that was happening is the sales. There was some document in the sales wiki which was describing the leave policy for the sales team and it just brought up the entire sales information and now the entire company is getting answers from the sales wiki about how you take leaves.

It's just very noisy. So I think NaiveRag has a lot of problems on the retrieval end. And then there's a lot of problems on how LLMs look at these data points as well. So for instance, I think there's a [00:10:00] lot of interesting work, depending on which LLM, a lot of LLMs get confused. If you give them anything unrelated to the question, they're not able to ignore things that really makes a difference.

The second is they're sensitive to. The order, like if the correct answer is somewhere in the middle versus the first or at the end relative to the question in the message which you have passed on to the LLM, they're very noisy very brittle to these kind of ordering and things like that position ordering part of it is because LLMs are trained with that sequential quadratic one after the other token prediction kind of way.

The other is a lot of our, the RLHF, RLAIS. Methods preference methods. The optimize they have a lot of these datasets, at least the ones which we have from, for instruction tuning in the open source. A lot of those actually have basically ordered list. So depending on whether the question came first and then the entire ordered list it's always, there's some inherent order, like quite often there is some inherent [00:11:00] order in.

The preference data set. So the models learn that, okay, there is some ordering and that is why this is persist even in later distilled models like many, for instance, or Sonnet 3. 5 or Cloud Sonnet 3. 5. That those also have these order sensitivity. So combining the fact on the LLM and that adding noise makes them worse which means you need high precision.

You cannot just dump data and expect the LLM to ignore it if it has a very large context window. So you cannot dump your entire codebase and say, Oh, tell me what is happening. It will just ignore too much of it or get distracted. And the second that it is very sensitive to the sequence in which you give it information, LLMs themselves, like this sort of a naive rag falls apart.

Nicolay Gerold: Yeah. And when you put on like your consultant hat in the end, how, when you're called into a company and they basically have an ill performing RAG system, how do you go about like identifying the problems in the system and finding like the root source or root [00:12:00] sources?

Nirant: There's two ways to do it. One is you speak to the engineers and say, okay, what does the system look like? And you will have a very informed guess based on past experience that, okay, these are the limitations of this data flow or system design itself. The more robust method in my experience has been to go look at, go to talk to the people who have complained and ask what did they expect and what was there.

And then when you look at that, if there is some, any sort of debugging, for instance, even if they just logged, okay, this was my retrieval, this was the prompt which went to the LLM, the answer. And okay, this person complained about that, and you compare it, it's quite obvious, at least when you're just being onboarded, that, okay, this is why this happened.

broke. So the first 30 to 50 percent of gains are relatively quick. The rest 50 percent takes forever is my anecdotal experience because the low hanging fruits are usually in that first 50 percent where you can get a lot of quick wins by just, let's say doing DSPY where like you can just give a few short examples, select them cleverly write [00:13:00] better instructions and you will get a lot of boost teach them prompt.

To basically say, I don't know just that also improve the user expectation and behavior quite a bit.

Nicolay Gerold: And the next 50%, how are you approaching that? Are you setting up like an retrieval data set in the end to especially test the retrieval? How well it works.

Nirant: Yeah, I do that as well, which is having a synthetic QA or existing queries with a ground truth. Like I will sit and figure out a few ground truths, but while I would love to have that exclusively for retrieval, I've realized that it's the end to end, which matters a lot. So now what I've started thinking about it as.

Diagnostic metrics and debugging metrics. So for diagnostics, you need the end answer and whatever was the NPS-score for that. Thumbs up, thumbs down and other users screaming at you. This looks wrong. What are you saying? So I also do like sentiment checks for that in the chat log itself and do the diagnostic [00:14:00] that, okay, what is happening?

And there's debugging. So that is things like retrieval instructions. Did you select the correct instruction? Did you select the correct few short examples? All of those are also logged, those logs basically become your debugging logs that's, and those will have separate metrics and separate lenses or hats that you put on when you're looking at those.

Nicolay Gerold: Yeah. And how you, especially, I'm really curious about like in information retrieval, I think the false positives and false negatives. For search are very interesting and especially like one of those particularly hard to track how do you use this information to improve the system basically? And how do you actually build a dataset to evaluate it?

Nirant: Let's answer the questions. Second half of this question first, right? Like, how do you build an eval data set? So there is the synthetic QA approach which has this inherent advantage that you already know what the [00:15:00] ground truth is or the ground retrieval is at least, if not the final answer, because you're just synthetically generating the equation using something like eval instruct or something like that.

So that is one way to increase the volume and increase the coverage over your data set. The second thing, and I think that has more impact is if you take any complaint from the user, I assume that if some one person has complained, there are 10 more who have asked the same question, but not complained.

And that ratio largely seems to hold. So what I will do is take that one chat log. See those questions or what the eventual intent of those questions was and look for similar iterations of that question, get the correct answer by speaking to someone in the company that, okay, where could this be in your actual thing?

We get that correct answer. And then we start to eval with on those documents and those topics and include those synthetic QA also, and the human asked questions also, and then iterate towards that. A rule of thumb, which I try to follow [00:16:00] is that the synthetic volume for evaluation in particular should never be larger than the actual humans, so to say.

So it's basically in the total, it should not be more than half unless You have very high confidence that the synthetic QA is pristine. And you have human verified it yourself or with a domain expert. So in case of basic things like within the company internal wiki search, I can do a lot of the verification myself.

But in let's say of domain specific cases like proteomics or biology or pharmaceuticals, then there is no way for me to do that because I do not have domain expertise in pharmaceuticals.

Nicolay Gerold: Yeah. And the data set you used to evaluation, you have the query types. Do you have two additional columns, basically documents that it should have retrieved it like the positives you want. And also the, all the rest is basically negatives or how do you approach the data

Nirant: yes, I see everything else as negatives except the positive one [00:17:00] with a relative order of negatives, basically hard negatives, like in a sense that I will use a strong baseline, which most companies usually have something like elastic in place. So then BM 25 from elastic, that implementation becomes my negative baseline.

So to say ranked documents and that which are wrong, they become my hard negatives.

Nicolay Gerold: Oh, nice. Yeah. Can you double click on that? Like, how do you set up the system

Nirant: The infrastructure.

Nicolay Gerold: Rather on the like hard negatives? Like, how do you feed it in basically into the eval data set? So basically going from the user query to the hard negatives.

Nirant: So this it's a slightly more contextual to the domain, like how entity heavy it is. So for instance, if you were to think about pharmaceutical, it's very entity heavy. The name of a drug is not going to change. It's a named entity. The similarly for biology, the name of a cell type or a protein is [00:18:00] not going to change.

So bm25 with a very strong baseline there or keyword matching is a strong baseline there and just setting up context to answer your question in some domains, like internal leave applications or finance. This is not the case. They are not entity heavy recurring revenue could be absolutely anything depending on what part of finance you're looking at.

So there's some inherent meaning to recurring revenue, but it's very different when Apple talks about recurring revenue, which is not subscription versus let's say. A SAAS company talks about recurring revenue. So that's the kind of differences you are trying to get now to answer your question Which is how do we map?

User queries to hard negatives. When I have the user queries whether they will, and I also have the user feedback at this point, right? Whether it worked, did not work, or the user screamed at it at this point and then eventually worked. So these are the possible three scenarios. Let's take the scenario where it worked because that way I already have the ground, I know that this was correct, I will also have some synthetic questions from the same [00:19:00] documents, which I also know I have the ground truth for that also.

So the same questions. I will usually run that against my baseline and whatever results I get, those are my which are wrong, outside of whatever I've used to generate the questions and which was in the original, those become my hard negatives. The second scenario where the user screamed at it and or the answer was wrong, it was downvoted, let's say, that is the harder, because now I don't have the ground truth.

So in those scenarios, what I will typically do is Take a list of documents based on some manual searching, which I am doing a mix of vector dense Sorry, dense, sparse, BM 25 Manually going to the documents opening it and seeing it Whatever that requires and then having a domain expert come in and say, Oh, yes, this is correct.

And this is not. And sometimes we will do it on the chunk level and not just a document level. Sometimes the other way around, we'll do it on the document level and not chunk level. So that, that is slightly flexible depending on the document length, the paragraph length [00:20:00] because Let's say in case of industrial instruction manuals, those are sometimes like 300 page PDFs.

So asking what document is pointless. It's more useful to ask which page or which section within that page also from the domain expert. There if we have, if we get that ground truth, we can again repeat the same process as earlier. In case we do not get ground truth, we see if the negative which we have overlaps the ones which we got from other signals. So sometimes the users can mark something users can download something for reasons which are not within the rag failure scenario. So as an example, let's say somebody is asking for leaves and the system says, Oh, you get 22 leads. And based on your past recent data, you have 12 leaves valid. Now the person could be sad because of that.

And hence downloaded because they wanted more leads. You don't know so you got everything right down right to the okay user specific retrieval and answering it. But [00:21:00] there's no way to separate one from the other in your evaluation

Nicolay Gerold: And what are like the issues that arise at more scale? And I want you to find the different types of more scale as well. So basically more queries, more user, more query types.

Nirant: I am not Exceptional at infrastructure and more luckily. I have very good partners in infrastructure usually so anything i've seen Vector DB employments or search elastic employments, which are a hundred million records with reasonably high QPS 10, 2000 QPS. And then they work with some redundancy sharding routing, those kinds of mechanisms.

The area where I focus and I'm more helpful is when the query types expand and there's coverage to be improved or precision to be improved. Like coverage is basically recall here. So that is pre queue. Query types [00:22:00] this is a famous, okay, not so famous, but a very nice shorthand from a paper from the C RAG paper of all the query types.

I think that's a good one which you can use as a, if you're doing it in house that's a good paper to let me see if I can fire that up right now. I can dig that up. If I can dig up that paper, give me a second, archive, yep, I have that, except that it's very, yeah, give me a second, I'll share that. So this is a very straightforward example of some question types or query types, right? These are something which vector search is very good at. This is also something which vector search is, [00:23:00] usually LLMs are very good at saying I don't know to simple with condition and sets. It's in the last five where people expect good answers and LLMs struggle, including perplexity for what it's worth.

And I think of these, the ones which I've seen most interesting to people are comparisons and aggregations, right? So comparisons is things like, oh, can you compare so and so method to so and so method, right? And if I were to give an example from, let's say, financial services, somebody might have a question of what was the recurring revenue for Apple in 2011 versus 2030?

And how did it change? These are the kind of questions which these systems struggle. The simplest method I have found is that you basically have a LLM do this routing internally by having these few short examples, you write a few short of what these each query type is and route it after classification.

And you can use a very small. Model or a very fast model is basically what I'm trying to say here. So for instance, if I'm on Azure cloud, I will probably use [00:24:00] 3. 5 turbo or four or many now that's the kind of scenario in which it would Excel or some of these queries. And I don't think that's fully covered in this example, like aggregation.

Sometimes you will ask. Okay, what were the top five trends or what was my top five selling items and so quarter? At that point you might have to sometimes do a fan out and you say okay I'm going to ask three different queries in parallel and then ask llm to do a reduced equivalent, right? You join all of that is where I think instructions which you give to the llm and a separate prompt, becomes very Important and useful for expanding query types

Nicolay Gerold: And I think the Dory leads us to like the improving coverage and improving recall section. So how do you actually figure out or profile the queries? What is the intent of the user and what does he want out of the query?

Nirant: So [00:25:00] one slightly half baked thought which I have right now is that like the way we have designed intent understanding systems so far, no longer apply. And I think this was something which was obvious in the previous chatbot era when Rasa was quite popular. Is that intent understanding and intent, sorry, interior detection need to be from the same base.

Understanding because the intent and entities quite often go hand in hand. If somebody is sharing their order ID, they usually have a question with tRAGing or, and not refilms. That's the frequency relationship basically, there's a correlation happening there. So in terms of query profiling I can again share so I think clustering is quite popular as a method to see, okay, is there a particular area where we are doing terrible?

So we will basically take all queries let's say I have green and red for the upwards and downwards and everything else in grade in grays and we'll see what works. A very common. And good pipeline looks like this. This is from the GitHub [00:26:00] repo called latent Scope, which is also a to tool for structured annotation, so to say, for further analysis.

And what they basically have is we already have embeddings when we ingest this data, right? So we basically do a umap cluster, and then there is a pseudo label or labeling step. I've also used Prodigy for this with great effect. Not in the last 6 to 12 months, but before in my previous projects, prodigy from explosion is a great tool for doing this accelerated labeling for query profiling in particular, and then there is a last step, which they basically bring in called scoping.

Which, which is a little hard to describe, but I will encourage folks to try this out. This is a great and they have some good yeah, they have some good getting started here. Like they have good example analysis from here and we should fire this up here. Yes. Let me go to the end. Yeah.

Nicolay Gerold: And this could be basically maybe the documents aren't in. In the data [00:27:00] set, even, or also there is a certain question type, like what we looked at before, like the query type, which the model or which the system does perform on. But maybe even like the search system just isn't set up in a way that you can find the relevant documents.

Nirant: Yes. Yes. It requires that manual effort. The only saving graces, the better you are as an engineer at using LLM as a judge, the better that so for instance, when I said upwards, downwards, you could, if your LLM as a judge is. It's very well correlated with the human upward downward system. And you can use DSPY to improve that correlation.

Then you can also increase the volume for this kind of for, because Typical ratio is what I've seen as 1 to 3 percent of all queries will have upwards and downwards. So for what do you do for the other 95 percent that the LLM as a [00:28:00] judge that view of the world really helps if you're good at this.

Nicolay Gerold: Yeah. And if you actually have the case, like the query. Isn't good enough or doesn't have the filter conditions ingrained into it. How do you go about basically building a system for query expansion and reformulation to actually improve the different search queries?

Nirant: one is although I'm very I would love to say I can just use I would love to say that I do query expansion on the job as often as it is required. In practice, it is not. In practice, what we get away with is that we just replace the, we add Parallel searches. So we know, okay, this query and the rephrasing of that, which is more exhaustive.

And we map those, we put both of those and say, okay, these are actually the same queries and some BBO of sorts. And that basically, and every time we find a query, which is very close to either of these, we just fire both. To the vector [00:29:00] DB or the entire search system, and then we pass the entire results, rerank them and give to the LLM so that way the better rewritten query, so to say, almost always wins.

That is the hack there. So to say that the query expansion is happening offline and not at query time especially because if there is head & tail distribution of queries so for the heads we had to do this manually quite often. Query expansions, rewrites things like that. And I'm treating both the query expansion and rewriting as synonymous here. Query decomposition, where we are doing some sort of fan out or adding filters that is information extraction problem and often right now we just do a LLM call and ask it to parse it to a pre specified JSON structure.

Nicolay Gerold: And this basically just means for example, if you have a temporal component in the query that you try to extract that I give to the user mentions what are the top news of the last week that he used to extract it? And basically further on [00:30:00] the last seven days. Do you use some classifier to basically decide?

Whether it is a certain query type and then route to different mechanisms like query expansions or certain metadata extractor that extract only certain information from the query.

Nirant: The metadata extraction is always done and a router is almost always triggered, whereas some companies prefer, some systems basically require lower latency in those cases, we might have a logistic regression check to basically do, should I ask a router or not thing and they are a hit or miss.

They have their own precision recall trade offs and I'm not a fan of that. I would rather. Focus a lot more of my energy into the teaching the LLM. I don't know, like handling that case in general, more gracefully, but I do [00:31:00] see some system designs benefiting from that. Should I route, should I use a different prompt?

Especially, let's say we discussed the aggregation and decomposition with the different behaviors in that case, having a different instruction, different prompt also helps.

Nicolay Gerold: Is this an issue in the metadata extraction as well that the LLM might hallucinate parts of the metadata you use for filtering in the end?

Nirant: I have luckily not happened to run into that often enough. And we do a, for a lot of these we do a does it in that like we basically do a check. So for instance, if the LLM says, last week was so and so date, we have libraries from, let's say, like we have the Stanford Day util parser. library, which converts natural language to a standard date, time object.

We will compare the both output. And if they're off by, I don't know, 300 weeks, we will basically assume one of them is wrong.[00:32:00]

Nicolay Gerold: Yeah, nice. What are like The metadata fields you tend to use more often, like basically give the people inspiration, like what you could extract and what tends to work the best in industry.

Nirant: Oh I think instead of, I could list those out, but I think you have a way better time by just going and looking at

So this is what you will see, you can see title what are the related areas and the actual text on hover. This makes it easier to visualize and share your findings also. So that's quite neat. This is all, this all builds on nearest neighbor search. So it's also gives you visibility in case there are questions which people are asking, but it's actually not present in the corpora itself. This is tying back to your previous question of false positives and false negatives, right? So that kind of thing. Okay, the LLM gave some answer, which it made up, or it said, I don't know. But it was just not there in the corpora. That is the problem with retrieval, like trying to evaluate [00:33:00] retrieval.

If there is not in the corpora, how do you know that it's not there? It's like, how do you pre Prove the absence of that information. This is the kind of pipeline which gives you some evidence that it does not exist. It's higher confidence than saying so yeah, that kind of thing.

Nicolay Gerold: So basically you move from the queries, embed them you cluster them, whether based on. The embeddings, but also mark them, whether they perform well or not. And then you basically try to identify groups of queries. The RAG system isn't performing well on, and in these groups, you basically do the manual work to basically identify what is the commonality between them, why they aren't performing.

Nirant: Elastic's work on this, and I can share a part of that on screen and quickly Give like a primer on this. So this is how they talk about metadata design itself, right?

So you can see that there they're only talking about a certain design. These are all the [00:34:00] scenarios pieces Everything which you will do with metadata, so to say, right? Elastic has put in a lot more thought than we will as an example, they have separate fields for routing. There have separate field for, okay, these are the fields to be ignored.

So three then where is this? Yeah. They're also putting a lot more effort into, let's say tokenize token filters. Let's say you do not want any mention of your competitors to accidentally be passed to the lm. That's where this kind of filters kind of thing makes sense. I'm looking for the exact reference which will answer what we were just discussing, go deeper into that.

Okay. Judgment list, LTR. Okay.

I'm, but I'll share that if I run into that again, filters, which I like to use and, but are not always going to production. Time is a major one. Anything else, which is a descriptive field adjectives, basically. I'll give you some examples. So it's in case of e [00:35:00] commerce, you would describe this as A blue collar so this is a collared t shirt with blue collar with some pattern and a white base and is half sleeves.

So you will describe all of these and store them in the and to the level of specification is useful. The fun thing about information extraction is you can always be too granular and aggregated back up, but not the other way around. So in terms of the schema design, I prefer to be as detailed as my system or the LLM or the information extraction system will allow me, which is quite often just a LLM call with some optimized prompt for that.

Time is quite popular. People seem to like that as well. I see people prefer location role-based filtering is quite common and useful. Session based is also quite helpful sometimes, but again, not something which people implement often enough.

Nicolay Gerold: Yeah. And moving on until the metadata, do you [00:36:00] treat information you have about the specific person as a type of metadata as well? And do you enrich the different documents in storage as well with like persona information? Basically you can use a filtering.

Nirant: Two parts to this question. Yes. The first part, do we treat the query person or the user as metadata? Absolutely. And that's something which we pass to the LM as well, which we pass, and we use it for selecting the few short examples as well. It's very useful and powerful. And we also use it for evals sometimes to basically get a sense of the descrip user experience.

I think one of the examples which I give often is. You do not want to give the same answer about company's history to the co founding CEO and the intern who has just joined. And similarly, you do not want to give the same answer about the engineer who wrote this code base for the last four years versus somebody who has just joined. There's a lot more inherent [00:37:00] context and you should be able to factor that in. The closest which people relate to this example is when you're going to a LLM today like a Cloud or a chat GPT product and saying be concise, because you're operating in your field of expertise, you don't want the LLM to basically yap about.

The way and we also use something like a Flesch Kincaid readability score sometimes to see how easy it is to use or read. That's more how do we say, more better fit for American audiences than globally, but basically Flesch Kincaid is a reading grade level and you can basically compute it.

with this formula. It's quite straightforward. There is an implementation of this in almost every programming language. In case it's not there it's still quite easy to implement because these words, sentences, syllables, words is something which a dictionary based system, so you can still do it quite easily using like a NLTK or any library of gives a sense of reading levels and even most fun part is [00:38:00] you can augment it with the vocabulary of your you can get the term frequencies of your own corpora quite easily, right?

So we were discussing financial services just before this revenue is A very common in financial services. So let's say terms like profit and margin and things like that, but outside of these, they might not be common. So you might want to treat margin and the context of biology as different from financial services.

This kind of reading level checks which you get from the user persona helps you in both doing retrieval better and passing it to the LLM helps the LLM address answer the question accordingly. I,

Nicolay Gerold: Yeah, and how does the typical profile look like? I think this is mostly relevant in like a B2B setting. What are like the different informations you try to capture about a person and how do you actually feed that into the model?

Nirant: I was just describing [00:39:00] this, but yeah, role designation is quite helpful. The access privileges is even more helpful. So we will sometimes just give the description of their access privileges and LLMs get very clever because of that. So if you just describe, Oh, this person has access to all code bases, the LLM then.

In the answer was sometimes say, Hey, this code actually exists in this place. And you can go and take a look, right? Because now it knows that it's part of your access privileges. You have access to GitLab or GitHub. Sometimes it will show that, Oh, you that for a marketing person, they might say, you might want to speak to a developer on so and so team.

For this change, if you want it it gets very clever. If you describe the access privileges the other types other metadata, It's a little controversial to use is age and geography or market and it's not also very helpful. Like geography and age is sometimes sorry, geography or market is sometimes helpful, especially outside of US when English is a second or third language.

But age is never, almost never useful, but it's [00:40:00] entertaining. It feels more personalized because sometimes. The LLM will try to use different grammar for somebody who is 20 versus 30 versus 50. It's just stylistic differences, but if you see them side by side, it will be very obvious to you, but if you only look at that, it will not be obvious to you.

So it's very entertaining. And it also makes it easier for the person to trust the system, so to say.

Nicolay Gerold: yeah. And how do you use like the information the user is giving you like in a chat system? Or if he's just querying, how do you use like the information he is providing to enrich the persona over time? Do you have an extraction running on top of that as well?

Nirant: So isn't one on persona level, but for longer conversations, we have one which is running basically on a conversation or session level, which does some sort of, Named entity extraction is how I will call it or topic extraction if you're feeling [00:41:00] fancy But in practice, it's just named entity extraction and sometimes we will use domain specific entities That's about it and a summary of the conversation so far.

The reason we do that is just allows you to Because the LLMs have a input context window beyond which they start hallucinating and the cost and the LLM gets slower as the conversation gets larger. So it allows us to keep reducing that. So we will just add, let's say a five line summary at the top saying we have discussed so and so far, and then the rest of the conversation continues.

So let's say the most recent 3000 tokens.

Nicolay Gerold: Yeah, and how do you use the entities of the session?

Nirant: We just mentioned it as keywords, colon, and all the entities, no, nothing fancy. And I would love to throw DSPY it helps, but I've just not had an opportunity to do

Nicolay Gerold: yeah, I would, I was also always thinking about basically using a tokenizer, like more of an old school one, and basically using all the [00:42:00] unknown. Terms as entities as well. This was always something that was really of interest to me because often especially in academic literature, you have like new terms being invented all the time and it tends to be, this is the relevant word you have to use in search as well.

Nirant: Yeah. That, that, that would be clever. I have not thought about it. Yes. That could work.

Nicolay Gerold: And what would you say is missing from the entire RAG space? So if you could wave like a magic wand and some solution would pop up suddenly, what would you wish for?

Nirant: Fun question. Magic wand question. The, I, magic wand question is that if I were to just dump a system dump my information into some system and it auto configures based on feedback from the user, it just should happen without a human intervention. I should be able [00:43:00] to set a budget in terms of let's say compute or LLM cost or whatever that is.

And then latency and throughput and say, okay, these are my constraints. Okay. Thanks. Now, please self optimize because at the end of the day, it is a constrained optimization problem and I should be able to run grid search, param search across all of these components and just have it run, right?

So all the tricks which we just discussed should have standard default implementations and we'll just get thrown at this problem by default and this, they should just work with human feedback. Like the more feedback you have, it self optimizes over time. Thank you.

Nicolay Gerold: And if people want to stay up to date with you or hire you for a project where can they get in touch with you?

Nirant: I am on niranthk. com slash consult if you want to speak to me. It also has my email, phone number, everything. And if you want to read what I'm reading, I post that quite often on Twitter and LinkedIn. So I'm niranthk on Twitter and niranthkasliwal on [00:44:00] LinkedIn.

So what's the takeaway, especially on the applied side, I think. Before we get into bed. I love neurons. I think he's a great guy. Also, if you haven't seen it yet, he has a corset it's coming out on rack, starting with the rack and also improving it and scaling it up. And we will put the link in the show notes and I think it's really worth checking out. He has also some really great guest speakers coming on. Among others. Tufin from Rutgers. And a bunch of other really interesting people. And. Fraud the takeaways.

So first of all, the. That embedding similarity isn't enough. I think like most people who are in the field and really building stuff already realize that. But the. Stuffy introduced this audit, especially at scale, what you can do, basically classify the different [00:45:00] query types, like what the user looking for, but also. What to set filters, like for example, the example he gave on the colored shirts. There should its color.

It has patterns and stuff like that. You can extract it. But also the Curry routing. I think it's very interesting that if the system actually realizes, Hey, this is a very urgent or fast query. That it actually doesn't route through the classification, but directly does the search and focuses more in returning results in a fast way, rather than. Doing some additional steps. Which potentially could lead to better results. In, in LLM stuff, we got some basic, like use few shirts. But also try to distill it more onto a smaller model.

I think in the beginning, especially he mentioned like one, two, 3 billion assist speed spot. And seven to 10 is more off to that song. I'm not entirely sure about that. To be honest, I, I, haven't done so many [00:46:00] different comparisons. I was rather more on. The side where I had compute requirements, which had to be fulfilled.

So I had to go with a really small model. I have really benchmark the 1,000,000,003 billion versus 7 billion could be interesting to check out, but his arguments were there on the side that it's easier to spot the errors in the smaller models. And easier to make more iterations, which I would agree with because the training is. Faster.

And also you can run it on a smaller GPU, so you can run high average sizes. It often can be important as well.

The. User mechanisms are interesting as well, but what was the most interesting is late in scope, the tool he mentioned for query clustering in analysis. Which basically you look at the different types of queries or the different curries that come in, you classify them and you could also add stuff like cheap. [00:47:00] Our fee, like where are the users coming from?

And you basically. Try to find. Parents and you can also, another is like, you're basically grouping the curries. You're embedding them. You're throwing a Yuma up on them. So you have like a two dimensional space. So you can investigate it better. And then you basically go to. Add certain coloring schemes on top of it.

So for example, you could use different credit types. Like this is a hat, a Toro sorta take very if you don't know what that is, I have a YouTube video on my channel on the three different query types. And you basically classify for each of those. And then you can basically see, Hey, which credits they performed well and which didn't based for example, on the implicit feedback you got. Whether the customer approaches when you're on an e-commerce site, stuff like that. Also for evaluations.

I think like the Flesch-Kincaid is always a good idea, especially if you are [00:48:00] running a Iraq. Sorry, generation part as well. Because of, in, in most cases you want to. Be at like, I think it's the eighth to 10th grade level. The last one is. The error handling. Like a lot of checks on. It's the nations one is for example. NLI model you can Ram. But also for if you use it for filtering or metadata extraction. That you cross validated with other mechanisms.

So he mentioned the Stanford date, user library. And value basically extract from text of the time, summer spell, and when it basically diverges from each other heavily But you basically set a threshold. You assume the extraction is wrong and you don't really use it.

Yes. I think that's very interesting. Kali Polly. It was at that stage. So I recorded the episode, right? When the paper came [00:49:00] all, then I tried it for the first time. So I was a little bit doubtful, I think. The, what of, you have seen already with like a few new releases in the last few weeks, like visual language models are really picking it up.

So back then, I still I tried visual language models, but they weren't good enough. Now I would say quality poly is a really interesting. Interesting model. Especially in combination with English and language most because you don't have to run any extraction which makes it way more powerful.

So this is something I made wrong in this episode. So basically I'm redacting here. And I'm really excited to try out more stuff with Colpali so we are at the moment looking at no. New companies where we can implement some stuff with ColPali. Probably in the healthcare medical domain or in finance. Yes, I will keep you posted on that.

I will probably also do an episode once we have a use case detailing how it works. And how to implement it in somewhat detail and quality Polly. So then

Stay tuned. Let me know in the, in [00:50:00] the comments whether you liked it and also by the, the apparently topics that you would like to hear more on. And yes, have a good one. Talk to you soon.