How AI Is Built

Today, we're talking to Aamir Shakir, the founder and baker at mixedbread.ai, where he's building some of the best embedding and re-ranking models out there. We go into the world of rerankers, looking at how they can classify, deduplicate documents, prioritize LLM outputs, and delve into models like ColBERT.
We discuss:
  • The role of rerankers in retrieval pipelines
  • Advantages of late interaction models like ColBERT for interpretability
  • Training rerankers vs. embedding models and their impact on performance
  • Incorporating metadata and context into rerankers for enhanced relevance
  • Creative applications of rerankers beyond traditional search
  • Challenges and future directions in the retrieval space
Still not sure whether to listen? Here are some teasers:
  • Rerankers can significantly boost your retrieval system's performance without overhauling your existing setup.
  • Late interaction models like ColBERT offer greater explainability by allowing token-level comparisons between queries and documents.
  • Training a reranker often yields a higher impact on retrieval performance than training an embedding model.
  • Incorporating metadata directly into rerankers enables nuanced search results based on factors like recency and pricing.
  • Rerankers aren't just for search—they can be used for zero-shot classification, deduplication, and prioritizing outputs from large language models.
  • The future of retrieval may involve compound models capable of handling multiple modalities, offering a more unified approach to search.
Aamir Shakir:
Nicolay Gerold:
00:00 Introduction and Overview 00:25 Understanding Rerankers 01:46 Maxsim and Token-Level Embeddings 02:40 Setting Thresholds and Similarity 03:19 Guest Introduction: Aamir Shakir 03:50 Training and Using Rerankers (Episode Start) 04:50 Challenges and Solutions in Reranking 08:03 Future of Retrieval and Recommendation 26:05 Multimodal Retrieval and Reranking 38:04 Conclusion and Takeaways

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

===

[00:00:00]

Introduction and Overview
---

Hey, everyone. Welcome back to How AI Is Built. So today we're going to hit rerankers. I did not want to go into the basics behind rankers in the upcoming episode. So I want to give you some background right now. If you don't need it, or if you just want to go into the meat directly, you can jump ahead roughly two minutes.

I willput timestamp it in the show notes. So you can jump directly into the episode.

Understanding Rerankers
---

So re rankers, what do they allow you to do? Read rancorous in the end, allow you to compare the query. And a bunch of documents. And traditionally this meant basically the document is already in text form and is embedded. Now you can also do the same with images or potentially. Other types of data. Or modalities. And.

You typically see it. At the end of your retrieval pipelines. The retrieval is usually query understanding -> retrieval -> reranking and you embed each [00:01:00] document you have in the database. And on retrieval, we get the embeddings run the reranker on the query and the set of documents we re retrieved. And all the record tells us is if the query and the document match up. And again, the document could be an image. Could be a text, could be whatever. And we are basically comparing the embeddings. And. As opposed to regular retrieval, we are not just using an average embedding.

So if you embedded document, you embed every token and typically you take the average. In a re ranker, you have token level embeddings for the query and for the document. And you calculate something.

Maxsim and Token-Level Embeddings
---

If that's called a Maxsim and the Maxsim basically takes each query token.

All for a specific query, calculates the similarity of this token with each token in the document, [00:02:00] and then takes the maximum. And it repeats that for each query token in a specific query. And then you sum that up, like all of the max similarities for each query token, which gives you the store score for the document overall. And. When you repeat that for each document, you basically have a set of scores, one for each document. And to actually filter down the retrieved documents.

You just set us. A threshold. And documents. With scores above. This threshold. Are considered relevant. Document with scores below it are considered irrelevant.

Setting Thresholds and Similarity
---

And I think like the immediate question would be like how to set the threshold. And that's a little bit more art than science, but you will hear more about it in the episode, but in the end it will come down to trial and error. And. Have you based concept, you have to understand a similarity. And. You can encode [00:03:00] anything as similar. By creating a data set of input and output pairs, which are considered a similar in your specific metric. And the ranker and then tries to learn the association. And now. Cool.

We got the basics down. So let's get into the episode.

Guest Introduction: Aamir Shakir
---

And today we are continuing our series on search. Aamir Shakir is the founder and baker at mixedbread.ai. Where he's building some of the best embedding and re-ranking models out there. And we will talk about re rankers how you can use them to classify , deduplicate documents, prioritize LLM outputs, and also how the crossencoders like Colbert work and how you can use them. Let's do it.

Training and Using Rerankers
---

Aamir Shakir: you want to train on a BERT scale or something, just get eight A100s and run them for a couple of days until you can train it. But you're right. So I think it just makes sense. Use [00:04:00] BERT, start training it. Even don't use BERT.

For example, if you're using embedding models, use an E5 unsupervised or so, which is trained on a lot of data and just fine tune for yourself. It'd be really stupid to start from scratch. So don't do it.

Nicolay Gerold: I'm, really curious. Have you how much have you played around with ColPali already?

Aamir Shakir: A lot actually. The thing is, we as a team believed a lot in late interaction because it makes so much sense. It's interpretable. You know what's happening. You see the failure cases and you understand why the failure case is happening. A sense of you can look at the tokens and how the similarity is happening, right?

So you see, okay, hey, the model is confused here and there. So late interaction makes a lot of sense. But. So we knew that from a long time, but it has a lot of drawbacks, especially if you start scaling it up, right?

Challenges and Solutions in Reranking
---

Aamir Shakir: Because the token consumption is just extreme. So for people who don't know what late interaction is, late interaction you do a forward path and in models you [00:05:00] tokenize every input basically.

And normal or embedding models just get one dense vector as the output where you take normally for every token embedding is a mean or just take a classification token. While for Colbert, You store the embedding for every token and later when you want to compare them you compare basically all tokens against each other And then you can see hey, how's the token match to each other and how the similarity is and see Where the failure cases are happening?

So that's why it's more interpretable And the issue is since you're storing all token embeddings, right? It's sometimes three or four hundred times more storage. You need to store them You and also scaling it up makes it from not only storage but also computerized super expensive but there are ways to do it better now.

Nicolay Gerold: Yeah, and I think like when you can bet on one thing like storage and stuff like that it will become cheaper and also become more efficient at storing the [00:06:00] stuff. And especially like Jo Bergum is already doing like a bunch of stuff in Vespa with I think like hamming distances

Aamir Shakir: Yeah, binary

Nicolay Gerold: to make it more efficient.

Aamir Shakir: Yeah, exactly. So the idea there is what we see also we published a blog post together with a Huggingface on that and also did some follow up experiments that you don't need all information stored in those vectors. Meaning, first of all, you can get rid of dimensionality by doing something like representation learning.

So you have less dimensions and also you can get rid of precision, meaning instead of storing full float 32, you can just store it in int8 or even just one bit per dimension, which allows you then do stuff like hamming distance, which is super fast on modern CPUs because it's just two instruction, right?

But what we found is that as a performance loss is really hard. We prefer to do more stuff like int8. This works pretty well. And yeah.

Nicolay Gerold: I'm, what I'm really looking forward [00:07:00] is like the more explainability bug as well, when you to put like an auto encoder on top of that and actually try to decode, like what the different features are and whether they will actually find okay, tables or different elements in PDF documents are its own features.

And basically we can reconstruct it from there.

Aamir Shakir: Yeah, this is going to be super, super interesting, but what you can do already right now just producing a heat map, seeing where the model has the highest similarity, so later, which is super crucial for a lot of domains to understand what's happening. Because embedding models with single vector representation are black box, basically.

We don't know what's happening. There are some pushes, for example, there was a paper from Google earlier this year, Where they trained a decoder based on the output of an embedding model to reconstruct the output and see, hey, what's the model representing there. And there are a lot of fun experiments, what happens if you start changing the embeddings.

But it's not really you don't know what's happening, but with Colbert, you really get this interpretability part, [00:08:00] which is really nice.

Future of Retrieval and Recommendation
---

Nicolay Gerold: What's an exercise I really often like to play through when I'm building stuff is like, how will it look like in, in three to five years? Because then you know, like how to position yourself in like the best way when you're building something in tech or a tech startup. How do you imagine retrieval and recommendation pipelines to look in three to five years?

Yeah.

Aamir Shakir: I think it's an interesting question because dense vector representation has been around for a while. If you think about Annoy, the library for vector search from Spotify, I think, was created in 2013 or 2014. So the stuff has been around for a while, and it took so long that the mainstream tech started picking this up, right?

I think five to six years. But the future of retrieval, I wish that we have at one point a compound system, which is really good at query understanding. And also at multifaceted search, so which really understands what the user wants [00:09:00] and tries to find the right context and also ranks us in the right way.

Because we have so much unstructured data, which is laying around and it's really hard to make sense a lot of out of this by providing a lot of hacks. And I wish that we have maybe a unified or compound system. We just can feed it in and have nice queries. I think with LLMs, if they continue developing like this, we could have a nice natural language interface to data.

And let's see how it will play out and how the architecture of the future will look like.

Nicolay Gerold: Yeah, and I think if you think about the standard retrieval architecture, you have like query understanding retrieval, re-ranking, and the re-ranking part often is because, like in the actual retrieval, when we are doing, especially like a semantic search, we know that it, the performance isn't so good.

So we have the, to actually give us. More relevant results based on the user query as well. Do you actually interpret it like that as well?

Aamir Shakir: So I think my understanding of [00:10:00] this is more, it's a refinement step, right? Because it's more semantic comparison between query and the documents, right? I don't think that's a AI model right now understanding what is a query. Because what they do either in the re ranking step is, depending how they're trained, is finding out for every query and document how relevant they are to each other, or how relevant the document is to the query.

And relevance can mean a lot of things. For example, in retrieval stuff, like how well it's ranked, or if you're finding, try to find duplicates, how well how similar are they, and so on and so forth. It's really depending how it's trained. But for me, it's more like a super, super smart. comparison mechanism.

Nicolay Gerold: Yeah, and if we look at the re ranker maybe let's go more into the architecture side. What are the different components of the re ranker that actually enable it to basically say like the document and the query are actually relevant to each other?

Aamir Shakir: so let's take maybe one [00:11:00] step back and look how a retrieval pipeline looks like to understand what is the difference of re ranking model and, for example, an embeddings model, right? Let's say we do in the first step semantic search. So how semantic search works is you put in data and then like text or image, whatever, and you get an embedding out, which represents this information.

And later when you do a search, you encode your query. And get another embedding out, and then you do some distance metric, for example, cosine similarity. And this is a bi encoder, right? Because you have an encoder for the document, and you have an encoder for the query. And at inference time, the model doesn't look at the query, or doesn't look at the document.

So it just tries to guess a point in the room where the information lays, and then you hope that they lay next to each other. While a reranker is more cross encoder, normally you can also use GPT for that or LLMs, but let's just stick with the cross encoder, which is more the classical approach. And what you do there is you put in a [00:12:00] query and the document and then the model looks at both things, the query and document, and tries to understand how similar or relevance there to each other and provides the output score.

relevance there. For example, the document given the query. And this is obviously computation more heavy because imagine you have 100 documents, right? Then you have to do 100 inferences for query and document. And this computation is way more heavy than, for example, with the bi encoder where you would just encode the 100 document once and the query once and then do super cheap distance metrics.

Nicolay Gerold: Yeah, and have you actually played around with query rewriting for the re ranker as well?

Aamir Shakir: Yeah, we have actually. So it really depends how your model was trained. So for example, if your model was more trained on question answer pairs. And then you have keyword queries, it will start sucking hard. You should really bring the query in the format how the model was trained or trained in a way it will be used later.

And this was a big learning when we published our first [00:13:00] model from user feedback because we mostly trained it on question answer pairs because this was mostly our setting with customers and so on and so forth. But people were not using it, for example, also for simple keywords and so on and so forth.

And then they were surprised, for example, you had a like a query where the word, I don't know, king is coming in and to document, whereas king, or it's just like king, hundred times, but it was not number one, even though it's occurring exactly in there, because the model was not trained for those tasks.

And yeah, we and for version two, we are looking into incorporating this data.

Nicolay Gerold: Yeah, and I think to really nail the point home, your data the main issue in all of retrieval is basically the query, or, and the way the query is written, and document, mostly live in different spaces, and you basically, your work is actually in mapping. Those two spaces together or creating a projection from the one [00:14:00] into the other.

Aamir Shakir: Yeah, exactly. Yeah, you're totally right there. It's super interesting to see because our data is normally super well written, right? You have perfectly written queries, like without spelling mistakes, without like proper grammar, proper writing. But if you look at real user data, or if you think about how you use Google, right?

It's like just something super quickly written with a lot of spelling mistakes, no proper grammar, not proper casing. And so there you have to really be careful. It's, as you're saying bridge the gap between how your training data looks like and how the user queries look like. The best thing is having user queries to train your model.

Nicolay Gerold: Yeah. Yeah. And I think for me, training and embedding model has probably the highest payoff. Especially when I compare it to training your own LLM or something like that, do you think training a re ranker has like a similar large impact or is like a re ranker In the end like a component that has a little bit more give that it tends to already perform well out of the [00:15:00] box

Aamir Shakir: I would say it's even better to train your re ranking model than your embedding model because the issue with, when you train your embedding model and new knowledge comes up. For example, I was trained in, let's say it was a political embedding model, because in politics a lot of things change, or law changes really quickly and new laws comes out and you need then to retrain your model so it understands, okay, what did change, and then you need to re embed your whole corpus, which is super painful and expensive.

What we prefer is to fine tune the re ranking model because what the re ranking model does right, one, it gets the candidates and the queries. It just provides scores and you're not storing them anywhere or normally not storing them anywhere. So this means you can continuously fine tuning them, fine tune the re ranking model also based on click rates and so on and so forth.

So I think for us in our experiments with customers, training the re ranking model had the most impact compared to the embedding model.

Nicolay Gerold: and I had the episode with neos [00:16:00] and one of the big issues in embedding models is like long context And he mentioned like he hasn't seen an embedding model which can handle that. How do Rerankers compare at long context tasks

Aamir Shakir: So I think they're pretty good at it. So given also our benchmarks because rerankers doesn't have this limitation to represent information in one single embedding, right? So what they can either do is actually look at the whole context and your query and say, Hey, How relevant is this context to your query?

Even though you have, for example, The first passage is talking about ahoy browser, I don't know The second passage is talking about politics And the third passage is the third part is talking about technology For example the new iPhone, right? And you ask a question about iPhone So model still will understand, hey, the third part is about iPhone So there's this and that This context is still relevant basically to this query.

In our [00:17:00] experiments, actually re ranking works really well with long context as well.

Nicolay Gerold: Yeah, nice. And did you, do you apply to the re ranker any pre processing steps to the documents as well, like for example stopword removal or something similar?

Aamir Shakir: No, we don't do anything like that. It works pretty well without doing this. Especially also with stop word removal, that's the tricky thing about here new neural search and all this modern stack, right? Because the stop word still means something for the context, right? To contextualize it properly.

The stop words have a meaning. And it's pretty relevant for the model to understand this as well.

Nicolay Gerold: Yeah, I think stop words have become one of my biggest pet peeves because I started to actually look into the stop word lists when I'm doing NLP pipelines and you can imagine like what kind of biases are like In the stop words list and most of the time when you use them, you should create a custom one.

Aamir Shakir: Yeah, but I don't think I don't know if you have [00:18:00] seen it, but I don't think it's worth the time investment because the models are throwing on such a variety of data. So they, let's say understand what's relevant and what's not relevant. I would just feed it in.

Nicolay Gerold: Yeah, likely you're rather causing a performance downgrade because your documents suddenly are not as similar to the training data. That's so interesting. And In a re ranker especially when we're in a corporation, I think e commerce is the prime example that you often like to add additional factors to the re ranking model.

What would be your decision process to decide, for example, I have the timeliness of the item, like it's like a new one and I want to rank that higher. How would you actually add in like the decision process to decide whether I try to encode it in the re ranker? I actually just rank in my training data set, I prioritize items that are newer at the time, or so it actually learns like some kind of semantic [00:19:00] similarity, or I actually add that as an additional factor in the end.

Which can be added just basically as a score for a prioritization.

Aamir Shakir: Yeah. So unfortunately the answer here is always, it depends right on your data and your use case. But what we found out is that. rerankers are really good in understanding also structured information within for example, your input. Imagine you have a JSON field, for example, where you have a bunch of metadata as well, and you train the model to understand this metadata.

This really works really well. You could put it in there, but again, you have to specifically train your model for this. But if you want to use just an out of the shelf model, then it's smarter to do it in the end and just do waiting. But For example, this metadata stuff helps also, for example, do recency stuff, right?

So because you can't really put recency or like the date into embeddings, you could we tried it. There's a hack. So you take the dimension, the vector, and then you encode the [00:20:00] date as two dimensions in the end. Then you just do dot product in the end. So you add it also to the query because this will boost it up, right?

The later it is. So that's what's done in recommender systems, but. We found that adding it to structured data in the re ranker works really well, but you have to try it and if you can, or if you have a post processing pipeline, just use this.

Nicolay Gerold: Yeah. And how do you actually represent it? For example, if you have a date you likely have to put it in the query and in the document, like what is the current date? Did you represent it as like an actual timestamp

Aamir Shakir: Oh, we just put it in time, a day. We just we have really a JSON we put in JSON structured information and just put date, double point, for example, and it works pretty well.

Nicolay Gerold: That's really interesting. What other experiments have you run with like different representations or additional information to add

Aamir Shakir: For example, pricing, for example, hey, I want an iPhone below 500 or a phone below 500. Works really [00:21:00] nice. What we also try to push is that when you, for example, search for stuff like emails, you normally want to have the most current ones ranked up. And we just wanted to build a compound system instead of having filters and doing Actually, what I really don't like about current systems is, That you have so much hex around it, right?

So you need to do metadata extraction. And this sucks. It doesn't work well. It's painful. And what we try to really incorporate into the models so that they try to learn or understand that and then just put out the right output. The

Nicolay Gerold: yeah. So it's getting easier in my opinion with LLMs to do all the like metadata or field extraction to do faceted search and stuff like that. But it's like you have to be way more explicit and you're less flexible in the end.

Aamir Shakir: thing there is also you have to sit down and then really nail it down for all different [00:22:00] edge cases. That's what we find painful, especially if you have a lot of different data sources and your data sources are not unified. Think of a retailer who has maybe a hundred different ontologies from different providers, then good luck, have fun really unifying everything and extracting it in a nice way.

And we found like having a compound model who can do that is a really compelling solution.

Nicolay Gerold: What are like some of the more creative ways you actually have seen the output of a re ranker used?

Aamir Shakir: Yeah, you can use it for, you can use it pretty well for classification, for example, for zero shot classification. Works surprisingly well for deduplication. We see a lot of people using that for LLM output scoring. So this works also well, for example, for routers to figure out which model to use. Works really well.

Yeah. Actually, as I said, for me, it's cross encoders, for example, are a really smart way of comparing things. And there you can, you have millions of use [00:23:00] cases, how you can use it.

Nicolay Gerold: How did you use it for classification? So what are the inputs or the output?

Aamir Shakir: Yeah, so the input could be, for example like your document. And of course, you have to tune it for this a bit, but your other output could be the label. And then you just see, for example, from all your labels, what's the highest score, basically. Okay. And classify it to that.

Nicolay Gerold: Yeah, it's so interesting. Have you ever played like it doing, keeping the same query? So for example, always having the five different classes, you actually have a new classifier, putting them together in the query and looking at it, whether they are like different, like a single token in your model and actually using the late interaction scores.

The text has with each of the singular tokens to actually use as a classifier and taking the token, which has the highest score.

Aamir Shakir: No, I haven't tried it yet, but this sounds like a fun experiment to do. Yeah,

Nicolay Gerold: you only have to find like different tokens, which are unique, and then basically [00:24:00] map each class you have to one token. And

Aamir Shakir: BERT has some, a lot of unused tokens for this, actually, exactly this idea, that you can do experiments like this.

Nicolay Gerold: If you want to do that, I would be up for it. I think that sounds like a really fun experiment

Aamir Shakir: I think maybe one of the listeners could do that, it sounds actually also like a cool project for an internship or so

Nicolay Gerold: Yeah master thesis

Aamir Shakir: Yeah, let's go.

Nicolay Gerold: Nice, and on the other side like at the moment what i'm using often in production is using nli models To actually determine the factuality Is it? Have you worked more around that aspect as well? And how or how would you actually implement it to use a re ranker? To check whether the relevant information is in the source data for a given statement.

Aamir Shakir: I think cross encoder is the way to go here, there was also a paper last year, I think, where they did exactly this, use a cross encoder to align scores [00:25:00] between the documents and the output of the LLM and the factual documents. and to basically add citations. So they iterated basically over every sentence, I think, and then provided the score.

And if it was above a specific threshold, they say, okay, hey, that's the citational, if it was the highest score and above a specific threshold. And you can perfectly use it for that.

Nicolay Gerold: Yeah. How do you actually think about tuning a threshold?

Aamir Shakir: Yeah. It depends always on your use case, right? I find it always fuzzy. I'm not a big fan of that. Because it feels some, somehow the model should be able to reject something, right? Or there should be some mechanism where we know if it makes sense or doesn't make sense. Because we found sometimes, hey, even if it's below a specific threshold, it still makes sense, the output, or it's still relevant.

Then maybe your threshold is wrong, and threshold tuning is super hard, but I don't have a perfect solution for this yet, but [00:26:00] maybe in the future we'll figure out how to get rid of this hack. Yeah,

Nicolay Gerold: Yeah.

Multimodal Retrieval and Reranking
---

Nicolay Gerold: And the part where we're moving more multi modal. I think we already touched on ColPali do you imagine we will be adding more and more Modalities to the re ranking part as well Because I think that's where i'm very torn in audio because in audio you could do it But it's just way more convenient to actually just translate into text because It has a one to one mapping.

Aamir Shakir: for audio, I think it makes sense to map it, but what about geospatial data, for example, like satellite imagery and so on, or like data in biomedicine where you have a lot of different formulas and so on, where it would make sense where you can't re translate it, but you still need to retrieve it or rank it or whatever for your drug discovery use case and so on.

I think the future will be multi modal we, I hope we will have a [00:27:00] compound model, which can do all modalities, basically. So one supermodel, I think Colpali not the model itself, but the idea and the paradigm, is I think the way to go. And will be really interesting how we can add new data in the future to that.

And also multi model re ranking is going to be super relevant because you basically can reuse the adapters or exactly the same idea as from Colpali where you have. stream different encoders for different modalities. And then one world model, let's say, which in this case, LLM, which does the real computation.

And then you can use for re ranking, for example.

Nicolay Gerold: Yep. And if you would take like one use case, for example, the, like the bio in biomedicine, like you pick a new domain, how would you actually go about creating a data set that you can actually add the modality?

Aamir Shakir: Yeah. I don't know how it specifically looks like for biodata. Then let's maybe do geospatial because I have some more experience there. Yeah, for [00:28:00] example, you want to do poverty prediction, right? Let's say poverty prediction. And you have a map or a data set where you have, for example how poor the country is and the region, right?

The region and their total consumption per year. You could then construct a data set where you can ask, Hey, what are countries or what are areas? So bring it back into the QA data set, right? And where It's below a special or a specific threshold or what are poor countries and then you can use the rank and basically use them as an answer and then use the technologies we already know like constructive learning or rank, ranking loss, et cetera, to train this model.

And what you need to do is basically you need a mostly satellite imagery is hyperspectral, meaning it doesn't just have the three channels, has mostly seven or 14 channels or something. So you need an encoder, which understand that. So maybe you can here, modify your image encoder. You train it so it understands pretty well.

This you freeze and then you put it through your [00:29:00] vision encoder hyperspectral vision encoder get Embeddings as output and then you can feed it into your llm or whatever encoder model you have or decoder model you have To then generate embeddings or whatever you need

Nicolay Gerold: what do you think is like the trade off? between different embedding spaces and a joint embedding space

Aamir Shakir: I think it's aligning also the output of the distances. For example, if you try in Colpali to do image to image search, the scores are way higher than applying text to image search. So basically, it's really not trivial to align those vector spaces, but there is work about this, how to do it properly.

And so it's doable, but I think it's way better than having five different indexes for caption, description like above the image, then the image itself, and then some other random stuff, and then try out some wild weighting schema or ranking schema to merge them somehow together. So I think having something [00:30:00] like colpali or like a joint embedding space is a way more compelling solution.

Nicolay Gerold: If we were to add on the thread we were on before, if we were to add an additional modality, is it more prudent to actually train a model from scratch, or can an additional modality be added to the embedding space through just fine tuning?

Aamir Shakir: I think modality can be added pretty nicely. So I think what needs to be done is a bit pre training again on the model layer. But as we know, for example, with all those audio models popping up, it's not so hard to do. So you can do it with not so much data. And then you basically just need to fine tune it for your retrieval use case.

Nicolay Gerold: Yeah, the, do you think, like a new modality, what new signals does it actually bring? If I already have, like I have audio data, I have image data and I have text data. And now I add, [00:31:00] instead of using the text representation of audio, I'm using the native representation of audio and bringing it into an embedding space.

What is actually my benefit of doing that?

Aamir Shakir: I think it's the diversity, right? If you give the model a better understanding of the world around, or for me, LLMs are of course like language models, but also small world models, which tries to modulate the world as good as they can. As we can see also, if you just add image or video, it works, right?

It works really well. They also, if you add new stuff, you just make the model better or the understanding of the model better. And also what we figured out is that the modalities support each other. For example, if you train a model just with images. It's also pretty good at text retrieval. And I think what would happen is that the more stuff you add, the diverser the representation gets or the better the understanding of the model around the world gets.

And this will boost basically the performance in general of everything.[00:32:00]

Nicolay Gerold: Yeah. Do you think we actually make it also harder to like the AI person who takes the model and wants to find unit, make it harder to adapt it to the specific domain?

Aamir Shakir: In a perfect world we don't need to adapt it, right? But we're not there yet. But I think you would make it harder. But I think it would be smarter for the AI person just to pick out the two modalities or three needs. You rarely need all modalities, right? You can just basically ignore them. Because if you think about it, you never tune the encoder part, right?

You always tune the LLM or like the representation model part. So and then you do lower planning and it doesn't matter what you input. So the training stays more or less the same. So if you just have text data. What we, as I mentioned, what we figured is that also the audio performance will increase, for example. So the different modalities will also improve by just having training it on one modality. That's what our [00:33:00] first experiment showed.

But yeah, it's a bit more flunky to work with them, so I think this is a bit more painful.

Nicolay Gerold: Yeah I'm at the moment setting up a data set for ColPali for basically doing more table retrieval and then like finding the relevant things and then also training a vision model for doing the table extraction. And it's very interesting, like the amount of stuff you can do, but also the synthetic data generation part is so interesting because you can now, especially with old one, you can easily just fake the data and.

How do you think like synthetic data generation, how will it impact like also like your work and are you already experimenting with stuff around that?

Aamir Shakir: Yeah, so I think with synthetic data you have to be always careful. Either you generate your training data synthetically or your eval data, but not both. That's a bit of all learning. Because you then just sample from the same distribution, basically, [00:34:00] which is not your real world, real user distribution. So you have to be super careful there.

But of course synthetic data is going to be the way for a lot of niche domains. And also general way to go. What we do is we have more a mixture of raw data, which we get through open available data, web data, and so on and so forth. And also synthetic data, obviously. But we try to have a nice balance because it's just more diverse.

And we hope that the model generalizes better.

Nicolay Gerold: Yeah, and do you have something you're shooting for? I talked to Nirant for example and he mentioned like he wants to have like at least 50 percent of real data to this.

Aamir Shakir: No, it really depends on the use cases, so we just see what works. I think having hard numbers there is not a good idea. You should always Look at the use case. For example, if you're in a super niche domain, so you may be happy that you can get 100 samples or so, or 100 documents, and the rest you have to send generate statically, for example, through few shot prompting and so on and so forth, and then saying, [00:35:00] hey, oh, I want at least 50 percent of raw data it's hard.

Nicolay Gerold: Yeah, how have you found any approach to actually Check whether the document is still in distribution of the domain you're sampling from.

Aamir Shakir: Yeah manual checking, right? So reading through it. So that's, I think, the one. That's what's missing right now is proper tooling and proper just tools to check those basic things, actually, right? Because this decides if the model is going to work or not, or if your work is paying off or not.

And it's somehow shocking, but it's also hard to create those tools. And I think, in general, we need better tooling for those things.

Nicolay Gerold: Yeah. I think if in any domain at the moment or anything of whether that's embedding models, re rankers, LLMs, like the way to go is look at your data. And the whether you have like synthetically generated data or new data, What's the input? What's the output? Read through it and create a data [00:36:00] set that you can use for evaluation and also for training.

Aamir Shakir: Yeah, exactly, that's the way to go. Unfortunately, it's pretty boring compared to, or I'm keeping my GPUs busy. But I think that's the most relevant part and that's the part where we spend the most time in.

Nicolay Gerold: What's missing from the space? What would you actually say more specifically? What's missing in the retrieval space at the moment?

Aamir Shakir: I think a way to generate good evaluation data, because nobody wants to do that, it sucks and that's missing. And also maybe a tool to measure data quality for your train, from your training data and check stuff like, hey, is it in the distribution or not? I know Wamp Labs and so on who are working on stuff like this.

But it's going to be really exciting if we can solve those issues,

Nicolay Gerold: Yeah. And what is actually like something that you would love Yeah. To see build [00:37:00] specifically that would make your work 10 times easier

Aamir Shakir: huh? Maybe a data miner For everything we need for every domain so you can just write in hey give me data For domain x y that it should be multi modal. It should be this format and I just get my 10 million Rows, this would be pretty nice

Nicolay Gerold: yeah nice and if people want to start building the stuff we talked about like where would you point them?

Aamir Shakir: Obviously this podcast, right? And I think the Twitter profile of you is a good starting point, read through our blog posts on our website and also from Tom Arsen the posts on Hugging Face and the Transformers Sentence Transformers documentation is a really good starting point.

Nicolay Gerold: Yeah and For you if people want to get in touch with you hire you stay up to date Where can they do that?

Aamir Shakir: It would be obviously Twitter, LinkedIn, and email.

Nicolay Gerold: Nice and I will put all [00:38:00] of that in the

Aamir Shakir: yeah, this would be great.

Conclusion and Takeaways
---

So what can you take away for our applications? I think re rankers in general, even if you take, for example, coheres, are probably one of the biggest boosts you can add to your retrieval system without changing anything. Fine tuning a model because out of the box, a bunch of them will work well for a broad set of use cases.

So if you don't have them, I would give them a shot and see how they end up. At the same time, I think re rankers push us down a little bit in the same direction as embeddings does. That we don't really think about how can we actually map the query into the same space the document lives in. And which kind of documents are actually relevant for a given query, or [00:39:00] which queries could be used to retrieve a document, which often helps to build like a good retrieval system.

I think it pushes us further down the, down that road because we rely more and more on AI and less on how we engineer the system. And this isn't anything against re rankers. It's just, I think you have to strike a balance that you don't have an over reliance on AI models as they are more of a black box, even though re rankers are more explainable than bi encoders or regular embedding models.

For me, the most important part to take away is that looking at your data again is a massive part of building good models. And especially if you want to integrate a re ranker, look at the training data that was used to train the model. [00:40:00] Or if you use a closed source one, so for example Coheres, look in their docs what use cases they're advertising it for.

Because if your queries look very different from what they're advertising, or also what the model was trained on, it's very unlikely that the re ranker will return it. any good results. And then you should be careful. For tuning the threshold of the re ranker, I think you should train the re ranker, look at how it scores your test set, and try to find adequately well working threshold.

Otherwise, the One more really interesting part is encoding JSON or putting metadata into the document as well. For example, in a news item could be one where you add the timestamps. So the model learns that the [00:41:00] more recent the item is, the more relevant it is. And here, I think that's really interesting because you can play with many different encodings.

So if you have timestamps, you could use the actual timestamp and add the timestamp to the query as well, so it can see like whether it's identical or not. But you could also do something like, for example, you could add like, how long ago, like a delta. Instead of a timestamp and you can think out like many different ways to encode the same piece of information.

I think that's very interesting. And also in stuff like, for example, like an e commerce, if you want to prioritize high ticket items, But only for a certain set of queries, whether it will actually work when you put the prices of the items also in the data as well, in the document. And I think that's really interesting.

Yeah next week we will be diving deeper into CoPali [00:42:00] specifically with Jo Bergum. And I'm really excited for that. If you want to catch that, or stay up to date with any episodes, subscribe. Otherwise if you have any feedback or if you didn't like anything, or if you want to hear about a specific topic or from a specific guest, just let me know in the comments or shoot me a message on LinkedIn and yeah, I will catch you next week.