How AI Is Built

Ever wondered why vector search isn't always the best path for information retrieval?

Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub.

Discover how BM25 transforms search efficiency, even at GitHub's immense scale.

BM25, short for Best Match 25, use term frequency (TF) and inverse document frequency (IDF) to score document-query matches. It addresses limitations in TF-IDF, such as term saturation and document length normalization.

Search Is About User Expectations

Search isn't just about relevance but aligning with what users expect:
- GitHub users, for example, have diverse use cases—finding security vulnerabilities, exploring codebases, or managing repositories. Each requires a different prioritization of fields, boosting strategies, and possibly even distinct search workflows.
Key Insight: Search is deeply contextual and use-case driven. Understanding your users' intent and tailoring search behavior to their expectations matters more than chasing state-of-the-art technology.

The Challenge of Vector Search at Scale

Vector search systems require in-memory storage of vectorized data, making them costly for datasets with billions of documents (e.g., GitHub’s 100 billion documents).
IVF and HNSW offer trade-offs:
- IVF: Reduces memory requirements by bucketing vectors but risks losing relevance due to bucket misclassification.
- HNSW: Offers high relevance but demands high memory, making it impractical for massive datasets.
Architectural Insight: When considering vector search, focus on niche applications or subdomains with manageable dataset sizes or use hybrid approaches combining BM25 with sparse/dense vectors.

Vector Search vs. BM25: A Trade-off of Precision vs. Cost

Vector search is more precise and effective for semantic similarity, but its operational costs and memory requirements make it prohibitive for massive datasets like GitHub’s corpus of over 100 billion documents.
BM25’s scaling challenges (e.g., reliance on disk IOPS) are manageable compared to the memory-bound nature of vector search engines like HNSW and IVF.
Key Insight: BM25’s scalability allows for broader adoption, while vector search is still a niche solution requiring high specialization and infrastructure.

David Tippett:

Nicolay Gerold:

00:00 Introduction to RAG and Vector Search Challenges 00:28 Introducing BM25: The Efficient Search Solution 00:43 Guest Introduction: David Tippett 01:16 Comparing Search Engines: Vespa, Weaviate, and More 07:53 Understanding BM25 and Its Importance 09:10 Deep Dive into BM25 Mechanics 23:46 Field-Based Scoring and BM25F 25:49 Introduction to Zero Shot Retrieval 26:03 Vector Search vs BM25 26:22 Combining Search Techniques 26:56 Favorite BM25 Adaptations 27:38 Postgres Search and Term Proximity 31:49 Challenges in GitHub Search 33:59 BM25 in Large Scale Systems 40:00 Technical Deep Dive into BM25 45:30 Future of Search and Learning to Rank 47:18 Conclusion and Future Plans

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

Nicolay Gerold: People implementing RAG
usually jump straight into vector search,

but vector search has a lot of downsides.

So first of all, it's
not robust out of domain.

Different types of queries
need different embedding models

with different vector indexes.

And some query types
might not work at all.

It's also very costly in
terms of compute and storage.

We have to keep our indexes
in memory so we can achieve a

low enough latency for search.

And what we are talking about
today works for everything.

It works out of domain and is one of the
most efficient types of search out there.

You probably guessed it already,
we are talking about the OG

ranking function of search, BM25.

Today we are continuing our
series on search on how AI

is built with David Tippett.

David Tippett is search engineer
at GitHub after working at

OpenSearch for a long time.

We talk about BM25, How it works,
what makes it great and how you

can tailor it to your use case.

And also how BM 25 is used at GitHubs
for a hundred billion plus documents.

David Tippett: I have not used Vespa,
but I can say from the people that

I know that use Vespa, it's amazing.

It's really good, but at
the same time, it's not.

I would say it's like a top
5 percent search engine.

If you are like in the, top 5 percent
of like search users or in the top

5 percent of what is that called?

Vector search users.

That's where you're really
picking up Vespa and you're,

you know what you're doing.

Not a lot, maybe not a lot of people
know yet or haven't found out yet

that, yeah, Vespa came out of Yahoo.

It was there actually went into
Yahoo first in the like early

2000s and then left Yahoo.

So yeah, they have a lot of
lingering stuff because apparently

Yahoo was a big Java shop.

I don't know.

Nicolay Gerold: It's too much
business is still being done in java.

I don't I learned programming in java.

And I still look back with dread
and hatreds of The courses I took

David Tippett: I'm in the same boat.

I'm using Ruby now and Ruby is like
one of those like mythical languages

because it does so much in the background
for you versus Java where you have

to be, you're declaring everything.

You've got interfaces for everything.

It is so verbose.

It's painful.

Nicolay Gerold: that's maybe an
interesting starting point when you

would have to put search engines into
the same buckets as programming language.

So basically like Python, the really
simple one, like C really down rust,

like the newcomer, the really hyped
one, Go more of the cloud native one.

Where would you place the
different search engines?

David Tippett: Oh, this is perfect.

Okay.

I love this question because
I have really strong opinions.

So Weaviate is where I feel like a
lot of people get started, right?

It's the straightforward,
super documented.

Painfully well documented almost
to the point where, anyone could

get into Weaviate and start
doing something, which is great.

And they're integrated to everything,
Lang chain, all of those things.

Once you get to a certain point, though,
where you're doing some more advanced

relevance features, you're gonna start
feeling the pain there where Weaviate's

just not quite scratching the itch.

And that's where I feel like
most people will move into one

of the REST based search engines.

That's gonna be like, Solr
that's going to be open search.

It's going to be Elasticsearch.

If there was like a spectrum there
and that middle ground of like search

engines, I feel like elastic search
is the easiest to get going with.

Then open search kind of sits in
that middle category where it's

nice cause it's really flexible,
but at the same time, it's also

challenging because the documentation
is still like coming up to speed and.

OpenSearch will let you build
search however you want.

And sometimes that's a good thing
because you can do really cool

things, and sometimes that's a bad
thing because you could accidentally

do something where you're like,
that makes no sense whatsoever.

Solr, I can't really place, I don't have
enough experience with Solr to say like

where in that middle ground it falls.

And then at the very top end,
I feel like you've got Vespa.

And I say Vespa is like the top end
because Vespa is where I feel like PhDs go

to build, massively scalable distributed,
like vector search applications.

Vespa is always on top of
the most recent research.

Which means they have what the
most cutting edge features.

It's really good, but then operationally
it's challenging because it was

built to be run inside of Yahoo and
their, infrastructure frameworks and

paradigms, so running Vespa itself.

It's not as straightforward.

Nicolay Gerold: Would you actually
say it's like partly where the

different search engines came from?

So from which companies?

Because Yahoo, likely they have a large
infrastructure team, so they don't

really care about like how complex it is.

What are like the different
backgrounds of the different

search engines you work with?

David Tippett: That's so funny.

I'm writing an article right now,
and I've said that for the last three

podcast episodes I've recorded for
myself and here, but I'm writing this

really nice detailed blog post about
how these engines are being built.

So like one of the really interesting
things that not a lot of people think

about is how companies make money
affects how they build their business.

So for a great example is Elasticsearch.

Elasticsearch, their biggest
thing is their licensing.

So they make it extremely
easy to onboard, right?

Lots of really great documentation.

Lots of integrations with libraries.

Kibana has a lot of workflows to walk you
through and that's to get you onto the

platform and then, it's lacking some of
the feature richness, but once you're on

the platform, the idea is you're there to
stay versus OpenSearch, which OpenSearch

is, I'm going to say

being built heavily by AWS.

We'll say they're one of the largest
contributors to OpenSearch and.

That affects how OpenSearch is
being built because AWS is like a

More of an infrastructure
as a service type framework.

And as a result, OpenSearch
is reflecting that.

You can pick all of the various
components you want in OpenSearch.

But as a result of that, sometimes it
can be hard building a complete solution.

Because you're like I don't know
which vector search engine I want

under the hood of OpenSearch.

I just want vector search.

I don't know which one.

Which vector search I want.

I don't know, like why I would
pick one versus the other.

And that makes it a little bit more
challenging to get going with open search.

Nicolay Gerold: It's, it seems a
little bit like ship the org structure

of Amazon again that They want
the optionality to do everything

David Tippett: Yeah.

And, to some degree, I think it's great.

I will say there are.

Definitely advantages to being able to
pick different components and how you

want them to operate, but yeah It is
very painful if you don't have experience

with that And that's where I like to
recommend like elastic search for people

like just getting started, you know
Because it's going to give you a very

nice hey, look, this is how you do vector
search and elastic search And it's great.

It's straightforward, and it works.

But at some point you're going to
be like, hey, I want to do vectors

that are, more than 2000 tokens.

You're going to start running
into the like bottlenecks of

Lucene's vector search there.

Nicolay Gerold: Yeah, and if you would
have to give people like a curriculum

of things they should learn for search
What are like the algorithms, data

structures, techniques they should learn?

David Tippett: No, that's
a really good point.

And a lot of people, they like
to use search, but they don't

think very much about how search
is being used under the hood.

So that's where understanding some
of the core principles of your

search engine is really important.

So BM25 is probably the top one
that I feel like I harp on a lot.

I did a RAG demo for a long time,
which for those who don't know, it's

retrieval augmented generation, and
it's big in, the vector search space.

But I did it with BM 25 search
and people were just like shocked.

They're like, that works.

And yes, it does.

But yeah, BM 25 reverse indexes.

Reverse indexes are the core component
of these search indexes, and that's

why they're fast for text search.

And also, why they're terrible at
other sorts of things, like joins!

People are like, why
can't OpenSearch do joins?

And it pretty much boils down
to, reverse indexes and joins

just don't go very well together.

It takes a lot of memory
to do things like that.

Yeah,

Nicolay Gerold: Can you maybe
go a little bit into what is the

BM 25 and what makes it great?

David Tippett: Yeah.

So this is a really interesting one
because when I was getting into search,

I had no experience or background.

So I had no idea what any of this was.

People were throwing around these terms
like TFIDF and BM25 and, I was NDCG.

There's a lot of terms here.

And.

The funny thing is they're actually
not that challenging to understand once

you understand where they come from.

BM25 is the core retrieval
and scoring function for most

text based search engines.

So that's Solr, that's
Elasticsearch, that's OpenSearch.

Weaviate and Vespa tend
more towards vectors.

I know Vespa has a very good
BM25 Index solution as well.

But what BM25 is it stands for Best
Match 25 and it is We'll say a scoring

function to say how well does this query
match this document and it came from

what was TF IDF before, which stands for
term frequency, which is like a count

of how many terms, and then inverse
document frequency, which I'll talk

about a little bit more in just a minute.

But it was, it fixed some of the problems
that they had with TF IDF, because TF

IDF, did really well at scoring documents
against queries, but it had some pitfalls.

Like for really long documents, right?

It was weighting those exactly the
same as a really short document.

And you wouldn't exactly
wait that the same.

If you had a document that had, 10,
000 words and one word match your

query, you had a document that had five
words and one word match your query.

Which is a better match?

It's probably the one with five, because
out of five words, they chose one of the

matching query words to in their title.

It has a couple, BM25 has
a couple different parts.

So maybe we should just should
I just walk through this?

Do you feel like?

Nicolay Gerold: The key innovation
from TF IDF to BM25 is to penalize

the results in the score itself,
and to basically have some form of

regularization based on different factors.

And I think this would be the
most interesting way to start:

how does BM 25 penalize documents?

David Tippett: a really good point,
because one of the things they found with

TF IDF we'll start with the first one.

Term frequency.

Term frequency is, basically the
count of, again, how many terms in

your query match terms in a document.

Say I write, the quick brown fox or
whatever and I'm searching my documents.

If I have a document that is just the
word "THE" that's gonna match really

highly, because "THE" matches "THE",
and it matches it a lot of times.

But, you and I know that hey,
a document filled with just

"THE"'s that's not really a match.

That doesn't really tell us anything.

There's this idea called term saturation,
where, you know, words that are repeated

too much, they count, and they count a
little bit less each time it matches.

So there's this nice curve where it
tapers off and it says, Hey, even if

you include this word like a thousand
more times, it's not going to add

that much to the overall score.

Nicolay Gerold: And this is
basically within a document, right?

David Tippett: within the document, yeah.

So like I'll have the word the in my
query, and if the document is just all

the word "THE" a thousand times, it's not
actually going to score that much higher

for each individual instance of the.

Which is important because we
have a lot of these words in,

especially the English language.

I can speak to heavily to that
one because I know that one well.

But we have these stop words like the,
a and they just don't tell us that much

about does this query match this document?

So on the flip side of that words that are
very unique anti disestablishmentarianism.

That is not going to appear in almost any
of the documents and that will score very

highly Because of the last part of TF IDF,
which is the inverse document frequency.

And inverse document frequency is magic,
I'll say because what it does is it looks

at all of the other documents in the index
and it says, hey, how many other documents

in this index does this word exist in?

And if this word exists in one document
and you have it in your query, out

of a thousand other documents, that
is going to contribute super highly

to your score because they're like,
this is the only document that has

this word and this word is in your
query, therefore it's very important.

So that's The IDF portion of it.

And again, I said, BM 25 is a.

Iteration of term frequency
inverse documents frequency.

So the two things that BM 25 adds is
the term saturation so that repeated

terms don't count more for each instance
of that term, and then the other thing

is they normalize for document length.

And this is like a really big one.

And actually, there's some new iterations
of BM25, which we'll probably talk

about a bit, that also help with this.

But what they found was that
titles tend to match really great.

And then really long
documents match really poorly.

If you have two documents in the same
index, and one has only a few words,

and one has, a thousand words, those
should be weighted differently, right?

Because a match in one with five words
should be like, oh, one out of five words

matches versus one out of a thousand.

So that's the second part of BM25 that,
changed significantly was the ability

to normalize for document length.

And I'll say it's interesting because
In my own day to day life, I don't

think I appreciate this or recognize
it as much as I probably should.

And part of that is because we
naturally break documents up

into their individual components.

And at least within, I can say within
GitHub, documents in, Similar types of

fields tend to be around the same, length.

Most titles aren't 10, 000 words, most
titles are between, maybe 10 to 20 words.

Again, in that document normal,
length normalization doesn't

really hit us super much.

But where you would see this a lot
is in things like research journals.

It's like research journal
search, book search.

Those are the type of places
where you could have a really

short book, really long book.

You could have a really
dense paper academic paper.

You could have a really
sparse academic paper.

And yeah, that's where you're
going to see that, that document

length coming into play a lot.

Nicolay Gerold: and just to recover
a little bit, basically we have

The term frequency, which basically
just counts how often a term is

occurring in a certain document.

Then you have the inverse document
frequency, which is rather answering the

question, like, how valuable is a term
by determining how easy is it to separate

different documents or separate the
different documents in your corpus from

each other through the different terms.

Thanks.

And the easier is it to separate
one document from all the others,

the more valuable a term is.

And then you have the additional
more of a penalty factors.

And in BM 25, how, because you have
the document length, and you divide

it by the average document length,
but you also have the B factor.

Which is basically a hyper parameter.

What can you tell me about that?

How does that actually impact the
results and is there actually valuable

value in tuning that parameter?

David Tippett: gosh, that
is a really great question.

I'll answer that with a little bit of
a statement more than I, I don't have

a lot of experience tuning the b but
what I will say is most times people

do not have collect the sort of data
that they need to determine whether

tuning that would be valuable or not.

And I think that's one
of the really big things.

There are a handful of constants in
BM 25 That for a lot of people they're

good enough but for some people in like
very specialized use cases, and I'll

say again, like coming back to that
different places where you might have

lots of term saturation or where you
might have lots of, varying document

lengths, where you might, tune those.

And I think this is actually
comes back to what I think is

the core problem with search.

And all of these search engines, I'm
going to say, they do a very poor

job of helping people understand
how to make their search better.

Earlier you talked about the explain and
endpoint, and for those who don't know,

the explain endpoint in Elasticsearch,
OpenSearch, basically gives you a

breakdown of how your document was scored.

And we've been talking
about documents here.

Documents in OpenSearch and
Elasticsearch are actually individual

fields and then the sum of those
fields for a query, we'll say.

Nicolay Gerold: Yeah.

So basically to interject here, it's
you typically don't only look at

like the content of an academic paper
when you're running retrieval on it.

But you have multiple fields.

You have, for example, the authors, you
have the title, you have the abstract,

and then you have the entire content.

Or you might even split it up further.

You split it up into introduction
methodologies methodology section.

That's a

David Tippett: There we go.

Nicolay Gerold: And you can
basically create a scoring function.

Over the different fields and way
different fields in a different way.

And when you look at explain, it
basically breaks down how much impact

each field has and why it's cost.

So I,

David Tippett: Exactly.

And I think this is where we need like
better tooling across the board to put

that in a more explainable, I was going to
say explainable, understandable way though

for, people just getting into the field
because for me, I've been, explaining

BM 25 to people for a long time.

So it feels a little natural.

I'm like, Oh, this document didn't match
because, this document's much longer

and it just got tanked in the ratings.

But for people just coming into search,
that can be really challenging because

it's Oh, I don't understand, like that
document got weighted more, but why?

So there needs to be some
better tooling around that.

Nicolay Gerold: I just want to a
little bit, understand your intuition

behind like the two big parameters,
which is like the K one and the B

it's like in which scenario would you
actually increase or decrease them?

So when you're looking at the saturation
of term frequency, when you have like High

amount of terms repeated in each document.

Would you go with a higher
saturation or lower?

David Tippett: It's a, it depends.

So this comes down to what?

Actually, this comes all the way
back to, what do your users expect?

A lot of times especially in
maybe we'll say within an academic

domain, there might be very like
common terms across that domain.

And each additional what is that called?

Each additional term might add more
value, so in that case you would want

the term saturation to affect score
much I don't remember off the top.

I guess at that point
you would decrease K.

If I remember off the top of my
head, it's been a little bit.

But so that's where you would look at that
and say, Hey, what do my users expect?

Are they expecting, my Are they
expecting, repeated terms to count more?

Do they think repeated
terms are very important?

And then On the other one the document
length, does document length matter?

Maybe you look at, your documents and
your index and you say, Hey, look, our

documents are between, 10 and 25 pages.

Document length doesn't matter that much.

So you could decrease the impact there.

Or it might matter a lot, in
which case you could increase

that that impact on your scoring.

But again, it's, you also need to
have data that says, the documents

that I expect are not matching.

And I almost, I, before I would
tune BM25, I would probably

tune my query weights first.

Because what I think most people will find
is they might be weighting fields poorly.

Actually, this is, this might be
a good place to talk about, like

we've talked about scoring here and
we've said scoring is, per field.

So a title field of a document
might have one set of scores, a

body field of a document might
have a different set of scores.

When you combine those scores,
that's, you get the total score

for that document, whether that's a
paper, whether that's like a record.

One of the challenges is though, the.

The range of scores for each of these
fields is going to be different.

And the reason for that is fields with
fewer words are going to have let's see,

much higher IDF, versus fields that are
longer are going to have much lower IDF.

Lower IDFs.

I'm trying to make sure I'm saying that
in the correct order, but the, so like

the potential score of a title field
might be an entirely separate range.

Like it could be like, I'm going to
throw out bad numbers here, like 20 to

25 could be your title field score range.

And then your body field
might be, like zero to five.

So this is where it gets really important
to start having relevancy practice,

which is where you recognize, Hey, the
body field regularly scores in a very

different range than the title field.

Maybe we should increase the boost for the
body field and put it in a similar place.

scoring range as a title field.

Or maybe the body matching doesn't
actually matter that much, which

is what a lot of people find.

Matches on title type fields tend
to be really important because

like you only have so many words
in a title and people tend to put

the most important words there.

That's where it gets really important
to know what are users expecting and are

they finding what they expect to find?

Nicolay Gerold: Yeah.

And this already goes to
like, how much meaning do the

matches or the BM 25 scores is?

Carry it because it tends to be that on
title fields a very good match Carries way

more significance and relevance than if
I have a very good match on a body field

David Tippett: Yeah.

Nicolay Gerold: and this already
leads us a little bit into like the

BM 25 F, which is basically just
saying like it's field based and we

have a document with multiple feeds.

Yeah.

David Tippett: That's a nice segue.

And actually I'm going to be a
little bit embarrassed here because

I have very little experience with
BM25F, but what I can say, and like

OpenSearch, OpenSearch also didn't have
support for BM25F when I was there.

And they.

Still don't, it's like on the roadmap,
which is a little bit of a hot take.

It's been pushed back quite a few times
because actually I'm not sure why.

I'm not sure why, because I think
a lot of people are asking for it.

And but to break down why BM 25
F is important is we've said,

BM 25 scores are per field.

So it's either the title
field or it's a body field.

But what bm25f does is it calculates
the scores across all of those fields.

So now you've got this nice
combined what is that called?

Combined score, we'll say, between
several different types of fields.

So yeah, I think that's
really interesting.

I think it's interesting that
OpenSearch hasn't implemented it,

given how much, How many requests I've
seen for it in the community, but I

think it's it should be coming soon.

Hopefully

Nicolay Gerold: And I think it's, you
can also replicate it a little bit with

just using BM 25 on the different fields.

So I think it's probably not a
priority for them at the moment.

David Tippett: maybe but I think it does
add a lot because like it takes away

some of the work that you have to do
waiting all those different fields to get

them put into like similar rate ranges.

And actually, we want to talk about
BM 25 search versus vector search.

This has also been a huge problem for
a lot of people because you know what?

What at least OpenSearch found
in most of their research is BM25

does really great at we'll say
what's called zero shot retrieval.

So this is, it's never
been trained on anything.

It is a very straightforward algorithm.

And zero shot retrieval basically says,
hey, it's never seen this example before,

but it can go and find the right result.

Then.

You have vector search,
which is, it's been trained.

I don't know if I would
call vector search one shot.

But I feel like I've probably
just used that wrong.

And Jo Kristian (Bergum) is going
to come light me up on Twitter.

But they have very
different scoring paradigm.

So vector search is always
between, like zero and one.

And BM 25 can be a whole
range of different things.

Combining them, you're almost always
going to get a better experience.

Again, we've got scores that
are in hugely different ranges.

So how do you combine these
sets of documents that have

completely different scores?

And that's where I think actually
OpenSearch has done really well so far

by creating like pipelining tools that
allow you to do score normalization

and get a single unified set of
results back, which I think is cool.

Nicolay Gerold: Yeah, and when looking
at BM25, I think like there are so

many different versions nowadays.

What's your, if you had to pick
one favorite child, what would be

your favorite adaptation of BM25?

David Tippett: I I have a hard time saying
anything other than Ocampi BM 25, which is

like for the, like the baseline of BM 25.

Feel like it's been around since 1994 was
when it was like released to the world and

track, which is what's the text retrieval
conference I think is so yeah, 1994.

That's a year before I was
born for context and it's

been around all this time.

And I think there's a reason for that.

And I think it provides a
very good generic way to

score and retrieve documents.

And other groups are
doing things differently.

I would not recommend it.

But Postgres own search is
what was it based off of?

I'm going to forget, of course, the, what
Postgres search relevance is based off

of, but basically it's optimizing for
how close the terms are together is one

of the things that comes into, Postgres
retrieval function, which I think is neat.

And if someone finds a way to add that
to BM25, I think that is going to that

is going to dramatically change how we
score documents because term proximity,

can add, I think, a lot of value.

But

Nicolay Gerold: I think Postgres built
it not on terms, but on top of lexemes,

which introduce Like a complete new area.

I think the full text search in Postgres
was on top of Lexemes, which is more

like a semantic unit of language.

Which I think introduced a whole lot.

Of new issues, like what
makes up in semantic unit

and how that even is defined.

David Tippett: Yeah, it's interesting.

But yeah, one and this, to this point I'm
not saying that you shouldn't do full text

search on Postgres, but if I was doing
text search on Postgres, I'd be using

something something like Trieve or one
of these other layers that gets added on

More episodes

Chapters

What is How AI Is Built ?