How AI Is Built

Ever wondered why vector search isn't always the best path for information retrieval?
Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub.
Discover how BM25 transforms search efficiency, even at GitHub's immense scale.
BM25, short for Best Match 25, use term frequency (TF) and inverse document frequency (IDF) to score document-query matches. It addresses limitations in TF-IDF, such as term saturation and document length normalization.
Search Is About User Expectations
  • Search isn't just about relevance but aligning with what users expect: 
    • GitHub users, for example, have diverse use cases—finding security vulnerabilities, exploring codebases, or managing repositories. Each requires a different prioritization of fields, boosting strategies, and possibly even distinct search workflows.
  • Key Insight: Search is deeply contextual and use-case driven. Understanding your users' intent and tailoring search behavior to their expectations matters more than chasing state-of-the-art technology.
The Challenge of Vector Search at Scale
  • Vector search systems require in-memory storage of vectorized data, making them costly for datasets with billions of documents (e.g., GitHub’s 100 billion documents).
  • IVF and HNSW offer trade-offs: 
    • IVF: Reduces memory requirements by bucketing vectors but risks losing relevance due to bucket misclassification.
    • HNSW: Offers high relevance but demands high memory, making it impractical for massive datasets.
  • Architectural Insight: When considering vector search, focus on niche applications or subdomains with manageable dataset sizes or use hybrid approaches combining BM25 with sparse/dense vectors.
Vector Search vs. BM25: A Trade-off of Precision vs. Cost
  • Vector search is more precise and effective for semantic similarity, but its operational costs and memory requirements make it prohibitive for massive datasets like GitHub’s corpus of over 100 billion documents.
  • BM25’s scaling challenges (e.g., reliance on disk IOPS) are manageable compared to the memory-bound nature of vector search engines like HNSW and IVF.
  • Key Insight: BM25’s scalability allows for broader adoption, while vector search is still a niche solution requiring high specialization and infrastructure.
David Tippett:
Nicolay Gerold:
00:00 Introduction to RAG and Vector Search Challenges 00:28 Introducing BM25: The Efficient Search Solution 00:43 Guest Introduction: David Tippett 01:16 Comparing Search Engines: Vespa, Weaviate, and More 07:53 Understanding BM25 and Its Importance 09:10 Deep Dive into BM25 Mechanics 23:46 Field-Based Scoring and BM25F 25:49 Introduction to Zero Shot Retrieval 26:03 Vector Search vs BM25 26:22 Combining Search Techniques 26:56 Favorite BM25 Adaptations 27:38 Postgres Search and Term Proximity 31:49 Challenges in GitHub Search 33:59 BM25 in Large Scale Systems 40:00 Technical Deep Dive into BM25 45:30 Future of Search and Learning to Rank 47:18 Conclusion and Future Plans

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

Nicolay Gerold: People implementing RAG
usually jump straight into vector search,

but vector search has a lot of downsides.

So first of all, it's
not robust out of domain.

Different types of queries
need different embedding models

with different vector indexes.

And some query types
might not work at all.

It's also very costly in
terms of compute and storage.

We have to keep our indexes
in memory so we can achieve a

low enough latency for search.

And what we are talking about
today works for everything.

It works out of domain and is one of the
most efficient types of search out there.

You probably guessed it already,
we are talking about the OG

ranking function of search, BM25.

Today we are continuing our
series on search on how AI

is built with David Tippett.

David Tippett is search engineer
at GitHub after working at

OpenSearch for a long time.

We talk about BM25, How it works,
what makes it great and how you

can tailor it to your use case.

And also how BM 25 is used at GitHubs
for a hundred billion plus documents.

David Tippett: I have not used Vespa,
but I can say from the people that

I know that use Vespa, it's amazing.

It's really good, but at
the same time, it's not.

I would say it's like a top
5 percent search engine.

If you are like in the, top 5 percent
of like search users or in the top

5 percent of what is that called?

Vector search users.

That's where you're really
picking up Vespa and you're,

you know what you're doing.

Not a lot, maybe not a lot of people
know yet or haven't found out yet

that, yeah, Vespa came out of Yahoo.

It was there actually went into
Yahoo first in the like early

2000s and then left Yahoo.

So yeah, they have a lot of
lingering stuff because apparently

Yahoo was a big Java shop.

I don't know.

Nicolay Gerold: It's too much
business is still being done in java.

I don't I learned programming in java.

And I still look back with dread
and hatreds of The courses I took

David Tippett: I'm in the same boat.

I'm using Ruby now and Ruby is like
one of those like mythical languages

because it does so much in the background
for you versus Java where you have

to be, you're declaring everything.

You've got interfaces for everything.

It is so verbose.

It's painful.

Nicolay Gerold: that's maybe an
interesting starting point when you

would have to put search engines into
the same buckets as programming language.

So basically like Python, the really
simple one, like C really down rust,

like the newcomer, the really hyped
one, Go more of the cloud native one.

Where would you place the
different search engines?

David Tippett: Oh, this is perfect.

Okay.

I love this question because
I have really strong opinions.

So Weaviate is where I feel like a
lot of people get started, right?

It's the straightforward,
super documented.

Painfully well documented almost
to the point where, anyone could

get into Weaviate and start
doing something, which is great.

And they're integrated to everything,
Lang chain, all of those things.

Once you get to a certain point, though,
where you're doing some more advanced

relevance features, you're gonna start
feeling the pain there where Weaviate's

just not quite scratching the itch.

And that's where I feel like
most people will move into one

of the REST based search engines.

That's gonna be like, Solr
that's going to be open search.

It's going to be Elasticsearch.

If there was like a spectrum there
and that middle ground of like search

engines, I feel like elastic search
is the easiest to get going with.

Then open search kind of sits in
that middle category where it's

nice cause it's really flexible,
but at the same time, it's also

challenging because the documentation
is still like coming up to speed and.

OpenSearch will let you build
search however you want.

And sometimes that's a good thing
because you can do really cool

things, and sometimes that's a bad
thing because you could accidentally

do something where you're like,
that makes no sense whatsoever.

Solr, I can't really place, I don't have
enough experience with Solr to say like

where in that middle ground it falls.

And then at the very top end,
I feel like you've got Vespa.

And I say Vespa is like the top end
because Vespa is where I feel like PhDs go

to build, massively scalable distributed,
like vector search applications.

Vespa is always on top of
the most recent research.

Which means they have what the
most cutting edge features.

It's really good, but then operationally
it's challenging because it was

built to be run inside of Yahoo and
their, infrastructure frameworks and

paradigms, so running Vespa itself.

It's not as straightforward.

Nicolay Gerold: Would you actually
say it's like partly where the

different search engines came from?

So from which companies?

Because Yahoo, likely they have a large
infrastructure team, so they don't

really care about like how complex it is.

What are like the different
backgrounds of the different

search engines you work with?

David Tippett: That's so funny.

I'm writing an article right now,
and I've said that for the last three

podcast episodes I've recorded for
myself and here, but I'm writing this

really nice detailed blog post about
how these engines are being built.

So like one of the really interesting
things that not a lot of people think

about is how companies make money
affects how they build their business.

So for a great example is Elasticsearch.

Elasticsearch, their biggest
thing is their licensing.

So they make it extremely
easy to onboard, right?

Lots of really great documentation.

Lots of integrations with libraries.

Kibana has a lot of workflows to walk you
through and that's to get you onto the

platform and then, it's lacking some of
the feature richness, but once you're on

the platform, the idea is you're there to
stay versus OpenSearch, which OpenSearch

is, I'm going to say

being built heavily by AWS.

We'll say they're one of the largest
contributors to OpenSearch and.

That affects how OpenSearch is
being built because AWS is like a

More of an infrastructure
as a service type framework.

And as a result, OpenSearch
is reflecting that.

You can pick all of the various
components you want in OpenSearch.

But as a result of that, sometimes it
can be hard building a complete solution.

Because you're like I don't know
which vector search engine I want

under the hood of OpenSearch.

I just want vector search.

I don't know which one.

Which vector search I want.

I don't know, like why I would
pick one versus the other.

And that makes it a little bit more
challenging to get going with open search.

Nicolay Gerold: It's, it seems a
little bit like ship the org structure

of Amazon again that They want
the optionality to do everything

David Tippett: Yeah.

And, to some degree, I think it's great.

I will say there are.

Definitely advantages to being able to
pick different components and how you

want them to operate, but yeah It is
very painful if you don't have experience

with that And that's where I like to
recommend like elastic search for people

like just getting started, you know
Because it's going to give you a very

nice hey, look, this is how you do vector
search and elastic search And it's great.

It's straightforward, and it works.

But at some point you're going to
be like, hey, I want to do vectors

that are, more than 2000 tokens.

You're going to start running
into the like bottlenecks of

Lucene's vector search there.

Nicolay Gerold: Yeah, and if you would
have to give people like a curriculum

of things they should learn for search
What are like the algorithms, data

structures, techniques they should learn?

David Tippett: No, that's
a really good point.

And a lot of people, they like
to use search, but they don't

think very much about how search
is being used under the hood.

So that's where understanding some
of the core principles of your

search engine is really important.

So BM25 is probably the top one
that I feel like I harp on a lot.

I did a RAG demo for a long time,
which for those who don't know, it's

retrieval augmented generation, and
it's big in, the vector search space.

But I did it with BM 25 search
and people were just like shocked.

They're like, that works.

And yes, it does.

But yeah, BM 25 reverse indexes.

Reverse indexes are the core component
of these search indexes, and that's

why they're fast for text search.

And also, why they're terrible at
other sorts of things, like joins!

People are like, why
can't OpenSearch do joins?

And it pretty much boils down
to, reverse indexes and joins

just don't go very well together.

It takes a lot of memory
to do things like that.

Yeah,

Nicolay Gerold: Can you maybe
go a little bit into what is the

BM 25 and what makes it great?

David Tippett: Yeah.

So this is a really interesting one
because when I was getting into search,

I had no experience or background.

So I had no idea what any of this was.

People were throwing around these terms
like TFIDF and BM25 and, I was NDCG.

There's a lot of terms here.

And.

The funny thing is they're actually
not that challenging to understand once

you understand where they come from.

BM25 is the core retrieval
and scoring function for most

text based search engines.

So that's Solr, that's
Elasticsearch, that's OpenSearch.

Weaviate and Vespa tend
more towards vectors.

I know Vespa has a very good
BM25 Index solution as well.

But what BM25 is it stands for Best
Match 25 and it is We'll say a scoring

function to say how well does this query
match this document and it came from

what was TF IDF before, which stands for
term frequency, which is like a count

of how many terms, and then inverse
document frequency, which I'll talk

about a little bit more in just a minute.

But it was, it fixed some of the problems
that they had with TF IDF, because TF

IDF, did really well at scoring documents
against queries, but it had some pitfalls.

Like for really long documents, right?

It was weighting those exactly the
same as a really short document.

And you wouldn't exactly
wait that the same.

If you had a document that had, 10,
000 words and one word match your

query, you had a document that had five
words and one word match your query.

Which is a better match?

It's probably the one with five, because
out of five words, they chose one of the

matching query words to in their title.

It has a couple, BM25 has
a couple different parts.

So maybe we should just should
I just walk through this?

Do you feel like?

Nicolay Gerold: The key innovation
from TF IDF to BM25 is to penalize

the results in the score itself,
and to basically have some form of

regularization based on different factors.

And I think this would be the
most interesting way to start:

how does BM 25 penalize documents?

David Tippett: a really good point,
because one of the things they found with

TF IDF we'll start with the first one.

Term frequency.

Term frequency is, basically the
count of, again, how many terms in

your query match terms in a document.

Say I write, the quick brown fox or
whatever and I'm searching my documents.

If I have a document that is just the
word "THE" that's gonna match really

highly, because "THE" matches "THE",
and it matches it a lot of times.

But, you and I know that hey,
a document filled with just

"THE"'s that's not really a match.

That doesn't really tell us anything.

There's this idea called term saturation,
where, you know, words that are repeated

too much, they count, and they count a
little bit less each time it matches.

So there's this nice curve where it
tapers off and it says, Hey, even if

you include this word like a thousand
more times, it's not going to add

that much to the overall score.

Nicolay Gerold: And this is
basically within a document, right?

David Tippett: within the document, yeah.

So like I'll have the word the in my
query, and if the document is just all

the word "THE" a thousand times, it's not
actually going to score that much higher

for each individual instance of the.

Which is important because we
have a lot of these words in,

especially the English language.

I can speak to heavily to that
one because I know that one well.

But we have these stop words like the,
a and they just don't tell us that much

about does this query match this document?

So on the flip side of that words that are
very unique anti disestablishmentarianism.

That is not going to appear in almost any
of the documents and that will score very

highly Because of the last part of TF IDF,
which is the inverse document frequency.

And inverse document frequency is magic,
I'll say because what it does is it looks

at all of the other documents in the index
and it says, hey, how many other documents

in this index does this word exist in?

And if this word exists in one document
and you have it in your query, out

of a thousand other documents, that
is going to contribute super highly

to your score because they're like,
this is the only document that has

this word and this word is in your
query, therefore it's very important.

So that's The IDF portion of it.

And again, I said, BM 25 is a.

Iteration of term frequency
inverse documents frequency.

So the two things that BM 25 adds is
the term saturation so that repeated

terms don't count more for each instance
of that term, and then the other thing

is they normalize for document length.

And this is like a really big one.

And actually, there's some new iterations
of BM25, which we'll probably talk

about a bit, that also help with this.

But what they found was that
titles tend to match really great.

And then really long
documents match really poorly.

If you have two documents in the same
index, and one has only a few words,

and one has, a thousand words, those
should be weighted differently, right?

Because a match in one with five words
should be like, oh, one out of five words

matches versus one out of a thousand.

So that's the second part of BM25 that,
changed significantly was the ability

to normalize for document length.

And I'll say it's interesting because
In my own day to day life, I don't

think I appreciate this or recognize
it as much as I probably should.

And part of that is because we
naturally break documents up

into their individual components.

And at least within, I can say within
GitHub, documents in, Similar types of

fields tend to be around the same, length.

Most titles aren't 10, 000 words, most
titles are between, maybe 10 to 20 words.

Again, in that document normal,
length normalization doesn't

really hit us super much.

But where you would see this a lot
is in things like research journals.

It's like research journal
search, book search.

Those are the type of places
where you could have a really

short book, really long book.

You could have a really
dense paper academic paper.

You could have a really
sparse academic paper.

And yeah, that's where you're
going to see that, that document

length coming into play a lot.

Nicolay Gerold: and just to recover
a little bit, basically we have

The term frequency, which basically
just counts how often a term is

occurring in a certain document.

Then you have the inverse document
frequency, which is rather answering the

question, like, how valuable is a term
by determining how easy is it to separate

different documents or separate the
different documents in your corpus from

each other through the different terms.

Thanks.

And the easier is it to separate
one document from all the others,

the more valuable a term is.

And then you have the additional
more of a penalty factors.

And in BM 25, how, because you have
the document length, and you divide

it by the average document length,
but you also have the B factor.

Which is basically a hyper parameter.

What can you tell me about that?

How does that actually impact the
results and is there actually valuable

value in tuning that parameter?

David Tippett: gosh, that
is a really great question.

I'll answer that with a little bit of
a statement more than I, I don't have

a lot of experience tuning the b but
what I will say is most times people

do not have collect the sort of data
that they need to determine whether

tuning that would be valuable or not.

And I think that's one
of the really big things.

There are a handful of constants in
BM 25 That for a lot of people they're

good enough but for some people in like
very specialized use cases, and I'll

say again, like coming back to that
different places where you might have

lots of term saturation or where you
might have lots of, varying document

lengths, where you might, tune those.

And I think this is actually
comes back to what I think is

the core problem with search.

And all of these search engines, I'm
going to say, they do a very poor

job of helping people understand
how to make their search better.

Earlier you talked about the explain and
endpoint, and for those who don't know,

the explain endpoint in Elasticsearch,
OpenSearch, basically gives you a

breakdown of how your document was scored.

And we've been talking
about documents here.

Documents in OpenSearch and
Elasticsearch are actually individual

fields and then the sum of those
fields for a query, we'll say.

Nicolay Gerold: Yeah.

So basically to interject here, it's
you typically don't only look at

like the content of an academic paper
when you're running retrieval on it.

But you have multiple fields.

You have, for example, the authors, you
have the title, you have the abstract,

and then you have the entire content.

Or you might even split it up further.

You split it up into introduction
methodologies methodology section.

That's a

David Tippett: There we go.

Nicolay Gerold: And you can
basically create a scoring function.

Over the different fields and way
different fields in a different way.

And when you look at explain, it
basically breaks down how much impact

each field has and why it's cost.

So I,

David Tippett: Exactly.

And I think this is where we need like
better tooling across the board to put

that in a more explainable, I was going to
say explainable, understandable way though

for, people just getting into the field
because for me, I've been, explaining

BM 25 to people for a long time.

So it feels a little natural.

I'm like, Oh, this document didn't match
because, this document's much longer

and it just got tanked in the ratings.

But for people just coming into search,
that can be really challenging because

it's Oh, I don't understand, like that
document got weighted more, but why?

So there needs to be some
better tooling around that.

Nicolay Gerold: I just want to a
little bit, understand your intuition

behind like the two big parameters,
which is like the K one and the B

it's like in which scenario would you
actually increase or decrease them?

So when you're looking at the saturation
of term frequency, when you have like High

amount of terms repeated in each document.

Would you go with a higher
saturation or lower?

David Tippett: It's a, it depends.

So this comes down to what?

Actually, this comes all the way
back to, what do your users expect?

A lot of times especially in
maybe we'll say within an academic

domain, there might be very like
common terms across that domain.

And each additional what is that called?

Each additional term might add more
value, so in that case you would want

the term saturation to affect score
much I don't remember off the top.

I guess at that point
you would decrease K.

If I remember off the top of my
head, it's been a little bit.

But so that's where you would look at that
and say, Hey, what do my users expect?

Are they expecting, my Are they
expecting, repeated terms to count more?

Do they think repeated
terms are very important?

And then On the other one the document
length, does document length matter?

Maybe you look at, your documents and
your index and you say, Hey, look, our

documents are between, 10 and 25 pages.

Document length doesn't matter that much.

So you could decrease the impact there.

Or it might matter a lot, in
which case you could increase

that that impact on your scoring.

But again, it's, you also need to
have data that says, the documents

that I expect are not matching.

And I almost, I, before I would
tune BM25, I would probably

tune my query weights first.

Because what I think most people will find
is they might be weighting fields poorly.

Actually, this is, this might be
a good place to talk about, like

we've talked about scoring here and
we've said scoring is, per field.

So a title field of a document
might have one set of scores, a

body field of a document might
have a different set of scores.

When you combine those scores,
that's, you get the total score

for that document, whether that's a
paper, whether that's like a record.

One of the challenges is though, the.

The range of scores for each of these
fields is going to be different.

And the reason for that is fields with
fewer words are going to have let's see,

much higher IDF, versus fields that are
longer are going to have much lower IDF.

Lower IDFs.

I'm trying to make sure I'm saying that
in the correct order, but the, so like

the potential score of a title field
might be an entirely separate range.

Like it could be like, I'm going to
throw out bad numbers here, like 20 to

25 could be your title field score range.

And then your body field
might be, like zero to five.

So this is where it gets really important
to start having relevancy practice,

which is where you recognize, Hey, the
body field regularly scores in a very

different range than the title field.

Maybe we should increase the boost for the
body field and put it in a similar place.

scoring range as a title field.

Or maybe the body matching doesn't
actually matter that much, which

is what a lot of people find.

Matches on title type fields tend
to be really important because

like you only have so many words
in a title and people tend to put

the most important words there.

That's where it gets really important
to know what are users expecting and are

they finding what they expect to find?

Nicolay Gerold: Yeah.

And this already goes to
like, how much meaning do the

matches or the BM 25 scores is?

Carry it because it tends to be that on
title fields a very good match Carries way

more significance and relevance than if
I have a very good match on a body field

David Tippett: Yeah.

Nicolay Gerold: and this already
leads us a little bit into like the

BM 25 F, which is basically just
saying like it's field based and we

have a document with multiple feeds.

Yeah.

Yeah.

David Tippett: That's a nice segue.

And actually I'm going to be a
little bit embarrassed here because

I have very little experience with
BM25F, but what I can say, and like

OpenSearch, OpenSearch also didn't have
support for BM25F when I was there.

And they.

Still don't, it's like on the roadmap,
which is a little bit of a hot take.

It's been pushed back quite a few times
because actually I'm not sure why.

I'm not sure why, because I think
a lot of people are asking for it.

And but to break down why BM 25
F is important is we've said,

BM 25 scores are per field.

So it's either the title
field or it's a body field.

But what bm25f does is it calculates
the scores across all of those fields.

So now you've got this nice
combined what is that called?

Combined score, we'll say, between
several different types of fields.

So yeah, I think that's
really interesting.

I think it's interesting that
OpenSearch hasn't implemented it,

given how much, How many requests I've
seen for it in the community, but I

think it's it should be coming soon.

Hopefully

Nicolay Gerold: And I think it's, you
can also replicate it a little bit with

just using BM 25 on the different fields.

So I think it's probably not a
priority for them at the moment.

David Tippett: maybe but I think it does
add a lot because like it takes away

some of the work that you have to do
waiting all those different fields to get

them put into like similar rate ranges.

And actually, we want to talk about
BM 25 search versus vector search.

This has also been a huge problem for
a lot of people because you know what?

What at least OpenSearch found
in most of their research is BM25

does really great at we'll say
what's called zero shot retrieval.

So this is, it's never
been trained on anything.

It is a very straightforward algorithm.

And zero shot retrieval basically says,
hey, it's never seen this example before,

but it can go and find the right result.

Then.

You have vector search,
which is, it's been trained.

I don't know if I would
call vector search one shot.

But I feel like I've probably
just used that wrong.

And Jo Kristian (Bergum) is going
to come light me up on Twitter.

But they have very
different scoring paradigm.

So vector search is always
between, like zero and one.

And BM 25 can be a whole
range of different things.

Combining them, you're almost always
going to get a better experience.

Again, we've got scores that
are in hugely different ranges.

So how do you combine these
sets of documents that have

completely different scores?

And that's where I think actually
OpenSearch has done really well so far

by creating like pipelining tools that
allow you to do score normalization

and get a single unified set of
results back, which I think is cool.

Nicolay Gerold: Yeah, and when looking
at BM25, I think like there are so

many different versions nowadays.

What's your, if you had to pick
one favorite child, what would be

your favorite adaptation of BM25?

David Tippett: I I have a hard time saying
anything other than Ocampi BM 25, which is

like for the, like the baseline of BM 25.

Feel like it's been around since 1994 was
when it was like released to the world and

track, which is what's the text retrieval
conference I think is so yeah, 1994.

That's a year before I was
born for context and it's

been around all this time.

And I think there's a reason for that.

And I think it provides a
very good generic way to

score and retrieve documents.

And other groups are
doing things differently.

I would not recommend it.

But Postgres own search is
what was it based off of?

I'm going to forget, of course, the, what
Postgres search relevance is based off

of, but basically it's optimizing for
how close the terms are together is one

of the things that comes into, Postgres
retrieval function, which I think is neat.

And if someone finds a way to add that
to BM25, I think that is going to that

is going to dramatically change how we
score documents because term proximity,

can add, I think, a lot of value.

But

Nicolay Gerold: I think Postgres built
it not on terms, but on top of lexemes,

which introduce Like a complete new area.

I think the full text search in Postgres
was on top of Lexemes, which is more

like a semantic unit of language.

Which I think introduced a whole lot.

Of new issues, like what
makes up in semantic unit

and how that even is defined.

David Tippett: Yeah, it's interesting.

But yeah, one and this, to this point I'm
not saying that you shouldn't do full text

search on Postgres, but if I was doing
text search on Postgres, I'd be using

something something like Trieve or one
of these other layers that gets added on

top of Postgres that adds back in BM25.

Nicolay Gerold: Yeah.

I think ParadaDB is another one

On top of Postgres, which adds
a lot of the features you have

in regular search databases.

Yeah.

I think in the end you should be, like
you said already, you should be more.

Oriented how your users are searching
and what documents are you serving?

And if you have a reason to go with
a variant of BM 25, which isn't

the default then you can do it.

Like for example, be BM 25 plus,
I think introduces a constant to

avoid like zero term frequencies.

Especially for like very long documents.

That's very valuable to have because
you just have a way bigger term corpus.

And I think in general, you can look
through the different definitions.

The question is whether you will
outperform like the baseline PM25 in

David Tippett: Yeah.

Nicolay Gerold: significant way.

David Tippett: And that's also, it
comes back to measurement can you

measure what people want and apply that?

Because that's one of the places I know
a lot of people, they try a couple of

different things and they see which works
best, but when you start to measure things

and for those who don't know, there's
a lot of great ways to measure search.

The better.

Best in my opinion is NDCG, which
is like non discounted cumulative

gain, but basically it's a measure
of hey, I searched this and I have a

ranked set of results for this query.

How close to the top did my
highest ranked result get?

And likewise, the second, et cetera.

And being able to measure that and
have that measurement and then apply

that to like, hey, I made a change.

Are my search results getting
better or are they getting worse?

I think that is probably the most valuable
thing you can do in a search organization,

because then you can start to experiment
and say, hey, can we improve, query times

by trying this different modification
of BM25 or Can we simplify search?

Do we need all of these different fields?

Do we need trigrams, n
grams, etc., on and on?

Are they adding any value?

Nicolay Gerold: Yeah.

And this, I think like one smart thing
you already said mentions like the part of

search, which is, Often underappreciated.

It's not just like relevance.

It's also like the other
performance characteristic you

might want to optimize for.

Like latency is a massive one.

David Tippett: gosh,

Nicolay Gerold: Like doing a vector
search on a large corpus will likely

introduce some issues in terms of
latency, unless you're willing to.

Take like a whole lot of like inference
costs at the same time you have

you might want to optimize a little
bit more for like high ticket items

to have like higher ticket sales.

Like it search is very business driven,
which makes it so interesting to me as

like a domain, because it's very oriented
to to more of an engineering discipline.

David Tippett: Yeah.

I think it's a little funny though,
because you highlighted that search is

business driven, and I'm admittedly in a
place where search isn't our core, like in

e commerce, it is like everything, right?

You directly, you tie dollars to searches
in GitHub, it's like search is valuable.

Like we want people to
find the right things.

But the most valuable thing I think
in GitHub is like, Hey, are you

able to develop with a community
of people and interact well?

So it's hard saying, Hey, we get hub.

We should spend a ton of money, like
building the most advanced search ever.

And they're like, that's great.

That's really cool.

Does it add that much
more value to people?

And I will let the people decide
comment, respond retweet, whatever

at us and let us know would you value
a more relevant search at GitHub?

Nicolay Gerold: I would.

David Tippett: That's my own plug.

That's, this is me trying to get
funding for more search at GitHub.

But and I will say, GitHub, this
is, you talked about latencies.

This is where it gets really tricky,
especially in the search domain.

So at GitHub, technically
all of our data is hot.

We never deprioritize data.

Some documents, some commit you created,
seven years ago might be just as relevant

as a commit you created yesterday.

And that causes all sorts of challenges
because we have, a hundred billion people.

Billion documents around that.

I don't have exact counts off the
top of my head, but around a hundred

billion documents, and we need to be
able to say, Hey this is the most

relevant commit for what you want.

And it's just.

It's really hard because you
have users doing all sorts of

weird things with GitHub search.

We have people who are Oh,
what's the word for that?

People that are basically looking
for security vulnerabilities using

different parameters in our search
or people that are scraping us

trying to find certain things.

We have people that are, just business
users like saying, Hey, I'm just

trying to find an issue We have
developers who are like, Hey, do

we have a dependency on something?

And they're searching
within their code base.

It's a lot of different use cases
and it makes it really challenging.

Nicolay Gerold: Yeah.

And when you're looking at a large
scale search system, get up, Yeah.

What do you think, how does BM25
fit into the larger search system?

David Tippett: Under the hood,
it's technically everywhere.

All of our fields, we don't,
we can't afford to do.

Maybe I shouldn't say we can't
afford to do vector search.

We could probably do vector search if
we wanted to, but BM 25 is so incredibly

cost effective, it's very hard to justify
doing any other type of search other

than BM 25, because BM 25 gives us really
fast search across all of our data.

And I, and that's one actual point that
might be good to bring up is, There

are lots of layers on top of BM25 that
you use to improve search latency, that

you use to improve or reduce the impact
of a search in your entire cluster.

So if you do an open search, you're
gonna hit all of GitHub, right?

And that's expensive.

But we can start to do things and
layer things on top of that to

reduce the impact of that search.

Whether that is, filtering down to an
organization, then we can, as long as

we have all of our documents routed
appropriately, we can reduce the scope to

just documents within that organization.

Then you don't hit our
entire search cluster.

We reduce I was going to say shard
size, but I guess it's Yeah, sh shards.

I'm like, I always confuse shards
in segments in open search.

It's leucine segments and
Elasticsearch and open search

use shards, but shard size.

So a shard size for Elasticsearch or open
search should be about 20 to 50 gigabytes.

And again, that way you can use the most
of your CPU to search across this data.

So BM25 sits as the
foundation of all of that.

We have, all of our fields are
indexed into, BM25 type text fields.

Um, and then we're adding
layers on top of that.

Hey, we've indexed this title
field three different ways.

And the, Exact matches are, top tier, and
then, maybe we'll have an n gram field,

and that's like the next tier, and then
the lowest would be like a trigram, if

it matches on a trigram, that allows us
to retrieve things that might be really

hard to index otherwise, like labels
that have bunches of special characters.

Nicolay Gerold: And I think what really
nails the point home is that it The

different search queries you actually
face at GitHub that you mentioned

before, that BM25 can serve them all.

And when you would be using something
like VectorSearch, you'll likely need

fine tuned models for the different Types
of searches you encounter, and then you

need like a routing layer on top of it
that basically recognizes the types of

searches you're currently encountering
with the user that routes to the right

type of vectors and vector model.

David Tippett: That's so true.

And honestly, doing vector search at
the scale of GitHub would be just so

incredibly problematic because for HNSW
Base vector search, that's all sitting

in memory, and I can't even imagine what
it would be like fitting, 100 billion,

nonetheless, a billion records in memory.

And then, you have some other ones.

Oh, gosh I'm going to forget that.

What's the other I based vector search?

I keep wanting to say IDF,
but there's another more like

bucketed version of vector search that
allows you to Bucket your vectors into

different pools and the problem with that
IVF is the one yep And the challenge with

that one is you're gonna start losing out
in relevance because you know Maybe you've

routed it to the wrong bucket and you
know So it's just it gets really tricky

and I would like to see us using vector
search actually I do think there is a use

case and maybe it's some sparse vectors
But I, we are very far away from that.

Nicolay Gerold: Yep.

I would actually love to know
like how does like the storage

requirements in terms of.

Space and memory and space on this
compare between bm25 and vectors

if you have some rough estimations
or some intuition on that.

David Tippett: Actually, I'll, this is one
thing I can say that's really interesting.

Vector search is wonderful because it
scales, I don't want to say linearly.

What am I trying to say?

The size of vector documents are constant.

That's where I'm getting to.

So vectors, when you output them,
They're of constant length, you have

a, 1028, uh, token vector and all of
those individual vector tokens are

going to be like a floating point
64 or something like that number.

So that makes it nice because vector
search, you can estimate really well

what your memory requirements are.

BM 25, it's like throwing a dart on
a board because you start with, oh,

okay, here's like my base document.

That's great.

You have your base document size,
but then you might ingest it twice.

What if you ingest it once with, like
one type of analyzer and then a second

time with another type of analyzer?

So I feel like BM25 becomes
very challenging to guess for.

One of the really funny things that I'm
starting to wrestle with is our nodes

at GitHub We basically have 10x, I'm
going to say 10x the amount of storage

that we actually need for our BM25
vectors because the biggest challenge

we've run against is the IOPS for the
storage are actually really important.

Like how fast can we execute
operations against that storage?

And that tends to only be
available on high storage nodes.

So we've got these huge
nodes with like tons of IOPS.

And we store maybe we've got a
node with a terabyte of storage,

and we're using 100 gigs on that.

Now, we have, hundreds of these nodes but
that's all because how fast we can access

the storage is actually really important
to the overall latency of our search.

Nicolay Gerold: Yeah, nice.

Can you maybe go a little bit
into the technical parts of BM25?

Like, where the index is stored,
where the raw data is stored, where

is all that stuff pulled, and how
is it actually used when I index it,

but also when I run some queries?

David Tippett: Yeah, that's
a really good question.

So OpenSearch and Elasticsearch,
all, a lot of stuff is pre computed,

like pretty much anything that they
can pre compute for a document and

for a for an index is pre computed.

So documents, as they come into,
OpenSearch or Elasticsearch,

they're going to hit.

The REST API, from the REST API,
they're going to, get broken apart

into all their individual terms.

So all the terms get separated out.

And then that's stored temporarily in
a place called the translog, which is

like a write ahead log for those who
are familiar with like database terms.

But that's a very temporary place
that it gets stored before eventually

it gets like flushed to disk.

And I say flush to disk.

So OpenSearch and Elasticsearch
maintain a reverse index and a

reverse index is basically a hash map
where there is a term and then all

of the values for that term are the
documents that maintain that value.

And it does this for every single term.

So a document and a table or like an
index in OpenSearch are very different

because a document you think of, and I
think of, as oh, that's the whole thing.

However, when OpenSearch looks at it
or Elasticsearch looks at it, they're

looking at a giant table with all
the terms with pointers back to the

documents that contain those terms.

So this layout makes it really,
easy to do things like IDF

because okay, here's my term.

Here's the count of how many words
contain or that here is the count of

how many documents contain that term.

It's right there.

And So we can start to store that
calculation of how many terms or how

many other documents contain this
term, so that can be cached same with

things like IDF that, oh, actually,
that was what I just described so you

have, like, all these layers of these
different pieces of BM25 being cached,

and that allows us to do really cool
things, optimizations for Only retrieving

the documents that we anticipate.

are going to have the
highest match, we'll say.

And a lot of this breaks down to, it's
funny, it's vector math, like matrix math.

Because what you do is when you
have a query, you break your query

into the individual terms, we're
gonna go retrieve those terms those

term dictionaries from the index,
and then matrix math to determine

which of them scores the highest.

So I, I have, that is my, I'll
say high level or maybe low

level understanding of it.

I don't know that I could talk too much
more specifically about it than that.

Nicolay Gerold: No, I think that's The
misconception many people have when

they go into like search databases,
because most people, it looks like

a document database, but it isn't.

David Tippett: And this is also
why joins get really tricky.

Because what you have to do to
effectively do a join, is you have to

go and you have to calculate the scores.

For all of these different documents,
like if you wanted to join two, two,

we'll say two tables, we'll say to
use like SQL terms, you have to go

and retrieve all these documents.

Then you have to put them into memory
and you have to try and mush them

together in the appropriate manners.

So you miss out on a lot of the
optimizations that like SQL databases

can do because they're, they have
very efficient ways of We're doing

some of the join work ahead of time.

Nicolay Gerold: Yeah it's in the
end what are you optimizing for?

And.

That's, I think something that's talked
about more in a traditional database

courses that you have a workload mismatch.

And that's why you actually use
different databases because the

stuff you pre compute, which is
the indexes, which allows for the

high performance computations.

That you're doing on the data.

This is the stuff that actually gets you
the high performance, but this also makes

other computations or other workloads
you could perform very inefficient.

David Tippett: yeah.

Yeah.

And it's challenging.

That's why for each of the different
database types that you might pick

for something, it's very important to
choose like the right data model for

that, Really great example is like
open search is probably very similar

to Cassandra in the sense that you
want everything you need for that

query to exist within that document,
and Cassandra I think does that a lot.

It's very different from Redis, though.

Redis you're breaking things down into,
Hey, I have a key, and then I have a

value, and I'm always doing a lookup
of this key and a lookup of this value.

Maybe I'll do a handful of key
operations at the same time.

And then versus SQL.

SQL is like, Hey, I want to separate each
thing into their own nice little, uh,

compartments, like these are my users,
these are attributes of users that gets

joined on, maybe there's blogs, these
are the blogs that belong to the users,

that just doesn't make sense in something
like OpenSearch or Elasticsearch.

Nicolay Gerold: And what is actually
something that you would love to

see built in the search space?

David Tippett: Oh, yeah, okay.

I The person who solves this challenge
wins search as far as I'm concerned.

So I see a lot of people, like
we've talked a lot about like the

newest research, what's coming out.

The person who finds a way to, I'll say
automate or build a general enough system

that not only does the search portion,
which is like information retrieval, but

also the Optimization stuff like, Hey,
this was the query, this was the result.

They clicked on this result though,
and that was the final outcome.

If there, if someone figures
out, and for this workflow,

it's called learning to rank.

If someone finds a way to automate,
so to speak, learning to rank or to,

build these relevancy tuning workflows
into a search engine, That's it.

They have won the search game and
they will remain champion forever.

And I think there are
people working on it.

I think actually Algolia, I've seen
a lot of stuff coming out of Algolia,

which is funny because they're a
very proprietary search engine.

But I think they are starting
to do some of this stuff.

Same with OpenSearch.

OpenSearch actually is working on an
open standard called user behavior

insights, or UBI, and it's a standard
for collecting this information so that

you can automate workflows like that.

To that point, I feel like OpenSearch has
a head start, but the game is anyone's.

If someone can figure out how to
well integrate these relevancy

workflows with the retrieval
workflows, a They're going to win.

Nicolay Gerold: Yeah.

And what's next for you?

What is, what can you tease?

David Tippett: I've won.

There is nothing I want more.

And I'll just, I'll put this out there.

I'll make this public.

I am working around the clock
to try and get GitHub to adopt

a single search result set.

That is like my lifelong goal.

If I can achieve that, I would be so
happy because right now I feel like the

UX for GitHub, it's, we're very much
shipping the org, so to speak, with our

search results, and I don't like that.

I think there is a case where we build
a single pane of results where you are

getting, The exact result you're looking
for, regardless of whether you're looking

for an index, regardless of whether you're
looking for a repo or code or any of

these other things, which also I should
say, code search is very different than

traditional, like text search in GitHub.

And I work on text search, but I
would love to merge text search

and code search much more closely.

So yeah, the unification of GitHub search.

That is like my next
couple of years, probably.

Nicolay Gerold: Nice.

And if people want to follow along
on your journey of trying that

and let's see how successful you
will be, where can they do that?

David Tippett: Yeah.

So I have a couple of places
that I live very frequently.

I have a blog, tippybits.

com t i p y b i t s.

com.

I have also my podcast as well, which
is called For the Sake of Search.

We, and For the Sake of Search is
probably a much more search focused

podcast where I've been talking
with people about, what does it

look like to hire a search engineer?

What Like, what are different
companies doing in the search space?

You talked about parade DB.

Actually, they were one of my I
think they were my first, first ever

interview on the podcast, which is cool.

So yeah, there's a lot of different
things I'm talking about there.

And yeah, follow along and also let
us know, does GitHub search suck?

How can we make it better?

I'm happy to hear.

So what can you take away when
we're building search systems?

I think first is search is
never just about relevance

but about the user experience.

So you want to align with
what the user expects and also

what his information need is.

So for example, for GitHub, They
have a lot of different use cases.

They have support.

So for example, what he mentioned
was finding security vulnerabilities,

exploring codebases, managing repos,
finding old commits, finding old

documentation, finding old issues, and
each requires a different prioritization

of fields, different boosting strategies,
and possibly it would even require

like distinct search workflows.

So you actually need a technology which
can handle a lot of different types of

queries and different types of documents
as well and this is where bm25 in my

opinion really shines because it is
very good at this zero shot retrieval so

without having seen like any of this query
before and At the same time, it's strength

is in its simplicity and efficiency.

So when we are building search engines,
most of the time when we are building on

top of like a bespoke search database,
like Elastic, OpenSearch, Vespa, we are

already at point where we actually have to
go distributed, so we have a large corpus.

And BM25 is really suited for these
scenarios because it's so lightweight

and computationally efficient,
especially compared to vector search.

And This makes it really unbeatable for
cases like GitHub, where you have an

immense scale, cases where you have a
lot of different types of queries, but

you can't easily differentiate between
them, or you don't have the time or the

latency to differentiate between them.

So if you would have the case where you
really can classify easily, hey, it's

a certain type of search where the user
cares more about The accuracy, then

you can pipe to a more, to a slower
type of query or pipe to a vector

database and run the search there.

Then you could basically have a
vector database plus a regular

search database and basically
decide on query time where to go.

But.

This also is impossible in the GitHub
sake because you never know what the user

or what types of document the user is
searching for and they have like such a

large corpus which would be interesting
where to place them basically and how

much memory the HNSW index would take.

For the future, I think what becomes
more and more interesting is standards

like the User Behavior Insights.

I think it came out of OpenSearch.

Which is trying to standardize how to
integrate user behavior, like click

data, dwell time, into retriever systems
and also into retriever algorithms.

I think this might make it easier to
build learning to rank algorithms.

And for a lot of people, because you
suddenly have a unified data structure,

which is the main thing I think that UBI
put forward a data structure basically

for the user behavior and how to store it,
which should make it easier to basically

train the algorithms, like learning to
rank and use them in your search system.

I think you should always like In
the end, for me, search systems, it

always comes down to the architecture
and to how well you actually scope

the problem and understand the
problem you're trying to solve.

Because in search systems, you
really need to figure out in

advance what you're optimizing for.

Whether it's really low latency, whether
it's high accuracy, high recall, whatever.

Or.

Whether it's like really low cost
when you have a really large scale

system like in Github's case.

These types of architectural decisions
have to be made like mostly upfront

because it's really hard to change
it once you already have all of

your data in a search database and
it's already indexed and set up.

Otherwise thanks for joining.

We will continue with more on RAG soon.

So, subscribe, also leave a review,
especially on Spotify and Apple and if

you have any questions, any wishes on
different parts of search, you would

like me to go into or get a guest on, I
should have on the podcast and interview,

feel free to shoot me a message.

Otherwise, I will talk to you next week.