How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
Nicolay Gerold: People implementing RAG
usually jump straight into vector search,
but vector search has a lot of downsides.
So first of all, it's
not robust out of domain.
Different types of queries
need different embedding models
with different vector indexes.
And some query types
might not work at all.
It's also very costly in
terms of compute and storage.
We have to keep our indexes
in memory so we can achieve a
low enough latency for search.
And what we are talking about
today works for everything.
It works out of domain and is one of the
most efficient types of search out there.
You probably guessed it already,
we are talking about the OG
ranking function of search, BM25.
Today we are continuing our
series on search on how AI
is built with David Tippett.
David Tippett is search engineer
at GitHub after working at
OpenSearch for a long time.
We talk about BM25, How it works,
what makes it great and how you
can tailor it to your use case.
And also how BM 25 is used at GitHubs
for a hundred billion plus documents.
David Tippett: I have not used Vespa,
but I can say from the people that
I know that use Vespa, it's amazing.
It's really good, but at
the same time, it's not.
I would say it's like a top
5 percent search engine.
If you are like in the, top 5 percent
of like search users or in the top
5 percent of what is that called?
Vector search users.
That's where you're really
picking up Vespa and you're,
you know what you're doing.
Not a lot, maybe not a lot of people
know yet or haven't found out yet
that, yeah, Vespa came out of Yahoo.
It was there actually went into
Yahoo first in the like early
2000s and then left Yahoo.
So yeah, they have a lot of
lingering stuff because apparently
Yahoo was a big Java shop.
I don't know.
Nicolay Gerold: It's too much
business is still being done in java.
I don't I learned programming in java.
And I still look back with dread
and hatreds of The courses I took
David Tippett: I'm in the same boat.
I'm using Ruby now and Ruby is like
one of those like mythical languages
because it does so much in the background
for you versus Java where you have
to be, you're declaring everything.
You've got interfaces for everything.
It is so verbose.
It's painful.
Nicolay Gerold: that's maybe an
interesting starting point when you
would have to put search engines into
the same buckets as programming language.
So basically like Python, the really
simple one, like C really down rust,
like the newcomer, the really hyped
one, Go more of the cloud native one.
Where would you place the
different search engines?
David Tippett: Oh, this is perfect.
Okay.
I love this question because
I have really strong opinions.
So Weaviate is where I feel like a
lot of people get started, right?
It's the straightforward,
super documented.
Painfully well documented almost
to the point where, anyone could
get into Weaviate and start
doing something, which is great.
And they're integrated to everything,
Lang chain, all of those things.
Once you get to a certain point, though,
where you're doing some more advanced
relevance features, you're gonna start
feeling the pain there where Weaviate's
just not quite scratching the itch.
And that's where I feel like
most people will move into one
of the REST based search engines.
That's gonna be like, Solr
that's going to be open search.
It's going to be Elasticsearch.
If there was like a spectrum there
and that middle ground of like search
engines, I feel like elastic search
is the easiest to get going with.
Then open search kind of sits in
that middle category where it's
nice cause it's really flexible,
but at the same time, it's also
challenging because the documentation
is still like coming up to speed and.
OpenSearch will let you build
search however you want.
And sometimes that's a good thing
because you can do really cool
things, and sometimes that's a bad
thing because you could accidentally
do something where you're like,
that makes no sense whatsoever.
Solr, I can't really place, I don't have
enough experience with Solr to say like
where in that middle ground it falls.
And then at the very top end,
I feel like you've got Vespa.
And I say Vespa is like the top end
because Vespa is where I feel like PhDs go
to build, massively scalable distributed,
like vector search applications.
Vespa is always on top of
the most recent research.
Which means they have what the
most cutting edge features.
It's really good, but then operationally
it's challenging because it was
built to be run inside of Yahoo and
their, infrastructure frameworks and
paradigms, so running Vespa itself.
It's not as straightforward.
Nicolay Gerold: Would you actually
say it's like partly where the
different search engines came from?
So from which companies?
Because Yahoo, likely they have a large
infrastructure team, so they don't
really care about like how complex it is.
What are like the different
backgrounds of the different
search engines you work with?
David Tippett: That's so funny.
I'm writing an article right now,
and I've said that for the last three
podcast episodes I've recorded for
myself and here, but I'm writing this
really nice detailed blog post about
how these engines are being built.
So like one of the really interesting
things that not a lot of people think
about is how companies make money
affects how they build their business.
So for a great example is Elasticsearch.
Elasticsearch, their biggest
thing is their licensing.
So they make it extremely
easy to onboard, right?
Lots of really great documentation.
Lots of integrations with libraries.
Kibana has a lot of workflows to walk you
through and that's to get you onto the
platform and then, it's lacking some of
the feature richness, but once you're on
the platform, the idea is you're there to
stay versus OpenSearch, which OpenSearch
is, I'm going to say
being built heavily by AWS.
We'll say they're one of the largest
contributors to OpenSearch and.
That affects how OpenSearch is
being built because AWS is like a
More of an infrastructure
as a service type framework.
And as a result, OpenSearch
is reflecting that.
You can pick all of the various
components you want in OpenSearch.
But as a result of that, sometimes it
can be hard building a complete solution.
Because you're like I don't know
which vector search engine I want
under the hood of OpenSearch.
I just want vector search.
I don't know which one.
Which vector search I want.
I don't know, like why I would
pick one versus the other.
And that makes it a little bit more
challenging to get going with open search.
Nicolay Gerold: It's, it seems a
little bit like ship the org structure
of Amazon again that They want
the optionality to do everything
David Tippett: Yeah.
And, to some degree, I think it's great.
I will say there are.
Definitely advantages to being able to
pick different components and how you
want them to operate, but yeah It is
very painful if you don't have experience
with that And that's where I like to
recommend like elastic search for people
like just getting started, you know
Because it's going to give you a very
nice hey, look, this is how you do vector
search and elastic search And it's great.
It's straightforward, and it works.
But at some point you're going to
be like, hey, I want to do vectors
that are, more than 2000 tokens.
You're going to start running
into the like bottlenecks of
Lucene's vector search there.
Nicolay Gerold: Yeah, and if you would
have to give people like a curriculum
of things they should learn for search
What are like the algorithms, data
structures, techniques they should learn?
David Tippett: No, that's
a really good point.
And a lot of people, they like
to use search, but they don't
think very much about how search
is being used under the hood.
So that's where understanding some
of the core principles of your
search engine is really important.
So BM25 is probably the top one
that I feel like I harp on a lot.
I did a RAG demo for a long time,
which for those who don't know, it's
retrieval augmented generation, and
it's big in, the vector search space.
But I did it with BM 25 search
and people were just like shocked.
They're like, that works.
And yes, it does.
But yeah, BM 25 reverse indexes.
Reverse indexes are the core component
of these search indexes, and that's
why they're fast for text search.
And also, why they're terrible at
other sorts of things, like joins!
People are like, why
can't OpenSearch do joins?
And it pretty much boils down
to, reverse indexes and joins
just don't go very well together.
It takes a lot of memory
to do things like that.
Yeah,
Nicolay Gerold: Can you maybe
go a little bit into what is the
BM 25 and what makes it great?
David Tippett: Yeah.
So this is a really interesting one
because when I was getting into search,
I had no experience or background.
So I had no idea what any of this was.
People were throwing around these terms
like TFIDF and BM25 and, I was NDCG.
There's a lot of terms here.
And.
The funny thing is they're actually
not that challenging to understand once
you understand where they come from.
BM25 is the core retrieval
and scoring function for most
text based search engines.
So that's Solr, that's
Elasticsearch, that's OpenSearch.
Weaviate and Vespa tend
more towards vectors.
I know Vespa has a very good
BM25 Index solution as well.
But what BM25 is it stands for Best
Match 25 and it is We'll say a scoring
function to say how well does this query
match this document and it came from
what was TF IDF before, which stands for
term frequency, which is like a count
of how many terms, and then inverse
document frequency, which I'll talk
about a little bit more in just a minute.
But it was, it fixed some of the problems
that they had with TF IDF, because TF
IDF, did really well at scoring documents
against queries, but it had some pitfalls.
Like for really long documents, right?
It was weighting those exactly the
same as a really short document.
And you wouldn't exactly
wait that the same.
If you had a document that had, 10,
000 words and one word match your
query, you had a document that had five
words and one word match your query.
Which is a better match?
It's probably the one with five, because
out of five words, they chose one of the
matching query words to in their title.
It has a couple, BM25 has
a couple different parts.
So maybe we should just should
I just walk through this?
Do you feel like?
Nicolay Gerold: The key innovation
from TF IDF to BM25 is to penalize
the results in the score itself,
and to basically have some form of
regularization based on different factors.
And I think this would be the
most interesting way to start:
how does BM 25 penalize documents?
David Tippett: a really good point,
because one of the things they found with
TF IDF we'll start with the first one.
Term frequency.
Term frequency is, basically the
count of, again, how many terms in
your query match terms in a document.
Say I write, the quick brown fox or
whatever and I'm searching my documents.
If I have a document that is just the
word "THE" that's gonna match really
highly, because "THE" matches "THE",
and it matches it a lot of times.
But, you and I know that hey,
a document filled with just
"THE"'s that's not really a match.
That doesn't really tell us anything.
There's this idea called term saturation,
where, you know, words that are repeated
too much, they count, and they count a
little bit less each time it matches.
So there's this nice curve where it
tapers off and it says, Hey, even if
you include this word like a thousand
more times, it's not going to add
that much to the overall score.
Nicolay Gerold: And this is
basically within a document, right?
David Tippett: within the document, yeah.
So like I'll have the word the in my
query, and if the document is just all
the word "THE" a thousand times, it's not
actually going to score that much higher
for each individual instance of the.
Which is important because we
have a lot of these words in,
especially the English language.
I can speak to heavily to that
one because I know that one well.
But we have these stop words like the,
a and they just don't tell us that much
about does this query match this document?
So on the flip side of that words that are
very unique anti disestablishmentarianism.
That is not going to appear in almost any
of the documents and that will score very
highly Because of the last part of TF IDF,
which is the inverse document frequency.
And inverse document frequency is magic,
I'll say because what it does is it looks
at all of the other documents in the index
and it says, hey, how many other documents
in this index does this word exist in?
And if this word exists in one document
and you have it in your query, out
of a thousand other documents, that
is going to contribute super highly
to your score because they're like,
this is the only document that has
this word and this word is in your
query, therefore it's very important.
So that's The IDF portion of it.
And again, I said, BM 25 is a.
Iteration of term frequency
inverse documents frequency.
So the two things that BM 25 adds is
the term saturation so that repeated
terms don't count more for each instance
of that term, and then the other thing
is they normalize for document length.
And this is like a really big one.
And actually, there's some new iterations
of BM25, which we'll probably talk
about a bit, that also help with this.
But what they found was that
titles tend to match really great.
And then really long
documents match really poorly.
If you have two documents in the same
index, and one has only a few words,
and one has, a thousand words, those
should be weighted differently, right?
Because a match in one with five words
should be like, oh, one out of five words
matches versus one out of a thousand.
So that's the second part of BM25 that,
changed significantly was the ability
to normalize for document length.
And I'll say it's interesting because
In my own day to day life, I don't
think I appreciate this or recognize
it as much as I probably should.
And part of that is because we
naturally break documents up
into their individual components.
And at least within, I can say within
GitHub, documents in, Similar types of
fields tend to be around the same, length.
Most titles aren't 10, 000 words, most
titles are between, maybe 10 to 20 words.
Again, in that document normal,
length normalization doesn't
really hit us super much.
But where you would see this a lot
is in things like research journals.
It's like research journal
search, book search.
Those are the type of places
where you could have a really
short book, really long book.
You could have a really
dense paper academic paper.
You could have a really
sparse academic paper.
And yeah, that's where you're
going to see that, that document
length coming into play a lot.
Nicolay Gerold: and just to recover
a little bit, basically we have
The term frequency, which basically
just counts how often a term is
occurring in a certain document.
Then you have the inverse document
frequency, which is rather answering the
question, like, how valuable is a term
by determining how easy is it to separate
different documents or separate the
different documents in your corpus from
each other through the different terms.
Thanks.
And the easier is it to separate
one document from all the others,
the more valuable a term is.
And then you have the additional
more of a penalty factors.
And in BM 25, how, because you have
the document length, and you divide
it by the average document length,
but you also have the B factor.
Which is basically a hyper parameter.
What can you tell me about that?
How does that actually impact the
results and is there actually valuable
value in tuning that parameter?
David Tippett: gosh, that
is a really great question.
I'll answer that with a little bit of
a statement more than I, I don't have
a lot of experience tuning the b but
what I will say is most times people
do not have collect the sort of data
that they need to determine whether
tuning that would be valuable or not.
And I think that's one
of the really big things.
There are a handful of constants in
BM 25 That for a lot of people they're
good enough but for some people in like
very specialized use cases, and I'll
say again, like coming back to that
different places where you might have
lots of term saturation or where you
might have lots of, varying document
lengths, where you might, tune those.
And I think this is actually
comes back to what I think is
the core problem with search.
And all of these search engines, I'm
going to say, they do a very poor
job of helping people understand
how to make their search better.
Earlier you talked about the explain and
endpoint, and for those who don't know,
the explain endpoint in Elasticsearch,
OpenSearch, basically gives you a
breakdown of how your document was scored.
And we've been talking
about documents here.
Documents in OpenSearch and
Elasticsearch are actually individual
fields and then the sum of those
fields for a query, we'll say.
Nicolay Gerold: Yeah.
So basically to interject here, it's
you typically don't only look at
like the content of an academic paper
when you're running retrieval on it.
But you have multiple fields.
You have, for example, the authors, you
have the title, you have the abstract,
and then you have the entire content.
Or you might even split it up further.
You split it up into introduction
methodologies methodology section.
That's a
David Tippett: There we go.
Nicolay Gerold: And you can
basically create a scoring function.
Over the different fields and way
different fields in a different way.
And when you look at explain, it
basically breaks down how much impact
each field has and why it's cost.
So I,
David Tippett: Exactly.
And I think this is where we need like
better tooling across the board to put
that in a more explainable, I was going to
say explainable, understandable way though
for, people just getting into the field
because for me, I've been, explaining
BM 25 to people for a long time.
So it feels a little natural.
I'm like, Oh, this document didn't match
because, this document's much longer
and it just got tanked in the ratings.
But for people just coming into search,
that can be really challenging because
it's Oh, I don't understand, like that
document got weighted more, but why?
So there needs to be some
better tooling around that.
Nicolay Gerold: I just want to a
little bit, understand your intuition
behind like the two big parameters,
which is like the K one and the B
it's like in which scenario would you
actually increase or decrease them?
So when you're looking at the saturation
of term frequency, when you have like High
amount of terms repeated in each document.
Would you go with a higher
saturation or lower?
David Tippett: It's a, it depends.
So this comes down to what?
Actually, this comes all the way
back to, what do your users expect?
A lot of times especially in
maybe we'll say within an academic
domain, there might be very like
common terms across that domain.
And each additional what is that called?
Each additional term might add more
value, so in that case you would want
the term saturation to affect score
much I don't remember off the top.
I guess at that point
you would decrease K.
If I remember off the top of my
head, it's been a little bit.
But so that's where you would look at that
and say, Hey, what do my users expect?
Are they expecting, my Are they
expecting, repeated terms to count more?
Do they think repeated
terms are very important?
And then On the other one the document
length, does document length matter?
Maybe you look at, your documents and
your index and you say, Hey, look, our
documents are between, 10 and 25 pages.
Document length doesn't matter that much.
So you could decrease the impact there.
Or it might matter a lot, in
which case you could increase
that that impact on your scoring.
But again, it's, you also need to
have data that says, the documents
that I expect are not matching.
And I almost, I, before I would
tune BM25, I would probably
tune my query weights first.
Because what I think most people will find
is they might be weighting fields poorly.
Actually, this is, this might be
a good place to talk about, like
we've talked about scoring here and
we've said scoring is, per field.
So a title field of a document
might have one set of scores, a
body field of a document might
have a different set of scores.
When you combine those scores,
that's, you get the total score
for that document, whether that's a
paper, whether that's like a record.
One of the challenges is though, the.
The range of scores for each of these
fields is going to be different.
And the reason for that is fields with
fewer words are going to have let's see,
much higher IDF, versus fields that are
longer are going to have much lower IDF.
Lower IDFs.
I'm trying to make sure I'm saying that
in the correct order, but the, so like
the potential score of a title field
might be an entirely separate range.
Like it could be like, I'm going to
throw out bad numbers here, like 20 to
25 could be your title field score range.
And then your body field
might be, like zero to five.
So this is where it gets really important
to start having relevancy practice,
which is where you recognize, Hey, the
body field regularly scores in a very
different range than the title field.
Maybe we should increase the boost for the
body field and put it in a similar place.
scoring range as a title field.
Or maybe the body matching doesn't
actually matter that much, which
is what a lot of people find.
Matches on title type fields tend
to be really important because
like you only have so many words
in a title and people tend to put
the most important words there.
That's where it gets really important
to know what are users expecting and are
they finding what they expect to find?
Nicolay Gerold: Yeah.
And this already goes to
like, how much meaning do the
matches or the BM 25 scores is?
Carry it because it tends to be that on
title fields a very good match Carries way
more significance and relevance than if
I have a very good match on a body field
David Tippett: Yeah.
Nicolay Gerold: and this already
leads us a little bit into like the
BM 25 F, which is basically just
saying like it's field based and we
have a document with multiple feeds.
Yeah.
Yeah.
David Tippett: That's a nice segue.
And actually I'm going to be a
little bit embarrassed here because
I have very little experience with
BM25F, but what I can say, and like
OpenSearch, OpenSearch also didn't have
support for BM25F when I was there.
And they.
Still don't, it's like on the roadmap,
which is a little bit of a hot take.
It's been pushed back quite a few times
because actually I'm not sure why.
I'm not sure why, because I think
a lot of people are asking for it.
And but to break down why BM 25
F is important is we've said,
BM 25 scores are per field.
So it's either the title
field or it's a body field.
But what bm25f does is it calculates
the scores across all of those fields.
So now you've got this nice
combined what is that called?
Combined score, we'll say, between
several different types of fields.
So yeah, I think that's
really interesting.
I think it's interesting that
OpenSearch hasn't implemented it,
given how much, How many requests I've
seen for it in the community, but I
think it's it should be coming soon.
Hopefully
Nicolay Gerold: And I think it's, you
can also replicate it a little bit with
just using BM 25 on the different fields.
So I think it's probably not a
priority for them at the moment.
David Tippett: maybe but I think it does
add a lot because like it takes away
some of the work that you have to do
waiting all those different fields to get
them put into like similar rate ranges.
And actually, we want to talk about
BM 25 search versus vector search.
This has also been a huge problem for
a lot of people because you know what?
What at least OpenSearch found
in most of their research is BM25
does really great at we'll say
what's called zero shot retrieval.
So this is, it's never
been trained on anything.
It is a very straightforward algorithm.
And zero shot retrieval basically says,
hey, it's never seen this example before,
but it can go and find the right result.
Then.
You have vector search,
which is, it's been trained.
I don't know if I would
call vector search one shot.
But I feel like I've probably
just used that wrong.
And Jo Kristian (Bergum) is going
to come light me up on Twitter.
But they have very
different scoring paradigm.
So vector search is always
between, like zero and one.
And BM 25 can be a whole
range of different things.
Combining them, you're almost always
going to get a better experience.
Again, we've got scores that
are in hugely different ranges.
So how do you combine these
sets of documents that have
completely different scores?
And that's where I think actually
OpenSearch has done really well so far
by creating like pipelining tools that
allow you to do score normalization
and get a single unified set of
results back, which I think is cool.
Nicolay Gerold: Yeah, and when looking
at BM25, I think like there are so
many different versions nowadays.
What's your, if you had to pick
one favorite child, what would be
your favorite adaptation of BM25?
David Tippett: I I have a hard time saying
anything other than Ocampi BM 25, which is
like for the, like the baseline of BM 25.
Feel like it's been around since 1994 was
when it was like released to the world and
track, which is what's the text retrieval
conference I think is so yeah, 1994.
That's a year before I was
born for context and it's
been around all this time.
And I think there's a reason for that.
And I think it provides a
very good generic way to
score and retrieve documents.
And other groups are
doing things differently.
I would not recommend it.
But Postgres own search is
what was it based off of?
I'm going to forget, of course, the, what
Postgres search relevance is based off
of, but basically it's optimizing for
how close the terms are together is one
of the things that comes into, Postgres
retrieval function, which I think is neat.
And if someone finds a way to add that
to BM25, I think that is going to that
is going to dramatically change how we
score documents because term proximity,
can add, I think, a lot of value.
But
Nicolay Gerold: I think Postgres built
it not on terms, but on top of lexemes,
which introduce Like a complete new area.
I think the full text search in Postgres
was on top of Lexemes, which is more
like a semantic unit of language.
Which I think introduced a whole lot.
Of new issues, like what
makes up in semantic unit
and how that even is defined.
David Tippett: Yeah, it's interesting.
But yeah, one and this, to this point I'm
not saying that you shouldn't do full text
search on Postgres, but if I was doing
text search on Postgres, I'd be using
something something like Trieve or one
of these other layers that gets added on
top of Postgres that adds back in BM25.
Nicolay Gerold: Yeah.
I think ParadaDB is another one
On top of Postgres, which adds
a lot of the features you have
in regular search databases.
Yeah.
I think in the end you should be, like
you said already, you should be more.
Oriented how your users are searching
and what documents are you serving?
And if you have a reason to go with
a variant of BM 25, which isn't
the default then you can do it.
Like for example, be BM 25 plus,
I think introduces a constant to
avoid like zero term frequencies.
Especially for like very long documents.
That's very valuable to have because
you just have a way bigger term corpus.
And I think in general, you can look
through the different definitions.
The question is whether you will
outperform like the baseline PM25 in
David Tippett: Yeah.
Nicolay Gerold: significant way.
David Tippett: And that's also, it
comes back to measurement can you
measure what people want and apply that?
Because that's one of the places I know
a lot of people, they try a couple of
different things and they see which works
best, but when you start to measure things
and for those who don't know, there's
a lot of great ways to measure search.
The better.
Best in my opinion is NDCG, which
is like non discounted cumulative
gain, but basically it's a measure
of hey, I searched this and I have a
ranked set of results for this query.
How close to the top did my
highest ranked result get?
And likewise, the second, et cetera.
And being able to measure that and
have that measurement and then apply
that to like, hey, I made a change.
Are my search results getting
better or are they getting worse?
I think that is probably the most valuable
thing you can do in a search organization,
because then you can start to experiment
and say, hey, can we improve, query times
by trying this different modification
of BM25 or Can we simplify search?
Do we need all of these different fields?
Do we need trigrams, n
grams, etc., on and on?
Are they adding any value?
Nicolay Gerold: Yeah.
And this, I think like one smart thing
you already said mentions like the part of
search, which is, Often underappreciated.
It's not just like relevance.
It's also like the other
performance characteristic you
might want to optimize for.
Like latency is a massive one.
David Tippett: gosh,
Nicolay Gerold: Like doing a vector
search on a large corpus will likely
introduce some issues in terms of
latency, unless you're willing to.
Take like a whole lot of like inference
costs at the same time you have
you might want to optimize a little
bit more for like high ticket items
to have like higher ticket sales.
Like it search is very business driven,
which makes it so interesting to me as
like a domain, because it's very oriented
to to more of an engineering discipline.
David Tippett: Yeah.
I think it's a little funny though,
because you highlighted that search is
business driven, and I'm admittedly in a
place where search isn't our core, like in
e commerce, it is like everything, right?
You directly, you tie dollars to searches
in GitHub, it's like search is valuable.
Like we want people to
find the right things.
But the most valuable thing I think
in GitHub is like, Hey, are you
able to develop with a community
of people and interact well?
So it's hard saying, Hey, we get hub.
We should spend a ton of money, like
building the most advanced search ever.
And they're like, that's great.
That's really cool.
Does it add that much
more value to people?
And I will let the people decide
comment, respond retweet, whatever
at us and let us know would you value
a more relevant search at GitHub?
Nicolay Gerold: I would.
David Tippett: That's my own plug.
That's, this is me trying to get
funding for more search at GitHub.
But and I will say, GitHub, this
is, you talked about latencies.
This is where it gets really tricky,
especially in the search domain.
So at GitHub, technically
all of our data is hot.
We never deprioritize data.
Some documents, some commit you created,
seven years ago might be just as relevant
as a commit you created yesterday.
And that causes all sorts of challenges
because we have, a hundred billion people.
Billion documents around that.
I don't have exact counts off the
top of my head, but around a hundred
billion documents, and we need to be
able to say, Hey this is the most
relevant commit for what you want.
And it's just.
It's really hard because you
have users doing all sorts of
weird things with GitHub search.
We have people who are Oh,
what's the word for that?
People that are basically looking
for security vulnerabilities using
different parameters in our search
or people that are scraping us
trying to find certain things.
We have people that are, just business
users like saying, Hey, I'm just
trying to find an issue We have
developers who are like, Hey, do
we have a dependency on something?
And they're searching
within their code base.
It's a lot of different use cases
and it makes it really challenging.
Nicolay Gerold: Yeah.
And when you're looking at a large
scale search system, get up, Yeah.
What do you think, how does BM25
fit into the larger search system?
David Tippett: Under the hood,
it's technically everywhere.
All of our fields, we don't,
we can't afford to do.
Maybe I shouldn't say we can't
afford to do vector search.
We could probably do vector search if
we wanted to, but BM 25 is so incredibly
cost effective, it's very hard to justify
doing any other type of search other
than BM 25, because BM 25 gives us really
fast search across all of our data.
And I, and that's one actual point that
might be good to bring up is, There
are lots of layers on top of BM25 that
you use to improve search latency, that
you use to improve or reduce the impact
of a search in your entire cluster.
So if you do an open search, you're
gonna hit all of GitHub, right?
And that's expensive.
But we can start to do things and
layer things on top of that to
reduce the impact of that search.
Whether that is, filtering down to an
organization, then we can, as long as
we have all of our documents routed
appropriately, we can reduce the scope to
just documents within that organization.
Then you don't hit our
entire search cluster.
We reduce I was going to say shard
size, but I guess it's Yeah, sh shards.
I'm like, I always confuse shards
in segments in open search.
It's leucine segments and
Elasticsearch and open search
use shards, but shard size.
So a shard size for Elasticsearch or open
search should be about 20 to 50 gigabytes.
And again, that way you can use the most
of your CPU to search across this data.
So BM25 sits as the
foundation of all of that.
We have, all of our fields are
indexed into, BM25 type text fields.
Um, and then we're adding
layers on top of that.
Hey, we've indexed this title
field three different ways.
And the, Exact matches are, top tier, and
then, maybe we'll have an n gram field,
and that's like the next tier, and then
the lowest would be like a trigram, if
it matches on a trigram, that allows us
to retrieve things that might be really
hard to index otherwise, like labels
that have bunches of special characters.
Nicolay Gerold: And I think what really
nails the point home is that it The
different search queries you actually
face at GitHub that you mentioned
before, that BM25 can serve them all.
And when you would be using something
like VectorSearch, you'll likely need
fine tuned models for the different Types
of searches you encounter, and then you
need like a routing layer on top of it
that basically recognizes the types of
searches you're currently encountering
with the user that routes to the right
type of vectors and vector model.
David Tippett: That's so true.
And honestly, doing vector search at
the scale of GitHub would be just so
incredibly problematic because for HNSW
Base vector search, that's all sitting
in memory, and I can't even imagine what
it would be like fitting, 100 billion,
nonetheless, a billion records in memory.
And then, you have some other ones.
Oh, gosh I'm going to forget that.
What's the other I based vector search?
I keep wanting to say IDF,
but there's another more like
bucketed version of vector search that
allows you to Bucket your vectors into
different pools and the problem with that
IVF is the one yep And the challenge with
that one is you're gonna start losing out
in relevance because you know Maybe you've
routed it to the wrong bucket and you
know So it's just it gets really tricky
and I would like to see us using vector
search actually I do think there is a use
case and maybe it's some sparse vectors
But I, we are very far away from that.
Nicolay Gerold: Yep.
I would actually love to know
like how does like the storage
requirements in terms of.
Space and memory and space on this
compare between bm25 and vectors
if you have some rough estimations
or some intuition on that.
David Tippett: Actually, I'll, this is one
thing I can say that's really interesting.
Vector search is wonderful because it
scales, I don't want to say linearly.
What am I trying to say?
The size of vector documents are constant.
That's where I'm getting to.
So vectors, when you output them,
They're of constant length, you have
a, 1028, uh, token vector and all of
those individual vector tokens are
going to be like a floating point
64 or something like that number.
So that makes it nice because vector
search, you can estimate really well
what your memory requirements are.
BM 25, it's like throwing a dart on
a board because you start with, oh,
okay, here's like my base document.
That's great.
You have your base document size,
but then you might ingest it twice.
What if you ingest it once with, like
one type of analyzer and then a second
time with another type of analyzer?
So I feel like BM25 becomes
very challenging to guess for.
One of the really funny things that I'm
starting to wrestle with is our nodes
at GitHub We basically have 10x, I'm
going to say 10x the amount of storage
that we actually need for our BM25
vectors because the biggest challenge
we've run against is the IOPS for the
storage are actually really important.
Like how fast can we execute
operations against that storage?
And that tends to only be
available on high storage nodes.
So we've got these huge
nodes with like tons of IOPS.
And we store maybe we've got a
node with a terabyte of storage,
and we're using 100 gigs on that.
Now, we have, hundreds of these nodes but
that's all because how fast we can access
the storage is actually really important
to the overall latency of our search.
Nicolay Gerold: Yeah, nice.
Can you maybe go a little bit
into the technical parts of BM25?
Like, where the index is stored,
where the raw data is stored, where
is all that stuff pulled, and how
is it actually used when I index it,
but also when I run some queries?
David Tippett: Yeah, that's
a really good question.
So OpenSearch and Elasticsearch,
all, a lot of stuff is pre computed,
like pretty much anything that they
can pre compute for a document and
for a for an index is pre computed.
So documents, as they come into,
OpenSearch or Elasticsearch,
they're going to hit.
The REST API, from the REST API,
they're going to, get broken apart
into all their individual terms.
So all the terms get separated out.
And then that's stored temporarily in
a place called the translog, which is
like a write ahead log for those who
are familiar with like database terms.
But that's a very temporary place
that it gets stored before eventually
it gets like flushed to disk.
And I say flush to disk.
So OpenSearch and Elasticsearch
maintain a reverse index and a
reverse index is basically a hash map
where there is a term and then all
of the values for that term are the
documents that maintain that value.
And it does this for every single term.
So a document and a table or like an
index in OpenSearch are very different
because a document you think of, and I
think of, as oh, that's the whole thing.
However, when OpenSearch looks at it
or Elasticsearch looks at it, they're
looking at a giant table with all
the terms with pointers back to the
documents that contain those terms.
So this layout makes it really,
easy to do things like IDF
because okay, here's my term.
Here's the count of how many words
contain or that here is the count of
how many documents contain that term.
It's right there.
And So we can start to store that
calculation of how many terms or how
many other documents contain this
term, so that can be cached same with
things like IDF that, oh, actually,
that was what I just described so you
have, like, all these layers of these
different pieces of BM25 being cached,
and that allows us to do really cool
things, optimizations for Only retrieving
the documents that we anticipate.
are going to have the
highest match, we'll say.
And a lot of this breaks down to, it's
funny, it's vector math, like matrix math.
Because what you do is when you
have a query, you break your query
into the individual terms, we're
gonna go retrieve those terms those
term dictionaries from the index,
and then matrix math to determine
which of them scores the highest.
So I, I have, that is my, I'll
say high level or maybe low
level understanding of it.
I don't know that I could talk too much
more specifically about it than that.
Nicolay Gerold: No, I think that's The
misconception many people have when
they go into like search databases,
because most people, it looks like
a document database, but it isn't.
David Tippett: And this is also
why joins get really tricky.
Because what you have to do to
effectively do a join, is you have to
go and you have to calculate the scores.
For all of these different documents,
like if you wanted to join two, two,
we'll say two tables, we'll say to
use like SQL terms, you have to go
and retrieve all these documents.
Then you have to put them into memory
and you have to try and mush them
together in the appropriate manners.
So you miss out on a lot of the
optimizations that like SQL databases
can do because they're, they have
very efficient ways of We're doing
some of the join work ahead of time.
Nicolay Gerold: Yeah it's in the
end what are you optimizing for?
And.
That's, I think something that's talked
about more in a traditional database
courses that you have a workload mismatch.
And that's why you actually use
different databases because the
stuff you pre compute, which is
the indexes, which allows for the
high performance computations.
That you're doing on the data.
This is the stuff that actually gets you
the high performance, but this also makes
other computations or other workloads
you could perform very inefficient.
David Tippett: yeah.
Yeah.
And it's challenging.
That's why for each of the different
database types that you might pick
for something, it's very important to
choose like the right data model for
that, Really great example is like
open search is probably very similar
to Cassandra in the sense that you
want everything you need for that
query to exist within that document,
and Cassandra I think does that a lot.
It's very different from Redis, though.
Redis you're breaking things down into,
Hey, I have a key, and then I have a
value, and I'm always doing a lookup
of this key and a lookup of this value.
Maybe I'll do a handful of key
operations at the same time.
And then versus SQL.
SQL is like, Hey, I want to separate each
thing into their own nice little, uh,
compartments, like these are my users,
these are attributes of users that gets
joined on, maybe there's blogs, these
are the blogs that belong to the users,
that just doesn't make sense in something
like OpenSearch or Elasticsearch.
Nicolay Gerold: And what is actually
something that you would love to
see built in the search space?
David Tippett: Oh, yeah, okay.
I The person who solves this challenge
wins search as far as I'm concerned.
So I see a lot of people, like
we've talked a lot about like the
newest research, what's coming out.
The person who finds a way to, I'll say
automate or build a general enough system
that not only does the search portion,
which is like information retrieval, but
also the Optimization stuff like, Hey,
this was the query, this was the result.
They clicked on this result though,
and that was the final outcome.
If there, if someone figures
out, and for this workflow,
it's called learning to rank.
If someone finds a way to automate,
so to speak, learning to rank or to,
build these relevancy tuning workflows
into a search engine, That's it.
They have won the search game and
they will remain champion forever.
And I think there are
people working on it.
I think actually Algolia, I've seen
a lot of stuff coming out of Algolia,
which is funny because they're a
very proprietary search engine.
But I think they are starting
to do some of this stuff.
Same with OpenSearch.
OpenSearch actually is working on an
open standard called user behavior
insights, or UBI, and it's a standard
for collecting this information so that
you can automate workflows like that.
To that point, I feel like OpenSearch has
a head start, but the game is anyone's.
If someone can figure out how to
well integrate these relevancy
workflows with the retrieval
workflows, a They're going to win.
Nicolay Gerold: Yeah.
And what's next for you?
What is, what can you tease?
David Tippett: I've won.
There is nothing I want more.
And I'll just, I'll put this out there.
I'll make this public.
I am working around the clock
to try and get GitHub to adopt
a single search result set.
That is like my lifelong goal.
If I can achieve that, I would be so
happy because right now I feel like the
UX for GitHub, it's, we're very much
shipping the org, so to speak, with our
search results, and I don't like that.
I think there is a case where we build
a single pane of results where you are
getting, The exact result you're looking
for, regardless of whether you're looking
for an index, regardless of whether you're
looking for a repo or code or any of
these other things, which also I should
say, code search is very different than
traditional, like text search in GitHub.
And I work on text search, but I
would love to merge text search
and code search much more closely.
So yeah, the unification of GitHub search.
That is like my next
couple of years, probably.
Nicolay Gerold: Nice.
And if people want to follow along
on your journey of trying that
and let's see how successful you
will be, where can they do that?
David Tippett: Yeah.
So I have a couple of places
that I live very frequently.
I have a blog, tippybits.
com t i p y b i t s.
com.
I have also my podcast as well, which
is called For the Sake of Search.
We, and For the Sake of Search is
probably a much more search focused
podcast where I've been talking
with people about, what does it
look like to hire a search engineer?
What Like, what are different
companies doing in the search space?
You talked about parade DB.
Actually, they were one of my I
think they were my first, first ever
interview on the podcast, which is cool.
So yeah, there's a lot of different
things I'm talking about there.
And yeah, follow along and also let
us know, does GitHub search suck?
How can we make it better?
I'm happy to hear.
So what can you take away when
we're building search systems?
I think first is search is
never just about relevance
but about the user experience.
So you want to align with
what the user expects and also
what his information need is.
So for example, for GitHub, They
have a lot of different use cases.
They have support.
So for example, what he mentioned
was finding security vulnerabilities,
exploring codebases, managing repos,
finding old commits, finding old
documentation, finding old issues, and
each requires a different prioritization
of fields, different boosting strategies,
and possibly it would even require
like distinct search workflows.
So you actually need a technology which
can handle a lot of different types of
queries and different types of documents
as well and this is where bm25 in my
opinion really shines because it is
very good at this zero shot retrieval so
without having seen like any of this query
before and At the same time, it's strength
is in its simplicity and efficiency.
So when we are building search engines,
most of the time when we are building on
top of like a bespoke search database,
like Elastic, OpenSearch, Vespa, we are
already at point where we actually have to
go distributed, so we have a large corpus.
And BM25 is really suited for these
scenarios because it's so lightweight
and computationally efficient,
especially compared to vector search.
And This makes it really unbeatable for
cases like GitHub, where you have an
immense scale, cases where you have a
lot of different types of queries, but
you can't easily differentiate between
them, or you don't have the time or the
latency to differentiate between them.
So if you would have the case where you
really can classify easily, hey, it's
a certain type of search where the user
cares more about The accuracy, then
you can pipe to a more, to a slower
type of query or pipe to a vector
database and run the search there.
Then you could basically have a
vector database plus a regular
search database and basically
decide on query time where to go.
But.
This also is impossible in the GitHub
sake because you never know what the user
or what types of document the user is
searching for and they have like such a
large corpus which would be interesting
where to place them basically and how
much memory the HNSW index would take.
For the future, I think what becomes
more and more interesting is standards
like the User Behavior Insights.
I think it came out of OpenSearch.
Which is trying to standardize how to
integrate user behavior, like click
data, dwell time, into retriever systems
and also into retriever algorithms.
I think this might make it easier to
build learning to rank algorithms.
And for a lot of people, because you
suddenly have a unified data structure,
which is the main thing I think that UBI
put forward a data structure basically
for the user behavior and how to store it,
which should make it easier to basically
train the algorithms, like learning to
rank and use them in your search system.
I think you should always like In
the end, for me, search systems, it
always comes down to the architecture
and to how well you actually scope
the problem and understand the
problem you're trying to solve.
Because in search systems, you
really need to figure out in
advance what you're optimizing for.
Whether it's really low latency, whether
it's high accuracy, high recall, whatever.
Or.
Whether it's like really low cost
when you have a really large scale
system like in Github's case.
These types of architectural decisions
have to be made like mostly upfront
because it's really hard to change
it once you already have all of
your data in a search database and
it's already indexed and set up.
Otherwise thanks for joining.
We will continue with more on RAG soon.
So, subscribe, also leave a review,
especially on Spotify and Apple and if
you have any questions, any wishes on
different parts of search, you would
like me to go into or get a guest on, I
should have on the podcast and interview,
feel free to shoot me a message.
Otherwise, I will talk to you next week.