How AI Is Built

Modern search systems face a complex balancing act between performance, relevancy, and cost, requiring careful architectural decisions at each layer.
While vector search generates buzz, hybrid approaches combining traditional text search with vector capabilities yield better results.
The architecture typically splits into three core components:
  1. ingestion/indexing (requiring decisions between batch vs streaming)
  2. query processing (balancing understanding vs performance)
  3. analytics/feedback loops for continuous improvement.
Critical but often overlooked aspects include query understanding depth, systematic relevancy testing (avoid anecdote-driven development), and data governance as search systems naturally evolve into organizational data hubs.
Performance optimization requires careful tradeoffs between index-time vs query-time computation, with even 1-2% improvements being significant in mature systems.
Success requires testing against production data (staging environments prove unreliable), implementing proper evaluation infrastructure (golden query sets, A/B testing, interleaving), and avoiding the local maxima trap where improving one query set unknowingly damages others.
The end goal is finding an acceptable balance between corpus size, latency requirements, and cost constraints while maintaining system manageability and relevance quality.
"It's quite easy to end up in local maxima, whereby you improve a query for one set and then you end up destroying it for another set."
"A good marker of a sophisticated system is one where you actually see it's getting worse... you might be discovering a maxima."
"There's no free lunch in all of this. Often it's a case that, to service billions of documents on a vector search, less than 10 millis, you can do those kinds of things. They're just incredibly expensive. It's really about trying to manage all of the overall system to find what is an acceptable balance."
Search Pioneers:
Stuart Cam:
Russ Cam:
Nicolay Gerold:
00:00 Introduction to Search Systems 00:13 Challenges in Search: Relevancy vs Latency 00:27 Insights from Industry Experts 01:00 Evolution of Search Technologies 03:16 Storage and Compute in Search Systems 06:22 Common Mistakes in Building Search Systems 09:10 Evaluating and Improving Search Systems 19:27 Architectural Components of Search Systems 29:17 Understanding Search Query Expectations 29:39 Balancing Speed, Cost, and Corpus Size 32:03 Trade-offs in Search System Design 32:53 Indexing vs Querying: Key Considerations 35:28 Re-ranking and Personalization Challenges 38:11 Evaluating Search System Performance 44:51 Overrated vs Underrated Search Techniques 48:31 Final Thoughts and Contact Information

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

Nicolay Gerold: Search
is a system problem.

We need ingestion indexing.

Query understanding
retrieval and re ranking.

You have databases and AI nowadays also
LLMs and everything has trade offs.

Relevancy vs Latency, Accuracy vs speed.

Query-time vs indexing-time.

Today.

We are taking a system
level perspective on search.

Search.

And we are talking to Stuart in Russ Cam.

They have both lead search
infrastructure teams.

First at Elastic then at
canva and now their own shop.

You get to peek behind the curtains
or some of the biggest search

platforms and learn what you can.

Use in your own.

And they have seen what breaks,
what scales and most important.

Importantly, what actually works.

From query understanding
to evaluation system.

Systems.

We will actually look at
the practical solutions.

Through the problems.

Stuart Cam: It's quite interesting
to see, obviously, Elasticsearch,

the Lucene based, obviously, full
text search is its core, right?

That was flying that flag for
many years, and then it was

only later was Vector added.

And it's quite interesting to see the
vector databases come in much later.

And they're now adding all
of the BM25 for text search.

I think there's an acknowledgement
that maybe there isn't one

approach that solves all.

That hybrid approach is
often the best way to go.

And it's interesting to see that,
They're both fighting or working

towards the same end output.

It's just the way they, the
approach was different, so that's

been quite interesting to see.

You can run the vector database
as like a sidecar process

to the main process, right?

You are delegating out all of your
vector search to that sidecar.

There are downsides to that
obviously, now you're out of process.

You've got something that isn't part of
the core process that you have to manage.

If things go wrong in that space, It
can fail in weird and wonderful ways

that you might not be fully aware of
if everything was running in the JVM.

You have a disconnect
there, which is interesting.

You obviously pay the marshalling costs
across processes as well., there's pluses

and negatives for all things, right?

I guess one of the pluses is
that you can access, essentially

unlimited memory on the JVM.

On the server, which with the
JVM is a bit more tricky to do,

Russ Cam: I was going to say,
there's some historical reasons

for that too, with open search.

So you have.

I think FAISS was the first one
that was originally supported and

then support for NMSlib came in.

And then eventually when OpenSearch
2 moved to Lucene 9, then, I had

the the Lucene vector implementation
available as well, so you got three

different choices for, I think
largely for historical reasons.

Different kinds of pros and cons
to them as Stuart alluded to there.

Yeah, that it's not super clear, which
one to use in which scenario from reading

the documentation, but yeah, some of them
have limitations on things like filtering

for example that other ones don't.

Nicolay Gerold: Yeah.

Do you think we will see in, in regular
search databases, the same push we

see in the other database types that
you have open table formats on top

of S3 or GCP, on top of buckets, in
which we can dump more and more onto

the disk itself and then have indexes
for text search on disk as well?

Russ Cam: Yeah, that split of
storage and compute, I think is

something that, lots of vendors
are starting to do more and more.

In the sense that you can split out the
indexing process, dump to some cheap

storage, like S3 and then spin up querying
processes on demand to go and, load

that into memory into efficient data
structures, and then go and query on that.

I think, elastic.

OpenSearch have offerings that do that.

Stuart Cam: Elasticsearch obviously
has searchable snapshots, which is

essentially what you're suggesting there
where If you have data that's, in your

cold or frozen tier You can just dump
that out to S3 and have that searchable.

There are downsides to that, of course.

When you get into the idea of
mutability, that's an issue.

Often these things are better if they're
immutable, because they're read only.

It allows for much more efficient storage.

As soon as you make anything
mutable, it's a bit of an issue.

The other thing is
obviously latency as well.

Often it's the case that, If something
is on a disk based system and not

necessarily in memory that there's a seek
time there and a random access time that

can be a couple of orders of magnitude
slower than just a memory pointer read.

So there's no free lunch there.

You might be able to store more data,
but often it's slower to access and

often you pay the price in the fact
that you can't actually change it.

It's suitable for some
use cases, not for others.

Logs and metrics is probably fine.

Application search, not so much.

Russ Cam: Yeah, I think that's a good
point there: if you're dealing with

large volumes of data, that kind of
split of data, Compute and storage is,

something you typically want to do.

Like for logging systems you
have a massive long tail of older

information that you maybe want
to query at some point in time,

but most of the useful information
is the last like day, seven days.

And so you want it around to be
able to query it, but it's not super

important to have it loaded into RAM.

Super fast accessible.

Whereas I think for a lot of search
systems You know that the size of data

that you're dealing with is usually
typically one that can fit a reasonable

size distributed system And having that in
RAM And accessible is really what you want

in that situation to keep, latencies down

Nicolay Gerold: Yeah, I think that's one
issue also in AI that we have frequently

that you have multiple different systems
in the end, but you have to orchestrate

where you have the data for training,
which is in some form of offline storage.

And then you have the feature store,
which is more the online storage.

What would you say is like the
biggest mistake company making when

they build their search systems?

Stuart Cam: Good question.

I'll kick off on one thing.

I think, There's often an
overemphasis on technology, and not

necessarily the user experience.

Sometimes it's the wrong technology
as well, trying to shoehorn in

a database system that perhaps
isn't the best for search.

The first thing that comes to mind
immediately is something like SQL.

Trying to use SQL for search.

It's okay.

And then at some point it just
falls over as you realize it

isn't really designed for that.

Think of doing like edge n grams or
any kind of text analysis processing.

SQL doesn't have those
kind of features available.

So often we see, the wrong technology.

And then, yeah, focusing on the
technology and not necessarily

the user experience as well.

Perhaps, Russ, you have
some other insights as well?

Russ Cam: Yeah, I was just going to
say on that point about relational

databases, a lot, lots of them obviously
have an extension piece that allows you

to do some form of full text search.

And that often is like a good stepping
stone, I think, into building that

kind of capability in a system,
particularly if you don't have

any search system already there.

I think what I've typically seen in the
past though, is you do start to rub up

against lack of capabilities and lack
of features to be able to actually

implement the things that you want to.

So typically, for example, one of
the things that you usually end up

running up against is the ability
to actually configure and control

relevancy effectively and the
ranking algorithm used for that.

That's something.

Many years ago, trying to do
with SQL servers for tech search

capabilities was a real hard challenge.

But I think they're a good steping
stone to utilizing a, specific

search engine for the job.

But yeah, overemphasis on
technology is probably a good one.

I think one of the other things that we
often see is a lack of ongoing evaluation.

It's quite easy if you follow
the guidelines of, many of the

documentation for getting started with
Elasticsearch or OpenSearch or Vespa.

You can go spin up a search
service very quickly.

Start using that, start getting
some, results out of there.

You show it to your bosses and
show it to the rest of the team.

And it's Hey, we have some
search system up and running.

Awesome.

Then how do you go from that?

To understanding whether you're improving,
making search better, making relevance

worse, what's the feedback loop and the
measurements looking like for that system.

I think often you see that companies
get to that first state, first step,

which is I've built a search system.

It's okay, now how do you evaluate that?

How do you understand whether you're
making improvements to it, whether

you're making things better or worse?

Oftentimes I see people
stop at that point.

Or look at trying to do optimizations in
the small So looking at, specific queries

that might get reported by end users or
by the boss to say, Hey, I went to go

looking for some socks and I couldn't
find any specific socks I was looking for.

So then people tend to focus on the
small individual issues that come up.

And that then can affect,
the wider relevancy problem.

I

Stuart Cam: lack of system.

Lack of systems around relevancy.

And being driven by anecdotes
and it's quite easy to end up in.

Local maxima, whereby you improve a
query for one set and then you end

up destroying it for another set.

So you end up, fixing one set of
queries to improve the results in

one space and having them actually
detrimentally affect it in another.

And if you don't have a
systematized approach to.

measuring your improvements to relevancy,
often it's the case that you'll release

something, you'll actually break it
somewhere else and not realize that

until you get a user reporting the issue.

And we see that quite often, not
having decent feedback loops from

users, but then also not having
the necessary system in place.

to release those improvements or
perceived improvements to relevancy.

So yeah, that's, poor handling of edge
cases, variability, that kind of thing.

Nicolay Gerold: How do you approach
that when you really have a system in

production already, you have a small
test set, do you basically add more

and more examples for the different
edge cases you encounter to the test

sets over time until you say I have
a broad coverage of the different

query types and the results I want?

Stuart Cam: yeah, so there's
a number of ways to do that.

So like a golden query set is a good
place to start, which is, if you think

about the What you would like to do with
a golden query set, you're trying to

establish a sample of queries that users
run on your system that encapsulates,

a proportion of the head queries.

So those queries are the most popular.

So that might be, if you were on a
retail site, it might be an iPhone.

IPod, etc.

And then you would want to have a
representation of your middle queries.

Things that aren't necessarily at the head
or tail, but it's a broad set of queries.

And then you'd want a sample
of tail queries as well.

So you want a broad sample of everything.

And you would essentially then build up
a golden query set, which is for those

queries, what are the human evaluated
set of results that best fit that query.

It would be, that would essentially then
be a human list, a human judgment list.

So you'd have your operators go in and say
if I was searching for X, Y, Z, I would

expect to see ABC and these products that
allows you essentially then to establish

some kind of baseline for if you change
a, an algorithm In your search system

or if you change your search system, are
you still seeing for that golden set?

Are you still seeing those
results in the same position?

Or what if they, if you're
not, what is the deviation?

And there's all sorts of
mathematical functions out there.

For example, NDCG or reciprocal rank
that allow you to work out the, should

we say, put a mathematical figure on.

the change delta between
the two algorithms.

And that essentially then buys
you some level of confidence.

And it gives you a line
of inquiry as well.

If you change your algorithm that
affects head queries positively, but

tail queries negatively, you might
say actually, we're okay with that,

we're happy to improve it for the head
queries, but not necessarily for the tail

queries, then you might establish a new
baseline and measure from that instead.

But at least he gives
you a line of inquiry.

Now we would see that.

Essentially running.

in an automation suite.

So when you make a change your new
algorithm is run against that automation

suite, and then you get the results
back, and then you can establish whether

or not you think it's a viable change.

And then once you believe it's a viable
change, then you have a number of

avenues available to you in terms of,
okay we think this is an improvement.

That opens a question of how do
you establish that, how do you

establish that it's an improvement?

So what are you going to measure against?

Now, you could, for example, if you're
an e commerce store, you could measure

against number of items that are bought,
or number of items that are added to

basket, or some kind of business metric.

And you could track, based on the
search, what are the improvements

to the underlying business metric?

And then so what are you going to measure
against and then how are you going to

measure or how are you going to deploy it?

Then you get into
aspects like A B testing.

So segregating users up into different
groups and essentially giving them

a different algorithm and then
measuring the results of that group.

Or if you find, for example, with
some tail queries, because you

often don't get enough traffic.

on those tail queries, then you can look
at other methods like interleaving, which

is essentially running two algorithms
at the same time and interleaving

the results, and then essentially
trying to then work out, based on

the interleaved result set, what was
the net effect of each algorithm.

So there are a bunch of ways to
to measure your improvements or

perceived improvements to search
and then obviously you want rollout,

rollback mechanisms and that kind of
thing in case you've made a mistake.

So

Russ Cam: It's maybe worth pointing out
as well the sort of the bias of human

nature here in that, people tend to
report things that are wrong in some way.

And so back to your point that you
asked, which was would you add these

bad results to a collection of queries?

Yeah, you can totally do that.

I think that is something that every
place that I've seen does but yeah,

there is the element of bias there that
in, in the sense that, people like, for

example, when people are reviewing things,
people tend to write negative reviews

for things, but they don't often, don't
often write positive reviews for things.

So there's a natural human bias there
to report things that are bad over

things that are necessarily good.

Nicolay Gerold: and when you're running
these test suites, I imagine you're not

running it against a production database.

How do you set it up in the background?

Does every search engineer have
its own development database,

which has a copy of the production
database, or a copy of a small set

of data of the production database?

Stuart Cam: that's a good,
I mean it's obviously it's

environment dependent, right?

But the best case scenario is you are
running against a production data set.

That's the best case scenario.

In some instances you're talking
about having I hate the word, like

a backdoor or some kind of other
mechanism into the search system where

you can run your evaluation tests.

The quality of the data, it's often
the case in corporations that the

production data is, As the gold
standard, and every, it's polished, it's

maintained, it's, it's true, if you like.

And then it's often the case that sub,
that precursor environments say pre

production or test or development.

It's often the case that the data
in those environments isn't as good.

Ultimately and, many reasons for that
cost is one your maintenance is another.

There's all sorts of
reasons why that's the case.

So if you're not running the
algorithms on production like

data, there's a danger that.

You're now evaluating the
algorithm against something that's

inconsistent with production.

You want to run it
against production data.

Ideally, you'd like to run it
in the production environment.

So usually that's a case of, negotiating
with your DevOps team about trying

to find a way of, having this sidecar
of the evaluation process running.

The one thing I would say with
that evaluation process you're

potentially talking about running
tens of thousands of queries.

So try not to denial of service
your own system in the process.

You might want to have some
kind of it's like a staggered

approach to running those queries.

I've certainly seen servers
get very warm from when the

evaluation process is running.

So try not to do that, but
yeah, you want to run it in the

production system basically.

Russ Cam: Yeah.

There's maybe something else to add there.

Yeah, if you, you really want
to be running against what is

in prod because yeah, staging
dev usually are not the same.

If you can mirror prod as well, that, have
a real time mirror for, of the production

system, that's also good as well.

The reason why is oftentimes you do
want to run against Evaluations where

you say, for example, want to return
more debug information or more explain

information back from the system.

And so that, can put large additional
strain on the production system.

And so if you can offload that
and manage that separately on

a mirror, that's much better.

Stuart Cam: Yeah, that's a good,
that's a good call out, Russ.

For example, if you use explain equals
true with Elasticsearch, brace yourself

for the wall of text that comes back.

And all of that is obviously
memory allocated on the server.

Yeah, that's a fair call to, to
push towards a mirrored system.

Russ Cam: Yeah, or profiling
queries is another one that's

just super expensive to do.

Nicolay Gerold: I think the additional
complexity in searches you have so

many different interleaved components.

Let's maybe move into that.

What are the main components?

When you look at a search system,
what are the main architectural

components you have to have in place?

And whether that's like actual
infrastructure or business logic or

actually processes that have to be placed.

Stuart Cam: Really good question.

A good question.

Look, I think it's one of, it's
always, I hate to say this, but

it depends, like it does depend
on what it is you're trying to do.

If we were to pull up an example from
what we were speaking about before,

a very simple search system could
literally be, Something like a query

string query on Elasticsearch where
you're taking the user input without any

processing whatsoever and passing it into
Elasticsearch and hoping for the best.

That is like the, shall we say
the minimum required architecture.

Now that's going to

Russ Cam: Can I just jump in there a

Stuart Cam: Go on, Russ.

Russ Cam: You'd quickly want to move
from query string query onto simple query

string query so that any syntax problems
in the query don't take down the server.

Stuart Cam: So that would be your step 1.

Definitely.

That's your sort of super simple,
I would say that's virtually

no architecture, essentially.

You're just passing the
raw string input in.

But as you become, there are, Shall we
say well known hurdles that You would

want to, you end up having to jump
through in order to conduct a search.

Now, a lot of this is use case dependent,
and there's no, I don't know if there's

an architecture that suits all, all
types of search, because, there's a big

difference, for example, between a search
system that has You know, like maybe two

or three keywords versus a search system
where users are entering in, say, full

paragraphs of texts, so there are very
different, and that's not even getting

into multimodal, for example, where people
are searching on images or on video.

So there's no one size fits all.

I think it's probably worth just
constraining it to say, for example,

full text search, because that's
what a lot of search systems are

that are out there at the moment.

What are some of the components
that we see in search systems?

If you said at the beginning the
user types in their query, that's,

like the user query essentially.

So then you think about, okay what
do we know about that piece of text?

You may want to rewrite it, for example.

You might want to maybe, I don't
know, Standardize the case for it.

Make it all lowercase, for example.

You might want to remove weird
and wonderful punctuation.

You might want to strip out all
of the emojis and all of the other

crazy stuff that's put in there.

You might not want to either.

You might want to leave that in there.

So you'd want to essentially
think about rewriting that query.

So user query comes in and
you rewrite it in some way.

How do you, what does
that rewrite look like?

You have different types of rewrites.

Do you make it completely opaque?

Everything that happens after that rewrite
is, you're basically, it's just a one time

transformation and you lose the history
of what's happened prior to that rewrite.

Is it a rewrite where you
could maybe see the history of

rewrites that have happened?

Is it a forked rewrite?

So now, I type in, say I search for
iPhone, what does that get forked

into Apple iPhone, mobile phone?

Does it become multiple different things?

Do each one of those multiple
rewrites then start their own search?

Rewriting is one one thing that we see.

Tokenization is related to rewriting,
which is given a sequence of characters

How do you atomize that into a series of
other components that then make sense?

And there are edge cases to this,
for example, if I search for seat

belts and I put a space in there,
am I searching for seats and belts?

Probably not, I'm probably
searching for seat belts.

So there are, edge cases to rewrites
that you need to think about.

And that will be corpus dependent as
well, so your corpus and what you're

having people search over will inform you.

Some of your tokenization rules, you
might want to do white space tokenization.

You might want to factor in noun phrases.

You might want to look
for common kind of terms.

Word forms that go together.

Seatbelts is one example there.

So you have tokenization.

And this is really just before
you even get to the search

engine in some instances.

Because you have your own data
sets that you can refer to.

So essentially it's like text manipulation
that happens prior to running your search.

Russ Cam: If we were to

Nicolay Gerold: that all like one
bracket, basically like the query

understanding part, which is its
own service, which is basically just

doing all the pre processing on the
live queries that are coming in?

Stuart Cam: Yes.

Yeah, this all fits under that
umbrella of query understanding.

Russ is desperate to talk so I'll hand

Russ Cam: Yeah, if we took a step
back for a second and we just talked

about, what are the big parts of
any search system that you have?

You have ingestion, indexing.

which could be as simple as some kind
of bulk overnight import process, or

it could be some event stream based,
Kafka queue based mechanism to stream

in changes from your underlying source
of truth data into the search system.

That ingestion and indexing
can be, somewhat simple.

You could go with defaults.

It could be.

vastly complicated.

It could incorporate all sorts of
different signals into there that need

to be indexed into the search engine.

And then the next sort of big layer
as Stu was touching on there is

obviously querying and what happens at
query time in the search pipeline and

query understanding phase, retrieval
phase, ranking phase, et cetera.

And then the other sort of big parts are.

analytics where, you're capturing
metrics and feedback on the data

that you're showing to people.

So things like clicks, click streams
that, that kind of information what

rank in search results that you're
showing that users are clicking on

all of that sort of, I would say very
volumous data that you typically want to

capture and then build models on top of.

And then, yeah, evaluation, so putting
that data that you're capturing into

motion and into practice to build
some better systems that you can

then incorporate into that query
search pipeline side of things.

There's a kind of like the
big key components, I would

say there's a few other ones.

So from a systems perspective of
obviously operational resilience of

that system is super, super important.

But then, yeah, interpretability and
understandability or explainability for

the results that are being shown, having
some kind of, provenance for the results

that you're seeing, particularly if you're
pulling them from multiple different

data sources, or, there may be multiple
different rankers involved within a

pipeline, that are happening in stage,
how did this search result actually end

up appearing in here for this given user?

Can I clearly explain exactly
what happened in each component?

To understand how the results
were the way they were.

Stuart Cam: That's a much better,
broader answer than I gave Russ,

because I jumped straight into the
first part of query understanding.

So thank you for taking a
30, 000 foot view there.

Nicolay Gerold: And when I'm
looking at the system, I think

there are so many implementation
details, even just for indexing.

I think like what you mentioned, like
you can have a batch workload, you

can have a stream that's coming in
the same time, but underneath it in

vector databases, it's now common that
you basically have two separate ones.

One is the already indexed data and one is
basically the data that's coming in life.

Just because the indexing time is so
long that you can't really use the.

Insert a document in the
search immediately so you have

them in a separate storage.

What are like the decision factors
you're considering when you're

setting up a search system?

Like when you're looking at all
the different components you have,

and you basically have to decide
like what to go with for each step.

What is coming into play here?

Is it mainly the user behavior?

Is it the type of documents that's being
searched or the types of search, whether

that's multi modal, full text, vector?

Stuart Cam: That's a good question.

It's a good question.

I think the way How it's often been
described in terms of, say if we take, say

for example a company wants to do search
or they want to improve their search.

Usually there's a set of Functional
and non functional requirements, right?

So a functional requirement, it would
be something like, we have X billion

documents or X million documents, and
they change on this cadence, and they

look a certain way so that, maybe
they have a certain weight to them,

a certain number of kilobytes or
megabytes or whatever the thing is.

So that, Usually is like a
description of, should we say the

corpus size and its changeability.

So that's usually one thing that's
described hot off the heels of that is

usually we want searches to happen within
x milliseconds or X seconds, let's say.

Usually it's in the milliseconds
and usually it's under a hundred.

So given a search query, we
expect a set of results back in.

This amount of time.

So corpus size, expectations around
search latency are often there.

And then cost is usually a factor as well.

We don't want it to
cost X amount of money.

So those are the three kind of
highlights that are often presented.

Now, in terms of what's available,
feasible or possible, then we

would essentially work out what's
the most appropriate architecture.

And sometimes it's, there's
no free lunch in all of this.

Often it's a case that, to service
billions of documents on a vector

search, less than 10 millis, you
can do those kinds of things.

They're just incredibly expensive.

It's really about trying to, manage
all of the overall system to find

what is an acceptable balance.

So it's either going to come down
to speed, cost, or corpus size.

That's usually how these things play out.

Russ Cam: Yeah, I think that's a
good point is you really need to

understand, the requirements of the
pain points you're trying to solve

I think you need to understand the
constraints of the environment that

you might be working in as well.

So for example, building a search
system for a small company, Doesn't

have dedicated search engineers, has,
perhaps a few engineers that are,

looking after everything well rounded,
but looking after everything, what's

the most appropriate thing to build
here that when they're looking for some

search capability, now that's obviously
very different to a larger company

where search is, absolute paramount to
the user experience of their system.

Yeah, it's, and, there's very likely
to be dedicated search engineers,

machine learning engineers, data
scientists working together on, on

trying to solve problems collectively.

What fits for that is going to look
very different to, to that first case.

And then, yeah the sort of
technologies and techniques that

you might be looking at might be.

Might be quite different to, there might
be a tendency, for example, in that first

case to look at systems that you can
buy and, get relatively far with off the

shelf versus investing a lot of time in,
in building out more complicated, but

componentizable systems that allow you to,
experiment with different components and

different parts of the system over time.

Yeah it's the old adage.

It really depends.

Nicolay Gerold: And to make it a little
bit explicit we've talked before about

the different trade offs you can do.

Can you maybe go into the main
trade offs you're considering?

Because I think for a search system
it's worth it in the beginning

to make them all explicit.

For example, cost.

Is it like, I can maximize on performance
of the search system but it will

just become prohibitively expensive.

Stuart Cam: yeah.

So I mean there's there's a few kind
of trade offs that you can make.

Relevance and performance is one.

So do you know, what
do you care about more?

Do you care more about the
performance of the search versus

the accuracy and relevancy?

For example, in some, with some ANN
queries, you can essentially bake in

the the, your retrieval characteristics
of like how much performance

versus accuracy that you'd like.

What's important to you there?

The next one I think is
indexing versus querying.

There are, there is often the case,
indexing and querying are like two

sides of the same coin, essentially.

And I've seen examples where to improve
the search, you would change the indexing.

It's instead of improving, for example,
the query itself, you would actually

add an additional set of should we say
pre processing or other algorithmic

changes in your indexing that then allow
that data that's already pre computed

to then go into the search engine,
which can then be used to query time.

Now the downside of that is,
Essentially now you're baking in that

computation into the indexing stage.

So if you want to change your query,
it's often the case that, okay, we

may have to go back to the beginning
and re index all of your data

using a slightly different form.

So that's often a trade
off that can be made.

It's often the case that trying to,
run an algorithm, for example, a

query time is often slower than if
it's been pre computed in indexing.

So that's a trade off that
happens quite regularly.

I'll hand over to Russ for a couple
of other trade offs that we've seen.

Russ Cam: Yeah, I think that's one of
them pre computing features or signals

at index time versus query time.

That's a classic trade
off between the two.

Indexing usually faster, querying
more flexible and, can change them

more on the fly, but typically
result in slower slower searches.

I think that kind of touches on a wider
point, which is there is a balance here

between the latency of search results
that you're trying to hit and the

level of complexity or personalization
or query understanding that you might

decide to go into in a search pipeline
to be able to improve the results.

There is, can often be a balance
and a trade off between those,

you might have, for example, an
algorithm or a way to, that has.

you can demonstrably show massively
improves the results, but if it's going

to take, three seconds for every single
person it's, it's going to be unfeasible.

So there is that trade off
to make there there too.

There are ways of building into the
system around resiliency and, cutting

off after a certain point where you
can try and build that stuff in.

But yeah, that's typically one of
the other trade offs you can see.

Nicolay Gerold: And that's, I think
the current trend especially, because

most people are building RAG systems
in a way that the search is very crude,

so you have to actually add stuff that
can increase the relevancy for the user

in the end, so you have to have a re
ranking component which is very costly.

Especially looking at if you're
doing something like Colbert.

Russ Cam: Yeah, exactly.

It's yeah, it's the old adage of garbage
in garbage out in the fact that you want

your retrieval phase to, to surface the
most relevant results it possibly can.

But there is a trade off there in
terms of how much you can feasibly do

or what signals might be available en
masse to be able to do that for all

of your users at the retrieval phase.

So then often, yeah, you would
want to employ more complex but

better ranking mechanisms in a
post retrieval ranking phase.

That allows you then to finesse
and, fine tune those results that

come back from the retrieval phase.

But you want to make sure that you
still try and get in that bucket back

from the retrieval phase the very
best results that you can in order for

ranking to be able to do a better job.

I

Stuart Cam: that you've
been given to re rank.

So if the If the gold nugget isn't in
that initial set, then it's never going

to make its way to the top, regardless
of what re rankers you have in place.

And that is in itself is a trade off.

And, re ranking is often used for
tying in, personalization or machine

learning models that are inherently just
difficult to host within a search system.

They're excellent for that.

But yeah, as I said, you can
only rank what you've been given.

So if you don't have it,
you're not going to re rank it.

And that's interesting because that
we've seen approaches whereby, you

know, if you, for example, need
to serve 100 results to a user.

Do you think about over fetching,
say, 3 400 results from your retrieval

system, and then running that through
a re ranking, so that of the 3

400, at least you're giving a, more
information to the re ranker, so that

at least then the top 100 is, While
in theory it's the top 400, as opposed

to the top 100 just simply reordered.

There are approaches for overfetching
where you can hack some of that.

That does make paging very interesting
as soon as you start doing overfetching,

but, that's one approach to potentially
get around some issues there.

Nicolay Gerold: Yeah, and when you're
looking at a you're coming into a

new search project, you're coming in.

Do you have a set of questions you're
asking the people or you're asking

yourself to run through to basically
figure out the requirements you're

facing in that particular system?

Stuart Cam: Yes, they're
usually scattered and numerous.

I think the top three for me is usually
along the lines of how big is the corpus?

Where does the data come from?

What's the source of truth, in
your environment, what is the

feed into this search system?

Because there's often usually not just
one system, often it's an accumulation

of multiple different systems.

So what, okay.

What is feeding it, how big, how
many documents, size of documents,

expectations of rate of latency.

Where does data governance fit in as well?

This is usually something that crops up.

The search systems inherently,
There's a tendency for them

to attract everything in.

They can become like hubs.

You've got, say, for example, five
different departments, and the

combination of those five different
departments information all has to go

into the search system for some reason.

So now your search system is essentially
the master of those five pieces

of, those five departments, right?

And then you end up with
interesting questions.

Who actually now owns that data?

Is that the search team?

Are the search team now responsible
for the hygiene, cleanliness,

et cetera, of that data?

Or are the individual departments?

And how do you manage changes
and expectations of changes?

What happens with the data?

So governance is one that
comes up quite often.

And yeah usually things like what are
their, what cloud providers are they

using, what are their expectations
around DevOps, what do their, number

of developers look like, what the
languages that languages are they using.

So it's a kind of the broad set
of, IT questions that you might

ask plus then all of the search
questions that you would ask as well.

Russ Cam: Yeah, I would say one
of the key questions for me is.

And we touched on this earlier, is how are
you currently evaluating the system that

you have, because we need, in order to
be able to improve a system, we need some

forms of measure to work against, and so
do you currently have a system in place?

That you're able to point towards and
say, yeah, we can see that this is

where we're tracking for this stuff.

These are some Northstar
metrics that we have for things.

Do they have that system in place
already or is that something

that they've thought about?

And again, are they doing in the small
or is that something that also needs

to be built as a precursor to looking
at improvements in the search system?

Nicolay Gerold: How often is
it the case that you actually

have an eval set already set up?

Russ Cam: Fairly.

Stuart Cam: varies.

Russ Cam: Yeah, it's fairly,
it varies fairly frequently.

There might be different cases where
it's maybe not being used effectively

or in a format where, say the data
is being collected, but it's not

really being put to use effectively.

That's fairly common.

Stuart Cam: It's quite common to see KPI
metrics around, general business activity,

but then not see that tied up to search.

So it's we're tracking the amount
of inventory we're selling,

we're tracking page views.

But we're not associating
it to a search in some ways.

Often, it's a case of actually, if
you can tie this up to user behavior

around the search, now you have
the complete path available to you.

You you either see fragments
where they're half collecting

or not at all, I would say.

Those are the two common scenarios.

It's rare to see proper evaluation
done, I would say in most places

that's where we do see a lack.

Russ Cam: I think that's a good point
that you touch on there, Stu, is that

if you're able to tie in changes or
uplift in the search system to, the

core underlying business metrics.

It's very easy then to make cases
or easier, I should say, to make

cases for invest, further investment
in improvements in those systems.

Cause ultimately that's, that is the
goal of, any of these improvements

that you're making is you're ultimately
trying to uplift the, the business value

that you're getting from these systems.

And so if you have a clear case of being
able to tie those things to say, Hey, this

algorithmic change that I've made here,
I've run it through offline evaluation.

I can show that it results in, a 5
percent improvement or, even a 2 percent

improvement or even less than that.

But if I take that to, interleaving
or an AB test that can result in X

amount of uplift for the business,
for the company over the year.

I think it's good to get into that
kind of mindset of thinking when

you're looking at improvements.

Stuart Cam: Yeah, I would say actually
it's worth, that's an interesting

thing you bring up there Russ, which
is around, delta differences that

you can see in improving search.

It's not uncommon to see, like in the
single digit percentage improvements.

So You know, you might
see a 1 2 percent uplift.

So you have to think about, as you
go through your search journey,

if you like, you might find that
there are some very quick wins.

You might get your 10 20 percent
improvement and everybody's high fiving,

like we're really improving this thing.

And then you might see that diminish.

Okay, so you might see
it getting into the 1%.

You might even see, actually, I would say
a good marker of a, of an improvement.

of a sophisticated system is one where
you actually see it's getting worse.

Because if it's getting worse,
in some ways, you might be at

some point of maxima, right?

You might be discovering a maxima.

Yeah it's compounding effects essentially
over time is what you're interested in.

1%, a 1 percent improvement
over 50 improvements is still

quite a significant change.

Russ Cam: Yeah, and if you have some
versioning to these algorithmic changes

that you're making, even better, because
then you can look historically at these

changes that you've made and, determine
how you've ended up and landed into this

weird, complex, convoluted query and
ranking system that you have now, why, how

and why you've ended up in that situation.

But yeah, to your point, Stuart, there is.

Very much does follow like a Pareto
principle, like 80 percent you can

get, is 80 percent of the work is down
to 20 percent of the improvements.

Nicolay Gerold: And I always like to go
into that in the end, like overrated,

underrated in search, especially what
are things that are overrated and what

are things that are underrated to do?

Russ Cam: I'm going to go out
on a limb here which is always

risky to do on a podcast.

But, I think that.

There's obviously a lot of hype
around for good reason around vector

search, similarity search, right?

Base systems.

I think that what we are seeing
is that none of these are going

to replace full text search.

I think what I've typically seen
over time is that actually more and

more places that may have initially
jumped on a rag vector based approach

to things, which isn't necessarily
suited for all kinds of searches.

But over time that they've ended
up actually taking more of a hybrid

approach because the full text search
side of things brings brings things

there that a vector based approach.

Isn't as effective as, as
as a full text approach is.

So for example, if you want to find search
results that exactly match the keywords

that are being put in there, full text
search is perfectly good at doing that.

Yeah, so I think it's, I don't see things
as being necessarily a replacement.

It's a hybrid approach and a
combination of approaches that

we're seeing happening here.

Stuart Cam: Yeah, I think if you can.

I'm pleased that you've gone out on
the limb there, Russ, because I was

going to go out on the same limb,
but I think it would echo what Russ

has said there, is vector search
is, it definitely has its place.

I think it might be suffering from
the, it's the shiny thing out there

so everybody's flocking to it.

But I still think that There's plenty
of mileage that can be covered by just

improving your full text search, and
there's an argument to say that in,

in many instances we've seen that's
actually easier than taking on a vector

based approach, which often requires.

Machine learning skills, now you have
a machine learning hosting problem.

You have a whole index and
ingest problem there as well.

You have all the costs associated to it.

Whereas actually, all you really
needed to do was change a few

lines in your text processing.

We've seen instances like that.

I would say underrated.

And I did go off on my 30 foot view on
rewriting and tokenization earlier, but

I would say underrated is definitely
query understanding, which is failing

to to properly understand what the user
intended through their search, and failing

to essentially translate what they had
entered into a meaningful search query.

That's definitely underrated.

Russ Cam: Yeah, I think there's, it's
just going to say on that point, I think

there's, and it, I think some of it goes
back to when you start to build out a

search system and what things you start
to put in place and you build up a more

complicated architecture over time.

It's very typical to, take the.

user query verbatim, do no query
understanding on it, pass it to the

search engine, get the results back,
serve them back to the user, then start

to maybe perhaps introduce some form
of post retrieval ranking in there

perhaps for personalization and then
to build out these layers, but that

whole sort of query understanding phase
and trying to, better understand user

intent, I think is something that is
often undervalued from what we've seen.

Nicolay Gerold: Yeah.

And if people want to improve the
search system and get in touch with

you and follow along with you in
general, where can they do that?

Stuart Cam: So they can reach us.

So Russ and I both have a consulting
company called Search Pioneer.

So we're available at www.

searchpioneer.

com.

And we're also reachable on LinkedIn
under the same name, Search Pioneer.

So either of those is the
best way to get in contact.

If you go to the site, there's a form
you can read about clients that we've

interacted with and some of the success
stories and, what we're all about.

And if that's something that, interests
you and we believe you believe we can

help, then get in contact and we can
discuss your situation and hopefully

see if we're able to help you.

Nicolay Gerold: Okay,
what can we take away?

I think.

What I'm always interested
in first is actually how's.

They.

They make decisions and what are their.

Frameworks of how they actually
think through the problems.

I want to solve.

And I think they're
called constraints for.

For search infrastructure.

Are very interesting.

So they had Copa size latency
requirements, customer.

Limitations, but also
data governance needs.

And I think.

These are interesting Godrej is
to basically come up with an.

An architecture and.

The fact that.

Various no free lunch.

I'm dead.

Do you actually.

We really have to think through,
like, what are you optimizing for?

And.

This should also be informed by some
kind of business metrics or performance.

Metrics.

Is that go beyond your
search systems and actually.

The search metrics or the performance
metrics like latency that you actually.

We can measure in your system.

And.

The.

Trade-offs often com.

Very early.

If that you have to
decide when you actually.

Optimize.

So, for example, you can.

Incorporate a lot of stuff.

At indexing time, but then you
can't really adapt at query

time, which often limits.

Our relevance you can makesearch results
if you only bank on the information you

have available at indexing time because
you're have less user information.

So you can do less
personalization, for example.

And I think the relevance
was a performance.

Once right off is probably one of
the most interesting and important

ones you have to actually.

Consider.

The.

Main architecture components.

They broke down.

They basically.

He had the ingestion versus indexing.

Then you have to search pipeline.

And then you have like the
analytics or the feedback loop.

And the search.

The pipeline, you could basically break
down into three separate components.

First query undersanding which
is often its own pipeline.

Then the.

The retriever component, which is actually
the interaction with whatever such.

Database they're using.

And then lastly, you're ranking component.

Which really ranks all the search
results prioriitzes them or does fusion

if you have multiple different.

search databases.

They had basically three different.

And.

Validation approaches or a
three layer testing strategy.

Which was.

I was first a golden query set,
which basically covershead,

mid and tail queries.

Van you.

You have your basically ultimate
automated evaluations suite which

builds on these queries with nDCG or
reciprocal rank whatever, you are using

and then at the end, you basically.

Basically half your production AB tests.

Or something.

Something like interleaving.

When you have multiple algorithms and.

And you Interleaf their results.

Do.

Evaluate the system as a whole.

And.

I would add on top of spat,
basically the business metrics.

Because when you're doing.

Or implementing a completely
new search system.

You should also in.

E-commerce for example, measure like,
okay, what's the revenue impact.

And.

You can never exactly measure that.

It's not.

Like an AB test that you have
two different versions of a

system running, but you like.

We have, like you go from zero to
one or you do a major overhaul.

So you actually want to have, like, if
you own the business metrics as well.

And.

The.

What surprised me a little, the bit
was their advise that you always

have to or should run your tests
against production environments and.

Not really dev or staging environments.

Because you need to
test against production.

Production data.

And I think this is partially informed
by them having worked very, very large.

So system's very, it's actually
very impractical or even infeasible.

It's the run.

The complete.

The database multiple times,
because it's so large.

So.

So you probably can't really
replicate the production system

into a . Staging environment.

I think for most of us or for most of you.

You work on smaller search
systems, so you can likely.

Do a complete replica.

And test against production.

Data, but in a Def or staging
environment, if it doesn't.

Apply to you.

Go with there otherwise.

Actually be really.

Careful that you're not.

DDoSing your own service.

And especially.

Then you add stuff like the debug
operations, like explain =true.

In elastic search, you should
be really careful around those.

'cause they really.

Increase the load or strain
you're putting onto the system.

Next week we will be continuing with.

More on like the system's
perspective on search.

And I'm excited for that.

So leave a like subscribe so
you can stay notified, when