How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
Nicolay Gerold: Search systems come
with its own set of challenges.
The data is often too large to be.
Stored in a single note, we often need
to handle 10,000 to 50,000 queries.
The indices are very slow to build.
But we still.
I want to search the fresh data.
So we have a bunch of
different bottlenecks.
The next or trade-offs
costs, latency scale fresh.
Freshness of data, high throughput.
So how can we solve this?
Milvus is the open
source specter database.
That gives you the necessary levels.
To pick your trade-offs and solve
the bottlenecks you care about.
In your application.
You want no cost?
Place more of your data and objects.
It's storage.
You want higher throughput
at GPU acceleration.
You care.
About the fresh data.
Create a buffer that
stores to new data inquiry.
From it.
And use your main database for the older.
Data and basically just
combine the results.
Today we are.
Continuing with our series on search.
And today.
On how AI is spilled.
We are talking to
Charles , who is the founder?
And CEO of Zilliz the
company behind Milvus.
Charles, previously worked at
Oracle as a founding engineer
of the 12c cloud database.
And we talk about the bottlenecks
you will face in your vector database
and how tries to solve them through
mutli-tier storage, GPU acceleration
and we also get a glimpse at the future.
Self-learning indices for.
Charles Xie: I think a lot of organization
from smaller startup to large enterprise.
They are all looking for.
Vector database solution, but their
requirement is a little bit different.
So for for early stage startup for smaller
groups they care about easy to use.
So basically they want to.
Get the application up running
with the least amount of time.
But for enterprise, they
care about performance.
They care about scalability.
They care about the maintenance of the
system, how you easy to maintain at a
large scale, and also security compliance.
And how you integrate with the ecosystem
to take the data from the upstream.
Nicolay Gerold: And in terms of scale,
can you tease a little bit what is
like the largest vector database in
terms of like dimensions plus amount
of vectors you've scaled up today?
Charles Xie: yeah.
So for the largest one, they
get a like 100 billion scale.
So basically, that's that pretty
much like the Internet scale.
So on the Internet, we probably
have like 1000 billion, several
thousand billion of vectors issue.
If you want to index the whole Internet.
So we saw some the largest
deployment in that range.
And also we saw that there are a
lot of companies, they are doing
vector simulated search on on a
billion scale 10 billion scale.
And when it comes to dimension what we
found is that like three, four years ago,
they started with like 512 dimensions.
And now they are in in between 1,
000, 2, 000 each dimensions depends
on the application they are building.
Nicolay Gerold: And what are the primary
bottlenecks in your opinion, when a vector
database has to scale up to that size?
Charles Xie: So the biggest
challenge is that you have to build
a scalable system from bottom up.
I want to take one step
back to give you an analogy.
So if we look at the traditional
relational database ecosystem.
You will see that it was 20, 30 years ago.
We have we have a postgreSQL.
We have a MySQL.
And both of them are Single instance.
And so it's a very easy to use.
They get a lot of popularities.
But when the data volume of the of
the all the organization grows people
may need solutions that that can
accommodate a large volume of data.
So that's why we have we have a big data.
We have a lot of solution
to scale your database.
And so in the very beginning, there
comes a very easy idea is that you add a
sharding schema on the application layer.
So there are a lot of solution on top of
MySQL Postgres to, to, to basically to
basically partition the data into shards.
And and then you can do load balancing
on top of it on the application layer.
But this solution has some drawbacks.
So first of all it's, most of the
solution is going to be like single,
single write and multiple reads.
And then because you only get a
single loads or single instance
for for data modification.
So it's going to be your bottleneck
and but but if you want to.
Remove the bottleneck, you
have to build a system.
You have to build a algorithm to, to
support this distributed consistency.
And it is going to be a
lot of challenge on that.
So we have this RAF proto implementation
to support distributed consistency.
And also when you are trying to build
a scalable system, you add another
layer of complexity in network in
communication and with which will
further complicate the system.
And you have to consider that how
you do load balancing, how you do
data replica, how you do data fail
over in the data recovery, but
all in a distributed environment.
In the vector database space, in
all this complexity also applies.
And so we have to handle all this
distributor data consistency.
You have to and what makes things
more challenges That vector
data actually is very large.
So if you look at a single entry of
embedding vector actually is pretty much.
Bigger then and then a typical entry in
a traditional relational database system.
As I mentioned that typical
embedded vectors, they could be
1000 or even 2000 dimensions.
And so to transmit all this data
over in a di in a distributed
environment, it could be challenge.
And and also if you want to to
have this data consistency and also
without sacrificing the performance.
It's going to be a lot
of challenge over there.
Nicolay Gerold: Yeah, and we moved a lot
into the distributed system space already.
So for those who aren't familiar with
especially all the data data engineering
stuff, Raft is basically an algorithm to
reach consensus between different nodes.
So if you have like a writes and I
think it's easier to explain, but like
a transactional database, like if you
have multiple transactional leaders
distributed across the globe, and
you ride like a new entry for a bank
transaction, that's happening into one
of them, the other nodes have to be.
Basically up to date with a node
who has been written to, and you
basically have to reach the consensus.
And you also have to consider like
the second part, basically the
networking that the network might
fail or might take longer to basically
communicate between the different
nodes, which is like the additional
complexity in distributed systems.
Nice.
And especially for Raft, can you go
a little bit into the indexing part?
Especially because we don't really
have the lightweight writes.
You typically see in transactional
databases, but you have heavy vectors.
So how are the heavy vectors
basically distributed and how
does the consensus mechanism work
exactly between the different nodes?
Charles Xie: So basically every
vector database system may take
a different approach, a slightly
different approach or totally different
approach to resolve this problem.
But but as it is.
So the vector database system we build
so the open source Milvus we have support
different kinds of a data consistency.
So we.
So by default, it's going to be eventually
consistency, but but but if you really
want to have a strong consistency
you can also support that, but with
a slightly loss of the performance.
So it's going to be a little bit
slower, but but we do support
different consistency level.
And because because we think that's
important A lot of a lot of a company
that they had that they are building
vector database solutions for different
scenario and some of them care a lot
about performance and they want to
real time data visibility and for those
cases, eventually data consistency
maybe a good solution for that.
And as for some of the for example,
they are building this fraud detection
and also they are, they are doing this
real time fraud detection for financial
institutions and and they care a lot
about stronger data consistency,
and we can also support that.
Nicolay Gerold: Yeah.
And you build on top of HNSW, the index.
HNSW, especially at large scale, it
takes a while to re index when new
vectors are inserted into the database.
How do you actually support these
like more real time scenarios?
So where you basically have a read
in close proximity to a write.
Charles Xie: Yeah so first of all,
for all the data we ingest into the
system we will have them to go through
basically a distributed logging system.
So at the moment we are using either Kafka
or our party portal for this purpose.
And so every single data, we will put
them into a distributed write ahead log.
And we're trying to use this distributed
WAL (Write-ahead-log) to support all
these different kinds of data consistency.
And and other than that as you mentioned
that for real time data visibility and
to make sure that our data are visible
in real time or searchable in real time
without a sacrifice of performance.
We have two access paths for for the data.
So one is for this for this fresh data.
So for fresh data they first
we put them into Into a buffer.
But for the data in this buffer we
don't have to build indexing algorithm
so basically for the data in this buffer.
Buffer.
We just do a brute force search and to
make sure that the data is going to be
accessible and searchable in real time.
But but when the data grows
in this buffer, when it
reach at a certain threshold,
so once you reach out this threshold
it will trigger the index building.
And on the backend.
And so system, we started building
the index and it could be HNSW it
could be this IV pq and it could
be other indexes behind the scene.
And but so this is pretty.
So this is similar to the strategy
adapted by a lot of traditional big
data or database system like like
Cassandra also have this a SMT structure.
So they put data in the
buffer and then they.
They build this hierarchical
merge tree and and to gradually
build or rebuild the indexes.
Nicolay Gerold: Yeah, and I think what
you've seen, especially in the last year
as well, with something like LanceDB
to be that a bunch of companies are
starting to adopt a hybrid strategy of
in memory indices and disk based indices.
At Zilliz Is there something similar?
And if yes how do you actually decide
which data goes into which index?
Silence.
Charles Xie: We will be doing this hybrid
search for quite a while since year 2019,
but we actually we take one step further.
So it's not only about like
memory versus a local disc.
We take one step further.
So we have built this hierarchical storage
we have four layers of data storage.
At the bottom we can use
the, all this object store A3
storage for to store the data.
And then on top of it, we can use
local disk as, to, so for example,
to use this M-M-E-M-E disc to support
high more efficient disc data access.
And on top of it we have a memory
and, and also for some user
scenarios that we can even catch
the data into your own chip memory.
So for example it was three years ago, we
started this collaboration with NVIDIA.
So we are using GPU to accelerate
vector similarity search.
So for, so if you are using a
GPU, we can cache some amount
of data in the GPU memory.
So if you take a top down perspective,
you will see that so it's going to
be GPU memory as a high speed cache.
Then you have the memory then you
have a local disk, and then you
have a distributed object store.
So if you walk down into the
hierarchy you will see that
you can put more and more data.
But but the latency may increase.
Basically you are doing a trade off
between the volume of data you can put
into a system and versus the performance.
If you want a higher
performance, you may want to.
Build a system with more memory
with more high speed local disk.
But but if you want to you want to care
about cost efficiency you are building.
Pretty much like an offline analysis
application and the performance doesn't
matter too much you can definitely
use cheap machines that are with less
memory with less local disk, but with
a larger object store and so basically
we are giving our customer, our users
the opportunity to configure their own
vector database system in a way that they
can make a tradeoff between performance.
Consistency accuracy and
also cost efficiency.
Nicolay Gerold: That's really interesting.
And when it comes to the.
GPU are you doing GPU optimized indexes
as well for the search, or are you only
using the GPU to basically do a faster
or batch processing of the vectors?
Charles Xie: The idea is using
the computation power of GPU to
accelerate and the index building
part, but as well, the search part.
So for both part we have been working with
the NVIDIA to work on this GPU friendly
or GPU customized indexing algorithm.
So for example and there's a algorithm
called RAFT which can support,
high performance index building
and also indexing serving on GPU.
This is also part of this new
library released by NVIDIA
called q Vector computing.
And also you have to optimize data
transmission between memory and the GPU.
Because most likely it's
going to be the bottleneck.
The bandwidth between local memory
to the GPU is going to be limited.
So we have to implement a lot of
algorithm to do data prefetching,
data caching, to minimize the data
transmission between CPU and the GPU.
Nicolay Gerold: Yeah, not even just if
the indexes of the data can be stored
in like multiple different tiers,
not just the CPU and the GPU, but the
different layers of storage as well.
Charles Xie: Let me give you
some more detail about it.
So as I mentioned that GPU is
limited in some memory, but it
get a lot of computation capacity.
So that's why having this hierarchical
data storage is going to be important.
So basically we probably cash or store
most of the data on your local disk or in
your object store and we probably store
like one 10th of the data in local memory.
And also on top of that, we probably
store one out of 100 in the GPU memory.
You can accommodate the scalability of
the of the data support a massive amount
of data, but also you can achieve super
high performance because, GPU only store
a very small percentage of the data.
But you have to make a lot of
modifications of your indexing algorithm.
Basically it's going to be a
hierarchical indexing algorithm.
So on GPU, you just do some fast
probe to, to find that okay for this
particular search which partition or
which sector of data, which class of
vector data we should do the search.
And and also on the GPU on the
local memory you catch more
information and you can support more
granularity so real the real vector
search happens on the disk level.
So you have just a hierarchical indexing
to make sure that you can make best
use of a GPU CPU in a local disk.
Nicolay Gerold: At which scale of data
have you actually noticed that the
GPU brings a real performance benefit?
So it will give like a benefit already
at like small scales of data, but
you will probably not see like a
large impact in performance boost.
Charles Xie: Yeah.
I would say it's not a, at
a which scale is more like.
It's more what do you want of
from your application perspective.
So basically what do we observe is that
GPU acceleration is good for scenarios
that you need a high throughput.
So you need you need
like thousands of qps.
So you want to have 10,000 or
even 50,000 query per second.
So that's where GPU is going to shine.
You basically send a lot of data to
the GPU processor for batch processing
and you get all the results in a batch.
Nicolay Gerold: New embeddings
especially like Colbert, ColPali,
but also Matryoshka embeddings.
What challenges do the new type of
embeddings actually present for vector
search similarity systems and also
the new type of calculations you are
actually doing with these embeddings?
Charles Xie: So this is
definitely emerging technology
in the vector search space.
And back in the past five years, 10 years,
we saw a lot of our application about
just search and and also when when we are
talking about a similarity a lot of people
are defaulting to cosine similarity.
At the moment that, we are in the
process to supporting all just a
different kind of a new vector search
operators into our vector database
We are going to support ColBERT and we
are going to support a sparse index.
And we are going to also support,
re ranking and customize scoring
functions and things like that.
Nicolay Gerold: For the calculations
within Colbert for the late interactions,
what kind of new indexes or what kind
of optimizations to the existing ones
are actually necessary to perform
those efficiently at large scale.
Charles Xie: So first of all compared
to cosine distance ColBERT is going
to be more computationally intensive
which will have an impact on the scale
of the distributed system, so you may
need a large cluster to support that.
And also like a ColBERT also
means that you are transmitting
more data, so a single entry
made of a higher volume of data.
We are doing tons of optimization
to accommodate the computation
and also data transmission
Nicolay Gerold: yep.
And have you actually already played
around with additional optimizations?
Especially in research, there's a lot
of talk, for example, about pruning,
removing the less relevant tokens, for
example quantization, of course, but
also caching a few of the embeddings,
especially the more important ones.
What are you actually considering
beyond those to optimize
the Colbert calculations?
Charles Xie: We have we are supporting all
these approaches and this is still some
ongoing work and so we are not sure which
approach works the best for us because
again, me it's a distributed system and
a lot of things going to be much more
complicated in a distributed environment
and we're definitely exploring all these
approaches, but we haven't decided yet.
Nicolay Gerold: On the sparseness,
as you mentioned, is this among
others SPLADE because there would be
really interesting, like how are you
handling the expansion at query time?
Because this is always something
that really bogged me about SPLADE
that it's really hard to implement
the expansion at query time to
actually make it really useful.
Charles Xie: Yeah so we also think this
sparse index is very important because you
can also you basically add another other
other way for us to to search the data.
You can combine it with a similarity
search and also dense vectors.
You can you have to higher recall.
Nicolay Gerold: Yeah.
And are there any emerging technologies
or research areas that you are
particularly excited about at the moment?
Charles Xie: So there
are a few technology.
So first of all to integrate and
support all this more traditional data
retrieval algorithms in vector search.
For example BM 25.
We saw the importance to support a BM 25.
One so to integrate a BEM 2025 with
a vector simulated search, that's
one thing that we are working on.
And also the other thing
that we are trying to.
We are trying to
accommodate a lot of data.
So when we are talking about
embeddings that a lot of people we
are just talking about like text
images and, but in the domain of a
text and in the knowledge base, we
found that there are a lot of data.
They are starting in
more like a traditional.
And we found out that we can
actually convert all this knowledge
base into graph embedding.
So there are a lot of
graph embedding algorithms.
So if you pick up one of this
algorithm to transfer knowledge
base or your graph data into graph
embedding and you put this embedding
in your vector similarity search.
Actually, you can achieve
superior data accuracy.
So actually we did some experiment.
We found out that you can achieve
even higher accuracy on data retrieval
than than graph RAG for example.
There are going to be more
and more indexing algorithm.
And for each indexing algorithms, there
are a lot of parameters like segmentation
size and how you casting the data, there
are tons of parameters to fine tuning.
That makes us to think about
autonomous driving for vector indexes.
And so basically basically we are, we're
trying to use this machine algorithms,
machine learning algorithms to provide
tuning on top of vector indexes.
And and so the customer, the user
don't have to care about, the
configuration and the parameters
of different indexing anymore.
And it just works out of the box.
And lastly we have a lot
of advancement in indexing
algorithms from HNSW to the IVPQ.
But we found out that there could
be like a self learned indexes.
And so it could be powered by all
these AI algorithms but it is highly
customized for your data distribution.
So for example, HNSW is
basically a graph based algorithms
But every single customer may have a.
Different data workload.
So what if we can use machine learning
or some algorithms to customize
the HNS W algorithm or the graph
based algorithm just for that data
distribution, for that kind of workload.
And and so this will help
people to achieve higher.
Performance and also have them to achieve
low cost and yeah, so I think those are
the those are the things that are emerging
in the next in the next a few years.
Nicolay Gerold: And by self learning
algorithm, do you, or adapting index,
do you actually mean that the index sees
the queries and which query types are
performed frequently and then basically
makes those faster by adapting the index?
Or is it also based on the data
structure and data distributions that it
reorganizes itself over time to basically
organize similar items closer together?
Charles Xie: Yeah, it could be both.
Yeah so you are absolutely right.
So we want to achieve both.
So let me give you one example.
We know that eventually you
have to build several indexes.
So if you have a larger volume of vector
data, you have to segment your data
and then build a several indexes and
how you are going to segment your data.
It is going to be a big challenge because
you want to maintain those data locality.
So if you know that the pattern
that you are going to retrieve your
data, you can build the indexes
without sacrificing data locality.
You basically have a better data locality.
And also as you mentioned there are
some data that you probably want to
retrieve with a higher frequency and then.
For those data, you can definitely
put them somewhere in this storage
hierarchy that have a higher,
affinity to GPU or CPU, you basically
store them in the GPU memory or
local memory and things like that.
Nicolay Gerold: Yeah.
And they're like in HNSW, you have a lot
of different things you could basically
add on probably like the choice of
connections per node, like the number
of connections per node, the number of
layers you have, like you could probably
also add new entry points into the data.
Are you also exploring like in
reinforcement learning learning
Custom distance functions.
Charles Xie: So this is
a very interesting idea.
We had a discussion about that.
I think one year ago.
But unfortunately, at the moment,
we don't have the engineering
bandwidth to explore that.
But but that's a very interesting idea.
Nicolay Gerold: Yeah.
It could be really interesting.
And if people really want to
start building a stuff, check out
Zilliz where can they do that?
And also what's on the horizon?
What can you tease or what should
people really look out for?
Charles Xie: I think that so as it
is that we are building we have been
building a vector database system for
high performance for scalability and also
without a sacrifice of a cost efficiency.
So for those users who want
super high performance.
They they want a very high QPS.
They want a very low latency
and also they get tons of data.
They get million and above and also
and also they want to achieve a
good TCO to total cost of ownership.
And I think we are the we
are the solution to go.
Nicolay Gerold: And if people
want to follow along with
you where can they do that?
Charles Xie: So they can.
So they can reach out to us
in the open source community.
So Milvus is open source on GitHub
and, so it was one of the most popular
vector database system in the world.
And yeah, so just search Milvus on GitHub.
And also we have a Discord channel
and to hold all this discussion about
vector database and about Milvus
Nicolay Gerold: So what can we take away?
I think there are two very
different sets of learning here.
One for like the developer who's working
on the actual databases and one for
more of the user of the databases.
So let's maybe start with the user, which
is the group, which I identify with more.
And I think what really is obvious
through the episode is how much effort
you actually should spend on figuring
out the requirements of the system
you're building so that you actually
sit down, look at the use case.
And figure out, okay, what am I
optimizing for or what do I have to
optimize for based on what the user
needs and also the use case requires.
And this can be very different things.
I think like the three main aspects
I'm looking at is for one always cost.
The second one is latency.
And the last one is basically in,
in search databases, relevance or
relevancy, like how relevant or good
do my search results have to be.
And you have a lot of different
levers to pull in your system to
actually have an impact on that.
So just optimizing for cost, I would place
more of my data in a cold form of storage.
But this also means I'm
trading off the latency part.
It doesn't mean I'm getting a
worse relevancy, but I will have a
worse latency just because of the
network cost I will be encountering.
And On the other hand, when I have more
of an e commerce use case where I'm in
the domain of 100 milliseconds, where I
have to return the results to the user,
I would have to optimize for latency.
So I would have to place
more of the data on, in RAM.
Or at the maximum in local SSD.
Because I wouldn't be able to basically
do the round trips in the network
before I have to return the results.
Especially when I have to
move large amounts of data.
Which I have to in search systems.
And the last one is the relevancy.
And here this could mean when I
have a use case where relevancy
is of the utmost importance.
And This could also mean that,
like, latency doesn't matter at all.
So when I'm looking, for example, at
report generation, which is something that
is typically done more async, so the day
prior, the night prior, and then I serve
to the user, I really can take my time.
And can focus more on getting
a really high relevance.
So doing, for example, an exhaustive
search over the entire vector
database because I just don't really
care about the latency I will have.
The other thing would be I
could even add more components.
So, for example, a re ranker,
which is something very hard to
do when you're actually working
in a low latency environment.
And I could also, when I have
a use case where I actually need high
relevance, but also very low load and
latency, then I would have to trade
off the cost massively and probably go
into GPU acceleration or just adding
more nodes to handle all those QPS.
And I think being explicit about this
in the very beginning, what are you
optimizing for is what will get you
the good results when you're actually
building a system down the road.
And the other set of learnings, which
is more of the like database engineer
or however you want to call it, who
is building the databases is there are
a lot of different, like interesting
things they've built that have emerged
over the years of building databases,
like one, the dual path Of search, like
you have a right path, which is going
to a buffer and it's only indexed once
you hit a certain threshold of data and
the read path basically queries from the
buffer, but also the already indexed data.
And I think that's very similar
to Cassandra's LSM tree.
And.
Also, like it's, it looks a lot like
a cache aside and you can basically
achieve the real time searchability.
So basically the data is fresh without
really sacrificing the performance
because you couldn't search it in real
time if you would have to build the
index first and also the hierarchical
storage element, like having for one,
like the fastest is the GPU memory.
Then you have the RAM, then you
have the local SSD, and then
you have the object storage.
And in each one of those, you
have a different bottleneck.
So in GPU memory, the bottleneck
is CPU to GPU transfer.
In RAM, the bottleneck
is memory bandwidth.
In local SSD, you have IO operations.
In object store, like S3,
you have the network latency.
So you have to solve a lot of
problems to actually make this work.
But you basically can do a refined
search and go basically through
the different types of data.
And he also mentioned that they have
different amounts of data tend to be
stored on the different types of storage.
So, for example, the GPU has 1%.
At the maximum, probably even less.
And the, in RAM, are 10%.
Then in disk and object store,
I think this is where, at the
moment, the main trade off is done.
Like, on the local SSD, or
whether it's in object storage.
And, the ratio, I think,
was really interesting.
I think he even mentioned, like,
89 percent is roughly on local SSD.
The architecture basically allows
them to maintain the high performance
while still handling massive data
sets in a cost efficient way.
And this is again going into like
the trade off direction, like
what are you trading off for?
And as a systems engineer, database
engineer, you basically have to
build the wheels so the user is
actually able to make his decisions
on what he's optimizing for.
But yeah, I think this was
a very interesting episode.
We will be going even deeper
into Milvus and its architecture
in an upcoming episode.
I'm not sure whether I will post it next
week or at a point later in this season.
But to stay up to date and to
catch the next episode, subscribe.
Also, let me know what you think, post
it in the comments, whether you're on
YouTube or just send me a message on
LinkedIn, Twitter, BlueSky, wherever.
And yeah, I will talk to you soon.
Have a good one.