Nicolay Gerold: Usually, I don't do these
types of episodes where I go really deep
into closed source tools, but I think
it's really interesting what the guys at
TopK are building, and I think they're
building something similar to Snowflake.
But for search, where they basically
separate the compute and the storage
for analytics in the case of Snowflake,
and they are doing it for search.
And they're building on some really
interesting architectural patterns.
Which are more and more common
nowadays for example, using a log
as a database where you basically
write all changes to append only
locks first, and they establish the
exact order of operations that can be
replayed and you have data durability.
It comes with a little bit of performance
downsides, but it gives you a lot of
advantages, but also things like a
custom file format as we have seen
in, for example, lands to be, but.
Also something we are seeing more
and more like custom table formats
in iceberg and data lake, where you
control how the data is stored and
laid out and organized on disc so you
can get more performance out of it and
that you actually get it performant
enough so you can utilize really cheap.
object storage in the cloud, like S3.
And then they also build their
custom query execution engine.
And the query engine basically turns
a user request into something that's
more optimizable and understandable.
So basically it parses the query, creates
a logical plan, pushes down filter,
distributes the work across nodes, then
the computation happens, and then you
merge the results back in and return them.
And these three components, I
think, are really interesting as a
more general trend in All of data.
And in this epi episode we are talking
to Marek Galovich, who was a former
engineer at Pine Con and is now building
a new kind of search database at Top K.
And we will be talking a lot
about how they build it, their
architecture, but also the different
decisions they make that went.
Into building a search database
from the perspective of the 21st
century, like all the new hardware
and architecture components we have
access to, like really large, scalable
storage, really efficient compute, way
faster IO than what we used to have.
And I think that's a really
interesting discussion.
And also you can take a lot from
it, even though we are talking about
something closed source, let's do it.
Marek Galovic: One super nice thing of
like building distributed systems like
building distributed algorithms is Rust.
It's like precisely the async
ecosystem where like Rust itself
doesn't have an async runtime.
You can provide your own runtime.
And there is this project
from Tokyo called Loom.
And what you can basically do is
write your concurrent program.
It's async, right?
If you do any sort of network
communication, all of that
you would drive that to async.
And then you can use a deterministic
runtime to simulate all the possible
executions of those features.
And that allows you to basically specify
a formal model for, like, how the system
should behave, assert all the invariants.
And then essentially simulate
your distributed algorithm on
top of a single node, which like
makes testing much more robust.
Like we can actually like
test all the possible cases.
Nicolay Gerold: Why do you think Mervis
went mostly with Golang instead of Rust to
build like a cloud native vector database?
Marek Galovic: Before Rust,
Go was like the default choice
for like new systems languages.
Like I, when I like back then, like
four years ago when I built my like
open source VectorDB, I built it in Go.
Like it was a.
It's a pretty good choice for like
new sort of systems languages that are
a simple, simpler and like a better
ecosystem than C and C But unlike Rust
matured a lot over the last five years.
So I would say like Rust is now
becoming like the default choice.
Nicolay Gerold: Yeah, and maybe to kick it
really off, can you maybe quickly define
what is a query execution engine actually?
Marek Galovic: So like basically you have
a sort of description devices, like you
have a description, like what needs to
happen, like that could be a SQL query.
And then they have a bunch of data in
some format, and then the query execution
really takes that sort of description
of what needs to happen, compiles it
to some sort of set of expressions for
that data format and for what actually
needs to happen in a physical layer, and
then drives execution, drives the IO,
drives the computation to compute some
result and give it back to the client.
Super high level.
Nicolay Gerold: And in your case
specifically, like, how does it work
a specific query is coming in, what's
happening behind the scenes to actually go
from the query to the final search result?
Marek Galovic: Yeah, so we the API we
expose is it's very like data frame.
So we provide an API.
That's a bunch of stages that
describe what needs to happen.
And really, what happens is that we,
we build a, like a distributed query
engine called reactor internally.
And then what happens is the sort of
the routing node or the client facing
node gets the sort of public description
of what needs to happen, translates it
to a logical plan, which is a sort of
internal description of all the operators.
Then that's like some optimization
of the logical plan, right?
So like we want to push down
predicates all the scans.
We want to push down projections to only
read the columns that we care about.
That happens still on the router node.
And then since we have a big data sets
we distribute the sort of parts of the
data set to different executor notes
and send this plan to the executor
notes and then like executor notes,
since the model is more document
oriented, so like different files can
have different schemas because we don't
require you to specify schema upfront.
And so then the sort of logical
description gets compiled for a
specific plan on the first specific
file on the executor notes.
And then we drive the execution
that really dispatches
operators onto Tokio runtime.
We compute partial results, right?
So like every node does
its own part of work.
Then we send the result back to the router
node, which also like pipes those results.
So the same execution engine really
is like reactor running on both and
then returns the result to the client.
Nicolay Gerold: And within that what are
the more so small pieces of work that
you're doing can you give an example?
Marek Galovic: Yeah.
So for example, like if you are doing a
vector search one of the more so is like
you have a block of data which is like
a part of a file and then you want to
compute like sort of cosine similarity
between like vectors or projections
in that block of data and your query.
And so that more so is really give me
like cosine similarity on 10K rows.
Nicolay Gerold: Yeah, and you mentioned
in our previous conversation that you
have created your own custom file format.
What makes it better suited for
search use cases than, for example,
like Arrow, Parquet, and what
we have currently are also LATs.
Marek Galovic: Yeah, so build, build
a database to be like cloud native
and really like use object storage as
the sort of primary storage medium.
And like object storage has a very
different characteristics compared
to like locally attached disk.
Like spinning disk versus disk.
And that way is like your
request latencies are like
100 to 200 milliseconds P99.
And then like you also
pay per request, right?
So you don't want to do too many small
requests because that's super expensive.
So you want to take that into account,
like to take latency into account.
And then find optimal issue,
optimal request sizes to maximize
throughput and minimize cost.
And like really the issue with like
on the shelf format, take Parquet
as the default is that Parquet
really couples like the IO size, IO
execution with the statistics, right?
And you have like in Parquet row
groups, then within row groups you
have column chunks, then within
those you have pages and the page
really is the sort of unit of IO.
Issue with using that for search
is that Basically, if you have
like a bunch of columns, and one
column is like flow 32, right?
Or u32, like some small value.
And other column is let's
say 768 dimensional vector.
You will have a, if you read the
whole column, it's going to be
like 40 kilobytes for 14 k rows.
Versus like the sort of vector column
is going to be a bunch of megabytes.
And then since you're issuing
IO on that granularity, right?
You will do a very small request
for the scholar for a file.
So start scholar field and then
like huge IO for the vector field.
And that's an issue.
You want to have a very small pages
because you want to have statistics
that are very gradual, to do the
printing very efficiently, but.
Doing like very but also like you want
to for the case of U30 field, you want
to maximize the size because you want
to avoid doing very small requests.
And then coupling the sort of
statistics granularity to like IO
granularity is an issue for for
doing that on top of blob storage.
The other issue is like Parquet was
really built for spinning disks.
And so if you like read per K file,
you have a lot of serial dependency.
Like I need to read the footer.
Then the footer sort of gives
me a point of the schema.
Schema gives me a bunch of like
block indexes, and then I can read
the actual data that's like serial
dependencies, where each of those is 70
milliseconds to actually get the date.
For object stories, like you want to
do a lot of concurrent IO and like
we build a data format to really
have like white IO trees where we
can read some small amount of data.
And then issue a huge amount of concurrent
IO to actually pull the data very quickly.
So that's the two biggest differences.
You also mentioned building a
custom file format for search.
We actually use Arrow as our data
layout, the primary data layout, and
also the in memory layout for execution.
Although we extended Arrow with custom
layouts to basically support types like
dense matrices, sparse matrices, and
bit packed posting lists to optimize for
the use cases that we really care about.
If you co design the layout with
the execution, you can basically
do stuff like Execute the kernels
on top of compressed data.
So you don't really need to
decompress for example, posting
lists, you can do the actual compute
on the compressor presentation.
Nicolay Gerold: Yeah, and in search,
you usually pre compute a lot of stuff.
So you pre compute all the indexes, all
the statistics of the different documents.
What things are you pre computing
and where are you storing it?
Are you storing it with the row groups,
with the files, with the colon groups?
Marek Galovic: Yeah.
So we do pre compute like more efficient
representations of vectors, for
example, to make the scans much faster.
We pre compute inverted indexes for
like keywords and keyword search.
We precomp it's not really precomputing,
but for example, if you store float32
or you store string values, right?
We do dictionary compression to actually
make that much smaller and make the sort
of predicate evaluation much faster.
So this is the sort of
stuff we tend to precompute.
In terms of where we store it, it depends.
The, having a control over the file
structure allows us to basically store
the columnar data as columns, logically.
And then attach external indexes, for
example, like inverted index for keywords
is attached within the same file, but
it's like a separate from the column.
And like the query execution, then it
allows us to basically take the columnar
data, take the external indexes and
basically intersect the query and execute
the query on both in the same class.
Nicolay Gerold: Nice.
That's really interesting.
And how do you actually
handle distributed queries?
So you have Multiple files stored
probably in different locations.
How do you distribute it and be sure
that you actually have the correct
statistics and all the different files?
And also be sure that you can find like
all of the different search results.
Marek Galovic: Yeah, so maybe like
to answer this question, it would be
helpful to go over like the high level
architecture of how the system looks like.
So we have a we follow the approach
of the database is the log.
So when we get it right, we
append it to a write ahead log.
And at that point once we
acknowledge that we, you can be
sure the data is durably persisted.
And then on top of that write ahead
log, there is a indexing process that
tails the log and builds more read
optimized representation of the data.
And those are the index files.
And basically the sort of description
of what all, what are all the index
files is a manifest where we store
pointers to every file in the index.
And that the manifest really is a
sort of like a consistent snapshot
of the system at the point in time.
So if a, if an indexing process fails,
we can just load the most recent
manifest and pick off where we left
off, or pick up where we left off.
So that's like the
guarantee of consistency.
We, we always have a consistent
snapshot of the index.
And then in terms of like how
we execute distributed queries,
like the router node loads this
manifest, which is the description
figures out what all the files are.
Maybe it does some like high level
pruning to exclude the data that's
like not relevant for the query.
And then takes the files that are
relevant, distributes them over,
over a set of executor nodes.
And then what I mentioned the
reactor framework that takes over
and runs the execution on top of
the sort of the executor nodes.
Nicolay Gerold: Yep.
And do you actually redistribute the data?
So for example, if certain pieces of
information are often retrieved together,
did you actually redistribute the data?
Similar data that's retrieved
together is actually co occurring in
the same files and same locations.
Marek Galovic: Not really, like we
tried to minimize the assumptions on
access patterns, which like it's an
issue with the vector debits, right?
Because they assume like the locality
is there in the vector space.
There's your queries are usually
all over the vector space, not
necessarily coalesced together.
So we like remove that assumption.
The, in terms of like how data is
organized, we, like the storage
engine really is like an LSN tree.
So the data is sorted by the ID.
And so one access pattern that's very
common is that like people want to store
like multi tenant data in the same index.
And to, and the way to do that
very efficiently, like with our
system is you basically prefix
the ID with the tenant ID.
And then once we sort by the sort of
user provided ID, all of your tenant
data will be coalesced together, right?
And so then you can issue a query that's
hey, give me all the sort of data I want
subject to the ID starts with this prefix.
And then we can like prune all the files
that are not relevant and basically
just get the file that's contains
the data that you really care about.
So in that sense, like it is
coalesced, but we don't regroup the
data based on the query patterns.
Yeah.
Nicolay Gerold: you also mentioned
you're using a cache on EC2.
I'm really curious what's
the caching strategy?
Like, how is it determined?
What stays hot?
What's evicted?
And how do you actually handle
also the cache consistency?
Marek Galovic: So like specifically
what we cash, where is like part
of the secret sauce, give us like
good latencies and a low cost.
But on the level, like the, in terms
of like eviction strategy, it is just
LRU, like that works very well because
you're basically caching blocks and
like you wanna prioritize the data
that's like most recently accessed.
We we have the option to ping
customers in the cache and
provide the dedicated capacity.
But for the, like the shared cluster it's
LRU how the caching works specifically
it's two tiers where every executor
notes caches both in memory and on disk.
And then the blob storage, sorry.
The blob storage is is
a sort of durable layer.
The grounded of what we cache is
that the sort of data format has a
bunch of buffers, and then we cache
on thisk, like we cache buffers which
could be like multiple columns or
like multiple unrelated pieces of
data we like on at the right time.
We try to sort of ce relevant, like
relevant data together into the buffers.
And then in memory we cache like
smaller granularity that helps us
execute, like prune a bunch of data
very quickly, execute the first
layer, and then minimize the amount
of data we have to read from disk.
That's like on a, how that works.
Nicolay Gerold: Yeah, and I
think most vector databases, when
Actually run like two retrievals.
So the new data, which isn't indexed
yet, is like basically brute force vector
searches run, and on the older data,
you run like your HNSW index, whatever.
How do you approach the same problem
like of eventual consistency?
In the end, you can only run the retrieval
when it's actually indexed, but also
for the different types of search.
And use support because you have the
bm25, probably the classical term search,
but you also have the vector search.
Marek Galovic: So in terms of like
consistency models, we support three
consistent, three consistency models,
which is really two and one is masked.
We do support strong consistency.
So if you like write your, write the
data, Any client in the system can
actually read that data if it provides
I want a strongly consistent read.
It's more expensive because we
have to check what's really the
latest data against the source
of truth, but we can do that.
In the default mode, we propagate
the write within a second, right?
So if you write your data
that write will be visible.
Like the max delay there is like
600 milliseconds, so it will be
pretty much there right away.
And then that sort of needs to query the
data that's not really read optimized yet.
And then for like high performance
and like large scale use cases, they
don't necessarily care about like
data latency that much, like usually
just have some background job that
writes the data into the index.
And so if you want to get the best
performance, you can provide indexed mode.
Which skips the most recent
data and only reads like the
read optimized representation.
That's what gives you the best
performance, but the delay there
is like a couple of minutes.
Nicolay Gerold: Yeah.
What are the different use cases you have
seen for the different consistency mode?
Like when is actually like
strong consistency necessary?
Marek Galovic: Yeah for large scale
use cases that are not interactive,
the indexed mode is the best choice,
because you get the best performance, and
you can load the data in an async job.
That's not user interactive,
so it's not a problem there.
The use cases for the default mode that
we provide and then the strong consistency
is really like interactive application.
Imagine you're in the Jupyter
notebook developing your search.
Really, you don't want to wait 10
minutes after you write your data
to like, now I can test my query.
The default mode is really the best suited
there where you have a bunch of data
and then like when you go run the query.
It's there.
And I think the default and stronger
system modes are really becoming
super useful for agenting applications
that are interactive with the user.
You want to provide memory for your
agent or whatever, and then you don't
want the agent to not remember for 10
minutes and then show the memory shows up.
So if you are interacting with
the agent, like you want to.
It results right away, like doing this,
like either the default mode, if you
are fine with this modulate or doing
it like a strongly consistent read is
super important that because like you're
guaranteed to get the most recent, right.
Reflective inquiry.
Nicolay Gerold: Yeah, I think one
often like underappreciated aspect
of Elastic and the other search
databases is like the analytics.
Do you already have implemented like
a custom data format for the feedback
data and the analytical data as well?
Marek Galovic: We don't have a
custom format for any feedback yet.
The, like the data format that
we've built is suited for analytics.
So also like the query engine right
now, we optimize for search, but
it's basically like a matter of.
Adding more operators to support
stuff like aggregations, to support
metrics, to support stuff like that.
Because really in the end, it's
like an execution execution engine
on top of columnar data, right?
We optimize for something, but
basically you can add operators
to support other types of queries.
One side use case for building a very good
filtering engine and general query engine.
Is that like log search and basically
like security and observability is
a sort of a special case of search.
And we can like also like support
use cases like that very efficiently,
Nicolay Gerold: And can you actually
fall back on data fusion a little
bit more for the analytical queries?
Marek Galovic: not really like the
execution engine is a custom engine.
We looked at data fusion early
on and basically what we realized
is that like data fusion is very
good for analytical queries.
That's what it's what's been built for.
And like a lot of like new
databases actually use data
fusion as their execution engine.
The reason why it doesn't really work
for search is that like data fusion
doesn't support external indexes natively.
Like the way you implement indexes
in data fusion is that you basically
build one side of the join and then you
join your data with the index, right?
And like building that one side
of the joint is pretty expensive.
And if you have control over the
data layout and execution, you can
basically do like a sorted intersection.
Of both and then execute the sort of index
scan and data scan much more efficiently.
Like that's like the primary reason
why data fusion doesn't work.
The other reason is like what I
found is filtering in data fusion.
It's really it's rebuilding
the batches, right?
So if you filter, you basically copy
the data into new batches and then
continue execution on top of that.
Versus what we can do is essentially
execute filters as logical operations,
where if we know that the data is
already cached in memory, like there
is no point of copying the actual data.
You can simply create a new selection
mask or bit mask on top of that data
and then continue execution using that.
Yeah, like the, on the high level,
like basically you get three to
four times better performance
by not using data fusion.
Nicolay Gerold: Yeah, and you mentioned
also in our previous conversation
that you have built some custom error
extensions for dense and sparse matrices.
What do they actually enable you to do?
Marek Galovic: So one like huge unlock
there is that in, in like vector,
it isn't mostly like you would have
like vectors cannot be now, you
have your index and like you have to
provide the primary key for the index.
Since our model is more document
oriented, you're going to have
a bunch of vector fields we've
already treated as a data type.
And those can be null we don't require
it to always provide a value for a field.
And so one biggie analog for having
the data, custom data layout is that
if you want to represent vectors we
can represent nulls very efficiently
without actually storing dummy values.
And then also be able to do point
reads for those like for those rows
that can be null and may not be null.
Like the sort of the compression in
in the way you represent that, the
way you figure out where the value
actually is in terms of byte offset.
Arrow like stores an
offset list, which is U32.
Like the way we figured out how
to do that is like basically
like 20 to 25 times smaller.
Like really means for, like a
file of 10 million rows, right?
We are not reading just 40
megabytes of like dummy values,
but like you can basically read
100 kilobytes and do the same.
There is no overhead of that.
Similarly for like sparse matrices.
And I also mentioned like
bit packed arrays, like arrow
doesn't natively support a bit
packing, like bit packed arrays.
Also, it doesn't support sparse
matrices to basically optimize
for sparse matrix vector product.
And so we built custom types to compute
those types of scores very efficiently.
Nicolay Gerold: And the, what
other advantages do you actually
get from the error format?
Does it actually help you with
a distributed system as well?
Marek Galovic: It doesn't necessarily
help with distributed system.
Even if you look at data fusion, it's
like a lot of the value of data fusion.
Is that it basically uses arrow
and arrow compute kernels to do
like undifferentiated stuff, right?
So like filter floats, like every
database needs to filter floats.
Every database needs
to like filter strings.
That's not a differentiation you need
to like care about and spend like
engineering time writing kernels for that.
So like we get a lot of that by
just like simply using arrow and
like skipping the data fusion part.
And that's actually like what we get.
Like we, the custom types,
like we write custom kernels.
We optimize them for
different architectures.
That's the sort of.
meat of the database versus filtering
floats, filtering strings, like
doing undifferentiated ops, like
that we can just offload to arrow
and use their computer house.
And like that's a sort
of huge, the time saving.
Nicolay Gerold: I, the manifest files,
maybe to double click on that, I have
mostly worked with Delta and Iceberg
which basically have JSON files.
Partly why they do this is because on S3.
It's like there is a limit of
how many files you can list.
So you usually have manifest files
so you can list all of them at once.
How do your manifest files actually differ
from, for example, Iceberg and Polars?
Marek Galovic: Yeah.
Like on a high level it's very similar
because if you're building a database
on object storage, like you tend to
make similar choices in some regard.
The difference there is that like I
mentioned, like we are document oriented,
not necessarily like table oriented.
So we don't really enforce schema or
store schema in the manifest files.
And we have just one manifest
file for the whole index.
The sort of huge unlock
there is that Iceberg, you
mentioned like it's JSON files.
Yeah.
What we've built is basically we use
flat buffers as our like format for
describing what exists and what that
allows us to do to basically just
read the data and interpret, right?
So we don't have to decode j we don't
have to catch it in a special way.
We simply load the bytes and then use
flat buffers to know what's index.
That.
That's a huge saving, especially for large
indexes because you can if you do Jason,
like you can end up spending like 600
milliseconds just decoding the manifest.
That's like all of your
query budget pretty much.
So like by not spending 600 milliseconds,
but like only paying the IO once,
and then when using the data, we
can basically minimize the sort of
worst case latency that happens.
Nicolay Gerold: Can you also enforce
any asset properties with your execution
engine and with the file schema?
Marek Galovic: Yeah so as I mentioned,
like the rights are like atomic once
we commit them to the direct press log.
And the way we build the right headlock
is that it basically provides sequencing.
So we know, like at the time
we commit the right to the log.
We know like a version of that, right?
And then basically that's
the right, that's the most
recent version that will win.
And then the compaction and
indexing part consumes the log
and uses those versions to enforce
what's the newest version, right?
So we at the time we commit a new version
of the manifest or publish like a new
version of the index, that's like a
consistent snapshot of the database.
So that like moves.
The state from one
version to a new version.
And there is no sort of way to see
partial results in your queries.
Nicolay Gerold: Yeah,
is the,
no, I lost my question.
Does it also give you the capability
to do basically a history?
So I can go versions back in
time and run my queries on all
versions of the search database.
Marek Galovic: we, in principle, we
could do that because you could specify
essentially you can imagine every
manifest being a snapshot, right?
So we don't expose it right now,
but we could expose like all the
snapshots that we created and
maybe keep a last day of history.
In the manifest, and then
you could choose okay, I want
this snapshot of the database.
And then yeah, given it's a consistent
snapshot, you could just run a
query against that snapshot and you
would get that version of the data.
Nicolay Gerold: That's really cool.
And what I'm really curious about is
in terms of kernels, I think Rust.
It doesn't have all the, like decades of
optimizations, which went into C and C
plus, plus, have you done any benchmarks
of the performance impact this has,
because like in AI, this is like all the
talk, like, why aren't we using rust?
Like we have decades of
optimizations and C plus, have
you tried to quantify that stuff?
Marek Galovic: Yes.
So in terms of kernels, we
handwritten SIMD kernels for the
operators that we care about, right?
For example, like, how do we intersect
a bit packed list for text search?
How do very efficiently compute
distances for distances for vectors?
All of that, like we've written
custom kernels for both x86 and ARM.
So that, that allows us to like
basically run on the cloud.
Like ARM nodes are much cheaper and like
usually gives you better performance.
So you can pick the architecture
that you want to run, and then
yeah, we have kernels to compile
for that specific architecture.
Nicolay Gerold: Yeah.
And do you think rust is
faster or C is faster?
Marek Galovic: It depends on the use
case, like, all of them dispatch to LLVM,
ultimately it depends on the compiler.
You can get the same performance
as C or C but basically doing what
we did, and you basically wrap an
unsafe code in a safe interface.
And then in an unsaved do whatever.
So what we did is written,
which is like the instructions
for specific architectures.
And that's like basically
as good as you can do.
Nicolay Gerold: And in terms of
scale, what is the largest data set
you have tested top K with, like
how far did you try to push it?
Marek Galovic: So far for a single
namespace, like a single data set,
we pushed to a hundred million.
But basically like we've been doing that
for three weeks in a production scale.
With the latencies there being
like on the 100 million scale is
now like 200 milliseconds P99.
For smaller indexes, which is like
most of the tenants in the system,
that would be like 1 to 10 million.
And there, the latencies are like
around 50, 50 milliseconds P99.
I
Nicolay Gerold: Yeah.
And I think Milvus is at the moment
doing a big push into doing GPU
acceleration for vector search as
well, but also for the indexing.
Is this something you're already
exploring or that's like on
the horizon for the future?
Marek Galovic: mean, we looked at that.
It really depends on like the
use case you're optimizing for.
Like GPUs are quite expensive and you
if the operating point that you like
want to get is like order of hundreds
of QPS with a reasonable latency,
so let's say 50 to 100 milliseconds.
You don't necessarily need GPUs for that.
And you can get like very good
performance there for pretty low cost.
Same goes for indexing.
Like it depends on the chronology, right?
And the sort of algorithms you use.
But so far, like we did bunch of
cost estimations and we haven't
seen like a value for GPUs.
But I guess like it, it depends
for the use case, right?
Because if you have a training
algorithm, like a recommendation
system that needs to do Alta and Qs.
And needs to do that in one millisecond
latency, doing like HNSW stuff, doing like
graph based algorithms, or doing like GPU
accelerated compute it makes sense there.
But the truth is like, those
types of use cases are like 1
percent of all the use cases.
Nicolay Gerold: Yeah.
What is the use case you've
seen or what did you say?
What is the use case that
is like the perfect one?
For the search database you're building.
Marek Galovic: I would say, right now
the use cases we target are really the
ones where getting super relevant context
or results is critical, and that would
be Like financial health care legal use
cases and the primary reason for that
is that like you want to be able to do
like different types of retrieval in
the same query to basically say I have
a, I will not get semantically similar
results that also like satisfied is like
text filters and metadata predicates.
I didn't apply custom scoring on that
using like different vector scores and
maybe being 25 score in the same break.
And really, most of the existence search
systems, the way they do it, is they have
separate indexes for maybe different sort
of vector fields, then you have a separate
index for text, you get partial results,
then you do something like reciprocal
rank fusion on top of the partial
results, and then return to the plan.
I think that's the way to do hybrid
search with the current technologies.
The current engine that we built and
the system that we built allows you
to do all of that in the same phrase.
So you can basically do multi vector
search with a BN25 scoring and then
apply like a custom expression to
actually compute the top k input.
And then like score based on that.
And really what that allows us to do
is for example, one use case we've
seen is there's a medical search.
And what they do is they retrieve
like a passages from journal
articles and then overfetch that
and feed that to a re enter.
The reason why they overfetch is that
they want to boost some articles by like
their score for how good the journal is.
And really there is no way to do that
in the vector database that I'm using.
So they overfetch results, then re
sort based on this importance field,
and then feed that to the re anchor.
That's not necessarily the
best you can do, right?
Because you can imagine some
articles being super important,
but not really having the basic
relevance score on the vector system.
That wouldn't pass the first
top K, even if you overfetch.
And with us, what we can do is basically
move that sort of custom scoring function
into the database itself, where the
candidates said that they get It's
the best one based on their custom
string function and then feed that
to the rink and actually to the user.
That's because they can unlock to, to
give them better relevance out of the box.
Nicolay Gerold: Yeah.
Not just better relevance, but also like
often at the moment with vector database,
I think like a lot of filtering or if
you have to do post filtering a lot of
the times I've seen that the use cases
actually need you to extract informations.
After the search and when you have to
deliver a fixed set of search results,
it's like a real pain because often
you end up like you have to run another
retrieval in which you have to exclude all
the documents, which you already retrieved
Marek Galovic: Yeah most of the those are
like two or three years ago, like there
was a huge push oh, we do pre filtering.
We do post filtering.
In vectorDBs it should be told most
vectorDBs actually figured out how
to do some form of pre filtering in
the database, so they can filter the
results, and they can, it won't happen
you get empty results even though
in the database there are results,
because they pre filter before the TokB
operator, or starting with the limit.
The issue there is that usually
the distribution of vectors
and the distribution of filter
values are uncorrelated.
And so what happens is that you apply
your filter and then you have to
scan large part of the index anyway,
because the filter selects points at
random, basically in the vector space.
And like that's a huge way to think
about it, because like you spend all
this effort indexing the vectors, you
spend all this effort like building
IVF style index or graph based index.
And then at the query time, like
you basically end up scanning
most of the data anyway.
Which is like slow and wasteful.
I think that's like a fundamental
issue with building a database around
like a vector as a primary key.
Or like around a vector
index as a primary concept.
Nicolay Gerold: and what are the knobs
you actually expose or the levers that
you expose to the user, which allow
them to basically fine tune or optimize
the search for their specific use case.
Like on top of the standard, like I'm
running a term search and I'm optimizing
it, I'm doing boosting of certain terms.
What are you exposing to the user to
optimize the search performance, but
also the search result performance?
Marek Galovic: Yeah, so actually,
maybe, a sort of different
way of thinking about it.
Because we don't really expose NOPs,
like the query language we expose.
allows you to write any arithmetic
expression you want for scoring, right?
So you don't have hey, boost
this specific term by five.
It's really like you can write an
expression that gives you a numeric
output, and then score by that.
So you have flexibility, almost like a
programming language, like you write some
expression that gives you some number,
and then you can score by that, which is
very different compared to yes, I have
some keyword filter, and then I specify
just a fixed set of boosting parameters,
like that would be the case in Elastic.
And then I don't know what we
do explicitly you can, of course
provide different vector types, so
you can do fold32, you can do scalar
quantize, you can do binary quantize.
You'll be able to do like
sparse vectors very soon.
So those are like the,
vector scoring functions.
You can compute beyond 25 scores.
It's with the parameters for beyond 25.
And you can tune that.
But what we really want is off the shelf.
It should be, like, super simple
and super intuitive to build
production level scale search.
And then you can play with scoring, you
can play with stuff to push it further.
Nicolay Gerold: Do you want to integrate
learning to rank into this as well?
That you actually can come up
with the, basically the expression
in a more automated way?
Marek Galovic: Yes.
Ultimately if you can like how
actually production, the production
level search systems work at like
large companies, usually have some
document transformation, right?
Today, like that would be embeddings.
You can imagine like more ways
of transferring documents.
You have a query understanding
module, which sort of takes the
query and transposes it to some
sort of domain specific DSL.
That would be like a query language.
And then you have like on the output,
you have ranking and really the job of
ranking is to take a set of candidates and
resort them to some measure of relevance.
That's very domain specific and that,
like today you can get like a re rankers
from Cohere, from other companies, you
can get open source re rankers, but
that's really trained on like public data.
So that doesn't really mean
that this is the best relevance
you can get for your domain.
Or like for your specific application
and so like ultimately in the end what
we want to do is provide this like all
of these pieces in the same API with
the ability to like for the applications
to provide feedback for a this item was
clear this item was added to the kind of
that's a signal you can use to optimize
the ranking further and like you can
basically create application specific
rankers inside the database itself.
And that's a place where you
want to get get in the end.
That's the sort of ultimate goal.
Nicolay Gerold: And I think
it's really interesting.
I think the learning to rank
part is something that's a little
bit, it's At really large search
organizations, it's done a lot,
but it's not really talked about.
And I think it's a really easy addition
to the search stack, because you use
a, usually it's some form of boosted
machine learning model which is really
easy to deploy and optimize as well.
Marek Galovic: Yeah it's it's a spectrum.
For simple models, there's also linear
models that combines, I don't know
similarity scores with BN25 and some
precomputed features you, you start.
That's a sort of linear
regression function.
You can express that already
in the correct language.
So it's a arithmetic expression
you can just express.
It will work.
For more complicated rankers that
are, like, learned and use, I don't
know LLMs, you, you still can do that.
The issue there is if you want to
provide a good out of the box experience
you need to be able to improve the
model from a small amount of feedback.
And that's like a core research challenge
for how do you do that where like you
don't have much data but you want to
have like strict improvement on the like
baseline ranking model you provide out
of the box and once we can solve that
very reliably and be sure that we can
provide a good experience, we will put
that in as a part of the core language.
Nicolay Gerold: Do you already have
any research directions you want to?
You want to explore for doing that.
Marek Galovic: Yeah, so first thing
that is like really, how do you
improve off the shelf ranking models?
You want to have a good baseline
experience, so even if you don't provide
feedback, like we want to be able
to, the re ranking part of the core
language should give you better results.
But then how do you take that
and make it strictly better,
make it a part of improvement
given small amount of feedback.
So that's in the ranking side on
the document transformation part.
I think there's like really interesting
research and like improvements you
can do on training, like hybrid
dense and sparse embedding models.
And basically given that we can do both
types of ritual in the same query, like
we can actually learn the sort of optimal
combination of the two to maximize the
relevance of candidates that we get.
So that's like the second part of
research you want to do on the input side.
Nicolay Gerold: How do you think about,
because we are in a document model
updates to fields, it tend to be pretty
easy, but updates to an entire column.
So basically when I'm introducing
a new embedding model.
I find you in a new one and I want to
basically re embed all of my documents.
How do you actually handle
these like bulk or mass updates?
Especially also in light of like we
are in production and we still have
load running against our database.
Marek Galovic: Yeah.
So if you want to switch to a new
embedding model, one way to do that is to
basically create a new column, reinsert
the data with that new column, and then
basically switch switch your queries
from the old column to the new column.
That's a way to do it.
In general this is a hard problem
because if you build the database on
object storage, the files are immutable,
so the only way to really change
something about the files or even a
specific column is to rewrite the data.
Like you have to pretty much
reinsert the data yourself.
In the long run, once we actually
manage the embeddings for you, which
is something like we want to do where
like the interface is really text,
it's not written, it's not vectors.
Yes, I think that's like a better example.
Like ultimately you care about like
searching searching using text.
You don't care about like
vectors is just a technology.
Yeah, we use that.
But like what do you care about is
the sort of higher level interface.
Then once we do that, we will be
able to change evolve the embedding
models, evolve all that underneath
in the database, and you don't
have to necessarily worry about it.
Yeah,
Nicolay Gerold: Yeah.
How, I think the part why Elasticsearch
is still so popular is actually like the.
way of doing relevance engineering.
Like they have very good feedback
functions where you see the impact
for a specific query, how do the
different terms how do the different
parts of my my relevance algorithm,
my scoring algorithm, how do they
impact these sets of documents.
Do you provide a similar mechanism
for basically debugging or
analyzing the relevance and
performance of the search queries?
Marek Galovic: In the CURL language
I encourage people to check out how
the CURL language looks like, but
you can basically select different
parts of your scoring, right?
So you can create separate fields, right?
Vector, BN25, some internal fields.
And then do the final sort of
expression in the in the top k operator.
And then you can see what is the
contribution of individual parts
in the final scoring, because you
will get the partial scores and
the final score as a part of that.
So you can see, okay for this
document, the vector summit is really
good, but bn25 score is pretty shit.
The, for other functions, maybe
my bn25 is really good, but the
vector summit is pretty low.
And then you can actually fine tune
that and see what the contribution is.
Nicolay Gerold: Yeah, on part of
the documents, because you mentioned
documents don't have to have the same
schema, when I'm running searches with
filters what is, like, when the field
isn't existing, does the user have to
set whether to include it or exclude it?
Marek Galovic: Yeah, by default,
we evaluate only for fields
that have the value which is I
guess the most intuitive way.
And you can provide like isNull operator,
if you want to include hey, this field
must match this value, or it can be null.
Like you can actually express
that as a, as an expression.
Nicolay Gerold: Yeah.
What's next for you guys?
So what's, what are you building on at
the moment that you can already teaser?
Marek Galovic: Yeah, so we
have the storage storage out.
So you can already play with
that and like experiment with it.
The next part is really like adding
the, more of the end to end experience.
As I mentioned, like the sort
of, really the production search
is document transformation, is
query understanding, is ranking.
And we want to build all of that in the
same sort of API where like today, if
you want to build a semantic search, like
you have to build an embedding provider,
create embeddings, like start in some
database, retrieve that, send it to a
re ranker, like that's an unnecessarily
like plumbing and boilerplate, like if
you look at any guide, like how to build
semantic search, like it's all the same
module, like the last sort of paragraph,
like you insert your own, Product I
don't think that's a good experience.
What we really want to do is build this
into five lines of code and you should
be able to use text really as a, as an
interface and not necessarily like care
about, okay, this is embedded somewhere.
This is really somewhere.
So that's like the immediate
next step in the long run.
As I mentioned, like interesting research
on how do you do hybrid embeddings?
How do you do tailored specific,
like domain specific ranking?
How do you adapt rankers and
small amount of feedback?
This is right now is like a
research stage, but like somewhere
we're going to definitely go to.
Nicolay Gerold: Yeah.
What would you say is missing in search or
AI that you are actually not building on?
Marek Galovic: Yeah.
I don't know, like one thing that's
very interesting is integrating
like search with with the LLMs
on like a lower level, right?
So today it's the context is provided
as text, like you can imagine
injecting context in different ways.
Like really attention is in like LLMs
or in transformers is doing search over
the way it's stored in the model itself.
And so you can imagine like injecting
context on the sort of embedding
in like the latent space, not
necessarily in the input space of text.
So it's like a research direction I think
is interesting from a purely research
point of view, but training LLMs and
doing that is not really in scope for us.
Nicolay Gerold: Yeah, and maybe even
like under hyped, over hived, like in
the AI space, what do you think is under
hyped and what do you think is overhyped?
Marek Galovic: Overhyped, I think
like LLMs themselves, like in the
long run, I think they just become
commodities and you, like different
providers converging to the same point.
This is hard to predict, right?
Because there can be a breakthrough
that sort of like unlocks a new
capability, like no one perceived.
For underhyped stuff, I think like
verticalized AI, so building AI for
specific domain and specific use cases.
Where I feel like there is a lot of
interesting work happening, like applying
AI to like biomedical data, applying
to like material science, where like
training models, they're really what it
unlocks is basically speed of iteration.
So if you really want to get
fat, like you need to fail a lot.
And today, like failing in in the
biomedical domain and like material
science, where it was like building
the thing, which takes a long time.
So I feel like having good models there
and being able to use models, like
experiments and run experiments on.
Unlocks you to try a lot of stuff in
parallel, try it very quickly and really
like scope down to the the subspace that
really matters and then go and do physical
experiments just on that subspace.
And like it's really nice and I feel
like AlphaFold is the first example, like
first production level example in that
space, but I think there are like many
more, more coming in the coming years.
And it's super exciting.
Like it's it's going to be a
huge one for so many things.
Nicolay Gerold: it's, I'm so
torn on the LMS are overhyped.
I think they're definitely overhyped,
but I think they're still underutilized.
Marek Galovic: Oh yeah, for sure.
Like they're super useful.
It's just the making them like omnipresent
in the society, like your workplace,
that's going to take a bunch of years.
It's a, there's a lot of
Nicolay Gerold: I think like we
are skimming the top off at the
moment with LLMs and I think we will
see like a lot more stuff at some
point, but it will take a lot of
time, especially in the enterprise.
Marek Galovic: Yeah.
Nicolay Gerold: Nice.
And where can people learn more about
you, about TopK and what you're building?
Marek Galovic: Our website, topk.
io you can feel free to visit that.
We have a bunch of guides, we have docs.
You can sign up and try it out.
For myself, I'm on Twitter.
I'm on LinkedIn.
I do it with my name so
people can reach out there.
I'm happy to chat.
Nicolay Gerold: So what can we
take away from this that's relevant
for building AI applications?
I think because it's a little bit
different to our usual episodes where
we share like concrete advice, we
went really deep into how they build
it and how they design their tools.
So I would take away more
like patterns for building.
Data search AI systems.
And one of them is basically
separating write and read pass.
So especially like in databases you
often have a trade off, optimized for
writes or reads, but not really both.
And log base really can give
you a little bit more like
flexibility and optimizations of
actually exposing it to users.
Like what is he optimizing for?
And they basically deal.
Have done this by giving him the
user different consistency models.
So basically strong, default and indexed.
And through that, the user basically
can choose between what he's optimizing
for, whether it's read or write.
And I think thinking through the trade
offs of your system and often like
when you're in data and AI, it's read
and write like thinking through it,
like what are the knobs I can turn
to optimize for one or the other and
what trade off am I going for or what
trade off do I want to allow my user to
optimize for is a really interesting one.
And we see this pattern across
a lot of different databases.
For example, in Kafka.
We, or data systems in Kafka, you have
events, they're written sequentially.
Also in CICD, I think you
can draw a parallel that you
commit code to a main branch.
And then from that you build
optimized artifacts for deployment.
Design for your storage medium
is something that is also pretty
interesting, like S3 and cloud storage,
they aren't just like slower disks,
they're like very different economics.
And depending on the cloud you're building
on, you have very different economics
or pricing structures and taking that
into consideration as you build your
application is probably relevant,
because in most cases, you probably.
Have external constraints, which
determine what cloud you're building on.
And if not, you should be,
you should think about like.
How you use your storage medium and
then pick the according cloud provider,
because mostly storage and egress and
ingress are the major cost drivers.
And in S3 and cloud, you have
really expensive egress and ingress.
And
when you.
Make a lot of requests.
So basically you pay per request.
So having a large number of
requests is really expensive.
So having something like a lock and
then writing after a certain time
periods and basically combining or
batching the rights really pays off
because you can save a lot and also for.
For example, Cloudflare, I believe,
doesn't have any egress and ingress
fees, but they charge differently.
So they probably, I'm not sure
about Cloudflare, but they probably
charge more based on their storage.
So I think, like, when you design from the
get go for your specific storage medium,
you can optimize a lot in terms of costs.
Also what they do is basically combining
multiple things into one, which
usually run through different systems.
They combine vector
search and text search.
Into a single pass, whereas most
search databases see it as multiple
different types of search, which
are then merged and re ranked.
And this basically waste
works and misses good results.
And you don't really want to run separate
algorithms and combine the results later.
But you want to build execution plans that
filter score and rank in a single pass.
I think this is a really
interesting pattern as well.
Like when you can combine it, you
probably should think about combining it.
Another really interesting thing they
have done is basically strategic caching.
So most just use something like Redis.
They have built their own solution on
something like EC, on EC2, I think.
And at some point when you're building
application, every database needs a cache.
In front of it and since they have built
their own logic, they can do a little bit
more than just a regular cache because you
can build something like two tier caches.
So basically you have memory for
really hot data and then you have
local disk for warm data and this.
Can reduce cloud storage costs while
also preserving the performance.
And you can also not just cash the raw
data, but also cash intermediate results,
which are expensive to recompute.
And by building your custom caching
solution or a custom cache, which
takes a little bit more work, you can.
Find a caching strategy, which
really aligns with your workload.
So what's frequently accessed,
what's really expensive to rebuild
and what data is accessed together.
And I think as we start to build more
and more with AI and especially also
with agents, which access our own
data, this will become more important
because we don't really want to hit
our production database all the time.
Yeah, I think that's it for this week.
We will have one more.
One more episode in search next week.
And after that, people will be
moving on to a new season on MLOps.
So I'm excited for next week.
As always, if you liked the episode
and you're still listening, leave
a subscribe, or leave me a comment.
Also, if you have any suggestions in terms
of guests who you would like to have on,
let me know in the comments or send me a
message on LinkedIn, Twitter, blue sky,
otherwise I will catch you next week.
See you soon.