A weekly podcast about all things PostgreSQL
Michael: Hello and welcome to Postgres
FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard,
and today I am delighted
to be joined by Jonathan Katz,
Postgres Core Team member amongst
other things, Principal Product
Manager at AWS and contributor to
pgvector.
Welcome to the show, Jonathan.
Jonathan: Thank you, Michael.
Happy to be here.
I'm definitely a huge fan of your work.
I'm very excited for our conversation
today.
Michael: Kind of, you'd say the same
to me.
So I reached out to you about this
topic because it's about time
we did an episode on pgvector.
How could we not have given the
last year?
But neither of us knew it well.
Both Nikolay and myself haven't
used it until recently.
Nikolay has been using it.
So we wanted to make sure we covered
it.
And who better to well, Andrew
Kane was unavailable and you are
definitely the next best person.
So we really appreciated all the
content you've been putting
out, the talks you've been giving
on this.
So yeah, thanks for joining.
Jonathan: Yeah, happy to be here.
I mean, and by the way, when you
say next best person, it's like
well below Andrew Kane.
Andrew has done phenomenal work
on pgvector over the past several
years.
Michael: Yeah, right.
And it has been.
It definitely predates the ChatGPT
hype, right?
It was a couple of years beforehand.
Do you want to give us a little
bit of an introduction as to
what it is and how you came to
become involved with it?
Jonathan: Yeah, so pgvector, at
the surface sounds very simple.
It's a Postgres extension that
allows you to store and search
vectors.
And I mean, I'll talk a little
bit about how I got involved in
it, but really to understand how
this all came to be, it actually
helps to look back in the history of
Postgres.
Postgres has actually always been
able to support vectors back
since the Berkeley days, and it
comes down to the array data
type.
So I mean, quite simply, a vector
is, you know, it's an array
with certain properties.
It has a certain amount of dimensionality.
There are certain things that each
dimension must meet.
And there's all sorts of math around
it.
I mean, there's over a century
of what a vector is.
But the reason why it was added
to Postgres was not because of
any mathematical properties.
It was actually as a shortcut to
be able to look up ACL rules.
So instead of having to do a join
to another catalog table to
get the ACL rules, it was just
embedded within a single column
as an array.
Oh, really?
Yeah, yeah.
No, it's actually pretty cool.
And then fast forward to the early
2000s, Postgres added the
cube data type, which allowed
you to index up to a hundred
dimensional vectors, ultimately
using the GiST index, but it
added also two distance operations,
which are the mathematical
operations you need for a vector,
which, you know, we can talk
a little bit more about that as
soon as we get into this.
But this is to say, Postgres has
actually supported vectors and
vector lookups for a while, but
certainly there's been a more,
a more pressing need for it.
And, you know, in terms of a little
bit how I got involved, perhaps
my secret in the Postgres community
was, I originally wanted
to go into machine learning when
I was in college.
And while I was studying it, I
was like, oh, it's really fascinating,
but to do anything with it, you're
going to need a PhD and everything's
ad hoc.
And I had a very entrepreneurial
spirit.
I think I got a patent.
And of course, I didn't have the
foresight to see like, oh, this
will be commoditized.
And it'll be very simple to access
machine learning algorithms
through simple API.
So credit where credit is due.
Some very smart people identified
that and have been really building
towards that.
But again, working with machine
learning systems in college,
the fundamental data type was the
vector.
And back then, a high dimensional
vector was something considered
to be like 20 dimensions.
And think about it, Michael, what's
a 20 dimensional vector?
I can understand 3, 3 is like points
in space.
I can understand 4, points in space
moving around.
What's 5?
Michael: Plus time, right?
Yeah.
Jonathan: Yeah.
Michael: I studied, I don't know
if you know, but I studied maths
at university and did pretty much
just pure.
I did, I did a tiny bit of stats,
a tiny bit of mechanics, but
I was best at the pure stuff.
And yeah, 5 dimensions hurts my
head to think about it, or to
try and conceptualize it in any
way.
Jonathan: Yeah, it's funny that
you say that.
I studied math too, though.
I would say I was much stronger
on the computer science side.
But real analysis was one of my favorite
classes, because you were
proving everything in n-dimensional
space.
And again, everything's a vector
or a matrix.
You basically spend a semester
just studying vector spaces and
all their properties.
And to me, like, that was fun.
And maybe it was telling, you know,
fast forwarding, you know,
20 years later.
But, you know, my journey back
to vectors came probably a little
bit later than some folks.
The pgvector project started in
2021.
Andrew had the foresight to see
that a lot of data was going
to be generated around machine
learning and AI systems.
And Postgres didn't really have
support for the types of vector
lookups that we needed.
So let's back up a second.
So I said that Postgres had support
for vectors all along, but
it did and it didn't.
One thing about relational databases,
and Postgres in particular,
is that you look for exact results.
You say, hey, I want to find the
10, well, let's say I want to
look up Michael or Jonathan in
a database.
I'm going to write a select query
with a where, where name
equals Michael, right?
And you expect to get the results
for like, Michael, all the
Michaels in the database.
Same thing when you're looking
up points, Postgres has the ability
to index things by k-nearest neighbor
or find all the points
that are around me.
And again, if I say like, hey,
go in and find like the 10 closest
coffee shops to me.
You're going to find exactly the
10 closest coffee shops.
And these things work in 3 dimensions,
4 dimensions, which is
typically where Postgres operates.
If you take the PostGIS extension,
which adds geospatial support
to Postgres, you can index 3 and
4 dimensional points, and you
can do it quite well.
The GiST and the SP-GiST indexes
provide the ability to write
these indexes.
But as you start adding dimensions,
you go from, you know, even
like 8 to 10 to a hundred, which
is the limit of the cube extension,
it's going to start getting very
exhaustive.
First, in terms of placing the
vector or the...
I mean, it's not really a point
anymore, but think about it as
a point.
You're trying to place this in
space and try to figure out, I
have 100 dimensions and I have
to situate somewhere in my graph
where that is.
It's going to be, you know, very exhaustive.
And then even if you are doing
some kind of exact nearest neighbor
search or some index search, there's
going to be a lot of cost to
it.
Yeah.
And this is where Postgres, you
know, started falling over in
a ways, you know, Because a lot
of these vectors we're seeing,
if you look at some of the popular
embedding systems, they're
1,536 dimensions, which again,
mind-blowing number.
Like I still don't even know what
a five-dimensional vector is,
and here we're talking about 1,500
dimensional vectors.
To do an exact nearest neighbor
search, as they call it, or the
k nearest neighbor, you have
to compare against every single
vector that you store in a database.
So you have a million vectors that
are 1,500 dimensions, you're
doing a million times 1,500 computations
on them.
And that's really expensive.
I can tell you it's very expensive
because I've had to, you know,
I've been personally, you know,
benchmarking this for quite a
bit.
Michael: And it makes sense, right?
It's at minimum, a sequential scan
of the entire table, plus
all of the CPU of doing those calculations
100 times per, well,
1,500 times per row.
Jonathan: If you have 1,500 times,
yeah.
And we can talk about in terms
of like all the, you know, all
sorts of the optimizations that
are in place there.
But this is where like that exact
nearest neighbor problem gets
hard.
And again, Postgres didn't have
the indexing mechanisms to handle
it.
Probably we'll further dive into
that as we talk more.
But actually, I'll say 1 of the
big ones is just in terms of
just storing the data, storing
the data within the index, that
you have to fit your index value
for a single row within an 8
kilobyte limit within the page.
So that's already going to start
creating some constraints.
Now, what's also happened over
the past 20 years, since the cube
type was added in Postgres, was
a whole field of modern vector
research, as I like to call it.
And starting to this idea of approximate
nearest neighbor.
And the deal with approximate nearest
neighbor is that you're
trying to find your nearest neighbors,
but with a best fit, which
is you're not going to search every
single vector in a particular
table or database, you can search
a subset of it, and you're
hoping that you're seeing the representative
set of the data
you want to return.
Now again, for us relational database
folks, that's mind blowing.
It's like, wait a second, I'm not
returning the exact answers.
I'm returning an approximate subset.
And I'll tell you, the first time
I used pgvector, and I was
playing around with it, and I was
not getting the results I expected,
at first I'm like, oh, this is
broken, this doesn't work, this
doesn't make sense.
And that was the app developer
in me, the DBA in me that started
diving deeper and trying to understand
like, oh, wow, there's
like a whole science around this
that we kind of have to relearn
when using a database.
Michael: Yeah, well, I often say
to people, you can run a database
without adding any indexes and
you can run any query you want
and get the results back.
It's just faster to use an index.
Your query result will not change
as a result of you having an
index.
And that sentence is no longer
true if we include these approximate
nearest neighbor type indexes.
So I have to stop saying that or
work out a way of caveating
it.
Jonathan: Yeah.
Well, just say, well, in this world,
when you're dealing with,
you know, vector similarity search,
but back to the question,
you know, how did I get involved
in pgvector?
So rewind to about the end of 2022.
We may have heard about some of
these generative AI systems that
were capturing the imagination.
And we go and we type things into
them and we get back these
human-like responses and suddenly
it's like, we're trying to
think like, oh, what are all the
different ways to use it?
And one of the techniques that emerged
as these systems grew in
rapid popularity was retrieval-augmented
generation, where these
large language models, foundational
models, have trained on vast
quantities of data, typically publicly
available data.
But you may have data sitting in
your database or some other
private source that is not publicly
available.
For good reason.
It might be your product catalog
or information related to a
documentation collection that you
have.
But you're trying to deliver something
to your users, your customers
that can take advantage of what's
in these generative AI systems,
but provide that personalized experience.
To do that, you need some way to
be able to take that data that
exists in your private databases
and safely and securely augment
them with your foundational models
to be able to deliver that.
And that's where the vector search
comes in because the vector
search provides a representation
that you can search over and
be able to augment those answers.
So seeing that, seeing this is
how people want to use it and
frankly, people want to use it
with Postgres.
A lot of the user and customer
conversations I had were around
Hey, I already have this data existing
in my Postgres database.
Or I've already built all my search
mechanisms around Postgres,
I want to be able to simply use
it.
And there was an extension that
did this, and it was pgvector.
So for me personally, it started
to become a perfect storm.
I'd studied this vector stuff in
college.
I was a hobbyist machine learningist,
albeit the field has gotten
way more complex than when I studied
it.
And I happen to like databases
as it happens, particularly Postgres.
So I looked at this extension.
It seemed like a lot of people
are starting to adopt it in a rapid
fashion.
I mean, if you look at the star
history of pgvector, you know,
it's a curve like this.
Like I've never seen that before
for a Postgres extension.
But I stepped back a minute as
well.
And I saw that these are net new
database workloads.
You know, a lot of what you see
in Postgres in the past has been
moving database workloads over
a lot of traditional transactional
relational applications.
Though, you might call it once
in a generation, which in the
tech world is every 10 years, there's
this new workload that
emerges that just generates net
new data.
So about 10, 15 years ago, we saw
that with JSON.
We developed this lingua franca
across the web that made it simple
to communicate between all these
APIs, but there became a demand
to be able to store JSON data directly
in a database and easily
query it out.
Postgres developed a solution for
that, being the JSON and the
JSONB data types, which became quite
efficient at being able to
query it.
What was nice is that the Postgres
community rallied around what
people were doing and created a
very efficient, effective solution.
Side note, I've talked to one Postgres
user that has 40 terabytes
of JSON in a single database, and
they say Postgres rocks it,
which, I'm like, wow, that's great.
I'm glad it works.
Michael: That's so nice for you.
Jonathan: Yeah, yeah.
It is really cool.
Again, I can't figure out how I
would personally do it, but I'm
glad that it works really well
for them.
But we're seeing something similar
with vectors.
Vectors are just a data type.
It's actually a well-studied data
type.
It's something you kind of look
at in your introductory computer
science classes.
But when it comes to efficient
data search and retrieval, there's
so many nasty properties with them
that make it a really fun
and challenging problem to work
on.
But the thing is, like, this is,
you know, this is this new workload
that is going to be available to
databases, including Postgres.
And for me personally, I want to
make sure that we're positioning
Postgres in a way to handle it.
So where I got started with
pgvector was first using it,
helping to get it deployed at AWS.
But like any open source project,
if you're interested in it,
just participate.
There's a few things that I particularly
focused on, which was
performance both from a contribution
standpoint, a testing standpoint.
And first, I would just want to
say a large chunk of the performance
features have been developed by
Andrew Kane.
He's done an awesome job on it.
And more recently, Heikki Linnakangas
has even further improved
the performance on it.
Where I stepped in was both just
benchmarking to make sure that
we're focusing on the right areas
of performance and trying to
prioritize what things to look
at.
And a few patches here and there,
particularly around some of
the index costing early on for
IVFFlat.
Michael: But yeah, I saw that.
Jonathan: Oh, thank you.
Michael: I would love to get on
to performance and tips for people.
But before that, I hadn't thought
of this as net new data.
That's a super interesting way
of looking at it.
And I hadn't thought of JSON as
necessarily net new data.
I was thinking of it much more
along the lines of the NoSQL workloads
that felt quite competitive with
the relational for a while.
But yeah, really interesting way
of looking at it.
And it does feel like an opportunity,
much like that did, to
be able to handle more within Postgres
with all of the benefits
that we've done episodes on, the
pros and cons of having data
inside and outside using specialized
databases.
But you gain so many benefits
like the transactional nature
of things or the low latency or
being able to join things together.
Like there are so many benefits
of having it all together.
And not to mention that lower maintenance
overhead or the lower
operational knowledge needed of
managing multiple systems.
So yeah, great point, love it,
thank you.
Jonathan: Cool.
Michael: Yeah, so I was looking
at 3 of your PRs and the IVFFlat
and the IVFFlat costings was looked
like a core improvement
to make sure Postgres is using
indexes in more cases where indexes
would be faster, which is the age-old
cost optimization problem.
And also more recently, you did
some great blog posts on the
addition.
So IVFFlat was the first index
type that pgvector supported.
And then more recently, got added
a very, well, not as new as
I expected when I looked back at
the history, but the competing
index type, HNSW, which you've
also been involved in tuning.
I've seen.
Jonathan: Yeah, so IV**F** Flat was
first.
And I think one of the reasons why
is that it is a bit simpler
to implement.
So it's a clustering algorithm.
Clustering algorithms are well
studied.
I mean, they're continuing being
improved.
But one of the things you have to think
about when you're looking
at implementing these algorithms
is that something that might
work for an in-memory workload
doesn't necessarily work for a
database workload, where you're
going between memory and an IO
level, be it wherever your storage
may reside.
And with IV**F** Flat, where I began
getting involved was at AWS,
we had rolled out pgvector.
I started talking to some of our
customers who were early PG
Vector adopters, and it was definitely
all over the board.
But I was talking to one who was
definitely very savvy on vector
searches in general.
And they were remarking that when
they were trying to increase
the number of probes in their query.
So a probe, so let's step back.
So the way IV**F** Flat works is that
you build clusters of vectors.
So let's say you have. Let's keep
it simple.
You have like 10,000 vectors. What
you want to do is you want
to try to bucket them into lists.
So let's say you want to have a
hundred lists.
So there would be a hundred vectors
per list.
No, there would be a hundred vectors
per list.
There we go.
Clearly, I did well with my math
major.
Michael: The arithmetic was not
included, right?
Jonathan: Yeah, exactly.
After a certain year, you never
saw a number.
So you have a hundred.
So let's back up.
So we have 10,000 vectors.
We want to put them in a hundred lists,
each having 100 vectors.
What happens with IV**F** Flat is that
you try to find centers.
So you want to find all the vectors
near a particular center.
Because the idea of that is that
those are going to be the most
similar vectors to each other.
So that way, when you do a lookup,
you're going to one of those
centers.
And the idea is that these are
likely, by approximation, the
vectors that are most similar to
the ones that you're looking
up.
However, it may not be.
Because let's say you have two centers
here and here, and your
query vector is here.
Which one are you closer to?
Michael:
Michael: I had to have somebody
explain this to me twice, and
the second time they used the analogy,
it worked better for me.
If we go back to 2 dimensions and
look at, let's say, a map of
Europe, and we had the countries
laid out.
We could have, for example, the
capital cities.
First off, we could compare any
1 point and say, which capital
city is this nearest to?
That's a much cheaper operation
than looking at every single
city.
Then we can go into that country
and look at which of the actual
towns or cities we're closest to.
But if something's close to a border,
or the fact that not all
countries are the same size, you
can end up in situations where
you don't necessarily get the right
center or country by doing
that first one.
Is that a reasonable analogy?
Jonathan: Yeah, that's really good,
because I started thinking
about that.
So I'm based in New York, and I'm
thinking what capital am I
closest to is technically Trenton,
which is in New Jersey.
So if I just looked at one probe,
it would say like, oh, you're
in New Jersey.
Yeah, actually, I think I take
3 probes because I think Albany
is the third closest capital to...
Michael: There you go.
Jonathan: So there you go.
Yeah.
Great analogy.
Michael: So probes is the number...
So you then do that first search
and say, which are the 3 closest
centers?
Jonathan: Yeah, centers list.
I mean, list is the term that's
used.
You might also be, it might be
used interchangeably with centers.
Michael: So if we increase that
number of probes, we're increasing
the cost of doing the query because
we're looking up more things,
we're comparing to more things,
but we're increasing the likelihood
we're gonna get back the correct
result or the number of things
that are correct.
Jonathan: In our expected set.
So the measurement of expectation
is recall.
So if you know that these set of
vectors are like your actual
10 exact vectors, but you get like
8 out of the 10, then you
have 80% recall.
So with IV-flat, as you say, as
you increase the number of probes,
you're going to increase your recall.
But you are going to increase the
query time.
Now, in that patch you were talking
about, the problem is that
we were actually over costing.
So anytime we were above 5 probes,
we then flip back to a sequential
scan, which will give you 100%
recall, but it was taking the
query time from, you know, it was
like a 50x increase in query
time based upon the dataset.
And we definitely still had room
to grow.
So the first patch really was focused
on, let's get the costing
to be more reasonable level, which
gets into, you know, startup
costs being one of those elements.
For me personally, it really was
my first time diving into the
Query Planner code.
I definitely saw all sorts of interesting
things.
First off, the Query Planner code
is...
There's a level of brilliance to
it.
It encapsulates a lot.
And the work that folks have done
over the past, you know, 35
years tuning it is quite remarkable.
I’d also say that it is quite clean.
It takes a lot, like, there's
a lot of it, right.
So like, there's a lot to dive
into.
But like, once you get into it,
you can navigate your
way around.
Editing it is another story.
It’s one thing to understand it,
it’s another to propose a modification,
but what’s really cool about Postgres
is that you can make a
lot of impact within an extension,
or basically you can add this
functionality without having to
fork Postgres.
You could just write it in a separate
module and then add it
in.
And pgvector does this through
the index access method interface,
which is actually quite, quite
robust.
Like we build index types that
don't fit the natural, you know,
B-tree ordering of a relational
database.
So it's quite powerful there.
And we were able to ultimately
get, I'd say, pretty good costing
for IVF Flat without having to
make any upstream modifications
to Postgres.
Nice.
Michael: And the other, yeah, so
on the IVF Flat, because of
these centers, my understanding
is it doesn't necessarily, it
suits a dataset you already have
that is static or not changing
much because if you're adding data,
it can’t adjust as it goes.
So there is this trade-off inherent
in it.
And I'm guessing some of the customers
you were talking to had
different workloads that wouldn’t
necessarily suit that.
And what actually, what happened
with your customer conversations
next?
Jonathan: Yeah, so that's a good
discussion.
And I’d say, you know, let’s burn
it.
Like, it’s users of pgvector in
general.
You know, a lot of these conversations
I have in the open source
```
project, which is, again, awesome.
So that's definitely something
that came up.
That if my dataset, if I keep adding
vectors to the dataset or
every now and then you update them
as well or even delete them,
what happens is that the centers
start to skew because where
you calculated where all these
centers are may start shifting.
So your results might start shifting
as well.
And you might not be getting the
expected results or the recall
that you'd want to see.
The other thing as well is that
for a lot of app developers,
I mean, first off, again, you have
to wrap your head around approximate
nearest neighbor that you may not
be getting the exact results
that you expect.
And like I said, I tripped up over
that as well.
Like it's, you know, it is, it's
normal to make that mistake.
You know, you have to learn something
different.
The other thing is that it started
turning app developers into
DBAs in the sense that you had
to worry about some of your tuning
parameters.
And granted, there are 2 of them,
right?
There's the number of lists when
you build the index, which is
not necessarily terrible because
as an app developer, you're
still writing your SQL code, you
are creating that index.
But you have to think, what is
the number of lists?
And the pgvector project gives
some guidance on the number of
lists to pick, but it's still like
you have to experiment.
And then you have to be mindful
of things like, oh, am I adding
more data to it?
Do I need to re-index?
How do I re-index?
You know, who's doing the re-indexing,
etc.
And then you have to select probes,
which again, you know, you're
writing queries possibly in your
ORM, and suddenly you have to
set like this magic parameter to
choose how many of these lists
you want to visit during a query.
So it's not as simple as set and
forget.
And that's when I come back to
my history being an app developer,
I always wanted to just write code
and I want it to all just work.
Yeah, I happen to explore the database
as a side effect of that,
not, you know, as a first-order
principle of that.
So it became time to explore an
algorithm that could be closer
to set and forget.
Like none of these approximate
nearest neighbor algorithms are
set and forget, but we can at least
make it easier.
And this is where HNSW came in,
a hierarchical net of small worlds.
So HNSW is a little bit different
than IVFI because it's not
cluster-based, it's graph-based.
And when you hear a graph, you
can actually almost hear, you
know, a graph is a superset of
a tree.
So even though Postgres is very
tree-based in a lot of its algorithms,
you can implement something graph-based
as well, particularly
with the custom index framework.
And the way an **HNSW** works is that
you build a hierarchy or layers
of vectors.
And as you go down the hierarchy,
the layers get denser, denser,
and denser.
And the idea is that you start
at the top layer and you try to
find which vector am I closest
to, you find your local maximum
or effectively the vector I'm closest
to, then you go down and
you search within that area.
You find the one that you're next
closest to and you go down and
then it's even denser and you search
and so on and so forth until
you get to the final layer.
And the vectors that you should
be clustered around are most
likely the vectors you're most
similar to.
When you're building the index,
there's 2 parameters that...
Now you have 2 index building parameters.
You have something called m, which
is the number of vectors that
you're linked to at a given time.
And pgvector defaults to 16.
And you have something called EF
construction, which is the search
radius that you're keeping as you're
descending down it.
The idea that if you look at a
larger search radius, you're going
to see the vectors that you're
more similar to.
The idea also with a larger M is
that you're going to be clustered
to vectors you're more similar
to as well.
But there's definitely trade-offs
between how you pick that.
What happens is that when you actually
do the query, you should
be looking over a smaller space.
So **IVF** flat is going to grow linearly
as you add the number of
probes and it's going to grow linearly
by the number of vectors
that are in each list.
So the idea is that if you can
keep the list relatively small,
you're going to be able to do faster
searches with **IVF** flat.
But there's also trade-offs to
that as well, that you might not
be getting the recall that you
want.
You can see on **IVF-LAC** where you
can get expensive over time,
particularly if you're linearly
growing it.
With **HNSW**, you're only searching
over a smaller subset of the
space.
Again, this could be dictated by
your one search parameter, which
is **EF search**, but you're only looking
at a subset of the graph.
So you're going to be looking at
far fewer vectors.
And the idea is that you're being
put into a cluster that's very
similar to you.
So there's trade-offs.
With **IVF** flat, I can I can build
the index super quickly?
And technically, if I'm only using
one probe, I might be able to
query it super quickly.
But the trade-off is that to boost
recall in **IVFlat**, what we've
seen empirically is that it's going
to get more expensive.
With HNSW, we can query really
quickly and we can query, you
know, and again, the results are
showing that we can query and
get high recall or, you know, high,
you know, we're basically
seeing the expected results and,
you know, with pretty good accuracy.
But the trade-off is going to be
on the build time, because we're
going to do a lot more work up
front to build that hierarchy
and see all of the, you know, basically
try to visit enough vectors
that seem most similar to me.
And that's the push and pull between
the, you know, these 2 algorithms
and just vector similarity search
in general is that you're going
to have to pay the cost somewhere,
and you have to figure out
what that cost is going to be and
what makes the most sense for
your workload.
With HNSW, at least, if you're
willing to pay that upfront cost,
which it seems like a lot of folks
are, you are going to get
this very high-performance, high-recall
system, most likely.
And again, there are other factors
that go into this as well,
including your embedding model
or how you're querying the data
or how quickly you're ingesting
the data.
So, like, there's a lot to consider
here.
Michael: Yeah, I'm a performance
guy at heart, and I love it
when there is actually a true trade-off
and people in different
situations will have different
preferences.
But to me, it seems like a lot
of the currently popular use cases
can pay or have a desire for as
high a recall as is possible
within a latency threshold.
So they're willing to pay some
latency and index build
time costs for as high a recall
as possible to a point.
And then everyone has a different
point at which that is.
But yeah, I actually wanted to
ask almost the opposite question.
Do you see some use cases where
the index build time itself is
the primary driver?
Or any other reasons that you're
seeing people choose IVF flat
at the moment?
Jonathan: Yeah, there are cases
where you might be needing to
do rapid loads and a bit of analysis,
but it's very transient.
So one of the cases I've heard is
that there's a system with, let's
say like a hundred million vectors,
and it's really more about
getting a rapid build and doing
a spot check on the data within
it, as opposed to having something
that's permanent that needs
to be run all the time.
I think it's like a lot of these
transient or ephemeral workloads
where an IVF flat index can make
sense.
But frankly, like I've talked to
people who have just rolled
out production systems with IVF
flat with several million vectors
in it.
They're perfectly happy.
They're like, we're getting the
results that we need.
We're getting it in the amount
of time that we expect it to return
the query in.
And they're completely fine with
it.
So some of it is personal preference.
And some of it, like I said, depends
on other factors, such as
the data that has been vectorized.
How similar are the vectors based
on the embedding model?
And are you able to get the distances
within a range where you're
seeing the set that you want.
Michael: Awesome.
That's really interesting.
So you've got a great talk on this
that I'll link up that you
gave recently at PGConf EU.
And you've also had a couple of
great blog posts on the releases
around 0.5.0 and 0.5.1.
I'll link those up so people can
read them in their own time.
I did actually want to ask you,
you mentioned some tuning, like
people becoming DBAs and having
to think about, You mentioned
right up top the page size of PostgreSQL,
for example.
Do you want to talk a little bit
about why that's important and
some of the lower level stuff that
people need to think about
when they're doing this?
Jonathan: All right.
So, tuning or technical?
I heard 2 different things.
Michael: Let's go tuning as more
practical.
We don't have to explain TOAST.
We've done a whole episode on TOAST
before.
But there was some interesting
stuff in there that really got
me thinking that I hadn't thought
about before.
Jonathan: Yeah, so there's a whole
bunch of areas to go into.
So just to briefly recap, because
I think this does impact tuning.
So TOAST is a system that, well,
step back, rather.
So the foundational unit of PostgreSQL
is the page.
That's the atomic unit.
When you store data, you're actually
storing it within this atomic
unit.
You're not just storing a row randomly
on disk.
It's going to be fit within a page.
By default, the page for PostgreSQL
is 8 kilobytes.
Now you can recompile PostgreSQL
to use a different page size,
but most folks just use the default
and for a lot of good reasons.
What's interesting is that, you
know, so if you have data that
doesn't fit within a page, what
happens is that it gets TOASTed
and it gets stored out of line
and you can store it arbitrarily
large.
I believe it's up to a 1 gigabyte
per field.
Now, what's interesting, well,
there's 3 interesting things here.
So first, PostgreSQL has what's called
a minimum TOAST threshold,
which is 2 kilobytes.
So anything above 2 kilobytes is
going to get toasted, unless you
change that threshold.
The second thing is that index
pages must abide by that 8 kilobyte
limit.
And you can actually TOAST data
in line with an index page to
shrink it down a bit so you can
get a little bit more on that
index page.
But if it's gonna be over 8 kilobytes,
Postgres can't index it.
So this is where it gets interesting
for vectors for a few reasons.
So first, any vector that's above
that 2 kilobyte threshold,
which I believe, I think I forgot
off the top of my head, it's
around like 514 dimensions for
pgvector currently, is going to
get TOASTed.
Okay, that might seem okay.
But any vector currently, there's
a hard cap in terms of indexing
pgvector vectors of 2000 dimensions,
which there are some valid
use cases I've heard of vectors
that go beyond that size, but
typically most things are going
to fall within that range for
now.
Now, there's a few things here.
First, you're like, well, you say
you can TOAST things in line.
So why can't we TOAST these vectors
in line?
Well, I hate to be the bearer of
bad news, but it's actually
very challenging to compress effectively
2,000 dimensions of
random floating-point numbers.
There's not really much you can
do other than dimensionality
reduction to shrink it down.
There are some techniques out there
called quantization, which
you can touch on after this, but
they all have their tradeoffs,
such as losing information.
The second thing is that TOASTing
can screw up the query planner
a bit.
What do I mean by this?
So, currently when the query planner
is trying to estimate parallel
workers or essentially reading
data in parallel, it's going to
look at, it basically looks at
your heap pages or your main table
pages and uses that to drive the
estimate.
And the thing about TOAST is that
your heap pages actually be
quite small, you know, in this
case, let's say if an ID and a
vector.
Well, I'm not going to have that
many pages in my regular heap
because it's just going to be a
bunch of IDs and pointer to my
TOAST table.
The TOAST table is going to be
quite large and likely you need
those parallel workers to suck
all the data out.
But what Postgres is going to do
is it's going to underestimate
the workers today.
And it actually makes sense historically
because the data that
you typically TOASTed was not in
your search path or your ordering
path.
It's typically something that you
need to do some post filtering
on as you pull it out.
You know, think like, you know,
a big blob of text.
But here, we are actively querying
the vectors.
We are calculating distances between
them.
So we probably want them closer
to our main set of data.
Now, we can choose to store the
vectors in line.
You use this technique called setStoragePlane,
and that will
keep your 1500-dimensional vector
in line with your other table
data.
But also keep in mind, this is
going to cost a full page because
a 1500-dimensional vector is about
6 kilobytes and Postgres is
just going to have to allocate
a page for each of them.
So no matter what, you're storing
8 kilobytes of data in them.
So this gets very interesting.
So in terms of tuning, again, part
of this, you've got to look
at what your workload is and what
makes the most sense.
What we're seeing so far, at least
with HNSW, is that even if
you're toasting a 1500-dimensional
vector, the estimates are
still pretty good overall for making
sure that you're using your
HNSW index.
I haven't seen as much impact there.
We saw more impact, particularly
with IVF flat based upon some
of that costing model.
Michael: But
Jonathan: I think there's some,
this is one of those areas where
I think there's some improvements
to how we work with toasted
data upstream.
I think I have a couple of emails
or threads on that subject.
And if I get my C chops better,
maybe I can propose a patch.
But that's one area to be mindful
of, is how you store the data.
And again, I think we're starting
to get a little bit more set
and forget there.
But I think as we see these workloads
increase, it is definitely
something to be mindful of.
Michael: From a bias standpoint,
as a Postgres supporter and
advocate, I've seen quite a lot
of benchmarks that show pgVector
doing very well as a vector database
compared to other dedicated
systems.
So that's impressive, considering
they're probably not even tuning
some of those lower-level things
while doing those benchmark.
So it's exciting for me to hear
that there's actually some potential
further wins there on the Postgres
side.
Jonathan: Yeah, I mean, maybe like
hot off the presses, Andrew
proposed a patch last night for
speeding up HNSW index building.
Since the 0.5.1 release, there's
already been a lot of work to
improve this, but the patch last
night even further accelerates
it.
So there's support for parallel
builds for HNSW indexes coming.
And the initial work still had
to leverage a lot of the data
coming from disk as opposed to
being fully in memory.
The proposal last night is fully
in memory.
And I actually did a test, I was
so excited to see it, I did
a test.
I have this 10,000,500 dimensional
vector dataset, or really
like 1,536 dimensional dataset.
Yeah, random vectors, right?
This is really more just for like
beating up on performance.
You know, I'm not measuring recall
here, which for real benchmarks,
you got to measure performance
and recall.
Michael: But I'm
Jonathan: trying to just understand
like how quick is the index
build time.
So I did this with what's currently
on the master branch to date.
I used this and then I also did
the test with the HNSW fast build
branch, which is this in-memory
system.
So I saw a couple of things.
So first, this new branch was about,
I think, 7.3x faster at
building HNSW index than the other
branch.
And just to compare, like this
gives some real numbers.
So the old branch.
Which is you know this you know
the unreleased support for the
HNW parallel builds it took about
3 hours and change to build
the entire index.
And this was with like 64 parallel
workers.
I'm throwing, this is a big beefy
box.
With the new branch, it took 25
minutes.
It's like, that's mind-blowing,
right?
These are big vectors.
You're doing a lot of computations.
And I even had cranked up the EF
construction value, which does
increase the time.
I did compare it to this previous
method I'd been recommending,
which was concurrent inserts.
And I just, you know, I did a spot
check between a blog post
I wrote about it and this new patch.
In the blog post where I had a
lower value of EF construction,
it was I think 64, I was getting
a little bit over 1,000 vectors
per second, looking over the entire
front.
With this new technique that Andrew
posted, I was getting over
6,500 vectors per second.
Michael: So again, huge.
Jonathan: Yeah.
This is what's cool about this
because this is 1 of those set
and forget things that you might
need to tweak.
You might need to tweak 1 parameter.
In this case, I tweaked the max
parallel maintenance workers.
But again, huge performance boost
in terms of the index building.
It simplifies it too, because now
it makes it truly viable to
say, preload all your vectors,
which will be a much faster operation
than doing a concurrent insert.
I mean, because there's a current
insert and even like further
speed up your load of your vectors.
But then you can just do create
index, you know, embeddings using
HNSW and boom, like you have an
index way faster than other systems.
And like, this is a key thing to
look at.
You know, you touch on this, Michael,
is that when you're dealing
with this vector data, you gotta
look at everything.
You gotta look at your ingestion
time, your index build time,
your query time.
And you know, you have to focus
on all these things because there
is a trade-off with all of them.
Michael: Yeah, well, exciting times
ahead.
It feels like the most rapidly
improving area at the moment and
for good reason.
So I saw on this, it's got a great
changelog, pgvector.
And I saw there's like even the
unreleased part.
So 0.5.2 is currently on there
and unreleased.
So anything you're particularly
excited about that's coming up
or anything not on that list that
you'd love to see?
Jonathan: Yeah.
And I think this is where it really
helps to have feedback on
how people are using it.
So I was very excited about parallel
HNSW builds.
Like this is, I can't, even like
in the middle of that test I
was running last night, like I
emailed Andrew, like Andrew, the
results are amazing.
Like I can't, I can't wait for
it to finish.
Like I can see where this is trending.
So there's 2 things in particular
I'm very excited for.
The first, you know, the first,
and we'll see if it makes it
in, but it's around pre-filtering.
So a lot of the queries today we've
been seeing is like select
star from vector table, order by,
blah blah blah, you know, limit
10.
Find me my 10 nearest neighbors.
But what's happening in practice?
In practice, it's like select star
from table where, you have
some condition, category ID equals
this, order by blah, blah,
blah, blah.
So what this is kind of like is
really a multi-column index,
because what will happen is like
today, either you won't use
the index, effectively you're doing
an exact nearest neighbor
search, which means very accurate
results.
Maybe your dataset gets sorted
down to a small enough value where
it doesn't matter.
Like you're getting a very fast
search.
But what if you have multiple fields
that you need to filter
over?
Or what if the dataset you get
back is 50,000 of these 1,500
dimensional vectors?
Like this could be exhausting.
So there's a patch out there, it's
based on a newer paper called
HNSW, which I can let pgvector
repo right now.
Yes.
That lets you build effectively
a multi-column index where it's
able to build the links or group
together.
Well, there are 2 things, right?
'Cause you still want to be able
to search over your entire vector
set and find your most similar
vectors.
But you can also group the vectors
by your filters.
So that way, you're searching over
just your filters and it does
that pre-filtering.
This is like a what have you done
for me lately feature, because
as soon as you put it out, users
find like, you know, more of
these case studies emerge and users
find like, hey, like, I really
need this.
But it's great is that there are
people who are testing it against
real data.
And this is where if you want to
be involved in pgvector, you
can help is that you see these
patches out there, please test
and report back on them.
Because if you're finding the HNSW
brands useful, describe
your use case.
Like there's an open issue where
people talk about, you know,
they might say like, hey, I really
need this.
And what makes it super valuable
is when you talk about like,
here's exactly how I'm using it,
because it just further justifies
that this is the right direction.
I mean, 1 of the big goals of pgvector
is to try to keep the
code base simple to maintain and
also make the product simple
or the extension simple to use.
So hearing more about what people
are doing is great.
Another patch that I'm excited
for, and again, we'll see if it
makes it into 0.5.2, you know,
no guarantees, is being able to
support smaller dimensional sizes.
So right now, dimensions are 4-byte
floats.
But there are definitely embedding
models that provide two-byte
floats or one-byte unsigned integers.
So there are a couple branches that
have those in there.
But again, hearing those use cases
will help further support
it.
Because the added bonus to supporting
these smaller dimensions
is that we can index larger vectors
and that'll be able to go
beyond that 2k limit.
Michael: Yes, interesting.
The 1 final question I had is,
do you see any world where this
becomes part of core Postgres grow
in the long term?
Jonathan: Yeah, so at PGCon 2023
last year, I had a lightning
talk, which is basically to first
shout to the wind to the community,
like, hey, like, these vector workloads,
this is real, like,
that was still early.
But it's like, hey, this is coming,
there's a, I call it like
a storm of data coming.
Like we want to make sure Postgres
is positioned for it.
Like this is very similar to what
we saw with JSON.
And I was able to get an unconference
session as well where
we discussed it.
And the general consensus was in
the fullness of time, it does
make sense to have something like
this in upstream Postgres.
Great.
But I think there's a few things
here.
First, we have to look at release
cycles.
Postgres releases once a year.
In fact, the feature freeze for
Postgres 17 is coming up in about
less than 3 months.
But effectively, there's a...
Now, if you think, let's say you
come up with the idea we want
to support a vector data type right
now, or let's say after feature
freeze.
I mean, there's effectively an
18-month window before it gets
in, because we have to go through
the whole cycle.
And given the pace that this field
is moving, we don't necessarily
want to wait on the Postgres release
cycle.
So being able to do this work in
pgVector or other extensions
does help accelerate adoption of
Postgres as a vector database,
so to speak.
The other thing is that once it
is in upstream Postgres, you
know, that is the on-disk format.
Like that is the rule.
And Tom Lane made a very good point
during that Unconference session,
which is like, let's see, you know,
how things shake out in terms
of what the on-disk format is.
Now, pgVector is also trying to
stay true to that contract and
try to keep the on-disk format
as, you know, effectively not
to change it.
Because as soon as you change it,
you gotta like rebuild, re-index,
restore everything, and that's
a costly operation.
So pgVector is trying to apply
the same level of rigor as Postgres
to implementing these features,
but it can move a little bit
faster because it is an extension.
It can have its own release lifecycle.
So I think that's where the, you
know, is that I'm not gonna
say this is an official or unofficial
community position, but
is there an interest in supporting
it upstream?
Absolutely.
But given the rapid emergence and
development we need to make,
we're trying to make as much progress
as possible within pgVector.
Here's the other thing too, like
the big difference between now
and JSON.
We have the index access method,
we can do this in an extension.
That is a big change since 10,
15 years ago.
Michael: That's really good, interesting
and good to know.
There's another big difference,
which is cloud providers.
So whilst it could be in core Postgres,
currently, a lot of people
that install Postgres themselves
or manage Postgres themselves
more specifically can install pgvector.
And people that don't are often
on a managed service provider.
Those folks, most managed service
providers have added support
for pgvector already, which is
the fastest I've ever seen them
add a new extension.
And it's been pretty much everyone.
So most folks that want access
to pgvector, at least some version
of it, maybe not always the latest
version, can have it now as
well.
Jonathan: Yeah, it's definitely
it's definitely exciting to see
all the adoption of pgvector.
And yeah, yeah, yeah, to tease
a little bit, I think I still
think the best is yet to come.
Like I think pgvector has made
a lot of progress in the past
year.
I mean, it's been a tremendous
work by the community, and in
particular, Andrew.
Like, I can't say enough nice things
about the work Andrew has
done and really the diligence he's
put into it.
And I still think there's more
to do.
Like I said, even just like, well,
the patch that came out last
night, I mean, just shows that
there's still even more performance
that we can get out of it.
Michael: And for folks listening,
we're recording on the 16th.
So I'll link that up.
Wonderful.
Jonathan, thank you so much.
Is there anything else I should
have asked you but didn't?
Or any last things you wanted to
say?
Jonathan: Yeah, I think one of the
big things when you think about
contributing to open source is
writing a patch, writing code.
And I'm going to say this in the
context of pgvector, but I
think this applies to any project,
even Postgres itself, is that
there are many different ways to
contribute.
Testing is huge, because testing,
particularly if you can test
something that's close to a real
workload, like don't test your
production workloads with these
things, but like test something
that's close to production.
That helps because that helps drive
the use case and hearing
how you use something and just
talking about your different use
cases of being supportive of the
community in that way.
That helps.
Helping to write documentation,
helping to advocate for something
that you think can help others.
Again, all these things can help
a project.
So if you want to contribute to
pgvector or Postgres, there
are a variety of different ways.
And maybe too, as it's top of mind
right now, PGCon, which was
effectively the annual developer
conference.
It's evolved.
It's now called PGConf.dev, P-G-C-O-N-F.dev.
It's being held in Vancouver at
the end of May.
I can tell you, I can almost guarantee
vector workloads will
be a topic of discussion there.
But just all things Postgres.
And the idea is that while certainly
a lot of the folks that
the discussions are around technical
hacking topics, really,
if you step back, the gist is how
do we continue building and
growing the Postgres community?
So if you are the CFP actually
just closed yesterday, or when
this airs probably a few years
before, but that's a great way
to participate as well.
Even if you're new to the community,
because I know personally,
at one point I was new to the community,
and the first time I went
to PGCon, I'm like, oh my God,
like I know nothing about how
a database works.
Like I can write a select query,
but geez.
But it is a way to help have an impact
on the community.
And just talking to folks who are
working on the software or
working on events or hosting podcasts
or finding ways to help
grow the community, it's a great
way to participate and help
growth.
So certainly think about attending
and participating.
And again, there's all sorts of
different ways to contribute.
So that's the parting message I
have.
I'm far from a database hacker.
I can write a couple of lines of C
here and there.
But where I've found a home in
open source is working on all
sorts of other aspects around open
source projects.
Thank you for having me on your
wonderful podcast.
Michael: Oh, cheers.
Take care
Jonathan: Take care.