Postgres FM

Michael is joined by Jonathan Katz, PostgreSQL Core Team member, Principle Product Manager at AWS, and also pgvector contributor, to discuss pgvector. They cover what it is, why it's so popular all of a sudden, some tuning and tradeoff guidance, and a look to the future.
 
Here are some links to things they mentioned:

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With special thanks to:

Creators & Guests

Host
Michael Christofides
Founder of pgMustard
Guest
Jonathan S. Katz
Principal Product Manager Technical at AWS, PostgreSQL Core Team member, and pgvector contributor. Helping people to adopt and learn PostgreSQL! Opinions are my own.

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello and welcome to Postgres
FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and today I am delighted

to be joined by Jonathan Katz,
Postgres Core Team member amongst

other things, Principal Product
Manager at AWS and contributor to

pgvector.

Welcome to the show, Jonathan.

Jonathan: Thank you, Michael.

Happy to be here.

I'm definitely a huge fan of your work.

I'm very excited for our conversation
today.

Michael: Kind of, you'd say the same
to me.

So I reached out to you about this
topic because it's about time

we did an episode on pgvector.

How could we not have given the
last year?

But neither of us knew it well.

Both Nikolay and myself haven't
used it until recently.

Nikolay has been using it.

So we wanted to make sure we covered
it.

And who better to well, Andrew
Kane was unavailable and you are

definitely the next best person.

So we really appreciated all the
content you've been putting

out, the talks you've been giving
on this.

So yeah, thanks for joining.

Jonathan: Yeah, happy to be here.

I mean, and by the way, when you
say next best person, it's like

well below Andrew Kane.

Andrew has done phenomenal work
on pgvector over the past several

years.

Michael: Yeah, right.

And it has been.

It definitely predates the ChatGPT
hype, right?

It was a couple of years beforehand.

Do you want to give us a little
bit of an introduction as to

what it is and how you came to
become involved with it?

Jonathan: Yeah, so pgvector, at
the surface sounds very simple.

It's a Postgres extension that
allows you to store and search

vectors.

And I mean, I'll talk a little
bit about how I got involved in

it, but really to understand how
this all came to be, it actually

helps to look back in the history of
Postgres.

Postgres has actually always been
able to support vectors back

since the Berkeley days, and it
comes down to the array data

type.

So I mean, quite simply, a vector
is, you know, it's an array

with certain properties.

It has a certain amount of dimensionality.

There are certain things that each
dimension must meet.

And there's all sorts of math around
it.

I mean, there's over a century
of what a vector is.

But the reason why it was added
to Postgres was not because of

any mathematical properties.

It was actually as a shortcut to
be able to look up ACL rules.

So instead of having to do a join
to another catalog table to

get the ACL rules, it was just
embedded within a single column

as an array.

Oh, really?

Yeah, yeah.

No, it's actually pretty cool.

And then fast forward to the early
2000s, Postgres added the

cube data type, which allowed
you to index up to a hundred

dimensional vectors, ultimately
using the GiST index, but it

added also two distance operations,
which are the mathematical

operations you need for a vector,
which, you know, we can talk

a little bit more about that as
soon as we get into this.

But this is to say, Postgres has
actually supported vectors and

vector lookups for a while, but
certainly there's been a more,

a more pressing need for it.

And, you know, in terms of a little
bit how I got involved, perhaps

my secret in the Postgres community
was, I originally wanted

to go into machine learning when
I was in college.

And while I was studying it, I
was like, oh, it's really fascinating,

but to do anything with it, you're
going to need a PhD and everything's

ad hoc.

And I had a very entrepreneurial
spirit.

I think I got a patent.

And of course, I didn't have the
foresight to see like, oh, this

will be commoditized.

And it'll be very simple to access
machine learning algorithms

through simple API.

So credit where credit is due.

Some very smart people identified
that and have been really building

towards that.

But again, working with machine
learning systems in college,

the fundamental data type was the
vector.

And back then, a high dimensional
vector was something considered

to be like 20 dimensions.

And think about it, Michael, what's
a 20 dimensional vector?

I can understand 3, 3 is like points
in space.

I can understand 4, points in space
moving around.

What's 5?

Michael: Plus time, right?

Yeah.

Jonathan: Yeah.

Michael: I studied, I don't know
if you know, but I studied maths

at university and did pretty much
just pure.

I did, I did a tiny bit of stats,
a tiny bit of mechanics, but

I was best at the pure stuff.

And yeah, 5 dimensions hurts my
head to think about it, or to

try and conceptualize it in any
way.

Jonathan: Yeah, it's funny that
you say that.

I studied math too, though.

I would say I was much stronger
on the computer science side.

But real analysis was one of my favorite
classes, because you were

proving everything in n-dimensional
space.

And again, everything's a vector
or a matrix.

You basically spend a semester
just studying vector spaces and

all their properties.

And to me, like, that was fun.

And maybe it was telling, you know,
fast forwarding, you know,

20 years later.

But, you know, my journey back
to vectors came probably a little

bit later than some folks.

The pgvector project started in
2021.

Andrew had the foresight to see
that a lot of data was going

to be generated around machine
learning and AI systems.

And Postgres didn't really have
support for the types of vector

lookups that we needed.

So let's back up a second.

So I said that Postgres had support
for vectors all along, but

it did and it didn't.

One thing about relational databases,
and Postgres in particular,

is that you look for exact results.

You say, hey, I want to find the
10, well, let's say I want to

look up Michael or Jonathan in
a database.

I'm going to write a select query
with a where, where name

equals Michael, right?

And you expect to get the results
for like, Michael, all the

Michaels in the database.

Same thing when you're looking
up points, Postgres has the ability

to index things by k-nearest neighbor
or find all the points

that are around me.

And again, if I say like, hey,
go in and find like the 10 closest

coffee shops to me.

You're going to find exactly the
10 closest coffee shops.

And these things work in 3 dimensions,
4 dimensions, which is

typically where Postgres operates.

If you take the PostGIS extension,
which adds geospatial support

to Postgres, you can index 3 and
4 dimensional points, and you

can do it quite well.

The GiST and the SP-GiST indexes
provide the ability to write

these indexes.

But as you start adding dimensions,
you go from, you know, even

like 8 to 10 to a hundred, which
is the limit of the cube extension,

it's going to start getting very
exhaustive.

First, in terms of placing the
vector or the...

I mean, it's not really a point
anymore, but think about it as

a point.

You're trying to place this in
space and try to figure out, I

have 100 dimensions and I have
to situate somewhere in my graph

where that is.

It's going to be, you know, very exhaustive.

And then even if you are doing
some kind of exact nearest neighbor

search or some index search, there's
going to be a lot of cost to

it.

Yeah.

And this is where Postgres, you
know, started falling over in

a ways, you know, Because a lot
of these vectors we're seeing,

if you look at some of the popular
embedding systems, they're

1,536 dimensions, which again,
mind-blowing number.

Like I still don't even know what
a five-dimensional vector is,

and here we're talking about 1,500
dimensional vectors.

To do an exact nearest neighbor
search, as they call it, or the

k nearest neighbor, you have
to compare against every single

vector that you store in a database.

So you have a million vectors that
are 1,500 dimensions, you're

doing a million times 1,500 computations
on them.

And that's really expensive.

I can tell you it's very expensive
because I've had to, you know,

I've been personally, you know,
benchmarking this for quite a

bit.

Michael: And it makes sense, right?

It's at minimum, a sequential scan
of the entire table, plus

all of the CPU of doing those calculations
100 times per, well,

1,500 times per row.

Jonathan: If you have 1,500 times,
yeah.

And we can talk about in terms
of like all the, you know, all

sorts of the optimizations that
are in place there.

But this is where like that exact
nearest neighbor problem gets

hard.

And again, Postgres didn't have
the indexing mechanisms to handle

it.

Probably we'll further dive into
that as we talk more.

But actually, I'll say 1 of the
big ones is just in terms of

just storing the data, storing
the data within the index, that

you have to fit your index value
for a single row within an 8

kilobyte limit within the page.

So that's already going to start
creating some constraints.

Now, what's also happened over
the past 20 years, since the cube

type was added in Postgres, was
a whole field of modern vector

research, as I like to call it.

And starting to this idea of approximate
nearest neighbor.

And the deal with approximate nearest
neighbor is that you're

trying to find your nearest neighbors,
but with a best fit, which

is you're not going to search every
single vector in a particular

table or database, you can search
a subset of it, and you're

hoping that you're seeing the representative
set of the data

you want to return.

Now again, for us relational database
folks, that's mind blowing.

It's like, wait a second, I'm not
returning the exact answers.

I'm returning an approximate subset.

And I'll tell you, the first time
I used pgvector, and I was

playing around with it, and I was
not getting the results I expected,

at first I'm like, oh, this is
broken, this doesn't work, this

doesn't make sense.

And that was the app developer
in me, the DBA in me that started

diving deeper and trying to understand
like, oh, wow, there's

like a whole science around this
that we kind of have to relearn

when using a database.

Michael: Yeah, well, I often say
to people, you can run a database

without adding any indexes and
you can run any query you want

and get the results back.

It's just faster to use an index.

Your query result will not change
as a result of you having an

index.

And that sentence is no longer
true if we include these approximate

nearest neighbor type indexes.

So I have to stop saying that or
work out a way of caveating

it.

Jonathan: Yeah.

Well, just say, well, in this world,
when you're dealing with,

you know, vector similarity search,
but back to the question,

you know, how did I get involved
in pgvector?

So rewind to about the end of 2022.

We may have heard about some of
these generative AI systems that

were capturing the imagination.

And we go and we type things into
them and we get back these

human-like responses and suddenly
it's like, we're trying to

think like, oh, what are all the
different ways to use it?

And one of the techniques that emerged
as these systems grew in

rapid popularity was retrieval-augmented
generation, where these

large language models, foundational
models, have trained on vast

quantities of data, typically publicly
available data.

But you may have data sitting in
your database or some other

private source that is not publicly
available.

For good reason.

It might be your product catalog
or information related to a

documentation collection that you
have.

But you're trying to deliver something
to your users, your customers

that can take advantage of what's
in these generative AI systems,

but provide that personalized experience.

To do that, you need some way to
be able to take that data that

exists in your private databases
and safely and securely augment

them with your foundational models
to be able to deliver that.

And that's where the vector search
comes in because the vector

search provides a representation
that you can search over and

be able to augment those answers.

So seeing that, seeing this is
how people want to use it and

frankly, people want to use it
with Postgres.

A lot of the user and customer
conversations I had were around

Hey, I already have this data existing
in my Postgres database.

Or I've already built all my search
mechanisms around Postgres,

I want to be able to simply use
it.

And there was an extension that
did this, and it was pgvector.

So for me personally, it started
to become a perfect storm.

I'd studied this vector stuff in
college.

I was a hobbyist machine learningist,
albeit the field has gotten

way more complex than when I studied
it.

And I happen to like databases
as it happens, particularly Postgres.

So I looked at this extension.

It seemed like a lot of people
are starting to adopt it in a rapid

fashion.

I mean, if you look at the star
history of pgvector, you know,

it's a curve like this.

Like I've never seen that before
for a Postgres extension.

But I stepped back a minute as
well.

And I saw that these are net new
database workloads.

You know, a lot of what you see
in Postgres in the past has been

moving database workloads over
a lot of traditional transactional

relational applications.

Though, you might call it once
in a generation, which in the

tech world is every 10 years, there's
this new workload that

emerges that just generates net
new data.

So about 10, 15 years ago, we saw
that with JSON.

We developed this lingua franca
across the web that made it simple

to communicate between all these
APIs, but there became a demand

to be able to store JSON data directly
in a database and easily

query it out.

Postgres developed a solution for
that, being the JSON and the

JSONB data types, which became quite
efficient at being able to

query it.

What was nice is that the Postgres
community rallied around what

people were doing and created a
very efficient, effective solution.

Side note, I've talked to one Postgres
user that has 40 terabytes

of JSON in a single database, and
they say Postgres rocks it,

which, I'm like, wow, that's great.

I'm glad it works.

Michael: That's so nice for you.

Jonathan: Yeah, yeah.

It is really cool.

Again, I can't figure out how I
would personally do it, but I'm

glad that it works really well
for them.

But we're seeing something similar
with vectors.

Vectors are just a data type.

It's actually a well-studied data
type.

It's something you kind of look
at in your introductory computer

science classes.

But when it comes to efficient
data search and retrieval, there's

so many nasty properties with them
that make it a really fun

and challenging problem to work
on.

But the thing is, like, this is,
you know, this is this new workload

that is going to be available to
databases, including Postgres.

And for me personally, I want to
make sure that we're positioning

Postgres in a way to handle it.

So where I got started with
pgvector was first using it,

helping to get it deployed at AWS.

But like any open source project,
if you're interested in it,

just participate.

There's a few things that I particularly
focused on, which was

performance both from a contribution
standpoint, a testing standpoint.

And first, I would just want to
say a large chunk of the performance

features have been developed by
Andrew Kane.

He's done an awesome job on it.

And more recently, Heikki Linnakangas
has even further improved

the performance on it.

Where I stepped in was both just
benchmarking to make sure that

we're focusing on the right areas
of performance and trying to

prioritize what things to look
at.

And a few patches here and there,
particularly around some of

the index costing early on for
IVFFlat.

Michael: But yeah, I saw that.

Jonathan: Oh, thank you.

Michael: I would love to get on
to performance and tips for people.

But before that, I hadn't thought
of this as net new data.

That's a super interesting way
of looking at it.

And I hadn't thought of JSON as
necessarily net new data.

I was thinking of it much more
along the lines of the NoSQL workloads

that felt quite competitive with
the relational for a while.

But yeah, really interesting way
of looking at it.

And it does feel like an opportunity,
much like that did, to

be able to handle more within Postgres
with all of the benefits

that we've done episodes on, the
pros and cons of having data

inside and outside using specialized
databases.

But you gain so many benefits
like the transactional nature

of things or the low latency or
being able to join things together.

Like there are so many benefits
of having it all together.

And not to mention that lower maintenance
overhead or the lower

operational knowledge needed of
managing multiple systems.

So yeah, great point, love it,
thank you.

Jonathan: Cool.

Michael: Yeah, so I was looking
at 3 of your PRs and the IVFFlat

and the IVFFlat costings was looked
like a core improvement

to make sure Postgres is using
indexes in more cases where indexes

would be faster, which is the age-old
cost optimization problem.

And also more recently, you did
some great blog posts on the

addition.

So IVFFlat was the first index
type that pgvector supported.

And then more recently, got added
a very, well, not as new as

I expected when I looked back at
the history, but the competing

index type, HNSW, which you've
also been involved in tuning.

I've seen.

Jonathan: Yeah, so IV**F** Flat was
first.

And I think one of the reasons why
is that it is a bit simpler

to implement.

So it's a clustering algorithm.

Clustering algorithms are well
studied.

I mean, they're continuing being
improved.

But one of the things you have to think
about when you're looking

at implementing these algorithms
is that something that might

work for an in-memory workload
doesn't necessarily work for a

database workload, where you're
going between memory and an IO

level, be it wherever your storage
may reside.

And with IV**F** Flat, where I began
getting involved was at AWS,

we had rolled out pgvector.

I started talking to some of our
customers who were early PG

Vector adopters, and it was definitely
all over the board.

But I was talking to one who was
definitely very savvy on vector

searches in general.

And they were remarking that when
they were trying to increase

the number of probes in their query.

So a probe, so let's step back.

So the way IV**F** Flat works is that
you build clusters of vectors.

So let's say you have. Let's keep
it simple.

You have like 10,000 vectors. What
you want to do is you want

to try to bucket them into lists.

So let's say you want to have a
hundred lists.

So there would be a hundred vectors
per list.

No, there would be a hundred vectors
per list.

There we go.

Clearly, I did well with my math
major.

Michael: The arithmetic was not
included, right?

Jonathan: Yeah, exactly.

After a certain year, you never
saw a number.

So you have a hundred.

So let's back up.

So we have 10,000 vectors.

We want to put them in a hundred lists,
each having 100 vectors.

What happens with IV**F** Flat is that
you try to find centers.

So you want to find all the vectors
near a particular center.

Because the idea of that is that
those are going to be the most

similar vectors to each other.

So that way, when you do a lookup,
you're going to one of those

centers.

And the idea is that these are
likely, by approximation, the

vectors that are most similar to
the ones that you're looking

up.

However, it may not be.

Because let's say you have two centers
here and here, and your

query vector is here.

Which one are you closer to?

Michael:

Michael: I had to have somebody
explain this to me twice, and

the second time they used the analogy,
it worked better for me.

If we go back to 2 dimensions and
look at, let's say, a map of

Europe, and we had the countries
laid out.

We could have, for example, the
capital cities.

First off, we could compare any
1 point and say, which capital

city is this nearest to?

That's a much cheaper operation
than looking at every single

city.

Then we can go into that country
and look at which of the actual

towns or cities we're closest to.

But if something's close to a border,
or the fact that not all

countries are the same size, you
can end up in situations where

you don't necessarily get the right
center or country by doing

that first one.

Is that a reasonable analogy?

Jonathan: Yeah, that's really good,
because I started thinking

about that.

So I'm based in New York, and I'm
thinking what capital am I

closest to is technically Trenton,
which is in New Jersey.

So if I just looked at one probe,
it would say like, oh, you're

in New Jersey.

Yeah, actually, I think I take
3 probes because I think Albany

is the third closest capital to...

Michael: There you go.

Jonathan: So there you go.

Yeah.

Great analogy.

Michael: So probes is the number...

So you then do that first search
and say, which are the 3 closest

centers?

Jonathan: Yeah, centers list.

I mean, list is the term that's
used.

You might also be, it might be
used interchangeably with centers.

Michael: So if we increase that
number of probes, we're increasing

the cost of doing the query because
we're looking up more things,

we're comparing to more things,
but we're increasing the likelihood

we're gonna get back the correct
result or the number of things

that are correct.

Jonathan: In our expected set.

So the measurement of expectation
is recall.

So if you know that these set of
vectors are like your actual

10 exact vectors, but you get like
8 out of the 10, then you

have 80% recall.

So with IV-flat, as you say, as
you increase the number of probes,

you're going to increase your recall.

But you are going to increase the
query time.

Now, in that patch you were talking
about, the problem is that

we were actually over costing.

So anytime we were above 5 probes,
we then flip back to a sequential

scan, which will give you 100%
recall, but it was taking the

query time from, you know, it was
like a 50x increase in query

time based upon the dataset.

And we definitely still had room
to grow.

So the first patch really was focused
on, let's get the costing

to be more reasonable level, which
gets into, you know, startup

costs being one of those elements.

For me personally, it really was
my first time diving into the

Query Planner code.

I definitely saw all sorts of interesting
things.

First off, the Query Planner code
is...

There's a level of brilliance to
it.

It encapsulates a lot.

And the work that folks have done
over the past, you know, 35

years tuning it is quite remarkable.

I’d also say that it is quite clean.

It takes a lot, like, there's
a lot of it, right.

So like, there's a lot to dive
into.

But like, once you get into it,
you can navigate your

way around.

Editing it is another story.

It’s one thing to understand it,
it’s another to propose a modification,

but what’s really cool about Postgres
is that you can make a

lot of impact within an extension,
or basically you can add this

functionality without having to
fork Postgres.

You could just write it in a separate
module and then add it

in.

And pgvector does this through
the index access method interface,

which is actually quite, quite
robust.

Like we build index types that
don't fit the natural, you know,

B-tree ordering of a relational
database.

So it's quite powerful there.

And we were able to ultimately
get, I'd say, pretty good costing

for IVF Flat without having to
make any upstream modifications

to Postgres.

Nice.

Michael: And the other, yeah, so
on the IVF Flat, because of

these centers, my understanding
is it doesn't necessarily, it

suits a dataset you already have
that is static or not changing

much because if you're adding data,
it can’t adjust as it goes.

So there is this trade-off inherent
in it.

And I'm guessing some of the customers
you were talking to had

different workloads that wouldn’t
necessarily suit that.

And what actually, what happened
with your customer conversations

next?

Jonathan: Yeah, so that's a good
discussion.

And I’d say, you know, let’s burn
it.

Like, it’s users of pgvector in
general.

You know, a lot of these conversations
I have in the open source
```

project, which is, again, awesome.

So that's definitely something
that came up.

That if my dataset, if I keep adding
vectors to the dataset or

every now and then you update them
as well or even delete them,

what happens is that the centers
start to skew because where

you calculated where all these
centers are may start shifting.

So your results might start shifting
as well.

And you might not be getting the
expected results or the recall

that you'd want to see.

The other thing as well is that
for a lot of app developers,

I mean, first off, again, you have
to wrap your head around approximate

nearest neighbor that you may not
be getting the exact results

that you expect.

And like I said, I tripped up over
that as well.

Like it's, you know, it is, it's
normal to make that mistake.

You know, you have to learn something
different.

The other thing is that it started
turning app developers into

DBAs in the sense that you had
to worry about some of your tuning

parameters.

And granted, there are 2 of them,
right?

There's the number of lists when
you build the index, which is

not necessarily terrible because
as an app developer, you're

still writing your SQL code, you
are creating that index.

But you have to think, what is
the number of lists?

And the pgvector project gives
some guidance on the number of

lists to pick, but it's still like
you have to experiment.

And then you have to be mindful
of things like, oh, am I adding

more data to it?

Do I need to re-index?

How do I re-index?

You know, who's doing the re-indexing,
etc.

And then you have to select probes,
which again, you know, you're

writing queries possibly in your
ORM, and suddenly you have to

set like this magic parameter to
choose how many of these lists

you want to visit during a query.

So it's not as simple as set and
forget.

And that's when I come back to
my history being an app developer,

I always wanted to just write code
and I want it to all just work.

Yeah, I happen to explore the database
as a side effect of that,

not, you know, as a first-order
principle of that.

So it became time to explore an
algorithm that could be closer

to set and forget.

Like none of these approximate
nearest neighbor algorithms are

set and forget, but we can at least
make it easier.

And this is where HNSW came in,
a hierarchical net of small worlds.

So HNSW is a little bit different
than IVFI because it's not

cluster-based, it's graph-based.

And when you hear a graph, you
can actually almost hear, you

know, a graph is a superset of
a tree.

So even though Postgres is very
tree-based in a lot of its algorithms,

you can implement something graph-based
as well, particularly

with the custom index framework.

And the way an **HNSW** works is that
you build a hierarchy or layers

of vectors.

And as you go down the hierarchy,
the layers get denser, denser,

and denser.

And the idea is that you start
at the top layer and you try to

find which vector am I closest
to, you find your local maximum

or effectively the vector I'm closest
to, then you go down and

you search within that area.

You find the one that you're next
closest to and you go down and

then it's even denser and you search
and so on and so forth until

you get to the final layer.

And the vectors that you should
be clustered around are most

likely the vectors you're most
similar to.

When you're building the index,
there's 2 parameters that...

Now you have 2 index building parameters.

You have something called m, which
is the number of vectors that

you're linked to at a given time.

And pgvector defaults to 16.

And you have something called EF
construction, which is the search

radius that you're keeping as you're
descending down it.

The idea that if you look at a
larger search radius, you're going

to see the vectors that you're
more similar to.

The idea also with a larger M is
that you're going to be clustered

to vectors you're more similar
to as well.

But there's definitely trade-offs
between how you pick that.

What happens is that when you actually
do the query, you should

be looking over a smaller space.

So **IVF** flat is going to grow linearly
as you add the number of

probes and it's going to grow linearly
by the number of vectors

that are in each list.

So the idea is that if you can
keep the list relatively small,

you're going to be able to do faster
searches with **IVF** flat.

But there's also trade-offs to
that as well, that you might not

be getting the recall that you
want.

You can see on **IVF-LAC** where you
can get expensive over time,

particularly if you're linearly
growing it.

With **HNSW**, you're only searching
over a smaller subset of the

space.

Again, this could be dictated by
your one search parameter, which

is **EF search**, but you're only looking
at a subset of the graph.

So you're going to be looking at
far fewer vectors.

And the idea is that you're being
put into a cluster that's very

similar to you.

So there's trade-offs.

With **IVF** flat, I can I can build
the index super quickly?

And technically, if I'm only using
one probe, I might be able to

query it super quickly.

But the trade-off is that to boost
recall in **IVFlat**, what we've

seen empirically is that it's going
to get more expensive.

With HNSW, we can query really
quickly and we can query, you

know, and again, the results are
showing that we can query and

get high recall or, you know, high,
you know, we're basically

seeing the expected results and,
you know, with pretty good accuracy.

But the trade-off is going to be
on the build time, because we're

going to do a lot more work up
front to build that hierarchy

and see all of the, you know, basically
try to visit enough vectors

that seem most similar to me.

And that's the push and pull between
the, you know, these 2 algorithms

and just vector similarity search
in general is that you're going

to have to pay the cost somewhere,
and you have to figure out

what that cost is going to be and
what makes the most sense for

your workload.

With HNSW, at least, if you're
willing to pay that upfront cost,

which it seems like a lot of folks
are, you are going to get

this very high-performance, high-recall
system, most likely.

And again, there are other factors
that go into this as well,

including your embedding model
or how you're querying the data

or how quickly you're ingesting
the data.

So, like, there's a lot to consider
here.

Michael: Yeah, I'm a performance
guy at heart, and I love it

when there is actually a true trade-off
and people in different

situations will have different
preferences.

But to me, it seems like a lot
of the currently popular use cases

can pay or have a desire for as
high a recall as is possible

within a latency threshold.

So they're willing to pay some
latency and index build

time costs for as high a recall
as possible to a point.

And then everyone has a different
point at which that is.

But yeah, I actually wanted to
ask almost the opposite question.

Do you see some use cases where
the index build time itself is

the primary driver?

Or any other reasons that you're
seeing people choose IVF flat

at the moment?

Jonathan: Yeah, there are cases
where you might be needing to

do rapid loads and a bit of analysis,
but it's very transient.

So one of the cases I've heard is
that there's a system with, let's

say like a hundred million vectors,
and it's really more about

getting a rapid build and doing
a spot check on the data within

it, as opposed to having something
that's permanent that needs

to be run all the time.

I think it's like a lot of these
transient or ephemeral workloads

where an IVF flat index can make
sense.

But frankly, like I've talked to
people who have just rolled

out production systems with IVF
flat with several million vectors

in it.

They're perfectly happy.

They're like, we're getting the
results that we need.

We're getting it in the amount
of time that we expect it to return

the query in.

And they're completely fine with
it.

So some of it is personal preference.

And some of it, like I said, depends
on other factors, such as

the data that has been vectorized.

How similar are the vectors based
on the embedding model?

And are you able to get the distances
within a range where you're

seeing the set that you want.

Michael: Awesome.

That's really interesting.

So you've got a great talk on this
that I'll link up that you

gave recently at PGConf EU.

And you've also had a couple of
great blog posts on the releases

around 0.5.0 and 0.5.1.

I'll link those up so people can
read them in their own time.

I did actually want to ask you,
you mentioned some tuning, like

people becoming DBAs and having
to think about, You mentioned

right up top the page size of PostgreSQL,
for example.

Do you want to talk a little bit
about why that's important and

some of the lower level stuff that
people need to think about

when they're doing this?

Jonathan: All right.

So, tuning or technical?

I heard 2 different things.

Michael: Let's go tuning as more
practical.

We don't have to explain TOAST.

We've done a whole episode on TOAST
before.

But there was some interesting
stuff in there that really got

me thinking that I hadn't thought
about before.

Jonathan: Yeah, so there's a whole
bunch of areas to go into.

So just to briefly recap, because
I think this does impact tuning.

So TOAST is a system that, well,
step back, rather.

So the foundational unit of PostgreSQL
is the page.

That's the atomic unit.

When you store data, you're actually
storing it within this atomic

unit.

You're not just storing a row randomly
on disk.

It's going to be fit within a page.

By default, the page for PostgreSQL
is 8 kilobytes.

Now you can recompile PostgreSQL
to use a different page size,

but most folks just use the default
and for a lot of good reasons.

What's interesting is that, you
know, so if you have data that

doesn't fit within a page, what
happens is that it gets TOASTed

and it gets stored out of line
and you can store it arbitrarily

large.

I believe it's up to a 1 gigabyte
per field.

Now, what's interesting, well,
there's 3 interesting things here.

So first, PostgreSQL has what's called
a minimum TOAST threshold,

which is 2 kilobytes.

So anything above 2 kilobytes is
going to get toasted, unless you

change that threshold.

The second thing is that index
pages must abide by that 8 kilobyte

limit.

And you can actually TOAST data
in line with an index page to

shrink it down a bit so you can
get a little bit more on that

index page.

But if it's gonna be over 8 kilobytes,
Postgres can't index it.

So this is where it gets interesting
for vectors for a few reasons.

So first, any vector that's above
that 2 kilobyte threshold,

which I believe, I think I forgot
off the top of my head, it's

around like 514 dimensions for
pgvector currently, is going to

get TOASTed.

Okay, that might seem okay.

But any vector currently, there's
a hard cap in terms of indexing

pgvector vectors of 2000 dimensions,
which there are some valid

use cases I've heard of vectors
that go beyond that size, but

typically most things are going
to fall within that range for

now.

Now, there's a few things here.

First, you're like, well, you say
you can TOAST things in line.

So why can't we TOAST these vectors
in line?

Well, I hate to be the bearer of
bad news, but it's actually

very challenging to compress effectively
2,000 dimensions of

random floating-point numbers.

There's not really much you can
do other than dimensionality

reduction to shrink it down.

There are some techniques out there
called quantization, which

you can touch on after this, but
they all have their tradeoffs,

such as losing information.

The second thing is that TOASTing
can screw up the query planner

a bit.

What do I mean by this?

So, currently when the query planner
is trying to estimate parallel

workers or essentially reading
data in parallel, it's going to

look at, it basically looks at
your heap pages or your main table

pages and uses that to drive the
estimate.

And the thing about TOAST is that
your heap pages actually be

quite small, you know, in this
case, let's say if an ID and a

vector.

Well, I'm not going to have that
many pages in my regular heap

because it's just going to be a
bunch of IDs and pointer to my

TOAST table.

The TOAST table is going to be
quite large and likely you need

those parallel workers to suck
all the data out.

But what Postgres is going to do
is it's going to underestimate

the workers today.

And it actually makes sense historically
because the data that

you typically TOASTed was not in
your search path or your ordering

path.

It's typically something that you
need to do some post filtering

on as you pull it out.

You know, think like, you know,
a big blob of text.

But here, we are actively querying
the vectors.

We are calculating distances between
them.

So we probably want them closer
to our main set of data.

Now, we can choose to store the
vectors in line.

You use this technique called setStoragePlane,
and that will

keep your 1500-dimensional vector
in line with your other table

data.

But also keep in mind, this is
going to cost a full page because

a 1500-dimensional vector is about
6 kilobytes and Postgres is

just going to have to allocate
a page for each of them.

So no matter what, you're storing
8 kilobytes of data in them.

So this gets very interesting.

So in terms of tuning, again, part
of this, you've got to look

at what your workload is and what
makes the most sense.

What we're seeing so far, at least
with HNSW, is that even if

you're toasting a 1500-dimensional
vector, the estimates are

still pretty good overall for making
sure that you're using your

HNSW index.

I haven't seen as much impact there.

We saw more impact, particularly
with IVF flat based upon some

of that costing model.

Michael: But

Jonathan: I think there's some,
this is one of those areas where

I think there's some improvements
to how we work with toasted

data upstream.

I think I have a couple of emails
or threads on that subject.

And if I get my C chops better,
maybe I can propose a patch.

But that's one area to be mindful
of, is how you store the data.

And again, I think we're starting
to get a little bit more set

and forget there.

But I think as we see these workloads
increase, it is definitely

something to be mindful of.

Michael: From a bias standpoint,
as a Postgres supporter and

advocate, I've seen quite a lot
of benchmarks that show pgVector

doing very well as a vector database
compared to other dedicated

systems.

So that's impressive, considering
they're probably not even tuning

some of those lower-level things
while doing those benchmark.

So it's exciting for me to hear
that there's actually some potential

further wins there on the Postgres
side.

Jonathan: Yeah, I mean, maybe like
hot off the presses, Andrew

proposed a patch last night for
speeding up HNSW index building.

Since the 0.5.1 release, there's
already been a lot of work to

improve this, but the patch last
night even further accelerates

it.

So there's support for parallel
builds for HNSW indexes coming.

And the initial work still had
to leverage a lot of the data

coming from disk as opposed to
being fully in memory.

The proposal last night is fully
in memory.

And I actually did a test, I was
so excited to see it, I did

a test.

I have this 10,000,500 dimensional
vector dataset, or really

like 1,536 dimensional dataset.

Yeah, random vectors, right?

This is really more just for like
beating up on performance.

You know, I'm not measuring recall
here, which for real benchmarks,

you got to measure performance
and recall.

Michael: But I'm

Jonathan: trying to just understand
like how quick is the index

build time.

So I did this with what's currently
on the master branch to date.

I used this and then I also did
the test with the HNSW fast build

branch, which is this in-memory
system.

So I saw a couple of things.

So first, this new branch was about,
I think, 7.3x faster at

building HNSW index than the other
branch.

And just to compare, like this
gives some real numbers.

So the old branch.

Which is you know this you know
the unreleased support for the

HNW parallel builds it took about
3 hours and change to build

the entire index.

And this was with like 64 parallel
workers.

I'm throwing, this is a big beefy
box.

With the new branch, it took 25
minutes.

It's like, that's mind-blowing,
right?

These are big vectors.

You're doing a lot of computations.

And I even had cranked up the EF
construction value, which does

increase the time.

I did compare it to this previous
method I'd been recommending,

which was concurrent inserts.

And I just, you know, I did a spot
check between a blog post

I wrote about it and this new patch.

In the blog post where I had a
lower value of EF construction,

it was I think 64, I was getting
a little bit over 1,000 vectors

per second, looking over the entire
front.

With this new technique that Andrew
posted, I was getting over

6,500 vectors per second.

Michael: So again, huge.

Jonathan: Yeah.

This is what's cool about this
because this is 1 of those set

and forget things that you might
need to tweak.

You might need to tweak 1 parameter.

In this case, I tweaked the max
parallel maintenance workers.

But again, huge performance boost
in terms of the index building.

It simplifies it too, because now
it makes it truly viable to

say, preload all your vectors,
which will be a much faster operation

than doing a concurrent insert.

I mean, because there's a current
insert and even like further

speed up your load of your vectors.

But then you can just do create
index, you know, embeddings using

HNSW and boom, like you have an
index way faster than other systems.

And like, this is a key thing to
look at.

You know, you touch on this, Michael,
is that when you're dealing

with this vector data, you gotta
look at everything.

You gotta look at your ingestion
time, your index build time,

your query time.

And you know, you have to focus
on all these things because there

is a trade-off with all of them.

Michael: Yeah, well, exciting times
ahead.

It feels like the most rapidly
improving area at the moment and

for good reason.

So I saw on this, it's got a great
changelog, pgvector.

And I saw there's like even the
unreleased part.

So 0.5.2 is currently on there
and unreleased.

So anything you're particularly
excited about that's coming up

or anything not on that list that
you'd love to see?

Jonathan: Yeah.

And I think this is where it really
helps to have feedback on

how people are using it.

So I was very excited about parallel
HNSW builds.

Like this is, I can't, even like
in the middle of that test I

was running last night, like I
emailed Andrew, like Andrew, the

results are amazing.

Like I can't, I can't wait for
it to finish.

Like I can see where this is trending.

So there's 2 things in particular
I'm very excited for.

The first, you know, the first,
and we'll see if it makes it

in, but it's around pre-filtering.

So a lot of the queries today we've
been seeing is like select

star from vector table, order by,
blah blah blah, you know, limit

10.

Find me my 10 nearest neighbors.

But what's happening in practice?

In practice, it's like select star
from table where, you have

some condition, category ID equals
this, order by blah, blah,

blah, blah.

So what this is kind of like is
really a multi-column index,

because what will happen is like
today, either you won't use

the index, effectively you're doing
an exact nearest neighbor

search, which means very accurate
results.

Maybe your dataset gets sorted
down to a small enough value where

it doesn't matter.

Like you're getting a very fast
search.

But what if you have multiple fields
that you need to filter

over?

Or what if the dataset you get
back is 50,000 of these 1,500

dimensional vectors?

Like this could be exhausting.

So there's a patch out there, it's
based on a newer paper called

HNSW, which I can let pgvector
repo right now.

Yes.

That lets you build effectively
a multi-column index where it's

able to build the links or group
together.

Well, there are 2 things, right?

'Cause you still want to be able
to search over your entire vector

set and find your most similar
vectors.

But you can also group the vectors
by your filters.

So that way, you're searching over
just your filters and it does

that pre-filtering.

This is like a what have you done
for me lately feature, because

as soon as you put it out, users
find like, you know, more of

these case studies emerge and users
find like, hey, like, I really

need this.

But it's great is that there are
people who are testing it against

real data.

And this is where if you want to
be involved in pgvector, you

can help is that you see these
patches out there, please test

and report back on them.

Because if you're finding the HNSW
brands useful, describe

your use case.

Like there's an open issue where
people talk about, you know,

they might say like, hey, I really
need this.

And what makes it super valuable
is when you talk about like,

here's exactly how I'm using it,
because it just further justifies

that this is the right direction.

I mean, 1 of the big goals of pgvector
is to try to keep the

code base simple to maintain and
also make the product simple

or the extension simple to use.

So hearing more about what people
are doing is great.

Another patch that I'm excited
for, and again, we'll see if it

makes it into 0.5.2, you know,
no guarantees, is being able to

support smaller dimensional sizes.

So right now, dimensions are 4-byte
floats.

But there are definitely embedding
models that provide two-byte

floats or one-byte unsigned integers.

So there are a couple branches that
have those in there.

But again, hearing those use cases
will help further support

it.

Because the added bonus to supporting
these smaller dimensions

is that we can index larger vectors
and that'll be able to go

beyond that 2k limit.

Michael: Yes, interesting.

The 1 final question I had is,
do you see any world where this

becomes part of core Postgres grow
in the long term?

Jonathan: Yeah, so at PGCon 2023
last year, I had a lightning

talk, which is basically to first
shout to the wind to the community,

like, hey, like, these vector workloads,
this is real, like,

that was still early.

But it's like, hey, this is coming,
there's a, I call it like

a storm of data coming.

Like we want to make sure Postgres
is positioned for it.

Like this is very similar to what
we saw with JSON.

And I was able to get an unconference
session as well where

we discussed it.

And the general consensus was in
the fullness of time, it does

make sense to have something like
this in upstream Postgres.

Great.

But I think there's a few things
here.

First, we have to look at release
cycles.

Postgres releases once a year.

In fact, the feature freeze for
Postgres 17 is coming up in about

less than 3 months.

But effectively, there's a...

Now, if you think, let's say you
come up with the idea we want

to support a vector data type right
now, or let's say after feature

freeze.

I mean, there's effectively an
18-month window before it gets

in, because we have to go through
the whole cycle.

And given the pace that this field
is moving, we don't necessarily

want to wait on the Postgres release
cycle.

So being able to do this work in
pgVector or other extensions

does help accelerate adoption of
Postgres as a vector database,

so to speak.

The other thing is that once it
is in upstream Postgres, you

know, that is the on-disk format.

Like that is the rule.

And Tom Lane made a very good point
during that Unconference session,

which is like, let's see, you know,
how things shake out in terms

of what the on-disk format is.

Now, pgVector is also trying to
stay true to that contract and

try to keep the on-disk format
as, you know, effectively not

to change it.

Because as soon as you change it,
you gotta like rebuild, re-index,

restore everything, and that's
a costly operation.

So pgVector is trying to apply
the same level of rigor as Postgres

to implementing these features,
but it can move a little bit

faster because it is an extension.

It can have its own release lifecycle.

So I think that's where the, you
know, is that I'm not gonna

say this is an official or unofficial
community position, but

is there an interest in supporting
it upstream?

Absolutely.

But given the rapid emergence and
development we need to make,

we're trying to make as much progress
as possible within pgVector.

Here's the other thing too, like
the big difference between now

and JSON.

We have the index access method,
we can do this in an extension.

That is a big change since 10,
15 years ago.

Michael: That's really good, interesting
and good to know.

There's another big difference,
which is cloud providers.

So whilst it could be in core Postgres,
currently, a lot of people

that install Postgres themselves
or manage Postgres themselves

more specifically can install pgvector.

And people that don't are often
on a managed service provider.

Those folks, most managed service
providers have added support

for pgvector already, which is
the fastest I've ever seen them

add a new extension.

And it's been pretty much everyone.

So most folks that want access
to pgvector, at least some version

of it, maybe not always the latest
version, can have it now as

well.

Jonathan: Yeah, it's definitely
it's definitely exciting to see

all the adoption of pgvector.

And yeah, yeah, yeah, to tease
a little bit, I think I still

think the best is yet to come.

Like I think pgvector has made
a lot of progress in the past

year.

I mean, it's been a tremendous
work by the community, and in

particular, Andrew.

Like, I can't say enough nice things
about the work Andrew has

done and really the diligence he's
put into it.

And I still think there's more
to do.

Like I said, even just like, well,
the patch that came out last

night, I mean, just shows that
there's still even more performance

that we can get out of it.

Michael: And for folks listening,
we're recording on the 16th.

So I'll link that up.

Wonderful.

Jonathan, thank you so much.

Is there anything else I should
have asked you but didn't?

Or any last things you wanted to
say?

Jonathan: Yeah, I think one of the
big things when you think about

contributing to open source is
writing a patch, writing code.

And I'm going to say this in the
context of pgvector, but I

think this applies to any project,
even Postgres itself, is that

there are many different ways to
contribute.

Testing is huge, because testing,
particularly if you can test

something that's close to a real
workload, like don't test your

production workloads with these
things, but like test something

that's close to production.

That helps because that helps drive
the use case and hearing

how you use something and just
talking about your different use

cases of being supportive of the
community in that way.

That helps.

Helping to write documentation,
helping to advocate for something

that you think can help others.

Again, all these things can help
a project.

So if you want to contribute to
pgvector or Postgres, there

are a variety of different ways.

And maybe too, as it's top of mind
right now, PGCon, which was

effectively the annual developer
conference.

It's evolved.

It's now called PGConf.dev, P-G-C-O-N-F.dev.

It's being held in Vancouver at
the end of May.

I can tell you, I can almost guarantee
vector workloads will

be a topic of discussion there.

But just all things Postgres.

And the idea is that while certainly
a lot of the folks that

the discussions are around technical
hacking topics, really,

if you step back, the gist is how
do we continue building and

growing the Postgres community?

So if you are the CFP actually
just closed yesterday, or when

this airs probably a few years
before, but that's a great way

to participate as well.

Even if you're new to the community,
because I know personally,

at one point I was new to the community,
and the first time I went

to PGCon, I'm like, oh my God,
like I know nothing about how

a database works.

Like I can write a select query,
but geez.

But it is a way to help have an impact
on the community.

And just talking to folks who are
working on the software or

working on events or hosting podcasts
or finding ways to help

grow the community, it's a great
way to participate and help

growth.

So certainly think about attending
and participating.

And again, there's all sorts of
different ways to contribute.

So that's the parting message I
have.

I'm far from a database hacker.

I can write a couple of lines of C
here and there.

But where I've found a home in
open source is working on all

sorts of other aspects around open
source projects.

Thank you for having me on your
wonderful podcast.

Michael: Oh, cheers.

Take care

Jonathan: Take care.