A weekly podcast about all things PostgreSQL
Nikolay: Hello, hello, this is
PostgresFM, episode number 101.
My name is Nik Samokhvalov, founder
of Postgres.AI.
I thought about...
Since last time we had episode
number 100, and it was interesting
because we discussed databases
at scale of 100 terabytes and
beyond.
Logically this episode should be
something very simple and beginner
entry level, because it's 101.
You know, in the US, 101 means
something like some introduction
to something, right?
But I guess we will dive quite
deep, I hope we will dive quite
deep into some details of some
things.
Very interesting to me personally.
But also, since a couple of days
ago, Timescale released an addition
to pgvector called pgvectorscale.
I asked developers to join and
they agreed.
So meet Mat Arye.
Hi, Mat.
Thank you
for coming.
Mat: Thank you for having us.
Nikolay: And John Pruitt.
Thank you for coming as well.
It's great to have you.
And Let's start, it's kind of an
interview, but maybe it will
be some informal discussion, because
I don't know, we know each
other for some time and why not,
right?
Let's start from the beginning,
from maybe some distance.
Why do we need at all some vector
search in Postgres and maybe
some vector search in general?
Because recently, not long ago,
LLM became...
It's becoming bigger and bigger
at a very high pace, in terms
of how many dimensions, but also
in terms of context window.
Gemini now supports 1000000, there
are talks about 2 millions.
Although I must admit, Gemini is
terrible in terms of reliability.
We have a lot of credits being
an AI startup, we have a lot of
credits to spend.
And it's a very well-known problem
with Gemini that 500 errors
all the time.
It's making me very sad.
But 1 million context window means
that Probably we don't need
vector search at all and we can
just put everything into 1 prompt,
right?
There was such a crazy idea when
context became really large.
Imagine if we have like 100 million
tokens context window.
You can just put everything to
a single question, or it's an
insane idea?
What do you think?
Mat: SRS- I think there are a few
points to point out.
First of all, there's still the
bandwidth that you need to use
up to send the question to the
LLM, right?
So there is going to be a pure
networking limit at some point.
Another point is that even with
very big context windows or token
windows, yes, you can give the
LLM a lot of information, but
there's actually an open question
about whether it will use all
of the information it gets well.
And there's some academic research
out of LLM that's actually
showing that giving it less information
but more relevant information
in the prompt gets you better results
than giving it all the
information you have, right?
And so, I mean, with LLMs in general,
all of this is purely empirical
research.
There's actually not very good
scientific backing either way,
but with the LLMs that people have
tested, I know that huge prompts
are often not very good in terms
of performance, in terms of
the answers you get.
So in that sense, narrowing prompts
down using either VectorSearch
or TextSearch or HybridSearch,
whatever you want, in many cases
gives you a better answer.
You a better answer.
And apart from that, when this
is a bit controversial, I'm actually
not convinced that RAG is the killer
app for vector databases.
There's a lot of other use cases
for vector databases, starting
from clustering, recommendation
engines, plain old search engines,
right?
A lot of other things that have
nothing to do with the rag and
that increasing context windows
were not, you know, help or help.
These are just orthogonal things.
And the basic thing that what vector
search gives you is semantic
search.
There's a lot of applications for
that.
Nikolay: So not only RAG, right?
So interesting.
Yeah, that's...
Okay, so your answer has 2 points
to summarize.
First, we still need to narrow
down and find...
And if we put everything to 1 question,
it will be not efficient.
I agree with LLM sometimes.
If you put too much information,
the answers are not good.
Quality of answers decrease for
sure.
And second, it's not only about
RAG.
I would like to explore this.
Do you see some good examples beyond
retrieval-augmented generation?
Actually while you think, Let me
just spend maybe half a minute,
because I know our audience is
mostly Postgres engineers, DBAs,
and backend developers.
Of course, many already heard about
RUG, but just a small recap.
It's quite a simple thing.
Actually, you're right.
Maybe vector search is not needed
in RUG.
For example, we can use full text
search there.
So the idea is when some request
is coming to LLM, before passing
this request to LLM, we have intermediate
software which finds
additional information which should
be relevant.
Here vector search can be useful,
but maybe not only vector search,
for example, full text search.
Then We already augment this request
with this information, and
LLM has some context.
Basically, it's similar to if you
go to exam and you have a lot
of information in your pockets,
When you need to answer, you
can get this information out of
your pockets and answer better.
Or use Wikipedia, for example.
Similar.
But you just told me that it's
not only RUG.
What else?
What else for vector search particularly?
Mat: I mean, fundamentally, I think
any time that you are right
now using full text search or text
search, you could substitute
that with semantic search to give
you better results.
Let me just give a simple example
for people.
With the full text search, searching
for a query on car does
not return results that have truck
in the document, right?
Semantic search solves that because
it's kind of, I mean, that
sounds almost magical, but it's
kind of a search on meaning.
So things that are close together
in meaning get retrieved.
So you could use a totally different
lexical structure, totally
different word, and somehow this
type of search figured out that
they are similar in the meaning
space, if you will.
John: Or you could use
it to
augment full-text search
or as a hybrid of the 2.
Nikolay: Have you seen good examples
of this?
Like I saw terrible techniques,
like for example, let's find
100 results from full-text search,
100 results from vector search,
and then combine them.
But always I have a question, what's
the ranking system?
Because it's very different, and
sometimes we want fresh information.
Specific question I would like
to explore maybe later, how to
reflect freshness of data.
We can discuss it maybe later because
I also already want to
discuss what you particular just
released.
But did you see any good examples
of combination of full-text
search and semantic search like
vector search?
Mat: It's hard to say because we
are database providers so we
kind of see the...
We don't see the examples that
our customers have.
We get told by our customers that,
hey, this technique works,
you find it useful, blah, blah,
blah.
But we rarely actually see the
results ourselves.
Actually see the results ourselves.
But like anecdotally, a lot of
people are using hybrid set.
And like, in the AI community,
I think it's almost becoming
the standard.
What is potentially a problem for
Postgres we can go into later,
but That's what we're hearing at
least a lot.
But I will agree that this is totally
all ad hoc, non-scientific,
like, crack my left ear with my
right arm type thing, right?
Nikolay: For me, it looks almost
always like all examples I see,
they look ugly because, for example,
if you think about pagination,
there is no way to have pagination
there.
Although if we go back to vector
search, it's also problematic
there because to go to page number
100, you need to extract all
100 pages, and it's very inefficient
performance-wise.
Let's discuss the project.
It's called pgvectorscale.
It looks like you just append a
suffix scale to pgvector.
It's a very interesting approach.
Let me ask you this.
It's very well known that TimescaleDB
is a very great thing
for time series.
It has a free version, and you
can host it yourself, a community
edition, I think it's called.
I doubt it's true open source because
it doesn't have OSI approved
license.
So it's like some specific thing.
Or maybe Apache license exists,
but it lacks a lot of good stuff.
For example, I think it lacks compression.
Maybe I'm wrong.
Is it?
Mat: Yes, it lacks compression.
So we have an Apache version, we
have a community version.
Nikolay: Right.
I think you can use the extended version,
still not paying if you
host it yourself, but it cannot be
considered true open source because
this license is already something
like hybrid and so on.
Mat: We call it source available.
Nikolay: Yeah, yeah, this is a
common term.
So then Postgres itself has a very,
very, very simple and permissive
license.
That's why many commercial products
developed on top of it.
Then we have pgvector, which is
a separate extension, which in
my opinion should at some point
go to core Postgres.
But there is a big question, how
come?
Because it's breaking all the rules,
because just creating an index
you might have results which differ
from original results without
index, right?
It's something which we didn't
have before, because as always,
indexes just sped up queries.
Now we can change results because
it's approximate neighbor search.
This project, I think it's...
Which license is it?
I don't remember, but I think...
Yeah.
It's also Postgres license, Right.
And maybe that's why you also developed
your product on top of
it using Postgres license.
But it's very interesting because
if before Timescale had this
major product, everyone loved,
compression especially, but only
if you go that route, source available,
or you go to Timescale
Cloud, now you release this thing
on Postgres license, meaning
that anyone, any cloud provider,
most of them already have pgvector,
so naturally they should already
be considering adding your extra
to their services.
I think, of course, you've thought
about it very well.
So my question is why you changed
your approach compared to Timescale?
Mat: Yeah, so I'll speak for myself,
maybe not for the company.
I think, look, we all love Postgres,
but I think Postgres needs
different things in different areas.
Postgres, in the time series area,
Postgres already has a lot
of functionality to build in and
we are already building upon
a well-known foundation where I
would argue that given our broad
definition of time series, in many
ways Postgres already dominates
the market.
So we felt that we could build
a company while contributing some
things to PostGrid, but also quite
frankly making a business
for ourselves and making money
in this area by preventing hyperscalers
from making all the money off of
our work, which is what the
TSL license does.
It doesn't stop any individual
from benefiting from our work
or really any company, it only
stops the cloud providers from
making money off our work.
And that was then for pragmatic
reasons so that we could make
a business.
In terms of vector, I think the
whole vector market is much more
nascent and much younger.
I don't think Postgres has won
the market a lot.
You can look at how many other
databases have gotten millions
of dollars in the past year to
develop their own solution.
This is a very hot, fast-moving
market where I don't think Postgres
yet has a particular...
It hasn't won, right?
And so, because we like Postgres
so much, we wanted
Nikolay: to know when
Mat: Postgres wins.
And so...
Nikolay: There are many benchmarks
where Postgres has very poor
results with pgvector on ivfflat
index without HNSW and so on,
right?
And they're still around.
Mat: I will say with pgvector 0.7,
things have improved on the
pgvector side quite a lot as well.
But yeah, and I think not only
in terms of benchmarks, I think
in terms of actual usage, if you
know, last episode you talked
about 100 terabyte databases, We
want to reach 100 terabyte databases
that have extra data.
We have to do a lot of work to
make that happen, right?
And I think that's the ultimate
goal.
And there should be the ultimate
goal of everybody in the community.
We thought that it was more wise
to help the entire community
succeed at this point in time.
Nikolay: Yeah, when I saw benchmarks,
like 100,000, 500,000 vectors,
I was like, what's happening here?
People say, oh, it's already a
lot.
It's not a lot.
You take any company, take their
databases, and they have already
so many data entries in their databases.
Out of those entries we can create
vectors usually.
It means that we should speak about
billions already, but it's
not there yet because it's problematic
because vectors are large
and building indexes is slow, takes
a lot of time, and the latencies
of search, of course.
Everything is quite slow.
I'm glad you did your benchmarks
starting with 50 million vectors,
already very good.
I'm super glad to hear that you
talk about terabytes of data.
By the way, some people say, okay,
for us, for Postgres guys,
not for Timescale, regular Postgres
guys, 1 terabyte is usually
1000000000 rows.
With Timescale compression, it's
many more.
1 terabyte should be maybe tens
of billions rows, right?
And I know...
Tens of billions.
Right, right.
So it's a different order of magnitude.
But if we talk about pgvector,
50 million vectors.
Let's talk about...
This project has 2 things to bring
on the table.
Maybe let's close first with a
non-technical question, because
I cannot skip this question just
sitting in my head.
So pgvector had ivfflat index originally,
then HNSW was added.
I remember NUAN participated, starting
with a separate project,
then decided to contribute to pgvector.
So now it has 2 indexes.
And pgvectorscale brings a third
type of index, right?
It could be a pull request to pgvector,
no?
Mat: So we actually talked with
Andrew Kane about this.
The issue is that we are written
in Rust, not C.
And Andrew thought that it would
be better as a separate project.
I can't say I disagree.
I think either could have worked,
but I think given the language
difference, it makes some sense
to have it as a separate project.
But we did all for...
Nikolay: Yeah, well, makes sense.
Makes total sense.
But why Rust?
Mat: Because I like to work quickly.
And I thought Rust would allow
me to do that more than other
things.
I should say I'm 1 of the people
that worked compression Timescale
and we did, you know, our compression
system had our own data
type on disk and I remember the
pain of having to make sure that
the way you are writing data on
disk would work very well.
And the cross-over platforms, big-end,
little-end, all of these,
the alignment issues, it's just
a lot of double checking that
you have to do to make sure everything
is correct.
I wanted to avoid this.
And with Rust, a lot of this
could be a lot simpler.
So, like...
Nikolay: Yeah, makes total sense.
Yeah.
Mm-hmm.
Right, right.
So, this makes total sense, but
for end user, it's interesting.
Especially those who self-manage.
So you need to take Postgres, then
you need to bring pgvector,
then you need to bring pgvectorscale.
It's an interesting situation.
We do
Mat: have web-based packages.
Nikolay: This is interesting, but
now I understand better.
Let's talk about the index itself.
What can you tell us about the
index compared to HNSW?
I know HNSW is like a memory thing,
this works with disk, so
this is definitely the right thing
to do if we talk about terabytes
of data and billions of rows and
so on?
Mat: Yeah.
So I'm going to hand wave a lot
here because this is a lot of
history, but it turns out for various
reasons that the best way
to index vector data is using a
graph structure, so a graph database.
You can imagine each vector is
a node and is connected to other
nodes that are mostly that are
close to it, right?
And then there are some for edges.
And the traditional problem in
these types of indexes is getting
from 1 end of the graph to another
end of the graph.
So your starting point is very
far from your query vector, which
is what you're looking for.
It takes a lot of hops in these
graphs.
And HNSW and DiskANN are pretty
much different ways of solving
that basic problem.
HNSW solves this by introducing
layers into the graph, where
the top layer where you start only
has long distance edges.
So it kind of allows you to jump
a long distance but close to
where your query is, and then you
go down a level in order to
make more fine-grained jumps.
And there are several levels.
By the time you get to the lowest
level, you are kind of in the
most fine-grained area of the graph.
And those top levels are the things
that help you solve this
long jump issue, if you would.
This KNN doesn't use multiple graphs.
Instead it kind of uses a neat
trick in the construction to...
How do I put this?
It puts in enough long edges to
be useful.
So the way it constructs the graph
is modified from the regular
way these types of graphs are usually
constructed, specifically
to inject these long edges into
the graph, which probabilistically
allows you to jump faster to where
you need to go.
And this kind of flat structure
is what allows you to keep a
better locality of where you are.
So instead of using multiple levels,
the flat structure allows
you to make fewer hops in memory,
which allows us to work on disk,
on SSD rather than RAM.
Because...
Nikolay: Yeah.
I noticed, by the way, maybe it's
like a side note, but I noticed
you in Postgresstab you use local
SSD disks, right?
Not like EBS volumes.
Is it fair?
I mean, they're ephemeral.
If you restart the machine, you
lose them.
So in your benchmark particularly,
I noticed this small point.
Mat: Our benchmarks very concretely
benchmark PyCon against self-hosted
Postgres.
And the most self-hosted Postgres
people I know actually use
NVMes and they do like...
Nikolay: But not local NVMes.
Mat: But you can do backups other
ways, right?
You could
Nikolay: do string
Mat: replication, you could do
other ways to get around this
issue of things don't go away.
Nikolay: I'm just very curious
how much it helped because of
course latency is better, discrepancy
is better.
Mat: I can tell you it helped immensely
and I can say that we
are right now actively thinking
about how to bring this kind
of performance to cloud as well.
Stay tuned, I think we'll have
some exciting stuff coming out.
But yes, that's a very astute observation
because, you know,
Quite frankly, none of this works
on EBS.
You need the random read performance
of SSDs to make this work.
So by the way, I would be shocked
if Pinecone wasn't using...
Nikolay: ...VMWare's NVMes?
Mat: I don't know.
I have no idea what they use, but
I would be shocked if this
is EBR based.
Nikolay: And for these 50 million
vectors, how big was the database?
Like in terms of...
John: The disk and index was about
40 gigs.
The entire table, I think, was
around 250, somewhere between
250 and 300 gigs.
Nikolay: Right, and the machine
has definitely less than that,
so of course it was not cached.
I mean, we do need the disk.
Okay, and back to indexes.
DiskANN, and I saw you mentioned
in the article and the project
itself, Microsoft work, but you
modified this work, right?
But the question is, like, HNSW
and this DiskANN, both are approximate
nearest neighbors algorithms.
Can we say one is better than another
for most of workloads and
data types, data sets?
Or there are cases where one can
win, there are cases where another
can win.
That's what I think.
John: ANDREW BROGDON-PICKETT That's
windows index supports concurrent
index builds, which could make
it faster, at least for building
the index.
SIAMAK
Nikolay: ROSHANI For building the
index, So, and yeah.
But for search?
Mat: For search and accuracy, look,
we haven't benchmarked everything.
The thing is that we have benchmarked,
we've seen higher throughput
and higher accuracy.
The trade-off was always kind of
better than HNSW, but you know,
we haven't benchmarked everything.
We've kind of concentrated on modern
embedding systems.
So, for example, we haven't gotten
to benchmark very low dimensional
vectors like 128 or 256, the lowest
thing we've benchmarked is
768.
So a lot of caveats and I think,
you know, the space is so new
people should actually test on
their own data, but I can say
in terms of research we haven't
seen where we're worse yet.
Nikolay: People should always test
on their own data, even if
it's old school full-text search
and so on, because who knows
what you have, right?
Mat: Absolutely.
John: That's one thing that made
benchmarking so hard, especially
versus our competitors, because
a lot of the specialized vector
databases have very few parameters
that let you actually control
the index, whereas on Postgres
with both pgvector and pgvectorscale,
there are several different parameters
to tune the build of the
index, and then also query time,
plus all of the Postgres settings.
And if you're self-hosting, you've
also got OS and machine level
things to play with.
So there's just a mind-boggling
number of variables you could
play with if you had the time and
money to spend on it.
Nikolay: Yeah, and in pgvectorscale
particularly, I also checked
the source code and documentation
that we have so far, and I
saw also parameters you can touch.
A couple of them, like query time,
and if you can be also adjusted
during build time.
What can you say about some list
sizes and so on?
Could you search list size?
And during build time, num neighbors,
how many neighbors?
Do you recommend trying to tune
this so far now?
Mat: So we tried to put in the
small defaults.
That is what we saw as the best
combination.
But obviously, we wanted to let
people also experiment.
But in terms of build parameters,
I would recommend keeping them
as a default.
The runtime parameters is really
where I think people should
experiment because that allows
you, at query time, to make the
trade-off between accuracy and
speed, right?
And there's the re-school parameter
that I think is the 1 we
recommend to actually tune.
I just realized today we probably
should change the default is
pretty low.
That was my fault.
But that is the parameter I think
people should play with.
John: Yeah, that 1 seemed to have
the most impact on both speed
and accuracy.
At query time.
Nikolay: Right.
Okay, so 1 more question about
the index.
You call it StreamingDiskANN,
right?
Why streaming?
How does it differ from the original
Microsoft implementation?
Yeah,
Mat: so I think both the HNSW implementation
and the original
DiskANN implementation, you tell
it ahead of time how many things
you want returned.
And that just doesn't play very
well with Postgres, quite frankly,
because the Postgres
John: model
Mat: For indexes, Postgres, the
other part of Postgres, can always
ask the index for the next tuple.
And this is done so that you can
filter your results after index
retrieval.
So for example, let's say I want
to find the closest vectors
to a given query that also meets
some other criteria that belong
to the business department or the
engineering department, right?
The vector index only fetches the
closest things to the query,
and then what Postgres does is
that after index retrieval it
will filter out department equals
engineering, right?
Now let's say your closest hundreds
of vectors are all from the
business department.
That means if that-
Nikolay: It will
Mat: be fast.
What?
Nikolay: That means- I mean, in
this case, it will be fast because-
Mat: Right, but if your parameter
is, you say you're returning
the closest 50 from the index,
then no results will be returned
at all.
Because the index returns 50 results,
Those 50 results are then
filtered by department equals engineering.
Nikolay: None of them bad.
Different department, right, right,
right.
A lot of work to be done, but 0
results,
Mat: right?
Exactly.
And there are 0 results because
there's this arbitrary limit
that you give at the beginning
of the query to tell you, hey,
retrieve this many results.
Whereas in reality, Postgres has
no idea how many results you
need to retrieve in order to match
both the query and the department
equals engineering, right?
So What the streaming part of the
algorithm does is it removes
that restriction and makes the
algorithm work in the way Postgres
expects other indexes to work.
So you can tell the index, hey,
give me the next closest thing,
the next closest.
Actually you could traverse the
entire graph, your entire table
like that.
And that makes all of your queries
that have a secondary filter
completely accurate.
Nikolay: Sven And this secondary
filter, it's a different column,
for example, text or integer, right?
Or not?
Mat: Rony Yeah, it's a different
column.
Often it's JSONB, right?
So you could have an article, you
could have a list of associated
tags with it in JSON-B, and you
can say, hey, find me all the
articles about cars that also come
from country USA, or any kind
of other metadata.
So it's this combination of semantic
and metadata, which is actually
incredibly common.
Nikolay: Right.
But there we have usually either
a situation when we have Btree
and gin index and Postgres needs
to decide which 1 to use and
then still apply a filter, or we
need to use something like additional
extension like GiST, I'm
bad with names as usual, and then
try to achieve single index scan.
But here it's not possible.
I mean, ideal world is single index
scan without additional filtering.
Mat: Yes, that is the ideal world.
Right now, none of the indexes
supports that.
And so all of the best you can
do is the best, the state of the
art on Postgres right now, and
that would over you, state of
the art period, but we could leave
that argument for another
time, is you can hope that you
can retrieve from the vector index
and then post filter and still
have accurate results.
Nikolay: I can imagine, for example,
if we take this department
name, put it inside this input
text which builds a vector, and
then additionally we filter by
integer like department ID, it
probably will work better in this
case, right?
Because first we apply semantic
search involving in the query
which department we want, But then
we think, okay, this filter
will make final polishing.
But it sounds again like some ugly
solution.
Mat: So...
Yeah, it's not only ugly.
I'm actually not sure, it won't
work, but like using my intuition,
like the department engineering
in the text would kind of skew
the semantics away from the actual
thing in the text you're talking
about.
And so I'm not sure that would
combine well in this kind of semantic
multi-dimensional space.
John: You could maybe
add a dimension and synthetically
set a value in the dimension to
represent which department it
is.
Nikolay: And 1 more dimension for
time, like timestamp.
This is so natural.
I'm very curious if there are some
works already in this direction,
because everyone needs creation
time, like publication time or
something for each data entry.
I didn't see good discussions about
that yet.
Because if you, for example, we
loaded almost 1000000 entries
from Postgres mailing list archives.
Then it started working great,
but when you see some discussion
from Bruce Momjan from 2002 about
something, I don't know, I
remember just some cases, It's
already not relevant at all.
You think, maybe I should delete
all data, but still, it might
be relevant.
What we did, ugly solution, we
return 1,000 or maybe 5,000 results,
and then we dynamically apply some
score, I think it's logarithm
approach for time, so we add some
penalty if the article, well,
email is very old.
It quickly becomes less relevant.
We just combine it with score,
pgvector provides us similarity
or distance.
This works well, but sometimes
you have 1 second latency on 1000000
rows dataset, and this is terrible
and doesn't scale.
So this problem of additional dimensions,
I think it's huge.
We need to extend original vector
with non-semantic dimensions
and use them in filtering and achieve
1 single index scan.
Good performance, right?
Mat: I would say that in the academic
literature or there is
some progress being made.
That is, the filtering, this can
end paper, which I believe is
2 years old, which is ancient in
this space.
There was another paper from a
group out in Berkeley talking
about building and filtering into
these graph-based indexes as
well.
But you gotta walk a crore people
you can walk.
We just haven't had any chance
to really implement these inside
this index, but there is, there's
kind of a chaotic work in this
area.
Nikolay: Right, for this algorithm.
John: You'll be able to signal
that like certain dimensions,
if that's where you're putting
your filters, are, that they would
need to be exact versus approximate.
Nikolay: Right.
And Do you think this KNN approach
is better than HNSW in this
particular problem?
Mat: For filtering, yes, I do.
Obviously, I'm biased, but I think
the simplicity of going from
multi-levels to a single level
really helps in a lot of these
things, just because there's a
lot less edge cases to consider.
Nikolay: Let's not forget...
John: This helps as well, right?
Mat: Yeah, I think Streaming was
a lot easier to implement because
it's a single level.
Nikolay: Interesting.
Let's not forget to talk about
compression, because I'm sure
we need it in vector world, in
vector search.
We need it.
I saw in a recent pgvector there
are ideas, let's not use integers,
let's use floats and so on, with
a kind of compressions, so to
speak.
But you talk about real compression,
right?
Maybe some experience from TimescaleDB
extension, or no, or it's
different.
Because there is time series, I
remember articles, TimescaleBlock
has excellent articles, but maybe
it's different here, right?
Mat: I don't think I directly used
any of the algorithms from
the time series space for this,
but the basic insight of using
the statistical properties of the
vectors you have to kind of
make compression, give you better
results is exactly what led
to...
So we have an algorithm called
Statistical Binary Quantization,
SBQ, And it takes quite a simple
but well-known algorithm called
BQ and kind of adapts it to your
dataset in a better way.
You know, we have a blog post about
it.
It's pretty simple.
It's pretty much using the means
of each dimension and the kind
of standard deviations to kind
of better segment your space,
if you would.
And yeah, and we just, honestly,
we took a very experimental
approach.
We took various datasets and we
tried different things and this
turned out that it worked.
It added a few percentage points
to the accuracy, which once
you get into the 90s, a few percentage
points is a pretty big
deal.
And so, yeah, it's, you get to
read about it in our blog post.
The algorithm is fully explained
there.
Nikolay: To make sure...
I'm just noticing, and...
John: The compression that Mat's
talking about is happening
in the index and the TimescaleDB
compression is, you know, converting
row-based into columnar and then
compressing each column in the
heap.
We actually tried compressing the
vectors in the heap with TimescaleDB
compression. And seemingly random
vector strings of numbers don't
compress very well.
So that didn't help very much.
Nikolay: RADUKIJ YURIYENKIN Because
in TimescaleDB, one of the
key ideas is that for time series,
it's like values are changing,
not jumping, how to say.
So there are deltas and these deltas
are quite low and so on.
This is what I remember from those
blog posts.
For vectors this doesn't work,
I understand.
For index compression, how much
could we expect in terms of compression
ratio to achieve?
Mat: Uses 1 bit per dimension.
The uncompressed version is float4.
It's a very easy calculation.
It's always a 32x compression ratio.
Nikolay: Roman Karpukhin Good.
Understood.
Great.
So yeah, anything else in technical
area we should mention?
Mat: James Johnson
Nikolay: That helps
John: with the size, but it also
helps with performance because
you can fit
Nikolay: more in.
Fewer buffers to load to the buffer
pool.
This makes sense.
I remember your blog post also
mentions storage costs you compare
with Pinecone in terms of how much
do you need to spend each
month to store 50 million vectors.
The difference is very noticeable,
I would say.
So yeah, this makes sense for sure.
Anything else, like technical stuff?
Anything maybe you're working on
right now?
Mat: I think about PG vector scale.
That's about it.
We haven't talked about PGAI, which
is the other thing we announced
this week.
Nikolay: This is a Python untrusted,
so it won't be possible
to run it on managed services,
as I understand, but it allows
you to...
Or it's something different?
Because it makes calls to OpenAI
API or...
It
Mat: does.
So, with untrusted languages, you
can't allow users on clouds
to write their own functions.
But if the functions are included
inside an extension, that's
fine.
And it's easy to see because most
extensions are written C, which
is completely untrusted, right?
So the entire point of the PGA
extension is specifically so that
this could be run on the clouds.
Nikolay: So it limits capabilities,
and if a cloud vendor decides,
verifies everything works well,
nothing, Bad calls cannot be
done through it.
So we are in the
John: center.
It's almost like whitelisting certain
programs.
Nikolay: Interesting.
Makes sense.
So the idea is to make it really
simple to create vectors from
regular data types, just transparently
calling these, just with
SQL, right?
I think there are other similar
implementations of this, But
I like your idea of betting on
cloud providers, including this,
and even not providing untrusted
languages, capabilities.
It's interesting.
Mat: Yeah, there are a few things
that do similar things, and
I was always curious why they didn't
just run the Python code
to do this.
And so that's what we did.
A lot of people do very complicated
stuff to get the same result.
I don't know why, but yeah.
We had to work that.
Nikolay: I like this approach myself
very well.
I mean, I do it a lot for many
years, just select and call something
externally.
But of course, there should be
a huge warning on site.
It doesn't scale well because if
Python calls...
You basically add latency to your
queries of the primary.
Primary CPU is the most expensive
resource you have.
If you want to generate a vector,
you cannot run it on a replica
because you need to write it.
You must do it on the primary.
While this query is running, it's
a primary node, and offloading
this work on Python application
nodes makes total sense because
the database doesn't notice this
at all, the primary doesn't
notice.
You only speak to it when the result
is already retrieved from
OpenAI or another LLM provider.
This is handy, but very dangerous
in terms of scalability and
performance and future issues.
Mat: Completely agreed with you.
This is a way for people to get
up and running quickly and testing
and experimenting.
And we do have another project
called pgvectorizer for when you
need to batch things up and scale
it and do it in the background.
We have all of that as well.
Nikolay: I haven't seen that, that's
interesting also
Mat: to check.
Nikolay: Yeah,
John: but all of that plumbing
that you have to write to drag
data out of the database and then
send it off to an API and put
it back in the database.
You end up writing so much code
that when it's in the database,
you don't have to do not to mention
all of the bandwidth you're
consuming and latency there.
Nikolay: Right.
I remember I was copying images
from S3 in SQL.
And then, of course, it's similar.
You call some APIs from Postgres.
It's great, but it's an interesting
direction.
With warning, for start it's good.
I think I don't have any more questions.
I'm looking forward to trying this
project.
We wanted to do it before this
call, like yesterday already,
but we had some technical difficulties.
I hope we will solve them soon
and try pgvectorscale.
For our case, I will tell you when
we have results.
Thank you.
So as a summary.
Mat: There's 1 more thing about
that.
This is a new project.
We're trying to be very responsive.
So if you run into any problems
at all, you know, file GitHub
issues or contact us on Discord.
We are more than happy and very
eager to talk to anybody.
This is a young project, as I said,
so all feedback is very young.
Nikolay: Sure, yeah, that's a good
point.
Honestly, seeing this work in Timescale
company repository on
GitHub, it makes me have some good
expectations in terms of if
some problems are encountered that
will be addressed and so on.
As a summary, let's say everyone
who works with vectors in Postgres
should check out this new type
of index and compare.
This is great.
I think there are many people who
do this right now and looking
forward to results in our case
and other cases as well.
Thank you for a good project
Mat: and very interesting.
Nikolay: Thank you.
Good luck with it.