1
00:00:00,060 --> 00:00:02,440
Michael: Hello and welcome to Postgres
FM, a weekly show about

2
00:00:02,440 --> 00:00:03,600
all things PostgreSQL.

3
00:00:03,600 --> 00:00:06,900
I am Michael, founder of pgMustard,
and today I am delighted

4
00:00:06,900 --> 00:00:11,120
to be joined by Jonathan Katz,
Postgres Core Team member amongst

5
00:00:11,120 --> 00:00:15,060
other things, Principal Product
Manager at AWS and contributor to

6
00:00:15,060 --> 00:00:15,880
pgvector.

7
00:00:16,160 --> 00:00:17,580
Welcome to the show, Jonathan.

8
00:00:17,800 --> 00:00:18,480
Jonathan: Thank you, Michael.

9
00:00:18,480 --> 00:00:19,540
Happy to be here.

10
00:00:19,540 --> 00:00:21,080
I'm definitely a huge fan of your work.

11
00:00:21,080 --> 00:00:23,420
I'm very excited for our conversation
today.

12
00:00:23,600 --> 00:00:25,420
Michael: Kind of, you'd say the same
to me.

13
00:00:25,640 --> 00:00:30,600
So I reached out to you about this
topic because it's about time

14
00:00:30,600 --> 00:00:32,420
we did an episode on pgvector.

15
00:00:32,720 --> 00:00:35,720
How could we not have given the
last year?

16
00:00:35,920 --> 00:00:37,500
But neither of us knew it well.

17
00:00:37,640 --> 00:00:40,860
Both Nikolay and myself haven't
used it until recently.

18
00:00:40,920 --> 00:00:42,460
Nikolay has been using it.

19
00:00:42,500 --> 00:00:44,680
So we wanted to make sure we covered
it.

20
00:00:44,680 --> 00:00:49,040
And who better to well, Andrew
Kane was unavailable and you are

21
00:00:49,040 --> 00:00:51,140
definitely the next best person.

22
00:00:51,140 --> 00:00:53,400
So we really appreciated all the
content you've been putting

23
00:00:53,400 --> 00:00:55,460
out, the talks you've been giving
on this.

24
00:00:55,460 --> 00:00:57,020
So yeah, thanks for joining.

25
00:00:57,340 --> 00:00:58,340
Jonathan: Yeah, happy to be here.

26
00:00:58,340 --> 00:01:00,540
I mean, and by the way, when you
say next best person, it's like

27
00:01:00,540 --> 00:01:01,680
well below Andrew Kane.

28
00:01:01,680 --> 00:01:04,900
Andrew has done phenomenal work
on pgvector over the past several

29
00:01:04,900 --> 00:01:05,400
years.

30
00:01:05,820 --> 00:01:06,500
Michael: Yeah, right.

31
00:01:06,500 --> 00:01:07,540
And it has been.

32
00:01:07,800 --> 00:01:10,580
It definitely predates the ChatGPT
hype, right?

33
00:01:10,580 --> 00:01:12,100
It was a couple of years beforehand.

34
00:01:12,380 --> 00:01:15,560
Do you want to give us a little
bit of an introduction as to

35
00:01:15,560 --> 00:01:18,460
what it is and how you came to
become involved with it?

36
00:01:19,000 --> 00:01:23,560
Jonathan: Yeah, so pgvector, at
the surface sounds very simple.

37
00:01:23,560 --> 00:01:26,820
It's a Postgres extension that
allows you to store and search

38
00:01:26,820 --> 00:01:27,320
vectors.

39
00:01:27,900 --> 00:01:30,860
And I mean, I'll talk a little
bit about how I got involved in

40
00:01:30,860 --> 00:01:34,340
it, but really to understand how
this all came to be, it actually

41
00:01:34,340 --> 00:01:36,300
helps to look back in the history of
Postgres.

42
00:01:36,660 --> 00:01:39,520
Postgres has actually always been
able to support vectors back

43
00:01:39,520 --> 00:01:43,080
since the Berkeley days, and it
comes down to the array data

44
00:01:43,080 --> 00:01:43,580
type.

45
00:01:43,660 --> 00:01:46,200
So I mean, quite simply, a vector
is, you know, it's an array

46
00:01:46,200 --> 00:01:47,300
with certain properties.

47
00:01:48,120 --> 00:01:49,940
It has a certain amount of dimensionality.

48
00:01:50,380 --> 00:01:52,640
There are certain things that each
dimension must meet.

49
00:01:52,640 --> 00:01:54,560
And there's all sorts of math around
it.

50
00:01:54,610 --> 00:01:58,580
I mean, there's over a century
of what a vector is.

51
00:01:58,660 --> 00:02:01,400
But the reason why it was added
to Postgres was not because of

52
00:02:01,400 --> 00:02:02,780
any mathematical properties.

53
00:02:02,980 --> 00:02:05,720
It was actually as a shortcut to
be able to look up ACL rules.

54
00:02:05,720 --> 00:02:08,480
So instead of having to do a join
to another catalog table to

55
00:02:08,480 --> 00:02:11,820
get the ACL rules, it was just
embedded within a single column

56
00:02:12,160 --> 00:02:13,400
as an array.

57
00:02:13,860 --> 00:02:14,560
Oh, really?

58
00:02:14,720 --> 00:02:15,200
Yeah, yeah.

59
00:02:15,200 --> 00:02:16,820
No, it's actually pretty cool.

60
00:02:17,080 --> 00:02:21,100
And then fast forward to the early
2000s, Postgres added the

61
00:02:21,100 --> 00:02:24,720
cube data type, which allowed
you to index up to a hundred

62
00:02:24,720 --> 00:02:28,700
dimensional vectors, ultimately
using the GiST index, but it

63
00:02:28,700 --> 00:02:31,660
added also two distance operations,
which are the mathematical

64
00:02:31,860 --> 00:02:33,900
operations you need for a vector,
which, you know, we can talk

65
00:02:33,900 --> 00:02:36,360
a little bit more about that as
soon as we get into this.

66
00:02:36,420 --> 00:02:39,160
But this is to say, Postgres has
actually supported vectors and

67
00:02:39,160 --> 00:02:43,080
vector lookups for a while, but
certainly there's been a more,

68
00:02:43,080 --> 00:02:44,780
a more pressing need for it.

69
00:02:44,840 --> 00:02:47,920
And, you know, in terms of a little
bit how I got involved, perhaps

70
00:02:47,920 --> 00:02:51,980
my secret in the Postgres community
was, I originally wanted

71
00:02:51,980 --> 00:02:54,220
to go into machine learning when
I was in college.

72
00:02:55,080 --> 00:02:59,080
And while I was studying it, I
was like, oh, it's really fascinating,

73
00:02:59,540 --> 00:03:03,560
but to do anything with it, you're
going to need a PhD and everything's

74
00:03:03,680 --> 00:03:04,380
ad hoc.

75
00:03:04,840 --> 00:03:06,760
And I had a very entrepreneurial
spirit.

76
00:03:06,760 --> 00:03:08,180
I think I got a patent.

77
00:03:08,400 --> 00:03:11,200
And of course, I didn't have the
foresight to see like, oh, this

78
00:03:11,200 --> 00:03:12,020
will be commoditized.

79
00:03:12,700 --> 00:03:15,740
And it'll be very simple to access
machine learning algorithms

80
00:03:15,820 --> 00:03:17,480
through simple API.

81
00:03:17,940 --> 00:03:19,540
So credit where credit is due.

82
00:03:19,540 --> 00:03:22,680
Some very smart people identified
that and have been really building

83
00:03:22,680 --> 00:03:23,600
towards that.

84
00:03:24,020 --> 00:03:28,100
But again, working with machine
learning systems in college,

85
00:03:28,100 --> 00:03:30,560
the fundamental data type was the
vector.

86
00:03:31,780 --> 00:03:35,800
And back then, a high dimensional
vector was something considered

87
00:03:35,800 --> 00:03:37,020
to be like 20 dimensions.

88
00:03:37,420 --> 00:03:40,340
And think about it, Michael, what's
a 20 dimensional vector?

89
00:03:40,600 --> 00:03:44,280
I can understand 3, 3 is like points
in space.

90
00:03:44,600 --> 00:03:47,420
I can understand 4, points in space
moving around.

91
00:03:47,560 --> 00:03:48,240
What's 5?

92
00:03:48,240 --> 00:03:49,200
Michael: Plus time, right?

93
00:03:49,200 --> 00:03:49,540
Yeah.

94
00:03:49,540 --> 00:03:50,040
Jonathan: Yeah.

95
00:03:50,660 --> 00:03:52,840
Michael: I studied, I don't know
if you know, but I studied maths

96
00:03:52,840 --> 00:03:55,080
at university and did pretty much
just pure.

97
00:03:55,080 --> 00:03:58,940
I did, I did a tiny bit of stats,
a tiny bit of mechanics, but

98
00:03:58,940 --> 00:04:00,560
I was best at the pure stuff.

99
00:04:00,820 --> 00:04:05,140
And yeah, 5 dimensions hurts my
head to think about it, or to

100
00:04:05,140 --> 00:04:07,200
try and conceptualize it in any
way.

101
00:04:07,480 --> 00:04:09,000
Jonathan: Yeah, it's funny that
you say that.

102
00:04:09,000 --> 00:04:10,460
I studied math too, though.

103
00:04:10,460 --> 00:04:13,040
I would say I was much stronger
on the computer science side.

104
00:04:13,040 --> 00:04:16,100
But real analysis was one of my favorite
classes, because you were

105
00:04:16,100 --> 00:04:17,920
proving everything in n-dimensional
space.

106
00:04:17,920 --> 00:04:20,360
And again, everything's a vector
or a matrix.

107
00:04:21,140 --> 00:04:23,720
You basically spend a semester
just studying vector spaces and

108
00:04:23,720 --> 00:04:24,400
all their properties.

109
00:04:24,400 --> 00:04:26,040
And to me, like, that was fun.

110
00:04:26,040 --> 00:04:28,900
And maybe it was telling, you know,
fast forwarding, you know,

111
00:04:28,900 --> 00:04:30,040
20 years later.

112
00:04:30,760 --> 00:04:35,400
But, you know, my journey back
to vectors came probably a little

113
00:04:35,400 --> 00:04:37,060
bit later than some folks.

114
00:04:37,200 --> 00:04:39,260
The pgvector project started in
2021.

115
00:04:40,460 --> 00:04:44,020
Andrew had the foresight to see
that a lot of data was going

116
00:04:44,020 --> 00:04:46,860
to be generated around machine
learning and AI systems.

117
00:04:46,980 --> 00:04:50,820
And Postgres didn't really have
support for the types of vector

118
00:04:50,820 --> 00:04:52,060
lookups that we needed.

119
00:04:52,360 --> 00:04:54,180
So let's back up a second.

120
00:04:54,440 --> 00:04:58,320
So I said that Postgres had support
for vectors all along, but

121
00:04:58,320 --> 00:04:59,440
it did and it didn't.

122
00:05:00,060 --> 00:05:02,980
One thing about relational databases,
and Postgres in particular,

123
00:05:03,040 --> 00:05:05,580
is that you look for exact results.

124
00:05:06,020 --> 00:05:09,880
You say, hey, I want to find the
10, well, let's say I want to

125
00:05:09,880 --> 00:05:11,900
look up Michael or Jonathan in
a database.

126
00:05:11,900 --> 00:05:15,100
I'm going to write a select query
with a where, where name

127
00:05:15,100 --> 00:05:17,060
equals Michael, right?

128
00:05:17,080 --> 00:05:19,700
And you expect to get the results
for like, Michael, all the

129
00:05:19,700 --> 00:05:20,840
Michaels in the database.

130
00:05:21,760 --> 00:05:24,820
Same thing when you're looking
up points, Postgres has the ability

131
00:05:24,960 --> 00:05:28,200
to index things by k-nearest neighbor
or find all the points

132
00:05:28,200 --> 00:05:29,240
that are around me.

133
00:05:29,240 --> 00:05:32,720
And again, if I say like, hey,
go in and find like the 10 closest

134
00:05:32,720 --> 00:05:33,720
coffee shops to me.

135
00:05:33,720 --> 00:05:36,660
You're going to find exactly the
10 closest coffee shops.

136
00:05:37,440 --> 00:05:40,940
And these things work in 3 dimensions,
4 dimensions, which is

137
00:05:40,940 --> 00:05:43,240
typically where Postgres operates.

138
00:05:43,320 --> 00:05:46,640
If you take the PostGIS extension,
which adds geospatial support

139
00:05:46,640 --> 00:05:49,700
to Postgres, you can index 3 and
4 dimensional points, and you

140
00:05:49,700 --> 00:05:50,940
can do it quite well.

141
00:05:51,000 --> 00:05:53,720
The GiST and the SP-GiST indexes
provide the ability to write

142
00:05:53,720 --> 00:05:54,440
these indexes.

143
00:05:55,200 --> 00:05:58,380
But as you start adding dimensions,
you go from, you know, even

144
00:05:58,380 --> 00:06:03,060
like 8 to 10 to a hundred, which
is the limit of the cube extension,

145
00:06:03,340 --> 00:06:04,980
it's going to start getting very
exhaustive.

146
00:06:05,060 --> 00:06:07,660
First, in terms of placing the
vector or the...

147
00:06:07,660 --> 00:06:10,120
I mean, it's not really a point
anymore, but think about it as

148
00:06:10,120 --> 00:06:10,440
a point.

149
00:06:10,440 --> 00:06:12,800
You're trying to place this in
space and try to figure out, I

150
00:06:12,800 --> 00:06:16,040
have 100 dimensions and I have
to situate somewhere in my graph

151
00:06:16,040 --> 00:06:16,980
where that is.

152
00:06:17,320 --> 00:06:18,780
It's going to be, you know, very exhaustive.

153
00:06:18,900 --> 00:06:21,760
And then even if you are doing
some kind of exact nearest neighbor

154
00:06:21,760 --> 00:06:24,640
search or some index search, there's
going to be a lot of cost to

155
00:06:24,640 --> 00:06:25,140
it.

156
00:06:25,440 --> 00:06:25,940
Yeah.

157
00:06:26,040 --> 00:06:29,440
And this is where Postgres, you
know, started falling over in

158
00:06:29,440 --> 00:06:32,380
a ways, you know, Because a lot
of these vectors we're seeing,

159
00:06:32,380 --> 00:06:34,860
if you look at some of the popular
embedding systems, they're

160
00:06:34,860 --> 00:06:37,960
1,536 dimensions, which again,
mind-blowing number.

161
00:06:37,960 --> 00:06:40,380
Like I still don't even know what
a five-dimensional vector is,

162
00:06:40,380 --> 00:06:42,980
and here we're talking about 1,500
dimensional vectors.

163
00:06:43,860 --> 00:06:47,440
To do an exact nearest neighbor
search, as they call it, or the

164
00:06:47,440 --> 00:06:50,800
k nearest neighbor, you have
to compare against every single

165
00:06:50,800 --> 00:06:52,280
vector that you store in a database.

166
00:06:52,280 --> 00:06:56,120
So you have a million vectors that
are 1,500 dimensions, you're

167
00:06:56,120 --> 00:07:00,120
doing a million times 1,500 computations
on them.

168
00:07:00,580 --> 00:07:01,820
And that's really expensive.

169
00:07:02,180 --> 00:07:04,280
I can tell you it's very expensive
because I've had to, you know,

170
00:07:04,280 --> 00:07:06,960
I've been personally, you know,
benchmarking this for quite a

171
00:07:06,960 --> 00:07:07,460
bit.

172
00:07:07,540 --> 00:07:08,860
Michael: And it makes sense, right?

173
00:07:08,860 --> 00:07:13,660
It's at minimum, a sequential scan
of the entire table, plus

174
00:07:13,780 --> 00:07:19,940
all of the CPU of doing those calculations
100 times per, well,

175
00:07:19,940 --> 00:07:22,320
1,500 times per row.

176
00:07:22,440 --> 00:07:23,860
Jonathan: If you have 1,500 times,
yeah.

177
00:07:23,860 --> 00:07:26,140
And we can talk about in terms
of like all the, you know, all

178
00:07:26,140 --> 00:07:28,580
sorts of the optimizations that
are in place there.

179
00:07:28,580 --> 00:07:32,360
But this is where like that exact
nearest neighbor problem gets

180
00:07:32,360 --> 00:07:32,860
hard.

181
00:07:33,080 --> 00:07:36,820
And again, Postgres didn't have
the indexing mechanisms to handle

182
00:07:36,820 --> 00:07:37,200
it.

183
00:07:37,200 --> 00:07:39,980
Probably we'll further dive into
that as we talk more.

184
00:07:40,120 --> 00:07:42,740
But actually, I'll say 1 of the
big ones is just in terms of

185
00:07:42,740 --> 00:07:45,560
just storing the data, storing
the data within the index, that

186
00:07:45,680 --> 00:07:50,380
you have to fit your index value
for a single row within an 8

187
00:07:50,380 --> 00:07:52,240
kilobyte limit within the page.

188
00:07:52,240 --> 00:07:54,980
So that's already going to start
creating some constraints.

189
00:07:55,840 --> 00:07:59,280
Now, what's also happened over
the past 20 years, since the cube

190
00:07:59,280 --> 00:08:03,340
type was added in Postgres, was
a whole field of modern vector

191
00:08:03,340 --> 00:08:05,240
research, as I like to call it.

192
00:08:05,320 --> 00:08:07,980
And starting to this idea of approximate
nearest neighbor.

193
00:08:08,200 --> 00:08:10,480
And the deal with approximate nearest
neighbor is that you're

194
00:08:10,480 --> 00:08:13,780
trying to find your nearest neighbors,
but with a best fit, which

195
00:08:13,780 --> 00:08:16,840
is you're not going to search every
single vector in a particular

196
00:08:16,840 --> 00:08:20,780
table or database, you can search
a subset of it, and you're

197
00:08:20,780 --> 00:08:24,520
hoping that you're seeing the representative
set of the data

198
00:08:24,520 --> 00:08:25,580
you want to return.

199
00:08:26,200 --> 00:08:29,120
Now again, for us relational database
folks, that's mind blowing.

200
00:08:29,120 --> 00:08:32,280
It's like, wait a second, I'm not
returning the exact answers.

201
00:08:32,360 --> 00:08:34,520
I'm returning an approximate subset.

202
00:08:34,740 --> 00:08:37,120
And I'll tell you, the first time
I used pgvector, and I was

203
00:08:37,120 --> 00:08:39,960
playing around with it, and I was
not getting the results I expected,

204
00:08:39,960 --> 00:08:42,060
at first I'm like, oh, this is
broken, this doesn't work, this

205
00:08:42,060 --> 00:08:43,660
doesn't make sense.

206
00:08:43,680 --> 00:08:47,720
And that was the app developer
in me, the DBA in me that started

207
00:08:47,720 --> 00:08:49,820
diving deeper and trying to understand
like, oh, wow, there's

208
00:08:49,820 --> 00:08:52,480
like a whole science around this
that we kind of have to relearn

209
00:08:52,480 --> 00:08:53,080
when using a database.

210
00:08:54,140 --> 00:08:58,280
Michael: Yeah, well, I often say
to people, you can run a database

211
00:08:58,480 --> 00:09:01,960
without adding any indexes and
you can run any query you want

212
00:09:01,960 --> 00:09:03,240
and get the results back.

213
00:09:03,240 --> 00:09:05,030
It's just faster to use an index.

214
00:09:06,010 --> 00:09:08,860
Your query result will not change
as a result of you having an

215
00:09:08,860 --> 00:09:09,260
index.

216
00:09:09,260 --> 00:09:14,620
And that sentence is no longer
true if we include these approximate

217
00:09:14,660 --> 00:09:16,520
nearest neighbor type indexes.

218
00:09:17,340 --> 00:09:20,520
So I have to stop saying that or
work out a way of caveating

219
00:09:20,640 --> 00:09:21,140
it.

220
00:09:21,420 --> 00:09:21,760
Jonathan: Yeah.

221
00:09:21,760 --> 00:09:24,440
Well, just say, well, in this world,
when you're dealing with,

222
00:09:24,440 --> 00:09:28,320
you know, vector similarity search,
but back to the question,

223
00:09:28,320 --> 00:09:30,580
you know, how did I get involved
in pgvector?

224
00:09:30,660 --> 00:09:33,160
So rewind to about the end of 2022.

225
00:09:33,660 --> 00:09:37,020
We may have heard about some of
these generative AI systems that

226
00:09:37,020 --> 00:09:38,030
were capturing the imagination.

227
00:09:38,920 --> 00:09:42,500
And we go and we type things into
them and we get back these

228
00:09:42,500 --> 00:09:45,160
human-like responses and suddenly
it's like, we're trying to

229
00:09:45,160 --> 00:09:47,460
think like, oh, what are all the
different ways to use it?

230
00:09:48,080 --> 00:09:52,540
And one of the techniques that emerged
as these systems grew in

231
00:09:52,540 --> 00:09:57,280
rapid popularity was retrieval-augmented
generation, where these

232
00:09:57,280 --> 00:10:00,060
large language models, foundational
models, have trained on vast

233
00:10:00,060 --> 00:10:02,900
quantities of data, typically publicly
available data.

234
00:10:03,420 --> 00:10:06,560
But you may have data sitting in
your database or some other

235
00:10:06,560 --> 00:10:08,860
private source that is not publicly
available.

236
00:10:08,940 --> 00:10:09,560
For good reason.

237
00:10:09,560 --> 00:10:15,260
It might be your product catalog
or information related to a

238
00:10:15,260 --> 00:10:17,180
documentation collection that you
have.

239
00:10:17,360 --> 00:10:20,900
But you're trying to deliver something
to your users, your customers

240
00:10:21,140 --> 00:10:24,940
that can take advantage of what's
in these generative AI systems,

241
00:10:25,040 --> 00:10:26,820
but provide that personalized experience.

242
00:10:27,800 --> 00:10:30,800
To do that, you need some way to
be able to take that data that

243
00:10:30,800 --> 00:10:34,840
exists in your private databases
and safely and securely augment

244
00:10:34,840 --> 00:10:38,100
them with your foundational models
to be able to deliver that.

245
00:10:38,320 --> 00:10:41,000
And that's where the vector search
comes in because the vector

246
00:10:41,000 --> 00:10:44,060
search provides a representation
that you can search over and

247
00:10:44,060 --> 00:10:46,320
be able to augment those answers.

248
00:10:47,560 --> 00:10:51,460
So seeing that, seeing this is
how people want to use it and

249
00:10:51,460 --> 00:10:53,700
frankly, people want to use it
with Postgres.

250
00:10:53,700 --> 00:10:57,260
A lot of the user and customer
conversations I had were around

251
00:10:57,260 --> 00:10:59,920
Hey, I already have this data existing
in my Postgres database.

252
00:10:59,920 --> 00:11:02,640
Or I've already built all my search
mechanisms around Postgres,

253
00:11:02,640 --> 00:11:04,500
I want to be able to simply use
it.

254
00:11:04,940 --> 00:11:07,860
And there was an extension that
did this, and it was pgvector.

255
00:11:08,760 --> 00:11:11,820
So for me personally, it started
to become a perfect storm.

256
00:11:11,820 --> 00:11:13,680
I'd studied this vector stuff in
college.

257
00:11:15,860 --> 00:11:20,720
I was a hobbyist machine learningist,
albeit the field has gotten

258
00:11:20,800 --> 00:11:23,040
way more complex than when I studied
it.

259
00:11:23,040 --> 00:11:26,500
And I happen to like databases
as it happens, particularly Postgres.

260
00:11:26,500 --> 00:11:27,780
So I looked at this extension.

261
00:11:27,980 --> 00:11:30,200
It seemed like a lot of people
are starting to adopt it in a rapid

262
00:11:30,200 --> 00:11:30,480
fashion.

263
00:11:30,480 --> 00:11:33,520
I mean, if you look at the star
history of pgvector, you know,

264
00:11:33,520 --> 00:11:34,740
it's a curve like this.

265
00:11:34,740 --> 00:11:37,660
Like I've never seen that before
for a Postgres extension.

266
00:11:38,480 --> 00:11:41,100
But I stepped back a minute as
well.

267
00:11:41,100 --> 00:11:44,720
And I saw that these are net new
database workloads.

268
00:11:44,720 --> 00:11:48,420
You know, a lot of what you see
in Postgres in the past has been

269
00:11:48,520 --> 00:11:52,040
moving database workloads over
a lot of traditional transactional

270
00:11:52,260 --> 00:11:53,260
relational applications.

271
00:11:54,440 --> 00:11:57,600
Though, you might call it once
in a generation, which in the

272
00:11:57,600 --> 00:12:00,720
tech world is every 10 years, there's
this new workload that

273
00:12:00,720 --> 00:12:03,340
emerges that just generates net
new data.

274
00:12:04,200 --> 00:12:07,860
So about 10, 15 years ago, we saw
that with JSON.

275
00:12:08,160 --> 00:12:11,600
We developed this lingua franca
across the web that made it simple

276
00:12:11,600 --> 00:12:14,600
to communicate between all these
APIs, but there became a demand

277
00:12:14,600 --> 00:12:17,960
to be able to store JSON data directly
in a database and easily

278
00:12:17,960 --> 00:12:18,980
query it out.

279
00:12:19,340 --> 00:12:22,660
Postgres developed a solution for
that, being the JSON and the

280
00:12:22,660 --> 00:12:25,760
JSONB data types, which became quite
efficient at being able to

281
00:12:25,760 --> 00:12:26,560
query it.

282
00:12:28,040 --> 00:12:31,320
What was nice is that the Postgres
community rallied around what

283
00:12:31,320 --> 00:12:34,620
people were doing and created a
very efficient, effective solution.

284
00:12:34,940 --> 00:12:38,440
Side note, I've talked to one Postgres
user that has 40 terabytes

285
00:12:38,440 --> 00:12:41,980
of JSON in a single database, and
they say Postgres rocks it,

286
00:12:42,240 --> 00:12:45,520
which, I'm like, wow, that's great.

287
00:12:45,520 --> 00:12:46,980
I'm glad it works.

288
00:12:47,720 --> 00:12:49,220
Michael: That's so nice for you.

289
00:12:49,740 --> 00:12:50,420
Jonathan: Yeah, yeah.

290
00:12:50,740 --> 00:12:52,000
It is really cool.

291
00:12:52,800 --> 00:12:55,320
Again, I can't figure out how I
would personally do it, but I'm

292
00:12:55,320 --> 00:12:57,440
glad that it works really well
for them.

293
00:12:57,440 --> 00:12:59,700
But we're seeing something similar
with vectors.

294
00:13:00,060 --> 00:13:01,940
Vectors are just a data type.

295
00:13:02,400 --> 00:13:03,760
It's actually a well-studied data
type.

296
00:13:03,760 --> 00:13:06,480
It's something you kind of look
at in your introductory computer

297
00:13:06,480 --> 00:13:07,260
science classes.

298
00:13:07,360 --> 00:13:10,740
But when it comes to efficient
data search and retrieval, there's

299
00:13:10,740 --> 00:13:13,940
so many nasty properties with them
that make it a really fun

300
00:13:13,940 --> 00:13:15,640
and challenging problem to work
on.

301
00:13:15,640 --> 00:13:19,420
But the thing is, like, this is,
you know, this is this new workload

302
00:13:19,540 --> 00:13:22,700
that is going to be available to
databases, including Postgres.

303
00:13:23,680 --> 00:13:28,540
And for me personally, I want to
make sure that we're positioning

304
00:13:28,580 --> 00:13:30,520
Postgres in a way to handle it.

305
00:13:31,220 --> 00:13:35,540
So where I got started with
pgvector was first using it,

306
00:13:35,580 --> 00:13:37,380
helping to get it deployed at AWS.

307
00:13:37,540 --> 00:13:40,680
But like any open source project,
if you're interested in it,

308
00:13:40,680 --> 00:13:41,620
just participate.

309
00:13:42,440 --> 00:13:45,780
There's a few things that I particularly
focused on, which was

310
00:13:46,420 --> 00:13:49,820
performance both from a contribution
standpoint, a testing standpoint.

311
00:13:50,280 --> 00:13:54,160
And first, I would just want to
say a large chunk of the performance

312
00:13:54,160 --> 00:13:55,960
features have been developed by
Andrew Kane.

313
00:13:55,960 --> 00:13:57,720
He's done an awesome job on it.

314
00:13:57,720 --> 00:14:01,620
And more recently, Heikki Linnakangas
has even further improved

315
00:14:01,620 --> 00:14:02,940
the performance on it.

316
00:14:04,540 --> 00:14:07,840
Where I stepped in was both just
benchmarking to make sure that

317
00:14:07,840 --> 00:14:10,440
we're focusing on the right areas
of performance and trying to

318
00:14:10,440 --> 00:14:12,540
prioritize what things to look
at.

319
00:14:12,660 --> 00:14:16,080
And a few patches here and there,
particularly around some of

320
00:14:16,080 --> 00:14:18,980
the index costing early on for
IVFFlat.

321
00:14:19,740 --> 00:14:21,140
Michael: But yeah, I saw that.

322
00:14:21,500 --> 00:14:22,240
Jonathan: Oh, thank you.

323
00:14:22,240 --> 00:14:25,320
Michael: I would love to get on
to performance and tips for people.

324
00:14:25,320 --> 00:14:28,580
But before that, I hadn't thought
of this as net new data.

325
00:14:28,580 --> 00:14:31,060
That's a super interesting way
of looking at it.

326
00:14:31,060 --> 00:14:34,300
And I hadn't thought of JSON as
necessarily net new data.

327
00:14:34,300 --> 00:14:37,580
I was thinking of it much more
along the lines of the NoSQL workloads

328
00:14:37,720 --> 00:14:42,140
that felt quite competitive with
the relational for a while.

329
00:14:42,500 --> 00:14:44,760
But yeah, really interesting way
of looking at it.

330
00:14:44,760 --> 00:14:48,420
And it does feel like an opportunity,
much like that did, to

331
00:14:48,420 --> 00:14:51,600
be able to handle more within Postgres
with all of the benefits

332
00:14:51,600 --> 00:14:56,320
that we've done episodes on, the
pros and cons of having data

333
00:14:56,320 --> 00:14:59,560
inside and outside using specialized
databases.

334
00:15:00,040 --> 00:15:02,760
But you gain so many benefits
like the transactional nature

335
00:15:02,760 --> 00:15:06,600
of things or the low latency or
being able to join things together.

336
00:15:06,600 --> 00:15:09,660
Like there are so many benefits
of having it all together.

337
00:15:09,800 --> 00:15:13,440
And not to mention that lower maintenance
overhead or the lower

338
00:15:13,440 --> 00:15:16,940
operational knowledge needed of
managing multiple systems.

339
00:15:17,360 --> 00:15:19,700
So yeah, great point, love it,
thank you.

340
00:15:20,320 --> 00:15:20,820
Jonathan: Cool.

341
00:15:21,180 --> 00:15:24,640
Michael: Yeah, so I was looking
at 3 of your PRs and the IVFFlat

342
00:15:24,640 --> 00:15:30,560
and the IVFFlat costings was looked
like a core improvement

343
00:15:30,760 --> 00:15:35,040
to make sure Postgres is using
indexes in more cases where indexes

344
00:15:35,080 --> 00:15:40,940
would be faster, which is the age-old
cost optimization problem.

345
00:15:42,160 --> 00:15:46,080
And also more recently, you did
some great blog posts on the

346
00:15:46,080 --> 00:15:46,560
addition.

347
00:15:46,560 --> 00:15:49,900
So IVFFlat was the first index
type that pgvector supported.

348
00:15:50,580 --> 00:15:55,160
And then more recently, got added
a very, well, not as new as 

349
00:15:55,160 --> 00:15:58,100
I expected when I looked back at
the history, but the competing

350
00:15:58,380 --> 00:16:03,120
index type, HNSW, which you've
also been involved in tuning.

351
00:16:03,120 --> 00:16:03,840
I've seen.

352
00:16:04,700 --> 00:16:07,440
Jonathan: Yeah, so IV**F** Flat was
first.

353
00:16:07,800 --> 00:16:13,020
And I think one of the reasons why
is that it is a bit simpler

354
00:16:13,020 --> 00:16:13,620
to implement.

355
00:16:13,620 --> 00:16:15,040
So it's a clustering algorithm.

356
00:16:15,100 --> 00:16:16,880
Clustering algorithms are well
studied.

357
00:16:16,880 --> 00:16:18,780
I mean, they're continuing being
improved.

358
00:16:19,820 --> 00:16:22,900
But one of the things you have to think
about when you're looking

359
00:16:22,900 --> 00:16:26,240
at implementing these algorithms
is that something that might

360
00:16:26,240 --> 00:16:30,120
work for an in-memory workload
doesn't necessarily work for a

361
00:16:30,120 --> 00:16:33,680
database workload, where you're
going between memory and an IO

362
00:16:33,680 --> 00:16:36,140
level, be it wherever your storage
may reside.

363
00:16:37,440 --> 00:16:41,540
And with IV**F** Flat, where I began
getting involved was at AWS,

364
00:16:41,600 --> 00:16:43,220
we had rolled out pgvector.

365
00:16:43,320 --> 00:16:46,020
I started talking to some of our
customers who were early PG

366
00:16:46,020 --> 00:16:49,760
Vector adopters, and it was definitely
all over the board.

367
00:16:49,760 --> 00:16:53,720
But I was talking to one who was
definitely very savvy on vector

368
00:16:53,720 --> 00:16:54,640
searches in general.

369
00:16:54,640 --> 00:16:57,860
And they were remarking that when
they were trying to increase

370
00:16:57,860 --> 00:17:00,180
the number of probes in their query.

371
00:17:00,180 --> 00:17:01,780
So a probe, so let's step back.

372
00:17:01,780 --> 00:17:05,680
So the way IV**F** Flat works is that
you build clusters of vectors.

373
00:17:05,680 --> 00:17:07,900
So let's say you have. Let's keep
it simple.

374
00:17:07,900 --> 00:17:11,780
You have like 10,000 vectors. What
you want to do is you want

375
00:17:11,780 --> 00:17:13,600
to try to bucket them into lists.

376
00:17:13,780 --> 00:17:15,760
So let's say you want to have a
hundred lists.

377
00:17:15,760 --> 00:17:18,460
So there would be a hundred vectors
per list.

378
00:17:18,520 --> 00:17:20,940
No, there would be a hundred vectors
per list.

379
00:17:20,940 --> 00:17:21,720
There we go.

380
00:17:22,660 --> 00:17:24,140
Clearly, I did well with my math
major.

381
00:17:24,140 --> 00:17:26,020
Michael: The arithmetic was not
included, right?

382
00:17:26,060 --> 00:17:26,820
Jonathan: Yeah, exactly.

383
00:17:27,340 --> 00:17:29,380
After a certain year, you never
saw a number.

384
00:17:30,120 --> 00:17:30,960
So you have a hundred.

385
00:17:30,960 --> 00:17:31,700
So let's back up.

386
00:17:31,700 --> 00:17:32,900
So we have 10,000 vectors.

387
00:17:32,900 --> 00:17:35,720
We want to put them in a hundred lists,
each having 100 vectors.

388
00:17:35,940 --> 00:17:39,160
What happens with IV**F** Flat is that
you try to find centers.

389
00:17:39,160 --> 00:17:42,660
So you want to find all the vectors
near a particular center.

390
00:17:42,980 --> 00:17:45,480
Because the idea of that is that
those are going to be the most

391
00:17:45,480 --> 00:17:46,720
similar vectors to each other.

392
00:17:46,720 --> 00:17:49,600
So that way, when you do a lookup,
you're going to one of those

393
00:17:49,600 --> 00:17:49,900
centers.

394
00:17:49,900 --> 00:17:54,060
And the idea is that these are
likely, by approximation, the

395
00:17:54,060 --> 00:17:56,140
vectors that are most similar to
the ones that you're looking

396
00:17:56,140 --> 00:17:56,640
up.

397
00:17:57,440 --> 00:17:58,860
However, it may not be.

398
00:17:58,860 --> 00:18:01,740
Because let's say you have two centers
here and here, and your

399
00:18:01,740 --> 00:18:03,060
query vector is here.

400
00:18:03,160 --> 00:18:04,540
Which one are you closer to?

401
00:18:04,540 --> 00:18:05,020
Michael:

402
00:18:05,020 --> 00:18:07,360
Michael: I had to have somebody
explain this to me twice, and

403
00:18:07,360 --> 00:18:10,400
the second time they used the analogy,
it worked better for me.

404
00:18:10,400 --> 00:18:15,040
If we go back to 2 dimensions and
look at, let's say, a map of

405
00:18:15,060 --> 00:18:17,320
Europe, and we had the countries
laid out.

406
00:18:17,320 --> 00:18:21,560
We could have, for example, the
capital cities.

407
00:18:21,780 --> 00:18:26,480
First off, we could compare any
1 point and say, which capital

408
00:18:26,480 --> 00:18:27,520
city is this nearest to?

409
00:18:27,520 --> 00:18:30,340
That's a much cheaper operation
than looking at every single

410
00:18:30,340 --> 00:18:30,840
city.

411
00:18:31,120 --> 00:18:34,640
Then we can go into that country
and look at which of the actual

412
00:18:34,640 --> 00:18:36,600
towns or cities we're closest to.

413
00:18:36,820 --> 00:18:40,680
But if something's close to a border,
or the fact that not all

414
00:18:40,680 --> 00:18:44,100
countries are the same size, you
can end up in situations where

415
00:18:44,100 --> 00:18:49,200
you don't necessarily get the right
center or country by doing

416
00:18:49,200 --> 00:18:50,140
that first one.

417
00:18:50,140 --> 00:18:51,680
Is that a reasonable analogy?

418
00:18:51,940 --> 00:18:53,880
Jonathan: Yeah, that's really good,
because I started thinking

419
00:18:53,880 --> 00:18:54,440
about that.

420
00:18:54,440 --> 00:18:56,920
So I'm based in New York, and I'm
thinking what capital am I

421
00:18:56,920 --> 00:19:00,180
closest to is technically Trenton,
which is in New Jersey.

422
00:19:00,700 --> 00:19:04,680
So if I just looked at one probe,
it would say like, oh, you're

423
00:19:04,680 --> 00:19:05,560
in New Jersey.

424
00:19:05,580 --> 00:19:08,800
Yeah, actually, I think I take
3 probes because I think Albany

425
00:19:08,800 --> 00:19:10,580
is the third closest capital to...

426
00:19:10,580 --> 00:19:11,140
Michael: There you go.

427
00:19:11,140 --> 00:19:12,340
Jonathan: So there you go.

428
00:19:12,340 --> 00:19:12,740
Yeah.

429
00:19:12,740 --> 00:19:13,440
Great analogy.

430
00:19:14,100 --> 00:19:15,480
Michael: So probes is the number...

431
00:19:15,480 --> 00:19:19,140
So you then do that first search
and say, which are the 3 closest

432
00:19:19,640 --> 00:19:20,140
centers?

433
00:19:20,660 --> 00:19:21,900
Jonathan: Yeah, centers list.

434
00:19:21,900 --> 00:19:23,800
I mean, list is the term that's
used.

435
00:19:23,800 --> 00:19:26,700
You might also be, it might be
used interchangeably with centers.

436
00:19:27,260 --> 00:19:30,860
Michael: So if we increase that
number of probes, we're increasing

437
00:19:30,860 --> 00:19:33,520
the cost of doing the query because
we're looking up more things,

438
00:19:33,520 --> 00:19:36,720
we're comparing to more things,
but we're increasing the likelihood

439
00:19:36,840 --> 00:19:41,580
we're gonna get back the correct
result or the number of things

440
00:19:41,580 --> 00:19:42,320
that are correct.

441
00:19:43,780 --> 00:19:45,100
Jonathan: In our expected set.

442
00:19:45,100 --> 00:19:47,800
So the measurement of expectation
is recall.

443
00:19:47,980 --> 00:19:51,760
So if you know that these set of
vectors are like your actual

444
00:19:51,760 --> 00:19:55,400
10 exact vectors, but you get like
8 out of the 10, then you

445
00:19:55,400 --> 00:19:56,700
have 80% recall.

446
00:19:57,560 --> 00:20:00,600
So with IV-flat, as you say, as
you increase the number of probes,

447
00:20:00,600 --> 00:20:02,220
you're going to increase your recall.

448
00:20:02,360 --> 00:20:05,140
But you are going to increase the
query time.

449
00:20:05,540 --> 00:20:08,980
Now, in that patch you were talking
about, the problem is that

450
00:20:08,980 --> 00:20:10,420
we were actually over costing.

451
00:20:10,920 --> 00:20:16,120
So anytime we were above 5 probes,
we then flip back to a sequential

452
00:20:16,120 --> 00:20:19,740
scan, which will give you 100%
recall, but it was taking the

453
00:20:19,740 --> 00:20:23,240
query time from, you know, it was
like a 50x increase in query

454
00:20:23,240 --> 00:20:24,600
time based upon the dataset.

455
00:20:24,780 --> 00:20:26,580
And we definitely still had room
to grow.

456
00:20:26,580 --> 00:20:29,440
So the first patch really was focused
on, let's get the costing

457
00:20:29,440 --> 00:20:32,720
to be more reasonable level, which
gets into, you know, startup

458
00:20:32,780 --> 00:20:34,540
costs being one of those elements.

459
00:20:35,020 --> 00:20:38,040
For me personally, it really was
my first time diving into the

460
00:20:38,040 --> 00:20:39,140
Query Planner code.

461
00:20:39,140 --> 00:20:42,440
I definitely saw all sorts of interesting
things.

462
00:20:42,440 --> 00:20:44,440
First off, the Query Planner code
is...

463
00:20:44,600 --> 00:20:46,160
There's a level of brilliance to
it.

464
00:20:46,160 --> 00:20:47,280
It encapsulates a lot.

465
00:20:47,280 --> 00:20:50,160
And the work that folks have done
over the past, you know, 35

466
00:20:50,220 --> 00:20:51,800
years tuning it is quite remarkable.

467
00:20:52,200 --> 00:20:54,220
I’d also say that it is quite clean.

468
00:20:54,720 --> 00:20:57,780
It takes a lot, like, there's
a lot of it, right.

469
00:20:57,780 --> 00:20:59,180
So like, there's a lot to dive
into.

470
00:20:59,180 --> 00:21:01,960
But like, once you get into it,
you can navigate your

471
00:21:01,960 --> 00:21:02,660
way around.

472
00:21:03,320 --> 00:21:04,940
Editing it is another story.

473
00:21:05,580 --> 00:21:09,040
It’s one thing to understand it,
it’s another to propose a modification,

474
00:21:09,280 --> 00:21:13,440
but what’s really cool about Postgres
is that you can make a

475
00:21:13,440 --> 00:21:17,420
lot of impact within an extension,
or basically you can add this

476
00:21:17,420 --> 00:21:19,200
functionality without having to
fork Postgres.

477
00:21:19,200 --> 00:21:21,820
You could just write it in a separate
module and then add it

478
00:21:21,820 --> 00:21:22,320
in.

479
00:21:22,660 --> 00:21:25,680
And pgvector does this through
the index access method interface,

480
00:21:25,680 --> 00:21:27,900
which is actually quite, quite
robust.

481
00:21:27,980 --> 00:21:31,680
Like we build index types that
don't fit the natural, you know,

482
00:21:31,680 --> 00:21:34,540
B-tree ordering of a relational
database.

483
00:21:34,540 --> 00:21:36,260
So it's quite powerful there.

484
00:21:36,260 --> 00:21:40,320
And we were able to ultimately
get, I'd say, pretty good costing

485
00:21:40,320 --> 00:21:44,940
for IVF Flat without having to
make any upstream modifications

486
00:21:45,100 --> 00:21:45,780
to Postgres.

487
00:21:46,640 --> 00:21:47,140
Nice.

488
00:21:47,640 --> 00:21:51,860
Michael: And the other, yeah, so
on the IVF Flat, because of

489
00:21:51,860 --> 00:21:56,500
these centers, my understanding
is it doesn't necessarily, it

490
00:21:56,500 --> 00:22:01,580
suits a dataset you already have
that is static or not changing

491
00:22:01,580 --> 00:22:06,840
much because if you're adding data,
it can’t adjust as it goes.

492
00:22:07,040 --> 00:22:10,680
So there is this trade-off inherent
in it.

493
00:22:10,680 --> 00:22:13,940
And I'm guessing some of the customers
you were talking to had

494
00:22:13,940 --> 00:22:16,860
different workloads that wouldn’t
necessarily suit that.

495
00:22:17,120 --> 00:22:19,780
And what actually, what happened
with your customer conversations

496
00:22:19,920 --> 00:22:20,420
next?

497
00:22:20,740 --> 00:22:22,800
Jonathan: Yeah, so that's a good
discussion.

498
00:22:22,800 --> 00:22:24,340
And I’d say, you know, let’s burn
it.

499
00:22:24,340 --> 00:22:26,320
Like, it’s users of pgvector in
general.

500
00:22:26,320 --> 00:22:28,380
You know, a lot of these conversations
I have in the open source
```

501
00:22:28,380 --> 00:22:30,200
project, which is, again, awesome.

502
00:22:30,580 --> 00:22:32,440
So that's definitely something
that came up.

503
00:22:32,440 --> 00:22:36,580
That if my dataset, if I keep adding
vectors to the dataset or

504
00:22:36,580 --> 00:22:39,840
every now and then you update them
as well or even delete them,

505
00:22:39,960 --> 00:22:42,840
what happens is that the centers
start to skew because where

506
00:22:42,840 --> 00:22:46,500
you calculated where all these
centers are may start shifting.

507
00:22:47,040 --> 00:22:49,280
So your results might start shifting
as well.

508
00:22:49,280 --> 00:22:52,920
And you might not be getting the
expected results or the recall

509
00:22:53,200 --> 00:22:54,520
that you'd want to see.

510
00:22:54,520 --> 00:22:57,140
The other thing as well is that
for a lot of app developers,

511
00:22:57,150 --> 00:23:00,400
I mean, first off, again, you have
to wrap your head around approximate

512
00:23:00,400 --> 00:23:02,720
nearest neighbor that you may not
be getting the exact results

513
00:23:02,720 --> 00:23:03,280
that you expect.

514
00:23:03,280 --> 00:23:05,280
And like I said, I tripped up over
that as well.

515
00:23:05,280 --> 00:23:08,540
Like it's, you know, it is, it's
normal to make that mistake.

516
00:23:08,560 --> 00:23:10,180
You know, you have to learn something
different.

517
00:23:10,440 --> 00:23:13,320
The other thing is that it started
turning app developers into

518
00:23:13,320 --> 00:23:16,240
DBAs in the sense that you had
to worry about some of your tuning

519
00:23:16,240 --> 00:23:16,740
parameters.

520
00:23:17,120 --> 00:23:18,960
And granted, there are 2 of them,
right?

521
00:23:18,960 --> 00:23:21,820
There's the number of lists when
you build the index, which is

522
00:23:21,820 --> 00:23:25,200
not necessarily terrible because
as an app developer, you're

523
00:23:25,200 --> 00:23:28,820
still writing your SQL code, you
are creating that index.

524
00:23:29,180 --> 00:23:31,740
But you have to think, what is
the number of lists?

525
00:23:31,800 --> 00:23:34,540
And the pgvector project gives
some guidance on the number of

526
00:23:34,540 --> 00:23:37,520
lists to pick, but it's still like
you have to experiment.

527
00:23:37,900 --> 00:23:40,280
And then you have to be mindful
of things like, oh, am I adding

528
00:23:40,280 --> 00:23:41,000
more data to it?

529
00:23:41,000 --> 00:23:42,020
Do I need to re-index?

530
00:23:42,040 --> 00:23:43,020
How do I re-index?

531
00:23:43,080 --> 00:23:45,360
You know, who's doing the re-indexing,
etc.

532
00:23:45,860 --> 00:23:48,900
And then you have to select probes,
which again, you know, you're

533
00:23:48,900 --> 00:23:52,500
writing queries possibly in your
ORM, and suddenly you have to

534
00:23:52,500 --> 00:23:56,380
set like this magic parameter to
choose how many of these lists

535
00:23:56,380 --> 00:23:58,040
you want to visit during a query.

536
00:23:58,940 --> 00:24:01,140
So it's not as simple as set and
forget.

537
00:24:01,420 --> 00:24:04,740
And that's when I come back to
my history being an app developer,

538
00:24:04,740 --> 00:24:08,860
I always wanted to just write code
and I want it to all just work.

539
00:24:08,880 --> 00:24:11,760
Yeah, I happen to explore the database
as a side effect of that,

540
00:24:11,760 --> 00:24:14,080
not, you know, as a first-order
principle of that.

541
00:24:14,180 --> 00:24:18,120
So it became time to explore an
algorithm that could be closer

542
00:24:18,120 --> 00:24:18,940
to set and forget.

543
00:24:18,940 --> 00:24:21,300
Like none of these approximate
nearest neighbor algorithms are

544
00:24:21,300 --> 00:24:23,860
set and forget, but we can at least
make it easier.

545
00:24:24,140 --> 00:24:28,200
And this is where HNSW came in,
a hierarchical net of small worlds.

546
00:24:28,780 --> 00:24:32,600
So HNSW is a little bit different
than IVFI because it's not

547
00:24:32,600 --> 00:24:34,020
cluster-based, it's graph-based.

548
00:24:34,700 --> 00:24:36,680
And when you hear a graph, you
can actually almost hear, you

549
00:24:36,680 --> 00:24:38,560
know, a graph is a superset of
a tree.

550
00:24:38,560 --> 00:24:42,940
So even though Postgres is very
tree-based in a lot of its algorithms,

551
00:24:43,480 --> 00:24:45,880
you can implement something graph-based
as well, particularly

552
00:24:45,880 --> 00:24:47,300
with the custom index framework.

553
00:24:48,080 --> 00:24:52,700
And the way an **HNSW** works is that
you build a hierarchy or layers

554
00:24:52,700 --> 00:24:53,420
of vectors.

555
00:24:53,940 --> 00:24:57,540
And as you go down the hierarchy,
the layers get denser, denser,

556
00:24:57,540 --> 00:24:58,220
and denser.

557
00:24:58,580 --> 00:25:01,200
And the idea is that you start
at the top layer and you try to

558
00:25:01,200 --> 00:25:05,860
find which vector am I closest
to, you find your local maximum

559
00:25:05,860 --> 00:25:10,160
or effectively the vector I'm closest
to, then you go down and

560
00:25:10,160 --> 00:25:11,640
you search within that area.

561
00:25:11,940 --> 00:25:14,340
You find the one that you're next
closest to and you go down and

562
00:25:14,340 --> 00:25:16,880
then it's even denser and you search
and so on and so forth until

563
00:25:16,880 --> 00:25:18,220
you get to the final layer.

564
00:25:18,520 --> 00:25:21,100
And the vectors that you should
be clustered around are most

565
00:25:21,100 --> 00:25:23,180
likely the vectors you're most
similar to.

566
00:25:24,340 --> 00:25:27,420
When you're building the index,
there's 2 parameters that...

567
00:25:27,440 --> 00:25:29,240
Now you have 2 index building parameters.

568
00:25:29,440 --> 00:25:32,640
You have something called m, which
is the number of vectors that

569
00:25:32,640 --> 00:25:34,120
you're linked to at a given time.

570
00:25:34,120 --> 00:25:35,900
And pgvector defaults to 16.

571
00:25:36,340 --> 00:25:39,160
And you have something called EF
construction, which is the search

572
00:25:39,160 --> 00:25:41,460
radius that you're keeping as you're
descending down it.

573
00:25:41,460 --> 00:25:44,380
The idea that if you look at a
larger search radius, you're going

574
00:25:44,380 --> 00:25:47,060
to see the vectors that you're
more similar to.

575
00:25:47,220 --> 00:25:49,640
The idea also with a larger M is
that you're going to be clustered

576
00:25:49,640 --> 00:25:51,600
to vectors you're more similar
to as well.

577
00:25:51,600 --> 00:25:55,060
But there's definitely trade-offs
between how you pick that.

578
00:25:56,000 --> 00:25:59,060
What happens is that when you actually
do the query, you should

579
00:25:59,060 --> 00:26:00,640
be looking over a smaller space.

580
00:26:00,720 --> 00:26:04,300
So **IVF** flat is going to grow linearly
as you add the number of

581
00:26:04,300 --> 00:26:06,900
probes and it's going to grow linearly
by the number of vectors

582
00:26:06,900 --> 00:26:07,940
that are in each list.

583
00:26:07,940 --> 00:26:10,980
So the idea is that if you can
keep the list relatively small,

584
00:26:11,300 --> 00:26:14,120
you're going to be able to do faster
searches with **IVF** flat.

585
00:26:14,120 --> 00:26:17,440
But there's also trade-offs to
that as well, that you might not

586
00:26:17,440 --> 00:26:19,140
be getting the recall that you
want.

587
00:26:19,200 --> 00:26:22,040
You can see on **IVF-LAC** where you
can get expensive over time,

588
00:26:22,040 --> 00:26:24,060
particularly if you're linearly
growing it.

589
00:26:24,580 --> 00:26:28,520
With **HNSW**, you're only searching
over a smaller subset of the

590
00:26:28,520 --> 00:26:29,020
space.

591
00:26:29,240 --> 00:26:32,360
Again, this could be dictated by
your one search parameter, which

592
00:26:32,360 --> 00:26:36,760
is **EF search**, but you're only looking
at a subset of the graph.

593
00:26:36,760 --> 00:26:39,060
So you're going to be looking at
far fewer vectors.

594
00:26:39,140 --> 00:26:41,820
And the idea is that you're being
put into a cluster that's very

595
00:26:41,820 --> 00:26:42,940
similar to you.

596
00:26:43,620 --> 00:26:44,740
So there's trade-offs.

597
00:26:44,840 --> 00:26:48,260
With **IVF** flat, I can I can build
the index super quickly?

598
00:26:48,280 --> 00:26:50,900
And technically, if I'm only using
one probe, I might be able to

599
00:26:50,900 --> 00:26:52,160
query it super quickly.

600
00:26:52,580 --> 00:26:57,040
But the trade-off is that to boost
recall in **IVFlat**, what we've

601
00:26:57,040 --> 00:27:00,300
seen empirically is that it's going
to get more expensive.

602
00:27:00,940 --> 00:27:04,340
With HNSW, we can query really
quickly and we can query, you

603
00:27:04,340 --> 00:27:06,720
know, and again, the results are
showing that we can query and

604
00:27:06,720 --> 00:27:09,600
get high recall or, you know, high,
you know, we're basically

605
00:27:09,600 --> 00:27:12,760
seeing the expected results and,
you know, with pretty good accuracy.

606
00:27:13,460 --> 00:27:15,480
But the trade-off is going to be
on the build time, because we're

607
00:27:15,480 --> 00:27:18,380
going to do a lot more work up
front to build that hierarchy

608
00:27:18,600 --> 00:27:21,820
and see all of the, you know, basically
try to visit enough vectors

609
00:27:21,820 --> 00:27:23,460
that seem most similar to me.

610
00:27:23,600 --> 00:27:26,740
And that's the push and pull between
the, you know, these 2 algorithms

611
00:27:26,820 --> 00:27:31,320
and just vector similarity search
in general is that you're going

612
00:27:31,320 --> 00:27:33,340
to have to pay the cost somewhere,
and you have to figure out

613
00:27:33,340 --> 00:27:35,440
what that cost is going to be and
what makes the most sense for

614
00:27:35,440 --> 00:27:36,140
your workload.

615
00:27:36,580 --> 00:27:39,720
With HNSW, at least, if you're
willing to pay that upfront cost,

616
00:27:39,720 --> 00:27:42,940
which it seems like a lot of folks
are, you are going to get

617
00:27:42,940 --> 00:27:46,660
this very high-performance, high-recall
system, most likely.

618
00:27:46,680 --> 00:27:49,280
And again, there are other factors
that go into this as well,

619
00:27:49,280 --> 00:27:51,760
including your embedding model
or how you're querying the data

620
00:27:51,760 --> 00:27:53,400
or how quickly you're ingesting
the data.

621
00:27:53,400 --> 00:27:55,900
So, like, there's a lot to consider
here.

622
00:27:56,580 --> 00:27:59,640
Michael: Yeah, I'm a performance
guy at heart, and I love it

623
00:27:59,640 --> 00:28:03,380
when there is actually a true trade-off
and people in different

624
00:28:03,380 --> 00:28:05,240
situations will have different
preferences.

625
00:28:05,980 --> 00:28:10,320
But to me, it seems like a lot
of the currently popular use cases

626
00:28:10,900 --> 00:28:16,860
can pay or have a desire for as
high a recall as is possible

627
00:28:17,120 --> 00:28:19,900
within a latency threshold.

628
00:28:20,420 --> 00:28:25,520
So they're willing to pay some
latency and index build

629
00:28:25,520 --> 00:28:32,280
time costs for as high a recall
as possible to a point.

630
00:28:32,360 --> 00:28:35,860
And then everyone has a different
point at which that is.

631
00:28:35,860 --> 00:28:38,660
But yeah, I actually wanted to
ask almost the opposite question.

632
00:28:38,680 --> 00:28:42,680
Do you see some use cases where
the index build time itself is

633
00:28:42,940 --> 00:28:44,360
the primary driver?

634
00:28:45,560 --> 00:28:49,120
Or any other reasons that you're
seeing people choose IVF flat

635
00:28:49,120 --> 00:28:49,940
at the moment?

636
00:28:50,320 --> 00:28:52,800
Jonathan: Yeah, there are cases
where you might be needing to

637
00:28:52,800 --> 00:28:57,420
do rapid loads and a bit of analysis,
but it's very transient.

638
00:29:00,040 --> 00:29:03,120
So one of the cases I've heard is
that there's a system with, let's

639
00:29:03,120 --> 00:29:06,760
say like a hundred million vectors,
and it's really more about

640
00:29:06,760 --> 00:29:09,720
getting a rapid build and doing
a spot check on the data within

641
00:29:09,720 --> 00:29:12,540
it, as opposed to having something
that's permanent that needs

642
00:29:12,540 --> 00:29:13,460
to be run all the time.

643
00:29:13,460 --> 00:29:15,800
I think it's like a lot of these
transient or ephemeral workloads

644
00:29:15,800 --> 00:29:17,860
where an IVF flat index can make
sense.

645
00:29:17,900 --> 00:29:20,440
But frankly, like I've talked to
people who have just rolled

646
00:29:20,440 --> 00:29:23,560
out production systems with IVF
flat with several million vectors

647
00:29:23,560 --> 00:29:24,060
in it.

648
00:29:24,060 --> 00:29:24,860
They're perfectly happy.

649
00:29:24,860 --> 00:29:26,640
They're like, we're getting the
results that we need.

650
00:29:26,640 --> 00:29:30,200
We're getting it in the amount
of time that we expect it to return

651
00:29:30,200 --> 00:29:31,200
the query in.

652
00:29:31,280 --> 00:29:32,720
And they're completely fine with
it.

653
00:29:32,720 --> 00:29:34,660
So some of it is personal preference.

654
00:29:34,700 --> 00:29:38,140
And some of it, like I said, depends
on other factors, such as

655
00:29:38,140 --> 00:29:39,640
the data that has been vectorized.

656
00:29:40,240 --> 00:29:43,080
How similar are the vectors based
on the embedding model?

657
00:29:43,080 --> 00:29:46,080
And are you able to get the distances
within a range where you're

658
00:29:46,080 --> 00:29:47,940
seeing the set that you want.

659
00:29:48,560 --> 00:29:49,020
Michael: Awesome.

660
00:29:49,020 --> 00:29:50,100
That's really interesting.

661
00:29:51,440 --> 00:29:54,800
So you've got a great talk on this
that I'll link up that you

662
00:29:54,800 --> 00:29:56,980
gave recently at PGConf EU.

663
00:29:57,180 --> 00:30:00,360
And you've also had a couple of
great blog posts on the releases

664
00:30:00,480 --> 00:30:03,060
around 0.5.0 and 0.5.1.

665
00:30:04,440 --> 00:30:07,980
I'll link those up so people can
read them in their own time.

666
00:30:08,260 --> 00:30:11,760
I did actually want to ask you,
you mentioned some tuning, like

667
00:30:11,760 --> 00:30:15,860
people becoming DBAs and having
to think about, You mentioned

668
00:30:15,860 --> 00:30:19,090
right up top the page size of PostgreSQL,
for example.

669
00:30:19,090 --> 00:30:21,820
Do you want to talk a little bit
about why that's important and

670
00:30:21,820 --> 00:30:24,840
some of the lower level stuff that
people need to think about

671
00:30:24,840 --> 00:30:25,940
when they're doing this?

672
00:30:26,040 --> 00:30:26,620
Jonathan: All right.

673
00:30:27,040 --> 00:30:28,140
So, tuning or technical?

674
00:30:28,140 --> 00:30:29,440
I heard 2 different things.

675
00:30:30,040 --> 00:30:31,860
Michael: Let's go tuning as more
practical.

676
00:30:31,960 --> 00:30:34,260
We don't have to explain TOAST.

677
00:30:34,540 --> 00:30:36,540
We've done a whole episode on TOAST
before.

678
00:30:37,600 --> 00:30:39,660
But there was some interesting
stuff in there that really got

679
00:30:39,660 --> 00:30:42,080
me thinking that I hadn't thought
about before.

680
00:30:42,400 --> 00:30:46,160
Jonathan: Yeah, so there's a whole
bunch of areas to go into.

681
00:30:46,160 --> 00:30:50,340
So just to briefly recap, because
I think this does impact tuning.

682
00:30:50,900 --> 00:30:55,340
So TOAST is a system that, well,
step back, rather.

683
00:30:55,380 --> 00:30:58,040
So the foundational unit of PostgreSQL
is the page.

684
00:30:58,260 --> 00:30:59,440
That's the atomic unit.

685
00:30:59,440 --> 00:31:02,120
When you store data, you're actually
storing it within this atomic

686
00:31:02,120 --> 00:31:02,440
unit.

687
00:31:02,440 --> 00:31:04,820
You're not just storing a row randomly
on disk.

688
00:31:04,820 --> 00:31:06,700
It's going to be fit within a page.

689
00:31:07,280 --> 00:31:10,580
By default, the page for PostgreSQL
is 8 kilobytes.

690
00:31:11,120 --> 00:31:14,240
Now you can recompile PostgreSQL
to use a different page size,

691
00:31:14,240 --> 00:31:17,660
but most folks just use the default
and for a lot of good reasons.

692
00:31:18,600 --> 00:31:21,680
What's interesting is that, you
know, so if you have data that

693
00:31:21,680 --> 00:31:24,780
doesn't fit within a page, what
happens is that it gets TOASTed

694
00:31:25,160 --> 00:31:28,220
and it gets stored out of line
and you can store it arbitrarily

695
00:31:28,320 --> 00:31:28,660
large.

696
00:31:28,660 --> 00:31:31,420
I believe it's up to a 1 gigabyte
per field.

697
00:31:31,640 --> 00:31:36,020
Now, what's interesting, well,
there's 3 interesting things here.

698
00:31:36,020 --> 00:31:39,060
So first, PostgreSQL has what's called
a minimum TOAST threshold,

699
00:31:39,200 --> 00:31:40,360
which is 2 kilobytes.

700
00:31:40,440 --> 00:31:43,660
So anything above 2 kilobytes is
going to get toasted, unless you

701
00:31:43,660 --> 00:31:44,560
change that threshold.

702
00:31:45,120 --> 00:31:51,000
The second thing is that index
pages must abide by that 8 kilobyte

703
00:31:51,000 --> 00:31:51,500
limit.

704
00:31:52,200 --> 00:31:56,200
And you can actually TOAST data
in line with an index page to

705
00:31:56,200 --> 00:31:58,240
shrink it down a bit so you can
get a little bit more on that

706
00:31:58,240 --> 00:31:58,820
index page.

707
00:31:58,820 --> 00:32:02,220
But if it's gonna be over 8 kilobytes,
Postgres can't index it.

708
00:32:02,960 --> 00:32:06,060
So this is where it gets interesting
for vectors for a few reasons.

709
00:32:06,180 --> 00:32:09,440
So first, any vector that's above
that 2 kilobyte threshold,

710
00:32:09,440 --> 00:32:12,180
which I believe, I think I forgot
off the top of my head, it's

711
00:32:12,180 --> 00:32:15,660
around like 514 dimensions for
pgvector currently, is going to

712
00:32:15,660 --> 00:32:16,220
get TOASTed.

713
00:32:16,820 --> 00:32:18,660
Okay, that might seem okay.

714
00:32:19,440 --> 00:32:22,620
But any vector currently, there's
a hard cap in terms of indexing

715
00:32:22,960 --> 00:32:27,160
pgvector vectors of 2000 dimensions,
which there are some valid

716
00:32:27,160 --> 00:32:30,480
use cases I've heard of vectors
that go beyond that size, but

717
00:32:30,480 --> 00:32:33,600
typically most things are going
to fall within that range for

718
00:32:33,600 --> 00:32:34,100
now.

719
00:32:34,920 --> 00:32:37,600
Now, there's a few things here.

720
00:32:37,600 --> 00:32:39,760
First, you're like, well, you say
you can TOAST things in line.

721
00:32:39,760 --> 00:32:41,860
So why can't we TOAST these vectors
in line?

722
00:32:41,940 --> 00:32:44,600
Well, I hate to be the bearer of
bad news, but it's actually

723
00:32:44,600 --> 00:32:48,980
very challenging to compress effectively
2,000 dimensions of

724
00:32:49,160 --> 00:32:51,100
random floating-point numbers.

725
00:32:52,660 --> 00:32:54,560
There's not really much you can
do other than dimensionality

726
00:32:54,740 --> 00:32:56,780
reduction to shrink it down.

727
00:32:57,040 --> 00:32:59,060
There are some techniques out there
called quantization, which

728
00:32:59,060 --> 00:33:03,060
you can touch on after this, but
they all have their tradeoffs,

729
00:33:03,060 --> 00:33:04,460
such as losing information.

730
00:33:06,740 --> 00:33:11,440
The second thing is that TOASTing
can screw up the query planner

731
00:33:11,440 --> 00:33:11,920
a bit.

732
00:33:11,920 --> 00:33:13,160
What do I mean by this?

733
00:33:13,940 --> 00:33:17,040
So, currently when the query planner
is trying to estimate parallel

734
00:33:17,040 --> 00:33:21,380
workers or essentially reading
data in parallel, it's going to

735
00:33:21,380 --> 00:33:24,320
look at, it basically looks at
your heap pages or your main table

736
00:33:24,320 --> 00:33:26,700
pages and uses that to drive the
estimate.

737
00:33:27,500 --> 00:33:30,720
And the thing about TOAST is that
your heap pages actually be

738
00:33:30,720 --> 00:33:33,420
quite small, you know, in this
case, let's say if an ID and a

739
00:33:33,420 --> 00:33:33,920
vector.

740
00:33:34,200 --> 00:33:36,600
Well, I'm not going to have that
many pages in my regular heap

741
00:33:36,600 --> 00:33:39,320
because it's just going to be a
bunch of IDs and pointer to my

742
00:33:39,320 --> 00:33:40,060
TOAST table.

743
00:33:40,600 --> 00:33:43,080
The TOAST table is going to be
quite large and likely you need

744
00:33:43,080 --> 00:33:45,360
those parallel workers to suck
all the data out.

745
00:33:45,600 --> 00:33:49,240
But what Postgres is going to do
is it's going to underestimate

746
00:33:49,280 --> 00:33:50,340
the workers today.

747
00:33:50,900 --> 00:33:53,640
And it actually makes sense historically
because the data that

748
00:33:53,640 --> 00:33:56,720
you typically TOASTed was not in
your search path or your ordering

749
00:33:56,720 --> 00:33:57,040
path.

750
00:33:57,040 --> 00:33:59,440
It's typically something that you
need to do some post filtering

751
00:33:59,440 --> 00:34:00,400
on as you pull it out.

752
00:34:00,400 --> 00:34:02,660
You know, think like, you know,
a big blob of text.

753
00:34:03,400 --> 00:34:06,140
But here, we are actively querying
the vectors.

754
00:34:06,140 --> 00:34:07,760
We are calculating distances between
them.

755
00:34:07,760 --> 00:34:10,780
So we probably want them closer
to our main set of data.

756
00:34:11,000 --> 00:34:13,440
Now, we can choose to store the
vectors in line.

757
00:34:13,440 --> 00:34:16,340
You use this technique called setStoragePlane,
and that will

758
00:34:16,340 --> 00:34:21,000
keep your 1500-dimensional vector
in line with your other table

759
00:34:21,000 --> 00:34:21,500
data.

760
00:34:22,240 --> 00:34:25,320
But also keep in mind, this is
going to cost a full page because

761
00:34:25,320 --> 00:34:28,320
a 1500-dimensional vector is about
6 kilobytes and Postgres is

762
00:34:28,320 --> 00:34:30,600
just going to have to allocate
a page for each of them.

763
00:34:30,740 --> 00:34:34,060
So no matter what, you're storing
8 kilobytes of data in them.

764
00:34:34,960 --> 00:34:36,720
So this gets very interesting.

765
00:34:37,060 --> 00:34:39,880
So in terms of tuning, again, part
of this, you've got to look

766
00:34:39,880 --> 00:34:42,080
at what your workload is and what
makes the most sense.

767
00:34:43,260 --> 00:34:47,640
What we're seeing so far, at least
with HNSW, is that even if

768
00:34:47,640 --> 00:34:50,580
you're toasting a 1500-dimensional
vector, the estimates are

769
00:34:50,580 --> 00:34:53,200
still pretty good overall for making
sure that you're using your

770
00:34:53,200 --> 00:34:54,440
HNSW index.

771
00:34:54,520 --> 00:34:56,260
I haven't seen as much impact there.

772
00:34:56,280 --> 00:34:59,340
We saw more impact, particularly
with IVF flat based upon some

773
00:34:59,340 --> 00:35:00,360
of that costing model.

774
00:35:00,360 --> 00:35:00,860
Michael: But

775
00:35:01,120 --> 00:35:03,420
Jonathan: I think there's some,
this is one of those areas where

776
00:35:03,420 --> 00:35:06,420
I think there's some improvements
to how we work with toasted

777
00:35:06,420 --> 00:35:07,180
data upstream.

778
00:35:07,580 --> 00:35:11,100
I think I have a couple of emails
or threads on that subject.

779
00:35:11,840 --> 00:35:15,120
And if I get my C chops better,
maybe I can propose a patch.

780
00:35:15,520 --> 00:35:19,740
But that's one area to be mindful
of, is how you store the data.

781
00:35:19,740 --> 00:35:22,800
And again, I think we're starting
to get a little bit more set

782
00:35:22,800 --> 00:35:23,840
and forget there.

783
00:35:23,840 --> 00:35:28,320
But I think as we see these workloads
increase, it is definitely

784
00:35:28,320 --> 00:35:29,780
something to be mindful of.

785
00:35:30,180 --> 00:35:34,340
Michael: From a bias standpoint,
as a Postgres supporter and

786
00:35:34,340 --> 00:35:38,340
advocate, I've seen quite a lot
of benchmarks that show pgVector

787
00:35:38,440 --> 00:35:42,680
doing very well as a vector database
compared to other dedicated

788
00:35:42,720 --> 00:35:43,220
systems.

789
00:35:43,380 --> 00:35:46,220
So that's impressive, considering
they're probably not even tuning

790
00:35:46,220 --> 00:35:48,740
some of those lower-level things
while doing those benchmark.

791
00:35:48,760 --> 00:35:51,920
So it's exciting for me to hear
that there's actually some potential

792
00:35:52,460 --> 00:35:54,880
further wins there on the Postgres
side.

793
00:35:55,080 --> 00:35:58,480
Jonathan: Yeah, I mean, maybe like
hot off the presses, Andrew

794
00:35:58,520 --> 00:36:02,300
proposed a patch last night for
speeding up HNSW index building.

795
00:36:02,540 --> 00:36:05,280
Since the 0.5.1 release, there's
already been a lot of work to

796
00:36:05,280 --> 00:36:08,140
improve this, but the patch last
night even further accelerates

797
00:36:08,260 --> 00:36:08,760
it.

798
00:36:10,160 --> 00:36:13,520
So there's support for parallel
builds for HNSW indexes coming.

799
00:36:14,060 --> 00:36:17,160
And the initial work still had
to leverage a lot of the data

800
00:36:17,160 --> 00:36:19,660
coming from disk as opposed to
being fully in memory.

801
00:36:20,020 --> 00:36:22,180
The proposal last night is fully
in memory.

802
00:36:23,000 --> 00:36:25,440
And I actually did a test, I was
so excited to see it, I did

803
00:36:25,440 --> 00:36:26,140
a test.

804
00:36:26,980 --> 00:36:30,780
I have this 10,000,500 dimensional
vector dataset, or really

805
00:36:30,780 --> 00:36:33,040
like 1,536 dimensional dataset.

806
00:36:33,620 --> 00:36:34,600
Yeah, random vectors, right?

807
00:36:34,600 --> 00:36:36,820
This is really more just for like
beating up on performance.

808
00:36:36,820 --> 00:36:39,620
You know, I'm not measuring recall
here, which for real benchmarks,

809
00:36:39,620 --> 00:36:41,180
you got to measure performance
and recall.

810
00:36:41,180 --> 00:36:41,720
Michael: But I'm

811
00:36:41,720 --> 00:36:43,680
Jonathan: trying to just understand
like how quick is the index

812
00:36:43,680 --> 00:36:44,480
build time.

813
00:36:45,060 --> 00:36:49,000
So I did this with what's currently
on the master branch to date.

814
00:36:49,000 --> 00:36:52,860
I used this and then I also did
the test with the HNSW fast build

815
00:36:52,860 --> 00:36:54,640
branch, which is this in-memory
system.

816
00:36:55,600 --> 00:36:56,920
So I saw a couple of things.

817
00:36:57,800 --> 00:37:02,860
So first, this new branch was about,
I think, 7.3x faster at

818
00:37:02,860 --> 00:37:05,640
building HNSW index than the other
branch.

819
00:37:05,640 --> 00:37:07,760
And just to compare, like this
gives some real numbers.

820
00:37:07,760 --> 00:37:09,440
So the old branch.

821
00:37:10,080 --> 00:37:12,700
Which is you know this you know
the unreleased support for the

822
00:37:12,700 --> 00:37:17,460
HNW parallel builds it took about
3 hours and change to build

823
00:37:17,460 --> 00:37:18,340
the entire index.

824
00:37:18,340 --> 00:37:20,280
And this was with like 64 parallel
workers.

825
00:37:20,280 --> 00:37:22,860
I'm throwing, this is a big beefy
box.

826
00:37:24,060 --> 00:37:27,680
With the new branch, it took 25
minutes.

827
00:37:29,440 --> 00:37:31,640
It's like, that's mind-blowing,
right?

828
00:37:31,640 --> 00:37:32,900
These are big vectors.

829
00:37:32,900 --> 00:37:34,100
You're doing a lot of computations.

830
00:37:34,180 --> 00:37:37,300
And I even had cranked up the EF
construction value, which does

831
00:37:37,300 --> 00:37:38,340
increase the time.

832
00:37:38,740 --> 00:37:41,620
I did compare it to this previous
method I'd been recommending,

833
00:37:41,660 --> 00:37:43,440
which was concurrent inserts.

834
00:37:44,440 --> 00:37:47,040
And I just, you know, I did a spot
check between a blog post

835
00:37:47,040 --> 00:37:49,840
I wrote about it and this new patch.

836
00:37:50,140 --> 00:37:52,600
In the blog post where I had a
lower value of EF construction,

837
00:37:52,600 --> 00:37:56,960
it was I think 64, I was getting
a little bit over 1,000 vectors

838
00:37:56,960 --> 00:37:59,120
per second, looking over the entire
front.

839
00:38:00,220 --> 00:38:03,760
With this new technique that Andrew
posted, I was getting over

840
00:38:03,760 --> 00:38:05,580
6,500 vectors per second.

841
00:38:06,900 --> 00:38:07,980
Michael: So again, huge.

842
00:38:08,940 --> 00:38:09,440
Jonathan: Yeah.

843
00:38:09,800 --> 00:38:12,040
This is what's cool about this
because this is 1 of those set

844
00:38:12,040 --> 00:38:14,840
and forget things that you might
need to tweak.

845
00:38:15,200 --> 00:38:17,060
You might need to tweak 1 parameter.

846
00:38:17,220 --> 00:38:20,260
In this case, I tweaked the max
parallel maintenance workers.

847
00:38:21,040 --> 00:38:23,980
But again, huge performance boost
in terms of the index building.

848
00:38:23,980 --> 00:38:26,720
It simplifies it too, because now
it makes it truly viable to

849
00:38:26,720 --> 00:38:29,700
say, preload all your vectors,
which will be a much faster operation

850
00:38:29,700 --> 00:38:31,180
than doing a concurrent insert.

851
00:38:31,640 --> 00:38:34,240
I mean, because there's a current
insert and even like further

852
00:38:34,240 --> 00:38:35,940
speed up your load of your vectors.

853
00:38:36,580 --> 00:38:40,680
But then you can just do create
index, you know, embeddings using

854
00:38:40,680 --> 00:38:44,900
HNSW and boom, like you have an
index way faster than other systems.

855
00:38:44,900 --> 00:38:46,820
And like, this is a key thing to
look at.

856
00:38:46,820 --> 00:38:49,240
You know, you touch on this, Michael,
is that when you're dealing

857
00:38:49,240 --> 00:38:51,180
with this vector data, you gotta
look at everything.

858
00:38:51,180 --> 00:38:54,140
You gotta look at your ingestion
time, your index build time,

859
00:38:54,140 --> 00:38:55,220
your query time.

860
00:38:55,440 --> 00:38:58,320
And you know, you have to focus
on all these things because there

861
00:38:58,320 --> 00:39:00,520
is a trade-off with all of them.

862
00:39:00,660 --> 00:39:03,080
Michael: Yeah, well, exciting times
ahead.

863
00:39:03,080 --> 00:39:08,940
It feels like the most rapidly
improving area at the moment and

864
00:39:09,120 --> 00:39:10,160
for good reason.

865
00:39:10,520 --> 00:39:15,040
So I saw on this, it's got a great
changelog, pgvector.

866
00:39:15,040 --> 00:39:17,900
And I saw there's like even the
unreleased part.

867
00:39:18,080 --> 00:39:21,540
So 0.5.2 is currently on there
and unreleased.

868
00:39:21,820 --> 00:39:24,240
So anything you're particularly
excited about that's coming up

869
00:39:24,240 --> 00:39:26,700
or anything not on that list that
you'd love to see?

870
00:39:26,980 --> 00:39:27,380
Jonathan: Yeah.

871
00:39:27,380 --> 00:39:29,860
And I think this is where it really
helps to have feedback on

872
00:39:29,860 --> 00:39:30,880
how people are using it.

873
00:39:30,880 --> 00:39:33,880
So I was very excited about parallel
HNSW builds.

874
00:39:33,880 --> 00:39:39,000
Like this is, I can't, even like
in the middle of that test I

875
00:39:39,000 --> 00:39:41,380
was running last night, like I
emailed Andrew, like Andrew, the

876
00:39:41,380 --> 00:39:42,340
results are amazing.

877
00:39:42,340 --> 00:39:44,040
Like I can't, I can't wait for
it to finish.

878
00:39:44,040 --> 00:39:45,560
Like I can see where this is trending.

879
00:39:46,980 --> 00:39:49,740
So there's 2 things in particular
I'm very excited for.

880
00:39:49,940 --> 00:39:52,580
The first, you know, the first,
and we'll see if it makes it

881
00:39:52,580 --> 00:39:53,980
in, but it's around pre-filtering.

882
00:39:54,720 --> 00:39:58,000
So a lot of the queries today we've
been seeing is like select

883
00:39:58,000 --> 00:40:02,420
star from vector table, order by,
blah blah blah, you know, limit

884
00:40:02,440 --> 00:40:02,840
10.

885
00:40:02,840 --> 00:40:04,340
Find me my 10 nearest neighbors.

886
00:40:04,700 --> 00:40:06,060
But what's happening in practice?

887
00:40:06,060 --> 00:40:10,680
In practice, it's like select star
from table where, you have

888
00:40:10,680 --> 00:40:14,140
some condition, category ID equals
this, order by blah, blah,

889
00:40:14,140 --> 00:40:14,740
blah, blah.

890
00:40:15,320 --> 00:40:18,760
So what this is kind of like is
really a multi-column index,

891
00:40:18,760 --> 00:40:22,540
because what will happen is like
today, either you won't use

892
00:40:22,540 --> 00:40:24,960
the index, effectively you're doing
an exact nearest neighbor

893
00:40:24,960 --> 00:40:26,820
search, which means very accurate
results.

894
00:40:27,260 --> 00:40:31,820
Maybe your dataset gets sorted
down to a small enough value where

895
00:40:31,960 --> 00:40:32,720
it doesn't matter.

896
00:40:32,720 --> 00:40:35,040
Like you're getting a very fast
search.

897
00:40:35,740 --> 00:40:39,080
But what if you have multiple fields
that you need to filter

898
00:40:39,080 --> 00:40:39,580
over?

899
00:40:39,620 --> 00:40:43,680
Or what if the dataset you get
back is 50,000 of these 1,500

900
00:40:43,780 --> 00:40:44,680
dimensional vectors?

901
00:40:44,680 --> 00:40:46,220
Like this could be exhausting.

902
00:40:46,400 --> 00:40:49,340
So there's a patch out there, it's
based on a newer paper called

903
00:40:49,340 --> 00:40:52,480
HNSW, which I can let pgvector
repo right now.

904
00:40:52,480 --> 00:40:52,900
Yes.

905
00:40:52,900 --> 00:40:56,260
That lets you build effectively
a multi-column index where it's

906
00:40:56,260 --> 00:40:59,060
able to build the links or group
together.

907
00:40:59,060 --> 00:41:00,140
Well, there are 2 things, right?

908
00:41:00,140 --> 00:41:02,880
'Cause you still want to be able
to search over your entire vector

909
00:41:02,880 --> 00:41:04,540
set and find your most similar
vectors.

910
00:41:04,540 --> 00:41:11,260
But you can also group the vectors
by your filters.

911
00:41:11,440 --> 00:41:14,340
So that way, you're searching over
just your filters and it does

912
00:41:14,340 --> 00:41:14,980
that pre-filtering.

913
00:41:16,100 --> 00:41:21,200
This is like a what have you done
for me lately feature, because

914
00:41:21,200 --> 00:41:24,600
as soon as you put it out, users
find like, you know, more of

915
00:41:24,600 --> 00:41:27,340
these case studies emerge and users
find like, hey, like, I really

916
00:41:27,340 --> 00:41:27,980
need this.

917
00:41:27,980 --> 00:41:30,060
But it's great is that there are
people who are testing it against

918
00:41:30,060 --> 00:41:30,580
real data.

919
00:41:30,580 --> 00:41:33,040
And this is where if you want to
be involved in pgvector, you

920
00:41:33,040 --> 00:41:36,600
can help is that you see these
patches out there, please test

921
00:41:36,600 --> 00:41:37,640
and report back on them.

922
00:41:37,640 --> 00:41:41,320
Because if you're finding the HNSW
brands useful, describe

923
00:41:41,320 --> 00:41:41,820
your use case.

924
00:41:41,820 --> 00:41:43,860
Like there's an open issue where
people talk about, you know,

925
00:41:43,860 --> 00:41:45,820
they might say like, hey, I really
need this.

926
00:41:46,220 --> 00:41:48,720
And what makes it super valuable
is when you talk about like,

927
00:41:48,720 --> 00:41:51,340
here's exactly how I'm using it,
because it just further justifies

928
00:41:51,340 --> 00:41:52,860
that this is the right direction.

929
00:41:52,900 --> 00:41:56,200
I mean, 1 of the big goals of pgvector
is to try to keep the

930
00:41:56,200 --> 00:41:59,640
code base simple to maintain and
also make the product simple

931
00:41:59,640 --> 00:42:01,660
or the extension simple to use.

932
00:42:02,080 --> 00:42:04,580
So hearing more about what people
are doing is great.

933
00:42:05,020 --> 00:42:07,360
Another patch that I'm excited
for, and again, we'll see if it

934
00:42:07,360 --> 00:42:10,320
makes it into 0.5.2, you know,
no guarantees, is being able to

935
00:42:10,320 --> 00:42:12,040
support smaller dimensional sizes.

936
00:42:12,240 --> 00:42:15,560
So right now, dimensions are 4-byte
floats.

937
00:42:15,860 --> 00:42:19,200
But there are definitely embedding
models that provide two-byte

938
00:42:19,200 --> 00:42:21,920
floats or one-byte unsigned integers.

939
00:42:22,480 --> 00:42:24,960
So there are a couple branches that
have those in there.

940
00:42:24,960 --> 00:42:27,480
But again, hearing those use cases
will help further support

941
00:42:27,480 --> 00:42:27,980
it.

942
00:42:28,040 --> 00:42:30,560
Because the added bonus to supporting
these smaller dimensions

943
00:42:30,600 --> 00:42:33,940
is that we can index larger vectors
and that'll be able to go

944
00:42:33,940 --> 00:42:35,780
beyond that 2k limit.

945
00:42:36,320 --> 00:42:37,360
Michael: Yes, interesting.

946
00:42:37,900 --> 00:42:42,180
The 1 final question I had is,
do you see any world where this

947
00:42:42,180 --> 00:42:45,260
becomes part of core Postgres grow
in the long term?

948
00:42:45,360 --> 00:42:50,280
Jonathan: Yeah, so at PGCon 2023
last year, I had a lightning

949
00:42:50,280 --> 00:42:53,940
talk, which is basically to first
shout to the wind to the community,

950
00:42:53,940 --> 00:42:57,360
like, hey, like, these vector workloads,
this is real, like,

951
00:42:57,360 --> 00:42:58,380
that was still early.

952
00:42:58,380 --> 00:43:01,160
But it's like, hey, this is coming,
there's a, I call it like

953
00:43:01,160 --> 00:43:02,320
a storm of data coming.

954
00:43:02,320 --> 00:43:04,340
Like we want to make sure Postgres
is positioned for it.

955
00:43:04,340 --> 00:43:06,640
Like this is very similar to what
we saw with JSON.

956
00:43:06,760 --> 00:43:09,440
And I was able to get an unconference
session as well where

957
00:43:09,440 --> 00:43:10,620
we discussed it.

958
00:43:10,680 --> 00:43:14,380
And the general consensus was in
the fullness of time, it does

959
00:43:14,380 --> 00:43:17,140
make sense to have something like
this in upstream Postgres.

960
00:43:17,560 --> 00:43:18,060
Great.

961
00:43:18,440 --> 00:43:21,100
But I think there's a few things
here.

962
00:43:21,100 --> 00:43:22,940
First, we have to look at release
cycles.

963
00:43:24,060 --> 00:43:26,200
Postgres releases once a year.

964
00:43:26,200 --> 00:43:29,560
In fact, the feature freeze for
Postgres 17 is coming up in about

965
00:43:30,420 --> 00:43:31,580
less than 3 months.

966
00:43:32,100 --> 00:43:34,020
But effectively, there's a...

967
00:43:34,020 --> 00:43:36,380
Now, if you think, let's say you
come up with the idea we want

968
00:43:36,380 --> 00:43:39,720
to support a vector data type right
now, or let's say after feature

969
00:43:39,720 --> 00:43:40,080
freeze.

970
00:43:40,080 --> 00:43:42,440
I mean, there's effectively an
18-month window before it gets

971
00:43:42,440 --> 00:43:44,620
in, because we have to go through
the whole cycle.

972
00:43:45,580 --> 00:43:48,560
And given the pace that this field
is moving, we don't necessarily

973
00:43:48,560 --> 00:43:50,900
want to wait on the Postgres release
cycle.

974
00:43:51,260 --> 00:43:54,100
So being able to do this work in
pgVector or other extensions

975
00:43:54,160 --> 00:43:58,140
does help accelerate adoption of
Postgres as a vector database,

976
00:43:58,140 --> 00:43:58,920
so to speak.

977
00:43:59,680 --> 00:44:02,200
The other thing is that once it
is in upstream Postgres, you

978
00:44:02,200 --> 00:44:03,900
know, that is the on-disk format.

979
00:44:03,960 --> 00:44:05,780
Like that is the rule.

980
00:44:06,220 --> 00:44:09,720
And Tom Lane made a very good point
during that Unconference session,

981
00:44:09,720 --> 00:44:12,860
which is like, let's see, you know,
how things shake out in terms

982
00:44:12,860 --> 00:44:14,940
of what the on-disk format is.

983
00:44:15,240 --> 00:44:18,620
Now, pgVector is also trying to
stay true to that contract and

984
00:44:18,620 --> 00:44:22,240
try to keep the on-disk format
as, you know, effectively not

985
00:44:22,240 --> 00:44:23,200
to change it.

986
00:44:23,440 --> 00:44:26,680
Because as soon as you change it,
you gotta like rebuild, re-index,

987
00:44:26,680 --> 00:44:28,680
restore everything, and that's
a costly operation.

988
00:44:28,780 --> 00:44:32,740
So pgVector is trying to apply
the same level of rigor as Postgres

989
00:44:32,800 --> 00:44:36,160
to implementing these features,
but it can move a little bit

990
00:44:36,160 --> 00:44:38,240
faster because it is an extension.

991
00:44:38,240 --> 00:44:40,380
It can have its own release lifecycle.

992
00:44:40,520 --> 00:44:43,580
So I think that's where the, you
know, is that I'm not gonna

993
00:44:43,580 --> 00:44:46,460
say this is an official or unofficial
community position, but

994
00:44:46,460 --> 00:44:48,160
is there an interest in supporting
it upstream?

995
00:44:48,160 --> 00:44:48,660
Absolutely.

996
00:44:48,700 --> 00:44:51,920
But given the rapid emergence and
development we need to make,

997
00:44:52,120 --> 00:44:55,740
we're trying to make as much progress
as possible within pgVector.

998
00:44:56,400 --> 00:44:58,940
Here's the other thing too, like
the big difference between now

999
00:44:58,940 --> 00:44:59,640
and JSON.

1000
00:45:00,280 --> 00:45:03,740
We have the index access method,
we can do this in an extension.

1001
00:45:04,900 --> 00:45:07,780
That is a big change since 10,
15 years ago.

1002
00:45:08,040 --> 00:45:11,020
Michael: That's really good, interesting
and good to know.

1003
00:45:11,320 --> 00:45:13,640
There's another big difference,
which is cloud providers.

1004
00:45:13,680 --> 00:45:20,000
So whilst it could be in core Postgres,
currently, a lot of people

1005
00:45:20,000 --> 00:45:23,140
that install Postgres themselves
or manage Postgres themselves

1006
00:45:23,160 --> 00:45:25,460
more specifically can install pgvector.

1007
00:45:25,840 --> 00:45:29,100
And people that don't are often
on a managed service provider.

1008
00:45:29,480 --> 00:45:32,900
Those folks, most managed service
providers have added support

1009
00:45:32,900 --> 00:45:36,380
for pgvector already, which is
the fastest I've ever seen them

1010
00:45:36,380 --> 00:45:38,300
add a new extension.

1011
00:45:38,600 --> 00:45:40,540
And it's been pretty much everyone.

1012
00:45:40,840 --> 00:45:44,540
So most folks that want access
to pgvector, at least some version

1013
00:45:44,540 --> 00:45:47,920
of it, maybe not always the latest
version, can have it now as

1014
00:45:47,920 --> 00:45:48,420
well.

1015
00:45:48,700 --> 00:45:50,580
Jonathan: Yeah, it's definitely
it's definitely exciting to see

1016
00:45:50,580 --> 00:45:52,320
all the adoption of pgvector.

1017
00:45:52,540 --> 00:45:55,520
And yeah, yeah, yeah, to tease
a little bit, I think I still

1018
00:45:55,520 --> 00:45:56,640
think the best is yet to come.

1019
00:45:56,640 --> 00:45:59,620
Like I think pgvector has made
a lot of progress in the past

1020
00:45:59,620 --> 00:46:00,040
year.

1021
00:46:00,040 --> 00:46:02,560
I mean, it's been a tremendous
work by the community, and in

1022
00:46:02,560 --> 00:46:03,400
particular, Andrew.

1023
00:46:03,400 --> 00:46:07,080
Like, I can't say enough nice things
about the work Andrew has

1024
00:46:07,080 --> 00:46:09,520
done and really the diligence he's
put into it.

1025
00:46:09,760 --> 00:46:11,820
And I still think there's more
to do.

1026
00:46:11,820 --> 00:46:14,640
Like I said, even just like, well,
the patch that came out last

1027
00:46:14,640 --> 00:46:17,360
night, I mean, just shows that
there's still even more performance

1028
00:46:17,360 --> 00:46:18,720
that we can get out of it.

1029
00:46:18,720 --> 00:46:21,900
Michael: And for folks listening,
we're recording on the 16th.

1030
00:46:21,900 --> 00:46:23,500
So I'll link that up.

1031
00:46:23,640 --> 00:46:24,140
Wonderful.

1032
00:46:24,220 --> 00:46:25,200
Jonathan, thank you so much.

1033
00:46:25,200 --> 00:46:27,440
Is there anything else I should
have asked you but didn't?

1034
00:46:27,440 --> 00:46:29,440
Or any last things you wanted to
say?

1035
00:46:29,640 --> 00:46:32,360
Jonathan: Yeah, I think one of the
big things when you think about

1036
00:46:32,360 --> 00:46:36,420
contributing to open source is
writing a patch, writing code.

1037
00:46:36,660 --> 00:46:39,760
And I'm going to say this in the
context of pgvector, but I

1038
00:46:39,760 --> 00:46:43,580
think this applies to any project,
even Postgres itself, is that

1039
00:46:43,580 --> 00:46:45,360
there are many different ways to
contribute.

1040
00:46:45,960 --> 00:46:48,580
Testing is huge, because testing,
particularly if you can test

1041
00:46:48,580 --> 00:46:50,980
something that's close to a real
workload, like don't test your

1042
00:46:50,980 --> 00:46:53,000
production workloads with these
things, but like test something

1043
00:46:53,000 --> 00:46:54,260
that's close to production.

1044
00:46:54,520 --> 00:46:56,880
That helps because that helps drive
the use case and hearing

1045
00:46:56,880 --> 00:46:59,340
how you use something and just
talking about your different use

1046
00:46:59,340 --> 00:47:01,980
cases of being supportive of the
community in that way.

1047
00:47:02,300 --> 00:47:03,000
That helps.

1048
00:47:03,260 --> 00:47:06,820
Helping to write documentation,
helping to advocate for something

1049
00:47:06,820 --> 00:47:08,460
that you think can help others.

1050
00:47:08,480 --> 00:47:10,180
Again, all these things can help
a project.

1051
00:47:10,180 --> 00:47:13,340
So if you want to contribute to
pgvector or Postgres, there

1052
00:47:13,340 --> 00:47:14,880
are a variety of different ways.

1053
00:47:15,060 --> 00:47:20,160
And maybe too, as it's top of mind
right now, PGCon, which was

1054
00:47:20,380 --> 00:47:22,120
effectively the annual developer
conference.

1055
00:47:22,120 --> 00:47:22,840
It's evolved.

1056
00:47:22,940 --> 00:47:25,740
It's now called PGConf.dev, P-G-C-O-N-F.dev.

1057
00:47:28,840 --> 00:47:31,120
It's being held in Vancouver at
the end of May.

1058
00:47:31,120 --> 00:47:34,720
I can tell you, I can almost guarantee
vector workloads will

1059
00:47:34,720 --> 00:47:37,200
be a topic of discussion there.

1060
00:47:37,200 --> 00:47:38,920
But just all things Postgres.

1061
00:47:38,940 --> 00:47:42,280
And the idea is that while certainly
a lot of the folks that

1062
00:47:42,280 --> 00:47:46,500
the discussions are around technical
hacking topics, really,

1063
00:47:46,500 --> 00:47:49,300
if you step back, the gist is how
do we continue building and

1064
00:47:49,300 --> 00:47:50,720
growing the Postgres community?

1065
00:47:50,740 --> 00:47:55,120
So if you are the CFP actually
just closed yesterday, or when

1066
00:47:55,120 --> 00:47:58,640
this airs probably a few years
before, but that's a great way

1067
00:47:58,640 --> 00:48:00,040
to participate as well.

1068
00:48:00,040 --> 00:48:03,120
Even if you're new to the community,
because I know personally,

1069
00:48:03,120 --> 00:48:05,740
at one point I was new to the community,
and the first time I went

1070
00:48:05,740 --> 00:48:09,140
to PGCon, I'm like, oh my God,
like I know nothing about how

1071
00:48:09,140 --> 00:48:09,840
a database works.

1072
00:48:09,840 --> 00:48:12,660
Like I can write a select query,
but geez.

1073
00:48:12,660 --> 00:48:15,720
But it is a way to help have an impact
on the community.

1074
00:48:15,720 --> 00:48:18,900
And just talking to folks who are
working on the software or

1075
00:48:18,900 --> 00:48:22,760
working on events or hosting podcasts
or finding ways to help

1076
00:48:22,760 --> 00:48:27,980
grow the community, it's a great
way to participate and help

1077
00:48:27,980 --> 00:48:28,340
growth.

1078
00:48:28,340 --> 00:48:33,680
So certainly think about attending
and participating.

1079
00:48:33,960 --> 00:48:36,380
And again, there's all sorts of
different ways to contribute.

1080
00:48:36,380 --> 00:48:38,440
So that's the parting message I
have.

1081
00:48:38,440 --> 00:48:40,460
I'm far from a database hacker.

1082
00:48:40,760 --> 00:48:43,260
I can write a couple of lines of C
here and there.

1083
00:48:43,260 --> 00:48:47,800
But where I've found a home in
open source is working on all

1084
00:48:47,800 --> 00:48:50,500
sorts of other aspects around open
source projects.

1085
00:48:55,030 --> 00:49:01,920
Thank you for having me on your
wonderful podcast.

1086
00:49:02,560 --> 00:49:03,380
Michael: Oh, cheers.

1087
00:49:03,524 --> 00:49:03,834
Take care

1088
00:49:03,834 --> 00:49:04,143
Jonathan: Take care.