1
00:00:00,060 --> 00:00:03,300
Nikolay: Hello, hello, this is
PostgresFM, episode number 101.

2
00:00:04,700 --> 00:00:08,960
My name is Nik Samokhvalov, founder
of Postgres.AI.

3
00:00:10,020 --> 00:00:10,940
I thought about...

4
00:00:13,260 --> 00:00:17,280
Since last time we had episode
number 100, and it was interesting

5
00:00:17,280 --> 00:00:21,500
because we discussed databases
at scale of 100 terabytes and

6
00:00:21,500 --> 00:00:22,000
beyond.

7
00:00:23,340 --> 00:00:27,180
Logically this episode should be
something very simple and beginner

8
00:00:27,180 --> 00:00:29,080
entry level, because it's 101.

9
00:00:29,100 --> 00:00:32,980
You know, in the US, 101 means
something like some introduction

10
00:00:32,980 --> 00:00:34,040
to something, right?

11
00:00:35,660 --> 00:00:39,920
But I guess we will dive quite
deep, I hope we will dive quite

12
00:00:39,920 --> 00:00:42,320
deep into some details of some
things.

13
00:00:42,740 --> 00:00:44,180
Very interesting to me personally.

14
00:00:45,040 --> 00:00:51,020
But also, since a couple of days
ago, Timescale released an addition

15
00:00:51,020 --> 00:00:53,140
to pgvector called pgvectorscale.

16
00:00:54,000 --> 00:00:58,520
I asked developers to join and
they agreed.

17
00:00:58,520 --> 00:01:00,740
So meet Mat Arye.

18
00:01:01,220 --> 00:01:01,920
Hi, Mat.

19
00:01:02,076 --> 00:01:03,232
Thank you

20
00:01:03,388 --> 00:01:04,544
for coming.

21
00:01:04,700 --> 00:01:06,200
Mat: Thank you for having us.

22
00:01:06,940 --> 00:01:08,160
Nikolay: And John Pruitt.

23
00:01:10,120 --> 00:01:11,680
Thank you for coming as well.

24
00:01:12,180 --> 00:01:14,540
It's great to have you.

25
00:01:14,540 --> 00:01:18,540
And Let's start, it's kind of an
interview, but maybe it will

26
00:01:18,540 --> 00:01:25,120
be some informal discussion, because
I don't know, we know each

27
00:01:25,120 --> 00:01:27,480
other for some time and why not,
right?

28
00:01:30,300 --> 00:01:34,460
Let's start from the beginning,
from maybe some distance.

29
00:01:34,820 --> 00:01:37,720
Why do we need at all some vector
search in Postgres and maybe

30
00:01:37,720 --> 00:01:39,060
some vector search in general?

31
00:01:39,060 --> 00:01:46,020
Because recently, not long ago,
LLM became...

32
00:01:46,260 --> 00:01:49,540
It's becoming bigger and bigger
at a very high pace, in terms

33
00:01:49,540 --> 00:01:55,300
of how many dimensions, but also
in terms of context window.

34
00:01:56,880 --> 00:02:01,600
Gemini now supports 1000000, there
are talks about 2 millions.

35
00:02:02,080 --> 00:02:05,880
Although I must admit, Gemini is
terrible in terms of reliability.

36
00:02:07,540 --> 00:02:12,940
We have a lot of credits being
an AI startup, we have a lot of

37
00:02:12,940 --> 00:02:14,000
credits to spend.

38
00:02:14,640 --> 00:02:20,420
And it's a very well-known problem
with Gemini that 500 errors

39
00:02:20,420 --> 00:02:21,240
all the time.

40
00:02:24,060 --> 00:02:26,080
It's making me very sad.

41
00:02:26,400 --> 00:02:31,580
But 1 million context window means
that Probably we don't need

42
00:02:31,820 --> 00:02:35,940
vector search at all and we can
just put everything into 1 prompt,

43
00:02:35,940 --> 00:02:36,440
right?

44
00:02:36,820 --> 00:02:40,900
There was such a crazy idea when
context became really large.

45
00:02:40,900 --> 00:02:44,440
Imagine if we have like 100 million
tokens context window.

46
00:02:44,500 --> 00:02:48,240
You can just put everything to
a single question, or it's an

47
00:02:48,240 --> 00:02:49,180
insane idea?

48
00:02:49,600 --> 00:02:50,520
What do you think?

49
00:02:54,060 --> 00:02:57,180
Mat: SRS- I think there are a few
points to point out.

50
00:02:57,180 --> 00:03:01,440
First of all, there's still the
bandwidth that you need to use

51
00:03:01,440 --> 00:03:06,060
up to send the question to the
LLM, right?

52
00:03:06,100 --> 00:03:12,080
So there is going to be a pure
networking limit at some point.

53
00:03:17,220 --> 00:03:23,260
Another point is that even with
very big context windows or token

54
00:03:23,260 --> 00:03:30,480
windows, yes, you can give the
LLM a lot of information, but

55
00:03:30,480 --> 00:03:35,440
there's actually an open question
about whether it will use all

56
00:03:35,440 --> 00:03:38,100
of the information it gets well.

57
00:03:38,680 --> 00:03:42,820
And there's some academic research
out of LLM that's actually

58
00:03:42,840 --> 00:03:48,900
showing that giving it less information
but more relevant information

59
00:03:49,580 --> 00:03:58,480
in the prompt gets you better results
than giving it all the

60
00:03:58,480 --> 00:04:00,560
information you have, right?

61
00:04:00,720 --> 00:04:08,300
And so, I mean, with LLMs in general,
all of this is purely empirical

62
00:04:08,520 --> 00:04:09,020
research.

63
00:04:09,320 --> 00:04:13,360
There's actually not very good
scientific backing either way,

64
00:04:13,360 --> 00:04:19,660
but with the LLMs that people have
tested, I know that huge prompts

65
00:04:20,140 --> 00:04:28,520
are often not very good in terms
of performance, in terms of

66
00:04:28,520 --> 00:04:30,000
the answers you get.

67
00:04:30,620 --> 00:04:38,160
So in that sense, narrowing prompts
down using either VectorSearch

68
00:04:38,600 --> 00:04:44,580
or TextSearch or HybridSearch,
whatever you want, in many cases

69
00:04:44,600 --> 00:04:47,820
gives you a better answer.

70
00:04:47,820 --> 00:04:48,620
You a better answer.

71
00:04:51,740 --> 00:04:59,320
And apart from that, when this
is a bit controversial, I'm actually

72
00:04:59,540 --> 00:05:07,160
not convinced that RAG is the killer
app for vector databases.

73
00:05:07,840 --> 00:05:14,040
There's a lot of other use cases
for vector databases, starting

74
00:05:14,040 --> 00:05:19,940
from clustering, recommendation
engines, plain old search engines,

75
00:05:20,360 --> 00:05:20,860
right?

76
00:05:20,940 --> 00:05:25,240
A lot of other things that have
nothing to do with the rag and

77
00:05:25,240 --> 00:05:31,600
that increasing context windows
were not, you know, help or help.

78
00:05:31,600 --> 00:05:33,840
These are just orthogonal things.

79
00:05:34,620 --> 00:05:41,540
And the basic thing that what vector
search gives you is semantic

80
00:05:41,780 --> 00:05:42,280
search.

81
00:05:42,540 --> 00:05:44,940
There's a lot of applications for
that.

82
00:05:45,060 --> 00:05:46,660
Nikolay: So not only RAG, right?

83
00:05:47,080 --> 00:05:47,780
So interesting.

84
00:05:50,340 --> 00:05:51,000
Yeah, that's...

85
00:05:51,220 --> 00:05:56,680
Okay, so your answer has 2 points
to summarize.

86
00:05:57,440 --> 00:06:01,920
First, we still need to narrow
down and find...

87
00:06:02,520 --> 00:06:06,420
And if we put everything to 1 question,
it will be not efficient.

88
00:06:07,040 --> 00:06:08,580
I agree with LLM sometimes.

89
00:06:09,840 --> 00:06:13,760
If you put too much information,
the answers are not good.

90
00:06:13,860 --> 00:06:16,620
Quality of answers decrease for
sure.

91
00:06:16,620 --> 00:06:18,940
And second, it's not only about
RAG.

92
00:06:18,940 --> 00:06:20,640
I would like to explore this.

93
00:06:20,740 --> 00:06:25,240
Do you see some good examples beyond
retrieval-augmented generation?

94
00:06:27,740 --> 00:06:32,340
Actually while you think, Let me
just spend maybe half a minute,

95
00:06:32,560 --> 00:06:36,760
because I know our audience is
mostly Postgres engineers, DBAs,

96
00:06:37,080 --> 00:06:38,140
and backend developers.

97
00:06:38,840 --> 00:06:44,720
Of course, many already heard about
RUG, but just a small recap.

98
00:06:44,820 --> 00:06:46,320
It's quite a simple thing.

99
00:06:46,740 --> 00:06:47,860
Actually, you're right.

100
00:06:47,860 --> 00:06:50,240
Maybe vector search is not needed
in RUG.

101
00:06:50,280 --> 00:06:52,620
For example, we can use full text
search there.

102
00:06:53,440 --> 00:07:01,120
So the idea is when some request
is coming to LLM, before passing

103
00:07:01,120 --> 00:07:05,320
this request to LLM, we have intermediate
software which finds

104
00:07:05,320 --> 00:07:08,180
additional information which should
be relevant.

105
00:07:08,760 --> 00:07:12,440
Here vector search can be useful,
but maybe not only vector search,

106
00:07:12,440 --> 00:07:13,940
for example, full text search.

107
00:07:14,660 --> 00:07:21,020
Then We already augment this request
with this information, and

108
00:07:21,020 --> 00:07:22,580
LLM has some context.

109
00:07:23,100 --> 00:07:27,040
Basically, it's similar to if you
go to exam and you have a lot

110
00:07:27,040 --> 00:07:32,020
of information in your pockets,
When you need to answer, you

111
00:07:32,020 --> 00:07:36,500
can get this information out of
your pockets and answer better.

112
00:07:37,360 --> 00:07:39,260
Or use Wikipedia, for example.

113
00:07:39,820 --> 00:07:40,320
Similar.

114
00:07:42,740 --> 00:07:45,480
But you just told me that it's
not only RUG.

115
00:07:45,480 --> 00:07:46,400
What else?

116
00:07:46,620 --> 00:07:48,900
What else for vector search particularly?

117
00:07:51,680 --> 00:07:56,580
Mat: I mean, fundamentally, I think
any time that you are right

118
00:07:56,580 --> 00:08:01,160
now using full text search or text
search, you could substitute

119
00:08:01,440 --> 00:08:05,420
that with semantic search to give
you better results.

120
00:08:05,860 --> 00:08:09,460
Let me just give a simple example
for people.

121
00:08:10,120 --> 00:08:16,640
With the full text search, searching
for a query on car does

122
00:08:16,640 --> 00:08:25,620
not return results that have truck
in the document, right?

123
00:08:25,940 --> 00:08:30,820
Semantic search solves that because
it's kind of, I mean, that

124
00:08:30,820 --> 00:08:34,680
sounds almost magical, but it's
kind of a search on meaning.

125
00:08:35,140 --> 00:08:39,840
So things that are close together
in meaning get retrieved.

126
00:08:40,560 --> 00:08:45,520
So you could use a totally different
lexical structure, totally

127
00:08:45,560 --> 00:08:53,520
different word, and somehow this
type of search figured out that

128
00:08:53,520 --> 00:08:59,080
they are similar in the meaning
space, if you will.

129
00:09:00,940 --> 00:09:01,740
John: Or you could use

130
00:09:01,740 --> 00:09:02,240
it to

131
00:09:03,400 --> 00:09:06,800
augment full-text search
or as a hybrid of the 2.

132
00:09:08,040 --> 00:09:10,160
Nikolay: Have you seen good examples
of this?

133
00:09:10,160 --> 00:09:13,760
Like I saw terrible techniques,
like for example, let's find

134
00:09:15,840 --> 00:09:18,340
100 results from full-text search,
100 results from vector search,

135
00:09:18,340 --> 00:09:19,280
and then combine them.

136
00:09:19,280 --> 00:09:23,500
But always I have a question, what's
the ranking system?

137
00:09:23,500 --> 00:09:27,260
Because it's very different, and
sometimes we want fresh information.

138
00:09:28,660 --> 00:09:32,060
Specific question I would like
to explore maybe later, how to

139
00:09:32,500 --> 00:09:35,580
reflect freshness of data.

140
00:09:36,500 --> 00:09:40,440
We can discuss it maybe later because
I also already want to

141
00:09:40,680 --> 00:09:43,600
discuss what you particular just
released.

142
00:09:44,140 --> 00:09:48,420
But did you see any good examples
of combination of full-text

143
00:09:48,420 --> 00:09:51,500
search and semantic search like
vector search?

144
00:09:54,920 --> 00:10:00,180
Mat: It's hard to say because we
are database providers so we

145
00:10:00,180 --> 00:10:02,200
kind of see the...

146
00:10:02,640 --> 00:10:05,920
We don't see the examples that
our customers have.

147
00:10:06,420 --> 00:10:11,780
We get told by our customers that,
hey, this technique works,

148
00:10:11,980 --> 00:10:14,280
you find it useful, blah, blah,
blah.

149
00:10:14,340 --> 00:10:17,240
But we rarely actually see the
results ourselves.

150
00:10:17,240 --> 00:10:18,680
Actually see the results ourselves.

151
00:10:19,460 --> 00:10:27,380
But like anecdotally, a lot of
people are using hybrid set.

152
00:10:29,080 --> 00:10:35,580
And like, in the AI community,
I think it's almost becoming

153
00:10:35,580 --> 00:10:36,360
the standard.

154
00:10:37,300 --> 00:10:42,320
What is potentially a problem for
Postgres we can go into later,

155
00:10:42,780 --> 00:10:47,660
but That's what we're hearing at
least a lot.

156
00:10:47,660 --> 00:10:52,200
But I will agree that this is totally
all ad hoc, non-scientific,

157
00:10:53,520 --> 00:10:58,380
like, crack my left ear with my
right arm type thing, right?

158
00:11:01,440 --> 00:11:06,020
Nikolay: For me, it looks almost
always like all examples I see,

159
00:11:06,020 --> 00:11:09,160
they look ugly because, for example,
if you think about pagination,

160
00:11:09,280 --> 00:11:11,340
there is no way to have pagination
there.

161
00:11:11,680 --> 00:11:15,520
Although if we go back to vector
search, it's also problematic

162
00:11:15,720 --> 00:11:20,220
there because to go to page number
100, you need to extract all

163
00:11:20,220 --> 00:11:23,260
100 pages, and it's very inefficient
performance-wise.

164
00:11:24,140 --> 00:11:25,960
Let's discuss the project.

165
00:11:25,960 --> 00:11:27,280
It's called pgvectorscale.

166
00:11:28,580 --> 00:11:32,580
It looks like you just append a
suffix scale to pgvector.

167
00:11:33,420 --> 00:11:35,640
It's a very interesting approach.

168
00:11:37,460 --> 00:11:39,660
Let me ask you this.

169
00:11:40,740 --> 00:11:43,780
It's very well known that TimescaleDB
is a very great thing

170
00:11:43,780 --> 00:11:44,840
for time series.

171
00:11:45,340 --> 00:11:49,820
It has a free version, and you
can host it yourself, a community

172
00:11:49,820 --> 00:11:51,300
edition, I think it's called.

173
00:11:52,360 --> 00:11:59,760
I doubt it's true open source because
it doesn't have OSI approved

174
00:11:59,760 --> 00:12:00,260
license.

175
00:12:00,920 --> 00:12:02,520
So it's like some specific thing.

176
00:12:02,520 --> 00:12:06,020
Or maybe Apache license exists,
but it lacks a lot of good stuff.

177
00:12:06,020 --> 00:12:07,840
For example, I think it lacks compression.

178
00:12:07,840 --> 00:12:08,740
Maybe I'm wrong.

179
00:12:08,800 --> 00:12:09,440
Is it?

180
00:12:09,440 --> 00:12:10,820
Mat: Yes, it lacks compression.

181
00:12:10,900 --> 00:12:15,560
So we have an Apache version, we
have a community version.

182
00:12:16,100 --> 00:12:16,600
Nikolay: Right.

183
00:12:16,720 --> 00:12:20,280
I think you can use the extended version,
still not paying if you

184
00:12:20,280 --> 00:12:24,140
host it yourself, but it cannot be
considered true open source because

185
00:12:24,140 --> 00:12:27,320
this license is already something
like hybrid and so on.

186
00:12:29,060 --> 00:12:30,840
Mat: We call it source available.

187
00:12:32,460 --> 00:12:34,820
Nikolay: Yeah, yeah, this is a
common term.

188
00:12:37,820 --> 00:12:43,580
So then Postgres itself has a very,
very, very simple and permissive

189
00:12:43,660 --> 00:12:44,160
license.

190
00:12:45,380 --> 00:12:49,460
That's why many commercial products
developed on top of it.

191
00:12:49,960 --> 00:12:55,580
Then we have pgvector, which is
a separate extension, which in

192
00:12:55,580 --> 00:12:58,940
my opinion should at some point
go to core Postgres.

193
00:12:59,160 --> 00:13:01,140
But there is a big question, how
come?

194
00:13:01,140 --> 00:13:06,900
Because it's breaking all the rules,
because just creating an index

195
00:13:06,900 --> 00:13:11,000
you might have results which differ
from original results without

196
00:13:11,000 --> 00:13:11,880
index, right?

197
00:13:12,560 --> 00:13:16,160
It's something which we didn't
have before, because as always,

198
00:13:16,700 --> 00:13:19,240
indexes just sped up queries.

199
00:13:19,300 --> 00:13:23,860
Now we can change results because
it's approximate neighbor search.

200
00:13:24,960 --> 00:13:27,680
This project, I think it's...

201
00:13:27,980 --> 00:13:29,360
Which license is it?

202
00:13:30,120 --> 00:13:31,700
I don't remember, but I think...

203
00:13:31,700 --> 00:13:32,200
Yeah.

204
00:13:33,220 --> 00:13:35,140
It's also Postgres license, Right.

205
00:13:35,140 --> 00:13:39,280
And maybe that's why you also developed
your product on top of

206
00:13:39,280 --> 00:13:40,580
it using Postgres license.

207
00:13:40,580 --> 00:13:49,620
But it's very interesting because
if before Timescale had this

208
00:13:50,220 --> 00:13:53,720
major product, everyone loved,
compression especially, but only

209
00:13:53,720 --> 00:13:57,840
if you go that route, source available,
or you go to Timescale

210
00:13:57,880 --> 00:14:04,340
Cloud, now you release this thing
on Postgres license, meaning

211
00:14:04,340 --> 00:14:09,440
that anyone, any cloud provider,
most of them already have pgvector,

212
00:14:09,960 --> 00:14:16,680
so naturally they should already
be considering adding your extra

213
00:14:17,140 --> 00:14:19,060
to their services.

214
00:14:20,020 --> 00:14:22,980
I think, of course, you've thought
about it very well.

215
00:14:23,000 --> 00:14:27,140
So my question is why you changed
your approach compared to Timescale?

216
00:14:28,780 --> 00:14:33,780
Mat: Yeah, so I'll speak for myself,
maybe not for the company.

217
00:14:37,240 --> 00:14:43,880
I think, look, we all love Postgres,
but I think Postgres needs

218
00:14:44,060 --> 00:14:45,800
different things in different areas.

219
00:14:47,780 --> 00:14:54,800
Postgres, in the time series area,
Postgres already has a lot

220
00:14:55,340 --> 00:14:59,780
of functionality to build in and
we are already building upon

221
00:15:00,480 --> 00:15:07,480
a well-known foundation where I
would argue that given our broad

222
00:15:07,480 --> 00:15:12,320
definition of time series, in many
ways Postgres already dominates

223
00:15:13,180 --> 00:15:13,880
the market.

224
00:15:14,240 --> 00:15:20,440
So we felt that we could build
a company while contributing some

225
00:15:20,440 --> 00:15:26,400
things to PostGrid, but also quite
frankly making a business

226
00:15:26,680 --> 00:15:34,520
for ourselves and making money
in this area by preventing hyperscalers

227
00:15:35,600 --> 00:15:39,620
from making all the money off of
our work, which is what the

228
00:15:39,620 --> 00:15:40,840
TSL license does.

229
00:15:40,840 --> 00:15:46,360
It doesn't stop any individual
from benefiting from our work

230
00:15:46,360 --> 00:15:52,480
or really any company, it only
stops the cloud providers from

231
00:15:52,480 --> 00:15:54,000
making money off our work.

232
00:15:54,000 --> 00:15:58,300
And that was then for pragmatic
reasons so that we could make

233
00:15:58,300 --> 00:15:58,940
a business.

234
00:16:00,600 --> 00:16:08,740
In terms of vector, I think the
whole vector market is much more

235
00:16:10,440 --> 00:16:12,540
nascent and much younger.

236
00:16:13,180 --> 00:16:18,020
I don't think Postgres has won
the market a lot.

237
00:16:18,660 --> 00:16:27,600
You can look at how many other
databases have gotten millions

238
00:16:27,620 --> 00:16:31,440
of dollars in the past year to
develop their own solution.

239
00:16:31,780 --> 00:16:38,540
This is a very hot, fast-moving
market where I don't think Postgres

240
00:16:39,280 --> 00:16:40,940
yet has a particular...

241
00:16:42,980 --> 00:16:44,540
It hasn't won, right?

242
00:16:45,040 --> 00:16:49,460
And so, because we like Postgres
so much, we wanted

243
00:16:49,540 --> 00:16:50,280
Nikolay: to know when

244
00:16:50,280 --> 00:16:52,160
Mat: Postgres wins.

245
00:16:54,520 --> 00:16:55,320
And so...

246
00:16:56,800 --> 00:17:01,060
Nikolay: There are many benchmarks
where Postgres has very poor

247
00:17:01,060 --> 00:17:06,560
results with pgvector on ivfflat
index without HNSW and so on,

248
00:17:06,560 --> 00:17:07,060
right?

249
00:17:07,680 --> 00:17:09,040
And they're still around.

250
00:17:09,520 --> 00:17:14,400
Mat: I will say with pgvector 0.7,
things have improved on the

251
00:17:14,400 --> 00:17:17,340
pgvector side quite a lot as well.

252
00:17:18,600 --> 00:17:23,640
But yeah, and I think not only
in terms of benchmarks, I think

253
00:17:23,720 --> 00:17:30,520
in terms of actual usage, if you
know, last episode you talked

254
00:17:30,520 --> 00:17:38,300
about 100 terabyte databases, We
want to reach 100 terabyte databases

255
00:17:38,520 --> 00:17:39,840
that have extra data.

256
00:17:39,840 --> 00:17:43,100
We have to do a lot of work to
make that happen, right?

257
00:17:43,100 --> 00:17:46,920
And I think that's the ultimate
goal.

258
00:17:47,680 --> 00:17:51,000
And there should be the ultimate
goal of everybody in the community.

259
00:17:51,500 --> 00:17:57,840
We thought that it was more wise
to help the entire community

260
00:17:58,020 --> 00:18:00,540
succeed at this point in time.

261
00:18:01,060 --> 00:18:05,460
Nikolay: Yeah, when I saw benchmarks,
like 100,000, 500,000 vectors,

262
00:18:05,460 --> 00:18:07,440
I was like, what's happening here?

263
00:18:07,440 --> 00:18:09,740
People say, oh, it's already a
lot.

264
00:18:09,880 --> 00:18:10,960
It's not a lot.

265
00:18:12,980 --> 00:18:16,720
You take any company, take their
databases, and they have already

266
00:18:16,720 --> 00:18:20,940
so many data entries in their databases.

267
00:18:21,260 --> 00:18:23,940
Out of those entries we can create
vectors usually.

268
00:18:25,440 --> 00:18:29,280
It means that we should speak about
billions already, but it's

269
00:18:29,280 --> 00:18:33,280
not there yet because it's problematic
because vectors are large

270
00:18:33,280 --> 00:18:39,780
and building indexes is slow, takes
a lot of time, and the latencies

271
00:18:39,880 --> 00:18:41,020
of search, of course.

272
00:18:41,980 --> 00:18:43,260
Everything is quite slow.

273
00:18:43,260 --> 00:18:49,220
I'm glad you did your benchmarks
starting with 50 million vectors,

274
00:18:49,220 --> 00:18:50,400
already very good.

275
00:18:51,040 --> 00:18:55,140
I'm super glad to hear that you
talk about terabytes of data.

276
00:18:55,380 --> 00:19:00,220
By the way, some people say, okay,
for us, for Postgres guys,

277
00:19:00,220 --> 00:19:04,080
not for Timescale, regular Postgres
guys, 1 terabyte is usually

278
00:19:04,080 --> 00:19:05,140
1000000000 rows.

279
00:19:06,260 --> 00:19:08,940
With Timescale compression, it's
many more.

280
00:19:09,280 --> 00:19:14,120
1 terabyte should be maybe tens
of billions rows, right?

281
00:19:14,640 --> 00:19:15,520
And I know...

282
00:19:15,900 --> 00:19:17,500
Tens of billions.

283
00:19:17,550 --> 00:19:18,340
Right, right.

284
00:19:18,340 --> 00:19:21,480
So it's a different order of magnitude.

285
00:19:22,000 --> 00:19:27,380
But if we talk about pgvector,
50 million vectors.

286
00:19:28,500 --> 00:19:29,480
Let's talk about...

287
00:19:31,220 --> 00:19:34,700
This project has 2 things to bring
on the table.

288
00:19:35,820 --> 00:19:40,520
Maybe let's close first with a
non-technical question, because

289
00:19:40,520 --> 00:19:44,520
I cannot skip this question just
sitting in my head.

290
00:19:44,600 --> 00:19:49,200
So pgvector had ivfflat index originally,
then HNSW was added.

291
00:19:49,200 --> 00:19:52,700
I remember NUAN participated, starting
with a separate project,

292
00:19:52,700 --> 00:19:54,740
then decided to contribute to pgvector.

293
00:19:55,200 --> 00:19:56,620
So now it has 2 indexes.

294
00:19:57,100 --> 00:20:00,220
And pgvectorscale brings a third
type of index, right?

295
00:20:02,420 --> 00:20:04,840
It could be a pull request to pgvector,
no?

296
00:20:06,160 --> 00:20:10,180
Mat: So we actually talked with
Andrew Kane about this.

297
00:20:10,520 --> 00:20:15,360
The issue is that we are written
in Rust, not C.

298
00:20:16,400 --> 00:20:21,840
And Andrew thought that it would
be better as a separate project.

299
00:20:22,120 --> 00:20:24,820
I can't say I disagree.

300
00:20:26,040 --> 00:20:31,040
I think either could have worked,
but I think given the language

301
00:20:31,080 --> 00:20:37,360
difference, it makes some sense
to have it as a separate project.

302
00:20:37,360 --> 00:20:38,660
But we did all for...

303
00:20:39,520 --> 00:20:40,760
Nikolay: Yeah, well, makes sense.

304
00:20:40,760 --> 00:20:41,760
Makes total sense.

305
00:20:41,760 --> 00:20:42,780
But why Rust?

306
00:20:45,400 --> 00:20:47,780
Mat: Because I like to work quickly.

307
00:20:48,280 --> 00:20:56,100
And I thought Rust would allow
me to do that more than other

308
00:20:56,100 --> 00:20:56,600
things.

309
00:20:57,040 --> 00:21:01,600
I should say I'm 1 of the people
that worked compression Timescale

310
00:21:02,540 --> 00:21:12,400
and we did, you know, our compression
system had our own data

311
00:21:12,400 --> 00:21:18,080
type on disk and I remember the
pain of having to make sure that

312
00:21:18,080 --> 00:21:25,140
the way you are writing data on
disk would work very well.

313
00:21:26,120 --> 00:21:31,660
And the cross-over platforms, big-end,
little-end, all of these,

314
00:21:31,720 --> 00:21:36,140
the alignment issues, it's just
a lot of double checking that

315
00:21:36,140 --> 00:21:39,040
you have to do to make sure everything
is correct.

316
00:21:39,400 --> 00:21:40,760
I wanted to avoid this.

317
00:21:40,760 --> 00:21:47,280
And with Rust, a lot of this
could be a lot simpler.

318
00:21:48,420 --> 00:21:49,280
So, like...

319
00:21:49,280 --> 00:21:50,880
Nikolay: Yeah, makes total sense.

320
00:21:53,100 --> 00:21:53,600
Yeah.

321
00:21:53,940 --> 00:21:54,440
Mm-hmm.

322
00:21:55,440 --> 00:21:56,100
Right, right.

323
00:21:56,440 --> 00:22:01,400
So, this makes total sense, but
for end user, it's interesting.

324
00:22:02,520 --> 00:22:03,700
Especially those who self-manage.

325
00:22:03,920 --> 00:22:07,080
So you need to take Postgres, then
you need to bring pgvector,

326
00:22:07,120 --> 00:22:09,240
then you need to bring pgvectorscale.

327
00:22:10,240 --> 00:22:12,320
It's an interesting situation.

328
00:22:13,820 --> 00:22:14,300
We do

329
00:22:14,300 --> 00:22:15,580
Mat: have web-based packages.

330
00:22:19,740 --> 00:22:22,280
Nikolay: This is interesting, but
now I understand better.

331
00:22:23,080 --> 00:22:25,220
Let's talk about the index itself.

332
00:22:30,240 --> 00:22:35,780
What can you tell us about the
index compared to HNSW?

333
00:22:36,940 --> 00:22:42,800
I know HNSW is like a memory thing,
this works with disk, so

334
00:22:42,800 --> 00:22:46,640
this is definitely the right thing
to do if we talk about terabytes

335
00:22:46,640 --> 00:22:49,100
of data and billions of rows and
so on?

336
00:22:50,740 --> 00:22:51,240
Mat: Yeah.

337
00:22:52,040 --> 00:22:57,540
So I'm going to hand wave a lot
here because this is a lot of

338
00:22:57,540 --> 00:23:05,640
history, but it turns out for various
reasons that the best way

339
00:23:06,020 --> 00:23:12,880
to index vector data is using a
graph structure, so a graph database.

340
00:23:13,320 --> 00:23:20,940
You can imagine each vector is
a node and is connected to other

341
00:23:20,940 --> 00:23:26,280
nodes that are mostly that are
close to it, right?

342
00:23:26,280 --> 00:23:28,480
And then there are some for edges.

343
00:23:28,840 --> 00:23:39,800
And the traditional problem in
these types of indexes is getting

344
00:23:39,800 --> 00:23:43,180
from 1 end of the graph to another
end of the graph.

345
00:23:43,180 --> 00:23:48,340
So your starting point is very
far from your query vector, which

346
00:23:48,340 --> 00:23:49,940
is what you're looking for.

347
00:23:50,000 --> 00:23:52,860
It takes a lot of hops in these
graphs.

348
00:23:53,740 --> 00:24:01,260
And HNSW and DiskANN are pretty
much different ways of solving

349
00:24:01,260 --> 00:24:02,460
that basic problem.

350
00:24:03,700 --> 00:24:10,600
HNSW solves this by introducing
layers into the graph, where

351
00:24:10,600 --> 00:24:15,260
the top layer where you start only
has long distance edges.

352
00:24:15,360 --> 00:24:21,580
So it kind of allows you to jump
a long distance but close to

353
00:24:21,580 --> 00:24:28,340
where your query is, and then you
go down a level in order to

354
00:24:29,060 --> 00:24:31,580
make more fine-grained jumps.

355
00:24:32,620 --> 00:24:34,160
And there are several levels.

356
00:24:34,700 --> 00:24:38,900
By the time you get to the lowest
level, you are kind of in the

357
00:24:38,900 --> 00:24:41,760
most fine-grained area of the graph.

358
00:24:42,620 --> 00:24:47,640
And those top levels are the things
that help you solve this

359
00:24:47,640 --> 00:24:50,240
long jump issue, if you would.

360
00:24:52,040 --> 00:24:58,300
This KNN doesn't use multiple graphs.

361
00:25:00,700 --> 00:25:06,880
Instead it kind of uses a neat
trick in the construction to...

362
00:25:08,260 --> 00:25:09,640
How do I put this?

363
00:25:09,800 --> 00:25:14,040
It puts in enough long edges to
be useful.

364
00:25:16,220 --> 00:25:20,880
So the way it constructs the graph
is modified from the regular

365
00:25:21,020 --> 00:25:26,460
way these types of graphs are usually
constructed, specifically

366
00:25:26,840 --> 00:25:31,360
to inject these long edges into
the graph, which probabilistically

367
00:25:32,300 --> 00:25:36,360
allows you to jump faster to where
you need to go.

368
00:25:36,900 --> 00:25:45,240
And this kind of flat structure
is what allows you to keep a

369
00:25:45,240 --> 00:25:48,420
better locality of where you are.

370
00:25:48,420 --> 00:25:53,300
So instead of using multiple levels,
the flat structure allows

371
00:25:53,300 --> 00:26:01,740
you to make fewer hops in memory,
which allows us to work on disk,

372
00:26:02,220 --> 00:26:04,700
on SSD rather than RAM.

373
00:26:04,960 --> 00:26:05,460
Because...

374
00:26:06,100 --> 00:26:06,600
Nikolay: Yeah.

375
00:26:07,960 --> 00:26:11,180
I noticed, by the way, maybe it's
like a side note, but I noticed

376
00:26:11,180 --> 00:26:15,920
you in Postgresstab you use local
SSD disks, right?

377
00:26:16,440 --> 00:26:17,820
Not like EBS volumes.

378
00:26:18,700 --> 00:26:19,940
Is it fair?

379
00:26:20,940 --> 00:26:22,360
I mean, they're ephemeral.

380
00:26:22,360 --> 00:26:25,120
If you restart the machine, you
lose them.

381
00:26:25,760 --> 00:26:29,420
So in your benchmark particularly,
I noticed this small point.

382
00:26:30,420 --> 00:26:41,500
Mat: Our benchmarks very concretely
benchmark PyCon against self-hosted

383
00:26:41,580 --> 00:26:42,080
Postgres.

384
00:26:42,260 --> 00:26:49,700
And the most self-hosted Postgres
people I know actually use

385
00:26:49,820 --> 00:26:52,720
NVMes and they do like...

386
00:26:55,920 --> 00:26:57,380
Nikolay: But not local NVMes.

387
00:27:00,300 --> 00:27:05,780
Mat: But you can do backups other
ways, right?

388
00:27:08,180 --> 00:27:08,540
You could

389
00:27:08,540 --> 00:27:09,440
Nikolay: do string

390
00:27:09,640 --> 00:27:15,060
Mat: replication, you could do
other ways to get around this

391
00:27:15,060 --> 00:27:17,000
issue of things don't go away.

392
00:27:17,180 --> 00:27:21,660
Nikolay: I'm just very curious
how much it helped because of

393
00:27:21,660 --> 00:27:24,900
course latency is better, discrepancy
is better.

394
00:27:24,900 --> 00:27:28,520
Mat: I can tell you it helped immensely
and I can say that we

395
00:27:28,520 --> 00:27:33,820
are right now actively thinking
about how to bring this kind

396
00:27:33,820 --> 00:27:35,880
of performance to cloud as well.

397
00:27:37,280 --> 00:27:41,600
Stay tuned, I think we'll have
some exciting stuff coming out.

398
00:27:41,600 --> 00:27:46,520
But yes, that's a very astute observation
because, you know,

399
00:27:47,420 --> 00:27:50,080
Quite frankly, none of this works
on EBS.

400
00:27:51,280 --> 00:27:58,060
You need the random read performance
of SSDs to make this work.

401
00:27:59,340 --> 00:28:05,460
So by the way, I would be shocked
if Pinecone wasn't using...

402
00:28:06,340 --> 00:28:08,040
Nikolay: ...VMWare's NVMes?

403
00:28:09,140 --> 00:28:10,160
Mat: I don't know.

404
00:28:10,380 --> 00:28:14,640
I have no idea what they use, but
I would be shocked if this

405
00:28:14,640 --> 00:28:15,840
is EBR based.

406
00:28:17,860 --> 00:28:24,440
Nikolay: And for these 50 million
vectors, how big was the database?

407
00:28:25,760 --> 00:28:26,680
Like in terms of...

408
00:28:26,680 --> 00:28:30,560
John: The disk and index was about
40 gigs.

409
00:28:30,940 --> 00:28:35,680
The entire table, I think, was
around 250, somewhere between

410
00:28:35,680 --> 00:28:37,260
250 and 300 gigs.

411
00:28:37,740 --> 00:28:40,540
Nikolay: Right, and the machine
has definitely less than that,

412
00:28:41,040 --> 00:28:42,700
so of course it was not cached.

413
00:28:42,700 --> 00:28:45,440
I mean, we do need the disk.

414
00:28:45,520 --> 00:28:47,480
Okay, and back to indexes.

415
00:28:48,700 --> 00:28:53,760
DiskANN, and I saw you mentioned
in the article and the project

416
00:28:53,760 --> 00:28:58,040
itself, Microsoft work, but you
modified this work, right?

417
00:28:58,080 --> 00:29:05,200
But the question is, like, HNSW
and this DiskANN, both are approximate

418
00:29:05,220 --> 00:29:06,400
nearest neighbors algorithms.

419
00:29:07,680 --> 00:29:11,320
Can we say one is better than another
for most of workloads and

420
00:29:11,320 --> 00:29:13,540
data types, data sets?

421
00:29:14,340 --> 00:29:18,960
Or there are cases where one can
win, there are cases where another

422
00:29:18,960 --> 00:29:19,620
can win.

423
00:29:21,200 --> 00:29:22,140
That's what I think.

424
00:29:22,140 --> 00:29:25,540
John: ANDREW BROGDON-PICKETT That's
windows index supports concurrent

425
00:29:26,420 --> 00:29:30,100
index builds, which could make
it faster, at least for building

426
00:29:30,100 --> 00:29:30,780
the index.

427
00:29:31,520 --> 00:29:32,020
SIAMAK

428
00:29:32,240 --> 00:29:37,400
Nikolay: ROSHANI For building the
index, So, and yeah.

429
00:29:38,260 --> 00:29:39,160
But for search?

430
00:29:40,580 --> 00:29:46,120
Mat: For search and accuracy, look,
we haven't benchmarked everything.

431
00:29:46,120 --> 00:29:50,700
The thing is that we have benchmarked,
we've seen higher throughput

432
00:29:50,740 --> 00:29:53,220
and higher accuracy.

433
00:29:57,280 --> 00:30:05,500
The trade-off was always kind of
better than HNSW, but you know,

434
00:30:05,500 --> 00:30:07,460
we haven't benchmarked everything.

435
00:30:07,540 --> 00:30:13,540
We've kind of concentrated on modern
embedding systems.

436
00:30:14,060 --> 00:30:19,940
So, for example, we haven't gotten
to benchmark very low dimensional

437
00:30:20,160 --> 00:30:24,960
vectors like 128 or 256, the lowest
thing we've benchmarked is

438
00:30:24,960 --> 00:30:25,460
768.

439
00:30:26,200 --> 00:30:31,040
So a lot of caveats and I think,
you know, the space is so new

440
00:30:31,040 --> 00:30:36,600
people should actually test on
their own data, but I can say

441
00:30:37,360 --> 00:30:42,260
in terms of research we haven't
seen where we're worse yet.

442
00:30:44,060 --> 00:30:47,240
Nikolay: People should always test
on their own data, even if

443
00:30:47,240 --> 00:30:51,300
it's old school full-text search
and so on, because who knows

444
00:30:51,300 --> 00:30:52,480
what you have, right?

445
00:30:52,480 --> 00:30:52,980
Mat: Absolutely.

446
00:30:53,360 --> 00:30:56,360
John: That's one thing that made
benchmarking so hard, especially

447
00:30:56,360 --> 00:30:59,440
versus our competitors, because
a lot of the specialized vector

448
00:31:00,040 --> 00:31:04,140
databases have very few parameters
that let you actually control

449
00:31:04,700 --> 00:31:09,260
the index, whereas on Postgres
with both pgvector and pgvectorscale,

450
00:31:09,960 --> 00:31:14,640
there are several different parameters
to tune the build of the

451
00:31:14,640 --> 00:31:20,040
index, and then also query time,
plus all of the Postgres settings.

452
00:31:20,820 --> 00:31:23,760
And if you're self-hosting, you've
also got OS and machine level

453
00:31:23,760 --> 00:31:24,980
things to play with.

454
00:31:25,520 --> 00:31:29,020
So there's just a mind-boggling
number of variables you could

455
00:31:29,060 --> 00:31:32,140
play with if you had the time and
money to spend on it.

456
00:31:33,340 --> 00:31:37,200
Nikolay: Yeah, and in pgvectorscale
particularly, I also checked

457
00:31:37,200 --> 00:31:40,640
the source code and documentation
that we have so far, and I

458
00:31:40,640 --> 00:31:43,300
saw also parameters you can touch.

459
00:31:43,480 --> 00:31:47,820
A couple of them, like query time,
and if you can be also adjusted

460
00:31:47,840 --> 00:31:48,980
during build time.

461
00:31:49,640 --> 00:31:55,840
What can you say about some list
sizes and so on?

462
00:31:55,840 --> 00:31:57,320
Could you search list size?

463
00:31:58,680 --> 00:32:01,860
And during build time, num neighbors,
how many neighbors?

464
00:32:02,440 --> 00:32:05,780
Do you recommend trying to tune
this so far now?

465
00:32:08,040 --> 00:32:10,940
Mat: So we tried to put in the
small defaults.

466
00:32:11,340 --> 00:32:14,940
That is what we saw as the best
combination.

467
00:32:15,140 --> 00:32:19,340
But obviously, we wanted to let
people also experiment.

468
00:32:19,780 --> 00:32:25,380
But in terms of build parameters,
I would recommend keeping them

469
00:32:25,380 --> 00:32:26,260
as a default.

470
00:32:27,900 --> 00:32:33,980
The runtime parameters is really
where I think people should

471
00:32:34,180 --> 00:32:39,640
experiment because that allows
you, at query time, to make the

472
00:32:39,640 --> 00:32:43,200
trade-off between accuracy and
speed, right?

473
00:32:43,380 --> 00:32:47,840
And there's the re-school parameter
that I think is the 1 we

474
00:32:47,840 --> 00:32:50,200
recommend to actually tune.

475
00:32:52,240 --> 00:32:56,540
I just realized today we probably
should change the default is

476
00:32:56,540 --> 00:32:57,480
pretty low.

477
00:32:58,620 --> 00:32:59,960
That was my fault.

478
00:33:00,720 --> 00:33:05,820
But that is the parameter I think
people should play with.

479
00:33:06,180 --> 00:33:09,720
John: Yeah, that 1 seemed to have
the most impact on both speed

480
00:33:09,960 --> 00:33:10,660
and accuracy.

481
00:33:12,980 --> 00:33:14,000
At query time.

482
00:33:15,100 --> 00:33:15,600
Nikolay: Right.

483
00:33:16,780 --> 00:33:20,100
Okay, so 1 more question about
the index.

484
00:33:21,260 --> 00:33:25,020
You call it StreamingDiskANN,
right?

485
00:33:25,200 --> 00:33:26,080
Why streaming?

486
00:33:27,260 --> 00:33:31,060
How does it differ from the original
Microsoft implementation?

487
00:33:31,060 --> 00:33:31,560
Yeah,

488
00:33:31,860 --> 00:33:37,780
Mat: so I think both the HNSW implementation
and the original

489
00:33:37,900 --> 00:33:43,940
DiskANN implementation, you tell
it ahead of time how many things

490
00:33:43,940 --> 00:33:44,980
you want returned.

491
00:33:48,960 --> 00:33:53,400
And that just doesn't play very
well with Postgres, quite frankly,

492
00:33:53,400 --> 00:33:56,020
because the Postgres

493
00:33:57,790 --> 00:33:58,290
John: model

494
00:34:00,060 --> 00:34:05,580
Mat: For indexes, Postgres, the
other part of Postgres, can always

495
00:34:05,600 --> 00:34:08,180
ask the index for the next tuple.

496
00:34:08,760 --> 00:34:16,220
And this is done so that you can
filter your results after index

497
00:34:16,320 --> 00:34:16,820
retrieval.

498
00:34:17,500 --> 00:34:24,900
So for example, let's say I want
to find the closest vectors

499
00:34:24,940 --> 00:34:33,200
to a given query that also meets
some other criteria that belong

500
00:34:33,260 --> 00:34:38,240
to the business department or the
engineering department, right?

501
00:34:39,340 --> 00:34:46,600
The vector index only fetches the
closest things to the query,

502
00:34:46,780 --> 00:34:52,020
and then what Postgres does is
that after index retrieval it

503
00:34:52,020 --> 00:34:58,120
will filter out department equals
engineering, right?

504
00:34:58,460 --> 00:35:04,960
Now let's say your closest hundreds
of vectors are all from the

505
00:35:04,960 --> 00:35:06,060
business department.

506
00:35:08,420 --> 00:35:10,180
That means if that-

507
00:35:10,440 --> 00:35:10,640
Nikolay: It will

508
00:35:10,640 --> 00:35:11,260
Mat: be fast.

509
00:35:12,500 --> 00:35:13,000
What?

510
00:35:14,160 --> 00:35:17,180
Nikolay: That means- I mean, in
this case, it will be fast because-

511
00:35:18,660 --> 00:35:22,500
Mat: Right, but if your parameter
is, you say you're returning

512
00:35:22,900 --> 00:35:28,380
the closest 50 from the index,
then no results will be returned

513
00:35:28,380 --> 00:35:29,020
at all.

514
00:35:29,380 --> 00:35:34,400
Because the index returns 50 results,
Those 50 results are then

515
00:35:34,400 --> 00:35:37,040
filtered by department equals engineering.

516
00:35:37,440 --> 00:35:38,680
Nikolay: None of them bad.

517
00:35:39,140 --> 00:35:40,640
Different department, right, right,
right.

518
00:35:41,100 --> 00:35:43,360
A lot of work to be done, but 0
results,

519
00:35:45,780 --> 00:35:45,970
Mat: right?

520
00:35:45,970 --> 00:35:46,160
Exactly.

521
00:35:46,160 --> 00:35:50,520
And there are 0 results because
there's this arbitrary limit

522
00:35:50,900 --> 00:35:55,280
that you give at the beginning
of the query to tell you, hey,

523
00:35:55,280 --> 00:35:57,080
retrieve this many results.

524
00:35:57,320 --> 00:36:02,120
Whereas in reality, Postgres has
no idea how many results you

525
00:36:02,120 --> 00:36:10,620
need to retrieve in order to match
both the query and the department

526
00:36:10,640 --> 00:36:12,240
equals engineering, right?

527
00:36:14,060 --> 00:36:19,340
So What the streaming part of the
algorithm does is it removes

528
00:36:19,340 --> 00:36:25,460
that restriction and makes the
algorithm work in the way Postgres

529
00:36:25,760 --> 00:36:27,840
expects other indexes to work.

530
00:36:27,940 --> 00:36:32,960
So you can tell the index, hey,
give me the next closest thing,

531
00:36:32,960 --> 00:36:33,980
the next closest.

532
00:36:34,500 --> 00:36:39,640
Actually you could traverse the
entire graph, your entire table

533
00:36:40,240 --> 00:36:41,020
like that.

534
00:36:42,180 --> 00:36:49,100
And that makes all of your queries
that have a secondary filter

535
00:36:49,700 --> 00:36:50,960
completely accurate.

536
00:36:51,820 --> 00:36:54,860
Nikolay: Sven And this secondary
filter, it's a different column,

537
00:36:54,860 --> 00:36:56,860
for example, text or integer, right?

538
00:36:57,520 --> 00:36:58,380
Or not?

539
00:36:58,380 --> 00:37:00,140
Mat: Rony Yeah, it's a different
column.

540
00:37:00,140 --> 00:37:03,140
Often it's JSONB, right?

541
00:37:03,140 --> 00:37:09,100
So you could have an article, you
could have a list of associated

542
00:37:09,340 --> 00:37:13,860
tags with it in JSON-B, and you
can say, hey, find me all the

543
00:37:13,860 --> 00:37:23,800
articles about cars that also come
from country USA, or any kind

544
00:37:23,800 --> 00:37:24,880
of other metadata.

545
00:37:25,080 --> 00:37:30,560
So it's this combination of semantic
and metadata, which is actually

546
00:37:30,860 --> 00:37:32,220
incredibly common.

547
00:37:33,100 --> 00:37:33,500
Nikolay: Right.

548
00:37:33,500 --> 00:37:40,280
But there we have usually either
a situation when we have Btree

549
00:37:40,440 --> 00:37:44,540
and gin index and Postgres needs
to decide which 1 to use and

550
00:37:44,540 --> 00:37:48,680
then still apply a filter, or we
need to use something like additional

551
00:37:48,680 --> 00:37:59,340
extension like GiST, I'm
bad with names as usual, and then

552
00:37:59,340 --> 00:38:01,400
try to achieve single index scan.

553
00:38:01,620 --> 00:38:02,860
But here it's not possible.

554
00:38:02,860 --> 00:38:06,880
I mean, ideal world is single index
scan without additional filtering.

555
00:38:07,740 --> 00:38:10,020
Mat: Yes, that is the ideal world.

556
00:38:11,880 --> 00:38:15,840
Right now, none of the indexes
supports that.

557
00:38:16,440 --> 00:38:24,920
And so all of the best you can
do is the best, the state of the

558
00:38:24,920 --> 00:38:29,920
art on Postgres right now, and
that would over you, state of

559
00:38:29,920 --> 00:38:35,020
the art period, but we could leave
that argument for another

560
00:38:35,020 --> 00:38:43,200
time, is you can hope that you
can retrieve from the vector index

561
00:38:43,440 --> 00:38:49,140
and then post filter and still
have accurate results.

562
00:38:52,240 --> 00:38:54,880
Nikolay: I can imagine, for example,
if we take this department

563
00:38:54,960 --> 00:39:00,660
name, put it inside this input
text which builds a vector, and

564
00:39:00,660 --> 00:39:05,580
then additionally we filter by
integer like department ID, it

565
00:39:05,580 --> 00:39:08,280
probably will work better in this
case, right?

566
00:39:09,940 --> 00:39:13,620
Because first we apply semantic
search involving in the query

567
00:39:13,620 --> 00:39:18,000
which department we want, But then
we think, okay, this filter

568
00:39:18,000 --> 00:39:19,840
will make final polishing.

569
00:39:20,020 --> 00:39:22,520
But it sounds again like some ugly
solution.

570
00:39:24,140 --> 00:39:24,340
Mat: So...

571
00:39:24,340 --> 00:39:27,280
Yeah, it's not only ugly.

572
00:39:28,620 --> 00:39:35,580
I'm actually not sure, it won't
work, but like using my intuition,

573
00:39:37,360 --> 00:39:42,420
like the department engineering
in the text would kind of skew

574
00:39:42,660 --> 00:39:47,960
the semantics away from the actual
thing in the text you're talking

575
00:39:47,960 --> 00:39:48,460
about.

576
00:39:48,540 --> 00:39:54,440
And so I'm not sure that would
combine well in this kind of semantic

577
00:39:55,760 --> 00:39:57,280
multi-dimensional space.

578
00:39:57,340 --> 00:40:01,560
John: You could maybe
add a dimension and synthetically

579
00:40:02,420 --> 00:40:06,560
set a value in the dimension to
represent which department it

580
00:40:06,560 --> 00:40:07,060
is.

581
00:40:07,480 --> 00:40:09,980
Nikolay: And 1 more dimension for
time, like timestamp.

582
00:40:11,380 --> 00:40:12,460
This is so natural.

583
00:40:13,380 --> 00:40:16,400
I'm very curious if there are some
works already in this direction,

584
00:40:16,400 --> 00:40:19,200
because everyone needs creation
time, like publication time or

585
00:40:19,200 --> 00:40:22,620
something for each data entry.

586
00:40:23,940 --> 00:40:26,880
I didn't see good discussions about
that yet.

587
00:40:27,180 --> 00:40:31,580
Because if you, for example, we
loaded almost 1000000 entries

588
00:40:31,880 --> 00:40:33,660
from Postgres mailing list archives.

589
00:40:34,400 --> 00:40:39,020
Then it started working great,
but when you see some discussion

590
00:40:39,020 --> 00:40:43,940
from Bruce Momjan from 2002 about
something, I don't know, I

591
00:40:43,940 --> 00:40:47,160
remember just some cases, It's
already not relevant at all.

592
00:40:47,160 --> 00:40:50,460
You think, maybe I should delete
all data, but still, it might

593
00:40:50,460 --> 00:40:51,100
be relevant.

594
00:40:52,120 --> 00:40:57,540
What we did, ugly solution, we
return 1,000 or maybe 5,000 results,

595
00:40:58,100 --> 00:41:06,040
and then we dynamically apply some
score, I think it's logarithm

596
00:41:06,200 --> 00:41:12,540
approach for time, so we add some
penalty if the article, well,

597
00:41:13,260 --> 00:41:14,720
email is very old.

598
00:41:16,800 --> 00:41:19,080
It quickly becomes less relevant.

599
00:41:19,080 --> 00:41:23,380
We just combine it with score,
pgvector provides us similarity

600
00:41:23,460 --> 00:41:24,180
or distance.

601
00:41:27,040 --> 00:41:33,340
This works well, but sometimes
you have 1 second latency on 1000000

602
00:41:33,740 --> 00:41:36,540
rows dataset, and this is terrible
and doesn't scale.

603
00:41:37,200 --> 00:41:40,320
So this problem of additional dimensions,
I think it's huge.

604
00:41:40,320 --> 00:41:44,840
We need to extend original vector
with non-semantic dimensions

605
00:41:45,040 --> 00:41:49,940
and use them in filtering and achieve
1 single index scan.

606
00:41:51,040 --> 00:41:52,260
Good performance, right?

607
00:41:53,360 --> 00:41:59,180
Mat: I would say that in the academic
literature or there is

608
00:41:59,180 --> 00:42:01,040
some progress being made.

609
00:42:01,620 --> 00:42:08,160
That is, the filtering, this can
end paper, which I believe is

610
00:42:08,540 --> 00:42:12,180
2 years old, which is ancient in
this space.

611
00:42:13,080 --> 00:42:20,740
There was another paper from a
group out in Berkeley talking

612
00:42:20,740 --> 00:42:26,280
about building and filtering into
these graph-based indexes as

613
00:42:26,280 --> 00:42:26,780
well.

614
00:42:30,580 --> 00:42:34,300
But you gotta walk a crore people
you can walk.

615
00:42:34,300 --> 00:42:41,440
We just haven't had any chance
to really implement these inside

616
00:42:41,440 --> 00:42:46,320
this index, but there is, there's
kind of a chaotic work in this

617
00:42:46,320 --> 00:42:46,780
area.

618
00:42:46,780 --> 00:42:48,280
Nikolay: Right, for this algorithm.

619
00:42:49,940 --> 00:42:52,960
John: You'll be able to signal
that like certain dimensions,

620
00:42:53,100 --> 00:42:57,320
if that's where you're putting
your filters, are, that they would

621
00:42:57,320 --> 00:42:59,880
need to be exact versus approximate.

622
00:43:03,560 --> 00:43:04,060
Nikolay: Right.

623
00:43:04,460 --> 00:43:09,860
And Do you think this KNN approach
is better than HNSW in this

624
00:43:09,860 --> 00:43:10,860
particular problem?

625
00:43:12,180 --> 00:43:14,340
Mat: For filtering, yes, I do.

626
00:43:16,300 --> 00:43:25,440
Obviously, I'm biased, but I think
the simplicity of going from

627
00:43:25,440 --> 00:43:30,660
multi-levels to a single level
really helps in a lot of these

628
00:43:30,660 --> 00:43:34,840
things, just because there's a
lot less edge cases to consider.

629
00:43:37,080 --> 00:43:38,100
Nikolay: Let's not forget...

630
00:43:38,600 --> 00:43:42,180
John: This helps as well, right?

631
00:43:43,160 --> 00:43:48,700
Mat: Yeah, I think Streaming was
a lot easier to implement because

632
00:43:48,700 --> 00:43:50,360
it's a single level.

633
00:43:52,480 --> 00:43:52,980
Nikolay: Interesting.

634
00:43:53,560 --> 00:43:57,760
Let's not forget to talk about
compression, because I'm sure

635
00:43:57,780 --> 00:44:00,920
we need it in vector world, in
vector search.

636
00:44:00,920 --> 00:44:01,800
We need it.

637
00:44:02,580 --> 00:44:09,640
I saw in a recent pgvector there
are ideas, let's not use integers,

638
00:44:09,660 --> 00:44:13,260
let's use floats and so on, with
a kind of compressions, so to

639
00:44:13,260 --> 00:44:13,760
speak.

640
00:44:14,180 --> 00:44:16,720
But you talk about real compression,
right?

641
00:44:16,720 --> 00:44:22,660
Maybe some experience from TimescaleDB
extension, or no, or it's

642
00:44:22,660 --> 00:44:23,160
different.

643
00:44:25,640 --> 00:44:30,520
Because there is time series, I
remember articles, TimescaleBlock

644
00:44:31,240 --> 00:44:34,520
has excellent articles, but maybe
it's different here, right?

645
00:44:34,540 --> 00:44:39,360
Mat: I don't think I directly used
any of the algorithms from

646
00:44:39,360 --> 00:44:44,240
the time series space for this,
but the basic insight of using

647
00:44:44,720 --> 00:44:49,000
the statistical properties of the
vectors you have to kind of

648
00:44:49,000 --> 00:44:54,360
make compression, give you better
results is exactly what led

649
00:44:54,360 --> 00:44:54,860
to...

650
00:44:55,120 --> 00:44:59,240
So we have an algorithm called
Statistical Binary Quantization,

651
00:45:00,400 --> 00:45:06,620
SBQ, And it takes quite a simple
but well-known algorithm called

652
00:45:06,620 --> 00:45:14,560
BQ and kind of adapts it to your
dataset in a better way.

653
00:45:17,680 --> 00:45:19,700
You know, we have a blog post about
it.

654
00:45:19,700 --> 00:45:20,880
It's pretty simple.

655
00:45:21,040 --> 00:45:27,280
It's pretty much using the means
of each dimension and the kind

656
00:45:27,280 --> 00:45:34,520
of standard deviations to kind
of better segment your space,

657
00:45:34,820 --> 00:45:35,700
if you would.

658
00:45:38,940 --> 00:45:43,220
And yeah, and we just, honestly,
we took a very experimental

659
00:45:43,480 --> 00:45:43,980
approach.

660
00:45:45,480 --> 00:45:52,460
We took various datasets and we
tried different things and this

661
00:45:52,460 --> 00:45:54,740
turned out that it worked.

662
00:45:54,760 --> 00:45:59,700
It added a few percentage points
to the accuracy, which once

663
00:45:59,700 --> 00:46:05,380
you get into the 90s, a few percentage
points is a pretty big

664
00:46:05,380 --> 00:46:05,880
deal.

665
00:46:07,280 --> 00:46:14,100
And so, yeah, it's, you get to
read about it in our blog post.

666
00:46:14,100 --> 00:46:17,360
The algorithm is fully explained
there.

667
00:46:18,300 --> 00:46:18,820
Nikolay: To make sure...

668
00:46:18,820 --> 00:46:19,860
I'm just noticing, and...

669
00:46:23,500 --> 00:46:26,060
John: The compression that Mat's
talking about is happening

670
00:46:26,060 --> 00:46:31,520
in the index and the TimescaleDB
compression is, you know, converting

671
00:46:32,380 --> 00:46:36,480
row-based into columnar and then
compressing each column in the

672
00:46:36,480 --> 00:46:36,980
heap.

673
00:46:39,480 --> 00:46:43,540
We actually tried compressing the
vectors in the heap with TimescaleDB

674
00:46:44,220 --> 00:46:49,200
compression. And seemingly random
vector strings of numbers don't

675
00:46:49,200 --> 00:46:50,340
compress very well.

676
00:46:50,460 --> 00:46:51,660
So that didn't help very much.

677
00:46:51,660 --> 00:46:54,440
Nikolay: RADUKIJ YURIYENKIN Because
in TimescaleDB, one of the

678
00:46:54,440 --> 00:47:00,300
key ideas is that for time series,
it's like values are changing,

679
00:47:01,440 --> 00:47:03,000
not jumping, how to say.

680
00:47:04,300 --> 00:47:08,540
So there are deltas and these deltas
are quite low and so on.

681
00:47:08,860 --> 00:47:10,840
This is what I remember from those
blog posts.

682
00:47:11,600 --> 00:47:13,580
For vectors this doesn't work,
I understand.

683
00:47:15,260 --> 00:47:24,640
For index compression, how much
could we expect in terms of compression

684
00:47:24,640 --> 00:47:25,940
ratio to achieve?

685
00:47:30,180 --> 00:47:31,560
Mat: Uses 1 bit per dimension.

686
00:47:31,560 --> 00:47:42,320
The uncompressed version is float4.

687
00:47:42,900 --> 00:47:46,220
It's a very easy calculation.

688
00:47:46,400 --> 00:47:50,280
It's always a 32x compression ratio.

689
00:47:52,060 --> 00:47:52,940
Nikolay: Roman Karpukhin Good.

690
00:47:52,940 --> 00:47:53,440
Understood.

691
00:47:53,700 --> 00:47:54,200
Great.

692
00:47:54,520 --> 00:48:00,840
So yeah, anything else in technical
area we should mention?

693
00:48:01,290 --> 00:48:02,520
Mat: James Johnson

694
00:48:02,520 --> 00:48:04,200
Nikolay: That helps

695
00:48:04,200 --> 00:48:07,660
John: with the size, but it also
helps with performance because

696
00:48:07,660 --> 00:48:08,220
you can fit

697
00:48:08,220 --> 00:48:10,120
Nikolay: more in.

698
00:48:10,340 --> 00:48:13,280
Fewer buffers to load to the buffer
pool. 

699
00:48:16,160 --> 00:48:17,120
This makes sense.

700
00:48:17,420 --> 00:48:21,620
I remember your blog post also
mentions storage costs you compare

701
00:48:21,620 --> 00:48:26,720
with Pinecone in terms of how much
do you need to spend each

702
00:48:26,720 --> 00:48:29,080
month to store 50 million vectors.

703
00:48:29,760 --> 00:48:32,180
The difference is very noticeable,
I would say.

704
00:48:32,600 --> 00:48:35,460
So yeah, this makes sense for sure.

705
00:48:35,900 --> 00:48:39,400
Anything else, like technical stuff?

706
00:48:40,440 --> 00:48:43,280
Anything maybe you're working on
right now?

707
00:48:43,740 --> 00:48:46,320
Mat: I think about PG vector scale.

708
00:48:47,780 --> 00:48:48,900
That's about it.

709
00:48:48,940 --> 00:48:53,540
We haven't talked about PGAI, which
is the other thing we announced

710
00:48:55,260 --> 00:48:56,060
this week.

711
00:48:58,660 --> 00:49:03,500
Nikolay: This is a Python untrusted,
so it won't be possible

712
00:49:03,500 --> 00:49:06,540
to run it on managed services,
as I understand, but it allows

713
00:49:06,540 --> 00:49:07,260
you to...

714
00:49:07,540 --> 00:49:08,740
Or it's something different?

715
00:49:09,120 --> 00:49:12,540
Because it makes calls to OpenAI
API or...

716
00:49:13,040 --> 00:49:13,240
It

717
00:49:13,240 --> 00:49:13,740
Mat: does.

718
00:49:15,040 --> 00:49:22,440
So, with untrusted languages, you
can't allow users on clouds

719
00:49:22,580 --> 00:49:24,100
to write their own functions.

720
00:49:24,940 --> 00:49:31,720
But if the functions are included
inside an extension, that's

721
00:49:31,720 --> 00:49:32,220
fine.

722
00:49:32,780 --> 00:49:38,000
And it's easy to see because most
extensions are written C, which

723
00:49:38,000 --> 00:49:40,620
is completely untrusted, right?

724
00:49:40,760 --> 00:49:50,820
So the entire point of the PGA
extension is specifically so that

725
00:49:50,820 --> 00:49:53,800
this could be run on the clouds.

726
00:49:55,080 --> 00:49:59,280
Nikolay: So it limits capabilities,
and if a cloud vendor decides,

727
00:49:59,900 --> 00:50:04,840
verifies everything works well,
nothing, Bad calls cannot be

728
00:50:04,840 --> 00:50:06,020
done through it.

729
00:50:06,020 --> 00:50:06,980
So we are in the

730
00:50:06,980 --> 00:50:07,480
John: center.

731
00:50:07,933 --> 00:50:11,260
It's almost like whitelisting certain
programs.

732
00:50:12,700 --> 00:50:13,200
Nikolay: Interesting.

733
00:50:14,060 --> 00:50:14,820
Makes sense.

734
00:50:15,180 --> 00:50:19,300
So the idea is to make it really
simple to create vectors from

735
00:50:19,300 --> 00:50:25,900
regular data types, just transparently
calling these, just with

736
00:50:25,900 --> 00:50:26,780
SQL, right?

737
00:50:26,800 --> 00:50:31,580
I think there are other similar
implementations of this, But

738
00:50:31,720 --> 00:50:37,320
I like your idea of betting on
cloud providers, including this,

739
00:50:37,540 --> 00:50:40,400
and even not providing untrusted
languages, capabilities.

740
00:50:41,040 --> 00:50:41,500
It's interesting.

741
00:50:41,500 --> 00:50:46,320
Mat: Yeah, there are a few things
that do similar things, and

742
00:50:46,320 --> 00:50:52,100
I was always curious why they didn't
just run the Python code

743
00:50:52,200 --> 00:50:52,960
to do this.

744
00:50:52,960 --> 00:50:54,620
And so that's what we did.

745
00:50:55,760 --> 00:51:01,160
A lot of people do very complicated
stuff to get the same result.

746
00:51:01,300 --> 00:51:03,480
I don't know why, but yeah.

747
00:51:03,700 --> 00:51:05,100
We had to work that.

748
00:51:05,460 --> 00:51:07,800
Nikolay: I like this approach myself
very well.

749
00:51:07,800 --> 00:51:12,660
I mean, I do it a lot for many
years, just select and call something

750
00:51:13,260 --> 00:51:13,760
externally.

751
00:51:14,080 --> 00:51:19,200
But of course, there should be
a huge warning on site.

752
00:51:19,540 --> 00:51:23,460
It doesn't scale well because if
Python calls...

753
00:51:23,620 --> 00:51:30,860
You basically add latency to your
queries of the primary.

754
00:51:32,840 --> 00:51:36,140
Primary CPU is the most expensive
resource you have.

755
00:51:38,300 --> 00:51:43,220
If you want to generate a vector,
you cannot run it on a replica

756
00:51:43,260 --> 00:51:44,660
because you need to write it.

757
00:51:47,060 --> 00:51:49,040
You must do it on the primary.

758
00:51:50,640 --> 00:51:57,460
While this query is running, it's
a primary node, and offloading

759
00:51:57,620 --> 00:52:02,940
this work on Python application
nodes makes total sense because

760
00:52:03,080 --> 00:52:06,460
the database doesn't notice this
at all, the primary doesn't

761
00:52:06,460 --> 00:52:06,820
notice.

762
00:52:06,820 --> 00:52:10,740
You only speak to it when the result
is already retrieved from

763
00:52:11,040 --> 00:52:13,500
OpenAI or another LLM provider.

764
00:52:15,060 --> 00:52:18,700
This is handy, but very dangerous
in terms of scalability and

765
00:52:18,700 --> 00:52:20,280
performance and future issues.

766
00:52:23,800 --> 00:52:27,620
Mat: Completely agreed with you.

767
00:52:27,620 --> 00:52:35,320
This is a way for people to get
up and running quickly and testing

768
00:52:35,860 --> 00:52:36,660
and experimenting.

769
00:52:37,440 --> 00:52:41,920
And we do have another project
called pgvectorizer for when you

770
00:52:41,920 --> 00:52:46,320
need to batch things up and scale
it and do it in the background.

771
00:52:47,360 --> 00:52:49,620
We have all of that as well.

772
00:52:50,140 --> 00:52:51,940
Nikolay: I haven't seen that, that's
interesting also

773
00:52:51,940 --> 00:52:52,620
Mat: to check.

774
00:52:53,300 --> 00:52:53,800
Nikolay: Yeah,

775
00:52:55,840 --> 00:52:58,940
John: but all of that plumbing
that you have to write to drag

776
00:52:59,160 --> 00:53:03,420
data out of the database and then
send it off to an API and put

777
00:53:03,420 --> 00:53:04,580
it back in the database.

778
00:53:05,160 --> 00:53:09,240
You end up writing so much code
that when it's in the database,

779
00:53:09,240 --> 00:53:12,960
you don't have to do not to mention
all of the bandwidth you're

780
00:53:12,960 --> 00:53:15,420
consuming and latency there.

781
00:53:17,440 --> 00:53:17,800
Nikolay: Right.

782
00:53:17,800 --> 00:53:24,240
I remember I was copying images
from S3 in SQL.

783
00:53:24,720 --> 00:53:26,940
And then, of course, it's similar.

784
00:53:27,040 --> 00:53:30,460
You call some APIs from Postgres.

785
00:53:30,900 --> 00:53:34,340
It's great, but it's an interesting
direction.

786
00:53:36,500 --> 00:53:39,300
With warning, for start it's good.

787
00:53:43,080 --> 00:53:45,200
I think I don't have any more questions.

788
00:53:47,620 --> 00:53:50,040
I'm looking forward to trying this
project.

789
00:53:50,540 --> 00:53:53,560
We wanted to do it before this
call, like yesterday already,

790
00:53:53,560 --> 00:53:55,820
but we had some technical difficulties.

791
00:53:56,660 --> 00:54:01,160
I hope we will solve them soon
and try pgvectorscale.

792
00:54:02,400 --> 00:54:06,260
For our case, I will tell you when
we have results.

793
00:54:07,440 --> 00:54:08,000
Thank you.

794
00:54:08,000 --> 00:54:09,740
So as a summary.

795
00:54:11,120 --> 00:54:12,940
Mat: There's 1 more thing about
that.

796
00:54:13,440 --> 00:54:14,700
This is a new project.

797
00:54:14,700 --> 00:54:16,540
We're trying to be very responsive.

798
00:54:16,640 --> 00:54:21,180
So if you run into any problems
at all, you know, file GitHub

799
00:54:21,260 --> 00:54:23,580
issues or contact us on Discord.

800
00:54:23,920 --> 00:54:28,120
We are more than happy and very
eager to talk to anybody.

801
00:54:29,160 --> 00:54:33,980
This is a young project, as I said,
so all feedback is very young.

802
00:54:34,860 --> 00:54:36,480
Nikolay: Sure, yeah, that's a good
point.

803
00:54:36,780 --> 00:54:42,720
Honestly, seeing this work in Timescale
company repository on

804
00:54:42,720 --> 00:54:46,360
GitHub, it makes me have some good
expectations in terms of if

805
00:54:46,360 --> 00:54:49,400
some problems are encountered that
will be addressed and so on.

806
00:54:51,760 --> 00:54:58,580
As a summary, let's say everyone
who works with vectors in Postgres

807
00:54:59,120 --> 00:55:02,060
should check out this new type
of index and compare.

808
00:55:02,680 --> 00:55:03,560
This is great.

809
00:55:03,940 --> 00:55:08,000
I think there are many people who
do this right now and looking

810
00:55:08,000 --> 00:55:11,300
forward to results in our case
and other cases as well.

811
00:55:11,400 --> 00:55:13,480
Thank you for a good project

812
00:55:16,740 --> 00:55:17,100
Mat: and very interesting.

813
00:55:17,100 --> 00:55:17,720
Nikolay: Thank you.

814
00:55:18,740 --> 00:55:19,840
Good luck with it.