1
00:00:00,539 --> 00:00:03,022
Michael: Hello and welcome to PostgresFM,
a weekly show about

2
00:00:03,220 --> 00:00:04,120
all things PostgreSQL.

3
00:00:04,299 --> 00:00:05,819
I am Michael, founder of pgMustard.

4
00:00:06,040 --> 00:00:08,080
This is my co-host Nikolay, founder
of Postgres.AI.

5
00:00:08,840 --> 00:00:11,500
Hello Nikolay, what are we talking
about today?

6
00:00:12,160 --> 00:00:15,320
Nikolay: Hi Michael, let's talk
about search at a high level.

7
00:00:15,480 --> 00:00:25,680
But before we proceed, let me express
a few words, a few thoughts

8
00:00:25,680 --> 00:00:29,619
about the news we've got a few
hours ago, a couple of hours ago,

9
00:00:29,619 --> 00:00:30,800
about Simon Riggs.

10
00:00:31,640 --> 00:00:34,760
We are recording this on Wednesday,
March 27.

11
00:00:36,140 --> 00:00:38,100
And Simon Riggs just passed away.

12
00:00:38,140 --> 00:00:42,260
I remember him as a very bright
mind.

13
00:00:43,440 --> 00:00:48,760
I remember he was not an easy person
to deal with, obviously.

14
00:00:49,010 --> 00:00:54,580
I remember like 100 emails, even
100 emails, to convince him

15
00:00:54,580 --> 00:00:56,840
to come to Moscow to speak at a conference.

16
00:00:57,720 --> 00:01:01,320
Many people were involved, but eventually
he did, he came and it was

17
00:01:01,320 --> 00:01:06,540
a great talk, but the work he did
and like in general, yeah, it's

18
00:01:06,540 --> 00:01:10,060
a big loss obviously for the Postgres
community.

19
00:01:11,380 --> 00:01:15,780
So yeah, condolences to family,
friends, and co-workers, ex-co-workers,

20
00:01:16,160 --> 00:01:17,060
and so on.

21
00:01:17,420 --> 00:01:23,080
And Simon built a lot of things
and he was quite brave to attack

22
00:01:23,200 --> 00:01:29,440
very complex topics in PostgreSQL
system in general, right, in

23
00:01:29,440 --> 00:01:32,140
the core of PostgreSQL, in the engine
itself.

24
00:01:32,720 --> 00:01:36,240
For example, point-in-time recovery,
things related to replication.

25
00:01:37,080 --> 00:01:42,040
Many achievements were made by
Simon or involving Simon.

26
00:01:42,040 --> 00:01:44,000
So it's a big loss, definitely.

27
00:01:46,060 --> 00:01:48,080
Michael: Yeah, over many years
as well, right?

28
00:01:48,180 --> 00:01:51,660
I have only, I actually had the
opportunity to meet him a couple

29
00:01:51,660 --> 00:01:54,520
of times at a couple of London
events and heard him speak.

30
00:01:54,520 --> 00:01:58,220
And not only was he a great contributor
to the code base, but

31
00:01:58,380 --> 00:02:02,360
I was amazed at how he was able
to communicate and educate and

32
00:02:02,360 --> 00:02:04,660
also engage in community building, right?

33
00:02:04,660 --> 00:02:07,600
Like he was involved in organizing
a lot of events, especially

34
00:02:07,600 --> 00:02:12,660
in the UK, growing companies and
a lot more around the ecosystem

35
00:02:12,800 --> 00:02:13,440
as well.

36
00:02:13,520 --> 00:02:19,080
Nikolay: I must say, I remember
very well this look in Simon's

37
00:02:19,180 --> 00:02:22,040
eyes, which was like, had some
sparkles.

38
00:02:22,360 --> 00:02:25,580
And I remember the conference,
the first, the very first conference

39
00:02:25,600 --> 00:02:31,680
in the American continent I attended
in 2007, speaking with Peter

40
00:02:31,680 --> 00:02:32,180
Eisentraut.

41
00:02:32,560 --> 00:02:37,360
I was a baby actually, I was, like,
involved in implementing some parts

42
00:02:37,360 --> 00:02:41,760
of XML implementation functions
and type in PostgreSQL.

43
00:02:42,080 --> 00:02:46,000
I remember Simon looking directly
to me with those sparks and

44
00:02:46,000 --> 00:02:49,620
asking, what's your next thing
to build in PostgreSQL.

45
00:02:49,960 --> 00:02:54,180
I was like, I was caught off-guard
didn't answer anything, actually.

46
00:02:54,400 --> 00:02:58,860
And yeah, so this is what I remember
about Simon, this, this look,

47
00:02:58,860 --> 00:03:02,060
and courage. Do I pronounce it right?

48
00:03:03,220 --> 00:03:04,200
Courage, yeah.

49
00:03:04,200 --> 00:03:07,800
Courage, yes, he obviously had big
courage, huge.

50
00:03:08,860 --> 00:03:09,360
Michael: So.

51
00:03:10,420 --> 00:03:13,600
And the ability to silence Nikolay,
that's quite the...

52
00:03:15,640 --> 00:03:18,020
Nikolay: Well, yeah, yeah, yeah.

53
00:03:18,060 --> 00:03:18,780
It's interesting.

54
00:03:18,840 --> 00:03:19,340
Yeah.

55
00:03:19,480 --> 00:03:25,220
Yeah, so it's sad, very sad. So yeah.

56
00:03:26,480 --> 00:03:29,440
Michael: Yeah, absolutely. Condolences
to everybody who knew him,

57
00:03:29,440 --> 00:03:33,740
and worked with him. I don't really
know how to move on from that,

58
00:03:33,740 --> 00:03:35,200
actually, but we were going to
talk

59
00:03:35,200 --> 00:03:35,940
Nikolay: about search.

60
00:03:37,540 --> 00:03:41,780
Let's just after this small break,
let's return to the search

61
00:03:41,780 --> 00:03:42,280
topic.

62
00:03:42,980 --> 00:03:46,900
And it's wide, it's very wide topic,
And I guess we just want

63
00:03:46,900 --> 00:03:50,940
to touch it a little bit today,
at a very high level, right?

64
00:03:51,600 --> 00:03:54,720
Michael: Yeah, well, it's amazing
if you consider all the things

65
00:03:54,720 --> 00:03:56,300
Postgres is used for.

66
00:03:56,680 --> 00:04:01,100
Search is one of the top use cases,
but looking back at our episodes,

67
00:04:01,440 --> 00:04:05,180
we've touched on it a few times,
like when we've looked at using

68
00:04:05,180 --> 00:04:09,560
external databases or forgotten
what somebody called them, actually,

69
00:04:09,560 --> 00:04:13,580
like a second type of database,
like a partner database or something

70
00:04:13,580 --> 00:04:14,280
like that.

71
00:04:14,540 --> 00:04:17,960
So, we touched on them a few times,
but never, we haven't done

72
00:04:17,960 --> 00:04:20,460
anything on full-text search, we
haven't done anything more recently

73
00:04:20,460 --> 00:04:24,940
on semantic search, we've done
a PG vector episode and a few

74
00:04:25,080 --> 00:04:29,540
related-ish subjects, but no, it
crossed my mind that we hadn't

75
00:04:29,540 --> 00:04:31,860
touched on this really as a topic.

76
00:04:32,040 --> 00:04:35,380
And obviously, it's one of those subjects
that the more you learn,

77
00:04:35,380 --> 00:04:38,100
the more you realize you didn't
know, or the more complicated

78
00:04:38,220 --> 00:04:38,860
it gets.

79
00:04:38,860 --> 00:04:42,500
So I can imagine us doing quite
a few follow-ups on the more

80
00:04:42,500 --> 00:04:46,400
specifics or, you know, implementation
details or the tricky

81
00:04:46,400 --> 00:04:46,900
parts.

82
00:04:47,220 --> 00:04:47,940
But yeah.

83
00:04:48,540 --> 00:04:50,560
Nikolay: Don't forget faceted search.

84
00:04:51,540 --> 00:04:52,700
Michael: Faceted, yeah.

85
00:04:53,300 --> 00:04:55,440
I could be saying it wrong, but
I think I've heard it called

86
00:04:55,440 --> 00:04:58,940
faceted like boring bitmaps and
things like that.

87
00:04:59,380 --> 00:05:02,360
Nikolay: Well, usually we start
from UI, and in UI.

88
00:05:02,360 --> 00:05:04,400
In my head, this starts from UI.

89
00:05:04,540 --> 00:05:09,880
We have some things in big form
consisting of multiple, very

90
00:05:09,880 --> 00:05:14,280
different selectors and filters
and so on.

91
00:05:15,060 --> 00:05:18,160
This is very common in various
marketplaces.

92
00:05:18,600 --> 00:05:23,580
For example, imagine Airbnb, you
want to limit price and location,

93
00:05:24,060 --> 00:05:28,040
and various categories and properties
and so on.

94
00:05:28,260 --> 00:05:31,580
And let me put this on the table
right away.

95
00:05:32,080 --> 00:05:36,060
Ideally, we always should have
just a single index scan or index

96
00:05:36,060 --> 00:05:36,840
only scan.

97
00:05:37,360 --> 00:05:41,140
This is our ultimate goal always,
but unfortunately, it's not

98
00:05:41,140 --> 00:05:42,040
always possible.

99
00:05:42,440 --> 00:05:43,380
Why do we need it?

100
00:05:43,380 --> 00:05:48,900
Because it's the best plan, and
I think you can, like, do you

101
00:05:48,900 --> 00:05:49,700
Agree or not?

102
00:05:49,700 --> 00:05:50,860
Because you deal with plans

103
00:05:50,860 --> 00:05:51,540
Michael: all the

104
00:05:52,060 --> 00:05:54,880
Nikolay: time, with Postmaster,
explain plans and so on.

105
00:05:56,920 --> 00:05:58,680
Single index scan is the best.

106
00:06:00,040 --> 00:06:03,900
Michael: One of the things I love
about databases and performance

107
00:06:04,000 --> 00:06:07,580
in general is when it gets to the
point where you have to trade

108
00:06:07,580 --> 00:06:08,860
one thing off against another.

109
00:06:08,860 --> 00:06:13,400
And I think search is one of those
topics where often we're trading

110
00:06:13,440 --> 00:06:19,260
off user experience versus complexity
of the back end.

111
00:06:20,820 --> 00:06:24,020
Some of the nicest search features
are just a search bar.

112
00:06:24,150 --> 00:06:29,480
But without getting any input from
the user, you have to do a

113
00:06:29,480 --> 00:06:33,040
lot of work on the back end to
be able to serve that in any kind

114
00:06:33,040 --> 00:06:34,540
of performance way.

115
00:06:34,540 --> 00:06:39,260
So you've got the complexity
of matching

116
00:06:39,520 --> 00:06:44,540
the results to the intent, with
the additional caveat that you

117
00:06:44,540 --> 00:06:48,340
want it to give some, at least
some results, good results quickly.

118
00:06:49,000 --> 00:06:51,140
And that's a trade-off.

119
00:06:51,220 --> 00:06:54,920
Like it's really easy to get, well
not easy, but you can give

120
00:06:54,920 --> 00:06:58,980
great results if you can search
through everything you have and

121
00:06:58,980 --> 00:07:03,600
score everything and if you've
got forever to return them, but

122
00:07:03,740 --> 00:07:07,080
if you've given yourself a budget
of returning within a few hundred

123
00:07:07,080 --> 00:07:10,640
milliseconds, suddenly that becomes
a more difficult problem.

124
00:07:10,640 --> 00:07:15,120
So I love that it's a trade-off
and we're trading off user experience

125
00:07:15,420 --> 00:07:19,300
with resources, with technical
complexity on the back end.

126
00:07:19,300 --> 00:07:24,140
So I think it's one of those topics
that there are trade-offs and

127
00:07:24,140 --> 00:07:28,100
yes, one of those is performance,
but sometimes I think you are

128
00:07:28,100 --> 00:07:32,600
willing to pay a little bit of
performance for a better result.

129
00:07:33,740 --> 00:07:36,140
Nikolay: Yeah, this is difficult,
I guess.

130
00:07:36,180 --> 00:07:40,180
What are better results and what
is high quality of search, right?

131
00:07:40,600 --> 00:07:46,400
Like, I remember a definition that
users should be happy, which

132
00:07:46,400 --> 00:07:47,460
is very broad.

133
00:07:49,200 --> 00:07:50,780
What makes users happy?

134
00:07:51,100 --> 00:07:56,820
Maybe we return good results but
UI is very bad, so they are

135
00:07:56,820 --> 00:07:57,840
not happy, right?

136
00:07:58,080 --> 00:08:01,660
Like, it's quite an interesting
topic.

137
00:08:02,900 --> 00:08:08,100
And I think you're right, but also
like I just dove into the

138
00:08:08,100 --> 00:08:10,380
very bottom of performance.

139
00:08:11,380 --> 00:08:13,120
Performance matters a lot, right?

140
00:08:13,320 --> 00:08:19,360
If the search is very slow, the users
won't be happy and it means

141
00:08:19,360 --> 00:08:22,360
poor quality of the search, right?

142
00:08:22,360 --> 00:08:26,260
So we do care about performance,
but also we do care about things

143
00:08:26,260 --> 00:08:30,320
like if it's a full-text search,
we want stop words to be removed

144
00:08:30,320 --> 00:08:35,680
and ignored, we want some dictionaries
to be used, maybe synonyms

145
00:08:35,720 --> 00:08:37,860
to be applied, and so on and so
forth, right?

146
00:08:38,000 --> 00:08:41,660
This matters a lot, and of course,
but this also moves us to

147
00:08:41,660 --> 00:08:47,300
the performance part, because if
these steps are slow, it's also

148
00:08:47,300 --> 00:08:47,800
bad.

149
00:08:48,100 --> 00:08:51,080
Why was I mentioning faceted search?

150
00:08:51,820 --> 00:08:53,500
I just see a common pattern.

151
00:08:53,940 --> 00:09:00,240
Postgres is huge in terms of capabilities
and extensibility and

152
00:09:00,280 --> 00:09:03,280
various index types, extensions.

153
00:09:04,780 --> 00:09:08,200
But we have simple problems unsolved.

154
00:09:08,860 --> 00:09:14,340
For example, take full text search
and order by timestamp or

155
00:09:14,340 --> 00:09:14,840
ID.

156
00:09:15,300 --> 00:09:20,700
I want the very, like, instead
of old school, regular approach,

157
00:09:20,900 --> 00:09:25,580
return like most relevant documents
to me, I want fresh documents

158
00:09:25,840 --> 00:09:29,940
to go first because it's social
media, and this is number 1 pattern.

159
00:09:30,660 --> 00:09:35,940
But also they need to follow some
full-text search query I used.

160
00:09:36,340 --> 00:09:40,940
I just need to see the latest but
following some text patterns.

161
00:09:41,380 --> 00:09:43,700
And this problem is unsolved in
Postgres, unfortunately.

162
00:09:43,940 --> 00:09:47,100
And there is a good, the best attempt
to solve it, it's called

163
00:09:47,100 --> 00:09:51,420
RUM index, which is an extension,
like a new generation of GIN

164
00:09:51,420 --> 00:09:51,920
index.

165
00:09:52,720 --> 00:09:55,240
But why isn't it in the core?

166
00:09:55,240 --> 00:09:56,260
Because it has issues.

167
00:09:57,260 --> 00:09:59,420
It's huge, it's slow, and so on.

168
00:10:00,060 --> 00:10:04,780
And similar things I now observe,
not only observe, I touch them.

169
00:10:05,380 --> 00:10:11,460
For example, just before we started
recording, you showed me

170
00:10:11,640 --> 00:10:15,660
the Supabase blog post about
how to combine full-text search

171
00:10:15,720 --> 00:10:18,580
and semantic search based on embeddings.

172
00:10:18,780 --> 00:10:21,680
I don't like the word embeddings,
I like the word vectors.

173
00:10:22,360 --> 00:10:27,780
Because embedding, in my opinion,
in the database, it doesn't

174
00:10:27,780 --> 00:10:29,060
settle in my mind at all.

175
00:10:29,060 --> 00:10:34,260
Embedding is what we embed into our
prompt.

176
00:10:35,420 --> 00:10:36,300
This is content.

177
00:10:36,600 --> 00:10:38,040
But a vector is a vector.

178
00:10:39,220 --> 00:10:42,680
Maybe I missed something, but why
do we call vectors embeddings?

179
00:10:43,820 --> 00:10:47,540
Honestly, in our database, this
column is called embeddings just

180
00:10:47,540 --> 00:10:49,620
because OpenAI dictated it.

181
00:10:49,840 --> 00:10:56,100
But I also see OpenAI's Assistant
APIs name state machines' states

182
00:10:56,280 --> 00:11:00,360
as statuses, which also like, what's
happening there?

183
00:11:00,820 --> 00:11:05,420
A status is a state, state machine,
state, like in progress,

184
00:11:05,580 --> 00:11:06,820
function call, etc.

185
00:11:06,820 --> 00:11:08,180
So, but it's off-topic.

186
00:11:09,060 --> 00:11:17,060
So these vectors, they provide
us a great capability to have semantic

187
00:11:17,080 --> 00:11:20,800
search and we have text search
and the Supabase article describes

188
00:11:20,800 --> 00:11:22,040
how to combine them.

189
00:11:22,380 --> 00:11:26,000
But basically we perform 2 searches
and then like merge results.

190
00:11:26,000 --> 00:11:27,840
It means we cannot do pagination.

191
00:11:27,900 --> 00:11:29,340
Pagination is very important.

192
00:11:29,760 --> 00:11:32,940
Maybe a user needs to go to the second
page, third page.

193
00:11:33,200 --> 00:11:36,480
In quality search engines, they
do need it.

194
00:11:36,500 --> 00:11:40,320
And in this case, it means that
it's similar to the offset problem

195
00:11:40,320 --> 00:11:41,020
we described.

196
00:11:41,520 --> 00:11:43,300
Michael: That's the only solution,
I guess,

197
00:11:43,500 --> 00:11:45,650
Nikolay: is pagination through
offset.

198
00:11:45,650 --> 00:11:45,815
Michael: So far, yes,

199
00:11:45,815 --> 00:11:48,160
Nikolay: but maybe it's possible
to combine something, right?

200
00:11:48,960 --> 00:11:54,380
Honestly, GIN is also about multidimensional
things and like,

201
00:11:54,380 --> 00:11:55,580
I don't know, I don't know.

202
00:11:55,580 --> 00:11:59,920
It also has KNN, I don't know,
like it's, I know only parts of

203
00:11:59,920 --> 00:12:06,020
things here, but what, like I don't
like the index scans and then

204
00:12:06,020 --> 00:12:08,720
we combine things and we lose pagination.

205
00:12:08,740 --> 00:12:13,200
I mean we can have pagination but
if we want to go to page number

206
00:12:13,200 --> 00:12:17,720
100 it's insane how much data we
need to fetch and buffers will

207
00:12:17,720 --> 00:12:21,060
show very bad numbers and analyze
buffers.

208
00:12:21,260 --> 00:12:23,900
It means it's not working well.

209
00:12:24,160 --> 00:12:29,520
And a different example, sorry, I
will finish my my complaining

210
00:12:29,640 --> 00:12:30,140
speech.

211
00:12:31,020 --> 00:12:36,020
So, a different example is what we
have right now.

212
00:12:36,020 --> 00:12:43,360
In our bot we imported more than
900,000 emails from some 6 mailing

213
00:12:43,360 --> 00:12:45,860
lists, 25 years of them.

214
00:12:46,120 --> 00:12:48,060
So we have more than 1 million
documents.

215
00:12:48,480 --> 00:12:53,480
And of course, immediately, like
before we'd only imported to

216
00:12:53,480 --> 00:12:57,500
the bot's knowledge base only documentation,
source code, and

217
00:12:57,500 --> 00:12:58,260
blog posts.

218
00:12:58,580 --> 00:13:03,080
And all of them were quite relatively
fresh, almost.

219
00:13:03,580 --> 00:13:08,940
But when we imported 25 years of
mailing list archives, I'm

220
00:13:08,940 --> 00:13:12,820
asking, hey bot, what can you explain
to me about sub-transactions?

221
00:13:13,260 --> 00:13:17,080
Okay, this is documentation, my
article, but also this is a very

222
00:13:17,080 --> 00:13:19,860
good email from Bruce Momjian from
2002.

223
00:13:21,060 --> 00:13:22,860
And it went to first place.

224
00:13:24,480 --> 00:13:25,320
It's not good.

225
00:13:25,640 --> 00:13:27,080
We need to...

226
00:13:27,520 --> 00:13:32,080
Basically we need to take into
account the age of the data here,

227
00:13:32,080 --> 00:13:32,580
right?

228
00:13:32,800 --> 00:13:33,880
How to do that?

229
00:13:34,240 --> 00:13:35,520
There's no good way.

230
00:13:35,920 --> 00:13:40,340
If you work with pgVector, there's
no good way to deprioritize

231
00:13:42,920 --> 00:13:46,300
old documents, to take into account
the age of the data.

232
00:13:47,240 --> 00:13:51,760
So what we did, we just, when,
usually we need to find like 10

233
00:13:51,760 --> 00:13:56,760
or 15, 20 entries, maximum like
100 usually entries, and embed

234
00:13:56,760 --> 00:13:59,120
them as embeddings to the prompt.

235
00:14:00,040 --> 00:14:05,220
So what we do, we find a thousand
entries, and then just in memory

236
00:14:05,220 --> 00:14:09,180
Postgres recalculates adjusted
similarity, adjusted distance

237
00:14:09,240 --> 00:14:12,000
based on a logarithm of age.

238
00:14:13,940 --> 00:14:15,520
And this is how we do it.

239
00:14:15,580 --> 00:14:20,200
If nothing new, okay, we are satisfied
with old documents.

240
00:14:20,740 --> 00:14:21,740
But if there are...

241
00:14:21,740 --> 00:14:23,660
So we take into account the age, right?

242
00:14:23,680 --> 00:14:26,400
But again, this doesn't scale well.

243
00:14:26,400 --> 00:14:30,660
Like, if we will have a lot of,
like, 10 million documents, it

244
00:14:30,660 --> 00:14:31,540
will be worse.

245
00:14:31,560 --> 00:14:35,040
And also we cannot have pagination
if we talk about search here.

246
00:14:36,220 --> 00:14:37,960
Kind of a similar problem as well.

247
00:14:38,600 --> 00:14:42,940
And this makes me think, great
that we have extensibility, but

248
00:14:43,180 --> 00:14:47,300
these types of searches are so, like,
so different.

249
00:14:47,940 --> 00:14:53,200
We have, like, what is the name
of when different things are

250
00:14:53,200 --> 00:14:53,700
combined?

251
00:14:54,120 --> 00:15:01,020
So it means that it's hard to build
good system which works...

252
00:15:01,020 --> 00:15:01,520
Heterogeneous.

253
00:15:02,460 --> 00:15:03,260
Yes, heterogeneous.

254
00:15:03,620 --> 00:15:07,940
This word, I know how to spell
it but I cannot pronounce it because

255
00:15:07,940 --> 00:15:12,880
I saw it many times in papers,
but in scientific papers and so

256
00:15:12,880 --> 00:15:15,220
on, in technical papers, but yeah.

257
00:15:15,220 --> 00:15:15,860
I'm not

258
00:15:15,860 --> 00:15:17,440
Michael: even sure I know how to
pronounce it.

259
00:15:17,440 --> 00:15:19,240
Heterogeneous or something like
that?

260
00:15:19,860 --> 00:15:21,820
Nikolay: I cannot pronounce it
in Russian, sorry.

261
00:15:23,040 --> 00:15:28,880
So, what I'm trying to say, we
are kind of in the Linux early stage.

262
00:15:28,980 --> 00:15:32,480
You need to compile a lot of drivers,
like, and deal with it

263
00:15:32,480 --> 00:15:38,540
to make the system work as you want,
like, as a good product, right?

264
00:15:38,940 --> 00:15:43,080
Compared to some things like Elasticsearch,
when you take it and things

265
00:15:43,080 --> 00:15:46,640
work together very well because
it's a single product. What do

266
00:15:46,640 --> 00:15:47,560
you think about it?

267
00:15:47,560 --> 00:15:48,400
This is a problem.

268
00:15:48,400 --> 00:15:51,840
Accessibility has a negative side
here.

269
00:15:52,540 --> 00:15:55,760
Michael: I think you've jumped
straight to, like, where are the

270
00:15:55,760 --> 00:15:58,780
limits of Postgres' search capabilities
right now?

271
00:15:58,780 --> 00:16:04,140
And that’s a really interesting
topic and quite deep already.

272
00:16:04,860 --> 00:16:08,680
But it skips over all the things
you can do already in Postgres.

273
00:16:08,680 --> 00:16:15,840
And there are a ton of different
inbuilt things or add-on modules

274
00:16:15,840 --> 00:16:19,740
or extensions that mean that those
limits are being pushed further

275
00:16:19,740 --> 00:16:20,560
and further.

276
00:16:22,740 --> 00:16:26,080
And I think a lot of people come
from an assumption that Postgres

277
00:16:26,120 --> 00:16:28,820
won't be able to handle search
super well because products like

278
00:16:28,820 --> 00:16:32,560
Elasticsearch exist and are successful,
and therefore probably people

279
00:16:32,560 --> 00:16:36,800
aren't doing this in the database, 
but I see a lot of use cases

280
00:16:36,820 --> 00:16:43,120
that can be served adequately with
good results in acceptable

281
00:16:43,940 --> 00:16:48,060
response times for users without
touching any external services.

282
00:16:48,080 --> 00:16:52,400
So I think you're right that there
are edges and there are limits

283
00:16:52,600 --> 00:16:58,040
that can be better served by other
products, but those limits

284
00:16:58,040 --> 00:17:00,680
are quite far down the road for
a lot of use cases.

285
00:17:00,680 --> 00:17:04,840
You can build pretty good search
features for a lot of different

286
00:17:04,840 --> 00:17:08,480
use cases, especially if you're
willing to learn exactly how

287
00:17:08,480 --> 00:17:14,000
it works and factor in your own
product or services requirements.

288
00:17:14,440 --> 00:17:18,880
If you're not just searching every
field for every word or like

289
00:17:19,120 --> 00:17:22,960
I'm assuming like a text search
type field, it can be really powerful

290
00:17:22,960 --> 00:17:23,460
already.

291
00:17:24,120 --> 00:17:29,180
Nikolay: Yeah, I agree, I agree,
but yeah, well.

292
00:17:29,680 --> 00:17:32,320
Michael: Can we talk about some
of them quickly, like just to

293
00:17:32,320 --> 00:17:32,680
cover?

294
00:17:32,680 --> 00:17:33,860
Nikolay: Let's talk about them.

295
00:17:34,340 --> 00:17:39,900
I agree, and you're like basically
echoing the usual problem I

296
00:17:39,900 --> 00:17:40,400
have.

297
00:17:41,980 --> 00:17:45,300
I had the cases when people listening
to me said I'm a Postgres

298
00:17:45,300 --> 00:17:46,220
hater, right?

299
00:17:46,220 --> 00:17:53,100
So again, of course, this criticism
goes quite deep, and of course

300
00:17:53,520 --> 00:17:59,200
I don't like the idea to have Elasticsearch
for full-text search.

301
00:17:59,720 --> 00:18:03,980
and the need to constantly synchronize
or maybe some...

302
00:18:06,420 --> 00:18:10,880
What's the name of these new vector
database systems?

303
00:18:11,040 --> 00:18:12,580
Pinecone or something like that?

304
00:18:12,580 --> 00:18:16,080
So you basically need to synchronize
data from your main OLTP

305
00:18:16,080 --> 00:18:20,040
database all the time and you have
a lag and then you bring some

306
00:18:20,740 --> 00:18:23,940
regular data there and you think
how to combine and search that

307
00:18:23,940 --> 00:18:30,040
data because obviously for Elastic
you need to not only bring

308
00:18:30,540 --> 00:18:34,280
textual data but you need to bring
categories to have the same

309
00:18:34,340 --> 00:18:35,340
faceted search.

310
00:18:35,580 --> 00:18:40,380
Sometimes people want, like, I
want to do full text search, but

311
00:18:40,380 --> 00:18:42,540
again, limit price, right?

312
00:18:43,280 --> 00:18:49,780
Range, some range, and this usually
is stored in a regular column

313
00:18:49,780 --> 00:18:51,080
in the relational database.

314
00:18:51,760 --> 00:18:56,580
And of course, we have good capabilities
to combine it with full-text

315
00:18:56,580 --> 00:18:58,440
search and achieve in the single
index scan.

316
00:18:58,440 --> 00:19:03,020
For example, if you use GiST index,
well, GiST is slower.

317
00:19:03,740 --> 00:19:05,920
It works well for smaller datasets.

318
00:19:06,300 --> 00:19:12,280
But you combine it with GiST B-tree
and GiST B-tree, right?

319
00:19:12,280 --> 00:19:12,780
Or...

320
00:19:13,320 --> 00:19:14,540
Michael: B-tree/GiST, I think.

321
00:19:14,540 --> 00:19:15,040
Yeah.

322
00:19:15,220 --> 00:19:15,720
Nikolay: Right.

323
00:19:15,860 --> 00:19:20,040
So, and then you have a capability
to combine both full-text

324
00:19:20,040 --> 00:19:26,480
search and numeric range filter
and have a single index scan.

325
00:19:26,480 --> 00:19:27,240
This is perfect.

326
00:19:27,240 --> 00:19:30,560
Again, I'm staying on the same
point.

327
00:19:30,560 --> 00:19:32,140
Single index scan is the best.

328
00:19:32,580 --> 00:19:35,260
But unfortunately in many cases
we cannot achieve it.

329
00:19:36,620 --> 00:19:41,520
Ideally, user types something,
chooses something on the form,

330
00:19:42,040 --> 00:19:45,360
presses search, or maybe it's automated,
like I don't like automated,

331
00:19:45,360 --> 00:19:46,920
I like to press search explicitly.

332
00:19:47,320 --> 00:19:52,120
Anyway, we have a request and this
request translates to a single

333
00:19:52,120 --> 00:19:54,320
index scan and we return.

334
00:19:54,320 --> 00:19:56,880
And this is an ideal case in terms
of performance.

335
00:19:58,020 --> 00:20:03,220
Otherwise, for bigger data sets,
you will have very bad performance.

336
00:20:03,740 --> 00:20:06,260
Michael: Well in terms of performance
but also in terms of system

337
00:20:06,260 --> 00:20:07,120
resources, right?

338
00:20:07,120 --> 00:20:11,340
Like, we're also not having to
use a lot of system resources

339
00:20:11,340 --> 00:20:14,360
to satisfy quite a lot of searches.

340
00:20:14,640 --> 00:20:17,640
Whereas a lot of the alternatives
require, well because they're

341
00:20:17,640 --> 00:20:20,520
not just one index scan, require
more resources as well.

342
00:20:20,520 --> 00:20:25,420
So I think it's efficient from
a couple of angles but it very

343
00:20:25,420 --> 00:20:27,720
much limits what the user can search
for.

344
00:20:27,720 --> 00:20:34,180
If you can, if it has to be indexable
that way some other searches

345
00:20:34,940 --> 00:20:36,060
wouldn't be possible.

346
00:20:36,060 --> 00:20:39,360
Like I, I don't know about you,
but since, to give people a bit

347
00:20:39,360 --> 00:20:42,260
of an insight into how we do this,
we agreed on a topic about 24

348
00:20:42,260 --> 00:20:45,860
hours ago and every, every product
I've used since I've been

349
00:20:45,860 --> 00:20:48,420
like thinking, how does search
work here exactly?

350
00:20:48,420 --> 00:20:51,420
And it's really interesting how
different products implement

351
00:20:51,500 --> 00:20:51,780
it.

352
00:20:51,780 --> 00:20:54,060
And not, not everyone does it the
same.

353
00:20:54,240 --> 00:20:58,980
And we've been somewhat spoiled
as users by Google, in my opinion,

354
00:20:58,980 --> 00:21:04,160
Google and Gmail, both of which
have an incredibly good search

355
00:21:04,200 --> 00:21:06,360
features for quite a long time.

356
00:21:06,580 --> 00:21:08,940
And most people have experienced
those.

357
00:21:09,320 --> 00:21:11,140
But it isn't the same in every
product.

358
00:21:11,140 --> 00:21:13,660
Not every product is firstly capable
of doing that.

359
00:21:13,660 --> 00:21:18,400
But also it is not quite the right
trade-off for a lot of products

360
00:21:18,400 --> 00:21:18,900
either.

361
00:21:19,200 --> 00:21:23,560
So like a lot of products you use,
a lot of products I use, things

362
00:21:23,560 --> 00:21:28,160
like Slack, for example, or Stripe,
they will encourage you to

363
00:21:28,160 --> 00:21:29,760
use filters.

364
00:21:30,060 --> 00:21:33,160
They let you type whatever you
want in, and they will perform

365
00:21:33,160 --> 00:21:35,960
a wide search, depending on whatever
you type.

366
00:21:36,360 --> 00:21:41,260
But they encourage the use of, for
example, in Slack, search within

367
00:21:41,260 --> 00:21:45,420
just 1 channel, or just from 1
person, or things that filter

368
00:21:45,420 --> 00:21:48,540
it right down to make those searches
much more efficient.

369
00:21:48,900 --> 00:21:51,820
So it's interesting that they're
doing that partly, I guess,

370
00:21:51,820 --> 00:21:55,280
to give you the results you're
exactly looking for high up, but

371
00:21:55,280 --> 00:22:00,780
also, as I guess, to reduce system
use, like they, they don't

372
00:22:00,780 --> 00:22:04,820
have to do as much work if you
filter it down for them.

373
00:22:06,200 --> 00:22:06,900
Nikolay: All right.

374
00:22:07,960 --> 00:22:11,420
Michael: So I think there are a few
things that the beginners

375
00:22:11,420 --> 00:22:14,220
I think or when I was a beginner
when I didn't know quite how

376
00:22:14,220 --> 00:22:15,180
this stuff worked.

377
00:22:15,720 --> 00:22:20,360
I don't think I fully appreciated
the complexity of doing search

378
00:22:20,360 --> 00:22:20,860
well.

379
00:22:20,940 --> 00:22:22,560
So there's the basics.

380
00:22:22,900 --> 00:22:25,780
When we say full-text search, by
the way, I never really understood

381
00:22:27,160 --> 00:22:28,780
what the word full is doing in
there.

382
00:22:28,780 --> 00:22:31,160
It's basically just text search.

383
00:22:31,160 --> 00:22:35,780
Does this document or does this
sentence or something contain

384
00:22:36,160 --> 00:22:40,340
this word and this word or this
word or this word and so basic

385
00:22:40,380 --> 00:22:42,760
kind of text-based searches.

386
00:22:43,080 --> 00:22:45,420
I don't know why it's called full,
do you know?

387
00:22:46,740 --> 00:22:47,370
Nikolay: No, good question.

388
00:22:47,370 --> 00:22:47,790
No?
No.

389
00:22:47,790 --> 00:22:48,500
Good question.

390
00:22:49,080 --> 00:22:52,700
I think, right, so I understand
the difference.

391
00:22:55,020 --> 00:22:59,600
You can compare the whole value,
in this case, a bit reasonable,

392
00:22:59,700 --> 00:23:04,080
but you'll be dealing with the
problem of the size of this value,

393
00:23:04,080 --> 00:23:04,580
right?

394
00:23:05,340 --> 00:23:05,920
And then you can...

395
00:23:05,920 --> 00:23:08,440
Michael: But there's also, there's
like a million complexities

396
00:23:08,600 --> 00:23:11,620
just, if we only consider that,
there's a million complexities

397
00:23:11,740 --> 00:23:15,860
to do, like, should you care about
the case of, like, does the,

398
00:23:15,860 --> 00:23:17,700
like, the capital letters matter?

399
00:23:17,700 --> 00:23:20,100
Nikolay: So, full-text search is
a very, like, well-established

400
00:23:20,460 --> 00:23:21,440
area already.

401
00:23:21,780 --> 00:23:22,280
Yes.

402
00:23:24,180 --> 00:23:30,400
And where instead of comparing
whole value or doing some mask,

403
00:23:31,120 --> 00:23:36,260
regular expressions, right, which
is also an interesting topic,

404
00:23:36,260 --> 00:23:40,760
but it's also related, like a GIN
can be, trigram search can

405
00:23:40,760 --> 00:23:41,460
be used, right?

406
00:23:41,460 --> 00:23:41,720
Michael: Yes.

407
00:23:41,720 --> 00:23:45,900
Nikolay: Instead of that, we consider
words, like, first of all,

408
00:23:45,900 --> 00:23:51,500
we, as usual, have this problem
with First Normal Form, but it's

409
00:23:51,500 --> 00:23:52,540
off topic, right?

410
00:23:52,800 --> 00:23:55,380
Because this value is not atomic
anymore, right?

411
00:23:55,380 --> 00:23:56,980
We consider each value as...

412
00:23:58,380 --> 00:24:00,590
We have atoms in this molecule,
right?

413
00:24:00,590 --> 00:24:00,720
Yeah.

414
00:24:00,720 --> 00:24:06,660
So, and first of all, some words,
words we usually either normalize

415
00:24:07,120 --> 00:24:09,940
using this stemmer, snowball stemmer,
right?

416
00:24:10,120 --> 00:24:14,200
Or we use some dictionary to find
kind of...

417
00:24:15,540 --> 00:24:16,740
Michael: Synonyms or...

418
00:24:17,020 --> 00:24:19,540
Nikolay: No, No, synonyms is one
thing, it's also a synonym dictionary

419
00:24:19,540 --> 00:24:21,980
that can be used, but I'm talking
about Ispell dictionaries,

420
00:24:21,980 --> 00:24:23,320
for example, when you take...

421
00:24:23,320 --> 00:24:24,040
Oh, yeah.

422
00:24:24,520 --> 00:24:26,260
When you take a word and...

423
00:24:27,540 --> 00:24:29,160
Stemmers is very dumb.

424
00:24:29,480 --> 00:24:32,640
It's just cut the ending.

425
00:24:32,640 --> 00:24:33,360
That's it.

426
00:24:34,020 --> 00:24:37,660
You can feed some new word and
it will cut it according to some

427
00:24:37,660 --> 00:24:38,160
rule.

428
00:24:38,160 --> 00:24:41,320
But Ispell, it's a dictionary,
it knows the language, it

429
00:24:41,320 --> 00:24:46,400
knows the set of words and it can
transform words in different

430
00:24:46,400 --> 00:24:50,360
forms to some, like, normalized,
it normalizes every word, right?

431
00:24:51,060 --> 00:24:56,540
And then we have, basically, we
have, we can build either a tree

432
00:24:56,840 --> 00:25:03,120
and use GiST, Generalized Search Tree, which can be used

433
00:25:03,120 --> 00:25:05,260
for even B-tree or R-tree.

434
00:25:06,360 --> 00:25:12,600
B-tree is one dimension, R-tree is
2 or more dimensions, and R-tree

435
00:25:12,660 --> 00:25:17,080
is based on GiST in Postgres because
implementation based on

436
00:25:17,080 --> 00:25:21,020
GiST was better than original implementation,
which was not the

437
00:25:21,020 --> 00:25:22,000
case for B-tree.

438
00:25:22,240 --> 00:25:25,760
B-tree remained the native implementation,
but there is GiST

439
00:25:25,760 --> 00:25:28,600
implementation, that's why I already
mentioned it, right?

440
00:25:28,940 --> 00:25:31,100
And then, so, tree, right?

441
00:25:31,100 --> 00:25:31,560
Great.

442
00:25:31,560 --> 00:25:36,760
So you can find the entries which
have your words, but also you

443
00:25:36,760 --> 00:25:37,260
can...

444
00:25:37,480 --> 00:25:40,360
There's an inverted in this, GIN,
right?

445
00:25:41,040 --> 00:25:42,560
And GiST actually...

446
00:25:42,560 --> 00:25:44,280
Oh, GiST, I didn't mention.

447
00:25:45,440 --> 00:25:51,220
So, B-tree is one dimension, so just
one axis, R-tree is two or more dimensions,

448
00:25:51,500 --> 00:25:57,940
and you can build trees, for example,
rectangles in two dimensions.

449
00:25:58,520 --> 00:26:00,900
But what to do with text?

450
00:26:01,020 --> 00:26:02,620
Because it has a lot of words.

451
00:26:02,620 --> 00:26:06,100
We can consider words as kind of
array or set, right?

452
00:26:06,340 --> 00:26:12,540
And then we can say, this set contains
that set, right?

453
00:26:12,540 --> 00:26:13,540
Or is contained.

454
00:26:13,740 --> 00:26:18,040
So we define operators, intersects,
is contained, contained.

455
00:26:18,920 --> 00:26:21,560
And in this case, we talk about
sets.

456
00:26:21,760 --> 00:26:25,780
And we can still build the tree
based on GiST.

457
00:26:26,200 --> 00:26:30,120
There are 7 functions you need
to implement to define the operations

458
00:26:30,480 --> 00:26:35,140
and like basically so for sets
we can build tree and it's called

459
00:26:35,140 --> 00:26:37,280
actually R-tree, Russian Doll Tree.

460
00:26:37,280 --> 00:26:40,940
This is how full-text search, it's
official name for Berkeley

461
00:26:41,640 --> 00:26:42,140
paper.

462
00:26:42,380 --> 00:26:44,620
We can attach the link to it.

463
00:26:44,700 --> 00:26:49,320
And This is how originally full-text
search was implemented based

464
00:26:49,320 --> 00:26:50,040
on GiST.

465
00:26:50,460 --> 00:26:55,460
But also later it was implemented
GIN, which is General Inverted

466
00:26:55,940 --> 00:27:00,480
Index, which works much better
for very large volumes of data,

467
00:27:00,480 --> 00:27:03,740
and this is what search engines
use.

468
00:27:04,440 --> 00:27:09,100
So it's basically a list of terms
and links to which document

469
00:27:09,520 --> 00:27:15,720
terms are mentioned and then there
are internal bit trees to

470
00:27:15,720 --> 00:27:17,520
find faster each term.

471
00:27:18,420 --> 00:27:21,600
I think there are 2 kinds of bit
tree inside GIN, but it's like

472
00:27:21,600 --> 00:27:22,780
implementation details.

473
00:27:23,260 --> 00:27:28,120
So, in general, it means that we
can say, okay, these words are

474
00:27:28,120 --> 00:27:29,440
present in this document.

475
00:27:31,060 --> 00:27:36,560
And we can very fast find them
and we can also order by like

476
00:27:36,560 --> 00:27:40,080
rank. Rank, it's an interesting thing.

477
00:27:40,080 --> 00:27:44,280
It's calculated based on like most
relevant documents. Like for

478
00:27:44,280 --> 00:27:49,920
example, I don't know. Like words
are like mentioned more in this

479
00:27:49,920 --> 00:27:50,880
document, right?

480
00:27:51,780 --> 00:27:53,080
Do we have phrase search?

481
00:27:53,080 --> 00:27:54,640
Can we do double quotes?

482
00:27:55,160 --> 00:27:56,620
Yeah, we have also some...

483
00:27:56,980 --> 00:27:58,440
And or some...

484
00:27:58,520 --> 00:28:01,120
We can write some formulas, right?

485
00:28:01,120 --> 00:28:01,220
There's like

486
00:28:01,220 --> 00:28:03,780
Michael: Followed by, yeah, you
can do like followed by for free

487
00:28:03,780 --> 00:28:07,500
search, but there's also a, so
there's, we have some data types

488
00:28:07,500 --> 00:28:11,000
and loads of functions that are
really helpful for doing this

489
00:28:11,000 --> 00:28:11,320
without...

490
00:28:11,320 --> 00:28:12,840
Nikolay: And 4 categories, right?

491
00:28:13,080 --> 00:28:14,240
ABCD, right?

492
00:28:14,240 --> 00:28:15,720
We can, you can...

493
00:28:15,720 --> 00:28:16,400
Like weighting.

494
00:28:18,820 --> 00:28:19,900
Michael: What do you mean by categories?

495
00:28:19,900 --> 00:28:20,820
Nikolay: I don't remember.

496
00:28:20,940 --> 00:28:27,080
I remember when you define GIN
index, you need to convert data

497
00:28:27,120 --> 00:28:31,320
using tsvector, so you convert
to special text search vector,

498
00:28:31,320 --> 00:28:36,160
tsvector type, and you can say
that some parts can be considered

499
00:28:36,160 --> 00:28:39,000
one category, some parts different
categories, there are a maximum

500
00:28:39,000 --> 00:28:39,840
of four categories.

501
00:28:40,080 --> 00:28:43,300
And when you search, you can say,
I'm searching within only a

502
00:28:43,480 --> 00:28:44,480
specific category.

503
00:28:45,040 --> 00:28:50,980
It means that, for example, you
can build one TS vector, but take,

504
00:28:50,980 --> 00:28:54,100
for example, if you're indexing,
for example, emails, you can

505
00:28:54,100 --> 00:28:58,540
take words from the subject and mark
them as category A, for example,

506
00:28:58,660 --> 00:28:59,160
right?

507
00:28:59,540 --> 00:29:01,020
But the body is B.

508
00:29:01,220 --> 00:29:05,040
And then you have the freedom and flexibility
to search globally,

509
00:29:05,820 --> 00:29:09,200
not taking into account the origin
of the words.

510
00:29:09,200 --> 00:29:14,440
Or you can limit inside a single,
same gene, same search, you can

511
00:29:14,440 --> 00:29:17,720
limit saying I'm searching only
inside the subject.

512
00:29:19,380 --> 00:29:24,100
So you can mark these parts of
the TS vector, which is good, but

513
00:29:24,100 --> 00:29:26,260
four is not enough in many cases.

514
00:29:28,480 --> 00:29:31,340
So there are many capabilities
in Postgres systems.

515
00:29:31,340 --> 00:29:34,900
Michael: Yes, and you don't have
to build that much around it

516
00:29:34,900 --> 00:29:36,820
to get something pretty powerful
out.

517
00:29:36,820 --> 00:29:40,660
And one thing I learned about relatively
recently from a blog post

518
00:29:40,680 --> 00:29:43,380
was websearch_to_tsquery.

519
00:29:43,380 --> 00:29:47,620
So this TS query being the query
representation and the TS

520
00:29:47,620 --> 00:29:50,100
vector being the vector representation,
like, normalized.

521
00:29:50,740 --> 00:29:54,140
So once you've, like, taken each
word, normalized them, like,

522
00:29:54,140 --> 00:29:56,340
plurals out of it and things like
that.

523
00:29:56,580 --> 00:30:01,360
Yeah, websearch_to_tsquery means you can
achieve a query that's a bit like

524
00:30:01,360 --> 00:30:06,100
you might imagine a fairly complex
search engine search, like

525
00:30:06,100 --> 00:30:10,280
taking the not operator and saying
I don't want documents that

526
00:30:10,280 --> 00:30:14,720
include this word or using and
and or type operators as well.

527
00:30:14,720 --> 00:30:19,540
So it could let you build something
that has some basic search

528
00:30:19,540 --> 00:30:24,140
engine-like features built into
the text search field without

529
00:30:24,140 --> 00:30:25,260
much work at all.

530
00:30:26,320 --> 00:30:28,940
And these will come like in Postgres
core.

531
00:30:29,180 --> 00:30:31,840
You don't even need an extension
for this lot.

532
00:30:32,060 --> 00:30:33,200
Which is pretty cool.

533
00:30:33,480 --> 00:30:34,960
But yeah, I was going to move on.

534
00:30:34,960 --> 00:30:38,520
I think obviously that contains
loads of complexities and helps

535
00:30:38,520 --> 00:30:41,560
you solve some of the trickier
things immediately.

536
00:30:42,080 --> 00:30:48,660
But there's also relatively built-in
modules for fuzzy search.

537
00:30:48,660 --> 00:30:50,840
So handling typos elegantly.

538
00:30:51,320 --> 00:30:54,740
The kind of things that just start
to get a bit more complicated.

539
00:30:55,520 --> 00:30:59,020
Do you want to be able to match
on people nearly getting the

540
00:30:59,020 --> 00:30:59,500
name right?

541
00:30:59,500 --> 00:31:04,040
And not all products do, but it's
pretty common that we do want

542
00:31:04,040 --> 00:31:04,540
to.

543
00:31:04,860 --> 00:31:08,400
And when I first came across trigram
search or pg_trgm the

544
00:31:08,400 --> 00:31:12,620
extension, I was blown away by
like how elegant a solution that

545
00:31:12,620 --> 00:31:14,940
is to typos.

546
00:31:16,520 --> 00:31:17,700
So were you not as impressed?

547
00:31:17,700 --> 00:31:21,980
Like I was thinking it's so simple
and it works so well.

548
00:31:22,280 --> 00:31:22,980
Nikolay: Trigrams.

549
00:31:23,740 --> 00:31:26,540
Well, let me disagree with you.

550
00:31:26,600 --> 00:31:27,560
I use it many times.

551
00:31:27,560 --> 00:31:28,220
Michael: Go on.

552
00:31:28,820 --> 00:31:32,120
Nikolay: For many, many years,
and I cannot say it's elegant,

553
00:31:32,220 --> 00:31:35,100
because it requires, first of all,
it requires a lot of effort.

554
00:31:36,460 --> 00:31:39,160
It's not just to say, I want regex
here, that's it.

555
00:31:39,160 --> 00:31:42,380
No, you need to do something, like
some things you need to do.

556
00:31:42,640 --> 00:31:46,720
And also at really high volumes,
it doesn't work well in terms

557
00:31:46,720 --> 00:31:47,380
of performance.

558
00:31:48,220 --> 00:31:48,720
Michael: Sure.

559
00:31:48,940 --> 00:31:53,940
Nikolay: And if you have a lot
of updates coming, you will bump

560
00:31:53,940 --> 00:31:59,660
into this dilemma, fast update
or without fast update GIN and

561
00:31:59,920 --> 00:32:05,800
pending list, which is by default
4 megabytes, right?

562
00:32:05,800 --> 00:32:11,880
And during regular select, it decides
it needs to be like processed,

563
00:32:13,480 --> 00:32:16,580
select is timing out, then you
need to tune it.

564
00:32:17,020 --> 00:32:20,040
Well, I still have a feeling it
could be better.

565
00:32:20,740 --> 00:32:21,300
Michael: No doubt.

566
00:32:21,300 --> 00:32:26,600
And I guess this is one of the examples
of can work really well

567
00:32:26,600 --> 00:32:31,180
for a while and once you hit
a certain scale, maybe an external

568
00:32:31,320 --> 00:32:33,220
system is beneficial.

569
00:32:34,540 --> 00:32:35,020
But yeah.

570
00:32:35,020 --> 00:32:36,360
Nikolay: Yeah, because it's more
polished.

571
00:32:36,600 --> 00:32:41,240
But also I think there is an opportunity
for people who are new

572
00:32:42,180 --> 00:32:48,800
to Postgres companies to improve and
benefit from a more polished

573
00:32:48,800 --> 00:32:53,080
version when you develop some product
on top of Postgres, right?

574
00:32:54,720 --> 00:32:57,480
Michael: Yeah, and there are some
startups, right, that are quite

575
00:32:57,480 --> 00:33:00,940
focused on search, or at least
what, is it ParadeDB that are

576
00:33:00,940 --> 00:33:02,080
looking into this stuff?

577
00:33:02,080 --> 00:33:08,400
And then there's also one other that's
worth, I would be ashamed

578
00:33:08,400 --> 00:33:11,080
to finish this episode without
mentioning them.

579
00:33:13,100 --> 00:33:19,440
Yes, who do synchronization, who
take the hard parts out of synchronizing

580
00:33:20,220 --> 00:33:21,360
Postgres with Elasticsearch.

581
00:33:22,120 --> 00:33:23,400
Nikolay: Yeah, that's a good tagline.

582
00:33:23,400 --> 00:33:27,080
Also, I wanted to mention, sorry,
I was trying to find the book

583
00:33:27,080 --> 00:33:28,320
I recently bought.

584
00:33:28,440 --> 00:33:33,080
For specialists, it should be super,
like, important book, I

585
00:33:33,080 --> 00:33:37,580
guess, and I need to thank Andrei
Borodin, as usual.

586
00:33:38,640 --> 00:33:42,340
This guy helps me a lot with Postgres
in general.

587
00:33:42,660 --> 00:33:45,840
So this book is called Introduction
to Information Retrieval

588
00:33:46,980 --> 00:33:53,200
by Christopher Manning, Prabhakar
Raghavan, I'm sorry, and Hinrich

589
00:33:53,200 --> 00:33:53,700
Schütze.

590
00:33:54,220 --> 00:33:57,700
So this book is interesting and
this is exactly where I saw this

591
00:33:57,700 --> 00:33:59,660
like, what is good quality?

592
00:33:59,860 --> 00:34:00,840
Users are happy.

593
00:34:00,940 --> 00:34:02,560
And then a lot of formulas, right?

594
00:34:02,560 --> 00:34:03,260
It's interesting.

595
00:34:04,220 --> 00:34:06,300
Michael: Well, and it's a moving
target.

596
00:34:06,340 --> 00:34:10,140
One of the good blog posts I re-read
as part of doing research

597
00:34:10,140 --> 00:34:14,440
for this was by Justin Searls
about a Ruby implementation and

598
00:34:14,440 --> 00:34:18,860
he made a really good point at
the end about it being a moving

599
00:34:18,860 --> 00:34:19,360
goalpost.

600
00:34:19,740 --> 00:34:23,940
So users might be happy with waiting
a few seconds 1 year and

601
00:34:23,940 --> 00:34:27,440
then 5 or 10 years later that may
not be considered good enough

602
00:34:27,440 --> 00:34:30,640
anymore because they can use other
search engines that are much

603
00:34:30,640 --> 00:34:31,140
faster.

604
00:34:31,360 --> 00:34:32,960
Or your data volumes grow.

605
00:34:32,960 --> 00:34:37,580
You talked about, you know, if
your implementation relies on,

606
00:34:38,300 --> 00:34:41,500
well, volume might change, but
also patterns might change.

607
00:34:41,500 --> 00:34:46,080
And you might find it's harder
to provide as good results as

608
00:34:46,080 --> 00:34:49,060
your data changes or as your users
change in their expectations.

609
00:34:49,200 --> 00:34:52,280
So it's kind of a moving goalpost
constantly as well.

610
00:34:52,280 --> 00:34:57,240
So not only might the same results
10 years later not be good

611
00:34:57,240 --> 00:35:02,680
enough, but also like, yeah, it's
a tricky one, but I think, I

612
00:35:02,680 --> 00:35:07,000
think user happiness is one good
one, but also Google uses lots of

613
00:35:08,300 --> 00:35:09,460
metrics like...

614
00:35:09,860 --> 00:35:10,360
Nikolay: Yeah.

615
00:35:10,960 --> 00:35:12,480
I know what I think about.

616
00:35:13,320 --> 00:35:14,020
Michael: Go on.

617
00:35:14,140 --> 00:35:16,600
Nikolay: So since we talk about
semantic search, which supports

618
00:35:16,700 --> 00:35:21,460
some kind of AI systems, like these
embeddings, I'm thinking

619
00:35:21,460 --> 00:35:25,080
about not user happiness, but LLM
happiness, so to speak.

620
00:35:25,740 --> 00:35:29,820
So, and I think that usually we
deal with very large documents

621
00:35:30,180 --> 00:35:33,780
And when we generate vectors there
are certain limits, for example

622
00:35:33,780 --> 00:35:42,240
OpenAI limit is 8,192 tokens roughly
like 15-16 characters.

623
00:35:42,720 --> 00:35:46,300
And for example my article about
sub-transactions it exceeds

624
00:35:46,520 --> 00:35:47,960
30,000 characters.

625
00:35:47,980 --> 00:35:52,260
So it was hard to vectorize it,
right?

626
00:35:52,260 --> 00:35:55,820
And we needed to summarize it first
using LLM and then vectorize

627
00:35:55,840 --> 00:35:58,380
the summary only, so unfortunately,
right?

628
00:35:58,480 --> 00:36:03,320
It works, okay, but what I'm trying
to say, when we talk about

629
00:36:03,320 --> 00:36:07,320
traditional search engines, search
results are not whole documents,

630
00:36:07,340 --> 00:36:08,160
they are snippets.

631
00:36:08,940 --> 00:36:09,440
Michael: True.

632
00:36:09,520 --> 00:36:15,060
Nikolay: And it's also a part of
this happiness or quality is

633
00:36:15,060 --> 00:36:16,500
how we present results.

634
00:36:16,820 --> 00:36:21,420
For example, we can provide snippets
and highlight the words

635
00:36:22,180 --> 00:36:24,720
which are present in the query,
right?

636
00:36:24,720 --> 00:36:25,320
It's good.

637
00:36:25,320 --> 00:36:26,700
It's good for user experience.

638
00:36:26,760 --> 00:36:30,280
User immediately sees the words
mentioned that user typed.

639
00:36:31,160 --> 00:36:35,020
If it's a synonym, it will be different,
but anyway, it's a good

640
00:36:35,020 --> 00:36:35,520
practice.

641
00:36:36,220 --> 00:36:40,220
But if you think about Google,
how it works, you see some results,

642
00:36:40,680 --> 00:36:45,800
first page, second page and so
on, but then it's good when very

643
00:36:45,800 --> 00:36:48,480
relevant results are on the very
first page, right?

644
00:36:49,080 --> 00:36:51,820
Maybe on the first top, like, included
in the first page and

645
00:36:51,820 --> 00:36:52,620
you're satisfied.

646
00:36:53,500 --> 00:36:57,840
We like, there is a new topic, LLM happiness.

647
00:36:57,940 --> 00:36:58,940
They should be satisfied.

648
00:36:58,940 --> 00:37:00,080
But what does it mean?

649
00:37:00,360 --> 00:37:04,300
It means that what we decided to
include, to embed, to prompt

650
00:37:04,300 --> 00:37:08,180
and use in answers should be very
relevant, right?

651
00:37:09,280 --> 00:37:14,160
And this is a very big topic which
is not yet discovered properly.

652
00:37:14,340 --> 00:37:17,120
In my opinion, what I have in my
mind, I'm just sharing and if

653
00:37:17,120 --> 00:37:21,040
some guys are interested, I would
love to have a discussion around

654
00:37:21,040 --> 00:37:21,540
it.

655
00:37:22,120 --> 00:37:24,800
So what I think, this is what we
are going to do.

656
00:37:24,800 --> 00:37:26,600
Not yet, we are going to do it.

657
00:37:26,880 --> 00:37:33,420
We are going to return many snippets
or summaries or something

658
00:37:33,420 --> 00:37:35,140
like that, not whole documents.

659
00:37:36,040 --> 00:37:41,020
And then ask LLM internally during
the same cycle of requests.

660
00:37:41,400 --> 00:37:44,440
We just say, LLM, evaluate, is
it relevant?

661
00:37:44,440 --> 00:37:47,300
Do you think based on this snippet,
is it relevant to your original

662
00:37:47,440 --> 00:37:47,940
request?

663
00:37:49,340 --> 00:37:55,560
And if it's on the scale from 1
to 10, we limit the result and

664
00:37:55,560 --> 00:37:58,580
then it's important, and this is
how humans behave.

665
00:37:59,340 --> 00:38:04,320
We open each document and inspect
it fully and think again, is

666
00:38:04,320 --> 00:38:04,980
it relevant?

667
00:38:05,240 --> 00:38:07,900
And only then, and maybe we go
to the second page, I don't know,

668
00:38:07,900 --> 00:38:10,800
like maybe the second page is not needed,
we just, our first page

669
00:38:10,800 --> 00:38:13,380
can be bigger because everything
is automated here.

670
00:38:13,380 --> 00:38:17,000
But we inspect the full document and
decide, is it worth keeping

671
00:38:17,980 --> 00:38:20,780
and using in the answer or no?

672
00:38:21,820 --> 00:38:22,580
Maybe not.

673
00:38:23,040 --> 00:38:27,840
Maybe in the end of this process,
we will have 0 documents left.

674
00:38:27,840 --> 00:38:30,840
Maybe we need to think maybe to
provide a different query.

675
00:38:31,420 --> 00:38:35,220
And what I see from LLM, sometimes
if we do this, like you ask

676
00:38:35,220 --> 00:38:39,020
something, there is this rank system
which performs a search

677
00:38:39,760 --> 00:38:43,640
and you are not satisfied and if
you just tell like if you consider

678
00:38:43,680 --> 00:38:49,320
LLM as a junior engineer you just
say I'm not satisfied you can

679
00:38:49,320 --> 00:38:53,800
do better. Invent some new searches
and try them.

680
00:38:53,800 --> 00:38:56,260
You don't say exactly what searches
to use.

681
00:38:58,360 --> 00:39:01,940
In all cases I did this, I was
satisfied on the second step.

682
00:39:03,280 --> 00:39:07,620
So I mean the internal process
of searching can be very complex,

683
00:39:07,660 --> 00:39:10,800
it might take longer, much longer,
but the result will be much

684
00:39:10,800 --> 00:39:11,300
better.

685
00:39:11,820 --> 00:39:15,560
And the question about quality
here is very, it's shifting.

686
00:39:16,400 --> 00:39:21,880
This semantic search should be
somehow different to be considered

687
00:39:21,900 --> 00:39:25,840
as good quality for this type of
process, for AI systems.

688
00:39:26,260 --> 00:39:31,120
So I'm reading this book, understanding
that it was written for

689
00:39:31,120 --> 00:39:35,740
search engines targeted humans,
and I'm very interested how this

690
00:39:35,740 --> 00:39:37,440
topic will be changed.

691
00:39:37,640 --> 00:39:40,220
I'm sure a lot of things will be
inherited.

692
00:39:40,600 --> 00:39:45,460
A lot of science is there, like
huge science.

693
00:39:45,540 --> 00:39:49,260
Search engine, like information
search, is a huge topic, right?

694
00:39:49,300 --> 00:39:52,400
So a lot of things will be inherited,
but there will be changes

695
00:39:52,500 --> 00:39:58,520
as well, because of a high level
of automation, and I'm very curious

696
00:39:58,520 --> 00:40:00,140
how quality will be changed.

697
00:40:00,540 --> 00:40:02,660
So this is a very interesting topic.

698
00:40:02,980 --> 00:40:08,000
I have many more questions than answers.

699
00:40:08,840 --> 00:40:09,340
Yeah.

700
00:40:10,440 --> 00:40:12,880
Thank you for listening to this
long speech.

701
00:40:13,920 --> 00:40:17,420
It just sits in my head right now,
like what will happen

702
00:40:17,420 --> 00:40:18,140
Michael: with search.

703
00:40:18,205 --> 00:40:22,000
Don't you consider search engines
semantic search too?

704
00:40:23,200 --> 00:40:27,600
Nikolay: Yeah, well, Google and all top search engines,

705
00:40:27,600 --> 00:40:30,700
they do semantic search for many,
many years already.

706
00:40:30,700 --> 00:40:31,320
I know that.

707
00:40:31,320 --> 00:40:35,780
But now we, like, they didn't provide,
like, do you know API

708
00:40:35,900 --> 00:40:36,420
to Google?

709
00:40:36,420 --> 00:40:37,060
I don't know.

710
00:40:37,060 --> 00:40:41,400
I only know some workaround solutions
to it, right?

711
00:40:41,500 --> 00:40:46,100
But now we are building small Googles
for our small knowledge

712
00:40:46,100 --> 00:40:47,660
bases, and it's interesting.

713
00:40:48,580 --> 00:40:50,960
What, like, pgvector is a very
basic thing.

714
00:40:51,060 --> 00:40:54,520
Of course, a lot of work there,
a lot, like, 2 types of indexes

715
00:40:54,520 --> 00:40:58,540
it provides, and so on, performance
improvements but what to

716
00:40:58,540 --> 00:41:02,620
do with age What to do with this
embedding process?

717
00:41:03,820 --> 00:41:07,900
Yeah, these questions are big right
now in my head.

718
00:41:09,920 --> 00:41:10,320
Michael: Cool.

719
00:41:10,320 --> 00:41:13,540
I'm looking forward to some in-depth
ones on this.

720
00:41:14,340 --> 00:41:17,280
Nikolay: I hope we will have some
follow-up episodes, maybe about

721
00:41:17,280 --> 00:41:21,260
full-text search as well, and semantic
search as well, and some

722
00:41:21,580 --> 00:41:23,040
faceted search as well.

723
00:41:23,800 --> 00:41:24,300
Yeah.

724
00:41:25,160 --> 00:41:26,240
Michael: Sounds good.

725
00:41:26,520 --> 00:41:29,120
All right, thanks so much Nikolay,
take care.