1
00:00:00,060 --> 00:00:00,660
Nikolay: Hello, hello.

2
00:00:00,660 --> 00:00:05,200
This is PostgresFM, episode number
98, right?

3
00:00:05,200 --> 00:00:05,700
98.

4
00:00:06,400 --> 00:00:06,900
Michael: Yes.

5
00:00:07,420 --> 00:00:09,180
Nikolay: And this is Nikolay Samokhvalov,

6
00:00:09,180 --> 00:00:11,680
founder of Postgres.AI.

7
00:00:11,920 --> 00:00:16,040
And my co-host is Michael Christofides,
founder of pgMustard.

8
00:00:16,880 --> 00:00:17,640
Hi, Michael.

9
00:00:18,240 --> 00:00:18,480
Michael: Wow.

10
00:00:18,480 --> 00:00:20,980
That's the first time you've attempted
my last name and you did

11
00:00:20,980 --> 00:00:21,480
great.

12
00:00:21,900 --> 00:00:22,400
Nikolay: Okay.

13
00:00:23,000 --> 00:00:24,000
Thank you so much.

14
00:00:24,060 --> 00:00:25,740
I practiced all these years.

15
00:00:27,260 --> 00:00:28,880
So what's the topic today?

16
00:00:29,820 --> 00:00:31,320
Michael: Yeah, So this was your
suggestion.

17
00:00:31,320 --> 00:00:32,360
I think it's a great one.

18
00:00:32,360 --> 00:00:34,840
We've been meaning to come back
to it for a while now.

19
00:00:34,840 --> 00:00:36,320
It's full-text search.

20
00:00:36,380 --> 00:00:38,260
Why did you want to talk about
it?

21
00:00:38,800 --> 00:00:42,540
Nikolay: I think it was on our
list for a very long time, maybe

22
00:00:42,620 --> 00:00:45,540
from the very beginning, because
it's a huge topic.

23
00:00:45,720 --> 00:00:50,460
And when we built that list, I
think pgVector already existed,

24
00:00:50,600 --> 00:00:54,580
but only a few people paid attention
to it, unlike today.

25
00:00:55,240 --> 00:01:01,280
So since we didn't manage to discuss
full-text search before

26
00:01:02,380 --> 00:01:04,960
this AI boom started last year.

27
00:01:05,220 --> 00:01:08,580
I think we should maybe compare
it a little bit.

28
00:01:09,640 --> 00:01:12,960
Since I used full-text search a
lot in the past, I still remember

29
00:01:12,960 --> 00:01:16,260
many features it offers in Postgres.

30
00:01:17,020 --> 00:01:20,920
I think it's interesting to not
only discuss, like usual discussion,

31
00:01:21,500 --> 00:01:24,180
embedded full-text search in Postgres
versus Elastic.

32
00:01:24,960 --> 00:01:28,860
But also it's interesting to discuss
embedded full-text search

33
00:01:29,380 --> 00:01:33,180
of Postgres versus PgVector
extension and similarity

34
00:01:33,340 --> 00:01:36,520
search, or maybe not versus, maybe
together.

35
00:01:37,960 --> 00:01:38,980
And this is interesting.

36
00:01:39,280 --> 00:01:44,720
But I think time is changing and
evolution is very interesting.

37
00:01:45,060 --> 00:01:49,240
And today people probably pay less
attention to full-text search,

38
00:01:49,300 --> 00:01:53,500
but they at least should know what
capabilities it can offer.

39
00:01:55,760 --> 00:02:02,300
And let's maybe start from use
cases and discuss functions, features,

40
00:02:03,160 --> 00:02:06,100
FTS, full-text search provides
in Postgres.

41
00:02:07,540 --> 00:02:08,440
What do you think?

42
00:02:08,940 --> 00:02:12,540
Or maybe we can talk about some
historical aspects.

43
00:02:12,720 --> 00:02:18,660
I also can provide some overview
of what was happening in Postgres

44
00:02:18,720 --> 00:02:20,040
and not only in Postgres.

45
00:02:20,800 --> 00:02:21,720
What do you think?

46
00:02:21,740 --> 00:02:24,940
Features, use cases, history, what's
better?

47
00:02:25,600 --> 00:02:27,340
Michael: Actually, I don't know
the history that well.

48
00:02:27,340 --> 00:02:29,380
That would be awesome to hear a
tiny bit about that.

49
00:02:29,380 --> 00:02:32,980
And then I'd love to jump onto
use cases where you see it used

50
00:02:32,980 --> 00:02:34,000
most commonly, that kind

51
00:02:34,000 --> 00:02:34,340
Nikolay: of thing.

52
00:02:34,340 --> 00:02:35,740
Yeah, actually we can combine.

53
00:02:35,740 --> 00:02:37,860
I now see how we can do it.

54
00:02:37,860 --> 00:02:47,180
So if, for example, we roll back
to first web era, Web 1.0, so

55
00:02:47,180 --> 00:02:51,360
to speak, like 25 years ago, and
when Google was created.

56
00:02:52,120 --> 00:02:55,460
Because Google is kind of, it's
a search engine, originally,

57
00:02:55,600 --> 00:02:56,100
right?

58
00:02:56,400 --> 00:03:00,060
And it solves the problem of finding
proper texts, most relevant

59
00:03:00,060 --> 00:03:00,980
texts, right?

60
00:03:01,860 --> 00:03:07,000
And at that time, before Google,
already other search engines

61
00:03:07,000 --> 00:03:09,520
were created, obviously.

62
00:03:09,660 --> 00:03:14,340
Full text search capabilities were
originally called T-Search

63
00:03:14,540 --> 00:03:15,040
2.

64
00:03:15,040 --> 00:03:19,220
Actually, I remember this contrib
module I was seeing at 2005,

65
00:03:19,500 --> 00:03:23,240
maybe, when I first started working
with Postgres, created by

66
00:03:23,240 --> 00:03:27,620
Oleg Bartunov, Teodor Sigaev, and
then Alexander Korotkov joined

67
00:03:27,620 --> 00:03:28,120
them.

68
00:03:28,380 --> 00:03:32,460
They created it because they were
building a search engine originally.

69
00:03:33,440 --> 00:03:36,360
They helped building quite a big
search engine.

70
00:03:37,480 --> 00:03:43,080
Then they incorporated many of
their works into Postgres, and

71
00:03:43,080 --> 00:03:47,740
this became T-Search2 contrib module,
which later was merged

72
00:03:48,240 --> 00:03:53,000
to the core and you don't need
to install extension, so it's

73
00:03:53,000 --> 00:03:54,300
already in core, right?

74
00:03:54,560 --> 00:03:58,440
But interesting thing that it was
before social networks and

75
00:03:58,440 --> 00:04:05,500
so on, but If we look at how it's
built, it has a lot of things.

76
00:04:05,500 --> 00:04:09,880
For example, you can search, you
can, for example, exclude some

77
00:04:09,880 --> 00:04:10,280
words.

78
00:04:10,280 --> 00:04:12,540
By default, it excludes some words.

79
00:04:12,540 --> 00:04:17,140
It's called dictionary, which is
like in Postgres full text search

80
00:04:17,140 --> 00:04:18,980
terminology, dictionary is a program.

81
00:04:19,620 --> 00:04:23,800
It can process input somehow transforming
it.

82
00:04:24,520 --> 00:04:30,200
So stop words, it's just exclusion
of various articles for example

83
00:04:30,200 --> 00:04:33,280
and so on which we probably don't
have any meaning, so we can

84
00:04:33,280 --> 00:04:34,860
just exclude them, that's it.

85
00:04:35,020 --> 00:04:40,760
Then we have Snowball, which is
a stemmer, so it removes the

86
00:04:40,760 --> 00:04:41,260
endings.

87
00:04:42,340 --> 00:04:44,440
It's super fast and super simple.

88
00:04:45,040 --> 00:04:48,280
So we can, for example, runs and
run, it's like kind of the same

89
00:04:48,280 --> 00:04:50,700
word, so s should be removed, right?

90
00:04:51,280 --> 00:04:56,760
And then more full-fledged thing,
we can ask to join, to use

91
00:04:56,760 --> 00:05:03,360
some real dictionary, usually ISPELL,
which is also used for thesaurus,

92
00:05:03,780 --> 00:05:08,300
like It can give a lot of things,
but it's usually slower, requires

93
00:05:08,300 --> 00:05:08,980
some memory.

94
00:05:09,840 --> 00:05:13,040
And all these things are configurable
in full-text search, and

95
00:05:13,040 --> 00:05:18,820
they were configurable in TS Vector
2 originally, and it's great.

96
00:05:19,540 --> 00:05:23,960
But if you think about the task,
the use case, what we want to

97
00:05:23,960 --> 00:05:29,800
try to solve, we want to present
user some information which

98
00:05:29,800 --> 00:05:31,660
is most relevant, right?

99
00:05:32,380 --> 00:05:33,480
What is most relevant?

100
00:05:33,480 --> 00:05:34,820
The user has some query.

101
00:05:34,820 --> 00:05:41,880
I want a recipe with, I don't know,
with some eggplant, for example.

102
00:05:41,880 --> 00:05:42,880
He puts eggplant.

103
00:05:42,880 --> 00:05:47,040
If we have, for example, food recipe,
website, or system, knowledge

104
00:05:47,040 --> 00:05:47,540
base.

105
00:05:48,640 --> 00:05:57,780
And then we just stop words, normalizing
words, we can just filter,

106
00:05:57,780 --> 00:05:58,280
right?

107
00:05:58,380 --> 00:06:00,040
But filtering is just 1 thing.

108
00:06:00,040 --> 00:06:02,840
We also need to somehow order,
maybe, right?

109
00:06:02,840 --> 00:06:05,300
And present only the most relevant.

110
00:06:05,660 --> 00:06:06,840
What is the most relevant?

111
00:06:07,300 --> 00:06:11,820
Maybe these words are most frequently
used in these documents,

112
00:06:11,960 --> 00:06:12,460
right?

113
00:06:12,740 --> 00:06:14,820
Like repeated many times, for example.

114
00:06:15,640 --> 00:06:17,620
And this is quite a basic idea.

115
00:06:18,580 --> 00:06:22,320
If the word is used many times,
it's considered most relevant.

116
00:06:23,680 --> 00:06:26,100
Now, confession, I never used this.

117
00:06:26,640 --> 00:06:27,040
Why?

118
00:06:27,040 --> 00:06:28,520
Because I'm from a different era.

119
00:06:28,520 --> 00:06:33,840
I'm from web 2.0, where most relevant
meant something else because

120
00:06:33,840 --> 00:06:35,220
it was social networks.

121
00:06:35,920 --> 00:06:38,860
It was like also time matters a
lot.

122
00:06:38,860 --> 00:06:42,740
Like if it's fresh document, like
fresh blog post or something,

123
00:06:42,740 --> 00:06:43,920
it matters a lot.

124
00:06:44,540 --> 00:06:48,520
But Postgres full text search lacks
this, like it's hard to build

125
00:06:48,520 --> 00:06:49,020
together.

126
00:06:49,400 --> 00:06:52,840
There is a special index RAM, I mentioned
it many times already.

127
00:06:53,360 --> 00:06:57,220
Maybe I need to revisit it finally
and think, maybe it still

128
00:06:57,540 --> 00:07:00,880
can be used, because if you're
on self-managed Postgres, this

129
00:07:00,940 --> 00:07:03,460
type of index can be installed
as an additional extension.

130
00:07:04,280 --> 00:07:08,920
But the original full-text search can
order by only this tsRank,

131
00:07:09,000 --> 00:07:09,240
right?

132
00:07:09,240 --> 00:07:10,440
It's called function tsRank.

133
00:07:10,440 --> 00:07:14,040
And there is another function which
also considers positions

134
00:07:14,320 --> 00:07:15,900
and distance between words.

135
00:07:16,500 --> 00:07:18,280
I think like density or something.

136
00:07:18,280 --> 00:07:22,780
Like not density, but the closer
words are the better ranking

137
00:07:24,440 --> 00:07:25,500
score is, right?

138
00:07:25,680 --> 00:07:33,280
But for me, it was always nonsense
to use only these word positions

139
00:07:33,380 --> 00:07:34,200
and so on.

140
00:07:34,460 --> 00:07:37,860
I couldn't imagine how I can use
it because I always needed to

141
00:07:37,860 --> 00:07:39,500
take into account time.

142
00:07:39,540 --> 00:07:44,460
And also sometimes things like
likes, reactions, comments, if

143
00:07:44,460 --> 00:07:48,660
this post is very heavily discussed,
it's very important, right?

144
00:07:49,140 --> 00:07:53,800
But back to history, we know there
is BM25 and also there is

145
00:07:53,800 --> 00:07:54,840
like this TF-IDF.

146
00:07:55,840 --> 00:07:59,740
The idea, TF, it's term frequency,
inverted document frequency,

147
00:07:59,740 --> 00:08:00,240
right?

148
00:08:00,240 --> 00:08:05,100
If some document has this term
like word mentioned many times,

149
00:08:05,660 --> 00:08:07,460
it's good for this document, right?

150
00:08:07,540 --> 00:08:12,840
And if, in general, this term is
mentioned in our whole database,

151
00:08:13,820 --> 00:08:17,780
it's a whole dataset, it's mentioned
not very frequently, it's

152
00:08:17,780 --> 00:08:22,320
also good for our document, because
it means that it's exceptional

153
00:08:22,840 --> 00:08:24,640
in terms of frequency, right?

154
00:08:24,640 --> 00:08:26,760
So that's why it works, document
frequency.

155
00:08:26,980 --> 00:08:30,060
So it means that for such thing
we need to analyze the whole

156
00:08:30,060 --> 00:08:30,460
dataset.

157
00:08:30,460 --> 00:08:32,460
Postgres full text search doesn't
do it.

158
00:08:32,900 --> 00:08:36,260
It's quite hard to maintain such
type of thing, but I think if

159
00:08:36,260 --> 00:08:40,140
there are some attempts to have
it in Postgres, maybe some extensions

160
00:08:41,040 --> 00:08:41,540
exist.

161
00:08:42,340 --> 00:08:45,120
This is what Elastic uses, right?

162
00:08:45,300 --> 00:08:46,900
But there is another approach.

163
00:08:47,360 --> 00:08:51,960
Back 25 years ago, why did Google
win?

164
00:08:51,960 --> 00:08:53,420
Because it had PageRank.

165
00:08:54,060 --> 00:08:59,820
If we think deeply about this,
it's already from Web 2.0 idea.

166
00:08:59,960 --> 00:09:07,220
Because PageRank takes into account
social network of documents,

167
00:09:07,260 --> 00:09:08,080
I would say.

168
00:09:08,080 --> 00:09:11,020
It takes into account links, right?

169
00:09:11,240 --> 00:09:11,940
So interactions.

170
00:09:12,740 --> 00:09:13,920
We are not alone.

171
00:09:14,120 --> 00:09:15,740
And this idea is super powerful.

172
00:09:15,760 --> 00:09:19,200
I'm curious, are there any extensions
which implement this in

173
00:09:19,200 --> 00:09:21,300
Postgres and how it could be implemented?

174
00:09:21,300 --> 00:09:22,160
This is super interesting.

175
00:09:22,160 --> 00:09:24,640
But this is a big power.

176
00:09:25,680 --> 00:09:30,980
If we have links, we value this
document more.

177
00:09:31,320 --> 00:09:36,400
I think there are good algorithms
how to solve it with billions

178
00:09:36,400 --> 00:09:37,700
of documents of course.

179
00:09:37,900 --> 00:09:42,420
There are papers written, anyone
who studied search engine theory

180
00:09:42,980 --> 00:09:43,680
of course.

181
00:09:44,540 --> 00:09:49,140
There is a course from Stanford
mining massive data sets, very

182
00:09:49,140 --> 00:09:52,760
good material where I studied this
PageRank and other algorithms

183
00:09:53,560 --> 00:09:56,080
closer to machine learning on big
data sets.

184
00:09:56,200 --> 00:09:58,880
So idea is like links, right?

185
00:09:59,240 --> 00:10:03,420
You multiply by some matrices and
you have some good index and

186
00:10:03,420 --> 00:10:04,560
you can use it.

187
00:10:04,640 --> 00:10:06,860
Again, Postgres doesn't provide
this.

188
00:10:06,860 --> 00:10:09,360
Elastic doesn't provide PageRank,
I guess.

189
00:10:09,780 --> 00:10:15,460
But it's good because we start
taking into account not only word

190
00:10:15,460 --> 00:10:15,960
positions.

191
00:10:17,140 --> 00:10:20,720
Full-text search in Postgres, only
word position for single document

192
00:10:21,900 --> 00:10:22,840
and that's it.

193
00:10:23,360 --> 00:10:28,760
But since full-text search in Postgres
is inside relational database

194
00:10:28,780 --> 00:10:32,440
system, we can have other columns,
right?

195
00:10:32,440 --> 00:10:35,240
And we can have indexes on them.

196
00:10:35,740 --> 00:10:39,680
So for example, likes or something
like timestamp when document

197
00:10:39,680 --> 00:10:40,380
was created.

198
00:10:40,640 --> 00:10:44,240
We can construct SQL, which takes
into account all these things.

199
00:10:44,480 --> 00:10:45,780
It will be just slow, right?

200
00:10:45,780 --> 00:10:51,000
Maybe because the idea that the
fastest query is always like

201
00:10:51,040 --> 00:10:53,980
index scan or index only scan is
the best, right?

202
00:10:54,140 --> 00:10:58,080
But when we have full text search,
we have other columns and

203
00:10:58,080 --> 00:11:00,240
we have B-trees probably on them.

204
00:11:00,660 --> 00:11:03,040
Combination is a problem, right?

205
00:11:03,980 --> 00:11:05,600
This is the problem.

206
00:11:05,600 --> 00:11:11,000
But there are things how you can
solve it, but let's maybe discuss

207
00:11:11,000 --> 00:11:14,820
a little bit later, I mean, B-Tree GiST
and other extensions.

208
00:11:15,440 --> 00:11:20,200
But originally, the combination
of different factors in the ranking

209
00:11:20,860 --> 00:11:23,260
task, this is the problem, right?

210
00:11:23,760 --> 00:11:27,460
Later, Google and others, they
started taking into account social

211
00:11:27,720 --> 00:11:28,700
network activities.

212
00:11:29,440 --> 00:11:34,200
If some document is mentioned on
Twitter, for example, or Facebook.

213
00:11:34,960 --> 00:11:37,940
It happened like 15, 10 years ago.

214
00:11:38,500 --> 00:11:42,480
This document, everyone who dealt
with search engine optimization

215
00:11:42,620 --> 00:11:45,760
knows that if you have a lot of
likes in social networks, your

216
00:11:45,760 --> 00:11:48,060
document goes up in positions.

217
00:11:48,720 --> 00:11:52,700
But if we deal with our system,
we can take into account interactions

218
00:11:52,780 --> 00:11:56,260
like that because it's just some
data we store and that's it.

219
00:11:57,040 --> 00:12:01,680
Today it's not that bad usually
if you don't have tens of billions

220
00:12:01,680 --> 00:12:05,580
of records or just millions or
tens or hundreds of millions.

221
00:12:05,680 --> 00:12:08,040
It's not a huge volume of data.

222
00:12:08,080 --> 00:12:11,600
So you can rely on multiple indexes
and let Postgres decide what

223
00:12:11,600 --> 00:12:12,840
to choose based on statistics.

224
00:12:13,140 --> 00:12:16,120
This is very similar to what we
discussed, remember, with...

225
00:12:16,220 --> 00:12:16,720
Yeah?

226
00:12:16,880 --> 00:12:17,860
Yeah, very similar.

227
00:12:17,860 --> 00:12:22,480
Because it triggered my problems
with full-text search.

228
00:12:22,500 --> 00:12:26,300
I had trying to incorporate it
to social media projects.

229
00:12:26,840 --> 00:12:30,060
Because you have 2 indexes, which
to choose?

230
00:12:30,060 --> 00:12:31,680
Because you cannot choose them
together.

231
00:12:31,680 --> 00:12:36,180
Well, you can, you can have bitmap
in the scan, but it's already

232
00:12:36,180 --> 00:12:36,680
slower.

233
00:12:37,200 --> 00:12:41,120
But most likely Postgres will choose
just 1 index in 1, like

234
00:12:41,120 --> 00:12:45,200
for example, over time, or full
text search, but based on statistics.

235
00:12:45,540 --> 00:12:49,600
And then on the fly, it will apply
either sorting in memory or

236
00:12:49,600 --> 00:12:51,020
filtering in memory, right?

237
00:12:51,020 --> 00:12:51,980
Michael: Yeah, exactly.

238
00:12:52,420 --> 00:12:54,360
That's where the inefficiency comes
in.

239
00:12:54,860 --> 00:12:55,620
Nikolay: Yeah, yeah.

240
00:12:55,840 --> 00:12:57,740
So this is like historical overview.

241
00:12:57,740 --> 00:13:03,540
We had like simple stop words, stemming,
dictionary.

242
00:13:03,540 --> 00:13:04,640
It's not simple, by the way.

243
00:13:04,640 --> 00:13:05,780
It's a lot of functionality.

244
00:13:06,660 --> 00:13:13,680
But ranking based on just our opinion
about how this document

245
00:13:13,700 --> 00:13:18,460
is relevant to your query, not
considering relations or other

246
00:13:18,460 --> 00:13:18,960
things.

247
00:13:19,460 --> 00:13:22,360
Second thing is, I think, page
rank and Postgres doesn't have

248
00:13:22,360 --> 00:13:22,860
it.

249
00:13:22,900 --> 00:13:26,740
And the third thing is taking into
account various factors like

250
00:13:26,740 --> 00:13:28,280
time and likes and so on.

251
00:13:28,280 --> 00:13:31,620
But also there is similarity search
now and we know search engines

252
00:13:31,840 --> 00:13:33,840
use it for long, right?

253
00:13:35,020 --> 00:13:35,520
So...

254
00:13:35,600 --> 00:13:37,600
Michael: Well, actually, I want
to go back to...

255
00:13:37,600 --> 00:13:40,580
I think your example is remarkably
useful.

256
00:13:40,640 --> 00:13:45,840
So the example you gave of searching
a site of recipes for eggplant,

257
00:13:45,840 --> 00:13:50,740
or let's add that maybe we're not
super sophisticated, we type

258
00:13:50,740 --> 00:13:51,420
an eggplant.

259
00:13:51,900 --> 00:13:56,020
So we can, for example, say that
an would count as a stop word.

260
00:13:56,630 --> 00:14:01,700
We don't want recipes that say
an a lot to rank highly.

261
00:14:01,920 --> 00:14:03,940
So that's getting rid of the stop
words.

262
00:14:04,380 --> 00:14:07,080
We might want, for example, to
do something slightly more.

263
00:14:07,080 --> 00:14:11,180
We might want to rank recipes that
list eggplant in the title.

264
00:14:11,460 --> 00:14:15,300
We might want those to score higher
than ones where it's listed

265
00:14:15,300 --> 00:14:17,020
in the document more times.

266
00:14:17,020 --> 00:14:20,100
So that's like an interesting additional
challenge.

267
00:14:20,280 --> 00:14:23,800
Secondly, it might be, let's say,
a user-generated recipe site,

268
00:14:23,800 --> 00:14:27,100
and we might want to factor in
how many likes a recipe has got

269
00:14:27,100 --> 00:14:29,460
or how recent it is or something
like that.

270
00:14:29,460 --> 00:14:32,480
So there's like all these, even
in the simplest example you can

271
00:14:32,480 --> 00:14:35,540
think of these days, it can get...

272
00:14:35,580 --> 00:14:37,000
Oh, I've got 1 more.

273
00:14:37,540 --> 00:14:40,860
You might also, if it's user-generated
content, you could have

274
00:14:40,860 --> 00:14:44,480
British people or like people using
aubergine instead of eggplant.

275
00:14:44,480 --> 00:14:46,920
And you also want those to rank
highly.

276
00:14:47,280 --> 00:14:48,220
So yes, exactly.

277
00:14:48,480 --> 00:14:51,760
Nikolay: So synonyms already, it's
already a bridge to discussing

278
00:14:52,540 --> 00:14:53,900
full text search versus similarity.

279
00:14:53,980 --> 00:14:56,540
Let's keep it as the last item,
right?

280
00:14:56,640 --> 00:14:59,600
But let's, great attempt to zoom.

281
00:14:59,600 --> 00:15:00,280
This is good.

282
00:15:00,280 --> 00:15:03,060
Let's do, or diving into deeper.

283
00:15:04,080 --> 00:15:08,980
So let's slightly mention how full-text
search works in Postgres.

284
00:15:09,440 --> 00:15:13,780
So after this pre-processing, we
just discussed like removal

285
00:15:13,780 --> 00:15:15,900
of stop words and you can control
it.

286
00:15:15,900 --> 00:15:19,140
Like good thing about full-text
search in Postgres, a lot of

287
00:15:19,140 --> 00:15:20,180
customization options.

288
00:15:20,180 --> 00:15:20,660
Like A lot.

289
00:15:20,660 --> 00:15:24,000
You, for example, can control stop
list and stop considering

290
00:15:25,020 --> 00:15:26,780
an as a stop word.

291
00:15:28,140 --> 00:15:30,100
Or you can, for example, remove
accents.

292
00:15:30,100 --> 00:15:32,880
There's a special extension on
accent, right?

293
00:15:33,580 --> 00:15:34,620
Additional pre-processing.

294
00:15:35,500 --> 00:15:38,040
Michael: So like in French, for
example, with lots of...

295
00:15:38,040 --> 00:15:41,040
Nikolay: Or there are support of
multiple languages also, it's

296
00:15:41,040 --> 00:15:41,880
also good.

297
00:15:42,180 --> 00:15:45,920
So you can start considering additional
stop words, for example,

298
00:15:45,920 --> 00:15:49,920
remove it if they mention too often,
like, almost all documents

299
00:15:49,920 --> 00:15:52,900
have them, let's remove it because
it doesn't make sense to use

300
00:15:52,900 --> 00:15:53,540
it, right?

301
00:15:53,680 --> 00:15:57,440
Then like stemming or dictionary
processing, I spell dictionary

302
00:15:57,440 --> 00:15:57,940
processing.

303
00:15:58,100 --> 00:16:02,470
So in the end we have a set of
normalized words, right?

304
00:16:02,880 --> 00:16:05,980
And then we build an array.

305
00:16:06,360 --> 00:16:10,540
It's called TS vector, again also
vector by the way.

306
00:16:11,040 --> 00:16:16,100
So it's a set of words with positions,
already normalized, in

307
00:16:16,100 --> 00:16:17,280
a normalized form.

308
00:16:17,660 --> 00:16:21,960
But basically it's just an array
of texts, right?

309
00:16:23,080 --> 00:16:26,260
Michael: Yeah, like words or lexemes
or whatever.

310
00:16:26,760 --> 00:16:27,260
Yeah.

311
00:16:27,380 --> 00:16:31,460
Nikolay: For example, I forgot
also to mention hyphened words.

312
00:16:32,440 --> 00:16:34,820
I think you also can control it.

313
00:16:34,940 --> 00:16:38,200
As I remember, it was many years
ago, I think you can choose

314
00:16:38,200 --> 00:16:42,040
either to consider them as a single
1 word or to split them to

315
00:16:42,040 --> 00:16:47,000
multiple words, like 2 words basically,
first and second, or

316
00:16:47,120 --> 00:16:47,620
together.

317
00:16:48,060 --> 00:16:53,600
Like You can put a pair of separate
words and whole word as hyphen,

318
00:16:53,600 --> 00:16:53,800
right?

319
00:16:53,800 --> 00:16:55,520
A total of 3 already, right?

320
00:16:55,680 --> 00:16:59,120
Michael: So that if people search
1 or the other, they still

321
00:16:59,320 --> 00:17:01,100
get that result back.

322
00:17:01,260 --> 00:17:01,760
Nikolay: Right.

323
00:17:02,220 --> 00:17:03,920
For example, shift left testing.

324
00:17:04,000 --> 00:17:09,220
Shift left, if it's a hyphen, you
can put both shift left and

325
00:17:09,220 --> 00:17:13,260
shift left as 3 separate entries
in your

326
00:17:13,260 --> 00:17:14,120
Michael: test vector.

327
00:17:14,380 --> 00:17:17,500
Nikolay: You can control how Postgres
will do it, building test

328
00:17:17,500 --> 00:17:18,000
vectors.

329
00:17:18,080 --> 00:17:21,580
Also, you can concatenate test
vectors so data can come from

330
00:17:21,580 --> 00:17:22,400
different sources.

331
00:17:22,940 --> 00:17:27,940
For example, we have title, we
have body, subject, or body, like

332
00:17:27,940 --> 00:17:32,140
for example if it's email, or if
it's blog posts, title and content,

333
00:17:32,620 --> 00:17:36,220
text, how you name it, it doesn't
matter.

334
00:17:36,220 --> 00:17:40,620
And you can concatenate 2 TS vectors,
or you can concatenate

335
00:17:40,680 --> 00:17:45,840
it before that with additional
space and then build TS vector.

336
00:17:46,500 --> 00:17:50,700
So TS vector is just an array of
texts, and you can explicitly

337
00:17:50,800 --> 00:17:52,740
store it, consuming space.

338
00:17:53,300 --> 00:17:57,380
Or you can rely on function, because
then we need to create index

339
00:17:57,380 --> 00:17:58,100
on it.

340
00:17:58,580 --> 00:18:00,040
Originally it was GIST.

341
00:18:00,860 --> 00:18:09,520
GIST, I think it was the work of
Heikki Linnakangas in Berkeley very

342
00:18:09,520 --> 00:18:12,180
long ago, in the early 90s maybe
or when.

343
00:18:12,800 --> 00:18:17,220
It was not finished, but the guys
who I mentioned, Bartunov and

344
00:18:17,220 --> 00:18:23,000
Sigaev, they just saw this directory
in PostgreSQL source code, as

345
00:18:23,000 --> 00:18:26,860
they say, and understood this is
exactly what they need.

346
00:18:27,340 --> 00:18:28,860
There's also an additional thing.

347
00:18:29,160 --> 00:18:32,940
GiST is a generalized search tree.

348
00:18:34,440 --> 00:18:38,200
It's similar to B-tree, but it's
abstract.

349
00:18:40,240 --> 00:18:43,580
You can use it for many data types
and data operators.

350
00:18:44,920 --> 00:18:49,040
For B-tree, it will be numbers
and operators less than or more

351
00:18:49,040 --> 00:18:50,580
than and equals.

352
00:18:51,020 --> 00:18:52,620
And it's just 1 axis, right?

353
00:18:52,740 --> 00:18:57,180
R3 it's 2 axes and then you can
have it for arrays as well.

354
00:18:57,340 --> 00:19:03,660
In this case we talk about operators
intersect, includes, contains,

355
00:19:04,200 --> 00:19:08,440
is contained by, and overlaps.

356
00:19:08,660 --> 00:19:10,820
Overlaps is at at, usually.

357
00:19:11,600 --> 00:19:15,260
You can actually redefine it and
use your own symbols.

358
00:19:15,720 --> 00:19:19,380
It's actually first you define
function, then operator, since

359
00:19:19,380 --> 00:19:20,580
Postgres is very extendable.

360
00:19:21,180 --> 00:19:26,920
But historically it's at at, or
less than at, or at greater

361
00:19:26,920 --> 00:19:27,420
than.

362
00:19:28,420 --> 00:19:33,600
So these operators, you can define
how to deal with them in terms

363
00:19:33,600 --> 00:19:35,820
of this B-tree-like index structure.

364
00:19:36,140 --> 00:19:40,880
So it's balanced with many, many
children on each node.

365
00:19:41,540 --> 00:19:44,120
And balancing is automated, everything
is automated.

366
00:19:44,760 --> 00:19:49,420
So when these folks found this Heikki Linnakangas's
work, it was not

367
00:19:49,420 --> 00:19:51,800
finished, and the WAL was not
supported.

368
00:19:52,200 --> 00:19:55,100
They worked on that supported WAL.

369
00:19:55,360 --> 00:20:00,140
Then additional thing, we won't
go deep because of lack of time

370
00:20:00,140 --> 00:20:03,080
and I also forgot everything, But
there's also a thing that's

371
00:20:03,080 --> 00:20:04,680
called signature tree.

372
00:20:04,960 --> 00:20:09,180
So to put our TS vectors to this
structure, there's additional

373
00:20:09,220 --> 00:20:11,460
concept of signature tree, you
need to build signature.

374
00:20:11,760 --> 00:20:12,840
It's also defined.

375
00:20:13,380 --> 00:20:18,360
So these signatures are placed
into the leaves of this tree.

376
00:20:19,020 --> 00:20:23,980
And then search is very bit tree-like,
but with additional exclusion,

377
00:20:24,320 --> 00:20:29,840
which makes today's GiST option
is not popular at all.

378
00:20:30,020 --> 00:20:35,460
Recheck is needed, because it's
not certain, this search.

379
00:20:36,740 --> 00:20:39,020
Michael: I've heard it's called
lossy index types.

380
00:20:39,320 --> 00:20:39,960
Nikolay: Right, right.

381
00:20:39,960 --> 00:20:44,200
So if you check, this is what you
do in pgMustard, right?

382
00:20:44,200 --> 00:20:44,920
Yeah, yeah.

383
00:20:44,920 --> 00:20:46,780
EXPLAIN, ANALYZE, BUFFERS, right?

384
00:20:46,920 --> 00:20:49,340
You check and see in the plan,
recheck happened.

385
00:20:49,640 --> 00:20:51,560
If GiST was used, recheck happened.

386
00:20:52,200 --> 00:20:54,780
So let me close some gap.

387
00:20:54,940 --> 00:21:00,040
Query also processed to build a
vector, right?

388
00:21:00,040 --> 00:21:01,320
But it's called TS query.

389
00:21:01,320 --> 00:21:04,340
It's called TS query, different
data type, but it's also similar.

390
00:21:04,340 --> 00:21:07,100
Also a bunch of words, also with
pre-processing.

391
00:21:07,380 --> 00:21:10,580
You can have different pre-processing,
you decide how to pre-process,

392
00:21:10,680 --> 00:21:11,180
right?

393
00:21:11,380 --> 00:21:15,400
But usually it's same as for TS
vector, for TS query, same pre-processing

394
00:21:15,440 --> 00:21:20,500
also remove stop words, remove
normalized words, and so on.

395
00:21:20,600 --> 00:21:26,380
So, and then our question is, let's
find everything which includes,

396
00:21:26,680 --> 00:21:29,820
like, we need to find documents
which include everything we asked

397
00:21:29,820 --> 00:21:31,600
for in our query, right?

398
00:21:32,340 --> 00:21:34,140
Michael: And rank them, but yeah.

399
00:21:34,300 --> 00:21:36,040
Nikolay: Yeah, and rank also additionally.

400
00:21:36,220 --> 00:21:42,100
So there is approach to put
it to trees, to such values, to

401
00:21:42,100 --> 00:21:42,600
leaves.

402
00:21:42,640 --> 00:21:48,480
And then it searches similar to
B-tree, but instead of greater

403
00:21:48,480 --> 00:21:55,160
than or less than, instead of that,
it checks if this vector

404
00:21:55,160 --> 00:21:56,620
is contained in that vector.

405
00:21:56,760 --> 00:22:00,040
I mean, this array is contained
in that array, and that's why

406
00:22:00,040 --> 00:22:02,660
you need to go left or right, something
like this.

407
00:22:02,800 --> 00:22:06,900
So it's called Russian doll tree
because it's like Russian doll,

408
00:22:06,900 --> 00:22:07,400
right?

409
00:22:08,000 --> 00:22:08,500
Somehow.

410
00:22:08,860 --> 00:22:09,360
RD3.

411
00:22:09,520 --> 00:22:14,280
So by the way, it's also in Berkeley
papers, this term RD3, somehow.

412
00:22:14,540 --> 00:22:19,020
So I think it was invented before
Bartunov and Kratkov.

413
00:22:19,020 --> 00:22:19,520
Probably.

414
00:22:19,540 --> 00:22:21,540
Maybe they influenced, I don't
know.

415
00:22:21,900 --> 00:22:27,320
But then, obviously it's slow in
terms of search for large volumes

416
00:22:27,320 --> 00:22:28,800
of data because of this recheck.

417
00:22:30,140 --> 00:22:33,780
This is not how Google and others
worked in terms of like...

418
00:22:33,940 --> 00:22:34,740
Not even Google.

419
00:22:34,740 --> 00:22:37,160
Google, we know, page rank is a
bigger thing.

420
00:22:37,360 --> 00:22:41,140
But eventually, Bartunov, Sigaev,
and then Kratkov, they created

421
00:22:41,520 --> 00:22:42,020
GIN.

422
00:22:42,340 --> 00:22:46,580
GIN is a generalized inverted tree
where we have a list of terms,

423
00:22:47,080 --> 00:22:50,780
our words, and for each list of...

424
00:22:51,960 --> 00:22:52,460
Michael: Documents.

425
00:22:52,900 --> 00:22:57,800
Nikolay: Right, all tuple IDs,
tuple CTIDs, I think, where this

426
00:22:57,800 --> 00:23:02,460
is stored, but not lists, there
are bit trees there, actually,

427
00:23:02,880 --> 00:23:04,060
for faster search.

428
00:23:04,640 --> 00:23:10,180
So it means that Gist is good only
at 1 thing, update speed.

429
00:23:10,240 --> 00:23:11,420
Michael: Gist fast is gin.

430
00:23:11,760 --> 00:23:12,540
Nikolay: Right, right.

431
00:23:12,560 --> 00:23:15,920
But gin, it was improved, there
is also fast update option.

432
00:23:16,080 --> 00:23:19,620
But anyway, default option for
us is gin, right?

433
00:23:21,180 --> 00:23:27,540
And comparing tsQuery and tsVector,
gin is good, search is fast,

434
00:23:28,340 --> 00:23:31,660
but order by tsRank, which is not
probably good.

435
00:23:31,960 --> 00:23:36,780
So back to the comparison, back
to these use cases you started

436
00:23:36,780 --> 00:23:37,280
mentioning.

437
00:23:37,440 --> 00:23:38,740
Let's think about them.

438
00:23:38,940 --> 00:23:44,380
First, you said, or at least second,
let's distinguish subject

439
00:23:44,380 --> 00:23:46,020
and body, right, for example.

440
00:23:47,100 --> 00:23:50,580
Michael: Yeah, or should we rank,
should we, should the, the

441
00:23:50,580 --> 00:23:53,400
presence of an ingredient in

442
00:23:54,520 --> 00:23:54,940
Nikolay: it or

443
00:23:54,940 --> 00:23:55,580
Michael: let's yeah.

444
00:23:55,580 --> 00:23:58,600
Words because we know that people
aren't going to be searching

445
00:23:58,680 --> 00:24:01,800
in our like search bar for random
things.

446
00:24:01,800 --> 00:24:04,540
Like maybe there's like a few different
things they might search

447
00:24:04,540 --> 00:24:04,840
by.

448
00:24:04,840 --> 00:24:06,360
The most common is probably ingredient.

449
00:24:06,840 --> 00:24:11,880
Should recipes that list that in
their title rank higher, or

450
00:24:11,880 --> 00:24:15,660
be weighted higher, even if they
only mention it like a few couple

451
00:24:15,660 --> 00:24:20,740
of times later on, I feel like
personally I'd be expecting those

452
00:24:20,740 --> 00:24:24,660
to rank higher than ones where
it was a small ingredient, or

453
00:24:24,660 --> 00:24:27,020
I only needed a small amount of
it maybe.

454
00:24:27,040 --> 00:24:29,440
Maybe the amount of the ingredient
matters a lot.

455
00:24:29,640 --> 00:24:30,600
Maybe that's easier.

456
00:24:31,380 --> 00:24:31,900
Nikolay: Right, right.

457
00:24:31,900 --> 00:24:39,200
So it's definitely a good thing
to want, right?

458
00:24:39,380 --> 00:24:43,180
If a word is entitled, it means
that maybe the whole article

459
00:24:43,180 --> 00:24:43,980
is about this, right?

460
00:24:43,980 --> 00:24:47,300
If eggplant is entitled, it pays
attention more to it.

461
00:24:47,440 --> 00:24:48,120
Makes sense.

462
00:24:48,120 --> 00:24:52,440
So for this, there are 2 bits which
you can use.

463
00:24:52,940 --> 00:24:55,940
2 bits means 4 options, right?

464
00:24:56,040 --> 00:24:57,880
So there is a function setWeight.

465
00:24:59,180 --> 00:25:02,800
And when you combine information
from multiple sources, from

466
00:25:02,800 --> 00:25:06,540
title column and content column,
for example, body column, you

467
00:25:06,540 --> 00:25:12,040
can say set weight A to first 1,
set weight B to second 1.

468
00:25:12,880 --> 00:25:19,180
And they are only ABCD, uppercase,
because again, 2 bytes only

469
00:25:19,180 --> 00:25:20,360
are used for this.

470
00:25:20,860 --> 00:25:25,580
And then later in your query, you
can also say, I pay attention

471
00:25:25,580 --> 00:25:26,820
to this or to that.

472
00:25:26,980 --> 00:25:29,020
Actually, you can use it for filtering
as well.

473
00:25:29,020 --> 00:25:30,320
And this is what I did.

474
00:25:30,560 --> 00:25:32,460
But originally it was created for
ranking.

475
00:25:33,300 --> 00:25:36,900
I don't remember details, but setWeight
function you need to search

476
00:25:36,900 --> 00:25:40,600
in documentation, it should be
there and you will find out how

477
00:25:40,600 --> 00:25:41,040
to...

478
00:25:41,040 --> 00:25:42,540
It's a very strange concept.

479
00:25:44,480 --> 00:25:45,540
Why only 4?

480
00:25:45,780 --> 00:25:47,080
Because of 2 bits only.

481
00:25:47,080 --> 00:25:49,540
They had only 2 bits to spend for
it.

482
00:25:49,960 --> 00:25:51,700
Maybe there should be more.

483
00:25:51,740 --> 00:25:56,820
But I use it for sometimes like
we search only in subject, right?

484
00:25:56,820 --> 00:25:58,660
And I say, okay, only a category.

485
00:25:58,660 --> 00:26:00,040
It's called category maybe.

486
00:26:00,040 --> 00:26:02,260
It's embedded inside TS vector.

487
00:26:02,720 --> 00:26:05,140
So you can skip using it.

488
00:26:05,460 --> 00:26:07,900
It's there, but you can just ignore
it in search.

489
00:26:07,900 --> 00:26:11,340
But in different time, you can
say, I want to search only category

490
00:26:11,460 --> 00:26:13,360
A, it means only title search.

491
00:26:13,740 --> 00:26:16,740
You don't need to build 2 TS vectors
separately.

492
00:26:16,940 --> 00:26:18,420
However, you could, right?

493
00:26:19,140 --> 00:26:19,480
Michael: Got it.

494
00:26:19,480 --> 00:26:23,100
So like, if, for example, going
back to our example, I search

495
00:26:23,100 --> 00:26:27,260
what's clearly like an author name,
and we can maybe on the application

496
00:26:27,440 --> 00:26:31,360
side, we're doing a little like
a quick, Maybe we're doing something

497
00:26:31,360 --> 00:26:34,080
first to try and categorize what
people are searching for.

498
00:26:34,080 --> 00:26:37,320
If it's an author name, I could
then send that through to Postgres

499
00:26:37,540 --> 00:26:39,940
as like, let's only look in this
category.

500
00:26:40,380 --> 00:26:41,040
Nikolay: Right, right.

501
00:26:41,040 --> 00:26:44,160
But again, it originally was created
for ranking.

502
00:26:44,200 --> 00:26:45,660
I just didn't use it.

503
00:26:45,660 --> 00:26:48,940
So I cannot explain how the documentation
of course, explains

504
00:26:48,940 --> 00:26:49,440
how.

505
00:26:49,820 --> 00:26:50,320
Right.

506
00:26:50,820 --> 00:26:54,960
So good thing to mention also,
now we have generated always thing,

507
00:26:54,960 --> 00:26:55,460
right?

508
00:26:56,020 --> 00:26:56,980
Generated columns.

509
00:26:57,740 --> 00:27:04,520
They are stored, but Postgres maintains
them automatically.

510
00:27:04,780 --> 00:27:08,320
So, I think generation of TSVector
is probably a good thing to

511
00:27:08,320 --> 00:27:10,220
use together with that functionality.

512
00:27:11,160 --> 00:27:17,760
So you have author, title, body
of blog post, and then TSVector

513
00:27:17,860 --> 00:27:23,000
can be generated based on some
expression, which puts different

514
00:27:23,000 --> 00:27:29,020
categories to different sources
of data, again up to four, and generates

515
00:27:29,020 --> 00:27:32,140
and puts TSVector there, and then
you have an index on it, right?

516
00:27:32,360 --> 00:27:34,040
Or you can define a trigger.

517
00:27:34,740 --> 00:27:37,620
All my life I defined triggers
for this.

518
00:27:37,900 --> 00:27:40,680
And also there is another option
not to store it at all and just

519
00:27:40,680 --> 00:27:45,420
use index on expression, GIN, and
then your big expression, maybe

520
00:27:45,420 --> 00:27:48,500
with this set weight we just discussed.

521
00:27:49,080 --> 00:27:52,660
But in this case, it's good in
terms of storage.

522
00:27:53,100 --> 00:27:55,020
Less thing to keep in memory.

523
00:27:55,900 --> 00:27:58,760
But it might be bad in terms of...

524
00:28:00,060 --> 00:28:00,820
I don't know.

525
00:28:00,820 --> 00:28:06,360
Sometimes you need to deal with
TSVector directly.

526
00:28:07,840 --> 00:28:11,980
If you have only expressional index,
then you need to construct

527
00:28:11,980 --> 00:28:15,040
this expression once again to deal
with it.

528
00:28:16,460 --> 00:28:20,440
I don't like this approach somehow,
because it limits in terms

529
00:28:20,440 --> 00:28:23,140
of what you can do with such records.

530
00:28:23,940 --> 00:28:28,420
So, if you don't store it, you
need to build it on the fly to

531
00:28:28,420 --> 00:28:32,420
deal with it, to additionally somehow
analyze it or so.

532
00:28:32,780 --> 00:28:37,080
So I always prefer to store them,
although they can consume a

533
00:28:37,080 --> 00:28:39,280
lot of data and the TOAST and so
on.

534
00:28:39,280 --> 00:28:42,260
Sometimes I just put them to separate
table understanding that

535
00:28:42,260 --> 00:28:43,740
then I will need to join.

536
00:28:44,180 --> 00:28:48,240
For me it's easier to create a
trigger and use a regular column,

537
00:28:48,240 --> 00:28:48,740
right?

538
00:28:49,540 --> 00:28:53,800
But if you think to store it or
not to store it, it's good to

539
00:28:53,800 --> 00:28:56,340
think what you will do with it.

540
00:28:56,800 --> 00:29:03,180
And if you want to deal with TSVector
in different ways, then

541
00:29:03,180 --> 00:29:06,500
keeping it stored only in the index
itself.

542
00:29:06,500 --> 00:29:11,260
It's maybe not enough and maybe
it's good to put it to the table.

543
00:29:11,320 --> 00:29:15,840
But to put it to the same table
as a column and rely on TOASTing

544
00:29:15,840 --> 00:29:21,140
and so on, or maybe to allocate
a different table and like one-to-one

545
00:29:21,220 --> 00:29:21,720
relationship?

546
00:29:23,200 --> 00:29:24,220
It's a good question.

547
00:29:24,840 --> 00:29:26,420
Again, depending on workloads.

548
00:29:29,040 --> 00:29:31,160
So okay, This we covered.

549
00:29:31,340 --> 00:29:34,060
Can you remind me of other use
cases you mentioned?

550
00:29:34,940 --> 00:29:37,800
Michael: Well, I think it's all
I consider this kind of one use

551
00:29:37,800 --> 00:29:41,020
case, but I guess, I guess there's
like all the complexities

552
00:29:41,120 --> 00:29:43,220
that you may or may not want to
support.

553
00:29:43,480 --> 00:29:47,900
1 that I didn't mention, but probably
we care about is what if

554
00:29:47,900 --> 00:29:50,420
somebody spells eggplant with 1
g?

555
00:29:51,140 --> 00:29:52,120
Nikolay: Yeah, typos.

556
00:29:52,480 --> 00:29:52,980
Michael: Yes.

557
00:29:53,760 --> 00:29:57,880
Nikolay: For this, there is pg_trgm,
3 grams.

558
00:29:59,540 --> 00:30:02,420
Extension, which actually also
uses RD3.

559
00:30:02,840 --> 00:30:03,580
Gin, actually.

560
00:30:03,580 --> 00:30:04,400
Gin these days.

561
00:30:04,400 --> 00:30:05,540
RD3, forget about it.

562
00:30:05,540 --> 00:30:08,440
Forget about GiST, very rarely
used.

563
00:30:08,560 --> 00:30:09,820
So Gin, right?

564
00:30:10,020 --> 00:30:11,900
So we have some text, right?

565
00:30:12,040 --> 00:30:12,700
Or some...

566
00:30:12,700 --> 00:30:15,820
Basically, 3-grams should work
for words, right?

567
00:30:15,820 --> 00:30:19,460
We have word, and we suspect maybe
there is a typo there.

568
00:30:19,820 --> 00:30:24,940
So it depends 1 space on 1 side
and 2 spaces to different side

569
00:30:24,940 --> 00:30:29,920
and then just split for 3 words,
3 letters, 3 letters, 3 letters,

570
00:30:29,920 --> 00:30:30,420
right?

571
00:30:30,520 --> 00:30:32,380
And then we have array again, right?

572
00:30:33,580 --> 00:30:34,080
Vector.

573
00:30:34,840 --> 00:30:35,960
Vector, right.

574
00:30:36,060 --> 00:30:39,120
And then the question is which
is the most, the closest.

575
00:30:39,120 --> 00:30:41,420
Closest means like most overlapping.

576
00:30:42,440 --> 00:30:46,900
Most of members of array are the
same.

577
00:30:47,440 --> 00:30:49,380
Might be not all of them.

578
00:30:49,740 --> 00:30:57,320
And if it's just 1 letter typo,
it means 3 members of this vector

579
00:30:58,620 --> 00:30:59,940
will be different.

580
00:31:00,720 --> 00:31:01,560
Michael: Ah, nice.

581
00:31:01,560 --> 00:31:02,060
Yeah.

582
00:31:02,620 --> 00:31:08,040
Well, it wouldn't include EGG,
it wouldn't include GGP, and it...

583
00:31:08,560 --> 00:31:09,220
Is that...

584
00:31:09,220 --> 00:31:10,360
They're the only 2?

585
00:31:10,520 --> 00:31:13,840
In this case, I think, possibly,
just they're the only 2.

586
00:31:14,240 --> 00:31:16,920
Nikolay: So 3-grams, We're just
shifting, shifting, shifting

587
00:31:16,920 --> 00:31:17,620
also, right?

588
00:31:17,620 --> 00:31:19,340
We don't just split, we're shifting.

589
00:31:19,510 --> 00:31:24,940
So start position, first is,
for example, space, then 2 letters,

590
00:31:24,960 --> 00:31:27,720
then second, the first letter, second
letter, third letter, shift,

591
00:31:27,720 --> 00:31:30,760
shift, shift, and we construct
array out of it.

592
00:31:30,920 --> 00:31:39,180
And then if, for example, you just
missed, forgot 1 letter, overlapping

593
00:31:39,240 --> 00:31:40,120
will be huge.

594
00:31:40,200 --> 00:31:41,520
Distance is very close.

595
00:31:41,580 --> 00:31:44,780
Similarities, distance low, similarities
high.

596
00:31:45,100 --> 00:31:49,440
If you, for example, mixed positions
of 2 letters, just swap

597
00:31:49,440 --> 00:31:49,940
them.

598
00:31:50,740 --> 00:31:51,660
Also huge.

599
00:31:52,760 --> 00:31:56,020
This is an interesting idea behind
3-grams.

600
00:31:56,460 --> 00:32:00,920
And again, we use either RD3, nobody
does it, or Gin.

601
00:32:01,240 --> 00:32:07,620
GIN, we use GIN so we can find
arrays which are closest, overlapping

602
00:32:07,740 --> 00:32:08,440
is higher.

603
00:32:08,720 --> 00:32:12,900
That's how we find words which
actually present in our document

604
00:32:13,260 --> 00:32:13,760
dataset.

605
00:32:15,140 --> 00:32:17,860
And then they say, you probably
thought about this word.

606
00:32:18,400 --> 00:32:23,420
1 thing, in the past I remember
we maintained the list of words

607
00:32:24,760 --> 00:32:28,480
with usage counts probably, using
some statistics provided by

608
00:32:28,520 --> 00:32:29,560
full text search.

609
00:32:30,040 --> 00:32:33,680
I don't remember details, but you
can build a list of words with

610
00:32:33,680 --> 00:32:38,660
stats, and then you can store it
in a table, and then use trigrams

611
00:32:38,760 --> 00:32:39,840
on top of it.

612
00:32:40,580 --> 00:32:43,980
You need triggers to maintain this,
or you need to refresh it

613
00:32:43,980 --> 00:32:44,480
periodically.

614
00:32:45,060 --> 00:32:48,200
For example, if you just imported
a lot of documents which use

615
00:32:48,200 --> 00:32:51,980
new words, this list of words is
already outdated, you need to

616
00:32:51,980 --> 00:32:52,800
rebuild it.

617
00:32:53,080 --> 00:32:58,980
And then this Trigram approach
worked, you have input words,

618
00:32:59,180 --> 00:33:01,160
you check the table and it works.

619
00:33:01,240 --> 00:33:03,240
Now I think we don't need it, right?

620
00:33:03,240 --> 00:33:05,820
It's automatically maintain this
list of words.

621
00:33:06,900 --> 00:33:11,540
Or we just don't care and build
an index and then take whole

622
00:33:11,540 --> 00:33:12,040
input.

623
00:33:12,740 --> 00:33:14,760
Michael: That's how I've seen it
used, yeah.

624
00:33:15,180 --> 00:33:18,720
Nikolay: So there are several ways
to use it, and maintaining

625
00:33:18,780 --> 00:33:22,320
this additional table probably
still makes sense, right?

626
00:33:23,300 --> 00:33:24,240
List of words.

627
00:33:25,760 --> 00:33:27,980
Michael: Do you think it would
then speed up queries, or what

628
00:33:27,980 --> 00:33:29,440
do you think the benefit is?

629
00:33:29,540 --> 00:33:30,040
Speed?

630
00:33:30,180 --> 00:33:32,120
Nikolay: You lose speed here probably,
right?

631
00:33:32,420 --> 00:33:34,140
But accuracy is good.

632
00:33:34,340 --> 00:33:35,440
Michael: Accuracy, yeah.

633
00:33:37,060 --> 00:33:41,120
This is great, but I think the
conversation you mentioned earlier,

634
00:33:41,120 --> 00:33:44,600
like the comparison of full-text
search versus semantic search

635
00:33:44,600 --> 00:33:49,640
in the current climate, and something
I wanted to introduce was

636
00:33:50,680 --> 00:33:55,020
whether the use case suits false
positives, or suits false negatives

637
00:33:55,020 --> 00:33:58,660
better, like I think it's really
difficult to, to come up with

638
00:33:58,660 --> 00:34:00,720
a solution that that does neither.

639
00:34:00,720 --> 00:34:05,140
But often you see kind of slightly
biased towards like is it

640
00:34:05,140 --> 00:34:09,640
better like Google for example
what PageRank was really good

641
00:34:09,640 --> 00:34:12,540
at was making sure it didn't miss
a website that was clearly

642
00:34:12,540 --> 00:34:16,860
very relevant to your search term
and so it was very very good

643
00:34:16,860 --> 00:34:21,080
at avoiding false negatives but
you did often especially in the

644
00:34:21,080 --> 00:34:23,040
earlier days get quite a lot of
false positives.

645
00:34:23,100 --> 00:34:27,100
You'd get articles that didn't
match your intention, at least,

646
00:34:27,100 --> 00:34:28,740
even if it did match the words.

647
00:34:29,020 --> 00:34:32,640
But the recipe example, maybe we
don't care so much about like,

648
00:34:32,640 --> 00:34:37,040
if you miss 1 great recipe, but
you get 17 perfectly matched

649
00:34:37,040 --> 00:34:38,680
ones, that's still a good result.

650
00:34:38,680 --> 00:34:41,760
Maybe you'd rather that kind of
weigh around on the trade off.

651
00:34:41,760 --> 00:34:44,000
I don't know how you see that and
whether that's relevant.

652
00:34:44,060 --> 00:34:48,140
Nikolay: I see this as many different
small use cases.

653
00:34:49,460 --> 00:34:55,100
For example, if first of all, I
remember your different use cases.

654
00:34:55,120 --> 00:34:56,480
This is what I mentioned as well.

655
00:34:56,480 --> 00:35:01,740
Like we want to take into account
likes, comments, timestamp,

656
00:35:02,040 --> 00:35:02,540
everything.

657
00:35:03,160 --> 00:35:07,200
It gives us ability to build very
big query and then we think,

658
00:35:07,200 --> 00:35:11,380
okay, but our full text search,
single GIN index is not enough,

659
00:35:11,380 --> 00:35:11,880
right?

660
00:35:12,260 --> 00:35:16,740
Sometimes we can use bit.ly gist,
for example, right?

661
00:35:16,980 --> 00:35:22,320
Which allows us combining these
arrays of text and regular numbers

662
00:35:22,440 --> 00:35:24,800
or timestamps in a single index.

663
00:35:25,440 --> 00:35:26,420
And this is good.

664
00:35:27,380 --> 00:35:29,060
We can have single index scan.

665
00:35:29,060 --> 00:35:30,700
Also there is B-tree GiST.

666
00:35:31,740 --> 00:35:32,920
Also interesting thing.

667
00:35:33,520 --> 00:35:36,660
Also there is RAM which you need
to install as a separate extension,

668
00:35:36,660 --> 00:35:41,680
but these 2, B-tree GiST and B-tree GIN,
they are included with regular

669
00:35:42,660 --> 00:35:45,260
contrib modules, so with any Postgres
it's available.

670
00:35:46,100 --> 00:35:50,280
But in some cases we want, like
I forgot to mention Postgres

671
00:35:50,280 --> 00:35:51,420
also supports it.

672
00:35:52,360 --> 00:35:56,640
I think functional web search to
test query or something like

673
00:35:56,640 --> 00:35:57,140
this.

674
00:35:57,280 --> 00:36:03,080
It supports some small language
of phrase search and inverted

675
00:36:03,120 --> 00:36:06,440
search, you can exclude, user can
say, I want everything but

676
00:36:06,440 --> 00:36:07,940
this word should not be there.

677
00:36:08,180 --> 00:36:09,360
Just minus, right?

678
00:36:09,520 --> 00:36:12,880
Or put in double quotes for exact
match.

679
00:36:14,380 --> 00:36:19,060
This all great means that user
can control and in regular manner

680
00:36:19,080 --> 00:36:21,840
Google also supports things like
that, right?

681
00:36:22,160 --> 00:36:28,760
So it means that you want, for
example, to see exact phrase mentioning,

682
00:36:28,780 --> 00:36:32,420
well you know it's there and you
just can do it.

683
00:36:32,780 --> 00:36:37,420
But then similarity search, 2 big
use cases.

684
00:36:37,420 --> 00:36:39,520
For example, you mentioned synonyms,
right?

685
00:36:40,080 --> 00:36:44,760
Someone doesn't use the word eggplant,
there's another word meaning

686
00:36:44,760 --> 00:36:45,640
the same thing.

687
00:36:46,180 --> 00:36:46,680
Michael: Aubergine.

688
00:36:49,760 --> 00:36:51,740
It's honestly, that's what we call
them.

689
00:36:51,740 --> 00:36:52,240
Aubergine.

690
00:36:52,740 --> 00:36:53,240
A-U-B-E-R.

691
00:36:54,000 --> 00:36:55,000
I'm not going to try the

692
00:36:55,000 --> 00:36:55,260
Nikolay: rest.

693
00:36:55,260 --> 00:36:56,540
I haven't heard it.

694
00:36:56,660 --> 00:36:57,160
Cool.

695
00:36:57,540 --> 00:37:02,900
So if you want a synonym search,
full-text search supports it.

696
00:37:02,980 --> 00:37:04,300
But you need to maintain it.

697
00:37:04,300 --> 00:37:06,840
You need to maintain the dictionary
of synonyms.

698
00:37:07,780 --> 00:37:12,260
And the normalization process will
automatically, all synonyms

699
00:37:12,340 --> 00:37:16,820
will be defined in our test vector
and test query it will be

700
00:37:17,040 --> 00:37:18,080
1 word we chose.

701
00:37:18,080 --> 00:37:19,740
So we can maintain synonyms.

702
00:37:19,920 --> 00:37:20,680
It's easy.

703
00:37:20,680 --> 00:37:21,220
Not easy.

704
00:37:21,220 --> 00:37:22,760
I mean, it's straightforward.

705
00:37:23,000 --> 00:37:26,160
It's not easy because it requires
effort of maintaining synonyms,

706
00:37:26,160 --> 00:37:26,660
right?

707
00:37:27,660 --> 00:37:35,500
But on the other hand, if we use
pgVector, it probably puts both

708
00:37:35,500 --> 00:37:39,800
words in this highly dimensional
space very close, right?

709
00:37:40,640 --> 00:37:42,180
Because meaning is the same.

710
00:37:43,840 --> 00:37:44,700
Right, interesting.

711
00:37:45,040 --> 00:37:49,540
So maybe semantic search provided
by pgVector is better here,

712
00:37:49,540 --> 00:37:50,040
right?

713
00:37:51,160 --> 00:37:51,660
Okay.

714
00:37:51,960 --> 00:37:56,420
Another thing is, for example,
something is not working versus

715
00:37:56,420 --> 00:37:58,100
something is working.

716
00:37:58,620 --> 00:38:00,740
Not is a stop word, right?

717
00:38:02,420 --> 00:38:03,660
What do we try to search?

718
00:38:03,740 --> 00:38:04,740
Something is not working.

719
00:38:04,740 --> 00:38:06,360
For example, my car is not working.

720
00:38:06,780 --> 00:38:11,180
Maybe not working I should put
in double quotes, so not as definitely

721
00:38:11,260 --> 00:38:13,640
there for exact phrase search.

722
00:38:13,940 --> 00:38:18,120
Or maybe I just need to use pgVector
because it definitely will

723
00:38:18,120 --> 00:38:22,480
distinguish semantically that not
working and working are very fine

724
00:38:22,480 --> 00:38:26,600
in this highly dimensional space,
so similarity is not good,

725
00:38:26,600 --> 00:38:26,880
right?

726
00:38:26,880 --> 00:38:28,540
Distance is not good.

727
00:38:28,980 --> 00:38:34,820
And if we are trying to find my
car is not working, we won't

728
00:38:34,820 --> 00:38:36,000
find documents.

729
00:38:37,120 --> 00:38:42,380
They will be put not high, right,
in terms of ranking by similarity.

730
00:38:43,680 --> 00:38:44,740
But also PgVector...

731
00:38:44,980 --> 00:38:45,480
AL

732
00:38:45,720 --> 00:38:47,220
Michael: SUTTON, I think that depends,
though.

733
00:38:47,220 --> 00:38:50,360
I think that's contextually important
and interesting because

734
00:38:50,820 --> 00:38:55,440
let's say this was a car forum,
how many posts are you going

735
00:38:55,440 --> 00:38:57,360
to find about cars that are working great?

736
00:38:57,360 --> 00:38:59,180
Like realistically in the

737
00:38:59,180 --> 00:38:59,627
Nikolay: data set?

738
00:38:59,627 --> 00:39:00,507
Michael: Well, it's a bad

739
00:39:00,507 --> 00:39:05,040
Nikolay: example. It was a bad
example, but in general, inverted,

740
00:39:06,280 --> 00:39:11,080
since how full text search works
in Postgres, it will remove

741
00:39:11,520 --> 00:39:13,040
not as a stop word.

742
00:39:13,700 --> 00:39:15,080
And it's bad sometimes, right?

743
00:39:15,080 --> 00:39:17,300
Because it's the opposite meaning,
right?

744
00:39:18,500 --> 00:39:19,340
Michael: Yeah, makes sense.

745
00:39:19,340 --> 00:39:20,560
I didn't realize that.

746
00:39:20,640 --> 00:39:23,440
Nikolay: We might think about similar
examples where the removal

747
00:39:23,440 --> 00:39:31,260
of the stop word basically leads to
very bad ranking or filtering,

748
00:39:31,320 --> 00:39:31,820
also.

749
00:39:32,360 --> 00:39:35,860
Michael: Yeah, like one that comes
up quite often is like film

750
00:39:35,860 --> 00:39:38,800
titles or like band titles.

751
00:39:38,800 --> 00:39:41,780
If you put the in front of it,
it could be a completely different

752
00:39:41,780 --> 00:39:42,280
band.

753
00:39:42,860 --> 00:39:50,980
Nikolay: Or some English is very,
has very such example, a lot

754
00:39:50,980 --> 00:39:51,820
of such examples.

755
00:39:51,820 --> 00:39:55,220
For example, go on, go on, go,
it's like goes, right?

756
00:39:55,240 --> 00:39:56,540
On is just some...

757
00:39:56,780 --> 00:39:58,820
So go on means continue, right?

758
00:39:59,600 --> 00:40:00,980
Meaning is very different.

759
00:40:01,040 --> 00:40:05,440
If you just look at words separately,
you don't get this meaning

760
00:40:05,440 --> 00:40:05,740
at all.

761
00:40:05,740 --> 00:40:08,680
And there are many such phrases,
right?

762
00:40:10,440 --> 00:40:11,440
In many languages.

763
00:40:12,340 --> 00:40:17,360
And semantic search will capture
this, the meaning, and put it

764
00:40:17,360 --> 00:40:18,160
to vector, right?

765
00:40:18,160 --> 00:40:23,100
And then we use approximate nearest
neighbor search, ANN search.

766
00:40:24,440 --> 00:40:28,440
But it lacks many capabilities,
like for example, exact phrase

767
00:40:28,440 --> 00:40:34,440
search or inverted word search,
or this categorization I mentioned.

768
00:40:35,280 --> 00:40:36,140
What do you think?

769
00:40:36,140 --> 00:40:36,920
It's interesting.

770
00:40:37,020 --> 00:40:39,740
Should we use only semantic search
these days?

771
00:40:42,040 --> 00:40:44,760
Michael: I don't think so, but
only because I'm thinking of some

772
00:40:44,760 --> 00:40:49,040
quite boring use cases where, for
example, you're set, like,

773
00:40:49,240 --> 00:40:52,160
I quite often think about software
as a service applications.

774
00:40:52,900 --> 00:40:56,420
And let's say you might want, you
might be looking up a customer

775
00:40:56,580 --> 00:41:00,180
and you might want to look up them
by email address or by name

776
00:41:00,180 --> 00:41:03,000
or by like, there's a few different
things you might want to

777
00:41:03,000 --> 00:41:04,060
look them up by.

778
00:41:04,400 --> 00:41:07,820
And you might like want to provide
your users with a single search

779
00:41:07,820 --> 00:41:09,700
bar to look those up.

780
00:41:10,240 --> 00:41:14,060
Nikolay: Or you can have tags,
for example, if based on what

781
00:41:14,060 --> 00:41:18,300
I just described, overall, like
these vectors or arrays of text,

782
00:41:18,380 --> 00:41:25,520
text arrays, you can use GiN for
text search, putting all text

783
00:41:25,520 --> 00:41:28,400
for document into single value,
single column, right?

784
00:41:28,500 --> 00:41:31,080
All the text, denormalizing them.

785
00:41:31,080 --> 00:41:34,700
I first did it, by the way, using
GiST in 2008.

786
00:41:34,940 --> 00:41:38,300
It was the topic of my first talk
in the U.S.

787
00:41:38,600 --> 00:41:40,580
I presented in 2008.

788
00:41:40,740 --> 00:41:41,540
Can you imagine?

789
00:41:42,180 --> 00:41:47,380
And I was, we were building this,
we put, it was social media,

790
00:41:47,780 --> 00:41:53,560
and all tags were stored in single
column instead of EAV approach

791
00:41:53,560 --> 00:41:59,200
when you have separately words
like tags, terms, can be phrases.

792
00:42:00,200 --> 00:42:05,780
1 table, documents, another table
and between them like relationship

793
00:42:06,020 --> 00:42:06,880
table, right?

794
00:42:08,300 --> 00:42:13,140
And then you need to have 3 table
joins, like 2 joins all the

795
00:42:13,140 --> 00:42:13,640
time.

796
00:42:14,160 --> 00:42:17,800
In terms of storage and speed of
search, it was terrible, it

797
00:42:17,800 --> 00:42:21,100
remains terrible, E-A-V, Entity
Attribute Value approach.

798
00:42:21,600 --> 00:42:26,040
But here you can put all tags and
full text search can provide

799
00:42:26,040 --> 00:42:29,480
you exact search by tags.

800
00:42:30,600 --> 00:42:35,248
Versus if you put everything to
vector, probably how we would

801
00:42:35,248 --> 00:42:35,344
do it?

802
00:42:35,344 --> 00:42:41,200
We would do it, we probably would
just take author, title, body,

803
00:42:41,200 --> 00:42:42,780
and then okay, we have tags.

804
00:42:42,980 --> 00:42:47,800
And we append probably tags colon,
this comma, this comma, this,

805
00:42:47,800 --> 00:42:48,300
right?

806
00:42:48,420 --> 00:42:53,940
And rely on our vector search that
it will be used somehow, but

807
00:42:54,240 --> 00:42:56,100
we lose control, right?

808
00:42:56,120 --> 00:43:00,200
If it's full text search, or we
have multiple indexes, we can

809
00:43:00,200 --> 00:43:03,640
reliably find proper documents,
which definitely contain some

810
00:43:03,640 --> 00:43:04,140
text.

811
00:43:04,940 --> 00:43:06,440
Michael: This might be my answer.

812
00:43:06,500 --> 00:43:11,680
I think if I care a lot about reducing
false negatives, like

813
00:43:11,680 --> 00:43:15,600
it would be really bad if you search
the exact name of a customer

814
00:43:15,780 --> 00:43:20,080
in my application and we didn't
show it back to you.

815
00:43:20,080 --> 00:43:21,340
If we couldn't find it.

816
00:43:21,340 --> 00:43:26,140
So sometimes the problem with the
index types for vector search

817
00:43:26,440 --> 00:43:29,080
is they can have false negatives.

818
00:43:29,440 --> 00:43:33,240
Results that should be the correct
answer turn out to then not

819
00:43:33,240 --> 00:43:34,620
be returned in the answer.

820
00:43:34,960 --> 00:43:40,660
Nikolay: So and also we remember
that vector search since dimensionality

821
00:43:42,100 --> 00:43:44,960
is super high, it's approximate
nearest neighbors.

822
00:43:45,040 --> 00:43:46,560
Michael: So that's what I'm trying
to explain.

823
00:43:46,560 --> 00:43:51,140
Nikolay: But full text search is
okay, nearest neighbors or something

824
00:43:51,140 --> 00:43:52,680
like filtering is exact.

825
00:43:53,520 --> 00:43:54,020
Yes.

826
00:43:54,400 --> 00:43:54,900
Precise.

827
00:43:55,320 --> 00:43:57,280
So let me rephrase you.

828
00:43:57,440 --> 00:44:01,840
If we want to present user just
single input with a button, fine,

829
00:44:01,840 --> 00:44:02,340
right?

830
00:44:02,460 --> 00:44:05,720
Probably similarity search is enough,
right?

831
00:44:06,000 --> 00:44:11,420
Or if we want to have some of this
language, like exclude this

832
00:44:11,420 --> 00:44:16,620
word or exact phrase search in
double quotes, or we have advanced

833
00:44:18,180 --> 00:44:22,080
search form, like choose the offer,
date range or something.

834
00:44:22,840 --> 00:44:25,740
Similarity search is not enough
and probably we will need not

835
00:44:25,740 --> 00:44:30,560
only full text search but also
faceted search with filtering,

836
00:44:30,600 --> 00:44:31,720
additional things.

837
00:44:32,080 --> 00:44:36,860
Some things we can put to full text
search index, as I described,

838
00:44:36,860 --> 00:44:38,280
but not all of them.

839
00:44:38,680 --> 00:44:41,680
And it's interesting that now we
have similarity search, now

840
00:44:41,680 --> 00:44:47,020
we have full text search, typo
correction basically, but it requires

841
00:44:47,020 --> 00:44:48,160
effort, as I said.

842
00:44:48,760 --> 00:44:51,800
Other indexes, B-tree, doesn't go away
at all, right?

843
00:44:52,200 --> 00:44:56,580
And then we can probably build
some system which can be very

844
00:44:56,580 --> 00:44:59,640
powerful in terms of what you can
do with it.

845
00:45:00,060 --> 00:45:04,440
And if we consider a particular
example, for example, I know

846
00:45:04,440 --> 00:45:08,580
you used PostgreSQL.org and Google
search, right?

847
00:45:09,020 --> 00:45:10,360
2 search engines, right?

848
00:45:12,360 --> 00:45:18,920
And PostgreSQL.org search is remarkably
weak.

849
00:45:19,540 --> 00:45:19,820
Is

850
00:45:19,820 --> 00:45:20,380
Michael: it true?

851
00:45:20,380 --> 00:45:22,100
Yeah, it's not great, is it?

852
00:45:22,500 --> 00:45:26,180
Nikolay: It seems it doesn't use
full text search, or maybe it

853
00:45:26,180 --> 00:45:31,580
uses it in some strange form, because
I don't see all power full

854
00:45:31,580 --> 00:45:32,620
text search provides.

855
00:45:33,280 --> 00:45:37,580
Maybe we could look into details
because I guess source code

856
00:45:37,580 --> 00:45:38,400
should be available.

857
00:45:38,900 --> 00:45:44,180
But in general it can be improved
and things like just using

858
00:45:44,180 --> 00:45:47,840
PostgreSQL's own full-text search
will improve it and we could

859
00:45:47,840 --> 00:45:50,460
define more things, right?

860
00:45:50,660 --> 00:45:54,980
But if pgVector would be installed,
it would be even better,

861
00:45:55,640 --> 00:45:55,960
right?

862
00:45:55,960 --> 00:46:00,140
Because we could combine semantic
search and full-text search

863
00:46:00,140 --> 00:46:00,480
phrase.

864
00:46:00,480 --> 00:46:05,580
I think, for me, full-text search
is good in terms of these capabilities

865
00:46:05,860 --> 00:46:09,980
like this language, like phrase
search and negative exclusion,

866
00:46:11,120 --> 00:46:12,180
and also categories.

867
00:46:13,280 --> 00:46:13,780
Yes.

868
00:46:14,280 --> 00:46:17,220
So as someone who built a lot of
systems using full text search

869
00:46:17,220 --> 00:46:23,860
and recently built 1 system using
pgVector and semantics similarity.

870
00:46:25,680 --> 00:46:30,760
I think it's probably a proper
time to start adding full text

871
00:46:30,760 --> 00:46:33,900
search capabilities also, and it
will be tricky and interesting

872
00:46:33,900 --> 00:46:36,180
how to properly combine them.

873
00:46:36,180 --> 00:46:39,440
I saw simple examples, trivial
examples, like how to combine

874
00:46:39,440 --> 00:46:39,900
them.

875
00:46:39,900 --> 00:46:45,200
Let's find 50 documents using semantic
search, 50 documents using

876
00:46:45,200 --> 00:46:47,220
full-text search, that's it.

877
00:46:47,640 --> 00:46:52,640
Or let's find 1,000 using full
text search, then re-rank them

878
00:46:52,640 --> 00:46:55,220
using similarity and leave only
100.

879
00:46:55,400 --> 00:47:00,140
And then define some additional
ranking and leave only 10.

880
00:47:00,480 --> 00:47:04,060
Quite weak examples because I have
no idea how pagination will

881
00:47:04,060 --> 00:47:04,540
work, right?

882
00:47:04,540 --> 00:47:06,140
And also, what about speed?

883
00:47:06,680 --> 00:47:14,040
So for me, it remains an open question
how to look back at old

884
00:47:14,120 --> 00:47:17,120
functionality provided by full-text
search and bring them to

885
00:47:17,120 --> 00:47:18,160
similarity search.

886
00:47:18,220 --> 00:47:20,100
So it's a super interesting topic.

887
00:47:20,580 --> 00:47:22,320
Michael: It depends on the use
case, right?

888
00:47:22,860 --> 00:47:27,280
So I'm thinking about your bot
for Postgres AI, you don't need

889
00:47:27,280 --> 00:47:28,180
to paginate, right?

890
00:47:28,180 --> 00:47:32,980
Like if you're returning the results
in a chat-based interface.

891
00:47:32,980 --> 00:47:33,800
Nikolay: Yeah, there, yes.

892
00:47:34,460 --> 00:47:37,200
Michael: So it's an interesting,
like we maybe don't, depending

893
00:47:37,200 --> 00:47:39,620
on the use case, maybe don't always
have the same constraints

894
00:47:39,620 --> 00:47:42,460
we're used to having, like, via
web search type interfaces.

895
00:47:43,200 --> 00:47:46,400
Nikolay: Well, I think yes and
no.

896
00:47:46,400 --> 00:47:52,040
We don't need it now, but I already
see building the bot, I follow

897
00:47:52,300 --> 00:47:54,020
my own behavior using Google.

898
00:47:54,400 --> 00:47:55,460
When using Google.

899
00:47:55,920 --> 00:47:56,320
Michael: And there

900
00:47:56,320 --> 00:47:59,440
Nikolay: I definitely sometimes
go to page number 2, page number

901
00:47:59,440 --> 00:47:59,940
3.

902
00:48:00,740 --> 00:48:02,020
Michael: But not page 20.

903
00:48:02,020 --> 00:48:05,860
Like pagination maybe is not that
big a deal if you're looking

904
00:48:05,860 --> 00:48:07,240
at the first 3 pages.

905
00:48:08,800 --> 00:48:09,920
Nikolay: But speed will suffer.

906
00:48:09,920 --> 00:48:14,240
But anyway, I think I should combine,
start combining with full

907
00:48:14,240 --> 00:48:18,280
text search because sometimes I
want an exact search, I know the exact

908
00:48:18,280 --> 00:48:19,940
phrase or function name.

909
00:48:20,660 --> 00:48:22,260
Sometimes I want to exclude something.

910
00:48:22,480 --> 00:48:25,180
I mean, both already want to exclude
something.

911
00:48:26,260 --> 00:48:26,980
It happens.

912
00:48:26,980 --> 00:48:32,760
And also I think about two-phase
search when we search for more

913
00:48:32,760 --> 00:48:33,260
money.

914
00:48:33,280 --> 00:48:35,780
Maybe indeed pagination is not
needed.

915
00:48:36,060 --> 00:48:40,060
I think about the idea to search
a lot of items, like a thousand,

916
00:48:40,580 --> 00:48:44,080
2000 items, but then analyze snippets
like in Google.

917
00:48:45,060 --> 00:48:49,620
Using, again, automated analysis
using LLM and leave only a few.

918
00:48:49,960 --> 00:48:54,840
For them, we need to expand because
I always open every relevantly

919
00:48:55,040 --> 00:48:56,760
looking snippet.

920
00:48:56,760 --> 00:49:00,220
I just click it and open a new
tab and consume the whole document.

921
00:49:01,280 --> 00:49:06,320
But we cannot consume whole documents
for 10 or 20 items.

922
00:49:06,540 --> 00:49:07,820
It's interesting there.

923
00:49:07,820 --> 00:49:10,280
But full text search, I think,
is about to return.

924
00:49:11,020 --> 00:49:14,780
Because again, sometimes I want
to search by title reliably.

925
00:49:15,040 --> 00:49:17,680
I know it was mentioned in the
title or something.

926
00:49:17,680 --> 00:49:18,420
Or authors.

927
00:49:18,900 --> 00:49:19,780
Also author.

928
00:49:20,420 --> 00:49:20,920
Yeah.

929
00:49:22,740 --> 00:49:23,500
Michael: Makes sense.

930
00:49:23,940 --> 00:49:24,860
Exciting times.

931
00:49:25,240 --> 00:49:29,320
Nikolay: Yeah, maybe it was slightly
messed in terms of content

932
00:49:30,060 --> 00:49:31,260
structure this time.

933
00:49:32,040 --> 00:49:36,020
But I still have questions and
I don't like examples like just

934
00:49:36,020 --> 00:49:38,040
50 semantics, 50 full-text search.

935
00:49:38,560 --> 00:49:40,140
Michael: Do you want to hear from
people?

936
00:49:40,160 --> 00:49:42,180
Should people let us know how they're
currently?

937
00:49:42,360 --> 00:49:44,680
Nikolay: Yeah, maybe there are
better ideas already.

938
00:49:45,140 --> 00:49:49,620
I think it's a very evolving area,
right?

939
00:49:49,820 --> 00:49:54,840
And maybe there will be more systems
where capabilities will

940
00:49:54,840 --> 00:49:58,980
be combined, because we know in,
for example, in Google, it combines

941
00:49:58,980 --> 00:50:04,020
a lot, including similarity and
like full text things.

942
00:50:05,380 --> 00:50:08,200
Michael: Yeah, so if you're doing
this in Postgres already combining

943
00:50:08,200 --> 00:50:10,580
the 2 techniques, let us know.

944
00:50:11,040 --> 00:50:13,520
Nikolay: Yeah, how exactly, what
are the details?

945
00:50:13,520 --> 00:50:14,660
It's super interesting.

946
00:50:15,480 --> 00:50:15,980
Good.

947
00:50:16,860 --> 00:50:17,540
Michael: Nice one.

948
00:50:17,540 --> 00:50:18,360
Cheers, Nikolay.