1
00:00:05,080 --> 00:00:07,440
Nikolay: Hello, hello, this is
Postgres.FM.

2
00:00:07,440 --> 00:00:11,840
As usual, I'm Nik, Postgres AI,
and as usual, my co-host is

3
00:00:11,840 --> 00:00:13,300
Michael, pgMustard.

4
00:00:13,660 --> 00:00:14,360
Hi, Michael.

5
00:00:15,060 --> 00:00:15,920
Michael: Hello, Nikolay.

6
00:00:16,320 --> 00:00:17,940
Nikolay: How are you doing today?

7
00:00:18,080 --> 00:00:18,800
Michael: I'm good, thanks.

8
00:00:18,800 --> 00:00:19,280
How are you?

9
00:00:19,280 --> 00:00:19,780
Nikolay: Great.

10
00:00:20,640 --> 00:00:21,800
Very good, very good.

11
00:00:22,420 --> 00:00:23,460
Thank you for asking.

12
00:00:24,380 --> 00:00:27,240
So today we have a very big topic.

13
00:00:27,980 --> 00:00:28,700
I'm joking.

14
00:00:29,540 --> 00:00:35,500
But Yeah, I hope for this podcast
episode should be the shortest

15
00:00:35,500 --> 00:00:38,900
because we are going to discuss
very narrow topic.

16
00:00:39,800 --> 00:00:43,220
Although this topic hits almost
every project, as I noticed.

17
00:00:44,100 --> 00:00:48,560
So every system has usually some
table like users.

18
00:00:49,120 --> 00:00:55,580
It has some column like email and
in email we usually expect

19
00:00:55,580 --> 00:01:00,560
the search to not to distinguish
registry of characters look

20
00:01:00,560 --> 00:01:04,400
case it should be case-insensitive
search right So if it's all

21
00:01:04,400 --> 00:01:09,780
uppercase email or lowercase or
mixed, all those values must

22
00:01:09,780 --> 00:01:12,480
be considered the same.

23
00:01:13,520 --> 00:01:17,780
However, All people made this mistake,
including myself, many,

24
00:01:17,780 --> 00:01:18,500
many times.

25
00:01:18,540 --> 00:01:23,720
You usually start using varchar
or something or text for email

26
00:01:23,720 --> 00:01:24,220
column.

27
00:01:24,720 --> 00:01:30,420
And then you see that multiple
accounts, multiple users' records

28
00:01:30,920 --> 00:01:34,680
have basically the same record,
just written in different cases.

29
00:01:35,340 --> 00:01:35,840
Right?

30
00:01:35,900 --> 00:01:38,800
So here, we usually, what do we
do?

31
00:01:38,800 --> 00:01:46,120
We just introduce additional unique
index, over expression, lower

32
00:01:46,320 --> 00:01:46,820
email.

33
00:01:47,540 --> 00:01:51,480
So we convert email value to lowercase
or uppercase, doesn't

34
00:01:51,480 --> 00:01:55,460
matter, and build an index, B-tree index with unique option,

35
00:01:55,800 --> 00:02:00,860
create unique index concurrently
on lower or upper function from

36
00:02:00,860 --> 00:02:05,460
that value and this gives us uniqueness,
case-sensitive uniqueness,

37
00:02:06,140 --> 00:02:06,640
right?

38
00:02:07,760 --> 00:02:12,040
But that's not it, we then need
to fix our search queries.

39
00:02:14,240 --> 00:02:19,760
Michael: And make sure that every
future use of email uses the

40
00:02:19,760 --> 00:02:22,800
function as well in order to be
able to use the index or have

41
00:02:22,800 --> 00:02:26,200
2 indexes if you want to be able
to, you know, there's a few

42
00:02:26,200 --> 00:02:29,320
kind of like gotchas around this
approach.

43
00:02:29,540 --> 00:02:33,260
Nikolay: Honestly, like roughly
half of the cases, you know,

44
00:02:33,260 --> 00:02:38,300
I, we do a lot of health check
of our startups, like usually

45
00:02:38,300 --> 00:02:41,200
like they are grown to some like
terabyte or so.

46
00:02:41,200 --> 00:02:45,890
And we see some tables and not
every time we look into the, the,

47
00:02:45,890 --> 00:02:49,040
the, the table structure, but oftentimes
we do.

48
00:02:49,340 --> 00:02:55,080
And I just see some indexes, like
index on raw email value and

49
00:02:55,080 --> 00:02:58,000
on lower email, I see it a lot.

50
00:02:58,080 --> 00:03:01,020
And some queries use 1 index.

51
00:03:01,020 --> 00:03:05,460
So we hunt for unused indexes,
which is part of our health check,

52
00:03:05,660 --> 00:03:09,800
and we discussed it on our podcast
in terms of how to maintain

53
00:03:09,800 --> 00:03:10,940
good health of indexes.

54
00:03:11,420 --> 00:03:14,080
Extra indexes is not what you want
to have.

55
00:03:14,280 --> 00:03:18,580
And in this case, if we have index
on raw value and on lower

56
00:03:18,760 --> 00:03:23,420
of email, often we also see both
of them are used, which means

57
00:03:23,460 --> 00:03:24,460
there is a mess.

58
00:03:26,460 --> 00:03:30,840
So all of this was introduction
of our today's topic, which is

59
00:03:30,840 --> 00:03:31,940
called citext.

60
00:03:32,960 --> 00:03:36,560
Michael: Yeah, so this was, I know
it's a very narrow topic,

61
00:03:36,560 --> 00:03:40,240
but this was my choice because
I was looking through the contrib

62
00:03:40,240 --> 00:03:45,140
modules at what was in there that
was both popular and that we

63
00:03:45,140 --> 00:03:46,400
hadn't discussed yet.

64
00:03:46,420 --> 00:03:49,760
And we've got to enough episodes
now that that was not a long

65
00:03:49,760 --> 00:03:52,840
list of ones that, I mean, there
was, there were definitely some

66
00:03:52,840 --> 00:03:56,480
other extensions we have, some
other contrib modules that we

67
00:03:56,480 --> 00:04:00,720
haven't discussed yet, but the
ones on that list are not ones

68
00:04:00,720 --> 00:04:03,580
I hear discussed often, not ones
I see in use.

69
00:04:04,000 --> 00:04:08,440
Whereas citext, I feel like at
least was quite popular.

70
00:04:09,000 --> 00:04:12,100
It did get quite a lot of use,
or at least a lot of people were

71
00:04:12,100 --> 00:04:13,520
interested in the use case.

72
00:04:13,940 --> 00:04:17,020
So yeah, I was interested kind
of in your, in your perspective

73
00:04:17,020 --> 00:04:17,420
on this.

74
00:04:17,420 --> 00:04:21,180
And I also, after suggesting it
as a topic, started looking into

75
00:04:21,180 --> 00:04:24,440
it more and found the topic more
interesting than I was expecting

76
00:04:24,440 --> 00:04:24,940
to.

77
00:04:25,240 --> 00:04:28,180
So yeah I'm looking forward to
getting your thoughts on this

78
00:04:28,600 --> 00:04:30,260
and what people should be doing.

79
00:04:30,660 --> 00:04:35,260
But just in terms of those drawbacks
of using the function approach,

80
00:04:35,660 --> 00:04:38,300
there's a couple more mentioned
by the documentation I thought

81
00:04:38,300 --> 00:04:39,020
were good.

82
00:04:39,280 --> 00:04:43,680
1 is less important but there is
the just the verbosity of every

83
00:04:43,680 --> 00:04:47,380
query and just having to have functions
in them.

84
00:04:47,420 --> 00:04:51,920
But a more important 1, potentially,
depending on other opinions

85
00:04:51,940 --> 00:04:55,320
you might have, is that you also
then can't use it as a primary

86
00:04:55,320 --> 00:04:59,920
key if it's just text because,
or at least you'd have to have

87
00:04:59,920 --> 00:05:03,900
another unique index on it to guarantee
uniqueness, to make sure

88
00:05:03,900 --> 00:05:06,020
somebody couldn't sign up with
the same email address, but with

89
00:05:06,020 --> 00:05:08,660
some case different because it's
the same email address.

90
00:05:08,860 --> 00:05:10,580
But yeah, email, email's an important
1.

91
00:05:10,580 --> 00:05:13,600
I don't know if you'd actually
want email as a primary key anywhere,

92
00:05:14,140 --> 00:05:15,600
but also username.

93
00:05:15,660 --> 00:05:20,440
Like a lot of services allow you
to set a username that is

94
00:05:20,440 --> 00:05:23,360
case-insensitive, like social media
handles and things.

95
00:05:24,140 --> 00:05:28,300
Those will often be, if not
case-insensitive, definitely unique

96
00:05:28,680 --> 00:05:31,360
depending on case in a lot of cases.

97
00:05:32,680 --> 00:05:34,540
So like Twitter handles, for example.

98
00:05:35,240 --> 00:05:38,740
Nikolay: Yeah, and the question
is why it's an extension not

99
00:05:38,740 --> 00:05:39,640
in core, right?

100
00:05:40,640 --> 00:05:42,720
Michael: Well, there's a couple
of questions, right?

101
00:05:42,720 --> 00:05:44,340
Like, could this be a data type?

102
00:05:44,340 --> 00:05:46,080
Yeah, could it be a core data type?

103
00:05:46,080 --> 00:05:47,700
Nikolay: Well, it is data type,
but

104
00:05:48,980 --> 00:05:49,200
Michael: it's

105
00:05:49,200 --> 00:05:51,240
Nikolay: living outside of the
core.

106
00:05:52,120 --> 00:05:54,320
Michael: But I actually think so.

107
00:05:54,320 --> 00:05:57,100
Well, I don't know how quickly
you want to jump ahead, but the

108
00:05:57,100 --> 00:06:01,360
reason I found this super interesting
is In the docs for citext,

109
00:06:01,560 --> 00:06:07,720
the contract module, it actually
suggests not using either citext

110
00:06:08,220 --> 00:06:12,840
or the lower Function approach,
but instead creating a case-insensitive

111
00:06:13,360 --> 00:06:16,200
collation and using that for it.

112
00:06:17,040 --> 00:06:19,500
And that was so elegant.

113
00:06:20,360 --> 00:06:23,980
But it is something you have to
do at the User level.

114
00:06:24,380 --> 00:06:27,980
And we could, yeah, we could ship,
like, I don't see why Postgres

115
00:06:27,980 --> 00:06:31,640
couldn't come with 1 of
those prebuilt that you could

116
00:06:31,640 --> 00:06:32,460
just use.

117
00:06:32,740 --> 00:06:36,140
And maybe even, maybe it wouldn't
be a case-insensitive field.

118
00:06:36,580 --> 00:06:38,660
I could imagine a case.

119
00:06:38,940 --> 00:06:39,960
Sorry, I'll keep this.

120
00:06:39,960 --> 00:06:42,940
So I need to stop using the word
case to mean like argument.

121
00:06:43,080 --> 00:06:47,560
I can imagine an argument for having
email as a datatype that

122
00:06:47,560 --> 00:06:48,280
could be...

123
00:06:49,660 --> 00:06:51,540
Nikolay: Oh, that's interesting.

124
00:06:52,080 --> 00:06:59,040
I remember for Postgres, some huge
snippet to check that it's

125
00:06:59,040 --> 00:07:00,560
really email.

126
00:07:00,660 --> 00:07:03,120
It's like this RFC, right?

127
00:07:03,120 --> 00:07:09,320
And it's not trivial to ensure
some valid email address or not.

128
00:07:09,320 --> 00:07:14,540
It's not, it's not really like
so trivial as 1 might think.

129
00:07:15,300 --> 00:07:18,040
Michael: Well, so I've seen 2 cases
for this 1 is the really

130
00:07:18,040 --> 00:07:21,760
complicated 1, which actually I
think suits the argument that

131
00:07:21,760 --> 00:07:24,060
it should be pre-built because
you don't want everybody rolling

132
00:07:24,060 --> 00:07:27,980
their own version of that and The
other version is no just ship

133
00:07:27,980 --> 00:07:29,000
it right back.

134
00:07:29,060 --> 00:07:34,940
All you need to do is It doesn't
contain an at It maybe is it

135
00:07:34,940 --> 00:07:37,840
all lowercase or can like cast
it to lowercase?

136
00:07:38,760 --> 00:07:39,780
And then that's it.

137
00:07:40,920 --> 00:07:44,720
Maybe a period after the at, or
a full stop after the at.

138
00:07:44,760 --> 00:07:49,140
But I think anything more complicated
than that runs into potentially

139
00:07:49,400 --> 00:07:52,180
not allowing things that should
be allowed, right?

140
00:07:52,200 --> 00:07:53,980
Nikolay: So I will tell you what
I think.

141
00:07:53,980 --> 00:07:56,760
First of all, I agree that collation
is an elegant solution.

142
00:07:57,040 --> 00:08:00,580
And actually, collation, it's the
job of collations, even in

143
00:08:00,580 --> 00:08:05,440
SQL standard, is to define rules
how we compare values, right?

144
00:08:05,640 --> 00:08:09,780
And case sensitive or case-insensitive,
this is comparison.

145
00:08:10,200 --> 00:08:12,540
So it's exactly where it should
be.

146
00:08:12,540 --> 00:08:17,220
The only thing is, I remember there
should be some, like you

147
00:08:17,220 --> 00:08:21,100
need to distinguish comparison
at the byte level, like logical

148
00:08:21,140 --> 00:08:21,580
level.

149
00:08:21,580 --> 00:08:23,940
It's like deterministic or something,
right?

150
00:08:24,000 --> 00:08:24,500
Yes.

151
00:08:24,620 --> 00:08:25,120
Yeah.

152
00:08:26,000 --> 00:08:28,040
So honestly, I never use it myself.

153
00:08:28,140 --> 00:08:28,500
Never.

154
00:08:28,500 --> 00:08:31,520
I just read about this and I have no idea.

155
00:08:31,920 --> 00:08:38,080
My own experience is, I was like, when I first discovered citext,

156
00:08:38,200 --> 00:08:40,580
probably it happened 19 years ago or so.

157
00:08:41,400 --> 00:08:45,860
Like on my second year of Postgres use, or Maybe in the first

158
00:08:45,860 --> 00:08:46,400
year even.

159
00:08:46,400 --> 00:08:48,460
Michael: I have the history of it here, by the way.

160
00:08:48,620 --> 00:08:50,660
Nikolay: Yeah, it's a very old contrib module.

161
00:08:50,980 --> 00:08:54,460
Michael: First released to the public 17th of February 2003,

162
00:08:55,200 --> 00:08:57,580
but it wasn't contrib module at that point.

163
00:08:57,840 --> 00:08:59,940
That only happened a few years later.

164
00:09:01,220 --> 00:09:03,840
And it was a kind of like a redesign at that point.

165
00:09:03,840 --> 00:09:05,440
But yeah, so you were pretty early.

166
00:09:05,860 --> 00:09:09,020
Nikolay: Well, yeah, I started using Postgres in 2005 or 6.

167
00:09:09,140 --> 00:09:11,760
And of course, it was social media.

168
00:09:11,760 --> 00:09:14,280
So we definitely had the table users.

169
00:09:14,900 --> 00:09:17,800
At the time, I tried to avoid plural.

170
00:09:17,800 --> 00:09:21,320
So it was maybe something, something else like person or something.

171
00:09:21,620 --> 00:09:26,280
Yeah, Then I was convinced by the influence of Ruby on Rails,

172
00:09:26,280 --> 00:09:28,860
and Django and others, like, okay, it should be plural.

173
00:09:29,020 --> 00:09:31,220
And then I quickly became a fan of citext.

174
00:09:32,260 --> 00:09:36,040
And then I quickly became not a fan because like maybe 1 or 2

175
00:09:36,040 --> 00:09:40,440
years later I stopped being a fan because like it's extra effort

176
00:09:40,440 --> 00:09:43,780
to install extension and then you need to describe everyone what

177
00:09:43,780 --> 00:09:48,240
like oh this special data type and I found it easier to keep

178
00:09:48,240 --> 00:09:54,480
using text or varchar and these days I think just text but well

179
00:09:54,480 --> 00:10:00,560
it's not a topic text or varchar and with limit or no for length.

180
00:10:00,940 --> 00:10:02,900
Michael: But did you use collations or no?

181
00:10:02,900 --> 00:10:03,900
Nikolay: No, no, no, no, no, no.

182
00:10:03,900 --> 00:10:04,900
Michael: I just interesting.

183
00:10:04,940 --> 00:10:05,940
Nikolay: It's just functions.

184
00:10:06,580 --> 00:10:10,340
Yeah, function index with unique option if we need uniqueness,

185
00:10:10,440 --> 00:10:11,220
that's it.

186
00:10:11,320 --> 00:10:13,860
And then we need to adjust all the Queries.

187
00:10:15,060 --> 00:10:18,680
And I was in my own projects, I was hunting to if we if we made

188
00:10:18,680 --> 00:10:21,740
decision that it should be case-insensitive search, then probably

189
00:10:21,740 --> 00:10:22,620
it should be global.

190
00:10:22,640 --> 00:10:25,740
So we should get rid of index on raw value.

191
00:10:26,200 --> 00:10:27,180
So that's it.

192
00:10:27,180 --> 00:10:30,040
And fix it and all Queries and avoid sequential scans.

193
00:10:30,040 --> 00:10:30,480
And that's it.

194
00:10:30,480 --> 00:10:35,740
So this is like, you either want a sensitive search, so index

195
00:10:35,740 --> 00:10:36,980
should be on the raw value.

196
00:10:37,580 --> 00:10:40,360
Or you want case and sensitive search.

197
00:10:40,540 --> 00:10:44,820
In this case, you must have index on lower or upper.

198
00:10:45,360 --> 00:10:48,820
You need to choose in advance and make it a rule, like part of

199
00:10:48,820 --> 00:10:50,700
your code style, probably.

200
00:10:51,340 --> 00:10:55,680
And then use it usually on both sides of comparison, so input

201
00:10:55,680 --> 00:11:02,880
value also get lowercase or uppercase,
and the value of Column,

202
00:11:02,980 --> 00:11:06,660
and we know that we rely on Index
on expression here.

203
00:11:07,440 --> 00:11:11,140
Michael: Yeah, well this is an
interesting thing that the original

204
00:11:11,160 --> 00:11:14,260
author, so it was originally, the
citext module was originally

205
00:11:14,260 --> 00:11:19,900
authored by Donald Fraser and in
the initial email about this,

206
00:11:19,900 --> 00:11:25,580
1 of the use cases was multi-Column
functional indexes.

207
00:11:26,260 --> 00:11:29,240
So if you wanted to have, so if
you want an Index on a single,

208
00:11:29,240 --> 00:11:33,160
like just on email, fine, you can
do a Function on that.

209
00:11:33,160 --> 00:11:35,640
But at least at the time, you couldn't
have it on 2, let's say

210
00:11:35,640 --> 00:11:40,320
you wanted it on email, comma,
username, which may have been

211
00:11:40,320 --> 00:11:43,480
more common back then before they
had like the includes option.

212
00:11:43,860 --> 00:11:48,220
If you wanted it to be on the lower
of both of those, you'd have

213
00:11:48,220 --> 00:11:53,620
to have like a kind of a parent
Function that did both of those.

214
00:11:54,140 --> 00:11:57,320
So it got a bit messy in terms
of multi-Column indexes.

215
00:11:57,660 --> 00:12:00,860
Whereas if you have them as their
own data type, you can just

216
00:12:01,200 --> 00:12:02,260
do that at will.

217
00:12:02,440 --> 00:12:06,440
Nikolay: Multi-Column expression
on 2 expressions, so lower email,

218
00:12:06,440 --> 00:12:07,360
lower something else.

219
00:12:07,360 --> 00:12:09,500
I think right now it's definitely
possible.

220
00:12:09,960 --> 00:12:10,640
Michael: Okay, great.

221
00:12:10,640 --> 00:12:11,860
I didn't realize that.

222
00:12:12,260 --> 00:12:14,060
But back then it wasn't, so that
was like...

223
00:12:14,060 --> 00:12:14,860
Nikolay: Yeah, that's interesting.

224
00:12:14,860 --> 00:12:16,680
I don't remember this nuance.

225
00:12:16,690 --> 00:12:17,440
Yeah, okay.

226
00:12:17,440 --> 00:12:23,700
But yeah, so it's difficult to
name, to use the term multi-Column

227
00:12:23,800 --> 00:12:28,480
here because it's already not Column
but expression yeah multi-layered

228
00:12:28,940 --> 00:12:34,480
or something I don't know so multi-dimension
multi-dimensional

229
00:12:35,540 --> 00:12:38,320
maybe Index on expressions.

230
00:12:38,820 --> 00:12:40,520
It should work these days.

231
00:12:40,680 --> 00:12:45,600
Not a problem, but also there is
a like there is in inconvenience.

232
00:12:45,800 --> 00:12:50,280
If you go this route and use like
lower email or something you

233
00:12:50,580 --> 00:12:55,240
it's easier to forget that we need
to when Index is created we

234
00:12:55,240 --> 00:12:59,020
need to run Analyze on the Table
because it lacks statistics

235
00:12:59,180 --> 00:12:59,860
on expression

236
00:13:00,020 --> 00:13:02,000
Michael: yes yeah good point

237
00:13:02,220 --> 00:13:05,560
Nikolay: If you build the regular
Index on raw, a Column value,

238
00:13:06,900 --> 00:13:08,800
statistics usually already there.

239
00:13:09,320 --> 00:13:13,580
But for expressional indexes, it's
not there.

240
00:13:13,780 --> 00:13:20,480
But yeah, since then, I made this
decision in favor of expression

241
00:13:20,500 --> 00:13:23,460
indexes and I never went back.

242
00:13:24,520 --> 00:13:27,100
So why should I use citext in
2025?

243
00:13:28,140 --> 00:13:30,580
Michael: I don't think you should,
but I also don't think you

244
00:13:30,580 --> 00:13:34,340
should be having to do lower Functions.

245
00:13:34,340 --> 00:13:37,200
Nikolay: Both paths are bad, collation
path is good.

246
00:13:38,220 --> 00:13:40,380
Michael: That's what I think is
the solution now.

247
00:13:40,380 --> 00:13:41,820
So it's like super interesting.

248
00:13:42,040 --> 00:13:45,900
I actually came across just by
chance while looking into this

249
00:13:46,100 --> 00:13:51,600
a blog post by Adam Johnson who's
worked on Django for many years.

250
00:13:51,820 --> 00:13:56,580
It seems like Django actually removed
support or mostly removed

251
00:13:56,580 --> 00:13:59,940
support for citext in version 5.1
of Django.

252
00:14:00,140 --> 00:14:04,340
So they actually deprecated this
and he blogged about what you

253
00:14:04,340 --> 00:14:08,400
should do instead and wrote up
about the how to set up a collation

254
00:14:08,560 --> 00:14:12,160
and what we kind of define all
the kind of ICU I don't even know

255
00:14:12,160 --> 00:14:15,180
what you call them like codes or
it's called a locale isn't it

256
00:14:15,360 --> 00:14:19,800
But it's like made up of a bunch
of different, complete gibberish

257
00:14:19,820 --> 00:14:20,580
in my opinion.

258
00:14:20,820 --> 00:14:24,900
But you can look up the spec and
kind of choose which language

259
00:14:24,900 --> 00:14:27,540
do you want it to be and do you
want it to be case sensitive

260
00:14:27,540 --> 00:14:28,240
or not?

261
00:14:28,260 --> 00:14:31,300
Do you want it to be sensitive
to accents or not?

262
00:14:31,300 --> 00:14:31,620
So you

263
00:14:31,620 --> 00:14:32,120
Nikolay: can.

264
00:14:32,280 --> 00:14:33,380
Yeah, that's cool.

265
00:14:33,680 --> 00:14:35,660
And it's much more flexibility.

266
00:14:37,200 --> 00:14:39,840
And it's also SQL standard approach,
I think.

267
00:14:40,400 --> 00:14:41,440
Michael: Yes, it is.

268
00:14:41,980 --> 00:14:42,580
The last time

269
00:14:42,580 --> 00:14:45,480
Nikolay: I looked into SQL standard,
it was more than 10 years

270
00:14:45,480 --> 00:14:45,720
ago.

271
00:14:45,720 --> 00:14:50,420
So my memory might make tricks
with me, but I remember something

272
00:14:50,460 --> 00:14:51,240
about collations.

273
00:14:52,580 --> 00:14:53,540
This came up,

274
00:14:54,000 --> 00:14:57,500
Michael: well, in 2003 when it
was first proposed, I think I

275
00:14:57,500 --> 00:15:02,320
read a post from Tom Lane in that
thread that said basically

276
00:15:02,320 --> 00:15:05,340
the SQL standard way of doing this
is via collations, but I can't

277
00:15:05,340 --> 00:15:06,820
see us doing that anytime soon.

278
00:15:06,820 --> 00:15:09,120
So this makes sense as a hack in
the meantime.

279
00:15:09,620 --> 00:15:10,620
Nikolay: Oh, that's great.

280
00:15:10,960 --> 00:15:11,400
Yeah.

281
00:15:11,400 --> 00:15:13,180
So, so that's great.

282
00:15:13,180 --> 00:15:15,220
So it's very hard.

283
00:15:15,480 --> 00:15:17,720
How are we going to name this episode?

284
00:15:17,740 --> 00:15:20,780
I thought it will be citext now
it's shifted.

285
00:15:21,680 --> 00:15:24,060
Michael: I was going to suggest
maybe case-insensitive

286
00:15:25,080 --> 00:15:25,580
Nikolay: text.

287
00:15:26,320 --> 00:15:27,380
Don't cut this.

288
00:15:28,100 --> 00:15:32,200
Because it shows confusion, you
know, like it's like case-insensitive

289
00:15:32,240 --> 00:15:32,740
comparison.

290
00:15:32,920 --> 00:15:33,420
Yeah.

291
00:15:34,140 --> 00:15:36,100
Michael: Because that's what we
want, isn't it?

292
00:15:36,100 --> 00:15:38,280
Nikolay: But it goes beyond just
cases.

293
00:15:38,560 --> 00:15:39,740
You mentioned excellence.

294
00:15:40,240 --> 00:15:40,840
That's great.

295
00:15:40,840 --> 00:15:44,280
Because some people might use my
like some names, for example,

296
00:15:45,120 --> 00:15:46,140
like Thomas.

297
00:15:46,520 --> 00:15:47,020
Maybe.

298
00:15:47,500 --> 00:15:52,260
Michael: But as you said, I think
the 2 main use cases for this

299
00:15:52,260 --> 00:15:54,900
are email addresses and usernames.

300
00:15:55,240 --> 00:15:56,760
That's where I've seen it.

301
00:15:56,820 --> 00:16:01,400
And email addresses are unique,
no matter the case, but I don't

302
00:16:01,400 --> 00:16:02,740
know if they're unique.

303
00:16:03,560 --> 00:16:04,640
Nikolay: No, they are not.

304
00:16:04,640 --> 00:16:05,080
They are not.

305
00:16:05,080 --> 00:16:10,640
But we might decide, for example,
if username in our system is

306
00:16:11,180 --> 00:16:16,980
allowing like variants of Latin
characters, We might say, okay,

307
00:16:16,980 --> 00:16:19,680
we allow them, but we don't want
collisions.

308
00:16:21,340 --> 00:16:27,540
Like 1 user used pure Latin, another
user used like native version,

309
00:16:27,540 --> 00:16:29,680
like from another language.

310
00:16:30,140 --> 00:16:32,220
Many languages have it, right?

311
00:16:32,300 --> 00:16:36,540
But we say we allow them, but first
person who use it should

312
00:16:36,540 --> 00:16:38,940
be like, should be protected from
collisions.

313
00:16:39,560 --> 00:16:42,720
In this case, we might decide to
use collisions and define unique

314
00:16:42,720 --> 00:16:43,200
indexes.

315
00:16:43,200 --> 00:16:44,340
That's a great idea, I think.

316
00:16:44,340 --> 00:16:44,840
Yeah.

317
00:16:45,480 --> 00:16:46,260
Michael: Yeah, good point.

318
00:16:46,260 --> 00:16:49,760
In fact, I know it's not quite
related to that, but reading about

319
00:16:49,760 --> 00:16:55,640
this, I came across somebody giving
an example of lower not always

320
00:16:55,640 --> 00:16:59,440
being ideal, or lower and upper
being subtly different, because

321
00:17:00,020 --> 00:17:03,960
in some languages, you can get
an uppercase character that can

322
00:17:03,960 --> 00:17:07,740
have 2 alternative lowercase versions
of it.

323
00:17:07,740 --> 00:17:07,940
I

324
00:17:07,940 --> 00:17:09,220
Nikolay: didn't know about this.

325
00:17:09,520 --> 00:17:10,220
Michael: Nor me.

326
00:17:10,240 --> 00:17:13,780
But that, I mean, you can imagine
the kind of hassle that's going

327
00:17:13,780 --> 00:17:14,780
to end up causing.

328
00:17:14,840 --> 00:17:15,900
So yeah, it's...

329
00:17:16,860 --> 00:17:17,140
Nikolay: Wow.

330
00:17:17,140 --> 00:17:19,760
Michael: Collations are where this
stuff is defined properly,

331
00:17:19,760 --> 00:17:20,260
right?

332
00:17:20,580 --> 00:17:21,100
Yeah, yeah.

333
00:17:21,100 --> 00:17:22,080
So it gives

334
00:17:22,080 --> 00:17:23,260
Nikolay: you big freedom.

335
00:17:23,620 --> 00:17:25,940
But what are downsides of using
collations?

336
00:17:26,600 --> 00:17:30,040
Michael: Well, I did wonder because
do you remember, you know,

337
00:17:30,400 --> 00:17:34,320
if glibc changes and we can end
up with corrupted indexes.

338
00:17:34,640 --> 00:17:37,320
Nikolay: I don't, I, how can I
like,

339
00:17:37,360 --> 00:17:38,760
Michael: Well, of course you remember?

340
00:17:38,940 --> 00:17:40,180
Nikolay: Yes, of course I remember.

341
00:17:40,580 --> 00:17:43,120
We had multiple cases with multiple
clients.

342
00:17:45,120 --> 00:17:48,620
Michael: Could that be, Could that
be an issue here?

343
00:17:48,740 --> 00:17:48,900
I

344
00:17:48,900 --> 00:17:49,660
Nikolay: don't know.

345
00:17:49,940 --> 00:17:52,320
Well, if you use ICU, it should
not be so.

346
00:17:52,800 --> 00:17:53,260
I don't

347
00:17:53,260 --> 00:17:53,600
Michael: see how.

348
00:17:53,600 --> 00:17:54,240
Right, yeah, right.

349
00:17:54,240 --> 00:17:54,740
Nikolay: Yeah.

350
00:17:55,200 --> 00:17:59,700
And I think in this case, we definitely
want to specify a provider

351
00:17:59,700 --> 00:18:00,700
ICU, right?

352
00:18:00,920 --> 00:18:01,420
Yes.

353
00:18:01,560 --> 00:18:02,300
Good point.

354
00:18:02,320 --> 00:18:07,140
Yeah, and this comparison, I don't
see issue here.

355
00:18:07,960 --> 00:18:12,280
And also there we have issues not
with like comparison but with

356
00:18:12,280 --> 00:18:14,120
order right?

357
00:18:14,780 --> 00:18:15,280
True.

358
00:18:15,420 --> 00:18:17,420
Michael: So isn't that the same
thing?

359
00:18:17,620 --> 00:18:24,020
Nikolay: No Who is bigger who is
who's upper who is lower basically

360
00:18:24,020 --> 00:18:28,660
in the order it doesn't it's not
the same As who equals who I

361
00:18:28,660 --> 00:18:29,160
know

362
00:18:29,440 --> 00:18:34,640
Michael: it Except if you're looking
for like is there a clash

363
00:18:34,700 --> 00:18:36,100
is this Is this unique?

364
00:18:37,000 --> 00:18:40,640
You might traverse it only to the
point where you're expecting

365
00:18:40,640 --> 00:18:42,940
it to be, but it's actually lowered
down.

366
00:18:43,180 --> 00:18:46,560
So you say, oh no, I've already
got past that point in the index,

367
00:18:46,560 --> 00:18:49,000
you can insert this and you can
end up with duplicate values

368
00:18:49,000 --> 00:18:50,140
in the unique index.

369
00:18:52,420 --> 00:18:56,760
Nikolay: Yeah, what I don't know
is if, so here we use deterministic

370
00:18:57,440 --> 00:18:58,740
false, right?

371
00:18:58,740 --> 00:18:59,780
Michael: Yes, yes.

372
00:19:00,040 --> 00:19:02,920
Nikolay: And it means that at byte
level, the characters are

373
00:19:02,920 --> 00:19:07,620
different, but we logically, we
think about them as equal.

374
00:19:08,240 --> 00:19:12,360
But if you think about as equal
in the order by, they should

375
00:19:12,360 --> 00:19:13,400
go together, right?

376
00:19:13,400 --> 00:19:20,440
But do we have deterministic ordering
of them always?

377
00:19:20,460 --> 00:19:22,700
This is, I don't know, like should
be, right?

378
00:19:23,420 --> 00:19:25,420
Or there's some other mystic thing.

379
00:19:25,440 --> 00:19:28,340
If so, that might cause this corruption.

380
00:19:29,540 --> 00:19:33,060
Or yeah, If it's not unique index,
if it's unique index, it's

381
00:19:33,060 --> 00:19:33,940
just 1 value.

382
00:19:33,940 --> 00:19:36,900
I don't see corruption risks.

383
00:19:37,820 --> 00:19:44,200
But if it's non-unique index and
like these values will come

384
00:19:44,200 --> 00:19:48,440
in groups, is there like deterministic
structure?

385
00:19:48,480 --> 00:19:51,000
I mean, not a rule, how they are
ordered.

386
00:19:51,100 --> 00:19:56,100
So we avoid the switch of order
when glibc version changes.

387
00:19:56,180 --> 00:19:59,920
But again, here we talk about ICU,
so glibc is not involved,

388
00:19:59,920 --> 00:20:00,140
right?

389
00:20:00,140 --> 00:20:02,480
So everything should be fine.

390
00:20:02,980 --> 00:20:06,900
The only thing, just not to forget,
ICU should be enabled at

391
00:20:06,900 --> 00:20:07,920
compile time.

392
00:20:08,560 --> 00:20:12,100
I just, yeah, which is default,
I think right now.

393
00:20:12,100 --> 00:20:12,840
Is it default?

394
00:20:14,540 --> 00:20:18,840
I just remember Cursor and I were
fighting a few weeks ago because

395
00:20:18,960 --> 00:20:23,600
my laptop didn't have ICU installed,
so I always reminded that

396
00:20:23,680 --> 00:20:27,680
I created even Cursor a rule without
ICU.

397
00:20:27,840 --> 00:20:33,480
If you run configure without this,
Then I surrendered and installed

398
00:20:33,480 --> 00:20:37,640
ICUs just because somehow Cursor
kept losing this rule somehow.

399
00:20:37,640 --> 00:20:38,540
I don't know why.

400
00:20:38,900 --> 00:20:41,200
So yeah, so it should it should
be present.

401
00:20:41,200 --> 00:20:44,160
It might be not present and not
be available on the system.

402
00:20:44,760 --> 00:20:47,880
Michael: I think lots of people
use managed services these days,

403
00:20:47,880 --> 00:20:49,860
I think it tends to be on those.

404
00:20:49,920 --> 00:20:51,380
But yeah, really good point.

405
00:20:51,380 --> 00:20:54,720
But yeah, this seems to me like
the way to do it now.

406
00:20:54,820 --> 00:21:00,280
And just to complete that, I'm
going to read from Adam's blog

407
00:21:00,280 --> 00:21:00,780
post.

408
00:21:01,160 --> 00:21:03,960
The syntax is create collation,
and then you can call it whatever

409
00:21:03,960 --> 00:21:04,340
you want.

410
00:21:04,340 --> 00:21:05,560
For example, case-insensitive.

411
00:21:06,040 --> 00:21:10,420
And then you set provider and he
goes provider equals ICU.

412
00:21:10,560 --> 00:21:14,480
Then locale and then that's the
string that you can set all the

413
00:21:14,480 --> 00:21:16,100
various settings we mentioned earlier.

414
00:21:16,160 --> 00:21:19,080
And then you choose whether it's,
you have to choose in this

415
00:21:19,080 --> 00:21:20,820
case, deterministic equals false.

416
00:21:20,860 --> 00:21:23,560
So that's the byte-wise thing that
you're talking about.

417
00:21:23,560 --> 00:21:24,060
Nikolay: Right.

418
00:21:25,160 --> 00:21:25,660
Right.

419
00:21:25,960 --> 00:21:28,980
Michael: So yeah, in fact, just,
is it worth going through that

420
00:21:28,980 --> 00:21:31,600
gobbledygook in the locale quickly?

421
00:21:31,720 --> 00:21:36,960
So yeah, he's gone with und, so
UND, I thought it was German

422
00:21:36,960 --> 00:21:39,440
at first, but it actually just
means undetermined language.

423
00:21:39,580 --> 00:21:43,140
So you could set it to like German
or French or something if

424
00:21:43,140 --> 00:21:44,400
you really wanted to.

425
00:21:45,440 --> 00:21:52,040
U specifies Unicode attributes,
and then KS, level 2.

426
00:21:52,060 --> 00:21:57,040
So apparently that's collation
strength and then level 2 doesn't

427
00:21:57,040 --> 00:22:00,040
include case in comparisons only
letters and accents.

428
00:22:00,660 --> 00:22:04,060
So letters and accents can be considered
different, but case

429
00:22:04,060 --> 00:22:04,560
can't.

430
00:22:04,660 --> 00:22:09,080
So yeah, so that's what you'd want
in email address use case,

431
00:22:09,080 --> 00:22:09,740
I think.

432
00:22:10,440 --> 00:22:12,760
Nikolay: Yeah, so yeah, that's
interesting.

433
00:22:13,220 --> 00:22:13,980
Anything else?

434
00:22:14,200 --> 00:22:18,040
We have chances to have this episode,
as I promised, very short.

435
00:22:18,540 --> 00:22:21,300
Or there are some additional aspects
here.

436
00:22:21,820 --> 00:22:23,740
Michael: So no, I think that's
it.

437
00:22:23,740 --> 00:22:26,660
Nikolay: So it looks like both
of us, if we were building a new

438
00:22:26,660 --> 00:22:30,460
system, we probably would try to
use this approach with collations.

439
00:22:31,160 --> 00:22:31,560
Sounds

440
00:22:31,560 --> 00:22:32,220
Michael: like 100%

441
00:22:32,960 --> 00:22:35,720
Nikolay: most flexible, most powerful
these days, right?

442
00:22:36,200 --> 00:22:36,700
Michael: Yeah.

443
00:22:36,820 --> 00:22:37,320
Nikolay: Yeah.

444
00:22:37,660 --> 00:22:41,280
You know what I'm
checking right now and I see

445
00:22:41,280 --> 00:22:45,460
cases about, collation
corruption.

446
00:22:45,900 --> 00:22:46,020
Yeah.

447
00:22:46,020 --> 00:22:46,520
Yeah.

448
00:22:47,000 --> 00:22:51,980
I see, Daniel Vérité or how...

449
00:22:51,980 --> 00:22:53,160
Michael: Oh yeah.

450
00:22:53,560 --> 00:22:54,440
Is he psycopg?

451
00:22:55,440 --> 00:22:56,260
Nikolay: Not sure.

452
00:22:56,580 --> 00:22:57,880
I remember this name.

453
00:22:57,880 --> 00:23:00,720
I read some blog posts in the past, but definitely collation

454
00:23:00,720 --> 00:23:02,940
versioning problem with ICU 73.

455
00:23:04,400 --> 00:23:10,640
So it's interesting to think about possible corruption when something

456
00:23:10,640 --> 00:23:12,420
changes under the hood.

457
00:23:13,320 --> 00:23:17,180
Yeah, and if we go this route, what to expect?

458
00:23:17,860 --> 00:23:23,500
I lack personal experience with it, but I would definitely study

459
00:23:23,500 --> 00:23:28,840
this topic closer because I thought if we use ICU collations,

460
00:23:30,180 --> 00:23:33,640
well, glibc change is not a problem anymore, So we can upgrade

461
00:23:33,640 --> 00:23:34,540
OS, right?

462
00:23:34,640 --> 00:23:40,360
But it looks like an ICU library upgrade can be a problem.

463
00:23:40,640 --> 00:23:44,080
So we need to check this, right?

464
00:23:44,680 --> 00:23:49,200
We should be very careful considering what happened in the past

465
00:23:49,200 --> 00:23:50,780
with glibc changes.

466
00:23:51,760 --> 00:23:56,200
I remember it started from not knowing about it at all, then

467
00:23:56,200 --> 00:24:01,300
thinking only one change was a problem, and then adjusting vision

468
00:24:01,300 --> 00:24:05,540
like, you know what, every glibc version upgrade should be very

469
00:24:05,540 --> 00:24:10,760
carefully tested because things change quite often actually there.

470
00:24:11,520 --> 00:24:16,020
So I would bring this experience to here and then double check

471
00:24:16,020 --> 00:24:16,820
what's happening.

472
00:24:17,380 --> 00:24:20,500
Michael: It's probably a case for re-indexing at those times,

473
00:24:20,740 --> 00:24:27,240
like having maybe a prioritized list of indexes that you want

474
00:24:27,240 --> 00:24:30,540
to make sure, like maybe unique indexes at least.

475
00:24:32,780 --> 00:24:34,480
Nikolay: It's only about unique indexes.

476
00:24:34,620 --> 00:24:40,120
If we talk about corruption, we, again, order what matters.

477
00:24:40,160 --> 00:24:44,340
And we don't want some values to be returned in different order

478
00:24:44,340 --> 00:24:47,220
or search is not working as expected and so on.

479
00:24:47,220 --> 00:24:49,120
Michael: I was thinking about the limitation of only being able

480
00:24:49,120 --> 00:24:52,440
to concurrently create or like re-index one at a time.

481
00:24:52,500 --> 00:24:54,920
You probably want to prioritize lists, so which ones would you

482
00:24:54,920 --> 00:24:55,620
do first?

483
00:24:56,120 --> 00:24:58,480
Yeah, maybe your unique ones first.

484
00:24:58,520 --> 00:25:01,720
Nikolay: Yeah, during OS upgrades we usually look at them and

485
00:25:01,720 --> 00:25:07,120
using amcheck to identify indexes which are going to be corrupted

486
00:25:07,120 --> 00:25:13,740
and then we plan to just rebuild them or we rebuild them on if

487
00:25:13,740 --> 00:25:17,340
it's this upgrade involves logical replication.

488
00:25:17,440 --> 00:25:19,580
We rebuild them before we switch.

489
00:25:21,100 --> 00:25:23,860
This logical replication gives us flexibility here.

490
00:25:26,320 --> 00:25:31,300
But amcheck, back in the days, could only check B-tree indexes.

491
00:25:32,580 --> 00:25:36,880
There was a multi-year work to bring it to GiST and GIN.

492
00:25:37,340 --> 00:25:37,600
Yeah.

493
00:25:37,600 --> 00:25:38,540
I don't remember.

494
00:25:38,560 --> 00:25:40,300
There was some progress there recently.

495
00:25:40,380 --> 00:25:43,280
So Postgres 18 probably will bring something, but I don't remember

496
00:25:43,280 --> 00:25:44,540
from top of my head.

497
00:25:45,040 --> 00:25:48,540
So that's why you probably want to rebuild all GIN and GiST indexes

498
00:25:49,040 --> 00:25:52,780
just for safety, because corruption might happen there as well.

499
00:25:54,680 --> 00:25:55,620
Michael: So nice.

500
00:25:56,000 --> 00:25:56,500
Nikolay: Yeah.

501
00:25:56,920 --> 00:25:57,420
Okay.

502
00:25:58,100 --> 00:25:58,440
Yeah.

503
00:25:58,440 --> 00:25:59,720
I hope this was helpful.

504
00:25:59,960 --> 00:26:01,360
I learned something from you.

505
00:26:01,360 --> 00:26:02,040
Thank you so much.

506
00:26:02,040 --> 00:26:05,080
And next, next time I need this,
I probably will look into collations

507
00:26:05,080 --> 00:26:08,340
based on ICU and deterministic
equals false.

508
00:26:09,060 --> 00:26:09,840
Michael: Yeah, nice.

509
00:26:10,080 --> 00:26:10,520
Nikolay: That's a good idea, yeah.

510
00:26:10,520 --> 00:26:14,060
And probably we'll think about
accents next time.

511
00:26:14,060 --> 00:26:16,420
Yeah, it's a good thing to keep
in mind.

512
00:26:16,960 --> 00:26:17,660
Michael: For sure.

513
00:26:17,680 --> 00:26:18,120
All right.

514
00:26:18,120 --> 00:26:19,470
Nice one, Nikolay.

515
00:26:19,540 --> 00:26:19,980
Take care.

516
00:26:19,980 --> 00:26:20,320
Nikolay: Thank you.

517
00:26:20,320 --> 00:26:21,360
See you next time.