1
00:00:00,060 --> 00:00:02,360
Michael: Hello and welcome to Postgres.FM,
a weekly show about

2
00:00:02,360 --> 00:00:03,220
all things PostgreSQL.

3
00:00:03,480 --> 00:00:06,420
I am Michael, founder of pgMustard,
and this is my co-host Nikolay,

4
00:00:06,420 --> 00:00:07,640
founder of Postgres.AI.

5
00:00:07,640 --> 00:00:09,860
Hey Nikolay, what are we going
to be talking about today?

6
00:00:10,080 --> 00:00:13,860
Nikolay: Hi Michael, let's talk
about planning time and what

7
00:00:13,860 --> 00:00:16,460
can increase it, potentially.

8
00:00:17,520 --> 00:00:19,060
Michael: Nice, And this was your
suggestion.

9
00:00:19,080 --> 00:00:21,300
Why is it something you're thinking
about at the

10
00:00:21,900 --> 00:00:22,100
moment?

11
00:00:22,100 --> 00:00:27,580
Nikolay: Well, it's we are making circles
around this topic all the time

12
00:00:27,580 --> 00:00:33,660
because basically almost all of
our customers, which are mostly

13
00:00:34,640 --> 00:00:38,860
fast-growing startups, they reach
the point where partitioning

14
00:00:39,060 --> 00:00:42,840
is needed, they reach the point
where LockManager contention

15
00:00:43,080 --> 00:00:49,140
happens, number of indexes becomes
concerned, and also many of

16
00:00:49,140 --> 00:00:54,560
them still don't use prepared statements
mostly because PgBouncer

17
00:00:54,560 --> 00:00:58,680
didn't support it until
very recently, right?

18
00:00:59,080 --> 00:01:03,600
This means that planning time being
basically unnoticed for quite

19
00:01:03,600 --> 00:01:06,180
long suddenly becomes an issue.

20
00:01:06,680 --> 00:01:11,480
And we often show them, oh, it
looks like planning time is an

21
00:01:11,480 --> 00:01:12,740
issue in these cases.

22
00:01:14,280 --> 00:01:18,480
And actually, especially partitioning
thing is like, there's

23
00:01:18,520 --> 00:01:21,420
always a question, especially for
time series.

24
00:01:21,900 --> 00:01:27,540
It's always a question how big
partitions should be, which can

25
00:01:27,540 --> 00:01:31,240
be translated to how many partitions
you will have.

26
00:01:32,580 --> 00:01:38,600
And if you partition not well,
you may end up having many thousands

27
00:01:38,620 --> 00:01:39,340
of partitions.

28
00:01:40,480 --> 00:01:41,820
They will be very small.

29
00:01:42,660 --> 00:01:47,220
But if you don't have prepared
statements, you can end up having

30
00:01:47,220 --> 00:01:49,380
a huge problem with planning time.

31
00:01:50,320 --> 00:01:53,860
And planning time can exceed execution
time for simple primary

32
00:01:53,860 --> 00:01:55,120
key lookups, right?

33
00:01:55,440 --> 00:01:56,460
This is not fun.

34
00:01:56,820 --> 00:01:59,060
Michael: Yeah, it's not the only
case of that, right?

35
00:01:59,060 --> 00:02:01,800
But it definitely makes sense that
it's a common one.

36
00:02:01,880 --> 00:02:05,360
Nikolay: Yeah, Basically, last
week we discussed with a couple

37
00:02:05,360 --> 00:02:08,360
of companies how to properly implement
partitioning.

38
00:02:09,520 --> 00:02:13,020
And this question was in the center,
what size of partitions

39
00:02:13,020 --> 00:02:13,800
to choose?

40
00:02:14,440 --> 00:02:21,200
And the rule of thumb is like,
don't let them be larger than

41
00:02:21,200 --> 00:02:22,860
100 gigabytes, right?

42
00:02:23,640 --> 00:02:26,340
Because this is the golden rule
of partitioning.

43
00:02:26,680 --> 00:02:30,700
It's time to apply partitioning
when you have risks already table

44
00:02:30,700 --> 00:02:32,300
exceeded 100 gigabytes.

45
00:02:32,540 --> 00:02:35,340
But it also can be applied to partitions.

46
00:02:35,860 --> 00:02:38,740
Also, they should be balanced in
terms of size.

47
00:02:39,180 --> 00:02:41,300
This is usually a good idea to
do.

48
00:02:41,760 --> 00:02:46,220
But on the other hand, if you make
them tiny, you end up having

49
00:02:46,220 --> 00:02:47,940
a lot of them and it's not good.

50
00:02:47,980 --> 00:02:49,020
Why is it not good?

51
00:02:49,020 --> 00:02:50,780
Because of planning time, mostly.

52
00:02:51,900 --> 00:02:55,640
This is not the only reason why
you don't want a lot of partitions,

53
00:02:55,680 --> 00:02:57,540
but this is 1 of the biggest reasons.

54
00:02:58,180 --> 00:03:01,440
If you have, for example, 10,000
partitions, you might have

55
00:03:02,780 --> 00:03:06,560
dozens of milliseconds or like
maybe a second of planning time

56
00:03:06,560 --> 00:03:07,580
for each query.

57
00:03:08,240 --> 00:03:12,440
If prepared statements are not used,
every time a SQL query is

58
00:03:12,440 --> 00:03:16,880
executed planning is happening
in Postgres, right?

59
00:03:17,420 --> 00:03:19,540
Michael: Yeah, and you're talking
about the best case.

60
00:03:19,640 --> 00:03:23,500
Planning time can expand depending
on the complexity of the query.

61
00:03:23,500 --> 00:03:26,860
You're talking about for the simplest
possible query.

62
00:03:26,960 --> 00:03:27,680
Nikolay: Very simple.

63
00:03:28,100 --> 00:03:30,040
Bit real index scan.

64
00:03:31,220 --> 00:03:31,720
Michael: Yeah.

65
00:03:32,660 --> 00:03:35,600
Is it worth us explaining what
are we talking about?

66
00:03:35,600 --> 00:03:36,800
What is planning time?

67
00:03:36,860 --> 00:03:38,360
Nikolay: Yeah, let's do it.

68
00:03:38,420 --> 00:03:41,040
So when query is executed, there
are stages, right?

69
00:03:41,040 --> 00:03:42,320
Do you remember stages?

70
00:03:43,260 --> 00:03:47,720
Planning, it means, well, The planner
is based on statistics

71
00:03:48,040 --> 00:03:48,840
and considers...

72
00:03:49,900 --> 00:03:54,400
The goal of the planning is to
choose the best plan as fast as

73
00:03:54,400 --> 00:03:54,900
possible.

74
00:03:55,400 --> 00:03:56,540
It might be wrong.

75
00:03:56,640 --> 00:03:58,220
Very often it's wrong, actually.

76
00:03:58,580 --> 00:03:59,280
Not very.

77
00:03:59,280 --> 00:04:00,080
It happens.

78
00:04:02,020 --> 00:04:07,720
For me, it's every day I deal with
such cases because I see usually

79
00:04:07,720 --> 00:04:10,320
problems, not like ideal situations.

80
00:04:12,800 --> 00:04:18,000
This is when my team and I are
involved when something is wrong.

81
00:04:18,580 --> 00:04:24,220
So it needs to choose as fast as
possible the kind of optimal

82
00:04:24,240 --> 00:04:24,740
plan.

83
00:04:24,840 --> 00:04:28,380
And then when a plan is chosen,
execution happens.

84
00:04:29,020 --> 00:04:32,120
Executor executes according to
the plan.

85
00:04:33,080 --> 00:04:36,980
And there are like many, many,
many possible plans.

86
00:04:37,820 --> 00:04:38,860
Many possible plans.

87
00:04:38,860 --> 00:04:42,040
Sometimes I think infinite number
of possible plans.

88
00:04:42,560 --> 00:04:43,040
Right?

89
00:04:43,040 --> 00:04:44,380
Michael: Well, it depends, right?

90
00:04:44,380 --> 00:04:49,260
Like If you're doing a single table
select and there's no indexes,

91
00:04:49,540 --> 00:04:51,680
there maybe is only 1 possible
plan.

92
00:04:53,480 --> 00:04:54,640
But once you include...

93
00:04:55,080 --> 00:04:56,080
Nikolay: What about prioritization?

94
00:04:56,580 --> 00:04:58,940
There's a couple of possible plans,
right?

95
00:04:59,020 --> 00:05:01,040
Michael: Yeah, but all I mean is
as the...

96
00:05:01,560 --> 00:05:05,760
It gets important, like, for example,
when you start joining

97
00:05:05,760 --> 00:05:09,280
multiple tables and you start to
realize that you have multiple

98
00:05:09,280 --> 00:05:14,600
scan choices, multiple join orders,
multiple join types that

99
00:05:14,600 --> 00:05:15,260
you could...

100
00:05:15,380 --> 00:05:17,540
Nikolay: Multiple indexes are present.

101
00:05:18,180 --> 00:05:20,100
Planner needs to choose them properly.

102
00:05:20,380 --> 00:05:20,940
So, yeah.

103
00:05:20,940 --> 00:05:22,480
Michael: And these things are combinatorial.

104
00:05:22,800 --> 00:05:26,700
These things multiply to create
the number of total possible

105
00:05:26,820 --> 00:05:27,320
plans.

106
00:05:27,840 --> 00:05:31,940
And it quickly, like as the number
of joins, for example, increases,

107
00:05:32,600 --> 00:05:34,220
that can explode exponentially.

108
00:05:35,800 --> 00:05:38,260
So there are some things in Postgres
too.

109
00:05:38,260 --> 00:05:40,780
Nikolay: There's also a genetic
optimizer, right?

110
00:05:41,520 --> 00:05:42,020
Michael: Yes.

111
00:05:42,040 --> 00:05:43,480
So because it can...

112
00:05:43,940 --> 00:05:47,500
Well, I guess the simple version
is that the planner isn't always

113
00:05:47,500 --> 00:05:50,460
looking for the best plan in those
cases.

114
00:05:50,460 --> 00:05:53,360
It's sometimes looking for a good
enough plan in a reasonable

115
00:05:53,560 --> 00:05:54,360
time frame.

116
00:05:54,520 --> 00:05:58,180
Because if we're taking way longer
to plan the query, if we take

117
00:05:58,180 --> 00:06:01,160
multiple seconds to plan a query
that's probably going to execute

118
00:06:01,160 --> 00:06:04,840
in a few milliseconds, it's probably
a complete waste of time.

119
00:06:04,840 --> 00:06:07,780
So there's a balancing act here
for complex queries.

120
00:06:07,800 --> 00:06:12,340
But for simple queries, it doesn't
make sense to be paying dozens

121
00:06:12,340 --> 00:06:16,960
of milliseconds for a 0.2 millisecond
primary key lookup.

122
00:06:17,640 --> 00:06:22,020
Nikolay: Right, So the process
of finding ideal plan is called

123
00:06:22,020 --> 00:06:24,300
optimization, plan optimization.

124
00:06:24,840 --> 00:06:29,520
And the planner deals with basically
several things.

125
00:06:30,200 --> 00:06:34,500
First of all, decisions are of
course heavily depend on the version

126
00:06:34,500 --> 00:06:37,900
of Postgres you use because all
the time improvements are being

127
00:06:37,900 --> 00:06:38,400
made.

128
00:06:38,960 --> 00:06:44,360
Second of all, it uses statistics
currently present in Postgres.

129
00:06:44,440 --> 00:06:49,020
That's why it's important to keep
autovacuum tuned so it maintains

130
00:06:49,120 --> 00:06:49,780
the statistics.

131
00:06:50,080 --> 00:06:53,820
And this is not only pg_statistic,
it's also estimated number

132
00:06:54,320 --> 00:06:59,240
of tuples in Postgres rel pages,
so it's number of pages the

133
00:06:59,240 --> 00:07:00,860
table or index have.

134
00:07:01,120 --> 00:07:02,220
This is super important.

135
00:07:02,400 --> 00:07:05,820
And finally, it also takes into
account the settings Postgres

136
00:07:05,820 --> 00:07:06,760
currently has.

137
00:07:06,780 --> 00:07:10,940
An interesting point, as I also
mentioned many, many times, those

138
00:07:10,940 --> 00:07:15,240
who listen to us not recently,
they know it, right?

139
00:07:15,240 --> 00:07:19,240
It's Interesting that the planner
has no idea about the real

140
00:07:19,240 --> 00:07:21,440
physical capabilities of your machine.

141
00:07:21,780 --> 00:07:26,900
It doesn't know what type of processor
you have, how many cores,

142
00:07:28,360 --> 00:07:33,960
how fast disks are, or how many
gigabytes of RAM you provided.

143
00:07:34,460 --> 00:07:35,500
It has no idea.

144
00:07:36,180 --> 00:07:37,580
It relies on the settings.

145
00:07:38,480 --> 00:07:40,220
First of all, costs.

146
00:07:41,040 --> 00:07:45,140
So, random page cost, seq page
cost, and so on, like CPU, tuple

147
00:07:45,340 --> 00:07:46,520
cost, and so on.

148
00:07:46,740 --> 00:07:51,560
And also, settings like work mem
and some estimate of available

149
00:07:51,560 --> 00:07:53,800
memory called effective cache size.

150
00:07:53,940 --> 00:07:58,140
And you can put any number there
even if you don't have those

151
00:07:58,140 --> 00:07:59,440
gigabytes of RAM.

152
00:07:59,500 --> 00:08:05,080
This is bad because if you forgot
to tune your server, Postgres

153
00:08:05,080 --> 00:08:08,360
might not know that you have a
lot of RAM, like a terabyte of

154
00:08:08,360 --> 00:08:10,900
RAM, but Postgres still thinks
you have only 1 gigabyte, for

155
00:08:10,900 --> 00:08:11,280
example.

156
00:08:11,280 --> 00:08:15,260
It's easy to be in this situation
and it's bad.

157
00:08:15,360 --> 00:08:19,360
But at the same time, it's good
for us who deal with a lot of

158
00:08:19,360 --> 00:08:22,740
experiments, because it means that
we can fool the Postgres and

159
00:08:22,740 --> 00:08:26,280
say, oh, you know, we have the
same number of gigabytes as production

160
00:08:26,280 --> 00:08:26,780
has.

161
00:08:27,360 --> 00:08:31,940
And this helps us to reproduce
the problem in lower environments

162
00:08:32,220 --> 00:08:35,140
when we have a much weaker machine.

163
00:08:35,500 --> 00:08:36,860
This is great for experimentation.

164
00:08:37,900 --> 00:08:39,120
So good and bad.

165
00:08:39,160 --> 00:08:40,140
So that's it.

166
00:08:40,280 --> 00:08:46,180
This is everything that matters
when optimization happens, when

167
00:08:46,220 --> 00:08:51,380
the planner chooses the plan the
executor will be using.

168
00:08:53,100 --> 00:08:58,260
Michael: Yeah, anybody that's familiar
with Explain, I've had

169
00:08:58,260 --> 00:09:02,560
success describing planning time
as the time it takes Postgres

170
00:09:02,560 --> 00:09:04,540
to come up with that explain plan.

171
00:09:04,900 --> 00:09:06,180
That's what it's doing.

172
00:09:06,180 --> 00:09:07,900
And you can even see this.

173
00:09:08,720 --> 00:09:12,240
By default, explain on its own
doesn't show planning time.

174
00:09:12,380 --> 00:09:16,580
But if you do explain summary,
So explain with the parameter

175
00:09:16,640 --> 00:09:19,540
summary, it will show you planning
time as part of that.

176
00:09:19,740 --> 00:09:22,100
So it's quite a cool way of thinking.

177
00:09:22,900 --> 00:09:23,800
Yes, exactly.

178
00:09:23,800 --> 00:09:24,520
At the bottom.

179
00:09:25,020 --> 00:09:29,340
Nikolay: Or if you ask executor
to be involved and you run explain

180
00:09:29,340 --> 00:09:33,940
analyze, then summary is being
turned on automatically, right?

181
00:09:33,940 --> 00:09:34,440
Michael: Yeah.

182
00:09:34,540 --> 00:09:37,700
And you'll see the planning time

183
00:09:37,700 --> 00:09:39,360
and the execution time then.

184
00:09:39,480 --> 00:09:42,500
And the planning time is roughly
the time it took to run Explain

185
00:09:42,500 --> 00:09:46,120
on its own, and the execution time,
in addition to the planning

186
00:09:46,120 --> 00:09:49,080
time, is the time taken roughly
to run explain analyze.

187
00:09:49,280 --> 00:09:51,800
So that's quite a cool way of thinking
about the difference.

188
00:09:52,280 --> 00:09:52,740
Nikolay: Yeah.

189
00:09:52,740 --> 00:09:57,320
And the worth mentioning, again,
those who listen to us for long

190
00:09:57,380 --> 00:09:59,660
know it very well, the word buffers.

191
00:10:00,580 --> 00:10:03,780
Since some version, like several
years ago, it was implemented,

192
00:10:03,900 --> 00:10:05,920
I think Postgres 13 maybe.

193
00:10:07,120 --> 00:10:11,280
You can use buffers, expand buffers
even without analyze, only

194
00:10:11,280 --> 00:10:15,640
planning phase, and you can see
how many buffers were involved

195
00:10:15,660 --> 00:10:16,260
into planning.

196
00:10:16,260 --> 00:10:21,560
And this is super cool because
usually bad timing is because

197
00:10:21,560 --> 00:10:24,860
of a lot of IO, which is buffers,
right?

198
00:10:24,860 --> 00:10:29,200
IO is number 1 reason why Postgres
can be slow in anything.

199
00:10:30,340 --> 00:10:30,840
Michael: Yeah.

200
00:10:31,380 --> 00:10:35,440
And you might think like, what's
Postgres reading during planning

201
00:10:35,440 --> 00:10:35,940
time?

202
00:10:36,620 --> 00:10:37,120
Nikolay: Yeah.

203
00:10:37,440 --> 00:10:37,940
Yeah.

204
00:10:38,260 --> 00:10:42,180
Actually, you know what, I explained
some picture, which is like

205
00:10:42,260 --> 00:10:48,080
maybe true in 99%, but I now remember
a case where something

206
00:10:48,080 --> 00:10:49,100
else was happening.

207
00:10:49,860 --> 00:10:52,580
It was some interesting query.

208
00:10:53,140 --> 00:10:56,420
I don't remember how many tables,
but definitely it was a join,

209
00:10:56,580 --> 00:10:57,880
maybe a couple of joins.

210
00:10:59,060 --> 00:11:05,460
And I remember planning time was
completely out of normal.

211
00:11:05,460 --> 00:11:09,580
It was like half a minute or so.

212
00:11:10,240 --> 00:11:11,820
Michael: I think I remember this
case.

213
00:11:11,820 --> 00:11:12,840
Was it the merge-join?

214
00:11:13,320 --> 00:11:13,820
Nikolay: Exactly.

215
00:11:13,940 --> 00:11:15,400
Yeah.
It was not merge-join.

216
00:11:15,600 --> 00:11:20,320
Merge-join was not considered the
best plan, it was not chosen,

217
00:11:21,180 --> 00:11:23,940
but it was the reason why it was
so slow.

218
00:11:24,860 --> 00:11:30,800
Considering merge-join path, the
planner used not only what I

219
00:11:30,800 --> 00:11:32,700
just described, but additional
information.

220
00:11:33,240 --> 00:11:39,780
It tried to find the minimum or
maximum values for some subsets

221
00:11:39,840 --> 00:11:44,080
of data, edges, real values from
table.

222
00:11:44,280 --> 00:11:49,220
So it needed to, during planning
time, it needed real values.

223
00:11:49,760 --> 00:11:52,840
And this, due to some reason, was
super slow.

224
00:11:53,300 --> 00:11:56,660
So planning became many, many,
many seconds exceeding statement

225
00:11:56,660 --> 00:11:59,020
timeout and failing.

226
00:11:59,760 --> 00:12:02,220
It was extremely hard to diagnose,
actually.

227
00:12:02,800 --> 00:12:03,600
And I actually...

228
00:12:03,600 --> 00:12:04,700
Michael: Do you remember why?

229
00:12:04,840 --> 00:12:05,340
Nikolay: Yeah.

230
00:12:06,060 --> 00:12:06,800
Why what?

231
00:12:07,420 --> 00:12:08,820
Michael: Why it was hard to diagnose?

232
00:12:09,100 --> 00:12:12,180
Nikolay: Because the explain doesn't
show other options.

233
00:12:12,300 --> 00:12:13,820
It shows only the winner.

234
00:12:14,760 --> 00:12:18,200
Michael: Well, and also extra reasons
it was difficult.

235
00:12:19,460 --> 00:12:22,740
I remember talking to you at the
time, auto_explain doesn't show

236
00:12:22,740 --> 00:12:25,760
planning time, and therefore, when
you're trying to diagnose...

237
00:12:25,760 --> 00:12:26,180
It's not

238
00:12:26,180 --> 00:12:29,140
Nikolay: a problem for us because,
well, we didn't use it at

239
00:12:29,140 --> 00:12:30,240
that time for this project.

240
00:12:30,240 --> 00:12:35,400
We didn't use auto_explain at all,
but we had DBLab Engine installed

241
00:12:35,740 --> 00:12:41,020
and I knew when these problems
started happening, so I quickly

242
00:12:41,520 --> 00:12:44,520
rewinded, created a snapshot and
a clone.

243
00:12:45,060 --> 00:12:48,400
So yeah, I worked on the clone
and reproduced the problem easily.

244
00:12:48,460 --> 00:12:51,860
Interesting that like 1 hour later
it was different.

245
00:12:52,540 --> 00:12:56,760
And statistics changed somehow
and planner stopped considering

246
00:12:56,980 --> 00:12:59,640
merge path, merge join path.

247
00:13:00,040 --> 00:13:03,520
Yeah, so it auto healed, but it
was super scary what happened

248
00:13:03,520 --> 00:13:04,700
1 hour ago, right?

249
00:13:04,780 --> 00:13:08,540
But thanks to DBLab Engine, we
easily reproduced it.

250
00:13:08,720 --> 00:13:11,960
So I agree with you, auto_explain,
not showing planning time

251
00:13:11,960 --> 00:13:12,760
at that time.

252
00:13:12,800 --> 00:13:15,420
It was an issue, but not in this
project particularly.

253
00:13:16,280 --> 00:13:20,940
What was an issue is you reproduced
it, you see super slow planning

254
00:13:20,940 --> 00:13:23,300
time, but you have no idea what's
happening.

255
00:13:23,700 --> 00:13:30,260
And you try to understand if we
had a way to see second best,

256
00:13:30,720 --> 00:13:32,060
third best plans.

257
00:13:32,500 --> 00:13:36,420
In this case, we probably, yeah,
second best probably would be

258
00:13:36,420 --> 00:13:40,440
merge join, it lost eventually,
but during consideration of it,

259
00:13:40,440 --> 00:13:45,040
it was like Postgres planners spent
a lot of time considering it

260
00:13:45,040 --> 00:13:46,620
because it needed real values.

261
00:13:46,840 --> 00:13:52,360
I think what helped in that case,
actually, is a friend's advice,

262
00:13:52,480 --> 00:13:55,880
because Kukushkin first said, have
you tried setEnableMergeJoin

263
00:13:57,340 --> 00:13:58,040
to off?

264
00:14:00,720 --> 00:14:01,760
This is weird.

265
00:14:01,920 --> 00:14:05,580
But once I did it, planning time
went to almost 0.

266
00:14:05,580 --> 00:14:06,360
Like, boom.

267
00:14:06,660 --> 00:14:08,500
It was interesting case.

268
00:14:08,640 --> 00:14:13,600
And of course, when you try to
dig deeper, you can use like Perf

269
00:14:13,600 --> 00:14:18,120
or BPF and understand what's happening
under the hood and in

270
00:14:18,120 --> 00:14:20,240
which functions Postgres spends
time.

271
00:14:21,160 --> 00:14:23,860
And you quickly find its merge-join
path.

272
00:14:24,760 --> 00:14:29,080
And then read the code, because
it's open source, this is cool.

273
00:14:29,340 --> 00:14:33,240
Read the code and understand that
Postgres, analyzing merge-join

274
00:14:33,740 --> 00:14:37,160
sometimes needs real values from
table.

275
00:14:37,680 --> 00:14:38,180
Wow.

276
00:14:39,660 --> 00:14:41,780
It was exotic, honestly.

277
00:14:42,480 --> 00:14:45,920
Michael: Yeah, that is exotic,
but it's also very helpful in

278
00:14:45,920 --> 00:14:49,140
understanding how the optimizer
is working.

279
00:14:50,280 --> 00:14:54,520
Because it isn't exhaustively calculating
every plan that is

280
00:14:54,520 --> 00:14:55,020
possible.

281
00:14:55,080 --> 00:14:59,560
It can give up on plans once it
gets to a sufficiently high cost.

282
00:14:59,680 --> 00:15:06,100
So Because the way enable_mergejoin
off is working, it's penalizing

283
00:15:06,340 --> 00:15:06,600
merge join.

284
00:15:06,600 --> 00:15:09,720
It's not saying you can't ever
have a merge join.

285
00:15:09,720 --> 00:15:13,300
It penalizes it cost-wise, at which
point the planner gives up

286
00:15:13,300 --> 00:15:13,940
on that.

287
00:15:13,940 --> 00:15:16,740
Nikolay: Yeah, it discards very
quickly because cost is already

288
00:15:17,040 --> 00:15:17,940
super high.

289
00:15:18,400 --> 00:15:22,480
Michael: But I don't think that's
how everybody thinks of how

290
00:15:22,480 --> 00:15:23,240
the planner works.

291
00:15:23,240 --> 00:15:26,860
They think it's going all the way
to the end of each option.

292
00:15:26,940 --> 00:15:28,640
But it can abort early.

293
00:15:28,700 --> 00:15:31,020
And this is 1 of those trade-offs
it's making.

294
00:15:31,060 --> 00:15:33,520
It will sometimes make a mistake
because it seems like it's gonna

295
00:15:33,520 --> 00:15:35,420
be an expensive plan, but then
it isn't.

296
00:15:35,940 --> 00:15:39,340
So it isn't always looking for,
in my opinion, the fastest plan,

297
00:15:39,340 --> 00:15:41,300
it's looking for a fast enough
plan.

298
00:15:41,400 --> 00:15:44,740
Like that's quite like, well, maybe
I'm wrong in that.

299
00:15:45,020 --> 00:15:47,620
Like it is looking for the fastest,
but it uses these shortcuts

300
00:15:48,120 --> 00:15:51,480
to avoid spending excessive time
to do so.

301
00:15:52,740 --> 00:15:53,700
Nikolay: Yeah, yeah, exactly.

302
00:15:53,800 --> 00:15:59,800
As usual, I like exotic and hard
problems, but let's maybe switch

303
00:15:59,800 --> 00:16:01,240
to simple things.

304
00:16:01,980 --> 00:16:03,160
Michael: Yeah, good idea.

305
00:16:03,620 --> 00:16:08,800
Nikolay: Yeah, you sent me a very
good post by depesz, Hubert

306
00:16:08,860 --> 00:16:10,520
Lubaczewski, depesz, right?

307
00:16:11,040 --> 00:16:12,840
And it was 3 years ago.

308
00:16:13,440 --> 00:16:18,640
He was curious about, is it really
so that Postgres doesn't allow

309
00:16:18,640 --> 00:16:21,960
more than 65,000 partitions?

310
00:16:22,360 --> 00:16:24,780
And it turned out it's not so.

311
00:16:25,080 --> 00:16:27,240
He reached 100,000 partitions.

312
00:16:28,680 --> 00:16:32,540
But at the same time, Hubert checked
the planning time.

313
00:16:32,860 --> 00:16:37,500
How exactly the planning time depends
on the number of partitions.

314
00:16:38,680 --> 00:16:40,200
And this is back to

315
00:16:40,200 --> 00:16:42,660
Michael: the initial question,
the initial reason you were looking

316
00:16:42,660 --> 00:16:43,760
into planning time.

317
00:16:45,700 --> 00:16:45,944
Nikolay: Exactly.
Yeah.

318
00:16:45,944 --> 00:16:48,480
We don't want a lot of partitions,
many thousands of them, because

319
00:16:48,480 --> 00:16:54,060
planning time can be already high
— dozens of milliseconds, maybe

320
00:16:54,060 --> 00:16:56,140
hundreds of milliseconds, and so
on.

321
00:16:57,720 --> 00:16:59,880
Basically, the growth is linear.

322
00:17:00,860 --> 00:17:04,740
If you double the number of partitions,
you double planning time.

323
00:17:05,580 --> 00:17:06,080
Likely.

324
00:17:06,680 --> 00:17:09,820
So of course it depends on the
query you use, I guess.

325
00:17:10,080 --> 00:17:13,100
But the simple rule is just if
you have a lot of partitions,

326
00:17:13,100 --> 00:17:14,560
you have bad planning time.

327
00:17:15,060 --> 00:17:16,580
It means we need to be...

328
00:17:18,080 --> 00:17:22,400
I also think Postgres versions
matter a lot here, many optimizations

329
00:17:22,600 --> 00:17:23,600
were made, right?

330
00:17:24,520 --> 00:17:26,020
Michael: Well, I'm aware of 1.

331
00:17:26,120 --> 00:17:29,560
So, depesz's test, he says in the
comments that he tested on

332
00:17:29,560 --> 00:17:34,200
version 14, which is important
because there's a commit in version

333
00:17:34,200 --> 00:17:39,340
15 a couple of years ago from David
Rowley that says that in

334
00:17:39,340 --> 00:17:42,040
the release notes, it said, improve
planning time for queries

335
00:17:42,040 --> 00:17:43,640
referencing partition tables.

336
00:17:43,820 --> 00:17:47,660
This change helps when only a few
of many partitions are relevant,

337
00:17:47,780 --> 00:17:50,880
which is the case here where we're
doing a primary key lookup.

338
00:17:51,040 --> 00:17:55,360
Like, if we're just searching for
a single row in a single partition,

339
00:17:56,020 --> 00:17:59,860
all of the 100,000 partitions,
only 1 of them is relevant to

340
00:17:59,860 --> 00:18:02,220
this query.
We can prune that at planning time.

341
00:18:02,720 --> 00:18:07,260
In those cases, planning time should
be much reduced as of version

342
00:18:07,260 --> 00:18:08,160
15 onwards.

343
00:18:08,620 --> 00:18:12,940
Nikolay: Well, you know, I used
different table structure and

344
00:18:12,940 --> 00:18:16,100
queries, but it also was primary
key lookup.

345
00:18:16,100 --> 00:18:20,500
But basically, I reproduced the
same experiment using our AI

346
00:18:20,500 --> 00:18:21,000
system.

347
00:18:22,120 --> 00:18:26,300
I'm just checking right now, depesz
had for 10,000 partitions,

348
00:18:27,200 --> 00:18:30,060
he had like 40 milliseconds or
so, right?

349
00:18:30,060 --> 00:18:31,420
30, 40 milliseconds.

350
00:18:31,640 --> 00:18:36,420
It's pretty much the same as what
we observed with our AI system.

351
00:18:37,120 --> 00:18:42,700
But I also used so-called shared
environment experiments, so

352
00:18:42,700 --> 00:18:44,260
I was not alone on the machine.

353
00:18:44,260 --> 00:18:49,620
So That's why I involved buffers
in my analysis.

354
00:18:50,740 --> 00:18:56,260
Because this linear time growth,
according to my benchmarks,

355
00:18:57,100 --> 00:19:02,860
is explained by linear growth of
buffer hits associated with

356
00:19:02,860 --> 00:19:03,360
planning.

357
00:19:04,440 --> 00:19:10,880
So if you have 10,000 partitions,
it might be several dozens

358
00:19:10,960 --> 00:19:11,700
of milliseconds.

359
00:19:13,860 --> 00:19:14,980
So what do you think?

360
00:19:15,520 --> 00:19:16,800
What are you looking for?

361
00:19:17,620 --> 00:19:20,980
Michael: I was looking at depesz's
post just to see if he gave

362
00:19:20,980 --> 00:19:25,520
enough details around how, like,
did he pick the best of, you

363
00:19:25,520 --> 00:19:28,040
know, sometimes when you're doing
this kind of experiment, you

364
00:19:28,040 --> 00:19:32,420
would run the query a few times
and pick the fastest or the average.

365
00:19:32,680 --> 00:19:33,180
He

366
00:19:33,620 --> 00:19:34,300
Nikolay: did.
Several.

367
00:19:34,300 --> 00:19:35,280
I didn't.
Yeah.

368
00:19:35,280 --> 00:19:40,060
He used several, not many, I think
3 or maybe 5 runs only.

369
00:19:40,380 --> 00:19:42,740
I used only 1 run for now.

370
00:19:42,740 --> 00:19:47,360
I think it's a good idea to check,
to convert this experiment

371
00:19:47,360 --> 00:19:51,300
to more like full-fledged experiment
maybe on separate machine

372
00:19:51,680 --> 00:19:56,540
and maybe even with pgbench and
have many many competing like

373
00:19:56,680 --> 00:20:00,340
not competing concurrent SELECTs
the idea is it's again about

374
00:20:00,340 --> 00:20:04,260
buffers many many buffer hits for
planning and this explains

375
00:20:04,740 --> 00:20:06,700
the growth of planning time.

376
00:20:06,700 --> 00:20:10,740
Michael: Yeah, I like how in your
experiment, I like how clearly

377
00:20:11,040 --> 00:20:14,680
linear the chart with buffers is,
whereas the 1 with timing,

378
00:20:15,020 --> 00:20:17,860
it just, It looks like it might
be linear, but there's enough

379
00:20:18,400 --> 00:20:20,980
noise that it's tricky to see.

380
00:20:20,980 --> 00:20:23,920
Although there are a couple of
dips in the buffers chart, so

381
00:20:23,920 --> 00:20:25,660
there's something interesting going
on there.

382
00:20:26,180 --> 00:20:28,300
Nikolay: Yeah, I'm also interested.

383
00:20:28,880 --> 00:20:33,120
We also discussed before this call
that it really matters, is

384
00:20:33,120 --> 00:20:38,420
it first query when you just connected
or even not involving

385
00:20:38,420 --> 00:20:39,460
prepared statements.

386
00:20:39,560 --> 00:20:43,760
If you run the very first query,
you have extra penalty because

387
00:20:44,060 --> 00:20:48,480
in this connection, information
from system catalogs is not cached

388
00:20:48,480 --> 00:20:49,140
at all.

389
00:20:49,340 --> 00:20:53,440
And this adds extra to like extra
buffer hits during planning

390
00:20:53,440 --> 00:20:53,940
time.

391
00:20:54,340 --> 00:20:55,680
Michael: Yeah, that's what we thought
it was.

392
00:20:55,680 --> 00:20:57,080
I don't know for sure.

393
00:20:57,240 --> 00:21:01,000
And it seems really interesting
like what it could be that's

394
00:21:01,000 --> 00:21:01,940
going on there.

395
00:21:02,120 --> 00:21:04,920
The reason I thought it might be
that is I've seen in the past

396
00:21:04,920 --> 00:21:09,860
when running experiments, if I
run ANALYZE, like nothing to have

397
00:21:09,860 --> 00:21:12,800
explained in this case, just gathering
new statistics for the

398
00:21:12,800 --> 00:21:16,620
table, I've seen planning buffers
spike on the next query execution

399
00:21:17,060 --> 00:21:19,280
and then come back down on the
second execution.

400
00:21:19,280 --> 00:21:22,380
So that makes me think it might
be statistics related.

401
00:21:22,760 --> 00:21:26,320
But yeah, I've seen VACUUM drastically
reduce buffers as well,

402
00:21:26,320 --> 00:21:29,700
not VACUUM ANALYZE, just VACUUM
on its own, and definitely not

403
00:21:29,700 --> 00:21:30,640
VACUUM FULL.

404
00:21:31,060 --> 00:21:35,940
So whilst buffers are much more
reliable and stable than timings

405
00:21:36,020 --> 00:21:39,060
for this kind of experiment, there
are a few things that can

406
00:21:39,060 --> 00:21:40,620
affect buffers still.

407
00:21:42,440 --> 00:21:43,560
Nikolay: Yeah, yeah, yeah.

408
00:21:43,750 --> 00:21:48,540
So anyway, I couldn't reproduce
VACUUM thing and ANALYZE thing,

409
00:21:48,540 --> 00:21:52,880
but I can imagine if you have complex
query, rebuilding statistics

410
00:21:53,180 --> 00:21:58,520
leads to many more hits from pg_statistic,
for example, of new

411
00:21:58,520 --> 00:21:59,680
data just collected.

412
00:22:00,320 --> 00:22:04,000
Especially if you have a lot of
buckets, if the default statistics

413
00:22:04,000 --> 00:22:05,780
target is increased and so on.

414
00:22:05,860 --> 00:22:07,940
It's a lot of data to hit additionally.

415
00:22:08,620 --> 00:22:10,100
I don't know, it's interesting.

416
00:22:10,320 --> 00:22:11,780
But anyway, back to partitioning.

417
00:22:13,840 --> 00:22:17,440
The current recommendation would
be, let's not allow the number

418
00:22:17,440 --> 00:22:22,200
of partitions to grow to many,
many thousands, unless we are

419
00:22:22,200 --> 00:22:24,360
okay with increased planning time.

420
00:22:24,520 --> 00:22:28,040
We can be okay with increased planning
time if we use, first

421
00:22:28,040 --> 00:22:29,940
of all, pooler matters.

422
00:22:30,060 --> 00:22:34,960
So the situation when it's the
very first query in session just

423
00:22:34,960 --> 00:22:37,780
established, it's very rare, right?

424
00:22:37,780 --> 00:22:40,820
It's good if it's really rare,
and the pooler helps.

425
00:22:41,680 --> 00:22:43,860
Even application-side pooler helps.

426
00:22:44,650 --> 00:22:50,980
And second, of course, prepared
statements just help us shave

427
00:22:50,980 --> 00:22:53,740
off a lot of planning time situations.

428
00:22:54,600 --> 00:22:59,600
You remember Dirk from Adyen discussed
problems even if you use

429
00:22:59,600 --> 00:23:02,520
prepared statements, still there
might be interesting problems

430
00:23:02,520 --> 00:23:05,940
and we had a very interesting episode
number 100, right?

431
00:23:06,820 --> 00:23:09,960
It was an interesting discussion,
but it's already an advanced

432
00:23:09,960 --> 00:23:13,680
topic and at bare minimum it's
good if you know planning time

433
00:23:13,680 --> 00:23:14,940
is going up.

434
00:23:15,480 --> 00:23:20,200
It's good to consider prepared
statements and PgBouncer

435
00:23:20,200 --> 00:23:21,820
supports them now, so it's great.

436
00:23:22,680 --> 00:23:25,840
And I must say, like, dozens of
milliseconds is already super

437
00:23:25,840 --> 00:23:27,840
slow overhead for planning.

438
00:23:27,860 --> 00:23:34,320
If we aim to have all queries fast,
fast for me means in OLTP

439
00:23:34,540 --> 00:23:35,200
as usual.

440
00:23:35,600 --> 00:23:37,540
This is episode number 1.

441
00:23:37,540 --> 00:23:38,800
What is slow query?

442
00:23:39,220 --> 00:23:43,280
Slow query is if it exceeds 100
milliseconds, 200 milliseconds,

443
00:23:43,860 --> 00:23:46,400
because this is human perception,
200 milliseconds.

444
00:23:48,280 --> 00:23:51,540
Most people already see it's quite
slow.

445
00:23:51,900 --> 00:23:56,580
And if 1 HTTP request consists of multiple SQL statements, this

446
00:23:56,580 --> 00:23:59,680
means we should be below like 10 milliseconds already.

447
00:24:00,140 --> 00:24:04,000
But if planning time adds 50 milliseconds, for example, if we

448
00:24:04,000 --> 00:24:11,100
have 20,000 partitions, for example, wow, it's not a good situation

449
00:24:11,100 --> 00:24:12,100
to be in, right?

450
00:24:12,340 --> 00:24:16,320
So prepared statements should be a good tool to fight with this

451
00:24:16,320 --> 00:24:16,820
problem.

452
00:24:17,420 --> 00:24:19,580
Or just keep partition numbers low.

453
00:24:20,580 --> 00:24:24,060
Michael: Well, a few times you've said, and I completely agree

454
00:24:24,060 --> 00:24:28,380
with you, that our primary server's CPU is our most limited resource,

455
00:24:29,180 --> 00:24:31,200
and planning time is CPU.

456
00:24:31,840 --> 00:24:37,460
It's a pure use of that, and if we can shave it, why not?

457
00:24:37,460 --> 00:24:41,860
Nikolay: It's CPU, but I insist it's gain buffers.

458
00:24:41,880 --> 00:24:45,040
It's just buffer hits, it's related to memory, but yes, it's

459
00:24:45,040 --> 00:24:45,980
CPU, yes.

460
00:24:46,260 --> 00:24:47,540
But still it's I.O.

461
00:24:48,260 --> 00:24:49,640
I consider I.O.

462
00:24:49,900 --> 00:24:55,020
In a wide sense when we involve communication to memory also,

463
00:24:55,020 --> 00:24:55,960
consider it I/O.

464
00:24:57,100 --> 00:24:58,780
Not disk I/O, of course, in this case.

465
00:24:58,780 --> 00:24:59,720
Most of the time.

466
00:24:59,720 --> 00:25:03,220
It can be buffer reads as well, but in rare cases.

467
00:25:03,460 --> 00:25:07,180
Sometimes we see buffer reads involved into planning as well.

468
00:25:07,740 --> 00:25:10,660
Michael: You mean like shared read rather than shared hit?

469
00:25:11,040 --> 00:25:13,780
Nikolay: Yeah, I call them shared buffer reads because...

470
00:25:14,200 --> 00:25:14,700
Michael: Shared...

471
00:25:15,140 --> 00:25:16,500
Nikolay: Yeah, we discussed it.

472
00:25:17,420 --> 00:25:23,200
I consider this bad naming because it's I/O operations, not amount

473
00:25:23,200 --> 00:25:23,860
of data.

474
00:25:25,240 --> 00:25:25,640
Right?

475
00:25:25,640 --> 00:25:26,140
Right.

476
00:25:26,260 --> 00:25:26,760
Yeah.

477
00:25:26,840 --> 00:25:29,240
So, and what else?

478
00:25:29,240 --> 00:25:30,180
Indexes, right?

479
00:25:30,180 --> 00:25:33,460
If we have a lot of indexes, planning time is increasing, increasing,

480
00:25:33,460 --> 00:25:34,420
increasing also.

481
00:25:35,060 --> 00:25:39,180
And I suspect it's not linear even, but it's worth double checking.

482
00:25:39,620 --> 00:25:44,240
So if we have, for example, 10 indexes, that's some penalty and

483
00:25:44,240 --> 00:25:48,260
planning time in a primary key lookup can be already like 1

484
00:25:48,260 --> 00:25:50,780
millisecond or so, half of millisecond.

485
00:25:51,300 --> 00:25:56,380
But if we go, like if our table has many columns, we are tempted

486
00:25:56,380 --> 00:25:58,220
to have dozens of indexes.

487
00:25:58,740 --> 00:26:00,060
And that's really bad.

488
00:26:00,060 --> 00:26:04,580
And also, not only planning time increases itself due to buffer

489
00:26:04,740 --> 00:26:09,400
hits, again, buffer hits also grow here, but also locking.

490
00:26:09,960 --> 00:26:13,680
When planning happens, Postgres applies access share lock to

491
00:26:13,680 --> 00:26:15,160
each object involved.

492
00:26:15,160 --> 00:26:19,140
All indexes of a table are locked by access share lock.

493
00:26:20,220 --> 00:26:24,260
And this is quite expensive, I would say.

494
00:26:24,520 --> 00:26:29,020
If you have thousands of queries per second for this primary

495
00:26:29,020 --> 00:26:33,780
key lookup, it might be a problem,
especially if you exceed 15

496
00:26:33,860 --> 00:26:38,180
indexes, because the total number
of objects locked already reaches

497
00:26:38,180 --> 00:26:44,440
16, which is a threshold where
when Postgres cannot, like above

498
00:26:44,440 --> 00:26:49,700
16, it cannot use FastPath for
locking in LockManager anymore.

499
00:26:49,700 --> 00:26:51,540
And we discussed this as well.

500
00:26:51,960 --> 00:26:52,900
And this is the number.

501
00:26:52,900 --> 00:26:54,840
Michael: Is that number being increased
in?

502
00:26:55,680 --> 00:26:56,680
Nikolay: I'm not sure.

503
00:26:56,940 --> 00:27:00,000
There are discussions to make these
things configurable.

504
00:27:00,320 --> 00:27:02,220
Actually, I think I raised this
question.

505
00:27:02,360 --> 00:27:04,460
It turned out not to be so simple.

506
00:27:05,020 --> 00:27:05,660
I'm not sure.

507
00:27:05,660 --> 00:27:08,860
I think in 2017, nothing happened
with this.

508
00:27:09,240 --> 00:27:10,820
So what helps here?

509
00:27:10,840 --> 00:27:16,340
Like, add more CPU, more cores,
or decrease number of indexes,

510
00:27:17,080 --> 00:27:22,080
or again, prepared statements, right?

511
00:27:22,800 --> 00:27:23,540
Prepared statements.

512
00:27:23,740 --> 00:27:24,320
What else?

513
00:27:24,320 --> 00:27:28,480
What else helps in this situation
when you need?

514
00:27:28,660 --> 00:27:29,660
Yeah, I don't know.

515
00:27:29,660 --> 00:27:30,580
Michael: Maybe even denormalizing.

516
00:27:30,920 --> 00:27:35,780
If you've got a table of 15 indexes,
why do you have 15 indexes?

517
00:27:37,400 --> 00:27:42,100
Nikolay: Or this also helps, decrease
QPS of this kind, caching

518
00:27:42,340 --> 00:27:43,380
on application side.

519
00:27:43,380 --> 00:27:49,140
If you had 5, 000 queries per second,
I see it all the time.

520
00:27:49,160 --> 00:27:54,320
So every click, an application
reads from user's table checking

521
00:27:54,520 --> 00:27:55,360
who is this.

522
00:27:56,120 --> 00:27:57,140
Nothing changed.

523
00:27:57,540 --> 00:27:58,760
So cache it.

524
00:27:59,020 --> 00:28:01,080
Why are you reading this all the
time?

525
00:28:01,080 --> 00:28:03,760
Or Rails, recently we discussed
with 1 customer.

526
00:28:04,120 --> 00:28:08,600
Rails code is constantly bombing
Postgres system catalogs,

527
00:28:08,740 --> 00:28:12,740
checking some pg_class, pg_attribute,
and so on.

528
00:28:12,840 --> 00:28:16,200
Checking schema all the time, all
the time, like thousands per

529
00:28:16,200 --> 00:28:16,700
second.

530
00:28:17,080 --> 00:28:17,580
Why?

531
00:28:18,200 --> 00:28:20,180
On replicas, nothing changed.

532
00:28:20,200 --> 00:28:22,020
It's changing, but not so fast.

533
00:28:22,420 --> 00:28:27,380
So lack of caching everywhere,
and this quickly can become a

534
00:28:27,380 --> 00:28:31,180
problem, especially if you don't
use prepared statements again.

535
00:28:31,720 --> 00:28:36,400
But interesting that we can combine
this problem and they can

536
00:28:36,480 --> 00:28:36,980
amplify.

537
00:28:38,360 --> 00:28:44,000
If you have, say, 10 indexes, it's
enough to have 3 partitions.

538
00:28:44,320 --> 00:28:49,100
Even 2 partitions is enough if
you haven't checked partition

539
00:28:49,200 --> 00:28:52,280
pruning and some queries involve
both partitions.

540
00:28:53,140 --> 00:28:58,380
You already have a situation where
22 objects need to be locked

541
00:28:58,380 --> 00:29:02,360
by access share lock and lock manager
contention might strike

542
00:29:02,360 --> 00:29:02,860
you.

543
00:29:03,420 --> 00:29:07,860
Jeremy Schneider had a great overview
of various works in this

544
00:29:07,860 --> 00:29:09,320
area from various companies.

545
00:29:09,960 --> 00:29:13,260
It's both about execution time
but mostly about planning time

546
00:29:13,260 --> 00:29:17,860
because planning time is worse
because this is when Postgres

547
00:29:17,880 --> 00:29:19,840
locks all the objects.

548
00:29:20,080 --> 00:29:23,260
During execution, it locks only
what is really needed.

549
00:29:25,000 --> 00:29:29,660
So yeah, the worst case, you have
partitioning, not many partitions,

550
00:29:29,800 --> 00:29:35,940
but also indexes, and then a lot
of QPS, and it can put your

551
00:29:35,980 --> 00:29:40,380
server down completely because
of LockManager, lightweight lock

552
00:29:40,380 --> 00:29:40,880
contention.

553
00:29:42,040 --> 00:29:48,040
Michael: Yeah, I think you've talked
about people who could be

554
00:29:48,040 --> 00:29:50,320
doing things for good reasons still
getting caught up.

555
00:29:50,320 --> 00:29:54,520
But I think there are also a couple
of cases where, like around

556
00:29:54,520 --> 00:29:59,840
partitioning, people running a
lot of queries without the partition

557
00:29:59,900 --> 00:30:01,500
key being supplied in them.

558
00:30:01,500 --> 00:30:02,960
Like that kind of thing can...

559
00:30:02,960 --> 00:30:04,460
Nikolay: How to catch those situations?

560
00:30:04,600 --> 00:30:06,160
What would you do?

561
00:30:06,340 --> 00:30:07,920
How to verify that?

562
00:30:08,180 --> 00:30:11,920
For example, imagine we are prepared.

563
00:30:12,180 --> 00:30:16,200
We prepared already a good partitioning
approach, like with 0

564
00:30:16,200 --> 00:30:17,940
downtime conversion, everything.

565
00:30:18,340 --> 00:30:22,840
And we need to double check that
all queries will have partition

566
00:30:22,840 --> 00:30:23,800
pruning involved.

567
00:30:24,060 --> 00:30:25,340
How to check it?

568
00:30:26,660 --> 00:30:27,660
Michael: Holistically.
ALEXANDER Good question.

569
00:30:28,700 --> 00:30:31,220
Well, actually, that brings me
back to an idea.

570
00:30:31,400 --> 00:30:35,400
We've got pg_stat_statements and
we can now look at some planning

571
00:30:35,580 --> 00:30:37,440
related data as of version 13.

572
00:30:37,440 --> 00:30:39,140
It was a big, big release.

573
00:30:39,800 --> 00:30:41,460
Nikolay: It's not on.

574
00:30:41,680 --> 00:30:42,100
You need

575
00:30:42,100 --> 00:30:42,400
Michael: to enable it.
Yeah.

576
00:30:42,400 --> 00:30:43,540
Why is it off?

577
00:30:45,040 --> 00:30:45,740
Nikolay: Good question.

578
00:30:46,360 --> 00:30:50,220
Since it's off, many people, including
actually myself, still

579
00:30:50,220 --> 00:30:54,140
have some fears that the observer
effect might strike you in

580
00:30:54,140 --> 00:30:55,220
some cases.

581
00:30:56,000 --> 00:31:01,320
So yeah, honestly, yesterday we
discussed with 1 of our clients,

582
00:31:01,320 --> 00:31:05,900
we discussed that we should definitely
consider enabling it.

583
00:31:06,140 --> 00:31:07,620
It's a good thing to have.

584
00:31:07,800 --> 00:31:10,080
Michael: It would be good to hear
what the observer effect is

585
00:31:10,080 --> 00:31:10,900
in that case.

586
00:31:10,900 --> 00:31:15,320
Nikolay: I think we had experiments
checking this with synthetic

587
00:31:15,320 --> 00:31:19,540
benchmarks and it was very low,
this overhead, but it doesn't

588
00:31:19,540 --> 00:31:23,000
eliminate, synthetic benchmarks
don't eliminate all the fears,

589
00:31:23,000 --> 00:31:23,500
right?

590
00:31:23,540 --> 00:31:24,780
Michael: Of course, of course.

591
00:31:25,520 --> 00:31:31,360
Nikolay: So yeah, I'm curious if
people have planning time.

592
00:31:31,620 --> 00:31:36,940
Well, I know projects heavily loaded,
huge clusters, planning

593
00:31:36,940 --> 00:31:37,700
time enabled.

594
00:31:39,020 --> 00:31:40,020
Michael: So you mean pg_stat_statements.track_planning?

595
00:31:42,260 --> 00:31:43,440
Nikolay: Not planning time, sorry.

596
00:31:43,440 --> 00:31:44,280
Yes, yes.

597
00:31:44,440 --> 00:31:46,940
It's not only time, it's also I,
right?

598
00:31:46,940 --> 00:31:49,440
I mean, buffers for planning phase.

599
00:31:50,060 --> 00:31:54,160
They are fine, but I'm curious
about opposite when people you're

600
00:31:54,160 --> 00:31:58,380
not fine and decided to keep it
off otherwise I will call it

601
00:31:58,380 --> 00:32:00,260
another default bullshit

602
00:32:01,880 --> 00:32:02,660
Michael: right and it's

603
00:32:02,660 --> 00:32:07,660
Nikolay: we have a collection of
them like random page cost or

604
00:32:07,660 --> 00:32:08,000
what

605
00:32:08,000 --> 00:32:11,900
Michael: this one's I think this
one's even worse because pg_stat_statements

606
00:32:11,920 --> 00:32:13,640
isn't on by default.

607
00:32:13,680 --> 00:32:16,020
So people already have to turn
that on.

608
00:32:16,020 --> 00:32:16,400
And They

609
00:32:16,400 --> 00:32:19,200
Nikolay: already agreed to pay
overhead price.

610
00:32:19,200 --> 00:32:22,540
Michael: Yeah, but also it means
that in a default ship, like

611
00:32:22,540 --> 00:32:26,740
if Postgres ships by default, even
if this parameter was on,

612
00:32:26,820 --> 00:32:30,160
by default it would be off still
because pg_stat_statements isn't

613
00:32:30,540 --> 00:32:30,660
on.

614
00:32:30,660 --> 00:32:34,340
So I think this one might be one of
the easier ones to change.

615
00:32:34,540 --> 00:32:35,380
Nikolay: Yeah, yeah.

616
00:32:35,380 --> 00:32:36,920
Well, we need to check discussion.

617
00:32:36,980 --> 00:32:41,780
I might ask our AI to check what
was discussion and provide summary.

618
00:32:43,480 --> 00:32:46,100
Michael: But you asked how to monitor
for that.

619
00:32:46,100 --> 00:32:49,620
I would expect to see issues there
quite quickly.

620
00:32:50,140 --> 00:32:52,320
Nikolay: I'm interested in such
issues.

621
00:32:52,920 --> 00:32:57,840
So anyway, I agree this is important,
observability for planning

622
00:32:58,180 --> 00:32:58,680
overhead.

623
00:33:00,480 --> 00:33:05,820
Michael: Also just P95, P99, like
query latencies, I would expect

624
00:33:05,920 --> 00:33:06,700
to suffer.

625
00:33:07,200 --> 00:33:12,720
Nikolay: But I think also I have
in mind an approach we discussed

626
00:33:12,720 --> 00:33:15,800
with a couple of clients actually
over the last few weeks.

627
00:33:16,120 --> 00:33:20,040
If you enable partitioning, it's
worth checking with auto_explain

628
00:33:20,320 --> 00:33:24,320
in lower environments, enabling
it for everything and just having

629
00:33:24,320 --> 00:33:27,560
it enabled and playing functionality.

630
00:33:30,060 --> 00:33:31,760
It's about testing coverage, right?

631
00:33:31,760 --> 00:33:34,340
If you cannot reproduce the whole
workload, of course, there

632
00:33:34,340 --> 00:33:35,940
are chances that you miss something.

633
00:33:35,940 --> 00:33:41,400
But it's good, then you can just
analyze logs and see if there

634
00:33:41,400 --> 00:33:43,500
are plans which involve multiple
partitions.

635
00:33:43,640 --> 00:33:45,920
But you need to partition properly
as well.

636
00:33:45,920 --> 00:33:49,280
You need to have multiple partitions
for all tables, which are

637
00:33:49,280 --> 00:33:50,160
partitions.
Michael: Yeah, true.

638
00:33:50,160 --> 00:33:53,640
And if you were doing this and
you had thousands of partitions,

639
00:33:54,320 --> 00:33:56,760
that would be a lot of information
in the logs.

640
00:33:57,880 --> 00:33:59,580
And still no planning time information.

641
00:34:01,240 --> 00:34:01,620
Nikolay: Yeah, yeah.

642
00:34:01,620 --> 00:34:05,700
But we are interested in plan structure
in this case.

643
00:34:06,160 --> 00:34:09,960
And in lower environments, we usually
like if it's a smaller

644
00:34:09,960 --> 00:34:13,640
database, for example, it's not
thousands of partitions.

645
00:34:15,060 --> 00:34:19,460
Like for example, we say, okay,
we want to have at least 10 partitions

646
00:34:19,480 --> 00:34:23,440
for each partitioned table, and then
we just check plans and with

647
00:34:23,440 --> 00:34:26,380
something, some analysis, we can
analyze them.

648
00:34:27,040 --> 00:34:30,140
And it's a separate question how
to analyze them, but anyway,

649
00:34:30,140 --> 00:34:34,060
we can find holistically find bad
plans.

650
00:34:34,960 --> 00:34:38,300
But I like your idea about pg_stat_statements
as well.

651
00:34:38,360 --> 00:34:46,400
I wish it had, you know, it had
information about plans more,

652
00:34:46,400 --> 00:34:51,160
not only timing and buffers, but
for example, how many objects

653
00:34:51,220 --> 00:34:56,660
were locked, for example, or involved
in the plan, or I don't

654
00:34:56,660 --> 00:34:59,780
know, like some additional things
here.

655
00:35:00,780 --> 00:35:06,040
For example, just was partition
pruning involved?

656
00:35:06,340 --> 00:35:08,320
How many times was it involved
in plans?

657
00:35:09,280 --> 00:35:11,680
You know, there's a column called
plans.

658
00:35:11,680 --> 00:35:12,620
How many plans?

659
00:35:14,640 --> 00:35:16,460
Partition pruning, question.

660
00:35:16,500 --> 00:35:17,780
Because partitioning is hard.

661
00:35:17,780 --> 00:35:21,180
It has so many surprising aspects,
I would say.

662
00:35:21,760 --> 00:35:23,160
So many surprising aspects.

663
00:35:23,240 --> 00:35:24,720
Michael: I have some sympathy with
them.

664
00:35:24,720 --> 00:35:28,480
I think pg_stat_statements already
has a lot of columns.

665
00:35:29,440 --> 00:35:31,340
Nikolay: Yes, I agree.

666
00:35:34,200 --> 00:35:35,180
Still not enough.

667
00:35:36,980 --> 00:35:40,220
We have interesting work happening
in the area of monitoring,

668
00:35:41,280 --> 00:35:46,320
and we discussed not only pg_stat_statements,
but query analysis.

669
00:35:46,840 --> 00:35:50,740
And the holistic table of query
analysis is so wide, because

670
00:35:50,740 --> 00:35:55,220
we take every metric and have 3
derivatives for all metrics.

671
00:35:56,120 --> 00:35:59,840
Time-based differentiation, calls-based,
and also percentage,

672
00:36:00,300 --> 00:36:01,880
except column calls.

673
00:36:01,880 --> 00:36:04,520
For this column, it doesn't make
sense.

674
00:36:04,660 --> 00:36:06,540
Calls per calls is always 1.

675
00:36:06,820 --> 00:36:10,780
So it's like 4 times more minus
1.

676
00:36:11,280 --> 00:36:13,360
It's so many columns to present.

677
00:36:13,900 --> 00:36:19,220
Yeah, we don't need them all the
time, but I can find cases when

678
00:36:19,220 --> 00:36:23,360
some of them, like almost every
one is useful in some cases.

679
00:36:23,400 --> 00:36:28,220
So query analysis becomes harder
and harder, wider and wider

680
00:36:28,420 --> 00:36:30,560
in terms of how many metrics we
have.

681
00:36:31,120 --> 00:36:32,460
And it's still not enough.

682
00:36:32,720 --> 00:36:33,540
That's it.

683
00:36:34,020 --> 00:36:34,740
So yeah, good.

684
00:36:34,740 --> 00:36:37,160
I think we covered this topic quite
well.

685
00:36:37,820 --> 00:36:39,740
Michael: I have 1 last question
for you.

686
00:36:40,900 --> 00:36:44,960
Do you ever change people's from
collapse limit or join collapse

687
00:36:44,960 --> 00:36:45,460
limit?

688
00:36:46,780 --> 00:36:48,000
Nikolay: Oh, interesting.

689
00:36:48,340 --> 00:36:54,060
I honestly like last, for example,
12 months, I don't remember

690
00:36:54,060 --> 00:36:55,940
cases of it at all.

691
00:36:56,200 --> 00:36:56,700
Somehow.

692
00:36:57,260 --> 00:37:00,780
In the past I had cases where we
considered it and I think even

693
00:37:00,780 --> 00:37:03,780
applied it, but it was so long
ago.

694
00:37:04,660 --> 00:37:05,160
Somehow.

695
00:37:05,280 --> 00:37:06,900
Michael: I think the default is
quite good.

696
00:37:06,900 --> 00:37:08,860
The default for both of them is
8.

697
00:37:09,960 --> 00:37:11,740
I think it works pretty well.

698
00:37:12,180 --> 00:37:13,040
Nikolay: It works pretty well.

699
00:37:13,040 --> 00:37:19,280
And also, I don't know, like when
I see 12 tables involved in

700
00:37:19,280 --> 00:37:24,000
plan joins and like I already like,
oh, genetic optimizer already

701
00:37:24,000 --> 00:37:24,640
here, right?

702
00:37:24,640 --> 00:37:27,340
It has to turn up to 12 threshold,
no?

703
00:37:27,500 --> 00:37:29,620
Michael: Yeah, I only learned this
today.

704
00:37:30,300 --> 00:37:33,340
But I think because of from collapse
limit and during collapse

705
00:37:33,340 --> 00:37:35,900
limit, GECO never kicks in.

706
00:37:35,900 --> 00:37:36,720
Nikolay: Wow, okay.

707
00:37:36,980 --> 00:37:38,220
I have old memory.

708
00:37:38,420 --> 00:37:38,920
Okay.

709
00:37:38,940 --> 00:37:39,520
Michael: Me too.

710
00:37:39,520 --> 00:37:42,980
Like, well, or maybe I misunderstood
originally, but unless you

711
00:37:42,980 --> 00:37:46,620
raise those to above the GECO threshold,
This is my new understanding

712
00:37:46,640 --> 00:37:47,780
based on reading the docs.

713
00:37:47,780 --> 00:37:50,740
Unless you raise them above the
Jack Ray threshold, it will never

714
00:37:50,740 --> 00:37:51,480
kick in.

715
00:37:51,940 --> 00:37:53,600
Nikolay: Well, yeah.

716
00:37:54,380 --> 00:37:58,100
Somehow recently I mostly deal
with different problems when only

717
00:37:58,100 --> 00:38:00,480
a few tables involved into plans.

718
00:38:01,100 --> 00:38:01,800
Makes sense.

719
00:38:02,300 --> 00:38:05,180
Yeah, but my experience is not
everything, right?

720
00:38:05,180 --> 00:38:10,740
So I'm sure there are cases where
collapse limits matter and

721
00:38:11,240 --> 00:38:14,560
worth considering, but just didn't
happen to me recently.

722
00:38:15,300 --> 00:38:18,480
Michael: No, I was just interested
from like a it's 1 of the

723
00:38:18,480 --> 00:38:21,340
things we look out for high planning
time It's 1 of the things

724
00:38:21,340 --> 00:38:24,140
we recommend looking at in case
someone has tweaked them For

725
00:38:24,140 --> 00:38:27,100
example, if somebody's increased
them too much that can massively

726
00:38:27,160 --> 00:38:32,000
balloon planning time Interesting,
but I wanted to know like

727
00:38:32,000 --> 00:38:33,460
in the real world for you.

728
00:38:33,880 --> 00:38:35,180
Is it something you see us?

729
00:38:35,180 --> 00:38:36,200
It sounds like me

730
00:38:37,200 --> 00:38:37,660
Nikolay: Yeah, cool.

731
00:38:37,660 --> 00:38:42,840
Well, there might be more cases
we even didn't see yet.

732
00:38:43,520 --> 00:38:44,680
Michael: Oh, of course, always.

733
00:38:45,520 --> 00:38:46,460
Nikolay: Always, yeah.

734
00:38:46,640 --> 00:38:49,580
So because Postgres planner is
complex.

735
00:38:49,900 --> 00:38:53,420
It's not as complex as SQL Server
or Oracle planners.

736
00:38:53,680 --> 00:38:54,740
Michael: Not yet, no.

737
00:38:55,840 --> 00:38:57,040
Well, hopefully never.

738
00:38:57,340 --> 00:38:59,240
Nikolay: Already complex enough,
I would say.

739
00:39:00,020 --> 00:39:00,800
So yeah.

740
00:39:00,880 --> 00:39:01,380
Good.

741
00:39:02,080 --> 00:39:05,420
Oh, by the way, do you know if
we use pg_hint_plan?

742
00:39:07,200 --> 00:39:07,880
Michael: Oh yeah.

743
00:39:09,140 --> 00:39:09,640
pg_hint_plan.

744
00:39:10,460 --> 00:39:10,960
Nikolay: Right.

745
00:39:11,000 --> 00:39:13,600
Because we tell the planner what
to do directly.

746
00:39:14,640 --> 00:39:18,300
Does it help fight high planning
time?

747
00:39:19,340 --> 00:39:20,400
Michael: I don't know.

748
00:39:21,140 --> 00:39:21,500
Nikolay: Good question.

749
00:39:21,500 --> 00:39:21,820
It would

750
00:39:21,820 --> 00:39:24,580
Michael: make sense that it would,
but I don't know where it's

751
00:39:24,580 --> 00:39:25,520
hooking in.

752
00:39:25,760 --> 00:39:26,040
Nikolay: Yeah.

753
00:39:26,040 --> 00:39:30,380
Yeah, I don't have this used in,
I don't observe it used in production

754
00:39:30,380 --> 00:39:32,140
at all, only non-production.

755
00:39:32,500 --> 00:39:33,340
So I also

756
00:39:33,340 --> 00:39:33,840
Michael: don't see.

757
00:39:33,840 --> 00:39:35,740
Okay, good.

758
00:39:35,740 --> 00:39:36,840
Question to our listeners.

759
00:39:37,120 --> 00:39:38,160
Nikolay: Yeah, it's homework.

760
00:39:39,760 --> 00:39:42,240
Michael: And also, the other thing
I wanted to ask is if anybody

761
00:39:42,240 --> 00:39:48,440
using some of the more complex
extensions that hook into planning,

762
00:39:48,620 --> 00:39:52,280
like for example, Citus or Timescale.

763
00:39:53,460 --> 00:39:56,460
Yeah, I've seen more planning time
issues.

764
00:39:56,580 --> 00:39:58,520
But it just feels like a coincidence
at the moment.

765
00:39:58,520 --> 00:40:02,720
I haven't looked into if it's definitely
the causal rather than

766
00:40:02,720 --> 00:40:06,980
just, you know, what's the correlation
versus

767
00:40:07,820 --> 00:40:08,320
Nikolay: causation.

768
00:40:08,720 --> 00:40:09,220
But

769
00:40:09,520 --> 00:40:10,740
Michael: yes, that's the 1.

770
00:40:10,760 --> 00:40:13,980
So I don't know if it's just coincidence,
but I tend to see more

771
00:40:13,980 --> 00:40:16,960
planning related issues with the
more complex extensions.

772
00:40:16,960 --> 00:40:20,860
Nikolay: Maybe some extensions
help improve planning time, not

773
00:40:20,860 --> 00:40:21,740
make it worse.

774
00:40:22,720 --> 00:40:26,540
Michael: Well, you mentioned pg_hint_plan
would be the 1 that would

775
00:40:26,580 --> 00:40:28,040
make most sense on that.

776
00:40:28,520 --> 00:40:30,060
Nikolay: Well, yeah, it's just
a random guess.

777
00:40:30,060 --> 00:40:31,520
I'm not sure at all.

778
00:40:31,560 --> 00:40:32,320
Worth checking.

779
00:40:32,320 --> 00:40:32,900
Michael: Me too.

780
00:40:32,900 --> 00:40:33,960
Well, let us know.

781
00:40:34,040 --> 00:40:35,580
There's definitely people out there
that know more

782
00:40:35,580 --> 00:40:35,920
Nikolay: about it.

783
00:40:35,920 --> 00:40:36,720
Let's do a summary.

784
00:40:37,780 --> 00:40:42,880
So, try to avoid high partition
number, try to avoid a lot of

785
00:40:42,880 --> 00:40:46,840
indexes, especially try to avoid
this combination of these 2

786
00:40:46,840 --> 00:40:49,300
problems and use prepared statements.

787
00:40:49,840 --> 00:40:50,280
Right?

788
00:40:50,280 --> 00:40:50,780
Michael: Yeah.

789
00:40:51,040 --> 00:40:52,000
You don't think avoiding

790
00:40:52,000 --> 00:40:52,280
Nikolay: lots of
Michael: joins as well?

791
00:40:52,280 --> 00:40:53,360
3
Nikolay: simple ones?

792
00:40:53,360 --> 00:40:53,920
Maybe, yeah.

793
00:40:55,520 --> 00:40:58,600
Again, like, somehow I deal with
very simple plans lately.

794
00:40:58,680 --> 00:40:59,560
I don't know.

795
00:40:59,560 --> 00:41:01,300
Yeah, Good point as well.

796
00:41:01,900 --> 00:41:02,620
Good point.

797
00:41:04,220 --> 00:41:09,760
I try to avoid a lot of tables,
many joins, and then we try to

798
00:41:09,760 --> 00:41:16,060
expect good performance because
I always hunt for single index

799
00:41:16,060 --> 00:41:18,280
scan and better index only scan,
always.

800
00:41:18,600 --> 00:41:21,820
And if you have even 1 join, even
2 tables, you cannot build

801
00:41:21,820 --> 00:41:22,960
index on 2 tables.

802
00:41:22,960 --> 00:41:26,780
It's always 1, like indexes belong
to their tables always.

803
00:41:27,840 --> 00:41:31,900
And materialized views, as we maybe
talked, out of scope for

804
00:41:32,020 --> 00:41:33,580
my recommendations, always.

805
00:41:33,740 --> 00:41:35,100
Regular materialized views.

806
00:41:35,540 --> 00:41:40,140
So that's why I tried to think,
how can we simplify things and,

807
00:41:40,920 --> 00:41:44,480
okay, nest loop and so on, but
make queries sub-millisecond.

808
00:41:45,780 --> 00:41:50,140
But yeah, if many, many tables,
planning time might be huge.

809
00:41:50,140 --> 00:41:54,280
Anyway, I wanted 3 lessons, like
low index number, low partition

810
00:41:54,320 --> 00:41:56,260
number, but not super low.

811
00:41:56,480 --> 00:42:00,200
So partitions should be small,
relatively.

812
00:42:00,860 --> 00:42:02,220
And prepared statements.

813
00:42:03,620 --> 00:42:07,000
And I think these 3 might take
years to implement in large projects.

814
00:42:08,000 --> 00:42:08,900
Michael: Yeah, true.

815
00:42:09,140 --> 00:42:09,840
Nikolay: Yeah, unfortunately.

816
00:42:10,520 --> 00:42:11,020
Michael: Cool. Nice one,

817
00:42:12,260 --> 00:42:13,660
Nikolay, thanks so much.

818
00:42:13,940 --> 00:42:14,440
Nikolay: Bye-bye.

819
00:42:14,760 --> 00:42:15,460
Michael: Catch you next week.