1
00:00:00,060 --> 00:00:02,400
Michael: Hello and welcome to Postgres.FM, a weekly show about

2
00:00:02,400 --> 00:00:03,260
all things PostgreSQL.

3
00:00:03,400 --> 00:00:06,220
I am Michael, founder of pgMustard, and this is my co-host Nikolay,

4
00:00:06,220 --> 00:00:07,440
founder of Postgres.AI.

5
00:00:07,440 --> 00:00:09,440
Hey Nikolay, what are we talking about today?

6
00:00:09,520 --> 00:00:13,480
Nikolay: Hi Michael, let's talk about performance cliffs and

7
00:00:13,480 --> 00:00:16,280
1 case particularly, track_planning.

8
00:00:17,580 --> 00:00:22,920
The reason why I brought this to our attention is that I observed

9
00:00:23,040 --> 00:00:27,260
recently several, actually over the last few years, I observed

10
00:00:27,260 --> 00:00:31,580
several strong cases in production systems, various production

11
00:00:31,620 --> 00:00:38,740
systems, where lack of planning phase tracking caused huge amount,

12
00:00:38,740 --> 00:00:41,260
huge effort invested to troubleshooting.

13
00:00:42,440 --> 00:00:46,500
And you know, our best example is, we talked about this many

14
00:00:46,500 --> 00:00:52,220
times, during planning Postgres locks all indexes with access

15
00:00:52,220 --> 00:00:52,920
share lock.

16
00:00:52,920 --> 00:00:56,240
All tables, all indexes involved, even those which will not be

17
00:00:56,240 --> 00:00:58,300
used, it still locks them, right?

18
00:00:58,900 --> 00:01:05,360
And we talked a lot about lightweight lock, LockManager contention

19
00:01:06,200 --> 00:01:07,260
issues, right?

20
00:01:07,960 --> 00:01:09,900
And it happens during planning.

21
00:01:10,460 --> 00:01:14,520
And if we don't have track_planning, pg_stat_statements.track_planning

22
00:01:16,760 --> 00:01:22,100
enabled, it's not visible to our top-down query analysis.

23
00:01:22,500 --> 00:01:29,840
We cannot see which queries spend a lot of time in planning and

24
00:01:29,840 --> 00:01:32,200
a lot of buffers, maybe, right, and so on.

25
00:01:33,460 --> 00:01:38,560
Michael: Yes, and crucially, not only is pg_stat_statements not

26
00:01:38,560 --> 00:01:43,220
on by default although a lot of people have it turned on but

27
00:01:43,220 --> 00:01:46,980
pg_stat_statements track timing is off by default even when you

28
00:01:46,980 --> 00:01:51,540
enable pg_stat_statements so most folks that enable pg_stat_statements

29
00:01:51,580 --> 00:01:56,080
don't enable track_planning including most cloud providers.

30
00:01:57,040 --> 00:01:58,800
Nikolay: So let's zoom out, actually.

31
00:01:58,800 --> 00:01:59,940
This is a good point.

32
00:02:00,180 --> 00:02:06,500
By default, Postgres presents only poor metrics for query analysis.

33
00:02:08,040 --> 00:02:15,040
If you have I/O timing enabled, in the pg_stat_database you see

34
00:02:15,400 --> 00:02:15,900
I/O timing.

35
00:02:16,740 --> 00:02:20,780
You have a number of transactions, you can understand how many

36
00:02:20,780 --> 00:02:23,000
transactions per second not queries per second.

37
00:02:23,000 --> 00:02:27,380
It's like Postgres doesn't track it and that's it so you don't

38
00:02:27,380 --> 00:02:28,400
understand like?

39
00:02:29,060 --> 00:02:30,060
You don't understand throughput.

40
00:02:30,060 --> 00:02:32,520
You don't understand the average latency in general.

41
00:02:32,520 --> 00:02:33,660
It's super hard.

42
00:02:36,040 --> 00:02:40,580
And even at the highest level, like my whole database, how is

43
00:02:40,580 --> 00:02:41,040
it doing?

44
00:02:41,040 --> 00:02:41,880
How many TPS?

45
00:02:42,260 --> 00:02:43,440
How many TPS is fine?

46
00:02:43,440 --> 00:02:44,360
How many QPS?

47
00:02:44,760 --> 00:02:45,840
Queries per second.

48
00:02:46,420 --> 00:02:47,720
And what's about latencies?

49
00:02:48,480 --> 00:02:53,860
But if you want to go down and understand which groups of queries,

50
00:02:54,020 --> 00:02:59,760
parts of workload are responsible for, like, where do we spend

51
00:02:59,760 --> 00:03:05,340
time, For example, or do some work
with the buffer pool.

52
00:03:05,740 --> 00:03:08,360
To do that, you don't have it by
default.

53
00:03:08,360 --> 00:03:11,900
You need to install the pg_stat_statements
extension.

54
00:03:13,500 --> 00:03:17,420
And this is actually what most
people I know do.

55
00:03:18,080 --> 00:03:24,020
This lesson is learned and pg_stat_statements
is maybe the most popular extension

56
00:03:24,020 --> 00:03:25,220
because of that, right?

57
00:03:25,760 --> 00:03:26,260
Definitely.

58
00:03:27,040 --> 00:03:29,700
And we did have a couple of episodes
about pg_stat_statements.

59
00:03:30,360 --> 00:03:34,380
It's hard to overstate that this
extension is not good.

60
00:03:36,900 --> 00:03:37,900
It's really valuable.

61
00:03:38,560 --> 00:03:42,040
And there are opinions that it
should be already...

62
00:03:42,940 --> 00:03:48,780
It's time for it to go to the core
engine, because it sucks that

63
00:03:48,840 --> 00:03:52,900
we install this extension and Postgres
needs to parse, analyze

64
00:03:52,900 --> 00:03:55,360
query once again, just for metrics.

65
00:03:56,660 --> 00:03:58,280
So it should be part of core.

66
00:04:00,360 --> 00:04:04,860
But it's already a non-default
situation when you have a good

67
00:04:04,920 --> 00:04:05,780
query analysis.

68
00:04:06,180 --> 00:04:09,220
And it has problems, as we discussed
many times, it doesn't track

69
00:04:09,220 --> 00:04:10,760
failing queries and so on.

70
00:04:10,760 --> 00:04:11,740
It's another story.

71
00:04:12,180 --> 00:04:18,880
But next step, we install pg_stat_statements,
we see all queries,

72
00:04:19,200 --> 00:04:22,000
well, top 5,000 by default, right?

73
00:04:22,000 --> 00:04:26,460
It has pg_stat_statements.max parameter,
which controls how many

74
00:04:26,920 --> 00:04:32,620
normalized, aggregated, or official
languages, normalized queries,

75
00:04:33,460 --> 00:04:36,240
queries without parameters, are
tracked.

76
00:04:36,820 --> 00:04:41,360
But then, at some point, and actually
I think many folks don't

77
00:04:41,420 --> 00:04:42,600
understand it.

78
00:04:43,320 --> 00:04:48,960
Honestly, I didn't feel it so deeply
as I do it now, like last

79
00:04:48,960 --> 00:04:50,180
couple of years maybe.

80
00:04:50,740 --> 00:04:54,900
We track only part of our, of Postgres
work.

81
00:04:55,080 --> 00:04:56,500
We track only execution.

82
00:04:56,920 --> 00:04:58,160
We don't track planning.

83
00:04:58,660 --> 00:05:03,840
And planning can be a huge headache
if workload is heavy and

84
00:05:03,840 --> 00:05:06,000
machine is huge and so on.

85
00:05:06,260 --> 00:05:12,660
So in heavily loaded projects,
not tracking planning means like

86
00:05:12,660 --> 00:05:13,380
you don't...

87
00:05:14,160 --> 00:05:19,140
You can, like, very roughly, You
can be in trouble in 50% of

88
00:05:19,140 --> 00:05:22,320
production cases when you need
to investigate what's happening,

89
00:05:23,140 --> 00:05:26,260
what's causing high CPU consumption,
for example.

90
00:05:26,400 --> 00:05:28,660
But pg_stat_statements doesn't know,
it doesn't see it.

91
00:05:28,660 --> 00:05:30,780
It doesn't see the planning phase
at all.

92
00:05:30,940 --> 00:05:32,280
Only execution phase.

93
00:05:32,580 --> 00:05:35,460
So it feels like very weird.

94
00:05:35,460 --> 00:05:40,840
It's like we have, okay, we install
pg_stat_statements, we tell

95
00:05:40,840 --> 00:05:45,680
all people who have LTP install
it, we know overhead is quite

96
00:05:45,680 --> 00:05:49,020
low, and we will discuss it in
depth in a moment.

97
00:05:49,700 --> 00:05:54,020
But then we say, okay, actually,
the pg_stat_statements extension

98
00:05:54,100 --> 00:05:57,780
we actually installed, it's like
half of it.

99
00:05:57,780 --> 00:06:00,980
Half of the extension, because
it's only about execution.

100
00:06:01,000 --> 00:06:04,340
But planning is a super important
phase and we don't have it.

101
00:06:04,920 --> 00:06:08,100
So all folks who have pg_stat_statements,
most of them, most

102
00:06:08,100 --> 00:06:08,760
of them.

103
00:06:08,760 --> 00:06:12,460
All folks who haven't changed settings
of pg_stat_statements,

104
00:06:12,740 --> 00:06:15,300
they have only part of solution.

105
00:06:16,120 --> 00:06:19,220
Michael: Yeah, I would probably
make the case that it's a bit

106
00:06:19,220 --> 00:06:20,020
more than half.

107
00:06:20,020 --> 00:06:26,680
Like I think planning in general
is like a much lower proportion

108
00:06:26,980 --> 00:06:30,640
of performance issues are planning
related than execution related,

109
00:06:32,080 --> 00:06:32,980
in my experience.

110
00:06:33,960 --> 00:06:37,840
But I do take your point that it's
not it's not 0 percent planning

111
00:06:37,840 --> 00:06:41,520
related so even if it's 80 20 or
90 10 it's still there's still

112
00:06:41,520 --> 00:06:44,960
a whole category of issues that
we're not spotting by having

113
00:06:44,960 --> 00:06:48,580
this off and I think there's a
couple of extra things like that

114
00:06:48,580 --> 00:06:52,160
I don't think it's just heavily
loaded systems I've seen planning

115
00:06:52,480 --> 00:06:57,340
dominated queries that are analytical
in nature where it's just

116
00:06:57,340 --> 00:07:00,620
somebody trying to diagnose why
a single query is slow and that

117
00:07:00,620 --> 00:07:04,420
has hidden as well if you're just
looking at things like...

118
00:07:04,460 --> 00:07:06,640
Nikolay: This is just-in-time comment
actually.

119
00:07:07,420 --> 00:07:09,640
Michael: Well, there's just-in-time
compilation as well.

120
00:07:09,920 --> 00:07:11,420
But, yeah, good.

121
00:07:11,680 --> 00:07:15,020
No, but sticking to planning, I
think there's a few confusing

122
00:07:15,020 --> 00:07:15,300
things.

123
00:07:15,300 --> 00:07:19,920
And I see this mistake made by
all sorts of quite expert people

124
00:07:19,920 --> 00:07:23,240
as well when they're they might
be looking at a query a single

125
00:07:23,240 --> 00:07:26,820
query plan and they say this query
plan only takes 100 milliseconds

126
00:07:27,180 --> 00:07:29,600
but there's 10 milliseconds of
planning as well like they're

127
00:07:29,600 --> 00:07:32,200
only looking at the execution time
they don't realize that the

128
00:07:32,200 --> 00:07:34,680
planning is in addition to the
execution time, which they should

129
00:07:34,680 --> 00:07:38,560
be saying is 110 or how they should
be summing those 2.

130
00:07:38,560 --> 00:07:41,280
And the same is true in pg_stat_statements,
we should be summing

131
00:07:41,280 --> 00:07:42,020
the 2.

132
00:07:42,040 --> 00:07:47,540
And we can't sum the 2 if it's
off or we can, It's just it mistakenly

133
00:07:48,060 --> 00:07:51,860
tells us that there's 0 planning
time where it's actually just

134
00:07:51,860 --> 00:07:52,780
not being tracked

135
00:07:53,760 --> 00:07:59,780
Nikolay: and Just a set once had
only metrics Call like total

136
00:07:59,780 --> 00:08:08,960
time to mean time and so on before
Postgres 13 and since 13 it

137
00:08:08,960 --> 00:08:13,580
was renamed it was renamed right
it never tracked planning right

138
00:08:14,180 --> 00:08:16,400
Michael: there was no planning
it was only execution but the

139
00:08:16,400 --> 00:08:17,620
naming made it clearer.

140
00:08:18,640 --> 00:08:19,460
And on 13...

141
00:08:20,080 --> 00:08:25,540
Nikolay: And 13 it was renamed
to total_exec_time,

142
00:08:26,040 --> 00:08:30,400
min_exec_time and so on and there
is another group of metrics

143
00:08:31,580 --> 00:08:36,500
total_plan_time and so on, min_plan_time
and so on.

144
00:08:36,600 --> 00:08:41,600
And it means that, okay, we have
now the setting, but it's not

145
00:08:41,600 --> 00:08:42,540
on by default.

146
00:08:43,660 --> 00:08:48,240
And It means we have a second lesson
to learn.

147
00:08:48,240 --> 00:08:49,420
Should we enable it?

148
00:08:50,280 --> 00:08:52,520
We say install pg_stat_statements.

149
00:08:52,900 --> 00:08:54,340
Don't go without it.

150
00:08:54,340 --> 00:08:57,720
It's the number 1 recommendation
in terms of query analysis.

151
00:08:57,720 --> 00:08:59,260
Always have pg_stat_statements settings.

152
00:08:59,260 --> 00:09:05,240
I remember when it appeared, some rumors said, it adds 7% of

153
00:09:05,540 --> 00:09:07,500
overhead, but it's worth it.

154
00:09:07,700 --> 00:09:12,260
Well, as we can see now, it's actually very small overhead under

155
00:09:12,260 --> 00:09:17,860
normal circumstances, but we had an episode showing, remember,

156
00:09:17,920 --> 00:09:23,440
4 million TPS, showing that on edge case, overhead can be drastical

157
00:09:23,560 --> 00:09:25,640
and drop performance.

158
00:09:25,640 --> 00:09:27,080
It's like performance cliff.

159
00:09:27,360 --> 00:09:30,360
But I doubt normal projects reach that point.

160
00:09:31,400 --> 00:09:36,720
Everyone should be aware of that overhead, but again, it's a

161
00:09:36,720 --> 00:09:37,220
cliff.

162
00:09:37,300 --> 00:09:39,500
It doesn't come immediately.

163
00:09:39,600 --> 00:09:43,840
It comes only when you reach a lot, a lot of queries per second

164
00:09:44,200 --> 00:09:46,400
for a particular query ID, right?

165
00:09:47,380 --> 00:09:49,380
Michael: Yes, and I think this suffers.

166
00:09:49,400 --> 00:09:52,680
Tell me if you think I'm wrong I think the problem here is that

167
00:09:52,680 --> 00:09:57,080
the easiest way to stress test Postgres or the easiest way to

168
00:09:57,080 --> 00:10:00,900
do a quick benchmark is using pgbench yes which has this exact

169
00:10:00,900 --> 00:10:04,360
Nikolay: it's not a lot of secrets immediately let's okay let's

170
00:10:04,360 --> 00:10:08,020
talk about like we are like problem like it just says Edmonds

171
00:10:08,020 --> 00:10:11,540
is number number 1 recommendation for query analysis.

172
00:10:11,560 --> 00:10:12,160
Install it.

173
00:10:12,160 --> 00:10:12,920
Everyone agrees?

174
00:10:12,920 --> 00:10:13,420
Good.

175
00:10:13,780 --> 00:10:14,840
There is some overhead.

176
00:10:14,860 --> 00:10:16,360
We don't see it.

177
00:10:16,360 --> 00:10:18,240
Like, it's not huge.

178
00:10:18,240 --> 00:10:18,740
Okay.

179
00:10:19,160 --> 00:10:24,300
Well, we saw it can be huge, but with some pathological workloads,

180
00:10:24,320 --> 00:10:24,720
right?

181
00:10:24,720 --> 00:10:26,080
And we will discuss it.

182
00:10:26,120 --> 00:10:29,320
Now, question of this episode, basically.

183
00:10:29,680 --> 00:10:32,600
Should everyone start saying, enable track_planning?

184
00:10:34,080 --> 00:10:39,140
Should we, like, first default situation when Postgres doesn't

185
00:10:39,140 --> 00:10:43,140
have anything and then we install extension, okay, everyone learned

186
00:10:43,140 --> 00:10:43,820
this lesson.

187
00:10:43,900 --> 00:10:45,040
It's kind of solved.

188
00:10:45,380 --> 00:10:50,860
Should we consider the second lesson similarly and tell everyone

189
00:10:50,860 --> 00:10:52,100
enable track_planning?

190
00:10:52,660 --> 00:10:56,900
Because this gives you full power of producer statements now

191
00:10:57,640 --> 00:11:01,920
Michael: Yeah my current opinion And I'll be happy to adjust

192
00:11:01,920 --> 00:11:06,500
this as the episode goes on is I would enable this very early

193
00:11:06,500 --> 00:11:10,320
in a project these days and Doing research for this.

194
00:11:10,320 --> 00:11:11,920
I did come across some interesting things.

195
00:11:11,920 --> 00:11:17,000
I didn't realize including why and when the default was made

196
00:11:17,000 --> 00:11:17,500
off.

197
00:11:18,400 --> 00:11:23,980
But yeah, I would say when you've not got high volume, high load,

198
00:11:24,140 --> 00:11:28,600
turn it on while you've not got high load and then reassess if

199
00:11:29,160 --> 00:11:30,400
you end up with a pathological

200
00:11:31,020 --> 00:11:31,520
Nikolay: workload.

201
00:11:32,120 --> 00:11:32,460
Which I

202
00:11:32,460 --> 00:11:33,340
Michael: don't think you will.

203
00:11:33,340 --> 00:11:36,400
Nikolay: I understand this approach
And usually it's good.

204
00:11:36,400 --> 00:11:42,460
For example, if we had some overhead
which would grow monotonically

205
00:11:42,980 --> 00:11:43,480
with...

206
00:11:45,620 --> 00:11:48,440
Maybe linearly or somehow with
our workload.

207
00:11:49,360 --> 00:11:50,800
I would understand this approach.

208
00:11:50,900 --> 00:11:56,120
Because take these weights and
go to the gym with additional

209
00:11:56,120 --> 00:11:56,880
weights all the time.

210
00:11:56,880 --> 00:11:58,340
So you got used to it.

211
00:11:58,380 --> 00:12:00,440
But in this case, it's very different.

212
00:12:00,440 --> 00:12:06,240
We don't have any noticeable overhead,
as we saw from benchmarks.

213
00:12:06,340 --> 00:12:10,120
We don't have it and it happens
only at very, very, very, very

214
00:12:10,240 --> 00:12:11,140
extreme situations.

215
00:12:12,520 --> 00:12:13,780
Let's talk about history.

216
00:12:14,380 --> 00:12:18,560
I just learned about what you found
right before our recording

217
00:12:18,580 --> 00:12:20,140
so I was super surprised.

218
00:12:20,760 --> 00:12:23,240
Tell us more about it, very interesting.

219
00:12:23,240 --> 00:12:26,980
Michael: Yeah so 1 thing, like
an easy thing I do in preparation

220
00:12:26,980 --> 00:12:29,740
for most episodes is just check
when was the feature introduced,

221
00:12:30,020 --> 00:12:34,740
what was the commit, like what
was the discussion around it And

222
00:12:34,740 --> 00:12:37,600
I noticed it was in version 13
that it was introduced and I went

223
00:12:37,600 --> 00:12:40,380
to the version 13 release notes
to see what was said, kind of

224
00:12:40,380 --> 00:12:41,620
as a high-level summary.

225
00:12:41,760 --> 00:12:44,900
And the release notes now have
these awesome little links to

226
00:12:44,900 --> 00:12:45,540
the commits.

227
00:12:45,540 --> 00:12:46,840
Even for old versions?

228
00:12:47,780 --> 00:12:49,140
Well, 13 had it.

229
00:12:49,140 --> 00:12:52,420
I knew they did it for the latest
version and I was surprised

230
00:12:52,420 --> 00:12:55,240
to see them in 13 but very pleasantly
surprised so thank you

231
00:12:55,240 --> 00:12:55,760
to whoever did that.

232
00:12:55,760 --> 00:12:57,280
Like
Nikolay: a portal, that's great.

233
00:12:57,280 --> 00:12:57,780
Yeah.

234
00:12:58,840 --> 00:13:04,640
So every item in RezNodes we can
quickly trace discussion, commits

235
00:13:04,640 --> 00:13:05,380
and so on, right?

236
00:13:05,380 --> 00:13:06,020
That's great.

237
00:13:06,020 --> 00:13:06,760
Michael: Yes, exactly.

238
00:13:06,760 --> 00:13:09,720
Or you can go to the commits and
from the commits you can go

239
00:13:09,720 --> 00:13:10,520
to the discussions.

240
00:13:10,640 --> 00:13:11,540
In most cases.

241
00:13:12,340 --> 00:13:13,400
Yeah, in most cases.

242
00:13:14,200 --> 00:13:18,240
But often new features especially
simpler ones like this like

243
00:13:18,240 --> 00:13:23,160
small like a new parameter they'll
only have 1 commit this 1

244
00:13:23,160 --> 00:13:28,640
had 2 which which piqued my interest
immediately so I opened

245
00:13:28,640 --> 00:13:31,920
them both and the first 1 was pretty
pretty normal it made sense

246
00:13:31,920 --> 00:13:33,580
that this was added.

247
00:13:33,820 --> 00:13:38,740
And then the second 1, it was when
I realized, oh it was during

248
00:13:38,740 --> 00:13:42,840
the beta, it was during the beta
phase for 13, somebody had reported

249
00:13:44,440 --> 00:13:49,080
a performance issue with having
track_planning on and turning

250
00:13:49,080 --> 00:13:52,760
it off made the performance go
back to exactly how it was in

251
00:13:52,760 --> 00:13:53,260
12.

252
00:13:53,600 --> 00:13:58,640
So basically they had a regression
according to a synthetic benchmark

253
00:13:58,820 --> 00:14:02,000
and then asked could we turn it
off and there was it was pretty

254
00:14:02,000 --> 00:14:05,740
unanimous I think it got a few
replies and all the replies were

255
00:14:05,740 --> 00:14:10,080
in favor of turning it off for
the version 13 release and as

256
00:14:10,080 --> 00:14:12,440
far as I can tell it's not been
revisited since.

257
00:14:13,320 --> 00:14:16,560
Nikolay: Yeah, so I now understand
much better what's happening

258
00:14:16,560 --> 00:14:17,060
here.

259
00:14:17,220 --> 00:14:20,020
Let me like unwrap it here.

260
00:14:20,020 --> 00:14:21,500
So what happened?

261
00:14:21,740 --> 00:14:25,120
It was obviously, it would be obviously
good if it was enabled

262
00:14:25,120 --> 00:14:26,780
for all by default, right?

263
00:14:27,180 --> 00:14:32,120
But then it was learned that there
is performance degradation

264
00:14:32,300 --> 00:14:32,800
45%.

265
00:14:32,900 --> 00:14:35,900
Okay, it doesn't matter, actually,
it can be 90% in some cases,

266
00:14:35,900 --> 00:14:38,140
it depends on your machine and
so on.

267
00:14:38,940 --> 00:14:42,480
And it was related to spin lock
contention in the results.

268
00:14:43,180 --> 00:14:49,960
And this is exactly what we have
recently observed in our benchmarks

269
00:14:50,160 --> 00:14:56,660
using our AI workflow which we
discussed also several times.

270
00:14:56,820 --> 00:15:01,900
I wanted to say how I see our benchmark
workflow for synthetic

271
00:15:02,020 --> 00:15:03,840
benchmarks, pgbench and so on.

272
00:15:04,460 --> 00:15:10,280
We built a lot, and we collect
80 artifacts for each run, very

273
00:15:10,280 --> 00:15:14,940
comprehensive configuration, everything
is stored, we can iterate.

274
00:15:15,620 --> 00:15:19,640
And I see LLMs are just like kind
of oil so eventually we should

275
00:15:19,640 --> 00:15:25,580
have well oiled machine right engine
and but the Main part is

276
00:15:25,580 --> 00:15:26,660
not in LLM.

277
00:15:26,680 --> 00:15:31,840
It's just it just makes it easier
to iterate So we wanted to

278
00:15:31,840 --> 00:15:39,140
check overhead and we checked it
So pgbench which was also used

279
00:15:39,400 --> 00:15:44,440
in like in this Research mentioned
in the mailing list pgbench

280
00:15:44,480 --> 00:15:50,140
by default is not limiting TPS
But in normal circumstances under

281
00:15:50,140 --> 00:15:54,120
normal circumstances in production,
of course we don't, so this

282
00:15:54,120 --> 00:15:57,680
is not just load test, it's a kind
of load test which is called

283
00:15:57,680 --> 00:15:58,600
stress test.

284
00:15:59,180 --> 00:16:02,840
So we are on the edge, We are checking
what's happening on the

285
00:16:02,840 --> 00:16:03,340
edge.

286
00:16:04,960 --> 00:16:07,060
And in production we don't have
it normally.

287
00:16:08,000 --> 00:16:13,520
You don't want to work at 100%
CPU usually in OLTP because you

288
00:16:13,520 --> 00:16:20,140
will experience various interesting
phenomena like this basically,

289
00:16:20,140 --> 00:16:20,640
right?

290
00:16:20,800 --> 00:16:21,720
So spin lock.

291
00:16:21,900 --> 00:16:28,240
And we checked it and like we took
very big machines, I think

292
00:16:29,380 --> 00:16:35,500
192 cores with 5 generation, fifth
generation Google Cloud, still

293
00:16:35,500 --> 00:16:37,480
spending Google Cloud credits,
right?

294
00:16:37,860 --> 00:16:43,420
Fifth generation of Xeon scalable
Intel and almost 200 cores,

295
00:16:43,680 --> 00:16:44,580
a lot of memory.

296
00:16:44,580 --> 00:16:48,820
It doesn't matter because we also
took small scale, scale 100,

297
00:16:48,820 --> 00:16:52,320
that means 10 million entries in
pgbench accounts only.

298
00:16:53,000 --> 00:16:57,280
And we started, first we started
not select only, no difference.

299
00:16:57,360 --> 00:17:01,100
Like if you use read-write workload,
you don't see difference.

300
00:17:01,400 --> 00:17:07,100
But once you switch to select only
workload, you quickly observe.

301
00:17:07,300 --> 00:17:11,680
So what we did, and like in our
show notes, we will have links

302
00:17:11,680 --> 00:17:18,220
to detailed details, all details
with all numbers and reproducible

303
00:17:18,480 --> 00:17:20,540
with all pictures and so on.

304
00:17:20,540 --> 00:17:26,660
So we start with just 1 client
and 10, 20 and so on, 100 and

305
00:17:26,660 --> 00:17:31,560
so on until 200 clients and we
expect that we should basically

306
00:17:31,560 --> 00:17:33,520
grow in terms of TPS, right?

307
00:17:33,520 --> 00:17:37,280
Because more clients, more TPS,
and Postgres these days scales

308
00:17:37,720 --> 00:17:39,500
to many cores quite well.

309
00:17:40,240 --> 00:17:47,080
But we quickly saw that we've enabled
on this machine, quite

310
00:17:47,080 --> 00:17:48,400
powerful machine, right?

311
00:17:48,580 --> 00:17:51,260
We've enabled pg_stat_statements.track_planning

312
00:17:52,080 --> 00:17:56,360
Peak is reached after 30, between
30 and 40 clients.

313
00:17:56,640 --> 00:17:57,420
Very early.

314
00:17:57,440 --> 00:17:59,940
And then TPS goes down.

315
00:18:00,780 --> 00:18:05,760
And without track_planning, similar
picture actually, but it's

316
00:18:05,760 --> 00:18:13,160
later, until like 70-80, like 2
times to the right, 2 times more

317
00:18:13,160 --> 00:18:15,700
clients and TPS also higher.

318
00:18:16,240 --> 00:18:20,140
And once the peak is reached and
we go down, so first the lines

319
00:18:20,140 --> 00:18:23,080
are together, but once the peak
is reached and we go down, we

320
00:18:23,080 --> 00:18:26,340
see like kind of 2 times difference.

321
00:18:27,440 --> 00:18:32,900
Peak is reached 2 times sooner
And TPS is 2 times slower.

322
00:18:32,900 --> 00:18:33,900
Interesting, right?

323
00:18:34,340 --> 00:18:38,600
But looking at this, I was thinking,
okay, why do we have peak

324
00:18:39,300 --> 00:18:47,280
without track_planning reached
much sooner than 196 cores, I

325
00:18:47,280 --> 00:18:48,160
think, vCPUs?

326
00:18:48,820 --> 00:18:50,280
Because normally it should be there.

327
00:18:50,380 --> 00:18:51,460
How many cores we have?

328
00:18:51,460 --> 00:18:57,180
This is our maximum, most optimal
point of load.

329
00:18:57,920 --> 00:19:01,560
There are nuances because we run
pgbench on the same machine,

330
00:19:01,560 --> 00:19:08,200
so we limit number of threads by
25% of cores, so we couldn't

331
00:19:08,560 --> 00:19:12,420
have all CPU consumed by clients,
basically.

332
00:19:12,980 --> 00:19:18,300
So anyway, Then I recalled our
February tests and we had a podcast

333
00:19:19,220 --> 00:19:20,400
episode about it.

334
00:19:20,440 --> 00:19:23,660
pg_stat_statements overhead, 4 million TPS.

335
00:19:23,760 --> 00:19:28,580
We needed to remove pg_stat_statements to reach
maximum workload.

336
00:19:28,660 --> 00:19:32,580
I remember when we removed it,
peak shifted to normal position,

337
00:19:32,580 --> 00:19:35,040
closer to number of vCPUs we have.

338
00:19:35,860 --> 00:19:38,100
So what does it tell me?

339
00:19:38,100 --> 00:19:41,740
Okay, I'm thinking, oh, it looks
like pg_stat_statements has some

340
00:19:41,740 --> 00:19:46,320
significant overhead, but when
we enable track_planning, this

341
00:19:46,320 --> 00:19:47,380
overhead doubles.

342
00:19:48,260 --> 00:19:54,780
And then, since our automation
collects flame graphs, it collects

343
00:19:55,080 --> 00:19:59,120
pg_wait_sampling analysis and
so on, so we quickly identify

344
00:19:59,380 --> 00:20:03,900
there are spin locks indeed, but
the same spin locks are present

345
00:20:04,540 --> 00:20:06,680
when you just use previous assessments.

346
00:20:07,120 --> 00:20:08,960
This is the most interesting part,
right?

347
00:20:10,080 --> 00:20:18,480
So on flame graphs we see, and
like, I think I should publish

348
00:20:18,480 --> 00:20:19,100
a report.

349
00:20:19,180 --> 00:20:24,280
If I have it already, please let's
have it in show notes, link

350
00:20:24,280 --> 00:20:25,140
to the report.

351
00:20:25,440 --> 00:20:30,240
So in Flame Graphs we see that
without track_planning enabled,

352
00:20:30,660 --> 00:20:35,020
for this particular workload, we
have spin lock contention, and

353
00:20:35,020 --> 00:20:42,540
there is a very wide s_lock
function inside pgss store.

354
00:20:42,540 --> 00:20:46,080
pgss store is a function which
saves metrics.

355
00:20:47,020 --> 00:20:47,860
So what's happening?

356
00:20:47,860 --> 00:20:50,240
Michael: To pg_stat_statements,
that's the pass, right?

357
00:20:50,240 --> 00:20:51,600
Nikolay: Yeah, yeah.

358
00:20:51,600 --> 00:20:56,440
And like I would expect, like,
I remember in February I expected,

359
00:20:56,480 --> 00:21:00,040
like, I wanted to quickly reach
1 million TPS and go further.

360
00:21:00,040 --> 00:21:04,160
And I know it was achieved by Alexander
Korotkov in 2016, like

361
00:21:04,160 --> 00:21:04,920
8 years ago.

362
00:21:04,920 --> 00:21:08,200
So I was very surprised I couldn't
reach it easily.

363
00:21:08,480 --> 00:21:11,920
And only when I removed pg_stat_statements,
I reached it.

364
00:21:12,180 --> 00:21:16,540
So what's happening here is with
pg_stat_statements used, not

365
00:21:16,540 --> 00:21:20,560
track_planning, just pg_stat_statements,
if you have such a weird workload,

366
00:21:21,060 --> 00:21:27,540
like it's just a single query ID
and it runs hundreds of thousands

367
00:21:27,540 --> 00:21:31,580
of TPS for this, QPS in this case,
queries per second, For this

368
00:21:31,580 --> 00:21:34,540
particular machine, it can be lower.

369
00:21:34,700 --> 00:21:37,800
Actually, we reproduce problems
on 8 core machines.

370
00:21:38,040 --> 00:21:39,920
This is super interesting point
as well.

371
00:21:39,920 --> 00:21:40,240
Yeah, yeah.

372
00:21:40,240 --> 00:21:44,840
So this performance cliff can come
to you on small machines as

373
00:21:44,840 --> 00:21:45,340
well.

374
00:21:45,740 --> 00:21:48,780
Michael: But wait, wait, how many
concurrent connections?

375
00:21:48,780 --> 00:21:50,460
I guess it comes sooner?

376
00:21:51,020 --> 00:21:52,760
Nikolay: Yeah, well, it comes sooner.

377
00:21:53,360 --> 00:21:56,680
I don't remember from the top of
my head, but I know this problem

378
00:21:56,680 --> 00:21:57,700
can happen sooner.

379
00:21:57,700 --> 00:22:03,340
So the idea is, if we have limited
resources and transactions,

380
00:22:03,740 --> 00:22:07,180
these queries are so fast, like
I think it was like 15 microseconds

381
00:22:07,260 --> 00:22:09,780
or so, it's super fast for primary
key lookup.

382
00:22:09,960 --> 00:22:13,640
But they all fight to update metrics
in a single pg_stat_statements

383
00:22:14,080 --> 00:22:14,580
record.

384
00:22:15,220 --> 00:22:17,080
And spin lock is required for that.

385
00:22:17,080 --> 00:22:19,020
That's why we see spin lock contention.

386
00:22:21,040 --> 00:22:23,540
So it's observer effect as it is.

387
00:22:24,140 --> 00:22:25,580
Pure observer effect.

388
00:22:25,680 --> 00:22:29,720
And pg_stat_statements can have observer
effect, but you need to

389
00:22:29,720 --> 00:22:40,000
reach a lot of queries per second.

390
00:22:40,200 --> 00:22:40,660
Michael: Good point.

391
00:22:40,660 --> 00:22:42,460
Nikolay: Because of limited resources,
right?

392
00:22:42,940 --> 00:22:48,800
So, yeah, and this means that in
reality, very unlikely you will

393
00:22:48,800 --> 00:22:49,340
see it.

394
00:22:49,340 --> 00:22:50,880
Maybe, maybe, but unlikely.

395
00:22:51,100 --> 00:22:56,520
It should be super fast index-only
scan, for example, and the

396
00:22:56,520 --> 00:22:58,040
frequency is so high.

397
00:22:59,340 --> 00:23:04,160
But When you look at flame graphs
with track_planning enabled

398
00:23:04,180 --> 00:23:10,680
you see exactly 2 areas similar
width Both are as log inside

399
00:23:10,680 --> 00:23:15,480
pgss store and 1 pgss store is
in execution phase and other pgss

400
00:23:15,900 --> 00:23:22,360
store in planning phase So metrics
are saved separately and if

401
00:23:22,360 --> 00:23:24,780
you enable they are saved 2 times.

402
00:23:25,380 --> 00:23:29,360
That fully explains what's happening
here.

403
00:23:29,860 --> 00:23:34,440
If we enable it we just move performance
cliff 2 times closer

404
00:23:34,440 --> 00:23:35,100
to us.

405
00:23:35,860 --> 00:23:38,540
Well if we sit in 0, right?

406
00:23:38,540 --> 00:23:40,020
If we shift it already, it's...

407
00:23:40,440 --> 00:23:40,840
Yeah, so...

408
00:23:40,840 --> 00:23:44,820
Michael: Well it explains why in
your synthetic benchmarks you

409
00:23:44,820 --> 00:23:48,180
got saturation twice as fast.

410
00:23:48,420 --> 00:23:49,520
Nikolay: Yes, yes.

411
00:23:50,440 --> 00:23:53,380
All pieces of the puzzle are like...

412
00:23:53,480 --> 00:23:55,440
This took quite a lot.

413
00:23:55,440 --> 00:24:00,700
We had 4 sessions of various kinds
of research starting in February

414
00:24:00,760 --> 00:24:04,700
with pg_stat_statements and idea,
let's squeeze a lot of TPS from

415
00:24:04,700 --> 00:24:06,180
our big machines.

416
00:24:06,960 --> 00:24:08,440
But this is interesting, right?

417
00:24:08,440 --> 00:24:09,640
Let's think about it.

418
00:24:09,800 --> 00:24:12,660
So we have this pgss store.

419
00:24:12,780 --> 00:24:16,500
Obviously, I think it should be
possible to find a way to store

420
00:24:16,500 --> 00:24:17,260
it just once.

421
00:24:17,260 --> 00:24:18,740
I mean, to save it once.

422
00:24:18,820 --> 00:24:23,060
Michael: Well, I remember you suggested
that when we were discussing

423
00:24:23,120 --> 00:24:27,100
privately, and since then, having
read the discussions, it's

424
00:24:27,100 --> 00:24:30,040
a deliberate design decision to
save them separately.

425
00:24:30,860 --> 00:24:32,340
And I didn't know about that.

426
00:24:32,480 --> 00:24:37,200
So you mentioned a while back,
we don't even track queries that

427
00:24:37,200 --> 00:24:38,260
fail for example.

428
00:24:38,600 --> 00:24:43,780
Well now in a way we are like because
the number of times a query

429
00:24:43,780 --> 00:24:46,480
was planned doesn't have to equal
the number of times it was

430
00:24:46,480 --> 00:24:48,540
executed anymore in pg_stat_statements.

431
00:24:48,840 --> 00:24:51,600
Once you've got track_planning
on you can see the number of times

432
00:24:51,600 --> 00:24:54,940
it was planned versus the number
of times it was executed.

433
00:24:56,780 --> 00:25:00,380
Nikolay: Are you saying that if
we have a failed quiz but it

434
00:25:00,380 --> 00:25:04,020
was planned, planned time would
be saved?

435
00:25:04,600 --> 00:25:05,100
Michael: Yes.

436
00:25:05,740 --> 00:25:07,360
That's my understanding from reading
the...

437
00:25:07,360 --> 00:25:09,620
I haven't tested this, but that's
my understanding from reading

438
00:25:09,620 --> 00:25:11,600
the discussions around the design.

439
00:25:11,680 --> 00:25:14,160
Nikolay: Still I think it should
be possible to optimize.

440
00:25:14,280 --> 00:25:17,640
Yeah, I understand the design decision,
but if we think about...

441
00:25:17,640 --> 00:25:20,580
Michael: Either the design would
need to change or the...

442
00:25:20,580 --> 00:25:23,760
Yeah exactly but it's a trade-off
and that's like I found that

443
00:25:23,760 --> 00:25:24,260
interesting.

444
00:25:24,520 --> 00:25:27,540
Nikolay: I remember cases, terrible
cases which were very hard,

445
00:25:27,540 --> 00:25:31,100
extremely hard to diagnose because
pg_stat_statements didn't show

446
00:25:31,100 --> 00:25:35,980
like for example merge joins which
like merge join itself like

447
00:25:35,980 --> 00:25:40,820
not merge join somehow like maybe
hash join was used, but considering

448
00:25:40,940 --> 00:25:45,820
merge join, planar spent many seconds,
like 5 to 10 seconds,

449
00:25:45,900 --> 00:25:47,180
just considering it.

450
00:25:47,200 --> 00:25:51,740
Because it needs sometimes to check
the table actual data and

451
00:25:51,740 --> 00:25:55,620
see like min or max values for
some columns, which is unexpected,

452
00:25:55,920 --> 00:25:58,160
but planner sometimes does it.

453
00:25:58,520 --> 00:26:02,740
And if it takes too long, planning
phase, even if merge join

454
00:26:03,480 --> 00:26:05,900
is not chosen eventually.

455
00:26:07,360 --> 00:26:09,500
In the end, we don't see it, right?

456
00:26:09,640 --> 00:26:11,420
Log manager contention, right?

457
00:26:12,120 --> 00:26:17,160
Sometimes with JSONB values, we
have something like planning

458
00:26:17,160 --> 00:26:21,900
is long and consuming a lot of
CPU, but we should see it.

459
00:26:21,900 --> 00:26:23,040
We should see it, right?

460
00:26:23,040 --> 00:26:27,280
And pg_stat_kcache, which tracks
physical metrics and those people

461
00:26:27,280 --> 00:26:32,200
who are, I don't know, like on
self-managed Postgres can use

462
00:26:32,200 --> 00:26:33,360
it and it's great.

463
00:26:33,760 --> 00:26:37,760
It also has track_planning actually
the same, same parameter,

464
00:26:37,760 --> 00:26:38,260
right?

465
00:26:38,480 --> 00:26:39,860
Michael: Oh no, off by default.

466
00:26:40,320 --> 00:26:41,420
Nikolay: It's all by default.

467
00:26:41,460 --> 00:26:41,880
Yes.

468
00:26:41,880 --> 00:26:45,280
So now we understand the design
decision and we understand that

469
00:26:45,280 --> 00:26:49,900
some benchmark, which checked very,
very like edge case, I would

470
00:26:49,900 --> 00:26:53,720
say maybe even corner case because
you have unlimited TPS and

471
00:26:53,720 --> 00:26:57,080
you have select only single query
ID it's edge it's corner case

472
00:26:57,080 --> 00:27:01,300
not just edge we have 2 edges coming
here it's a corner case

473
00:27:01,360 --> 00:27:06,640
so likelihood of having it in production
extremely low and decision

474
00:27:06,740 --> 00:27:09,840
in my opinion decision was wrong
what do you think

475
00:27:10,680 --> 00:27:15,160
Michael: yeah I think I agree I
can see why it was made I think

476
00:27:15,320 --> 00:27:19,140
if you see it's quite a startling
number if you if you look at

477
00:27:19,140 --> 00:27:22,860
45% drop that was the number reported
and shown.

478
00:27:23,160 --> 00:27:27,040
I can see why people were scared
of that, especially when it

479
00:27:27,040 --> 00:27:28,280
was in the beta phase.

480
00:27:28,320 --> 00:27:32,540
But in hindsight, especially with
what I know now based on what

481
00:27:32,540 --> 00:27:36,680
you've said, it seems like we'd
be much better off having it

482
00:27:36,680 --> 00:27:39,220
on for everybody.

483
00:27:39,640 --> 00:27:42,380
Nikolay: 45% is roughly 50, right?

484
00:27:42,380 --> 00:27:43,360
It's like 2x.

485
00:27:44,540 --> 00:27:46,340
Michael: It sounds scary though,
right?

486
00:27:47,320 --> 00:27:48,140
Nikolay: It's reasonable.

487
00:27:48,160 --> 00:27:52,460
If you are sitting on the edge,
you are suffering from this penalty

488
00:27:52,780 --> 00:27:54,560
pg_stat_statements already gives
you.

489
00:27:55,440 --> 00:28:00,140
You just double it, because pgss store
is called twice.

490
00:28:00,140 --> 00:28:02,140
That's why it's roughly 50%.

491
00:28:02,980 --> 00:28:06,460
But the conclusion here is not,
let's not enable by default.

492
00:28:06,460 --> 00:28:11,160
Let's put a note in documentation
that pg_stat_statements is

493
00:28:11,160 --> 00:28:15,060
sensitive to the cases when it's
a single query ID and a lot

494
00:28:15,060 --> 00:28:17,180
of high frequent queries happening.

495
00:28:18,060 --> 00:28:19,980
Michael: I think there is that
note, yeah.

496
00:28:20,980 --> 00:28:22,240
Nikolay: If there is, it's good.

497
00:28:22,240 --> 00:28:25,040
Then track_planning just doubles
this penalty and that's it,

498
00:28:25,040 --> 00:28:27,540
but it's not happening under normal
circumstances.

499
00:28:30,060 --> 00:28:32,460
Michael: It's a really good point
that it's not track_planning.

500
00:28:32,600 --> 00:28:35,660
track_planning isn't the thing
causing this.

501
00:28:35,660 --> 00:28:37,540
The track_planning makes it twice
as bad.

502
00:28:37,540 --> 00:28:41,640
But twice as bad is quite a small
multiplication factor when

503
00:28:41,640 --> 00:28:44,840
you're talking about such an extreme
case.

504
00:28:45,540 --> 00:28:46,030
Nikolay: Yeah.

505
00:28:46,030 --> 00:28:47,140
It would be like

506
00:28:47,140 --> 00:28:50,320
Michael: if we imagine if we had
twice as you know that the transaction

507
00:28:50,320 --> 00:28:52,580
ID wraparound is a couple of billion?

508
00:28:53,080 --> 00:28:55,080
Or a couple of billion in the positive.

509
00:28:55,080 --> 00:28:57,840
If it was 4000000000 in the positive,
that would make a bit of

510
00:28:57,840 --> 00:29:00,880
difference, but not a huge, huge
difference.

511
00:29:02,200 --> 00:29:05,900
In these extreme cases, twice isn't
that bad.

512
00:29:06,580 --> 00:29:08,660
Nikolay: So I have 2 separate questions
here.

513
00:29:09,000 --> 00:29:12,600
Can we have a single pgss store
instead of 2?

514
00:29:12,700 --> 00:29:18,080
And second question, should we
recommend enabling it to everyone

515
00:29:18,080 --> 00:29:22,400
but understanding that doubles
the problem of pg_stat_statements which

516
00:29:22,580 --> 00:29:26,280
already is present, it's there
already, but you just don't feel

517
00:29:26,280 --> 00:29:26,780
it.

518
00:29:27,740 --> 00:29:31,220
You don't feel it, for example,
on that machine, almost 200 vCPUs.

519
00:29:31,440 --> 00:29:35,940
You don't feel it unless a single
query ID has like 200, 000

520
00:29:38,480 --> 00:29:39,560
calls per second.

521
00:29:39,560 --> 00:29:40,060
200, 000.

522
00:29:42,260 --> 00:29:43,300
It's insane.

523
00:29:44,640 --> 00:29:45,480
This is the edge.

524
00:29:45,480 --> 00:29:46,700
It's super far.

525
00:29:47,920 --> 00:29:50,740
And without track_planning, okay,
it's 400, 000.

526
00:29:53,180 --> 00:29:56,250
So not enabling it, you're just
saying, okay, instead of 200,

527
00:29:56,250 --> 00:29:57,940
000, we will have 400, 000.

528
00:29:58,620 --> 00:30:02,040
But looking at your statistics,
you see, okay, maximum calls

529
00:30:02,040 --> 00:30:03,480
per second is what?

530
00:30:03,860 --> 00:30:04,360
1, 000?

531
00:30:04,660 --> 00:30:05,160
5, 000?

532
00:30:05,440 --> 00:30:05,940
10, 000?

533
00:30:07,120 --> 00:30:08,800
It's already a lot, right?

534
00:30:10,160 --> 00:30:13,780
And usually we have different kinds
of problems with such queries

535
00:30:14,380 --> 00:30:18,700
which are exactly present during
planning time because locking,

536
00:30:19,120 --> 00:30:19,540
right?

537
00:30:19,540 --> 00:30:20,660
It's already a problem.

538
00:30:20,740 --> 00:30:21,980
Michael: Yeah, such a good point.

539
00:30:22,200 --> 00:30:26,920
Nikolay: And I wonder, if we have
a high frequent query, a single

540
00:30:26,920 --> 00:30:31,500
query ID like 10, 000, primary
key lookup in many cases, maybe

541
00:30:31,500 --> 00:30:32,340
in most cases.

542
00:30:32,780 --> 00:30:35,780
We say, okay, log manager can come
after you, right?

543
00:30:36,160 --> 00:30:37,560
Let's cache it.

544
00:30:38,040 --> 00:30:41,460
I don't know, like prepare statements,
or maybe plp.js.calc function,

545
00:30:41,520 --> 00:30:43,100
indirect caching and so on.

546
00:30:43,100 --> 00:30:44,280
Let's avoid planning.

547
00:30:46,340 --> 00:30:47,980
Planning will be okay, right?

548
00:30:48,820 --> 00:30:50,920
Michael: I haven't checked, but
that would be an interesting

549
00:30:50,920 --> 00:30:51,420
test.

550
00:30:51,500 --> 00:30:52,480
Actually, no.

551
00:30:53,100 --> 00:30:58,360
Yeah, you could do the same test
again, but with prepared statements.

552
00:30:58,360 --> 00:31:01,840
It would be interesting to see
if track_planning is updated at

553
00:31:01,840 --> 00:31:02,340
all.

554
00:31:02,540 --> 00:31:06,000
Presumably it's not doing planning
so it shouldn't do that.

555
00:31:06,000 --> 00:31:08,680
Nikolay: That's a great point I'm
going to check it because like

556
00:31:08,680 --> 00:31:14,620
as I said with our almost well
oiled machine I will just ask

557
00:31:14,620 --> 00:31:21,440
to repeat experiment and with hyphen
capital M prepared we will

558
00:31:21,440 --> 00:31:24,960
see and basically it should have
effect like we disable it but

559
00:31:24,960 --> 00:31:26,600
we keep it enabled right

560
00:31:27,040 --> 00:31:29,380
Michael: with a pool and maybe
like with some way of keeping

561
00:31:29,380 --> 00:31:33,000
the sessions Remember last time
it was doing the first...

562
00:31:33,520 --> 00:31:35,140
Nikolay: Yeah, I remember.

563
00:31:35,140 --> 00:31:36,560
Michael: Would it need to be that
or not?

564
00:31:36,560 --> 00:31:37,760
I actually can't remember.

565
00:31:38,160 --> 00:31:40,020
Maybe that would be fine in this
case.

566
00:31:40,080 --> 00:31:43,780
Nikolay: In this case it will be
fine because the problem is

567
00:31:43,780 --> 00:31:44,920
like it's vice versa.

568
00:31:44,920 --> 00:31:47,140
pgbench maintains connection here.

569
00:31:47,300 --> 00:31:51,360
What we discussed is that like
in reality, if connection is not

570
00:31:51,360 --> 00:31:54,860
maintained, we connect again and
then we don't have cache.

571
00:31:55,840 --> 00:31:58,860
So the connection, okay, I will
double check this area.

572
00:31:58,860 --> 00:31:59,840
It's good point.

573
00:32:00,140 --> 00:32:04,700
And additional research to make
the picture complete.

574
00:32:05,200 --> 00:32:10,580
But my gut tells me it should be
enabled in many cases, but we

575
00:32:10,580 --> 00:32:14,140
should be aware of spin-lock contention,
which can happen if

576
00:32:14,140 --> 00:32:15,200
we have high frequency.

577
00:32:15,240 --> 00:32:19,260
Honestly, I'm thinking to find
some rule maybe to in monitoring,

578
00:32:19,500 --> 00:32:23,100
for example, we know this machine
has this number of vCPUs, and

579
00:32:23,100 --> 00:32:26,140
we can quickly check how many calls
per second, even if

580
00:32:26,140 --> 00:32:29,040
track_planning is not enabled, we can
check it how many calls per second

581
00:32:29,040 --> 00:32:31,300
we have for most frequent,

582
00:32:33,420 --> 00:32:34,340
Michael: Very high frequency.

583
00:32:34,760 --> 00:32:36,340
Order by calls, descending.

584
00:32:36,340 --> 00:32:36,920
Nikolay: Order by calls, exactly.

585
00:32:36,920 --> 00:32:37,600
It's not hard.

586
00:32:39,020 --> 00:32:39,520
Yeah.

587
00:32:39,780 --> 00:32:43,440
So top 10 by calls, and how big
is that?

588
00:32:43,440 --> 00:32:47,840
And okay, this number of cores,
this number of QPS, this level

589
00:32:47,840 --> 00:32:53,460
of QPS, just roughly estimate how
far we are from the performance

590
00:32:53,460 --> 00:32:53,960
cliff.

591
00:32:54,620 --> 00:32:56,820
Maybe it's possible to do this,
right?

592
00:32:56,820 --> 00:32:57,460
For modern...

593
00:32:57,880 --> 00:32:59,740
Of course, it depends on many things.

594
00:33:00,060 --> 00:33:01,860
It doesn't depend on the plan,
I think.

595
00:33:03,240 --> 00:33:05,620
Because saving happens only once.

596
00:33:05,660 --> 00:33:08,580
It depends on duration, but not
on duration, on frequency.

597
00:33:10,760 --> 00:33:14,760
And of course, on the resources,
type of processor you're using.

598
00:33:15,760 --> 00:33:21,260
So I think it's possible to have
some kind of prediction, how

599
00:33:21,260 --> 00:33:22,780
far we are from this.

600
00:33:23,240 --> 00:33:25,840
And if we are really far, let's
enable track_planning.

601
00:33:25,840 --> 00:33:26,820
This is my point.

602
00:33:27,780 --> 00:33:30,880
Meanwhile, let's think maybe it's
possible to save it just once

603
00:33:30,880 --> 00:33:32,420
and reconsider decision-making.

604
00:33:34,200 --> 00:33:36,820
I'm really thankful you made this
research.

605
00:33:36,820 --> 00:33:39,260
I liked it in my puzzle.

606
00:33:39,520 --> 00:33:42,000
Like I still collecting pieces
of it.

607
00:33:42,280 --> 00:33:42,780
Cool.

608
00:33:44,340 --> 00:33:50,780
So enable it, but it will double
overhead and be aware of overhead.

609
00:33:50,840 --> 00:33:53,540
This is the bottom line from pg_stat_statements.

610
00:33:54,880 --> 00:33:55,380
Michael: Yeah.

611
00:33:56,680 --> 00:33:59,540
If you ran a managed Postgres service,
would you enable it for

612
00:33:59,540 --> 00:34:01,420
everybody by default at the beginning?

613
00:34:02,480 --> 00:34:06,220
Nikolay: Well, yeah, this is another
piece of great info you

614
00:34:06,220 --> 00:34:06,720
found.

615
00:34:07,780 --> 00:34:09,440
We almost forgot to discuss it.

616
00:34:09,440 --> 00:34:10,460
You found the...

617
00:34:10,760 --> 00:34:12,480
So you found that many...

618
00:34:12,560 --> 00:34:14,320
Well, it's not surprise to me.

619
00:34:14,480 --> 00:34:19,440
You sent me the list of managed
service providers and what's

620
00:34:19,440 --> 00:34:22,420
possible what's not there it's
not surprise for me that most

621
00:34:22,420 --> 00:34:27,420
have it off by default surprise
for me it was that Google Cloud

622
00:34:27,980 --> 00:34:30,180
even doesn't allow to change it

623
00:34:30,620 --> 00:34:33,140
Michael: not just Google Cloud
quite a few of the providers...

624
00:34:34,020 --> 00:34:35,780
Nikolay: Crunchy Bridge right?

625
00:34:36,240 --> 00:34:38,760
Michael: So yeah yeah I is it's
definitely not a it's definitely

626
00:34:38,760 --> 00:34:45,060
not a full list, and I'm guilty
of testing more of the ones that

627
00:34:45,060 --> 00:34:47,140
make it easy to spin up new instances.

628
00:34:47,320 --> 00:34:51,920
Nikolay: But let's drop some names
here.

629
00:34:52,240 --> 00:34:53,500
RDS or by default?

630
00:34:53,500 --> 00:34:56,460
Michael: Should we go the other
way around?

631
00:34:56,460 --> 00:34:59,020
Is the other direction is the ones
that do make it configurable.

632
00:35:00,180 --> 00:35:03,480
RDS is I think still by far the
most popular.

633
00:35:03,840 --> 00:35:05,140
Off by default but configurable.

634
00:35:06,460 --> 00:35:09,220
Bear in mind pg_stat_statements in
RDS is on by default.

635
00:35:09,520 --> 00:35:11,280
So they do change some of

636
00:35:11,280 --> 00:35:11,880
Nikolay: the defaults

637
00:35:11,880 --> 00:35:13,040
Michael: around this, for example.

638
00:35:13,380 --> 00:35:16,560
Most of these provide, in fact,
I can't remember the last time

639
00:35:16,560 --> 00:35:19,900
I checked a provider that didn't
have pg_stat_statements on by

640
00:35:19,900 --> 00:35:20,320
default.

641
00:35:20,320 --> 00:35:21,000
You mean

642
00:35:21,000 --> 00:35:24,320
Nikolay: on by default, you mean
it's in shared preload libraries,

643
00:35:24,320 --> 00:35:26,700
or it's created in like template
database?

644
00:35:26,840 --> 00:35:29,960
Michael: It's, yeah, it's, it starts
tracking queries without

645
00:35:29,960 --> 00:35:30,680
you doing anything.

646
00:35:30,680 --> 00:35:32,880
You don't have to create extension
or whatever the normal thing

647
00:35:32,880 --> 00:35:33,660
is to do.

648
00:35:33,760 --> 00:35:34,900
Nikolay: I don't remember should
we create extension.

649
00:35:34,900 --> 00:35:37,200
Michael: Almost all providers,
because they rely on it for their

650
00:35:37,200 --> 00:35:40,820
monitor, like the monitoring they
present you runs off 

651
00:35:40,840 --> 00:35:43,060
pg_stat_statements for most of these.

652
00:35:43,180 --> 00:35:46,140
So it's on by default, but
track_planning isn't.

653
00:35:47,080 --> 00:35:51,100
That's true for almost all of the
ones I tested on.

654
00:35:51,100 --> 00:35:52,760
But I think Timescale...

655
00:35:53,140 --> 00:35:54,940
Nikolay: Timescale is champion
here.

656
00:35:55,080 --> 00:35:55,580
Michael: Yeah.

657
00:35:55,640 --> 00:35:58,780
Well, it's configurable but it's
also on by default which was,

658
00:35:58,780 --> 00:36:01,500
I think, the only one that I found
that had it on by default.

659
00:36:01,720 --> 00:36:03,520
There probably are others I didn't
check them all.

660
00:36:03,520 --> 00:36:04,840
It's quite time consuming.

661
00:36:04,840 --> 00:36:05,640
Nikolay: That's great.

662
00:36:06,220 --> 00:36:07,320
I love this decision.

663
00:36:08,940 --> 00:36:09,720
Michael: Yeah, it's great.

664
00:36:09,720 --> 00:36:09,960
Nikolay: But a

665
00:36:09,960 --> 00:36:12,060
Michael: few of the others made
it configurable, which is nice.

666
00:36:12,060 --> 00:36:15,820
I just was surprised that at least
one of the major ones doesn't

667
00:36:15,820 --> 00:36:16,020
even.

668
00:36:16,020 --> 00:36:19,000
Nikolay: Yeah, Google CloudSQL
even doesn't allow to change

669
00:36:19,000 --> 00:36:19,840
it and we...

670
00:36:20,240 --> 00:36:22,620
You gave up trying to check Azure.

671
00:36:24,380 --> 00:36:28,200
Michael: Yeah, I had limited time
and the UI and I don't...

672
00:36:28,840 --> 00:36:31,840
I need to talk to somebody there
to teach me how to use it or

673
00:36:31,840 --> 00:36:33,720
something because I was really
struggling.

674
00:36:35,280 --> 00:36:37,100
Nikolay: Yeah, I understand that.

675
00:36:37,380 --> 00:36:43,200
So, yeah, so who can like, our
recommendation is to for those

676
00:36:43,200 --> 00:36:47,320
who are listening from those teams
like CloudSQL, Crunchy and

677
00:36:47,320 --> 00:36:49,460
Supabase guys like make it configurable.

678
00:36:50,280 --> 00:36:52,360
It's not normal that it's not.

679
00:36:53,940 --> 00:37:00,280
I would say consider making it
on by default, but at least bare

680
00:37:00,280 --> 00:37:04,260
minimum should be let users to
decide.

681
00:37:05,560 --> 00:37:08,260
Michael: I think so, and I think
normally when these providers

682
00:37:08,300 --> 00:37:11,380
change a default they'll change
it for new instances only where

683
00:37:11,380 --> 00:37:16,220
I think there's just so such low
risk so I like that approach

684
00:37:16,340 --> 00:37:20,160
of you know if people create new
clusters or new instances, have

685
00:37:20,160 --> 00:37:21,140
it on by default.

686
00:37:22,360 --> 00:37:26,320
It will help people when they come
across planning time related

687
00:37:26,320 --> 00:37:30,040
issues, diagnose those issues much
quicker, much easier, whether

688
00:37:30,040 --> 00:37:33,060
that's on their own or once they
pull people in.

689
00:37:33,400 --> 00:37:38,180
Nikolay: Yeah yeah yeah that's
that's that's for sure yeah so

690
00:37:38,180 --> 00:37:46,260
good I hope we brought some food
for thoughts to guys who are

691
00:37:46,260 --> 00:37:47,280
listening to us.

692
00:37:47,860 --> 00:37:49,940
So yeah, thank you.

693
00:37:49,940 --> 00:37:50,880
Michael: Nice one Nikolay.

694
00:37:51,060 --> 00:37:52,200
Nikolay: Yeah, thank you.

695
00:37:52,200 --> 00:37:53,700
Michael: Take care, have a good
week, bye.