1
00:00:00,060 --> 00:00:02,140
Nikolay: Hello, hello, this is
Postgres.FM.

2
00:00:02,300 --> 00:00:05,780
I'm Nik, Postgres.AI, and as usual
with me, Michael, pgMustard.

3
00:00:06,540 --> 00:00:07,500
Hello, Michael.

4
00:00:08,400 --> 00:00:09,380
Michael: Hello, Nik.

5
00:00:09,380 --> 00:00:10,220
How's it going?

6
00:00:10,680 --> 00:00:11,660
Nikolay: Everything is all right.

7
00:00:11,660 --> 00:00:12,640
A lot of work.

8
00:00:13,100 --> 00:00:15,020
There is a progress on many fronts.

9
00:00:15,020 --> 00:00:18,900
So, yeah, excited to discuss a
quite important topic.

10
00:00:19,660 --> 00:00:24,360
And maybe let's see what we can
learn from this discussion.

11
00:00:24,380 --> 00:00:26,500
I don't know, like mutually learn
or something.

12
00:00:27,180 --> 00:00:29,180
Michael: Yeah, I'm glad you think
it's interesting.

13
00:00:29,760 --> 00:00:32,520
Nikolay: It's interesting because
it's always with us.

14
00:00:33,160 --> 00:00:36,600
When we deal with performance,
we need to work with many numbers

15
00:00:36,780 --> 00:00:39,900
and this topic pops up all the
time.

16
00:00:40,280 --> 00:00:40,780
Michael: Yeah.

17
00:00:41,400 --> 00:00:46,960
So specifically, the topic is when
we're looking at latencies

18
00:00:47,280 --> 00:00:48,956
or monitoring or troubleshooting,
looking at latencies or you

19
00:00:48,956 --> 00:00:54,620
know monitoring or troubleshooting,
maybe prioritization, we

20
00:00:55,240 --> 00:01:00,640
often need to look at queries by
some measure of their duration.

21
00:01:00,900 --> 00:01:06,260
Like often, and I see this in blog
posts, in tools, various places,

22
00:01:06,260 --> 00:01:12,080
often ordered by the mean or the
average of their latencies.

23
00:01:12,540 --> 00:01:15,520
Very easy to get from pg_stat_statements
for example, it has

24
00:01:15,520 --> 00:01:20,040
a dedicated column, but in a lot
of other places, a lot of other,

25
00:01:20,140 --> 00:01:24,620
you know, different types of monitoring
tools and I'd say more

26
00:01:24,620 --> 00:01:26,160
on the application developer side.

27
00:01:26,160 --> 00:01:31,460
I'm seeing increasingly people
moving to more monitoring the

28
00:01:31,460 --> 00:01:38,100
p95 p99 type latencies, so percentile
based and kind of the tail

29
00:01:38,100 --> 00:01:43,280
latency is not the average, so
wondering about that.

30
00:01:43,440 --> 00:01:50,680
And yeah, I saw recently another
feature request to have the

31
00:01:50,680 --> 00:01:54,720
ability to see these in pg_stat_statements
and I've realized it's

32
00:01:54,720 --> 00:01:57,340
something we can approximate but
it's not something we can get

33
00:01:57,340 --> 00:02:00,660
from that so I wrote a post about
it recently and I'm glad you

34
00:02:00,660 --> 00:02:01,780
find interesting too.

35
00:02:02,840 --> 00:02:03,340
Nikolay: Yeah.

36
00:02:04,300 --> 00:02:09,060
So that's important topic, I think,
because we all work with

37
00:02:09,280 --> 00:02:11,780
averages or mean values all the
time.

38
00:02:11,780 --> 00:02:14,240
And, it's not only about positives,
right?

39
00:02:14,280 --> 00:02:20,420
Mean value, meaning of average
is often misunderstood and it's

40
00:02:20,420 --> 00:02:21,300
often misleading.

41
00:02:21,480 --> 00:02:27,880
I remember the idea that in any
organization, if you check what

42
00:02:27,880 --> 00:02:34,080
is the mean salary or average salary,
most people might find

43
00:02:34,080 --> 00:02:37,360
out it's like in some organizations,
in some organizations, oh

44
00:02:37,360 --> 00:02:41,980
wow, it's like this average salary
is much above my own salary,

45
00:02:41,980 --> 00:02:42,880
what's happening here?

46
00:02:42,880 --> 00:02:46,980
Well, what's happening, a couple
of dudes have huge salary, that's

47
00:02:46,980 --> 00:02:47,780
why, right?

48
00:02:49,300 --> 00:02:55,580
So it means that we like, when
we back to Postgres, back to queries,

49
00:02:56,840 --> 00:03:00,480
you see average or mean time is
like 1 millisecond.

50
00:03:01,020 --> 00:03:02,780
Oh, not bad, not bad, right?

51
00:03:02,840 --> 00:03:06,500
But how many queries are much about that?

52
00:03:07,200 --> 00:03:08,500
It might be a lot.

53
00:03:09,120 --> 00:03:14,300
And I must notice that mean time, mean exact time, mean planned

54
00:03:14,300 --> 00:03:16,240
time, these columns are redundant.

55
00:03:16,360 --> 00:03:20,520
They could be avoided because we always can calculate them dividing

56
00:03:20,580 --> 00:03:22,540
total by number of calls, right?

57
00:03:23,200 --> 00:03:25,460
Or plans in case of planning time.

58
00:03:25,460 --> 00:03:29,080
By the way, I read your article, I noticed you optimistically

59
00:03:30,060 --> 00:03:32,420
involve planning metrics.

60
00:03:33,260 --> 00:03:33,740
Yeah.

61
00:03:33,740 --> 00:03:37,940
Unfortunately, 99% of cases, we don't see them because they are

62
00:03:37,940 --> 00:03:38,900
off by default.

63
00:03:39,740 --> 00:03:44,060
And I remember last year or previous year, it took some effort

64
00:03:44,060 --> 00:03:50,100
for us to prove that basically you can turn them on, but unfortunately,

65
00:03:50,220 --> 00:03:54,120
as we had episode about this, right, unfortunately, performance

66
00:03:54,120 --> 00:03:58,440
cliff will be 2 times closer to you if you turn it on, right,

67
00:03:58,440 --> 00:04:02,220
but again, like, default recommendation is to turn it on because

68
00:04:02,220 --> 00:04:03,740
it's very valuable information.

69
00:04:04,120 --> 00:04:08,900
So again, back to averages, we could always calculate them dynamically

70
00:04:09,020 --> 00:04:10,820
or something, it's not a big deal.

71
00:04:11,120 --> 00:04:14,760
But I guess it was just a matter of convenience to add them and

72
00:04:14,760 --> 00:04:19,320
keep them already pre-calculated in pg_stat_statements and other

73
00:04:19,640 --> 00:04:22,260
similar extensions and so on.

74
00:04:22,800 --> 00:04:26,620
But would be really great to see percentiles, right?

75
00:04:26,820 --> 00:04:31,020
Or better to see whole histogram analysis of what's happening.

76
00:04:31,380 --> 00:04:34,700
But unfortunately, I guess it's a very complex topic.

77
00:04:34,760 --> 00:04:38,700
I mean, I know pg_stat_monitor implements this, but we need to

78
00:04:38,700 --> 00:04:39,440
check overhead.

79
00:04:40,120 --> 00:04:41,880
Michael: That's the critical part.

80
00:04:41,880 --> 00:04:44,080
And also, just it's more columns.

81
00:04:44,180 --> 00:04:47,700
We've got already a lot of columns in pg_stat_statements.

82
00:04:48,120 --> 00:04:53,180
So it has additional overhead both to tracking, but also in terms

83
00:04:53,180 --> 00:04:54,240
of user experience.

84
00:04:54,520 --> 00:04:58,140
And I mean, I think at this point we have so many columns that

85
00:04:58,140 --> 00:04:59,680
why not have a few more?

86
00:05:00,340 --> 00:05:01,880
But it is a problem.

87
00:05:01,880 --> 00:05:05,220
Like, there are, I think, just in time compilation, we've got

88
00:05:05,220 --> 00:05:08,940
a bunch of additional columns for every single part of that and

89
00:05:08,940 --> 00:05:11,420
there's there has been Discussion about whether that should be

90
00:05:11,420 --> 00:05:13,920
collapsed down because do we really need granular?

91
00:05:15,460 --> 00:05:20,140
Reporting there But yeah, I I would be in favor of some histograms.

92
00:05:20,500 --> 00:05:24,060
I also haven't yet seen any analysis of...

93
00:05:24,140 --> 00:05:27,480
So pg_stat_monitor is an alternative to pg_stat_statements, relatively

94
00:05:27,540 --> 00:05:29,920
new in the grand scheme of things, probably a few years old at

95
00:05:29,920 --> 00:05:30,600
this point.

96
00:05:30,600 --> 00:05:35,980
But I don't see it in many installations and I haven't yet seen

97
00:05:35,980 --> 00:05:40,900
anybody publish benchmarks on its performance relative to, yeah.

98
00:05:40,900 --> 00:05:44,020
Nikolay: It was created in Percona and I remember I discussed

99
00:05:44,100 --> 00:05:48,420
it with Peter Zaitsev who was the founder of Percona when this

100
00:05:48,420 --> 00:05:51,500
extension was created and they said instead of contributing to

101
00:05:51,500 --> 00:05:56,240
pg_stat_statements, they wanted a separate, independent release

102
00:05:56,240 --> 00:05:56,740
cycle.

103
00:05:56,980 --> 00:06:01,180
And as we know, pg_stat_statements is a part of the set of so-called

104
00:06:01,180 --> 00:06:02,080
contrib modules.

105
00:06:02,100 --> 00:06:06,560
It means that they are, like, basically, this cycle, it's Postgres's

106
00:06:06,660 --> 00:06:08,760
own release cycle, so once per year.

107
00:06:09,140 --> 00:06:13,180
And if you want a few times per year to release something, it's

108
00:06:13,180 --> 00:06:13,680
hard.

109
00:06:14,120 --> 00:06:18,480
And then I know the original creator left Percona and I don't

110
00:06:18,480 --> 00:06:21,540
know what's happening with development of this extension but

111
00:06:21,540 --> 00:06:28,120
when it was created, our first look, we did, our team, I remember

112
00:06:29,040 --> 00:06:34,500
we did some analysis and early versions looked promising but

113
00:06:34,500 --> 00:06:36,440
challenging in terms of performance overhead.

114
00:06:37,440 --> 00:06:41,120
And of course, if you talk about RDS and others, they don't have

115
00:06:41,120 --> 00:06:41,460
it.

116
00:06:41,460 --> 00:06:44,240
I saw it only maybe on 1 or 2 platforms.

117
00:06:44,660 --> 00:06:47,940
So Yeah, and performance overhead is a big question.

118
00:06:48,260 --> 00:06:51,000
That being said, I don't know the current performance overhead.

119
00:06:51,600 --> 00:06:54,720
And it would be great to have histograms and so on, right?

120
00:06:54,720 --> 00:06:55,520
So percentiles.

121
00:06:56,400 --> 00:07:01,560
So yeah, and another thing is 2 aspects here additionally in

122
00:07:01,560 --> 00:07:02,220
my approach.

123
00:07:03,400 --> 00:07:08,040
First is that in many cases, I don't care about average at all,

124
00:07:08,400 --> 00:07:13,280
because this is, as we also discussed several times, depending

125
00:07:13,380 --> 00:07:17,200
on our goals, it might be relevant or not, like important or

126
00:07:17,200 --> 00:07:17,640
not.

127
00:07:17,640 --> 00:07:22,520
For example, if we target to optimize how much resource utilization

128
00:07:22,660 --> 00:07:28,000
happens on this server, we care about total time and total values

129
00:07:28,140 --> 00:07:33,160
because this is what exactly shows the picture because a query

130
00:07:33,160 --> 00:07:38,860
might have very like tiny average, tiny standard deviation so

131
00:07:38,860 --> 00:07:44,840
they all like similar all query executions are similar and planning

132
00:07:45,480 --> 00:07:50,260
occurrences are similar but so many of them like 10,000 QPS,

133
00:07:50,320 --> 00:07:51,300
queries per second.

134
00:07:52,060 --> 00:07:55,640
This is going to load our server a lot and we need to optimize

135
00:07:55,640 --> 00:08:01,520
this query in spite of good values in terms of average.

136
00:08:01,560 --> 00:08:02,720
Total will be high.

137
00:08:03,180 --> 00:08:08,480
So averages, as I feel it, it's mostly, you mostly need it when

138
00:08:08,480 --> 00:08:11,520
you target to deliver good experience to your users.

139
00:08:12,340 --> 00:08:13,700
Michael: So I think so.

140
00:08:14,880 --> 00:08:17,140
Yeah, that you're completely right.

141
00:08:17,140 --> 00:08:21,300
Obviously, if we're trying to reduce utilization, there's no

142
00:08:21,300 --> 00:08:25,960
point looking at things by how slow they are on average, or even

143
00:08:26,020 --> 00:08:30,760
at the extremes, at the p99 level, because if they don't run

144
00:08:30,760 --> 00:08:32,800
that often, they're not things taking the most time.

145
00:08:32,800 --> 00:08:37,580
So total, ordering by totals, whether that's time, I still like

146
00:08:37,580 --> 00:08:41,000
to do time, or some combination of buffers, if for example your

147
00:08:41,000 --> 00:08:44,120
resource that you're constrained by is I O, which could be the

148
00:08:44,120 --> 00:08:47,500
case, or like if you're paying by I O, like if you're on Aurora

149
00:08:47,500 --> 00:08:50,180
or something, ordering by something else could make sense.

150
00:08:50,460 --> 00:08:52,900
But yeah, totals always make sense for those cases.

151
00:08:52,900 --> 00:08:57,280
But yeah, this is for those other
cases where you care about

152
00:08:57,280 --> 00:09:00,560
user experience, perhaps, like
how long a dashboard's taken to

153
00:09:00,560 --> 00:09:05,700
load for a large customer or the
boss has complained or that

154
00:09:05,700 --> 00:09:06,520
kind of thing.

155
00:09:07,640 --> 00:09:11,280
Or we've talked a few times about
the issues with long running

156
00:09:11,280 --> 00:09:11,760
queries.

157
00:09:11,760 --> 00:09:15,440
So you might be investigating what
queries do I even have that

158
00:09:15,440 --> 00:09:16,660
run long sometimes?

159
00:09:17,140 --> 00:09:21,060
So I think that's the other use
case I thought of as like, why

160
00:09:21,060 --> 00:09:25,300
might you want to order by some
version of either average or

161
00:09:25,440 --> 00:09:25,940
percentile?

162
00:09:26,600 --> 00:09:28,220
Nikolay: Yeah, let me argue with
myself.

163
00:09:29,600 --> 00:09:31,020
My favorite thing.

164
00:09:31,020 --> 00:09:36,820
So yeah, when we optimize for resource
consumption, lower resource

165
00:09:36,820 --> 00:09:41,880
consumption, we look at totals,
but at the second step, we should

166
00:09:41,880 --> 00:09:46,760
look at average, because we need
to decide Average and frequency,

167
00:09:46,820 --> 00:09:50,520
which multiplied bring us to total,
right?

168
00:09:50,660 --> 00:09:56,380
So we need to go down and see is
it about too frequent query

169
00:09:56,380 --> 00:10:00,880
or is it about not very frequent
but very slow query and decide

170
00:10:01,680 --> 00:10:04,780
which optimization methods to apply,
right?

171
00:10:05,220 --> 00:10:10,020
And in this case, I'm wondering,
does it really matter to understand

172
00:10:10,320 --> 00:10:13,260
that, like, to look at standard
deviation, basically, right?

173
00:10:13,260 --> 00:10:14,440
To think about percentiles?

174
00:10:15,140 --> 00:10:18,220
Michael: I think you're conflating
2 things here, looking at

175
00:10:18,220 --> 00:10:20,420
something versus ordering by it.

176
00:10:22,240 --> 00:10:24,760
And yes, when you're looking at
resource consumption, you want

177
00:10:24,760 --> 00:10:29,360
to look at average time sometimes,
but you don't order by it.

178
00:10:29,540 --> 00:10:31,980
Crucially, you need to order by
total time.

179
00:10:32,200 --> 00:10:35,680
And the same is true for standard
deviation, I think.

180
00:10:36,560 --> 00:10:41,260
I'm talking about adding maybe
1, maybe 2, depending on exactly

181
00:10:41,260 --> 00:10:44,160
what percentile you want, adding
a couple of standard deviations

182
00:10:44,160 --> 00:10:46,500
to the mean to order by it.

183
00:10:46,500 --> 00:10:50,720
Not necessarily to look at it,
but just to get the initial prioritization

184
00:10:51,040 --> 00:10:53,680
order, which things should I start
looking at, what are my top

185
00:10:53,680 --> 00:10:57,880
10, because the order will be different,
assuming you have some

186
00:10:57,880 --> 00:11:00,880
variants and you know, distribution
of your queries and some

187
00:11:00,880 --> 00:11:03,720
of them have a longer tail than
others, the order will be different

188
00:11:03,720 --> 00:11:06,880
if you order by mean plus a couple
of standard deviations than

189
00:11:06,880 --> 00:11:08,560
it would be if you just order by
mean.

190
00:11:08,560 --> 00:11:10,120
So it's about the prioritization
order.

191
00:11:10,120 --> 00:11:13,020
And then you care about looking
at a lot of the other columns,

192
00:11:13,660 --> 00:11:15,040
but you don't order by them.

193
00:11:15,040 --> 00:11:16,580
So I think that's a subtle difference.

194
00:11:16,720 --> 00:11:17,940
Nikolay: Yeah, yeah, yeah, I agree.

195
00:11:18,680 --> 00:11:20,780
Order by standard deviation, it's
interesting.

196
00:11:21,280 --> 00:11:22,040
Michael: I'm joking.

197
00:11:22,060 --> 00:11:22,700
I'm joking.

198
00:11:23,880 --> 00:11:24,380
Yeah.

199
00:11:24,840 --> 00:11:25,240
Nikolay: Yeah.

200
00:11:25,240 --> 00:11:30,560
I understand you are thinking about
ordering, for example, average

201
00:11:30,560 --> 00:11:35,980
plus some factor times standard
deviation.

202
00:11:35,980 --> 00:11:37,540
This is a smart idea, I think.

203
00:11:37,660 --> 00:11:40,920
And before we move to that idea,
which is explained in your blog

204
00:11:40,920 --> 00:11:41,420
post.

205
00:11:41,520 --> 00:11:41,760
Michael: Yeah.

206
00:11:41,760 --> 00:11:44,800
Nikolay: So, so again, like, okay,
we order by total time, and

207
00:11:44,800 --> 00:11:52,220
then we, we look to decide which
approach to use for optimization.

208
00:11:53,400 --> 00:12:00,740
And still maybe there it also makes
some sense to look, to have

209
00:12:00,740 --> 00:12:04,240
percentiles for example, because
we would understand, oh, this

210
00:12:04,440 --> 00:12:10,780
query has this average, which is
not that bad, but for example,

211
00:12:10,900 --> 00:12:13,860
we don't have percentiles in pg_stat_statements, no histogram.

212
00:12:13,860 --> 00:12:17,340
So, but we see, Oh, standard deviation
is so huge, it means that

213
00:12:17,360 --> 00:12:20,100
some occurrences are okay, some
are not at all.

214
00:12:20,860 --> 00:12:21,360
Right?

215
00:12:21,680 --> 00:12:26,440
And this brings us to the idea,
oh, it means that we have instability

216
00:12:26,640 --> 00:12:27,720
here, right?

217
00:12:27,720 --> 00:12:31,360
Depending on maybe data volumes,
or maybe we were blocked by

218
00:12:31,360 --> 00:12:31,720
someone.

219
00:12:31,720 --> 00:12:34,760
By the way, I would rather prefer
to have this information for,

220
00:12:34,760 --> 00:12:37,440
you know what, different metrics
and time, right?

221
00:12:38,040 --> 00:12:38,540
Yeah.

222
00:12:38,800 --> 00:12:41,840
But for hits and reads, this is
where I would really love to

223
00:12:41,840 --> 00:12:46,840
see that piece, standard deviation
or even better, percentiles.

224
00:12:49,360 --> 00:12:50,020
Michael: Oh my goodness.

225
00:12:50,020 --> 00:12:52,620
I'm just imagining the number of
columns at that point.

226
00:12:52,660 --> 00:12:55,260
Nikolay: Yeah, because it always,
in pg_stat_statements approach, it

227
00:12:55,260 --> 00:12:59,740
always concerns me how much of
those slowness was associated

228
00:13:00,400 --> 00:13:03,840
with query being blocked by another
query, just wait time.

229
00:13:04,400 --> 00:13:06,480
We don't know and it's hidden here.

230
00:13:07,060 --> 00:13:10,940
Michael: Yeah, I mean, it's a good
point that these aggregates

231
00:13:10,960 --> 00:13:12,940
are always going to hide information.

232
00:13:13,080 --> 00:13:16,640
You know, it's the nature of aggregations
is we don't get all

233
00:13:16,640 --> 00:13:17,220
the information.

234
00:13:17,360 --> 00:13:19,440
It's the trade-off, right?

235
00:13:19,440 --> 00:13:23,320
We get some useful information
on aggregate, but we don't get

236
00:13:23,320 --> 00:13:24,220
all of the information.

237
00:13:24,320 --> 00:13:30,260
I quite often see people include
min, mean, and max in their

238
00:13:30,260 --> 00:13:32,360
queries, their querying pg_stat_statements form.

239
00:13:32,360 --> 00:13:34,420
So you get kind of the full distribution.

240
00:13:35,540 --> 00:13:39,340
I could imagine adding a couple,
you know, like either side of

241
00:13:39,340 --> 00:13:42,740
the mean to say what's the mean
minus the standard deviation,

242
00:13:42,740 --> 00:13:44,700
what's the mean plus the standard
deviation.

243
00:13:44,840 --> 00:13:47,720
Nikolay: Yeah, without standard
deviation, mean and max, they

244
00:13:47,780 --> 00:13:51,940
can be some, like you have million
calls, but just 1 of them

245
00:13:51,940 --> 00:13:56,140
was so far and like, okay, it broke
your whole picture, affected

246
00:13:56,140 --> 00:13:58,860
your whole picture, while standard
deviation brings some good

247
00:13:58,860 --> 00:13:59,740
sense, right?

248
00:14:00,200 --> 00:14:03,060
Michael: Yeah, I mean, I guess
min is possibly more interesting

249
00:14:03,160 --> 00:14:07,220
than max, but yeah of course sometimes
there'll be 1 parameter

250
00:14:07,220 --> 00:14:09,500
that causes the query to run very
differently.

251
00:14:09,680 --> 00:14:13,140
Nikolay: 0 rows returned, we have
min close to 0, it doesn't

252
00:14:13,140 --> 00:14:14,380
make any sense, right?

253
00:14:14,380 --> 00:14:15,520
Michael: Good point, yeah.

254
00:14:17,180 --> 00:14:21,520
But by including more you get a
better picture of the distribution

255
00:14:21,600 --> 00:14:25,800
you get more points on the on the
graph to get an idea of quite

256
00:14:25,800 --> 00:14:29,580
how is this distributed is that
max and outlier or are there

257
00:14:29,580 --> 00:14:34,260
quite a few queries running nearly
that long, you know, so without

258
00:14:34,720 --> 00:14:39,400
some measure close to p95, p99
that kind of thing, I think it's

259
00:14:39,400 --> 00:14:42,280
really hard to know that and I
think possibly what's happening

260
00:14:42,280 --> 00:14:45,640
at the moment is quite a few of
these dashboards, monitoring

261
00:14:45,640 --> 00:14:49,700
tools, diagnosis tools, I'll give
you an order when you look

262
00:14:49,700 --> 00:14:55,160
at it by mean that's fine, it's
useful, but could be more useful

263
00:14:55,160 --> 00:14:58,460
and you could maybe start at the
more problematic queries if

264
00:14:58,460 --> 00:15:03,520
you ordered them by a higher percentile
and then started at the

265
00:15:03,520 --> 00:15:04,500
top from there.

266
00:15:05,100 --> 00:15:05,360
Nikolay: Yeah.

267
00:15:05,360 --> 00:15:09,980
Because it's not it's like we say,
oh, on average, we deliver

268
00:15:10,080 --> 00:15:12,900
everything like 10 milliseconds,
everything is good.

269
00:15:12,900 --> 00:15:16,800
Or 1 millisecond again, like for
a queries, we want to be below

270
00:15:16,800 --> 00:15:20,100
10 milliseconds ideally, definitely
not 100 milliseconds, it's

271
00:15:20,140 --> 00:15:20,780
too much.

272
00:15:21,580 --> 00:15:30,040
So we say it's good, but what if
almost 50% of that was 1 millisecond,

273
00:15:30,100 --> 00:15:34,800
and almost 50% of that was how
many, like, not 1 millisecond.

274
00:15:35,540 --> 00:15:37,120
Michael: What did you say the average
was?

275
00:15:37,900 --> 00:15:40,060
Nikolay: My math is going to be
really bad.

276
00:15:40,280 --> 00:15:45,120
I was trying to say you can, you
might have 2 segments or clusters

277
00:15:45,620 --> 00:15:46,300
of cases.

278
00:15:46,820 --> 00:15:50,200
1 is very close to 0 and other
is very bad.

279
00:15:50,600 --> 00:15:51,100
And

280
00:15:51,540 --> 00:15:54,640
Michael: they call it like a bimodal
bimodal distribution.

281
00:15:55,080 --> 00:15:55,580
Yeah.

282
00:15:55,580 --> 00:15:59,060
Because perhaps because of a plan
like different plan for different

283
00:15:59,060 --> 00:16:00,400
amounts of data, maybe like

284
00:16:00,400 --> 00:16:04,080
Nikolay: parameters, maybe, or
maybe this locking issues.

285
00:16:04,700 --> 00:16:07,580
And then you observe something
in the middle.

286
00:16:07,580 --> 00:16:10,760
Well, maybe, okay, maybe 50-50
is a bad example, because we are

287
00:16:10,760 --> 00:16:11,960
going to mean the middle.

288
00:16:11,960 --> 00:16:17,700
What if, what if like 70%, very
close to 0, 30% is very far from

289
00:16:17,700 --> 00:16:21,800
0, and they really struggle with
performance, those users, right?

290
00:16:22,480 --> 00:16:27,700
And we see average is not bad it's
going to be shifted towards

291
00:16:27,700 --> 00:16:32,940
to 0 because it's average not and
oh no it's not going to look

292
00:16:32,940 --> 00:16:37,280
okay depends right So but but average
will tell us it's like,

293
00:16:37,280 --> 00:16:38,340
it's not that bad.

294
00:16:38,400 --> 00:16:39,320
We can leave it.

295
00:16:39,320 --> 00:16:42,560
Michael: It's very, yeah, it's
very easy to imagine distributions

296
00:16:42,980 --> 00:16:46,220
that on average are less than some
threshold that we might not

297
00:16:46,220 --> 00:16:48,080
care about, like 50 milliseconds
or something.

298
00:16:48,080 --> 00:16:49,900
Maybe we set our threshold there.

299
00:16:50,060 --> 00:16:52,080
On average, they're 40, 45.

300
00:16:52,800 --> 00:16:57,220
But for a non-zero, you know, like
a decent number, maybe if

301
00:16:57,340 --> 00:17:01,260
it's p95, p99, and we're talking
about either 5% or 1% of times,

302
00:17:01,500 --> 00:17:03,340
it's running for hundreds of milliseconds.

303
00:17:03,340 --> 00:17:04,280
And we've talked in the

304
00:17:04,280 --> 00:17:05,340
Nikolay: past about how...

305
00:17:05,740 --> 00:17:07,320
Michael: Or even seconds, yeah,
exactly.

306
00:17:07,420 --> 00:17:13,660
So those are noticeable to human
durations, even though on average

307
00:17:13,660 --> 00:17:15,840
it's running in 45 milliseconds.

308
00:17:16,260 --> 00:17:20,140
So it's very easy, I think, to
hide things that on average look

309
00:17:20,140 --> 00:17:22,900
like they maybe aren't perceptible
to humans but if the tail

310
00:17:22,900 --> 00:17:27,280
is long enough then for quite a
lot of people especially you

311
00:17:27,280 --> 00:17:31,800
know let's say it's 5% of people
that's 1 in 20 times somebody's

312
00:17:31,920 --> 00:17:34,460
loading that dashboard or using
that feature.

313
00:17:34,700 --> 00:17:37,180
1 in 20 sounds quite often.

314
00:17:37,240 --> 00:17:38,220
Nikolay: Yeah, yeah.

315
00:17:38,500 --> 00:17:42,380
Yeah, if it's 1 in a thousand,
well, things

316
00:17:42,380 --> 00:17:43,440
happen, right?

317
00:17:43,960 --> 00:17:48,760
Yeah, we can say, okay,
we have a very large user base,

318
00:17:48,760 --> 00:17:51,420
some of them, like, we need to
take care of everyone, but we

319
00:17:51,420 --> 00:17:53,400
cannot, like, it happens, right?

320
00:17:54,520 --> 00:17:58,180
But this approach works only if
your p99 is good.

321
00:17:58,580 --> 00:17:59,080
Michael: Yes.

322
00:18:00,040 --> 00:18:04,620
Nikolay: Okay, 1% is not that good,
okay, but if it happens with

323
00:18:04,620 --> 00:18:08,980
20% of your users, they suffer,
and you missed this fact just

324
00:18:08,980 --> 00:18:11,680
looking at the average, it's not
good, it's not good.

325
00:18:11,680 --> 00:18:13,780
Michael: Yeah, exactly.

326
00:18:13,780 --> 00:18:15,340
Well, imagine, go back to our example.

327
00:18:15,340 --> 00:18:19,040
If you've got 2 queries, most,
they look the same in terms of

328
00:18:19,040 --> 00:18:22,560
average, let's say one's 45 milliseconds
on average; the other's

329
00:18:22,560 --> 00:18:28,060
50 milliseconds on average, but
the one that's 45 has a p99 that's

330
00:18:28,140 --> 00:18:32,500
10 times that 450 milliseconds,
and the one that's 50 milliseconds

331
00:18:32,500 --> 00:18:35,940
has a p99 of only 60 milliseconds,
so only a little bit higher.

332
00:18:36,020 --> 00:18:41,540
I would much, much, much rather
start, have it ranked higher,

333
00:18:41,640 --> 00:18:44,760
the one that had the higher p99 in
that case, even though it has

334
00:18:44,760 --> 00:18:46,580
a lower average duration.

335
00:18:46,780 --> 00:18:48,960
So that's kind of the whole point.

336
00:18:48,960 --> 00:18:51,400
Nikolay: Yeah, let's explain the
idea you had.

337
00:18:52,300 --> 00:18:53,600
The idea is simple.

338
00:18:53,940 --> 00:18:57,360
We don't have buckets, we don't
have percentiles, we don't have

339
00:18:57,360 --> 00:19:00,700
histogram, we cannot draw it, but
we have standard deviation.

340
00:19:01,100 --> 00:19:06,000
And the idea is how I learn, how
I hear it or read it.

341
00:19:06,900 --> 00:19:11,780
In all analysis we have over the
last years with pg_stat_statements

342
00:19:11,780 --> 00:19:15,440
and other things, pg_stat_kcache,
pg_wait_sampling maybe and so

343
00:19:15,440 --> 00:19:19,340
on, we always look at throughput
and average, throughput and

344
00:19:19,340 --> 00:19:20,880
average latency, that's it.

345
00:19:21,140 --> 00:19:24,220
Yeah, we know that from, you mentioned,
the article mentioned

346
00:19:24,220 --> 00:19:26,820
that from application side, some
monitoring systems, they bring

347
00:19:26,820 --> 00:19:32,360
you p99, p99.5, it's great, or 2
values even better than just

348
00:19:32,360 --> 00:19:33,180
1 here.

349
00:19:33,740 --> 00:19:34,240
Yeah.

350
00:19:34,860 --> 00:19:38,360
And sometimes colorful, good looking
graphs.

351
00:19:39,340 --> 00:19:41,920
But in Postgres we don't do it
somehow.

352
00:19:42,040 --> 00:19:45,420
But we have standard deviation
and in which Postgres monitoring

353
00:19:45,420 --> 00:19:47,240
standard deviation is really appreciated?

354
00:19:47,360 --> 00:19:48,260
I don't know.

355
00:19:48,580 --> 00:19:50,040
Michael: I don't know any, yeah.

356
00:19:50,080 --> 00:19:54,600
Nikolay: We don't have p99 or p95,
p90, nothing like that.

357
00:19:54,600 --> 00:19:58,440
But what if we assume we like distribution,
well, your assumption,

358
00:19:58,440 --> 00:20:01,420
it's normal, it's not normal, like
half normal.

359
00:20:02,640 --> 00:20:06,460
But idea is let's take average
and standard deviation and have

360
00:20:06,460 --> 00:20:11,260
some multipliers, so we have formulas,
average plus some multiplier

361
00:20:11,380 --> 00:20:16,620
times standard deviation, and we
have a kind of p something,

362
00:20:16,720 --> 00:20:19,160
p 90 something, kind of, right?

363
00:20:19,340 --> 00:20:23,680
Michael: Yeah, I called it approximate,
just only because it

364
00:20:23,680 --> 00:20:26,920
really does assume a normal distribution,
like standard deviation

365
00:20:26,920 --> 00:20:31,120
only makes sense in a, like, not
that it only makes sense, you

366
00:20:31,120 --> 00:20:34,200
can only approximate if you assume
a normal distribution and

367
00:20:34,200 --> 00:20:36,820
I made the case that I think it
would be quite rare for you to

368
00:20:36,820 --> 00:20:40,160
have queries, or at least problematic
queries that are normally

369
00:20:40,160 --> 00:20:40,660
distributed.

370
00:20:40,960 --> 00:20:43,920
I suspect a lot of the problematic
ones are going to have long

371
00:20:43,920 --> 00:20:46,920
tails but it definitely isn't a
safe assumption that they're

372
00:20:46,920 --> 00:20:51,340
all going to be normally distributed,
but once you factor that

373
00:20:51,340 --> 00:20:53,200
in, I think this is still useful.

374
00:20:53,200 --> 00:20:57,460
I think queries that have a low
average and a high standard deviation,

375
00:20:57,600 --> 00:21:03,340
where this ordering difference,
where this change makes a difference,

376
00:21:03,740 --> 00:21:05,940
are probably going to have, well,
are definitely going to have

377
00:21:05,940 --> 00:21:07,400
high p99s as well.

378
00:21:08,000 --> 00:21:11,240
It's still directionally correct,
I think I used that phrase,

379
00:21:11,240 --> 00:21:13,980
so it still points you probably
in the right direction.

380
00:21:13,980 --> 00:21:16,640
This order is probably not going
to be that different to the

381
00:21:16,640 --> 00:21:18,580
order by true p99.

382
00:21:19,460 --> 00:21:25,220
So it's a step in the right direction,
it's not necessarily accurate,

383
00:21:25,400 --> 00:21:27,080
but it's directionally correct.

384
00:21:27,440 --> 00:21:28,340
Nikolay: S.
Yeah, I agree.

385
00:21:28,340 --> 00:21:32,860
And since I suck at statistics,
as every listener already should

386
00:21:32,860 --> 00:21:34,300
understand by this point.

387
00:21:34,940 --> 00:21:39,440
I talked to my daughter who is
learning data science at UCSD

388
00:21:39,520 --> 00:21:40,220
right now.

389
00:21:42,540 --> 00:21:47,780
And she told me, oh, this is quite
known, like p95 or something

390
00:21:47,780 --> 00:21:52,120
for normal distribution we just
standard deviation times 2 to

391
00:21:52,120 --> 00:21:56,260
both sides this is how you can
get p90 I should say or p95 I

392
00:21:56,260 --> 00:22:01,940
forgot like yeah so this is reasonable
she said but then like

393
00:22:01,940 --> 00:22:04,980
of course it's not normal distribution,
and it's a big assumption.

394
00:22:05,500 --> 00:22:09,060
And what I thought, idea number
1 I had, and maybe I will encourage

395
00:22:09,060 --> 00:22:12,960
some, I don't know, some students
or some people who are willing

396
00:22:12,960 --> 00:22:14,240
to do something here.

397
00:22:14,380 --> 00:22:18,440
What if we had some raw data from
somewhere?

398
00:22:18,960 --> 00:22:23,600
I don't know who can afford logging
all queries with duration.

399
00:22:24,620 --> 00:22:28,980
Since we talked last week, we talked
how dangerous it can be

400
00:22:28,980 --> 00:22:30,060
to log all queries.

401
00:22:30,060 --> 00:22:34,200
But imagine we collected some raw
data and we see what's present

402
00:22:34,200 --> 00:22:35,340
in persistent statements.

403
00:22:36,100 --> 00:22:39,060
And what I'm trying to say, we
could find, we could understand

404
00:22:39,060 --> 00:22:43,440
distribution nature, maybe observing
many, many cases from OLTP,

405
00:22:43,580 --> 00:22:45,640
like web apps, mobile apps.

406
00:22:46,220 --> 00:22:52,000
And then we could probably find
some heuristic based multiplier,

407
00:22:52,360 --> 00:22:52,860
right?

408
00:22:53,160 --> 00:22:57,040
Which would serve like closer to
reality, I don't know, like

409
00:22:57,040 --> 00:23:00,740
to avoid this assumption that it's
normal, but it's a big work.

410
00:23:00,860 --> 00:23:04,840
I know like it's, you need to collect
a lot of data process it

411
00:23:04,840 --> 00:23:05,780
and so on.

412
00:23:06,220 --> 00:23:07,040
What do you think?

413
00:23:07,420 --> 00:23:09,960
Michael: Well, and it would be
different for different queries

414
00:23:09,960 --> 00:23:14,020
like unless we find out that it's
just I think it's a tricky

415
00:23:14,020 --> 00:23:18,660
problem unless you actually approximate
it in while you're monitoring.

416
00:23:19,200 --> 00:23:21,300
Nikolay: Look, I understand what
you said.

417
00:23:23,500 --> 00:23:27,020
So if we take assumption, we like,
it's kind of normal.

418
00:23:27,840 --> 00:23:31,100
It sounds for me like fragile assumption.

419
00:23:33,280 --> 00:23:37,320
If we learn how some OLTP system,
social media or something,

420
00:23:37,540 --> 00:23:41,480
with all these queries, like select
2 users, select 2 something,

421
00:23:41,480 --> 00:23:45,080
join some tables, and so on, how
these queries behave, at least,

422
00:23:45,080 --> 00:23:49,040
I don't know, a few dozens of them,
how they behave.

423
00:23:50,500 --> 00:23:55,380
This like practice learned multiplier,
I'm pretty sure it will

424
00:23:55,380 --> 00:23:58,660
be in another system it will be
very close, you know?

425
00:23:58,660 --> 00:24:00,300
Michael: Yeah, actually it's a
good point.

426
00:24:00,940 --> 00:24:06,040
You could get more accurate with
the number, like the exact multiple

427
00:24:06,500 --> 00:24:11,020
to get a more likely to be p99
measure.

428
00:24:11,200 --> 00:24:14,100
It might turn out that the number
I've suggested based on normal

429
00:24:14,100 --> 00:24:17,960
distribution is actually more likely
to be a p96 than a p99.

430
00:24:18,340 --> 00:24:20,140
Or p97 instead of p99.

431
00:24:21,060 --> 00:24:22,660
Does it make that much difference?

432
00:24:22,660 --> 00:24:23,040
No, no.

433
00:24:23,040 --> 00:24:24,920
Nikolay: If it's just order by,
no.

434
00:24:24,920 --> 00:24:25,960
I agree with you.

435
00:24:25,960 --> 00:24:30,360
And what I'm proposing breaks Pareto
principle, right?

436
00:24:31,100 --> 00:24:33,640
Michael: Yeah, It's a lot
of work for arguably not that

437
00:24:33,640 --> 00:24:36,340
much additional, like, I think
if someone's going to put a lot

438
00:24:36,340 --> 00:24:39,180
of effort into this, it would be
much better to actually track

439
00:24:39,180 --> 00:24:42,840
this and put it into P statements.

440
00:24:44,500 --> 00:24:47,280
Nikolay: This will be, again, overhead
question.

441
00:24:47,280 --> 00:24:50,740
It would be great to have, but
again, I understand.

442
00:24:50,740 --> 00:24:54,960
What I'm proposing is going to
be outside of p80 of the effort

443
00:24:54,960 --> 00:24:56,100
we're going to have.

444
00:24:56,140 --> 00:24:57,540
Or p20.

445
00:24:57,720 --> 00:24:58,320
I like it.

446
00:24:58,320 --> 00:24:59,160
p20, p20.

447
00:24:59,160 --> 00:24:59,980
I don't know.

448
00:25:00,060 --> 00:25:02,320
p80 of the value, p20 of...

449
00:25:02,780 --> 00:25:05,940
Okay, so another thing, another
thought I have.

450
00:25:06,220 --> 00:25:10,080
I noticed over time that when I run some benchmarks, very often

451
00:25:10,080 --> 00:25:14,840
with pgbench, I'm absolutely unsatisfied with...

452
00:25:15,560 --> 00:25:19,920
So I use, we have it in automation as well, like all my team

453
00:25:19,920 --> 00:25:20,640
members use it.

454
00:25:20,640 --> 00:25:28,300
So I have lowercase r to report results in the end for each statement

455
00:25:28,320 --> 00:25:30,140
used in transactions, right?

456
00:25:30,660 --> 00:25:35,820
Sometimes we use multiple files with different transactions and

457
00:25:35,820 --> 00:25:38,440
like with some weights assigned to them.

458
00:25:39,720 --> 00:25:43,220
It can be achieved with option F and add sign.

459
00:25:43,220 --> 00:25:45,300
It's great feature, by the way, underappreciated.

460
00:25:46,000 --> 00:25:49,740
But anyway, I see these reports and we discuss them and like,

461
00:25:49,740 --> 00:25:50,980
Oh, this looks great.

462
00:25:50,980 --> 00:25:54,980
Oh, it looks not that great, but I realized it's not enough.

463
00:25:55,080 --> 00:25:56,080
I want histograms.

464
00:25:56,980 --> 00:25:57,480
Right.

465
00:25:57,720 --> 00:25:58,220
Michael: Yeah.

466
00:25:59,180 --> 00:26:03,480
Nikolay: And to mitigate it, at least somehow, we started to

467
00:26:03,480 --> 00:26:05,400
use capital P option.

468
00:26:06,100 --> 00:26:10,550
For example, P30, it means like every 30 seconds, it will be

469
00:26:10,550 --> 00:26:10,840
reported.
Oh,

470
00:26:10,840 --> 00:26:11,340
Michael: interesting.

471
00:26:11,600 --> 00:26:12,680
Different P notation.

472
00:26:12,800 --> 00:26:13,300
Yeah.

473
00:26:14,160 --> 00:26:14,760
Nikolay: Yeah, yeah.

474
00:26:14,760 --> 00:26:16,420
So it will be reported as progress.

475
00:26:16,480 --> 00:26:19,360
So it reports latencies every 30 seconds, for

476
00:26:19,360 --> 00:26:19,860
Michael: example.

477
00:26:20,460 --> 00:26:24,200
Nikolay: Nice.
Or every 10 seconds if benchmark is very fast, or every minute

478
00:26:24,200 --> 00:26:25,580
if it's longer, depends.

479
00:26:26,040 --> 00:26:29,340
And from there, we already understand, oh, distribution is like

480
00:26:29,340 --> 00:26:31,380
this, right?

481
00:26:31,380 --> 00:26:33,840
So we can feel some distribution.

482
00:26:34,900 --> 00:26:37,720
It reports by the way a standard deviation as well, but it's

483
00:26:37,720 --> 00:26:38,700
still not enough.

484
00:26:40,360 --> 00:26:44,620
There is another aspect here, not only about distribution of

485
00:26:44,620 --> 00:26:49,660
np90, np95, It's also about the distribution in time, over time,

486
00:26:49,760 --> 00:26:53,900
because there might be a spike of latency, right?

487
00:26:54,280 --> 00:26:55,080
And then it's better.

488
00:26:55,080 --> 00:26:55,580
True.

489
00:26:55,807 --> 00:26:57,444
So, yeah, yeah.

490
00:26:57,444 --> 00:27:00,300
But still, I think what I was trying to say, actually, there's

491
00:27:00,300 --> 00:27:04,260
an idea to implement histograms or percentiles.

492
00:27:05,240 --> 00:27:07,260
Maybe it's better to start in pgbench.

493
00:27:09,620 --> 00:27:11,780
Yeah, not in pg_stat_statements first.

494
00:27:12,340 --> 00:27:13,280
Michael: Yeah, interesting idea.

495
00:27:13,280 --> 00:27:16,320
I hadn't thought, I get where you're coming from.

496
00:27:16,320 --> 00:27:17,640
I think it'd be easier, right?

497
00:27:17,640 --> 00:27:19,260
You don't worry so much about overhead.

498
00:27:19,660 --> 00:27:25,660
Nikolay: Well, you worry about
overhead, but this is a tool which

499
00:27:26,640 --> 00:27:28,680
used to make decisions and so on.

500
00:27:28,680 --> 00:27:30,980
Well, pgbench also is used.

501
00:27:31,100 --> 00:27:35,420
But anyway, maybe it's easier because
it's less risky.

502
00:27:35,660 --> 00:27:37,160
The idea is it's not production.

503
00:27:38,240 --> 00:27:38,560
So...

504
00:27:38,560 --> 00:27:41,020
Michael: Are you thinking it's
more like, are you thinking educationally,

505
00:27:41,660 --> 00:27:44,900
you could see the idea that this
is a useful thing and once people

506
00:27:44,900 --> 00:27:48,540
start seeing it, like, oh, this
is really helpful, Why can't

507
00:27:48,540 --> 00:27:49,440
I have it elsewhere?

508
00:27:49,440 --> 00:27:50,520
So it kind of builds.

509
00:27:51,580 --> 00:27:54,120
Nikolay: I think most of people
understand that this is helpful,

510
00:27:54,160 --> 00:27:57,900
but it's just risky because pg_stat_statements
have overhead.

511
00:27:57,900 --> 00:27:58,440
We know it.

512
00:27:58,440 --> 00:28:01,680
And it's probably increasing because
more and more things are

513
00:28:01,680 --> 00:28:02,460
being added.

514
00:28:03,080 --> 00:28:08,720
And just again, pg_stat_statements is not about
production, usually, depends

515
00:28:08,720 --> 00:28:09,400
of course.

516
00:28:09,640 --> 00:28:10,460
And I think-

517
00:28:10,460 --> 00:28:11,820
Michael: pgbench is not about-

518
00:28:12,620 --> 00:28:15,540
Nikolay: It's not used in production,
it's a learning tool for

519
00:28:15,540 --> 00:28:16,040
experiments.

520
00:28:17,860 --> 00:28:21,400
Yeah, pg_stat_statements is for production,
so it's harder to bring something

521
00:28:21,400 --> 00:28:22,200
heavy there.

522
00:28:22,780 --> 00:28:27,180
pgbench, again, like optionally,
and see how it works, like

523
00:28:27,180 --> 00:28:28,780
it's easier in my opinion.

524
00:28:28,780 --> 00:28:31,080
And it's also, like, first of all,
it's needed there.

525
00:28:31,080 --> 00:28:31,780
It's needed.

526
00:28:32,560 --> 00:28:35,880
It's very useful to have it there,
it would be very useful.

527
00:28:36,980 --> 00:28:40,360
Do you know if Sysbench is bringing
histograms?

528
00:28:41,340 --> 00:28:42,460
Michael: No, I don't know.

529
00:28:42,540 --> 00:28:43,440
Nikolay: I don't remember.

530
00:28:44,340 --> 00:28:46,020
Yeah, it brings percentiles.

531
00:28:48,040 --> 00:28:48,540
Michael: Yeah.

532
00:28:49,300 --> 00:28:51,140
How many buckets or which?

533
00:28:51,220 --> 00:28:51,740
I don't know.

534
00:28:51,740 --> 00:28:52,920
Just report a couple.

535
00:28:53,840 --> 00:28:55,660
Nikolay: It brings p95.

536
00:28:57,940 --> 00:28:58,440
Yeah.

537
00:28:59,060 --> 00:29:01,460
And I don't know, I don't see any
more.

538
00:29:01,460 --> 00:29:02,060
I don't know.

539
00:29:02,060 --> 00:29:05,280
Anyway, it's already better than
just standard deviation I guess,

540
00:29:05,280 --> 00:29:05,780
right?

541
00:29:05,900 --> 00:29:10,180
But I agree with your point if
we order by just like maybe just

542
00:29:10,200 --> 00:29:11,680
2 times standard deviation.

543
00:29:12,700 --> 00:29:13,360
Michael: Okay, I

544
00:29:13,360 --> 00:29:16,320
did, yeah, in the blog
post I did consider rounding

545
00:29:16,320 --> 00:29:19,540
just to 2 instead of using the
1 that's actually an approximation

546
00:29:21,600 --> 00:29:26,120
Yeah exactly because at
that point I'm saying look it's

547
00:29:26,120 --> 00:29:28,740
this is approximate already why
are you multiplying by such a

548
00:29:28,740 --> 00:29:32,280
specific number but then what are
we approximating it's like

549
00:29:32,280 --> 00:29:36,040
a different p value and I just
didn't really I thought it would

550
00:29:36,040 --> 00:29:41,320
just be confusing as a kind of
to people like people reading

551
00:29:41,320 --> 00:29:44,480
the Results rather than people
looking at the query.

552
00:29:44,480 --> 00:29:47,960
I figured more people would probably
end up If you've got if

553
00:29:47,960 --> 00:29:50,160
you put these into dashboards and
things, you're gonna have more

554
00:29:50,160 --> 00:29:52,200
people reading the dashboards than
looking at the queries behind

555
00:29:52,200 --> 00:29:52,860
the dashboards.

556
00:29:52,960 --> 00:29:56,260
So I figured I'd rather have something
confusing in the query

557
00:29:56,260 --> 00:29:59,520
than something confusing in the
like column header.

558
00:29:59,860 --> 00:30:03,100
Nikolay: Makes sense, makes sense,
Yeah.

559
00:30:03,100 --> 00:30:05,800
Michael: So yeah, quite strange
to use such a precise figure

560
00:30:05,800 --> 00:30:07,620
to calculate an approximation.

561
00:30:08,040 --> 00:30:11,180
Nikolay: You know what, I will
think about next time we will,

562
00:30:11,740 --> 00:30:17,640
we will maybe do some reports or
dashboards, maybe we will implement

563
00:30:17,640 --> 00:30:18,360
this approach.

564
00:30:18,740 --> 00:30:22,480
We'll call it p??

565
00:30:23,800 --> 00:30:27,440
Michael: I suggested calling
it approx_p99, just so that

566
00:30:27,440 --> 00:30:32,300
it's really clearly, Like, you're
leading with the word approximate.

567
00:30:32,900 --> 00:30:34,740
Nikolay: So it's Mysterious percentile.

568
00:30:35,400 --> 00:30:35,900
Yes.

569
00:30:37,880 --> 00:30:41,200
Michael: And even if you don't
include, like, it's probably confusing

570
00:30:41,200 --> 00:30:43,340
to look at a dashboard that doesn't
include it in the results,

571
00:30:43,340 --> 00:30:45,900
but still include all the other
columns you're going to include,

572
00:30:45,900 --> 00:30:49,520
just ordering by that is then,
I think, a useful

573
00:30:50,640 --> 00:30:54,280
Nikolay: ordering means in monitoring,
it means another, chart

574
00:30:54,280 --> 00:30:58,780
you, you like exactly top end,
top end by what, and this is ordering

575
00:30:58,860 --> 00:31:02,360
and, instead of top end by average
or in addition to top end

576
00:31:02,360 --> 00:31:02,840
by average?

577
00:31:02,840 --> 00:31:10,160
This would be helpful to bring
sense of how users feel about

578
00:31:10,160 --> 00:31:11,340
your service, right?

579
00:31:11,780 --> 00:31:15,300
Michael: And I really respect monitoring
tools and reports that

580
00:31:15,300 --> 00:31:20,700
try and limit the number of pages
they or dashboards they show

581
00:31:20,920 --> 00:31:25,320
but most I come across don't don't
seem to try and limit it too

582
00:31:25,320 --> 00:31:30,240
much so I think another what's
another chart but if you are worried

583
00:31:30,240 --> 00:31:33,720
about quite how many charts you
have, I might argue that I'd

584
00:31:33,720 --> 00:31:37,080
rather have, well, my argument
is I'd rather have this than the

585
00:31:37,080 --> 00:31:39,840
1 ordered by me, which all of,
pretty much all of them include.

586
00:31:39,840 --> 00:31:42,840
So if you include that 1, I would
rather have this 1.

587
00:31:44,160 --> 00:31:44,440
Nikolay: Yeah.

588
00:31:44,440 --> 00:31:48,920
And Do you think it's worth keeping
a chart for average?

589
00:31:49,400 --> 00:31:50,960
For pure average?

590
00:31:51,400 --> 00:31:53,200
Michael: Again, it depends on the
constraints, right?

591
00:31:53,200 --> 00:31:56,180
If you've already got a hundred
charts and you're showing it's

592
00:31:56,920 --> 00:32:00,060
99 versus 100, why not include
the average still?

593
00:32:00,060 --> 00:32:04,320
But if you're talking about only
having 2, then 2 versus 3 is

594
00:32:04,320 --> 00:32:05,260
quite a big difference.

595
00:32:05,640 --> 00:32:08,040
So I think it depends on the content.

596
00:32:08,040 --> 00:32:12,380
Nikolay: And this chart is going
to be regular, like not non-stacked,

597
00:32:12,540 --> 00:32:16,600
because you cannot summarize it,
like it's not cumulative thing,

598
00:32:17,120 --> 00:32:18,140
averages, right?

599
00:32:18,140 --> 00:32:18,640
Yeah.

600
00:32:19,400 --> 00:32:23,200
Yeah, And probably it should include
only like 10 or something.

601
00:32:23,480 --> 00:32:26,100
Michael: Well, it's not really
a chart, it's just like a top

602
00:32:26,280 --> 00:32:26,780
end.

603
00:32:26,820 --> 00:32:27,620
Nikolay: Ah, table.

604
00:32:28,380 --> 00:32:29,200
Michael: Yeah, like a table.

605
00:32:29,200 --> 00:32:30,060
Yeah, I think so.

606
00:32:30,060 --> 00:32:34,080
Nikolay: Chart also matters, like
again, table is good, but it

607
00:32:34,200 --> 00:32:36,140
loses historical information.

608
00:32:37,280 --> 00:32:41,180
Michael: Yeah, okay, so maybe your
chart is, yeah, your most

609
00:32:41,180 --> 00:32:44,280
important queries looking at their
p99 over time, but then you

610
00:32:44,280 --> 00:32:46,400
can include the mean on the same
chart, can't you?

611
00:32:46,400 --> 00:32:49,680
I've seen those in monitoring tools
where you're looking at the

612
00:32:49,680 --> 00:32:50,180
same...

613
00:32:51,390 --> 00:32:55,467
Nikolay: Interesting thing, for
the chart we can get average

614
00:32:55,467 --> 00:32:58,880
for specific period just dealing
with total and calls.

615
00:32:59,640 --> 00:33:05,380
We have delta for total, delta
for calls, we can have pure average

616
00:33:05,380 --> 00:33:07,080
for specific period of time.

617
00:33:07,660 --> 00:33:09,440
But what about standard deviation?

618
00:33:09,440 --> 00:33:10,620
We don't have it.

619
00:33:11,680 --> 00:33:15,580
Michael: You can calculate it using
deltas, but well, the same

620
00:33:15,580 --> 00:33:19,160
way pg_stat_statements recalculates
it every time it has a new

621
00:33:19,160 --> 00:33:20,240
query come in.

622
00:33:20,280 --> 00:33:20,740
Nikolay: Oh, yeah.

623
00:33:20,740 --> 00:33:23,040
Michael: So, Indeed.

624
00:33:24,140 --> 00:33:26,160
Yeah, I was looking at the source
code, I was like, oh, that's

625
00:33:26,160 --> 00:33:26,640
clever.

626
00:33:26,640 --> 00:33:27,760
Nikolay: Yeah, yeah, yeah, yeah,
yeah, exactly.

627
00:33:28,260 --> 00:33:29,440
Otherwise, yeah, exactly.

628
00:33:30,560 --> 00:33:34,920
Michael: But you also have the
option of snapshotting pg_stat_statements.

629
00:33:34,920 --> 00:33:36,020
Some people do that, right?

630
00:33:36,020 --> 00:33:40,240
Reset snapshot, like some monitoring
tools or some providers

631
00:33:40,440 --> 00:33:41,520
reset it regularly.

632
00:33:41,520 --> 00:33:42,680
Nikolay: It's not a good idea.

633
00:33:43,360 --> 00:33:48,560
Because to do it often you're going
to have troubles under load

634
00:33:48,560 --> 00:33:53,660
because it doesn't come for free
to add new entries.

635
00:33:54,280 --> 00:33:55,460
Michael: But then we have deltas.

636
00:33:55,520 --> 00:33:56,920
Deltas is a good idea.

637
00:33:56,920 --> 00:33:57,620
Nikolay: Yeah, yeah, yeah.

638
00:33:57,620 --> 00:34:01,000
Well, and standardization can be
also understood for a specific

639
00:34:01,000 --> 00:34:01,640
period of time.

640
00:34:01,640 --> 00:34:08,680
And in this case, you can have
this metric with like magic, mysterious

641
00:34:09,140 --> 00:34:09,520
percentile.

642
00:34:09,520 --> 00:34:15,900
And you can have a chart of, for
example, 10, the top 10 queries.

643
00:34:17,200 --> 00:34:22,100
Yeah, and maybe also average of
everything.

644
00:34:28,660 --> 00:34:31,600
Michael: Well, it's a good point
of like there are a couple of

645
00:34:32,080 --> 00:34:34,900
I don't even know if I mentioned
this in the blog post, but there's

646
00:34:34,900 --> 00:34:39,180
at least 1 major caveat here is
a lot of people have a timeout,

647
00:34:40,320 --> 00:34:44,120
often like 30 second, like statement_timeout.

648
00:34:45,660 --> 00:34:46,640
It won't be registered.

649
00:34:46,640 --> 00:34:51,520
So it does quite seriously affect
those metrics like the max,

650
00:34:51,740 --> 00:34:53,500
the standard deviation and

651
00:34:53,560 --> 00:34:55,060
Nikolay: Affects in which way?

652
00:34:55,160 --> 00:34:57,480
It's not registered, so you miss
it there.

653
00:34:57,700 --> 00:34:58,320
Michael: You miss it.

654
00:34:58,320 --> 00:34:59,560
You miss an outlier.

655
00:35:00,080 --> 00:35:02,280
So it's bringing the average down.

656
00:35:02,280 --> 00:35:03,480
It's bringing the p19

657
00:35:04,060 --> 00:35:06,000
Nikolay: looks more positive than
it is.

658
00:35:06,460 --> 00:35:06,820
Yes.

659
00:35:06,820 --> 00:35:09,840
But you know, like if a query was
cancelled, job is not done,

660
00:35:09,840 --> 00:35:11,080
we don't count it.

661
00:35:12,120 --> 00:35:16,800
As I remember, this is another
point, pg_stat_monitor aims to solve.

662
00:35:17,380 --> 00:35:17,880
Yes.

663
00:35:18,320 --> 00:35:22,200
Like, it's a common problem with
pg_stat_statements and all its

664
00:35:22,200 --> 00:35:27,420
friends, like pg_stat_kcache, that
canceled queries are not counted.

665
00:35:27,980 --> 00:35:31,940
And I think also a big missing
piece here, honestly, is that

666
00:35:31,940 --> 00:35:36,180
in pg_stat_database we have TPS,
2 metrics committed rolled back,

667
00:35:36,180 --> 00:35:36,680
right?

668
00:35:36,700 --> 00:35:41,640
But we don't have QPS, we don't
have latencies, standard deviation,

669
00:35:41,640 --> 00:35:42,740
it should be there, actually.

670
00:35:42,740 --> 00:35:50,560
This is another idea to register,
like, these metrics, calls,

671
00:35:51,300 --> 00:35:56,920
mean, max, mean, max, and min,
average, and standard deviation

672
00:35:57,440 --> 00:36:02,540
at the database level for everything,
regardless of the statements.

673
00:36:03,820 --> 00:36:08,000
And maybe involving error doubt
queries, Maybe.

674
00:36:08,000 --> 00:36:13,840
This is I'm not sure because this
is like, if we involve them,

675
00:36:14,180 --> 00:36:18,900
the values observed by producer
statements and at this high level,

676
00:36:20,140 --> 00:36:21,660
they won't match at all.

677
00:36:22,420 --> 00:36:22,920
Michael: Yeah.

678
00:36:23,680 --> 00:36:26,920
I was unsure what ideal looks like
here.

679
00:36:26,920 --> 00:36:33,100
I did wonder about, in an ideal
world, would you add the timeout?

680
00:36:33,480 --> 00:36:36,300
Like, would you add a query to
30 seconds?

681
00:36:36,300 --> 00:36:38,320
Let's say 1 got canceled at 30
seconds.

682
00:36:38,420 --> 00:36:42,340
Would you add an instance of a
query that ran 30 seconds and

683
00:36:42,340 --> 00:36:45,060
add that to the average, add that
to the standard deviation?

684
00:36:45,060 --> 00:36:47,200
Would you want that or would you
not want that?

685
00:36:47,200 --> 00:36:47,720
Nikolay: It's easy.

686
00:36:47,720 --> 00:36:49,940
Let's just have another pg_stat_statements.

687
00:36:50,460 --> 00:36:56,780
We'll call it pg_stat_error_statements
and have a whole set of

688
00:36:56,780 --> 00:36:57,280
metrics.

689
00:36:57,360 --> 00:36:58,580
I'm joking, of course.

690
00:36:59,240 --> 00:37:01,720
Michael: Well, it's not a bad idea,
though, right?

691
00:37:01,720 --> 00:37:03,760
Like how many of each error type
do we have?

692
00:37:03,760 --> 00:37:04,200
Like it's

693
00:37:04,200 --> 00:37:04,300
Nikolay: kind of cool.

694
00:37:04,300 --> 00:37:06,740
And could it be canceled before
statement_timeout?

695
00:37:06,740 --> 00:37:10,640
It can be canceled due to deadlock_timeout, for example, or just

696
00:37:10,640 --> 00:37:12,460
something else happens, I don't know.

697
00:37:13,620 --> 00:37:14,120
Kill.

698
00:37:14,720 --> 00:37:18,900
Term, a pg_cancel_backend, for example, came, canceled us.

699
00:37:19,600 --> 00:37:24,120
So yeah, and it would be great to count all those things like

700
00:37:24,400 --> 00:37:25,740
buffer hits, reads,

701
00:37:25,960 --> 00:37:26,780
Michael: and so on.

702
00:37:27,720 --> 00:37:30,800
So yeah, that's the long and short of it.

703
00:37:30,800 --> 00:37:31,560
Was there anything else

704
00:37:31,560 --> 00:37:31,880
Nikolay: you wanted to talk about?

705
00:37:31,880 --> 00:37:35,640
I feel we went to some very dark corner of Postgres observability,

706
00:37:35,840 --> 00:37:36,520
you know?

707
00:37:36,900 --> 00:37:37,780
Michael: Yeah, yeah, yeah.

708
00:37:38,080 --> 00:37:40,560
It definitely seems less explored than I was expecting.

709
00:37:40,560 --> 00:37:44,340
As your daughter said, this is not a novel, like this is not

710
00:37:44,340 --> 00:37:45,640
an out there revelation.

711
00:37:46,740 --> 00:37:50,740
I was just surprised When I saw the feature request for it, and

712
00:37:50,740 --> 00:37:54,960
I've heard it a few times, I was thinking, why don't we add,

713
00:37:54,960 --> 00:37:57,600
like, why don't more people have that in their queries?

714
00:37:57,600 --> 00:38:01,020
Why don't I see this in blog posts or in monitoring tools?

715
00:38:01,020 --> 00:38:02,740
Nikolay: And also for buffer operations.

716
00:38:04,140 --> 00:38:05,100
And WAL operations.

717
00:38:05,660 --> 00:38:07,280
Michael: Well, I know why we don't have that.

718
00:38:07,280 --> 00:38:10,420
It's because the standard deviation is not even available there.

719
00:38:10,520 --> 00:38:11,820
Nikolay: Well, it should be available.

720
00:38:11,820 --> 00:38:12,840
Let's add it.

721
00:38:13,180 --> 00:38:16,860
Michael: So, yeah, my summary is I'd love to see these in pg_stat_statements.

722
00:38:16,860 --> 00:38:20,220
But in the meantime, like you mentioned, if you're doing a consulting

723
00:38:20,220 --> 00:38:23,520
report or if you've got a dashboard internally consider adding

724
00:38:23,520 --> 00:38:26,820
this like or replacing the ordering by mean.

725
00:38:26,820 --> 00:38:32,340
Nikolay: Just don't call it fake p95 call it mysterious pn

726
00:38:33,380 --> 00:38:38,660
Michael: fake news p99 yeah I liked your mysterious what was

727
00:38:38,660 --> 00:38:40,120
it MMP

728
00:38:40,440 --> 00:38:41,400
Nikolay: magic P

729
00:38:44,140 --> 00:38:44,540
Michael: yeah

730
00:38:44,540 --> 00:38:45,040
Nikolay: okay

731
00:38:45,620 --> 00:38:46,520
Michael: yep absolutely

732
00:38:47,180 --> 00:38:47,680
Nikolay: good