1
00:00:00,060 --> 00:00:03,120
Michael: Hello and welcome to PostgresFM, a weekly show about

2
00:00:03,120 --> 00:00:04,540
all things PostgreSQL.

3
00:00:04,640 --> 00:00:07,940
I am Michael, founder of pgMustard,
and this is my co-host,

4
00:00:07,960 --> 00:00:09,840
Nikolay, founder of Postgres.ai.

5
00:00:09,960 --> 00:00:11,020
Welcome back, Nikolay.

6
00:00:11,180 --> 00:00:13,420
How are you getting on and what
would you like to talk about

7
00:00:13,420 --> 00:00:13,920
today?

8
00:00:14,540 --> 00:00:15,320
Nikolay: Hi, Michael.

9
00:00:15,800 --> 00:00:19,540
First of all, thank you for keeping
the, how is it called, the

10
00:00:19,540 --> 00:00:24,520
ball rolling, or how to say, long
time no see or long time no

11
00:00:24,520 --> 00:00:25,020
hear.

12
00:00:25,520 --> 00:00:25,580
Yeah.

13
00:00:25,580 --> 00:00:26,080
Yeah.

14
00:00:26,320 --> 00:00:29,080
Thank you for very interesting
interviews.

15
00:00:29,540 --> 00:00:34,700
I liked listening to them while
I was slightly off, but maybe

16
00:00:34,700 --> 00:00:39,160
it's time to return to our regular
format or have a mix, I don't

17
00:00:39,160 --> 00:00:42,840
know what we will decide next,
but today we have a regular format,

18
00:00:42,840 --> 00:00:43,340
right?

19
00:00:43,440 --> 00:00:43,940
Michael: Yeah.

20
00:00:44,720 --> 00:00:48,660
Nikolay: Yeah, and the topic I
brought this topic is overhead

21
00:00:48,900 --> 00:00:51,420
of observability tools, the most
popular ones.

22
00:00:51,420 --> 00:00:55,460
Well, 1 is the most popular one,
pg_stat_statements.

23
00:00:56,740 --> 00:01:02,320
And usually people, like I remember
exactly when we didn't have

24
00:01:02,320 --> 00:01:05,420
pg_stat_statements, we had only
log-based analysis.

25
00:01:05,420 --> 00:01:08,040
We didn't have wait events, didn't
have pg_stat_statements like

26
00:01:08,040 --> 00:01:09,360
15 years ago, right?

27
00:01:09,800 --> 00:01:17,220
And we only had log-based analysis
and there was a tool written

28
00:01:17,240 --> 00:01:23,760
in PHP called pgFouine, and then
another tool replaced it, written

29
00:01:23,760 --> 00:01:25,100
in Perl called pgBadger.

30
00:01:26,120 --> 00:01:31,600
And the idea was, like, people
always, like, first thing to figure

31
00:01:31,600 --> 00:01:36,400
out that, oh, usually we only see
only a tip of the iceberg only

32
00:01:36,400 --> 00:01:39,960
we see only like some queries which
are the slowest ones but

33
00:01:39,960 --> 00:01:46,080
you cannot properly perform like
full holistic analysis if you

34
00:01:46,080 --> 00:01:52,580
don't see the fast queries, because
fast queries might be consuming

35
00:01:52,580 --> 00:01:54,660
even more resources than slow queries.

36
00:01:54,900 --> 00:01:59,080
So what people did usually, they
experienced DBAs usually said,

37
00:01:59,080 --> 00:02:04,820
okay, I'm going to switch off all
query logging for a few minutes

38
00:02:04,900 --> 00:02:06,880
to collect everything and then
analyze.

39
00:02:09,320 --> 00:02:11,260
In very many cases it worked well.

40
00:02:11,740 --> 00:02:16,260
Yes, there is an observer effect
because we probably put our

41
00:02:16,260 --> 00:02:20,840
Postgres down completely if we
log all queries, but not always.

42
00:02:21,020 --> 00:02:23,080
Sometimes it depends.

43
00:02:24,000 --> 00:02:27,420
So yeah, this is quite understandable.

44
00:02:27,780 --> 00:02:33,520
But then pg_stat_statements were
created and some DBAs, I remember,

45
00:02:34,360 --> 00:02:37,400
we were saying, oh, you know, like
we still need to study its

46
00:02:37,400 --> 00:02:37,900
overhead.

47
00:02:39,520 --> 00:02:43,400
And then somehow there was like,
there were some benchmarks,

48
00:02:43,520 --> 00:02:47,660
I don't remember any benchmarks
I would say I trust them.

49
00:02:48,180 --> 00:02:53,000
But I remember many experienced
folks started saying, oh, you

50
00:02:53,000 --> 00:02:56,620
know, yes, there is overhead, but
it's below 10%.

51
00:02:57,500 --> 00:03:00,680
And since this thing is measuring
everything, it's observing

52
00:03:00,720 --> 00:03:04,800
the whole workload, it's worth keeping
it enabled.

53
00:03:05,660 --> 00:03:06,940
Let's have it enabled.

54
00:03:07,580 --> 00:03:11,920
I remember also there was some
number, like, people, many people

55
00:03:11,920 --> 00:03:15,080
saying, you know, pg_stat_statements
overhead is 7%.

56
00:03:17,220 --> 00:03:19,740
It's kind of strange, but kind
of 7%.

57
00:03:20,020 --> 00:03:20,520
7%?

58
00:03:21,340 --> 00:03:21,840
7%.

59
00:03:22,840 --> 00:03:26,660
I don't remember any benchmarks
that proved it, but I remember

60
00:03:26,660 --> 00:03:32,260
it was like the consensus was below
10%, so we are all good.

61
00:03:32,800 --> 00:03:37,200
If we try to find a good benchmark
from these early days of

62
00:03:37,200 --> 00:03:38,980
pg_stat_statements, it would be
great.

63
00:03:39,480 --> 00:03:40,760
I don't think it exists.

64
00:03:41,460 --> 00:03:45,880
But honestly, we as a community
should perform good benchmarks

65
00:03:45,920 --> 00:03:46,420
there.

66
00:03:47,500 --> 00:03:48,880
I don't remember that.

67
00:03:48,940 --> 00:03:51,220
It doesn't mean they don't exist, of
course, but...

68
00:03:51,600 --> 00:03:53,920
Michael: I did a little bit of
searching beforehand because I

69
00:03:53,920 --> 00:03:56,880
was interested in if there was
any up-to-date ones as well.

70
00:03:56,880 --> 00:04:00,360
I couldn't find anything great,
but there were a few trustworthy

71
00:04:00,460 --> 00:04:00,960
sources.

72
00:04:01,560 --> 00:04:07,120
One was PG Analyze, a commercial
monitoring tool that uses pg_stat_statements.

73
00:04:07,120 --> 00:04:07,620
statements.

74
00:04:07,760 --> 00:04:11,380
In their frequently asked questions,
they gave an estimate of

75
00:04:11,400 --> 00:04:14,560
approximately 1% of CPU time, which
I thought was interesting

76
00:04:14,560 --> 00:04:16,200
that they're telling their customers.

77
00:04:16,260 --> 00:04:19,080
Anyway, as we're going to discuss,
this is all very dependent

78
00:04:19,080 --> 00:04:21,760
on workload, right but at least
that's interesting and

79
00:04:21,760 --> 00:04:25,960
I trust them enough to say that's
probably based on some tests

80
00:04:25,960 --> 00:04:29,040
on their side even if I didn't
see the actual benchmarking, and

81
00:04:29,040 --> 00:04:33,480
then the other things I've seen
are a series of, well, there's

82
00:04:33,480 --> 00:04:37,760
a question on the DBA Stack Exchange
and somebody's quoting various

83
00:04:37,760 --> 00:04:40,360
Postgres authors and a benchmarking
series.

84
00:04:40,620 --> 00:04:44,280
One that was like a more realistic
workload that was about 0.5%

85
00:04:45,060 --> 00:04:49,820
measured overhead and another that
was a more stress test that

86
00:04:49,820 --> 00:04:51,180
measured it at about 10%.

87
00:04:51,680 --> 00:04:55,440
So those are numbers, but that's
a wide range of different numbers.

88
00:04:55,440 --> 00:04:58,180
And as we're going to discuss later,
you've done some work that

89
00:04:58,180 --> 00:05:01,700
shows even that it can be very
different, which is cool.

90
00:05:02,220 --> 00:05:05,360
Nikolay: I hope by the end of this
episode, people will start

91
00:05:05,360 --> 00:05:10,160
having some ideas why this range
can be so wide.

92
00:05:11,280 --> 00:05:16,600
But I can imagine like 0.5% or
10%.

93
00:05:17,600 --> 00:05:20,400
In both cases, it can be so.

94
00:05:20,560 --> 00:05:22,620
I mean, it depends on workload,
actually.

95
00:05:22,960 --> 00:05:24,320
Very well depends on the workload.

96
00:05:24,320 --> 00:05:28,380
And also not only about workload,
we will talk about it later,

97
00:05:28,380 --> 00:05:28,880
right?

98
00:05:29,480 --> 00:05:31,040
So, yeah, okay.

99
00:05:31,220 --> 00:05:34,860
But when I say overhead, of course
I mean CPU overhead, first

100
00:05:34,860 --> 00:05:39,720
of all, because we don't think
about disk I/O or disk usage, disk

101
00:05:39,720 --> 00:05:45,600
space usage here, network, nothing
like that, only pure CPU overhead.

102
00:05:47,460 --> 00:05:47,800
Michael: By the

103
00:05:47,800 --> 00:05:56,540
Nikolay: way, when guys like RDS
and others, probably inheriting

104
00:05:56,580 --> 00:06:02,280
this from Oracle, when they consider
wait event being null in

105
00:06:02,280 --> 00:06:10,460
pg_stat_activity as CPU, marking
it as green, CPU green in performance

106
00:06:11,140 --> 00:06:11,640
insights.

107
00:06:12,040 --> 00:06:17,760
Do they mean other wait events,
all of them are not CPU?

108
00:06:18,940 --> 00:06:20,400
I don't understand this.

109
00:06:21,820 --> 00:06:22,700
Michael: I don't know.

110
00:06:23,320 --> 00:06:29,620
Nikolay: Because many of, for example,
lightweight locks, spin

111
00:06:29,620 --> 00:06:32,280
locks, for example, they are purely
CPU.

112
00:06:33,840 --> 00:06:36,720
Michael: I know we have several
listeners from AWS, so maybe

113
00:06:36,720 --> 00:06:37,940
they can let us know.

114
00:06:38,760 --> 00:06:40,140
Nikolay: We have listeners, yeah.

115
00:06:40,640 --> 00:06:46,860
And on the other hand, if you say
CPU instead of null, null means

116
00:06:46,860 --> 00:06:47,360
unknown.

117
00:06:47,780 --> 00:06:51,920
And according to PostgreSQL documentation
and source code, it means

118
00:06:52,540 --> 00:06:53,340
no wait.

119
00:06:54,120 --> 00:06:57,920
In reality, it means either really
no wait and maybe some kind

120
00:06:57,920 --> 00:07:03,380
of CPU work, but also in many cases
it means a wait event which

121
00:07:03,380 --> 00:07:06,880
is not yet created and code is
not covered with this.

122
00:07:07,260 --> 00:07:10,960
For example, PostgreSQL 16, we recently
had some benchmarks and

123
00:07:10,960 --> 00:07:15,060
we saw wait event which you see
in PostgreSQL 16 but you don't

124
00:07:15,060 --> 00:07:18,420
see it in PostgreSQL 14 because it
was not yet created there.

125
00:07:20,280 --> 00:07:23,740
So I'm thinking, okay, it means
in performance insights on RDS

126
00:07:24,360 --> 00:07:27,600
and I think in many other systems
maybe in CloudSQL as well

127
00:07:27,600 --> 00:07:32,500
or this PgBouncer ad hoc tool
written in Java, it also likes

128
00:07:32,500 --> 00:07:36,020
to use green color and say this
is CPU, but it's actually not

129
00:07:36,020 --> 00:07:37,440
CPU, it's null.

130
00:07:37,440 --> 00:07:41,580
Because CPU doesn't exist in the
list of wait events in pg_stat_activity.

131
00:07:41,580 --> 00:07:42,320
of Activity.

132
00:07:42,780 --> 00:07:49,440
So if you mark it CPU for 14 because
it's what was null, but

133
00:07:49,440 --> 00:07:52,000
then you start distinguishing it
for 16.

134
00:07:52,440 --> 00:07:53,760
Something is not right here.

135
00:07:54,140 --> 00:07:56,620
So I'm asking, is it really CPU?

136
00:07:59,200 --> 00:08:02,880
So interesting question, but it's
slightly off topic.

137
00:08:02,960 --> 00:08:04,440
Back to PgStatStatements.

138
00:08:05,200 --> 00:08:07,000
Let's talk about benchmarks we
had.

139
00:08:07,280 --> 00:08:11,420
So, the idea was we got some credits
from Google Cloud.

140
00:08:12,280 --> 00:08:17,460
And it's great, this year we will
be performing, I mean, we,

141
00:08:17,460 --> 00:08:22,080
I mean, Postgres.ai team, we will
be performing a lot of benchmarks.

142
00:08:22,200 --> 00:08:24,640
I'm going to do a lot of post...

143
00:08:24,720 --> 00:08:28,340
We already had a lot of benchmarks,
I think thousands of them

144
00:08:28,380 --> 00:08:34,540
if you count each iteration, and
hundreds of them if you count

145
00:08:34,540 --> 00:08:38,700
the whole benchmark consisting
of many iterations.

146
00:08:39,840 --> 00:08:45,160
So we were curious how many TPS
we can squeeze from PostgreSQL

147
00:08:45,180 --> 00:08:51,480
16 on various big machines, Intel,
AMD, and there was a good

148
00:08:51,480 --> 00:08:57,860
article from 2016 from Alexander
Korotkov, when both PostgreSQL

149
00:08:57,900 --> 00:09:00,700
and MySQL, I remember, teamed up.

150
00:09:01,580 --> 00:09:02,460
It was interesting.

151
00:09:02,500 --> 00:09:07,220
I think, I don't remember if somebody
from Percona was also working

152
00:09:07,220 --> 00:09:13,180
on MySQL, the goal was to demonstrate
that both systems can show

153
00:09:13,180 --> 00:09:14,080
million TPS.

154
00:09:15,720 --> 00:09:20,820
And during that work, some contention
issues were improved, fixed

155
00:09:20,840 --> 00:09:21,540
in Postgres.

156
00:09:21,760 --> 00:09:26,520
So, Postgres reached a million
TPS on some strong machines.

157
00:09:26,680 --> 00:09:29,120
And I like that benchmark because
it's simple, it's pgBench,

158
00:09:30,540 --> 00:09:31,440
select only.

159
00:09:32,060 --> 00:09:34,820
So, it's only selects, very simple
ones.

160
00:09:35,420 --> 00:09:37,620
So we just repeated the same benchmark.

161
00:09:38,560 --> 00:09:44,840
And as usual for such benchmarks,
it's actually stress testing,

162
00:09:45,180 --> 00:09:48,540
because you are exploring this
edge.

163
00:09:49,020 --> 00:09:51,900
It's not what you need to do for
your application, for example,

164
00:09:51,900 --> 00:09:54,260
unless you're on purpose studying
it.

165
00:09:54,380 --> 00:09:58,680
I hate this behavior being default
in pgBench.

166
00:10:00,480 --> 00:10:03,480
It should not be so, because it's
provoking to perform stress

167
00:10:03,480 --> 00:10:08,100
tests instead of regular load tests
and exploring normal situation,

168
00:10:08,160 --> 00:10:11,040
for example, 25 or 50% of CPU load.

169
00:10:11,040 --> 00:10:14,000
But in our case, we just want to
squeeze as much as we can.

170
00:10:14,340 --> 00:10:19,780
So we did this, and I published
a blog post about our new AI

171
00:10:19,780 --> 00:10:24,820
bot and it has details and links
to details how we did it.

172
00:10:24,960 --> 00:10:32,340
So we took the newest Intel, fourth
generation Intel scalable

173
00:10:33,380 --> 00:10:35,420
Sapphire Rapids it's called, right?

174
00:10:35,460 --> 00:10:41,640
And also 4th generation AMD EPYC,
C3 and C3D instances on GCP.

175
00:10:42,660 --> 00:10:50,280
One has, the biggest number is 176
vCPUs, and AMD has 360 vCPUs.

176
00:10:51,820 --> 00:10:54,260
And both have more than a terabyte
of RAM.

177
00:10:54,480 --> 00:10:57,780
So it's insane, but we use spots.

178
00:10:58,780 --> 00:11:01,500
AWS has spots, GCP has spot instances, it's
cool.

179
00:11:01,760 --> 00:11:05,040
You pay a couple of bucks for such
an experiment, honestly.

180
00:11:06,020 --> 00:11:06,960
Like it's not...

181
00:11:07,580 --> 00:11:11,140
I realized even if we didn't have
credits, I would probably pay

182
00:11:11,140 --> 00:11:14,940
for myself, it's interesting to
explore these things.

183
00:11:15,060 --> 00:11:18,480
Because we just provision temporary
machines for 1 or 2 hours

184
00:11:18,480 --> 00:11:21,500
and spot means huge discount.

185
00:11:22,200 --> 00:11:28,740
Yeah, so, this kind of experiment,
it's a classic experiment for

186
00:11:28,740 --> 00:11:29,500
stress testing.

187
00:11:29,500 --> 00:11:32,980
You first start with 1 connection,
then more and more connections,

188
00:11:33,080 --> 00:11:34,440
and you control...

189
00:11:35,280 --> 00:11:39,400
We chose Aleksandr Korotkov's approach.

190
00:11:39,940 --> 00:11:44,920
In pgBench both -c and -j are the same.

191
00:11:45,060 --> 00:11:48,920
So we start from 1, then we put
50.

192
00:11:51,400 --> 00:11:54,100
With such huge machines, this step
is reasonable.

193
00:11:54,780 --> 00:11:57,460
Jumping to 50 right away.

194
00:11:57,560 --> 00:12:02,380
Then 100, then 150, 200, and so
on until, I think, 500.

195
00:12:03,280 --> 00:12:06,800
Exceeding the number of vCPUs in
both cases.

196
00:12:07,720 --> 00:12:13,860
So since we use PostgreSQL cluster,
which is a project, it's

197
00:12:13,860 --> 00:12:17,840
open source project, It's a very
good thing, Ansible kind

198
00:12:17,840 --> 00:12:21,100
of set of playbooks to provision
Patroni clusters with a lot

199
00:12:21,100 --> 00:12:21,760
of things.

200
00:12:22,200 --> 00:12:26,540
It's maintained by Vitaliy, who
is working in my team.

201
00:12:26,940 --> 00:12:31,780
So it's a great tool and it has
a lot of things, including those

202
00:12:31,780 --> 00:12:37,380
we over years discussed together
and he just added into PostgreSQL

203
00:12:37,480 --> 00:12:37,980
cluster.

204
00:12:38,440 --> 00:12:45,040
So we used it and because of that
we had a lot of observability

205
00:12:45,180 --> 00:12:48,580
tools including those for query
analysis, **pg_stat_statements**, **pg_stat_kcache**,

206
00:12:49,200 --> 00:12:53,500
much less popular, and **pg_wait_sampling**
for weight event analysis.

207
00:12:54,140 --> 00:12:57,100
And then we saw a very strange
picture.

208
00:12:57,180 --> 00:13:02,360
Until 50 we grow in terms of TPS,
kind of approaching a million

209
00:13:02,360 --> 00:13:04,680
TPS, but then go down very quickly.

210
00:13:05,660 --> 00:13:09,380
And on Intel we went down even
more, but it's a different story.

211
00:13:09,620 --> 00:13:15,760
And thanks to **pg_wait_sampling**,
we saw that we have number 1

212
00:13:15,760 --> 00:13:17,960
weight event is related to **pg_stat_kcache**.

213
00:13:19,640 --> 00:13:22,800
So we immediately realized this
is an observed effect from this

214
00:13:22,800 --> 00:13:24,440
**pg_stat_kcache**.

215
00:13:24,780 --> 00:13:29,060
For those who don't know, **pg_stat_kcache**
is an additional extension

216
00:13:29,060 --> 00:13:33,380
to **pg_stat_statements** extension which
provides physical level metrics,

217
00:13:34,160 --> 00:13:39,820
user CPU, sys CPU, real disk IO
at physical level, context switches.

218
00:13:40,840 --> 00:13:46,100
So it's very useful to understand
real CPU usage, unlike this

219
00:13:46,380 --> 00:13:52,060
where to when equals null, which
I think is wrongly presented

220
00:13:52,160 --> 00:13:54,100
in RDS performance insights.

221
00:13:54,800 --> 00:13:59,320
This thing, trustworthy, we use
it many years, a few big companies

222
00:13:59,640 --> 00:14:00,660
use it as well.

223
00:14:00,660 --> 00:14:03,800
It's not super popular, **RDS** doesn't
have it, **CloudSQL** doesn't

224
00:14:03,800 --> 00:14:09,520
have it, most managed Postgres
providers don't have it.

225
00:14:09,760 --> 00:14:11,820
I know **Yandex Cloud** has it.

226
00:14:12,400 --> 00:14:13,420
I think they might be the

227
00:14:13,420 --> 00:14:14,740
Michael: only ones that have it.

228
00:14:15,640 --> 00:14:18,940
Nikolay: Maybe, But also 1 of the
biggest e-commerce companies

229
00:14:18,940 --> 00:14:23,360
we worked with also has it for
many years and they have very

230
00:14:23,360 --> 00:14:26,540
critical systems running on Postgres,
so it's...

231
00:14:27,500 --> 00:14:31,300
But in this case, overhead was
so huge, like, what's happening?

232
00:14:31,560 --> 00:14:35,540
And I immediately published it
on Twitter and Vitaliy created

233
00:14:35,540 --> 00:14:42,080
an issue in **pg_stat_kcache**, GitHub,
a repository, and then maintainers,

234
00:14:43,100 --> 00:14:45,600
they created a fix in 4 hours,
I think.

235
00:14:46,680 --> 00:14:48,420
So I think they expected something.

236
00:14:48,420 --> 00:14:53,380
There was some additional lock
which was removed in this fix

237
00:14:54,140 --> 00:14:57,160
with the idea this lock is something
old and it's not really

238
00:14:57,160 --> 00:15:01,720
needed, but in our case it was
slowing everything down when contention

239
00:15:01,780 --> 00:15:04,720
is high, when many, many sessions
fight...

240
00:15:04,900 --> 00:15:08,520
Not fight, basically what's happening
with **pgbench** you have by

241
00:15:08,520 --> 00:15:11,200
default only 4 queries, right?

242
00:15:11,200 --> 00:15:17,440
If you say select only dash uppercase
S, It's only a single query,

243
00:15:17,440 --> 00:15:21,360
only 1 select to the **pgbench_accounts**
table, that's it.

244
00:15:21,780 --> 00:15:28,480
In this case, like hundreds of
sessions try to increment metrics

245
00:15:28,940 --> 00:15:32,580
in a single record of **pg_stat_kcache**,
right?

246
00:15:33,100 --> 00:15:36,240
And due to that lock, which was
removed in the latest version,

247
00:15:36,960 --> 00:15:37,720
it was not good.

248
00:15:37,720 --> 00:15:42,520
I mean, the more sessions we have,
the biggest contention, the

249
00:15:42,520 --> 00:15:47,220
bigger overhead is and we see it
as we increase the number of

250
00:15:47,220 --> 00:15:51,180
clients, number of connections
and jobs, the dash C, dash J in

251
00:15:51,180 --> 00:15:55,380
pgBench parameters, TPS go down.

252
00:15:55,380 --> 00:15:59,360
So we have more, we have room,
we have more vCPUs, right?

253
00:16:00,060 --> 00:16:05,440
But we cannot use them properly
because all of them try to execute

254
00:16:05,440 --> 00:16:07,660
basically the same normalized query.

255
00:16:08,680 --> 00:16:10,320
Parameters don't matter here.

256
00:16:11,080 --> 00:16:13,580
Michael: Yeah, I pulled the chart
up from your tweet and just

257
00:16:13,580 --> 00:16:18,820
to give people an idea, at 50 clients
it's about 550,000 TPS.

258
00:16:19,200 --> 00:16:25,360
At 100 clients it does go up to
about 700,000 TPS, but at 150

259
00:16:26,320 --> 00:16:30,260
we're down below the 50 client
rate and it's only 400,000.

260
00:16:30,920 --> 00:16:35,740
Then it settles at about 200 clients
and above, at about 300,000

261
00:16:36,200 --> 00:16:38,440
TPS, which is less than we had
at 50.

262
00:16:38,440 --> 00:16:42,500
So it's a super interesting curve
and cool to hear that you got

263
00:16:42,500 --> 00:16:45,820
so quickly to the root cause and
that the team was able to fix

264
00:16:45,820 --> 00:16:46,680
it so quickly.

265
00:16:47,460 --> 00:16:48,840
Nikolay: Yeah, in a few hours.

266
00:16:48,840 --> 00:16:52,360
Well, let's split the story here
into 2 paths.

267
00:16:52,360 --> 00:16:55,900
First path is purely pg_stat_kcache
and the second path is without

268
00:16:55,900 --> 00:17:00,200
it, because when we saw this overhead
we continued our discovery

269
00:17:00,480 --> 00:17:06,540
of maximum TPS on modern Intel
and modern AMD without pg_stat_kcache,

270
00:17:06,660 --> 00:17:07,520
of course.

271
00:17:07,720 --> 00:17:11,460
Because we didn't expect it to
be fixed so quickly.

272
00:17:11,820 --> 00:17:17,780
But when the fix was ready, we
asked the bot to repeat this benchmark.

273
00:17:18,120 --> 00:17:21,720
It was interesting to convince
the bot that it's safe to download

274
00:17:22,000 --> 00:17:24,440
fresh code from GitHub and compile
it.

275
00:17:25,120 --> 00:17:28,760
It hallucinated, saying it's against
policy.

276
00:17:28,820 --> 00:17:32,060
When I asked which policy, It said
PostgreSQL policy.

277
00:17:33,280 --> 00:17:37,580
So it was funny, like I went checking,
do we have such policy?

278
00:17:37,580 --> 00:17:41,560
I checked all our docs because
the bot does know our documents.

279
00:17:41,680 --> 00:17:44,800
So probably like, but it was just
pure hallucination.

280
00:17:46,800 --> 00:17:51,580
So then we convinced it, verified
and indeed we saw that the

281
00:17:51,580 --> 00:17:56,760
fix indeed resolves the problem
and no more such overhead.

282
00:17:57,440 --> 00:18:02,100
So it was good and next week the
new release of pg_stat_kcache was

283
00:18:02,100 --> 00:18:02,600
issued.

284
00:18:02,860 --> 00:18:07,240
But interesting question, like,
as I mentioned, I trust those

285
00:18:07,240 --> 00:18:11,680
people, and also not just trust, I worked
with some of them, and I

286
00:18:11,680 --> 00:18:15,520
touched production systems with
my hands, so I didn't see such

287
00:18:15,520 --> 00:18:17,980
problems for years on production.

288
00:18:18,580 --> 00:18:21,840
But here we saw obviously a very
big overhead.

289
00:18:23,420 --> 00:18:24,120
Why so?

290
00:18:24,640 --> 00:18:29,340
Why have we survived with this on production?

291
00:18:29,340 --> 00:18:31,960
This is an interesting question.
Let's return to it after we

292
00:18:31,960 --> 00:18:33,140
explore the second path.

293
00:18:33,140 --> 00:18:36,660
Second path, okay, forget about
pg_stat_kcache, we only have pg_stat

294
00:18:36,660 --> 00:18:41,140
statements, pg_wait_sampling, we
know the overhead of both is not

295
00:18:41,140 --> 00:18:41,920
so big.

296
00:18:42,040 --> 00:18:48,340
And then we just explore from 1
to 500 connections on both modern

297
00:18:48,520 --> 00:18:57,260
Intel platform and modern AMD,
176 vCPUs on one and 360 vCPUs on

298
00:18:57,260 --> 00:18:57,760
another.

299
00:18:58,500 --> 00:19:02,360
And a huge surprise was Intel behaved...

300
00:19:02,480 --> 00:19:09,020
Again, the same workload, pgBench -S,
so select only, single query,

301
00:19:09,340 --> 00:19:11,640
very simple, sub-millisecond latency.

302
00:19:12,660 --> 00:19:17,280
Huge surprise was Intel behaves
not well at all.

303
00:19:17,620 --> 00:19:21,000
It behaves similar to what we had
with pg_stat_kcache enabled.

304
00:19:21,220 --> 00:19:24,660
But higher, yes, higher, it reached
1 million I think or so,

305
00:19:25,680 --> 00:19:28,040
maybe slightly below it, and then
went down.

306
00:19:28,780 --> 00:19:35,020
While AMD, like this going down,
even before you reach a number

307
00:19:35,020 --> 00:19:40,080
of vCPUs, increasing number of
connections and jobs in pgBench.

308
00:19:40,080 --> 00:19:40,580
Bench.

309
00:19:40,580 --> 00:19:41,420
It's not normal.

310
00:19:41,760 --> 00:19:43,220
Some kind of problem, obviously.

311
00:19:44,060 --> 00:19:49,500
While AMD also was not good, but
it didn't go down.

312
00:19:49,940 --> 00:19:51,600
It went down slightly, right?

313
00:19:52,740 --> 00:19:54,520
It demonstrated almost a plateau.

314
00:19:55,240 --> 00:19:56,420
Very different behavior.

315
00:19:57,180 --> 00:20:01,520
And we started studying what's
happening here, and obviously

316
00:20:01,520 --> 00:20:04,040
the problem was pg_stat_statements
in this case.

317
00:20:04,460 --> 00:20:09,020
And we saw it, I think, from wait
events by pg_wait_sampling, but

318
00:20:09,020 --> 00:20:14,480
we also collected flame graphs
and we obtained 2 very different

319
00:20:14,480 --> 00:20:16,160
pictures for these platforms.

320
00:20:16,780 --> 00:20:22,280
Everything is the same, Ubuntu
22.04, Postgres 16 latest version,

321
00:20:23,440 --> 00:20:27,980
everything is like, some kind of
tuning applied, kind of default

322
00:20:27,980 --> 00:20:31,320
tuning we usually apply to new
clusters under our control.

323
00:20:33,460 --> 00:20:36,940
Nothing special, nothing fancy,
but very different behavior.

324
00:20:37,540 --> 00:20:42,500
And on flame graphs we could see
that in case of Intel, these

325
00:20:42,500 --> 00:20:48,460
few repits, we see that PgStatStatements_Nextval
had a huge S_log function call consumed a lot
of time.

326
00:20:48,460 --> 00:20:51,540
And in case of AMD this call is
much, like, smaller.

327
00:20:52,120 --> 00:20:55,460
So in case of Intel it was like
75% of whole time spent by CPU

328
00:21:00,660 --> 00:21:01,360
in FlameGraph.

329
00:21:01,620 --> 00:21:02,720
Like, Why?

330
00:21:02,720 --> 00:21:03,540
What's happening?

331
00:21:03,700 --> 00:21:05,340
Something not normal at all.

332
00:21:05,860 --> 00:21:10,440
And I talked to a few guys, talked
to Andrey Borodin and Alexander

333
00:21:10,460 --> 00:21:14,260
Korotkov, and interestingly, they
both mentioned the idea that

334
00:21:14,260 --> 00:21:17,220
probably pg_stat_statements needs
sampling here.

335
00:21:18,340 --> 00:21:18,840
Right?

336
00:21:19,200 --> 00:21:19,700
Sampling.

337
00:21:20,020 --> 00:21:21,160
So, it...

338
00:21:21,740 --> 00:21:23,560
Michael: Yeah, so it's an interesting
idea.

339
00:21:23,560 --> 00:21:25,360
We don't currently have a parameter.

340
00:21:26,040 --> 00:21:26,880
Yeah, we don't...

341
00:21:26,880 --> 00:21:29,640
Like for a lot of the other things
we have, like for logging,

342
00:21:29,640 --> 00:21:32,780
for example, because there's such
overhead to logging, we have

343
00:21:32,780 --> 00:21:34,300
parameters for that kind of thing.

344
00:21:34,300 --> 00:21:39,800
We can sample to only measure 1
in 10 or 1 in 100.

345
00:21:39,800 --> 00:21:40,440
Yeah, yeah.

346
00:21:40,440 --> 00:21:44,440
Well, I'm thinking actually, I
know we've discussed this before,

347
00:21:44,440 --> 00:21:48,320
but from log_min_duration_statement,
yeah, I think since more recently,

348
00:21:48,320 --> 00:21:52,760
but for auto_explain from a long
time ago, for example, but the

349
00:21:52,760 --> 00:21:55,900
the 75% makes sense as well just
to give people an idea of the

350
00:21:55,900 --> 00:21:59,820
exact numbers I've pulled up that
chart as well On Intel with

351
00:21:59,820 --> 00:22:00,800
pg_stat_statements.

352
00:22:00,800 --> 00:22:05,560
It does get to very nearly a million
tps without pg_stat_kcache

353
00:22:05,740 --> 00:22:09,660
and then drops a little bit less
rapidly but still it drops down

354
00:22:09,660 --> 00:22:14,620
to 500,000 150 and then down to
about 300,000 by the time you

355
00:22:14,620 --> 00:22:20,460
get into the hundreds of clients
whereas AMD with pg_stat_statements

356
00:22:21,040 --> 00:22:25,640
continues to rise at 100 clients
to about 1.2 million, gets above

357
00:22:25,640 --> 00:22:31,820
1.5 million at about 150 clients,
then seems saturated and it's

358
00:22:31,820 --> 00:22:34,700
mostly a plateau, slight decline
as you get more.

359
00:22:34,820 --> 00:22:38,500
So that's about 5 times more by
the time you get to, you know,

360
00:22:38,500 --> 00:22:40,640
300,000 versus 1.5 million.

361
00:22:41,120 --> 00:22:45,140
Starts to make sense that that's
like 75% overhead, I guess.

362
00:22:45,660 --> 00:22:50,900
Nikolay: Yeah, so in all cases
when you reach some usual number

363
00:22:50,900 --> 00:22:53,640
of vCPU, you go down.

364
00:22:54,400 --> 00:22:57,540
But normally you go down slowly,
right?

365
00:22:58,260 --> 00:23:03,340
But the picture demonstrated on
Intel, like very acute, like

366
00:23:03,340 --> 00:23:05,040
acutely going down, right?

367
00:23:05,740 --> 00:23:07,060
Going down very fast.

368
00:23:07,640 --> 00:23:08,760
And this is not normal.

369
00:23:08,860 --> 00:23:12,540
And of course, we, to confirm that
PgSentinel Attachments involved,

370
00:23:12,640 --> 00:23:16,920
and also to get numbers we wanted,
we wanted big numbers, right?

371
00:23:18,660 --> 00:23:21,560
When I posted it on Twitter, of
course, people started liking

372
00:23:21,560 --> 00:23:22,260
and reposting.

373
00:23:22,960 --> 00:23:27,160
Like, with AMD, I think we've got,
without Pages of Atonement,

374
00:23:27,160 --> 00:23:31,440
just removing it, we've got 2.5
million TPS, right?

375
00:23:32,060 --> 00:23:32,820
On AMD.

376
00:23:33,400 --> 00:23:34,360
Michael: Nearly, yeah.

377
00:23:35,280 --> 00:23:36,380
Nikolay: Almost, almost, right.

378
00:23:36,380 --> 00:23:38,040
About 2 million TPS.

379
00:23:38,040 --> 00:23:40,740
My tweet was 2 million TPS on PostgreSQL
16.

380
00:23:41,120 --> 00:23:41,620
Right.

381
00:23:42,340 --> 00:23:49,300
Well, yeah, It was funny to see
the reaction from CEO of PlanetScale.

382
00:23:51,740 --> 00:23:53,860
Like, it's not real workload.

383
00:23:54,840 --> 00:23:56,760
Well, no, it's not real workload.

384
00:23:56,760 --> 00:23:57,480
It's select-only.

385
00:23:58,780 --> 00:24:00,600
It's purely synthetic workload.

386
00:24:00,860 --> 00:24:05,280
We're just exploring some edge
of what system can provide.

387
00:24:05,280 --> 00:24:08,260
Of course, it's not realistic,
it's some select, that's it.

388
00:24:08,820 --> 00:24:13,080
Well, maybe there are some systems
which need mostly this kind

389
00:24:13,080 --> 00:24:13,620
of workload.

390
00:24:13,620 --> 00:24:16,920
And this is an interesting question,
because if they have such

391
00:24:16,920 --> 00:24:22,100
workload, they suffer from bigger
problems from pg_stat_statements settings.

392
00:24:22,820 --> 00:24:26,820
I think this is the main idea of
today's episode.

393
00:24:27,340 --> 00:24:31,020
Look at your workload and understand
the nature of it.

394
00:24:31,160 --> 00:24:35,640
But obviously, CEO of PlanetScale
is not interested in single-node

395
00:24:36,040 --> 00:24:39,520
performance, because their main
thing is sharding.

396
00:24:40,440 --> 00:24:44,020
So single-node should not provide
millions of TPS, right?

397
00:24:45,060 --> 00:24:45,960
It's not normal, right?

398
00:24:47,220 --> 00:24:48,620
Michael: It doesn't help the marketing.

399
00:24:49,340 --> 00:24:50,140
Nikolay: Right, right.

400
00:24:50,140 --> 00:24:55,220
Everyone realizes it's select only,
everything is cached in memory,

401
00:24:55,060 --> 00:24:58,030
but still, 2.5 million TPS, wow.

402
00:24:58,840 --> 00:25:00,820
On a machine you can just rent easily.

403
00:25:00,820 --> 00:25:02,780
Well, it's an expensive machine, I
think.

404
00:25:03,480 --> 00:25:06,050
If it's not spot, if it's normal,
without any discounts, it's

405
00:25:06,050 --> 00:25:09,460
above $10,000 per month.

406
00:25:09,060 --> 00:25:10,760
So it's an expensive machine.

407
00:25:10,760 --> 00:25:11,260
Yeah.

408
00:25:11,660 --> 00:25:12,260
Of course.

409
00:25:12,260 --> 00:25:13,020
But it's possible.

410
00:25:13,020 --> 00:25:16,520
And this is just Postgres with
minimal tuning, right?

411
00:25:16,780 --> 00:25:17,460
It's good.

412
00:25:17,500 --> 00:25:19,340
Like 2 million TPS, whoo-hoo.

413
00:25:19,900 --> 00:25:24,360
Of course, I'm slightly sad comparing
to 2016, how many years

414
00:25:24,360 --> 00:25:25,120
have already passed?

415
00:25:25,120 --> 00:25:25,820
Like 8?

416
00:25:26,360 --> 00:25:27,840
3,000,000?

417
00:25:30,720 --> 00:25:33,920
Maybe we can do more, we can squeeze
maybe more, but it's a separate

418
00:25:33,920 --> 00:25:34,420
question.

419
00:25:34,440 --> 00:25:35,860
We will probably think about it.

420
00:25:35,860 --> 00:25:39,960
But returning to pg_stat_statements,
what's happening?

421
00:25:39,960 --> 00:25:46,780
Again, a lot of sessions, they
compete trying to update the same

422
00:25:46,780 --> 00:25:50,020
record in the pg_stat_statements, just single
record, select, query, that's

423
00:25:50,020 --> 00:25:50,700
it.

424
00:25:51,140 --> 00:25:55,580
If it was different queries, it
would be okay.

425
00:25:56,780 --> 00:26:00,560
So, I even started calling this
workload pathological.

426
00:26:01,980 --> 00:26:07,280
But then I say, okay, what about
all SaaS systems and social

427
00:26:07,280 --> 00:26:07,780
media?

428
00:26:09,060 --> 00:26:13,780
Do they have something, some query
which is executed, which should

429
00:26:13,780 --> 00:26:17,520
be fast and it's executed in many
cases?

430
00:26:17,540 --> 00:26:18,740
And the answer is yes.

431
00:26:18,740 --> 00:26:23,540
Usually, if people work with your
system, you need to select...

432
00:26:23,800 --> 00:26:27,940
It can be cached, of course, but
I saw it many times, some primary

433
00:26:27,940 --> 00:26:33,180
key lookup to tables like users
or posts or projects or blogs

434
00:26:33,180 --> 00:26:33,900
or something.

435
00:26:34,000 --> 00:26:40,020
And you see most of sessions, I
mean not only database sessions,

436
00:26:40,020 --> 00:26:43,760
but for example, like session in
terms of HTTP communication,

437
00:26:43,900 --> 00:26:44,680
web sessions.

438
00:26:45,100 --> 00:26:47,060
Most of them need this query, right?

439
00:26:47,220 --> 00:26:48,740
In this case, you probably...

440
00:26:49,940 --> 00:26:51,420
You might have this problem.

441
00:26:51,480 --> 00:26:53,500
You might have this observer effect.

442
00:26:54,240 --> 00:26:58,400
And, of course, the solution would
be to start caching them,

443
00:26:58,940 --> 00:26:59,940
probably, right?

444
00:27:00,100 --> 00:27:01,020
And so on.

445
00:27:01,400 --> 00:27:03,480
But, this is still an issue.

446
00:27:03,600 --> 00:27:07,120
So, if you have, for example, a
thousand TPS or more of some

447
00:27:07,120 --> 00:27:11,360
primary key lookup, probably you
already might have this problem,

448
00:27:11,740 --> 00:27:15,980
overhead from pg_stat_statements, which
maybe for the whole workload which

449
00:27:15,980 --> 00:27:21,240
you have is not so big, as we mentioned,
can be 1 or 0.5%.

450
00:27:22,200 --> 00:27:25,680
But if you zoom into this part
of the workload, primary key lookup,

451
00:27:25,680 --> 00:27:30,720
probably there this overhead is
bigger and maybe guys I talked

452
00:27:30,720 --> 00:27:33,900
to, maybe they are right and maybe
sampling would be a good solution.

453
00:27:34,860 --> 00:27:38,200
Maybe pg_stat_statements could guess,
oh, this is high frequency

454
00:27:38,300 --> 00:27:42,840
query, a lot of QPS are happening
here, the call's number is high,

455
00:27:42,980 --> 00:27:46,980
maybe I just need to stop sampling,
Maybe, I don't know.

456
00:27:47,900 --> 00:27:49,060
It's an interesting question.

457
00:27:49,540 --> 00:27:52,980
And of course, another interesting
question was why Intel?

458
00:27:54,280 --> 00:27:56,020
And I don't have an answer yet.

459
00:27:56,820 --> 00:28:01,360
We see these flame graphs, we realize,
okay, this code is running

460
00:28:01,360 --> 00:28:03,300
much longer than on Intel.

461
00:28:04,080 --> 00:28:08,640
Right now there is an idea to explore
older Intels' Cascade Lake

462
00:28:08,640 --> 00:28:15,580
and maybe even older Xeons, which
may be used much more in production

463
00:28:15,600 --> 00:28:16,100
systems.

464
00:28:16,780 --> 00:28:22,400
And maybe also older Epycs, third
and second generation maybe.

465
00:28:23,680 --> 00:28:28,440
We also have an issue, not related,
but we also observe an issue

466
00:28:28,440 --> 00:28:33,120
with, we discussed it, Lightweight
Locks Log manager contention,

467
00:28:34,300 --> 00:28:37,060
there AMD behaves worse than Intel.

468
00:28:37,200 --> 00:28:39,340
So it's kind of interesting.

469
00:28:39,800 --> 00:28:44,820
But what we need to understand,
if a lot of sessions run the

470
00:28:44,820 --> 00:28:48,360
same query, it can be bad in terms
of lock manager.

471
00:28:49,160 --> 00:28:53,000
So the solution would be to get rid
of planning time using prepared

472
00:28:53,000 --> 00:28:54,360
statements, this is ideal.

473
00:28:54,860 --> 00:28:57,720
Or just reduce frequency and you
won't notice this, right?

474
00:28:57,720 --> 00:29:03,760
Or make sure fast path is true
always and it means that you have

475
00:29:03,760 --> 00:29:07,880
only a few indexes and partition
pruning works in plans.

476
00:29:08,320 --> 00:29:11,600
And also, you have an observer effect
in the pg_stat_statements

477
00:29:11,600 --> 00:29:12,460
in this case.

478
00:29:13,940 --> 00:29:15,600
And second, Intel versus AMD.

479
00:29:18,960 --> 00:29:22,900
I don't have answers here, but
it's interesting to just dig into

480
00:29:22,900 --> 00:29:24,260
it and understand it.

481
00:29:25,920 --> 00:29:28,680
Michael: The thing I've heard you
say multiple times in the past

482
00:29:28,680 --> 00:29:31,720
is when you're doing this kind
of work, or when you're looking

483
00:29:31,720 --> 00:29:34,820
at your system, it's trying to
understand where the bottleneck

484
00:29:34,820 --> 00:29:35,140
is.

485
00:29:35,140 --> 00:29:37,480
Something I really like about this
piece of work that you've

486
00:29:37,480 --> 00:29:42,080
done is you first identified that
pg_stat_kcache was a bottleneck

487
00:29:42,980 --> 00:29:46,720
and then switched that off and
then tried to work out what the

488
00:29:46,720 --> 00:29:50,820
next bottleneck was, looks like
it actually might be pg_stat_statements,

489
00:29:51,360 --> 00:29:55,240
even though maybe in your head
you were thinking, it probably

490
00:29:55,240 --> 00:29:58,480
isn't, but let's try turning it
off and see if that makes a difference.

491
00:29:58,480 --> 00:29:59,720
That made a big difference.

492
00:30:00,040 --> 00:30:03,520
So it's each time, like even when
you're thinking about maybe

493
00:30:03,520 --> 00:30:07,960
trying to get above that 2.5 million
or maybe 3 million, we have

494
00:30:07,960 --> 00:30:10,160
to work out what's the current
bottleneck.

495
00:30:10,160 --> 00:30:11,700
Like that's how we...

496
00:30:12,740 --> 00:30:13,580
Yeah, exactly.

497
00:30:14,020 --> 00:30:17,220
So that's a really nice thing that
I don't see enough people

498
00:30:17,220 --> 00:30:19,840
thinking about on their own workloads,
but also when they're

499
00:30:19,840 --> 00:30:22,740
benchmarking, what are we currently
limited by?

500
00:30:23,800 --> 00:30:24,560
Nikolay: Right, right, right.

501
00:30:24,560 --> 00:30:32,060
So when we perform any performance
research, like benchmarks,

502
00:30:32,360 --> 00:30:35,880
analysis, root cause analysis after
some incidents on production,

503
00:30:36,060 --> 00:30:40,380
or we try to reproduce problems,
so we perform, like all these

504
00:30:41,040 --> 00:30:44,200
database systems is a complex thing,
and workload usually is

505
00:30:44,200 --> 00:30:44,940
quite complex.

506
00:30:45,420 --> 00:30:49,960
So to study the whole, we need
to apply this, like I mentioned

507
00:30:50,140 --> 00:30:55,640
before we had this call, I mentioned
René Descartes or how to

508
00:30:55,640 --> 00:30:56,820
pronounce it in English.

509
00:30:57,440 --> 00:30:59,360
Michael: I think that's, well,
he's French, right?

510
00:30:59,880 --> 00:31:01,220
Nikolay: He's French, yes.

511
00:31:01,620 --> 00:31:02,120
Michael: Yeah.

512
00:31:02,640 --> 00:31:03,740
I like René Descartes.

513
00:31:04,800 --> 00:31:05,300
Right.

514
00:31:05,440 --> 00:31:06,680
So the idea

515
00:31:06,680 --> 00:31:12,340
Nikolay: is we need to properly
split the whole into segments and

516
00:31:12,340 --> 00:31:15,120
then try to study each segment
separately.

517
00:31:15,580 --> 00:31:20,960
When you study each segment separately
and know how each of it

518
00:31:21,360 --> 00:31:21,860
behaves.

519
00:31:21,960 --> 00:31:25,580
For example, okay, there is a high frequency
select.

520
00:31:25,600 --> 00:31:28,360
Let's study how it behaves without
anything else.

521
00:31:29,040 --> 00:31:32,300
We know how it usually behaves
without anything else, in like

522
00:31:33,060 --> 00:31:34,480
an emptiness, right?

523
00:31:34,940 --> 00:31:38,000
By the way, when we study it, we
also can divide it into smaller

524
00:31:38,000 --> 00:31:38,360
pieces.

525
00:31:38,360 --> 00:31:42,600
For example, okay, let's remove
this, let's remove that extension.

526
00:31:43,700 --> 00:31:47,140
So go deeper, deeper, so basically
minimal pieces.

527
00:31:47,140 --> 00:31:49,200
We start, it takes time, of course,
right?

528
00:31:49,200 --> 00:31:53,360
But then we know small pieces,
how they behave, we can try to

529
00:31:53,360 --> 00:31:58,140
compose it back to a complex workload
and study an ensemble

530
00:31:58,140 --> 00:32:00,580
of it as a whole.

531
00:32:00,940 --> 00:32:04,920
This is a regular scientific approach,
I think one of the oldest

532
00:32:04,920 --> 00:32:09,780
ones, but we must do it here, and
I cannot agree with PlanetScale

533
00:32:10,080 --> 00:32:10,580
CEO.

534
00:32:11,400 --> 00:32:16,440
It's not normal, it's not, but
we study it because it's presented

535
00:32:16,980 --> 00:32:19,540
in our complex production workloads,
right?

536
00:32:19,700 --> 00:32:23,860
For example, primary key lookups
with 1,000 or more QPS.

537
00:32:24,620 --> 00:32:25,820
It's not uncommon.

538
00:32:27,620 --> 00:32:28,120
Michael: Yeah.

539
00:32:28,260 --> 00:32:33,720
Well, but to be clear, I'm still
a big fan of encouraging every

540
00:32:33,720 --> 00:32:37,680
workload I've ever seen, or every
setup I've ever seen, especially

541
00:32:37,680 --> 00:32:40,240
we're talking about SaaS companies,
like that kind of thing,

542
00:32:40,240 --> 00:32:43,860
I would still encourage them to
have pg_stat_statements, unless

543
00:32:43,860 --> 00:32:48,220
they've done some testing that
somehow is affecting their workload.

544
00:32:48,820 --> 00:32:49,460
Nikolay: I agree.

545
00:32:51,100 --> 00:32:55,240
Even if it was 30%, I would say
in many cases we still need to

546
00:32:55,240 --> 00:32:55,760
have it.

547
00:32:55,760 --> 00:32:57,660
Because without it, we are blind.

548
00:32:58,780 --> 00:32:59,280
Michael: Yeah.

549
00:32:59,440 --> 00:33:03,400
But if it was 30%, if we did find
that out, it would be cool.

550
00:33:03,400 --> 00:33:04,300
to have sampling.

551
00:33:04,440 --> 00:33:07,660
If we then took 1 in 10, we could
reduce that maybe to 3 percent.

552
00:33:07,660 --> 00:33:09,940
Also, maybe it's not quite linear,
but you know.

553
00:33:09,940 --> 00:33:11,980
Nikolay: But sampling here should
be smart.

554
00:33:11,980 --> 00:33:18,060
It should be applied only to high
frequency queries. So anyway,

555
00:33:18,060 --> 00:33:20,360
I think we're almost out of time.

556
00:33:20,640 --> 00:33:21,220
We are.

557
00:33:21,220 --> 00:33:24,100
The bottom line, yeah, the bottom
line, check...

558
00:33:24,480 --> 00:33:29,180
I think we should also check ARM
platform as well and see how

559
00:33:29,180 --> 00:33:29,840
it behaves.

560
00:33:30,140 --> 00:33:36,700
So there is difference in pg_stat_statements
behavior on regular

561
00:33:36,700 --> 00:33:41,680
queries versus regular slow queries
like updates or deletes versus

562
00:33:42,040 --> 00:33:45,580
high frequent very fast selects
like primary key lookups, and

563
00:33:45,580 --> 00:33:47,920
there is difference between AMD
and Intel.

564
00:33:48,540 --> 00:33:50,580
So this is the bottom line.

565
00:33:51,020 --> 00:33:55,520
And I'm excited to see the results
of our further investigation of

566
00:33:55,520 --> 00:33:56,280
what's happening.

567
00:33:56,280 --> 00:33:58,360
We plan to understand details here.

568
00:33:59,680 --> 00:34:00,400
Michael: Nice one.

569
00:34:01,460 --> 00:34:02,560
Thanks so much, Nikolay.

570
00:34:02,780 --> 00:34:05,220
Thank you everyone for listening
and catch you soon.