1
00:00:00,060 --> 00:00:01,900
Nikolay: Hello, hello, this is
PostgresFM.

2
00:00:02,020 --> 00:00:03,980
I'm Nikolay from Postgres.AI.

3
00:00:04,000 --> 00:00:08,240
And as usual, my co-host is Michael
from pgMustard.

4
00:00:08,360 --> 00:00:09,100
Hi, Michael.

5
00:00:09,480 --> 00:00:10,380
Hello, Nikolay.

6
00:00:11,360 --> 00:00:14,240
So I chose a very boring topic.

7
00:00:16,120 --> 00:00:20,860
And we talked about it a lot I
hope this is the last time we

8
00:00:20,860 --> 00:00:26,780
were going to talk about it maybe
right BUFFERS or pages or blocks

9
00:00:26,780 --> 00:00:32,860
or how how to name it depending
on the perspective you have, right

10
00:00:32,860 --> 00:00:38,340
but BUFFERS episode 3, which
title are you going to choose?

11
00:00:38,400 --> 00:00:41,040
Michael: Oh, it's like the third
in a trilogy, isn't it?

12
00:00:41,040 --> 00:00:43,860
Maybe it's like Return of the BUFFERS
or something, I don't know.

13
00:00:43,860 --> 00:00:48,480
Nikolay: Right, but we have a few
reasons to revisit this topic

14
00:00:48,520 --> 00:00:49,020
today.

15
00:00:49,300 --> 00:00:54,400
Of course, this topic has been
very heartwarming for me.

16
00:00:55,380 --> 00:00:59,440
For a couple of years, we discussed
how important it is to consider

17
00:00:59,440 --> 00:01:04,500
BUFFERS when optimizing queries
at micro-level, when we talk

18
00:01:04,500 --> 00:01:08,700
about EXPLAIN ANALYZE, or at macro-level when we talk

19
00:01:08,700 --> 00:01:12,580
about pg_stat_statements and other
extensions, and aggregate

20
00:01:13,260 --> 00:01:17,260
everything and see parts of workload
or the whole workload.

21
00:01:18,700 --> 00:01:23,580
And recently, finally, as we already
mentioned a few times, BUFFERS

22
00:01:24,240 --> 00:01:29,720
were added to EXPLAIN ANALYZE
by default after a few unsuccessful

23
00:01:30,060 --> 00:01:32,220
attempts from previous years.

24
00:01:32,900 --> 00:01:34,840
I think it was David Rowley, right?

25
00:01:35,380 --> 00:01:38,760
Michael: Yes, and Guillaume Lelarge
played a big part as well.

26
00:01:39,060 --> 00:01:45,320
Nikolay: Right, and all previous
hackers also who made even unsuccessful

27
00:01:45,600 --> 00:01:48,060
attempts made some shift in this
direction.

28
00:01:48,060 --> 00:01:52,500
Of course, it's really a small
technical change.

29
00:01:53,100 --> 00:01:57,020
So now you don't need to write
the word BUFFERS.

30
00:01:57,040 --> 00:01:59,280
You just write EXPLAIN ANALYZE.

31
00:01:59,800 --> 00:02:03,460
And as a reminder, EXPLAIN just
provides a plan for the query,

32
00:02:03,680 --> 00:02:08,160
which planner is going to choose,
optimizer is going to choose.

33
00:02:08,160 --> 00:02:12,560
And then EXPLAIN ANALYZE actually
provides both the plan and actually

34
00:02:12,560 --> 00:02:13,060
execution.

35
00:02:13,660 --> 00:02:17,800
It actually executes the query,
so you see more data.

36
00:02:18,740 --> 00:02:23,900
And by default, buffer information,
buffer hits, reads, BUFFERS,

37
00:02:24,060 --> 00:02:27,080
dirtied and written, we are not
present in the plan.

38
00:02:27,900 --> 00:02:33,480
And that was a bummer for us who
dealt with queries at scale,

39
00:02:33,740 --> 00:02:34,120
right?

40
00:02:34,120 --> 00:02:38,640
So because this is important information,
as we discussed, and

41
00:02:38,640 --> 00:02:40,380
we will discuss again, I think.

42
00:02:40,940 --> 00:02:46,020
But now in Postgres 18, you don't
need to write BUFFERS explicitly.

43
00:02:46,240 --> 00:02:48,460
This information will be included
by default.

44
00:02:49,320 --> 00:02:51,020
And that's great, that's great.

45
00:02:51,020 --> 00:02:56,120
I think it's a super important
small change which will affect

46
00:02:56,120 --> 00:02:58,220
lives of many engineers.

47
00:02:59,100 --> 00:02:59,440
Because...

48
00:02:59,440 --> 00:03:02,860
Michael: Yeah, I think just worth,
I know you've alluded to this

49
00:03:02,860 --> 00:03:06,600
already, is an upcoming version
of Postgres 18.

50
00:03:06,940 --> 00:03:10,240
It's been committed, and things
that get committed have a very

51
00:03:10,240 --> 00:03:12,480
good chance of making it to the final version.

52
00:03:13,140 --> 00:03:15,620
But things do get reverted.

53
00:03:15,620 --> 00:03:18,840
So there's a small chance that we'll still have to talk about

54
00:03:18,840 --> 00:03:23,060
this more in future but yeah we already had the ability to request

55
00:03:23,100 --> 00:03:26,280
BUFFERS we could already add EXPLAIN (ANALYZE, BUFFERS) and other

56
00:03:26,280 --> 00:03:31,260
parameters but a lot of the time when people are asking for help

57
00:03:31,260 --> 00:03:35,180
online or when people set up auto_explain although I see BUFFERS

58
00:03:35,180 --> 00:03:37,445
more in auto_explain I don't know what your experience is but

59
00:03:37,445 --> 00:03:37,597
yeah when people are asking for help online or when people set

60
00:03:37,597 --> 00:03:37,724
up auto_explain, although I see BUFFERS more in auto_explain, I

61
00:03:37,724 --> 00:03:38,560
don't know what your experience is, but yeah when people are

62
00:03:38,560 --> 00:03:41,620
asking for help or people are writing blog posts or people are

63
00:03:41,880 --> 00:03:45,960
just getting familiar with Postgres, they'll come across

64
00:03:45,960 --> 00:03:51,340
EXPLAIN ANALYZE and then not know that BUFFERS exists or not use it when

65
00:03:51,340 --> 00:03:54,140
it would actually have shown the issue and so there's a bunch

66
00:03:54,140 --> 00:03:57,620
of cases where by having it on by default we'll be able to help

67
00:03:57,620 --> 00:04:00,020
more people online when they first have an issue without having

68
00:04:00,020 --> 00:04:03,080
to ask them to go back and get another query plan or it will

69
00:04:03,080 --> 00:04:05,740
just be more likely to be in, well it will be in those query

70
00:04:05,740 --> 00:04:08,980
plans by default if they're on Postgres 18 or newer.

71
00:04:09,520 --> 00:04:10,620
Nikolay: Yeah, yeah.

72
00:04:10,680 --> 00:04:14,600
And I know you have also opinion and I agree with you that some

73
00:04:14,600 --> 00:04:17,800
other things should be included if we provide something like

74
00:04:17,800 --> 00:04:19,720
all or verbose.

75
00:04:19,860 --> 00:04:22,900
There is verbose but it doesn't include everything and this is

76
00:04:22,900 --> 00:04:26,300
slightly strange and unexpected if you don't know details.

77
00:04:26,940 --> 00:04:29,080
But let's focus only on BUFFERS.

78
00:04:30,520 --> 00:04:33,220
It's really hard for me to imagine it will be reverted.

79
00:04:34,740 --> 00:04:36,860
It might be, but I think, yeah.

80
00:04:37,740 --> 00:04:40,640
Michael: I could just imagine this aging really, you know, aging

81
00:04:40,640 --> 00:04:43,440
like milk, we say sometimes, just in case.

82
00:04:43,440 --> 00:04:45,860
I just wanted to make people aware that might have happened.

83
00:04:45,920 --> 00:04:49,180
Nikolay: Yeah, so we talk about Postgres 18, which is going to

84
00:04:49,180 --> 00:04:54,060
be released this fall in many months and it will hit production

85
00:04:54,160 --> 00:04:59,340
only in a couple of years, on average, maybe, or maybe even more.

86
00:04:59,340 --> 00:05:02,360
So it's still in far future, in distant future.

87
00:05:02,780 --> 00:05:07,680
But I wanted to just think in advance, since we are both building

88
00:05:07,680 --> 00:05:13,320
some tools and methodologies and everything, I wanted to think,

89
00:05:13,780 --> 00:05:19,180
is it going to have some positive, I think, positive effect on

90
00:05:20,140 --> 00:05:22,720
how we are going to optimize queries in the future.

91
00:05:23,680 --> 00:05:29,060
For example, defaults matter a lot at scale, and I learned it

92
00:05:29,060 --> 00:05:31,120
building 3 social networks in the past.

93
00:05:33,460 --> 00:05:38,440
The simple lesson I remember some spammer was sending on behalf

94
00:05:38,440 --> 00:05:43,380
of administration was sending the message, send text message

95
00:05:43,380 --> 00:05:45,940
to this number without any reasoning.

96
00:05:46,120 --> 00:05:50,420
And if you send it to 1000000 people, there will be some people

97
00:05:50,420 --> 00:05:51,300
who will follow.

98
00:05:51,760 --> 00:05:53,460
Because, you know, like, why not?

99
00:05:54,440 --> 00:05:57,260
Of course, if you add some reasoning, it will work even more.

100
00:05:57,500 --> 00:05:59,200
So defaults matter a lot.

101
00:05:59,540 --> 00:06:04,200
And we can put everywhere words
like don't forget BUFFERS, don't

102
00:06:04,200 --> 00:06:08,740
forget BUFFERS, but if it's still
like if default behavior remains

103
00:06:08,900 --> 00:06:13,780
without BUFFERS, there will be
a lot of people who will not use

104
00:06:13,780 --> 00:06:14,180
it, right?

105
00:06:14,180 --> 00:06:18,480
And this is what we see If you
go to explain.depesz.com, it has

106
00:06:18,480 --> 00:06:18,980
history.

107
00:06:19,640 --> 00:06:24,660
And, of course, actually, even
explain.depesz.com also advocates

108
00:06:24,720 --> 00:06:28,340
for BUFFERS since not long ago,
which is great.

109
00:06:28,660 --> 00:06:32,460
But if you just check history,
you see a lot of plans provided

110
00:06:32,480 --> 00:06:33,780
without BUFFERS, right?

111
00:06:34,080 --> 00:06:34,820
Because defaults.

112
00:06:35,240 --> 00:06:41,940
Now we have good reason to expect
that after a couple of years

113
00:06:41,940 --> 00:06:44,880
we will see majority of plans with
BUFFERS.

114
00:06:45,920 --> 00:06:46,720
Do you agree?

115
00:06:47,220 --> 00:06:51,860
Michael: Yes, maybe not in the
full, but maybe in like definitely

116
00:06:51,860 --> 00:06:55,340
in 5 years time, maybe in 234 years
it will become the majority.

117
00:06:55,340 --> 00:06:59,060
Nikolay: Well let's say if it will
be 18 plus version, this plan

118
00:06:59,060 --> 00:07:04,000
like again 99% or even almost 100%
that will include BUFFERS

119
00:07:04,000 --> 00:07:07,660
because only few people will bother
turning it off.

120
00:07:07,820 --> 00:07:10,740
Michael: Oh, I meant actually people
still running very old versions.

121
00:07:10,960 --> 00:07:13,000
Yeah, that was my understanding.

122
00:07:13,320 --> 00:07:16,240
Nikolay: But imagine we're already
thinking about people with

123
00:07:16,240 --> 00:07:18,940
new versions and I think we will
have BUFFERS.

124
00:07:19,400 --> 00:07:21,840
What does it mean for analysis?

125
00:07:23,000 --> 00:07:26,980
Are we going to highlight this
information or use this information

126
00:07:27,400 --> 00:07:28,660
even more intensively?

127
00:07:29,600 --> 00:07:30,460
What do you think?

128
00:07:30,780 --> 00:07:31,860
Michael: Well, good question.

129
00:07:32,140 --> 00:07:33,740
I think people will ask...

130
00:07:34,460 --> 00:07:36,880
People already have lots of questions
when looking to

131
00:07:36,880 --> 00:07:39,960
EXPLAIN ANALYZE output for the first time
and people are like what does

132
00:07:39,960 --> 00:07:43,140
this statistic mean what does that
statistic mean and I think

133
00:07:43,140 --> 00:07:46,620
BUFFERS are another kind of 1 that
at first people are a little

134
00:07:46,620 --> 00:07:49,080
bit confused by what does it mean?

135
00:07:49,080 --> 00:07:50,500
Like what is that number?

136
00:07:50,500 --> 00:07:51,160
Is that a lot?

137
00:07:51,160 --> 00:07:52,320
Is it not a lot?

138
00:07:52,440 --> 00:07:53,880
What unit is it in?

139
00:07:53,880 --> 00:07:55,180
All these kind of questions.

140
00:07:55,460 --> 00:07:59,260
So I suspect it will immediately
in people's heads have a bit

141
00:07:59,260 --> 00:08:00,340
of extra confusion.

142
00:08:00,780 --> 00:08:04,680
But once they're familiar, I can't
see how it isn't helpful.

143
00:08:04,960 --> 00:08:08,480
It's just going to be so helpful,
especially once people start,

144
00:08:08,480 --> 00:08:12,780
like, people and tools start converting
that into data volume

145
00:08:12,780 --> 00:08:13,280
sizes.

146
00:08:13,340 --> 00:08:18,960
I know it's an imperfect, but let's
accept the imperfection of

147
00:08:18,960 --> 00:08:23,860
timesing it by 8 kilobytes, so
each buffer being a page, and

148
00:08:24,360 --> 00:08:27,500
multiplying it by 8 kilobytes,
you get an idea of the data volume

149
00:08:27,500 --> 00:08:28,180
being read.

150
00:08:28,260 --> 00:08:33,040
When people see a query and notice
it's reading a gigabyte of

151
00:08:33,040 --> 00:08:37,080
data, even if it's from shared
memory, when it's not using an

152
00:08:37,080 --> 00:08:39,640
index or when it's using a specific
index, and then it's only

153
00:08:39,640 --> 00:08:42,800
using a few kilobytes when it's
using a different index, suddenly

154
00:08:42,800 --> 00:08:47,080
they realize the index is helping
speed things up because it's

155
00:08:47,080 --> 00:08:50,840
more efficient because Postgres
is only having to read far fewer

156
00:08:51,720 --> 00:08:55,400
blocks, like much less in order
to serve that query.

157
00:08:55,440 --> 00:09:00,260
I think it becomes a lot more intuitive
that indexes help because

158
00:09:01,100 --> 00:09:03,440
they're a more efficient way of
running certain queries, not

159
00:09:03,440 --> 00:09:04,140
their magic.

160
00:09:04,180 --> 00:09:10,760
And I think this is currently not
amongst Postgres folks, but

161
00:09:10,760 --> 00:09:14,120
definitely in the developer community
as a whole, there's this

162
00:09:14,120 --> 00:09:15,960
perception that indexes are magic.

163
00:09:16,160 --> 00:09:20,420
And I think BUFFERS being present
help is like a good step towards

164
00:09:20,540 --> 00:09:23,860
no they're not magic they're just
a they're very clever idea

165
00:09:24,000 --> 00:09:25,600
that makes things really simple

166
00:09:25,600 --> 00:09:29,760
Nikolay: yeah yeah that's a good
point I was listening to you

167
00:09:29,760 --> 00:09:35,320
and thought I'm contradicting to
myself when I say buffer, BUFFERS,

168
00:09:35,920 --> 00:09:38,140
hit and read and written and dirted.

169
00:09:38,800 --> 00:09:42,560
It's incorrect, it should be operations,
BUFFERS, buffer reads,

170
00:09:42,560 --> 00:09:47,580
hits, writes, and I don't know
how to deal with dirt.

171
00:09:48,740 --> 00:09:49,240
Michael: Dirties?

172
00:09:49,540 --> 00:09:50,340
I don't know.

173
00:09:50,640 --> 00:09:51,240
Nikolay: Yeah, yeah, yeah.

174
00:09:52,260 --> 00:09:53,700
Pieces of dirt, right?

175
00:09:54,440 --> 00:09:56,820
No, no, no, like actions of dirt.

176
00:09:56,820 --> 00:10:00,860
But anyway, like I'm saying, these
are operations And if you

177
00:10:00,860 --> 00:10:05,680
sum up some numbers, it might be
the same buffer which you hit

178
00:10:05,680 --> 00:10:06,760
multiple times.

179
00:10:07,920 --> 00:10:10,400
Not sure about other operations,
but hit, we definitely have

180
00:10:10,400 --> 00:10:14,870
in the nested loop, we can have
hits for the same buffer many,

181
00:10:14,870 --> 00:10:15,820
many, many times.

182
00:10:16,080 --> 00:10:19,440
But when you say we need to convert,
I totally agree.

183
00:10:19,440 --> 00:10:20,740
I did it for a long time.

184
00:10:20,740 --> 00:10:24,440
I noticed that very long ago, once
you converted to bytes, megabytes,

185
00:10:24,640 --> 00:10:29,600
gigabytes, sometimes terabytes
actually, tebibytes, right?

186
00:10:29,860 --> 00:10:30,560
Gibibytes.

187
00:10:31,060 --> 00:10:34,100
Once you convert it, engineers
have a ha moment.

188
00:10:34,640 --> 00:10:35,140
Always.

189
00:10:35,740 --> 00:10:38,280
I mean, those who see it first
time.

190
00:10:38,360 --> 00:10:43,400
Because 100 BUFFERS, buffer hits,
or reads, 100 buffer reads,

191
00:10:44,060 --> 00:10:45,700
there's nothing to them, actually.

192
00:10:45,700 --> 00:10:46,920
It's what it is.

193
00:10:47,600 --> 00:10:48,100
Hard.

194
00:10:48,340 --> 00:10:55,180
But once you say, okay, we here
we read 800 kibibytes to return

195
00:10:56,040 --> 00:10:59,480
a number of like a 8 byte number.

196
00:11:00,660 --> 00:11:03,300
You can feel it already how much...

197
00:11:04,280 --> 00:11:05,040
It's just...

198
00:11:05,580 --> 00:11:09,500
To return 1 byte, you are reading
100 or 800...

199
00:11:09,920 --> 00:11:10,620
How many?

200
00:11:10,900 --> 00:11:11,400
KB.

201
00:11:11,980 --> 00:11:15,520
Each KB is 1024 bytes.

202
00:11:16,420 --> 00:11:20,700
Wow, like it's not very efficient, right?

203
00:11:21,140 --> 00:11:25,040
How about having I don't like hash hash or hash table or something

204
00:11:25,040 --> 00:11:28,580
to find this number and then Oh, okay, this is why like, because

205
00:11:28,580 --> 00:11:29,980
it's a sequential scan.

206
00:11:30,040 --> 00:11:33,020
And this sequential scan is just scanning whole table.

207
00:11:33,480 --> 00:11:37,780
If we remember big O notation, sequential scan has terrible,

208
00:11:38,760 --> 00:11:41,660
terrible performance, because so many operations.

209
00:11:42,040 --> 00:11:48,120
And it also depends on the size of each tuple, how many tuples

210
00:11:48,120 --> 00:11:53,440
are present in 1 8-kilobyte block, 8-kilobyte buffer, right?

211
00:11:53,560 --> 00:12:00,480
So if only a few tuples there, we need to read a lot of data

212
00:12:00,480 --> 00:12:01,980
from memory or from disk.

213
00:12:02,780 --> 00:12:07,320
So once you switch to a B-tree index, it becomes only a few

214
00:12:07,800 --> 00:12:08,660
buffer operations.

215
00:12:08,860 --> 00:12:11,520
Even if you have a billion rows, it will be like 7.

216
00:12:11,520 --> 00:12:12,980
I don't remember exactly, but

217
00:12:12,980 --> 00:12:13,260
Michael: it's

218
00:12:13,260 --> 00:12:16,080
Nikolay: like 7 buffer hits.

219
00:12:16,080 --> 00:12:16,820
That's it.

220
00:12:17,380 --> 00:12:21,280
And even, okay, they are still like 8 kilobytes, it's a lot,

221
00:12:21,280 --> 00:12:25,200
but it's already much better than if it would be sequential scan,

222
00:12:25,200 --> 00:12:27,040
which is insanely slow in this case.

223
00:12:27,040 --> 00:12:30,880
And you think, okay, that's why it's slow, because we deal with

224
00:12:31,020 --> 00:12:32,420
huge volumes of data.

225
00:12:32,580 --> 00:12:37,480
And when we apply index, we suddenly deal with very low volumes

226
00:12:37,480 --> 00:12:38,180
of data.

227
00:12:38,440 --> 00:12:38,680
Right?

228
00:12:38,680 --> 00:12:42,320
And this is number 1 reason why database becomes fast.

229
00:12:43,260 --> 00:12:44,680
This is how indexing works.

230
00:12:44,680 --> 00:12:47,180
Like, we just reduce the number of I/O

231
00:12:47,520 --> 00:12:48,020
Operations.

232
00:12:48,240 --> 00:12:49,200
And when I say I/0

233
00:12:49,200 --> 00:12:55,060
Operations, I often include operations with memory hits.

234
00:12:55,460 --> 00:12:58,700
Because, well, we know operations with memory versus operations

235
00:12:58,700 --> 00:13:01,500
with disk, it's like 1, 000 times difference.

236
00:13:01,500 --> 00:13:05,000
It's, of course, a lot, but still, like, we can compare it, and

237
00:13:05,000 --> 00:13:09,980
1 read is roughly like 1000 hits to memory.

238
00:13:10,840 --> 00:13:13,620
Unless this read is also from memory, from the page cache in

239
00:13:13,620 --> 00:13:14,140
this case.

240
00:13:14,140 --> 00:13:15,740
So they're very similar.

241
00:13:16,500 --> 00:13:21,520
So what I'm trying to say, we talked about this reasoning and

242
00:13:21,900 --> 00:13:26,760
I agree with you, we need to convert to bytes and present it.

243
00:13:26,840 --> 00:13:30,940
At the same time, we need somehow like have some remark that

244
00:13:31,080 --> 00:13:36,000
these buffer hits might be to the same buffer.

245
00:13:36,820 --> 00:13:37,700
It's not only...

246
00:13:38,200 --> 00:13:39,260
It's a volume...

247
00:13:39,520 --> 00:13:41,680
It's some abstract volume of data.

248
00:13:42,180 --> 00:13:49,320
We can have a kilobyte or megabyte of data heating the same 8-kB

249
00:13:50,560 --> 00:13:51,640
buffer, right?

250
00:13:52,280 --> 00:13:53,760
So it's okay.

251
00:13:54,320 --> 00:13:58,480
But comparing volumes of data, it's so good.

252
00:13:58,480 --> 00:14:01,720
Because first of all, you start thinking about data volumes,

253
00:14:01,720 --> 00:14:05,240
which is the number 1 reason for slowness, especially if we forget

254
00:14:05,240 --> 00:14:06,000
about concurrency.

255
00:14:07,080 --> 00:14:08,860
Because concurrency is a different topic.

256
00:14:09,340 --> 00:14:14,380
If we take just 1 query, there are cases when it can be a very

257
00:14:14,380 --> 00:14:17,980
CPU-intensive workload, but in the majority of cases it will

258
00:14:17,980 --> 00:14:18,480
be I/O

259
00:14:18,480 --> 00:14:22,400
Intensive workload for 1 query, for databases, right?

260
00:14:23,100 --> 00:14:27,540
And our goal is to help Postgres with indexes, with maybe some

261
00:14:27,540 --> 00:14:33,900
redesign, to help Index to deal with as few operations with memory

262
00:14:33,900 --> 00:14:37,040
and disk as possible, it means we need to focus on BUFFERS.

263
00:14:38,800 --> 00:14:39,900
Michael: Yeah, are there any...

264
00:14:40,360 --> 00:14:45,180
I think there might be a few other exceptions, but I think they're

265
00:14:45,180 --> 00:14:45,680
more...

266
00:14:45,960 --> 00:14:47,040
Yeah, I don't think that...

267
00:14:47,040 --> 00:14:49,740
Well, maybe it's only 1 I think maybe a CPU as well actually

268
00:14:49,740 --> 00:14:52,200
I was thinking JIT compilation what that's

269
00:14:52,200 --> 00:14:55,580
Nikolay: probably recently we had a recent discussion about our

270
00:14:56,040 --> 00:15:04,920
RLS and there we had inefficient memory work CPU work like check

271
00:15:04,920 --> 00:15:09,140
some volatile function or stable function also, right?

272
00:15:09,140 --> 00:15:14,320
So if it's in loop for all rows, it's CPU intensive work.

273
00:15:15,060 --> 00:15:17,440
Michael: Yeah, there have been, I was just trying to think, there

274
00:15:17,440 --> 00:15:20,380
have been a couple of other times I've really not been able to

275
00:15:20,380 --> 00:15:22,280
spot issues by looking at BUFFERS.

276
00:15:23,180 --> 00:15:27,280
But they're not actually, it's not because Postgres isn't reading

277
00:15:27,280 --> 00:15:30,400
data, it's just it's not reporting it in EXPLAIN ANALYZE yet.

278
00:15:30,400 --> 00:15:34,400
So actually, maybe getting it on by default will encourage some

279
00:15:34,400 --> 00:15:36,020
more reporting of BUFFERS as well.

280
00:15:36,020 --> 00:15:41,200
So, for example, I don't think triggers report the BUFFERS read.

281
00:15:41,200 --> 00:15:46,220
Like, if you have, you know, the famous case of not indexing

282
00:15:46,260 --> 00:15:49,300
a foreign key and then having on cascade delete.

283
00:15:49,300 --> 00:15:53,620
Like that trigger might tell you it's taking many, many seconds

284
00:15:53,620 --> 00:15:56,980
because it's having to do a sequential scan of the reference

285
00:15:57,100 --> 00:16:01,820
table but no BUFFERS reported as part of that.

286
00:16:01,820 --> 00:16:06,540
Same with in memory operations, like sorts or hashes.

287
00:16:06,940 --> 00:16:09,240
They could be really slow, even though they're done maybe you've

288
00:16:09,240 --> 00:16:12,720
got quite large work mem, and maybe it's the dominant part of

289
00:16:12,720 --> 00:16:16,120
the query plan, but no BUFFERS reported as part of that.

290
00:16:16,120 --> 00:16:17,120
Because it's in memory.

291
00:16:17,120 --> 00:16:19,840
It will report the BUFFERS if it spills to disk.

292
00:16:21,820 --> 00:16:25,200
For reads of tables and indexes, we get reports even if it's

293
00:16:25,200 --> 00:16:25,760
from memory.

294
00:16:25,760 --> 00:16:29,940
But for operations like sorts and hashes, we don't get.

295
00:16:30,180 --> 00:16:33,780
So there are a few things that are still doing work that it would

296
00:16:33,780 --> 00:16:34,820
be good to get those.

297
00:16:34,820 --> 00:16:37,920
But it doesn't go against what you're saying in principle.

298
00:16:37,920 --> 00:16:42,260
It just means I can't use what Postgres is actually giving me

299
00:16:42,260 --> 00:16:46,560
in EXPLAIN (ANALYZE, BUFFERS) to actually diagnose the issue there.

300
00:16:46,560 --> 00:16:49,360
Nikolay: Yeah, I remember also touched the topic that it would

301
00:16:49,360 --> 00:16:51,820
be really great to see details
for BUFFERS.

302
00:16:51,820 --> 00:16:57,180
For example, how many BUFFERS from
heap, from index, maybe distinguish

303
00:16:57,180 --> 00:17:00,280
in each index, some details to
see.

304
00:17:00,860 --> 00:17:06,420
Because if we go back to the original
approach for dealing with

305
00:17:06,420 --> 00:17:10,040
EXPLAIN ANALYZE, and I see it like
even many hackers, blog posts,

306
00:17:10,040 --> 00:17:13,480
provide plans, ANALYZE plans, but
they don't use BUFFERS still.

307
00:17:13,780 --> 00:17:16,580
I hope this will change, of course,
over time.

308
00:17:16,800 --> 00:17:22,080
So usually, okay, we have time,
we try to guess, okay, this time

309
00:17:22,080 --> 00:17:23,200
is lost here, why?

310
00:17:23,200 --> 00:17:24,640
Okay, because sequential scan.

311
00:17:25,000 --> 00:17:28,320
But in terms of data volumes, what
do we have?

312
00:17:28,320 --> 00:17:29,840
Only rows, right?

313
00:17:30,160 --> 00:17:31,900
Expected and actual rows.

314
00:17:33,260 --> 00:17:34,580
Logical rows.

315
00:17:34,740 --> 00:17:40,060
And sometimes we think, okay, we
fetch 1 row here.

316
00:17:40,360 --> 00:17:42,040
Michael: Oh, we get width as well.

317
00:17:42,440 --> 00:17:45,120
Like an average estimated width.

318
00:17:45,400 --> 00:17:46,040
Which is...

319
00:17:46,780 --> 00:17:50,560
Times by rows gives you some idea,
but it's only the width of

320
00:17:50,560 --> 00:17:53,420
what's being returned not the width
of what's being read

321
00:17:53,840 --> 00:17:54,340
Nikolay: Exactly.

322
00:17:54,400 --> 00:17:55,440
This is like, okay.

323
00:17:55,440 --> 00:17:59,540
Yeah, we can multiply with by rows
get big basically number of

324
00:17:59,540 --> 00:18:02,880
bytes But it will be like logical
level and an underlying level

325
00:18:02,880 --> 00:18:06,640
might be very different, very different,
drastically different.

326
00:18:06,640 --> 00:18:12,620
For example, if there is a lot
of dead tuples, which Postgres

327
00:18:12,640 --> 00:18:15,120
checked to return this very row,
right?

328
00:18:15,420 --> 00:18:17,620
Maybe it checked a thousand dead
tuples.

329
00:18:18,080 --> 00:18:23,160
And it's hidden if we look only
at rows and timing.

330
00:18:23,160 --> 00:18:26,040
And we have no idea why timing
is so bad, right?

331
00:18:26,040 --> 00:18:29,200
Okay, we returned 1 row in the
index scan, but why so bad?

332
00:18:29,480 --> 00:18:30,180
Here's why.

333
00:18:30,180 --> 00:18:36,960
We go to BUFFERS and we see that
a lot of BUFFERS were hit or

334
00:18:36,960 --> 00:18:37,280
read.

335
00:18:37,280 --> 00:18:38,660
I don't know, it doesn't matter.

336
00:18:38,760 --> 00:18:40,080
That's why timing is better.

337
00:18:40,080 --> 00:18:45,600
And then, my point is that it's
already very useful, But imagine

338
00:18:45,600 --> 00:18:50,080
if we saw, oh, actually, from this
particular index, from this

339
00:18:50,080 --> 00:18:53,580
particular table, we got those
operations with BUFFERS.

340
00:18:54,020 --> 00:18:57,660
It means that to return this row,
so many BUFFERS, even with

341
00:18:57,660 --> 00:18:59,100
index scan, it's bad.

342
00:18:59,100 --> 00:19:00,940
Oh, that's why, because it's bloat.

343
00:19:01,120 --> 00:19:06,740
This row had many versions not
cleaned, and the index scan needs

344
00:19:06,740 --> 00:19:11,140
to check heap table to get version
information.

345
00:19:11,880 --> 00:19:16,560
So this would be useful to like
detail BUFFERS, to have detail

346
00:19:16,560 --> 00:19:17,640
BUFFERS maybe, right?

347
00:19:18,140 --> 00:19:18,660
And I

348
00:19:18,660 --> 00:19:22,260
Michael: had this case recently,
I was in January, I was doing

349
00:19:22,260 --> 00:19:25,380
some office hours calls just to
get an idea of what kind of issues

350
00:19:25,380 --> 00:19:28,940
people were facing and I had some
really good conversations thank

351
00:19:28,940 --> 00:19:33,820
you to everybody that jumped in on those and but 1 of them they

352
00:19:33,820 --> 00:19:37,480
were they were using our tool pgMustard and it was 1 of the

353
00:19:37,480 --> 00:19:41,000
tips we show people is, we call it read efficiency, to look for

354
00:19:41,000 --> 00:19:41,820
exactly that.

355
00:19:41,820 --> 00:19:43,520
When are you reading lots of data?

356
00:19:43,520 --> 00:19:48,420
And we actually, for long-term users, used to call it table bloat

357
00:19:48,580 --> 00:19:50,420
potential, or like bloat potential.

358
00:19:50,580 --> 00:19:52,480
So it could be index bloat, could be table bloat.

359
00:19:52,480 --> 00:19:55,960
But we renamed it read efficiency a few years ago because there

360
00:19:55,960 --> 00:19:59,100
are other possible causes of it as well.

361
00:19:59,100 --> 00:20:00,280
So it's a tricky 1.

362
00:20:00,280 --> 00:20:04,120
And because we didn't, so this office that I was called, they

363
00:20:04,120 --> 00:20:06,960
were hitting quite a few read efficiency issues but they couldn't

364
00:20:06,960 --> 00:20:10,840
diagnose exactly where it was like was it index level was it

365
00:20:10,840 --> 00:20:13,500
table level they looked at both it didn't they didn't seem bloated

366
00:20:13,500 --> 00:20:17,400
they reindexed they they tried a lot of the things and it couldn't

367
00:20:17,400 --> 00:20:22,160
reduce it so it was a it was a really interesting call but that

368
00:20:22,160 --> 00:20:26,040
yeah that was my first port of call is in this used to be true

369
00:20:26,040 --> 00:20:29,540
more in older versions of Postgres but indexes can get especially

370
00:20:29,540 --> 00:20:33,520
can get extremely bloated so can tables in some cases but yeah

371
00:20:33,520 --> 00:20:36,760
it's it's amazing that you can suddenly see that by turning on

372
00:20:36,760 --> 00:20:37,260
BUFFERS.

373
00:20:38,480 --> 00:20:41,260
And that is something I wasn't expecting to be true.

374
00:20:41,400 --> 00:20:45,060
Because I see bloat as kind of like more of a system-wide issue,

375
00:20:45,060 --> 00:20:47,900
But to be able to spot it in query licenses was really cool.

376
00:20:47,900 --> 00:20:49,400
Nikolay: Oh, bloat can be very tricky.

377
00:20:49,400 --> 00:20:52,800
Sometimes we have, with our consulting clients also, we have

378
00:20:52,800 --> 00:20:55,080
like bloat is low, but some query suffers.

379
00:20:55,400 --> 00:20:59,760
And it turns out to that, like, we call it usually local bloat.

380
00:20:59,760 --> 00:21:03,700
So for particular IDs, the situation is super bad.

381
00:21:03,840 --> 00:21:08,680
It can also be not a bloat, but some, you know, like sparsely

382
00:21:08,800 --> 00:21:10,620
stored records.

383
00:21:11,140 --> 00:21:14,360
They are distributed among many BUFFERS.

384
00:21:14,640 --> 00:21:20,600
And you expect, okay, I have very narrow table, only like 4 or

385
00:21:20,600 --> 00:21:21,520
5 columns.

386
00:21:21,740 --> 00:21:25,820
I expect a lot of tuples to be fit inside 1 page.

387
00:21:26,040 --> 00:21:31,720
But somehow reading 1, 000 records, rows, I deal with thousands

388
00:21:31,720 --> 00:21:34,070
of buffer operations, what's happening here.

389
00:21:34,070 --> 00:21:40,140
And if we just check CTID, we can understand that each row is

390
00:21:40,140 --> 00:21:43,700
stored in each particular page, so we have a lack of data locality

391
00:21:43,780 --> 00:21:44,100
here.

392
00:21:44,100 --> 00:21:49,660
And so clustering with pg_repack without downtime could help, or

393
00:21:49,820 --> 00:21:53,400
partitioning helps usually in such situations and so on.

394
00:21:53,400 --> 00:21:54,380
Yeah, yeah, yeah.

395
00:21:54,380 --> 00:21:58,840
So, and the buffer is exactly the way to fill these, all these

396
00:21:58,840 --> 00:22:00,120
problems better.

397
00:22:00,220 --> 00:22:04,300
But I agree, I like your idea to talk about read efficiency and

398
00:22:04,300 --> 00:22:06,140
maybe write efficiency as well.

399
00:22:06,420 --> 00:22:08,960
Because for end user, it's obvious.

400
00:22:09,020 --> 00:22:13,820
Okay, I have, I return this data, I return only 25 rows on my

401
00:22:13,820 --> 00:22:14,320
page.

402
00:22:15,240 --> 00:22:23,080
So I expect not gigabytes of data to be fetched from buffer pool,

403
00:22:23,080 --> 00:22:25,580
or even worse, from disk, right?

404
00:22:25,580 --> 00:22:29,640
Or from page cache, which is between disk and the buffer pool.

405
00:22:29,820 --> 00:22:33,740
So I expect maybe only like some kilobytes, right?

406
00:22:33,740 --> 00:22:35,220
Maybe tens of kilobytes.

407
00:22:36,420 --> 00:22:39,400
Even not megabytes, if it's quite a narrow table.

408
00:22:42,400 --> 00:22:45,780
Postgres should not do a lot of work to show 25 rows.

409
00:22:46,560 --> 00:22:50,020
If it does a lot of work, something is not right and we need

410
00:22:50,020 --> 00:22:51,240
to optimize this query.

411
00:22:51,820 --> 00:22:54,660
And without BUFFERS, we only can guess.

412
00:22:55,440 --> 00:22:57,240
With BUFFERS, we see it, right?

413
00:22:57,660 --> 00:22:59,980
With buffer details, it would be even better, right?

414
00:23:00,060 --> 00:23:05,420
Someday maybe we will have detailed BUFFERS per each database

415
00:23:05,440 --> 00:23:05,940
object.

416
00:23:06,420 --> 00:23:09,740
Some statistics would be interesting, I think, to observe.

417
00:23:12,020 --> 00:23:13,960
And there is criticism of this approach.

418
00:23:13,980 --> 00:23:20,900
I recently had an interesting discussion on Twitter on X, And

419
00:23:20,900 --> 00:23:25,360
somebody told me that in Oracle, there is ability to flush caches.

420
00:23:26,440 --> 00:23:26,880
Right.

421
00:23:26,880 --> 00:23:28,980
And in Postgres, it's tricky.

422
00:23:29,040 --> 00:23:33,560
Usually it means you, we need to restart Postgres to flush the

423
00:23:33,560 --> 00:23:38,140
buffer pool and also to echo 3 blah blah blah to flush Linux

424
00:23:38,140 --> 00:23:43,620
caches if you want to go hardcore and very cold state, to check

425
00:23:43,620 --> 00:23:44,660
very cold state.

426
00:23:45,060 --> 00:23:45,980
And this is interesting.

427
00:23:46,560 --> 00:23:47,740
Michael: I think you can restart.

428
00:23:48,340 --> 00:23:49,180
Nikolay: Restart what?

429
00:23:49,660 --> 00:23:50,220
Michael: The Database.

430
00:23:50,220 --> 00:23:52,800
Like if you're testing locally, for example, or on a clone, if

431
00:23:52,800 --> 00:23:56,080
you restart the Database, doesn't that reset caches?

432
00:23:57,180 --> 00:24:01,680
Nikolay: Restart of Postgres will make empty only the buffer

433
00:24:01,680 --> 00:24:02,180
pool.

434
00:24:02,260 --> 00:24:05,780
But we also have, if it's Linux, We have also page cache, right?

435
00:24:05,980 --> 00:24:09,680
We can flush page cache with simple command like we of course

436
00:24:09,680 --> 00:24:10,820
if you have sudo

437
00:24:11,100 --> 00:24:11,600
Michael: Yes,

438
00:24:12,040 --> 00:24:13,300
Nikolay: so it's it's easy.

439
00:24:13,660 --> 00:24:17,520
I always Google it echo 3 to some path and and then sink and

440
00:24:17,520 --> 00:24:18,140
that's it

441
00:24:18,140 --> 00:24:20,840
Michael: If you're a managed service, what's the best you can

442
00:24:20,840 --> 00:24:21,340
do?

443
00:24:21,500 --> 00:24:23,900
Nikolay: Well, managed service, you cannot do this, of course.

444
00:24:24,480 --> 00:24:25,380
Michael: Restart is like, I

445
00:24:25,380 --> 00:24:25,933
Nikolay: think the best

446
00:24:25,933 --> 00:24:26,347
Michael: you can

447
00:24:26,347 --> 00:24:26,553
Nikolay: do.

448
00:24:26,553 --> 00:24:26,760
Reboot.

449
00:24:26,760 --> 00:24:27,660
Reboot, yeah.

450
00:24:27,660 --> 00:24:28,440
Yeah, yeah, reboot.

451
00:24:28,780 --> 00:24:30,000
Well, this is hardcore.

452
00:24:30,480 --> 00:24:32,440
So let's talk about this.

453
00:24:32,440 --> 00:24:40,320
I also see a constant desire to
check the cold state in the Database

454
00:24:40,320 --> 00:24:41,740
Lab we built.

455
00:24:42,120 --> 00:24:46,380
Some users use it for query optimization
with some bot we have,

456
00:24:46,380 --> 00:24:51,000
drawbot, very old stuff, but it
works really well for query optimization

457
00:24:51,660 --> 00:24:53,620
and also like experiments, right?

458
00:24:53,620 --> 00:24:58,940
So you can check on thinkloan,
on branch, you can check various

459
00:24:58,940 --> 00:25:03,300
ideas And there people say, okay,
I see the hot state.

460
00:25:03,340 --> 00:25:04,340
Data is already cached.

461
00:25:04,340 --> 00:25:06,000
I want to check the cold state.

462
00:25:06,820 --> 00:25:09,360
And I think it's natural desire,
but yeah.

463
00:25:09,480 --> 00:25:12,540
Well, every time we say there is
no such easy way.

464
00:25:13,320 --> 00:25:16,240
And interesting that, I didn't
know, but interesting that in

465
00:25:16,240 --> 00:25:21,960
PostgreSQL 17, in pg_buffercache,
there is a new function, pg_buffercache_evict,

466
00:25:23,680 --> 00:25:25,160
and you can try to use it.

467
00:25:25,160 --> 00:25:28,980
I think we are going to try it
at some point when Postgres 17

468
00:25:28,980 --> 00:25:31,800
will be more present on production
systems.

469
00:25:32,440 --> 00:25:36,720
We probably will consider adding
this to make the buffer pool

470
00:25:37,120 --> 00:25:38,040
empty, right?

471
00:25:38,320 --> 00:25:42,980
But unfortunately, this function,
I suspect it won't work really

472
00:25:42,980 --> 00:25:43,480
well.

473
00:25:43,500 --> 00:25:45,800
It's definitely not for production,
first of all, right?

474
00:25:46,220 --> 00:25:49,520
It's for like lab environments
we build.

475
00:25:49,860 --> 00:25:55,020
And unfortunately, per documentation,
it won't evict BUFFERS

476
00:25:55,020 --> 00:25:59,340
which are pinned by, for example,
autovacuum, vacuum process

477
00:25:59,340 --> 00:26:00,060
and so on.

478
00:26:00,060 --> 00:26:03,080
So in some cases it will fail,
we will need to think how to deal

479
00:26:03,080 --> 00:26:03,780
with it.

480
00:26:04,020 --> 00:26:07,420
Of course, restart is definitely
working in this case.

481
00:26:07,540 --> 00:26:16,440
So if we think about why do people
need cold case at all, What's

482
00:26:16,440 --> 00:26:17,800
your opinion on this?

483
00:26:18,340 --> 00:26:22,060
Michael: Well, I think it's coming
from a good engineering rationale.

484
00:26:22,860 --> 00:26:25,940
I want to make sure the worst case
isn't too bad.

485
00:26:27,440 --> 00:26:30,520
It's definitely, you want good
engineers to be thinking about

486
00:26:30,520 --> 00:26:33,460
edge cases, to be thinking about
the worst possible case.

487
00:26:33,840 --> 00:26:38,580
But I also think that in practice,
the kinds of optimizations

488
00:26:38,860 --> 00:26:43,260
we're doing here, the whole point
of the discussion here is we're

489
00:26:43,260 --> 00:26:48,340
saying try and reduce the amount
of data being read to as minimal

490
00:26:48,340 --> 00:26:48,900
as possible.

491
00:26:48,900 --> 00:26:52,360
If it's an important query, try
and get a very, very good access

492
00:26:52,360 --> 00:26:53,660
path for that query.

493
00:26:54,160 --> 00:26:56,920
That will involve getting the buffer
numbers down as much as

494
00:26:56,920 --> 00:26:59,340
possible, whether they're cached
or not.

495
00:27:00,860 --> 00:27:04,940
Maybe you want to explain to people
in the PR, or you need to

496
00:27:04,940 --> 00:27:07,580
explain to your team, how much
faster has this made it?

497
00:27:07,700 --> 00:27:11,000
So I can understand wanting to
compare cold cache state to cold

498
00:27:11,000 --> 00:27:13,700
cache state, and then having these
kind of relative differences.

499
00:27:14,120 --> 00:27:17,080
But I would encourage going the
opposite route.

500
00:27:17,080 --> 00:27:19,640
It's much easier to work with warm
cache, especially when you're

501
00:27:19,640 --> 00:27:22,380
doing query optimization, you're naturally going to warm up the

502
00:27:22,380 --> 00:27:25,120
cache if you're running the same query over and over again.

503
00:27:25,960 --> 00:27:29,240
So I would encourage more going the opposite direction and say

504
00:27:29,340 --> 00:27:29,840
compare...

505
00:27:33,180 --> 00:27:38,440
Yes, it definitely applies for OLTP, but even for OLAP or OLAP

506
00:27:38,640 --> 00:27:44,060
queries, I don't see the downside of saying to your team, these,

507
00:27:44,060 --> 00:27:45,660
I mean, it's still EXPLAIN ANALYZE, right?

508
00:27:45,660 --> 00:27:48,740
If you're showing query execution plans before and after, which

509
00:27:48,740 --> 00:27:51,020
is often what people are doing, kind of show this is what it

510
00:27:51,020 --> 00:27:54,660
was, this is what it will be, it's already imperfect in terms

511
00:27:54,660 --> 00:27:56,320
of reporting numbers.

512
00:27:56,320 --> 00:27:59,100
It's not exactly the same as what the client's seeing.

513
00:27:59,260 --> 00:28:04,040
So If we're already imperfect, I don't see the downside of comparing

514
00:28:04,060 --> 00:28:07,080
warm cache to warm cache, so I tend to run things a few times,

515
00:28:07,540 --> 00:28:08,000
get the...

516
00:28:08,000 --> 00:28:10,680
Nikolay: Before optimization and after optimization, we compare

517
00:28:10,680 --> 00:28:14,680
2 cases, both should be warm, and we focus on not...

518
00:28:15,900 --> 00:28:21,040
Even if we have reads still, for example, buffer pool may be

519
00:28:21,100 --> 00:28:22,540
smaller in lab environment.

520
00:28:22,700 --> 00:28:26,620
We usually have much smaller buffer pool because we run multiple

521
00:28:27,160 --> 00:28:29,020
Postgres instances on the same machine.

522
00:28:29,540 --> 00:28:30,580
It's a shared environment.

523
00:28:31,100 --> 00:28:34,960
So if you deal with large volumes of data, it might not fit the

524
00:28:34,960 --> 00:28:37,400
buffer pool, so you are going to see reads.

525
00:28:37,480 --> 00:28:41,880
But the idea is let's just summarize reads and hits, maybe even

526
00:28:42,100 --> 00:28:45,220
writes and dirties, how to name it, right?

527
00:28:45,220 --> 00:28:49,080
So all 4 buffer operations, if we just summarize it and use it

528
00:28:49,080 --> 00:28:52,660
like as universal metric, this is now our number of...

529
00:28:52,660 --> 00:28:54,960
Or just maybe reads and hits, depends, right?

530
00:28:54,960 --> 00:29:00,580
So but overall, we just see this is overall volume of data.

531
00:29:01,580 --> 00:29:04,100
And this is what you return, 25 rows.

532
00:29:04,740 --> 00:29:08,800
Michael: I like that a lot that's exactly what we do but I get

533
00:29:08,800 --> 00:29:11,820
the sense that what people really want is to compare timings

534
00:29:12,180 --> 00:29:15,940
so they want to say this query that used to take Because sometimes

535
00:29:15,940 --> 00:29:17,720
it's coming from a complaint, right, or something.

536
00:29:17,720 --> 00:29:22,760
This query that used to take 20 seconds now is 500 milliseconds.

537
00:29:22,800 --> 00:29:26,520
Or this query that used to be 200 milliseconds is now less than

538
00:29:26,520 --> 00:29:27,180
a millisecond.

539
00:29:28,180 --> 00:29:32,620
So they want this kind of difference in timing, even though behind

540
00:29:32,620 --> 00:29:34,460
the scenes it's a difference in BUFFERS.

541
00:29:34,840 --> 00:29:38,800
So for that, if you want kind of apples to apples comparison,

542
00:29:38,860 --> 00:29:42,880
instead of trying to get cold and be fair on the cold side, I'd

543
00:29:42,880 --> 00:29:46,140
just say it's easier to be fair and do it on the warmer side.

544
00:29:46,520 --> 00:29:50,500
I know it's not perfect, and it might be that your real users

545
00:29:50,500 --> 00:29:53,920
are hitting a cold cache state, but if that's what you want,

546
00:29:53,920 --> 00:29:55,460
that's the direction I would go.

547
00:29:55,557 --> 00:29:59,640
Nikolay: Yeah, well, we have slightly different methodologies

548
00:29:59,760 --> 00:30:00,260
here.

549
00:30:00,300 --> 00:30:04,700
I usually say, okay, we start from the statement, the query is

550
00:30:04,700 --> 00:30:07,120
slow, it takes like 30 seconds or 1 minute.

551
00:30:07,120 --> 00:30:10,820
It's super slow, unacceptable, we want it below 100 milliseconds

552
00:30:10,840 --> 00:30:11,760
or 10 milliseconds.

553
00:30:13,580 --> 00:30:16,760
And then I say, forget about timing for now, completely.

554
00:30:17,320 --> 00:30:18,620
Focus only on BUFFERS.

555
00:30:19,280 --> 00:30:24,140
If we check BUFFERS and see a low volume, 100 BUFFERS, 100 buffer

556
00:30:24,140 --> 00:30:29,520
hits and reads, to return 25 rows, we know our micro-optimization

557
00:30:29,880 --> 00:30:30,600
is over.

558
00:30:31,080 --> 00:30:34,640
We need to go and see the whole picture of what's happening with

559
00:30:34,640 --> 00:30:35,140
concurrency.

560
00:30:35,500 --> 00:30:36,760
Probably we'll have some...

561
00:30:37,480 --> 00:30:41,500
Maybe this query is waiting on lock acquisition of all those seconds.

562
00:30:41,740 --> 00:30:42,740
That's it, right?

563
00:30:42,740 --> 00:30:44,020
So it's already this.

564
00:30:44,340 --> 00:30:48,140
But if we confirm large volumes of data, we forget about time.

565
00:30:48,480 --> 00:30:55,120
We focus only on obtaining sane numbers for BUFFERS.

566
00:30:55,760 --> 00:30:59,440
Sane means, okay, hundreds, thousands, not more.

567
00:31:00,060 --> 00:31:03,580
Even thousands to return 25 rows, it's already quite a lot.

568
00:31:04,340 --> 00:31:05,640
And we must do it.

569
00:31:05,640 --> 00:31:07,860
Sometimes, of course, it's not easy.

570
00:31:07,860 --> 00:31:12,280
Sometimes we need to do denormalization, some complex surgery

571
00:31:12,280 --> 00:31:18,360
on our database involving long-lasting operations with backfilling

572
00:31:18,460 --> 00:31:23,040
and so on, but once we achieve this, in most cases we can have

573
00:31:23,640 --> 00:31:28,220
same number of BUFFERS and then we say, okay, now we can compare

574
00:31:28,220 --> 00:31:28,720
timing.

575
00:31:29,540 --> 00:31:34,920
And Comparing timing in lab environment, it still might differ

576
00:31:34,920 --> 00:31:35,640
from production.

577
00:31:36,580 --> 00:31:37,440
Michael: Yeah, of course.

578
00:31:37,660 --> 00:31:40,400
Nikolay: But of course, I know, I know, I use it as well.

579
00:31:40,580 --> 00:31:46,620
I say, we improve timing hundred, like 10000 times.

580
00:31:46,620 --> 00:31:48,020
It happens, right?

581
00:31:48,580 --> 00:31:53,500
And this is good for final comment you know in the end but only

582
00:31:53,500 --> 00:31:58,220
when we did work on BUFFERS fully right this is me

583
00:31:58,620 --> 00:32:03,280
Michael: yeah given it's you you you could with your reputation

584
00:32:03,480 --> 00:32:08,500
and your teachings you could leave that last line as we improved

585
00:32:09,120 --> 00:32:12,980
buffer read or read efficiency for example.

586
00:32:12,980 --> 00:32:16,640
Read efficiency and it would be probably almost the exact same

587
00:32:16,640 --> 00:32:17,120
number.

588
00:32:17,120 --> 00:32:19,140
Nikolay: Not necessarily, but yeah, maybe.

589
00:32:19,200 --> 00:32:21,360
Michael: You'd be surprised how often it's close.

590
00:32:21,460 --> 00:32:22,800
Nikolay: Okay, good, good, good.

591
00:32:22,800 --> 00:32:23,800
Yeah, that's good.

592
00:32:24,020 --> 00:32:27,320
Maybe I should think about it and also present this information

593
00:32:27,560 --> 00:32:31,400
during consulting activities and in tools as well.

594
00:32:31,400 --> 00:32:32,860
Yeah, I think it's good.

595
00:32:33,180 --> 00:32:35,160
So yeah, read efficiency.

596
00:32:35,280 --> 00:32:40,320
And you take number of rows and width and just multiply, get

597
00:32:40,320 --> 00:32:41,980
bytes from there and then...

598
00:32:42,440 --> 00:32:46,020
Michael: Yeah, it's a tricky 1 because like different scan types

599
00:32:46,020 --> 00:32:50,200
you expect different amounts of efficiency so yes it's slightly

600
00:32:50,200 --> 00:32:52,480
Nikolay: more involved also interesting right

601
00:32:53,040 --> 00:32:58,760
Michael: yeah but I think I think the main differences are sequential

602
00:32:58,860 --> 00:33:02,260
scan like sequential scans very different from well sequential

603
00:33:02,260 --> 00:33:04,060
Scan and Bitmap Scan have some similarities.

604
00:33:05,060 --> 00:33:08,160
Index Scan and Index Only Scan have some similarities, but then

605
00:33:09,020 --> 00:33:12,660
like looped index scans are somewhat different to like range

606
00:33:12,660 --> 00:33:16,900
index scans in terms of how many tuples, like in a loop for example,

607
00:33:16,900 --> 00:33:21,400
you can't get fewer than the number of loops, like it has to

608
00:33:21,400 --> 00:33:22,620
do at least that many.

609
00:33:23,040 --> 00:33:28,920
So yeah, there's some subtlety, but yeah, that's it essentially.

610
00:33:29,680 --> 00:33:32,640
Nikolay: Yeah, But back to the cold state, I also agree.

611
00:33:32,640 --> 00:33:35,180
I can imagine you would need it.

612
00:33:35,280 --> 00:33:39,440
And unfortunately, it's possible, as I see, only in lab environment

613
00:33:39,480 --> 00:33:40,640
where you are alone.

614
00:33:40,840 --> 00:33:44,240
Because this is a global operation to flash operational system

615
00:33:44,240 --> 00:33:45,420
page cache, for example.

616
00:33:45,900 --> 00:33:47,980
Michael: And when you control the hardware, right?

617
00:33:47,980 --> 00:33:50,760
Or like the, you know, the VMs.

618
00:33:51,280 --> 00:33:54,380
Nikolay: Well, if it's a managed situation, again, restart, as

619
00:33:54,380 --> 00:33:58,100
you said, also an option, but again, this only works if you're

620
00:33:58,100 --> 00:33:59,640
alone, and this is expensive.

621
00:34:00,060 --> 00:34:04,740
We aim to make experiments super, super, super, extremely fast

622
00:34:04,740 --> 00:34:05,780
and cost-efficient.

623
00:34:06,760 --> 00:34:08,720
So it means a managed environment.

624
00:34:08,860 --> 00:34:12,700
Many people work on the same machine, so if 1 decides to flush

625
00:34:12,700 --> 00:34:15,980
everything, others will notice, right?

626
00:34:16,880 --> 00:34:21,540
And With this function, pg_buffercache_evict, I think if we manage

627
00:34:21,540 --> 00:34:25,520
to make it work, I see it's useful.

628
00:34:25,600 --> 00:34:29,600
Even imagine if we have page cache, or in the case of DBLab,

629
00:34:29,620 --> 00:34:32,540
it's ZFS-Arc, which is common.

630
00:34:34,020 --> 00:34:37,440
If 1 block is cached, it's cached for all clones, all branches,

631
00:34:37,440 --> 00:34:39,740
it's very good in terms of efficiency.

632
00:34:40,900 --> 00:34:45,560
But sometimes, indeed, engineers want to check inefficient situation.

633
00:34:45,600 --> 00:34:50,660
Okay, in this case, I think what we will do, we try to okay,

634
00:34:50,660 --> 00:34:54,800
if user requests for their particular database, particular Postgres

635
00:34:54,800 --> 00:34:59,560
instance running on this machine, we can ask to evict maybe everything

636
00:34:59,600 --> 00:35:03,660
from buffer pool or for particular tables and indexes, say everything.

637
00:35:04,220 --> 00:35:09,100
And if we succeeded, there are chances that these buffers are

638
00:35:09,100 --> 00:35:10,940
cached in underlying cache, right?

639
00:35:10,940 --> 00:35:12,320
Page cache or arc.

640
00:35:12,980 --> 00:35:17,340
In this case, query timing is not going to be very different,

641
00:35:17,640 --> 00:35:18,940
but plans will be different.

642
00:35:18,940 --> 00:35:21,580
We will see reads instead of hits.

643
00:35:23,420 --> 00:35:26,180
Observing buffer numbers, we will see the difference.

644
00:35:26,280 --> 00:35:27,940
Okay, reads now, not hits.

645
00:35:28,660 --> 00:35:33,060
And what I'm thinking, In this case, we could consider it relatively

646
00:35:33,240 --> 00:35:35,400
cold, this situation.

647
00:35:36,400 --> 00:35:36,900
Michael: Lukewarm.

648
00:35:37,660 --> 00:35:38,360
Nikolay: Yeah, yeah, yeah.

649
00:35:38,360 --> 00:35:44,320
And we could just say, okay, we know that reading from disk is

650
00:35:44,320 --> 00:35:47,180
roughly a thousand times slower than from memory.

651
00:35:48,500 --> 00:35:52,940
So we could do some estimation
of real call state.

652
00:35:53,800 --> 00:35:55,420
Michael: Well, or there's that...

653
00:35:55,640 --> 00:35:59,880
How often do you see clients having
track_io_timing on?

654
00:36:00,420 --> 00:36:01,760
Nikolay: Oh, very often.

655
00:36:02,100 --> 00:36:02,600
Michael: Great.

656
00:36:02,800 --> 00:36:04,740
Okay, that's really positive.

657
00:36:04,840 --> 00:36:09,440
But that because that gives us
an idea of how long those reads

658
00:36:09,600 --> 00:36:11,100
rather than hits took.

659
00:36:11,120 --> 00:36:16,640
And you could do some division
of reads by I/O timing to get an

660
00:36:16,640 --> 00:36:18,560
idea of where they're coming from.

661
00:36:18,560 --> 00:36:20,520
I think there might also be some
work.

662
00:36:21,180 --> 00:36:23,900
I think I saw something either
in a commit for a century or a

663
00:36:23,900 --> 00:36:26,680
hackers thread to try and get

664
00:36:26,840 --> 00:36:27,540
Nikolay: the details.

665
00:36:27,980 --> 00:36:32,580
It's like, there's just that cache
for micro level, right?

666
00:36:33,040 --> 00:36:33,540
Yeah.

667
00:36:33,740 --> 00:36:38,080
Well, it's possible if you have
access to everything, it's possible

668
00:36:38,080 --> 00:36:43,820
from proc, process number, I/O,
you can get it from there, I think.

669
00:36:44,340 --> 00:36:47,220
Michael: But it would be good to
have it in Postgres core so

670
00:36:47,220 --> 00:36:49,940
that we could even get it from
managed services, for example.

671
00:36:50,020 --> 00:36:50,460
Nikolay: Yeah, yeah.

672
00:36:50,460 --> 00:36:54,440
Well, there is something that can
be done here and improved observability

673
00:36:54,580 --> 00:36:58,360
of part and work with particular
query if you have control.

674
00:36:58,440 --> 00:36:58,920
Yeah.

675
00:36:58,920 --> 00:37:02,960
But I'm trying to say, like, even
if you with this evict, we

676
00:37:02,960 --> 00:37:07,560
can get some interesting information,
even if we cannot allow

677
00:37:07,720 --> 00:37:12,680
ourselves to reset the whole, everything,
including page cache.

678
00:37:13,140 --> 00:37:16,680
So it already can be useful in
non-production environments to

679
00:37:16,680 --> 00:37:19,660
study the query behavior if it's
a clone.

680
00:37:19,860 --> 00:37:20,240
Yeah, yeah.

681
00:37:20,240 --> 00:37:24,800
So I feel there's a lot of things
that can be improved in this

682
00:37:24,800 --> 00:37:26,440
area to work with queries.

683
00:37:26,440 --> 00:37:30,480
And especially it's important to
continue this work Because I

684
00:37:30,480 --> 00:37:38,800
suspect we will use some tools
to work on this at massive scale.

685
00:37:39,100 --> 00:37:40,640
Like for example, if we fetched
from...

686
00:37:40,640 --> 00:37:44,240
We already know during consulting,
we did it with a few clients.

687
00:37:45,060 --> 00:37:48,840
With auto_explain, we fetched
a lot of queries, hundreds usually,

688
00:37:48,840 --> 00:37:49,840
sometimes thousands.

689
00:37:49,900 --> 00:37:50,640
It's insane.

690
00:37:50,740 --> 00:37:51,010
Nice.

691
00:37:51,010 --> 00:37:52,000
Examples with plans.

692
00:37:52,490 --> 00:37:57,340
And then, well, actually, you know,
because we partnered, in

693
00:37:57,340 --> 00:38:01,920
some cases we use pgMustard to provide
additional insights, which

694
00:38:01,920 --> 00:38:02,580
is great.

695
00:38:02,960 --> 00:38:04,940
And then we have a lot of stuff.

696
00:38:05,140 --> 00:38:10,120
Unfortunately, you know, unfortunately,
I still don't see how

697
00:38:10,120 --> 00:38:11,930
to avoid human here.

698
00:38:11,930 --> 00:38:16,560
Well, not like, I like to involve
human there, But I would like

699
00:38:16,560 --> 00:38:22,340
to involve less, you know because
usually Right now we have a

700
00:38:22,340 --> 00:38:26,820
lot of cases of query analysis
like that and then we need to

701
00:38:26,940 --> 00:38:32,220
draw some common picture from it right and this This like going

702
00:38:32,220 --> 00:38:35,580
above all those plans and say, okay, we have a pattern here.

703
00:38:35,580 --> 00:38:39,240
For example, bloat is obvious or something like this, like, or

704
00:38:39,240 --> 00:38:40,820
index health is not good.

705
00:38:41,260 --> 00:38:45,120
And to draw this picture, human is involved heavily right now.

706
00:38:45,180 --> 00:38:48,660
I'm thinking like over time, probably we will build some additional

707
00:38:51,500 --> 00:38:57,240
macro-level analysis for micro-level cases of hundreds or thousands

708
00:38:57,240 --> 00:38:57,940
of them.

709
00:38:58,620 --> 00:38:59,860
This is going to be great.

710
00:39:00,060 --> 00:39:04,240
And then looking at BUFFERS, and actually we have BUFFERS there

711
00:39:04,240 --> 00:39:04,540
as well.

712
00:39:04,540 --> 00:39:08,220
You were right in how to explain timing is not present.

713
00:39:08,260 --> 00:39:13,100
People are afraid of overhead BUFFERS also have overhead, but

714
00:39:13,180 --> 00:39:14,020
kind of smaller.

715
00:39:14,020 --> 00:39:14,520
Right.

716
00:39:15,060 --> 00:39:18,360
And yeah, we had articles, you had articles about this as well.

717
00:39:18,480 --> 00:39:18,980
Yeah.

718
00:39:19,000 --> 00:39:23,740
So yeah, and so with cold cache it's tricky, and there is this

719
00:39:23,740 --> 00:39:25,060
function, that's it.

720
00:39:25,680 --> 00:39:27,320
So Postgres 17 plus.

721
00:39:28,040 --> 00:39:28,760
What else?

722
00:39:28,860 --> 00:39:30,400
There is macro level, right?

723
00:39:30,480 --> 00:39:33,900
And at macro level we can talk about pg_stat_statements, blocks,

724
00:39:35,220 --> 00:39:36,240
read, hit.

725
00:39:36,540 --> 00:39:41,540
By the way, there it's, yeah, also naming, we talked about it.

726
00:39:41,540 --> 00:39:45,980
It's also naming like it's data volumes, but it's okay.

727
00:39:46,680 --> 00:39:49,020
And so read, hit, written, dirtied.

728
00:39:50,860 --> 00:39:53,420
For shared buffer blocks.

729
00:39:54,120 --> 00:39:57,400
In pg_stat_statements, also there is pg_stat_kcache if we want

730
00:39:57,400 --> 00:39:59,560
real disk I-O to be observed.

731
00:39:59,760 --> 00:40:03,460
And again, I advertise to include pg_stat_kcache in any setup.

732
00:40:03,820 --> 00:40:06,000
It's a great extension.

733
00:40:06,600 --> 00:40:09,520
And there we can talk about some macro level, like, okay, this

734
00:40:09,520 --> 00:40:13,820
query ID has this volume per second, or this volume per query,

735
00:40:15,020 --> 00:40:20,140
or it's responsible for 90% of all hits to the buffer pool happening

736
00:40:20,220 --> 00:40:20,900
in database.

737
00:40:21,400 --> 00:40:26,540
Maybe it doesn't look super slow, but it's so intensive with

738
00:40:26,580 --> 00:40:27,760
shared buffer pool.

739
00:40:28,220 --> 00:40:29,220
Michael: That's a good point.

740
00:40:29,220 --> 00:40:32,740
I almost always encourage people to look at their top queries

741
00:40:32,840 --> 00:40:34,060
by total time.

742
00:40:36,460 --> 00:40:40,240
Unfortunately, with pg_stat_statements, at least, there's no

743
00:40:40,240 --> 00:40:41,600
kind of total blocks.

744
00:40:42,040 --> 00:40:46,820
Because it's split into these many, many steps, you could order

745
00:40:46,820 --> 00:40:52,860
by shared hit plus shared read and get a good sense of 1 type

746
00:40:53,600 --> 00:40:57,540
or by dirtied for a different, but yeah, you'd have to sum in

747
00:40:57,660 --> 00:41:01,080
most cases I'm thinking of that would give a good system-wide

748
00:41:01,260 --> 00:41:05,280
aggregation, you need to sum 2 columns, I think, to get a good...

749
00:41:05,280 --> 00:41:06,220
Nikolay: Reads and hits, right?

750
00:41:06,220 --> 00:41:06,720
Michael: Ordering.

751
00:41:06,900 --> 00:41:07,400
Yeah.

752
00:41:08,400 --> 00:41:09,060
Nikolay: Yes, yes.

753
00:41:09,060 --> 00:41:10,460
So we usually...

754
00:41:12,120 --> 00:41:18,100
It's underestimated approach with
pg_stat_statements to order by

755
00:41:18,280 --> 00:41:19,660
sum of reads and hits.

756
00:41:19,940 --> 00:41:20,440
Michael: Yeah.

757
00:41:20,580 --> 00:41:21,760
Nikolay: It's super important.

758
00:41:22,360 --> 00:41:26,100
This is how you identify I/O-intensive
queries.

759
00:41:26,840 --> 00:41:30,200
And now they can be mostly hits,
but if database grows, they

760
00:41:30,200 --> 00:41:32,220
can be converted more and more
into reads.

761
00:41:33,180 --> 00:41:36,840
Or vice versa, you're going to
double your memory size and the

762
00:41:36,840 --> 00:41:40,080
buffer pool size, so reads will
be gone.

763
00:41:40,080 --> 00:41:44,380
But just sum them and consider
them together to understand how

764
00:41:44,380 --> 00:41:46,060
IO intensive the query is.

765
00:41:46,560 --> 00:41:51,000
And think, oh, this query probably
needs optimization, especially

766
00:41:51,040 --> 00:41:54,460
if it returns only 1 row or 0 rows.

767
00:41:55,440 --> 00:42:02,800
So, yeah, in this case, we can
also consider buffer-focused optimization,

768
00:42:03,700 --> 00:42:04,620
which is important.

769
00:42:04,640 --> 00:42:05,780
I think it's underestimated.

770
00:42:06,580 --> 00:42:09,860
But also, it's worth mentioning
this interesting blog post from

771
00:42:09,860 --> 00:42:17,440
Andrei Lepikhov, who gave, I would
say, a hacker researcher perspective

772
00:42:17,540 --> 00:42:22,620
on this, saying, actually, I didn't
mention that this is my point

773
00:42:22,620 --> 00:42:26,700
for sure for lab environments we
build it's important to say

774
00:42:27,500 --> 00:42:31,280
timing is super volatile it's unpredictable
you can have timing

775
00:42:31,820 --> 00:42:38,260
1 in 1 today another tomorrow 1
on production another on clone

776
00:42:38,260 --> 00:42:40,300
of production Very different timing.

777
00:42:41,600 --> 00:42:45,680
And it will be a big question to
track all reasons why timing

778
00:42:45,720 --> 00:42:46,820
is unpredictable.

779
00:42:47,900 --> 00:42:51,300
And there is a concept, there is
a simple methodology, if we

780
00:42:51,300 --> 00:42:54,240
deal with a complex problem, let's
split it to pieces and analyze

781
00:42:54,240 --> 00:42:55,220
pieces separately.

782
00:42:55,520 --> 00:42:59,260
So when we take a clone and bring
a single query and optimize

783
00:42:59,260 --> 00:43:04,640
it, we already get rid of concurrency
issues, heavy locks, lightweight

784
00:43:04,640 --> 00:43:07,080
locks, everything, like forgetting
about them.

785
00:43:07,160 --> 00:43:13,380
And we deal with a single query
alone in 1 clone and it's already

786
00:43:13,380 --> 00:43:14,160
super good.

787
00:43:14,600 --> 00:43:20,940
But we still need to have something
reliable in terms of metrics

788
00:43:20,940 --> 00:43:21,620
to optimize.

789
00:43:21,660 --> 00:43:23,980
And timing is not reliable because
it's unpredictable.

790
00:43:24,160 --> 00:43:26,080
It depends on the states of caches.

791
00:43:26,260 --> 00:43:29,440
If it's a shared environment, maybe
CPU and disk are super busy,

792
00:43:29,440 --> 00:43:33,340
and there's noise from neighbors
when we optimize in this case,

793
00:43:33,340 --> 00:43:33,840
right?

794
00:43:34,300 --> 00:43:39,060
But once you focus on BUFFERS inside
optimization, you have invariant

795
00:43:39,240 --> 00:43:39,740
metric.

796
00:43:40,160 --> 00:43:45,780
Okay, it can be reads versus hits,
but some of it will be constant

797
00:43:45,780 --> 00:43:47,120
if you repeat the query.

798
00:43:47,780 --> 00:43:51,100
For this particular clone, for
this particular query, these parameters,

799
00:43:51,300 --> 00:43:52,540
everything is the same.

800
00:43:53,100 --> 00:43:54,620
Plan is the same, right?

801
00:43:54,800 --> 00:43:55,960
So you have the same thing.

802
00:43:55,960 --> 00:43:57,420
It's repeatable, it's reproducible.

803
00:43:58,480 --> 00:44:01,720
This gives you a foundation for
optimization because without

804
00:44:01,720 --> 00:44:07,620
it you have very shaky foundation,
like not solid foundation,

805
00:44:07,660 --> 00:44:08,160
right?

806
00:44:08,600 --> 00:44:09,100
Timing.

807
00:44:09,440 --> 00:44:11,760
So, and then you optimize and you...

808
00:44:11,980 --> 00:44:16,560
I like your idea about optimization,
like, actually I used it.

809
00:44:16,560 --> 00:44:17,020
I used it.

810
00:44:17,020 --> 00:44:19,740
I said, okay, 10,000 times timing.

811
00:44:19,760 --> 00:44:24,020
And I remember I mentioned BUFFERS
as well, but I didn't expect

812
00:44:24,020 --> 00:44:28,020
people will value this comment
a lot.

813
00:44:28,020 --> 00:44:32,620
Maybe with BUFFERS by default,
we should push on this methodology

814
00:44:32,760 --> 00:44:33,580
more and more.

815
00:44:33,740 --> 00:44:38,040
But what Andrei is talking about
in this article is that for

816
00:44:38,440 --> 00:44:41,680
research purposes and so on, instead
of timing, maybe we should

817
00:44:41,680 --> 00:44:43,700
focus only on reads.

818
00:44:45,440 --> 00:44:51,540
And reads is something that could
be a target itself for optimization

819
00:44:52,200 --> 00:44:56,880
and decision if the system works
efficiently or not efficiently.

820
00:44:57,180 --> 00:45:00,060
Because we are talking about databases
and I always the number

821
00:45:00,060 --> 00:45:03,360
1 thing in terms of what takes
time.

822
00:45:03,700 --> 00:45:04,800
And I like this idea.

823
00:45:04,800 --> 00:45:05,140
Yes.

824
00:45:05,140 --> 00:45:09,160
So like this is very close to what
I see for many years building

825
00:45:09,160 --> 00:45:13,620
lab environments, non-production
environments, which can help

826
00:45:13,620 --> 00:45:18,460
people scale their teams and databases
and processes and so on.

827
00:45:18,460 --> 00:45:22,280
Yeah, so what's your opinion about
this article?

828
00:45:23,440 --> 00:45:24,520
Michael: Yeah, I thought it was
good.

829
00:45:24,520 --> 00:45:28,380
I think all of, I think this blog's
great, but it's very, I think

830
00:45:28,380 --> 00:45:29,480
it's very hacker focused.

831
00:45:29,480 --> 00:45:30,700
And I think that's good, right?

832
00:45:30,700 --> 00:45:33,660
Like This is a good topic for hackers
as well.

833
00:45:33,940 --> 00:45:37,580
There's a lot of blog posts I see
from people often explaining

834
00:45:37,600 --> 00:45:40,540
performance issues and not including
BUFFERS, and it seems like

835
00:45:40,540 --> 00:45:43,780
a missed opportunity to educate
on that front.

836
00:45:43,780 --> 00:45:47,100
So this is a great argument for
the people doing the educating,

837
00:45:47,100 --> 00:45:51,180
I think, or even doing the implementations
of why it can be helpful

838
00:45:51,180 --> 00:45:51,960
to see this.

839
00:45:51,960 --> 00:45:56,420
I think he makes some good points
as well about hardware changing

840
00:45:56,420 --> 00:46:02,240
over time and his initial paragraph
is even about I can't reproduce

841
00:46:02,240 --> 00:46:02,380
this paper.

842
00:46:02,380 --> 00:46:04,400
Nikolay: I can't reproduce timing, yes.

843
00:46:04,400 --> 00:46:07,720
Michael: Yeah, because all this
and it's not just that it's also

844
00:46:07,760 --> 00:46:10,900
there was and this is a different
point slightly but there's

845
00:46:10,900 --> 00:46:13,920
insufficient setup information
which is really common in benchmarks

846
00:46:13,940 --> 00:46:18,540
it's really common to not include
which exact machines were being

847
00:46:18,540 --> 00:46:19,700
used or which exact.

848
00:46:20,140 --> 00:46:23,400
So without that information, BUFFERS
can still be used as like,

849
00:46:23,400 --> 00:46:24,480
oh, this is interesting.

850
00:46:24,520 --> 00:46:27,540
Sure, there might be optimizations
over time that Postgres can

851
00:46:27,660 --> 00:46:31,500
use fewer BUFFERS for certain operations
or certain indexes become

852
00:46:31,500 --> 00:46:32,240
more efficient.

853
00:46:32,780 --> 00:46:36,020
There's been a bunch of work to
optimize B-trees in Postgres

854
00:46:36,020 --> 00:46:38,760
over the years, and I would expect
that to lead to slightly fewer

855
00:46:38,760 --> 00:46:40,900
buffer reads in a bunch of cases.

856
00:46:41,320 --> 00:46:44,780
But they would be easier to see,
and easier to see the differences

857
00:46:44,920 --> 00:46:46,520
in than timings.

858
00:46:46,940 --> 00:46:50,500
And I think he's also got a really
good graph here that shows

859
00:46:50,500 --> 00:46:56,140
that there's not 0 variance in
buffer numbers, but it's much,

860
00:46:56,200 --> 00:47:00,460
much lower and it's way more stable
than timings.

861
00:47:01,220 --> 00:47:04,820
Nearly completely stable with some
exceptions, which is my experience

862
00:47:04,820 --> 00:47:05,520
as well.

863
00:47:05,600 --> 00:47:06,600
Nikolay: Yeah, yeah, yeah.

864
00:47:07,260 --> 00:47:10,940
I agree, especially for example,
if you run the query first time

865
00:47:10,940 --> 00:47:14,060
real cache is not there, it will
give you,

866
00:47:14,060 --> 00:47:14,760
Michael: yeah, planning

867
00:47:14,760 --> 00:47:17,560
Nikolay: time, it will give you
huge overhead and huge buffer

868
00:47:17,560 --> 00:47:21,960
numbers, but second execution will
fully shave it off.

869
00:47:22,640 --> 00:47:23,800
Michael: Yeah, planning BUFFERS.

870
00:47:24,160 --> 00:47:26,360
Nikolay: Planning BUFFERS is a
separate topic.

871
00:47:26,380 --> 00:47:27,940
It's a very important topic.

872
00:47:28,260 --> 00:47:29,560
I like to have it.

873
00:47:30,360 --> 00:47:33,060
Michael: Me too, but I don't fully
understand them yet.

874
00:47:33,660 --> 00:47:40,520
Like why do we get so many more
in the first execution and then

875
00:47:40,520 --> 00:47:43,680
why do we sometimes get none reported
at all?

876
00:47:43,680 --> 00:47:44,560
Nikolay: Ah, okay.

877
00:47:44,680 --> 00:47:45,780
You mean, yeah, yeah, yeah.

878
00:47:45,780 --> 00:47:50,840
I think we discussed it maybe,
not sure if you, I saw some very

879
00:47:50,840 --> 00:47:52,620
weird behavior in planning BUFFERS.

880
00:47:52,820 --> 00:47:53,320
Yeah.

881
00:47:53,540 --> 00:47:56,180
Maybe something is not tracked
as well, as you've mentioned.

882
00:47:56,180 --> 00:47:56,760
Michael: I think it

883
00:47:56,760 --> 00:47:57,120
Nikolay: must be.

884
00:47:57,120 --> 00:47:58,260
Foreign keys, yeah, yeah, yeah.

885
00:47:58,260 --> 00:48:03,620
So maybe we should dig into this
topic and raise it, the discussion

886
00:48:03,640 --> 00:48:04,200
about that.

887
00:48:04,200 --> 00:48:04,700
Yeah.

888
00:48:04,920 --> 00:48:06,960
So back to Andrei's blog post.

889
00:48:06,960 --> 00:48:07,460
Michael: Yes.

890
00:48:08,500 --> 00:48:09,160
Nikolay: I agree.

891
00:48:09,160 --> 00:48:14,280
And like I cannot reproduce and
we built lab environments sometimes

892
00:48:14,540 --> 00:48:18,060
very different from production
just for the sake of cost optimization.

893
00:48:18,340 --> 00:48:24,900
Like if you have 360 core Epic,
5th generation Epic on production,

894
00:48:26,120 --> 00:48:30,700
and almost terabyte or more of
memory, it doesn't mean you cannot

895
00:48:30,720 --> 00:48:34,300
optimize your queries on Raspberry
Pi.

896
00:48:34,360 --> 00:48:38,440
You can on magnetic disks, very
cheap.

897
00:48:39,520 --> 00:48:43,700
And that query fully cached might
be slow for you on production,

898
00:48:43,740 --> 00:48:44,980
might be like a second.

899
00:48:45,720 --> 00:48:49,700
You go and optimize, you understand
that you have shitty hardware.

900
00:48:50,600 --> 00:48:55,220
In this case, you just see, okay,
it takes minutes or hours.

901
00:48:55,900 --> 00:48:59,080
But BUFFERS are the same, and plan
is the same.

902
00:48:59,100 --> 00:49:04,700
You can make decisions on super
cheap hardware with smaller caches,

903
00:49:04,900 --> 00:49:09,720
everything is smaller, very slow
disk, you can still make cautious

904
00:49:09,720 --> 00:49:13,920
decisions and move forward with
optimization, find solution,

905
00:49:14,760 --> 00:49:18,460
see that you optimize buffer numbers
like a thousand times and

906
00:49:18,460 --> 00:49:22,700
expect similar effect, maybe the
same effect on production, affecting

907
00:49:22,720 --> 00:49:24,220
time accordingly.

908
00:49:24,320 --> 00:49:24,820
Right.

909
00:49:24,860 --> 00:49:27,480
And this is the, this is the power
of lab environments.

910
00:49:27,720 --> 00:49:31,320
I think everyone should have them
super cheap, shared.

911
00:49:31,320 --> 00:49:35,940
So many people and CI pipelines
work together and focusing on

912
00:49:35,940 --> 00:49:37,000
BUFFERS is magic.

913
00:49:37,500 --> 00:49:38,000
Right?

914
00:49:38,720 --> 00:49:39,220
Yeah, I

915
00:49:39,220 --> 00:49:39,840
Michael: think it's easy.

916
00:49:39,840 --> 00:49:40,960
I think you're completely right.

917
00:49:40,960 --> 00:49:43,740
I think it's easy for people to
make the mistake of then also

918
00:49:43,740 --> 00:49:45,320
not using sufficient data volumes.

919
00:49:45,320 --> 00:49:46,160
I know you don't.

920
00:49:46,160 --> 00:49:48,920
I know that's not what you're advocating
for.

921
00:49:48,920 --> 00:49:53,500
But yeah, it's the data volumes
that matter way more than the

922
00:49:55,000 --> 00:49:58,600
power of the disks or the specific
CPU.

923
00:49:59,060 --> 00:50:01,720
Nikolay: Yeah, usually people say,
okay, we are going to have

924
00:50:01,720 --> 00:50:07,240
weaker hardware for non-production,
and we are going to focus

925
00:50:07,240 --> 00:50:08,400
on relative optimization.

926
00:50:08,480 --> 00:50:12,940
So if it was a second in production,
10 seconds in this non-production,

927
00:50:13,080 --> 00:50:15,920
which is weak, we are going to
optimize from there.

928
00:50:15,920 --> 00:50:19,280
But it's still weak because of
the status of cache and so on.

929
00:50:19,280 --> 00:50:22,700
Just focus on BUFFERS and then
you can understand it.

930
00:50:22,700 --> 00:50:26,420
Again, there are exceptions, I
agree, such as we discussed with

931
00:50:26,820 --> 00:50:28,280
RLS and so on.

932
00:50:29,240 --> 00:50:35,200
It's possible that your particular
case is CPU-intensive, but

933
00:50:35,340 --> 00:50:37,560
these are exclusions from the main
rule.

934
00:50:37,940 --> 00:50:42,380
Absolute majority of cases, you
will be dealing with BUFFERS

935
00:50:43,200 --> 00:50:44,700
during optimization, right?

936
00:50:45,140 --> 00:50:49,240
So congrats with Postgres 18 change.

937
00:50:50,920 --> 00:50:55,680
And I also want to thank Andrei
Lepikhov for this blog post,

938
00:50:55,680 --> 00:50:57,440
which is very welcome.

939
00:50:57,440 --> 00:51:01,880
I think we need more posts about
the importance of I/O Metrics

940
00:51:02,120 --> 00:51:06,760
at various levels, like 1 query
level, micro level, or macro

941
00:51:06,760 --> 00:51:09,100
level like pg_stat_statements, pg_stat_kcache.

942
00:51:09,880 --> 00:51:16,300
So yeah, I hope this topic will
grow and be more popular, this

943
00:51:16,300 --> 00:51:16,800
methodology.

944
00:51:17,700 --> 00:51:19,220
Michael: Yeah, agreed.

945
00:51:20,280 --> 00:51:23,160
Another thank you to the people
involved in getting this committed

946
00:51:23,160 --> 00:51:27,380
to Postgres 18 and big thanks oh
and I think this might be our

947
00:51:27,380 --> 00:51:32,560
first episode since we both got
Postgres Contributor status Nikolay

948
00:51:32,560 --> 00:51:36,340
so congratulations to you and thank
you to whoever nominated

949
00:51:36,420 --> 00:51:38,860
us or whatever the process is there.

950
00:51:38,860 --> 00:51:40,180
Thank you to everyone involved.

951
00:51:41,480 --> 00:51:43,620
Nikolay: Good, see you next time.

952
00:51:44,440 --> 00:51:45,060
Michael: See you soon, bye.