1
00:00:00,060 --> 00:00:02,280
Michael: Hello and welcome to Postgres.FM, a weekly show about

2
00:00:02,280 --> 00:00:03,140
all things PostgreSQL.

3
00:00:03,280 --> 00:00:05,760
I am Michael, founder of pgMustard, and this is Nikolay, founder

4
00:00:05,760 --> 00:00:06,800
of Postgres.AI.

5
00:00:07,120 --> 00:00:08,440
Hey Nikolay, how's it going?

6
00:00:08,600 --> 00:00:09,440
Nikolay: Hello, Michael.

7
00:00:09,640 --> 00:00:10,320
Going great.

8
00:00:10,320 --> 00:00:11,100
How are you?

9
00:00:11,840 --> 00:00:12,740
Michael: Yeah, good.

10
00:00:12,740 --> 00:00:13,520
Thank you.

11
00:00:13,860 --> 00:00:16,220
So what are we talking about this week?

12
00:00:16,640 --> 00:00:20,220
Nikolay: Let's talk about why we can have writes on replicas.

13
00:00:22,080 --> 00:00:26,780
Michael: Yeah, or the thing I've seen people get confused by

14
00:00:27,180 --> 00:00:32,320
is why they can get writes even on the primary for SELECT queries,

15
00:00:32,320 --> 00:00:34,020
like for read queries.

16
00:00:34,020 --> 00:00:35,340
But yeah, I like your framing.

17
00:00:35,380 --> 00:00:37,780
Nikolay: Yeah, it can happen on primary as well.

18
00:00:38,040 --> 00:00:38,540
Yeah.

19
00:00:38,560 --> 00:00:41,820
I mean, this surprise can happen on the primary as well.

20
00:00:42,040 --> 00:00:45,360
Michael: And specifically where people might notice this is seeing

21
00:00:45,920 --> 00:00:48,300
if they get buffers in query plans like shared

22
00:00:48,600 --> 00:00:49,100
Nikolay: dirty.

23
00:00:49,160 --> 00:00:49,660
Exactly.

24
00:00:50,060 --> 00:00:54,280
Michael: But also, I think the place I've seen people spot it,

25
00:00:54,280 --> 00:00:57,540
at least when they mention it in blog posts, is pg_stat_statements

26
00:00:57,540 --> 00:01:00,420
or monitoring tools that are looking at pg_stat_statements

27
00:01:01,080 --> 00:01:04,240
and seeing that some of the top offending queries for buffers

28
00:01:04,240 --> 00:01:07,160
dirtied could even be SELECT queries, and that really surprised

29
00:01:07,160 --> 00:01:07,360
them.

30
00:01:07,360 --> 00:01:11,480
So it's kind of one of those surprising moments where you think,

31
00:01:11,480 --> 00:01:13,520
ah, is something badly wrong?

32
00:01:13,520 --> 00:01:16,380
And then you look into it and realize maybe not.

33
00:01:16,380 --> 00:01:19,640
Nikolay: Yeah, actually, to be fully clear, writes can happen

34
00:01:19,640 --> 00:01:22,360
during SELECTs because of temporary files as well, right?

35
00:01:22,360 --> 00:01:24,320
And if you think about writes to disk.

36
00:01:24,440 --> 00:01:26,600
But this is pretty straightforward.

37
00:01:26,740 --> 00:01:30,020
We see, again, in pg_stat_statements, we can see it as

38
00:01:30,020 --> 00:01:35,220
well as temporary bytes or blocks written and read and...

39
00:01:35,220 --> 00:01:37,480
Michael: Yeah, is it local or is it temp?

40
00:01:37,480 --> 00:01:38,600
I think it's local.

41
00:01:39,380 --> 00:01:40,520
Nikolay: No, local is different.

42
00:01:40,920 --> 00:01:42,100
I always forget what local

43
00:01:42,100 --> 00:01:42,180
Michael: is.

44
00:01:42,180 --> 00:01:42,880
No, you're right.

45
00:01:43,260 --> 00:01:43,780
You're right.

46
00:01:43,780 --> 00:01:46,360
Local is for temporary tables, that's why I get confused.

47
00:01:46,360 --> 00:01:49,220
Local is for temporary tables and temporary objects.

48
00:01:50,540 --> 00:01:54,640
But temp is for like, if for example, sort or hash spills to

49
00:01:54,640 --> 00:01:59,040
disk because it's too big for work_mem or work_mem times hash

50
00:01:59,040 --> 00:02:04,220
mem multiplier then you get temp blocks read pretend blocks written

51
00:02:04,400 --> 00:02:07,800
But I think what people are spotting is the dirtied and being

52
00:02:07,800 --> 00:02:09,220
like, wait, that should...

53
00:02:09,280 --> 00:02:12,480
Dirtied means, you know, an Update or Delete, normally.

54
00:02:13,480 --> 00:02:14,600
Nikolay: It's so confusing, right?

55
00:02:14,600 --> 00:02:18,940
I'm checking pg_stat_statements documentation and for local blocks,

56
00:02:19,200 --> 00:02:25,420
read, dirtied, written, and local block read time, write time.

57
00:02:26,040 --> 00:02:29,440
It only says total number of local blocks read by statement.

58
00:02:29,640 --> 00:02:33,560
Total number of local blocks, like everyone understands what

59
00:02:33,560 --> 00:02:34,440
local block is.

60
00:02:34,440 --> 00:02:36,580
Well, local block, okay, I think you're right.

61
00:02:36,580 --> 00:02:37,620
It's a block.

62
00:02:38,100 --> 00:02:42,320
I also saw like the more and more I deal with it in Postgres,

63
00:02:42,520 --> 00:02:45,300
almost how many years, like 20 years, right?

64
00:02:45,760 --> 00:02:49,940
The more I dislike the fact that we call operations blocks.

65
00:02:50,740 --> 00:02:52,540
These are operations, not blocks.

66
00:02:53,740 --> 00:03:01,120
Because if we say 10 blocks written, it's actually 10 block writes.

67
00:03:01,120 --> 00:03:03,980
Maybe it's the same block which we've written 10 times, right?

68
00:03:04,300 --> 00:03:05,040
Or rare.

69
00:03:05,860 --> 00:03:09,300
So it should be block reads, block writes, block...

70
00:03:10,440 --> 00:03:14,080
Dirty, I don't know the proper word how to say.

71
00:03:14,600 --> 00:03:15,100
Dirtying...

72
00:03:15,820 --> 00:03:17,640
Dirtyings, I don't know.

73
00:03:18,140 --> 00:03:23,040
But anyway, here, of course I think it would be great to have

74
00:03:23,680 --> 00:03:25,460
understanding what local block is.

75
00:03:25,460 --> 00:03:27,600
Local block, I think, it's, again, operation.

76
00:03:28,040 --> 00:03:31,280
Local block operation, it's what is caused when we work with

77
00:03:31,280 --> 00:03:35,220
temporary tables, which I always try to avoid because of different

78
00:03:35,920 --> 00:03:36,420
reasons.

79
00:03:36,420 --> 00:03:39,640
Maybe we should talk 1 day about temporary tables.

80
00:03:40,120 --> 00:03:43,100
But temporary tables, temporary blocks, temporary blocks is not

81
00:03:43,100 --> 00:03:46,320
about temporary tables, but rather about temporary files which

82
00:03:46,320 --> 00:03:50,960
can happen during execution of a Query dealing with normal tables

83
00:03:51,200 --> 00:03:54,300
anytime because work memory is not enough, right?

84
00:03:54,840 --> 00:03:56,400
So again, this is easy.

85
00:03:56,520 --> 00:03:57,440
This part is easy.

86
00:03:57,440 --> 00:04:01,400
If we see some SELECT or Update or anything is producing a lot

87
00:04:01,400 --> 00:04:06,420
of temporary blocks, read or written.

88
00:04:06,940 --> 00:04:10,240
We need to just consider raising work memory or optimize Queries

89
00:04:10,240 --> 00:04:16,080
so it deals with lower volumes, smaller volumes of data, right?

90
00:04:16,700 --> 00:04:22,300
For example, I don't know, like, just increase selectivity of

91
00:04:22,300 --> 00:04:25,600
our, I don't know, it's a separate topic anyway.

92
00:04:25,780 --> 00:04:29,180
Let's just point that writes can happen because of temporary

93
00:04:29,180 --> 00:04:29,680
files.

94
00:04:29,680 --> 00:04:32,120
Of course, SELECTs can produce temporary files because work_mem

95
00:04:32,380 --> 00:04:33,280
is not enough.

96
00:04:34,080 --> 00:04:34,900
That's it.

97
00:04:35,320 --> 00:04:40,900
But then sometimes we have huge work memory, definitely, which

98
00:04:40,900 --> 00:04:45,080
is enough, and SELECT is still writing to disk.

99
00:04:45,080 --> 00:04:49,120
Or, well, technically dirty, blocks dirtied.

100
00:04:49,700 --> 00:04:54,100
These operations, I will keep calling these operations, not data.

101
00:04:55,520 --> 00:04:59,180
These events, these operations
can happen during SELECTs.

102
00:05:00,100 --> 00:05:04,800
And if it's just dirtied, it not
necessarily goes to disk immediately,

103
00:05:04,800 --> 00:05:08,980
because this is the work for Checkpointer and background writer.

104
00:05:10,280 --> 00:05:12,180
But it's already like, it's not
good.

105
00:05:12,180 --> 00:05:17,600
Like SELECT is producing some writes
to the buffer pool, to shared_buffers

106
00:05:17,880 --> 00:05:18,380
area.

107
00:05:18,400 --> 00:05:20,920
And of course, if it's written,
it's even worse, because I'm

108
00:05:20,920 --> 00:05:26,320
pretty sure this makes SELECT also
slow similarly as it happens

109
00:05:26,320 --> 00:05:30,120
when it creates temporary file
and writes to it right

110
00:05:30,560 --> 00:05:36,240
Michael: yeah good point although
I think in the 2 cases that

111
00:05:36,240 --> 00:05:39,220
we're going to be talking about,
actually, no, no, there could

112
00:05:39,220 --> 00:05:39,960
be a lot of them.

113
00:05:39,960 --> 00:05:43,140
I was thinking, I was thinking
it's generally only in a lot of

114
00:05:43,140 --> 00:05:45,660
the cases I've seen, it's only
1 block at a time.

115
00:05:45,800 --> 00:05:50,040
But actually, you could easily
have a scan doing a lot of these.

116
00:05:50,580 --> 00:05:52,580
So yeah, could be a lot of data.

117
00:05:52,580 --> 00:05:57,140
Nikolay: Yeah, and actually to
be even like absolutely precise,

118
00:05:58,260 --> 00:06:01,360
a block written not necessarily
goes to disk because this is

119
00:06:01,360 --> 00:06:03,280
written to the page cache.

120
00:06:05,140 --> 00:06:10,060
And then like pdflush or something,
this is already about Linux,

121
00:06:11,040 --> 00:06:13,000
it should write to disk at some
point.

122
00:06:14,960 --> 00:06:17,700
Written doesn't necessarily mean
that it goes to disk.

123
00:06:18,240 --> 00:06:23,300
And I think to understand that,
we should use different extension.

124
00:06:24,100 --> 00:06:26,700
Because pg_stat_statements is great,
it's available everywhere.

125
00:06:26,980 --> 00:06:29,940
Because it's like, I think this
is the most popular extension,

126
00:06:30,040 --> 00:06:31,640
should be at least among observers.

127
00:06:31,640 --> 00:06:34,220
And there is opinion that it should
go to core at some point.

128
00:06:35,280 --> 00:06:39,780
It should stop being an extension,
especially after query ID

129
00:06:40,120 --> 00:06:43,220
propagated to pg_stat_activity to
auto_explain, right?

130
00:06:43,480 --> 00:06:45,720
Definitely, pg_stat_statements
should be in core.

131
00:06:45,720 --> 00:06:49,440
But what I'm saying is that there
is a KCache which extends

132
00:06:49,440 --> 00:06:56,120
the predecessor statements to allow
users to see metrics related

133
00:06:56,120 --> 00:07:01,920
to actual physical resources such
as CPU, system CPU, and also

134
00:07:01,920 --> 00:07:02,680
disk I/O.

135
00:07:02,680 --> 00:07:06,360
Real disk writes, real disk reads
how many bytes are written,

136
00:07:06,360 --> 00:07:10,980
how many bytes are read for each
query ID, which is great.

137
00:07:11,940 --> 00:07:19,460
I think We talked that finally
buffers are committed to EXPLAIN

138
00:07:19,460 --> 00:07:20,140
analyze.

139
00:07:20,280 --> 00:07:24,780
By default, you run EXPLAIN ANALYZE
in Postgres 18 unless it's

140
00:07:24,780 --> 00:07:25,280
reverted.

141
00:07:25,320 --> 00:07:26,820
I hope it won't be reverted.

142
00:07:27,260 --> 00:07:31,260
In Postgres 18, EXPLAIN ANALYZE
will always include BUFFERS in

143
00:07:31,260 --> 00:07:32,940
output, which is great victory.

144
00:07:33,540 --> 00:07:36,740
We talked about it a couple of
years on this podcast, I think

145
00:07:36,740 --> 00:07:38,260
almost since the very beginning.

146
00:07:38,680 --> 00:07:42,900
And I was thinking what topic to
choose next to complain about

147
00:07:43,140 --> 00:07:44,780
occasionally, regularly.

148
00:07:45,940 --> 00:07:47,780
Michael: Or maybe campaign rather
than complaint.

149
00:07:47,780 --> 00:07:48,840
Nikolay: Campaign, okay.

150
00:07:50,740 --> 00:07:52,100
Michael: More positive, yeah.

151
00:07:52,120 --> 00:07:55,620
Nikolay: Right, so what kind of
campaign to launch after this?

152
00:07:56,140 --> 00:08:00,600
Like, I was happy, but I was like,
oh, I will be missing us to

153
00:08:00,600 --> 00:08:03,540
talk about buffers and explain
the lies and so on.

154
00:08:04,660 --> 00:08:06,600
I chose this topic.

155
00:08:07,060 --> 00:08:12,840
I think we should focus on all
managed Postgres platform builders

156
00:08:13,780 --> 00:08:18,560
and advertise to include pg_wait_sampling
and pg_stat_kcache

157
00:08:18,740 --> 00:08:19,460
to extensions.

158
00:08:20,860 --> 00:08:25,560
Because this is super important
to have them both to be able

159
00:08:25,560 --> 00:08:30,180
to do proper query analysis in
different aspects.

160
00:08:30,280 --> 00:08:32,000
pg_stat_statements is great.

161
00:08:32,380 --> 00:08:33,640
It has a lot of metrics.

162
00:08:34,460 --> 00:08:38,260
It keeps growing in terms of number
of metrics, but it doesn't

163
00:08:38,260 --> 00:08:41,880
have what pg_stat_kcache and pg_wait_sampling
provide.

164
00:08:42,520 --> 00:08:46,480
So we need those extensions on
each serious database, especially

165
00:08:46,560 --> 00:08:50,060
heavily loaded, and pg_stat_kcache
is great here.

166
00:08:50,180 --> 00:08:54,660
And this would allow us to see,
oh, actually this query being

167
00:08:54,660 --> 00:08:58,300
SELECT caused actual physical disk
writes.

168
00:08:58,440 --> 00:09:00,300
Michael: Yeah, I'm going to have
a think.

169
00:09:00,300 --> 00:09:04,160
If we only get 1 campaign, I'm
not sure that's mine, but maybe

170
00:09:04,300 --> 00:09:04,860
I like it.

171
00:09:04,860 --> 00:09:05,880
It's a good idea.

172
00:09:05,900 --> 00:09:09,140
Nikolay: Okay, let's please
consider this.

173
00:09:09,140 --> 00:09:12,400
I'm definitely launching campaign
at least on my Twitter and

174
00:09:12,400 --> 00:09:17,040
LinkedIn and actually I already
launched it and I saw a

175
00:09:17,040 --> 00:09:21,380
reaction from So actually I mentioned
that these extensions are

176
00:09:21,380 --> 00:09:26,040
present on all serious setups,
which we helped to maintain for

177
00:09:26,040 --> 00:09:27,080
quite some time.

178
00:09:27,340 --> 00:09:32,120
And pg_stat_kcache has been used like
in serious databases I'm dealing

179
00:09:32,120 --> 00:09:34,100
with for a very long time, many
years.

180
00:09:34,120 --> 00:09:38,900
And I know some platforms use it,
but in different countries,

181
00:09:38,900 --> 00:09:39,400
right?

182
00:09:40,440 --> 00:09:46,840
And the only case I see is pg_wait_sampling
is available on Google

183
00:09:46,840 --> 00:09:48,240
Cloud SQL.

184
00:09:48,820 --> 00:09:52,540
pg_stat_kcache is not available there,
but pg_wait_sampling is,

185
00:09:52,540 --> 00:09:53,420
which is great.

186
00:09:53,720 --> 00:09:54,520
Michael: Are you sure?

187
00:09:54,640 --> 00:09:55,240
Nikolay: I'm sure.

188
00:09:55,240 --> 00:09:57,840
pg_wait_sampling is available,
so you can have...

189
00:09:58,260 --> 00:10:01,360
Imagine RDS, they have performance
insights, which is great.

190
00:10:01,420 --> 00:10:04,800
But this information is not available
through SQL interface.

191
00:10:04,900 --> 00:10:08,400
I'm not sure about CloudWatch,
maybe it's available there, but

192
00:10:08,400 --> 00:10:09,380
it's a different API.

193
00:10:09,380 --> 00:10:14,540
I want to see details right in
SQL because I have tools to analyze

194
00:10:14,960 --> 00:10:18,580
pg_stat_statements and I want extension
to ActiveSession history

195
00:10:18,580 --> 00:10:22,860
analysis and also to physical metrics,
which pg_stat_kcache provides.

196
00:10:24,440 --> 00:10:28,240
So Cloud SQL has pg_wait_sampling,
but it doesn't have pg_stat_kcache.

197
00:10:28,680 --> 00:10:30,800
Others don't have any of these
2.

198
00:10:30,940 --> 00:10:35,540
And the reaction was from one of
Cloud SQL engineers I saw who

199
00:10:35,540 --> 00:10:38,040
promised to think about it.

200
00:10:38,440 --> 00:10:42,680
I mean, it would be great and Cloud
SQL could be much better

201
00:10:43,740 --> 00:10:46,220
if they have pg_stat_kcache.

202
00:10:46,220 --> 00:10:48,740
I think they're already much better
in terms of observability,

203
00:10:49,240 --> 00:10:53,220
better than RDS because pg_wait_sampling
is better than performance

204
00:10:53,300 --> 00:10:53,800
insights.

205
00:10:55,640 --> 00:10:59,540
Usually we combine multiple tools
for observability and you go

206
00:10:59,540 --> 00:11:05,000
to for active session history for
RDS or Aurora you go to one place,

207
00:11:05,000 --> 00:11:07,960
to their own place, for everything
else, you go to another place.

208
00:11:07,960 --> 00:11:08,900
It's not convenient.

209
00:11:09,760 --> 00:11:13,520
So in the case of Google, they
have a code, I think, query insights

210
00:11:13,520 --> 00:11:14,760
or something like this.

211
00:11:14,760 --> 00:11:17,360
They have it exposed in their own
monitoring interface, but they

212
00:11:17,360 --> 00:11:21,420
also expose it for anyone through
SQL interface, which is great,

213
00:11:21,760 --> 00:11:22,260
right?

214
00:11:22,360 --> 00:11:24,660
Through pg_wait_sampling.

215
00:11:25,520 --> 00:11:29,140
If they have pg_stat_kcache, I will
be happy, but I also I'm

216
00:11:29,140 --> 00:11:32,620
going to ask others to consider
these two extensions.

217
00:11:33,100 --> 00:11:36,100
For example, Crunchy Bridge, Supabase,
who else?

218
00:11:36,600 --> 00:11:37,100
Everyone.

219
00:11:37,360 --> 00:11:40,520
I actually think RDS should consider
pg_wait_sampling as well,

220
00:11:40,520 --> 00:11:41,960
and pg_stat_kcache as well.

221
00:11:41,960 --> 00:11:42,660
Why not?

222
00:11:43,380 --> 00:11:45,700
So, yeah, it should be just an
option.

223
00:11:46,040 --> 00:11:47,420
Users could decide.

224
00:11:49,300 --> 00:11:50,520
So back to our question.

225
00:11:51,060 --> 00:11:54,380
I think if you have pg_stat_kcache,
you can see actual writes

226
00:11:54,380 --> 00:11:58,900
happening from queries and you
can see SELECTs and you know work

227
00:11:58,900 --> 00:12:01,960
memory is enough, so it's not temporary
files.

228
00:12:02,060 --> 00:12:05,440
You can confirm it through pg_stat_statements
or also through

229
00:12:05,440 --> 00:12:10,840
individual query execution via
EXPLAIN ANALYZE, BUFFERS, and that's

230
00:12:10,840 --> 00:12:12,840
it, you're thinking what's happening,
right?

231
00:12:14,060 --> 00:12:16,160
Why SELECT is writing to disk?

232
00:12:16,920 --> 00:12:22,100
Or at least, is it writing to the
buffer pool, which is already

233
00:12:22,120 --> 00:12:23,420
quite a surprise, right?

234
00:12:24,960 --> 00:12:25,460
Michael: Yeah.

235
00:12:25,920 --> 00:12:30,460
And well, until recently, I knew
this was a phenomenon.

236
00:12:31,060 --> 00:12:34,820
And I'd read about it a couple
of times seen on the main list

237
00:12:34,820 --> 00:12:41,440
and in a post or two. But I only knew
of one of the potential 

238
00:12:41,440 --> 00:12:44,540
reasons and the reason I think
this came up recently is there

239
00:12:44,540 --> 00:12:48,320
was a good blog post by Alex Jesipow.

240
00:12:48,320 --> 00:12:49,800
I'm not sure how to pronounce that.

241
00:12:49,800 --> 00:12:50,280
Sorry.

242
00:12:50,900 --> 00:12:52,760
Who mentioned a second reason as
well.

243
00:12:52,760 --> 00:12:57,980
So the first reason is related
to setting of hint bits and maybe

244
00:12:57,980 --> 00:12:59,440
we can go into detail on that one
first.

245
00:12:59,440 --> 00:13:03,580
Cause that's the one I've seen most
often and it comes up the most.

246
00:13:03,600 --> 00:13:05,400
Do you want to discuss that one first?

247
00:13:05,420 --> 00:13:06,780
Nikolay: Yeah, let's do it

248
00:13:06,960 --> 00:13:11,180
Michael: So why do why do we have
hint bits and what are hint bits?

249
00:13:11,440 --> 00:13:15,600
Nikolay: By the way, maybe it's
Alex Jesipow but Alex

250
00:13:15,600 --> 00:13:19,740
is from Germany I see and works
at Luminovo, which is interesting

251
00:13:19,860 --> 00:13:23,100
because We we work together last year.

252
00:13:23,100 --> 00:13:24,640
I mean with his colleagues.

253
00:13:25,080 --> 00:13:25,820
It's interesting.

254
00:13:25,920 --> 00:13:26,820
I just realized.

255
00:13:27,880 --> 00:13:28,600
It's cool.

256
00:13:28,600 --> 00:13:29,100
Yeah.

257
00:13:29,240 --> 00:13:31,860
Yeah, I'm going to send him this episode for sure.

258
00:13:32,300 --> 00:13:34,840
So we have 2 cases, right?

259
00:13:34,840 --> 00:13:35,580
2 cases.

260
00:13:35,800 --> 00:13:38,800
And I'm not sure which 1 is the easiest.

261
00:13:38,800 --> 00:13:40,020
Let's start with easiest.

262
00:13:40,640 --> 00:13:43,700
Michael: I think Hint Bits is easiest to understand, and it's

263
00:13:43,700 --> 00:13:44,480
most common.

264
00:13:45,060 --> 00:13:51,560
So my understanding of this is when new data is written to a

265
00:13:51,560 --> 00:13:56,940
page, at that moment, Postgres doesn't yet know if that could

266
00:13:56,940 --> 00:14:02,220
be part of a much larger, longer transaction, and we don't yet

267
00:14:02,220 --> 00:14:06,300
know if that transaction is going to commit or get aborted and

268
00:14:06,300 --> 00:14:07,060
rolled back.

269
00:14:07,920 --> 00:14:13,220
So we at that moment cannot set, we cannot say on the data page

270
00:14:13,780 --> 00:14:16,880
that this transaction has been committed and therefore should

271
00:14:16,880 --> 00:14:21,180
be visible to new reads of this page.

272
00:14:21,680 --> 00:14:29,620
So that information is in a separate log once the transaction

273
00:14:29,700 --> 00:14:30,440
does commit.

274
00:14:31,100 --> 00:14:37,060
And that means when you go, if a, if somebody's reading the page,

275
00:14:37,060 --> 00:14:41,340
they need to check which of these row versions has already been

276
00:14:41,540 --> 00:14:42,760
committed for sure.

277
00:14:42,880 --> 00:14:46,360
And if there's any ambiguous, any ones where we don't know yet,

278
00:14:46,360 --> 00:14:50,440
because, but based on these hint bits, based on these 4 different

279
00:14:50,440 --> 00:14:54,380
flags, we can tell what status it is.

280
00:14:54,380 --> 00:14:55,220
Is it unknown?

281
00:14:55,280 --> 00:14:56,460
Is it definitely committed?

282
00:14:56,520 --> 00:14:58,340
Is it already out of date?

283
00:14:58,480 --> 00:14:59,940
Like is it already being replaced?

284
00:15:00,540 --> 00:15:08,600
So if in the case of these reads causing writes, it's a subsequent

285
00:15:08,900 --> 00:15:14,280
read coming along, seeing a new row version that hasn't yet,

286
00:15:14,280 --> 00:15:17,480
that we don't know yet whether it's been committed, checking

287
00:15:17,480 --> 00:15:21,600
the commit log and then setting, like having read it once, it

288
00:15:21,600 --> 00:15:26,700
sets that and dirties the page, writes the page again to wow,

289
00:15:27,260 --> 00:15:30,580
so that any future reads now don't have to check.

290
00:15:31,060 --> 00:15:35,880
Nikolay: Right, yeah, so yeah, I think this can be like invisible,

291
00:15:36,300 --> 00:15:41,280
dead, frozen, right, for tuple and the thing is that the most

292
00:15:41,280 --> 00:15:45,240
interesting part of it that this this write can happen on the

293
00:15:45,240 --> 00:15:49,960
replica which is kind of surprise for people, right?

294
00:15:49,960 --> 00:15:50,720
How come?

295
00:15:51,420 --> 00:15:54,140
Michael: There's a really good, oh, sorry, just to go back to

296
00:15:54,140 --> 00:15:58,060
hint bits, there's a really good page on the Postgres wiki that

297
00:15:58,100 --> 00:16:01,420
describes it really succinctly, and there are 4 hint bits.

298
00:16:01,820 --> 00:16:06,760
xmin committed, xmin aborted, xmax committed, xmax aborted.

299
00:16:08,600 --> 00:16:09,100
Nikolay: Yeah.

300
00:16:09,440 --> 00:16:10,920
So, yeah, that's it actually.

301
00:16:10,920 --> 00:16:14,860
So we just know about this phenomenon
and that's it.

302
00:16:15,280 --> 00:16:17,880
It should not be a surprise that
it happens.

303
00:16:19,020 --> 00:16:19,700
And subsequently...

304
00:16:20,200 --> 00:16:21,300
Michael: And it's healthy, right?

305
00:16:21,300 --> 00:16:24,960
It's not a sign of anything having
gone wrong in healthy like

306
00:16:24,960 --> 00:16:27,340
it will happen Quite often.

307
00:16:27,940 --> 00:16:30,280
Nikolay: Yeah, and in case it is
subsequent.

308
00:16:30,280 --> 00:16:32,220
So it already is not causing this,
right?

309
00:16:32,220 --> 00:16:36,740
So so it's it's because it's already
updated and it's kind of

310
00:16:36,740 --> 00:16:41,700
you can consider it a kind of like
dealing with warming up caches,

311
00:16:41,980 --> 00:16:43,880
but it's vice versa because it's
writes.

312
00:16:44,940 --> 00:16:47,960
So it's only, this overhead is
only present on the first very

313
00:16:47,960 --> 00:16:49,200
call with the same parameters.

314
00:16:49,200 --> 00:16:52,480
Of course, different parameters
can cause such writes in different

315
00:16:52,480 --> 00:16:52,980
pages.

316
00:16:55,080 --> 00:16:59,860
Michael: And it's a necessary result
of us wanting to be able

317
00:16:59,860 --> 00:17:01,420
to serve things concurrently.

318
00:17:01,880 --> 00:17:05,040
Because of MVCC, we need versions
of rows.

319
00:17:05,380 --> 00:17:08,700
And we need to then know which
1 should be visible to which transactions.

320
00:17:09,160 --> 00:17:14,240
So in order to do that, this is
a necessary part of that.

321
00:17:14,280 --> 00:17:14,780
Nikolay: Right.

322
00:17:15,440 --> 00:17:18,340
So yeah, let's talk about the second
case.

323
00:17:18,900 --> 00:17:19,760
Michael: Yeah, this.

324
00:17:19,760 --> 00:17:20,140
Yeah.

325
00:17:20,140 --> 00:17:21,820
The second was more interesting
to me.

326
00:17:21,820 --> 00:17:23,760
Like I hadn't come across this
before.

327
00:17:24,100 --> 00:17:24,600
Nikolay: Yeah.

328
00:17:25,080 --> 00:17:26,420
Let's talk about it.

329
00:17:26,640 --> 00:17:29,540
Michael: So in Alex's post, I think
you're probably right on

330
00:17:29,540 --> 00:17:32,880
the surname pronunciation, but
I'm not going to try it again.

331
00:17:33,520 --> 00:17:39,000
He describes these as page pruning
he calls it which I have not

332
00:17:39,000 --> 00:17:42,940
heard it called that before and
when I think of pruning I think

333
00:17:43,540 --> 00:17:49,620
my mind naturally goes to for example
the mechanism for removing

334
00:17:49,700 --> 00:17:52,160
an empty page at the end of the
heap, for example.

335
00:17:52,240 --> 00:17:53,600
Is that also called pruning?

336
00:17:54,240 --> 00:17:55,360
But this is different.

337
00:17:55,600 --> 00:17:56,700
Nikolay: I think it's truncation.

338
00:17:56,980 --> 00:18:00,680
If you talk about removing last
page when vacuum does it, it's

339
00:18:00,680 --> 00:18:01,180
truncation.

340
00:18:02,220 --> 00:18:03,160
Michael: That makes sense.

341
00:18:03,280 --> 00:18:08,100
So this is in almost like a, I
think you just, before the call

342
00:18:08,100 --> 00:18:12,820
you mentioned it was described
as a kind of in-page vacuum.

343
00:18:12,860 --> 00:18:13,220
Yeah.

344
00:18:13,220 --> 00:18:14,140
So it's like.

345
00:18:14,160 --> 00:18:17,200
Nikolay: And it can happen during
SELECTs on the fly, like it's

346
00:18:17,200 --> 00:18:18,200
interesting, right?

347
00:18:18,520 --> 00:18:20,020
Which is like also strange.

348
00:18:20,220 --> 00:18:25,520
But if we, if we recall how HOT
updates are organized, which

349
00:18:25,520 --> 00:18:29,140
is great, a feature, unfortunately,
not available always because

350
00:18:29,140 --> 00:18:31,900
it requires 2 special conditions
to be met.

351
00:18:31,920 --> 00:18:35,600
First is we are updating only,
we are changing values of the

352
00:18:35,600 --> 00:18:37,120
columns which are not indexed.

353
00:18:37,660 --> 00:18:42,160
And second is there is enough empty
space in the same page where

354
00:18:42,160 --> 00:18:45,360
our old tuple is stored.

355
00:18:45,560 --> 00:18:51,200
In this case, so-called hot chains
are created, and it can be

356
00:18:51,200 --> 00:18:55,740
multiple versions in the same page
of the same tuple.

357
00:18:56,960 --> 00:19:02,780
In this case, what happens when,
if we have index scan, we only

358
00:19:02,780 --> 00:19:06,260
know the page and offset for the
first.

359
00:19:06,860 --> 00:19:09,380
Indexes are not updated in this
case because it's hot update.

360
00:19:09,380 --> 00:19:12,400
This is optimization to fight index
write amplification, which

361
00:19:12,400 --> 00:19:17,440
is terrible Postgres MVCC behavior
many projects are suffering

362
00:19:17,440 --> 00:19:17,940
from.

363
00:19:18,220 --> 00:19:21,740
When updating 1 row, having many
indexes on the table.

364
00:19:21,740 --> 00:19:22,700
We deal with...

365
00:19:22,800 --> 00:19:26,760
We need to update all of indexes,
producing a lot of WAL writes

366
00:19:27,160 --> 00:19:31,360
and just making updates slow, heavy,
and so on.

367
00:19:31,540 --> 00:19:35,620
In case of hot updates, those 2
conditions I mentioned are met.

368
00:19:35,660 --> 00:19:38,560
Postgres writes only to the same
page where tuple is already

369
00:19:38,560 --> 00:19:41,320
stored because there is enough
space and it doesn't update indexes

370
00:19:41,320 --> 00:19:44,160
at all because we are changing
the value which is not indexed.

371
00:19:44,340 --> 00:19:48,060
And it produces a new version tuple,
new raw version inside the

372
00:19:48,060 --> 00:19:52,040
same page, creating chain, it can
be new, new, new, and when

373
00:19:52,040 --> 00:19:56,980
we have index scan, index scan
points to the first version in

374
00:19:56,980 --> 00:20:03,240
the page, and then it's quick to
jump between versions and find

375
00:20:03,240 --> 00:20:05,580
the actual 1 inside the same page.

376
00:20:07,040 --> 00:20:09,520
Michael: 1 crucial thing that I
think becomes important later

377
00:20:09,520 --> 00:20:13,260
is that that chain information
is stored in the header of the

378
00:20:13,260 --> 00:20:18,480
page, whereas the row, like the
data from the row version is

379
00:20:18,480 --> 00:20:20,180
stored at the end of the page.

380
00:20:20,380 --> 00:20:21,360
Nikolay: Right, right.

381
00:20:21,800 --> 00:20:22,620
This is true.

382
00:20:22,720 --> 00:20:24,220
Yes, it's so.

383
00:20:24,960 --> 00:20:31,260
And if we already have old versions,
at some point they need

384
00:20:31,260 --> 00:20:31,920
to be deleted.

385
00:20:31,920 --> 00:20:36,360
It can happen during vacuuming
or it can happen earlier if during

386
00:20:36,540 --> 00:20:39,680
dealing with this page Postgres
says, okay, we can clean up old

387
00:20:39,680 --> 00:20:41,360
versions right now, why not?

388
00:20:41,820 --> 00:20:44,840
And can happen during SELECT as
well, which is very strange.

389
00:20:44,840 --> 00:20:45,340
Yeah.

390
00:20:46,160 --> 00:20:47,420
Michael: Well, it's so cool.

391
00:20:48,040 --> 00:20:52,700
But I think, again, there's like
a condition where it will only

392
00:20:52,880 --> 00:20:58,120
do this if there is not that much
space left on the page.

393
00:20:58,180 --> 00:21:04,200
So I think the number is 10% of
the page is left in terms of

394
00:21:04,200 --> 00:21:04,900
free space.

395
00:21:06,140 --> 00:21:09,340
And there's some subtlety around
if you've changed fill factor.

396
00:21:10,080 --> 00:21:13,880
Nikolay: And so which means that
this is an effort to maintain

397
00:21:14,240 --> 00:21:19,020
hotness of updates further, because
if without this, we would

398
00:21:19,020 --> 00:21:20,320
need to go to another page.

399
00:21:20,320 --> 00:21:23,900
And this will definitely lead to
updating all indexes, right?

400
00:21:24,520 --> 00:21:25,020
Michael: Yes.

401
00:21:25,840 --> 00:21:28,780
But I think the optimization of
not doing it unless the page

402
00:21:28,780 --> 00:21:31,280
is quite full means we don't have
to do it that often.

403
00:21:31,280 --> 00:21:35,820
So for example, if we have maybe
like say a really quick 10-15

404
00:21:36,100 --> 00:21:41,140
updates of the same row, and they
all fit on the same page, and

405
00:21:41,140 --> 00:21:45,840
then we're having reads of that
row in between, we're not cleaning

406
00:21:45,860 --> 00:21:49,960
up each version each time we do
a read, we're waiting until we

407
00:21:49,960 --> 00:21:51,940
get full and then doing it in 1
go.

408
00:21:52,200 --> 00:21:56,340
So I think it's quite a pretty
smart optimization.

409
00:21:56,540 --> 00:22:01,920
Nikolay: It's a kind of trade-off,
balance between 2 worst situations

410
00:22:02,280 --> 00:22:02,780
and so on.

411
00:22:02,780 --> 00:22:08,540
So We don't do it too often, but
also we try not to allow this

412
00:22:08,540 --> 00:22:12,700
tuple to drift to another page,
which would cause the need to

413
00:22:12,700 --> 00:22:15,680
update indexes because indexes
point to the old page.

414
00:22:17,600 --> 00:22:18,640
This is quite interesting.

415
00:22:18,740 --> 00:22:25,080
Again this can happen during SELECT
and yeah but it can happen

416
00:22:25,080 --> 00:22:31,340
only on the primary right yes or
no it's my guess I don't know

417
00:22:31,340 --> 00:22:36,780
honestly why because because can
happen on replica

418
00:22:37,520 --> 00:22:40,280
Michael: yeah well I actually I
don't know well let us know in

419
00:22:40,280 --> 00:22:40,880
the comments.

420
00:22:41,120 --> 00:22:44,700
Nikolay: Yeah, this is an interesting
question, because if it's

421
00:22:44,700 --> 00:22:48,920
happening on replica, it means
we have different, very different

422
00:22:49,540 --> 00:22:52,460
content of pages on both replica
and primary.

423
00:22:55,380 --> 00:23:00,620
vacuum happening on a replica,
please no, because this changes

424
00:23:00,620 --> 00:23:02,060
the content of page.

425
00:23:05,880 --> 00:23:08,400
I cannot understand how page content
should be synchronized in

426
00:23:08,400 --> 00:23:09,160
terms of

427
00:23:09,680 --> 00:23:09,960
Michael: tuples.

428
00:23:09,960 --> 00:23:11,540
It's physical replica, yeah.

429
00:23:11,580 --> 00:23:14,980
Nikolay: Yeah, so it should happen
only on the primary because...

430
00:23:15,020 --> 00:23:15,560
Michael: Okay, yeah.

431
00:23:15,560 --> 00:23:17,780
Nikolay: And the replica should
just get this content.

432
00:23:18,240 --> 00:23:20,340
Now, with Hint Bits it's different.

433
00:23:20,340 --> 00:23:25,320
Hint Bits gets additional information
and it doesn't change how

434
00:23:25,320 --> 00:23:26,980
tuples are stored inside the page.

435
00:23:27,120 --> 00:23:28,320
This thing is changing.

436
00:23:28,320 --> 00:23:31,080
It's cleaning up the space for
new tuples.

437
00:23:31,720 --> 00:23:34,240
So it should happen on the primary,
I think.

438
00:23:36,180 --> 00:23:37,060
Michael: That makes sense.

439
00:23:37,060 --> 00:23:37,560
Yeah.

440
00:23:38,940 --> 00:23:40,580
Logically on a physical replica.

441
00:23:40,580 --> 00:23:41,000
Nikolay: Right.

442
00:23:41,000 --> 00:23:41,500
Okay.

443
00:23:42,040 --> 00:23:42,880
Well, good.

444
00:23:42,900 --> 00:23:45,140
Michael: There's this 1 more open
question.

445
00:23:45,140 --> 00:23:45,640
Yeah.

446
00:23:46,300 --> 00:23:50,500
I saw when this blog, this latest
blog post was shared to Reddit

447
00:23:50,500 --> 00:23:53,920
there was a really interesting
comment that didn't that no 1

448
00:23:53,920 --> 00:23:58,000
has replied to and there's almost
no activity on it but somebody

449
00:23:58,300 --> 00:24:01,060
posed the question I think it's
a good 1 whether there might

450
00:24:01,060 --> 00:24:03,740
be a third possible cause of this.

451
00:24:03,740 --> 00:24:06,460
And that's, so it's more of a question
to the listeners because

452
00:24:06,460 --> 00:24:09,520
I'm not actually sure, I don't
know myself and haven't had, you

453
00:24:09,520 --> 00:24:11,020
know, to confirm for sure.

454
00:24:11,040 --> 00:24:15,640
But there's a PD_ALL_VISIBLE flag
on pages.

455
00:24:16,500 --> 00:24:22,660
And that might be a third way if
if SELECTs can flip that based

456
00:24:22,660 --> 00:24:27,180
on visibility of all the rows in
the or the tuples in the page

457
00:24:27,400 --> 00:24:30,660
then that might be a third case
but I don't I don't know if SELECTs

458
00:24:30,660 --> 00:24:33,780
can so yeah again if you know I'll
be really interested to hear

459
00:24:34,280 --> 00:24:38,920
Nikolay: yeah yeah so I guess that's
it right yeah so yeah

460
00:24:39,660 --> 00:24:42,440
Michael: so yeah great great blog
post there's also an old 1

461
00:24:42,440 --> 00:24:47,180
from Nikolay Sivko on the Okmeter
blog that I'll share So there's

462
00:24:47,180 --> 00:24:49,460
a few articles to read more about
this.

463
00:24:49,460 --> 00:24:52,660
Nikolay: Correction, Nikolay currently
has a new startup which

464
00:24:52,660 --> 00:24:53,260
is called Coroot.

465
00:24:53,260 --> 00:24:53,760
Yes.

466
00:24:53,780 --> 00:24:56,660
Yeah, so then Okmeter is his old
startup.

467
00:24:57,100 --> 00:25:00,980
And Peter Zaitsev from, like, founder
of Percona joined Nikolay

468
00:25:01,020 --> 00:25:04,680
recently, not very long ago, and
I'm like, Coroot is a great

469
00:25:04,680 --> 00:25:05,180
project.

470
00:25:05,420 --> 00:25:08,900
But this, I guess this blog post
was created while Nikolay was

471
00:25:08,900 --> 00:25:12,840
working on the previous project
and just started observing the

472
00:25:12,840 --> 00:25:16,560
writes happening from SELECTs and
was very surprised and decided

473
00:25:16,560 --> 00:25:18,840
to blog about it.

474
00:25:19,640 --> 00:25:22,620
Michael: Yeah, well it was a monitoring
project as well, right?

475
00:25:22,960 --> 00:25:28,680
Nikolay: Yeah, it had very good
Postgres related features, which

476
00:25:28,680 --> 00:25:31,840
I guess in some practices were
inherited by Coroot.

477
00:25:33,560 --> 00:25:36,920
But Coroot is amazing, just a couple
of words about it.

478
00:25:37,400 --> 00:25:43,040
If you, if you like flame graphs,
Coroot is, can show you dynamically,

479
00:25:43,180 --> 00:25:47,740
like imagine you have a dashboard,
you choose time for Postgres

480
00:25:48,080 --> 00:25:56,500
and you see flame graphs but hanging
down so they are like 180

481
00:25:56,820 --> 00:26:00,260
degrees rotated or mirrored, right?

482
00:26:00,340 --> 00:26:01,420
Horizontal mirrored.

483
00:26:02,380 --> 00:26:06,220
You can see details like If you
have debug symbols installed,

484
00:26:06,220 --> 00:26:08,960
I think everyone should have debug
symbols always installed.

485
00:26:09,380 --> 00:26:14,660
You see deep details what's happening,
where time is spent.

486
00:26:15,380 --> 00:26:18,340
Dynamic diagnostics like this,
it's crazy.

487
00:26:18,580 --> 00:26:22,520
I think probably I should add this
to recommendations for Postgres

488
00:26:22,540 --> 00:26:25,960
platform builders, managed Postgres
platform builders, because

489
00:26:25,960 --> 00:26:30,420
to consider Coroot is also like
a good thing for observability.

490
00:26:31,060 --> 00:26:32,340
Michael: It's open source, right?

491
00:26:32,900 --> 00:26:33,820
Nikolay: It has open source.

492
00:26:33,820 --> 00:26:36,980
It has, I think, it's like an open
core model.

493
00:26:36,980 --> 00:26:37,860
I'm not sure.

494
00:26:38,040 --> 00:26:39,980
But the main thing is definitely
open source.

495
00:26:40,120 --> 00:26:41,360
Again, I'm not sure.

496
00:26:41,760 --> 00:26:42,880
Michael: Thanks so much, Nikolay.

497
00:26:43,260 --> 00:26:44,440
Thanks everyone for listening.

498
00:26:44,440 --> 00:26:45,300
We'll catch you next week.

499
00:26:45,300 --> 00:26:46,060
Nikolay: Thank you, Michael.

500
00:26:46,060 --> 00:26:46,860
See you soon.