1
00:00:00,060 --> 00:00:02,960
Michael: Hello and welcome to Postgres
FM, a weekly show about

2
00:00:02,960 --> 00:00:04,160
all things PostgreSQL.

3
00:00:04,160 --> 00:00:05,640
I am Michael, founder of pgMustard.

4
00:00:05,680 --> 00:00:08,180
This is my co-host Nikolay, founder
of Postgres.AI.

5
00:00:08,220 --> 00:00:10,120
Hello Nikolay, what are we talking
about today?

6
00:00:10,380 --> 00:00:14,640
Nikolay: Hi Michael, we wanted
to talk about one of my favorite

7
00:00:14,640 --> 00:00:19,540
topics, massive delete, I usually
call it, but maybe you can

8
00:00:19,540 --> 00:00:22,060
call it somehow differently.

9
00:00:23,440 --> 00:00:24,800
Michael: Yeah, I really like this.

10
00:00:24,800 --> 00:00:28,100
So when you say massive delete,
do you mean deleting a lot of

11
00:00:28,100 --> 00:00:29,480
rows all at once?

12
00:00:30,180 --> 00:00:30,680
Right.

13
00:00:30,740 --> 00:00:32,540
Nikolay: In one SQL statement.

14
00:00:33,240 --> 00:00:39,040
So for example, 10 million rows,
1 billion rows, or just 1 million.

15
00:00:39,760 --> 00:00:40,740
It's good already.

16
00:00:41,160 --> 00:00:44,980
Good enough to feel how bad this
is for Postgres.

17
00:00:45,480 --> 00:00:48,840
So we can discuss in detail what's
happening if you do it.

18
00:00:50,380 --> 00:00:56,360
I like this type of thing because
it's not only bad, it's also

19
00:00:56,360 --> 00:00:56,860
good.

20
00:00:57,920 --> 00:01:02,140
And we can explore that as well
because if you do it on production,

21
00:01:02,220 --> 00:01:07,360
it's bad, but if you do it on purpose
in some testing environment,

22
00:01:08,160 --> 00:01:09,380
it can be very helpful.

23
00:01:10,320 --> 00:01:11,520
Michael: Oh, I see what you mean.

24
00:01:11,520 --> 00:01:13,520
So it's a useful tool for certain
things.

25
00:01:13,520 --> 00:01:14,020
Right.

26
00:01:14,440 --> 00:01:15,360
Okay, cool.

27
00:01:15,360 --> 00:01:16,640
Yeah, I get it.

28
00:01:16,640 --> 00:01:17,540
Nikolay: Stress test,

29
00:01:18,040 --> 00:01:18,540
Michael: yeah.

30
00:01:18,820 --> 00:01:21,960
So you mentioned this in a previous
episode, that it would be

31
00:01:21,960 --> 00:01:24,820
good to cover this in more detail,
and I was really interested.

32
00:01:26,120 --> 00:01:29,800
But I think it's important to say
what we're not gonna cover

33
00:01:29,800 --> 00:01:30,060
as well.

34
00:01:30,060 --> 00:01:34,040
So massive deletes, some people
might also be thinking of like

35
00:01:34,040 --> 00:01:37,200
some other things that sound like
deletes.

36
00:01:37,200 --> 00:01:41,820
Like some people often implement
soft deletes, which would be

37
00:01:41,820 --> 00:01:43,140
like implemented by updates.

38
00:01:43,140 --> 00:01:45,060
We're not going to talk about that
kind of thing.

39
00:01:45,060 --> 00:01:48,500
We're talking about only actual
Postgres deletes at this point.

40
00:01:48,560 --> 00:01:52,740
Nikolay: Some people complain why
soft deletes are not already

41
00:01:52,740 --> 00:01:56,880
implemented by default in the database
system, like, natively.

42
00:01:57,720 --> 00:02:00,300
By then, maybe it's a good idea
to have, but...

43
00:02:00,660 --> 00:02:02,360
Yeah, we will skip this topic.

44
00:02:02,360 --> 00:02:06,140
So, maybe it's another episode
because it has interesting nuances.

45
00:02:06,600 --> 00:02:10,760
Yeah, but let's talk about what's
happening if you want to delete

46
00:02:10,760 --> 00:02:17,840
many, many rows, how to do it correctly
on production, not to

47
00:02:17,840 --> 00:02:22,100
put your database on knees, and
how to use it in non-production

48
00:02:22,440 --> 00:02:25,540
environments for something good,
for good reasons.

49
00:02:26,280 --> 00:02:26,780
Right?

50
00:02:27,040 --> 00:02:27,380
Michael: I like it.

51
00:02:27,380 --> 00:02:27,560
Where do

52
00:02:27,560 --> 00:02:28,540
Nikolay: you want to start?

53
00:02:28,740 --> 00:02:31,640
Michael: Maybe with some use cases,
like When are you generally

54
00:02:31,640 --> 00:02:33,940
seeing people do these huge deletes?

55
00:02:34,600 --> 00:02:37,980
I've got a few examples I've seen,
but I'd be interested in yours.

56
00:02:38,440 --> 00:02:42,160
Nikolay: Well, usually people do
massive deletes by mistake.

57
00:02:43,480 --> 00:02:45,540
Not understanding how Postgres
works.

58
00:02:46,560 --> 00:02:50,820
And MVCC and so on, and they expect
it's a simple operation.

59
00:02:51,420 --> 00:02:56,380
They probably saw some warning
that you shouldn't expect immediate

60
00:02:56,960 --> 00:02:59,360
reduction of disk space used.

61
00:02:59,440 --> 00:03:03,040
For example, you delete from some
table and think, okay, I will

62
00:03:03,040 --> 00:03:03,840
free some space.

63
00:03:03,840 --> 00:03:08,480
No, you won't free some space because
it's only part of the work.

64
00:03:08,480 --> 00:03:13,080
Another part is Autovacuum, which
will need to actually delete

65
00:03:13,080 --> 00:03:13,800
those tuples.

66
00:03:14,640 --> 00:03:16,800
But, yeah, what?

67
00:03:17,100 --> 00:03:20,600
Michael: And even then, you won't
see, if you're monitoring disk

68
00:03:20,600 --> 00:03:24,060
space, you won't see that as free
space again.

69
00:03:25,760 --> 00:03:26,340
Nikolay: It depends.

70
00:03:26,400 --> 00:03:33,040
Autovacuum also can truncate relations
if some pages at the end

71
00:03:33,040 --> 00:03:37,060
are deleted, all tuples are deleted
and vacuumed.

72
00:03:38,300 --> 00:03:41,020
So physically deleted already,
right?

73
00:03:41,260 --> 00:03:45,920
Pages become free, it will truncate
them and it will reclaim disk

74
00:03:45,920 --> 00:03:46,420
space.

75
00:03:47,380 --> 00:03:48,160
Michael: Yeah, okay.

76
00:03:48,340 --> 00:03:51,540
But I feel like those cases are
the exception rather than the

77
00:03:51,540 --> 00:03:51,880
norm.

78
00:03:51,880 --> 00:03:55,140
I see a lot more cases where people
are deleting a subset of

79
00:03:55,140 --> 00:03:59,100
data from a non-partitioned table
and really surprised when their

80
00:03:59,100 --> 00:04:01,160
relation doesn't decrease in size.

81
00:04:01,320 --> 00:04:04,300
But just to cover a few of the
other basics, I think this is

82
00:04:04,300 --> 00:04:07,840
actually a topic that's coming
up more and more because we have

83
00:04:07,840 --> 00:04:11,880
these new privacy laws that require
people to be able to, to

84
00:04:11,880 --> 00:04:13,220
have the right to be forgotten.

85
00:04:13,940 --> 00:04:18,840
And deleting data is quite a safe
way of not having a data leak.

86
00:04:19,180 --> 00:04:22,100
The less data you have, the less
data that can be leaked security-wise.

87
00:04:22,960 --> 00:04:25,200
So I feel like there's these forcing
functions. 

88
00:04:25,200 --> 00:04:28,680
I think also people are starting
to get serious about having

89
00:04:29,060 --> 00:04:30,180
data retention policies.

90
00:04:30,180 --> 00:04:34,280
So how long should we keep each
type of data around for?

91
00:04:34,280 --> 00:04:36,980
So I think there are some of these
use cases that mean this is

92
00:04:36,980 --> 00:04:37,880
coming up more.

93
00:04:38,100 --> 00:04:40,760
And in the past, at least in the
last 10 years, I saw a lot more

94
00:04:40,760 --> 00:04:44,220
people kind of just storing everything
forever and not really

95
00:04:44,220 --> 00:04:45,400
thinking about it.

96
00:04:45,480 --> 00:04:48,080
So I didn't know if this is something
you're seeing more often

97
00:04:48,080 --> 00:04:50,440
or it's just an age-old problem.

98
00:04:51,220 --> 00:04:51,980
Nikolay: Well, yes.

99
00:04:52,040 --> 00:04:57,180
So I think GDPR and everything,
it's not that big because, you

100
00:04:57,180 --> 00:05:02,060
know, like, single user delete,
it's not, it's usually like some

101
00:05:02,220 --> 00:05:03,800
fraction of the whole, right?

102
00:05:03,800 --> 00:05:07,440
But usually, massive delete happens
when we need to clean up

103
00:05:07,440 --> 00:05:11,920
and understand there's some old
data which is not super useful,

104
00:05:12,240 --> 00:05:18,900
we pay for it, and we want to postpone
the moment when we need

105
00:05:18,900 --> 00:05:24,360
to scale our instances, database
instances, machines, and so

106
00:05:24,360 --> 00:05:24,860
on.

107
00:05:25,080 --> 00:05:30,600
In this case, in general, I would
like to mention before COVID,

108
00:05:31,020 --> 00:05:37,900
I went to the VLDB conference, which
is a conference I know since

109
00:05:37,900 --> 00:05:43,040
being a kid, basically, when I
was a student learning database

110
00:05:43,040 --> 00:05:43,540
theory.

111
00:05:44,060 --> 00:05:45,680
I've heard about this conference.

112
00:05:45,680 --> 00:05:50,420
So I went to it because it was
very close in Los Angeles.

113
00:05:51,060 --> 00:05:57,840
I remember a keynote, some researcher,
she was presenting the

114
00:05:58,180 --> 00:05:59,480
talk, which was interesting.

115
00:06:00,060 --> 00:06:01,620
It was about delete basically.

116
00:06:02,500 --> 00:06:07,060
She said this is exponential growth
of the total data volumes

117
00:06:07,640 --> 00:06:10,100
of data in our databases in the
world.

118
00:06:10,160 --> 00:06:11,180
It was like phew.

119
00:06:12,260 --> 00:06:17,440
Because storage becomes cheaper,
we produce a lot of data, big

120
00:06:17,440 --> 00:06:19,440
data, huge data, and so on.

121
00:06:19,440 --> 00:06:22,040
This is like some zettabytes or
something.

122
00:06:22,040 --> 00:06:25,740
It's an insane curve, insane, with
some forecast.

123
00:06:26,160 --> 00:06:31,660
And she said, we spent decades
to learn how to store properly

124
00:06:31,680 --> 00:06:36,180
CID, not to lose data, reliable,
highly available and so on.

125
00:06:36,180 --> 00:06:40,860
Now it's coming time to learn how
to clean up and delete.

126
00:06:41,120 --> 00:06:44,200
Understand which data can be safely
deleted, how to delete it

127
00:06:44,200 --> 00:06:45,780
efficiently and so on.

128
00:06:45,780 --> 00:06:47,060
And I was super inspired.

129
00:06:47,080 --> 00:06:50,660
I also did a talk at a conference
because around the same time

130
00:06:50,660 --> 00:06:52,000
I had a production incident.

131
00:06:53,000 --> 00:06:56,560
A senior engineer, a back-end engineer,
with a good understanding

132
00:06:57,660 --> 00:07:02,360
of analytical databases actually,
got a task to delete some old

133
00:07:02,360 --> 00:07:05,580
data, preparing for some marketing
campaign.

134
00:07:05,900 --> 00:07:10,420
Because the forecast was saying
we either need to invest a lot

135
00:07:10,680 --> 00:07:15,360
to upgrade or we need to clean
up before we do this huge marketing

136
00:07:15,360 --> 00:07:15,860
campaign.

137
00:07:16,640 --> 00:07:23,300
So he went to production and
just performed delete from table

138
00:07:23,300 --> 00:07:26,820
where created at older than 1
year.

139
00:07:27,280 --> 00:07:31,640
He estimated in advance that it
will be 10 million rows and we

140
00:07:31,640 --> 00:07:32,620
got an incident.

141
00:07:33,520 --> 00:07:37,860
Downtime, more than 10 minutes,
cost the company a lot of money.

142
00:07:38,200 --> 00:07:39,100
I was super impressed.

143
00:07:39,100 --> 00:07:40,140
A lot of money.

144
00:07:40,680 --> 00:07:45,360
And he almost left the company
himself because he was super embarrassed.

145
00:07:48,140 --> 00:07:51,220
And at the same time, this VLDB
conference without delete, and

146
00:07:51,220 --> 00:07:54,440
this, like, I'm saying, like, we
need something, right?

147
00:07:54,440 --> 00:07:58,700
We need, at least we need to educate
people that delete should

148
00:07:58,700 --> 00:07:59,840
be split into batches.

149
00:08:00,060 --> 00:08:04,040
Of course, if you have good disk,
maybe you'll be fine.

150
00:08:04,620 --> 00:08:06,980
But in that case, we had quite
slow...

151
00:08:06,980 --> 00:08:11,700
It was some enterprise level, but
some sun, I think, or some

152
00:08:11,820 --> 00:08:12,760
old system.

153
00:08:13,280 --> 00:08:14,420
It was on premise.

154
00:08:15,180 --> 00:08:19,120
And also, at the same time, I was
advocating this company, like

155
00:08:19,540 --> 00:08:22,420
I was helping as a consultant with
Postgres.

156
00:08:22,920 --> 00:08:26,700
The company was growing super fast,
a huge startup.

157
00:08:27,520 --> 00:08:31,920
And I was saying, guys, you do
need to increase the max-wal

158
00:08:31,980 --> 00:08:32,480
size.

159
00:08:33,460 --> 00:08:35,040
1 gigabyte is not enough.

160
00:08:35,800 --> 00:08:41,420
So with default setting max-wal
size, untuned checkpointer and

161
00:08:41,520 --> 00:08:43,040
quite slow storage.

162
00:08:43,100 --> 00:08:48,120
I think maximum throughput for
writes was maybe 600 megabytes

163
00:08:48,120 --> 00:08:49,440
per second, maybe 500.

164
00:08:49,820 --> 00:08:54,360
It's not like current modern NVMe
which gives you like 2 gigabytes

165
00:08:54,380 --> 00:08:55,620
per second, for example.

166
00:08:56,260 --> 00:09:00,840
These 2 factors plus the idea let's
delete 10 million rows, it's

167
00:09:00,840 --> 00:09:02,580
not a huge number, right?

168
00:09:05,660 --> 00:09:10,580
So checkpointer became crazy because
a lot of...

169
00:09:10,580 --> 00:09:11,420
What's happening?

170
00:09:12,040 --> 00:09:13,780
First of all, we find rows.

171
00:09:14,380 --> 00:09:16,720
I mean, executor finds rows.

172
00:09:17,260 --> 00:09:20,520
In that case, it was also unfortunate
that it was not a sequential

173
00:09:20,740 --> 00:09:21,240
pattern.

174
00:09:23,240 --> 00:09:29,140
So rows were ordered according
to created at, so this starts

175
00:09:29,140 --> 00:09:33,620
sparsely and first you need to
put, Postgres needs to put xmax

176
00:09:34,280 --> 00:09:36,600
value to this hidden system column.

177
00:09:36,820 --> 00:09:39,580
It needs to put current transaction
which deletes, right?

178
00:09:40,600 --> 00:09:45,020
Value to xmax column for first row
and the page becomes dirty.

179
00:09:45,600 --> 00:09:49,740
Dirty means in memory we have a
different page than on disk,

180
00:09:49,940 --> 00:09:53,760
so we need to flush it to disk
eventually.

181
00:09:54,960 --> 00:09:57,420
A checkpointer needs to do it,
basically.

182
00:09:58,180 --> 00:10:01,620
And then a different row was a different
tuple, basically, a physical

183
00:10:01,620 --> 00:10:05,520
row, a version of row in a different
page, so we dirty different

184
00:10:05,520 --> 00:10:06,020
page.

185
00:10:06,280 --> 00:10:13,040
So almost, like if you need to
update XMax in 100 rows, it's

186
00:10:13,040 --> 00:10:17,220
like almost 100 pages already dirtied,
very inefficiently already,

187
00:10:17,220 --> 00:10:20,700
because of this pattern of random
access basically.

188
00:10:21,040 --> 00:10:22,940
It's not random, but it's not sequential.

189
00:10:23,300 --> 00:10:24,440
This is the key, right?

190
00:10:24,760 --> 00:10:29,920
And then checkpointer sees, okay,
I have max wal size 1 gigabyte.

191
00:10:30,480 --> 00:10:33,560
And it was Postgres, I think, 9.5,
9.6.

192
00:10:33,920 --> 00:10:37,320
At that time, it means that real
distance between 2 checkpoints

193
00:10:37,360 --> 00:10:39,900
was 3 times lower.

194
00:10:40,520 --> 00:10:42,280
Only 300 megabytes.

195
00:10:42,280 --> 00:10:46,620
You can read in Yegor Rogov's book,
Postgres Internals, a very

196
00:10:46,620 --> 00:10:51,900
good explanation mentioning the
improvements and why it's not

197
00:10:51,900 --> 00:10:55,520
really even a max wal size but
3 times smaller.

198
00:10:55,520 --> 00:10:58,680
Now I think it's either 2 times
smaller or something, I already

199
00:10:58,680 --> 00:10:59,740
don't remember.

200
00:11:00,020 --> 00:11:03,840
So it means that Checkpointer comes
and says okay it's time already,

201
00:11:03,900 --> 00:11:08,420
we already accumulated too many
buffers dirty, we need to put,

202
00:11:08,420 --> 00:11:12,460
like, to flash them to page cache
first of all, and then pgflush,

203
00:11:13,780 --> 00:11:16,700
something like pgflush will go
and flash them to disk.

204
00:11:16,720 --> 00:11:20,280
So it starts working, producing
a lot of IO and it converts to

205
00:11:20,280 --> 00:11:21,440
disk IO already.

206
00:11:22,200 --> 00:11:28,940
And then, boom, different row happens
to be in the same page

207
00:11:28,940 --> 00:11:30,100
which just flashed.

208
00:11:30,340 --> 00:11:31,480
It was just flashed.

209
00:11:32,780 --> 00:11:34,200
But we update it again.

210
00:11:34,200 --> 00:11:37,260
I mean, we update different tuples
on the same page, it becomes

211
00:11:37,260 --> 00:11:38,040
dirty again.

212
00:11:39,140 --> 00:11:42,180
And Checkpointer says, okay, again,
a lot of work.

213
00:11:42,670 --> 00:11:48,140
Michael: So it's duplicating effort
by flushing too excessively.

214
00:11:48,480 --> 00:11:49,500
Is that what you're saying?

215
00:11:49,960 --> 00:11:50,140
Nikolay: Yeah.

216
00:11:50,140 --> 00:11:55,820
So the distance between checkpoints
was like 20 seconds only.

217
00:11:56,200 --> 00:12:01,720
Also, I learned that if I'm not
mistaken, I think many years

218
00:12:01,720 --> 00:12:06,220
already passed, checkpoints can
overlap sometimes for a couple

219
00:12:06,220 --> 00:12:06,840
of seconds.

220
00:12:07,200 --> 00:12:09,940
It's still happening, but it's
insane.

221
00:12:10,460 --> 00:12:12,540
So checkpoint produced a lot of
I.O.

222
00:12:12,560 --> 00:12:14,320
And disk couldn't handle this I.O.

223
00:12:15,120 --> 00:12:19,520
And database became unresponsive
and we had a critical incident,

224
00:12:19,540 --> 00:12:24,620
basically downtime, failovers,
it's insane, everything became

225
00:12:24,620 --> 00:12:28,400
insane, the downtime was huge,
and that's not good.

226
00:12:28,940 --> 00:12:31,420
But As I remember, there are 2
effects.

227
00:12:32,080 --> 00:12:36,760
Not only do you need to flush the
same page multiple times if

228
00:12:36,760 --> 00:12:43,340
you have this non-sequential access
pattern, but also once the

229
00:12:43,340 --> 00:12:48,200
checkpointer or anything else made
the page clean, we have a

230
00:12:48,200 --> 00:12:49,020
different effect.

231
00:12:50,600 --> 00:12:51,640
Michael: Full-page images.

232
00:12:52,200 --> 00:12:52,560
Nikolay: Yes, yes.

233
00:12:52,560 --> 00:12:54,780
So full-page images, full-page
writes.

234
00:12:55,380 --> 00:12:58,920
Since full-page writes is on, it
means that after checkpoint,

235
00:12:59,620 --> 00:13:04,700
if we visit with our change, our
delete, this patch again, it

236
00:13:04,700 --> 00:13:07,460
goes in full to the WAL, right?

237
00:13:08,760 --> 00:13:12,340
Michael: Whereas if it was in a
previous, if it had managed to

238
00:13:12,340 --> 00:13:15,480
sneak into the previous before
the check, the next checkpoint,

239
00:13:15,480 --> 00:13:18,140
it would have been a much smaller
amount of work.

240
00:13:18,420 --> 00:13:18,780
Nikolay: Yeah, yeah.

241
00:13:18,780 --> 00:13:23,240
If we had, for example, a huge,
a maximal size like 100 gigabytes,

242
00:13:23,800 --> 00:13:27,860
we have enough disk space, we can
afford it, we understand if

243
00:13:27,860 --> 00:13:32,040
we crash, startup time will be
longer, replica provisioning also

244
00:13:32,040 --> 00:13:35,520
takes longer, because the recovery
point

245
00:13:35,740 --> 00:13:36,660
Michael: takes time.

246
00:13:39,120 --> 00:13:43,120
Nikolay: And then in this case,
even if we have random, not random,

247
00:13:43,420 --> 00:13:47,280
almost non-sequential access pattern,
we visit the same page

248
00:13:47,280 --> 00:13:49,080
multiple times, not sequentially.

249
00:13:49,860 --> 00:13:52,960
And not only the checkpointer will
flush it just once instead

250
00:13:52,960 --> 00:13:58,520
of multiple times, but also only
the first change will go to

251
00:13:58,520 --> 00:14:02,900
the WAL as a full page, but subsequent
changes will be just

252
00:14:02,900 --> 00:14:04,140
presented as a tuple.

253
00:14:04,760 --> 00:14:10,640
And this means WAL generation
decreases significantly if we

254
00:14:10,640 --> 00:14:11,340
do this.

255
00:14:11,840 --> 00:14:16,680
So it was helpful to me as a consultant
because I finally found

256
00:14:17,380 --> 00:14:20,260
an argument that we need checkpoint
tuning.

257
00:14:21,060 --> 00:14:25,280
It was the beginning of my practice
in the US, so it was hard

258
00:14:25,280 --> 00:14:30,200
for me to find good arguments,
but this case showed okay.

259
00:14:30,360 --> 00:14:33,580
I just did a very good series of
experiments.

260
00:14:34,140 --> 00:14:36,000
You know I'm a big fan of experiments.

261
00:14:36,680 --> 00:14:41,720
So if you just start experimenting
with max_wal_size and run

262
00:14:41,720 --> 00:14:45,580
deletes each time, you just do
this 1 gigabyte.

263
00:14:46,100 --> 00:14:47,240
This is the I.O.

264
00:14:47,900 --> 00:14:49,340
For this massive delete.

265
00:14:49,360 --> 00:14:53,200
Like 2 gigabytes, 4 gigabytes,
logarithmic approach.

266
00:14:53,940 --> 00:14:56,660
64 for example, or 128 gigabytes.

267
00:14:57,100 --> 00:14:59,120
And you can draw a good curve.

268
00:15:00,040 --> 00:15:00,540
Right?

269
00:15:01,000 --> 00:15:02,140
Look how I/O.

270
00:15:02,140 --> 00:15:06,860
And usually, I/O, if you also have
monitoring, if this delete

271
00:15:06,860 --> 00:15:11,140
takes like a minute or 2 minutes,
5 minutes, you can see monitoring

272
00:15:11,260 --> 00:15:15,460
that you have a plateau for your I/O because
it's separated.

273
00:15:16,100 --> 00:15:17,780
Michael: Are you taking the total
I/O?

274
00:15:17,780 --> 00:15:20,800
So let's say you're doing the same
delete each time and you're

275
00:15:20,800 --> 00:15:24,480
looking at total I/O across the
time and showing that there's

276
00:15:24,800 --> 00:15:27,690
less I/O as we increase the...

277
00:15:27,690 --> 00:15:32,080
Nikolay: I just, yeah, I just,
Well, I did several things in

278
00:15:32,080 --> 00:15:35,720
this experiment, and I think it's
worth maybe a good how-to article

279
00:15:35,800 --> 00:15:36,520
or something.

280
00:15:37,020 --> 00:15:41,060
Actually, it was 2 phases, 2 parts
of the whole experiment.

281
00:15:41,060 --> 00:15:45,040
First is to study this I/O behavior
and checkpointer behavior.

282
00:15:45,140 --> 00:15:50,080
So I did a snapshot of pg_stat_bgwriter,
which as we know,

283
00:15:50,380 --> 00:15:55,020
until the recent version, contains
not only the background writer,

284
00:15:55,120 --> 00:16:00,920
but also a checkpointer and backend
activity for cleaning dirty

285
00:16:00,920 --> 00:16:01,420
buffers.

286
00:16:01,720 --> 00:16:07,040
So I converted as usual, I converted
buffers to gigabytes because

287
00:16:07,080 --> 00:16:11,280
this is how any engineer can start
understanding what's happening.

288
00:16:11,280 --> 00:16:15,220
If you say buffers, nobody understands
except DBAs, right?

289
00:16:16,980 --> 00:16:22,180
If you multiply it by 8 kilobytes,
you have gigabytes, megabytes,

290
00:16:22,660 --> 00:16:24,060
everyone starts understanding.

291
00:16:24,660 --> 00:16:29,800
So I just showed that with default
setting, Checkpointer had

292
00:16:29,800 --> 00:16:31,020
a lot of I/O.

293
00:16:31,020 --> 00:16:32,980
And also it just had WAL.

294
00:16:33,580 --> 00:16:35,300
Much more WAL was generated.

295
00:16:35,460 --> 00:16:39,720
But I also made screenshots of
monitoring showing that we had

296
00:16:39,720 --> 00:16:43,760
a plateau situation for disk I/O, disk
write I/O.

297
00:16:43,940 --> 00:16:49,780
But when we had already like 16
gigabytes, 32 gigabytes, we already

298
00:16:49,780 --> 00:16:54,720
see like a spiky pattern up and down
and this is good.

299
00:16:54,720 --> 00:16:55,375
It means we have room.

300
00:16:55,375 --> 00:16:56,260
Michael: Because of the batches?

301
00:16:57,040 --> 00:16:57,850
Why is it spiky?

302
00:16:57,850 --> 00:17:01,220
Nikolay: Yeah, some batching, something
like, yeah, checkpoint,

303
00:17:01,220 --> 00:17:04,020
that's also like, it's of course
checkpoint completion target,

304
00:17:04,020 --> 00:17:07,420
it's like close to 1, so it should
be spread, but it's like,

305
00:17:07,420 --> 00:17:08,850
this is batching there, obviously.

306
00:17:08,850 --> 00:17:10,040
And this is good.

307
00:17:10,040 --> 00:17:11,180
We must iterate it.

308
00:17:11,295 --> 00:17:13,025
Plato is bad.

309
00:17:13,140 --> 00:17:13,640
Michael: Right.

310
00:17:14,240 --> 00:17:14,457
So.

311
00:17:14,457 --> 00:17:16,623
Actually, it's a good point that
this helps, this optimizer,

312
00:17:16,623 --> 00:17:19,960
like tuning the checkpointer is
helping with massive deletes,

313
00:17:20,020 --> 00:17:21,700
regardless of whether we batch
or not.

314
00:17:21,700 --> 00:17:24,240
It's helping in the case where
we don't batch and it's helping

315
00:17:24,240 --> 00:17:25,680
in the case where we do batch.

316
00:17:26,320 --> 00:17:27,180
Is that right?

317
00:17:27,880 --> 00:17:31,720
Nikolay: Right now I'm talking
only about how we were tuning

318
00:17:31,720 --> 00:17:33,960
how tuning would prevent this incident.

319
00:17:34,360 --> 00:17:40,520
I think it would convert P1 or
priority 1 or criticality 1 incident

320
00:17:40,520 --> 00:17:41,140
to P2.

321
00:17:41,600 --> 00:17:45,900
So basically we have a slow database
but it's not down.

322
00:17:48,340 --> 00:17:51,360
Because we just discussed this.

323
00:17:51,820 --> 00:17:54,840
A well-tuned checkpointer has
less I.O.

324
00:17:54,840 --> 00:17:55,520
To perform.

325
00:17:55,600 --> 00:17:56,100
Michael: Okay.

326
00:17:56,260 --> 00:17:57,640
I understand the spikiness now.

327
00:17:57,640 --> 00:17:59,980
We're talking about that's when
the checkpoints are happening.

328
00:18:00,200 --> 00:18:00,700
Nikolay: Yeah.

329
00:18:00,740 --> 00:18:01,360
The I.O.

330
00:18:01,360 --> 00:18:03,500
From checkpointer, disk I.O.

331
00:18:03,500 --> 00:18:04,700
Was spiky and it's good.

332
00:18:04,700 --> 00:18:07,220
That means we are not having plateau,
we are not saturated.

333
00:18:07,340 --> 00:18:11,520
Of course, better if you know your
numbers, you know the limit

334
00:18:11,520 --> 00:18:16,120
of IO your device, storage device
can handle, and you can draw

335
00:18:16,120 --> 00:18:19,180
this line on graph and understand
how far.

336
00:18:19,860 --> 00:18:26,600
Basically, regular SRE practices
starting from usage situation

337
00:18:26,920 --> 00:18:31,220
errors should be applied here to
study this incident and perform

338
00:18:31,220 --> 00:18:32,220
root cause analysis.

339
00:18:32,440 --> 00:18:33,480
And this was great.

340
00:18:33,480 --> 00:18:39,120
It was obvious that if we reach
like 16 gigabytes or 32 gigabytes,

341
00:18:39,680 --> 00:18:41,320
we are in much better shape.

342
00:18:41,320 --> 00:18:44,720
We just need to have disk IO and
also second phase of experiment.

343
00:18:44,720 --> 00:18:49,160
I think we had an episode about
maximal size and checkpoint tuning.

344
00:18:49,540 --> 00:18:53,580
So second phase, I won't go into
detail there, but second phase

345
00:18:53,580 --> 00:18:59,100
of you do need to understand recovery
time in the worst situation.

346
00:18:59,680 --> 00:19:03,740
And I invented the term like double
worst situation, double unlucky.

347
00:19:04,120 --> 00:19:09,360
So worst situation if your database
crashed right before checkpoint

348
00:19:09,360 --> 00:19:13,940
completes and double unlucky if
at the same time you had massive

349
00:19:13,940 --> 00:19:15,480
delete or something like that.

350
00:19:15,940 --> 00:19:20,820
In this case it means a lot of
work during recovery.

351
00:19:21,380 --> 00:19:22,860
So yeah, that's it actually.

352
00:19:22,920 --> 00:19:28,380
This is how massive delete looks
like in the wild, and you should

353
00:19:28,380 --> 00:19:29,240
avoid it.

354
00:19:29,700 --> 00:19:32,660
Michael: Yeah, So how do we then
go from the...

355
00:19:32,780 --> 00:19:35,460
I guess we've gone from priority
1 incident to priority 2.

356
00:19:35,460 --> 00:19:37,620
How do we make this not an incident
at all?

357
00:19:38,160 --> 00:19:38,660
Nikolay: Batching.

358
00:19:39,020 --> 00:19:40,780
So just split it to batches.

359
00:19:41,380 --> 00:19:45,080
I think there might be cases when
logically you cannot afford

360
00:19:45,080 --> 00:19:49,040
splitting to batches because you
must delete everything in 1

361
00:19:49,040 --> 00:19:49,540
transaction.

362
00:19:49,700 --> 00:19:53,420
But in my practice, I never saw
this.

363
00:19:53,420 --> 00:19:58,880
I mean, I always could convince
people to split to batches and

364
00:19:59,180 --> 00:20:00,980
do everything in different transactions.

365
00:20:02,020 --> 00:20:06,840
So in this case, we need just to
understand what is the ideal

366
00:20:07,020 --> 00:20:08,320
batch size for us.

367
00:20:08,360 --> 00:20:10,220
Not too small, not too big.

368
00:20:10,940 --> 00:20:15,140
Too small means a lot of transaction
overhead, too big, we just

369
00:20:15,140 --> 00:20:15,640
discussed.

370
00:20:18,620 --> 00:20:18,960
I was

371
00:20:18,960 --> 00:20:21,360
Michael: gonna say, I haven't seen
a delete use case that like

372
00:20:21,360 --> 00:20:24,280
we're deleting stuff right we don't
need it anymore that's almost

373
00:20:24,280 --> 00:20:27,880
by definition we don't need it
anymore so why do you why would

374
00:20:27,880 --> 00:20:30,720
you need it to all be gone or none
of it to be gone it doesn't

375
00:20:30,720 --> 00:20:32,030
That doesn't make sense to me.

376
00:20:32,860 --> 00:20:35,640
Nikolay: Potentially there might
be a case when you don't want

377
00:20:35,640 --> 00:20:37,920
users to see parts of the

378
00:20:37,920 --> 00:20:38,620
Michael: old data.

379
00:20:39,900 --> 00:20:43,260
So you want them to not see anything,
but you'd rather they saw

380
00:20:43,260 --> 00:20:45,480
everything than part of it.

381
00:20:45,480 --> 00:20:48,960
Nikolay: Yeah, in this case, Grails
only adjust application logic

382
00:20:48,960 --> 00:20:53,140
so the application cannot reach
that data already, right?

383
00:20:53,140 --> 00:20:56,080
Even if it's present in the database,
but you hide it already,

384
00:20:56,120 --> 00:20:58,070
maybe like that, based on timestamps
or something.

385
00:20:59,600 --> 00:21:02,980
But again, this is just a theoretical
discussion.

386
00:21:03,160 --> 00:21:05,840
In practice I didn't see any cases
when we couldn't.

387
00:21:06,740 --> 00:21:11,760
The benefits for Postgres with
its MVCC model, benefits from

388
00:21:11,920 --> 00:21:17,800
batch deletes always much higher
than experiencing this pain

389
00:21:17,800 --> 00:21:18,760
and risks.

390
00:21:19,340 --> 00:21:24,100
So yeah, batches, and we know the
ideal size of batches, we discussed

391
00:21:24,100 --> 00:21:28,340
it many, many times, starting from
the very first episode we

392
00:21:28,340 --> 00:21:30,140
had almost 2 years ago.

393
00:21:30,480 --> 00:21:33,380
Michael: And very recently in the
Don't Do This episode, right?

394
00:21:33,740 --> 00:21:34,340
Nikolay: Yeah, yeah, yeah.

395
00:21:34,340 --> 00:21:40,120
So ideal, my recipe is just try
to be below 1 second, but maybe

396
00:21:40,120 --> 00:21:44,120
not below 100 milliseconds or 50.

397
00:21:44,500 --> 00:21:48,220
So this is the ideal size to me based
on human perception.

398
00:21:49,080 --> 00:21:49,360
Michael: Yeah.

399
00:21:49,360 --> 00:21:53,440
You mentioned something in the
recent episode about it degrading

400
00:21:53,480 --> 00:21:53,940
over time.

401
00:21:53,940 --> 00:21:58,940
So let's say you can, you get a
clone or you have a replica of

402
00:21:58,940 --> 00:22:03,440
production where you're testing
batch sizes and you get the sweet

403
00:22:03,440 --> 00:22:03,800
spot.

404
00:22:03,800 --> 00:22:09,080
Maybe you find out that you can
delete 5,000 rows in just under

405
00:22:09,080 --> 00:22:09,620
a second.

406
00:22:09,620 --> 00:22:12,340
And you think let's go with 5,000
as the batch size.

407
00:22:12,400 --> 00:22:13,040
You mentioned

408
00:22:14,600 --> 00:22:18,460
Nikolay: 1 hour later, you see
the same delete already takes

409
00:22:18,460 --> 00:22:19,080
1 minute.

410
00:22:19,080 --> 00:22:19,900
What's happening?

411
00:22:20,740 --> 00:22:23,560
Michael: Yeah, so I was gonna ask,
not only what's causing it,

412
00:22:23,560 --> 00:22:27,540
but how would you, like, do you
run the test for longer?

413
00:22:27,540 --> 00:22:30,040
Or like, how do you think about
that stuff?

414
00:22:30,360 --> 00:22:32,900
Nikolay: Well, you just need to
understand that when you delete

415
00:22:32,900 --> 00:22:37,840
or update, you produce dead tuples,
and it means that only the

416
00:22:37,840 --> 00:22:42,000
first part of 2 parts of the operation
is done.

417
00:22:42,120 --> 00:22:45,760
It's done sequentially by your
SQL query, but there is a second

418
00:22:45,760 --> 00:22:47,740
super important part, which is
vacuuming.

419
00:22:49,740 --> 00:22:53,800
So delete is not complete when
you see a transaction committed.

420
00:22:54,280 --> 00:22:57,720
It's complete only logically, but
physically we have these dead

421
00:22:57,720 --> 00:23:02,240
tuples, basically garbage unless
some transactions are still

422
00:23:02,240 --> 00:23:02,740
needed.

423
00:23:03,740 --> 00:23:05,940
And you need to understand vacuum.

424
00:23:06,780 --> 00:23:10,940
And actually in the same company
to perform delete properly,

425
00:23:11,320 --> 00:23:15,200
of course I split 2 batches and
then I was a big fan like, oh

426
00:23:15,200 --> 00:23:19,700
I want to create a simple query
which is basically stateless.

427
00:23:20,800 --> 00:23:26,020
I don't want to deal with remembering
last ID or last timestamp

428
00:23:26,360 --> 00:23:28,220
of previous batch.

429
00:23:28,780 --> 00:23:30,900
Basically no pagination.

430
00:23:32,720 --> 00:23:36,900
I don't want to have pagination
because they have a similar problem

431
00:23:36,900 --> 00:23:43,420
as pagination for selects showing
huge result set split to pages

432
00:23:43,420 --> 00:23:47,220
to users, similar problem here,
Very similar, because it's the

433
00:23:47,220 --> 00:23:48,420
same basically problem.

434
00:23:48,540 --> 00:23:53,540
So I wanted to do something like
stateless and I just relied

435
00:23:53,560 --> 00:23:58,580
on Postgres on some index, I checked
indexes used, and then already

436
00:23:58,580 --> 00:24:01,100
in production I saw this degradation.

437
00:24:01,860 --> 00:24:06,240
So degradation was because of a
lot of dead tuples and auto vacuum

438
00:24:06,240 --> 00:24:09,120
couldn't catch up with my speed
of deletion.

439
00:24:09,400 --> 00:24:11,060
I also remember an interesting
thing.

440
00:24:11,060 --> 00:24:16,560
Since then, I'm a big fan of single-threaded
maintenance jobs.

441
00:24:16,980 --> 00:24:22,840
I implemented multiple threads
originally, but then I saw that

442
00:24:22,840 --> 00:24:26,700
even a single thread is too fast
for auto vacuum.

443
00:24:27,500 --> 00:24:30,120
And we don't need parallelization
here.

444
00:24:30,120 --> 00:24:33,420
Of course, parallelization in my
case was based on select for

445
00:24:33,420 --> 00:24:34,840
update, skip locked.

446
00:24:35,280 --> 00:24:36,840
Like fancy, super cool.

447
00:24:37,540 --> 00:24:42,160
It was maybe like 5, 6 years ago,
like, let's do fancy stuff,

448
00:24:42,620 --> 00:24:45,120
we will have like 5 workers, let's
go.

449
00:24:45,480 --> 00:24:47,220
But you don't need it here.

450
00:24:47,220 --> 00:24:51,300
1 thread is enough to clean up.

451
00:24:52,200 --> 00:24:56,640
Because otherwise you need to speed
up vacuum somehow, and vacuum

452
00:24:56,640 --> 00:24:58,480
for a single table is always single-threaded.

453
00:24:59,160 --> 00:25:03,540
You cannot have multiple workers
cleaning up dead tuples from

454
00:25:03,540 --> 00:25:05,640
a single physical table.

455
00:25:05,680 --> 00:25:06,960
Michael: Non-partitioned, yeah.

456
00:25:07,120 --> 00:25:10,020
Nikolay: If it's partitioned, yes,
they can work on multiple

457
00:25:10,440 --> 00:25:13,580
partitions at the same time, but
if it's a single table, no.

458
00:25:15,060 --> 00:25:16,580
Michael: I want to come back to
partitioning.

459
00:25:17,020 --> 00:25:21,800
But in the meantime, like, so massive
delete also has a big impact

460
00:25:21,900 --> 00:25:25,960
on, like, I was thinking about
index only scans, not for the

461
00:25:25,960 --> 00:25:30,060
delete, but for other like concurrent
reads.

462
00:25:31,020 --> 00:25:34,200
Let's say we're in a situation
where we are deleting a lot of

463
00:25:34,200 --> 00:25:37,720
tuples from not necessarily random
but like lots of different

464
00:25:37,720 --> 00:25:41,000
pages; the visibility map is going
to be well; there's going to

465
00:25:41,000 --> 00:25:44,700
be a lot of pages that have changes
and therefore can't be true

466
00:25:44,700 --> 00:25:48,820
index only scans. So I was thinking
for that case we might actually

467
00:25:48,820 --> 00:25:52,540
want to run a vacuum manually every
certain number of batches.

468
00:25:52,860 --> 00:25:55,600
Nikolay: Yeah, this is what I did
and I also thought maybe I

469
00:25:55,600 --> 00:25:56,040
should.

470
00:25:56,040 --> 00:25:56,740
Yeah, I did.

471
00:25:56,740 --> 00:26:01,300
I did manual, not manual, I mean
my script did it.

472
00:26:02,940 --> 00:26:07,780
But after n batches, like 100,000
batches, vacuum.

473
00:26:08,300 --> 00:26:10,140
Michael: And how did you determine
n?

474
00:26:11,140 --> 00:26:11,880
Nikolay: Good question.

475
00:26:12,180 --> 00:26:13,040
Well, experiments.

476
00:26:13,340 --> 00:26:20,460
So if you want to go stateless
and you know a lot of dead tuples

477
00:26:21,100 --> 00:26:26,520
will be a problem because the index
you use on primary key or

478
00:26:26,520 --> 00:26:31,120
on the creation timestamp, create
tab, this index also needs

479
00:26:31,120 --> 00:26:34,160
vacuum because it will be pointing
to dead tuples and that's

480
00:26:34,160 --> 00:26:38,300
why subsequent batches will be
slower, degradation over

481
00:26:38,300 --> 00:26:39,240
time will happen.

482
00:26:39,840 --> 00:26:45,360
But if you vacuum, you get rid
of these links to dead tuples

483
00:26:46,500 --> 00:26:48,280
And it becomes good again.

484
00:26:48,710 --> 00:26:54,980
So based on that, applying the
rule, we want our transactions

485
00:26:55,840 --> 00:27:01,180
to exceed 1 or 2 seconds because
users might notice some bad

486
00:27:01,180 --> 00:27:01,680
things.

487
00:27:02,260 --> 00:27:04,980
When we have a long transaction,
we also block...

488
00:27:05,980 --> 00:27:10,520
We also are some kind of a problem
for all autovacuum workers

489
00:27:10,520 --> 00:27:12,540
globally for our database.

490
00:27:13,480 --> 00:27:16,720
So if you see, oh, it degraded
to 2 seconds, this is the right

491
00:27:16,720 --> 00:27:18,220
time to run vacuum, right?

492
00:27:18,420 --> 00:27:21,300
But I eventually chose not to do
vacuum, actually.

493
00:27:23,100 --> 00:27:27,660
I had many iterations of my solution,
and finally I gave up and

494
00:27:27,660 --> 00:27:29,080
decided to go stateful.

495
00:27:29,720 --> 00:27:34,840
So I performed just this key set
pagination, just based on the

496
00:27:34,840 --> 00:27:35,780
last timestamp.

497
00:27:36,900 --> 00:27:38,920
I select the next batch based on
that.

498
00:27:38,920 --> 00:27:41,680
It's super fast even if we had
a lot of dead tuples.

499
00:27:42,700 --> 00:27:43,800
So I don't care.

500
00:27:44,100 --> 00:27:48,740
Because Autovacuum is good to skip
pages it can skip.

501
00:27:48,820 --> 00:27:50,740
It's not like 10 or 15 years ago.

502
00:27:50,740 --> 00:27:52,220
It's already quite good.

503
00:27:52,260 --> 00:27:55,280
So multiple runs versus just single
big run.

504
00:27:56,380 --> 00:27:58,600
This is a good question, by the
way, worth exploring.

505
00:27:58,780 --> 00:28:05,100
Is it good to run vacuum quite
often in this case, versus let's

506
00:28:05,100 --> 00:28:06,360
run it just once.

507
00:28:06,680 --> 00:28:09,720
In terms of how low it will produce,
in terms of how many disk

508
00:28:09,720 --> 00:28:13,940
operations in total, this is a
good experiment to conduct and

509
00:28:13,940 --> 00:28:15,040
compare these numbers.

510
00:28:15,540 --> 00:28:18,220
Michael: Well, and doesn't it depend
a little bit on like, I

511
00:28:18,220 --> 00:28:20,140
think it depends a bit on your
workload, right?

512
00:28:20,140 --> 00:28:25,380
If you have no other concurrent
users, running 1 vacuum is likely

513
00:28:25,380 --> 00:28:27,280
more efficient than running lots
of small ones.

514
00:28:27,280 --> 00:28:30,140
But if you have other concurrent
users and you're forcing heap

515
00:28:30,140 --> 00:28:35,440
fetches, then maybe your IO impact
from those reads outweighs

516
00:28:35,860 --> 00:28:38,040
the duplicate effort on vacuum.

517
00:28:38,340 --> 00:28:41,200
Or even when I say duplicate effort
on vacuum, I guess we're

518
00:28:41,200 --> 00:28:42,100
talking about

519
00:28:42,800 --> 00:28:44,640
Nikolay: the same kind of a similar...

520
00:28:46,460 --> 00:28:46,788
Yeah, you're bringing up a good
point here.

521
00:28:46,788 --> 00:28:50,220
If we see that we want to run single
vacuum not frequently, we

522
00:28:50,220 --> 00:28:55,020
will live with a lot of buffers
which are out of visibility

523
00:28:55,080 --> 00:28:55,580
map.

524
00:28:55,840 --> 00:28:56,340
Yeah.

525
00:28:56,460 --> 00:28:58,940
Yeah, and index-only scans will
degrade.

526
00:28:59,240 --> 00:29:01,740
But you add complexity here.

527
00:29:01,740 --> 00:29:02,940
I know, I know.

528
00:29:03,820 --> 00:29:06,700
In my case, I don't care about
other users for now.

529
00:29:06,700 --> 00:29:08,040
I like this approach.

530
00:29:08,340 --> 00:29:12,560
First you study the problem alone
and then you start adding what

531
00:29:12,560 --> 00:29:14,680
you did, like, oh, we have other
users.

532
00:29:15,120 --> 00:29:18,840
Let's already understand the problem
in a vacuum, so to speak.

533
00:29:20,340 --> 00:29:20,740
Michael: I know what

534
00:29:20,740 --> 00:29:21,220
Nikolay: you mean.

535
00:29:21,220 --> 00:29:22,240
In single user mode, basically.

536
00:29:22,660 --> 00:29:26,400
Then you start adding considerations,
what about other users?

537
00:29:26,460 --> 00:29:30,840
If you know how Postgres behaves,
how your workload behaves for

538
00:29:30,840 --> 00:29:33,960
single user situation, It's already
better.

539
00:29:34,280 --> 00:29:38,680
You are much more confident when
you start looking at others

540
00:29:38,680 --> 00:29:39,240
as well.

541
00:29:39,240 --> 00:29:40,580
Yeah, it's a good point.

542
00:29:41,040 --> 00:29:44,720
If we rely on index-only scans
a lot, probably we should vacuum

543
00:29:44,720 --> 00:29:46,720
frequently to avoid heap fetches.

544
00:29:47,080 --> 00:29:51,440
You saw it in your plans at PgMaster
quite a lot, right?

545
00:29:52,200 --> 00:29:53,840
Michael: Yeah, it's quite a common
one.

546
00:29:53,840 --> 00:30:01,660
It's not the biggest issue, but
it is, especially range scans

547
00:30:01,720 --> 00:30:04,820
where people are doing like returning
a lot of rows or doing

548
00:30:04,820 --> 00:30:08,300
aggregates that rely on index only
scans on these kind of you

549
00:30:08,300 --> 00:30:12,100
know like hybrid workloads set
kind of analytical queries they

550
00:30:12,100 --> 00:30:15,820
can degrade quite quickly in certain
environments if you go from

551
00:30:15,820 --> 00:30:20,080
an index-only scan across a few
thousand pages to an index-only

552
00:30:20,080 --> 00:30:21,860
scan with a few thousand heap fetches.

553
00:30:23,100 --> 00:30:26,680
Nikolay: Yeah, so your deletes
can be a problem for these.

554
00:30:26,920 --> 00:30:31,960
So, selects might degrade Because
visibility map is outdated.

555
00:30:32,380 --> 00:30:32,880
Michael: Exactly.

556
00:30:33,480 --> 00:30:35,900
Nikolay: So yeah, more frequent
vacuums might be good.

557
00:30:36,100 --> 00:30:38,900
It's an interesting discussion,
interesting point as well.

558
00:30:39,240 --> 00:30:41,460
But in my case, I liked the...

559
00:30:42,120 --> 00:30:46,320
When I gave up my idea to be stateless,
I liked so much, like

560
00:30:46,320 --> 00:30:49,520
I just remember this, and remember,
I just...

561
00:30:50,360 --> 00:30:55,360
I did it also to single-threaded
approach, forgetting about this

562
00:30:55,380 --> 00:30:56,780
select for update, skip locked.

563
00:30:58,080 --> 00:31:02,580
And so I just need 1 value always
to remember, So easy.

564
00:31:02,980 --> 00:31:07,960
And then performance became very
predictable, reliable, stable.

565
00:31:08,680 --> 00:31:14,080
So all batches were the same size
and latencies were the same.

566
00:31:15,660 --> 00:31:18,800
Also interesting, when you do it,
you start thinking, oh, I want

567
00:31:18,800 --> 00:31:19,840
a progress bar basically.

568
00:31:19,960 --> 00:31:23,460
I want to predict ETA, how much
time left.

569
00:31:24,440 --> 00:31:30,640
And if it's degrading or not stable,
this latency of each delete,

570
00:31:30,900 --> 00:31:32,640
you cannot predict reliably.

571
00:31:32,800 --> 00:31:36,820
But once you perform keyset pagination
here, you know your

572
00:31:36,820 --> 00:31:40,320
batch, it takes like 300 milliseconds
each batch.

573
00:31:40,320 --> 00:31:40,820
Great.

574
00:31:42,520 --> 00:31:46,400
I integrated it into my SQL with
a progress bar.

575
00:31:46,400 --> 00:31:50,720
It reported percentage and how
much left in ETA.

576
00:31:51,820 --> 00:31:52,220
Michael: So, quite

577
00:31:52,220 --> 00:31:53,160
Nikolay: an important thing.

578
00:31:53,160 --> 00:31:55,840
Michael: A quick question on the
ordering of batches.

579
00:31:55,840 --> 00:31:58,580
Do you try and do any kind of natural…
You mentioned created

580
00:31:58,580 --> 00:32:01,420
that being a little bit random,
but I would have expected, unless

581
00:32:01,840 --> 00:32:06,900
the tuples are getting updated
a fair amount, the creator that

582
00:32:06,900 --> 00:32:09,520
might be relatively natural ordering.

583
00:32:09,520 --> 00:32:12,080
Nikolay: It should be a very good
correlation between created

584
00:32:12,080 --> 00:32:13,740
physical location and created_at.

585
00:32:13,740 --> 00:32:14,980
You can check it easily.

586
00:32:15,400 --> 00:32:22,500
Select ctid, created_at, and order
by created_at, some limit.

587
00:32:22,740 --> 00:32:26,660
And you can see ctid is the physical
location and you can understand

588
00:32:26,660 --> 00:32:30,520
how correlated physical location
is with createdAt.

589
00:32:30,860 --> 00:32:36,000
In my case it did have a lot of
updates in the past, so it was

590
00:32:36,760 --> 00:32:38,660
basically everywhere in the table.

591
00:32:39,860 --> 00:32:41,760
That's why the incident happened,
actually.

592
00:32:41,820 --> 00:32:42,340
Yeah, it

593
00:32:42,340 --> 00:32:43,020
Michael: makes sense.

594
00:32:44,150 --> 00:32:46,240
Nikolay: So, yeah.

595
00:32:46,240 --> 00:32:47,120
Oh, no.

596
00:32:48,360 --> 00:32:53,100
I ordered by modified_at and I
even considered creating an index

597
00:32:53,100 --> 00:32:54,440
on modified_at.

598
00:32:54,640 --> 00:32:56,540
But this is an anti-pattern to me.

599
00:32:57,800 --> 00:33:02,840
An index on modified_at is a problem
because you start losing hot

600
00:33:02,840 --> 00:33:07,280
updates, heap-only tuple updates
immediately because each update

601
00:33:07,460 --> 00:33:12,140
needs, usually by trigger or something,
it changes this value

602
00:33:12,340 --> 00:33:17,500
and if you have an index on it,
By definition, Postgres cannot

603
00:33:17,500 --> 00:33:22,440
implement heap-only tuple updates,
and it means you need to deal

604
00:33:22,440 --> 00:33:25,740
with index write amplification
problem, which is nasty.

605
00:33:26,540 --> 00:33:29,440
Michael: So you're optimizing your
delete, but you're messing

606
00:33:29,440 --> 00:33:30,260
up your updates.

607
00:33:30,820 --> 00:33:31,180
Nikolay: So now

608
00:33:31,180 --> 00:33:32,780
Michael: you are considering other
users.

609
00:33:33,080 --> 00:33:34,860
Nikolay: I remember I went deeper.

610
00:33:35,160 --> 00:33:40,840
I decided, okay, honestly, I don't
want this index to be built.

611
00:33:40,840 --> 00:33:45,520
I know it will bite me back because
of heap-only tuples.

612
00:33:46,160 --> 00:33:47,440
I don't want to lose them.

613
00:33:47,440 --> 00:33:52,260
I checked statistics and I saw
we have a good ratio of them among

614
00:33:52,260 --> 00:33:52,800
all updates.

615
00:33:52,800 --> 00:33:54,260
We have a lot of hot updates.

616
00:33:54,480 --> 00:33:56,400
So I decided, you know what I decided?

617
00:33:56,400 --> 00:34:00,960
I decided to rely on created at
index, but then somehow perform

618
00:34:00,960 --> 00:34:04,060
logical replication with modified at based
on partial correlation.

619
00:34:04,060 --> 00:34:05,460
It was something crazy.

620
00:34:05,660 --> 00:34:06,780
It worked quite well.

621
00:34:06,780 --> 00:34:10,300
I don't remember details, but I
did some tricks there.

622
00:34:11,140 --> 00:34:12,420
Avoiding this.

623
00:34:12,580 --> 00:34:16,320
At first I already coordinated
index creation with guys.

624
00:34:16,600 --> 00:34:18,120
They said, oh yeah, let's do it.

625
00:34:18,120 --> 00:34:19,540
But then I said, no, no, no, no.

626
00:34:19,540 --> 00:34:23,580
In different company, like 1 year
before that, I already had

627
00:34:23,580 --> 00:34:25,760
this mistake done.

628
00:34:26,100 --> 00:34:29,820
I suggested some index and then
degradation of updates happened

629
00:34:29,820 --> 00:34:31,460
because we lost hot updates.

630
00:34:31,960 --> 00:34:36,680
Actually, that incident led me
to the idea we need to verify

631
00:34:36,740 --> 00:34:42,440
all changes, holistically checking
all queries, ideally, what

632
00:34:42,440 --> 00:34:42,900
we have.

633
00:34:42,900 --> 00:34:45,980
And this experimentation approach
and so on.

634
00:34:46,280 --> 00:34:47,820
It's all connected and interesting.

635
00:34:48,120 --> 00:34:52,480
But maybe let's skip this, since
we are out of time.

636
00:34:52,540 --> 00:34:53,600
I have a question.

637
00:34:53,940 --> 00:34:55,320
Michael: 2 more quick, 2 more ones.

638
00:34:55,320 --> 00:34:59,100
I guess they're not quick, but
when we're done with this large

639
00:34:59,100 --> 00:35:02,920
delete, what do you, like, are
there any kind of maintenance

640
00:35:03,160 --> 00:35:05,720
tidy up tasks that you recommend
doing?

641
00:35:05,820 --> 00:35:08,720
I was thinking like we did a whole
episode on index maintenance,

642
00:35:08,720 --> 00:35:12,160
I think rebuilding index is concurrently
my...

643
00:35:12,620 --> 00:35:15,880
Nikolay: This question exactly
is a good bridge to what I wanted

644
00:35:15,880 --> 00:35:17,520
to discuss as the last topic.

645
00:35:17,720 --> 00:35:22,320
I wanted to discuss how massive
delete can be useful in non-production.

646
00:35:22,760 --> 00:35:23,100
Michael: Okay,

647
00:35:23,100 --> 00:35:23,600
Nikolay: yeah.

648
00:35:23,740 --> 00:35:26,600
And let's talk about this, why
your question is a bridge.

649
00:35:26,600 --> 00:35:32,320
Because now I remember very well,
but in my career I forgot about

650
00:35:32,320 --> 00:35:33,840
this many, many, many times.

651
00:35:34,380 --> 00:35:36,800
Delete doesn't touch indexes at
all.

652
00:35:37,660 --> 00:35:38,660
This is the key.

653
00:35:38,760 --> 00:35:43,840
So, delete, they just, indexes
are not touched at all during

654
00:35:43,840 --> 00:35:44,340
delete.

655
00:35:44,540 --> 00:35:46,980
Delete just puts xmax, that's it.

656
00:35:47,380 --> 00:35:51,180
And if the transaction is considered
as committed, then vacuum will

657
00:35:51,220 --> 00:35:55,460
remove this, and also if no other
transactions need this version,

658
00:35:55,580 --> 00:35:56,940
they're already in the future.

659
00:35:56,940 --> 00:35:59,440
In this case, the tuple will be deleted.

660
00:35:59,760 --> 00:36:04,320
And also links to this tuple in
this index will be deleted asynchronously

661
00:36:04,480 --> 00:36:09,240
by a vacuum, but our backend is
not dealing with indexes at all.

662
00:36:09,240 --> 00:36:16,840
It only needs to use one index to
find the scope of work, the tuples

663
00:36:16,840 --> 00:36:17,900
we need to touch.

664
00:36:17,900 --> 00:36:22,300
Oh, well, also during planning,
as we know, all indexes are considered

665
00:36:22,300 --> 00:36:25,940
by a planner and an access share lock
is acquired.

666
00:36:26,520 --> 00:36:29,440
So I don't see any index maintenance
here at all.

667
00:36:29,440 --> 00:36:32,280
Michael: Well, I was just thinking
about them being, let's say

668
00:36:32,280 --> 00:36:34,240
we deleted 10% of our table.

669
00:36:34,280 --> 00:36:36,560
Nikolay: Well, yeah, you're right.

670
00:36:36,900 --> 00:36:41,540
Right, so if we don't do frequent
vacuuming...

671
00:36:42,800 --> 00:36:45,300
Michael: Even if we do frequent
vacuuming, there'll still be

672
00:36:45,300 --> 00:36:50,320
like 10% bloat, especially if it's
like created out or something

673
00:36:50,320 --> 00:36:52,320
where we're deleting old data.

674
00:36:52,820 --> 00:36:56,580
Nikolay: Yes, vacuum doesn't rebalance
B-tree, so B-tree will be in

675
00:36:56,580 --> 00:36:58,000
worse shape probably.

676
00:36:59,540 --> 00:37:04,280
I would say if we deleted a big
fraction of the table data,

677
00:37:04,280 --> 00:37:08,840
we probably need to check bloat
in indexes and probably we'll

678
00:37:08,840 --> 00:37:11,460
need to rebuild them.

679
00:37:11,720 --> 00:37:16,900
But I already got used to relying
on automatically rebuild jobs,

680
00:37:17,220 --> 00:37:20,340
which we suggest implementing on
weekends or something.

681
00:37:21,040 --> 00:37:23,180
Michael: But this feels like a
good time to reconsider.

682
00:37:23,680 --> 00:37:25,660
Anyway, it felt worth mentioning.

683
00:37:25,760 --> 00:37:29,080
I know you want to bridge to that
1, but I had one more quick one.

684
00:37:29,160 --> 00:37:33,420
And that's like, more and more
I'm thinking about some of these

685
00:37:33,420 --> 00:37:38,420
use cases are deleting, like a
multi-tenant SaaS application

686
00:37:38,460 --> 00:37:40,180
wants to delete a single tenant.

687
00:37:40,840 --> 00:37:43,720
And if you've partitioned, like
I was thinking about partitioning

688
00:37:44,060 --> 00:37:47,980
and detached, like the another
way of avoiding big deletes is

689
00:37:47,980 --> 00:37:50,420
if you can just drop a whole partition.

690
00:37:50,740 --> 00:37:51,820
Nikolay: Or truncate it.

691
00:37:51,820 --> 00:37:52,620
Yeah, depending.

692
00:37:53,320 --> 00:37:55,080
Michael: Sorry, yeah, like, you
know, get hatched.

693
00:37:55,080 --> 00:37:57,940
Nikolay: PGQ, long, Skype implemented
this, like, three partitions,

694
00:37:58,100 --> 00:38:02,180
four queue-like workloads, three partitions,
and then, like, round-robin

695
00:38:02,620 --> 00:38:06,240
approach truncating when it's possible.

696
00:38:06,900 --> 00:38:11,400
It's much faster, it's much better.

697
00:38:12,180 --> 00:38:15,620
No job for many components and
it's good.

698
00:38:15,620 --> 00:38:17,460
Yeah, cool.

699
00:38:17,620 --> 00:38:18,090
I agree.

700
00:38:18,090 --> 00:38:22,000
And indexes are on partitions,
they are like physical and if

701
00:38:22,060 --> 00:38:23,720
it's truncated, it's also truncated.

702
00:38:23,800 --> 00:38:24,560
So, great.

703
00:38:25,080 --> 00:38:28,600
Yeah, and let's touch a little
bit on the last topic.

704
00:38:28,820 --> 00:38:30,200
Why delete is useful.

705
00:38:30,660 --> 00:38:36,140
So I already was, like, you know,
database map and ZFS, like,

706
00:38:36,140 --> 00:38:40,680
branching, thing cloning and iteration
starting from the idea

707
00:38:40,680 --> 00:38:44,900
of experimentation, it should start
from the same point in each

708
00:38:44,900 --> 00:38:45,340
iteration.

709
00:38:45,340 --> 00:38:49,340
You need to, you compare things,
you need to compare apples versus

710
00:38:49,340 --> 00:38:50,140
apples, right?

711
00:38:50,580 --> 00:38:51,860
Always apples to apples.

712
00:38:51,860 --> 00:38:55,880
So each iteration must start from
the same state of database.

713
00:38:56,200 --> 00:39:00,160
And this is difficult if table
is huge, database is huge, and

714
00:39:00,160 --> 00:39:02,980
you want to test it on big hardware.

715
00:39:04,440 --> 00:39:08,980
In case of single-threaded behavior,
if you don't care about

716
00:39:09,000 --> 00:39:12,560
checkpoint or vacuuming and so
on, usually thin clones that the

717
00:39:12,560 --> 00:39:15,920
Database Lab Engine provides are great
if you just study plans and

718
00:39:15,920 --> 00:39:16,440
so on.

719
00:39:16,440 --> 00:39:21,200
But in this case we deal with like
we need to consider the same

720
00:39:21,200 --> 00:39:24,780
file system and we need to understand
checkpointer behavior,

721
00:39:24,920 --> 00:39:26,820
WAL, like everything, vacuum.

722
00:39:26,980 --> 00:39:31,340
So we need a dedicated clone when
we perform this experiment

723
00:39:31,720 --> 00:39:35,420
for checkpoint tuning and to study
this behavior.

724
00:39:35,820 --> 00:39:39,680
In this case, for this tuning,
I found it super cool this kind

725
00:39:39,680 --> 00:39:43,480
of workload can bring us very good
interesting tool.

726
00:39:44,440 --> 00:39:48,840
If we perform massive delete but
don't commit with the rollback,

727
00:39:49,300 --> 00:39:51,200
the physical layout remains the
same.

728
00:39:51,200 --> 00:39:54,320
I mentioned it multiple times in
previous episodes and this is

729
00:39:54,320 --> 00:39:55,960
a super cool observation.

730
00:39:56,540 --> 00:40:00,260
You can just delete rollback, begin
delete rollback massively.

731
00:40:01,120 --> 00:40:05,280
This puts good pressure to your
WAL system, right?

732
00:40:05,280 --> 00:40:07,520
Checkpoint, it's cool.

733
00:40:07,660 --> 00:40:12,500
And it means that, of course, for
vacuum it won't do anything

734
00:40:12,500 --> 00:40:17,560
because the transaction marked
has rolled back, so even if XMAX

735
00:40:17,620 --> 00:40:23,080
is already updated, the tuples
survive, they're still alive.

736
00:40:23,680 --> 00:40:26,660
But Postgres generates a lot of
work for such workload.

737
00:40:27,260 --> 00:40:31,320
It also puts pressure to a backup
system, to a replication system,

738
00:40:31,820 --> 00:40:34,980
And also a checkpoint has a lot
of work.

739
00:40:35,280 --> 00:40:41,840
So it means you can stress test
many components just with

740
00:40:41,840 --> 00:40:42,940
this kind of workload.

741
00:40:43,260 --> 00:40:48,480
And I know we mentioned last week,
right, We had Sai as a guest.

742
00:40:49,180 --> 00:40:54,940
I suggest, who haven't watched
that episode, it was a super interesting

743
00:40:54,940 --> 00:41:00,140
discussion about logical replication
with PureDB founder Sai.

744
00:41:01,020 --> 00:41:04,780
So I mentioned that this is exactly
how I found that we can easily

745
00:41:04,780 --> 00:41:08,400
reach and saturate single core
on WAL sender.

746
00:41:09,520 --> 00:41:12,180
So you just delete middle-rows,
rollback.

747
00:41:13,440 --> 00:41:20,880
It spams WAL and WAL sender hitting
100% of single vCPU and

748
00:41:21,140 --> 00:41:23,440
it's becoming bottleneck quite
quickly.

749
00:41:23,600 --> 00:41:28,000
Fortunately, in production, as
I mentioned, I couldn't find such

750
00:41:28,260 --> 00:41:31,020
situation, but in non-production
it's easy to reproduce.

751
00:41:31,780 --> 00:41:38,000
So I found this very handy tool
to stress test Postgres and for

752
00:41:38,000 --> 00:41:39,820
checkpoint tuning and so on.

753
00:41:40,160 --> 00:41:43,180
That's why messages delete not only
bad but also good.

754
00:41:44,440 --> 00:41:44,720
Useful.

755
00:41:44,720 --> 00:41:46,620
But only being rolled back.

756
00:41:47,320 --> 00:41:47,820
Yeah.

757
00:41:48,820 --> 00:41:54,520
Isn't it funny and exciting that
the physical layout doesn't change

758
00:41:54,520 --> 00:41:55,260
of the table?

759
00:41:55,440 --> 00:41:57,440
You didn't look at rollback, but
it's the same.

760
00:41:57,440 --> 00:41:57,740
This is

761
00:41:57,740 --> 00:42:00,380
Michael: one of those things where
if you talk to Oracle guys,

762
00:42:00,380 --> 00:42:04,440
you'll get them sweating by this
point because of a different implementation.

763
00:42:05,460 --> 00:42:08,860
It's only because of Postgres'
specific MVCC implementation.

764
00:42:10,260 --> 00:42:15,380
Because in Oracle they have the
undo log, it's like the opposite

765
00:42:15,380 --> 00:42:15,880
trade-off.

766
00:42:15,940 --> 00:42:19,900
So it's so interesting that the
implementation details are so...

767
00:42:20,380 --> 00:42:23,500
It's good to know them in order
to work out how to test things.

768
00:42:24,100 --> 00:42:28,200
Nikolay: Yeah, Postgres is expecting
you to perform rollbacks

769
00:42:28,200 --> 00:42:28,700
often.

770
00:42:29,720 --> 00:42:33,420
Michael: Yeah, It's good to make
use of it when it's advantageous.

771
00:42:34,340 --> 00:42:34,820
Nikolay: Right.

772
00:42:34,820 --> 00:42:39,960
So tuple remains in the same spot
in the page, doesn't shift,

773
00:42:39,960 --> 00:42:41,260
unlike in other systems.

774
00:42:41,280 --> 00:42:46,740
But still, xmax is updated, so
page becomes dirty and must be

775
00:42:46,820 --> 00:42:47,360
flushed to

776
00:42:47,360 --> 00:42:47,860
Michael: disk.

777
00:42:49,200 --> 00:42:50,880
Yep, makes sense.

778
00:42:51,100 --> 00:42:53,720
Nikolay: Yeah, I think we explored
it quite well enough, quite

779
00:42:53,720 --> 00:42:54,220
deep.

780
00:42:54,440 --> 00:42:55,580
Takeovers, takeaways.

781
00:42:57,800 --> 00:43:01,560
First, perform checkpointer tuning,
watch out for our episode about

782
00:43:01,560 --> 00:43:02,060
that.

783
00:43:02,860 --> 00:43:07,260
And if you prohibit massive deletes,
perform deletes in batches

784
00:43:08,000 --> 00:43:10,940
roughly not longer than 1 or 2
seconds.

785
00:43:12,180 --> 00:43:17,160
Unless you don't have users who
deal with your database a lot

786
00:43:17,160 --> 00:43:21,060
and maybe you can go with like
30 seconds in this case.

787
00:43:21,500 --> 00:43:25,260
Michael: Well, at least consider
batching and try yeah.

788
00:43:26,600 --> 00:43:27,440
I just wanted to

789
00:43:27,440 --> 00:43:30,620
Nikolay: have some rule and for
regular OLTP 1 second is a good

790
00:43:30,620 --> 00:43:31,120
rule.

791
00:43:31,420 --> 00:43:32,880
Michael: Yeah great I like it.

792
00:43:32,980 --> 00:43:35,800
Yeah, I think it's like a healthy
recommendation as well.

793
00:43:35,800 --> 00:43:39,620
Like even if you wouldn't go down
or even if you wouldn't have

794
00:43:39,620 --> 00:43:44,140
like degradation of like that users
would notice, it's just an

795
00:43:44,140 --> 00:43:48,840
unhealthy thing to do like to go
excessive and then like, it's

796
00:43:48,840 --> 00:43:52,800
kind of like extremes where if
you can keep things more level,

797
00:43:52,800 --> 00:43:54,600
you probably see better things.

798
00:43:54,600 --> 00:43:54,960
Yeah, and

799
00:43:54,960 --> 00:44:00,100
Nikolay: a couple more takeaways
is vacuuming, don't forget about the

800
00:44:00,240 --> 00:44:04,480
regular vacuum, affects about index-only
scans we discussed.

801
00:44:04,480 --> 00:44:07,760
maybe you want frequent and quite
aggressive vacuum, like to

802
00:44:07,760 --> 00:44:12,680
go faster, more are you consumed,
and also partitioning.

803
00:44:14,380 --> 00:44:17,780
Maybe your massive delete is just
truncated, in this case all

804
00:44:17,780 --> 00:44:21,300
those negative effects can go away.

805
00:44:21,460 --> 00:44:21,960
Yeah.

806
00:44:22,360 --> 00:44:22,540
Yeah.

807
00:44:22,540 --> 00:44:23,300
Love it.

808
00:44:24,060 --> 00:44:25,180
Michael: Thanks so much, Nikolay.

809
00:44:25,920 --> 00:44:26,760
Nikolay: Thank you, Michael.