1
00:00:00,009 --> 00:00:04,030
Michael: Hello, and welcome to Postgres FM, a
weekly show about all thingss Postgres girl.

2
00:00:04,270 --> 00:00:06,039
I'm Michael founder of PG mustard.

3
00:00:06,070 --> 00:00:08,740
And this is my cos Nilay founder of Postgres AI.

4
00:00:09,070 --> 00:00:10,840
Hey Nilay what are we going to talk about today?

5
00:00:11,200 --> 00:00:13,930
Nikolay: Hi, Michael, let's talk about checkpoint tuning,

6
00:00:14,064 --> 00:00:14,514
Michael: Yes.

7
00:00:14,564 --> 00:00:15,174
Write ahead.

8
00:00:15,174 --> 00:00:15,954
Log in general.

9
00:00:15,954 --> 00:00:16,254
Right?

10
00:00:16,372 --> 00:00:16,882
Wow.

11
00:00:16,972 --> 00:00:17,242
Wow.

12
00:00:17,242 --> 00:00:18,292
Configuration.

13
00:00:18,349 --> 00:00:21,796
We had a really good request for this topic from one of our listeners.

14
00:00:21,895 --> 00:00:22,165
Nikolay: right.

15
00:00:22,170 --> 00:00:24,715
So not only checkpoint two, but checkpoint two is a part of it.

16
00:00:24,775 --> 00:00:25,795
Michael: Yes, absolutely.

17
00:00:25,832 --> 00:00:27,392
Thank you to Chelsea for this one.

18
00:00:27,438 --> 00:00:27,998
Nikolay: Uhhuh.

19
00:00:28,188 --> 00:00:28,594
Thank you.

20
00:00:28,685 --> 00:00:30,425
Uh, Very interconnected topics.

21
00:00:30,556 --> 00:00:35,631
We, we should think about them, both wall configuration and point tuning.

22
00:00:35,691 --> 00:00:36,051
They come.

23
00:00:36,604 --> 00:00:36,844
Michael: Yeah.

24
00:00:36,849 --> 00:00:41,434
Should we go through them one at a time in
terms of, well, so what, what is checkpoint?

25
00:00:41,464 --> 00:00:42,394
What are checkpoints?

26
00:00:42,394 --> 00:00:43,114
Should we start there?

27
00:00:43,114 --> 00:00:44,478
Or what is the, what is, wow.

28
00:00:44,822 --> 00:00:45,152
Nikolay: right.

29
00:00:45,152 --> 00:00:53,040
So what, what are checkpoints checkpoints to talk about checkpoints, we
need to talk about wall first, write a head log and right head log is

30
00:00:53,040 --> 00:00:57,930
like the basic, absolutely fundamental concept of FA database system.

31
00:00:58,650 --> 00:01:03,942
Not only relational uh, which says like that first data is recorded.

32
00:01:04,767 --> 00:01:10,804
Different place to, to some additional some
kind of binary log or like write ahead log.

33
00:01:11,074 --> 00:01:15,770
And only then it's changed in memory in data pages.

34
00:01:15,984 --> 00:01:19,473
And this allows us to build recovery system.

35
00:01:19,845 --> 00:01:25,184
This allows us to survive unexpected restarts failures various bug and so on.

36
00:01:25,432 --> 00:01:27,892
And in POS August, this also allows us to.

37
00:01:28,427 --> 00:01:32,942
Uh, Physical replication because it's built on top of, recovery subsystem.

38
00:01:33,178 --> 00:01:40,512
So if you think that you change some row in a table,
first, this change is recorded into right head log.

39
00:01:40,622 --> 00:01:46,380
And  only when we know that these changes recorded reliably in.

40
00:01:46,935 --> 00:01:49,335
it's already on disk only.

41
00:01:49,335 --> 00:01:57,264
Then you can see committed you, you can, the user can
seek commit and the actual change in data it's still

42
00:01:57,264 --> 00:02:00,594
can be only in memory, not flashed into disk yet.

43
00:02:00,688 --> 00:02:01,248
Right?

44
00:02:01,418 --> 00:02:03,578
So, so basically we record twice.

45
00:02:03,638 --> 00:02:07,208
We record to red headlock the change and we change data.

46
00:02:07,958 --> 00:02:16,118
In place where it's stored permanently, but this flushing
to disk in the second case it's not occurred synchronously

47
00:02:16,118 --> 00:02:25,580
so we can seek meet, but the data file tabled index files,
they are not yet in actual state, on disk and the checkpoint.

48
00:02:25,609 --> 00:02:29,345
It's the process of writing so-called dirty.

49
00:02:29,347 --> 00:02:38,105
Blocks, dirty buffers uh, two disk so dirty in this
context means changed, but not yet saved on disk.

50
00:02:38,110 --> 00:02:41,635
So, if it's already saved, it's called clean buffer.

51
00:02:41,905 --> 00:02:44,245
If it's not yet saved, it's called dirty buffer.

52
00:02:44,485 --> 00:02:47,845
And when we have dirty buffers, it means that we changed a lot.

53
00:02:47,992 --> 00:02:50,572
We already reflected these changes involved.

54
00:02:50,577 --> 00:02:51,832
It's  like the rule number.

55
00:02:52,500 --> 00:02:54,900
Right ahead look, but it's not yet saved on this.

56
00:02:55,080 --> 00:03:02,066
When checkpoint happens, all dirty buffers are saved
and it means that next time we will need to start.

57
00:03:02,187 --> 00:03:09,408
For example, after some crash, power loss, anything we we, don't care
about previous changes because they're already reflected on disc.

58
00:03:10,088 --> 00:03:10,298
Michael: Yeah.

59
00:03:10,298 --> 00:03:13,268
So we only have to worry about things since the last checkpoint

60
00:03:13,328 --> 00:03:13,598
Nikolay: right.

61
00:03:13,598 --> 00:03:14,798
And we need to redo them.

62
00:03:14,798 --> 00:03:15,458
Right, right, right.

63
00:03:15,708 --> 00:03:16,198
Michael: Yeah.

64
00:03:16,508 --> 00:03:22,812
And just before we move on from the basics . Because
I had to read up a bit about this the main, the main

65
00:03:22,812 --> 00:03:26,472
reason we need this is to prevent data loss on recovery.

66
00:03:26,502 --> 00:03:28,531
So it's the idea of the dual system.

67
00:03:28,531 --> 00:03:33,601
The idea of, of having this is for the, the
D in acid, I believe the durability side.

68
00:03:33,606 --> 00:03:34,411
So preventing

69
00:03:34,466 --> 00:03:35,026
Nikolay: Right.

70
00:03:35,026 --> 00:03:43,412
A, a C I D is like the core concept of uh, database
system should be like, if it says data committ.

71
00:03:43,589 --> 00:03:44,999
it never can lose it.

72
00:03:45,149 --> 00:03:46,499
Otherwise it's a bad system.

73
00:03:46,652 --> 00:03:47,552
So right ahead.

74
00:03:47,552 --> 00:03:51,872
Look exactly allows us to, to, to, to have it.

75
00:03:52,352 --> 00:04:01,529
And without checkpoint, we would need to keep a lot of changes and replay
a lot of changes and start type time after any crash would be very long.

76
00:04:01,713 --> 00:04:01,953
Right.

77
00:04:02,096 --> 00:04:04,196
And so checkpoints happen all the time.

78
00:04:04,196 --> 00:04:05,970
They happen like kind of on.

79
00:04:06,673 --> 00:04:15,130
They also can happen during, for example, when we need to shut down
the server or restart, there is so called uh, Moy shutdown checkpoint.

80
00:04:15,353 --> 00:04:21,837
So PSUs doesn't report that shut completed
until this shut checkpoint finishes.

81
00:04:22,052 --> 00:04:22,352
right.

82
00:04:22,382 --> 00:04:29,522
So, so it's also important to understand, and that's why sometimes
we can see that shutdown takes some, a significant time because

83
00:04:29,522 --> 00:04:32,092
we have a lot of dirty buffers and we need to sell them first.

84
00:04:32,447 --> 00:04:34,727
Michael: Yeah on this topic, I saw a good recommendation.

85
00:04:34,727 --> 00:04:40,188
I think it was from you actually around Postgres
upgrades and the idea of taking a checkpoint triggering

86
00:04:40,188 --> 00:04:43,208
a checkpoint manually to, uh, reduce the time needed.

87
00:04:43,493 --> 00:04:43,883
Nikolay: Right.

88
00:04:43,883 --> 00:04:49,120
Because During During shutdown down, checkpoint
POS doesn't respond to new queries anymore.

89
00:04:49,120 --> 00:04:50,768
It says uh, shutting down.

90
00:04:50,768 --> 00:04:51,849
So come later.

91
00:04:51,939 --> 00:04:52,179
Right.

92
00:04:52,539 --> 00:04:56,909
But if we run manual checkpoint a SQL  command checkpoint.

93
00:04:56,921 --> 00:04:58,551
It can be in parallel.

94
00:04:58,671 --> 00:05:05,211
So we save a lot of dirty buffers ourselves, like just
running, select in terminal in P SQL, for example.

95
00:05:05,361 --> 00:05:08,381
And then when shutdown  check what happens.

96
00:05:08,434 --> 00:05:12,596
It's already fast because there's very low  number of dirty buffers.

97
00:05:12,626 --> 00:05:12,806
Right?

98
00:05:12,826 --> 00:05:13,316
Michael: Yeah.

99
00:05:13,332 --> 00:05:19,092
Nikolay: So it it's always recommended when we need to restart
server, for example, for minor upgrade, or we need, for example,

100
00:05:19,092 --> 00:05:28,222
to switch, to perform switchover this like manual not manual can
go to rent, of course, but additional checkpoint, I would say.

101
00:05:29,117 --> 00:05:34,365
uh, Oh, explicit, explicit checkpoint should be there
because after it  shut down, checkpoint will happen.

102
00:05:34,365 --> 00:05:37,695
So we need to help it to be shorter, faster,

103
00:05:38,090 --> 00:05:38,420
Michael: Yeah.

104
00:05:39,260 --> 00:05:45,470
So you, you already mentioned a couple of uses
for the right head log  one's recovery on crash.

105
00:05:45,470 --> 00:05:50,149
One is replication, I believe put in time recovery is another one.

106
00:05:50,539 --> 00:05:53,488
So tools like PG backrest make use of it.

107
00:05:53,488 --> 00:05:53,758
I believe.

108
00:05:53,768 --> 00:05:54,248
Nikolay: right, right.

109
00:05:54,848 --> 00:05:56,814
Well, it's a different topic, but of course, right.

110
00:05:56,934 --> 00:06:03,974
If, if we store full copies of P data by the way,
they never are consistent in consistent state.

111
00:06:04,004 --> 00:06:08,144
They, if we, if you copy picture data on live server, it's not consistent.

112
00:06:08,190 --> 00:06:19,268
So you need a bunch of walls to be able to reach consistent point
always but you also can store all the whole stream of walls, addition.

113
00:06:19,638 --> 00:06:27,819
and uh, if this allows you to take some so-called Facebook up, picture
data, like corresponding to some point in time, and then you can replace

114
00:06:27,819 --> 00:06:31,579
additional walls and reach new point in time and it can be arbitrary.

115
00:06:31,579 --> 00:06:36,294
So if  you store continuous stream of
falls in your archive, you can choose any.

116
00:06:36,629 --> 00:06:44,992
You want,  the only problem usually is if your database is
quite big, the initial copy P data initially takes time.

117
00:06:45,089 --> 00:06:51,409
And roughly like very, very rough rule is one terabyte per
hour, but can be slower, can be faster, but very rough rule.

118
00:06:51,529 --> 00:06:58,439
So if you have 10 terabyte database should be
prepared for five or 10 or 15 hours of initial copy.

119
00:06:58,596 --> 00:06:59,696
And I wonder why people.

120
00:07:00,167 --> 00:07:03,688
Don't use often cloud snapshots for that  to, to speed it.

121
00:07:03,988 --> 00:07:04,168
right.

122
00:07:04,168 --> 00:07:12,138
It would make sense completely, but I know sometimes
they're not reliable cloud snapshots in Amazon and Google.

123
00:07:12,378 --> 00:07:14,728
Like there are issues with them sometimes.

124
00:07:14,752 --> 00:07:18,592
Also making them takes time, but we could do it on it's.

125
00:07:18,652 --> 00:07:27,646
Like, I, I, I go, like I put us to different field let's let's
postpone discussion of backups and, and other things, but you're right.

126
00:07:27,676 --> 00:07:30,676
Point in time, recovery is never a area of application.

127
00:07:31,186 --> 00:07:39,192
Of wall, but the primary goal is to allow us to
recover from, from unexpected shutdown or power.

128
00:07:39,612 --> 00:07:42,782
I think so it's, it's like in games, right?

129
00:07:43,346 --> 00:07:44,176
Michael: I lost you briefly

130
00:07:44,291 --> 00:07:44,681
Nikolay: yeah.

131
00:07:44,681 --> 00:07:47,273
I see, I see let, let me listen to you then.

132
00:07:47,773 --> 00:07:48,103
so

133
00:07:48,603 --> 00:07:51,041
Michael: Well, yeah, so I'd be,  I think it's getting better.

134
00:07:51,101 --> 00:07:53,531
So we, we have a lot of parameters for tuning these things.

135
00:07:53,531 --> 00:08:01,624
We have a lot of parameters for controlling how, how fast these
things happen, how much has to happen before they kick in.

136
00:08:02,124 --> 00:08:06,944
There's a few settings that it seems like we
really shouldn't touch pretty much ever as well.

137
00:08:07,194 --> 00:08:13,404
But it also seems that especially on high right workloads,
there are some really big wins we can get by changing some

138
00:08:13,459 --> 00:08:13,849
Nikolay: Right.

139
00:08:13,879 --> 00:08:16,849
Defaults are not good enough for you usually, right?

140
00:08:16,926 --> 00:08:17,256
Michael: Okay.

141
00:08:17,496 --> 00:08:18,876
Is that generally true?

142
00:08:18,876 --> 00:08:26,076
Or like if I have a read heavy application and not, not a ton of rights,
am I likely to, what, what, how would I know if I'm hitting problems?

143
00:08:26,076 --> 00:08:28,236
What's the kind of, what are the first telltale signs?

144
00:08:28,444 --> 00:08:29,837
Nikolay: Well, that's a good question.

145
00:08:29,859 --> 00:08:30,579
Usually.

146
00:08:30,605 --> 00:08:35,713
it's good to evaluate a distance between two daily backups, for example.

147
00:08:35,773 --> 00:08:39,933
And you, you can, even in SQL, you can even take to LSN.

148
00:08:40,083 --> 00:08:43,680
Usually when backup is taken, you know, LSN log, sequence number.

149
00:08:43,980 --> 00:08:45,990
So it's like position in wall.

150
00:08:46,200 --> 00:08:47,490
It's all sequential.

151
00:08:47,657 --> 00:08:49,307
It has some specific structure.

152
00:08:49,732 --> 00:08:50,902
There are a few articles.

153
00:08:51,142 --> 00:08:56,499
We will attach articles, explaining how to
read LSM and  how to understand this structure.

154
00:08:56,859 --> 00:09:04,179
So if you take to LSN and convert them to PG underscore
LSM data type in progress, you can subtract one

155
00:09:04,179 --> 00:09:07,089
from another and the difference will be in bites.

156
00:09:07,419 --> 00:09:16,506
So this difference, and then you can run PG Pretty or P size
and you can see it, the difference in megabytes, gigabytes,

157
00:09:16,679 --> 00:09:24,280
actually gigabytes actually, but Pogs doesn't use this notation
so you can understand how much wall you generate during a day.

158
00:09:24,464 --> 00:09:24,764
Right?

159
00:09:24,911 --> 00:09:34,343
If this size is it's small, like gigabytes or 10 gigabytes, it's quite
small, probably you don't need specific checkpoint and well tuning.

160
00:09:35,131 --> 00:09:43,552
but if you have one terabyte generated per day, it's quite
a lot, and I'm sure you need to go away from defaults and,

161
00:09:43,702 --> 00:09:46,102
and you will have better performance, better everything.

162
00:09:46,352 --> 00:09:49,352
For example, wall compression is not enabled by default.

163
00:09:49,701 --> 00:09:50,391
And, And,

164
00:09:50,556 --> 00:09:51,306
Michael: I missed that one.

165
00:09:51,306 --> 00:09:51,486
That's

166
00:09:51,681 --> 00:09:59,071
Nikolay: yeah, but I, I, I'm going to check if it's enabled in recent
version because PSGs defaults are improving, but still they are Legg.

167
00:09:59,266 --> 00:10:02,451
If you have heavily loaded systems, you definitely want to tune.

168
00:10:02,516 --> 00:10:12,109
So if you have for Pogs 13, for example, while compression is disabled
by default, let's talk about how, like, what is written in walls.

169
00:10:12,290 --> 00:10:17,233
Oh, first of all, like just like simplify the explanation people usually use.

170
00:10:17,313 --> 00:10:18,347
About checkpoints.

171
00:10:18,497 --> 00:10:21,851
It's like in games, you want to save your progress.

172
00:10:21,981 --> 00:10:22,281
Right.

173
00:10:22,281 --> 00:10:26,984
And if something bad happens you, you will repeat fewer steps.

174
00:10:27,031 --> 00:10:27,331
Right.

175
00:10:27,331 --> 00:10:29,221
So it's very simple analogy.

176
00:10:29,461 --> 00:10:29,641
Yeah.

177
00:10:29,641 --> 00:10:32,720
It's still in Pogo in POS 14.

178
00:10:32,720 --> 00:10:35,520
It's still not enabled while compression.

179
00:10:35,550 --> 00:10:35,760
Yeah.

180
00:10:35,880 --> 00:10:37,610
Pogo 15 is still not enable.

181
00:10:38,110 --> 00:10:41,342
and this is I think should be enabled in most set ups.

182
00:10:41,452 --> 00:10:44,282
I, I, I'm almost sure on just it's enabled.

183
00:10:44,403 --> 00:10:54,638
So and if you can, for example, do checkpoints very rarely once per week,
it's insanely low, but in this case there are high chances that if crash

184
00:10:54,638 --> 00:10:58,805
happens, you will need to wait a lot while POS replace many walls, right?

185
00:10:58,805 --> 00:11:00,840
A lot, a lot to do in terms of software you do.

186
00:11:01,061 --> 00:11:03,043
And during this period you will be down.

187
00:11:03,317 --> 00:11:04,277
Your system is down.

188
00:11:04,277 --> 00:11:05,114
So no, not good.

189
00:11:05,114 --> 00:11:07,480
That's why I check like logically.

190
00:11:07,539 --> 00:11:10,633
I would say it's good to have checkpoints more often.

191
00:11:10,643 --> 00:11:10,943
Right.

192
00:11:11,018 --> 00:11:14,615
Michael: It seems like a  gold deluxe problem right too often.

193
00:11:14,658 --> 00:11:20,688
And you have a lot of overhead, but too infrequent
and it will take a long time to recover.

194
00:11:20,825 --> 00:11:22,955
So it feels like there's a balance.

195
00:11:23,105 --> 00:11:24,275
Nikolay: There is trade off here.

196
00:11:24,275 --> 00:11:26,312
And there are two kinds of overhead.

197
00:11:26,317 --> 00:11:30,820
We will talk about it in a second, but
to understand where overhead comes from.

198
00:11:30,875 --> 00:11:32,570
in spite of dirty buffers.

199
00:11:32,840 --> 00:11:36,770
Let's talk about what is written to wall by default.

200
00:11:36,770 --> 00:11:38,826
Full page rights are enabled, right?

201
00:11:38,826 --> 00:11:40,681
And what, what is full page, right?

202
00:11:40,711 --> 00:11:46,745
It's if you change anything in some table
in some row, it may be very small change.

203
00:11:46,751 --> 00:11:48,761
SVI writes whole page.

204
00:11:49,251 --> 00:11:50,571
to write a headlock.

205
00:11:50,768 --> 00:11:51,098
Why?

206
00:11:51,103 --> 00:11:52,604
Because there is a difference.

207
00:11:52,624 --> 00:12:00,717
Buffer is usually eight kilobyte size Kiy size , but
file system probably uses block size four Kiy bytes.

208
00:12:00,994 --> 00:12:07,046
And you don't want to have partial right
during writing to disk reported a success.

209
00:12:07,046 --> 00:12:08,346
But you,  wrote only.

210
00:12:08,606 --> 00:12:09,236
Half of it.

211
00:12:09,513 --> 00:12:11,943
So that's why full page writers need it.

212
00:12:12,003 --> 00:12:18,166
And by the way first PSUs first talks about Aurora from grant McAllister.

213
00:12:18,166 --> 00:12:20,836
If I'm not mistaken they are very well.

214
00:12:21,673 --> 00:12:21,823
We.

215
00:12:22,438 --> 00:12:25,858
Find links to YouTube and probably slide deck.

216
00:12:26,248 --> 00:12:31,498
They explain very well, this problem about
full page rights and this big overhead.

217
00:12:31,678 --> 00:12:38,558
So the, when first change in the page occurs, first time
it's written after checkpoint it's full page, right?

218
00:12:38,818 --> 00:12:44,468
If you change it once again, only small
Delta is written, So it's not full page.

219
00:12:44,468 --> 00:12:44,708
Right.

220
00:12:45,218 --> 00:12:47,278
But only until the next check.

221
00:12:47,961 --> 00:12:53,348
if checkpoint happened, all changes are initially again, full page rights.

222
00:12:53,434 --> 00:12:57,374
If checkpoints are very frequent, we have a lot of full page rights.

223
00:12:57,574 --> 00:13:04,828
If checkpoints are not frequent, very often we
have repetitive changes in this of the same page.

224
00:13:04,828 --> 00:13:05,998
So we changed.

225
00:13:06,171 --> 00:13:08,781
We, for example, wrote inserted something.

226
00:13:08,811 --> 00:13:10,971
We insert once again into same page.

227
00:13:10,971 --> 00:13:12,775
It's new change again.

228
00:13:12,775 --> 00:13:14,485
And again, we, we update something.

229
00:13:14,485 --> 00:13:18,253
We have had to update hip only two top update.

230
00:13:18,258 --> 00:13:19,843
So we change it the same page.

231
00:13:20,263 --> 00:13:22,333
And this means we touch the same page.

232
00:13:22,543 --> 00:13:26,053
We write to it multiple types in this case.

233
00:13:26,653 --> 00:13:34,173
so I'm  saying not only the number of rights
matters,  also the nature of rights matters.

234
00:13:34,383 --> 00:13:43,217
If you have heart updates, touching the same page, many, many times you
will benefit from rare checkpoints a lot because only one full page right.

235
00:13:43,217 --> 00:13:44,957
Will happen after checkpoint.

236
00:13:45,137 --> 00:13:47,237
And then you benefit having very.

237
00:13:47,782 --> 00:13:54,735
You write a little bit to, to write a headlock EV every
subsequent change until next checkpoint of course.

238
00:13:55,053 --> 00:13:56,930
Michael: yeah, that's super interesting.

239
00:13:56,990 --> 00:13:58,640
And I think also.

240
00:13:58,959 --> 00:14:01,899
it explains why people, so there's, there's some people I've seen.

241
00:14:01,908 --> 00:14:10,627
I, I suspect this is a very bad idea, but turning off full page rights in
order to increase throughput, but it feels like a very risky thing to do.

242
00:14:10,627 --> 00:14:13,717
And that's that there, I feel like we're gonna cover quite a few better

243
00:14:14,017 --> 00:14:23,465
Nikolay: some systems can afford if uh, what you, you need to understand
what your, your file system is your disc  and like what settings they use.

244
00:14:23,465 --> 00:14:26,515
And in some cases  it's possible, but it's quite dangerous.

245
00:14:26,515 --> 00:14:33,293
You should understand all the risks and be a hundred percent sure that
it's it's possible in your system, but usually we don't go this way.

246
00:14:33,293 --> 00:14:37,403
Usually we use X four with 4k box size and so on.

247
00:14:37,403 --> 00:14:40,598
And we, we want full page rights enabled.

248
00:14:40,748 --> 00:14:45,248
So back to compression compression is applied only to full page rights.

249
00:14:45,938 --> 00:14:46,568
Michael: Interesting.

250
00:14:47,163 --> 00:14:47,493
Nikolay: right.

251
00:14:47,493 --> 00:14:54,287
So we, we don't compress as I understand POS
doesn't com compress these,  small changes.

252
00:14:54,437 --> 00:14:56,267
It compresses only these.

253
00:14:56,451 --> 00:14:59,361
first time we change something in a page.

254
00:14:59,751 --> 00:15:03,111
We record this page fully and we can compress it.

255
00:15:03,274 --> 00:15:05,284
And compression is not enabled by.

256
00:15:06,091 --> 00:15:16,307
and if you enable it, you can see huge benefit in terms of how much
wall you write, why we care about volume here, because if we have

257
00:15:16,367 --> 00:15:25,304
right heavy system, of course writing a lot, additionally to wall
it, it like it's additional overhead on disk if you have 10 replicas,

258
00:15:25,304 --> 00:15:29,371
sometimes people have it, all replicas need to receive this data.

259
00:15:29,836 --> 00:15:32,926
F they work rep physical replication, logical as well.

260
00:15:33,136 --> 00:15:35,236
It works through wall through writer headlock.

261
00:15:35,476 --> 00:15:39,566
So if we write a lot, we need to stand over network a lot.

262
00:15:40,066 --> 00:15:49,348
They want wall compression enabled to compress all full,  page rights, and
we want checkpoints to happen rarely to have fewer full page rights as well.

263
00:15:49,798 --> 00:15:51,118
So  I would tune.

264
00:15:51,162 --> 00:15:59,418
Maxwell size and checkpoint timeout to, to have very, very, not
frequent checkpoints, but in this case, if they are not frequent

265
00:15:59,448 --> 00:16:10,155
again, start up time after crash, also follow over, for example, it,
the timing of these procedures will be very, very bad, long minutes.

266
00:16:10,305 --> 00:16:13,245
Some I see various engineers struggl.

267
00:16:13,444 --> 00:16:14,404
To understand why.

268
00:16:14,404 --> 00:16:22,701
For example, shutdown takes so long why this start takes so long and
they become nervous and that at extreme cases they use kill minus.

269
00:16:23,346 --> 00:16:30,816
So secure, like  POG survives because we have right
headlock and we just redo, but do also takes a lot.

270
00:16:31,026 --> 00:16:32,526
They, they kill it.

271
00:16:32,856 --> 00:16:33,966
It's it's not acceptable.

272
00:16:34,476 --> 00:16:35,556
Only in rare cases.

273
00:16:35,556 --> 00:16:36,126
We should do it.

274
00:16:36,626 --> 00:16:37,556
It's like last resort.

275
00:16:37,616 --> 00:16:38,336
We should not do it.

276
00:16:38,666 --> 00:16:43,016
But on after read, PSG starts and start up takes also many minutes.

277
00:16:43,016 --> 00:16:44,396
They're still nervous.

278
00:16:44,726 --> 00:16:46,182
It's not a good situation.

279
00:16:46,332 --> 00:16:48,825
That's why people need to understand.

280
00:16:48,945 --> 00:16:54,623
Like how much wall needed to be written and distance between checkpoints,

281
00:16:54,725 --> 00:16:57,185
Michael: Yeah, let's go back to a couple of those ones you mentioned.

282
00:16:57,185 --> 00:17:05,925
So my understanding is that checkpoint timeout is a maximum time
between checkpoints and that's default, quite low is five minutes.

283
00:17:05,925 --> 00:17:06,390
Nikolay: Very low.

284
00:17:06,529 --> 00:17:10,609
Michael: so what would be a sensible
starting point for most people in terms of

285
00:17:10,679 --> 00:17:18,694
Nikolay: Yeah, so usually  the main metric here is how
long you can afford being down in the, in the bad case.

286
00:17:18,694 --> 00:17:23,501
In the case of an incident, this is the main number you need to understand.

287
00:17:23,531 --> 00:17:31,211
You need to talk with your business people and find some number
like, okay, we can be down up to two minutes, for example, right.

288
00:17:31,721 --> 00:17:33,221
From there, you start thinking.

289
00:17:33,888 --> 00:17:41,153
if we have this like requirement or SLO service
level objective, if, if we are SRE, right?

290
00:17:41,603 --> 00:17:49,343
So if, if we have two minutes, let's think, during two minutes,
how much we can replay, we can measure it with, with experiment.

291
00:17:49,793 --> 00:17:55,199
We can, for example set checkpoint time out and
Maxwell size to value  insanely big numbers.

292
00:17:55,499 --> 00:18:04,974
Then we can have a lot of rights happening, PG bench, for example,
then we can wait until new checkpoint one checkpoint finishes.

293
00:18:05,184 --> 00:18:11,254
Now there is about to finish and then we
kill Stan, our PSUs crash it on purpose.

294
00:18:11,314 --> 00:18:19,816
And then we see recovery and just measure the speed of recovery,
how much how many bites of wall we can replay per second per minute.

295
00:18:19,996 --> 00:18:27,973
And this gives us understanding of how much wall we can
afford to replay, not to exceed two minutes of downtime.

296
00:18:27,973 --> 00:18:28,423
For example,

297
00:18:28,736 --> 00:18:28,946
Michael: Yep.

298
00:18:29,096 --> 00:18:29,429
Perfect.

299
00:18:29,430 --> 00:18:34,587
Nikolay: from this, we can start thinking like,
this is very important to understand recovery.

300
00:18:35,158 --> 00:18:38,715
in terms of bites per second bites per minute or gigabytes per minute.

301
00:18:38,848 --> 00:18:49,278
anything here from there, we can understand how, like how many bites of
wall we produce when everything is normal, during quiet, busy hours, usually

302
00:18:49,488 --> 00:18:55,018
at night, for example, we have lower activity at daytime on working days.

303
00:18:55,108 --> 00:18:55,648
Probably we.

304
00:18:56,054 --> 00:18:57,404
More activity, right.

305
00:18:57,644 --> 00:19:02,797
But usually we say, okay, we produce, like, for
example, one wall per second, it's quite good speed.

306
00:19:03,217 --> 00:19:05,812
Each wall means this file.

307
00:19:06,312 --> 00:19:12,885
There is also confusion in terms because is a, I remember
the commutation says wall file is some abstract thing.

308
00:19:12,885 --> 00:19:16,671
It's like two gigabytes and wall segment is 16.

309
00:19:16,675 --> 00:19:22,884
Maybe bites, but if you go to picture wall
directory, you will see H file will be 16 megabytes.

310
00:19:22,974 --> 00:19:28,274
Usually, as I remember, they use it and  they have  64 Miyes British wall.

311
00:19:28,604 --> 00:19:29,624
So I say walls.

312
00:19:30,194 --> 00:19:32,624
Each wall is usually 16 megabytes.

313
00:19:32,804 --> 00:19:38,863
So one wall per second, during normal, quite busy
hours, it means we produce 16 megabytes per second, or.

314
00:19:39,668 --> 00:19:39,908
Right.

315
00:19:39,908 --> 00:19:40,448
So, okay.

316
00:19:40,748 --> 00:19:43,668
And it means that, okay, what is our replace speed?

317
00:19:43,824 --> 00:19:45,834
What is our production speed?

318
00:19:46,101 --> 00:19:54,791
And from there, we can understand during which time we generate that amount
of wall data, which will give us two minutes of recovery time, right.

319
00:19:55,291 --> 00:19:56,071
Quite complex.

320
00:19:56,071 --> 00:19:56,551
I understand.

321
00:19:57,421 --> 00:20:02,011
Michael: well, it feels like luckily we've
got that second parameter in max wall

322
00:20:02,301 --> 00:20:02,911
Nikolay: No, no, no, no.

323
00:20:03,241 --> 00:20:04,711
Let's, let's let's pause.

324
00:20:04,891 --> 00:20:05,191
Yes.

325
00:20:05,251 --> 00:20:14,307
I'm talking about how to, like, in my opinion, how to understand
what's the best what's like normal checkpoint time out for you.

326
00:20:14,312 --> 00:20:14,787
Right?

327
00:20:15,717 --> 00:20:17,427
So in this case we understand, okay.

328
00:20:17,427 --> 00:20:20,727
Recovery time is this production time of production.

329
00:20:21,967 --> 00:20:22,567
Is this.

330
00:20:22,807 --> 00:20:32,044
So we can conclude that to have not more than two minutes of recovery
time, we, we need to have to, to, we produce this number of wall data.

331
00:20:32,194 --> 00:20:34,894
So we will set point time out, probably like half an hour.

332
00:20:35,094 --> 00:20:38,122
This is like quite, maybe 15 minutes, 30 minutes.

333
00:20:38,392 --> 00:20:41,872
It depends, of course observing concrete system.

334
00:20:41,872 --> 00:20:42,442
We can.

335
00:20:42,462 --> 00:20:43,512
make some conclusion.

336
00:20:43,512 --> 00:20:43,692
Okay.

337
00:20:43,692 --> 00:20:50,314
We want 30 minutes for example, but then we start
distinguishing plan checkpoints and requested checkpoints.

338
00:20:50,444 --> 00:20:55,064
Requested checkpoints is like POS has two logics.

339
00:20:55,334 --> 00:20:56,394
One logic is okay.

340
00:20:56,394 --> 00:21:01,020
One schedule when time comes time to, to have checkpoint every 30 minutes.

341
00:21:01,020 --> 00:21:02,760
For example, by default is five minutes.

342
00:21:02,760 --> 00:21:03,691
I think it's  too.

343
00:21:04,199 --> 00:21:04,469
Right.

344
00:21:05,069 --> 00:21:14,153
But then if there is another parameter called Maxwell size, and I think it's
very, very important parameter to understand it's our protection for the cases

345
00:21:14,153 --> 00:21:24,961
when we have elevated activity, And we want to have more frequent checkpoints
because we want to be protected again from writing too much data to wall and.

346
00:21:25,485 --> 00:21:27,045
Again longer, wait again.

347
00:21:27,045 --> 00:21:27,615
The same logic.

348
00:21:27,615 --> 00:21:32,578
If we understand the, how much we produce
the speed of production, we can say, okay.

349
00:21:32,608 --> 00:21:41,109
Maxwell size also roughly corresponds to so, so checkpoint
time out and Maxwell size, they tuning can be correlated here.

350
00:21:41,609 --> 00:21:41,939
Michael: Yeah.

351
00:21:42,599 --> 00:21:49,422
So my, my understanding is you, it, it sounds like we should
rely on checkpoint timeout for the majority of the time.

352
00:21:49,439 --> 00:21:52,829
That should be the thing that, that kicks off checkpoints, but.

353
00:21:53,219 --> 00:22:00,104
If if more than that amount of wow is generated more
than the amount we're expected, we could set an amount in

354
00:22:00,419 --> 00:22:09,669
Nikolay: We should say, we should say that so usually like, and default is
there very small one, gigabyte default is insanely small for modern workloads.

355
00:22:09,859 --> 00:22:15,159
Usually like I recommend to go up sometimes up to hundred gigabytes.

356
00:22:15,924 --> 00:22:18,721
Which we need to understand this recovery trade off.

357
00:22:18,781 --> 00:22:19,021
Right?

358
00:22:19,021 --> 00:22:26,911
So we need to measure recovery and guarantee our business that we
will not be down more than for example, two minutes or five minutes.

359
00:22:26,941 --> 00:22:27,271
Right.

360
00:22:27,961 --> 00:22:28,633
But right.

361
00:22:28,813 --> 00:22:35,953
Maxwell size protects us from the cases when we have more
rights and PSGs can decide to perform requested checkpoint.

362
00:22:36,043 --> 00:22:38,713
We see it on logs, by the way, logging of checkpoints.

363
00:22:38,713 --> 00:22:38,953
We.

364
00:22:39,268 --> 00:22:45,088
Enable always, as far as I remember recently,
default was changed and login is, is now enabled.

365
00:22:45,148 --> 00:22:48,252
I, I remember discussion in hackers and mailing list.

366
00:22:48,612 --> 00:22:51,492
So log checkpoint should be enabled for all checkpoints.

367
00:22:51,659 --> 00:22:53,159
I I'm hundred percent sure.

368
00:22:53,219 --> 00:22:57,090
This is what you want to understand is default is, was false.

369
00:22:57,840 --> 00:23:02,840
Default is disabled in August 12, disabled in August 14.

370
00:23:03,340 --> 00:23:06,910
but enabled in Pogs 15, which will be released very soon.

371
00:23:07,420 --> 00:23:09,820
So this, this is new change in Pogs.

372
00:23:09,820 --> 00:23:21,054
15 log check point is enabled and I recommend enabling it for
any Pogs, so I, I also saw some DBAs see that like 90% of all

373
00:23:21,054 --> 00:23:24,174
checkpoints are requested, they occur according to Maxwell.

374
00:23:24,621 --> 00:23:25,431
This is a problem.

375
00:23:25,581 --> 00:23:26,601
No, it's not a problem.

376
00:23:26,781 --> 00:23:32,631
It's not a problem because request a checkpoint
and time checkpoint like planned on schedule.

377
00:23:32,961 --> 00:23:41,278
They are the same actually, like not no big difference, but of
course you want to be in order, everything should be in order,

378
00:23:41,278 --> 00:23:48,341
of course you want like it's, it's just a sign that probably you
need to reconsider savings, but it's not a, an urgency situation.

379
00:23:48,565 --> 00:23:48,835
Right.

380
00:23:49,165 --> 00:23:49,495
Michael: Yeah,

381
00:23:49,766 --> 00:23:50,216
Nikolay: Well,

382
00:23:50,521 --> 00:23:50,626
Michael: good.

383
00:23:50,898 --> 00:23:54,038
Nikolay: There is another uh, checkpoint completion target.

384
00:23:54,038 --> 00:23:55,238
We, we didn't mention.

385
00:23:55,658 --> 00:24:00,113
And by default it's 0.7 or, or,

386
00:24:00,123 --> 00:24:00,473
or

387
00:24:00,518 --> 00:24:01,568
Michael: this changed.

388
00:24:01,568 --> 00:24:04,208
I, yeah, I looked this up until very recently.

389
00:24:04,213 --> 00:24:05,558
It was not 0.5.

390
00:24:05,625 --> 00:24:07,474
Nikolay: oh, 0.5 is terrible.

391
00:24:07,654 --> 00:24:08,194
I would say

392
00:24:08,494 --> 00:24:09,244
it's not what you.

393
00:24:09,863 --> 00:24:10,163
Michael: Yeah.

394
00:24:10,163 --> 00:24:12,503
But in 14 it was increased to 0.9.

395
00:24:12,953 --> 00:24:13,433
Nikolay: great.

396
00:24:13,433 --> 00:24:14,273
This is good number.

397
00:24:14,573 --> 00:24:16,013
So what it, what is it?

398
00:24:16,061 --> 00:24:20,935
Since when you run manual checkpoint, explicit checkpoint, it goes full speed.

399
00:24:20,935 --> 00:24:24,518
So it rises dirty buffers to disc as fast as possible.

400
00:24:24,668 --> 00:24:27,698
And it produces some stress on disc it's.

401
00:24:27,788 --> 00:24:28,208
It's a okay.

402
00:24:28,208 --> 00:24:31,658
Stress, but it's normally you want to be.

403
00:24:31,863 --> 00:24:33,483
More gentle with your disc system.

404
00:24:33,483 --> 00:24:33,783
Right?

405
00:24:34,113 --> 00:24:36,783
So that's why we spread it over time.

406
00:24:36,783 --> 00:24:39,033
And 0.9 checkpoint completion.

407
00:24:39,033 --> 00:24:44,163
Target means that between two checkpoints,
90% of time, we want to spend the writing.

408
00:24:44,703 --> 00:24:46,653
And 10% of time we are resting.

409
00:24:47,163 --> 00:24:49,143
Maybe you want even more 99.

410
00:24:49,148 --> 00:24:50,913
I don't know, like 99% of time.

411
00:24:51,543 --> 00:24:56,784
So, and, and, and this is important because It's
hard to understand the distance between checkpoints.

412
00:24:56,784 --> 00:24:58,654
It's quite tricky question.

413
00:24:58,688 --> 00:25:03,395
Logging will report something, but you can think about when checkpoint starts.

414
00:25:03,545 --> 00:25:04,595
This is like the beginning.

415
00:25:05,165 --> 00:25:09,158
So 10 minutes between them is like,  or 15 minutes between them.

416
00:25:09,863 --> 00:25:11,153
Or 30, it's fine.

417
00:25:11,693 --> 00:25:14,617
But what I wanted to deliver, this is very tricky.

418
00:25:14,622 --> 00:25:22,501
It, it bothered me a few years actually, and only in the book
of Gogo already mentioned it, this Pogo scale internals of.

419
00:25:23,001 --> 00:25:26,596
So I, I read it in Russian even earlier now it's published.

420
00:25:26,596 --> 00:25:27,976
Both parts are published in English.

421
00:25:27,976 --> 00:25:28,546
It's very good.

422
00:25:28,663 --> 00:25:32,263
It explains everything in detail with links to source code.

423
00:25:32,263 --> 00:25:41,722
And finally, I understood why, if we set Maxwell size, one gigabyte, the
distance in bites measured in bites, it can be like 300 something megabytes.

424
00:25:41,882 --> 00:25:42,812
So it's like three times.

425
00:25:43,569 --> 00:25:44,219
Michael: Why is that?

426
00:25:44,919 --> 00:25:47,139
Nikolay: so explanations is interesting.

427
00:25:47,529 --> 00:25:49,149
I'm looking at it right now.

428
00:25:49,279 --> 00:25:51,409
So I know I knew it from practice.

429
00:25:51,422 --> 00:25:55,502
I just like when I said, well, I saw, oh, you have default one gigabyte.

430
00:25:55,562 --> 00:26:02,702
You know that it means that the actual distance measured
and bites will be 300 megabytes is tiny, like distance.

431
00:26:03,032 --> 00:26:06,362
It means that checkpoint will disturb your system constantly.

432
00:26:06,572 --> 00:26:08,642
And they, I even saw the case of very large.

433
00:26:09,297 --> 00:26:13,399
Where people had um, some cleanup job on the background happening.

434
00:26:13,429 --> 00:26:17,959
And then before big event marketing event, they disabled this job.

435
00:26:18,409 --> 00:26:23,989
And then a couple of months later they realized that
job is disabled and some engineer very experienced one.

436
00:26:24,319 --> 00:26:25,514
But not Pogo expert.

437
00:26:25,519 --> 00:26:28,719
He like said, okay, this job was not painful at all.

438
00:26:28,719 --> 00:26:30,249
It, it was working many years.

439
00:26:30,249 --> 00:26:38,279
So he went ahead and tried to delete 10 million million
rows using one delete and put system down for 10 minutes

440
00:26:38,929 --> 00:26:43,224
because they didn't have PO uh, checkpoint tuning in place.

441
00:26:43,404 --> 00:26:45,234
So Maxwell size was default.

442
00:26:45,234 --> 00:26:49,944
One gigabyte actual actual distance was 300 gigabytes.

443
00:26:49,944 --> 00:26:50,614
I will explain.

444
00:26:51,114 --> 00:26:56,124
it means that when you produce a lot, you
have checkpoints happening all the time.

445
00:26:56,124 --> 00:26:57,144
Boom, boom, boom, boom.

446
00:26:57,384 --> 00:26:59,814
And a lot of full page rights.

447
00:26:59,814 --> 00:27:00,354
Boom, boom, boom.

448
00:27:00,924 --> 00:27:09,646
It's not compressed and discs work quite good, like enterprise discs, but
not and V me unfortunately, and they just, situation happened and they went.

449
00:27:10,366 --> 00:27:12,256
For 10 minutes since then, just one delete.

450
00:27:12,466 --> 00:27:16,040
I even had a talk in, I did it in Russia sometime ago.

451
00:27:16,490 --> 00:27:26,260
Like just about this case, how delete can put your one, one line of delete
can put your S down even before you worked very well and like critical system.

452
00:27:26,260 --> 00:27:28,930
But so checkpoint tuning is important thing to have.

453
00:27:29,290 --> 00:27:37,127
So if you have one gigabyte until post this 11, it was,
if you have checkpoint completion, target close to.

454
00:27:37,930 --> 00:27:40,990
you should divide by three since post August 11.

455
00:27:40,990 --> 00:27:42,580
You should divide, divide by two.

456
00:27:43,420 --> 00:27:48,280
So if you have one gigabyte box full size,
actual distance will be half a gigabyte.

457
00:27:48,280 --> 00:27:57,998
Roughly if your checkpoint completion target is 0.9 because
posts needs everything since last checkpoint and also

458
00:27:58,118 --> 00:28:02,078
everything between previous one and the latest one latest.

459
00:28:03,055 --> 00:28:03,405
Michael: Oh,

460
00:28:04,235 --> 00:28:07,745
Nikolay: and before post 11 additional cycle was needed.

461
00:28:08,245 --> 00:28:11,725
so two successful cycles and a tail, right?

462
00:28:11,725 --> 00:28:15,175
Not tail it's this tail is before not behind us.

463
00:28:15,180 --> 00:28:16,225
It's in front of us.

464
00:28:16,225 --> 00:28:16,405
Right.

465
00:28:16,735 --> 00:28:23,528
So if checkpoint competition target is 0.9, like
roughly three times, like three intervals needed.

466
00:28:24,473 --> 00:28:24,773
Michael: Yeah.

467
00:28:24,848 --> 00:28:27,428
Nikolay: That's why you need to raise Maxwell size anyway.

468
00:28:27,508 --> 00:28:27,988
Right?

469
00:28:28,883 --> 00:28:32,873
Michael: Yeah, that seems like a, almost
everybody would want to increase that one.

470
00:28:33,203 --> 00:28:33,893
I I've read like.

471
00:28:34,838 --> 00:28:42,096
There's some other interesting ones that are be keen on your view on,
and also actually, I guess, should, are people on cloud providers?

472
00:28:42,156 --> 00:28:43,566
You mentioned RDS a couple of times.

473
00:28:43,596 --> 00:28:45,396
Are they generally more protected from this?

474
00:28:45,396 --> 00:28:47,252
Because they've been tuned already.

475
00:28:47,436 --> 00:28:54,334
Nikolay: Tuning here means increasing Maxwell size increase,
Maxwell size, but do it not blindly understanding recovery.

476
00:28:54,949 --> 00:28:55,276
Michael: Yeah.

477
00:28:55,290 --> 00:29:00,591
The other ones side for, for example, I've read that, that can increase right.

478
00:29:00,621 --> 00:29:03,172
Performance if you increase that number

479
00:29:03,432 --> 00:29:03,937
Nikolay: my practice.

480
00:29:03,942 --> 00:29:05,407
I cannot say anything here.

481
00:29:05,407 --> 00:29:10,097
Like I, I, I didn't dive deeply enough to discuss this, but

482
00:29:10,097 --> 00:29:12,297
Maxwell sizes, my favorite topic

483
00:29:12,592 --> 00:29:12,922
Michael: Yeah.

484
00:29:12,922 --> 00:29:16,282
If you haven't had to worry about meanwhile size, then I can't imagine it's.

485
00:29:17,062 --> 00:29:18,112
That important.

486
00:29:18,112 --> 00:29:18,502
So, yeah.

487
00:29:18,562 --> 00:29:19,222
Good to know.

488
00:29:19,271 --> 00:29:22,631
And yeah, the, the only other one I wanted to ask about is,

489
00:29:23,051 --> 00:29:25,661
Nikolay: let's let's sorry, like it's so important.

490
00:29:25,666 --> 00:29:35,168
I just want to emphasize it, you know, like, so if we have very short
distance in terms of Maxwell size, first checkpoints, and we have unexpected

491
00:29:35,168 --> 00:29:38,978
or maybe expected someone decided to do it, a lot of rat activity.

492
00:29:39,022 --> 00:29:48,544
We can measure it with experiments and what I found that, you know think
loans is good to iterate, but we cannot use think loans here because we need

493
00:29:48,604 --> 00:29:52,514
to our discount file system behave exactly the same as Amazon production.

494
00:29:53,084 --> 00:30:01,552
So I found good recipe, how to have some workload
which will not touch our physical layout of data.

495
00:30:02,032 --> 00:30:02,392
It's.

496
00:30:03,142 --> 00:30:14,732
Transaction of massive delete, like delete 10 or a hundred million rows, but
cancel it in the, in the beginning, roll back, delete will write to X max.

497
00:30:14,852 --> 00:30:16,912
We discussed it a couple of months ago.

498
00:30:16,917 --> 00:30:21,992
Probably it'll write the transaction ID, which deleted double,  but

499
00:30:22,022 --> 00:30:26,064
if transaction got cancel, , this is virtually zero.

500
00:30:26,124 --> 00:30:28,254
Zero means it's this table is still alive.

501
00:30:28,614 --> 00:30:38,846
So we produce a lot of wall, produce a big stress on system, but then we
say nothing changed and we can do another experiment on the same system.

502
00:30:39,026 --> 00:30:42,056
It's perfect workload for a lab, right?

503
00:30:42,305 --> 00:30:46,185
So we can have a sequence of experiments with different Maxwell.

504
00:30:46,820 --> 00:30:52,109
and see using like IO top, or I know
iostat, we can see, we can use monitoring.

505
00:30:52,349 --> 00:30:56,860
I recommend to using need data because it has export button.

506
00:30:56,860 --> 00:31:00,809
You can export all graphs and you can see how this cloud behave.

507
00:31:01,019 --> 00:31:10,142
And usually if you have one gigabyte Maxwell size,  and discs are not
very powerful, you'll see plateau  because it's saturated . Right.

508
00:31:10,652 --> 00:31:10,862
Then.

509
00:31:11,349 --> 00:31:14,469
double your Maxwell size plate again, double max.

510
00:31:14,589 --> 00:31:18,279
At some point you will see   your system under such stress.

511
00:31:18,779 --> 00:31:24,509
It's already not plateau because, and IOP
shows when Maxwell size is small IOP shows.

512
00:31:24,589 --> 00:31:28,724
If you're ordered by right throughput check
pointer will be number one, it writes.

513
00:31:29,399 --> 00:31:34,799
200, 300, I don't know, 500 max per second, like, right, right, right.

514
00:31:34,919 --> 00:31:38,589
Also I,  promise to explain two reasons of our head one.

515
00:31:38,589 --> 00:31:40,498
We already discussed full page rights.

516
00:31:40,798 --> 00:31:47,550
If we just finished our checkpoint and we needed to start
another because Maxwell size  commands us to,  have them very.

517
00:31:48,307 --> 00:31:49,297
it's like insane.

518
00:31:49,327 --> 00:31:50,467
Checkpointing insane.

519
00:31:50,887 --> 00:31:53,137
Checkpointing like check pointer went mad.

520
00:31:53,137 --> 00:31:53,437
Right?

521
00:31:53,617 --> 00:31:56,317
So for example, I saw like every 15 seconds.

522
00:31:56,347 --> 00:31:57,217
Boom, boom, boom, boom.

523
00:31:57,217 --> 00:32:01,267
Because these deletes, like  300 mix.

524
00:32:01,267 --> 00:32:02,737
It's quite quite Fastly.

525
00:32:02,947 --> 00:32:06,037
So again, again, again, default settings.

526
00:32:06,127 --> 00:32:08,985
So, so full page rights is one type of overhead.

527
00:32:08,985 --> 00:32:13,039
So basically, oh, also make your deletes no se.

528
00:32:13,539 --> 00:32:23,262
for example, you can have some index on some random column and you can say
let's delete first 10 million rows ordered by this column, but it's random.

529
00:32:23,262 --> 00:32:25,722
So first apple is first page.

530
00:32:25,722 --> 00:32:26,112
Second.

531
00:32:26,232 --> 00:32:27,552
Apple is page number thousand.

532
00:32:27,672 --> 00:32:29,382
So we , jump between various page.

533
00:32:29,958 --> 00:32:33,798
and this is the worst type because we, we could benefit.

534
00:32:33,858 --> 00:32:40,248
Like if you, if they are sequential, probably all changes
in one page will happen inside one checkpointing cycle.

535
00:32:40,638 --> 00:32:47,618
But if we jump between pages, we constantly produce a lot of full page rights
and we need to produce them once again, because checkpoint just finished.

536
00:32:47,623 --> 00:32:47,888
Right.

537
00:32:48,178 --> 00:32:49,408
This is the worst situation.

538
00:32:49,408 --> 00:32:53,643
And this happens, this is exactly what put that system down that I explained.

539
00:32:53,733 --> 00:32:56,373
So second I didn't realize, but it's quite obvious.

540
00:32:56,378 --> 00:32:57,933
Second overhead is quite obvious.

541
00:32:58,323 --> 00:33:00,976
If our page was dirty, it was checkpoint.

542
00:33:01,036 --> 00:33:02,926
Now it's clean checkpoint.

543
00:33:03,226 --> 00:33:04,074
Uh, Okay.

544
00:33:04,104 --> 00:33:05,934
We missed once again, it became dirty.

545
00:33:05,934 --> 00:33:06,234
Again.

546
00:33:06,627 --> 00:33:07,647
We need to write it again.

547
00:33:07,840 --> 00:33:12,600
If two rights would be inside one checkpointing cycle, we would produce only.

548
00:33:13,206 --> 00:33:18,656
But if two hour visits happened in different
checkpoint cycles, we need to perform two discards.

549
00:33:19,846 --> 00:33:20,656
Michael: it's more IO.

550
00:33:21,166 --> 00:33:21,586
Nikolay: Right?

551
00:33:21,826 --> 00:33:22,096
Right.

552
00:33:22,096 --> 00:33:24,976
So sequential delete is not, is not that bad.

553
00:33:25,235 --> 00:33:29,065
A random delete  according to some index can be right?

554
00:33:29,125 --> 00:33:30,025
Michael: that's a good point.

555
00:33:30,025 --> 00:33:30,895
So as well as all.

556
00:33:30,906 --> 00:33:31,296
Nikolay: Sorry.

557
00:33:31,296 --> 00:33:33,966
I like I so so much fun.

558
00:33:33,966 --> 00:33:38,016
I spent some months exploring it and we made a lot very good.

559
00:33:38,021 --> 00:33:47,545
Like I would say enterprise scale experiments and I, I, I can take any system
and show exactly how recovery will behave, how exactly you need to tune.

560
00:33:47,550 --> 00:33:49,585
I, I can like show graphs.

561
00:33:49,645 --> 00:33:54,581
It like it's quite expensive in terms of
time and probably money research of system.

562
00:33:54,581 --> 00:33:55,753
But I think big systems.

563
00:33:56,408 --> 00:34:01,713
They need to understand their workload, their
system, and understand what will happen.

564
00:34:02,113 --> 00:34:07,087
So this random delete, I, I say this, like I named it double unfortunate.

565
00:34:07,807 --> 00:34:10,027
You can be unfortunate because you crashed.

566
00:34:10,497 --> 00:34:15,754
And you you unfortunate twice because you
crash during some uh, random intensive rights.

567
00:34:15,999 --> 00:34:25,037
In this case, you definitely want to understand your Maxwell point I'm out
and you want your this choreograph not to have Plato, but be like spiky.

568
00:34:25,230 --> 00:34:28,500
And this is a sign that you have some room for other eyes.

569
00:34:29,000 --> 00:34:35,990
This is like our research shows like, okay, at 16
gigabytes or 32 gigabytes, we already don't have Plato.

570
00:34:36,170 --> 00:34:41,537
So this is our desired setting for Maxwell size,
maybe a hundred gigabytes seven divided by two.

571
00:34:41,717 --> 00:34:49,127
Like we need to understand since S well, but, and then we say,
okay, but in this case, recovery time can be in at normal time.

572
00:34:49,127 --> 00:34:52,079
It'll be this at bed times when somebody is writing random.

573
00:34:52,579 --> 00:34:53,059
a lot.

574
00:34:53,479 --> 00:34:55,519
It can be these like 10 minutes.

575
00:34:55,729 --> 00:34:57,429
Can you afford it or it's not good here.

576
00:34:57,508 --> 00:35:05,006
So, uh,  you see how much like I, I had in the past  with Maxwell size,

577
00:35:05,006 --> 00:35:05,806
especially.

578
00:35:06,201 --> 00:35:06,501
Michael: Yeah.

579
00:35:07,336 --> 00:35:07,996
This is great.

580
00:35:07,996 --> 00:35:10,271
And I hope people are encouraged to go.

581
00:35:10,271 --> 00:35:11,795
And you can easily check this by the way.

582
00:35:11,800 --> 00:35:16,849
Can't you just with all like with all Postgres
parameters, you can just use show, show max, well size.

583
00:35:16,854 --> 00:35:20,680
If it's, if you get one gigabyte back maybe time to have a look at that.

584
00:35:20,760 --> 00:35:22,741
Same with checkpoint timeout.

585
00:35:22,741 --> 00:35:28,441
So show checkpoint, time out, check out that if it's, if
it comes back five minutes or it might say 300 seconds.

586
00:35:29,261 --> 00:35:30,581
Again, another one to look at

587
00:35:30,921 --> 00:35:31,801
anything else.

588
00:35:31,836 --> 00:35:36,486
Nikolay: there are other things, but let's, let's
stop at this point because we are out of time.

589
00:35:36,486 --> 00:35:37,446
Definitely here.

590
00:35:37,946 --> 00:35:41,699
I apologize for too, too many details in this case.

591
00:35:41,894 --> 00:35:47,384
Michael: I don't think, I don't think that's the kind of feedback
we, I, if, if anybody thinks we did do too many details, let me know,

592
00:35:47,529 --> 00:35:48,089
Nikolay: Right.

593
00:35:48,134 --> 00:35:49,210
Michael: think that's gonna be the,

594
00:35:49,719 --> 00:35:56,824
Nikolay: And I, again, I, I want to like advertise need data here,
because if you do some experiments and some, you, you take like

595
00:35:56,824 --> 00:36:04,515
same virtual machine, same settings, everything as production,
you do this very unfortunate, massive delete rolled back again,

596
00:36:04,515 --> 00:36:13,168
delete, but can you check with various Citrix, do install net data
and export all dashboard with all this Cayo and everything to file.

597
00:36:13,168 --> 00:36:16,018
And then you can compare, you can open and browser several.

598
00:36:16,055 --> 00:36:18,719
Several files and, right, right.

599
00:36:18,869 --> 00:36:22,620
And see exactly the difference in behavior for different settings.

600
00:36:23,070 --> 00:36:24,000
It's so convenient

601
00:36:24,750 --> 00:36:25,230
and you can

602
00:36:25,235 --> 00:36:25,800
store the,

603
00:36:25,800 --> 00:36:27,579
those artifacts term.

604
00:36:28,079 --> 00:36:30,779
Michael: Yeah, I really, I enjoy doing showed me that.

605
00:36:30,839 --> 00:36:32,909
I, I also wanted to advertise a few things.

606
00:36:32,939 --> 00:36:35,687
There's a couple of great websites for checking out parameters.

607
00:36:35,687 --> 00:36:37,748
If you want to see like what they mean.

608
00:36:37,748 --> 00:36:42,266
Obviously the Postgres documentation's
great, but there's also Postgres code.nf.

609
00:36:42,566 --> 00:36:45,326
Oh, Postgres comp by I'll link up.

610
00:36:45,646 --> 00:36:47,276
And  PGP here as well.

611
00:36:47,276 --> 00:36:48,716
I find great for this kind of thing.

612
00:36:48,716 --> 00:36:50,957
They have a section on this I found useful.

613
00:36:51,197 --> 00:36:52,787
So I'll, I'll share those as well.

614
00:36:53,442 --> 00:36:53,742
Nikolay: right.

615
00:36:54,192 --> 00:37:01,152
But like you, I, I suppose if you have, have a lot of LTP
system, you probably will set checkpoint time out of 15 or 30

616
00:37:01,152 --> 00:37:09,832
minutes and Maxwell size to something like 32 gigabytes, at
least maybe more but better to  conduct  full-fledged research.

617
00:37:10,092 --> 00:37:13,722
and make decisions based on your requirements from business.

618
00:37:14,757 --> 00:37:14,937
Michael: Yeah,

619
00:37:15,042 --> 00:37:15,432
Nikolay: Good.

620
00:37:16,262 --> 00:37:17,042
Michael: so much Cola.

621
00:37:17,042 --> 00:37:17,792
Thanks everyone.

622
00:37:18,093 --> 00:37:19,053
Nikolay: Thank you everyone.

623
00:37:19,053 --> 00:37:19,383
Yes.

624
00:37:19,455 --> 00:37:22,035
Share like share, share, share, share is important.

625
00:37:22,035 --> 00:37:22,995
Most important, probably.

626
00:37:23,565 --> 00:37:28,284
And if you, by the way I have special request to our listeners today.

627
00:37:28,524 --> 00:37:33,954
If you have a IO device, please go to apple podcast.

628
00:37:33,983 --> 00:37:35,169
and like us please.

629
00:37:35,169 --> 00:37:36,279
And write some review.

630
00:37:36,969 --> 00:37:37,089
would

631
00:37:37,389 --> 00:37:40,709
Michael: Yeah, I dunno if you saw, but we got a nice one already.

632
00:37:40,709 --> 00:37:42,285
So thank you to that person.

633
00:37:43,005 --> 00:37:43,305
Nikolay: Good.

634
00:37:43,305 --> 00:37:43,975
Thank you, Michael.

635
00:37:44,070 --> 00:37:44,730
Michael: Cheers everyone.

636
00:37:45,075 --> 00:37:45,495
Nikolay: Bye