1
00:00:00,060 --> 00:00:02,280
Michael: Hello and welcome to Postgres.FM,
a weekly show about

2
00:00:02,280 --> 00:00:03,340
all things PostgreSQL.

3
00:00:03,340 --> 00:00:06,100
I am Michael, founder of pgMustard,
and as usual, this is my

4
00:00:06,100 --> 00:00:08,220
co-host Nikolay, founder of Postgres.AI.

5
00:00:08,220 --> 00:00:10,460
Hey Nikolay, what are we talking
about today?

6
00:00:10,760 --> 00:00:11,460
Nikolay: Hi Michael.

7
00:00:12,260 --> 00:00:18,400
We are talking about snapshots
and not only snapshots, but backups

8
00:00:18,460 --> 00:00:21,900
of large databases and RPO, RTO.

9
00:00:22,540 --> 00:00:24,740
More actually RTO.

10
00:00:25,080 --> 00:00:30,180
RTO means how much time it would
take for us to recover from

11
00:00:30,180 --> 00:00:31,900
some disaster from backups.

12
00:00:33,420 --> 00:00:37,160
So, is it minutes, hours, days?

13
00:00:38,100 --> 00:00:43,280
For example, let's take some hypothetical
database of 10 terabytes

14
00:00:43,780 --> 00:00:48,740
and consider what we would expect
from good backup system in

15
00:00:49,200 --> 00:00:53,480
various managed Postgres situations
and also self-managed in

16
00:00:53,480 --> 00:01:00,700
cloud maybe bare metal as well
okay pardon dirt sounds

17
00:01:00,700 --> 00:01:02,120
Michael: it's a nice it's a nice
sound

18
00:01:02,120 --> 00:01:05,420
Nikolay: it's spring I've got to
close the door and they decided

19
00:01:05,460 --> 00:01:07,280
to have a nest again here.

20
00:01:07,280 --> 00:01:09,180
This is what they try to do every
time.

21
00:01:11,640 --> 00:01:12,760
Michael: So yeah, good topic.

22
00:01:12,880 --> 00:01:15,060
Why is it on your mind at the moment?

23
00:01:15,840 --> 00:01:17,340
Nikolay: Well yeah, it's a good
question.

24
00:01:17,580 --> 00:01:22,440
I just observe larger and larger
databases and I also observe

25
00:01:22,440 --> 00:01:26,260
different situations of managed
services recently and I see that

26
00:01:26,800 --> 00:01:31,420
it's really a big pain when you
achieve some level of scale,

27
00:01:31,980 --> 00:01:35,260
maybe not 10 terabytes, maybe 20
or 50 terabytes.

28
00:01:35,340 --> 00:01:43,160
And at this level, it becomes a
big pain to do various operations,

29
00:01:44,100 --> 00:01:46,840
including backup, I mean, recovery,
first of all, backups as

30
00:01:46,840 --> 00:01:47,160
well.

31
00:01:47,160 --> 00:01:51,020
We will talk about them as well,
but also provisioning of new

32
00:01:51,020 --> 00:01:54,400
nodes, provisioning of replicas
or clones.

33
00:01:55,520 --> 00:01:57,580
It becomes more and more difficult.

34
00:01:58,380 --> 00:02:01,820
And the only solution I see is
snapshots.

35
00:02:02,980 --> 00:02:09,500
And we have 2 big, I mean, 2 most
popular backup systems right

36
00:02:09,500 --> 00:02:11,020
now in Postgres ecosystem.

37
00:02:11,480 --> 00:02:15,520
They are pgBackRest and WAL-G,
in my opinion.

38
00:02:15,520 --> 00:02:17,440
All others became less popular.

39
00:02:17,440 --> 00:02:18,540
This is my perception.

40
00:02:18,600 --> 00:02:19,640
I might be wrong.

41
00:02:19,640 --> 00:02:22,760
I have only anecdotal data to prove
it.

42
00:02:23,360 --> 00:02:27,840
But it's time for both of them
to seriously think about cloud

43
00:02:27,840 --> 00:02:29,740
snapshots in cloud environments.

44
00:02:30,820 --> 00:02:34,540
Snapshots of When I say cloud snapshots,
let's clarify, I mean

45
00:02:34,540 --> 00:02:41,680
snapshots of EBS volumes in AWS,
of persistent disks, PD, SSD,

46
00:02:41,680 --> 00:02:47,040
or others, Hyperdisk in Google
Cloud, and alternatives in other

47
00:02:47,040 --> 00:02:47,540
clouds.

48
00:02:48,920 --> 00:02:52,360
I have less experience with Azure,
so let's cover only all 3,

49
00:02:52,360 --> 00:02:54,940
But again, I have much less experience
with Azure.

50
00:02:55,380 --> 00:03:00,300
My main experience is with GCP
and AWS.

51
00:03:01,500 --> 00:03:08,380
So these snapshots, I think, must
be used by those who create

52
00:03:08,380 --> 00:03:13,080
large database systems, maintain
them, and especially if you

53
00:03:13,080 --> 00:03:19,120
are a managed Postgres provider,
it's inevitable that you should

54
00:03:19,120 --> 00:03:22,860
rely on these snapshots of disk
and restoration.

55
00:03:23,680 --> 00:03:31,500
And, of course, I think AWS RDS
Postgres and Google Cloud SQL,

56
00:03:32,080 --> 00:03:33,760
they already do it.

57
00:03:33,760 --> 00:03:37,300
I suspect Azure also should do
it, but we have many more managed

58
00:03:37,300 --> 00:03:38,700
Postgres providers, right?

59
00:03:39,020 --> 00:03:40,200
Michael: So many now, yeah.

60
00:03:40,200 --> 00:03:44,280
Nikolay: Right, and also we have
some people who self-manage

61
00:03:44,440 --> 00:03:45,780
their Postgres clusters.

62
00:03:46,400 --> 00:03:50,780
And also we have, of course, many
users who use RDS or Cloud

63
00:03:50,780 --> 00:03:51,980
SQL or Azure.

64
00:03:52,540 --> 00:03:57,240
And I think it's good to understand
some details about how backups

65
00:03:57,240 --> 00:04:02,160
of large databases are done and
the role of snapshots in those

66
00:04:02,160 --> 00:04:02,660
backups.

67
00:04:03,400 --> 00:04:07,920
Michael: So yeah, so I understand
the attraction for huge, huge

68
00:04:07,920 --> 00:04:13,840
databases in terms of recovery
time, but are there also benefits

69
00:04:14,200 --> 00:04:16,720
for smaller databases as well?

70
00:04:16,860 --> 00:04:20,200
Or is it, are we just fine just
recovering from a normal backup

71
00:04:20,200 --> 00:04:25,320
like a pgBackRest style or WAL-G
as you say I know their speed

72
00:04:25,320 --> 00:04:28,580
advantages or smaller sizes as
well

73
00:04:29,840 --> 00:04:35,320
Nikolay: that's a good question
well yeah let's let's just let

74
00:04:35,320 --> 00:04:41,140
me walk us through the steps of
backup and recovery.

75
00:04:41,380 --> 00:04:42,080
Michael: Good idea.

76
00:04:42,360 --> 00:04:46,560
Nikolay: With WAL-G or pgBackRest,
Let's forget about Delta

77
00:04:46,560 --> 00:04:46,960
backups.

78
00:04:46,960 --> 00:04:51,320
Let's consider like the basic the
most fundamental Process.

79
00:04:51,860 --> 00:04:57,540
So first of all, we need to take
a copy of data directory Bring

80
00:04:57,540 --> 00:05:00,420
it to archive to backup archive.

81
00:05:00,720 --> 00:05:06,540
It's usually in object store like
AWS S3 or Google Cloud, GCS.

82
00:05:07,880 --> 00:05:08,380
Right.

83
00:05:08,440 --> 00:05:13,340
And in Azure, it's blob storage
or how they call it.

84
00:05:13,580 --> 00:05:19,800
And this copy of data directory
is not consistent by default.

85
00:05:20,540 --> 00:05:21,420
And it's okay.

86
00:05:21,820 --> 00:05:24,820
We can have inconsistent copy.

87
00:05:25,200 --> 00:05:29,180
We can copy data directory manually
with rsync, with cp.

88
00:05:29,180 --> 00:05:30,060
It doesn't matter.

89
00:05:30,060 --> 00:05:32,740
It will take time to copy.

90
00:05:33,080 --> 00:05:36,820
And usually roughly like 4 terabytes,
it's roughly like 1 hour

91
00:05:36,820 --> 00:05:37,820
or half an hour.

92
00:05:38,480 --> 00:05:41,420
If it's faster, you're lucky, you
have good infrastructure.

93
00:05:42,120 --> 00:05:46,200
If it's slower than a 1 hour, something
should be tuned.

94
00:05:46,500 --> 00:05:49,840
But if you have 10 terabytes, so
we're ready to talk about 5

95
00:05:49,840 --> 00:05:54,840
to 10 hours If you have 50 terabytes,
it's like more than 1 day

96
00:05:55,320 --> 00:05:59,760
right to call and It's inconsistent,
but it's fine because we

97
00:05:59,760 --> 00:06:05,260
do it between wrapping this process
up with 2 calls, pg_start_backup

98
00:06:05,660 --> 00:06:06,360
and pg_stop_backup.

99
00:06:07,800 --> 00:06:14,180
These 2 important calls tell Postgres
that it should keep WALs

100
00:06:14,240 --> 00:06:16,800
that will be used in recovery.

101
00:06:17,680 --> 00:06:22,660
It's understood that the directory
copy is not consistent, but

102
00:06:22,660 --> 00:06:26,180
we have enough WALs to reach consistency
point.

103
00:06:26,820 --> 00:06:32,460
So when we will recover this backup,
full backup, it will be

104
00:06:32,460 --> 00:06:36,720
copy back to some normal disk,
to another machine, right?

105
00:06:38,360 --> 00:06:41,440
And enough WALs and replace some
WALs to reach consistency

106
00:06:41,640 --> 00:06:42,140
point.

107
00:06:42,640 --> 00:06:45,740
And here's some problem.

108
00:06:46,100 --> 00:06:50,400
When you recover, sometimes you
think, oh, how long it will take

109
00:06:50,440 --> 00:06:55,960
to reach consistency point and
I have a recipe in my how to set

110
00:06:55,960 --> 00:07:02,500
of how to articles how to understand
current position in terms

111
00:07:02,500 --> 00:07:09,020
of LSN and how long it is left
to reach consistency point because

112
00:07:09,520 --> 00:07:14,540
while those WALs are applied Any
connection attempt will be

113
00:07:14,540 --> 00:07:18,680
rejected right and we need Postgres
needs to reach consistency

114
00:07:18,800 --> 00:07:21,180
point to open the gates and accept
connections.

115
00:07:22,240 --> 00:07:28,380
And I think there was some work
to improve logging of this process.

116
00:07:28,840 --> 00:07:32,080
And maybe It will happen in Postgres
18.

117
00:07:32,080 --> 00:07:34,940
I saw some discussions and some
patches, I think.

118
00:07:35,280 --> 00:07:38,940
But before that, while we still
don't have transparent logging

119
00:07:38,940 --> 00:07:44,000
of that process, I have some recipes,
so you can use them to

120
00:07:44,480 --> 00:07:49,900
be less anxious, because sometimes
many, many minutes you wait

121
00:07:49,900 --> 00:07:54,960
and don't understand and it's not
it's not fun right and that's

122
00:07:54,960 --> 00:07:58,940
it this is full backup but it's
only part 1 because it will reach

123
00:07:58,940 --> 00:08:05,600
consistency point and it will represent
backup for some time

124
00:08:05,600 --> 00:08:06,360
in the past.

125
00:08:07,540 --> 00:08:13,760
But we also need to have to reach
the latest state of database.

126
00:08:13,780 --> 00:08:19,340
And for that, we need a second
part of backups of archives is

127
00:08:19,340 --> 00:08:20,140
WAL archive.

128
00:08:21,300 --> 00:08:26,960
Write-ahead log files, by default
they are 16 MB, so on RDS they

129
00:08:26,960 --> 00:08:29,280
are tuned to 64 MB.

130
00:08:30,160 --> 00:08:37,860
Most clusters are default 16 MB,
so this small file is also compressed

131
00:08:38,560 --> 00:08:40,760
in 1 way or another.

132
00:08:40,760 --> 00:08:47,640
So it takes not 16, but say 7 or
10 or 5 MB, depending on the

133
00:08:47,640 --> 00:08:49,200
nature of data inside it.

134
00:08:49,400 --> 00:08:54,680
And then when we recover, we reach
the consistency point, and

135
00:08:54,680 --> 00:08:56,120
we start replaying WALs.

136
00:08:56,120 --> 00:08:59,840
Postgres starts replaying WALs
according to restore command

137
00:09:00,060 --> 00:09:05,640
configuration, which just tells
fetch next WAL, and we will

138
00:09:05,640 --> 00:09:06,420
apply it, right?

139
00:09:06,420 --> 00:09:08,380
Postgres will apply it.

140
00:09:08,380 --> 00:09:12,800
And there's prefetching, because,
of course, object storage is

141
00:09:12,800 --> 00:09:14,440
good with parallelization.

142
00:09:15,300 --> 00:09:18,900
It's not a good idea to use just
a single process to fetch a

143
00:09:18,900 --> 00:09:20,040
file sequentially.

144
00:09:21,180 --> 00:09:25,640
We can ask to fetch 10 files in
parallel to move faster, because

145
00:09:25,860 --> 00:09:30,620
in a heavily loaded system sometimes
we see the source, the production,

146
00:09:30,640 --> 00:09:32,940
can generate many, many WALs per
second.

147
00:09:33,520 --> 00:09:37,560
And recovery, it can be like dozens
of WALs per second, depending

148
00:09:37,740 --> 00:09:39,880
on hardware and so on.

149
00:09:39,880 --> 00:09:42,600
So it's worth fetching very fast.

150
00:09:42,740 --> 00:09:47,220
And this process can take a long,
of course, like after we did

151
00:09:47,220 --> 00:09:51,100
full backup, many WALs can be
archived until the latest point.

152
00:09:51,100 --> 00:09:56,200
And you can restore to the latest
point or to some point in time.

153
00:09:56,200 --> 00:09:57,940
It's called point-in-time recovery,
right?

154
00:09:57,940 --> 00:10:01,840
You can configure, I need this
timestamp or this LSN, and you

155
00:10:01,840 --> 00:10:02,740
need to wait.

156
00:10:03,480 --> 00:10:08,300
So, of course, it's good if you
have quite frequent full backups.

157
00:10:09,140 --> 00:10:12,880
It means that WAL replay, this
additional WAL replay, because

158
00:10:12,880 --> 00:10:15,800
there is WAL replay to reach consistency
point, but there is

159
00:10:15,800 --> 00:10:20,020
additional WAL replay to reach
the latest or desired point in

160
00:10:20,020 --> 00:10:20,520
time.

161
00:10:21,020 --> 00:10:22,580
It will be much less.

162
00:10:22,800 --> 00:10:27,580
But usually, usually like to replay
whole day of WAL, it takes

163
00:10:27,580 --> 00:10:28,860
not a whole day, definitely.

164
00:10:29,280 --> 00:10:34,480
It takes like maybe like an hour
or 2 hours, it depends a lot

165
00:10:34,480 --> 00:10:36,300
on a particular case.

166
00:10:36,600 --> 00:10:40,580
But usually it's much faster than
to generate those WALs.

167
00:10:42,180 --> 00:10:46,980
And to replay 7 days of WALs,
Well, it might take up to 1 day,

168
00:10:47,040 --> 00:10:48,900
maybe less, it depends again.

169
00:10:49,600 --> 00:10:55,620
So people need to configure the
frequency of full backups and

170
00:10:55,620 --> 00:11:01,760
understand that up to the whole
period between 2 backups, we

171
00:11:01,760 --> 00:11:05,440
might need to replay this amount
of WALs.

172
00:11:05,440 --> 00:11:05,940
Right.

173
00:11:06,540 --> 00:11:09,260
This affects recovery time as well.

174
00:11:09,260 --> 00:11:12,020
RTO, recovery time objective.

175
00:11:12,160 --> 00:11:12,660
Right.

176
00:11:13,360 --> 00:11:19,000
And To improve that situation,
both pgBackRest and WAL-G support

177
00:11:19,000 --> 00:11:19,820
Delta backups.

178
00:11:23,100 --> 00:11:27,600
Instead of copying whole 10 terabytes
or 50 terabytes to GCS

179
00:11:27,860 --> 00:11:32,740
or S3, they copy only delta compared
to previous backup.

180
00:11:33,480 --> 00:11:38,860
And then recovery looks like restore
full backup and apply delta

181
00:11:38,860 --> 00:11:39,360
backups.

182
00:11:39,520 --> 00:11:44,500
It's just some diffs of file basically,
which is supposed to

183
00:11:44,500 --> 00:11:46,160
be faster than WAL replay.

184
00:11:46,560 --> 00:11:50,920
And also much cheaper because you
don't need to keep, for example,

185
00:11:50,920 --> 00:11:54,520
for 7 days, you don't need to keep
7 days of full backups, but

186
00:11:54,520 --> 00:11:57,540
maybe you keep only 1 full backup
plus deltas only.

187
00:11:58,600 --> 00:12:04,040
It's already optimization of budget
and also optimization of

188
00:12:04,540 --> 00:12:08,900
our RTO, because replaying WAL
is slower than applying those

189
00:12:08,900 --> 00:12:09,400
deltas.

190
00:12:10,580 --> 00:12:16,720
However, we still need to start
and have first full backup fetched

191
00:12:16,800 --> 00:12:17,520
and applied.

192
00:12:18,080 --> 00:12:23,080
And it's definitely slower than
if we had this full backup made

193
00:12:23,080 --> 00:12:26,700
recently, so we just apply this
and so on.

194
00:12:27,120 --> 00:12:34,900
Anyway, if you use pgBackRest or
WAL-G, they basically ignore

195
00:12:34,900 --> 00:12:37,040
the existence of another alternative.

196
00:12:37,540 --> 00:12:43,400
Instead of copying data directly
to object storage, you can use

197
00:12:43,780 --> 00:12:47,740
cloud-native capabilities and snapshot
disk.

198
00:12:49,020 --> 00:12:53,600
Snapshotting disk is also bringing
data to object storage in

199
00:12:53,680 --> 00:12:56,940
the background, but it's on the
shoulders of cloud provider,

200
00:12:57,180 --> 00:12:59,780
AWS or GCP or Azure.

201
00:13:00,140 --> 00:13:03,100
And they have this automation,
they use it in many processes,

202
00:13:03,260 --> 00:13:07,240
and these days it's already quite
reliable in most cases.

203
00:13:08,240 --> 00:13:12,080
Of course, this is, by the way,
an objection I hear.

204
00:13:13,440 --> 00:13:17,360
Like copy whole directory to object
storage, it's reliable.

205
00:13:18,700 --> 00:13:22,000
What will happen with those snapshots,
who knows?

206
00:13:22,260 --> 00:13:23,580
Well, yes, okay.

207
00:13:24,160 --> 00:13:27,660
But a lot of mission-critical systems
already use them for many

208
00:13:27,660 --> 00:13:28,580
years, right?

209
00:13:29,540 --> 00:13:35,240
And it's also copied to object
storage but done in the background.

210
00:13:36,280 --> 00:13:37,320
And it's done...

211
00:13:37,660 --> 00:13:38,340
It's like...

212
00:13:38,420 --> 00:13:39,120
It's basically...

213
00:13:39,600 --> 00:13:44,440
You create a snapshot, it takes
only minutes for a large volume,

214
00:13:44,440 --> 00:13:46,500
like 10 terabytes, 50 terabytes.

215
00:13:46,840 --> 00:13:51,840
And it also is like, there's also
a concept of delta snapshots.

216
00:13:51,980 --> 00:13:53,080
Like it's, it's incremental.

217
00:13:53,300 --> 00:13:57,680
So there's a 1 snapshot and then
additional snapshots are done

218
00:13:58,040 --> 00:14:00,720
faster and take less space and
object storage.

219
00:14:00,720 --> 00:14:02,460
So you pay, you pay less.

220
00:14:03,080 --> 00:14:06,440
And again, this is a responsibility
of a cloud provider like

221
00:14:06,440 --> 00:14:07,560
AWS or GCP.

222
00:14:07,960 --> 00:14:08,660
It works.

223
00:14:09,760 --> 00:14:13,860
Then you can recover, restore,
snapshot, restore takes only a

224
00:14:13,860 --> 00:14:14,700
few minutes.

225
00:14:14,700 --> 00:14:15,820
This is the magic.

226
00:14:17,020 --> 00:14:21,540
And restore takes only a few minutes,
even for like 50 terabytes,

227
00:14:21,540 --> 00:14:23,600
it can take like 10-15 minutes.

228
00:14:24,300 --> 00:14:30,820
Because I'm speaking about GCP
and AWS here, There is a concept

229
00:14:30,820 --> 00:14:31,940
of lazy load.

230
00:14:32,360 --> 00:14:35,740
So it looks like disk already exists
and data is already there,

231
00:14:35,740 --> 00:14:39,220
but in background it's still pooling
data from object storage.

232
00:14:40,440 --> 00:14:45,610
And that's why it's good, recovery
is fast, but It's slow.

233
00:14:45,610 --> 00:14:48,540
I mean, you try to SELECT from
table, but it's slow.

234
00:14:48,600 --> 00:14:53,000
And this concept is very well known
for people who use RDS, for

235
00:14:53,000 --> 00:14:55,760
example, or work with EBS volume
snapshots.

236
00:14:56,320 --> 00:15:01,340
It might be bad in some cases,
and there is trade-off.

237
00:15:02,680 --> 00:15:07,460
You may say, okay, I have a recovery
time of a few minutes only.

238
00:15:07,640 --> 00:15:12,380
RTO is perfect in our case, but
we will be slow originally.

239
00:15:12,800 --> 00:15:14,980
We will need to warm up the disks.

240
00:15:16,020 --> 00:15:20,860
Or you can say, OK, recovery time
RTO is bad, but we are fast

241
00:15:20,860 --> 00:15:21,860
from the very beginning.

242
00:15:22,660 --> 00:15:26,820
Well, we need to warm up buffer
pool and page cache anyway, but

243
00:15:26,820 --> 00:15:27,840
it's a different story.

244
00:15:27,880 --> 00:15:31,140
It usually takes only minutes,
not hours definitely.

245
00:15:32,080 --> 00:15:34,180
So that's the story about snapshots.

246
00:15:34,280 --> 00:15:40,140
And I think that I just observe
people have tendency to like

247
00:15:40,280 --> 00:15:46,760
snapshots and prefer them both
for DR and just copy, cloning

248
00:15:46,840 --> 00:15:53,480
of nodes for purposes of creating
replicas, standbys, or forking,

249
00:15:53,480 --> 00:15:53,980
anything.

250
00:15:55,080 --> 00:15:58,860
Once they reach level like 10-20
terabytes.

251
00:16:00,800 --> 00:16:06,180
Before that, it's not that obvious that snapshots are beneficial.

252
00:16:07,060 --> 00:16:09,600
Because, okay, we can wait a few hours.

253
00:16:10,760 --> 00:16:15,200
If you have 5 terabytes, okay, we can wait 5 hours or 2 and a

254
00:16:15,200 --> 00:16:17,320
half hours for backup.

255
00:16:17,320 --> 00:16:21,600
We can wait to recover to a couple of hours, not big deal.

256
00:16:22,200 --> 00:16:28,040
But once we have 10, 20, 30, 50 terabytes, it already exceeds

257
00:16:28,140 --> 00:16:28,880
1 day.

258
00:16:30,060 --> 00:16:34,140
It exceeds 8 hours of work, working day, right?

259
00:16:34,400 --> 00:16:37,960
So it already becomes a two-day job for someone.

260
00:16:38,260 --> 00:16:38,480
Yeah.

261
00:16:38,480 --> 00:16:39,960
To control this process, right?

262
00:16:39,960 --> 00:16:43,300
This is, I think, psychological threshold here.

263
00:16:44,280 --> 00:16:47,120
Michael: But there are, I'm just thinking there are a couple

264
00:16:47,120 --> 00:16:52,100
of other use cases where that time does matter like if you decide

265
00:16:52,220 --> 00:16:55,440
that you don't want to go down the logical replication route

266
00:16:55,440 --> 00:16:58,200
for major version upgrades for example and you're willing to

267
00:16:58,200 --> 00:17:02,040
take a certain amount of downtime one of the steps is to clone

268
00:17:02,040 --> 00:17:05,560
your database right do the upgrade on the clone and then switch

269
00:17:05,560 --> 00:17:10,760
over as a as on a cloud like a lot of cloud providers that would

270
00:17:10,760 --> 00:17:12,700
be quicker or less time

271
00:17:12,700 --> 00:17:14,280
Nikolay: why do you need a clone

272
00:17:15,360 --> 00:17:18,840
Michael: I think if you do it in place then you're down the whole

273
00:17:18,840 --> 00:17:19,280
time.

274
00:17:19,280 --> 00:17:22,220
Whereas if you do it on a clone and then transfer any changes

275
00:17:22,220 --> 00:17:24,380
across that customers have made in the meantime, you might be

276
00:17:24,380 --> 00:17:25,680
able to minimize downtime.

277
00:17:26,280 --> 00:17:27,660
Nikolay: How do you transfer changes?

278
00:17:28,980 --> 00:17:29,480
Michael: Script.

279
00:17:31,100 --> 00:17:32,780
Let's say it's not super busy.

280
00:17:33,960 --> 00:17:35,520
Nikolay: It's an interesting approach.

281
00:17:36,280 --> 00:17:36,780
Yeah.

282
00:17:37,080 --> 00:17:41,880
Sounds like reinventing the wheel of logical replication so you

283
00:17:41,880 --> 00:17:47,880
need if you have complex system how will you understand the diff

284
00:17:47,880 --> 00:17:48,560
the changes I'm

285
00:17:48,560 --> 00:17:51,180
Michael: only thinking I'm thinking mostly about quiet systems

286
00:17:51,180 --> 00:17:54,760
where it's unlikely that there are that many changes or maybe

287
00:17:54,760 --> 00:17:55,260
0.

288
00:17:55,280 --> 00:17:58,880
Like if you're doing a quiet time and it's like an internal system.

289
00:17:59,380 --> 00:18:02,960
Nikolay: Well, for me there is in place, there is in place upgrade

290
00:18:02,960 --> 00:18:04,660
and there is 0 downtime involved.

291
00:18:05,900 --> 00:18:08,100
And logical has variations, actually.

292
00:18:08,920 --> 00:18:11,520
But what you propose, it's kind of like...

293
00:18:11,760 --> 00:18:14,200
It breaks my brain a little bit.

294
00:18:15,280 --> 00:18:20,140
Michael: Well, all I'm thinking is, I can imagine cases where

295
00:18:20,500 --> 00:18:24,020
Being able to do take a clone in a minute or 2 Instead of it

296
00:18:24,020 --> 00:18:26,980
being 10 or 20 minutes is actually quite attractive for the same

297
00:18:26,980 --> 00:18:31,560
reason you mentioned about 8 hours It's a different I would stop

298
00:18:31,560 --> 00:18:32,160
doing another time.

299
00:18:32,160 --> 00:18:34,400
If something's gonna take 20 minutes, I'd move on to a different

300
00:18:34,400 --> 00:18:34,640
task.

301
00:18:34,640 --> 00:18:37,280
But if it's only gonna take 1 or
2 minutes, I might only check

302
00:18:37,280 --> 00:18:39,640
my emails, you know, like there's
a very different

303
00:18:40,640 --> 00:18:44,920
Nikolay: It might be you need some
to do some research on the

304
00:18:44,920 --> 00:18:49,560
fork on clone I call it clone some
people call it fork you you

305
00:18:49,560 --> 00:18:53,240
need to verify for example you
want to verify how long will it

306
00:18:53,240 --> 00:18:56,420
take to upgrade, in place upgrade.

307
00:18:57,840 --> 00:19:03,340
For testing, it's good to have
identical copy of the same machine

308
00:19:03,340 --> 00:19:06,780
and just run all the detail like
and see the log and how else

309
00:19:06,780 --> 00:19:10,420
can you check it right and for
if forking takes a day

310
00:19:12,720 --> 00:19:13,220
Michael: well

311
00:19:13,520 --> 00:19:17,500
Nikolay: and it's frustrating yeah
it's frustrating But if it

312
00:19:17,500 --> 00:19:21,920
takes only dozens of minutes, it's
good.

313
00:19:22,080 --> 00:19:25,060
The only catch here again, there
will be lazy load.

314
00:19:25,900 --> 00:19:26,820
Lazy load, right?

315
00:19:26,820 --> 00:19:30,640
So you need to understand like
you okay, you have a fork it pretends

316
00:19:30,660 --> 00:19:32,380
it's up and running.

317
00:19:32,380 --> 00:19:34,240
It's up and running, but it's slow.

318
00:19:34,540 --> 00:19:35,020
Does that

319
00:19:35,020 --> 00:19:36,640
Michael: invalidate the test then?

320
00:19:37,300 --> 00:19:40,540
Nikolay: Yeah, some tests will
be invalidated for sure.

321
00:19:40,760 --> 00:19:41,760
Maybe not upgrade.

322
00:19:42,040 --> 00:19:45,800
For upgrade, what I would do for
test of major upgrade in place

323
00:19:45,980 --> 00:19:53,000
with hard links, I would just in
this case warm up only the system

324
00:19:53,000 --> 00:19:53,920
catalog part.

325
00:19:55,160 --> 00:19:59,780
Because this is what pg_upgrade
does, it dumps and restores system

326
00:19:59,780 --> 00:20:00,280
catalogs.

327
00:20:01,980 --> 00:20:05,780
Then I would ignore the remainder
of data knowing that it will

328
00:20:05,780 --> 00:20:07,900
be handled by hard links.

329
00:20:07,960 --> 00:20:09,340
So this is my warm up.

330
00:20:09,340 --> 00:20:10,280
It's super fast.

331
00:20:10,680 --> 00:20:13,520
So this is a perfect example then.

332
00:20:14,440 --> 00:20:15,900
But we need to understand details.

333
00:20:18,140 --> 00:20:24,900
So we forked during 10-20 minutes,
I mean snapshot, we restore

334
00:20:24,900 --> 00:20:30,140
snapshot, we have up and running
Postgres, we dump schema couple

335
00:20:30,140 --> 00:20:32,500
of times, well 1 time is enough
actually.

336
00:20:32,720 --> 00:20:41,300
And then we run pg_upgrade -k or
--links, and see what happens,

337
00:20:41,300 --> 00:20:43,400
how long it takes for us.

338
00:20:44,680 --> 00:20:51,240
And We are not worried about data
actually still being in GCS

339
00:20:51,260 --> 00:20:53,220
or S3 or blob storage.

340
00:20:53,900 --> 00:20:56,320
We know it doesn't matter in this
particular case.

341
00:20:56,320 --> 00:21:00,800
But if it's something, for example,
you want to check how long

342
00:21:00,800 --> 00:21:05,440
it will take to run pg_repack or
VACUUM FULL or pg_squeeze on a large

343
00:21:05,440 --> 00:21:07,700
table without additional traffic.

344
00:21:07,860 --> 00:21:09,140
It's also useful data.

345
00:21:09,140 --> 00:21:13,860
You know that additional traffic
will affect timing in a negative

346
00:21:13,860 --> 00:21:17,440
way, but you want to measure at
least in an empty space.

347
00:21:18,140 --> 00:21:19,700
Michael: Or even adding a new index.

348
00:21:19,800 --> 00:21:20,300
Like,

349
00:21:21,000 --> 00:21:22,320
Nikolay: how long would that take?

350
00:21:22,660 --> 00:21:26,320
Yeah, In this case, the fact that
we have a lazy load problem

351
00:21:26,320 --> 00:21:29,720
with Snapshot Restore, well, it will affect your experiment for

352
00:21:29,720 --> 00:21:30,220
sure.

353
00:21:30,240 --> 00:21:34,280
You will have bad timing in production, you will have good timing,

354
00:21:34,280 --> 00:21:38,500
and you will regret you did experiment at all.

355
00:21:38,500 --> 00:21:41,740
You will stop trusting those experiments if you don't understand

356
00:21:41,740 --> 00:21:43,280
lazy load, this problem.

357
00:21:43,840 --> 00:21:47,300
If you understand lazy load, you first warm up particular table,

358
00:21:47,980 --> 00:21:52,080
just doing, I don't know, SELECT everything from it, right?

359
00:21:52,360 --> 00:21:55,820
And then create index and measure, right?

360
00:21:56,400 --> 00:21:59,260
Michael: So yeah, just to get an idea, how much slower are we

361
00:21:59,260 --> 00:21:59,760
talking?

362
00:21:59,860 --> 00:22:01,080
Is there some rough...

363
00:22:01,080 --> 00:22:03,040
How much slow?

364
00:22:03,200 --> 00:22:06,940
Let's say you SELECT *, how much slower would you expect it

365
00:22:06,940 --> 00:22:10,780
to be while it's lazy loading still versus on production?

366
00:22:11,000 --> 00:22:12,600
Nikolay: Oh, it's a tough question.

367
00:22:14,140 --> 00:22:16,660
I don't have good examples, recent examples.

368
00:22:16,980 --> 00:22:20,440
First time I heard about Lazy Load was 2015 or something.

369
00:22:21,020 --> 00:22:23,540
This was the first time I was working with RDS.

370
00:22:26,040 --> 00:22:31,240
This is when importance of working with clones started to form

371
00:22:31,240 --> 00:22:32,100
in my mind.

372
00:22:32,880 --> 00:22:36,040
I was super impressed, like how fast we can clone a terabyte

373
00:22:36,040 --> 00:22:39,800
Database and start working and experimenting with it but I was

374
00:22:39,800 --> 00:22:43,280
upset why it's slow so I reached out to support and they explained

375
00:22:43,280 --> 00:22:47,660
lazy load and their articles by the way AWS explains this quite

376
00:22:47,660 --> 00:22:48,980
well lazy load.

377
00:22:49,200 --> 00:22:54,400
There is explanation in various levels, I think, both at EBS,

378
00:22:54,480 --> 00:22:57,020
so basic level EBS volumes, and also RDS.

379
00:22:58,180 --> 00:23:02,200
But I must admit, Google Cloud doesn't do a good job here in

380
00:23:02,200 --> 00:23:03,620
terms of documenting this.

381
00:23:04,940 --> 00:23:06,400
I know it exists.

382
00:23:06,900 --> 00:23:10,460
I know they admit it exists, but it's not well documented.

383
00:23:10,920 --> 00:23:14,840
There's a good opportunity to catch up here in terms of documentation

384
00:23:15,040 --> 00:23:15,780
and understanding.

385
00:23:16,100 --> 00:23:18,280
As for a question of duration, I don't know.

386
00:23:18,280 --> 00:23:20,200
Definitely it can be frustrating.

387
00:23:23,000 --> 00:23:27,440
SELECT, for example, count star from a table, if it's a sequential

388
00:23:27,540 --> 00:23:31,040
scan, it will have the same effect for warming up, right?

389
00:23:32,100 --> 00:23:33,940
It will not spam your output.

390
00:23:33,940 --> 00:23:37,620
Well, there is a different way not to spam your output is just

391
00:23:38,300 --> 00:23:43,680
SELECT from table without specifying columns and maybe also somehow

392
00:23:44,020 --> 00:23:44,800
so that the race

393
00:23:44,800 --> 00:23:45,660
Michael: EXPLAIN ANALYZE

394
00:23:47,220 --> 00:23:53,300
Nikolay: yeah by the way yes yes yes so, so I your timing will

395
00:23:53,300 --> 00:23:57,640
be really high in that EXPLAIN ANALYZE if it's enabled I hope

396
00:23:57,640 --> 00:24:05,220
it's enabled so I think if like SELECT count star is super slow

397
00:24:05,220 --> 00:24:08,480
in Postgres, we know it, because it's raw storage and all the

398
00:24:08,480 --> 00:24:10,420
things, MVCC and so on.

399
00:24:10,580 --> 00:24:13,440
But here it can be order of magnitude slower.

400
00:24:14,340 --> 00:24:15,040
Michael: Makes sense.

401
00:24:15,160 --> 00:24:16,660
Nikolay: Maybe 2 orders of magnitude.

402
00:24:16,920 --> 00:24:18,900
I don't have numbers here in my
head.

403
00:24:19,020 --> 00:24:21,260
Michael: Yeah, I was just trying
to understand if it was 10%

404
00:24:21,280 --> 00:24:23,040
or double or an order of magnitude.

405
00:24:23,480 --> 00:24:23,680
No, no, no.

406
00:24:23,680 --> 00:24:24,520
It's really

407
00:24:24,780 --> 00:24:28,120
Nikolay: fetching data from object
storage in the ground.

408
00:24:30,320 --> 00:24:34,020
So it can be frustrating, this
is I can tell you.

409
00:24:34,660 --> 00:24:40,240
But you can, again, you can, if
you aim to warm up a node just

410
00:24:40,240 --> 00:24:44,480
restored from a snapshot, a Postgres
node, you can again benefit

411
00:24:44,540 --> 00:24:48,080
from the fact that object storage
is good in terms of parallelization

412
00:24:48,160 --> 00:24:48,880
of operations.

413
00:24:49,280 --> 00:24:54,780
You can warm it up, not just single
connection to Postgres.

414
00:24:54,780 --> 00:25:00,180
You can run 10 or 20 parallel warming
up queries.

415
00:25:01,660 --> 00:25:05,200
And This is a recommendation on
how to warm up because of lazy

416
00:25:05,200 --> 00:25:05,580
load.

417
00:25:05,580 --> 00:25:10,060
By the way, AWS also offers a few
years ago, they implemented

418
00:25:10,080 --> 00:25:10,400
it.

419
00:25:10,400 --> 00:25:13,760
I think this is called Fast Restore,
Fast Recovery, something.

420
00:25:15,180 --> 00:25:18,160
Fast snapshot recovery, FSR or
something.

421
00:25:18,160 --> 00:25:19,200
I might be mistaken.

422
00:25:19,340 --> 00:25:24,680
But they, if my memory is right,
they support up to 50 snapshots

423
00:25:24,820 --> 00:25:30,460
marked as available for fast restore
in the region, I think,

424
00:25:30,780 --> 00:25:32,380
for a particular account.

425
00:25:32,740 --> 00:25:38,300
So you can mark some snapshots
and lazy load problem will disappear

426
00:25:38,300 --> 00:25:39,780
for those snapshots.

427
00:25:40,080 --> 00:25:41,940
This is a good thing to have.

428
00:25:41,940 --> 00:25:47,020
And I'm not aware of similar thing
for Google Cloud or Azure.

429
00:25:47,940 --> 00:25:54,660
But this makes, like, once per,
I don't know, week, you have

430
00:25:54,660 --> 00:25:58,260
a good snapshot which you can recover
very fast and then replay

431
00:25:58,260 --> 00:26:02,320
WALs. But you you have super fast
recovery for experiments.

432
00:26:02,320 --> 00:26:05,900
It's already super cool super cool
Right.

433
00:26:06,140 --> 00:26:09,520
Michael: What about for the actual
record like the RTO discussion?

434
00:26:09,620 --> 00:26:14,180
What are you seeing people with
dozens of terabytes opt for in

435
00:26:14,180 --> 00:26:15,300
terms of that trade-off?

436
00:26:16,060 --> 00:26:18,960
Are they happy to accept things
will be slower, it's kind of

437
00:26:18,960 --> 00:26:22,200
like a brownout situation for a
while, but they're back online

438
00:26:22,200 --> 00:26:22,660
at least?

439
00:26:22,660 --> 00:26:27,820
Or do they prefer longer RTO but
everything works exactly as

440
00:26:27,820 --> 00:26:29,980
expected once it is back online?

441
00:26:30,080 --> 00:26:35,500
Nikolay: Well I think nobody is
happy when recovery takes 10,

442
00:26:35,500 --> 00:26:36,840
20, 30 hours.

443
00:26:36,900 --> 00:26:37,400
Nobody.

444
00:26:37,660 --> 00:26:38,320
Of course.

445
00:26:38,920 --> 00:26:39,420
Right?

446
00:26:40,260 --> 00:26:43,440
So this becomes a problem and the
characteristics of the system

447
00:26:43,700 --> 00:26:45,020
become very bad.

448
00:26:45,020 --> 00:26:46,400
I mean, this is not...

449
00:26:46,400 --> 00:26:50,280
It becomes not acceptable for business
also to understand that

450
00:26:50,280 --> 00:26:54,520
we will be down a whole day or
2 Just to recover from backups

451
00:26:55,440 --> 00:27:01,020
But before business realizes it
operation engineers realize because

452
00:27:01,020 --> 00:27:02,340
it's really hard to experiment.

453
00:27:03,260 --> 00:27:06,340
Michael: So, well, and the cases
where this happens are super

454
00:27:06,340 --> 00:27:06,880
rare, right?

455
00:27:06,880 --> 00:27:09,720
Like we're talking, we're not talking
about failover, like with

456
00:27:09,720 --> 00:27:12,240
all of these setups would generally
be like high availability

457
00:27:12,440 --> 00:27:13,180
anyway, right?

458
00:27:13,180 --> 00:27:17,900
So we're talking only a very specific
class of recovery where

459
00:27:17,900 --> 00:27:21,840
you have to recover from backups,
like there's no failover situation,

460
00:27:22,200 --> 00:27:24,140
which I'm guessing is pretty rare.

461
00:27:24,140 --> 00:27:27,980
People have to think about it,
but they don't face it very often.

462
00:27:28,380 --> 00:27:33,420
Nikolay: 10 plus years Cloud environments
were super unreliable

463
00:27:33,640 --> 00:27:35,720
and failovers happened very often.

464
00:27:36,100 --> 00:27:44,440
Right now, again, at least for
Google Cloud AWS, I see nodes

465
00:27:44,440 --> 00:27:47,540
can be running for years without
disruption.

466
00:27:47,780 --> 00:27:52,000
So it's much better right now in
terms of lower risks.

467
00:27:52,280 --> 00:27:57,040
Well, again, we have a habit to
consider cloud environments as

468
00:27:57,040 --> 00:27:58,120
not super reliable.

469
00:27:58,400 --> 00:28:04,500
And for example, It also depends
on particular instance classes

470
00:28:05,280 --> 00:28:06,680
and the sizes, I think.

471
00:28:06,680 --> 00:28:13,940
If you have small, very popular
instance family, very small,

472
00:28:14,540 --> 00:28:16,140
migration might happen often.

473
00:28:16,240 --> 00:28:19,920
For example, we were down because
we are on GCP, I mean Postgres.AI

474
00:28:19,920 --> 00:28:20,080
AI.

475
00:28:20,080 --> 00:28:23,860
We were down because Kubernetes
cluster was migrated again, like

476
00:28:23,860 --> 00:28:25,120
all nodes were migrated.

477
00:28:26,200 --> 00:28:28,860
And then we had a problem, let
me admit.

478
00:28:29,060 --> 00:28:33,640
We had a problem with backups because
after reconfiguration half

479
00:28:33,640 --> 00:28:40,320
a year ago another dot history
file for some timeline in future

480
00:28:40,640 --> 00:28:44,100
which was in the past but for us
currently was in future was

481
00:28:44,100 --> 00:28:49,720
left in pg_wal directory sub-directory
And after such migration

482
00:28:49,780 --> 00:28:57,020
caused by migration of VMs in Kubernetes,
we couldn't start.

483
00:28:57,560 --> 00:29:01,400
Because promotion happened, timeline
changed, and there was basically

484
00:29:01,400 --> 00:29:05,780
a rogue file from previous life
corresponding to the same timeline

485
00:29:05,780 --> 00:29:06,280
number.

486
00:29:06,280 --> 00:29:07,800
It was a nasty incident.

487
00:29:08,680 --> 00:29:12,680
I think we were down like an hour
until we figured out, unfortunately,

488
00:29:12,880 --> 00:29:13,380
misconfiguration.

489
00:29:14,280 --> 00:29:16,280
Yeah, it's a bad problem.

490
00:29:16,280 --> 00:29:19,600
But it was good that it happened
only on Saturday, so not many

491
00:29:19,600 --> 00:29:22,320
users noticed for our systems.

492
00:29:23,180 --> 00:29:28,940
And I must thank my team for a
couple of folks who jumped very

493
00:29:28,940 --> 00:29:31,980
quickly to Zoom and helped to diagnose
and fix.

494
00:29:32,560 --> 00:29:35,040
So yeah, back to questions.

495
00:29:35,140 --> 00:29:39,660
In my opinion, again, snapshots
are inevitable to use if you

496
00:29:39,660 --> 00:29:42,020
reach dozens of terabytes scale.

497
00:29:44,280 --> 00:29:47,540
And again, There are several levels
of trust here.

498
00:29:47,540 --> 00:29:48,900
First level, are they consistent?

499
00:29:50,380 --> 00:29:53,500
Answer is we don't care because
we have pg_start_backup, pg_stop_backup.

500
00:29:55,580 --> 00:29:59,480
Second question, are they reliable
and what about possible corruption

501
00:29:59,540 --> 00:30:00,400
and so on?

502
00:30:01,740 --> 00:30:04,200
Answer is test them and test backups.

503
00:30:04,340 --> 00:30:07,480
You must test backups anyway, because
otherwise it's a Schrodinger

504
00:30:07,500 --> 00:30:08,460
backup system.

505
00:30:09,560 --> 00:30:11,980
So we don't know if it will work.

506
00:30:11,980 --> 00:30:12,880
We must test.

507
00:30:13,680 --> 00:30:19,280
And once enough number of tests
show positive results, trust

508
00:30:19,280 --> 00:30:20,100
will be built.

509
00:30:20,740 --> 00:30:21,100
Right?

510
00:30:21,100 --> 00:30:23,820
And I have several...

511
00:30:23,940 --> 00:30:28,180
You know, in the beginning of the
year I announced that every

512
00:30:28,180 --> 00:30:32,300
managed Postgres provider must
consider, or highly recommend,

513
00:30:32,300 --> 00:30:36,220
to consider pg_wait_sampling and
pg_stat_kcache as very important

514
00:30:36,220 --> 00:30:40,580
pieces of observability to extend
what we have with pg_stat_statements.

515
00:30:42,880 --> 00:30:47,020
Even if you are RDS and have performance
insights, still consider

516
00:30:47,280 --> 00:30:53,520
adding pg_wait_sampling because
this will add very flexible tooling

517
00:30:53,520 --> 00:30:55,380
for your customers.

518
00:30:56,140 --> 00:31:02,260
Now I say for backups of course
these Big guys like RDS already

519
00:31:02,260 --> 00:31:03,480
use snapshots 100%.

520
00:31:04,540 --> 00:31:10,200
We know it indirectly from the
suffer of lazy load of myself

521
00:31:10,200 --> 00:31:12,280
in 2015 or 2016.

522
00:31:13,480 --> 00:31:19,400
And now all competitors of RDS
or alternatives to RDS, please

523
00:31:19,400 --> 00:31:21,360
consider snapshots, right?

524
00:31:21,740 --> 00:31:26,420
Because I know you don't yet so
you you it's it's worth because

525
00:31:26,760 --> 00:31:32,700
your customers grow and once they
reach 10 20 30 40 50 terabytes

526
00:31:33,540 --> 00:31:39,480
they will go to RDS or self-managed
and stop paying 100% because

527
00:31:39,480 --> 00:31:40,380
it's a big pain.

528
00:31:40,380 --> 00:31:41,740
We just described this pain.

529
00:31:42,040 --> 00:31:45,840
And now those who self-manage also
consider snapshots, right?

530
00:31:46,060 --> 00:31:53,680
But most importantly, developers
of WAL-G and pgBackRest, consider

531
00:31:53,680 --> 00:31:58,240
supporting snapshots, cloud disk
snapshots, as an alternative,

532
00:31:58,380 --> 00:32:01,320
as an official alternative to full
backups and delta backups

533
00:32:01,320 --> 00:32:02,380
you already have.

534
00:32:03,060 --> 00:32:04,880
In this case it would become native.

535
00:32:06,500 --> 00:32:09,740
And yeah, so I think it's worth
doing this.

536
00:32:10,200 --> 00:32:11,760
Maybe other types of snapshots.

537
00:32:11,920 --> 00:32:16,160
So like external external tool
for snapshotting, If it exists

538
00:32:16,620 --> 00:32:19,500
in infrastructure and natively
supported by a backup tool, I

539
00:32:19,500 --> 00:32:21,140
think it's a good idea.

540
00:32:21,500 --> 00:32:25,420
So a backup tool would perform
all the orchestration and skip

541
00:32:25,520 --> 00:32:26,760
some of full backups.

542
00:32:26,760 --> 00:32:27,860
It can be a mix.

543
00:32:28,280 --> 00:32:32,920
Ideal backup system for me would
be a mix of occasional full

544
00:32:32,920 --> 00:32:37,820
backups, continuous WAL stream,
and snapshots as the main tool

545
00:32:37,820 --> 00:32:40,660
for provision nodes and the main
strategy for DR.

546
00:32:41,200 --> 00:32:42,920
But not the only 1.

547
00:32:42,920 --> 00:32:47,120
So full backups would exist separately,
occasionally again, like

548
00:32:47,120 --> 00:32:48,780
I would have full backups.

549
00:32:49,020 --> 00:32:53,200
By the way, both snapshots and
backups, full backups must be

550
00:32:53,200 --> 00:32:57,260
done, and I see it on some systems,
they do it on the primary.

551
00:32:57,280 --> 00:32:58,760
It must be done on replicas.

552
00:33:00,060 --> 00:33:03,740
They must be done on replicas because
it's a big stress for primary.

553
00:33:04,460 --> 00:33:07,740
Even snapshot is a big stress for
primary, as I learned.

554
00:33:09,000 --> 00:33:09,740
So, yeah.

555
00:33:10,080 --> 00:33:11,740
It's good to do it on a replica.

556
00:33:11,960 --> 00:33:16,240
And pg_start_backup, pg_stop_backup,
they support so-called non-exclusive

557
00:33:17,060 --> 00:33:20,720
backups, so it can be done on replicas,
on physical replicas.

558
00:33:21,340 --> 00:33:25,080
And even if they are lagging for
some time, it doesn't matter

559
00:33:25,080 --> 00:33:29,780
because even lag for some minutes
nobody will notice because

560
00:33:30,600 --> 00:33:34,500
we have continuous WAL stream
and replaying few minutes of WAL,

561
00:33:34,760 --> 00:33:35,700
it's very fast.

562
00:33:36,160 --> 00:33:40,760
Michael: Yeah, I saw in RDS's documentation
that for multi-availability

563
00:33:41,120 --> 00:33:44,660
zone deployments, they take the
backups from the standby.

564
00:33:45,240 --> 00:33:49,300
Nikolay: That's absolutely all,
like, this is what should happen.

565
00:33:49,540 --> 00:33:49,780
Yeah.

566
00:33:49,780 --> 00:33:54,820
Yeah and snapshots also should
be done from there I saw well,

567
00:33:55,260 --> 00:34:00,300
maybe it was specific cases, but
I saw some cases when snapshot

568
00:34:00,340 --> 00:34:05,740
of disk of primary affected performance
Because in background

569
00:34:06,300 --> 00:34:11,180
data is being sent to object storage
and it's definitely generating

570
00:34:11,200 --> 00:34:14,040
some disk, read disk IO for our
disks.

571
00:34:16,100 --> 00:34:18,840
Michael: So I'm curious though,
why have, you mentioned in an

572
00:34:18,840 --> 00:34:22,360
ideal world you have a mix, what's
the benefit of having both?

573
00:34:22,360 --> 00:34:27,760
Like once you move to snapshots,
what's the benefit of also having

574
00:34:29,380 --> 00:34:33,780
traditional pgBackRest style backups
as well?

575
00:34:34,660 --> 00:34:35,640
Nikolay: It's a good question.

576
00:34:35,640 --> 00:34:39,100
I remember reading Google Cloud
documentation many years ago

577
00:34:39,100 --> 00:34:43,840
and it said, when we create snapshot
of disk, this snapshot is

578
00:34:43,840 --> 00:34:47,780
not guaranteed to be consistent
when if application is running

579
00:34:47,780 --> 00:34:52,860
and writing and then suddenly the
sentence was removed from Google

580
00:34:52,860 --> 00:34:56,400
Cloud documentation It's it's a
matter of trust, right?

581
00:34:56,680 --> 00:35:01,640
So Now I trust Google Cloud snapshot
Snapshots and a BS volume

582
00:35:01,640 --> 00:35:02,660
snapshots in the edible.

583
00:35:02,660 --> 00:35:03,740
Yes quite a lot.

584
00:35:03,840 --> 00:35:10,960
I have trust but still some part
of me is Like It's paranoid,

585
00:35:10,960 --> 00:35:15,400
you know, so so I would keep full
backups not very frequent.

586
00:35:15,400 --> 00:35:20,640
I would keep them just Just to
be on safe safe side But maybe

587
00:35:20,640 --> 00:35:24,880
over time trust will be complete
like if you have a lot of data

588
00:35:24,880 --> 00:35:30,660
proving snapshot creation and recovery
is very reliable.

589
00:35:32,860 --> 00:35:35,000
Well, consistency, by the way,
doesn't matter.

590
00:35:35,400 --> 00:35:39,480
The problem I mentioned in documentation,
It's not about Snapshots

591
00:35:39,480 --> 00:35:42,600
are consistent or no consistent
or no Honestly, I don't know

592
00:35:42,600 --> 00:35:45,980
even I don't care because I have
to just start back up and stop

593
00:35:45,980 --> 00:35:49,460
back up and I'm protected from
inconsistent state Postgres will

594
00:35:49,460 --> 00:35:53,860
take care of it But it's a matter
of trust like lack of transparency

595
00:35:54,020 --> 00:35:57,180
in documentation, understanding
what's happening and so on.

596
00:35:57,180 --> 00:36:03,080
So only obtaining experience over
time can bring some trust and

597
00:36:03,080 --> 00:36:06,860
understanding the system works
fine and Regular testing testing

598
00:36:06,900 --> 00:36:12,360
should be automated recovery testing
of backups and and so on

599
00:36:12,360 --> 00:36:17,580
so now I think It doesn't matter
if they are if they are persistent

600
00:36:18,140 --> 00:36:22,400
Consistent or no It matters if
they work, right?

601
00:36:22,860 --> 00:36:28,500
If they If we can recover and then
reach consistency point by

602
00:36:28,500 --> 00:36:33,220
our means, I mean, by what Postgres
has, And that's it.

603
00:36:33,280 --> 00:36:38,000
But still, what if I see snapshots
created, but I cannot use

604
00:36:38,000 --> 00:36:38,500
them?

605
00:36:39,140 --> 00:36:41,460
And also bugs happen, people change
systems.

606
00:36:41,460 --> 00:36:45,600
I mean, cloud providers also are
being developed all the time

607
00:36:45,600 --> 00:36:50,680
and new types of disks and other
changes are introduced all the

608
00:36:50,680 --> 00:36:51,180
time.

609
00:36:51,740 --> 00:36:54,520
So what if snapshots will not work?

610
00:36:54,900 --> 00:36:58,900
If I have very reliable testing
process, if I test maybe, for

611
00:36:58,900 --> 00:37:02,980
example, all snapshots, I restore
from them automatically all

612
00:37:02,980 --> 00:37:08,240
the time and I see that over the
last few months Every day we

613
00:37:08,240 --> 00:37:10,180
created snapshot or maybe every
few hours.

614
00:37:10,180 --> 00:37:13,980
We created snapshot and we tested
them all Maybe at some point

615
00:37:13,980 --> 00:37:18,800
I will say okay Full backups by
a pgBackRest or WAL-G are not needed

616
00:37:18,800 --> 00:37:19,300
anymore.

617
00:37:19,960 --> 00:37:23,320
But for now, my mind is in a state
of paranoia.

618
00:37:26,140 --> 00:37:26,980
Michael: Very reasonable.

619
00:37:27,440 --> 00:37:28,200
Sounds good?

620
00:37:28,480 --> 00:37:28,980
Nikolay: Yeah.

621
00:37:29,640 --> 00:37:35,080
1 interesting thing here is that
if you use snapshots, there

622
00:37:35,080 --> 00:37:37,940
is a catch for large databases.

623
00:37:38,720 --> 00:37:44,220
EBS volumes and persistent disks
in Google Cloud, they are limited

624
00:37:44,220 --> 00:37:45,560
by 64 terabytes.

625
00:37:46,080 --> 00:37:52,180
I think some types of EBS volumes
are limited by 128 terabytes,

626
00:37:52,180 --> 00:37:54,640
if I'm not mistaken, so the limit
is increased.

627
00:37:55,360 --> 00:38:00,060
But 64 terabytes, if you experience
problems being at level of

628
00:38:00,060 --> 00:38:04,100
10, 20, 30, 50 terabytes, 64 terabytes
is not far already.

629
00:38:04,300 --> 00:38:05,420
Michael: Yeah, that's true.

630
00:38:05,800 --> 00:38:11,460
Nikolay: And then the question
will be what to do, because if

631
00:38:11,460 --> 00:38:19,340
you use regular backup, full backup
by means of pgBackRest and

632
00:38:19,340 --> 00:38:20,400
WAL-G.

633
00:38:21,340 --> 00:38:24,800
You can have multiple disks and
again, it doesn't matter.

634
00:38:25,320 --> 00:38:31,420
Thanks to pgBackRest, pg_start_backup,
pg_stop_backup, You can copy

635
00:38:31,500 --> 00:38:33,820
data from multiple disks, right?

636
00:38:33,820 --> 00:38:42,100
It can be a LVM and multiple disks
organized to give you more

637
00:38:42,100 --> 00:38:43,240
disk space, right?

638
00:38:43,820 --> 00:38:44,880
Michael: Like table spaces?

639
00:38:46,840 --> 00:38:48,780
Nikolay: No, no, no, not table
space.

640
00:38:48,780 --> 00:38:51,360
Table space is a different thing,
it's a Postgres level.

641
00:38:51,740 --> 00:38:58,040
If you need to have, you can combine
multiple disks using LVM2.

642
00:38:59,820 --> 00:39:06,880
And have 1 big, basically, logical
volume for Postgres.

643
00:39:07,360 --> 00:39:12,500
And for regular backups, when you
copy data, pgBackRest or WAL-G

644
00:39:12,500 --> 00:39:16,580
copies data to object storage,
it doesn't matter that the underlying

645
00:39:17,080 --> 00:39:22,100
many disks contribute to build
this big logical volume because

646
00:39:22,820 --> 00:39:26,960
we have only files up to 1 gigabyte
in Postgres so each huge

647
00:39:26,960 --> 00:39:31,860
table it's shrink to 1 gigabyte
files so they are copied compressed

648
00:39:31,860 --> 00:39:33,900
and copied by our backup tool.

649
00:39:34,300 --> 00:39:37,180
But if you start using snapshots,
it's interesting.

650
00:39:38,300 --> 00:39:42,260
Michael: Well, also people at that
scale, people are thinking

651
00:39:42,260 --> 00:39:45,060
of the companies we've talked to
at that, we had a good hundredth

652
00:39:45,060 --> 00:39:46,120
episode, didn't we?

653
00:39:46,120 --> 00:39:49,860
And a lot of them are considering
sharding by that point anyway.

654
00:39:50,280 --> 00:39:53,220
So I guess in some cases that...

655
00:39:53,800 --> 00:39:54,960
Nikolay: Or deleting data, right?

656
00:39:54,960 --> 00:39:56,040
That's my favorite advice.

657
00:39:56,040 --> 00:39:57,080
Just delete something.

658
00:39:58,960 --> 00:40:01,320
And keep sleep well.

659
00:40:02,140 --> 00:40:02,640
Yeah.

660
00:40:02,680 --> 00:40:05,860
But Well, so the problem is that
it's not a problem, again, it's

661
00:40:05,860 --> 00:40:08,040
not a problem that you will create
multiple snapshots.

662
00:40:08,400 --> 00:40:11,640
If those snapshots are created
between pg_start_backup and pg_stop_backup,

663
00:40:12,160 --> 00:40:13,940
you will be able to recover from
them.

664
00:40:13,940 --> 00:40:17,520
The question will be only, will
LVM survive this?

665
00:40:18,160 --> 00:40:21,260
And this is question not about
Postgres, it's a question about

666
00:40:21,760 --> 00:40:24,940
the cloud provider and how like
snapshot orchestration.

667
00:40:25,640 --> 00:40:30,060
And this topic already goes beyond
my current understanding of

668
00:40:30,060 --> 00:40:31,100
things here.

669
00:40:31,340 --> 00:40:36,140
I can just suggest, I can recommend,
okay, use snapshots, but

670
00:40:36,140 --> 00:40:40,400
if you reach a limit and you need
to start using LVM, it's a

671
00:40:40,400 --> 00:40:41,240
different story.

672
00:40:42,180 --> 00:40:45,720
So you need to check how LVM will
survive, snapshot restore,

673
00:40:45,720 --> 00:40:46,500
Michael: and test.

674
00:40:48,220 --> 00:40:52,200
Nikolay: I think it's definitely
solvable because like RDS did

675
00:40:52,200 --> 00:40:53,160
solve it, right?

676
00:40:54,340 --> 00:40:54,840
So

677
00:40:55,920 --> 00:40:56,980
Michael: yeah, good point.

678
00:40:57,040 --> 00:40:57,540
Nikolay: Right.

679
00:40:57,740 --> 00:40:59,780
Michael: Well, that's actually
that's a really good probably

680
00:40:59,860 --> 00:41:00,700
place to...

681
00:41:02,060 --> 00:41:05,860
I don't know how you're doing for
time, but it makes me think

682
00:41:05,860 --> 00:41:12,180
about the likes of Aurora or people
that are doing like innovations

683
00:41:12,260 --> 00:41:15,640
at the storage layer where this
kind of problem just doesn't

684
00:41:16,580 --> 00:41:19,160
doesn't exist anymore, you know,
they've already got multiple,

685
00:41:19,920 --> 00:41:22,440
well I guess we do need to still
have a disaster recovery, you

686
00:41:22,440 --> 00:41:24,520
know, whatever, all of Aurora's
down.

687
00:41:24,960 --> 00:41:26,780
We need a way of recovering.

688
00:41:27,640 --> 00:41:31,580
But they are promising to handle
that for us in this distributed

689
00:41:31,720 --> 00:41:32,460
world, right?

690
00:41:33,080 --> 00:41:37,360
That's part of the promise of keeping
copies at the storage layer.

691
00:41:38,400 --> 00:41:45,900
Nikolay: Yeah, well, it's definitely
new approaches where data

692
00:41:45,900 --> 00:41:51,840
is stored on object storage originally,
not only backups, but

693
00:41:52,720 --> 00:41:56,820
it becomes the primary storage
originally, and then there is

694
00:41:56,820 --> 00:42:00,920
a bigger orchestration to bring
data closer to compute, But it's

695
00:42:00,920 --> 00:42:02,940
all hidden from user.

696
00:42:03,820 --> 00:42:05,040
It's interesting approach.

697
00:42:05,280 --> 00:42:10,460
But there are also doubts this
approach will win in terms of

698
00:42:10,520 --> 00:42:11,820
the market share.

699
00:42:12,740 --> 00:42:14,320
Right, so it's interesting.

700
00:42:15,240 --> 00:42:18,840
Definitely it's where I don't know
many things.

701
00:42:18,840 --> 00:42:22,540
Maybe our listeners will share
some interesting ideas and we

702
00:42:22,540 --> 00:42:24,380
can explore them.

703
00:42:24,520 --> 00:42:27,660
And maybe we should invite someone
who can talk about this more.

704
00:42:28,260 --> 00:42:29,240
Michael: Yeah, sounds good.

705
00:42:29,240 --> 00:42:30,160
Nikolay: Yeah, yeah.

706
00:42:30,160 --> 00:42:32,640
But anyway, I think snapshots are
great.

707
00:42:32,640 --> 00:42:35,280
I think they can be great for smaller
databases as well.

708
00:42:35,280 --> 00:42:40,120
Like, small, I mean 1TB still,
because instead of 1 hour, it

709
00:42:40,120 --> 00:42:41,140
will be a minute.

710
00:42:41,820 --> 00:42:42,320
Right?

711
00:42:42,660 --> 00:42:44,080
Michael: I think that might be...

712
00:42:45,060 --> 00:42:47,840
I'm curious about that area, because
I think there are a lot

713
00:42:47,840 --> 00:42:53,120
more databases in that range of
hundreds of gigabytes to a couple

714
00:42:53,120 --> 00:42:56,580
of terabytes than there are companies
having to deal with dozens

715
00:42:56,580 --> 00:42:57,220
of terabytes.

716
00:42:57,900 --> 00:43:01,800
But if we can make their developer
experiences much better...

717
00:43:03,340 --> 00:43:04,040
Nikolay: I agree.

718
00:43:05,380 --> 00:43:08,940
A few terabytes, I would say, it's
already middle market in terms

719
00:43:08,940 --> 00:43:10,280
of database size.

720
00:43:11,400 --> 00:43:13,860
So it's very common to have 1,
2 terabytes.

721
00:43:16,220 --> 00:43:21,040
And snapshots can be beneficial
because operations speed up significantly

722
00:43:21,280 --> 00:43:23,300
and RTO is improved.

723
00:43:23,560 --> 00:43:29,060
If you, if you remember about lazy
load and know how to mitigate

724
00:43:29,060 --> 00:43:30,300
it or accept it.

725
00:43:31,100 --> 00:43:31,600
Michael: Yeah.

726
00:43:31,880 --> 00:43:32,380
Nikolay: Good.

727
00:43:32,520 --> 00:43:33,020
Good.

728
00:43:33,140 --> 00:43:34,500
Michael: Nice one, Nikolay.

729
00:43:34,500 --> 00:43:35,480
Nikolay: Thank you for listening.

730
00:43:35,500 --> 00:43:39,960
Again, if you have ideas in this
area I'm very... I will be happy

731
00:43:39,960 --> 00:43:43,320
to learn more because I'm learning
all the time myself and I

732
00:43:43,320 --> 00:43:47,380
think Michael also in same shoes,
right?

733
00:43:47,580 --> 00:43:50,460
Michael: Yeah, I find this stuff
interesting I have to deal with

734
00:43:50,460 --> 00:43:54,520
it a lot less than you which I'm
grateful for But yeah, definitely

735
00:43:54,520 --> 00:43:57,960
let us know I guess via YouTube
comments or on various social

736
00:43:57,960 --> 00:43:58,940
media things.

737
00:43:59,200 --> 00:44:01,220
Nikolay: Yeah good.
Michael: Nice one.

738
00:44:01,820 --> 00:44:03,300
We'll catch you next week, Nikolay.

739
00:44:03,400 --> 00:44:03,900
Nikolay: Bye-bye.