1
0:0:0,06 --> 0:0:2,54
Nikolay: Hello, hello, this is Postgres.FM.

2
0:0:3,26 --> 0:0:8,0
My name is Nik as usual, PostgresAI, and as usual with me,

3
0:0:8,0 --> 0:0:8,98
Michael, pgMustard.

4
0:0:9,34 --> 0:0:10,12
Hi, Michael.

5
0:0:10,44 --> 0:0:11,3
Michael: Hello, Nik.

6
0:0:11,82 --> 0:0:15,66
Nikolay: So we have a question today from one of the listeners,

7
0:0:16,68 --> 0:0:18,2
and it's about archiving.

8
0:0:19,54 --> 0:0:22,279999
I have so many options what it can mean.

9
0:0:23,26 --> 0:0:24,220001
Where to start?

10
0:0:24,84 --> 0:0:27,36
Michael: Yeah, well they specifically asked whether it was possible

11
0:0:27,36 --> 0:0:30,36
for us to do a video on archiving and they said they want to

12
0:0:30,36 --> 0:0:33,74
archive a database that is about 5 terabytes in size.

13
0:0:34,9 --> 0:0:38,7
So what do you assume that means when you first hear it?

14
0:0:39,32 --> 0:0:44,78
Nikolay: Well if you are familiar with Postgres internals, and

15
0:0:45,48 --> 0:0:49,9
just not internals, at least about Postgres architecture.

16
0:0:51,14 --> 0:0:52,36
Archiving is simple.

17
0:0:52,36 --> 0:0:53,96
You just set up archive command.

18
0:0:53,96 --> 0:0:54,94
That's it, right?

19
0:0:55,12 --> 0:0:56,18
But it's a joke.

20
0:0:56,26 --> 0:0:58,98
I think this is not what was meant here.

21
0:1:2,52 --> 0:1:7,0
So I think 5 terabyte these days is an average size database.

22
0:1:7,2 --> 0:1:9,26
Who doesn't agree?

23
0:1:10,24 --> 0:1:14,94
Well, I see bigger databases and 5TB is not a big database anymore.

24
0:1:15,64 --> 0:1:17,64
It was big 10 years ago.

25
0:1:17,98 --> 0:1:20,72
These days big means more than 10 terabytes.

26
0:1:21,82 --> 0:1:22,92
This is just it.

27
0:1:23,26 --> 0:1:23,54
Right.

28
0:1:23,54 --> 0:1:28,86
So 5 terabytes is a very large database and archiving it means

29
0:1:29,06 --> 0:1:32,28
that we want to keep it longer term, right?

30
0:1:33,18 --> 0:1:34,0
How I read it?

31
0:1:34,0 --> 0:1:40,52
We want to keep it longer term and or we want probably to clean

32
0:1:40,52 --> 0:1:41,02
up.

33
0:1:41,84 --> 0:1:45,22
Sometimes we want to clean up.

34
0:1:45,979996 --> 0:1:50,04
Well, yeah, because for example, if imagine it's e-commerce,

35
0:1:50,38 --> 0:1:54,6
we accumulated order history for many, many years.

36
0:1:54,82 --> 0:1:57,94
It's not used, but it's so much data.

37
0:1:58,82 --> 0:2:2,56
So why not archive it and then clean up?

38
0:2:2,68 --> 0:2:6,88
Clean up in Postgres, straightforward clean up with delete as

39
0:2:6,88 --> 0:2:11,32
we discussed many times, it's not an easy operation in large

40
0:2:11,32 --> 0:2:17,36
databases because of MVCC, vacuum and all these complexities.

41
0:2:18,68 --> 0:2:25,2
So if it's partitioned, if bigger tables are partitioned, then

42
0:2:25,2 --> 0:2:26,72
cleanup is easy, right?

43
0:2:26,72 --> 0:2:30,98
We just detach and drop all partitions, that's it.

44
0:2:31,5 --> 0:2:34,8
And before that, obviously, archiving makes a lot of sense.

45
0:2:35,66 --> 0:2:40,02
But even if you don't have a task to clean up, archiving might

46
0:2:40,88 --> 0:2:45,36
make sense for backup purposes, right?

47
0:2:46,42 --> 0:2:46,56
We

48
0:2:46,56 --> 0:2:47,12
Michael: want to

49
0:2:47,12 --> 0:2:49,14
Nikolay: keep archive for long, yeah.

50
0:2:49,66 --> 0:2:52,48
Or maybe there are requirements.

51
0:2:53,12 --> 0:2:55,46
There also may be there are requirements, right?

52
0:2:56,06 --> 0:2:59,12
Michael: Yeah, I think there's an argument that it's slightly

53
0:2:59,12 --> 0:3:0,78
different to backups.

54
0:3:1,16 --> 0:3:7,62
But I'm not actually certain that my like what how I have like

55
0:3:7,74 --> 0:3:11,26
internalized that is actually that actually is helpful in any

56
0:3:11,26 --> 0:3:15,24
practical sense because if you've got data archived you've got

57
0:3:15,24 --> 0:3:19,16
it's kind of a backup right Like it is another source of old

58
0:3:19,16 --> 0:3:19,66
data.

59
0:3:20,74 --> 0:3:24,38
But I do think they're different in that you wouldn't expect

60
0:3:24,72 --> 0:3:27,8
an archive to have up-to-date data, whereas you would expect

61
0:3:27,8 --> 0:3:29,2
a backup to have up-to-date data.

62
0:3:29,2 --> 0:3:31,5
So I do think there's like subtle differences.

63
0:3:32,06 --> 0:3:36,82
And also I think, also I think you backup and you archive for

64
0:3:36,82 --> 0:3:38,98
different reasons, therefore there's different trade-offs.

65
0:3:38,98 --> 0:3:41,88
So like, if we're backing up in order to be able to get back

66
0:3:41,88 --> 0:3:45,52
online quickly if we have a disaster that you might want to put

67
0:3:45,52 --> 0:3:48,34
it somewhere different in a different format than if we archive

68
0:3:48,4 --> 0:3:52,16
it because we need access to it but not necessarily performantly.

69
0:3:55,74 --> 0:4:1,36
Nikolay: For me, in real life, we, For example, take our DBLab

70
0:4:1,56 --> 0:4:2,38
or experiments.

71
0:4:3,68 --> 0:4:6,88
Experiments should be done on clones, right?

72
0:4:7,44 --> 0:4:11,78
On thin clones, on branching, if it's DBLab, for example, or

73
0:4:12,28 --> 0:4:17,3
full-fledged clones on separate virtual machines, like if it's

74
0:4:17,3 --> 0:4:18,42
self-managed, right?

75
0:4:18,42 --> 0:4:24,4
And we say, let's restore either from snapshot, cloud snapshot

76
0:4:24,4 --> 0:4:29,94
of disk, or let's restore from backups or from archive.

77
0:4:29,96 --> 0:4:32,72
And there is purely like synonyms for me.

78
0:4:32,72 --> 0:4:34,34
Like let's restore from archive.

79
0:4:34,9 --> 0:4:37,86
And that's why archive command, it's a part of backups, right,

80
0:4:38,04 --> 0:4:39,84
it's archiving WALs.

81
0:4:40,68 --> 0:4:47,78
And we say we have many like versions of our database, not versions,

82
0:4:47,84 --> 0:4:53,1
many exemplars of our database and archive or in backups.

83
0:4:53,1 --> 0:4:55,68
Backups, archive, it's the same for me in this case.

84
0:4:56,4 --> 0:5:0,36
And of course we want to be it quickly restored, but These days

85
0:5:0,36 --> 0:5:2,3
if it's cloud, it's easy to configure.

86
0:5:2,88 --> 0:5:9,26
So you basically archive it, back it up using pgBackRest or WAL-G.

87
0:5:9,84 --> 0:5:15,16
It goes to S3 or other object storage like GCS or I'm not an

88
0:5:15,16 --> 0:5:18,3
expert in Azure Blob Storage, they say, they name it, right?

89
0:5:18,56 --> 0:5:20,66
And there you can configure it.

90
0:5:20,66 --> 0:5:25,54
So for example, a couple of weeks later, that backup automatically

91
0:5:26,2 --> 0:5:31,38
moves to colder storage, like in S3 it's called Glacier.

92
0:5:31,4 --> 0:5:32,22
Michael: To an archive.

93
0:5:32,8 --> 0:5:37,62
Nikolay: Well, all our archives, 1 is hotter, another is colder,

94
0:5:37,66 --> 0:5:38,04
right?

95
0:5:38,04 --> 0:5:43,94
It's Glacier or like in GCS and Google Cloud, it's called Coldline,

96
0:5:44,12 --> 0:5:45,56
I don't remember the name.

97
0:5:45,6 --> 0:5:51,32
But the idea is it's cheaper, but slower to restore.

98
0:5:53,4 --> 0:5:57,72
And this still causes a bug which is known many years in WAL-G.

99
0:5:57,72 --> 0:5:57,94
I

100
0:5:57,94 --> 0:6:2,54
recently vibe-coded a fix because it changes modification time

101
0:6:3,24 --> 0:6:6,3
and latest becomes broken.

102
0:6:6,68 --> 0:6:12,18
If you want to restore using uppercase
latest with WAL-G, will

103
0:6:12,18 --> 0:6:13,7
WAL-G backup fetch latest?

104
0:6:14,28 --> 0:6:19,54
In this case it will fetch wrong,
might fetch wrong backup because

105
0:6:19,54 --> 0:6:22,26
it's based on mod time, modification
time.

106
0:6:22,72 --> 0:6:26,98
And if when, when it's moved automatically
to colder storage,

107
0:6:26,98 --> 0:6:30,84
modification time of old backup
changes.

108
0:6:31,26 --> 0:6:32,8
So latest might be broken.

109
0:6:32,8 --> 0:6:38,46
I recently I sent it to Andrey
already pull request to fix it

110
0:6:38,56 --> 0:6:39,22
in WAL-G.

111
0:6:39,24 --> 0:6:40,82
It should be fixed, right?

112
0:6:42,04 --> 0:6:46,16
Yeah, but I mean modification time
cannot be trusted here obviously

113
0:6:46,64 --> 0:6:53,36
so but it's a good idea and it's
great that cloud's support is

114
0:6:53,8 --> 0:6:57,6
super easy to do, you just configure
some policy, how long to

115
0:6:57,6 --> 0:7:1,64
keep them, how long to like when
to move to all colder storage,

116
0:7:1,64 --> 0:7:6,22
so it allows you to for the same
money to keep archives for longer

117
0:7:6,82 --> 0:7:9,22
right so

118
0:7:9,24 --> 0:7:12,04
Michael: or if you if you're if
like for example you're doing

119
0:7:12,04 --> 0:7:14,34
this for compliance reasons and
you're the length

120
0:7:14,34 --> 0:7:14,48
Nikolay: of

121
0:7:14,48 --> 0:7:17,38
Michael: time is predetermined
how long you have to keep it for,

122
0:7:17,52 --> 0:7:19,54
you can keep it for that long for
cheaper.

123
0:7:19,54 --> 0:7:22,32
Like that, it feels to me like
that's the more, yeah.

124
0:7:23,0 --> 0:7:28,62
Nikolay: Yeah, so obviously it's
not SSD anymore, it's HDD, but

125
0:7:28,62 --> 0:7:32,14
it's still super reliable, and
as I understand, they like promise

126
0:7:33,06 --> 0:7:37,9
like a lot of nines, like amazing
number of nines.

127
0:7:39,0 --> 0:7:40,7
11 nines or how many?

128
0:7:41,4 --> 0:7:41,885
It's like basically all-

129
0:7:41,885 --> 0:7:45,08
Michael: But here we're talking
about durability, not availability,

130
0:7:45,42 --> 0:7:45,92
right?

131
0:7:46,52 --> 0:7:46,82
Nikolay: Yeah.

132
0:7:46,82 --> 0:7:47,22
Exactly.

133
0:7:47,22 --> 0:7:50,78
So obviously data won't be lost.

134
0:7:51,02 --> 0:7:52,16
That's the key.

135
0:7:52,36 --> 0:7:55,6
And it can be stored for years,
years, years if you want.

136
0:7:55,6 --> 0:7:56,1
Right.

137
0:7:56,52 --> 0:7:57,46
And that's great.

138
0:7:57,54 --> 0:8:3,74
The only risk remains if somebody
possesses your cloud account

139
0:8:4,94 --> 0:8:8,1
access, they might destroy everything.

140
0:8:8,62 --> 0:8:12,32
To protect from that, there is
a lock option, right?

141
0:8:12,32 --> 0:8:15,8
You can lock and basically say
you cannot delete, nobody can

142
0:8:15,8 --> 0:8:20,02
delete, or you can copy to different
account or even different

143
0:8:20,02 --> 0:8:20,52
cloud

144
0:8:21,54 --> 0:8:22,06
Michael: Yeah, right.

145
0:8:22,06 --> 0:8:24,46
So different permissions, right?

146
0:8:25,44 --> 0:8:30,9
Nikolay: Right, and if it's again
if it's backups using WAL-G what

147
0:8:30,9 --> 0:8:36,36
because the WAL-G has a special comment for this copy.

148
0:8:36,6 --> 0:8:40,58
So you can copy from 1 backup location,
1 archive location to

149
0:8:40,58 --> 0:8:42,34
different archive location, right?

150
0:8:43,04 --> 0:8:49,68
But if you are on managed Postgres,
In this case, physical backups

151
0:8:49,68 --> 0:8:53,4
are not available unless it's Crunchy
Bridge, right?

152
0:8:53,56 --> 0:8:56,98
Which is great that they allow
it to access physical backups

153
0:8:57,04 --> 0:8:57,78
and WALs.

154
0:8:58,94 --> 0:9:3,58
And in this case, the only option
remains is to create logical

155
0:9:3,9 --> 0:9:6,5
backup, which I prefer to call
dump.

156
0:9:7,12 --> 0:9:13,62
And we had an episode is logical
backup a backup?

157
0:9:14,72 --> 0:9:16,08
My answer is still no.

158
0:9:16,74 --> 0:9:17,24
But

159
0:9:19,02 --> 0:9:20,04
Michael: Is it an archive?

160
0:9:20,14 --> 0:9:25,54
Because I think, like, archiving
almost always isn't incremental,

161
0:9:25,84 --> 0:9:26,2
right?

162
0:9:26,2 --> 0:9:28,38
Like, you do it in batches, right?

163
0:9:28,38 --> 0:9:30,84
You take a whole year and archive
it.

164
0:9:30,86 --> 0:9:32,78
So in this case, it's not changing.

165
0:9:32,78 --> 0:9:37,52
So a logical dump is kind of, is
an archive.

166
0:9:38,68 --> 0:9:42,68
Nikolay: Well, for me, again, I
don't feel this difference.

167
0:9:43,48 --> 0:9:45,32
I hear you, you feel some difference.

168
0:9:45,4 --> 0:9:47,7
But for me, all backups are also
archive.

169
0:9:47,9 --> 0:9:50,6
They're just continuous thanks
to WAL archive.

170
0:9:50,74 --> 0:9:55,88
They consist of 2 parts full or
like deltas, incremental, differential,

171
0:9:56,02 --> 0:9:59,14
there are many names and additionally
there, but basically like

172
0:9:59,64 --> 0:10:5,82
full copy of data directory plus
WALs, right, and WALs make

173
0:10:5,82 --> 0:10:7,0
them like continuous.

174
0:10:7,12 --> 0:10:8,74
This archive becomes continuous.

175
0:10:9,4 --> 0:10:11,6
Michael: This is where I think
we have a difference because I

176
0:10:11,6 --> 0:10:15,06
think these let's go back to the
question right they want to

177
0:10:15,06 --> 0:10:17,36
archive a database that's 5 terabytes
in size.

178
0:10:17,36 --> 0:10:18,92
That's all the information we got.

179
0:10:19,12 --> 0:10:23,52
But we don't like, there's a chance
that it's an old product

180
0:10:23,52 --> 0:10:26,68
that's no longer served and they
just need to keep the data for

181
0:10:26,68 --> 0:10:27,88
a certain amount of time.

182
0:10:28,08 --> 0:10:30,94
It could be that it doesn't need
to be accessed anymore.

183
0:10:31,0 --> 0:10:34,62
It could be, it could be all manner
of...

184
0:10:37,16 --> 0:10:39,96
Nikolay: They maybe want to drop
database and just keep it for

185
0:10:39,96 --> 0:10:40,94
history right?

186
0:10:41,18 --> 0:10:46,32
So to answer this question directly
the first thing we need to

187
0:10:46,32 --> 0:10:51,88
understand is it managed service
or we have access to physical

188
0:10:52,36 --> 0:10:52,86
level.

189
0:10:54,0 --> 0:10:57,1
If it's managed service, the only
option for us is pg_dump.

190
0:10:57,88 --> 0:10:58,78
Michael: No, I disagree.

191
0:11:0,06 --> 0:11:0,56
Why?

192
0:11:1,02 --> 0:11:2,86
Well, because I think you could...

193
0:11:3,68 --> 0:11:7,58
So depending on the parameters,
right, like you could export

194
0:11:7,58 --> 0:11:9,3
it in a different format, right?

195
0:11:9,44 --> 0:11:10,58
There's a lot of...

196
0:11:12,98 --> 0:11:16,5
Nikolay: Okay, logical backup is
the only option here.

197
0:11:16,52 --> 0:11:17,02
Yes.

198
0:11:17,64 --> 0:11:20,28
Although technically speaking,
we can rely on...

199
0:11:20,6 --> 0:11:23,72
We can say, okay, let's create
a snapshot and consider the archive

200
0:11:23,72 --> 0:11:26,68
and rely on RDS that they will
keep it for longer.

201
0:11:27,18 --> 0:11:27,54
Why not?

202
0:11:27,54 --> 0:11:31,04
Michael: Yeah but so here's what I'm thinking though and 5 terabytes

203
0:11:31,04 --> 0:11:33,82
is not very large right as you were saying at the beginning.

204
0:11:34,12 --> 0:11:37,74
But if in the general case where we care about cost it seems

205
0:11:37,74 --> 0:11:40,32
like cost actually matters here, it's 1 of the main reasons for

206
0:11:40,32 --> 0:11:41,74
doing this in the first place.

207
0:11:42,04 --> 0:11:45,86
If we care a lot about cost it feels to me like some of the newer

208
0:11:45,86 --> 0:11:50,46
formats that compress extremely well like, and maybe new is not

209
0:11:50,46 --> 0:11:53,8
the right term, but it's new to me in terms of coming across

210
0:11:53,8 --> 0:11:57,84
it from a Postgres perspective but the formats like Iceberg and

211
0:11:57,84 --> 0:12:3,08
Parquet that compress down extremely well feels to me like would

212
0:12:3,08 --> 0:12:8,04
be the cheapest way of archiving data that's static and is not

213
0:12:8,04 --> 0:12:10,92
going to be changing or you're going to get a new batch of it

214
0:12:10,92 --> 0:12:11,92
once per year.

215
0:12:11,92 --> 0:12:15,08
Like the archiving use cases that I'm thinking of.

216
0:12:15,48 --> 0:12:20,02
If we exported in those formats they'd be tiny, like the 5 terabytes

217
0:12:20,02 --> 0:12:22,86
might go down 95% or so.

218
0:12:23,74 --> 0:12:25,22
Nikolay: Oh, I doubt.

219
0:12:25,6 --> 0:12:26,96
20 times, I doubt.

220
0:12:27,44 --> 0:12:28,22
Well, maybe.

221
0:12:29,14 --> 0:12:31,08
Okay, What have you seen?

222
0:12:31,08 --> 0:12:34,26
Michael: Yeah, I've seen people reporting that kind of...

223
0:12:35,02 --> 0:12:38,1
Even on the Timescale compression side, I've seen that reported.

224
0:12:38,1 --> 0:12:39,18
And that wasn't even...

225
0:12:40,32 --> 0:12:41,88
Nikolay: Depending on data again.

226
0:12:42,54 --> 0:12:43,68
Michael: Yeah, depends on the data.

227
0:12:43,68 --> 0:12:45,9
But if you've got a lot of numerical...

228
0:12:46,1 --> 0:12:48,38
A lot of columns that compress...

229
0:12:48,6 --> 0:12:51,88
That have a lot of similar data in them, which I think is pretty

230
0:12:51,88 --> 0:12:55,46
common, especially for like, if we're thinking it's healthcare

231
0:12:55,52 --> 0:13:1,72
related or financial, like transactional data, A lot of that

232
0:13:1,72 --> 0:13:3,98
is numerical stuff.

233
0:13:4,64 --> 0:13:8,2
Nikolay: Well, I, again, like, okay, Parquet is interesting.

234
0:13:8,22 --> 0:13:9,94
I think it's an option as well.

235
0:13:10,52 --> 0:13:15,04
And in this case, you can also, like, it depends on access patterns

236
0:13:15,06 --> 0:13:15,82
later, right?

237
0:13:16,12 --> 0:13:18,84
We need to, first thing we need to answer, do we have access

238
0:13:18,84 --> 0:13:20,92
to physical or only logical?

239
0:13:20,92 --> 0:13:23,56
Logical, we obviously have access because this is our database.

240
0:13:23,56 --> 0:13:24,06
Yeah.

241
0:13:24,34 --> 0:13:24,98
I suppose.

242
0:13:25,76 --> 0:13:28,52
All right, so physical versus logical thing.

243
0:13:28,52 --> 0:13:31,68
We discussed it many times, but physical also can be compressed,

244
0:13:31,68 --> 0:13:31,92
right?

245
0:13:31,92 --> 0:13:37,28
So both WAL-G and pgBackRest, they compress data directory and WALs.

246
0:13:37,44 --> 0:13:38,3
So why not?

247
0:13:38,32 --> 0:13:42,8
The only thing to like, super important thing to remember is

248
0:13:42,8 --> 0:13:45,32
that you cannot just take data directory, that's it.

249
0:13:45,32 --> 0:13:47,94
You need to make sure WALs are also archived.

250
0:13:49,12 --> 0:13:52,58
If it's only logical, OK, there is a Parquet option, but also

251
0:13:52,58 --> 0:13:55,64
there is a pg_dump option, which
also has compression.

252
0:13:56,48 --> 0:14:0,32
Michael: Yeah, but we're not getting
the compression like at

253
0:14:0,32 --> 0:14:1,66
the page level, right?

254
0:14:2,52 --> 0:14:3,98
Or kind of the row level.

255
0:14:4,16 --> 0:14:8,22
And I think column level compression
is hard to compete with.

256
0:14:10,08 --> 0:14:14,2
Nikolay: Well, it's page level
or file level?

257
0:14:15,66 --> 0:14:16,4
Michael: File level.

258
0:14:17,02 --> 0:14:18,06
Okay, what's the difference?

259
0:14:18,82 --> 0:14:19,7
Nikolay: What is the difference?

260
0:14:20,98 --> 0:14:22,76
Huh how...

261
0:14:23,26 --> 0:14:29,36
Well from pg_dump I would expect
10x sometimes in some cases 10x

262
0:14:29,38 --> 0:14:31,12
compression if you...

263
0:14:31,92 --> 0:14:33,74
There are options in pg_dump, right?

264
0:14:34,66 --> 0:14:39,28
zstd I would check definitely,
and it has, do you remember

265
0:14:39,28 --> 0:14:42,76
if pg_dump allows to control compression
level?

266
0:14:43,2 --> 0:14:45,78
Can you set compression level 19?

267
0:14:46,98 --> 0:14:49,22
It will be terribly slow, right?

268
0:14:49,46 --> 0:14:50,11
Or if you check...

269
0:14:50,11 --> 0:14:50,46
We might

270
0:14:50,46 --> 0:14:51,54
Michael: be okay with that.

271
0:14:51,54 --> 0:14:55,76
That's the nice thing about this
use case, is we might be okay

272
0:14:55,76 --> 0:14:59,48
with really slow retrieval really
slow restores because the main

273
0:14:59,48 --> 0:15:1,74
point is keep it for many years

274
0:15:3,3 --> 0:15:7,9
Nikolay: right right anyway anyway
pg_dump supports for multiple

275
0:15:7,9 --> 0:15:13,14
years already, it supports options
and you can control compression

276
0:15:13,14 --> 0:15:13,88
level there.

277
0:15:14,04 --> 0:15:17,8
You cannot control compression
level in WAL compression, which

278
0:15:17,8 --> 0:15:22,16
compresses full-page writes inside,
full-page images inside WAL,

279
0:15:22,16 --> 0:15:22,66
right?

280
0:15:23,1 --> 0:15:26,86
There, it was my proposal, I need
to finish it.

281
0:15:27,26 --> 0:15:30,74
There you can control algorithm,
but not the level of compression.

282
0:15:31,08 --> 0:15:35,46
So I'm not sure everyone needs
it, but still for completeness,

283
0:15:36,04 --> 0:15:38,14
we should support it in Postgres,
right?

284
0:15:38,14 --> 0:15:41,38
But for pg_dump, definitely you
can control level.

285
0:15:41,94 --> 0:15:48,78
So if you are okay to wait and
load CPU a lot, and it will like

286
0:15:48,78 --> 0:15:52,26
in some if you check That is the
19.

287
0:15:53,24 --> 0:15:58,26
I think it should be super slow
But maybe beneficial for in terms

288
0:15:58,26 --> 0:16:1,6
of size and of course the restore
will be also slow, right?

289
0:16:2,04 --> 0:16:3,88
Yeah Yeah

290
0:16:4,14 --> 0:16:6,5
Michael: that I think that is something
to factor in if people

291
0:16:6,5 --> 0:16:10,58
are deciding which way to go though
is what kind of questions

292
0:16:10,58 --> 0:16:12,18
are you going to get of this old
data?

293
0:16:12,18 --> 0:16:15,04
Like are there going to be auditors
that want to come along and

294
0:16:15,04 --> 0:16:16,56
be able to Query it?

295
0:16:16,68 --> 0:16:19,76
Does it matter that it takes like
if you're going to get advance

296
0:16:19,76 --> 0:16:22,8
warning that they're coming, does
it matter that it takes a day

297
0:16:22,8 --> 0:16:23,54
to restore?

298
0:16:24,84 --> 0:16:26,54
Are there gonna be analytical queries?

299
0:16:26,72 --> 0:16:29,5
Like are people gonna want to know
aggregations on it?

300
0:16:29,5 --> 0:16:33,18
Because then if it's gonna be mostly
aggregates, maybe you do

301
0:16:33,18 --> 0:16:36,24
want it in some columnar format
where

302
0:16:36,24 --> 0:16:36,54
you can

303
0:16:36,54 --> 0:16:37,36
get those done quickly.

304
0:16:37,36 --> 0:16:41,18
Nikolay: Well, you confuse me,
but obviously, when you compress

305
0:16:41,48 --> 0:16:45,16
a dump, it's file level, so the
whole thing is compressed, It's

306
0:16:45,16 --> 0:16:46,22
not page level.

307
0:16:46,32 --> 0:16:47,46
Michael: That makes sense, yeah.

308
0:16:47,72 --> 0:16:48,54
Nikolay: It should be good.

309
0:16:48,54 --> 0:16:51,96
Page level, if it's, for example,
you store something on ZFS,

310
0:16:52,08 --> 0:16:53,4
there it's page level.

311
0:16:54,14 --> 0:16:54,82
Here, no.

312
0:16:54,82 --> 0:16:55,96
The whole thing is compressed.

313
0:16:55,96 --> 0:16:59,56
It can be super slow, of course,
but quite well compressed.

314
0:17:0,36 --> 0:17:2,92
Not sure how it will compete with
Parquet.

315
0:17:3,34 --> 0:17:7,32
I have not a lot of experience
with it So like in production

316
0:17:7,44 --> 0:17:13,08
we have a customer who has it Self-managed
database and Archived

317
0:17:13,08 --> 0:17:14,94
from time to time in Parquet format.

318
0:17:15,14 --> 0:17:16,3
Yeah, and it's great.

319
0:17:16,42 --> 0:17:17,38
It's good work.

320
0:17:17,52 --> 0:17:18,46
It's also ZFS.

321
0:17:18,46 --> 0:17:23,24
It's interesting setup But anyway,
I think there is a potential

322
0:17:23,24 --> 0:17:27,58
here if somebody wants to write
some article or do some research,

323
0:17:27,78 --> 0:17:30,52
there are many options to discover
here and benchmark.

324
0:17:31,18 --> 0:17:37,3
Like we could build some decision
tree, like physical or logical

325
0:17:37,3 --> 0:17:43,06
how to decide, I would stick with
physical always, if recovery

326
0:17:43,08 --> 0:17:46,8
time matters, and if we want continuous
archiving, right?

327
0:17:47,38 --> 0:17:51,56
If and if we have access, of course,
because if no access, it's

328
0:17:51,56 --> 0:17:53,18
blocked, this patch is blocked.

329
0:17:53,4 --> 0:17:56,62
Logical, okay, dump or Parquet
format.

330
0:17:57,04 --> 0:17:58,86
Dump, we have compression options.

331
0:17:59,38 --> 0:18:4,94
Definitely I would choose, I would
not choose a single file.

332
0:18:5,14 --> 0:18:7,12
Well, again, depends, right?

333
0:18:7,12 --> 0:18:11,7
But normally we choose custom format
because we want compression

334
0:18:12,16 --> 0:18:14,68
and parallelization, right?

335
0:18:15,16 --> 0:18:20,04
Because if you want single file
dump, then you cannot use parallelization.

336
0:18:21,18 --> 0:18:27,66
And you will be limited by a single
vCPU work when dumping and

337
0:18:27,66 --> 0:18:28,16
restoring.

338
0:18:29,18 --> 0:18:34,3
When you use custom format, you
can say how many jobs, how many

339
0:18:34,3 --> 0:18:39,22
vCPUs to utilize when dumping and
separately when restoring.

340
0:18:39,44 --> 0:18:41,42
It can be different number.

341
0:18:43,3 --> 0:18:47,78
And this matters a lot if we want
to dump faster and restore

342
0:18:47,88 --> 0:18:48,96
faster, right?

343
0:18:48,96 --> 0:18:50,14
This thing.

344
0:18:50,28 --> 0:18:53,44
And then finally, like, so this
is a decision making tree, right?

345
0:18:53,44 --> 0:18:56,04
And then finally, where to store
it?

346
0:18:56,04 --> 0:19:0,42
Is it, definitely archive shouldn't
be stored on expensive SSD

347
0:19:0,42 --> 0:19:1,72
disks, right?

348
0:19:1,72 --> 0:19:2,72
It should be stored.

349
0:19:3,48 --> 0:19:6,18
Object storage is great if it's
cloud, right?

350
0:19:6,28 --> 0:19:11,26
If it's not cloud, some cheap disks,
which we don't use a lot,

351
0:19:11,72 --> 0:19:14,24
so they serve as archive.

352
0:19:14,86 --> 0:19:17,6
Or even, is tape still an option?

353
0:19:17,6 --> 0:19:20,02
I think in many organizations,
which old organizations still

354
0:19:20,02 --> 0:19:20,46
use tape.

355
0:19:20,46 --> 0:19:22,92
Well, it has a huge lifespan, right?

356
0:19:23,32 --> 0:19:25,14
So tape, right?

357
0:19:25,4 --> 0:19:26,14
Why not?

358
0:19:26,68 --> 0:19:30,94
Anyway, it should be something
which we know is reliable, will

359
0:19:30,94 --> 0:19:32,58
live very long, right?

360
0:19:33,52 --> 0:19:38,58
But also we need to take into account
restoration timing, right?

361
0:19:39,24 --> 0:19:40,32
Michael: Maybe, yeah.

362
0:19:40,52 --> 0:19:46,02
Nikolay: Because glacier super
cheap, S3 glaciers, like the cheapest,

363
0:19:46,32 --> 0:19:49,26
but also the slowest to recover
from, right?

364
0:19:49,3 --> 0:19:49,7
Yeah.

365
0:19:49,7 --> 0:19:52,98
To fetch, fetch data, fetch data
from there.

366
0:19:54,12 --> 0:19:56,12
So it's interesting, right?

367
0:19:56,12 --> 0:20:0,92
And physical backups also can be,
physical archive also can be

368
0:20:0,92 --> 0:20:4,28
placed into various locations.

369
0:20:4,28 --> 0:20:8,7
So location question, I think it's
a separate decision.

370
0:20:9,16 --> 0:20:10,98
Michael: Well, yeah.

371
0:20:12,28 --> 0:20:15,36
I think there's a couple of other
things to factor in.

372
0:20:15,36 --> 0:20:21,9
Because if you do store it in parquet
for example, you can query

373
0:20:21,9 --> 0:20:23,0
it without restoring.

374
0:20:23,92 --> 0:20:27,44
You don't even have to restore
to get the data.

375
0:20:27,44 --> 0:20:28,52
Nikolay: Yeah, that's great.

376
0:20:28,52 --> 0:20:29,4
Great option.

377
0:20:29,58 --> 0:20:34,86
Also logical dump path is great
because it allows you partial

378
0:20:34,9 --> 0:20:35,4
archive.

379
0:20:36,9 --> 0:20:39,18
If you want only specific tables.

380
0:20:40,46 --> 0:20:44,28
And also of course you need to
take into account that you only

381
0:20:45,3 --> 0:20:48,8
archive data, no helper data.

382
0:20:48,8 --> 0:20:51,84
Helper data is indexes or materialized
views.

383
0:20:52,36 --> 0:20:58,88
So derived things, well, which
means that recovery will be longer,

384
0:20:59,82 --> 0:21:0,32
right?

385
0:21:0,48 --> 0:21:4,46
So I would say if you are going
to drop some database for historical

386
0:21:4,54 --> 0:21:7,58
purposes Even if you have a physical
access, I would probably

387
0:21:7,58 --> 0:21:15,74
do dump still 5 terabytes dump...
If it's powerful machine definitely

388
0:21:15,74 --> 0:21:20,18
within 1 hour definitely You know,
but

389
0:21:20,18 --> 0:21:21,24
Michael: yeah, we took local

390
0:21:22,66 --> 0:21:25,7
Nikolay: Yeah, and if it's like
with many threads, you can...

391
0:21:26,68 --> 0:21:27,94
Well, if it's a...

392
0:21:28,03 --> 0:21:31,1
By the way, dumps are always interesting.

393
0:21:31,12 --> 0:21:35,04
I think it doesn't matter, Parquet
or native pg_dump.

394
0:21:35,58 --> 0:21:40,06
If you have unpartitioned case
and, for example, 5 terabytes,

395
0:21:40,14 --> 0:21:42,42
but 4 of them is a single table.

396
0:21:42,84 --> 0:21:44,28
We see it a lot, actually.

397
0:21:44,88 --> 0:21:46,26
1 huge table.

398
0:21:47,64 --> 0:21:54,4
Usually called logs or history
or something like historical events.

399
0:21:55,38 --> 0:21:58,94
Sometimes queue-like workload, but
something like accumulated.

400
0:22:0,56 --> 0:22:2,84
And then people think how to get
rid of it.

401
0:22:2,84 --> 0:22:4,6
Partitioning is needed there, right?

402
0:22:4,6 --> 0:22:7,94
Because without partitioning, dumping
will be single threaded

403
0:22:7,94 --> 0:22:8,66
as well.

404
0:22:8,86 --> 0:22:12,44
Although you could create your
own tool, and I think actually

405
0:22:12,44 --> 0:22:18,94
pg_dump could support it, using
ID ranges, dump using multiple

406
0:22:18,94 --> 0:22:19,88
parallel threads.

407
0:22:20,82 --> 0:22:23,5
Theoretically, it's possible and
you can write this.

408
0:22:24,56 --> 0:22:31,0
But out of the box, pg_dump
will limit you and probably dumping

409
0:22:31,0 --> 0:22:32,94
5 terabytes will look like this.

410
0:22:33,74 --> 0:22:39,78
And you say, I want to utilize
all my 16 cores and using 16 threads

411
0:22:40,58 --> 0:22:41,82
to send it to S3.

412
0:22:42,54 --> 0:22:47,22
First 10 minutes, all 16 threads
are utilized.

413
0:22:47,86 --> 0:22:52,12
And then a couple of hours, just
single thread is working.

414
0:22:53,1 --> 0:22:55,3
Because 1 huge table.

415
0:22:57,34 --> 0:22:58,2
That's why.

416
0:22:58,62 --> 0:23:4,68
So partitioning would become archiving
much faster and more flexible

417
0:23:4,78 --> 0:23:10,24
to tune, like, like let's move
faster because, again, if it's

418
0:23:10,24 --> 0:23:14,56
live database, every time we archive
logically, we have long-running

419
0:23:14,64 --> 0:23:15,14
transaction.

420
0:23:16,32 --> 0:23:21,36
And we affect vacuum behavior,
even if it's on replicas which

421
0:23:21,36 --> 0:23:25,16
have hot_standby_feedback on, right?

422
0:23:25,84 --> 0:23:32,92
So this can lead to higher bloat
and various negative consequences.

423
0:23:34,02 --> 0:23:37,12
Michael: I don't see it as much
anymore, but I used to get quite

424
0:23:37,12 --> 0:23:41,1
frustrated seeing partitioning
advice that mostly focused on

425
0:23:41,52 --> 0:23:45,26
like the read performance benefits
of partitioning, and the more

426
0:23:45,26 --> 0:23:48,24
I've learned about it over the
years the more I think the main

427
0:23:48,24 --> 0:23:53,68
reason to partition is for these
maintenance tasks and for health

428
0:23:53,68 --> 0:23:57,18
of the database overall but time-based
in particular if you've

429
0:23:57,18 --> 0:23:59,86
got an archiving requirement if
you're ever going to need to

430
0:23:59,86 --> 0:24:5,64
remove data based on time time-based
partitioning is just a dream

431
0:24:7,66 --> 0:24:11,5
Nikolay: yeah you're absolutely
right as some LLMs say.

432
0:24:15,42 --> 0:24:20,34
So exactly like this archiving,
restoring, like dump restore,

433
0:24:21,3 --> 0:24:24,16
how vacuum works, how index creation
works.

434
0:24:25,44 --> 0:24:31,6
Even physical backups benefit from
partitioning because if updates

435
0:24:31,62 --> 0:24:35,7
and serve deletes are localized
to specific files.

436
0:24:36,22 --> 0:24:41,64
So Postgres keeps tables and indexes
and files up to 1 gigabyte.

437
0:24:42,04 --> 0:24:46,28
So if it's 5 terabytes, 4 of each
single table, it will be shrinked

438
0:24:46,28 --> 0:24:51,1
to 1 gigabyte files and imagine
we have a single and partition

439
0:24:51,18 --> 0:24:56,98
table all writes can come to any
of these files all the time

440
0:24:57,1 --> 0:25:0,74
any blocks inside any pages all
the time all the time like it's

441
0:25:0,74 --> 0:25:7,24
it's not localized right And in
this case, if you use snapshots,

442
0:25:8,14 --> 0:25:11,82
RDS snapshots, they have incremental
support, right?

443
0:25:11,82 --> 0:25:16,0
So first snapshot is full, and
then next snapshot is just diff

444
0:25:16,0 --> 0:25:17,06
at block level.

445
0:25:17,6 --> 0:25:22,62
If you use WAL-G or pgBackRest,
they also have delta backups

446
0:25:22,62 --> 0:25:27,44
or incremental backups or differential
backups where also at

447
0:25:27,44 --> 0:25:30,78
block level, only changes are backed
up, right?

448
0:25:31,7 --> 0:25:35,44
So in this case, if you have a
huge table, you write constantly

449
0:25:35,5 --> 0:25:39,9
everywhere, your delta will be
bigger compared to the case when

450
0:25:39,9 --> 0:25:45,74
you write only to a specific file,
which is like our latest file,

451
0:25:45,9 --> 0:25:51,5
and only occasionally you write
updating all data, all the records.

452
0:25:51,78 --> 0:25:56,66
In this case, these deltas will
be much more localized and this

453
0:25:56,66 --> 0:25:58,4
is also beneficial, right?

454
0:25:59,18 --> 0:26:2,78
So even physical backups will benefit
from partitioning.

455
0:26:3,64 --> 0:26:8,88
So general advice, yes, and our
general advice was always if

456
0:26:8,88 --> 0:26:13,78
you exceed 100 gigabytes, 1 table,
it's time to partition.

457
0:26:14,06 --> 0:26:17,36
People ask, is it counting, like
do we count it with indexes,

458
0:26:17,54 --> 0:26:21,8
with TOAST, well without, well
it's a very very rough number.

459
0:26:22,08 --> 0:26:26,68
So recently we started saying okay
if it's exceeding 50 gigabytes

460
0:26:27,44 --> 0:26:29,76
already let's think about partitioning,
right?

461
0:26:31,1 --> 0:26:32,64
Without indexes maybe.

462
0:26:34,08 --> 0:26:36,62
Michael: The point is don't worry
about it if your whole database

463
0:26:36,62 --> 0:26:41,62
is 1 gigabyte but also if your
largest table is already a terabyte

464
0:26:42,34 --> 0:26:43,94
maybe it's a good time to

465
0:26:44,24 --> 0:26:45,04
Nikolay: it's late

466
0:26:45,9 --> 0:26:49,44
Michael: you know what I mean,
the reason for giving a rule of

467
0:26:49,44 --> 0:26:52,24
thumb is give people an idea of
where they stand, roughly.

468
0:26:52,24 --> 0:26:55,48
And if you're close to that number,
be thinking about it.

469
0:26:55,48 --> 0:26:59,38
Nikolay: So we have funny stories
when AI companies come to us

470
0:26:59,38 --> 0:27:2,32
for help because they grow too
fast and they need help.

471
0:27:2,32 --> 0:27:2,82
Yeah.

472
0:27:3,34 --> 0:27:4,76
They lack database expertise.

473
0:27:4,76 --> 0:27:8,0
We have a bunch of AI companies,
AI startups.

474
0:27:8,2 --> 0:27:10,24
And it was an interesting story.

475
0:27:10,24 --> 0:27:16,1
So some company achieved between
10 to 20 terabytes already.

476
0:27:16,1 --> 0:27:19,1
And they have table succeeding
terabyte.

477
0:27:19,24 --> 0:27:23,4
And someone from our team says
they should partition a year ago

478
0:27:23,4 --> 0:27:25,58
but I said I just launched a year
ago.

479
0:27:27,62 --> 0:27:33,68
The pace is very very high and
I think Postgres needs improvements

480
0:27:33,7 --> 0:27:38,38
in the area of partitioning to
simplify how like developer experience

481
0:27:38,4 --> 0:27:43,0
and I hope we will work in this
area next year with our tooling.

482
0:27:44,0 --> 0:27:47,46
Michael: It has got a lot better
at native partitioning.

483
0:27:48,16 --> 0:27:52,7
It feels like maybe not 18 as much,
but it feels like most releases

484
0:27:52,78 --> 0:27:56,5
before then, since it was introduced,
there was like significant

485
0:27:56,76 --> 0:27:57,26
improvements.

486
0:27:57,34 --> 0:27:58,44
Maybe there were in 18

487
0:27:58,44 --> 0:27:59,02
Nikolay: as well.

488
0:27:59,34 --> 0:28:3,0
In terms of various like detached
partition concurrently, in

489
0:28:3,0 --> 0:28:7,96
terms of support of online operations,
locking things, but still

490
0:28:7,96 --> 0:28:8,98
it's super painful.

491
0:28:9,06 --> 0:28:9,56
Still.

492
0:28:9,72 --> 0:28:12,66
You know, like you need basically
to, you need to allocate an

493
0:28:12,66 --> 0:28:16,66
engineer for a month or 2 to implement
it.

494
0:28:16,82 --> 0:28:21,76
And we have great guidelines, which
we give our clients, but

495
0:28:21,76 --> 0:28:22,94
still lack of automation.

496
0:28:23,32 --> 0:28:26,38
Michael: You mean like to migrate
from non-partition setup to

497
0:28:26,38 --> 0:28:27,48
a partition setup?

498
0:28:28,14 --> 0:28:28,64
Nikolay: Yeah.

499
0:28:28,9 --> 0:28:32,54
The hardest part here is that everyone
is using various ORMs

500
0:28:33,08 --> 0:28:38,2
and they want this tooling to be
closer to their language.

501
0:28:39,52 --> 0:28:42,78
There is of course pg_partman but still.

502
0:28:43,38 --> 0:28:43,94
Michael: Oh, okay.

503
0:28:43,94 --> 0:28:45,54
It's about the creation of new
partition.

504
0:28:45,54 --> 0:28:46,34
Okay, interesting.

505
0:28:46,62 --> 0:28:46,92
Nikolay: Yeah.

506
0:28:46,92 --> 0:28:49,64
It's about creation of new partition
and then maintaining it

507
0:28:49,64 --> 0:28:54,3
because when you have partition
schema already, there is amplification

508
0:28:54,6 --> 0:28:56,18
on several risks.

509
0:28:56,82 --> 0:28:59,76
1 of them very well known LockManager, right?

510
0:28:59,76 --> 0:29:4,2
And I had a series of blog posts
recently, studied Postgres 18

511
0:29:4,2 --> 0:29:7,8
and deeper, also in the context
of prepared statements.

512
0:29:8,48 --> 0:29:9,44
And it was interesting.

513
0:29:9,84 --> 0:29:13,74
And another thing is that any DDL
schema changes, they might

514
0:29:13,74 --> 0:29:16,78
be really Painful, right?

515
0:29:16,78 --> 0:29:21,18
Even simple index creation or dropping
index can be painful or

516
0:29:21,38 --> 0:29:22,72
foreign keys, right?

517
0:29:22,72 --> 0:29:28,02
So yeah, it requires every
time you need like Every time

518
0:29:28,02 --> 0:29:32,58
we see like people implement some
nuances and it feels like Postgres

519
0:29:32,64 --> 0:29:33,78
needs to solve it.

520
0:29:34,86 --> 0:29:35,36
Fair.

521
0:29:36,54 --> 0:29:36,7
All

522
0:29:36,7 --> 0:29:39,38
Michael: right, anything else on
the archiving front?

523
0:29:40,56 --> 0:29:45,36
Nikolay: Final thing, if you archive
something, but haven't tested,

524
0:29:46,3 --> 0:29:46,58
right?

525
0:29:46,58 --> 0:29:46,74
It's

526
0:29:46,74 --> 0:29:47,88
Michael: like backups, right?

527
0:29:48,46 --> 0:29:49,32
Nikolay: Yeah, Schrödinger.

528
0:29:49,94 --> 0:29:52,36
Michael: When we say tested, you
mean like, can it

529
0:29:52,36 --> 0:29:52,94
Nikolay: be restored?

530
0:29:52,94 --> 0:29:53,98
Michael: Can it be queried?

531
0:29:54,16 --> 0:29:57,54
And also, is it the same as the
data it was originally?

532
0:29:58,26 --> 0:30:0,36
Like, it's verifying as well, right?

533
0:30:0,36 --> 0:30:3,02
Like, do some basic checks.

534
0:30:4,44 --> 0:30:4,94
Nikolay: Exactly.

535
0:30:4,96 --> 0:30:13,28
So unverified, unrestored backups
and archives are like undercooked.

536
0:30:14,6 --> 0:30:15,6
Michael: Yeah, great point.

537
0:30:15,6 --> 0:30:20,54
Nikolay: Just simply undercooked
and should be considered as

538
0:30:20,54 --> 0:30:24,78
a huge critical mistake that's
it

539
0:30:25,08 --> 0:30:26,66
Michael: huge and a very small
mistake

540
0:30:27,04 --> 0:30:34,0
Nikolay: yeah yeah so if you if
you did something but never run

541
0:30:34,0 --> 0:30:38,96
if you write a code but never run
and ship, how does it feel?

542
0:30:38,96 --> 0:30:42,62
The same with backups, you created
a backup but you haven't tested

543
0:30:43,12 --> 0:30:47,86
the archives, so It's a critical
mistake, that's it.

544
0:30:48,42 --> 0:30:49,7
Michael: Nice, good addition.

545
0:30:50,68 --> 0:30:53,6
Nikolay: This addition, yeah, just
almost overlooked.

546
0:30:54,92 --> 0:30:55,9
Yeah, okay?

547
0:30:56,82 --> 0:31:0,44
Michael: Nice, well, thanks again
and catch you next time.

548
0:31:1,66 --> 0:31:2,19
Nikolay: Bye bye.