1
00:00:00,060 --> 00:00:02,860
Michael: Hello and welcome to Postgres
FM, a weekly show about

2
00:00:02,860 --> 00:00:03,960
all things PostgreSQL.

3
00:00:03,960 --> 00:00:05,580
I am Michael, founder of pgMustard.

4
00:00:05,580 --> 00:00:08,100
This is my co-host, Nikolay, founder
of Postgres.AI.

5
00:00:08,260 --> 00:00:09,100
Hello, Nikolay.

6
00:00:09,140 --> 00:00:10,960
What are we talking about this
week?

7
00:00:11,260 --> 00:00:12,020
Nikolay: Hi, Michael.

8
00:00:12,040 --> 00:00:13,740
We are out of disk space.

9
00:00:13,820 --> 00:00:14,740
What to do?

10
00:00:15,560 --> 00:00:16,060
Michael: Panic.

11
00:00:17,100 --> 00:00:18,840
Nikolay: This is what Postgres does, right?

12
00:00:19,400 --> 00:00:21,400
Michael: Yeah, in some cases, right?

13
00:00:21,820 --> 00:00:24,220
Nikolay: Yeah, well, I saw it a
lot.

14
00:00:24,320 --> 00:00:26,060
I recently saw it again, actually.

15
00:00:27,040 --> 00:00:28,980
And yeah, it's not fun.

16
00:00:29,240 --> 00:00:30,040
Why panic?

17
00:00:30,060 --> 00:00:31,180
Don't panic, right?

18
00:00:32,320 --> 00:00:34,160
Michael: Keep calm and carry on,
right?

19
00:00:34,160 --> 00:00:36,720
Nikolay: Yeah, it's Postgres, you
should, yeah.

20
00:00:37,200 --> 00:00:38,800
Just use Postgres, you know.

21
00:00:39,060 --> 00:00:44,740
Yeah, so as we discussed before
this call, before recording,

22
00:00:45,480 --> 00:00:50,000
Let's just talk about possible
reasons, mitigation and avoidance,

23
00:00:50,280 --> 00:00:50,780
right?

24
00:00:51,200 --> 00:00:51,700
Yeah.

25
00:00:52,080 --> 00:00:52,580
Prevention.

26
00:00:53,040 --> 00:00:54,000
Prevention is better.

27
00:00:54,000 --> 00:00:54,500
Avoidance.

28
00:00:55,120 --> 00:00:55,620
Michael: Avoidance.

29
00:00:56,400 --> 00:00:56,900
Denial.

30
00:00:57,160 --> 00:00:58,940
It's like the stages of grief.

31
00:00:59,480 --> 00:01:01,240
Nikolay: It's not my problem, actually.

32
00:01:02,180 --> 00:01:03,360
Yeah, I'm a DBA.

33
00:01:03,560 --> 00:01:05,360
It's an SRE problem, right?

34
00:01:05,660 --> 00:01:07,620
They should just give us more space.

35
00:01:08,300 --> 00:01:13,020
To start about problems, I mentioned
it before the call, but

36
00:01:13,020 --> 00:01:13,760
let's repeat.

37
00:01:14,380 --> 00:01:19,240
The only reason, The only true
reason of this is an insufficient

38
00:01:19,540 --> 00:01:20,720
disk space allocation.

39
00:01:21,260 --> 00:01:22,320
It's always so.

40
00:01:23,080 --> 00:01:27,900
And it's a joke, but we will be
returning to this joke, I'm quite

41
00:01:27,900 --> 00:01:28,400
sure.

42
00:01:28,860 --> 00:01:31,940
Because sometimes you do need more
space, that's it.

43
00:01:32,380 --> 00:01:37,600
Of course, I would deprioritize
this path and consider everything

44
00:01:37,600 --> 00:01:38,100
else.

45
00:01:38,100 --> 00:01:43,260
But sometimes we end up with this
reason and mitigation action,

46
00:01:43,260 --> 00:01:44,840
like just add this space.

47
00:01:45,300 --> 00:01:50,720
But let's consider this as a last
resort, right?

48
00:01:51,900 --> 00:01:53,480
Michael: Well, yeah, I like your
structure.

49
00:01:53,480 --> 00:01:58,640
So, causes, then recovery, and
then mitigation or prevention.

50
00:01:59,440 --> 00:02:00,600
Mitigation is recovery.

51
00:02:01,160 --> 00:02:02,280
Yeah, okay, great.

52
00:02:02,320 --> 00:02:03,840
Prevention being the last 1.

53
00:02:03,840 --> 00:02:04,540
So, causes.

54
00:02:04,540 --> 00:02:08,420
What are the most common reasons
you see this happening to people

55
00:02:08,420 --> 00:02:09,180
in the wild?

56
00:02:09,640 --> 00:02:10,460
Nikolay: Good question.

57
00:02:12,260 --> 00:02:13,440
There are many reasons.

58
00:02:13,440 --> 00:02:14,480
I don't have statistics.

59
00:02:15,420 --> 00:02:18,420
Michael: I can tell you the 1 I
see most blog posts about, because

60
00:02:18,420 --> 00:02:23,040
I don't see this that often, but
the 1 that comes up time and

61
00:02:23,040 --> 00:02:27,680
time again in blog posts is an
open replication slot.

62
00:02:30,120 --> 00:02:31,240
Nikolay: Oh, it's too narrow.

63
00:02:31,240 --> 00:02:38,160
Let's just say a lot of WAL written
and it remains, a lot of

64
00:02:38,160 --> 00:02:40,760
WAL files remain in pg_wal directory.

65
00:02:41,200 --> 00:02:44,820
Or pg_xlog if we talk about ancient
versions of Postgres.

66
00:02:45,060 --> 00:02:49,120
Michael: Yes, but that's the root
cause of why it's the most

67
00:02:49,120 --> 00:02:50,340
common 1 I've seen.

68
00:02:51,000 --> 00:02:53,860
Nikolay: I think it's a super popular
reason, but I would structure

69
00:02:53,900 --> 00:02:59,640
this like if we need to provide
a comprehensive list of reasons,

70
00:02:59,640 --> 00:03:02,720
I would put it to the second place
and for the first place I

71
00:03:02,720 --> 00:03:03,380
would put...

72
00:03:03,840 --> 00:03:08,300
And not only because of slots,
slots is just 1 subsection here.

73
00:03:08,300 --> 00:03:09,560
Whole section is just...

74
00:03:10,080 --> 00:03:13,100
PG_WAL directory takes a lot of
disk space.

75
00:03:13,100 --> 00:03:17,220
This is the big class of reasons,
right?

76
00:03:17,340 --> 00:03:21,000
But the very first class of reasons,
I would say, anything related

77
00:03:21,000 --> 00:03:23,680
to the data directory itself.

78
00:03:24,160 --> 00:03:30,040
Like a lot of disk space consumed
by tables and indexes, right?

79
00:03:30,060 --> 00:03:34,540
It's also like a whole class, so
we have specific reasons inside

80
00:03:34,540 --> 00:03:35,040
it.

81
00:03:35,460 --> 00:03:39,160
Michael: Yeah, so we actually have
more data than the size of

82
00:03:39,160 --> 00:03:39,240
Nikolay: the dictionary.

83
00:03:39,240 --> 00:03:40,140
Than we expected.

84
00:03:40,200 --> 00:03:44,940
We tried to delete, but the space
is not free, and so on.

85
00:03:45,420 --> 00:03:47,320
Michael: Lots of bloat, for example,
that we talked about.

86
00:03:47,320 --> 00:03:50,320
Nikolay: Data-related, all related,
and the third class, probably,

87
00:03:50,320 --> 00:03:53,760
or category is probably anything
else.

88
00:03:55,240 --> 00:03:59,060
Anything else is tricky and interesting,
we will talk about it

89
00:03:59,060 --> 00:03:59,840
soon, right?

90
00:04:00,060 --> 00:04:01,180
So how about this

91
00:04:02,560 --> 00:04:02,680
Michael: classification?

92
00:04:02,680 --> 00:04:05,680
Yeah, just to give people an idea
of anything else, like you

93
00:04:05,680 --> 00:04:09,860
including logs in that, like things
that can grow quickly that

94
00:04:09,860 --> 00:04:14,560
aren't data, like main data directory
stuff or WAL.

95
00:04:15,140 --> 00:04:17,920
Nikolay: Base sub-directory is
where data is stored.

96
00:04:18,180 --> 00:04:22,900
So if we think about PGDATA directory
structure, we can think

97
00:04:22,900 --> 00:04:26,900
about base, so tables and indexes,
right?

98
00:04:27,160 --> 00:04:28,600
Materialized views as well.

99
00:04:28,940 --> 00:04:35,140
And then PG_WAL sub-directory,
and it is huge, for example.

100
00:04:35,900 --> 00:04:36,760
And everything else.

101
00:04:36,760 --> 00:04:38,600
Everything else is interesting
as well.

102
00:04:38,680 --> 00:04:45,620
Maybe also outside of PGDATA, but
let's not reveal secrets before

103
00:04:45,840 --> 00:04:46,580
it's time.

104
00:04:46,580 --> 00:04:49,380
So what about tables and indexes?

105
00:04:50,080 --> 00:04:51,040
What do you think?

106
00:04:51,040 --> 00:04:53,460
Like, possible specific reasons?

107
00:04:55,240 --> 00:05:00,820
I deleted 90% of my table, but
it didn't help.

108
00:05:01,500 --> 00:05:04,580
This consumption is still 99% what
to do.

109
00:05:04,700 --> 00:05:05,940
Time to panic, right?

110
00:05:07,120 --> 00:05:11,340
So like obviously delete doesn't
delete data, right?

111
00:05:12,100 --> 00:05:15,640
It just marks it deleted, but it's
a two-stage process.

112
00:05:15,800 --> 00:05:19,120
Michael: Well, and this is a tricky
1 because a lot of the times

113
00:05:19,120 --> 00:05:23,540
we've discussed how to deal with
issues like this, we actually

114
00:05:23,540 --> 00:05:30,980
require quite a lot of disk space
in order to repack or to actually

115
00:05:31,180 --> 00:05:32,440
mitigate that problem.

116
00:05:33,580 --> 00:05:37,940
The easiest way, the way to do
that in an online fashion, requires

117
00:05:38,860 --> 00:05:42,320
double the size of the relation
or double the table size normally.

118
00:05:43,460 --> 00:05:45,140
So it is a tricky 1 in this case.

119
00:05:45,140 --> 00:05:45,640
Nikolay: Yeah.

120
00:05:45,660 --> 00:05:46,160
Exactly.

121
00:05:46,780 --> 00:05:50,940
Just a few days ago, I was discussing
this exactly problem with

122
00:05:51,140 --> 00:05:52,320
1 of our customers.

123
00:05:53,300 --> 00:05:57,320
And they mentioned that they have
a huge table which consists

124
00:05:57,440 --> 00:05:59,360
of 90% of their database.

125
00:06:00,140 --> 00:06:03,000
And to fight with bloat, they know
there is bloat.

126
00:06:03,660 --> 00:06:08,760
And to fight with bloat, especially
in heap, it's obviously like

127
00:06:08,760 --> 00:06:13,400
you need to use pgRepack, and to
use it, you need the same size.

128
00:06:13,940 --> 00:06:19,180
Like, temporarily, you need to
have two large tables of the same

129
00:06:19,180 --> 00:06:22,200
size, because it rebuilds the whole
table in the background and

130
00:06:22,200 --> 00:06:23,980
then switches to it.

131
00:06:24,380 --> 00:06:28,440
So yeah, this is a problem, and
if you don't have enough disk

132
00:06:28,440 --> 00:06:29,540
space, oops.

133
00:06:30,400 --> 00:06:33,900
But yeah, so I would say this is
edge case.

134
00:06:33,900 --> 00:06:38,800
Normally we, there's no single
large table, which is like 90%

135
00:06:38,940 --> 00:06:40,420
of whole database, right?

136
00:06:40,440 --> 00:06:42,240
So normally it's not a big problem.

137
00:06:42,280 --> 00:06:47,980
And if we keep free disk space
30, 40%, which should be so like

138
00:06:47,980 --> 00:06:53,080
at least 25%, maybe, we have this
space and we know that it's

139
00:06:53,080 --> 00:06:53,820
just temporary.

140
00:06:53,860 --> 00:06:58,980
We like just when we're repacking
the heap, the table, right?

141
00:07:00,280 --> 00:07:01,280
Not a problem usually.

142
00:07:01,280 --> 00:07:04,780
But if you're in this situation,
well, not good.

143
00:07:05,680 --> 00:07:08,740
Maybe you should do something else.

144
00:07:09,060 --> 00:07:11,080
There are alternatives to PgRepack.

145
00:07:11,960 --> 00:07:18,420
I think PgSqueeze alternative from
Cybertech also requires, also

146
00:07:18,940 --> 00:07:19,940
has this issue.

147
00:07:20,080 --> 00:07:23,300
It requires additional disk space
because it rebuilds the table.

148
00:07:23,680 --> 00:07:29,020
But older solutions, like in this
particular edge case, I would

149
00:07:29,020 --> 00:07:33,360
think I would return to solutions
which exist for quite some

150
00:07:33,360 --> 00:07:38,480
time, several implementations of
not let's rebuild table, but

151
00:07:38,480 --> 00:07:39,340
another idea.

152
00:07:39,480 --> 00:07:41,260
It's called PG compact table.

153
00:07:41,980 --> 00:07:46,060
Again, there are 3 probably attempts
from different persons to

154
00:07:46,060 --> 00:07:50,620
implement this idea and all of
them I think are in Perl.

155
00:07:51,500 --> 00:07:56,680
But the idea is let's issue some
updates which don't change data.

156
00:07:57,660 --> 00:08:02,420
When you issue an update even saying
update blah blah set id

157
00:08:02,420 --> 00:08:06,180
equals id where id equals some
number, you know a new tuple is

158
00:08:06,180 --> 00:08:06,680
created.

159
00:08:07,200 --> 00:08:08,160
Always, right?

160
00:08:08,300 --> 00:08:12,220
So this tool is issuing these updates
in a smart way.

161
00:08:12,340 --> 00:08:16,920
It checks which tuples are in the
end of table and it knows there

162
00:08:16,920 --> 00:08:21,100
is bloat, so there is empty space
in first pages, for example.

163
00:08:21,540 --> 00:08:25,760
So if we update tuples which sit
at the end of table, they will

164
00:08:25,760 --> 00:08:27,880
move to the beginning of the table.

165
00:08:28,520 --> 00:08:35,240
And once there are no tuples left
in the final page, final block,

166
00:08:35,860 --> 00:08:36,840
it will be truncated.

167
00:08:37,220 --> 00:08:41,940
By default, this is what vacuum
does, unless it's turned off.

168
00:08:43,180 --> 00:08:48,480
And then we do the same with next,
from the end, the next page,

169
00:08:48,480 --> 00:08:49,700
and so on and so on.

170
00:08:49,700 --> 00:08:54,520
Like basically, this is similar
to like, I remember in Windows,

171
00:08:54,520 --> 00:08:55,840
like defragmentation, right?

172
00:08:55,840 --> 00:08:58,760
So it's moving data to different
places.

173
00:08:59,060 --> 00:09:00,780
Michael: Remember that defrag screen
on Windows?

174
00:09:00,780 --> 00:09:01,100
Nikolay: Yeah, yeah.

175
00:09:01,100 --> 00:09:01,875
People love that.

176
00:09:01,875 --> 00:09:02,640
People visualize it.

177
00:09:02,640 --> 00:09:03,580
Yeah, yeah, yeah.

178
00:09:04,080 --> 00:09:06,600
Maybe like red and blue, right?

179
00:09:07,080 --> 00:09:08,300
So, yeah.

180
00:09:08,560 --> 00:09:13,040
And this is much slower process
than using pgrepack for table.

181
00:09:13,580 --> 00:09:14,080
But...

182
00:09:15,060 --> 00:09:21,460
Michael: Also, in our case, if
we notice we've got 99% disk full,

183
00:09:21,460 --> 00:09:22,940
maybe this kind of thing helps.

184
00:09:23,200 --> 00:09:27,260
But if we're actually out of disk
space, even this isn't

185
00:09:27,260 --> 00:09:27,940
Nikolay: an option.

186
00:09:28,260 --> 00:09:29,180
Yeah, yeah, yeah.

187
00:09:30,060 --> 00:09:33,980
I'm talking about mitigation here,
maybe too early, but of course,

188
00:09:34,600 --> 00:09:40,520
if we already achieved 100%, we
cannot move, it's not what I

189
00:09:40,520 --> 00:09:41,180
would do.

190
00:09:41,680 --> 00:09:45,020
But you just raised a very good
point about extra disk space

191
00:09:45,020 --> 00:09:49,040
needed, and I just remember there
are alternative solutions and

192
00:09:49,040 --> 00:09:53,040
I think they don't require any
extensions so they could be used

193
00:09:54,180 --> 00:09:54,680
anywhere.

194
00:09:54,960 --> 00:09:55,680
True, true.

195
00:09:55,680 --> 00:09:57,760
Like RDS, Cloud SQL, anything.

196
00:09:58,660 --> 00:09:59,940
It's just simple updates.

197
00:10:00,480 --> 00:10:06,060
Super slow, but not requiring extra
resources, which is great.

198
00:10:06,380 --> 00:10:10,200
Also probably you will need to
make sure the tool is working

199
00:10:10,200 --> 00:10:12,740
well with current version of Postgres
because they are quite

200
00:10:12,740 --> 00:10:15,140
old, and again, there are 3 of
them.

201
00:10:15,140 --> 00:10:18,300
I think Depesz also tried to implement
one of them.

202
00:10:19,660 --> 00:10:21,460
Michael: That makes sense that
it's in Perl then.

203
00:10:21,460 --> 00:10:23,960
Nikolay: This is a small zoo of
ancient tools.

204
00:10:24,240 --> 00:10:25,960
And I like the idea, actually.

205
00:10:26,940 --> 00:10:29,440
But it's not an emergency tool.

206
00:10:29,840 --> 00:10:33,480
It's right before an emergency tool.

207
00:10:34,540 --> 00:10:36,000
Trying to avoid it.

208
00:10:36,140 --> 00:10:38,800
So yeah, this, right?

209
00:10:39,060 --> 00:10:43,680
But as for indexes, probably yes,
when we rebuild an index concurrently,

210
00:10:45,360 --> 00:10:47,140
like Postgres builds another index.

211
00:10:47,420 --> 00:10:51,980
But I think this should not be
super noticeable because no index

212
00:10:52,120 --> 00:10:52,620
can...

213
00:10:53,860 --> 00:10:57,520
Yes, it requires extra disk space,
but we should have it if we

214
00:10:57,520 --> 00:10:58,360
do it, right?

215
00:10:58,360 --> 00:11:02,840
So anyway, if we delete, we should
think about, okay, those tuples

216
00:11:02,840 --> 00:11:03,580
and bloat.

217
00:11:04,340 --> 00:11:05,420
Keep it in mind.

218
00:11:06,180 --> 00:11:11,100
Try to avoid massive deletes or
design them properly with MVCC

219
00:11:11,260 --> 00:11:13,680
and vacuum behavior in mind.

220
00:11:14,120 --> 00:11:16,030
Michael: We did a whole episode
on that, I think.

221
00:11:16,030 --> 00:11:17,020
Nikolay: Right, exactly.

222
00:11:18,820 --> 00:11:23,760
So yeah, I saw crazy cases when
some tables and indexes were

223
00:11:23,760 --> 00:11:24,880
bloated like 99%.

224
00:11:26,100 --> 00:11:26,600
Michael: Yeah.

225
00:11:26,940 --> 00:11:29,320
And you saw these cause out of
disk issues.

226
00:11:30,140 --> 00:11:31,820
Nikolay: Well, they contributed
a lot.

227
00:11:31,820 --> 00:11:37,220
And if we keep it as is, we will
soon be out of disk space.

228
00:11:38,440 --> 00:11:42,940
So it's definitely worth keeping
an eye on the situation in this

229
00:11:42,940 --> 00:11:46,320
area, bloat and vacuum behavior.

230
00:11:47,540 --> 00:11:51,960
And yeah, if you don't do it and
do it very infrequently, then

231
00:11:51,960 --> 00:11:57,160
you once in 5 years you take care
of it and then you realize

232
00:11:57,180 --> 00:12:03,180
out of 10 terabytes of disk space
you had, now it's below 1 terabyte.

233
00:12:03,520 --> 00:12:06,640
And what do you do with the other 9
terabytes?

234
00:12:06,760 --> 00:12:09,680
You keep paying the cloud provider,
right?

235
00:12:09,680 --> 00:12:14,560
Because it's easy to add disk space,
but it's not easy to reduce

236
00:12:14,680 --> 00:12:18,900
disk space because they don't have
such a function and it's possible

237
00:12:18,900 --> 00:12:25,240
only if you create a new standby
cluster with a smaller disk and

238
00:12:25,240 --> 00:12:28,520
then just switch to it, switch
over, right?

239
00:12:29,060 --> 00:12:32,220
Michael: Yeah, But yeah, it's quite
funny seeing that message,

240
00:12:32,220 --> 00:12:35,360
you know, this is a one way ticket.

241
00:12:35,920 --> 00:12:37,160
Nikolay: One way ticket, yeah.

242
00:12:37,900 --> 00:12:39,800
Michael: So I think those make
sense, right?

243
00:12:39,800 --> 00:12:43,740
Normal data exceeding the size
of the disk, whether that's actual

244
00:12:43,740 --> 00:12:46,680
data or whether that's bloat, makes
perfect sense.

245
00:12:46,680 --> 00:12:48,420
Do you want to talk any more about-

246
00:12:48,480 --> 00:12:49,180
Nikolay: Of course.

247
00:12:49,280 --> 00:12:54,320
Actually, I still believe since
VLDB in Los Angeles, I think

248
00:12:54,320 --> 00:12:59,980
in 2018, I still believe in the
keynote there.

249
00:13:00,040 --> 00:13:06,520
I like this keynote about data
deletion is a huge, big problem

250
00:13:07,300 --> 00:13:11,140
we engineers need to solve and
find better solutions.

251
00:13:11,400 --> 00:13:16,900
So sometimes it's not about bloat,
it's about a lot of garbage,

252
00:13:17,860 --> 00:13:19,400
duplicated data also.

253
00:13:20,220 --> 00:13:23,200
And I saw many times when we start
paying attention to bloat

254
00:13:23,680 --> 00:13:27,340
and pushing customers to fight
with it, suddenly they say, you

255
00:13:27,340 --> 00:13:28,060
know what?

256
00:13:28,940 --> 00:13:30,880
I had it like maybe a month ago.

257
00:13:31,720 --> 00:13:32,300
You know what?

258
00:13:32,300 --> 00:13:33,740
We don't need this table.

259
00:13:34,120 --> 00:13:38,000
Let's just not spend time and just
drop it.

260
00:13:38,000 --> 00:13:38,200
Drop it.

261
00:13:38,200 --> 00:13:38,860
Drop this table.

262
00:13:38,860 --> 00:13:39,600
Or truncate it.

263
00:13:39,600 --> 00:13:40,680
We don't need the data.

264
00:13:40,680 --> 00:13:43,080
And it's like, wow, some terabytes
are free.

265
00:13:43,080 --> 00:13:43,580
So...

266
00:13:44,340 --> 00:13:44,840
Michael: Yeah.

267
00:13:45,320 --> 00:13:46,080
I see this.

268
00:13:46,080 --> 00:13:49,540
I see this with like analytics
data, for example, people tracking

269
00:13:50,200 --> 00:13:50,700
everything.

270
00:13:51,240 --> 00:13:53,940
Like they don't know, they never
look at their analytics and

271
00:13:53,940 --> 00:13:56,960
never look at who's using which
feature, but they track everything

272
00:13:56,960 --> 00:13:58,520
because we might need it someday.

273
00:13:58,520 --> 00:14:01,640
Or, you know, like that, there's
all these data use cases and,

274
00:14:01,640 --> 00:14:05,280
And even the solution to this,
like, in terms of...

275
00:14:05,740 --> 00:14:07,500
I've heard the phrase retention
policy.

276
00:14:07,500 --> 00:14:10,600
I know it's not quite the same
topic, but talking about how long

277
00:14:10,600 --> 00:14:13,980
we should retain data for, even
that we don't even mention deletion.

278
00:14:13,980 --> 00:14:16,960
It's not deletion policy, It's
retention policy, which is quite

279
00:14:16,960 --> 00:14:17,460
funny.

280
00:14:17,780 --> 00:14:21,820
Nikolay: Also, maybe 1 of the last
things in this section I would

281
00:14:21,820 --> 00:14:27,120
mention is it's becoming more popular
instead of deleting old

282
00:14:27,120 --> 00:14:34,200
data to have some better approach
tiering, tiers of data storage.

283
00:14:35,020 --> 00:14:38,340
And I know there is from Tembo,
there is extension called PG

284
00:14:38,340 --> 00:14:38,840
tier.

285
00:14:38,960 --> 00:14:41,900
First of all, timescale has this,
right?

286
00:14:41,900 --> 00:14:44,700
But in cloud only, I think it's
not open source.

287
00:14:45,960 --> 00:14:47,720
If I'm not right, if I'm not wrong.

288
00:14:48,580 --> 00:14:54,680
And so idea is, let's consider
some data archived and it still

289
00:14:54,860 --> 00:14:57,740
looks like it's present in Postgres,
but it's not stored on an

290
00:14:57,740 --> 00:15:01,300
expensive disk which also has capacity
limits.

291
00:15:01,840 --> 00:15:05,880
We offload it to object storage,
like S3 or AWS.

292
00:15:06,980 --> 00:15:12,660
And this is a great idea, actually,
but I think some latency

293
00:15:12,660 --> 00:15:17,320
issues will appear, and also maybe
some errors sometimes, inconsistencies

294
00:15:17,980 --> 00:15:18,720
or so.

295
00:15:19,280 --> 00:15:20,240
Like, it's interesting.

296
00:15:20,540 --> 00:15:21,600
But it's a great idea.

297
00:15:21,600 --> 00:15:26,720
And again, like I said, Tembo has
a PG tier extension, which

298
00:15:26,720 --> 00:15:32,840
I think I hope to test someday
when I, or my team has an opportunity

299
00:15:32,860 --> 00:15:33,820
and time for it.

300
00:15:33,820 --> 00:15:36,100
It's super interesting to check
how it works.

301
00:15:37,200 --> 00:15:38,760
And I think it's a great idea.

302
00:15:38,760 --> 00:15:43,400
Like, if you have some huge tables,
maybe already partitioned,

303
00:15:44,060 --> 00:15:49,200
you can go either with sharding,
or if some historical data is

304
00:15:49,200 --> 00:15:52,760
not needed all the time, you can
just offload it to object storage,

305
00:15:52,760 --> 00:15:54,440
which is like virtually infinite.

306
00:15:55,020 --> 00:16:00,600
And then, okay, users' experience
some worse latency when reading

307
00:16:00,600 --> 00:16:01,660
this old data.

308
00:16:02,120 --> 00:16:03,280
Imagine e-commerce,

309
00:16:03,340 --> 00:16:04,100
Michael: for example.

310
00:16:04,660 --> 00:16:08,400
Just to quickly, I feel like this
is an old, kind of an old-fashioned

311
00:16:08,440 --> 00:16:10,420
solution to this was table spaces.

312
00:16:11,200 --> 00:16:17,360
Nikolay: Yeah, but table spaces
in cloud reality became not popular

313
00:16:17,360 --> 00:16:18,020
at all.

314
00:16:18,220 --> 00:16:22,800
Last time I used table spaces in
a serious project was more than

315
00:16:22,800 --> 00:16:23,940
10 years ago, I think.

316
00:16:23,940 --> 00:16:25,580
Michael: Yeah, but that's what
I mean.

317
00:16:25,760 --> 00:16:28,600
We had this before, it's just it
had a different name.

318
00:16:30,180 --> 00:16:31,100
Nikolay: Maybe, yeah.

319
00:16:31,500 --> 00:16:35,500
Table spaces, yeah, you can have
cheaper disk attached to your

320
00:16:35,500 --> 00:16:36,000
machine.

321
00:16:36,240 --> 00:16:40,140
But I still don't like this particular
approach because I like

322
00:16:40,520 --> 00:16:42,880
offloading to object storage more
than that.

323
00:16:42,880 --> 00:16:46,300
Because imagine you have primary
and multiple standby nodes.

324
00:16:47,360 --> 00:16:52,700
If you use table spaces and cheaper
disks, you are forced to

325
00:16:52,700 --> 00:16:57,500
use the same structure of disks
on each node and it becomes still

326
00:16:57,500 --> 00:17:01,660
expensive and slower, for example,
right?

327
00:17:01,780 --> 00:17:06,340
While object storage, it's worse
in terms of uptime compared

328
00:17:06,340 --> 00:17:08,740
to EBS volumes on AWS, for example.

329
00:17:08,940 --> 00:17:13,140
But it's much better in terms of
reliability or vice versa.

330
00:17:14,380 --> 00:17:16,020
Michael: Durability or other things.

331
00:17:16,020 --> 00:17:18,840
Nikolay: If you check, yeah, availability
and durability, if

332
00:17:18,840 --> 00:17:23,380
you check SLAs, S3s, I forget.

333
00:17:24,060 --> 00:17:27,660
So if you check characteristics,
you think, okay, this is what

334
00:17:27,660 --> 00:17:31,060
I actually would like to have for
archived data.

335
00:17:31,260 --> 00:17:34,880
Maybe latency will be slower, worse,
right?

336
00:17:35,320 --> 00:17:38,940
But if you think about e-commerce
and order history, for example,

337
00:17:38,940 --> 00:17:44,440
if you're off a large e-commerce
website, if you have the ability

338
00:17:44,540 --> 00:17:52,400
to offload all the activities and
users touch them very rarely,

339
00:17:52,960 --> 00:17:58,380
but it's still good to have the
ability to see the old order

340
00:17:58,380 --> 00:17:58,880
history.

341
00:18:00,060 --> 00:18:03,220
But it's not needed often, so we
can keep it outside of Postgres

342
00:18:03,220 --> 00:18:05,220
and evict from caches and so on.

343
00:18:05,320 --> 00:18:08,980
And don't pay for our main cluster,
we just pay for S3.

344
00:18:09,280 --> 00:18:11,620
There we also can have tiering,
right?

345
00:18:12,620 --> 00:18:13,120
Usually.

346
00:18:13,520 --> 00:18:14,360
And even automated.

347
00:18:15,060 --> 00:18:19,060
Usually providers have these features,
like if it's older than

348
00:18:19,060 --> 00:18:23,900
like 1 month, it goes to some colder
and cheaper space.

349
00:18:24,320 --> 00:18:31,560
So I think this direction will
receive some more popularity and

350
00:18:31,560 --> 00:18:33,920
will be better developed and so
on.

351
00:18:34,140 --> 00:18:35,720
Turing for storage.

352
00:18:37,480 --> 00:18:40,300
Michael: Once again, we've been
sucked into kind of like prevention.

353
00:18:41,720 --> 00:18:44,960
Nikolay: I think yes, it's hard
to talk about reasons and then

354
00:18:44,960 --> 00:18:49,420
yeah, like this is a big flaw of
the structure I proposed.

355
00:18:50,440 --> 00:18:54,020
Michael: Well, no, I like it still,
but it feels like we haven't

356
00:18:54,060 --> 00:18:55,460
really touched on...

357
00:18:56,060 --> 00:18:57,720
Nikolay: The most popular, let's
talk about.

358
00:18:57,720 --> 00:18:58,220
Yeah.

359
00:18:59,160 --> 00:18:59,760
WAL, right?

360
00:18:59,760 --> 00:19:00,600
We have some...

361
00:19:00,700 --> 00:19:02,120
So, PGWAL is huge.

362
00:19:02,800 --> 00:19:08,160
We identified it, which is actually
already some good skill.

363
00:19:08,600 --> 00:19:12,500
Not everyone can do that, but if
you manage to identify that

364
00:19:12,500 --> 00:19:16,560
your PGWAL data is huge, there
are a few certain reasons.

365
00:19:16,560 --> 00:19:20,140
And there is a good article from
CyberTech about this.

366
00:19:20,860 --> 00:19:23,800
Why Postgres doesn't delete all
files?

367
00:19:24,340 --> 00:19:25,444
It's already time to exclude them.

368
00:19:25,444 --> 00:19:26,760
Michael: Ah, yeah, the various
reasons.

369
00:19:27,440 --> 00:19:31,160
Nikolay: This is exactly when we
need to apply this article as

370
00:19:32,100 --> 00:19:35,560
the list of possible reasons and
just exclude 1 of them until

371
00:19:35,560 --> 00:19:39,060
we find our case, our real reason.

372
00:19:39,380 --> 00:19:45,860
So 1 of them you mentioned, some
replication slot, logical or

373
00:19:45,860 --> 00:19:49,200
physical, not progressing and accumulating
a lot of changes.

374
00:19:49,200 --> 00:19:51,800
It's not like accumulating, it's
not like slot.

375
00:19:52,200 --> 00:19:53,980
Postgres doesn't write to slots.

376
00:19:54,620 --> 00:19:57,540
It writes to PGWAL, a lot of WALs,
right?

377
00:19:57,540 --> 00:20:02,820
But slot has position, and if it's
inactive or position is frozen,

378
00:20:03,840 --> 00:20:07,380
consumers don't consume and don't
shift this position.

379
00:20:07,700 --> 00:20:12,220
It means Postgres must keep those
WALs in the PGWAL directory

380
00:20:13,260 --> 00:20:14,980
until it finishes.

381
00:20:15,180 --> 00:20:19,440
And since recently, a few years
ago, 1 of, I think, Postgres, maybe

382
00:20:19,960 --> 00:20:26,440
13 or 14, received a setting to
limit this, to have threshold

383
00:20:26,680 --> 00:20:31,560
when we give up and say, let's
better to kill our slot, to destroy

384
00:20:31,560 --> 00:20:39,720
our slot, but let's stop this situation
and WALs should be deleted

385
00:20:39,720 --> 00:20:40,220
already.

386
00:20:40,240 --> 00:20:45,980
And I remember this big fear in
a good engineer I worked with,

387
00:20:46,080 --> 00:20:50,380
he was not Postgres, he was more
SRE, but with a lot of Postgres

388
00:20:50,380 --> 00:20:50,740
experience.

389
00:20:50,740 --> 00:20:55,680
And when, it was long ago, when
initially replication slots for

390
00:20:55,680 --> 00:21:00,100
physical replication, we were introduced,
we had basically a

391
00:21:00,100 --> 00:21:00,620
small fight.

392
00:21:00,620 --> 00:21:02,700
Like I said, let's use it, a great
feature.

393
00:21:02,720 --> 00:21:06,020
He said, no, no, no, I'm not going
to use it.

394
00:21:06,020 --> 00:21:07,920
He said, this is super dangerous.

395
00:21:08,200 --> 00:21:13,080
And he understood that the danger
is if some replica is somehow

396
00:21:14,280 --> 00:21:17,400
stuck, the slot is not progressing,
the primary is out of disk

397
00:21:17,400 --> 00:21:20,400
space, and this is the last thing
he wanted.

398
00:21:21,740 --> 00:21:23,680
After lost backups, of course.

399
00:21:26,320 --> 00:21:30,320
But over time, we started using
slots, but then this setting,

400
00:21:30,320 --> 00:21:34,240
I think it's a great setting, you
can understand your disk layout

401
00:21:34,740 --> 00:21:39,300
and how much free disk space you
can afford.

402
00:21:39,900 --> 00:21:42,640
If you approach 90%, it's time
to kill slots.

403
00:21:42,640 --> 00:21:46,400
So you can do some math and understand
that the maximum number

404
00:21:46,400 --> 00:21:51,100
of gigabytes you can allow for
a lag is this.

405
00:21:51,420 --> 00:21:55,280
Of course, over time, the database
grows and still this...

406
00:21:55,380 --> 00:21:57,180
It still can be a problem, right?

407
00:21:57,180 --> 00:22:03,480
Because if data grows, Tables and
indexes grow, and probably

408
00:22:03,480 --> 00:22:07,000
your setting will not save you
from emergency, right?

409
00:22:07,700 --> 00:22:08,200
Maybe.

410
00:22:08,480 --> 00:22:10,580
But this is definitely 1 of the
reasons.

411
00:22:10,840 --> 00:22:11,380
What else?

412
00:22:11,380 --> 00:22:12,080
What reasons?

413
00:22:13,080 --> 00:22:15,640
Because of failing archive command,
right?

414
00:22:15,720 --> 00:22:19,120
Because for Postgres it's important
if the archive command is

415
00:22:19,120 --> 00:22:21,760
configured, for Postgres it's important
to archive.

416
00:22:22,360 --> 00:22:25,440
And if it's failing, any reason
can be.

417
00:22:26,040 --> 00:22:30,060
Postgres cannot archive, so if
it cannot archive, it cannot remove

418
00:22:30,060 --> 00:22:30,800
these walls.

419
00:22:31,720 --> 00:22:33,060
And that's a problem.

420
00:22:33,340 --> 00:22:37,320
So you should monitor archiving
command lag separately.

421
00:22:37,960 --> 00:22:42,140
And if it's growing, it's very
bad for backups already.

422
00:22:42,840 --> 00:22:47,360
But it's also bad because it can
hit you in terms of disk space.

423
00:22:48,460 --> 00:22:53,080
So yeah, these processes, replication
and archiving of walls

424
00:22:53,100 --> 00:22:57,940
are 2 processes that if something
is wrong with them, Postgres

425
00:22:57,940 --> 00:22:59,560
cannot remove walls.

426
00:23:00,900 --> 00:23:01,740
Another reason actually...

427
00:23:01,740 --> 00:23:03,400
Michael: And by the way, this can
grow quickly.

428
00:23:03,420 --> 00:23:06,420
I think people don't always realize
how quickly this can grow.

429
00:23:06,420 --> 00:23:06,920
Yeah.

430
00:23:07,840 --> 00:23:08,620
Some database.

431
00:23:09,440 --> 00:23:12,040
Nikolay: Yeah, it can be small,
but very active database and

432
00:23:12,040 --> 00:23:14,940
you can have terabytes of wall
generated per days.

433
00:23:15,040 --> 00:23:18,240
It's actually good to know how
much you generate per day or per

434
00:23:18,240 --> 00:23:19,460
hour per second.

435
00:23:20,140 --> 00:23:22,860
Michael: So I think even with like
a low activity, as long as

436
00:23:22,860 --> 00:23:26,060
something's happening every time
there's a checkpoint, I've seen

437
00:23:26,060 --> 00:23:30,180
people with toy databases like,
you know, like free tier RDS,

438
00:23:30,360 --> 00:23:33,020
be surprised that it's generating
like a couple of gigabytes

439
00:23:33,040 --> 00:23:37,840
a day, like 64 megabytes per 5
minutes or so.

440
00:23:38,480 --> 00:23:41,500
Adds up quite quickly with small...

441
00:23:42,380 --> 00:23:46,080
Nikolay: Funny reason, I also can
mention, you have not a huge

442
00:23:46,080 --> 00:23:49,600
database, but also quite active,
and you think, oh, it's time

443
00:23:49,600 --> 00:23:50,640
for checkpoint tuning.

444
00:23:50,640 --> 00:23:52,580
We had an episode about it, checkpoint
tuning.

445
00:23:52,580 --> 00:23:55,360
Let's raise max wall size and checkpoint
timeout.

446
00:23:55,640 --> 00:24:00,340
You raise it, but you didn't do
proper math in terms of how much

447
00:24:00,340 --> 00:24:06,500
disk space you have, and end up
having too big, normally too

448
00:24:06,500 --> 00:24:07,400
big, pgwall.

449
00:24:07,760 --> 00:24:11,140
Because distance between checkpoints
increases, and Postgres

450
00:24:11,180 --> 00:24:12,940
needs to keep more walls.

451
00:24:14,540 --> 00:24:17,060
And this can be just a mistake
of configuration.

452
00:24:18,640 --> 00:24:23,800
So yeah, after this, I think we
can move on and let's talk about

453
00:24:23,800 --> 00:24:24,560
other reasons.

454
00:24:25,760 --> 00:24:27,320
Michael: Yeah, what else do you
see?

455
00:24:27,320 --> 00:24:31,060
Nikolay: I would put on the first
place in this category, log

456
00:24:31,060 --> 00:24:37,580
directory, especially if it's inside
PGDATA, inside this main

457
00:24:37,580 --> 00:24:38,540
Postgres directory.

458
00:24:39,660 --> 00:24:41,080
Not especially, like, no.

459
00:24:41,200 --> 00:24:46,780
Especially if log directory where
Postgres writes logs, if it's

460
00:24:46,780 --> 00:24:50,740
in the same drive as a data directory.

461
00:24:51,660 --> 00:24:53,080
For example, inside PGDATA.

462
00:24:53,480 --> 00:24:55,740
Or just next to it but on the same
drive.

463
00:24:56,100 --> 00:25:00,060
If logs are suddenly, like we have
some performance degradation

464
00:25:00,060 --> 00:25:03,360
and log_min_duration_statement
is configured or auto_explain.

465
00:25:03,780 --> 00:25:07,040
Or, for example, we decided, oh,
we need more audit and PG Audit

466
00:25:07,120 --> 00:25:08,140
is a great tool.

467
00:25:08,860 --> 00:25:10,220
Right, let's bring it.

468
00:25:10,440 --> 00:25:12,640
It starts writing a lot to logs.

469
00:25:12,880 --> 00:25:18,920
And if you didn't separate your
data from logs, you should separate.

470
00:25:19,220 --> 00:25:23,460
I'm not talking about PgWAL, I'm
not a big fan of having a separate

471
00:25:23,460 --> 00:25:25,340
disk for PgWAL, actually.

472
00:25:27,100 --> 00:25:33,200
It was a thing in the past, but
recently I don't see benefits

473
00:25:33,260 --> 00:25:34,560
from doing this often.

474
00:25:34,600 --> 00:25:36,420
Benchmarks don't prove it and so
on.

475
00:25:36,500 --> 00:25:38,420
So, it depends, of course.

476
00:25:38,420 --> 00:25:42,480
It depends if it's your own data
center, maybe it's worth doing

477
00:25:42,480 --> 00:25:42,980
this.

478
00:25:43,080 --> 00:25:46,920
But if you don't offload logs,
regular logs, and keep it on the

479
00:25:46,920 --> 00:25:51,680
same disk, then you bring PG Audit,
they suddenly start writing gigabytes

480
00:25:51,780 --> 00:25:59,480
per hour, or maybe some problem
with rotation of logs.

481
00:25:59,640 --> 00:26:00,140
Michael: Yeah.

482
00:26:00,900 --> 00:26:02,640
Nikolay: Like retention or retention.

483
00:26:02,980 --> 00:26:06,100
So they are not deleted properly,
for example.

484
00:26:06,380 --> 00:26:09,620
And then this can hit you badly
and you have panic actually for

485
00:26:09,620 --> 00:26:11,740
your database, and your selects
are not working.

486
00:26:12,100 --> 00:26:16,320
But if you're offloaded to a different
disk, and at least with

487
00:26:16,320 --> 00:26:20,080
logging collector enabled, I know
that database actually will be

488
00:26:20,080 --> 00:26:21,320
working fully.

489
00:26:21,740 --> 00:26:22,440
That's great.

490
00:26:22,440 --> 00:26:26,080
Actually, recently, maybe a few
months ago, I had this experience.

491
00:26:27,560 --> 00:26:31,360
I got used to it, Postgres is fully
down if we are out of disk

492
00:26:31,360 --> 00:26:36,000
space, and I expected the same
behavior when our log drive is

493
00:26:36,000 --> 00:26:36,500
full.

494
00:26:37,200 --> 00:26:40,360
If we are flooded, if it's full,
I expect big problems.

495
00:26:40,440 --> 00:26:42,000
Postgres didn't notice almost.

496
00:26:43,360 --> 00:26:45,760
So okay, we cannot log, but we
keep working.

497
00:26:45,760 --> 00:26:46,940
That's great, actually.

498
00:26:47,640 --> 00:26:52,580
So this is super big benefit of
having different drive for regular

499
00:26:52,580 --> 00:26:53,540
Postgres logs.

500
00:26:54,280 --> 00:26:58,400
And even if it's slow, it can affect
performance, right?

501
00:26:58,440 --> 00:27:02,200
But if it's 100% full, not that
bad.

502
00:27:03,400 --> 00:27:06,560
Michael: Yeah, we've talked several
times about excessive logging

503
00:27:06,560 --> 00:27:08,040
having an impact on performance.

504
00:27:08,560 --> 00:27:09,520
Nikolay: Observer effect.

505
00:27:09,820 --> 00:27:13,100
Michael: Yeah, and I've definitely
done benchmarks, you know,

506
00:27:13,100 --> 00:27:17,100
this is the age old log min duration
statement 0.

507
00:27:17,100 --> 00:27:20,660
Like what, why we recommend not
set or I recommend never setting

508
00:27:20,660 --> 00:27:21,540
it to 0.

509
00:27:21,980 --> 00:27:25,460
Because this, like you can generate
gigabytes in like a 30 minute

510
00:27:25,580 --> 00:27:27,700
benchmark, like with pgbench or
something.

511
00:27:27,700 --> 00:27:31,860
So it's, it's surprised me how
much it can grow.

512
00:27:32,620 --> 00:27:35,900
Nikolay: Yeah, it's actually a
good idea maybe to just, you know,

513
00:27:35,900 --> 00:27:39,520
like maybe to ask our bot to benchmark
with regular PgBench and

514
00:27:39,520 --> 00:27:44,860
then to have sampling for logging
of queries in 1 way or another

515
00:27:45,560 --> 00:27:50,880
and just to see how observer effect
grows and kills your performance.

516
00:27:52,000 --> 00:27:57,540
Only people with small databases
can recommend log min duration

517
00:27:57,540 --> 00:27:58,540
statement 0.

518
00:27:59,440 --> 00:28:01,140
Michael: Or quiet databases, yeah.

519
00:28:01,620 --> 00:28:02,700
Nikolay: Quiet, yeah.

520
00:28:03,740 --> 00:28:04,240
Not active.

521
00:28:04,240 --> 00:28:08,740
Small databases and small workloads.

522
00:28:10,080 --> 00:28:10,640
Michael: That's it.

523
00:28:10,640 --> 00:28:11,260
So yeah.

524
00:28:13,480 --> 00:28:18,480
Or I guess super low, like turn
it to that only very, very briefly,

525
00:28:18,480 --> 00:28:20,080
like a minute or 2 or something.

526
00:28:20,080 --> 00:28:20,420
Nikolay: Yeah.

527
00:28:20,420 --> 00:28:20,740
Yeah.

528
00:28:20,740 --> 00:28:23,960
So for, for serious workloads,
you cannot do that.

529
00:28:25,600 --> 00:28:30,040
I already shared my experience
putting big data, critical databases,

530
00:28:30,280 --> 00:28:35,620
productions down and yeah, don't
repeat my mistakes.

531
00:28:36,340 --> 00:28:38,960
Michael: As well as logging, what
else did you have in this category

532
00:28:38,960 --> 00:28:39,640
of things?

533
00:28:39,960 --> 00:28:41,460
Nikolay: Oh, that's a good question.

534
00:28:41,540 --> 00:28:46,900
Well, let's not maybe spend too
much time discussing some unknowns

535
00:28:47,020 --> 00:28:50,320
or external things like we have,
I don't know, something else

536
00:28:50,320 --> 00:28:54,280
installed on the same machine and
using the same disks and suddenly

537
00:28:54,440 --> 00:28:55,680
we are out of disk space.

538
00:28:55,680 --> 00:28:57,740
Also it can happen sometimes.

539
00:28:58,940 --> 00:29:01,840
Or maybe, for example, you do some
troubleshooting, you have

540
00:29:01,840 --> 00:29:05,880
self-managed Postgres, you can
SSH to the box, you started sampling

541
00:29:06,000 --> 00:29:10,380
every second some good logs, but
do it for example to home directory

542
00:29:10,580 --> 00:29:13,240
and then it's full or something
like this.

543
00:29:13,400 --> 00:29:15,760
So you always need to be careful
with that.

544
00:29:16,120 --> 00:29:19,360
But interesting case when, for
example, you try to upgrade, you

545
00:29:19,360 --> 00:29:23,540
have Patroni, or you some, I don't
remember exact details, but

546
00:29:23,540 --> 00:29:28,460
in some cases Patroni decides to
retry provisioning of a standby

547
00:29:28,460 --> 00:29:34,340
node, for example, and it fetches
PgData in the way you configured

548
00:29:34,740 --> 00:29:40,320
with PgBaseBackup or from archives,
but it renames the old directory

549
00:29:40,900 --> 00:29:43,860
to make it like backup, not like
local backup, right?

550
00:29:43,860 --> 00:29:47,300
It just renames it and you end
up having 2 directories suddenly,

551
00:29:47,900 --> 00:29:48,400
right?

552
00:29:48,520 --> 00:29:52,440
Data directories, full-fledged,
like, wow, it can quickly be

553
00:29:52,440 --> 00:29:54,440
a problem in terms of disk space.

554
00:29:54,760 --> 00:29:54,960
What

555
00:29:54,960 --> 00:29:55,440
Michael: else?

556
00:29:55,440 --> 00:29:59,540
Does that mean like if you had,
let's say your data storage was

557
00:29:59,540 --> 00:30:02,360
about 40% of your disk size, So
you thought you had loads of

558
00:30:02,360 --> 00:30:02,840
headroom.

559
00:30:02,840 --> 00:30:06,000
Suddenly you're at 80% plus whatever
else is on there.

560
00:30:06,660 --> 00:30:07,160
Yeah.

561
00:30:07,820 --> 00:30:08,420
Nikolay: Yeah, okay.

562
00:30:08,420 --> 00:30:08,860
Wow.

563
00:30:08,860 --> 00:30:10,180
Maybe large temporary files, but
it's quite exotic.

564
00:30:10,180 --> 00:30:12,800
I like, it can be, they can be
huge, right?

565
00:30:12,800 --> 00:30:14,360
But it's like, it's exotic.

566
00:30:15,040 --> 00:30:17,800
Michael: I saw some Stack Overflow
posts mention that, you know,

567
00:30:17,800 --> 00:30:21,020
people that were out of disk got
the error, but when they checked,

568
00:30:21,020 --> 00:30:22,440
they did have free disk space.

569
00:30:22,440 --> 00:30:26,120
And one of the suggested answers
mentioned check for temp file,

570
00:30:26,120 --> 00:30:28,860
like your queries could be doing
a lot of temp file stuff, but

571
00:30:28,860 --> 00:30:30,360
yeah, I've never seen that myself.

572
00:30:30,900 --> 00:30:33,900
What about, well, backups is an
interesting one.

573
00:30:33,900 --> 00:30:37,700
Like backups, if you, cause some
people recommend keeping some,

574
00:30:37,960 --> 00:30:42,100
obviously keeping all your backups
on your, on the same disk

575
00:30:42,100 --> 00:30:46,980
as your database is suboptimal
for recovery purposes, but keeping

576
00:30:46,980 --> 00:30:49,900
some on it makes sense, especially
for huge databases, right?

577
00:30:49,900 --> 00:30:51,000
Nikolay: I don't think so.

578
00:30:51,220 --> 00:30:52,980
I would not recommend doing this.

579
00:30:54,080 --> 00:30:56,780
Michael: But what about for recovery,
like no network?

580
00:30:57,840 --> 00:31:00,400
Nikolay: Regular backups should
be separate, of course.

581
00:31:01,100 --> 00:31:04,540
But if you do some operations manually,
for example, you decided

582
00:31:04,540 --> 00:31:09,400
to back up some tables and just
dump them to some compressed

583
00:31:09,400 --> 00:31:12,760
form and keep on the same drive,
well, it might happen.

584
00:31:12,960 --> 00:31:15,860
But there are many mistakes you
can do manually.

585
00:31:17,020 --> 00:31:20,080
Or you just create a table select,
for example.

586
00:31:20,080 --> 00:31:23,000
You basically just clone the table
and then forgot it.

587
00:31:23,000 --> 00:31:25,020
Well, many such mistakes can be
done.

588
00:31:26,320 --> 00:31:31,820
I think there are many exotic situations
we don't discuss here,

589
00:31:32,220 --> 00:31:36,300
and I'm curious if our listeners
had some interesting situation

590
00:31:36,420 --> 00:31:38,640
we didn't discuss yet.

591
00:31:38,900 --> 00:31:39,720
It will be interesting.

592
00:31:39,720 --> 00:31:41,860
Like I think many exotic things
can happen.

593
00:31:41,920 --> 00:31:44,340
Let's talk about what to do in
emergency.

594
00:31:44,860 --> 00:31:45,360
Right?

595
00:31:46,100 --> 00:31:46,360
Yeah.

596
00:31:46,360 --> 00:31:47,920
First thing, understand the reasons.

597
00:31:48,560 --> 00:31:49,400
Or maybe no.

598
00:31:49,400 --> 00:31:49,780
This is extremely interesting.

599
00:31:49,780 --> 00:31:51,040
Michael: No, I don't think so.

600
00:31:51,040 --> 00:31:51,540
Yeah.

601
00:31:51,980 --> 00:31:55,520
I mean, bear in mind, if we're
actually in an emergency, our

602
00:31:55,520 --> 00:32:00,300
database is no longer not only
not accepting updates and writes

603
00:32:00,300 --> 00:32:04,940
and things, but it's also in a
lot of cases not accepting SELECTs,

604
00:32:05,140 --> 00:32:05,640
which...

605
00:32:06,600 --> 00:32:08,040
Nikolay: Which is ridiculous, right?

606
00:32:08,240 --> 00:32:11,980
You shared with me before our call,
you shared Aiven doc, which

607
00:32:11,980 --> 00:32:16,040
says if a managed Postgres service
has automation, if it understands

608
00:32:16,160 --> 00:32:22,320
that we are approaching a full
disk space, it converts our database

609
00:32:22,800 --> 00:32:24,860
to read-only state.

610
00:32:25,380 --> 00:32:30,620
Just setting a parameter, which
actually any client can override.

611
00:32:31,220 --> 00:32:34,080
So it's weak protection.

612
00:32:34,540 --> 00:32:37,200
Default transaction read-only turned
on.

613
00:32:38,360 --> 00:32:40,660
This parameter, default transaction
read-only.

614
00:32:40,900 --> 00:32:44,680
I like the idea, actually, because
it's weak protection, but

615
00:32:44,680 --> 00:32:48,340
it's quite reliable if you don't
have this set to off all the

616
00:32:48,340 --> 00:32:49,200
time in your code.

617
00:32:49,200 --> 00:32:52,540
Usually, there are low chances
you have it in your code.

618
00:32:52,540 --> 00:32:54,680
So it means that it will protect
you.

619
00:32:54,680 --> 00:32:57,420
And then you have alerts, you need
to have proper monitoring

620
00:32:57,440 --> 00:32:58,180
and so on.

621
00:32:58,180 --> 00:33:00,400
But this is like a prevention measure,
maybe.

622
00:33:00,400 --> 00:33:00,900
Michael: Yeah.

623
00:33:00,900 --> 00:33:01,400
Again.

624
00:33:02,000 --> 00:33:03,020
Nikolay: Yeah, it's again prevention.

625
00:33:03,020 --> 00:33:03,220
Yeah.

626
00:33:03,220 --> 00:33:06,980
But if you're in an emergency, this
is like, I think we should write

627
00:33:06,980 --> 00:33:10,680
some good how-to in general, how
to for any case.

628
00:33:10,680 --> 00:33:12,520
Michael: I did check your how-tos
and I did.

629
00:33:12,520 --> 00:33:14,940
I was surprised not to find one for
this actually.

630
00:33:15,560 --> 00:33:21,040
But I think also one thing worth
mentioning, nobody's going to

631
00:33:21,040 --> 00:33:23,980
be listening to a podcast and be
like half an hour in or whatever

632
00:33:23,980 --> 00:33:25,880
and be like, you know, because
they're panicking.

633
00:33:26,600 --> 00:33:27,100
Yeah.

634
00:33:27,640 --> 00:33:28,140
Yeah.

635
00:33:29,540 --> 00:33:33,780
But one thing to remember that does
seem to catch people out is,

636
00:33:34,700 --> 00:33:37,200
I mean, hopefully everybody listening
here won't, this won't

637
00:33:37,200 --> 00:33:40,640
happen to you, but they know that
WAL stands for Write-Ahead

638
00:33:40,640 --> 00:33:41,140
Logging.

639
00:33:41,820 --> 00:33:44,940
Assume that it's logs and therefore
can delete it.

640
00:33:45,060 --> 00:33:49,200
And then because it's like a large
amount of what they think

641
00:33:49,200 --> 00:33:52,620
are logs, they delete those to
try and recover.

642
00:33:52,960 --> 00:33:56,260
And I think that the main advice
from me in terms of recovery,

643
00:33:56,280 --> 00:34:00,460
like from every guide you'll read
on this, is don't do that.

644
00:34:00,800 --> 00:34:03,960
Nikolay: Yeah, I will say the main,
the like classic approach

645
00:34:03,960 --> 00:34:07,720
is understand reasons and fix them,
mitigate, right?

646
00:34:07,960 --> 00:34:11,460
But you might have pressure of
time.

647
00:34:11,460 --> 00:34:14,800
In this case, it's good if you
can quickly identify something

648
00:34:14,800 --> 00:34:17,580
you can delete safely without big
consequences.

649
00:34:17,680 --> 00:34:19,980
For example, regular Postgres logs.

650
00:34:20,220 --> 00:34:25,020
If you have a lot of logs, gigabytes,
it's good to delete just

651
00:34:25,740 --> 00:34:31,560
1 file or so, start database if
it's down, and then you do need

652
00:34:31,560 --> 00:34:36,260
to do full analysis and mitigate
the real reason and so on.

653
00:34:36,260 --> 00:34:39,520
Or maybe you can delete something
outside Postgres data directory.

654
00:34:39,560 --> 00:34:40,060
Something.

655
00:34:40,580 --> 00:34:44,840
You can check quickly if there
are different directories around

656
00:34:45,240 --> 00:34:49,340
which have some gigabytes of data,
or at least megabytes, you

657
00:34:49,340 --> 00:34:56,120
can quickly delete and let Postgres
behave normally for quite

658
00:34:56,120 --> 00:34:58,540
some time, it buys you some room,
right?

659
00:34:58,620 --> 00:35:00,360
For analysis and proper mitigation.

660
00:35:00,860 --> 00:35:03,660
But mitigation is inevitable, You
need to understand what happened.

661
00:35:03,680 --> 00:35:06,640
Is it a WAL or data or something
else?

662
00:35:07,200 --> 00:35:08,900
And then you need to mitigate.

663
00:35:09,720 --> 00:35:11,720
Michael: Yeah, I was reading some
guides on this.

664
00:35:11,720 --> 00:35:15,520
There's a good 1 on the Crunchy
Data blog, and they made a really

665
00:35:15,520 --> 00:35:19,480
good point that we don't all think
of doing when you're in that

666
00:35:19,480 --> 00:35:20,740
kind of panic mode.

667
00:35:20,840 --> 00:35:25,020
Get, you know, get things back
online is, is my primary objective

668
00:35:25,760 --> 00:35:27,160
in cases like that.

669
00:35:27,740 --> 00:35:31,260
But they recommend taking a file
system level copy of everything

670
00:35:31,280 --> 00:35:32,280
in its current state.

671
00:35:32,280 --> 00:35:36,540
Like take a snapshot of the current
directory as it is before

672
00:35:36,540 --> 00:35:37,900
you start deleting things.

673
00:35:38,260 --> 00:35:40,120
Nikolay: Yeah, it's additional
work in progress.

674
00:35:40,760 --> 00:35:43,100
Michael: Yeah, because it takes
time and we're down.

675
00:35:44,380 --> 00:35:47,640
Nikolay: In cloud actually, if
you don't see what you can delete

676
00:35:47,640 --> 00:35:51,740
quickly, like in Cloud, it's managed,
for example, you cannot

677
00:35:51,740 --> 00:35:54,560
go and run the UDF and so on.

678
00:35:54,560 --> 00:35:58,260
But in this case, probably it's
easier just to add a few percent

679
00:35:58,260 --> 00:36:02,860
of disk space quickly to buy some
room again, and then to investigate.

680
00:36:03,420 --> 00:36:04,780
Obvious approach, right?

681
00:36:05,060 --> 00:36:07,240
Michael: I also think because of
some reasons we're going to

682
00:36:07,240 --> 00:36:10,400
discuss in a bit, this is just
so much less likely to happen

683
00:36:10,400 --> 00:36:11,440
in cloud environments.

684
00:36:11,760 --> 00:36:14,480
Like it just feels like this is
1 of those features.

685
00:36:15,060 --> 00:36:16,120
There aren't that many.

686
00:36:16,120 --> 00:36:17,420
Nikolay: This is their job, right?

687
00:36:17,560 --> 00:36:21,960
Michael: Yeah, this is 1 of those
2 or 3 features that they really

688
00:36:21,960 --> 00:36:24,420
do manage and can manage for you
automatically.

689
00:36:24,860 --> 00:36:27,020
Nikolay: Notify properly and so
on, right?

690
00:36:27,260 --> 00:36:28,060
Michael: Or just auto-scale.

691
00:36:28,780 --> 00:36:29,980
Auto-scale, yeah.

692
00:36:30,320 --> 00:36:31,340
You just don't hit this button.

693
00:36:31,340 --> 00:36:34,200
Nikolay: I don't like auto-scaling
in this area because it's

694
00:36:34,200 --> 00:36:35,640
auto-scaling of budgets.

695
00:36:36,980 --> 00:36:40,040
Michael: Well, but it's also not
solving the root cause of the

696
00:36:40,040 --> 00:36:40,380
problem.

697
00:36:40,380 --> 00:36:43,520
If you're, if several of those
problems we talked about, it's

698
00:36:43,520 --> 00:36:45,200
not your actual data growing.

699
00:36:45,200 --> 00:36:49,460
It's not your like, It's not your
true tables and indexes.

700
00:36:50,460 --> 00:36:55,320
It's write-ahead logging going
out of control because of an open

701
00:36:55,320 --> 00:36:55,820
replication.

702
00:36:56,200 --> 00:36:58,260
Some reason that it shouldn't be
that much data.

703
00:36:58,260 --> 00:37:02,580
So auto scaling those is just not
solving the problem.

704
00:37:02,780 --> 00:37:03,280
Exactly.

705
00:37:03,740 --> 00:37:04,240
Yeah.

706
00:37:04,740 --> 00:37:07,620
Nikolay: Like hiding dirt under
the rug, is it?

707
00:37:08,300 --> 00:37:11,200
Michael: As long as you've been
told when it has scaled up, then

708
00:37:11,200 --> 00:37:13,600
at least you can go and investigate.

709
00:37:14,440 --> 00:37:18,700
Nikolay: Yeah, so I think we covered
quite well mitigation in

710
00:37:18,700 --> 00:37:19,200
general.

711
00:37:21,000 --> 00:37:22,440
Michael: Well, what do you think
about...

712
00:37:22,900 --> 00:37:26,600
Yeah, I think increasing makes
sense.

713
00:37:26,600 --> 00:37:30,200
I saw some warnings not to increase
in place necessarily.

714
00:37:31,060 --> 00:37:34,700
Maybe you should look at setting
up a replica and failing over,

715
00:37:34,700 --> 00:37:36,420
but I thought that would be slower.

716
00:37:36,420 --> 00:37:38,360
Like I didn't, I wasn't sure myself.

717
00:37:39,520 --> 00:37:40,120
Nikolay: I don't know.

718
00:37:40,120 --> 00:37:45,280
Like I think if you can add space
or delete something safe, it's

719
00:37:45,280 --> 00:37:45,780
good.

720
00:37:45,800 --> 00:37:49,040
First step, then you have big step
of analysis, proper analysis

721
00:37:49,040 --> 00:37:54,440
and mitigation of true reason,
main reason why you approached

722
00:37:54,480 --> 00:37:57,440
it and not noticing, right?

723
00:37:57,440 --> 00:38:01,180
So a big part of mitigation, if
you perform proper root cause

724
00:38:01,180 --> 00:38:04,600
analysis, I'm quite sure will be,
we didn't notice this and it's

725
00:38:04,600 --> 00:38:07,700
a problem itself, so our observability
is bad.

726
00:38:08,260 --> 00:38:11,240
Michael: The process is not- Either
monitoring or probably alerting

727
00:38:11,540 --> 00:38:15,780
is, in fact, yeah, we have to go
to prevention.

728
00:38:16,500 --> 00:38:19,360
Nikolay: Yeah, prevention, I think,
philosophically, is quite

729
00:38:19,360 --> 00:38:19,740
simple.

730
00:38:19,740 --> 00:38:23,460
You just need, like, the better
way to prevent it is understand,

731
00:38:23,680 --> 00:38:29,800
like, keep a good level of understanding
of how you spend your

732
00:38:29,800 --> 00:38:31,900
bytes of disks, of gigabytes.

733
00:38:32,300 --> 00:38:37,080
So you understand the layout, you
understand wall data, real

734
00:38:37,080 --> 00:38:42,180
database data, is it like actual
data or a lot of bloat, and

735
00:38:42,180 --> 00:38:45,060
then you understand logs and everything
else and you understand,

736
00:38:45,060 --> 00:38:49,880
okay, we are good and we understand
how we spend our disk bytes.

737
00:38:51,000 --> 00:38:54,560
And if this understanding is good,
everything will be fine and

738
00:38:54,560 --> 00:38:56,540
you will notice problems early.

739
00:38:58,260 --> 00:39:02,740
But it's hard to do if you have
hundreds or thousands of clusters,

740
00:39:02,760 --> 00:39:03,260
right?

741
00:39:03,900 --> 00:39:08,980
In this case, you do need some
rules to be alerted if bloat is

742
00:39:08,980 --> 00:39:16,740
high, if pgwall grows without good
control, if logs are not properly

743
00:39:17,600 --> 00:39:20,860
– like also, logs consume too much.

744
00:39:20,860 --> 00:39:23,940
You just need to have good observability,
right?

745
00:39:24,840 --> 00:39:25,740
To have mitigation.

746
00:39:26,120 --> 00:39:29,540
And then you just, with understanding
of possible reasons, you

747
00:39:29,540 --> 00:39:34,080
navigate in these reasons and find
mitigation.

748
00:39:34,340 --> 00:39:37,280
But the key of prevention is understanding
and understanding

749
00:39:37,280 --> 00:39:38,740
is observability, right?

750
00:39:38,940 --> 00:39:40,660
Michael: Yeah, 2 other things.

751
00:39:40,680 --> 00:39:44,940
I think like whilst I'm kind of
against auto scaling in principle,

752
00:39:45,560 --> 00:39:50,900
I think if you've got the option,
going up in budget and having

753
00:39:50,900 --> 00:39:55,840
to migrate later back down to a
smaller size is so much better

754
00:39:55,840 --> 00:39:56,860
than being down.

755
00:39:57,700 --> 00:40:01,180
I personally would have that on
myself if I had that option.

756
00:40:01,560 --> 00:40:04,400
And I think also, oversizing.

757
00:40:04,820 --> 00:40:07,860
Like, disk is so cheap compared
to everything else we're doing.

758
00:40:08,240 --> 00:40:10,940
Unless you really are in the like
extreme...

759
00:40:11,400 --> 00:40:11,900
Huh?

760
00:40:12,180 --> 00:40:12,320
Nikolay: IM.

761
00:40:12,320 --> 00:40:15,980
Disk is not cheap if you have a
lot of terabytes and many standby

762
00:40:16,080 --> 00:40:18,580
nodes and oh, it can be not cheap.

763
00:40:18,820 --> 00:40:20,732
Michael: But most of us aren't
in that case.

764
00:40:20,732 --> 00:40:25,460
Like most of us aren't in, so yeah,
I agree that for like people

765
00:40:25,460 --> 00:40:27,080
with serious, serious days.

766
00:40:27,500 --> 00:40:30,180
Nikolay: Some people pay like millions
for disk space.

767
00:40:31,100 --> 00:40:32,600
It's like, it can be headache.

768
00:40:33,020 --> 00:40:36,680
That's why I mentioned PgTier is
a quite very good, potentially

769
00:40:36,740 --> 00:40:39,000
good thing and Teering is good.

770
00:40:39,520 --> 00:40:44,660
And like some design, understanding
how the structure of your

771
00:40:44,660 --> 00:40:50,080
disk space consumption, you design
some rules and to float data

772
00:40:50,080 --> 00:40:52,760
properly and keep everything under
control.

773
00:40:52,900 --> 00:40:54,780
And that's, that's, this is good.

774
00:40:54,780 --> 00:40:55,460
This is good.

775
00:40:55,920 --> 00:40:56,780
But yeah,

776
00:40:56,880 --> 00:40:57,540
Michael: I think...

777
00:40:57,560 --> 00:41:00,360
Imagine, imagine advising some
startup and they had like some

778
00:41:00,360 --> 00:41:03,760
10 gigabyte disk or something,
and their database was only a

779
00:41:03,760 --> 00:41:04,900
few hundred megabytes.

780
00:41:05,740 --> 00:41:08,680
The cost of being on a hundred
gigabyte disk instead is just

781
00:41:08,680 --> 00:41:15,180
so tiny, like it's just such a
small, but yeah, I take your point.

782
00:41:15,660 --> 00:41:20,860
Nikolay: But again, I think observability
can be always improved,

783
00:41:20,860 --> 00:41:21,520
I think.

784
00:41:21,600 --> 00:41:26,160
So like if, especially if we know
that this is visual, this is

785
00:41:26,160 --> 00:41:30,720
data, this is logs, everything
else, and we have some proper

786
00:41:31,080 --> 00:41:33,660
analysis and alerts for it.

787
00:41:33,660 --> 00:41:36,040
Michael: I think that's the bit
that trips people up.

788
00:41:36,040 --> 00:41:38,940
I think a lot of people actually
have quite decent observability,

789
00:41:39,380 --> 00:41:43,500
like they can see what's going
on, but alerting people aren't

790
00:41:43,500 --> 00:41:47,440
getting alerted early enough, or
not at all is the 1 I see quite

791
00:41:47,440 --> 00:41:47,940
often.

792
00:41:48,160 --> 00:41:53,700
No alerts at all for, you know,
disk is at 50% sends an email,

793
00:41:54,080 --> 00:41:58,740
70% raises a, you know, 90% sets
off a page, you know, like these

794
00:41:58,740 --> 00:42:00,540
kinds of like serious alarms.

795
00:42:00,840 --> 00:42:03,900
Nikolay: Yeah, you know, I know
we need to go already out of

796
00:42:03,900 --> 00:42:08,940
time, but let's mention some small
practical advice in the end.

797
00:42:09,320 --> 00:42:13,660
If you cannot SSH to the box, for
example, it's RDS or other

798
00:42:13,660 --> 03:42:17,040
managed Postgres service, and Postgres
is still alive, but you

799
00:42:17,040 --> 00:42:20,140
wonder why the disk usage goes
up.

800
00:42:20,380 --> 00:42:24,920
At SQL level, you can check what's
happening in directories using

801
00:42:24,920 --> 00:42:26,400
pg_ls_dir, pg_ls_valdir.

802
00:42:27,500 --> 00:42:31,780
There is a set of administrative
functions which allow you to

803
00:42:31,780 --> 00:42:34,020
inspect the content of directories.

804
00:42:35,200 --> 00:42:38,140
And this is very helpful when you
troubleshoot and understand

805
00:42:38,140 --> 00:42:38,760
what's happening.

806
00:42:38,760 --> 00:42:43,160
And maybe also it's helpful, can
be helpful in mitigation activities,

807
00:42:44,240 --> 00:42:44,940
in observability.

808
00:42:45,140 --> 00:42:49,440
For example, if you have a report
triggered out of disk, almost

809
00:42:49,540 --> 00:42:52,260
soon usage exceeded 90%.

810
00:42:52,860 --> 00:42:56,980
Then you can include some analysis
automated to have a snapshot

811
00:42:56,980 --> 00:43:03,080
of what's happening in key directories
and understand what are

812
00:43:03,080 --> 00:43:05,240
consumers of disk space.

813
00:43:06,660 --> 00:43:10,540
Just some advice I thought we could
miss.

814
00:43:10,580 --> 00:43:11,780
Let's not miss it.

815
00:43:11,820 --> 00:43:13,120
That's it, I think, right?

816
00:43:13,500 --> 00:43:14,360
Michael: I think so.

817
00:43:14,450 --> 00:43:17,960
I hope none of you ever actually
have to face this and you can

818
00:43:17,960 --> 00:43:20,960
all mitigate or prevent it from
happening.

819
00:43:21,600 --> 00:43:23,220
And yeah, thanks so much, Nikolay.

820
00:43:23,220 --> 00:43:24,260
Catch you next week.