1
00:00:00,120 --> 00:00:04,120
Nikolay: Hello everyone, this is
PostgresFM, a podcast about,

2
00:00:04,120 --> 00:00:05,180
guess what, Postgres.

3
00:00:05,520 --> 00:00:07,520
I'm Nikolay and hi Michael.

4
00:00:08,020 --> 00:00:08,980
Michael: Hello Nikolay.

5
00:00:09,400 --> 00:00:10,700
Nikolay: How are you doing today?

6
00:00:11,000 --> 00:00:11,940
Michael: I'm good, thank you.

7
00:00:11,940 --> 00:00:12,660
How are you?

8
00:00:13,780 --> 00:00:14,840
Nikolay: Very good.

9
00:00:15,180 --> 00:00:21,160
So we discussed before the call
that we have a few options for

10
00:00:21,160 --> 00:00:21,660
topics.

11
00:00:22,740 --> 00:00:27,120
And it was my choice, but somehow
you chose, right?

12
00:00:27,120 --> 00:00:31,780
You chose the health check or checkup.

13
00:00:32,780 --> 00:00:36,360
Postgres health checks and how
to check the health of Postgres

14
00:00:36,360 --> 00:00:41,380
clusters periodically, why to do
so, what to look at and so on,

15
00:00:41,380 --> 00:00:41,880
right?

16
00:00:42,660 --> 00:00:45,580
Michael: Well you said you had
2 ideas, 1 of them was this 1

17
00:00:45,580 --> 00:00:48,160
and I was like I like that idea
And then you explained the other

18
00:00:48,160 --> 00:00:48,280
1.

19
00:00:48,280 --> 00:00:49,540
I was like, I like that idea too.

20
00:00:49,540 --> 00:00:51,340
So I think we've got lots of good
ones left

21
00:00:51,340 --> 00:00:51,760
Nikolay: to go.

22
00:00:51,760 --> 00:00:51,960
Yeah.

23
00:00:51,960 --> 00:00:54,060
We need to discuss the internal
policy.

24
00:00:54,440 --> 00:00:56,340
How transparent should we be?

25
00:00:56,580 --> 00:00:57,080
Right.

26
00:00:57,280 --> 00:00:57,740
Michael: Yeah.

27
00:00:57,740 --> 00:01:01,920
Well, and nice reminder to people
that you can submit ideas for

28
00:01:01,920 --> 00:01:05,280
future episodes on our Google doc
or by mentioning us on Twitter

29
00:01:05,280 --> 00:01:06,540
or anything like that.

30
00:01:07,080 --> 00:01:07,360
Nikolay: Right.

31
00:01:07,360 --> 00:01:07,680
Right.

32
00:01:07,680 --> 00:01:08,560
That's, that's good.

33
00:01:08,560 --> 00:01:12,780
We, I, I know people go there sometimes
and I think it's good

34
00:01:12,780 --> 00:01:16,980
to remind this, this, this opportunity
and this, this feeds our

35
00:01:16,980 --> 00:01:18,180
podcast with ideas.

36
00:01:18,760 --> 00:01:20,640
So, yeah, good point.

37
00:01:20,900 --> 00:01:23,960
So, health check, where to start?

38
00:01:24,520 --> 00:01:25,880
Goals, maybe, right?

39
00:01:26,000 --> 00:01:26,500
Or...

40
00:01:27,100 --> 00:01:28,660
Michael: Well, I think definitions,
actually.

41
00:01:28,660 --> 00:01:31,100
I think, because I was doing a
little bit of research around

42
00:01:31,100 --> 00:01:33,540
this, It is a slightly loaded term.

43
00:01:33,600 --> 00:01:37,480
When you say health check, it could
be for example software just

44
00:01:37,480 --> 00:01:41,680
checking that something's even
alive, but all the way through

45
00:01:41,680 --> 00:01:45,700
to kind of consultancy services
around, you know, that kind of

46
00:01:45,700 --> 00:01:45,840
thing.

47
00:01:45,840 --> 00:01:47,300
Nikolay: Select 1 is working.

48
00:01:47,940 --> 00:01:48,740
That's good.

49
00:01:49,860 --> 00:01:52,400
Michael: Yeah, So what do you mean
when you say health check?

50
00:01:54,320 --> 00:01:58,860
Nikolay: My perception on this
term is very broad, of course.

51
00:01:58,940 --> 00:02:04,440
Because in my opinion, what we
mean is not like looking at monitoring

52
00:02:04,540 --> 00:02:08,440
and see everything is green or
something, no alerts or nothing.

53
00:02:08,940 --> 00:02:09,720
No, no, no.

54
00:02:09,720 --> 00:02:15,440
It's about something that should
be done not every day, but also

55
00:02:15,440 --> 00:02:19,900
not once per 5 years, slightly
more frequently, right?

56
00:02:19,900 --> 00:02:23,820
Like for example, once per quarter
or twice per year.

57
00:02:25,320 --> 00:02:31,840
And it's very similar to checking
health of humans or complex

58
00:02:32,420 --> 00:02:34,340
technology like cars, right?

59
00:02:34,540 --> 00:02:36,140
Sometimes you need to check.

60
00:02:36,420 --> 00:02:41,180
And it doesn't eliminate the need
to have continuous monitoring.

61
00:02:41,500 --> 00:02:47,000
Like, for example, in your car
you have various checks constantly

62
00:02:47,100 --> 00:02:50,420
being working, for example, tire
pressure.

63
00:02:51,360 --> 00:02:56,200
By the way, I think for humans
it's a big disadvantage that we

64
00:02:56,200 --> 00:02:57,160
don't have monitoring.

65
00:02:58,180 --> 00:03:02,780
It's improving, for example, a
modern smartwatch can provide

66
00:03:02,780 --> 00:03:05,140
you a heartbeat rate and so on.

67
00:03:05,500 --> 00:03:08,080
But still, we're monitoring...

68
00:03:08,440 --> 00:03:13,520
When I complain about how poor
some Postgres monitoring systems

69
00:03:13,580 --> 00:03:18,640
are, I remind myself how poor human
body monitoring is right

70
00:03:18,640 --> 00:03:19,460
now still.

71
00:03:21,460 --> 00:03:26,300
And we can monitor technological
systems much better than non-technological.

72
00:03:26,760 --> 00:03:32,920
So yeah, we have some things constantly
being checked, but still

73
00:03:32,920 --> 00:03:38,220
we need sometimes to perform deeper
and wider analysis for different

74
00:03:38,220 --> 00:03:38,720
goals.

75
00:03:39,320 --> 00:03:44,240
First goal is to ensure that everything
currently is fine and

76
00:03:44,240 --> 00:03:45,360
nothing is overlooked.

77
00:03:46,880 --> 00:03:51,040
And second goal is to predict and
prevent some problems in the

78
00:03:51,040 --> 00:03:51,960
nearest future.

79
00:03:52,740 --> 00:03:59,640
And the third goal is things like
how we will live in the next

80
00:03:59,640 --> 00:04:00,400
few years.

81
00:04:00,780 --> 00:04:05,320
It involves capacity planning and
proper planning of growth,

82
00:04:05,540 --> 00:04:07,940
and prediction, and so on.

83
00:04:08,080 --> 00:04:11,820
So these are 3 key goals, very
roughly, right?

84
00:04:12,880 --> 00:04:13,860
What do you think?

85
00:04:14,440 --> 00:04:17,780
Michael: Yeah, I really like it,
and I think the car analogy

86
00:04:18,200 --> 00:04:22,760
is particularly good in that I
don't think this is true in the

87
00:04:22,760 --> 00:04:26,520
US but in the UK at least, once
a car is a certain age, like

88
00:04:26,520 --> 00:04:30,660
3 years old or something, it's
required by law to have an annual

89
00:04:31,060 --> 00:04:35,600
health check, we call it an MOT,
but it's not just what's wrong

90
00:04:35,600 --> 00:04:38,800
with the car now that you need
to fix, but it's also looking

91
00:04:38,800 --> 00:04:43,100
at what like things that could
go wrong and making and trying to

92
00:04:43,100 --> 00:04:45,080
get them replaced before they fail.

93
00:04:45,300 --> 00:04:48,000
So there's there are certain parts
of the car that if they haven't

94
00:04:48,000 --> 00:04:50,820
been changed in a certain number
of years They have to be changed.

95
00:04:50,980 --> 00:04:54,560
So I think that's quite a nice
kind of that's at least 2 of the

96
00:04:54,560 --> 00:04:59,440
goals looking at What's wrong now,
but also if we don't do anything

97
00:04:59,440 --> 00:05:03,640
about it what could go wrong very
soon and we should get ahead

98
00:05:03,640 --> 00:05:06,840
of it, which is not necessarily
something monitoring always is

99
00:05:06,840 --> 00:05:07,620
good at.

100
00:05:08,300 --> 00:05:10,940
Nikolay: Yeah, well, in the US
it's very different.

101
00:05:11,200 --> 00:05:12,480
No pressure at all.

102
00:05:12,640 --> 00:05:14,840
You just change the oil and that's
it.

103
00:05:15,060 --> 00:05:18,280
Everyone is on their own, as I
understand, at least in California.

104
00:05:19,000 --> 00:05:23,000
And actually, I'm going to argue
with myself and with you a little

105
00:05:23,000 --> 00:05:23,500
bit.

106
00:05:24,600 --> 00:05:29,500
Better technology means less need
for health checks.

107
00:05:29,500 --> 00:05:33,300
For example, I own a couple of
Teslas, and I was very concerned

108
00:05:33,340 --> 00:05:37,000
that nobody is replacing anything
like filters and so on and

109
00:05:37,720 --> 00:05:43,080
then I needed to replace
the windshield after some trip

110
00:05:43,080 --> 00:05:49,640
to Nevada. And I asked like guys,
like check everything like brake

111
00:05:49,640 --> 00:05:53,180
pads and so on because of course
I'm worried like it's already

112
00:05:53,180 --> 00:05:57,360
like almost 2 years and nothing and they said everything

113
00:05:57,360 --> 00:06:00,660
is fine we are going to replace
just the cabin filter and that's

114
00:06:00,660 --> 00:06:04,260
it. No need to replace
anything because it's better technology.

115
00:06:05,020 --> 00:06:10,220
So I think, of course, the amount
of work that needs to be done...

116
00:06:11,980 --> 00:06:15,480
Let's go back to Postgres because
it's a PostgresFM podcast, as

117
00:06:15,480 --> 00:06:16,760
I said in the beginning.

118
00:06:17,440 --> 00:06:23,840
So, if we have very old Postgres,
it probably, it perhaps needs

119
00:06:23,960 --> 00:06:29,180
more attention and work to check
and maintain good health.

120
00:06:30,140 --> 00:06:31,580
Michael: I think there are 2 axes.

121
00:06:32,080 --> 00:06:34,640
I think you're right that technology
plays into it, but I think

122
00:06:34,640 --> 00:06:38,940
also simplicity versus complexity
plays a part as well.

123
00:06:39,120 --> 00:06:42,660
And as Postgres gets more complex
or as your Postgres setup gets

124
00:06:42,660 --> 00:06:47,080
more complex, I think actually
maybe there's more you need to

125
00:06:47,080 --> 00:06:51,600
do, whereas on the car analogy,
I think one of the arguments for

126
00:06:51,600 --> 00:06:53,800
things like electric cars is
they're simpler.

127
00:06:53,800 --> 00:06:55,460
They have fewer parts.

128
00:06:57,040 --> 00:06:58,680
They are easier to maintain.

129
00:06:58,860 --> 00:07:02,740
So I think there's also an argument
for keeping your Postgres

130
00:07:02,840 --> 00:07:03,340
setup.

131
00:07:03,940 --> 00:07:08,040
This is one of the arguments for
simplicity in, well, we'll get

132
00:07:08,040 --> 00:07:12,480
to some of the specifics, but on
the technology side, for example,

133
00:07:12,500 --> 00:07:19,140
if we could get bigger than 32-bit
transaction IDs,

134
00:07:19,540 --> 00:07:19,740
Nikolay: we

135
00:07:19,740 --> 00:07:22,400
Michael: wouldn't have to monitor
as often for wrap, like wrap

136
00:07:22,400 --> 00:07:23,920
around would be less of an issue.

137
00:07:23,920 --> 00:07:29,440
If we defaulted to int8 or bigint
primary keys instead of int4,

138
00:07:30,060 --> 00:07:31,920
we don't have as many of those
issues.

139
00:07:31,920 --> 00:07:37,180
So, by certain choices, we can simplify
and prevent issues, but

140
00:07:37,180 --> 00:07:38,100
also the technology.

141
00:07:38,860 --> 00:07:41,020
Nikolay: And defaults matter a
lot.

142
00:07:41,040 --> 00:07:42,100
Defaults matter.

143
00:07:42,540 --> 00:07:46,420
If, for example, all our application
code or frameworks, and

144
00:07:46,420 --> 00:07:49,980
for example, Ruby on Rails made
this choice a few years ago.

145
00:07:50,140 --> 00:07:56,780
As you said, default to bigint
for surrogate primary keys,

146
00:07:56,820 --> 00:07:59,280
then minus one report needed, right?

147
00:08:00,100 --> 00:08:05,860
At least, yeah, so we don't need
to check integer capacity

148
00:08:05,860 --> 00:08:07,500
for primary keys.

149
00:08:08,200 --> 00:08:09,140
And that's good.

150
00:08:09,180 --> 00:08:12,260
But of course, the complexity of the
system can grow and sometimes

151
00:08:12,280 --> 00:08:16,120
we check in Postgres help, we go
beyond Postgres, for example,

152
00:08:16,120 --> 00:08:23,100
we check connection pooling and
how orchestration and automation

153
00:08:23,300 --> 00:08:27,720
around Postgres is organized with
the goals to understand the

154
00:08:27,720 --> 00:08:32,340
HA, high availability, disaster
recovery, these kinds of topics.

155
00:08:32,900 --> 00:08:34,620
So and of course capacity planning.

156
00:08:34,640 --> 00:08:38,420
We will grow a few years without
performance issues or not.

157
00:08:39,060 --> 00:08:43,740
But anyway, we can add many things
like that.

158
00:08:43,740 --> 00:08:50,180
For example, if threads topic succeeds,
started by Heikki a year

159
00:08:50,180 --> 00:08:57,180
ago, if this succeeds, we won't
be worried about single CPU core

160
00:08:57,180 --> 00:09:00,080
capacity for system processes.

161
00:09:00,480 --> 00:09:04,800
Or for example, there is an ongoing
discussion for pgBouncer that

162
00:09:05,140 --> 00:09:09,840
threading is also a good thing,
so we won't be bound, the process

163
00:09:09,840 --> 00:09:15,140
won't be bound by a single core,
and this is a hard wall to hit.

164
00:09:15,300 --> 00:09:18,580
And Postgres processes can also
hit, some Postgres processes

165
00:09:18,580 --> 00:09:22,680
can hit single core utilization
100% and even if you have 100

166
00:09:22,680 --> 00:09:24,940
more cores, it doesn't help you,
right?

167
00:09:25,320 --> 00:09:31,020
So if, like, I cannot say this
is simplicity because technology

168
00:09:31,160 --> 00:09:35,200
is similar to EEV It's also like
a lot of complex technology

169
00:09:35,660 --> 00:09:38,820
behind it but it's simplicity of
the use, right?

170
00:09:39,000 --> 00:09:40,940
Of usage, maybe.

171
00:09:41,920 --> 00:09:45,460
Michael: Well yeah, I would actually
argue that let's say we

172
00:09:45,460 --> 00:09:50,800
did get threading in Postgres in
a major version in a few years

173
00:09:50,800 --> 00:09:51,300
time.

174
00:09:52,240 --> 00:09:54,860
Actually, at first, you'd want
to be doing health checks more

175
00:09:54,860 --> 00:09:55,360
often.

176
00:09:56,400 --> 00:10:00,100
You'd be more nervous about things
going wrong because you've added

177
00:10:00,100 --> 00:10:04,540
complexity and things that maybe
for over a decade have been

178
00:10:04,540 --> 00:10:09,320
working great suddenly may not work as well or just their

179
00:10:09,320 --> 00:10:10,620
patterns may change

180
00:10:12,240 --> 00:10:18,720
Nikolay: Well maybe, maybe, but let's
let's find some mutual ground

181
00:10:18,720 --> 00:10:23,100
here maybe like we both agree that
health checks are needed,

182
00:10:23,100 --> 00:10:23,600
right?

183
00:10:24,440 --> 00:10:26,340
Or right now at least.

184
00:10:26,820 --> 00:10:30,660
Maybe in Postgres 25, it won't be
needed so much because it will

185
00:10:30,660 --> 00:10:33,100
be an autonomous database.

186
00:10:34,200 --> 00:10:39,760
Michael: I personally, I think
it's a really good way of keeping

187
00:10:39,760 --> 00:10:44,600
a system reliable and avoiding
major issues including downtime

188
00:10:44,800 --> 00:10:46,900
and outages and huge things.

189
00:10:47,860 --> 00:10:51,360
But that's, yeah, I don't need
it as an interesting word.

190
00:10:51,360 --> 00:10:54,840
Like a lot of the people I see
hitting issues are just living

191
00:10:54,840 --> 00:10:56,040
in a reactionary world.

192
00:10:56,040 --> 00:10:58,820
A lot of them are startups and
they.

193
00:10:59,100 --> 00:10:59,600
Yeah.

194
00:11:00,200 --> 00:11:03,420
But once like once things are working
well and you've got, like

195
00:11:03,420 --> 00:11:06,060
maybe you're hitting like an inflection
point or you need to

196
00:11:06,060 --> 00:11:10,380
capacity plan or you're in a large
established business where

197
00:11:10,460 --> 00:11:16,220
outages can be really expensive,
it seems wild to me to not be

198
00:11:16,440 --> 00:11:20,820
doing semi-regular or if not very
regular health checks.

199
00:11:21,580 --> 00:11:22,080
Nikolay: Right.

200
00:11:22,280 --> 00:11:24,320
Yeah, I think you're very right.

201
00:11:24,320 --> 00:11:28,080
There are two categories of people
who, for example, go to us for

202
00:11:28,080 --> 00:11:28,680
health checks.

203
00:11:28,680 --> 00:11:34,700
One is probably the bigger category
of people.

204
00:11:35,020 --> 00:11:39,920
People who knew that it's good
to have a health check, but they

205
00:11:39,920 --> 00:11:43,340
only come to us when something happens,
something bad happens,

206
00:11:44,640 --> 00:11:45,780
something already hitting.

207
00:11:46,640 --> 00:11:50,880
And in this case, well, it's actually
like a psychological thing,

208
00:11:50,880 --> 00:11:51,280
right?

209
00:11:51,280 --> 00:11:55,160
For example, for a CTO, the Postgres
database is just one of the things

210
00:11:56,320 --> 00:11:57,240
in infrastructure.

211
00:11:59,860 --> 00:12:05,040
And when it hurts, only then we
realize, okay, let's check what

212
00:12:05,140 --> 00:12:10,940
can hurt soon as well, what other
things can be a problem in

213
00:12:10,940 --> 00:12:12,680
this area, Postgres, right?

214
00:12:13,140 --> 00:12:16,860
This is a very good moment to perform
a health check.

215
00:12:17,220 --> 00:12:21,440
Another category is, I would say, 
wiser guys.

216
00:12:22,440 --> 00:12:29,120
They do health checks before a launch
or in the initial stage,

217
00:12:29,580 --> 00:12:32,840
but at the same time it's harder
to work with them because usually

218
00:12:32,840 --> 00:12:38,860
the project is smaller, and it's harder
to predict workload patterns,

219
00:12:39,300 --> 00:12:41,220
growth rates, and so on.

220
00:12:42,180 --> 00:12:47,340
But this is better, of course,
to start checking health initially

221
00:12:47,440 --> 00:12:49,500
and then return to it after some
time.

222
00:12:49,820 --> 00:12:53,040
Michael: Yeah, it can go hand in
hand with, like, stress testing

223
00:12:53,040 --> 00:12:53,440
as well.

224
00:12:53,440 --> 00:12:53,940
Can't it?

225
00:12:53,940 --> 00:12:57,460
Like before I see a lot of people
doing this kind of thing before

226
00:12:57,840 --> 00:13:01,980
big events, like in the US you've
got Black Friday, Cyber Monday

227
00:13:01,980 --> 00:13:05,680
peak times where you know there's
going to be increased traffic

228
00:13:05,680 --> 00:13:06,880
or increased load.

229
00:13:07,240 --> 00:13:10,080
Before that, there's a lot of work
that people are doing often

230
00:13:10,080 --> 00:13:11,620
to just make sure they can.

231
00:13:11,940 --> 00:13:13,860
Is that a good time to do a health
check?

232
00:13:14,200 --> 00:13:14,700
Nikolay: Right.

233
00:13:15,060 --> 00:13:18,340
Well, yes, but I would distinguish
these topics.

234
00:13:18,640 --> 00:13:22,320
Load testing and preparation for
some events like Black Friday

235
00:13:22,420 --> 00:13:24,060
is something else.

236
00:13:24,140 --> 00:13:29,440
Health checks, as I see them, doing
for many years, it's quite

237
00:13:29,440 --> 00:13:31,600
an established field.

238
00:13:32,120 --> 00:13:38,900
They include review of settings,
review of index health, bloat

239
00:13:38,900 --> 00:13:43,780
first, let's talk bloat first,
both table and index B-tree, then

240
00:13:43,780 --> 00:13:47,440
index health, and we had an episode
about it, we had an episode

241
00:13:47,440 --> 00:13:54,240
about bloat, about index health
and maintenance then query analysis,

242
00:13:55,680 --> 00:14:01,100
like do we have outliers who are
attacking us too much in various

243
00:14:01,100 --> 00:14:06,000
metrics I mean queries in pg_stat_statements and other extensions

244
00:14:06,960 --> 00:14:10,600
Then static analysis, for example,
you mentioned int4 primary

245
00:14:10,600 --> 00:14:15,720
keys, lack of foreign keys and
so on.

246
00:14:15,720 --> 00:14:19,700
Because if we see it first time,
a lot of findings.

247
00:14:19,840 --> 00:14:24,860
If we see it a second time, maybe
since last time, developers added

248
00:14:24,860 --> 00:14:26,460
a lot of new stuff, right?

249
00:14:27,100 --> 00:14:28,680
And other areas as well.

250
00:14:28,680 --> 00:14:33,880
And capacity planning can be strategic,
like many years, or it

251
00:14:33,880 --> 00:14:38,860
can be before some specific event
like Black Friday for e-commerce.

252
00:14:39,640 --> 00:14:43,480
In this case, it may involve additional
actions like let's do

253
00:14:43,480 --> 00:14:47,960
load testing, but I would say health
check, normal health check,

254
00:14:48,340 --> 00:14:52,320
is a simpler task than proper load
testing.

255
00:14:52,660 --> 00:14:55,580
That's why I'm trying to say it's
a separate topic, because if

256
00:14:55,580 --> 00:14:58,640
you're involved, a health check
suddenly becomes bigger.

257
00:15:01,820 --> 00:15:05,140
You can think about it, if you
go, like in the US, if you go

258
00:15:05,140 --> 00:15:09,440
to a doctor, there is an annual health
checkup, covered by insurance

259
00:15:09,440 --> 00:15:10,460
always, right?

260
00:15:10,520 --> 00:15:14,440
So, it's like your primary care
physician doing some stuff, discussing

261
00:15:14,440 --> 00:15:18,720
with you, checking, many things
actually, but it's not difficult.

262
00:15:19,340 --> 00:15:23,040
And then if there is some question,
you go to a specific doctor.

263
00:15:23,040 --> 00:15:25,740
And load testing is a specific
doctor, I would say.

264
00:15:26,480 --> 00:15:28,820
Michael: Yeah, I was just thinking
about timing, if it's more

265
00:15:28,820 --> 00:15:31,300
likely that people start the process
then.

266
00:15:31,380 --> 00:15:34,340
And just like maybe just before
they go on a holiday maybe they

267
00:15:34,340 --> 00:15:37,060
go for their annual checkup or
like I don't know if there's like

268
00:15:37,060 --> 00:15:40,120
times where people are more likely
to go for those than others

269
00:15:40,320 --> 00:15:42,460
but yeah should we get into some
more specifics

270
00:15:43,140 --> 00:15:46,060
Nikolay: yeah well speaking of
time to finish this topic Sometimes

271
00:15:46,080 --> 00:15:51,440
people just have periodical, you
know, like, life cycle episodes

272
00:15:51,480 --> 00:15:52,860
in their work, right?

273
00:15:52,860 --> 00:15:55,900
For example, this financial year,
right?

274
00:15:55,940 --> 00:15:59,380
And you plan finances, you spend
them, and so on.

275
00:15:59,380 --> 00:16:00,360
You manage them.

276
00:16:00,520 --> 00:16:06,000
Similar proper capacity planning
and database health management

277
00:16:06,000 --> 00:16:07,360
should be also like periodical.

278
00:16:07,580 --> 00:16:10,840
You would just have some point
of time, maybe after some events,

279
00:16:10,840 --> 00:16:14,060
for example, after Black Friday,
you think, okay, next year,

280
00:16:14,060 --> 00:16:20,400
how we expect our growth will look
like, how we expect our incident

281
00:16:20,400 --> 00:16:22,040
management will look like.

282
00:16:22,120 --> 00:16:25,600
And this is a good time to review
the health, understand ongoing

283
00:16:25,600 --> 00:16:27,900
problems and try to predict future
problems.

284
00:16:28,080 --> 00:16:33,480
You also perform capacity planning
to understand do we need to

285
00:16:33,480 --> 00:16:38,600
plan big changes, like architectural
changes, to sustain the

286
00:16:38,600 --> 00:16:42,280
big growth if we predict it in
a couple of years.

287
00:16:42,520 --> 00:16:45,680
In this case probably you need
to think about microservices or

288
00:16:45,680 --> 00:16:47,380
sharding or something like that.

289
00:16:48,320 --> 00:16:51,540
And health check can include this,
like overview.

290
00:16:52,120 --> 00:16:56,960
And this is a big distinction from
monitoring, because monitoring

291
00:16:57,400 --> 00:17:00,120
very often has very small retention.

292
00:17:01,220 --> 00:17:03,840
You see only a couple of weeks
or a couple of months, that's

293
00:17:03,840 --> 00:17:04,340
it.

294
00:17:05,460 --> 00:17:10,260
For an existing evolved system, you
need a few years of observation,

295
00:17:10,560 --> 00:17:13,540
data points, better to have them,
right?

296
00:17:13,780 --> 00:17:17,120
To understand what will happen
in the next years.

297
00:17:18,080 --> 00:17:23,440
And this is when like a health check
can help and capacity planning.

298
00:17:23,940 --> 00:17:27,080
Maybe load testing as well, but
it's additionally, I would say.

299
00:17:27,180 --> 00:17:28,380
Yeah, specific topics.

300
00:17:29,580 --> 00:17:32,860
So the tool starts usually for
me specifically.

301
00:17:33,260 --> 00:17:35,920
We actually have this tool, Postgres
Checkup and it has a big

302
00:17:35,920 --> 00:17:36,420
plan.

303
00:17:37,120 --> 00:17:40,900
The tool itself implements only,
I would say, less than 50% of

304
00:17:40,900 --> 00:17:41,540
the plan.

305
00:17:41,580 --> 00:17:44,840
The plan was my vision of HealthCheck
several years ago.

306
00:17:45,260 --> 00:17:48,220
I did conference talks about it.

307
00:17:49,540 --> 00:17:54,480
At that time, I remember I performed
really heavy health checks

308
00:17:54,480 --> 00:17:56,620
for a couple of really big companies.

309
00:17:57,440 --> 00:18:00,040
It took a few weeks of work.

310
00:18:00,920 --> 00:18:06,840
It's like a lot, like deep and
wide, and like 100 pages report,

311
00:18:07,440 --> 00:18:10,020
PDF of 100 pages, it's like interesting.

312
00:18:10,460 --> 00:18:15,420
Executive summary alone was like
5 or 6 pages, like quite big.

313
00:18:15,420 --> 00:18:15,920
Wow.

314
00:18:16,740 --> 00:18:19,140
Yeah, yeah, well, it's a serious
thing.

315
00:18:20,500 --> 00:18:24,780
But it was for a company which
costed more than $10 billion.

316
00:18:26,180 --> 00:18:30,360
And the Postgres databases served
a few clusters that were

317
00:18:30,360 --> 00:18:34,340
in the center of this serious business,
so it was worth it.

318
00:18:34,900 --> 00:18:38,860
And then this tool, this was the vision,
and we implemented in this

319
00:18:38,860 --> 00:18:42,220
tool, we implemented like almost
half of this vision.

320
00:18:43,260 --> 00:18:45,980
And for me everything starts with
version.

321
00:18:46,400 --> 00:18:47,000
Simple thing.

322
00:18:47,000 --> 00:18:47,980
Let's check version.

323
00:18:48,520 --> 00:18:49,620
Major and minor.

324
00:18:49,960 --> 00:18:51,480
And maybe history of it.

325
00:18:51,760 --> 00:18:55,220
Because both things matter a lot.

326
00:18:55,340 --> 00:19:00,060
If your current version is 11,
you're already out of what?

327
00:19:00,060 --> 00:19:01,380
Out of normal life, right?

328
00:19:01,380 --> 00:19:04,840
Because the community won't help you,
and bugs won't be fixed unless

329
00:19:04,840 --> 00:19:05,820
you pay a lot.

330
00:19:06,160 --> 00:19:08,300
Michael: And out of security patches
as well.

331
00:19:08,360 --> 00:19:11,540
Nikolay: Right, yeah, that includes
security also, bugs.

332
00:19:11,820 --> 00:19:15,180
If you're on 12, you have time
until November, right?

333
00:19:15,540 --> 00:19:18,540
So, major version is important,
and also you are missing a lot

334
00:19:18,540 --> 00:19:19,380
of good stuff.

335
00:19:20,640 --> 00:19:25,640
For example, from one of recent experiences,
if you're on 12, it

336
00:19:25,640 --> 00:19:31,760
not only means that you soon
will be having non-supported

337
00:19:33,100 --> 00:19:37,800
Postgres, but also you lack good
things like WAL-related metrics

338
00:19:38,240 --> 00:19:41,920
in pg_stat_statements and explain
plans, which were added in the next

339
00:19:41,920 --> 00:19:43,380
version, 13, right?

340
00:19:43,940 --> 00:19:48,040
You lack a lot of improvements
related to logical decoding and

341
00:19:48,040 --> 00:19:48,540
replication.

342
00:19:49,460 --> 00:19:55,180
Probably you need to move data
to Snowflake or between Postgres

343
00:19:55,180 --> 00:19:57,940
to Postgres and you miss a lot
of good stuff.

344
00:19:58,260 --> 00:20:02,960
So major version also matters a
lot, but the minor version may be

345
00:20:02,960 --> 00:20:04,000
number 1 thing.

346
00:20:04,000 --> 00:20:08,680
If you also lag in minor version,
obviously you probably miss

347
00:20:08,680 --> 00:20:11,080
some important fixes and security
patches.

348
00:20:11,680 --> 00:20:19,580
So here our tool is using yupgrade,
depesz.com, links to explain

349
00:20:19,700 --> 00:20:20,200
what...

350
00:20:21,020 --> 00:20:25,020
Yeah, it's a great tool, but now
we're already moving towards

351
00:20:25,080 --> 00:20:25,580
combining...

352
00:20:27,380 --> 00:20:29,440
It's a huge list sometimes, right?

353
00:20:30,180 --> 00:20:35,740
If you're lagging 10 minor versions
behind, it's more than a

354
00:20:35,740 --> 00:20:38,700
couple of years of fixes.

355
00:20:38,940 --> 00:20:41,420
And you're like, oh, what's important
here?

356
00:20:42,980 --> 00:20:45,140
Security, but maybe not only security.

357
00:20:45,720 --> 00:20:50,720
So, yeah, we are moving towards
trying to summarize this and

358
00:20:51,680 --> 00:20:53,500
understand particular context.

359
00:20:53,740 --> 00:20:55,220
Of course, guess what helps?

360
00:20:55,480 --> 00:20:56,820
LLM helps here.

361
00:21:00,720 --> 00:21:02,900
Like it's, yeah, I think in the
nearest future we will have interesting

362
00:21:02,900 --> 00:21:07,360
things here like combining observations,
why upgrade this diff,

363
00:21:07,360 --> 00:21:12,120
and like this is summary what matters
for you, like almost part

364
00:21:12,120 --> 00:21:17,240
of executive summary can be read
by CTO and with links, so like

365
00:21:17,240 --> 00:21:18,240
it's not like...

366
00:21:18,560 --> 00:21:19,060
Yeah

367
00:21:19,360 --> 00:21:22,520
Michael: I hope you're right long
term, but I think at the moment,

368
00:21:22,660 --> 00:21:26,860
the release notes are important
enough like the risk of missing

369
00:21:26,860 --> 00:21:29,200
something very important is still
very high.

370
00:21:29,200 --> 00:21:33,960
Like I think sometimes the LLMs
at the moment still, they can

371
00:21:33,960 --> 00:21:36,920
be good for like idea generation,
but if you want an exhaustive

372
00:21:37,160 --> 00:21:41,580
list of what's very important,
they can miss things or make things

373
00:21:41,580 --> 00:21:41,920
up.

374
00:21:41,920 --> 00:21:46,020
So I would still not rely on them
for things like reading, which

375
00:21:46,020 --> 00:21:49,540
like even minor release notes can
include things that are very

376
00:21:49,540 --> 00:21:53,300
important to do, like maybe a re-indexing
is needed and things

377
00:21:53,300 --> 00:21:54,000
like that.

378
00:21:54,960 --> 00:21:59,560
Nikolay: Yeah, I agree with you,
but I think you are criticizing

379
00:21:59,640 --> 00:22:04,040
a different idea than I'm describing.

380
00:22:04,220 --> 00:22:07,020
I'm not saying, like, first of
all, health checks we are doing

381
00:22:07,020 --> 00:22:10,580
and I think in the next few years
we'll be doing them still a

382
00:22:10,580 --> 00:22:13,960
lot of manual steps for serious
projects.

383
00:22:14,540 --> 00:22:18,920
So they, of course, involve DBA
who is looking at everything.

384
00:22:19,640 --> 00:22:21,980
But LLM can help here.

385
00:22:23,760 --> 00:22:25,580
You have usually pressure of timing.

386
00:22:26,600 --> 00:22:29,560
You need to do this work and a
lot of stuff.

387
00:22:30,420 --> 00:22:33,500
And change log with 200 items,
for example.

388
00:22:35,640 --> 00:22:42,140
Of course you check them, but I
bet no DBA spends significant

389
00:22:42,400 --> 00:22:44,280
time to check properly.

390
00:22:45,740 --> 00:22:50,200
They said they check, but I don't
trust humans as well here.

391
00:22:50,740 --> 00:22:55,040
Michael: So sorry, I think including
them in a report is different

392
00:22:55,040 --> 00:22:56,820
to reading them before upgrading.

393
00:22:56,920 --> 00:22:58,120
That's a really good point.

394
00:22:58,780 --> 00:23:03,060
Nikolay: Right, so what I'm trying
to say, we try to convince

395
00:23:03,080 --> 00:23:06,960
clients that it's important to
upgrade, both minor and major.

396
00:23:07,900 --> 00:23:11,320
And while explaining, of course
we can provide, not we can, we

397
00:23:11,320 --> 00:23:15,400
provide the list for full changes,
why upgrade is great too.

398
00:23:15,760 --> 00:23:22,620
But also we want to bring the most
sound arguments, right?

399
00:23:24,280 --> 00:23:26,700
And OLM helps to highlight.

400
00:23:28,380 --> 00:23:31,020
It was my long-term vision a few
years ago.

401
00:23:31,020 --> 00:23:32,540
I thought, why upgrade?

402
00:23:32,540 --> 00:23:35,240
You take it and it would be great
to understand.

403
00:23:36,200 --> 00:23:39,460
There is an item about GiST indexes,
or GIN indexes, for example.

404
00:23:40,080 --> 00:23:43,560
We have a schema here, we can check
if we have GIN indexes, right?

405
00:23:44,600 --> 00:23:45,840
Does it affect us?

406
00:23:46,080 --> 00:23:48,920
And if it does, this should be
highlighted.

407
00:23:49,940 --> 00:23:51,000
This is the idea.

408
00:23:51,040 --> 00:23:52,560
And LLM is good here.

409
00:23:52,720 --> 00:23:56,120
It's like it just performs some
legwork here.

410
00:23:56,120 --> 00:24:00,780
I mean, not only LLM, some automation
with LLM.

411
00:24:01,560 --> 00:24:06,700
And it just helps DBAs not to miss
important things, you know?

412
00:24:06,880 --> 00:24:11,460
But of course, again, I'm sure
it should be combined, like ideally

413
00:24:11,480 --> 00:24:17,080
it should be the work from automation
and humans.

414
00:24:17,980 --> 00:24:20,060
Okay, too much for versions.

415
00:24:20,100 --> 00:24:23,080
Actually, it's the smallest item
usually, right?

416
00:24:23,240 --> 00:24:26,760
We don't have time for everything
else, what to do. Okay, let's

417
00:24:26,760 --> 00:24:28,600
let's speed up maybe a little bit.

418
00:24:28,940 --> 00:24:31,220
Michael: I think we've mentioned
a bunch of them already, haven't

419
00:24:31,220 --> 00:24:32,640
we like you've done

420
00:24:32,640 --> 00:24:34,540
Nikolay: Yeah, but right.

421
00:24:34,540 --> 00:24:37,580
We had episodes separately, but
they are super important.

422
00:24:38,040 --> 00:24:39,780
Settings go usually second.

423
00:24:39,800 --> 00:24:40,300
Settings.

424
00:24:40,760 --> 00:24:41,880
It's a huge topic.

425
00:24:42,340 --> 00:24:48,340
But, from my experience, if I see
autovacuum settings being

426
00:24:48,340 --> 00:24:52,680
default, or checkpoint settings,
first of all, max_wal_size,

427
00:24:52,680 --> 00:24:56,540
in my opinion, the most important
one, being default 1 gigabyte,

428
00:24:56,940 --> 00:24:59,080
or logging settings being default.

429
00:24:59,680 --> 00:25:00,360
What else?

430
00:25:01,300 --> 00:25:02,380
These kinds of things.

431
00:25:02,380 --> 00:25:05,180
Michael: We did a whole, We've
done one on the default configuration.

432
00:25:06,740 --> 00:25:09,660
Nikolay: Our favorite, random_page_cost being 4.

433
00:25:10,440 --> 00:25:11,860
Recently we discussed it.

434
00:25:12,280 --> 00:25:16,220
These are strong signs to me that
we need to pay attention here.

435
00:25:16,560 --> 00:25:20,000
Even if I didn't see the database
yet, I already know that this

436
00:25:20,000 --> 00:25:20,820
is a problem.

437
00:25:21,180 --> 00:25:23,600
Michael: So this is really useful
for that first health check

438
00:25:23,600 --> 00:25:24,140
you ever do.

439
00:25:24,140 --> 00:25:26,820
Maybe you're a new DBA or new developer
on a team.

440
00:25:28,740 --> 00:25:31,440
When you're doing a second health
check or subsequent ones, are

441
00:25:31,440 --> 00:25:34,960
you diffing this list of settings
and seeing what's changed in

442
00:25:34,960 --> 00:25:35,780
the last year?

443
00:25:36,060 --> 00:25:36,660
Nikolay: Good question.

444
00:25:36,660 --> 00:25:40,940
We have reports that can help us
to understand that, for example,

445
00:25:40,940 --> 00:25:45,780
checkpoint needs reviewing, reconsideration,
checkpoint settings,

446
00:25:45,780 --> 00:25:48,240
or autovacuum needs reconsideration.

447
00:25:48,920 --> 00:25:50,380
So yeah, you're right.

448
00:25:50,380 --> 00:25:53,720
I'm just coming here with like,
I first met this database and

449
00:25:53,720 --> 00:25:54,560
reviewed it.

450
00:25:55,240 --> 00:26:00,020
But yeah, second is easier usually,
but if it grows fast, it

451
00:26:00,020 --> 00:26:01,440
can be challenging as well.

452
00:26:01,440 --> 00:26:03,340
So sometimes we need to revisit.

453
00:26:03,820 --> 00:26:08,040
Yeah, but settings, it's a huge
topic.

454
00:26:08,760 --> 00:26:12,940
I just wanted to highlight the
easy, low-hanging fruit we usually

455
00:26:12,940 --> 00:26:13,440
have.

456
00:26:13,440 --> 00:26:15,920
Default is a low-hanging fruit.

457
00:26:16,080 --> 00:26:20,780
For example, if the auto_vacuum
scale factor is short or regular,

458
00:26:21,460 --> 00:26:27,740
it's default like 10%, for any
OLTP database I would say it's

459
00:26:27,980 --> 00:26:29,660
too high, way too high.

460
00:26:31,100 --> 00:26:32,380
So many things like that.

461
00:26:32,380 --> 00:26:36,920
And logging settings, if I see
logging settings are default,

462
00:26:38,100 --> 00:26:42,180
I guess people don't look at databases
properly yet and we need

463
00:26:42,180 --> 00:26:43,500
to adjust them and so on.

464
00:26:43,500 --> 00:26:45,460
But second time is easier usually,
I agree.

465
00:26:45,460 --> 00:26:50,240
Second time you just revisit things
that change a lot.

466
00:26:50,540 --> 00:26:55,580
So, bloat estimates tables, indexes.

467
00:26:56,040 --> 00:26:56,880
Quite easy, right?

468
00:26:56,880 --> 00:27:00,840
We usually don't want, very roughly,
we don't want anything to

469
00:27:00,840 --> 00:27:01,980
be higher than 40-50%.

470
00:27:03,280 --> 00:27:06,060
Depends, of course, but this is
normal.

471
00:27:07,360 --> 00:27:11,260
And converting to number of bytes
always helps.

472
00:27:11,800 --> 00:27:15,380
Like, people say, oh, my database
is 1 terabyte, and I have like

473
00:27:15,380 --> 00:27:18,460
200 gigabytes of table and index
bloat.

474
00:27:18,560 --> 00:27:19,060
Wow.

475
00:27:19,100 --> 00:27:24,240
Of course we have this asterisk
like comment, remark that it's

476
00:27:24,240 --> 00:27:27,620
an estimate, it can be very very
wrong.

477
00:27:27,620 --> 00:27:31,680
We discussed, we had an episode
about bloat, and autovacuum

478
00:27:31,680 --> 00:27:34,160
as well, very related topic here.

479
00:27:34,280 --> 00:27:38,940
And then from table bloat, index
bloat, from index bloat we already

480
00:27:38,940 --> 00:27:44,940
go to index health, which is of
course index bloat included,

481
00:27:45,060 --> 00:27:49,640
But also unused indexes, redundant
indexes, well, rarely used

482
00:27:49,640 --> 00:27:53,400
indexes, like it's a rarely used
report.

483
00:27:54,240 --> 00:27:57,580
But also invalid indexes, unindexed
foreign keys.

484
00:27:58,260 --> 00:27:58,860
What else?

485
00:27:58,860 --> 00:27:59,533
Maybe that's it.

486
00:27:59,533 --> 00:28:00,420
Michael: Are you overlapping?

487
00:28:02,040 --> 00:28:03,840
Nikolay: Redundant, we call that
redundant.

488
00:28:06,100 --> 00:28:11,520
So yeah, and we suggest talking about
index maintenance and I usually,

489
00:28:11,520 --> 00:28:16,160
if it's the first time, I explain
why it's so important to implement

490
00:28:16,160 --> 00:28:21,360
index maintenance and why we should
understand that we do need

491
00:28:21,400 --> 00:28:23,500
to rebuild indexes from time to
time.

492
00:28:24,440 --> 00:28:26,140
Michael: Are you doing any corruption
checks?

493
00:28:26,140 --> 00:28:28,500
Or I'm guessing maybe not in this
case.

494
00:28:28,500 --> 00:28:30,560
Nikolay: Oh, that's a great question
actually.

495
00:28:30,860 --> 00:28:35,820
I think we can include this, but
a proper check is quite heavy.

496
00:28:36,180 --> 00:28:36,680
Michael: Yeah.

497
00:28:36,820 --> 00:28:39,960
And one of the nice things about
this tool is it's like a lot of

498
00:28:39,960 --> 00:28:43,360
these, it's been designed to be
lightweight in this case, right?

499
00:28:43,380 --> 00:28:45,840
Nikolay: Yeah, it's very lightweight.

500
00:28:46,460 --> 00:28:49,320
It was better tested in very heavily
loaded clusters.

501
00:28:50,640 --> 00:28:55,880
And it limits itself with statement
timeout like 30 seconds or

502
00:28:55,880 --> 00:28:56,660
even 15.

503
00:28:57,180 --> 00:29:01,400
I don't remember from the top of my
head, but if it takes too long,

504
00:29:01,400 --> 00:29:04,940
sometimes we just miss some reports
because statement timeout.

505
00:29:04,940 --> 00:29:07,000
We don't want to be a problem.

506
00:29:07,200 --> 00:29:08,080
Observer effect.

507
00:29:08,360 --> 00:29:08,860
Michael: Yeah.

508
00:29:08,940 --> 00:29:09,640
Nikolay: So, yeah.

509
00:29:09,640 --> 00:29:14,560
And if we talk about indexes, okay,
index maintenance, quite

510
00:29:15,180 --> 00:29:17,780
straightforward topic already,
quite well-studied.

511
00:29:18,660 --> 00:29:22,480
And of course, we don't forget replicas,
and we also don't forget

512
00:29:22,480 --> 00:29:26,000
about the age of statistics, because
if it was reserved yesterday,

513
00:29:26,320 --> 00:29:29,260
we cannot draw proper conclusions,
right?

514
00:29:29,260 --> 00:29:33,240
Because maybe these indexes were
not used in the last 24 hours.

515
00:29:33,240 --> 00:29:37,400
But okay, anyway, we had the index
maintenance episode.

516
00:29:37,400 --> 00:29:37,820
Yes.

517
00:29:37,820 --> 00:29:39,120
Michael: I have a couple of questions.

518
00:29:39,480 --> 00:29:42,180
I have a couple of questions on
the cluster health.

519
00:29:42,180 --> 00:29:46,420
So, replication, are you looking
at delay at all?

520
00:29:46,420 --> 00:29:49,780
Or what are you looking at when
it comes to replication health?

521
00:29:50,340 --> 00:29:52,120
Nikolay: That's a good question.

522
00:29:52,360 --> 00:29:56,740
I doubt we have replication lags
being checked by this tool at

523
00:29:56,740 --> 00:29:57,240
all.

524
00:29:57,440 --> 00:29:59,440
It's usually present in monitoring.

525
00:30:00,520 --> 00:30:06,380
So as part of broader health
check, not only we use health

526
00:30:06,380 --> 00:30:10,040
check, but the postgres-checkup report,
but we also talk with the

527
00:30:10,040 --> 00:30:13,500
customer and check what they have
in monitoring, and sometimes

528
00:30:13,500 --> 00:30:17,760
we install secondary monitoring,
which we have, and gather at

529
00:30:17,760 --> 00:30:20,660
least 1 day of observations and
include everything.

530
00:30:20,900 --> 00:30:22,540
And also discussion matters.

531
00:30:22,780 --> 00:30:28,280
This is probably what automation
won't be having anytime soon,

532
00:30:28,680 --> 00:30:34,660
is live discussion, trying to understand
pain points, like what

533
00:30:34,660 --> 00:30:39,560
bothers you, what is the issue
recently, what issue do you expect

534
00:30:39,560 --> 00:30:40,620
to happen soon.

535
00:30:40,840 --> 00:30:46,680
This live discussion at a high level
may help prioritize further

536
00:30:46,680 --> 00:30:47,180
research.

537
00:30:48,300 --> 00:30:51,760
So proper health check includes
both automated reports and these

538
00:30:51,760 --> 00:30:52,440
kinds of...

539
00:30:52,440 --> 00:30:56,200
So there we can check replication,
but usually if people don't

540
00:30:56,200 --> 00:30:59,280
mention replication lags being
an issue, they have it in monitoring,

541
00:30:59,340 --> 00:31:01,040
we just skip this, usually.

542
00:31:01,400 --> 00:31:04,400
Sometimes they mention, for example,
a lot of WAL is generated

543
00:31:04,400 --> 00:31:07,600
and we say, okay, let's go to pg_stat_statements
and check WAL

544
00:31:07,600 --> 00:31:13,620
metrics, try to understand why
we generate so much WAL, optimize

545
00:31:13,620 --> 00:31:17,220
some queries, enable WAL compression
if it's not enabled, and

546
00:31:17,220 --> 00:31:18,360
so on and so on.

547
00:31:19,280 --> 00:31:24,260
Also, disk, let's just check what
kind of disks you have.

548
00:31:24,380 --> 00:31:25,460
It's a big topic.

549
00:31:26,640 --> 00:31:30,100
Michael: Yeah, the last question I
had was another thing that I imagine

550
00:31:30,100 --> 00:31:33,280
is very important to check the
first time you want to know the

551
00:31:33,280 --> 00:31:37,920
overall system health, or at least
setup health, but might be

552
00:31:37,920 --> 00:31:42,940
tricky to automate or quickly run
a query for, is what's the

553
00:31:42,940 --> 00:31:44,940
health of the backup process or

554
00:31:45,800 --> 00:31:46,520
Nikolay: the risk?

555
00:31:46,920 --> 00:31:48,260
You're raising questions.

556
00:31:48,420 --> 00:31:52,580
Yeah, if you check the plan, which
is in the readme of the postgres-checkup,

557
00:31:52,580 --> 00:31:55,380
this is included there,
and the checkbox is not checked,

558
00:31:55,380 --> 00:31:56,080
of course.

559
00:31:56,180 --> 00:31:56,960
Michael: Oh, okay.

560
00:31:57,180 --> 00:31:57,660
Nice.

561
00:31:57,660 --> 00:32:01,120
Nikolay: Yeah, we thought about
automating this, but we have

562
00:32:01,120 --> 00:32:06,420
so many different situations, different
backup tools used, different

563
00:32:06,420 --> 00:32:08,800
managed providers and so on.

564
00:32:09,520 --> 00:32:11,140
Every time it's quite different.

565
00:32:11,960 --> 00:32:13,780
We have a wide range of situations.

566
00:32:14,480 --> 00:32:16,780
Michael: Yeah, I wasn't actually
thinking about postgres-checkup

567
00:32:17,040 --> 00:32:19,340
at all, I was thinking more about
health check.

568
00:32:19,540 --> 00:32:22,720
As part of a health check, you'd
probably want to check you can

569
00:32:22,720 --> 00:32:23,100
restore.

570
00:32:23,100 --> 00:32:26,200
Nikolay: Did we have an episode
about backups and disaster recovery

571
00:32:26,200 --> 00:32:26,640
at all?

572
00:32:26,640 --> 00:32:28,080
If not, let's just plan it.

573
00:32:28,080 --> 00:32:33,160
Because it's not a topic for 1
minute, it's a huge topic and

574
00:32:33,160 --> 00:32:33,660
probably...

575
00:32:34,200 --> 00:32:34,820
Michael: Of course.

576
00:32:34,820 --> 00:32:39,320
Nikolay: Yeah, maybe number 1 topic
for DBA, because it's the

577
00:32:39,320 --> 00:32:44,360
worst nightmare if backups are
lost and data loss happened, right?

578
00:32:44,720 --> 00:32:48,460
Obviously, we'll include this to
consideration unless this is

579
00:32:48,460 --> 00:32:49,360
managed service.

580
00:32:49,360 --> 00:32:52,800
If it's managed service, we have
guys who are responsible for

581
00:32:52,800 --> 00:32:53,500
it, right?

582
00:32:54,640 --> 00:32:59,980
Well, it brings us to this topic,
managed versus self-managed.

583
00:33:00,420 --> 00:33:06,600
Because for example, HA, since
not long ago, both RDS and GCP

584
00:33:06,600 --> 00:33:11,920
CloudSQL, they allow you to initiate
a failover to test it.

585
00:33:12,840 --> 00:33:18,760
So, you rely on them in terms of
HA, in terms of if 1 node is

586
00:33:18,760 --> 00:33:20,400
down, a failover will happen.

587
00:33:20,560 --> 00:33:23,180
But you can also test it, and this
is super important.

588
00:33:23,620 --> 00:33:28,660
Backups, it's like a black box, we
cannot access files as we discussed

589
00:33:28,660 --> 00:33:29,360
last time.

590
00:33:30,060 --> 00:33:32,140
But you can test restoration, right?

591
00:33:32,540 --> 00:33:33,880
And it's fully automated.

592
00:33:35,320 --> 00:33:39,600
And maybe you should do it if you
have a serious system running

593
00:33:39,600 --> 00:33:43,280
on RDS, because as I understand,
RDS itself doesn't do it for

594
00:33:43,280 --> 00:33:43,780
you.

595
00:33:44,100 --> 00:33:47,460
They rely on some global data,
as I understand.

596
00:33:47,560 --> 00:33:49,340
They don't test every backup.

597
00:33:50,140 --> 00:33:53,600
Michael: All I mean is if I was
a new DBA going into a new company

598
00:33:53,600 --> 00:33:57,140
and I want, like 1 of my first
tasks was to do a health check,

599
00:33:57,340 --> 00:34:00,440
this, that would be 1 of the things
that I would think of as

600
00:34:00,440 --> 00:34:02,520
including as in that health check.

601
00:34:02,920 --> 00:34:05,010
Maybe a new consultant would as
well.

602
00:34:05,160 --> 00:34:10,740
Nikolay: DR and HA, two fundamental
two-letter terms, right?

603
00:34:11,120 --> 00:34:14,820
You definitely, if you're starting
working with some database

604
00:34:15,060 --> 00:34:19,280
seriously, you definitely need
to allocate a good amount of time

605
00:34:19,280 --> 00:34:20,520
for both these topics.

606
00:34:20,980 --> 00:34:24,140
Like good means at least a couple
of days for both to deeply

607
00:34:24,140 --> 00:34:25,760
understand what...

608
00:34:25,760 --> 00:34:28,140
For example, if we talk about backups,
you need to understand

609
00:34:28,140 --> 00:34:35,140
RPO, RTO, real and desired, to
check if we have a document, where

610
00:34:35,140 --> 00:34:41,780
it's documented and procedures
defined to maintain these objectives.

611
00:34:43,940 --> 00:34:48,000
These are super serious topics
and for any infrastructure engineer

612
00:34:48,000 --> 00:34:51,960
I think it's quite clear, for any
SRE it should be quite clear

613
00:34:51,960 --> 00:34:56,200
that these are two SLOs that you
need to keep in mind and always

614
00:34:56,200 --> 00:34:59,980
revisit and maintain and have monitoring
that covers them and

615
00:34:59,980 --> 00:35:00,620
so on.

616
00:35:00,620 --> 00:35:03,360
But this is again, like this is
maybe...

617
00:35:03,920 --> 00:35:06,980
In health check, we just need to
check and understand if there

618
00:35:06,980 --> 00:35:08,040
is a problem here.

619
00:35:08,680 --> 00:35:11,980
To study this topic is similar
to a lot of testing topics.

620
00:35:11,980 --> 00:35:13,640
It's like a special doctor.

621
00:35:15,040 --> 00:35:15,760
So yeah.

622
00:35:16,720 --> 00:35:17,380
What else?

623
00:35:18,660 --> 00:35:21,820
To wrap up, maybe a couple of final
things.

624
00:35:21,820 --> 00:35:25,680
I like to check query analysis
in table form.

625
00:35:25,680 --> 00:35:30,280
In table form it's quite exotic
maybe because people got used

626
00:35:30,280 --> 00:35:34,180
to some graphs of course, historical
data, it's great.

627
00:35:34,300 --> 00:35:38,720
But when I look at table data for
pg_stat_statements with all these

628
00:35:38,720 --> 00:35:43,980
metrics and derived metrics, it
gives me a very good feeling of

629
00:35:43,980 --> 00:35:44,700
what's called.

630
00:35:45,060 --> 00:35:47,580
So to understand that, it's great.

631
00:35:47,960 --> 00:35:49,260
And also static analysis.

632
00:35:49,600 --> 00:35:50,640
Michael: Ordered by what?

633
00:35:51,760 --> 00:35:52,860
Always ordered by total time.

634
00:35:53,360 --> 00:35:57,080
Total time, I think, because we're
looking at system-level health.

635
00:35:57,700 --> 00:35:58,480
Nikolay: Yeah, yeah, yeah.

636
00:35:58,980 --> 00:36:02,580
Total time is default for us because
we talk about resources,

637
00:36:03,240 --> 00:36:10,560
but if you aim to analyze the situation
for end-users, you need

638
00:36:10,560 --> 00:36:15,680
to order by mean time, and sometimes
you need to order by different

639
00:36:15,780 --> 00:36:16,120
metrics.

640
00:36:16,120 --> 00:36:21,400
For example, as we discussed above,
earlier, WAL metric, for

641
00:36:21,400 --> 00:36:21,900
example.

642
00:36:22,060 --> 00:36:23,560
Sometimes temporary files, for
example.

643
00:36:23,560 --> 00:36:26,980
If we generate a lot of temporary
files, we want to order by

644
00:36:26,980 --> 00:36:27,540
temporary files.

645
00:36:27,540 --> 00:36:31,660
In this case, sometimes we just
generate additional reports manually.

646
00:36:32,280 --> 00:36:33,640
Michael: Yeah, makes sense.

647
00:36:35,820 --> 00:36:37,240
Nikolay: And actually that's it.

648
00:36:38,560 --> 00:36:39,840
Not difficult, right?

649
00:36:40,520 --> 00:36:41,020
Michael: No.

650
00:36:41,980 --> 00:36:44,640
Do you see a lot of people doing
this in-house?

651
00:36:45,100 --> 00:36:48,260
I know a lot of consultancies,
including yourselves, offer this

652
00:36:48,260 --> 00:36:49,080
as a service.

653
00:36:49,160 --> 00:36:51,700
Do you see the pros and cons of
those?

654
00:36:52,720 --> 00:36:58,780
Nikolay: Well, usually in-house,
if you do it forever, you don't

655
00:36:58,780 --> 00:37:02,320
see problems, or you don't prioritize
them well.

656
00:37:02,320 --> 00:37:08,220
Because I saw we have bloat, but
we have some kind of life still,

657
00:37:08,360 --> 00:37:09,740
we are not down.

658
00:37:10,140 --> 00:37:16,340
I saw we have, for example, 5 terabyte
unpartitioned table, but

659
00:37:16,340 --> 00:37:19,060
it's kind of fine maybe, and so
on.

660
00:37:19,700 --> 00:37:23,760
But when external people come,
at least temporarily, or for example

661
00:37:23,760 --> 00:37:29,440
you hire a new DBA, DBRE, this
is a good point to revisit and

662
00:37:29,440 --> 00:37:33,980
reprioritize things and avoid problems
in the future.

663
00:37:34,860 --> 00:37:39,520
So that's why fresh blood, fresh
look is good here.

664
00:37:42,740 --> 00:37:50,740
But usually people get a lot of
value during first checkup, health

665
00:37:50,740 --> 00:37:55,820
check, then smart people implement
periodical health checks.

666
00:37:56,760 --> 00:38:01,900
And then just, I would say, in
1 or 2 years, if things change

667
00:38:02,080 --> 00:38:07,620
significantly, this is when you
need to talk again.

668
00:38:07,780 --> 00:38:14,720
Because if health check was full-fledged
in the first year, 1 or

669
00:38:14,720 --> 00:38:20,380
2 years will be fine and your team
will be performing all the

670
00:38:20,380 --> 00:38:24,560
problems, solving all the problems,
knowing what to do.

671
00:38:25,160 --> 00:38:26,780
Yeah, nice one.

672
00:38:28,380 --> 00:38:32,140
But honestly, give me any database,
I will find problems easily.

673
00:38:33,340 --> 00:38:37,500
I mean, actual live production
database, not some synthetic.

674
00:38:39,180 --> 00:38:44,400
Anyway, like, summary is don't
postpone this.

675
00:38:45,060 --> 00:38:47,580
Perform health check, plan it and
do.

676
00:38:48,900 --> 00:38:49,900
Check your health.

677
00:38:50,600 --> 00:38:52,160
Michael: I had one last thing to
mention.

678
00:38:52,160 --> 00:38:56,440
I met a few people at PG Day Paris,
so hi to everyone who's listening

679
00:38:56,440 --> 00:38:57,160
from there.

680
00:38:57,440 --> 00:39:00,560
And if you're watching on YouTube
and just listening to the audio,

681
00:39:00,760 --> 00:39:04,280
we do also have an audio-only version
in the show notes.

682
00:39:04,280 --> 00:39:09,520
So you don't have to waste your
precious bandwidth and data on

683
00:39:09,520 --> 00:39:10,460
watching us.

684
00:39:10,680 --> 00:39:13,380
And if you listened to the audio
and didn't know we had a YouTube,

685
00:39:13,380 --> 00:39:14,940
then now you know that too.

686
00:39:15,720 --> 00:39:16,880
Nikolay: Right, right.

687
00:39:16,960 --> 00:39:22,580
We also have a transcript if you
like reading better and I think

688
00:39:22,580 --> 00:39:23,660
it's quite good.

689
00:39:24,480 --> 00:39:25,460
It's quite good.

690
00:39:25,850 --> 00:39:30,140
Actually, guys who are listening,
please check transcript and

691
00:39:30,140 --> 00:39:35,780
let us know if you think it's good
to read or it's hard to read.

692
00:39:37,000 --> 00:39:42,100
I would like to know what you think
about the quality of transcripts.

693
00:39:42,560 --> 00:39:48,000
I was thinking about improving
this and having some...

694
00:39:48,640 --> 00:39:55,900
Not a book, but a kind of set of
our discussions we had almost

695
00:39:55,900 --> 00:39:58,720
last 2 years as just a set of texts.

696
00:39:59,280 --> 00:40:02,120
Because some people just like reading,
right?

697
00:40:03,040 --> 00:40:08,940
So if people find the quality good,
I think it's worth investing

698
00:40:10,360 --> 00:40:15,060
to organize it better and maybe
provide some more links and pictures

699
00:40:15,060 --> 00:40:16,100
maybe and so on.

700
00:40:16,120 --> 00:40:18,520
So this is the idea we discussed
some time ago.

701
00:40:18,520 --> 00:40:22,060
I think maybe I will pursue it
soon.

702
00:40:22,200 --> 00:40:25,880
But I need to know what people
think about the quality of recent

703
00:40:26,200 --> 00:40:28,340
transcripts we have on PostgresFM.

704
00:40:29,600 --> 00:40:32,340
Michael: What's the best way to
contact you for that?

705
00:40:33,620 --> 00:40:38,260
Nikolay: YouTube comments, Twitter,
all Twitter, right?

706
00:40:38,480 --> 00:40:38,940
Michael: Yeah, yeah.

707
00:40:38,940 --> 00:40:40,120
Nikolay: I think, yeah.

708
00:40:40,520 --> 00:40:41,000
Michael: Nice.

709
00:40:41,000 --> 00:40:41,500
Nikolay: Good.

710
00:40:41,540 --> 00:40:42,040
Michael: Yeah.

711
00:40:42,500 --> 00:40:43,640
Thanks so much, Nikolay.

712
00:40:44,440 --> 00:40:45,140
Nikolay: Thank you, Michael.

713
00:40:45,140 --> 00:40:45,715
Michael: Catch you next week.

714
00:40:45,715 --> 00:40:45,910
Bye.

715
00:40:45,910 --> 00:40:46,105
Bye-bye.

716
00:40:46,105 --> 00:40:46,885
See you next week.

717
00:40:46,885 --> 00:40:47,080
Bye.