1
00:00:00,060 --> 00:00:02,060
Nikolay: Hello, hello, this is
Postgres.FM.

2
00:00:02,060 --> 00:00:06,260
My name is Nikolay from Postgres.AI
and as usual, my co-host

3
00:00:06,260 --> 00:00:07,780
is Michael from pgMustard.

4
00:00:08,080 --> 00:00:08,800
Hi, Michael.

5
00:00:09,200 --> 00:00:10,100
Michael: Hello, Nikolay.

6
00:00:10,840 --> 00:00:16,360
Nikolay: And the topic I chose
is related to problems, acute

7
00:00:16,360 --> 00:00:20,880
problems, unpredictable, which
sometimes happen to production

8
00:00:21,420 --> 00:00:23,040
systems including Postgres.

9
00:00:24,340 --> 00:00:29,620
In many cases, databases in the
center of storm and let's discuss

10
00:00:29,640 --> 00:00:32,220
how we manage this and how to manage
it better.

11
00:00:32,780 --> 00:00:38,660
So yeah, how to handle crisis situation
with production Postgres.

12
00:00:40,080 --> 00:00:44,860
I called it Postgres urgent care
or emergency room.

13
00:00:44,860 --> 00:00:49,240
I don't know like what's better
name here, but yeah, I guess

14
00:00:49,240 --> 00:00:53,180
this is something I can share from
my past experience.

15
00:00:53,800 --> 00:00:54,280
Yeah.

16
00:00:54,280 --> 00:00:55,580
So let's discuss this.

17
00:00:56,460 --> 00:00:57,040
Michael: Sounds good.

18
00:00:57,040 --> 00:01:00,060
And good point about the database
often being in the center of

19
00:01:00,060 --> 00:01:00,480
things.

20
00:01:00,480 --> 00:01:04,960
I think when you see large, sometimes
I guess it is often the

21
00:01:04,960 --> 00:01:08,580
large companies that we notice
on Twitter and things that post,

22
00:01:09,000 --> 00:01:11,540
you know, people start tweeting
it that it's down.

23
00:01:11,580 --> 00:01:15,020
I think recently there was a big
GitHub incident and that was,

24
00:01:15,020 --> 00:01:19,040
I think, one of the first communications
was about it being database

25
00:01:19,040 --> 00:01:19,540
related.

26
00:01:20,020 --> 00:01:23,000
Nikolay: Database changes related,
not only database, but also

27
00:01:23,000 --> 00:01:27,100
database changes, which when you
have a change with database,

28
00:01:27,100 --> 00:01:32,220
it means increase the risk of incident
in most cases, actually.

29
00:01:33,100 --> 00:01:35,880
But it was not about Postgres,
so let's exclude this case.

30
00:01:36,660 --> 00:01:37,660
It's MySQL.

31
00:01:38,680 --> 00:01:43,480
But I truly believe that the database,
as I usually say, is the

32
00:01:43,480 --> 00:01:48,160
heart of any tech system because
it's very, very stateful, right,

33
00:01:48,160 --> 00:01:49,340
because it's data.

34
00:01:49,780 --> 00:01:55,120
And since it's stateful, it's really
hard to scale it and handle

35
00:01:55,120 --> 00:01:56,600
performance and so on.

36
00:01:56,740 --> 00:02:01,920
So indeed, since database is heart
of our systems, production

37
00:02:01,920 --> 00:02:04,820
systems, it's often in the center
of storm.

38
00:02:06,040 --> 00:02:07,760
Michael: Where would you like to
start with this?

39
00:02:08,200 --> 00:02:09,060
Nikolay: I have a plan.

40
00:02:09,060 --> 00:02:11,400
Let's have a two-step discussion.

41
00:02:11,540 --> 00:02:16,120
First, we'll discuss the psychological
part of incident management

42
00:02:16,120 --> 00:02:18,620
related to databases, and second,
purely technical.

43
00:02:19,060 --> 00:02:19,940
How about this?

44
00:02:20,080 --> 00:02:21,220
Michael: Yeah, I like it.

45
00:02:21,220 --> 00:02:24,320
The psychological aspect probably
isn't talked about as much

46
00:02:24,320 --> 00:02:25,080
as it could be.

47
00:02:25,080 --> 00:02:30,040
I think people often in postmortems
focus on the technical things

48
00:02:30,040 --> 00:02:32,700
that could be done differently
to avoid the issue reoccurring.

49
00:02:33,240 --> 00:02:37,760
I mean, that obviously is the most
important thing, but I feel

50
00:02:37,760 --> 00:02:42,160
like so many could also have learned
a lot from how they communicated

51
00:02:42,440 --> 00:02:44,080
or how much they communicated.

52
00:02:44,340 --> 00:02:49,160
Often I see almost no communication
from companies that are having

53
00:02:49,160 --> 00:02:52,660
big outages, very poor levels of
updates.

54
00:02:53,100 --> 00:02:56,240
And it wouldn't completely alleviate
the situation, of course.

55
00:02:56,240 --> 00:02:58,220
If you're down, you're still down.

56
00:02:58,540 --> 00:03:02,720
But I'd be surprised if companies
that communicate better don't

57
00:03:02,900 --> 00:03:06,540
have much better outcomes than
ones that don't.

58
00:03:06,660 --> 00:03:09,740
People looking to cancel services
afterwards, that kind of thing.

59
00:03:10,160 --> 00:03:13,680
Nikolay: Yeah, so indeed, it may
be not well discussed, but I

60
00:03:13,680 --> 00:03:17,920
think anyone who is dealing with
production at a larger scale definitely

61
00:03:18,260 --> 00:03:22,740
knows that, first of all, many
people manage this kind of stress

62
00:03:22,740 --> 00:03:24,360
not well and that's fine.

63
00:03:25,240 --> 00:03:28,160
Maybe it's not your kind of thing,
right?

64
00:03:28,660 --> 00:03:30,720
Some people manage stress very
well.

65
00:03:31,080 --> 00:03:32,200
It's not me, by the way.

66
00:03:32,200 --> 00:03:33,960
I manage stress moderately well.

67
00:03:33,960 --> 00:03:37,360
I've learned how to do it and so on,
but in the center of a production

68
00:03:37,360 --> 00:03:41,100
incident, I still find myself with
high emotions.

69
00:03:41,820 --> 00:03:43,880
We should do something very quickly,
right?

70
00:03:44,480 --> 00:03:45,260
It's hard.

71
00:03:46,260 --> 00:03:47,480
Also, tunnel vision.

72
00:03:47,980 --> 00:03:50,940
You see only some things and you
don't have time for anything

73
00:03:50,940 --> 00:03:51,440
else.

74
00:03:51,480 --> 00:03:55,260
It's very hard to relax and capture
the whole picture and so

75
00:03:55,260 --> 00:03:55,760
on.

76
00:03:56,120 --> 00:03:59,440
And that's why tooling is also
important here, right?

77
00:03:59,440 --> 00:04:03,960
Tooling should be designed like
for people who are under stress

78
00:04:03,960 --> 00:04:05,280
and some runbooks and so on.

79
00:04:05,280 --> 00:04:06,500
But this is technical stuff.

80
00:04:06,580 --> 00:04:08,860
We will talk about it slightly
later.

81
00:04:09,440 --> 00:04:12,440
So I think there are some trainings
and so on.

82
00:04:12,440 --> 00:04:17,920
I'm not a good source of advice here, but I know we should

83
00:04:17,920 --> 00:04:22,840
look at people who deal with production
systems, SREs, and so

84
00:04:22,840 --> 00:04:28,340
on, and there are many, many books
written about it, and handbooks, 

85
00:04:28,440 --> 00:04:29,840
runbooks, and so on.

86
00:04:29,840 --> 00:04:34,940
So, yeah, there are some good practices
how to deal with such

87
00:04:34,940 --> 00:04:36,140
stress and so on.

88
00:04:36,140 --> 00:04:38,040
Michael: What are your favorites
of those?

89
00:04:38,220 --> 00:04:41,280
Nikolay: I like to use materials
from Google, of course.

90
00:04:41,280 --> 00:04:44,740
There is sre.google, this is the hostname.

91
00:04:45,660 --> 00:04:49,240
There are 3 books there and quite
good content.

92
00:04:49,900 --> 00:04:54,740
I also like the handbook from GitLab
for the production SRE team.

93
00:04:54,960 --> 00:04:58,000
Michael: I seem to remember Netflix
having some good stuff as

94
00:04:58,000 --> 00:04:58,440
well.

95
00:04:58,440 --> 00:04:59,020
Nikolay: Yeah, yeah.

96
00:04:59,020 --> 00:05:02,260
Well, we are in the area like it's
not only about databases,

97
00:05:02,320 --> 00:05:03,280
of course, right?

98
00:05:03,340 --> 00:05:07,440
It's the SRE area basically and there
are many good materials for

99
00:05:07,440 --> 00:05:07,860
this.

100
00:05:07,860 --> 00:05:11,920
But specifically for databases,
I think in general, if you feel

101
00:05:11,920 --> 00:05:15,220
Emotional, maybe you should tell
this to colleagues or maybe

102
00:05:15,280 --> 00:05:17,420
someone else should be acting more.

103
00:05:17,740 --> 00:05:20,140
You need to understand yourself
basically, right?

104
00:05:20,240 --> 00:05:26,120
If you feel… I sometimes feel myself…
I remember we had a Black

105
00:05:26,120 --> 00:05:30,260
Friday, very boring because we
were very well prepared.

106
00:05:30,920 --> 00:05:35,040
We had like a lot of stuff, like
it's a large company, e-commerce,

107
00:05:35,200 --> 00:05:39,020
and we had very good preparation
and war room organized and we

108
00:05:39,020 --> 00:05:42,440
are prepared for incidents and
whole Black Friday was so boring.

109
00:05:42,880 --> 00:05:47,780
So when we finally found some incident
at 9 p.m., I was happy.

110
00:05:47,780 --> 00:05:49,180
Finally, we have some material.

111
00:05:50,080 --> 00:05:54,000
And recently, I was helping to
some customers, and I also, I

112
00:05:54,000 --> 00:05:55,960
remember exactly this state.

113
00:05:56,760 --> 00:05:59,120
I'm finally something interesting,
you know?

114
00:05:59,320 --> 00:06:01,160
This is great state to be in.

115
00:06:01,360 --> 00:06:05,860
Instead of being stressed and everything
is on your shoulders,

116
00:06:05,860 --> 00:06:10,280
you don't know what to do or maybe
you know, but what if it won't

117
00:06:10,280 --> 00:06:10,780
work?

118
00:06:11,960 --> 00:06:16,200
You have some good materials to
check, like some monitoring systems

119
00:06:16,200 --> 00:06:18,780
to check, but so stressed, right?

120
00:06:18,960 --> 00:06:21,500
Because you have fears of failure,
right?

121
00:06:21,960 --> 00:06:26,880
What if you won't be able to bring
database up again, back up

122
00:06:27,400 --> 00:06:31,760
during many hours, and then everything
is disaster.

123
00:06:32,380 --> 00:06:37,060
But if you in the state of like,
this is interesting, this is

124
00:06:37,060 --> 00:06:41,020
like finally we have some work
to do in production, let's just...

125
00:06:41,480 --> 00:06:43,380
I have a high level of curiosity.

126
00:06:44,480 --> 00:06:45,840
Maybe it's a new case.

127
00:06:46,360 --> 00:06:47,580
This comes with experience.

128
00:06:47,840 --> 00:06:51,560
You saw many cases, but you are
looking for new kind of cases

129
00:06:51,560 --> 00:06:54,360
because it's already too boring
to deal with the same kind of

130
00:06:54,360 --> 00:06:59,440
corruption again or I don't know,
like some database down again.

131
00:07:00,060 --> 00:07:01,220
You saw it many times.

132
00:07:01,360 --> 00:07:05,820
And you are hunting for new types
of cases and curiosity helps.

133
00:07:06,140 --> 00:07:09,340
So these are very 2 opposite states,
I would say.

134
00:07:09,340 --> 00:07:11,900
And I was in both in my life.

135
00:07:12,380 --> 00:07:13,240
So yeah.

136
00:07:13,900 --> 00:07:16,460
Michael: For you, does
it depend a bit on the severity

137
00:07:16,640 --> 00:07:20,280
though because for me even if it
was 9 p.m.

138
00:07:20,280 --> 00:07:24,640
And I'd been hoping for some interesting
case to come up if it

139
00:07:24,640 --> 00:07:28,620
was super serious and the whole
like everything was down I wouldn't

140
00:07:28,620 --> 00:07:30,840
be happy that we finally got

141
00:07:30,900 --> 00:07:35,720
Nikolay: Well honestly I didn't
have cases when like for example

142
00:07:36,440 --> 00:07:38,460
life of people depends on this.

143
00:07:39,900 --> 00:07:44,200
I can assume this might happen
with some systems, but I was in

144
00:07:44,200 --> 00:07:47,300
cases when cost of downtime was
super high.

145
00:07:49,020 --> 00:07:51,180
And now I'm not scared already,
you know?

146
00:07:51,760 --> 00:07:54,800
I already had it, right?

147
00:07:54,800 --> 00:07:57,020
So I know how it feels and so on.

148
00:07:57,160 --> 00:07:59,440
I'm not super scared if it's only
about money.

149
00:07:59,880 --> 00:08:03,700
But life-threatening downtime,
honestly, I didn't have it.

150
00:08:03,700 --> 00:08:08,900
And I think if it happened, I would
be very concerned, right?

151
00:08:09,160 --> 00:08:12,980
Maybe this realization of this
is only about money, and the worst

152
00:08:12,980 --> 00:08:15,860
thing that can happen, somebody
will lose money, and you will

153
00:08:15,860 --> 00:08:16,560
lose a job.

154
00:08:16,560 --> 00:08:18,740
It's not the worst case actually.

155
00:08:19,620 --> 00:08:20,400
Just relax.

156
00:08:21,020 --> 00:08:23,260
But life-threatening, it's another
story.

157
00:08:23,260 --> 00:08:27,040
I'm very curious if someone who
is listening to us has some system

158
00:08:27,780 --> 00:08:33,640
where the state of Postgres can
influence the health or life of people.

159
00:08:33,640 --> 00:08:34,500
This is interesting.

160
00:08:35,060 --> 00:08:38,800
Michael: Yeah, you do hear about
healthcare use cases, but also

161
00:08:39,060 --> 00:08:40,340
military use cases.

162
00:08:40,600 --> 00:08:41,500
Nikolay: Right, right.

163
00:08:41,580 --> 00:08:42,880
Yeah, it might happen.

164
00:08:42,920 --> 00:08:43,620
Might happen.

165
00:08:43,620 --> 00:08:44,720
I didn't have it.

166
00:08:45,140 --> 00:08:45,725
No, me neither.

167
00:08:45,725 --> 00:08:47,660
So, that's why I'm not super scared.

168
00:08:47,660 --> 00:08:50,280
It's like, okay, it's not a big
deal.

169
00:08:50,280 --> 00:08:51,600
I know it's a big deal.

170
00:08:52,060 --> 00:08:54,240
We will be professionally helping,
right?

171
00:08:54,240 --> 00:08:58,120
But let's just do what we can and
that's it.

172
00:08:58,860 --> 00:09:00,300
And that's why I'm curious.

173
00:09:01,420 --> 00:09:02,780
Is it a new case, finally?

174
00:09:03,180 --> 00:09:05,020
Okay, let's work on it.

175
00:09:05,020 --> 00:09:09,260
But again, back to my point, you
need to understand yourself.

176
00:09:09,320 --> 00:09:10,540
This is very important.

177
00:09:11,400 --> 00:09:16,560
If you know you react not well,
even to small problems, it's

178
00:09:16,560 --> 00:09:18,480
better maybe to be an assistant.

179
00:09:18,940 --> 00:09:19,440
Michael: Ah, yeah.

180
00:09:19,540 --> 00:09:22,860
I was doing some thinking before
the episode on what kinds of

181
00:09:23,000 --> 00:09:24,400
emergencies there could be.

182
00:09:24,400 --> 00:09:27,960
And a couple that I don't know
if you're 100% thinking of that

183
00:09:27,960 --> 00:09:32,220
would be really scary for me would
be like security-style incidents,

184
00:09:32,536 --> 00:09:35,484
either external or internal.

185
00:09:35,800 --> 00:09:36,500
Yeah, exactly.

186
00:09:36,580 --> 00:09:38,400
Nikolay: Like hackers acting right
now.

187
00:09:39,840 --> 00:09:42,200
Michael: That would be scary in
a different way, potentially.

188
00:09:42,740 --> 00:09:44,060
Or exciting in a different

189
00:09:44,060 --> 00:09:44,560
way.

190
00:09:44,680 --> 00:09:47,320
Nikolay: This is, I think this is CEO level
already.

191
00:09:47,320 --> 00:09:53,040
So definitely if something like
that happens, it's not only like

192
00:09:53,040 --> 00:09:57,180
there's a technical aspect here,
but it's also a very high level

193
00:09:57,180 --> 00:10:00,600
organizational aspect of it, how
to handle this situation properly.

194
00:10:01,160 --> 00:10:01,420
Right?

195
00:10:01,420 --> 00:10:04,020
Michael: So how - Oh, I was still
talking about psychologically

196
00:10:04,280 --> 00:10:04,780
though.

197
00:10:06,060 --> 00:10:08,900
Nikolay: Psychologically, but this
like decisions, like how to

198
00:10:08,900 --> 00:10:10,700
handle it, it's already CEO level.

199
00:10:11,040 --> 00:10:11,680
It happens.

200
00:10:11,880 --> 00:10:15,920
Recently, we received from our
insurance, I think we received

201
00:10:15,920 --> 00:10:20,380
like a regular routine notice that,
you know, our database is slow

202
00:10:20,380 --> 00:10:20,880
again.

203
00:10:23,040 --> 00:10:24,060
And it just happens.

204
00:10:24,520 --> 00:10:27,380
Like, you know, maybe you're, we
don't know if your record is

205
00:10:27,380 --> 00:10:29,840
also stolen, maybe, no.

206
00:10:30,180 --> 00:10:34,680
And a couple of days later, I found
on GitHub a very good project.

207
00:10:35,220 --> 00:10:39,780
Some guy created a database of
all SSNs of all Americans and

208
00:10:39,780 --> 00:10:41,030
just published it on GitHub.

209
00:10:43,580 --> 00:10:49,720
The fact is that it's only 1,000,000,000,
like how many digits this

210
00:10:49,720 --> 00:10:50,580
number has.

211
00:10:51,140 --> 00:10:55,840
So he just published all numbers
up to 1,000,000,000,

212
00:10:55,840 --> 00:10:56,340
Michael: okay.

213
00:10:56,780 --> 00:11:01,760
Nikolay: But some people on Twitter
started thinking, oh, I found

214
00:11:01,760 --> 00:11:03,020
my SSN as well.

215
00:11:06,060 --> 00:11:08,040
It was like a snowball joke.

216
00:11:08,240 --> 00:11:11,680
Some people started, okay, I'm
going to remove my SSN, created

217
00:11:11,680 --> 00:11:12,420
pull request.

218
00:11:15,560 --> 00:11:16,320
It's funny.

219
00:11:18,760 --> 00:11:23,740
So back to this, If you know yourself,
it's good.

220
00:11:23,740 --> 00:11:26,960
It helps you understand your stress
level.

221
00:11:27,040 --> 00:11:30,860
On another note, it's funny that
we aim to monitor a database and

222
00:11:30,860 --> 00:11:34,400
production systems well, like with
second-level precision sometimes,

223
00:11:34,540 --> 00:11:36,300
but we don't monitor ourselves.

224
00:11:36,820 --> 00:11:38,180
Like cortisol level, right?

225
00:11:38,180 --> 00:11:42,100
It would be great to understand,
but we don't have it.

226
00:11:42,180 --> 00:11:45,700
This bothers me a lot, monitoring
of human bodies.

227
00:11:45,940 --> 00:11:49,120
I don't understand my own state
except how do I feel.

228
00:11:49,540 --> 00:11:50,740
So it will be good

229
00:11:50,740 --> 00:11:51,520
to see

230
00:11:51,820 --> 00:11:53,000
heart rate for example,
right?

231
00:11:53,000 --> 00:11:53,100
Michael: Yeah.

232
00:11:53,100 --> 00:11:56,480
Rings and watches
that monitor heart rate which

233
00:11:56,480 --> 00:11:59,800
is probably quite, like correlates
probably quite well with stress

234
00:11:59,800 --> 00:12:00,170
level.

235
00:12:00,170 --> 00:12:00,540
Nikolay: Yeah.

236
00:12:00,540 --> 00:12:04,640
Yeah, But let's maybe slowly shift
to technical stuff.

237
00:12:04,640 --> 00:12:07,520
So of course if you know yourself
helps.

238
00:12:07,820 --> 00:12:10,360
If you… I wanted to share 1 story.

239
00:12:10,760 --> 00:12:15,600
I very long ago, 15 years ago or
so, I had a great team, a great

240
00:12:15,600 --> 00:12:16,100
startup.

241
00:12:17,100 --> 00:12:18,880
I was CTO, I think.

242
00:12:19,540 --> 00:12:21,240
Maybe no, I was CEO actually.

243
00:12:21,420 --> 00:12:28,920
But yeah, combining 2 these roles
and I had a Postgres production

244
00:12:28,940 --> 00:12:33,580
system and great Postgres experts
in my team.

245
00:12:34,340 --> 00:12:37,800
And I remember 1 guy was a great
Postgres expert and I made

246
00:12:37,800 --> 00:12:38,360
a mistake.

247
00:12:38,360 --> 00:12:39,400
It was my mistake.

248
00:12:39,480 --> 00:12:43,020
I was leaving to a trip for a few
days, and I said, you will

249
00:12:43,020 --> 00:12:47,640
be responsible for production,
especially Postgres state, because

250
00:12:47,640 --> 00:12:50,580
he was the best Postgres expert
in my team, right?

251
00:12:50,740 --> 00:12:54,660
But it was an obvious mistake because
an incident happened and he

252
00:12:54,660 --> 00:13:01,500
couldn't handle it properly and
he was completely like… he lost

253
00:13:01,500 --> 00:13:03,140
his shit, sorry for French.

254
00:13:03,740 --> 00:13:04,060
Right?

255
00:13:04,060 --> 00:13:06,980
Michael: So you mean technically he would
have been best placed to handle

256
00:13:06,980 --> 00:13:07,860
it in the team.

257
00:13:08,680 --> 00:13:13,140
Nikolay: A technical expert is not
necessarily good in terms of

258
00:13:13,140 --> 00:13:14,660
incident management, right?

259
00:13:16,060 --> 00:13:18,480
And this is my mistake, I didn't
recognize it.

260
00:13:19,400 --> 00:13:23,760
And this led to the end of our
cooperation, unfortunately.

261
00:13:24,520 --> 00:13:30,020
So sometimes good technical experts
should be an assistant, right?

262
00:13:30,020 --> 00:13:32,580
Not feel the pressure on their shoulders,
right?

263
00:13:33,040 --> 00:13:34,700
This is super important to understand.

264
00:13:35,280 --> 00:13:38,720
So my advice is, you know, like
try to understand yourself and

265
00:13:38,720 --> 00:13:41,780
whether you should be responsible for incident
management or just assisting

266
00:13:42,540 --> 00:13:43,060
technically, right?

267
00:13:44,080 --> 00:13:48,060
Michael: Yeah, know yourself, but
also know your team and know

268
00:13:48,060 --> 00:13:51,680
who in your team could, like who,
yeah, who you can call on for

269
00:13:51,680 --> 00:13:52,060
different things.

270
00:13:52,060 --> 00:13:52,560
Yeah.

271
00:13:53,400 --> 00:13:53,900
Nikolay: Yeah.

272
00:13:54,060 --> 00:13:56,260
Now, let's move to the technical
stuff.

273
00:13:56,400 --> 00:13:57,640
What is helpful?

274
00:13:58,440 --> 00:14:03,480
Very helpful, first of all, and
many small companies like we deal with. Our

275
00:14:03,580 --> 00:14:08,380
main focus right now is companies
who are growing, startups, usually

276
00:14:08,380 --> 00:14:12,440
lacking database expertise and
many such companies come to us

277
00:14:12,440 --> 00:14:16,760
for help and almost none of them
have good incident management

278
00:14:16,760 --> 00:14:17,460
in place.

279
00:14:17,960 --> 00:14:19,740
It's not only about Postgres, right?

280
00:14:19,740 --> 00:14:25,320
We always suggest thinking about
at least a simple process because

281
00:14:25,320 --> 00:14:28,520
they say, we had an incident last
week.

282
00:14:28,600 --> 00:14:32,100
My question is, show us the incident
notes.

283
00:14:33,000 --> 00:14:35,040
Are they logged anyhow, like with
timestamps?

284
00:14:35,840 --> 00:14:39,520
In most cases, they don't have
anything but just words.

285
00:14:39,760 --> 00:14:40,580
They have words.

286
00:14:40,580 --> 00:14:45,140
Okay, we saw the database was slow,
then it was unresponsive, blah,

287
00:14:45,140 --> 00:14:45,700
blah, blah.

288
00:14:45,740 --> 00:14:50,000
What you must have for an incident
is a sequence, like you have

289
00:14:50,000 --> 00:14:56,120
we must have a document with artifacts,
like the first known event

290
00:14:56,120 --> 00:15:01,500
happened, some logs, screenshots
from monitoring better with

291
00:15:01,500 --> 00:15:03,740
links so we can revisit it.

292
00:15:04,020 --> 00:15:08,300
But screenshots matter a lot because
sometimes monitoring has

293
00:15:08,440 --> 00:15:14,940
a small retention window and the investigation
might be long, especially

294
00:15:15,040 --> 00:15:18,780
if you involve external consultants
like us, right?

295
00:15:19,000 --> 00:15:24,620
So there should be some template
and a plan for documenting incidents.

296
00:15:25,380 --> 00:15:28,280
And when you have it, it also helps
with stress because you know

297
00:15:28,280 --> 00:15:28,780
what to do.

298
00:15:28,780 --> 00:15:33,640
You need to identify the first abnormal
event, document it, things

299
00:15:33,640 --> 00:15:36,900
before it, things after it, like
some form it should take.

300
00:15:36,900 --> 00:15:42,280
And everything you notice also
documented, important things highlighted.

301
00:15:42,600 --> 00:15:45,160
It can be a Google Doc or something
with discussion around it.

302
00:15:45,160 --> 00:15:49,540
It's good when it's possible to
discuss some things so people

303
00:15:49,540 --> 00:15:52,580
can ask questions, clarify, add
some additional knowledge and

304
00:15:52,580 --> 00:15:53,200
so on.

305
00:15:53,200 --> 00:15:54,980
It can be anything actually, right?

306
00:15:55,080 --> 00:15:58,700
But it's important to have, to
be prepared to document it.

307
00:15:59,100 --> 00:16:03,480
Michael: Yeah, I've seen a lot
of people start with like a, like

308
00:16:03,480 --> 00:16:07,300
whatever app you use for chat normally
in the team, or some people

309
00:16:07,300 --> 00:16:09,520
have like a different app for instance
specifically.

310
00:16:09,800 --> 00:16:13,900
But if you're using Slack, for
example, start a new channel for

311
00:16:13,900 --> 00:16:17,020
the instant, all instant related
stuff goes in there.

312
00:16:17,040 --> 00:16:18,340
Screenshots, logs,

313
00:16:18,420 --> 00:16:19,340
Nikolay: chat, everything.

314
00:16:20,060 --> 00:16:22,620
Michael: And then people turn it
into a doc later sometimes.

315
00:16:22,900 --> 00:16:25,340
But I could see an argument for
starting with the doc.

316
00:16:26,760 --> 00:16:30,200
But normally people are panicking
at the beginning, so chat makes

317
00:16:30,200 --> 00:16:30,700
sense.

318
00:16:31,240 --> 00:16:36,040
Nikolay: Yeah, chat is more convenient
for many people.

319
00:16:36,040 --> 00:16:38,620
It's what you use every day, so
chat is good.

320
00:16:38,740 --> 00:16:44,780
It's important to have long-term
storage for this document, converted

321
00:16:44,820 --> 00:16:45,480
to document.

322
00:16:45,480 --> 00:16:49,600
And I can say like most of startups
which grew to terabyte or

323
00:16:49,600 --> 00:16:53,040
a couple of terabytes in terms
of database size, most of them

324
00:16:53,040 --> 00:16:56,820
don't have proper incident management
workflow developed.

325
00:16:56,960 --> 00:16:58,220
They must have it.

326
00:16:58,380 --> 00:16:59,340
It's time already.

327
00:16:59,820 --> 00:17:01,840
So yeah, I definitely encourage.

328
00:17:01,840 --> 00:17:05,200
Even if you have like couple of
technical teams and technical

329
00:17:05,500 --> 00:17:09,520
experts in your team, still it's
super important to have incident

330
00:17:09,520 --> 00:17:11,900
management workflow developed.

331
00:17:13,740 --> 00:17:17,880
So yeah, detailed, step-by-step,
so we understand what's happening.

332
00:17:18,220 --> 00:17:21,640
And You agree on format of this
document in advance.

333
00:17:21,980 --> 00:17:24,280
You can use some other companies
as example.

334
00:17:24,280 --> 00:17:29,600
Again, SRE.Google and GitLab Handbook
for particular this area

335
00:17:29,620 --> 00:17:30,320
are useful.

336
00:17:31,020 --> 00:17:33,140
GitLab, for example, particularly
has example.

337
00:17:34,140 --> 00:17:35,240
For instance, management.

338
00:17:35,860 --> 00:17:40,820
Many other companies also share
their templates and description

339
00:17:40,860 --> 00:17:42,420
how to document it properly.

340
00:17:42,980 --> 00:17:43,760
Super important.

341
00:17:44,200 --> 00:17:47,980
And also, Of course, sometimes
you feel, okay, I'm documenting,

342
00:17:47,980 --> 00:17:51,940
documenting, but who will be actually
solving the problem, right?

343
00:17:51,940 --> 00:17:55,860
So it's good if you have a few
folks who can help each other

344
00:17:55,860 --> 00:17:59,960
and some of them is responsible
for documenting, another is trying

345
00:17:59,960 --> 00:18:01,680
to find a quick solution.

346
00:18:02,640 --> 00:18:06,200
And also document is important
to have because then in bigger

347
00:18:06,200 --> 00:18:09,840
companies, we have a procedure
called root cause analysis, RCA,

348
00:18:09,960 --> 00:18:10,460
right?

349
00:18:11,040 --> 00:18:14,560
To learn from mistakes and fix
them and prevent them for future,

350
00:18:14,620 --> 00:18:15,120
right?

351
00:18:16,220 --> 00:18:18,160
That's why it's also important to document.

352
00:18:18,700 --> 00:18:22,940
But then this helps and this is
I think fundamental number 1

353
00:18:22,940 --> 00:18:24,380
technical thing you need to do.

354
00:18:24,380 --> 00:18:26,620
Oh, it's an organizational thing,
sorry.

355
00:18:26,980 --> 00:18:30,400
But it includes some technical
aspects.

356
00:18:30,400 --> 00:18:34,080
For example, which monitoring we
use when an incident happens,

357
00:18:34,080 --> 00:18:34,580
right?

358
00:18:34,840 --> 00:18:35,880
Where do we start?

359
00:18:36,760 --> 00:18:39,180
This dashboard or that dashboard,
right?

360
00:18:39,360 --> 00:18:42,540
What technical things we must document
there?

361
00:18:42,780 --> 00:18:48,560
For example, of course, we care
about CPU level and disk I/O,

362
00:18:48,560 --> 00:18:49,520
basics, right?

363
00:18:49,520 --> 00:18:50,440
Hosts us.

364
00:18:50,660 --> 00:18:55,620
If the database seems to be slow or
unresponsive, we must document

365
00:18:55,680 --> 00:18:56,380
these things.

366
00:18:56,720 --> 00:19:00,660
We had discussions about monitoring
dashboard number 1, we propose

367
00:19:00,660 --> 00:19:01,620
like these things.

368
00:19:01,720 --> 00:19:06,940
Dashboard number 1 in our pgwatch2
Postgres.AI edition is designed

369
00:19:08,360 --> 00:19:12,720
for shallow but very wide analysis,
very quick, like up to 1

370
00:19:12,720 --> 00:19:17,060
minute analysis of various components
of Postgres and the various

371
00:19:17,320 --> 00:19:22,400
properties at a very high level,
like 30,000 feet level of workload

372
00:19:22,940 --> 00:19:26,740
to understand which directions
to investigate further, right?

373
00:19:26,760 --> 00:19:28,260
Michael: Yeah, where's
the issue?

374
00:19:28,260 --> 00:19:32,260
Nikolay: Right. Yeah, this is
very good to prepare in advance.

375
00:19:32,480 --> 00:19:38,740
I know if something happens, how
I will act, where I will start,

376
00:19:39,000 --> 00:19:39,500
right?

377
00:19:40,240 --> 00:19:41,820
Yeah, so this is important.

378
00:19:42,040 --> 00:19:44,440
And you will document, you'll know
how to start.

379
00:19:45,060 --> 00:19:48,620
This is about monitoring and observability
and logs and so on.

380
00:19:49,540 --> 00:19:53,420
Next, there are several particular
cases I can quickly share,

381
00:19:53,940 --> 00:19:57,540
which are important to be prepared
for.

382
00:19:57,740 --> 00:20:01,160
For example, of course, if you
already know that the database has,

383
00:20:01,160 --> 00:20:03,140
for example, a transaction ID wraparound.

384
00:20:04,340 --> 00:20:07,580
Michael: You can see straight away
that there's an error in the

385
00:20:07,580 --> 00:20:08,540
log or something.

386
00:20:08,600 --> 00:20:09,100
Nikolay: Yeah.

387
00:20:09,180 --> 00:20:15,120
So we have cases very well documented
from Sentry, MailChimp,

388
00:20:15,540 --> 00:20:19,940
somebody else, And also we have
very, very good work from Google,

389
00:20:20,280 --> 00:20:22,100
GCP, Hannu Krossing.

390
00:20:22,500 --> 00:20:28,080
He was at PostgresTV presenting
his talk about how to handle transaction

391
00:20:28,080 --> 00:20:30,960
ID wraparound without single-user
mode.

392
00:20:31,780 --> 00:20:35,560
He thinks single-user mode is not
the right way to do it, but

393
00:20:35,560 --> 00:20:39,060
this is a traditional approach,
single-user mode, and a very

394
00:20:39,060 --> 00:20:44,240
long time for processing of the
database, for recovering the

395
00:20:44,240 --> 00:20:45,560
state of the database.

396
00:20:45,860 --> 00:20:50,280
So this is like you just, you can
document if it happens sometime

397
00:20:50,280 --> 00:20:54,960
someday, but I haven't seen it
for so long because monitoring

398
00:20:54,960 --> 00:20:57,340
has it, alerts have it, like,
and so on.

399
00:20:57,340 --> 00:21:00,560
Michael: And also recent versions
have improvements in this area.

400
00:21:00,560 --> 00:21:03,520
I remember, I think Peter Geoghegan
did some good work around this.

401
00:21:03,520 --> 00:21:04,740
probably others too.

402
00:21:04,760 --> 00:21:07,800
Nikolay: Yeah, I just started
from a very scary thing.

403
00:21:08,000 --> 00:21:12,540
The scariest thing, and also the scariest,
is like loss of backups and

404
00:21:12,540 --> 00:21:15,220
you cannot perform disaster recovery,
right?

405
00:21:16,260 --> 00:21:21,240
Also like very low-risk these days.

406
00:21:22,540 --> 00:21:28,100
Michael: Yeah, I guess these days,
sometimes major issues

407
00:21:28,100 --> 00:21:33,080
are things like the whole of US East 1 is down for like, this

408
00:21:33,080 --> 00:21:36,680
hasn't really happened for a while,
but like a cloud regional

409
00:21:36,860 --> 00:21:37,360
outage.

410
00:21:38,040 --> 00:21:40,900
I feel like that could still take
down a company's data.

411
00:21:40,900 --> 00:21:44,160
If you're using a managed service
or the cloud at all, you're

412
00:21:44,160 --> 00:21:45,140
at risk of that.

413
00:21:45,140 --> 00:21:47,320
Obviously there, you can have plans
in place to mitigate that.

414
00:21:47,320 --> 00:21:51,080
Nikolay: Even if it's self-managed,
not many people have multi-region

415
00:21:51,460 --> 00:21:51,960
setup.

416
00:21:52,640 --> 00:21:53,140
Michael: Exactly.

417
00:21:53,300 --> 00:21:54,640
Nikolay: It's very hard, actually.

418
00:21:55,080 --> 00:21:58,860
Michael: So if you don't have off-site
backups, you're sat there

419
00:21:58,860 --> 00:22:01,480
thinking, "We just have to wait".

420
00:22:01,820 --> 00:22:04,900
Nikolay: Yeah, it's a complex thing
to have multi-region purely

421
00:22:05,020 --> 00:22:09,240
and well-tested productions, like,
failover-tested very well,

422
00:22:09,240 --> 00:22:10,060
and so on.

423
00:22:10,380 --> 00:22:11,980
Yeah, it's a big topic actually.

424
00:22:12,440 --> 00:22:16,180
So backups and transaction ID
are probably 2 nightmares of

425
00:22:16,180 --> 00:22:17,660
any Postgres DBA, right?

426
00:22:17,980 --> 00:22:19,840
Michael: Are they the scariest
to you?

427
00:22:20,140 --> 00:22:22,220
I think corruption is pretty scary.

428
00:22:23,000 --> 00:22:25,460
Nikolay: Well, it's a good and interesting
topic.

429
00:22:25,460 --> 00:22:29,280
Corruption, we had an episode about
corruption as well, right?

430
00:22:30,060 --> 00:22:34,780
But this is good to put to preparation
of incidents.

431
00:22:34,900 --> 00:22:37,260
If corruption happens, what we
will do?

432
00:22:37,580 --> 00:22:38,260
Some steps.

433
00:22:38,300 --> 00:22:44,280
And first step is, according to
wiki.postgresql.org, copy the database,

434
00:22:44,280 --> 00:22:44,640
right?

435
00:22:44,640 --> 00:22:47,720
Because you will try to fix, maybe
you will break it more, right?

436
00:22:47,720 --> 00:22:48,280
So copy.

437
00:22:48,280 --> 00:22:50,420
This is the first step to do.

438
00:22:50,740 --> 00:22:54,520
And knowing this helps because
this kind of thing you can know

439
00:22:54,520 --> 00:22:55,240
in advance.

440
00:22:55,680 --> 00:22:58,740
By the way, transaction ID wraparound
you can practice as well.

441
00:22:58,740 --> 00:23:03,340
There is a recipe I wrote on how to
simulate it, right?

442
00:23:03,340 --> 00:23:07,720
So you can have it in a lower environment
and then good luck dealing

443
00:23:07,720 --> 00:23:08,460
with it.

444
00:23:08,680 --> 00:23:12,080
Or you can clone your database
and simulate it there.

445
00:23:12,440 --> 00:23:16,080
So corruption is a very broad topic,
many types of corruption,

446
00:23:16,080 --> 00:23:18,460
but some kinds can also be simulated.

447
00:23:18,520 --> 00:23:19,860
There are tools for it.

448
00:23:19,940 --> 00:23:22,100
So it's good to know it.

449
00:23:22,660 --> 00:23:29,800
But in cases I saw, in most cases,
it was quite like there was

450
00:23:29,800 --> 00:23:32,720
some path to escape.

451
00:23:33,260 --> 00:23:36,300
In some cases, escape was we just
restore from backups losing

452
00:23:36,300 --> 00:23:39,260
some data and for that project
was like acceptable.

453
00:23:40,020 --> 00:23:45,360
In some cases it was, okay, we
just noticed that only pg_statistic

454
00:23:45,700 --> 00:23:46,400
is corrupted.

455
00:23:47,360 --> 00:23:49,460
So running analyze fixes it.

456
00:23:49,740 --> 00:23:54,860
But long term we see the database
is on NFS and this is no, no,

457
00:23:54,860 --> 00:23:55,360
no.

458
00:23:55,440 --> 00:23:59,240
Don't use NFS for PGDATA, right?

459
00:23:59,280 --> 00:24:03,900
It's quite, Like in most cases
I saw corruption, it was something

460
00:24:03,900 --> 00:24:04,900
silly actually.

461
00:24:05,740 --> 00:24:10,520
But corruption happens also like
due to bugs, due to various

462
00:24:10,520 --> 00:24:16,700
stuff or mistake planning some
major change like switching to

463
00:24:16,700 --> 00:24:19,260
new operating system, GDPc.

464
00:24:20,030 --> 00:24:24,520
Fortunately over the last few years
such corruption happened

465
00:24:25,120 --> 00:24:27,740
in non-production, so we fully
prevented it.

466
00:24:27,740 --> 00:24:31,060
Michael: Well, the reason
I find it scary is more that

467
00:24:31,060 --> 00:24:33,920
we could have been returning bad
results.

468
00:24:34,860 --> 00:24:36,340
Nikolay: Like it's just silent
corruption.

469
00:24:36,820 --> 00:24:37,320
Michael: Yeah.

470
00:24:38,480 --> 00:24:43,240
It's that's the fear to me is more
how far back does this go?

471
00:24:43,440 --> 00:24:45,460
Anyway, but it's a different kind
of emergency.

472
00:24:46,020 --> 00:24:50,640
Nikolay: Yeah, we had the corruption
due to index and GDPc change

473
00:24:50,640 --> 00:24:53,720
in production with one company last
year.

474
00:24:54,580 --> 00:24:56,060
And it was our oversight.

475
00:24:56,820 --> 00:25:00,360
But fortunately, it happened only
on standby nodes, which were

476
00:25:00,360 --> 00:25:01,220
not used.

477
00:25:01,840 --> 00:25:07,060
So it was pure matter of luck that
this production corruption

478
00:25:07,120 --> 00:25:07,620
happened.

479
00:25:08,860 --> 00:25:09,780
Michael: And no failover.

480
00:25:10,120 --> 00:25:10,740
Nikolay: Yeah, yeah.

481
00:25:10,920 --> 00:25:13,080
Other clusters used standby nodes.

482
00:25:13,080 --> 00:25:14,740
This cluster didn't use it.

483
00:25:15,040 --> 00:25:17,820
And we just saw some errors in...

484
00:25:18,260 --> 00:25:21,560
It was during upgrade with logical
replication.

485
00:25:21,600 --> 00:25:25,020
We saw errors in logs and quickly
reacted and then realized,

486
00:25:25,440 --> 00:25:27,380
these standby nodes are not used.

487
00:25:28,260 --> 00:25:31,900
Let's pray that failover won't
happen soon.

488
00:25:32,280 --> 00:25:34,700
Of course, it's like just imagine
like...

489
00:25:34,740 --> 00:25:36,920
So we quickly mitigated this completely.

490
00:25:37,200 --> 00:25:38,240
Nobody noticed.

491
00:25:38,940 --> 00:25:42,240
But if it happens, yeah, the question
is how, like what's the

492
00:25:42,240 --> 00:25:43,440
propagation here?

493
00:25:43,860 --> 00:25:49,220
But there's also like tooling and
knowing, like learning from

494
00:25:49,220 --> 00:25:51,740
other people's mistakes helps, of
course, as usual.

495
00:25:51,740 --> 00:25:57,040
And knowing tools like amcheck
should be very routine tool being

496
00:25:57,040 --> 00:25:59,020
used often, right?

497
00:25:59,020 --> 00:26:00,700
amcheck, to check B-tree indexes.

498
00:26:01,000 --> 00:26:04,400
Hopefully it will finally support
other types of indexes soon.

499
00:26:04,400 --> 00:26:06,540
I think it's still a work in progress,
right?

500
00:26:06,600 --> 00:26:07,100
Michael: Yeah.

501
00:26:07,240 --> 00:26:08,940
I saw some work going on.

502
00:26:09,240 --> 00:26:10,120
Nikolay: Gin and GiST.

503
00:26:11,540 --> 00:26:12,560
Yeah.
What else?

504
00:26:12,980 --> 00:26:20,060
For example, if the database is shutting
down too slowly, it takes

505
00:26:20,060 --> 00:26:22,940
a lot of time or starting up takes
a lot of time.

506
00:26:23,480 --> 00:26:28,840
Not once I saw many people being
nervous, not understanding that

507
00:26:28,840 --> 00:26:33,160
it's normal, not understanding
how to check the progress, what

508
00:26:33,160 --> 00:26:33,900
to expect.

509
00:26:35,060 --> 00:26:39,100
And it was like when you perform
checkpoint tuning, we also had

510
00:26:39,100 --> 00:26:42,780
an episode about it, and increase
checkpoint timeout and max

511
00:26:42,780 --> 00:26:46,420
WAL size, which you should do
on loaded systems.

512
00:26:46,880 --> 00:26:51,760
In this case, like restart or just
stopping the database or starting

513
00:26:51,760 --> 00:26:54,500
the database might take many, 
many, many minutes.

514
00:26:55,240 --> 00:26:59,020
And if it's self-managed, I saw
people  kill -9,

515
00:26:59,760 --> 00:27:00,740
SIGKILL, right?

516
00:27:01,340 --> 00:27:04,680
Sending to Postgres because they
are nervous, not understanding,

517
00:27:04,700 --> 00:27:06,880
"Oh, Postgres is not starting. What
to do?"

518
00:27:07,060 --> 00:27:12,080
And I think, I think now in fresh
versions, there are some log

519
00:27:12,080 --> 00:27:17,200
messages telling that we are in
recovery mode and showing some

520
00:27:17,200 --> 00:27:18,100
progress, right?

521
00:27:18,520 --> 00:27:20,120
I thought about it.

522
00:27:20,120 --> 00:27:21,500
Michael: I think it's very recent.

523
00:27:21,500 --> 00:27:23,480
I think, I can't remember if it...

524
00:27:23,480 --> 00:27:24,640
Nikolay: It should
be so.

525
00:27:24,640 --> 00:27:25,660
I mean, it should be so.

526
00:27:25,660 --> 00:27:26,880
It should be very straightforward.

527
00:27:27,800 --> 00:27:32,840
A DBA should see the progress and
have an understanding of when it

528
00:27:32,840 --> 00:27:33,580
will finish.

529
00:27:34,860 --> 00:27:39,900
For older versions, at least definitely
older than 16, it's unclear.

530
00:27:40,440 --> 00:27:45,420
Usually, you need to, if it's self-managed,
you just run ps to

531
00:27:45,420 --> 00:27:50,820
see what process reports in its
title or top, right?

532
00:27:50,900 --> 00:27:55,900
And you see LSN there, then you
see your pg_control data to

533
00:27:55,900 --> 00:28:00,040
understand the point of consistency,
and then you understand

534
00:28:00,060 --> 00:28:04,280
how many, if you have 2 LSNs and
go to another Postgres, you

535
00:28:04,280 --> 00:28:07,100
can calculate the difference and the difference
is in bytes.

536
00:28:07,360 --> 00:28:11,900
So you understand how many bytes,
megabytes, gigabytes left,

537
00:28:12,040 --> 00:28:16,480
and then you can already monitor
like every minute or every second

538
00:28:16,480 --> 00:28:21,020
and understand the progress and
have ETA, expected time of arrival,

539
00:28:21,020 --> 00:28:21,520
right?

540
00:28:22,540 --> 00:28:23,460
And this helps.

541
00:28:23,680 --> 00:28:27,340
And I think it's a good idea to
learn how to do it.

542
00:28:27,340 --> 00:28:30,400
In older versions, in newer versions,
I have a how-to about it,

543
00:28:30,400 --> 00:28:30,900
actually.

544
00:28:31,300 --> 00:28:36,100
What if the Postgres startup and stop
time takes, like, it's long.

545
00:28:36,140 --> 00:28:37,640
What to do about it?

546
00:28:38,060 --> 00:28:40,840
And yeah, it should be just learned,
right?

547
00:28:40,840 --> 00:28:45,040
And if you're prepared, it will reduce
stress.

548
00:28:45,660 --> 00:28:50,200
And yeah, we had a lot of such
cases working on DBLab.

549
00:28:51,000 --> 00:28:53,940
Sometimes, like, a clone is not created.

550
00:28:54,000 --> 00:28:54,500
Why?

551
00:28:54,520 --> 00:28:57,020
But it's because WAL size is
huge and so on.

552
00:28:57,020 --> 00:29:00,140
It's just recovering, so you just
need to wait a little bit more.

553
00:29:00,380 --> 00:29:02,660
But then we improved it.

554
00:29:03,080 --> 00:29:05,640
So yeah, this might happen.

555
00:29:05,660 --> 00:29:07,080
This is a very common situation.

556
00:29:07,920 --> 00:29:09,560
Long restart time.

557
00:29:10,080 --> 00:29:12,720
Michael: Yeah, I'll definitely
share that episode in the show

558
00:29:12,720 --> 00:29:15,240
notes as well so people can find
it if they weren't listening

559
00:29:15,240 --> 00:29:15,940
back then.

560
00:29:16,120 --> 00:29:16,780
Nikolay: What else?

561
00:29:17,480 --> 00:29:20,740
Somebody deleted data and you

562
00:29:20,740 --> 00:29:21,180
Michael: need to recover.

563
00:29:21,180 --> 00:29:23,540
We have other episodes like out
of disk.

564
00:29:23,560 --> 00:29:25,280
Like there's other kinds of emergencies.

565
00:29:26,660 --> 00:29:30,600
1 we haven't covered I don't think
in much detail was the big,

566
00:29:30,660 --> 00:29:32,380
like out of integers.

567
00:29:32,860 --> 00:29:33,520
Yeah, in short.

568
00:29:33,520 --> 00:29:36,480
Nikolay: Oh, out of integers is
a big disaster.

569
00:29:37,820 --> 00:29:38,240
Michael: Yeah.
Yeah.

570
00:29:38,240 --> 00:29:41,980
But I guess that's quite common
like in terms of other common

571
00:29:41,980 --> 00:29:45,140
issues people come to you with
is that is that up there or what

572
00:29:45,140 --> 00:29:46,120
the what tends

573
00:29:46,340 --> 00:29:49,760
Nikolay: I had I have I may be
I'm biased here because I have

574
00:29:49,760 --> 00:29:52,980
a feeling it's a very well-known
problem and people already mitigate

575
00:29:53,000 --> 00:29:57,320
it or are mitigating not requiring
a lot of expertise.

576
00:29:58,280 --> 00:30:01,880
Our Postgres checkup tool has a
report for it, like saying how

577
00:30:01,880 --> 00:30:08,160
much of capacity of int4 regular
integer primary key left for

578
00:30:08,160 --> 00:30:09,120
a particular table.

579
00:30:09,560 --> 00:30:12,280
For me, it's like straightforward
already, right?

580
00:30:12,720 --> 00:30:15,040
And I don't see a big deal.

581
00:30:15,040 --> 00:30:18,200
But if it happens, of course, it's
like partial, at least partial

582
00:30:18,200 --> 00:30:21,960
downtime because you cannot insert
new rows in this table.

583
00:30:23,120 --> 00:30:23,560
And it

584
00:30:23,560 --> 00:30:24,400
can be scary.

585
00:30:25,020 --> 00:30:27,660
Michael: That's true of so many
of these issues though, right?

586
00:30:27,660 --> 00:30:30,060
Like once you're monitoring for
them, once you know about them

587
00:30:30,060 --> 00:30:33,620
and you've got alerts far enough
out, they become not emergencies.

588
00:30:34,860 --> 00:30:35,360
Nikolay: Right.

589
00:30:35,600 --> 00:30:39,620
But I'd like to mention also like
common problem, like database

590
00:30:39,620 --> 00:30:42,540
is slow or database is unresponsive
for what to do.

591
00:30:43,260 --> 00:30:45,360
Like very general, like where to
start?

592
00:30:45,720 --> 00:30:46,620
What do you think?

593
00:30:47,080 --> 00:30:49,200
Michael: Well, I think that's the
monitoring thing, isn't it?

594
00:30:49,200 --> 00:30:52,580
Like that's to go to the monitoring,
that number 1 dashboard

595
00:30:52,580 --> 00:30:53,580
you talked about.

596
00:30:53,680 --> 00:30:53,800
Nikolay: Yeah.

597
00:30:53,800 --> 00:30:57,020
Michael: I think that's the workout
where is the problem.

598
00:30:57,340 --> 00:30:59,280
It needs to be the first point,
doesn't it?

599
00:30:59,280 --> 00:31:01,360
Nikolay: Yeah, I agree.

600
00:31:01,680 --> 00:31:06,400
And the first thing I would start understanding,
I think we can talk

601
00:31:06,400 --> 00:31:11,660
about methodologies here like starting
from USE, use, right,

602
00:31:11,660 --> 00:31:14,440
and others, like there are many 
of them.

603
00:31:14,860 --> 00:31:20,140
But question, like do you see the 
utilization, saturation, errors from Brendan 

604
00:31:20,140 --> 00:31:22,440
Gregg, like basics from Netflix, 
right?

605
00:31:22,900 --> 00:31:25,940
It's a very, very trivial approach, 
I would say.

606
00:31:26,760 --> 00:31:32,080
But yeah, here, first question, 
if the database is slow and unresponsive,

607
00:31:32,100 --> 00:31:35,880
first question, are we really putting 
more workload on it?

608
00:31:36,220 --> 00:31:38,660
Very simple question, but sometimes 
hard to answer.

609
00:31:39,120 --> 00:31:44,340
Because often we find out that 
many more clients are connected, some

610
00:31:44,340 --> 00:31:48,540
background job started like bombarding 
the database with new queries,

611
00:31:48,820 --> 00:31:51,440
retrying a lot of connections.

612
00:31:52,120 --> 00:31:53,680
Michael: Ah, like a cascading effect.

613
00:31:53,680 --> 00:31:54,640
Nikolay: As well.
Yeah, yeah, yeah.

614
00:31:55,080 --> 00:31:58,020
Michael: Is it because of elephants, or 
I actually don't know that term,

615
00:31:58,020 --> 00:31:58,780
but it's like a...

616
00:31:58,780 --> 00:32:04,020
Nikolay: So the question is, is more 
load coming from, like, 

617
00:32:04,020 --> 00:32:05,520
externally to the database.

618
00:32:06,340 --> 00:32:11,420
And this can, of course, be a reason 
why it's slow.

619
00:32:12,040 --> 00:32:17,860
And if it's not tuned well to handle 
spikes of load, for example,

620
00:32:17,860 --> 00:32:22,520
you keep max_connections high, 
ignoring advice from Postgres

621
00:32:22,540 --> 00:32:24,560
experts that let's keep it sane.

622
00:32:25,400 --> 00:32:29,060
Recently I saw – I'm sharing without 
names so I can share, right?

623
00:32:29,060 --> 00:32:30,860
– 12,000 max_connections.

624
00:32:32,380 --> 00:32:34,160
This is for me, I think, a record.

625
00:32:34,160 --> 00:32:36,360
A new client showed it and they 
explained.

626
00:32:37,200 --> 00:32:38,660
I see it's a trend.

627
00:32:39,240 --> 00:32:42,620
Recently when I say you need to 
decrease max_connections, I

628
00:32:42,620 --> 00:32:47,640
also say most likely you will not 
do it right now because most

629
00:32:47,640 --> 00:32:49,400
people tend not to do it.

630
00:32:49,400 --> 00:32:52,840
They all have reasons why max_connections 
should be very high.

631
00:32:53,000 --> 00:32:56,840
And of course, since Postgres, 
I think, 14, things have improved

632
00:32:56,840 --> 00:32:58,680
in terms of handling idle connections.

633
00:32:59,760 --> 00:33:03,980
But when an incident happens, these 
idle connections become active, 

634
00:33:04,540 --> 00:33:11,700
and we have almost 0 chances for 
statements to be finished because

635
00:33:12,700 --> 00:33:14,840
the server is overwhelmed with load.

636
00:33:16,100 --> 00:33:19,640
Whereas if you have a sane number of 
max_connections, I would say,

637
00:33:19,640 --> 00:33:25,580
take your vCPUs number, multiply 
it by some relatively low multiplier, 

638
00:33:25,680 --> 00:33:27,540
like less than 10.

639
00:33:28,860 --> 00:33:31,300
That should be max_connections 
for all OLTP workloads.

640
00:33:31,880 --> 00:33:33,720
Then you have pgBouncer or something.

641
00:33:34,180 --> 00:33:40,660
So if you have this and enormous 
load is coming, additional load 

642
00:33:41,440 --> 00:33:44,340
will be receiving an "out of connections" 
error.

643
00:33:44,760 --> 00:33:46,720
And existing transactions or something.

644
00:33:46,720 --> 00:33:50,460
Yeah, and who has chances to finish 
current statements, current 

645
00:33:50,460 --> 00:33:53,900
queries processing and new ones and
so on.

646
00:33:54,020 --> 00:33:59,980
So it's much better than you try
to please everyone, right?

647
00:34:00,720 --> 00:34:05,020
And cannot do it at all, including 
like with your old clients.

648
00:34:05,020 --> 00:34:05,700
you know.

649
00:34:06,260 --> 00:34:09,500
Michael: It also makes some diagnosis
easier, right?

650
00:34:09,960 --> 00:34:13,540
If the database is still responding
to anything, it's easier

651
00:34:13,540 --> 00:34:16,480
to diagnose issues than if it's
not responding at all.

652
00:34:18,820 --> 00:34:19,320
Nikolay: Exactly.

653
00:34:20,540 --> 00:34:23,360
Michael: It's kind of just moving
the problem, but it's definitely

654
00:34:23,360 --> 00:34:24,020
an improvement.

655
00:34:24,860 --> 00:34:26,140
Yeah.
But yeah, it's a good point.

656
00:34:26,140 --> 00:34:28,320
Like it could just be overwhelmed,
but it could be, there are

657
00:34:28,320 --> 00:34:29,880
like a million other reasons.

658
00:34:30,020 --> 00:34:30,640
Nikolay: Of course.

659
00:34:30,720 --> 00:34:35,180
But the first question I would
say, are we receiving more load?

660
00:34:37,580 --> 00:34:40,280
So the reason is already outside
of Postgres.

661
00:34:40,520 --> 00:34:44,800
Well, technically I just explained
additional factor, high or

662
00:34:44,800 --> 00:34:47,960
max_connections, it's partially
problem is inside Postgres, but

663
00:34:47,960 --> 00:34:50,580
the main reason, root cause is
outside.

664
00:34:50,580 --> 00:34:53,160
Like we're just receiving much
more than usually.

665
00:34:53,680 --> 00:34:56,340
Right.
This is number 1 thing to check.

666
00:34:56,480 --> 00:35:02,280
If like we don't have time to discuss
full recipe for troubleshooting

667
00:35:02,380 --> 00:35:03,280
of such cases.

668
00:35:03,280 --> 00:35:05,180
Michael: We've got an episode,
I think, for that.

669
00:35:05,280 --> 00:35:07,100
Nikolay: Maybe, yeah, I already
keep forgetting.

670
00:35:07,100 --> 00:35:08,460
Michael: Probably actually just
monitoring,

671
00:35:08,680 --> 00:35:09,180
Nikolay: yeah.

672
00:35:09,520 --> 00:35:15,280
Yeah, but maybe we should have,
like, you know, like, how to

673
00:35:15,280 --> 00:35:18,020
troubleshoot slow database, step
by step.

674
00:35:18,060 --> 00:35:23,240
So, to save time, second advice,
I would say, just check wait

675
00:35:23,240 --> 00:35:24,060
event analysis.

676
00:35:24,820 --> 00:35:25,540
Second thing.

677
00:35:26,040 --> 00:35:30,300
If you have a lot of active sessions,
maybe actually Sometimes

678
00:35:30,300 --> 00:35:32,720
databases slow without a lot of
active sessions.

679
00:35:32,720 --> 00:35:33,400
It's interesting.

680
00:35:33,540 --> 00:35:37,420
But it's also, if you understand
number of active sessions, it's

681
00:35:37,420 --> 00:35:38,160
very important.

682
00:35:38,200 --> 00:35:41,800
But next thing, understand the
state of what are they doing,

683
00:35:41,800 --> 00:35:42,300
right?

684
00:35:42,700 --> 00:35:48,580
So are they doing a lot of IO or
there is a contention related

685
00:35:48,580 --> 00:35:51,660
to lock manager, for example, or
sub-transactions or anything

686
00:35:51,660 --> 00:35:52,380
like that.

687
00:35:52,480 --> 00:35:55,580
So, wait event analysis is super
important.

688
00:35:56,640 --> 00:36:01,520
And we discuss right now how to
improve dashboard number 1 and...

689
00:36:02,940 --> 00:36:03,940
No, no, not dashboard.

690
00:36:03,940 --> 00:36:07,280
Dashboard number 3, which is query
analysis in pgwatch Postgres.AI

691
00:36:07,360 --> 00:36:07,860
edition.

692
00:36:08,100 --> 00:36:13,400
And I'm almost convinced to put
wait event query analysis to

693
00:36:13,400 --> 00:36:14,080
the top.

694
00:36:14,380 --> 00:36:22,886
Previously, I was thinking we should
have total time from pg_stat_statements and average time, total
time maybe should be higher

695
00:36:22,920 --> 00:36:28,380
and we have like a long discussion
inside the team about what should

696
00:36:28,380 --> 00:36:29,120
be higher.

697
00:36:29,260 --> 00:36:32,440
But now I'm almost convinced actually
wait event analysis should

698
00:36:32,440 --> 00:36:36,260
be on very top because it gives
you a very quick understanding

699
00:36:37,240 --> 00:36:40,640
just from 1 chart you quickly understand
the number of active sessions

700
00:36:40,640 --> 00:36:45,540
and distribution in terms of what
they are doing.

701
00:36:46,240 --> 00:36:51,600
In any analysis when you have some
number, the next step is to segment

702
00:36:51,600 --> 00:36:52,980
analysis, right, properly.

703
00:36:53,200 --> 00:36:55,620
So to divide this number into some
segments.

704
00:36:56,140 --> 00:37:00,680
And I think wait event is a very
good direction for segmentation,

705
00:37:01,000 --> 00:37:01,860
how to say.

706
00:37:02,320 --> 00:37:07,020
Michael: Yeah, it's like it splits
it into fewer things, so therefore

707
00:37:07,020 --> 00:37:09,420
it's easier to spot if there's
like a majority.

708
00:37:09,700 --> 00:37:14,820
Whereas with query analysis, you
could have a real long tail,

709
00:37:14,820 --> 00:37:19,540
like your, even the most commonly
executed query might only be

710
00:37:19,540 --> 00:37:20,840
1% of your workload.

711
00:37:21,180 --> 00:37:26,180
Well, yeah, it might be 50% but
it might be 1% whereas more likely.

712
00:37:26,520 --> 00:37:29,520
Nikolay: Yeah, and timing in pg_stat_statements, it hides details

713
00:37:29,540 --> 00:37:33,260
for it might be actual work database
is doing and that's why

714
00:37:33,260 --> 00:37:36,180
it's spending time, for example,
sequential scans due to lack

715
00:37:36,180 --> 00:37:39,680
of indexes or something like that,
or content, or it might be

716
00:37:39,760 --> 00:37:41,540
waiting for a lock to be acquired.

717
00:37:42,180 --> 00:37:46,000
So it also spending time and you
quickly see.

718
00:37:46,000 --> 00:37:51,440
So very good book as usual, books
from Brendan Gregg.

719
00:37:52,040 --> 00:37:57,320
There is in troubleshooting, I
remember also his talks, two-part

720
00:37:57,440 --> 00:38:01,380
talk about tooling for Linux and
so on, and he mentioned that

721
00:38:01,380 --> 00:38:06,760
if he needed to choose just 1 Linux
tool, like you can use only

722
00:38:06,760 --> 00:38:09,980
1 tool and biggest outcome in terms
of troubleshooting.

723
00:38:10,380 --> 00:38:11,040
What is it?

724
00:38:11,040 --> 00:38:11,840
Do you remember?

725
00:38:11,840 --> 00:38:12,340
No?

726
00:38:12,560 --> 00:38:13,060
Michael: No.

727
00:38:13,100 --> 00:38:13,880
Nikolay: It's iostat.

728
00:38:15,100 --> 00:38:15,840
Oh, why?

729
00:38:15,960 --> 00:38:19,520
It gives you disk I/O and also it
reports CPU as well, segmented

730
00:38:19,600 --> 00:38:23,160
by like user system I/O wait.

731
00:38:23,540 --> 00:38:27,720
So it's super good, like you see
Disk I/O and CPU just from 1

732
00:38:27,720 --> 00:38:28,220
tool.

733
00:38:28,270 --> 00:38:35,280
Similar here, we see the number of
active sessions and also we

734
00:38:35,280 --> 00:38:37,580
see wait events segmentation.

735
00:38:38,320 --> 00:38:41,340
It's a very good chart to have
for troubleshooting.

736
00:38:41,940 --> 00:38:44,580
Michael: It feels to me like an
interesting trade-off, like whether

737
00:38:44,580 --> 00:38:48,120
you're looking at monitoring more
often or not even necessarily

738
00:38:48,120 --> 00:38:51,540
more often, but do you optimize
for people in an incident or

739
00:38:51,540 --> 00:38:55,740
do you optimize for people doing
general performance work?

740
00:38:55,740 --> 00:38:59,060
And I think optimizing for the
incident people make some sense,

741
00:38:59,060 --> 00:39:01,090
even though it's less often, hopefully.

742
00:39:01,090 --> 00:39:02,860
Nikolay: Yeah, they
have less time.

743
00:39:02,860 --> 00:39:07,160
Michael: Less time,
but also heightened emotions

744
00:39:07,260 --> 00:39:09,440
and not thinking straight like
we started.

745
00:39:09,440 --> 00:39:10,320
So maybe that's a

746
00:39:10,320 --> 00:39:11,620
Nikolay: Path should be shorter.

747
00:39:12,260 --> 00:39:13,260
Yeah, yeah, I agree.

748
00:39:13,260 --> 00:39:13,760
Right.

749
00:39:14,920 --> 00:39:15,420
Yeah.

750
00:39:15,940 --> 00:39:19,320
So there are many other things
that can happen with a database,

751
00:39:19,320 --> 00:39:20,080
of course, right?

752
00:39:20,080 --> 00:39:23,380
But if you know some common things,
it helps a lot.

753
00:39:23,920 --> 00:39:24,420
Yeah.

754
00:39:24,800 --> 00:39:30,120
And tooling should be prepared
and, yeah, observability is important.

755
00:39:31,500 --> 00:39:33,260
Michael: Yeah, 1 last question.

756
00:39:33,560 --> 00:39:39,420
I think there's some arguments
for trying to reduce incidents

757
00:39:39,480 --> 00:39:43,000
down to like nearly 0, like trying
to put everything in place

758
00:39:43,000 --> 00:39:46,580
so that you never have any incidents,
you know, high availability,

759
00:39:46,880 --> 00:39:49,460
everything to try and minimize
the risk.

760
00:39:49,860 --> 00:39:55,760
And then I think as a team, you
can get out of practice dealing

761
00:39:55,760 --> 00:39:58,820
with incidents if you're good at
that kind of thing.

762
00:39:58,820 --> 00:40:01,860
But then when one does happen, it
can really throw you.

763
00:40:02,640 --> 00:40:06,580
Some teams like to deal with super
minor incidents and treat

764
00:40:06,580 --> 00:40:08,740
those as incidents, almost like
as practice.

765
00:40:08,860 --> 00:40:12,260
Do you have any opinions or feelings
around that kind of thing?

766
00:40:12,260 --> 00:40:13,540
Nikolay: Yeah, good point.

767
00:40:13,700 --> 00:40:17,220
So we actually didn't discuss many
things, for example, how to

768
00:40:17,220 --> 00:40:20,760
categorize incidences like priority
1, priority 2, and so on.

769
00:40:20,760 --> 00:40:26,040
Because when a client comes, it
happened a couple of times over

770
00:40:26,040 --> 00:40:28,660
the last month, like a client comes
and shows me some graphs

771
00:40:28,660 --> 00:40:34,240
with spikes of active sessions
exceeding the CPU count significantly,

772
00:40:35,080 --> 00:40:39,740
I already say, oh, you are having
at least like, you know, P3

773
00:40:39,960 --> 00:40:41,400
incident or maybe P2.

774
00:40:41,920 --> 00:40:47,440
Maybe it's not user-facing, people
haven't noticed it, but it's

775
00:40:47,440 --> 00:40:48,420
an incident already.

776
00:40:48,420 --> 00:40:52,120
It requires investigation and they
like, the database is slow,

777
00:40:52,120 --> 00:40:56,920
but this is already, you need some
reaction and mitigation for

778
00:40:56,920 --> 00:40:57,420
it.

779
00:40:57,840 --> 00:41:02,200
So it requires maybe understanding
and expertise and classification

780
00:41:03,460 --> 00:41:07,600
rules, which require PostgreSQL
understanding, right?

781
00:41:08,180 --> 00:41:12,960
Because sometimes I have a hard
time convincing people that if

782
00:41:12,960 --> 00:41:16,920
you have, I don't know, like 64
cores, but accession count jumped

783
00:41:16,920 --> 00:41:19,460
to 200, 300, It's already not normal.

784
00:41:19,660 --> 00:41:22,260
They say, well, it worked.

785
00:41:22,880 --> 00:41:23,740
Michael: No one complained?

786
00:41:24,180 --> 00:41:24,940
Nikolay: Yeah, yeah, yeah.

787
00:41:25,900 --> 00:41:26,820
Well, it worked.

788
00:41:27,260 --> 00:41:31,280
And part of the problem in Postgres,
we don't have good metric

789
00:41:31,280 --> 00:41:34,420
for average latency, for example,
for query processing, because

790
00:41:34,780 --> 00:41:40,960
database job to be like, we want
query processing to be not producing

791
00:41:41,040 --> 00:41:42,560
errors and be fast.

792
00:41:42,980 --> 00:41:46,480
Fast, we have definition of fast
for OLTP case.

793
00:41:46,500 --> 00:41:48,060
I have an article about it.

794
00:41:48,100 --> 00:41:50,400
Definitely, it's not 1 second,
it should be below.

795
00:41:50,600 --> 00:41:52,200
It should be below 100 milliseconds.

796
00:41:52,200 --> 00:41:56,040
In most cases, it should be below
10 milliseconds because 1 HTTP

797
00:41:56,040 --> 00:42:00,040
request consists of multiple SQL,
usually, in many cases.

798
00:42:00,540 --> 00:42:05,680
And people, human perception is
200 milliseconds, so we have

799
00:42:05,680 --> 00:42:09,640
some threshold already, so let's
keep latency low.

800
00:42:09,660 --> 00:42:14,200
But funny thing, Postgres doesn't
have latency exposed, average

801
00:42:14,200 --> 00:42:14,700
latency.

802
00:42:15,600 --> 00:42:16,300
It doesn't.

803
00:42:16,720 --> 00:42:19,460
So, the pg_stat_database doesn't
have it.

804
00:42:20,660 --> 00:42:21,900
Nothing has it.

805
00:42:21,900 --> 00:42:23,300
Only pg_stat_statements.

806
00:42:24,440 --> 00:42:26,380
But it's not precise.

807
00:42:27,040 --> 00:42:28,400
Michael: It's not in core.

808
00:42:29,180 --> 00:42:30,140
Nikolay: It's not in core.

809
00:42:30,140 --> 00:42:31,040
It's not precise.

810
00:42:31,080 --> 00:42:32,520
There is max, 5,000.

811
00:42:32,700 --> 00:42:36,260
In some cases, workload is complex
and there is constant eviction

812
00:42:38,260 --> 00:42:41,760
of records from pg_stat_statements
and appearance of new ones.

813
00:42:42,040 --> 00:42:46,580
So the latency measured from pg_stat_statements,
This is what most

814
00:42:46,580 --> 00:42:51,840
monitoring systems do, including
dashboard number 1 we discussed

815
00:42:51,860 --> 00:42:54,720
earlier, from pgwatch2 Postgres.AI
edition.

816
00:42:55,480 --> 00:43:00,000
But it feels not fully reliable,
right?

817
00:43:00,720 --> 00:43:04,120
But this is important because this
is how we can say, okay, really

818
00:43:04,120 --> 00:43:05,600
slow, how much?

819
00:43:05,600 --> 00:43:09,520
We had sub-millisecond latency,
now we have 5 millisecond latency.

820
00:43:09,520 --> 00:43:11,540
Okay, indeed, there's proof of
it.

821
00:43:11,820 --> 00:43:13,740
I like that PgBouncer reports it.

822
00:43:13,740 --> 00:43:14,140
I was

823
00:43:14,140 --> 00:43:15,320
going to
ask, yeah.

824
00:43:15,320 --> 00:43:17,220
It logs and then starts and reports
it.

825
00:43:17,220 --> 00:43:18,100
This is great.

826
00:43:18,740 --> 00:43:21,980
This is what we should have, honestly,
in Postgres as well.

827
00:43:22,500 --> 00:43:26,700
But, yeah, I actually don't remember
discussions about it.

828
00:43:26,720 --> 00:43:28,020
There should be some discussions.

829
00:43:28,680 --> 00:43:32,640
So, this is our maybe main characteristics
of performance.

830
00:43:33,080 --> 00:43:36,520
I wish, of course, we had percentiles,
not only average.

831
00:43:39,060 --> 00:43:41,600
Many people monitor it from client
side.

832
00:43:41,840 --> 00:43:45,960
Datadog has APM and there's ability
to monitor it from client

833
00:43:45,960 --> 00:43:50,580
side, but this is not purely database
latency because it includes

834
00:43:51,160 --> 00:43:55,460
Round trip, RTTs, round trip times,
network, right?

835
00:43:55,920 --> 00:43:59,320
And it should be excluded if we
talk about database for, to understand

836
00:44:00,060 --> 00:44:01,500
behavior of database, right?

837
00:44:01,720 --> 00:44:05,360
So yeah, pg_stat_statements, this
is how we understand latency.

838
00:44:06,220 --> 00:44:09,480
And yeah, if it's slow, it's slow.

839
00:44:09,480 --> 00:44:13,380
And then we need to, again, apply
segmentation and top-down analysis

840
00:44:13,380 --> 00:44:15,060
and find what exactly is slow.

841
00:44:15,060 --> 00:44:17,680
Everything or just some of it,
right?

842
00:44:19,200 --> 00:44:23,740
So it's, it's, P2P3 incidents.

843
00:44:24,140 --> 00:44:27,680
I think for smaller companies,
it's hard in terms of database.

844
00:44:27,920 --> 00:44:29,760
It's possible, but it's too much
work.

845
00:44:29,760 --> 00:44:30,260
Maybe.

846
00:44:30,560 --> 00:44:33,360
Michael: Well, but I also think,
I think there could be an argument

847
00:44:33,400 --> 00:44:37,340
from like a make incidents a bit
more normal in your team and

848
00:44:37,340 --> 00:44:38,100
less stressful.

849
00:44:38,200 --> 00:44:42,280
So when you do have a stressful
1, or like when you do have a

850
00:44:42,280 --> 00:44:44,280
big 1 that's a bigger deal.

851
00:44:44,380 --> 00:44:45,460
Nikolay: I see your point.

852
00:44:46,620 --> 00:44:50,540
Unless your team is overwhelmed
with P1 incidents, which

853
00:44:50,540 --> 00:44:55,680
I also had in my team actually,
and I saw it like we have every

854
00:44:55,680 --> 00:44:57,420
day we have database down.

855
00:44:58,680 --> 00:45:02,340
Unless that, it's a good idea if
you don't have database incidents

856
00:45:03,120 --> 00:45:09,580
to say, okay, let's look for P2,
P3 incidents and start processing

857
00:45:09,720 --> 00:45:14,440
them routinely so we build a muscle
for incident management.

858
00:45:14,440 --> 00:45:15,580
It's a great advice.

859
00:45:16,420 --> 00:45:16,750
Michael: Cool.

860
00:45:16,750 --> 00:45:17,540
Nikolay: Indeed, Indeed.

861
00:45:17,540 --> 00:45:18,040
Yeah.

862
00:45:18,940 --> 00:45:20,020
Yeah, maybe that's it.

863
00:45:20,020 --> 00:45:21,540
Let's let's wrap it up.

864
00:45:21,900 --> 00:45:22,700
Michael: Sounds good.

865
00:45:23,000 --> 00:45:24,100
Thanks so much, Nikolay.

866
00:45:24,140 --> 00:45:25,220
Catch you next week.