Postgres FM

Nikolay and Michael discuss PostgreSQL emergencies — both the psychological side of incident management, and some technical aspects too.
 
Here are some links to things they mentioned:

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With special thanks to:
  • Jessie Draws for the elephant artwork 

Creators & Guests

Host
Michael Christofides
Founder of pgMustard
Host
Nikolay Samokhvalov
Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Nikolay: Hello, hello, this is
Postgres.FM.

My name is Nikolay from Postgres.AI
and as usual, my co-host

is Michael from pgMustard.

Hi, Michael.

Michael: Hello, Nikolay.

Nikolay: And the topic I chose
is related to problems, acute

problems, unpredictable, which
sometimes happen to production

systems including Postgres.

In many cases, databases in the
center of storm and let's discuss

how we manage this and how to manage
it better.

So yeah, how to handle crisis situation
with production Postgres.

I called it Postgres urgent care
or emergency room.

I don't know like what's better
name here, but yeah, I guess

this is something I can share from
my past experience.

Yeah.

So let's discuss this.

Michael: Sounds good.

And good point about the database
often being in the center of

things.

I think when you see large, sometimes
I guess it is often the

large companies that we notice
on Twitter and things that post,

you know, people start tweeting
it that it's down.

I think recently there was a big
GitHub incident and that was,

I think, one of the first communications
was about it being database

related.

Nikolay: Database changes related,
not only database, but also

database changes, which when you
have a change with database,

it means increase the risk of incident
in most cases, actually.

But it was not about Postgres,
so let's exclude this case.

It's MySQL.

But I truly believe that the database,
as I usually say, is the

heart of any tech system because
it's very, very stateful, right,

because it's data.

And since it's stateful, it's really
hard to scale it and handle

performance and so on.

So indeed, since database is heart
of our systems, production

systems, it's often in the center
of storm.

Michael: Where would you like to
start with this?

Nikolay: I have a plan.

Let's have a two-step discussion.

First, we'll discuss the psychological
part of incident management

related to databases, and second,
purely technical.

How about this?

Michael: Yeah, I like it.

The psychological aspect probably
isn't talked about as much

as it could be.

I think people often in postmortems
focus on the technical things

that could be done differently
to avoid the issue reoccurring.

I mean, that obviously is the most
important thing, but I feel

like so many could also have learned
a lot from how they communicated

or how much they communicated.

Often I see almost no communication
from companies that are having

big outages, very poor levels of
updates.

And it wouldn't completely alleviate
the situation, of course.

If you're down, you're still down.

But I'd be surprised if companies
that communicate better don't

have much better outcomes than
ones that don't.

People looking to cancel services
afterwards, that kind of thing.

Nikolay: Yeah, so indeed, it may
be not well discussed, but I

think anyone who is dealing with
production at a larger scale definitely

knows that, first of all, many
people manage this kind of stress

not well and that's fine.

Maybe it's not your kind of thing,
right?

Some people manage stress very
well.

It's not me, by the way.

I manage stress moderately well.

I've learned how to do it and so on,
but in the center of a production

incident, I still find myself with
high emotions.

We should do something very quickly,
right?

It's hard.

Also, tunnel vision.

You see only some things and you
don't have time for anything

else.

It's very hard to relax and capture
the whole picture and so

on.

And that's why tooling is also
important here, right?

Tooling should be designed like
for people who are under stress

and some runbooks and so on.

But this is technical stuff.

We will talk about it slightly
later.

So I think there are some trainings
and so on.

I'm not a good source of advice here, but I know we should

look at people who deal with production
systems, SREs, and so

on, and there are many, many books
written about it, and handbooks,

runbooks, and so on.

So, yeah, there are some good practices
how to deal with such

stress and so on.

Michael: What are your favorites
of those?

Nikolay: I like to use materials
from Google, of course.

There is sre.google, this is the hostname.

There are 3 books there and quite
good content.

I also like the handbook from GitLab
for the production SRE team.

Michael: I seem to remember Netflix
having some good stuff as

well.

Nikolay: Yeah, yeah.

Well, we are in the area like it's
not only about databases,

of course, right?

It's the SRE area basically and there
are many good materials for

this.

But specifically for databases,
I think in general, if you feel

Emotional, maybe you should tell
this to colleagues or maybe

someone else should be acting more.

You need to understand yourself
basically, right?

If you feel… I sometimes feel myself…
I remember we had a Black

Friday, very boring because we
were very well prepared.

We had like a lot of stuff, like
it's a large company, e-commerce,

and we had very good preparation
and war room organized and we

are prepared for incidents and
whole Black Friday was so boring.

So when we finally found some incident
at 9 p.m., I was happy.

Finally, we have some material.

And recently, I was helping to
some customers, and I also, I

remember exactly this state.

I'm finally something interesting,
you know?

This is great state to be in.

Instead of being stressed and everything
is on your shoulders,

you don't know what to do or maybe
you know, but what if it won't

work?

You have some good materials to
check, like some monitoring systems

to check, but so stressed, right?

Because you have fears of failure,
right?

What if you won't be able to bring
database up again, back up

during many hours, and then everything
is disaster.

But if you in the state of like,
this is interesting, this is

like finally we have some work
to do in production, let's just...

I have a high level of curiosity.

Maybe it's a new case.

This comes with experience.

You saw many cases, but you are
looking for new kind of cases

because it's already too boring
to deal with the same kind of

corruption again or I don't know,
like some database down again.

You saw it many times.

And you are hunting for new types
of cases and curiosity helps.

So these are very 2 opposite states,
I would say.

And I was in both in my life.

So yeah.

Michael: For you, does
it depend a bit on the severity

though because for me even if it
was 9 p.m.

And I'd been hoping for some interesting
case to come up if it

was super serious and the whole
like everything was down I wouldn't

be happy that we finally got

Nikolay: Well honestly I didn't
have cases when like for example

life of people depends on this.

I can assume this might happen
with some systems, but I was in

cases when cost of downtime was
super high.

And now I'm not scared already,
you know?

I already had it, right?

So I know how it feels and so on.

I'm not super scared if it's only
about money.

But life-threatening downtime,
honestly, I didn't have it.

And I think if it happened, I would
be very concerned, right?

Maybe this realization of this
is only about money, and the worst

thing that can happen, somebody
will lose money, and you will

lose a job.

It's not the worst case actually.

Just relax.

But life-threatening, it's another
story.

I'm very curious if someone who
is listening to us has some system

where the state of Postgres can
influence the health or life of people.

This is interesting.

Michael: Yeah, you do hear about
healthcare use cases, but also

military use cases.

Nikolay: Right, right.

Yeah, it might happen.

Might happen.

I didn't have it.

No, me neither.

So, that's why I'm not super scared.

It's like, okay, it's not a big
deal.

I know it's a big deal.

We will be professionally helping,
right?

But let's just do what we can and
that's it.

And that's why I'm curious.

Is it a new case, finally?

Okay, let's work on it.

But again, back to my point, you
need to understand yourself.

This is very important.

If you know you react not well,
even to small problems, it's

better maybe to be an assistant.

Michael: Ah, yeah.

I was doing some thinking before
the episode on what kinds of

emergencies there could be.

And a couple that I don't know
if you're 100% thinking of that

would be really scary for me would
be like security-style incidents,

either external or internal.

Yeah, exactly.

Nikolay: Like hackers acting right
now.

Michael: That would be scary in
a different way, potentially.

Or exciting in a different

way.

Nikolay: This is, I think this is CEO level
already.

So definitely if something like
that happens, it's not only like

there's a technical aspect here,
but it's also a very high level

organizational aspect of it, how
to handle this situation properly.

Right?

Michael: So how - Oh, I was still
talking about psychologically

though.

Nikolay: Psychologically, but this
like decisions, like how to

handle it, it's already CEO level.

It happens.

Recently, we received from our
insurance, I think we received

like a regular routine notice that,
you know, our database is slow

again.

And it just happens.

Like, you know, maybe you're, we
don't know if your record is

also stolen, maybe, no.

And a couple of days later, I found
on GitHub a very good project.

Some guy created a database of
all SSNs of all Americans and

just published it on GitHub.

The fact is that it's only 1,000,000,000,
like how many digits this

number has.

So he just published all numbers
up to 1,000,000,000,

Michael: okay.

Nikolay: But some people on Twitter
started thinking, oh, I found

my SSN as well.

It was like a snowball joke.

Some people started, okay, I'm
going to remove my SSN, created

pull request.

It's funny.

So back to this, If you know yourself,
it's good.

It helps you understand your stress
level.

On another note, it's funny that
we aim to monitor a database and

production systems well, like with
second-level precision sometimes,

but we don't monitor ourselves.

Like cortisol level, right?

It would be great to understand,
but we don't have it.

This bothers me a lot, monitoring
of human bodies.

I don't understand my own state
except how do I feel.

So it will be good

to see

heart rate for example,
right?

Michael: Yeah.

Rings and watches
that monitor heart rate which

is probably quite, like correlates
probably quite well with stress

level.

Nikolay: Yeah.

Yeah, But let's maybe slowly shift
to technical stuff.

So of course if you know yourself
helps.

If you… I wanted to share 1 story.

I very long ago, 15 years ago or
so, I had a great team, a great

startup.

I was CTO, I think.

Maybe no, I was CEO actually.

But yeah, combining 2 these roles
and I had a Postgres production

system and great Postgres experts
in my team.

And I remember 1 guy was a great
Postgres expert and I made

a mistake.

It was my mistake.

I was leaving to a trip for a few
days, and I said, you will

be responsible for production,
especially Postgres state, because

he was the best Postgres expert
in my team, right?

But it was an obvious mistake because
an incident happened and he

couldn't handle it properly and
he was completely like… he lost

his shit, sorry for French.

Right?

Michael: So you mean technically he would
have been best placed to handle

it in the team.

Nikolay: A technical expert is not
necessarily good in terms of

incident management, right?

And this is my mistake, I didn't
recognize it.

And this led to the end of our
cooperation, unfortunately.

So sometimes good technical experts
should be an assistant, right?

Not feel the pressure on their shoulders,
right?

This is super important to understand.

So my advice is, you know, like
try to understand yourself and

whether you should be responsible for incident
management or just assisting

technically, right?

Michael: Yeah, know yourself, but
also know your team and know

who in your team could, like who,
yeah, who you can call on for

different things.

Yeah.

Nikolay: Yeah.

Now, let's move to the technical
stuff.

What is helpful?

Very helpful, first of all, and
many small companies like we deal with. Our

main focus right now is companies
who are growing, startups, usually

lacking database expertise and
many such companies come to us

for help and almost none of them
have good incident management

in place.

It's not only about Postgres, right?

We always suggest thinking about
at least a simple process because

they say, we had an incident last
week.

My question is, show us the incident
notes.

Are they logged anyhow, like with
timestamps?

In most cases, they don't have
anything but just words.

They have words.

Okay, we saw the database was slow,
then it was unresponsive, blah,

blah, blah.

What you must have for an incident
is a sequence, like you have

we must have a document with artifacts,
like the first known event

happened, some logs, screenshots
from monitoring better with

links so we can revisit it.

But screenshots matter a lot because
sometimes monitoring has

a small retention window and the investigation
might be long, especially

if you involve external consultants
like us, right?

So there should be some template
and a plan for documenting incidents.

And when you have it, it also helps
with stress because you know

what to do.

You need to identify the first abnormal
event, document it, things

before it, things after it, like
some form it should take.

And everything you notice also
documented, important things highlighted.

It can be a Google Doc or something
with discussion around it.

It's good when it's possible to
discuss some things so people

can ask questions, clarify, add
some additional knowledge and

so on.

It can be anything actually, right?

But it's important to have, to
be prepared to document it.

Michael: Yeah, I've seen a lot
of people start with like a, like

whatever app you use for chat normally
in the team, or some people

have like a different app for instance
specifically.

But if you're using Slack, for
example, start a new channel for

the instant, all instant related
stuff goes in there.

Screenshots, logs,

Nikolay: chat, everything.

Michael: And then people turn it
into a doc later sometimes.

But I could see an argument for
starting with the doc.

But normally people are panicking
at the beginning, so chat makes

sense.

Nikolay: Yeah, chat is more convenient
for many people.

It's what you use every day, so
chat is good.

It's important to have long-term
storage for this document, converted

to document.

And I can say like most of startups
which grew to terabyte or

a couple of terabytes in terms
of database size, most of them

don't have proper incident management
workflow developed.

They must have it.

It's time already.

So yeah, I definitely encourage.

Even if you have like couple of
technical teams and technical

experts in your team, still it's
super important to have incident

management workflow developed.

So yeah, detailed, step-by-step,
so we understand what's happening.

And You agree on format of this
document in advance.

You can use some other companies
as example.

Again, SRE.Google and GitLab Handbook
for particular this area

are useful.

GitLab, for example, particularly
has example.

For instance, management.

Many other companies also share
their templates and description

how to document it properly.

Super important.

And also, Of course, sometimes
you feel, okay, I'm documenting,

documenting, but who will be actually
solving the problem, right?

So it's good if you have a few
folks who can help each other

and some of them is responsible
for documenting, another is trying

to find a quick solution.

And also document is important
to have because then in bigger

companies, we have a procedure
called root cause analysis, RCA,

right?

To learn from mistakes and fix
them and prevent them for future,

right?

That's why it's also important to document.

But then this helps and this is
I think fundamental number 1

technical thing you need to do.

Oh, it's an organizational thing,
sorry.

But it includes some technical
aspects.

For example, which monitoring we
use when an incident happens,

right?

Where do we start?

This dashboard or that dashboard,
right?

What technical things we must document
there?

For example, of course, we care
about CPU level and disk I/O,

basics, right?

Hosts us.

If the database seems to be slow or
unresponsive, we must document

these things.

We had discussions about monitoring
dashboard number 1, we propose

like these things.

Dashboard number 1 in our pgwatch2
Postgres.AI edition is designed

for shallow but very wide analysis,
very quick, like up to 1

minute analysis of various components
of Postgres and the various

properties at a very high level,
like 30,000 feet level of workload

to understand which directions
to investigate further, right?

Michael: Yeah, where's
the issue?

Nikolay: Right. Yeah, this is
very good to prepare in advance.

I know if something happens, how
I will act, where I will start,

right?

Yeah, so this is important.

And you will document, you'll know
how to start.

This is about monitoring and observability
and logs and so on.

Next, there are several particular
cases I can quickly share,

which are important to be prepared
for.

For example, of course, if you
already know that the database has,

for example, a transaction ID wraparound.

Michael: You can see straight away
that there's an error in the

log or something.

Nikolay: Yeah.

So we have cases very well documented
from Sentry, MailChimp,

somebody else, And also we have
very, very good work from Google,

GCP, Hannu Krossing.

He was at PostgresTV presenting
his talk about how to handle transaction

ID wraparound without single-user
mode.

He thinks single-user mode is not
the right way to do it, but

this is a traditional approach,
single-user mode, and a very

long time for processing of the
database, for recovering the

state of the database.

So this is like you just, you can
document if it happens sometime

someday, but I haven't seen it
for so long because monitoring

has it, alerts have it, like,
and so on.

Michael: And also recent versions
have improvements in this area.

I remember, I think Peter Geoghegan
did some good work around this.

probably others too.

Nikolay: Yeah, I just started
from a very scary thing.

The scariest thing, and also the scariest,
is like loss of backups and

you cannot perform disaster recovery,
right?

Also like very low-risk these days.

Michael: Yeah, I guess these days,
sometimes major issues

are things like the whole of US East 1 is down for like, this

hasn't really happened for a while,
but like a cloud regional

outage.

I feel like that could still take
down a company's data.

If you're using a managed service
or the cloud at all, you're

at risk of that.

Obviously there, you can have plans
in place to mitigate that.

Nikolay: Even if it's self-managed,
not many people have multi-region

setup.

Michael: Exactly.

Nikolay: It's very hard, actually.

Michael: So if you don't have off-site
backups, you're sat there

thinking, "We just have to wait".

Nikolay: Yeah, it's a complex thing
to have multi-region purely

and well-tested productions, like,
failover-tested very well,

and so on.

Yeah, it's a big topic actually.

So backups and transaction ID
are probably 2 nightmares of

any Postgres DBA, right?

Michael: Are they the scariest
to you?

I think corruption is pretty scary.

Nikolay: Well, it's a good and interesting
topic.

Corruption, we had an episode about
corruption as well, right?

But this is good to put to preparation
of incidents.

If corruption happens, what we
will do?

Some steps.

And first step is, according to
wiki.postgresql.org, copy the database,

right?

Because you will try to fix, maybe
you will break it more, right?

So copy.

This is the first step to do.

And knowing this helps because
this kind of thing you can know

in advance.

By the way, transaction ID wraparound
you can practice as well.

There is a recipe I wrote on how to
simulate it, right?

So you can have it in a lower environment
and then good luck dealing

with it.

Or you can clone your database
and simulate it there.

So corruption is a very broad topic,
many types of corruption,

but some kinds can also be simulated.

There are tools for it.

So it's good to know it.

But in cases I saw, in most cases,
it was quite like there was

some path to escape.

In some cases, escape was we just
restore from backups losing

some data and for that project
was like acceptable.

In some cases it was, okay, we
just noticed that only pg_statistic

is corrupted.

So running analyze fixes it.

But long term we see the database
is on NFS and this is no, no,

no.

Don't use NFS for PGDATA, right?

It's quite, Like in most cases
I saw corruption, it was something

silly actually.

But corruption happens also like
due to bugs, due to various

stuff or mistake planning some
major change like switching to

new operating system, GDPc.

Fortunately over the last few years
such corruption happened

in non-production, so we fully
prevented it.

Michael: Well, the reason
I find it scary is more that

we could have been returning bad
results.

Nikolay: Like it's just silent
corruption.

Michael: Yeah.

It's that's the fear to me is more
how far back does this go?

Anyway, but it's a different kind
of emergency.

Nikolay: Yeah, we had the corruption
due to index and GDPc change

in production with one company last
year.

And it was our oversight.

But fortunately, it happened only
on standby nodes, which were

not used.

So it was pure matter of luck that
this production corruption

happened.

Michael: And no failover.

Nikolay: Yeah, yeah.

Other clusters used standby nodes.

This cluster didn't use it.

And we just saw some errors in...

It was during upgrade with logical
replication.

We saw errors in logs and quickly
reacted and then realized,

these standby nodes are not used.

Let's pray that failover won't
happen soon.

Of course, it's like just imagine
like...

So we quickly mitigated this completely.

Nobody noticed.

But if it happens, yeah, the question
is how, like what's the

propagation here?

But there's also like tooling and
knowing, like learning from

other people's mistakes helps, of
course, as usual.

And knowing tools like amcheck
should be very routine tool being

used often, right?

amcheck, to check B-tree indexes.

Hopefully it will finally support
other types of indexes soon.

I think it's still a work in progress,
right?

Michael: Yeah.

I saw some work going on.

Nikolay: Gin and GiST.

Yeah.
What else?

For example, if the database is shutting
down too slowly, it takes

a lot of time or starting up takes
a lot of time.

Not once I saw many people being
nervous, not understanding that

it's normal, not understanding
how to check the progress, what

to expect.

And it was like when you perform
checkpoint tuning, we also had

an episode about it, and increase
checkpoint timeout and max

WAL size, which you should do
on loaded systems.

In this case, like restart or just
stopping the database or starting

the database might take many,
many, many minutes.

And if it's self-managed, I saw
people kill -9,

SIGKILL, right?

Sending to Postgres because they
are nervous, not understanding,

"Oh, Postgres is not starting. What
to do?"

And I think, I think now in fresh
versions, there are some log

messages telling that we are in
recovery mode and showing some

progress, right?

I thought about it.

Michael: I think it's very recent.

I think, I can't remember if it...

Nikolay: It should
be so.

I mean, it should be so.

It should be very straightforward.

A DBA should see the progress and
have an understanding of when it

will finish.

For older versions, at least definitely
older than 16, it's unclear.

Usually, you need to, if it's self-managed,
you just run ps to

see what process reports in its
title or top, right?

And you see LSN there, then you
see your pg_control data to

understand the point of consistency,
and then you understand

how many, if you have 2 LSNs and
go to another Postgres, you

can calculate the difference and the difference
is in bytes.

So you understand how many bytes,
megabytes, gigabytes left,

and then you can already monitor
like every minute or every second

and understand the progress and
have ETA, expected time of arrival,

right?

And this helps.

And I think it's a good idea to
learn how to do it.

In older versions, in newer versions,
I have a how-to about it,

actually.

What if the Postgres startup and stop
time takes, like, it's long.

What to do about it?

And yeah, it should be just learned,
right?

And if you're prepared, it will reduce
stress.

And yeah, we had a lot of such
cases working on DBLab.

Sometimes, like, a clone is not created.

Why?

But it's because WAL size is
huge and so on.

It's just recovering, so you just
need to wait a little bit more.

But then we improved it.

So yeah, this might happen.

This is a very common situation.

Long restart time.

Michael: Yeah, I'll definitely
share that episode in the show

notes as well so people can find
it if they weren't listening

back then.

Nikolay: What else?

Somebody deleted data and you

Michael: need to recover.

We have other episodes like out
of disk.

Like there's other kinds of emergencies.

1 we haven't covered I don't think
in much detail was the big,

like out of integers.

Yeah, in short.

Nikolay: Oh, out of integers is
a big disaster.

Michael: Yeah.
Yeah.

But I guess that's quite common
like in terms of other common

issues people come to you with
is that is that up there or what

the what tends

Nikolay: I had I have I may be
I'm biased here because I have

a feeling it's a very well-known
problem and people already mitigate

it or are mitigating not requiring
a lot of expertise.

Our Postgres checkup tool has a
report for it, like saying how

much of capacity of int4 regular
integer primary key left for

a particular table.

For me, it's like straightforward
already, right?

And I don't see a big deal.

But if it happens, of course, it's
like partial, at least partial

downtime because you cannot insert
new rows in this table.

And it

can be scary.

Michael: That's true of so many
of these issues though, right?

Like once you're monitoring for
them, once you know about them

and you've got alerts far enough
out, they become not emergencies.

Nikolay: Right.

But I'd like to mention also like
common problem, like database

is slow or database is unresponsive
for what to do.

Like very general, like where to
start?

What do you think?

Michael: Well, I think that's the
monitoring thing, isn't it?

Like that's to go to the monitoring,
that number 1 dashboard

you talked about.

Nikolay: Yeah.

Michael: I think that's the workout
where is the problem.

It needs to be the first point,
doesn't it?

Nikolay: Yeah, I agree.

And the first thing I would start understanding,
I think we can talk

about methodologies here like starting
from USE, use, right,

and others, like there are many
of them.

But question, like do you see the
utilization, saturation, errors from Brendan

Gregg, like basics from Netflix,
right?

It's a very, very trivial approach,
I would say.

But yeah, here, first question,
if the database is slow and unresponsive,

first question, are we really putting
more workload on it?

Very simple question, but sometimes
hard to answer.

Because often we find out that
many more clients are connected, some

background job started like bombarding
the database with new queries,

retrying a lot of connections.

Michael: Ah, like a cascading effect.

Nikolay: As well.
Yeah, yeah, yeah.

Michael: Is it because of elephants, or
I actually don't know that term,

but it's like a...

Nikolay: So the question is, is more
load coming from, like,

externally to the database.

And this can, of course, be a reason
why it's slow.

And if it's not tuned well to handle
spikes of load, for example,

you keep max_connections high,
ignoring advice from Postgres

experts that let's keep it sane.

Recently I saw – I'm sharing without
names so I can share, right?

– 12,000 max_connections.

This is for me, I think, a record.

A new client showed it and they
explained.

I see it's a trend.

Recently when I say you need to
decrease max_connections, I

also say most likely you will not
do it right now because most

people tend not to do it.

They all have reasons why max_connections
should be very high.

And of course, since Postgres,
I think, 14, things have improved

in terms of handling idle connections.

But when an incident happens, these
idle connections become active,

and we have almost 0 chances for
statements to be finished because

the server is overwhelmed with load.

Whereas if you have a sane number of
max_connections, I would say,

take your vCPUs number, multiply
it by some relatively low multiplier,

like less than 10.

That should be max_connections
for all OLTP workloads.

Then you have pgBouncer or something.

So if you have this and enormous
load is coming, additional load

will be receiving an "out of connections"
error.

And existing transactions or something.

Yeah, and who has chances to finish
current statements, current

queries processing and new ones and
so on.

So it's much better than you try
to please everyone, right?

And cannot do it at all, including
like with your old clients.

you know.

Michael: It also makes some diagnosis
easier, right?

If the database is still responding
to anything, it's easier

to diagnose issues than if it's
not responding at all.

Nikolay: Exactly.

Michael: It's kind of just moving
the problem, but it's definitely

an improvement.

Yeah.
But yeah, it's a good point.

Like it could just be overwhelmed,
but it could be, there are

like a million other reasons.

Nikolay: Of course.

But the first question I would
say, are we receiving more load?

So the reason is already outside
of Postgres.

Well, technically I just explained
additional factor, high or

max_connections, it's partially
problem is inside Postgres, but

the main reason, root cause is
outside.

Like we're just receiving much
more than usually.

Right.
This is number 1 thing to check.

If like we don't have time to discuss
full recipe for troubleshooting

of such cases.

Michael: We've got an episode,
I think, for that.

Nikolay: Maybe, yeah, I already
keep forgetting.

Michael: Probably actually just
monitoring,

Nikolay: yeah.

Yeah, but maybe we should have,
like, you know, like, how to

troubleshoot slow database, step
by step.

So, to save time, second advice,
I would say, just check wait

event analysis.

Second thing.

If you have a lot of active sessions,
maybe actually Sometimes

databases slow without a lot of
active sessions.

It's interesting.

But it's also, if you understand
number of active sessions, it's

very important.

But next thing, understand the
state of what are they doing,

right?

So are they doing a lot of IO or
there is a contention related

to lock manager, for example, or
sub-transactions or anything

like that.

So, wait event analysis is super
important.

And we discuss right now how to
improve dashboard number 1 and...

No, no, not dashboard.

Dashboard number 3, which is query
analysis in pgwatch Postgres.AI

edition.

And I'm almost convinced to put
wait event query analysis to

the top.

Previously, I was thinking we should
have total time from pg_stat_statements and average time, total
time maybe should be higher

and we have like a long discussion
inside the team about what should

be higher.

But now I'm almost convinced actually
wait event analysis should

be on very top because it gives
you a very quick understanding

just from 1 chart you quickly understand
the number of active sessions

and distribution in terms of what
they are doing.

In any analysis when you have some
number, the next step is to segment

analysis, right, properly.

So to divide this number into some
segments.

And I think wait event is a very
good direction for segmentation,

how to say.

Michael: Yeah, it's like it splits
it into fewer things, so therefore

it's easier to spot if there's
like a majority.

Whereas with query analysis, you
could have a real long tail,

like your, even the most commonly
executed query might only be

1% of your workload.

Well, yeah, it might be 50% but
it might be 1% whereas more likely.

Nikolay: Yeah, and timing in pg_stat_statements, it hides details

for it might be actual work database
is doing and that's why

it's spending time, for example,
sequential scans due to lack

of indexes or something like that,
or content, or it might be

waiting for a lock to be acquired.

So it also spending time and you
quickly see.

So very good book as usual, books
from Brendan Gregg.

There is in troubleshooting, I
remember also his talks, two-part

talk about tooling for Linux and
so on, and he mentioned that

if he needed to choose just 1 Linux
tool, like you can use only

1 tool and biggest outcome in terms
of troubleshooting.

What is it?

Do you remember?

No?

Michael: No.

Nikolay: It's iostat.

Oh, why?

It gives you disk I/O and also it
reports CPU as well, segmented

by like user system I/O wait.

So it's super good, like you see
Disk I/O and CPU just from 1

tool.

Similar here, we see the number of
active sessions and also we

see wait events segmentation.

It's a very good chart to have
for troubleshooting.

Michael: It feels to me like an
interesting trade-off, like whether

you're looking at monitoring more
often or not even necessarily

more often, but do you optimize
for people in an incident or

do you optimize for people doing
general performance work?

And I think optimizing for the
incident people make some sense,

even though it's less often, hopefully.

Nikolay: Yeah, they
have less time.

Michael: Less time,
but also heightened emotions

and not thinking straight like
we started.

So maybe that's a

Nikolay: Path should be shorter.

Yeah, yeah, I agree.

Right.

Yeah.

So there are many other things
that can happen with a database,

of course, right?

But if you know some common things,
it helps a lot.

Yeah.

And tooling should be prepared
and, yeah, observability is important.

Michael: Yeah, 1 last question.

I think there's some arguments
for trying to reduce incidents

down to like nearly 0, like trying
to put everything in place

so that you never have any incidents,
you know, high availability,

everything to try and minimize
the risk.

And then I think as a team, you
can get out of practice dealing

with incidents if you're good at
that kind of thing.

But then when one does happen, it
can really throw you.

Some teams like to deal with super
minor incidents and treat

those as incidents, almost like
as practice.

Do you have any opinions or feelings
around that kind of thing?

Nikolay: Yeah, good point.

So we actually didn't discuss many
things, for example, how to

categorize incidences like priority
1, priority 2, and so on.

Because when a client comes, it
happened a couple of times over

the last month, like a client comes
and shows me some graphs

with spikes of active sessions
exceeding the CPU count significantly,

I already say, oh, you are having
at least like, you know, P3

incident or maybe P2.

Maybe it's not user-facing, people
haven't noticed it, but it's

an incident already.

It requires investigation and they
like, the database is slow,

but this is already, you need some
reaction and mitigation for

it.

So it requires maybe understanding
and expertise and classification

rules, which require PostgreSQL
understanding, right?

Because sometimes I have a hard
time convincing people that if

you have, I don't know, like 64
cores, but accession count jumped

to 200, 300, It's already not normal.

They say, well, it worked.

Michael: No one complained?

Nikolay: Yeah, yeah, yeah.

Well, it worked.

And part of the problem in Postgres,
we don't have good metric

for average latency, for example,
for query processing, because

database job to be like, we want
query processing to be not producing

errors and be fast.

Fast, we have definition of fast
for OLTP case.

I have an article about it.

Definitely, it's not 1 second,
it should be below.

It should be below 100 milliseconds.

In most cases, it should be below
10 milliseconds because 1 HTTP

request consists of multiple SQL,
usually, in many cases.

And people, human perception is
200 milliseconds, so we have

some threshold already, so let's
keep latency low.

But funny thing, Postgres doesn't
have latency exposed, average

latency.

It doesn't.

So, the pg_stat_database doesn't
have it.

Nothing has it.

Only pg_stat_statements.

But it's not precise.

Michael: It's not in core.

Nikolay: It's not in core.

It's not precise.

There is max, 5,000.

In some cases, workload is complex
and there is constant eviction

of records from pg_stat_statements
and appearance of new ones.

So the latency measured from pg_stat_statements,
This is what most

monitoring systems do, including
dashboard number 1 we discussed

earlier, from pgwatch2 Postgres.AI
edition.

But it feels not fully reliable,
right?

But this is important because this
is how we can say, okay, really

slow, how much?

We had sub-millisecond latency,
now we have 5 millisecond latency.

Okay, indeed, there's proof of
it.

I like that PgBouncer reports it.

I was

going to
ask, yeah.

It logs and then starts and reports
it.

This is great.

This is what we should have, honestly,
in Postgres as well.

But, yeah, I actually don't remember
discussions about it.

There should be some discussions.

So, this is our maybe main characteristics
of performance.

I wish, of course, we had percentiles,
not only average.

Many people monitor it from client
side.

Datadog has APM and there's ability
to monitor it from client

side, but this is not purely database
latency because it includes

Round trip, RTTs, round trip times,
network, right?

And it should be excluded if we
talk about database for, to understand

behavior of database, right?

So yeah, pg_stat_statements, this
is how we understand latency.

And yeah, if it's slow, it's slow.

And then we need to, again, apply
segmentation and top-down analysis

and find what exactly is slow.

Everything or just some of it,
right?

So it's, it's, P2P3 incidents.

I think for smaller companies,
it's hard in terms of database.

It's possible, but it's too much
work.

Maybe.

Michael: Well, but I also think,
I think there could be an argument

from like a make incidents a bit
more normal in your team and

less stressful.

So when you do have a stressful
1, or like when you do have a

big 1 that's a bigger deal.

Nikolay: I see your point.

Unless your team is overwhelmed
with P1 incidents, which

I also had in my team actually,
and I saw it like we have every

day we have database down.

Unless that, it's a good idea if
you don't have database incidents

to say, okay, let's look for P2,
P3 incidents and start processing

them routinely so we build a muscle
for incident management.

It's a great advice.

Michael: Cool.

Nikolay: Indeed, Indeed.

Yeah.

Yeah, maybe that's it.

Let's let's wrap it up.

Michael: Sounds good.

Thanks so much, Nikolay.

Catch you next week.