Postgres FM

Nikolay and Michael discuss some more advanced topics around EXPLAIN, including some tips for complex query plans, some recent improvements, and an idea or two that might make it even better.
 
Here are some links to things they mentioned:

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With special thanks to:
  • Jessie Draws for the elephant artwork 

Creators & Guests

Host
Michael Christofides
Founder of pgMustard
Host
Nikolay Samokhvalov
Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Nikolay: Hello, hello, this is
Postgres.FM.

My name is Nikolay, Postgres.AI,
and with me as usual my co-host

Michael, pgMustard.

Hi Michael, how are you doing?

Michael: Hello Nikolay, I'm good,
how are you?

Nikolay: Very good, very good.

So it was my turn to choose the
topic.

And I chose the topic naturally.

I thought we didn't discuss EXPLAIN.

Never, ever we discussed EXPLAIN,
right?

We didn't do it.

But then I googled and found out
that you discussed it without

me.

So I thought, it doesn't count.

Let's discuss once again, but maybe
going deeper in some places.

Michael: Yeah, I covered the basics and
I think it's a good suggestion

to cover more advanced topics.

I think EXPLAIN is one of those topics
that you can just keep getting

more advanced.

And there's definitely a level
at which I'll get way out of my

depth.

But yeah, there's definitely room
for a second episode where

we go into more advanced topics
for sure.

Nikolay: Right.

And just to those who don't know,
you spend most of your professional

time dealing with plans because
pgMustard is helping people analyze

and improve queries, particular
queries, looking at EXPLAIN plans,

right?

Michael: Yeah, exactly.

It's a visualization tool for EXPLAIN
and then also tries to

provide performance advice based
on that.

So we get normally people's more
complex query plans.

So I guess this is an area that
I've got more practical experience

than most.

Obviously, everyone knows when
you're building something, people

that build healthcare software
aren't always intimately familiar

with everything about healthcare.

There's certain things you get
really familiar with, but you

also end up having to deal with
things that are not to do with

that at all.

But yeah, I have over the past
few years come across probably

most variations of how performance
problems can be shown via

EXPLAIN plans.

I have some blind spots for sure
but yeah it's been a big thing

for me but also for you right like
this is a big part of your

work's been performance consulting
and

Nikolay: yeah well It's

Michael: hard to do that without
EXPLAIN.

Nikolay: For me, EXPLAIN is not
the number 1 tool usually, except

cases when a customer asks us to
help with a specific query,

which happens as well.

But usually we deal with analysis,
like top-down analysis of

whole workload.

We start with pg_stat_statements
and wait event analysis, performance

insights, and ideas or something.

Then we go down and when we identify
first aggregated query or

normalized query in official pg_stat_statements
terminology.

Only then we find examples of bad
behaving query.

Then we discuss how to improve,
where to, we need a playground,

a proper playground, and maybe
we will want to discuss how to

make it so EXPLAIN plans match what
we see in production.

And only then we go and check the
actual plan and try to improve

it and so on.

Sometimes we have auto_explain,
right?

auto_explain is a great tool.

Michael: Yeah, we did an episode on that
too.

Nikolay: Yeah, yeah, I remember.

And yeah, if you have it, it's
great.

But it's not always so.

It's not enabled by default.

It requires some effort to be enabled.

Although it's present, it's available
everywhere.

So I wish I saw it more often.

Michael: Did you see I did a blog
post on this this year on which

cloud providers provide access
to EXPLAIN?

I don't

Nikolay: remember honestly, I need
to check it.

Michael: I'll share it, it's not
the most interesting read, but

it's like a good reference and
I find it useful for if somebody

mentions a cloud provider to me.

Nikolay: So everyone has it right?

Michael: Yes, pretty much.

Not quite everyone but pretty much
everyone.

And Most of them provide access
to most parameters, so EXPLAIN

and auto_explain have a bunch of
parameters and I found it hard

to check which ones offered which
parameters.

And I've got good news as well,
actually.

I got a nice direct message from
somebody who works at AWS and

the parameters they didn't offer,
there's like 2 or 3 that they

didn't offer, they've submitted
a pull request for.

So I'm looking forward to the next
release of the RDS products,

which I think will include those.

Nikolay: That's good.

So yeah, of course, we also deal
with explain, but it's just

1 of the things and we probably
don't dive too deep as you do.

So where do you want to start

Michael: this discussion?

It was your suggestion, what do you want to start?

Nikolay: Oh, but you're an expert, right?

Michael: Okay, alright, maybe.

I'm not sure, I still feel really uncomfortable with the word

expert.

Nikolay: I'm feeling very uncomfortable with the word analyze.

It should be execute, not analyze, right?

In the EXPLAIN command.

Michael: I mean, naming...

I've been thinking about naming for a while with this because

part of our tool, we give people something to copy and paste

so they can easily copy it and put it in front of their query

if they're using a like PSQL or an editor or something.

And it's getting longer and longer.

The more parameters that get added to EXPLAIN the more like and

and the more information the better for when you're using a tool.

If you don't have to worry about the output getting long and

your visualization tool should be pointing people at the right

place anyway, the amount of data is like the more the better

really.

The more information you get, the more likely it is you can point

out the problem.

However, that it's become like EXPLAIN and then in parenthesis,

ANALYZE, BUFFERS, VERBOSE, SETTINGS, FORMAT, JSON, and there's

more.

There's like, wow, now there's serialize and memory, but most

of these are off by default so you have to specify all of them

to get all of the information.

So ANALYZE is 1 of them but the reason I was thinking about naming

was I wondered if we could I was wondering about suggesting something

like EXPLAIN all But the problem with that is there's so many

pieces of advice out there in the wild about when you run EXPLAIN,

ANALYZE, be careful because your query will be executed.

If your query's data modifying, it will modify the data.

If your query's long, it will take that amount of time to run.

And all of that advice would become not obsolete, but if we change

the name, all of that advice becomes difficult to like, you have

to go back and change all of the advice, or you have to add,

if you want to EXPLAIN, ANALYZE, or EXPLAIN all.

Nikolay: For 5 years, the old word could be supported,

but it's so confusing to explain people that ANALYZE is actually

executing the query and it's not the same ANALYZE as ANALYZE

table name, which also affects probably plan, but you need to

understand this.

If you run ANALYZE table name, probably after that you will see

different plan in EXPLAIN and EXPLAIN ANALYZE, but it's not because

these 2 words are the same words.

They are very different words.

Remember we discussed Postgres 17, it was confusion resolution.

I wish this confusion 1 day would be resolved as well.

Michael: I agree in principle, I just think in practice it's

gonna be painful.

Nikolay: Why painful?

The word execute is not used yet.

First we introduced synonym and then in 5 years we ditched the

old…

Michael: It's used in prepared statements, right?

Like prepare, execute.

Nikolay: Yeah, but well, execute means execute, and analyze and

explain also means execute.

Okay, by the way, Let me confess about something related to prepared

statements.

A few weeks ago, we had an episode about planning time, which

also related to today's topic because planning time we check

using explain mostly.

Of course we can check it in pg_stat_statements but again not everyone

enabled track planning in pg_stat_statements so usually we just

check in auto_explain or regular manually in ad hoc way we check

using explain.

Not auto_explain, sadly.

It cannot show planning time?

Wow.

Okay.

Good to know.

Michael: But in pg_stat_statements, it's if track planning, yeah.

Nikolay: Yeah, if track planning.

So If we, for example, deal with partitioning, with high number

of partitions, planning time increases as we discussed.

But I just blog posted about it and right after that we had a

discussion on Twitter and it turned out that my experiment was

wrong and my ideas were not fully wrong, but not purely accurate,

because planning time is affected only in the very first time

when queries is executed in the session.

If you execute it once again, we already have relation metadata

cache, rel cache, and overhead becomes almost 0.

At least for simple queries.

Michael: Yes, simple queries that can prune down to a single

partition or like a low number of partitions.

And it was David Rowley, who

Nikolay: pointed out… Yes, of course, If there's no pruning,

then it's a different story.

Of course, with the number of partitions growing, more work needs

to be done at planning time.

But yeah, I'm talking about partition pruning happening at planning

time.

The idea is only the very first execution suffers.

Michael: It's so, I mean I'm really impressed David noticed that

so quickly.

Yeah.

I mean he knows his stuff around
this topic, I'm pretty sure

he was involved in the fixes, I
think it was in Postgres 12 to

improve that exact metric.

So yeah, kudos to him for pointing
that out so politely as well.

I think he said it was misleading,
not wrong.

And also a really good reminder,
do you remember when we had

Melanie on to talk about benchmarking,
how difficult benchmarking

was to make sure you're measuring
the thing you think you're

measuring.

It's just a good reminder that
these things are so difficult.

But by publishing them, by publishing
your findings, and by the

Postgres community being so open
to correcting us when we're

wrong, we've all learned something.

I think

Nikolay: that's good.

Exactly.

It was very good.

And I cannot agree it's wrong because
there was no mistake there.

There was a problem that the experiment
didn't do next step,

which of course I should do.

Actually misleading is a good term
here.

I extended blog post right away.

It turned out if you use just EXPLAIN,
no execution, no ANALYZE

work.

If you just use EXPLAIN, you see
you add partitions.

We checked from 0, 100, 200, and
up to 1,000.

You see we're using EXPLAIN, SELECT,
primary key lookup, and

partition key basically lookup.

You see first of all, you see index
scan, only 1 partition.

Partition pruning is working at
planning time.

But planning time grows linearly
with number of partitions and

number of planning buffers also
grows linearly for this kind

of query and it was range partitioning.

We didn't check others.

But once you run EXPLAIN once again
in the same session, only 3

buffer hits always, regardless
of number of partitions.

It's also good relief, right?

So you think, oh, it's great.

I'm going just to relax and I can
have 10,000 partitions for

simple queries like this.

Of course, complex queries will
be different, I'm quite sure.

But if partition pruning works
at planning time, I'm happy, right?

It just tells me I need to take
care of proper connection poolers.

I need to have them always.

The problem with our discussion
and my ideas a few weeks ago

when I said, oh, if you have high
overhead of planning time,

which obviously it's only for first
execution planning.

In this case, just use prepared
statements.

But this is definitely maybe not
only misleading, but also wrong,

because as I actually learned in
detail, if you use prepare statements,

you see that partition pruning
shifts to execution.

So the plan, a generic plan, includes
all partitions, right?

Because we cannot cache the plan,
which is dealing only with

single partition in this case,
but what if we need to execute

the prepared statements with different
parameter?

We need different partition probably.

The cached plan includes all partitions
and you run explain for

prepared statements for a specific
parameter and you see that

execution time, you see subplans
removed.

This is interesting.

Which means that overhead shifted
to execution.

It's not good, right?

But again, overhead, which overhead?

If we have real cache here, all
good.

So it means that for high number
of partitions, not prepared

statements matter.

They can help, but they don't help
fully because of this shift

of partition pruning.

But what actually helps a lot is
poolers.

We have poolers in all heavily
loaded projects.

And if you are preparing to have
huge tables, like we always

say, if you exceed 100 gigabytes,
you need partitioning.

But it also means you do need proper
connection pooling and avoid

the initial overhead of planning
basically.

With high number of partitions,
planning will be not good.

So back to explain.

Michael: Yeah, question back to
explain, kind of linking them.

When you did the second explain,
extracting the buffers and things,

did you use format JSON for that
and parse the output?

Or what did you do?

Nikolay: It was text regular.

Michael: We've just parsed it.

Nikolay: Settings, yeah.

And it was done by our AI, so it
doesn't use JSON at this point.

Just we don't want very bloated
format.

And yeah, and actually, yeah, we
needed to fix how it works because

right now it works through psycopg,
so Python.

If you run a couple of explains
in one shot, it received only one

result.

We fixed that, so now it sees both
results.

The second result is always, in
our case, three buffer hits.

Planning time remains way below
one millisecond, I think maybe

10 microseconds or so, super tiny,
almost zero.

Bottom line here is that usually
we say, if you run

EXPLAIN (ANALYZE, BUFFERS), of course.

We had a couple of episodes about
it.

If you run it, always run it a
couple of times, maybe a few times

because of caches.

If you want to check timing, if
you're too obsessed by timing,

not my case.

I'm obsessed by buffers, not timing.

You need to run it multiple times
to check the second execution

because first time probably everything
is not cached, data is

not in the buffer pool, and not
in page cache, so run it twice.

That's a good piece of advice,
right?

But what we just discussed means
also that even for planning,

run it twice in the same session,
right?

Because maybe you observe a lot
of work, if partitioning is involved,

especially if a lot of work related
to lack of real cache.

Michael: I've seen this.

The other weird thing I've seen,
I don't know if you know the

answer to, is you can do a simple
ID lookup, like primary key

lookup and you will get planning
time, but no planning buffers

reported, even if you include the
buffers thing.

What's that?

How's that?

It's a bug I think.

Yeah.

Nikolay: Inside my team, like last
week, since we worked on this

post, a couple of other… Many people
actually were involved,

but this particular problem was
very annoying.

Very annoying.

So, like, why buffers are not reported
in planning?

Okay, now they are reported.

And we cannot understand what it
depends on.

So, Once we understand, probably
it will be a bug report already,

right?

Because it's something wrong is
there, because we requested.

Michael: It's weird, but I mean
the text format is hurting you

there because by default in the
text format, if things are 0,

they don't get reported, it's like
roughly a general rule.

Nikolay: You think so?

Michael: Whereas in JSON format,
you'd still get buffers reported,

but they would have zeros.

It's not much better.

Nikolay: In our case, it should
be reported because it always

was something like some number
of hits.

But yeah.

Michael: So that's a different
bug, right?

Like all I meant is if planning
buffers

Nikolay: are 0,

Michael: I don't understand how
planning buffers can be 0 if

it's like it was actually doing
work, if it's actually, you know,

picking which index to scan.

So I found that…

Nikolay: – In the case of prepared,
cached plan, if plan is cached,

there is still some planning time,
but I suspect this planning

time occurs during execution because
of these subplans removed

and partition pruning happening.

I'm not sure, by the way.

This is some nuance.

Michael: I think maybe like
parsing might count as planning

or like I think there's other steps.

Nikolay: Well, I'm looking at
it right now.

I'm looking at the plan which checks
execution of the forced

generic plan.

Plan cache mode set to forced generic
plan and we check execution

for prepared statement.

It's cached.

The plan cached.

We see no planning-related buffers
reported at all.

It's missing.

But planning time is like 38 microseconds.

And partition pruning happened
at execution time.

So I wonder what are those 38 microseconds.

If plan is cached, maybe just additional
checks, I don't know.

It's interesting, right?

Michael: You know there's multiple
stages, like, you know, I

can't remember all of them, but
1 of them, for example, is parsing.

1 of them is like rewriting, so
like you might rewrite the query.

I think some of those get bucketed
at the moment as planned.

So

Nikolay: probably you're right,
there are no buffer hits involved

at all and that's why it's not
reported at all.

Michael: But I've seen this, I've
seen 0 reported when it's like

not a plan cache, like if it's
a primary lookup that you've

run.

Nikolay: So where there is some
work for sure, right?

Michael: Yeah, or like in my head
it would make sense that there's

work.

Anyway, it's an interesting quirk.

Nikolay: And even if it's not a
bug, the fact that you raise

this topic and we discuss this
topic inside my team, it's already

something, right, which is not
clear.

And yeah, If someone who watches
us who is more experienced than

us, please tell us.

This is super interesting piece
of information.

So what I wanted to say, You need
to check the execution of explain

for planning for the second time
inside the same session just

to exclude the fact that you observed
overhead from lack of real

cache.

This is lesson I learned last week.

This is correction to ideas I expressed
a few weeks ago in the

planning time episode.

Michael: And there's some good
practices that help mitigate

this kind of issue anyway.

Normally people when they're benchmarking
or testing things will

run something 5 times and maybe
take the best of the 5 or you

know running things multiple times I don't know if you've ever

seen some sports do like diving for example they give people

scores out of 10 and then they cross out the top scoring judge

and the bottom scoring judge.

And yeah, actually,

Nikolay: we did that.

We did it, but we like, due to some reasons, it was not like

We plan to improve this and this kind of analysis should be done

exactly that way.

You run multiple times, you check which data points are way off

and exclude them and then you take average.

That makes sense totally.

But in this case, just how we did it, It was a single session

experiment and we ran all EXPLAINs.

So bottom point, if you run EXPLAIN 5 times and every time it's

a new connection, the problem persists.

You need to run it multiple times inside the same connection.

And then I would not agree with excluding the first 1 due to

lack of real cache, because it's also an interesting fact that

With growing number of PARTITIONs, the very first execution,

it can reach dozens of milliseconds and even hundreds of milliseconds.

It's a lot.

It means connection poolers are needed and they need to be configured

properly if you have thousands of PARTITIONs.

Otherwise, it will be very often new connection established,

could execute the first time, huge number of PARTITIONs, and

you have penalty of dozens of milliseconds for heavily loaded

projects.

It's a lot, maybe 100 milliseconds.

It's like huge overhead.

This fact also interesting, right?

We learned 2 facts.

Overhead is huge, linearly growing for this particular simple,

very trivial case.

There is basically no overhead for subsequent queries.

Wow.

This is like a sequence of surprises for me in this work.

I like it actually.

Let's maybe switch to another topic.

Maybe you have some ideas what to discuss next?

Michael: Yeah, well when I think of advanced EXPLAIN, I'm thinking

more like advanced query optimisation.

Maybe you're familiar and got good at reading EXPLAIN plans when

it's a relatively simple query, maybe some normal scan types,

some normal JOIN algorithms.

You're somewhat familiar with it, you can work your way around

that.

But I've spent several years looking at this and I still from

time to time see operation types that I've never seen before.

And things get complex like in just in terms of how many operations

there are or things that complicate
things like CTEs and in it

plans and kind of things happening,
which order things happen

in, things happening at the same
time.

I think we had Haki Benita on relatively
recently talking about

how if you have multiple CTEs,
those happen at the same time

and independently of each other.

So like, that's confusing.

And like, how those appear and
explain is interesting.

But because it's shown as a tree
and they have to kind of work

out different display mechanisms
for how, you know, for showing

that that's not necessarily misleading
because it's hard to say

how you should visualize that.

But that's what I'm thinking in
terms of like advanced topics

around explain.

How do you interpret those?

And also, what can you ignore?

If you want to do this, normally
your aim is, can I speed this

up?

And as things get complex, you
often want to simplify.

Like, what can I do to discard
a bunch of this information?

Like, which bits of this are already
fast that I can ignore?

How do I, maybe I can run a simpler
query that's only the part

of the plan that's problematic.

That's what I'm thinking in terms
of more complex topics.

Nikolay: Right, I agree.

Reading complex plans.

This is it.

Yeah, I agree.

Here actually what helps?

Experience helps definitely.

I usually like buffers for me as
a must.

Buffers is a must.

First, I start reading execution
time, planning time, but I read

quickly scan buffers and they are
like cumulative, like they

are accumulated to the root.

Root includes everything.

So you can see and go from root
to leaves basically.

It's like upside down tree, right?

So we top down analysis again.

Okay, we understand the number
of buffers, understand how many

megabytes or even gigabytes sometimes
it is if you multiply by

the block size, which is in most
cases 8 kibibytes.

Then we go down and see where exactly
this huge volume came from.

This is my default type of work
with plans, including execution,

of course.

Because if you don't have execution,
you only have cost, and

buffers are reported only for planning,
as we just discussed.

In this case, usually it helps.

In most cases it helps.

And I wish I also was able to see
lock-related information, right?

Mixed plane, plane.

Michael: T.

Yeah.

Like weights.

Yeah.

It's difficult to see how they
would do that, but...

Nikolay: Not necessarily weights.

Michael: What do you mean by locks?

Nikolay: If we...

Yeah, yeah, actually maybe you're
right.

Yeah, what's happening under the
hood slightly, but it's already

super complicated output, right?

So...

Michael: Yeah, and there's macro
versus micro issues here.

If you're on a test environment
where you're the only person

there, it's much less likely that
you're going to be incorrect.

Nikolay: When I said locks,
I was meaning heavy locks.

So at planning time, all indexes
are locked, And all tables and

all relations are locked.

And that's a lot of overhead.

I wish I saw it.

It's interesting information.

Accessory lock for all of them.

So, and if it's update, okay, what
kind of lock I have here?

Michael: Oh, interesting.

Yeah, that's not what I was thinking
at all.

Nikolay: Yeah, you thought about
lightweight logs.

Michael: Well, no, I was...

Oh, even a different...

Or weight events.

No, I was thinking, like, if you...

Easiest experiment to show this
is start 1 transaction and maybe

do some DDL or like yeah drop a
column or something yeah anything

that's doing well add a huge like
a really heavy lock on that

table, then open a second transaction,
just do a simple SELECT

query on that table, go back to
the first transaction and commit,

go back to the second 1 and see
that your query was just waiting

that entire time.

If you run that query with EXPLAIN
in that

Nikolay: second session...

If you commit it, it's not waiting

Michael: anymore.

But the execution time of that
EXPLAIN will be huge.

Nikolay: Oh yeah, yeah, yeah, yeah,
yeah, yeah, yeah, you're

right.

And this execution time can be
explained if we saw basically

a pie chart or something, or a
distribution of wait

Michael: elements.

There will be no node, there'll
be no operation in that EXPLAIN

plan that shows that it was waiting
on an exclusive lot.

It will just show that 1 of the
operations was extremely slow.

Nikolay: Simple wait event
analysis for this particular

session would help.

Michael: Yes, but it doesn't
come up that often to me.

Maybe it's 1 of the blind spots
because I don't see those, but

I don't see plans that have these
really unexpected bottlenecks.

Nikolay: I would
like to be able to, not by default,

but I would like to be able to
see these wait event analysis and

heavy locks as well because I want
to understand this query,

which locks it's going to acquire
when I go production.

When execution finished, what locks
were acquired?

I think interesting extensions,
but it's already so huge.

Michael: And as I mentioned, we're
getting more and more, I think

this is part of the reason people
are adding more parameters

to explain, so write-ahead logging
information.

Nikolay: It should not be by default,
it's too much.

Michael: Well, I actually have
a proposal here.

I haven't suggested it on any of
the mailing lists yet, but I

would like your opinion.

I thought about adding them by
default when you include the parameter

verbose.

So if you do explain, analyze verbose,
give me everything.

Nikolay: You just don't want to
update your website

Michael: snippet.

Yes, yeah, I don't want to update
that snippet for sure, but

also I want people to only have
to remember to do that and they'll

get whatever Postgres can give
them.

Verbose means really long.

It means everything.

It kind of means like, I'm okay
with loads of information, give

me it all.

So it's already kind of in the
meaning.

Anyway, I haven't proposed it yet,
but that would also come with

buffers, so I thought you might
be happy.

Nikolay: Okay.

And also, if you talk about what's
missing and what you cannot

get from EXPLAIN, I also like at
some point, maybe some extension

of this could be, imagine if we
see operations at disk level

as well.

And we understand not only hits
and reads for the buffer pool

but also hits and reads for page
cache for example and maybe

CPU analysis as well for this particular
execution if it is explained

and analyzed.

Michael: Right?

That's a good point.

It's not exactly what you're talking
about but loads of people

don't realize because it's not
an EXPLAIN parameter that you

can, with track_io_timing on, you
can start to get some insights

into the disk level performance
and that's improving in each,

Like there's recent, even in 17
we got some improvements there.

Like buffers are split into shared, local and temp.

IOTiming wasn't initially and is now split into all 3.

Nikolay: Yeah, that's great.

But I'm talking about like basically pg_stat_kcache, but for

EXPLAIN.

Michael: I understand, but the timings are a good start, right?

Nikolay: I agree.

Yeah.

And actually, I think it's possible to have all of this right

now already, but you need to create, like, connect many pieces.

And for example, it could be a tool which runs EXPLAIN, but also

looks at wait event analysis for this pid in pg_stat_activity

and locks as well.

And what else?

It collects everything and big report for a particular query.

And also physical.

Physical, we can, if it's, we have physical access, we can do

it from proc.

Knowing PID, this I think we had prototype for our old bot.

So it's like you just, counters, counters from proc.

Michael: You can get, because if you're the query identifier,

which is the normalized 1 that you can match to like pg_stat_statements,

I'm imagining.

This also, yeah.

Yeah, you could start to like tie these things, all of these

things together.

Nikolay: Yeah, a lot of stuff is possible.

Who's doing this, by the way?

Maybe someone is doing this.

Michael: I don't know the monitoring tools well enough to know

who's doing some of this micro level, but if they are, let us

know.

Like if you're using 1 or you if you're building 1 that does

that would be cool to hear.

To be honest though, it goes back to some of the conversations

we've had about they're like macro to micro zoom, like once you've

done your macro analysis and you've identified a few problematic

query, like the main consumer, if you've got a few bottlenecks,

EXPLAIN is a really good tool for doing that micro, like once

you've got some sensible set of parameters that you know will

be slow, it's not that hard to reproduce normally.

The difficult thing is then going through a 200 line explain

plan.

You asked what can help.

I think we've put a lot of effort into helping with that stage

and a lot of the other tool providers have as well.

1 thing we do that I think is really helpful is if a sub plan

is fast like as a proportion of your query, we collapse it by

default.

We hide like we don't hide it.

We just say there's like there's
20 nodes here.

Probably you don't want to look
here, which means that like a

300 line plan can go down to like
20.

Nikolay: That's cool.

Michael: But that like those kinds
of things I think can really

help initially.

Like not it's not good for a hacker.

It's not good for a Postgres hacker.

It's like trying to work out the
cost model for a new feature.

But for somebody just trying to
speed up a single query, you

don't need to look at the 200 partitions
that all took 0, or

that were never executed, or that
took 0 milliseconds.

Nikolay: Yeah.

And you mentioned query ID and
for example, like I think it was

last week we were working with
1 of the customers, we saw that

auto_explain was used and it's great.

And There was a task to analyze
a lot, and we still plan it.

I'm very thankful to you that we
have integration in our AI right

now, and your system helps our
system.

I plan to perform some massive
analysis at larger scale to be

able to do it.

But the interesting thing is that
if you want to analyze a lot

of queries from auto_explain, you
do need some...

You basically need to compute...

It's called compute_query_id parameter,
right?

You need to turn it on because
by default it's auto and for auto_explain

it will not work.

Michael: It does if you have
the boson.

Nikolay: Okay, in auto_explain.

Okay, in our case we didn't have
the boson, I think.

Michael: But only as of 16, I think.

Nikolay: It's better to turn it
on, actually.

Michael: Maybe, yeah.

Nikolay: The question of overhead,
and let's, by the way, talk

about overhead in a second.

But Imagine you have a lot of query
cases, various queries from

the log.

Presence of query ID first gives
you understanding how many different

normalized or aggregated queries
you are dealing with.

Second, you can then join this
information with pg_stat_statements

and go to macro level back and
understand, okay, this query is

not very influential because it's
a tiny part of or maybe vice

versa, it's influential although
we see only a few occurrences

in our slow log because we have
threshold by time.

auto_explain mean duration, query
mean duration.

We usually set it to 1 second,
for example.

If we have only a few cases, but
we see in the workload, oh,

this is actually 50% of all total
exact time, oh, we do need

to think about it properly, and
influence of this optimization

will be high.

Let's do it.

This helps understand when you
have a lot of queries to deal

with, where to start.

You want to start with most influential
steps.

So query ID is a good thing.

I'm glad we have it everywhere
in EXPLAIN, in the logs, in

pg_stat_statements, you can connect all
the dots.

And pg_stat_activity as well, right?

So wait event analysis is possible
just joining by query ID as

well, if you run many, many times,
right?

So, yeah, great.

What about overhead?

I think it's also an advanced topic.

I know you blog posted about it
and OnGres did.

You talked about auto_explain,
right?

But Explain itself also has overhead.

Michael: Well, Explain can have
both, can under-report and can

over-report the timing.

Nikolay: Django, I'm not a fan
of Django.

Michael: Well, but bear with me.

I think people focus a lot...

There's different things.

The OnGres thing was about the
queries being slower, like you

actually cause some overhead by
observing and that's true when

you're running EXPLAIN ANALYZE
with time because timing is on

with Analyze by default But it's
also true if you're running

auto_explain, like there's also
some overhead to measuring and

to logging, like if you log a lot
of plans that's some overhead

as well.

So there's those overheads but
there's also the possibility that

it under reports which I

Nikolay: think

Michael: throws some.

So this is 1 of the reasons they
added the parameter serialize.

If you are doing a lot of work
after the query is executed but

in order to send the data to the
client.

Nobody really cares about the server-side
time.

The server-side time is important,
that's what we get from EXPLAIN

ANALYZE, but what they normally
care about is the user experience,

like the full client to server
to client time.

Nikolay: And- I cannot agree here,
but okay, go on.

Michael: Okay, well, what I mean
is, If you're getting a report

for my dashboard is slow, there's
no point only speeding up the

server side time.

If you're sending huge amounts
of data, and part of that work

is, so serialize is part of it,
It's getting the data ready to

transmit.

And then there's the network overhead
as well, like of actually

sending all that data.

So, yeah, I guess you're right.

Nikolay: It's not network cost.

It's not like transmission cost.

It's a cost of, not cost, cost
may be not a good word, it's overloaded

here, right?

It's time, right?

Or cost, or it's cost, it's cost

Michael: actually, right?

Serialized is reported as time
and amount of data.

Nikolay: What about cost?

Okay.

Michael: I actually don't think
it has cost to relate.

I think it's 1 of those things
that gets reported in the summary

section.

So, just-in-time compilation and
planning.

Nikolay: It's about civilization,
not transmitting.

Michael: Yes, It's a step before
getting the data ready to transmit.

Nikolay: It's still everything
on the server.

Michael: For example, uncompressing
or decompressing TOASTed data.

Nikolay: And it's Postgres 17 feature,
right?

So it just goes deeper to understand
what's happening on server

side, like execution and what part
of it is associated with civilization,

attributed to civilization, right?

Michael: Well, and not just
that.

Before, if you ran EXPLAIN ANALYZE,
SELECT some data that had

been TOASTed from this table, you
would get a really fast execution

plan because it wouldn't do the
serialization.

It would say this executes really
quickly.

So it wasn't just that you couldn't
see the time, it was like

a query that might take 100 milliseconds
when you run it from

psql, if you put EXPLAIN ANALYZE
in front of it, it could run

in 1 millisecond or 10 milliseconds
much much much faster.

So because it didn't have to do
that expensive step.

So this is 1 of those cases that
used to be and not just used

to be like by default this will
still be, this is still the behavior

in version 17.

If you want to explain, analyze
some basics.

Nikolay: SELECT star from blah
blah blah and before 17 we didn't

see this part at all.

Michael: Even in 17 you don't see
it unless you also explicitly

ask EXPLAIN ANALYZE serial.

That's what I've been saying.

Nikolay: And what about verbose?

Verbose includes it, right?

Michael: No it doesn't.

But this is my proposal that verbose
should include all of these

additional parameters.

Nikolay: It should, of course, I agree.

Well, okay.

It is as it is right now.

Michael: And I have to actually say, EXPLAIN is awesome.

EXPLAIN in Postgres in particular is the best I've seen of any

Database that I've worked with.

Other data, like MySQL for example, is catching up.

They got EXPLAIN ANALYZE in version 8, but the amount of detail

we can already get is way, like, it's beyond the other open source

Databases I've seen.

So it is good, it's just always can be better, right?

Nikolay: Yeah, yeah, yeah.

If somebody is watching us thinking, oh, a couple of haters,

right?

Haters of EXPLAIN.

It's just a lot of stuff, yeah, yeah.

But a lot of stuff is helping all the time, it's true.

Do you want to discuss overhead from buffers option?

Michael: I couldn't.

When I did some really basic, really basic experiment where

I tried to, where I did very, very small queries, but I couldn't

get any additional overhead.

I couldn't, there wasn't like a measurable extra overhead of

requesting buffers.

I think you said you did an experiment once that did show it,

but I don't remember.

Nikolay: Yeah.

On timing.

It cannot, like buffers are like, they are basically invariant.

Of course, there can be hits or reads or they're like moving

back and forth.

But if everything is cached, it's very stable.

Like you're reading the same data, like same query, just SELECT

multiple times, you will see the same number of buffers all the

time.

That's one of the reasons I like them.

But if you include buffers, I did see in the past, it was some

time ago, that timing is increasing.

Because a lot of buffers need to be reported and counters inside

need to be presented.

If it's a complex query, in many places this overhead adds up

and the timing without buffers option is very different compared

to with buffers.

Michael: I've seen single queries where that's true and I've

seen I think Lukas Fittl from pganalyze has done some examples

of this.

I think I've only seen it in queries that have nested loops with

lots of loops.

That's the example everybody turns to.

For example,

Nikolay: join with nested loop.

Michael: But the reason I think
it's slightly overhyped is when

you're optimising, you don't always
care.

If the EXPLAIN analyzes 4 seconds,
or maybe let's pick 400

milliseconds, and your real execution
time is 300 milliseconds,

if you can speed the 400 milliseconds
up to 50 milliseconds,

the 300 might go down to 40, and
directionally you're still looking

for the same things.

And by asking for buffers information,
you're getting more information,

you're more likely to be able to
optimize your query.

So I don't care as much as most
that it's slightly different.

Nikolay: Maybe I only care that including
buffers don't change

buffers.

That's it.

That's it.

So For me, when I start working
with buffers, I'm less concerned

about timing.

Focus on buffers, optimize, optimize,
optimize, and then go back

to result and see, oh, timing has
improved drastically because

we reduced number of buffer hits
and reads.

That's it.

This is a very, very common path
for me.

Just reduce because you like sequential
scan, lots of buffer

hits and reads.

We are at an index, okay,

Michael: couple of, 3, 4, 5 buffer
hits, that's it.

Time to look at timing.

Oh, of course, expect it a thousand
times faster, right?

I like, I like, you know, I like
buffers a lot, but for me they'll

always be secondary because just
Because if you look at timings

first and last, that's the main
aim.

The main aim is to get things faster.

And yes, in maybe 99% of cases,
it's going to be I/O.

That's the limiting factor.

But in some queries, there'll be
a sort in memory.

There'll be something that doesn't
report extra buffers in EXPLAIN.

There'll be just-in-time compilation,
or there'll be some limit,

like maybe trigger timing.

Nikolay: Just-in-time compilation
is off.

Michael: But all I'm saying or
planning time, like actually planning

time, you now do get buffers, but
trigger times, just in time

compilation, there's like a few
other examples, sorts in memory,

hashes in memory, like if some
of these other bottlenecks don't

report buffers, I would just, why
not look at timings and then

buffers?

Also, I almost always see query
plans with both these days.

So it's kind of a

Nikolay: care.

That's great that buffers are present.

At some point I started saying
if there are no buffers I'm not

dealing with this query.

This is what I say to customers,
go and give me with buffers

because we are missing super important.

Okay, if you put it in second place,
this place should be always

present.

Yeah.

I'm sure

Michael: that's what it's like
before.

I think that's probably enough.

I've got a talk that I gave a few
years ago called explain beyond

the basics which I would have thought
we had covered some of

the bits of but actually I took
a completely different path that

I went down some of the less common
types of issue that you can

spot with explain which we haven't
really talked about so I can

include that as like a if anybody
wants to know a bit more of

the specifics, it's a bit nicer,
like that format is good because

you can actually see examples of
EXPLAIN plans, so it's probably

a bit better video wise.

And yet the reason I see buffers
more often now is everybody

suggests adding it, You, but also
some of the other tools as

well, like depesz added it to the
default thing that he suggests

people get.

So there's loads of people.

Nikolay: That's news to me.

I remember it was missing and it
was a big disadvantage.

Michael: He added it.

It's great.

Nikolay: That's good.

Yeah.

Michael: Anything else you wanted to add?

Nikolay: I think it's enough.

Michael: Me too.

Nikolay: Thank you.

See you later.

Bye.