A weekly podcast about all things PostgreSQL
Nikolay: Hello, hello, this is
PostgresFM, episode number 96.
My name is Nikolay, founder of
Postgres.AI, and as usual, my
co-host is Michael from pgMustard.
Hi, Michael.
Michael: Hello, Nikolay.
Nikolay: So pgMustard, as I mentioned
last time, is focusing on
explain plans and helping optimize
Postgres queries.
And the topic I chose this time
is very, very close to that topic,
right?
I mean, it's a subtopic.
I know many people discuss a lot
of very in-depth materials around
this topic, but let's just focus
on basics just for awareness.
This topic is how plan caching
works and generic versus custom
plans.
Because at a very high level, we
don't know about them at all.
We know that EXPLAIN shows us a
plan.
We know there is plan caching.
We know pgBouncer since recently
started supporting prepared
statements, so you can avoid planning
time, because sometimes
it's very costly.
For example, if you have a lot
of partitions, a lot of indexes,
log manager contention can happen.
So it's good, especially for queries
which have very high frequency,
for example, a thousand times per
second, it's good to just perform
planning once per session and then
rely on the cached plan.
But it's not really once per session,
right?
There are more nuances in this
area.
So let's uncover them and discuss.
Michael: Yeah, nice.
At a kind of high level, I see
people that are using ORMs and
frameworks using prepared statements,
sometimes without realizing
it, or sometimes they know but
they don't really know what it
means.
Do you see a lot of people, like
is that the same in your experience?
A lot of people using this without
realizing, or are you seeing
quite a lot of conscious choice,
like conscious use of prepared
statements, even for people that
aren't using it via an ORM or
similar?
Nikolay: Well, you know my point
of view on this world.
On one hand, I think people are quite
smart in general,
Michael: but
Nikolay: they are smart only when
they are very motivated.
And we know a lot of engineers
don't distinguish between binary tree
and B-tree at all, because they
are not motivated enough to understand
the differences.
Similarly here, a lot of people,
in my opinion, don't understand
that one normalized query in pg_stat_statements
might have different
plans depending on parameters.
This simple truth is something
that should be understood, but
it can be understood only when
you are motivated, when you have
some incident, for example, and
you start off thinking, oh, okay,
this query can behave differently
depending on parameters.
So this is like basic knowledge.
And honestly, I didn't realize
it for quite long in my career
as well.
I thought, okay, this is so, but
why is it important?
Well, it's very important because
looking at pg_stat_statements
and understanding heavy queries,
it's not easy to jump to optimizing
these queries, especially when
we talk about mixed situation,
when it's also quite high frequency
query and also it has issues
with not perfect indexes chosen,
not perfect plan, like not only
just index only scan, that's it.
So it means that we need to understand,
okay, we depend on parameters,
but which parameters should we
choose for optimization?
Michael: When you say for optimization,
do you mean like we get
the PG stat statements and it has
these parameterized queries,
which ones do we put in in order
to run, explain, analyze buffers?
Nikolay: Right, right, right.
This is unfortunately a big gap
between macro and micro levels
of query analysis, which still
exists.
Of course, with the addition of
query ID to logs and to pg_stat_activity,
we have this bridge between pg_stat_statements and individual
queries, which we can find in logs
or pg_stat_activity.
But still, I remember 1 lesson
I learned in practice when I spent
1 hour or so optimizing some query.
It was some parameter which was
true or false.
And I was trying to optimize for
value which was present in 90%
of rows, majority of rows.
And it was, of course, difficult
because selectivity is low,
cardinality is high, right?
But then I understood all this
hour spent was wasted because
it's never solved.
Never, ever, application runs this
value, uses this value in
query.
So The idea is that we need to
understand real cases and also
not super rare cases or non-existing
at all cases, but we need
to understand the cases which hurt
user experience, for example,
and focus on them.
In this case, generic plan is something
really weird in this
picture I just painted, right?
So what do you think?
Like generic and custom and maybe
let's talk about how planner
uses them and then discuss what
should we do about it at all.
So this 5 times rule, what do you
think?
Michael: Yeah, I think that catches
people out sometimes.
So the way prepared statements
works, at least my understanding,
is that you can prepare a query
in advance.
And that will do, like, the parse
stage.
And it will do, like, several stages
up to but not including
the execution of that query, then
you can ask it to execute with
parameters.
Now, that then, on the subsequent
executions, you don't have
to reparse, like do the set, like
do several of those.
Well, planning at first does have
to be done again, right?
So it's the parsing and the rewriting
stages.
Yes, on a per session basis, we
don't have to do several of those
steps.
But there is also this weird kind
of initial phase.
So if you've done prepare, if you've
prepared the statement and
then you execute it, the first
5 times, Postgres will plan each
of those 5 executions separately
based on the parameters that
are given.
So if those are in the case you
gave if you used true was it
that was 90% of time You've used
true every single time for those
5 times It's gonna plan it the
same way each time and on the
sixth execution it will then look
at the cost of planning.
And if that cost is less than a
generic, or if it's more than
a generic plan that it's calculated,
or I think there's some
subtlety around exactly how precise
that is, then it will choose
its generic plan thereafter.
It will flip to not planning each
new execution.
So sixth time, seventh time, eighth
time, ninth time, forever
onwards, during that session, it
won't replan.
So now you're not only saving on
the parsing, on the rewriting
stages, you're also saving on the
planning stage, which you mentioned
already.
But that's not true if, for example,
you did 3 or 4 executions
with true and then a couple of
executions with false in your
example, then you're going to get
a different kind of average
cost for those 5 executions.
Then that will get compared to
the generic plan and then a different
decision will get made.
So depending on those first 5,
So those first 5 do actually determine
quite a lot what happens thereafter,
which is interesting.
But yeah, the 1 time I've seen
that the 5 become important is...
This can be really confusing for
folks.
If you're trying to debug something
and it's fast for a while,
and then suddenly it goes slow,
This is 1 of those telltale signs
once you've been around a while.
If you notice that it's fast for
5 executions then slow on the
sixth, you should look into this
issue.
Custom plan versus generic plan.
It's kind of a telltale sign, like
a smoking gun, or like a,
you know, it's a telltale sign
that this could be what you're
hitting.
But aside from that, I haven't
really thought too much about
it.
I don't think it's configurable,
that 5 number.
I think it's hard-coded.
Nikolay: Yeah, it's, by the way,
why 5?
Why 5?
Michael: Yeah, good question.
Nikolay: Some decision, right?
That's it.
But like, what is the generic plan?
Like generic plan is just we don't
take into account parameters
at all, right?
It means that we don't take into
account this distribution of
values?
Or we take…
Michael: Yeah, good question.
Did you look into how it's calculated
in the first place?
Nikolay: No, but it's an interesting
question.
Michael: I could be wrong, but
I would have guessed that it would
be based on some statistics at
the time.
If you were implementing this feature,
you would think if you've
got a data distribution there,
maybe you'd factor that in.
But it's difficult.
Because if it's skewed, if it's
like completely uniform, it's
quite easy.
Like if we're talking about primary
key lookups, it's primary
key lookup query and every single
one's unique and every single
one's only going to return 1 row.
It's only scanning 1 table, that
kind of thing.
The generic plan is really easy
because every single parameter
you could provide to that is going
to be planned the same way,
even in the custom plan mode.
So generic plan in that case is
quite easy.
But in the case where it's a skewed
distribution, some values
have more entries than others,
it becomes tricky, I think, to
know how you would plan that in
a generic way.
So I don't actually know.
Nikolay: For me, a generic plan
is a good thing that we don't
have a lot of time to spend finding
particular examples in logs
or in pg_stat_activity.
In pg_stat_activity, we have a
different problem, the track activity
size.
So this parameter, which by default
is 1024, means that really
long queries are truncated and
we don't see the full query.
Parameters usually are in the end,
right?
Select all columns, our favorite,
from table and so many column
names, so I don't see parameter.
Something equals truncated.
I saw it not once.
So when we don't want to spend
time, we want something really
fast, this trick prepare statement,
we prepare statement and
then we use just nulls as all values.
This gives us a generic plan, but
also we can say set plan cache
mode to force generic plan, which
exists as we just checked before
we started recording this episode.
Surprisingly, the plan cache mode
configuration setting, it exists
since many, many years ago.
So in all current versions, it's
supported, right?
Michael: Yes, since version 12.
Nikolay: Exactly.
So 12 is the oldest currently supported
version.
A few months left until the end
of it.
Michael: Yeah.
Nikolay: Yeah, yeah.
So this trick allows us to, looking
at the pg_stat_statements,
just take some normalized query,
which has this $1, $2, doesn't
have parameters at all, they are
removed.
And just to use this prepared statements
with null streak and
get some generic plan and think
already how bad the situation
in general, not in particular cases,
but overall.
But this trick, as you absolutely
correctly mentioned before
this episode recording, doesn't
work if we have, for example,
a primary key there.
Because primary key equals null
or is null won't work properly
because there is no null constraint,
so the planner knows that
nothing will be returned.
Doesn't show us the proper plan
at all.
Michael: Yeah, if you don't specify,
like, if you don't set plan
cache mode to force generic plan,
then you'll get a custom plan
and the planner will shortcut it
because there's a really, yeah.
So, but if you do force generic
plan on version 12 onwards, then
you will get the generic plan you're
hoping for.
Nikolay: Well, that's great.
I didn't realize that.
So yeah, good, good.
But if we are lucky enough to have
already Postgres 16, all these
tricks are not needed because we
can just say, explain generic
plan.
There is a new option in explain
command, and this is very handy.
We just ask a generic plan and
explain.
Don't care.
And interesting that we can even
use $1, $2, and this query text
right from pg_stat_statements.
This is super convenient.
We can say, explain generic plan
and that text.
We can even probably automate this,
right?
And see, for example, if we have
sequential scans in generic
plans and put warnings somewhere
and so on.
But of course, this won't work
if we want to execute and we want
to explain analyze buffers.
This won't work.
I mean, in this case, we need to
replace $1, $2 with some values.
Michael: Yeah, and it makes sense,
right?
We can't do the execution unless
we specify something, otherwise
what are you executing?
So it makes sense.
And kudos to Lawrence from CyberTech
who added this and has blogged
about it.
So I'll share the blog post about
that.
Nikolay: I remember actually this
wave from Lawrence and also
Lucas Fittal, who generalized blogging
about generic plans and
also getting excited.
Also, Frank Pascho, I think, joined
at some point.
I actually don't know who started
this, but I remember waves
of, wave of thinking, multiple
blog authors, blog posts about
this, generic plans a few years
ago and it was great.
Michael: I don't think it's super
surprising that this has come
out of a consultancy, like a Postgres
consultancy, and that Lucas
is excited about it.
So I think it's 1 of those things
that really helps people help
others.
Because if you really know what
you're doing, like these people
all do, you can now set up Postgres
in a way where you can find
parameter values in the vast majority
of cases, or you can reconfigure
things so that it does log the
query ID, or you can get your
slowest queries logged with parameters.
So there are these ways of doing
it now, but if people haven't
set that up in advance and you're
helping them with a production
issue, this is a really helpful
tool to give you a clue as to
what's going on.
Now explain will always, there
will only ever be a clue, not
because you don't have the execution
system, you don't know for
sure why it's slow.
But those clues are often enough
for experienced people to get
a theory, get a hypothesis as to
what's going wrong, and then
give them something to be getting
on with instead of being able
to say, I have no idea why this
is like, instead of, it's kind
of like extra information to make
the guess a little bit more
educated rather than an actual
answer.
Nikolay: Yeah, and we can have
a holistic approach now, like checking
everything in an automated fashion,
not guessing and so on.
It's great.
So, yeah, what else worth mentioning
in this area?
Maybe partitioning, which we like,
it's funny that you found
this commit in Postgres 15, which
improved the situation.
My general impression, if you use
a lot of partitioning, you
should perform major upgrades promptly,
not lagging at all, because
every year a lot of things are
improved.
So this is a quite in-depth thing
that was improved in Postgres
15, saying that now, like partition
pruning, so exclusion for
irrelevant partitions based on,
like if user asks to return some
data in a query, Postgres can remove
partitions which are for
sure irrelevant from consideration.
For example, we have partitioning
by time, range by time and
by date, and then the user for
sure wants fresh data, definitely
didn't need to look at very old
partitions, partitions with very
old data.
But this might happen both at planning
time and execution time,
which can be a big surprise.
And you want this to happen at
planning time, basically, better.
Because this will mean that we
can rely later on cached plan,
or if we use prepared statements,
we can rely more on cached plans.
And then we have a very good situation
when partition pruning
was already done at planning time,
we just execute that set.
But it's tricky when we talk about
generic plan, because generic
plan doesn't take into account
the values.
So something was improved in Postgres
15 in this area, right?
Michael: Yeah, We tried to look
into it before the call, didn't
we?
And couldn't...
Nikolay: Yeah, I couldn't reproduce
the problem.
Michael: Yeah, so we tried on an
old version before these.
So it would be interesting to hear
from people that have reproduced
this.
I didn't see a blog post about
it, I saw it referenced a couple
of times.
Nikolay: Yeah, Lukas Fittel in
this blog post, the generalized
blog post, mentioned that before
we had a problem with generic
plans, that partition pruning didn't
work in planning time.
It worked only at execution time.
I couldn't reproduce it, but if
it's so it means that, for example,
if you have a lot of partitions
with a lot of indexes and at
execution time you involve all
of them.
In general, if you have a lot of
partitions, just checking them,
not returning everything, it's
a big overhead, huge overhead.
But at least at execution time,
Postgres doesn't put access share
locks on all indexes, because at
planning time it does.
And if at planning time we don't
have partition pruning working,
it's a nightmare.
This case
Michael: of manager spikes.
Well, I think that would happen
at X.
Yeah.
Nikolay: So,
Michael: a question for you.
Because we do now have this plan
cache mode, so plan cache mode
got added in 12, that's 1 of the
things we can use to force generic
plan, force custom plan while we're
debugging, while we're trying
to, pre-version 16 releases, before
we had this explained generic
plan, we could use that for debugging.
But I don't think I've ever seen
anybody change.
So, Plan cache mode got added.
You have 3 options, Auto, Force
Custom Plan and Force Generic
Plan.
Auto is the default, so Auto acts
as it did as...
Nikolay: As you described.
Michael: Yes, so the first 5 executions...
Nikolay: times custom, then decision
is made.
Yeah.
What choice?
Well, I
Michael: was going to ask because
I think I got the impression
that some of these ORMs, like ActiveRecord,
for example, uses
prepared statements.
I got the impression the main reason
they do so is to protect
against SQL injection attacks,
not for performance reasons.
And that might be not quite accurate.
But there is this other benefit
that seems to be mentioned quite
a lot around prepared statements.
Now, I wondered if you'd ever seen
anybody change, like if you
only were, if you only were using
prepared statements for.
SQL injection protection, and you
didn't, you weren't as bothered
about the performance benefits
in some cases, or pitfalls in
other cases, you might want to
allow Postgres to replan every
single time.
I can imagine some people considering
turning force custom plan
on.
I can imagine like the upside of
doing so.
I can't see quite the same upside
for doing forced generic plan
because all you're doing is saving
those 5.
Well, I think you'd only save those
5 executions and then you
flip to the generic plan anyway.
Maybe there's another case for
that.
But yeah, have you seen anybody
changing that setting in any
cases?
Nikolay: I see everyone doesn't
know about it, almost everyone.
It's very deep.
It's not super, super...
It feels like some kind of fine
tuning.
I see people have fears, for example,
to rely on prepared statements
globally, because this idea, what
if global plan will work not
well?
Michael: Right?
And...
So generic plan is bad for a specific
parameter...
Nikolay: If generic plan is chosen,
it's cached, and then it's
bad for some cases and we have
performance degradation for particular
parameters.
But also I think this idea to general,
to use prepared statements,
I don't think these days it already
matters as protection from...
I think the main reason is planning
time is usually fast, but
not always fast, as we know.
And the idea that during planning
time Postgres locks all indexes,
it's a super big surprise.
Many people, including myself,
had not long ago.
And we had benchmarks showing this
small overhead adding with
each index for select, not for
update, for select.
And at some point when total number
of relations, both tables
and indexes, reaches 16, which
is also a hard-coded constant.
If a query is very frequent, 1000
or 2000 per second, primary kilo
caps are usually so.
We have a log manager spike, and
partitioning increases the probability
of it.
And that's why getting rid of planning
is good.
And when you get rid of planning,
it's worth understanding this
behavior of how caching works and
generic versus custom.
But just changing this globally,
I'm not sure this is a popular
decision.
But of course, my experience is
very limited.
Michael: Let us know if you have
changed it.
It'd be interesting to hear from
you.
Nikolay: Yeah, it's an interesting
topic anyway.
But in general, I think it's worth
spending efforts to get rid
of planning for high-frequency queries,
especially for partitioning
tables.
Michael: I have one more related
question.
So now pgBouncer supports prepared
statements.
Nikolay: I
Michael: saw they have a max prepared
state.
Well, so it's off by default, which
I didn't realize.
They have a max prepared statements
parameter.
Do you have any opinions on tuning
that?
Or like I saw that they said a
default of 100 might make sense.
Like, might be a sensible start.
Nikolay: Yeah, some high value
is good.
I think it's like, depends on the
situation, of course, right?
I mean, but in general, it's not
an issue to just increase it.
Yeah.
Cool.
But again, I see the tendency to
be afraid of turning prepare
statements globally in already
existing heavily loaded projects.
If you start from scratch, it's
a very good decision to make
to start using prepared statements
from the very beginning.
And then you just grow and see
the problems, fix them as usual,
like you would do anyway.
And then there's confidence level
of using already there, but
switching for a big project from
0 to 100% coverage, it's a very
hard decision to make.
Michael: That's hard as in scary
or?
Nikolay: Just people are not sure
and that's the problem.
And there is no good path.
Maybe there is a good path to it.
But in a couple of big companies
I observed recently, people
decided not to proceed with this
after a lot of consideration.
And it's kind of an interesting
topic.
Instead, It feels safer to start
using prepare statements only
for specific queries.
Michael: That's a slow way in.
Maybe for new features.
Nikolay: Not new features.
Not at all.
Again, the problem is usually this
log manager overhead.
When we have 1,000, 2,000 QPS,
we see very simple queries, select
star from or select something from
table When we have 1000-2000
QPS, we see a very simple query,
select something from the table
where ID equals something, primary
queue lookup.
And if we have planning time every
time, select itself is fast,
planning itself is also fast.
But if you have a lot of backends
competing, access share lock
is not a problem at all.
But lock manager has overhead because
of fast path is false.
And this is what multiple companies
experienced recently.
And Jeremy Schneider said, local
managers strikes back or bites
back or something like that.
Last year was a local manager year
in terms of performance issues
people experienced.
Michael: So you're saying, yeah,
so I'll share the episode where
we discussed that in more detail.
And so you're saying they choose
to turn prepared statements
on only for those queries, because
it reduces the issue there,
but not globally.
Okay, makes sense.
Nikolay: Just to remove these hotspots.
Yeah.
Only for these primary lookups.
Michael: I mean, we're talking,
So if you're talking about these
super fast queries, even though
both execution and planning time
are fast, it could be that planning
time is more than execution
time.
Like it's quite common.
Nikolay: It happens.
Michael: So even if it wasn't for
the issues you're describing,
you could maybe increase throughput.
Or like there's a lot of overhead
to be gained by not having
to plan those queries.
Nikolay: Yeah, shave off planning
time is a good idea in many
cases in general.
It's like, for example, partitioning,
right?
If partitioning is there, planning
time can be high if you have
a lot.
It can depend on the number of
partitions.
This is the problem as well.
In this case, of course, getting
rid of it is good.
Nice.
Cool.
Anything else you want
Michael: to add to this 1?
Nikolay: No.
It's an interesting thing to keep
in mind when dealing with plan
optimization.
So use explain-generic-plan in
Postgres 16 and in future Postgres
versions.
Michael: Yeah, if you can't, as
a starting point, or if you can't
get
Nikolay: parameters.
Michael: Lazy mode.
Lazy mode.
Nikolay: If you don't want to go
to logs to find examples, or
maybe a log min duration statement
was not configured somehow.
Michael: Exactly.
If it wasn't configured for the
time where you're having issues,
or if you can't turn it on easily,
like if you've got a lot of
process at your company before
changing some of these parameters,
you know, so there might be reasons
why you can't get real values.
But of course, I think The reason
I don't know as much as maybe
I should on this is, if you can
get the real values, performance
optimization work becomes so much
easier.
Nikolay: Of course.
Michael: There's less guesswork.
Nikolay: Yeah.
This generic plan should be something
like a partial solution
in terms of analysis.
It's a partial analysis.
It always should be with remark,
you know, this is not, maybe
not it.
Maybe but maybe not.
Michael: Yeah.
Like a stepping stone on the way
to the solution, but like it's
not a solution in itself.
Nikolay: Right.
On the other hand, it might give
very quick insights what to
do and how to speed up.
So yeah,
Michael: good point.
Yeah, good point.
Yeah.
Nice one, Nikolay.
Thank you so much.
Nikolay: Thank you.
Have a good day.
Bye.
Michael: You too.