A weekly podcast about all things PostgreSQL
Hello, hello, this is
PostgresFM episode number 78,
live, because I'm alone today again.
Michael is on vacations, holidays,
but I cannot allow us to miss
any weeks, because we do it already
1 year and a half.
Episode 78, so no weeks are missed.
So this week we'll do it as well.
But it will be a small episode.
First of all, happy holidays everyone!
It's December 29th, so it's holiday
season.
And let's have some small episode
about work_mem.
Somebody asked me to cover this
topic.
And I actually wrote a how-to,
I just haven't published it yet.
I'm in my Postgres Marathon series,
where I publish how-tos every
day.
Almost, I'm lagging as well a little
bit, because of holiday
season, but I'm going to catch
up.
So work_mem.
work_mem is that everyone uses,
right?
We all run queries.
We need to use work_mem.
But this is super tricky, super,
how to say, basic setting, because
everyone needs it.
And it's also super tricky because
every statement can utilize
less than work_mem.
This is limit, but it defines an upper
limit, right?
So if we need less, we use less.
It's not like an allocated amount
of memory, like shared buffers,
the size of the buffer pool.
But, so we can use less, any query
can use less.
But the trickiest part is that
we don't know, any statement can
be used multiple times.
Less may be up to work memory,
but multiple times, because this
is the amount of memory needed
for hashing, sorting operations.
So if we have, for example, multiple
hashing operations inside
one query, for example, multiple
hash joins, we can use it multiple
times.
And this adds the kind of, this
like unpredictability, it's hard
to define a good algorithm to tune
it clearly, because we don't
know, right?
For example, we have some amount
of memory and we know our buffer
pool size, it's simple and it's
another topic, but you define
it once and you cannot change it
without restart.
So we say, like some rule, most
people use it 25%.
Okay, we allocate 25% for the buffer
pool.
What's left we can use, and also
the operating system can use for
its own page cache.
We should not forget also that
there is also maintenance work_mem
which is more predictable.
It has some trickiness as well
because there is an autovacuum work_mem
which is by default minus 1.
It means maintenance work_mem will
be used and we have multiple
workers for autovacuum.
So if we set it, for example, I
see people set to 2 gigabytes,
it's maybe not really, it's quite
a lot.
You need to ensure that, for example,
when you create an index,
2 gigabytes is indeed helpful.
But we can have multiple autovacuum
workers, and I usually advocate
to raise the autovacuum workers a lot
so we can have many of them
say 10 if you have a lot of CPUs
or maybe 20 even.
It means if you set maintenance
work_mem to 2 gigabytes, autovacuum
work_mem is minus 1, means
it inherits from maintenance
work_mem.
Autovacuum alone can use up to
20 gigabytes.
It also depends because not for
everything it will use it, but
anyway, we need to subtract these
20GB from the remaining memory.
We also have some overhead for
additional processes.
And then what's left, we can just
say, okay, we have max connections,
say, 200.
So we just divide the remaining
memory by 200 and this is roughly
per each backend what we can use.
But we don't know how many times
backends will use work_mem, right?
So let's discuss the approach I
kind of developed just
writing this how-to.
First, my favorite rule is this
Pareto principle rule 80/20.
So we take for example PGTune Leopard,
I forgot the name, but
this is a very simple tuning heuristic
based tool, which is quite
good.
I mean, it's good.
Good enough in many cases.
You just use it, and it will give
you some value for work_mem.
Quite a safe value.
Again, the main thing is with work_mem
to see how your max connections,
because if you set, if you increase
max connections you need
to understand that, like, we don't
want to have out of memory.
This is the main thing to be careful
with.
So, okay, it will give you some
rough value, and I think let's
go with this value, that's it.
Then we run it for some time in
production, and the second important
step is to have very good monitoring.
Monitoring can provide you with some
very useful insights about temporary
file creation.
When work_mem is not enough,
it doesn't mean Postgres cannot
execute the query.
Postgres will execute your query,
but it will involve temporary
file creation, meaning that it
will use disk to have more memory.
And this is of course very slow,
it will slow down query execution
a lot, but it will eventually finish
unless statement_timeout
is reached, right?
So, okay, we applied this rough
tuning.
We started monitoring work_mem,
oh by the way, how to monitor
these temporary files.
I see two sources of temporary file
creation.
First is, like, very high level is
pg_stat_database.
For each database, you can see the
number of temporary files already
created and the size of them, total
size of them, columns, temp
files, and temp bytes.
So if your monitoring is good,
or if you can extend it to have
this, you will see the rates of
temp file creation and also size,
size also interesting.
We can talk about average size
or maybe maximum size for each
file.
Well, we can probably play with
this data more, but it's only
two numbers, right?
So number of files and number of
bytes.
It's not a lot, we cannot have
p95, for example, here, right?
So next, more detailed information
is from Postgres logs.
If we adjust log_temp_files setting,
we can have details about
every occurrence of temporary file
creation in the Postgres logs.
Of course, we need to be careful
with observer effect because
if we set it, for example, to 0 and
for example our work memory
is very small and a lot of queries
need to create temporary files.
Not only temporary files will slow
us down, but also we will
produce a lot of logging.
Observer effect can be bad here.
So probably we should be careful
and not set it immediately to
0, but to some sane value first,
and go down a little bit and
see.
But eventually, if we know that
temporary files are not created
often, we can go even to 0.
Again, we should be careful.
And finally, the third source of
important monitoring data here
is pg_stat_statements.
It has a couple of columns, temp
blocks, BLKS read and temp blocks
written.
So we can understand for each normalized
query, I call it query
group.
For each query group, we can see
again, like same as for database
level, we can see a number of,
oh no, not the same.
We don't have a number of files
here.
Instead, we have block read, read
and written.
So written blocks are interesting
here.
But the good idea here is that
we can identify the parts of our
whole workload and understand which
queries are most active in
terms of temporary file creation.
That means they need more work
memory, right?
They lack work memory.
So once we build our monitoring,
or we already have it, maybe.
I'm not sure everyone has very
good.
As usual, I'm very skeptical in
terms of the current state of
Postgres monitoring in general.
But assuming we have this covered
in our monitoring tools, and
we have some details probably in
logs, the next thing, of course,
we can identify parts of our code
and we can think about optimization
first.
Instead of raising our work_mem,
we can have an idea, let's try
to reduce, let's be less hungry
for work_mem, right?
Let's reduce the memory usage.
Sometimes it's quite straightforward,
sometimes it's tricky.
Again, here I recommend using the
Pareto principle and not to
spend too much effort on this optimization.
We just try, if it takes too much
time, too much effort, we just
proceed to the next step.
Next step is raising work_mem.
From these, like monitoring already
can suggest us what is average
temporary file size and what is
maximum temporary file size.
And from that information we can
understand how much we need
to raise.
Of course, instead of jumping straight
to this new value, it
may be risky.
Sometimes I see people do it.
I mean, we know our max_connections
value.
We know that each statement can
consume multiple times up to
work_mem size because of operations,
this approach.
Also, since Postgres 13, there
is a new setting, which is...
I always forget this name, but
there is a setting that tells
you multiplier for hash operations.
And as I remember, by default it's
2, meaning that you have work_mem,
but hash operations can use up
to 2 work_mem, which adds complexity
in the logic and tuning.
And again, it makes it even trickier
to tune.
So, like on the safe side, if you
want to be on the safe side,
you understand the available memory,
you understand your max
connections, and you add some multiplier,
like 2, 3, maybe 4,
but usually this will lead us to
very low work_mem.
So this is why this iterative approach
and maybe raising understanding
that, like our workload won't change
tomorrow suddenly, like
a whole, usually.
In our existing project, usually
we understand, okay, realistic
consumption of memory is this.
So we are fine.
We can start raising this work_mem.
But like, and If you apply the
formula, you will see, oh, we
have risks of out of memory.
But no, no, we, our workload is,
we know our workload, right?
Of course, if we release it, release
changes in applications,
often workloads can change as well,
right?
So we should be careful with it.
Especially we should be careful
raising max_connections after
this tuning of work_mem because
this can lead us to higher out
of memory risks.
So instead of raising globally,
I recommend trying to think about
raising locally.
For example, you can say, I want
to raise for a specific session
because I know this is a heavy
report.
It needs more memory.
I want to avoid temporary files.
I just set work_mem to a higher value
in this session and that's
it.
Other sessions still use the global
setting of work_mem.
We can set even, say, even set
local work_mem in a transaction,
so when the transaction finishes, work_mem
kind of resets in the same session.
Or we can identify some parts of
the workload and this is good practice
to split the workload by users and,
for example, we have a special
user that runs heavier queries
like analytical queries, maybe
and we know this user needs a higher
work_mem, so we can alter the user's
work_mem.
And this is also a good practice
to avoid global raises.
But of course, this will make the
logic complex.
We need to document it properly.
So, if we have a bigger team and
we need to think, other people
will deal with it.
Of course, this needs proper documentation.
SET doesn't have a comment unlike
database objects.
So maybe, by the way, I just realized
maybe it's a good idea to
have some commenting capabilities
in Postgres for configuration
settings, right?
So anyway, as a final step, of
course, we consider raising it
globally.
And we do it all the time.
I mean, we see max_connections
quite high, and we raise work_mem,
so even if you multiply max_connections
by work_mem, you see that
you already exceed the kind of
available memory.
But this is tricky, I mean, this
is risky of course, but if we
observe our workload for a very
long time, and we know we don't
change everything drastically,
but we change only parts of the workload,
sometimes it's okay.
But of course, we understand there
are risks here, right?
So raising work_mem is kind
of risky and should be done with
an understanding of the details I just described.
Okay, I think maybe that's it.
Oh, there is also, since Postgres
14, there is a function pg_get_backend_memory_contexts.
It's very useful.
I mean, I don't use it myself yet,
because it's quite new.
Postgres 14 is only a couple of years old.
But, and there's a drawback to
it.
It can be applied only to the current
session.
So this is only for troubleshooting,
detailed troubleshooting.
If you deal with some queries,
you can see what's happening with
memory for a particular session.
I saw discussions to extend this
function to be able to use it
for other backends, for any session,
and when I was preparing
my how-to, These days I use our
new AI bot and of course it hallucinated
thinking Oh, you can just pass
PID to it.
No, it doesn't have any parameters.
You cannot pass anything to it
I would expect it.
So I will probably hallucinate
as well.
But the reality is it supports
only the current session, that's
it.
Maybe in the future it will be
extended.
So that discussion, as I understand,
didn't lead to patches accepted
yet.
But anyway, This is additional,
like, extra.
I think what I just described is
already quite practical.
Just remember that any session
can use, any query can use multiple
work_mems, but usually it's not
so.
And so the approach based on temporary
files is the way to go
these days.
Just monitor temporary files.
It's not like a big deal if we
have few of them happening sometimes,
especially for queries, analytical
queries.
They anyway are slow probably.
And okay, temporary files, we can
check how much we can win if
we raise work_mem.
But anyway, for WALTP, of course,
you want to avoid temporary
file creation.
And by default, I forgot to mention,
work memory is just 4 megabytes.
It's quite low.
These days, it's quite low.
I see in practice for mobile web
apps on bigger servers with
hundreds of gigabytes, we usually
raise it to 100 megabytes,
having few hundred max connections
and connection poolers, we
usually tend to have like 100 megabytes
work memory.
Maybe even more sometimes, again,
depends.
I think that's it.
So hello chat, I see several people
joined, thank you for joining.
Honestly, I recorded live just
because this is more convenient
for me.
So this is podcast anyway, this
will be distributed as usual.
I want to again say thank you for
being a listener, happy holidays,
and I hope we will have very good
topics in new year and I hope
every Postgres production server
is up and running with very
good uptime and with as few failovers
as possible and with as
low temporary file numbers as possible
as well.
So this is my wish for you in a
new year and thank you for listening
and watching.