Postgres FM | work

Happy holidays! Today's topic is work_mem — how it's used and how to tune it.

Related post: https://x.com/samokhvalov/status/1740813478150189172?s=46&t=HSxrW86NZqzPQP9vXVJEWg.

Creators & Guests

Host

Nikolay Samokhvalov

Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Hello, hello, this is
PostgresFM episode number 78,

live, because I'm alone today again.

Michael is on vacations, holidays,
but I cannot allow us to miss

any weeks, because we do it already
1 year and a half.

Episode 78, so no weeks are missed.

So this week we'll do it as well.

But it will be a small episode.

First of all, happy holidays everyone!

It's December 29th, so it's holiday
season.

And let's have some small episode
about work_mem.

Somebody asked me to cover this
topic.

And I actually wrote a how-to,
I just haven't published it yet.

I'm in my Postgres Marathon series,
where I publish how-tos every

day.

Almost, I'm lagging as well a little
bit, because of holiday

season, but I'm going to catch
up.

So work_mem.

work_mem is that everyone uses,
right?

We all run queries.

We need to use work_mem.

But this is super tricky, super,
how to say, basic setting, because

everyone needs it.

And it's also super tricky because
every statement can utilize

less than work_mem.

This is limit, but it defines an upper
limit, right?

So if we need less, we use less.

It's not like an allocated amount
of memory, like shared buffers,

the size of the buffer pool.

But, so we can use less, any query
can use less.

But the trickiest part is that
we don't know, any statement can

be used multiple times.

Less may be up to work memory,
but multiple times, because this

is the amount of memory needed
for hashing, sorting operations.

So if we have, for example, multiple
hashing operations inside

one query, for example, multiple
hash joins, we can use it multiple

times.

And this adds the kind of, this
like unpredictability, it's hard

to define a good algorithm to tune
it clearly, because we don't

know, right?

For example, we have some amount
of memory and we know our buffer

pool size, it's simple and it's
another topic, but you define

it once and you cannot change it
without restart.

So we say, like some rule, most
people use it 25%.

Okay, we allocate 25% for the buffer
pool.

What's left we can use, and also
the operating system can use for

its own page cache.

We should not forget also that
there is also maintenance work_mem

which is more predictable.

It has some trickiness as well
because there is an autovacuum work_mem

which is by default minus 1.

It means maintenance work_mem will
be used and we have multiple

workers for autovacuum.

So if we set it, for example, I
see people set to 2 gigabytes,

it's maybe not really, it's quite
a lot.

You need to ensure that, for example,
when you create an index,

2 gigabytes is indeed helpful.

But we can have multiple autovacuum
workers, and I usually advocate

to raise the autovacuum workers a lot
so we can have many of them

say 10 if you have a lot of CPUs
or maybe 20 even.

It means if you set maintenance
work_mem to 2 gigabytes, autovacuum

work_mem is minus 1, means
it inherits from maintenance

work_mem.

Autovacuum alone can use up to
20 gigabytes.

It also depends because not for
everything it will use it, but

anyway, we need to subtract these
20GB from the remaining memory.

We also have some overhead for
additional processes.

And then what's left, we can just
say, okay, we have max connections,

say, 200.

So we just divide the remaining
memory by 200 and this is roughly

per each backend what we can use.

But we don't know how many times
backends will use work_mem, right?

So let's discuss the approach I
kind of developed just

writing this how-to.

First, my favorite rule is this
Pareto principle rule 80/20.

So we take for example PGTune Leopard,
I forgot the name, but

this is a very simple tuning heuristic
based tool, which is quite

good.

I mean, it's good.

Good enough in many cases.

You just use it, and it will give
you some value for work_mem.

Quite a safe value.

Again, the main thing is with work_mem
to see how your max connections,

because if you set, if you increase
max connections you need

to understand that, like, we don't
want to have out of memory.

This is the main thing to be careful
with.

So, okay, it will give you some
rough value, and I think let's

go with this value, that's it.

Then we run it for some time in
production, and the second important

step is to have very good monitoring.

Monitoring can provide you with some
very useful insights about temporary

file creation.

When work_mem is not enough,
it doesn't mean Postgres cannot

execute the query.

Postgres will execute your query,
but it will involve temporary

file creation, meaning that it
will use disk to have more memory.

And this is of course very slow,
it will slow down query execution

a lot, but it will eventually finish
unless statement_timeout

is reached, right?

So, okay, we applied this rough
tuning.

We started monitoring work_mem,
oh by the way, how to monitor

these temporary files.

I see two sources of temporary file
creation.

First is, like, very high level is
pg_stat_database.

For each database, you can see the
number of temporary files already

created and the size of them, total
size of them, columns, temp

files, and temp bytes.

So if your monitoring is good,
or if you can extend it to have

this, you will see the rates of
temp file creation and also size,

size also interesting.

We can talk about average size
or maybe maximum size for each

file.

Well, we can probably play with
this data more, but it's only

two numbers, right?

So number of files and number of
bytes.

It's not a lot, we cannot have
p95, for example, here, right?

So next, more detailed information
is from Postgres logs.

If we adjust log_temp_files setting,
we can have details about

every occurrence of temporary file
creation in the Postgres logs.

Of course, we need to be careful
with observer effect because

if we set it, for example, to 0 and
for example our work memory

is very small and a lot of queries
need to create temporary files.

Not only temporary files will slow
us down, but also we will

produce a lot of logging.

Observer effect can be bad here.

So probably we should be careful
and not set it immediately to

0, but to some sane value first,
and go down a little bit and

see.

But eventually, if we know that
temporary files are not created

often, we can go even to 0.

Again, we should be careful.

And finally, the third source of
important monitoring data here

is pg_stat_statements.

It has a couple of columns, temp
blocks, BLKS read and temp blocks

written.

So we can understand for each normalized
query, I call it query

group.

For each query group, we can see
again, like same as for database

level, we can see a number of,
oh no, not the same.

We don't have a number of files
here.

Instead, we have block read, read
and written.

So written blocks are interesting
here.

But the good idea here is that
we can identify the parts of our

whole workload and understand which
queries are most active in

terms of temporary file creation.

That means they need more work
memory, right?

They lack work memory.

So once we build our monitoring,
or we already have it, maybe.

I'm not sure everyone has very
good.

As usual, I'm very skeptical in
terms of the current state of

Postgres monitoring in general.

But assuming we have this covered
in our monitoring tools, and

we have some details probably in
logs, the next thing, of course,

we can identify parts of our code
and we can think about optimization

first.

Instead of raising our work_mem,
we can have an idea, let's try

to reduce, let's be less hungry
for work_mem, right?

Let's reduce the memory usage.

Sometimes it's quite straightforward,
sometimes it's tricky.

Again, here I recommend using the
Pareto principle and not to

spend too much effort on this optimization.

We just try, if it takes too much
time, too much effort, we just

proceed to the next step.

Next step is raising work_mem.

From these, like monitoring already
can suggest us what is average

temporary file size and what is
maximum temporary file size.

And from that information we can
understand how much we need

to raise.

Of course, instead of jumping straight
to this new value, it

may be risky.

Sometimes I see people do it.

I mean, we know our max_connections
value.

We know that each statement can
consume multiple times up to

work_mem size because of operations,
this approach.

Also, since Postgres 13, there
is a new setting, which is...

I always forget this name, but
there is a setting that tells

you multiplier for hash operations.

And as I remember, by default it's
2, meaning that you have work_mem,

but hash operations can use up
to 2 work_mem, which adds complexity

in the logic and tuning.

And again, it makes it even trickier
to tune.

So, like on the safe side, if you
want to be on the safe side,

you understand the available memory,
you understand your max

connections, and you add some multiplier,
like 2, 3, maybe 4,

but usually this will lead us to
very low work_mem.

So this is why this iterative approach
and maybe raising understanding

that, like our workload won't change
tomorrow suddenly, like

a whole, usually.

In our existing project, usually
we understand, okay, realistic

consumption of memory is this.

So we are fine.

We can start raising this work_mem.

But like, and If you apply the
formula, you will see, oh, we

have risks of out of memory.

But no, no, we, our workload is,
we know our workload, right?

Of course, if we release it, release
changes in applications,

often workloads can change as well,
right?

So we should be careful with it.

Especially we should be careful
raising max_connections after

this tuning of work_mem because
this can lead us to higher out

of memory risks.

So instead of raising globally,
I recommend trying to think about

raising locally.

For example, you can say, I want
to raise for a specific session

because I know this is a heavy
report.

It needs more memory.

I want to avoid temporary files.

I just set work_mem to a higher value
in this session and that's

it.

Other sessions still use the global
setting of work_mem.

We can set even, say, even set
local work_mem in a transaction,

so when the transaction finishes, work_mem
kind of resets in the same session.

Or we can identify some parts of
the workload and this is good practice

to split the workload by users and,
for example, we have a special

user that runs heavier queries
like analytical queries, maybe

and we know this user needs a higher
work_mem, so we can alter the user's

work_mem.

And this is also a good practice
to avoid global raises.

But of course, this will make the
logic complex.

We need to document it properly.

So, if we have a bigger team and
we need to think, other people

will deal with it.

Of course, this needs proper documentation.

SET doesn't have a comment unlike
database objects.

So maybe, by the way, I just realized
maybe it's a good idea to

have some commenting capabilities
in Postgres for configuration

settings, right?

So anyway, as a final step, of
course, we consider raising it

globally.

And we do it all the time.

I mean, we see max_connections
quite high, and we raise work_mem,

so even if you multiply max_connections
by work_mem, you see that

you already exceed the kind of
available memory.

But this is tricky, I mean, this
is risky of course, but if we

observe our workload for a very
long time, and we know we don't

change everything drastically,
but we change only parts of the workload,

sometimes it's okay.

But of course, we understand there
are risks here, right?

So raising work_mem is kind
of risky and should be done with

an understanding of the details I just described.

Okay, I think maybe that's it.

Oh, there is also, since Postgres
14, there is a function pg_get_backend_memory_contexts.

It's very useful.

I mean, I don't use it myself yet,
because it's quite new.

Postgres 14 is only a couple of years old.

But, and there's a drawback to
it.

It can be applied only to the current
session.

So this is only for troubleshooting,
detailed troubleshooting.

If you deal with some queries,
you can see what's happening with

memory for a particular session.

I saw discussions to extend this
function to be able to use it

for other backends, for any session,
and when I was preparing

my how-to, These days I use our
new AI bot and of course it hallucinated

thinking Oh, you can just pass
PID to it.

No, it doesn't have any parameters.

You cannot pass anything to it
I would expect it.

So I will probably hallucinate
as well.

But the reality is it supports
only the current session, that's

it.

Maybe in the future it will be
extended.

So that discussion, as I understand,
didn't lead to patches accepted

yet.

But anyway, This is additional,
like, extra.

I think what I just described is
already quite practical.

Just remember that any session
can use, any query can use multiple

work_mems, but usually it's not
so.

And so the approach based on temporary
files is the way to go

these days.

Just monitor temporary files.

It's not like a big deal if we
have few of them happening sometimes,

especially for queries, analytical
queries.

They anyway are slow probably.

And okay, temporary files, we can
check how much we can win if

we raise work_mem.

But anyway, for WALTP, of course,
you want to avoid temporary

file creation.

And by default, I forgot to mention,
work memory is just 4 megabytes.

It's quite low.

These days, it's quite low.

I see in practice for mobile web
apps on bigger servers with

hundreds of gigabytes, we usually
raise it to 100 megabytes,

having few hundred max connections
and connection poolers, we

usually tend to have like 100 megabytes
work memory.

Maybe even more sometimes, again,
depends.

I think that's it.

So hello chat, I see several people
joined, thank you for joining.

Honestly, I recorded live just
because this is more convenient

for me.

So this is podcast anyway, this
will be distributed as usual.

I want to again say thank you for
being a listener, happy holidays,

and I hope we will have very good
topics in new year and I hope

every Postgres production server
is up and running with very

good uptime and with as few failovers
as possible and with as

low temporary file numbers as possible
as well.

So this is my wish for you in a
new year and thank you for listening

and watching.