Postgres FM

Nikolay and Michael discuss memory in PostgreSQL — how it works, important settings, and how to go about tuning them.

Here are links to a few things we mentioned:

Resource Consumption (PostgreSQL docs) https://www.postgresql.org/docs/current/runtime-config-resource.html
Andres Freud tweet about shared_buffers https://twitter.com/AndresFreundTec/status/1438912583554113537
Henrietta (Hettie) Dombrovskaya https://hdombrovskaya.wordpress.com/about-the-author/
annotated.conf (by ash Berkus) https://github.com/jberkus/annotated.conf
Our episode about checkpoint tuning https://postgres.fm/episodes/wal-and-checkpoint-tuning
Our episode about BUFFERS https://postgres.fm/episodes/buffers-by-default
Analyzing the Limits of Connection Scalability in Postgres (blog post by Andres Freund) https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/analyzing-the-limits-of-connection-scalability-in-postgres/ba-p/1757266#memory-usage
Tuning memory parameters for Aurora PostgreSQL https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.BestPractices.Tuning-memory-parameters.html
RDS for PostgreSQL memory https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Tuning.concepts.html#PostgreSQL.Tuning.concepts.memory
EDB guide by Via Fearing https://www.enterprisedb.com/postgres-tutorials/introduction-postgresql-performance-tuning-and-optimization#resourceusage
pg_stat_kcache https://github.com/powa-team/pg_stat_kcache
pg_buffercache https://www.postgresql.org/docs/current/pgbuffercache.html
Process and Memory Architecture chapter (from Hironobu SUZUKI) https://www.interdb.jp/pg/pgsql02.html
PostgreSQL 14 internals PDF book from Egor Rogov (pages 37, 184) https://edu.postgrespro.com/postgresql_internals-14_en.pdf
src/backend/storage/buffer/README https://github.com/postgres/postgres/blob/master/src/backend/storage/buffer/README
pg_backend_memory_contexts (PostgreSQL 14+) https://www.postgresql.org/docs/current/view-pg-backend-memory-contexts.html
pg_stat_io (coming in PostgreSQL 16) https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW
pg_prewarm https://www.postgresql.org/docs/current/pgprewarm.html
Configuring work_mem blog post https://www.pgmustard.com/blog/work-mem

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

If you would like to share this episode, here's a good link (and thank you!)

~~~

Postgres FM is brought to you by:

Nikolay Samokhvalov, founder of Postgres.ai
Michael Christofides, founder of pgMustard

With special thanks to:

Jessie Draws for the amazing artwork

Creators & Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

[00:00:00] Nikolay: Hi, this is Nikolai and my co-host is Michael. Hi Michael.

[00:00:04] Michael: Hello, Nicola.

[00:00:06] Nikolay: And this is And today we talk about memory management. She buffers work ma'am. Effective cash size maintenance work ma'am. Auto work and work ma'am and song right.

[00:00:19] Michael: yeah, absolutely. And just before the show, actually, you were telling me you had a joke about this.

[00:00:25] Nikolay: I forgot it.

[00:00:27] Michael: Yeah. Nice

[00:00:28] Nikolay: Yeah.

Okay. let's try to be shallow because it's easier and, but very wide. I mean, we will try to cover as wide as possible in half an hour, and, uh, we can give tips where to, if, if people want to dive deeper, we can recommend some great materials, greater sources such as articles, books, and so on.

Source code. so where do you want to start?

[00:00:58] Michael: my experience is very biased because, I make a, a tool for optimizing individual queries.

So these system-wide things don't come up as often in the conversations I'm having, but my understanding is that a lot of people still don't, and there seems to still be disagreement in the community about what is a sensible setting for things like shared buffers. What is a sensible setting for things like work men?

So some of these, uh, extremely high level, but extremely important settings have some debate around them and. very, uh, experienced people disagreeing with exactly what the doc should say or exactly what, equation to use to set these and there seems to be some like differences of opinion from people are more into theory versus people more in practice, very different. If there's, uh, depending on the, the database, like how big or small it is, can make a huge difference to what the advice should be. So yeah, I think there's. A good high level overview we can do so. I'd probably like to start with maybe shared buffers feels like there's, there's a few things that Postgres use shared memory for, but that's the biggest, right?

Or normally the biggest.

[00:02:07] Nikolay: Yeah, I agree. This is probably number one in terms of memory management to consider. But there is also underlying, thing that we probably should be, should go before it. And underlying thing is called, page cash of our professional system.

File cash. So pogs always works through it. It's pogs never reads files from disk. From file system, it always goes through page cash. So, the first cash, first of all, why, why cash? Why do we need cash? Because SSD is quite fast, right? It's like you can do a lot of, operations per second.

Latency is quite good, Sometimes below one millisecond. It's quite good. Of course, uh, memory is much better. Definitely. So, of course we still, despite of the rise of, new external storage devices memory is still. Several orders of mag faster.

and we have a lot of memory lately. Like we, you can, uh, easily have a server with, uh, hundreds of gigabytes and sometimes already terabytes of memory. And memory is super fast. And of course, uh, it's, it makes sense to, to, keep, uh, a lot of data in memory. So, Final latency. Latency of good execution is better for our end users.

Well, our end users are not end users of, uh, systems usually use. Our end users from database perspective are some application, or sometimes humans, but very often not humans. and we want, uh, of course queer execution to be very fast. Performance is critical. Critical for databases. So that's why we want, cash.

And we always have page cash underlying before even going to shell buffers. We have it. And, usually when we think about it, we don't know how much it'll be originally because, like Linux, for example, Linux users for page cash, everything it has, which is not used by others. Right.

All free memory becomes our page cash memory. And this, this is good. This is fully transparent. and only then we have, the buffer pool, which is controlled by shared buffers and it's allocated at startup time. and it cannot be changed during, uh, while we run server. And this is big limitation.

POS has, one of the biggest, items I would like to be changed in the pos, but it's very hard. I mean, it would be great to change shell buffers, uh, without restart, but right now it's not possible. So if you want to adjust, shell buffers the size of the buffer pole you need, uh, to restart your pos.

And since it's located, we need to think, uh, how big should it be? but thinking about it, we should consider underlying cash, the page cash. So POS basically has two levels of cash, page cash and buffer pool. And, uh, the rule of thumb right now, 25%. I remember when we started to have more than a few, couple of gigabytes of memory on most machines.

There was a debate, like at some point, uh, the, the rule was 25. If it doesn't exceed like eight gigabytes, but now it's already, it's okay to allocate hundred, 200 gigabytes and, uh, it's normal. And I, I should notice, uh, that 25% is not optimal. depending on your workload, you probably want to allocate 80 per percent, for example.

But it's risky. of course, like I would like to allocate like 90% to share buffers. Linux would use less memory for the page cash. But it's okay because shared buffers is smarter than page cash, for example, in some cases, uh, POCUS decides to use very limited, area for of shared buffers. If, for example, if you perform sequential scan on large table page cash doesn't know that it's sequential scan, Linux doesn't know it, and uh, it'll spam your cash patch cash.

It'll spam it, but buffer pole, you will have rink. So only limited area. I don't remember. 256 k I don't remember. So basically this process will use, uh, only this limited area, to read whole table. And it won't evict, it won't cause eviction of useful data that is needed, by other beckons, right? And, uh, so.

Buffer pool is smarter than page cash, so we would like it to 80 or 90%, but it's risky because we can be, uh, we can heat out of memory state sooner

[00:06:48] Michael: Yeah, so lots to go on. I think. Uh, 25% is, does seem to be, I think it's even mentioned in the Postgres docs. it's mentioned in a lot of guides. it's interesting you say it's not optimal, but like, it may not be op, like there's a, there's a, there probably is a workload like it, for example, your tables and indexes are almost exactly about 25% of your, A ram, like if they fit exactly in shared buffers, maybe that is an opt.

Well, I'm just saying like there is

[00:07:22] Nikolay: yeah, yeah, yeah, yeah, yeah, yeah, yeah. I agree.

[00:07:24] Michael: The problems we've got are that the default is so low that it's unlikely to be a good starting point, and 25% probably gives you a better starting point for experimentation.

But then to experiment you need restarts and, or you also need to be able to replay your workload and things like that. So it's, it's a tricky. Tricky one. And before we move on though, I did really want to say there's a really good tweet from and Andrea's friend that I thought, I couldn't believe you managed to summarize this in basically two tweets.

But the reason that your statement is, is a good one that is probably not optimal is that assuming a data doesn't fit in ram, which is. Like a lot of people, it will, a lot of people, it you're, if you're working on a small database, uh, your data might well fit in less than 25%, . In which case you're probably gonna be fine. It's probably gonna

[00:08:17] Nikolay: All 10% of Yeah.

[00:08:19] Michael: Yeah, exactly. but the people that are gonna have problems with this, the people that, are going to gonna need to listen to this or look into some of these links are probably people that have got much larger data sets than will fit easily into ram, or at least it'll be extremely expensive to, a care server where they would fit into ram.

And then you've got the problem of the Postgres. Yeah, the shared buffers, the Postgres cash versus operated system cash. And I think Andrew does a great job of simplifying the fact that you can have it saved in both. And if you have the same data duplicated, that's inefficient for your cash hit rate.

[00:08:57] Nikolay: That's why 50% probably is not a good idea at all, right? So you need to move this slider either to one side or to different side, and I would prefer to move it to nine to 1890, but it's risky.

[00:09:12] Michael: I've seen more commonly people like starting to point to maybe closer to 40 at the higher end and like towards 10 at the lower end. But eight or 90 does. It just sounds wild to me when you consider what else we want to use memory for it.

[00:09:25] Nikolay: my personal experience, I spent like some time. Studying various cases and, uh, making a lot of experiments, trying to develop full automated methodology so each particular database, we take workload somehow. It's different topic and we just, uh, run a lot of experiments and see where's our optimal point. sometimes we have local optimum before 50% and slightly better. Optim. Optimum after, like we pass 50 and go closer to a hundred. But again, it's risky. We can discuss why, but, eventually I got disappointed in this area completely. Like absolutely disappointed, uh, because, uh, query performance, things you do hits bit more.

Like right now I prefer to say like, let's just put 25% with no problem. We don't, we are not optimal, but we are on safe side. And let's first, first focus on different settings, which are very influential and cool optimization. And, uh, if we do need to squeeze a lot from performance, like for example, we have, , 10,000 POCUS nodes and we need to save on memory, we need to really optimal performance, then probably it would make sense to find better setting than 25%.

But, in my opinion, If you start with 25% for LTP workloads and then say we are going to optimize it and say maybe 40 is better, you have premature optimization because most likely you have different things to tune first and not to tune, query, optimize workload, you like indexes and so on.

You need partitioning to make, to have data locality, to use the buffer pool better, more, more efficiently. And if you focus on that areas, Maybe when you return and think about 25 or 40%, which, which will probably give you 10% benefit, you think, like later, later, later. And you keep 25% for longer. This is my current understanding o of this topic.

and again, like, I would like to thank, Henrietta Dubroski. I, I, I, I always forget, uh, I'm bad with names as usual, but I would like to thank her, uh, because it was like cold shower for me when I presented, our approach for posts, conflict tuning with experiments and so on in New York, some maybe five years ago, and she said like, You can optimize a lot, but then developers or ORMs write bad queries and all your optimization is vanished because queries are like indexes or written poorly or like, and so on.

And, uh, I again, like my experience was like, no, no query, config optimization is super important. Let's do it. But then I switched completely and now we work on query optimization as well with experiments and uh, thin cloning and so on. And so it, it was, the change happened there. So that's, I say like shared buffers optimization is probably number 10.

Uh, item you want to optimize,

put 25% and go else elsewhere first.

[00:12:33] Michael: Yeah. But given our topic is memory, memory management,

[00:12:37] Nikolay: Well, we have another memory settings we, we need to consider and they are also very interesting. But, uh, shell buffers of course, uh, number one in terms of positive configuration because it's hard to change because restart is needed. And, uh, it's influential if you don't change it at all. Because default original postpositive default is 128 megabytes.

Right? And there is also different default dictated by. After ya uh, packages, which probably half of gigabyte, I don't remember, maybe 250, 56. But these days we can use these settings on Raspberry Pi and also not on the latest models because latest models have like eight gigs of memory. So this default settings of shared buffers and pocus config is so outdated that can be used only on teapots.

So it's tippo pocus style. So it should be always tuned, but 25% rule is, is good

enough.

[00:13:36] Michael: The one exception is that I, a lot of people these days, The default config is now the default config chosen by the cloud provider. So, and this is because of the reason you said this is one of the few settings that I see almost. Well, I've seen e Everyone I've checked has, has tuned this not always to people's, taste, but, yeah, this is one people, uh,

[00:13:58] Nikolay: And they use 25%.

[00:14:01] Michael: Not all of them,

no.

[00:14:03] Nikolay: your experience?

[00:14:05] Michael: Well, I haven't checked all of them ex like recently and started new instances, but I was talking to a customer of Heroku just recently and they're having a real ordeal trying

[00:14:16] Nikolay: Heroku is, uh, it's very old solution.

[00:14:21] Michael: Yeah. But lots

of people are still on it

[00:14:24] Nikolay: Yes, I know. I, I have customers as well. Yeah,

[00:14:27] Michael: and they have tune ship, they have

[00:14:29] Nikolay: they don't have

[00:14:30] Michael: from the. I know, I know, but they have changed it. But not to 25%. Which is interesting cuz that means it's a deliberate decision not going with what the docs say. Like, I think the easiest, the, the least

[00:14:41] Nikolay: this decision was made before, it was written into docs.

[00:14:44] Michael: Maybe

[00:14:45] Nikolay: So probably you, you probably, they still apply 25 if not more than eight gigs, said if it's, um, maximum eight, so maximum of eight gigabytes and 25%. And this was also recommendation, uh, in all George Burkus, uh, post scale conf notated,

annotated postpositive,

skull conf. It was, yes. And it's, it's unfortunately quite like legging. For new versions, but 25% is current rule of thumb. That's it. Let's keep it, it should be number 10.

Let's agree to put 25% and forget about it for a while.

[00:15:22] Michael: Cool.

[00:15:23] Nikolay: So we have

25% went to shell buffers and if, we live in vacuum, nothing else and processors are not using memory, 75% will be, will go to the patch. Cash. And we have effective cash size, which is additional, but very important, uh, setting, which tells post's planner, how, how big are our both caius together? Not together. it's

[00:15:51] Michael: It's the remaining normally people. Normally people recommend setting effective cash size as the total memory minus, yeah. Yeah. Or three

[00:16:00] Nikolay: No, not minus. Not minus.

It includes her buffers, for example, if we indeed decided to exceed 50, it's also, I remember several years ago we discussed that hackers and Bruce was also participating, and I think Bruce fixed it in, in documentation. I, I will. I participated in that discussion. So for example, if shell buffers is 60%.

And we know our processes, backends and other processes, non processes use say 10%. So we have 30% left for page cash. it means that, okay, 60 shell buffers, 30 page cash. Effective cash has, should be either some of them, some of them wrong, right? Because we have duplicated cashing, but so either maximum or some.

It's interesting in this case. Yeah. I, I already forgot. But in general, like, if we decide, start with 25% for buffer pool, I, I would just put 75% for effective catch size as a starting point. But what I would like to say here, can put, 200, 500,000, thousand percent, our REM available. And since it's not allocation, POS will work. And this helps us on lower environment where we don't have enough memory, but we want planner behave similar to production actually. exactly like production. We use planner settings and work, ma'am, the same as on production. And effective care size is one of the most important planner settings.

So we can full pause this and the planner doesn't look at. Real available ram real available CPU disc. Doesn't know about it anything, and doesn't look in shared buffers even. It just looks at effective cash size and it's interesting.

[00:17:52] Michael: I still remember your tweet about that. I think you did a poll and I got it wrong. I thought it would take shared buffers into account, in the planner, but, I was wrong. So, yeah, that's, uh, that was a few years ago now, but yeah. Good one. I feel like we should move on. From my experience, at least work mem is hugely important here.

Is, is that the next thing you'd want to talk about or what, where would you go next?

[00:18:14] Nikolay: Yeah, so, so brief summary. Shell buffers is more like physical setting, which will affect timing of good execution. Of course, if, uh, data is there, good. If it's not there bad, so we need to inspect page cash again, and for post, it'll already will count as reads. Buffer red reading of buffer, but then it might be disc related, but not, maybe not without disc at all.

So we don't, POS doesn't know. That's why we need that KK to understand. but effective cash size is for, planner. How planner chooses, planner thinks, okay, we expect that, many gigabytes of RAM involved into caching. So it's more optimistic that like it'll be in memory or it expects some.

Data is happening. So more heavy, operations io operations. so this is very like high level thing, like logical only planner decides what to do, but shell buffers is, again, it's, it'll affect the timing, uh, but not behavioral of behavior of buffer. We should, mention our episode about checkpoint or tuning because. when some buffers are changed, and it can happen during select as well, if hin beats are updated on the fly and the buffer can see, buffer is like, uh, in virtual memory, it's alo, uh, block read by, uh, from

disk. It. Page eight, kilo eight kilobytes. Uh, ex, we have huge pages as well, right? So it's a different topic, but buffer is eight.

kilobytes, and if we changed at least something in it inside this page, it's considered as dirty and then checkpoint or. background writer, first went checkpoint, and sometimes backend itself, they need to clean it. So like, synchronize with disk. So less stress will be, uh, at next time we need to perform recovery.

and we discussed this topic, right? So we need to checkpoint or tuning, uh, to, but what we didn't discuss probably is that page cash also needs tuning sometimes because if we accumulate a lot of dirty pages in page cash, It's not directly related to pocus, but it's related to overall performance of our database.

If we allow operation system to accumulate a lot of, uh, dirty pages in the underlying page, cash, we might have performance issues as well be because we want, this process to be like smoother and probably you, you want to tune the behavior of page cash adjusting. We, we, we've see CTL adjusting, kernel settings.

So page cash doesn't grow a lot. And, uh, personal system flushes of, of, uh, dirty buffers with p flush and so on, like more often. but it's slightly outside of pogs, but it's sometimes very affects how, we work with memory. So again, POGS relies on page cash a lot, so this topic is, should be studied as well.

Okay. Now what, what's left we, uh, work ma'am? Maintenance work, ma'am, these two guys, first of all, right.

[00:21:19] Michael: Yeah, and when you mentioned a previous episode, I thought you were actually gonna mention the Buffers episode, and I think there's also something to be said there before we move on, which is. The, when you get, like when you're, when people are doing estimations for their cash hit rate, for example, or if you're looking at a single query plan with buffers, which hopefully you are, you'll see hits and reads.

But though the, the thing you mentioned here that's relevant is that those reads aren't necessarily from disk. They could be from the OS page cash and we can't tell. So yeah, something to be

[00:21:53] Nikolay: Yes. And PK statements cannot tell, but P can. And unfortunately we don't have anything for explain. It would be great to have some extension, which would, would extend the explain analyze buffers. So we would see real disc, but real disc can be measured. If you look at proc io. For our process Id, you just cut slash pro slash io process ID and you see reads and the rights and this is real disc, so you can see it if you have access to, to Linux, of course not on rds.

So sometimes you can cheat a little bit and check what's happening, uh, before you run. Explain Las Buffer sand after it,

[00:22:34] Michael: Nice.

[00:22:35] Nikolay: right?

[00:22:35] Michael: Right work man.

[00:22:37] Nikolay: So work ma'am. We discussed it briefly, uh, recently as well. So the difficulty with work, ma'am, nobody knows how much we will need in advance and each, query can consume multiple times of work, ma'am, but sometimes only fraction of work, ma'am.

So work

[00:22:54] Michael: Yep.

[00:22:55] Nikolay: this defines the amount of memory each backend needs to process, query. for operations like, hash join, ordering and grouping and so on. And if we have multiple of such operations for one query, we might, uh, use multiple times of work, ma'am. And if we reach work, ma'am, we need more, then temporary files will be created and good execution will slow down significantly.

[00:23:26] Michael: And that's because Postgres is no longer doing, for example, that sort in memory, is doing that sort on disk using a different Yeah, exactly. and it's not, even just per operation, but it can, if we, we did an episode recently on parallelism. It can be per backend, per operations, so it can multiply out quickly,

[00:23:46] Nikolay: yeah. Also, there is some setting for hash multiplier, so it's very tricky to understand how much memory you'll expect. And usually we approach our approach for opportunity is very reactive. Of course, default for Max again is for teapots only

usually, but it's very related to max connections and how, how many CPUs and how many connections for backends you expect.

starting from POGS 14, we might allow a lot of idle connections. And you mentioned Andres fraud, a couple of, , tweets and a couple of block posts, uh, related to, memory consumption from by backends and also max connections. And is it okay to have a lot of idle connections starting from post 14? We have optimization.

It turned out that memory management was not number one Problem. Number one problem was, , how work with snapshots is organized. So these are very good in depth, block posts, these two. But, it's very tricky to understand how much memory is used. First of all by process, even if you have all accesses and you see like PAs top everything, you, you have sar, all, all of things. But it's hard to understand in Linux how much memory is really consumed by each process.

But also we cannot limit it reliably. We cannot say okay, hundred megabytes for each backend. That's it, unfortunately, because if we say work hundred megabytes, Steel backend is allowed to use it multiple times because of this behavior. Workman defines the limit for each operation, not for backend, and also we have parallel workers, so.

Together. This becomes quite tricky topic and our approach usually is very reactive and iterational it iterative. So we first start from some safe workman setting. Safe approach is like, okay, we have max connections, say 300 or or 500 for our. 96 or 244 core server. Big, big one. And then we say, okay, we should be on safe side.

We know 25% is shell buffers. We know some backends will have maintenance work, ma'am, to be used for index creation or auto vacuum. And we allocate like four gigs. Or two gigs, we can discuss this also, we should discuss this. And we also have maximum 10 workers of the like, okay, 40 gigs. We have like maybe half of terabyte, 700 gigs, so 25% there.

40 gigs for maintenance workers like index creation, reindex index maintenance because it's needed all the time. And auto vacuum like we have 10 workers of to vacuum, one or two workers rebuilding indexes. Okay. We know meth is good here, so what's left? We'll, we have ignoring page cash. Will we have like 300 gigs, for example?

For a while we ignored page cash and we say, okay, we have 300 max connections, 300 gigs. It means like we can, , give up to one gigabyte to all back ends. Right? Okay. That's good. So let's have some multiplier two or three because we know some queries can have multiple operations, like sort or hash joint, and so hashing and so on in this case.

We say, okay, our quiet safe setting is a hundred megabytes times two, three. It'll be 200, 300 megabytes. We don't approach, we don't reach one gigabyte per, it means we can use a hundred megabytes for work. Ma'am, let's start with it. And we start with it and we see first, we see, most backends don't reach it even once. but some bank can succeeded. They need more. And this is imbalance. And we also see page cash is quite huge. what's left? Its most page cash. We understand. Okay. At our average work, ma'am, consumption. Consumption for work ma'am, is quite low, but, uh, sometimes we need more and so we start to tune, we start to raise it.

We know on average we don't reach it at all, but some backgrounds need it. We can, we can have two options here either to set locally in sessions. If we control, application code, we can say set work, ma'am, right? For particular query, because we know it's very heavy and we don't want this query to use temporary files.

Or we can just raise workman globally with we absorbing our. Workload for quite long time. We see it's quite safe and it's not uncommon to see that people are, having settings not safe already. If you take max connections, take work, ma'am. Have multiplier two or three and consider shared buffers and maintenance work.

Ma'am, you see that we don't have so much memory, but you, we know we don't reach, the limits.

[00:28:44] Michael: And this is what I was alluding to at the beginning of the episode. You say it's not safe, but you mean it's not safe in theory, not not safe in practice.

Yeah, exactly. Uh, with the current workload. So it's an interesting topic. I think what you said there is actually really good point that sometimes gets missed, is it can be set at different levels and you can, if you've got one reporting query that runs once a month, for example, you could set a higher number for that.

[00:29:11] Nikolay: Uh, gigabyte. Five gigabytes. Yeah, just

four. Single by count.

[00:29:16] Michael: Yeah, if you know it's running at a time that's relatively quiet, for example, you, and it's not gonna be con like contending with lots of other things.

So lots of things are possible here. smaller servers than the ones Nikola are talking about are available. but even for very small services, even for extremely small instances, I very rarely see one where a setting lesson. Less than 16 megabytes or four times the default is ever sensible. As, as, as far as I've seen.

So, and, and quite often, even these small queries, even these transactional workloads do spill over four meg, like if you, you'll see if you start logging temporary files, which I think it's very good advice. Although I saw in one of the. One of the AWS stocks that you shared with me that I'll, link up is that they advise only turning it on temporarily.

So I was gonna ask you if, if that was the case for you as well. But logging temporary files is extremely useful, and if you see lots of ones being logged in the four to 10 megabyte range, it's not uncommon for transactional workloads. You probably not changed the setting and it's well worth looking at.

[00:30:18] Nikolay: And we, we have, uh, statistics, aggregated statistics for temporary files. The size and number of files, occurred, in database for each database, so we can, this is a good thing to monitor as well and have alerts and so on. But as for logging, I prefer logging at all, temporary files, occurrences, In tune server. If it's not tuned, then each query execution involves, involves it, it'll be disaster. But normally we already raised she buffers raised work, maam. And, since we raised work, maam, only particular queries, are experiencing,

, temporary files. So it's okay to log them all. So, I mean, to set zero as threshold of size to be locked.

[00:31:00] Michael: But I would say I see issues with workmen way more often than I see, issues with shared buffers, probably cuz of my bias, but also because cloud providers don't tend to tune it, but because it's this

scary setting. They don't even use that formula you were talking about that formula that shared lots of places.

I, I particularly like an EDB guide by Vic Fearing, I'll share in the links as well. But that formula, they could apply it, right? They could set a

[00:31:24] Nikolay: Annotated, post also has it, this, this is, yeah.

[00:31:28] Michael: but my experience is they don't, they start at four megabytes still. Even in, well, so maybe I'm wrong, maybe I'm a bit out of date, but a lot of them, uh, that I was checking didn't,

[00:31:37] Nikolay: you see temporary files and plans a lot. Right. And first advice raise workman

because

I often see

four max.

[00:31:46] Michael: first advice.

Check your indexes.

[00:31:48] Nikolay: Check your indexes.

[00:31:49] Michael: Well, first advice, tune the query. Like if you're doing a massive sort and like, you know, like it,

[00:31:55] Nikolay: In this case. Second advice, supply partitioning,

[00:31:58] Michael: Well, first advice, do less. Yeah.

[00:32:01] Nikolay: First

advice. is always buffers because if you focus on buffers, you like, database is all, all database.

Performance optimization is about to do less. Io. Right. That's, that's the key for good latency timings and so on. Less io.

[00:32:20] Michael: this is, well, we're possibly talking about the one exception to that in this today's topic, which is, Memory, right? Like it, it's not reporting buffers when, if we're doing like a sort in memory. Is that io? I think no,

[00:32:34] Nikolay: It's can be like an internal memory or external memory disc or page cast, considered as external, but let's, let's forget about it.

[00:32:42] Michael: It's not reported as buffers, crucially, if it's done in memory. So that's an inter like, but like maybe that's an implementation detail. regardless. if you do one thing as a result of this episode, please check what your current work MA setting is. And if it's four megabytes, I'd check start checking your, temporary,

[00:33:03] Nikolay: Yeah, probably your cloud provider already tuned, shared buffers. Uh, not too optimal state, but good enough state. But workman probably they didn't touch that. It's still four x and for modern workloads and. Data volumes. It's, it's tiny size and also like, TPO size and and so on. But, uh, if you raise it a lot and, max connections is, is huge, a lot of vital connections, probably you will be out of memory.

So it should be also done carefully in maybe mul multiple steps, observing and so on. but again, like I agree with you and. Partitioning is also important here because it gives data locality, and you, you mentioned, uh, when our database can be put fully to shared buffers, that's great, but if we partitioned, we can say, okay, this, this is our total database size, but this is our work set size only.

Fresh partitions we are work intensively with. And, , buffer pool has. And page cash. They both ha have, quite good simple mechanisms to understand what's needed, what's not. usage counters and so on. And the cat can be studied using good extension called PG buffer cash. You can study what's inside your buffer pool.

Right now It doesn't hurt to run it. It'll take some time because it'll scan, hold buffer pool. If it's huge, it'll take some time. But it'll give you statistics like this, uh, relation, this index is present by these like blocks and you can aggregate it and, and. The documentation, it's it's official country model shipped with pogs country modules and documentation has basic examples and you can understand, uh, what's inside the buffer pool right now.

And this is interesting. So if you apply partitioning this things might start improving because instead of mixing all the new data in single, table and indexes also, it'll start improving a lot, it'll be more efficient.

[00:34:56] Michael: Yeah. In fact, Timescale give this recommendation in their docs. I, I think so. They, they actually talk about setting the chunk interval so that at least your latest chunk, which is their word for partition, fits in shared buffers. So that's a, another, another way of thinking about it, like you could be choosing your partition's, like size of, or frequency, based on what you have available Memory wise that Yeah, it's a cool flip way of like thinking about it.

you said you wanted to talk about something. Uh, I've forgotten what it was, but we needed to make sure we covered one more thing. Yes. Oh yeah. Let's cover

[00:35:34] Nikolay: So, tobacco work, maam, is said to minus zero, meaning that it, it's inherited from maintenance work, maam,

by default, Minus,

one.

[00:35:43] Michael: yeah.

[00:35:43] Nikolay: Auto vacuum work, ma'am, is minus one by default in most cases. I see. People don't touch it, so it means that work, maintenance work, ma'am, is both about index creation and such per such operations, like index creation or indexing and also auto vacuuming and usually I don't remember default.

It's also very low. Usually you want to raise it at least like one gigabyte if you have a lot of memory or half of gigabyte, but interesting. I made a couple of years ago, I made new experience, uh, trying to understand how beneficial it is to go to higher values and for index creation particularly, I didn't see big benefits, uh, after half of, or one gigabyte difference between one gigabyte and five gigabytes.

In my experience, we were low. but don't trust me blindly, it's good to experiment always right? So you create your index with one setting, then you start over to flash, , the set of sharp buffers and everything. And also don't forget to flash page cash.

Uh, you say echo three to some page cash. You can find it in internet. How to flash page cash and Linux. It's easy, it's a single line. And then you start experiment from scratch again with higher value of maintenance work. Ma'am. And then you compare duration, for example, and you can compare how, how much IO happened, for example and so on.

It's good experiment to run on in simple exercise and you'll see is it beneficial to raise it for your particular post version of your database and your index and so on. For auto vacuum, it's, it makes more sense to raise it, but again, you can experiment and vacuum, but. You need to have, George, you need to have same number of that apples for each experiment run.

So it's interesting how to, build this experiment. if our listeners have different, uh, experience, please comment, on Twitter or YouTube anywhere. I would like to hear if. Some of you found, uh, beneficial for index maintenance to raise maintenance work, ma much more than one gigabyte, for example.

But again, since auto vacuum work, ma is set to minus one, we need to take, maintenance work, ma. Because it used by auto vico workers, multiply bio vico workers. And my recommendation is free is not enough. Usually if you have a lot of course, raise it to like 25%, for example, of course available.

and then you, can understand how much memory can be consumed by auto vCAN workers as well, right? So

it's

can be, it can be a lot.

[00:38:18] Michael: The limit is per worker. Right. The limit we talked about earlier. each worker can consume that.

So it's But only once. Right? Not multiple times. Not like, it's not like,

[00:38:27] Nikolay: O only once in this case. Uh, as I, as I understand, so my, my, my preference is, uh, more if, especially if we have a lot of partitions my preference is more workers, maybe not so big, who maintenance work map. And, uh, index. Index maintenance may be just single, backend recreating indexes.

Maybe sometimes two, maybe three, but not many

[00:38:48] Michael: Nice.

I feel like we should wrap up soon, but I there's a few links that, , I found really useful. One in particular we haven't managed to mention yet. so maybe we should do a a quickfire round of good resources.

[00:38:59] Nikolay: Yeah, first of all, uh, both, internals, books or how like online books? Uh, one is pdf, one is just website. Uh, very interesting. From Suzuki and from rok, and they cover this. Many aspects quite deeply. So if you go, if you want to go inside, I think in this ca case, almost everyone should go inside and even backend developers could benefit from understanding how POCUS works, like process, model of threats or shared buffers, always to work with any relations and so on.

Work, my maintenance work, this

is the number one thing.

[00:39:36] Michael: yeah, before you move on, I wanted to give a special shout out to the Rogoff book. , I was expecting, given it's translated for it not to be as clear and well-written as it is. and I looked up just before this, who translated it, and that's, Lud Mila, uh, MANOVA. So shout out to them cuz it, it's fantastically well written, even the English version.

So I'll share link, I'll share a link to that.

[00:39:58] Nikolay: Yeah, I, I I use this book as a reference and I actually, honestly, Don't realize which language I use because I, I speak both and I found both versions are good enough. It means that translation's really good. I agree. So, but before internalists we should mention documentation, of course.

And also read, read file in, uh, back end. Uh, I don't remember.

Buffer polo or some, yeah, yeah. About buffer polo, because it's quite well also plain English, uh, explaining. Some details how buffer poles organized and about beans and so on, like, uh, about locks. but but also interesting to, for. Folks who are using newer versions of Postgres.

in Postgres 14, we have Ji Can Memory Con, Ji Can Memory Context System View, which gives, , aggregated stats for memory usage. I never use it in production. Somehow, because I still have mostly older version, well, somewhere already POG 14, but I need to have some probable incidents to, to start using it.

That's interesting thing to keep in mind that now we have good observability tool and in POG 16, PPG IO is added. We

right also aggregated statistics, for how many operations happened, timing, and so on. also very interesting. So, and PG buffer cash. These are things like our quad deep, like this is enough to go quite deep.

yeah.

And also prewarm, by the way, if you want to restore the state of buffer pool, you can check PG Buffer cash plus PG Prewarm. And this is also automatic, pre, pre warming. Starting a couple of years ago, some versions ago, there is a recipe, so you have restart. You want, uh, to restore the state of the buffer pool.

So query performance is good. Again, it's also possible, I, I never use it honestly, but I know this is possible. So, knowing which is possible also, sometimes is, More important than knowing some details sometimes.

[00:42:02] Michael: I've heard people using it for experimentation, uh, but not

[00:42:05] Nikolay: Yeah, Yeah, yeah, This is a opposite approach. Instead of starting cold, you probably want to start from the same, warmed up state. Makes sense. And also worth mentioning that if you are on zfs and you use, for example, database lab, you can use on production, and in this case you have branched posters.

You can use smaller buffer pool. Of course, in this case, Uh, the cash is not so smart as the buffer pool, but it, it's good to have multiple, POGS versions like branched POGS versions, uh, utilizing single underlying cash instead of, uh, regular page cash. It's, it's called arc. Is that the first arc? And. If like all databases are writeable running on single host by, they use single , cash arc a r C from Zfs, and it's great to see that shared blocks are cashed there.

So you start new branch and you already quite fast. It's not in the buffer pool. And we need to use smaller buffer pool if we run multiple pos on single machine, but it's already quite fast because it's cashed in this underlying. A, so it's also interesting approach. Uh, quiet, underappreciated, I think. But some folks run the professional production and it's interesting,

[00:43:23] Michael: I think that's enough for today. What do you reckon?

[00:43:25] Nikolay: Right. Yeah. That's it. And I, I like your idea. Raise work, ma, as a single recommendation. Consider raising, but be careful.

[00:43:34] Michael: yeah, exactly. I've, in fact, I've got a blog post on that. I don't, can't believe I forgot to, to say, but I'll share that as well.

Right. Thanks so much, Nicola.

[00:43:44] Nikolay: Thank

[00:43:44] Michael: won't see you next week because even though last week we were talking about such a good streak of weeks in a row, I'll be missing the

[00:43:51] Nikolay: Uh,

[00:43:51] Michael: two.

[00:43:52] Nikolay: yeah, you, it's called Slacker, right?

Uh, but, we will have episodes matter what.

[00:44:01] Michael: Wonderful. Well, I look forward to coming back from my holiday and seeing what they're on. and being a

[00:44:06] Nikolay: Right. Surprise.

[00:44:07] Michael: Yeah. Exciting.

[00:44:09] Nikolay: Thank you.

[00:44:10] Michael: Take care and yeah, catch you soon.

[00:44:12] Nikolay: Bye.

[00:44:13] Michael: Bye.

Memory

MemoryMemory

More episodes

Memory

Memory

Chapters

Creators & Guests

What is Postgres FM?