Michael: Hello and welcome to PostgresFM, a weekly show about all things PostgreSQL. I am Michael, founder of pgMustard, and as usual I'm joined by Nick, founder of Postgres.AI. Hey Nick, how's it going? Nikolay: Hello Michael, going great, how are you? Michael: I am good also. What are we talking about this week? Nikolay: We're talking about... What's written on the cover? Michael: Are we gonna write LWLocks or lightweight locks as a full word? What do you think? Nikolay: I like the shorter version of course. There's also a question to write 3 uppercase letters or everything lowercase. Yeah, lightweight locks, it's like, it's even hard to pronounce. So bottleneck locks also not good, right? LWLocks. I like some systems call it latches, you know? Michael: Latches, yeah. Yeah. Why Postgres doesn't do it? Nikolay: Maybe because of Linux or I don't know. Michael: Naming things is hard. Nikolay: Yeah, yeah, yeah. Because we have confusion sometimes when we say just locks. You know, like when we say backups or logical backups dumps, here also locks, there are 2 types of locks. There are more types Michael: of locks, right? Nikolay: But there are big 2 major types. Yeah, categories. Yeah. Is it category or type in pg_stats, in pg_locks? Michael: Good point. Yeah, maybe types. I don't know. Another loaded term, types is another loaded term. Nikolay: Or mode or maybe more let me let me just check right now maybe it's called mode in pg_locks I'm constantly confused about yeah there are you know there are some terms like class type mode like they can be like they are quite abstract right it's called mode in pg_locks it's called mode Michael: cool Nikolay: but pg_locks is about heavyweight locks we are talking today about lightweight locks right Michael: yeah we did we did a whole episode that we just called locks because that's generally what they referred to the heavyweight locks yeah Nikolay: if there is no additional word it means heavyweight locks so yeah and and why why so because heavyweight locks you can name them just locks because they are closer to User, right? You can see them in pg_locks, for example, you can sometimes, not, not always, but sometimes you can acquire them directly using just lock SQL command right just lock table name or SELECT for UPDATE and you sit in transaction being acquired some locks right. Michael: Yeah I also think in general more people come across them because they affect you at earlier stages in project life cycle. Like you don't have to be at such extreme scale to start being affected by them or having to be aware of them. Nikolay: Yeah, I agree. In general case. Because in some cases, for example, if you have read-only workloads, that's it. In read-only, you don't like, you have heavyweight locks, but you won't notice them because they're like ACCESS SHARE lock, that's it. Right. But if you have really a lot of TPS, you might start observing some lightweight locks. Yeah, Michael: I was thinking even like Schema changes though, like even in read-only. Nikolay: This was edge case. Michael: Yeah, okay, fine. Nikolay: In general, I agree with you, heavy locks, you bump into them sooner. Schema change is a great example. Michael: Cool. So where did you, I mean, starting with the difference between locks and lightweight locks is probably great. Nikolay: Yeah, let's talk more about differences, because there are important differences to understand and feel and take into account when you develop things or tune, optimize, scale, migrate. So heavyweight locks are acquired during like SQL operations and they are acquired for like Database objects, like Relational level locks. By the way, this is super confusing. You know, yeah, I'll be talking about heavyweight locks a little bit, trying to make it shorter, but as you know, I write again almost every day. I skip some weekends, but I write Postgres Marathon posts again, And many days I already sit in between heavyweight locks and lightweight locks and research 1 of LWLock:LockManager, right? So relation level locks, it's quite confusing name because they are called in documentation, they are called Table level locks. Yeah. Which is misleading because they are also this type of like the same thing and inside documentation it's already, it becomes clear that indexes are also involved and materialized views and views. And so all the relations, sequences are also relations, right? Michael: Yes. I never really thought of it like that. Nikolay: Well, if you check class and reltype, I think S is sequence. Maybe I should check again. Should we develop a habit to check things right online? Michael: Why not? Nikolay: Yeah, so I will be checking, but meanwhile, you can acquire locks, heavyweight locks on tables, on indexes, on database, right? Like even higher level, you can lock the whole database. And we know the recent problem when Recall.ai blog post, right? Our clients, they posted about database level lock acquired when NOTIFY happens to establish sequential NOTIFY events at commit time. And also row level locks. Yeah. Tuple or row level. Let's leave it for another time. So you can acquire locks on database objects. These are heavyweight locks. Documentation is also confusing because it says explicit locking. Although most of the cases where you have it, it's implicit locking. You say alter table and you don't say lock table, you say alter table. So it's actually implicit. Well, this, I have also always like some shift in my brain when I need to Google documentation for lock or heavyweight locks. I just remember, I need to search. I need to ignore the fact that I'm going to look at explicit locking documentation, although I need implicit locking documentation, right? Michael: By the way, before we move on, I checked pg_class and you're right, sequences are in there, and weirdly the rel kind is capital S all the others are lowercase I think well the ones I can see anyway. Nikolay: Does it mean something? Michael: Don't know. Nikolay: Yeah. Oh, by the way, explicit locking documentation, it mentions that you can lock indexes with ACCESS SHARE lock, for example. But You cannot do it explicitly. You cannot say lock an index name. So I'm pretty sure you cannot do it with sequences as well. Yeah, anyway, so these are heavyweight locks, right? So you basically, your actions, I mean your SQL, this is what directly creates heavyweight locks. And why is it needed? Because we need to, we are not working in single user mode. We need to protect resources from concurrent operations, reading, writing, changing. And usually we don't need to protect from reading, but while somebody is reading, another Backend shouldn't modify it usually, right? Or for example, if you read from table, dumping it for example, other Session cannot modify, like add a Column, for example, cannot run DDL, for example, right? Michael: Or drop it, for example. Nikolay: And the important thing about heavyweight locks compared to lightweight locks is to understand that once a lock is acquired, it can be released only in the very end of Transaction, Commit or Rollback. That's it. Only 2 options to release this lock. You cannot release it midway. Right, and this is super important for understanding always. It means that Transactions should be shorter. Right, so your actions won't affect others. Like, or chances to affect others would be lower. Right? This is... Michael: Or time that it affects others is lower, right? Nikolay: Yeah, yeah, yeah. Michael: Because you will affect people just for less time and there's a point at which that becomes unnoticeable or acceptable. Nikolay: Yeah, yeah. Or won't affect at all if they come a little bit later. But if you change something or even if you read something and keep Transaction open for hours, it means nobody can modify this table, no DDL is possible, autovacuum cannot do some things and so on, like it's bad. Michael: Yeah, it's worse than that, isn't it? I know we've talked about this before, but if DDL comes along and doesn't have a lock_timeout, then naturally you can suddenly be down. Yeah. Nikolay: Yeah. Yeah. So because, yeah, this is also a good point. So heavy locks, they have this ability, like this property to be acquired. Release happens only in the very end. And what you say also good point. They also, there is a LockManager. There's, By the way, I couldn't find definition of LockManager. Nowhere, nowhere. Like it's like, it's obvious, right? Even in the source code, it's not defined, which is interesting. So LockManager is responsible for managing locks, heavyweight locks, right? And backends can form a queue of waiting for a lock acquisition. So if I'm waiting to acquire lock, some other backends can be waiting and they like ask where is the end of the line and go there, right? So it's just natural, like in the order of first, like in natural order, right? So unlike lightweight locks, lightweight locks acquired and released very quickly. I think documentation, source code, I mentioned it's like dozens of operations. Unlike There is underlying concept of spin locks, which like few operations only, like few instructions only. Lightweight locks are bigger, but it's very fast as well. Acquired and released, and they don't wait until the end of transaction because they work in lower level of abstraction. It's not closer to users, closer to resources like memory. So their main purpose is to protect some physical resources like parts of the memory, shared buffers, and so on. Right? OK. Yeah. Like these things. So they can be acquired and list quickly. There are only 2 types, exclusive lock and share lock, unlike heavyweight locks. Heavyweight locks have a list. And interesting relationships between different ones, right? Here it's only share and exclusive, shared and exclusive. Shared locks don't conflict, shared lightweight locks don't conflict. But exclusive lock cannot be acquired while share lock is still running, lasting. Right? Share lock should be released first, then only you can acquire exclusive lock, because exclusive lock is needed to modify the resource, right? Share lock is needed to protect for reading. It's saying I'm reading, don't change it because I'm still reading. And when I'm done you can modify it, right? So this is lightweight locks. And that's why they are lightweight, because they are much, much shorter living, right? So these are main differences between them. What else? Michael: Maybe types of lightweight locks? Well forgive the word types but you know what what should people be aware of Because I've only really come across 1 type because that's the type that seems to cause the problems. But what should people be aware of at least? Nikolay: Yeah so types, modes, I struggle, I'm mixing these terms and it's really hard to distinguish between them. So if we talk about types, exclusive and shared, we just covered it. If we talk about different kinds or, let's say wait events we observe in pg_stat_activity. pg_stat_activity is the main cumulative statistics system view. Everyone should think, like, learn about it, right? It's super important because it shows what's currently happening in database. And there are 2 columns called wait event type and wait event. Also, by the way, slightly confusing because the word type is there and so on. I would prefer like, it would be good to name that thing like classes maybe or category. I don't know because type word is so overused or like overloaded. Right. Anyway, wait event type can be, I think there are less than 10 class, 10 types. And 2 of them which are most interesting today is lock, meaning heavyweight lock. And LWLock. And wait event type LWLock, you can check documentation. There are many, many, many, many dozens of wait events for LWLock, meaning that we have a lot of kinds of LWLock. These kinds, like again, types are only like exclusive and shared, but these kinds, it's classification with respect to the resource we are locking. For example, LockManager itself, although the main purpose of LockManager is to handle, to manage heavyweight locks. When it does it, it does it using a piece of memory, shared memory. Special piece of shared memory called, well it's called like main lock table, right? It's a big piece of memory which is segmented partition to 16 partitions starting Postgres I think 9.2 or 8.2 actually it was very long. num lock partitions 16. And when a new information about heavyweight lock is needed to be written there, The partition of this main lock table where it needs to be written, it needs to be locked by lightweight lock, right? To ensure nobody else is writing to it. So, LockManager can have up to 16 lightweight locks, which are seen as LWLock:lock_manager. 16 because we have 16 partitions of this main lock table in memory. And how it works based on, for example, for relational level locks, based on the relation name, there's a hash function which understands which partition to use, determines which partition to use, right? Michael: Yeah. So Nikolay: The same table or index will always go to the same partition of all of those 16 partitions. Right? Michael: So this was a long time ago. So back in the day, I'm guessing this was 1 thing, and there was probably too much contention. Nikolay: Before 8.2. Yeah. Michael: Yeah. Okay. That's what you're talking about. Great. Nikolay: Yeah. And this just to like, there's a confusion because there was another 16. Yes. Michael: That's what I was thinking. Nikolay: It's a different constant, which, which changed that behavior changed in, in Postgres 18. This behavior hasn't changed. Yeah, fastpath changed. This hasn't. This hasn't. This still is 16 partitions and if you have a lot of heavy lock acquisition attempts for the same relation, a lot I mean like thousands or maybe dozens of thousands per second, a lot, really a lot, then exclusive lightweight locks on the same partition will be competing. And while you are waiting, like while we try to establish heavyweight lock to some index or table, but that partition is already locked by exclusive lightweight locks from different backends attempt write heavyweight lock information about it. We need to wait a little bit. And this little wait will be seen as wait event type LWLock and wait event lock manager. Right? Is it clear? Because we are like, we have weird combination of heavyweight locks and lightweight locks in the same topic here. Michael: Especially because the lightweight lock manager is actually looking at heavyweight locks. That's the confusing part for sure. But I was just looking up in the docs, in the table of all of the... They've called them types, wait events of type LWLock, and you're right, it's such a long list. I think there might be 50 or more. Nikolay: You will see checkpoint, autovacuum there, but isn't it great that we don't observe them? It means it's quite well optimized, right? Like for example, yeah, yeah, you'll find a lot of stuff. I see many of them, I do observe, not just a lock manager. We saw many of them in production and yeah. Usually the rule is if you see lock wait event type, it means you need to go and think how to redesign your application. Because classic example is, for example, we are doing some billing system and we have a single account which needs to be updated for each transaction people do. I mean, financial transaction. And this is a classic example when you shoot yourself into the foot because updating the same row will be like hotspot. And you will see a lot of, you see heavyweight lock contention because many, many backends try to update the same row. Right. Yeah. So, or you mentioned a very good example. If you do DDL without lock timeout and retries, you also can have a chain of waiting backends, which just wait until your DDL finishes, but it itself is waiting for some other SELECT. And people, I see examples in blog posts, people, like I see examples, people try to explain this problem, but many of them involve updates, deletes. No, just SELECT, DDL, and many other SELECTs. You don't need even to update anything or INSERT or DELETE. That's it. Just SELECTs and DDL. And we see a lock wait event in pg_stat_activity. It's bad. But when we talk about... Yeah, and for lock you saw like it's a relation, object, page, also page interesting, tuple, virtualxid, that's interesting. Advisory locks is kind of a different thing. But for LWLock, we have a lot, and among them, there is lock manager. I like the approach, which I think RDS started, maybe not RDS, but they use it a lot and I also started using it. We usually take WaitEventType and WaitEvent to columns from pg_stat_activity and write them with a colon in between. So it becomes LWLock:LockManager. Just in, you know, like in texts where we discuss problems and do RCA or something. Yeah, it's just convenient. Michael: Like a naming convention. Nikolay: And I wanted to highlight that this problem, which related to both heavyweight locks and lightweight locks, in the name of it, we have the word lock twice. LWLock:lock_manager. First time it's about lightweight lock but in lock manager it's about heavyweight lock. That's why the lock is encountered twice, 2 times. What else? We have other, we observed other types of lightweight locks problems. For example, Michael: yeah. Well, I've spotted 1 in the list, SubtransSLRU. Nikolay: Yeah, this is my favorite 1. Although I must admit, I haven't touched this topic for a few years. Yeah, yeah. I touched it heavily in 2021 when GitLab had the problem. Yeah. And studied it. And yeah, and since then, SLRU, it's simple, at least recently used, it's small caches, Postgres have multiple of them. I think since Postgres 12 or 13, we have pg_stat_slru system view, where you can see like counters of work of those SLRUs, But also SLRU mechanism got some handles, I mean, settings, GUC, GUCs, right? People say GUCs. You can change them and increase it and yeah, to postpone this performance cliff. So I haven't seen them often since then. Like there are customers who usually read and come, we help like, we help easily, like just try to get rid of sub-transactions, although, like I still think by default you should avoid sub-transactions, but in some cases I already see they can be used in safe way. You know, you need to just understand the limits and then you can use them. For example, again, DDL, sometimes complex changes of schema. You don't want to lose part of schema. And this approach with attempts to acquire lock, how can you do it inside transaction, you need sub-transaction, right? Yeah. Because if attempt fails, you don't want to lose everything. You want to lose only the last step, right? And this is exactly where I think it's worth thinking about to use some transaction, but you need to understand like details. For example, you don't want to have a long transaction running on the primary in parallel. To other table, by the way, not to any table. And replicas which receive a lot of like transactions per second because they might be down because of the use of sub-transaction and you can see subtrans SLRU because SLRU is overflown. And again when it's overflown the lightweight locks acquisition like we see contention and we see it as in pg_stat_activity and wait event analysis as LWLock:SubtransSLRU Right. There are other SLRUs, right, mentioned like notify SLRU, I'm pretty sure MultiXact upset, MultiXact member SLRU. Speaking of them, we had an episode about the case from Metronome, right? Michael: Yeah, true. Nikolay: Yeah, it was a great blog post. Like this is a great example how company can share with community what happened and others benefit. Since then we had another client, new client we had, which came to us with very same problem related to MultiXact member. Michael: Wow. Member exhaustion. Nikolay: Yeah, exactly. Wow. So it's also observed like a lightweight lock, MultiXact blah, blah. There are, there's a bunch of MultiXact lightweight locks you can see in the table. Another 1 is a very popular 1 is LWLock:buffer_mapping. So usually it's called a buffer thrashing, right? When we let like the buffer pool is not big enough and a lot of eviction happens and new pages are loaded all the time and and we see when of course when it's happening to protect memory Postgres needs to use exclusive lightweight lock when writing happening, shared lightweight lock when reading happening. And this is exactly when we can see some backends are a little bit waiting for other backends, right? And this is how it's seen. Michael: So like if somebody's limited by the amount of shared memory, or like shared buffers. Nikolay: Yeah, solution is simple. We need to increase the buffer pool. We need to fight bloat because this is what like increases this problem. You need to get rid of unused indexes and other things because they also contribute to it. Right. Unused ones. Of course. Michael: Like, I was just thinking they would be evicted and not. Nikolay: Think better when you change something with insert or non-HOT update. Michael: All indexes need to Nikolay: be loaded to be changed. Yeah. Yeah. And they contribute to this spam coming to the buffer pool and also to WAL, but it's a different story. So we, that's why I think people underestimate how bad bloat is. I feel it like we have a new wave of companies coming to for consulting to us, which I call AI companies. They probably are quite old companies, but they have the transition to AI. And they have increasing data volumes, increasing workloads, and they underestimate the problem of bloat and unused indexes and index write amplification, all this spam coming to memory and WAL and it means backups, replication. It's like, this thing is like multi-sided. Yeah. Michael: Like cascades, doesn't it? Nikolay: Yeah. Yeah, yeah. It doesn't, and also it's not performance cliff which you like suddenly see and oh we have a problem it's slowly slowly slowly like growing Michael: or like you sink into it slowly like like sand or like a swamp Nikolay: yeah yeah And then you need to increase instance size or think about sharding and so on. By the way, I like sharding a lot, but I think in many cases it's just hiding the problem. It's just distributing the problem and you just you pay to not to solve problems. For business it's sometimes a valid approach, right? You just you don't have time to solve this. But we also can just implement automated procedures to reduce the amount of trash you have, right? Michael: Yeah, I hear about it all of the time as well, even at smaller companies that just upgrade their instance, especially at the lower sizes. They just don't wanna throw engineering time Nikolay: at it. I'm very surprised. I recently started asking directly on the very first call when we have consulting like guys, do you care about bloat and index health? And I usually hear no. And then It's our job to explain why, right? Michael: So it's easy. Yeah, bloat is not the problem. Bloat is like, well, it might be the root cause, but it's not the problem users see. They don't see. They're not. Nikolay: Well, unused indexes also they don't see. Michael: Yeah, yeah. Nikolay: They created some indexes and forgot about it, right? Or redundant indexes. Same problem. I mean, similar in this case, but then boom, like buffer mapping, LWLock:buffer_mapping. Why? We don't have enough memory. Michael: Yeah. Also, this might be part of it. I think Bloat used to be worse. Like I remember in the Nikolay: before a Michael: few optimizations Nikolay: 14-14. Michael: Yeah before then we you could come across especially indexes that were like 99.9% you know in extreme cases you could come across very very very bloated indexes it's just much less likely now so I think it's Nikolay: less likely but not much Michael: yeah well okay Nikolay: well only deduplication doesn't solve the problem when... Like Postgres B-tree doesn't have merge. Michael: But it does have bottom up deletion. It does like a lot less spitting. Nikolay: Well, if you have in the middle of B-tree, half empty page, This space won't be used if you write, for example, it's an incremental timestamp. You're writing to the end always. In the middle, nobody will write, so you have this Bloat which won't be eliminated. Michael: True. Until like, yeah, anyway, let's get back to Nikolay: the topic. Until reindex. Michael: Well, I was going to say until like logical replication upgrade or something like that. Nikolay: Well, yeah, yeah, yeah, Anyway, when you rebuild index anyway, right? Michael: Yeah. Nikolay: By the way, let me advertise something. We have open source component, which we just recently developed. It's now entering beta stage and it aims to like to automatically rebuild indexes on any platform, just reach out to me in any way. I will share details because we are looking for more cases before we move on and make it more publicly available. I don't advertise it because I want to understand the use cases people have in terms of... Anyway, if you want to rebuild indexes in an automated fashion, we have an open source component for you. Fresh 1, very interesting, not only for B-tree, but for any index, almost any. I think BRIN is not supported. All others are supported. Michael: I've never seen a bloated BRIN index. Nikolay: Have you? Yeah, good question. It can degrade a lot if you modify. Michael: We've got another good episode on that actually. MaxMulti I think is... Nikolay: On every sentence we say we had an episode already, right? Anyway, yeah, we're looking for early adopters for this small tool. Michael: Cool, I'll put it in the show notes. Nikolay: Yeah, yeah. Which is open source. Yeah. Okay. WAL write is another 1 we see often. Michael: Oh really? Nikolay: Yeah. Again, like when somebody hasn't dropped unused and redundant indexes And they write a lot of, oh, I forgot to mention, of course you can write, you can find queries which contribute to this buffer thrashing, right? And maybe get rid of them or make it less, causing less smaller storms, right? So you can optimize queries sometimes and, and, and avoid that LWLock:buffer_mapping contention. Yeah. So about WAL writes, same thing, like if you have a lot of indexes which contribute to WAL writes, or you have like very frequent checkpoints and you have a random access pattern of writes. When you write to the same page often, but between those writes to the same page you have checkpoint, It means this page will go as a full page right to WAL multiple times over and over. You have a lot of WAL and this can be also an issue. And if disks are, maybe disk I/O is saturated as well. Yeah, these things. Michael: Yeah. Or IOPS. Nikolay: Yeah. There are, there are several things there. Like there is I/O WALWrite, I think, and LWLock WALWrite. I don't remember from top of my head, but there are interesting nuances there. I probably should cover it some day in Postgres Marathon, because sometimes you are waiting on disks, but sometimes you are waiting on locking internal structures, lightweight lock. So if WAL buffers is like, there is a quite small amount of WAL buffers in the memory, so if it's already fully written, it needs to go to disk, probably you are waiting on disk. But if you are writing to... Yeah, so it should be checked in detail when we have it. There are several wait events there. Also interesting thing which pops up recently is SyncRep. LWLock SyncRep. Synchronous replication, when the primary cannot continue because it waits confirmation from replicas, synchronous replicas. Michael: So, okay, and you're seeing a lot of those. Nikolay: Not a lot, but this started happening more and more if you use synchronous replication, Quorum commit. Michael: What? Yeah. I don't... I come across a lot of people that think they're going to use synchronous replication but then end up don't. Do you see it quite commonly used? Nikolay: Let's say so. Big old clusters are on async. All new clusters should be on synchronous replication, although there is a bunch of issues with it. And there was a great talk a few months ago presented by Alexander Kukushkin about misunderstanding of synchronous replication and various anomalies you can experience in current implementation because it's actually not synchronous replication. This is the thing. Because when commit happens, main thing, when commit happens on the primary, it actually happens. It already happened. Commit happened. But we just are locked, by the way, on heavy lock, right? Heavyweight lock. Our transaction is locked. And we are waiting for 1 of replicas to confirm. Or this is actually a lightweight lock sync replica. Yeah, this is it. Yeah, this is it. Michael: This is it. We are Nikolay: locked and we are waiting. And when a replica confirms, this lightweight lock released. This is a special case when we need to wait for something outside, which will help us unlock. Michael: So I have watched that talk, and I remember a really good slide in it with like all of the hops, like a really good diagram of what actually happens. So yeah, I'll include a link to Nikolay: that. Yeah, and this how to troubleshoot LWLock:SyncRep is like, it's not fully understood. There are interesting new cases which are not covered by in articles and talks. I think more materials are coming. I know about some. Michael: Okay. Can you share them with me? Nikolay: Well, it hasn't happened yet. Check out the upcoming PGConf.EU. Michael: Great. So, yeah. How would you feel about calling the episode there and actually then talking about specifically the lock manager issues in a different episode Nikolay: Sounds good. Good. Nice because there are interesting answers inside. Yeah, great Let's call it a day for today. I think we covered 1 and a half percent because it's huge. The list is huge. Some of them I haven't seen ever. Michael: But I think that's useful. I think it's useful to kind of, for people to get a grasp on like which ones are they most like? Nikolay: Oh, main thing, always I mention when we talk about lightweight lock and actually wait event analysis, RDS documentation has great list of like knowledge and how to style troubleshooting documents for many wait events, including many lightweight locks, not all of them, only subset, but it's great documentation. I hope it will be improved over time, extended, right? Michael: Yeah. Yeah. It's Nikolay: good. I know a lot of effort was invested to building by many people I recently reread the blog post by Jeremy Schneider how it was done during a couple of years. So it was huge effort That's why it's so good Short yeah, but yeah short documents, but so so much wisdom inside Michael: Yeah Done over a long period of time, but also by very good people, like people that really know, you know, stuff. Nikolay: So basically do this, this, this, like list of mitigation action items. But behind each step, many RCAs, right? Yeah. Cases, case studies, it's so much time paid to just write 1 line what to do. Or what to check, or how to change, how to improve. That's a great example of documentation. Michael: Thanks so much, Nikolay. Thank you. Look forward to talking again soon.