Here are links to a few things we mentioned:
- TOAST docs
- TOAST wiki
- Hussein Nasser on rows per page (Twitter)
- Toasting in action (dbi services blog)
- Interview with Peter Zaitsev (Postgres TV)
- Building columnar compression in a row-oriented database (Timescale blog post)
- The Surprising Impact of Medium-Size Texts on PostgreSQL Performance (blog post by Haki Benita)
- PostgreSQL at Scale: Saving Space Basically for Free (blog post by Braintree on column Tetris)
- postgres_dba alignment padding query
What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.
If you would like to share this episode, here's a good link (and thank you!)
Postgres FM is brought to you by:
- Nikolay Samokhvalov, founder of Postgres.ai
- Michael Christofides, founder of pgMustard
- Jessie Draws for the amazing artwork
Creators & Guests
What is Postgres FM?
A weekly podcast about all things PostgreSQL
Michael: [00:00:00] Hello and welcome to Postgres Event, a weekly show about all things Postgres Cure. I'm Michael, founder of PG Mustard, and this is my co-host Nikolai, founder of Postgres ai. Hey Nikola, what are we talking about today?
Nikolay: Your choice. Uh, and it's toast, yet another boring topic.
Michael: I think it's the tastiest, breakfast snack. So, I thought you might actually wanna talk about it being, you know, food.
Nikolay: Mm-hmm. . Mm-hmm. . Well, okay. I agree that this is something that everyone should learn, at least in some detail. not , but I mean, all developers should understand how positive stores data and why limits are so good. I mean, we can store up to one gigabyte of data in a single. Column in single row.
In single column, I mean a single cell. if you speak a simple language and it's great. So you can sort really large values, usually
Michael: Yeah. This is one of my favorite types of Postgres feature in that it's on by the default. It does something that most people never have to worry about or, [00:01:00] or it's tr exactly transparent and on my default. And it, it helps in so many ways that you might never thought of, thought about, or you might only hit at the limits.
And I, yeah, I, I love this kind of thing because you can go through life never knowing about it, and it's still helping you out.
Nikolay: Unlimited toast. Toast to everyone. Almost unlimit.
Michael: Yeah, so should we start with what it, what we are even talking about, I guess if anybody new to this, we're talking about how Postgres, well, based on a couple of Postgres limitations specifically, you can't store a row across multiple pages and the default page size being eight kilobytes and
Nikolay: Keep your.
Michael: ki Bytes.
Sure. Yeah. And there's one other thing that, is specifically limited, I think, a two kilobyte limit, so that it can store at least four rows or two pools per page. if we only had that, we'd only be able to store smaller, medium values in, in Postgres, right?
But we, we want to be able to [00:02:00] potentially store sometimes larger or, uh, relatively large, um, pieces of data from time to time.
Nikolay: Yeah, let's, uh, warm up with small topic. Um, maybe like, it's not about toted sell, but it's, it's like, it's good intro topic. Uh, uh, recently, very popular YouTuber and the blogger, hue Naer Naser, uh, sorry, I always pronounce. Other names wrong, but they pay me for the same because my last name is very difficult to pronounce.
So. It was great. couple of posts and on Twitter, LinkedIn, how many rows can we ensure? How many taps actually let let's speak, correct language here because we are on post podcast, right? Not Rose, but taps or two posts as you prefer. how many of them can fit into one single, page, which is, as you said, by default, eight gigabytes.
And it, it's a great question to understand the, internal layout. Uh, there is a, page header, and [00:03:00] it's some, some static, Uh, amount of bites. Number of bites is 24 bytes in each page is allocated, for page Heather. Then, from both directions, we have, uh, we fill, page with some data from the beginning.
We fill it with, pointers to doubles, pointers to interest. Right. And each one is only four. . So each top in page, the pointer takes four bites and they start from the beginning, but real data comes from the end of the page, right? And, uh, of course the size there depends on the data itself. But each topple there has always has, uh, a header.
double header and it's 23 bites, but due to alignment, paging, it's padded to 24 bites. So alignment, paging, it's for performance reasons. data should be fetched in, in like so-called wards of eight bytes. That's why 24 not 23. And, uh, it [00:04:00] means that we can feed a few hundred, uh, of, tables, right?
But interesting point that you can create a table without any column. right? That's interesting. So you can create table parenthesis, that's it. Create table t1, for example. Parenthesis, that's it, semi. poss allows it, of course, it's not like standard thing, but POSS allows it. Why not? We will always have internal, like virtual system column cti.
Id xme X max as we discussed a few times. So you can select CTI i d. And it'll provide page number and offset there. Offset. This is exactly this pointer. And so we can start inserting, inserting to T1 Select again. Select and post Select cloud allows empty, uh, list of columns as well. It's some kind of trick and this is how you can reach, a little bit.
Tops inserted to page 2, 291 tops. and, uh, the other hat is huge [00:05:00] actually in this case, but I would like to say it's not a hundred percent because, uh, uh, number of columns, uh, number of rows or topples we have, uh, it's also some knowledge, right?
But almost a hundred percent wasted for overhead for various headers and point.
Michael: Yeah, so we are looking, I guess with, that's the theoretical maximum number of
Nikolay: Practical as well. I, I, I did it. Uh, but nobody, nobody wants them to
Michael: It's, but it's not useful, right? Yeah.
Nikolay: in some testing. Maybe in some testing sometimes it's useful, right?
Michael: does mean that in, in practice, we'll only ever have a maximum of a few hundred rows. page, even if you're only storing, very, very small amounts of data. But I guess we're talking about the opposite in this case, we're talking about the other end of the limit.
Nikolay: Yes, this is extreme. I just wanted to, to give wider understanding of what, like, of what we talk about. This is extreme situation. When we have zero, like no [00:06:00] data at all in the table and we can have 291 topples, then we start having some column, for example, an interesting thing, it doesn't matter, like smallest column probably will be one bite.
Uh, if you take for example, Berlin, Berlin is one. Already a lot of waste, right? Because it's not beat, it's, it's bad, but we like, this is how the world we live in, like we, we need, this is a lot of overhead and relational approach. Of course, if you need to store some beats, probably you need to create some additional storage or pack them to bites and, and store some somehow differently. But take for example, one bite, data type. Again, it'll be padded till eight bites and you'll, be able to store only 200, uh, 26 rows because of this overhead. Immediately. The same for regular integer, which is four.
It'll be ED for with four zeros in each value or inter [00:07:00] eight or begin, data type. They all will, will take the same, storage and you will be able to fit only 226, tops into a single page. quite straightforward. and, and, uh, if it's times timestamp of timestamp with timezone, which it both take 16 bites, okay, you'll have less, you'll be able to feed, fewer, doubles into a page Because, uh, they might be just one row, but many versions of it, right? So now, let, let's go to toast Now, finally, right? We add more and more, or for example, we decide to have a column of from va, family of, of data types. So VA means a variable length. For example, text json xm.
Michael: episode. Yep.
Nikolay: Yeah. Or numeric for example as well, because numeric is exact number and it can take a lot of data. So it's also Lina,
Michael: Interesting. But yeah, [00:08:00] typically this would be a, a text field or a Jason field. That's pretty common, isn't it?
Nikolay: right? Right. And so they all behave similar similarly, and they all involve, uh, this mechanism test, uh, works. implicitly, transparently, we, we usually don't think about it until we deal with some performance issues and we need to understand what's happening under the hood. so what, what is tossed?
Uh, it's, yeah, let's, let's discuss it.
Michael: I was really interested in where the name came from. and I looked into the mailing lists, and I think the original, reason for calling it Toast, uh, so it's an acronym, but the, the original reason was it, it's about slicing. So if you, if we go back to the, the page size and we, let's say we're just trying to insert an 80 kilobyte blob of text.
To fit that. We, we can't fit that in an eight kilobyte. Uh, I probably picked a bad format there for, let's say a megabyte of text [00:09:00] into an eight kilobyte page. Naturally, we're gonna struggle and we've got the limitation that we can't store it, over multiple pages. So, this is a, a strategy for slicing that data.
Into multiple, topples off, like in a separate table and storing pointers to each one so that we can reconstruct it, or that, so that, so Postgres can reconstruct it later when we request it. it also, by the way, includes, compression. so that will, I'm sure will come to that. But, by default, we'll first try and compress it, and if not, we'll store it off page and yeah, so the name comes initially from slicing and I think there's some funny, like, people kind of working backwards into acronyms, which I think is how a lot of things end up getting named. but I think the, the generally accepted one is the oversized attribute storage technique and it had a, a code name of toast or the best thing since sliced bread, which I found funny and I'll, I'll include the mailing list thread there for anybody interested in that.[00:10:00]
Nikolay: Yeah. imagine toast with gin, how they work together,
Nikolay: So, about technical aspects of it. roughly, very roughly when our. total length of all columns involved and one topple value approaches roughly two, 2000 bites. , uh, this mechanisms start, starts working. There are many details inside it.
We, we like podcasts. Probably not the best place to discuss everything because of like, there are many, if else, and so on. details, uh, how it works, but roughly, , PO tries to feed. at least four values in one page. That's why actually, I wanted to bring this topic of, uh, the maximum number, but at least let's, let's have at least four values in, in one page.
It means we need to shrink value to chunks. roughly 2000 bytes and some data will be in the main page. , other chunks will go to a [00:11:00] separate table, which automatically created called Toast Table. For each regular table we can have up to only one to table, right? But it can have many, many chunks for each value.
And this is interesting. what should we discuss about it? Let's, let's talk.
Michael: so. There's a couple of things. One, kinda like the performance implications of that, both good and bad. also, I guess, when do we need to worry about the, we, we've got certain values that can be, certain things that can be changed. Should we ever change those? There's default, so there's default compression, for example.
There's default, whether this is even on or off. there's a few things to be wary of, like the usual kind of select star, or even whether we. Store, you know, we can, we can use Postgres version of this or we could, avoid storing this data in the database at all. That's a not, that's a common technique.
We could store it ourselves deliberately in a separate table to avoid some of those issues. Like I guess we could discuss some of those pros and cons. [00:12:00] Which direction would you like to go?
Nikolay: Yeah. Well, first of all, It's, as we already discussed on previous episode about json. nothing can beat, narrow cases when we have really small, small number of columns. All columns are small, like even if VA data types involved, like text, the actual values are small. Everything fits like roughly.
Total value of double is less than two 2000 bytes. By the way, the easiest way to check, uh, the size, it's really interesting. You can select T1 from q1, taking row, record. Um, , you can convert it to text and you can get length of it. But also there is a, uh, there are special functions ppg uh, column size.
I don't remember by name, on top of my head. But it's possible also to, to check. Uh, it's, uh, more like wisely. But quite often, I use this trick to converting. Whole record to check and just getting, checking the length of [00:13:00] it. So, if we, everything below 2000, nothing can be that this, because toast is not involved.
So there is no overhead because of course, uh, if you need to read Whole Top, you need to join with the pocus needs to join with those table and, uh, multiple, uh, records. Get, extract multiple records from Toast Table and then. Collect everything and, and, and it, it's, it takes time, of course. And, and, uh, there is overhead here without host.
It's much faster, obviously. That's why we discussed, uh, in Sison episode if, uh, Sison values are below 2000, performance is much better. Once we approached roughly 2000, we have some liquidation because to starts to be involved. but. Interesting thing that it's quite, there are many things, uh, we are done to make toast quite efficient in terms of performance.
For example, we know VCC POCUS model when you [00:14:00] change, when you update some row and new tackle is created and the old topple is. Also until auto Wacom deletes it, uh, seeing that it's not needed anymore, not all transactions look at it, can look at it. So it's for toast, it's slightly different.
right. So the in toast there is, uh, like special version versioning applied and it's quite re reduced. So only in shorts and. and updates trigger inserts of toast values only if toast value actually changed if you change some different column. the same toast, uh, records. Toast is regular table, actually right from third glances, glances, but, update happens in the main page.
It can be hot update actually inside this, this, uh, page. Very efficient, not involving index, not causing index simplification problem, but [00:15:00] toast. We just reuse the same records for new, actual top in in main page, right in main table. And this is quite interesting and this adds efficiency as well, right?
Michael: Yeah, very much so. And I read another cool thing, which is. The, the large values in the, from the Toast table, on are only pulled out if they're selected at all. and they're only pulled out at the time the results set is sent to the client. Now that's super interesting for things like in memory processes.
So if we're, like, for example, if we're sorting rows, and we're carrying. Like large rows around, that's gonna take up a bit more space in memory. And if, if we only have to pull out some of that larger information at the end, chances are more of that's gonna be done, you know, quicker, uh, and with less footprint.
So it's, it's super interesting, some of the benefits this comes with that. Some of the other things we mentioned do, don't, so yeah,
it, yeah, it's
Nikolay: this means, putting all [00:16:00] columns into select clouds, which sometimes ORM can do it can cause issue with performance. Uh, you can be much more efficient if you select only those columns you, you really need at this moment. Right.
Michael: Yeah. Uh, so I think. it's slightly different. Yeah. So this is like a select, this is a typical select star from problem, isn't it? Now that has other issues. Like if what, what happens if you add a column, select star is, um, kind of can change on you, but, uh,
Nikolay: convenient, right? You, you have everyth.
Michael: For anybody who's listening on audio, Nikola has got a smirk on his face and he's definitely.
Nikolay: Right. So, so, uh, what about compression? there is automatic compression applied, which is lightweight, not, super, high ratio. They, uh, producing not super high ratio, but, doesn't, consume too many CPU cycles.
Michael: Yeah. I. For anybody trying to test this out. In fact, I read a good blog post of [00:17:00] somebody trying to test out toast and it looks like the first example they picked actually didn't trigger a toast, table as they were expect or didn't trigger any chunks to be inserted into a toast table because the value they were inserting, whilst it would be two kilobytes, of it compressed really well because they were just doing a repeat character, for example.
Nikolay: you're saying it.
was compressed before, going to toast tables and it, it, it could fit, uh, main table. Main table was enough. Mm-hmm.
Michael: In fact, that's a better way of describing it. It at first, if the, if the whole tupo, as you mentioned, looks like it's gonna exceed two kilobytes, the first thing that Postgres tries to do is compress a co, like one, or it compresses as, as many of these hostable values as it can to store them on the page still and o Only if it can't, will it.
move them off, the page into a, into the toast table. So that's a really important thing that's on by default. That can be like every combination of [00:18:00] this is possible. but yeah, by default it will try and compress and it will by default use a, this is something I didn't realize until a year or two ago.
It will use a Postgre, A P L z, compression algorithm. and since Postgres 14, you can use, uh, LZ four. Is
Nikolay: You can control. I don't remember, but there is, there are options in, in new pos there are options to control, but I would like to emphasize that you also can control and tell poss not to try to compress, because if you, for example, have some value, which you know, already compressed it, it would, it would be not good in terms of efficiency to try to compress it once again.
So you can, uh, you can tell POCUS to have like something like storage external. In this case, would go to Tost, uh, immediately without, compression attempts at all.
Michael: Yeah, exactly. really good point. you do pay an overhead, like I think, I think one of the benefits of LZ four, I think it's, um, well, with compression there's normally like a trade off, right? [00:19:00] It's how well it compresses like versus speed of compression versus speed of re and I, the benchmarks I've seen.
LZ four does come out on top. but I guess it'll depend on the exact scenario, but it, as you say, if it's already compressed, it doesn't matter what algorithm you're going to use, there's gonna be overhead for no real benefit.
Nikolay: Right. I'm not a big compression expert, but of course there are many like. Recent, advances in, in this area, not only pipe piper and so on. Uh, and uh, it was like, for example, broadly we use broadly quite often for wall G compression of walls when we do backups. But I'd like to say. it's good that we have, uh, more control, so it's very, like, it's very good.
but recently we had a brief, uh, small, uh, like word battle with, CTO of Victorian metrics. And, we discussed number of rows versus terabytes, of data. For example, one terabyte for me [00:20:00] is always something like 1 billion of rows. And for them it's roughly one.
billion of France is one gigabyte of data.
So it's like three orders of magnitude difference. Why? Because they, talk about, time series often, for example, victory metrics, right click house, column store, and they live in very different world. Unfortunately, with pocus, we still don't have quite good battle.
column store. there are attempts. There are, there is one going work. Several, teams working on it. Uh, we know. But when we talk about compression, it really shines when you, we have column store,
Michael: Yeah, because the data's more similar, right?
Nikolay: Right, right, right. Because you, when we, you try to compress, values in raw, not, not, uh, involving the context of other values in the same column, uh, you are in worst position.
So the compression topic, actually, Peter ze of, interview happened. This, A week on [00:21:00] Monday on Positive tv. A very good, chat. It was, this is his idea that, uh, you, you should consider compression and expect good results. Only if you combine this topic with, uh, storage engine topic and, uh, For Com store, it's much better.
Or like Times Carere, they're very smart. They still work in, in the Row store, but they, apply compression. looking at column, it's a really interesting article. We already discussed it a few times, so Warf, referencing one small time, but, again, in some cases, uh, some compression in place.
It's good and it's good that to have more controls in pocus.
but I hope in future 14, sorry. I hope in future POS versions, we will have, in core, product. We will have, column store storage, storage engine providing the same level of compression, like maybe not thousand x, but a hundred. [00:22:00] And, uh, in this case, we will also say, okay, 1 billion rows.
It's not one terabyte anymore. It's like 10 gigabytes maybe, right?
Michael: Yeah. Should we talk about some of the limits of, toast? Like what's the maximum size
Nikolay: One gigabyte
for one value. , I don't remember the limits quite good, but sometimes we hit them. And for me, the biggest limitation is that toast is not, uh, applied. Like for indexes, it's not applied. There is compression there. I don't remember details, but I remember like, you cannot index really huge value of like a hundred megabits.
You cannot index it. You need first maybe hash it and, and index only some expression. So this is one of the biggest limit. because in toasting, it supplied only to heap, right? this full fledged toasting. So this is one limitation, one gigabyte, four values in another limitation. What else?
Michael: I've read a couple more in the docs, but I don't think you're likely to hit them. [00:23:00] So for example, you can't have more than 4 billion. Toasted values in a single table, which would be a lot given they all have to be over a certain size. You know, we're talking about, terabytes of data.
Nikolay: Well, I, I can easily imagine that we can hit it, but before that, we probably will face, other performance issues. So everyone, with table more than hundred gigabytes should already use partition
Michael: this is another
benefit of partitioning. And then one more kind of slightly, uh, hypothetical one, please don't do this, but you could, I think in theory you could only, have about 450 wide columns, in a table
Nikolay: How many?
Michael: 450, columns that are toast all or some, uh, it's a lot,
Nikolay: It's a lot, but still, I already saw cases when it was achieved.
Michael: So, yeah. pretty big numbers, but, things to be aware of if you are working in an extreme case for some reason.
Nikolay: All right. Right. And, uh, if you, if you care about, uh, [00:24:00] super, like if, if storage efficiency, my advice is to learn about column Tetris in pocus. Like this is quite common term already. and, uh, I usually there are. Two approaches, uh, either start from big to and go to small or vice versa. I prefer starting from small columns first.
for example, all one bite, two bite columns, go first, then we. We have four byte columns, eight byte columns, 16, and then all our lean, columns go last. In this case, uh, you probably will have, fewer zeros due to alignment, paging, and, spend fewer bytes, maybe sometimes gigabytes actually, or hundreds of gigabytes if you have big data volumes for.
And it's not only for storage, it's also for, the, amount of ram, both for shared buffers and page cash and also for, backups and so on and so on. Sometimes the benefits of applying, , better column teris tactics, uh, [00:25:00] are good. And toast prostate, like valino toast should go last, and in this case, It's good if you mix them.
This is the worst case, regular integer, some text, regular integer, some text. In this case, uh, you will have a lot of zeros due to alignment paging.
Michael: I read a Good Post by Braintree, who I think built a, they built something to help kind of. . It's almost like repacking a table. Like for example, you can, you can sort out a column ordering that makes sense initially, but then what happens if you add a couple of columns? does the order still make sense for large enough tables?
they found it could be worth, changing it.
Nikolay: Oh, I, I also, I also build it, I, I have a, a small report, actually, it's not small, it's reive CTE with recursive, analyzing this. For any table and proposing better, uh, order. It's in my POCUS DBA quite old tool, like a hoc tool to analyze some POCUS cluster aspects. So yeah, and [00:26:00] some people use it, I think.
Michael: There's one more blog post I wanted to mention before, it's kind of on this topic, but kind of not, is by hacky Benita. I've mentioned it in a couple of episodes, but this is almost a side effect of the fact we have toast, is he noticed that there was some kind of medium sized text, so he defined small text as being, you know, like integers, timestamps.
Small numbers of bites per collar, and then large texts being kind of like the blobs we're talking about here, you know, whole HDML pages or, larger things that you know automatically gonna be quite big and off. And therefore off the page he mentioned there was. he calls it surprising impact of medium size texts.
So kind of still text fields or, you know, it might be a comment on a blog or something could get quite long, could, could get into the multiple kilobytes even, or just about still fits on the page. so it's below that threshold. We were talking about two kilobytes compressed. Now these, if you, if you start to only have, for example, that limit of four [00:27:00] tules per page and you, you, you know, they're getting towards two kilobytes each suddenly if you're doing reads of multiple page, if you're doing kind of scans, , you could be having to read quite a lot of data for the amount you're returning if you're not returning those columns.
If you're not returning, for example, the comments, you don't need them for the query you're doing, you're still having to go through quite a lot of pages. So efficiency-wise, those sometimes you might want deliberately store yourself in a different table so that you can fit much more data on the pages that you are scanning more frequently.
So it's a really interesting web.
Nikolay: sometimes one, one, one-to-one relationship, uh, it like, might seem strange, but it makes sense. So one table is for very compact storage and another for our like, bigger values.
Michael: Yeah, exactly.
Nikolay: But in this case, you always should remember that there are no indexes on top of two tables. So you lose some, index only scan, for example, or index scan opportu.
Michael: Yeah. Was there anything else you wanted to cover on this one?
Nikolay: No, that's it, I think. Right?
Michael: I think so. Thank you everyone for joining us. Thank you, Nikola. I see you next week.
Nikolay: Thank you. Bye-Bye.