Postgres FM

Nikolay and Michael discuss "Column Tetris" — what it is, why it matters, how to order columns for new tables, and how to re-organise existing ones.
 
Here are some links to things they mentioned:

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With special thanks to:
  • Jessie Draws for the elephant artwork 

Creators & Guests

Host
Michael Christofides
Founder of pgMustard
Host
Nikolay Samokhvalov
Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Nikolay: Hello hello, this is Postgres.FM,
episode number again,

I don't remember 126, great,
more than 2 years and my

name is Nikolay Postgres.AI as usual
My co-host is Michael pgMustard.

Hi Michael

Michael: Hello, Nikolay.

Nikolay: How are you doing today?

Michael: Yeah, good.

How are you?

Nikolay: Great as well So let's play some Tetris column Tetris.

I mean talk about this game

Michael: It's not a very fun game.

Is it I like Tetris, but Column Tetris not so much.

Nikolay: Yeah well I think you can know 0 about it then you usually

get excited about it and then you kind of think okay it exists

but at some point you stop playing too much.

Let's unwrap it.

What is it?

So columns can have different sizes, Data types have different

sizes.

Michael: Yes.

Nikolay: And most folks say, for example, integer, big int, small

int, but I prefer saying int4, int8, int2, because you see exactly

how many bytes it occupies.

Timestamps, regardless of with time zone, without time zone,

they all occupy 8 bytes.

UUID, I think, 16, right?

It's huge.

Or 8 as well.

I think 16.

I always forget.

Michael: I actually can't remember.

Nikolay: Yeah.

Boolean, which should take only 1 bit, takes whole 8 bits, so

1 byte, right?

It's a very wasteful way to store your flags, like true-false

values.

What else?

Michael: There are the big ones, aren't there?

I guess you've got varying length, all the numeric ones.

In fact the documentation is great listing what they all are

but the other varying length ones I think of the other big category

Nikolay: yeah Marlena like text and watch our and dress on a

race just some be of course all these guys they are variable

length and it's different story.

They are not fixed size data types right but these data types

which have fixed size it's interesting to see that sometimes

just reordering columns you can save disk space and get better

performance

Michael: yeah I would say most people including myself when they

first learn this find it very unexpected

Nikolay: Yeah, I agree Same it
was long ago, but yeah, same

Michael: And mine was probably
more recent.

Yeah But yeah, so I don't know
if you want to get into any of

the details as to why I don't fully
understand the reasons why

but my understanding was reading

Nikolay: let's just say it's something
about efficiency of how

CPU works and if it's 8 bytes worth,
it's better for performance

to read and write this number of
bytes, 8 bytes in modern systems.

But it was a surprise to me to
realize, you know, we talked before,

the recording, the alignment happens
not only to 8 bytes, but

also sometimes to 4 bytes and to
2 bytes.

So, yeah, but let's have an example.

For example, if you have, very
typical example, you have a column,

4 byte column, which is primary
key, say ID, integer.

Michael: Which we don't recommend.

Nikolay: Yeah, we don't recommend,
and I always, like, you know,

it's interesting, like you see
something, you learn something

usually by doing some mistakes
and some going through some pain

in production and so on and then
and then you start recommending

something but still people use
integer for primary keys for example

Supabase and OrioleDB they
just had great release it was

on on top of Hacker News but by
the way congratulations it looks

great benchmarks look great beta
7 right yeah and I admire Alexander

Korotkov a lot like he helped me
personally many times.

And for example, recently we had
a discussion about track planning

and we didn't mention that there
is a good thread and discussion

of sampling to mitigate overhead
from pg_stat_statements.

I discussed it with Korotkov some
time ago.

I dropped a line and he immediately
commented on that thread.

So it's cool.

It's cool.

I mean, great.

But he used integer for primary
key in examples of benchmarks.

But you know, when table is named
like, I don't like users, can

we have 2 billion, 2.1 billion
insert attempts to such table?

Maybe yes.

I don't know.

It depends on the size of the project.

But anyway, we don't recommend
using integer for primary keys

as you mentioned I I stay at this
point again, but I'm less aggressive

judging people when they use integer
4 primary keys.

Michael: So a lot of the argument
for int4 primary key versus

int8 would be saving space.

But due to this alignment padding
issue, if you're going to follow

it with an 8 by 8 type.

Nikolay: Create a timestamp, for
example.

Michael: Yes, exactly.

Super common.

If that's the order when you create
your table, you're not saving

space even, other than, well, and
we'll get to this, other than

maybe in some indexes, you're not saving space by using the smaller

data type.

So I think it's really fascinating.

And as I said, for me, it was unexpected.

Not because I knew better, but because I think in like, like

when you create things in any other, in most other formats, like

you see their length grow as you add things together and you

don't

Nikolay: you

Michael: don't consider that the lower level structure is going

to be like grouped into little pay I guess I guess the 1 the

1 exception is when you're creating maybe maybe some PDFs or

something like let's say you wanted to print out the documentation

into PDFs, you probably would want to start new chapters on new

pages, for example.

There's some formatting when you do books and things.

You don't want the chapter to be at the end of a page.

So it's that kind of thing, right?

It's kind of shifting data to start at a new point to make things

easier for the thing.

Nikolay: Yeah, when you use Google Docs, for example, you insert

a page break.

Yeah.

Yeah.

Similar.

A good analogy, I agree.

Back to example again.

If it's int4 primary key and then create it at timestamp,

and then for example, I don't know, like org ID or something,

group ID, also int4, or int8, doesn't matter.

At storage level, what happens between Id and created, between

the first and second columns, Postgres will just fill it up with

4 zeros.

And we can see it using pg_hexedit on Linux if you have Ubuntu

or something.

You can have this graphical interface which works on top of Page

Inspect extension.

You can see zeros with your own eyes, right?

It's from Peter Geoghegan.

Michael: Is that the Peter Geoghegan?

Yeah.

Yeah.

Yeah.

I've seen him using it.

Nikolay: Yeah, I used it a few times, but I remember it required

Ubuntu.

I'm not sure maybe it's possible to run on macOS right now,

so I use it in virtual machines like it would be overhead to

run.

Michael: But these zeros are that padding yes?

Nikolay: Yes since the second column
is 8 bytes for efficiency

Postgres adds 4 zeros so the first
column also basically takes

8 bytes instead of 4.

So 2 first columns, they both take
16 bytes instead of 12.

And it means if you used int8 or
bigint, primary key, it would

be the same in terms of storage
here, as you said.

And we do recommend using integer
4 because who knows, maybe

you will achieve 2.1 billion for
your sequence which is generating

the numbers and once you achieve
that, sequence itself is 8 bytes

always, it's okay.

But if the column where insert
is 4 bytes, with zeros always,

you will end up having problem
because you cannot insert into

it anymore.

Because larger values, larger than
2.1 billion won't be possible

to insert, because the capacity
of 4 bytes is 4.2 billion and

we have signed int.

It's a bad situation because a
big surgery will be needed on

this table.

It's really not an easy topic.

And we touched it a few times in
the past.

But what I didn't realize until
very recently is that padding

can happen also to 4 bytes and
to 2 bytes.

So If you have, for example, int2,
int2, int4, 3 columns, they

take 8 bytes.

But if you have int2, int4, int2,
moving 1 of the int2s to the

end, the first int2 is going to
be padded to 4 bytes, then 4

bytes, then 2 bytes.

So overall it will be 10 bytes.

This was a surprise to me because
I thought they will take only

8 bytes.

It's not so.

Padding happens at different levels.

8 bytes, 4 bytes, and also 2 bytes,
as you said.

Same thing if you use Boolean,
and you can see similarly that

it's going to be padded either
to 2 or 4 or 8 bytes depending

on the subsequent column's

Michael: data type.

Exactly, always the next 1.

Nikolay: And also, to make the
picture complete, if we have,

for example, just...

The First example we used.

Int4, primary key, createdAt, timestamp
taking 8 bytes, and then,

for example, I don't know, like
orgID or something, also 4 bytes.

So padding for first column is
going to be up to 8 bytes, then

we have 8 bytes.

And in the end, it also will be
padding, but not for the tuple.

It will be padding for before next
tuple.

So overall, we will end up having
3 8-byte wards, so it's 24

bytes, right?

And just moving the third column
to the second position, we'll

go down from 24 bytes to 16 bytes.

Also there are 24 bytes for tuple
header, which is actually 23,

but 1 byte is not used, as I remember.

But it's also padded always to
24 bytes.

So overall, we will have 48 bytes
per tuple instead of 40.

And this is amount of saving per
tuple we can achieve if we just

move, we just back 2 int4 data
types together into 1 8 byte position.

And this will contradict with our
recommendation to use int8

primary keys, but again the reason
for int8 primary keys is to

avoid the risks of surgery for
the table.

If you do know you will never achieve
for your sequence value,

you will never achieve 2.1 billion
value.

Never ever.

It's okay.

I could even use 2 byte integer,
small int.

If I know I want to achieve 32,
000, right?

I don't know.

Something like this.

Michael: I'm not good at factors.

Nikolay: It requires careful planning,
so you need to think about

the future and all possibilities.

My rule of thumb is just using
for int8 and that's it but back

to this additional example this
is kind of significant saving

right from from 48 to 40 bytes
It's kind of 20 percent, right?

Michael: Yeah as a percentage,
but then I think I think where

this becomes meaningful is when
people have huge tables so we're

talking about only a small number
like bytes and this day and

age even RAM's not that expensive
like it's expensive at scale

but we can get quite powerful machines
for not very much money

these days and loads of storage
for not very much money so I

think this program

Nikolay: because the storage is
not only about storage it's spamming

RAM not with zeros

Michael: I am a huge fan of doing
things efficiently and I but

I think at scale it's where it
starts the investment in this

kind of thing starts to pay off
and I remember a blog that 1

of my favorite blog posts on this
topic is by the team at Braintree

and they mentioned by going back,
like once they discovered this

phenomenon, by going back through
their largest tables they calculated

it was about a 10% saving on disk
space from where they were

before to to a more optimal setup
so I believe that I don't see

any reason why that that wouldn't
be true for a lot of organizations

that hadn't done this deliberately
in the past or people hadn't

been conscious about this but it's
a 10% a lot in the by some

benchmark.

Nikolay: Noticeable.

It's noticeable.

Michael: Yeah, but it's also not
like sometimes when we talk

about performance, we often talk
about orders of magnitude and

when I don't think we're talking
about that in most cases.

Nikolay: Still, you know, like
again, like there are stages of

impression here and minor counterpression.

Well, it's important problem, but
I don't know, like it's It's

worth thinking about it when you
first design a table, which

will obviously be large in the
future.

I know, for example, GitLab, they
have public documentation and

there is a Column Tetris page there.

I know it's a rule to check it.

If it's a new table, to always
check it.

Maybe there is also some automation,
I'm not sure.

But we have bloat as well in different
dimensions.

This dimension is kind of like
programmed bloat in columns level,

but we have in rows level we have
bloat there.

And 10% is probably… for example,
if I see extra 10% of load

in both indexes and tables, it's
not triggering action from me.

Exactly.

Right?

But if you design a table from
scratch, of course, it's worth

thinking about it.

Yes.

Michael: Mostly because it's so
cheap to think about.

And I guess we'll talk about this
in a moment.

It's not a difficult equation to
work out how you should order

things.

Nikolay: And it can be automated.

We discussed also before recording
and I have report, I have

analysis script in my postgres_dba
old toolkit, which will suggest

to you.

It's easy.

Just start from the largest or
from smallest, doesn't matter.

Maybe it matters a little bit.

Actually start from 1 byte columns,
then put 2 bytes columns,

and so on.

Or vice versa, start from the biggest
size bytes and then smallest.

It will provide you, it will reorganize
columns so you have less

cases when you see zeros in physical
storage.

But don't you think it makes Column
Order weird?

For example, ID can be not at the
first position and it's so

strange to deal with such tables.

Michael: Yeah, I haven't had to
do this optimization myself.

Nikolay: I did it a lot.

Michael: I think I would start
large.

I think I would want to start at
least, maybe not largest to

smallest, but at least the eights
first, or at least a group

of eights first, because I'd want
my ID in the first place.

If you're doing SELECT * [from a] table,
you're just doing some testing.

Just having that ID in those first
few useful columns first is

beneficial for me as a user.

But also, I was reading the Braintree
post, and they've done

some optimizations around, they've
suggested putting the ID first

because it's going to be read the
most so in terms of unpacking

the row Apparently there's some
optimization around that as well.

Okay.

So that, like, so they also, they
also recommended some other

super minor optimizations because
they've built tooling around

this.

Like for example, within your,
let's say you've got a bunch of

8 byte columns, within those you're
probably better off putting

your not null columns first because
they're more likely to have

data in them.

And then putting your default columns
a bit after that because

they will almost always have data
in them.

So like there's some of these,
like probably right at the edge

in terms of optimization, but now
we're not talking about storage

They don't we're talking about
Like we're talking about read

performance So it's and and actually
I owes a good point because

if your data is taken up less space
some of your queries are

going to also require less IO and
that's faster cheaper all the

all good things right well it's

Nikolay: this is I didn't see that
did they show some benchmarks

to prove that there's a reason
I mean, foundation for these decisions

to reorder columns.

It's interesting, I need to check
that.

I didn't know, I didn't see it.

But yeah, I think probably you're
right and I need to adjust

my recommendation logic and put
8 bytes before 16 byte columns

because ID should be there, and
the created add should be there,

and so on.

Although if it's UUID type, Again,
it's going to be 16 bytes.

I don't know.

Anyway, this is the topic.

I think it's super easy not to
know about it, obviously, but

it's also super easy to overestimate
the benefit of paying too

much attention to it.

But interestingly, what happened
recently with Postgres versions

17.1, 16.4, 5, a couple of weeks
ago they were released and caused,

for example, TimescaleDB to not
work with them.

What happened in some structures,
there was additional Boolean

type which was added, and physically,
like physical layout changed

in minor version of structures
and causing some extensions like

TimescaleDB to stop working.

And yeah, it was not a good situation
and I think lack of testing,

but more interesting like next
week the deploy new releases so

17.2, 16.6 and others, they fixed
this problem.

And actually, originally, Boolean
value was added to fix some

problem.

It was a fix, because in minor
versions of Postgres, unlike some

other systems, Postgres doesn't
release new features.

It's only about fixes.

But this fix caused incompatibility
in terms of physical layout.

It's bad.

And how next releases fix it?

This Boolean was moved to different
location in terms of order,

where some padding happened, where
some zeros were present, right?

Feeling 1 of those zeros.

So Column Tetris, as I've tweeted,
Column Tetris well played,

right?

So, I mean, Postgres hackers themselves
played Column Tetris

a few weeks ago, fixing some problem,
fixing a bug which was

another fix of another bug.

That was interesting to observe.

Michael: I think it's worth discussing,
We've discussed a bit

why you would do this, like save
on disk space, less IO, better

caching, like better use of RAM.

But I think when is also useful
to discuss.

Obviously, we can do this when
we're creating a new table.

We can think about column order.

That's normally when we're adding
the most number of columns.

But I think also sometimes when
you're adding new feature or

you're doing refactoring, you might
be adding a few columns at

the same time.

And I think it's worth pointing
out that that's another time

where you can make use of this.

But equally, I think we're moving
to a world where there's more

and more emphasis on incremental
improvements.

And I know for my own projects,
I tend to add 1 column at a time.

Like, I tend to be doing a migration.

Nikolay: Oh, yeah, like 5 releases
per day.

Michael: Yeah.

I'm not doing 5 per day, but, you
know, like, they keep things

small, and they tend to not be
multicolored.

Nikolay: Well, in software engineering,
this makes sense completely,

because if you mix many things
in 1 big change, making them depending

on each other, it's bad, right?

So we try to split it to small
steps and release small steps

sooner, because if we have a problem
with 1 of the steps, it

doesn't block others from deployment,
right?

This means that we probably indeed
add columns 1 by 1 or sometimes

in pairs.

Michael: Not normally, or maybe
other people have different experience.

But yeah, so even in pairs, I guess
the order does.

Nikolay: But in this context, in
previous context, what do you

think?

First of all, when Postgres creates,
When we run create a table

with many, many columns, what?

Why Postgres doesn't do it?

Michael: You're going to ask why
doesn't it do it by default?

Itself, yeah.

Then it's unexpected in a different
way.

Like we were saying, I want my
ID to come first.

Nikolay: No, No, no, no, no.

Imagine we had a logical order
and physical order.

Instead of at num, column in
pg_attribute, we would have

Michael: to num.

Nikolay: And we know what to present
to user, like logical.

And we know physical because at
creation time Postgres would

Have some algorithm to reorganize
it right and I think it's possible

I don't I'm not sure if it was
discussed like it should should

be discussed a few times But obviously
it's not happened.

It didn't happen yet but father
let's let's think like let's

dream a little bit additionally
and Based on this case with new

releases, minor releases, which
played Column Tetris, when you

have an existing table, imagine
we have an existing table and

we have some padding happening
there and we add a boolean column

and table is huge we could add
it and Postgres could add it probably

to some different position not
increasing size at all

Michael: right well I don't know
what needs to then happen to

what happens then to the data files
yeah

Nikolay: well existing was is different
if we if we need fully

right we do it right but since
post this 11 if you if you define

default for a column, it will be
virtual, so without physical

rewrite.

It's just, it will say, okay, until
this moment, I think in xmin

or something, until this transaction
ID, all old rows, They virtually

have this value even we don't write
it physically But it was

great optimization and actually
what I discuss here.

It's also in the same area of possible
optimization of Postgres

could have Right, So only future
rows will have it.

They had zeros already, padding.

Michael: Would the padding look
different?

Nikolay: Well, there is a specific
additional place in Postgres,

I don't remember, I forgot, but
in Postgres 11 it was created

for defaults, right?

Saying all rows kind of have this
value for this column which

is new.

All future values will be read,
All future rows will have it

in normal way.

Same here.

We have padding zeros, right?

In future, like future this column,
say it like, fired, right?

Bullying.

True or false.

Or null, 3 value logic.

We can have it not in, like, in
some position where we had zeros

for old rows.

All new rows will have not 0.

That's it.

It's good, right?

Michael: Yeah, I mean, it would
be beautiful, it would be lovely.

You know, in real Tetris, if it's
getting quite fast and you're

starting to get some of these gaps,
every now and again you get

a shape that looks exactly like
a gap you've got right down near

the bottom and you want to just
pick it up and pop it in.

It would feel like that, wouldn't
it?

Nikolay: Similar, yeah.

Another good analogy, right?

So I think there is some sense
in this optimization.

I don't know, maybe I should discuss
it with Andrey and Kirk.

We're probably slowly returning
to our normal hiking sessions

on Postgres TV.

We have actually already quite
a long list of ideas.

Michael: Yeah.

There's 2 other things I wanted
to discuss with you on this.

1 was doing it with existing data.

So we find out that this is a phenomenon
or we've inherited a

system that wasn't set up in optimal
or it's just evolved over

time we've been adding columns
as we've been developing new features

as a system evolved And we decide
it is worth it for us to, or

like, we want to look into how
would we reorganize the table

into a different column ordering.

How would you go about doing that?

Or how do you go about doing that?

Nikolay: It's the same complexity
for surgery as for int4 to

int8 primary key conversion.

I wish pg_repack would do it, but
we had a discussion and I think

I had some very quick and dirty
prototype to organize cold motor

when you are repacking table using
pg_repack.

It will be a good moment because
it rewrites everything, but

there are some doubts and fears
and I think we just had lack

of attention from very experienced
hackers there, so it was not...

I think somebody in that pull request,
let's check it, and have

it shown on somebody, try that
again.

Because the idea is simple.

pg_repack is rewriting the whole
table.

If we need to physically reorganize
the order of columns, we

also need to rewrite the table.

Let's just change the column order
at this moment.

But this is again, this is something
which is not ready.

You cannot use it in production.

I don't recommend it.

For this, although there is a cooking
dirty prototype for this.

How I would do it, I would just
apply this new table approach,

same as in 8, But it requires some
additional effort.

It's similar to pg_repack

You need some kind of delta table,
you need trigger, and you

need to write changes to that delta
table.

Then you create a copy.

While you create a copy, this trigger
should write all changes

in this delta.

Then copy is created.

There are interesting tricks there
to minimize the duration of

transactions, as I remember.

It was long ago, last time I used
this and developed, I think,

maybe 5 years past.

We had a very serious workflow
developed for a big company originally,

and we used it several times, this
approach, under very heavy

loads and very mission-critical
systems where Postgres was in

the center.

And then you just have some transaction
to switch.

You should be very careful.

Foreign keys will be the biggest
problem actually in terms of

Michael: switch.

Well, I had an alternative.

Nikolay: Logical replication.

Michael: Yeah.

And I think if you've only just
discovered this and you have

several large tables that

Nikolay: you want

Michael: to do it on all at the
same time like I realize it's

heavy-handed if it's not your largest
table or if there's You've

got tons of data that is already
pretty optimal.

But if you're doing 1 big retroactive...

Nikolay: Maybe.

Michael: Well, what's the downside?

Nikolay: Of logical?

As always, downsides of using logical
Under heavy loads, it's

tricky sometimes.

In this case, we need publication
for 1 table, so we need to

carefully work with retries and
so on, because some locks will

be needed.

They won't block other sessions,
but you can just fail.

For example, if you have a long-running,
as usual, long-running

autovacuum process, which is processing
your table to prevent

transaction ID wraparound, you
won't be able to get a lock for

long.

I mean, well, it's running, so
you'll fail it.

You need to...

There are some nuances here, right?

But I think it's obviously a valid
approach.

And combining all these thoughts,
I think PjSqueeze from CYBERTEC,

which is an alternative to PjRepack,
probably is a tool where

this should be maybe...

Maybe it's already supported there.

Michael: I checked.

I couldn't see it in the documentation

Nikolay: But it makes sense getting
rid of bloat let's also reorganize

table a little bit and get rid
of padding which can be considered

also kind of bloat right

Michael: Yeah I think so.

I think depending on your definition...

Nikolay: Vertical bloat.

Michael: Like, if you rebuilt your
table now, how much smaller

would it be?

Is kind of how I think of bloat,
the delta between.

How big is it now?

How big would it be if

Nikolay: you think you could...

Well, you're talking about dump
restore, you can check it like

that, or vacuum full, but it won't
remove padding zeros.

Michael: No, so therefore maybe
it doesn't count in that strict

definition, but if you allow for
column reordering in between,

then it suddenly, it counts.

Nikolay: This would be a good feature
for pg_squeeze and unlike

pg_repack I think I think it should
people should not have fears

because if you use logical and
you organize your table like cold

mother.

I don't see problems with same.

Yeah, because it is a pocket works
with substituting real file

nodes and it's like basically substituting
files, it's kind of

a hack, right?

And it sounds scary, while pg_squeeze
is using official API basically,

logical decoding, right?

Which is good because sometimes
pg_repack is not available.

If it's a managed Postgres offering
and they forgot to add pg_repack,

I don't know why they could do
it, but actually I think Supabase

doesn't have pg_repack.

Michael: Interesting.

Nikolay: Yeah, yeah, I think so.

I think I checked last week, we
had a new client come, they are

on Supabase, and I think I checked
and didn't see pg_repack among

supported extensions.

I wonder how Supabase clients
deal with bloat, or maybe they

don't care yet.

But by the way, again, congratulations
with OrioleDB.

It's great.

OrioleDB is a super innovative
thing, which has high chances

to be widely used because it's,
I think it's an Apache license

and the plan is to have it as an
extension for regular Postgres,

which is great.

But it's not yet there because
some changes need to be done in

Postgres core.

But there are chances it will be
done.

Michael: And it's not recommended
for production used yet?

Nikolay: Well yeah, of course,
because basically it requires

patching the original Postgres.

Michael: I think it's more than
that.

I think there's still others.

Nikolay: Yeah, it requires many,
many years to build a reliable

database.

It can be considered a kind of
new database because it's heavily

modified Postgres, right?

Storage layer is heavily modified there.

I can't wait to see, like you know, we discussed it with Korotkov

that branching can be native in Postgres.

This could be huge.

Michael: Yeah, there's a lot of promise there I think, yeah.

Nikolay: But anyway, back to pg_squeeze, I think it's a

good idea if it's supported.

You have many ideas today, but back to practical.

Michael: These are more episodes, these are separate episodes.

Nikolay: Yeah.

Michael: Yeah, back to practical stuff.

There's 1 more thing, So yeah, so I think those are 2 like ways

of doing it.

And I also wondered at what kind of volume are you generally

seeing?

Like what kind of size tables are we talking about before this

starts to make any sense normally?

Nikolay: You know, it's hard to say, because again, like I told

you, even if I have a 1TB table, which is already above the threshold

where partitioning should be used, and I see additional 10% of

bloat, I don't care.

So if I see additional 10% of padding, I don't care.

10% because of engineering time is very expensive, right?

Yeah.

Yeah, I know it's not only about storage.

Like people say it's a storage source, but it's I'm mostly concerned

not about storage, although storage also matters because it affects

backup sizes and replication size and so on.

What matters more for me is state of memory.

And this is like additional spam in the buffer pool and page

cache.

It's not good.

So 10% I wouldn't be bothered, but it's 20% I already kind of

in warning mode, right?

It's 30, 40, it's already, it's worth doing it, right?

And by the way, it's interesting, you know, like I always tell

people these bloat, let's connect topics in very weird way.

So, bloat, regular bloat, when we have dead tuples and they were

cleaned by autovacuum or vacuum, and we have gaps in terms of

tuple, like slots for tuples are not filled and page has only

few tuples and many empty spaces.

This is called blow, right?

So over time Postgres will probably put new tuples there, but

maybe no, because new tuples are coming in different pace and

we end up having too much space used than it could be.

And dump restore would help, or repack, or pg_squeeze, or VACUUM FULL,

which is not good in production, and so on.

So usually how do we understand the bloat?

How do we understand the bloat level?

We use some scripts, some queries
or tools use some queries or

monitoring systems use some queries
but these queries they are

like they sometimes are heavy but
not so heavy as full table

scan would be right still they
are light but they are estimate

estimated yes And you see a bloat
level 40, but it's not 40.

You dump restore and you still
see 40.

How come?

And Interesting, if you take example,
1 of those we discussed,

for example, Boolean int8, Boolean
int8, Boolean int8, repeated

like 10 times, and rinse, and check
bloat level using Estimated

bloat using 1 of those queries
which Actually all originated

from 1 Work, so you will see some
bloat.

I can demonstrate bloat like 50%
It's insane right and you dump

restore you are come for you Pj
repack and this bloat cannot

be eliminated because it's not
bloat.

I think it's very related to mistakes
the errors these scripts

like error related to padding

Michael: You could that could be
fixed right that

Nikolay: could be fixed I tried
I didn't have enough time a few

years ago, and the complexity was
high.

I think, first I need to make sure
my analysis script in postgres_dba

works really right.

I need to check these 4 by 2 byte
padding cases.

And if it's right, I think jumping
from there to those estimate,

bloat estimate queries, I think
it's possible, should be, right?

Michael: You need to know the column
order, but that's...

Nikolay: I know it from practice.

This is how I can easily demonstrate
that you should not trust

blindly those estimated bloat queries.

Yeah.

Michael: Cool.

Last thing, I think Posts on this
normally don't mention indexes,

but 1 good 1 recently did by Renato Massaro.

And I don't think it's as important
as heap for perhaps obvious

reasons, But it did somewhat surprise
me that There are some

somewhat surprising Alignment padding
issues related to indexes

as well, and I thought that was
fascinating

Nikolay: Yeah, but we usually we
cannot change Most cases Changing

column order and yeses just because
of this is not a good idea,

because column order matters in
terms of query performance.

Michael: Yes, so I completely agree.

I think there are some rare cases
where, if you're only doing

equality type, there's some rare
cases where you can you do have

some flexibility on the on at least
a couple of the columns if

you're always including them in
all queries like that kind of

thing but the reason I the reason
I brought it up was I've seen

twice now, including the person
who wrote this article, twice

now people have spent time on something
that then turned out

to be not worth it.

So they had spent time reducing,
like they didn't need a timestamp,

they really only needed the date,
for example.

So they thought, I'm going to, I've got this index on, they only

had a single column index on this timestamp, and I'm going to

create a B-tree, I'm going to create the index again, but only

on the date, like the truncated part of it.

And they spent development time thinking about this, doing this.

Nikolay: And then they stopped.

Well, date is maybe, how much does

it take?

4 bytes?

Michael: It's padded to 8, actually, in BG.

Nikolay: Yeah, you say, like, having

two-column index, 4 and 4 bytes, and just single-column 4 bytes

is kind of same?

Or what?

Michael: Single column 4 bytes is the same as single column 8

bytes.

And that's really surprising.

Nikolay: Yeah, cool.

Michael: So that time was wasted for them, even though it wasn't

a multi-column index.

The padding meant that they didn't gain anything.

Nikolay: So guys have more two-column indexes if it's about 4

byte columns.

Michael: Yeah, maybe you would, yeah.

Nikolay: Or maybe covering indexes as well.

Oh, covering indexes may be different.

You know about including?

If it's a single column 4 byte index and we have another like

passenger, right?

Passenger.

Michael: Yeah, but it only gets stored in the leaf pages.

Nikolay: Yeah, so it's different.

Okay, yeah, that's interesting.

And I didn't go there ever.

This is good that this article raises this topic.

So

Michael: Yeah, I think it's a good thing to be aware of.

Probably not something you should be.

You shouldn't, don't, reorder your indexes based on performance

reasons not based on the storage size it would be my recommendation

Nikolay: I agree yeah

Michael: oh yeah but yeah a really good point that I hadn't realized

until I read this article.

Nikolay: Okay, good stuff.

Thank you for choosing this topic.

Michael: Oh, you're welcome.

We finally got to it.

We've brought it up a few times.

And

yeah.

Nikolay: Okay.

Michael: Have a good week.

Take care.

Bye.

Nikolay: Have a great week.