Postgres FM | Massive DELETEs

Nikolay and Michael discuss doing massive DELETE operations in Postgres — what can go wrong, how to prevent major issues, and some ideas to minimise their impact.

Here are some links to things they mentioned:

Article based on Nikolay’s talk, including batching implementation (translated to English) https://habr-com.translate.goog/en/articles/523536/?_x_tr_sl=ru&_x_tr_tl=en&_x_tr_hist=true
Our episode on WAL and checkpoint tuning https://postgres.fm/episodes/wal-and-checkpoint-tuning
Egor Rogov’s book on Postgres Internals (chapter 10 on WAL) https://edu.postgrespro.com/postgresql_internals-14_en.pdf
full_page_writes https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
TRUNCATE https://www.postgresql.org/docs/current/sql-truncate.html
Our episode on partitioning https://postgres.fm/episodes/partitioning
Our episode on bloat https://postgres.fm/episodes/bloat
Our episode on index maintenance https://postgres.fm/episodes/index-maintenance

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is brought to you by:

Nikolay Samokhvalov, founder of Postgres.ai
Michael Christofides, founder of pgMustard

With special thanks to:

Jessie Draws for the amazing artwork

Creators & Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello and welcome to Postgres
FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard.

This is my co-host Nikolay, founder
of Postgres.AI.

Hello Nikolay, what are we talking
about today?

Nikolay: Hi Michael, we wanted
to talk about one of my favorite

topics, massive delete, I usually
call it, but maybe you can

call it somehow differently.

Michael: Yeah, I really like this.

So when you say massive delete,
do you mean deleting a lot of

rows all at once?

Right.

Nikolay: In one SQL statement.

So for example, 10 million rows,
1 billion rows, or just 1 million.

It's good already.

Good enough to feel how bad this
is for Postgres.

So we can discuss in detail what's
happening if you do it.

I like this type of thing because
it's not only bad, it's also

good.

And we can explore that as well
because if you do it on production,

it's bad, but if you do it on purpose
in some testing environment,

it can be very helpful.

Michael: Oh, I see what you mean.

So it's a useful tool for certain
things.

Right.

Okay, cool.

Yeah, I get it.

Nikolay: Stress test,

Michael: yeah.

So you mentioned this in a previous
episode, that it would be

good to cover this in more detail,
and I was really interested.

But I think it's important to say
what we're not gonna cover

as well.

So massive deletes, some people
might also be thinking of like

some other things that sound like
deletes.

Like some people often implement
soft deletes, which would be

like implemented by updates.

We're not going to talk about that
kind of thing.

We're talking about only actual
Postgres deletes at this point.

Nikolay: Some people complain why
soft deletes are not already

implemented by default in the database
system, like, natively.

By then, maybe it's a good idea
to have, but...

Yeah, we will skip this topic.

So, maybe it's another episode
because it has interesting nuances.

Yeah, but let's talk about what's
happening if you want to delete

many, many rows, how to do it correctly
on production, not to

put your database on knees, and
how to use it in non-production

environments for something good,
for good reasons.

Right?

Michael: I like it.

Where do

Nikolay: you want to start?

Michael: Maybe with some use cases,
like When are you generally

seeing people do these huge deletes?

I've got a few examples I've seen,
but I'd be interested in yours.

Nikolay: Well, usually people do
massive deletes by mistake.

Not understanding how Postgres
works.

And MVCC and so on, and they expect
it's a simple operation.

They probably saw some warning
that you shouldn't expect immediate

reduction of disk space used.

For example, you delete from some
table and think, okay, I will

free some space.

No, you won't free some space because
it's only part of the work.

Another part is Autovacuum, which
will need to actually delete

those tuples.

But, yeah, what?

Michael: And even then, you won't
see, if you're monitoring disk

space, you won't see that as free
space again.

Nikolay: It depends.

Autovacuum also can truncate relations
if some pages at the end

are deleted, all tuples are deleted
and vacuumed.

So physically deleted already,
right?

Pages become free, it will truncate
them and it will reclaim disk

space.

Michael: Yeah, okay.

But I feel like those cases are
the exception rather than the

norm.

I see a lot more cases where people
are deleting a subset of

data from a non-partitioned table
and really surprised when their

relation doesn't decrease in size.

But just to cover a few of the
other basics, I think this is

actually a topic that's coming
up more and more because we have

these new privacy laws that require
people to be able to, to

have the right to be forgotten.

And deleting data is quite a safe
way of not having a data leak.

The less data you have, the less
data that can be leaked security-wise.

So I feel like there's these forcing
functions.

I think also people are starting
to get serious about having

data retention policies.

So how long should we keep each
type of data around for?

So I think there are some of these
use cases that mean this is

coming up more.

And in the past, at least in the
last 10 years, I saw a lot more

people kind of just storing everything
forever and not really

thinking about it.

So I didn't know if this is something
you're seeing more often

or it's just an age-old problem.

Nikolay: Well, yes.

So I think GDPR and everything,
it's not that big because, you

know, like, single user delete,
it's not, it's usually like some

fraction of the whole, right?

But usually, massive delete happens
when we need to clean up

and understand there's some old
data which is not super useful,

we pay for it, and we want to postpone
the moment when we need

to scale our instances, database
instances, machines, and so

on.

In this case, in general, I would
like to mention before COVID,

I went to the VLDB conference, which
is a conference I know since

being a kid, basically, when I
was a student learning database

theory.

I've heard about this conference.

So I went to it because it was
very close in Los Angeles.

I remember a keynote, some researcher,
she was presenting the

talk, which was interesting.

It was about delete basically.

She said this is exponential growth
of the total data volumes

of data in our databases in the
world.

It was like phew.

Because storage becomes cheaper,
we produce a lot of data, big

data, huge data, and so on.

This is like some zettabytes or
something.

It's an insane curve, insane, with
some forecast.

And she said, we spent decades
to learn how to store properly

CID, not to lose data, reliable,
highly available and so on.

Now it's coming time to learn how
to clean up and delete.

Understand which data can be safely
deleted, how to delete it

efficiently and so on.

And I was super inspired.

I also did a talk at a conference
because around the same time

I had a production incident.

A senior engineer, a back-end engineer,
with a good understanding

of analytical databases actually,
got a task to delete some old

data, preparing for some marketing
campaign.

Because the forecast was saying
we either need to invest a lot

to upgrade or we need to clean
up before we do this huge marketing

campaign.

So he went to production and
just performed delete from table

where created at older than 1
year.

He estimated in advance that it
will be 10 million rows and we

got an incident.

Downtime, more than 10 minutes,
cost the company a lot of money.

I was super impressed.

A lot of money.

And he almost left the company
himself because he was super embarrassed.

And at the same time, this VLDB
conference without delete, and

this, like, I'm saying, like, we
need something, right?

We need, at least we need to educate
people that delete should

be split into batches.

Of course, if you have good disk,
maybe you'll be fine.

But in that case, we had quite
slow...

It was some enterprise level, but
some sun, I think, or some

old system.

It was on premise.

And also, at the same time, I was
advocating this company, like

I was helping as a consultant with
Postgres.

The company was growing super fast,
a huge startup.

And I was saying, guys, you do
need to increase the max-wal

size.

1 gigabyte is not enough.

So with default setting max-wal
size, untuned checkpointer and

quite slow storage.

I think maximum throughput for
writes was maybe 600 megabytes

per second, maybe 500.

It's not like current modern NVMe
which gives you like 2 gigabytes

per second, for example.

These 2 factors plus the idea let's
delete 10 million rows, it's

not a huge number, right?

So checkpointer became crazy because
a lot of...

What's happening?

First of all, we find rows.

I mean, executor finds rows.

In that case, it was also unfortunate
that it was not a sequential

pattern.

So rows were ordered according
to created at, so this starts

sparsely and first you need to
put, Postgres needs to put xmax

value to this hidden system column.

It needs to put current transaction
which deletes, right?

Value to xmax column for first row
and the page becomes dirty.

Dirty means in memory we have a
different page than on disk,

so we need to flush it to disk
eventually.

A checkpointer needs to do it,
basically.

And then a different row was a different
tuple, basically, a physical

row, a version of row in a different
page, so we dirty different

page.

So almost, like if you need to
update XMax in 100 rows, it's

like almost 100 pages already dirtied,
very inefficiently already,

because of this pattern of random
access basically.

It's not random, but it's not sequential.

This is the key, right?

And then checkpointer sees, okay,
I have max wal size 1 gigabyte.

And it was Postgres, I think, 9.5,
9.6.

At that time, it means that real
distance between 2 checkpoints

was 3 times lower.

Only 300 megabytes.

You can read in Yegor Rogov's book,
Postgres Internals, a very

good explanation mentioning the
improvements and why it's not

really even a max wal size but
3 times smaller.

Now I think it's either 2 times
smaller or something, I already

don't remember.

So it means that Checkpointer comes
and says okay it's time already,

we already accumulated too many
buffers dirty, we need to put,

like, to flash them to page cache
first of all, and then pgflush,

something like pgflush will go
and flash them to disk.

So it starts working, producing
a lot of IO and it converts to

disk IO already.

And then, boom, different row happens
to be in the same page

which just flashed.

It was just flashed.

But we update it again.

I mean, we update different tuples
on the same page, it becomes

dirty again.

And Checkpointer says, okay, again,
a lot of work.

Michael: So it's duplicating effort
by flushing too excessively.

Is that what you're saying?

Nikolay: Yeah.

So the distance between checkpoints
was like 20 seconds only.

Also, I learned that if I'm not
mistaken, I think many years

already passed, checkpoints can
overlap sometimes for a couple

of seconds.

It's still happening, but it's
insane.

So checkpoint produced a lot of
I.O.

And disk couldn't handle this I.O.

And database became unresponsive
and we had a critical incident,

basically downtime, failovers,
it's insane, everything became

insane, the downtime was huge,
and that's not good.

But As I remember, there are 2
effects.

Not only do you need to flush the
same page multiple times if

you have this non-sequential access
pattern, but also once the

checkpointer or anything else made
the page clean, we have a

different effect.

Michael: Full-page images.

Nikolay: Yes, yes.

So full-page images, full-page
writes.

Since full-page writes is on, it
means that after checkpoint,

if we visit with our change, our
delete, this patch again, it

goes in full to the WAL, right?

Michael: Whereas if it was in a
previous, if it had managed to

sneak into the previous before
the check, the next checkpoint,

it would have been a much smaller
amount of work.

Nikolay: Yeah, yeah.

If we had, for example, a huge,
a maximal size like 100 gigabytes,

we have enough disk space, we can
afford it, we understand if

we crash, startup time will be
longer, replica provisioning also

takes longer, because the recovery
point

Michael: takes time.

Nikolay: And then in this case,
even if we have random, not random,

almost non-sequential access pattern,
we visit the same page

multiple times, not sequentially.

And not only the checkpointer will
flush it just once instead

of multiple times, but also only
the first change will go to

the WAL as a full page, but subsequent
changes will be just

presented as a tuple.

And this means WAL generation
decreases significantly if we

do this.

So it was helpful to me as a consultant
because I finally found

an argument that we need checkpoint
tuning.

It was the beginning of my practice
in the US, so it was hard

for me to find good arguments,
but this case showed okay.

I just did a very good series of
experiments.

You know I'm a big fan of experiments.

So if you just start experimenting
with max_wal_size and run

deletes each time, you just do
this 1 gigabyte.

This is the I.O.

For this massive delete.

Like 2 gigabytes, 4 gigabytes,
logarithmic approach.

64 for example, or 128 gigabytes.

And you can draw a good curve.

Right?

Look how I/O.

And usually, I/O, if you also have
monitoring, if this delete

takes like a minute or 2 minutes,
5 minutes, you can see monitoring

that you have a plateau for your I/O because
it's separated.

Michael: Are you taking the total
I/O?

So let's say you're doing the same
delete each time and you're

looking at total I/O across the
time and showing that there's

less I/O as we increase the...

Nikolay: I just, yeah, I just,
Well, I did several things in

this experiment, and I think it's
worth maybe a good how-to article

or something.

Actually, it was 2 phases, 2 parts
of the whole experiment.

First is to study this I/O behavior
and checkpointer behavior.

So I did a snapshot of pg_stat_bgwriter,
which as we know,

until the recent version, contains
not only the background writer,

but also a checkpointer and backend
activity for cleaning dirty

buffers.

So I converted as usual, I converted
buffers to gigabytes because

this is how any engineer can start
understanding what's happening.

If you say buffers, nobody understands
except DBAs, right?

If you multiply it by 8 kilobytes,
you have gigabytes, megabytes,

everyone starts understanding.

So I just showed that with default
setting, Checkpointer had

a lot of I/O.

And also it just had WAL.

Much more WAL was generated.

But I also made screenshots of
monitoring showing that we had

a plateau situation for disk I/O, disk
write I/O.

But when we had already like 16
gigabytes, 32 gigabytes, we already

see like a spiky pattern up and down
and this is good.

It means we have room.

Michael: Because of the batches?

Why is it spiky?

Nikolay: Yeah, some batching, something
like, yeah, checkpoint,

that's also like, it's of course
checkpoint completion target,

it's like close to 1, so it should
be spread, but it's like,

this is batching there, obviously.

And this is good.

We must iterate it.

Plato is bad.

Michael: Right.

So.

Actually, it's a good point that
this helps, this optimizer,

like tuning the checkpointer is
helping with massive deletes,

regardless of whether we batch
or not.

It's helping in the case where
we don't batch and it's helping

in the case where we do batch.

Is that right?

Nikolay: Right now I'm talking
only about how we were tuning

how tuning would prevent this incident.

I think it would convert P1 or
priority 1 or criticality 1 incident

to P2.

So basically we have a slow database
but it's not down.

Because we just discussed this.

A well-tuned checkpointer has
less I.O.

To perform.

Michael: Okay.

I understand the spikiness now.

We're talking about that's when
the checkpoints are happening.

Nikolay: Yeah.

The I.O.

From checkpointer, disk I.O.

Was spiky and it's good.

That means we are not having plateau,
we are not saturated.

Of course, better if you know your
numbers, you know the limit

of IO your device, storage device
can handle, and you can draw

this line on graph and understand
how far.

Basically, regular SRE practices
starting from usage situation

errors should be applied here to
study this incident and perform

root cause analysis.

And this was great.

It was obvious that if we reach
like 16 gigabytes or 32 gigabytes,

we are in much better shape.

We just need to have disk IO and
also second phase of experiment.

I think we had an episode about
maximal size and checkpoint tuning.

So second phase, I won't go into
detail there, but second phase

of you do need to understand recovery
time in the worst situation.

And I invented the term like double
worst situation, double unlucky.

So worst situation if your database
crashed right before checkpoint

completes and double unlucky if
at the same time you had massive

delete or something like that.

In this case it means a lot of
work during recovery.

So yeah, that's it actually.

This is how massive delete looks
like in the wild, and you should

avoid it.

Michael: Yeah, So how do we then
go from the...

I guess we've gone from priority
1 incident to priority 2.

How do we make this not an incident
at all?

Nikolay: Batching.

So just split it to batches.

I think there might be cases when
logically you cannot afford

splitting to batches because you
must delete everything in 1

transaction.

But in my practice, I never saw
this.

I mean, I always could convince
people to split to batches and

do everything in different transactions.

So in this case, we need just to
understand what is the ideal

batch size for us.

Not too small, not too big.

Too small means a lot of transaction
overhead, too big, we just

discussed.

I was

Michael: gonna say, I haven't seen
a delete use case that like

we're deleting stuff right we don't
need it anymore that's almost

by definition we don't need it
anymore so why do you why would

you need it to all be gone or none
of it to be gone it doesn't

That doesn't make sense to me.

Nikolay: Potentially there might
be a case when you don't want

users to see parts of the

Michael: old data.

So you want them to not see anything,
but you'd rather they saw

everything than part of it.

Nikolay: Yeah, in this case, Grails
only adjust application logic

so the application cannot reach
that data already, right?

Even if it's present in the database,
but you hide it already,

maybe like that, based on timestamps
or something.

But again, this is just a theoretical
discussion.

In practice I didn't see any cases
when we couldn't.

The benefits for Postgres with
its MVCC model, benefits from

batch deletes always much higher
than experiencing this pain

and risks.

So yeah, batches, and we know the
ideal size of batches, we discussed

it many, many times, starting from
the very first episode we

had almost 2 years ago.

Michael: And very recently in the
Don't Do This episode, right?

Nikolay: Yeah, yeah, yeah.

So ideal, my recipe is just try
to be below 1 second, but maybe

not below 100 milliseconds or 50.

So this is the ideal size to me based
on human perception.

Michael: Yeah.

You mentioned something in the
recent episode about it degrading

over time.

So let's say you can, you get a
clone or you have a replica of

production where you're testing
batch sizes and you get the sweet

spot.

Maybe you find out that you can
delete 5,000 rows in just under

a second.

And you think let's go with 5,000
as the batch size.

You mentioned

Nikolay: 1 hour later, you see
the same delete already takes

1 minute.

What's happening?

Michael: Yeah, so I was gonna ask,
not only what's causing it,

but how would you, like, do you
run the test for longer?

Or like, how do you think about
that stuff?

Nikolay: Well, you just need to
understand that when you delete

or update, you produce dead tuples,
and it means that only the

first part of 2 parts of the operation
is done.

It's done sequentially by your
SQL query, but there is a second

super important part, which is
vacuuming.

So delete is not complete when
you see a transaction committed.

It's complete only logically, but
physically we have these dead

tuples, basically garbage unless
some transactions are still

needed.

And you need to understand vacuum.

And actually in the same company
to perform delete properly,

of course I split 2 batches and
then I was a big fan like, oh

I want to create a simple query
which is basically stateless.

I don't want to deal with remembering
last ID or last timestamp

of previous batch.

Basically no pagination.

I don't want to have pagination
because they have a similar problem

as pagination for selects showing
huge result set split to pages

to users, similar problem here,
Very similar, because it's the

same basically problem.

So I wanted to do something like
stateless and I just relied

on Postgres on some index, I checked
indexes used, and then already

in production I saw this degradation.

So degradation was because of a
lot of dead tuples and auto vacuum

couldn't catch up with my speed
of deletion.

I also remember an interesting
thing.

Since then, I'm a big fan of single-threaded
maintenance jobs.

I implemented multiple threads
originally, but then I saw that

even a single thread is too fast
for auto vacuum.

And we don't need parallelization
here.

Of course, parallelization in my
case was based on select for

update, skip locked.

Like fancy, super cool.

It was maybe like 5, 6 years ago,
like, let's do fancy stuff,

we will have like 5 workers, let's
go.

But you don't need it here.

1 thread is enough to clean up.

Because otherwise you need to speed
up vacuum somehow, and vacuum

for a single table is always single-threaded.

You cannot have multiple workers
cleaning up dead tuples from

a single physical table.

Michael: Non-partitioned, yeah.

Nikolay: If it's partitioned, yes,
they can work on multiple

partitions at the same time, but
if it's a single table, no.

Michael: I want to come back to
partitioning.

But in the meantime, like, so massive
delete also has a big impact

on, like, I was thinking about
index only scans, not for the

delete, but for other like concurrent
reads.

Let's say we're in a situation
where we are deleting a lot of

tuples from not necessarily random
but like lots of different

pages; the visibility map is going
to be well; there's going to

be a lot of pages that have changes
and therefore can't be true

index only scans. So I was thinking
for that case we might actually

want to run a vacuum manually every
certain number of batches.

Nikolay: Yeah, this is what I did
and I also thought maybe I

should.

Yeah, I did.

I did manual, not manual, I mean
my script did it.

But after n batches, like 100,000
batches, vacuum.

Michael: And how did you determine
n?

Nikolay: Good question.

Well, experiments.

So if you want to go stateless
and you know a lot of dead tuples

will be a problem because the index
you use on primary key or

on the creation timestamp, create
tab, this index also needs

vacuum because it will be pointing
to dead tuples and that's

why subsequent batches will be
slower, degradation over

time will happen.

But if you vacuum, you get rid
of these links to dead tuples

And it becomes good again.

So based on that, applying the
rule, we want our transactions

to exceed 1 or 2 seconds because
users might notice some bad

things.

When we have a long transaction,
we also block...

We also are some kind of a problem
for all autovacuum workers

globally for our database.

So if you see, oh, it degraded
to 2 seconds, this is the right

time to run vacuum, right?

But I eventually chose not to do
vacuum, actually.

I had many iterations of my solution,
and finally I gave up and

decided to go stateful.

So I performed just this key set
pagination, just based on the

last timestamp.

I select the next batch based on
that.

It's super fast even if we had
a lot of dead tuples.

So I don't care.

Because Autovacuum is good to skip
pages it can skip.

It's not like 10 or 15 years ago.

It's already quite good.

So multiple runs versus just single
big run.

This is a good question, by the
way, worth exploring.

Is it good to run vacuum quite
often in this case, versus let's

run it just once.

In terms of how low it will produce,
in terms of how many disk

operations in total, this is a
good experiment to conduct and

compare these numbers.

Michael: Well, and doesn't it depend
a little bit on like, I

think it depends a bit on your
workload, right?

If you have no other concurrent
users, running 1 vacuum is likely

more efficient than running lots
of small ones.

But if you have other concurrent
users and you're forcing heap

fetches, then maybe your IO impact
from those reads outweighs

the duplicate effort on vacuum.

Or even when I say duplicate effort
on vacuum, I guess we're

talking about

Nikolay: the same kind of a similar...

Yeah, you're bringing up a good
point here.

If we see that we want to run single
vacuum not frequently, we

will live with a lot of buffers
which are out of visibility

map.

Yeah.

Yeah, and index-only scans will
degrade.

But you add complexity here.

I know, I know.

In my case, I don't care about
other users for now.

I like this approach.

First you study the problem alone
and then you start adding what

you did, like, oh, we have other
users.

Let's already understand the problem
in a vacuum, so to speak.

Michael: I know what

Nikolay: you mean.

In single user mode, basically.

Then you start adding considerations,
what about other users?

If you know how Postgres behaves,
how your workload behaves for

single user situation, It's already
better.

You are much more confident when
you start looking at others

as well.

Yeah, it's a good point.

If we rely on index-only scans
a lot, probably we should vacuum

frequently to avoid heap fetches.

You saw it in your plans at PgMaster
quite a lot, right?

Michael: Yeah, it's quite a common
one.

It's not the biggest issue, but
it is, especially range scans

where people are doing like returning
a lot of rows or doing

aggregates that rely on index only
scans on these kind of you

know like hybrid workloads set
kind of analytical queries they

can degrade quite quickly in certain
environments if you go from

an index-only scan across a few
thousand pages to an index-only

scan with a few thousand heap fetches.

Nikolay: Yeah, so your deletes
can be a problem for these.

So, selects might degrade Because
visibility map is outdated.

Michael: Exactly.

Nikolay: So yeah, more frequent
vacuums might be good.

It's an interesting discussion,
interesting point as well.

But in my case, I liked the...

When I gave up my idea to be stateless,
I liked so much, like

I just remember this, and remember,
I just...

I did it also to single-threaded
approach, forgetting about this

select for update, skip locked.

And so I just need 1 value always
to remember, So easy.

And then performance became very
predictable, reliable, stable.

So all batches were the same size
and latencies were the same.

Also interesting, when you do it,
you start thinking, oh, I want

a progress bar basically.

I want to predict ETA, how much
time left.

And if it's degrading or not stable,
this latency of each delete,

you cannot predict reliably.

But once you perform keyset pagination
here, you know your

batch, it takes like 300 milliseconds
each batch.

Great.

I integrated it into my SQL with
a progress bar.

It reported percentage and how
much left in ETA.

Michael: So, quite

Nikolay: an important thing.

Michael: A quick question on the
ordering of batches.

Do you try and do any kind of natural…
You mentioned created

that being a little bit random,
but I would have expected, unless

the tuples are getting updated
a fair amount, the creator that

might be relatively natural ordering.

Nikolay: It should be a very good
correlation between created

physical location and created_at.

You can check it easily.

Select ctid, created_at, and order
by created_at, some limit.

And you can see ctid is the physical
location and you can understand

how correlated physical location
is with createdAt.

In my case it did have a lot of
updates in the past, so it was

basically everywhere in the table.

That's why the incident happened,
actually.

Yeah, it

Michael: makes sense.

Nikolay: So, yeah.

Oh, no.

I ordered by modified_at and I
even considered creating an index

on modified_at.

But this is an anti-pattern to me.

An index on modified_at is a problem
because you start losing hot

updates, heap-only tuple updates
immediately because each update

needs, usually by trigger or something,
it changes this value

and if you have an index on it,
By definition, Postgres cannot

implement heap-only tuple updates,
and it means you need to deal

with index write amplification
problem, which is nasty.

Michael: So you're optimizing your
delete, but you're messing

up your updates.

Nikolay: So now

Michael: you are considering other
users.

Nikolay: I remember I went deeper.

I decided, okay, honestly, I don't
want this index to be built.

I know it will bite me back because
of heap-only tuples.

I don't want to lose them.

I checked statistics and I saw
we have a good ratio of them among

all updates.

We have a lot of hot updates.

So I decided, you know what I decided?

I decided to rely on created at
index, but then somehow perform

logical replication with modified at based
on partial correlation.

It was something crazy.

It worked quite well.

I don't remember details, but I
did some tricks there.

Avoiding this.

At first I already coordinated
index creation with guys.

They said, oh yeah, let's do it.

But then I said, no, no, no, no.

In different company, like 1 year
before that, I already had

this mistake done.

I suggested some index and then
degradation of updates happened

because we lost hot updates.

Actually, that incident led me
to the idea we need to verify

all changes, holistically checking
all queries, ideally, what

we have.

And this experimentation approach
and so on.

It's all connected and interesting.

But maybe let's skip this, since
we are out of time.

I have a question.

Michael: 2 more quick, 2 more ones.

I guess they're not quick, but
when we're done with this large

delete, what do you, like, are
there any kind of maintenance

tidy up tasks that you recommend
doing?

I was thinking like we did a whole
episode on index maintenance,

I think rebuilding index is concurrently
my...

Nikolay: This question exactly
is a good bridge to what I wanted

to discuss as the last topic.

I wanted to discuss how massive
delete can be useful in non-production.

Michael: Okay,

Nikolay: yeah.

And let's talk about this, why
your question is a bridge.

Because now I remember very well,
but in my career I forgot about

this many, many, many times.

Delete doesn't touch indexes at
all.

This is the key.

So, delete, they just, indexes
are not touched at all during

delete.

Delete just puts xmax, that's it.

And if the transaction is considered
as committed, then vacuum will

remove this, and also if no other
transactions need this version,

they're already in the future.

In this case, the tuple will be deleted.

And also links to this tuple in
this index will be deleted asynchronously

by a vacuum, but our backend is
not dealing with indexes at all.

It only needs to use one index to
find the scope of work, the tuples

we need to touch.

Oh, well, also during planning,
as we know, all indexes are considered

by a planner and an access share lock
is acquired.

So I don't see any index maintenance
here at all.

Michael: Well, I was just thinking
about them being, let's say

we deleted 10% of our table.

Nikolay: Well, yeah, you're right.

Right, so if we don't do frequent
vacuuming...

Michael: Even if we do frequent
vacuuming, there'll still be

like 10% bloat, especially if it's
like created out or something

where we're deleting old data.

Nikolay: Yes, vacuum doesn't rebalance
B-tree, so B-tree will be in

worse shape probably.

I would say if we deleted a big
fraction of the table data,

we probably need to check bloat
in indexes and probably we'll

need to rebuild them.

But I already got used to relying
on automatically rebuild jobs,

which we suggest implementing on
weekends or something.

Michael: But this feels like a
good time to reconsider.

Anyway, it felt worth mentioning.

I know you want to bridge to that
1, but I had one more quick one.

And that's like, more and more
I'm thinking about some of these

use cases are deleting, like a
multi-tenant SaaS application

wants to delete a single tenant.

And if you've partitioned, like
I was thinking about partitioning

and detached, like the another
way of avoiding big deletes is

if you can just drop a whole partition.

Nikolay: Or truncate it.

Yeah, depending.

Michael: Sorry, yeah, like, you
know, get hatched.

Nikolay: PGQ, long, Skype implemented
this, like, three partitions,

four queue-like workloads, three partitions,
and then, like, round-robin

approach truncating when it's possible.

It's much faster, it's much better.

No job for many components and
it's good.

Yeah, cool.

I agree.

And indexes are on partitions,
they are like physical and if

it's truncated, it's also truncated.

So, great.

Yeah, and let's touch a little
bit on the last topic.

Why delete is useful.

So I already was, like, you know,
database map and ZFS, like,

branching, thing cloning and iteration
starting from the idea

of experimentation, it should start
from the same point in each

iteration.

You need to, you compare things,
you need to compare apples versus

apples, right?

Always apples to apples.

So each iteration must start from
the same state of database.

And this is difficult if table
is huge, database is huge, and

you want to test it on big hardware.

In case of single-threaded behavior,
if you don't care about

checkpoint or vacuuming and so
on, usually thin clones that the

Database Lab Engine provides are great
if you just study plans and

so on.

But in this case we deal with like
we need to consider the same

file system and we need to understand
checkpointer behavior,

WAL, like everything, vacuum.

So we need a dedicated clone when
we perform this experiment

for checkpoint tuning and to study
this behavior.

In this case, for this tuning,
I found it super cool this kind

of workload can bring us very good
interesting tool.

If we perform massive delete but
don't commit with the rollback,

the physical layout remains the
same.

I mentioned it multiple times in
previous episodes and this is

a super cool observation.

You can just delete rollback, begin
delete rollback massively.

This puts good pressure to your
WAL system, right?

Checkpoint, it's cool.

And it means that, of course, for
vacuum it won't do anything

because the transaction marked
has rolled back, so even if XMAX

is already updated, the tuples
survive, they're still alive.

But Postgres generates a lot of
work for such workload.

It also puts pressure to a backup
system, to a replication system,

And also a checkpoint has a lot
of work.

So it means you can stress test
many components just with

this kind of workload.

And I know we mentioned last week,
right, We had Sai as a guest.

I suggest, who haven't watched
that episode, it was a super interesting

discussion about logical replication
with PureDB founder Sai.

So I mentioned that this is exactly
how I found that we can easily

reach and saturate single core
on WAL sender.

So you just delete middle-rows,
rollback.

It spams WAL and WAL sender hitting
100% of single vCPU and

it's becoming bottleneck quite
quickly.

Fortunately, in production, as
I mentioned, I couldn't find such

situation, but in non-production
it's easy to reproduce.

So I found this very handy tool
to stress test Postgres and for

checkpoint tuning and so on.

That's why messages delete not only
bad but also good.

Useful.

But only being rolled back.

Yeah.

Isn't it funny and exciting that
the physical layout doesn't change

of the table?

You didn't look at rollback, but
it's the same.

This is

Michael: one of those things where
if you talk to Oracle guys,

you'll get them sweating by this
point because of a different implementation.

It's only because of Postgres'
specific MVCC implementation.

Because in Oracle they have the
undo log, it's like the opposite

trade-off.

So it's so interesting that the
implementation details are so...

It's good to know them in order
to work out how to test things.

Nikolay: Yeah, Postgres is expecting
you to perform rollbacks

often.

Michael: Yeah, It's good to make
use of it when it's advantageous.

Nikolay: Right.

So tuple remains in the same spot
in the page, doesn't shift,

unlike in other systems.

But still, xmax is updated, so
page becomes dirty and must be

flushed to

Michael: disk.

Yep, makes sense.

Nikolay: Yeah, I think we explored
it quite well enough, quite

deep.

Takeovers, takeaways.

First, perform checkpointer tuning,
watch out for our episode about

that.

And if you prohibit massive deletes,
perform deletes in batches

roughly not longer than 1 or 2
seconds.

Unless you don't have users who
deal with your database a lot

and maybe you can go with like
30 seconds in this case.

Michael: Well, at least consider
batching and try yeah.

I just wanted to

Nikolay: have some rule and for
regular OLTP 1 second is a good

rule.

Michael: Yeah great I like it.

Yeah, I think it's like a healthy
recommendation as well.

Like even if you wouldn't go down
or even if you wouldn't have

like degradation of like that users
would notice, it's just an

unhealthy thing to do like to go
excessive and then like, it's

kind of like extremes where if
you can keep things more level,

you probably see better things.

Yeah, and

Nikolay: a couple more takeaways
is vacuuming, don't forget about the

regular vacuum, affects about index-only
scans we discussed.

maybe you want frequent and quite
aggressive vacuum, like to

go faster, more are you consumed,
and also partitioning.

Maybe your massive delete is just
truncated, in this case all

those negative effects can go away.

Yeah.

Love it.

Michael: Thanks so much, Nikolay.

Nikolay: Thank you, Michael.