Postgres FM

Nik and Michael discuss a listener question about archiving a database.
 
Here are some links to things they mentioned: 


~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With credit to:
  • Jessie Draws for the elephant artwork

Creators and Guests

Host
Michael Christofides
Founder of pgMustard
Host
Nikolay Samokhvalov
Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Nikolay: Hello, hello, this is Postgres.FM.

My name is Nik as usual, PostgresAI, and as usual with me,

Michael, pgMustard.

Hi, Michael.

Michael: Hello, Nik.

Nikolay: So we have a question today from one of the listeners,

and it's about archiving.

I have so many options what it can mean.

Where to start?

Michael: Yeah, well they specifically asked whether it was possible

for us to do a video on archiving and they said they want to

archive a database that is about 5 terabytes in size.

So what do you assume that means when you first hear it?

Nikolay: Well if you are familiar with Postgres internals, and

just not internals, at least about Postgres architecture.

Archiving is simple.

You just set up archive command.

That's it, right?

But it's a joke.

I think this is not what was meant here.

So I think 5 terabyte these days is an average size database.

Who doesn't agree?

Well, I see bigger databases and 5TB is not a big database anymore.

It was big 10 years ago.

These days big means more than 10 terabytes.

This is just it.

Right.

So 5 terabytes is a very large database and archiving it means

that we want to keep it longer term, right?

How I read it?

We want to keep it longer term and or we want probably to clean

up.

Sometimes we want to clean up.

Well, yeah, because for example, if imagine it's e-commerce,

we accumulated order history for many, many years.

It's not used, but it's so much data.

So why not archive it and then clean up?

Clean up in Postgres, straightforward clean up with delete as

we discussed many times, it's not an easy operation in large

databases because of MVCC, vacuum and all these complexities.

So if it's partitioned, if bigger tables are partitioned, then

cleanup is easy, right?

We just detach and drop all partitions, that's it.

And before that, obviously, archiving makes a lot of sense.

But even if you don't have a task to clean up, archiving might

make sense for backup purposes, right?

We

Michael: want to

Nikolay: keep archive for long, yeah.

Or maybe there are requirements.

There also may be there are requirements, right?

Michael: Yeah, I think there's an argument that it's slightly

different to backups.

But I'm not actually certain that my like what how I have like

internalized that is actually that actually is helpful in any

practical sense because if you've got data archived you've got

it's kind of a backup right Like it is another source of old

data.

But I do think they're different in that you wouldn't expect

an archive to have up-to-date data, whereas you would expect

a backup to have up-to-date data.

So I do think there's like subtle differences.

And also I think, also I think you backup and you archive for

different reasons, therefore there's different trade-offs.

So like, if we're backing up in order to be able to get back

online quickly if we have a disaster that you might want to put

it somewhere different in a different format than if we archive

it because we need access to it but not necessarily performantly.

Nikolay: For me, in real life, we, For example, take our DBLab

or experiments.

Experiments should be done on clones, right?

On thin clones, on branching, if it's DBLab, for example, or

full-fledged clones on separate virtual machines, like if it's

self-managed, right?

And we say, let's restore either from snapshot, cloud snapshot

of disk, or let's restore from backups or from archive.

And there is purely like synonyms for me.

Like let's restore from archive.

And that's why archive command, it's a part of backups, right,

it's archiving WALs.

And we say we have many like versions of our database, not versions,

many exemplars of our database and archive or in backups.

Backups, archive, it's the same for me in this case.

And of course we want to be it quickly restored, but These days

if it's cloud, it's easy to configure.

So you basically archive it, back it up using pgBackRest or WAL-G.

It goes to S3 or other object storage like GCS or I'm not an

expert in Azure Blob Storage, they say, they name it, right?

And there you can configure it.

So for example, a couple of weeks later, that backup automatically

moves to colder storage, like in S3 it's called Glacier.

Michael: To an archive.

Nikolay: Well, all our archives, 1 is hotter, another is colder,

right?

It's Glacier or like in GCS and Google Cloud, it's called Coldline,

I don't remember the name.

But the idea is it's cheaper, but slower to restore.

And this still causes a bug which is known many years in WAL-G.

I

recently vibe-coded a fix because it changes modification time

and latest becomes broken.

If you want to restore using uppercase
latest with WAL-G, will

WAL-G backup fetch latest?

In this case it will fetch wrong,
might fetch wrong backup because

it's based on mod time, modification
time.

And if when, when it's moved automatically
to colder storage,

modification time of old backup
changes.

So latest might be broken.

I recently I sent it to Andrey
already pull request to fix it

in WAL-G.

It should be fixed, right?

Yeah, but I mean modification time
cannot be trusted here obviously

so but it's a good idea and it's
great that cloud's support is

super easy to do, you just configure
some policy, how long to

keep them, how long to like when
to move to all colder storage,

so it allows you to for the same
money to keep archives for longer

right so

Michael: or if you if you're if
like for example you're doing

this for compliance reasons and
you're the length

Nikolay: of

Michael: time is predetermined
how long you have to keep it for,

you can keep it for that long for
cheaper.

Like that, it feels to me like
that's the more, yeah.

Nikolay: Yeah, so obviously it's
not SSD anymore, it's HDD, but

it's still super reliable, and
as I understand, they like promise

like a lot of nines, like amazing
number of nines.

11 nines or how many?

It's like basically all-

Michael: But here we're talking
about durability, not availability,

right?

Nikolay: Yeah.

Exactly.

So obviously data won't be lost.

That's the key.

And it can be stored for years,
years, years if you want.

Right.

And that's great.

The only risk remains if somebody
possesses your cloud account

access, they might destroy everything.

To protect from that, there is
a lock option, right?

You can lock and basically say
you cannot delete, nobody can

delete, or you can copy to different
account or even different

cloud

Michael: Yeah, right.

So different permissions, right?

Nikolay: Right, and if it's again
if it's backups using WAL-G what

because the WAL-G has a special comment for this copy.

So you can copy from 1 backup location,
1 archive location to

different archive location, right?

But if you are on managed Postgres,
In this case, physical backups

are not available unless it's Crunchy
Bridge, right?

Which is great that they allow
it to access physical backups

and WALs.

And in this case, the only option
remains is to create logical

backup, which I prefer to call
dump.

And we had an episode is logical
backup a backup?

My answer is still no.

But

Michael: Is it an archive?

Because I think, like, archiving
almost always isn't incremental,

right?

Like, you do it in batches, right?

You take a whole year and archive
it.

So in this case, it's not changing.

So a logical dump is kind of, is
an archive.

Nikolay: Well, for me, again, I
don't feel this difference.

I hear you, you feel some difference.

But for me, all backups are also
archive.

They're just continuous thanks
to WAL archive.

They consist of 2 parts full or
like deltas, incremental, differential,

there are many names and additionally
there, but basically like

full copy of data directory plus
WALs, right, and WALs make

them like continuous.

This archive becomes continuous.

Michael: This is where I think
we have a difference because I

think these let's go back to the
question right they want to

archive a database that's 5 terabytes
in size.

That's all the information we got.

But we don't like, there's a chance
that it's an old product

that's no longer served and they
just need to keep the data for

a certain amount of time.

It could be that it doesn't need
to be accessed anymore.

It could be, it could be all manner
of...

Nikolay: They maybe want to drop
database and just keep it for

history right?

So to answer this question directly
the first thing we need to

understand is it managed service
or we have access to physical

level.

If it's managed service, the only
option for us is pg_dump.

Michael: No, I disagree.

Why?

Well, because I think you could...

So depending on the parameters,
right, like you could export

it in a different format, right?

There's a lot of...

Nikolay: Okay, logical backup is
the only option here.

Yes.

Although technically speaking,
we can rely on...

We can say, okay, let's create
a snapshot and consider the archive

and rely on RDS that they will
keep it for longer.

Why not?

Michael: Yeah but so here's what I'm thinking though and 5 terabytes

is not very large right as you were saying at the beginning.

But if in the general case where we care about cost it seems

like cost actually matters here, it's 1 of the main reasons for

doing this in the first place.

If we care a lot about cost it feels to me like some of the newer

formats that compress extremely well like, and maybe new is not

the right term, but it's new to me in terms of coming across

it from a Postgres perspective but the formats like Iceberg and

Parquet that compress down extremely well feels to me like would

be the cheapest way of archiving data that's static and is not

going to be changing or you're going to get a new batch of it

once per year.

Like the archiving use cases that I'm thinking of.

If we exported in those formats they'd be tiny, like the 5 terabytes

might go down 95% or so.

Nikolay: Oh, I doubt.

20 times, I doubt.

Well, maybe.

Okay, What have you seen?

Michael: Yeah, I've seen people reporting that kind of...

Even on the Timescale compression side, I've seen that reported.

And that wasn't even...

Nikolay: Depending on data again.

Michael: Yeah, depends on the data.

But if you've got a lot of numerical...

A lot of columns that compress...

That have a lot of similar data in them, which I think is pretty

common, especially for like, if we're thinking it's healthcare

related or financial, like transactional data, A lot of that

is numerical stuff.

Nikolay: Well, I, again, like, okay, Parquet is interesting.

I think it's an option as well.

And in this case, you can also, like, it depends on access patterns

later, right?

We need to, first thing we need to answer, do we have access

to physical or only logical?

Logical, we obviously have access because this is our database.

Yeah.

I suppose.

All right, so physical versus logical thing.

We discussed it many times, but physical also can be compressed,

right?

So both WAL-G and pgBackRest, they compress data directory and WALs.

So why not?

The only thing to like, super important thing to remember is

that you cannot just take data directory, that's it.

You need to make sure WALs are also archived.

If it's only logical, OK, there is a Parquet option, but also

there is a pg_dump option, which
also has compression.

Michael: Yeah, but we're not getting
the compression like at

the page level, right?

Or kind of the row level.

And I think column level compression
is hard to compete with.

Nikolay: Well, it's page level
or file level?

Michael: File level.

Okay, what's the difference?

Nikolay: What is the difference?

Huh how...

Well from pg_dump I would expect
10x sometimes in some cases 10x

compression if you...

There are options in pg_dump, right?

zstd I would check definitely,
and it has, do you remember

if pg_dump allows to control compression
level?

Can you set compression level 19?

It will be terribly slow, right?

Or if you check...

We might

Michael: be okay with that.

That's the nice thing about this
use case, is we might be okay

with really slow retrieval really
slow restores because the main

point is keep it for many years

Nikolay: right right anyway anyway
pg_dump supports for multiple

years already, it supports options
and you can control compression

level there.

You cannot control compression
level in WAL compression, which

compresses full-page writes inside,
full-page images inside WAL,

right?

There, it was my proposal, I need
to finish it.

There you can control algorithm,
but not the level of compression.

So I'm not sure everyone needs
it, but still for completeness,

we should support it in Postgres,
right?

But for pg_dump, definitely you
can control level.

So if you are okay to wait and
load CPU a lot, and it will like

in some if you check That is the
19.

I think it should be super slow
But maybe beneficial for in terms

of size and of course the restore
will be also slow, right?

Yeah Yeah

Michael: that I think that is something
to factor in if people

are deciding which way to go though
is what kind of questions

are you going to get of this old
data?

Like are there going to be auditors
that want to come along and

be able to Query it?

Does it matter that it takes like
if you're going to get advance

warning that they're coming, does
it matter that it takes a day

to restore?

Are there gonna be analytical queries?

Like are people gonna want to know
aggregations on it?

Because then if it's gonna be mostly
aggregates, maybe you do

want it in some columnar format
where

you can

get those done quickly.

Nikolay: Well, you confuse me,
but obviously, when you compress

a dump, it's file level, so the
whole thing is compressed, It's

not page level.

Michael: That makes sense, yeah.

Nikolay: It should be good.

Page level, if it's, for example,
you store something on ZFS,

there it's page level.

Here, no.

The whole thing is compressed.

It can be super slow, of course,
but quite well compressed.

Not sure how it will compete with
Parquet.

I have not a lot of experience
with it So like in production

we have a customer who has it Self-managed
database and Archived

from time to time in Parquet format.

Yeah, and it's great.

It's good work.

It's also ZFS.

It's interesting setup But anyway,
I think there is a potential

here if somebody wants to write
some article or do some research,

there are many options to discover
here and benchmark.

Like we could build some decision
tree, like physical or logical

how to decide, I would stick with
physical always, if recovery

time matters, and if we want continuous
archiving, right?

If and if we have access, of course,
because if no access, it's

blocked, this patch is blocked.

Logical, okay, dump or Parquet
format.

Dump, we have compression options.

Definitely I would choose, I would
not choose a single file.

Well, again, depends, right?

But normally we choose custom format
because we want compression

and parallelization, right?

Because if you want single file
dump, then you cannot use parallelization.

And you will be limited by a single
vCPU work when dumping and

restoring.

When you use custom format, you
can say how many jobs, how many

vCPUs to utilize when dumping and
separately when restoring.

It can be different number.

And this matters a lot if we want
to dump faster and restore

faster, right?

This thing.

And then finally, like, so this
is a decision making tree, right?

And then finally, where to store
it?

Is it, definitely archive shouldn't
be stored on expensive SSD

disks, right?

It should be stored.

Object storage is great if it's
cloud, right?

If it's not cloud, some cheap disks,
which we don't use a lot,

so they serve as archive.

Or even, is tape still an option?

I think in many organizations,
which old organizations still

use tape.

Well, it has a huge lifespan, right?

So tape, right?

Why not?

Anyway, it should be something
which we know is reliable, will

live very long, right?

But also we need to take into account
restoration timing, right?

Michael: Maybe, yeah.

Nikolay: Because glacier super
cheap, S3 glaciers, like the cheapest,

but also the slowest to recover
from, right?

Yeah.

To fetch, fetch data, fetch data
from there.

So it's interesting, right?

And physical backups also can be,
physical archive also can be

placed into various locations.

So location question, I think it's
a separate decision.

Michael: Well, yeah.

I think there's a couple of other
things to factor in.

Because if you do store it in parquet
for example, you can query

it without restoring.

You don't even have to restore
to get the data.

Nikolay: Yeah, that's great.

Great option.

Also logical dump path is great
because it allows you partial

archive.

If you want only specific tables.

And also of course you need to
take into account that you only

archive data, no helper data.

Helper data is indexes or materialized
views.

So derived things, well, which
means that recovery will be longer,

right?

So I would say if you are going
to drop some database for historical

purposes Even if you have a physical
access, I would probably

do dump still 5 terabytes dump...
If it's powerful machine definitely

within 1 hour definitely You know,
but

Michael: yeah, we took local

Nikolay: Yeah, and if it's like
with many threads, you can...

Well, if it's a...

By the way, dumps are always interesting.

I think it doesn't matter, Parquet
or native pg_dump.

If you have unpartitioned case
and, for example, 5 terabytes,

but 4 of them is a single table.

We see it a lot, actually.

1 huge table.

Usually called logs or history
or something like historical events.

Sometimes queue-like workload, but
something like accumulated.

And then people think how to get
rid of it.

Partitioning is needed there, right?

Because without partitioning, dumping
will be single threaded

as well.

Although you could create your
own tool, and I think actually

pg_dump could support it, using
ID ranges, dump using multiple

parallel threads.

Theoretically, it's possible and
you can write this.

But out of the box, pg_dump
will limit you and probably dumping

5 terabytes will look like this.

And you say, I want to utilize
all my 16 cores and using 16 threads

to send it to S3.

First 10 minutes, all 16 threads
are utilized.

And then a couple of hours, just
single thread is working.

Because 1 huge table.

That's why.

So partitioning would become archiving
much faster and more flexible

to tune, like, like let's move
faster because, again, if it's

live database, every time we archive
logically, we have long-running

transaction.

And we affect vacuum behavior,
even if it's on replicas which

have hot_standby_feedback on, right?

So this can lead to higher bloat
and various negative consequences.

Michael: I don't see it as much
anymore, but I used to get quite

frustrated seeing partitioning
advice that mostly focused on

like the read performance benefits
of partitioning, and the more

I've learned about it over the
years the more I think the main

reason to partition is for these
maintenance tasks and for health

of the database overall but time-based
in particular if you've

got an archiving requirement if
you're ever going to need to

remove data based on time time-based
partitioning is just a dream

Nikolay: yeah you're absolutely
right as some LLMs say.

So exactly like this archiving,
restoring, like dump restore,

how vacuum works, how index creation
works.

Even physical backups benefit from
partitioning because if updates

and serve deletes are localized
to specific files.

So Postgres keeps tables and indexes
and files up to 1 gigabyte.

So if it's 5 terabytes, 4 of each
single table, it will be shrinked

to 1 gigabyte files and imagine
we have a single and partition

table all writes can come to any
of these files all the time

any blocks inside any pages all
the time all the time like it's

it's not localized right And in
this case, if you use snapshots,

RDS snapshots, they have incremental
support, right?

So first snapshot is full, and
then next snapshot is just diff

at block level.

If you use WAL-G or pgBackRest,
they also have delta backups

or incremental backups or differential
backups where also at

block level, only changes are backed
up, right?

So in this case, if you have a
huge table, you write constantly

everywhere, your delta will be
bigger compared to the case when

you write only to a specific file,
which is like our latest file,

and only occasionally you write
updating all data, all the records.

In this case, these deltas will
be much more localized and this

is also beneficial, right?

So even physical backups will benefit
from partitioning.

So general advice, yes, and our
general advice was always if

you exceed 100 gigabytes, 1 table,
it's time to partition.

People ask, is it counting, like
do we count it with indexes,

with TOAST, well without, well
it's a very very rough number.

So recently we started saying okay
if it's exceeding 50 gigabytes

already let's think about partitioning,
right?

Without indexes maybe.

Michael: The point is don't worry
about it if your whole database

is 1 gigabyte but also if your
largest table is already a terabyte

maybe it's a good time to

Nikolay: it's late

Michael: you know what I mean,
the reason for giving a rule of

thumb is give people an idea of
where they stand, roughly.

And if you're close to that number,
be thinking about it.

Nikolay: So we have funny stories
when AI companies come to us

for help because they grow too
fast and they need help.

Yeah.

They lack database expertise.

We have a bunch of AI companies,
AI startups.

And it was an interesting story.

So some company achieved between
10 to 20 terabytes already.

And they have table succeeding
terabyte.

And someone from our team says
they should partition a year ago

but I said I just launched a year
ago.

The pace is very very high and
I think Postgres needs improvements

in the area of partitioning to
simplify how like developer experience

and I hope we will work in this
area next year with our tooling.

Michael: It has got a lot better
at native partitioning.

It feels like maybe not 18 as much,
but it feels like most releases

before then, since it was introduced,
there was like significant

improvements.

Maybe there were in 18

Nikolay: as well.

In terms of various like detached
partition concurrently, in

terms of support of online operations,
locking things, but still

it's super painful.

Still.

You know, like you need basically
to, you need to allocate an

engineer for a month or 2 to implement
it.

And we have great guidelines, which
we give our clients, but

still lack of automation.

Michael: You mean like to migrate
from non-partition setup to

a partition setup?

Nikolay: Yeah.

The hardest part here is that everyone
is using various ORMs

and they want this tooling to be
closer to their language.

There is of course pg_partman but still.

Michael: Oh, okay.

It's about the creation of new
partition.

Okay, interesting.

Nikolay: Yeah.

It's about creation of new partition
and then maintaining it

because when you have partition
schema already, there is amplification

on several risks.

1 of them very well known LockManager, right?

And I had a series of blog posts
recently, studied Postgres 18

and deeper, also in the context
of prepared statements.

And it was interesting.

And another thing is that any DDL
schema changes, they might

be really Painful, right?

Even simple index creation or dropping
index can be painful or

foreign keys, right?

So yeah, it requires every
time you need like Every time

we see like people implement some
nuances and it feels like Postgres

needs to solve it.

Fair.

All

Michael: right, anything else on
the archiving front?

Nikolay: Final thing, if you archive
something, but haven't tested,

right?

It's

Michael: like backups, right?

Nikolay: Yeah, Schrödinger.

Michael: When we say tested, you
mean like, can it

Nikolay: be restored?

Michael: Can it be queried?

And also, is it the same as the
data it was originally?

Like, it's verifying as well, right?

Like, do some basic checks.

Nikolay: Exactly.

So unverified, unrestored backups
and archives are like undercooked.

Michael: Yeah, great point.

Nikolay: Just simply undercooked
and should be considered as

a huge critical mistake that's
it

Michael: huge and a very small
mistake

Nikolay: yeah yeah so if you if
you did something but never run

if you write a code but never run
and ship, how does it feel?

The same with backups, you created
a backup but you haven't tested

the archives, so It's a critical
mistake, that's it.

Michael: Nice, good addition.

Nikolay: This addition, yeah, just
almost overlooked.

Yeah, okay?

Michael: Nice, well, thanks again
and catch you next time.

Nikolay: Bye bye.