A weekly podcast about all things PostgreSQL
Nikolay: Hello, hello, this is Postgres.FM.
My name is Nik as usual, PostgresAI, and as usual with me,
Michael, pgMustard.
Hi, Michael.
Michael: Hello, Nik.
Nikolay: So we have a question today from one of the listeners,
and it's about archiving.
I have so many options what it can mean.
Where to start?
Michael: Yeah, well they specifically asked whether it was possible
for us to do a video on archiving and they said they want to
archive a database that is about 5 terabytes in size.
So what do you assume that means when you first hear it?
Nikolay: Well if you are familiar with Postgres internals, and
just not internals, at least about Postgres architecture.
Archiving is simple.
You just set up archive command.
That's it, right?
But it's a joke.
I think this is not what was meant here.
So I think 5 terabyte these days is an average size database.
Who doesn't agree?
Well, I see bigger databases and 5TB is not a big database anymore.
It was big 10 years ago.
These days big means more than 10 terabytes.
This is just it.
Right.
So 5 terabytes is a very large database and archiving it means
that we want to keep it longer term, right?
How I read it?
We want to keep it longer term and or we want probably to clean
up.
Sometimes we want to clean up.
Well, yeah, because for example, if imagine it's e-commerce,
we accumulated order history for many, many years.
It's not used, but it's so much data.
So why not archive it and then clean up?
Clean up in Postgres, straightforward clean up with delete as
we discussed many times, it's not an easy operation in large
databases because of MVCC, vacuum and all these complexities.
So if it's partitioned, if bigger tables are partitioned, then
cleanup is easy, right?
We just detach and drop all partitions, that's it.
And before that, obviously, archiving makes a lot of sense.
But even if you don't have a task to clean up, archiving might
make sense for backup purposes, right?
We
Michael: want to
Nikolay: keep archive for long, yeah.
Or maybe there are requirements.
There also may be there are requirements, right?
Michael: Yeah, I think there's an argument that it's slightly
different to backups.
But I'm not actually certain that my like what how I have like
internalized that is actually that actually is helpful in any
practical sense because if you've got data archived you've got
it's kind of a backup right Like it is another source of old
data.
But I do think they're different in that you wouldn't expect
an archive to have up-to-date data, whereas you would expect
a backup to have up-to-date data.
So I do think there's like subtle differences.
And also I think, also I think you backup and you archive for
different reasons, therefore there's different trade-offs.
So like, if we're backing up in order to be able to get back
online quickly if we have a disaster that you might want to put
it somewhere different in a different format than if we archive
it because we need access to it but not necessarily performantly.
Nikolay: For me, in real life, we, For example, take our DBLab
or experiments.
Experiments should be done on clones, right?
On thin clones, on branching, if it's DBLab, for example, or
full-fledged clones on separate virtual machines, like if it's
self-managed, right?
And we say, let's restore either from snapshot, cloud snapshot
of disk, or let's restore from backups or from archive.
And there is purely like synonyms for me.
Like let's restore from archive.
And that's why archive command, it's a part of backups, right,
it's archiving WALs.
And we say we have many like versions of our database, not versions,
many exemplars of our database and archive or in backups.
Backups, archive, it's the same for me in this case.
And of course we want to be it quickly restored, but These days
if it's cloud, it's easy to configure.
So you basically archive it, back it up using pgBackRest or WAL-G.
It goes to S3 or other object storage like GCS or I'm not an
expert in Azure Blob Storage, they say, they name it, right?
And there you can configure it.
So for example, a couple of weeks later, that backup automatically
moves to colder storage, like in S3 it's called Glacier.
Michael: To an archive.
Nikolay: Well, all our archives, 1 is hotter, another is colder,
right?
It's Glacier or like in GCS and Google Cloud, it's called Coldline,
I don't remember the name.
But the idea is it's cheaper, but slower to restore.
And this still causes a bug which is known many years in WAL-G.
I
recently vibe-coded a fix because it changes modification time
and latest becomes broken.
If you want to restore using uppercase
latest with WAL-G, will
WAL-G backup fetch latest?
In this case it will fetch wrong,
might fetch wrong backup because
it's based on mod time, modification
time.
And if when, when it's moved automatically
to colder storage,
modification time of old backup
changes.
So latest might be broken.
I recently I sent it to Andrey
already pull request to fix it
in WAL-G.
It should be fixed, right?
Yeah, but I mean modification time
cannot be trusted here obviously
so but it's a good idea and it's
great that cloud's support is
super easy to do, you just configure
some policy, how long to
keep them, how long to like when
to move to all colder storage,
so it allows you to for the same
money to keep archives for longer
right so
Michael: or if you if you're if
like for example you're doing
this for compliance reasons and
you're the length
Nikolay: of
Michael: time is predetermined
how long you have to keep it for,
you can keep it for that long for
cheaper.
Like that, it feels to me like
that's the more, yeah.
Nikolay: Yeah, so obviously it's
not SSD anymore, it's HDD, but
it's still super reliable, and
as I understand, they like promise
like a lot of nines, like amazing
number of nines.
11 nines or how many?
It's like basically all-
Michael: But here we're talking
about durability, not availability,
right?
Nikolay: Yeah.
Exactly.
So obviously data won't be lost.
That's the key.
And it can be stored for years,
years, years if you want.
Right.
And that's great.
The only risk remains if somebody
possesses your cloud account
access, they might destroy everything.
To protect from that, there is
a lock option, right?
You can lock and basically say
you cannot delete, nobody can
delete, or you can copy to different
account or even different
cloud
Michael: Yeah, right.
So different permissions, right?
Nikolay: Right, and if it's again
if it's backups using WAL-G what
because the WAL-G has a special comment for this copy.
So you can copy from 1 backup location,
1 archive location to
different archive location, right?
But if you are on managed Postgres,
In this case, physical backups
are not available unless it's Crunchy
Bridge, right?
Which is great that they allow
it to access physical backups
and WALs.
And in this case, the only option
remains is to create logical
backup, which I prefer to call
dump.
And we had an episode is logical
backup a backup?
My answer is still no.
But
Michael: Is it an archive?
Because I think, like, archiving
almost always isn't incremental,
right?
Like, you do it in batches, right?
You take a whole year and archive
it.
So in this case, it's not changing.
So a logical dump is kind of, is
an archive.
Nikolay: Well, for me, again, I
don't feel this difference.
I hear you, you feel some difference.
But for me, all backups are also
archive.
They're just continuous thanks
to WAL archive.
They consist of 2 parts full or
like deltas, incremental, differential,
there are many names and additionally
there, but basically like
full copy of data directory plus
WALs, right, and WALs make
them like continuous.
This archive becomes continuous.
Michael: This is where I think
we have a difference because I
think these let's go back to the
question right they want to
archive a database that's 5 terabytes
in size.
That's all the information we got.
But we don't like, there's a chance
that it's an old product
that's no longer served and they
just need to keep the data for
a certain amount of time.
It could be that it doesn't need
to be accessed anymore.
It could be, it could be all manner
of...
Nikolay: They maybe want to drop
database and just keep it for
history right?
So to answer this question directly
the first thing we need to
understand is it managed service
or we have access to physical
level.
If it's managed service, the only
option for us is pg_dump.
Michael: No, I disagree.
Why?
Well, because I think you could...
So depending on the parameters,
right, like you could export
it in a different format, right?
There's a lot of...
Nikolay: Okay, logical backup is
the only option here.
Yes.
Although technically speaking,
we can rely on...
We can say, okay, let's create
a snapshot and consider the archive
and rely on RDS that they will
keep it for longer.
Why not?
Michael: Yeah but so here's what I'm thinking though and 5 terabytes
is not very large right as you were saying at the beginning.
But if in the general case where we care about cost it seems
like cost actually matters here, it's 1 of the main reasons for
doing this in the first place.
If we care a lot about cost it feels to me like some of the newer
formats that compress extremely well like, and maybe new is not
the right term, but it's new to me in terms of coming across
it from a Postgres perspective but the formats like Iceberg and
Parquet that compress down extremely well feels to me like would
be the cheapest way of archiving data that's static and is not
going to be changing or you're going to get a new batch of it
once per year.
Like the archiving use cases that I'm thinking of.
If we exported in those formats they'd be tiny, like the 5 terabytes
might go down 95% or so.
Nikolay: Oh, I doubt.
20 times, I doubt.
Well, maybe.
Okay, What have you seen?
Michael: Yeah, I've seen people reporting that kind of...
Even on the Timescale compression side, I've seen that reported.
And that wasn't even...
Nikolay: Depending on data again.
Michael: Yeah, depends on the data.
But if you've got a lot of numerical...
A lot of columns that compress...
That have a lot of similar data in them, which I think is pretty
common, especially for like, if we're thinking it's healthcare
related or financial, like transactional data, A lot of that
is numerical stuff.
Nikolay: Well, I, again, like, okay, Parquet is interesting.
I think it's an option as well.
And in this case, you can also, like, it depends on access patterns
later, right?
We need to, first thing we need to answer, do we have access
to physical or only logical?
Logical, we obviously have access because this is our database.
Yeah.
I suppose.
All right, so physical versus logical thing.
We discussed it many times, but physical also can be compressed,
right?
So both WAL-G and pgBackRest, they compress data directory and WALs.
So why not?
The only thing to like, super important thing to remember is
that you cannot just take data directory, that's it.
You need to make sure WALs are also archived.
If it's only logical, OK, there is a Parquet option, but also
there is a pg_dump option, which
also has compression.
Michael: Yeah, but we're not getting
the compression like at
the page level, right?
Or kind of the row level.
And I think column level compression
is hard to compete with.
Nikolay: Well, it's page level
or file level?
Michael: File level.
Okay, what's the difference?
Nikolay: What is the difference?
Huh how...
Well from pg_dump I would expect
10x sometimes in some cases 10x
compression if you...
There are options in pg_dump, right?
zstd I would check definitely,
and it has, do you remember
if pg_dump allows to control compression
level?
Can you set compression level 19?
It will be terribly slow, right?
Or if you check...
We might
Michael: be okay with that.
That's the nice thing about this
use case, is we might be okay
with really slow retrieval really
slow restores because the main
point is keep it for many years
Nikolay: right right anyway anyway
pg_dump supports for multiple
years already, it supports options
and you can control compression
level there.
You cannot control compression
level in WAL compression, which
compresses full-page writes inside,
full-page images inside WAL,
right?
There, it was my proposal, I need
to finish it.
There you can control algorithm,
but not the level of compression.
So I'm not sure everyone needs
it, but still for completeness,
we should support it in Postgres,
right?
But for pg_dump, definitely you
can control level.
So if you are okay to wait and
load CPU a lot, and it will like
in some if you check That is the
19.
I think it should be super slow
But maybe beneficial for in terms
of size and of course the restore
will be also slow, right?
Yeah Yeah
Michael: that I think that is something
to factor in if people
are deciding which way to go though
is what kind of questions
are you going to get of this old
data?
Like are there going to be auditors
that want to come along and
be able to Query it?
Does it matter that it takes like
if you're going to get advance
warning that they're coming, does
it matter that it takes a day
to restore?
Are there gonna be analytical queries?
Like are people gonna want to know
aggregations on it?
Because then if it's gonna be mostly
aggregates, maybe you do
want it in some columnar format
where
you can
get those done quickly.
Nikolay: Well, you confuse me,
but obviously, when you compress
a dump, it's file level, so the
whole thing is compressed, It's
not page level.
Michael: That makes sense, yeah.
Nikolay: It should be good.
Page level, if it's, for example,
you store something on ZFS,
there it's page level.
Here, no.
The whole thing is compressed.
It can be super slow, of course,
but quite well compressed.
Not sure how it will compete with
Parquet.
I have not a lot of experience
with it So like in production
we have a customer who has it Self-managed
database and Archived
from time to time in Parquet format.
Yeah, and it's great.
It's good work.
It's also ZFS.
It's interesting setup But anyway,
I think there is a potential
here if somebody wants to write
some article or do some research,
there are many options to discover
here and benchmark.
Like we could build some decision
tree, like physical or logical
how to decide, I would stick with
physical always, if recovery
time matters, and if we want continuous
archiving, right?
If and if we have access, of course,
because if no access, it's
blocked, this patch is blocked.
Logical, okay, dump or Parquet
format.
Dump, we have compression options.
Definitely I would choose, I would
not choose a single file.
Well, again, depends, right?
But normally we choose custom format
because we want compression
and parallelization, right?
Because if you want single file
dump, then you cannot use parallelization.
And you will be limited by a single
vCPU work when dumping and
restoring.
When you use custom format, you
can say how many jobs, how many
vCPUs to utilize when dumping and
separately when restoring.
It can be different number.
And this matters a lot if we want
to dump faster and restore
faster, right?
This thing.
And then finally, like, so this
is a decision making tree, right?
And then finally, where to store
it?
Is it, definitely archive shouldn't
be stored on expensive SSD
disks, right?
It should be stored.
Object storage is great if it's
cloud, right?
If it's not cloud, some cheap disks,
which we don't use a lot,
so they serve as archive.
Or even, is tape still an option?
I think in many organizations,
which old organizations still
use tape.
Well, it has a huge lifespan, right?
So tape, right?
Why not?
Anyway, it should be something
which we know is reliable, will
live very long, right?
But also we need to take into account
restoration timing, right?
Michael: Maybe, yeah.
Nikolay: Because glacier super
cheap, S3 glaciers, like the cheapest,
but also the slowest to recover
from, right?
Yeah.
To fetch, fetch data, fetch data
from there.
So it's interesting, right?
And physical backups also can be,
physical archive also can be
placed into various locations.
So location question, I think it's
a separate decision.
Michael: Well, yeah.
I think there's a couple of other
things to factor in.
Because if you do store it in parquet
for example, you can query
it without restoring.
You don't even have to restore
to get the data.
Nikolay: Yeah, that's great.
Great option.
Also logical dump path is great
because it allows you partial
archive.
If you want only specific tables.
And also of course you need to
take into account that you only
archive data, no helper data.
Helper data is indexes or materialized
views.
So derived things, well, which
means that recovery will be longer,
right?
So I would say if you are going
to drop some database for historical
purposes Even if you have a physical
access, I would probably
do dump still 5 terabytes dump...
If it's powerful machine definitely
within 1 hour definitely You know,
but
Michael: yeah, we took local
Nikolay: Yeah, and if it's like
with many threads, you can...
Well, if it's a...
By the way, dumps are always interesting.
I think it doesn't matter, Parquet
or native pg_dump.
If you have unpartitioned case
and, for example, 5 terabytes,
but 4 of them is a single table.
We see it a lot, actually.
1 huge table.
Usually called logs or history
or something like historical events.
Sometimes queue-like workload, but
something like accumulated.
And then people think how to get
rid of it.
Partitioning is needed there, right?
Because without partitioning, dumping
will be single threaded
as well.
Although you could create your
own tool, and I think actually
pg_dump could support it, using
ID ranges, dump using multiple
parallel threads.
Theoretically, it's possible and
you can write this.
But out of the box, pg_dump
will limit you and probably dumping
5 terabytes will look like this.
And you say, I want to utilize
all my 16 cores and using 16 threads
to send it to S3.
First 10 minutes, all 16 threads
are utilized.
And then a couple of hours, just
single thread is working.
Because 1 huge table.
That's why.
So partitioning would become archiving
much faster and more flexible
to tune, like, like let's move
faster because, again, if it's
live database, every time we archive
logically, we have long-running
transaction.
And we affect vacuum behavior,
even if it's on replicas which
have hot_standby_feedback on, right?
So this can lead to higher bloat
and various negative consequences.
Michael: I don't see it as much
anymore, but I used to get quite
frustrated seeing partitioning
advice that mostly focused on
like the read performance benefits
of partitioning, and the more
I've learned about it over the
years the more I think the main
reason to partition is for these
maintenance tasks and for health
of the database overall but time-based
in particular if you've
got an archiving requirement if
you're ever going to need to
remove data based on time time-based
partitioning is just a dream
Nikolay: yeah you're absolutely
right as some LLMs say.
So exactly like this archiving,
restoring, like dump restore,
how vacuum works, how index creation
works.
Even physical backups benefit from
partitioning because if updates
and serve deletes are localized
to specific files.
So Postgres keeps tables and indexes
and files up to 1 gigabyte.
So if it's 5 terabytes, 4 of each
single table, it will be shrinked
to 1 gigabyte files and imagine
we have a single and partition
table all writes can come to any
of these files all the time
any blocks inside any pages all
the time all the time like it's
it's not localized right And in
this case, if you use snapshots,
RDS snapshots, they have incremental
support, right?
So first snapshot is full, and
then next snapshot is just diff
at block level.
If you use WAL-G or pgBackRest,
they also have delta backups
or incremental backups or differential
backups where also at
block level, only changes are backed
up, right?
So in this case, if you have a
huge table, you write constantly
everywhere, your delta will be
bigger compared to the case when
you write only to a specific file,
which is like our latest file,
and only occasionally you write
updating all data, all the records.
In this case, these deltas will
be much more localized and this
is also beneficial, right?
So even physical backups will benefit
from partitioning.
So general advice, yes, and our
general advice was always if
you exceed 100 gigabytes, 1 table,
it's time to partition.
People ask, is it counting, like
do we count it with indexes,
with TOAST, well without, well
it's a very very rough number.
So recently we started saying okay
if it's exceeding 50 gigabytes
already let's think about partitioning,
right?
Without indexes maybe.
Michael: The point is don't worry
about it if your whole database
is 1 gigabyte but also if your
largest table is already a terabyte
maybe it's a good time to
Nikolay: it's late
Michael: you know what I mean,
the reason for giving a rule of
thumb is give people an idea of
where they stand, roughly.
And if you're close to that number,
be thinking about it.
Nikolay: So we have funny stories
when AI companies come to us
for help because they grow too
fast and they need help.
Yeah.
They lack database expertise.
We have a bunch of AI companies,
AI startups.
And it was an interesting story.
So some company achieved between
10 to 20 terabytes already.
And they have table succeeding
terabyte.
And someone from our team says
they should partition a year ago
but I said I just launched a year
ago.
The pace is very very high and
I think Postgres needs improvements
in the area of partitioning to
simplify how like developer experience
and I hope we will work in this
area next year with our tooling.
Michael: It has got a lot better
at native partitioning.
It feels like maybe not 18 as much,
but it feels like most releases
before then, since it was introduced,
there was like significant
improvements.
Maybe there were in 18
Nikolay: as well.
In terms of various like detached
partition concurrently, in
terms of support of online operations,
locking things, but still
it's super painful.
Still.
You know, like you need basically
to, you need to allocate an
engineer for a month or 2 to implement
it.
And we have great guidelines, which
we give our clients, but
still lack of automation.
Michael: You mean like to migrate
from non-partition setup to
a partition setup?
Nikolay: Yeah.
The hardest part here is that everyone
is using various ORMs
and they want this tooling to be
closer to their language.
There is of course pg_partman but still.
Michael: Oh, okay.
It's about the creation of new
partition.
Okay, interesting.
Nikolay: Yeah.
It's about creation of new partition
and then maintaining it
because when you have partition
schema already, there is amplification
on several risks.
1 of them very well known LockManager, right?
And I had a series of blog posts
recently, studied Postgres 18
and deeper, also in the context
of prepared statements.
And it was interesting.
And another thing is that any DDL
schema changes, they might
be really Painful, right?
Even simple index creation or dropping
index can be painful or
foreign keys, right?
So yeah, it requires every
time you need like Every time
we see like people implement some
nuances and it feels like Postgres
needs to solve it.
Fair.
All
Michael: right, anything else on
the archiving front?
Nikolay: Final thing, if you archive
something, but haven't tested,
right?
It's
Michael: like backups, right?
Nikolay: Yeah, Schrödinger.
Michael: When we say tested, you
mean like, can it
Nikolay: be restored?
Michael: Can it be queried?
And also, is it the same as the
data it was originally?
Like, it's verifying as well, right?
Like, do some basic checks.
Nikolay: Exactly.
So unverified, unrestored backups
and archives are like undercooked.
Michael: Yeah, great point.
Nikolay: Just simply undercooked
and should be considered as
a huge critical mistake that's
it
Michael: huge and a very small
mistake
Nikolay: yeah yeah so if you if
you did something but never run
if you write a code but never run
and ship, how does it feel?
The same with backups, you created
a backup but you haven't tested
the archives, so It's a critical
mistake, that's it.
Michael: Nice, good addition.
Nikolay: This addition, yeah, just
almost overlooked.
Yeah, okay?
Michael: Nice, well, thanks again
and catch you next time.
Nikolay: Bye bye.