Postgres FM

Michael and Nikolay are joined by three special guests for episode 100 who have all scaled Postgres to significant scale — Arka Ganguli from Notion, Sammy Steele from Figma, and Derk van Veen from Adyen. They cover how their setup has evolved, what their plans are for the future, and get into the weeds of some fun and interesting challenges along the way!

Links to some of the things discussed: 

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With special thanks to:

Creators & Guests

Host
Michael Christofides
Founder of pgMustard
Host
Nikolay Samokhvalov
Founder of Postgres AI
Guest
Arka Ganguli
Software Engineer on the infrastructure team at Notion, leading database scaling efforts, and having a blast
Guest
Derk van Veen
Database specialist and PostgreSQL DBA at Adyen
Guest
Sammy Steele
Senior Staff Engineer at Figma, tech lead for the overall databases area, and leading the multi-year project to horizontally shard Figma's database stack

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello and welcome to Postgres
FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and I'm joined as usual by

Nikolay, founder of Postgres.AI.

Hey Nikolay.

Nikolay: Hi Michael.

Michael: And this week we have
not 1, not 2, but 3 special guests

with us to celebrate episode number
100.

And we asked you, our listeners,
for ideas for this episode.

And we got some great suggestions,
including the topic of scaling

to 100 terabytes for episode 100.

So we thought that...

Nikolay: And beyond.

Michael: Oh, and beyond.

Yes, always beyond.

Yeah, no downtime allowed.

We're delighted to welcome 3 people
who have successfully scaled

Postgres exceptionally well.

And those are Arka Ganguli from
Notion.

Hey, Arka.

Arka: Hey, good to be here.

Really excited to talk about Postgres.

Michael: And hey, Sammy, Sammy
Steele from Figma.

Sammy: Thanks so much for having
me on.

Michael: And finally, Derk van
Veen from Adyen.

Derk: Yeah, thank you very much
for the invitation.

I'm honored.

Michael: We are honored to have
all 3 of you.

So yeah, thank you all for joining.

So we're gonna try a kind of panel
round table style discussion.

I'll play good cop trying to keep
us on track, but you are all

welcome to chime in with questions
for each other.

And I know Nikolay will as well.

So, let's start on the simpler
side of things.

And starting with you, Derk, would
you mind letting us know what

kind of things you're using Postgres
for at Adyen there?

Derk: Well, that's an easy question.

We keep our data within Postgres
databases.

The thing is we are processing
financial transactions.

And the thing about financial transactions
is you'd better not

lose them.

Because if you have a social media
kind of platform, right, and

you refresh your page, and you're
missing a picture and refresh

again, and the picture's there,
nobody really cares.

But if you refresh your bank app
and your money is gone, then

people, for some reason, they kind
of freak out.

So yeah, we try to not lose a single
bit of data within our Postgres

databases.

Nikolay: So you don't use data
type money as well, right?

Derk: No.

Okay.

Michael: Nice.

Thanks so much.

Sammy, how about at Figma?

Sammy: Hi, everyone.

I'm Sammy and at Figma, our platform
is an online web-based SaaS

software that allows designers
to collaborate together.

And you know, kind of the magic
secret sauce is this real-time

platform where you can comment
on a design and it immediately

shows up for other people.

And so all of that core data is
powered by Postgres and it has

to be really reliable, really fast
and low latency, so it can

feel real-time.

And we've grown 100X in the last
4 years.

So that's where we're getting our
100 for this episode from.

Michael: Nice, even more hundreds,
we like it.

And Arka, how about Notion?

Arka: Yeah, so unsurprisingly,
we also use Postgres to store

our data and make sure that it
continues being there when our

customers expect it to be.

For those of you that don't know,
Notion is kind of like a collaborative

workspace.

You can use it for documents, wikis,
product management, and

it has a lot of AI built in recently.

So really it's like the place where
a lot of customers store

their entire knowledge base and
run their company or their personal

use cases as well.

Nikolay: I just wanted to mention
that I'm user of all 3 companies

products, because for example,
Notion, I'm a big fan of lists

and so on, so it's a great thing
for that, for such kind of people

I am.

And Figma, I actually remember
I created a lot of interfaces

in Figma myself.

It's super cool, too, and especially
this real-time thing as

well.

And somehow I noticed Figma replaced
all things I used in the

past.

And Adyen, I just told Derk before
we started recording that every

time I go with my daughter to Legoland,
I see Adyen on payment

machine when I pay for something.

And I think, oh, Postgres processing
is there.

So it's cool.

Thank you for coming once again.

Michael: I love how humble you
all are, assuming people might

not know who your companies are
as well.

It's nice.

Let's flip the order and let's
look at how you set things up

Postgres-wise.

So how things were in the past,
how things are set up now.

I've realized this might be quite
an in-depth answer and any

future plans.

Perhaps as well.

So maybe at the highest level,
you can describe that would be

awesome.

Okay.

We restart on your side.

Arka: Sure.

So I joined Notion about a little
bit over 3 years ago.

And when I got there, we had actually
just finished like our

first sharding effort, but essentially
at Notion, we run a pretty

simple deployment.

We use RDS, which is Amazon's managed
database offering.

We run Postgres on that.

And we run pgBouncer in front
of our RDS cluster for connection

pooling.

And that's pretty much it.

We do have sharding, like I mentioned,
which basically is just

like splitting up the data across
more databases.

And the logic for how to talk to
the database is actually controlled

by the application.

So depending on some ID, we know
what database to send your query

to.

And yeah, that's kind of how it
is even today.

We have a few more databases than
when I started, but the overall

high level architecture is pretty
much the same.

Michael: Nice.

I've read both of your sharding
and resharding blog posts.

I'll link both up in the show notes.

But I'm guessing that will continue
to work.

Any reason it won't continue to
work for the foreseeable?

Arka: Yeah, I think our plans are
not really to move away from

Postgres.

It's worked really well for us
so far, and we've got a lot of

people at the company now that
have built up a lot of knowledge

around how to operate this thing
at scale.

So definitely something we want
to stick with moving forward.

Probably we do have to re-shard
in the future at some point,

but in the longer term we're actually
looking at moving to more

of like a cellular architecture
where we can kind of deploy Maybe

like different versions of Notion
to scale our capacity with

like their own databases and caches
and things like that.

So we don't have to continuously
spend our time re-sharding over

and over again, because that's
a very time intensive and operationally

intensive kind of thing for engineers
to take on.

So we probably do have to re-shard
in the near future, but in

the longer term, we hope our more
like cellular architecture

will be able to scale much further.

Michael: Super interesting.

Sammy, let's go to you.

I know you're sharding as well,
but slightly differently, I believe.

Sammy: Yeah, well, we definitely
have a lot in common with what

Arka just described.

And so your blog posts were a great
inspiration for us as we

were in the early days of horizontal
sharding.

So similar high-level stacks, and
that we're also on RDS Postgres,

Amazon's managed system, and we
also use pgBouncer for connection

pooling, at least for now.

And where we start diverging from
Notion is that I think our

data model is somewhat more complex
and less naturally a good

fit for sharding.

My understanding is Notion had
shard IDs based on the orgs or

the workspaces and that that worked
pretty well.

But at Figma, we have a lot of
data not associated with an org

and data moves quite frequently
between orgs, which makes that

kind of sharding model quite hard.

And so we actually ended up going
with a more tactical approach

of, for each table, picking a set
of a few shard keys that would

be more optimal.

And we ended up building a proxy
layer so that instead of the

application having to be aware
of shards, we actually have this

DB proxy service we built, which
is a Golang service.

And it's able to do all the routing
and handle all of the horizontal

sharding, scatter gathers where
a query hits many shards, and

then you aggregate results back.

And so that's kind of a superpower
for us to have a lot more

control in that layer.

And long-term, we have sharded
our simplest, highest write rate

tables, but we still have a ways
to go until everything is sharded.

And we have to build out a lot
more infrastructure to support

that.

So unlike Notion, we actually expect
restart operations to be

quite common.

And so our goal is like 1 click
failover where you can just hit

a button and then have a restart
operation happen and transparently

in the background.

Nikolay: Sounds like logical replication
involved, but maybe

we'll talk about that later.

Derk: I'm so interested in this
proxy layer.

Did you build it yourself in-house?

Sammy: We did.

Yeah.

So it's a Golang service.

So it has a couple of components.

It has a query engine, which has
a parsing layer that takes in

raw SQL and transforms it to an
AST.

And that part we did borrow CockroachDB's
Postgres parser, which

has mostly worked well for us with
the few modifications, but then

all of the logic for, we have like
a topology layer that we've

built that knows where all the
databases are and what shard keys

map to what tables.

And so the query engine really
does all that evaluation and query

rewriting.

Derk: Yeah, that sounds pretty
mind blowing to me.

It's really cool.

Nikolay: Yeah, there are a couple
of projects like PGCat and

SPQR which

Sammy: do that,

Nikolay: but maybe when you started
they are very early stage,

right?

Sammy: Yes, actually to clarify,
so we wish we had named this

something besides dbproxy, because
PGCAT would not be a replacement

for what we have, because we actually
might use PGCAT and replace

a pgBouncer, because we are hitting
some pgBouncer scaling problems.

I'd be curious if you guys are
also running into challenges.

But dbproxy is really, it's a stateless
service that talks to

every single database.

And so a request gets routed to
it, and it has a connection pooling

layer which talks to a pgBouncer
or PGCAT-like layer that is

specific to a single host.

So it's really playing more of
a high-level orchestrator role.

Michael: So cool.

And I love that we've already got
2 different approaches to sharding,

1 that's transparent to the application
and the developers, and

1 that's not.

And I think we're about to get
a third answer from Derk's side.

How are things set up at Adyen
and past, present, future?

Derk: Let's start with the past.

I think we had a pretty, it was
our by far, it's our most well

read blog article.

And that's how to upgrade a 50
terabyte database.

And when looking back on it, it's
quite a while ago.

So I think 2 years ago, we had
an internal event and it was like,

no, it's not how you upgrade a
50 terabyte database.

It's how you upgrade a terabyte
database with multiple zeros

in the end.

Nikolay: At least 2, I guess, right?

Derk: Yeah, we have 2 zeros in
the end.

Nikolay: And not just upgrade,
but 0 downtime upgrade, or like,

I just want to clarify, because
regular upgrade is not super

big problem if you can afford a
couple of hours downtime to.

Derk: Yeah, we tried to think about
ordinary upgrades, but the

rough calculation, it will take
us like 4 or 5 weeks downtime,

which didn't cut the bill.

Nikolay: We've had links.

Derk: No, that 1 is much faster.

Fortunately, I think we limited
to 15 minutes of downtime.

And we already had a big fight
for this 15 minutes, but it has

to happen.

And it's it's not even a database
downtime, but it's more like

all the infrastructure around it
to get it fixed, which required

downtime.

If you look purely at Postgres,
I think it was minutes, but it's

not just a database.

Nikolay: I know this problem very
well.

So

Derk: Yeah, that's

Nikolay: everything around.

It's more and more minutes and
then sometimes hours, right?

Sammy: Yeah.

I'm curious, what's the biggest
individual table or do you have

partitioning under the hood there
to have so many terabytes on

1 box?

Derk: What is the biggest table
we have?

I don't know, but the biggest partition
is over 30 terabytes,

which is a single partition.

Sammy: Wow.

And you don't run into vacuuming
issues with that?

Derk: I think we are the most skilled
company when it comes to

vacuum and how to tweak vacuum.

Nikolay: And rebuild your indexes,
which blocks XminHorizon,

right?

Derk: Yeah, we have seen many vacuum
issues.

When I started to work with Postgres
and I was installing this

database and transaction wraparound
was some kind of weird theoretical

thing.

And then I joined Kotien and then
something you do every week,

at least.

Nikolay: So I yesterday asked My
Twitter audience, what would

you ask to people who manage many,
many, many terabytes?

And 1 of the questions was, how
are you?

I guess this is the case when you
have problems like 30 terabytes

on 1 table.

How long is the index creation
time or rebuild time?

It's like days or?

Yeah, days.

Days.

During which XminHorizon is blocked,
right?

So we accumulate bloat in whole
database for all tables, right?

It's a super annoying problem.

Derk: But most of these tables
are partitions.

So first we create an index per
partition, and then finally in

the end, we create the index on
the parent.

Otherwise we don't make it before
we hit wraparound.

Nikolay: Oh, wraparound.

It's also a problem, but I'm talking
about also vacuum and bloat

problems, so multiple problems
here.

Well, that's super interesting
experience.

Derk: It keeps us busy.

But it's also, it's an honor to
have these problems, but I think

all 3 of us got this big databases
because the company is very successful.

successful.

Nikolay: Right.

Derk: And I think at the end of
the day, having a big database

is actually a design problem.

You forgot to design in time for
the size you need to accommodate.

Sammy: See, I actually disagree.

I think it's a good, It's an expected
problem.

If you design a company too early
to scale, then you're not gonna

have a company probably if you're
trying to solve those problems

when you're 5 or 10 people.

I think 1 of the cool things about
Postgres is all of our companies

did get very far on 1 host.

Like Figma existed for 7 or 8 years
and was pretty successful

before we had to actually start
scaling out.

Derk: That's definitely true.

Nikolay: I agree with that.

And my team helped a couple of
companies who went to IPO being

like on a single Postgres cluster
and they got evaluation, thousands

of billions of dollars having 1
cluster and this was impressive.

Of course, later it should be changed,
but yeah, so I agree with

1 cluster you can scale a lot,
But I would not like to be in

your shoes dealing with multi-day
index rebuild and also transaction

ID wraparound problems.

So how do you escape from these
problems today?

Like 30 terabytes 1 partition,
It's quite difficult, right?

So is it to be partitioned into
smaller partitions or

Derk: Yeah, I'm thinking about
how to answer this question.

I think before I joined Adyen,
I tried to manage my database

in a proper way to do it by the
book and by the rules and everything

is fine.

And Adyen is pushing me to find
which boundaries can be bended

or violated without too much risk.

I think that's the way to put it.

We still are on the safe side,
but with more knowledge, you're

better able to decide when do I
cross a boundary and when is

it really dangerous and when is
it safe to do, but not strictly

advised.

Sammy: I think that's definitely
one of the benefits of staying

on a system like Postgres that
you end up understanding really

well is all these limits are extremely
specific to the workloads

and the tables and even what works
for one table at Figma doesn't

work for another.

And so there is a lot of just accumulated
organizational knowledge

that makes it easier to run these
systems at scale and understand

actually at what point do things
get scary.

Nikolay: Yeah, I agree.

If, for example, index creation
takes one day, but our real transaction

ID doesn't grow a lot, it's not
a big problem.

But if it grows very fast, then
we have two kinds of problems.

So transaction ID wraparound and
bloat accumulated.

I agree.

So it's very specific to particular
workloads.

Michael: I want to go back to Derk
quickly because you've done

such a great series of blog posts
on partitioning.

I'll share those.

I've reread them today and part
3 promises a part 4.

So I'm, I'm hopeful of getting
another installment at some point,

but yeah, is it, would you say
partitioning is the big, like

the way you've largely scaled there
or are there other strategies?

Like, are you moving data out?

Like how are you dealing with it?

And how are you planning to in
the future?

Derk: Yeah, I think we started
with horizontal sharding and then

within every shard, tables are
partitioned, but financial transactions,

you need to be able to refund them
for a very, very long time.

So it takes quite a long time before
you can actually archive

data.

So we have a lot of old data, which
is still around, but yeah,

these partitions kind of shift
more back to the back in the row

and vacuum is much easier if you
don't change a lot on a partition.

But at the moment we are actually
reaching the limits of our

sharding solution again.

So we basically start redesigning
the entire system again.

Arka: You know, that's really interesting.

Cause that notion like it's interesting
that you chose partitioning

Postgres as like your thing that
like got you really far because

that notion we actually ran into
I don't know like really weird

issues with dropping indexes on
partitioned tables, But that's

because I think we were running
on like, like this database was

running on like Postgres 12.

And I think a lot of these things
may have been fixed in the

future releases.

But we actually went the opposite
way where we're like, we don't

want Postgres partitions, let us
manage our own.

Because I don't know, there's some
weird behavior we ran into

when creating and dropping indexes,
especially because some things

you can't do concurrently on partitioned
tables.

Derk: Yeah, dropping indexes is
a nightmare.

I always tell my developers, you
can create indexes on a partitioned

table, but if you want to drop
them, I come after you.

Arka: Yeah.

Fun problem.

Michael: Actually, Derk, you said
something earlier about knowing

which rules you can kind of bend
or break.

One of the ones you mentioned in
those posts was around adding

check constraints.

And so adding them in a not valid
state, but then not running

validate afterwards.

I think you mentioned just updating
the system catalog.

And that's such a cool trick.

But like, how?

Yeah, I guess you have to make

Nikolay: up comfortable.

We use this trick like 15 years
ago when validate was not an

option.

So like something like that we
use it, but it wasn't like undocumented

thing.

Don't do it.

Like if you're, if you're not owning
database, For example, if

you're a consultant, this is not
something you should recommend,

because who knows what will happen
next.

If you own a database, if you work
inside the company, it's probably

okay.

It's not official.

By the way, I wanted to emphasize
problems we just started touching.

Most people who listen to us don't
have, right?

Because it's like extreme problems.

Derk: I'm not completely sure.

I think partitioning becomes useful
for a lot of people way before

you hit the limit where you have
to.

So if you're listening to this
podcast, and you think like, yeah,

partitioning is too far away, just
think again, because if you

do it now, when you have all the
time in the world to figure

it out, find a good strategy, Then
you don't end up with a huge

partition, a small partition, weirdly
partitioning things.

So I would say just starting time.

Nikolay: I'm on the same page with
you here because you mentioned

this rule, like 100 gigabytes as
threshold when you need to start

partitioning.

Same rule we also apply everywhere.

If you exceed 100 gigabytes, it's
time to partition, I agree.

Sammy: Interesting.

We have a lot of tables larger
than that that aren't partitioned,

but probably about a terabyte is
what we aim for max for our

largest shards, although we have
a 116 terabyte table today.

But another interesting topic that
I think is more relevant to

people are upgrades.

We have built this no downtime
tooling that we've mostly used

for re-shard and horizontal sharding
operations, but more recently

we've been using it for no downtime
major version upgrades with

the ability to roll back if you
run into problems.

And so I think that's something
that could be interesting for

anyone who's running Postgres.

Upgrades are always a hard thing
to do.

Nikolay: Yeah, By the way, we forgot
to mention that ADN is on

self-managed situation.

It's not on RDS, right?

Because it's very different.

Yeah.

I'm very curious how you did this
for RDS case, not using Bluegreen

deployments, right?

Just fully...

Sammy: At least today...

Well, Bluegreen only very recently
became available for Postgres.

Last year it was only really MySQL.

And so we'll probably explore it
going forward.

1 blocker that actually we're talking
to the RDS team today is

our rollback.

Exactly.

There is no way to rollback.

Nikolay: And that

Sammy: is the main thing.

Nikolay: That's wrong.

Bluegreen deployment's idea.

It should be symmetric.

And when you switch over, reverse
replication should be installed

immediately.

Without this, you don't have rollback
plan, right?

Sammy: Exactly.

It means data loss.

Yeah, so that's what our failover
operation does.

I think it's also nice to have
more control when you do it locally

because we're able to, for instance,
do a no downtime replicas

failover.

And because most of our P0 critical
workflows read from replicas,

that means that most failovers
don't actually incur any downtime.

Whereas, you know, if you're using
a managed thing like Blue-Green,

you lose that kind of control.

Nikolay: Yeah, well, I'm super
curious about details here, but

maybe it should be a separate discussion.

And first of all, I wanted to thank
all of you.

We invited you because you shared
very great posts.

So this knowledge sharing is super
important.

So I would be happy to see more
and more coming.

And maybe if we talk more and exchange
ideas, we could collect

better materials, for example,
how to do 0 downtime upgrades,

because I know very good recipe
for self-managed Postgres, but

I don't know for RDS, because they
don't allow you to control

recovery target LSN.

So I'm very curious to talk about
details, but maybe in a separate

discussion.

Because I know Michael has a different
plan, not to talk half

an hour about upgrades.

Michael: Well, maybe we could actually
do a quick lightning round

of which major version each of
you are currently running, if

you know and are happy to share.

Nikolay: Is it a single version
only?

Michael: Also a good question.

Sami, are you happy to share that
1?

Sammy: Sure, I can definitely talk
about that.

Well, we are newly, as of 2 months
ago, fully on PG-13.

So we had a couple of PG-11, very
legacy boxes.

There were some of our earliest,
largest databases, and that

actually was the main motivation
for building out this upgrade

workflow, was to get those into
PG-13.

And we would love to use some of
the features that are available

in PG-15, PG-16.

And so we're probably going to
look into fleet-wide upgrades

sometime next year to get to a
newer version.

But this is where this one-click
upgrades or failovers really

matters because today it's quite
toilsome and it would be pretty

painful to upgrade the whole fleet.

Michael: Yeah, absolutely.

How about you, Arka?

Arka: Yeah, we're actually we're
like 90% on Postgres 15.

And it is my life's mission

Nikolay: to

Arka: get us that final 10% over.

Yeah, like most of our databases
are on actually Postgres 15.

We finished that upgrade last year.

And yeah, we really wanted some
of the features that I think

got introduced maybe after Postgres
14, especially around being

able to filter the replication
stream because that's something

we want to use in the future, probably
if we want to rechart,

that'll be a very useful thing
for us.

And just wanted to touch on Sammy's
point a little bit.

We actually use the exact same
kind of, I think, failover that

you guys use as well at Figma based
on pgBouncer and with the

reversibility as well.

And that's something that has been
really, really surprisingly

amazing.

Nikolay: Yeah, you mean pgBouncer
post-resume.

I guess this is why you don't use
RDS proxy being on RDS, right?

Because it doesn't have post-resume.

Yeah, that's great.

I also was surprised last year
that it can handle really heavy

workloads.

Yeah.

Just a few seconds spike of latency,
you can upgrade and switch

over to a different primary.

Arka: Yeah, It was one of those things
where when we were writing

out the tech spec for it, I was
like, is this thing going to

actually work?

Nikolay: The problem with this,
by the way, there are no good

materials about this.

Proving like, nobody says, like
it's like public secret.

Nobody talks about it somehow,
but it works really well.

So we need more materials proving
that Pause Resume works well

in pgBouncer.

Sammy: Yeah, it's been great for
us.

We have some PgBouncer scaling
problems, but everything on the

failover's PgPause, PgResume side
has been flawless.

Michael: That's so good to hear.

And finally, Dec, on the major
version side of things.

Derk: We are fully on 13, and we're
having discussions about

the next version.

We want to upgrade too because
every version has nice things.

I want certain partitioning functionality,
A colleague wants

more logical replication functionality.

So, of course, 17 would be the
ideal, but then do we really want

to run 17 already with the biggest
clusters we have?

So, yeah, to be continued.

Michael: And I think also the fact
that, I mean, Notion being

on 15 was somewhat of a surprise
to me.

I think because of these major
version upgrade challenges, I

see so many companies, even at
the cutting edge of what Postgres

can do, lagging several years behind
because upgrades are so

difficult.

So I know we won't get a chance
to discuss all the reasons why,

but I think it's proof because
we're all lagging.

Sammy: Yeah, well, I actually think
it's not for us so much fear

of newer versions.

You know, I actually worked in
MySQL before Postgres, and MySQL

major version upgrades were truly
terrifying.

These were like 2 or 3-year-long
efforts where you found a lot

of major incompatibilities.

But one thing I love about Postgres
is it is mostly backward compatible

across major versions.

And so it's not so much that we're
scared of the newer ones as

versus we can't take downtime
to do these upgrades.

And so there it's much more manual toil.

Michael: Yeah, exactly.

I think that is, I think that's
the case for a lot of people.

So last one I was going to ask each
of you is if you have that

funny or scary, and I don't know
which one I want most story that

you can share with us related to
your scaling.

Any of you got one prepared already,
feel free to jump in.

Or Nikolay?

Nikolay: Wide-width locks, lock
manager, multi-xact transactions,

multi-xact IDs and so on, like
something like that, maybe.

Derk: Yeah, I think we definitely
got bitten by the LockManager,

like most people who start doing
partitioning.

It's like,

Nikolay: yeah, well,

Derk: we were, we started partitioning
and we, we did a good

job.

And I built this framework around
partitioning.

By the way, Michael, the fourth
blog post is, we are writing

it at the moment.

Yes.

So it will be there.

It's a nice idea.

Now I lost the rest of my story.

Oh yeah, partitioning.

Nikolay: So it sounds like we have
prepared statements.

Derk: No, we just have to prepare
statements.

But You know everything about prepared
statements, right?

You try it 5 times, then the optimizer
decides, do we go with

the generic plan or the custom
plan?

But we also created a lot and a
lot of partitions and a lot of

partitioned tables.

So we definitely ran out of logs.

And our CPU, basically every time
we created a new partition

for table, the CPU uses jumped
10%.

But jumping 10% was fine because
we had a lot of CPU to spare.

So by the time it alerted, It was
pretty high.

And then we were looking back and
we see these jumps of 10% and

10% and 10%.

And then we were looking like the
next jump of 10% that's above

100.

And then we are not in a good position.

So, then we had this small window
with all kinds of things prepared,

like don't create any partitions
anymore on this cluster.

What can we break open partition
boundaries so we can basically

extend the final partition to hold
more data?

And then we were researching this
problem, like why is the CPU

usually jumping like this?

And in the end, we started to force
the optimizer to pick generic

plans because custom plans were
faster in execution, but definitely

not at prepare time.

So we won like a few milliseconds
in execution time, but we were

paying like 100 milliseconds in
prepare time.

And that was really heating up
the CPUs.

Nikolay: Because during planning
Postgres locks all indexes and

tables, all partitions, all its
indexes if partition pruning

is not working.

So it's like, it's terrible.

Derk: Yeah, So you're running out
of your fastpath locks because

you need to lock like 20 partitions
and 20 indexes on these partitions

for a single table.

And then you join with a partitioned
table, and with partitioned table,

all these locks go to the log manager,
which is burning CPU cycles.

Nikolay: You're in a good club
of a bunch of companies who experienced

this over a couple of last years.

Derk: Yeah, I was

Sammy: going to say, we hit this
for vacuuming on our bigger

tables.

When you get above like 4 terabytes,
you start seeing that at

the end of the cycle, when you
run analyze and the cache gets

cleared out, and then you suddenly
have this custom plans that

are 5x more expensive.

And that's when CPU spikes to 100%
for a few seconds.

Nikolay: I have a question to all
of you.

When you have these problems, do
you discuss this with people

who can influence this?

I mean, hackers somehow.

Do you raise these problems in
mailing lists or somewhere else

where hackers live?

Arka: We work with AWS support
quite a bit.

Nikolay: Well, some hackers there
are for sure, but not all of

them, right?

Sammy: Yeah, we're good friends
with the AWS RDS team and they

have actually a team that contributes
to Postgres, so we try

to leverage them to get some of
these patches in.

But it's honestly a lot of us reading
the source code ourselves

as well and trying to figure out
workarounds.

Nikolay: By the way, I just last
week I think I realized that

SLRU patches created by originally
by Andrey Borodin, which will

be in Postgres 17, they are already
adopted by AWS RDS Aurora

for a couple of years, almost.

I was super surprised.

Actually, Andrey was also surprised.

But you're not going to use Aurora,
right?

Instead of regular RDS.

Arka: Maybe.

I think, I mean, for Notion, I
think we are actually starting

to look into if Aurora makes sense.

Mostly because the AWS team promises
a lot better replication

for global replicas compared to
the RDS built-in replicas.

So it's something we're considering,
but we haven't actually

used before.

So was curious if any of you have
any experience with that or

like scaling read replicas in general?

Because that's also something we
haven't leveraged too much at

Notion yet.

Sammy: Well, less scaling read
replicas, but actually my scary

story was going to be around moving
things more to replicas because

At the same time that we were growing
100X, we were dealing with

a lot more pressure on the reliability
of our system and trying

to move towards 4 nines globally
as a company.

And so our application, because
it's a web browser and we have

a kind of single threaded Ruby
monolith that serves a lot of

traffic, is very latency sensitive,
that if your database slows

down for a minute, then the whole
application can back up and

your whole website can be down.

And so that was 1 of our biggest
reliability risks and most persistent

pain points is our RDS does have
some big latency spikes on EBS

that most customers don't notice,
but we were really hitting.

And so our 2 big solutions there
were to move, basically remove

writes from our most business critical
routes wherever possible

and put all reads, or 99% of reads
onto replicas for these P0

user workflows.

And then we build out replica hedging.

So every request hits 2 replicas
and returns the first results

that are successful there, which
allows us to tolerate any 1

replica being down.

And that's been hugely useful.

We also got onto IO2 for EBS, which
has been a game changer for

EBS reliability.

Arka: Oh, nice.

How many replicas do you guys normally
have off 1 database?

Sammy: Today it's quite, there
are only 2 large replicas.

In the future we want to move towards
more smaller replicas.

Derk: We have multiple, Let's put
it that way.

Reading from a replica is much
harder when it is about financial

data because you want to have the
correct data and it might just

been updated.

So we try to move a lot of stuff
to the replicas, but it's kind

of hard.

Arka: Yeah, that would definitely
be a problem for financial

data.

Nikolay: Are we talking about asynchronous
replicas only, or

maybe some semi-synchronous or
quorum commit approach?

Derk: We have both.

Arka: Also, cache invalidation,
I think That's 1 of the bigger

concerns for us at Notion because
we're super, super heavy users

of memcache.

And reasoning about how that works
for replicas is another thing

that we don't really want to do,
but we have to do.

I don't really have a super scary
story, but actually it's something

I've been working on recently,
which we ran into some surprising

behavior.

So maybe it'd be just fun to talk
about.

So recently at Notion, we've been
trying to upgrade that last

10%.

And of course those databases are
just the hardest because they're

just so big.

So we're trying to upgrade this
partition table that is around

20 terabytes with like 16 partitions
on each database.

So, you know, each partition is
about like a terabyte ish.

And we were having trouble getting
this to work with just regular

logical replication.

So we actually started exploring
DMS, which is like an AWS service

that lets you migrate data between
two separate databases.

Side note, it's actually super
cool.

Like you can go from like a completely
random system to like

another new system.

It doesn't have to be like Postgres
to Postgres only.

I don't know how good it is at
huge scale.

Nikolay: It just requires a department
in your company to manage.

Arka: Yeah, so originally I was
like, oh yeah, I can do this.

And then it's been like 4 months
now and I'm still like, oh my

God, this is a lot.

But anyway, 1 of the problems we
ran into was actually related

to Postgres bloat.

Cause it turns out with DMS, you
get a lot of kind of knobs that

you can tune for the concurrency
of like the copy.

And that goes pretty fast.

Like we're able to copy, you know,
a terabyte of data in maybe

12, 14 hours with the way we have
our DMS set up.

But the thing we noticed after
we set up the target databases

is somehow the data size, like,
so we use PG Analyze as like

our Postgres monitoring tool.

And like, we noticed that the data
size literally on disk was

like 3 times larger on the new
tables for some reason than the

old 1.

And you know, we were going from
like Postgres 12 to Postgres

15, so at first we thought maybe
it was like a TOAST compression

thing.

Cause we know that the TOAST compression
algorithm changed between,

I think, 12 to 14 or something.

It went from like PGLZ to LZ4 or
something.

I don't remember exactly but I
know that that changed.

So I was like, oh like is it is
it the new post compression that's

causing like the data size to be
literally much larger on disk?

Which obviously leads to much worse
latency because then every

1 of your queries are doing much
more I/O.

Nikolay: Buffers.

Buffers.

Arka: Yeah, so you know we tried
a few tests with like changing

the compression back to the old
1.

We weren't fully convinced still,
But it turns out actually we

were just hitting like bloat and
this clicked when I was actually

listening to 1 of your earlier
podcasts I think on Postgres bloat

and I realized that our vacuums
weren't actually fully completing

on the new table because there
was just so many new transactions

being written the vacuum could
not keep up which meant that the

bloat just kept on accumulating
and accumulating.

And on top of that, a lot of the
data in this table was being

TOASTed because they're just JSON
B columns.

So anyway, it was just like 2 weeks
of investigation to be like,

oh, we should just run a full vacuum.

And turns out that brought the
data size right back to what it

should be.

So it was a fun little kind of
investigation.

Nikolay: It sounds like PgAnalyze
doesn't tell you about XminHorizon

and proper like things like why
a vacuum can be lagging and skipping

some dead tuples that cannot be
deleted.

PgAnalyze has a very good vacuum
dashboard.

So I'm very curious, does it speak
about

Arka: this new horizon?

It actually does point those things
out.

We just didn't have the new databases
in PGAnalyze yet, because

we're like, oh, we're not using
production traffic on these,

so we don't need it there.

When we put it there, we realized
that, oh yeah, like the vacuums

are falling behind.

Nikolay: I'm just personally myself
very annoyed that Postgres

monitoring systems lack better
xmin horizon and long transactions,

which is 2 separate things basically.

Alerts.

Long transaction can happen and
nobody tells you.

It's not the right thing if you
have sometimes 2 Postgres monitorings

and nobody tells you about this.

So I'm curious if PGAnalyze has
some alerts.

Arka: It does.

Nikolay: Okay, okay.

Arka: It sends you an email if,
for example, you have had any

transaction running for over an
hour.

Nikolay: Right, But if you use
logical, it's not the only reason

of xmin horizon being installed.

So, yeah.

Arka: Right.

Nikolay: Okay.

For

Arka: us, like, PGAnalyze has
been really great, actually.

Like, I love working with that
team.

Shout out to Lucas.

Like, he's been so amazing.

Nikolay: And also content, again,
back to content sharing.

Lucas does a great job and his
team does a great job.

Sammy: Yeah, Lucas is awesome.

I talked to him as well a few weeks
ago and everything at PGAnalyze

their blogs are really
great.

Okay, I think it's interesting
you guys had the data copy be

such a big bottleneck.

We also had a big pain point there
when we were first doing vertical

sharding.

And what we ended up finding out
is we kind of went into the

source code for Postgres and logical
replication and realized

that it's really, really expensive
to have indexes on your database

table while you're doing logical
replication, and much faster

to drop the indexes and then recreate
them afterwards.

So it went from weeks to a day
for a multi-terabyte data copy

when you drop all of your indexes
and foreign keys.

Arka: Yeah, that's a great insight.

Nikolay: Yeah, I'm also curious
if you talk about regular logical

replica provisioning when data
is copied logically or maybe,

For example, there is a binary
copy option and also there is,

if you need it for upgrades, for
example, or you need majority

of data, sometimes it's better
to convert physical replica to

logical, which I hope soon will
become standard recipe inside

Postgres itself, but now it's only
possible if you orchestrate

it properly.

I'm very curious if you do this
trick on RDS, both of you, or

no?

Sammy: We rely a lot on logical
replication for all of our core

failover tooling, so we haven't
really explored physical replication.

I'm not sure if we really can because
of RDS.

Nikolay: On RDS it's possible if
you apply so-called Instacart

recipe, but

Sammy: there

Nikolay: are debates about is it
safe to do it.

Actually, Lucas's blog posted about
it as well recently.

Arka: You're talking about the
thing where you create a replication

slot, take a backup, and then advance
the replication slot.

Nikolay: Yeah, either this or...

Yeah, on RDS only this, yes.

Arka: Yeah,

Nikolay: yeah.

On self-managed, you can play with
recovery target LSN, right?

Arka: Yeah.

Nikolay: So I'm curious, is it
only in Stackgres or what...

Arka: We've actually used that
before too.

So actually, before we knew the
trick that Sammy just told us

about with the indexes, we could
never get logical replication

to catch up ever to our tables.

So we actually ended up using the
Instacart trick to do our first

round of upgrades that I ever did
at Notion.

It worked well.

I think 1 of the problems you run
into with that though is similarly,

when you provision the new database
off the backup, you have

to do another extra step of actually
dropping the indexes again.

Because even just catching up with
logical replication from the

point that you took the backup
for a sufficiently large database

Still took us like a day or so
just to catch up the changes between

the replication time and that thing
So I think yeah that trick

could work really well probably
combined with dropping the indexes

would speed it up significantly.

But actually for us, after that,
we've always just used regular

logical replication, and it's been
fast enough if we can provision

the schemas without any indexes.

Nikolay: Actually, dropping indexes
lead us to 1 topic I wanted

not to miss.

Huge WAL volumes generated and
index-write amplification.

Anyone can tell anything about
these problems, like experience

something about it or...

Because this is 1 of the key criticism
points from Uber when

they posted this infamous article,
right?

Postgres has index write amplification.

We have hot updates, but still,
it's a problem, right?

Actually, Derk, I remember an article
about maybe a couple of

years ago about fill factor, right?

Is it still the tool you use actively?

Derk: Well, fill factor is not
not a tool, right?

It's just a parameter you can set
for your tables and your indexes.

Nikolay: But the tool is to use
it to provoke more hot updates.

Derk: Yeah.

If we lose hot updates, like we
did last week, it's like we doubled

our WAL usage.

So that's already, and that was
only for single table for some

reason, fifth, next or the new
release.

We had a longer running query,
which means pretty long because

it was holding back hot updates.

And that means like you just double
the WAL volume on your system,

but we have many tables, right?

And then this, only this single
table is responsible or becomes

responsible for half the WAL of
the entire cluster.

For me, that's an amazing number
that you can generate or skip

creating that amount of WAL when
using hot updates.

So yeah, if we are having heavily
updated tables, we always try

to find the right fill factor.

I've been trying to get this formula,
like what tables do you

need to fill factor and how high
do you set it?

And it's still a mystery to me
because we have some tables which

do fine at 95% fill factor and
some do really great at 70%.

Nikolay: And also you need to protect
this table from new indexes

being created and losing all the
hot updates somehow, right?

If developers decide to add one more
index.

Derk: Yeah, I can't.

I have no way to force my developers.

Sammy: I'm actually curious.

A big pain point we have right
now is for really expensive index

creations, there's no good way
of throttling it.

So if it eats a lot of your IOPS,
there's not really a postgres

way We've found to tune this so
that it, you know, maybe you

want the index creation to happen
much more slowly in the background

over a few days for some really
large tables.

Nikolay: But it will hold xmin
again, this infamous problem

from...

Sammy: Sure, but I mean, maybe
3 days of holding that is better

than an hour of the website being
down.

Michael: This might be a stupid
question, but would reducing

maintenance work mem in a, can
you do that in a session?

Would that help?

Nikolay: More disk I/O maybe.

Sammy: Yeah, I think we have the
minimum.

Yeah, we're only using 1 thread
for the maintenance settings

there, but it's, it has been a
coulnle of times enough to spike

our IOPS to basically the RDS limits.

Derk: I always want indexes to
be created faster.

Nikolay: Yeah.

But it requires a faster disk IO
capabilities, of course.

Michael: This is what I love about
TrueScale is you just hit

trade-offs.

You hit times where 1 thing's best
and you hit other times where

the complete opposite is best.

So yeah, I'm really conscious of
time.

You've all been amazing.

Thanks so much for joining.

I wanted to give you each an opportunity
to say anything else

you wanted to add or forgot to
say, or feel free to shout out

at you if you're hiring, that kind
of thing.

Nikolay: Let me ask 1 more technical
question.

Everyone uses form keys everywhere,
or decided to drop some of

them or everywhere, right?

Arka: No, I will actually say something
about this.

I think like, honestly, we use
probably Postgres in like the

least interesting way possible.

For our sharded cloud databases,
we don't have foreign keys.

We don't do joins.

We don't have any triggers.

We don't have really anything that
can make, we don't have any

generated columns.

Like we limit so many things that
you can do that we're basically

using Postgres as like a blob store
at this point, which is why

we end up with really giant databases
with a lot of indexes,

but the queries are quite simple
actually to retrieve the data.

And I think that's been like probably
1 of the primary reasons

that I've actually, and like my
team has actually been able to

kind of keep up with the growth
of our Postgres cluster because

we're able to do things like re-sharding
without having to, you

know, talk to every team at the
company and be like, can you

please change this query?

Can you please not do this thing?

You know, I think a lot of people
get really excited about using

all the fun features of relational
databases, but actually you

can use it in a very boring way,
but still get some of the benefits

of relational databases, such as
consistency and things like

that.

So I think that's like my 1 trick
to high scale with any database,

not just Postgres, MySQL and really
any other thing that you

can think of.

It's good to be aware of some of
the pitfalls of the more interesting

features.

Nikolay: So no foreign keys?

Arka: Absolutely not.

Okay.

Not allowed.

Okay.

Nikolay: Any more input?

Sammy: Yeah, We do allow foreign
keys on the sharding key.

And similarly, we allow joins on
the sharding key.

So for us, it's a matter of picking
a sharding key so that most

operations are happening within
that.

And then that allows us to avoid
rewriting a lot of application

logic.

Smart.

Derk: We also have foreign keys
at Postgres, which is in 2 weeks

time, a little bit less.

I tell you how to find a proper
partitioning strategy because

up to now I've been focusing on
partitioning a single table in

the best way possible, But I kind
of forgot to look up the bigger

picture.

So all my tables are partitioned
slightly different with different

partition boundaries.

And then the entire system becomes
a bit messy.

So now we are working on basically
straighten out all the partition

boundaries because then the joining
and the foreign keys work

much more efficient.

So it's partition pruning.

Michael: Clever.

Derk: I kid you not, some queries
became 20 times faster by straightening

out partition boundaries.

Michael: Wow.

Sammy: I'll make my last plug on
the hiring side.

My team is hiring really aggressively
right now.

We're basically trying to double
this year.

And in particular, we just had
an awesome senior staff engineer

become a manager.

And so we have a big gap in senior
staff, principal type engineers,

folks who have worked at scale
before.

So if you're interested in those
problems, please reach out to

me or come apply.

Arka: Notion is also hiring.

But my pitch is definitely not
as refined as Sammy's.

But if you want to come work on
fun problems, check out our careers

page.

Derk: I've actually think this
is might be the first time in

20 years.

We are not expanding my team this
year.

Nikolay: Oh wow.

Michael: But we should plug Derk's
talk at Postgres, which you can

watch.

So I'll include links to all of
these things.

Final 1 final thank you from me.

This has been amazing.

Thank you so much for joining us
for episode a hundred.

Nikolay: Yeah.

Thank you so much.

I want to just to once again, like
say, thank you for knowledge

sharing.

Super important.

Looking forward to more, definitely
to more blog posts and maybe

some run books published.

I don't know, like things like
you use in your daily work maintaining

these large databases, large Postgres
databases.

So super curious to see more from
you and maybe to collaborate

on some things like some knowledge
and exchanging and so on.

Thank you.

Derk: Thank you very much, both
of you.

It's really cool to be here, especially
in such good company.

Sammy: Yeah, this is a really fun
conversation.

Thanks for getting us all together.

Arka: Yeah.

Thanks again.

I really enjoyed this.

I hope to be back in the future
as well.

Nikolay: This was the best episode
number a hundred we could

imagine.