Postgres FM

Nik and Michael are joined by Bruce Momjian to discuss his new talk "What’s Missing in Postgres?"

Here are some links to things they mentioned:

Bruce Momjian https://postgres.fm/people/bruce-momjian
EDB https://www.enterprisedb.com
What’s Missing in Postgres? (Bruce’s slides) https://momjian.us/main/writings/pgsql/missing.pdf
The Wonderful World of WAL (Bruce’s slides) https://momjian.us/main/writings/pgsql/wal.pdf
Getting started with benchmarking (our episode with Melanie Plageman) https://postgres.fm/episodes/getting-started-with-benchmarking
Myths and Truths about Synchronous Replication in Postgres (talk by Alexander Kukushkin) https://www.youtube.com/watch?v=PFn9qRGzTMc
The Future of Postgres Sharding (Bruce’s slides) https://momjian.us/main/writings/pgsql/sharding.pdf
MultiXact member space exhaustion https://postgres.fm/episodes/multixact-member-space-exhaustion

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

Michael Christofides, founder of pgMustard
Nikolay Samokhvalov, founder of Postgres.ai

With credit to:

Jessie Draws for the elephant artwork

Creators and Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Guest

Bruce Momjian

Vice President Postgres Evangelist at EDB, PostgreSQL Core Team

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello and welcome to Postgres.FM,
a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard.

I'm joined as always by Nik, founder of PostgresAI.

Hey, Nik.

Nikolay: Hi, Michael.

How are you?

Michael: Good.

And we're delighted to be joined by Bruce Momjian, who is VP of

Postgres Evangelist at EDB and long-serving Postgres Core Team

member who recently gave a new talk about the missing features

in Postgres.

So welcome Bruce, it's an honour to have you here.

Bruce: Good to be with you.

Michael: Wonderful.

Perhaps we could get started with why this topic, why is it important

or why is it important to you?

Bruce: So I'd love to say that I thought of this topic and I'm

a genius, but no, actually Melanie Plageman came to me in Riga,

PG Europe in the fall.

And she said, we'd really love you to do a new talk at the

PGConf.dev Conference in Vancouver.

I said, okay.

And she said, how about things that aren't in Postgres?

I think, wow, that's a great idea.

So then I guess I was showing these slides to Robert Haas once

I finished them.

And he's like, wow, Melanie got a mention.

I'm like, yeah, I said half the job is figuring out what topic

to do for a talk.

So he said, I'll have to think about that now.

I can get a mention just by thinking of a topic.

I don't even have to do the talk.

So yeah, I thought it was a cool idea.

I never would have thought of it.

There's a couple of my talks where you'll see in the introduction

of title concept from text, because again, I have about 60, 65

talks I think now and it's hard to, 67, it's hard to think of

like topics.

I've been really lucky because I did 2 new ones in the fall of

last year.

1 is this talk, another is the wonderful world of WAL, which

is about the write-ahead log, and then I have a third 1 which

is more of an AI talk.

I already have 2 AI talks.

This is another 1 called building an MCP server using Postgres.

So I got 3 new talks queued up for 2026.

So it's unusual for me.

I write my talks and basically give them for maybe a year around

and then come up with a new 1 because I've been doing it for

30 years.

So on average, I'm making 2 talks
a year because I have 67, but

this year I get 3.

So I was like, wow, this is great.

And Melanie, I have to give her
credit for the topic on this

one.

Michael: That's awesome to hear.

Friend of the show Melanie, she's
been a guest before.

That's good to hear.

Nikolay: Yeah, I have a question.

If my memory doesn't play a game
with me, 20 years ago I remember

you maintained the to-do list.

Bruce: That's right.

Nikolay: Isn't this a to-do list?

Bruce: Is this a to-do list?

I would say no.

The to-do list is basically driven
by what people have asked

for.

It could be a small thing, it could
be a big thing.

This is more of the, I would say,
the big missing features.

And a lot of them are ones that
we had no intention of doing.

So it's sort of like, let's step
back.

Let's look at what big things were
missing.

Let's look at what big things we're
maybe working on.

And then look at some of the big
things that we'll probably never

do.

Nikolay: So it's strategic.

Bruce: Much more strategic.

You're right.

If I drill down to the to-do list,
we'll be here forever.

But again, I think it is valuable
to step back and look at what

is missing and what isn't.

And I even learned something just
by writing the talk because

stepping back far enough and looking
at the missing things that,

so I think I have, what is that?

12 items.

And what's interesting is 11 of
the 12 items relate to performance.

That surprised me.

Michael: If you had to guess beforehand,
what would you thought

more of them would be?

Bruce: I thought it would be basically
missing operational features

or missing security features.

Well, that's TD is that 1.

That's the 1 of the 12, right?

Missing just integration, infrastructure,
some of these big missing

things.

But when I actually wrote them
down, I kind of like, wow.

So 11 of the 12 are basically related
to scaling, either on a

single host, 7 of them, or multiple
hosts, which is another area

we need to work on.

So you got really 2 kind of 11
of the 12 performance, 7 of them

are for single host and 4 of them
are for multi-host.

And again, I just never suspected
that it would be a performance

need and not an operational need.

If I had done this 10 years ago,
I know there would have been

a whole bunch of things that...

Nikolay: I can bring some ideas.

Maybe I should do it in the end,
which are not performance related

and definitely big missing pieces
And they're not included to

your list.

So maybe let's

Bruce: see.

I have...

What's interesting, I've given
this twice, both in January, 1

in Prague, 1 in Brussels, and nobody
really came up, in those

groups, nobody really came up with
anything that was wrong.

The only, I think, feeling I got
was some of the people were

sort of like, well Oracle RAC has
some value in these small use

cases.

And I'm like, yeah.

So there was a little bit of, you
don't have this Oracle thing

or multi-master like, well maybe
there was some value.

So I get it.

Yeah, it's there, but I didn't
get anything that sort of shocked

me.

So I'll be interested to see if
you have anything that was sort

of like,

Nikolay: Sounds good, I will do
it.

Bruce: Yeah, I want to hear this.

Michael: Why don't we do it straight
away?

Bruce: Sure.

Nikolay: Right now, let's discuss
this.

Okay, I'm looking at your list
and I don't see, for example,

synchronous replication improvements
because right now synchronous

replication is not all right.

If you maybe watched Kukushkin's
talk last summer talking about

caveats, and This is a big topic,
for example, when there are

new people coming to Postgres ecosystem
from MySQL, for example,

and 1 of big group is Multigres,
creators of Vitess.

Some words from them, some words
from Kukushkin, and I see synchronous

replication is not all right because
it's based on logs And if

primary already written data, and
then it suddenly restarts,

primary thinks data is written,
standby never accepted.

This is 1 thing.

Another thing is that if you have
logical replication involved,

There is a nightmare to handle
that.

Basically you lose, during failover
with issues, you lose logical

replica.

And looking at other systems, including
MySQL, there is a big

potential to improve synchronous
replication.

Basically some people won't say
names but just watch Kukushkin's

talk.

It's broken right now.

Bruce: So I'm working on the PG 19
release notes.

I do see improvements in both of
those areas in the release notes

that I'm working on.

They're not done yet, it won't
be done for about 2 weeks.

But it'd be interesting if you
could come back and take a look

and see what was done.

So what I didn't, what happens
in that particular example, both

examples, is you're now out of
the 30,000 foot and you're now

getting down.

And there's totally stuff to do.

Nikolay: Sorry, I'm not getting
down.

When people with enterprise and
big cluster experience come,

they say, where is synchronous
replication?

We cannot build proper clusters.

And what I'm telling them, most
people still use asynchronous

replication and they cannot believe.

They cannot believe.

Really?

Bruce: Yes.

How can

Nikolay: they say?

Other systems have proper synchronous
replication with proper

consensus algorithm.

This is another topic.

There are many attempts to bring
consensus algorithm or something

inside Postgres, right?

That's similar to pooler question.

It's always there are many ways,
and so this topic is not about

performance, it's about HA.

Bruce: No, what I'm saying, I'm
saying, Dan, what I'm saying

is I'm trying to look at this from
a very high level to look

at big holes.

What you're actually looking at
is we have the feature but it

doesn't work right or it needs
whatever.

I get that but again if I go down
that level I'll never get done

right.

I'm trying to from the top look
for these big holes and actually

to me not only look at the big
holes, but look at why are they

holes, right?

Because a lot of people who aren't
me or you, who are sitting

watching everything that's happening
all 25 hours a day, right?

They're like, what's the problem?

What's the holdup?

And then the other thing for me
is, by having that 10,000-foot

view, I see a pattern clearer than
I could see when I was down

in the...

I'm not disputing that these things are not feature complete,

I would say.

It's not atypical for us, particularly with partitioning and

logical replication and failover, to take years to get to a feature

complete point.

And I think we are in a feature complete 0.1 whole bunch of areas,

but I tried to stay away from, okay, we're not feature complete

here, we're not feature complete there.

I'm like, what things are we not even like moving in?

And there are a bunch of them that I would say we're just not

engaged at all.

Nikolay: Yeah, I get that.

That's a good description.

And when you're at this highest level, you see a lot of performance

related stuff, right?

Bruce: That's right.

At that level, I see a lot of stuff that we're not even in the

ballpark in.

That we're just scratching the surface.

I'm not saying all the things we've done so far that have been

implemented or done perfectly or feature completely clearly are

not.

But at least I feel we're moving there.

What I see up top is a whole class of things where we're not

even engaging, and we may not need to engage.

It's more of a thought experiment for people to come in and say,

okay, what things are they not even in the ballpark on?

And it's helpful to know, because if you look at the list it's

not a terrible list.

And if you think this is the biggest stuff we don't have, we

must have a lot of stuff, right?

Because I can't think of any other big things except these 12.

There might be 1 or 2 that somebody will think of at some point.

But I think for me it's a great exercise in seeing where we are

at that big level.

Where do we think we need to go 05:10 years out?

That's usually the area, that's the distance I'm looking at,

Not necessarily what comes in 19 or 20.

I think some of the stuff you're looking for it's coming in those

releases

Nikolay: That's good.

Bruce: But what stuff would do we really need to at least evaluate

and say is this a direction we need to go in or not?

Michael: Looking through the list it struck me that there's a

bunch of things that you can currently do, but you need extensions

or a separate tool.

You can use Postgres and get those features, but you need an

extension or another service like PgBouncer or something.

But there's also a bunch that you literally can't do with Postgres,

so you would stick with maybe a proprietary database or just

a different solution completely, maybe a fork or something like

that.

So it felt like there's that kind of 2 categories there where

there are things you can do with Postgres, but you need something

outside of core.

And there are things that you can't do with it.

So is there 1 of those that's more interesting than the other

in terms of bringing into core?

Bruce: That's funny you mentioned that.

I really struggled with that question.

And the audience struggled with that question.

I remember people saying, hey, what about this?

What about that?

And I think the version, these slides are regularly updated.

Like every time I give a talk, I think it's something that, I'm

just looking at 1 of the capitalizations I did wrong.

Anyway, I'll fix that after this talk.

But my point is that you're right.

You have, there's like concentric circles.

So if you're willing, if you're willing to go to a proprietary

fork of Postgres, it's a big circle, right?

And then if you're willing to use extension or external tools,

let's say external tools and then proprietary extensions and

then extensions And then the core, right?

And in fact, maybe I should have a slide for this to illustrate

what that span is.

And in a lot of cases, we will say we have it.

This is just great you asked that question.

Nobody asked it quite the same way you did.

So the question, some of the complaints we have are not that

it's not available as either an extension or an external tool

or a proprietary work of Postgres.

The complaint we have is it's not in core Postgres.

Or that there's an extension, but the extension, because it's

an extension, doesn't work exactly the same way it could possibly

work if it was in core.

Or if it's a third-party tool, the third-party tool is not as

effective or easy to use as it would be if it was integrated,

which is the connection pooling example.

Nikolay: Or there's a choice of them.

Like, why a choice of them?

Bruce: Right, there's a choice of them.

Although the choice of them I kind of like, particularly in the

pgpool, PgBouncer realm, there is actually a reason for that.

But my point is that some of the missing stuff, even something

like column there, right?

You could say we have extensions to do that, but yeah, I guess,

but I'm not really happy with that and I'm not necessarily wedded

to Citus and it has limitations and because it's an extension,

it might not be available.

So you get all these sort of complaints
And it brings you to

a bigger philosophical question
of, does everything need to be

in core, right?

And there's probably not.

I think pgvector is a great example
of something external that

can be developed on its own release
cycle.

And it really doesn't hamper its
ability to be used effectively

with the code.

But when you're starting to talk
column there, you're starting

to talk connection pooling, where
you have problems with authentication

going through the connection pooler.

We've tried to improve that, but
it's still kind of yucky.

You know, logical replication at
DDL, you know, another one's

available as an external proprietary
product but not in the code.

TDE is a classic that's available
in a bunch of external stuff.

It's not available in the code.

So it's sometimes when people complain
or it's missing, what

I tried to focus on was basically
in this talk, is it available

either in Postgres or as an extension
that has no downsides?

That's my litmus test.

Anything that has limitations and
is an extension, limitations

as a third party tool, or obviously
as a proprietary fork, those

are going to be effectively not
available in the community.

And we have to decide, is that
a good place to be, a bad place

to be?

Maybe that's fine.

Michael: I really like the concentric
circles thought and I want

just on the kind of the last 2,
is there also a difference in

your head between contrib extensions
and core?

Like, for example, pg_stat_statements,
a lot of the pg_stat_* views

are on by default when people install
Postgres, but pg_stat_statements

isn't.

So like that's an interesting,
I don't know if that's too semantic

a discussion, but that feels interesting
to me as is that missing

or not really missing?

Bruce: You're right.

That's another concentric circle.

Nikolay: Yeah.

Bruce: Right.

So yeah, see like pgvector, maybe
that would be nice to be in

contrib instead of being external.

But then if we pulled into contrib
and its release cycle becomes

tied to the major releases.

Maybe that's not a good idea, I
don't know.

And when you start to talk about
the cloud vendors will support

some contrib extensions and not
others, and you get people who

pull me aside, hey, why don't you
move this into core, get out

of contrib, And I'm like, what's
the matter with contrib?

And you're like, well, my cloud
vendor doesn't support it.

I'm like, okay, so you're complaining
to me that your cloud vendor

doesn't support something we're already shipping.

Is that, do you think that's gonna be like fruitful?

And they're like, well, why isn't it?

But there's a variety of reasons that stuff is in contrib.

It may be an edge use case.

It may be like pg_stat_statements, creates its own tables and

has its own overhead that we're not sure everyone would want.

So that kind of makes sense out there.

Some of them like an oddball 1 like cube, an edgy case, pg_trgm.

I don't know.

It's hard to say.

Is it better as an extension?

Is it going to be better in core?

You can make a case either way, I would say.

Michael: Nice.

From the list, are there any that are like particularly your

favorites or ones that you would love to see progress on or do

you not like to pick favourites?

Bruce: Yeah, the 2 that I've really felt called to champion are

cluster file encryption, which is called TDE in the industry

and sharding.

Those are the 2 that I have the biggest impact in terms of adoption

and expanding Postgres workloads for me.

The other ones are interesting, but they're more...

I don't see them as expanding the Postgres adoption universe.

That's, I guess, the phrase I would use.

Yeah, The other ones are good, but they're more operational.

It would be nice to have this.

I would be more optimized if I had this other thing.

For me, the sharding and the TDE really take Postgres to a new

level.

I would say in both of those, I've pretty much either failed

or executed poorly in terms of because I've been working, at

least championing those for at least 5 years now.

I don't feel I've made a lot of, I guess I've been waiting for

a groundswell of support because both of them were very hard

to implement as an individual.

I got close with TDE but got stuck on something.

And sharding, I think, it really requires a team.

I don't think I've seen the kind of desire behind that.

I think part of the reason for sharding is that machines keep

getting so big that effectively you're better off just buying

a bigger machine or renting a bigger machine than going down

the sharding route.

Because the sharding route really is very workload specific and

there's an increasing dislike of workload specific things in

an enterprise.

There's a much more push toward just generic compute, generic

solutions, generic, and the sort of, oh, I got... with special

chips on the drive, which filter stuff, and oh, I got this special,

whatever, special hardware does something.

It just, the enterprise focus is a lot less on hardware now,

a lot less on infrastructure, a lot more on solutions.

Maybe That might be good, I'm not saying good or bad, but I think

sharding is...

As people needed sharding, systems have increased in size, really

pretty much in lockstep to what they needed.

So the number of people who actually need it is limited, and

the number of people who are willing to put in the effort to

architect it is also limited.

Nikolay: Yeah, I agree with you that hardware available right

now is so huge.

Recently, I saw a cluster which had default vacuum settings,

self-managed.

So usually like RDS or others, they tune, half-tune it.

But it was the default of the vacuum settings, 3 workers only,

nothing tuned, and it was 20 terabytes and somehow surviving.

It was insane, absolutely.

So how is it possible?

But at the same time, RDS and CloudSQL and others, they have

hard limit.

I know they work on it.

And Aurora has more 64 terabytes.

And these days to collect that amount of data is not rare already.

Even 1 person surrounded by AIs can collect a lot of data.

This is 1 thing.

And another thing is that there are hard problems.

For example, lightweight lock manager problem, which was solved

in Postgres 18.

There are other problems as well when you grow to some heights

you encounter with Some people would not like to spend time firefighting,

so Sharding is really needed.

But I guess, what's your take?

For me, it's really hard to find consensus how to achieve it.

Also, sharding topic usually triggers topics like internal connection

pooler, for example, because they are adjacent.

If you talk about routing and so on, you think about also auto

failover, you think about connection pooling.

These topics usually come together in my head.

But there are other ways to do sharding.

There are many ways.

What's your take?

Will it ever be implemented in core or no?

Bruce: I've always been the opinion, I do have a sharding talk

on my website, but I've always been the opinion that sharding

has to be really developed organically within Postgres, and it

really has to be built on effectively partitioning foreign data

wrappers and parallelism.

I don't think we're going to have the appetite to create a whole

new architecture for sharding.

And as every year that goes by,
it becomes clearer that, yeah,

we just can't.

So I think we can get closer than
we are now with little impact.

I think if we, for read-only sharding,
I think we want to do

read-write sharding, then we have
to have a global LockManager,

global snapshot manager.

It becomes much more complicated.

Yeah, I keep going there and people
who think they need it effectively

don't end up needing it or end
up getting bigger hardware or

they re-architect what they're
doing and then they don't need

it anymore.

So I think that's 1 of the reasons
it hasn't moved forward.

But I do think, I think for the
right only sort of a data warehouse

kind of sharding, I think within
2 years we could have a pretty

good solution in the industry.

But I just haven't seen a lot of,
There's 1 guy in Fujitsu who

was working on it, but that was
about it.

And frankly, it's been pretty dormant
for the past 2 years, and

I haven't had time to work on it
much either.

Nikolay: Yeah.

If you stay on a single primary
cluster, there is a big limitation,

well known for those who achieved
some heights, where you have

200, 300 maybe bytes per second.

We talked about this a lot with
folks who develop sharding systems.

You hit the limits of single threaded
process of logical, physical,

logical, logical, WAL receiver,
basically.

Not WAL receiver, a replay process,
right?

In it, it shows up as a neat and
top, which basically replays

the changes from the primary and
it's a single threaded process.

And I know also somebody from Japan
working on it since 2013,

I think, on and off.

I saw some conference talks and
so on.

This problem is not included to
your list.

It's not a small problem if you
achieve heights.

What do you think about that problem?

Bruce: Yeah, so as I remember,
that was the 1 where they're trying

to create a dependency graph from
the WAL, and therefore identify

which parts of the WAL are parallelizable,
and send those off

to workers.

CPUs do that now with CPU instructions.

They figure out which parts can
be run on different cores or

in parallel cores.

Yeah, you're right.

But we don't

Nikolay: have threading, so, yeah.

Bruce: I don't know.

I don't know if threading is...

Threading's on the list that I
have.

I don't

Nikolay: know if

Bruce: threading I'm not sure threading is a requirement for

that, because we do have parallelism without threading now.

It seems to work fine.

Because you load them into a shared memory queue and the process

just pulls out of there.

I don't know.

I think you're right.

I think that would be getting a paralyzable replay of logical

replication.

Physical doesn't seem to matter too much because it's so fast

that...

Nikolay: Hold on 1 second.

Logical I think already, if there is work maybe in Postgres 18

or something, I remember something for logical to parallelize

it.

Bruce: To parallelize it, okay.

Nikolay: I think so.

Technically you can parallelize it with multiple publication

subscription pairs, multiple slots, but it will be having problems

with foreign keys.

It will be eventually consistent in terms of referential integrity.

I'm talking purely about physical replication.

Bruce: Physical replication, okay.

Nikolay: And this is limit some companies hit, and it's super

painful because there is no escape from it.

You need to do either sharding or vertical split.

You need to achieve some, like a lot of writes basically.

Bruce: I think the reason we don't hear about it a lot in the

community level, and this might be different from the level that

you're working at, is we're definitely a general purpose database.

And not that we can't go to the heights and we keep pushing the

ceiling up higher and higher, but there is a limit to how much

complexity we're willing to add, and potentially unreliability

to Postgres to get up to the super heights.

It's possible that's why I know you specialize in that area.

That's possible we don't hear a lot about it.

That didn't get on my list as something that I've heard a lot

about.

I don't see many email threads really addressing that.

Nikolay: This is so.

But I know, for example, Instacart hit it years ago, during

COVID actually, and there are others who hit it.

Sometimes people hit it and don't notice because, okay, some

replica are lagging a little bit and so on.

Yeah, but I understand it's not a super common problem.

Yeah.

Bruce: We had a, EDB had a customer who hit it on specialized

storage hardware and we identified that it was the storage hardware

that was causing it.

So that was, yeah, that was kind of a get new hardware, get new

storage hardware, your problem will go away.

So that was, I think the answer to that 1.

Nikolay: Okay,

Bruce: yeah.

That's the only case I've heard of replay complaints and lagging

in my work.

Yeah.

Nikolay: But what's your take on threading topic?

Because there was a big impulse a couple of years ago from Heikki

originally, I think.

Bruce: Yeah.

I have a blog entry about it, which kind of from 2018, which

references, I think it might even references Heikki's, Heikki's,

let me see if it references Heikki's email.

No, it doesn't.

I think we're still on the fence on that 1.

We keep chewing away at some of the small things like getting

rid of global variables, I think was 1.

And I think the other thing that's going to bail us out here

is that we now have, it appears as though we never have compiler

support for this.

So instead of having to re-architect all our code, it looks like

there's some, if we can get rid of the global variables, which

is pretty simple, it looks like the compiler will sort of read

local a lot of stuff for us to make it easier without us having

to like really re-architect a lot of the code.

So I'm hoping in that area.

But my understanding is that the threading prototypes don't seem

to gain a whole lot.

They don't, at least from what we found, because it reduces task

switching time.

But again, for example, if you're doing a GUI, you really need

threading because you're updating a whole bunch of stuff on the

screen at once and you have 1 of all these background jobs to

be running in your address space and so forth to update a whole

bunch of very light processes and light statuses.

But when you start to talk about a database, there really isn't

a whole lot of light out there.

Yeah, I guess we could do index lookups with threads.

That would be cool.

Nikolay: Visibility bits maybe?

Bruce: Visibility, yeah.

But the problem is there's a lot of the stuff is really heavy.

For example, when you're trying to paralyze a query, those are

not light, right?

So you might as well just get a process and run it and have its

own everything.

So I think that's what has slowed it down.

Another problem obviously is resiliency.

Right now if a session crashes or runs out of memory or whatever,

we keep running just fine, unless
it's in a critical section.

Whereas with threading, it would
make us less resilient to sessions

that misbehave, and we have to
balance those 2 off.

We know the value of threading,
and then the less resiliency

of it.

I don't really know the answer.

I've been surprised we haven't
done more in this area, but it

appears to not be a huge problem
if I'm not hearing about it

every couple months.

I certainly

Nikolay: am not.

Let's talk about huge problems
related to 4 byte transaction

IDs, which is included to your
list, 8 byte transaction IDs.

And some folks have them for years
already.

And also there is, I saw some preparation
work and so on.

I'm very curious What are your
forecasts and take on this?

But also, which is not included
to your list, is undo log and

redesigning MVCC and these topics
which were quite popular a

few years ago.

There are some efforts, but now
like silence.

Maybe I don't see it, but...

Bruce: No, I think you're right
on the MVCC.

I mean, we had the Zheap effort
by Robert Haas.

That was probably 10 years ago.

And he really poured into it, but
I think the job just got too

large.

And I think you also, once you
start to look at what undo requires,

things get very complicated.

So I'm not saying it's not doable,
but you end up with a lot

of complicated challenges And Oracle
suffered from those challenges

for decades until they finally
figured out how to deal with them.

So what would we gain by having
an undo?

Certainly it would make updates
easier.

I'm not sure it would help with
something like deletes or aborted

inserts.

Nikolay: I think it's deletes.

Let's just update and see.

Bruce: Yeah, just update and be
done with it, exactly.

Michael: It's all the bloat and
vacuum issues, I think, that

people are most excited about getting
rid of.

Bruce: Yeah, I mean, but at the
same time they keep improving

vacuum every release I saw it in
the thing.

You have today's vacuum, you have
the vacuum from 2 years ago,

you have the vacuum 5 years ago.

Which 1 are we complaining about?

Because in fact it has gotten better.

I'd love to just get rid of the
whole thing.

I don't think we Even with an undo,
I'm not sure I'm ever going

to be able to get rid of vacuum.

Nikolay: Because it has different
jobs as well.

Bruce: Right, because of the problem with aborted inserts and

deletes and so forth.

So what is that going to look like?

I don't know.

I'm surprised at how long we've been able to stay with what we

got, what, 30, 40 years ago.

Much more simpler architectural setup.

It's funny, I had a discussion with somebody and I was in Geneva

for the CERN PGDay and somebody came to me and they were talking

about page compression.

Do we do page compression?

And I have a blog entry about it and I showed it to him and he

didn't really, I hoped he would read it from my phone, but he

didn't want to.

I said, why do you want page compression?

He said, it requires less disk space.

I said, this is not the 1990s.

What actual reason?

I said, I understand why it was needed 30 years ago, but I'm

not sure that, because once we compress a page, then obviously

if we do an update and then the new data can't be compressed

as well, then we got a bigger page, you've got to put that somewhere,

and where do we put it, and then how do we deal with the index

entries, and like what are we really gaining once we're all done

all that computation, all that moving the data around, What have

we really gained with the compression?

Now we can use storage compression, right?

You can, there is storage compression.

I think Postgres will run on that.

But again, that pushes the problem down to the storage layer.

I don't know what performance would be like, because frankly,

databases don't like compression of that size.

But I guess my point is that he was asking for a very specific

thing, but when I asked him why he wanted it, he really couldn't

articulate except, okay, it uses less disk space, but I wasn't

sure what the goal was there.

So for

Nikolay: me, compression related to, let's compress data, it's

more related to Column storage and how there you can compress.

Bruce: I told him that.

I said, if you're telling me Column storage, that's a completely

different setup.

And we do have solutions for that.

But again, you're assuming a lot of duplicate data in the same

Column, duplicate data.

And that's a different case than compressing an 8K page on disk.

And it's like, no, no, no, I don't want that.

I want the 1 about the page.

I don't see anybody working on that.

And I'm like, okay.

Nikolay: We don't serve this in
this restaurant.

Bruce: Yeah, we don't serve that.

In terms of 64-bit transaction
IDs, I think it's a little hard

to understand what's going on when
you look at the email threads,

because I think the impetus is
that Postgres Pro has a version

of Postgres that does 64-bit transaction
IDs.

And therefore, you don't have to
freeze and the whole thing there.

But as we talked about earlier,
Nikolay, I think that the Postgres

Pro customers are really at the
very high end throughput requirements

And they're willing to pay for
that.

They're willing to have a more
complex system that does that.

But when we're now in the more
generic workload case, the Russians

have been very willing to give
us the patches to do it.

But there's a resistance in terms
of exactly how to do it.

In a way, it's going to benefit
the high-end users, no question.

And 1 of the things I've learned
in the past year is that 1 of

the reasons that Oracle looks that
way, and a lot of the proprietary

forks look the way they do, is
because they're really selling

to that top 5% of market volume.

Whereas when Postgres is working
on its code, we're trying to

hit that 50% mark, right?

And the reason Oracle is so complicated,
the reason a lot of

these databases become complicated
or add these complicated features

is because they're really selling,
they don't really care.

I don't say they don't care, but
they're only really focused

on their top 50 customers and everyone
else is just along for

the ride.

And therefore you get these very
complicated systems with a lot

of weird options, which were added
only because 1 of the 50 wanted

it.

You have this very high-end group
who's very demanding, who's

calling the shots, and they're
dragging the Database into this

high-end volume.

But every time they're dragging
it up there, they're potentially

making the generic workload either
slower or harder or less efficient.

And they effectively don't care
because the money is in that

high-end.

Nikolay: Let me comment, please.

I cannot keep silence here because
we help startups.

Our primary customer is startup
which grew some terabytes and

they saw something and they hit
problems.

And last couple of years we experienced
a lot of AI startups,

and this is crazy, absolutely.

They reach heights much more quicker
with much less effort and

resources.

And they start hitting problems,
for example, very quickly when

freezing is a big problem and we
need to do manual freezing to

skip indexes, for example.

This is very common right now for
us.

And second thing is that we're already losing some of them, because

when we start saying you need like partitioning is like you need

to allocate some resources for this like experiments and so on.

Yeah.

They switch to different database system or quickly consider

it where like less headache.

But this concerns me a lot.

Michael: Isn't that kind of Bruce's point?

That if they're an Oracle customer and they grew to being a significant

portion of revenue worth saving, that's different.

If it's the Postgres, like the Postgres project doesn't have

to listen to the huge customers because they're paying.

Maybe Oracle incentivized to listen to those customers more than

the average customer or more than all the little guys.

So Postgres is a different landscape where we can listen to people

more like across the board, not just the guys hitting extreme

scale.

Nikolay: My point is that the smaller teams have bigger databases

with increasing speed.

They hit problems which were hit only by big users before more

and more often.

These guys are more ready to switch faster because they are surrounded

by AI themselves, because they are AI startups.

Bruce: What do they switch to?

Nikolay: Sometimes proprietary databases, AWS or GCP, what they

offer.

Sometimes MongoDB also, like on the table sometimes, go into

different problems.

But I'm just concerned that I see this different thing that I

didn't see 20 previous years.

Bruce: So if we look at what's going on with the 64-bit IDs as

a microcosm, So they're willing to give us all the patches to

make this happen, right?

But what's happened is that we've only incrementally implemented

those.

We have some of them in PG 19.

For example, the MultiXact sizes are now 64-bit.

The members or the groups, I can't remember, I think it's the

groups.

Nikolay: And sorry, we had a couple of good examples last year

and some of them came to our podcast where this was exhausted,

this space.

Bruce: The multi-tac.

Nikolay: Yeah, yeah.

And also AI startups which have a lot of data very fast.

Bruce: So there's 1, 2 ways we can go at it.

We could just go whole hog, 64-bit everything, increase the header

size by 33% by making them 64-bit and be done with it.

But we're worried about the impact on an average user.

So what we've been doing is slowly 64-bit, enabling the server,

MultiXact is 1, and there's some other areas.

I think C-log we're looking at doing.

I don't know if we got to that.

So we're trying to do it.

We're trying to 64-bit the areas that are costless to us, basically,

where we can increase it without really any problems.

And then I think once we get to that, now we have 64-bit pretty

much everywhere that we can easily do it.

Then we have to look at, okay, now at the places they're going

to cost us, what do we do?

I think one of the great ideas that I saw was to get basically

an epoch LSN on the page header so that you could basically say

that for mass loads of data, you're doing a mass load of data

which a lot of these companies do, effectively all the tuples

are in the same transaction.

They're clearly all in the same epoch, right?

So you just put one epoch and then all your heap tuples look just

the same as they did before.

Right.

I think that's where we're going is the concept of having an

epoch on the page and then basically allow those not to necessarily

have to be frozen at all.

I think if we start to bring the epoch down to individual rows

and we increase the header size, which people have already complained

is too big, then we potentially could get complaints.

I think that, if I had to take a guess, I think that's where

we're going, but we're going at it very incrementally, again,

hitting all the places that store LSNs outside of the heap and

index pages, because we know those are going to be complicated,

and get the infrastructure done.

I know that we have some patches in 19 for that.

I've seen them coming through.

They got committed.

Then we have to decide, Okay, now that all this stuff is in place,

let's put up a proof of concept.

Let's see what a per-page EPOCH looks like.

Let's do some loading of data.

Let's see how far this gets us down the road.

We don't have a whole lot of empty space on the pages, so it'll

be a little tricky to figure out where to put it, but I think

we can find a place to put it.

But I think that's where we basically are is trying to figure

out once we get all the bottom stuff done, what do we do for

the main stuff, the heap and the index pages.

Michael: Nice.

I think that makes a ton of sense.

And I really like the incremental approach that Postgres takes.

I've only been in and around the project for the past 10 years

or so.

Bruce: It's a lot!

Michael: I know, it's a lot compared to some, but it pales in

comparison to 30 years.

And it's been really nice seeing the incremental improvements

but also that they stack up.

It really has come a long way in 10 years and things that looked

incremental kind of 9 years ago, 8 years ago, 7 years ago, they've

really added up and you mentioned vacuum but it's it's not just

vacuum right All of the changes that have been that have helped

with index bloat.

They kind of attacking from the other direction.

So it's

Bruce: that's very true.

Yeah, replication.

Yeah, yeah, partitioning, yeah.

Michael: Yeah, I like that approach and I feel like it might

sound like it's going to be slow, but time will fly in a few

years.

I can imagine us having made significant progress on some of

these things.

So yeah, that's really cool.

Bruce, I'm conscious of time.

Is there anything we didn't talk about that you wanted to make

sure we did mention or any last shout outs or pointers you wanted

to give people?

Bruce: No, I was just nice that we got a chance to at least talk

about, to me, the categories of what's missing was a big takeaway

for me to understand, like, why are we here?

Why don't we have some of the stuff that we're missing and some

of the stuff we may never have?

Optimizer hints.

Although that seems like a portion of that is coming in Postgres

19.

It's not called Optimizer Hints, it's called advise, but

it could be used in a similar way.

So again, I think that we don't have a roadmap.

That's the problem.

We don't, because we're too dynamic to have a roadmap.

So it's almost a surprise by me to see what gets in every release.

And, I think everybody's pleasantly surprised by what's in 19

and obviously what we're going to be starting for Postgres 20

in July.

Michael: Yeah, absolutely.

And I think I saw the work from Robert.

I think it was in contrib modules, right?

At least the first version.

Bruce: pg_plan_advice, yeah.

Michael: Yes.

Looks very cool.

All right.

Well, thank you so much for joining us.

I'm sorry Nik had to drop off
It's absolute pleasure having

you

Bruce: Great.

Thanks for nice talking to you.

Michael: Likewise.

More episodes

Chapters

Creators and Guests

What is Postgres FM?