Postgres FM

Michael and Nikolay are joined by Melanie Plageman, database internals engineer at Microsoft and major contributor and committer to PostgreSQL, to discuss getting started with benchmarking — how it differs for users and developers of Postgres, how and when it comes up during development, some tools and lessons, as well as what she's working on at the moment.

Here are some links to things they mentioned:

Melanie Plageman https://postgres.fm/people/melanie-plageman
Meanie’s Introduction to Benchmarking With pgbench talk slides https://postgresql.us/events/pgconfnyc2023/schedule/session/1410-introduction-to-benchmarking-with-pgbench/#slides
Melanie’s Visualizing Postgres I/O Performance for Development talk recording https://www.youtube.com/watch?v=CxyPZHG5beI
Melanie’s Visualizing Postgres I/O Performance for Development talk slides https://speakerdeck.com/melanieplageman/o-performance-for-development
pgbench https://www.postgresql.org/docs/current/pgbench.html
Mark Callaghan’s blog https://smalldatum.blogspot.com
Some of Tomas Vondra’s blog posts https://www.2ndquadrant.com/en/blog/author/tomas-vondra
Some of Andres Freund’s blog posts https://www.citusdata.com/blog/authors/andres-freund/
An example of Alexander Lakhin’s work https://www.postgresql.org/message-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52%40gmail.com
Simplifying the TPC Benchmark C, an OLTP Workload (talk by Mark Wong) https://www.youtube.com/watch?v=qi0I74urLoY
Matplotlib https://matplotlib.org
pandas https://pandas.pydata.org
pg_stat_io https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW
pg_stat_io commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a9c70b46dbe152e094f137f7e6ba9cd3a638ee25
pg_buffercache https://www.postgresql.org/docs/current/pgbuffercache.html
tmpfs https://docs.kernel.org/filesystems/tmpfs.html
Eager page freeze criteria mailing list thread https://www.postgresql.org/message-id/CAAKRu_b3tpbdRPUPh1Q5h35gXhY%3DspH2ssNsEsJ9sDfw6%3DPEAg%40mail.gmail.com
The path to using AIO in postgres (talk by Andres Freund) https://www.youtube.com/watch?v=qX50xrHwQa4
Improve dead tuple storage for lazy vacuum (Masahiko Sawada) https://www.postgresql.org/message-id/flat/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA@mail.gmail.com

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

Michael Christofides, founder of pgMustard
Nikolay Samokhvalov, founder of Postgres.ai

With special thanks to:

Jessie Draws for the elephant artwork

Creators & Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

Guest

Melanie Plageman

Database internals engineer at Microsoft, major contributor and committer to PostgreSQL, hacker, runner, baker, always planning the next adventure

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello and welcome to
Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and as usual, I'm joined

by Nikolay, founder of Postgres.AI.

Hello, Nikolay.

Nikolay: Hi, Michael.

Michael: Today is no normal episode
though, because we are delighted

to be joined by Melanie Plageman,
who is a Database Internals

Engineer working at Microsoft and
major contributor and committer

to PostgreSQL.

It's a real honor to have you on
the show, Melanie.

Welcome.

Melanie: Thank you.

I'm excited to be here.

Michael: We're excited to have
you.

There are so many things we could
have chosen as a topic today,

But off the back of a couple of
talks you've given at Postgres

conferences, we've opted to go
for the topic of getting started

with benchmarking.

Would you mind giving us a little
background about why this is

a topic you're particularly interested
in?

Melanie: Yeah, sure.

So since I started working at Microsoft
a few years ago, a lot

of the work that our team does
is performance related.

So I would say as a Postgres engineer,
even for features that

are not performance related, you
have to make sure that you understand

their performance impact.

But we tend to work on things like
AIO and other features where

you wouldn't do them if they didn't
have a positive performance

impact.

So assessing that is a huge part
of my job.

And it's an area where every little
detail matters.

And the things that don't matter
are kind of counterintuitive.

And it's just, it's kind of an
art and it's so hard to learn.

And it's like a dark art, you know,
and it's, I think I've been

really blessed in that I've gotten
to talk to and learn from

people that know how to do it,
like Andres Freund and other people

at Microsoft and in the community,
but not everyone gets to learn

from people that have 15 years
or 20 years of experience.

So I like talking about it and
trying to see if I can help people

get started and demystify what
I can about it.

Nikolay: Yeah, that's a good point.

So you think it can be improved,
right?

And people can understand benchmarks
better, right?

What are the tools, do you think,
for that to find solutions

better and understand bottlenecks
faster and so on?

Yes.

Misleading results we have all
the time, like we think we found

something, but it's wrong, and
then you like 1 day later, you

realize that it's completely opposite
sometimes, right?

So, so you think it can be improved,
right?

Melanie: Right.

So I think one of the things that
users have going for them is

that they have a baseline understanding
of how you should configure

your database.

And some of the mistakes that you
maybe, that I made as a beginning

with performance work, you might
not make.

So I remember like the first set
of benchmarks that I did, I

didn't configure shared buffers
at all.

This was years ago, so I give myself
a little bit of an out.

But basically, I was looking at
it and I was saying, this improvement,

you know, that I, this is supposed
to be an improvement and it

looks like it's actually hurting
performance.

And I was showing Andres and he
was like, well, tell me how you

configured it and what, you know,
what kind of, I didn't change

any configuration.

And he was like, okay, we'll just
throw all of this away.

And so some of the basic stuff
around benchmarking, I think that's

something that developers who maybe
don't have as much user experience

have to learn, but users have that.

I think one of the things that I
see happen more than you would

think is that there's some unexplained
performance impact or,

like a lot of times you're, you
develop a change, you want to

see how does it improve performance,
and when it doesn't, or

it has an effect that you can't
explain, you have to start investigating.

One of the things that people almost
never do, at least performance

engineers, they don't look at the
logs, and it's like maybe because

it's a development branch, right?

Like maybe you actually had a bug
and it's crashing or, you know,

in the logs, there's something
that's actually, and that's what's

slowing it down.

And it's such a basic thing that
you don't think about.

You're like looking at IO stat
and looking at these more advanced

tools when we really just need
to look at the log.

Nikolay: Or it was fast because
it didn't do any work, right?

Melanie: Yeah.

Nikolay: And looking at logs, you
can notice it.

Yeah.

Good point.

Sometimes, even some companies
we had, I won't name it, but sometimes

some benchmarks are published and
they claim something, and if

you have experience, 10 plus years,
you quickly have some ideas

what's wrong with those benchmarks.

For example, full-text search in
Postgres is bad.

It shows capital O of N.

You think, how come?

It's not possible.

Like, you quickly realize it's
not possible.

And you look at this and realize,
oh, yeah, they just didn't

build GIN index at all.

Right?

But if you don't have experience,
it's easy to make errors, like

right, to have errors, mistakes.

And my idea, like it would be good
if something would help people

to brainstorm what's wrong or what
else to look at as early as

possible.

So we need some brainstorming mechanism,
right?

What do you think?

Melanie: Yeah, almost like a checklist
or that kind of thing.

I think that's one of the reasons
why when I started out with development,

I thought, you know, with Postgres,
there's not enough basic

resources like for performance
work that give you a checklist.

But then I think I realized over
time, like I sort of moved towards

the, you almost need that one-on-one
interaction with someone

who has more experience because
it's just so individual and so

it's just so hard to give general
advice to people.

So I think that there are basic
questions you can ask yourself.

There's a big difference between
I think a lot of users when

they do benchmarking, I think a
common case is they're going

to install, they're thinking about
upgrading.

And so they're like, let me just
see if it's slower or faster,

whatever, right?

In that case, using pgbench for
that is not necessarily going

to help you, right?

Because first of all, benchmarking
what you actually want to

benchmark with pgbench or any
benchmarking tool is, is hard.

And second of all, what if that's
not like your workload at all?

You know, so let's say that in
Postgres there were changes to

the way that vacuum works and you
run a benchmark for less than

a few minutes.

Well, like, did you even end up
having anything be vacuumed?

You almost have to think about
what do I actually care about?

So if you're a user who's worried
about upgrading, like, I guess

you guys would know better than
me, but you'll probably have

options like running your maybe
non in not production, you can

run your real workload and on a
newer version and see how it

is.

And like, that's the ideal scenario.

But if that's not an option for
you, then I think what you have

to do is really think about what
are the highest risk areas for

me?

Like, do I have one critical report
I have to run?

Or, you know, I need to make sure
that I'm able to reach this

TPS at this time of day, or whatever
it is.

And then you have to have a way
to reproduce that.

Like you almost have to have a
minimal repro for the performance

question that you're asking and
be able to test it.

And I would say even then, like,
I don't know if you're gonna

get a perfect answer outside of
actual production environment.

Like the people, the use case that
users have for benchmarking,

I haven't heard of one where
I think it's going to work out

for them.

But I don't know, maybe you can
tell me what people want to use

Nikolay: it for.

Let me shamelessly ask you your
opinion about the approach I

ended up having for upgrades with
our customers, big and small,

doesn't matter.

So I was a big fan of replaying
workload like 5 years ago, but

we don't have tooling for it.

And mirroring, like PgCat, I think,
has mirroring, but it's hard.

It should be in between clients
and server and it's very critical

infrastructure.

Usually pgBouncer is there and
you cannot just quickly replace

it with PgCat and so on and overhead
and so on.

And mirroring also is tricky thing
actually.

So what I ended up doing, I split
it into 2 things.

First thing is load testing generally
with pgbench.

And usually, we have already done
by hackers and so on, like

performance farms or something.

And we can check it additionally
with some synthetic workload,

pgbench, not pgbench, sysbench,
doesn't matter.

So this is one thing.

And usually we rely on community
results or we can extend them

as well.

But as for our workload for our
project, Instead of replaying

the whole, we think, okay, we forget
about log manager, buffer

pool, and so on, all those components,
and we care only about

plan flips.

We want just to check the planner
behavior, and we just grab

query examples, like 100 of them,
which are most critical in

terms of total time or calls or
something from pg_stat_statements.

Sometimes we collect multiple examples
for one normalized query

in pg_stat_statements.

And then this is our testing set.

And then we don't care about machine
we use, actually.

It can be smaller machine.

We just have exact clone of production
database to have the same

data and pg_statistic.

And then we also adjust settings,
all planner settings and work_mem,

which is not planner setting, but
also affects planner behavior.

And then we just reproduce plans,
upgrade, reproduce plans again,

and compare and see if some plans
changed.

And also we use BUFFERS, of course,
and we see like costs and

BUFFERS we usually look at.

We don't care about timing because
it can be smaller machine,

cache type can be different.

And if some plan flip occurs, we
quickly find it.

And it can happen like in small
machine, like in shared environment,

basically.

It's not, pgbench is not needed
for plan flip check.

Melanie: Right.

Nikolay: And that's it.

Melanie: Yeah, I think if I understand
correctly, at some point

you use pgbench, and then you realize
that it wasn't actually

going to help you with this particular
case.

Nikolay: It's super hard to reproduce
actual workload.

Melanie: Yeah, so I support that.

I think really, honestly, what
you're doing, which is looking

for plan flips and starting with
pg_stat_statements is probably

the best thing that you could do
to try to make sure that you're

going to, when you upgrade that
it's going to work out for you.

And because I think that's one of
the only things that you can

realistically reproduce because
like the planner doesn't take

into account other things that
are happening.

Each plan is planned in isolation
with the statistics and the

data that you're talking about.

It's not looking at overall system
load.

I mean, maybe we should do things
like that, but planner doesn't

do that right now.

So it's actually possible to do
that.

So I think that's a really good
sort of precautionary thing that

you can do to see if the next version
of Postgres is going to

cause you problems.

So yes, I sign off on that.

But that makes sense as a strategy.

Nikolay: Before that, we were using
pgbench and we took most

frequent and most time-consuming
queries from pg_stat_statements

and we called it crafted workload.

We've tried to find some parameters
or examples of queries and

put them to pgbench using hyphen
F and then add sign to balance

somehow like to have some balanced
workload and then ignore some

errors if they happen like foreign
key violation or something.

I cannot say it worked well.

It helped, but then I just realized,
why do we care about buffer

pool and so on?

Let's just split this.

Melanie: So you were trying to,
you were doing that before upgrading

to see if you would get the same.

Nikolay: I stopped doing this because
It's hard, actually, time

consuming.

And if we care about plan flips,
that's it.

We have a recipe.

If we care about improvements in
some components of Postgres,

we can research them using some
simple workloads like pgbench.

That's it.

Melanie: Right.

Yeah.

You really can't use pgbench to
replicate real workloads because,

so for example, if I want to analyze
a patch that I'm doing and

I want to replicate some scenario,
I might run multiple instances

of pgbench at the same time.

So I run 1 that's doing a SELECT
query, and another 1 that's

doing a transactional workload.

And you can combine different pgbenches,
and you can sequence

them, and that kind of thing.

But ultimately, each pgbench is
going to do the same thing, no

matter what kind of custom script
you provide.

So you can there's different variables
and you can do things

like have variables in the script
and then interpolate random

numbers and things like that.

But in a real system, you'll have
some kind of work happening

and then another kind of work happening
that's different work

and like interspersed or at unpredictable
intervals and with

pgbench in the, you know, sort
of during a pgbench run, all

of the workers are going to be
doing the same thing, you know,

whatever's in the pgbench script.

So you can't get, it's very, very
hard to replicate realistic

scenarios with it.

And there's lots of creative things
you can do.

Like you can have some if statement
logic where you're like,

okay, if this worker is this worker
number, then do this other

thing.

And like, but at that point, I
think it's not useful if you're

trying to understand what the performance
impact is going to

be for your real world workload.

Like I can see users maybe using
it to figure out, to understand

a particular quirk or something
like that.

Or maybe if you're thinking about
changing your hardware, using

a different SKU from the same vendor
or something like that,

you can try to understand how it
might affect general Postgres

workloads.

And then it could be helpful.

But pgbench is like mainly a developer
tool.

And I mean, that's what it's marketed
at.

It's not really marketed as something,
not that it's marketed,

but it's not, I don't think it's
represented to anyone as a tool for

users to understand Postgres performance
of their workloads.

But I mean, all the other benchmarks
that are out there, I think

pgbench kind of gets lumped in
with them.

And it's not really useful for
any of the things other than development.

In my opinion, it's very hard to
make it useful for like comparing

whatever Postgres to MySQL or that
kind of, I mean, cause you

can't use

Nikolay: them.

Different hardware options, for
example, it can be useful.

Melanie: Yeah, you can use it for
that for sure.

Yeah, but it's definitely serves
a different purpose than the

TPC benchmarks, for example, which
allow you to compare across,

you know, databases.

And then of course we also have
a community of people that are

doing performance testing from
version to version of Postgres

that's different than just testing
1 patch and saying, what is

this patch doing?

They're like sort of looking for
broad themes and regressions

across versions.

And you can use pgbench for that,
but ultimately, I would say

it's easier or more meaningful
to do something like run the TPC

benchmarks when you're looking
to see if Postgres has a regression

from 1 version to another for the
purpose of community work.

Michael: You mentioned we've got
a community of people doing

that.

I've seen a couple, but I don't
know of many.

Do you have any names or things
I could look into?

Melanie: So I think the most, the
benchmarker that I follow closely

and that many people follow is
Mark Callaghan.

Michael: Yeah, great.

Melanie: Yeah, so he's great.

And he actually came to PGConf.dev
this year and I got to meet

him.

He's kind of like one of my, I was
starstruck when I met him people,

you know, and he's like super nice,
you know, in that.

Nikolay: I like his blog's name,
Small Datum.

Melanie: Yeah, yeah.

And he's great because he has,
I mean, he's an engineer who has

a ton of experience in doing performance
work for a long time,

you know, in MySQL mainly, the
MySQL community.

And then with benchmarking, describing
your methodology is so

important.

So he does that in great detail.

And because he's active and he's
doing these benchmarks like

all the time.

He's providing a lot of really
useful information for the community

around, you know, how he does version
to version of Postgres,

different types of benchmarks,
different on different sized machines

and that kind of thing.

So he's great.

And if you want to learn more about
benchmarking, I would say

his blog posts are fairly advanced.

So like, if you're getting started,
it might not be the best

resource.

I don't know.

But I still am kind of like, okay,
let me read through this.

And you know, he's getting down
to looking at how does your CPU

frequency governor affect your
benchmark results, right?

So it's definitely at the point
of, it's definitely past configuring

shared buffers for sure.

So he's great.

And then I would say within the
Postgres community, I mean, there's

developers like Tomas Vondra and
of course Andres Freund who do benchmarks

when they're developing and then
usually they'll look at the

holistic, you know, release.

But most of those results are just
getting posted on hackers

and they're not sort of like, because
it takes so much time to

do what Mark does and sort of describe
your methodology, like

double check every result, investigate
every discrepancy, and

then publish it and sort of be
accountable for it because people

will come after you, they'll be
like, no, that is not a regression

or whatever.

You have to be ready to, to like
defend your work.

And it's, it's a full, it's almost
like a full-time job in itself.

So a lot of the benchmarks that
get done within the community

are people

doing them and then just saying,
hey, I found this.

Can anyone else look into it?

So there's also Alexander Lakhin
runs GPCDS and I think GPCH

also every release and does a lot
of investigation.

And he's great because he does
so much investigation of bugs

and like reporting of bugs and
reproducing very hard to reproduce

bugs.

So if he finds some sort of regression,
performance regression,

he'll also bisect it down to the
commit that actually caused

it.

And as a developer, it's nice
to have someone doing that

level of analysis.

And so again, like that's not,
he's not necessarily publishing

a methodology post and all of
that.

Then there's some people
that maintain the kits for,

because if you want to implement
the TPC benchmarks, it's not like

you just run it.

Like you have, it's more
involved than that.

There are people that maintain
different kits to like the

bash scripts and stuff.

And I think Mark Wong just did
a new one for TPC-DS.

So he has a lot of database internals
experience.

And then he was involved with the
TPC council for a while.

And so he kind of is an expert
on just the TPC benchmarks.

And so he tends to put together
these kits to try to help people

because it's not that straightforward.

And then I know there's some ongoing
work with the Postgres performance

farm initiative, but it's really
hard, right, to agree upon what

is a useful benchmark in general
to run.

So you can find some regression
or whatever.

And it's like, in order to actually
prove that that's correct

or valid is a lot of work on the
part of everyone else.

So I think in some ways, publishing
benchmarks, you can't just

say I set up this farm and I ran
all these benchmarks and now

I expect people to go look at them.

As a benchmarker, you're kind of
accountable for the analysis

and trying to find out if your
results are valid.

Nikolay: Practically useful.

Practically useful.

I found 1 day, I found, we found,
my team found bottleneck in

WAL sender for logical replication.

It was so easy to reproduce.

You just create some transactions
with delete and roll back them.

And like at some point very quickly
you reach a hundred percent

of CPU for WAL sender.

And it led to bad conclusions that
in production we won't be

able to use logical for many things,
but it turned out that this

kind of workload doesn't happen
in the wild at all, because it's

like kind of very specific kind
of workload.

So I guess this bottleneck maybe
it's not worth fixing right

now, right?

I mean, it's like, so it's interesting.

Melanie: Yeah, you can definitely
report things like that on

the hackers mailing list or on
performance mailing lists and

have a discussion about, you know,
if it's kind of the right

thing to focus on and also, you
know, if it's reproducible.

Nikolay: I did some discussions
with some hackers about this

and didn't meet understanding that
it's worth fixing.

And then I realized in production,
we don't see such bottleneck

at all.

And I just like postpone this research
for future maybe.

What do you think about observability
tools to be used in feature

benchmarks?

What's too like extensions or additional
tools and so on what

to look at?

To avoid observer effect as well,
right?

Melanie: Yeah, I mean, I tend to
not worry too much about the

observer effect, depending on what
tools I'm using, because I,

you have to do some, I mean, as
long as you don't have like log

min duration statement set to
0 or something like that.

But if you're comparing to another
version of Postgres, and you're

using the same observability tools,
it's sort of the price you

have to pay, some overhead.

But every person that does performance
work is different.

When I started, I made the mistake
of trying to like generalize

and make some of my tools useful
to other people and no one wants

that.

So I learned that the hard way.

But personally, I find that it's
very hard to look at aggregated

numbers at the end of a benchmark
run and make sense of it.

And like, that might be what you
present to other people, because

they don't want to look at every
detailed, you know, graph that

you produced or chart that you
produced.

But while I'm doing the work, I
can't do it at all without visual

representations and charts and
things like that.

So I use for that, I mean, for
the actual charting, I use like

Matplotlib and pandas and that
kind of thing.

But I have all sorts of scripts
that do things because I use

a lot of data input sources.

So, I mean, one of the things that
depends on the patch, but I,

so I'll query pg_stat_io, or, you
know, depending on what the

patch is, pg_stat_activity or pg_stat_all_tables,
and then gather

certain information every 2 seconds
or every 1 second.

And then I have scripts that take
this output and then I'm able

to just basically make it CSVs
and then load it into pandas data

frames.

So for example, I've been working
on vacuum recently, and then

one of the things that you consider
when you're doing vacuum performance

work is how often vacuum work is
being done.

So, you might want to look at the
wait events and look at, you

know, how many if autovacuum workers
are waiting on vacuum delay,

wait events a lot and like not
actually doing the vacuuming work

and if you have the right autovacuum settings to actually be

able to observe the thing you're
trying to observe.

So, you know, just gathering that
information and then plotting

it correctly and that kind of thing.

Another thing, because Postgres
uses buffered I/O right now,

I tend to use external tools for
I/O observability, like iostat,

because you can't actually tell
with, you know, reads and writes

in the Postgres statistics, that
could be a read or write to the

kernel buffer cache.

You could be reading from there
and that's obviously going to

be quite different than actually
reads from disk.

So I use iostat depending
on what I'm doing.

And then also there's different
files like, well, I use Linux,

so there's different CPU statistics
or memory usage statistics

that you can get.

And so a lot of what my scripts
do is just query whatever files

somewhere in the file system that
has that information over time

at different intervals so that
I can see, okay, what's the memory

utilization or what kinds of effects
is this patch having on

my resource utilization?

So that's a source that I use.

And then I also use some different
Postgres extensions, again,

depending on the patch.

So pg_buffercache, I look at the...

Again, I just query it at intervals
and then see in terms of

the shared buffers that I have,
how many are pinned.

And it helps me to see how the
patch is affecting the behavior

of shared buffers and utilization
there.

So it's, there's a, Basically all
of the statistics views are

fair game.

All of the, the output from
pgbench itself.

So I parse that.

So there's progress output, which
is like, you know, pgbench

does some averaging and that kind
of thing.

So I also parse the execution reports
that come out, which is

like, they call it a transaction,
very confusing.

It's literally just an execution
of the script by a worker.

So you might get an average TPS,
but depending on, you know,

because of how it's calculated,
that might not actually show

you the P99 latency of an individual
execution of one of the execution

scripts.

So I typically parse all the execution
reports.

There's one line for each execution.

And then do various analysis of
that and plot that.

So that's an input and there's
always new sources and it's funny,

I'll take some analysis challenges
and say, I haven't figured

out like why there's this weird
spike in IO at this one particular

time.

And then he's like, well, here's
this other input source you

could use to investigate this.

And I was like, I have 55 input
sources or what?

How is it possible?

There's something else that I could
look at.

But I guess like everything is
caused by something.

And so it's possible to find out
what it is, I guess, if you

look hard enough.

Nikolay: Right.

Don't you feel like it'll be so
good to have P99, P95 in pgbench?

It would be so good.

It would.

It would work well, right?

I'm also using pgbench all the
time, like 10 seconds, 30 seconds.

It's so great to see how like caches
are filled and so on, like,

but lack of percentiles.

Melanie: Yeah.

Yeah.

So if you do get the execution
reports, you can just like, then

read them in with some Python tooling.

Nikolay: Execution reports, it's
like hyphen R or

Melanie: It's dash dash L and then
you provide a log prefix.

Yeah.

So that gives you the actual time
that each execution of the

script.

Nikolay: So this is so slow.

No.

Melanie: Well, I mean, it's not
free.

Yeah, that's true.

But I think if you wanted the P99
latency, I'm sure there's statistical

methods for getting it without
recording every execution, but

like you would need something like
that.

Nikolay: Yeah.

Do we put it, do you use the actual
file there or like some RAM

disk, or memory or something?

Melanie: I just, I use tmpfs and
then when it's over, I copy

it somewhere to a disk.

Yeah.

Nikolay: Good idea.

I will add it to my tool set.

Melanie: Yeah.

At first, I was pretty apprehensive
about doing it because it

sounded like a lot of work, but
once you actually have the execution,

you literally just like the statistics
modules in Python, you

can just give it the quantile you
want and then it just calculates

it's 2 lines of code or 3 or 4
or whatever.

Nikolay: Yeah, and you mentioned
pg_stat_io.

This was great when I first saw
it, like this should have existed many

years already, and we discussed
it with Michael, and we here in

PostgresFM often discuss
lack of buffers in plans, for

example, right?

In individual query plans.

Buffers is like, we try to advertise
it's so important to have

buffers in execution plans.

And this is a macro level, like
high level.

So good to have it, right?

But unfortunately, I cannot say
I've used it because it's only in

Postgres 16.

We don't have it in most production
systems yet, still.

But maybe next year, I will start
feeling it in production with

our customers.

Thank you for doing this.

My question is how you came up
with this idea originally?

Melanie: Well, I can't take credit
for it.

So Andres has done a lot of performance
work for years and years.

So when I joined Microsoft, one of
the first projects he suggested

was to add some view that would
improve observability of I/O.

And at that time, we didn't have
the shared memory stats system.

So, the patch was much bigger at
that point because of the way

the stats collector worked.

So I did an early version of it,
and then I used it in my own

development, and I ended up working
on some of the AIO work and

found it really useful for that.

And then I ended up doing some
review of the shared memory stats

patch.

And after that, it made the pg_stat_io
patch much smaller and

simpler.

So then I went back to it.

But I honestly, like For me, I
just looked at it as this helps

me as a developer doing benchmarking.

So it's cool that users like it
because I don't have that perspective

really.

Michael: Obviously it's useful
for benchmarking itself, but I'm

wondering about the other way around,
like, did you have to look

at its overhead and how did you
go about that?

Melanie: Yeah.

So it was much easier, like I said,
once we had shared memory

stats, because the way that it
didn't need any extra infrastructure

is basically just like another
data structure, couple other data

structures.

And then everything works the way
that it does now.

We already had some counting for
different kinds of reads and

writes, because that's how for
explain, analyze and that kind

of thing, we had those counters.

And then there was some reads and
writes there in pg_stat_io,

all tables at 1 of the views that
has some io things.

And then there was some for pg_stat_statements.

So there was some places where
we already had some IO counting.

And then the other places where
we didn't, there wasn't too much

risk because like if you're adding
some timing for, you know,

whatever, background writer or
checkpointer, there are processes

that it's okay to add a tiny bit
of overhead.

So you didn't have to think too
much about it.

I think the hard thing was actually
finding all of the places

that we could possibly do IO for
the types of IO that we were

representing and then thinking
about how to categorize it.

And especially for the different
buffer access strategies.

So like that's one of the things
that was completely not surfaced

to users was how buffer access
strategies work and how they reuse

buffers.

And so for copy and vacuum and
things like that, literally you

have no information about that.

So I think just figuring out how
to represent all of that, that

was probably like the least performance
work that patch did

in terms of the impact, performance
impact of it, just because there

was a lot of instrumentation already,
just that was not complete,

if that makes sense.

Michael: So you're saying like
we're already paying a lot of

the overhead, like we're running
Postgres 15, for example, we're

doing a lot of that counting anyway.

So storing it, but like also displaying
it in an extra view,

there's very little added in actual
instrumentation.

Okay, cool.

Melanie: Yeah, I mean, like we
had for backends that were doing

queries, like counting of reads
and writes and hits in most places.

So the places that it was sort
of net new were mostly for types

of processes that we don't mind
a little bit of overhead if there

was.

And it would be so small.

I mean, it's measuring IO, right?

So like the, if you're doing IO,
the overhead of that will cover

any sort of overhead of instrumentation
for the most part.

Nikolay: Unless it's hits, actually,
it also counts hits.

And I found in some cases, for
example, monitoring systems try

to avoid storing these metrics
at all, for example, from pg_stat_statements.

statements.

But I had some incidents where
we lacked hits numbers.

But pg_stat_io has hits, right?

I'm looking at the documentation
just to confirm.

Melanie: Yeah, it does have hits.

And this

Nikolay: might be kind of overhead
of counting.

This might be interesting because
it doesn't have real IO, I

mean, in terms of disk.

Melanie: Yeah, it actually was
already being counted.

We had hits somewhere else, one of
the other views, or maybe it

was for pg_stat_statements, I'm
not sure.

So that was already kind of being
counted.

And when you go to get something
from shared buffers, there's

locking and all sorts

Nikolay: of

Melanie: other things that happen
that are expensive enough that

it was pretty negligible.

Nice.

Nikolay: Yeah, okay.

But there are some benchmarks proving
that Timescale didn't add.

Or like just...

Melanie: Wait, you want to see
my receipts?

Nikolay: Yeah, yeah.

I'm just curious.

It's just a general feeling that
it doesn't bring overhead or

there's a solid proof of it in
terms of benchmarks.

Melanie: That's a good point.

It's mostly based on a feeling.

So, like...

Nikolay: I'm curious, is it worth
checking this, for example?

Because maybe we can strip it out
somehow or compare 15 versus

16 in various cases and see if...

Melanie: I would look at those
benchmarks, but I guess my gut

feeling would be that it, because
we already were doing, when

track_io_timing is on, we already
were doing a lot of the measurements

that it wouldn't be much different.

Nikolay: But actually, I can answer
myself.

We have benchmarks with insane
number.

Oh,

Melanie: you do?

Okay.

Nikolay: No, no, I mean, 15, we
compare all versions and we try

to reach as many TPS as possible.

Achieved how many, Michael?

4 million?

Michael: Nearly.

3.75,

Nikolay: I think.

Yeah, just select only pgbench,
right?

And we didn't see any degradation
for version 16 at all.

It's already some kind of proof
that there is no overhead here.

And we have wait events and flame
graphs and so on.

We would notice it.

Michael: But Nikolay, you wouldn't
have had TrackIO timing on.

Nikolay: I want to check it.

And maybe repeat it.

No way.

Yeah, I will check.

Michael: If you did, you should
repeat again.

Maybe you'll get 4 million.

Melanie: Okay, I'm checking right
now.

Nikolay: Good point.

Michael: But yeah, this is great.

Like, it's super interesting what,
like how do you choose which

things to benchmark and which things
not to then?

Melanie: Yeah.

So I think developing intuition
around that is, so I actually

became a software engineer, like
my, the first thing that I worked

on professionally was Greenplum,
which is a fork of Postgres.

So almost all of my software engineering
experience comes from

being a Postgres engineer.

I was pretty new to software engineering,
right?

So I think a lot of software engineers
develop intuition around,

like, you know, if you take a lock
every millisecond versus every

30 seconds, like, which 1 of those
is okay, or whatever, right?

Like, it obviously depends on the
context.

But I think there's a lot of sort
of performance intuition that

people develop just from being
software engineers.

And that's sort of like the first
step of, you know, thinking

about, okay, well, is it okay for
me to take an exclusive lock

here?

Do I need to find a lock-free design
for this or whatever it

is?

And that's when you have something
where you had a performance

question and you said, is it okay
that I do X, then you have

to think about, okay, how would
I benchmark this?

And in a lot of cases, maybe not
benchmark, how would I profile

this?

How would I evaluate or microbenchmark
this or something like

that?

So, For example, in 17, I worked
on some refactoring of heap pruning,

which is kind of scary for different
reasons because of like

data corruption stuff.

But from the perspective of performance,
You also, because we

do on access heap pruning when you're
doing a select query or

reading in the buffer to do an
update, changing that code, adding

any sort of.

Additional instructions, you know,
potentially has overhead.

So of course, like if you're doing
actual heap pruning, then you

have IO from the WAL and you have
IO from writing out the dirty

buffer and that kind of thing.

And so you have to think about
is the thing that I'm changing

Actually going to matter in the
context that it's in?

And then I think evaluate from
there.

So and also how important would
it be if there was a difference,

right?

So like, on every SELECT query,
if it's a little bit more expensive,

that's obviously bad, even if it's
only a tiny bit more expensive.

So I, I and Heikki did a lot of
microbenchmarking around those

changes to make sure that they
seemed right.

And then you have to design the
benchmark so that it's actually

exercising what you think might
be the bottleneck, which is its

own challenge.

And then like, right now I'm working
on something where it has

to do with improving eager freezing
so that you freeze more data,

VACUUM freezes more data sooner,
And there's a bunch of different

pieces to it.

But 1 of them is a new sort of
responsibility that Background

Writer would have for maintaining
a new data structure that contains

some new statistics.

And 1 of the questions I got in
review was like, have you profiled

this to make sure that it doesn't
have an overhead?

And my first reaction was like,
but it's Background Writer.

No 1 cares about Background Writer's
performance, you know, like

kind of like, it's different than
on every SELECT query, you're

going to have some overhead.

But again, it was doing something
that was like more, I'd say,

heavyweight.

Like it's doing some calculations
that could take time and doing

them while holding an exclusive
lock.

And so even though it's Background
Writer, like, it's worth proving

that, you know, the number of extra
instructions, for example,

is tolerable.

And that's a judgment call.

But what you want is a paper trail,
too, that you, like, did

your due diligence.

Which is why it was funny when
you were asking me about pg_stat_io's

performance impact, because in
that case, I kind of thought,

it's negligible compared to the
fact that you're doing IO first

of all, and second of all, we had
a lot of that instrumentation

there.

But for most other things, like,
I haven't had a patch other

than that probably in the last
couple years where there wasn't

a discussion of what the performance
impact could be and then

benchmarking or profiling.

Nikolay: Nice.

Yeah, Michael, it was on.

Tracking IO timing was on.

Michael: Well, that's good news
for Melanie.

There's potential for 4 million.

Nikolay: Yeah, and we didn't see
a difference between 15 and

16, so I mean...

Wow, wow, wow.

Yeah, I already asked our robot
to double-check with tracking

our timing off, so we'll see this
as well.

Yeah, good.

Interesting.

I hope we'll have more TPS.

I’m curious what you are working
on right now and direction of

future work, ideas, and so on.

Melanie: Yeah, so I mean, I started
a project last release that

had to do with reducing the amount
of WAL emitted by vacuum.

And that was like kind of a refactoring,
but it spun out a lot

of other projects.

And I think I sort of wanted to
take a break from vacuum this

release, but it didn't work out
that way.

So I’m working on a couple of different
things around vacuum.

And some of them are some leftover
pieces around combining WAL

records that vacuum used to emit,
like up to 6 WAL records per

block.

So we've combined a few of them
and there's more to do.

So that's 1 of the things, but
sort of the more probably exciting

work is around the freezing work.

So that basically was born out
of Peter Geoghegan a few years ago,

I think it was '16, did some work
to have us do more eager freezing.

So one of the things we hear a lot
from users is they have, say,

insert-only tables and none of
the data gets frozen.

And then all of a sudden they have
anti-wraparound vacuums and

then eventually their system becomes
read-only.

But even before that anti-wraparound
vacuums, if you don't have

any actual like pruning or vacuuming
to do, it can be expensive

from an IO perspective.

And so it can slow your whole system
down.

And so being more aggressive about
freezing data sooner was something

that people had been asking about.

And so Peter worked on that and
it's really difficult to find

a heuristic that works for determining
whether or not to freeze

something because if you have a
table where you're doing a lot

of updates, like a standard pgbench
built-in workload, then

it's a waste to freeze things because
freezing emits WAL and

it dirties the page and that kind
of thing.

So finding a heuristic that freezes
the stuff you want to freeze

and doesn't freeze the stuff you
don't want to freeze is then

a journey.

And one of the parts that's really
hard is that you have to try

to come up with all of the workloads
that are important, the

best case and worst case, especially
the worst case workloads

that users might have and think
about how it would perform.

And that's so hard.

So that took a lot of time.

And I spent a lot of time improving
my different benchmarking

setups and coming up with these
different workloads and evaluating

them, looking at them.

So that's one thing I've been working
on.

And then last release, I also,
so there's the new read stream

API that does vector I/O, does
larger reads that Thomas Munro

did last release.

And I worked on the different users.

So for sequential scans and this,
I had some additional users

that didn't go into 17 because
their performance effects that

you have to sort of analyze really
closely and they weren't ready

And one of

Nikolay: them was...

Did you

Michael: use the word analyze deliberately
there?

Was it analyze?

Melanie: Yeah.

No, no, analyze went in actually.

Michael: Oh, cool.

Nice.

Melanie: That was the laws work.

Yeah.

So analyze went in, but a bitmap
heap scan and vacuum didn't.

Michael: Got it.

Melanie: So vacuum is an interesting
one because streaming vacuum

sounds great, but vacuum uses a
ring buffer to keep it from having

too much of a negative effect on
shared buffers.

So you don't wash out all of the
things that are resident and

shared buffers from your workload.

It reuses buffers, which means
that it has to write WAL, typically,

and there's a lot of calculation
around how big the ring is such

that WAL writer can help you or
help vacuum to write out the

WALs.

So vacuum's not doing all of it.

And if you do I/O and if you do
reads in a different size, that

means that you're dirtying and
needing to clean buffers at a

different rate.

And that actually was affecting
how many individual WAL writes

and syncs that we would do.

So like, in a given Fsync, you
can have different amounts of

data, but each Fsync's a system
call and has overhead, right?

So you want to amortize that cost
over a large enough amount

of data and also have WAL writer
help to do some of the work.

So by doing bigger I/Os, we are
actually making the performance

worse in some cases because of
these interactions with additional

WAL syncs.

So Thomas Munro has been working
on a, I think he's calling

it streaming write-behind.

It's kind of a way of thinking
about, for the purposes eventually

of AIO, but of how backends and
maintenance processes and writes

in general in Postgres can sort
of make sure that they are cleaning

up and doing writes in large enough,
if they're doing larger

writes, right, that they're actually
being conscious of when

they should do those writes based
on the WAL that it's required

that you emit.

So like if you're doing reads,
you don't have to really think

about that.

But if you're writing out data,
the WAL associated with those

dirty buffers has to be flushed
first.

So you have to think about when
do I wanna do writes?

Because you need to think about
when do you actually want to

do the WAL writes.

So he's been working on that and
it's a tough problem.

A lot of benchmarking, a lot of
thinking about it and buffer

access strategies really complicate
it.

And those are used for large selects.

Those are used for, you know, copy,
vacuum.

So he's doing some work around
that and I'll probably sort of

jump in after some of this vacuum
stuff is done and try to work

on some of the AIO work that's
coming up.

And I think that ultimately, like
just between those things,

that'll probably be most of what
I end up doing on my own.

But I would like to see there's
some for pg_stat_io, there's been

a patch to add WAL-related IO
statistics to it that's been around

for about a release.

And we haven't basically what we
have, we haven't come to an

agreement on what the right thing
to do is because you can technically,

like our thought was that we would
have the block size for, like,

you can configure, you know, your
block size for Postgres.

You can configure it for WAL.

And that they would be in units
of block size, WAL block size.

But that actually isn't always
the unit that we do reads and writes

in exactly, like it usually is.

But so now we're actually talking
about rejiggering pg_stat_io,

especially in light of vector.io
to change it so that it's not

looking at, you know, if you have
a thousand writes and then

you have the block size and then
you multiply them to get the

number of bytes, do we change it
to represent it differently

and put it just in the number of
bytes?

And how do you represent that?

If you're doing it, do you want
the right number of writes to

be the number of system calls.

So we were just thinking about
how to actually represent it,

but I would love to see WAL stats
go into pg_stat_io in 18 also.

So we'll see.

And there's also a bunch of other
exciting things going on, like

Masahiko Sawada is working on parallelizing
the first and third

phases of vacuum.

So he did that really exciting
work in 17 on TID store and making

it so that I think that's my favorite
feature from 17.

Nikolay: Can you explain it to?

Michael: Yeah, I don't know that.

Melanie: Yeah, so you know, like
one of people's biggest complaints

about vacuum is when they have
to do multiple rounds of index

vacuuming because the maintenance
work mem, even if you set it

to a high value, you might end
up filling it up with dead tids

and then you have to do a round
of index vacuuming.

And if your indexes are very big,
you can imagine that that ends

up really affecting your vacuum
performance.

So what he did was introduce a
new data structure that organized

the dead tids in a way that was
much, much, much more efficient.

And that made it so that you're
using way less memory for the

actual dead tid storage, but then
it also, you can end up having,

you can end up sort of changing
it on the fly.

You could, I mean, basically you're
not tied to the same restrictions

that we had around the size of
maintenance work mem as before.

So most people are going to not
need ever to do multiple passes

of index vacuuming.

So I think that people with very
large indexes and large tables

are going to see their vacuum performance
be a lot better.

Nikolay: I have big customers with
big partitioned tables.

We talk about partitioning for
years and it's hard and they will

benefit.

So vacuuming will be faster, basically,
right?

Melanie: Yeah.

I mean, if you don't have this
problem with the multiple index

vacuuming passes, then maybe not.

But it's something we hear a lot
from sophisticated customers.

Nikolay: And also, I remember in
PostgreSQL 17, before that,

it was only up to 1 gigabyte could
be used for.

Melanie: Yeah, that restriction
is lifted now.

Nikolay: Is it related to this
work or?

Melanie: Yes, it is related,

Nikolay: yeah.

I suspected so, good.

Melanie: Yeah.

So that's really cool.

And I think it's something that's
hard to explain unless you've

had the problem.

I think if you haven't had this
problem, you're not really thinking

about it.

But for people that do have this
problem, I think it's going

to make a big difference.

He's building on that work and
there's a lot of things that we

can do with vacuum because of having,
so this read stream API

will make it somewhat easier to
do parallelization of the first

and third phases of vacuuming.

So he's working on that as well.

And then I think, you know, there's
the, the dreams of having

global indexes involved a step
where you persisted the dead TIDs

to disk, because otherwise you
wouldn't be able to.

Basically, there's, there's discussion
of being able to split

up the phases of vacuuming and
be able to vacuum indexes at some

point and come back to it and do
it later.

Like vacuum, just do the first
stage of vacuuming.

And if you want to do the phases
of vacuum separately, you have

to have the dead tuples that you
need, basically what index

entries you need to delete, you
need to save that somewhere that's

not in memory.

And so there's some modifications
to the TID store, but they

can make and probably make it easy
to persist.

And I don't know that he's planning
on working on that now, but

Dilip had done some work on dead
TID storage.

And so I think there's like a lot
of exciting things around vacuum

that'll be happening.

And there's also been discussion,
which I know has happened in

the past, but new discussion about
auto vacuum scheduling.

And not just auto vacuum scheduling,
but cost-based delay and

how to change that to be more adaptive
and to use statistics

basically that are collected while
vacuuming to decrease the

delay adaptively while vacuuming
if you're not able to finish

the vacuum appropriately and things
like that.

So we'll see.

Nikolay: So we talked to Peter
Geoghegan in the past as well.

And I remember the duplication.

So you both sound not like people
dreamed to get rid of vacuum

completely, but you sound like
there is a big potential for improvements

and it's happening and so it's
going to stay, but with improvements.

This is like I'm trying to simplify
for our audience.

Melanie: The undo people?

I mean, they still

Nikolay: would have needed vacuum.

The undo people, right?

Melanie: I think that there has
been a lot more work on vacuum

this release than there has been

Nikolay: 17 or 18

Melanie: 17 and 18

Nikolay: 17 18.

Yeah, yeah, that's great.

That's great.

So it's, yeah, many people need
it.

I mean, Postgres instances.

So yeah.

Michael: I like this kind of work.

I think Peter Geoghegan's done a lot
of work in the last few releases,

and you and others that focus on
things that will help people.

Well, almost everybody will benefit
without having to change

anything.

And that's, those are so powerful.

They're so easily forgotten or
easily ignored, but everybody

benefits, it gets better for almost
everybody without A, noticing

and B, having to do anything.

Just so powerful, those kind of
changes.

So thank you for working on those
and encouraging others to as

well.

Appreciate it a lot.

Melanie: Yeah, the dream is that
we get rid of all of those auto

vacuum gucks because that's terrible.

Like, there's, I don't know, the
fact that there are so many

blog posts.

I mean, yeah, like they should
all be gone.

Your auto vacuum configuration
should be basically nothing.

I mean, the system should figure
out what to do.

Nikolay: We often discuss outdated
defaults, but this is a radical

position.

Melanie: Right.

Yeah.

I mean, it's not going to happen
overnight, but I think my vision,

a lot of people I think want this.

I think the cost-based delay system
is really hard to understand

because it's not intuitive.

The relationship, like scale factor
and all of that, It's just

like, what is that?

I mean, I understand why it happened
from a developer perspective,

but I think that getting vacuum
to do the right thing is like

our job as engineers and shouldn't
be users' jobs, basically.

So we'll see.

Michael: Wow.

Well, we don't do clips on this
show, but if we did, I think

that would be one.

Nikolay: Also number of workers,
3 workers.

If you have like almost 200 cores
and you have only 3 workers

by default, this is insane, right?

Melanie: Yeah.

Well, actually, so I will take
it back.

I would say that the one configuration
that I think users should

be able to provide is the max number
of workers because like

that's your system preference,
right?

Like, I mean, maybe not the number
of workers, the minimum number,

but the maximum number.

You should be able to keep the
system from having 50 processes

just doing vacuuming work if you
want.

Right?

I think that they should be.

Nikolay: Because this is second
phase of rights.

We need it.

Right?

Because if you just deleted something
or updated something, work

is not done until it's vacuumed
and you need more workers anyway.

Michael: So- Also, Postgres doesn't
know how many cores, right?

Nikolay: Doesn't know.

Melanie: It doesn't take that into
account.

Yeah.

I mean, but like people are allowed
to use their servers for

other things at the same time. I
maybe you don't, but like we sort

of we don't assume that we have
access to everything, I guess.

Michael: Yeah, makes sense.

Nikolay: Thank you for coming.

It was very interesting. Thank you
for this work again.

Thank you for pg_stat_io
work and so on, and benchmarks.

I like, we will keep an eye on
it and good luck.

Good luck.

Melanie: Thank you.

Michael: Yeah, thanks so much.

Melanie: Thanks.

Michael: Bye.

More episodes

Chapters

Creators & Guests

What is Postgres FM?