Postgres FM

Michael and Nikolay are joined by Peter Geoghegan, major contributor and committer to Postgres, to discuss adding skip scan support to PostgreSQL over versions 17 and 18.
 
Here are some links to things they mentioned:

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With special thanks to:
  • Jessie Draws for the elephant artwork 

Creators & Guests

Host
Michael Christofides
Founder of pgMustard
Host
Nikolay Samokhvalov
Founder of Postgres AI
Guest
Peter Geoghegan
PostgreSQL Major Contributor and Committer, working at AWS. Irish expat. A theoretician's practitioner.

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello and welcome to Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard and I'm joined as usual by

Nikolay, founder of Postgres.AI.

Hello Nikolay.

Nikolay: Hi Michael, how are you?

Let's not miss the fact that we have a guest today.

Michael: We didn't miss it.

You throw things off by saying, how are you?

And I'm British, so I have to reply, and how are you?

Nikolay: Well, let's blame the US.

This is where I learned to ask.

So I'm super glad we have Peter today.

Hi, Peter.

Peter: Hi.

Good to be here.

Thanks for having me on the pod.

Michael: We are delighted to have you.

And yeah, it's Peter Geoghegan.

Hopefully everybody knows of Peter, but in case you don't, he's

a major contributor and committer to Postgres for many years.

Three or so of the last were at AWS.

So today we're going to be talking about some of Pete's recent

and upcoming work on Postgres related to skip scan implementation,

but it's far more than that.

And honestly, the more I learn about this work, the more excited

I am about it.

I thought I knew quite a lot, and I've been reading up in the

last couple of days, and it's even better.

I didn't realise, and I'm probably cheating or skipping ahead

far too much, but I want people to understand how big this is.

This could affect our indexing strategies going forwards.

We could be choosing to index Postgres differently.

I guess it's obvious in hindsight, but there are some real major

differences.

So yeah, I'm super excited about this work.

I want to thank you for having started it and continuing with

it.

I'd be interested to hear, though, what attracted you to this

work, and did you realise how powerful it would be when you first

started on it?

Peter: Actually not at first, no.

I last did a similar interview to this.

I think it was a predecessor podcast, technically, not this one.

At that time, I happened to talk about how, for whatever reason,

I kind of like finding as many different ways as possible of

looking at a thing.

At the time, I think I said something like, well, there's a bunch

of different perspectives.

There's the data structure perspective, there's the concurrency

control perspective, and on and
on down the list.

And it was that sort of general
mentality that brought me to

this work, because I realised,
hey, I'm missing something here.

My tendency to sort of collect
these different ways of looking

at the thing, The same familiar
old topic, this tendency I have.

I was for a long time missing this
extra bit, which I think is

embodied by the work I'm doing
now, and its source material,

which is a pretty obscure 1995
paper.

It happens to be called Efficient
Search of Multidimensional

Indexes.

So you haven't heard of it, but
hopefully that's not any indication

of its real-world relevance.

It has been something that I believe
has influenced a bunch of

other systems.

So it's, in that sense, I think,
pretty standard stuff.

But the implications may be, as
you were saying, Michael, are

not fully appreciated.

So maybe we'll start with a general
overview for those that don't

know what skip scan is.

It's a feature that certainly Oracle
and in recent versions MySQL

and also SQLite have.

They all call it skip scans.

I think in DB2 it's called jump
scans or something like that.

So the basic idea is to extend
the standard B-tree index access

method to support new types of
querying without changing anything

about the on-disk contents, without
fundamentally altering anything.

This is possible with multi-column
indexes, composite indexes.

The simplest case, and the one that
gets the most attention, is

where you have an omitted prefix
column in your query.

So, for example, you do a SELECT
from a table where b equals,

let's say, 5.

Here we would have a composite
index on AB, of course, A has been

omitted from the query predicate.

So traditionally, in Postgres at
least, you wouldn't be able

to really use the index.

You might do a very inefficient
full index scan where you sort

of have to grovel through the entire
leaf level of the index,

And that works, but obviously that's
not very efficient.

What SkipScan will allow you to
do here is, assuming that there

are not terribly many distinct
values in column A, as a rule

of thumb, I would say hundreds
of distinct columns total, or

perhaps thousands, is about the
upper limit, then you can effectively

use that composite index as a collection
of logical sub-indexes,

as they're sometimes called.

So, for example, the first value
might be 1 in the column A.

So we'll do 1, A equals 1, B equals
5, A equals 2, B equals 5,

A equals 3, B equals 5, and exhaustively
will do multiple small

index scans that are effectively
pretending to be one continuous

index scan.

And that will be certainly much
more efficient than a sequential

scan in most cases.

Of course, it's true that you'd
be much better off having a column

on B directly, but it's also true
that who says that that's going

to be possible?

You might just not think to do
it, or in some cases, at least,

it might not really be feasible,
which depends on the requirements

for the user.

This is particularly likely with
things like data warehousing

use cases, the facts table, where
a lot of the querying is sort

of ad hoc, and it's not really
feasible to have all the indexes

you might possibly require ahead
of time.

So here, it's making a composite
index, which is more useful

for a wider variety of queries.

If you only require adequate performance
here, sub-second performance

is attainable.

It's not going to give you sub-millisecond
performance, although

that does sometimes happen.

But it's still, for these kinds
of use cases, almost as good.

And yet it allows you to avoid
creating a bunch of quasi-redundant

indexes.

So in that sense, as you were saying,
Michael, it's kind of like...

It at least would require some
of the conventional ways that

the updated...

I wouldn't say it's a radical shift,
but there is a sense of

which, OK, in light of this new
capability, we're going to have

to rethink some of the rules of
thumb we have for indexing in

general.

I think that's likely to work out
that way.

I would say I'm excited to see
how people end up using the feature

in practice.

Many things are possible.

So that example I gave is a sort
of very traditional simple example.

But also worth noting that the
techniques that I'm working on

include similar, though slightly
different, cases where, for

example, you have a range.

So suppose the query is SELECT
* from sales, where sales date

between the 1st of August and the
10th of August.

Then it could be that you're not
enumerating every possible date

in the entire table, but just those
dates from that range.

But the same exact idea still.

So for the 1st of August, plus
whatever the actual thing that

appeared in your predicate in the
query is, let's say, again,

it's b equals 5, I don't know.

That for the first of August, for
the second of August, for the

third of August.

So from the query perspective,
you haven't omitted the column.

The column appears in this between
range, but it is effectively

the same idea.

It also should be pointed out that
these are all very general

techniques.

They can be combined in all kinds
of different ways.

And that's where it becomes super
interesting.

Like, you have very, very complicated
combinations of these things.

The example query that I used in
my initial email to the hackers

and adding this to sell it to people
was taken from the original

source paper.

And in that instance, it was, I
believe...

So you have an initial omitted
prefix column, which is on a column

called Apartments.

Then you have a range.

Then you have an in, and then you
have another in.

So it's all kind of working together.

You see it's like this sort of...

And that's where I think it can
make a huge difference when you

combine all these techniques together,
which you will be able

to do freely.

So there, effectively, you have
this query that would conventionally

require a huge and inefficient index
scan, which is just kind of

a non-starter.

What you actually would have to
do is have a bunch of single-column

indexes and use some kind of bitmap
or something like that.

But then you give up the ability
to have the sort order from

the index.

You can't do a limit 5 or something
like that.

You can't use a group aggregate
directly, etc.

Whereas this way, you can pretty
well convert what you would

think of as 1 big query into what
you would think of as maybe

thousands of tiny queries that
each land on 1 if page again and

again.

And it's sort of a lot of small
queries pretending to be 1 big

one.

It's not apparent to you looking
at EXPLAIN Analyzer that's going

on, but it is.

So effectively, you're rewriting
the query as if it was just

a bunch of tiny queries in a way
that is fairly intuitively obvious.

It wouldn't be very practical to
do this, but you could do it

by procedurally generating a bunch
of these small queries yourself,

in principle, at least.

And that is the wooden core clever
idea here.

It's surprising how far this wooden
core idea will get you, because

it is a fairly simple idea in the
end.

But there's a bunch of things that
I haven't even considered

yet that the paper also contemplates.

For example, it has a part about
not-in predicates.

You would think it would be impossible
to use a not-in predicate

for an index.

But again, it's the same idea.

It's like skip-scan, except rather
than enumerating every possible

value in the index, it's enumerating
every possible value in

the index except these ones.

So it's basically the same thing
again and again.

So it just buys you an awful lot,
one core simple idea, which I

find kind of cool.

So that's the big picture.

Yeah, something that I'm expecting
to get into Postgres 18, and

something that builds on work that
started with Postgres 17.

Michael: So yeah, that's a great
explanation.

I think the multi-column thing
is the complex example you mentioned

is the part that blows my mind
and starts to make me think, you

know all this advice we've given
that's, you generally advise

people against huge multi-column
indexes because they often only

target then like a very, very niche
set of queries.

And you end up accepting that you're
going to do a large index

scan with a bunch of rows being
filtered, which we know is inefficient,

but it's currently the best solution.

You lost

Nikolay: hot updates also.

I'm going to continue advising
against too many columns in indexes.

Michael: Well, okay.

But I can see the flip side.

Once you've got fewer, once you
only need fewer indexes, you

don't have to support, like let's
say you've got 5 columns that

are important here.

Sometimes I've seen people index
all 5 in 3 different orders

to support different versions of...

And I feel like that kind of thing
will become less important

if we even need it at all anymore.

So that's mind-blowing.

But just to go back, Your thought
process is super interesting

here as well.

So did you come across this paper
as part of widening your horizons

and then think of all these implementation
ideas?

Or was it, you thought, this is
an area, this is a blind spot

for me, I should seek out papers
in this area?

Peter: Well, there's a surprisingly
little written about the topic,

actually, even though, as I say,
it's implemented in a bunch

of other systems, like Postgres.

I also think it's probably true
that they don't exploit it to

its full extent.

The actual paper was written by
people that were working on a

system called Non-Stop SQL in the
1990s, which hasn't quite completely

died yet, but it certainly has
far fewer users.

It was a very influential system,
though.

So I think that it's likely that
the full generality of it hasn't

been exploited for whatever reason
in other systems.

My hope is that my implementation
will do just that.

It's kind of weird.

The example complicated query I
gave, what I noticed about that

is, most of the time, if I changed
the order of the columns,

not always, but most of the time,
it made hardly any difference.

Wow.

Right, which is sort of like...

That wouldn't always be true, but
just given the particulars

of the query that I'm talking about,
it kind of didn't matter

if I swapped.

It made maybe a small difference
in favor or get...

But that is a bit mind-blowing,
sure, that that could ever be

true.

Again, it's going to be most compelling
in cases where you don't

know what the requirements are
ahead of time.

It's very good for ad hoc indexing.

And you can always beat it by having
a dedicated index if you

want to pay for that.

But that's perhaps impractical
for a lot of users.

And again, I think in OLTP apps,
it is still useful, but it's

more for avoiding the worst case,
for something that you maybe

should have anticipated but didn't
for whatever reason.

It's good for that more so than
something that you would ideally

plan on using.

But it is relevant to both.

As for why, I don't know.

I think it entered my awareness
about 5 or 6 years ago, but I

didn't fully understand it.

And it took me a little while to
figure out the right way of

doing it.

So having explained all that, it
should also be pointed out that

Whether or not you actually want
to rewrite your query in this

manner, it all depends, doesn't
it?

It could be, for example, that
the old way of doing it happens

to be the most efficient one, because,
yes, you could, in principle,

rewrite it into 100 different small
queries, but those small

queries would all be hitting the
same few leaf pages anyway,

so you actually would want to do
maybe not 100, maybe 1 or 2

at most small queries there.

How do you know ahead of time which
it is?

Well, you don't, right?

But you don't have to.

So I've had this intuition that
if I solve that problem first,

if I solve it for in-lists, then
after that, I would be able

to move on to this.

That would be a useful foundation.

So Postgres 17, this has been implemented.

It'll be available, I think, in
3 weeks when we release 17, if

all goes as planned.

So that took the existing capability
for indexing in-lists and

made it so that if you have 50
integers, say, and they're contiguous

integers like 1, 2, 3, 4, 5,
through to 50, then suppose they're

all co-located on the same leaf
page, then you're going to do

1 index scan and read them all
at once.

If, on the other hand, it turns
out that they're spread out all

over the index because maybe it's
low cardinality, or maybe it's

low cardinality just in that part
of the index, there's no rule

that says it has to be 1 or the
other.

It could be at different points
in the key space.

Low cardinality, others could be
high.

So making a dynamic choice is very
useful, I think, because you

don't have to know.

That's what makes it safe to aggressively
apply these optimizations,

because if it doesn't work out,
in any case, the v3 scan code

is able to combine the would-be
accesses into 1 on the fly.

So you'll get the same behavior
as in previous versions, if

and only if that makes sense.

So that sort of flexibility, that
dynamic behavior, seemed like

an essential prerequisite.

So it's not actually true

It's conceptually the case that
you're rewriting the big query

to lots of little ones and then
combining the result, allowing

them to pretend that they are one
big index scan, even though

they're not.

But the actual degree to which
you do that, how much atomizing

actually happens, is determined
on the fly as the scan progresses.

So it's kind of indeterminate if
it's how many actual index scans

you're doing, but that doesn't
matter, if you get what I mean.

Michael: Yeah, it's so cool.

It doesn't matter to the, it's
like one of those things that as

a user, I don't have to worry about
it, think about it at all.

As somebody who's implementing
it though, it feels like it makes

it much more complex for you to
build.

Is that fair?

Peter: Yes and no.

I think there's two ways of looking
at it.

There's what you just said, which
is, in an immediate sense,

true.

It is more complicated.

But in another sense, It isn't,
because...

Complicated to who?

I mean, the optimizer has to model
all these costs, right?

And it turns out that's hard.

And it turns out that statistics
are...

Well, they're statistics.

And if you're talking about, like,
you have all these secret

hidden correlations, but these
are multiple column indexes, of

course, so we have statistics for
each column, but at least,

unless you go out of your way to,
there's no correlated statistics

for columns.

There's a generic assumption that
these are independent, even

though very often they're not.

So having the access path be very
forgiving, so the optimizer

gets things quite wrong, but it
actually doesn't matter, or it

doesn't matter very much at least,
is way easier, at least from

the point of view of the query
optimizer.

And on balance, that probably makes
it much, much easier to actually,

as a practical matter, make it
work.

I think that that will probably
turn out to be really important,

although at this point, that's
pretty speculative.

Michael: I had this on my list
of things that might come up,

but are the cost models super interesting?

It seems to me, and I haven't checked
for sure, but from looking

at some examples, you might not
have changed the cost estimations

at least in 2017.

You might consider that because
you're making a bunch of things

more efficient in a bunch of cases,
you might try and reduce

the costs for those to encourage
those scans when they make sense.

But I guess we're already doing...

Anyway, I guess my question is,
have you changed the cost model?

And if not, why not?

Peter: Yes, but I'm not surprised
that you didn't notice it because

the main change that I made to
the cost model was that now it

will saturate.

Because, a very simple observation,
the most expensive possible

index scan ought to be a full index
scan.

You know, very simple.

If you scan every single...

There's a pretty tangible maximum
cost to, in this hand, at least

in this world where this stuff
is available, as it will be in

17, where it is physically impossible
to scan the same leaf page

a second time now, no matter how
the constants have been specified

in the query itself because automatically
that just cannot happen.

It's a physical impossibility.

So we know that we can take to
the bank.

And so the cost model saturates.

There's an awful lot of uncertainty,
except when it comes to

the worst case.

We know for absolute sure that
in the very worst case we'll

scan every leaf page once.

So that's where you notice it more.

So you can have an absurdly long
in-list, and at a certain point,

it'll just stop adding additional
cost, since it is physically

impossible for the cost to keep
going up even as you continue

to add these.

So that sort of plays into what
I was saying about having some

general understanding of what the
worst possible case can be

when everything goes wrong, when
the statistics are garbage.

That assurance, in the end, makes
everything easier rather than

making it harder, I feel.

It's also just useful.

Michael: Mason I like that it seems
pessimistic but also safe.

Peter: That's exactly it, sort of.

I think statistics is being kind
of inherently just very it's

amazing that optimizers work as
well as they do, I say.

Like, people don't realise, I feel,
often, when something goes

wrong, well, it was next to a miracle
that everything was working

as well as it was.

And probably things were not technically
working optimally, but

you didn't know that because you
were happy with adequate performance.

You didn't consider that that was
technically 10 times slower

than what could have, in principle,
have been attained with the

same index and such.

So, like, I think that if we're
honest, we have to admit that

there's an awful lot of uncertainty
about the cost models, and

that doesn't necessarily matter.

We're more encouraging the idea
that it's amongst people that

are on the hackers list, that better
to have a kind of...

I wouldn't even call it insurance,
that's too strong a word,

but just have...

Account for variation at runtime
dynamically rather than doing

so statically in the query plan.

That just seems like, to me, something
that has very little downside

and potentially quite a lot of
upside.

It is a more general idea.

It doesn't just apply here.

It applies, for example, with...

I don't know.

You can do some similar things
with hash joins.

It does get more speculative if
you keep going down the line

of things that are like that.

But why would you trust an uncertain
selectivity estimate if

you didn't have to?

There's no reason why you can't
have some kind of way of recovering

from a set summation at runtime,
at least in certain cases.

I guess that it is related to that
whole way of thinking about

it.

That is the projects I'm working
on are sort of in that way,

I feel.

Nikolay: Do you mind if I break
the structure a little bit as

usual?

Peter: Go ahead.

Nikolay: Honestly, I didn't read
the whole discussion.

So I'm going to ask some newbie
questions.

First of all, I see this potential
of...

There is loose index scan as well,
right?

And this is a very different thing
for distinct values.

And we have a wiki page, and we're
constantly using this technique.

And we also mentioned, like you
did, other database systems have

it implemented, Postgres doesn't.

But here we discuss skip scan and
skip scan, and this is very

different, I see.

But I just googled a little bit
and I found people use skip scan

in that context as well, right?

So if I'm right, some people reaching
this point will still not

fully understand what we'll discuss,
right?

Peter: Alex.

Nikolay: It's possible.

Right?

So we discussed the problem.

There is a problem.

We have an index on columns A, B,
but filter is only on B, the first

layer of A is not relevant, and
usually it means this index cannot

be applied

Peter: Well.

It's worth pointing out that with
what I'm talking about, it

only applies when you have at least
2 index columns, whereas

I don't believe that's true with
Skip Scan.

You can still usefully skip if
there's only 1 column.

Nikolay: Because there we need
distinct values, it's a different

task.

Here we need all values, but we
have an index, which usually we

say you have the wrong index, right?

But your optimization will make
it good, but only if a column

has not many values, right?

If it has a lot of values, maybe
it won't be good, right?

Peter: It's certainly not going
to be able to help because,

of course, technically you could
say, okay, there's a million

logically distinct subindexes,
but that degenerates to a full

index scan, so it's the same.

By the way, that might still be
the best plan, all things considered.

We can't rule that out either.

Sometimes that is actually the
best available plan.

But overall, yeah, you're probably
not going to want to use it

unless you have hundreds, perhaps
low thousands, or occasionally

maybe tens of thousands, but after
that it's unlikely.

Nikolay: Will the planner decide,
will it have some threshold

to decide when skip scan should
be applied when it already doesn't

make sense?

Peter: Not as such, because as
I sort of touched on already,

there is no special skip scan index
node.

I think that's a bad idea because
it's all dynamic, right?

It's an additive thing.

Nikolay: Yeah, now I'm catching
up.

Peter: Sometimes for all the usual
reasons, you might want to

have an index only scan that happens
to use a skip scan, or you

might want to bitmap index scan,
or you might want to parallel

index scan, etc.

It is useful, I think, to have
a distinct kind of index path.

More choices means more wrong choices
from the point of view

of the optimizer.

So I feel like that's a distinct
advantage of the approach I've

taken.

Of course, I still have to figure
out a way of making it not

matter when that's not the right
thing to do.

And that's not very easy, but I'm
almost...

I think I've got that part figured
out.

We'll see.

Michael: Do you mean to like minimize
regressions at times where...

Peter: Yeah, OK.

Because again, occasionally it's
going to be intrinsically the

best thing is to just scan the
index without trying to skip at

all.

So you don't want to waste a lot
of CPU cycles on a useless enterprise

that is skipping when you shouldn't
be skipping.

You really do want to fall back
on essentially the same behaviour

as previous releases there.

Now, mind you, overall, hopefully
the optimizer just wouldn't

pick that way of doing it, because
it would be wrong.

It wouldn't be the fastest available
plan, given the available

indexes.

So that only applies at all when
you're already doing something

that's basically pretty slow.

But overall, yeah, I think we're
going to have to make sure that

you don't notice it when it isn't
able to do what you would hope

it would.

That is sort of complicated, for
sure.

Michael: Just to make sure I've
understood, if we could do a

scan without skips, if we had the
perfect index for a very efficient,

let's say, index-only scan, that
would have been the index that

would have been chosen in the first
place because its cost estimate

would have been lowest.

So already when we are skipping,
you're doing a scan that involves

some skips.

We haven't had that available,
and therefore, yeah, okay.

That's really cool.

Peter: Yeah, in general, I think
it makes more sense to think

about these sorts of queries as
being on a continuum.

One extreme is where there's, let's
say, no more than 5 or 10 distinct.

And then the fact that you're doing
5 different descents of the

index is hardly noticeable.

It's maybe less than a millisecond
in the end anyway.

So that would be a very favourable
case for skipping.

And then you have less favourable
cases, I suppose, where cardinality

goes up.

And eventually, you get to the
point where you can't do any skipping

at all, and that's where you don't
want to pay a penalty for

being open to the possibility of
skipping, if you like.

Also, skipping, as I sort of touched
on earlier, there could

be, for example, that you have
an index that has 90% of the pages

in the index in respect of the
leading column have 3 values total,

but nevertheless, the remaining
10% are all perfectly unique.

So now you don't have...

You can't really average the two.

It's just two different things.

You can't think of it as the average
of the 2, really.

So the fact that it's kind of continuous
just makes sense.

You wouldn't want to have it be...

If the planner were to make a static
choice, then how would you

account for that variation where
you legitimately need to change

your strategy for the same index
scan in real time like that's.

Totally possible that that would
be the best thing to do so I

like not having an opinion until
you really have an opinion the

last minute you.

I do get the best of both worlds.

That's the thinking.

Anyway.

Michael: I think it's super smart.

And those distributions are quite
common.

We see them all the time.

And it's the time where statistics
most obviously falls flat

on its face because of averages.

I know it does some things to most
common values.

But anyway, I completely agree
that makes loads of sense.

Peter: Sorry, you were going to
say?

I was going to give another example
of what you're saying with

statistics.

I mentioned briefly earlier, there's
this...

I think it's called the attribute-entity-independence
assumption.

It's this generic sort of assumption
made by all cost-based optimizers

that there's no correlation between
different columns, which

is demonstrably false all the time,
but somehow mostly doesn't

break that badly.

But for something like this, you've
got to think, there could

be a very strong correlation between
columns, or there could

be a totally different, completely
independent relationship between

the 2.

And it's really hard to model that.

So if you can just not do that
to much of an extent at all and

still get useful query plans, that
seems ideal.

You know, there's just different
types of indexes.

So you have, for example, in the
paper that I talk about, it's

data warehousing, So you explicitly
have different dimensions

of a business that are obviously
quite independent.

But then you could also have something
like, I don't know, supplier

ID, first column, second column,
discount code.

So it might be that different suppliers
use the same discount

code, coincidentally, But even
then, those 2 discount codes are

completely distinct entities.

So it just doesn't make sense to
think of them as being the same

at all.

You wouldn't expect a query to
just supply a predicate that doesn't

have a supplier ID.

You would expect either to have
a display ID or have both.

You wouldn't expect it to be omitted,
because it just wouldn't

make sense.

So it's impossible or very difficult,
in any case, for the optimizer

to understand those kinds of distinctions,
even though they're

obviously very, very common.

So again, ideally, it wouldn't
have to.

Ideally, it would just be able
to fall back on this sort of idea

of, hey, we'll try to work it out
at runtime to the best of our

ability.

As long as we get the fastest query
plan, everyone's happy.

It doesn't necessarily have to
be that exact and cost model.

In fact, usually isn't.

So that's kind of the, the overarching
philosophy of the thing.

And.

Michael: Yeah, as usual with your
work, it's so simple once it

makes like, once you understand
it, but to like, I'm super impressed

with the thought process to have
got there.

It feels like yet another feature
where, as users, we don't have

to think about how it's working,
why it's working that much.

Maybe with strategies later in
terms of indexing, we can take

advantage of that.

But we could carry on indexing
exactly the way we are, and we'll

only benefit as a result still.

Peter: Unless I mess up, of course,
but hopefully I won't do that.

Yeah.

I'm not exactly playing to my strengths
when I'm talking about

the optimizer, you understand,
but nevertheless, that is kind

of part of the philosophy.

I'm not really an optimizer person.

I like being successful, that's
probably why.

But here, I think that undeniably,
there is some thought given

to those aspects, because it kind
of has to be.

Michael: What do you mean successful?

Peter: Well, it's just really hard
to do anything in the optimizer.

That's why some people will do.

And thank goodness someone is willing
to do it.

But it's a very difficult area
to work in, I feel, which does

seem to put people off more than
perhaps it should.

Perhaps that will change at some
point in the future, but you

have to ask somebody else about
that.

Michael: Al-Khalili Yes.

That somebody else, I believe,
is going on another podcast, which

I'm excited to… Al-Khalili

Peter: Oh, great.

I think I know what you're talking
about.

Michael: Yes, for everybody else,
this is Tom Lane going on the

Talking Postgres podcast, which
is scheduled to be live, I think,

next month.

So I'll share a link to that.

Exciting times.

So we've talked about it briefly
already, But another cool thing

you've done here is split this
work into some useful stuff on

its own for version 17 coming out
soon.

I'll be interested to hear how
and when did you realize you could

split it up into the in-list optimizations
that I think are going

to be very useful for a bunch of
often-generated queries that

people don't have that much control
over, like long, like relatively

long in-lists seem to pop up, at
least from what I've seen from

like ORMs normally.

How did you work that out?

Peter: I'm not sure exactly when.

I think it probably wasn't any
1 moment where I realised that

those were… I remember talking
about another patch, actually

1 for LucidNextScan, and just being
very confused about that

whole area myself, being very confused
about what the difference

is, where to draw the boundaries
architecturally.

And I recall asking to nobody in
particular at the time, how

does this relate to in-lists or
what we would call internally,

ScalarArrayOpExprs equals
any array expressions?

There must be some relationship
here.

At that time, I was aware of the
paper, the source material,

and that strongly suggested as
much.

But I just didn't...

I kind of thought, you can't really
have an opinion about 1 without

having an opinion about the other.

You want to be able to combine
these things arbitrarily.

So that was when the initial awareness
sort of sunk in.

Then I had some work on Vacuum
over a couple of releases, and

then I came back to B-tree again.

And I think at that point I realised,
OK, this is probably the

most useful thing I could do.

And then I got a little bit lucky
as well with the project for

17.

I worked out a couple of things
on the optimizer side that I

wasn't necessarily counting on.

That's more subtle.

It's stuff that's related to whether
or not we know that the

scan will return ordered results.

So I realized the relationship
between that question...

So in previous versions, you would
do a query like SELECT FROM

TABLE where a equals 5 and b in
a long list, ordered by AB limit

something, and it wouldn't be able
to terminate the scan early

because, for reasons that are complicated
and not that interesting,

the implementation could trust
the B-tree code to correctly return

things in the ordered results.

So I kind of had this idea that
that was important, but I didn't

know how important it was to everything
else.

So I kind of got a bit lucky in
figuring that out in about 2023.

Before I even realized that I was
pretty sure that I would do

the skip scan thing, but there
was some serendipity with, there

was actually a user complaint,
I think it was on Slack.

Michael: Oh, yeah, Benoit.

I'll link that up as well.

Peter: Yeah, Benoit had a complaint.

Basically, because of the limitation
I just mentioned, the query

plan, which was for such a query
with an emit, it couldn't do

it without creating a way of filtering
the non-matching rows

that required heap access.

So for reasons that are not really
fundamental and worth explaining,

it did way more heap accesses just
to exclude non-matching rows,

when in principle it could have
done that at the index level

before accessing the heap at all.

So for very roundabout, hard to
explain, and actually not that

relevant reasons, that made a huge
difference for his use case.

I wasn't even thinking about it,
because I'm a B-tree fellow.

I wasn't really thinking about
avoiding heap accesses.

But then I found, actually, that's
by far the main reason to

do that.

I was thinking initially about
just the simple fact that, hey,

do you want to do the index descent
1 time or a hundred times?

Obviously, you want to do it 1
time if that's feasible.

That's where I started, but then
I realized not much later, but

somewhat later with some help from
Edoard, oh, actually, yeah,

you can in some cases at least
completely...

Only to all this right about nonsense
with restrictions in the

planner, you can completely...

In some cases, like the one he showed,
you can completely eliminate

a huge number.

And that was by far the best, the
most flattering case for the

B-tree work, where I think it was
like 8, 20 times faster, something

like that.

It doesn't really matter what the
number is, it's way, way faster,

because you're fundamentally doing
way less I/O.

So I didn't necessarily count on
that happening at all.

And then it kind of did without
my initially realizing it, which

was nice.

Serendipity did play a role, as
it often does.

Michael: Very cool.

Nikolay, do you have any questions?

Nikolay: Of course, I have comments
and questions.

All of them are silly, maybe.

Once again, we should avoid this
confusion.

Loose index scan is not skip scan.

Because I saw posts mixing these
names, but I double-checked

Oracle and SQLite, and they have
both skip scan, the same meaning,

so we should stick to this.

Peter: I guess so, yeah.

I mean, I wouldn't mind calling
it something else entirely, but

I certainly do think the distinction
between loose index scan

and skip scan.

I was confused on that point myself.

I mentioned this already.

The important distinction between
the 2 is...

So loose index scan involves an
Index scan that knows specifically

that it's feeding into a group
Aggregate.

So it's only correct because you
don't actually logically need

to return all of the rows from
the Index, or indeed from the

Heap at all.

It's only correct because of that
high-level context that you're

applying in the slow-level code.

Whereas, in contrast, SkipScan
isn't like that.

SkipScan is just exploiting information
that's readily available

to the scan.

It's doing Index scan things in
a more clever way without really

understanding how that information
is being consumed 1 level

up or 2 levels up.

So that's how I think of it.

Nikolay: Yeah, you mentioned work
on LucentXScan also happening.

What do you think about the future
of that work?

Peter: I think it's a perfectly
fine idea.

There's nothing wrong with it.

I did briefly look at that some
years back.

My recollection of exactly how
it was at the time is a little

bit vague.

I remember noticing that it had
lots of bloat which was

probably a little bit of something
that discouraged my involvement,

because, again, I like being successful.

Right, yeah.

But fundamentally, it's a perfectly
sound idea.

It's obviously a good idea that
could be implemented with enough

effort.

And I think that the patch didn't
have any major problems, or

at least didn't have any major
problems that could not be addressed

in the fullness of time.

I recall Tom, at the time, who
was sort of asked to take a look

at it by me, saying it was maybe
doing things the wrong way on

the Planner, but that's kind of
above my pay grade.

But from an indexing point of view,
obviously it makes sense.

It's like, just thinking about
it very sensibly, it's doing potentially

an enormous amount of extra I/O
you don't need to do logically.

It's just not...

Yeah, I mean, it's inarguable that
that's a perfectly good idea.

If I haven't started there or I
haven't paid too much attention

to it, it's only because of competing
priorities.

I think it's totally reasonable.

Nikolay: Yeah, we write with recursive
all the time to implement

this, basically at the application
level.

I'm curious if there are any specific
benchmarks which could

be helpful for these both patches
to check, like, kind of workloads?

Peter: Well, not really, because
as often is the case with things

that have some compiler component
to them, it's sort of like,

It's perfectly obvious that it's
not a quantitative improvement,

it's a qualitative improvement.

So you name a number, and I'll
give you a test case that is faster

by that amount.

It doesn't make up a number.

It's no problem.

I'll get you a test case that will
do that.

I just have to come up with something
that has sufficiently large

amounts of things that you will
be skipping now.

Coming up with a number, it's not
really very meaningful.

You're doing it in the efficient
way, the obvious, sensible way,

having not done that, so there's
really no practical limit on

how much faster it could be.

And it's hard to make generalizations
in the real world.

That's the annoying answer to your
question.

Nikolay: By the way, I wanted to
ask you about propagation of

this idea from B-tree to other
types of indexes, GIN first of

all.

Is it a valid idea at all?

Maybe with combination with the
btree_gin?

Peter: It definitely could be,
but then the way the GIN in particular

does multicolumn indexes is kind
of weird.

The way that actually works is,
it essentially, rather than having

a B-tree that has A and B stored
together, let's say it has a

B-tree, if and only if you have
a multi-column index, It has

a B-tree where attribute 1 is stored
with 1, whatever the value

for A is, and attribute 2 is stored
with 2, whatever the value

for A is.

So the column number is, in the
B-tree itself, where the entries

live, is the most significant key,
which is very different.

Now, if you said to me, what about
GIST, then that's very different,

because GIST is much more like
B-tree, particularly here.

I think with GIST it's totally
possible.

The way that GIST works, so the
source paper I mentioned, Efficient

Search of Multidimensional Indexes,
it might be confused.

You might think, well, multidimensional
like GIST?

And the answer is, yes, sort of.

I'm doing work that's purely enhancing
B-trees, but some of the

ideas with dimensionality are at
least at a high level related

to things like multidimensional
indexing using GIST.

So I think it's entirely possible
to do that.

GIST is loosely ordered, so it's
kind of inherently doing something

a bit like that anyway, where it's
not so much finding the exact

match and the exact place where
it must be, but rather looking

in all the places it could be exhaustively,
which is usually

only a fairly small number of pages.

It doesn't have the same propensity
of B-tree to go to exactly

the right...

The only right place for that value
that you're looking for directly.

It's more because of the loose
ordering of the index, it's not

quite doing it that way.

There's a bounding box in the internal
pages which describes

what the leaf pages contain.

So you might have to do 2 or 3,
even for a point lookup, you

might have to do 2 or 3 leaf page
accesses, as opposed to just

1, which would be very often the
case with B-tree.

So it's almost kind of doing that
anyway.

So it seems entirely plausible
that with GiST you could do something

similar.

I don't think that's probably possible
in the case of GIN, though,

because what does that even mean
in light of the scheme with

GIN storing two different attributes
in completely different places?

And the tree doesn't really seem
compatible with that.

I don't know, though.

I mean, anything's possible

Nikolay: what if it's a bit region
when first called on this.

Scalar like organization ID
or category or something and

second column is like JSONB or
something?

Peter: Maybe.

I mean, I think it's probably more
proveable to talk about it

in the context of a use case.

So there, I guess, intrinsically,
you have a kind of nested structure

to JSON.

And so having that capability could
be useful if you had some

kind of partitioning scheme or
something like that, I suppose.

Nikolay: I mean, JSON nested structure
doesn't matter.

It's a second column.

I

Peter: know, I know.

Nikolay: Right?

We put a scalar value, like some
number or timestamp on the first

place.

Peter: Yeah, I mean, I think that
could be possible.

I haven't thought about it too
much, though.

You have to talk about it in two
levels as well, though.

It's like, what would be practically
achievable within a year?

Nikolay: Let me tell you.

Why I care so much about this.

Because I started to reflect on
one problem recently related to

full-text search in Postgres and
why people move to Elastic.

Why I did?

Because I saw a similar situation
happening with pgvector.

The most popular issue in pgvector
repository right now on GitHub

is additional filtering.

So people need like global index
with vectors, but they also

need to be able to sometimes search
locally for a particular project,

organization, I don't know, anything.

And this usually is done, what
Elastic and others, they call

it metadata search.

Basically, it's additional filters,
usually these colors, like

numbers or timestamps, something.

And for vector search, it's not
solved yet.

That's why people still can say,
like, they say PostgreSQL vector

search with pgvector is not well
for us.

Because the usual solution is,
let's just create partial indexes.

But if you have hundreds or thousands
of categories, it doesn't

work well.

Then I realized that for full-text
search, we had a similar problem

for years.

Usually it's solved by extension
bit region.

Then you are able to create a multi-column
index, but this is

a barrier for many people.

Although this is a contrib module,
it's present everywhere, it's

some kind of a barrier.

And I saw cases when people even
created this extension, but

somehow still not using it well.

They still have this problem, B-tree
vs. GIN dilemma for planner,

right?

So I'm thinking, for example, if
we use BRIN more often, we

usually will put scalar value on
first position, because it's

more selective, by default, maybe.

We don't know.

And if it has a low number of distinct
values, this skip-scan

approach would make total sense
there as well.

Right?

Peter: Maybe.

I don't have a lot of domain expertise
here, so forgive me, I

need to ask questions to clarify
what you meant.

Are you talking about pushing down
a limit for each grouping

or something like that?

Nikolay: Limit is like...

I'm not sure about limit and ordering,
I'm just thinking...

Peter: But not a limit for the
entire query, but rather for each

individual grouping or whatever.

Where you want to...

Is that what you mean to the question?

Nikolay: Imagine we have full-text
search, TSVector, GIN index,

and then we have additional things,
like, for example, creation

time.

And when we want to search within
some window only, By default,

full-text search in Postgres doesn't
support it itself.

It will be a different B-tree index
on timestamp, like created at

timestamp, and we have a global
whole table GIN.

And then the planner decides what
is cheaper.

Bit by bit and then filter on the fly,
or use full text search and

then filter or order by on timestamp.

And the pg_trgm extension gives
you the opportunity to combine,

have 2 column index.

And then the goal is to find everything,
no limits, right?

But if we put scalar value on first
position, queries which don't

have this filter will be not good
with this index.

So we put a timestamp as our column
A and the TS vector as our

column B, use B-tree for that
to have 2-column index.

And then I would like to have skip
scan for this, because if

my query doesn't have additional
filtering and we have a low

number of distinct values for column
A, maybe timestamp was not

good because it will be a lot of
distinct values.

Let's say it's some category ID
or, I don't know, like group

ID or something.

In this case, probably skip scan
will help there as well for

such queries.

Or I'm forced to put this column
to second position and keep

full-text vector on the first position?

Okay, this is something to think
about additionally.

I'm just very curious.

Peter: I don't know is the simple
answer.

In principle, you could...

I mean, after all, what I've described
is, in the end, there's

a bunch of details, but it is basically
quite simple.

You have to be willing to do, at
least in the simple case where

you omitted, fully omitted a prefix
column, you have to be willing

to look at every single individual
partition, every distinct

value, which adds up quite quickly
if it's something like a timestamp.

Whereas if you had some kind of
bucket by day or something like

that, then it would be more feasible,
because then you would

have relatively few buckets, and
especially if you had not completely

omitted the date column, if you
had a range in between, then

it would be, I think, feasible
from a data structure point of

view, not necessarily in GIN, but
in something, to do something

like that.

I mean, I don't know what the basis
of comparison here, what

is Elastic doing?

I just don't know.

I can't speak to that.

I don't know
Nikolay: as well, internals.

They just claim if you have additional
filtering, it's not slower

at all, and usually it's faster.

This is what their documentation
says.

I'm very curious about internals
as well, because I just feel

this, even from a user perspective,
even the need to create btree_gin

extension, it's already a
barrier.

Some people don't do it, and they
think, okay, full-text search

is weak, And they are going to
have the same impression about

vector search.

And this whole topic about combination
of these powerful indexes

with B-tree indexes, It's interesting.

Direction, I think, can be improved
as well.

And with SkipScan, maybe it can
be improved further.

This is my basic idea.

It's quite maybe not well thought.

I promised to have silly questions.

So

Peter: It's okay.

There are no silly questions.

Well, I've got an even sillier
idea because I don't have I think

a very very Fisher Price level
understanding of how pgvector

works, which is not going to do
much good here So, you know,

I'm not going to say something
about I think I know so little

about it's something that.

Maybe I'll talk to Jonathan Katz
better next time I see him he's

my go to expert on these things.

Yeah

Nikolay: I'm sure he knows already
this problem very well, because

as I said, this is the most active
discussion among issues on

GitHub repository, and basically all people
need this.

They need a combination of B-tree
and vector search somehow.

Maybe not only just one B-tree, not
only this color, but a bunch

of them, because they need to filter
additionally.

This is a challenge.

I also know one project which they
built some kind of quite universal

solution for the others.

When they first came to us, they
said, we are forced to index

every column for flexibility.

We know this is anti-pattern, but
we are forced.

I'm sure they will be happy when
Postgres 18 is out with this

optimization.

This is super relief for them,
because they are struggling with

this problem.

It's good to have only, as Michael
said, only one index on multiple

columns and don't think too much
about order.

Peter: Yeah.

Especially if, as I say, the important
distinction is between

getting a very slow, totally unacceptable
performance, typically

if sequentials can have a very
large table, versus adequate,

sub-second but not sub-millisecond.

Like, a lot of people are in that
zone, and if they have a lot

of ad hoc queries, they can't really
predict the indexing needs

ahead of time.

I think it makes the biggest difference
there.

Much less so in OLTP apps, but
still to some degree, it's going

to get you acceptable performance
where you didn't know that

you really ought to have had a
certain kind of index, and then

you get adequate-ish performance
there too, which could make

all the difference in a pinch.

Nikolay: Yeah, well, actually,
interesting additional question

I have.

With loose index scan, we have
this wiki page describing this

technique.

We apply this technique using recursive
CTE, and there is an

Expectation in the future we will
stop doing this, it will be

fully automated.

It doesn't make any sense to think
about implementing while we

are still not there and not on
Postgres 18, it's not even committed.

It doesn't make sense to try to
implement a similar approach

at a higher level ourselves, somehow,
splitting query to pieces

like different index scans.

Peter: It might.

I would just say, in the case of...

If you're able to upgrade to Postgres
17,

Nikolay: and

Peter: having done that, you're
able to also...

Suppose you have 10 distinct values
in your leading column.

If you can write an inlist that
has those 10 distinct columns

and are satisfied that that will
work for you, that will give

you the correct answer, not only
when you write the query, but

going forward, if it's sort of
static, then you can totally do

that.

And in fact, I would expect the
actual access patterns to be

identical to what you would have
gotten had you had the skip

scan feature.

It is basically doing the same
thing.

It's just rather than having an
array with these constants that

come from a query, you will instead
have a magical sort of an

array that the B-tree scan code
uses almost the same way.

It's just that it generates its
array elements procedurally and

on demand, but it is essentially
the same thing at quite a low

level.

So it would be not just similar,
but identical, if you could

make it work within those constraints.

And in particular, there'd be no
possible downside in terms of

whether or not it made sense to
do a full index scan.

Depending on the distribution of
values, it's going to be adaptive

in that same way I described.

With a big caveat, of course, you
kind of need to know that,

at least for those leading values,
that it's static and predictable,

and it won't change and break your
query.

Nikolay: So I guess our approaches
will change with this.

Usually we, like, analyze workload
and decide on indexes, we

decide what to put on the first
position, second position.

Here it seems like Sometimes we
will decide to put columns with

lower cardinality of distinct values
or something.

Peter: That'll start to make way
more sense than it did.

I don't want to overstate it.

It's not like a revolution by any
means, but there are specific

rules of thumb that will be we
can buy it for sure Would that

you already said yourself Nikolai
was Oh put the most Just put

the highest cardinality column first
That's already kind

of suspect actually, but it's actually
is in this world with

this capability.

Nikolay: Interesting.

Peter: I think it's mostly confined.

That'll be most true by far with
more your data warehousing use

cases, where there is a built-in
requirement for a lot of ad

hoc querying that you can't really
exactly predict ahead of time,

looking at the details of something
rather than just a summary,

really matters, then yes, absolutely,
that will be true.

Well, with all the CPUs, less
so.

Nikolay: Cool.

Yeah, this is an interesting direction
to explore, even if the patch

is not yet committed.

Yeah, thank you.

Go and have something in the to-do
list.

Michael: Yeah.

Is there anything else you wanted
to make sure you mentioned,

Peter, or anything anyone can help
you with?

Peter: Well, I always welcome input
into the work I'm doing.

It would be kind of nice if I could
get some feedback on the

stuff we were just discussing about,
hey, how as a practical

matter does this alter the general
advice that we give to you

about how to index things?

I'm not entirely sure how, in practice,
that will tend to work

out anyway.

It is nice that this is a very
general thing.

I think that will tend to make
it possible to use it more often,

because in a certain sense, the
optimizer only has to have the

right rough idea, because if it
has that, then it will pick an

index scan, and then the index
scan itself will figure out what

to do.

So I think that's probably going
to make it more compelling than

it appears to be in other systems,
where, as far as I can tell,

I might be mistaken, they require
the optimizer to generate a

distinct index node type.

So it's kind of either-or, which
seems to me anyway, could be

something that limited the applicability
of the optimizations.

So I feel like it would be nice
if I had a better idea of how,

practically speaking, this will
change things and where.

I don't think I've got a complete
understanding of that myself.

Like, what kind of applications,
what kind of usage patterns.

That would be nice to have more
detail on.

So if anyone wants to help me,
that would be an obvious place

to start, I would say.

Michael: Nice, thanks so much,
Peter.

It's been an honor having you on.

Great chat.

Peter: Thank you both.

Thanks, Michael.

I really enjoyed it.

Nikolay: Thank you.