Postgres FM

Nikolay and Michael discuss LIMIT in Postgres — what it does, how it can help with performance, and an interesting example where adding it can actually hurt performance(!)

Here are some links to things they mentioned:

LIMIT considered harmful in PostgreSQL (Twitter thread by Christophe Pettus) https://twitter.com/Xof/status/1413542818673577987
LIMIT and OFFSET (docs) https://www.postgresql.org/docs/current/queries-limit.html
No OFFSET (by Markus Winand) https://use-the-index-luke.com/no-offset
LIMIT clause (docs) https://www.postgresql.org/docs/current/sql-select.html#SQL-LIMIT

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is brought to you by:

Nikolay Samokhvalov, founder of Postgres.ai
Michael Christofides, founder of pgMustard

With special thanks to:

Jessie Draws for the amazing artwork

Creators & Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello and welcome to Postgres
FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard.

This is my co-host Nikolay, founder
of Postgres.AI.

Hey Nikolay, what are we talking
about today?

Nikolay: Hi Michael.

It's your proposal actually, Limit,
right?

But before that, let's mention
a little bit what our companies

are doing because I think last
time we did it 90 episodes ago.

Michael: Right, sure.

Nikolay: Yeah, so pgMustard, by
the way, I wanted to tell you

yesterday but forgot.

During some session, people asked
me about pgMustard suddenly

and asked, can I recommend?

Michael: What did you say?

Nikolay: said, yes, of course.

And became shy.

I should recommend it before they
ask, right?

Michael: It's good that they ask.

Nikolay: So yeah, We discussed
some plan and they wanted...

Obviously we discussed how important
it is to use buffers, our

last week topic.

If you haven't listened to it,
please do.

I mean, not you, but our listeners,
right?

Because buffers are super important
and there are many aspects.

And I actually found pgMustard
as a good thing to visualize

and suggest.

It's not my idea, it's an idea
from my customer.

I personally think visualization
is not the strongest pgMustard

feature.

Strongest is a list of recommendations
based on heuristics, which

is quite a good thing.

So, and I know you improve it all
the time, already many years.

So it's great.

Yeah.

And what about me?

My company is now currently creating
bot.

Check out our latest, it's only
like a few months old, but our

blog post, my blog post, where
I described how we build it.

And currently we are rebuilding
it fully.

It will be interesting.

The release will be soon.

Yeah, it's quite interesting.

We bring it to web from Slack.

I think you will live in Slack
still, but it will be in web.

It's just easier to use it there.

And we started using different
models, including Gemini 1.5 with

1 million context window.

So yeah, we change our reg approach
right now.

And it's kind of interesting.

It knows already more than 1 million
documents about Postgres,

it's huge.

So yeah, that's probably enough
about Postgres.AI and pgMustard.

Then let's talk about limit.

Michael: Nice, yeah.

That's cool to hear.

Maybe we'll hear more about that
another time about the differences

between the models.

Yes, so I suggested this topic
and it came out of a tweet a long

time ago that I found myself recommending
over and over again

to people because it was a phenomenon
I saw and most people didn't

understand or didn't intuitively
understand, including myself.

So this was a tweet from Christophe
Pettus and the headline was,

limit considered harmful or something
along those lines.

And it was a really good kind of
initial tweet for a thread.

I was like, what?

Limit's good.

There's so many good things about
limit.

It's really helpful for so many
performance reasons.

What's harmful about it?

And he goes on to describe 1 specific
case where it can cause

plan flips and really, really slow
queries in cases where you

would expect it to be helpful in
performance.

So, that was the trigger for this
topic, but it might be nice

to discuss all the positives of
using limit Well, at least when

we should be using it whether we
should be using it that kind

of thing

Nikolay: Yeah, well first of all,
I guess we will be fighting

a little bit in this episode because
we have different opinions.

Let's start.

I cannot be on the side of Christophe
Pettus in that tweet.

I think there is huge potential
because limit, This is what every

one of us is using all the time,
right?

The limit is everywhere.

So if there are some cases when
it's harmful, it's interesting.

But my regular approach is if you
don't limit, you're in danger.

Limitless queries are harmful because
you don't know if you tested.

You know, my favorite topic, testing
and experiments and how

to test and the idea that if you
test on small databases, you

miss.

It's not, It's like it's bad testing.

It's anti-pattern unless you know
what you're doing and do it

before you test on full-size databases.

So this is exactly when it's not
even about performance.

It's just if you test on small
databases, small tables, and some

queries showing some results on
a page that don't have a limit.

And we start usually with it.

If we prototype something, first
we do it like forget about pagination,

Let's show everything here.

But then you test, it works well,
like 1,000 rows on 1 page is

not a problem, probably depending
on how heavy they are in markup.

But then project grows and some
users, for example, have 10,000

posts or, I don't know, comments,
and you show all of them on

1 page, and you have problems.

And this is a postponed problem
if you start a new project, right?

So limitless queries can be harmful
quite easily, and this is

a straightforward idea.

If you limit, at least you tell
yourself what to expect.

Okay, I expect 15 rows, for example.

Okay, we know it's good.

We know how many buffers we should
expect in the very efficient

query.

15 times the width of row plus
some internal buffer operations,

continuing the topic we had last
week.

Right, and that's it.

If you don't have limit, you don't
know what to expect.

Maybe it's a billion rows.

Michael: Yeah, so I agree and I
think pagination is quite a good

place to start because when people
think like if you look it

up limit in the docs for example,
you're also going to learn

about offset and it's a really
common concept.

Well yeah, so I'm a big fan of...

I'm going to say friend of the
podcast, Markus Winand, previous

guest.

Yeah.

I've always wanted to say friend
of the podcast.

So maybe I finally got to say...

Nikolay: It was a great episode.

Michael: Thank you.

And yeah, it was...

So he has a page on his website
called no offset and I just about

trying

Nikolay: to I just used the offset
hashtag on Twitter.

I use it all the time because when
we discuss when somebody mentions

offset and this is not an expert
who is doing offset on purpose,

knowing why, for example, index-only
scans before you deal with

regular index scan.

There you probably want offset
sometimes, but very rare actually.

So I always use this hashtag no
offset.

It's good.

Michael: I'm sure most people listening
know exactly what these

things do, but just for clarity,
When we use limit, we might

add limit 10 to the end of our
query, limit 20, limit some number.

And that will, as the name suggests,
limit the result set to

that many.

That many rows.

Nikolay: Limit 0.

I use sometimes.

Do you use limit 0?

Michael: No, but I've seen quite

a lot of offset 0.

Nikolay: I use limit 0 in 2 cases.

When I, for example, create a table,
I select specific columns

limit 0.

For example, sometimes just bootstrap
some table.

I don't want to repeat all data
types.

Sometimes it's like you're dealing
with CSV or something.

And also when I do tests, for example,
presence of column, you

can just select that column from
table limit 0, not caring about

anything.

And yeah, kind of assert.

So,

Michael: yeah, I've, I've seen
that really cool use case.

I've seen offset 0 used a few times,
so offset is, so limit limits

the first N results.

I've never thought of using 0 for
that.

It's cool.

Offset will jump you forward that
many before you start limiting.

So you could offset, you take the
first 10, throw those away,

and then limit will take you the
next 10.

So if you do limit 10, offset 10,
you get the second set of 10.

Nikolay: Yeah, and offset 0 it's
either some random generated,

right?

Because it was offset n where n
is 0.

Or it's also there was some performance
trick, some hack, right?

Yes.

I already forgot.

Michael: It's quite, yeah, I think
it's an optimization.

It is or was an optimization fence.

So it can, it can trick the planner
into not doing certain optimizations,

which I guess we'll get onto a
little bit, but there is a line

in the doc.

Nikolay: That trick or it was mitigated.

Michael: Well, something in the
docs made me think it might...

I think it's still...

I think they still work, because
I remember reading in a hacker's

thread, or it was in a thread somewhere
in the mailing list.

Nikolay: Yeah, certainly, it has
been used, but it was so long

ago, that's why I'm wondering if
it's still actual.

Michael: So I read a line in the
docs in preparation that the

offset 0 is the same as omitting
the offset clause, but I don't

think that's quite true because
of the planner implications, but

in terms of the results, that's
true.

In the results, you

Nikolay: won't change it.

Semantically, it's true, but physical
execution might be affected

with the presence of offset 0.

Michael: Yeah.

So the reason I brought this up
entirely was because we were

talking about pagination, right?

We can pack both common ways of
paginating a used limit.

1 is the kind of crude way of doing
it.

And a lot of our ORMs, a lot of
tools implemented it this way,

at least initially, was to the
first set of, let’s say, 20 rows

that you want, just add limit 20.

The next set, they just did offset
20, limit 20.

And then the next 1, offset 40,
limit 20.

And the problem with that is performance
degrades linearly as

you go.

And if you want to do a large,
you want to get to page 20 or

page 100, that can be quite an
inefficient query.

And Markus wrote a great post about
why we should instead be

doing key set pagination for predictable
performance and efficiency.

Nikolay: Right, its explanation
is simple, it's just because

internally it fetches everything
until that and discards.

Exactly.

It's like a lot of inefficiency.

It grows with the depth of your
originating, right?

Michael: So yeah, so what we were
talking about in terms of good

is where ID is above, like, where
some parameter is greater than

some, like, the last result you
fetched and then still limiting

by 20 or 21 depending on exactly
how you want to do.

Like that's something like that.

Nikolay: So limit 0 can be a problem
or limit 10 can be a problem,

right?

Let's talk about this problem maybe,
right?

Because I hope I already explained
why limit is good in terms

of performance, because you limit,
you understand yourself what

to expect, and you know what is
efficient and what's not if you

look at buffer operation numbers.

Michael: I also think there are
a couple of cases, like even

simpler cases, like it is, chances
are you don't want, like if

your query could return a hundred
thousand rows, do you really

want

Nikolay: all of them?

Right, right.

Michael: Or do you, you might only
be interested in the top 10

or maybe your user might only care
about the most recent 10.

And returning all 100,000, even
if the query is quite fast on

the server side, sending all that
data over the wire is inefficient.

So there are, I think it's kind
of like almost all more fundamental

reason that limit is good for returning
a subset of rows, exactly

what it's designed for.

Nikolay: It's close to our topic
about delete and batching.

By the way, someone asked on YouTube
comments, how about updates?

And I explained it's very similar,
but with additional things.

So if you don't batch and send
the whole result set in 1 shot,

what if not only just it's a lot
of memory and so on, but resiliency,

reliability, what if it fails in
the end of processing, you lose

everything, right?

And retry will be huge again.

So it's better to go in smaller
steps with retry logic.

And limit, of course, is the key
tool to split the batches.

So generation, batching, we need
it.

And also, in general, common sense,
if I limit again, like I

know how much I want, this is my
request, I want 15, for example,

or 2,500, and if you scan many
more parts of the database than these

15 records or 25 records plus additional
internal like index

pages and so on, then I'm not considering
this as efficient work,

right?

So it sets a simple expectation
and metric for efficiency.

Efficient versus not efficient
comparison of queries, right?

Michael: Yeah, I think that makes
tons of sense.

Like in application code, for example,
you are suggesting that

you're kind of communicating with
future developers what the

expectation is here.

I think this is where we're going
to start to differ in opinion,

though, because I think this is
exactly on the Let's say on an

ad hoc query, I've sat next to
people working on production systems

that were quite scared and saw
them and didn't think this was

a problem, actually was learning
from them at the time.

They were only expecting a single
row back and they added a limit

1 at the end of the query.

Nikolay: For example, primary key
lookup, right?

Michael: Yeah, I think this might
have even been like, I can't

remember exactly what they were
doing, but they definitely added

Nikolay: a primary key or a unique
key.

It cannot hurt because they are,
it's like if you don't trust

Postgres unique keys, you add limit
1 just as a sort, basically.

You think, okay, I must check,
it should fail or something if

it returns more.

Michael: Yeah, I actually can't
remember exactly why.

The thing I remembered was, oh
that's quite a clever way of not

having a runaway...

Nikolay: Actually, I remember I
did a different thing.

I did limit 2 on purpose.

So if something goes wrong and
I have 2 rows and I don't remember

where, but somewhere I should have
an error or exception, catch

it and process.

So limit 2 was...

Well, it should be unique, so I
expect 1 row, but I add limit

2 on purpose to check later that
it's not 2.

Michael: Yeah, I like it.

Nikolay: But yeah, so anyway, like
in my opinion, the limit is

always, I would add it like as
a must for everything.

Michael: think the fear was in
this case, a runaway sequential

scan across a huge table that might
start to eat resources and

cores.

I think this was a very, very on
fire server.

Nikolay: I'm thinking now and I
think maybe I would just even

consider this as a rule, mandatory,
like limit everywhere and

the only thing we cannot limit
is, for example, SELECT COUNT

* FROM something.

This basically deals with all rows,
returns just 1 number and

we cannot limit there, unfortunately.

There, okay.

Michael: I mean, you can add limit,
you're just still going to

get the count.

Nikolay: Yeah, limit will be useless
in this case.

Yeah, definitely.

Michael: The interesting thing,
though, and I was looking back

at just for fun, could I find a
version of the Postgres docs

that didn't have limit in it.

I'm sure it's as old as time.

But it was in an old version of
the docs, I got back to 7.1,

because that's the oldest one online.

And there was a really, like, I
got way more than I deserved

looking back at that.

And it said, as of Postgres 7.0,
the query optimizer takes LIMIT

into account when generating a
query plan, which I thought might

be a nice segue.

So before 7.0, there wouldn't have
been any danger in adding

LIMIT to your query.

I object.

Nikolay: Mind your language.

This language is wrong.

There is no danger in adding LIMIT.

Let's already jump to the core
of the topic.

There is no danger in adding LIMIT.

There is danger in having bad statistics
or expecting some wrong,

like forming expectations based
on indexes which cannot support

your ORDER BY with filtering and
LIMIT.

By the way, when you describe this
case, focus on ORDER BY as

well, because without ORDER BY,
this case is impossible.

It's not about LIMIT, it's about
ORDER BY plus LIMIT, which forms

expectations, like I mean, gives
planner the ability to use some

index, right?

Yeah, so let's

Michael: talk about, well, for
anybody thinking we've jumped,

I shared some examples with Nikolay
before the call, so that

he will talk about that.

I saw

Nikolay: this example before you
like a long ago.

I saw it live.

I just.

Michael: Yeah, well, that's good.

So, let's back up and try and explain
the problem or the edge

case or whatever you want to call
it.

So, exactly the same query, if
you run it with and without limit

some number, can be dramatically
slower with limit the number

added than it is without the limit
at all, which I found really

counterintuitive.

And yes, it relies on some specific
cases, but those specific

things are not as rare in the real
world.

Nikolay: It's not rare, I agree.

Michael: Yeah, so that person sitting
in front of that console,

hoping that limit some number is
going to make it safe and guarantee

fast execution.

Nikolay: Same logic is expected,
Yes, I agree.

Michael: Can actually shoot themselves
in the foot and make things

loads worse for themselves.

And I actually think there might
be other advice you could follow.

I'd much rather they added additional
where clauses or other

things than an additional limit.

So that's potentially controversial.

I'd love your take on it.

Let's try to explain how this can
happen.

So the simplest case is to limit
by a relatively rare condition

and then order by something else
that you have indexed.

So for example, created at.

So if we think about a, let's say,
a software as a service business,

and maybe you have a customer that
cancelled a few years ago,

and you want to look at their most
recent event or something

like that.

And so you're looking at by company
A, order by, created at,

descending, limit 1, something
like that.

Nikolay: Or limit 10.

Michael: Or limit 10, yeah, exactly.

Nikolay: So it can happen, the
problem can happen with not only

It can be some small number.

Yeah.

Michael: It could be a big number.

Nikolay: Or relatively small.

The key is that it's small enough
to have a flip.

Michael: Yeah.

Well, Christophe made a good point
that I haven't tested yet.

He thinks it's worse around the
point of the actual number of

rows.

There are some cases where, let's
say that company had a thousand

events, the limit of a thousand
would be problematic.

So the problem comes from not knowing,
like Postgres assumes

things are randomly distributed
unless it knows better, unless

it has statistics to suggest otherwise.

And if it thinks that the table
that contains these events is

going to have events from any company
in any order, because it

knows it only has to fetch 1 or
10 events, it now has this optimization

where it might be quicker for it
to scan backwards through.

Nikolay: You missed a very important
thing.

Let me add things.

So we talk about, for example,
you took some SAS system, some

users, and users have some activities,
let's say, like comments

or, I don't know, orders, anything.

So we have orders table, for example.

This is what you showed, if I'm
not mistaken.

I remember.

So, orders.

And we have user ID, we have created
that, we have ID of order,

enough.

The key here is we want to select
orders for this specific user

in chronological order, like order
by created at desc, so like

in reverse order, the newest first,
the latest first, right?

And then the key here is not to
forget, our developers, or we

are developers, We don't know about
multi-column indexes.

So we created index on created
at, we created index on userId,

and on ID.

Okay, let's simplify assumption.

We like to index all columns.

It's worse.

And when we do it, we use only
single column indexes always,

right?

This is quite a classic example.

I can imagine it happens in the
life of every engineer during

5 years, for example, of development,
definitely at least once

it should happen, I would say.

99% of developers.

It happened with me many times.

And with my products, my databases,
and also with others when

I observe them.

Actually, recently we had a similar
case, maybe a few months

ago, I remember.

So, the key is Postgres needs to
choose, okay, I have filter,

user ID, and I have order by created
at desc limit 1 or limit

10.

And I have 2 indexes.

So I have 2 opportunities here
to save in terms of IO operations.

Of course, Postgres should avoid
sequential scan here.

It can be 1 user of many, of millions,
so it can be a huge table.

So we don't want to scan the whole
table, so we need to choose

only 1 index or 2 of them, maybe
combine with bitmap scan and

so on.

But 2 opportunities here, filtering
by user ID and also create

order by created at desc limit
1 or 10.

And the key here is order by.

Without order by, the limit, it
will be fast always because,

I mean, order by is the key to
consider index on created at.

Michael: Yeah, We need both the
equality check and the order

by to make it a choice for the
planner.

Either it has to filter on the
equality condition with one of the

indexes, take the data and sort
it and return the order, or it

has to choose the index that's
ordered and take the first row

it reaches that matches the equality
condition.

Nikolay: It can be not necessarily
equality, it can be between

one and the other.

Michael: It could be any, like,
yeah, so sorry, it doesn't have

to be, it doesn't also have to
be an order by the way, it could

be a different, it has to be two
separate conditions and those

have to fall.

Nikolay: They fight between each
other, right?

Yes.

How do they fight?

They fight based on costs, based
on statistics and what Planner

expects in terms of cost.

And the key also, like, important
point here is that Planner

has two costs.

Startup cost, which can be very
small, it means the first row

can be returned very fast.

And also the full cost.

Full cost is the most important
when Planner, like, the full

cost is what planner uses to make
the final decision choosing

between a variety of plans, right?

Full cost.

So our case, index on user ID,
if we use just it, it means Planner

thinks, okay, I will choose all
rows for this user, but then

I need to order by them in memory,
right?

It means startup cost cannot be
close to 0 because ordering and

so on, it will take some time.

And once done, only once done,
we can return all the rows.

I assume like in this case, startup
cost and the final full cost,

they are very close to each other,
but far from 0 probably, right?

But the other option, order by
created at desc limit 1 or 10.

This option means that we immediately
use created at index, we

start fetching rows based on there,
but we need to filter out

irrelevant rows owned by different
other users, right?

We only need our user ID.

It means we start returning rows
quite quick, but only if we

find our user ID quite quick.

There is uncertainty here.

But for Planner, it's quite certain
because it has some expectations

about distribution.

You mentioned it.

Probably, a planner can think,
okay, I will be lucky because

this user maybe is not real user.

Maybe it's still super active,
by the way, because we start from

the very fresh rows.

The latest row will be considered
the first and if our user ID

is there, it means we already found
one row in the first step already.

And we go, go, go.

But if this user is super inactive
last year, for example, in

reality, we need to scan the whole
year to reach the first row

of that user to return.

And here is exactly where the planner
can be very wrong.

I mean, it simply doesn't know.

It doesn't know.

There is no correlation in statistics
between user IDs and created

that.

And also, the planner doesn't know
who is actively active and

who is not.

The planner doesn't know.

So it can be very hard.

Michael: Yeah.

So there are multiple.

So I completely agree that you
brought up multi-column indexes.

In the case that I shared with
you and the case we just discussed,

a multi-column index on the user
ID or whatever it was and then

created that, ordered, would make
this query efficient and predictable

all the time.

And so, if this was an application...

Yes, yes, yes.

But if this is an application query,
and this was showing up

in pg_stat_statements as a problem,
and we needed to optimize

it, and it was one of our biggest
concerns, we didn't mind about

the overhead of adding another
index, great, we can add this

index, remove the one on user ID
or whatever we had, like, And

it would be fast all the time.

But I was talking about a case
where we're sitting in front of

a console on production trying
to stop ourselves from running

a large query or a slow query and
adding that limit, even though

we didn't think it would make a
difference, even though we think

we're only going to get one or two
rows returned, if we add it, we're

taking a risk that it could actually
make things worse.

Because we don't, like, if we don't
check, and this is actually

maybe what I would recommend.

I don't know about you.

But if you add explain before the
query and just check what index

it's going to use, does it make
sense?

If we saw it was going to do a
backwards scan on created at,

maybe that would be a sign that
it was a bad idea.

The tricky part is this is an issue
because of a cost underestimation.

So it's always going to look like
a cheap query in cost.

Like it, the reason it's being
chosen, if it's a bad plan, if

it was a slow plan, is because
the plan is thinking it will be

cheap.

It's because it thinks it can abort
early.

So costs won't give you a clue
here.

You'd have to be attuned to the
fact of which indexes it's scanning.

And is that enough?

Is that a good idea for this query?

So I'm talking about production
cases, like where you're doing

ad hoc queries, not so much application
queries where you should

optimize.

Nikolay: Yeah, for example, we
have to return our table and allow

users to order by all columns in
both directions.

It's a nightmare to support in
terms of performance.

And if you have many of such tables
or maybe you allow users

to create tables, any tables, and
you want to hide this performance

optimization from them, some website
builder or mobile app builder

and you can say, okay, users, developers,
your users are developers

of website or mobile app and you
say, I allow you to create tables,

I will take care of performance.

In tables, you're allowed to create
columns, I will take care

of performance.

And you can put some element on
your UI for your users, millions

of users, and you can allow them
to order by any column in any

direction.

It's a super nightmare.

Absolutely nightmare.

Yes.

Because obviously you want to create
index on every column and

also you think, okay, I'm smart,
I know about key set pagination,

you get your pagination, you then
allow this.

This is exactly where you can encounter
it, right?

Because this limit...

So I wanted to emphasize again,
like I described, 1 index has

very low startup cost.

It's creating that index.

And why is it low?

Because Postgres doesn't know for
arbitrary user.

We probably want this query to work
not for 1 user but for arbitrary

user, right?

So in this case, you probably think
Postgres doesn't know for

this particular user how fresh
activity is or orders are, right?

So maybe, but it assumes something.

It can be slightly more, slightly
less, depends.

But it assumes something.

For second row, it's 2 times more.

10 rows to 10 times more.

But once we find it, we don't
need to do anything.

We can traverse, create an index
and just filter out irrelevant

other users, right?

We just need to find our users.

Once we find our N, 1 or 10 rows,
we are happy.

This is our final cost.

And then if we don't use a limit,
our final cost is our expectation

for this user.

This statistic probably is present.

Probably, unless we don't have
statistics at all.

At all.

Postgres expects for this user ID,
I expect this number of rows,

and this forms our total cost.

Probably if it expects a lot of
rows, total cost will be high

and it won't start using this approach.

In this case, it switches to the
more reliable approach, I will

fetch all the rows and do a memory
sort.

So the key here, if we limit, this
is probably why Michael mentioned

that the boundary is this limit.

It's not the actual number of rows,
but its statistics, suppose

this expectation for number of
rows for this user ID.

So when our limit goes down, Below
it, below this number, our

traversal with filtering out irrelevant
users becomes more attractive

because total cost starts looking
good there.

But it's a trick, right?

Because the problem here, it's
a trick and of course, it's super

interesting phenomenon.

I would say this is a phenomenon.

It's not a harmful situation because
there are many cases when

limit doesn't help.

For example, we don't have indexes
at all, we say, well, no.

Michael: No, that can really help.

Nikolay: It can help, it can help,
yeah, yeah, yeah, because

filtering out, it can help.

But it's easy to imagine when limit
is useless in this.

Michael: Actually, that's another
time I use limit.

So the docs are quite

Nikolay: on the measure.

In general case, it improves performance
because you will stop

earlier.

But on edge case, when our record
is the latest in scanning.

It doesn't help.

Here we deal with a similar situation.

In general it's good, but we have
some edge case which not only

doesn't help, but it decreases
performance.

But it's an edge case and it's
phenomenal.

It's not a harmful limit, especially
because you cannot do this

without order by.

And especially because the problem
here is not limit, but lack

of proper index to call an index.

Michael: And again, we're talking
about a one-off query rather

than an ongoing optimization challenge.

I'm talking about it's just an
interesting phenomenon that, and

I don't think most people expect
this, that by adding limit you

can make the query slower.

And that's true.

It can be a lot slower.

So I just don't want people to
use that as their safety mechanism.

Nikolay: You know, when you understand
what's happening inside,

adding limit is clear why it's
wrong.

It's clear.

You just force the planner to choose
the wrong index here.

Right?

That's it.

But of course for people who don't
dive inside the execution

and planning process, they of course
are very surprised and think

limit is bad.

But remove order by and you won't
see this effect, right?

Michael: But the docs, and this
is what I wanted to say, the

docs quite often when you're reading
about limit will say it's

really sensible to use order by
with limit because otherwise

you're getting things in a like
an undetermined

Nikolay: Again, 1 second.

If we talk about arbitrary user
ID, what if this user ID is super

active and created a thousand orders
right today?

In this case, it's right.

So that's why I say this is not
even an edge case, it's a corner

case.

You have several things in the
corner here.

And observing various strange effects,
phenomena, of course,

it makes sense.

But create proper indexes, that's
it.

But at least let's correct this
harmful title.

Let's say limit 1 with order by
considered harmful.

Order by limit.

Michael: It's not just order by,
though.

It could be another condition that
you add.

Nikolay: Lack of multi-column indexes
order by limit considered

harmful.

This is the full truth.

This is the truth actually here.

Not just limit.

Michael: I was hoping you'd understand
my point of view on this.

It feels like you haven't.

I do think that in a lot of real
world cases, people don't have

perfect stats or they have data
that's correlated strongly in

certain ways.

For example, like user ID and account
ID being extremely tightly

coupled.

Or you know, the age old city and
country codes.

And Postgres assumes that all of
these stats are uncorrelated

by default.

And if you don't, you can add,
you can create statistics for

these things.

But in a lot of cases, we have
a lot of skewed distributions

in databases that we haven't created
extra statistics for.

And most of the time, the plan
is great and can really can work

things out nicely.

But by having this limit optimization,
it can flip plans that

we're not expecting to flip.

And the reason I found it so powerful
is we're using it exactly

in the cases where we're most scared
of doing a big scan on production.

If that's what you're really scared
of doing, And this is something

that can

Nikolay: be quite...

Why do you not agree with all this
concept that is harmful?

Because imagine we don't have an
index on user ID at all.

In this case, if...

In this case, created that shows
bad performance because the

user became inactive last year,
we need to scan the whole year.

But at least we have some approach
to scan, right?

Created that adds some order to scanning.

Without it, we only have a sequential
scan, right?

That's it.

So the problem is we don't have
a proper index.

Now, okay, it turns out we have
an index on user ID.

It happened to be helping, but
what if some user has 50% of whole

row like records, millions of records.

Michael: So bear with me.

This is why I'm suggesting instead
of running the query on production

by adding a limit 1 or something
to make it feel safe, instead

of adding limit 1, put just 1 word,
explain, in front of the

query, not explain analyze, not
explain analyze buffers, just

explain your query without the
limit and see what the query plan

is before you run it.

If you're scared of what it's going
to do in production, do that

instead of adding the limit.

And you'll see which indexes it's
using, whether you're

Nikolay: missing one or whether you're
not.

But I cannot add limit because
I don't want my page to blow up

because of many, many rows returned.

ALEX LUCASSE-COLOMBRE-CAMPOYS

Michael: Well, add limit 1 and
put explain in front just to check

what it's going to do.

Because that's what's really going
to check.

That's what's going

Nikolay: to actually keep safe.

SLAVOJ ŽIŽEKERJÁNAS You will check
the player behavior, right?

ALEX LUCASSE-COLOMBRE-CAMPOYS

Michael: Yeah, you're checking
what you're actually scared of

instead of assuming you're going
to be safe.

It's like, I understand your point
that it's almost never going

to be a problem, but imagine if
a safety on a gun, for example,

almost always kept you safe.

It's like a safety feature that's
almost always good, but sometimes

it's going to actually shoot you
when you press that safety catch.

Not when you pull the trigger,
but when you press the catch.

Nikolay: Again, in my world, I
add limit 1 to limit things.

If it doesn't work, it means I
don't have an index.

If there is some index which would
help me when I don't have

a limit, I don't care because I do
need limit 1 anyway.

Michael: And maybe you'll hit some
like statement timeout or

you'll cancel your query quickly
or there was some other safety

mechanism in place that will stop
you from hurting yourself.

Nikolay: But still, if my limit
behaves not good, I'm looking

which index will help.

And I'm not considering index on
user ID as helpful here because

I don't want to deal with a lot
of records and memory because

I need only 15 records.

So that's why I jump straight to
multi-column indexes here.

And as we discussed...

Michael: I agree.

I agree that to optimize this query,
but we're not talking about

optimizing the query, we're talking
about should you use limit

as a safety feature when you don't
need it?

Nikolay: Why I don't, why might
I not?

Michael: Let's say you know there's
only going to be 1 row, should

you add limit 1.

Nikolay: How do you know it's only
1 row?

Unique index?

If you have unique index, this
problem won't happen.

Michael: Let's say a company could
set up multiple things, but

they generally set up 1.

Like how many websites are you
gonna have for each organization

you don't have a unique index on
it people can add a second okay

but almost everybody else for

Nikolay: you for example right

Michael: yeah email addresses yeah
it's probably gonna be less

than 10 like you don't have any
you know but you maybe you're

not certain it's going to be 1,
but you're pretty sure it's going

Nikolay: be 1.

But am I having up-to-date stats
or no?

Because stats will say, they will
form an expectation to Postgres

if it's a low number of rows, it
would prefer.

Michael: I think I'm struggling
because it's a simplified example,

but if you've got lots of joins
going on there could be all sorts

of other like complexities.

Nikolay: Joins are tricky by the
way.

The same problem is becoming much
more tricky but I don't think

limit this problem.

I think the problem is a lack of
proper indexes and sometimes

you cannot create them.

Imagine user id is present in 1
table but created that is present

in different table.

Michael: So you're saying you've
never sat in front of a database

where you don't know all of the
indexes in the system, you don't

know the state of stats and wanted
to run a query and not being

scared that it might take a long
time.

Nikolay: Well, when I do this,
I add limit, I understand that

if I lack proper index, that will
help.

I might be dealing with sequential
scan or wrong index have been

scanned almost fully, right?

I understand that.

If you don't understand that, that's
strange.

Right?

Michael: Well, but then, so this
is my, that's kind of my, like,

takeaway.

Instead of adding limit 1

Nikolay: and hoping that I'm okay.

Let me give you an example I saw
before what we discussed today.

So we have full-text search, for
example.

And we have rare words and we have
very frequent words present

in the majority of records in TSVector.

Yeah, that's a nice example.

Right.

And then we have created that OR
ID and we want to find all matching

documents and return only 15 latest,
for example.

In this case, Postgres must choose
between need to choose, must

choose between ordering by bit
3 and then applying a filter to

add science, right, TS vector,
add TS query.

Or it might decide, okay, I need
to choose, I need to extract

all documents and then order by
memory and then limit.

Same problem, completely same problem.

And I saw terrible cases when I
thought, wow, like, why do you

do this?

Why do you think this word is super
frequent?

So instead of using gin, you decide
to use B-tree to order by limit

15, for example.

And then you end up scanning almost
everything, finding my 1

or 2 records only in the end of
scanning.

It's absolutely the same situation.

But the problem to me is not limit
there at all.

I put limit, I see the problem,
I understand.

Limit didn't help, I understand
why.

Because we need to combine both
things in 1 single index and

have only 1 single index scan.

Unfortunately, this is impossible
there, right, in this particular

case.

Because for that you need to use
a RAM index and it can be not

available.

Nardius, for example, it's not
present.

Michael: Or like work out some
middle ground, take way more rows

than you want and then sort them,
that kind of thing.

Nikolay: Right, so I mean We sit
on the edge here.

We might be lucky, but we are not.

And what else?

What can I say?

It's not a problem of limit, it's
a problem of lack of indexing.

Michael: Let's try and wrap up.

And I think I agree with you that
to solve the performance issue,

you don't get rid of the limit,
you add a good index or you get

your stats, like you work out your
stats correlation, maybe create

some statistics there.

Maybe you just need to run analyze
on that table or maybe globally,

like that.

There's lots of ways of solving
it so that that query doesn't

run slowly again in future.

I just wanted to publicize and
share that.

And by the way, I wanted to shout
out Frank Pashoe, who got in

touch recently when I shared this
Twitter thread again, to say,

I think we agreed that it's
an optimization problem.

And yeah, he would completely be
agreeing with you right now.

I just found it really interesting
and quite counterintuitive.

Like probably it sounds like more
than you did, that it could,

like just by adding it could make
a query slower.

Nikolay: Right.

Michael: Even though that's the
sign of other issues.

So I wanted

Nikolay: to clarify that.

It's an interesting phenomenon,
for sure.

And adding a limit is not a performance
optimization tool.

Performance optimization tool is
order by limit and improper

index, like together.

Yes.

And at the limit, I should not
expect quick execution.

I just form expectations for myself
and I can then say, okay,

this is not what I want and find
a better index, which is sometimes

not possible, for example, GIN
versus B-tree and so on.

But by the way, to be fair, with
GIN statistics already, Postgres

is quite often right, so it knows
when to use B-tree and quickly

find proper results, and it knows
when it's better to find everything

and order by memory.

So it's not that bad, but also
not ideal, not just single index

scan.

Michael: Yeah, cool.

Nikolay: Okay.

Michael: Thanks so much, Nikolay.

Take care.

Nikolay: Thank you.

More episodes

Chapters

Creators & Guests

What is Postgres FM?