A weekly podcast about all things PostgreSQL
Michael: Hello and welcome to Postgres
FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard.
This is my co-host Nikolay, founder
of Postgres.AI.
Hey Nikolay, what are we talking
about today?
Nikolay: Hi Michael.
It's your proposal actually, Limit,
right?
But before that, let's mention
a little bit what our companies
are doing because I think last
time we did it 90 episodes ago.
Michael: Right, sure.
Nikolay: Yeah, so pgMustard, by
the way, I wanted to tell you
yesterday but forgot.
During some session, people asked
me about pgMustard suddenly
and asked, can I recommend?
Michael: What did you say?
I
Nikolay: said, yes, of course.
And became shy.
I should recommend it before they
ask, right?
Michael: It's good that they ask.
Nikolay: So yeah, We discussed
some plan and they wanted...
Obviously we discussed how important
it is to use buffers, our
last week topic.
If you haven't listened to it,
please do.
I mean, not you, but our listeners,
right?
Because buffers are super important
and there are many aspects.
And I actually found pgMustard
as a good thing to visualize
and suggest.
It's not my idea, it's an idea
from my customer.
I personally think visualization
is not the strongest pgMustard
feature.
Strongest is a list of recommendations
based on heuristics, which
is quite a good thing.
So, and I know you improve it all
the time, already many years.
So it's great.
Yeah.
And what about me?
My company is now currently creating
bot.
Check out our latest, it's only
like a few months old, but our
blog post, my blog post, where
I described how we build it.
And currently we are rebuilding
it fully.
It will be interesting.
The release will be soon.
Yeah, it's quite interesting.
We bring it to web from Slack.
I think you will live in Slack
still, but it will be in web.
It's just easier to use it there.
And we started using different
models, including Gemini 1.5 with
1 million context window.
So yeah, we change our reg approach
right now.
And it's kind of interesting.
It knows already more than 1 million
documents about Postgres,
it's huge.
So yeah, that's probably enough
about Postgres.AI and pgMustard.
Then let's talk about limit.
Michael: Nice, yeah.
That's cool to hear.
Maybe we'll hear more about that
another time about the differences
between the models.
Yes, so I suggested this topic
and it came out of a tweet a long
time ago that I found myself recommending
over and over again
to people because it was a phenomenon
I saw and most people didn't
understand or didn't intuitively
understand, including myself.
So this was a tweet from Christophe
Pettus and the headline was,
limit considered harmful or something
along those lines.
And it was a really good kind of
initial tweet for a thread.
I was like, what?
Limit's good.
There's so many good things about
limit.
It's really helpful for so many
performance reasons.
What's harmful about it?
And he goes on to describe 1 specific
case where it can cause
plan flips and really, really slow
queries in cases where you
would expect it to be helpful in
performance.
So, that was the trigger for this
topic, but it might be nice
to discuss all the positives of
using limit Well, at least when
we should be using it whether we
should be using it that kind
of thing
Nikolay: Yeah, well first of all,
I guess we will be fighting
a little bit in this episode because
we have different opinions.
Let's start.
I cannot be on the side of Christophe
Pettus in that tweet.
I think there is huge potential
because limit, This is what every
one of us is using all the time,
right?
The limit is everywhere.
So if there are some cases when
it's harmful, it's interesting.
But my regular approach is if you
don't limit, you're in danger.
Limitless queries are harmful because
you don't know if you tested.
You know, my favorite topic, testing
and experiments and how
to test and the idea that if you
test on small databases, you
miss.
It's not, It's like it's bad testing.
It's anti-pattern unless you know
what you're doing and do it
before you test on full-size databases.
So this is exactly when it's not
even about performance.
It's just if you test on small
databases, small tables, and some
queries showing some results on
a page that don't have a limit.
And we start usually with it.
If we prototype something, first
we do it like forget about pagination,
Let's show everything here.
But then you test, it works well,
like 1,000 rows on 1 page is
not a problem, probably depending
on how heavy they are in markup.
But then project grows and some
users, for example, have 10,000
posts or, I don't know, comments,
and you show all of them on
1 page, and you have problems.
And this is a postponed problem
if you start a new project, right?
So limitless queries can be harmful
quite easily, and this is
a straightforward idea.
If you limit, at least you tell
yourself what to expect.
Okay, I expect 15 rows, for example.
Okay, we know it's good.
We know how many buffers we should
expect in the very efficient
query.
15 times the width of row plus
some internal buffer operations,
continuing the topic we had last
week.
Right, and that's it.
If you don't have limit, you don't
know what to expect.
Maybe it's a billion rows.
Michael: Yeah, so I agree and I
think pagination is quite a good
place to start because when people
think like if you look it
up limit in the docs for example,
you're also going to learn
about offset and it's a really
common concept.
Well yeah, so I'm a big fan of...
I'm going to say friend of the
podcast, Markus Winand, previous
guest.
Yeah.
I've always wanted to say friend
of the podcast.
So maybe I finally got to say...
Nikolay: It was a great episode.
Michael: Thank you.
And yeah, it was...
So he has a page on his website
called no offset and I just about
trying
Nikolay: to I just used the offset
hashtag on Twitter.
I use it all the time because when
we discuss when somebody mentions
offset and this is not an expert
who is doing offset on purpose,
knowing why, for example, index-only
scans before you deal with
regular index scan.
There you probably want offset
sometimes, but very rare actually.
So I always use this hashtag no
offset.
It's good.
Michael: I'm sure most people listening
know exactly what these
things do, but just for clarity,
When we use limit, we might
add limit 10 to the end of our
query, limit 20, limit some number.
And that will, as the name suggests,
limit the result set to
that many.
That many rows.
Nikolay: Limit 0.
I use sometimes.
Do you use limit 0?
Michael: No, but I've seen quite
a lot of offset 0.
Nikolay: I use limit 0 in 2 cases.
When I, for example, create a table,
I select specific columns
limit 0.
For example, sometimes just bootstrap
some table.
I don't want to repeat all data
types.
Sometimes it's like you're dealing
with CSV or something.
And also when I do tests, for example,
presence of column, you
can just select that column from
table limit 0, not caring about
anything.
And yeah, kind of assert.
So,
Michael: yeah, I've, I've seen
that really cool use case.
I've seen offset 0 used a few times,
so offset is, so limit limits
the first N results.
I've never thought of using 0 for
that.
It's cool.
Offset will jump you forward that
many before you start limiting.
So you could offset, you take the
first 10, throw those away,
and then limit will take you the
next 10.
So if you do limit 10, offset 10,
you get the second set of 10.
Nikolay: Yeah, and offset 0 it's
either some random generated,
right?
Because it was offset n where n
is 0.
Or it's also there was some performance
trick, some hack, right?
Yes.
I already forgot.
Michael: It's quite, yeah, I think
it's an optimization.
It is or was an optimization fence.
So it can, it can trick the planner
into not doing certain optimizations,
which I guess we'll get onto a
little bit, but there is a line
in the doc.
Nikolay: That trick or it was mitigated.
Michael: Well, something in the
docs made me think it might...
I think it's still...
I think they still work, because
I remember reading in a hacker's
thread, or it was in a thread somewhere
in the mailing list.
Nikolay: Yeah, certainly, it has
been used, but it was so long
ago, that's why I'm wondering if
it's still actual.
Michael: So I read a line in the
docs in preparation that the
offset 0 is the same as omitting
the offset clause, but I don't
think that's quite true because
of the planner implications, but
in terms of the results, that's
true.
In the results, you
Nikolay: won't change it.
Semantically, it's true, but physical
execution might be affected
with the presence of offset 0.
Michael: Yeah.
So the reason I brought this up
entirely was because we were
talking about pagination, right?
We can pack both common ways of
paginating a used limit.
1 is the kind of crude way of doing
it.
And a lot of our ORMs, a lot of
tools implemented it this way,
at least initially, was to the
first set of, let’s say, 20 rows
that you want, just add limit 20.
The next set, they just did offset
20, limit 20.
And then the next 1, offset 40,
limit 20.
And the problem with that is performance
degrades linearly as
you go.
And if you want to do a large,
you want to get to page 20 or
page 100, that can be quite an
inefficient query.
And Markus wrote a great post about
why we should instead be
doing key set pagination for predictable
performance and efficiency.
Nikolay: Right, its explanation
is simple, it's just because
internally it fetches everything
until that and discards.
Exactly.
It's like a lot of inefficiency.
It grows with the depth of your
originating, right?
Michael: So yeah, so what we were
talking about in terms of good
is where ID is above, like, where
some parameter is greater than
some, like, the last result you
fetched and then still limiting
by 20 or 21 depending on exactly
how you want to do.
Like that's something like that.
Nikolay: So limit 0 can be a problem
or limit 10 can be a problem,
right?
Let's talk about this problem maybe,
right?
Because I hope I already explained
why limit is good in terms
of performance, because you limit,
you understand yourself what
to expect, and you know what is
efficient and what's not if you
look at buffer operation numbers.
Michael: I also think there are
a couple of cases, like even
simpler cases, like it is, chances
are you don't want, like if
your query could return a hundred
thousand rows, do you really
want
Nikolay: all of them?
Right, right.
Michael: Or do you, you might only
be interested in the top 10
or maybe your user might only care
about the most recent 10.
And returning all 100,000, even
if the query is quite fast on
the server side, sending all that
data over the wire is inefficient.
So there are, I think it's kind
of like almost all more fundamental
reason that limit is good for returning
a subset of rows, exactly
what it's designed for.
Nikolay: It's close to our topic
about delete and batching.
By the way, someone asked on YouTube
comments, how about updates?
And I explained it's very similar,
but with additional things.
So if you don't batch and send
the whole result set in 1 shot,
what if not only just it's a lot
of memory and so on, but resiliency,
reliability, what if it fails in
the end of processing, you lose
everything, right?
And retry will be huge again.
So it's better to go in smaller
steps with retry logic.
And limit, of course, is the key
tool to split the batches.
So generation, batching, we need
it.
And also, in general, common sense,
if I limit again, like I
know how much I want, this is my
request, I want 15, for example,
or 2,500, and if you scan many
more parts of the database than these
15 records or 25 records plus additional
internal like index
pages and so on, then I'm not considering
this as efficient work,
right?
So it sets a simple expectation
and metric for efficiency.
Efficient versus not efficient
comparison of queries, right?
Michael: Yeah, I think that makes
tons of sense.
Like in application code, for example,
you are suggesting that
you're kind of communicating with
future developers what the
expectation is here.
I think this is where we're going
to start to differ in opinion,
though, because I think this is
exactly on the Let's say on an
ad hoc query, I've sat next to
people working on production systems
that were quite scared and saw
them and didn't think this was
a problem, actually was learning
from them at the time.
They were only expecting a single
row back and they added a limit
1 at the end of the query.
Nikolay: For example, primary key
lookup, right?
Michael: Yeah, I think this might
have even been like, I can't
remember exactly what they were
doing, but they definitely added
Nikolay: a primary key or a unique
key.
It cannot hurt because they are,
it's like if you don't trust
Postgres unique keys, you add limit
1 just as a sort, basically.
You think, okay, I must check,
it should fail or something if
it returns more.
Michael: Yeah, I actually can't
remember exactly why.
The thing I remembered was, oh
that's quite a clever way of not
having a runaway...
Nikolay: Actually, I remember I
did a different thing.
I did limit 2 on purpose.
So if something goes wrong and
I have 2 rows and I don't remember
where, but somewhere I should have
an error or exception, catch
it and process.
So limit 2 was...
Well, it should be unique, so I
expect 1 row, but I add limit
2 on purpose to check later that
it's not 2.
Michael: Yeah, I like it.
Nikolay: But yeah, so anyway, like
in my opinion, the limit is
always, I would add it like as
a must for everything.
I
Michael: think the fear was in
this case, a runaway sequential
scan across a huge table that might
start to eat resources and
cores.
I think this was a very, very on
fire server.
Nikolay: I'm thinking now and I
think maybe I would just even
consider this as a rule, mandatory,
like limit everywhere and
the only thing we cannot limit
is, for example, SELECT COUNT
* FROM something.
This basically deals with all rows,
returns just 1 number and
we cannot limit there, unfortunately.
There, okay.
Michael: I mean, you can add limit,
you're just still going to
get the count.
Nikolay: Yeah, limit will be useless
in this case.
Yeah, definitely.
Michael: The interesting thing,
though, and I was looking back
at just for fun, could I find a
version of the Postgres docs
that didn't have limit in it.
I'm sure it's as old as time.
But it was in an old version of
the docs, I got back to 7.1,
because that's the oldest one online.
And there was a really, like, I
got way more than I deserved
looking back at that.
And it said, as of Postgres 7.0,
the query optimizer takes LIMIT
into account when generating a
query plan, which I thought might
be a nice segue.
So before 7.0, there wouldn't have
been any danger in adding
LIMIT to your query.
I object.
Nikolay: Mind your language.
This language is wrong.
There is no danger in adding LIMIT.
Let's already jump to the core
of the topic.
There is no danger in adding LIMIT.
There is danger in having bad statistics
or expecting some wrong,
like forming expectations based
on indexes which cannot support
your ORDER BY with filtering and
LIMIT.
By the way, when you describe this
case, focus on ORDER BY as
well, because without ORDER BY,
this case is impossible.
It's not about LIMIT, it's about
ORDER BY plus LIMIT, which forms
expectations, like I mean, gives
planner the ability to use some
index, right?
Yeah, so let's
Michael: talk about, well, for
anybody thinking we've jumped,
I shared some examples with Nikolay
before the call, so that
he will talk about that.
I saw
Nikolay: this example before you
like a long ago.
I saw it live.
I just.
Michael: Yeah, well, that's good.
So, let's back up and try and explain
the problem or the edge
case or whatever you want to call
it.
So, exactly the same query, if
you run it with and without limit
some number, can be dramatically
slower with limit the number
added than it is without the limit
at all, which I found really
counterintuitive.
And yes, it relies on some specific
cases, but those specific
things are not as rare in the real
world.
Nikolay: It's not rare, I agree.
Michael: Yeah, so that person sitting
in front of that console,
hoping that limit some number is
going to make it safe and guarantee
fast execution.
Nikolay: Same logic is expected,
Yes, I agree.
Michael: Can actually shoot themselves
in the foot and make things
loads worse for themselves.
And I actually think there might
be other advice you could follow.
I'd much rather they added additional
where clauses or other
things than an additional limit.
So that's potentially controversial.
I'd love your take on it.
Let's try to explain how this can
happen.
So the simplest case is to limit
by a relatively rare condition
and then order by something else
that you have indexed.
So for example, created at.
So if we think about a, let's say,
a software as a service business,
and maybe you have a customer that
cancelled a few years ago,
and you want to look at their most
recent event or something
like that.
And so you're looking at by company
A, order by, created at,
descending, limit 1, something
like that.
Nikolay: Or limit 10.
Michael: Or limit 10, yeah, exactly.
Nikolay: So it can happen, the
problem can happen with not only
1.
It can be some small number.
Yeah.
Michael: It could be a big number.
Nikolay: Or relatively small.
The key is that it's small enough
to have a flip.
Michael: Yeah.
Well, Christophe made a good point
that I haven't tested yet.
He thinks it's worse around the
point of the actual number of
rows.
There are some cases where, let's
say that company had a thousand
events, the limit of a thousand
would be problematic.
So the problem comes from not knowing,
like Postgres assumes
things are randomly distributed
unless it knows better, unless
it has statistics to suggest otherwise.
And if it thinks that the table
that contains these events is
going to have events from any company
in any order, because it
knows it only has to fetch 1 or
10 events, it now has this optimization
where it might be quicker for it
to scan backwards through.
Nikolay: You missed a very important
thing.
Let me add things.
So we talk about, for example,
you took some SAS system, some
users, and users have some activities,
let's say, like comments
or, I don't know, orders, anything.
So we have orders table, for example.
This is what you showed, if I'm
not mistaken.
I remember.
So, orders.
And we have user ID, we have created
that, we have ID of order,
enough.
The key here is we want to select
orders for this specific user
in chronological order, like order
by created at desc, so like
in reverse order, the newest first,
the latest first, right?
And then the key here is not to
forget, our developers, or we
are developers, We don't know about
multi-column indexes.
So we created index on created
at, we created index on userId,
and on ID.
Okay, let's simplify assumption.
We like to index all columns.
It's worse.
And when we do it, we use only
single column indexes always,
right?
This is quite a classic example.
I can imagine it happens in the
life of every engineer during
5 years, for example, of development,
definitely at least once
it should happen, I would say.
99% of developers.
It happened with me many times.
And with my products, my databases,
and also with others when
I observe them.
Actually, recently we had a similar
case, maybe a few months
ago, I remember.
So, the key is Postgres needs to
choose, okay, I have filter,
user ID, and I have order by created
at desc limit 1 or limit
10.
And I have 2 indexes.
So I have 2 opportunities here
to save in terms of IO operations.
Of course, Postgres should avoid
sequential scan here.
It can be 1 user of many, of millions,
so it can be a huge table.
So we don't want to scan the whole
table, so we need to choose
only 1 index or 2 of them, maybe
combine with bitmap scan and
so on.
But 2 opportunities here, filtering
by user ID and also create
order by created at desc limit
1 or 10.
And the key here is order by.
Without order by, the limit, it
will be fast always because,
I mean, order by is the key to
consider index on created at.
Michael: Yeah, We need both the
equality check and the order
by to make it a choice for the
planner.
Either it has to filter on the
equality condition with one of the
indexes, take the data and sort
it and return the order, or it
has to choose the index that's
ordered and take the first row
it reaches that matches the equality
condition.
Nikolay: It can be not necessarily
equality, it can be between
one and the other.
Michael: It could be any, like,
yeah, so sorry, it doesn't have
to be, it doesn't also have to
be an order by the way, it could
be a different, it has to be two
separate conditions and those
have to fall.
Nikolay: They fight between each
other, right?
Yes.
How do they fight?
They fight based on costs, based
on statistics and what Planner
expects in terms of cost.
And the key also, like, important
point here is that Planner
has two costs.
Startup cost, which can be very
small, it means the first row
can be returned very fast.
And also the full cost.
Full cost is the most important
when Planner, like, the full
cost is what planner uses to make
the final decision choosing
between a variety of plans, right?
Full cost.
So our case, index on user ID,
if we use just it, it means Planner
thinks, okay, I will choose all
rows for this user, but then
I need to order by them in memory,
right?
It means startup cost cannot be
close to 0 because ordering and
so on, it will take some time.
And once done, only once done,
we can return all the rows.
I assume like in this case, startup
cost and the final full cost,
they are very close to each other,
but far from 0 probably, right?
But the other option, order by
created at desc limit 1 or 10.
This option means that we immediately
use created at index, we
start fetching rows based on there,
but we need to filter out
irrelevant rows owned by different
other users, right?
We only need our user ID.
It means we start returning rows
quite quick, but only if we
find our user ID quite quick.
There is uncertainty here.
But for Planner, it's quite certain
because it has some expectations
about distribution.
You mentioned it.
Probably, a planner can think,
okay, I will be lucky because
this user maybe is not real user.
Maybe it's still super active,
by the way, because we start from
the very fresh rows.
The latest row will be considered
the first and if our user ID
is there, it means we already found
one row in the first step already.
And we go, go, go.
But if this user is super inactive
last year, for example, in
reality, we need to scan the whole
year to reach the first row
of that user to return.
And here is exactly where the planner
can be very wrong.
I mean, it simply doesn't know.
It doesn't know.
There is no correlation in statistics
between user IDs and created
that.
And also, the planner doesn't know
who is actively active and
who is not.
The planner doesn't know.
So it can be very hard.
Michael: Yeah.
So there are multiple.
So I completely agree that you
brought up multi-column indexes.
In the case that I shared with
you and the case we just discussed,
a multi-column index on the user
ID or whatever it was and then
created that, ordered, would make
this query efficient and predictable
all the time.
And so, if this was an application...
Yes, yes, yes.
But if this is an application query,
and this was showing up
in pg_stat_statements as a problem,
and we needed to optimize
it, and it was one of our biggest
concerns, we didn't mind about
the overhead of adding another
index, great, we can add this
index, remove the one on user ID
or whatever we had, like, And
it would be fast all the time.
But I was talking about a case
where we're sitting in front of
a console on production trying
to stop ourselves from running
a large query or a slow query and
adding that limit, even though
we didn't think it would make a
difference, even though we think
we're only going to get one or two
rows returned, if we add it, we're
taking a risk that it could actually
make things worse.
Because we don't, like, if we don't
check, and this is actually
maybe what I would recommend.
I don't know about you.
But if you add explain before the
query and just check what index
it's going to use, does it make
sense?
If we saw it was going to do a
backwards scan on created at,
maybe that would be a sign that
it was a bad idea.
The tricky part is this is an issue
because of a cost underestimation.
So it's always going to look like
a cheap query in cost.
Like it, the reason it's being
chosen, if it's a bad plan, if
it was a slow plan, is because
the plan is thinking it will be
cheap.
It's because it thinks it can abort
early.
So costs won't give you a clue
here.
You'd have to be attuned to the
fact of which indexes it's scanning.
And is that enough?
Is that a good idea for this query?
So I'm talking about production
cases, like where you're doing
ad hoc queries, not so much application
queries where you should
optimize.
Nikolay: Yeah, for example, we
have to return our table and allow
users to order by all columns in
both directions.
It's a nightmare to support in
terms of performance.
And if you have many of such tables
or maybe you allow users
to create tables, any tables, and
you want to hide this performance
optimization from them, some website
builder or mobile app builder
and you can say, okay, users, developers,
your users are developers
of website or mobile app and you
say, I allow you to create tables,
I will take care of performance.
In tables, you're allowed to create
columns, I will take care
of performance.
And you can put some element on
your UI for your users, millions
of users, and you can allow them
to order by any column in any
direction.
It's a super nightmare.
Absolutely nightmare.
Yes.
Because obviously you want to create
index on every column and
also you think, okay, I'm smart,
I know about key set pagination,
you get your pagination, you then
allow this.
This is exactly where you can encounter
it, right?
Because this limit...
So I wanted to emphasize again,
like I described, 1 index has
very low startup cost.
It's creating that index.
And why is it low?
Because Postgres doesn't know for
arbitrary user.
We probably want this query to work
not for 1 user but for arbitrary
user, right?
So in this case, you probably think
Postgres doesn't know for
this particular user how fresh
activity is or orders are, right?
So maybe, but it assumes something.
It can be slightly more, slightly
less, depends.
But it assumes something.
For second row, it's 2 times more.
10 rows to 10 times more.
But once we find it, we don't
need to do anything.
We can traverse, create an index
and just filter out irrelevant
other users, right?
We just need to find our users.
Once we find our N, 1 or 10 rows,
we are happy.
This is our final cost.
And then if we don't use a limit,
our final cost is our expectation
for this user.
This statistic probably is present.
Probably, unless we don't have
statistics at all.
At all.
Postgres expects for this user ID,
I expect this number of rows,
and this forms our total cost.
Probably if it expects a lot of
rows, total cost will be high
and it won't start using this approach.
In this case, it switches to the
more reliable approach, I will
fetch all the rows and do a memory
sort.
So the key here, if we limit, this
is probably why Michael mentioned
that the boundary is this limit.
It's not the actual number of rows,
but its statistics, suppose
this expectation for number of
rows for this user ID.
So when our limit goes down, Below
it, below this number, our
traversal with filtering out irrelevant
users becomes more attractive
because total cost starts looking
good there.
But it's a trick, right?
Because the problem here, it's
a trick and of course, it's super
interesting phenomenon.
I would say this is a phenomenon.
It's not a harmful situation because
there are many cases when
limit doesn't help.
For example, we don't have indexes
at all, we say, well, no.
Michael: No, that can really help.
Nikolay: It can help, it can help,
yeah, yeah, yeah, because
filtering out, it can help.
But it's easy to imagine when limit
is useless in this.
Michael: Actually, that's another
time I use limit.
So the docs are quite
Nikolay: on the measure.
In general case, it improves performance
because you will stop
earlier.
But on edge case, when our record
is the latest in scanning.
It doesn't help.
Here we deal with a similar situation.
In general it's good, but we have
some edge case which not only
doesn't help, but it decreases
performance.
But it's an edge case and it's
phenomenal.
It's not a harmful limit, especially
because you cannot do this
without order by.
And especially because the problem
here is not limit, but lack
of proper index to call an index.
Michael: And again, we're talking
about a one-off query rather
than an ongoing optimization challenge.
I'm talking about it's just an
interesting phenomenon that, and
I don't think most people expect
this, that by adding limit you
can make the query slower.
And that's true.
It can be a lot slower.
So I just don't want people to
use that as their safety mechanism.
Nikolay: You know, when you understand
what's happening inside,
adding limit is clear why it's
wrong.
It's clear.
You just force the planner to choose
the wrong index here.
Right?
That's it.
But of course for people who don't
dive inside the execution
and planning process, they of course
are very surprised and think
limit is bad.
But remove order by and you won't
see this effect, right?
Michael: But the docs, and this
is what I wanted to say, the
docs quite often when you're reading
about limit will say it's
really sensible to use order by
with limit because otherwise
you're getting things in a like
an undetermined
Nikolay: Again, 1 second.
If we talk about arbitrary user
ID, what if this user ID is super
active and created a thousand orders
right today?
In this case, it's right.
So that's why I say this is not
even an edge case, it's a corner
case.
You have several things in the
corner here.
And observing various strange effects,
phenomena, of course,
it makes sense.
But create proper indexes, that's
it.
But at least let's correct this
harmful title.
Let's say limit 1 with order by
considered harmful.
Order by limit.
Michael: It's not just order by,
though.
It could be another condition that
you add.
Nikolay: Lack of multi-column indexes
order by limit considered
harmful.
This is the full truth.
This is the truth actually here.
Not just limit.
Michael: I was hoping you'd understand
my point of view on this.
It feels like you haven't.
I do think that in a lot of real
world cases, people don't have
perfect stats or they have data
that's correlated strongly in
certain ways.
For example, like user ID and account
ID being extremely tightly
coupled.
Or you know, the age old city and
country codes.
And Postgres assumes that all of
these stats are uncorrelated
by default.
And if you don't, you can add,
you can create statistics for
these things.
But in a lot of cases, we have
a lot of skewed distributions
in databases that we haven't created
extra statistics for.
And most of the time, the plan
is great and can really can work
things out nicely.
But by having this limit optimization,
it can flip plans that
we're not expecting to flip.
And the reason I found it so powerful
is we're using it exactly
in the cases where we're most scared
of doing a big scan on production.
If that's what you're really scared
of doing, And this is something
that can
Nikolay: be quite...
Why do you not agree with all this
concept that is harmful?
Because imagine we don't have an
index on user ID at all.
In this case, if...
In this case, created that shows
bad performance because the
user became inactive last year,
we need to scan the whole year.
But at least we have some approach
to scan, right?
Created that adds some order to scanning.
Without it, we only have a sequential
scan, right?
That's it.
So the problem is we don't have
a proper index.
Now, okay, it turns out we have
an index on user ID.
It happened to be helping, but
what if some user has 50% of whole
row like records, millions of records.
Michael: So bear with me.
This is why I'm suggesting instead
of running the query on production
by adding a limit 1 or something
to make it feel safe, instead
of adding limit 1, put just 1 word,
explain, in front of the
query, not explain analyze, not
explain analyze buffers, just
explain your query without the
limit and see what the query plan
is before you run it.
If you're scared of what it's going
to do in production, do that
instead of adding the limit.
And you'll see which indexes it's
using, whether you're
Nikolay: missing one or whether you're
not.
But I cannot add limit because
I don't want my page to blow up
because of many, many rows returned.
ALEX LUCASSE-COLOMBRE-CAMPOYS
Michael: Well, add limit 1 and
put explain in front just to check
what it's going to do.
Because that's what's really going
to check.
That's what's going
Nikolay: to actually keep safe.
SLAVOJ ŽIŽEKERJÁNAS You will check
the player behavior, right?
ALEX LUCASSE-COLOMBRE-CAMPOYS
Michael: Yeah, you're checking
what you're actually scared of
instead of assuming you're going
to be safe.
It's like, I understand your point
that it's almost never going
to be a problem, but imagine if
a safety on a gun, for example,
almost always kept you safe.
It's like a safety feature that's
almost always good, but sometimes
it's going to actually shoot you
when you press that safety catch.
Not when you pull the trigger,
but when you press the catch.
Nikolay: Again, in my world, I
add limit 1 to limit things.
If it doesn't work, it means I
don't have an index.
If there is some index which would
help me when I don't have
a limit, I don't care because I do
need limit 1 anyway.
Michael: And maybe you'll hit some
like statement timeout or
you'll cancel your query quickly
or there was some other safety
mechanism in place that will stop
you from hurting yourself.
Nikolay: But still, if my limit
behaves not good, I'm looking
which index will help.
And I'm not considering index on
user ID as helpful here because
I don't want to deal with a lot
of records and memory because
I need only 15 records.
So that's why I jump straight to
multi-column indexes here.
And as we discussed...
Michael: I agree.
I agree that to optimize this query,
but we're not talking about
optimizing the query, we're talking
about should you use limit
as a safety feature when you don't
need it?
Nikolay: Why I don't, why might
I not?
Michael: Let's say you know there's
only going to be 1 row, should
you add limit 1.
Nikolay: How do you know it's only
1 row?
Unique index?
If you have unique index, this
problem won't happen.
Michael: Let's say a company could
set up multiple things, but
they generally set up 1.
Like how many websites are you
gonna have for each organization
you don't have a unique index on
it people can add a second okay
but almost everybody else for
Nikolay: you for example right
Michael: yeah email addresses yeah
it's probably gonna be less
than 10 like you don't have any
you know but you maybe you're
not certain it's going to be 1,
but you're pretty sure it's going
to
Nikolay: be 1.
But am I having up-to-date stats
or no?
Because stats will say, they will
form an expectation to Postgres
if it's a low number of rows, it
would prefer.
Michael: I think I'm struggling
because it's a simplified example,
but if you've got lots of joins
going on there could be all sorts
of other like complexities.
Nikolay: Joins are tricky by the
way.
The same problem is becoming much
more tricky but I don't think
limit this problem.
I think the problem is a lack of
proper indexes and sometimes
you cannot create them.
Imagine user id is present in 1
table but created that is present
in different table.
Michael: So you're saying you've
never sat in front of a database
where you don't know all of the
indexes in the system, you don't
know the state of stats and wanted
to run a query and not being
scared that it might take a long
time.
Nikolay: Well, when I do this,
I add limit, I understand that
if I lack proper index, that will
help.
I might be dealing with sequential
scan or wrong index have been
scanned almost fully, right?
I understand that.
If you don't understand that, that's
strange.
Right?
Michael: Well, but then, so this
is my, that's kind of my, like,
takeaway.
Instead of adding limit 1
Nikolay: and hoping that I'm okay.
Let me give you an example I saw
before what we discussed today.
So we have full-text search, for
example.
And we have rare words and we have
very frequent words present
in the majority of records in TSVector.
Yeah, that's a nice example.
Right.
And then we have created that OR
ID and we want to find all matching
documents and return only 15 latest,
for example.
In this case, Postgres must choose
between need to choose, must
choose between ordering by bit
3 and then applying a filter to
add science, right, TS vector,
add TS query.
Or it might decide, okay, I need
to choose, I need to extract
all documents and then order by
memory and then limit.
Same problem, completely same problem.
And I saw terrible cases when I
thought, wow, like, why do you
do this?
Why do you think this word is super
frequent?
So instead of using gin, you decide
to use B-tree to order by limit
15, for example.
And then you end up scanning almost
everything, finding my 1
or 2 records only in the end of
scanning.
It's absolutely the same situation.
But the problem to me is not limit
there at all.
I put limit, I see the problem,
I understand.
Limit didn't help, I understand
why.
Because we need to combine both
things in 1 single index and
have only 1 single index scan.
Unfortunately, this is impossible
there, right, in this particular
case.
Because for that you need to use
a RAM index and it can be not
available.
Nardius, for example, it's not
present.
Michael: Or like work out some
middle ground, take way more rows
than you want and then sort them,
that kind of thing.
Nikolay: Right, so I mean We sit
on the edge here.
We might be lucky, but we are not.
And what else?
What can I say?
It's not a problem of limit, it's
a problem of lack of indexing.
Michael: Let's try and wrap up.
And I think I agree with you that
to solve the performance issue,
you don't get rid of the limit,
you add a good index or you get
your stats, like you work out your
stats correlation, maybe create
some statistics there.
Maybe you just need to run analyze
on that table or maybe globally,
like that.
There's lots of ways of solving
it so that that query doesn't
run slowly again in future.
I just wanted to publicize and
share that.
And by the way, I wanted to shout
out Frank Pashoe, who got in
touch recently when I shared this
Twitter thread again, to say,
I think we agreed that it's
an optimization problem.
And yeah, he would completely be
agreeing with you right now.
I just found it really interesting
and quite counterintuitive.
Like probably it sounds like more
than you did, that it could,
like just by adding it could make
a query slower.
Nikolay: Right.
Michael: Even though that's the
sign of other issues.
So I wanted
Nikolay: to clarify that.
It's an interesting phenomenon,
for sure.
And adding a limit is not a performance
optimization tool.
Performance optimization tool is
order by limit and improper
index, like together.
Yes.
And at the limit, I should not
expect quick execution.
I just form expectations for myself
and I can then say, okay,
this is not what I want and find
a better index, which is sometimes
not possible, for example, GIN
versus B-tree and so on.
But by the way, to be fair, with
GIN statistics already, Postgres
is quite often right, so it knows
when to use B-tree and quickly
find proper results, and it knows
when it's better to find everything
and order by memory.
So it's not that bad, but also
not ideal, not just single index
scan.
Michael: Yeah, cool.
Nikolay: Okay.
Michael: Thanks so much, Nikolay.
Take care.
Nikolay: Thank you.