1
00:00:00,060 --> 00:00:02,300
Nikolay: Hello, hello, this is
Postgres.FM.

2
00:00:02,500 --> 00:00:04,240
As usual, I don't remember episode
number.

3
00:00:04,240 --> 00:00:05,280
Actually, I don't care.

4
00:00:05,280 --> 00:00:06,420
I know 100 something.

5
00:00:06,500 --> 00:00:09,640
And my name is Nikolay, founder
of Postgres.AI.

6
00:00:09,800 --> 00:00:13,540
And as usual, my co-host is Michael,
pgMustard.

7
00:00:14,240 --> 00:00:15,040
Hi, Michael.

8
00:00:15,360 --> 00:00:16,280
Michael: Hello, Nikolay.

9
00:00:16,400 --> 00:00:17,220
How are you doing?

10
00:00:18,260 --> 00:00:19,120
Nikolay: I'm doing great.

11
00:00:19,120 --> 00:00:19,940
How are you?

12
00:00:20,280 --> 00:00:21,420
Michael: Yeah, good as well.

13
00:00:22,060 --> 00:00:27,160
Nikolay: I actually feel very good
after therapy session we had

14
00:00:27,160 --> 00:00:27,940
last week.

15
00:00:28,580 --> 00:00:29,560
Michael: Glad to hear it.

16
00:00:29,580 --> 00:00:32,180
Nikolay: Yeah, So I know it's your choice.

17
00:00:32,900 --> 00:00:35,580
I'm glad you chose a very practical
topic.

18
00:00:35,920 --> 00:00:37,160
Technical, practical.

19
00:00:37,540 --> 00:00:39,780
So yeah, what is it?

20
00:00:39,960 --> 00:00:41,980
Michael: Yeah, I chose index-only
scans.

21
00:00:41,980 --> 00:00:45,220
I was aiming for technical and
practical, but I also wanted something

22
00:00:45,280 --> 00:00:46,320
that's still beginner-friendly.

23
00:00:46,780 --> 00:00:50,880
Like, I think there's enough depth
we can go into with index-only

24
00:00:50,940 --> 00:00:51,260
scans.

25
00:00:51,260 --> 00:00:53,300
I'm hoping there's something in
here for everybody.

26
00:00:53,800 --> 00:00:59,280
But I definitely see some beginners
not fully understand index-only

27
00:00:59,480 --> 00:01:04,120
scans, how to get them, or even
knowing that they exist, really

28
00:01:04,120 --> 00:01:09,560
truly understanding their benefits,
and also the downsides when

29
00:01:09,560 --> 00:01:12,240
they go wrong, or when they aren't
so efficient.

30
00:01:12,320 --> 00:01:16,420
So yeah, lots to cover I think,
but we can start nice and gently.

31
00:01:16,820 --> 00:01:22,140
Nikolay: Yeah, and since this topic
is related to query optimization,

32
00:01:23,860 --> 00:01:27,520
it's your field, I would say, because
of pgMustard, so I will

33
00:01:27,520 --> 00:01:29,740
let you to be in the driver's seat.

34
00:01:32,120 --> 00:01:34,860
Michael: Yeah, well, I feel like
you are as much, probably more

35
00:01:34,860 --> 00:01:37,060
of an expert in this field than
me still.

36
00:01:37,360 --> 00:01:40,040
But yes, it's definitely something
I've had to look into a lot

37
00:01:40,040 --> 00:01:41,420
for the product I work on.

38
00:01:41,420 --> 00:01:44,520
And we've got a couple of tips
around this within the product.

39
00:01:44,540 --> 00:01:49,440
But I actually wanted to start
like, super basic, like,

40
00:01:49,440 --> 00:01:51,060
index-only scans are a type of scans.

41
00:01:51,060 --> 00:01:54,280
I think it's worth covering that
we've, we've got other scan

42
00:01:54,280 --> 00:01:58,100
types and Postgres gets to choose
between them in certain situations.

43
00:01:58,280 --> 00:02:01,080
We can cover when it gets
the choice of doing an index-only

44
00:02:01,080 --> 00:02:04,240
scan, but I wanted to start with
even the absolute basics.

45
00:02:04,240 --> 00:02:07,680
If, for example, we don't have
any indexes, the only scan option

46
00:02:07,680 --> 00:02:10,660
we have for a table is a sequential
scan.

47
00:02:10,920 --> 00:02:11,420
Yep.

48
00:02:12,200 --> 00:02:16,700
Going through the pages in order,
one at a time, and then if we

49
00:02:16,700 --> 00:02:17,060
have an index...

50
00:02:17,060 --> 00:02:19,960
Nikolay: Sequential or parallel sequential,
if we distinguish between

51
00:02:19,960 --> 00:02:20,700
the 2.

52
00:02:21,060 --> 00:02:22,640
Michael: You consider that a different
type?

53
00:02:22,640 --> 00:02:23,500
Yeah, interesting.

54
00:02:24,160 --> 00:02:26,020
Nikolay: Well, well, technically
it's different.

55
00:02:27,440 --> 00:02:30,400
The planner has to make a choice.

56
00:02:32,180 --> 00:02:33,640
Michael: True, true.

57
00:02:33,920 --> 00:02:37,160
Oh, actually, this came up, and
maybe this is too off-topic,

58
00:02:37,280 --> 00:02:41,200
but do you think parallelism should
be on by default, like with

59
00:02:41,200 --> 00:02:41,700
Postgres?

60
00:02:42,560 --> 00:02:43,440
Like, maximizing

61
00:02:43,440 --> 00:02:44,360
the work together?

62
00:02:45,020 --> 00:02:48,540
Nikolay: I think JIT should
be off by default.

63
00:02:48,840 --> 00:02:50,220
Michael: This is when it came up.

64
00:02:50,220 --> 00:02:50,720
Yeah.

65
00:02:50,740 --> 00:02:55,460
Nikolay: I got beaten very badly again and
it was in my home.

66
00:02:55,760 --> 00:02:58,400
I had a bad, bad, bad, bad, bad situation.

67
00:02:58,400 --> 00:03:00,060
I don't understand why JIT is on.

68
00:03:00,060 --> 00:03:01,620
I completely don't understand.

69
00:03:02,540 --> 00:03:05,340
It just hurts workloads and that's
it.

70
00:03:05,740 --> 00:03:11,080
So yeah, I think random_page_cost
should be closer to seq_page_cost.

71
00:03:11,200 --> 00:03:12,840
We will discuss this, right?

72
00:03:13,820 --> 00:03:17,340
random_page_cost adjustment, and
we discussed it a few times.

73
00:03:18,740 --> 00:03:25,640
And as for parallelism, I saw Jeremy
Schneider's opinion that

74
00:03:27,660 --> 00:03:33,000
maybe in some cases it's better
to switch it off to have more

75
00:03:33,000 --> 00:03:36,920
predictable behavior of query processing.

76
00:03:37,060 --> 00:03:38,760
Because if you...

77
00:03:39,120 --> 00:03:41,340
Well, there are many things like
that.

78
00:03:41,820 --> 00:03:46,280
If you have parallelism turned
on, which is on by default, 2

79
00:03:46,280 --> 00:03:48,780
additional workers usually.

80
00:03:49,660 --> 00:03:53,480
But there are several such things
and definitely it's our topic

81
00:03:53,480 --> 00:03:53,980
today.

82
00:03:54,640 --> 00:03:59,820
Index-only scans is an unpredictable
thing and heap-only tuple updates

83
00:03:59,820 --> 00:04:01,200
is an unpredictable thing.

84
00:04:02,140 --> 00:04:06,980
Quite recently I saw talking with
customers, I saw opinion, which

85
00:04:06,980 --> 00:04:10,420
is probably a reasonable opinion
from backend engineers,

86
00:04:10,520 --> 00:04:13,220
or like, oh, this is something
quite unpredictable.

87
00:04:13,740 --> 00:04:17,180
We have many, many, many installations,
various situations, different

88
00:04:17,220 --> 00:04:18,900
Postgres infrastructures.

89
00:04:20,080 --> 00:04:25,600
So if we think we cannot, if it
was only 1 cluster, we should

90
00:04:25,600 --> 00:04:28,040
think, oh, this is like our pet.

91
00:04:28,080 --> 00:04:32,380
We will take care of behavior closely,
watch it closely.

92
00:04:32,500 --> 00:04:36,900
But if you have thousands of setups,
you cannot rely on such

93
00:04:36,900 --> 00:04:37,400
things.

94
00:04:37,660 --> 00:04:41,840
And maybe you will prefer, like
let's raise this question

95
00:04:41,840 --> 00:04:45,260
and in the end maybe we'll try
to answer ourselves.

96
00:04:45,920 --> 00:04:50,140
So to have predictable behavior,
maybe you should switch off

97
00:04:50,600 --> 00:04:57,080
parallelism, stop relying on heap-only
tuple updates, HOT updates,

98
00:04:57,740 --> 00:05:01,940
and also stop relying on index-only
scans because they can degrade

99
00:05:02,040 --> 00:05:03,980
to regular index scans, right?

100
00:05:05,280 --> 00:05:07,380
Michael: Yeah, or even slightly
worse, I think.

101
00:05:07,440 --> 00:05:09,100
So yeah, let's go back to basics.

102
00:05:09,520 --> 00:05:11,520
Nikolay: Basics also, question.

103
00:05:11,820 --> 00:05:15,780
Do you feel the name is kind of
off?

104
00:05:16,560 --> 00:05:23,480
If we had name choices made right
now, maybe index-only scans

105
00:05:23,480 --> 00:05:27,540
would be named as index scans,
and index scans would be named

106
00:05:27,540 --> 00:05:32,040
like, I don't know, like index
heap scans or something, because

107
00:05:32,040 --> 00:05:32,940
it's hard to...

108
00:05:32,940 --> 00:05:36,560
You need to explain it, And if
people don't work with Postgres

109
00:05:36,820 --> 00:05:42,260
planner every day, they forget
and they expect index scan is

110
00:05:42,260 --> 00:05:45,760
what deals with only indexes, but
then index-only scan, what

111
00:05:45,760 --> 00:05:46,500
is it?

112
00:05:46,800 --> 00:05:50,640
Michael: Yeah, that's a really
good transition into what these

113
00:05:50,640 --> 00:05:51,000
are.

114
00:05:51,000 --> 00:05:56,300
Yeah, so index scans came first,
and I think that's probably

115
00:05:56,380 --> 00:05:58,140
like why we're stuck with that
name.

116
00:05:58,140 --> 00:06:01,560
And in Postgres... in a lot of other
database management systems,

117
00:06:02,080 --> 00:06:06,100
not including Postgres, you have
indexed organized tables.

118
00:06:06,500 --> 00:06:08,100
But in Postgres we don't have that.

119
00:06:08,100 --> 00:06:12,220
We have heap and indexes are what
is often called secondary.

120
00:06:12,560 --> 00:06:16,360
So to look something up via an
index, we'll check first in the

121
00:06:16,360 --> 00:06:20,380
index and then that will point
us to the right place in the heap

122
00:06:20,380 --> 00:06:21,640
to find that row.

123
00:06:21,820 --> 00:06:22,840
Nikolay: Probably right.

124
00:06:24,060 --> 00:06:25,020
Michael: Yeah, yes.

125
00:06:25,920 --> 00:06:30,260
Nikolay: At least pages are right, but offset
might be wrong, right?

126
00:06:30,660 --> 00:06:32,560
Michael: Yeah, well, back to Heap
again.

127
00:06:32,560 --> 00:06:34,420
Let's keep it simple to start with.

128
00:06:34,540 --> 00:06:37,940
Nikolay: For simple, yes, for index
points, but it doesn't know

129
00:06:37,940 --> 00:06:40,880
if this tuple is alive or dead.

130
00:06:42,660 --> 00:06:44,540
This is why we need to check heap.

131
00:06:45,360 --> 00:06:49,720
Michael: Or even, I guess not even
like alive or dead, but visible,

132
00:06:49,840 --> 00:06:53,760
like it might be alive to new transactions.

133
00:06:53,940 --> 00:06:55,780
Nikolay: Right, visible to current
transactions.

134
00:06:56,000 --> 00:06:56,900
Michael: Visible, exactly.

135
00:06:57,060 --> 00:06:57,540
Cool.

136
00:06:57,540 --> 00:07:02,700
So that becomes important in a
moment when, in fact, until, you

137
00:07:02,700 --> 00:07:06,420
know, when index-only scans were
added to Postgres, I found out

138
00:07:06,420 --> 00:07:07,060
it was...

139
00:07:07,280 --> 00:07:08,680
Nikolay: It was not very long.

140
00:07:08,680 --> 00:07:09,180
Michael: 9.2.

141
00:07:10,240 --> 00:07:10,640
Yeah.

142
00:07:10,640 --> 00:07:13,260
So yeah, not that long, like in
the history

143
00:07:13,260 --> 00:07:13,520
of Postgres.

144
00:07:13,520 --> 00:07:16,800
Nikolay: I remember living without them
very long, of course.

145
00:07:16,800 --> 00:07:21,920
And this mantra, like, indexes
don't store visibility information,

146
00:07:22,920 --> 00:07:26,060
I first heard 20 years ago, and
it was a rule.

147
00:07:27,180 --> 00:07:30,900
Index scan must consult Heap to
understand visibility.

148
00:07:31,620 --> 00:07:32,380
Michael: Yes.

149
00:07:32,700 --> 00:07:37,120
So, and looking back, in fact,
the Wiki includes, the Postgres

150
00:07:37,120 --> 00:07:40,020
Wiki includes a really good entry
on index-only scans and explains

151
00:07:40,120 --> 00:07:45,140
that the majority of the work to
add index-only scans was work

152
00:07:45,440 --> 00:07:48,980
on the visibility map to make it
crash safe.

153
00:07:49,200 --> 00:07:52,200
So that, I found that fascinating
looking into it for this episode.

154
00:07:52,200 --> 00:07:54,340
I didn't know that until yesterday.

155
00:07:55,080 --> 00:07:56,420
So yeah, super fascinating.

156
00:07:56,460 --> 00:07:58,740
So the index-only scan, probably
should actually explain what

157
00:07:58,740 --> 00:07:59,860
it is and does.

158
00:08:00,300 --> 00:08:05,580
If the data that we need for the
query is already in the index,

159
00:08:05,660 --> 00:08:09,920
so very typical for a B-tree index,
not for all index types,

160
00:08:09,920 --> 00:08:13,080
but in a B-tree index, we have
the data that we're indexing in

161
00:08:13,080 --> 00:08:13,700
the index.

162
00:08:13,980 --> 00:08:17,500
If that's all we need for the query
we're running...

163
00:08:17,500 --> 00:08:20,440
Nikolay: If we don't use
SELECT *, if we

164
00:08:20,440 --> 00:08:24,520
only select, for example, columns
which are in the index, this

165
00:08:24,520 --> 00:08:26,060
is conditional, right?

166
00:08:26,640 --> 00:08:27,680
Michael: Yes, exactly.

167
00:08:27,780 --> 00:08:32,240
So if the columns we're selecting
are in the index, or for example,

168
00:08:32,240 --> 00:08:36,100
if we're doing like an aggregation,
like COUNT *, we're not

169
00:08:36,100 --> 00:08:39,160
really selecting a column, but
we... yeah.

170
00:08:39,160 --> 00:08:41,340
So yeah, there are like a couple
of cases.

171
00:08:42,120 --> 00:08:42,940
Yes, exactly.

172
00:08:44,080 --> 00:08:44,640
And yeah.

173
00:08:44,640 --> 00:08:48,680
And if the index type supports
it, like B-trees do, then we can

174
00:08:48,680 --> 00:08:50,240
get an index-only scan.

175
00:08:50,280 --> 00:08:51,840
So yeah, really cool feature.

176
00:08:54,720 --> 00:09:00,660
Nikolay: This is like, okay, we
deal only with values which are

177
00:09:00,660 --> 00:09:03,400
present in index keys.

178
00:09:03,420 --> 00:09:05,500
Index keys maybe can be called.

179
00:09:05,500 --> 00:09:08,180
Maybe not, maybe it's a bad idea
to call that.

180
00:09:08,740 --> 00:09:15,000
But yeah, the planner might choose,
may choose or may not choose

181
00:09:15,180 --> 00:09:16,060
index-only scan.

182
00:09:16,060 --> 00:09:21,760
It still may think index scan is
better because the second component,

183
00:09:22,200 --> 00:09:25,460
like I agree with everything you
said so far, but then you say

184
00:09:26,060 --> 00:09:27,480
we don't consult the heap.

185
00:09:27,500 --> 00:09:32,460
Well, in general case, index-only
scan consults the heap because

186
00:09:33,820 --> 00:09:36,520
we don't have visibility information
still, right?

187
00:09:37,080 --> 00:09:40,540
But in some cases, it skips it.

188
00:09:41,760 --> 00:09:43,740
Michael: Do you count the visibility
map as the

189
00:09:43,740 --> 00:09:44,060
heap?

190
00:09:44,060 --> 00:09:44,560
Nikolay: Yes.

191
00:09:45,140 --> 00:09:46,300
No, no, no, no.

192
00:09:46,300 --> 00:09:52,120
No, I mean, we check the visibility
map and see what bit it has in

193
00:09:52,120 --> 00:09:57,420
terms of all visible bit for the
page and the heap we need.

194
00:10:00,040 --> 00:10:06,180
If that page is marked as all visible,
this is a win.

195
00:10:06,260 --> 00:10:07,380
But it's not guaranteed.

196
00:10:07,960 --> 00:10:10,080
It may not be marked as all visible.

197
00:10:10,080 --> 00:10:13,520
In this case, it's kind of degradation
to the behavior of an index scan

198
00:10:13,520 --> 00:10:18,640
for this particular value reference
to the tuple.

199
00:10:18,660 --> 00:10:28,080
And we need to go to the heap and first
see if this tuple is visible

200
00:10:28,080 --> 00:10:32,780
to our session, and secondly, probably
we will jump inside page

201
00:10:32,800 --> 00:10:38,160
to proper position because of Heap-Only Tuple
chain, right?

202
00:10:39,280 --> 00:10:40,200
But yeah, that's it.

203
00:10:40,200 --> 00:10:45,100
So I mean, everything is right,
but it's not guaranteed that

204
00:10:45,100 --> 00:10:47,000
we won't consult heap.

205
00:10:47,860 --> 00:10:48,560
Michael: Yeah, good.

206
00:10:48,560 --> 00:10:49,360
Yes, exactly.

207
00:10:49,660 --> 00:10:50,420
Good point.

208
00:10:50,900 --> 00:10:56,260
So heap, you'll see this in EXPLAIN
plans as heap fetches, if

209
00:10:56,260 --> 00:10:57,760
you're running EXPLAIN ANALYZE.

210
00:10:58,620 --> 00:11:02,080
Nikolay: Even if BUFFERS is not
used in EXPLAIN (ANALYZE, BUFFERS).

211
00:11:02,100 --> 00:11:02,600
Michael: Yes.

212
00:11:02,620 --> 00:11:06,400
Nikolay: Even if you skip BUFFERS, this
kind of buffers you still get.

213
00:11:09,180 --> 00:11:12,600
Michael: In fact, imagine if you
didn't have heap fetches and

214
00:11:12,600 --> 00:11:16,560
you only had buffers, you'd have
to be running it twice to spot

215
00:11:16,560 --> 00:11:17,580
these, like, once.

216
00:11:18,660 --> 00:11:22,280
Anyway, so, it's I really like
that they call this out explicitly.

217
00:11:22,740 --> 00:11:23,720
It's really helpful.

218
00:11:24,220 --> 00:11:29,620
And as you mentioned, like, let's
say we're scanning many, many

219
00:11:29,620 --> 00:11:31,460
rows for our index-only scan.

220
00:11:31,460 --> 00:11:36,100
So, we're doing a range scan across
maybe returning 100 rows

221
00:11:36,100 --> 00:11:36,840
or something.

222
00:11:37,240 --> 00:11:39,620
And we only have to do 1 or 2 heap
fetches.

223
00:11:40,240 --> 00:11:41,920
That can still be a huge win.

224
00:11:41,920 --> 00:11:45,560
It's only if the proportion grows
to a decent.

225
00:11:45,660 --> 00:11:47,800
So, yeah, 2 big benefits.

226
00:11:47,800 --> 00:11:51,820
Let's go back to why even bother
with all that work to make the

227
00:11:51,820 --> 00:11:53,100
visibility map crash safe?

228
00:11:53,100 --> 00:11:56,700
Why even go to the trouble of doing
this?

229
00:11:56,780 --> 00:12:02,800
It's not only the benefit of not
having to do those reads on

230
00:12:02,800 --> 00:12:06,100
the heap, but also I think data
locality.

231
00:12:06,600 --> 00:12:11,320
Like if we're doing a range scan,
the hundreds values on the

232
00:12:11,320 --> 00:12:15,720
index are likely in a much smaller
range of index pages than

233
00:12:15,720 --> 00:12:17,320
they would be heap pages.

234
00:12:17,320 --> 00:12:20,440
I mean, if they were all inserted
in order and there have been

235
00:12:20,440 --> 00:12:24,400
very few updates, maybe they are,
of course, only a relatively

236
00:12:24,400 --> 00:12:26,260
small number of pages, but they
could be anywhere.

237
00:12:26,260 --> 00:12:28,220
It could be really random reads.

238
00:12:28,460 --> 00:12:31,000
So I think historically, especially
if you consider when this

239
00:12:31,000 --> 00:12:36,060
was added, doing those random reads
on the heap could be quite

240
00:12:36,060 --> 00:12:39,440
a lot more expensive than doing
the reads.

241
00:12:40,160 --> 00:12:45,400
Nikolay: Right, and if there is
no data locality, each tuple

242
00:12:45,660 --> 00:12:48,680
sits in its own page, basically,
buffer, and we have a lot of

243
00:12:48,680 --> 00:12:53,540
buffer operations additionally,
just to check visibility, basically,

244
00:12:53,860 --> 00:12:54,700
in this case.

245
00:12:54,960 --> 00:12:59,780
If we have all values already in
index, there's a lot of buffer

246
00:13:00,060 --> 00:13:01,660
operations just to check visibility.

247
00:13:02,960 --> 00:13:09,180
It feels super inefficient and
index-only scans is just an optimization

248
00:13:09,360 --> 00:13:10,380
of index scans.

249
00:13:12,440 --> 00:13:16,600
But they'd be super good if visibility...

250
00:13:16,740 --> 00:13:20,140
Like If you have, for example,
if you just loaded your data into

251
00:13:20,140 --> 00:13:26,980
a table, run a vacuum once on the
table, and don't do any writes

252
00:13:26,980 --> 00:13:28,980
anymore, this is a perfect situation.

253
00:13:29,160 --> 00:13:32,900
Heap fetches will be 0, the best
case for index-only scans.

254
00:13:33,340 --> 00:13:36,780
But in the worst situation, you'll
have the maximum number of

255
00:13:36,860 --> 00:13:40,800
heap fetches, and this should be
maybe the same number of buffer

256
00:13:40,800 --> 00:13:42,340
operations as index scans.

257
00:13:43,500 --> 00:13:46,560
Michael: It, plus, you've had
to check the visibility map,

258
00:13:46,560 --> 00:13:50,220
which admittedly is not many reads,
like a visibility map is

259
00:13:50,220 --> 00:13:52,900
tiny, but you don't have to check
that if you're doing an index

260
00:13:52,900 --> 00:13:53,220
scan.

261
00:13:53,220 --> 00:13:53,720
Like

262
00:13:53,760 --> 00:13:53,800
it's...

263
00:13:53,800 --> 00:13:58,620
Nikolay: Oh, so you think index-only scan
in this extreme case is worse

264
00:13:58,620 --> 00:13:59,680
than index scan?

265
00:14:00,840 --> 00:14:01,740
Michael: Tiny bit.

266
00:14:01,880 --> 00:14:06,500
Nikolay: I know planner might decide,
I think based on the state

267
00:14:06,500 --> 00:14:12,720
of visibility map, in some cases
it might decide to choose index

268
00:14:12,720 --> 00:14:17,940
scan, even the main condition is
met.

269
00:14:19,060 --> 00:14:20,380
Michael: I don't know that it does.

270
00:14:20,380 --> 00:14:21,820
Like in theory, that would make
sense.

271
00:14:21,820 --> 00:14:23,940
I don't know for sure that it does.

272
00:14:24,380 --> 00:14:25,400
Nikolay: It's worth checking.

273
00:14:25,440 --> 00:14:27,240
It's a simple experiment.

274
00:14:27,840 --> 00:14:29,940
Worth checking and looking at buffers.

275
00:14:30,140 --> 00:14:30,640
Yeah.

276
00:14:31,300 --> 00:14:37,220
But for simplicity, number 1 thing
to memorize, index-only scan.

277
00:14:38,000 --> 00:14:41,980
If you have autovacuum tuned, index-only
scan is the best what

278
00:14:41,980 --> 00:14:46,000
you can get from Postgres, for
example, for aggregate

279
00:14:46,000 --> 00:14:48,900
SELECT COUNT, things like that.

280
00:14:48,900 --> 00:14:52,720
If you get index-only scan, even
if it's like a lot, a lot of

281
00:14:52,720 --> 00:14:54,980
rows, this is the best.

282
00:14:55,680 --> 00:14:57,320
Michael: Well, I completely agree.

283
00:14:57,400 --> 00:14:59,700
Nikolay: Index-only scan and heap
fetches are 0.

284
00:14:59,700 --> 00:15:01,500
This is the ideal situation.

285
00:15:02,020 --> 00:15:06,260
Michael: And 1 more condition,
Rows removed by filter 0 as well.

286
00:15:07,300 --> 00:15:12,340
So I see sometimes people get excited
because they see an index

287
00:15:12,340 --> 00:15:15,680
scanned or an index-only scan and
then they stop because they

288
00:15:15,680 --> 00:15:20,200
think I've got an index-only scan,
but it's returning 100 rows

289
00:15:20,200 --> 00:15:21,180
for every … So

290
00:15:21,180 --> 00:15:22,540
Nikolay: you get 1 row, right?

291
00:15:22,540 --> 00:15:23,460
Michael: Yeah, exactly.

292
00:15:24,780 --> 00:15:29,180
So there is this 1 other gotcha
that sometimes catches beginners

293
00:15:29,180 --> 00:15:29,380
out.

294
00:15:29,380 --> 00:15:31,020
You would never be caught out by
that.

295
00:15:31,020 --> 00:15:34,080
But some people see index-only
scan and think, great, I'm done.

296
00:15:34,080 --> 00:15:36,000
And actually, it's still hideously
inefficient.

297
00:15:36,100 --> 00:15:39,560
So yeah, they're the 2 things we
look out for.

298
00:15:39,960 --> 00:15:40,440
Nikolay: Okay.

299
00:15:40,440 --> 00:15:47,220
So yes, in phrasing this, you really
need those rows that index-only

300
00:15:47,300 --> 00:15:49,660
scan brought to you, right?

301
00:15:49,900 --> 00:15:51,000
You really need them.

302
00:15:51,580 --> 00:15:56,820
Because if you see 99% was filtered
out, it means those rows

303
00:15:57,620 --> 00:16:01,320
were not needed in the first place,
and you just need to find

304
00:16:01,320 --> 00:16:02,620
better index, for example.

305
00:16:03,080 --> 00:16:03,580
Michael: Exactly.

306
00:16:04,160 --> 00:16:05,640
Nikolay: Also, let's touch this
thing.

307
00:16:05,640 --> 00:16:10,020
You create an index, you run vacuum,
no writes yet.

308
00:16:10,440 --> 00:16:15,560
You run SELECT COUNT(*) from
this table, or SELECT column,

309
00:16:15,940 --> 00:16:20,420
indexed column, like ID, for example,
which is technically the

310
00:16:20,420 --> 00:16:22,080
same because it cannot be null.

311
00:16:23,000 --> 00:16:26,980
We know count, won't count, rows,
which are null.

312
00:16:26,980 --> 00:16:29,120
And * cannot be null, even if...

313
00:16:29,440 --> 00:16:30,320
It's complicated.

314
00:16:30,820 --> 00:16:34,500
Maybe we should save it for another
episode about counts, right?

315
00:16:34,700 --> 00:16:39,220
But you run it, but you still see
sequential scan in the plan.

316
00:16:39,400 --> 00:16:40,920
This is happening all the time,
right?

317
00:16:40,920 --> 00:16:41,260
Michael: That's a

318
00:16:41,260 --> 00:16:41,760
good point.

319
00:16:41,760 --> 00:16:42,620
Yeah, good point.

320
00:16:43,580 --> 00:16:45,260
Or parallel sequential scan.

321
00:16:45,320 --> 00:16:46,700
Nikolay: Or parallel sequential
scan.

322
00:16:46,700 --> 00:16:47,820
And you think, why?

323
00:16:48,060 --> 00:16:50,020
Why, like, Index is better, right?

324
00:16:50,020 --> 00:16:52,060
Because index-only scan is better.

325
00:16:53,000 --> 00:16:53,800
I know it.

326
00:16:53,800 --> 00:16:58,840
It's much fewer buffer hits or
reads, or both.

327
00:16:59,540 --> 00:17:00,040
Right?

328
00:17:00,280 --> 00:17:01,160
What to do?

329
00:17:05,740 --> 00:17:09,100
Michael: Yeah, so before we jump
straight to it, you can use,

330
00:17:09,100 --> 00:17:11,880
if you want to ask this kind of
question yourself, there's like

331
00:17:11,880 --> 00:17:15,420
some parameters that we can set
locally to help us diagnose these

332
00:17:15,420 --> 00:17:15,720
things.

333
00:17:15,720 --> 00:17:20,420
So we could use enable_seqscan
and set that to false briefly.

334
00:17:21,400 --> 00:17:25,020
And then hopefully that will encourage
an index-only scan.

335
00:17:25,020 --> 00:17:26,360
We can compare the costs.

336
00:17:26,840 --> 00:17:29,780
And once you compare the costs,
it should become clear.

337
00:17:30,060 --> 00:17:31,420
Nikolay: How to compare the cost?

338
00:17:31,800 --> 00:17:34,180
Disable scan and look?

339
00:17:34,280 --> 00:17:38,360
Michael: Run, EXPLAIN, ANALYZE,
SELECT counts, or maybe with

340
00:17:38,360 --> 00:17:38,860
BUFFERS.

341
00:17:39,060 --> 00:17:40,240
But you don't need it.

342
00:17:40,240 --> 00:17:41,980
In fact, you don't even need ANALYZE.

343
00:17:42,520 --> 00:17:45,780
Nikolay: Yeah, yeah, just because
we want to compare planning

344
00:17:45,780 --> 00:17:46,280
cost.

345
00:17:46,460 --> 00:17:51,220
And if sequential scan was chosen,
it means the cost was lower

346
00:17:51,220 --> 00:17:52,900
than for index-only scan.

347
00:17:53,240 --> 00:17:54,180
Michael: Yes, exactly.

348
00:17:54,440 --> 00:17:57,440
And then I'm guessing where you're
going with this is probably

349
00:17:57,440 --> 00:17:58,640
random_page_cost.

350
00:17:58,740 --> 00:17:59,240
Nikolay: Right.

351
00:17:59,500 --> 00:18:02,540
This happens if you have default
random_page_cost.

352
00:18:02,540 --> 00:18:06,720
And it doesn't make sense if you
have fast disks, random accesses

353
00:18:07,840 --> 00:18:12,600
close to sequential access on SSD
and NVMe, right?

354
00:18:12,800 --> 00:18:14,360
In memory, it's also so.

355
00:18:14,540 --> 00:18:18,220
You might have very slow magnetic
disks, but the database is

356
00:18:18,220 --> 00:18:20,580
well cached and you rely on it
somehow, right?

357
00:18:20,580 --> 00:18:21,920
It also can happen still.

358
00:18:22,720 --> 00:18:26,980
In this case, random_page_cost
is 4 compared to seq_page_cost

359
00:18:26,980 --> 00:18:33,480
is 1, and to work in a B-tree it's
random access compared to page

360
00:18:33,480 --> 00:18:37,320
by page reading of heap sequentially.

361
00:18:38,420 --> 00:18:43,340
So index scan, index-only scan,
the planner looks at

362
00:18:43,340 --> 00:18:43,840
random_page_cost.

363
00:18:44,380 --> 00:18:50,360
If it's 4 times more expensive,
it might prefer seq_page_cost.

364
00:18:50,380 --> 00:18:52,220
Michael: Especially if it can do
it in parallel.

365
00:18:52,280 --> 00:18:52,780
Yeah.

366
00:18:53,040 --> 00:18:53,900
Yes, yes.

367
00:18:53,960 --> 00:18:57,440
Nikolay: Well, index scan and index
only scan also can be parallel.

368
00:18:57,440 --> 00:18:57,800
Michael: True.

369
00:18:57,800 --> 00:18:58,260
Good point.

370
00:18:58,260 --> 00:18:59,780
Nikolay: Like we have duplicated.

371
00:19:01,840 --> 00:19:03,140
Michael: So maybe that doesn't
matter.

372
00:19:03,140 --> 00:19:03,380
Yeah.

373
00:19:03,380 --> 00:19:03,900
Good point.

374
00:19:03,900 --> 00:19:07,340
Nikolay: Bitmap scan, index scan,
index-only scan, seq scan,

375
00:19:07,340 --> 00:19:10,460
and then parallelized versions
of them, right?

376
00:19:10,680 --> 00:19:14,780
Michael: Yeah, I skipped bitmap
scan to keep things simpler.

377
00:19:15,400 --> 00:19:15,780
But we

378
00:19:15,780 --> 00:19:18,060
maybe should do a whole
episode on that because it is

379
00:19:18,060 --> 00:19:19,060
interesting enough.

380
00:19:19,160 --> 00:19:19,660
Nikolay: Yeah.

381
00:19:20,540 --> 00:19:21,040
Yeah.

382
00:19:21,100 --> 00:19:25,700
And for index scan parallelization
cannot be, as I remember,

383
00:19:25,960 --> 00:19:30,360
it cannot be applied to navigating
inside B-tree, inside the

384
00:19:30,360 --> 00:19:32,580
tree, going from root to leaf.

385
00:19:32,900 --> 00:19:39,480
But it can be applied to fetching
leaf pages in tree, and also

386
00:19:39,480 --> 00:19:44,020
accessing heap, and also traversing
leaf nodes.

387
00:19:44,480 --> 00:19:51,440
So it has a bidirectional list for
all the leaf nodes, allowing

388
00:19:52,200 --> 00:19:56,960
not going up and down all the time,
not starting from root for

389
00:19:56,960 --> 00:19:58,400
each entry, right?

390
00:19:58,520 --> 00:20:00,040
We just need the first entry.

391
00:20:00,600 --> 00:20:03,940
And then I think it can be parallelized,
dealing with leaf nodes.

392
00:20:03,940 --> 00:20:07,220
This is what I remember, I might
be slightly wrong about internals

393
00:20:07,240 --> 00:20:07,740
here.

394
00:20:08,300 --> 00:20:13,120
But again, index-only scan is the
best if you need a lot of rows.

395
00:20:13,260 --> 00:20:18,240
For SELECT COUNT(*) you might
still see a sequential scan.

396
00:20:18,320 --> 00:20:23,440
It's a good sign you have not tuned
random_page_cost, which we

397
00:20:23,440 --> 00:20:30,920
recommend to everyone to do sooner
because later it will feel

398
00:20:30,920 --> 00:20:31,800
more risky.

399
00:20:32,520 --> 00:20:33,520
Michael: Scarier, yeah.

400
00:20:33,520 --> 00:20:37,240
Nikolay: Yeah, scarier because you
have a heavy workload and changing

401
00:20:37,240 --> 00:20:42,680
this might flip your plans in some
unpredictable manner.

402
00:20:42,720 --> 00:20:48,220
So yeah, it's a big task, But if
you do it early, live with it,

403
00:20:48,220 --> 00:20:53,500
it feels more reasonable than keeping
defaults.

404
00:20:56,320 --> 00:20:59,760
Did you have cases when you see
this, okay, cost for sequential

405
00:20:59,760 --> 00:21:03,740
scan is lower, that's why it's
chosen, But when you say enable

406
00:21:03,740 --> 00:21:08,240
seq scan to off, giving some huge
penalty to sequential operations

407
00:21:08,300 --> 00:21:12,780
in planner's mind, then you see
timing and buffer operations

408
00:21:13,420 --> 00:21:15,780
so much better for index-only scan.

409
00:21:15,780 --> 00:21:18,420
You think, oh wow, planner is so
wrong here.

410
00:21:18,560 --> 00:21:20,560
You had it, yeah, I have it all
the time.

411
00:21:21,100 --> 00:21:25,420
And this is a good sign we need
to change random_page_cost by

412
00:21:25,420 --> 00:21:27,340
default in all databases, basically.

413
00:21:28,020 --> 00:21:30,980
Michael: Yeah, I think we've discussed
this in previous episodes,

414
00:21:31,120 --> 00:21:35,080
but in the past I've seen, like,
conservatively, people go down

415
00:21:35,080 --> 00:21:36,520
to 2, but

416
00:21:37,200 --> 00:21:37,540
more aggressively.

417
00:21:37,540 --> 00:21:38,540
Nikolay: And you know the
reason, right?

418
00:21:38,680 --> 00:21:42,260
I mean, ah, not 2, I was
thinking, why default is still

419
00:21:42,260 --> 00:21:42,760
4?

420
00:21:43,780 --> 00:21:47,440
You know, like why is it still
4, the default?

421
00:21:47,680 --> 00:21:49,200
Michael: Yeah, I don't know.

422
00:21:49,200 --> 00:21:52,860
I've seen argument, or I've seen
people mention that they don't

423
00:21:53,080 --> 00:21:57,720
want, like people have old systems
and they don't want to change

424
00:21:57,720 --> 00:21:58,220
those.

425
00:22:01,560 --> 00:22:05,280
But yeah, I agree with you, but
that is the argument I've heard.

426
00:22:05,280 --> 00:22:07,940
Nikolay: Defaults are applied
to new systems. New systems are on SSD...

427
00:22:07,940 --> 00:22:12,320
Michael: I understand.
But that's what people say.

428
00:22:12,740 --> 00:22:18,180
And I actually think also, even
if they didn't, like, It's not

429
00:22:18,180 --> 00:22:21,420
about old systems versus new systems,
it's about SSDs versus

430
00:22:21,660 --> 00:22:22,660
spinning disks.

431
00:22:23,080 --> 00:22:24,720
It's about disk type, mostly.

432
00:22:25,440 --> 00:22:30,080
Most people aren't setting up new
systems on spinning disks.

433
00:22:30,400 --> 00:22:32,540
Like most people are setting up.

434
00:22:32,620 --> 00:22:36,140
In that case, why are we forcing
the majority to have to change

435
00:22:36,140 --> 00:22:39,260
their default instead of like default
should be for the majority?

436
00:22:39,620 --> 00:22:39,840
Yeah.

437
00:22:39,840 --> 00:22:43,100
Nikolay: it's second place after
JIT, or maybe default shared

438
00:22:43,100 --> 00:22:43,600
buffers.

439
00:22:43,860 --> 00:22:44,640
Third place.

440
00:22:44,760 --> 00:22:47,640
Third place goes t
random_page_costs.

441
00:22:47,640 --> 00:22:56,200
Let's have maybe some chart of
defaults we don't like, we would

442
00:22:56,200 --> 00:22:57,280
like to be changed.

443
00:22:58,080 --> 00:23:00,140
By the way, things are changing
slowly.

444
00:23:00,600 --> 00:23:05,040
Yesterday I learned that pg_dump
is not a backup tool anymore.

445
00:23:05,320 --> 00:23:07,460
It's off topic, but it's very similar.

446
00:23:08,000 --> 00:23:14,660
So many years of discussing, it's
like you're swimming against

447
00:23:14,660 --> 00:23:15,880
the current all the time.

448
00:23:16,060 --> 00:23:19,160
You say something, but you open
documentation and it says this,

449
00:23:19,160 --> 00:23:23,340
and you know every experienced
guy says this, but still...

450
00:23:24,060 --> 00:23:25,420
So defaults are similar.

451
00:23:27,440 --> 00:23:32,680
So Postgres until version 18, in
documentation stated, pg_dump

452
00:23:32,780 --> 00:23:34,100
is a backup tool.

453
00:23:35,060 --> 00:23:37,700
Like, first sentence about pg_dump.

454
00:23:38,560 --> 00:23:41,840
And finally, like 5 days ago, Peter
Eisentraut committed the

455
00:23:41,840 --> 00:23:42,340
change.

456
00:23:42,440 --> 00:23:45,740
pg_dump is a tool to export PostgreSQL
databases.

457
00:23:46,160 --> 00:23:46,660
Michael: Yay!

458
00:23:47,220 --> 00:23:48,040
Nikolay: Not backup.

459
00:23:49,140 --> 00:23:52,800
But this will change only in a
year, in PostgreSQL 18.

460
00:23:53,780 --> 00:23:55,760
But it's already a win, like so
good.

461
00:23:55,760 --> 00:23:58,720
I hope defaults also, like they
are slowly changing.

462
00:23:59,700 --> 00:24:02,300
Log checkpoint was on, work_mem.

463
00:24:02,520 --> 00:24:03,020
Yeah.

464
00:24:03,330 --> 00:24:03,830
Is

465
00:24:04,140 --> 00:24:04,360
it?

466
00:24:04,360 --> 00:24:07,480
Michael: hash_mem_multiplier has been, like
work_mem's been increasing very

467
00:24:07,480 --> 00:24:08,440
slowly over the years.

468
00:24:08,440 --> 00:24:11,820
hash_mem_multiplier doubled in like
not that long ago.

469
00:24:11,820 --> 00:24:12,680
Like there was...

470
00:24:13,740 --> 00:24:17,400
Nikolay: Autovacuum default throttling
10 times.

471
00:24:17,840 --> 00:24:19,740
But it was many years already ago.

472
00:24:19,760 --> 00:24:22,400
Yeah, slowly it's changing, it's
good.

473
00:24:22,800 --> 00:24:27,420
So random_page_cost, most people
say, decrease it.

474
00:24:27,900 --> 00:24:29,140
1.1 maybe.

475
00:24:29,240 --> 00:24:32,060
You said some guys decide to go
with 2.

476
00:24:32,220 --> 00:24:39,900
Well, I can say some guys decided
to put 1 there, but put 4 to

477
00:24:39,900 --> 00:24:40,940
seq_page_cost.

478
00:24:41,920 --> 00:24:42,420
Michael: What?

479
00:24:42,880 --> 00:24:43,380
Nikolay: Yes.

480
00:24:44,440 --> 00:24:44,940
Michael: Okay.

481
00:24:45,060 --> 00:24:48,820
So I've seen sensible people that
have done testing, I've seen

482
00:24:48,820 --> 00:24:51,520
set anywhere between 1.1 and 2.

483
00:24:51,580 --> 00:24:56,540
And I tended to go closer to 1.1
if the performance results.

484
00:24:56,600 --> 00:24:57,100
Yeah, I

485
00:24:57,100 --> 00:24:57,740
think that's really cool.

486
00:24:57,740 --> 00:25:00,420
Nikolay: Crunchy, after we did
the episode about something, They

487
00:25:00,420 --> 00:25:03,740
tested it and published some details.

488
00:25:04,120 --> 00:25:06,140
Maybe we'll find them a touch.

489
00:25:08,100 --> 00:25:13,360
Their benchmarks showed that 1.1
in their case, I think it's

490
00:25:13,360 --> 00:25:15,665
AWS and GCP or something, right?

491
00:25:15,665 --> 00:25:19,540
It's still In their case, they
found out that 1.1 is slightly

492
00:25:19,540 --> 00:25:23,540
better than 1.0, according to some
benchmarks.

493
00:25:23,760 --> 00:25:25,140
Not mine, I don't know.

494
00:25:26,060 --> 00:25:32,940
So since then, I also kind of put
1.1 before I tried to put

495
00:25:32,940 --> 00:25:34,160
1.0.

496
00:25:34,400 --> 00:25:35,420
So I don't know.

497
00:25:35,640 --> 00:25:39,100
1.1 is a good start, right?

498
00:25:39,580 --> 00:25:40,940
Enough maybe here.

499
00:25:41,120 --> 00:25:43,160
Michael: Let's go through a few
other recommendations.

500
00:25:43,380 --> 00:25:45,920
I think this is 1 recommendation
of something that you can do

501
00:25:45,920 --> 00:25:49,200
to encourage index-only scans or
recommendations that are around

502
00:25:49,200 --> 00:25:50,260
index-only scans.

503
00:25:50,280 --> 00:25:54,020
We did a whole episode, for example,
on over-indexing, and I

504
00:25:54,020 --> 00:25:55,240
think that's relevant here.

505
00:25:55,240 --> 00:25:57,440
Nikolay: Before we go to over-indexing,
1 second.

506
00:25:57,440 --> 00:26:00,600
It's super important to tune to
autovacuum because otherwise heap

507
00:26:00,600 --> 00:26:04,700
fetches will be super high and
you need to, you need to, autovacuum...

508
00:26:04,740 --> 00:26:08,660
autovacuum is not only about vacuuming
or collection of statistics

509
00:26:08,720 --> 00:26:11,820
or fighting with transaction and
MultiXact ID wraparound.

510
00:26:11,960 --> 00:26:16,120
It's also important because it
maintains visibility maps.

511
00:26:16,680 --> 00:26:17,080
Michael: Yes.

512
00:26:17,080 --> 00:26:20,280
And it's the only thing that maintains
visibility, like the only

513
00:26:20,280 --> 00:26:21,820
thing that can set those bits

514
00:26:21,820 --> 00:26:24,220
Nikolay: Directly, yes, yes,
directly.

515
00:26:24,340 --> 00:26:28,500
Michael: I think this is important
because any, let's say we've

516
00:26:28,500 --> 00:26:32,800
got a page in Postgres that has
20 tuples on it.

517
00:26:32,800 --> 00:26:36,220
Like really common to have like
a couple of dozen of tuples on

518
00:26:36,220 --> 00:26:36,900
a page.

519
00:26:36,980 --> 00:26:42,740
If any of those rows are changed,
the visibility of that row,

520
00:26:42,740 --> 00:26:43,820
like the bit gets-

521
00:26:43,820 --> 00:26:45,040
Nikolay: It's out of the game.

522
00:26:45,040 --> 00:26:45,720
Yeah, this page-

523
00:26:45,720 --> 00:26:48,160
Michael: Yeah, it goes from 1 to
0, right?

524
00:26:48,400 --> 00:26:51,940
It goes from all visible to not
all visible, or not guaranteed

525
00:26:51,980 --> 00:26:52,960
to be all visible.

526
00:26:53,800 --> 00:26:56,360
Nikolay: No, it's not all visible,
that's it.

527
00:26:56,680 --> 00:26:59,800
We don't know, but it already cannot
be considered all visible,

528
00:26:59,800 --> 00:27:00,540
that's it.

529
00:27:01,100 --> 00:27:04,740
Yeah, technically, these tuples
might be visible to everyone.

530
00:27:04,840 --> 00:27:07,060
Autovacuum just didn't mark it.

531
00:27:07,060 --> 00:27:08,200
Michael: Didn't get there yet.

532
00:27:08,240 --> 00:27:08,500
Nikolay: Yeah.

533
00:27:08,500 --> 00:27:11,680
So autovacuum needs to do
this as well.

534
00:27:11,680 --> 00:27:16,600
That's why frequent visits of tables
are so important.

535
00:27:17,220 --> 00:27:19,460
Not only because of deleting that
tuples.

536
00:27:20,280 --> 00:27:20,780
Michael: Yes.

537
00:27:20,800 --> 00:27:24,280
Now is, I guess it is possible
that you will never benefit from

538
00:27:24,280 --> 00:27:27,440
index-only scans if you have a table,
but maybe like a small table

539
00:27:27,440 --> 00:27:30,460
that's constant, that the rows
are constantly being updated.

540
00:27:30,520 --> 00:27:33,900
But if you, most of us have tables
where, or at least like the

541
00:27:33,900 --> 00:27:37,580
cases we're thinking about here,
you have like some data that's

542
00:27:37,580 --> 00:27:40,440
pretty old that doesn't change
often that we do still want to

543
00:27:40,440 --> 00:27:40,940
read.

544
00:27:41,040 --> 00:27:44,760
And like in those cases, as long
as you can, yeah, tuning

545
00:27:44,760 --> 00:27:49,340
autovacuum super valuable for maintaining
the performance of index-only

546
00:27:49,440 --> 00:27:49,940
scans.

547
00:27:52,820 --> 00:27:54,780
Is there anything else around that?

548
00:27:55,160 --> 00:27:59,060
I actually shared just recently
in my newsletter an old post

549
00:27:59,060 --> 00:27:59,760
by Tomas Vondra.

550
00:27:59,760 --> 00:28:01,720
I thought I revisited it.

551
00:28:01,720 --> 00:28:04,100
He's updated it for the latest
versions of Postgres.

552
00:28:04,280 --> 00:28:06,300
I'll share it again because I think
it's so good

553
00:28:06,820 --> 00:28:08,600
Nikolay: Oh yeah, yeah, I also saw it.
That's great.

554
00:28:08,600 --> 00:28:12,040
I actually expected this because
after they acquired the second

555
00:28:12,040 --> 00:28:15,300
quadrant, they broke that block
and some links didn't work.

556
00:28:15,300 --> 00:28:20,540
And I used archive.org to fetch
some old pages.

557
00:28:20,540 --> 00:28:21,220
It's great.

558
00:28:21,220 --> 00:28:22,700
These posts needed to be updated.

559
00:28:22,720 --> 00:28:25,580
I also think they should be part
of documentation, honestly.

560
00:28:26,200 --> 00:28:26,680
Michael: Nice.

561
00:28:26,680 --> 00:28:27,660
Well, high praise.

562
00:28:27,980 --> 00:28:28,480
Nikolay: Yeah.

563
00:28:28,940 --> 00:28:36,240
So, but documentation tends to
avoid good how-tos and so far,

564
00:28:36,340 --> 00:28:37,120
only basics.

565
00:28:37,960 --> 00:28:43,280
There are how-to notes in documentation,
but they are scattered

566
00:28:43,540 --> 00:28:47,820
among general documentation, reference
style, and so on.

567
00:28:48,920 --> 00:28:54,060
I wish documentation grew like
a huge how-to section, but it's

568
00:28:54,060 --> 00:28:55,020
a different story.

569
00:28:55,320 --> 00:29:00,040
But let's not lose this important
point in this context.

570
00:29:00,460 --> 00:29:05,980
So autovacuum maintains visibility
map, which has this all visible.

571
00:29:05,980 --> 00:29:08,300
It also has all frozen, but that's
a different story.

572
00:29:08,300 --> 00:29:11,180
All visible, 2 bits for each page.

573
00:29:11,720 --> 00:29:15,820
And if we have a lot of writes,
like many writes are happening,

574
00:29:16,880 --> 00:29:21,020
many pages are having 0 for all
visible bit.

575
00:29:22,400 --> 00:29:25,440
And what's important here to understand
that partitioning is

576
00:29:25,440 --> 00:29:26,140
so good.

577
00:29:27,940 --> 00:29:31,640
Like if you just consider unpartitioned
one terabyte table and

578
00:29:31,640 --> 00:29:37,540
partitioned one where old partitions,
like most of writes go to

579
00:29:38,000 --> 00:29:43,040
the latest or few latest partitions
and old partitions are like

580
00:29:43,040 --> 00:29:47,280
they can be, they can receive writes
but it happens rarely.

581
00:29:47,980 --> 00:29:48,400
Yeah.

582
00:29:48,400 --> 00:29:52,100
And vacuum aggressively fairly
frequently visits such pages.

583
00:29:52,100 --> 00:29:54,900
Oh, okay, some write happened in
old partition but I immediately

584
00:29:54,920 --> 00:29:58,820
mark this page all visible, and
it's only 1 page in this partition.

585
00:29:59,640 --> 00:30:04,540
That's why, like, archive, archive
data locality and archive

586
00:30:04,540 --> 00:30:10,240
kind partitions, old ones and only
new or the latest receives

587
00:30:10,240 --> 00:30:14,420
all inserts, some updates, they're
usually for fresh data.

588
00:30:14,720 --> 00:30:19,080
It means this, like old partitions
most of the time will be all

589
00:30:19,080 --> 00:30:20,620
visible for all their pages.

590
00:30:21,100 --> 00:30:26,920
This is so good if massive reads
are needed for all data.

591
00:30:27,700 --> 00:30:31,660
Without partitioning, any page
can suddenly receive a write,

592
00:30:31,860 --> 00:30:36,360
even if all rows in this page are
5 years old.

593
00:30:36,880 --> 00:30:39,640
And then suddenly, oh, there is
some place here.

594
00:30:39,960 --> 00:30:41,500
Let's insert a new tuple.

595
00:30:42,340 --> 00:30:44,640
And it’s out of...

596
00:30:44,680 --> 00:30:47,360
So everything is shuffled, right?

597
00:30:47,620 --> 00:30:51,420
New data and old data, they can
live in the same page.

598
00:30:52,280 --> 00:30:56,400
And for visibility map, it's kind
of a lot of work for autovacuum,

599
00:30:56,460 --> 00:31:01,320
and visibility map is most of the
time kind of in bad shape.

600
00:31:01,400 --> 00:31:05,500
So what I'm trying to say, partitioning
is needed not only because

601
00:31:05,500 --> 00:31:06,800
indexes become smaller.

602
00:31:07,120 --> 00:31:12,880
This reason is, over years I realized,
is one of the smallest reasons

603
00:31:12,880 --> 00:31:13,620
for me.

604
00:31:13,840 --> 00:31:17,620
Because B-tree height grows very
slow, you know.

605
00:31:18,640 --> 00:31:23,600
But this particular thing, this
may be number one reason, since

606
00:31:23,600 --> 00:31:25,080
recently I realized it.

607
00:31:25,080 --> 00:31:25,520
Michael: Wow,

608
00:31:25,520 --> 00:31:26,460
you think so?

609
00:31:26,720 --> 00:31:30,760
Nikolay: For partitioning, like,
we can say, okay, This partition

610
00:31:31,020 --> 00:31:33,280
is barely touched in terms of writes.

611
00:31:33,280 --> 00:31:37,220
We just read it because people
want to see archived data.

612
00:31:37,540 --> 00:31:43,480
But reading is well organized most
of the time.

613
00:31:43,780 --> 00:31:48,160
If it counts or something, It's
index-only scans with heap fetches

614
00:31:48,820 --> 00:31:50,040
very close to zero.

615
00:31:51,400 --> 00:31:54,500
Without partitioning, impossible.

616
00:31:56,380 --> 00:31:59,980
I mean, and less predictable also.

617
00:32:00,520 --> 00:32:03,280
Michael: Yeah, I mean, in the case
you're talking about where

618
00:32:03,280 --> 00:32:07,480
it's append mostly and it's to
the right of the...

619
00:32:07,800 --> 00:32:12,380
I've definitely seen cases where
you don't touch old data anyway,

620
00:32:12,380 --> 00:32:14,080
whether it's partitioned or not.

621
00:32:14,240 --> 00:32:17,980
But I definitely see this as yet
another benefit of partitioning.

622
00:32:18,180 --> 00:32:20,760
I'm surprised you said it's like,
you see it as the biggest,

623
00:32:20,760 --> 00:32:21,960
but that's really cool.

624
00:32:22,540 --> 00:32:27,340
Nikolay: Well, actually, I think
if we, for example, decide to

625
00:32:27,340 --> 00:32:30,720
create additional indexes to support
index-only scans, and we

626
00:32:30,720 --> 00:32:32,440
will talk about methods right now,
right?

627
00:32:32,440 --> 00:32:35,960
But in the index-write amplification,
I know we have a plan.

628
00:32:36,060 --> 00:32:39,620
But if you, for example, decide...
usually we keep the same sets

629
00:32:39,620 --> 00:32:41,380
of indexes on all partitions, right?

630
00:32:41,380 --> 00:32:44,940
But technically, you can create
an index for a particular partition.

631
00:32:45,300 --> 00:32:50,400
And if, for example, you created
it for archive-style partitions

632
00:32:50,660 --> 00:32:56,540
to support index-only scan, to
prioritize them for reads, and

633
00:32:56,540 --> 00:32:59,640
you maybe don't care about write
amplification, because again,

634
00:32:59,640 --> 00:33:04,040
these partitions barely receive
writes, and HOT updates also

635
00:33:04,040 --> 00:33:04,940
we don't care.

636
00:33:05,080 --> 00:33:06,080
We care about...

637
00:33:06,220 --> 00:33:09,520
So we might decide to create more
indexes for all partitions

638
00:33:10,680 --> 00:33:12,720
to benefit from...

639
00:33:12,720 --> 00:33:13,220
Michael: Yeah,

640
00:33:14,440 --> 00:33:16,420
I've not seen that but
it's a cool idea.

641
00:33:16,560 --> 00:33:18,920
Nikolay: It's just like, I don't
know, maybe it's a crazy idea,

642
00:33:18,920 --> 00:33:23,100
maybe we'll have some walls hit
if we implement it.

643
00:33:23,100 --> 00:33:25,680
But it seems reasonable, and for
fresh partitions, I never did

644
00:33:25,680 --> 00:33:26,180
it.

645
00:33:27,720 --> 00:33:30,800
For fresh partitions, you avoid
creating these indexes because

646
00:33:30,800 --> 00:33:35,260
you want HOT updates occurring.

647
00:33:36,280 --> 00:33:37,000
So, right?

648
00:33:37,100 --> 00:33:37,600
Yeah.

649
00:33:37,940 --> 00:33:40,340
Let's maybe dive into details a
little bit.

650
00:33:40,440 --> 00:33:43,380
Michael: I yeah, well, we could
point people towards the we did

651
00:33:43,380 --> 00:33:46,100
a whole episode on over indexing
and we did another 1 on HOT

652
00:33:46,100 --> 00:33:46,560
updates.

653
00:33:46,560 --> 00:33:49,080
So let's point people at those
because I think there's a couple

654
00:33:49,080 --> 00:33:53,800
more like definitely want to get
your thoughts on the difference

655
00:33:53,800 --> 00:33:59,140
between multi-column index and
include like when you use each

656
00:33:59,140 --> 00:33:59,640
or...

657
00:34:00,120 --> 00:34:05,180
Nikolay: when it was released again
like 11? 12? 10? 11? I don't

658
00:34:05,180 --> 00:34:05,680
remember.

659
00:34:06,280 --> 00:34:10,400
Some many years, already
many years ago, and I was

660
00:34:10,400 --> 00:34:12,940
thinking, is it only for unique
indexes?

661
00:34:13,780 --> 00:34:18,720
Honestly, I remember I had
Anastasia Lubennikova presented on

662
00:34:18,720 --> 00:34:21,100
Postgres TV details, some details.

663
00:34:21,710 --> 00:34:26,440
Yeah, or maybe it was an interview
because she participated in

664
00:34:26,580 --> 00:34:29,160
this include keyword covering indexes.

665
00:34:29,680 --> 00:34:32,140
So my understanding, we only need...

666
00:34:33,980 --> 00:34:36,280
It might be a size difference also,
right?

667
00:34:36,280 --> 00:34:42,540
Because if we want to put in a
new column to have multi-column

668
00:34:42,720 --> 00:34:47,040
index, 1 more column we just included,
it will participate in

669
00:34:47,040 --> 00:34:48,040
structure, right?

670
00:34:48,540 --> 00:34:54,320
But if we say include, it's just
additionally stored in leaf

671
00:34:54,320 --> 00:34:54,860
nodes, right?

672
00:34:54,860 --> 00:34:55,920
Michael: Leaf page, exactly.

673
00:34:56,040 --> 00:34:56,540
Nikolay: Yeah.

674
00:34:56,820 --> 00:35:02,720
And it means that kind of for to
achieve index-only scans, both

675
00:35:02,720 --> 00:35:07,720
cases work, but the first case
can be also used since it's in

676
00:35:07,720 --> 00:35:10,120
structure, it can be used to verify
uniqueness.

677
00:35:10,440 --> 00:35:12,320
Or maybe also constrain exclusion.

678
00:35:13,120 --> 00:35:16,100
Exclusion constraints, right?

679
00:35:16,100 --> 00:35:16,600
Maybe.

680
00:35:16,880 --> 00:35:18,740
For range types.

681
00:35:19,540 --> 00:35:21,920
It's a different story, but definitely
for uniqueness.

682
00:35:23,140 --> 00:35:29,260
I now realize that actually we
cannot put some data type, columns

683
00:35:29,260 --> 00:35:32,840
of some data type to multi-column
index, but we can put them

684
00:35:32,840 --> 00:35:33,400
to include.

685
00:35:33,400 --> 00:35:35,060
This is also a benefit of include.

686
00:35:35,060 --> 00:35:40,940
So unique constraint can, like,
we don't want to put index because

687
00:35:40,940 --> 00:35:44,360
it will break the logic of uniqueness,
right?

688
00:35:44,540 --> 00:35:48,780
But we can put additional column
to include and have index-only

689
00:35:48,840 --> 00:35:51,980
scan, not changing logic of unique
constraint.

690
00:35:52,580 --> 00:35:57,900
But also to include, we can put
data types which are not B-tree

691
00:35:57,900 --> 00:36:00,660
friendly, which are not supported
by B-tree.

692
00:36:01,160 --> 00:36:02,120
Michael: Like not...

693
00:36:02,280 --> 00:36:02,760
Nikolay: Arrays, for example, right?

694
00:36:02,760 --> 00:36:03,980
Michael: Yeah, yeah, yeah.

695
00:36:04,980 --> 00:36:07,040
Nikolay: Polygons or something, like points.

696
00:36:07,960 --> 00:36:10,240
Michael: Although you do need to
be careful with size, right?

697
00:36:10,920 --> 00:36:11,760
Nikolay: Of course, yes.

698
00:36:11,760 --> 00:36:12,380
Of course.

699
00:36:12,440 --> 00:36:13,300
Size matters.

700
00:36:13,580 --> 00:36:17,540
But imagine we have like point, 2 numbers,
right?

701
00:36:18,300 --> 00:36:22,000
So we can put it to include and
the index-only scans will happen.

702
00:36:22,280 --> 00:36:24,760
Michael: I've got 1 more thing
that I wanted to make sure we

703
00:36:24,760 --> 00:36:29,080
mentioned, which was something
practical and very beginner-friendly.

704
00:36:29,540 --> 00:36:33,120
In fact, probably the main thing
that I see stopping index-only

705
00:36:33,280 --> 00:36:38,220
scans in the real world, which
is people selecting more data

706
00:36:38,220 --> 00:36:39,120
than they need.

707
00:36:40,160 --> 00:36:44,940
And I know it's kind of obvious,
but it's really common in ORMs

708
00:36:45,100 --> 00:36:48,580
to have SELECT * be the default,
which is almost always going

709
00:36:48,580 --> 00:36:51,420
to be giving you more columns than
you actually need for that

710
00:36:51,420 --> 00:36:52,480
specific query.

711
00:36:52,700 --> 00:36:56,380
But I see people doing it, like,
people deleting features, for

712
00:36:56,380 --> 00:36:56,820
example.

713
00:36:56,820 --> 00:37:00,920
Like, changing how a page is structured
in an application.

714
00:37:01,360 --> 00:37:05,500
They no longer need all the data
from the queries returning,

715
00:37:05,500 --> 00:37:06,880
but don't change the query.

716
00:37:06,880 --> 00:37:11,040
There are so many ways it can creep
into people's code that they

717
00:37:11,040 --> 00:37:13,240
no longer need all the data they're
returning.

718
00:37:13,460 --> 00:37:17,640
So being careful on that front
and also just remembering that

719
00:37:17,640 --> 00:37:21,900
there are these optimizations if
you can be, even if the application

720
00:37:21,900 --> 00:37:25,920
does currently use all of the columns,
do we need to like what

721
00:37:25,920 --> 00:37:28,860
value is that adding at the moment?

722
00:37:28,860 --> 00:37:31,620
Can you make things much faster
by not including that information

723
00:37:31,640 --> 00:37:32,500
at that point.

724
00:37:32,840 --> 00:37:33,340
Nikolay: Right.

725
00:37:33,840 --> 00:37:34,220
Yes.

726
00:37:34,220 --> 00:37:37,540
And finally, maybe let's mention
also HOT updates.

727
00:37:37,540 --> 00:37:41,680
I already kind of mentioned that
indirectly, but directly, like

728
00:37:41,680 --> 00:37:48,620
if you have, if you need to put
extra, 1 more column to the index,

729
00:37:48,900 --> 00:37:49,740
either to...

730
00:37:49,740 --> 00:37:54,660
Doesn't matter if it's to extend
this multi-column list or to

731
00:37:54,660 --> 00:37:58,840
include, to have coverage for your
selects.

732
00:37:59,580 --> 00:38:02,720
You can lose the HOTness of some updates,
right?

733
00:38:02,980 --> 00:38:09,160
Because even if it's include, still,
if this column is updated

734
00:38:10,040 --> 00:38:12,280
frequently, these updates cannot
be HOT.

735
00:38:12,280 --> 00:38:13,740
They will be much slower.

736
00:38:13,740 --> 00:38:15,540
They will be regular updates.

737
00:38:16,560 --> 00:38:20,380
So it's better sometimes to avoid
this.

738
00:38:20,380 --> 00:38:23,860
So this trade-off here, you want
select only scans but you lose

739
00:38:23,860 --> 00:38:24,560
HOT updates.

740
00:38:26,480 --> 00:38:27,340
What's needed?

741
00:38:27,340 --> 00:38:29,480
Selects or writes, I mean updates.

742
00:38:30,140 --> 00:38:33,480
Michael: Yeah, I know this is very
off topic, but do you know

743
00:38:33,480 --> 00:38:36,940
heap-only tuple updates are no
longer heap-only in some cases?

744
00:38:37,540 --> 00:38:40,960
Like in BRIN, there's some changes
to BRIN indexes.

745
00:38:41,980 --> 00:38:43,000
Have you seen this?

746
00:38:44,060 --> 00:38:46,720
And now, Yeah, now it updates the
BRIN index.

747
00:38:46,720 --> 00:38:47,820
I thought that was funny.

748
00:38:48,540 --> 00:38:49,240
Nikolay: Yeah, yeah.

749
00:38:49,540 --> 00:38:52,700
Okay, so I think we've covered enough,
right?

750
00:38:52,940 --> 00:38:54,060
Michael: Yeah, 100%.

751
00:38:54,380 --> 00:38:55,220
Nikolay: For iOS.

752
00:38:56,540 --> 00:38:59,980
When I hear iOS, I think index-only
scans, right?

753
00:39:00,060 --> 00:39:01,020
Michael: That’s so funny.

754
00:39:01,120 --> 00:39:02,620
Are you on Android or iOS?

755
00:39:03,760 --> 00:39:05,240
Nikolay: I'm on iOS.

756
00:39:05,640 --> 00:39:06,340
My phone, I mean.

757
00:39:06,340 --> 00:39:07,360
Michael: Index-only scans.

758
00:39:07,360 --> 00:39:07,960
Cool, cool.

759
00:39:07,960 --> 00:39:08,460
Nikolay: Yeah.

760
00:39:09,280 --> 00:39:09,440
Yeah.

761
00:39:09,440 --> 00:39:09,940
OK.

762
00:39:10,320 --> 00:39:10,820
Good.

763
00:39:11,100 --> 00:39:12,080
See you next week.

764
00:39:12,240 --> 00:39:12,740
Bye.

765
00:39:12,820 --> 00:39:13,780
Michael: See you next week.

766
00:39:13,780 --> 00:39:14,030
Bye.