1
00:00:00,060 --> 00:00:04,200
Nikolay: Hello, hello, this is PostgresFM, podcast about Postgres,

2
00:00:04,320 --> 00:00:05,460
as you can guess.

3
00:00:05,860 --> 00:00:09,360
My name is Nikolay from Postgres.AI, and as usual, my co-host,

4
00:00:09,480 --> 00:00:12,320
not guest, co-host is Michael from pgMustard.

5
00:00:12,520 --> 00:00:13,300
Hi, Michael.

6
00:00:13,660 --> 00:00:14,560
Michael: Hello, Nikolay.

7
00:00:15,720 --> 00:00:19,060
Nikolay: So you chose a topic, and it is?

8
00:00:19,980 --> 00:00:22,920
Michael: Slow counts, Or like why is count slow and what can

9
00:00:22,920 --> 00:00:23,780
we do about it?

10
00:00:23,780 --> 00:00:25,240
Nikolay: How to make counts slow?

11
00:00:25,760 --> 00:00:26,680
You just count.

12
00:00:27,980 --> 00:00:29,520
So yeah, let's discuss.

13
00:00:29,640 --> 00:00:30,140
I don't know.

14
00:00:30,140 --> 00:00:36,660
For me it's super old topic and I'm not sure I will use some

15
00:00:36,660 --> 00:00:37,860
fresh knowledge here.

16
00:00:38,040 --> 00:00:41,120
I have a lot of fresh knowledge about Postgres, and I share it

17
00:00:41,120 --> 00:00:41,620
constantly.

18
00:00:41,840 --> 00:00:45,120
But this is something which is super old.

19
00:00:46,220 --> 00:00:49,540
There are improvements we can discuss, like index-only scan.

20
00:00:49,740 --> 00:00:51,980
We talked about them recently a lot.

21
00:00:52,800 --> 00:00:57,600
But I guess my role will be to use my knowledge from, you know,

22
00:00:57,600 --> 00:00:59,420
like 15 years ago mostly.

23
00:00:59,760 --> 00:01:03,780
And just before we started this recording, I shared with you

24
00:01:03,820 --> 00:01:08,140
how I a little bit helped PostgREST to get rid of count, right?

25
00:01:08,300 --> 00:01:10,580
Because sometimes you don't need it.

26
00:01:10,580 --> 00:01:14,440
So I'm going to stay in that area 10 years ago.

27
00:01:14,480 --> 00:01:20,580
And If we have something new, I will be just putting more to-do

28
00:01:20,580 --> 00:01:24,140
items to my to-do list to explore and that's it.

29
00:01:24,140 --> 00:01:26,180
Maybe you know some things better than me.

30
00:01:26,320 --> 00:01:27,080
Let's see.

31
00:01:27,520 --> 00:01:30,980
Michael: Yeah, well, the PostgREST link is great because The resource

32
00:01:31,060 --> 00:01:35,080
I still like to point people to on this topic, the best thing

33
00:01:35,080 --> 00:01:38,800
I've ever found on it, is a blog post by the PostgREST author,

34
00:01:38,860 --> 00:01:41,180
Joe Nelson, on the Citus blog.

35
00:01:41,260 --> 00:01:44,060
So I'll link that up for people that I think explains it in a

36
00:01:44,060 --> 00:01:46,720
lot of depth, but in 2016.

37
00:01:47,380 --> 00:01:52,500
So quite a long time ago, in 8 years at the point of recording.

38
00:01:53,680 --> 00:01:58,140
And I was kind of surprised how many things have changed, but

39
00:01:58,140 --> 00:02:00,000
also how little has changed in a way.

40
00:02:00,000 --> 00:02:01,200
Like there are optimizations.

41
00:02:01,640 --> 00:02:06,380
A lot of things that were kind of little gotchas back then aren't

42
00:02:06,380 --> 00:02:07,100
true anymore.

43
00:02:07,120 --> 00:02:10,460
There are a bunch of ways that it's faster now by default.

44
00:02:11,460 --> 00:02:17,500
But the essence of it is still true, like because of the way

45
00:02:17,500 --> 00:02:22,060
Postgres does MVCC, when we do an application.

46
00:02:23,200 --> 00:02:24,220
Because it's Rowstore.

47
00:02:24,560 --> 00:02:25,060
Yeah.

48
00:02:25,580 --> 00:02:29,260
Basically, because all the things that make it great as like

49
00:02:29,260 --> 00:02:32,640
a transactional database count against it.

50
00:02:32,640 --> 00:02:34,540
And a pun not intended.

51
00:02:35,140 --> 00:02:40,520
When it comes to aggregating across
a lot of data and count is,

52
00:02:40,520 --> 00:02:44,560
I think, at least in my experience,
that's the most common aggregation

53
00:02:44,760 --> 00:02:47,500
people want to do in like typical
use cases.

54
00:02:48,340 --> 00:02:49,840
Nikolay: Yeah, that's a good point.

55
00:02:50,020 --> 00:02:52,400
Since the Postgres row store, it's
better.

56
00:02:52,430 --> 00:02:56,980
Imagine you have a table of users,
for example, as many SaaS

57
00:02:56,980 --> 00:03:00,560
systems or e-commerce systems do,
like customers, users.

58
00:03:01,220 --> 00:03:06,200
Since Postgres is mostly like OLTP,
row store, OLTP-focused database

59
00:03:06,200 --> 00:03:09,420
system, it stores a tuple, a row.

60
00:03:10,080 --> 00:03:14,620
If it's not too wide, it stores
it in 1 place, in 1 page, basically.

61
00:03:15,600 --> 00:03:19,900
As we discussed many times, a page
is 8 kilobytes by default

62
00:03:19,900 --> 00:03:26,580
in most cases, and very roughly,
if a row doesn't exceed 2 kilobytes,

63
00:03:26,580 --> 00:03:28,760
it will be on the same page, and
that's it.

64
00:03:29,840 --> 00:03:33,000
The mechanism called TOAST won't
be involved.

65
00:03:33,320 --> 00:03:37,540
If it exceeds it, it will be chunked
to pieces and stored in an

66
00:03:37,540 --> 00:03:40,020
additional table called the TOAST table.

67
00:03:40,600 --> 00:03:43,160
So imagine we have a users table.

68
00:03:43,320 --> 00:03:47,800
And then users can log in, so we
need to retrieve information

69
00:03:47,840 --> 00:03:48,820
about this user.

70
00:03:48,940 --> 00:03:52,000
And it's good because all information
about a single user is

71
00:03:52,000 --> 00:03:53,600
stored on the same page, right?

72
00:03:54,240 --> 00:03:55,240
If it's narrow, right?

73
00:03:55,240 --> 00:03:58,260
As I said, it means that it's quick
to retrieve.

74
00:03:58,260 --> 00:04:01,300
It's just 1 buffer hit or read.

75
00:04:03,480 --> 00:04:06,460
Michael: Yes, extremely efficient
for like if we wanted to show

76
00:04:06,460 --> 00:04:10,080
the profile of that user if we
wanted their name, their age,

77
00:04:10,080 --> 00:04:13,740
you know loads of information about
them, single page read to

78
00:04:13,740 --> 00:04:16,480
return that information maybe like
a couple once you look them

79
00:04:16,480 --> 00:04:20,180
up in the index, then the heap
page.

80
00:04:20,280 --> 00:04:25,120
Nikolay: So it's very I/O optimized
basically.

81
00:04:25,520 --> 00:04:25,640
Efficient.

82
00:04:25,640 --> 00:04:26,780
Efficient, exactly.

83
00:04:27,280 --> 00:04:34,800
But if it comes to counting or
calculating some sum or average,

84
00:04:35,280 --> 00:04:37,320
minimum, maximum, everything, aggregates.

85
00:04:38,740 --> 00:04:44,220
It's not good because usually aggregates
deal with only a single

86
00:04:44,220 --> 00:04:46,300
column or a few columns.

87
00:04:46,720 --> 00:04:49,780
And we don't need all the other
columns this table has.

88
00:04:49,900 --> 00:04:56,140
If we have a row store, as Postgres
is, in this case, it means that

89
00:04:56,140 --> 00:05:02,660
we need to do much more I/O, many
more buffer operations, hits

90
00:05:02,660 --> 00:05:08,040
and reads to get information for
like to calculate aggregates

91
00:05:08,080 --> 00:05:09,120
for a single column.

92
00:05:09,320 --> 00:05:11,140
For example, count them, right?

93
00:05:11,400 --> 00:05:15,860
And it means if you want, for example,
very, very, very common

94
00:05:16,320 --> 00:05:23,820
ask in startups, in social media,
in SaaS, in e-commerce, let's

95
00:05:23,820 --> 00:05:28,120
just show how many new users we
have every day.

96
00:05:28,780 --> 00:05:30,980
We need to count daily, right?

97
00:05:31,240 --> 00:05:36,500
And this means, like, imagine if
we have only a few records in

98
00:05:36,500 --> 00:05:41,680
an 8-kilobyte page, few users records,
few users tuples because

99
00:05:41,680 --> 00:05:42,780
we have many columns.

100
00:05:43,440 --> 00:05:50,640
It means 1 buffer operation will
allow us to count only like

101
00:05:50,640 --> 00:05:53,900
4 users, 6 users, and that's it.

102
00:05:53,940 --> 00:05:58,940
Compared to column store where
each column is stored in separate

103
00:05:58,940 --> 00:06:04,620
files, and if this file is about
ID, it's only ID.

104
00:06:04,820 --> 00:06:08,820
If it's about creation time, it's
only creation time, datetime.

105
00:06:09,260 --> 00:06:14,500
In this case, counting is much
more efficient because 1 page

106
00:06:14,500 --> 00:06:18,900
can bring us many more users, and
we can count them much better.

107
00:06:19,200 --> 00:06:24,300
This is quite straightforward and
very simple, a trivial comparison

108
00:06:24,660 --> 00:06:28,680
of ColumnStore and RowStore and
why in ColumnStore aggregates

109
00:06:29,220 --> 00:06:34,380
and counts specifically are much
faster than in RowStore, just

110
00:06:34,380 --> 00:06:35,780
because of IO, right?

111
00:06:37,700 --> 00:06:41,180
Michael: Yes, and that before you
start to think like, I think

112
00:06:41,180 --> 00:06:45,520
because column stores are used
for these aggregation and analytics

113
00:06:45,600 --> 00:06:49,540
use cases so much, I think they
also have, they tend to have

114
00:06:49,540 --> 00:06:55,440
more optimizations around these
aggregations as well than at

115
00:06:55,440 --> 00:06:58,780
least like the "old school" transactional
processing databases

116
00:06:59,020 --> 00:06:59,520
have.

117
00:07:00,060 --> 00:07:02,560
I think lines started getting blurred
a little bit with some

118
00:07:02,560 --> 00:07:06,000
of the extensions, like some of
the newer extensions in Postgres

119
00:07:06,200 --> 00:07:08,500
world that also have some of these
optimizations.

120
00:07:09,520 --> 00:07:13,940
But yeah, I think that's a great
starting point as some of Postgres'

121
00:07:14,160 --> 00:07:15,600
strengths working against it.

122
00:07:15,600 --> 00:07:20,140
I think another 1 that is not just
RowStore versus ColumnStore

123
00:07:20,800 --> 00:07:24,560
is because of how Postgres does
visibility.

124
00:07:24,960 --> 00:07:30,140
I think there's if you look at
some other transactional process

125
00:07:30,180 --> 00:07:37,360
databases, because they don't do
like the same marking pages

126
00:07:37,360 --> 00:07:39,820
as not because they do undo logs,
basically.

127
00:07:40,640 --> 00:07:43,980
They can also cheat a little bit
for certain counts.

128
00:07:43,980 --> 00:07:46,960
Like, if you want to count all
of the users, which if you have

129
00:07:46,960 --> 00:07:51,600
some specific use cases, like let's
count all of the records

130
00:07:51,600 --> 00:07:56,600
in this table, they can keep, they
cheat and do some of the solutions

131
00:07:56,600 --> 00:07:57,340
that we're going to

132
00:07:57,340 --> 00:07:57,912
Nikolay: discuss maybe at the day
to day level.

133
00:07:57,912 --> 00:07:58,520
Well, they're not cheating.

134
00:07:58,520 --> 00:08:02,880
They put new row version in the
same place as old row version,

135
00:08:02,880 --> 00:08:07,460
it means that compactness, if we
have already compactness of

136
00:08:07,640 --> 00:08:10,580
storage, it's not lost.

137
00:08:10,880 --> 00:08:15,200
Compared to Postgres, where new
version comes to a new place, maybe

138
00:08:15,200 --> 00:08:18,540
a different page, in many cases it's
a different page.

139
00:08:19,200 --> 00:08:23,380
Maybe it's a place which was taken
by another tuple which became

140
00:08:23,380 --> 00:08:27,720
dead and then vacuumed, deleted
by a vacuum, right?

141
00:08:27,720 --> 00:08:28,380
Or autovacuum.

142
00:08:28,940 --> 00:08:33,840
It means that we might end up having
very scattered storage for

143
00:08:33,840 --> 00:08:38,280
our tuples, especially if we need
to count only specific tuples.

144
00:08:38,420 --> 00:08:41,880
We have the WHERE clause in our
SELECT COUNT.

145
00:08:42,040 --> 00:08:47,900
It means that we need to, or we
might have an index scan for this,

146
00:08:47,900 --> 00:08:50,020
but it's kind of random access,
right?

147
00:08:50,020 --> 00:08:53,460
We need to count many tuples which
are stored in random places,

148
00:08:53,560 --> 00:08:54,940
and we don't use partitioning.

149
00:08:55,240 --> 00:08:57,600
Our table is 1 terabyte in size.

150
00:08:57,900 --> 00:08:59,240
It's super bad case.

151
00:08:59,340 --> 00:09:02,680
An UPDATE randomizes the storage,
right?

152
00:09:02,680 --> 00:09:04,740
The new version is stored in random
locations.

153
00:09:05,220 --> 00:09:07,020
And it becomes worse and worse
and worse.

154
00:09:07,020 --> 00:09:09,040
It's very scattered, and we lose
compactness.

155
00:09:10,080 --> 00:09:13,760
This is how MVCC in Postgres works,
unfortunately.

156
00:09:14,020 --> 00:09:19,620
We oftentimes say it's more like
rollback-focused than commit-focused.

157
00:09:19,920 --> 00:09:25,320
Because in case of rollback, the new
version becomes basically dead

158
00:09:25,320 --> 00:09:30,060
eventually and cleaned up, and the
old version remains in the same

159
00:09:30,060 --> 00:09:32,220
place, and we don't lose compactness.

160
00:09:32,720 --> 00:09:36,680
If we have a commit, we move rows
all the time.

161
00:09:36,760 --> 00:09:41,280
We move tuples, the physical versions
of rows, all the time, and we

162
00:09:41,280 --> 00:09:42,640
lose compactness eventually.

163
00:09:43,680 --> 00:09:48,220
And this is bad for count, bad
for other aggregates as well.

164
00:09:48,460 --> 00:09:52,040
Michael: Yeah, so I think that's
like a good overall summary

165
00:09:52,040 --> 00:09:54,720
of why count is slow, or at least
why

166
00:09:54,720 --> 00:09:55,440
Nikolay: it scales.

167
00:09:57,740 --> 00:10:01,980
Michael: I think while it's not
bad on tiny datasets, it's fine,

168
00:10:01,980 --> 00:10:02,480
right?

169
00:10:03,340 --> 00:10:07,080
You probably won't notice it too
much, but as data scales, even

170
00:10:07,080 --> 00:10:11,240
in the blog post I was reading,
they only did a dataset of a

171
00:10:11,240 --> 00:10:14,840
million rows and already, even like
the fastest optimizations

172
00:10:15,140 --> 00:10:20,020
at the time could only get an accurate
count in about 100, 200

173
00:10:20,020 --> 00:10:21,420
milliseconds, something like that.

174
00:10:21,420 --> 00:10:25,320
So we're already talking about
10x what you would consider a

175
00:10:25,320 --> 00:10:26,120
fast query.

176
00:10:26,120 --> 00:10:30,360
So even at a million records, which
is not many in many use cases,

177
00:10:30,360 --> 00:10:34,900
we're talking about a potentially
meaningful length of time.

178
00:10:34,940 --> 00:10:39,240
So yeah, I think, should we move
on to what we can do about it?

179
00:10:39,620 --> 00:10:42,760
Nikolay: Let me share the history
of this blog post.

180
00:10:42,800 --> 00:10:46,520
I might somehow provoke that blog
post that John Nelson wrote a little

181
00:10:46,520 --> 00:10:50,560
bit, and I'm not taking full responsibility
for that blog post,

182
00:10:50,560 --> 00:10:51,060
definitely.

183
00:10:51,220 --> 00:10:54,740
But I remember he published it
in 2016, right?

184
00:10:55,240 --> 00:10:57,080
And I started using it all the
time.

185
00:10:57,080 --> 00:11:00,940
I shared, like, if you want to
understand count, just read that

186
00:11:00,940 --> 00:11:01,720
blog post.

187
00:11:01,720 --> 00:11:04,840
Of course, right now I would say
it's missing BUFFERS in the

188
00:11:04,840 --> 00:11:09,960
EXPLAIN plans. For sure, because
as we just discussed, I always

189
00:11:09,960 --> 00:11:13,640
reason number 1 of why things are
slow, right?

190
00:11:13,780 --> 00:11:17,480
We should not talk about timing
only. We should talk about BUFFERS

191
00:11:17,480 --> 00:11:18,660
as well all the time.

192
00:11:19,040 --> 00:11:24,940
But back 1 year earlier, in 2015,
I started looking at PostgREST,

193
00:11:25,520 --> 00:11:30,060
which right now is 1 of the main
building blocks

194
00:11:30,060 --> 00:11:31,100
Supabase has.

195
00:11:31,460 --> 00:11:36,500
And I was impressed, like, great,
we can ditch Django and Rails

196
00:11:36,500 --> 00:11:40,020
applications completely in simple,
at least in simple cases,

197
00:11:40,560 --> 00:11:43,000
and do everything on server-side
and front-end.

198
00:11:43,040 --> 00:11:46,200
This is actually the same time
React became popular.

199
00:11:46,500 --> 00:11:50,460
I think it was started in 2013,
'14, and so on.

200
00:11:50,460 --> 00:11:53,760
So it skyrocketed in 2015.

201
00:11:55,240 --> 00:11:58,320
And a lot of business logic went
to frontend, of course.

202
00:11:58,320 --> 00:11:59,440
Also React Native.

203
00:11:59,760 --> 00:12:00,260
Great.

204
00:12:00,360 --> 00:12:04,700
So we need to get rid of middleware,
keep data-oriented business

205
00:12:04,700 --> 00:12:08,720
logic in the database, keep UI-oriented
business logic on the

206
00:12:08,720 --> 00:12:11,100
client side, and this is great.

207
00:12:11,200 --> 00:12:15,560
So I found PostgREST, and I quickly
realized I came from already

208
00:12:16,160 --> 00:12:17,860
quite heavily loaded systems.

209
00:12:18,340 --> 00:12:22,540
I remember, for example, first
time I saw integer for primary

210
00:12:22,540 --> 00:12:25,220
key exhausted, it was in my own
project in 2008.

211
00:12:25,900 --> 00:12:29,380
So I already realized some data
volumes and how they work in

212
00:12:29,380 --> 00:12:33,640
Postgres, And I quickly found that
in PostgREST, they had already

213
00:12:33,640 --> 00:12:38,300
some pagination, very good, but
on each GET request, you say,

214
00:12:38,300 --> 00:12:42,980
I want only 25 items on my page,
but they also show you how many

215
00:12:43,520 --> 00:12:46,760
overall items you have for this
request with all your filters

216
00:12:46,760 --> 00:12:47,580
and so on.

217
00:12:47,780 --> 00:12:49,700
And I thought, well, that's not
good.

218
00:12:49,700 --> 00:12:52,000
I know Postgres has slow count,
right?

219
00:12:52,340 --> 00:12:58,140
If I every time they counted, so
it's written in Haskell, so

220
00:12:58,140 --> 00:13:03,020
I started digging inside source
code and I found indeed it runs

221
00:13:03,480 --> 00:13:06,940
count all the time, so I helped
them to get rid of it by default.

222
00:13:07,200 --> 00:13:11,520
So I remember I created issue,
let's get rid of count by default,

223
00:13:11,520 --> 00:13:15,040
and they quickly agreed and removed
it.

224
00:13:15,060 --> 00:13:19,080
So I'm happy I helped a little
bit to realize that count should

225
00:13:19,080 --> 00:13:21,440
not be used by default.

226
00:13:21,500 --> 00:13:24,960
And also at the same time, Joe
Nelson, the original creator

227
00:13:24,960 --> 00:13:29,540
of PostgREST, started to dig into
this topic and found several

228
00:13:29,540 --> 00:13:35,100
alternative ways how to make count
fast, including approximate

229
00:13:35,600 --> 00:13:36,100
method.

230
00:13:36,600 --> 00:13:37,660
I like this method.

231
00:13:37,660 --> 00:13:39,000
It doesn't work all the time.

232
00:13:39,000 --> 00:13:42,680
It works very well if you have
autovacuum well-tuned, so it

233
00:13:42,680 --> 00:13:47,620
runs auto-analyze frequently, right,
because you need fresh statistics.

234
00:13:47,920 --> 00:13:51,300
And it also works if you don't
play too much with filtering.

235
00:13:51,600 --> 00:13:55,820
So, for example, if you have a
users table and you know to display

236
00:13:55,960 --> 00:14:01,100
overall count of rows in this table,
you can use just approximate

237
00:14:01,320 --> 00:14:05,880
approach, which, in a few words,
you just run explain without

238
00:14:05,900 --> 00:14:10,460
analyze, you just run explain,
select star from users, right?

239
00:14:10,680 --> 00:14:13,460
And the planner will tell you how
many rows it expects.

240
00:14:13,500 --> 00:14:17,060
And this is already an approximate
number you can use.

241
00:14:17,180 --> 00:14:21,720
This is all trick I first learned
maybe in 2004 or 5 when I started

242
00:14:21,720 --> 00:14:22,380
with Postgres.

243
00:14:22,580 --> 00:14:23,480
We used it.

244
00:14:23,480 --> 00:14:26,640
Michael: I think, yeah, I think
this trick comes into its own

245
00:14:27,580 --> 00:14:31,640
for things that are harder to find
out from the statistics or

246
00:14:31,640 --> 00:14:32,700
from other methods.

247
00:14:32,760 --> 00:14:38,240
I think, like for example, if you
have dynamic filters that users

248
00:14:38,240 --> 00:14:40,540
can set, or if you, basically,
if you have,

249
00:14:41,040 --> 00:14:41,280
Nikolay: if you

250
00:14:41,280 --> 00:14:44,140
Michael: need to be able to set
arbitrary where clauses or you

251
00:14:44,140 --> 00:14:48,740
want to be able to join relations
and get estimates then, This

252
00:14:48,740 --> 00:14:54,440
is an extremely powerful estimation
tool because you're basically

253
00:14:54,440 --> 00:14:57,740
using Postgres's planner to do
it.

254
00:14:57,740 --> 00:15:01,180
Nikolay: But you need to make sure
it's up to date most of the

255
00:15:01,180 --> 00:15:01,680
time.

256
00:15:02,280 --> 00:15:04,540
Michael: Well, that's not even
enough though because we've had

257
00:15:04,540 --> 00:15:07,600
plenty of episodes where we've
talked about times where the plan's

258
00:15:07,600 --> 00:15:09,200
estimations can be way off.

259
00:15:09,200 --> 00:15:13,380
Nikolay: So yeah, just have 2 columns
and 2 filters and don't

260
00:15:13,380 --> 00:15:15,660
have multi-column statistics, that's
it.

261
00:15:15,660 --> 00:15:15,900
Exactly.

262
00:15:15,900 --> 00:15:17,240
Michael: It's a super easy way.

263
00:15:18,280 --> 00:15:22,200
Yeah, or in 2 different tables,
like, where you can't even.

264
00:15:22,200 --> 00:15:25,700
So yeah, so we've talked about
the limitations of that before,

265
00:15:26,060 --> 00:15:30,480
but in the real world, I've seen
very few, I mean, it's probably

266
00:15:30,480 --> 00:15:33,540
a limitation of my experience rather
than because I've seen how

267
00:15:33,540 --> 00:15:35,960
popular these estimation tools
are.

268
00:15:36,220 --> 00:15:40,140
I've seen very few cases where
people would rather have an estimated

269
00:15:40,160 --> 00:15:41,780
count than no count at all.

270
00:15:41,780 --> 00:15:44,800
In lots of user facing applications,
I've seen people want an

271
00:15:44,800 --> 00:15:47,080
exact count, or they don't want
to show it.

272
00:15:47,080 --> 00:15:48,040
And there seems to be this...

273
00:15:48,040 --> 00:15:49,100
I cannot agree.

274
00:15:49,120 --> 00:15:49,920
Nikolay: I cannot agree.

275
00:15:49,920 --> 00:15:50,280
For example, I...

276
00:15:50,280 --> 00:15:50,940
What do you see?

277
00:15:50,940 --> 00:15:51,660
Michael: Where do you see

278
00:15:51,660 --> 00:15:51,980
Nikolay: it used?

279
00:15:51,980 --> 00:15:54,060
Well, for example, I had this.

280
00:15:54,160 --> 00:15:58,860
We have a social network, for example,
and we just want to display

281
00:15:58,860 --> 00:16:02,440
the total number of registered
users and the total number of,

282
00:16:02,440 --> 00:16:06,720
I don't know, like groups created
or anything created, posts,

283
00:16:06,760 --> 00:16:08,300
comments on the front page.

284
00:16:09,860 --> 00:16:11,540
And it's just overall number.

285
00:16:12,520 --> 00:16:15,320
Definitely this approach will work
there very well.

286
00:16:15,320 --> 00:16:16,480
And we rounded it.

287
00:16:16,480 --> 00:16:23,460
We don't show, well, next idea
let's round it but also show some,

288
00:16:23,700 --> 00:16:29,360
if we know current number and 1
week ago number so we can understand

289
00:16:29,440 --> 00:16:33,680
the velocity, the speed of growth
and we can also pretend it's

290
00:16:33,680 --> 00:16:34,540
growing right now.

291
00:16:34,540 --> 00:16:39,140
It's like fake, but this is what
social media do all the time,

292
00:16:39,440 --> 00:16:39,940
right?

293
00:16:39,940 --> 00:16:42,980
Michael: But this is kind of my
point that you wouldn't use the

294
00:16:42,980 --> 00:16:43,340
actual...

295
00:16:43,340 --> 00:16:46,440
So the number when it's an estimate
could go down.

296
00:16:46,780 --> 00:16:49,940
Like even if it's like a number
of you, like in that case, you

297
00:16:49,940 --> 00:16:52,440
wouldn't want to show a number
this week that was lower than

298
00:16:52,440 --> 00:16:53,760
the number last week.

299
00:16:54,440 --> 00:16:54,940
Nikolay: Right.

300
00:16:55,760 --> 00:16:56,480
You're right.

301
00:16:56,480 --> 00:16:58,940
Michael: So there are you wouldn't
you still wouldn't use the

302
00:16:58,940 --> 00:17:00,040
exact like.

303
00:17:00,060 --> 00:17:03,140
So to be more specific, there's
some really cool algorithms.

304
00:17:03,280 --> 00:17:08,300
There's a really popular extension
called HyperLogLog, extension

305
00:17:08,300 --> 00:17:09,520
called Postgres HyperLogLog.

306
00:17:10,080 --> 00:17:10,580
Yeah.

307
00:17:10,960 --> 00:17:12,540
Nikolay: Count distinct.

308
00:17:13,200 --> 00:17:16,320
Michael: Yes, so it's for estimating count
distinct, you're right.

309
00:17:16,520 --> 00:17:20,500
So more specific than just count,
but it's available on pretty

310
00:17:20,500 --> 00:17:24,720
much every, every cloud provider,
which makes me think it must

311
00:17:24,720 --> 00:17:25,560
be pretty popular.

312
00:17:25,560 --> 00:17:28,120
Like it must have been requested
on a bunch of these.

313
00:17:28,320 --> 00:17:29,080
Nikolay: Not necessarily.

314
00:17:30,060 --> 00:17:31,580
Things work in life differently.

315
00:17:32,020 --> 00:17:37,660
It might be just, you know, like,
we don't know how popular it

316
00:17:37,660 --> 00:17:38,500
is overall.

317
00:17:38,920 --> 00:17:41,540
It might be just cool to have it
at some point.

318
00:17:41,880 --> 00:17:44,940
You know, sometimes people have
high expectations.

319
00:17:45,660 --> 00:17:51,200
And also in case of managed Postgres
services, in any areas of

320
00:17:51,200 --> 00:17:56,000
business, any segments of market,
if some big player added it,

321
00:17:56,000 --> 00:17:57,480
others also added it.

322
00:17:58,360 --> 00:18:02,160
Michael: Or even if, let's say
only 1 big customer needs it or

323
00:18:02,160 --> 00:18:02,900
wants it.

324
00:18:02,900 --> 00:18:05,280
If they're big enough, maybe you
add it.

325
00:18:05,280 --> 00:18:05,740
Nikolay: Yeah.

326
00:18:05,740 --> 00:18:06,700
Also might happen.

327
00:18:06,700 --> 00:18:07,120
Yeah.

328
00:18:07,120 --> 00:18:08,380
Michael: So yeah, good point.

329
00:18:09,480 --> 00:18:13,660
But I do think this whole category
of estimated counts is super

330
00:18:13,660 --> 00:18:14,160
interesting.

331
00:18:14,180 --> 00:18:18,740
And that feels like 1 big solution
that you can say, look, your

332
00:18:18,740 --> 00:18:21,320
count is slow, 1 option is estimating.

333
00:18:21,420 --> 00:18:24,520
And I think it's really interesting
intellectually, and I'd love

334
00:18:24,520 --> 00:18:29,360
to hear from people that do use
it in production for real things.

335
00:18:30,660 --> 00:18:34,200
But very often I see people say,
okay, that's cool, but what

336
00:18:34,200 --> 00:18:35,200
are the other options?

337
00:18:35,280 --> 00:18:36,340
What else can I do?

338
00:18:36,340 --> 00:18:40,340
Nikolay: Let's finish with when
to apply this approximate approach.

339
00:18:40,600 --> 00:18:45,700
I think if you need to count rows
in 1 table, it's easy, no filtering,

340
00:18:45,720 --> 00:18:47,060
it's definitely an option.

341
00:18:47,160 --> 00:18:51,780
But you need to make sure autovacuum
is configured to trigger,

342
00:18:51,780 --> 00:18:57,040
like autovacuum analyze scale factor
is configured properly

343
00:18:57,100 --> 00:19:00,820
so statistics are up to date most
of the time, I would say.

344
00:19:00,840 --> 00:19:04,740
But also, if you have filters,
these filters should be like single

345
00:19:04,740 --> 00:19:06,160
column, no joins.

346
00:19:06,340 --> 00:19:07,640
In this case, it's good.

347
00:19:08,460 --> 00:19:10,680
In some cases, it can be not good
as well.

348
00:19:10,680 --> 00:19:15,840
For example, if you need a daily,
it might be off completely.

349
00:19:16,600 --> 00:19:23,500
Or if your project already is old,
you have many years of data,

350
00:19:24,240 --> 00:19:28,340
daily might be also challenging
because we know by default Postgres

351
00:19:28,340 --> 00:19:31,160
keeps only statistics for 100 buckets.

352
00:19:31,740 --> 00:19:32,580
Michael: Oh, interesting.

353
00:19:32,960 --> 00:19:33,120
Nikolay: I

354
00:19:33,120 --> 00:19:34,080
Michael: see where you're coming
from.

355
00:19:34,080 --> 00:19:35,200
Nikolay: But we can raise it.

356
00:19:35,200 --> 00:19:39,340
We can raise it for specific tables,
even specific columns.

357
00:19:40,080 --> 00:19:42,660
And this can help as well in this
case.

358
00:19:42,660 --> 00:19:49,800
If we know we do daily or monthly
counts, we can change this

359
00:19:49,840 --> 00:19:54,360
setting and have more precise statistics,
right?

360
00:19:54,480 --> 00:19:56,680
And still use an approximate approach.

361
00:19:57,720 --> 00:19:58,380
Make sense?

362
00:19:59,500 --> 00:20:00,000
Michael: Yeah.

363
00:20:00,720 --> 00:20:04,920
I think it might even be simpler,
like it might be that most

364
00:20:04,920 --> 00:20:08,240
people don't end up going this
route because they actually find

365
00:20:08,240 --> 00:20:10,460
they can live with the trade-offs
of the other approaches.

366
00:20:10,560 --> 00:20:13,820
Like it feels like a last resort
estimation.

367
00:20:14,540 --> 00:20:18,060
If you've got a slow count, maybe
you're thinking, well, can

368
00:20:18,060 --> 00:20:21,320
we make it like there are ways
of making it faster, but they

369
00:20:21,320 --> 00:20:22,500
come with certain trade-offs.

370
00:20:22,540 --> 00:20:25,320
So maybe we should discuss those
first and then come back at

371
00:20:25,320 --> 00:20:28,860
the end to say, if you can't live
with those trade-offs, then

372
00:20:28,860 --> 00:20:32,180
you have to then, and you still
want to display something or

373
00:20:32,180 --> 00:20:32,920
show something.

374
00:20:33,100 --> 00:20:34,840
Nikolay: For me it's vice versa
actually.

375
00:20:35,820 --> 00:20:38,300
Approximate count is super easy
to implement.

376
00:20:38,300 --> 00:20:39,660
I would consider it first.

377
00:20:40,600 --> 00:20:45,080
Would it work well enough in our
case or we exclude it completely?

378
00:20:45,220 --> 00:20:45,980
That's it.

379
00:20:46,120 --> 00:20:47,700
Because it's super easy to implement.

380
00:20:48,080 --> 00:20:49,280
But let's move on.

381
00:20:49,280 --> 00:20:52,360
You already started talking about
HyperLogLog, right?

382
00:20:52,360 --> 00:20:55,940
This is for distinct count and
this extension is indeed available

383
00:20:55,960 --> 00:20:56,820
almost everywhere.

384
00:20:57,380 --> 00:21:01,880
So if you have a count distinct,
why not consider it, right?

385
00:21:01,880 --> 00:21:05,500
Because it might be very good,
fast, and so on.

386
00:21:06,040 --> 00:21:10,840
And Joe Nelson and this old article
explains it.

387
00:21:12,180 --> 00:21:15,040
So what I would do, I would consider
it in this case.

388
00:21:15,240 --> 00:21:19,380
I would do some benchmarks with
current data volumes, but also

389
00:21:19,380 --> 00:21:24,860
10x for example, predicting future
growth with focus on I/O.

390
00:21:24,940 --> 00:21:30,660
I definitely would look at buffers
numbers in the plans and understand

391
00:21:30,660 --> 00:21:33,160
how, like, what about scalability
here?

392
00:21:33,160 --> 00:21:41,660
If we scale 2x, 5x, 10x, will still
I/O numbers be fine and leading

393
00:21:41,660 --> 00:21:44,120
to acceptable latencies or no?

394
00:21:44,440 --> 00:21:47,120
This is the question, right?

395
00:21:48,100 --> 00:21:50,460
Michael: But this is, we're still
in estimation, right?

396
00:21:50,740 --> 00:21:53,580
Nikolay: Yeah, but it's more like
it's smarter already, right?

397
00:21:53,840 --> 00:21:54,740
Michael: Yeah, true.

398
00:21:54,920 --> 00:21:57,340
Smarter than just relying on Postgres
statistics.

399
00:21:57,340 --> 00:21:57,720
Right.

400
00:21:57,720 --> 00:21:58,400
Yeah, true.

401
00:21:58,620 --> 00:22:02,140
Nikolay: And next, like, I know
the article also just, like,

402
00:22:02,140 --> 00:22:05,440
it's interesting that we discussed
the article which is 9 years

403
00:22:05,440 --> 00:22:07,420
old or how many 9 right?

404
00:22:07,420 --> 00:22:07,660
Well,

405
00:22:07,660 --> 00:22:11,180
Michael: I actually I did as part
of my prep for this I did 1

406
00:22:11,180 --> 00:22:16,020
they Conveniently they provided
like code snippets for each step

407
00:22:16,020 --> 00:22:21,040
and I didn't I did a little a few
changes But it was very very

408
00:22:21,040 --> 00:22:23,740
easy for me to run this on Postgres
17, actually, just because

409
00:22:23,740 --> 00:22:24,940
I was doing some testing.

410
00:22:24,960 --> 00:22:25,700
Nikolay: What changed?

411
00:22:25,940 --> 00:22:29,820
Michael: Well, so the first thing
I noticed was his plans didn't

412
00:22:29,820 --> 00:22:30,760
even have parallel.

413
00:22:30,760 --> 00:22:33,480
Like, bear in mind, we're talking
about doing a count over a

414
00:22:33,480 --> 00:22:34,320
million rows.

415
00:22:34,640 --> 00:22:38,720
These days you just expect to see
the first query plan have a

416
00:22:38,720 --> 00:22:39,940
gather and...

417
00:22:40,320 --> 00:22:42,180
Nikolay: 3 workers by default,
basically.

418
00:22:43,140 --> 00:22:43,880
Michael: Yeah, exactly.

419
00:22:45,120 --> 00:22:45,860
And I didn't.

420
00:22:45,860 --> 00:22:49,400
And I was like, wait, so his blog
post was pre-Postgres having

421
00:22:49,400 --> 00:22:50,300
parallel workers.

422
00:22:50,740 --> 00:22:51,300
And that was

423
00:22:51,300 --> 00:22:51,900
Nikolay: the first thing.

424
00:22:51,900 --> 00:22:55,220
And it means roughly it should
be by default 3 times faster,

425
00:22:55,260 --> 00:22:59,180
because aggregation, like, append,
how it's called, it's easy

426
00:22:59,180 --> 00:23:02,220
in this case, it's just sum of
numbers, That's it, right?

427
00:23:02,220 --> 00:23:04,620
Michael: Yeah, count should parallelize
really well.

428
00:23:04,760 --> 00:23:08,300
Because I don't know his hardware,
and I didn't go back and try

429
00:23:08,300 --> 00:23:10,900
and get a really old version of
Postgres, I didn't actually.

430
00:23:10,900 --> 00:23:13,760
Actually, I probably could have
just tested by turning off parallelization.

431
00:23:14,120 --> 00:23:16,260
But yeah, probably about 3 times
faster.

432
00:23:17,140 --> 00:23:17,500
Nikolay: Good.

433
00:23:17,500 --> 00:23:18,480
This is a good point.

434
00:23:18,480 --> 00:23:18,980
Yeah.

435
00:23:19,120 --> 00:23:20,980
Michael: Yeah, that was the first
thing I noticed.

436
00:23:20,980 --> 00:23:26,280
And then there were a couple of
other differences in the nitty

437
00:23:26,280 --> 00:23:30,680
gritty, but nothing like fundamental,
nothing like too big.

438
00:23:31,380 --> 00:23:34,400
So most of the points still stood
really well, even though, as

439
00:23:34,400 --> 00:23:37,120
I said, it was like a lot's changed,
but also not that much has

440
00:23:37,120 --> 00:23:37,620
changed.

441
00:23:37,660 --> 00:23:43,280
So yeah, the other 1 was in the
past, you had to do a workaround

442
00:23:43,780 --> 00:23:47,960
for you needed to use a subquery
trick for like count distinct

443
00:23:47,960 --> 00:23:49,580
to get an index only scan.

444
00:23:49,940 --> 00:23:53,720
So instead of being able to do
count distinct on column and get

445
00:23:53,720 --> 00:23:57,720
an index only scan, you had to
do like a select within that.

446
00:23:57,720 --> 00:24:01,480
So you had to select distinct
to get the index only scan, don't

447
00:24:01,480 --> 00:24:02,380
need to anymore.

448
00:24:02,440 --> 00:24:04,060
That was quite cool to see.

449
00:24:04,080 --> 00:24:06,240
Nikolay: Yeah, that's a good point
as well.

450
00:24:07,120 --> 00:24:08,360
Michael: But yeah, like a bunch

451
00:24:08,360 --> 00:24:11,260
So actually index-only scans, we
should probably talk about that.

452
00:24:11,920 --> 00:24:14,980
Nikolay: Yeah, we just had the
episode about it a month ago.

453
00:24:15,060 --> 00:24:16,980
So it's very related to...

454
00:24:17,280 --> 00:24:23,360
The ideal case for Postgres in
its raw store situation, it's

455
00:24:23,360 --> 00:24:27,420
index-only scan with low heap fetches,
ideally 0.

456
00:24:27,740 --> 00:24:31,860
Which means, again, that leads
to the need to configure autovacuum

457
00:24:31,960 --> 00:24:36,000
to make it work more frequently
and move on faster.

458
00:24:36,940 --> 00:24:40,080
In this case, if it's an index-only
scan, this is the best we

459
00:24:40,080 --> 00:24:41,260
can do with Postgres.

460
00:24:41,460 --> 00:24:44,520
Parallel index-only scan, maybe,
as you said.

461
00:24:45,060 --> 00:24:46,200
And this is good.

462
00:24:46,200 --> 00:24:50,000
I mean, this is the best thing
you can see in plans index-only

463
00:24:50,220 --> 00:24:51,800
scan with low heap fetches.

464
00:24:52,300 --> 00:24:58,360
The only better situation would
mean you need to start denormalization,

465
00:24:58,980 --> 00:25:00,420
caching, and so on.

466
00:25:01,060 --> 00:25:05,280
I would consider it as a heavy
solution to the problem.

467
00:25:06,160 --> 00:25:14,240
Because this would give you a few
buffer operations to retrieve

468
00:25:14,440 --> 00:25:15,140
your count.

469
00:25:15,480 --> 00:25:19,040
If you have fully denormalized,
maybe, for example, materialized

470
00:25:19,080 --> 00:25:19,580
view.

471
00:25:19,740 --> 00:25:22,860
I would not recommend using materialized
view in complex projects

472
00:25:22,860 --> 00:25:23,540
at all.

473
00:25:23,940 --> 00:25:25,680
Default materialized view.

474
00:25:25,680 --> 00:25:28,780
But consider we have it already,
and we have proper index on

475
00:25:28,780 --> 00:25:29,280
it.

476
00:25:29,360 --> 00:25:32,960
Finding proper count would be just
finding maybe 1 row, that's

477
00:25:32,960 --> 00:25:34,340
it, 1 value.

478
00:25:34,820 --> 00:25:37,280
It's like primary key lookup, right?

479
00:25:37,600 --> 00:25:41,020
Or maybe just index lookup to find
1 record.

480
00:25:41,480 --> 00:25:41,980
That's it.

481
00:25:41,980 --> 00:25:47,540
It's super easy but to achieve
that, First of all, it requires

482
00:25:47,580 --> 00:25:48,080
effort.

483
00:25:48,160 --> 00:25:51,840
You need probably triggers or asynchronous
triggers somehow.

484
00:25:52,660 --> 00:25:56,640
And second of all, if it's not
materialized view, you need to

485
00:25:56,640 --> 00:26:01,540
have something probably with desire
to support partial refresh

486
00:26:01,740 --> 00:26:05,820
because default materialized view
supports only full refresh,

487
00:26:06,280 --> 00:26:10,700
which is not efficient in most
cases and can lead to bloat itself.

488
00:26:13,520 --> 00:26:17,340
Michael: So I've seen people implement
versions of this and use

489
00:26:17,920 --> 00:26:21,180
off the shelf things that are kind
of in Postgres extensions

490
00:26:21,340 --> 00:26:22,540
for this as well.

491
00:26:22,660 --> 00:26:26,060
So this is actually the that's
the approach I've seen most commonly

492
00:26:26,180 --> 00:26:29,240
used for this because it doesn't
scale with volume.

493
00:26:29,340 --> 00:26:33,220
I think it's really attractive
because even an index only scan,

494
00:26:33,500 --> 00:26:36,720
if you double the data size, you're
going to roughly double the

495
00:26:37,060 --> 00:26:38,800
time it takes to do the count.

496
00:26:39,140 --> 00:26:42,380
So you're going to keep bumping
into this problem down the line.

497
00:26:42,380 --> 00:26:42,880
Whereas...

498
00:26:43,900 --> 00:26:45,560
Nikolay: Well, yeah, you're right.

499
00:26:45,560 --> 00:26:46,420
If you double...

500
00:26:47,460 --> 00:26:51,840
Yeah, even with the index-only
scan, we still need to...

501
00:26:51,900 --> 00:26:54,680
If we, for example, if we forget
about structure completely,

502
00:26:54,960 --> 00:27:00,080
talk about only leaf nodes, if
we have 2 times more values to

503
00:27:00,080 --> 00:27:03,000
store, we need 2 times more pages,
right?

504
00:27:03,460 --> 00:27:05,100
2 times more leaf nodes.

505
00:27:05,740 --> 00:27:09,560
And it translates to 2 times more
I/O.

506
00:27:09,560 --> 00:27:10,060
Operations.

507
00:27:10,900 --> 00:27:11,660
And of course...

508
00:27:11,660 --> 00:27:15,780
Michael: And even if it's not 2
times, it's still like, it still

509
00:27:15,780 --> 00:27:17,380
degrades over time.

510
00:27:17,560 --> 00:27:19,200
And these counter-caches...

511
00:27:19,840 --> 00:27:21,060
Nikolay: LRU, not better than 2
times.

512
00:27:21,060 --> 00:27:23,540
It can be even worse, but not better.

513
00:27:23,540 --> 00:27:28,520
Because if you need to keep references
to 2 times more tuples,

514
00:27:28,520 --> 00:27:31,580
you do need more space in leaf
nodes, right?

515
00:27:31,800 --> 00:27:35,180
2 times more leaf nodes, at least,
maybe more actually, if we

516
00:27:35,180 --> 00:27:41,740
have a rebalancing situation and
again, not compact storage of

517
00:27:41,740 --> 00:27:43,040
tuples and so on.

518
00:27:43,580 --> 00:27:43,940
Michael: Yeah.

519
00:27:43,940 --> 00:27:46,800
So on the index-only scan, actually,
I think it's worth, before

520
00:27:46,800 --> 00:27:50,140
we move on from that, we're thinking
about some, because I think

521
00:27:50,140 --> 00:27:51,920
that goes back to the beginning
of the conversation.

522
00:27:51,920 --> 00:27:53,580
You talked about rowstore versus
columnstore.

523
00:27:53,640 --> 00:27:58,160
If we have an index only scan on
only a single, relatively small,

524
00:27:58,340 --> 00:28:01,400
like let's say a primary key, for
example, that we can use to

525
00:28:01,400 --> 00:28:03,240
count, it's ordered as well.

526
00:28:03,240 --> 00:28:05,860
So it can help us with count and
count distinct.

527
00:28:06,060 --> 00:28:11,480
It's almost columnstore-like in
a way, because if we only have

528
00:28:11,480 --> 00:28:16,760
to search the index, we're searching
like a single column effectively.

529
00:28:17,020 --> 00:28:20,720
So we're almost getting that columnstore
benefit without some

530
00:28:20,720 --> 00:28:23,140
of the aggregation tricks that
some of these columnstores

531
00:28:23,140 --> 00:28:23,420
Nikolay: have.

532
00:28:23,420 --> 00:28:27,040
Leaf nodes are doubly linked in
both directions, so we avoid

533
00:28:27,380 --> 00:28:29,360
traversing the whole tree.

534
00:28:29,640 --> 00:28:33,900
We just start at the beginning of
our range, the first value, and then

535
00:28:33,900 --> 00:28:38,560
we just go scan leaves only, and
that's great.

536
00:28:38,560 --> 00:28:40,360
I mean, you're right.

537
00:28:40,360 --> 00:28:43,840
If it's an index-only scan, we don't
care about other columns.

538
00:28:44,140 --> 00:28:49,520
But interesting point that we need
to make sure we select count

539
00:28:50,240 --> 00:28:53,620
star, we'll probably choose some
index, most likely primary key,

540
00:28:53,620 --> 00:28:54,120
right?

541
00:28:54,640 --> 00:28:59,860
But if you count some column which
allows NULLs, like we had

542
00:28:59,860 --> 00:29:02,720
a very long time ago, we had an episode
about the dangers of NULLs.

543
00:29:02,720 --> 00:29:06,540
And this is one of the dangers, because
count, if it's NULL, it

544
00:29:06,540 --> 00:29:07,400
doesn't count.

545
00:29:08,300 --> 00:29:10,180
Count doesn't count NULLs, right?

546
00:29:10,360 --> 00:29:14,940
This thing might be surprising
in some cases, even for experienced

547
00:29:14,960 --> 00:29:15,460
folks.

548
00:29:15,620 --> 00:29:19,700
I surprise myself with this once
every couple of years at least.

549
00:29:20,140 --> 00:29:24,280
Michael: I see this so often in
people talking from a computer

550
00:29:24,280 --> 00:29:25,260
science perspective.

551
00:29:25,320 --> 00:29:28,380
People asking, like, what's faster,
count(*) or count(1) or

552
00:29:28,380 --> 00:29:29,240
count(ID)?

553
00:29:29,720 --> 00:29:33,480
But I've never in practice seen
any use for anything other than

554
00:29:33,480 --> 00:29:34,400
count(*).

555
00:29:34,900 --> 00:29:35,580
Have you?

556
00:29:35,680 --> 00:29:40,280
Nikolay: Well, in some cases I
specifically count not null values

557
00:29:40,280 --> 00:29:41,180
in some column.

558
00:29:41,200 --> 00:29:43,540
So I put the column there and that's
it.

559
00:29:43,660 --> 00:29:46,860
In many cases I count with additional
filtering.

560
00:29:46,860 --> 00:29:50,500
You know this, it's already not
new, it's like 20 years old,

561
00:29:50,500 --> 00:29:53,040
but you know filter extension.

562
00:29:53,680 --> 00:29:56,940
Like you say, count(*), this
is our overall count.

563
00:29:56,940 --> 00:30:03,140
And then instead of doing these
very heavy structures in SQL,

564
00:30:03,420 --> 00:30:06,960
like case when blah blah, case
when else, I don't like them.

565
00:30:06,960 --> 00:30:12,100
So instead of that, you just say
count(*) or column filter

566
00:30:12,280 --> 00:30:13,940
and then in parentheses where.

567
00:30:13,940 --> 00:30:19,240
And this gives you opportunity
to count many things in 1 go,

568
00:30:19,240 --> 00:30:19,540
right?

569
00:30:19,540 --> 00:30:20,380
You just have

570
00:30:20,380 --> 00:30:21,140
Michael: 1 scan.

571
00:30:22,360 --> 00:30:24,640
Nikolay: Yeah, so it's filtering
on the fly.

572
00:30:24,640 --> 00:30:30,060
If it suits your filter conditions,
then it will increment your

573
00:30:30,060 --> 00:30:30,800
additional counter.

574
00:30:30,800 --> 00:30:36,300
So you have multiple results in
1 select, multiple numbers returned,

575
00:30:37,200 --> 00:30:37,440
right?

576
00:30:37,440 --> 00:30:38,500
That's a great thing.

577
00:30:38,540 --> 00:30:43,140
And in there, sometimes I use colon,
understanding that nulls

578
00:30:43,140 --> 00:30:45,560
won't be counted in this approach.

579
00:30:46,900 --> 00:30:51,400
For example, I want to understand
how many people fill this value,

580
00:30:51,960 --> 00:30:55,520
again, user stable and we have,
for example, Facebook ID.

581
00:30:55,520 --> 00:30:59,320
It's null if it's not, if Facebook
was not connected, in this

582
00:30:59,320 --> 00:31:03,360
case, you say count(*) means
overall number of users registered

583
00:31:03,500 --> 00:31:07,840
daily, for example, and then you
say count(Facebook ID), counting

584
00:31:08,000 --> 00:31:11,520
only those people who connected
their Facebook accounts.

585
00:31:11,760 --> 00:31:12,780
Makes sense, right?

586
00:31:13,440 --> 00:31:13,620
Yeah.

587
00:31:13,620 --> 00:31:16,420
In this case, you don't even need
filter.

588
00:31:17,360 --> 00:31:20,880
So it's kind of just maybe syntax
sugar, right?

589
00:31:21,340 --> 00:31:24,460
Michael: Yeah, I have seen an argument
that * is misleading

590
00:31:24,580 --> 00:31:28,220
because it doesn't mean what it
means in other contexts.

591
00:31:28,260 --> 00:31:31,400
And maybe we'd have been better
off if it was like count and

592
00:31:31,400 --> 00:31:33,740
then parentheses but nothing in
the parentheses would have been

593
00:31:33,740 --> 00:31:34,230
Nikolay: a getable.

594
00:31:34,230 --> 00:31:36,780
Lev Tolstoy Danylovich I actually
agree with this.

595
00:31:36,780 --> 00:31:41,020
Because if we propagate this logic
about nulls, the * should

596
00:31:41,020 --> 00:31:44,360
mean none of the columns are null.

597
00:31:44,540 --> 00:31:45,410
And this is always so.

598
00:31:45,410 --> 00:31:45,700
Lev Tolstoy Danylovich Yeah,

599
00:31:45,700 --> 00:31:47,540
Michael: but All of the columns
are not null.

600
00:31:47,920 --> 00:31:50,940
Nikolay: So let's talk about this
denormalized approach.

601
00:31:50,940 --> 00:31:53,540
I think this is most powerful but
requires effort.

602
00:31:54,340 --> 00:31:55,940
Michael: And well, it depends.

603
00:31:57,840 --> 00:32:00,600
I think there's like some really
simple approaches that have

604
00:32:00,600 --> 00:32:04,520
some big trade-offs and then there's
some more complicated approaches

605
00:32:04,780 --> 00:32:08,360
that have like fewer performance
trade-offs but they were more

606
00:32:08,360 --> 00:32:09,720
complex to set up.

607
00:32:09,720 --> 00:32:13,860
Like for example I think adding
a trigger that lets say you

608
00:32:13,860 --> 00:32:17,860
know what you want to count and
you're able to maybe you're doing

609
00:32:17,860 --> 00:32:21,180
a bunch of reads but not many writes
to this table and it's already

610
00:32:21,180 --> 00:32:22,060
quite big.

611
00:32:22,440 --> 00:32:25,580
So you're okay to pay a little
bit of overhead on each write

612
00:32:25,580 --> 00:32:28,620
in order to maintain just literally
a counter.

613
00:32:28,620 --> 00:32:29,880
Nikolay: Yeah, I've done it many
times.

614
00:32:30,900 --> 00:32:32,660
But you know what will happen,
right?

615
00:32:32,780 --> 00:32:33,580
Under load.

616
00:32:35,020 --> 00:32:35,520
Okay.

617
00:32:36,780 --> 00:32:40,900
Especially if you have foreign
key, as I remember, there are

618
00:32:40,900 --> 00:32:43,680
issues because multixact ID and
contention.

619
00:32:44,340 --> 00:32:48,560
And imagine you have count, like,
you know this popular anti-pattern

620
00:32:49,020 --> 00:32:53,100
many people do in the beginning
of their career, when they try

621
00:32:53,100 --> 00:32:56,840
to develop some accounting part
of their systems.

622
00:32:56,840 --> 00:32:58,320
Michael: Like debit and credit.

623
00:32:58,320 --> 00:32:59,020
Nikolay: Yeah, yeah.

624
00:32:59,340 --> 00:33:03,360
And they have like common, how
to say, common account or something,

625
00:33:03,380 --> 00:33:05,460
like counting all the money flow.

626
00:33:05,820 --> 00:33:09,740
And this requires updates to single
row in some table.

627
00:33:10,760 --> 00:33:14,940
Any transaction, financial transaction,
triggers this update

628
00:33:15,060 --> 00:33:15,560
synchronously.

629
00:33:15,960 --> 00:33:20,600
And this becomes quickly a hotspot,
obviously, because if you

630
00:33:20,600 --> 00:33:25,040
have 2 SQL transactions, already
SQL transactions, doing some

631
00:33:25,040 --> 00:33:28,440
financial transactions, that by
triggering, they try to update

632
00:33:28,440 --> 00:33:29,940
a single row, they compete.

633
00:33:30,020 --> 00:33:34,300
Like, I mean, this 1 will be blocked
until the other finishes

634
00:33:34,360 --> 00:33:36,260
with commit or rollback, right?

635
00:33:36,560 --> 00:33:38,720
And this doesn't scale at all.

636
00:33:39,840 --> 00:33:42,180
Michael: Well, I've seen solutions
to that.

637
00:33:42,280 --> 00:33:45,060
I've seen somebody, I think it
might have even been Tobias Petry,

638
00:33:45,060 --> 00:33:47,300
It might have been one of his Twitter
tips.

639
00:33:47,640 --> 00:33:49,940
All you do is you add more.

640
00:33:50,060 --> 00:33:51,380
If you split-

641
00:33:51,380 --> 00:33:52,940
Nikolay: Depends on the table,
right?

642
00:33:53,080 --> 00:33:53,920
Michael: No, no.

643
00:33:54,840 --> 00:33:58,720
So you have, say, 10, keep it small,
counts.

644
00:33:59,340 --> 00:34:03,940
And each right triggers, updates
one of those, but you don't know

645
00:34:03,940 --> 00:34:04,540
which one.

646
00:34:04,540 --> 00:34:08,080
It gets random, it updates one of
them, and then when you want to count

647
00:34:08,080 --> 00:34:08,420
everything,

648
00:34:08,420 --> 00:34:10,520
You have to sum all 10.

649
00:34:10,520 --> 00:34:11,020
Nikolay: Right.

650
00:34:12,380 --> 00:34:14,680
Michael: So yeah, it removes the
hot, like it spreads the

651
00:34:14,680 --> 00:34:15,240
Nikolay: The heat.

652
00:34:15,240 --> 00:34:16,660
Yeah, this is one approach.

653
00:34:16,780 --> 00:34:19,780
Second approach, if you, for example,
have some...

654
00:34:20,420 --> 00:34:22,920
Well, it's not easy, I would say.

655
00:34:23,040 --> 00:34:26,200
And batching is already a good
thing, but it's also like, okay,

656
00:34:26,200 --> 00:34:30,740
you basically reduce contention,
but you don't eliminate it.

657
00:34:31,160 --> 00:34:33,900
Full elimination would require
a synchronous trigger.

658
00:34:34,740 --> 00:34:37,720
Michael: All I meant is it's quite
simple at low scale.

659
00:34:37,900 --> 00:34:40,300
It only gets complex at higher
scales.

660
00:34:41,280 --> 00:34:42,840
Nikolay: Yeah, I also did it.

661
00:34:43,080 --> 00:34:47,040
Let's increment by 10 all the time
or by 100 in jumps, but it

662
00:34:47,040 --> 00:34:49,860
becomes already not real-time,
it's okay.

663
00:34:50,280 --> 00:34:54,940
But ideally, in heavily loaded
systems, it should be asynchronous.

664
00:34:55,520 --> 00:34:58,760
So normal writes don't trigger
immediately.

665
00:34:58,970 --> 00:35:02,780
I mean, they don't lead to increments
at all.

666
00:35:03,280 --> 00:35:06,520
And they just register an event, something
happened, right?

667
00:35:06,580 --> 00:35:10,160
Or they just are there, and we
will retrieve them by selects

668
00:35:10,160 --> 00:35:11,100
and that's it.

669
00:35:11,100 --> 00:35:14,690
Yeah, synchronously we will do
processing with batches as well.

670
00:35:14,690 --> 00:35:14,840
Do you

671
00:35:14,840 --> 00:35:17,280
Michael: know my favorite, the
favorite thing I've seen for this

672
00:35:17,280 --> 00:35:21,300
is, well, other than maybe some
proprietary solutions is simple

673
00:35:21,300 --> 00:35:22,280
roll-up tables.

674
00:35:23,460 --> 00:35:28,520
So maintaining a count of older
stuff in bigger batches.

675
00:35:28,520 --> 00:35:31,560
So you might do a yearly count, and
then a monthly count, then

676
00:35:31,560 --> 00:35:35,380
a daily count, and then anything
since that day, you go through

677
00:35:35,380 --> 00:35:36,340
all of the data.

678
00:35:36,420 --> 00:35:42,280
But it means that over time, you
can roll things up into bigger

679
00:35:42,280 --> 00:35:42,780
aggregates.

680
00:35:43,180 --> 00:35:46,040
And I think that, to me, scales
really well.

681
00:35:46,520 --> 00:35:51,940
Nikolay: Yes, but the main approach,
naive, simple approach,

682
00:35:51,940 --> 00:35:55,900
let's do synchronous updates and
we will immediately have a hotspot

683
00:35:55,960 --> 00:35:56,780
it's bad.

684
00:35:57,400 --> 00:36:01,560
Then okay asynchronous approach
there are two like sub approaches.

685
00:36:02,260 --> 00:36:09,420
One is pool like every minute we
pool look oh do we have some new

686
00:36:09,420 --> 00:36:13,880
data If we have it in one batch,
we issue update every minute.

687
00:36:14,140 --> 00:36:16,040
It's okay approach, actually.

688
00:36:16,240 --> 00:36:19,900
And reduction of it, it's actually
regular, it utilizes you and

689
00:36:19,900 --> 00:36:22,360
you just refresh it fully from
time to time.

690
00:36:22,360 --> 00:36:28,040
It's again, it's super simple,
relatively.

691
00:36:28,520 --> 00:36:32,560
And it lacks partial updates, this
is probably very annoying.

692
00:36:33,060 --> 00:36:36,140
So multi-lives view is a very rough
approach, regular multi-lives

693
00:36:36,140 --> 00:36:36,640
view.

694
00:36:36,980 --> 00:36:40,920
If it's pg_ivm or how it's called,
it should be better.

695
00:36:40,920 --> 00:36:44,280
I haven't tested it myself and
I needed to convince.

696
00:36:44,800 --> 00:36:52,940
I had big plans to test Hydra,
this new thing with DuckDB, right?

697
00:36:52,940 --> 00:36:53,860
pg_duckdb or something.

698
00:36:53,860 --> 00:36:54,360
Yeah.

699
00:36:54,620 --> 00:36:55,320
This pg_ivm.

700
00:36:55,520 --> 00:36:59,080
I have so many things on my plate
to test for maybe already not

701
00:36:59,080 --> 00:37:00,140
months but years.

702
00:37:00,720 --> 00:37:06,400
But unfortunately, I must confess,
customers with managed Postgres,

703
00:37:07,060 --> 00:37:12,100
with their own needs, occupy our
time lately most of the time.

704
00:37:12,100 --> 00:37:18,040
Because we try to do things which
don't require extensions.

705
00:37:18,900 --> 00:37:21,740
RDS, CloudSQL, and others don't
have.

706
00:37:21,960 --> 00:37:23,640
I wanted to mention PGQ.

707
00:37:24,440 --> 00:37:28,140
PGQ is present in CloudSQL, right?

708
00:37:28,180 --> 00:37:28,980
It's interesting.

709
00:37:29,060 --> 00:37:32,320
So CloudSQL users can use it and
benefit from it because it's

710
00:37:32,320 --> 00:37:36,420
very old and very well battle-tested
solution from Skype.

711
00:37:38,360 --> 00:37:41,920
So imagine if we have some things
we can register in PGQ, it

712
00:37:41,920 --> 00:37:45,360
will take care of scalability because
it has partitioning inside

713
00:37:45,360 --> 00:37:46,040
and so on.

714
00:37:46,040 --> 00:37:49,960
And then we just, it's like push
events, push events, and then

715
00:37:49,960 --> 00:37:54,240
consumers of those events, they
increment count, but also maybe

716
00:37:54,240 --> 00:37:55,440
in batches and so on.

717
00:37:55,440 --> 00:38:01,120
So like, so issue just 1 update
to increment by some number,

718
00:38:01,120 --> 00:38:01,620
right?

719
00:38:02,260 --> 00:38:06,640
And this improves things a lot
and makes the system more resilient

720
00:38:07,120 --> 00:38:09,640
because lags will be lower.

721
00:38:10,640 --> 00:38:14,240
It can be not PGQ, but something
like Kafka or I don't know,

722
00:38:14,340 --> 00:38:19,440
Sidekiq or what you have, like
celery, if you're a Python guy,

723
00:38:19,440 --> 00:38:19,940
right?

724
00:38:20,920 --> 00:38:26,920
But with this asynchronous nature
is really good and nobody likes

725
00:38:26,920 --> 00:38:30,560
actually delays like 1 hour because
you start with 1 minute,

726
00:38:30,600 --> 00:38:33,140
you say, okay, it's acceptable,
but the system grows.

727
00:38:33,520 --> 00:38:37,600
And then you say, okay, we need
to reduce frequency to like once

728
00:38:37,600 --> 00:38:38,940
per 5 minutes now.

729
00:38:39,060 --> 00:38:41,840
But it then grows again once per
10 minutes now.

730
00:38:41,840 --> 00:38:46,040
But if you have asynchronous push,
like through an asynchronous

731
00:38:47,060 --> 00:38:50,860
message queue or something like
an event processing system, In

732
00:38:50,860 --> 00:38:54,120
this case, you can build a good
asynchronous kind of trigger,

733
00:38:54,120 --> 00:38:54,620
right?

734
00:38:55,680 --> 00:38:59,760
I wish Postgres had something internally,
and every managed service

735
00:38:59,760 --> 00:39:03,100
provider supports it, so all our
customers already had something.

736
00:39:03,260 --> 00:39:06,020
But right now, unfortunately, it
requires effort, and you need

737
00:39:06,020 --> 00:39:10,400
to build with things you use, sometimes
outside Postgres, like

738
00:39:10,400 --> 00:39:12,460
Sidekiq or Kafka or something.

739
00:39:13,520 --> 00:39:16,840
Michael: Yeah, I think we've mentioned
pretty much everything

740
00:39:17,520 --> 00:39:21,360
from, like, vanilla Postgres without
extensions at the moment.

741
00:39:21,380 --> 00:39:23,540
But there are a bunch of other
interesting projects.

742
00:39:23,680 --> 00:39:24,060
I agree.

743
00:39:24,060 --> 00:39:29,180
I think the DuckDB stuff, especially
that cross-company initiative

744
00:39:29,540 --> 00:39:33,060
seems super interesting, but there's
at least a couple of projects

745
00:39:33,060 --> 00:39:38,200
at the moment pushing some of these
analytical queries to a column-oriented

746
00:39:38,800 --> 00:39:39,020
file.

747
00:39:39,020 --> 00:39:42,760
Nikolay: But we need to keep in
mind that this also brings additional

748
00:39:42,780 --> 00:39:43,280
challenges.

749
00:39:43,520 --> 00:39:47,920
Again, this means you need denormalization
because you have original

750
00:39:47,920 --> 00:39:51,680
table as a source of truth and you
need to bring data to extra

751
00:39:51,780 --> 00:39:56,680
table or something and maintain
this relationship all the time

752
00:39:56,680 --> 00:40:00,820
so it's consistent and not lagging
too much, right?

753
00:40:01,720 --> 00:40:05,920
Michael: Yeah, but I think the
optimizations they add might then

754
00:40:05,920 --> 00:40:07,620
be worth it in a lot of cases.

755
00:40:08,300 --> 00:40:10,760
Like the column store and then
the vectorization, like there

756
00:40:10,760 --> 00:40:15,040
are so many benefits once you pay
that, like if you're willing

757
00:40:15,040 --> 00:40:16,540
to pay the upfront cost.

758
00:40:17,280 --> 00:40:20,320
Nikolay: You can also do more like,
I would say it's already

759
00:40:20,320 --> 00:40:24,060
traditional architecture when you
have analytical system in addition

760
00:40:24,060 --> 00:40:24,720
to Postgres.

761
00:40:24,860 --> 00:40:27,740
It can be, I don't know, like it
can be even ClickHouse.

762
00:40:27,800 --> 00:40:30,840
It can be Vertica if it's old or
these days Snowflake.

763
00:40:31,280 --> 00:40:36,300
And we know PeerDB was recently
acquired by ClickHouse.

764
00:40:37,120 --> 00:40:42,480
And it means, imagine we have our
OLTP storage with users table,

765
00:40:42,540 --> 00:40:46,160
and then through logical replication,
it goes to ClickHouse,

766
00:40:46,200 --> 00:40:50,940
for example, and there you can
do any accounts with good speed,

767
00:40:50,940 --> 00:40:51,440
right?

768
00:40:51,580 --> 00:40:56,760
Any analytical, you can move analytical
workloads, which are

769
00:40:56,880 --> 00:40:59,560
maybe even user-facing to there,
right?

770
00:41:00,360 --> 00:41:04,740
It's also worth considering, but
it moves us outside of Postgres,

771
00:41:04,740 --> 00:41:06,420
and I don't like it personally.

772
00:41:07,200 --> 00:41:12,240
Michael: To stay kind of within
Postgres, I think the work the

773
00:41:12,240 --> 00:41:15,600
Citus team did was super interesting,
because I think they were

774
00:41:15,600 --> 00:41:20,340
largely used for these large analytical
workloads where they

775
00:41:20,340 --> 00:41:25,580
used the fact they had sharded
to multiple nodes to let you parallelize

776
00:41:25,740 --> 00:41:26,680
some of these queries.

777
00:41:26,680 --> 00:41:29,880
So it was, it gained from some
of that extra parallelization

778
00:41:30,480 --> 00:41:33,160
And then the other company doing
some interesting stuff in the

779
00:41:33,160 --> 00:41:36,980
space, I think is Timescale, who've
added some optimizations,

780
00:41:37,280 --> 00:41:38,380
added some functions.

781
00:41:39,800 --> 00:41:40,760
But I think...

782
00:41:40,760 --> 00:41:42,840
Nikolay: Continuous aggregates
is a great thing.

783
00:41:43,380 --> 00:41:43,820
Michael: Yes.

784
00:41:43,820 --> 00:41:47,080
And continuous aggregates means
that they've done a lot of that

785
00:41:47,080 --> 00:41:47,860
heavy lifting.

786
00:41:47,960 --> 00:41:52,000
If you have access to Timescale,
if you're self-hosted or on

787
00:41:52,000 --> 00:41:56,460
their cloud, then you don't have
to do all the complicated stuff.

788
00:41:56,460 --> 00:41:59,280
Nikolay: And if you have not only
access to Timescale, but to

789
00:41:59,280 --> 00:42:03,480
proper Timescale, because when
some company says we support Timescale,

790
00:42:03,580 --> 00:42:06,800
but it's only, how is it called,
like community edition?

791
00:42:06,900 --> 00:42:07,400
Michael: Apache.

792
00:42:07,900 --> 00:42:09,660
Yeah, I get confused.

793
00:42:09,660 --> 00:42:13,840
I think they actually community
edition is the one that you can't

794
00:42:13,940 --> 00:42:14,200
use.

795
00:42:14,200 --> 00:42:17,060
It's the Apache 2 one that you can
use on other clouds.

796
00:42:17,180 --> 00:42:21,700
Nikolay: Yeah, some clouds support
Timescale, but only this reduced

797
00:42:22,180 --> 00:42:22,680
option.

798
00:42:23,940 --> 00:42:26,720
But full-fledged Timescale is a
great thing to have.

799
00:42:26,720 --> 00:42:29,880
Definitely compression plus continuous
aggregates.

800
00:42:30,720 --> 00:42:33,300
We have it in a couple of places,
it's so great.

801
00:42:34,280 --> 00:42:38,100
And this is it, like, it's a flavor
of Postgres, which I do like,

802
00:42:38,100 --> 00:42:42,180
but it's not available if you're
already deep inside RDS or so,

803
00:42:42,180 --> 00:42:45,480
you need to consider migration
and so on, yeah.

804
00:42:45,480 --> 00:42:47,000
Timescale Cloud or self-hosted.

805
00:42:48,300 --> 00:42:48,800
Michael: Exactly.

806
00:42:48,960 --> 00:42:51,380
So, yeah, and then the last thing
to bring us back to the topic

807
00:42:51,380 --> 00:42:55,080
we talked about last week is they
have an implementation of loose

808
00:42:55,080 --> 00:42:57,080
index scan for count distinct.

809
00:42:57,260 --> 00:42:59,880
You can use your own implementation
of loose index scan

810
00:42:59,880 --> 00:43:00,729
to speed up some

811
00:43:00,729 --> 00:43:01,066
of these.

812
00:43:01,066 --> 00:43:01,403
Nikolay: Like in

813
00:43:01,403 --> 00:43:01,740
Wiki, right?

814
00:43:01,740 --> 00:43:02,460
Yeah, exactly.

815
00:43:02,460 --> 00:43:03,360
Nikolay: With recursive.

816
00:43:05,140 --> 00:43:07,700
Michael: Yeah, so that's pretty
much everything I had.

817
00:43:07,780 --> 00:43:08,280
Nikolay: Yeah.

818
00:43:08,900 --> 00:43:10,820
So I think we covered the basics.

819
00:43:11,180 --> 00:43:17,040
If you want to deal with billions of rows with
hundreds of thousands of transactions per second

820
00:43:18,180 --> 00:43:20,040
It's not an easy topic actually.

821
00:43:20,180 --> 00:43:22,040
Michael: Thanks so much Nikolay.
Catch you next week.