1
00:00:00,060 --> 00:00:02,560
Michael: Hello and welcome to Postgres
FM, a weekly show about

2
00:00:02,560 --> 00:00:03,740
all things PostgreSQL.

3
00:00:03,740 --> 00:00:06,820
I am Michael, founder of pgMustard,
and I'm joined as usual by

4
00:00:06,820 --> 00:00:08,540
Nikolay, founder of Postgres.AI.

5
00:00:08,760 --> 00:00:09,440
Hey Nikolay.

6
00:00:09,780 --> 00:00:10,520
Nikolay: Hi Michael.

7
00:00:10,960 --> 00:00:15,200
Michael: And this week we have
not 1, not 2, but 3 special guests

8
00:00:15,200 --> 00:00:17,900
with us to celebrate episode number
100.

9
00:00:18,640 --> 00:00:21,980
And we asked you, our listeners,
for ideas for this episode.

10
00:00:21,980 --> 00:00:25,660
And we got some great suggestions,
including the topic of scaling

11
00:00:25,760 --> 00:00:28,140
to 100 terabytes for episode 100.

12
00:00:28,140 --> 00:00:28,820
So we thought that...

13
00:00:28,820 --> 00:00:29,800
Nikolay: And beyond.

14
00:00:30,060 --> 00:00:31,040
Michael: Oh, and beyond.

15
00:00:31,060 --> 00:00:32,260
Yes, always beyond.

16
00:00:32,500 --> 00:00:34,040
Yeah, no downtime allowed.

17
00:00:34,060 --> 00:00:37,660
We're delighted to welcome 3 people
who have successfully scaled

18
00:00:37,820 --> 00:00:39,440
Postgres exceptionally well.

19
00:00:39,440 --> 00:00:42,540
And those are Arka Ganguli from
Notion.

20
00:00:42,720 --> 00:00:43,420
Hey, Arka.

21
00:00:43,520 --> 00:00:44,540
Arka: Hey, good to be here.

22
00:00:44,540 --> 00:00:46,080
Really excited to talk about Postgres.

23
00:00:46,240 --> 00:00:49,020
Michael: And hey, Sammy, Sammy
Steele from Figma.

24
00:00:49,120 --> 00:00:50,660
Sammy: Thanks so much for having
me on.

25
00:00:50,800 --> 00:00:53,540
Michael: And finally, Derk van
Veen from Adyen.

26
00:00:53,960 --> 00:00:55,900
Derk: Yeah, thank you very much
for the invitation.

27
00:00:56,040 --> 00:00:56,780
I'm honored.

28
00:00:57,560 --> 00:00:59,760
Michael: We are honored to have
all 3 of you.

29
00:00:59,760 --> 00:01:01,300
So yeah, thank you all for joining.

30
00:01:01,700 --> 00:01:06,180
So we're gonna try a kind of panel
round table style discussion.

31
00:01:06,540 --> 00:01:09,760
I'll play good cop trying to keep
us on track, but you are all

32
00:01:09,760 --> 00:01:12,440
welcome to chime in with questions
for each other.

33
00:01:12,440 --> 00:01:14,360
And I know Nikolay will as well.

34
00:01:14,540 --> 00:01:17,580
So, let's start on the simpler
side of things.

35
00:01:18,340 --> 00:01:21,500
And starting with you, Derk, would
you mind letting us know what

36
00:01:21,500 --> 00:01:24,360
kind of things you're using Postgres
for at Adyen there?

37
00:01:24,620 --> 00:01:26,180
Derk: Well, that's an easy question.

38
00:01:26,280 --> 00:01:28,600
We keep our data within Postgres
databases.

39
00:01:28,940 --> 00:01:31,620
The thing is we are processing
financial transactions.

40
00:01:32,620 --> 00:01:35,440
And the thing about financial transactions
is you'd better not

41
00:01:35,440 --> 00:01:36,180
lose them.

42
00:01:36,180 --> 00:01:39,840
Because if you have a social media
kind of platform, right, and

43
00:01:39,840 --> 00:01:43,160
you refresh your page, and you're
missing a picture and refresh

44
00:01:43,180 --> 00:01:46,460
again, and the picture's there,
nobody really cares.

45
00:01:46,640 --> 00:01:50,500
But if you refresh your bank app
and your money is gone, then

46
00:01:50,500 --> 00:01:52,860
people, for some reason, they kind
of freak out.

47
00:01:52,860 --> 00:01:57,040
So yeah, we try to not lose a single
bit of data within our Postgres

48
00:01:57,040 --> 00:01:57,540
databases.

49
00:01:58,200 --> 00:02:01,040
Nikolay: So you don't use data
type money as well, right?

50
00:02:01,320 --> 00:02:01,820
Derk: No.

51
00:02:02,720 --> 00:02:03,220
Okay.

52
00:02:03,460 --> 00:02:03,960
Michael: Nice.

53
00:02:04,120 --> 00:02:04,900
Thanks so much.

54
00:02:04,900 --> 00:02:06,580
Sammy, how about at Figma?

55
00:02:06,780 --> 00:02:07,580
Sammy: Hi, everyone.

56
00:02:07,580 --> 00:02:13,180
I'm Sammy and at Figma, our platform
is an online web-based SaaS

57
00:02:13,180 --> 00:02:15,680
software that allows designers
to collaborate together.

58
00:02:16,020 --> 00:02:18,860
And you know, kind of the magic
secret sauce is this real-time

59
00:02:18,860 --> 00:02:22,060
platform where you can comment
on a design and it immediately

60
00:02:22,080 --> 00:02:23,500
shows up for other people.

61
00:02:23,520 --> 00:02:27,620
And so all of that core data is
powered by Postgres and it has

62
00:02:27,620 --> 00:02:31,460
to be really reliable, really fast
and low latency, so it can

63
00:02:31,460 --> 00:02:32,180
feel real-time.

64
00:02:32,780 --> 00:02:35,560
And we've grown 100X in the last
4 years.

65
00:02:35,640 --> 00:02:38,740
So that's where we're getting our
100 for this episode from.

66
00:02:39,060 --> 00:02:41,820
Michael: Nice, even more hundreds,
we like it.

67
00:02:41,840 --> 00:02:43,300
And Arka, how about Notion?

68
00:02:43,780 --> 00:02:47,000
Arka: Yeah, so unsurprisingly,
we also use Postgres to store

69
00:02:47,000 --> 00:02:50,840
our data and make sure that it
continues being there when our

70
00:02:50,840 --> 00:02:52,240
customers expect it to be.

71
00:02:52,240 --> 00:02:55,540
For those of you that don't know,
Notion is kind of like a collaborative

72
00:02:55,680 --> 00:02:56,180
workspace.

73
00:02:56,520 --> 00:03:00,920
You can use it for documents, wikis,
product management, and

74
00:03:00,920 --> 00:03:02,980
it has a lot of AI built in recently.

75
00:03:03,080 --> 00:03:06,020
So really it's like the place where
a lot of customers store

76
00:03:06,020 --> 00:03:09,820
their entire knowledge base and
run their company or their personal

77
00:03:10,080 --> 00:03:11,200
use cases as well.

78
00:03:11,200 --> 00:03:14,720
Nikolay: I just wanted to mention
that I'm user of all 3 companies

79
00:03:14,720 --> 00:03:17,900
products, because for example,
Notion, I'm a big fan of lists

80
00:03:17,900 --> 00:03:21,820
and so on, so it's a great thing
for that, for such kind of people

81
00:03:21,820 --> 00:03:22,540
I am.

82
00:03:23,100 --> 00:03:28,520
And Figma, I actually remember
I created a lot of interfaces

83
00:03:28,620 --> 00:03:29,620
in Figma myself.

84
00:03:29,620 --> 00:03:33,000
It's super cool, too, and especially
this real-time thing as

85
00:03:33,000 --> 00:03:33,500
well.

86
00:03:34,540 --> 00:03:38,180
And somehow I noticed Figma replaced
all things I used in the

87
00:03:38,180 --> 00:03:38,680
past.

88
00:03:38,680 --> 00:03:42,120
And Adyen, I just told Derk before
we started recording that every

89
00:03:42,120 --> 00:03:46,080
time I go with my daughter to Legoland,
I see Adyen on payment

90
00:03:46,080 --> 00:03:47,680
machine when I pay for something.

91
00:03:47,680 --> 00:03:50,140
And I think, oh, Postgres processing
is there.

92
00:03:50,140 --> 00:03:51,020
So it's cool.

93
00:03:51,220 --> 00:03:52,740
Thank you for coming once again.

94
00:03:52,740 --> 00:03:55,080
Michael: I love how humble you
all are, assuming people might

95
00:03:55,080 --> 00:03:57,120
not know who your companies are
as well.

96
00:03:57,120 --> 00:03:57,740
It's nice.

97
00:03:57,800 --> 00:04:01,860
Let's flip the order and let's
look at how you set things up

98
00:04:01,860 --> 00:04:02,580
Postgres-wise.

99
00:04:02,580 --> 00:04:05,900
So how things were in the past,
how things are set up now.

100
00:04:05,900 --> 00:04:09,000
I've realized this might be quite
an in-depth answer and any

101
00:04:09,000 --> 00:04:09,860
future plans.

102
00:04:09,880 --> 00:04:10,760
Perhaps as well.

103
00:04:10,760 --> 00:04:13,620
So maybe at the highest level,
you can describe that would be

104
00:04:13,620 --> 00:04:14,120
awesome.

105
00:04:14,200 --> 00:04:14,700
Okay.

106
00:04:14,880 --> 00:04:16,400
We restart on your side.

107
00:04:16,560 --> 00:04:17,060
Arka: Sure.

108
00:04:17,180 --> 00:04:21,040
So I joined Notion about a little
bit over 3 years ago.

109
00:04:21,040 --> 00:04:24,900
And when I got there, we had actually
just finished like our

110
00:04:24,900 --> 00:04:28,740
first sharding effort, but essentially
at Notion, we run a pretty

111
00:04:28,740 --> 00:04:29,620
simple deployment.

112
00:04:30,040 --> 00:04:33,780
We use RDS, which is Amazon's managed
database offering.

113
00:04:34,200 --> 00:04:36,000
We run Postgres on that.

114
00:04:36,040 --> 00:04:39,780
And we run pgBouncer in front
of our RDS cluster for connection

115
00:04:39,800 --> 00:04:40,300
pooling.

116
00:04:40,640 --> 00:04:41,980
And that's pretty much it.

117
00:04:41,980 --> 00:04:45,560
We do have sharding, like I mentioned,
which basically is just

118
00:04:45,560 --> 00:04:47,780
like splitting up the data across
more databases.

119
00:04:48,520 --> 00:04:53,200
And the logic for how to talk to
the database is actually controlled

120
00:04:53,200 --> 00:04:53,940
by the application.

121
00:04:54,280 --> 00:04:58,440
So depending on some ID, we know
what database to send your query

122
00:04:58,440 --> 00:04:58,780
to.

123
00:04:58,780 --> 00:05:00,760
And yeah, that's kind of how it
is even today.

124
00:05:00,760 --> 00:05:05,140
We have a few more databases than
when I started, but the overall

125
00:05:05,140 --> 00:05:07,220
high level architecture is pretty
much the same.

126
00:05:07,600 --> 00:05:08,100
Michael: Nice.

127
00:05:08,160 --> 00:05:11,640
I've read both of your sharding
and resharding blog posts.

128
00:05:11,640 --> 00:05:13,260
I'll link both up in the show notes.

129
00:05:13,260 --> 00:05:15,060
But I'm guessing that will continue
to work.

130
00:05:15,060 --> 00:05:17,540
Any reason it won't continue to
work for the foreseeable?

131
00:05:18,180 --> 00:05:22,360
Arka: Yeah, I think our plans are
not really to move away from

132
00:05:22,360 --> 00:05:22,860
Postgres.

133
00:05:22,960 --> 00:05:25,760
It's worked really well for us
so far, and we've got a lot of

134
00:05:25,760 --> 00:05:28,500
people at the company now that
have built up a lot of knowledge

135
00:05:28,500 --> 00:05:31,220
around how to operate this thing
at scale.

136
00:05:31,220 --> 00:05:33,980
So definitely something we want
to stick with moving forward.

137
00:05:34,540 --> 00:05:37,620
Probably we do have to re-shard
in the future at some point,

138
00:05:37,660 --> 00:05:41,120
but in the longer term we're actually
looking at moving to more

139
00:05:41,120 --> 00:05:45,240
of like a cellular architecture
where we can kind of deploy Maybe

140
00:05:45,240 --> 00:05:48,680
like different versions of Notion
to scale our capacity with

141
00:05:48,680 --> 00:05:51,920
like their own databases and caches
and things like that.

142
00:05:51,920 --> 00:05:55,520
So we don't have to continuously
spend our time re-sharding over

143
00:05:55,520 --> 00:05:59,200
and over again, because that's
a very time intensive and operationally

144
00:05:59,440 --> 00:06:02,320
intensive kind of thing for engineers
to take on.

145
00:06:02,620 --> 00:06:05,420
So we probably do have to re-shard
in the near future, but in

146
00:06:05,420 --> 00:06:08,620
the longer term, we hope our more
like cellular architecture

147
00:06:08,620 --> 00:06:10,440
will be able to scale much further.

148
00:06:10,920 --> 00:06:11,740
Michael: Super interesting.

149
00:06:12,560 --> 00:06:13,940
Sammy, let's go to you.

150
00:06:13,940 --> 00:06:16,980
I know you're sharding as well,
but slightly differently, I believe.

151
00:06:17,420 --> 00:06:20,860
Sammy: Yeah, well, we definitely
have a lot in common with what

152
00:06:20,860 --> 00:06:21,920
Arka just described.

153
00:06:22,360 --> 00:06:25,280
And so your blog posts were a great
inspiration for us as we

154
00:06:25,280 --> 00:06:27,380
were in the early days of horizontal
sharding.

155
00:06:27,680 --> 00:06:32,700
So similar high-level stacks, and
that we're also on RDS Postgres,

156
00:06:32,920 --> 00:06:36,360
Amazon's managed system, and we
also use pgBouncer for connection

157
00:06:36,360 --> 00:06:38,140
pooling, at least for now.

158
00:06:38,400 --> 00:06:42,780
And where we start diverging from
Notion is that I think our

159
00:06:42,860 --> 00:06:46,780
data model is somewhat more complex
and less naturally a good

160
00:06:46,780 --> 00:06:47,660
fit for sharding.

161
00:06:48,120 --> 00:06:52,160
My understanding is Notion had
shard IDs based on the orgs or

162
00:06:52,160 --> 00:06:54,360
the workspaces and that that worked
pretty well.

163
00:06:54,440 --> 00:06:57,940
But at Figma, we have a lot of
data not associated with an org

164
00:06:58,140 --> 00:07:01,400
and data moves quite frequently
between orgs, which makes that

165
00:07:01,400 --> 00:07:03,180
kind of sharding model quite hard.

166
00:07:03,460 --> 00:07:06,680
And so we actually ended up going
with a more tactical approach

167
00:07:06,760 --> 00:07:11,120
of, for each table, picking a set
of a few shard keys that would

168
00:07:11,120 --> 00:07:12,040
be more optimal.

169
00:07:12,600 --> 00:07:15,480
And we ended up building a proxy
layer so that instead of the

170
00:07:15,480 --> 00:07:19,080
application having to be aware
of shards, we actually have this

171
00:07:19,080 --> 00:07:22,340
DB proxy service we built, which
is a Golang service.

172
00:07:22,740 --> 00:07:26,660
And it's able to do all the routing
and handle all of the horizontal

173
00:07:26,660 --> 00:07:30,160
sharding, scatter gathers where
a query hits many shards, and

174
00:07:30,160 --> 00:07:31,740
then you aggregate results back.

175
00:07:31,840 --> 00:07:34,760
And so that's kind of a superpower
for us to have a lot more

176
00:07:34,760 --> 00:07:36,100
control in that layer.

177
00:07:36,580 --> 00:07:40,380
And long-term, we have sharded
our simplest, highest write rate

178
00:07:40,380 --> 00:07:43,580
tables, but we still have a ways
to go until everything is sharded.

179
00:07:43,580 --> 00:07:46,020
And we have to build out a lot
more infrastructure to support

180
00:07:46,020 --> 00:07:46,520
that.

181
00:07:46,640 --> 00:07:49,900
So unlike Notion, we actually expect
restart operations to be

182
00:07:49,900 --> 00:07:50,880
quite common.

183
00:07:50,980 --> 00:07:54,320
And so our goal is like 1 click
failover where you can just hit

184
00:07:54,320 --> 00:07:57,180
a button and then have a restart
operation happen and transparently

185
00:07:57,240 --> 00:07:58,000
in the background.

186
00:07:58,520 --> 00:08:00,800
Nikolay: Sounds like logical replication
involved, but maybe

187
00:08:00,800 --> 00:08:02,060
we'll talk about that later.

188
00:08:02,320 --> 00:08:04,760
Derk: I'm so interested in this
proxy layer.

189
00:08:04,760 --> 00:08:06,540
Did you build it yourself in-house?

190
00:08:07,200 --> 00:08:07,900
Sammy: We did.

191
00:08:07,960 --> 00:08:08,240
Yeah.

192
00:08:08,240 --> 00:08:10,160
So it's a Golang service.

193
00:08:10,460 --> 00:08:11,880
So it has a couple of components.

194
00:08:11,980 --> 00:08:16,000
It has a query engine, which has
a parsing layer that takes in

195
00:08:16,000 --> 00:08:18,300
raw SQL and transforms it to an
AST.

196
00:08:18,480 --> 00:08:22,200
And that part we did borrow CockroachDB's
Postgres parser, which

197
00:08:22,200 --> 00:08:26,040
has mostly worked well for us with
the few modifications, but then

198
00:08:26,040 --> 00:08:28,940
all of the logic for, we have like
a topology layer that we've

199
00:08:28,940 --> 00:08:32,180
built that knows where all the
databases are and what shard keys

200
00:08:32,180 --> 00:08:33,200
map to what tables.

201
00:08:33,240 --> 00:08:36,000
And so the query engine really
does all that evaluation and query

202
00:08:36,000 --> 00:08:36,500
rewriting.

203
00:08:36,900 --> 00:08:39,440
Derk: Yeah, that sounds pretty
mind blowing to me.

204
00:08:39,440 --> 00:08:40,360
It's really cool.

205
00:08:41,120 --> 00:08:43,380
Nikolay: Yeah, there are a couple
of projects like PGCat and

206
00:08:43,380 --> 00:08:44,960
SPQR which

207
00:08:44,960 --> 00:08:45,540
Sammy: do that,

208
00:08:45,540 --> 00:08:47,960
Nikolay: but maybe when you started
they are very early stage,

209
00:08:47,960 --> 00:08:48,240
right?

210
00:08:48,240 --> 00:08:51,540
Sammy: Yes, actually to clarify,
so we wish we had named this

211
00:08:51,540 --> 00:08:55,560
something besides dbproxy, because
PGCAT would not be a replacement

212
00:08:55,560 --> 00:08:59,820
for what we have, because we actually
might use PGCAT and replace

213
00:08:59,960 --> 00:09:03,000
a pgBouncer, because we are hitting
some pgBouncer scaling problems.

214
00:09:03,000 --> 00:09:05,860
I'd be curious if you guys are
also running into challenges.

215
00:09:06,340 --> 00:09:10,400
But dbproxy is really, it's a stateless
service that talks to

216
00:09:10,400 --> 00:09:11,460
every single database.

217
00:09:11,680 --> 00:09:14,840
And so a request gets routed to
it, and it has a connection pooling

218
00:09:14,840 --> 00:09:17,840
layer which talks to a pgBouncer
or PGCAT-like layer that is

219
00:09:17,840 --> 00:09:19,340
specific to a single host.

220
00:09:19,340 --> 00:09:22,320
So it's really playing more of
a high-level orchestrator role.

221
00:09:23,420 --> 00:09:23,940
Michael: So cool.

222
00:09:23,940 --> 00:09:27,460
And I love that we've already got
2 different approaches to sharding,

223
00:09:27,560 --> 00:09:30,780
1 that's transparent to the application
and the developers, and

224
00:09:30,780 --> 00:09:31,500
1 that's not.

225
00:09:31,500 --> 00:09:34,900
And I think we're about to get
a third answer from Derk's side.

226
00:09:34,900 --> 00:09:38,360
How are things set up at Adyen
and past, present, future?

227
00:09:39,160 --> 00:09:40,700
Derk: Let's start with the past.

228
00:09:40,920 --> 00:09:45,560
I think we had a pretty, it was
our by far, it's our most well

229
00:09:45,560 --> 00:09:47,340
read blog article.

230
00:09:47,380 --> 00:09:50,340
And that's how to upgrade a 50
terabyte database.

231
00:09:51,040 --> 00:09:54,520
And when looking back on it, it's
quite a while ago.

232
00:09:54,520 --> 00:09:58,860
So I think 2 years ago, we had
an internal event and it was like,

233
00:09:58,860 --> 00:10:01,820
no, it's not how you upgrade a
50 terabyte database.

234
00:10:01,820 --> 00:10:04,780
It's how you upgrade a terabyte
database with multiple zeros

235
00:10:04,780 --> 00:10:05,580
in the end.

236
00:10:05,580 --> 00:10:07,240
Nikolay: At least 2, I guess, right?

237
00:10:07,260 --> 00:10:09,860
Derk: Yeah, we have 2 zeros in
the end.

238
00:10:09,960 --> 00:10:13,520
Nikolay: And not just upgrade,
but 0 downtime upgrade, or like,

239
00:10:13,520 --> 00:10:16,280
I just want to clarify, because
regular upgrade is not super

240
00:10:16,280 --> 00:10:19,840
big problem if you can afford a
couple of hours downtime to.

241
00:10:19,840 --> 00:10:24,220
Derk: Yeah, we tried to think about
ordinary upgrades, but the

242
00:10:24,220 --> 00:10:27,380
rough calculation, it will take
us like 4 or 5 weeks downtime,

243
00:10:27,380 --> 00:10:29,940
which didn't cut the bill.

244
00:10:30,020 --> 00:10:31,040
Nikolay: We've had links.

245
00:10:31,620 --> 00:10:33,040
Derk: No, that 1 is much faster.

246
00:10:33,040 --> 00:10:36,540
Fortunately, I think we limited
to 15 minutes of downtime.

247
00:10:37,000 --> 00:10:40,680
And we already had a big fight
for this 15 minutes, but it has

248
00:10:40,680 --> 00:10:41,400
to happen.

249
00:10:41,520 --> 00:10:44,720
And it's it's not even a database
downtime, but it's more like

250
00:10:44,720 --> 00:10:48,340
all the infrastructure around it
to get it fixed, which required

251
00:10:48,400 --> 00:10:48,900
downtime.

252
00:10:49,060 --> 00:10:52,540
If you look purely at Postgres,
I think it was minutes, but it's

253
00:10:52,540 --> 00:10:53,540
not just a database.

254
00:10:53,800 --> 00:10:55,080
Nikolay: I know this problem very
well.

255
00:10:55,080 --> 00:10:55,580
So

256
00:10:55,680 --> 00:10:56,680
Derk: Yeah, that's

257
00:10:56,680 --> 00:10:57,540
Nikolay: everything around.

258
00:10:57,920 --> 00:11:01,040
It's more and more minutes and
then sometimes hours, right?

259
00:11:01,160 --> 00:11:01,620
Sammy: Yeah.

260
00:11:01,620 --> 00:11:04,400
I'm curious, what's the biggest
individual table or do you have

261
00:11:04,400 --> 00:11:06,980
partitioning under the hood there
to have so many terabytes on

262
00:11:06,980 --> 00:11:07,700
1 box?

263
00:11:08,240 --> 00:11:09,940
Derk: What is the biggest table
we have?

264
00:11:09,940 --> 00:11:13,220
I don't know, but the biggest partition
is over 30 terabytes,

265
00:11:13,780 --> 00:11:15,060
which is a single partition.

266
00:11:15,060 --> 00:11:15,400
Sammy: Wow.

267
00:11:15,400 --> 00:11:17,820
And you don't run into vacuuming
issues with that?

268
00:11:18,640 --> 00:11:22,600
Derk: I think we are the most skilled
company when it comes to

269
00:11:22,600 --> 00:11:24,500
vacuum and how to tweak vacuum.

270
00:11:25,520 --> 00:11:27,940
Nikolay: And rebuild your indexes,
which blocks XminHorizon,

271
00:11:28,360 --> 00:11:28,860
right?

272
00:11:29,380 --> 00:11:31,660
Derk: Yeah, we have seen many vacuum
issues.

273
00:11:31,720 --> 00:11:34,540
When I started to work with Postgres
and I was installing this

274
00:11:34,540 --> 00:11:38,160
database and transaction wraparound
was some kind of weird theoretical

275
00:11:38,440 --> 00:11:38,940
thing.

276
00:11:39,280 --> 00:11:42,260
And then I joined Kotien and then
something you do every week,

277
00:11:42,500 --> 00:11:43,200
at least.

278
00:11:43,320 --> 00:11:47,020
Nikolay: So I yesterday asked My
Twitter audience, what would

279
00:11:47,020 --> 00:11:50,420
you ask to people who manage many,
many, many terabytes?

280
00:11:51,020 --> 00:11:53,660
And 1 of the questions was, how
are you?

281
00:11:55,080 --> 00:11:59,860
I guess this is the case when you
have problems like 30 terabytes

282
00:11:59,860 --> 00:12:00,860
on 1 table.

283
00:12:02,220 --> 00:12:05,020
How long is the index creation
time or rebuild time?

284
00:12:05,020 --> 00:12:06,400
It's like days or?

285
00:12:07,360 --> 00:12:08,260
Yeah, days.

286
00:12:08,540 --> 00:12:09,040
Days.

287
00:12:09,280 --> 00:12:11,860
During which XminHorizon is blocked,
right?

288
00:12:12,040 --> 00:12:17,220
So we accumulate bloat in whole
database for all tables, right?

289
00:12:17,740 --> 00:12:19,440
It's a super annoying problem.

290
00:12:19,440 --> 00:12:21,280
Derk: But most of these tables
are partitions.

291
00:12:21,280 --> 00:12:24,140
So first we create an index per
partition, and then finally in

292
00:12:24,140 --> 00:12:26,460
the end, we create the index on
the parent.

293
00:12:26,520 --> 00:12:28,760
Otherwise we don't make it before
we hit wraparound.

294
00:12:29,060 --> 00:12:29,840
Nikolay: Oh, wraparound.

295
00:12:30,060 --> 00:12:33,520
It's also a problem, but I'm talking
about also vacuum and bloat

296
00:12:33,520 --> 00:12:35,460
problems, so multiple problems
here.

297
00:12:35,920 --> 00:12:37,840
Well, that's super interesting
experience.

298
00:12:37,840 --> 00:12:38,900
Derk: It keeps us busy.

299
00:12:39,380 --> 00:12:43,200
But it's also, it's an honor to
have these problems, but I think

300
00:12:43,200 --> 00:12:46,800
all 3 of us got this big databases
because the company is very successful.

301
00:12:46,800 --> 00:12:47,300
successful.

302
00:12:47,780 --> 00:12:48,280
Nikolay: Right.

303
00:12:48,540 --> 00:12:51,340
Derk: And I think at the end of
the day, having a big database

304
00:12:51,420 --> 00:12:52,900
is actually a design problem.

305
00:12:53,360 --> 00:12:57,660
You forgot to design in time for
the size you need to accommodate.

306
00:12:57,880 --> 00:12:59,040
Sammy: See, I actually disagree.

307
00:12:59,040 --> 00:13:01,560
I think it's a good, It's an expected
problem.

308
00:13:01,560 --> 00:13:05,860
If you design a company too early
to scale, then you're not gonna

309
00:13:05,860 --> 00:13:08,420
have a company probably if you're
trying to solve those problems

310
00:13:08,420 --> 00:13:10,140
when you're 5 or 10 people.

311
00:13:10,760 --> 00:13:13,660
I think 1 of the cool things about
Postgres is all of our companies

312
00:13:13,660 --> 00:13:15,480
did get very far on 1 host.

313
00:13:15,480 --> 00:13:18,820
Like Figma existed for 7 or 8 years
and was pretty successful

314
00:13:18,820 --> 00:13:21,020
before we had to actually start
scaling out.

315
00:13:21,760 --> 00:13:23,140
Derk: That's definitely true.

316
00:13:23,300 --> 00:13:24,440
Nikolay: I agree with that.

317
00:13:24,520 --> 00:13:28,140
And my team helped a couple of
companies who went to IPO being

318
00:13:28,140 --> 00:13:33,080
like on a single Postgres cluster
and they got evaluation, thousands

319
00:13:33,080 --> 00:13:37,660
of billions of dollars having 1
cluster and this was impressive.

320
00:13:38,420 --> 00:13:44,020
Of course, later it should be changed,
but yeah, so I agree with

321
00:13:44,540 --> 00:13:48,820
1 cluster you can scale a lot,
But I would not like to be in

322
00:13:48,820 --> 00:13:53,720
your shoes dealing with multi-day
index rebuild and also transaction

323
00:13:53,720 --> 00:13:54,940
ID wraparound problems.

324
00:13:55,600 --> 00:13:57,880
So how do you escape from these
problems today?

325
00:13:57,980 --> 00:14:02,980
Like 30 terabytes 1 partition,
It's quite difficult, right?

326
00:14:02,980 --> 00:14:07,580
So is it to be partitioned into
smaller partitions or

327
00:14:07,660 --> 00:14:10,020
Derk: Yeah, I'm thinking about
how to answer this question.

328
00:14:10,160 --> 00:14:14,880
I think before I joined Adyen,
I tried to manage my database

329
00:14:14,960 --> 00:14:18,900
in a proper way to do it by the
book and by the rules and everything

330
00:14:18,900 --> 00:14:19,580
is fine.

331
00:14:20,860 --> 00:14:25,880
And Adyen is pushing me to find
which boundaries can be bended

332
00:14:25,960 --> 00:14:29,480
or violated without too much risk.

333
00:14:30,040 --> 00:14:31,560
I think that's the way to put it.

334
00:14:31,560 --> 00:14:34,700
We still are on the safe side,
but with more knowledge, you're

335
00:14:34,700 --> 00:14:38,260
better able to decide when do I
cross a boundary and when is

336
00:14:38,260 --> 00:14:41,700
it really dangerous and when is
it safe to do, but not strictly

337
00:14:41,760 --> 00:14:42,260
advised.

338
00:14:43,480 --> 00:14:45,880
Sammy: I think that's definitely
one of the benefits of staying

339
00:14:45,880 --> 00:14:48,680
on a system like Postgres that
you end up understanding really

340
00:14:48,680 --> 00:14:52,560
well is all these limits are extremely
specific to the workloads

341
00:14:52,660 --> 00:14:55,640
and the tables and even what works
for one table at Figma doesn't

342
00:14:55,640 --> 00:14:56,420
work for another.

343
00:14:56,420 --> 00:14:59,220
And so there is a lot of just accumulated
organizational knowledge

344
00:14:59,220 --> 00:15:02,040
that makes it easier to run these
systems at scale and understand

345
00:15:02,040 --> 00:15:04,220
actually at what point do things
get scary.

346
00:15:04,860 --> 00:15:05,720
Nikolay: Yeah, I agree.

347
00:15:06,040 --> 00:15:11,920
If, for example, index creation
takes one day, but our real transaction

348
00:15:11,940 --> 00:15:14,440
ID doesn't grow a lot, it's not
a big problem.

349
00:15:14,440 --> 00:15:18,340
But if it grows very fast, then
we have two kinds of problems.

350
00:15:18,340 --> 00:15:20,920
So transaction ID wraparound and
bloat accumulated.

351
00:15:21,360 --> 00:15:21,820
I agree.

352
00:15:21,820 --> 00:15:25,020
So it's very specific to particular
workloads.

353
00:15:25,680 --> 00:15:28,780
Michael: I want to go back to Derk
quickly because you've done

354
00:15:28,780 --> 00:15:31,780
such a great series of blog posts
on partitioning.

355
00:15:32,300 --> 00:15:33,240
I'll share those.

356
00:15:33,240 --> 00:15:37,180
I've reread them today and part
3 promises a part 4.

357
00:15:37,180 --> 00:15:41,180
So I'm, I'm hopeful of getting
another installment at some point,

358
00:15:41,200 --> 00:15:44,440
but yeah, is it, would you say
partitioning is the big, like

359
00:15:44,440 --> 00:15:49,500
the way you've largely scaled there
or are there other strategies?

360
00:15:49,600 --> 00:15:51,260
Like, are you moving data out?

361
00:15:51,260 --> 00:15:52,360
Like how are you dealing with it?

362
00:15:52,360 --> 00:15:54,560
And how are you planning to in
the future?

363
00:15:55,160 --> 00:15:58,740
Derk: Yeah, I think we started
with horizontal sharding and then

364
00:15:58,740 --> 00:16:03,620
within every shard, tables are
partitioned, but financial transactions,

365
00:16:03,620 --> 00:16:06,760
you need to be able to refund them
for a very, very long time.

366
00:16:06,780 --> 00:16:09,900
So it takes quite a long time before
you can actually archive

367
00:16:10,080 --> 00:16:10,580
data.

368
00:16:10,840 --> 00:16:14,140
So we have a lot of old data, which
is still around, but yeah,

369
00:16:14,140 --> 00:16:18,520
these partitions kind of shift
more back to the back in the row

370
00:16:18,520 --> 00:16:22,700
and vacuum is much easier if you
don't change a lot on a partition.

371
00:16:23,860 --> 00:16:26,800
But at the moment we are actually
reaching the limits of our

372
00:16:26,800 --> 00:16:28,040
sharding solution again.

373
00:16:28,540 --> 00:16:32,380
So we basically start redesigning
the entire system again.

374
00:16:32,580 --> 00:16:33,980
Arka: You know, that's really interesting.

375
00:16:34,120 --> 00:16:36,980
Cause that notion like it's interesting
that you chose partitioning

376
00:16:37,480 --> 00:16:40,760
Postgres as like your thing that
like got you really far because

377
00:16:40,760 --> 00:16:44,640
that notion we actually ran into
I don't know like really weird

378
00:16:44,640 --> 00:16:48,900
issues with dropping indexes on
partitioned tables, But that's

379
00:16:48,900 --> 00:16:51,300
because I think we were running
on like, like this database was

380
00:16:51,300 --> 00:16:52,360
running on like Postgres 12.

381
00:16:52,360 --> 00:16:55,080
And I think a lot of these things
may have been fixed in the

382
00:16:55,080 --> 00:16:55,880
future releases.

383
00:16:56,420 --> 00:16:58,780
But we actually went the opposite
way where we're like, we don't

384
00:16:58,780 --> 00:17:01,100
want Postgres partitions, let us
manage our own.

385
00:17:01,720 --> 00:17:04,820
Because I don't know, there's some
weird behavior we ran into

386
00:17:04,820 --> 00:17:07,720
when creating and dropping indexes,
especially because some things

387
00:17:07,720 --> 00:17:10,160
you can't do concurrently on partitioned
tables.

388
00:17:10,440 --> 00:17:13,880
Derk: Yeah, dropping indexes is
a nightmare.

389
00:17:14,640 --> 00:17:18,740
I always tell my developers, you
can create indexes on a partitioned

390
00:17:18,740 --> 00:17:21,980
table, but if you want to drop
them, I come after you.

391
00:17:23,940 --> 00:17:24,440
Arka: Yeah.

392
00:17:25,680 --> 00:17:26,460
Fun problem.

393
00:17:27,500 --> 00:17:30,660
Michael: Actually, Derk, you said
something earlier about knowing

394
00:17:30,660 --> 00:17:32,980
which rules you can kind of bend
or break.

395
00:17:33,260 --> 00:17:37,560
One of the ones you mentioned in
those posts was around adding

396
00:17:37,560 --> 00:17:38,380
check constraints.

397
00:17:38,680 --> 00:17:41,880
And so adding them in a not valid
state, but then not running

398
00:17:41,880 --> 00:17:42,800
validate afterwards.

399
00:17:42,800 --> 00:17:45,560
I think you mentioned just updating
the system catalog.

400
00:17:45,860 --> 00:17:47,760
And that's such a cool trick.

401
00:17:47,760 --> 00:17:48,720
But like, how?

402
00:17:48,800 --> 00:17:49,960
Yeah, I guess you have to make

403
00:17:49,960 --> 00:17:50,610
Nikolay: up comfortable.

404
00:17:50,610 --> 00:17:54,320
We use this trick like 15 years
ago when validate was not an

405
00:17:54,320 --> 00:17:54,720
option.

406
00:17:54,720 --> 00:17:58,880
So like something like that we
use it, but it wasn't like undocumented

407
00:17:59,220 --> 00:17:59,720
thing.

408
00:18:00,100 --> 00:18:00,800
Don't do it.

409
00:18:00,800 --> 00:18:06,740
Like if you're, if you're not owning
database, For example, if

410
00:18:06,740 --> 00:18:09,600
you're a consultant, this is not
something you should recommend,

411
00:18:09,600 --> 00:18:11,320
because who knows what will happen
next.

412
00:18:11,320 --> 00:18:14,920
If you own a database, if you work
inside the company, it's probably

413
00:18:14,920 --> 00:18:15,420
okay.

414
00:18:15,560 --> 00:18:16,520
It's not official.

415
00:18:16,560 --> 00:18:20,200
By the way, I wanted to emphasize
problems we just started touching.

416
00:18:20,580 --> 00:18:23,760
Most people who listen to us don't
have, right?

417
00:18:23,760 --> 00:18:25,640
Because it's like extreme problems.

418
00:18:26,120 --> 00:18:27,840
Derk: I'm not completely sure.

419
00:18:27,840 --> 00:18:32,640
I think partitioning becomes useful
for a lot of people way before

420
00:18:32,640 --> 00:18:34,740
you hit the limit where you have
to.

421
00:18:34,860 --> 00:18:37,640
So if you're listening to this
podcast, and you think like, yeah,

422
00:18:37,640 --> 00:18:40,920
partitioning is too far away, just
think again, because if you

423
00:18:40,920 --> 00:18:43,500
do it now, when you have all the
time in the world to figure

424
00:18:43,500 --> 00:18:47,440
it out, find a good strategy, Then
you don't end up with a huge

425
00:18:47,440 --> 00:18:50,880
partition, a small partition, weirdly
partitioning things.

426
00:18:51,600 --> 00:18:53,460
So I would say just starting time.

427
00:18:53,560 --> 00:18:56,380
Nikolay: I'm on the same page with
you here because you mentioned

428
00:18:56,380 --> 00:18:59,680
this rule, like 100 gigabytes as
threshold when you need to start

429
00:18:59,680 --> 00:19:00,180
partitioning.

430
00:19:00,660 --> 00:19:02,920
Same rule we also apply everywhere.

431
00:19:03,520 --> 00:19:06,920
If you exceed 100 gigabytes, it's
time to partition, I agree.

432
00:19:07,540 --> 00:19:08,040
Sammy: Interesting.

433
00:19:08,080 --> 00:19:10,740
We have a lot of tables larger
than that that aren't partitioned,

434
00:19:10,840 --> 00:19:14,980
but probably about a terabyte is
what we aim for max for our

435
00:19:14,980 --> 00:19:18,940
largest shards, although we have
a 116 terabyte table today.

436
00:19:19,300 --> 00:19:22,920
But another interesting topic that
I think is more relevant to

437
00:19:22,920 --> 00:19:24,180
people are upgrades.

438
00:19:25,240 --> 00:19:28,260
We have built this no downtime
tooling that we've mostly used

439
00:19:28,260 --> 00:19:31,500
for re-shard and horizontal sharding
operations, but more recently

440
00:19:31,500 --> 00:19:34,840
we've been using it for no downtime
major version upgrades with

441
00:19:34,840 --> 00:19:37,440
the ability to roll back if you
run into problems.

442
00:19:37,840 --> 00:19:40,580
And so I think that's something
that could be interesting for

443
00:19:40,640 --> 00:19:41,940
anyone who's running Postgres.

444
00:19:42,120 --> 00:19:44,000
Upgrades are always a hard thing
to do.

445
00:19:44,280 --> 00:19:47,960
Nikolay: Yeah, By the way, we forgot
to mention that ADN is on

446
00:19:47,960 --> 00:19:49,220
self-managed situation.

447
00:19:49,240 --> 00:19:50,780
It's not on RDS, right?

448
00:19:51,060 --> 00:19:52,320
Because it's very different.

449
00:19:53,160 --> 00:19:53,660
Yeah.

450
00:19:54,340 --> 00:19:59,580
I'm very curious how you did this
for RDS case, not using Bluegreen

451
00:19:59,580 --> 00:20:00,540
deployments, right?

452
00:20:00,540 --> 00:20:01,600
Just fully...

453
00:20:03,820 --> 00:20:04,340
Sammy: At least today...

454
00:20:04,340 --> 00:20:07,840
Well, Bluegreen only very recently
became available for Postgres.

455
00:20:08,140 --> 00:20:09,840
Last year it was only really MySQL.

456
00:20:10,520 --> 00:20:13,300
And so we'll probably explore it
going forward.

457
00:20:13,380 --> 00:20:16,260
1 blocker that actually we're talking
to the RDS team today is

458
00:20:16,260 --> 00:20:17,120
our rollback.

459
00:20:17,640 --> 00:20:18,140
Exactly.

460
00:20:18,760 --> 00:20:20,120
There is no way to rollback.

461
00:20:20,360 --> 00:20:20,740
Nikolay: And that

462
00:20:20,740 --> 00:20:21,060
Sammy: is the main thing.

463
00:20:21,060 --> 00:20:21,460
Nikolay: That's wrong.

464
00:20:21,460 --> 00:20:22,960
Bluegreen deployment's idea.

465
00:20:22,960 --> 00:20:23,980
It should be symmetric.

466
00:20:24,120 --> 00:20:27,780
And when you switch over, reverse
replication should be installed

467
00:20:27,860 --> 00:20:28,360
immediately.

468
00:20:30,080 --> 00:20:32,660
Without this, you don't have rollback
plan, right?

469
00:20:32,860 --> 00:20:33,360
Sammy: Exactly.

470
00:20:33,820 --> 00:20:35,100
It means data loss.

471
00:20:35,220 --> 00:20:37,480
Yeah, so that's what our failover
operation does.

472
00:20:37,480 --> 00:20:40,520
I think it's also nice to have
more control when you do it locally

473
00:20:40,520 --> 00:20:43,920
because we're able to, for instance,
do a no downtime replicas

474
00:20:43,940 --> 00:20:44,440
failover.

475
00:20:44,680 --> 00:20:48,160
And because most of our P0 critical
workflows read from replicas,

476
00:20:48,260 --> 00:20:51,180
that means that most failovers
don't actually incur any downtime.

477
00:20:51,300 --> 00:20:53,940
Whereas, you know, if you're using
a managed thing like Blue-Green,

478
00:20:53,940 --> 00:20:55,340
you lose that kind of control.

479
00:20:55,760 --> 00:20:58,520
Nikolay: Yeah, well, I'm super
curious about details here, but

480
00:20:58,520 --> 00:21:00,360
maybe it should be a separate discussion.

481
00:21:00,360 --> 00:21:04,240
And first of all, I wanted to thank
all of you.

482
00:21:04,400 --> 00:21:07,020
We invited you because you shared
very great posts.

483
00:21:07,660 --> 00:21:10,460
So this knowledge sharing is super
important.

484
00:21:10,760 --> 00:21:13,940
So I would be happy to see more
and more coming.

485
00:21:13,940 --> 00:21:19,100
And maybe if we talk more and exchange
ideas, we could collect

486
00:21:19,120 --> 00:21:22,480
better materials, for example,
how to do 0 downtime upgrades,

487
00:21:22,480 --> 00:21:25,200
because I know very good recipe
for self-managed Postgres, but

488
00:21:25,200 --> 00:21:29,080
I don't know for RDS, because they
don't allow you to control

489
00:21:29,540 --> 00:21:30,800
recovery target LSN.

490
00:21:31,160 --> 00:21:34,820
So I'm very curious to talk about
details, but maybe in a separate

491
00:21:34,820 --> 00:21:35,320
discussion.

492
00:21:35,900 --> 00:21:39,480
Because I know Michael has a different
plan, not to talk half

493
00:21:39,480 --> 00:21:40,700
an hour about upgrades.

494
00:21:41,660 --> 00:21:45,140
Michael: Well, maybe we could actually
do a quick lightning round

495
00:21:45,180 --> 00:21:49,240
of which major version each of
you are currently running, if

496
00:21:49,240 --> 00:21:51,000
you know and are happy to share.

497
00:21:51,140 --> 00:21:52,940
Nikolay: Is it a single version
only?

498
00:21:53,740 --> 00:21:54,900
Michael: Also a good question.

499
00:21:55,520 --> 00:21:57,320
Sami, are you happy to share that
1?

500
00:21:57,740 --> 00:21:59,920
Sammy: Sure, I can definitely talk
about that.

501
00:21:59,920 --> 00:22:03,540
Well, we are newly, as of 2 months
ago, fully on PG-13.

502
00:22:04,440 --> 00:22:08,560
So we had a couple of PG-11, very
legacy boxes.

503
00:22:08,560 --> 00:22:11,520
There were some of our earliest,
largest databases, and that

504
00:22:11,520 --> 00:22:14,180
actually was the main motivation
for building out this upgrade

505
00:22:14,180 --> 00:22:16,420
workflow, was to get those into
PG-13.

506
00:22:17,960 --> 00:22:20,380
And we would love to use some of
the features that are available

507
00:22:20,380 --> 00:22:21,880
in PG-15, PG-16.

508
00:22:22,420 --> 00:22:25,400
And so we're probably going to
look into fleet-wide upgrades

509
00:22:25,400 --> 00:22:27,860
sometime next year to get to a
newer version.

510
00:22:28,080 --> 00:22:31,320
But this is where this one-click
upgrades or failovers really

511
00:22:31,320 --> 00:22:34,000
matters because today it's quite
toilsome and it would be pretty

512
00:22:34,000 --> 00:22:35,620
painful to upgrade the whole fleet.

513
00:22:36,060 --> 00:22:36,880
Michael: Yeah, absolutely.

514
00:22:37,260 --> 00:22:38,340
How about you, Arka?

515
00:22:38,800 --> 00:22:42,880
Arka: Yeah, we're actually we're
like 90% on Postgres 15.

516
00:22:42,880 --> 00:22:44,220
And it is my life's mission

517
00:22:44,220 --> 00:22:44,340
Nikolay: to

518
00:22:44,340 --> 00:22:46,580
Arka: get us that final 10% over.

519
00:22:47,320 --> 00:22:49,900
Yeah, like most of our databases
are on actually Postgres 15.

520
00:22:49,900 --> 00:22:52,220
We finished that upgrade last year.

521
00:22:52,640 --> 00:22:55,320
And yeah, we really wanted some
of the features that I think

522
00:22:55,320 --> 00:22:59,020
got introduced maybe after Postgres
14, especially around being

523
00:22:59,020 --> 00:23:01,560
able to filter the replication
stream because that's something

524
00:23:01,560 --> 00:23:04,500
we want to use in the future, probably
if we want to rechart,

525
00:23:04,540 --> 00:23:06,840
that'll be a very useful thing
for us.

526
00:23:06,880 --> 00:23:09,140
And just wanted to touch on Sammy's
point a little bit.

527
00:23:09,140 --> 00:23:12,980
We actually use the exact same
kind of, I think, failover that

528
00:23:12,980 --> 00:23:17,780
you guys use as well at Figma based
on pgBouncer and with the

529
00:23:17,780 --> 00:23:18,920
reversibility as well.

530
00:23:18,920 --> 00:23:21,840
And that's something that has been
really, really surprisingly

531
00:23:22,040 --> 00:23:22,540
amazing.

532
00:23:22,740 --> 00:23:25,020
Nikolay: Yeah, you mean pgBouncer
post-resume.

533
00:23:25,460 --> 00:23:29,180
I guess this is why you don't use
RDS proxy being on RDS, right?

534
00:23:29,180 --> 00:23:30,640
Because it doesn't have post-resume.

535
00:23:31,160 --> 00:23:32,040
Yeah, that's great.

536
00:23:32,040 --> 00:23:35,720
I also was surprised last year
that it can handle really heavy

537
00:23:35,720 --> 00:23:36,220
workloads.

538
00:23:36,820 --> 00:23:37,100
Yeah.

539
00:23:37,100 --> 00:23:42,380
Just a few seconds spike of latency,
you can upgrade and switch

540
00:23:42,380 --> 00:23:44,340
over to a different primary.

541
00:23:44,340 --> 00:23:46,640
Arka: Yeah, It was one of those things
where when we were writing

542
00:23:46,640 --> 00:23:48,840
out the tech spec for it, I was
like, is this thing going to

543
00:23:48,840 --> 00:23:49,900
actually work?

544
00:23:50,080 --> 00:23:52,260
Nikolay: The problem with this,
by the way, there are no good

545
00:23:52,260 --> 00:23:53,660
materials about this.

546
00:23:53,840 --> 00:23:57,100
Proving like, nobody says, like
it's like public secret.

547
00:23:57,720 --> 00:24:00,840
Nobody talks about it somehow,
but it works really well.

548
00:24:00,860 --> 00:24:04,340
So we need more materials proving
that Pause Resume works well

549
00:24:04,340 --> 00:24:05,780
in pgBouncer.

550
00:24:06,580 --> 00:24:08,100
Sammy: Yeah, it's been great for
us.

551
00:24:08,420 --> 00:24:11,520
We have some PgBouncer scaling
problems, but everything on the

552
00:24:11,520 --> 00:24:14,940
failover's PgPause, PgResume side
has been flawless.

553
00:24:15,480 --> 00:24:16,820
Michael: That's so good to hear.

554
00:24:17,000 --> 00:24:20,280
And finally, Dec, on the major
version side of things.

555
00:24:20,540 --> 00:24:24,400
Derk: We are fully on 13, and we're
having discussions about

556
00:24:24,400 --> 00:24:25,200
the next version.

557
00:24:25,200 --> 00:24:28,980
We want to upgrade too because
every version has nice things.

558
00:24:29,240 --> 00:24:32,580
I want certain partitioning functionality,
A colleague wants

559
00:24:32,580 --> 00:24:34,300
more logical replication functionality.

560
00:24:35,460 --> 00:24:40,120
So, of course, 17 would be the
ideal, but then do we really want

561
00:24:40,120 --> 00:24:43,780
to run 17 already with the biggest
clusters we have?

562
00:24:43,780 --> 00:24:46,360
So, yeah, to be continued.

563
00:24:47,100 --> 00:24:50,000
Michael: And I think also the fact
that, I mean, Notion being

564
00:24:50,000 --> 00:24:51,760
on 15 was somewhat of a surprise
to me.

565
00:24:51,760 --> 00:24:55,080
I think because of these major
version upgrade challenges, I

566
00:24:55,080 --> 00:24:58,380
see so many companies, even at
the cutting edge of what Postgres

567
00:24:58,380 --> 00:25:02,660
can do, lagging several years behind
because upgrades are so

568
00:25:02,660 --> 00:25:03,040
difficult.

569
00:25:03,040 --> 00:25:05,900
So I know we won't get a chance
to discuss all the reasons why,

570
00:25:05,900 --> 00:25:08,820
but I think it's proof because
we're all lagging.

571
00:25:09,380 --> 00:25:12,400
Sammy: Yeah, well, I actually think
it's not for us so much fear

572
00:25:12,400 --> 00:25:13,360
of newer versions.

573
00:25:13,380 --> 00:25:16,660
You know, I actually worked in
MySQL before Postgres, and MySQL

574
00:25:16,840 --> 00:25:18,840
major version upgrades were truly
terrifying.

575
00:25:18,900 --> 00:25:21,660
These were like 2 or 3-year-long
efforts where you found a lot

576
00:25:21,660 --> 00:25:22,700
of major incompatibilities.

577
00:25:23,100 --> 00:25:26,540
But one thing I love about Postgres
is it is mostly backward compatible

578
00:25:26,540 --> 00:25:27,760
across major versions.

579
00:25:28,040 --> 00:25:31,480
And so it's not so much that we're
scared of the newer ones as

580
00:25:31,480 --> 00:25:34,200
versus we can't take downtime
to do these upgrades.

581
00:25:34,200 --> 00:25:37,000
And so there it's much more manual toil.

582
00:25:37,540 --> 00:25:38,160
Michael: Yeah, exactly.

583
00:25:38,160 --> 00:25:40,880
I think that is, I think that's
the case for a lot of people.

584
00:25:41,140 --> 00:25:44,860
So last one I was going to ask each
of you is if you have that

585
00:25:44,860 --> 00:25:48,940
funny or scary, and I don't know
which one I want most story that

586
00:25:48,940 --> 00:25:51,240
you can share with us related to
your scaling.

587
00:25:51,380 --> 00:25:55,360
Any of you got one prepared already,
feel free to jump in.

588
00:25:55,760 --> 00:25:56,420
Or Nikolay?

589
00:25:56,680 --> 00:25:59,560
Nikolay: Wide-width locks, lock
manager, multi-xact transactions,

590
00:26:00,280 --> 00:26:03,720
multi-xact IDs and so on, like
something like that, maybe.

591
00:26:03,820 --> 00:26:06,820
Derk: Yeah, I think we definitely
got bitten by the LockManager,

592
00:26:06,820 --> 00:26:09,120
like most people who start doing
partitioning.

593
00:26:09,920 --> 00:26:10,460
It's like,

594
00:26:10,460 --> 00:26:11,160
Nikolay: yeah, well,

595
00:26:11,160 --> 00:26:14,340
Derk: we were, we started partitioning
and we, we did a good

596
00:26:14,340 --> 00:26:14,640
job.

597
00:26:14,640 --> 00:26:16,620
And I built this framework around
partitioning.

598
00:26:17,440 --> 00:26:20,920
By the way, Michael, the fourth
blog post is, we are writing

599
00:26:20,920 --> 00:26:21,920
it at the moment.

600
00:26:22,040 --> 00:26:22,540
Yes.

601
00:26:22,580 --> 00:26:23,800
So it will be there.

602
00:26:23,940 --> 00:26:24,980
It's a nice idea.

603
00:26:25,080 --> 00:26:26,960
Now I lost the rest of my story.

604
00:26:26,960 --> 00:26:27,780
Oh yeah, partitioning.

605
00:26:28,080 --> 00:26:30,360
Nikolay: So it sounds like we have
prepared statements.

606
00:26:30,620 --> 00:26:32,720
Derk: No, we just have to prepare
statements.

607
00:26:33,920 --> 00:26:36,340
But You know everything about prepared
statements, right?

608
00:26:36,420 --> 00:26:39,520
You try it 5 times, then the optimizer
decides, do we go with

609
00:26:39,520 --> 00:26:41,420
the generic plan or the custom
plan?

610
00:26:42,120 --> 00:26:45,480
But we also created a lot and a
lot of partitions and a lot of

611
00:26:45,480 --> 00:26:46,280
partitioned tables.

612
00:26:46,280 --> 00:26:48,300
So we definitely ran out of logs.

613
00:26:48,520 --> 00:26:52,200
And our CPU, basically every time
we created a new partition

614
00:26:52,200 --> 00:26:54,980
for table, the CPU uses jumped
10%.

615
00:26:55,440 --> 00:26:58,860
But jumping 10% was fine because
we had a lot of CPU to spare.

616
00:26:58,860 --> 00:27:02,380
So by the time it alerted, It was
pretty high.

617
00:27:02,780 --> 00:27:06,060
And then we were looking back and
we see these jumps of 10% and

618
00:27:06,060 --> 00:27:07,320
10% and 10%.

619
00:27:07,540 --> 00:27:10,640
And then we were looking like the
next jump of 10% that's above

620
00:27:10,640 --> 00:27:11,140
100.

621
00:27:12,040 --> 00:27:14,880
And then we are not in a good position.

622
00:27:15,520 --> 00:27:19,180
So, then we had this small window
with all kinds of things prepared,

623
00:27:19,180 --> 00:27:22,160
like don't create any partitions
anymore on this cluster.

624
00:27:22,500 --> 00:27:25,760
What can we break open partition
boundaries so we can basically

625
00:27:25,760 --> 00:27:28,300
extend the final partition to hold
more data?

626
00:27:28,780 --> 00:27:31,880
And then we were researching this
problem, like why is the CPU

627
00:27:31,880 --> 00:27:33,460
usually jumping like this?

628
00:27:33,700 --> 00:27:37,440
And in the end, we started to force
the optimizer to pick generic

629
00:27:37,440 --> 00:27:41,620
plans because custom plans were
faster in execution, but definitely

630
00:27:41,820 --> 00:27:43,120
not at prepare time.

631
00:27:43,180 --> 00:27:47,540
So we won like a few milliseconds
in execution time, but we were

632
00:27:47,540 --> 00:27:50,140
paying like 100 milliseconds in
prepare time.

633
00:27:50,280 --> 00:27:52,860
And that was really heating up
the CPUs.

634
00:27:53,520 --> 00:27:56,720
Nikolay: Because during planning
Postgres locks all indexes and

635
00:27:56,720 --> 00:28:00,580
tables, all partitions, all its
indexes if partition pruning

636
00:28:00,580 --> 00:28:01,400
is not working.

637
00:28:01,420 --> 00:28:02,640
So it's like, it's terrible.

638
00:28:02,640 --> 00:28:05,220
Derk: Yeah, So you're running out
of your fastpath locks because

639
00:28:05,220 --> 00:28:09,020
you need to lock like 20 partitions
and 20 indexes on these partitions

640
00:28:09,060 --> 00:28:09,840
for a single table.

641
00:28:09,840 --> 00:28:12,600
And then you join with a partitioned
table, and with partitioned table,

642
00:28:12,720 --> 00:28:15,980
all these locks go to the log manager,
which is burning CPU cycles.

643
00:28:16,220 --> 00:28:19,080
Nikolay: You're in a good club
of a bunch of companies who experienced

644
00:28:19,080 --> 00:28:20,940
this over a couple of last years.

645
00:28:21,040 --> 00:28:21,420
Derk: Yeah, I was

646
00:28:21,420 --> 00:28:24,440
Sammy: going to say, we hit this
for vacuuming on our bigger

647
00:28:24,440 --> 00:28:24,720
tables.

648
00:28:24,720 --> 00:28:27,720
When you get above like 4 terabytes,
you start seeing that at

649
00:28:27,720 --> 00:28:31,080
the end of the cycle, when you
run analyze and the cache gets

650
00:28:31,080 --> 00:28:34,280
cleared out, and then you suddenly
have this custom plans that

651
00:28:34,280 --> 00:28:35,520
are 5x more expensive.

652
00:28:35,580 --> 00:28:38,680
And that's when CPU spikes to 100%
for a few seconds.

653
00:28:38,680 --> 00:28:40,340
Nikolay: I have a question to all
of you.

654
00:28:40,340 --> 00:28:44,340
When you have these problems, do
you discuss this with people

655
00:28:44,340 --> 00:28:46,300
who can influence this?

656
00:28:46,300 --> 00:28:47,580
I mean, hackers somehow.

657
00:28:47,780 --> 00:28:51,300
Do you raise these problems in
mailing lists or somewhere else

658
00:28:51,300 --> 00:28:53,240
where hackers live?

659
00:28:53,920 --> 00:28:56,860
Arka: We work with AWS support
quite a bit.

660
00:28:57,040 --> 00:29:00,140
Nikolay: Well, some hackers there
are for sure, but not all of

661
00:29:00,140 --> 00:29:00,360
them, right?

662
00:29:00,360 --> 00:29:03,700
Sammy: Yeah, we're good friends
with the AWS RDS team and they

663
00:29:03,700 --> 00:29:06,780
have actually a team that contributes
to Postgres, so we try

664
00:29:06,780 --> 00:29:09,060
to leverage them to get some of
these patches in.

665
00:29:09,060 --> 00:29:12,380
But it's honestly a lot of us reading
the source code ourselves

666
00:29:12,380 --> 00:29:14,200
as well and trying to figure out
workarounds.

667
00:29:15,060 --> 00:29:18,540
Nikolay: By the way, I just last
week I think I realized that

668
00:29:18,540 --> 00:29:22,900
SLRU patches created by originally
by Andrey Borodin, which will

669
00:29:22,900 --> 00:29:27,660
be in Postgres 17, they are already
adopted by AWS RDS Aurora

670
00:29:27,780 --> 00:29:29,320
for a couple of years, almost.

671
00:29:29,620 --> 00:29:30,900
I was super surprised.

672
00:29:30,900 --> 00:29:32,520
Actually, Andrey was also surprised.

673
00:29:32,660 --> 00:29:35,180
But you're not going to use Aurora,
right?

674
00:29:35,380 --> 00:29:36,920
Instead of regular RDS.

675
00:29:37,420 --> 00:29:37,920
Arka: Maybe.

676
00:29:38,160 --> 00:29:41,760
I think, I mean, for Notion, I
think we are actually starting

677
00:29:41,760 --> 00:29:44,380
to look into if Aurora makes sense.

678
00:29:44,380 --> 00:29:48,580
Mostly because the AWS team promises
a lot better replication

679
00:29:49,080 --> 00:29:53,000
for global replicas compared to
the RDS built-in replicas.

680
00:29:53,440 --> 00:29:56,480
So it's something we're considering,
but we haven't actually

681
00:29:56,480 --> 00:29:57,940
used before.

682
00:29:58,140 --> 00:30:01,640
So was curious if any of you have
any experience with that or

683
00:30:01,640 --> 00:30:04,540
like scaling read replicas in general?

684
00:30:04,540 --> 00:30:07,200
Because that's also something we
haven't leveraged too much at

685
00:30:07,200 --> 00:30:08,040
Notion yet.

686
00:30:08,240 --> 00:30:10,760
Sammy: Well, less scaling read
replicas, but actually my scary

687
00:30:10,760 --> 00:30:14,560
story was going to be around moving
things more to replicas because

688
00:30:15,040 --> 00:30:18,560
At the same time that we were growing
100X, we were dealing with

689
00:30:18,560 --> 00:30:21,300
a lot more pressure on the reliability
of our system and trying

690
00:30:21,300 --> 00:30:23,940
to move towards 4 nines globally
as a company.

691
00:30:24,280 --> 00:30:28,520
And so our application, because
it's a web browser and we have

692
00:30:28,520 --> 00:30:31,560
a kind of single threaded Ruby
monolith that serves a lot of

693
00:30:31,560 --> 00:30:35,980
traffic, is very latency sensitive,
that if your database slows

694
00:30:35,980 --> 00:30:39,920
down for a minute, then the whole
application can back up and

695
00:30:39,920 --> 00:30:41,520
your whole website can be down.

696
00:30:41,740 --> 00:30:45,980
And so that was 1 of our biggest
reliability risks and most persistent

697
00:30:46,120 --> 00:30:51,880
pain points is our RDS does have
some big latency spikes on EBS

698
00:30:51,900 --> 00:30:54,560
that most customers don't notice,
but we were really hitting.

699
00:30:54,720 --> 00:31:00,460
And so our 2 big solutions there
were to move, basically remove

700
00:31:00,460 --> 00:31:04,080
writes from our most business critical
routes wherever possible

701
00:31:04,740 --> 00:31:09,440
and put all reads, or 99% of reads
onto replicas for these P0

702
00:31:09,620 --> 00:31:10,460
user workflows.

703
00:31:11,000 --> 00:31:13,060
And then we build out replica hedging.

704
00:31:13,060 --> 00:31:17,980
So every request hits 2 replicas
and returns the first results

705
00:31:18,280 --> 00:31:21,140
that are successful there, which
allows us to tolerate any 1

706
00:31:21,140 --> 00:31:22,480
replica being down.

707
00:31:22,740 --> 00:31:24,500
And that's been hugely useful.

708
00:31:25,240 --> 00:31:28,940
We also got onto IO2 for EBS, which
has been a game changer for

709
00:31:28,940 --> 00:31:29,840
EBS reliability.

710
00:31:30,940 --> 00:31:31,760
Arka: Oh, nice.

711
00:31:31,860 --> 00:31:35,940
How many replicas do you guys normally
have off 1 database?

712
00:31:36,340 --> 00:31:40,520
Sammy: Today it's quite, there
are only 2 large replicas.

713
00:31:40,760 --> 00:31:43,400
In the future we want to move towards
more smaller replicas.

714
00:31:43,980 --> 00:31:46,580
Derk: We have multiple, Let's put
it that way.

715
00:31:46,640 --> 00:31:49,700
Reading from a replica is much
harder when it is about financial

716
00:31:49,740 --> 00:31:53,400
data because you want to have the
correct data and it might just

717
00:31:53,400 --> 00:31:54,180
been updated.

718
00:31:54,760 --> 00:31:58,080
So we try to move a lot of stuff
to the replicas, but it's kind

719
00:31:58,080 --> 00:31:58,720
of hard.

720
00:31:59,180 --> 00:32:01,860
Arka: Yeah, that would definitely
be a problem for financial

721
00:32:01,880 --> 00:32:02,380
data.

722
00:32:02,540 --> 00:32:05,160
Nikolay: Are we talking about asynchronous
replicas only, or

723
00:32:05,160 --> 00:32:09,300
maybe some semi-synchronous or
quorum commit approach?

724
00:32:10,120 --> 00:32:10,960
Derk: We have both.

725
00:32:12,740 --> 00:32:16,080
Arka: Also, cache invalidation,
I think That's 1 of the bigger

726
00:32:16,080 --> 00:32:19,420
concerns for us at Notion because
we're super, super heavy users

727
00:32:19,640 --> 00:32:20,400
of memcache.

728
00:32:21,420 --> 00:32:24,960
And reasoning about how that works
for replicas is another thing

729
00:32:24,960 --> 00:32:28,540
that we don't really want to do,
but we have to do.

730
00:32:29,800 --> 00:32:33,680
I don't really have a super scary
story, but actually it's something

731
00:32:33,680 --> 00:32:37,160
I've been working on recently,
which we ran into some surprising

732
00:32:37,160 --> 00:32:37,640
behavior.

733
00:32:37,640 --> 00:32:39,560
So maybe it'd be just fun to talk
about.

734
00:32:39,720 --> 00:32:43,180
So recently at Notion, we've been
trying to upgrade that last

735
00:32:43,180 --> 00:32:43,680
10%.

736
00:32:44,060 --> 00:32:47,200
And of course those databases are
just the hardest because they're

737
00:32:47,200 --> 00:32:48,140
just so big.

738
00:32:48,280 --> 00:32:52,640
So we're trying to upgrade this
partition table that is around

739
00:32:52,640 --> 00:32:56,300
20 terabytes with like 16 partitions
on each database.

740
00:32:56,600 --> 00:32:59,740
So, you know, each partition is
about like a terabyte ish.

741
00:33:00,060 --> 00:33:03,740
And we were having trouble getting
this to work with just regular

742
00:33:03,860 --> 00:33:04,840
logical replication.

743
00:33:05,660 --> 00:33:09,720
So we actually started exploring
DMS, which is like an AWS service

744
00:33:09,720 --> 00:33:13,540
that lets you migrate data between
two separate databases.

745
00:33:14,140 --> 00:33:15,660
Side note, it's actually super
cool.

746
00:33:15,660 --> 00:33:19,200
Like you can go from like a completely
random system to like

747
00:33:19,200 --> 00:33:20,140
another new system.

748
00:33:20,140 --> 00:33:22,540
It doesn't have to be like Postgres
to Postgres only.

749
00:33:22,540 --> 00:33:25,460
I don't know how good it is at
huge scale.

750
00:33:25,680 --> 00:33:29,120
Nikolay: It just requires a department
in your company to manage.

751
00:33:30,660 --> 00:33:33,840
Arka: Yeah, so originally I was
like, oh yeah, I can do this.

752
00:33:33,840 --> 00:33:36,380
And then it's been like 4 months
now and I'm still like, oh my

753
00:33:36,380 --> 00:33:38,500
God, this is a lot.

754
00:33:38,680 --> 00:33:41,780
But anyway, 1 of the problems we
ran into was actually related

755
00:33:41,820 --> 00:33:43,300
to Postgres bloat.

756
00:33:43,620 --> 00:33:46,840
Cause it turns out with DMS, you
get a lot of kind of knobs that

757
00:33:46,840 --> 00:33:49,900
you can tune for the concurrency
of like the copy.

758
00:33:49,920 --> 00:33:51,420
And that goes pretty fast.

759
00:33:51,420 --> 00:33:54,800
Like we're able to copy, you know,
a terabyte of data in maybe

760
00:33:54,800 --> 00:33:58,780
12, 14 hours with the way we have
our DMS set up.

761
00:33:58,820 --> 00:34:01,980
But the thing we noticed after
we set up the target databases

762
00:34:02,160 --> 00:34:07,040
is somehow the data size, like,
so we use PG Analyze as like

763
00:34:07,040 --> 00:34:08,860
our Postgres monitoring tool.

764
00:34:09,100 --> 00:34:12,880
And like, we noticed that the data
size literally on disk was

765
00:34:12,880 --> 00:34:17,040
like 3 times larger on the new
tables for some reason than the

766
00:34:17,040 --> 00:34:17,740
old 1.

767
00:34:18,740 --> 00:34:21,680
And you know, we were going from
like Postgres 12 to Postgres

768
00:34:21,680 --> 00:34:26,020
15, so at first we thought maybe
it was like a TOAST compression

769
00:34:26,180 --> 00:34:26,640
thing.

770
00:34:26,640 --> 00:34:29,500
Cause we know that the TOAST compression
algorithm changed between,

771
00:34:29,500 --> 00:34:32,040
I think, 12 to 14 or something.

772
00:34:32,040 --> 00:34:35,540
It went from like PGLZ to LZ4 or
something.

773
00:34:35,540 --> 00:34:37,940
I don't remember exactly but I
know that that changed.

774
00:34:38,000 --> 00:34:41,200
So I was like, oh like is it is
it the new post compression that's

775
00:34:41,200 --> 00:34:44,540
causing like the data size to be
literally much larger on disk?

776
00:34:44,540 --> 00:34:47,800
Which obviously leads to much worse
latency because then every

777
00:34:47,800 --> 00:34:50,140
1 of your queries are doing much
more I/O.

778
00:34:50,580 --> 00:34:51,080
Nikolay: Buffers.

779
00:34:51,220 --> 00:34:51,720
Buffers.

780
00:34:52,820 --> 00:34:56,440
Arka: Yeah, so you know we tried
a few tests with like changing

781
00:34:56,440 --> 00:34:59,440
the compression back to the old
1.

782
00:34:59,440 --> 00:35:02,480
We weren't fully convinced still,
But it turns out actually we

783
00:35:02,480 --> 00:35:05,680
were just hitting like bloat and
this clicked when I was actually

784
00:35:05,680 --> 00:35:09,440
listening to 1 of your earlier
podcasts I think on Postgres bloat

785
00:35:10,080 --> 00:35:13,480
and I realized that our vacuums
weren't actually fully completing

786
00:35:13,480 --> 00:35:16,160
on the new table because there
was just so many new transactions

787
00:35:16,180 --> 00:35:19,540
being written the vacuum could
not keep up which meant that the

788
00:35:19,540 --> 00:35:21,600
bloat just kept on accumulating
and accumulating.

789
00:35:21,960 --> 00:35:24,740
And on top of that, a lot of the
data in this table was being

790
00:35:24,740 --> 00:35:27,280
TOASTed because they're just JSON
B columns.

791
00:35:27,800 --> 00:35:31,500
So anyway, it was just like 2 weeks
of investigation to be like,

792
00:35:31,500 --> 00:35:33,140
oh, we should just run a full vacuum.

793
00:35:33,380 --> 00:35:37,260
And turns out that brought the
data size right back to what it

794
00:35:37,260 --> 00:35:37,900
should be.

795
00:35:37,900 --> 00:35:40,020
So it was a fun little kind of
investigation.

796
00:35:41,680 --> 00:35:45,040
Nikolay: It sounds like PgAnalyze
doesn't tell you about XminHorizon

797
00:35:45,480 --> 00:35:49,960
and proper like things like why
a vacuum can be lagging and skipping

798
00:35:49,960 --> 00:35:52,320
some dead tuples that cannot be
deleted.

799
00:35:52,660 --> 00:35:55,780
PgAnalyze has a very good vacuum
dashboard.

800
00:35:56,480 --> 00:35:59,980
So I'm very curious, does it speak
about

801
00:36:00,540 --> 00:36:01,420
Arka: this new horizon?

802
00:36:01,780 --> 00:36:04,020
It actually does point those things
out.

803
00:36:04,020 --> 00:36:07,660
We just didn't have the new databases
in PGAnalyze yet, because

804
00:36:07,660 --> 00:36:10,120
we're like, oh, we're not using
production traffic on these,

805
00:36:10,120 --> 00:36:11,260
so we don't need it there.

806
00:36:11,260 --> 00:36:13,980
When we put it there, we realized
that, oh yeah, like the vacuums

807
00:36:13,980 --> 00:36:15,060
are falling behind.

808
00:36:15,060 --> 00:36:18,520
Nikolay: I'm just personally myself
very annoyed that Postgres

809
00:36:18,520 --> 00:36:23,200
monitoring systems lack better
xmin horizon and long transactions,

810
00:36:23,360 --> 00:36:25,220
which is 2 separate things basically.

811
00:36:25,940 --> 00:36:26,440
Alerts.

812
00:36:27,100 --> 00:36:29,480
Long transaction can happen and
nobody tells you.

813
00:36:29,480 --> 00:36:32,920
It's not the right thing if you
have sometimes 2 Postgres monitorings

814
00:36:32,920 --> 00:36:34,620
and nobody tells you about this.

815
00:36:34,900 --> 00:36:37,500
So I'm curious if PGAnalyze has
some alerts.

816
00:36:37,960 --> 00:36:38,700
Arka: It does.

817
00:36:39,000 --> 00:36:39,840
Nikolay: Okay, okay.

818
00:36:39,960 --> 00:36:42,560
Arka: It sends you an email if,
for example, you have had any

819
00:36:42,560 --> 00:36:44,560
transaction running for over an
hour.

820
00:36:44,760 --> 00:36:48,120
Nikolay: Right, But if you use
logical, it's not the only reason

821
00:36:48,120 --> 00:36:49,940
of xmin horizon being installed.

822
00:36:49,940 --> 00:36:51,200
So, yeah.

823
00:36:51,200 --> 00:36:51,700
Arka: Right.

824
00:36:52,120 --> 00:36:52,620
Nikolay: Okay.

825
00:36:52,900 --> 00:36:53,040
For

826
00:36:53,040 --> 00:36:56,040
Arka: us, like, PGAnalyze has
been really great, actually.

827
00:36:56,040 --> 00:36:58,220
Like, I love working with that
team.

828
00:36:58,320 --> 00:36:59,240
Shout out to Lucas.

829
00:36:59,240 --> 00:37:00,540
Like, he's been so amazing.

830
00:37:01,360 --> 00:37:04,680
Nikolay: And also content, again,
back to content sharing.

831
00:37:05,380 --> 00:37:08,100
Lucas does a great job and his
team does a great job.

832
00:37:08,380 --> 00:37:09,340
Sammy: Yeah, Lucas is awesome.

833
00:37:09,340 --> 00:37:12,520
I talked to him as well a few weeks
ago and everything at PGAnalyze

834
00:37:12,520 --> 00:37:14,400
their blogs are really
great.

835
00:37:14,920 --> 00:37:17,880
Okay, I think it's interesting
you guys had the data copy be

836
00:37:17,880 --> 00:37:18,940
such a big bottleneck.

837
00:37:19,280 --> 00:37:22,800
We also had a big pain point there
when we were first doing vertical

838
00:37:22,800 --> 00:37:23,300
sharding.

839
00:37:23,680 --> 00:37:26,940
And what we ended up finding out
is we kind of went into the

840
00:37:26,940 --> 00:37:31,080
source code for Postgres and logical
replication and realized

841
00:37:31,120 --> 00:37:34,640
that it's really, really expensive
to have indexes on your database

842
00:37:34,640 --> 00:37:37,540
table while you're doing logical
replication, and much faster

843
00:37:37,540 --> 00:37:40,360
to drop the indexes and then recreate
them afterwards.

844
00:37:40,840 --> 00:37:46,780
So it went from weeks to a day
for a multi-terabyte data copy

845
00:37:46,780 --> 00:37:49,420
when you drop all of your indexes
and foreign keys.

846
00:37:49,820 --> 00:37:51,140
Arka: Yeah, that's a great insight.

847
00:37:51,660 --> 00:37:55,140
Nikolay: Yeah, I'm also curious
if you talk about regular logical

848
00:37:55,440 --> 00:37:59,940
replica provisioning when data
is copied logically or maybe,

849
00:38:00,040 --> 00:38:03,560
For example, there is a binary
copy option and also there is,

850
00:38:03,560 --> 00:38:06,260
if you need it for upgrades, for
example, or you need majority

851
00:38:06,260 --> 00:38:09,800
of data, sometimes it's better
to convert physical replica to

852
00:38:09,800 --> 00:38:13,740
logical, which I hope soon will
become standard recipe inside

853
00:38:13,740 --> 00:38:16,780
Postgres itself, but now it's only
possible if you orchestrate

854
00:38:16,800 --> 00:38:17,440
it properly.

855
00:38:17,440 --> 00:38:21,340
I'm very curious if you do this
trick on RDS, both of you, or

856
00:38:21,340 --> 00:38:21,840
no?

857
00:38:22,280 --> 00:38:25,440
Sammy: We rely a lot on logical
replication for all of our core

858
00:38:25,440 --> 00:38:28,380
failover tooling, so we haven't
really explored physical replication.

859
00:38:28,380 --> 00:38:30,820
I'm not sure if we really can because
of RDS.

860
00:38:31,240 --> 00:38:34,180
Nikolay: On RDS it's possible if
you apply so-called Instacart

861
00:38:34,240 --> 00:38:35,420
recipe, but

862
00:38:35,860 --> 00:38:36,000
Sammy: there

863
00:38:36,000 --> 00:38:38,340
Nikolay: are debates about is it
safe to do it.

864
00:38:39,060 --> 00:38:42,040
Actually, Lucas's blog posted about
it as well recently.

865
00:38:42,660 --> 00:38:45,560
Arka: You're talking about the
thing where you create a replication

866
00:38:45,600 --> 00:38:49,140
slot, take a backup, and then advance
the replication slot.

867
00:38:49,340 --> 00:38:51,000
Nikolay: Yeah, either this or...

868
00:38:51,180 --> 00:38:54,520
Yeah, on RDS only this, yes.

869
00:38:54,520 --> 00:38:54,720
Arka: Yeah,

870
00:38:54,720 --> 00:38:54,960
Nikolay: yeah.

871
00:38:54,960 --> 00:38:58,360
On self-managed, you can play with
recovery target LSN, right?

872
00:38:58,780 --> 00:38:59,280
Arka: Yeah.

873
00:38:59,700 --> 00:39:02,560
Nikolay: So I'm curious, is it
only in Stackgres or what...

874
00:39:02,640 --> 00:39:04,740
Arka: We've actually used that
before too.

875
00:39:04,740 --> 00:39:07,600
So actually, before we knew the
trick that Sammy just told us

876
00:39:07,600 --> 00:39:10,780
about with the indexes, we could
never get logical replication

877
00:39:10,840 --> 00:39:13,300
to catch up ever to our tables.

878
00:39:13,780 --> 00:39:16,480
So we actually ended up using the
Instacart trick to do our first

879
00:39:16,480 --> 00:39:19,340
round of upgrades that I ever did
at Notion.

880
00:39:19,660 --> 00:39:20,540
It worked well.

881
00:39:20,540 --> 00:39:23,580
I think 1 of the problems you run
into with that though is similarly,

882
00:39:23,940 --> 00:39:27,560
when you provision the new database
off the backup, you have

883
00:39:27,560 --> 00:39:30,860
to do another extra step of actually
dropping the indexes again.

884
00:39:30,860 --> 00:39:35,060
Because even just catching up with
logical replication from the

885
00:39:35,060 --> 00:39:38,080
point that you took the backup
for a sufficiently large database

886
00:39:38,560 --> 00:39:42,380
Still took us like a day or so
just to catch up the changes between

887
00:39:42,380 --> 00:39:46,240
the replication time and that thing
So I think yeah that trick

888
00:39:46,240 --> 00:39:49,140
could work really well probably
combined with dropping the indexes

889
00:39:49,140 --> 00:39:50,420
would speed it up significantly.

890
00:39:51,260 --> 00:39:54,480
But actually for us, after that,
we've always just used regular

891
00:39:54,480 --> 00:39:57,260
logical replication, and it's been
fast enough if we can provision

892
00:39:57,260 --> 00:39:58,840
the schemas without any indexes.

893
00:39:58,940 --> 00:40:02,060
Nikolay: Actually, dropping indexes
lead us to 1 topic I wanted

894
00:40:02,080 --> 00:40:02,940
not to miss.

895
00:40:03,700 --> 00:40:07,400
Huge WAL volumes generated and
index-write amplification.

896
00:40:07,640 --> 00:40:10,640
Anyone can tell anything about
these problems, like experience

897
00:40:10,640 --> 00:40:12,020
something about it or...

898
00:40:13,040 --> 00:40:17,220
Because this is 1 of the key criticism
points from Uber when

899
00:40:17,220 --> 00:40:20,280
they posted this infamous article,
right?

900
00:40:20,660 --> 00:40:22,680
Postgres has index write amplification.

901
00:40:22,760 --> 00:40:26,320
We have hot updates, but still,
it's a problem, right?

902
00:40:26,320 --> 00:40:29,700
Actually, Derk, I remember an article
about maybe a couple of

903
00:40:29,700 --> 00:40:31,460
years ago about fill factor, right?

904
00:40:31,800 --> 00:40:34,100
Is it still the tool you use actively?

905
00:40:35,020 --> 00:40:38,440
Derk: Well, fill factor is not
not a tool, right?

906
00:40:38,440 --> 00:40:42,540
It's just a parameter you can set
for your tables and your indexes.

907
00:40:43,040 --> 00:40:47,080
Nikolay: But the tool is to use
it to provoke more hot updates.

908
00:40:47,440 --> 00:40:47,940
Derk: Yeah.

909
00:40:48,280 --> 00:40:53,340
If we lose hot updates, like we
did last week, it's like we doubled

910
00:40:53,360 --> 00:40:54,660
our WAL usage.

911
00:40:55,760 --> 00:40:58,860
So that's already, and that was
only for single table for some

912
00:40:58,860 --> 00:41:01,160
reason, fifth, next or the new
release.

913
00:41:01,160 --> 00:41:03,940
We had a longer running query,
which means pretty long because

914
00:41:03,940 --> 00:41:05,420
it was holding back hot updates.

915
00:41:05,740 --> 00:41:09,100
And that means like you just double
the WAL volume on your system,

916
00:41:09,140 --> 00:41:11,080
but we have many tables, right?

917
00:41:11,080 --> 00:41:15,920
And then this, only this single
table is responsible or becomes

918
00:41:15,920 --> 00:41:18,660
responsible for half the WAL of
the entire cluster.

919
00:41:19,340 --> 00:41:23,740
For me, that's an amazing number
that you can generate or skip

920
00:41:23,860 --> 00:41:26,880
creating that amount of WAL when
using hot updates.

921
00:41:27,440 --> 00:41:31,740
So yeah, if we are having heavily
updated tables, we always try

922
00:41:31,800 --> 00:41:33,580
to find the right fill factor.

923
00:41:33,600 --> 00:41:36,900
I've been trying to get this formula,
like what tables do you

924
00:41:36,900 --> 00:41:39,520
need to fill factor and how high
do you set it?

925
00:41:39,620 --> 00:41:43,500
And it's still a mystery to me
because we have some tables which

926
00:41:43,500 --> 00:41:48,980
do fine at 95% fill factor and
some do really great at 70%.

927
00:41:49,840 --> 00:41:52,800
Nikolay: And also you need to protect
this table from new indexes

928
00:41:52,800 --> 00:41:56,140
being created and losing all the
hot updates somehow, right?

929
00:41:56,300 --> 00:41:58,920
If developers decide to add one more
index.

930
00:41:59,200 --> 00:42:00,120
Derk: Yeah, I can't.

931
00:42:01,080 --> 00:42:03,220
I have no way to force my developers.

932
00:42:04,020 --> 00:42:05,040
Sammy: I'm actually curious.

933
00:42:05,200 --> 00:42:08,720
A big pain point we have right
now is for really expensive index

934
00:42:08,720 --> 00:42:11,140
creations, there's no good way
of throttling it.

935
00:42:11,140 --> 00:42:15,260
So if it eats a lot of your IOPS,
there's not really a postgres

936
00:42:15,280 --> 00:42:18,220
way We've found to tune this so
that it, you know, maybe you

937
00:42:18,220 --> 00:42:21,140
want the index creation to happen
much more slowly in the background

938
00:42:21,140 --> 00:42:23,540
over a few days for some really
large tables.

939
00:42:24,120 --> 00:42:28,140
Nikolay: But it will hold xmin
again, this infamous problem

940
00:42:28,140 --> 00:42:28,440
from...

941
00:42:28,440 --> 00:42:31,440
Sammy: Sure, but I mean, maybe
3 days of holding that is better

942
00:42:31,440 --> 00:42:33,620
than an hour of the website being
down.

943
00:42:34,120 --> 00:42:36,540
Michael: This might be a stupid
question, but would reducing

944
00:42:36,560 --> 00:42:39,720
maintenance work mem in a, can
you do that in a session?

945
00:42:39,760 --> 00:42:40,240
Would that help?

946
00:42:40,240 --> 00:42:41,680
Nikolay: More disk I/O maybe.

947
00:42:41,940 --> 00:42:43,520
Sammy: Yeah, I think we have the
minimum.

948
00:42:43,520 --> 00:42:47,180
Yeah, we're only using 1 thread
for the maintenance settings

949
00:42:47,180 --> 00:42:50,440
there, but it's, it has been a
coulnle of times enough to spike

950
00:42:50,440 --> 00:42:52,760
our IOPS to basically the RDS limits.

951
00:42:53,860 --> 00:42:56,260
Derk: I always want indexes to
be created faster.

952
00:42:56,400 --> 00:42:56,900
Nikolay: Yeah.

953
00:42:57,180 --> 00:43:02,120
But it requires a faster disk IO
capabilities, of course.

954
00:43:02,660 --> 00:43:06,260
Michael: This is what I love about
TrueScale is you just hit

955
00:43:06,260 --> 00:43:06,760
trade-offs.

956
00:43:07,040 --> 00:43:11,880
You hit times where 1 thing's best
and you hit other times where

957
00:43:11,920 --> 00:43:13,860
the complete opposite is best.

958
00:43:14,140 --> 00:43:16,780
So yeah, I'm really conscious of
time.

959
00:43:16,840 --> 00:43:18,340
You've all been amazing.

960
00:43:18,460 --> 00:43:19,780
Thanks so much for joining.

961
00:43:19,900 --> 00:43:22,740
I wanted to give you each an opportunity
to say anything else

962
00:43:22,740 --> 00:43:25,960
you wanted to add or forgot to
say, or feel free to shout out

963
00:43:25,960 --> 00:43:27,840
at you if you're hiring, that kind
of thing.

964
00:43:27,980 --> 00:43:29,880
Nikolay: Let me ask 1 more technical
question.

965
00:43:29,880 --> 00:43:33,360
Everyone uses form keys everywhere,
or decided to drop some of

966
00:43:33,360 --> 00:43:35,040
them or everywhere, right?

967
00:43:36,260 --> 00:43:39,060
Arka: No, I will actually say something
about this.

968
00:43:39,060 --> 00:43:42,720
I think like, honestly, we use
probably Postgres in like the

969
00:43:42,720 --> 00:43:44,660
least interesting way possible.

970
00:43:44,920 --> 00:43:48,380
For our sharded cloud databases,
we don't have foreign keys.

971
00:43:48,420 --> 00:43:49,700
We don't do joins.

972
00:43:49,700 --> 00:43:51,040
We don't have any triggers.

973
00:43:51,100 --> 00:43:54,380
We don't have really anything that
can make, we don't have any

974
00:43:54,380 --> 00:43:55,460
generated columns.

975
00:43:55,520 --> 00:43:59,440
Like we limit so many things that
you can do that we're basically

976
00:43:59,440 --> 00:44:02,080
using Postgres as like a blob store
at this point, which is why

977
00:44:02,080 --> 00:44:05,260
we end up with really giant databases
with a lot of indexes,

978
00:44:05,720 --> 00:44:09,060
but the queries are quite simple
actually to retrieve the data.

979
00:44:09,720 --> 00:44:13,080
And I think that's been like probably
1 of the primary reasons

980
00:44:13,080 --> 00:44:15,920
that I've actually, and like my
team has actually been able to

981
00:44:15,920 --> 00:44:18,900
kind of keep up with the growth
of our Postgres cluster because

982
00:44:18,900 --> 00:44:21,960
we're able to do things like re-sharding
without having to, you

983
00:44:21,960 --> 00:44:24,140
know, talk to every team at the
company and be like, can you

984
00:44:24,140 --> 00:44:24,960
please change this query?

985
00:44:24,960 --> 00:44:26,520
Can you please not do this thing?

986
00:44:26,520 --> 00:44:28,940
You know, I think a lot of people
get really excited about using

987
00:44:28,940 --> 00:44:32,260
all the fun features of relational
databases, but actually you

988
00:44:32,260 --> 00:44:35,460
can use it in a very boring way,
but still get some of the benefits

989
00:44:35,460 --> 00:44:38,420
of relational databases, such as
consistency and things like

990
00:44:38,420 --> 00:44:38,920
that.

991
00:44:39,060 --> 00:44:43,440
So I think that's like my 1 trick
to high scale with any database,

992
00:44:43,440 --> 00:44:47,000
not just Postgres, MySQL and really
any other thing that you

993
00:44:47,000 --> 00:44:47,780
can think of.

994
00:44:47,780 --> 00:44:50,880
It's good to be aware of some of
the pitfalls of the more interesting

995
00:44:50,880 --> 00:44:51,380
features.

996
00:44:51,760 --> 00:44:52,980
Nikolay: So no foreign keys?

997
00:44:53,320 --> 00:44:54,260
Arka: Absolutely not.

998
00:44:54,560 --> 00:44:54,880
Okay.

999
00:44:54,880 --> 00:44:55,320
Not allowed.

1000
00:44:55,320 --> 00:44:55,820
Okay.

1001
00:44:57,980 --> 00:44:59,280
Nikolay: Any more input?

1002
00:44:59,760 --> 00:45:03,500
Sammy: Yeah, We do allow foreign
keys on the sharding key.

1003
00:45:03,580 --> 00:45:06,060
And similarly, we allow joins on
the sharding key.

1004
00:45:06,060 --> 00:45:08,940
So for us, it's a matter of picking
a sharding key so that most

1005
00:45:08,940 --> 00:45:10,680
operations are happening within
that.

1006
00:45:10,680 --> 00:45:13,520
And then that allows us to avoid
rewriting a lot of application

1007
00:45:13,580 --> 00:45:14,080
logic.

1008
00:45:14,800 --> 00:45:15,300
Smart.

1009
00:45:15,540 --> 00:45:20,400
Derk: We also have foreign keys
at Postgres, which is in 2 weeks

1010
00:45:20,400 --> 00:45:21,720
time, a little bit less.

1011
00:45:21,820 --> 00:45:26,000
I tell you how to find a proper
partitioning strategy because

1012
00:45:26,000 --> 00:45:29,240
up to now I've been focusing on
partitioning a single table in

1013
00:45:29,240 --> 00:45:32,600
the best way possible, But I kind
of forgot to look up the bigger

1014
00:45:32,600 --> 00:45:33,080
picture.

1015
00:45:33,080 --> 00:45:36,040
So all my tables are partitioned
slightly different with different

1016
00:45:36,040 --> 00:45:36,980
partition boundaries.

1017
00:45:37,540 --> 00:45:40,160
And then the entire system becomes
a bit messy.

1018
00:45:40,280 --> 00:45:43,380
So now we are working on basically
straighten out all the partition

1019
00:45:43,380 --> 00:45:47,440
boundaries because then the joining
and the foreign keys work

1020
00:45:47,440 --> 00:45:48,380
much more efficient.

1021
00:45:49,240 --> 00:45:50,520
So it's partition pruning.

1022
00:45:50,900 --> 00:45:51,400
Michael: Clever.

1023
00:45:51,820 --> 00:45:55,820
Derk: I kid you not, some queries
became 20 times faster by straightening

1024
00:45:55,900 --> 00:45:57,320
out partition boundaries.

1025
00:45:58,140 --> 00:45:58,640
Michael: Wow.

1026
00:46:00,060 --> 00:46:02,780
Sammy: I'll make my last plug on
the hiring side.

1027
00:46:02,860 --> 00:46:05,780
My team is hiring really aggressively
right now.

1028
00:46:05,820 --> 00:46:08,080
We're basically trying to double
this year.

1029
00:46:08,080 --> 00:46:11,280
And in particular, we just had
an awesome senior staff engineer

1030
00:46:11,280 --> 00:46:12,180
become a manager.

1031
00:46:12,180 --> 00:46:16,360
And so we have a big gap in senior
staff, principal type engineers,

1032
00:46:16,620 --> 00:46:18,300
folks who have worked at scale
before.

1033
00:46:18,340 --> 00:46:21,020
So if you're interested in those
problems, please reach out to

1034
00:46:21,020 --> 00:46:22,400
me or come apply.

1035
00:46:23,560 --> 00:46:24,880
Arka: Notion is also hiring.

1036
00:46:27,740 --> 00:46:30,640
But my pitch is definitely not
as refined as Sammy's.

1037
00:46:30,720 --> 00:46:33,580
But if you want to come work on
fun problems, check out our careers

1038
00:46:33,580 --> 00:46:34,080
page.

1039
00:46:34,300 --> 00:46:36,820
Derk: I've actually think this
is might be the first time in

1040
00:46:36,820 --> 00:46:37,580
20 years.

1041
00:46:37,580 --> 00:46:39,720
We are not expanding my team this
year.

1042
00:46:39,760 --> 00:46:40,540
Nikolay: Oh wow.

1043
00:46:40,920 --> 00:46:43,820
Michael: But we should plug Derk's
talk at Postgres, which you can

1044
00:46:43,820 --> 00:46:44,280
watch.

1045
00:46:44,280 --> 00:46:47,220
So I'll include links to all of
these things.

1046
00:46:47,680 --> 00:46:49,540
Final 1 final thank you from me.

1047
00:46:49,540 --> 00:46:50,500
This has been amazing.

1048
00:46:50,500 --> 00:46:52,700
Thank you so much for joining us
for episode a hundred.

1049
00:46:52,900 --> 00:46:53,140
Nikolay: Yeah.

1050
00:46:53,140 --> 00:46:53,980
Thank you so much.

1051
00:46:53,980 --> 00:46:58,480
I want to just to once again, like
say, thank you for knowledge

1052
00:46:58,480 --> 00:46:58,980
sharing.

1053
00:46:59,440 --> 00:47:00,260
Super important.

1054
00:47:00,540 --> 00:47:04,120
Looking forward to more, definitely
to more blog posts and maybe

1055
00:47:04,120 --> 00:47:05,200
some run books published.

1056
00:47:05,200 --> 00:47:10,540
I don't know, like things like
you use in your daily work maintaining

1057
00:47:10,600 --> 00:47:12,940
these large databases, large Postgres
databases.

1058
00:47:13,080 --> 00:47:18,300
So super curious to see more from
you and maybe to collaborate

1059
00:47:18,340 --> 00:47:22,060
on some things like some knowledge
and exchanging and so on.

1060
00:47:22,280 --> 00:47:22,940
Thank you.

1061
00:47:23,520 --> 00:47:25,520
Derk: Thank you very much, both
of you.

1062
00:47:25,520 --> 00:47:29,000
It's really cool to be here, especially
in such good company.

1063
00:47:29,680 --> 00:47:31,340
Sammy: Yeah, this is a really fun
conversation.

1064
00:47:31,440 --> 00:47:32,860
Thanks for getting us all together.

1065
00:47:33,520 --> 00:47:33,840
Arka: Yeah.

1066
00:47:33,840 --> 00:47:34,440
Thanks again.

1067
00:47:34,440 --> 00:47:35,660
I really enjoyed this.

1068
00:47:35,660 --> 00:47:37,900
I hope to be back in the future
as well.

1069
00:47:37,900 --> 00:47:40,520
Nikolay: This was the best episode
number a hundred we could

1070
00:47:40,520 --> 00:47:41,020
imagine.