1
00:00:00,060 --> 00:00:02,060
Nikolay: Hello, hello, this is
Postgres.FM.

2
00:00:02,260 --> 00:00:04,700
This is Nik from Postgres.AI.

3
00:00:05,020 --> 00:00:06,820
And as usual, co-host Michael.

4
00:00:06,820 --> 00:00:07,580
Hi, Michael.

5
00:00:07,900 --> 00:00:08,520
Michael: Hi, Nik.

6
00:00:08,520 --> 00:00:08,720
Nikolay: Yeah.

7
00:00:08,720 --> 00:00:12,380
And we invited a couple of guests
today, Nate and Andrew.

8
00:00:12,660 --> 00:00:17,880
And we invited because the company
Metronome experienced outages

9
00:00:18,140 --> 00:00:21,280
in I think in May, right, or April,
in May, right?

10
00:00:21,280 --> 00:00:28,740
In May of 2025, which were extremely
hard to diagnose, but these

11
00:00:28,740 --> 00:00:33,320
folks did a very good job and discussed
all the details in excellent

12
00:00:33,320 --> 00:00:34,100
blog post.

13
00:00:34,740 --> 00:00:38,480
And this blog post looked so great,
so we couldn't miss it.

14
00:00:38,480 --> 00:00:39,640
So thank you for coming.

15
00:00:40,360 --> 00:00:44,840
Let's dive into some very rare
Postgres problems.

16
00:00:45,540 --> 00:00:46,720
Nate: Yeah, thanks for having us.

17
00:00:46,720 --> 00:00:50,260
And hopefully sharing this out
will keep making it rare.

18
00:00:50,740 --> 00:00:51,580
Andrew: Yeah, yes.

19
00:00:51,580 --> 00:00:52,780
Thank you for having us.

20
00:00:53,000 --> 00:00:54,240
It's a pleasure to be here.

21
00:00:54,380 --> 00:00:54,880
Nikolay: Great.

22
00:00:56,040 --> 00:00:58,200
So, Michael, where should we start?

23
00:01:00,040 --> 00:01:02,560
Michael: How about we start with,
could you guys give us a little

24
00:01:02,560 --> 00:01:07,000
bit of an overview of what Metronome
does, why you're using Postgres,

25
00:01:07,540 --> 00:01:12,540
and what you were up to that was
hitting the limits this hard?

26
00:01:13,140 --> 00:01:14,240
Yeah, so I can start with

27
00:01:14,240 --> 00:01:17,500
Nate: the overview, and then Andrew,
I can pair on the exact

28
00:01:17,720 --> 00:01:19,900
way that we kind of blew our foot
off here.

29
00:01:20,320 --> 00:01:20,820
Andrew: Absolutely.

30
00:01:21,340 --> 00:01:24,500
Nate: Yeah, so Metronome is a usage-based
billing platform.

31
00:01:25,080 --> 00:01:27,660
And so, you know, when it was founded,
I think it's coming up

32
00:01:27,660 --> 00:01:30,860
on 6 years ago now, it was really,
you know, At that point in

33
00:01:30,860 --> 00:01:34,300
time, this is just before COVID
and lockdowns began, the target

34
00:01:34,300 --> 00:01:37,060
was really, there's a lot of infrastructure
and service companies,

35
00:01:37,420 --> 00:01:39,660
most of them are building on some
kind of heartbeat, you know,

36
00:01:39,660 --> 00:01:41,780
per minute or sometimes per second
model.

37
00:01:42,040 --> 00:01:43,980
That gets really tricky to do at
scale.

38
00:01:44,060 --> 00:01:46,920
In the simplest cases, yeah, you're
just incrementing a counter,

39
00:01:47,080 --> 00:01:50,380
But as things ramp up and your
account executives go out and

40
00:01:50,380 --> 00:01:54,400
cook up all different ways to finesse
how they want to charge

41
00:01:54,400 --> 00:01:56,460
that client, it gets really complicated.

42
00:01:56,480 --> 00:01:59,380
You add in things like committed
spend, credits, you want to

43
00:01:59,380 --> 00:02:02,140
automate it, you want to cut people
off and they're not paying.

44
00:02:02,800 --> 00:02:05,720
The complexity really compounds
immediately.

45
00:02:06,760 --> 00:02:10,640
And the 1 of the founders got worked
on the billing at Dropbox

46
00:02:10,800 --> 00:02:14,000
and had also encountered issues
where when you build this internally,

47
00:02:14,100 --> 00:02:18,000
which is where everyone kind of
starts, it gets very easy to

48
00:02:18,000 --> 00:02:19,700
build the brittle system by accident.

49
00:02:19,780 --> 00:02:22,860
Builds your local Maxima, fits
your current pricing and packaging.

50
00:02:25,840 --> 00:02:26,280
delayed.

51
00:02:26,280 --> 00:02:28,520
Like they can't launch their product
or they can't charge for

52
00:02:28,520 --> 00:02:30,720
it for weeks or something, months.

53
00:02:31,380 --> 00:02:35,760
So you get led into the spot where
you're delaying launches or

54
00:02:36,040 --> 00:02:39,240
just like burning compute because
you can't actually charge for

55
00:02:39,240 --> 00:02:40,760
it, which feels really silly.

56
00:02:40,760 --> 00:02:44,180
And then there's also a frequent,
you just accept a lot of slippage

57
00:02:44,540 --> 00:02:47,180
because building these systems
to be perfectly accurate gets

58
00:02:47,240 --> 00:02:48,000
really tricky.

59
00:02:48,040 --> 00:02:51,900
And thankfully there's a lot of
foundational pieces such as Kafka

60
00:02:51,900 --> 00:02:55,580
exactly once and strong transactional
databases to make this

61
00:02:55,580 --> 00:02:58,300
possible to build and have confidence
in.

62
00:02:58,880 --> 00:03:03,080
But it's very difficult to do this
internally and prioritize

63
00:03:03,280 --> 00:03:04,540
that degree of rigor.

64
00:03:04,660 --> 00:03:06,980
And also if you're building infrastructure
company, you know,

65
00:03:06,980 --> 00:03:09,860
let's say you're a database startup,
your best engineers are

66
00:03:09,860 --> 00:03:12,100
there because they want to work
on building a database, they're

67
00:03:12,100 --> 00:03:14,340
not actually there because they
want to work on building.

68
00:03:14,380 --> 00:03:16,780
And so we also think that, you
know, kind of consolidating talent

69
00:03:16,780 --> 00:03:19,040
to work in this problem yields
a better result.

70
00:03:19,040 --> 00:03:24,160
And I think the marriage of the
business logic domain, the exact

71
00:03:24,160 --> 00:03:27,420
correctness needs, and the overall
data scale volumes, we're

72
00:03:27,700 --> 00:03:30,940
right place, right time for this
AI boom, introduces a lot of

73
00:03:30,940 --> 00:03:32,140
very interesting challenges.

74
00:03:32,160 --> 00:03:34,860
And well, we're about to talk about
one of them.

75
00:03:35,900 --> 00:03:38,080
Nikolay: And you chose Postgres,
obviously, right?

76
00:03:38,480 --> 00:03:41,280
Nate: Yeah, the Postgres decision
preceded both of us by a ways,

77
00:03:41,280 --> 00:03:43,760
but I think we're broadly quite
happy with it.

78
00:03:43,820 --> 00:03:45,140
Nikolay: Okay, good, good.

79
00:03:45,180 --> 00:03:48,780
So do you know why Postgres was
chosen in the past?

80
00:03:49,460 --> 00:03:50,200
Nate: Not precisely.

81
00:03:50,820 --> 00:03:54,640
I mean, like, probably, yeah, we
immediately went to using RDS.

82
00:03:54,660 --> 00:03:56,460
I don't know if it was immediately
Aurora.

83
00:03:57,440 --> 00:03:57,940
Interesting.

84
00:03:58,040 --> 00:03:58,540
Yeah.

85
00:03:58,660 --> 00:04:01,500
I think it would be an interesting
decision as a small startup

86
00:04:01,500 --> 00:04:04,340
who anticipates having a lot of
data scale to want to deal with

87
00:04:04,340 --> 00:04:07,400
the schema change pain associated
with MySQL.

88
00:04:08,740 --> 00:04:12,820
Nikolay: Yeah, I wanted to add
a joke that maybe somebody googled

89
00:04:12,920 --> 00:04:16,220
some open source relational
database with good money

90
00:04:16,220 --> 00:04:17,060
data type.

91
00:04:17,780 --> 00:04:21,500
But yeah, you know, money data
type is what people shouldn't

92
00:04:21,500 --> 00:04:22,400
use in Postgres.

93
00:04:22,540 --> 00:04:23,140
This is...

94
00:04:23,140 --> 00:04:25,840
Nate: Yeah, no, I think we're mostly
in numeric, but at this

95
00:04:25,840 --> 00:04:26,820
point we're kind of cursed.

96
00:04:26,820 --> 00:04:30,320
We didn't lock down the precision
more tightly because Postgres

97
00:04:30,320 --> 00:04:31,980
will give you far too much precision.

98
00:04:32,200 --> 00:04:35,460
And kind of having to interoperate
with any other system becomes

99
00:04:36,040 --> 00:04:36,540
difficult.

100
00:04:36,580 --> 00:04:39,020
And so we're starting to figure
out how to unwind some of those

101
00:04:39,020 --> 00:04:39,520
decisions.

102
00:04:41,340 --> 00:04:42,460
Michael: Yeah, yeah.

103
00:04:42,740 --> 00:04:44,280
That is an interesting tangent.

104
00:04:45,860 --> 00:04:50,280
Especially, I guess, with people
billing by fractions of a cent

105
00:04:50,280 --> 00:04:50,860
and things.

106
00:04:50,860 --> 00:04:54,360
So yeah, anyway, probably too much
of a tangent, or maybe we'll

107
00:04:54,360 --> 00:04:55,580
come back to it later.

108
00:04:55,840 --> 00:04:59,080
Nate: Yeah, I mean, in short, yeah,
it gets very painful translating

109
00:04:59,120 --> 00:05:01,980
between the different systems that
do or don't have that precision.

110
00:05:02,080 --> 00:05:05,460
Sometimes even your billing provider,
understandably can't bill

111
00:05:05,460 --> 00:05:09,360
in fractions of a cent, which means
that how all the different

112
00:05:09,780 --> 00:05:13,680
sub-products you paid for and have
those round together, might

113
00:05:13,680 --> 00:05:15,220
not end up with the same total.

114
00:05:15,560 --> 00:05:16,060
Michael: Yeah.

115
00:05:16,160 --> 00:05:16,660
Yeah.

116
00:05:17,360 --> 00:05:20,320
Nikolay: Can you tell us a little
bit about the scale?

117
00:05:20,320 --> 00:05:23,400
I remember 30 terabytes on Aurora
from the blog post.

118
00:05:23,400 --> 00:05:28,260
Anything like in this area you
could share to get us the feeling

119
00:05:28,260 --> 00:05:29,120
of the scale?

120
00:05:29,540 --> 00:05:32,580
Nate: So we have 2 primary workloads
that are running on Postgres.

121
00:05:32,780 --> 00:05:36,980
The 1 we'll mostly be talking about
is our main DB, which is

122
00:05:36,980 --> 00:05:39,840
a common start-up pattern where
we've got our kitchen sink database

123
00:05:39,840 --> 00:05:42,900
that has nearly every dataset within
it.

124
00:05:42,900 --> 00:05:46,840
And yeah, that's ballooned out
to around 30 terabytes now and

125
00:05:46,840 --> 00:05:49,540
growing terabyte plus per month
at this point.

126
00:05:50,020 --> 00:05:53,160
And yeah, this is, wasn't an issue
for a while.

127
00:05:53,160 --> 00:05:56,460
And then depending on how our clients
leverage our platform,

128
00:05:56,920 --> 00:06:00,620
you know, part of billing is generating
invoices and invoices

129
00:06:00,620 --> 00:06:02,980
are usually characterized, you
know, with like, you know, a top

130
00:06:02,980 --> 00:06:05,920
level total and whatnot, but then
you have a line item per product

131
00:06:05,920 --> 00:06:06,420
usage.

132
00:06:07,060 --> 00:06:10,520
And, you know, some of these companies
have a lot of different

133
00:06:10,520 --> 00:06:11,580
products you can pay for.

134
00:06:11,580 --> 00:06:14,940
You know, you look at the list
of different LLM model engines

135
00:06:14,940 --> 00:06:15,640
that are available.

136
00:06:16,100 --> 00:06:20,380
Sometimes you can use batch, async
versus real time, or like

137
00:06:20,380 --> 00:06:23,320
infrastructure vendors who are
deployed to dozens of different

138
00:06:23,320 --> 00:06:23,820
regions.

139
00:06:24,720 --> 00:06:28,440
And it's easy for the cardinality
here to explode, especially

140
00:06:28,440 --> 00:06:32,060
if you want to group it by a logical
application on their platform.

141
00:06:33,940 --> 00:06:38,560
And we may, but we now regret,
decision to basically persist

142
00:06:38,560 --> 00:06:42,440
those line items in a separate
row and a ginormous table in our

143
00:06:42,440 --> 00:06:43,480
Postgres database.

144
00:06:45,040 --> 00:06:45,360
How big is

145
00:06:45,360 --> 00:06:46,100
Nikolay: that table?

146
00:06:47,320 --> 00:06:49,800
Nate: It's about half the cluster
and growing.

147
00:06:50,980 --> 00:06:53,580
Nikolay: Like 15 terabytes unpartitioned,
right?

148
00:06:54,560 --> 00:06:55,060
Yep.

149
00:06:55,240 --> 00:06:55,740
Yeah.

150
00:06:55,840 --> 00:06:57,380
Yeah, well, I know this.

151
00:06:58,500 --> 00:06:59,280
It's a pain,

152
00:07:00,380 --> 00:07:00,880
Nate: Yeah.

153
00:07:01,120 --> 00:07:03,900
There's some scaling problems you
get really ahead of and others

154
00:07:03,900 --> 00:07:06,780
that you defer because I mean,
a lot of the technology in the

155
00:07:06,780 --> 00:07:08,800
Postgres space is really fantastic
now.

156
00:07:08,800 --> 00:07:13,780
Like Aurora does probably just
handle it, copy out, making sure

157
00:07:13,780 --> 00:07:17,300
that you're dealing with vacuuming
and whatnot and all the other

158
00:07:17,300 --> 00:07:18,620
kind of implicit thresholds.

159
00:07:19,540 --> 00:07:23,680
Nikolay: Actually, I forgot to
mention that those who watch Postgres

160
00:07:23,680 --> 00:07:28,040
FM regularly should watch already
by this point previous episode

161
00:07:28,040 --> 00:07:31,740
with Sugu, where we discussed Multigres
and sharding.

162
00:07:31,980 --> 00:07:35,940
And in that very episode, he mentioned
your case, right?

163
00:07:36,160 --> 00:07:40,580
So yeah, I wanted to answer this
is what we go next to do next

164
00:07:40,580 --> 00:07:42,280
week, but it was a secret at that
time.

165
00:07:42,280 --> 00:07:47,560
So, and before that we also had
PgDog, Lev Kokotov from PgDog

166
00:07:48,460 --> 00:07:53,480
And obviously sharding currently
is hot topic again and with

167
00:07:53,480 --> 00:07:55,300
Citus in the past, there is SPQR.

168
00:07:56,120 --> 00:07:59,440
Before that we had a PL/Proxy from
Skype very long ago, like

169
00:07:59,440 --> 00:08:00,460
previous era.

170
00:08:00,660 --> 00:08:04,120
So is it right to say that the
problem we are going to discuss

171
00:08:04,120 --> 00:08:09,360
today, you wouldn't meet it if
you're on sharding schema?

172
00:08:10,520 --> 00:08:13,780
If your database was sharded, this
wouldn't happen probably.

173
00:08:14,100 --> 00:08:16,920
Nate: I think we've seen the app
suggest that it would be easier

174
00:08:16,920 --> 00:08:17,860
to avoid it.

175
00:08:17,860 --> 00:08:20,960
As part of the migration that we're
about to mention, we're moving

176
00:08:21,040 --> 00:08:24,560
another table that's generated
data to a partition scheme.

177
00:08:24,960 --> 00:08:27,740
And we've already seen a lot of
positive effects there, and that

178
00:08:27,740 --> 00:08:30,180
you can run a lot of autovacuum
workers in parallel.

179
00:08:30,620 --> 00:08:34,480
And so it allows you to much more
easily keep up with the rate

180
00:08:34,480 --> 00:08:35,100
of change.

181
00:08:35,580 --> 00:08:37,160
Nikolay: This is partitioning,
not sharding.

182
00:08:37,720 --> 00:08:40,900
And with unpartitioned table, the
problem is vacuum is single

183
00:08:40,900 --> 00:08:41,880
threaded, right?

184
00:08:43,440 --> 00:08:43,940
Yeah.

185
00:08:44,340 --> 00:08:49,220
So with partitioned schema, you
can raise a number of workers

186
00:08:49,220 --> 00:08:54,120
for max workers you can raise it
and benefit from it if you have

187
00:08:54,120 --> 00:08:57,100
a lot of CPUs and disk IO capacity.

188
00:08:58,080 --> 00:09:04,180
But what I'm trying to say problems
like this they like can be

189
00:09:04,180 --> 00:09:08,560
avoided completely if we have sharding
and every shard is smaller.

190
00:09:08,560 --> 00:09:09,900
Every single shard

191
00:09:09,900 --> 00:09:10,240
Nate: is smaller.

192
00:09:10,240 --> 00:09:11,980
Yeah, and that's what I was trying
to get at.

193
00:09:11,980 --> 00:09:15,140
By partitioning it, we can kind
of see the effect of, you know,

194
00:09:15,140 --> 00:09:18,340
if these were all in logical chunks,
that were maybe at max 100

195
00:09:18,340 --> 00:09:22,820
gigabytes, it's much easier for
the garbage collection system

196
00:09:22,960 --> 00:09:23,860
to keep up.

197
00:09:24,240 --> 00:09:25,420
Nikolay: Yeah, yeah, yeah, yeah.

198
00:09:25,600 --> 00:09:26,280
That's interesting.

199
00:09:26,280 --> 00:09:29,120
Nate: It would be even easier if
we were across many machines.

200
00:09:29,340 --> 00:09:31,900
But at the very least, I think we can infer most of the effect,

201
00:09:31,920 --> 00:09:33,220
even if it's all on 1.

202
00:09:33,560 --> 00:09:34,060
Nikolay: Yeah.

203
00:09:34,400 --> 00:09:34,740
Okay.

204
00:09:34,740 --> 00:09:37,780
Maybe it's already time to dive into the problem itself, right?

205
00:09:37,920 --> 00:09:41,340
Nate: Yeah.
Or maybe briefly, like, why we create the problem.

206
00:09:41,980 --> 00:09:42,440
Nikolay: Yeah.

207
00:09:42,440 --> 00:09:47,540
Before you go into this space, I was curious, do you have any

208
00:09:47,660 --> 00:09:51,820
issues with multixacts before that at all?

209
00:09:51,820 --> 00:09:56,020
Like maybe other kinds of issues like observing in performance

210
00:09:56,040 --> 00:09:59,840
insights, lightweight locks, contention, lightweight multixacts,

211
00:10:00,080 --> 00:10:02,280
or like a couple of them.

212
00:10:02,840 --> 00:10:04,520
You never had them before, right?

213
00:10:06,180 --> 00:10:07,060
Nate: Not in particular.

214
00:10:07,260 --> 00:10:10,600
No, I mean, we had run into emergency out of the vacuums for

215
00:10:10,600 --> 00:10:11,100
sure.

216
00:10:11,200 --> 00:10:13,780
Nikolay: Yeah, yeah, that's very interesting because your path

217
00:10:13,780 --> 00:10:17,340
is like definitely is different from what we observed with other

218
00:10:17,720 --> 00:10:18,220
projects.

219
00:10:18,800 --> 00:10:22,200
Because usually, different problems with multixacts came, and

220
00:10:22,200 --> 00:10:24,060
the contention happened, and so on.

221
00:10:24,440 --> 00:10:28,260
Your path is quite unique in this case, I think.

222
00:10:28,260 --> 00:10:28,760
Yeah.

223
00:10:28,980 --> 00:10:30,180
So let's dive in.

224
00:10:30,180 --> 00:10:31,120
Let's dive in.

225
00:10:31,420 --> 00:10:34,080
Nate: Yeah, I really want to describe how we create the problem

226
00:10:34,080 --> 00:10:34,600
for ourselves.

227
00:10:34,600 --> 00:10:37,200
And I really want Andrew to explain how we diagnose it.

228
00:10:37,200 --> 00:10:39,780
Cause like, that's the most interesting part here by far.

229
00:10:39,840 --> 00:10:40,200
Yeah.

230
00:10:40,200 --> 00:10:43,120
As we keep seeing these tables grow, you know, we kind of realize

231
00:10:43,120 --> 00:10:45,100
also do some access patterns that were introduced.

232
00:10:45,100 --> 00:10:49,020
It wasn't really tenable to keep just appending data month after

233
00:10:49,020 --> 00:10:52,540
month for every customer in terms of the line item table.

234
00:10:53,000 --> 00:10:55,440
And yeah, this was also painful for the invoice table, which,

235
00:10:55,440 --> 00:10:58,260
you know, there's a lot of right amplification between the 2.

236
00:10:58,260 --> 00:11:01,580
The average invoice has, you know, dozens and sometimes hundreds

237
00:11:01,640 --> 00:11:04,980
of line items, depending on how a client's using our platform.

238
00:11:06,020 --> 00:11:09,780
And so invoices have to be queried a lot of different ways.

239
00:11:09,780 --> 00:11:12,660
We didn't really see a way of moving that outside of a relational

240
00:11:12,700 --> 00:11:15,840
store and like being able to use foreign keys and constraints

241
00:11:15,840 --> 00:11:17,240
with it in the rest of our data.

242
00:11:17,240 --> 00:11:19,940
It was like providing too much value to really doing anything

243
00:11:19,940 --> 00:11:20,780
different there.

244
00:11:20,840 --> 00:11:24,660
We did wind up partitioning it just to like enable the autovacuum

245
00:11:24,660 --> 00:11:26,760
process to keep up much more easily.

246
00:11:27,380 --> 00:11:29,800
But we realized we could constrain the access pattern for the

247
00:11:29,800 --> 00:11:32,440
line items sufficiently that it didn't actually need to be in

248
00:11:32,440 --> 00:11:32,940
Postgres.

249
00:11:33,340 --> 00:11:35,920
Because I guess within Metronome,
you know, over the course of

250
00:11:35,920 --> 00:11:38,980
a month, you have the notion of
an invoice that's in a draft state

251
00:11:39,060 --> 00:11:41,600
where we'll calculate on demand
for you, or we're not actually

252
00:11:41,600 --> 00:11:43,640
persisting data associated with
it.

253
00:11:44,160 --> 00:11:47,060
Then only after you go through
a finalization, which is typically

254
00:11:47,300 --> 00:11:49,840
the first of the month, because
most billing periods are going

255
00:11:49,840 --> 00:11:53,100
to be monthly, that we do all the
math, do the calculations,

256
00:11:53,320 --> 00:11:54,480
persist that data.

257
00:11:55,160 --> 00:11:58,780
Which, you can then infer our workload
sounds pretty funky.

258
00:11:58,780 --> 00:12:01,920
So we've got a steady trickle of
writes for configuration changes.

259
00:12:01,980 --> 00:12:05,660
And then once a month, we have
to write out several billion rows

260
00:12:05,660 --> 00:12:08,160
which you know amounts to like
1 to 2 terabytes of data.

261
00:12:08,580 --> 00:12:09,080
Yeah

262
00:12:09,640 --> 00:12:11,060
Michael: all on the same day.

263
00:12:11,240 --> 00:12:12,840
Nate: As close to it as we can
get.

264
00:12:13,220 --> 00:12:15,900
Nikolay: Yeah very different from
social media.

265
00:12:16,420 --> 00:12:17,240
Very different.

266
00:12:18,460 --> 00:12:21,680
Nate: Otherwise it's you know a
more typical you know we have

267
00:12:21,780 --> 00:12:24,360
99 plus percent reads versus writes.

268
00:12:25,680 --> 00:12:27,760
It's a lot of compute on demand.

269
00:12:28,260 --> 00:12:32,780
So it's a little funky, I think
that because the high spikes

270
00:12:32,780 --> 00:12:35,780
in write volume are so periodic
is why we haven't seen multixact

271
00:12:35,920 --> 00:12:36,980
problems in the past.

272
00:12:37,200 --> 00:12:40,240
I think we'd likely narrowly been
avoiding them because it's

273
00:12:40,240 --> 00:12:42,780
so hard to observe whether or not
you're approaching them.

274
00:12:43,260 --> 00:12:46,420
I feel like this is 1 of the meta-stable
states of Postgres,

275
00:12:47,380 --> 00:12:49,940
if you're familiar with that, hey,
from the definition of stable,

276
00:12:49,940 --> 00:12:52,800
where often things are working,
working, working, and they just

277
00:12:52,800 --> 00:12:55,240
fall off a cliff or entirely lock
up.

278
00:12:55,240 --> 00:12:58,140
But the basic trigger here being
we identified we wanted to make

279
00:12:58,140 --> 00:13:02,360
this architecture change, and then
we were running the very long

280
00:13:02,360 --> 00:13:06,340
migration process of pushing these
line items into S3, which

281
00:13:06,340 --> 00:13:08,800
is what our back end store was
going to be, so their access pattern

282
00:13:08,800 --> 00:13:09,640
was so simple.

283
00:13:09,960 --> 00:13:14,680
But also rewriting and moving the
invoices from 1 table to another.

284
00:13:15,060 --> 00:13:19,540
And fortunately, we're doing that
broadly 1 by 1, which we'll

285
00:13:19,540 --> 00:13:20,820
get to why that's relevant.

286
00:13:21,040 --> 00:13:22,540
Nikolay: So you mentioned performance
cliff.

287
00:13:22,540 --> 00:13:25,440
This is a very good term when we
talk about multixact set,

288
00:13:25,440 --> 00:13:29,060
when we had different cases in
the past where, as I mentioned,

289
00:13:29,060 --> 00:13:30,060
contention happened.

290
00:13:30,060 --> 00:13:37,440
And interesting enough, like we
always, like 20, 15, 10 years

291
00:13:37,440 --> 00:13:43,580
ago, we often saw opinions that
foreign keys are dangerous, like

292
00:13:43,580 --> 00:13:48,340
they have huge overhead, don't
use them because like they slow

293
00:13:48,340 --> 00:13:49,820
down everything and so on.

294
00:13:49,900 --> 00:13:53,480
But interesting that now we can
agree with it, but in very different

295
00:13:53,480 --> 00:13:57,980
way, in an unexpected way, because
they, yes, they slow down

296
00:13:57,980 --> 00:14:00,060
writes, but we can afford it.

297
00:14:00,060 --> 00:14:05,140
And your table, which grown to
15 terabytes, is a good example,

298
00:14:05,140 --> 00:14:05,380
right?

299
00:14:05,380 --> 00:14:09,160
You survived with foreign keys
and you paid that price to have

300
00:14:09,160 --> 00:14:10,880
good data consistency, right?

301
00:14:11,500 --> 00:14:13,940
But then performance cliffs, right?

302
00:14:15,340 --> 00:14:18,180
Nate: Yeah, and for that table
in particular, the reason why

303
00:14:18,180 --> 00:14:21,220
we were able to move that was it
only had 1 foreign key, which

304
00:14:21,220 --> 00:14:22,900
is back to the invoice row.

305
00:14:22,900 --> 00:14:23,940
It was much simpler.

306
00:14:24,220 --> 00:14:25,960
That table is effectively append-only.

307
00:14:27,380 --> 00:14:30,060
It was a workload that just really
didn't have to be on Postgres.

308
00:14:30,060 --> 00:14:34,000
I don't think we were taking advantage
of the qualities to justify

309
00:14:34,000 --> 00:14:34,660
the cost.

310
00:14:36,500 --> 00:14:41,140
Nikolay: So let's discuss multixacts
in general, like why we

311
00:14:41,400 --> 00:14:42,700
should think about them.

312
00:14:42,700 --> 00:14:46,300
Like, you know, my own story is
usually, like with multixacts,

313
00:14:47,540 --> 00:14:49,940
I take some monitoring, this is
my hobby.

314
00:14:49,940 --> 00:14:53,860
Take some Postgres monitoring and
we check which metrics they

315
00:14:53,860 --> 00:14:54,120
have.

316
00:14:54,120 --> 00:14:59,940
And usually they have transaction
wraparound risk analysis,

317
00:15:00,020 --> 00:15:00,520
right?

318
00:15:01,100 --> 00:15:04,920
Like to control freezing process
and to understand how far we

319
00:15:04,920 --> 00:15:06,600
are from transaction ID wraparound.

320
00:15:08,140 --> 00:15:11,540
My question number 1, where is
multixact ID wraparound analysis?

321
00:15:12,380 --> 00:15:14,520
Most monitoring systems lack it.

322
00:15:14,920 --> 00:15:16,820
But they lack it because it never
happens.

323
00:15:17,680 --> 00:15:20,360
Usually, like, transaction ID wraparound
happens.

324
00:15:20,380 --> 00:15:24,260
We know several blog posts and
horror stories from the past.

325
00:15:24,440 --> 00:15:28,820
But multixact ID wraparound,
nobody experienced it.

326
00:15:29,540 --> 00:15:32,460
So OK, Why should we think about
multixacts?

327
00:15:33,700 --> 00:15:37,700
Tell us more technical details
what happened with your case.

328
00:15:37,860 --> 00:15:41,020
Nate: Well, briefly I'll respond
to the point about foreign keys.

329
00:15:41,680 --> 00:15:44,400
The community is right that you
don't always need them.

330
00:15:44,680 --> 00:15:47,840
But I think what can be tricky
is that when you're a very young

331
00:15:47,840 --> 00:15:50,980
company making lots of changes
to your product, trying to find

332
00:15:50,980 --> 00:15:53,560
that product market fit and get
to the point where scaling becomes

333
00:15:53,560 --> 00:15:54,360
the biggest concern.

334
00:15:54,520 --> 00:15:57,240
Foreign keys are really useful to
like prevent teams from stepping

335
00:15:57,240 --> 00:15:58,840
on each other and breaking your
data.

336
00:15:59,240 --> 00:16:02,440
And only when those product areas
and like the platform you're

337
00:16:02,440 --> 00:16:05,140
building on top of become mature
enough, can you maybe afford

338
00:16:05,220 --> 00:16:09,160
to put in the, you know, like pull
request level, you know, integrity

339
00:16:09,160 --> 00:16:11,760
checks that make sure you're not
changing some access pattern

340
00:16:11,760 --> 00:16:14,860
or maybe build the asynchronous,
you know, like offline data

341
00:16:14,860 --> 00:16:17,540
consistency check to make sure
you're not violating it.

342
00:16:18,000 --> 00:16:21,420
And this is a case where we didn't
think critically about have

343
00:16:21,420 --> 00:16:23,700
we made that leap from 1 side to
the other.

344
00:16:23,740 --> 00:16:25,840
And in retrospect, we have.

345
00:16:25,840 --> 00:16:27,660
And it's 1 of the mitigations we
took.

346
00:16:28,520 --> 00:16:29,020
Yeah.

347
00:16:29,540 --> 00:16:32,600
Where I guess I would defend having
a really strict schema early

348
00:16:32,600 --> 00:16:35,520
on, and then as you scale the platform,
just becomes necessary

349
00:16:35,600 --> 00:16:37,620
to progress some of those guardrails.

350
00:16:38,100 --> 00:16:40,940
Nikolay: Yeah, to maintain good
quality of data, that's absolutely

351
00:16:41,280 --> 00:16:42,180
great advice.

352
00:16:42,880 --> 00:16:48,340
But let's discuss how multixacts
are created, and so on.

353
00:16:48,580 --> 00:16:49,080
Andrew: Sure.

354
00:16:49,280 --> 00:16:49,780
Yes.

355
00:16:50,020 --> 00:16:54,020
So I suppose we should start with
what a multixact even is

356
00:16:54,020 --> 00:16:55,240
and when it's used.

357
00:16:55,240 --> 00:16:59,200
So basically, in Postgres, when
you have 2 or more transactions

358
00:16:59,280 --> 00:17:02,360
trying to lock the same row at
the same time, for instance, when

359
00:17:02,360 --> 00:17:06,460
you're using SELECT FOR SHARE or
SELECT FOR UPDATE or in fact

360
00:17:06,460 --> 00:17:09,720
when you're inserting into a table
that has a foreign key to

361
00:17:09,720 --> 00:17:13,740
another table, Postgres uses a
system called multixacts or multi

362
00:17:13,740 --> 00:17:14,240
transactions.

363
00:17:15,020 --> 00:17:18,580
And a multixact is basically a
logical object that represents

364
00:17:18,580 --> 00:17:21,240
a set of transactions locking the
same row.

365
00:17:21,740 --> 00:17:26,020
Each participating transaction
is called a multixact member.

366
00:17:26,500 --> 00:17:29,700
And those are a little distinct,
but they are related.

367
00:17:30,560 --> 00:17:34,160
And the way that a multixact
is stored is that it's represented

368
00:17:34,200 --> 00:17:38,160
by a 32-bit integer ID and that's
where your wraparound issue

369
00:17:38,160 --> 00:17:39,860
can happen with multixact IDs.

370
00:17:39,860 --> 00:17:42,540
But in this scenario, that was
actually not the problem.

371
00:17:43,220 --> 00:17:47,460
For every multixact, Postgres
keeps an offset into a globally

372
00:17:47,520 --> 00:17:50,720
shared, append-only, multixact
membership space.

373
00:17:51,300 --> 00:17:55,460
These are stored in the pg_multixact_members
file on disk.

374
00:17:55,680 --> 00:17:59,880
You can think of it as logically
like a simple last recently

375
00:18:00,060 --> 00:18:04,480
used cache that is contiguous,
so you have to append to it every

376
00:18:04,480 --> 00:18:05,220
single time.

377
00:18:06,740 --> 00:18:11,740
And so when you have to store these
members, you will have to

378
00:18:11,740 --> 00:18:18,060
look up an offset into this theoretically
long array, and each

379
00:18:18,060 --> 00:18:22,400
1 of your transaction IDs is stored
in 1 slot in this array,

380
00:18:22,740 --> 00:18:25,460
and this array is indexed by a
32-bit integer.

381
00:18:26,380 --> 00:18:30,300
This membership array is what it
was exhausted in our case.

382
00:18:31,100 --> 00:18:36,440
So due to the way that multixacts
are created and subsequently

383
00:18:36,480 --> 00:18:40,560
reaped, it is easy to exhaust this
membership space when you

384
00:18:40,560 --> 00:18:44,860
have a lot of parallel transactions
attempting to lock the same

385
00:18:44,860 --> 00:18:48,480
row, whether it's through foreign
keys or whether you're doing

386
00:18:48,480 --> 00:18:50,580
SELECT FOR SHARE or SELECT FOR
UPDATE.

387
00:18:51,280 --> 00:18:56,020
And the reason for that is that
since these multixacts are immutable,

388
00:18:56,580 --> 00:19:00,300
suppose that you have 3 transactions
trying to lock the same

389
00:19:00,300 --> 00:19:00,680
row.

390
00:19:00,680 --> 00:19:02,960
You would first create the first
multixact between the first

391
00:19:02,960 --> 00:19:05,880
2 transactions, and then the third
transaction would come in

392
00:19:05,880 --> 00:19:08,860
and you would have to create yet
another multixact with the

393
00:19:08,860 --> 00:19:11,680
old transaction IDs and then the
new 1.

394
00:19:12,180 --> 00:19:16,380
But the membership space is not
relinquished until a vacuum occurs,

395
00:19:16,560 --> 00:19:20,200
which can happen later, or perhaps
never in some cases, if you

396
00:19:20,200 --> 00:19:23,820
are constantly adding new transactions
and you're super busy

397
00:19:23,820 --> 00:19:25,060
doing all these updates.

398
00:19:25,680 --> 00:19:27,620
And what's interesting is this.

399
00:19:28,260 --> 00:19:32,920
The multi-transaction membership space cannot be reclaimed in

400
00:19:33,060 --> 00:19:33,420
chunks.

401
00:19:33,420 --> 00:19:34,860
It has to be reclaimed contiguously.

402
00:19:35,500 --> 00:19:38,440
So suppose you have a very long-running transaction that is part

403
00:19:38,440 --> 00:19:41,100
of a multixact that is very old.

404
00:19:41,520 --> 00:19:44,680
It can hold up the entire vacuuming of subsequent multixacts

405
00:19:44,800 --> 00:19:47,280
that are finished but are short-lived.

406
00:19:47,640 --> 00:19:51,180
And it is not until that the oldest multi-transaction is finished

407
00:19:51,340 --> 00:19:54,100
that the vacuum can actually reclaim all the membership space

408
00:19:54,140 --> 00:19:57,140
from the old multixact to the most recent one.

409
00:19:57,500 --> 00:20:00,820
And this scenario creates the ability to exhaust the multixact

410
00:20:01,020 --> 00:20:01,940
membership space.

411
00:20:02,540 --> 00:20:04,620
And that's what we experienced ourselves.

412
00:20:05,280 --> 00:20:07,180
Nikolay: Yeah, I wanted to add one thing.

413
00:20:07,340 --> 00:20:14,560
It might sound strange that SELECT FOR UPDATE also might, may produce.

414
00:20:15,360 --> 00:20:16,780
You also explored this, right?

415
00:20:16,780 --> 00:20:19,980
It happens only when we have sub-transactions, I think, right?

416
00:20:20,060 --> 00:20:25,080
So if it's a single transaction, select for update, there is

417
00:20:25,080 --> 00:20:26,260
no place for multixacts.

418
00:20:26,460 --> 00:20:31,080
But if we start doing savepoints there, basically we have different

419
00:20:31,080 --> 00:20:34,620
transactions inside one transaction, and then multixact mechanism

420
00:20:34,640 --> 00:20:35,780
starts working again.

421
00:20:35,980 --> 00:20:38,500
And I just remember, like, how many years ago?

422
00:20:38,500 --> 00:20:43,020
Four years ago, I explored this when I explored another big performance

423
00:20:43,020 --> 00:20:44,320
cliff related to sub-transactions.

424
00:20:45,060 --> 00:20:46,880
So there I noticed this as well.

425
00:20:46,880 --> 00:20:47,380
Yeah.

426
00:20:47,720 --> 00:20:48,620
Andrew: That is true.

427
00:20:48,700 --> 00:20:49,500
Nikolay: Good introduction.

428
00:20:49,540 --> 00:20:50,520
Thank you so much.

429
00:20:51,000 --> 00:20:54,380
So yeah, it's not it, right?

430
00:20:54,380 --> 00:20:58,400
There's also some problems like quadratic behavior, right?

431
00:20:58,780 --> 00:21:00,720
Andrew: Yes, I hinted at that before.

432
00:21:00,860 --> 00:21:06,100
So essentially, if you have, let's say, five transactions that are

433
00:21:06,100 --> 00:21:09,440
attempting to lock the same row, you have to create the multixacts

434
00:21:09,960 --> 00:21:13,940
one at a time by essentially cloning it with the previous transaction

435
00:21:13,940 --> 00:21:17,440
IDs and adding that additional transaction ID one at a time.

436
00:21:17,440 --> 00:21:21,180
So now you have a multixact with two, then three, then four, then five.

437
00:21:21,220 --> 00:21:25,220
And you add that all up, and that becomes a quadratically growing

438
00:21:25,580 --> 00:21:29,360
use case of multixact membership space, at least until the

439
00:21:29,360 --> 00:21:31,500
vacuum can actually clean up those old multixacts.

440
00:21:32,180 --> 00:21:36,180
And so that's where you will find quadratic consumption of the

441
00:21:36,180 --> 00:21:37,320
multixact space.

442
00:21:37,440 --> 00:21:41,200
And this can happen very quickly if you have a lot of concurrent

443
00:21:41,200 --> 00:21:44,780
transactions attempting to, say, insert into a table with a foreign

444
00:21:44,780 --> 00:21:48,340
key that links to another table that is a low cardinality table.

445
00:21:48,340 --> 00:21:51,100
So they're all trying to reference that exact same row.

446
00:21:51,260 --> 00:21:54,740
Nikolay: But it's also applicable when we have also additional

447
00:21:54,960 --> 00:21:56,400
long-running transaction, right?

448
00:21:56,540 --> 00:21:58,760
Because otherwise, cleanup will happen, right?

449
00:21:58,780 --> 00:22:01,400
Andrew: Correct, so a long running
transaction will cause the

450
00:22:01,400 --> 00:22:05,080
vacuum to essentially stop reclaiming
multixact membership

451
00:22:05,080 --> 00:22:08,200
space until it is finished, because
it can only do it sequentially.

452
00:22:09,960 --> 00:22:12,780
Nikolay: Is it this transaction,
long-running transaction, should

453
00:22:12,780 --> 00:22:16,200
it also participate in multixact
mechanism or it can be any,

454
00:22:16,200 --> 00:22:17,780
even SELECT, simple SELECT?

455
00:22:18,940 --> 00:22:20,820
Andrew: A simple SELECT probably
won't do it.

456
00:22:20,820 --> 00:22:23,740
It would have to be in a multi-transaction
context.

457
00:22:24,020 --> 00:22:29,340
So if you, let's say, have a multixact
and you have 2 transactions

458
00:22:29,340 --> 00:22:32,080
in that and the first 1 finishes,
that multixact is not cleaned

459
00:22:32,080 --> 00:22:33,580
up until that second transaction.

460
00:22:33,600 --> 00:22:35,880
Part of that multixact is also
finished.

461
00:22:36,260 --> 00:22:39,440
So all transactions in the multixact
have to finish before the

462
00:22:39,440 --> 00:22:40,760
vacuum can reclaim it.

463
00:22:40,960 --> 00:22:45,400
Nikolay: Yeah, so this highlights
pieces of advice like partition

464
00:22:45,400 --> 00:22:49,080
your large tables, avoid long-running
transactions if imposed

465
00:22:49,080 --> 00:22:50,320
with OLTP, right?

466
00:22:50,660 --> 00:22:54,180
Like just yet another reason to
avoid long-running transactions.

467
00:22:54,640 --> 00:22:58,040
And if it's, yeah, if, so long-running
transactions on replica

468
00:22:58,040 --> 00:23:02,270
reported by, hosted by feedback,
they are harmless here.

469
00:23:02,270 --> 00:23:04,700
So, right, so like again, yeah,
yeah.

470
00:23:04,900 --> 00:23:09,300
I'm trying, we are debating right
now how to properly monitor

471
00:23:09,480 --> 00:23:12,840
long running transactions because
there are different effects,

472
00:23:12,840 --> 00:23:15,140
negative effects, and we need to
distinguish them.

473
00:23:15,240 --> 00:23:19,640
In this case, yeah, we should monitor
probably long-run transactions

474
00:23:19,640 --> 00:23:20,580
which participate.

475
00:23:22,040 --> 00:23:26,180
And by long, how much long was
it in your case?

476
00:23:26,840 --> 00:23:30,840
Like 1 minute is long or 30 minutes
is long in this case?

477
00:23:31,000 --> 00:23:31,720
Andrew: I'm actually not sure.

478
00:23:31,720 --> 00:23:33,540
I don't actually have the metrics
on that.

479
00:23:33,540 --> 00:23:34,580
Nate, do you know?

480
00:23:35,060 --> 00:23:41,300
Nate: I think 99.9%, probably 9%
of our queries are sub-second.

481
00:23:41,600 --> 00:23:44,880
And I guess I previously worked
on a Database and for a company,

482
00:23:45,180 --> 00:23:48,940
or for a team at Stripe, and had
some of these lessons, just

483
00:23:48,940 --> 00:23:52,900
like really deeply, or have all
the scar tissue in short.

484
00:23:53,480 --> 00:23:55,240
Running a platform at their scale
is difficult.

485
00:23:55,240 --> 00:23:58,860
And so, I've been pushing us in
that direction and kind of moving

486
00:23:58,860 --> 00:24:02,460
anything that we expect to be longer
off of the online workload

487
00:24:03,280 --> 00:24:04,380
into our lake house.

488
00:24:04,380 --> 00:24:06,820
And so it's very, you know, occasionally
we have queries that

489
00:24:06,820 --> 00:24:10,580
go 5, 10 seconds, but I believe
we have a like, you know, global

490
00:24:10,580 --> 00:24:11,900
time of about 30 seconds.

491
00:24:12,540 --> 00:24:15,040
So I guess like part of my misunderstanding
here is I thought

492
00:24:15,060 --> 00:24:20,460
Part of these multixacts is that
the multi-transaction can

493
00:24:20,460 --> 00:24:24,360
be daisy-chained as you have many
concurrent overlapping transactions

494
00:24:24,400 --> 00:24:24,720
occur.

495
00:24:24,720 --> 00:24:27,340
I think that's more what you're
running into than a long running

496
00:24:27,340 --> 00:24:27,840
1.

497
00:24:28,260 --> 00:24:31,960
Was that we just had this steady
stream of overlapping transactions.

498
00:24:33,260 --> 00:24:37,260
And until you break that chain,
you're in this like, host position

499
00:24:37,260 --> 00:24:40,700
where you keep, you know, very
quickly at an increasing rate,

500
00:24:40,960 --> 00:24:41,900
consuming space.

501
00:24:42,380 --> 00:24:46,500
Andrew: Yes, I was only merely
stating the cleanup scenario and

502
00:24:46,500 --> 00:24:48,660
the challenges there, but you're
absolutely correct.

503
00:24:48,680 --> 00:24:51,860
When you have this daisy-chaining,
you know, constantly creating

504
00:24:51,860 --> 00:24:55,380
new multixacts by adding new transaction
IDs to it, you will

505
00:24:55,380 --> 00:24:57,380
very quickly exhaust your membership
space.

506
00:24:57,880 --> 00:25:01,000
Michael: You say very quickly and
very easily, but I mean, we

507
00:25:01,000 --> 00:25:03,080
are still talking about 2000000000,
right?

508
00:25:03,080 --> 00:25:07,300
Like, I think that's a, there's
been a debate in the Postgres

509
00:25:07,300 --> 00:25:12,240
world about 32-bit versus 64-bit
for a long time, but we are

510
00:25:12,240 --> 00:25:13,780
still talking about 2000000000.

511
00:25:13,780 --> 00:25:19,300
So You say very quickly and very
easily at your scale, right?

512
00:25:20,260 --> 00:25:23,440
Like this is also due to how many
things you're doing and maybe

513
00:25:23,440 --> 00:25:27,500
the pattern as well, like invoices
that have lots of line items

514
00:25:27,500 --> 00:25:32,140
or, you know, like it feels like
there are specifics here that

515
00:25:32,140 --> 00:25:33,300
make this more likely.

516
00:25:33,660 --> 00:25:36,180
I'm not saying it shouldn't be
a problem, I'm looking forward

517
00:25:36,180 --> 00:25:39,820
to hearing how you diagnose this
and what we can do better.

518
00:25:40,080 --> 00:25:44,600
But I feel like we should say it's
not necessarily very easy

519
00:25:44,600 --> 00:25:45,280
to hit this limit.

520
00:25:45,280 --> 00:25:48,840
I'm not actually, I think I found
1 other Stack Overflow post

521
00:25:48,840 --> 00:25:51,820
or somebody else who hit it I don't
know if you came across that

522
00:25:51,820 --> 00:25:54,480
while you were while you were in
the middle of this But I couldn't

523
00:25:54,480 --> 00:25:57,620
find any other examples on the
mailing list, but it's super interesting

524
00:25:57,620 --> 00:26:00,760
I think we're seeing more and more
Companies using Postgres at

525
00:26:00,760 --> 00:26:03,340
this scale Like Nikolay mentioned
a few of them that we talked

526
00:26:03,340 --> 00:26:07,920
to, but we also had an episode
with folks from Notion and Figma

527
00:26:08,000 --> 00:26:09,820
and Adyen as well on that same
episode.

528
00:26:09,840 --> 00:26:12,820
So there are a lot of high volume
companies now on Postgres.

529
00:26:13,140 --> 00:26:16,220
So I think this stuff is super
interesting, but 2 billion still

530
00:26:16,220 --> 00:26:18,820
seems like, even with the daisy
chaining, it still seems like

531
00:26:18,820 --> 00:26:19,860
a lot to me.

532
00:26:20,740 --> 00:26:22,200
Nikolay: Let me disagree with you.

533
00:26:22,200 --> 00:26:26,780
There is no, like debate is not
like 64 or 32.

534
00:26:27,700 --> 00:26:31,360
Postgres desperately needs 64 transactions.

535
00:26:31,440 --> 00:26:34,340
And for example, OrioleDB already
implemented it.

536
00:26:34,900 --> 00:26:40,280
But OrioleDB is like alternative
storage engine and it's right

537
00:26:40,280 --> 00:26:42,840
now in hands of Supabase in terms
of development.

538
00:26:43,060 --> 00:26:46,100
And I think right now there is
launch week and they published

539
00:26:46,100 --> 00:26:48,580
some benchmarks again, like it's
great, great project.

540
00:26:48,800 --> 00:26:53,400
So I think many projects dream
about the day when we will have

541
00:26:53,860 --> 00:26:55,220
64, not 32.

542
00:26:55,480 --> 00:26:58,480
Right now only several like forks
implemented only.

543
00:26:59,380 --> 00:27:02,860
There are commercial forks which
implement 64 transaction IDs

544
00:27:02,860 --> 00:27:04,080
and multixact IDs.

545
00:27:04,660 --> 00:27:05,160
Right?

546
00:27:06,580 --> 00:27:07,700
Michael: I didn't know that.

547
00:27:08,480 --> 00:27:10,540
Sadly not Aurora it sounds like.

548
00:27:10,840 --> 00:27:11,340
Nikolay: Yeah.

549
00:27:12,340 --> 00:27:14,880
So 2 billion is not a lot already.

550
00:27:15,140 --> 00:27:16,200
It's not a lot.

551
00:27:16,860 --> 00:27:17,360
Andrew: Yeah.

552
00:27:17,900 --> 00:27:21,760
It's pretty easy to hit it if you're
going at scale and you're

553
00:27:21,760 --> 00:27:23,940
hitting it with quadratic velocity.

554
00:27:25,240 --> 00:27:26,680
You get to 2 billion quickly.

555
00:27:26,680 --> 00:27:28,500
It does sound like a big number,

556
00:27:28,500 --> 00:27:29,660
Nikolay: but it's not that big.

557
00:27:30,060 --> 00:27:31,360
Michael, sorry for interruption.

558
00:27:31,400 --> 00:27:35,800
Michael, you can take chess board
and start putting 1 grain on

559
00:27:35,800 --> 00:27:37,060
first cell.

560
00:27:37,060 --> 00:27:37,540
And so.
Michael: Oh, yeah.

561
00:27:37,540 --> 00:27:38,980
So wait, maybe I misunderstood.

562
00:27:40,320 --> 00:27:45,740
If there are 5 row changes we're
trying to do, do we create 2

563
00:27:45,740 --> 00:27:48,180
to the 5 rows in this table?

564
00:27:48,400 --> 00:27:51,000
Or is it like 1, then 2, then 3?

565
00:27:51,000 --> 00:27:52,700
Is that only 5 rows?

566
00:27:53,380 --> 00:27:54,640
So in the member space?

567
00:27:55,200 --> 00:27:56,180
Andrew: Yeah, good question.

568
00:27:56,280 --> 00:27:59,440
So basically, it's more of a sequencing
thing.

569
00:27:59,720 --> 00:28:01,980
So essentially, you have 2 transactions.

570
00:28:01,980 --> 00:28:02,840
You start with 2.

571
00:28:02,840 --> 00:28:05,140
You get 1 multixact taking up
2 members.

572
00:28:05,580 --> 00:28:08,100
The third transaction comes in,
you create a new multixact

573
00:28:08,680 --> 00:28:11,820
with 3 members, but that old 1
with 2 members has not been cleaned

574
00:28:11,820 --> 00:28:13,500
up yet, so you're only taking up
5.

575
00:28:13,780 --> 00:28:16,560
The third 1 comes in, I'm sorry,
excuse me, the fourth transaction

576
00:28:16,560 --> 00:28:19,840
comes in, you create a multixact
with 4 and 3 and 2.

577
00:28:19,840 --> 00:28:22,360
So now you've added an additional
4 on top of that.

578
00:28:22,360 --> 00:28:24,620
And now the fifth 1 comes in, and
now you have a multixact

579
00:28:24,620 --> 00:28:26,460
with 5432.

580
00:28:26,460 --> 00:28:28,940
And that all add together, it takes
up membership space until

581
00:28:28,940 --> 00:28:32,200
the vacuum can come up and reap
those old multixacts.

582
00:28:34,020 --> 00:28:37,900
Nikolay: Yeah, chessboard example
is good here, I guess.

583
00:28:37,960 --> 00:28:43,740
We can try to understand volume
when we approach 64 cell, number

584
00:28:43,740 --> 00:28:45,920
64, how big it will be.

585
00:28:45,920 --> 00:28:50,080
Michael: Andrew, you've done such
a good job describing how this

586
00:28:50,080 --> 00:28:53,480
could happen, but when you're in
the middle of this, I'm guessing

587
00:28:53,480 --> 00:28:56,000
this is relatively new knowledge,
or like when you're in the

588
00:28:56,000 --> 00:28:59,380
middle of this, how did you work
out what on earth was going

589
00:28:59,380 --> 00:28:59,880
on?

590
00:29:00,060 --> 00:29:02,720
Andrew: Yes, that's quite the journey.

591
00:29:02,980 --> 00:29:05,200
So I'm actually really new to Metronome.

592
00:29:05,200 --> 00:29:06,360
I just joined in April.

593
00:29:06,360 --> 00:29:09,960
So when this incident was occurring,
it was only my fifth week

594
00:29:09,960 --> 00:29:10,640
or so.

595
00:29:10,840 --> 00:29:14,180
And this was occurring on a Saturday
right before a vacation

596
00:29:14,200 --> 00:29:14,840
to Miami.

597
00:29:15,060 --> 00:29:16,320
So, wow.

598
00:29:16,320 --> 00:29:20,140
Nate: Well, let's rewind a little
bit, because I guess the most

599
00:29:20,140 --> 00:29:22,800
embarrassing part of this is like
the first, what we now know

600
00:29:22,800 --> 00:29:24,940
was the first occurrence, was the
full week prior.

601
00:29:25,640 --> 00:29:26,140
Andrew: Yes.

602
00:29:26,280 --> 00:29:30,020
Nate: Where our migration workload
had basically implicitly triggered

603
00:29:30,020 --> 00:29:30,520
this.

604
00:29:30,860 --> 00:29:32,660
We hadn't really known what to
do.

605
00:29:33,820 --> 00:29:36,220
And unfortunately we're not able
to get to a full RCA.

606
00:29:36,260 --> 00:29:39,400
And so, yeah, we got to our best
explanation, which is like,

607
00:29:39,400 --> 00:29:42,660
oh, there's something in transaction
ID wraparounds that we're

608
00:29:42,660 --> 00:29:45,800
not grokking correctly, or Amazon's
hiding something from us.

609
00:29:45,920 --> 00:29:48,600
And then we proceeded to turn it
back on with a slightly lower

610
00:29:48,600 --> 00:29:50,660
concurrency, you know, 5 days later.

611
00:29:50,840 --> 00:29:53,220
And so like, yeah, so we didn't
get there immediately.

612
00:29:53,400 --> 00:29:56,680
But then, you know, that was a
P1 incident, you know, internally,

613
00:29:57,340 --> 00:30:00,060
tried to be cautious, but didn't
follow through sufficiently.

614
00:30:00,060 --> 00:30:03,160
And so this is like kind of in
the second chapter where Andrew

615
00:30:03,160 --> 00:30:03,980
comes in.

616
00:30:04,200 --> 00:30:05,260
Andrew: Yes, that's right.

617
00:30:05,280 --> 00:30:08,040
I was not a participant in the
first 3 incidences.

618
00:30:08,040 --> 00:30:11,880
It was only the fourth 1, which
I think was on a Saturday in

619
00:30:11,880 --> 00:30:16,560
which the call went out from Cosmo
to see if additional engineering

620
00:30:16,560 --> 00:30:19,140
resources could come and jump on,
so I decided to.

621
00:30:19,140 --> 00:30:22,900
But having been so new, I actually
didn't have any tactical knowledge

622
00:30:22,900 --> 00:30:26,320
I could use to address all the
specific fires that were being

623
00:30:26,320 --> 00:30:27,380
started by this.

624
00:30:27,880 --> 00:30:31,720
So I decided to actually figure
out what exactly was happening.

625
00:30:31,920 --> 00:30:35,140
And as Nate had said, we weren't
really sure, but we knew it

626
00:30:35,140 --> 00:30:36,900
was related to multixact in some
way.

627
00:30:36,900 --> 00:30:38,900
We just weren't really sure exactly
why.

628
00:30:39,180 --> 00:30:41,920
And we were struggling to really
fully understand because in

629
00:30:41,920 --> 00:30:46,220
our initial evaluation, we saw
that the number of multixact

630
00:30:46,360 --> 00:30:49,900
IDs that was being created was
well below the threshold of like

631
00:30:49,900 --> 00:30:50,780
400 million.

632
00:30:51,140 --> 00:30:52,260
And we weren't really sure why.

633
00:30:52,260 --> 00:30:55,200
So what I did is I took that knowledge
and I just went ahead

634
00:30:55,200 --> 00:30:58,860
and looked into the Postgres source
code and looked at the multixact.c

635
00:30:59,100 --> 00:31:03,420
file, which is this enormous
file that contains pretty much

636
00:31:03,420 --> 00:31:05,460
all the logic that you'll need
to know about multixacts.

637
00:31:06,140 --> 00:31:09,100
And in there, we were working with
the hypothesis that somehow

638
00:31:09,280 --> 00:31:12,980
this threshold of 400 million,
but we were only seeing 200 million

639
00:31:12,980 --> 00:31:15,920
before vacuums started occurring,
was related in some way.

640
00:31:16,400 --> 00:31:21,920
So I found the function called
multixact member freeze threshold

641
00:31:22,540 --> 00:31:26,720
which calculates the threshold
that you should start basically

642
00:31:26,720 --> 00:31:30,780
an autovacuum based on the number
of multixacts that are in

643
00:31:30,780 --> 00:31:35,200
use and newly the number of multixact
members that are in use,

644
00:31:35,200 --> 00:31:36,620
and that's what caught my eye.

645
00:31:36,620 --> 00:31:42,680
So from there, I mathematically
proved that the amount of members

646
00:31:43,260 --> 00:31:47,140
should be extremely high given
the code path that we are obviously

647
00:31:47,160 --> 00:31:50,740
hitting, because we're seeing vacuums
occur at a specific number

648
00:31:50,740 --> 00:31:52,320
of multixacts being used.

649
00:31:52,580 --> 00:31:56,540
And from there I was able to find
log messages related to multixact

650
00:31:56,820 --> 00:32:00,880
membership exhaustion and correlate
that with Datadog logs that

651
00:32:00,880 --> 00:32:03,460
we had found ourselves in the recent
week.

652
00:32:03,600 --> 00:32:07,580
And there we could conclude that
multixact members were being,

653
00:32:07,580 --> 00:32:09,180
the membership space was being
exhausted.

654
00:32:09,860 --> 00:32:13,440
And we were able to put together
what I would consider kind of

655
00:32:13,440 --> 00:32:16,960
a hacky estimate of the number
of multixact members that are

656
00:32:16,960 --> 00:32:20,860
currently in use by looking at
the files that are written for

657
00:32:20,860 --> 00:32:24,140
each member and estimating by multiplying
by a constant how many

658
00:32:24,140 --> 00:32:25,940
members are currently in use at
the moment.

659
00:32:26,140 --> 00:32:29,720
And that's how we came across the
solution, or the cause, I should

660
00:32:29,720 --> 00:32:30,220
say.

661
00:32:32,380 --> 00:32:35,900
Michael: I think this is worth
diving into a little bit because

662
00:32:35,900 --> 00:32:40,620
I definitely didn't realize that
there were at least 2 exhaustion

663
00:32:40,860 --> 00:32:45,700
possibilities with multixact
so we've got the the total well

664
00:32:45,700 --> 00:32:49,220
yeah so what why are there these
2 numbers like why do we have

665
00:32:49,260 --> 00:32:52,720
2 there's 2,000,000,000 that you were
monitoring at 400 million,

666
00:32:52,720 --> 00:32:54,380
which should be super safe, right?

667
00:32:54,380 --> 00:32:56,760
Because we have the possibility
to go up to 2,000,000,000, but you

668
00:32:56,760 --> 00:32:59,440
want it to be cleaning up much
sooner than that.

669
00:32:59,440 --> 00:33:02,060
So you monitored at 400 million
or so.

670
00:33:02,680 --> 00:33:04,180
Why is there another limit?

671
00:33:04,980 --> 00:33:08,900
Andrew: Yeah, so we have 2 limits
here.

672
00:33:08,940 --> 00:33:11,180
We have an ID that we give to multixacts.

673
00:33:11,920 --> 00:33:13,440
That's a 32-bit integer.

674
00:33:13,660 --> 00:33:17,600
So, of course, you cannot assign
more than 2 billion to these

675
00:33:17,600 --> 00:33:20,460
multixacts because we don't assign
negative IDs.

676
00:33:20,860 --> 00:33:23,500
And then you have the multixact
membership space.

677
00:33:23,720 --> 00:33:30,580
Now that is represented as a logically
a simple LRU that is append

678
00:33:30,580 --> 00:33:31,080
only.

679
00:33:31,480 --> 00:33:38,200
And so we can only address up to
32 bits of membership space.

680
00:33:38,600 --> 00:33:42,540
Each membership slot is addressed
by a 32-bit integer.

681
00:33:42,980 --> 00:33:45,520
And thus now you have 2 different
values that you have to worry

682
00:33:45,520 --> 00:33:45,800
about.

683
00:33:45,800 --> 00:33:49,700
You have to worry about how many
multixacts you're creating and

684
00:33:49,860 --> 00:33:52,780
how much membership space you're
consuming with those multixacts.

685
00:33:53,080 --> 00:33:55,260
And that's where those 2 numbers
come from.

686
00:33:56,040 --> 00:33:58,640
Michael: And 1 of the great things
in your blog post you mention

687
00:33:59,180 --> 00:34:01,300
is how do you even monitor that?

688
00:34:02,960 --> 00:34:04,480
Have you worked out a way since?

689
00:34:04,480 --> 00:34:06,240
Is there anything you can do there?

690
00:34:06,420 --> 00:34:09,940
Andrew: Yes, I think I might have
just offhand mentioned it,

691
00:34:09,940 --> 00:34:15,740
but we use an estimate where we
read the membership files because

692
00:34:15,740 --> 00:34:19,340
they are written to disk, And then
we know that the files have

693
00:34:19,340 --> 00:34:22,860
to be a certain size and we multiply
by that constant, whatever

694
00:34:22,860 --> 00:34:25,320
the size of these members are,
and we get an estimate for the

695
00:34:25,320 --> 00:34:26,000
number of members.

696
00:34:26,000 --> 00:34:29,180
Now, this is not the best solution,
which is why I submitted

697
00:34:29,180 --> 00:34:32,120
a Postgres patch to surface this
telemetry directly.

698
00:34:32,640 --> 00:34:34,140
Michael: Nice, Thank you.

699
00:34:34,300 --> 00:34:36,540
Nikolay: Yeah, that's what I wanted to mention.

700
00:34:36,660 --> 00:34:40,080
We have several things going on in mailing lists after this blog

701
00:34:40,080 --> 00:34:44,440
post.
By the way, before this, why did you decide to post in so great

702
00:34:44,440 --> 00:34:44,940
detail?

703
00:34:45,300 --> 00:34:47,420
Like, what was your intention?

704
00:34:49,340 --> 00:34:51,640
To help community or like?

705
00:34:52,200 --> 00:34:53,400
Nate: I guess a couple of factors.

706
00:34:53,420 --> 00:34:56,420
1, we do want to help the greater Postgres community.

707
00:34:56,420 --> 00:34:58,980
I think there are lots of other situations where we've benefited

708
00:34:59,020 --> 00:35:02,560
greatly by blog posts put out, in particular by companies like

709
00:35:02,560 --> 00:35:06,300
Percona, as well as others who have maybe led us away from paths

710
00:35:06,300 --> 00:35:07,900
that would have wound up being painful.

711
00:35:08,360 --> 00:35:12,380
But another part is that internally at Metronome, we view ourselves

712
00:35:12,380 --> 00:35:14,880
as a infrastructure provider for our clients.

713
00:35:15,560 --> 00:35:19,300
And while I think it's okay to say that we're not fully recognized

714
00:35:19,300 --> 00:35:22,420
as that yet, and we're still on our journey to solidifying that

715
00:35:22,420 --> 00:35:25,920
narrative, it is what we want to, or how we want to act to be

716
00:35:25,920 --> 00:35:26,980
perceived that way.

717
00:35:27,840 --> 00:35:30,480
And we are in a critical path for a lot of our clients and how

718
00:35:30,480 --> 00:35:31,560
they operate.

719
00:35:32,020 --> 00:35:36,600
And so for us to have multiple multi-hour downtime in the space

720
00:35:36,600 --> 00:35:40,120
of less than 10 days is pretty bad and honestly not what you

721
00:35:40,120 --> 00:35:42,560
want for your infrastructure provider if you choose to be in

722
00:35:42,560 --> 00:35:42,980
that way.

723
00:35:42,980 --> 00:35:47,080
And so having a really crisp explanation of, hey, we are a little

724
00:35:47,080 --> 00:35:50,280
embarrassed by this, but here's what we learned so you can learn

725
00:35:50,280 --> 00:35:55,320
from it, as well as a little bit of vetting of, we didn't run

726
00:35:55,320 --> 00:35:56,600
into the easiest thing.

727
00:35:56,880 --> 00:35:57,800
It is novel.

728
00:35:57,800 --> 00:36:00,560
There is a reason why it took a while for us to sort it out.

729
00:36:00,940 --> 00:36:03,840
And so I think a lot of our partners felt a lot better.

730
00:36:04,660 --> 00:36:05,980
Better could be a strong term.

731
00:36:05,980 --> 00:36:11,040
But when you have an honest RCA, and Cloudflare might be leading

732
00:36:11,040 --> 00:36:14,920
the industry in how they operate in this vein, you feel a lot

733
00:36:14,920 --> 00:36:17,800
better that your provider is actually learning from mistakes

734
00:36:17,840 --> 00:36:20,660
and is going to do a better job of avoiding them than you might

735
00:36:20,660 --> 00:36:23,740
internally if you're evaluating, should I just build this in-house

736
00:36:23,740 --> 00:36:24,240
instead?

737
00:36:25,440 --> 00:36:28,340
Nikolay: So great transparency and so on.

738
00:36:28,640 --> 00:36:30,040
It pays off over time.

739
00:36:30,040 --> 00:36:30,860
Yeah, that's great.

740
00:36:30,920 --> 00:36:34,340
So I saw several mailing list discussions, but before that I

741
00:36:34,340 --> 00:36:38,200
also like, I feel like with multixacts, we can have different

742
00:36:38,200 --> 00:36:39,120
kinds of problems.

743
00:36:39,120 --> 00:36:41,640
1 is wraparound, right?

744
00:36:42,540 --> 00:36:44,980
Okay, as I said, I never saw it.

745
00:36:45,040 --> 00:36:48,300
And that's why many monitoring systems even don't monitor it.

746
00:36:48,480 --> 00:36:52,180
And we have several cases of SLRU limits reached, right?

747
00:36:52,720 --> 00:36:55,700
And 1 of them was explored in the past.

748
00:36:55,940 --> 00:36:59,840
We can provide maybe blog posts, links to description.

749
00:37:00,380 --> 00:37:01,780
And this is your case.

750
00:37:01,780 --> 00:37:07,260
And your case obviously bumped into lack of observability bits,

751
00:37:07,460 --> 00:37:12,820
like how much of capacity we reach, like how far from the performance

752
00:37:12,820 --> 00:37:14,280
cliff we are, right?

753
00:37:14,540 --> 00:37:17,720
But also besides that, I saw different topics were discussed

754
00:37:17,720 --> 00:37:18,400
in mailing list.

755
00:37:18,400 --> 00:37:22,800
For example, there is a setting vacuum_multixact_failsafe_age,

756
00:37:23,800 --> 00:37:26,180
which is about wraparound, actually.

757
00:37:27,800 --> 00:37:32,120
Peter Geoghegan brought to attention that this setting doesn't take

758
00:37:32,120 --> 00:37:34,460
into account multixact membership exhaustion.

759
00:37:34,820 --> 00:37:37,920
So it's also an interesting side effect from your blog post.

760
00:37:38,160 --> 00:37:40,300
Maybe fully unexpected, right?

761
00:37:40,520 --> 00:37:48,400
Then I saw, Andrew, I think your proposal to have multixact

762
00:37:49,240 --> 00:37:54,140
monitoring, like system view, so any monitoring could have some

763
00:37:54,140 --> 00:37:56,620
graphs and also alerts probably, right?

764
00:37:57,200 --> 00:38:00,120
And this proposal currently at what stage?

765
00:38:00,360 --> 00:38:04,020
I saw it's in commitfest, but it requires review, right?

766
00:38:04,240 --> 00:38:04,740
Andrew: Correct.

767
00:38:04,800 --> 00:38:05,280
Yes.

768
00:38:05,280 --> 00:38:10,020
The patch is in the commitfest and, it definitely still requires

769
00:38:10,020 --> 00:38:10,520
review.

770
00:38:10,680 --> 00:38:14,480
We've, well, I've engaged with another contributor named Naga,

771
00:38:14,480 --> 00:38:18,440
who's, we've actually submitted competing proposals.

772
00:38:18,640 --> 00:38:24,240
His, actually directly reads from the, the memory space while

773
00:38:24,240 --> 00:38:28,920
mine uses the pg_stats infrastructure to surface this telemetry.

774
00:38:29,380 --> 00:38:31,020
But yes, it is still just in review.

775
00:38:31,020 --> 00:38:33,380
It has not been accepted yet, as far as I can tell.

776
00:38:33,740 --> 00:38:36,240
Nikolay: Yeah, so but it's great that it's registered in commitfest

777
00:38:36,600 --> 00:38:40,720
and I think if some people who listen to us and maybe some people

778
00:38:40,720 --> 00:38:45,620
who watch also our sessions of hacking we do with Andrey and

779
00:38:45,620 --> 00:38:47,680
Kirk sometimes almost every week.

780
00:38:47,680 --> 00:38:51,780
We missed today because of some issues, but we are going to be

781
00:38:51,780 --> 00:38:52,840
back next week.

782
00:38:52,960 --> 00:38:56,780
So we show how easy it is actually to test.

783
00:38:57,040 --> 00:39:00,840
You take Cursor, you like tell it, just compile Postgres, apply

784
00:39:00,840 --> 00:39:02,540
patch, and let's start using it.

785
00:39:02,540 --> 00:39:06,640
And just you, instead of diving too much into C code, you think

786
00:39:06,640 --> 00:39:11,460
about use cases and how to, for example, emulate problems, right?

787
00:39:11,460 --> 00:39:12,220
And so on.

788
00:39:12,280 --> 00:39:13,400
This is actually fun.

789
00:39:13,620 --> 00:39:18,380
And I encourage our audience to maybe to start with their first

790
00:39:18,380 --> 00:39:22,440
review and test this patch and this can increase chances that

791
00:39:22,440 --> 00:39:24,740
it will be included to Postgres 19, right?

792
00:39:25,580 --> 00:39:28,780
Andrew: Yes, what's also kind of interesting is I know you earlier

793
00:39:28,780 --> 00:39:32,040
mentioned that there was a debate about moving to 64-bit integers

794
00:39:32,040 --> 00:39:35,780
and in fact part of this commitfest does include a proposal

795
00:39:36,380 --> 00:39:38,080
to change those offsets to 64-bits.

796
00:39:38,420 --> 00:39:41,520
So there's a lot of changes that are coming in 19, it seems,

797
00:39:41,520 --> 00:39:45,380
if they're accepted, of course,
which would help alleviate this

798
00:39:45,380 --> 00:39:48,460
multixact membership and multixact
issue writ large.

799
00:39:48,900 --> 00:39:50,940
Nikolay: That's great, but it's
only next year.

800
00:39:52,740 --> 00:39:54,640
Andrew: I suppose we can't go too
fast.

801
00:39:55,160 --> 00:39:57,600
Michael: Well, and even longer
if it's Aurora, right?

802
00:39:57,600 --> 00:40:02,160
They have a generally a bit of
an additional lag to getting major

803
00:40:02,160 --> 00:40:05,680
versions because it's a fork that
they have to do a lot of work

804
00:40:05,680 --> 00:40:06,180
on.

805
00:40:06,340 --> 00:40:09,520
On that note though, you mentioned
trust in an infrastructure

806
00:40:09,760 --> 00:40:14,600
provider and how issues, big issues
can rock that trust and you

807
00:40:14,720 --> 00:40:19,540
need to see some maybe root cause
analysis or maybe some action.

808
00:40:19,960 --> 00:40:24,320
Has this at any point rocked your
confidence in Postgres or less

809
00:40:24,320 --> 00:40:28,280
so Aurora, I guess, in terms of
is it the right platform as you

810
00:40:28,280 --> 00:40:28,780
scale?

811
00:40:28,940 --> 00:40:31,440
Maybe initially and then since
you've worked out exactly what

812
00:40:31,440 --> 00:40:33,580
the issue was, what are your thoughts
there?

813
00:40:34,120 --> 00:40:39,380
Andrew: Yeah, I don't think it
actually has really rocked our

814
00:40:39,380 --> 00:40:40,560
confidence in Postgres.

815
00:40:40,560 --> 00:40:43,500
I haven't seen anyone run for the
hills yet.

816
00:40:44,060 --> 00:40:48,940
But you know, it definitely surfaces
the fact that there is quite

817
00:40:48,940 --> 00:40:51,940
a bit to learn and that this is
a very complex system But to

818
00:40:51,940 --> 00:40:55,200
that end something I've learned
at least Participating with the

819
00:40:55,200 --> 00:40:58,260
Postgres dev community and being
on the mailing list is that

820
00:40:58,260 --> 00:41:01,020
there's a lot of very smart individuals
who are part of this

821
00:41:01,020 --> 00:41:04,500
community and who are actively
contributing and are passionately

822
00:41:05,240 --> 00:41:08,000
debating what should and shouldn't
be part of it and how things

823
00:41:08,000 --> 00:41:08,680
should be designed.

824
00:41:08,680 --> 00:41:11,820
So, in many ways, it actually may
have built my confidence to

825
00:41:11,820 --> 00:41:14,740
see so many people paying attention
and caring about it from

826
00:41:15,200 --> 00:41:19,340
all across the world, from Fujitsu,
Amazon, Google, ourselves,

827
00:41:19,740 --> 00:41:22,660
even just regular folks doing it
in their hobby spare time.

828
00:41:22,660 --> 00:41:27,340
So in many ways perhaps it's a
testament to how far open source

829
00:41:27,340 --> 00:41:30,800
can go and how much confidence
we can have in these systems because

830
00:41:30,940 --> 00:41:34,580
like you said this is a pretty
rare instance and many of the

831
00:41:34,640 --> 00:41:37,960
core cases that most people need
to use are covered or at least

832
00:41:37,960 --> 00:41:38,700
will be.

833
00:41:38,860 --> 00:41:41,860
Michael: That's really good to
hear and of course when people

834
00:41:41,860 --> 00:41:45,480
hit issues with proprietary systems
like SQL Server and oracle

835
00:41:45,480 --> 00:41:48,920
I'm sure when they blog about it
those conversations do happen

836
00:41:48,920 --> 00:41:53,660
internally at Microsoft or oracle
but you can't see them you

837
00:41:53,660 --> 00:41:57,100
can't see the the actual conversations
and part of the way you

838
00:41:57,100 --> 00:42:00,920
diagnosed this issue was diving
into the source code so there

839
00:42:00,920 --> 00:42:04,920
are some really cool factors that
wouldn't have been possible

840
00:42:04,920 --> 00:42:05,820
if this was an open source.

841
00:42:05,820 --> 00:42:09,320
Yeah, that's not just open source,
but open discussions as well,

842
00:42:09,320 --> 00:42:10,420
open mailing lists.

843
00:42:10,760 --> 00:42:13,380
So yeah, that's really good to
hear, but it would have been very

844
00:42:13,380 --> 00:42:14,720
understandable if...

845
00:42:15,040 --> 00:42:17,980
People have definitely left Postgres
for less good reasons.

846
00:42:18,080 --> 00:42:21,780
So not naming names, Uber.

847
00:42:22,040 --> 00:42:25,940
Andrew: Yeah, I will say, as we
said in our blog post, 30 terabytes

848
00:42:25,940 --> 00:42:29,200
is not a small amount of data to
move.

849
00:42:29,540 --> 00:42:33,060
And so if we decided to make a
large decision like that, I believe

850
00:42:33,520 --> 00:42:36,720
we would have to reckon with the
transition itself.

851
00:42:37,260 --> 00:42:41,320
And so in many ways, you know,
we probably want to try to work

852
00:42:41,320 --> 00:42:42,160
with what we got.

853
00:42:42,160 --> 00:42:45,740
And thankfully, I think it's served
us well for the most part.

854
00:42:46,620 --> 00:42:49,860
Nate: Yeah, there's something to
be said for the devil you know.

855
00:42:50,060 --> 00:42:53,300
And, you know, there's lots of
other systems Postgres compatible

856
00:42:53,480 --> 00:42:57,760
that, you know, promise you the
world and something gives typically

857
00:42:57,880 --> 00:43:00,340
whether it's the price, the actual
performance on either the

858
00:43:00,340 --> 00:43:03,560
read or the writes or how maintainable
it is to actually run

859
00:43:03,560 --> 00:43:04,060
it.

860
00:43:04,760 --> 00:43:08,000
And yeah, I think it's clear that
we're going to have to undergo

861
00:43:08,000 --> 00:43:11,360
the investment to figure out what's
next, because we can only

862
00:43:11,360 --> 00:43:14,540
go single writer for so long, especially
with the demands that

863
00:43:14,540 --> 00:43:17,680
we put on the database on the first
of the month every single

864
00:43:17,680 --> 00:43:18,180
month.

865
00:43:19,280 --> 00:43:20,240
A great extreme.

866
00:43:20,860 --> 00:43:21,360
Nikolay: Yeah.

867
00:43:21,680 --> 00:43:24,820
And behind the scenes, we also
chatted with Nate about the formula.

868
00:43:24,960 --> 00:43:29,280
So Michael, for the first time,
we should include some formula

869
00:43:29,300 --> 00:43:30,300
in the show notes.

870
00:43:30,820 --> 00:43:35,700
There's a formula that is saying
that roughly 63,000 transactions

871
00:43:36,380 --> 00:43:41,420
overlapping is enough to reach
a 2 billion limit.

872
00:43:42,100 --> 00:43:45,040
Nate: Yeah, and it's a really dangerous
thing of some math on

873
00:43:45,040 --> 00:43:47,640
the fly that can be checked by
anyone after the fact.

874
00:43:47,640 --> 00:43:52,440
But yeah, as I worked it out, yeah,
only 63,000 overlapping

875
00:43:52,460 --> 00:43:55,080
transactions, which is not that
many.

876
00:43:55,160 --> 00:43:57,980
Relatively easy to do depending
on how you, you know, with a

877
00:43:57,980 --> 00:44:01,820
highly concurrent workload operating
1 row at a time with a potential

878
00:44:01,920 --> 00:44:03,100
overlap between them.

879
00:44:03,520 --> 00:44:05,100
You can consume that really quickly.

880
00:44:05,540 --> 00:44:06,880
Andrew: It might not also be transactions.

881
00:44:06,880 --> 00:44:09,620
If you take a checkpoint, that's
almost seen as like a sub-transaction

882
00:44:10,440 --> 00:44:12,980
and will also create a new multixact
as well.

883
00:44:12,980 --> 00:44:16,220
So if you're doing a lot of checkpointing,
you can easily hit

884
00:44:16,220 --> 00:44:17,120
that as well.

885
00:44:17,460 --> 00:44:18,580
Nikolay: Are you using sub-transactions?

886
00:44:20,080 --> 00:44:22,860
Andrew: I'm not actually sure myself,
but it is a possibility.

887
00:44:24,280 --> 00:44:26,460
Nate: I don't believe we are, at
least not intentionally.

888
00:44:27,340 --> 00:44:30,400
Nikolay: I think right now in the
latest versions of Postgres

889
00:44:30,960 --> 00:44:38,020
there's ability to change SRU sizes
to avoid SRU overflow cases,

890
00:44:38,760 --> 00:44:43,040
But my opinion about sub-transactions
in projects like this,

891
00:44:43,040 --> 00:44:45,980
at this scale, remains just avoid
them.

892
00:44:46,740 --> 00:44:50,780
Yeah, Because they can bring other
problems as well, like sub-transfer

893
00:44:51,560 --> 00:44:53,600
can be overflown and so on.

894
00:44:54,220 --> 00:44:56,260
And they consume transaction ID.

895
00:44:56,440 --> 00:44:57,440
So yeah.

896
00:44:57,900 --> 00:44:58,400
Good.

897
00:44:59,480 --> 00:45:04,080
That's I think quite good deep
dive, unless we forgot anything.

898
00:45:04,700 --> 00:45:06,700
Michael: Any last things you wanted
to add?

899
00:45:06,960 --> 00:45:09,520
Nate: I think we can talk about
the mitigations, just so that

900
00:45:09,520 --> 00:45:13,440
if someone's unfortunate enough
to run into this, that they have

901
00:45:13,440 --> 00:45:16,160
a few breadcrumbs for how to work
around it.

902
00:45:16,620 --> 00:45:20,080
Because it was rather expensive
for us to figure out what those

903
00:45:20,080 --> 00:45:22,640
were, including Andrew and others
digging through the source

904
00:45:22,640 --> 00:45:24,100
to really understand it.

905
00:45:24,100 --> 00:45:27,480
But a lot of what we inferred after
the fact was, so we had this

906
00:45:27,480 --> 00:45:33,120
migration which was doing a bunch
of row by row writes, not ideal

907
00:45:33,240 --> 00:45:35,720
if you're doing a highly concurrent
process here.

908
00:45:36,020 --> 00:45:37,500
And so we did make those batches.

909
00:45:37,540 --> 00:45:41,180
We also did a lot of deep thinking
on which of the foreign keys

910
00:45:41,180 --> 00:45:43,220
we actually needed on this new
table.

911
00:45:44,380 --> 00:45:49,020
Nikolay: Batches, How did you find
the ideal for you batch size?

912
00:45:50,200 --> 00:45:53,080
Nate: I don't know that we're at
an ideal, but if we climbed

913
00:45:53,080 --> 00:45:55,180
the hill far enough, that it's
not a problem.

914
00:45:55,600 --> 00:45:56,000
Nikolay: I see.

915
00:45:56,000 --> 00:45:58,200
Nate: And so it was kind of a trade-off
between what throughput

916
00:45:58,200 --> 00:46:01,860
we could obtain and making sure
that we weren't having the kind

917
00:46:01,860 --> 00:46:07,260
of like overtime growth in our
heuristic for the member space

918
00:46:07,280 --> 00:46:10,600
utilization, which is mainly based
on how many of the offset

919
00:46:10,600 --> 00:46:11,400
files existed.

920
00:46:11,980 --> 00:46:12,480
Yeah,

921
00:46:14,540 --> 00:46:18,200
Nikolay: I usually say choosing
batches, choose the biggest possible,

922
00:46:18,280 --> 00:46:22,400
but avoid locking rows for too
long to affect users, because

923
00:46:22,580 --> 00:46:26,740
users, we know perception is 200
milliseconds, so if you lock

924
00:46:26,740 --> 00:46:30,180
like 1 second should be maximum,
2 seconds should be absolute

925
00:46:30,180 --> 00:46:33,540
maximum for batch size choice.

926
00:46:34,080 --> 00:46:35,500
Nate: Yeah, it's less often.

927
00:46:35,500 --> 00:46:37,400
So we're even batches around like
200,

928
00:46:38,000 --> 00:46:38,200
Michael: which

929
00:46:38,200 --> 00:46:41,200
Nate: is enough to mitigate this
concern and kind of keep all

930
00:46:41,200 --> 00:46:45,060
of the operations up, you know,
short, since there's a very online

931
00:46:45,060 --> 00:46:46,420
workload hitting this as well.

932
00:46:46,640 --> 00:46:51,080
Nikolay: This is not users in terms
of batch transactions, but

933
00:46:51,500 --> 00:46:53,800
users read results, right?

934
00:46:53,800 --> 00:46:54,520
Or no?

935
00:46:54,960 --> 00:46:55,760
Nobody's using

936
00:46:55,760 --> 00:46:56,460
Nate: it now.

937
00:46:57,260 --> 00:47:00,600
So sorry, this is in the context
of our greater migration to

938
00:47:00,600 --> 00:47:05,080
move line items, line item construct
out of our database, as

939
00:47:05,080 --> 00:47:08,040
well as partitioning the invoice
table, which has also grown

940
00:47:08,040 --> 00:47:09,020
to be quite large.

941
00:47:09,720 --> 00:47:13,520
And so, you know, kind of we would
read the old version, all

942
00:47:13,520 --> 00:47:17,500
the line items in the invoice row,
write those to a new format

943
00:47:17,500 --> 00:47:21,040
in S3, which is ultimately keyed
by the invoice ID.

944
00:47:21,220 --> 00:47:24,720
And then a lot of logic, like we
can't just do a deterministic

945
00:47:24,840 --> 00:47:27,900
mapping because then you get hot
partitions in S3, so you have

946
00:47:27,900 --> 00:47:31,260
to do a little bit of extra work
and leave yourself a link in

947
00:47:31,260 --> 00:47:33,120
the actual database row.

948
00:47:33,540 --> 00:47:35,980
And so then we were writing those
out to the new table, which

949
00:47:35,980 --> 00:47:37,320
is partitioned 1 by 1.

950
00:47:37,900 --> 00:47:40,680
And our diagnosis, and it's a little
hard to confirm this, because

951
00:47:40,680 --> 00:47:45,220
again, there is no observability
bit here, is that that was the

952
00:47:45,220 --> 00:47:48,940
big mistake, as well as the single
inserts, like very quick.

953
00:47:48,940 --> 00:47:53,300
It was the issue that they all
reference a handful of other tables

954
00:47:53,300 --> 00:47:56,100
for foreign keys, some of which
were very low cardinality.

955
00:47:56,880 --> 00:48:00,960
So, you know, we have like some
enum tables and we, you know,

956
00:48:00,960 --> 00:48:03,560
for better or worse, in this case
worse, had chosen to use a

957
00:48:03,560 --> 00:48:05,420
foreign key reference to them.

958
00:48:05,940 --> 00:48:09,900
And so if every invoice is corresponding
to the USD credit type,

959
00:48:09,900 --> 00:48:12,860
for example, you're going to have
a lot of overlapping transactions

960
00:48:13,020 --> 00:48:16,960
all accessing that row to verify
the integrity of the data.

961
00:48:18,340 --> 00:48:21,300
And so we kind of walk through
it, and we realize a lot of these

962
00:48:21,300 --> 00:48:24,660
low cardinality cases, we don't
really change those code paths.

963
00:48:24,780 --> 00:48:26,820
We're quite confident they're going
to be stable.

964
00:48:27,160 --> 00:48:28,940
It's only the higher cardinality
cases.

965
00:48:29,280 --> 00:48:32,880
An invoice also references its
customer that owns that invoice.

966
00:48:33,540 --> 00:48:35,540
But that's a much higher cardinality
connection.

967
00:48:35,740 --> 00:48:39,160
We're much less likely to have
overlapping transactions referring

968
00:48:39,160 --> 00:48:42,280
to the same customer in a migration
like this.

969
00:48:42,700 --> 00:48:45,040
Nikolay: Without foreign keys,
did you think about periodical

970
00:48:45,100 --> 00:48:48,080
checks of referential integrity,
like asynchronously?

971
00:48:48,940 --> 00:48:52,160
Nate: Yeah, so we're going to be
adding those to run on our Lakehouse.

972
00:48:53,300 --> 00:48:56,580
So yeah, we have a syncing process
that brings all the files

973
00:48:56,580 --> 00:48:57,700
over into Parquet.

974
00:48:58,180 --> 00:49:00,900
Eventually, we're going to finish
this way on getting it to be

975
00:49:00,900 --> 00:49:04,700
CDC, so it's more incremental,
probably using Iceberg.

976
00:49:06,180 --> 00:49:09,920
And so on those, we'll run every,
I mean, it's much cheaper to

977
00:49:09,920 --> 00:49:11,600
run the checks there, for a lot
of reasons.

978
00:49:11,600 --> 00:49:14,980
Nikolay: More and more reasons
to move data out of Postgres.

979
00:49:15,040 --> 00:49:20,280
I understand it for analytical
reasons, but yesterday S3, yesterday

980
00:49:20,280 --> 00:49:22,540
S3 released vector index.

981
00:49:22,700 --> 00:49:24,640
If you haven't heard it, it's interesting.

982
00:49:25,120 --> 00:49:29,580
So I'm just curious where the place
for Postgres will be in our

983
00:49:29,580 --> 00:49:30,080
systems.

984
00:49:31,220 --> 00:49:34,460
But I totally understand, For analytical
workloads, definitely.

985
00:49:35,380 --> 00:49:35,880
Nate: Yeah.

986
00:49:36,040 --> 00:49:38,720
I mean, everything that's OLTP-shaped
is going to be staying

987
00:49:38,720 --> 00:49:40,180
in Postgres for some time.

988
00:49:40,200 --> 00:49:42,180
Can't see a reason to move it anywhere
else.

989
00:49:42,800 --> 00:49:43,680
Nikolay: Well, yeah.

990
00:49:44,380 --> 00:49:44,880
Michael: Great.

991
00:49:45,060 --> 00:49:49,000
In terms of mitigations, did you
tune any of the settings around

992
00:49:49,000 --> 00:49:50,760
multixact parameters?

993
00:49:51,020 --> 00:49:51,520
Yeah.

994
00:49:52,360 --> 00:49:53,160
What did you do?

995
00:49:53,160 --> 00:49:54,940
Which ones did you increase or
decrease?

996
00:49:56,180 --> 00:49:59,020
Andrew: I don't know if there was
actually any specific like

997
00:49:59,020 --> 00:50:01,760
Postgres like flags that we changed actually.

998
00:50:01,760 --> 00:50:03,660
I don't think we changed any constant specifically.

999
00:50:03,840 --> 00:50:08,940
But what we did do is we did introspect into the code base and

1000
00:50:08,940 --> 00:50:11,700
determine the danger levels for membership spaces.

1001
00:50:12,240 --> 00:50:17,680
And so we used that as a threshold for alerting ourselves with

1002
00:50:17,680 --> 00:50:18,180
Grafana.

1003
00:50:18,480 --> 00:50:23,300
And so if our membership space consumption breaches a certain

1004
00:50:23,300 --> 00:50:26,320
level, what we will do is we will take actions to reduce the

1005
00:50:26,320 --> 00:50:29,440
number of concurrency, the amount of concurrent Transactions

1006
00:50:29,440 --> 00:50:32,800
that are occurring until Postgres can recover that membership

1007
00:50:32,800 --> 00:50:33,300
space.

1008
00:50:33,480 --> 00:50:36,920
So it's a little bit of a manual intervention at the moment,

1009
00:50:37,360 --> 00:50:40,520
but I don't think there's any specific knobs that we can tune

1010
00:50:40,520 --> 00:50:44,080
because it is fundamentally limited by a 32-bit integer.

1011
00:50:45,040 --> 00:50:50,280
And the telemetry with respect to multixact membership numbers

1012
00:50:50,280 --> 00:50:54,260
is not used anywhere else except in the detection of when autovacuum

1013
00:50:54,400 --> 00:50:55,100
should occur.

1014
00:50:55,120 --> 00:50:58,880
And that is sort of automatic, so, and kind of hard-coded.

1015
00:51:00,060 --> 00:51:04,540
Nikolay: Yeah, and the current utilization, you use function

1016
00:51:04,540 --> 00:51:06,360
pg_get_multixact_members, right?

1017
00:51:06,360 --> 00:51:07,700
Or how do you?

1018
00:51:08,300 --> 00:51:10,360
Andrew: No, I don't believe that is it.

1019
00:51:10,460 --> 00:51:14,180
I think it's some specialized function where we actually have

1020
00:51:14,180 --> 00:51:18,020
to get all the files and read the file sizes themselves.

1021
00:51:18,280 --> 00:51:21,840
So it's not something from Postgres, I don't believe.

1022
00:51:22,880 --> 00:51:23,680
Nikolay: But it's Aurora.

1023
00:51:23,680 --> 00:51:25,240
You don't have access to files.

1024
00:51:26,040 --> 00:51:27,040
Andrew: There is some function.

1025
00:51:27,040 --> 00:51:28,080
I can't remember, Mnid.

1026
00:51:28,080 --> 00:51:30,040
Nikolay: It's JLS or something, right?

1027
00:51:30,180 --> 00:51:31,620
So it lists files.

1028
00:51:32,420 --> 00:51:33,900
You can still SQL.

1029
00:51:33,900 --> 00:51:35,580
You can list for, oh, OK, interesting.

1030
00:51:36,020 --> 00:51:38,720
Nate: The Query originated from Amazon blog posts.

1031
00:51:38,740 --> 00:51:39,800
I'll try to pull it up.

1032
00:51:39,800 --> 00:51:41,640
You can see if you want to include it in the show notes.

1033
00:51:41,640 --> 00:51:42,440
Nikolay: Let's make sure.

1034
00:51:42,440 --> 00:51:46,160
I don't remember this recipe was shared in your blog post, how

1035
00:51:46,160 --> 00:51:47,260
to monitor this.

1036
00:51:47,280 --> 00:51:49,740
Andrew: It was not, we did not post it there.

1037
00:51:49,740 --> 00:51:53,900
We actually, intend to have a followup with a more engineering

1038
00:51:53,940 --> 00:51:57,660
based audience where we would detail these steps kind of in the

1039
00:51:57,660 --> 00:52:02,260
same way that we detailed it here, but in a more concrete format

1040
00:52:02,260 --> 00:52:05,880
for engineers and not for executives and company leaders.

1041
00:52:06,420 --> 00:52:09,060
Nate: Yeah, and we were a bit more timeline constrained for initial

1042
00:52:09,060 --> 00:52:09,560
RCA.

1043
00:52:10,680 --> 00:52:13,780
And we are optimistic that we can link to Andrew's patch for

1044
00:52:13,780 --> 00:52:15,560
the second follow-up blog post.

1045
00:52:16,420 --> 00:52:16,920
Nikolay: Exactly.

1046
00:52:17,460 --> 00:52:18,140
That's awesome.

1047
00:52:18,380 --> 00:52:22,000
Because people like, probably we
should add it to monitoring

1048
00:52:22,000 --> 00:52:24,740
like practically everywhere in
large systems, right?

1049
00:52:24,960 --> 00:52:27,800
This would be main outcome of this
discussion probably, right?

1050
00:52:27,800 --> 00:52:29,940
For those who listen and have big
systems.

1051
00:52:30,180 --> 00:52:30,680
Andrew: Absolutely.

1052
00:52:31,120 --> 00:52:34,860
Nikolay: Before your patch Andrew
got accepted and everyone upgraded

1053
00:52:34,860 --> 00:52:37,160
to Postgres 19, which will take
some time.

1054
00:52:37,540 --> 00:52:38,360
Right, so...

1055
00:52:38,620 --> 00:52:42,360
Andrew: Yes, until then we definitely
have some guidance for

1056
00:52:42,360 --> 00:52:43,400
monitoring systems.

1057
00:52:43,660 --> 00:52:46,880
We admit though that it is not
the most ideal, but it is something

1058
00:52:46,880 --> 00:52:51,600
that people can use to guide themselves
away from disaster.

1059
00:52:54,520 --> 00:52:58,100
Michael: Well and I think work
well there's the thread you mentioned

1060
00:52:58,580 --> 00:53:03,400
from Peter Geoghegan he's somebody
that has worked in the past on

1061
00:53:03,400 --> 00:53:04,440
mitigation factors.

1062
00:53:04,440 --> 00:53:08,300
So not only to be able to monitor
a transaction ID wraparound,

1063
00:53:08,300 --> 00:53:11,520
but he did some things to vacuum
so that it's just less likely

1064
00:53:11,520 --> 00:53:13,520
to get into that state in the first
place.

1065
00:53:13,520 --> 00:53:19,640
And I think making that parameter
he mentioned aware of the member

1066
00:53:19,640 --> 00:53:24,120
space or have some different parameter
that kicks in and does

1067
00:53:24,120 --> 00:53:28,040
a, like an anti-wraparound vacuum
or whatever the equivalent

1068
00:53:28,040 --> 00:53:31,560
would need to be earlier would
make this less likely to happen.

1069
00:53:31,560 --> 00:53:35,660
And of course, monitoring is a
good idea anyway, but prevention

1070
00:53:35,800 --> 00:53:36,900
seems even better.

1071
00:53:36,900 --> 00:53:37,420
So yeah.

1072
00:53:37,420 --> 00:53:40,120
Nikolay: Optimization inside Postgres
could be implemented here

1073
00:53:40,120 --> 00:53:40,640
Michael: because it's

1074
00:53:40,640 --> 00:53:44,920
Nikolay: quadratic behavior, at
least it could, as like in other

1075
00:53:44,920 --> 00:53:49,720
cases in Postgres, at least it
could be split to some pieces,

1076
00:53:49,760 --> 00:53:52,580
like 128 pieces, you know, something
like this.

1077
00:53:52,580 --> 00:53:53,420
I don't know.

1078
00:53:54,280 --> 00:53:58,460
I haven't seen discussion of actual
optimization to get more

1079
00:53:58,460 --> 00:53:58,960
capacity.

1080
00:53:59,940 --> 00:54:01,300
Have you seen it?

1081
00:54:03,040 --> 00:54:06,380
Michael: No, but I was looking
through what parameters we even

1082
00:54:06,380 --> 00:54:09,880
have that mention multixact,
and 1 of them is in version 17,

1083
00:54:09,880 --> 00:54:13,940
we got multixact member buffers,
which is quite low by default,

1084
00:54:13,940 --> 00:54:18,820
only 256 kilobytes or 32 pages,
and we got multixact offset

1085
00:54:18,820 --> 00:54:22,040
buffers, also low, 128 kilobytes
by default.

1086
00:54:22,040 --> 00:54:24,620
So I'm wondering whether increasing
those would help with the

1087
00:54:24,620 --> 00:54:28,020
member space maybe that's the offset
buffers 1 and then yeah

1088
00:54:28,020 --> 00:54:32,280
we also have multixact failsafe
age which you mentioned version

1089
00:54:32,280 --> 00:54:36,860
14 that's the 1 for Peter Geoghegan
and 2 much older ones I did wonder

1090
00:54:36,860 --> 00:54:41,040
about Freeze Min Age, that's 5
million by default, I wondered

1091
00:54:41,040 --> 00:54:44,480
if lowering that would just get
you at, make sure that the vacuum

1092
00:54:44,480 --> 00:54:47,360
ran quicker when it did need to
happen, you know that kind of

1093
00:54:47,360 --> 00:54:52,080
thing, and then there's the freeze
table age is quite is 150

1094
00:54:52,120 --> 00:54:55,760
million, so I again I'm not sure
if lowering that would mean

1095
00:54:55,760 --> 00:54:58,780
you just get vacuums more often
but I guess when we're talking

1096
00:54:58,780 --> 00:55:01,860
about such a huge table it's still
going to be a problem with

1097
00:55:01,860 --> 00:55:05,100
the vacuum does so until you get
partitioning in place anyway

1098
00:55:06,340 --> 00:55:09,800
thank you so much both of you it's
been amazing learning about

1099
00:55:09,800 --> 00:55:15,060
this scary but great and thank
you for paying the pain and the

1100
00:55:15,060 --> 00:55:18,780
cost so that so many fewer people
don't have to in future.

1101
00:55:18,960 --> 00:55:21,940
Nikolay: And special thanks for
transparency and admitting like

1102
00:55:21,940 --> 00:55:23,760
your own mistakes in the past.

1103
00:55:23,760 --> 00:55:28,240
Like I especially appreciate this
like how you talk freely about

1104
00:55:28,460 --> 00:55:29,240
your own mistakes.

1105
00:55:29,240 --> 00:55:32,940
This is very valuable and a good
example for others as well.

1106
00:55:34,060 --> 00:55:34,540
Andrew: Absolutely.

1107
00:55:34,540 --> 00:55:35,500
Thanks for having us.

1108
00:55:35,500 --> 00:55:37,060
It was a pleasure to be here.

1109
00:55:37,460 --> 00:55:38,040
Nikolay: Thank you.

1110
00:55:38,040 --> 00:55:39,560
Nate: Yeah.
It's really fun chat.