1
00:00:00,060 --> 00:00:02,640
Michael: Hello and welcome to PostgresFM, a weekly show about

2
00:00:02,640 --> 00:00:03,480
all things PostgreSQL.

3
00:00:03,680 --> 00:00:06,420
I am Michael, founder of pgMustard, and I'm joined as usual by

4
00:00:06,420 --> 00:00:08,000
Nik, founder of Postgres.AI.

5
00:00:08,000 --> 00:00:08,820
Hey, Nik.

6
00:00:09,480 --> 00:00:10,300
Nikolay: Hi, Michael.

7
00:00:10,680 --> 00:00:12,780
And let's welcome our guest.

8
00:00:13,260 --> 00:00:16,680
Michael: Yeah, we are joined by a very special guest, Sugu, who

9
00:00:16,680 --> 00:00:21,660
is a co-creator of Vitess, co-founded PlanetScale, and is now

10
00:00:21,660 --> 00:00:25,440
at Supabase working on an exciting project called Multigres.

11
00:00:26,120 --> 00:00:27,240
So welcome, Sugu.

12
00:00:27,880 --> 00:00:28,540
Sugu: Thank you.

13
00:00:29,060 --> 00:00:30,220
Glad to be here.

14
00:00:30,700 --> 00:00:31,020
Michael: All right.

15
00:00:31,020 --> 00:00:31,940
It's our pleasure.

16
00:00:32,220 --> 00:00:36,660
So it's my job to ask you a couple of the easy questions to start

17
00:00:36,660 --> 00:00:37,120
off.

18
00:00:37,120 --> 00:00:41,080
So what is Multigres and why are you working on it?

19
00:00:42,440 --> 00:00:46,560
Sugu: Multigres is a Vitess adaptation for Postgres.

20
00:00:47,440 --> 00:00:54,160
It's been on my mind for a long time, many years, and we've even

21
00:00:54,160 --> 00:00:56,700
had a few false starts with this project.

22
00:00:57,740 --> 00:01:02,460
And I guess there is a timing for everything and finally the

23
00:01:02,460 --> 00:01:03,440
timing has come.

24
00:01:03,800 --> 00:01:07,980
So I'm very excited to get started on this finally.

25
00:01:09,520 --> 00:01:13,060
Michael: Yeah timing is an interesting one it feels like for many

26
00:01:13,060 --> 00:01:17,240
years I was looking at PlanetScale and Vitess specifically very

27
00:01:17,240 --> 00:01:21,100
jealously thinking you can promise the world you can promise

28
00:01:21,100 --> 00:01:25,760
this you know horizontal scaling with a relational database for

29
00:01:25,760 --> 00:01:29,540
OLTP and it's it you know all of the things that people want

30
00:01:29,540 --> 00:01:32,280
and we didn't really have a good answer for it in Postgres.

31
00:01:32,280 --> 00:01:35,660
But all of a sudden, in the last few months, it seems almost,

32
00:01:35,980 --> 00:01:40,440
there are now 3, 4 competing projects all doing it.

33
00:01:40,440 --> 00:01:41,240
So why now?

34
00:01:41,240 --> 00:01:42,840
Why is it all happening now?

35
00:01:46,980 --> 00:01:50,740
Sugu: I would say, I think there is a timing for every new idea

36
00:01:50,760 --> 00:01:51,940
to gather momentum.

37
00:01:52,800 --> 00:01:57,340
Like the idea may be good, but the timing may be bad for one reason

38
00:01:57,340 --> 00:01:58,040
or another.

39
00:01:58,080 --> 00:02:02,200
But once the timing is good, then it is obvious that it is the

40
00:02:02,200 --> 00:02:02,920
right time.

41
00:02:03,340 --> 00:02:05,720
And it feels like now is that time.

42
00:02:05,940 --> 00:02:08,960
Like I would say, for example, PayPal, there have been previous

43
00:02:09,400 --> 00:02:13,300
attempts at moving money online, but it never worked.

44
00:02:13,540 --> 00:02:16,860
But PayPal just came at the right time and therefore took off

45
00:02:16,960 --> 00:02:18,220
during that time.

46
00:02:18,480 --> 00:02:21,580
Many such examples where, you know, some ideas came too early

47
00:02:21,580 --> 00:02:22,760
and didn't take off.

48
00:02:23,080 --> 00:02:26,180
But for Postgres, this feels like it's the right time.

49
00:02:27,180 --> 00:02:28,020
Nikolay: That's interesting.

50
00:02:28,660 --> 00:02:29,880
Let me reveal something.

51
00:02:30,560 --> 00:02:33,480
We met in 2018, I think, at Percona
conference.

52
00:02:33,620 --> 00:02:37,520
And in 2019, we met again in Nutanix
at the meetup.

53
00:02:38,420 --> 00:02:39,640
Go meetup, right.

54
00:02:39,860 --> 00:02:43,220
And discussions about Vitess for
Postgres started.

55
00:02:43,280 --> 00:02:46,400
And a couple of times we had a
call, I tried to involve a couple

56
00:02:46,400 --> 00:02:51,140
of guys to and from from my understanding
it never worked because

57
00:02:52,020 --> 00:02:56,160
You could not do it yourself being
busy with other thing I guess

58
00:02:56,160 --> 00:02:59,780
my sequel related and the guys
are looking at the complexity

59
00:03:00,200 --> 00:03:04,600
other guys looking at the complexity
of related and, and didn't

60
00:03:04,600 --> 00:03:05,100
proceed.

61
00:03:05,220 --> 00:03:09,480
And actually, in 1 case, it was,
they decided to build from scratch.

62
00:03:09,480 --> 00:03:11,680
It was, it was SPQR project.

63
00:03:11,680 --> 00:03:15,060
It's live and
there's sharding for Postgres.

64
00:03:15,060 --> 00:03:16,580
Sugu: Yeah, Andrey Borodin.

65
00:03:17,200 --> 00:03:18,400
Nikolay: Yeah, Andrey Borodin.

66
00:03:19,280 --> 00:03:19,780
Borodin.

67
00:03:20,900 --> 00:03:21,280
Yeah.
Yeah.

68
00:03:21,280 --> 00:03:23,100
So and other folks also involved.

69
00:03:23,260 --> 00:03:27,700
And so for me, it was disappointing
that it doesn't work.

70
00:03:27,700 --> 00:03:32,220
And at some point I saw the message
in Vitess, I think, that

71
00:03:32,220 --> 00:03:33,660
we are not going to do it.

72
00:03:33,840 --> 00:03:35,580
So like, don't expect.

73
00:03:36,220 --> 00:03:40,200
Sugu: I feel so bad because I was
so excited about doing it.

74
00:03:40,200 --> 00:03:43,220
And then I realized, oh my God,
you know.

75
00:03:45,060 --> 00:03:47,900
Nikolay: But now you started and
last week PlanetScale decided

76
00:03:47,900 --> 00:03:48,840
to support Postgres.

77
00:03:48,840 --> 00:03:49,840
So what's happening?

78
00:03:49,900 --> 00:03:51,180
Like, I don't understand.

79
00:03:51,260 --> 00:03:52,480
Just, just right time.

80
00:03:52,480 --> 00:03:52,980
Right.

81
00:03:53,360 --> 00:03:53,560
Yeah.

82
00:03:53,560 --> 00:03:58,080
Enough, enough number of companies
using Postgres, which really

83
00:03:58,080 --> 00:03:58,580
needed.

84
00:03:59,140 --> 00:04:00,840
Sugu: At least 1 horse will win.

85
00:04:02,020 --> 00:04:02,520
Yeah.

86
00:04:02,980 --> 00:04:04,400
Nikolay: So yeah, it's great.

87
00:04:04,400 --> 00:04:08,000
But, but yeah, long, long story
to this point.

88
00:04:09,340 --> 00:04:09,660
Michael: Yeah.

89
00:04:09,660 --> 00:04:12,180
Sometimes when there are multiple
projects there's kind of slight

90
00:04:12,180 --> 00:04:18,400
differences in philosophy or approach
or trade-offs, willing

91
00:04:18,400 --> 00:04:21,020
to trade 1 thing off in relation
to another.

92
00:04:21,020 --> 00:04:25,440
And I saw your plan, I really liked
that you used, you mentioned

93
00:04:25,440 --> 00:04:26,340
building incrementally.

94
00:04:26,580 --> 00:04:30,780
So Vitess is a huge project, lots
and lots of features, But

95
00:04:30,780 --> 00:04:34,300
I've heard you talk in the past
about building it quite incrementally

96
00:04:34,480 --> 00:04:35,280
while at YouTube.

97
00:04:35,280 --> 00:04:39,780
You know, it didn't start off as
complex as it is now, obviously,

98
00:04:40,460 --> 00:04:42,540
and you did it kind of 1 feature
at a time.

99
00:04:42,540 --> 00:04:45,200
And it sounds like that's the plan
again with Multigres.

100
00:04:46,720 --> 00:04:48,740
Is that different to some of the other projects?

101
00:04:48,740 --> 00:04:53,200
Or what do you see as your philosophy and how it might differ

102
00:04:53,200 --> 00:04:54,740
slightly to some of the others?

103
00:04:55,340 --> 00:05:01,360
Sugu: I think my philosophy is that I would say I don't want

104
00:05:01,360 --> 00:05:06,300
to compromise on what the final project is going to look like.

105
00:05:06,660 --> 00:05:11,260
So any path that deviates me from hitting the target on the final

106
00:05:11,260 --> 00:05:13,400
project, I do not want to take.

107
00:05:13,740 --> 00:05:17,960
But any shortcuts that can get us early results, that we can

108
00:05:17,960 --> 00:05:23,360
do, but only those that ensure that it doesn't risk what we become

109
00:05:23,360 --> 00:05:29,440
eventually, which is a project that should feel native as if

110
00:05:29,440 --> 00:05:33,300
it was for Postgres, by Postgres kind of thing.

111
00:05:33,480 --> 00:05:36,140
I want it to be a pure Postgres project.

112
00:05:36,780 --> 00:05:40,820
Nikolay: In this case, I must ask, in this case, are you sure

113
00:05:41,060 --> 00:05:44,940
you're still good with choice of Go language?

114
00:05:45,320 --> 00:05:49,740
Because sometimes we see projects which have really like sub

115
00:05:49,740 --> 00:05:54,400
millisecond latency on average, like 100 microseconds, for example,

116
00:05:54,400 --> 00:05:55,060
on average.

117
00:05:55,460 --> 00:06:00,760
And Go definitely will bring a few hundred microseconds of latency

118
00:06:00,800 --> 00:06:01,300
overhead.

119
00:06:01,840 --> 00:06:06,760
So usually it's not a big deal, but maybe in some cases it's

120
00:06:06,760 --> 00:06:07,860
some deal, right?

121
00:06:08,560 --> 00:06:10,240
Are you happy with Go?

122
00:06:10,240 --> 00:06:15,220
Yeah, I mean- Because you were 1 of the first big Go language

123
00:06:15,640 --> 00:06:20,580
users building Vitess, as we know from various interviews and

124
00:06:20,580 --> 00:06:21,180
so on.

125
00:06:21,560 --> 00:06:22,060
Yeah.

126
00:06:22,280 --> 00:06:25,700
So it's still a good choice because now there is Rust, right?

127
00:06:26,980 --> 00:06:27,480
Sugu: Yes.

128
00:06:27,700 --> 00:06:33,760
I would say, by the way, when we started Go, compared to where

129
00:06:33,760 --> 00:06:35,420
Go is today, it was a nightmare.

130
00:06:36,460 --> 00:06:37,700
Nikolay: Yeah, like 10 milliseconds.

131
00:06:39,140 --> 00:06:42,920
Sugu: Like, yeah, like 10 milliseconds or something round trip

132
00:06:42,920 --> 00:06:44,480
is what we were paying for.

133
00:06:45,060 --> 00:06:46,860
Those days we had hard disks, by the way.

134
00:06:46,860 --> 00:06:51,000
So that's another 3 to 5 milliseconds just within the database.

135
00:06:51,760 --> 00:06:53,640
But things are a lot better now.

136
00:06:54,280 --> 00:07:01,080
And at this point, the way I would put it is, Like the trade-offs

137
00:07:01,640 --> 00:07:04,020
are in favor of Go, let's put it that way.

138
00:07:04,020 --> 00:07:07,900
Mainly because there is a huge amount of existing code that you

139
00:07:07,900 --> 00:07:09,840
can just lift and port.

140
00:07:10,200 --> 00:07:14,120
And rewriting all of that in Rust is going to just delay us.

141
00:07:14,880 --> 00:07:20,600
And at least in Vitess, it has proven itself to scale for, you

142
00:07:20,600 --> 00:07:23,600
know, like multi, hundreds of terabytes.

143
00:07:24,480 --> 00:07:28,260
And the latencies that people see are, they are not affected

144
00:07:28,260 --> 00:07:30,200
by a couple of hundred microseconds.

145
00:07:31,460 --> 00:07:37,100
So I think the, and plus there's, there's this inherent acceptance

146
00:07:37,200 --> 00:07:40,580
of this network latency for storage and stuff.

147
00:07:40,900 --> 00:07:44,380
And if you bring the storage local, then this actually wins out

148
00:07:44,380 --> 00:07:45,840
over anything that's there.

149
00:07:46,080 --> 00:07:48,120
Nikolay: This is exactly what I wanted to mention.

150
00:07:48,120 --> 00:07:51,660
Yeah, I see PlanetScale right now, they came out with Postgres

151
00:07:51,660 --> 00:07:52,920
support, but no Vitess.

152
00:07:53,560 --> 00:07:56,520
And I'm very curious how much it will take for them to bring

153
00:07:56,520 --> 00:07:58,040
it and compete with you.

154
00:07:58,040 --> 00:07:59,440
It's another interesting question.

155
00:08:00,620 --> 00:08:05,660
But from past week, I see my main impression is like, main stake

156
00:08:05,660 --> 00:08:06,960
is on local storage.

157
00:08:07,240 --> 00:08:11,940
And this is great because local storage for Postgres, we use

158
00:08:11,940 --> 00:08:15,300
it in some places where we struggle with EBS volumes and so on,

159
00:08:15,300 --> 00:08:19,240
but it's considered like not standard, not safe, blah blah blah.

160
00:08:19,940 --> 00:08:22,080
There are companies who use it, I know.

161
00:08:23,040 --> 00:08:24,700
I know myself, right?

162
00:08:24,960 --> 00:08:25,820
And it's great.

163
00:08:26,340 --> 00:08:30,480
Like today, for example, Patroni and since Postgres 12 we don't

164
00:08:30,480 --> 00:08:33,540
need to restart nodes when we have failover.

165
00:08:33,740 --> 00:08:38,160
So if we lose node, forget about node, we just failover and so

166
00:08:38,160 --> 00:08:38,480
on.

167
00:08:38,480 --> 00:08:41,020
And with local storage, not a big deal.

168
00:08:41,260 --> 00:08:46,720
But now I expect with your plans to bring local storage, it will

169
00:08:46,720 --> 00:08:51,060
become more, like I expect it will be more and more popular,

170
00:08:51,060 --> 00:08:51,940
and that's great.

171
00:08:52,000 --> 00:08:58,160
So you shave off latency there and keep Go, which brings 200.

172
00:08:58,940 --> 00:08:59,720
Sugu: Good compromise.

173
00:09:02,220 --> 00:09:07,100
Effectively, it's a win because 1 network hop completely eliminates

174
00:09:08,000 --> 00:09:09,260
language level overheads.

175
00:09:10,160 --> 00:09:10,940
Nikolay: Sounds good.

176
00:09:10,960 --> 00:09:12,660
Maybe your goal will improve additionally.

177
00:09:16,080 --> 00:09:19,640
Michael: I wanted to go back, Sugu, you mentioned not wanting

178
00:09:19,640 --> 00:09:23,080
to compromise on feeling Postgres native.

179
00:09:23,680 --> 00:09:29,240
That feels to me like a really big statement, coming from Vitess

180
00:09:29,280 --> 00:09:35,820
being very MySQL specific, saying you want to be Postgres native

181
00:09:35,820 --> 00:09:38,540
feels like it adds a lot of work to the project.

182
00:09:39,520 --> 00:09:41,200
It feels like a lot to me.

183
00:09:41,200 --> 00:09:42,280
What does it mean?

184
00:09:42,500 --> 00:09:45,260
Is that about compatibility with the protocol?

185
00:09:45,260 --> 00:09:47,900
What does it mean to be Postgres native?

186
00:09:50,920 --> 00:09:52,040
Sugu: There's 2 answers.

187
00:09:52,200 --> 00:09:56,500
1 is, why do we still think we can bring Vitess if it was built

188
00:09:56,500 --> 00:09:57,180
for MySQL?

189
00:09:58,420 --> 00:10:00,420
And how do you make it Postgres native?

190
00:10:00,660 --> 00:10:02,680
That's because of Vitess's history.

191
00:10:03,840 --> 00:10:08,620
For the longest time, Vitess was built not to be tied to MySQL.

192
00:10:09,520 --> 00:10:16,160
It was built to be a generic SQL 92 compliant database.

193
00:10:17,420 --> 00:10:21,020
That was actually our restriction for a very long time until

194
00:10:21,020 --> 00:10:26,540
the MySQL community said, you need to support all these MySQL

195
00:10:26,580 --> 00:10:27,900
features, otherwise we won't

196
00:10:27,900 --> 00:10:29,360
Nikolay: move forward.
CTEs, right?

197
00:10:30,080 --> 00:10:32,200
Common Table Expressions, with, right?

198
00:10:32,520 --> 00:10:36,460
It's I guess SQL 99 feature, not 92.

199
00:10:36,460 --> 00:10:40,840
Sugu: Yeah, I think the first parser
I built was SQL 92, which

200
00:10:40,840 --> 00:10:43,880
is the most popular 1 that I know
of.

201
00:10:44,380 --> 00:10:45,780
So that's answer 1.

202
00:10:46,620 --> 00:10:49,580
Answer 2 is more with the behavior
of Postgres.

203
00:10:50,420 --> 00:10:55,800
What we want is completely mimic
the Postgres behavior right

204
00:10:55,800 --> 00:10:56,640
from the beginning.

205
00:10:57,840 --> 00:11:03,540
Basically, in other words, we plan
to actually copy or translate

206
00:11:03,900 --> 00:11:08,380
what we can from the Postgres engine
itself, where that behavior

207
00:11:08,440 --> 00:11:09,900
is very specific to Postgres.

208
00:11:10,900 --> 00:11:16,440
And the goal is not compatibility
just at the communication layer,

209
00:11:16,720 --> 00:11:21,000
but even internally, possibly even,
you know, recreating bugs

210
00:11:21,040 --> 00:11:22,920
at the risk of recreating bugs.

211
00:11:25,020 --> 00:11:30,080
Nikolay: In this case, it's like,
so there is Citus in the hands

212
00:11:30,080 --> 00:11:33,180
of Microsoft got everything open
sourced.

213
00:11:33,520 --> 00:11:38,680
So before, resharding was only
in paid version, now it's in free

214
00:11:38,680 --> 00:11:41,300
version, open source, so it's fully
open source.

215
00:11:41,680 --> 00:11:45,360
And they put Postgres in between,
so they don't need to mimic

216
00:11:45,360 --> 00:11:46,980
it, they can use it, right?

217
00:11:47,280 --> 00:11:51,960
And latency overhead is surprisingly
low, we checked it.

218
00:11:51,960 --> 00:11:55,020
Well, it's C, but it's whole database
in between.

219
00:11:55,240 --> 00:11:58,100
But it's sub millisecond, so it's
acceptable as well.

220
00:11:58,100 --> 00:12:01,280
I think it's half a millisecond
or so in our experiments with

221
00:12:01,520 --> 00:12:04,980
simple select 1 or something, SELECT,
just like.

222
00:12:05,140 --> 00:12:08,940
So don't you think it's like quite,
like in comparison, it's

223
00:12:08,940 --> 00:12:13,140
quite a challenging point when
you say I'm going to mimic a lot

224
00:12:13,140 --> 00:12:17,260
of stuff, but they just use Postgres
in between?

225
00:12:17,800 --> 00:12:21,100
Sugu: Yeah, yeah, I think there's
a difference in architecture

226
00:12:21,140 --> 00:12:26,020
between our approach between Multigres
versus Citus.

227
00:12:26,320 --> 00:12:30,380
I think the main difference is
it's a single coordinator for

228
00:12:30,380 --> 00:12:30,880
Citus.

229
00:12:31,680 --> 00:12:35,200
And there is some bottleneck issue
with that.

230
00:12:35,280 --> 00:12:41,180
If you scale to extremely large
workloads, like that goes into

231
00:12:41,180 --> 00:12:44,020
millions of QPS, hundreds of terabytes.

232
00:12:44,760 --> 00:12:47,860
So having that single bottleneck,
I think would be a problem

233
00:12:48,600 --> 00:12:49,340
in the future.

234
00:12:49,340 --> 00:12:49,840
Whereas,

235
00:12:50,660 --> 00:12:51,160
Nikolay: yeah.

236
00:12:51,500 --> 00:12:55,160
I understand you can put multiple,
like you can have multiple

237
00:12:55,680 --> 00:12:59,180
of nodes there and also you can
put a PgBouncer to mitigate

238
00:12:59,180 --> 00:13:00,320
the connection issues.

239
00:13:00,920 --> 00:13:02,460
So it can scale as well.

240
00:13:03,480 --> 00:13:04,020
Sugu: That's good.

241
00:13:04,020 --> 00:13:06,440
That's not, that's something I
haven't known before.

242
00:13:06,820 --> 00:13:10,600
So, yeah, it's possible then that
they may also have something

243
00:13:10,600 --> 00:13:12,600
that can viably scale for OLTP.

244
00:13:13,620 --> 00:13:14,120
Nikolay: Yeah.

245
00:13:15,040 --> 00:13:18,720
So we're still exploring this and
more benchmarks are needed.

246
00:13:18,720 --> 00:13:23,900
Actually, I'm surprised how few
and not comprehensive benchmarks

247
00:13:24,240 --> 00:13:26,440
there are published for this.

248
00:13:27,180 --> 00:13:31,020
Sugu: Yeah, what I know of Citus
is probably what you told me

249
00:13:31,020 --> 00:13:32,020
when we met.

250
00:13:33,340 --> 00:13:35,100
So I was about 5 years old.

251
00:13:35,860 --> 00:13:37,380
Michael: Yeah, yeah.

252
00:13:38,420 --> 00:13:41,500
Another big difference and this
is typically Nikolay's question

253
00:13:41,580 --> 00:13:45,860
is on the license front I think
you've picked about as open a

254
00:13:45,860 --> 00:13:49,540
license as you could possibly pick
which is not the case I think

255
00:13:49,540 --> 00:13:52,360
for many of the other projects.

256
00:13:52,360 --> 00:13:56,640
So that feels to me like a very
Supabase thing to do and also

257
00:13:56,660 --> 00:14:01,160
in line with what Vitess did
and that seems to me like a major

258
00:14:01,240 --> 00:14:07,240
advantage in terms of collaborating
with others, other providers

259
00:14:07,360 --> 00:14:10,340
also adopting this or working with
you to make it better.

260
00:14:10,440 --> 00:14:13,320
Is like, what's your philosophy
on that side of things?

261
00:14:14,180 --> 00:14:18,740
Sugu: My philosophy is, my metric
for success is adoption.

262
00:14:19,900 --> 00:14:20,400
Yeah.

263
00:14:20,740 --> 00:14:24,960
And the only way to have a project
be adopted is to have a good

264
00:14:24,960 --> 00:14:28,480
license, a license that people
are confident to use.

265
00:14:29,180 --> 00:14:31,960
That has been the case from day
1 of Vitess.

266
00:14:32,360 --> 00:14:35,820
We actually first launched with
BSD license, which is even more

267
00:14:35,820 --> 00:14:37,160
permissive than Apache.

268
00:14:37,900 --> 00:14:42,040
And then when we moved the project
to CNCF, they said, oh, no,

269
00:14:42,040 --> 00:14:45,020
no, we need Apache license, which
is why we converted it to that

270
00:14:45,020 --> 00:14:45,520
1.

271
00:14:45,760 --> 00:14:48,140
So adoption has always been, yeah.

272
00:14:49,700 --> 00:14:50,800
Nikolay: Why do they say it?

273
00:14:50,800 --> 00:14:51,600
Do you know?

274
00:14:52,200 --> 00:14:54,560
Sugu: Why CNCF wants Apache?

275
00:14:55,640 --> 00:14:58,100
I think Apache is a pretty good
license.

276
00:14:58,260 --> 00:14:59,700
They just made it a policy.

277
00:15:00,040 --> 00:15:03,280
I mean, had we asked to keep the
BSD license, they would have

278
00:15:03,280 --> 00:15:06,900
allowed us, but we didn't feel
like it was a problem to move

279
00:15:06,900 --> 00:15:07,580
to Apache.

280
00:15:08,940 --> 00:15:13,140
Nikolay: Yeah, and I remember you
described like, when you did

281
00:15:13,140 --> 00:15:16,920
it, like in YouTube, you thought
about external users.

282
00:15:17,040 --> 00:15:20,140
You need external users for this
project to grow.

283
00:15:21,380 --> 00:15:25,600
I guess at Google, AGPL is not
popular at all, we know.

284
00:15:25,600 --> 00:15:26,500
Sugu: Oh, banned.

285
00:15:27,100 --> 00:15:28,220
Nikolay: Yeah, yeah, banned.

286
00:15:29,380 --> 00:15:30,980
And Citus is AGPL,

287
00:15:31,980 --> 00:15:32,940
Sugu: which is interesting.

288
00:15:34,020 --> 00:15:37,900
Nikolay: Also, compared to Citus,
I think you have chances to

289
00:15:37,900 --> 00:15:43,980
be compatible with RDS and other
managed Postgres, to work on

290
00:15:43,980 --> 00:15:47,360
top of them, unlike Citus, which
requires extension and so on,

291
00:15:47,360 --> 00:15:47,860
right?

292
00:15:48,540 --> 00:15:49,000
Sugu: Correct.

293
00:15:49,000 --> 00:15:49,440
Correct.

294
00:15:49,440 --> 00:15:49,940
Yes.

295
00:15:51,900 --> 00:15:56,000
This was actually something that
we learned very early on, wanting

296
00:15:56,000 --> 00:15:56,660
to work.

297
00:15:57,540 --> 00:16:02,860
We made a 5 line change on MySQL
just to make Vitess work initially.

298
00:16:03,500 --> 00:16:07,680
And it was such a nightmare to
keep that binary up, to keep that

299
00:16:07,680 --> 00:16:08,600
build running.

300
00:16:08,680 --> 00:16:09,180
Fork.

301
00:16:09,520 --> 00:16:11,300
Yeah, to keep that fork alive.

302
00:16:11,680 --> 00:16:15,600
So we decided, no, it's like, we
are going to make this work

303
00:16:15,600 --> 00:16:19,260
without a single line of code of
change in MySQL.

304
00:16:20,860 --> 00:16:24,360
And that actually is what helped
Vitess move forward.

305
00:16:24,440 --> 00:16:27,260
Because people would come in with
all kinds of configurations

306
00:16:27,840 --> 00:16:29,720
and say, make it work for this.

307
00:16:30,060 --> 00:16:35,940
So in this case, actually, we'll
probably talk about the consensus

308
00:16:36,040 --> 00:16:36,540
part.

309
00:16:37,040 --> 00:16:42,020
That is 1 part that we think it
is worth making a patch for Postgres,

310
00:16:42,340 --> 00:16:45,260
and we're going to work hard at
getting that patch accepted.

311
00:16:45,780 --> 00:16:51,220
But I think what we will do is
we will also make Multigres work

312
00:16:51,220 --> 00:16:55,380
for unpatched Postgres for those
who want it that way.

313
00:16:55,840 --> 00:16:59,200
Except they will lose all the cool
things about what consensus

314
00:16:59,240 --> 00:17:00,140
can give you.

315
00:17:00,660 --> 00:17:04,560
Nikolay: I'm smiling because we
have so many variations of Postgres

316
00:17:04,600 --> 00:17:05,260
these days.

317
00:17:05,380 --> 00:17:08,600
I would expect people coming not
only with different configurations

318
00:17:08,800 --> 00:17:12,180
of Postgres, but also with various
flavors like Aurora.

319
00:17:12,800 --> 00:17:17,940
We have a client who just migrated
from regular Postgres to AlloyDB.

320
00:17:18,900 --> 00:17:20,400
Hello from Google Cloud.

321
00:17:21,260 --> 00:17:25,640
And they already sharded on application
side, but imagine they

322
00:17:25,640 --> 00:17:28,860
could come to you and say, let's
support AlloyDB now.

323
00:17:28,860 --> 00:17:30,520
It's almost Postgres, right?

324
00:17:31,760 --> 00:17:35,900
So yeah, so these things might
happen as well.

325
00:17:36,820 --> 00:17:39,820
Sugu: Don't they claim full compatibility
with Postgres?

326
00:17:39,920 --> 00:17:41,120
I thought you Not full,

327
00:17:41,120 --> 00:17:42,600
Nikolay: but but most of it.

328
00:17:42,600 --> 00:17:47,980
It's, they did interesting stuff
in memory like, like column

329
00:17:47,980 --> 00:17:49,540
storage and memory for tables.

330
00:17:49,540 --> 00:17:53,300
It's row storage on disk, but column
storage in memory.

331
00:17:54,020 --> 00:17:57,800
But it looks like kind of Postgres,
and we actually even had

332
00:17:57,800 --> 00:18:01,780
some questions answered from my
team unexpectedly because we

333
00:18:01,780 --> 00:18:05,040
don't normally work with AlloyDB,
but it looks like Postgres.

334
00:18:05,340 --> 00:18:09,720
So I could imagine the request,
let's support AlloyDB as well.

335
00:18:09,960 --> 00:18:14,720
Anyway, but my question, I remember
featuring the test that we

336
00:18:14,720 --> 00:18:17,060
work with RDS and managed MySQL.

337
00:18:17,900 --> 00:18:20,640
Did those features, like has this
feature survived?

338
00:18:21,820 --> 00:18:27,800
Sugu: No, actually later we decided
that at least, we call it

339
00:18:27,800 --> 00:18:30,040
actually managed versus unmanaged.

340
00:18:30,720 --> 00:18:34,440
Managed meaning that means that
Vitess manages its databases.

341
00:18:35,580 --> 00:18:38,920
And unmanaged means that the database
is managed by somebody

342
00:18:38,920 --> 00:18:41,880
else, Vitess just acts as a proxy
to serve queries.

343
00:18:42,840 --> 00:18:49,120
At some point of time, we realized
that supporting both is diluting

344
00:18:49,120 --> 00:18:49,820
our efforts.

345
00:18:51,040 --> 00:18:54,440
And that's when we decided, okay,
you know, it's not worth it

346
00:18:54,440 --> 00:18:57,880
to try and make this work with
every available version that exists

347
00:18:57,880 --> 00:18:59,000
out there in the world.

348
00:19:00,060 --> 00:19:04,940
And we said, okay, we will do only
managed, which means that

349
00:19:04,940 --> 00:19:06,340
we will manage it ourselves.

350
00:19:06,820 --> 00:19:10,680
And if you want, we'll build the
tools to migrate out of wherever

351
00:19:10,680 --> 00:19:14,760
you are, and we'll make it safe,
we'll make it completely transparent.

352
00:19:15,580 --> 00:19:19,640
In other words, you deploy Vitess
on both and then we'll migrate

353
00:19:19,640 --> 00:19:22,400
your data out without you having
to change your application.

354
00:19:23,000 --> 00:19:28,300
But then Vitess can be more intentional
about its features,

355
00:19:28,780 --> 00:19:31,740
more opinionated about how clusters
are to be managed.

356
00:19:32,120 --> 00:19:35,420
And we were able to commit to that
because at that point, Vitess

357
00:19:35,420 --> 00:19:36,500
had become mature enough.

358
00:19:36,500 --> 00:19:38,420
People were completely trusting
it.

359
00:19:38,420 --> 00:19:41,460
They actually preferred it over
previous other managed solutions.

360
00:19:42,340 --> 00:19:44,440
So it wasn't a problem at that
time.

361
00:19:44,920 --> 00:19:47,040
Nikolay: Yeah, it's not a surprise.

362
00:19:47,040 --> 00:19:47,980
That's why I asked.

363
00:19:47,980 --> 00:19:52,480
Because you talk about local disks,
backups, HA, a lot of stuff,

364
00:19:52,480 --> 00:19:52,980
right?

365
00:19:53,560 --> 00:19:54,140
And basically-

366
00:19:54,140 --> 00:19:57,940
Sugu: Yeah, 5 nines is like what
Vitess shoots for.

367
00:19:58,900 --> 00:20:03,840
And like most big companies that
run Vitess do operate at that

368
00:20:03,840 --> 00:20:04,860
level of availability.

369
00:20:06,100 --> 00:20:09,460
Nikolay: So what's the plan for
Multigres going to support?

370
00:20:09,740 --> 00:20:12,360
So only managed version, right?

371
00:20:13,580 --> 00:20:20,240
Sugu: Only, So it would be, yes,
it would be only managed versions

372
00:20:20,240 --> 00:20:25,460
because I believe that the cluster
management section of Vitess

373
00:20:25,640 --> 00:20:29,560
will port directly over to Postgres,
which means that you will,

374
00:20:30,480 --> 00:20:35,520
once it goes live, it will be coming
with batteries included

375
00:20:35,600 --> 00:20:39,500
on cluster management, which should
hopefully be equal to or

376
00:20:39,500 --> 00:20:41,880
better than what is already out
there.

377
00:20:42,660 --> 00:20:47,040
So I don't see a reason why we
should try to make it work with

378
00:20:47,040 --> 00:20:49,700
everything that exists today.

379
00:20:50,220 --> 00:20:53,920
Nikolay: So it means there is no,
like, this is the same like

380
00:20:53,920 --> 00:20:57,180
with Citus, it doesn't work with
RDS on 1 hand, but on another

381
00:20:57,180 --> 00:21:00,600
hand, we, like, I don't see it's
only a sharding solution.

382
00:21:01,160 --> 00:21:02,860
It's everything, which is great.

383
00:21:02,860 --> 00:21:05,020
I mean, it's interesting, super
interesting.

384
00:21:05,020 --> 00:21:08,220
So a lot of problems will be solved.

385
00:21:08,680 --> 00:21:13,160
And I expect even more managed
services will be created.

386
00:21:13,180 --> 00:21:16,320
I don't know how it will continue,
like in terms of Supabase,

387
00:21:16,320 --> 00:21:18,540
because of the very open license
and so on.

388
00:21:18,540 --> 00:21:22,560
But also I expect that many people
will think, reconsider their

389
00:21:23,000 --> 00:21:24,520
opinion about managed.

390
00:21:25,160 --> 00:21:26,820
We had episode about this.

391
00:21:27,260 --> 00:21:30,940
This is my usual opinion about
managed services because they

392
00:21:30,940 --> 00:21:34,540
hide superuser from you, they
don't provide you access, it's

393
00:21:34,540 --> 00:21:35,940
hard to troubleshoot problems.

394
00:21:36,180 --> 00:21:39,180
In this case, if problems are solved
in this and this gives you

395
00:21:39,180 --> 00:21:43,680
a new way to run Postgres, so if
many problems solved, it's great.

396
00:21:44,680 --> 00:21:45,280
If you want

397
00:21:45,280 --> 00:21:46,120
Sugu: to do so,

398
00:21:46,160 --> 00:21:46,860
Nikolay: for example.

399
00:21:47,480 --> 00:21:51,800
Sugu: Yeah, if you may not know, the initial focus of Vitess

400
00:21:51,820 --> 00:21:54,780
is actually solving these problems first.

401
00:21:54,840 --> 00:21:59,280
Sharding was actually came much later, like protecting the database,

402
00:21:59,440 --> 00:22:02,220
making sure that they survive abusive queries.

403
00:22:02,720 --> 00:22:05,180
Basically, that's what we built Vitess for initially.

404
00:22:05,680 --> 00:22:11,520
And the counterpart of taking away power from the user, like

405
00:22:11,520 --> 00:22:15,720
you said, is 1 is, well, we now know exactly how to make sure

406
00:22:15,720 --> 00:22:17,380
that the cluster doesn't go down.

407
00:22:17,760 --> 00:22:22,780
And 2, we countered that by building really, really good metrics.

408
00:22:24,440 --> 00:22:29,680
So when there is an outage, you can very quickly 0 in on a query.

409
00:22:29,680 --> 00:22:32,900
If a query was responsible, Vitess will have it on top of it,

410
00:22:32,900 --> 00:22:35,660
like on the top of the line, saying that this is a query that's

411
00:22:35,680 --> 00:22:36,720
killing your database.

412
00:22:37,500 --> 00:22:41,040
So we built some really, really good metrics, and which should

413
00:22:41,040 --> 00:22:44,880
become available in Postgres, probably from day 1.

414
00:22:46,780 --> 00:22:47,580
Nikolay: That's interesting.

415
00:22:47,580 --> 00:22:50,920
I didn't see, maybe I missed, I didn't see in readme you were

416
00:22:50,920 --> 00:22:52,580
writing right now in the project.

417
00:22:53,000 --> 00:22:54,980
Sugu: There's a last section called observability.

418
00:22:56,040 --> 00:22:56,680
I missed that.

419
00:22:56,680 --> 00:22:57,740
I need to revisit.

420
00:22:58,320 --> 00:23:00,800
Nikolay: We're actually building something there as well for

421
00:23:00,800 --> 00:23:01,700
regular Postgres.

422
00:23:01,700 --> 00:23:04,900
I'm very curious, I will definitely revisit this, interesting,

423
00:23:06,340 --> 00:23:07,680
okay.
So yeah, great.

424
00:23:08,100 --> 00:23:08,600
Yeah.

425
00:23:09,440 --> 00:23:13,340
Michael: Also, I feel like this is quite a big difference on

426
00:23:13,340 --> 00:23:16,840
the, at least with Citus in terms of the philosophy or at least

427
00:23:16,840 --> 00:23:18,040
the origin story.

428
00:23:18,480 --> 00:23:24,520
I feel like that started much more with OLAP-focused features

429
00:23:24,580 --> 00:23:30,040
in terms of distributed queries and parallelised across multiple

430
00:23:30,040 --> 00:23:33,900
shards and aggregations and columnar, and loads of things that

431
00:23:33,900 --> 00:23:38,060
really benefit OLAP workloads, whereas this has come from a philosophy

432
00:23:38,160 --> 00:23:42,720
of let's not worry about optimizing for those cross shard queries,

433
00:23:42,720 --> 00:23:47,380
this is much more let's optimize for the single shard very very

434
00:23:47,700 --> 00:23:53,100
short quick OLTP queries and let's make sure we protect it against

435
00:23:53,140 --> 00:23:54,160
abusive queries.

436
00:23:54,600 --> 00:23:57,440
So it feels like it's coming, architecturally, it's coming from

437
00:23:57,440 --> 00:24:00,040
a very different place of what to optimize for first.

438
00:24:01,500 --> 00:24:04,300
Sugu: And historically, that was YouTube's problem.

439
00:24:06,180 --> 00:24:12,540
Surviving the onslaught of a huge number of QPS and making sure

440
00:24:12,540 --> 00:24:16,120
that 1 single QPS doesn't take the rest of the site down.

441
00:24:16,880 --> 00:24:17,380
Michael: Yeah.

442
00:24:17,940 --> 00:24:18,440
Yeah.

443
00:24:19,200 --> 00:24:19,980
Perfect.
Makes loads of sense.

444
00:24:20,220 --> 00:24:24,500
So actually, before we move on too much from that, what, where

445
00:24:24,500 --> 00:24:27,540
do you see sharding as becoming necessary?

446
00:24:27,660 --> 00:24:32,080
Like, is it just a case of a total number of QPS or like writes

447
00:24:32,080 --> 00:24:33,340
per second type thing.

448
00:24:33,340 --> 00:24:35,740
We've talked about sharding in
the past and talked about kind

449
00:24:35,740 --> 00:24:40,040
of a max that you can scale up
to perhaps in terms of writes,

450
00:24:40,040 --> 00:24:41,200
in terms of WAL.

451
00:24:41,200 --> 00:24:43,940
WAL per second I think was the
metric we ended up discussing.

452
00:24:44,380 --> 00:24:49,740
Are there other reasons or kind
of bottlenecks that you see people

453
00:24:49,740 --> 00:24:53,720
getting to that sharding then kind
of makes sense as it's now

454
00:24:53,720 --> 00:24:56,140
time or you should be considering
at this point?

455
00:24:56,740 --> 00:25:01,240
Sugu: Well there is a physical
limiting factor which is the single,

456
00:25:01,500 --> 00:25:06,300
if you max out your single machine,
that is your Postgres server,

457
00:25:06,820 --> 00:25:09,180
then that's the end of your scale.

458
00:25:09,220 --> 00:25:11,600
There is nothing more to do beyond
that.

459
00:25:11,660 --> 00:25:14,620
And there are a lot of people already
hitting those limits from

460
00:25:14,620 --> 00:25:15,540
what I hear.

461
00:25:16,020 --> 00:25:19,740
And the sad part of it is they
probably don't realize it.

462
00:25:19,840 --> 00:25:24,060
As soon as that limit is hit, in
order to protect the database,

463
00:25:24,640 --> 00:25:29,160
they actually push back on engineering
features indirectly, saying

464
00:25:29,160 --> 00:25:31,500
that, you know, this data, can
you make it smaller?

465
00:25:31,920 --> 00:25:33,680
Can you somehow lower the QPS?

466
00:25:33,820 --> 00:25:35,100
Or could you put it elsewhere?

467
00:25:36,580 --> 00:25:39,060
Nikolay: Let's stop showing this
number on front page.

468
00:25:40,440 --> 00:25:41,260
Sugu: Yeah, yeah.

469
00:25:41,760 --> 00:25:43,440
And it affects the entire organization.

470
00:25:43,860 --> 00:25:46,420
It's a very small, it's a very
subtle change.

471
00:25:46,760 --> 00:25:48,980
But the entire organization slows
down.

472
00:25:49,340 --> 00:25:54,220
Like we experienced that at YouTube
when we were at our limits.

473
00:25:54,660 --> 00:25:58,300
We like the default answer from
a DBA was always no.

474
00:25:58,920 --> 00:26:00,260
We used to even kid, no?

475
00:26:00,280 --> 00:26:01,180
The answer is no.

476
00:26:01,180 --> 00:26:02,180
What's your question?

477
00:26:03,800 --> 00:26:09,860
And when we started sharding, it
took us a while to change our

478
00:26:09,860 --> 00:26:13,440
answer to say that, you know, bring
your data I like it we can

479
00:26:13,440 --> 00:26:18,220
scale as much as you want. Believe
it or not, we went from 16 

480
00:26:18,480 --> 00:26:22,000
shards to 256 in no time.

481
00:26:22,360 --> 00:26:25,560
And the number of features in YouTube
exploded during that time

482
00:26:25,560 --> 00:26:29,540
because there was just no restriction
on how much data you wanted

483
00:26:29,540 --> 00:26:30,200
to put.

484
00:26:31,020 --> 00:26:35,520
And coming back here, the upper,
like reaching the limit of a

485
00:26:35,520 --> 00:26:38,820
machine is actually something you
should never do.

486
00:26:39,280 --> 00:26:42,280
It's very unhealthy for a large
number of reasons, like even

487
00:26:42,280 --> 00:26:45,700
if there is a crash, like how long
is it going to take to recover?

488
00:26:46,500 --> 00:26:50,440
Like the thing that we found out
is once you can shard, it actually

489
00:26:50,440 --> 00:26:53,620
makes sense to keep your instances
way, way small.

490
00:26:54,440 --> 00:27:00,960
So we used to run like 20 to 50
instances of MySQLs per machine.

491
00:27:01,920 --> 00:27:05,500
And that was a lot healthier than
running big ones.

492
00:27:06,340 --> 00:27:07,660
For a couple of reasons.

493
00:27:07,700 --> 00:27:12,040
One is, if you try to run so many
threads within a process, that

494
00:27:12,040 --> 00:27:14,040
itself is a huge overhead for the
machine.

495
00:27:14,280 --> 00:27:17,640
And it doesn't do that very efficiently,
whereas it does it better

496
00:27:17,640 --> 00:27:21,180
if you run it as smaller instances,
I think.

497
00:27:21,820 --> 00:27:24,880
It's more of a feeling, but I don't
know if there is proof or

498
00:27:24,880 --> 00:27:25,380
whatever.

499
00:27:25,520 --> 00:27:28,220
But like Go, for example, wouldn't
do well.

500
00:27:28,260 --> 00:27:32,120
Go, I think, beyond a certain memory
size or beyond a certain

501
00:27:32,120 --> 00:27:36,560
number of goroutines would start
to slow down, would not be

502
00:27:36,560 --> 00:27:38,200
as efficient as it was before.

503
00:27:38,420 --> 00:27:42,260
Mainly because the data structures
to keep track of those threads

504
00:27:42,260 --> 00:27:44,880
and stuff, they are growing bigger.

505
00:27:45,720 --> 00:27:52,040
But more importantly, on an outage,
a smaller number of users

506
00:27:52,040 --> 00:27:52,740
are affected.

507
00:27:54,960 --> 00:28:00,920
If you have 256 shards and 1 shard
goes down, it is 1 256th of

508
00:28:00,920 --> 00:28:02,340
the outage, right?

509
00:28:02,620 --> 00:28:06,600
And so the site looks a lot healthier,
behaves a lot healthier.

510
00:28:06,940 --> 00:28:09,220
There's less panic if a shard goes
down.

511
00:28:09,760 --> 00:28:13,560
So people are, you know, a lot
less stressed managing such instances.

512
00:28:15,840 --> 00:28:20,560
Nikolay: Right, I wanted to mention
that this discussion was

513
00:28:20,740 --> 00:28:24,840
with Lev Kokotov, PgDog, which
is a competitor as well, a new

514
00:28:24,840 --> 00:28:27,380
sharding tool written in Rust.

515
00:28:27,740 --> 00:28:32,440
And we discussed that there is
a big limitation when Postgres...

516
00:28:33,940 --> 00:28:38,520
So replication, physical replication
has limitation because it's

517
00:28:38,520 --> 00:28:41,100
single threaded process on standby.

518
00:28:41,280 --> 00:28:47,160
If we reach like somewhat like
150, 200, 250 megabytes per second,

519
00:28:47,160 --> 00:28:52,320
depending on core and also number
of, not number, structure of

520
00:28:52,700 --> 00:28:54,100
tuples and so on.

521
00:28:54,520 --> 00:29:00,860
We hit 1 single CPU, 100%, 1 process,
and it becomes bottleneck

522
00:29:01,060 --> 00:29:04,040
and replica standbys, they start
lagging.

523
00:29:04,640 --> 00:29:07,640
It's a big nightmare because you
usually by that time, but that's

524
00:29:07,640 --> 00:29:11,400
like at high scale you have multiple
replicas and you will float

525
00:29:11,400 --> 00:29:13,360
a lot of read only queries there.

526
00:29:13,840 --> 00:29:16,860
And then you don't want, don't
know what to do except as you

527
00:29:16,860 --> 00:29:19,620
described let's remove this feature
and slow down development

528
00:29:19,700 --> 00:29:21,880
and this is not not fun at all.

529
00:29:22,360 --> 00:29:26,240
So what I'm trying to do here is
trying to move us to discussion

530
00:29:26,320 --> 00:29:28,680
of replication not physical but
logical.

531
00:29:28,940 --> 00:29:34,320
I noticed your plans involve heavily
logical replication in Postgres.

532
00:29:34,600 --> 00:29:37,700
But we know it has, it's improving
every year.

533
00:29:37,700 --> 00:29:41,140
So like when we started the discussion
5, 6 years ago, it was

534
00:29:41,140 --> 00:29:41,920
much worse.

535
00:29:42,440 --> 00:29:43,760
Right now it's much better.

536
00:29:43,780 --> 00:29:47,880
Many things are solved, improved,
but many things still are not

537
00:29:47,920 --> 00:29:48,380
solved.

538
00:29:48,380 --> 00:29:52,960
For example, schema changes are
not replicated, right?

539
00:29:52,960 --> 00:29:57,180
And sequences, there is work in
progress, but if it's committed,

540
00:29:57,180 --> 00:29:59,760
it will be only in Postgres 19,
not in 18.

541
00:29:59,760 --> 00:30:03,260
So it means like long wait for
many people.

542
00:30:03,580 --> 00:30:05,640
So what are your plans here?

543
00:30:05,640 --> 00:30:09,300
Are you ready to deal with problems
like this?

544
00:30:10,080 --> 00:30:12,040
Pure Postgres problems, you know?

545
00:30:12,980 --> 00:30:13,680
Sugu: Yeah, yeah.

546
00:30:15,040 --> 00:30:18,520
If you ask me, I think the Postgres
problems are less than what

547
00:30:18,520 --> 00:30:19,740
we faced with MySQL.

548
00:30:20,740 --> 00:30:24,840
Nikolay: I wanted to involve physical
as well, because this great

549
00:30:24,840 --> 00:30:30,040
talk by Kukushkin, which describes
very bad anomalies when data

550
00:30:30,040 --> 00:30:31,660
loss happens and so on.

551
00:30:32,380 --> 00:30:33,680
Let's talk about this.

552
00:30:34,640 --> 00:30:36,400
Sugu: Yeah, we should talk about
both.

553
00:30:36,400 --> 00:30:40,920
I think overall the Postgres design
is cleaner, is what I would

554
00:30:40,920 --> 00:30:41,420
say.

555
00:30:41,840 --> 00:30:44,840
Like you can feel that from things.

556
00:30:45,060 --> 00:30:49,920
Like the design somewhat supersedes
performance, which I think

557
00:30:49,920 --> 00:30:53,560
in my case is a good trade-off,
especially for sharded solutions,

558
00:30:53,560 --> 00:30:56,840
because some of these design decisions
affect you only if you

559
00:30:56,840 --> 00:30:58,380
are running at, you know,

560
00:30:58,660 --> 00:30:58,820
Nikolay: if

561
00:30:58,820 --> 00:31:01,360
Sugu: you're pushing it really,
really hard, then these design

562
00:31:01,520 --> 00:31:02,660
decisions affect you.

563
00:31:02,660 --> 00:31:06,660
But if your instances are small
to medium size you won't even

564
00:31:06,660 --> 00:31:09,740
know and then you benefit from
the fact that these designs are

565
00:31:09,740 --> 00:31:10,240
good.

566
00:31:10,840 --> 00:31:15,920
So I actually like the approaches
that Postgres has taken with

567
00:31:15,920 --> 00:31:19,100
respect to the WAL as well as
logical replication.

568
00:31:19,640 --> 00:31:23,240
And by the way, I think logical
replication theoretically can

569
00:31:23,260 --> 00:31:26,360
do better things than what it does
now, and we should push those

570
00:31:26,360 --> 00:31:26,860
limits.

571
00:31:27,700 --> 00:31:33,220
But yes, I think the issue about
schema not being as part of

572
00:31:33,220 --> 00:31:37,680
logical replication, it feels like
that is also a theoretically

573
00:31:37,680 --> 00:31:40,920
solvable problem except that people
haven't gotten to it.

574
00:31:40,920 --> 00:31:45,040
I think there are issues about
the transactionality of DDLs which

575
00:31:45,040 --> 00:31:46,420
doesn't even exist in MySQL.

576
00:31:46,480 --> 00:31:50,640
So at least in Postgres, it exists
in most cases.

577
00:31:50,640 --> 00:31:52,940
There are only a few cases where
it is not.

578
00:31:53,680 --> 00:31:57,840
And for such things like a proxy
layer, like Multigres or Vitess,

579
00:31:58,180 --> 00:32:01,440
it's a no problem for them because
you should say, oh, OK, this

580
00:32:01,440 --> 00:32:05,740
particular construct is not transactionally
safe, well, then

581
00:32:05,740 --> 00:32:08,680
we'll even prevent you from doing
it transactionally because

582
00:32:08,680 --> 00:32:10,740
we don't want you to get the wrong
impression.

583
00:32:11,180 --> 00:32:14,860
We'll let you do it non-transactionally,
and we know that it's

584
00:32:14,860 --> 00:32:18,140
non-transactional, and therefore,
we can do something about it,

585
00:32:18,140 --> 00:32:18,640
right?

586
00:32:19,020 --> 00:32:23,000
Those abilities don't exist previously,
But eventually, if it

587
00:32:23,000 --> 00:32:26,020
becomes transactional, then we
can actually include it in a transaction.

588
00:32:26,880 --> 00:32:31,020
Nikolay: Yeah, just for those who
are curious, because there

589
00:32:31,020 --> 00:32:34,520
is a concept, all DDL in Postgres
is transactional.

590
00:32:34,540 --> 00:32:37,200
Here we talk about things like
creating this concurrently because

591
00:32:37,200 --> 00:32:40,820
we had discussion offline about
this before recording.

592
00:32:41,320 --> 00:32:46,580
So yeah, creating this concurrently
can be an issue, but you

593
00:32:46,580 --> 00:32:48,520
obviously have a solution for it.

594
00:32:48,520 --> 00:32:49,340
That's great.

595
00:32:49,540 --> 00:32:52,280
Sugu: The way I would say it is
we have dealt with much worse

596
00:32:52,280 --> 00:32:53,740
at with MySQL.

597
00:32:53,960 --> 00:32:57,740
So this is much better than what was there then.

598
00:32:58,260 --> 00:32:58,940
Sounds good.

599
00:32:58,940 --> 00:32:59,440
Yeah.

600
00:33:00,060 --> 00:33:00,960
Nikolay: Okay, good.

601
00:33:01,280 --> 00:33:05,160
And let's talk about physical replication because I saw you are

602
00:33:05,160 --> 00:33:06,440
going to use it.

603
00:33:06,700 --> 00:33:09,440
So each shard is going to have a standby.

604
00:33:10,580 --> 00:33:14,340
And yeah, so with quorum commit, right?

605
00:33:14,340 --> 00:33:18,280
So like we don't lose data because data is on local disks by

606
00:33:18,280 --> 00:33:21,700
default, as I understand, like ephemeral in cloud.

607
00:33:21,880 --> 00:33:25,360
So we want to be sure that data is written to at least 2 places,

608
00:33:25,360 --> 00:33:26,900
for example, or 3, right?

609
00:33:27,160 --> 00:33:28,420
Configurable, of course.

610
00:33:29,340 --> 00:33:33,400
Here, this interesting talk by Kukushkin, he presented it recently

611
00:33:33,580 --> 00:33:39,160
at an online conference by Microsoft, describing that synchronous

612
00:33:39,240 --> 00:33:41,760
replication in Postgres is not what you think.

613
00:33:42,260 --> 00:33:42,760
Sugu: Correct.

614
00:33:42,880 --> 00:33:43,600
So, right.

615
00:33:44,280 --> 00:33:44,780
Correct.

616
00:33:45,040 --> 00:33:46,580
Nikolay: What are you going to do about this?

617
00:33:47,240 --> 00:33:51,980
Sugu: Well, I was just chatting with someone and essentially

618
00:33:52,500 --> 00:33:57,040
synchronous replication is theoretically impure when it comes

619
00:33:57,040 --> 00:33:57,840
to consensus.

620
00:33:58,460 --> 00:34:01,920
I think it's provable that if you use synchronous replication

621
00:34:02,680 --> 00:34:05,940
then you will hit corner cases that you can't handle.

622
00:34:06,500 --> 00:34:11,260
And the most egregious situation is that it can lead to some

623
00:34:11,260 --> 00:34:15,600
level of definitely split brain, but in some cases it can even

624
00:34:15,600 --> 00:34:17,620
lead to downstream issues.

625
00:34:17,800 --> 00:34:20,780
Because it's a leaky abstraction, it's a leaky implementation,

626
00:34:21,680 --> 00:34:26,980
there are situations where you can see a transaction and think

627
00:34:26,980 --> 00:34:28,060
that it is committed.

628
00:34:28,580 --> 00:34:31,740
And later, the system may fail.

629
00:34:31,840 --> 00:34:35,580
And in the recovery, may choose not to propagate that transaction,

630
00:34:35,740 --> 00:34:39,240
or may not be able to, and it's going to discard that transaction

631
00:34:39,240 --> 00:34:40,140
and move forward.

632
00:34:40,760 --> 00:34:44,020
Nikolay: But this is the same as with asynchronous replication,

633
00:34:44,020 --> 00:34:46,400
it's the same, we're just losing some data, right?

634
00:34:46,400 --> 00:34:49,060
Sugu: Exactly, yeah, it is the same as asynchronous replication,

635
00:34:49,280 --> 00:34:49,780
Yes.

636
00:34:50,140 --> 00:34:52,760
Nikolay: It's not split-brain, it's just data loss.

637
00:34:53,080 --> 00:34:54,440
Sugu: It's data loss, correct.

638
00:34:54,520 --> 00:34:58,440
It's data loss, but for example, if you are running like a logical

639
00:34:58,440 --> 00:35:03,980
replication off of 1 of those, then that logical replication

640
00:35:04,120 --> 00:35:08,460
may actually propagate it into an external system and now you

641
00:35:08,460 --> 00:35:13,420
have corrupted downstream systems that don't match the source.

642
00:35:14,020 --> 00:35:18,340
So those risks exist and at Vitascale people see this all the

643
00:35:18,340 --> 00:35:22,700
time, for example, and they have to build defenses against this

644
00:35:23,000 --> 00:35:24,560
and it's very, very painful.

645
00:35:24,840 --> 00:35:29,340
It's not impossible, but it's very hard to reason about failures

646
00:35:29,440 --> 00:35:31,820
when a system is behaving like this.

647
00:35:32,100 --> 00:35:36,360
So that is the problem with synchronous
replication.

648
00:35:37,200 --> 00:35:41,760
And this is the reason why I feel
like it may be worth patching

649
00:35:41,880 --> 00:35:42,380
Postgres.

650
00:35:42,740 --> 00:35:46,680
Because there is no existing primitive
in Postgres on which you

651
00:35:46,680 --> 00:35:48,720
can build a clean consensus system.

652
00:35:49,020 --> 00:35:51,980
I feel like that primitive should
be in Postgres.

653
00:35:52,700 --> 00:35:55,320
Nikolay: I now remember from Kukushkin's
talk, there is another

654
00:35:55,320 --> 00:35:59,880
case when a primary transaction
looks like not committed because

655
00:36:00,160 --> 00:36:03,840
we wait a replica, but the replica
somehow is like lost connection

656
00:36:03,840 --> 00:36:04,500
or something.

657
00:36:04,540 --> 00:36:08,460
And then we suddenly, and client
thinks it's not committed because

658
00:36:08,560 --> 00:36:09,740
commit was not returned.

659
00:36:10,120 --> 00:36:11,760
But then it suddenly looks committed.

660
00:36:11,760 --> 00:36:14,120
It's like not data loss, it's data
and loss

661
00:36:14,540 --> 00:36:15,040
Sugu: somehow.

662
00:36:15,080 --> 00:36:15,580
Boom.

663
00:36:16,700 --> 00:36:20,260
Nikolay: Like suddenly, and this
is not all right as well.

664
00:36:21,180 --> 00:36:26,000
And when you think about consensus,
I think you are very good

665
00:36:26,000 --> 00:36:29,540
describing these things like concept
and distributed systems.

666
00:36:30,180 --> 00:36:34,400
It feels like if you have 2 places
to write, definitely there

667
00:36:34,400 --> 00:36:37,940
will be corner cases where something
will go off if you don't

668
00:36:37,940 --> 00:36:39,840
use two-phase commit, right?

669
00:36:39,880 --> 00:36:40,380
Sugu: Correct.

670
00:36:40,680 --> 00:36:42,040
Nikolay: And here we have this.

671
00:36:42,340 --> 00:36:45,160
But when you say you're going to
bring something with consensus,

672
00:36:45,160 --> 00:36:49,540
it immediately triggers my memory
how difficult it is and how

673
00:36:49,540 --> 00:36:56,580
many attempts it was made to bring
pure HA to Postgres, just

674
00:36:56,580 --> 00:36:58,000
to have auto-failover.

675
00:36:58,660 --> 00:36:59,780
All of them failed.

676
00:37:00,040 --> 00:37:00,860
All of them.

677
00:37:01,080 --> 00:37:04,420
And they say, let's be outside
of Postgres.

678
00:37:05,160 --> 00:37:10,440
So here maybe it will be similar
complexity to bring these 2

679
00:37:10,440 --> 00:37:11,420
inside Postgres.

680
00:37:11,760 --> 00:37:14,680
Is it possible to build it outside
this thing?

681
00:37:15,060 --> 00:37:16,980
Sugu: It is not possible to build
it outside.

682
00:37:17,720 --> 00:37:20,780
Because if it was, that is what
I would have proposed.

683
00:37:21,580 --> 00:37:25,680
The reason is because building
it outside is like putting band-aid

684
00:37:25,680 --> 00:37:26,520
over the problem.

685
00:37:26,520 --> 00:37:28,480
It will not solve the core problem.

686
00:37:28,740 --> 00:37:34,900
The core problem is you've committed
data in 1 place, and if

687
00:37:34,900 --> 00:37:39,760
that data can be lost, and there
is a gap when the data can be

688
00:37:39,760 --> 00:37:44,760
read by someone, causes is the
root cause of the problem.

689
00:37:44,760 --> 00:37:45,640
That is unsolvable.

690
00:37:45,800 --> 00:37:50,780
Even if you later, later raft may
choose to honor that transaction

691
00:37:50,860 --> 00:37:51,600
or not.

692
00:37:51,900 --> 00:37:55,440
And that becomes ambiguous, but
we don't want ambiguity, right?

693
00:37:55,440 --> 00:37:58,840
Nikolay: What if we created something
extension to commit, like

694
00:37:58,840 --> 00:38:02,880
make extendable to talk to some
external stuff to understand

695
00:38:02,940 --> 00:38:04,920
that can be finalized or something?

696
00:38:04,920 --> 00:38:05,940
I don't know, consensus.

697
00:38:06,280 --> 00:38:06,740
Correct.
Sugu: Correct.

698
00:38:06,740 --> 00:38:10,760
So if you, essentially, if you
reason through about this, your

699
00:38:10,760 --> 00:38:12,940
answer will become a two-phase
system.

700
00:38:13,380 --> 00:38:13,880
Nikolay: Yeah.

701
00:38:14,540 --> 00:38:15,700
Sugu: Without a two-phase system.

702
00:38:15,700 --> 00:38:16,660
Nikolay: Which scares me.

703
00:38:17,500 --> 00:38:23,300
But as I told you, a two-phase
commit in the Postgres OTP world

704
00:38:23,300 --> 00:38:26,480
is considered really, really slow
and the rule is let's just

705
00:38:26,480 --> 00:38:27,340
avoid it.

706
00:38:27,880 --> 00:38:32,280
I see your enthusiasm and I think,
I couldn't find good benchmarks,

707
00:38:32,520 --> 00:38:33,980
0, published.

708
00:38:34,700 --> 00:38:36,500
Sugu: This is not two-phase commit,
by the way.

709
00:38:36,500 --> 00:38:38,100
This is two-phase synchronization.

710
00:38:38,540 --> 00:38:39,140
Nikolay: I understand.

711
00:38:39,140 --> 00:38:43,340
It's not, in two-phase commit,
it's like more communication happens.

712
00:38:43,680 --> 00:38:44,640
I understand this.

713
00:38:45,060 --> 00:38:49,640
Sugu: So two-phase synchronization,
the network overhead is exactly

714
00:38:49,640 --> 00:38:53,680
the same as full sync, because
the transaction completes on the

715
00:38:53,680 --> 00:38:54,460
first sync.

716
00:38:54,960 --> 00:38:58,260
Later it sends an acknowledgement
saying that yes, I'm happy,

717
00:38:58,260 --> 00:39:01,320
you can commit it, but the transaction
completes on the first

718
00:39:01,320 --> 00:39:04,260
sync, so it will be no worse than
full sync.

719
00:39:05,740 --> 00:39:09,520
Nikolay: Yeah, compared to current
situation when primary commit

720
00:39:09,520 --> 00:39:14,140
happens, but there is a lock which
is being held until- Correct,

721
00:39:14,140 --> 00:39:15,420
Sugu: it is the same cost.

722
00:39:15,900 --> 00:39:16,880
Nikolay: Yeah, it is the same cost.

723
00:39:16,880 --> 00:39:20,580
We wait until standby, And for
user it looks like when lock is

724
00:39:20,580 --> 00:39:22,360
released, it thinks, okay, commit
happens.

725
00:39:22,360 --> 00:39:26,540
But the problem with this design,
if, for example, standby restarts,

726
00:39:26,920 --> 00:39:30,040
lock is automatically released
and commit is here and it's unexpected.

727
00:39:30,520 --> 00:39:32,420
This is data on loss, right?

728
00:39:32,420 --> 00:39:37,220
So you are saying we can redesign
this, network cost will be

729
00:39:37,220 --> 00:39:40,360
the same, but it will be pure.

730
00:39:40,520 --> 00:39:42,660
Yeah, that's great, I like this.

731
00:39:42,780 --> 00:39:45,360
I'm just thinking, will it be acceptable?

732
00:39:45,700 --> 00:39:48,540
Because bringing autofailover
is not acceptable.

733
00:39:48,680 --> 00:39:52,660
There was another attempt last
year from someone and with great

734
00:39:52,660 --> 00:39:55,460
enthusiasm, let's bring autofailover
inside Postgres.

735
00:39:55,680 --> 00:39:59,060
Actually, maybe you know this guy,
it was Konstantin Osipov who

736
00:39:59,060 --> 00:40:01,160
built a Tarantool database system.

737
00:40:01,160 --> 00:40:02,220
It's like memory.

738
00:40:03,060 --> 00:40:08,000
He was X MySQL in performance after
Zaitsev.

739
00:40:08,000 --> 00:40:11,140
Zaitsev was X MySQL then Osipov
was MySQL.

740
00:40:11,260 --> 00:40:16,660
And so Konstantin came to Postgres
saying let's build this.

741
00:40:16,680 --> 00:40:21,300
Great enthusiasm, but it's extremely
hard to convince such big

742
00:40:21,300 --> 00:40:23,360
thing to be in core.

743
00:40:23,800 --> 00:40:27,160
So if you say it's not big thing,
this already...

744
00:40:27,800 --> 00:40:33,340
Sugu: So I can, it's, I'll probably
have to explain it in a bigger

745
00:40:33,340 --> 00:40:38,300
blog, but essentially, now that
I've studied the problem well

746
00:40:38,300 --> 00:40:43,100
enough, the reason why it's hard to implement consensus in Postgres

747
00:40:43,480 --> 00:40:48,000
with the write-ahead log is because they are trying to make Raft work with

748
00:40:48,000 --> 00:40:48,500
the write-ahead log.

749
00:40:49,120 --> 00:40:55,200
And there are limitations about how the Raft, how commits work

750
00:40:55,200 --> 00:40:59,820
within Postgres that mismatch with how Raft wants commits to

751
00:40:59,820 --> 00:41:00,560
be processed.

752
00:41:01,580 --> 00:41:07,400
And that mismatch, so far, I have not found a way to work around

753
00:41:07,400 --> 00:41:07,900
that.

754
00:41:08,520 --> 00:41:14,640
But, the variation of Raft can be made to work.

755
00:41:15,940 --> 00:41:16,440
Nikolay: Interesting.

756
00:41:17,220 --> 00:41:20,860
Sugu: The way the, I don't know if you know about my blog series

757
00:41:20,860 --> 00:41:22,700
that I wrote when I was at PlanetScale.

758
00:41:23,100 --> 00:41:25,820
It's an 8 part blog series about generalized consensus.

759
00:41:26,380 --> 00:41:30,360
People think that Raft is the only way to do consensus, But it

760
00:41:30,360 --> 00:41:32,540
is 1 of a thousand ways to do consensus.

761
00:41:33,080 --> 00:41:36,460
So that blog series explains the rules you must follow if you

762
00:41:36,460 --> 00:41:37,900
have to build a consensus system.

763
00:41:37,900 --> 00:41:41,680
And if you follow those rules, you will get all the properties

764
00:41:41,780 --> 00:41:43,760
that are required by a consensus system.

765
00:41:44,380 --> 00:41:48,840
So This 1 that I have, the design that I have in mind, follows

766
00:41:48,840 --> 00:41:53,240
those rules, and I am able to prove to myself that it will work,

767
00:41:53,480 --> 00:41:54,640
but it's not Raft.

768
00:41:54,960 --> 00:41:56,780
It's going to be similar to Raft.

769
00:41:57,240 --> 00:42:01,400
I think we can make Raft also work, but that may require changes

770
00:42:01,400 --> 00:42:03,940
to the write-ahead log, which I don't want to do.

771
00:42:04,020 --> 00:42:07,980
So this system I want to implement without changes to the write-ahead log

772
00:42:08,500 --> 00:42:10,320
as possibly a plugin.

773
00:42:11,820 --> 00:42:12,320
Nikolay: Yeah.

774
00:42:12,720 --> 00:42:16,360
Well, now I understand why you could, like another reason you

775
00:42:16,360 --> 00:42:20,580
cannot take Patroni not only because it's Python versus Postgres

776
00:42:20,580 --> 00:42:23,860
But also because you need another version of consensus algorithm

777
00:42:24,960 --> 00:42:25,840
Sugu: Correct, correct.

778
00:42:25,840 --> 00:42:30,480
Nikolay: And among those hundred Thousand millions of ways.

779
00:42:30,480 --> 00:42:34,920
Sugu: By the way Patroni can take this and use it because it's

780
00:42:34,920 --> 00:42:36,840
very close to how FullSync works.

781
00:42:37,700 --> 00:42:38,200
Nikolay: Good.

782
00:42:38,360 --> 00:42:38,860
Okay.

783
00:42:38,860 --> 00:42:41,900
Michael: I was just thinking, watching Alexander Kukushkin's

784
00:42:42,180 --> 00:42:44,840
talk, he said a couple of things that were interesting.

785
00:42:44,870 --> 00:42:47,220
1 is that he was surprised that this hasn't happened upstream.

786
00:42:47,360 --> 00:42:52,440
So you definitely have an ally in Kukushkin in terms of trying

787
00:42:52,440 --> 00:42:57,480
to get this upstreamed, but also that he thinks every cloud provider

788
00:42:57,980 --> 00:43:02,740
has had to patch Postgres to, in order to offer their own high

789
00:43:02,740 --> 00:43:07,900
availability products with Postgres, each 1 has had to patch

790
00:43:07,900 --> 00:43:08,320
it.

791
00:43:08,320 --> 00:43:12,280
And they are having to, you mentioned earlier today how painful

792
00:43:12,280 --> 00:43:16,080
it is to maintain even a small patch on something.

793
00:43:16,080 --> 00:43:20,140
Nikolay: I don't think it's every, I think it's Microsoft for

794
00:43:20,140 --> 00:43:22,800
sure, knowing where Kukushkin works at.

795
00:43:22,800 --> 00:43:23,300
Yeah.

796
00:43:23,300 --> 00:43:25,040
But maybe more, not every.

797
00:43:25,040 --> 00:43:25,540
Yeah, 

798
00:43:25,960 --> 00:43:29,720
Michael: all I mean is that there
are a growing number of committers

799
00:43:29,780 --> 00:43:32,920
working for hyperscale and hosting
providers.

800
00:43:33,100 --> 00:43:38,340
So I suspect you might have more
more optimism for consensus

801
00:43:38,420 --> 00:43:41,720
or at least a few allies in terms
of getting something committed

802
00:43:41,720 --> 00:43:45,860
upstream so I personally think
there might be growing chance

803
00:43:45,860 --> 00:43:48,820
of this happening even if it hasn't
in the past for some reason.

804
00:43:49,300 --> 00:43:53,520
Sugu: Yeah, I feel like also being
new to the Postgres community,

805
00:43:53,520 --> 00:43:58,520
I am feeling a little you know,
shy about proposing this upfront.

806
00:43:58,520 --> 00:44:03,200
So what I am thinking of doing
is at least show it working, show

807
00:44:03,200 --> 00:44:07,080
it working at scale, have people
gain confidence that this is

808
00:44:07,080 --> 00:44:11,140
actually efficient and performant
and safe.

809
00:44:11,980 --> 00:44:16,100
So I also plan to, I don't know
if you've heard of FlexPaxos,

810
00:44:16,720 --> 00:44:21,440
which is actually, in my opinion,
a better way to handle durability,

811
00:44:22,200 --> 00:44:25,900
because today's cloud environments
are a lot more complex, and

812
00:44:25,900 --> 00:44:29,360
a simple majority-based quorum
is actually very hard to configure

813
00:44:29,800 --> 00:44:34,340
if your needs are different, which
actually FlexPaxos does handle.

814
00:44:34,740 --> 00:44:38,660
It's actually something I'm a co-inventor
of, of some sort.

815
00:44:39,920 --> 00:44:41,540
And this blog post...

816
00:44:42,380 --> 00:44:43,740
Nikolay: I only heard the name.

817
00:44:43,740 --> 00:44:44,440
That's it.

818
00:44:44,440 --> 00:44:45,580
So I like...

819
00:44:46,580 --> 00:44:48,720
Yeah.
Can you explain a little bit?

820
00:44:48,720 --> 00:44:49,720
Not super difficult.

821
00:44:49,720 --> 00:44:50,180
Sugu: Oh sure.

822
00:44:50,180 --> 00:44:50,500
Yeah.

823
00:44:50,500 --> 00:44:51,200
So the...

824
00:44:51,580 --> 00:44:54,480
Actually, let me explain what is
the reason why.

825
00:44:54,480 --> 00:44:59,420
So FlexPaxos was published a few
years ago, about 7 years ago

826
00:44:59,420 --> 00:45:00,120
or so.

827
00:45:00,260 --> 00:45:05,000
And you'll see my name mentioned
there, which I feel very proud

828
00:45:05,000 --> 00:45:05,500
of.

829
00:45:06,100 --> 00:45:09,960
And the block series that I wrote
is actually a refinement of

830
00:45:09,960 --> 00:45:10,460
FlexPaxos.

831
00:45:11,420 --> 00:45:15,180
And that actually explains better
why these things are important.

832
00:45:16,080 --> 00:45:19,800
The reason why it's important is
because people think of consensus

833
00:45:20,020 --> 00:45:23,680
as either a bunch of nodes agreeing
on a value, right?

834
00:45:23,680 --> 00:45:25,360
That's what you commonly hear.

835
00:45:25,640 --> 00:45:31,220
Or you think of like reaching majority,
reaching quorum is important.

836
00:45:31,560 --> 00:45:34,840
But the true reason for consensus
is just durability.

837
00:45:37,660 --> 00:45:41,620
When you ask for a commit and the
system says, yes, I have it,

838
00:45:41,900 --> 00:45:43,940
you don't want the system to lose
it.

839
00:45:45,660 --> 00:45:51,180
So instead of defining quorum and
all those things, define the

840
00:45:51,180 --> 00:45:55,400
problem as it is and solve it the
way it was asked for is, how

841
00:45:55,400 --> 00:45:58,860
do you solve the problem of durability
in a transactional system?

842
00:46:00,060 --> 00:46:03,980
And the simple answer to that is,
make sure your data is elsewhere.

843
00:46:07,120 --> 00:46:10,060
Michael: Yeah, I love how simple
you make it.

844
00:46:10,160 --> 00:46:12,540
Sugu: Yeah, if you make sure your
data is elsewhere, if there

845
00:46:12,540 --> 00:46:16,040
is a failure, your challenge is to find out where the data is

846
00:46:16,040 --> 00:46:18,300
and continue from where it went.

847
00:46:19,220 --> 00:46:21,940
And that is all that consensus is about.

848
00:46:22,540 --> 00:46:25,960
And then all you have to do is have rules to make sure that these

849
00:46:25,960 --> 00:46:27,540
properties are preserved.

850
00:46:28,820 --> 00:46:31,220
And Raft is only just 1 way to do this.

851
00:46:32,240 --> 00:46:35,280
So if you look at this problem, if you approach this problem

852
00:46:35,280 --> 00:46:39,620
this way, you could ask for something like, I just want my data

853
00:46:39,620 --> 00:46:42,340
to go across availability zones.

854
00:46:42,340 --> 00:46:45,640
As long as it's in a different availability zone, I'm happy.

855
00:46:46,080 --> 00:46:50,800
Or you can say, I want the data to be across regions, or I want

856
00:46:50,800 --> 00:46:53,480
at least 2 other nodes to have it, right?

857
00:46:53,480 --> 00:46:55,360
So that's your Durability requirement.

858
00:46:56,240 --> 00:46:59,900
But you could say, I want 2 other nodes to have it, but I want

859
00:46:59,900 --> 00:47:03,760
to run 7 nodes in the system, or 20 nodes.

860
00:47:04,940 --> 00:47:08,840
It sounds outrageous but it is actually very practical.

861
00:47:08,940 --> 00:47:17,800
In YouTube we had 70 replicas but only 1 1 node the data have

862
00:47:17,800 --> 00:47:22,480
to be in 1 other node for it to be durable and we were able to

863
00:47:22,480 --> 00:47:23,600
run this at scale.

864
00:47:24,320 --> 00:47:27,860
The trade-off is that when you do a failover you have a wild

865
00:47:27,860 --> 00:47:32,360
goose chase looking for the Transaction that went elsewhere but

866
00:47:32,360 --> 00:47:34,180
you find it and then you continue.

867
00:47:35,020 --> 00:47:38,900
And so that is basically the principle of this consensus system.

868
00:47:39,380 --> 00:47:42,080
And that's what I want to bring in Multigres.

869
00:47:43,580 --> 00:47:47,860
While making sure that the people that want simpler majority

870
00:47:47,860 --> 00:47:50,660
based quorums to also work using the same primitives.

871
00:47:52,840 --> 00:47:56,200
Michael: Just quickly to clarify, when you say the wild goose

872
00:47:56,200 --> 00:48:00,420
chase, is that because it was 1 of 70, but different transactions

873
00:48:00,420 --> 00:48:04,080
could have gone to different of the 70 or it's always the same

874
00:48:04,080 --> 00:48:06,380
1, but you have to know which 1 that is

875
00:48:07,020 --> 00:48:11,000
Sugu: No, it could be anyone the way we we ran it the way we

876
00:48:11,000 --> 00:48:15,580
ran it It is 1 it could not be at any given point of time There's

877
00:48:15,580 --> 00:48:18,920
only 1 Primary which means that there is only 1 Transaction that

878
00:48:18,920 --> 00:48:20,140
you have to chase down.

879
00:48:21,940 --> 00:48:23,100
Michael: The latest 1.

880
00:48:23,160 --> 00:48:24,440
Sugu: The latest 1, yes.

881
00:48:24,440 --> 00:48:25,460
Michael: Yeah, makes sense.

882
00:48:25,740 --> 00:48:26,240
Sugu: Yeah.

883
00:48:27,100 --> 00:48:30,920
There was a time when we found that Transaction in a different

884
00:48:30,920 --> 00:48:31,420
country.

885
00:48:32,780 --> 00:48:36,240
So we had to bring it back home and then continue.

886
00:48:37,660 --> 00:48:41,820
It was once it happened in whatever the 10 years that we ran.

887
00:48:43,480 --> 00:48:45,800
Nikolay: It's interesting that talking about sharding, we need

888
00:48:45,800 --> 00:48:48,800
to discuss these things, which are not sharding per se, right?

889
00:48:48,800 --> 00:48:54,060
So it's about HA inside each shard, right?

890
00:48:55,600 --> 00:49:00,420
Sugu: It's actually like what I would call healthy database principles,

891
00:49:02,420 --> 00:49:06,060
which is, I think, somewhat more important than sharding.

892
00:49:06,940 --> 00:49:07,440
Nikolay: Yeah.

893
00:49:07,480 --> 00:49:07,840
Yeah.

894
00:49:07,840 --> 00:49:08,340
Yeah.

895
00:49:08,860 --> 00:49:11,940
Michael: It is true that it is to do with it being a distributed

896
00:49:12,100 --> 00:49:13,040
system, right?

897
00:49:13,140 --> 00:49:17,220
And that is because it's sharded, no?

898
00:49:20,820 --> 00:49:22,080
Sugu: I think they are orthogonal.

899
00:49:23,100 --> 00:49:23,600
Okay.

900
00:49:23,820 --> 00:49:29,660
Yeah, I think sharding, like you can do sharding on anything,

901
00:49:30,060 --> 00:49:30,420
Right?

902
00:49:30,420 --> 00:49:33,920
Like you can do sharding on RDS.

903
00:49:34,280 --> 00:49:37,640
Somebody asked me, like, what about Neon?

904
00:49:37,640 --> 00:49:39,520
I said, you can do sharding on Neon too.

905
00:49:39,520 --> 00:49:43,600
It's like you put a proxy in front and then it does the sharding.

906
00:49:44,100 --> 00:49:49,940
But the problem with sharding is it is not just a proxy.

907
00:49:50,200 --> 00:49:53,960
That's what people think of it when they first think of the problem

908
00:49:53,960 --> 00:49:56,020
because they haven't looked ahead.

909
00:49:56,980 --> 00:49:59,660
Once you have sharded, you have to evolve.

910
00:50:00,380 --> 00:50:03,300
You start with 4 shards, then you have to go to 8 shards.

911
00:50:03,740 --> 00:50:06,820
And the evolution is not linear this way.

912
00:50:06,880 --> 00:50:10,440
Actually, it's an exponential growth because 4, 8, 16.

913
00:50:10,840 --> 00:50:15,920
But at some point of time, it changes because your sharding scheme

914
00:50:15,920 --> 00:50:17,820
itself will not scale.

915
00:50:17,900 --> 00:50:21,720
Like if you, for example, are in a multi-tenant workload and

916
00:50:21,720 --> 00:50:26,320
you say shard by tenant, at some point of time, a single tenant

917
00:50:26,320 --> 00:50:28,920
is going to be so big that they won't fit in an instance.

918
00:50:29,760 --> 00:50:31,100
And that we have seen.

919
00:50:32,720 --> 00:50:35,260
And at that time, we had to change the sharding scheme.

920
00:50:35,380 --> 00:50:37,320
So how do you change the sharding scheme?

921
00:50:37,900 --> 00:50:41,020
Slack had to go through this, where they were a tenant-based

922
00:50:41,200 --> 00:50:45,300
sharding scheme, and a single tenant just became too big.

923
00:50:46,020 --> 00:50:48,540
They couldn't even fit 1 tenant in 1 shard.

924
00:50:48,740 --> 00:50:51,560
So they had to change their sharding scheme to be user-based.

925
00:50:52,120 --> 00:50:55,020
They actually talk about it in 1 of their presentations.

926
00:50:56,200 --> 00:51:00,180
And Vitess has the tools to do these changes without actually

927
00:51:00,380 --> 00:51:04,680
you incurring any kind of downtime, which again, Multigres

928
00:51:04,680 --> 00:51:05,320
will have.

929
00:51:05,320 --> 00:51:08,900
I keep talking about Vitess, but these are all things that Multigres

930
00:51:08,940 --> 00:51:12,740
will have, which means that you are future-proofed when it comes

931
00:51:12,740 --> 00:51:13,240
to.

932
00:51:14,240 --> 00:51:16,820
And these are extremely difficult problems to solve.

933
00:51:17,100 --> 00:51:19,640
Because when you're talking about changing the sharding scheme,

934
00:51:19,640 --> 00:51:24,320
you are basically looking at a full crisscross replication of

935
00:51:24,320 --> 00:51:24,820
data.

936
00:51:27,980 --> 00:51:29,340
And across data centers.

937
00:51:30,920 --> 00:51:34,200
Nikolay: Yeah, and also, like I know, it has version 3, right?

938
00:51:34,200 --> 00:51:40,140
It was when you changed, basically created a new planner, right,

939
00:51:40,140 --> 00:51:46,860
to deal with arbitrary query and understand how to route it properly

940
00:51:46,960 --> 00:51:51,260
and where to execute it, which is it a single shard or it's global

941
00:51:51,260 --> 00:51:53,560
or it's like different shards and so on.

942
00:51:53,560 --> 00:51:58,580
Like what's, are you going to do the same with Postgres?

943
00:51:58,580 --> 00:51:59,240
I think yes.

944
00:51:59,240 --> 00:51:59,740
Right.

945
00:52:00,040 --> 00:52:04,460
Sugu: So that's the part that I'm
still on the fence.

946
00:52:06,540 --> 00:52:09,340
By the way, the v3 now has become
Gen 4.

947
00:52:09,340 --> 00:52:13,000
It's actually much better than
what it was when I built it.

948
00:52:16,240 --> 00:52:20,320
The problem with v3 is that it
is still not a full query.

949
00:52:20,320 --> 00:52:22,480
It doesn't support the full query
set yet.

950
00:52:23,480 --> 00:52:28,380
It controls supports like 90% of
it, I would say, but not everything.

951
00:52:28,820 --> 00:52:32,860
On the temptation side, there's
the Postgres engine that supports

952
00:52:32,860 --> 00:52:33,360
everything.

953
00:52:34,280 --> 00:52:37,860
So I'm still debating how do we
bring the 2 together?

954
00:52:38,600 --> 00:52:42,300
If it was possible to do a simple
git merge, I would do it.

955
00:52:43,480 --> 00:52:46,220
But obviously this 1 is in C, this
was in Go.

956
00:52:46,800 --> 00:52:51,540
And the part that I'm trying to
figure out is how much of the

957
00:52:51,540 --> 00:52:54,780
sharding bias exists in the current
engine in Vitess.

958
00:52:55,520 --> 00:52:59,540
If we brought the Postgres engine
as is, without that sharding

959
00:52:59,540 --> 00:53:02,700
bias, would this engine work well
for a sharded system?

960
00:53:04,440 --> 00:53:09,320
Nikolay: So this looks like CytoSort,
if you bring up the whole

961
00:53:09,320 --> 00:53:09,820
Postgres.

962
00:53:11,200 --> 00:53:19,540
There's a library, libpg_query,
by Lukas Fittl, which basically

963
00:53:19,540 --> 00:53:24,520
takes the parser part of Postgres
and brings it to...

964
00:53:24,520 --> 00:53:27,020
And there is a Go version of it
as well.

965
00:53:27,440 --> 00:53:29,620
Sugu: Oh, libpg_query you said?

966
00:53:29,760 --> 00:53:31,560
Nikolay: Yeah, yeah, yeah, I will
send it to you.

967
00:53:31,560 --> 00:53:36,260
So many, many, many, many systems
use it when we need to parse.

968
00:53:37,900 --> 00:53:44,320
Yeah, 1 day I told it to SPQR guys
about this, and eventually

969
00:53:44,320 --> 00:53:46,160
I think they ended up using it.

970
00:53:46,160 --> 00:53:52,360
I think Lev Kupotov uses it, Pijadok
also uses it, the Rust version.

971
00:53:52,360 --> 00:53:54,680
Sugu: Is it like 100% Postgres
compatible?

972
00:53:55,320 --> 00:53:58,500
Nikolay: Well, it's based on Postgres
source code.

973
00:53:58,500 --> 00:54:02,220
So parser is truly broad, but it's
not whole Postgres.

974
00:54:02,420 --> 00:54:04,500
So maybe you should consider this.

975
00:54:05,140 --> 00:54:05,640
Right?

976
00:54:06,400 --> 00:54:12,660
If you're thinking about parsing,
I mean, queries and so on,

977
00:54:12,660 --> 00:54:17,420
but I'm very curious, I also noticed
you mentioned routing, like

978
00:54:17,420 --> 00:54:20,360
read-only queries routed to replicas
automatically.

979
00:54:20,640 --> 00:54:25,740
And this concerns me a lot because
many Postgres developers,

980
00:54:25,800 --> 00:54:31,100
I mean, who use it, users, they
use PL/pgSQL functions, all PL..

981
00:54:31,100 --> 00:54:34,940
Python functions, anything, which
are writing data.

982
00:54:35,020 --> 00:54:39,060
And the standard way to call function
is select, select function

983
00:54:39,060 --> 00:54:39,560
name.

984
00:54:40,240 --> 00:54:43,740
So understanding that this function
is actually writing data

985
00:54:43,740 --> 00:54:44,580
is not trivial.

986
00:54:45,980 --> 00:54:51,600
Right, and we know in pgpool, which
I, all my life I just avoid.

987
00:54:51,600 --> 00:54:55,860
I touched it a few times, decided
not to use at all because it

988
00:54:55,860 --> 00:55:01,020
tries to do a lot of stuff at once
and always considered like,

989
00:55:01,020 --> 00:55:03,060
no, I'm not going to use this tool.

990
00:55:03,260 --> 00:55:08,080
So pgpool solves it like saying
okay like let's build a list

991
00:55:08,080 --> 00:55:11,200
of functions which are actually
writing or something like this.

992
00:55:11,200 --> 00:55:14,280
So it's like patch approach you
know workaround approach.

993
00:55:14,720 --> 00:55:17,780
So this is going to be a huge challenge,
I think, if you...

994
00:55:17,780 --> 00:55:19,640
Sugu: Yeah, yeah.

995
00:55:19,640 --> 00:55:21,980
Nikolay: For automatic routing, it's a huge challenge.

996
00:55:23,100 --> 00:55:27,620
Sugu: Yeah, I think this is the reason why I think it is important

997
00:55:27,620 --> 00:55:34,580
to have the full Postgres functional engine in Multigres, because

998
00:55:34,580 --> 00:55:38,300
then these things will work as intended is my hope.

999
00:55:38,940 --> 00:55:42,900
What we will have to do is add our own sharded understanding

1000
00:55:43,000 --> 00:55:46,080
to these functions and figure out what does it mean to call this

1001
00:55:46,080 --> 00:55:47,260
function, right?

1002
00:55:47,440 --> 00:55:51,280
If this function is going to call out to a different shard, then

1003
00:55:51,280 --> 00:55:55,160
that interpretation has to happen at the higher level.

1004
00:55:55,520 --> 00:55:58,520
But if that function is going to be accessing something within

1005
00:55:58,520 --> 00:56:02,480
shard, then push the whole thing down and just let the push the

1006
00:56:02,480 --> 00:56:06,160
whole SELECT along with the function down and let the individual

1007
00:56:06,180 --> 00:56:07,760
Postgres instance do it.

1008
00:56:08,160 --> 00:56:09,360
Nikolay: Yeah, but how to understand?

1009
00:56:09,400 --> 00:56:12,180
Function can contain another function and so on.

1010
00:56:12,900 --> 00:56:15,700
It can be so complex in some cases.

1011
00:56:17,000 --> 00:56:22,800
It's also funny that there is still there is actually Google

1012
00:56:22,800 --> 00:56:27,520
Cloud, CloudSQL supports it, kind of language, it's not language,

1013
00:56:27,520 --> 00:56:29,840
called PL/Proxy, which is sharding.

1014
00:56:30,340 --> 00:56:35,720
For those who have workload only in functions, this can route

1015
00:56:35,780 --> 00:56:36,660
to proper shard.

1016
00:56:36,660 --> 00:56:38,080
It was created at Skype.

1017
00:56:38,320 --> 00:56:41,100
It still exists, but not super popular these days.

1018
00:56:41,720 --> 00:56:45,040
But there is a big requirement to write everything in functions.

1019
00:56:45,140 --> 00:56:48,740
In your case, if you continue, like, I would expect in some case

1020
00:56:48,740 --> 00:56:50,820
you would say, okay, don't use functions.

1021
00:56:51,060 --> 00:56:52,940
But I'm afraid it's not possible.

1022
00:56:53,260 --> 00:56:54,500
Like, I love functions.

1023
00:56:55,080 --> 00:56:59,000
Actually, Supabase loves functions because they use Postgres,

1024
00:56:59,440 --> 00:57:02,140
right?
Postgres, like, it provokes you to use functions.

1025
00:57:02,560 --> 00:57:03,120
Sugu: Oh, really?

1026
00:57:03,120 --> 00:57:04,000
Oh, yeah, yeah.

1027
00:57:04,000 --> 00:57:04,960
Actually, I saw that.

1028
00:57:04,960 --> 00:57:05,460
Yeah.

1029
00:57:08,040 --> 00:57:12,540
So in Vitess, I feel like this was a mistake that we made, which

1030
00:57:12,700 --> 00:57:16,900
is if we felt that anything, any functionality that you used

1031
00:57:17,460 --> 00:57:18,580
didn't make sense.

1032
00:57:18,900 --> 00:57:22,500
Like if I were you, I wouldn't do this, right?

1033
00:57:23,000 --> 00:57:24,960
Because it's not, it won't scale.

1034
00:57:24,960 --> 00:57:26,080
It's a bad idea.

1035
00:57:26,120 --> 00:57:29,720
You know, it's like those we didn't support.

1036
00:57:29,800 --> 00:57:30,980
We didn't want to support.

1037
00:57:31,880 --> 00:57:35,600
You said, no, we will never do this for you because we'll not

1038
00:57:35,600 --> 00:57:37,700
give you a rope long enough to hang yourself.

1039
00:57:38,860 --> 00:57:40,780
Basically, that was our philosophy.

1040
00:57:41,920 --> 00:57:48,460
But in Multigres, we want to move away from that, which means

1041
00:57:48,460 --> 00:57:53,160
that if you want to call a function that writes, have at it.

1042
00:57:54,380 --> 00:57:57,540
Nikolay: Just put a comment, it's going to write something.

1043
00:57:57,880 --> 00:57:58,380
Sugu: Yeah.

1044
00:57:59,180 --> 00:58:01,960
If you want a function that calls
a function that writes have

1045
00:58:01,960 --> 00:58:02,660
at it.

1046
00:58:04,120 --> 00:58:08,760
If we cannot, like the worst case
scenario for us is we don't

1047
00:58:08,760 --> 00:58:10,460
know how to optimize this.

1048
00:58:11,280 --> 00:58:14,340
And what we will do is we'll execute
the whole thing on the VT

1049
00:58:14,340 --> 00:58:15,060
gate side.

1050
00:58:16,840 --> 00:58:22,700
Nikolay: There's another, I remember
there is interesting solution

1051
00:58:22,900 --> 00:58:27,800
in, I think in AWS RDS proxy, which
when they, as I know it,

1052
00:58:27,800 --> 00:58:32,040
maybe I'm wrong, when they needed
to create a global, it's called

1053
00:58:32,040 --> 00:58:35,100
I think Aurora Global Database
maybe or something like this.

1054
00:58:35,100 --> 00:58:39,020
So there is a secondary cluster
living in a different region

1055
00:58:39,440 --> 00:58:42,280
and it's purely read-only, but
it accepts writes.

1056
00:58:42,560 --> 00:58:48,680
And when write comes, this proxy
routes it to original primary,

1057
00:58:48,760 --> 00:58:53,800
waits until this write is propagated
back to replica and responds.

1058
00:58:54,380 --> 00:58:55,220
Sugu: Oh, wow.

1059
00:58:55,940 --> 00:58:56,440
Yeah.

1060
00:58:58,280 --> 00:59:01,560
I don't think that feature, I don't
know how even that feature

1061
00:59:01,560 --> 00:59:02,440
can be supported.

1062
00:59:02,440 --> 00:59:04,820
Nikolay: No, no, it's just some
exotic, interesting solution

1063
00:59:04,820 --> 00:59:06,080
I just wanted to share.

1064
00:59:06,080 --> 00:59:09,520
Maybe, you know, if we, for example,
if you originally route

1065
00:59:09,560 --> 00:59:13,700
a write to a replica, then somehow
in Postgres you understand,

1066
00:59:13,700 --> 00:59:15,020
oh, it's actually a write.

1067
00:59:15,020 --> 00:59:18,400
Sugu: Okay, Yeah, so maybe 100%
is theoretically impossible to

1068
00:59:18,400 --> 00:59:18,900
support.

1069
00:59:19,540 --> 00:59:20,580
Nikolay: Yes, it's okay.

1070
00:59:20,580 --> 00:59:21,500
It's super exotic.

1071
00:59:21,500 --> 00:59:22,000
Okay.

1072
00:59:23,800 --> 00:59:28,140
Sugu: But I think if people are
doing things like that, it means

1073
00:59:28,140 --> 00:59:32,940
that they are trying to solve a
problem that doesn't have a good

1074
00:59:33,120 --> 00:59:34,180
existing solution.

1075
00:59:34,400 --> 00:59:34,900
Nikolay: Exactly.

1076
00:59:35,320 --> 00:59:38,740
Sugu: So if we can find a good
existing solution, I think they'll

1077
00:59:38,740 --> 00:59:42,040
be very happy to adopt that instead
of whatever they were trying

1078
00:59:42,040 --> 00:59:42,760
to do.

1079
00:59:43,780 --> 00:59:45,900
Nikolay: Well, this is just multi-region
setup.

1080
00:59:46,680 --> 00:59:51,500
I saw not 1 CTO which wanted it
for like dealing with Postgres

1081
00:59:51,500 --> 00:59:56,260
like say we are still single region
we need to be to be present

1082
00:59:56,260 --> 01:00:00,540
in multiple regions in case if
1 AWS region is down right it's

1083
01:00:00,540 --> 01:00:07,120
also okay yeah so availability
and business characteristics so

1084
01:00:07,120 --> 01:00:15,320
yeah anyway okay yeah it's it's
exotic but but interesting still

1085
01:00:15,340 --> 01:00:17,680
yeah So

1086
01:00:17,680 --> 01:00:19,940
Michael: you've got a lot of work
ahead of you, Sugu.

1087
01:00:21,620 --> 01:00:26,400
Sugu: I feel like we barely covered
like 1 of so many topics.

1088
01:00:29,760 --> 01:00:31,160
Nikolay: Let's touch something
else.

1089
01:00:32,500 --> 01:00:35,380
Maybe it's a very long episode,
but it's worth it, I think.

1090
01:00:35,380 --> 01:00:36,300
It's super interesting.

1091
01:00:37,540 --> 01:00:38,300
What else?

1092
01:00:38,380 --> 01:00:39,060
What else?

1093
01:00:40,840 --> 01:00:45,560
Sugu: I think the other interesting
1 would be 2PC and isolation.

1094
01:00:47,980 --> 01:00:49,300
Nikolay: Isolation from what?

1095
01:00:50,600 --> 01:00:56,060
Sugu: Like the 1 issue with the sharded solution is that, again,

1096
01:00:56,060 --> 01:00:59,200
this is a philosophy for the longest time in Vitess we didn't

1097
01:00:59,200 --> 01:01:00,320
allow 2PC.

1098
01:01:00,560 --> 01:01:04,620
You said you shard it in such a way that you do not have distributed

1099
01:01:04,660 --> 01:01:05,160
transactions.

1100
01:01:07,200 --> 01:01:09,520
And many people lived with that.

1101
01:01:09,520 --> 01:01:10,840
And some people actually did

1102
01:01:10,840 --> 01:01:11,780
Nikolay: not adopt Vitess.

1103
01:01:12,180 --> 01:01:17,500
Let me interrupt you here, because this is like the most, the

1104
01:01:17,500 --> 01:01:21,660
best feature I liked about Vitess, it's this materialized feature

1105
01:01:21,660 --> 01:01:22,900
when data is brought.

1106
01:01:22,900 --> 01:01:24,620
Sugu: Oh yeah, materialized is another topic.

1107
01:01:24,620 --> 01:01:26,460
That's actually a better topic than 2PC.

1108
01:01:27,740 --> 01:01:30,120
Nikolay: Well, yeah, because this is your strength, right?

1109
01:01:30,120 --> 01:01:34,240
So this is like, I love this idea, basically distributed materialized

1110
01:01:34,380 --> 01:01:37,440
view, which is incrementally updated.

1111
01:01:37,960 --> 01:01:39,300
Sugu: Yes, yes, yes.

1112
01:01:39,440 --> 01:01:40,160
Nikolay: That's great.

1113
01:01:40,600 --> 01:01:42,480
We need it in Postgres ecosystem.

1114
01:01:43,500 --> 01:01:47,480
Just maybe as a separate project, even, you know, like we lack

1115
01:01:47,480 --> 01:01:48,280
it everywhere.

1116
01:01:49,300 --> 01:01:52,500
So yeah, this is how you avoid distributed transactions basically,

1117
01:01:52,500 --> 01:01:53,000
right?

1118
01:01:53,800 --> 01:01:55,600
Sugu: No, this is 1 way to avoid it.

1119
01:01:55,600 --> 01:01:56,680
1 way, yeah.

1120
01:01:56,760 --> 01:02:00,660
Like there are 2 use cases where materialized views are super

1121
01:02:00,660 --> 01:02:01,160
awesome.

1122
01:02:01,740 --> 01:02:06,500
You know a table that has multiple foreign keys, that has foreign

1123
01:02:06,500 --> 01:02:11,640
keys to 2 different tables is the classic use case, where the

1124
01:02:11,640 --> 01:02:15,600
example I gave was a user that's producing music and listeners

1125
01:02:15,600 --> 01:02:20,380
that are listening to it, which means that the row where I listen

1126
01:02:20,380 --> 01:02:24,060
to this music has 2 foreign keys, 1 to the creator and 1 to the

1127
01:02:24,060 --> 01:02:24,560
listener.

1128
01:02:25,880 --> 01:02:27,580
And where should this row live?

1129
01:02:27,660 --> 01:02:31,320
Should this row live with the creator or should this row live

1130
01:02:31,320 --> 01:02:35,180
with the listener is a classic problem.

1131
01:02:35,680 --> 01:02:37,920
And there is no perfect solution for it.

1132
01:02:37,920 --> 01:02:39,640
It depends on your traffic pattern.

1133
01:02:40,200 --> 01:02:44,380
But what if the traffic pattern is 1 way in 1 case and another

1134
01:02:44,380 --> 01:02:45,480
way in another case?

1135
01:02:45,560 --> 01:02:46,900
There is no perfect solution.

1136
01:02:47,580 --> 01:02:51,600
So this is where in Multigres what you could do is you say okay

1137
01:02:51,680 --> 01:02:56,040
in most cases this row should live with the creator let's assume

1138
01:02:56,040 --> 01:03:00,860
that right so then you say this row lives with the creator and

1139
01:03:00,860 --> 01:03:05,040
we shard it this way, which means that if you join the creator

1140
01:03:05,380 --> 01:03:10,940
table with this event row, it'll be all local joins.

1141
01:03:11,140 --> 01:03:14,540
But if you join the listeners table with this event row, it's

1142
01:03:14,540 --> 01:03:17,240
a huge cross shard while this chases.

1143
01:03:17,720 --> 01:03:23,040
So in this case, you can say materialize this table using a different

1144
01:03:23,040 --> 01:03:27,100
foreign key, which is the listeners
foreign key into the same

1145
01:03:27,100 --> 01:03:30,520
sharded database as a different
table name.

1146
01:03:30,820 --> 01:03:35,320
And now you can do a local join
with the listener and this event

1147
01:03:35,320 --> 01:03:35,820
table.

1148
01:03:36,260 --> 01:03:40,640
And this materialized view is near
real-time, basically the time

1149
01:03:40,640 --> 01:03:43,220
it takes to read the WAL and apply
it.

1150
01:03:43,980 --> 01:03:46,020
And this can go on forever.

1151
01:03:46,440 --> 01:03:50,320
And this is actually also the secret
behind resharding, changing

1152
01:03:50,320 --> 01:03:51,300
the sharding key.

1153
01:03:51,600 --> 01:03:56,260
This is essentially a table that
has real-time present with 2

1154
01:03:56,260 --> 01:03:57,100
sharding keys.

1155
01:03:57,260 --> 01:04:01,480
If you say, oh, at some point of
time, this is more authoritative,

1156
01:04:01,620 --> 01:04:04,080
All you have to do is swap this
out.

1157
01:04:04,440 --> 01:04:06,960
Make 1 the source, the other is
a target.

1158
01:04:07,200 --> 01:04:08,480
You've changed your sharding key.

1159
01:04:08,480 --> 01:04:11,320
Actually, the change sharding key
works exactly like this for

1160
01:04:11,320 --> 01:04:11,980
a table.

1161
01:04:13,040 --> 01:04:15,220
Nikolay: Distributed denormalization
technique.

1162
01:04:15,300 --> 01:04:15,860
This is what

1163
01:04:15,860 --> 01:04:16,320
Michael: it is.

1164
01:04:16,320 --> 01:04:18,000
Sugu: Yeah, yeah, yeah, exactly.

1165
01:04:18,940 --> 01:04:24,960
And the other use case is when
you reshard, you leave behind

1166
01:04:24,960 --> 01:04:28,680
smaller tables, reference tables,
we call them.

1167
01:04:29,180 --> 01:04:31,640
And they have to live in a different
database because they are

1168
01:04:31,640 --> 01:04:35,400
too small and even if you shard
them, they won't shard well.

1169
01:04:36,220 --> 01:04:41,480
Like if you have a billion rows
in 1 table and a thousand rows

1170
01:04:41,480 --> 01:04:44,100
in a smaller table, you don't want
to shard your thousand row

1171
01:04:44,100 --> 01:04:44,600
table.

1172
01:04:45,060 --> 01:04:47,460
And there's no benefit to sharding
that either.

1173
01:04:47,580 --> 01:04:52,280
So it's better that that table
lives in a separate database.

1174
01:04:53,100 --> 01:04:57,320
But if you want to join between
these 2, how do you do it, right?

1175
01:04:57,380 --> 01:05:00,720
The only way you join is join at
the application level, read

1176
01:05:00,720 --> 01:05:02,180
1 and then read the other.

1177
01:05:02,560 --> 01:05:04,740
And so at high QPS, it's not efficient.

1178
01:05:05,740 --> 01:05:09,240
So what we can do is actually materialize
this table on all the

1179
01:05:09,240 --> 01:05:13,380
shards as reference, and then all
joins become local.

1180
01:05:13,820 --> 01:05:14,320
Nikolay: Yeah.

1181
01:05:15,060 --> 01:05:18,680
And you definitely need logical
replication for all of this.

1182
01:05:18,680 --> 01:05:22,320
So this is where we started, like
challenges with logical replication.

1183
01:05:23,420 --> 01:05:24,360
Sugu: Yeah, yeah.

1184
01:05:25,080 --> 01:05:29,180
You do have the, so the reason
why 2PC is still important, because

1185
01:05:29,180 --> 01:05:32,660
there are trade-offs to this solution,
which is, there's a lag.

1186
01:05:33,340 --> 01:05:39,140
So it takes time for the things
to go through the WAL and come

1187
01:05:39,140 --> 01:05:40,140
to the other side.

1188
01:05:41,580 --> 01:05:47,040
Whereas 2PC is essentially, basically
the transaction system

1189
01:05:47,040 --> 01:05:51,020
itself trying to complete a transaction,
which means that it

1190
01:05:51,020 --> 01:05:54,860
will handle cases where there are
race conditions, right?

1191
01:05:55,020 --> 01:05:58,440
If somebody else tries to change
that row elsewhere while this

1192
01:05:58,440 --> 01:06:02,860
row is being changed, 2PC will
block that from happening, whereas

1193
01:06:02,860 --> 01:06:04,940
in the other case, you cannot do
that.

1194
01:06:05,380 --> 01:06:08,940
Nikolay: Yeah, if it's just user views of some video like on

1195
01:06:08,940 --> 01:06:14,240
YouTube, we can say, okay, there will be some lag, probably some

1196
01:06:14,240 --> 01:06:15,360
small mistake, it's fine.

1197
01:06:15,360 --> 01:06:18,960
But if it's financial data, it should be 2PC, but latency of

1198
01:06:18,960 --> 01:06:22,440
write will be high, throughput will be low, right?

1199
01:06:22,440 --> 01:06:23,160
This is...

1200
01:06:23,740 --> 01:06:25,180
Sugu: I actually want to...

1201
01:06:25,240 --> 01:06:26,960
I read the design of...

1202
01:06:26,980 --> 01:06:31,920
Which is again, by the way, very elegant API, and I assume...

1203
01:06:32,220 --> 01:06:37,960
I can see the implementation on the API and I Don't think we

1204
01:06:37,960 --> 01:06:40,060
will see performance problems with 2PC.

1205
01:06:40,240 --> 01:06:41,700
Nikolay: We need to benchmark it

1206
01:06:42,040 --> 01:06:45,960
Sugu: We will be we will benchmark it, but I will be very surprised.

1207
01:06:46,220 --> 01:06:50,600
I think there are some isolation issues that we may not have

1208
01:06:50,600 --> 01:06:54,020
time to go through today because it's a long topic.

1209
01:06:54,660 --> 01:06:58,680
Like the way 2PC is currently supported in Postgres, I think

1210
01:06:58,680 --> 01:07:00,080
it'll perform really well.

1211
01:07:00,240 --> 01:07:04,700
Nikolay: Isolation issues when we sit in read committed and use

1212
01:07:04,700 --> 01:07:05,200
2PC.

1213
01:07:05,600 --> 01:07:06,600
You mean this, right?

1214
01:07:06,600 --> 01:07:08,040
Not in repeatable read.

1215
01:07:08,420 --> 01:07:08,940
In default.

1216
01:07:08,940 --> 01:07:12,380
Sugu: Yeah, read committed I think will be, there will be some

1217
01:07:12,380 --> 01:07:15,600
tradeoffs on read committed, but not the kind that will affect

1218
01:07:15,600 --> 01:07:16,420
most applications.

1219
01:07:17,040 --> 01:07:19,260
MVCC will be the bigger challenge.

1220
01:07:19,740 --> 01:07:25,240
But from what I hear is most people don't use, like the most

1221
01:07:25,240 --> 01:07:27,080
common use case is read committed.

1222
01:07:27,340 --> 01:07:28,480
Nikolay: Of course, as default.

1223
01:07:28,620 --> 01:07:29,380
Yeah, it's faster.

1224
01:07:29,380 --> 01:07:30,200
Sugu: It's a default.

1225
01:07:30,400 --> 01:07:30,740
Yeah.

1226
01:07:30,740 --> 01:07:34,550
So people won't even, yeah, I don't, I think this.

1227
01:07:34,550 --> 01:07:38,160
Nikolay: They're already on some, they're already in bad state.

1228
01:07:38,560 --> 01:07:39,620
It won't be worse.

1229
01:07:40,160 --> 01:07:40,960
Sugu: It won't be worse.

1230
01:07:40,960 --> 01:07:41,460
Yes.

1231
01:07:42,160 --> 01:07:42,540
Nikolay: Yes.

1232
01:07:42,540 --> 01:07:46,920
Yeah.
So to PC, of course depends on the distance between nodes, right?

1233
01:07:46,920 --> 01:07:54,260
A lot, like if they are far, we need to talk like client is somewhere,

1234
01:07:54,340 --> 01:07:58,400
2 nodes are somewhere, and if it's different availability zones,

1235
01:07:58,980 --> 01:08:00,060
it depends, right?

1236
01:08:00,060 --> 01:08:03,840
So this distance is a big contributor to latency, right?

1237
01:08:03,840 --> 01:08:04,280
Network.

1238
01:08:04,280 --> 01:08:10,060
Because there are 4 communication messages that are needed.

1239
01:08:10,440 --> 01:08:10,920
So.

1240
01:08:10,920 --> 01:08:11,680
Sugu: Correct, correct.

1241
01:08:11,680 --> 01:08:16,220
Actually, you can, I have actually the mathematics for it?

1242
01:08:16,220 --> 01:08:17,300
But you're probably right.

1243
01:08:17,300 --> 01:08:19,840
It's about double the number of
round trips.

1244
01:08:20,540 --> 01:08:24,960
Nikolay: Yeah, if we put everything
in 1 AZ, client and both

1245
01:08:24,960 --> 01:08:26,620
primaries, we are fine.

1246
01:08:27,740 --> 01:08:30,260
But in reality, they will be in
different places.

1247
01:08:30,300 --> 01:08:33,180
And if it's different regions,
it's nightmare, of course.

1248
01:08:33,180 --> 01:08:33,480
But at

1249
01:08:33,480 --> 01:08:36,380
Sugu: least it's- Yeah, the 2PC
is not done by the client, by

1250
01:08:36,380 --> 01:08:36,740
the way.

1251
01:08:36,740 --> 01:08:41,420
The 2PC would be done by the VTgate,
which would be- It should

1252
01:08:41,420 --> 01:08:42,540
have the nodes nearby.

1253
01:08:44,060 --> 01:08:46,380
Nikolay: Ah, should have, In 1
availability

1254
01:08:46,380 --> 01:08:46,880
Sugu: zone?

1255
01:08:47,100 --> 01:08:50,860
Unless you did some crazy configuration,
they should be pretty

1256
01:08:50,860 --> 01:08:52,000
close to each other.

1257
01:08:52,200 --> 01:08:55,300
Nikolay: Pretty close means still
different availability zones

1258
01:08:55,320 --> 01:08:56,660
in general case, right?

1259
01:08:57,040 --> 01:08:57,940
Sugu: No, no.

1260
01:08:58,580 --> 01:09:01,160
The availability zone is only for
durability.

1261
01:09:01,980 --> 01:09:02,460
Nikolay: Okay.

1262
01:09:02,460 --> 01:09:03,640
Sugu: For replica level.

1263
01:09:04,000 --> 01:09:09,220
But a 2PC, you're coordinating
between 2 primaries, which may

1264
01:09:09,220 --> 01:09:11,680
actually be on the same machine
for all you care.

1265
01:09:12,040 --> 01:09:12,780
It's not...

1266
01:09:12,780 --> 01:09:13,280
Well,

1267
01:09:13,660 --> 01:09:15,400
Nikolay: Imagine the real practical
case.

1268
01:09:15,400 --> 01:09:17,200
We have sharded schema, we have...

1269
01:09:17,220 --> 01:09:20,420
Every shard has primary and a couple
of standbys, right?

1270
01:09:20,420 --> 01:09:20,920
Sugu: Correct.

1271
01:09:20,980 --> 01:09:21,320
Correct.

1272
01:09:21,320 --> 01:09:24,660
Nikolay: So are you saying that
we need to keep primaries all

1273
01:09:24,660 --> 01:09:26,820
in the same availability zone?

1274
01:09:27,700 --> 01:09:29,760
Sugu: That's usually how things
are.

1275
01:09:29,760 --> 01:09:30,480
Nikolay: Ah, interesting.

1276
01:09:30,480 --> 01:09:31,560
I didn't know about this.

1277
01:09:31,560 --> 01:09:36,080
By the way, I wanted to rattle
a little bit about PlanetScale benchmarks

1278
01:09:36,100 --> 01:09:37,580
they published last week.

1279
01:09:37,700 --> 01:09:38,980
They compared to everyone.

1280
01:09:39,140 --> 01:09:40,240
They compared to everyone.

1281
01:09:40,240 --> 01:09:42,980
I wanted just, it's not like, I'm
sorry, I will take a little

1282
01:09:42,980 --> 01:09:43,900
bit of time.

1283
01:09:43,900 --> 01:09:48,860
They compared to everyone, and
they just published like PlanetScale

1284
01:09:48,900 --> 01:09:49,740
versus something.

1285
01:09:50,540 --> 01:09:56,520
And this very topic, they on charts,
we have PlanetScale in single

1286
01:09:56,520 --> 01:10:00,180
AZ, everything client and server
in the same AZ.

1287
01:10:00,540 --> 01:10:05,520
And line, which is like normal
case, client is different AZ.

1288
01:10:05,940 --> 01:10:11,460
And line with same AZ is active,
line is normal, not active.

1289
01:10:11,720 --> 01:10:15,800
And others like Neon, Supabase,
everyone, It's different.

1290
01:10:16,640 --> 01:10:21,980
And of course, PlanetScale looks
really well, because by default,

1291
01:10:22,640 --> 01:10:26,460
they presented numbers for the
same availability zone.

1292
01:10:26,880 --> 01:10:30,840
And below the chart, everything
is explained, but who reads it,

1293
01:10:30,840 --> 01:10:32,820
right?
So people just see the graphs.

1294
01:10:32,900 --> 01:10:36,260
And you can unselect, select proper
PlanetScale numbers and

1295
01:10:36,260 --> 01:10:37,540
see that they are similar.

1296
01:10:38,680 --> 01:10:42,480
But by default, the same as that
number is chosen.

1297
01:10:44,440 --> 01:10:46,960
And this is like benchmarking,
you know, like.

1298
01:10:47,740 --> 01:10:53,040
Sugu: Well, I think if you look
at the architecture, like, even

1299
01:10:53,040 --> 01:10:56,200
fair comparison, PlanetScale should
come out ahead, like, like

1300
01:10:56,200 --> 01:11:00,300
the performance of a local disk,
of course, should.

1301
01:11:00,700 --> 01:11:03,820
Nikolay: But this was SELECT 1
disks.

1302
01:11:04,940 --> 01:11:07,060
Sugu: Disks, SELECT 1 is not a
benchmark.

1303
01:11:08,360 --> 01:11:10,300
Nikolay: Well, it was part of benchmark.

1304
01:11:10,320 --> 01:11:15,060
It's just checking query path,
but it fully depends on where

1305
01:11:15,060 --> 01:11:16,580
client and server are located.

1306
01:11:17,360 --> 01:11:21,460
So what's the point showing better
numbers just putting client

1307
01:11:21,460 --> 01:11:21,960
closer?

1308
01:11:22,900 --> 01:11:25,120
I don't like that part of that
benchmark.

1309
01:11:26,000 --> 01:11:27,040
Sugu: Okay, yeah.

1310
01:11:28,580 --> 01:11:31,840
I saw the publications, But I didn't
go into the details because

1311
01:11:31,840 --> 01:11:35,440
I thought, well, it has to be faster
because it's on local disks.

1312
01:11:35,660 --> 01:11:36,240
So why?

1313
01:11:36,260 --> 01:11:39,140
Nikolay: Yeah, for data which is
not fully in cache, of course.

1314
01:11:39,140 --> 01:11:42,320
Yeah, local disks are amazing versus
ABS volumes.

1315
01:11:42,360 --> 01:11:42,800
Sugu: You're right.

1316
01:11:42,800 --> 01:11:46,200
Yeah, if the data is in cache,
then there is, yeah, then all

1317
01:11:46,200 --> 01:11:48,900
performance, the performance of
everything would be the same.

1318
01:11:49,180 --> 01:11:51,520
Nikolay: Yeah, well, I wanted to
share this.

1319
01:11:51,820 --> 01:11:57,040
I was annoyed about this, but I
fully support the idea of local

1320
01:11:57,040 --> 01:11:57,980
disks.
It's great.

1321
01:11:57,980 --> 01:12:00,400
I think we need to use them more
in more systems.

1322
01:12:00,780 --> 01:12:04,240
Sugu: I think, I wouldn't be surprised
if you reached out to

1323
01:12:04,240 --> 01:12:04,740
PlanetScale.

1324
01:12:05,280 --> 01:12:09,380
They may be willing to, like if
you want to run your benchmarks,

1325
01:12:10,020 --> 01:12:11,780
they may be willing to give you
the...

1326
01:12:11,780 --> 01:12:14,400
Nikolay: Yeah, there is source
code published and in general

1327
01:12:14,540 --> 01:12:17,140
benchmarks look great, the idea
is great.

1328
01:12:17,200 --> 01:12:22,280
And actually, with local disks,
the only concern is usually the

1329
01:12:22,280 --> 01:12:24,180
limit, hard limit.

1330
01:12:24,480 --> 01:12:26,260
We cannot have more space.

1331
01:12:26,420 --> 01:12:30,100
But if you have sharded solution,
there is no such thing.

1332
01:12:30,480 --> 01:12:30,980
Sugu: Correct.

1333
01:12:31,100 --> 01:12:36,780
But speaking about the hard limit,
today's SSDs, you can buy

1334
01:12:36,780 --> 01:12:41,480
100 plus terabytes size SSD, single
SSD, and you can probably

1335
01:12:41,480 --> 01:12:42,620
stack them up on

1336
01:12:43,260 --> 01:12:44,480
Nikolay: 1 next to the other.

1337
01:12:45,720 --> 01:12:45,780
Yeah, yeah.

1338
01:12:45,780 --> 01:12:48,040
But in cloud there are limitations,
in cloud there are limitations,

1339
01:12:48,260 --> 01:12:49,020
for instance.

1340
01:12:49,040 --> 01:12:49,820
Sugu: Okay, yeah.

1341
01:12:51,160 --> 01:12:53,900
I saw AWS SSD over 100 terabytes.

1342
01:12:54,960 --> 01:13:00,400
Nikolay: In Google Cloud, 72 terabytes is hard limit for Z3 metal.

1343
01:13:00,560 --> 01:13:01,720
And I didn't see more.

1344
01:13:01,720 --> 01:13:05,280
So 72 terabytes, it's a lot, but sometimes it's already notable.

1345
01:13:05,280 --> 01:13:09,740
Sugu: Yeah, at that limit, your storage is not the limit.

1346
01:13:10,120 --> 01:13:13,380
You will not be able to run a database of that size on a single

1347
01:13:13,380 --> 01:13:13,880
machine.

1348
01:13:15,060 --> 01:13:16,740
Why not?
You will max out your CPU.

1349
01:13:17,400 --> 01:13:18,840
Nikolay: We have cases, CPU.

1350
01:13:18,920 --> 01:13:21,260
Well, again, the problem will be replication.

1351
01:13:21,580 --> 01:13:23,680
If we talk about single node, we can

1352
01:13:23,680 --> 01:13:24,600
Sugu: have- Or replication.

1353
01:13:25,520 --> 01:13:29,440
Nikolay: 360 cores in AWS, almost 1,000 cores already for Xeon

1354
01:13:29,440 --> 01:13:31,400
scalable generation 5 or something.

1355
01:13:31,400 --> 01:13:33,580
So hundreds of cores.

1356
01:13:35,040 --> 01:13:36,780
The problem is Postgres design.

1357
01:13:36,820 --> 01:13:42,900
If replication, physical replication was multi-threaded, we could

1358
01:13:42,900 --> 01:13:43,760
scale more.

1359
01:13:45,480 --> 01:13:47,720
Sugu: By the way, replication is not the only problem.

1360
01:13:48,080 --> 01:13:49,040
Backup recovery.

1361
01:13:49,540 --> 01:13:52,540
If your machine goes down, you're down for hours.

1362
01:13:53,860 --> 01:13:55,740
Recovery is something of that size.

1363
01:13:56,280 --> 01:13:56,780
Nikolay: Yeah.

1364
01:13:58,940 --> 01:13:59,860
Not many hours.

1365
01:14:00,780 --> 01:14:05,780
Someone in my team recently saw 17 or 19 terabytes per hour for

1366
01:14:05,780 --> 01:14:08,380
recovery with pgBackRest or WAL-G.

1367
01:14:09,720 --> 01:14:10,680
In AWS.

1368
01:14:10,680 --> 01:14:13,040
I was like, my jaw dropped.

1369
01:14:15,540 --> 01:14:17,780
On our podcast I usually say 1 terabyte.

1370
01:14:17,780 --> 01:14:18,740
Sugu: Can you repeat that?

1371
01:14:18,740 --> 01:14:19,540
17 or 19

1372
01:14:19,540 --> 01:14:20,700
Nikolay: terabytes per hour?

1373
01:14:21,100 --> 01:14:22,760
17 terabytes per hour.

1374
01:14:22,840 --> 01:14:23,900
With local disks.

1375
01:14:24,960 --> 01:14:26,280
So this is important.

1376
01:14:26,280 --> 01:14:27,340
With EBS, it's impossible.

1377
01:14:27,340 --> 01:14:28,300
Sugu: It's good to know.

1378
01:14:30,260 --> 01:14:34,760
Nikolay: Yeah.
With Michael, I was always saying 1 terabyte is like what you

1379
01:14:34,760 --> 01:14:35,420
should achieve.

1380
01:14:35,420 --> 01:14:36,700
If it's below, it's bad.

1381
01:14:36,700 --> 01:14:38,660
Now I'm thinking 1 terabyte is already...

1382
01:14:40,520 --> 01:14:41,260
Yeah, yeah.

1383
01:14:41,540 --> 01:14:45,340
So with EBS volumes, we managed to achieve, I think, 7 terabytes

1384
01:14:45,340 --> 01:14:48,040
per hour to restore with WAL-G.

1385
01:14:48,500 --> 01:14:48,940
And that's great.

1386
01:14:48,940 --> 01:14:49,300
But there's

1387
01:14:49,300 --> 01:14:50,420
Sugu: a danger there.

1388
01:14:50,860 --> 01:14:52,760
You could become a noisy neighbor.

1389
01:14:55,160 --> 01:15:00,600
So we actually built throttling in our restore just to prevent

1390
01:15:00,600 --> 01:15:01,420
noisy neighbors.

1391
01:15:03,080 --> 01:15:06,840
Nikolay: With local disks, you lose ability to use BS snapshots,

1392
01:15:07,380 --> 01:15:09,220
cloud disk snapshots.

1393
01:15:09,380 --> 01:15:10,460
Sugu: Correct, correct, yeah.

1394
01:15:10,640 --> 01:15:12,780
Nikolay: That's what you lose, unfortunately.

1395
01:15:13,080 --> 01:15:15,640
And they're great, and people enjoy them more and more.

1396
01:15:16,220 --> 01:15:21,440
Yeah, so I agree, and as I remember, for 17 terabytes it was

1397
01:15:21,660 --> 01:15:30,040
128 threads of 4g or pgBackRest I don't remember local disks I

1398
01:15:30,040 --> 01:15:30,780
need to update

1399
01:15:30,780 --> 01:15:33,780
Sugu: my technology is changing too fast.

1400
01:15:34,280 --> 01:15:35,360
Nikolay: Exactly, yeah.

1401
01:15:35,436 --> 01:15:39,780
And hundreds of cores, terabytes of RAM already, right?

1402
01:15:39,900 --> 01:15:40,400
Like,

1403
01:15:41,400 --> 01:15:41,900
Michael: yeah.

1404
01:15:42,800 --> 01:15:46,880
Yeah, yeah.
But it does go straight to your point of the smaller they are,

1405
01:15:46,880 --> 01:15:48,780
the faster you can recover still.

1406
01:15:49,280 --> 01:15:51,100
Sugu: You don't hit some of these limits.

1407
01:15:51,220 --> 01:15:55,080
These systems were not designed with these types of limits in

1408
01:15:55,080 --> 01:15:55,580
mind.

1409
01:15:56,040 --> 01:16:00,520
Some weird data structure, suddenly the limit of this is only

1410
01:16:00,620 --> 01:16:01,600
100 items.

1411
01:16:02,640 --> 01:16:04,980
And you hit those limits and then you're stuck.

1412
01:16:05,580 --> 01:16:07,520
Recently, Metronome had an issue.

1413
01:16:08,860 --> 01:16:12,280
They had that, the routed.

1414
01:16:12,340 --> 01:16:12,840
MultiXact.

1415
01:16:13,860 --> 01:16:18,880
The MultiXact thing, which nobody has ever run before, but

1416
01:16:21,140 --> 01:16:22,260
they hit that problem.

1417
01:16:22,800 --> 01:16:27,260
Nikolay: Yeah, we saw many problems also when you're on the edge.

1418
01:16:27,600 --> 01:16:33,420
And it pushes forward Postgres actually sometimes, but if you

1419
01:16:33,420 --> 01:16:36,180
want to be on the safe side, but I really like the, like, it's

1420
01:16:36,180 --> 01:16:40,200
kind of resilience characteristics when, even if it's down, it's

1421
01:16:40,200 --> 01:16:42,040
only a small part of your system is down.

1422
01:16:42,040 --> 01:16:42,720
That's great.

1423
01:16:42,720 --> 01:16:43,400
Sugu: Correct,
Nikolay: yeah.

1424
01:16:44,440 --> 01:16:46,360
That's mature architecture already.

1425
01:16:47,380 --> 01:16:52,240
Sugu: That actually makes it easier to achieve 5 nines uptime.

1426
01:16:53,520 --> 01:16:55,120
Because that's the way you calculate.

1427
01:16:55,120 --> 01:16:58,600
Like if only 1 node is down, you divide it by the number of users

1428
01:16:58,920 --> 01:16:59,620
Nikolay: being affected.

1429
01:17:00,260 --> 01:17:00,760
Budget.

1430
01:17:00,880 --> 01:17:01,195
Sugu: Downtime budget.

1431
01:17:01,195 --> 01:17:01,510
Yeah, budget.

1432
01:17:01,510 --> 01:17:02,010
Yeah.

1433
01:17:02,240 --> 01:17:02,940
Nikolay: That's good.

1434
01:17:03,660 --> 01:17:04,160
Cool.

1435
01:17:04,540 --> 01:17:08,340
I think it's maybe 1 of the longest episodes we had.

1436
01:17:08,760 --> 01:17:09,380
Enjoyed it.

1437
01:17:09,380 --> 01:17:10,220
Oh my God.

1438
01:17:12,040 --> 01:17:12,800
I enjoyed it.

1439
01:17:12,800 --> 01:17:17,060
I hope we will continue a discussion
of issues with logical,

1440
01:17:17,080 --> 01:17:20,320
for example, and so on, and maybe
if things will be improved

1441
01:17:20,320 --> 01:17:20,940
and so on.

1442
01:17:20,940 --> 01:17:24,640
Looking forward to test POC once
you have it.

1443
01:17:24,720 --> 01:17:25,520
Thank you so much

1444
01:17:25,520 --> 01:17:26,780
for coming.
Sugu: I am so excited.

1445
01:17:27,100 --> 01:17:27,800
Nikolay: Yeah, thank you.

1446
01:17:27,800 --> 01:17:30,220
Michael: Yeah, is there any last
things you wanted to add or

1447
01:17:30,220 --> 01:17:32,740
anything you wanted help from people
on?

1448
01:17:34,240 --> 01:17:37,940
Sugu: I would say it feels like
nothing is happening on the repository

1449
01:17:38,040 --> 01:17:41,940
except me pushing, you know, a
few pushes, a few things, changes,

1450
01:17:41,940 --> 01:17:45,360
but a huge amount of work is happening
in the background.

1451
01:17:45,520 --> 01:17:50,320
Like some of these design work
about consensus are all like almost

1452
01:17:50,320 --> 01:17:54,560
ready to go and there's also hiring
going on there are people

1453
01:17:54,560 --> 01:18:00,160
coming on board very soon so you
will see this snowball It's

1454
01:18:00,160 --> 01:18:03,380
a very tiny snowball right now,
but it's going to get very big

1455
01:18:04,120 --> 01:18:05,640
as momentum builds up.

1456
01:18:05,640 --> 01:18:07,360
So pretty excited about that.

1457
01:18:09,020 --> 01:18:13,780
We may still have 1 or 2 spots
open to add to the team, but it's

1458
01:18:13,780 --> 01:18:14,740
filling up fast.

1459
01:18:15,240 --> 01:18:19,840
So If any of you are very familiar,
this is a very high bar to

1460
01:18:20,020 --> 01:18:21,260
contribute to a multi-base.

1461
01:18:21,260 --> 01:18:23,760
You have to understand consensus,
you have to understand query

1462
01:18:23,760 --> 01:18:24,260
processing.

1463
01:18:25,080 --> 01:18:28,660
But if there are people who want
to contribute, we are still

1464
01:18:28,660 --> 01:18:32,420
looking for maybe 1 or 2 people
and also on the orchestration

1465
01:18:32,500 --> 01:18:34,540
side and the Kubernetes side of
things.

1466
01:18:35,140 --> 01:18:36,080
Yeah, so that's...

1467
01:18:36,940 --> 01:18:39,060
Nikolay: Do you mind a small joke
in the end?

1468
01:18:39,060 --> 01:18:40,320
Just not to finish on...

1469
01:18:40,320 --> 01:18:41,520
Sugu: I do not mind at all.

1470
01:18:41,520 --> 01:18:42,180
Let's hear it.

1471
01:18:42,180 --> 01:18:43,780
Nikolay: Yeah, so I know what you're
doing.

1472
01:18:43,780 --> 01:18:46,920
You're writing a lot of markdown
right now and then you will

1473
01:18:46,920 --> 01:18:48,000
feed it to AI.

1474
01:18:52,020 --> 01:18:52,860
Sugu: I wish!

1475
01:18:55,680 --> 01:18:56,620
Oh my god.

1476
01:18:57,660 --> 01:19:03,500
I almost hope that day never comes
but It is so fun working on

1477
01:19:03,500 --> 01:19:04,980
this project, creating it.

1478
01:19:05,280 --> 01:19:08,140
Why do I want to give it to an
AI to do it, you know?

1479
01:19:09,140 --> 01:19:09,640
Nikolay: Okay.

1480
01:19:10,520 --> 01:19:10,840
Good.

1481
01:19:10,840 --> 01:19:11,390
Thank you.

1482
01:19:11,390 --> 01:19:12,620
I enjoyed it a lot.

1483
01:19:13,140 --> 01:19:13,428
Michael: Yeah.
Yeah.

1484
01:19:13,428 --> 01:19:14,640
Thank you so much for joining us.

1485
01:19:14,640 --> 01:19:17,480
It's great to have you as part
of the Postgres community now

1486
01:19:17,480 --> 01:19:20,740
and I'm excited to see what you
get up to.

1487
01:19:21,360 --> 01:19:22,240
Sugu: And me too.

1488
01:19:23,480 --> 01:19:24,900
Michael: Wonderful, thanks so much.