1
00:00:00,060 --> 00:00:02,220
Michael: Hello and welcome to PostgresFM, a weekly show about

2
00:00:02,220 --> 00:00:03,240
all things PostgreSQL.

3
00:00:03,240 --> 00:00:05,640
My name is Michael, founder of pgMustard, and I'm joined

4
00:00:05,640 --> 00:00:07,540
as usual by Nik, founder of Postgres.AI.

5
00:00:07,540 --> 00:00:08,260
Hey Nik.

6
00:00:08,320 --> 00:00:09,160
Nikolay: Hi, Michael.

7
00:00:09,960 --> 00:00:12,520
Michael: And today we have a special guest.

8
00:00:12,520 --> 00:00:16,160
We have Harry Brundage from Gadget, co-founder and CTO there.

9
00:00:16,160 --> 00:00:17,820
Hey, Harry, thanks for joining us.

10
00:00:18,160 --> 00:00:19,440
Harry: Thanks so much for having me on.

11
00:00:19,440 --> 00:00:20,580
I'm happy to be here.

12
00:00:21,100 --> 00:00:22,240
Michael: It is our pleasure.

13
00:00:22,240 --> 00:00:25,360
So you've been doing some cool things with Postgres for a while.

14
00:00:25,600 --> 00:00:30,660
Recently blogged about a no downtime, a 0 downtime upgrade, but

15
00:00:30,660 --> 00:00:33,760
there's a ton of other things we wanted to talk with you about

16
00:00:33,760 --> 00:00:34,460
as well.

17
00:00:34,920 --> 00:00:36,440
Where would you like to start?

18
00:00:36,700 --> 00:00:39,900
Harry: Yeah, I'd love to tell you about this 0 Downtime upgrade,

19
00:00:39,900 --> 00:00:43,180
just because, I don't know, it's like a point of pride of mine,

20
00:00:43,180 --> 00:00:43,820
I guess.

21
00:00:44,240 --> 00:00:47,200
It's always so frustrating to me when I get an email from 1 of

22
00:00:47,200 --> 00:00:47,840
my providers.

23
00:00:47,840 --> 00:00:51,940
It's like, hey, Sunday at 2 PM, we're going to be down for 30

24
00:00:51,940 --> 00:00:52,240
minutes.

25
00:00:52,240 --> 00:00:53,540
Like, hope that's OK.

26
00:00:54,800 --> 00:00:57,420
So yeah, Gadget's an infrastructure provider.

27
00:00:57,980 --> 00:01:00,920
Like, the developers who use us use us for mission critical apps,

28
00:01:00,920 --> 00:01:04,500
So we just try to take it more seriously than that and like not,

29
00:01:04,900 --> 00:01:05,780
you know, disappear.

30
00:01:06,360 --> 00:01:10,260
So yeah, we did a 0 downtime upgrade with the help of Postgres.AI

31
00:01:11,120 --> 00:01:13,940
and Nik, thank you so much for your help, Nik, what a treat.

32
00:01:13,940 --> 00:01:17,680
Nikolay: Thank you for trusting us because you know, like, we

33
00:01:17,680 --> 00:01:20,680
did it with other companies and always like in the beginning,

34
00:01:20,680 --> 00:01:24,440
I like, I might hear pushback because understanding how much

35
00:01:24,440 --> 00:01:27,980
work it's needed there, you know, like, because it's far from

36
00:01:27,980 --> 00:01:28,940
being well automated.

37
00:01:28,940 --> 00:01:32,140
It should be in Postgres and it's improving, but

38
00:01:32,980 --> 00:01:33,480
Harry: yeah.

39
00:01:34,340 --> 00:01:35,420
Not quite yet.

40
00:01:36,020 --> 00:01:36,840
Nikolay: Yeah, yeah, yeah.

41
00:01:37,660 --> 00:01:41,600
So you did this journey and I thank you for not stopping.

42
00:01:42,380 --> 00:01:43,540
Harry: Absolutely, absolutely.

43
00:01:43,860 --> 00:01:45,460
So it was a lot of work,

44
00:01:45,460 --> 00:01:46,480
Nikolay: it was, yeah.

45
00:01:46,880 --> 00:01:49,200
Harry: I think too, it's like, to me it just kind of feels like

46
00:01:49,200 --> 00:01:51,780
something that should be built into the managed providers.

47
00:01:52,540 --> 00:01:56,040
Like I'm kind of surprised that so few of them have it.

48
00:01:56,040 --> 00:01:59,540
I think the only 1 I could find that has like true 0 downtime

49
00:01:59,540 --> 00:02:00,220
is PlanetScale.

50
00:02:00,920 --> 00:02:02,480
Nikolay: But for Postgres,

51
00:02:02,720 --> 00:02:05,900
Harry: the GCP options, I don't
know about the new Postgres one.

52
00:02:05,900 --> 00:02:08,760
I guess I know they have it for
MySQL but not for Postgres yet.

53
00:02:09,020 --> 00:02:11,020
Nikolay: For MySQL, yeah, I think
so.

54
00:02:11,040 --> 00:02:15,900
I don't know details but for Postgres
I don't see yet anywhere.

55
00:02:16,600 --> 00:02:18,480
Harry: But again, like what the
heck?

56
00:02:19,360 --> 00:02:22,120
Like as an industry shouldn't this
have been the thing we kind

57
00:02:22,120 --> 00:02:24,840
of like solved 20 years ago and
never looked back?

58
00:02:24,840 --> 00:02:26,340
You know, I don't understand.

59
00:02:26,680 --> 00:02:27,180
Nikolay: Yeah.

60
00:02:27,180 --> 00:02:29,800
And there's no AI needed there.

61
00:02:29,800 --> 00:02:31,040
It's just pure automation.

62
00:02:32,380 --> 00:02:32,880
Harry: Yeah.

63
00:02:32,980 --> 00:02:34,500
Terrifying button to press.

64
00:02:34,500 --> 00:02:35,020
Nikolay: I think.

65
00:02:35,020 --> 00:02:35,520
Harry: But

66
00:02:35,580 --> 00:02:38,600
Nikolay: yeah, I think what happens
is that it should be automated

67
00:02:38,660 --> 00:02:40,280
more in Postgres itself.

68
00:02:40,680 --> 00:02:44,580
And most managed platforms like
RDS and others, they just sit

69
00:02:44,580 --> 00:02:45,460
on the fence.

70
00:02:45,700 --> 00:02:49,540
Well, I know they they work and
do it, but they don't want to

71
00:02:49,540 --> 00:02:51,640
automate things outside Postgres
too much.

72
00:02:51,640 --> 00:02:53,040
For example, ANALYZE, right?

73
00:02:53,040 --> 00:02:54,560
ANALYZE is not automated.

74
00:02:54,800 --> 00:02:56,920
You always forget about this.

75
00:02:56,920 --> 00:03:01,400
And they don't automate it, and
they just put it in documentation

76
00:03:01,740 --> 00:03:03,180
on the shoulders of users.

77
00:03:03,400 --> 00:03:04,940
Now, it's time to analyze.

78
00:03:06,100 --> 00:03:10,080
And I started like asking around
like maybe a couple of years

79
00:03:10,080 --> 00:03:15,480
ago, why, why, why 1 year ago and
then okay, Postgres 18 should

80
00:03:15,480 --> 00:03:16,900
be out next week, right?

81
00:03:17,040 --> 00:03:20,800
Finally, statistics will be dump
restore of statistics will be

82
00:03:20,800 --> 00:03:21,520
fully automated.

83
00:03:21,580 --> 00:03:26,780
So they will just exclude this
from their how to manual, right?

84
00:03:26,980 --> 00:03:28,120
So this is an example.

85
00:03:28,780 --> 00:03:30,200
Things should be inside Postgres.

86
00:03:30,660 --> 00:03:33,580
But I think still, they should
automate much better.

87
00:03:34,540 --> 00:03:35,040
Indeed.

88
00:03:35,280 --> 00:03:37,220
Not waiting for Postgres itself.

89
00:03:37,700 --> 00:03:41,340
Anyway, I'm with you here, 100%.

90
00:03:42,780 --> 00:03:43,480
Harry: I'm glad.

91
00:03:43,860 --> 00:03:48,900
What I was going to say is I've
always attributed it, or the

92
00:03:48,900 --> 00:03:53,340
absence of something like a true
0 downtime major version upgrade

93
00:03:53,600 --> 00:03:57,160
to the fact that it's like more
than just Postgres that you need

94
00:03:57,160 --> 00:03:58,220
to pull it off.

95
00:03:58,380 --> 00:03:58,620
Right?

96
00:03:58,620 --> 00:04:02,560
Like you need a connection pooler
or a proxy or something in

97
00:04:02,560 --> 00:04:06,500
front of it sitting outside the
system and you need a bunch of

98
00:04:06,500 --> 00:04:09,840
tight coordination between that
proxy and the rest of the system

99
00:04:09,840 --> 00:04:14,380
to kind of orchestrate it all and
so like I know at least on

100
00:04:14,380 --> 00:04:18,200
our previous Postgres provider like they ran the PgBouncer on

101
00:04:18,200 --> 00:04:20,880
the same VM as Postgres itself.

102
00:04:21,500 --> 00:04:24,900
And so they're kind of hamstrung by the architecture where they

103
00:04:24,900 --> 00:04:28,380
can't have this piece that lives outside that can talk to 2 at

104
00:04:28,380 --> 00:04:31,720
once, whereas we self-hosted our bouncers for exactly this reason

105
00:04:31,720 --> 00:04:34,660
and can kind of do it all and orchestrate it all and it's like

106
00:04:34,760 --> 00:04:39,060
sucks to self-host it but gets you that architectural flexibility

107
00:04:39,140 --> 00:04:41,300
you know we needed that.

108
00:04:41,480 --> 00:04:44,660
Nikolay: Yeah and PgBouncer supports pause/resume which helps you

109
00:04:44,660 --> 00:04:47,000
achieve 0 downtime.

110
00:04:47,980 --> 00:04:51,760
Harry: We're maybe nuts too and we run our PgBouncers in K8s

111
00:04:52,300 --> 00:04:56,960
and do the whole rolling restart and whatever and it's turned

112
00:04:56,960 --> 00:05:00,140
out to be, I think, an okay choice.

113
00:05:00,400 --> 00:05:02,660
I think a lot of people may be shy away from that because it's

114
00:05:02,660 --> 00:05:04,420
sort of super stateful, super critical.

115
00:05:04,700 --> 00:05:06,340
Maybe this is a pet, not a cattle.

116
00:05:06,340 --> 00:05:08,520
But we did it in case just because that's where the rest of our

117
00:05:08,520 --> 00:05:11,280
compute runs and it's like easy to sort of bin pack and share

118
00:05:11,280 --> 00:05:11,920
it all.

119
00:05:12,040 --> 00:05:15,860
It actually really saved our butts recently because we're paying

120
00:05:15,860 --> 00:05:20,960
a lot for intrazonal network transit to our databases.

121
00:05:22,200 --> 00:05:26,980
Specifically, like once from our application pod to the bouncer,

122
00:05:27,180 --> 00:05:29,980
and then again from the bouncer to the database.

123
00:05:31,020 --> 00:05:34,780
And because the bouncer was in K8s, we could turn on the handy-dandy

124
00:05:35,540 --> 00:05:40,020
like prefer zone local network routing configurations and kind

125
00:05:40,020 --> 00:05:43,620
of have the application pod go to a zone local bouncer and then

126
00:05:43,620 --> 00:05:47,540
to the database And I think were we outside of K8s, that would

127
00:05:47,540 --> 00:05:50,940
have been a lot more annoying to kind of like, teach all the

128
00:05:50,940 --> 00:05:53,620
routing to be able to do the right thing.

129
00:05:54,560 --> 00:05:55,620
Nik, you look scared.

130
00:05:56,900 --> 00:05:57,820
You look skeptical.

131
00:05:58,040 --> 00:06:01,460
Nikolay: No, no, I'm just thinking this is also a job for database

132
00:06:01,460 --> 00:06:03,880
platform, I think, to take care of such things.

133
00:06:04,360 --> 00:06:07,680
So yeah, and I know like RDS has RDS proxy but it doesn't have

134
00:06:07,680 --> 00:06:11,520
a it has cool features PgBouncer doesn't but it also lacks cool

135
00:06:11,520 --> 00:06:15,860
features PgBouncer has so it's a zoo and yeah And I think pause

136
00:06:15,860 --> 00:06:17,460
resume is a wonderful feature.

137
00:06:17,980 --> 00:06:20,240
Still, I have to emphasize it.

138
00:06:20,240 --> 00:06:22,860
Michael: I was going to ask about your PgBouncer setup, actually.

139
00:06:22,860 --> 00:06:25,520
Because your blog post on the upgrade is really cool, and goes

140
00:06:25,520 --> 00:06:28,220
into good detail on the actual upgrade process.

141
00:06:28,940 --> 00:06:34,040
But you didn't talk much about your PgBouncer setup, and how

142
00:06:34,040 --> 00:06:35,180
many you're running.

143
00:06:36,340 --> 00:06:38,860
And do all of your application stuff go through that?

144
00:06:38,860 --> 00:06:41,260
Or does any of it skip it going to the database?

145
00:06:41,400 --> 00:06:43,940
Is there anything interesting about that that you wanted to share

146
00:06:43,940 --> 00:06:44,640
with people?

147
00:06:45,060 --> 00:06:45,860
Harry: Good question.

148
00:06:46,400 --> 00:06:48,220
I have 2 interesting tidbits.

149
00:06:48,900 --> 00:06:52,280
The first is, I think our, so our PgBouncer deployment's a bit

150
00:06:52,280 --> 00:06:54,300
weird, and then we have 9 instances.

151
00:06:55,200 --> 00:06:55,700
Michael: Interesting.

152
00:06:55,920 --> 00:06:58,520
Harry: Yeah, and we take our kind
of total connection count,

153
00:06:58,520 --> 00:07:02,640
where we're targeting 400 kind
of server-side connections, and

154
00:07:02,640 --> 00:07:07,220
split that up over 9, And the reason
for that is that same K8s

155
00:07:07,240 --> 00:07:12,560
network-aware routing thing, where
the kind of up until very

156
00:07:12,560 --> 00:07:18,340
recent versions of Kubernetes,
you needed 3 endpoints per availability

157
00:07:18,580 --> 00:07:21,500
zone for the network-aware routing
to kick in.

158
00:07:21,580 --> 00:07:25,440
So we had to use a minimum of 9
and initially we were really

159
00:07:25,440 --> 00:07:28,300
skeptical about that because it
seems kind of bad right instead

160
00:07:28,300 --> 00:07:32,220
of having like say 1 or 2 buckets
of a lot of connections, we

161
00:07:32,220 --> 00:07:33,900
were kind of brokering them fairly.

162
00:07:34,000 --> 00:07:37,000
You have many small buckets and
you could get kind of weird inefficiencies,

163
00:07:37,200 --> 00:07:39,180
but it didn't come to pass where
that bit us.

164
00:07:39,180 --> 00:07:41,140
It seemed actually okay in the
end.

165
00:07:41,540 --> 00:07:45,760
And what I like about it is we
have a lot of CPU headroom, right?

166
00:07:45,760 --> 00:07:48,360
Like PgBouncer, I think is still
largely single-threaded.

167
00:07:48,500 --> 00:07:52,540
I don't know the innermost details
of it, but we're using the

168
00:07:52,540 --> 00:07:55,680
fancy new prepared statement stuff
within it, and so I'm kind

169
00:07:55,680 --> 00:08:00,400
of happy that I can give it 9 cores
instead of 2 or 4 or something

170
00:08:00,400 --> 00:08:02,580
like that, and kind of be like,
you know.

171
00:08:04,160 --> 00:08:05,580
Take as much as you need.

172
00:08:07,200 --> 00:08:07,800
Michael: Yeah, nice.

173
00:08:07,800 --> 00:08:08,980
Always good to have headroom.

174
00:08:09,720 --> 00:08:10,220
Cool.

175
00:08:10,940 --> 00:08:14,640
Anything else about the upgrade,
or like which bits were most

176
00:08:14,640 --> 00:08:18,200
scary when Nik came to you with
the proposed plan, you said

177
00:08:18,200 --> 00:08:21,440
the pushing the button, is it because
it is single direction

178
00:08:21,500 --> 00:08:23,160
or did you have a rollback plan?

179
00:08:23,680 --> 00:08:24,440
Harry: Good question.

180
00:08:24,440 --> 00:08:28,140
We had a rollback plan, but it
was written rather than tested.

181
00:08:28,140 --> 00:08:32,120
Because we sort of like, So the
way that we pulled it out, or

182
00:08:32,120 --> 00:08:35,140
the way that we kind of built the
whole thing is we built a temporal

183
00:08:35,280 --> 00:08:35,780
workflow.

184
00:08:36,040 --> 00:08:39,960
So we're like huge fans of temporal,
very easy to kind of script

185
00:08:39,960 --> 00:08:41,140
something like this.

186
00:08:41,520 --> 00:08:44,280
And the reason why we did this
is because it lets us get all

187
00:08:44,280 --> 00:08:47,860
the nice kind of retries and item
potency stuff for free, where

188
00:08:47,860 --> 00:08:50,940
we get to write in a somewhat failure
oblivious way.

189
00:08:51,340 --> 00:08:54,260
And we wrote our workflow to have
like a dry run capability,

190
00:08:54,640 --> 00:08:57,180
where we do all but the very last
step.

191
00:08:57,180 --> 00:09:00,580
So we would go do everything where
we like prepared the new host,

192
00:09:01,020 --> 00:09:04,440
kind of replicated the data, made
sure we were caught up, did

193
00:09:04,440 --> 00:09:09,020
the PgBouncer pause, and then
we just wouldn't reload the config

194
00:09:09,020 --> 00:09:11,520
or change over and we'd just do
a resume and everything would

195
00:09:11,520 --> 00:09:13,820
keep on going as it was.

196
00:09:14,340 --> 00:09:17,520
And so The terrifying part is we
just never actually done the

197
00:09:17,520 --> 00:09:18,260
middle thing.

198
00:09:18,480 --> 00:09:21,340
And we weren't really sure, like,
here's how we can test this,

199
00:09:21,340 --> 00:09:24,500
other than like standing up a whole
parallel set of infrastructure,

200
00:09:24,520 --> 00:09:27,100
which we were like, I'm not even
sure that will prove this 1

201
00:09:27,100 --> 00:09:28,140
will work, right?

202
00:09:28,860 --> 00:09:34,180
Like, just in my admittedly still
limited lifespan, like, the

203
00:09:34,180 --> 00:09:36,840
things that go wrong are always
the things you didn't foresee,

204
00:09:36,960 --> 00:09:37,240
right?

205
00:09:37,240 --> 00:09:40,380
So it's like we dry-runned it a
million times, found all the

206
00:09:40,380 --> 00:09:43,320
things we could foresee, and then
it was like the spooky, you

207
00:09:43,320 --> 00:09:43,740
know.

208
00:09:43,740 --> 00:09:46,780
Nikolay: Yeah, so it's really like
It's impossible to test switchover

209
00:09:46,840 --> 00:09:50,640
in production, but I wanted to
emphasize it's possible to test

210
00:09:50,640 --> 00:09:51,820
everything before that.

211
00:09:51,820 --> 00:09:52,760
And it's great.

212
00:09:53,500 --> 00:09:55,220
Because this is the beauty of this
procedure.

213
00:09:55,240 --> 00:09:58,840
You create the whole cluster, which
is running already a new

214
00:09:58,840 --> 00:09:59,340
version.

215
00:10:00,060 --> 00:10:01,900
You see it's working.

216
00:10:02,900 --> 00:10:05,680
You can touch it and it's great,
and so on.

217
00:10:05,680 --> 00:10:09,800
But yeah, I agree, the scariest
part is switchover how to test

218
00:10:09,800 --> 00:10:10,460
it properly.

219
00:10:10,520 --> 00:10:13,820
Of course, it was tested in non-production,
but it's different.

220
00:10:16,460 --> 00:10:18,460
Here we just need to jump at some
point.

221
00:10:19,020 --> 00:10:19,520
Harry: Exactly.

222
00:10:19,640 --> 00:10:22,080
It was the best kind of jump, where
it was like a big meeting

223
00:10:22,080 --> 00:10:24,840
that we were all like scared of,
we all like sat down, Nik was

224
00:10:24,840 --> 00:10:27,720
on the phone, and it was over in
like 7 minutes, where it was

225
00:10:27,720 --> 00:10:29,380
like, oh, well, I guess it worked.

226
00:10:29,380 --> 00:10:30,780
Like all the numbers look good.

227
00:10:30,780 --> 00:10:32,720
Like yeah, that's it.

228
00:10:32,940 --> 00:10:33,820
Onwards you know.

229
00:10:33,820 --> 00:10:37,360
So I feel fortunate to work with
a team where, you know, that

230
00:10:37,360 --> 00:10:41,200
was how it went as opposed to like
us screaming around like,

231
00:10:41,200 --> 00:10:42,040
check out their heads.

232
00:10:42,040 --> 00:10:45,820
Nikolay: I wanted to thank you
for this, like emphasizing that

233
00:10:45,820 --> 00:10:50,340
you use this temporal framework
because we, like Postgres.AI, we're

234
00:10:50,340 --> 00:10:54,720
always big fans of Go and we like
developed using Go a lot of

235
00:10:54,720 --> 00:10:55,020
stuff.

236
00:10:55,020 --> 00:10:59,060
But recently, our team, we have
a few Python guys right now.

237
00:10:59,060 --> 00:11:02,780
And I just envision that probably
we'll learn from you in this

238
00:11:02,780 --> 00:11:06,600
experience as well and do some
stuff with temporal as well.

239
00:11:06,600 --> 00:11:07,980
So thank you for emphasizing.

240
00:11:09,020 --> 00:11:11,040
Harry: Yeah, we're heavy, heavy
users of it.

241
00:11:11,040 --> 00:11:15,460
We use it for these kinds of systems
administration tasks or

242
00:11:15,480 --> 00:11:17,000
behind-the-scenes maintenance workflows.

243
00:11:17,220 --> 00:11:19,300
But we use it for a lot of other
stuff too.

244
00:11:19,820 --> 00:11:23,980
Like Gadget is mostly used today
by people building Shopify apps.

245
00:11:24,400 --> 00:11:27,780
And when you build Shopify apps,
you have to receive a metric

246
00:11:28,460 --> 00:11:29,980
shit ton of webhooks.

247
00:11:30,660 --> 00:11:34,180
And so the Gadget has like a background
kind of queuing system

248
00:11:34,180 --> 00:11:37,240
that knows how to take a webhook
and process it later or process

249
00:11:37,240 --> 00:11:38,680
it with retries or whatever.

250
00:11:38,680 --> 00:11:41,660
So each 1 of those is a temporal
workflow for us and so we're

251
00:11:41,660 --> 00:11:45,900
doing I don't know something like
3k, 4k temporal state transitions

252
00:11:46,000 --> 00:11:51,140
per second also backed by Postgres
in this big ass temporal cluster.

253
00:11:51,420 --> 00:11:53,860
And that's something I never ever
want to have to try to build

254
00:11:53,860 --> 00:11:54,360
myself.

255
00:11:54,520 --> 00:11:59,980
Like it's so complicated and nasty
and fun for some.

256
00:12:00,660 --> 00:12:04,400
Michael: You mentioned in the post
also that Sharding was on

257
00:12:04,400 --> 00:12:05,420
your to-do list.

258
00:12:05,920 --> 00:12:06,420
Yes.

259
00:12:07,360 --> 00:12:10,560
You mentioned though you were using
a different database for

260
00:12:10,560 --> 00:12:15,200
that now, like AlloyDB specifically
I think for the new setup,

261
00:12:15,520 --> 00:12:17,800
but you said it was just for the
user data.

262
00:12:17,960 --> 00:12:20,640
What's the separation there with
like the Shopify?

263
00:12:20,740 --> 00:12:24,300
Is it the data of the Shopify apps,
for example?

264
00:12:24,840 --> 00:12:28,580
Harry: Yeah, so Gadget is like
a general purpose app development

265
00:12:28,580 --> 00:12:29,080
platform.

266
00:12:29,240 --> 00:12:34,540
So If you want to make a to-do
list tracker or a ChatGPT clone

267
00:12:34,540 --> 00:12:38,080
or a Shopify app, you can sign
up for Gadget and you get kind

268
00:12:38,080 --> 00:12:40,520
of like a full stack dev environment.

269
00:12:41,060 --> 00:12:44,060
So you get a place to run kind
of backend code.

270
00:12:44,060 --> 00:12:45,400
It's all JavaScript, TypeScript.

271
00:12:45,480 --> 00:12:48,800
You get a database or a sort of
virtual slice of a database to

272
00:12:48,800 --> 00:12:49,920
store back-end data.

273
00:12:49,920 --> 00:12:52,200
You get a front-end and React and
whatever and you're off to

274
00:12:52,200 --> 00:12:52,720
the races.

275
00:12:52,720 --> 00:12:55,020
You write back-end code, you write
front-end code, you define

276
00:12:55,020 --> 00:12:58,460
your data models, you transact,
so on and so forth.

277
00:12:58,460 --> 00:13:01,540
And then the Shopify part comes
in where we kind of have a managed

278
00:13:01,560 --> 00:13:05,640
API connection to Shopify, where
we'll do the OAuth and the webhooks

279
00:13:06,260 --> 00:13:09,440
and sync the historical data, et
cetera, et cetera.

280
00:13:10,140 --> 00:13:13,200
So like many infrastructure providers,
we have a control plane

281
00:13:13,200 --> 00:13:14,240
and a data plane.

282
00:13:14,380 --> 00:13:19,440
The control plane stores our list
of users, our who has what

283
00:13:19,440 --> 00:13:21,180
access to what app, so on and so
forth.

284
00:13:21,180 --> 00:13:25,360
Then the data plane is the actual
apps data, like this app has

285
00:13:25,360 --> 00:13:28,480
a to-do model that stores, you
know, this many to-dos.

286
00:13:28,480 --> 00:13:31,780
This app has the Shopify product
model and, you know, stores

287
00:13:31,780 --> 00:13:33,060
this many Shopify products.

288
00:13:33,620 --> 00:13:37,540
So when when we started we just
stuck those in the same Postgres.

289
00:13:38,000 --> 00:13:38,200
Yeah.

290
00:13:38,200 --> 00:13:43,160
And that Postgres got too big and
so this major version upgrade

291
00:13:43,380 --> 00:13:46,800
kind of precipitated splitting
the control plane in the data

292
00:13:46,800 --> 00:13:50,580
plane which I think is like a pretty
obvious best practice.

293
00:13:51,180 --> 00:13:53,940
Michael: When you say too big,
by which metric, like what's what

294
00:13:53,940 --> 00:13:56,820
was the where was the bottleneck
coming from?

295
00:13:57,100 --> 00:14:01,380
Harry: Good question, so we had
about 8 terabytes of data in

296
00:14:01,380 --> 00:14:04,920
the 1 instance, or sorry, our disk
was 8 terabytes big and we

297
00:14:04,920 --> 00:14:08,160
were at 6 or something like that,
6.5, and we're like, okay,

298
00:14:08,160 --> 00:14:12,240
this is scary, like we gotta figure
out the next thing, and we

299
00:14:12,240 --> 00:14:18,620
were on PG13, which was end of
life long ago, and so we needed

300
00:14:18,740 --> 00:14:24,360
to both do the upgrade and we just
wanted kind of a better roadmap

301
00:14:24,520 --> 00:14:26,620
for how we were gonna deal with
growth.

302
00:14:27,180 --> 00:14:31,260
Because Gadget has lots of these
e-commerce customers, BFCM is

303
00:14:31,260 --> 00:14:37,040
always a quote fun unquote time
for us and like we'll see probably

304
00:14:37,040 --> 00:14:41,240
this year like a 4x traffic increase
and like a 4x database QPS

305
00:14:41,240 --> 00:14:45,620
increase and so that means like
we kind of need to be ready for

306
00:14:45,620 --> 00:14:51,340
a 4xing of you know a database
CPU and our provider at the time

307
00:14:51,340 --> 00:14:52,640
it didn't sell a big enough 1.

308
00:14:52,640 --> 00:14:55,320
So we were kind of like, we have
to do the horizontal shard,

309
00:14:55,320 --> 00:14:57,280
or scaling sort of journey.

310
00:14:57,900 --> 00:15:01,560
So what we did is we sharded in
order to move the bulk of that

311
00:15:01,560 --> 00:15:05,320
6 terabytes out into newer instances
that had much more room

312
00:15:05,320 --> 00:15:09,020
to grow, leaving only the reasonably
sized control plane left

313
00:15:09,020 --> 00:15:11,200
over to do this major version upgrade.

314
00:15:11,200 --> 00:15:14,220
And it simplified a few of the
steps of this 0 downtime major

315
00:15:14,220 --> 00:15:16,980
version upgrade, because we didn't
have to replay 6 terabytes

316
00:15:17,040 --> 00:15:19,100
worth of stuff on the new instance.

317
00:15:19,540 --> 00:15:20,320
Michael: Yeah, yeah.

318
00:15:20,320 --> 00:15:21,020
Makes sense.

319
00:15:21,020 --> 00:15:24,500
But why not stick with the existing
provider for those shards?

320
00:15:25,080 --> 00:15:27,260
Harry: 2 reasons, or 3 reasons.

321
00:15:27,780 --> 00:15:31,720
The first is we're a Google shop,
and we get good pricing from

322
00:15:31,720 --> 00:15:32,220
Google.

323
00:15:33,340 --> 00:15:39,740
The second is Google claims up
to a 4X TPS increase with AlloyDB.

324
00:15:40,840 --> 00:15:44,240
Like, I like Google and I don't
want to poo-poo them.

325
00:15:44,240 --> 00:15:47,740
I'm not sure I've ever seen it
actually hit 4X, but we did see

326
00:15:48,440 --> 00:15:53,600
like at least a 1.5X kind of like
TPS increase without really

327
00:15:53,600 --> 00:15:56,260
much config change or tuning or
whatever.

328
00:15:56,660 --> 00:16:00,200
I can't even say if it's actually
the same CPUs like under the

329
00:16:00,200 --> 00:16:00,840
hood, you know what I mean?

330
00:16:00,840 --> 00:16:04,240
I don't know what the initial provider
was using, but the instance

331
00:16:04,240 --> 00:16:06,580
was faster for the same price,
so that was nice.

332
00:16:06,780 --> 00:16:12,340
And then the other big thing with
Alloy is you don't pay intrazonal

333
00:16:12,440 --> 00:16:14,440
network transit fees.

334
00:16:14,680 --> 00:16:16,360
It's not like a normal VM.

335
00:16:17,280 --> 00:16:21,460
I present that without comment
to the competitive forces that

336
00:16:21,460 --> 00:16:22,860
may or may not be listening.

337
00:16:23,000 --> 00:16:26,760
But yeah, it's free networking
to the Alloy instance.

338
00:16:26,980 --> 00:16:31,300
Whereas if you're Aiven or Crunchy
or whatever and you run instances,

339
00:16:31,560 --> 00:16:34,200
your customers have to pay from
2 out of 3 zones just to talk

340
00:16:34,200 --> 00:16:35,340
to it, so that sucks.

341
00:16:35,740 --> 00:16:39,360
And then the second thing is you
don't pay for replica storage.

342
00:16:40,080 --> 00:16:46,420
It's a split storage compute sort
of setup, where The disk is

343
00:16:46,420 --> 00:16:50,580
this sort of auto-scaled elastic
fancy log storage service thing,

344
00:16:50,740 --> 00:16:53,720
and then you pay for your nodes
that kind of read and write from

345
00:16:53,720 --> 00:16:56,780
that shared tier, but you don't
pay n times, you pay once.

346
00:16:57,100 --> 00:17:01,360
So it's like 40-ish cents a gigabyte,
which is like more expensive

347
00:17:01,620 --> 00:17:06,660
than like say a big fat EBS volume
but not that much more and

348
00:17:06,660 --> 00:17:09,400
so if you're running like a lot
of replicas it ends up being

349
00:17:09,400 --> 00:17:13,280
cheaper storage-wise and you don't
have like all the nasty IO

350
00:17:13,280 --> 00:17:15,560
spikes from backups or any of that stuff.

351
00:17:16,460 --> 00:17:16,960
Cool.

352
00:17:17,660 --> 00:17:22,440
Nikolay: And unlike Aurora, I don't see the hidden cost with

353
00:17:22,440 --> 00:17:24,660
Aurora, usually I/O costs.

354
00:17:25,920 --> 00:17:27,720
Harry: Yeah, no I/O costs.

355
00:17:27,720 --> 00:17:28,680
Nikolay: Yeah, that's interesting.

356
00:17:28,980 --> 00:17:32,400
Harry: So our database bill came out to be about the same as

357
00:17:32,400 --> 00:17:38,040
a pretty classic like GCP VM with you know then they're not EBS

358
00:17:38,100 --> 00:17:43,980
and GCP but like fat kind of SSD volumes backing it and we save

359
00:17:43,980 --> 00:17:48,160
money on the networking we save money on not having to pay for

360
00:17:48,160 --> 00:17:52,080
storage again for the replicas, and we deleted our I/O spike

361
00:17:52,080 --> 00:17:53,420
for the backups problem.

362
00:17:53,480 --> 00:17:54,780
Nikolay: That's an interesting comparison.

363
00:17:54,840 --> 00:17:58,600
So we even don't talk about features like this column store or

364
00:17:58,600 --> 00:17:59,980
row versus row store.

365
00:18:00,740 --> 00:18:02,880
Even without that, it's a direct comparison.

366
00:18:02,940 --> 00:18:03,440
Interesting.

367
00:18:04,240 --> 00:18:04,740
Harry: Yeah.

368
00:18:04,900 --> 00:18:07,440
We intend to make use of that column store.

369
00:18:07,440 --> 00:18:08,720
Like, I'm skeptical.

370
00:18:09,020 --> 00:18:12,080
I used to work on sort of like the big data warehousing team

371
00:18:12,080 --> 00:18:16,020
at Shopify when I worked there and like I just don't really believe

372
00:18:16,320 --> 00:18:22,000
that you can force Postgres to do the kind of super wide throughput

373
00:18:22,280 --> 00:18:25,400
stuff you need for crazy scale analytics.

374
00:18:25,560 --> 00:18:27,980
But you can definitely get pretty far with it.

375
00:18:27,980 --> 00:18:29,000
You know what I mean?

376
00:18:29,340 --> 00:18:33,000
So my guess is some of our operational reporting use cases where

377
00:18:33,000 --> 00:18:36,340
you're traversing half a million a million rows to answer a query

378
00:18:36,580 --> 00:18:38,990
on a read replica with this column store is going to be fine.

379
00:18:38,990 --> 00:18:43,400
But in the like billion, 10 billion scale, I think purpose-built

380
00:18:43,480 --> 00:18:45,920
analytics databases still should exist.

381
00:18:47,300 --> 00:18:49,440
Michael: We've had a few of those conversations recently.

382
00:18:49,640 --> 00:18:50,620
Harry: I'm sure, I'm sure.

383
00:18:50,620 --> 00:18:53,660
Michael: I had 1 more boring question on the major version upgrade

384
00:18:53,660 --> 00:18:55,520
before we move on to more interesting things.

385
00:18:55,520 --> 00:18:58,180
It looks like you're on crunchy data, you were on crunchy data

386
00:18:58,180 --> 00:19:01,360
before for the control plane stuff and are going to be again

387
00:19:01,360 --> 00:19:03,060
for the upgrade.

388
00:19:03,980 --> 00:19:06,700
But you only upgraded I think from 13, which by the way you're

389
00:19:06,700 --> 00:19:07,540
a bit harsh on yourself.

390
00:19:07,540 --> 00:19:09,160
I don't think it's quite yet out of support.

391
00:19:09,160 --> 00:19:09,980
I think it's

392
00:19:10,080 --> 00:19:11,680
Harry: like a

393
00:19:11,680 --> 00:19:14,360
Michael: couple of months time that'll be out of support.

394
00:19:15,060 --> 00:19:18,260
But you only moved to 15 and I think Crunchy have always been

395
00:19:18,260 --> 00:19:21,960
very good at supporting the latest versions so probably 17 is

396
00:19:21,960 --> 00:19:22,460
available.

397
00:19:22,540 --> 00:19:23,140
So why not?

398
00:19:23,140 --> 00:19:24,980
Like why 15 not 16 or 17?

399
00:19:25,320 --> 00:19:27,660
Harry: Yes, Crunchy has been great at that kind of thing.

400
00:19:28,140 --> 00:19:30,540
I think the main reason was...

401
00:19:32,220 --> 00:19:32,980
Shoot, you know what?

402
00:19:32,980 --> 00:19:34,020
I don't even remember.

403
00:19:35,380 --> 00:19:36,380
Michael: Maybe Nik knows.

404
00:19:36,580 --> 00:19:37,800
Nikolay: I also don't remember.

405
00:19:37,800 --> 00:19:42,080
I remember that Crunchy had different
end of life dates somehow.

406
00:19:42,340 --> 00:19:43,860
Crunchy Bridge had different dates.

407
00:19:43,860 --> 00:19:47,320
Harry: Yeah, they deprecate support,
or they stop sort of long-term

408
00:19:47,320 --> 00:19:52,000
support before the full-on end
of the kind of core support.

409
00:19:52,480 --> 00:19:56,480
But they, in fairness, they gave
us extreme advanced warning

410
00:19:56,480 --> 00:19:59,480
and like we knew it was coming
for a long time and they supported

411
00:19:59,480 --> 00:20:03,460
it for 4 years or whatever, which
is, I think, pretty good.

412
00:20:03,840 --> 00:20:04,340
I don't remember.

413
00:20:04,340 --> 00:20:07,760
I think it might have been a desire
to keep the versions homogenous

414
00:20:08,040 --> 00:20:09,860
across all our instances.

415
00:20:10,400 --> 00:20:10,900
Sure.

416
00:20:11,940 --> 00:20:15,980
And for really silly reasons, the
Alloy shards that we stood

417
00:20:15,980 --> 00:20:20,260
up were stuck on 15 and couldn't
be newer.

418
00:20:20,660 --> 00:20:22,260
Michael: Yes, that would explain
it.

419
00:20:22,260 --> 00:20:26,760
So, Alloy, I think because it's
a fork and it's probably quite

420
00:20:26,760 --> 00:20:29,240
hard to maintain fork, I don't
think they're as good at keeping

421
00:20:29,240 --> 00:20:30,840
up with the major versions.

422
00:20:30,840 --> 00:20:33,240
So, there's a good chance you wouldn't
even have the option of

423
00:20:33,240 --> 00:20:34,820
going to 16 or 17 on there.

424
00:20:34,820 --> 00:20:35,320
I don't know

425
00:20:35,320 --> 00:20:35,570
Harry: for sure.

426
00:20:35,570 --> 00:20:35,940
That's correct.

427
00:20:35,940 --> 00:20:36,580
That's correct.

428
00:20:36,580 --> 00:20:40,640
At the time, 16 was GA and 17 wasn't
even out yet.

429
00:20:40,640 --> 00:20:43,640
Now 17 is out, but it's still in
preview.

430
00:20:43,780 --> 00:20:46,640
Like, you can't, shouldn't run
production workloads on Alloy

431
00:20:46,640 --> 00:20:47,140
17.

432
00:20:47,440 --> 00:20:51,100
But the weirdest thing was we wanted
to go to 16, but they just

433
00:20:51,100 --> 00:20:54,640
don't have pg_squeeze support on
their 16 version.

434
00:20:54,720 --> 00:20:57,980
They have it on 15 and they have
it on 17 and not on 16 and couldn't

435
00:20:57,980 --> 00:21:01,020
tell you why, but we squeeze a
lot, so We really cared about

436
00:21:01,020 --> 00:21:01,520
that.

437
00:21:02,580 --> 00:21:04,040
Michael: That makes so much sense.

438
00:21:04,900 --> 00:21:07,760
In my head, the only reason I could
think of is because now you're

439
00:21:07,760 --> 00:21:10,480
so good at major version upgrades,
you wanted to give yourself

440
00:21:10,680 --> 00:21:12,040
more to do in the future.

441
00:21:12,660 --> 00:21:13,820
Harry: Let's rinse and repeat.

442
00:21:13,860 --> 00:21:14,160
Yeah.

443
00:21:14,160 --> 00:21:14,660
Nice.

444
00:21:16,620 --> 00:21:17,040
Michael: All right.

445
00:21:17,040 --> 00:21:21,920
So you're running Postgres for
a lot of different workloads.

446
00:21:22,540 --> 00:21:25,440
Are there any challenges or fun
things you wanted to share on

447
00:21:25,440 --> 00:21:26,180
that front?

448
00:21:26,460 --> 00:21:29,480
Harry: I think maybe a good 1 just
for the audience and like

449
00:21:29,480 --> 00:21:31,880
if you're listening out there and
you think I'm full of it, like

450
00:21:31,880 --> 00:21:33,000
please send me an email.

451
00:21:33,000 --> 00:21:34,920
I would love to know what you think.

452
00:21:34,920 --> 00:21:41,100
But we did the classic Postgres
is a queue thing early in our

453
00:21:41,100 --> 00:21:41,600
life.

454
00:21:43,260 --> 00:21:46,620
And I think that that, Well, it's
on my mind a lot.

455
00:21:47,080 --> 00:21:50,900
So a lot of people who have like
background queuing problems

456
00:21:51,020 --> 00:21:51,720
use Redis.

457
00:21:51,960 --> 00:21:54,640
When we were at Shopify or when
I was at Shopify, we used this

458
00:21:54,640 --> 00:21:55,740
library called Resque

459
00:21:56,280 --> 00:21:58,940
There's another really popular
1 called Sidekiq and Node, it's

460
00:21:58,940 --> 00:21:59,440
BullMQ.

461
00:21:59,960 --> 00:22:01,400
There's a litany of them, right?

462
00:22:01,400 --> 00:22:02,820
They all seem to use Redis.

463
00:22:03,340 --> 00:22:07,860
We decided that for our user, who
tends to be like a front-end

464
00:22:07,960 --> 00:22:12,540
dev or sort of someone who isn't
deeply familiar with like data

465
00:22:12,540 --> 00:22:15,920
consistency or data durability
concerns, that we didn't really

466
00:22:15,920 --> 00:22:19,060
want to be like, hey, we'll happily
process your background jobs,

467
00:22:19,200 --> 00:22:23,600
but we'll drop 0.5% of them and
not be able to tell you which

468
00:22:23,600 --> 00:22:26,400
ones we dropped and not be able
to tell you when it happened.

469
00:22:27,900 --> 00:22:30,680
So for our background queue product,
we said it's going to be

470
00:22:30,680 --> 00:22:31,180
durable.

471
00:22:31,220 --> 00:22:33,140
Not only is it going to be durable,
it's going to be transactional,

472
00:22:33,400 --> 00:22:36,760
where it's like when you do some
business logic and you enqueue

473
00:22:36,760 --> 00:22:40,600
a job within a transaction, if
the transaction commits, then

474
00:22:40,600 --> 00:22:42,240
the job will get enqueued.

475
00:22:42,440 --> 00:22:45,280
And if the transaction doesn't
commit, the job won't be enqueued,

476
00:22:45,280 --> 00:22:47,140
and that's the transaction boundary.

477
00:22:47,500 --> 00:22:49,940
So we were like, this is nice.

478
00:22:49,940 --> 00:22:53,940
This is us kind of trying to prioritize
our customers' sanity.

479
00:22:54,440 --> 00:22:57,560
And I think it's the kind of thing
that's like annoying, because

480
00:22:57,560 --> 00:22:59,920
not that many of them are going
to notice or care.

481
00:22:59,920 --> 00:23:03,760
It's just going to be no data quality
issues for this business

482
00:23:03,760 --> 00:23:04,260
logic.

483
00:23:05,020 --> 00:23:09,820
But that locked us into needing
Postgres to be on the hot path

484
00:23:09,840 --> 00:23:10,920
for those end queues, right?

485
00:23:10,920 --> 00:23:14,960
We have to participate in that
business logic transaction to

486
00:23:14,960 --> 00:23:17,180
kind of save the thing.

487
00:23:17,720 --> 00:23:20,640
We knew we didn't want to build
a giant queue of our own.

488
00:23:20,640 --> 00:23:23,860
So we knew we were going to get
it into Temporal and let Temporal

489
00:23:23,860 --> 00:23:27,600
take over, but we sort of needed
like a Postgres outbox, if that

490
00:23:27,600 --> 00:23:30,400
makes sense, where it's like, you
write to the outbox and then,

491
00:23:30,400 --> 00:23:33,280
you know, you're slowly forklifting
out of the outbox into the

492
00:23:33,280 --> 00:23:34,420
actual system eventually.

493
00:23:34,660 --> 00:23:36,180
And then Temporal itself is durable.

494
00:23:36,180 --> 00:23:36,880
And so, yeah.

495
00:23:36,880 --> 00:23:38,720
So that's the architecture we went
with.

496
00:23:38,720 --> 00:23:42,240
But as we sort of built it out
more and more, we learned about

497
00:23:42,340 --> 00:23:44,140
more limitations of Temporal.

498
00:23:45,060 --> 00:23:49,200
Like it came out of Uber from this
system that kind of predates

499
00:23:49,200 --> 00:23:50,140
it called Cadence.

500
00:23:50,380 --> 00:23:52,840
And my guess is they learned all
this stuff the hard way and

501
00:23:52,840 --> 00:23:56,940
built in a bunch of limits that
make it, you know, easier for

502
00:23:56,940 --> 00:24:00,040
the operators to guarantee certain
sort of performance characteristics,

503
00:24:00,060 --> 00:24:02,120
but really annoying for people
like me.

504
00:24:02,220 --> 00:24:05,840
So 1 of those limits is when you
start a job, there's a maximum

505
00:24:06,000 --> 00:24:08,860
payload size that's pretty small.

506
00:24:09,220 --> 00:24:10,520
I forget what it is exactly.

507
00:24:10,520 --> 00:24:13,760
I think it's like 50 megs or 5
megs or something like that.

508
00:24:14,020 --> 00:24:18,120
And then as that job executes,
it can only generate X amount

509
00:24:18,220 --> 00:24:21,960
of state transitions, they're called,
but events in its processing,

510
00:24:21,960 --> 00:24:24,280
where it can only have a certain
number of retries, basically,

511
00:24:24,280 --> 00:24:26,780
and you need to sort of re-architect
your temporal workflows

512
00:24:26,780 --> 00:24:28,180
if you want to overcome that.

513
00:24:28,580 --> 00:24:31,960
And again, in our case, our users
are kind of like, hey Harry,

514
00:24:31,960 --> 00:24:36,540
I have this 400 megabyte video
file that I'd like to process

515
00:24:36,540 --> 00:24:37,700
in a background action.

516
00:24:38,080 --> 00:24:39,860
Can you just make it work please?

517
00:24:39,860 --> 00:24:43,140
Can you not make me like re-architect
my entire thing to toss

518
00:24:43,140 --> 00:24:44,060
it into cloud storage?

519
00:24:44,060 --> 00:24:45,520
Whatever, it's ephemeral data.

520
00:24:45,660 --> 00:24:47,380
Like I don't need it in the cloud.

521
00:24:47,540 --> 00:24:50,700
I just want to push it to this
thing and work it within the background.

522
00:24:51,260 --> 00:24:54,080
So we found that the temporal limitations
were annoying enough

523
00:24:54,080 --> 00:24:59,340
that we actually needed to stage
a bunch of each jobs data outside

524
00:24:59,380 --> 00:25:00,040
of temporal.

525
00:25:00,480 --> 00:25:04,340
And so we ended up kind of having
this nasty hybrid where that

526
00:25:04,340 --> 00:25:09,800
Postgres outbox I mentioned sort
of became like a store for a

527
00:25:09,800 --> 00:25:12,760
bunch of extra stuff and then the
temporal workflow as it went

528
00:25:12,760 --> 00:25:16,060
it would be you know transacting
with its state storage and doing

529
00:25:16,060 --> 00:25:19,480
its state transition things, but
then also updating what is now

530
00:25:19,480 --> 00:25:23,400
no longer the outbox to be our
source of truth, it powered our

531
00:25:23,400 --> 00:25:26,260
UI, it had all these extra params,
yada, yada, yada.

532
00:25:26,820 --> 00:25:29,600
So we weren't even really using
Postgres as a queue.

533
00:25:30,020 --> 00:25:33,740
We weren't having the classic skip
locked problems and whatever,

534
00:25:34,020 --> 00:25:37,580
but we were just having like this
sort of like request scale

535
00:25:38,240 --> 00:25:42,420
data transmit, like 1 web hook
would be 1 write to this outbox

536
00:25:42,640 --> 00:25:46,340
to enqueue, 1 write to start processing,
1 write to finish processing.

537
00:25:46,560 --> 00:25:49,140
So just like massive, massive tuple
of turnover.

538
00:25:49,860 --> 00:25:52,540
And we also needed to build a UI
on this stuff.

539
00:25:52,540 --> 00:25:55,680
So we only could get a few of them
to be like HOT updates, like

540
00:25:55,680 --> 00:25:57,040
many of them were not.

541
00:25:57,380 --> 00:26:01,660
And so we were just like destroying
this poor Postgres instance

542
00:26:02,080 --> 00:26:06,880
with what I would call like less
important data than the actual

543
00:26:06,980 --> 00:26:10,240
data plane data but still important
enough that we felt that

544
00:26:10,240 --> 00:26:11,820
merited being on a disk.

545
00:26:12,340 --> 00:26:16,780
So all that to say we ditched Postgres
for this and we switched

546
00:26:16,780 --> 00:26:22,120
to Bigtable, which is like Google
fancy super distributed key

547
00:26:22,120 --> 00:26:25,180
value store and it friggin sucks.

548
00:26:25,860 --> 00:26:29,100
It really sucks and the reason
is no transactions, right?

549
00:26:29,100 --> 00:26:32,520
It's like a way worse model for
me as a programmer and a systems

550
00:26:32,520 --> 00:26:37,460
designer, and a way more scalable, fundamentally supersized system

551
00:26:37,460 --> 00:26:38,300
under the hood.

552
00:26:38,420 --> 00:26:42,960
But boy, oh boy, do I miss the creature comforts, we'll call

553
00:26:42,960 --> 00:26:48,260
them, of secondary indexes and transactions and a sane query

554
00:26:48,260 --> 00:26:53,160
language, let alone the gazillion and 1 features that a full

555
00:26:53,160 --> 00:26:56,020
relational system has over a key-value store like that.

556
00:26:56,320 --> 00:26:59,840
So we use Postgres, I think, by default for all our workloads.

557
00:27:00,040 --> 00:27:01,960
And then when we find something that we're like, okay, I think

558
00:27:01,960 --> 00:27:03,420
there's maybe a better fit here.

559
00:27:03,480 --> 00:27:05,680
We try to switch it out for the next best thing.

560
00:27:05,680 --> 00:27:07,800
But it's like, it's really hard to beat.

561
00:27:07,800 --> 00:27:11,100
And I know this is maybe unsurprising on a Postgres podcast,

562
00:27:11,100 --> 00:27:13,940
but it's like, I just wish I didn't have to do that.

563
00:27:13,940 --> 00:27:17,060
I could have stayed with my creature comforts and got my scale,

564
00:27:17,080 --> 00:27:17,740
you know?

565
00:27:18,560 --> 00:27:21,720
Nikolay: Yeah, so I'm curious what was the number 1 reason.

566
00:27:21,900 --> 00:27:24,300
Is it like bloat maybe or?

567
00:27:24,780 --> 00:27:29,480
Harry: Yeah, it was like bloat, tuple turnover and just like

568
00:27:29,480 --> 00:27:31,520
the fact that we were like always vacuuming.

569
00:27:32,420 --> 00:27:33,400
Nikolay: Yeah, yeah.

570
00:27:34,100 --> 00:27:35,040
I can imagine.

571
00:27:35,380 --> 00:27:37,760
Harry: Another big motivation for switching to something akin

572
00:27:37,760 --> 00:27:42,180
to Bigtable is like compute elasticity, where we can kind of

573
00:27:42,180 --> 00:27:43,980
like Bigtable auto scales by default.

574
00:27:43,980 --> 00:27:46,760
You can't even turn it off, I don't think, where they just add

575
00:27:46,760 --> 00:27:49,640
more nodes as demand kind of grows.

576
00:27:49,640 --> 00:27:53,660
And for us, where we're kind of like the background subsystems

577
00:27:53,860 --> 00:27:56,120
are the thing that absorb load, right?

578
00:27:56,120 --> 00:27:58,520
When you have a flash sale on your Shopify shop and you need

579
00:27:58,520 --> 00:28:01,440
to enqueue a bunch of stuff, like it's quite, We need this thing

580
00:28:01,440 --> 00:28:04,700
to be able to handle unpredictable surges.

581
00:28:04,820 --> 00:28:08,160
And it was a bummer at the time, we didn't have any way to do

582
00:28:08,160 --> 00:28:11,060
a 0 downtime, like non-disruptive Postgres resize.

583
00:28:11,280 --> 00:28:14,400
Now we do, but even then I don't really want to be doing it automatically,

584
00:28:14,640 --> 00:28:16,080
like in response to load.

585
00:28:16,640 --> 00:28:19,920
I kind of want something like Bigtable that's like, 5 seconds

586
00:28:19,920 --> 00:28:22,220
later you can get your capacity expansion.

587
00:28:23,520 --> 00:28:25,960
Nikolay: Yeah, well, I can see it easily.

588
00:28:27,900 --> 00:28:34,840
Harry: I think, I think like My ideal would be a million tiny

589
00:28:34,840 --> 00:28:40,580
Postgreses and have some fancy system for like disaggregating

590
00:28:41,400 --> 00:28:43,820
a logical Postgres from a physical VM.

591
00:28:44,440 --> 00:28:47,480
Or if I have like, let's say I shard my workload into 16,000

592
00:28:47,780 --> 00:28:50,200
Postgreses, but they only run on 4 VMs.

593
00:28:50,280 --> 00:28:53,660
And then as I need more compute, they kind of split out, get

594
00:28:53,660 --> 00:28:54,940
bigger, and whatever.

595
00:28:54,940 --> 00:29:00,040
But that's like a twinkle in old man Harry's eyes.

596
00:29:00,180 --> 00:29:03,540
Nikolay: There is work in this direction by multiple teams right

597
00:29:03,540 --> 00:29:03,840
now.

598
00:29:03,840 --> 00:29:04,920
So we have good format.

599
00:29:04,920 --> 00:29:05,720
Harry: Go on.

600
00:29:05,740 --> 00:29:06,420
Go on.

601
00:29:06,420 --> 00:29:06,920
Yeah.

602
00:29:07,100 --> 00:29:07,540
Nikolay: Yeah.

603
00:29:07,540 --> 00:29:08,340
There is PgDog.

604
00:29:09,080 --> 00:29:11,600
There is Multigres and PlanetScale also.

605
00:29:12,180 --> 00:29:14,240
They're all working in this direction.

606
00:29:15,900 --> 00:29:18,440
Michael: And Multigres specifically is 1 of the things Sugu

607
00:29:18,440 --> 00:29:19,500
mentioned as well.

608
00:29:19,740 --> 00:29:23,240
From his experience at YouTube, he loved having lots of smaller

609
00:29:23,300 --> 00:29:28,280
databases and he much preferred that in terms of reliability

610
00:29:29,120 --> 00:29:31,120
and localized outages even.

611
00:29:31,260 --> 00:29:32,940
So there were just so many benefits

612
00:29:33,580 --> 00:29:37,120
Nikolay: highly automated operational source in the roadmap like

613
00:29:37,120 --> 00:29:38,220
upgrades as well

614
00:29:38,600 --> 00:29:40,420
Michael: and rebalancing yeah

615
00:29:41,040 --> 00:29:46,000
Harry: yeah I am forgive me for another brief aside but are you

616
00:29:46,000 --> 00:29:48,340
familiar with Azure Cosmos DB?

617
00:29:49,080 --> 00:29:53,000
Like the kind of main 1, not the Citus 1, the main 1.

618
00:29:54,280 --> 00:29:55,860
Michael: I thought Cosmos was Citus.

619
00:29:55,880 --> 00:29:57,840
I get confused by their names though.

620
00:29:58,780 --> 00:30:02,960
Harry: So I think there is Azure Cosmos DB for Postgres, which

621
00:30:02,960 --> 00:30:03,620
is Citus.

622
00:30:04,160 --> 00:30:07,360
But I think there's, and I'm not deeply familiar, so I may be

623
00:30:07,360 --> 00:30:09,840
totally wrong here, but I think there is a different original

624
00:30:09,920 --> 00:30:15,180
Cosmos DB that's kind of the Azure answer to Dynamo and Bigtable.

625
00:30:15,580 --> 00:30:18,780
Like they're kind of like super scale, sort of less sophisticated

626
00:30:18,840 --> 00:30:22,540
query patterns, but guaranteed scale for whenever.

627
00:30:23,080 --> 00:30:26,300
The way that it works, or the way that it scales is you specify

628
00:30:26,400 --> 00:30:30,800
like a partition key with every piece of data, and you're only

629
00:30:30,800 --> 00:30:32,940
allowed partition local transactions.

630
00:30:33,840 --> 00:30:37,680
So it's not like a Spanner or a Cockroach where the whole key

631
00:30:37,680 --> 00:30:39,560
space can be transacted against together.

632
00:30:39,560 --> 00:30:42,720
It's just like a million tiny partitions and then every partition

633
00:30:42,720 --> 00:30:44,340
could be a max of 50 gigs.

634
00:30:44,440 --> 00:30:48,400
And to me, that's like so desirable because it means like 1 little

635
00:30:48,400 --> 00:30:53,200
part, you get like no local performance, but the same horizontal

636
00:30:53,240 --> 00:30:57,480
scaling properties of the super scalers, but without necessarily

637
00:30:57,840 --> 00:30:59,740
two-phase commit or Paxos or whatever.

638
00:31:00,020 --> 00:31:03,920
And so I'm hoping someone manages to do that.

639
00:31:03,920 --> 00:31:07,840
I think it's maybe, what's it called, Neki, like the Vitess

640
00:31:07,840 --> 00:31:08,540
for Postgres.

641
00:31:09,840 --> 00:31:10,440
Did I get that

642
00:31:10,440 --> 00:31:10,940
Michael: right?

643
00:31:11,400 --> 00:31:12,100
It's Neki.

644
00:31:12,440 --> 00:31:13,700
I have no idea how to pronounce it.

645
00:31:13,700 --> 00:31:14,440
Harry: Yeah, same.

646
00:31:14,440 --> 00:31:15,360
I have no idea.

647
00:31:15,360 --> 00:31:19,260
But I'm hoping that's what it is, which is like tiny little boundaries

648
00:31:19,440 --> 00:31:23,800
that force me to design my data in a way that makes the transactions

649
00:31:23,940 --> 00:31:27,160
efficient where you don't need a bunch of computers to agree

650
00:31:27,440 --> 00:31:30,020
on a Transaction outcome and instead
you can just funnel them

651
00:31:30,020 --> 00:31:31,040
to the right place

652
00:31:31,360 --> 00:31:36,360
Nikolay: unfortunately, they all
Not all but many of them came

653
00:31:36,360 --> 00:31:38,980
to me and asked, have you used
2PC?

654
00:31:41,880 --> 00:31:45,560
I said, no, I always avoided it,
but so they need it, unfortunately.

655
00:31:46,220 --> 00:31:49,760
Of course, like some workload will
avoid it, but some workload

656
00:31:49,760 --> 00:31:51,440
will still need 2PC.

657
00:31:51,600 --> 00:31:53,860
This is the sad truth.

658
00:31:54,860 --> 00:31:55,360
Harry: Totally.

659
00:31:55,600 --> 00:31:56,100
Totally.

660
00:31:56,260 --> 00:32:00,420
I think like the thing like I,
so I used to work at Shopify circa,

661
00:32:00,420 --> 00:32:02,880
I don't know, 2012, 2013.

662
00:32:03,580 --> 00:32:08,220
We were running like a MySQL fleet
at the time, and we felt constrained

663
00:32:08,440 --> 00:32:11,060
by the biggest Database servers
we could buy.

664
00:32:11,240 --> 00:32:15,720
Like we were sharding, but it was
still like the biggest customers

665
00:32:15,920 --> 00:32:19,200
could push the 1 instances, I don't
know what it was, probably

666
00:32:19,200 --> 00:32:23,800
64 threads, like to the max, I'm
not really feeling that way

667
00:32:23,800 --> 00:32:24,300
anymore.

668
00:32:24,620 --> 00:32:28,160
Like I feel like there's few workloads
that don't have a natural

669
00:32:28,200 --> 00:32:33,080
partitioning where you need more
than 1 Server's worth of compute

670
00:32:33,080 --> 00:32:37,200
just to satisfy 1 partitions worth
of work I mean, maybe you

671
00:32:37,200 --> 00:32:39,320
talked about this assignment last
time But it's like you can

672
00:32:39,320 --> 00:32:43,740
always split it up and when you
split it up Like you don't need

673
00:32:43,740 --> 00:32:47,500
you need an Aggregate many computers,
but 1 Transaction boundary

674
00:32:47,500 --> 00:32:49,540
doesn't need more than 1.

675
00:32:49,540 --> 00:32:53,460
And so I'm just, and like the hardware
has just gotten so much

676
00:32:53,460 --> 00:32:58,080
better in that 10 years that I'm
not sure I could grow my company

677
00:32:58,080 --> 00:33:00,220
fast enough to need that, Right?

678
00:33:00,220 --> 00:33:03,280
It's like maybe if we had like
a graph problem, maybe if we were

679
00:33:03,280 --> 00:33:06,140
Facebook or whatever, but as it
stands right now.

680
00:33:06,140 --> 00:33:06,740
Nikolay: That's true.

681
00:33:06,740 --> 00:33:12,420
And on GCP, if we don't look at
AMD, we look at only Xeon scalable

682
00:33:13,200 --> 00:33:16,380
fifth generation, the best available
C4 nodes, I think, right?

683
00:33:16,380 --> 00:33:22,280
You can get 192 cores or so, but
if you go to AWS, you will have

684
00:33:22,280 --> 00:33:26,420
even more, like almost, I think,
800 cores also the same type

685
00:33:26,420 --> 00:33:28,580
of processor, and this is already
huge.

686
00:33:28,580 --> 00:33:30,060
So you can scale 1 node.

687
00:33:30,060 --> 00:33:30,960
Harry: It's just bonkers.

688
00:33:31,120 --> 00:33:32,420
Like, that's ridiculous.

689
00:33:32,840 --> 00:33:33,340
Yeah.

690
00:33:33,740 --> 00:33:35,820
It's a lot of doge coins, you know?

691
00:33:36,900 --> 00:33:38,860
Sorry, so that was a long-winded
rant.

692
00:33:38,860 --> 00:33:41,820
But what more can I tell you about
that temporal sort of Postgres

693
00:33:41,980 --> 00:33:43,240
workload setup thing?

694
00:33:43,320 --> 00:33:45,920
Michael: Before we started, you
mentioned 1 of the most interesting

695
00:33:45,920 --> 00:33:48,720
things you're working on is that
you're serving other people's

696
00:33:48,720 --> 00:33:53,400
workloads and that you don't control
with design, like Schemas

697
00:33:53,440 --> 00:33:56,020
and design decisions that you don't
control.

698
00:33:57,340 --> 00:33:59,760
How do you stay sane while doing
that?

699
00:33:59,760 --> 00:34:02,300
And also, what are the interesting
parts of that?

700
00:34:02,720 --> 00:34:03,460
Harry: Great question.

701
00:34:03,600 --> 00:34:07,680
You hit me with the question that
just invalidates my entire

702
00:34:07,680 --> 00:34:12,180
prior rant, which is like, yeah,
you can partition your workload

703
00:34:12,180 --> 00:34:14,840
if you think hard about the schema
design, but yeah.

704
00:34:15,060 --> 00:34:17,140
So, okay, let me set the stage
a little.

705
00:34:17,320 --> 00:34:19,700
So, Gadget's, like, reason for being.

706
00:34:19,700 --> 00:34:22,540
Like, the reason why we're working
hard on this thing is because

707
00:34:22,540 --> 00:34:26,680
we kind of believe software development
is still hard, unnecessarily.

708
00:34:27,260 --> 00:34:30,520
That, like, when you go to spin
up an app on the weekend or your

709
00:34:30,520 --> 00:34:33,720
next big business or just, like,
solve a small problem in your

710
00:34:33,720 --> 00:34:34,580
life with software.

711
00:34:34,840 --> 00:34:38,220
You spend a lot of time on what
we call like baloney problems

712
00:34:38,360 --> 00:34:41,540
of provisioning infrastructure,
setting up connection poolers,

713
00:34:41,680 --> 00:34:45,860
like worrying about the autovacuum
schedule, so on and so forth.

714
00:34:46,020 --> 00:34:49,720
We just think that's all a distraction
from actually building

715
00:34:49,720 --> 00:34:50,540
something useful.

716
00:34:50,660 --> 00:34:54,960
It matters, like you can't ignore
it, but you also ideally don't

717
00:34:54,960 --> 00:34:55,840
want to have to deal with it.

718
00:34:55,840 --> 00:34:59,120
So similar to the people who came
before us, like the Firebases

719
00:34:59,440 --> 00:35:03,000
of the world, We kind of want to
delete a bunch of those problems

720
00:35:03,000 --> 00:35:06,760
from our customers' lives, take
them on ourselves, and allow

721
00:35:06,760 --> 00:35:10,280
them to focus on whatever unique,
interesting custom problem

722
00:35:10,280 --> 00:35:10,940
they want.

723
00:35:11,280 --> 00:35:15,300
Like, recently, it might even be
allow cloud code to focus on

724
00:35:15,300 --> 00:35:17,460
whatever unique, interesting problem
they want, right?

725
00:35:17,460 --> 00:35:21,840
But, like, by kind of deleting
infrastructure foot guns and systems

726
00:35:21,860 --> 00:35:25,300
integration, we just let people
kind of succeed faster.

727
00:35:25,760 --> 00:35:28,420
And then I guess I should say,
bearing in mind the audience listening

728
00:35:28,420 --> 00:35:30,060
here, it's not for everybody, right?

729
00:35:30,060 --> 00:35:33,940
Like, if you have the perfect knowledge
of the Postgres schema

730
00:35:33,940 --> 00:35:36,560
you want, Gadget's probably going
to be less efficient than that,

731
00:35:36,560 --> 00:35:36,780
right?

732
00:35:36,780 --> 00:35:38,860
Because we've added lots of layers
of abstraction.

733
00:35:39,180 --> 00:35:41,760
We have our own operations and
infrastructure team kind of managing

734
00:35:41,760 --> 00:35:42,180
this stuff.

735
00:35:42,180 --> 00:35:45,320
So I'm not trying to say, like,
no 1 needs to care about these

736
00:35:45,320 --> 00:35:45,680
problems.

737
00:35:45,680 --> 00:35:48,280
I'm just trying to say that a lot
of people, I hope, don't have

738
00:35:48,280 --> 00:35:51,140
to care about these problems so
that me and my team pull our

739
00:35:51,140 --> 00:35:52,740
hair out stressing about them.

740
00:35:52,800 --> 00:35:57,420
All this means is there's this
tension between asking the developer

741
00:35:57,680 --> 00:36:02,560
to be more specific about their
data access patterns and adding

742
00:36:02,560 --> 00:36:06,780
friction for them to like on the
path to success right that like

743
00:36:06,780 --> 00:36:11,660
every time you say do you want
4 8 or 32 bytes for this number

744
00:36:12,100 --> 00:36:15,180
you're you're posing what I would
call computer baloney question

745
00:36:15,360 --> 00:36:19,300
rather than like it's just a number
like you know screw off I

746
00:36:19,300 --> 00:36:22,200
don't know I don't know how big
it's ever gonna be just you figure

747
00:36:22,200 --> 00:36:25,700
it out you know so in Gadget we
have like a number field type

748
00:36:25,760 --> 00:36:29,580
instead of an int and a float and
a decimal we've taken this

749
00:36:29,580 --> 00:36:34,540
to a pretty intense extreme where
when you do data modeling in

750
00:36:34,540 --> 00:36:38,440
Gadget you specify tables, we call
them models because they include

751
00:36:38,440 --> 00:36:41,540
behavior as well, but you specify
tables and you specify fields,

752
00:36:42,100 --> 00:36:46,800
you can specify relationships among
them, but you don't run any

753
00:36:46,800 --> 00:36:49,040
SQL to realize those schemas.

754
00:36:49,440 --> 00:36:55,160
And then you get a kind of fully
auto-generated GraphQL API for

755
00:36:55,160 --> 00:36:58,380
interacting with those tables without
doing anything else.

756
00:36:58,380 --> 00:37:00,740
You just, you specify your schema,
you get the API.

757
00:37:01,200 --> 00:37:04,480
We generate a nice client for that
API, we generate nice docs

758
00:37:04,480 --> 00:37:08,160
for that API, but the idea is like
you say, to do title string,

759
00:37:08,560 --> 00:37:12,800
completed boolean, and now you
immediately can write some React

760
00:37:12,800 --> 00:37:16,320
code for a form, or you can ask
the AI to write some React code

761
00:37:16,320 --> 00:37:18,780
for a form to like transact against
that stuff.

762
00:37:19,280 --> 00:37:25,140
All this to say, we're responsible
for the SQL that, or the schema

763
00:37:25,160 --> 00:37:27,900
of those tables under the, like
the real tables under the hood,

764
00:37:27,900 --> 00:37:31,060
as well as the SQL that that API
generates and executes.

765
00:37:32,260 --> 00:37:35,860
So this is a win because we can
avoid things like the

766
00:37:35,860 --> 00:37:38,940
N+1 problem, like we know how to
do that good.

767
00:37:39,480 --> 00:37:44,820
We can implement things like 0
downtime cutovers or kind of maintenance

768
00:37:44,820 --> 00:37:48,280
modes where we know how it's like
this app is in the process

769
00:37:48,280 --> 00:37:50,940
of being moved to a different shard,
like, hold on a second,

770
00:37:51,000 --> 00:37:53,540
we'll finish that and then serve
your request, you know.

771
00:37:53,600 --> 00:37:56,420
It's powerful that Gadget sort
of owns the whole stack.

772
00:37:56,420 --> 00:37:59,180
Oh, we can implement other stuff
like cloud storage, where you

773
00:37:59,180 --> 00:38:03,400
can store data blobs in the right
way, instead of within Postgres,

774
00:38:03,400 --> 00:38:04,060
you know.

775
00:38:04,640 --> 00:38:08,520
But it means that all we know about
a given application is what

776
00:38:08,520 --> 00:38:12,260
we've collected from the developer,
which is that list of tables,

777
00:38:12,260 --> 00:38:14,760
the list of fields on the tables,
and the relationships among

778
00:38:14,760 --> 00:38:19,340
them, Not stuff like, ah, this
is a shard key, or this is how

779
00:38:19,340 --> 00:38:23,640
I intend to transact, or if I'm
going to make an HTTP call in

780
00:38:23,640 --> 00:38:27,560
the middle of my transaction, like
2 days from now when I change

781
00:38:27,560 --> 00:38:28,220
my code.

782
00:38:28,940 --> 00:38:32,200
So we picked this bed and we made
it on purpose.

783
00:38:32,960 --> 00:38:36,860
The idea is we want building high
quality software that scales

784
00:38:36,860 --> 00:38:39,860
and like meets the demands of Black
Friday, Cyber Monday to be

785
00:38:39,860 --> 00:38:43,880
more accessible and to not be locked
behind you know whatever

786
00:38:44,060 --> 00:38:47,720
between the 3 of us 60 years of
Postgres experience or something

787
00:38:47,720 --> 00:38:48,480
like that, right?

788
00:38:48,480 --> 00:38:53,440
You should be able to build and
get an app going way faster without

789
00:38:53,440 --> 00:38:53,940
that.

790
00:38:54,520 --> 00:38:58,660
So the gist of it is we need some
way to structure our data under

791
00:38:58,660 --> 00:39:01,680
the hood within our Postgres such
that people can't take down

792
00:39:01,680 --> 00:39:02,240
the platform.

793
00:39:02,240 --> 00:39:04,100
They can't use more than their
fair share.

794
00:39:04,120 --> 00:39:08,320
We just need to be safe by default
where we're committing to

795
00:39:08,320 --> 00:39:11,320
serving this API, and regardless
of what strange queries you

796
00:39:11,320 --> 00:39:15,040
throw at it, strange transaction
patterns, poorly written code,

797
00:39:15,060 --> 00:39:17,920
either out of ignorance or malice
or whatever, just needs to

798
00:39:17,920 --> 00:39:18,720
still work.

799
00:39:18,960 --> 00:39:21,820
Same as like a Firebase or yeah,
those other companies I mentioned

800
00:39:21,820 --> 00:39:22,320
before.

801
00:39:22,960 --> 00:39:27,320
And the way that we do that is
like kind of 2 main ways.

802
00:39:28,080 --> 00:39:33,280
We don't give everyone their own
instance and say it's your job,

803
00:39:33,300 --> 00:39:34,660
like you manage the performance.

804
00:39:34,900 --> 00:39:37,940
I think that's what, say, a Neon
or Supabase does, which is

805
00:39:37,940 --> 00:39:39,380
still helpful, but it's different,
right?

806
00:39:39,380 --> 00:39:41,880
They're sort of saying, you get
this much resources and within

807
00:39:41,880 --> 00:39:43,300
that, it's up to you to optimize.

808
00:39:43,320 --> 00:39:44,100
We don't say that.

809
00:39:44,100 --> 00:39:48,420
We charge by operation, more usage-based,
like say an Upstash

810
00:39:48,420 --> 00:39:49,780
Redis or something like that.

811
00:39:49,860 --> 00:39:53,940
And then we consider ourselves
Gadget on the hook for the performance

812
00:39:54,020 --> 00:39:54,900
of those operations.

813
00:39:55,600 --> 00:39:59,800
And that means that, again, we
have to find some way to not process

814
00:39:59,800 --> 00:40:01,320
queries that could take us down.

815
00:40:01,320 --> 00:40:02,460
So we do 2 things.

816
00:40:02,720 --> 00:40:07,660
We refuse to execute a query unless
there's an index on the predicates.

817
00:40:09,320 --> 00:40:13,100
And that's just a safety must,
basically.

818
00:40:14,060 --> 00:40:17,960
We can't run unindexed queries
because we can't know that they'll

819
00:40:17,960 --> 00:40:20,940
process fast enough for it to be
safe to run at scale.

820
00:40:21,180 --> 00:40:22,320
And then we rate limit.

821
00:40:23,040 --> 00:40:25,760
And that allows us to say like,
you know, no 1 can have more

822
00:40:25,760 --> 00:40:26,780
than their fair share.

823
00:40:26,820 --> 00:40:29,880
But the crazy thing is almost no
1 hits their rate limits.

824
00:40:30,300 --> 00:40:35,140
We consider that like a regrettable
private platform protection

825
00:40:35,200 --> 00:40:41,500
mechanism as opposed to like a
pricing mechanism or like an incentive

826
00:40:41,820 --> 00:40:44,200
and the reason that no 1 hits
their rate limits is because

827
00:40:44,200 --> 00:40:47,700
if you just force every query to
be indexed like you get pretty

828
00:40:47,700 --> 00:40:48,140
far.

829
00:40:48,140 --> 00:40:51,800
There's there's the planner is
so good and like the bitmap index

830
00:40:51,940 --> 00:40:58,000
combining stuff gets you so far
that almost nobody, or any time

831
00:40:58,000 --> 00:41:01,640
we ever have a database CPU spike,
it's usually because something's

832
00:41:01,640 --> 00:41:06,040
gone wrong with our creation of
the indexes, like Gadget's automations,

833
00:41:06,280 --> 00:41:09,440
their Temporal workflow, surprise,
like have failed to set the

834
00:41:09,440 --> 00:41:13,020
index up properly, or we like let
the code that makes the API

835
00:41:13,020 --> 00:41:15,620
calls ship before the index was
actually in place, or something

836
00:41:15,620 --> 00:41:16,560
like that, you know.

837
00:41:16,560 --> 00:41:19,900
But it's like Postgres just works
if you do this.

838
00:41:19,900 --> 00:41:22,660
So it's really naive, but it works.

839
00:41:22,660 --> 00:41:25,160
Nikolay: Are you okay to share
how many indexes you've already

840
00:41:25,160 --> 00:41:25,660
created?

841
00:41:26,200 --> 00:41:29,660
Harry: Yeah, we have, I think,
probably over 2 million indexes

842
00:41:30,240 --> 00:41:31,640
in our system right now.

843
00:41:31,640 --> 00:41:33,180
Actually, sorry, I take that back.

844
00:41:33,260 --> 00:41:38,040
At the time of the sharding we
did, but we've reduced that drastically

845
00:41:38,360 --> 00:41:40,600
since then, but for kind of unrelated
reasons.

846
00:41:40,600 --> 00:41:44,340
But yeah, basically if a user wants
to be able to, in their API,

847
00:41:44,480 --> 00:41:48,740
sort or filter on a column, we
just say there has to be an index.

848
00:41:48,740 --> 00:41:53,180
And so we have, I don't know, most
of the columns indexed.

849
00:41:54,120 --> 00:41:58,780
And like, oh god, I can feel the
hate mail now, talking to a

850
00:41:58,780 --> 00:42:02,280
giant legion of database engineers who are all screaming in their

851
00:42:02,280 --> 00:42:03,080
cars right now.

852
00:42:03,080 --> 00:42:04,300
Why are you indexing recall?

853
00:42:04,400 --> 00:42:05,160
No, no, no.

854
00:42:05,580 --> 00:42:09,880
The modern hardware that respected Nik just mentioned means

855
00:42:09,880 --> 00:42:13,620
that the cost of this, while substantial, is not actually that

856
00:42:13,620 --> 00:42:14,120
crazy.

857
00:42:14,560 --> 00:42:16,160
It's higher, no doubt.

858
00:42:16,160 --> 00:42:19,300
But we're not in the days where you need to be so incredibly

859
00:42:19,360 --> 00:42:23,000
stingy about every index you make, especially when the cardinality

860
00:42:23,000 --> 00:42:26,320
of the tables are like a million, 2 million records, as opposed

861
00:42:26,320 --> 00:42:28,060
to like 10 gazillion, right?

862
00:42:28,320 --> 00:42:30,720
When our first 10 gazillion customer comes along, we're going

863
00:42:30,720 --> 00:42:33,240
to work with them closely to be like, you don't need that index.

864
00:42:33,240 --> 00:42:34,540
Like, let's turn that off.

865
00:42:34,640 --> 00:42:35,940
Your stuff will go faster.

866
00:42:36,040 --> 00:42:39,840
But the vast majority of people, the overhead is it adds up,

867
00:42:39,840 --> 00:42:42,260
but it's not so meaningful that our business is busted.

868
00:42:42,800 --> 00:42:46,740
Nikolay: Yeah, so I wanted to say that this is a pet versus cattle

869
00:42:47,220 --> 00:42:47,720
philosophy.

870
00:42:48,220 --> 00:42:52,120
If we like think about indexes as a pet, we need to get rid of

871
00:42:52,120 --> 00:42:55,220
unused indexes, take care of bloat and all.

872
00:42:55,440 --> 00:42:58,820
But at your scale, before schema became sharded, millions of

873
00:42:58,820 --> 00:43:01,480
indexes, you think about this as cattle as well.

874
00:43:01,480 --> 00:43:03,900
Just inside 1 Postgres instance, right?

875
00:43:04,700 --> 00:43:06,180
Before it became sharded.

876
00:43:07,280 --> 00:43:08,860
This is an interesting approach.

877
00:43:09,520 --> 00:43:13,820
And I agree with, since we have small tables, we know there is

878
00:43:13,820 --> 00:43:17,160
right amplification, all indexes, if it's not HOT updated, it

879
00:43:17,160 --> 00:43:20,520
needs to update all those indexes, some of them are not used,

880
00:43:20,660 --> 00:43:25,520
but since approach by design it's cattle, okay, it's not that

881
00:43:25,520 --> 00:43:26,020
bad.

882
00:43:26,720 --> 00:43:27,040
Yes.

883
00:43:27,040 --> 00:43:32,260
This is interesting case, I must say I never saw it before, it's

884
00:43:32,260 --> 00:43:32,760
interesting.

885
00:43:34,500 --> 00:43:37,120
Michael: I've heard of 1 other case which was Heap Analytics

886
00:43:37,260 --> 00:43:40,840
blogged about, something about, yeah you know the 1, great.

887
00:43:40,840 --> 00:43:41,780
I sure do.

888
00:43:41,820 --> 00:43:44,280
So that was the only other time I heard about this.

889
00:43:44,340 --> 00:43:47,660
Nikolay: Was it also where they discussed that they create indexes

890
00:43:47,660 --> 00:43:49,540
without work concurrently, right?

891
00:43:49,540 --> 00:43:53,100
Michael: Yeah, because you can't do more than 1 at the same time.

892
00:43:53,100 --> 00:43:54,780
Nikolay: Yeah, to parallelize work.

893
00:43:55,460 --> 00:43:55,960
Michael: Yeah.

894
00:43:56,480 --> 00:44:00,580
Harry: What's neat or strange or weird about Gadget is we kind

895
00:44:00,580 --> 00:44:05,780
of participate in the whole SDLC, like the whole life cycle of

896
00:44:06,100 --> 00:44:09,600
like authoring a feature, debugging it, deciding it's good, merging

897
00:44:09,600 --> 00:44:10,820
it and deploying to production.

898
00:44:11,000 --> 00:44:14,120
Gadget has a development environment as well as the production

899
00:44:14,120 --> 00:44:14,620
environment.

900
00:44:15,060 --> 00:44:18,220
And so we kind of get to witness the moment that you want to

901
00:44:18,220 --> 00:44:18,920
go live.

902
00:44:18,960 --> 00:44:23,600
And we just do a normal style, call it Rails or whatever, migration

903
00:44:23,680 --> 00:44:26,880
where on deploy, we create the indexes concurrently behind the

904
00:44:26,880 --> 00:44:27,340
scenes.

905
00:44:27,340 --> 00:44:29,840
Then we ship the code that makes use of them.

906
00:44:30,040 --> 00:44:32,920
And then it's like, we can do that because we're not sort of

907
00:44:32,920 --> 00:44:36,240
exposing the raw SQL and asking people to manage their own schema,

908
00:44:36,460 --> 00:44:37,580
so on and so forth.

909
00:44:38,200 --> 00:44:41,040
But what do I want to share about this that I think is the most

910
00:44:41,040 --> 00:44:41,280
interesting?

911
00:44:41,280 --> 00:44:41,960
Oh, yeah.

912
00:44:41,960 --> 00:44:45,987
So the first thing is, like, giant catalogs where you have, you

913
00:44:45,987 --> 00:44:50,340
know, we had, I think, at peak in 1 instance, like 250,000 tables,

914
00:44:50,560 --> 00:44:51,920
1.4 million indexes.

915
00:44:52,480 --> 00:44:56,980
Giant catalogs suck, but were more manageable than I was guessing.

916
00:44:58,080 --> 00:45:01,840
I was assuming there was going to be some O(n²) system

917
00:45:01,840 --> 00:45:05,500
table inside Postgres, or some doinky little thing that just

918
00:45:05,500 --> 00:45:06,820
breaks when things are that big.

919
00:45:06,820 --> 00:45:10,020
But no, that instance operated well.

920
00:45:10,200 --> 00:45:14,200
The thing that did break was a bunch of the analytics tools out

921
00:45:14,200 --> 00:45:14,700
there.

922
00:45:15,400 --> 00:45:16,020
Like we were using.

923
00:45:16,020 --> 00:45:16,600
Like pg_stat_statements?

924
00:45:17,760 --> 00:45:19,320
pg_stat_statements was OK.

925
00:45:20,020 --> 00:45:20,740
It was big.

926
00:45:20,740 --> 00:45:23,520
It takes a lot of data, but it wasn't, like, an ungodly amount

927
00:45:23,520 --> 00:45:24,620
of performance overhead.

928
00:45:24,920 --> 00:45:28,740
But, like, pganalyze and many of the kind of collectors, they're

929
00:45:28,740 --> 00:45:32,300
like, let's get the size of every table every minute, you know?

930
00:45:32,300 --> 00:45:34,740
It's like that's a hard no.

931
00:45:35,380 --> 00:45:38,440
Nikolay: Our monitoring also didn't work properly because of

932
00:45:38,480 --> 00:45:39,060
the catalogs.

933
00:45:39,060 --> 00:45:41,320
Harry: Yeah, but you guys only had to make a couple tweaks.

934
00:45:41,320 --> 00:45:44,600
I think we're talking about pgwatch2, like the Postgres.AI

935
00:45:45,900 --> 00:45:48,680
remix, and it works well for us.

936
00:45:48,680 --> 00:45:50,020
It's the kind of fastest 1.

937
00:45:50,020 --> 00:45:52,500
I haven't tried pgMustard, but I must.

938
00:45:52,580 --> 00:45:52,700
I

939
00:45:52,700 --> 00:45:53,800
Michael: owe them a minimum.

940
00:45:53,800 --> 00:45:54,195
It's not

941
00:45:54,195 --> 00:45:54,960
Nikolay: a money-ching code.

942
00:45:54,960 --> 00:45:55,740
Harry: It's all right.

943
00:45:56,600 --> 00:45:57,780
So that was surprising.

944
00:45:58,040 --> 00:46:01,120
But the other, actually, This is just a fun fact.

945
00:46:01,120 --> 00:46:04,800
So we did our sharding, we moved this gazillion number of tables

946
00:46:04,800 --> 00:46:05,880
over to AlloyDB.

947
00:46:06,240 --> 00:46:08,540
We turned on their analytics thing.

948
00:46:09,140 --> 00:46:11,020
It's called Advanced Query Insights.

949
00:46:11,200 --> 00:46:12,260
And it's not bad.

950
00:46:12,260 --> 00:46:16,660
It's a nice UI for pg_stat_statements, souped up a bit.

951
00:46:17,040 --> 00:46:22,440
They store all the data for it in a private database in your

952
00:46:22,440 --> 00:46:22,940
instance.

953
00:46:23,200 --> 00:46:27,160
So like you have an almost superuser but not a full superuser

954
00:46:27,160 --> 00:46:30,360
and there's this one database you can't touch that has it in it.

955
00:46:30,440 --> 00:46:34,240
And currently, this is my fault, I set it all up and I was

956
00:46:34,240 --> 00:46:36,300
like, oh, this is probably going to be not that expensive.

957
00:46:36,340 --> 00:46:42,180
Currently our like AlloyDB Insights database is almost 1.5x the

958
00:46:42,180 --> 00:46:43,760
size of the actual database.

959
00:46:44,240 --> 00:46:47,360
And I think it's because the query diversity is so high, and

960
00:46:47,360 --> 00:46:51,300
all the data that they're keeping about each one adds up so much,

961
00:46:51,500 --> 00:46:55,520
that yeah, we're paying more than double in storage just for

962
00:46:55,520 --> 00:46:57,080
the privilege of this thing.

963
00:46:57,260 --> 00:46:58,780
And half the queries time out.

964
00:46:58,780 --> 00:47:01,640
But again, I'm going to blame me for that, not Alloy for that.

965
00:47:01,640 --> 00:47:03,140
We just need to turn that off.

966
00:47:03,600 --> 00:47:09,020
Nikolay: I'm curious if the main, like if we have just one AlloyDB

967
00:47:09,320 --> 00:47:14,440
database and we store metrics inside the same thing, if it's

968
00:47:14,440 --> 00:47:16,080
down, how will we troubleshoot?

969
00:47:17,440 --> 00:47:23,500
Harry: I think it is only like query insights like the equivalent

970
00:47:23,500 --> 00:47:24,680
of pg_stat_statements.

971
00:47:26,880 --> 00:47:28,580
I guess you're right, I don't know.

972
00:47:29,100 --> 00:47:32,340
Michael: Isn't the same true for Postgres like with pg_stat_statements?

973
00:47:33,140 --> 00:47:36,060
Nikolay: Well yeah, but that's why we need external monitoring

974
00:47:36,060 --> 00:47:39,500
tool which will pull data and store it independently for troubleshooting.

975
00:47:40,160 --> 00:47:44,440
I thought about this like self-observed or externally observed

976
00:47:44,920 --> 00:47:45,420
system.

977
00:47:45,620 --> 00:47:47,420
I think we actually need both.

978
00:47:47,500 --> 00:47:50,640
But if it's just self-observed, it might be a problem.

979
00:47:50,640 --> 00:47:53,400
Harry: Yeah, it goes full circle to that same thing we were talking

980
00:47:53,400 --> 00:47:55,120
about with the infrastructure provider earlier, which is like,

981
00:47:55,120 --> 00:47:57,380
they don't have anywhere else to put it unless they're going

982
00:47:57,380 --> 00:47:58,380
to pay for it.

983
00:47:58,440 --> 00:47:58,620
Right?

984
00:47:58,620 --> 00:48:01,560
So by putting it in my instance, I pay for it, which I am paying

985
00:48:01,560 --> 00:48:05,380
for dearly, and I'm now going to turn off, you know, but I agree.

986
00:48:05,380 --> 00:48:09,220
We, you do get the like system metrics that they store in their,

987
00:48:09,220 --> 00:48:12,420
you know, fancy time series database, but yeah.

988
00:48:13,580 --> 00:48:14,960
Nikolay: Yeah, I didn't know about this.

989
00:48:14,960 --> 00:48:15,460
Interesting.

990
00:48:16,340 --> 00:48:20,540
Harry: Yeah, I think, if I may, the most interesting thing we've

991
00:48:20,540 --> 00:48:27,260
done to manage this ungodly number of indexes and tables is classic

992
00:48:27,260 --> 00:48:29,240
computer science, a hybrid strategy.

993
00:48:30,380 --> 00:48:34,540
What we noticed is many, many, many Gadget apps are like tutorials.

994
00:48:34,740 --> 00:48:37,960
It's like someone wants to kick the tires, or they're building

995
00:48:37,960 --> 00:48:40,300
a to-do app and they only ever add three to-dos.

996
00:48:40,580 --> 00:48:45,780
It's kind of excessive to create them one to-do table, one user table,

997
00:48:45,920 --> 00:48:50,780
indexes on many of those columns to store zero to five rows.

998
00:48:51,660 --> 00:48:57,320
So we use a thing we call a storage strategy pattern where different

999
00:48:57,340 --> 00:49:00,200
Gadget models are backed by different data.

1000
00:49:00,360 --> 00:49:03,420
We still store them in the same Postgres, and that's so that

1001
00:49:03,420 --> 00:49:05,340
we can still transact against them together.

1002
00:49:05,740 --> 00:49:10,920
Like we support real, actual Postgres transactions, but we use

1003
00:49:10,920 --> 00:49:14,840
doinky little JSONB tables when the cardinality is small.

1004
00:49:15,100 --> 00:49:21,560
So we have 1 shared JSON table, it has an ID, a model identifier,

1005
00:49:21,820 --> 00:49:24,980
and then just field values, and we don't index that at all.

1006
00:49:25,080 --> 00:49:26,260
No GIN, no nothing.

1007
00:49:26,400 --> 00:49:27,840
We just do a table.

1008
00:49:28,520 --> 00:49:31,080
You could call it a table scan even though it's not, it's like

1009
00:49:31,080 --> 00:49:32,220
a range scan.

1010
00:49:32,680 --> 00:49:35,020
And then we just bound the cardinality of that.

1011
00:49:35,020 --> 00:49:39,520
And then when we witnessed the, I think it's like 22,500th record,

1012
00:49:40,080 --> 00:49:44,640
we swap out kind of a real table that has the real indexes and

1013
00:49:44,640 --> 00:49:44,860
whatever.

1014
00:49:44,860 --> 00:49:48,280
And so that cut our table count and our index count down by a

1015
00:49:48,280 --> 00:49:49,420
lot, a lot, a lot.

1016
00:49:49,700 --> 00:49:54,340
Again, that swap is 0 downtime, and it has to be.

1017
00:49:54,340 --> 00:49:55,940
That's again our commitment to our customers.

1018
00:49:55,940 --> 00:49:58,700
And that was actually quite difficult in order for someone's

1019
00:49:58,700 --> 00:50:00,740
doing a bulk load of data or whatever.

1020
00:50:00,800 --> 00:50:04,320
We witnessed that 2500 record, we have to kind of like pause

1021
00:50:04,320 --> 00:50:08,160
traffic, do the switcheroo, resume traffic without, like it'll

1022
00:50:08,160 --> 00:50:10,340
be a latency spike, but not an error spike.

1023
00:50:10,640 --> 00:50:14,760
And I think that's kind of, it's Gadget is a bundle, right?

1024
00:50:14,760 --> 00:50:17,440
Where we have this backend runtime, this frontend runtime, this

1025
00:50:17,440 --> 00:50:19,820
database, all these things that work nicely together.

1026
00:50:19,940 --> 00:50:22,820
And we see that as validation that the bundle's showing through.

1027
00:50:22,820 --> 00:50:25,680
Or it's like, we can only do that because we have control over

1028
00:50:25,680 --> 00:50:28,660
the app tier, control over the load balancers to be able to be

1029
00:50:28,660 --> 00:50:29,980
like, ah, hold on a second.

1030
00:50:30,040 --> 00:50:32,200
We've got to do some maintenance on your app real quick.

1031
00:50:32,200 --> 00:50:33,240
We won't drop any requests.

1032
00:50:33,240 --> 00:50:36,820
We'll do a little switcheroo without anybody knowing, and you're

1033
00:50:36,820 --> 00:50:40,340
off to the races with the full performance after that.

1034
00:50:41,240 --> 00:50:42,820
Michael: That's a really neat trick.

1035
00:50:43,080 --> 00:50:43,940
I like that.

1036
00:50:45,300 --> 00:50:48,160
You must have provisions in that code that doesn't allow

1037
00:50:48,160 --> 00:50:52,280
queries without indexes to like account for tiny tape like Yeah,

1038
00:50:52,280 --> 00:50:53,160
oh my goodness.

1039
00:50:53,440 --> 00:50:54,360
This sounds painful

1040
00:50:55,600 --> 00:50:58,980
Harry: We've been working on it for like 4 years, so yeah, yes,

1041
00:50:58,980 --> 00:51:02,000
but I think we're in a good position.

1042
00:51:02,000 --> 00:51:07,040
Like, I'm excited for BFCM this year, which for operations staff

1043
00:51:07,040 --> 00:51:08,620
is, I think, a rare thing to hear.

1044
00:51:08,620 --> 00:51:12,240
Like, I think I'm excited for the challenge, you know?

1045
00:51:12,280 --> 00:51:14,560
Michael: I saw the color drain from Nikolay's face.

1046
00:51:15,480 --> 00:51:15,980
Nikolay: Yeah.

1047
00:51:17,020 --> 00:51:17,520
Yeah.

1048
00:51:17,860 --> 00:51:21,680
Do you think there might be some point that true Postgres will

1049
00:51:21,820 --> 00:51:25,880
be fully replaced in your company by something?

1050
00:51:26,320 --> 00:51:30,140
Or there will always be a place
for true Postgres, not AlloyDB

1051
00:51:30,660 --> 00:51:31,160
Postgres-like?

1052
00:51:33,120 --> 00:51:33,980
Harry: Great question.

1053
00:51:34,240 --> 00:51:35,180
I think yes.

1054
00:51:35,460 --> 00:51:37,320
I do think it will be fully replaced.

1055
00:51:37,480 --> 00:51:44,200
And I think it's only because we
haven't seen any alloy downsides.

1056
00:51:45,060 --> 00:51:49,440
Like while it's a fork, it's still
99% true Postgres.

1057
00:51:49,440 --> 00:51:53,040
So we have like pgvector support,
for example, and pg_squeeze

1058
00:51:53,040 --> 00:51:53,540
support.

1059
00:51:54,100 --> 00:51:56,200
It's not like a wire compatible
thing.

1060
00:51:56,200 --> 00:52:01,260
It's like mostly actually Postgres
with the WAL and the block

1061
00:52:01,260 --> 00:52:02,800
storage kind of ripped out.

1062
00:52:03,060 --> 00:52:06,340
Nikolay: It also has lightweight
locks and so on, all those problems.

1063
00:52:06,340 --> 00:52:06,580
Harry: Yes.

1064
00:52:06,580 --> 00:52:07,080
Yeah.

1065
00:52:08,300 --> 00:52:09,060
Yes, true.

1066
00:52:10,160 --> 00:52:12,800
I think we're going to stick on
Alloy, but not because it's the

1067
00:52:12,800 --> 00:52:15,580
very best, but because now we have
familiarity, Now we know we

1068
00:52:15,580 --> 00:52:17,860
can maintain it, now we can scale
to Valhalla.

1069
00:52:17,860 --> 00:52:20,200
So we've sort of unlocked something
and we've got other stuff

1070
00:52:20,200 --> 00:52:20,820
to do.

1071
00:52:20,820 --> 00:52:23,900
I think the thing that would make
me revisit that is OrioleDB,

1072
00:52:25,160 --> 00:52:26,460
or whatever the...

1073
00:52:27,040 --> 00:52:29,340
Whatever InnoDB for Postgres is.

1074
00:52:29,340 --> 00:52:32,580
Like, when that comes around and
I don't have to vacuum anymore,

1075
00:52:33,420 --> 00:52:37,340
that seems like a pretty epic win
that I'd be interested in realizing.

1076
00:52:38,100 --> 00:52:38,860
Nikolay: Makes sense.

1077
00:52:39,240 --> 00:52:40,140
Great answer.

1078
00:52:40,840 --> 00:52:42,480
Great, thank you so much.

1079
00:52:42,800 --> 00:52:43,300
Enjoyed.

1080
00:52:44,800 --> 00:52:45,080
Thank you

1081
00:52:45,080 --> 00:52:47,580
Harry: guys, I really appreciate
getting to nerd out.

1082
00:52:47,580 --> 00:52:49,660
These are the things that keep
me up at night.

1083
00:52:49,660 --> 00:52:51,520
They're like 1.4 million indexes.

1084
00:52:51,760 --> 00:52:52,700
What do I do?

1085
00:52:52,740 --> 00:52:54,660
It's nice to let it out a little
bit.

1086
00:52:55,580 --> 00:52:56,080
Nikolay: Cool.

1087
00:52:56,120 --> 00:52:56,640
Thank you.

1088
00:52:56,640 --> 00:52:57,600
Michael: It's our pleasure.