1
0:0:0,06 --> 0:0:3,62
Nikolay: Hello, hello, this is
PostgresFM, your favorite podcast

2
0:0:3,6999998 --> 0:0:4,54
about Postgres.

3
0:0:4,7799997 --> 0:0:9,96
And as usual, my name is Nik,
PostgresAI, and Michael is here

4
0:0:9,96 --> 0:0:11,88
as usual as well from pgMustard.

5
0:0:12,099999 --> 0:0:12,88
Hi, Michael.

6
0:0:13,639999 --> 0:0:14,42
Michael: Hello, Nik.

7
0:0:14,44 --> 0:0:18,76
Nikolay: And we have a great guest
for the second time, Lev, PgDog.

8
0:0:19,44 --> 0:0:21,24
Hello, Lev, thank you for coming.

9
0:0:21,82 --> 0:0:22,32
Lev: Hello.

10
0:0:22,58 --> 0:0:24,64
So thank you for having me for
a second time.

11
0:0:24,64 --> 0:0:28,34
Nikolay: Yeah, I'm sure there's
a lot new to discuss.

12
0:0:30,4 --> 0:0:34,54
So yeah, I think it's a great idea
to see what has changed since

13
0:0:34,54 --> 0:0:37,86
last time we talked, because it
was around your first release,

14
0:0:37,86 --> 0:0:38,8
I guess, right?

15
0:0:39,96 --> 0:0:42,88
And yeah, so where to start?

16
0:0:44,34 --> 0:0:45,1
Lev: Where to start?

17
0:0:45,1 --> 0:0:48,08
Yeah, we're in a different year,
2026.

18
0:0:48,44 --> 0:0:52,48
So it's been almost 8 months since
we last chatted, something

19
0:0:52,48 --> 0:0:53,0
like that.

20
0:0:53,0 --> 0:0:53,739998
7 months.

21
0:0:55,46 --> 0:0:56,46
Now we're in January.

22
0:0:56,46 --> 0:0:57,58
Michael: Yeah, I think 10.

23
0:0:58,2 --> 0:0:58,96
Lev: 10 months.

24
0:0:59,04 --> 0:0:59,72
Oh boy.

25
0:1:0,04 --> 0:1:3,34
I mean, I've been working full
time on this stuff, so I have

26
0:1:3,48 --> 0:1:6,68
a whole list of things that have
been developed ever since.

27
0:1:6,68 --> 0:1:12,16
I think the biggest thing is PgDog
is very much real and in production

28
0:1:12,24 --> 0:1:14,16
with real companies and real customers.

29
0:1:14,16 --> 0:1:18,06
You know, last time we chatted,
it was like a thing, but now

30
0:1:18,4 --> 0:1:19,62
it's a real thing.

31
0:1:19,74 --> 0:1:21,18
So that's the big update.

32
0:1:21,86 --> 0:1:22,74
It's in prod.

33
0:1:23,3 --> 0:1:27,16
Most of the code I think I would
comfortably say is stable and

34
0:1:27,16 --> 0:1:31,94
performant, which obviously I would
say that, but there's real

35
0:1:31,94 --> 0:1:33,9
usage behind this which is really
cool.

36
0:1:34,2 --> 0:1:37,32
Let's see, there's a long list
of features that's been developed

37
0:1:37,36 --> 0:1:38,08
since then.

38
0:1:39,06 --> 0:1:42,18
Michael: Before we go on to those,
I wondered of the people using

39
0:1:42,18 --> 0:1:46,3
it in production, what are the
primary drivers?

40
0:1:46,32 --> 0:1:50,06
Is it sharding or is it like an
upgraded pooler?

41
0:1:50,06 --> 0:1:52,9
You know, I feel like there are
a few selling points to PgDog

42
0:1:52,9 --> 0:1:55,24
that we talked about last time
and I wondered which ones were

43
0:1:55,24 --> 0:1:57,34
driving like the primary adoption.

44
0:1:58,08 --> 0:1:59,28
Lev: That's a great question.

45
0:2:0,04 --> 0:2:2,8
One of our early adopters, they went
straight for sharding, which

46
0:2:2,8 --> 0:2:6,66
to me was like, wow, you're brave,
sir.

47
0:2:8,08 --> 0:2:11,84
And they went straight for it and
they replaced Citus with it,

48
0:2:11,84 --> 0:2:13,04
so that was really cool.

49
0:2:13,2 --> 0:2:15,22
Nikolay: Can you name some names
or?

50
0:2:16,02 --> 0:2:17,06
Lev: They're on my website.

51
0:2:17,22 --> 0:2:19,64
1 of their logos, I'll let you
guess which 1.

52
0:2:19,64 --> 0:2:22,5
I'm still trying to close them
as an official customer, so soon

53
0:2:22,5 --> 0:2:24,9
I'll be able to say their name,
but their logo is there.

54
0:2:25,44 --> 0:2:31,4
And yeah, they're pushing, I think,
over 100k QPS per shard.

55
0:2:32,02 --> 0:2:37,28
And they have like 12 shards, I
think Yeah, yeah, it's it's kind

56
0:2:37,28 --> 0:2:40,08
of crazy they're using direct to
shard query so that's that's

57
0:2:40,08 --> 0:2:44,18
the real you know why it worked
so early is because They didn't

58
0:2:44,18 --> 0:2:47,8
use any of the fancy like aggregates
and sorting across multiple

59
0:2:47,8 --> 0:2:48,3
shards.

60
0:2:48,3 --> 0:2:49,54
That's the real tricky part.

61
0:2:49,54 --> 0:2:54,68
Although we've made some progress
there, but it's still not 100%

62
0:2:54,8 --> 0:2:56,32
compatible to Postgres does it.

63
0:2:56,32 --> 0:2:58,3
So on its way though.

64
0:2:58,32 --> 0:3:2,86
But other use cases, load balancer
is a huge thing.

65
0:3:2,86 --> 0:3:7,2
Like people love having the ability
to talk to multiple replicas

66
0:3:7,2 --> 0:3:10,34
without actually manually creating
a connection pool for each.

67
0:3:10,52 --> 0:3:12,22
You know, there's health checks
and failover.

68
0:3:12,74 --> 0:3:17,14
1 of the big uses is we deployed
alongside, you know, Aurora

69
0:3:17,14 --> 0:3:21,06
or RDS where Patroni is basically
not used And it can detect

70
0:3:21,06 --> 0:3:24,96
primary replica failover, and it'll
move the traffic to the new

71
0:3:24,96 --> 0:3:28,36
primary within a second or so,
which is really great.

72
0:3:28,36 --> 0:3:32,54
And then, of course, when 1 of
the replicas starts failing, it'll

73
0:3:32,68 --> 0:3:35,7
ban it and then use the remaining
replicas for re-query.

74
0:3:35,84 --> 0:3:39,06
So you get that, you know, that
amazing, like 4 nines of uptime

75
0:3:39,14 --> 0:3:39,94
for Postgres, which is pretty

76
0:3:40,14 --> 0:3:40,46
Nikolay: cool.

77
0:3:40,46 --> 0:3:42,68
I have questions about all these
features.

78
0:3:43,52 --> 0:3:44,02
Yeah.

79
0:3:44,14 --> 0:3:44,84
All right.

80
0:3:45,96 --> 0:3:47,78
Before that, I have underlying
question.

81
0:3:48,22 --> 0:3:51,48
10 months, Is it really 10 months
or 10 years?

82
0:3:51,88 --> 0:3:53,16
Because you use AI.

83
0:3:53,16 --> 0:3:54,84
Do you use AI to develop?

84
0:3:54,84 --> 0:3:55,54
I do.

85
0:3:55,68 --> 0:3:56,14
Yes.

86
0:3:56,14 --> 0:3:57,34
Claude, right?

87
0:3:57,8 --> 0:3:58,02
Lev: Yeah.

88
0:3:58,02 --> 0:4:0,2
I use Claude, Opus specifically.

89
0:4:1,02 --> 0:4:3,06
Nikolay: Claude code, Opus 4.5 right
now.

90
0:4:3,06 --> 0:4:3,26
Right.

91
0:4:3,26 --> 0:4:4,78
So this is what you use.

92
0:4:5,26 --> 0:4:5,74
Lev: That's right.

93
0:4:5,74 --> 0:4:6,24
Yeah.

94
0:4:6,66 --> 0:4:9,62
For only certain areas of the code,
not all of it.

95
0:4:9,62 --> 0:4:12,18
There's some areas where I don't
trust it at all, but some areas

96
0:4:12,18 --> 0:4:16,46
like, like we have a, we're building
like a dashboard like with

97
0:4:16,46 --> 0:4:17,8
all the metrics and stuff.

98
0:4:17,84 --> 0:4:19,46
Nikolay: Not critical paths, right?

99
0:4:19,54 --> 0:4:20,04
Lev: Yeah.

100
0:4:20,22 --> 0:4:23,44
Critical paths are still organic
code, if you will.

101
0:4:23,56 --> 0:4:24,56
Organic code.

102
0:4:28,36 --> 0:4:28,86
Exactly.

103
0:4:29,06 --> 0:4:29,16
Yeah.

104
0:4:29,16 --> 0:4:29,68
This is the first

105
0:4:29,68 --> 0:4:30,7
Nikolay: time I hear this.

106
0:4:30,7 --> 0:4:31,6
It's, it's hilarious.

107
0:4:32,94 --> 0:4:33,44
Lev: Yeah.

108
0:4:33,8 --> 0:4:35,34
Homegrown organic code.

109
0:4:35,34 --> 0:4:38,2
But, I don't write JavaScript anymore, for example.

110
0:4:38,2 --> 0:4:40,96
I don't write CSS anymore, like that kind of stuff.

111
0:4:41,04 --> 0:4:45,6
Nikolay: So you use code for this like additional pieces of software,

112
0:4:46,34 --> 0:4:48,02
not super critical person.

113
0:4:49,06 --> 0:4:55,44
And I'm sure you ask it to duplicate and reuse the code as much

114
0:4:55,44 --> 0:4:59,1
as possible, recycle existing pieces of code.

115
0:4:59,1 --> 0:4:59,6
Stop.

116
0:5:0,18 --> 0:5:0,48
Okay.

117
0:5:0,48 --> 0:5:0,98
It's

118
0:5:2,98 --> 0:5:5,72
Lev: crazy how much that machine loves to generate like copy-based

119
0:5:5,8 --> 0:5:6,52
code around.

120
0:5:7,28 --> 0:5:9,1
It's allergic to function reuse.

121
0:5:10,26 --> 0:5:12,32
You really have to tell it like, hey, yeah.

122
0:5:13,44 --> 0:5:18,18
Nikolay: Yeah, I have a special additional code-based review

123
0:5:18,3 --> 0:5:22,86
process, which like basically targeting this like deduplication

124
0:5:24,18 --> 0:5:24,68
vector.

125
0:5:25,76 --> 0:5:27,22
But it's possible to achieve.

126
0:5:28,1 --> 0:5:31,5
If you don't do that, by default, you will get a lot of duplicated

127
0:5:31,56 --> 0:5:32,06
code.

128
0:5:32,56 --> 0:5:33,84
Yeah, that's great to hear.

129
0:5:34,34 --> 0:5:37,1
Okay, I see by the way, great logos on your website.

130
0:5:37,36 --> 0:5:40,32
I missed that, haven't visited front page for a while.

131
0:5:40,32 --> 0:5:45,14
I see Coinbase Circleback, we use Circleback at work.

132
0:5:45,86 --> 0:5:48,76
Some other names, that's great, congrats.

133
0:5:50,14 --> 0:5:51,5
Lev: Thank you, hard at work.

134
0:5:51,82 --> 0:5:53,84
Nikolay: Yeah, So what is your current pace?

135
0:5:53,84 --> 0:5:56,18
How often releases happen?

136
0:5:57,54 --> 0:5:58,58
How do you ship?

137
0:5:59,06 --> 0:6:2,36
And if you still write a code a lot, like I guess it's not like

138
0:6:2,36 --> 0:6:3,82
every day new version.

139
0:6:3,92 --> 0:6:5,42
It's less, right?

140
0:6:6,3 --> 0:6:6,72
Lev: Yep.

141
0:6:6,72 --> 0:6:7,54
Not every day.

142
0:6:7,54 --> 0:6:9,36
Our release schedule is every Thursday.

143
0:6:9,48 --> 0:6:11,64
I'm trying to be very diligent about it.

144
0:6:11,64 --> 0:6:16,16
Like whatever code that's in main is usually stable, but I give

145
0:6:16,16 --> 0:6:19,44
it a week before I tag a release.

146
0:6:19,64 --> 0:6:23,5
And then every Thursday I compile release notes and ship it out.

147
0:6:23,72 --> 0:6:26,98
And then you get it in GitHub and there's a Docker image that

148
0:6:26,98 --> 0:6:30,66
gets built for both x86 and ARM.

149
0:6:30,66 --> 0:6:33,84
So you can pretty much run it anywhere, including on your Mac,

150
0:6:33,84 --> 0:6:34,84
if you wanted to.

151
0:6:34,84 --> 0:6:37,96
And we have a Helm chart for if you're using Kubernetes, our

152
0:6:37,96 --> 0:6:41,4
Helm chart is getting a lot of good usage and contributions from

153
0:6:41,4 --> 0:6:42,18
a bunch of people.

154
0:6:42,18 --> 0:6:46,24
So, you know, I'm not a Kubernetes expert, but They are, so they're

155
0:6:46,24 --> 0:6:47,26
helping me along.

156
0:6:47,78 --> 0:6:50,36
So if you're running Kube, you should be able to deploy PgDog

157
0:6:50,44 --> 0:6:52,26
in like 30 seconds, honestly.

158
0:6:52,64 --> 0:6:54,74
But otherwise, you know, it's just a container.

159
0:6:55,2 --> 0:6:56,18
You can run it anywhere.

160
0:6:56,64 --> 0:7:1,4
Nikolay: Right, and you use libpg_query for query parsing, so

161
0:7:1,4 --> 0:7:5,66
to build AST trees and yeah, and we recently optimized it, right?

162
0:7:5,66 --> 0:7:9,34
So it's not merged, your optimization is not merged to upstream

163
0:7:9,34 --> 0:7:12,54
yet, but can you tell us a little bit more about it?

164
0:7:12,54 --> 0:7:17,98
Also, specifically, I think My question is also about is it worth

165
0:7:17,98 --> 0:7:18,48
optimizing?

166
0:7:19,44 --> 0:7:22,76
Like how much of microseconds we save on each query?

167
0:7:23,22 --> 0:7:26,94
Should we worry about those microseconds at all?

168
0:7:28,08 --> 0:7:29,76
Lev: Yeah, you should.

169
0:7:29,76 --> 0:7:30,36
It was...

170
0:7:31,1 --> 0:7:32,3
First of all, full disclosure.

171
0:7:32,64 --> 0:7:34,9
I told Claude to do it.

172
0:7:36,1 --> 0:7:37,32
And then he did it.

173
0:7:37,5 --> 0:7:38,54
I checked it.

174
0:7:39,9 --> 0:7:40,94
I was very open about it.

175
0:7:40,94 --> 0:7:41,72
I put in the PR.

176
0:7:41,72 --> 0:7:43,94
I know how open source projects hate this kind of stuff.

177
0:7:43,94 --> 0:7:46,16
Like, you know, 3,000 lines of boilerplate.

178
0:7:46,56 --> 0:7:48,52
I was up front, look, guys, this is Claude-generated.

179
0:7:48,52 --> 0:7:49,2
I reviewed it.

180
0:7:49,2 --> 0:7:49,78
It works okay.

181
0:7:49,78 --> 0:7:54,22
But give it a few months before seriously looking at it.

182
0:7:54,28 --> 0:7:59,54
But the idea is in libpg_query, there's a translation layer between

183
0:7:59,54 --> 0:8:2,02
the C library and a bunch of other bindings.

184
0:8:2,5 --> 0:8:6,04
So notably, like we use the Rust binding and the translation

185
0:8:6,04 --> 0:8:7,26
layer is in protobuf.

186
0:8:7,58 --> 0:8:11,64
So every single time you call parse or fingerprint, every time

187
0:8:11,64 --> 0:8:14,86
you get the AST back, basically, it goes through a serialization

188
0:8:15,06 --> 0:8:18,66
and deserialization process with protobuf inside the same process.

189
0:8:18,94 --> 0:8:20,82
And that's, and that's really slow actually.

190
0:8:20,84 --> 0:8:24,76
So I just replaced it with like only for the rust bindings with

191
0:8:24,76 --> 0:8:28,14
just like direct C to rust and rust to C mappings of the Postgres

192
0:8:28,14 --> 0:8:33,12
AST and That improved the translation part of it by, honestly,

193
0:8:33,12 --> 0:8:35,22
I'm not even exaggerating, like 5x faster.

194
0:8:35,42 --> 0:8:38,76
The throughput went from like, you know, 1000 per, I think per

195
0:8:38,76 --> 0:8:42,98
millisecond or per second to like 5 or 10k or something like

196
0:8:42,98 --> 0:8:43,26
that.

197
0:8:43,26 --> 0:8:46,56
The benchmarks are in the PR, but It's huge.

198
0:8:46,98 --> 0:8:49,66
And I think another important factor is CPU utilization.

199
0:8:50,66 --> 0:8:53,44
So for us, we're a network proxy, so our CPU budget has to be

200
0:8:53,44 --> 0:8:54,14
pretty reasonable.

201
0:8:54,14 --> 0:8:56,72
If you can decrease it by 5x, and
that's most of our CPU comes

202
0:8:56,72 --> 0:8:57,66
from query parsing.

203
0:8:58,08 --> 0:9:0,44
Everything else is pretty straightforward
stuff, just I.O.

204
0:9:0,44 --> 0:9:1,3
Back and forth.

205
0:9:1,36 --> 0:9:2,06
And Atomics.

206
0:9:2,28 --> 0:9:3,78
So for us, it's huge.

207
0:9:4,04 --> 0:9:7,58
Nikolay: So this optimization is
just like, you reduced the number

208
0:9:7,58 --> 0:9:11,32
of times how much query is copied
between like various buffers.

209
0:9:11,32 --> 0:9:11,82
Lev: Exactly.

210
0:9:12,5 --> 0:9:16,72
And the protobuf serialization
is Probably one of the fastest in

211
0:9:16,72 --> 0:9:19,34
the industry, but it's still very,
very slow compared to just

212
0:9:19,34 --> 0:9:22,86
copying a C struct in memory, like
a memcpy.

213
0:9:23,36 --> 0:9:26,32
Nikolay: It sounds to me like shorter
queries should suffer less

214
0:9:26,32 --> 0:9:31,22
from this than huge CTEs, you know,
like multi-kilobytes in size.

215
0:9:31,86 --> 0:9:33,84
They should suffer even more, right?

216
0:9:34,74 --> 0:9:35,96
Lev: Yeah, yeah, definitely.

217
0:9:36,28 --> 0:9:40,76
I haven't measured how much faster
are slow queries or long queries.

218
0:9:40,76 --> 0:9:43,94
I just took up a pretty sizable
query, like 600 lines, and used

219
0:9:43,94 --> 0:9:44,94
that as my benchmark.

220
0:9:45,56 --> 0:9:49,7
Because a lot of people who use
PgDog surprisingly like to write

221
0:9:49,92 --> 0:9:53,82
long and complex SQL statements
for their database.

222
0:9:54,08 --> 0:9:56,76
So that's what I went with and
that's where latency will really

223
0:9:56,76 --> 0:9:57,26
matter.

224
0:9:57,44 --> 0:10:2,24
So yeah, for small ones, maybe
it's smaller, but doing protobuf,

225
0:10:2,68 --> 0:10:7,22
DCR in memory, it's a big waste
of time when you can just map

226
0:10:7,72 --> 0:10:10,62
like a C-struct directly with a
few allocations.

227
0:10:11,68 --> 0:10:12,6
Nikolay: Right, right.

228
0:10:12,94 --> 0:10:19,86
I also saw there was optimization
of like SIMD vector calculations.

229
0:10:20,28 --> 0:10:23,04
So like, this is like cool stuff
or what?

230
0:10:23,04 --> 0:10:25,74
Like is it, or you saw some real
benefit?

231
0:10:27,38 --> 0:10:29,54
Lev: Well, the vector one is kind
of funny.

232
0:10:29,54 --> 0:10:33,4
There's a blog post about it, but
basically I tried to shard

233
0:10:33,94 --> 0:10:37,32
a pgvector index among multiple
Postgres instances.

234
0:10:38,04 --> 0:10:42,32
And to do that, I implemented one
of their algorithms, IVFFlat

235
0:10:42,4 --> 0:10:47,36
inside PgDog, and then it's able
to basically like separate the

236
0:10:47,58 --> 0:10:50,14
vectors between multiple databases
and then search for them using

237
0:10:50,14 --> 0:10:51,06
IVFFlat.

238
0:10:52,12 --> 0:10:56,82
And for that I needed SIMD to calculate,
I think I'm supporting,

239
0:10:57,34 --> 0:11:1,4
we're supporting cosine distance
only, just as a POC, just to

240
0:11:1,4 --> 0:11:3,58
see if anyone even cares about
this stuff.

241
0:11:3,78 --> 0:11:5,98
And that's where the SIMD optimization
comes from, which is to

242
0:11:5,98 --> 0:11:8,98
calculate the cosine distance between
the input and one of the

243
0:11:8,98 --> 0:11:12,18
centroids, because we store the
centroids basically in a config

244
0:11:12,18 --> 0:11:12,68
file.

245
0:11:12,84 --> 0:11:15,52
Everything is in PgDog right now
is config driven because I trust

246
0:11:15,52 --> 0:11:17,26
the disk not to do anything.

247
0:11:18,08 --> 0:11:20,14
Eventually we'll move away into
something more dynamic.

248
0:11:20,14 --> 0:11:22,64
Right now it's like you pre-calculate
your centroids in advance

249
0:11:22,64 --> 0:11:26,32
and then you store them and then
you can basically query one shard

250
0:11:26,32 --> 0:11:30,22
or multiple shards to get your
probes, the number of probes,

251
0:11:30,24 --> 0:11:30,94
if you will.

252
0:11:30,94 --> 0:11:34,28
And then underneath each Postgres
database has their own IVFFlat or HNSW index.

253
0:11:34,28 --> 0:11:36,24
So it's an approximation on top
of an approximation, which is

254
0:11:36,46 --> 0:11:39,28
fun because you can scale pgvector
into like presumably billions

255
0:11:39,28 --> 0:11:43,38
of billions

256
0:11:43,38 --> 0:11:44,18
Nikolay: of vectors like others
like announced, right, and we

257
0:11:44,18 --> 0:11:49,24
said it's not true because it's
not a single instance.

258
0:11:49,44 --> 0:11:52,94
So, yeah.

259
0:11:53,4 --> 0:11:54,64
Yeah.

260
0:11:54,64 --> 0:11:55,12
Yeah.

261
0:11:55,12 --> 0:11:58,48
1,000,000,000 vectors on a single instance,
I haven't seen this yet.

262
0:11:58,7 --> 0:12:1,76
Lev: No, no, no, it's not possible
with pgvector, but with sharding

263
0:12:1,76 --> 0:12:2,44
it is.

264
0:12:2,44 --> 0:12:2,94
Nikolay: Right.

265
0:12:3,08 --> 0:12:7,66
So if you have sharding schema
defined, like you can store a

266
0:12:7,66 --> 0:12:8,16
lot.

267
0:12:8,54 --> 0:12:10,62
Right, so it's a great feature.

268
0:12:11,18 --> 0:12:15,08
Anyone using it or is it just showcase
and that's it?

269
0:12:15,36 --> 0:12:18,06
Lev: I don't think anyone, I think
I got an email from somebody

270
0:12:18,06 --> 0:12:19,92
who was like, this is really cool
and I'm like, well, this is

271
0:12:19,92 --> 0:12:20,8
how you need to use it.

272
0:12:20,8 --> 0:12:21,92
And they're like, cool.

273
0:12:22,58 --> 0:12:24,22
I haven't heard back from them.

274
0:12:24,8 --> 0:12:27,74
But yeah, it's more like a...

275
0:12:28,08 --> 0:12:30,56
Nikolay: Speaking of performance,
let's talk about two-phase

276
0:12:30,6 --> 0:12:31,1
commit.

277
0:12:31,16 --> 0:12:32,7
There is support for it, right?

278
0:12:34,02 --> 0:12:37,92
Can you describe what's your
experience and view on using

279
0:12:38,0 --> 0:12:40,3
two-phase commit in larger systems?

280
0:12:41,04 --> 0:12:42,34
Oh, we have a dog.

281
0:12:44,76 --> 0:12:48,58
Lev: We have a dog. And we have
a dog on two-phase commit as well.

282
0:12:49,28 --> 0:12:49,94
Okay, just for

283
0:12:49,94 --> 0:12:52,82
Nikolay: those who are listening
on the audio version, we just

284
0:12:52,82 --> 0:12:56,62
saw a dog actually walking in the background.

285
0:12:57,4 --> 0:12:58,16
That's great.

286
0:12:58,18 --> 0:12:58,94
Lev: Big dog.

287
0:12:59,8 --> 0:13:2,28
She is the reason why PgDog
is named PgDog.

288
0:13:2,28 --> 0:13:2,78
Nikolay: Right.

289
0:13:2,9 --> 0:13:4,3
You explained that last time.

290
0:13:4,46 --> 0:13:4,96
Right.

291
0:13:5,34 --> 0:13:10,24
So what about two-phase commit
and performance overhead it has

292
0:13:10,24 --> 0:13:11,06
and so on?

293
0:13:11,06 --> 0:13:12,44
Lev: I'll be completely honest
with you.

294
0:13:12,44 --> 0:13:14,24
I haven't benchmarked it very much.

295
0:13:14,24 --> 0:13:16,86
Nobody really talked to me about
using it yet in production.

296
0:13:16,96 --> 0:13:22,66
So I don't know how much the Postgres
internals themselves create

297
0:13:22,66 --> 0:13:23,24
an overhead.

298
0:13:23,24 --> 0:13:27,94
I just know from the proxy side,
it's 2 extra statements to execute.

299
0:13:28,14 --> 0:13:32,54
And also the additional complexity
of managing phase 1, phase

300
0:13:32,54 --> 0:13:33,84
2 of the commit.

301
0:13:33,84 --> 0:13:36,18
So if phase 1 fails, you need to
roll back.

302
0:13:36,18 --> 0:13:39,38
And if phase 2 fails, you need
to roll forward or commit.

303
0:13:39,72 --> 0:13:42,54
And you also need to make it PgDog
crash resistant, which we

304
0:13:42,54 --> 0:13:43,48
haven't done yet.

305
0:13:43,86 --> 0:13:47,38
So there's still work to do there
to make it, I think, production

306
0:13:47,38 --> 0:13:47,88
grade.

307
0:13:48,1 --> 0:13:51,04
But from our point of view, it
was just an extra couple of statements

308
0:13:51,04 --> 0:13:51,7
to execute.

309
0:13:51,82 --> 0:13:52,86
Nikolay: Yeah, okay, understood.

310
0:13:52,9 --> 0:13:57,34
So this is just like, also like
some foundation for future work,

311
0:13:57,34 --> 0:13:57,64
maybe, right?

312
0:13:57,64 --> 0:13:58,22
Definitely, yeah.

313
0:13:58,38 --> 0:13:59,02
Lev: After we get

314
0:13:59,02 --> 0:14:0,7
Nikolay: some feedback from production
systems.

315
0:14:0,7 --> 0:14:1,4
Okay, I see.

316
0:14:1,46 --> 0:14:4,86
Another thing I noticed also, like,
it's fun, fun feature.

317
0:14:4,94 --> 0:14:7,3
You created ID generator, right?

318
0:14:7,3 --> 0:14:12,88
Which is like, we, like, Postgres
18 brings UUID version 7, but

319
0:14:12,88 --> 0:14:14,04
it's 16 bytes.

320
0:14:14,44 --> 0:14:19,54
It's quite a lot of space spent
for a single ID value.

321
0:14:20,02 --> 0:14:21,86
Yours is 8 bytes, right?

322
0:14:22,08 --> 0:14:23,3
Lev: Yeah, big end, yeah.

323
0:14:23,3 --> 0:14:25,96
Nikolay: It's 8 byte integer, but
it behaves similarly.

324
0:14:26,18 --> 0:14:28,48
So there is timestamp component
in it.

325
0:14:29,16 --> 0:14:33,58
This is cool feature and it's going
to serve us until when?

326
0:14:34,3 --> 0:14:35,44
70 years, right?

327
0:14:35,44 --> 0:14:36,6
70 years left.

328
0:14:36,66 --> 0:14:40,3
It's not a lot, but for many systems
should be enough.

329
0:14:40,52 --> 0:14:42,94
What inspired you to build this?

330
0:14:43,58 --> 0:14:47,6
Lev: I have a potential production
deployment and there were

331
0:14:47,6 --> 0:14:51,26
big usage of Omni-sharded tables
or replicated tables or mirrored

332
0:14:51,26 --> 0:14:53,94
tables or whatever you want to
call it Basically the the data

333
0:14:53,94 --> 0:14:58,6
in the table is the same on all
shards And they needed to generate

334
0:14:58,98 --> 0:15:3,4
the primary key for that row And
you couldn't rely on the sequence

335
0:15:3,4 --> 0:15:6,22
because even if you used to face
commit, which they originally

336
0:15:6,22 --> 0:15:8,9
thought they would do, there's
still a chance that the sequence

337
0:15:8,9 --> 0:15:12,28
numbers on each shard will be different,
especially because sequences

338
0:15:12,28 --> 0:15:14,7
are consumed whether the transaction
succeeds or fails.

339
0:15:15,18 --> 0:15:17,98
So the ID could be different on
each shard.

340
0:15:17,98 --> 0:15:20,66
So you needed an external ID generation
service.

341
0:15:20,8 --> 0:15:24,86
And the first thought about using
a third database or another

342
0:15:24,86 --> 0:15:28,98
database to use its sequence, to
pull it in and then insert into

343
0:15:28,98 --> 0:15:33,06
the query, which creates a single
point of failure, but it's

344
0:15:33,06 --> 0:15:36,36
much simpler to actually be able
to generate a unique big int

345
0:15:36,4 --> 0:15:38,4
inside your proxy, inside PgDog,
which is not that hard

346
0:15:38,4 --> 0:15:38,8
Nikolay: to do.

347
0:15:38,8 --> 0:15:42,04
Could you sequence, we just could
have some extra added, like

348
0:15:42,04 --> 0:15:46,88
for example, if we say we don't
need more than 100 billion per

349
0:15:46,88 --> 0:15:51,38
shard, we could say on shard number
1, we always add 100 billion,

350
0:15:51,38 --> 0:15:54,26
on the second shard we add 200
billion.

351
0:15:54,78 --> 0:15:55,46
Why not?

352
0:15:57,34 --> 0:15:59,36
Michael: They need to be the same
on every shard.

353
0:15:59,38 --> 0:16:0,9
This is the table that's replicated.

354
0:16:1,74 --> 0:16:2,1
Nikolay: Okay.

355
0:16:2,1 --> 0:16:2,16
I see.

356
0:16:2,16 --> 0:16:2,68
Lev: I like

357
0:16:2,68 --> 0:16:2,86
Michael: it.

358
0:16:2,86 --> 0:16:3,82
I think it's interesting.

359
0:16:3,82 --> 0:16:7,58
So I guess it's because we don't
have unique constraints across

360
0:16:7,58 --> 0:16:8,5
shards, right?

361
0:16:9,9 --> 0:16:11,3
Lev: Yeah, that's right.

362
0:16:12,26 --> 0:16:16,72
Well, we don't need them basically
because the ID generation

363
0:16:16,84 --> 0:16:18,04
is guaranteed to be unique.

364
0:16:18,04 --> 0:16:18,54
Before.

365
0:16:18,74 --> 0:16:19,7
Yeah, before.

366
0:16:19,9 --> 0:16:20,4
Yeah.

367
0:16:20,66 --> 0:16:22,8
I mean, all the good database engineers
will tell you should

368
0:16:22,8 --> 0:16:25,96
still have a Unique Constraint, even
for UIDs, which are theoretically

369
0:16:26,0 --> 0:16:29,2
guaranteed to be unique, but even
in this case, you don't necessarily

370
0:16:29,2 --> 0:16:29,88
need 1.

371
0:16:29,92 --> 0:16:32,68
So the idea is, again, so you insert
the same ID into all the

372
0:16:32,68 --> 0:16:35,26
shards and you need to be able
to generate it quickly and efficiently.

373
0:16:35,84 --> 0:16:39,74
And that's why this like, it's
the first of its kind, PgDog executed

374
0:16:39,9 --> 0:16:42,66
Postgres function, okay?

375
0:16:43,04 --> 0:16:44,8
Cause we're using the Postgres
parser, right?

376
0:16:44,8 --> 0:16:48,56
So why not, you know, just hook
into the Query execution.

377
0:16:49,76 --> 0:16:50,22
Nikolay: Extract.

378
0:16:50,22 --> 0:16:50,72
1

379
0:16:51,0 --> 0:16:51,105
Lev: step towards Citus

380
0:16:51,105 --> 0:16:54,36
Nikolay: and having whole Postgres
in your middle there, no?

381
0:16:55,24 --> 0:16:58,12
Lev: I mean, you heard it here
first in Postgres FM.

382
0:16:59,68 --> 0:17:0,72
No, I don't know.

383
0:17:0,72 --> 0:17:3,92
I think it's a good start, but
it was so simple to implement.

384
0:17:4,04 --> 0:17:7,1
You extract, you find the function
execution, and then you pull,

385
0:17:7,4 --> 0:17:10,22
you generate a unique ID inside
the pooler.

386
0:17:10,46 --> 0:17:13,34
And the way you do that is, well,
you get the timestamp.

387
0:17:13,48 --> 0:17:15,8
You kind of assume your clock is
reasonably accurate.

388
0:17:16,16 --> 0:17:18,7
Then you have an internal sequence
number so you can generate

389
0:17:18,7 --> 0:17:20,46
multiple IDs per millisecond.

390
0:17:20,66 --> 0:17:22,48
So the clock tick is millisecond.

391
0:17:23,18 --> 0:17:27,84
And then there's another unique
number, call it node ID or PgDog

392
0:17:27,9 --> 0:17:28,5
node ID.

393
0:17:28,5 --> 0:17:31,72
So you can have multiple PgDog
containers in front of the same

394
0:17:31,72 --> 0:17:36,06
deployment and it will guarantee
that none of them will hit a

395
0:17:36,06 --> 0:17:37,36
duplicate from the

396
0:17:37,36 --> 0:17:38,08
Nikolay: same list.

397
0:17:38,26 --> 0:17:38,96
No collisions, yeah.

398
0:17:39,16 --> 0:17:40,36
Lev: No collisions, exactly.

399
0:17:40,68 --> 0:17:43,34
So, you know, the 70 years is a
little bit arbitrary.

400
0:17:43,38 --> 0:17:47,42
We can reduce any 1 of these components
and increase the timestamp

401
0:17:47,44 --> 0:17:51,18
component to exit by just reducing
the number of PgDog nodes

402
0:17:51,18 --> 0:17:51,88
that you can deploy.

403
0:17:51,88 --> 0:17:58,26
Right now it's 1,024, which if
you're running that much, you're

404
0:17:58,26 --> 0:17:59,28
Facebook probably.

405
0:18:0,06 --> 0:18:3,62
But you can reduce that by half
and give that extra bit to the

406
0:18:3,62 --> 0:18:4,12
timestamp.

407
0:18:4,18 --> 0:18:7,74
And then you can also reduce the
number of per millisecond generation.

408
0:18:8,0 --> 0:18:9,14
Right now it's 4 million.

409
0:18:9,14 --> 0:18:11,34
You can generate 4 million per
millisecond.

410
0:18:12,38 --> 0:18:13,22
It's a lot.

411
0:18:14,1 --> 0:18:15,64
You could probably reduce that.

412
0:18:16,34 --> 0:18:16,78
Actually, no.

413
0:18:16,78 --> 0:18:17,46
Let me double check.

414
0:18:17,46 --> 0:18:19,38
I think it's probably 4,000 per
millisecond.

415
0:18:19,94 --> 0:18:20,94
That sounds more reasonable.

416
0:18:20,94 --> 0:18:23,04
Michael: 4 million per second would
make sense.

417
0:18:23,56 --> 0:18:24,3
4 million.

418
0:18:24,62 --> 0:18:25,78
Lev: That's right, yeah.

419
0:18:25,86 --> 0:18:28,22
This is where documentation, or
good documentation,

420
0:18:28,44 --> 0:18:28,68
Michael: comes

421
0:18:28,68 --> 0:18:29,36
Lev: in handy.

422
0:18:30,04 --> 0:18:32,06
Michael: That's a big change from
last time as well.

423
0:18:32,6 --> 0:18:33,04
Lev: Yep.

424
0:18:33,04 --> 0:18:35,76
The biggest thing I'm proud of
is adding an icon next to every

425
0:18:35,76 --> 0:18:36,54
single link.

426
0:18:38,1 --> 0:18:42,52
Just, you know, being very, like,
attention to detail, I suppose.

427
0:18:42,52 --> 0:18:45,72
But yeah, like, every single comment
I get from people is like,

428
0:18:45,72 --> 0:18:46,8
well, is this documented?

429
0:18:46,8 --> 0:18:47,44
Is this documented?

430
0:18:47,44 --> 0:18:49,68
And somebody's talking about our
feature, and I'm like, good

431
0:18:49,68 --> 0:18:50,18
point.

432
0:18:50,74 --> 0:18:51,56
Now it is.

433
0:18:51,56 --> 0:18:52,28
Thank you.

434
0:18:52,7 --> 0:18:56,94
So if it's not in the docs, I want
to say it doesn't exist, but

435
0:18:56,94 --> 0:18:57,84
that's not true.

436
0:18:58,84 --> 0:19:1,4
There's a backlog of documentation
to be done.

437
0:19:2,36 --> 0:19:2,86
Yeah.

438
0:19:4,2 --> 0:19:5,58
Nikolay: I should help with this.

439
0:19:6,58 --> 0:19:9,86
Lev: I would love for you guys
to help out with anything.

440
0:19:9,96 --> 0:19:11,08
Nikolay: Not I, AI.

441
0:19:11,82 --> 0:19:12,44
Oh, AI.

442
0:19:12,44 --> 0:19:13,54
Maybe I as well.

443
0:19:13,6001 --> 0:19:15,36
I would love to help with AI.

444
0:19:15,54 --> 0:19:18,34
Yeah, I'm looking at PgDog from
time to time and definitely

445
0:19:18,64 --> 0:19:20,26
looking for some use case.

446
0:19:20,28 --> 0:19:25,14
And as we all know, maybe not all
that we, my team benchmarked.

447
0:19:25,84 --> 0:19:30,48
And so I'm looking forward to revisit
that benchmark to see how

448
0:19:30,48 --> 0:19:34,0
that optimization related to protobuf
helped.

449
0:19:34,54 --> 0:19:35,46
I'm very curious.

450
0:19:35,58 --> 0:19:40,08
This time we are going to visit
not with trivial queries like

451
0:19:40,08 --> 0:19:42,94
SELECT, that's it, but with some
heavy...

452
0:19:42,94 --> 0:19:46,2
That's why I mentioned, like, I
expect some benefits from...

453
0:19:46,48 --> 0:19:49,9
For heavier queries, bigger text.

454
0:19:50,58 --> 0:19:51,46
Yeah, so yeah.

455
0:19:53,16 --> 0:19:55,58
Documentation, I think AI should
help with it.

456
0:19:55,58 --> 0:19:56,68
Like this is my plan.

457
0:19:56,84 --> 0:19:57,72
Lev: No, no way, man.

458
0:19:57,72 --> 0:19:59,86
Like, because they all sound the
same.

459
0:19:59,86 --> 0:20:0,9
They all look the same.

460
0:20:0,9 --> 0:20:3,54
Like you could give it a couple
of tips, like, you'll sound original,

461
0:20:3,56 --> 0:20:6,78
but like if you write text using
AI, it just all reads like it

462
0:20:6,78 --> 0:20:8,06
was written by the same.

463
0:20:8,1 --> 0:20:11,04
Nikolay: But at least the reference
could be like the usual problem

464
0:20:11,04 --> 0:20:15,9
with reference-like documentation,
which lists various functions,

465
0:20:15,94 --> 0:20:18,78
config params, everything, it usually
lags.

466
0:20:19,54 --> 0:20:24,44
And to maintain it, I think AI
is great to maintain it.

467
0:20:24,78 --> 0:20:27,34
Lev: Yeah, I mean, I don't disagree.

468
0:20:27,5 --> 0:20:30,7
For me, writing documentation is
kind of like a, it's like a

469
0:20:30,7 --> 0:20:33,34
ritual, you know, I go through
the things that we built, and

470
0:20:33,34 --> 0:20:36,38
I'm like, oh, document, and then
I look at what I wrote.

471
0:20:36,38 --> 0:20:37,54
I'm like, wait a minute.

472
0:20:39,24 --> 0:20:39,92
This is wrong.

473
0:20:39,92 --> 0:20:41,64
I'm putting myself into the user's
shoes.

474
0:20:41,64 --> 0:20:42,66
I'm reading the docs.

475
0:20:42,66 --> 0:20:44,14
I'm like, OK, this is what it looks
like.

476
0:20:44,14 --> 0:20:47,02
Nikolay: It's an additional point
where you can verify the whole

477
0:20:47,02 --> 0:20:50,28
picture and see if all that makes
sense, all that needs to be

478
0:20:50,28 --> 0:20:51,3
adjusted additionally.

479
0:20:51,36 --> 0:20:52,32
Yeah, I agree.

480
0:20:52,9 --> 0:20:54,1
Also, demos, right?

481
0:20:54,1 --> 0:20:58,1
Like when you think how to show
what was built, sometimes you

482
0:20:58,1 --> 0:20:59,62
end up changing everything.

483
0:21:0,72 --> 0:21:1,22
Lev: Yep.

484
0:21:1,5 --> 0:21:2,62
That has happened before.

485
0:21:2,62 --> 0:21:5,14
Nikolay: Because you simply cannot
explain it, it means that

486
0:21:5,14 --> 0:21:7,24
it probably should be done differently,
right?

487
0:21:8,1 --> 0:21:9,36
Lev: Exactly, exactly.

488
0:21:9,62 --> 0:21:10,12
So,

489
0:21:10,38 --> 0:21:14,84
Nikolay: let's talk maybe about
hard stuff, like prepared statements,

490
0:21:16,02 --> 0:21:21,52
what happened to them, like how
PgDog supports prepared statements,

491
0:21:21,76 --> 0:21:23,32
transaction mode and so on.

492
0:21:23,32 --> 0:21:24,74
Can you, can you...

493
0:21:24,84 --> 0:21:25,24
Lev: Funny.

494
0:21:25,24 --> 0:21:25,74
Yeah.

495
0:21:26,14 --> 0:21:28,42
You know, we have to stop sharding,
but prepared statements is

496
0:21:28,42 --> 0:21:29,76
where we really have issues.

497
0:21:30,06 --> 0:21:31,32
Nikolay: Yeah, let's talk about
it

498
0:21:31,32 --> 0:21:32,14
Lev: a little bit.

499
0:21:32,18 --> 0:21:32,68
Sure.

500
0:21:33,48 --> 0:21:34,2
They work.

501
0:21:34,3 --> 0:21:35,02
They're fine.

502
0:21:35,02 --> 0:21:36,02
Everything's great.

503
0:21:36,82 --> 0:21:38,8
It works, I think.

504
0:21:38,86 --> 0:21:40,1
Nikolay: You saved them, right?

505
0:21:40,24 --> 0:21:41,54
Lev: Yeah, they're cached.

506
0:21:41,76 --> 0:21:43,1
It's a global cache.

507
0:21:43,68 --> 0:21:48,22
So if multiple clients send the
same statement, the same statement

508
0:21:48,22 --> 0:21:49,66
on the server side is executed.

509
0:21:49,74 --> 0:21:51,66
So pretty high cache hits.

510
0:21:51,66 --> 0:21:54,64
Nikolay: Implicit or explicit,
like protocol level or prepare

511
0:21:54,88 --> 0:21:55,38
keyword?

512
0:21:56,4 --> 0:22:0,76
Lev: Protocol level right now and
partial support for explicit,

513
0:22:1,12 --> 0:22:1,22
Nikolay: you

514
0:22:1,22 --> 0:22:2,58
Lev: know, since we have the parser.

515
0:22:2,72 --> 0:22:6,0
For the explicit ones I need to
differentiate on the number.

516
0:22:6,0 --> 0:22:8,56
Sometimes you can pass the data
types in the prepared statement,

517
0:22:8,56 --> 0:22:9,58
sometimes you don't.

518
0:22:9,72 --> 0:22:12,94
I just need to add a little bit
more support to the parser there

519
0:22:12,94 --> 0:22:17,82
to extract the differentiation
between like different unique...

520
0:22:17,96 --> 0:22:21,06
Identical prepared statements,
but they expect different data

521
0:22:21,06 --> 0:22:22,16
types for some reason.

522
0:22:22,36 --> 0:22:25,24
So protocol ones work.

523
0:22:26,04 --> 0:22:27,38
I guarantee it.

524
0:22:27,54 --> 0:22:29,2
If they don't, I swear.

525
0:22:30,72 --> 0:22:31,4
They do.

526
0:22:31,62 --> 0:22:33,84
But the other kind is still a work
in progress.

527
0:22:34,54 --> 0:22:39,12
Nikolay: Yeah, and have you seen
any issues with prepared statements

528
0:22:40,24 --> 0:22:43,62
used together with partitioning
or your clients don't use partitioning

529
0:22:43,62 --> 0:22:45,16
at all because they have sharding?

530
0:22:45,16 --> 0:22:46,4
Why use partitioning?

531
0:22:48,42 --> 0:22:50,14
Lev: I think we use them with partitioning.

532
0:22:50,5 --> 0:22:53,26
I haven't seen any issues.

533
0:22:53,26 --> 0:22:56,04
Like if the query, if the prepared
statement plan is wrong, Postgres

534
0:22:56,04 --> 0:22:59,6
will return an error and we evict
it from the cache and then

535
0:22:59,76 --> 0:23:0,14
we prepare it.

536
0:23:0,14 --> 0:23:1,5
Nikolay: It's not that simple,
right?

537
0:23:1,5 --> 0:23:6,06
Because the first few calls are
not cached yet in prepared statements

538
0:23:6,1 --> 0:23:10,02
because first it uses custom plan
for the first 5 times.

539
0:23:10,68 --> 0:23:13,48
There is a hard-coded constant
in PostgreSQL.

540
0:23:14,76 --> 0:23:17,92
And with partitioning, it's much
more complicated.

541
0:23:18,18 --> 0:23:20,9
I have a series of blog posts about
this.

542
0:23:20,9 --> 0:23:22,62
It's like super complicated.

543
0:23:23,56 --> 0:23:25,2
Yeah, I'm curious about this.

544
0:23:25,2 --> 0:23:30,06
Maybe just that nobody encountered
any issues yet, but yeah,

545
0:23:30,06 --> 0:23:31,36
I'm curious about this.

546
0:23:32,5 --> 0:23:33,22
Lev: No clue.

547
0:23:33,34 --> 0:23:35,64
Yeah, let me know what I need to
fix.

548
0:23:35,86 --> 0:23:36,66
Nikolay: Yeah, I understand.

549
0:23:36,66 --> 0:23:37,4
Very practical.

550
0:23:37,4299 --> 0:23:38,76
I understand that.

551
0:23:39,34 --> 0:23:42,78
Michael: Yeah, but it's also, also
Nik, I think you're like,

552
0:23:42,98 --> 0:23:45,74
this feels like the kind of thing
that actually gets a little

553
0:23:45,74 --> 0:23:47,06
bit better with sharding.

554
0:23:47,14 --> 0:23:51,04
Like you just reduce the number,
you reduce the QPS, you reduce

555
0:23:51,04 --> 0:23:55,54
the load on each like, it, everything
that could cause issues

556
0:23:55,6 --> 0:24:0,86
around that is, however many shards
you have, times less bad

557
0:24:0,86 --> 0:24:2,66
or less likely to hit the cliff?

558
0:24:2,86 --> 0:24:5,28
Nikolay: Absolutely, sharding reduces
the need in partitioning,

559
0:24:5,46 --> 0:24:10,02
100% so, but maybe not fully eliminates
it because partitioning,

560
0:24:10,12 --> 0:24:14,48
if it's time-based, it still makes
sense within 1 shard.

561
0:24:15,18 --> 0:24:20,24
We know cases of really large databases
which combine both approaches.

562
0:24:21,5 --> 0:24:23,62
Michael: Yeah, for maintenance
reasons it still makes complete

563
0:24:23,62 --> 0:24:24,14
sense, right?

564
0:24:24,14 --> 0:24:24,64
Nikolay: Right.

565
0:24:24,72 --> 0:24:25,48
Yeah, yeah.

566
0:24:25,48 --> 0:24:27,38
So yeah, it's sticky.

567
0:24:27,38 --> 0:24:30,92
So and if you want prepared statements,
oh, It's like, it's

568
0:24:30,92 --> 0:24:31,7
super complex.

569
0:24:32,36 --> 0:24:34,1
Lev: So what's the deal?

570
0:24:34,64 --> 0:24:35,7
What doesn't work?

571
0:24:39,72 --> 0:24:44,76
Nikolay: So partitioning brings
cache not at planning time, but

572
0:24:44,76 --> 0:24:46,06
at execution time.

573
0:24:46,98 --> 0:24:51,04
And this increases complexity because
if you just deal with planning

574
0:24:51,04 --> 0:24:53,14
time cache, it's quite simple.

575
0:24:53,44 --> 0:24:57,98
So if it's not partitioned, only
planner can decide the cache

576
0:24:57,98 --> 0:24:58,82
or not, right?

577
0:24:58,82 --> 0:24:59,64
That's it.

578
0:24:59,72 --> 0:25:4,7
If it's partitioned, there is a
decision point shifted to execution

579
0:25:4,76 --> 0:25:5,26
time.

580
0:25:5,32 --> 0:25:8,8
And there is a point when basically
you are without cache and

581
0:25:8,8 --> 0:25:14,44
you cannot avoid it because when
cache is being built, at that

582
0:25:14,44 --> 0:25:17,34
execution, it's number 6 execution
or fifth execution.

583
0:25:17,64 --> 0:25:22,08
So at that point, executor doesn't
have this cache because it's

584
0:25:22,08 --> 0:25:22,86
being built.

585
0:25:23,08 --> 0:25:24,72
So we are without cache.

586
0:25:24,72 --> 0:25:31,02
And in high QPS cases, when frequency
of calls is super high,

587
0:25:31,16 --> 0:25:36,02
We basically, there is a moment
when we must leave without cache

588
0:25:36,58 --> 0:25:39,1
and it might be a super non-performance.

589
0:25:39,28 --> 0:25:42,72
So performance can be terrible
in that 0,

590
0:25:42,72 --> 0:25:43,26
Lev: I see.

591
0:25:43,26 --> 0:25:46,22
Nikolay: Which doesn't happen with
unpartitioned case at all.

592
0:25:46,32 --> 0:25:50,34
This is like, I already keep, I
already start forgetting details

593
0:25:50,34 --> 0:25:54,8
because so many nuances if you
don't leave there, if you don't

594
0:25:54,8 --> 0:25:58,86
only visit it like I did for a
couple of weeks and then, but

595
0:25:58,86 --> 0:26:1,98
I have several blog posts about
this problem with demonstrations.

596
0:26:3,48 --> 0:26:6,56
Lev: See, this is the prime example
why being a proxy for Postgres

597
0:26:6,56 --> 0:26:9,86
sharding is the best place to be
because to me, that problem

598
0:26:9,86 --> 0:26:11,34
is completely irrelevant.

599
0:26:11,98 --> 0:26:12,34
Right?

600
0:26:12,34 --> 0:26:15,26
Because I'm just passing statements
through, right?

601
0:26:15,28 --> 0:26:19,54
I think, unless I'm, because like, they will work or they won't,

602
0:26:19,54 --> 0:26:23,2
but from a performance perspective, like if you have trouble

603
0:26:23,2 --> 0:26:25,2
with partition tables and prepared statements, you just don't

604
0:26:25,2 --> 0:26:26,4
use prepared statements, right?

605
0:26:26,4 --> 0:26:27,26
Nikolay: Yeah, yeah, yeah.

606
0:26:27,26 --> 0:26:34,04
Partitioning, we must, so Partitioning also brings the problem

607
0:26:34,12 --> 0:26:37,44
of explosion of access share locks.

608
0:26:38,04 --> 0:26:38,54
Because...

609
0:26:40,68 --> 0:26:42,84
I'm talking about this particular, but yeah.

610
0:26:43,38 --> 0:26:48,54
So if you have a lot of partitions, and we have plan cache 1

611
0:26:48,54 --> 0:26:52,48
idea, but also we have routing to specific partition, which is

612
0:26:52,48 --> 0:26:54,86
called constraint exclusion, right?

613
0:26:54,96 --> 0:26:56,82
We need to route to specific partition.

614
0:26:56,82 --> 0:26:57,94
This is another idea.

615
0:26:58,3 --> 0:27:3,76
And at planning time, if planner knows where to route, this is

616
0:27:3,76 --> 0:27:7,94
normal case, when you don't have prepared statements.

617
0:27:8,04 --> 0:27:10,82
It just routes to specific partition, all good.

618
0:27:11,5 --> 0:27:12,18
No problem.

619
0:27:12,18 --> 0:27:14,78
We lock only 1 partition with all its indexes.

620
0:27:15,62 --> 0:27:21,54
But if we postpone it to execution, it needs to lock all tables

621
0:27:21,54 --> 0:27:23,54
with all its indexes, and it's terrible.

622
0:27:24,8 --> 0:27:29,28
With sharding, I don't know, you just...

623
0:27:30,06 --> 0:27:32,22
Lev: Well, you reduce the QPS, so...

624
0:27:32,24 --> 0:27:33,66
Nikolay: Yeah, this is already...

625
0:27:33,92 --> 0:27:36,78
And the lock manager is not a problem anymore.

626
0:27:36,78 --> 0:27:41,46
Or you just need to use Postgres 18 and raise max locks per transaction.

627
0:27:42,18 --> 0:27:45,6
In this case, lock manager also is the same.

628
0:27:46,26 --> 0:27:48,48
So yeah, anyway, this is an interesting topic.

629
0:27:48,48 --> 0:27:52,48
And I'm curious if some people combine sharding and partitioning

630
0:27:53,36 --> 0:27:56,52
I think they should sometimes they should sometimes and it's

631
0:27:56,52 --> 0:28:0,04
just it's in really big setups yeah

632
0:28:0,56 --> 0:28:3,18
Lev: I mean as a guy who's trying to build sharding you should

633
0:28:3,18 --> 0:28:6,42
always use sharding unless you can't but like that's you know

634
0:28:6,96 --> 0:28:8,68
Michael: I think there's a maintenance argument.

635
0:28:8,68 --> 0:28:14,06
Imagine if you've got like a logs table per shard and You want

636
0:28:14,06 --> 0:28:17,22
to delete old logs after a certain you've got a retention period

637
0:28:17,22 --> 0:28:21,04
you probably still want partitioning to make those deletes really

638
0:28:21,04 --> 0:28:21,54
efficient.

639
0:28:21,6 --> 0:28:24,64
I think there's still a use case for it but it's not so much

640
0:28:24,64 --> 0:28:27,34
for performance reasons but it's for maintenance reasons, that'd

641
0:28:27,34 --> 0:28:28,22
be my guess.

642
0:28:28,58 --> 0:28:33,28
Nikolay: Yeah, for to build indexes, to have vacuum working more

643
0:28:33,28 --> 0:28:35,14
efficiently, and so on and so on.

644
0:28:36,3 --> 0:28:38,72
Michael: Yeah, but I think those are, I still characterize those

645
0:28:38,72 --> 0:28:41,14
slight, I know they are maintenance, but they're kind of still

646
0:28:41,14 --> 0:28:42,4
performance reasons, right?

647
0:28:42,4 --> 0:28:44,44
You want them, because you don't want the table to be too big.

648
0:28:44,44 --> 0:28:48,02
I'm talking about doing like deletes efficiently, being able

649
0:28:48,02 --> 0:28:51,18
to, instead of having to like do batch deletes you can...

650
0:28:51,18 --> 0:28:51,68
Exactly.

651
0:28:51,82 --> 0:28:55,28
Nikolay: It's exactly how's it
called not data retention but

652
0:28:55,28 --> 0:28:57,0
there is some additional term

653
0:28:57,18 --> 0:28:59,7
Michael: yeah yeah the opposite
of

654
0:29:0,02 --> 0:29:2,54
Lev: yeah there could be like a
legal requirement for data retention.

655
0:29:2,54 --> 0:29:5,28
I think that's the only reason
why people would ever delete data,

656
0:29:5,28 --> 0:29:5,78
right?

657
0:29:6,18 --> 0:29:7,48
Well, the other reason is performance.

658
0:29:8,16 --> 0:29:11,78
So if you eliminate the performance
argument, like if you make

659
0:29:11,78 --> 0:29:14,94
performance just free because of
sharding makes that, I mean,

660
0:29:15,04 --> 0:29:16,78
Obviously that's not the case,
but like, no.

661
0:29:16,92 --> 0:29:19,86
Nikolay: Yeah, and if you compare
sharding and partitioning,

662
0:29:20,66 --> 0:29:25,52
with partitioning, we often are
willing to have 1 active partition

663
0:29:25,52 --> 0:29:27,76
where we write and all others are
archived.

664
0:29:27,98 --> 0:29:30,36
In case of sharding, it's strange,
right?

665
0:29:30,36 --> 0:29:32,66
Because a lot of hardware is not
used.

666
0:29:32,66 --> 0:29:36,9
Like if you have only 1 shard read-write
and others like archive,

667
0:29:36,9 --> 0:29:38,66
it's strange, simply strange, right?

668
0:29:38,66 --> 0:29:41,24
So you want to distribute writes
usually.

669
0:29:41,94 --> 0:29:44,48
Lev: You want it to be equivalent,
like you want like a perfect,

670
0:29:44,48 --> 0:29:47,36
like all shards have the same load,
And that's why you'd usually

671
0:29:47,36 --> 0:29:49,9
use like hash-based partitioning
if you can, again.

672
0:29:49,9 --> 0:29:52,64
Or the way you design your schema
for sharding is you make sure

673
0:29:52,64 --> 0:29:55,24
that each shard roughly has the
same load.

674
0:29:55,24 --> 0:29:57,78
If you have hot shard, you're gonna
have a hot shard problem

675
0:29:58,18 --> 0:29:59,36
sooner rather than later.

676
0:29:59,48 --> 0:29:59,98
Nikolay: Yeah.

677
0:30:0,66 --> 0:30:3,06
Lev: Exactly, especially for like
multi-tenant, like what I call

678
0:30:3,06 --> 0:30:5,86
multi-tenant is B2B basically,
where your sharding key is like

679
0:30:5,86 --> 0:30:9,32
a customer ID, if you will, and
then 1 of your customers grows

680
0:30:9,32 --> 0:30:14,06
big and you need a second sharding
tier, or if you will, a second

681
0:30:14,06 --> 0:30:14,4066
partitioning tier.

682
0:30:14,4066 --> 0:30:17,64
Nikolay: And does PgDog support
resharding without downtime

683
0:30:17,66 --> 0:30:18,64
already, you know?

684
0:30:18,74 --> 0:30:20,18
Oh yeah, yeah, yeah, of course.

685
0:30:20,18 --> 0:30:22,16
How's it technically working?

686
0:30:22,42 --> 0:30:23,3
Lev: I mean, it's good.

687
0:30:23,42 --> 0:30:25,52
It's been tested, so that's kind
of cool.

688
0:30:26,18 --> 0:30:30,66
The idea is, I mean, I don't want
to say simple, but I think

689
0:30:30,66 --> 0:30:31,16
practical.

690
0:30:31,36 --> 0:30:35,28
For re-sharding, this is where
the contrast between how this

691
0:30:35,28 --> 0:30:38,98
works and like NoSQL databases
work is a little bit more apparent.

692
0:30:39,14 --> 0:30:41,6
With NoSQL, you kind of add another
machine and then it kind

693
0:30:41,6 --> 0:30:44,28
of rebalances all the data using
existing machines.

694
0:30:45,14 --> 0:30:49,02
The PgDog approach is you create
a brand new empty cluster, and

695
0:30:49,02 --> 0:30:51,64
then you copy all the data out
into the new cluster.

696
0:30:52,12 --> 0:30:55,34
And this decision is actually not
because of a particular limitation,

697
0:30:55,38 --> 0:30:57,18
but it's intentional.

698
0:30:57,38 --> 0:31:0,58
Because you want to reduce the
load on the existing cluster during

699
0:31:0,58 --> 0:31:3,42
re-sharding, because usually when
you start re-sharding, it's

700
0:31:4,14 --> 0:31:6,26
already under load, it's already
underperforming, it's already

701
0:31:6,26 --> 0:31:6,82
having problems.

702
0:31:6,82 --> 0:31:9,84
So you don't want to add additional load with resharding to it.

703
0:31:9,84 --> 0:31:12,54
So you create a new, brand new cluster, you move the data over

704
0:31:12,54 --> 0:31:17,12
and we are able to use replicas since Postgres 16 to copy data

705
0:31:17,12 --> 0:31:19,46
out with logical replication, that was a huge thing.

706
0:31:19,46 --> 0:31:22,4
That was, I think, one of the game changers that made this possible,

707
0:31:22,9 --> 0:31:25,08
is logical replication slots on replicas.

708
0:31:25,08 --> 0:31:25,58
Why?

709
0:31:26,66 --> 0:31:27,84
Why does it matter?

710
0:31:28,86 --> 0:31:29,48
Why does it matter?

711
0:31:29,48 --> 0:31:33,38
Because you can create 16 replicas and parallelize your copying.

712
0:31:33,84 --> 0:31:34,94
Why does it matter?

713
0:31:35,28 --> 0:31:36,22
Nikolay: Disco or what?

714
0:31:36,22 --> 0:31:37,26
Why does it matter?

715
0:31:37,26 --> 0:31:40,94
Like you don't want to use disco on primary or what?

716
0:31:41,32 --> 0:31:42,32
Lev: Well, you definitely don't.

717
0:31:42,32 --> 0:31:43,68
Your primary is already shot.

718
0:31:43,68 --> 0:31:45,66
Like if when you're sharding, your primary is already having

719
0:31:45,66 --> 0:31:46,72
a really, really bad time.

720
0:31:46,72 --> 0:31:48,62
That's the main reason behind sharding.

721
0:31:48,62 --> 0:31:50,2
You want to scale your writes.

722
0:31:50,2 --> 0:31:54,64
So adding additional replication load to an already underwater

723
0:31:54,72 --> 0:31:56,14
database is bad.

724
0:31:56,6 --> 0:31:59,76
And especially when you want to create, let's say, you go from

725
0:31:59,76 --> 0:32:1,22
one shard to 16 shards.

726
0:32:1,24 --> 0:32:3,02
Extreme example, but realistic.

727
0:32:3,36 --> 0:32:6,4
You're going to add 16 replication streams to your primary.

728
0:32:7,12 --> 0:32:9,82
That's like 16 times like streaming the same data.

729
0:32:9,84 --> 0:32:11,12
The thing is going to fall over.

730
0:32:11,12 --> 0:32:12,32
I guarantee it.

731
0:32:13,28 --> 0:32:14,8
Nikolay: But overhead is amplified here.

732
0:32:14,8 --> 0:32:17,94
So the same work is done on different CPUs, VCPUs.

733
0:32:19,16 --> 0:32:20,94
Lev: Yeah, yeah, decoded and recoded.

734
0:32:21,62 --> 0:32:22,64
Nikolay: Yeah, yeah, yeah.

735
0:32:22,64 --> 0:32:23,14
So.

736
0:32:23,68 --> 0:32:25,78
You want to shard this load, right?

737
0:32:27,6 --> 0:32:29,18
Lev: Exactly, exactly.

738
0:32:30,36 --> 0:32:33,98
We're able to use one replica per table basically for, for re-sharding.

739
0:32:34,04 --> 0:32:37,2
So if you have like, say 1000 tables, you're able to shard your

740
0:32:37,2 --> 0:32:40,08
sharding stream and you're both like split it into 16 different

741
0:32:40,08 --> 0:32:43,1
machines or 32 different machines and sync it in parallel, which

742
0:32:43,1 --> 0:32:44,06
I think is pretty huge.

743
0:32:44,06 --> 0:32:45,3
It's all the same LSN, right?

744
0:32:45,3 --> 0:32:46,56
Because they're replicas, right?

745
0:32:46,56 --> 0:32:49,36
So you can synchronize at the end and make sure that you got,

746
0:32:49,36 --> 0:32:53,08
you're at the right point and like pause the live traffic and

747
0:32:53,08 --> 0:32:56,12
then wait for the logical stream to catch up and cut over to

748
0:32:56,12 --> 0:32:56,82
the new cluster.

749
0:32:56,82 --> 0:33:0,56
Like all of that is, the pieces are there, you just need a button

750
0:33:0,56 --> 0:33:1,48
to make it work.

751
0:33:1,48 --> 0:33:4,4
We haven't built the button yet, but we've built all the levers

752
0:33:4,4 --> 0:33:5,96
you need to push as an operator.

753
0:33:6,1 --> 0:33:6,54
So when

754
0:33:6,54 --> 0:33:10,68
Nikolay: you say you create cluster, you mean create Postgres

755
0:33:10,68 --> 0:33:13,3
cluster as just basically 1 new primary node.

756
0:33:13,58 --> 0:33:17,42
And when you say you copy, it means that there is already something

757
0:33:17,48 --> 0:33:18,98
in PgDog that copies.

758
0:33:19,12 --> 0:33:20,7
You just need to use it.

759
0:33:21,38 --> 0:33:23,66
It's not on the shoulders of the user, right?

760
0:33:24,34 --> 0:33:28,66
Lev: No, we hook into the replication protocol to stream the

761
0:33:28,66 --> 0:33:32,8
rows and we use copy, binary copy underneath, and we shard the

762
0:33:32,8 --> 0:33:33,68
copy stream.

763
0:33:33,74 --> 0:33:37,02
Nikolay: All right, this is the option since Postgres 16, or

764
0:33:37,12 --> 0:33:38,18
14, I think.

765
0:33:38,26 --> 0:33:38,66
Binary.

766
0:33:38,66 --> 0:33:39,88
Lev: Oh, it's been a minute, yeah.

767
0:33:39,88 --> 0:33:43,34
Nikolay: When you initiate logical replica, I guess.

768
0:33:45,32 --> 0:33:49,62
There is option binary equals true for initialization part.

769
0:33:49,66 --> 0:33:50,66
Interesting, actually.

770
0:33:50,66 --> 0:33:51,68
Lev: Yeah, look at this.

771
0:33:51,98 --> 0:33:55,2
Nikolay: There is also trick to convert physical to logical,

772
0:33:55,2 --> 0:33:58,94
which is official since Postgres 17, I think it's called

773
0:33:59,76 --> 0:34:2,52
pg_createsubscriber, CLI.

774
0:34:2,64 --> 0:34:3,36
Lev: Oh, interesting.

775
0:34:3,54 --> 0:34:6,92
I'm using like pgoutput, which streams everything in text

776
0:34:7,2 --> 0:34:7,7
Nikolay: format.

777
0:34:9,14 --> 0:34:10,02
Yeah, yeah, yeah.

778
0:34:10,6 --> 0:34:12,62
But yes, but right.

779
0:34:12,94 --> 0:34:15,04
But still there are additional things here.

780
0:34:15,06 --> 0:34:19,3
You can convert physical replica to logical replica, and we could

781
0:34:19,3 --> 0:34:21,0
do it before, but now it's official.

782
0:34:21,04 --> 0:34:21,88
It's official.

783
0:34:22,12 --> 0:34:22,86
Lev: Oh, I

784
0:34:22,86 --> 0:34:22,96
Nikolay: see.

785
0:34:22,96 --> 0:34:23,46
pg_createsubscriber.

786
0:34:24,22 --> 0:34:27,74
Unfortunately, we won't be able to use it on most managed services

787
0:34:27,74 --> 0:34:29,44
because they don't expose this.

788
0:34:29,44 --> 0:34:30,28
They should, actually.

789
0:34:30,28 --> 0:34:30,64
Yeah.

790
0:34:30,64 --> 0:34:32,14
They should expose this in API.

791
0:34:32,72 --> 0:34:33,94
I hope they will.

792
0:34:34,34 --> 0:34:38,04
But in case of self-managed, it's good.

793
0:34:39,18 --> 0:34:41,28
Lev: Yeah, with self-managed, you have a lot more options for

794
0:34:41,28 --> 0:34:41,64
sure.

795
0:34:41,64 --> 0:34:45,36
I think the original goal was to make this work in managed environments

796
0:34:45,36 --> 0:34:46,56
because that's where them.

797
0:34:46,56 --> 0:34:47,06
Nikolay: Right.

798
0:34:47,22 --> 0:34:47,48
Yeah.

799
0:34:47,48 --> 0:34:50,28
You mentioned Aurora already, support of Aurora.

800
0:34:50,28 --> 0:34:51,06
Lev: That's great.

801
0:34:52,08 --> 0:34:52,58
Nikolay: Yeah.

802
0:34:52,74 --> 0:34:57,04
And these replicas, so you create
logical replica out of physical

803
0:34:57,7 --> 0:35:0,25
copying data and, and, and performing
CDC.

804
0:35:0,25 --> 0:35:5,7
So, so that's ranges from a slot
on replica, which is supported

805
0:35:6,82 --> 0:35:8,04
since recently, right?

806
0:35:8,94 --> 0:35:14,36
I'm curious this, when you initialize
logical replica from this,

807
0:35:14,72 --> 0:35:16,6
is this physical standby?

808
0:35:16,82 --> 0:35:20,28
Does it have hot standby feedback
on or off?

809
0:35:20,28 --> 0:35:24,6
Does it report to the primary that
it still needs some data?

810
0:35:24,6 --> 0:35:28,18
Because if it does, the primary
health can be affected.

811
0:35:29,22 --> 0:35:32,66
If you eventually switch over,
that bloat doesn't matter.

812
0:35:33,4 --> 0:35:35,86
But if you roll back...

813
0:35:36,78 --> 0:35:37,84
Lev: So I think

814
0:35:38,3 --> 0:35:41,9
Nikolay: if this physical replica
is not used by anyone else,

815
0:35:41,9 --> 0:35:44,0
I would keep HBF off.

816
0:35:46,12 --> 0:35:48,78
To let it lag, I would raise the
max.

817
0:35:48,78 --> 0:35:52,86
There are a couple of settings
which allow a replica to lag.

818
0:35:53,74 --> 0:35:57,12
If hot standby feedback is off, it's allowed
to lag.

819
0:35:57,12 --> 0:35:59,6
There are many, many nuances here
in this process.

820
0:35:59,6 --> 0:36:1,86
I'm curious how you do it.

821
0:36:2,12 --> 0:36:4,86
But okay, you're using logical
replica here.

822
0:36:5,84 --> 0:36:8,5
Lev: Yeah, that's a good call out
actually.

823
0:36:8,6 --> 0:36:11,62
Like my recommendation is to create
a dedicated replica for this

824
0:36:11,64 --> 0:36:14,1
or dedicated set of replicas, but.

825
0:36:14,1 --> 0:36:15,7
Nikolay: Without hot standby feedback, I would

826
0:36:15,7 --> 0:36:16,24
Lev: do it.

827
0:36:16,24 --> 0:36:18,4
Nikolay: If it's temporary replica,
why not?

828
0:36:18,4 --> 0:36:23,2
Like, And also how this replica
is created, right?

829
0:36:23,2 --> 0:36:25,14
Like it's an interesting topic.

830
0:36:25,76 --> 0:36:30,22
Do you integrate with some backup
tools or you just, it's out

831
0:36:30,22 --> 0:36:31,34
of scope, right?

832
0:36:31,36 --> 0:36:32,14
For PgDog.

833
0:36:32,24 --> 0:36:36,98
Lev: I feel like at the moment
it is, but maybe in 6 months it

834
0:36:36,98 --> 0:36:37,76
won't be.

835
0:36:37,8 --> 0:36:41,48
Like right now I'm relying on people
using like some kind of

836
0:36:41,48 --> 0:36:43,02
reliable way to create replicas.

837
0:36:43,08 --> 0:36:44,94
If you use RDS, you just click
a button.

838
0:36:45,04 --> 0:36:47,12
Again, like if you're using like
any of the managed services,

839
0:36:47,12 --> 0:36:49,98
you just click a button, you get
a replica, it's using disk snapshots.

840
0:36:50,84 --> 0:36:51,68
You're self-hosting.

841
0:36:51,68 --> 0:36:54,52
Like if you have ZFS, presumably
you'll have your snapshots somewhere.

842
0:36:54,52 --> 0:36:57,8
You can restore from, from a snapshot
and resynchronize pretty

843
0:36:57,8 --> 0:36:58,3
quickly.

844
0:36:58,66 --> 0:37:2,96
But even then it takes a while
to, to sync the ZFS snapshot on

845
0:37:2,96 --> 0:37:3,46
disk.

846
0:37:3,94 --> 0:37:5,04
Tested that 1.

847
0:37:5,38 --> 0:37:8,08
Takes minutes, sometimes hours.

848
0:37:8,8 --> 0:37:11,18
Oh, well, if you have like a 2
terabyte file system.

849
0:37:11,76 --> 0:37:13,76
Nikolay: You mean the first send
receive?

850
0:37:13,86 --> 0:37:15,8
So between the hosts?

851
0:37:15,8 --> 0:37:16,7
Yeah, I understand.

852
0:37:17,56 --> 0:37:18,06
Lev: Yeah.

853
0:37:18,68 --> 0:37:21,48
It's surprisingly limited by, I
think, actually network.

854
0:37:21,5 --> 0:37:27,6
I needed to like compress it and
then do something else to it.

855
0:37:28,14 --> 0:37:31,3
And I wrote a blog post to it on
our old blog at Postgres.

856
0:37:33,42 --> 0:37:33,96
Oh my god.

857
0:37:33,96 --> 0:37:34,46
Michael: ML?

858
0:37:34,46 --> 0:37:35,46
Lev: Thank you.

859
0:37:38,42 --> 0:37:42,6
But the website went down so my
shame went down with it as well.

860
0:37:42,72 --> 0:37:44,02
The blog post is gone.

861
0:37:44,02 --> 0:37:46,04
I think it was a good blog post
but whatever.

862
0:37:46,06 --> 0:37:49,2
I learned how to use compression
tools and ZFS and, you know,

863
0:37:49,2 --> 0:37:49,9
big deal.

864
0:37:52,06 --> 0:37:55,2
Michael: Just out of interest,
like, are people generally using

865
0:37:55,2 --> 0:37:58,74
this with a managed service or
is it quite split between managed

866
0:37:58,74 --> 0:37:59,84
service and self-managed?

867
0:38:1,4 --> 0:38:2,58
Lev: Right now it's split.

868
0:38:2,64 --> 0:38:6,0
A lot of management or service
use, especially like failover

869
0:38:6,02 --> 0:38:7,54
for Aurora is a big deal.

870
0:38:7,84 --> 0:38:11,68
Like failover detection, because
you don't run Patroni typically.

871
0:38:12,26 --> 0:38:17,18
So it's able to like query the
pg..., sorry, what was the

872
0:38:17,18 --> 0:38:18,1
name of the function?

873
0:38:18,12 --> 0:38:19,58
Nikolay: pg_is_in_recovery.

874
0:38:19,94 --> 0:38:21,46
Lev: pg_is_in_recovery, thank you.

875
0:38:21,46 --> 0:38:24,18
Nikolay: In my opinion, to stop
confusing people.

876
0:38:26,04 --> 0:38:28,18
Lev: The people who know, know,
and the people who don't know,

877
0:38:28,18 --> 0:38:28,76
don't need to know.

878
0:38:28,76 --> 0:38:29,14
Yeah, if you know, you know.

879
0:38:29,14 --> 0:38:29,64
Yeah.

880
0:38:31,62 --> 0:38:32,12
Right.

881
0:38:32,78 --> 0:38:37,02
Yeah, and I'm sure you know Aurora
loves to failover on a perfectly

882
0:38:37,02 --> 0:38:37,54
good day.

883
0:38:37,54 --> 0:38:40,78
So it helps a lot with uptime.

884
0:38:41,0 --> 0:38:44,24
But 1 of my biggest sharded deployments
is literally like on-prem,

885
0:38:44,24 --> 0:38:47,64
on-prem, like They rent their own
data center and everything.

886
0:38:47,64 --> 0:38:48,78
So it varies.

887
0:38:48,78 --> 0:38:50,36
Like that's the beauty of it.

888
0:38:50,66 --> 0:38:53,76
Like operating on the application
network layer, you really don't

889
0:38:53,76 --> 0:38:55,14
care that much where you're deployed.

890
0:38:56,44 --> 0:39:0,84
Nikolay: So based on these, almost
a year since you started this,

891
0:39:1,4 --> 0:39:6,36
What's your current vision for
your customer who needs pg_auto_failover?

892
0:39:6,82 --> 0:39:9,36
And what's your vision for next
future steps?

893
0:39:9,72 --> 0:39:10,82
Where do you move?

894
0:39:11,04 --> 0:39:15,86
Lev: Yeah, this is a really interesting
question because sharding

895
0:39:15,86 --> 0:39:20,6
is always going to be like the
north star because it did it explains

896
0:39:20,6 --> 0:39:23,76
to people like this is a scaling
product ultimately you can start

897
0:39:23,76 --> 0:39:26,14
anywhere I already feel like it's
a better connection pool than

898
0:39:26,14 --> 0:39:26,88
pgBouncer

899
0:39:26,94 --> 0:39:27,1
Nikolay: I

900
0:39:27,1 --> 0:39:29,92
Lev: already feel like the load
balancing aspect of it is a game

901
0:39:29,92 --> 0:39:34,3
changer and sharding as much as it is a work in progress, is

902
0:39:34,3 --> 0:39:35,28
the North Star for us.

903
0:39:35,28 --> 0:39:38,04
So that's how we want this project to evolve.

904
0:39:38,04 --> 0:39:44,94
We want to make Postgres as scalable as Cassandra or Scylla or

905
0:39:45,06 --> 0:39:45,56
Elasticsearch.

906
0:39:45,76 --> 0:39:48,74
I don't know what people think about those particular databases,

907
0:39:48,74 --> 0:39:51,6
but I think they're horizontally scaled, we can all agree.

908
0:39:51,74 --> 0:39:54,22
So we want Postgres to be like that as well.

909
0:39:54,34 --> 0:39:58,08
We're building, it hasn't been released yet, we're building an

910
0:39:58,08 --> 0:40:2,02
enterprise edition of PgDog, which adds the classic things like

911
0:40:2,02 --> 0:40:5,22
monitoring, understanding how everything works, a control plane,

912
0:40:5,5 --> 0:40:6,72
something that we can.

913
0:40:7,36 --> 0:40:9,72
Sell as a product to people, you know, like you don't have to

914
0:40:9,72 --> 0:40:12,56
like go through like many, many steps requires to set up like

915
0:40:12,56 --> 0:40:15,76
a, like a sharding layer and network proxy, So that should come

916
0:40:15,76 --> 0:40:17,78
out within the next few months, hopefully.

917
0:40:18,34 --> 0:40:21,54
Nikolay: Will it include automated sharding with AI?

918
0:40:21,9 --> 0:40:24,9
Like, press a button and your database is sharded already.

919
0:40:25,68 --> 0:40:29,86
Lev: You want automatic sharding with AI as part of our future

920
0:40:29,86 --> 0:40:30,36
set?

921
0:40:30,6 --> 0:40:31,28
Nikolay: I think we

922
0:40:31,28 --> 0:40:33,06
Michael: just want more consulting clients.

923
0:40:35,92 --> 0:40:39,3
Lev: I don't think we'll ever get automatic sharding with AI,

924
0:40:39,34 --> 0:40:41,46
but I don't know.

925
0:40:41,64 --> 0:40:45,44
Right now my idea for that is like dump the, use pg_dump, get

926
0:40:45,44 --> 0:40:48,74
the schema out, pipe it into ChatGPT and ask it what to do.

927
0:40:51,14 --> 0:40:51,82
With a prompt.

928
0:40:51,82 --> 0:40:52,46
And test.

929
0:40:53,44 --> 0:40:54,26
Yeah, exactly.

930
0:40:54,34 --> 0:40:54,8
And test.

931
0:40:54,8 --> 0:40:55,16
Yeah.

932
0:40:55,16 --> 0:40:56,02
I don't know.

933
0:40:56,26 --> 0:40:59,54
A human being is able to come up with a sharding key really quickly.

934
0:40:59,54 --> 0:41:4,12
Like I talked to like a senior database person at a company or

935
0:41:4,12 --> 0:41:6,22
an app engineer and they're like, oh, my sharding key is like,

936
0:41:6,22 --> 0:41:11,9
they already know it's org ID or customer ID or like they instinctively

937
0:41:11,94 --> 0:41:14,24
know how the data is organized and they even know what the edge

938
0:41:14,24 --> 0:41:15,14
cases are.

939
0:41:15,18 --> 0:41:19,28
You know, like I don't think you need AI for that 1 unless you

940
0:41:19,28 --> 0:41:23,08
literally have you've never seen this database before but yeah,

941
0:41:23,08 --> 0:41:25,74
that's That's the way I think about it.

942
0:41:26,04 --> 0:41:26,88
Nikolay: I don't think

943
0:41:27,26 --> 0:41:28,78
Michael: yeah on the vision stuff.

944
0:41:28,78 --> 0:41:32,82
How how closely are you paying attention to the other sharding

945
0:41:34,18 --> 0:41:39,14
projects in the space and how do you differ in terms of your

946
0:41:39,14 --> 0:41:42,04
North Star versus what you think theirs is based on what they're

947
0:41:42,04 --> 0:41:42,54
saying?

948
0:41:43,26 --> 0:41:46,4
Lev: I'll be completely honest with you, I literally never look

949
0:41:46,4 --> 0:41:47,02
at them.

950
0:41:47,02 --> 0:41:50,28
And it's not because I think I'm cool or something, It's literally

951
0:41:50,28 --> 0:41:52,58
I just don't have the time like 
I have I have something in my

952
0:41:52,58 --> 0:41:55,32
mind I'm building it what the other 
guys are doing is usually

953
0:41:55,32 --> 0:41:58,24
more of a distraction than the 
learning point to me personally

954
0:41:58,26 --> 0:42:1,64
So I have my vision of my North 
Star and I'm gonna follow it

955
0:42:1,64 --> 0:42:2,96
until I hit a wall or something.

956
0:42:2,96 --> 0:42:4,06
And so far I haven't.

957
0:42:4,28 --> 0:42:7,06
So that's how I think about it.

958
0:42:7,28 --> 0:42:9,9
Nikolay: Great words to finish 
this podcast, I guess.

959
0:42:11,04 --> 0:42:13,7
Michael: Is there, is there anything 
left that you wanted to

960
0:42:13,7 --> 0:42:16,16
like mention that we haven't asked 
about?

961
0:42:16,68 --> 0:42:19,76
Lev: Ah, well, let's see, I have 
like a, I'm gonna do a post

962
0:42:19,76 --> 0:42:23,16
on Hacker News soon with all the 
stuff that we've done, but I'm

963
0:42:23,16 --> 0:42:25,44
just gonna blab with you, I know 
it's kind of boring.

964
0:42:25,44 --> 0:42:29,28
We added a bunch of support for 
Aggregate functions, you know,

965
0:42:29,38 --> 0:42:31,72
we do query rewriting now, which 
is really cool.

966
0:42:31,72 --> 0:42:34,76
Like we take the AST and we add 
whatever needs, whatever we need

967
0:42:34,76 --> 0:42:36,28
to that query to execute it.

968
0:42:36,28 --> 0:42:38,72
1 thing that I think is really, 
really phenomenal is we support

969
0:42:38,72 --> 0:42:39,62
sharding key updates.

970
0:42:39,62 --> 0:42:42,74
That's something I think Citus 
never got around to do is like

971
0:42:42,74 --> 0:42:45,06
changing the actual value of the 
sharding key and moving the

972
0:42:45,06 --> 0:42:46,82
rows between the shards.

973
0:42:47,4 --> 0:42:47,9
Wow.

974
0:42:48,08 --> 0:42:49,2
That one's really fun.

975
0:42:49,2 --> 0:42:51,96
I don't know how often it's gonna 
be used, but anytime somebody

976
0:42:51,96 --> 0:42:54,72
asks us about sharding, they ask 
about that feature, so it must

977
0:42:54,72 --> 0:42:55,38
be useful.

978
0:42:56,4 --> 0:42:59,4
We support multi-Tuple inserts, 
we rewrite insert statements,

979
0:42:59,6 --> 0:43:2,08
we separate the Tuples, and we 
send them individually to each

980
0:43:2,08 --> 0:43:4,26
shard, you know, all of that happens, 
you know, transparently,

981
0:43:5,02 --> 0:43:5,84
which is really cool.

982
0:43:5,84 --> 0:43:8,56
We're just, we're focusing on usability, 
like the idea is you

983
0:43:8,56 --> 0:43:11,26
just deploy this and you don't 
change your app, like don't rewrite

984
0:43:11,26 --> 0:43:14,12
your queries, don't rewrite your 
code and just make it work.

985
0:43:14,12 --> 0:43:17,3
But, you know, obviously We're 
not there yet, but we're getting

986
0:43:17,3 --> 0:43:17,8
closer.

987
0:43:18,48 --> 0:43:20,74
So every week there's something 
new that comes out.

988
0:43:20,74 --> 0:43:22,62
Nikolay: And it comes from the 
fields, right?

989
0:43:23,24 --> 0:43:23,74
Lev: Yep.

990
0:43:23,86 --> 0:43:26,46
It comes from people asking me, 
hey, is this supported?

991
0:43:26,46 --> 0:43:28,76
I'm like, yes.

992
0:43:29,02 --> 0:43:30,12
Almost, almost.

993
0:43:31,12 --> 0:43:31,94
Next week.

994
0:43:32,98 --> 0:43:34,14
Thank you for the suggestion.

995
0:43:36,0 --> 0:43:36,3
Nikolay: Yeah.

996
0:43:36,3 --> 0:43:36,8
Cool.

997
0:43:37,2 --> 0:43:37,7
Yeah.

998
0:43:38,0 --> 0:43:39,36
Production driven development.

999
0:43:40,52 --> 0:43:41,18
Lev: It is.

1000
0:43:41,46 --> 0:43:43,78
I really enjoy it because every 
single feature is immediately

1001
0:43:43,78 --> 0:43:44,24
used.

1002
0:43:44,24 --> 0:43:45,9
Like, it's great.

1003
0:43:46,18 --> 0:43:46,68
Nikolay: Cool.

1004
0:43:46,88 --> 0:43:47,38
Great.

1005
0:43:47,46 --> 0:43:47,98
Thank you.

1006
0:43:47,98 --> 0:43:49,2
Thank you for update.

1007
0:43:49,64 --> 0:43:53,16
It's really good to see the project
is growing.

1008
0:43:53,68 --> 0:43:57,18
Excited to test it once again soon.

1009
0:43:57,4 --> 0:43:58,26
Lev: That would be great.

1010
0:43:58,26 --> 0:44:0,94
And I still have to review and
merge your PR.

1011
0:44:1,02 --> 0:44:1,74
Nikolay: It's OK.

1012
0:44:1,88 --> 0:44:5,22
Lev: It's been open for, I believe,
10 months, since the last

1013
0:44:5,22 --> 0:44:6,16
time we talked.

1014
0:44:6,56 --> 0:44:9,12
Nikolay: Something like this.

1015
0:44:9,26 --> 0:44:9,76
Good.

1016
0:44:10,16 --> 0:44:12,32
Michael: Well yeah really good
to see you again Lev.

1017
0:44:12,38 --> 0:44:13,12
Take care.

1018
0:44:13,32 --> 0:44:13,78
Lev: Thank you so much.

1019
0:44:13,78 --> 0:44:14,6
Michael: Catch you soon.

1020
0:44:14,7 --> 0:44:16,36
Lev: Yeah Thanks so much for having
me guys.

1021
0:44:16,36 --> 0:44:17,06
Bye bye.