1
00:00:00,060 --> 00:00:04,200
Nikolay: Hello, hello, this is
PostgresFM, your favorite podcast

2
00:00:04,200 --> 00:00:05,040
about Postgres.

3
00:00:05,380 --> 00:00:06,560
And my name is Nikolay.

4
00:00:06,720 --> 00:00:08,260
My co-host is Michael.

5
00:00:08,760 --> 00:00:09,560
Michael: Hello, Nikolay.

6
00:00:09,960 --> 00:00:10,780
Nikolay: Hi, Michael.

7
00:00:10,960 --> 00:00:13,400
And we have a very good guest today,
Sai.

8
00:00:13,580 --> 00:00:15,000
Sai: Hey, Nik and Michael.

9
00:00:15,140 --> 00:00:16,320
Nikolay: Founder of PeerDB.

10
00:00:17,420 --> 00:00:19,500
We are very glad to have you here.

11
00:00:19,540 --> 00:00:23,240
And we just discussed that we are
not inviting guests just for

12
00:00:23,240 --> 00:00:23,680
guests.

13
00:00:23,680 --> 00:00:27,220
We choose topic first, so it's
topic-centric discussion.

14
00:00:27,900 --> 00:00:31,960
But my first thought when we discussed
we need to talk about

15
00:00:31,960 --> 00:00:32,460
logical.

16
00:00:33,340 --> 00:00:35,940
And also I saw your blog post published
recently.

17
00:00:36,020 --> 00:00:41,140
My first thought was you're a very
great person to have for this

18
00:00:41,140 --> 00:00:41,820
discussion, right?

19
00:00:41,820 --> 00:00:43,480
So that's why we invited you.

20
00:00:43,520 --> 00:00:46,240
Glad you found time to join.

21
00:00:47,140 --> 00:00:49,000
Sai: Thanks Nik and Michael for
inviting me.

22
00:00:49,000 --> 00:00:52,580
And yeah, logical replication and
logical decoding has been my

23
00:00:52,580 --> 00:00:55,040
life since the past like 7, 8 months.

24
00:00:55,040 --> 00:00:58,380
And we are trying to get into as
much depth as possible to, you

25
00:00:58,380 --> 00:01:02,000
know, understand how it works and
probably down the line contribute

26
00:01:02,020 --> 00:01:02,980
upstream as well.

27
00:01:02,980 --> 00:01:05,140
But yeah, I'm very excited to be
here.

28
00:01:05,380 --> 00:01:05,880
Nikolay: Cool.

29
00:01:05,900 --> 00:01:09,920
I know Michael has a lot of questions,
so I probably will be

30
00:01:09,920 --> 00:01:11,460
less active this episode.

31
00:01:11,520 --> 00:01:13,940
But I will have questions as well
and maybe comments.

32
00:01:14,140 --> 00:01:15,560
Michael: Yeah, I definitely have
questions.

33
00:01:15,720 --> 00:01:18,640
But yeah, I completely echo everything
Nikolay said.

34
00:01:18,640 --> 00:01:22,940
And yeah, we did a kind of a basics
intro episode to logical,

35
00:01:23,240 --> 00:01:27,840
where Nikolay also talked us through
how to handle scale in terms

36
00:01:27,840 --> 00:01:29,540
of major upgrades, at least.

37
00:01:29,540 --> 00:01:31,940
So that's a common use case for
logical.

38
00:01:32,300 --> 00:01:36,600
But 1 thing we didn't even touch
on, which was the topic of 1

39
00:01:36,600 --> 00:01:39,880
of your recent blog posts, was
I listened back and I think protocol

40
00:01:39,940 --> 00:01:43,520
versions got a single mention,
kind of like an off-the-cuff mention

41
00:01:43,580 --> 00:01:47,500
at 1 point in the episode, you
know, 30, 40-minute episode.

42
00:01:47,500 --> 00:01:50,860
So that was where I was thinking
of starting is, it'd be awesome

43
00:01:50,860 --> 00:01:54,060
to hear a little bit about the
protocol versions, why you worry

44
00:01:54,060 --> 00:01:55,020
about them recently.

45
00:01:56,120 --> 00:01:56,860
Sai: Yeah, absolutely.

46
00:01:57,100 --> 00:01:58,040
Thanks, Michael.

47
00:01:58,260 --> 00:02:02,280
So the thing is logical replication
has this concept of protocol

48
00:02:02,420 --> 00:02:06,180
versions that you could specify
as a part of the subscription

49
00:02:06,580 --> 00:02:10,060
or if you are like an external
client who is reading the replication

50
00:02:10,080 --> 00:02:14,640
slot you could do it as a part
of the start replication API which

51
00:02:14,640 --> 00:02:18,840
lets you read the replication slot
and there are like four versions

52
00:02:19,280 --> 00:02:22,700
that Postgres supports now with
PostgreSQL 16, right.

53
00:02:22,960 --> 00:02:27,280
The first version is the default
version which decodes transactions

54
00:02:27,400 --> 00:02:30,780
that are only committed so it doesn't
like start the decoding

55
00:02:30,780 --> 00:02:36,680
process before the transaction
commits and the second is the

56
00:02:36,820 --> 00:02:40,880
more advanced option which does
logical decoding of transactions

57
00:02:41,120 --> 00:02:45,060
that are not yet committed, right
like so it basically does decoding

58
00:02:45,060 --> 00:02:48,800
for in-flight transactions, right
that is number two and then third

59
00:02:48,800 --> 00:02:53,300
is it lets you decode two-phase
commit transactions, that is third

60
00:02:53,480 --> 00:02:58,020
and the fourth is it lets you decode
in-flight transactions and

61
00:02:58,020 --> 00:03:02,440
apply them in parallel to the target
which is the subscriber

62
00:03:02,440 --> 00:03:04,420
which is PostgreSQL basically, right.

63
00:03:04,700 --> 00:03:08,180
Now in that blog which you're talking
about we compared like

64
00:03:08,180 --> 00:03:11,680
version one and version two because
they are more like the common

65
00:03:11,680 --> 00:03:12,040
ones.

66
00:03:12,040 --> 00:03:14,540
Two-phase commits are not like
very common, right that's the reason

67
00:03:14,540 --> 00:03:19,500
we didn't go into that piece but
the biggest difference is the

68
00:03:19,500 --> 00:03:24,780
impact of two is it improves logical
decoding performance because

69
00:03:24,960 --> 00:03:28,780
you are not you're actually decoding
while the transaction is

70
00:03:28,780 --> 00:03:32,060
happening, right like you are not
letting the transaction finish

71
00:03:32,320 --> 00:03:35,860
and then only start decoding but
rather you're decoding as the

72
00:03:35,860 --> 00:03:37,300
transactions go on.

73
00:03:37,300 --> 00:03:41,720
So the idea is you're giving more
time to the decoder, the WAL

74
00:03:41,720 --> 00:03:45,900
sender process to perform the decoding
and now this has a lot

75
00:03:45,900 --> 00:03:51,300
of benefits, right like number one
it reduces the slot growth, right

76
00:03:51,300 --> 00:03:54,940
like so in that blog we talk about
a situation where we have

77
00:03:54,940 --> 00:03:58,540
two long running transactions that
are being committed.

78
00:03:58,620 --> 00:04:02,220
And with version one, the decoding
process doesn't start and the

79
00:04:02,220 --> 00:04:04,140
slot keeps growing, growing, growing.

80
00:04:04,440 --> 00:04:07,980
And only after the transactions
are committed, the decoding starts.

81
00:04:08,040 --> 00:04:12,000
And then for a few minutes until
the decoding is finished, the

82
00:04:12,040 --> 00:04:13,760
slot size remains the same.

83
00:04:13,780 --> 00:04:15,460
Right now, this is with one.

84
00:04:15,600 --> 00:04:20,280
With two, what happens is as the
transactions happen, decoding

85
00:04:20,280 --> 00:04:20,780
starts.

86
00:04:20,900 --> 00:04:21,100
Right?

87
00:04:21,100 --> 00:04:24,860
Like, so you do see the decoder
doing its job.

88
00:04:25,020 --> 00:04:29,240
And as in when the transactions
finish, the slot immediately

89
00:04:29,340 --> 00:04:29,840
falls.

90
00:04:30,240 --> 00:04:34,340
And the advantage of this is so
the summary was with version

91
00:04:34,380 --> 00:04:38,740
one, the slot growth of like, I think
we showed like 5, 6 gigabytes

92
00:04:39,180 --> 00:04:43,480
remain for like 5, 6 minutes until
the entire decoding finished.

93
00:04:43,620 --> 00:04:47,920
But with version two, it remained
there for an instance, because

94
00:04:47,920 --> 00:04:51,000
you know, like the decoding was
already happening as the transaction

95
00:04:51,020 --> 00:04:55,680
was going on, right so this can
have tremendous impact in use

96
00:04:55,680 --> 00:04:59,740
cases which have long running or
sizable transactions and that

97
00:04:59,740 --> 00:05:02,800
is very common, right like we
recently working were working

98
00:05:02,800 --> 00:05:05,280
with like a fintech customer, right
like where they had a lot

99
00:05:05,280 --> 00:05:07,880
they had like sizable transactions
right like hundreds of like

100
00:05:07,880 --> 00:05:10,760
you know operations happening in
a transaction and then these

101
00:05:10,760 --> 00:05:14,680
transactions were like interleaved
and now the advantage with

102
00:05:14,680 --> 00:05:18,980
like to the two would have helped
them a lot because as the you

103
00:05:18,980 --> 00:05:22,260
know operations are happening the
decoder like you know just

104
00:05:22,260 --> 00:05:23,080
decodes them.

105
00:05:23,080 --> 00:05:26,780
With one the problem that was happening
was there was one large

106
00:05:27,040 --> 00:05:31,360
long-running transaction which
took like an hour and then every

107
00:05:31,360 --> 00:05:35,860
time the decoder needs to decode
committed transactions it was

108
00:05:35,860 --> 00:05:39,220
decoding this long-running transaction
also so so the order of

109
00:05:39,220 --> 00:05:42,940
time is more quadratic basically
with like a version one right

110
00:05:43,040 --> 00:05:46,400
because you know wall is like sequential
right like so you keep

111
00:05:46,400 --> 00:05:48,840
writing writing writing and then
like and then as and when there

112
00:05:48,840 --> 00:05:51,880
is a commit like the decoder like
starts like working now the

113
00:05:51,880 --> 00:05:55,760
problem is with version one as the
existing like the long-running

114
00:05:55,760 --> 00:05:58,940
transaction is not getting decoded
still the decoder process

115
00:05:58,940 --> 00:06:01,760
decodes it for like other transactions,
right?

116
00:06:01,880 --> 00:06:06,180
So but with like, you know oh this
problem wouldn't have happened

117
00:06:06,180 --> 00:06:09,520
because once this long-running
transaction is committed I mean

118
00:06:09,520 --> 00:06:12,680
it's already consumed it's already
decoded if that makes sense

119
00:06:12,680 --> 00:06:16,400
right so the high-level summary
is version two is very helpful

120
00:06:16,400 --> 00:06:19,440
when there are like, you know,
long-running transactions that

121
00:06:19,440 --> 00:06:20,780
are like, you know, interleaved.

122
00:06:22,300 --> 00:06:25,460
Nikolay: So the lesson is upgrade
guys, right?

123
00:06:26,760 --> 00:06:28,940
Just use the latest PostgreSQL version.

124
00:06:29,380 --> 00:06:32,460
Sai: Version two comes with PG 14
version four is in PG 16.

125
00:06:32,520 --> 00:06:32,860
Nikolay: Sorry.

126
00:06:32,860 --> 00:06:34,080
I meant a small thing.

127
00:06:34,080 --> 00:06:37,800
I've noticed you mentioned a subscriber,
which is PostgreSQL.

128
00:06:38,020 --> 00:06:40,740
This phrase, attracted my attention.

129
00:06:40,960 --> 00:06:42,320
It's not always PostgreSQL.

130
00:06:43,280 --> 00:06:47,860
But parallelization which to apply
to long transactions it's done

131
00:06:47,860 --> 00:06:51,360
only for PostgreSQL subscribers,
not for others, right?

132
00:06:51,780 --> 00:06:52,200
Sai: Correct.

133
00:06:52,200 --> 00:06:52,640
Exactly.

134
00:06:52,640 --> 00:06:55,680
So the fourth version, which does
like, you know, parallel apply

135
00:06:55,680 --> 00:06:59,440
of these in-flight transactions
is more relevant for PostgreSQL

136
00:06:59,440 --> 00:07:00,560
to PostgreSQL replication.

137
00:07:00,820 --> 00:07:04,080
The apply piece is, you know, PostgreSQL
and that's the standard

138
00:07:04,080 --> 00:07:04,940
logical replication.

139
00:07:05,240 --> 00:07:08,400
And logical decoding is more like
a subset of logical replication

140
00:07:08,400 --> 00:07:13,760
where external clients like pg_BADGER
can read the slot and replicate

141
00:07:13,880 --> 00:07:16,220
to homogeneous or heterogeneous
targets.

142
00:07:17,680 --> 00:07:20,900
Nikolay: In your experience, does
it mean that there's no sense

143
00:07:20,900 --> 00:07:25,240
to have this feature for other
situations like to Snowflake or

144
00:07:25,240 --> 00:07:25,740
others.

145
00:07:26,120 --> 00:07:27,900
Sai: The parallel apply feature
you mean,

146
00:07:27,900 --> 00:07:28,400
Nikolay: right?

147
00:07:28,420 --> 00:07:28,700
Right.

148
00:07:28,700 --> 00:07:29,200
Sai: Yeah.

149
00:07:30,060 --> 00:07:30,600
Good question.

150
00:07:30,600 --> 00:07:33,300
So that is the onus of the ETL
tool.

151
00:07:33,660 --> 00:07:40,600
So in PADB we have an option to
do parallel apply or not do parallel

152
00:07:40,600 --> 00:07:40,900
apply.

153
00:07:40,900 --> 00:07:43,440
So it's like a single flag that
we provide.

154
00:07:43,440 --> 00:07:47,700
And the difference is with parallel
apply, we cannot guarantee

155
00:07:48,120 --> 00:07:51,540
the order of transactions across
tables.

156
00:07:52,420 --> 00:07:56,380
Nikolay: So foreign key and referential
consistency, similar

157
00:07:56,380 --> 00:07:59,680
to multiple slots in regular Postgres
to Postgres.

158
00:08:00,140 --> 00:08:00,640
Correct.

159
00:08:00,900 --> 00:08:01,740
Sai: Very, very similar.

160
00:08:01,740 --> 00:08:02,240
Right.

161
00:08:02,500 --> 00:08:05,220
And then both, I mean, customers
do it, but the advantage of

162
00:08:05,220 --> 00:08:07,360
parallelism is it will improve
latency, right?

163
00:08:07,360 --> 00:08:10,580
Like replication, latency reduces,
lag reduces, etc.

164
00:08:10,920 --> 00:08:14,180
But it doesn't guarantee, you know,
consistency across tables.

165
00:08:14,480 --> 00:08:16,320
But the sequential does that, right?

166
00:08:16,320 --> 00:08:18,540
So customers like we have both
customers.

167
00:08:19,220 --> 00:08:20,400
Nikolay: Yeah, that's interesting.

168
00:08:20,820 --> 00:08:26,940
I know like in many cases people
think, okay, we work at like

169
00:08:26,940 --> 00:08:32,300
half of our capacity, no worries
if TPS grows using a single slot.

170
00:08:32,300 --> 00:08:36,340
If TPS grows, we know the bottleneck
is on the subscriber, we will

171
00:08:36,340 --> 00:08:37,800
just use multiple slots.

172
00:08:37,800 --> 00:08:43,580
But then they realize that with foreign
keys, you need to basically

173
00:08:43,580 --> 00:08:48,740
agree that they are broken temporarily
on the subscriber and

174
00:08:48,740 --> 00:08:52,420
can cause a lot of troubles if
you point the application to such

175
00:08:52,420 --> 00:08:53,000
nodes, right?

176
00:08:53,000 --> 00:08:53,760
That's interesting.

177
00:08:54,240 --> 00:08:55,220
Yeah, absolutely.

178
00:08:55,680 --> 00:08:56,400
Makes sense.

179
00:08:56,680 --> 00:08:58,620
Sai: And then I did want to call
out, right?

180
00:08:58,620 --> 00:09:01,760
Like for that, I mean, there is
a setting for disabling foreign

181
00:09:01,760 --> 00:09:04,200
key just like, you know, complete
that there is a setting called

182
00:09:04,200 --> 00:09:07,900
session replication role basically,
you can set that as replication

183
00:09:07,900 --> 00:09:10,620
and that is pretty standard, it's 
like that's the reason Postgres

184
00:09:10,760 --> 00:09:13,580
gave this setting so you can set
that to replication and it would

185
00:09:13,580 --> 00:09:15,840
disable foreign keys and triggers
on the target.

186
00:09:16,420 --> 00:09:19,640
Nikolay: But by default,
if you use multiple slots, it's

187
00:09:19,640 --> 00:09:20,140
disabled.

188
00:09:21,740 --> 00:09:23,260
Sai: By default, is it like disabled?

189
00:09:23,540 --> 00:09:24,040
Nikolay: Yes.

190
00:09:24,340 --> 00:09:30,820
I know it since recent work to
use it for upgrade, zero-downtime

191
00:09:30,820 --> 00:09:31,320
upgrades.

192
00:09:31,740 --> 00:09:31,980
Michael: Got it.

193
00:09:31,980 --> 00:09:34,040
Nikolay: It's indeed disabled and
it's interesting.

194
00:09:36,100 --> 00:09:41,760
So you need to think maybe you
shouldn't use it.

195
00:09:42,040 --> 00:09:48,120
I mean maybe go back to a single
slot and just choose a different

196
00:09:48,120 --> 00:09:49,160
time for your workload.

197
00:09:49,160 --> 00:09:52,560
I mean, work on weekends, guys,
right?

198
00:09:53,300 --> 00:09:57,020
Lower traffic and you can afford
working with a single slot.

199
00:09:57,500 --> 00:10:01,780
This means only for like logical
for a temporary time, like for

200
00:10:01,780 --> 00:10:03,400
upgrades, we don't need it.

201
00:10:03,580 --> 00:10:04,060
Exactly.

202
00:10:04,060 --> 00:10:04,780
It really

203
00:10:04,780 --> 00:10:06,080
Sai: depends on the use case.

204
00:10:06,220 --> 00:10:06,800
Nikolay: Yeah, yeah.

205
00:10:06,900 --> 00:10:10,620
But if for a long time, forget
about foreign keys maybe.

206
00:10:11,740 --> 00:10:14,440
And you said latency, but it's
also about throughput.

207
00:10:14,480 --> 00:10:17,960
We need to process more bytes per
time.

208
00:10:18,480 --> 00:10:18,980
Sai: Absolutely.

209
00:10:19,700 --> 00:10:21,100
Nikolay: Lags can be nasty.

210
00:10:21,460 --> 00:10:25,540
I don't like dealing with logical
decoding in Postgres 11, 12.

211
00:10:26,520 --> 00:10:28,500
It's not pleasant at all.

212
00:10:30,320 --> 00:10:33,340
So these improvements are exciting
and everyone should upgrade.

213
00:10:33,340 --> 00:10:38,720
That's what I see here. You just
uncovered a lot of details.

214
00:10:38,720 --> 00:10:39,520
That's great.

215
00:10:39,840 --> 00:10:43,020
Sai: Yeah, absolutely. And the tricky
thing I mean with the good

216
00:10:43,020 --> 00:10:46,180
beauty of like Postgres is
that like this is already inbuilt

217
00:10:46,240 --> 00:10:47,560
in like Postgres core, right?

218
00:10:47,560 --> 00:10:49,640
Like logical replication does
this by default.

219
00:10:49,640 --> 00:10:52,360
It's just like, you know, a setting
you need to do say that like,

220
00:10:52,360 --> 00:10:55,400
Hey, I want to use version 2 and
it will be significantly more

221
00:10:55,400 --> 00:10:57,340
performant, reduce replication slot growth.

222
00:10:57,360 --> 00:11:00,880
But if it's an ETL tool, who's
using logical decoding, the transaction

223
00:11:01,020 --> 00:11:03,260
logic needs to be managed by the
ETL tool.

224
00:11:03,260 --> 00:11:06,260
Because as you're reading like
in-flight transactions, we need

225
00:11:06,260 --> 00:11:09,760
to keep track whether this transaction
is committed or not, and

226
00:11:09,760 --> 00:11:11,740
then only we should push it to
the target.

227
00:11:11,980 --> 00:11:16,320
It becomes tricky for ETL tools,
which we are working on in PADB

228
00:11:16,320 --> 00:11:17,380
now as we speak.

229
00:11:17,440 --> 00:11:20,740
But the beauty of Postgres is that
it just gives you the setting

230
00:11:20,740 --> 00:11:23,940
out of the box and you just need
to upgrade your Postgres version

231
00:11:23,940 --> 00:11:25,740
to 14, 15, and 16.

232
00:11:26,640 --> 00:11:30,980
Nikolay: And with this, in 16 we
have an example.

233
00:11:31,000 --> 00:11:34,960
I mean, the developers of these
tools like your company, they

234
00:11:34,960 --> 00:11:39,060
now have an example of Postgres to Postgres
native logical replication to show how

235
00:11:39,060 --> 00:11:40,060
it can be implemented.

236
00:11:40,240 --> 00:11:42,320
So the reference example exists.

237
00:11:42,440 --> 00:11:43,400
So it's good.

238
00:11:44,160 --> 00:11:45,140
Yeah, that's great.

239
00:11:45,780 --> 00:11:47,620
Sorry if my client interrupted
you.

240
00:11:47,900 --> 00:11:48,980
No, not at all.

241
00:11:49,740 --> 00:11:51,540
Michael: This is a really good
diversion.

242
00:11:51,980 --> 00:11:55,280
But we've also, we've talked about
3 things already, right?

243
00:11:55,680 --> 00:11:58,140
We've talked about throughput,
we've talked about lag, and we've

244
00:11:58,140 --> 00:11:59,760
talked about slot size growth.

245
00:12:00,160 --> 00:12:05,580
And I feel like when you said initially,
the performance of the

246
00:12:05,580 --> 00:12:08,700
version 2 protocol can be better,
I'm guessing we're talking

247
00:12:08,700 --> 00:12:11,020
mostly in terms of that lag metric.

248
00:12:11,280 --> 00:12:14,380
Like we can start processing things
faster, therefore the lag

249
00:12:14,380 --> 00:12:16,380
is lower, makes total sense.

250
00:12:16,380 --> 00:12:20,060
And when you said the slot size
growth is reduced, I think I

251
00:12:20,060 --> 00:12:21,260
Understand where you're coming 
from.

252
00:12:21,260 --> 00:12:26,100
I think the specific benchmark 
in the blog post shows that the

253
00:12:26,100 --> 00:12:30,520
growth is the same, but then reduces 
it, like it kind of reduces 

254
00:12:30,520 --> 00:12:30,780
quicker.

255
00:12:30,780 --> 00:12:35,280
But when you mentioned the like 
interleaved long running transactions

256
00:12:35,320 --> 00:12:39,440
I guess it's in a less synthetic 
workload where you've got lots

257
00:12:39,440 --> 00:12:44,340
of potentially overlapping long 
running transactions the peak 

258
00:12:44,340 --> 00:12:46,980
will be lower in the version if 
you're using.

259
00:12:46,980 --> 00:12:47,480
Correct.

260
00:12:47,780 --> 00:12:51,180
Sai: Exactly right. Like because
in the scenario we did it's like

261
00:12:51,220 --> 00:12:55,120
just do 2 transactions that are 
long running and then we just

262
00:12:55,120 --> 00:12:57,940
had 2 of them right but then and 
then they ended at the same

263
00:12:57,940 --> 00:13:01,780
time but in real-world workloads 
like it I mean it can be very

264
00:13:01,780 --> 00:13:06,900
arbitrary right but with like version 
2 as the slot consumption 

265
00:13:06,900 --> 00:13:11,680
is faster and like slot size falls 
quicker right the cumulative

266
00:13:11,740 --> 00:13:15,420
effect can be very significant 
in slot growth.

267
00:13:15,540 --> 00:13:16,040
Michael: Nice.

268
00:13:16,160 --> 00:13:19,120
Sai: If that makes sense. And that 
is the next benchmark we want 

269
00:13:19,120 --> 00:13:23,160
to do, where in the real world benchmark, 
I want to see how does 

270
00:13:23,160 --> 00:13:26,600
the peak slot size like compare 
with version 1 and version 2.

271
00:13:26,600 --> 00:13:29,160
We are actually building that feature 
so we can we plan to like,

272
00:13:29,160 --> 00:13:32,680
you know, we have a few like customers 
we are design partnering 

273
00:13:32,680 --> 00:13:34,700
with, you know, to implement this 
feature.

274
00:13:34,740 --> 00:13:37,600
And we have like, we know, you 
know, what is the peak size slot 

275
00:13:37,600 --> 00:13:39,360
size we are seeing with version 
1.

276
00:13:39,480 --> 00:13:43,180
And with this design partnership, 
we will get to know, okay, 

277
00:13:43,180 --> 00:13:44,380
what is the peak slot size?

278
00:13:44,380 --> 00:13:47,240
My like understanding is it will 
fall very quickly because like,

279
00:13:47,240 --> 00:13:49,120
you know, you're giving more time 
to the decoder, right?

280
00:13:49,120 --> 00:13:50,240
Like it's more efficient.

281
00:13:50,660 --> 00:13:53,480
And with these long running transactions, 
it cannot be quadratic,

282
00:13:53,480 --> 00:13:54,880
like the decoding process, right?

283
00:13:54,880 --> 00:13:57,760
But that's a very good question 
you pointed, Mike.

284
00:13:57,900 --> 00:14:00,540
And that is the next benchmark 
we want to do.

285
00:14:00,660 --> 00:14:01,160
Michael: Nice.

286
00:14:01,560 --> 00:14:05,280
So then the question then becomes, 
are there any workloads?

287
00:14:06,040 --> 00:14:09,520
And I'm thinking now that it must 
only be if you had, let's say

288
00:14:09,520 --> 00:14:13,020
you had a lot of long run transactions 
that end up getting aborted

289
00:14:13,020 --> 00:14:14,880
or rolled back instead of committed.

290
00:14:15,520 --> 00:14:19,080
I'm thinking that might be the 
only case where the version 1 

291
00:14:19,080 --> 00:14:22,800
protocol might be like you might 
prefer to carry on using it

292
00:14:22,800 --> 00:14:25,740
or are there other cases where 
you might prefer to carry on using

293
00:14:25,740 --> 00:14:29,480
the v1 protocol over the v2 one 
as long as you're on 14 and above.

294
00:14:29,480 --> 00:14:31,840
Sai: Even when the transactions 
are rolled back the WAL is not

295
00:14:31,840 --> 00:14:32,200
removed.

296
00:14:32,200 --> 00:14:34,740
So it still needs to be like decoded 
basically, right?

297
00:14:34,740 --> 00:14:35,960
Like for future transactions.

298
00:14:38,100 --> 00:14:41,420
That's exactly what happened in 
a customer scenario where there

299
00:14:41,420 --> 00:14:44,700
was a long running transaction, 
which the customer killed basically.

300
00:14:44,700 --> 00:14:46,820
It was running for like few hours, 
right?

301
00:14:46,960 --> 00:14:51,340
But still that impacted other transactions
because Postgres is

302
00:14:51,340 --> 00:14:53,400
not yet smart where it removes
the WAL.

303
00:14:53,400 --> 00:14:55,340
Nikolay: This is my favorite use
case.

304
00:14:55,640 --> 00:14:58,820
So you can overload WAL sender.

305
00:14:58,820 --> 00:15:02,360
If you just do a very simple trick,
you create a table with like

306
00:15:02,360 --> 00:15:06,360
a million, 10,000,000 rows and then
delete rows in transaction

307
00:15:06,420 --> 00:15:07,660
and roll it back.

308
00:15:07,660 --> 00:15:09,680
Massive delete rollback, massive
delete rollback.

309
00:15:09,680 --> 00:15:13,100
I like this workload in many cases,
this included.

310
00:15:13,340 --> 00:15:17,880
And this makes WAL sender consume
100% CPU very quickly.

311
00:15:18,740 --> 00:15:26,600
Because this transaction spams
WAL, writing xmax constantly

312
00:15:27,100 --> 00:15:28,360
for a lot of tuples.

313
00:15:29,100 --> 00:15:33,900
And then saying, okay, this xmax,
this transaction ID was rolled

314
00:15:33,900 --> 00:15:35,220
back, so it doesn't mean...

315
00:15:35,220 --> 00:15:36,920
Like they're still alive, right?

316
00:15:36,980 --> 00:15:40,440
The next transaction comes, does
the same, and the WAL sender

317
00:15:40,440 --> 00:15:41,780
is becoming crazy.

318
00:15:42,540 --> 00:15:48,060
And I think there should be some
way to optimize decoding to

319
00:15:48,060 --> 00:15:50,060
understand it, maybe.

320
00:15:50,060 --> 00:15:51,720
Because if you start processing...

321
00:15:51,760 --> 00:15:57,100
Ah, when it starts processing,
it's not known that this transaction

322
00:15:57,100 --> 00:15:58,360
is already rolled back.

323
00:15:59,180 --> 00:16:03,060
But maybe there is some opportunity
to optimize because if it's

324
00:16:03,060 --> 00:16:07,000
known already it's already rolled
back why we can probably skip

325
00:16:07,000 --> 00:16:10,440
better or something I don't know
I don't know details here.

326
00:16:10,520 --> 00:16:13,520
Sai: That's a very good point Nik
and I was recently chatting

327
00:16:13,520 --> 00:16:17,860
with one of the committers and we
don't even like skip rows is

328
00:16:17,860 --> 00:16:21,140
my understanding basically like
so every time like we try to

329
00:16:21,140 --> 00:16:25,120
decode basically So the immediate
optimization is if there is

330
00:16:25,120 --> 00:16:30,020
an entry in the WAL of a rollback
transaction, then we don't

331
00:16:30,020 --> 00:16:31,800
decode that entry in the WAL.

332
00:16:31,800 --> 00:16:34,000
But you bring up a great point
here.

333
00:16:34,000 --> 00:16:36,240
So there is a lot of scope of improvement
there.

334
00:16:36,340 --> 00:16:40,580
Nikolay: Yeah, and this led us
to the conclusion that we won't

335
00:16:40,580 --> 00:16:45,520
be able to deal with logical at
a few modifying transactions

336
00:16:45,580 --> 00:16:46,280
per second.

337
00:16:46,580 --> 00:16:51,480
It was wrong conclusion because
this was pathological workload.

338
00:16:53,180 --> 00:16:56,980
And In production, I think it's
a very rare case when a WAL

339
00:16:56,980 --> 00:16:59,740
sender is hitting 100% of a single
core.

340
00:17:00,040 --> 00:17:01,400
Or it's not, I'm not sure.

341
00:17:01,400 --> 00:17:03,560
Because the problem is you cannot
scale it.

342
00:17:03,560 --> 00:17:10,100
You add a second slot, but you
see a WAL sender is hitting 100%

343
00:17:10,280 --> 00:17:14,580
of a single core CPU, with the
constantly or from time to time,

344
00:17:14,580 --> 00:17:18,400
and you think, okay, I will try
to distribute workload among

345
00:17:18,820 --> 00:17:21,100
2 slots or 4 slots, 8 slots.

346
00:17:21,760 --> 00:17:22,760
But this doesn't help.

347
00:17:22,760 --> 00:17:26,600
All of them are hitting 100% of
different cores and that's it.

348
00:17:26,600 --> 00:17:32,120
So this can be bottleneck, but
it looks like it's very rare right

349
00:17:32,120 --> 00:17:33,460
now in production.

350
00:17:34,220 --> 00:17:35,140
Sai: Yeah, absolutely.

351
00:17:35,460 --> 00:17:39,120
And that is another optimization
that like, I mean, the community

352
00:17:39,280 --> 00:17:43,580
could do is where we, I mean, currently
if there are multiple

353
00:17:43,580 --> 00:17:44,200
slots, right?

354
00:17:44,200 --> 00:17:47,380
Like the decoding process runs
for every slot, right?

355
00:17:47,440 --> 00:17:51,360
If there are ways in which the
decoder runs for just 1 slot and

356
00:17:51,360 --> 00:17:54,520
the other slots reuse these like
decoded changes, right?

357
00:17:54,520 --> 00:17:56,260
Like if that makes sense.

358
00:17:56,280 --> 00:17:57,420
Nikolay: Yeah, exactly.

359
00:17:57,880 --> 00:18:00,020
The same work that's done multiple
times.

360
00:18:01,360 --> 00:18:02,880
Everyone is processing everything.

361
00:18:02,980 --> 00:18:06,940
This is what like definitely there
is opportunity for improvement

362
00:18:06,940 --> 00:18:07,020
here.

363
00:18:07,020 --> 00:18:11,010
But I think people just don't see
this as a bottleneck in production

364
00:18:11,920 --> 00:18:12,420
often.

365
00:18:12,980 --> 00:18:14,070
That's why it's not optimized yet.

366
00:18:14,070 --> 00:18:17,360
Sai: In production, what we see
is like, it's like these 1 off

367
00:18:17,360 --> 00:18:20,660
batch operations where customers
delete like a bunch of data

368
00:18:20,660 --> 00:18:23,080
or add like do copy with like millions
of rows where there is

369
00:18:23,080 --> 00:18:24,780
like a WAL spike.

370
00:18:24,920 --> 00:18:27,540
And you know, the recommendation
that we give right, like is

371
00:18:27,540 --> 00:18:28,940
guys have more disk.

372
00:18:29,020 --> 00:18:33,050
I mean, it is, I mean that it's
hard because like faster and

373
00:18:33,050 --> 00:18:36,060
like, you know, larger disk because
logical replication can

374
00:18:36,060 --> 00:18:40,520
only scale up to 25,000 to 30,000
like, you know, messages per

375
00:18:40,520 --> 00:18:41,060
second, basically.

376
00:18:41,880 --> 00:18:45,920
So have larger disk so that like
once this small spike like falls,

377
00:18:45,940 --> 00:18:46,070
it'll catch up.

378
00:18:46,070 --> 00:18:47,160
Okay, sure.

379
00:18:47,160 --> 00:18:48,740
At that time, there is more latency.

380
00:18:49,090 --> 00:18:55,160
Nikolay: It's a good CPU if 25,000,
30,000 messages per second.

381
00:18:55,160 --> 00:18:59,540
Because for PgBouncer on modern
Intel and AMD, I didn't see more

382
00:18:59,540 --> 00:19:01,620
than 20 messages per second.

383
00:19:02,080 --> 00:19:04,070
We can compare this like also messages
per second.

384
00:19:04,070 --> 00:19:08,940
PgBouncer processes transactions
just by passing them to backends,

385
00:19:08,940 --> 00:19:11,460
Postgres backends and returning
the result.

386
00:19:12,040 --> 00:19:16,050
And the logical replication also
similar, some messages, just

387
00:19:16,050 --> 00:19:17,560
some messages, right?

388
00:19:18,080 --> 00:19:22,040
Different type of work because
decoding maybe is more consuming

389
00:19:22,440 --> 00:19:25,180
but 30,000 sounds very good.

390
00:19:25,440 --> 00:19:29,020
Sai: Yeah, 100%, but the thing is
it has to be done well, no I mean

391
00:19:29,020 --> 00:19:32,460
like it has to be done properly
because one of the tips that

392
00:19:32,460 --> 00:19:36,660
we do right, like which have seen
lack in a few ETL tools, is always

393
00:19:36,660 --> 00:19:39,560
consuming the slot we cannot give
up on consuming the slot we

394
00:19:39,560 --> 00:19:42,360
need to constantly consume the
slot and flush the slot right

395
00:19:42,360 --> 00:19:48,280
and not to lag and the thing
is if you give up and start

396
00:19:48,280 --> 00:19:51,050
reading it again, it doesn't read
from the confirmed flush LSN

397
00:19:51,050 --> 00:19:53,260
it reads from the restart LSN.

398
00:19:53,520 --> 00:19:56,860
And sometimes that restart LSN
can be very like old.

399
00:19:57,340 --> 00:20:00,220
And I don't want to get into when
Postgres updates the restart

400
00:20:00,220 --> 00:20:02,440
LSN because it's a more complicated
thing.

401
00:20:02,440 --> 00:20:05,640
But the idea is when I give up the connection and reacquire the connection

402
00:20:05,640 --> 00:20:10,680
again, it starts from the restart LSN, which can increase my

403
00:20:10,680 --> 00:20:12,040
decoding times a lot.

404
00:20:12,040 --> 00:20:13,380
So this is very common.

405
00:20:13,380 --> 00:20:18,060
When we initially built pglogical, if the slot size was 150, 200 gigs,

406
00:20:18,340 --> 00:20:19,940
we were giving up the connection periodically.

407
00:20:20,020 --> 00:20:23,720
And whenever this slot is big, for like 10, 15 minutes, it was

408
00:20:23,720 --> 00:20:24,780
just stuck in Walreader.

409
00:20:24,880 --> 00:20:26,600
It was not even getting 1 change.

410
00:20:26,640 --> 00:20:31,400
So that is one tip I recommend, where you always consume the replication

411
00:20:31,500 --> 00:20:32,000
slot.

412
00:20:32,980 --> 00:20:38,040
Nikolay: I was trying to recall another problem I saw when I

413
00:20:38,040 --> 00:20:41,400
inspected what's happening if you start using multiple slots.

414
00:20:41,780 --> 00:20:45,280
And it's kind of an interesting problem when you start publishing

415
00:20:45,340 --> 00:20:48,900
changes to a single slot you use for all tables, for example.

416
00:20:50,020 --> 00:20:52,620
For example, for upgrade, it makes sense.

417
00:20:53,000 --> 00:20:53,980
And this is great.

418
00:20:54,000 --> 00:20:55,500
It works very well.

419
00:20:56,040 --> 00:21:01,080
But if you start distributing tables among multiple slots, you

420
00:21:01,080 --> 00:21:04,200
need to alter the table, I think.

421
00:21:04,220 --> 00:21:09,520
Anyway, you need a lock for each table individually in the group.

422
00:21:09,520 --> 00:21:16,360
So you divide groups, and then you need to start publishing changes.

423
00:21:16,560 --> 00:21:20,340
And this is a similar lock as, like DDL, basically.

424
00:21:20,840 --> 00:21:21,700
No, not similar.

425
00:21:21,700 --> 00:21:23,740
It's not an exclusive lock, but it's surely.

426
00:21:23,800 --> 00:21:26,760
I remember it was, it got stuck.

427
00:21:27,340 --> 00:21:29,640
It was not, it's not an exclusive lock.

428
00:21:29,640 --> 00:21:30,020
Yeah.

429
00:21:30,020 --> 00:21:34,260
So it's, it's not that bad as, as altering a table and adding a column.

430
00:21:34,280 --> 00:21:36,860
So I remember it got stuck.

431
00:21:37,120 --> 00:21:41,100
It couldn't complete, but at least it didn't block others.

432
00:21:42,720 --> 00:21:43,080
Sai: Right.

433
00:21:43,080 --> 00:21:46,160
And was this while creating the multiple slots, or was this just

434
00:21:46,160 --> 00:21:47,540
like creating publications?

435
00:21:47,780 --> 00:21:48,840
Nikolay: Creating a publication.

436
00:21:49,220 --> 00:21:52,320
So you specify tables for which you need to create a publication.

437
00:21:52,440 --> 00:21:56,140
And this needs to acquire locks on all of them.

438
00:21:56,140 --> 00:22:00,180
It can be successful for some of them, not successful for others.

439
00:22:00,180 --> 00:22:02,540
It's a single operation, so you're waiting.

440
00:22:03,200 --> 00:22:04,540
And this is not good.

441
00:22:05,200 --> 00:22:10,680
But not as bad as DDL, which can block selects, which come after.

442
00:22:10,680 --> 00:22:14,500
We discussed many times how dangerous it can be under heavy load.

443
00:22:14,500 --> 00:22:15,420
Yeah, that's interesting.

444
00:22:15,420 --> 00:22:17,280
Anyway, multiple slots are tricky.

445
00:22:17,280 --> 00:22:19,540
This is the lesson here.

446
00:22:19,960 --> 00:22:24,040
So you need to test and learn before using them.

447
00:22:24,340 --> 00:22:25,580
Michael: Yeah, absolutely.

448
00:22:26,760 --> 00:22:27,260
Nikolay: Good.

449
00:22:28,080 --> 00:22:29,320
What's next?

450
00:22:30,060 --> 00:22:32,540
Michael: Well, before we move on from that, I'm interested in,

451
00:22:32,540 --> 00:22:35,540
Are there any other like hard,
so you mentioned that rule of

452
00:22:35,540 --> 00:22:38,340
thumb, 25 to 30,000 messages per
second.

453
00:22:38,740 --> 00:22:42,320
Like, are there any other like
rules of thumb or hard limits

454
00:22:42,320 --> 00:22:46,160
or anything that people could be
thinking, oh, we've probably

455
00:22:46,160 --> 00:22:49,200
not tuned it enough because we're
not getting that much throughput

456
00:22:49,200 --> 00:22:51,380
or that low lag or something like
that.

457
00:22:51,380 --> 00:22:55,600
Nikolay: I guess this number depends
on the, on the core, on

458
00:22:55,600 --> 00:22:57,240
the type of CPU you have.

459
00:22:57,240 --> 00:22:59,600
If it's old, it can be 10,000 for
example.

460
00:22:59,960 --> 00:23:00,200
Sai: Yeah.

461
00:23:00,200 --> 00:23:00,900
A hundred percent.

462
00:23:00,900 --> 00:23:04,440
And then, you know, this setting
is this number that I shared

463
00:23:04,440 --> 00:23:10,900
is more on the higher end in a
non-local environment where the

464
00:23:10,900 --> 00:23:13,040
source, the target are not local.

465
00:23:13,040 --> 00:23:15,720
They are in the same region, but
in different boxes.

466
00:23:15,840 --> 00:23:18,020
So network latency is a very big
factor.

467
00:23:18,200 --> 00:23:22,080
1 thing we do want to see is, what
is the logical replication

468
00:23:22,540 --> 00:23:25,760
performance when both are on the
same local server, which is

469
00:23:25,760 --> 00:23:28,120
not in real world that doesn't
happen.

470
00:23:28,260 --> 00:23:31,680
So this, whatever I'm saying, is
network latency bound.

471
00:23:31,680 --> 00:23:34,600
Because it's single-threaded, and
the network latency kicks in,

472
00:23:34,600 --> 00:23:38,980
and it can only scale up to say
20-30 Mbps if it's like you know

473
00:23:38,980 --> 00:23:43,080
30-40 Mbps if it's done well also
right so those are also some

474
00:23:43,080 --> 00:23:47,220
things to keep in mind now coming
to gotchas right like a few

475
00:23:47,220 --> 00:23:51,360
things that I did like you know
make a note of of logical replication

476
00:23:51,760 --> 00:23:55,880
first is it doesn't support like
replication of DDL commands

477
00:23:56,040 --> 00:23:59,220
which includes like adding of columns
like dropping of columns

478
00:23:59,220 --> 00:24:03,220
like you know adding new tables
creating indexes like you know

479
00:24:03,220 --> 00:24:07,040
truncating tables right like so
these are not supported out of

480
00:24:07,040 --> 00:24:07,360
the box.

481
00:24:07,360 --> 00:24:10,520
And this is 1 of the common concerns
that we hear from customers,

482
00:24:10,520 --> 00:24:13,200
because in the real world, you
know, people, I mean, you add

483
00:24:13,200 --> 00:24:15,720
tables, you like you have these
Django migrations, which like

484
00:24:15,720 --> 00:24:18,280
add a bunch of indexes etc etc
right.

485
00:24:18,280 --> 00:24:20,340
Michael: Partitions we talked about
last time.

486
00:24:21,020 --> 00:24:23,600
Sai: Yeah, but they don't do that
also.

487
00:24:24,060 --> 00:24:27,180
Michael: So yeah, when you say
customers raise it as a concern,

488
00:24:27,500 --> 00:24:31,220
is that before starting or like
realizing it later on?

489
00:24:31,220 --> 00:24:32,920
Where does that pop up for you?

490
00:24:33,120 --> 00:24:34,900
Sai: It happens both ways, right?

491
00:24:34,900 --> 00:24:37,280
Like because I mean, obviously,
Postgres logical replication

492
00:24:37,280 --> 00:24:40,520
is the easiest way to like, you
know, replicate databases, right?

493
00:24:40,520 --> 00:24:43,660
They start off and then there's
a column that is added and then

494
00:24:43,660 --> 00:24:45,040
logical replication breaks.

495
00:24:45,520 --> 00:24:48,800
And now the good thing is you can
add a column manually on the

496
00:24:48,800 --> 00:24:52,100
target and it would continue But
then this becomes like difficult

497
00:24:52,120 --> 00:24:55,760
in production, which is when they
reach out saying that hey you

498
00:24:55,760 --> 00:24:59,340
guys like are also, you know Doing
using logical decoding and

499
00:24:59,340 --> 00:25:02,560
you support Postgres as a target
and that's how they come to

500
00:25:02,560 --> 00:25:02,720
us.

501
00:25:02,720 --> 00:25:06,500
And we do support schema changes,
but only add columns and drop

502
00:25:06,500 --> 00:25:06,980
columns.

503
00:25:06,980 --> 00:25:11,540
We don't yet support creating tables
and indexes and truncate

504
00:25:11,600 --> 00:25:13,300
and all of that yet, basically.

505
00:25:13,660 --> 00:25:15,860
And sometimes it happens before
itself.

506
00:25:15,860 --> 00:25:19,440
We're like, okay, this is a, I
mean, we keep adding tables, we

507
00:25:19,440 --> 00:25:21,720
keep like adding columns very often
and like this doesn't work

508
00:25:21,720 --> 00:25:22,040
for us.

509
00:25:22,040 --> 00:25:23,740
So like we want another solution.

510
00:25:24,140 --> 00:25:24,820
Michael: Yeah, nice.

511
00:25:24,960 --> 00:25:29,880
A couple of the other things that
I see people confused or struggling

512
00:25:30,060 --> 00:25:33,220
with are things like sequence synchronization.

513
00:25:33,840 --> 00:25:36,520
I guess that's just a one-off task
at the beginning.

514
00:25:36,900 --> 00:25:39,640
Do you see anything else catching
people out?

515
00:25:39,880 --> 00:25:41,180
Sai: That's a good question, right?

516
00:25:41,180 --> 00:25:45,560
Like I think DDL commands is number
1, replication slot growth

517
00:25:45,560 --> 00:25:49,200
issues is number 2, which we talked
a lot about basically.

518
00:25:49,200 --> 00:25:49,540
Yeah.

519
00:25:49,540 --> 00:25:52,340
It's like, and, you know, always
keep reading the slot, avoid

520
00:25:52,340 --> 00:25:56,100
long running or like open transactions,
you know, monitor slot

521
00:25:56,100 --> 00:25:57,260
growth, right?

522
00:25:57,340 --> 00:25:59,860
Like fourth is, you know, use protocol
versions, right?

523
00:25:59,860 --> 00:26:03,400
Like upgrade to like latest Postgres
versions and start using

524
00:26:03,400 --> 00:26:06,100
the version 2.3.4 which is faster.

525
00:26:07,200 --> 00:26:10,240
The third thing that we see is
Postgres logical replication doesn't

526
00:26:10,240 --> 00:26:12,780
replicate toast columns out of
the box.

527
00:26:12,800 --> 00:26:17,560
You need to set up replica identity
full to make that happen.

528
00:26:17,860 --> 00:26:21,680
And replica identity full could
be expensive if you are not on

529
00:26:21,680 --> 00:26:24,960
Postgres 16 because for updates
and deletes it could lead to

530
00:26:24,960 --> 00:26:26,200
like a sequential scan.

531
00:26:26,200 --> 00:26:31,160
Nikolay: And this thing what happened
CTID was used or is used

532
00:26:31,160 --> 00:26:34,240
or like why it's better than 16?

533
00:26:34,240 --> 00:26:35,740
Sai: It's using indexes now.

534
00:26:35,740 --> 00:26:37,560
It's like using indexes basically.

535
00:26:37,940 --> 00:26:38,440
Nikolay: Okay.

536
00:26:39,960 --> 00:26:41,120
Sai: And the thing is,

537
00:26:41,120 --> 00:26:42,180
Nikolay: If you have them.

538
00:26:42,260 --> 00:26:45,060
Sai: Yeah, if you have them and
if you have the right indexes

539
00:26:45,060 --> 00:26:48,040
And another thing is some customers,
right?

540
00:26:48,040 --> 00:26:51,180
Like who have primary keys, but
they have toast columns.

541
00:26:51,620 --> 00:26:56,760
We still like need to do replica
identity full and that helps.

542
00:26:56,940 --> 00:27:01,700
So replica identity full with primary
keys is more efficient.

543
00:27:01,820 --> 00:27:05,380
And there is a good blog that the
Zalando guys wrote, which

544
00:27:05,380 --> 00:27:06,180
I really liked.

545
00:27:06,180 --> 00:27:08,860
And that is something that I refer
to customers where, hey guys,

546
00:27:08,860 --> 00:27:10,340
like you can add this.

547
00:27:10,380 --> 00:27:12,900
I think it increased like CPU and
IO by 30%.

548
00:27:13,140 --> 00:27:14,280
That was their analysis.

549
00:27:15,040 --> 00:27:18,400
But that is about like toast columns
where you need to have replica

550
00:27:18,400 --> 00:27:21,800
identity full and the impact of
that can change based on scenarios.

551
00:27:22,440 --> 00:27:22,720
Michael: Yeah.

552
00:27:22,720 --> 00:27:23,220
Awesome.

553
00:27:23,220 --> 00:27:24,480
I didn't know about that one.

554
00:27:24,480 --> 00:27:25,920
I would love to read that blog
post.

555
00:27:25,920 --> 00:27:27,920
If you send it to me afterward,
I'll include it in the show

556
00:27:27,920 --> 00:27:28,880
notes as well.

557
00:27:29,280 --> 00:27:29,780
Sai: Absolutely.

558
00:27:30,140 --> 00:27:33,580
And the fourth thing that we have
seen is logical replication,

559
00:27:33,900 --> 00:27:37,400
at least, you know, logical decoding
did not support virtual

560
00:27:37,400 --> 00:27:38,580
and generated columns.

561
00:27:38,860 --> 00:27:43,040
If there are like generated columns,
they didn't show up on WAL

562
00:27:43,040 --> 00:27:43,540
decoding.

563
00:27:43,780 --> 00:27:48,960
And the way we, I mean, we easily
solve that, I mean, is on the

564
00:27:48,960 --> 00:27:52,040
target that customers set up like
a virtual column or they use

565
00:27:52,040 --> 00:27:55,580
like DBT or like some transformation
tool to, you know, backfill

566
00:27:55,760 --> 00:27:56,420
this, right?

567
00:27:56,420 --> 00:27:58,380
Like, so that is number four.

568
00:27:58,500 --> 00:28:01,880
And then last but not the least,
the fifth one that we see, right.

569
00:28:01,880 --> 00:28:05,520
Like, I'm still, you know, trying
to discover, it's like slot

570
00:28:05,580 --> 00:28:10,580
invalidation issues where the restart
LSN becomes null, and have

571
00:28:10,580 --> 00:28:12,440
seen that happen in two scenarios.

572
00:28:12,780 --> 00:28:17,920
So the first scenario is there
is a setting called max_replication_slot

573
00:28:17,980 --> 00:28:21,900
size or something, I mean, like which
actually limits the size

574
00:28:21,900 --> 00:28:25,140
of the slot, and in this, this setting
was recently added in 13 or

575
00:28:25,140 --> 00:28:28,700
14 which lets you like safeguard
from storage out of storage

576
00:28:28,700 --> 00:28:32,660
issues. So as and when the slot
hits, you know, over a 2 terabyte,

577
00:28:32,660 --> 00:28:36,380
right, like or if that's the setting
that you have, it basically

578
00:28:36,380 --> 00:28:39,640
invalidates the slot, and in that
scenario, the restart LSN

579
00:28:39,640 --> 00:28:42,940
becomes null, and the slot gets
invalidated, and you need to restart

580
00:28:42,940 --> 00:28:43,760
logical replication.

581
00:28:43,900 --> 00:28:47,620
The second scenarios we have seen
is sometimes it happens randomly,

582
00:28:47,640 --> 00:28:51,720
also, unseen, and we are still
figuring out why that can happen.

583
00:28:51,720 --> 00:28:55,120
And now we reached out to the community,
right, like, and we heard

584
00:28:55,120 --> 00:28:59,280
that like it could be because of
cloud providers' Postgres, right,

585
00:28:59,280 --> 00:29:02,140
like because cloud providers have
a bunch of like backup operations,

586
00:29:02,320 --> 00:29:05,060
some, some forks, you know, of Postgres, right.

587
00:29:05,060 --> 00:29:07,840
And then it could be. I mean, the
community doesn't say that,

588
00:29:07,840 --> 00:29:10,840
like, I mean, their point is, like,
can we try to reproduce this

589
00:29:10,840 --> 00:29:12,040
on Vanilla Postgres, right?

590
00:29:12,040 --> 00:29:12,980
Like not cloud.

591
00:29:12,980 --> 00:29:15,280
And then that was not becoming
easy for us.

592
00:29:15,280 --> 00:29:18,760
But like, we have seen that out
of the 7 to 8 months, 1 to 2

593
00:29:18,760 --> 00:29:22,960
times, randomly the slot gets invalidated
and we don't have like

594
00:29:22,960 --> 00:29:25,020
an RCA on when that can happen.

595
00:29:25,080 --> 00:29:28,380
And immediately we think that we
check, hey, is the max_wal_size

596
00:29:28,380 --> 00:29:31,920
for the slot, like, less than, is
it set up, right?

597
00:29:31,920 --> 00:29:34,540
But it's still set to minus 1, it's disabled,
but still, like we run

598
00:29:34,540 --> 00:29:35,340
into these issues.

599
00:29:35,440 --> 00:29:39,360
So I'm on that quest of figuring
out when this happens because

600
00:29:39,520 --> 00:29:41,980
it's a scenario that we need to understand better.

601
00:29:42,260 --> 00:29:43,220
Michael: And that's terrifying.

602
00:29:43,320 --> 00:29:46,400
And yeah, how do you even begin
to go about reproducing that?

603
00:29:46,520 --> 00:29:47,300
Good luck.

604
00:29:47,520 --> 00:29:50,920
Sai: Exactly. So we were able to
reproduce this on one of the cloud

605
00:29:50,920 --> 00:29:51,420
providers.

606
00:29:51,820 --> 00:29:54,880
They're like for high throughputs,
like over 50K transactions

607
00:29:54,960 --> 00:29:58,940
per second, every 15 to 16 hours,
it was like getting invalidated.

608
00:30:00,340 --> 00:30:00,560
Right?

609
00:30:00,560 --> 00:30:04,700
Michael: 50,000 per second and
every 15 to 16 hours.

610
00:30:04,900 --> 00:30:05,780
Sai: It's a lot.

611
00:30:06,280 --> 00:30:07,280
That's a lot

612
00:30:07,540 --> 00:30:08,140
Michael: of messages.

613
00:30:08,480 --> 00:30:09,340
Sai: That's not real.

614
00:30:09,340 --> 00:30:12,140
I mean, I don't think that happens
often, but yeah.

615
00:30:12,740 --> 00:30:13,440
Michael: Right, wow.

616
00:30:14,340 --> 00:30:14,840
Cool.

617
00:30:15,040 --> 00:30:17,480
I don't feel like we talked about
use cases much.

618
00:30:17,480 --> 00:30:20,700
Is that because it kind of these
aren't use-case dependent?

619
00:30:20,900 --> 00:30:24,060
I guess other than people using
two-phase commit, do you see any

620
00:30:24,060 --> 00:30:26,860
differences between people that
are using logical replication

621
00:30:26,940 --> 00:30:31,560
for analytics databases versus
for, do you see any difference

622
00:30:31,560 --> 00:30:32,860
between use cases basically?

623
00:30:33,180 --> 00:30:33,760
Sai: Good question.

624
00:30:33,760 --> 00:30:36,460
Like I mean, logical replication
is very powerful.

625
00:30:36,460 --> 00:30:39,220
I mean, it supports like, you know,
migration, online migrations,

626
00:30:39,240 --> 00:30:42,260
number 1, then it supports like
online upgrades, right, like

627
00:30:42,260 --> 00:30:44,440
which Nik mentioned, right, like
number 2.

628
00:30:44,480 --> 00:30:48,340
Number 3, it also supports like
HA and backup scenarios.

629
00:30:48,580 --> 00:30:52,500
So I have seen some customers who
use logical decoding or

630
00:30:52,500 --> 00:30:57,340
logical replication to have HA
across regions or in like hybrid

631
00:30:57,340 --> 00:30:57,840
environments.

632
00:30:57,880 --> 00:31:01,720
Like, okay, I'm running stuff on
my on-cloud and I still want

633
00:31:01,720 --> 00:31:05,140
to have a backup or like a replica
on my on-premise like, you

634
00:31:05,140 --> 00:31:06,840
know, self-hosted environments,
right?

635
00:31:07,100 --> 00:31:11,300
And the only way it's possible
is via logical replication because

636
00:31:11,320 --> 00:31:14,060
cloud providers don't give access
to like WAL, right?

637
00:31:14,060 --> 00:31:16,800
That is number 3, where like HA
and backups is another use case,

638
00:31:16,800 --> 00:31:17,300
right?

639
00:31:17,560 --> 00:31:21,900
4 is replication to like non-Postgres
like targets or even Postgres

640
00:31:21,900 --> 00:31:25,680
targets for like workload isolation,
use-case isolation, where,

641
00:31:25,680 --> 00:31:28,400
okay, I have my OLTP database running
on Postgres.

642
00:31:28,480 --> 00:31:31,160
Now I want to pipe this to another
database.

643
00:31:31,560 --> 00:31:35,560
Like it could be Snowflake or Postgres
for analytics or for search,

644
00:31:35,560 --> 00:31:38,460
like I want to pipe this on like
a Clickhouse or like Elasticsearch,

645
00:31:38,680 --> 00:31:39,000
right?

646
00:31:39,000 --> 00:31:40,520
Like for optimizing search.

647
00:31:40,520 --> 00:31:42,980
Yeah, logical replication is very
powerful and I believe that

648
00:31:42,980 --> 00:31:46,280
it's going to be the future for
Postgres and down the line it

649
00:31:46,280 --> 00:31:48,820
can open up like active-active
use-cases also right like where

650
00:31:48,820 --> 00:31:51,660
you know, okay, you have like cross-region and I recently read

651
00:31:51,660 --> 00:31:54,720
a blog where folks were using logical
replication for like active

652
00:31:54,720 --> 00:31:58,040
active I'd like so I mean it opens
up like a plethora of use

653
00:31:58,040 --> 00:32:01,740
cases and makes Postgres more,
you know, powerful extensible

654
00:32:01,800 --> 00:32:03,100
and yeah

655
00:32:03,580 --> 00:32:05,080
Michael: Yeah, I read that blog
post too.

656
00:32:05,080 --> 00:32:08,760
I think it was version 16 change
that made it possible to like

657
00:32:08,900 --> 00:32:12,620
not replicate changes that got
made by logical replication.

658
00:32:12,740 --> 00:32:15,640
So you could kind of like set up
logical replication in both

659
00:32:15,640 --> 00:32:16,140
directions.

660
00:32:16,400 --> 00:32:18,240
It seems super dangerous to me.

661
00:32:18,240 --> 00:32:20,240
Like, okay, cool.

662
00:32:20,380 --> 00:32:22,760
Would you recommend it at the moment
or do you think we need

663
00:32:22,760 --> 00:32:24,220
more features before that?

664
00:32:24,520 --> 00:32:29,440
Sai: I think it's more, I mean,
it's more like intense than that,

665
00:32:29,440 --> 00:32:29,680
right?

666
00:32:29,680 --> 00:32:33,420
Because like conflict resolution
and all of this is tricky, right?

667
00:32:33,420 --> 00:32:37,020
Like, and, you know, I mean, if
it's like very workload specific,

668
00:32:37,020 --> 00:32:39,440
where like, okay, I don't touch
the same rows, right?

669
00:32:39,440 --> 00:32:41,880
Like, and, you know, maybe there
it could work.

670
00:32:41,920 --> 00:32:46,600
But I mean, out of the box, like
implementing that is tricky,

671
00:32:46,640 --> 00:32:48,420
and it requires more like effort.

672
00:32:48,420 --> 00:32:51,240
And maybe we go there in the future,
right?

673
00:32:51,260 --> 00:32:54,440
Because we are seeing a few cases
with customers where they want

674
00:32:54,440 --> 00:32:58,280
Active-Active and there is not
an out-of-the-box solution.

675
00:32:59,240 --> 00:33:00,840
Nikolay: Why do people need Active-Active?

676
00:33:02,860 --> 00:33:03,600
Sai: Good question.

677
00:33:03,740 --> 00:33:07,020
I think I have like a lot of thoughts
here.

678
00:33:07,020 --> 00:33:10,560
So like the thing is, I think it
really helps with like HA, right?

679
00:33:10,560 --> 00:33:13,820
Like for example, I remember this
scenario in Microsoft where

680
00:33:13,820 --> 00:33:17,120
customers were having like a bunch
of like SQL Server Active-Active

681
00:33:17,140 --> 00:33:20,460
across regions and then this was
a bank and then 1 of the region

682
00:33:20,460 --> 00:33:24,720
went down And then every second
is like thousands of dollars.

683
00:33:24,720 --> 00:33:27,700
And then they immediately pivoted
to the other region and it

684
00:33:27,700 --> 00:33:28,820
kind of worked seamlessly.

685
00:33:30,940 --> 00:33:35,440
Nikolay: But doesn't it mean that
all clusters need to perform

686
00:33:35,440 --> 00:33:36,520
the same writes?

687
00:33:38,500 --> 00:33:43,440
And I didn't buy this idea that
replaying changes through logical

688
00:33:43,440 --> 00:33:48,060
replication is less expensive than
applying them initially, which

689
00:33:48,060 --> 00:33:51,000
was advertised in the BDR documentation.

690
00:33:51,380 --> 00:33:52,620
I didn't buy this.

691
00:33:53,500 --> 00:33:57,260
I hope to find time to test it
properly and write a blog post about that.

692
00:33:57,260 --> 00:34:00,060
So far didn't find time yet for
this.

693
00:34:00,060 --> 00:34:04,200
It's an interesting topic because
like in BDR, it means like, for

694
00:34:04,200 --> 00:34:07,440
example, you have multiple regions,
4 or 8.

695
00:34:07,740 --> 00:34:09,980
I know such cases as well.

696
00:34:10,520 --> 00:34:13,860
And they suffer because everyone
needs to perform the same writes.

697
00:34:13,860 --> 00:34:17,760
And if there is a spike in 1 region
or this, it's not a resilient

698
00:34:17,760 --> 00:34:18,840
system at all.

699
00:34:19,380 --> 00:34:20,820
I don't understand this.

700
00:34:21,000 --> 00:34:21,180
Sai: Yeah.

701
00:34:21,180 --> 00:34:21,820
A hundred percent.

702
00:34:21,820 --> 00:34:22,120
I agree.

703
00:34:22,120 --> 00:34:25,620
I think, I mean, that's the reason
there is a big opportunity

704
00:34:25,800 --> 00:34:26,120
there.

705
00:34:26,120 --> 00:34:28,580
And I recently saw one company,

706
00:34:28,580 --> 00:34:32,280
like what the PGYedge who are like
trying to do something like

707
00:34:32,280 --> 00:34:32,780
this.

708
00:34:33,480 --> 00:34:35,780
Nikolay: Yeah, new wave of this.

709
00:34:35,900 --> 00:34:40,220
I know every CTO is dreaming about
this, I know it.

710
00:34:40,840 --> 00:34:43,700
Because we are like, we work in
multiple regions.

711
00:34:44,700 --> 00:34:49,120
Maybe we should first check that
all our physical standbys are

712
00:34:49,120 --> 00:34:52,600
in different availability zones
than primaries, right?

713
00:34:53,720 --> 00:34:55,140
Sai: There are more basic things.

714
00:34:55,460 --> 00:34:58,260
Nikolay: Yeah, but multiple regions
is a great thing to have,

715
00:34:58,260 --> 00:35:00,700
but still I'm very skeptical.

716
00:35:01,260 --> 00:35:06,220
But I see in DBA's mind, there
is a mind shift as well.

717
00:35:06,220 --> 00:35:10,280
Like 10 years ago, all DBAs said
you don't need this.

718
00:35:10,520 --> 00:35:11,980
But things are changing.

719
00:35:11,980 --> 00:35:17,940
So in Postgres 16, this to avoid
loops, infinite loops, right?

720
00:35:18,340 --> 00:35:19,660
This feature, it's interesting.

721
00:35:21,820 --> 00:35:25,920
So I don't understand use cases
and how we can scale writes if

722
00:35:25,920 --> 00:35:28,780
we don't split them, like in sharding.

723
00:35:28,780 --> 00:35:30,740
We split them in sharding, and
that's great.

724
00:35:31,880 --> 00:35:33,300
Here, everyone needs everything.

725
00:35:33,380 --> 00:35:36,200
Well, I'm very skeptical.

726
00:35:36,340 --> 00:35:39,440
I don't understand this part of
I don't understand this part

727
00:35:39,440 --> 00:35:40,220
of landscape.

728
00:35:41,140 --> 00:35:41,820
Sai: Yeah, 100%.

729
00:35:41,880 --> 00:35:45,180
And like, I think it becomes very
critical in tier 0 use cases,

730
00:35:45,180 --> 00:35:45,540
guys.

731
00:35:45,540 --> 00:35:49,040
I mean, not like, I mean, tier 1,
tier 2, tier 3, I think it's

732
00:35:49,040 --> 00:35:52,200
kind of lesser, but like, it's
more these tier 0 where like,

733
00:35:52,200 --> 00:35:54,440
it's like a Chase bank or something
like that right? Like when

734
00:35:54,440 --> 00:35:55,400
it becomes like hyper.

735
00:35:55,400 --> 00:35:57,720
Nikolay: But you think it's possible
to build good system?

736
00:35:57,720 --> 00:36:00,860
Sai: No, I don't think it's, I mean
with Postgres I am also very

737
00:36:00,860 --> 00:36:03,453
skeptical, but I think there is
an opportunity there, right?

738
00:36:03,453 --> 00:36:07,700
Like, and community, both community
and like, you know, I mean,

739
00:36:07,700 --> 00:36:09,780
I mean, community will be very
critical here, right?

740
00:36:09,780 --> 00:36:13,860
Like, I don't think that it can
happen just by logical replication.

741
00:36:13,860 --> 00:36:18,460
Nikolay: Maybe if we build system,
categorizing data in tiers

742
00:36:18,460 --> 00:36:24,240
as well and replicating writes
only for most critical data between

743
00:36:24,240 --> 00:36:25,300
regions, right?

744
00:36:25,580 --> 00:36:28,640
Michael: We're probably tiptoeing
into a different topic here.

745
00:36:28,860 --> 00:36:29,360
It's

746
00:36:29,940 --> 00:36:30,640
Nikolay: not different.

747
00:36:30,760 --> 00:36:31,740
It's not different.

748
00:36:32,340 --> 00:36:35,380
Many people think logical leads
to multi-master, definitely,

749
00:36:35,380 --> 00:36:37,620
like so-called old term.

750
00:36:38,620 --> 00:36:42,600
Michael: But I think the tier 0
use case feels like it's a long

751
00:36:42,600 --> 00:36:44,880
way from being supported by native
Postgres.

752
00:36:45,660 --> 00:36:48,680
There are a lot of providers out
there for that kind of thing.

753
00:36:48,680 --> 00:36:52,140
And I think the general term for
it is being called distributed

754
00:36:52,800 --> 00:36:57,040
SQL or like that seems like dist
SQL is what I've been referring

755
00:36:57,040 --> 00:36:57,540
to.

756
00:36:58,680 --> 00:37:01,960
But I would warn people against
using logical for this in the

757
00:37:01,960 --> 00:37:03,960
short, like anytime soon.

758
00:37:03,960 --> 00:37:04,460
Yeah.

759
00:37:04,740 --> 00:37:05,020
Nikolay: Okay.

760
00:37:05,020 --> 00:37:05,640
Sai: I agree.

761
00:37:06,280 --> 00:37:06,780
Nikolay: Okay.

762
00:37:07,200 --> 00:37:10,120
Then what, what other questions
do you have?

763
00:37:10,120 --> 00:37:11,320
Michael: I didn't have anything
else.

764
00:37:11,320 --> 00:37:12,680
I wanted to thank Sai.

765
00:37:13,140 --> 00:37:14,840
Did you have anything else Nikolay?

766
00:37:15,240 --> 00:37:15,980
Nikolay: Well no.

767
00:37:16,020 --> 00:37:16,840
Thank you so much.

768
00:37:16,840 --> 00:37:17,680
It was interesting.

769
00:37:17,980 --> 00:37:18,980
Thank you for coming.

770
00:37:19,080 --> 00:37:19,860
Sai: Absolutely, guys.

771
00:37:19,860 --> 00:37:24,200
I really enjoyed chatting with
you and thanks for inviting me.

772
00:37:24,920 --> 00:37:28,940
Nikolay: I hope you won't stop
posting interesting technical

773
00:37:29,020 --> 00:37:29,440
posts.

774
00:37:29,440 --> 00:37:30,020
Sai: No, no, no.

775
00:37:30,020 --> 00:37:31,420
So content is our currency, guys.

776
00:37:31,420 --> 00:37:36,240
I mean, the thing is, for me, I'm
building the company because

777
00:37:36,980 --> 00:37:38,200
I'm very curious, right?

778
00:37:38,200 --> 00:37:41,180
Like now the thing that is like
haunting me is that slot

779
00:37:41,320 --> 00:37:43,860
invalidation, not like, I'm not understanding
why it can happen.

780
00:37:43,860 --> 00:37:44,060
Right.

781
00:37:44,060 --> 00:37:47,840
Like, so, so I've been like, so
because of curiosity, we will

782
00:37:47,840 --> 00:37:50,440
be publishing a lot of content,
but

783
00:37:50,740 --> 00:37:53,160
Nikolay: yeah, benchmarks, graphs,
data and so on.

784
00:37:53,160 --> 00:37:54,440
Reproducible also.

785
00:37:54,620 --> 00:37:55,480
Yeah, that's great.

786
00:37:55,480 --> 00:37:56,460
Thank you so much.

787
00:37:56,720 --> 00:37:59,100
Michael: And also, if any of our
listeners have seen that as

788
00:37:59,100 --> 00:38:03,280
well, have any theories or have
a reproduction case for it, let

789
00:38:03,280 --> 00:38:03,980
us know.

790
00:38:04,660 --> 00:38:05,160
Nikolay: Mm-hmm.

791
00:38:05,500 --> 00:38:06,000
Absolutely.

792
00:38:06,180 --> 00:38:06,680
Michael: Good.

793
00:38:06,980 --> 00:38:07,480
Wonderful.

794
00:38:07,540 --> 00:38:08,040
Thank you.

795
00:38:08,040 --> 00:38:08,880
Thank you both.

796
00:38:08,880 --> 00:38:09,340
Take care.

797
00:38:09,340 --> 00:38:10,120
Sai: Thanks, guys.

798
00:38:10,871 --> 00:38:11,371
Bye-bye.

799
00:38:11,431 --> 00:38:11,681
Bye-bye.