1
00:00:00,060 --> 00:00:02,520
Michael: Hello and welcome to Postgres.FM,
a weekly show about

2
00:00:02,520 --> 00:00:03,400
all things PostgreSQL.

3
00:00:03,740 --> 00:00:06,220
I'm Michael, founder of pgMustard,
I'm joined as usual by

4
00:00:06,220 --> 00:00:07,940
Nikolay, founder of Postgres.AI.

5
00:00:07,960 --> 00:00:08,580
Hey Nikolay.

6
00:00:09,140 --> 00:00:09,940
Nikolay: Hi Michael.

7
00:00:10,680 --> 00:00:14,280
Michael: And today we are delighted to be
joined by Alexander Kukushkin,

8
00:00:14,720 --> 00:00:17,860
a Postgres contributor currently
working at Microsoft and most

9
00:00:17,860 --> 00:00:19,860
famously, maintainer of Patroni.

10
00:00:20,280 --> 00:00:23,240
We had a listener request to discuss
Patroni, so we're delighted

11
00:00:23,240 --> 00:00:24,940
you agreed to join us for an episode.

12
00:00:25,080 --> 00:00:25,580
Alexander.

13
00:00:26,040 --> 00:00:28,220
Alexander: Yeah, hello Michael,
hello Nikolay.

14
00:00:28,620 --> 00:00:30,040
Thank you for inviting me.

15
00:00:30,040 --> 00:00:33,500
I'm really excited to talk about
my favorite project.

16
00:00:35,080 --> 00:00:35,680
Michael: Us too.

17
00:00:35,680 --> 00:00:38,180
Perhaps as a starting point, could
you give us an introduction?

18
00:00:38,400 --> 00:00:40,760
Like for most people, I think will
have heard of Patroni and

19
00:00:40,760 --> 00:00:43,080
know what it is, but for anybody
that doesn't, could you give

20
00:00:43,080 --> 00:00:45,260
an introduction what it is and
why it's important?

21
00:00:45,900 --> 00:00:49,000
Alexander: Yeah, so Patroni, like
in the simple words, it's

22
00:00:49,000 --> 00:00:51,100
a failover manager for Postgres.

23
00:00:51,480 --> 00:00:55,580
It solves the problem of availability
of a primary.

24
00:00:56,260 --> 00:01:00,640
In Postgres, we don't use some
words that are non-inclusive,

25
00:01:01,240 --> 00:01:02,940
like master.

26
00:01:03,100 --> 00:01:06,300
That's why we call it primary,
and Patroni actually recently

27
00:01:06,540 --> 00:01:09,060
get rid of this non-inclusive words
completely.

28
00:01:10,760 --> 00:01:15,240
And the way how Patroni does, it
makes sure that we are running

29
00:01:15,240 --> 00:01:19,060
just a single primary at a time,
and at the same time, Patroni

30
00:01:19,060 --> 00:01:24,180
helps you to manage as many read-only
replicas as you like to

31
00:01:24,180 --> 00:01:28,260
have, and keeping those replicas
ready to become primary in case

32
00:01:28,260 --> 00:01:29,600
the primary has failed.

33
00:01:30,040 --> 00:01:35,320
At the same time, Patroni helps
to automate usual DBA tasks like

34
00:01:35,320 --> 00:01:39,560
switchover, configuration management,
stuff like that.

35
00:01:39,620 --> 00:01:41,340
Nikolay: Node provisioning also,
right?

36
00:01:43,740 --> 00:01:44,860
Alexander: Node provisioning, not
really.

37
00:01:45,160 --> 00:01:48,060
Node provisioning is a task for
DBA.

38
00:01:48,580 --> 00:01:52,560
DBA has to start Patroni, and Patroni
will take care of bootstrapping

39
00:01:52,800 --> 00:01:53,360
this node.

40
00:01:53,360 --> 00:01:58,680
In case it's a totally new cluster,
Patroni will start as a primary.

41
00:01:58,680 --> 00:02:02,840
In case the node joins an existing
cluster, the replica node

42
00:02:02,840 --> 00:02:06,420
will take a pg_basebackup by default
from the running primary

43
00:02:06,580 --> 00:02:08,080
and start the replica.

44
00:02:08,500 --> 00:02:12,700
And the most interesting part,
let's say we bring node back which

45
00:02:12,700 --> 00:02:16,760
was previously running as a primary,
And Patroni does everything

46
00:02:16,880 --> 00:02:22,000
to convert this failed primary
as a new standby to join the cluster

47
00:02:22,120 --> 00:02:25,060
and be prepared for next unforeseen
event.

48
00:02:25,320 --> 00:02:28,540
Nikolay: At least you agree that
it does part of node provisioning,

49
00:02:28,580 --> 00:02:33,020
because otherwise, we wouldn't
have situations when old data

50
00:02:33,020 --> 00:02:38,240
directory, old PGDATA was copied
and new 1 is created and we

51
00:02:38,240 --> 00:02:39,840
are suddenly out of disk space.

52
00:02:40,160 --> 00:02:44,080
And if you don't expect Patroni
to participate in node provisioning,

53
00:02:44,200 --> 00:02:45,820
then you think, what's happening?

54
00:02:45,820 --> 00:02:47,460
Why am I out of disk space?

55
00:02:48,580 --> 00:02:49,080
Right?

56
00:02:49,120 --> 00:02:50,200
It happens sometimes.

57
00:02:50,980 --> 00:02:56,400
Alexander: It used to happen, I
think, with bootstrap mode.

58
00:02:56,400 --> 00:03:03,480
Like, when Patroni, like, I don't
remember up until which version,

59
00:03:03,480 --> 00:03:07,940
but Patroni, when it tries to create
a new cluster, usually by

60
00:03:07,940 --> 00:03:12,340
using initDB, but in some cases
you can configure Patroni to

61
00:03:12,340 --> 00:03:16,980
create a cluster from existing
backup, like from base backup.

62
00:03:17,440 --> 00:03:21,300
And if something goes wrong, Patroni
does not remove this data

63
00:03:21,300 --> 00:03:23,040
directory, but renames it.

64
00:03:23,500 --> 00:03:27,380
And it used to apply current timestamp
to the file name.

65
00:03:27,560 --> 00:03:31,520
And therefore, after the first
failure, it gives up, waits a

66
00:03:31,520 --> 00:03:33,840
little bit and does the next attempt.

67
00:03:34,340 --> 00:03:35,780
Nikolay: To directory, right?

68
00:03:36,340 --> 00:03:41,820
Alexander: Yeah, it creates, it
uses yet another base backup,

69
00:03:41,960 --> 00:03:45,180
creates a new data directory, fails,
and renames.

70
00:03:45,520 --> 00:03:51,440
Now it is not working like this,
it just renames PGDATA to PGDATA-old,

71
00:03:52,800 --> 00:03:56,620
something like this, and that's
why you will not have an infinite

72
00:03:56,640 --> 00:03:57,600
number of directories.

73
00:03:58,320 --> 00:04:01,620
And having just 1 is enough to
investigate the failure.

74
00:04:02,220 --> 00:04:06,100
Nikolay: But maximum we end up
like if we if our data directory

75
00:04:06,420 --> 00:04:10,800
we expected to fill 70% of the
disk we still might have out of

76
00:04:10,800 --> 00:04:11,620
disk space.

77
00:04:12,540 --> 00:04:16,680
Alexander: Yeah that's unfortunate
but the other option really

78
00:04:16,680 --> 00:04:21,000
like you just drop it but at the
same time like all the evidences

79
00:04:21,460 --> 00:04:24,840
of what failed, why it failed are
also gone.

80
00:04:24,840 --> 00:04:26,420
You have nothing to investigate.

81
00:04:27,340 --> 00:04:27,840
Nikolay: Okay.

82
00:04:27,880 --> 00:04:31,960
To me it still sounds like Patroni
participates in node provisioning.

83
00:04:31,960 --> 00:04:36,020
Yes, it doesn't bring you resources
like disk and virtual machine

84
00:04:36,340 --> 00:04:42,040
and so on, but it brings data,
like the most important part of

85
00:04:42,040 --> 00:04:43,940
Postgres node provisioning, right?

86
00:04:43,940 --> 00:04:46,080
Okay, I just wanted to be right
a little bit.

87
00:04:46,080 --> 00:04:46,580
Okay.

88
00:04:46,640 --> 00:04:47,140
Alexander: Okay.

89
00:04:48,860 --> 00:04:49,600
Nikolay: It's a joke.

90
00:04:49,600 --> 00:04:50,020
Okay.

91
00:04:50,020 --> 00:04:52,060
Michael: I think diving deep quickly
is great.

92
00:04:52,060 --> 00:04:54,600
It'd be good to discuss complex
topics but I think something

93
00:04:54,600 --> 00:04:55,920
simple would also be good.

94
00:04:55,920 --> 00:04:58,440
I would love to hear a little bit
almost about the history of

95
00:04:58,440 --> 00:04:58,780
Patroni.

96
00:04:58,780 --> 00:05:03,000
Like the early days, what were
you doing before Patroni to solve

97
00:05:03,000 --> 00:05:07,260
this kind of issue and why was
it built?

98
00:05:07,260 --> 00:05:10,580
What problems were there with the
existing setups?

99
00:05:11,320 --> 00:05:15,660
Alexander: To be honest, while
working for my previous company

100
00:05:15,660 --> 00:05:19,340
We didn't have any automatic failover
solution in place.

101
00:05:19,640 --> 00:05:23,540
What we relied on was just a good
monitoring system that sent

102
00:05:23,540 --> 00:05:27,180
you a message or some on-call engineer
just calls you in the

103
00:05:27,180 --> 00:05:28,680
night, the database failed.

104
00:05:28,680 --> 00:05:33,220
There were a lot of false positives,
unfortunately, but it still

105
00:05:33,220 --> 00:05:38,640
felt more reliable than using solutions
like replication manager,

106
00:05:39,020 --> 00:05:39,820
repmgr.

107
00:05:40,240 --> 00:05:44,180
Nikolay: Yeah, I remember this
very well.

108
00:05:44,180 --> 00:05:48,740
Like people constantly saying we
don't need autofailover, it's

109
00:05:48,740 --> 00:05:52,200
evil because it can switch over
suddenly, failover suddenly,

110
00:05:52,200 --> 00:05:53,300
and it's a mess.

111
00:05:53,360 --> 00:05:56,260
Let's rely on manual actions.

112
00:05:56,520 --> 00:05:57,980
I remember this time very well.

113
00:05:57,980 --> 00:06:01,980
Alexander: Yeah, to our excuse,
an amount of databases, like

114
00:06:01,980 --> 00:06:05,820
database clusters that we run wasn't
so high, like I think a

115
00:06:05,820 --> 00:06:10,240
few dozens, and it was running
on-prem, didn't fail so often,

116
00:06:10,240 --> 00:06:11,720
and therefore it was manageable.

117
00:06:12,500 --> 00:06:19,340
A bit later, we started moving
to the cloud, and suddenly, not

118
00:06:19,340 --> 00:06:22,960
suddenly, but luckily for us, we
found a project named Governor,

119
00:06:23,460 --> 00:06:28,700
which basically brought an idea
of how to implement autofailover

120
00:06:29,720 --> 00:06:35,140
in a very nice manner, without
having so much false positives

121
00:06:35,200 --> 00:06:38,260
and without risks of running to
a split-brain.

122
00:06:38,960 --> 00:06:40,880
Nikolay: Was it abandoned project
already?

123
00:06:41,480 --> 00:06:45,780
Alexander: No, no, so it was not
really abandoned, but it wasn't

124
00:06:45,780 --> 00:06:46,820
also very active.

125
00:06:46,820 --> 00:06:53,940
So we started applying it, found
some problems, reported problems

126
00:06:53,940 --> 00:06:57,840
to the maintainer of the Governor,
got no reaction unfortunately,

127
00:06:58,440 --> 00:07:03,100
started fixing those problems on
our own, and at some moment

128
00:07:03,280 --> 00:07:10,960
a number of fixes and some new
nice features accumulated and

129
00:07:10,960 --> 00:07:15,920
we decided just to fork it and
give a new name to the project.

130
00:07:16,080 --> 00:07:17,540
So this is how Patroni

131
00:07:17,560 --> 00:07:17,960
Nikolay: was born.

132
00:07:17,960 --> 00:07:18,920
Georgian name, right?

133
00:07:19,200 --> 00:07:19,700
Alexander: Right.

134
00:07:21,400 --> 00:07:21,644
Nikolay: What does

135
00:07:21,644 --> 00:07:21,732
Michael: it mean?

136
00:07:21,732 --> 00:07:21,880
Governor in Georgian.

137
00:07:21,880 --> 00:07:22,440
Nikolay: Oh, governor.

138
00:07:22,700 --> 00:07:24,960
I think so.

139
00:07:25,340 --> 00:07:25,840
Michael: Yeah,

140
00:07:26,220 --> 00:07:27,600
Alexander: almost, almost.

141
00:07:27,980 --> 00:07:28,480
Almost.

142
00:07:28,860 --> 00:07:35,720
Very close, but I'm not a good
person to explain like or to translate

143
00:07:35,760 --> 00:07:37,940
from Georgian because I don't

144
00:07:38,400 --> 00:07:38,900
Nikolay: Name

145
00:07:39,240 --> 00:07:42,760
Alexander: I know yet another word
in Georgian and it's a spillo.

146
00:07:43,080 --> 00:07:45,840
Yeah, it translates from Georgian
as elephant.

147
00:07:47,860 --> 00:07:52,400
Nikolay: And the name chose I guess
Valentin Gogitseshvili, right?

148
00:07:52,540 --> 00:07:58,040
Alexander: Yes, he was, no, at
that time he wasn't my boss anymore

149
00:07:58,040 --> 00:08:02,320
but we still worked close together
and I really appreciate his

150
00:08:02,320 --> 00:08:05,700
creativity in inventing good names
for projects.

151
00:08:07,200 --> 00:08:08,200
Michael: Yeah, great names.

152
00:08:08,200 --> 00:08:10,320
And is this a good time to bring
up Spilo?

153
00:08:10,320 --> 00:08:12,600
Like, what is Spilo and how is
that relevant?

154
00:08:12,980 --> 00:08:17,140
Alexander: Spilo, as I said, it
translates from Georgian elephant.

155
00:08:18,620 --> 00:08:22,760
When we started playing with Governor,
we were already targeting

156
00:08:23,000 --> 00:08:25,020
to deploy everything in the cloud.

157
00:08:25,900 --> 00:08:30,520
We had no other choice but to build
a Docker image and provision

158
00:08:31,500 --> 00:08:32,720
Postgres in a container.

159
00:08:33,300 --> 00:08:38,300
And we called this Docker image
Spilo basically we packaged for

160
00:08:38,720 --> 00:08:44,220
Governor Postgres if you Postgres
extensions and I think it was

161
00:08:44,220 --> 00:08:50,420
WAL-G back then as a backup and
point-and-tumble recovery solution.

162
00:08:52,740 --> 00:08:56,760
Michael: And it still exists to
this day as Spilo, but now with

163
00:08:56,760 --> 00:08:57,260
Patroni?

164
00:08:58,320 --> 00:08:59,320
Alexander: Yeah, of course.

165
00:08:59,640 --> 00:09:04,740
Now there is a Patroni inside,
and now Spilo includes plenty

166
00:09:04,740 --> 00:09:08,620
of Postgres major versions, which
may be an anti-pattern, but

167
00:09:08,620 --> 00:09:14,640
it allows you to run major upgrades,
like in-place major upgrades.

168
00:09:15,400 --> 00:09:20,700
It also includes WAL-G nowadays,
as a modern replacement of WAL-G.

169
00:09:22,940 --> 00:09:26,000
And it's used not only by...

170
00:09:26,000 --> 00:09:27,180
Operator, right?

171
00:09:28,920 --> 00:09:30,420
Not really part of Operator.

172
00:09:31,300 --> 00:09:35,200
Spilo is a product on its own.

173
00:09:35,740 --> 00:09:41,340
I know that some people run Postgres
on Kubernetes or even just

174
00:09:41,380 --> 00:09:47,420
on virtual machines with Spilo,
without using the Operator.

175
00:09:48,580 --> 00:09:50,100
Nikolay: But using Docker for example?

176
00:09:50,500 --> 00:09:51,440
Alexander: Yeah, of course.

177
00:09:52,540 --> 00:09:55,120
Michael: But that is a good opportunity
to discuss Postgres-operator.

178
00:09:55,680 --> 00:09:58,540
Postgres-operator was Zalando's...

179
00:09:58,700 --> 00:10:01,400
Was that one of the first operators
of its type?

180
00:10:01,400 --> 00:10:03,060
I know we've got lots these days.

181
00:10:03,580 --> 00:10:08,160
Alexander: Well, maybe it was,
but at the same time, the same

182
00:10:08,160 --> 00:10:11,780
name was used by Crunchy for their
Operator.

183
00:10:12,660 --> 00:10:16,080
They were developed in parallel
and back then Crunchy wasn't

184
00:10:16,260 --> 00:10:18,220
relying on Patroni yet.

185
00:10:20,220 --> 00:10:24,960
As I said, we started moving things
to the cloud and at some

186
00:10:24,960 --> 00:10:29,880
point, Vector moved a little bit
and started running plenty of

187
00:10:30,060 --> 00:10:32,520
workloads on Kubernetes, including
Postgres.

188
00:10:33,080 --> 00:10:37,360
Since deploying everything manually,
and more importantly, managing

189
00:10:37,440 --> 00:10:42,540
so many Postgres clusters manually
was really a nightmare, we

190
00:10:42,540 --> 00:10:45,040
started building Postgres-Operator.

191
00:10:45,600 --> 00:10:51,520
Back then, I don't think some very
nice Go library to implement

192
00:10:51,560 --> 00:10:55,980
the Operator pattern existed and
therefore people had to invent

193
00:10:56,800 --> 00:11:00,260
everything from scratch and there
is a lot of boilerplate code

194
00:11:00,600 --> 00:11:02,540
that copied over and so on.

195
00:11:04,700 --> 00:11:08,300
Nikolay: Is it only the move to
the cloud what mattered here,

196
00:11:08,300 --> 00:11:13,680
but maybe also moving to microservices,
splitting everything

197
00:11:13,680 --> 00:11:14,400
to microservices?

198
00:11:14,920 --> 00:11:16,800
Because I remember from Valentin,
for example...

199
00:11:16,800 --> 00:11:19,700
Alexander: Microservices, of course,
played a big role.

200
00:11:21,100 --> 00:11:21,840
And probably...

201
00:11:23,300 --> 00:11:24,020
Not probably...

202
00:11:24,960 --> 00:11:30,040
Microservices were really driving
force to move to the cloud,

203
00:11:30,060 --> 00:11:34,340
Because with the scale of the organization,
it wasn't possible

204
00:11:34,340 --> 00:11:36,080
to keep monolith.

205
00:11:36,380 --> 00:11:40,680
And the idea was, let's split everything
to microservices, and

206
00:11:40,680 --> 00:11:44,040
every microservice usually requires
its own Database.

207
00:11:44,580 --> 00:11:45,080
Nikolay: Right.

208
00:11:45,860 --> 00:11:49,880
Alexander: Sometimes sharded Database,
like we used application

209
00:11:50,080 --> 00:11:50,580
sharding.

210
00:11:50,940 --> 00:11:54,780
In certain cases, the same Database
is used by multiple microservices,

211
00:11:55,120 --> 00:11:56,600
but it's a different story.

212
00:11:57,580 --> 00:12:03,620
But really, the number of Database
clusters that we had to support

213
00:12:03,660 --> 00:12:04,160
exploded.

214
00:12:04,920 --> 00:12:08,320
From dozens to hundreds and then
to thousands.

215
00:12:10,900 --> 00:12:11,400
Nikolay: Yeah.

216
00:12:11,420 --> 00:12:16,220
And this is already when you cannot
rely on humans to perform

217
00:12:16,220 --> 00:12:17,460
a failover, right?

218
00:12:18,580 --> 00:12:22,740
Alexander: Even when you run a
few hundred Database clusters,

219
00:12:23,400 --> 00:12:28,540
better not to rely on humans to
do maintenance, in my opinion.

220
00:12:29,480 --> 00:12:33,260
Nikolay: Right, so that's interesting
and maybe it's also the

221
00:12:33,260 --> 00:12:37,240
right time to discuss why Postgres
doesn't have internal built-in

222
00:12:37,660 --> 00:12:38,580
to failover.

223
00:12:40,520 --> 00:12:46,380
I remember discussions about replication
when we relied on Slony

224
00:12:46,380 --> 00:12:51,580
and Londiste and some people resisted
to bring replication inside

225
00:12:51,580 --> 00:12:54,360
Postgres, but somehow it was resolved
eventually.

226
00:12:55,520 --> 00:13:00,100
And Postgres has good replication,
physical, logical, sometimes

227
00:13:00,160 --> 00:13:01,700
not good, but it's a different
story.

228
00:13:02,240 --> 00:13:05,020
In general, it's very good and
improving, improving every release.

229
00:13:05,020 --> 00:13:09,160
We just last week discussed with
Michael what improvements of

230
00:13:09,160 --> 00:13:14,060
logical replication in 17, and
maybe it will resonate a little

231
00:13:14,060 --> 00:13:18,920
bit with topic today, Patroni,
but it doesn't happen to autofailover

232
00:13:19,440 --> 00:13:20,440
at all, right?

233
00:13:20,500 --> 00:13:21,260
Why so?

234
00:13:21,900 --> 00:13:29,860
Alexander: I can only guess, because
to do it correctly, we cannot

235
00:13:29,860 --> 00:13:34,940
just have 2 nodes, which most people
run, like primary and standby,

236
00:13:35,340 --> 00:13:39,060
because there are many different
factors involved.

237
00:13:39,230 --> 00:13:43,800
1 of the most critical ones is
the network between those nodes.

238
00:13:45,240 --> 00:13:49,880
When just having 2 machines, you
cannot distinguish between failure

239
00:13:50,560 --> 00:13:54,300
on the networking and failure of
the primary.

240
00:13:55,380 --> 00:13:58,940
If you just run health check from
a standby and making decision

241
00:13:59,040 --> 00:14:02,420
based on the health check, you
may have a false positive.

242
00:14:02,640 --> 00:14:07,760
Basically, the network just experienced
some short glitch, which

243
00:14:07,760 --> 00:14:12,180
could last even a few seconds,
sometimes a few minutes, but at

244
00:14:12,180 --> 00:14:14,440
the same time the old primary is
still there.

245
00:14:15,040 --> 00:14:18,820
If we promote a standby, we get
to a split-brain situation.

246
00:14:19,940 --> 00:14:24,020
With 2 primaries and not being
clear to which 1 transactions

247
00:14:25,680 --> 00:14:26,340
are running.

248
00:14:27,100 --> 00:14:30,300
In the worst case, you end up in
an application connecting to

249
00:14:30,300 --> 00:14:31,260
both of them.

250
00:14:33,660 --> 00:14:37,040
Good luck with assembling all these
changes together.

251
00:14:37,440 --> 00:14:40,420
Nikolay: This is what tools like
repmgr do.

252
00:14:40,600 --> 00:14:44,880
So I ended up calling
repmgr a split-brain solution.

253
00:14:46,780 --> 00:14:48,920
Because I observed it many, many
times.

254
00:14:50,220 --> 00:14:55,040
Alexander: Like as a mitigation,
what maybe is possible to do,

255
00:14:55,040 --> 00:14:58,940
the primary can also run a health
check and in case if standby

256
00:14:58,940 --> 00:15:05,280
is not available, just stop accepting
writes by either restarting

257
00:15:05,820 --> 00:15:09,940
in read-only or maybe by implementing
some other mechanisms.

258
00:15:10,760 --> 00:15:16,360
But it also means that we lose
availability without a good reason.

259
00:15:16,920 --> 00:15:21,880
So with this scenario, when we
promote standby, technically if

260
00:15:21,880 --> 00:15:26,420
standby cannot access someone else,
it shouldn't be accepting

261
00:15:26,420 --> 00:15:30,140
writes either, like in the network
split.

262
00:15:30,720 --> 00:15:37,800
Basically, we closely come to set
up with how repmgr

263
00:15:38,000 --> 00:15:40,940
call it, witness node.

264
00:15:41,180 --> 00:15:42,660
Nikolay: Witness node, yes exactly.

265
00:15:42,740 --> 00:15:46,260
Alexander: Witness node, basically
you need to have more than

266
00:15:46,260 --> 00:15:46,760
2.

267
00:15:47,220 --> 00:15:52,000
And the witness node should help
in making decisions.

268
00:15:55,240 --> 00:15:58,960
Let's say we have a witness node
in some third failure domain,

269
00:15:59,180 --> 00:16:02,080
the primary can see the witness
node, therefore it can continue

270
00:16:02,080 --> 00:16:03,140
to run as a primary.

271
00:16:03,240 --> 00:16:08,200
And standby shouldn't be allowed
to promote if it cannot access

272
00:16:08,200 --> 00:16:09,360
the witness node.

273
00:16:09,860 --> 00:16:16,560
And it already reminds some systems
like ETCD that complement

274
00:16:17,640 --> 00:16:23,700
consensus algorithm and write is
possible when it is accepted

275
00:16:23,720 --> 00:16:25,220
by majority of nodes.

276
00:16:25,320 --> 00:16:27,140
Nikolay: This wheel already invented,
right?

277
00:16:27,800 --> 00:16:30,860
Alexander: Yeah, so this is already
invented, and what Patroni

278
00:16:30,880 --> 00:16:35,700
is really relying on to implement
after failover reliably.

279
00:16:36,240 --> 00:16:41,860
I can guess that at some moment
in Postgres it will be added,

280
00:16:42,040 --> 00:16:47,440
and we already have plenty of such
components in Postgres that

281
00:16:47,440 --> 00:16:47,940
exist.

282
00:16:48,340 --> 00:16:52,060
We have write-ahead log with LSN
which is always incremented.

283
00:16:52,500 --> 00:16:57,040
We have timelines which is very
similar to terms in etcd.

284
00:16:58,620 --> 00:17:03,480
So basically at the end we will
just need to have more than 2

285
00:17:03,480 --> 00:17:09,360
nodes, better 3, so that we don't
stop writes while 1 node is

286
00:17:09,360 --> 00:17:10,440
temporarily down.

287
00:17:12,040 --> 00:17:15,560
It will give possibility to implement
after failover without

288
00:17:15,920 --> 00:17:18,340
even doing pg_rewind, let's say.

289
00:17:18,640 --> 00:17:23,580
Because when primary writes to
write-ahead log, it will be first

290
00:17:23,720 --> 00:17:27,740
confirmed by standby nodes, and
only after that.

291
00:17:28,280 --> 00:17:33,560
So effectively, this is what we
already have, but it's not enough,

292
00:17:33,560 --> 00:17:34,060
unfortunately.

293
00:17:34,740 --> 00:17:35,020
Nikolay: S.

294
00:17:35,020 --> 00:17:37,760
So do you think at some point Patroni
will not be needed and

295
00:17:37,760 --> 00:17:39,520
everything will be inside Postgres
or no?

296
00:17:39,520 --> 00:17:40,020
A.

297
00:17:41,600 --> 00:17:43,140
Alexander: I hope so, really.

298
00:17:43,140 --> 00:17:43,260
Nikolay: S.

299
00:17:43,260 --> 00:17:44,020
I hope so.

300
00:17:44,060 --> 00:17:44,200
Alexander: A.

301
00:17:44,200 --> 00:17:45,400
No, no, no, no, no, no.

302
00:17:45,400 --> 00:17:50,380
I'm tired of maintaining Patroni,
but because this is what people

303
00:17:50,380 --> 00:17:51,660
really want to have.

304
00:17:51,840 --> 00:17:57,340
To deploy highly available Postgres
without necessity to research

305
00:17:57,700 --> 00:18:04,500
and learn a lot of external tools
like Patroni, solutions for

306
00:18:04,840 --> 00:18:05,740
backup and point...

307
00:18:05,740 --> 00:18:08,940
Nikolay: Upgrade them sometimes
because we're always lagging

308
00:18:08,940 --> 00:18:09,620
with these

309
00:18:10,240 --> 00:18:10,740
Alexander: upgrades.

310
00:18:11,760 --> 00:18:12,260
Yeah.

311
00:18:12,560 --> 00:18:17,440
But at the same time, Let's imagine
that it happens in a couple

312
00:18:17,440 --> 00:18:22,960
of years, but with a five-year
support cycle, there will still

313
00:18:22,960 --> 00:18:27,480
be a lot of setups that are running
not recent Postgres versions,

314
00:18:27,620 --> 00:18:31,340
and they still need to use something
external, like Patroni.

315
00:18:31,380 --> 00:18:34,400
Nikolay: Yeah, I'm actually looking
right now at commits of

316
00:18:34,400 --> 00:18:34,740
repmgr.

317
00:18:34,740 --> 00:18:38,420
It looks like the project is inactive
for more than 1 year, almost.

318
00:18:38,420 --> 00:18:39,780
Like a few commits, that's it.

319
00:18:39,780 --> 00:18:41,500
It's like going down.

320
00:18:42,340 --> 00:18:46,220
Alexander: So I have probably some
insights about it, not about

321
00:18:46,220 --> 00:18:50,240
repmgr, but I know
that EnterpriseDB was contributing

322
00:18:50,680 --> 00:18:56,100
some features and bug fixes to
Patroni, so they officially support

323
00:18:56,100 --> 00:18:56,600
Patroni.

324
00:18:57,340 --> 00:18:58,680
Nikolay: So it sounds interesting,
right?

325
00:18:58,680 --> 00:19:00,740
So Patroni is a winner, obviously.

326
00:19:00,760 --> 00:19:05,280
It's used by many Kubernetes operators,
many of them, and not

327
00:19:05,280 --> 00:19:09,520
only Kubernetes, of course, and
winning, of course, some projects

328
00:19:09,660 --> 00:19:13,700
were abandoned, not only
repmgr, we know some others,

329
00:19:13,740 --> 00:19:14,240
right?

330
00:19:14,600 --> 00:19:20,020
But you thinking about 1 day everything
will be in core and Patroni

331
00:19:20,020 --> 00:19:21,680
will be abandoned maybe, right?

332
00:19:21,860 --> 00:19:23,720
And you think it's maybe for good.

333
00:19:23,720 --> 00:19:26,880
Alexander: So every project has
its own life cycle.

334
00:19:28,260 --> 00:19:32,420
At some moment, the project is
abandoned and not used by anyone.

335
00:19:32,800 --> 00:19:34,140
We are not there yet.

336
00:19:34,400 --> 00:19:35,320
Nikolay: Right, right.

337
00:19:35,740 --> 00:19:38,860
While we're in this area, I wanted
to ask you what you think

338
00:19:38,860 --> 00:19:43,440
about, Kubernetes also has, it
also relies on consensus algorithm,

339
00:19:43,440 --> 00:19:43,940
right?

340
00:19:44,240 --> 00:19:45,520
Itself, it has it.

341
00:19:45,620 --> 00:19:50,200
Why some operators choose, why
do they choose to use Patroni

342
00:19:50,220 --> 00:19:54,440
while others like CloudNativePG
decide to rely on Kubernetes

343
00:19:54,600 --> 00:19:58,020
native mechanisms and avoid using
Patroni?

344
00:20:00,760 --> 00:20:05,940
Alexander: To be honest, I don't
know what driving people that

345
00:20:05,940 --> 00:20:09,940
build cloud-native Postgres.

346
00:20:10,400 --> 00:20:11,820
Nikolay: But what's better in general?

347
00:20:11,980 --> 00:20:13,220
What are pros and cons?

348
00:20:13,220 --> 00:20:14,120
How to compare?

349
00:20:14,480 --> 00:20:15,700
What would you do?

350
00:20:16,240 --> 00:20:23,040
Alexander: In a sense, CloudNativePG, there is a component

351
00:20:23,100 --> 00:20:28,240
that tries to manage all Postgres
clusters and decide whether

352
00:20:28,520 --> 00:20:32,060
some primary is failed and promote
1 of the standbys.

353
00:20:32,900 --> 00:20:37,320
I'm not sure how they implement
in the fencing of the failed

354
00:20:37,840 --> 00:20:41,700
primary, because if you don't correctly
implement fencing and

355
00:20:41,700 --> 00:20:46,080
promote the standby to the primary,
you again end up in a split-brain

356
00:20:46,080 --> 00:20:46,580
situation.

357
00:20:47,640 --> 00:20:55,120
And let's imagine that 1 Kubernetes
node is isolated in the network.

358
00:20:55,380 --> 00:20:56,180
Nikolay: Network partition.

359
00:20:56,740 --> 00:20:57,240
Alexander: Yeah.

360
00:20:57,600 --> 00:21:02,240
And it automatically means that
you will not be able to stop

361
00:21:02,240 --> 00:21:04,940
pods for containers that are running
on this node.

362
00:21:05,380 --> 00:21:08,540
At the same time, applications
that are running on this node

363
00:21:08,720 --> 00:21:13,540
will still use Kubernetes services
to be able to connect to the

364
00:21:14,020 --> 00:21:15,100
isolated primary.

365
00:21:16,420 --> 00:21:17,640
Nikolay: Right, yeah.

366
00:21:17,900 --> 00:21:21,760
Alexander: So Patroni detects such
scenarios very easily, because

367
00:21:21,760 --> 00:21:26,140
Patroni component runs in the same
port as Postgres, and in case

368
00:21:26,140 --> 00:21:30,740
it cannot write to Kubernetes API,
it just does self-fencing,

369
00:21:30,780 --> 00:21:32,620
It stops Postgres to read only.

370
00:21:33,080 --> 00:21:34,900
Nikolay: It's simple, by the way,
right?

371
00:21:35,140 --> 00:21:40,460
Alexander: Yeah, so I don't know
if they do something similar.

372
00:21:40,840 --> 00:21:43,200
In case if they don't, it's dangerous.

373
00:21:44,340 --> 00:21:46,720
Michael: We should do a whole separate
episode of CloudNativePG

374
00:21:46,720 --> 00:21:48,980
actually I think that would
be a good 1

375
00:21:49,400 --> 00:21:52,120
Alexander: yeah I'm not saying
that CloudNativePG is better

376
00:21:52,200 --> 00:21:53,440
like does something wrong

377
00:21:53,440 --> 00:21:55,580
Nikolay: I'm just raising questions

378
00:21:57,540 --> 00:21:58,680
Alexander: raising my concerns

379
00:21:58,820 --> 00:22:03,400
Michael: of course right back to
Patroni it worked like this

380
00:22:03,400 --> 00:22:05,940
from the beginning, but it feels
like

381
00:22:21,740 --> 00:22:24,380
Alexander: in version 10, which
is end of life for a couple of

382
00:22:24,380 --> 00:22:25,500
years, by the way.

383
00:22:27,400 --> 00:22:31,480
From the very beginning, we wanted
to support this feature, but

384
00:22:31,480 --> 00:22:35,580
what was stopping us was the promise
of Patroni with synchronous

385
00:22:35,740 --> 00:22:40,940
replication that we want to promote
a node that was synchronous

386
00:22:40,960 --> 00:22:43,180
at the time when primary failed.

387
00:22:43,660 --> 00:22:47,780
If we just have a single name in
synchronous standby names, like

388
00:22:47,780 --> 00:22:51,600
single node, it's very easy to
say, okay, so this node was synchronous

389
00:22:51,680 --> 00:22:54,180
and therefore we can just promote
it.

390
00:22:54,340 --> 00:22:59,480
When there are more than 1 node
and we require all of them to

391
00:22:59,480 --> 00:23:01,860
be synchronous, we can promote
any of them.

392
00:23:01,880 --> 00:23:05,740
But with quorum-based replication,
you can have something like

393
00:23:06,040 --> 00:23:09,500
any 1 from a list of, let's say,
3 nodes.

394
00:23:09,620 --> 00:23:12,080
Which 1 is synchronous when the
primary failed?

395
00:23:12,560 --> 00:23:16,440
I'm not demanding to answer this
question, So I will just explain

396
00:23:16,500 --> 00:23:19,780
how it works in Patroni, like in
the last major release.

397
00:23:20,240 --> 00:23:23,680
This information about current
value of synchronized and bynames

398
00:23:23,800 --> 00:23:25,540
is also stored in etcd.

399
00:23:26,280 --> 00:23:30,320
Therefore, those 3 nodes that are
listed in synchronized and

400
00:23:30,320 --> 00:23:36,280
bynames know that we are listed
as quorum nodes and during the

401
00:23:36,280 --> 00:23:42,440
leader race they need to access
each other and get some number

402
00:23:42,440 --> 00:23:43,240
of votes.

403
00:23:43,520 --> 00:23:48,480
If there are 3 nodes, It means
that every node, to become a new

404
00:23:48,480 --> 00:23:52,000
primary, like a new candidate,
needs to access 2 remaining nodes,

405
00:23:52,000 --> 00:23:52,660
at least.

406
00:23:53,600 --> 00:24:03,180
And get confirmation that they're
not ahead of all LSN on the

407
00:24:03,180 --> 00:24:04,060
current node.

408
00:24:06,380 --> 00:24:07,360
Is it clear?

409
00:24:08,540 --> 00:24:10,940
I should elaborate a little bit
more.

410
00:24:12,340 --> 00:24:15,340
Michael: So if they were ahead,
let me ask the stupid question,

411
00:24:15,660 --> 00:24:19,580
If a node checks that it is ahead
of the current candidate to

412
00:24:19,580 --> 00:24:23,080
be leader, that's then a bad decision
to promote that leader

413
00:24:23,080 --> 00:24:24,640
because a different 1 would...

414
00:24:24,640 --> 00:24:29,060
Alexander: So just for your understanding,
in Patroni there is

415
00:24:29,060 --> 00:24:33,580
no central component that decides
on which node to promote.

416
00:24:33,620 --> 00:24:36,360
Every node makes a decision on
its own.

417
00:24:36,580 --> 00:24:40,840
Therefore, every standby node,
like listed in Synchronous Standby

418
00:24:40,840 --> 00:24:45,420
Names, goes through the cycle of
health checks.

419
00:24:45,560 --> 00:24:49,240
It accesses remaining nodes from
synchronous to node names and

420
00:24:49,240 --> 00:24:52,320
checks at what LSN are there.

421
00:24:53,500 --> 00:24:59,280
And if they're on the same LSN
or behind, we can assume that

422
00:24:59,280 --> 00:25:01,300
this node is the healthiest 1.

423
00:25:01,960 --> 00:25:04,660
And the same procedure happens
on remaining nodes.

424
00:25:05,740 --> 00:25:09,960
Basically this way we can find,
okay, so this node is eligible

425
00:25:09,960 --> 00:25:11,180
to become a new primary.

426
00:25:11,660 --> 00:25:16,880
In case if we have something like
any 2 and 3 nodes, we can make

427
00:25:16,880 --> 00:25:19,900
a decision by asking just a single
node.

428
00:25:20,760 --> 00:25:25,840
Because we know that 2 nodes will
have the latest commits, like

429
00:25:25,840 --> 00:25:29,080
the latest commits that are reported
to the client.

430
00:25:30,060 --> 00:25:34,080
And it will be enough to just ask
a single node.

431
00:25:34,080 --> 00:25:37,660
Although it will ask all nodes
from synchronous standby names,

432
00:25:37,660 --> 00:25:41,280
but in case if 1 of them, let's
say, failed, together with the

433
00:25:41,280 --> 00:25:46,560
primary, it is still enough to
make a decision by asking the

434
00:25:46,560 --> 00:25:47,580
remaining 1.

435
00:25:49,200 --> 00:25:49,700
Nice.

436
00:25:51,100 --> 00:25:55,240
And the tricky part comes when
we need to change synchronous

437
00:25:55,240 --> 00:25:59,840
standby names and the values that
we store in etcd.

438
00:26:00,720 --> 00:26:06,260
Let's say we want to increase the
number of synchronous nodes

439
00:26:06,540 --> 00:26:07,960
from 1 to 2.

440
00:26:08,320 --> 00:26:11,820
What should we change first, synchronous
standby names, GUK,

441
00:26:11,820 --> 00:26:13,760
or value in etcd?

442
00:26:15,720 --> 00:26:20,000
So that we can correctly make a
decision.

443
00:26:21,160 --> 00:26:26,960
If we change first value in etcd,
it will assume, okay, so we

444
00:26:26,960 --> 00:26:30,440
need to ask just a single node
to make a decision, although there

445
00:26:30,440 --> 00:26:35,560
is just 1 node that has the latest
commits, 100%.

446
00:26:36,280 --> 00:26:38,080
And in fact we need to ask 2.

447
00:26:38,140 --> 00:26:45,020
Therefore, when we increase this
from 1 to 2, first we need to

448
00:26:45,020 --> 00:26:48,560
update the synchronous standby
names, and only after that change

449
00:26:48,720 --> 00:26:49,440
in etcd.

450
00:26:50,600 --> 00:26:55,200
And there are almost a dozen of
rules that 1 needs to follow

451
00:26:55,200 --> 00:26:57,840
to do such changes in the correct
order.

452
00:26:57,900 --> 00:27:02,120
Because it's not only about changing
replication factor, It's

453
00:27:02,120 --> 00:27:06,540
also about adding new nodes to
synchronize standby names or removing

454
00:27:06,560 --> 00:27:09,060
nodes that are gone and so on.

455
00:27:09,520 --> 00:27:13,480
And I don't think any other failover
solution implements a general

456
00:27:14,280 --> 00:27:16,660
algorithm to do such changes.

457
00:27:17,620 --> 00:27:20,100
Nikolay: How much time did you
spend to develop this?

458
00:27:22,360 --> 00:27:27,640
Alexander: Originally this feature
was implemented by Ants Aasma,

459
00:27:27,640 --> 00:27:31,160
he's working for CYBERTEC, it
happened in 2018.

460
00:27:32,360 --> 00:27:38,860
I did a few attempts to understand
this great logic of this algorithm.

461
00:27:40,120 --> 00:27:46,540
And finally, almost 5 years after,
I was able to get enough time

462
00:27:47,720 --> 00:27:49,460
to fully focus on the problem.

463
00:27:51,560 --> 00:27:54,520
And even after that I spent, I
don't know, a couple of months

464
00:27:54,520 --> 00:28:01,220
implementing and fixing some bugs
and corner cases and implementing

465
00:28:01,500 --> 00:28:06,640
all possible unit tests to cover
all such transitions.

466
00:28:07,120 --> 00:28:12,260
Nikolay: There is no book which
describes this, that you could

467
00:28:12,260 --> 00:28:12,760
follow.

468
00:28:13,620 --> 00:28:17,620
This is something really new that
needs to be invented, right?

469
00:28:18,080 --> 00:28:22,480
Alexander: Well, the idea was obvious,
like how to do it, like,

470
00:28:22,480 --> 00:28:26,780
or what to do, but like implementing
it correctly and proving

471
00:28:26,820 --> 00:28:31,940
that it is really working correctly,
like, it's really a challenge.

472
00:28:32,780 --> 00:28:34,460
Nikolay: Finding all the edge cases,
right.

473
00:28:35,340 --> 00:28:39,340
There is another thing I would
like to discuss a little bit.

474
00:28:39,340 --> 00:28:44,700
It was in Patroni 3, version 3.0,
DCS failsafe mode.

475
00:28:44,700 --> 00:28:51,100
So DCS is distributed configuration
storage.

476
00:28:52,240 --> 00:28:56,960
And actually we just experienced
a couple of outages because

477
00:28:57,360 --> 00:29:03,180
we are in Google Cloud and they're
running Salon operator, Patroni

478
00:29:03,280 --> 00:29:03,960
of course.

479
00:29:04,140 --> 00:29:08,080
And I just checked the version
of Patroni, and it seems to have

480
00:29:08,080 --> 00:29:08,580
it.

481
00:29:09,060 --> 00:29:09,820
But we...

482
00:29:09,840 --> 00:29:12,040
Alexander: But I don't think it
is enabled by default.

483
00:29:12,040 --> 00:29:14,860
Nikolay: Exactly, this is my second
question, actually, why it's

484
00:29:14,860 --> 00:29:15,600
not enabled.

485
00:29:15,720 --> 00:29:19,900
So, first question, what is it,
like, how do you solve this problem

486
00:29:19,900 --> 00:29:24,300
when etcd or console is temporarily
out?

487
00:29:24,520 --> 00:29:26,880
Alexander: Let's start from problem
statement.

488
00:29:27,880 --> 00:29:31,200
The promise of Patroni is that it
will run as a primary when it

489
00:29:31,200 --> 00:29:35,340
can write to a distributed configuration
store, like to etcd.

490
00:29:35,420 --> 00:29:39,340
If it cannot write to etcd, it
means that maybe something is

491
00:29:39,340 --> 00:29:44,400
wrong with etcd, or maybe this
node is isolated, and therefore

492
00:29:44,440 --> 00:29:45,360
writes are failing.

493
00:29:45,660 --> 00:29:51,400
And when node is isolated, it's
apparently working by design,

494
00:29:51,460 --> 00:29:56,180
Patroni cannot write to etcd, it
will stop Postgres in read-only

495
00:29:56,280 --> 00:30:02,360
mode, but in case if etcd is totally
down, because of some human

496
00:30:02,360 --> 00:30:05,920
mistake, you cannot access any
single node of etcd.

497
00:30:06,600 --> 00:30:11,880
And in this case, Patroni also
stops primary and starts it in

498
00:30:11,880 --> 00:30:19,120
read-only to protect from the case,
let's say, some standby nodes

499
00:30:19,120 --> 00:30:23,260
can access DCS at the same time
and promote 1 of the nodes.

500
00:30:24,320 --> 00:30:28,660
So people were really annoyed by
this problem, and were asking

501
00:30:28,660 --> 00:30:31,120
why we are demoting primary.

502
00:30:32,420 --> 00:30:36,100
So far the answer was always, alright,
so we cannot determine

503
00:30:36,100 --> 00:30:40,100
the state, and therefore we demote
to be on the safe side.

504
00:30:41,000 --> 00:30:45,940
The idea how to improve on that
came at one of Postgres conferences

505
00:30:46,280 --> 00:30:49,420
after talking with other Patroni
users.

506
00:30:51,820 --> 00:30:57,020
Like, how it is improved using
the failsafe mode.

507
00:30:57,080 --> 00:31:02,720
The primary, like when it can determine
that none of etcd nodes

508
00:31:02,720 --> 00:31:08,160
are accessible, it will try to
access all Patroni nodes in the

509
00:31:08,160 --> 00:31:10,420
cluster using the Patroni REST
API.

510
00:31:10,680 --> 00:31:17,220
And in case if the Patroni primary
can get a response from all

511
00:31:17,220 --> 00:31:21,360
nodes in the Patroni cluster in
the failsafe mode, it will continue

512
00:31:21,360 --> 00:31:23,000
to run as a primary.

513
00:31:23,240 --> 00:31:28,900
In this case, it's a much stronger
requirement than quorum or

514
00:31:28,900 --> 00:31:29,400
consensus.

515
00:31:29,680 --> 00:31:34,700
So it is not expecting to get responses
from, let's say, majority.

516
00:31:35,020 --> 00:31:39,780
It really wants to get responses
from all standby nodes to continue

517
00:31:39,800 --> 00:31:41,020
to run as a primary.

518
00:31:42,920 --> 00:31:47,740
This feature was introduced in
Patroni version 3, but it is

519
00:31:47,740 --> 00:31:55,640
not enabled by default, because
I think there are some side effects

520
00:31:55,640 --> 00:32:00,220
when you enable this mode in certain
environments.

521
00:32:01,320 --> 00:32:06,240
Probably it is related to environments
where your node may respond

522
00:32:06,460 --> 00:32:08,060
with a different name.

523
00:32:10,000 --> 00:32:11,340
Nikolay: I need to think about
it.

524
00:32:12,040 --> 00:32:13,480
Alexander: This behavior is documented.

525
00:32:14,180 --> 00:32:15,360
Nikolay: Yeah, we will explore
this.

526
00:32:15,360 --> 00:32:16,780
Thank you so much for it.

527
00:32:17,020 --> 00:32:17,660
But it sounds

528
00:32:17,660 --> 00:32:18,160
Alexander: like...

529
00:32:18,740 --> 00:32:21,300
On Kubernetes it is safe to enable
it.

530
00:32:21,300 --> 00:32:23,940
Nikolay: Yeah, we should start
using this, this is what I think

531
00:32:23,940 --> 00:32:24,340
as well.

532
00:32:24,340 --> 00:32:26,280
Yeah, definitely we'll explore,
thanks.

533
00:32:26,280 --> 00:32:29,280
Alexander: Like pods always have
the same name, just different

534
00:32:29,280 --> 00:32:30,120
IP addresses.

535
00:32:30,400 --> 00:32:32,240
Nikolay: I just got help for it.

536
00:32:32,280 --> 00:32:36,380
And as usual, I just wanted to
publicly thank you for all the

537
00:32:36,380 --> 00:32:44,120
help you do for me and actually
many companies. Many years it's

538
00:32:44,220 --> 00:32:44,720
huge.

539
00:32:45,060 --> 00:32:54,440
Thank you so much So another thing
I wanted to discuss is probably

540
00:32:55,320 --> 00:32:56,400
replication slots.

541
00:32:56,480 --> 00:33:01,400
And I remember a few years ago
you implemented support for failover

542
00:33:01,400 --> 00:33:02,520
of logical slots.

543
00:33:02,980 --> 00:33:05,060
Now we have it in Postgres, right?

544
00:33:05,060 --> 00:33:07,620
So one more, finally, yeah.

545
00:33:08,520 --> 00:33:12,980
One thing was basically removed,
I guess, from Patroni, right?

546
00:33:12,980 --> 00:33:14,680
Or you still keep this functionality?

547
00:33:15,400 --> 00:33:18,840
Alexander: No, We still keep it
and we didn't do anything special

548
00:33:19,300 --> 00:33:20,520
in Postgres 17.

549
00:33:21,980 --> 00:33:24,780
Nikolay: It was, I think it was
16 even, no?

550
00:33:25,440 --> 00:33:27,540
Alexander: Failover of, ah.

551
00:33:28,280 --> 00:33:29,000
Nikolay: Or 17.

552
00:33:29,720 --> 00:33:34,060
Well, ability to use a logical
slot on physical standbys was

553
00:33:34,060 --> 00:33:37,100
in 16, but fell over in 17, we
just discussed it.

554
00:33:37,100 --> 00:33:38,220
Alexander: Yes, exactly, exactly.

555
00:33:39,520 --> 00:33:40,620
I confused you.

556
00:33:40,640 --> 00:33:43,480
That's why I'm saying we didn't
do anything special.

557
00:33:43,780 --> 00:33:47,720
Although I did some tweaks to make
this feature work with Patroni,

558
00:33:47,780 --> 00:33:51,620
because it requires to have your
database name in the primary

559
00:33:51,620 --> 00:33:52,120
coninfo.

560
00:33:52,660 --> 00:33:57,260
Patroni wasn't putting the DB name
to primary coninfo, because

561
00:33:57,260 --> 00:33:59,740
for physical replication, it's
not useful.

562
00:34:01,520 --> 00:34:02,480
Nikolay: But I wonder...

563
00:34:02,680 --> 00:34:03,640
Alexander: How it does it?

564
00:34:03,920 --> 00:34:07,120
Nikolay: I wonder in my head, like,
of course, we create slot

565
00:34:07,540 --> 00:34:13,740
on the primary, it's clear, but
Patroni main task is to keep

566
00:34:13,740 --> 00:34:19,020
primary alive, to take care of
high availability HA for the primary.

567
00:34:19,080 --> 00:34:23,860
Okay, but if we have multiple replicas,
multiple standby nodes,

568
00:34:23,860 --> 00:34:28,580
and 1 of them is used, or maybe
a few, but at least 1, 1 of them

569
00:34:28,580 --> 00:34:33,560
is used to logically replicate
to some Postgres or Snowflake

570
00:34:33,640 --> 00:34:39,180
or anywhere or Kafka or something
in this case if this...

571
00:34:39,600 --> 00:34:42,980
Yeah, from standby because it's
good, we like, less risks on

572
00:34:42,980 --> 00:34:47,780
the primary and so on and wall
sender is not using CPU and so

573
00:34:47,780 --> 00:34:48,280
on.

574
00:34:48,580 --> 00:34:53,740
And no out of disk risks.

575
00:34:54,020 --> 00:34:57,540
So now we have this standby and
it's dead suddenly.

576
00:34:58,380 --> 00:35:01,160
It's not the job of Patroni to
take care of it, right?

577
00:35:01,160 --> 00:35:04,580
Because we need some mechanism
to failover standby now.

578
00:35:05,140 --> 00:35:09,400
Alexander: Well, you mean to keep
logical replication slot on

579
00:35:09,400 --> 00:35:12,080
a new standby where you would like
to connect.

580
00:35:13,200 --> 00:35:18,760
In theory, Patroni maybe can take
care of it, since it's possible

581
00:35:18,820 --> 00:35:22,480
to do logical replication from
standby nodes since Postgres 16.

582
00:35:23,360 --> 00:35:28,240
So how it's implemented currently
in Patroni, like logical failover

583
00:35:28,240 --> 00:35:34,160
slots, it creates logical slots
on standby nodes and uses

584
00:35:34,160 --> 00:35:39,860
pg_replication_slot_advance() to move
the slot to the same LSN as

585
00:35:39,860 --> 00:35:41,540
it's currently on the primary.

586
00:35:41,940 --> 00:35:45,140
So basically the assumption is
that logical replication happens

587
00:35:45,140 --> 00:35:45,980
on the primary.

588
00:35:46,500 --> 00:35:52,060
In theory, there is no reason why
it cannot be done for standby

589
00:35:52,200 --> 00:35:52,700
nodes.

590
00:35:53,260 --> 00:35:59,080
Let's say we create logical slots
on all standby nodes with the

591
00:35:59,080 --> 00:36:05,520
same name, and Patroni can watch
which 1 is active and publish

592
00:36:05,740 --> 00:36:10,880
this information to ATCD and remaining
standby nodes will again,

593
00:36:11,140 --> 00:36:14,360
like Patroni remaining standby
nodes will use pg_replication_slot_advance()

594
00:36:15,520 --> 00:36:20,900
to move LSN on standby nodes.

595
00:36:22,120 --> 00:36:23,680
So in theory it could work, but

596
00:36:23,680 --> 00:36:24,160
Nikolay: I don't

597
00:36:24,160 --> 00:36:27,260
Alexander: know if I would have
time to work on it.

598
00:36:28,200 --> 00:36:31,960
Nikolay: I'm just trying to understand,
This is a relatively

599
00:36:32,160 --> 00:36:37,440
new feature since 16 to be able
to logically replicate from physical

600
00:36:37,440 --> 00:36:38,420
standbys, but...

601
00:36:38,420 --> 00:36:41,640
Alexander: But please keep in mind that it still affects primary.

602
00:36:42,900 --> 00:36:43,400
Nikolay: Right.

603
00:36:44,860 --> 00:36:49,040
Alexander: So, Maybe like pg_wal will not bloat, but pg_catalog

604
00:36:49,440 --> 00:36:50,460
certainly will.

605
00:36:50,660 --> 00:36:51,840
Nikolay: Yeah, this for sure.

606
00:36:52,080 --> 00:36:57,540
I was referring to the need to preserve WAL files on the primary.

607
00:36:58,040 --> 00:37:00,060
This risk has gone if we do this.

608
00:37:00,060 --> 00:37:04,280
But I cannot imagine how we can start using logical slots on

609
00:37:04,280 --> 00:37:09,300
physical standbys in serious projects without HA ideas.

610
00:37:09,400 --> 00:37:12,540
Because right now I don't understand how we solve HA for this.

611
00:37:12,880 --> 00:37:17,580
Alexander: Yeah, and unfortunately, this hack that Patroni implements

612
00:37:17,660 --> 00:37:21,520
with pg_replication_slot_advance() has its downsides.

613
00:37:22,900 --> 00:37:28,200
It literally takes as much time to move the position of the logical

614
00:37:28,200 --> 00:37:31,160
slot as you consume it from the slot.

615
00:37:31,920 --> 00:37:32,720
That's unfortunate.

616
00:37:32,980 --> 00:37:37,040
And how it's solved in Postgres 17, it basically does not need

617
00:37:37,040 --> 00:37:42,740
to parse the whole file and decode it, so it just literally overwrites

618
00:37:43,140 --> 00:37:49,520
some values in the replication slot, because it knows exact locations

619
00:37:50,280 --> 00:37:51,420
and does it safely.

620
00:37:52,380 --> 00:37:53,900
Patroni cannot do it.

621
00:37:54,220 --> 00:37:58,640
Although, probably, pg_failover_slots can also do the same.

622
00:37:59,920 --> 00:38:00,980
For older versions.

623
00:38:01,640 --> 00:38:05,780
Nikolay: Okay, some area, additionally, for me to explore deeper,

624
00:38:05,900 --> 00:38:09,120
because I like understanding many places here.

625
00:38:10,160 --> 00:38:12,940
Good pieces of advice as well, thank you so much.

626
00:38:13,200 --> 00:38:15,320
Anything else, Michael, you wanted to discuss?

627
00:38:15,780 --> 00:38:21,300
Like, obviously, like 1 of the biggest features was Citus support,

628
00:38:21,300 --> 00:38:21,780
right?

629
00:38:21,780 --> 00:38:26,200
But I'm not using Citus actively, so I don't know.

630
00:38:26,200 --> 00:38:28,100
If you want to discuss it, let's discuss.

631
00:38:28,460 --> 00:38:31,380
Alexander: I know that some people certainly do, because from

632
00:38:31,380 --> 00:38:36,880
time to time I get questions about Citus with Patroni on Slack,

633
00:38:37,580 --> 00:38:42,940
or maybe not Citus-specific questions, but according to the output

634
00:38:42,940 --> 00:38:46,600
of the Patroni control list, they are running Citus Cluster.

635
00:38:47,100 --> 00:38:50,880
There is certainly a demand, and I believe with Patroni implementing

636
00:38:51,100 --> 00:38:56,920
Citus support, it improved quality of life of some organizations

637
00:38:57,340 --> 00:39:02,400
and people that want to run sharded setups.

638
00:39:03,540 --> 00:39:07,220
Nikolay: Is there anything specific you needed to solve to support

639
00:39:07,220 --> 00:39:08,800
this or like technical details?

640
00:39:08,800 --> 00:39:10,140
Alexander: To support Citus?

641
00:39:12,720 --> 00:39:20,360
So, Citus, I wouldn't say that it was very hard, but it wasn't

642
00:39:20,680 --> 00:39:21,940
very easy either.

643
00:39:23,640 --> 00:39:30,660
So, Citus has a notion of Citus Coordinator, where you, like

644
00:39:30,660 --> 00:39:33,680
Originally you're supposed to use coordinator for everything,

645
00:39:33,900 --> 00:39:37,940
to do DDL, to run transactional workload and so on.

646
00:39:38,740 --> 00:39:42,300
And on coordinator there is a metadata table where you register

647
00:39:42,380 --> 00:39:43,600
all worker nodes.

648
00:39:43,820 --> 00:39:47,860
And worker nodes, This is where you keep the actual data, like

649
00:39:47,960 --> 00:39:48,460
charts.

650
00:39:49,160 --> 00:39:53,940
And what I had to implement in Patroni is registering automatically

651
00:39:54,600 --> 00:39:56,640
worker nodes inside this metadata.

652
00:39:57,180 --> 00:40:01,580
And in case of failover happens
For the worker nodes, we need

653
00:40:01,580 --> 00:40:07,260
to update metadata and put new
IPs or host names, whatever.

654
00:40:09,480 --> 00:40:14,500
Basically, when you want to scale
out your Citus cluster, you

655
00:40:14,500 --> 00:40:17,020
just start more worker nodes.

656
00:40:18,900 --> 00:40:23,500
Every worker node, in fact, is
another small Patroni cluster.

657
00:40:24,160 --> 00:40:28,520
So technically, in Patroni control,
it looks like just a single

658
00:40:28,520 --> 00:40:32,720
cluster, but in fact it's 1 cluster
for a coordinator, 1 cluster

659
00:40:32,720 --> 00:40:39,300
for every worker node, and on each
of them there is its own failover

660
00:40:40,440 --> 00:40:40,940
happening.

661
00:40:41,520 --> 00:40:45,240
If you start worker nodes in a
different group, like in the new

662
00:40:45,240 --> 00:40:50,040
1, it joins existing Citus cluster
and Patroni, the coordinator,

663
00:40:50,660 --> 00:40:53,040
registers new worker nodes.

664
00:40:53,420 --> 00:40:56,920
But what Patroni will not do,
it will not redistribute existing

665
00:40:57,040 --> 00:40:58,840
data to the new workers.

666
00:40:59,140 --> 00:41:01,900
This is something that you will
have to do manually afterwards

667
00:41:01,920 --> 00:41:06,960
and it has to be your own decision
how to scale your data and

668
00:41:07,160 --> 00:41:09,520
replicate to other nodes.

669
00:41:09,840 --> 00:41:14,560
Although, like nowadays it's possible
to do it without downtime

670
00:41:14,600 --> 00:41:19,800
because all enterprise features
of Citus are included in Citus

671
00:41:19,920 --> 00:41:20,940
version 10.

672
00:41:21,020 --> 00:41:25,020
So everything that was enterprise
now is an open source.

673
00:41:25,140 --> 00:41:25,940
Nikolay: That's cool.

674
00:41:26,580 --> 00:41:30,460
Michael: I saw Alexander has a
good demo of this, of Citus and

675
00:41:30,460 --> 00:41:33,360
Patroni working together, including
rebalancing.

676
00:41:34,360 --> 00:41:36,320
I think it was Citus Con last year?

677
00:41:36,540 --> 00:41:37,960
Alexander: Yeah, it was Citus Con.

678
00:41:38,520 --> 00:41:40,740
Michael: Nice, I'll include that
video in the show notes.

679
00:41:41,280 --> 00:41:43,720
Nikolay: I wish I had all this
a few years ago.

680
00:41:45,260 --> 00:41:46,040
Alexander: When I...

681
00:41:46,700 --> 00:41:51,040
Yeah, of course, like, There was
a little bit more work under

682
00:41:51,040 --> 00:41:51,760
the hood.

683
00:41:51,940 --> 00:41:57,380
In case if you do write workload
via coordinator, it's possible,

684
00:41:57,380 --> 00:42:03,300
like Patroni can do some tricks
to avoid client connection termination,

685
00:42:05,080 --> 00:42:08,940
like while switchover of working
nodes is happening.

686
00:42:09,180 --> 00:42:11,180
This is what I did during the demo.

687
00:42:11,940 --> 00:42:15,600
There are certain tricks, but unfortunately
it works only on

688
00:42:15,600 --> 00:42:18,160
coordinator and only for write
workloads.

689
00:42:18,580 --> 00:42:22,100
For read-only workloads, your connection
will be broken.

690
00:42:22,740 --> 00:42:23,600
That's unfortunate.

691
00:42:24,340 --> 00:42:25,880
Maybe 1 day it will be fixed.

692
00:42:26,320 --> 00:42:32,740
So in the Citus, maybe 1 day the
same stuff will also work on

693
00:42:32,740 --> 00:42:33,520
worker nodes.

694
00:42:33,520 --> 00:42:37,160
And by the way, on Citus, you can
run transactional workload

695
00:42:38,300 --> 00:42:40,460
by connecting to every worker node.

696
00:42:41,320 --> 00:42:44,560
Only DDL must happen via coordinator.

697
00:42:46,560 --> 00:42:47,060
Michael: Nice.

698
00:42:47,780 --> 00:42:51,960
Speaking of improvements in the
future, do you have anything

699
00:42:51,960 --> 00:42:54,260
lined up that you still want to
improve in Patroni?

700
00:42:57,180 --> 00:42:58,580
Alexander: That's a very good question.

701
00:43:00,060 --> 00:43:05,640
Usually some nice improvements are coming out of nothing.

702
00:43:05,980 --> 00:43:10,180
You don't plan anything, but you talk to people and they say,

703
00:43:10,240 --> 00:43:13,940
it would be nice to have this improvement or this feature.

704
00:43:14,100 --> 00:43:18,480
And you start thinking about it, wow, yeah, it's a very nice

705
00:43:18,480 --> 00:43:21,480
idea and it's great to have it.

706
00:43:21,660 --> 00:43:28,020
But I rarely plan some big features from the ground up, let's

707
00:43:28,020 --> 00:43:28,520
say.

708
00:43:30,540 --> 00:43:37,240
So what I had in my mind, for example, it's a failover to a standby

709
00:43:37,260 --> 00:43:38,920
cluster, like in Patroni.

710
00:43:39,520 --> 00:43:43,820
Right now it's possible to run a standby cluster which is not

711
00:43:43,820 --> 00:43:46,580
aware of the source where it replicates from.

712
00:43:47,100 --> 00:43:50,520
It could be replicating from another Patroni cluster.

713
00:43:50,580 --> 00:43:54,460
And what people ask, we have a primary Patroni cluster, we have

714
00:43:54,480 --> 00:43:58,240
standby Patroni clusters, but there is no mechanism to automatically

715
00:43:58,420 --> 00:44:01,480
promote standby cluster, because it's running in a different

716
00:44:01,480 --> 00:44:05,440
region and it is using completely another etcd.

717
00:44:06,820 --> 00:44:09,440
So they simply don't know about each other.

718
00:44:10,320 --> 00:44:13,660
It would be nice to have, but again I cannot promise when I can

719
00:44:13,660 --> 00:44:16,160
start working on it and whether it will happen.

720
00:44:16,320 --> 00:44:21,100
I know that people from CYBERTEC did some experiments and have

721
00:44:21,180 --> 00:44:26,260
some proof-of-concept solutions that seem to work but for some

722
00:44:26,260 --> 00:44:30,860
reason they also they're also not happy with such solution they

723
00:44:30,860 --> 00:44:31,360
implemented.

724
00:44:33,400 --> 00:44:34,540
Michael: Yeah, sounds tricky.

725
00:44:35,380 --> 00:44:37,500
Alexander: Distributed systems are always tricky.

726
00:44:38,000 --> 00:44:38,500
Michael: Yeah,

727
00:44:39,000 --> 00:44:40,120
get that on a t-shirt.

728
00:44:40,460 --> 00:44:41,400
Nikolay: Thank you for coming.

729
00:44:42,040 --> 00:44:47,300
I, as usual, I use podcast and all events, I participate and

730
00:44:47,300 --> 00:44:48,660
organize and so on.

731
00:44:48,740 --> 00:44:53,200
I use just for my personal education and daily work as well.

732
00:44:53,200 --> 00:44:55,020
I just thank you so much for help.

733
00:44:55,520 --> 00:44:56,020
Again.

734
00:44:56,380 --> 00:44:58,360
Alexander: Yes, thank you for inviting me.

735
00:44:58,940 --> 00:45:02,060
Yeah, it's a nice job that you are doing.

736
00:45:02,080 --> 00:45:07,660
I know that many people listening to your podcasts and very happy

737
00:45:07,660 --> 00:45:08,080
about it.

738
00:45:08,080 --> 00:45:14,020
They learn a lot of great stuff and also making a big list of

739
00:45:14,020 --> 00:45:16,960
to-do items like what to check and what to learn.

740
00:45:17,480 --> 00:45:21,940
I cannot say the same about myself that I watch every single

741
00:45:21,940 --> 00:45:25,060
episode but sometimes I do.

742
00:45:25,840 --> 00:45:27,320
Nikolay: Cool, thank you.

743
00:45:27,440 --> 00:45:28,540
Michael: Thanks so much Alexander.

744
00:45:28,600 --> 00:45:29,320
Cheers Nikolay.