A weekly podcast about all things PostgreSQL
Michael: Hello and welcome to Postgres.FM,
a weekly show about
all things PostgreSQL.
I'm Michael, founder of pgMustard,
I'm joined as usual by
Nikolay, founder of Postgres.AI.
Hey Nikolay.
Nikolay: Hi Michael.
Michael: And today we are delighted to be
joined by Alexander Kukushkin,
a Postgres contributor currently
working at Microsoft and most
famously, maintainer of Patroni.
We had a listener request to discuss
Patroni, so we're delighted
you agreed to join us for an episode.
Alexander.
Alexander: Yeah, hello Michael,
hello Nikolay.
Thank you for inviting me.
I'm really excited to talk about
my favorite project.
Michael: Us too.
Perhaps as a starting point, could
you give us an introduction?
Like for most people, I think will
have heard of Patroni and
know what it is, but for anybody
that doesn't, could you give
an introduction what it is and
why it's important?
Alexander: Yeah, so Patroni, like
in the simple words, it's
a failover manager for Postgres.
It solves the problem of availability
of a primary.
In Postgres, we don't use some
words that are non-inclusive,
like master.
That's why we call it primary,
and Patroni actually recently
get rid of this non-inclusive words
completely.
And the way how Patroni does, it
makes sure that we are running
just a single primary at a time,
and at the same time, Patroni
helps you to manage as many read-only
replicas as you like to
have, and keeping those replicas
ready to become primary in case
the primary has failed.
At the same time, Patroni helps
to automate usual DBA tasks like
switchover, configuration management,
stuff like that.
Nikolay: Node provisioning also,
right?
Alexander: Node provisioning, not
really.
Node provisioning is a task for
DBA.
DBA has to start Patroni, and Patroni
will take care of bootstrapping
this node.
In case it's a totally new cluster,
Patroni will start as a primary.
In case the node joins an existing
cluster, the replica node
will take a pg_basebackup by default
from the running primary
and start the replica.
And the most interesting part,
let's say we bring node back which
was previously running as a primary,
And Patroni does everything
to convert this failed primary
as a new standby to join the cluster
and be prepared for next unforeseen
event.
Nikolay: At least you agree that
it does part of node provisioning,
because otherwise, we wouldn't
have situations when old data
directory, old PGDATA was copied
and new 1 is created and we
are suddenly out of disk space.
And if you don't expect Patroni
to participate in node provisioning,
then you think, what's happening?
Why am I out of disk space?
Right?
It happens sometimes.
Alexander: It used to happen, I
think, with bootstrap mode.
Like, when Patroni, like, I don't
remember up until which version,
but Patroni, when it tries to create
a new cluster, usually by
using initDB, but in some cases
you can configure Patroni to
create a cluster from existing
backup, like from base backup.
And if something goes wrong, Patroni
does not remove this data
directory, but renames it.
And it used to apply current timestamp
to the file name.
And therefore, after the first
failure, it gives up, waits a
little bit and does the next attempt.
Nikolay: To directory, right?
Alexander: Yeah, it creates, it
uses yet another base backup,
creates a new data directory, fails,
and renames.
Now it is not working like this,
it just renames PGDATA to PGDATA-old,
something like this, and that's
why you will not have an infinite
number of directories.
And having just 1 is enough to
investigate the failure.
Nikolay: But maximum we end up
like if we if our data directory
we expected to fill 70% of the
disk we still might have out of
disk space.
Alexander: Yeah that's unfortunate
but the other option really
like you just drop it but at the
same time like all the evidences
of what failed, why it failed are
also gone.
You have nothing to investigate.
Nikolay: Okay.
To me it still sounds like Patroni
participates in node provisioning.
Yes, it doesn't bring you resources
like disk and virtual machine
and so on, but it brings data,
like the most important part of
Postgres node provisioning, right?
Okay, I just wanted to be right
a little bit.
Okay.
Alexander: Okay.
Nikolay: It's a joke.
Okay.
Michael: I think diving deep quickly
is great.
It'd be good to discuss complex
topics but I think something
simple would also be good.
I would love to hear a little bit
almost about the history of
Patroni.
Like the early days, what were
you doing before Patroni to solve
this kind of issue and why was
it built?
What problems were there with the
existing setups?
Alexander: To be honest, while
working for my previous company
We didn't have any automatic failover
solution in place.
What we relied on was just a good
monitoring system that sent
you a message or some on-call engineer
just calls you in the
night, the database failed.
There were a lot of false positives,
unfortunately, but it still
felt more reliable than using solutions
like replication manager,
repmgr.
Nikolay: Yeah, I remember this
very well.
Like people constantly saying we
don't need autofailover, it's
evil because it can switch over
suddenly, failover suddenly,
and it's a mess.
Let's rely on manual actions.
I remember this time very well.
Alexander: Yeah, to our excuse,
an amount of databases, like
database clusters that we run wasn't
so high, like I think a
few dozens, and it was running
on-prem, didn't fail so often,
and therefore it was manageable.
A bit later, we started moving
to the cloud, and suddenly, not
suddenly, but luckily for us, we
found a project named Governor,
which basically brought an idea
of how to implement autofailover
in a very nice manner, without
having so much false positives
and without risks of running to
a split-brain.
Nikolay: Was it abandoned project
already?
Alexander: No, no, so it was not
really abandoned, but it wasn't
also very active.
So we started applying it, found
some problems, reported problems
to the maintainer of the Governor,
got no reaction unfortunately,
started fixing those problems on
our own, and at some moment
a number of fixes and some new
nice features accumulated and
we decided just to fork it and
give a new name to the project.
So this is how Patroni
Nikolay: was born.
Georgian name, right?
Alexander: Right.
Nikolay: What does
Michael: it mean?
Governor in Georgian.
Nikolay: Oh, governor.
I think so.
Michael: Yeah,
Alexander: almost, almost.
Almost.
Very close, but I'm not a good
person to explain like or to translate
from Georgian because I don't
Nikolay: Name
Alexander: I know yet another word
in Georgian and it's a spillo.
Yeah, it translates from Georgian
as elephant.
Nikolay: And the name chose I guess
Valentin Gogitseshvili, right?
Alexander: Yes, he was, no, at
that time he wasn't my boss anymore
but we still worked close together
and I really appreciate his
creativity in inventing good names
for projects.
Michael: Yeah, great names.
And is this a good time to bring
up Spilo?
Like, what is Spilo and how is
that relevant?
Alexander: Spilo, as I said, it
translates from Georgian elephant.
When we started playing with Governor,
we were already targeting
to deploy everything in the cloud.
We had no other choice but to build
a Docker image and provision
Postgres in a container.
And we called this Docker image
Spilo basically we packaged for
Governor Postgres if you Postgres
extensions and I think it was
WAL-G back then as a backup and
point-and-tumble recovery solution.
Michael: And it still exists to
this day as Spilo, but now with
Patroni?
Alexander: Yeah, of course.
Now there is a Patroni inside,
and now Spilo includes plenty
of Postgres major versions, which
may be an anti-pattern, but
it allows you to run major upgrades,
like in-place major upgrades.
It also includes WAL-G nowadays,
as a modern replacement of WAL-G.
And it's used not only by...
Operator, right?
Not really part of Operator.
Spilo is a product on its own.
I know that some people run Postgres
on Kubernetes or even just
on virtual machines with Spilo,
without using the Operator.
Nikolay: But using Docker for example?
Alexander: Yeah, of course.
Michael: But that is a good opportunity
to discuss Postgres-operator.
Postgres-operator was Zalando's...
Was that one of the first operators
of its type?
I know we've got lots these days.
Alexander: Well, maybe it was,
but at the same time, the same
name was used by Crunchy for their
Operator.
They were developed in parallel
and back then Crunchy wasn't
relying on Patroni yet.
As I said, we started moving things
to the cloud and at some
point, Vector moved a little bit
and started running plenty of
workloads on Kubernetes, including
Postgres.
Since deploying everything manually,
and more importantly, managing
so many Postgres clusters manually
was really a nightmare, we
started building Postgres-Operator.
Back then, I don't think some very
nice Go library to implement
the Operator pattern existed and
therefore people had to invent
everything from scratch and there
is a lot of boilerplate code
that copied over and so on.
Nikolay: Is it only the move to
the cloud what mattered here,
but maybe also moving to microservices,
splitting everything
to microservices?
Because I remember from Valentin,
for example...
Alexander: Microservices, of course,
played a big role.
And probably...
Not probably...
Microservices were really driving
force to move to the cloud,
Because with the scale of the organization,
it wasn't possible
to keep monolith.
And the idea was, let's split everything
to microservices, and
every microservice usually requires
its own Database.
Nikolay: Right.
Alexander: Sometimes sharded Database,
like we used application
sharding.
In certain cases, the same Database
is used by multiple microservices,
but it's a different story.
But really, the number of Database
clusters that we had to support
exploded.
From dozens to hundreds and then
to thousands.
Nikolay: Yeah.
And this is already when you cannot
rely on humans to perform
a failover, right?
Alexander: Even when you run a
few hundred Database clusters,
better not to rely on humans to
do maintenance, in my opinion.
Nikolay: Right, so that's interesting
and maybe it's also the
right time to discuss why Postgres
doesn't have internal built-in
to failover.
I remember discussions about replication
when we relied on Slony
and Londiste and some people resisted
to bring replication inside
Postgres, but somehow it was resolved
eventually.
And Postgres has good replication,
physical, logical, sometimes
not good, but it's a different
story.
In general, it's very good and
improving, improving every release.
We just last week discussed with
Michael what improvements of
logical replication in 17, and
maybe it will resonate a little
bit with topic today, Patroni,
but it doesn't happen to autofailover
at all, right?
Why so?
Alexander: I can only guess, because
to do it correctly, we cannot
just have 2 nodes, which most people
run, like primary and standby,
because there are many different
factors involved.
1 of the most critical ones is
the network between those nodes.
When just having 2 machines, you
cannot distinguish between failure
on the networking and failure of
the primary.
If you just run health check from
a standby and making decision
based on the health check, you
may have a false positive.
Basically, the network just experienced
some short glitch, which
could last even a few seconds,
sometimes a few minutes, but at
the same time the old primary is
still there.
If we promote a standby, we get
to a split-brain situation.
With 2 primaries and not being
clear to which 1 transactions
are running.
In the worst case, you end up in
an application connecting to
both of them.
Good luck with assembling all these
changes together.
Nikolay: This is what tools like
repmgr do.
So I ended up calling
repmgr a split-brain solution.
Because I observed it many, many
times.
Alexander: Like as a mitigation,
what maybe is possible to do,
the primary can also run a health
check and in case if standby
is not available, just stop accepting
writes by either restarting
in read-only or maybe by implementing
some other mechanisms.
But it also means that we lose
availability without a good reason.
So with this scenario, when we
promote standby, technically if
standby cannot access someone else,
it shouldn't be accepting
writes either, like in the network
split.
Basically, we closely come to set
up with how repmgr
call it, witness node.
Nikolay: Witness node, yes exactly.
Alexander: Witness node, basically
you need to have more than
2.
And the witness node should help
in making decisions.
Let's say we have a witness node
in some third failure domain,
the primary can see the witness
node, therefore it can continue
to run as a primary.
And standby shouldn't be allowed
to promote if it cannot access
the witness node.
And it already reminds some systems
like ETCD that complement
consensus algorithm and write is
possible when it is accepted
by majority of nodes.
Nikolay: This wheel already invented,
right?
Alexander: Yeah, so this is already
invented, and what Patroni
is really relying on to implement
after failover reliably.
I can guess that at some moment
in Postgres it will be added,
and we already have plenty of such
components in Postgres that
exist.
We have write-ahead log with LSN
which is always incremented.
We have timelines which is very
similar to terms in etcd.
So basically at the end we will
just need to have more than 2
nodes, better 3, so that we don't
stop writes while 1 node is
temporarily down.
It will give possibility to implement
after failover without
even doing pg_rewind, let's say.
Because when primary writes to
write-ahead log, it will be first
confirmed by standby nodes, and
only after that.
So effectively, this is what we
already have, but it's not enough,
unfortunately.
Nikolay: S.
So do you think at some point Patroni
will not be needed and
everything will be inside Postgres
or no?
A.
Alexander: I hope so, really.
Nikolay: S.
I hope so.
Alexander: A.
No, no, no, no, no, no.
I'm tired of maintaining Patroni,
but because this is what people
really want to have.
To deploy highly available Postgres
without necessity to research
and learn a lot of external tools
like Patroni, solutions for
backup and point...
Nikolay: Upgrade them sometimes
because we're always lagging
with these
Alexander: upgrades.
Yeah.
But at the same time, Let's imagine
that it happens in a couple
of years, but with a five-year
support cycle, there will still
be a lot of setups that are running
not recent Postgres versions,
and they still need to use something
external, like Patroni.
Nikolay: Yeah, I'm actually looking
right now at commits of
repmgr.
It looks like the project is inactive
for more than 1 year, almost.
Like a few commits, that's it.
It's like going down.
Alexander: So I have probably some
insights about it, not about
repmgr, but I know
that EnterpriseDB was contributing
some features and bug fixes to
Patroni, so they officially support
Patroni.
Nikolay: So it sounds interesting,
right?
So Patroni is a winner, obviously.
It's used by many Kubernetes operators,
many of them, and not
only Kubernetes, of course, and
winning, of course, some projects
were abandoned, not only
repmgr, we know some others,
right?
But you thinking about 1 day everything
will be in core and Patroni
will be abandoned maybe, right?
And you think it's maybe for good.
Alexander: So every project has
its own life cycle.
At some moment, the project is
abandoned and not used by anyone.
We are not there yet.
Nikolay: Right, right.
While we're in this area, I wanted
to ask you what you think
about, Kubernetes also has, it
also relies on consensus algorithm,
right?
Itself, it has it.
Why some operators choose, why
do they choose to use Patroni
while others like CloudNativePG
decide to rely on Kubernetes
native mechanisms and avoid using
Patroni?
Alexander: To be honest, I don't
know what driving people that
build cloud-native Postgres.
Nikolay: But what's better in general?
What are pros and cons?
How to compare?
What would you do?
Alexander: In a sense, CloudNativePG, there is a component
that tries to manage all Postgres
clusters and decide whether
some primary is failed and promote
1 of the standbys.
I'm not sure how they implement
in the fencing of the failed
primary, because if you don't correctly
implement fencing and
promote the standby to the primary,
you again end up in a split-brain
situation.
And let's imagine that 1 Kubernetes
node is isolated in the network.
Nikolay: Network partition.
Alexander: Yeah.
And it automatically means that
you will not be able to stop
pods for containers that are running
on this node.
At the same time, applications
that are running on this node
will still use Kubernetes services
to be able to connect to the
isolated primary.
Nikolay: Right, yeah.
Alexander: So Patroni detects such
scenarios very easily, because
Patroni component runs in the same
port as Postgres, and in case
it cannot write to Kubernetes API,
it just does self-fencing,
It stops Postgres to read only.
Nikolay: It's simple, by the way,
right?
Alexander: Yeah, so I don't know
if they do something similar.
In case if they don't, it's dangerous.
Michael: We should do a whole separate
episode of CloudNativePG
actually I think that would
be a good 1
Alexander: yeah I'm not saying
that CloudNativePG is better
like does something wrong
Nikolay: I'm just raising questions
Alexander: raising my concerns
Michael: of course right back to
Patroni it worked like this
from the beginning, but it feels
like
Alexander: in version 10, which
is end of life for a couple of
years, by the way.
From the very beginning, we wanted
to support this feature, but
what was stopping us was the promise
of Patroni with synchronous
replication that we want to promote
a node that was synchronous
at the time when primary failed.
If we just have a single name in
synchronous standby names, like
single node, it's very easy to
say, okay, so this node was synchronous
and therefore we can just promote
it.
When there are more than 1 node
and we require all of them to
be synchronous, we can promote
any of them.
But with quorum-based replication,
you can have something like
any 1 from a list of, let's say,
3 nodes.
Which 1 is synchronous when the
primary failed?
I'm not demanding to answer this
question, So I will just explain
how it works in Patroni, like in
the last major release.
This information about current
value of synchronized and bynames
is also stored in etcd.
Therefore, those 3 nodes that are
listed in synchronized and
bynames know that we are listed
as quorum nodes and during the
leader race they need to access
each other and get some number
of votes.
If there are 3 nodes, It means
that every node, to become a new
primary, like a new candidate,
needs to access 2 remaining nodes,
at least.
And get confirmation that they're
not ahead of all LSN on the
current node.
Is it clear?
I should elaborate a little bit
more.
Michael: So if they were ahead,
let me ask the stupid question,
If a node checks that it is ahead
of the current candidate to
be leader, that's then a bad decision
to promote that leader
because a different 1 would...
Alexander: So just for your understanding,
in Patroni there is
no central component that decides
on which node to promote.
Every node makes a decision on
its own.
Therefore, every standby node,
like listed in Synchronous Standby
Names, goes through the cycle of
health checks.
It accesses remaining nodes from
synchronous to node names and
checks at what LSN are there.
And if they're on the same LSN
or behind, we can assume that
this node is the healthiest 1.
And the same procedure happens
on remaining nodes.
Basically this way we can find,
okay, so this node is eligible
to become a new primary.
In case if we have something like
any 2 and 3 nodes, we can make
a decision by asking just a single
node.
Because we know that 2 nodes will
have the latest commits, like
the latest commits that are reported
to the client.
And it will be enough to just ask
a single node.
Although it will ask all nodes
from synchronous standby names,
but in case if 1 of them, let's
say, failed, together with the
primary, it is still enough to
make a decision by asking the
remaining 1.
Nice.
And the tricky part comes when
we need to change synchronous
standby names and the values that
we store in etcd.
Let's say we want to increase the
number of synchronous nodes
from 1 to 2.
What should we change first, synchronous
standby names, GUK,
or value in etcd?
So that we can correctly make a
decision.
If we change first value in etcd,
it will assume, okay, so we
need to ask just a single node
to make a decision, although there
is just 1 node that has the latest
commits, 100%.
And in fact we need to ask 2.
Therefore, when we increase this
from 1 to 2, first we need to
update the synchronous standby
names, and only after that change
in etcd.
And there are almost a dozen of
rules that 1 needs to follow
to do such changes in the correct
order.
Because it's not only about changing
replication factor, It's
also about adding new nodes to
synchronize standby names or removing
nodes that are gone and so on.
And I don't think any other failover
solution implements a general
algorithm to do such changes.
Nikolay: How much time did you
spend to develop this?
Alexander: Originally this feature
was implemented by Ants Aasma,
he's working for CYBERTEC, it
happened in 2018.
I did a few attempts to understand
this great logic of this algorithm.
And finally, almost 5 years after,
I was able to get enough time
to fully focus on the problem.
And even after that I spent, I
don't know, a couple of months
implementing and fixing some bugs
and corner cases and implementing
all possible unit tests to cover
all such transitions.
Nikolay: There is no book which
describes this, that you could
follow.
This is something really new that
needs to be invented, right?
Alexander: Well, the idea was obvious,
like how to do it, like,
or what to do, but like implementing
it correctly and proving
that it is really working correctly,
like, it's really a challenge.
Nikolay: Finding all the edge cases,
right.
There is another thing I would
like to discuss a little bit.
It was in Patroni 3, version 3.0,
DCS failsafe mode.
So DCS is distributed configuration
storage.
And actually we just experienced
a couple of outages because
we are in Google Cloud and they're
running Salon operator, Patroni
of course.
And I just checked the version
of Patroni, and it seems to have
it.
But we...
Alexander: But I don't think it
is enabled by default.
Nikolay: Exactly, this is my second
question, actually, why it's
not enabled.
So, first question, what is it,
like, how do you solve this problem
when etcd or console is temporarily
out?
Alexander: Let's start from problem
statement.
The promise of Patroni is that it
will run as a primary when it
can write to a distributed configuration
store, like to etcd.
If it cannot write to etcd, it
means that maybe something is
wrong with etcd, or maybe this
node is isolated, and therefore
writes are failing.
And when node is isolated, it's
apparently working by design,
Patroni cannot write to etcd, it
will stop Postgres in read-only
mode, but in case if etcd is totally
down, because of some human
mistake, you cannot access any
single node of etcd.
And in this case, Patroni also
stops primary and starts it in
read-only to protect from the case,
let's say, some standby nodes
can access DCS at the same time
and promote 1 of the nodes.
So people were really annoyed by
this problem, and were asking
why we are demoting primary.
So far the answer was always, alright,
so we cannot determine
the state, and therefore we demote
to be on the safe side.
The idea how to improve on that
came at one of Postgres conferences
after talking with other Patroni
users.
Like, how it is improved using
the failsafe mode.
The primary, like when it can determine
that none of etcd nodes
are accessible, it will try to
access all Patroni nodes in the
cluster using the Patroni REST
API.
And in case if the Patroni primary
can get a response from all
nodes in the Patroni cluster in
the failsafe mode, it will continue
to run as a primary.
In this case, it's a much stronger
requirement than quorum or
consensus.
So it is not expecting to get responses
from, let's say, majority.
It really wants to get responses
from all standby nodes to continue
to run as a primary.
This feature was introduced in
Patroni version 3, but it is
not enabled by default, because
I think there are some side effects
when you enable this mode in certain
environments.
Probably it is related to environments
where your node may respond
with a different name.
Nikolay: I need to think about
it.
Alexander: This behavior is documented.
Nikolay: Yeah, we will explore
this.
Thank you so much for it.
But it sounds
Alexander: like...
On Kubernetes it is safe to enable
it.
Nikolay: Yeah, we should start
using this, this is what I think
as well.
Yeah, definitely we'll explore,
thanks.
Alexander: Like pods always have
the same name, just different
IP addresses.
Nikolay: I just got help for it.
And as usual, I just wanted to
publicly thank you for all the
help you do for me and actually
many companies. Many years it's
huge.
Thank you so much So another thing
I wanted to discuss is probably
replication slots.
And I remember a few years ago
you implemented support for failover
of logical slots.
Now we have it in Postgres, right?
So one more, finally, yeah.
One thing was basically removed,
I guess, from Patroni, right?
Or you still keep this functionality?
Alexander: No, We still keep it
and we didn't do anything special
in Postgres 17.
Nikolay: It was, I think it was
16 even, no?
Alexander: Failover of, ah.
Nikolay: Or 17.
Well, ability to use a logical
slot on physical standbys was
in 16, but fell over in 17, we
just discussed it.
Alexander: Yes, exactly, exactly.
I confused you.
That's why I'm saying we didn't
do anything special.
Although I did some tweaks to make
this feature work with Patroni,
because it requires to have your
database name in the primary
coninfo.
Patroni wasn't putting the DB name
to primary coninfo, because
for physical replication, it's
not useful.
Nikolay: But I wonder...
Alexander: How it does it?
Nikolay: I wonder in my head, like,
of course, we create slot
on the primary, it's clear, but
Patroni main task is to keep
primary alive, to take care of
high availability HA for the primary.
Okay, but if we have multiple replicas,
multiple standby nodes,
and 1 of them is used, or maybe
a few, but at least 1, 1 of them
is used to logically replicate
to some Postgres or Snowflake
or anywhere or Kafka or something
in this case if this...
Yeah, from standby because it's
good, we like, less risks on
the primary and so on and wall
sender is not using CPU and so
on.
And no out of disk risks.
So now we have this standby and
it's dead suddenly.
It's not the job of Patroni to
take care of it, right?
Because we need some mechanism
to failover standby now.
Alexander: Well, you mean to keep
logical replication slot on
a new standby where you would like
to connect.
In theory, Patroni maybe can take
care of it, since it's possible
to do logical replication from
standby nodes since Postgres 16.
So how it's implemented currently
in Patroni, like logical failover
slots, it creates logical slots
on standby nodes and uses
pg_replication_slot_advance() to move
the slot to the same LSN as
it's currently on the primary.
So basically the assumption is
that logical replication happens
on the primary.
In theory, there is no reason why
it cannot be done for standby
nodes.
Let's say we create logical slots
on all standby nodes with the
same name, and Patroni can watch
which 1 is active and publish
this information to ATCD and remaining
standby nodes will again,
like Patroni remaining standby
nodes will use pg_replication_slot_advance()
to move LSN on standby nodes.
So in theory it could work, but
Nikolay: I don't
Alexander: know if I would have
time to work on it.
Nikolay: I'm just trying to understand,
This is a relatively
new feature since 16 to be able
to logically replicate from physical
standbys, but...
Alexander: But please keep in mind that it still affects primary.
Nikolay: Right.
Alexander: So, Maybe like pg_wal will not bloat, but pg_catalog
certainly will.
Nikolay: Yeah, this for sure.
I was referring to the need to preserve WAL files on the primary.
This risk has gone if we do this.
But I cannot imagine how we can start using logical slots on
physical standbys in serious projects without HA ideas.
Because right now I don't understand how we solve HA for this.
Alexander: Yeah, and unfortunately, this hack that Patroni implements
with pg_replication_slot_advance() has its downsides.
It literally takes as much time to move the position of the logical
slot as you consume it from the slot.
That's unfortunate.
And how it's solved in Postgres 17, it basically does not need
to parse the whole file and decode it, so it just literally overwrites
some values in the replication slot, because it knows exact locations
and does it safely.
Patroni cannot do it.
Although, probably, pg_failover_slots can also do the same.
For older versions.
Nikolay: Okay, some area, additionally, for me to explore deeper,
because I like understanding many places here.
Good pieces of advice as well, thank you so much.
Anything else, Michael, you wanted to discuss?
Like, obviously, like 1 of the biggest features was Citus support,
right?
But I'm not using Citus actively, so I don't know.
If you want to discuss it, let's discuss.
Alexander: I know that some people certainly do, because from
time to time I get questions about Citus with Patroni on Slack,
or maybe not Citus-specific questions, but according to the output
of the Patroni control list, they are running Citus Cluster.
There is certainly a demand, and I believe with Patroni implementing
Citus support, it improved quality of life of some organizations
and people that want to run sharded setups.
Nikolay: Is there anything specific you needed to solve to support
this or like technical details?
Alexander: To support Citus?
So, Citus, I wouldn't say that it was very hard, but it wasn't
very easy either.
So, Citus has a notion of Citus Coordinator, where you, like
Originally you're supposed to use coordinator for everything,
to do DDL, to run transactional workload and so on.
And on coordinator there is a metadata table where you register
all worker nodes.
And worker nodes, This is where you keep the actual data, like
charts.
And what I had to implement in Patroni is registering automatically
worker nodes inside this metadata.
And in case of failover happens
For the worker nodes, we need
to update metadata and put new
IPs or host names, whatever.
Basically, when you want to scale
out your Citus cluster, you
just start more worker nodes.
Every worker node, in fact, is
another small Patroni cluster.
So technically, in Patroni control,
it looks like just a single
cluster, but in fact it's 1 cluster
for a coordinator, 1 cluster
for every worker node, and on each
of them there is its own failover
happening.
If you start worker nodes in a
different group, like in the new
1, it joins existing Citus cluster
and Patroni, the coordinator,
registers new worker nodes.
But what Patroni will not do,
it will not redistribute existing
data to the new workers.
This is something that you will
have to do manually afterwards
and it has to be your own decision
how to scale your data and
replicate to other nodes.
Although, like nowadays it's possible
to do it without downtime
because all enterprise features
of Citus are included in Citus
version 10.
So everything that was enterprise
now is an open source.
Nikolay: That's cool.
Michael: I saw Alexander has a
good demo of this, of Citus and
Patroni working together, including
rebalancing.
I think it was Citus Con last year?
Alexander: Yeah, it was Citus Con.
Michael: Nice, I'll include that
video in the show notes.
Nikolay: I wish I had all this
a few years ago.
Alexander: When I...
Yeah, of course, like, There was
a little bit more work under
the hood.
In case if you do write workload
via coordinator, it's possible,
like Patroni can do some tricks
to avoid client connection termination,
like while switchover of working
nodes is happening.
This is what I did during the demo.
There are certain tricks, but unfortunately
it works only on
coordinator and only for write
workloads.
For read-only workloads, your connection
will be broken.
That's unfortunate.
Maybe 1 day it will be fixed.
So in the Citus, maybe 1 day the
same stuff will also work on
worker nodes.
And by the way, on Citus, you can
run transactional workload
by connecting to every worker node.
Only DDL must happen via coordinator.
Michael: Nice.
Speaking of improvements in the
future, do you have anything
lined up that you still want to
improve in Patroni?
Alexander: That's a very good question.
Usually some nice improvements are coming out of nothing.
You don't plan anything, but you talk to people and they say,
it would be nice to have this improvement or this feature.
And you start thinking about it, wow, yeah, it's a very nice
idea and it's great to have it.
But I rarely plan some big features from the ground up, let's
say.
So what I had in my mind, for example, it's a failover to a standby
cluster, like in Patroni.
Right now it's possible to run a standby cluster which is not
aware of the source where it replicates from.
It could be replicating from another Patroni cluster.
And what people ask, we have a primary Patroni cluster, we have
standby Patroni clusters, but there is no mechanism to automatically
promote standby cluster, because it's running in a different
region and it is using completely another etcd.
So they simply don't know about each other.
It would be nice to have, but again I cannot promise when I can
start working on it and whether it will happen.
I know that people from CYBERTEC did some experiments and have
some proof-of-concept solutions that seem to work but for some
reason they also they're also not happy with such solution they
implemented.
Michael: Yeah, sounds tricky.
Alexander: Distributed systems are always tricky.
Michael: Yeah,
get that on a t-shirt.
Nikolay: Thank you for coming.
I, as usual, I use podcast and all events, I participate and
organize and so on.
I use just for my personal education and daily work as well.
I just thank you so much for help.
Again.
Alexander: Yes, thank you for inviting me.
Yeah, it's a nice job that you are doing.
I know that many people listening to your podcasts and very happy
about it.
They learn a lot of great stuff and also making a big list of
to-do items like what to check and what to learn.
I cannot say the same about myself that I watch every single
episode but sometimes I do.
Nikolay: Cool, thank you.
Michael: Thanks so much Alexander.
Cheers Nikolay.