I'm joined by Nolan Brubaker of VMWare to talk about Velero, an open-source backup and migration tool for Kubernetes.
- Unedited live recording with demos on YouTube
- Velero website
- Velero on Twitter
- Nolan Brubaker on Twitter
What is DevOps and Docker Talk?
Interviews and Q&A from my weekly YouTube Live show. Topics cover Docker and container tools like Kubernetes, Swarm, Cloud Native development, Cloud tech, DevOps, GitOps, DevSecOps, and the full software lifecycle supply chain. Full YouTube shows and more info available at https://podcast.bretfisher.com
Bret: You're listening to DevOps and Docker Talk.
I'm your host, Bret Fisher.
These are edited audio only versions of my YouTube live show.
Every Thursday at bret.live.
This podcast is sponsored by my Patreon supporters.
I'd like to think the now over 100 paid supporters that make this show such a pleasure to do.
You can get more info and follow me for free at patreon.com/bretfisher.
I'm pulling another episode out of the archives from 2020 when I was
taking a, not so short break from launching new podcast episodes.
My guest this time is Nolan Brubaker from VMware, and we talk about the Velero open source project.
For backing up migrating and restoring all of your Kubernetes resources and volumes.
Bret: Now, usually in my audio only podcast, I'll edit out the demos on the YouTube
live show simply because they don't make a lot of sense in an audio only format.
But I listened to a lot of this one again, and it made a lot of sense and I'm leaving it in.
So somewhere between the 20 and 30 minute mark, we get into a demo, but it's largely CLI base.
And it's largely a discussion around how the tool is used.
Some of the settings you might use.
So I didn't feel like we were losing much of the knowledge transfer in just having audio only.
So I'm keeping that in this podcast, if you're interested in some of the features of the
product, what's in the future of it and all that, we go through that pretty well in the demo.
Check it out.
I think it's a valuable project.
That we all need to deal with when we're running Kubernetes
in production and we have to worry about Now on with the show
Bret (2): my guest today, which we've been, talking about this for
months, , planning this with the team at VMware, , Nolan from VMware.
, welcome to the show.
Nolan Brubaker: Hi, thanks for having me.
Bret Fisher: And what is your Twitter handle?
I just realized what does that,
Nolan Brubaker: what does that mean?
That's palendae something that came up within gosh, now middle school.
It's made up, I think I got it from a dragon lands, novel, took nice character name and changed it.
Don't really talk about work stuff there, but when it follow me, you can,
Be prepared not to see a whole lot of Docker or Kubernetes
but yeah, you're certainly welcome to follow me.
Bret Fisher: Yeah.
So I always appreciate having fellow gamers on the show.
I just finished doom eternal and.
Stayed up way too late, the last month, killing that game.
And it, I almost threw my controller a few times cause
Nolan Brubaker: yeah, I haven't gotten into doing maternal.
Um, I, what I've lately been playing has been dragon quest 11 on the switch.
But that's gonna take me while the way that's the way
I've been playing it since it's like over a hundred hours.
So yeah, it's going to take me a long time to get through this
Bret Fisher: that's commitment.
That's more than a few charges on your, uh, on your name.
Nolan Brubaker: I'm prepared for that to be like a year long thing, if not more so.
Bret Fisher: Right, right.
Oh, by the way, for those asking, I are wondering, I didn't know.
I had to ask if that was Metroid related in the background
and indeed that poster is from super Metroid, right?
That's a, yup.
Nolan Brubaker: Yup.
That's a super Metroid thing off of a, that was a fan art poster that I got for Christmas one year.
Bret Fisher: Yeah.
So we don't normally talk about games in this show, but I just thought I had to mention it.
Cause it's, we're staying, we're going to be staring at it.
The whole show.
I, my gamer handle is Sonic bum and it didn't make much sense on the internet.
So I had to ended up changing it to my real name because I realized
nobody would know nobody would find me mostly because I was into Sonic the
hedgehog in the nineties and I Sonic boom was taken and Sonic was taken.
So I made up Sonic bum, but that is old news.
You can still find some remnants of that stuff around if someone was trying to hunt me down.
So let's talk about Kubernetes and backups because that's why
you're here, but also it's a topic that doesn't get a lot of talk.
And but first let's talk about like, how did you get
started at VM-ware and like w w what's your background?
Nolan Brubaker: Yeah.
So I came to VMware via Heptio.
I started at Heptio working on Velero, which was half the arc.
Got renamed as part of the acquisition.
The project started one day at Heptio when Joe beta was trying to work with some clusters
and realized he had, um, EBS volumes that he's like, Hey, I need, what if wouldn't
it be great if something snapshotted this snapshot of these volumes automatically?
So I came from before that I was at Rackspace working on a project called OpenStack
Ansible that deployed the OpenStack control plane with Ansible in non Docker containers.
But LXC containers.
So it's was taking the control plane and instead of like deploying
a ton of hardware, it was condensing that into containers.
So yeah, I went from that to working at Heptio and trying
to make this whole Kubernetes thing work out and, uh,
Bret Fisher: that
Nolan Brubaker: bet.
Bret Fisher: Yeah, really cool.
So Valera is now the new product name.
In fact, let me just bring it up so people can see the site.
And Velero that IO, and they can find out all the good stuff about this here.
And, you know, typically, like if we're going to talk about
backups, Typically backups are a boring conversation, right?
And a lot of us think we have it handled.
I mean, you know, it's funny that it's not like when I usually work on like
container projects and stuff like that, backups are almost there almost
push to the point of they just want to just assume that they're happening.
Like everyone just assumes that somebody did it and that they're automatic.
And they're always right.
And then we've got everything we need to restore.
And that we've tested restores, and that we know how to do Dr.
And that we've validated our Dr on a regular basis.
And almost never is any of that.
And how do we fix this?
Nolan Brubaker: Yeah.
Well, in a lot of, I think in like a cloud native world, a
lot of people who assume that your vendors handle it, right?
Like, so you're running on a public cloud.
It doesn't matter who it is.
And you assume, well, they've handled all that hard stuff.
They've you assume that your.
Your volumes are they're durable, they're resistant to
availability zones falling down and things like that.
And maybe they are, maybe they aren't, maybe east goes away and you want to get back more quickly.
Then those engineers can or maybe they're, there's
some applications that are not made to be cloud native.
They're not designed yet to be living in that world.
So it's all well, and good if you have a stateless application, but
if you've got data you really should be owning that backup strategy.
And rather than just assuming your cloud provider has taken
care of all of it, you need to make sure you're protected.
So Velero allows you to not, it does two things, it backs up, the Kubernetes made a date metadata.
So like the Yammel or Jason manifests and it also grabs your persistent volumes.
And that's probably the bigger thing because you could get the
Kubernetes manifests through, get ops, if you're following that strategy.
So if you want to grab your Kubernetes application data, Velero provides a way to get that.
So there's a couple of different ways we can do that.
We've got support for doing snapshots through your persistent volume
provider or using a, an application called rustic to get file system level.
And that makes it platform neutral, right?
So you can move from one provider to another But yeah, like you said,
it's, we assume all this stuff's being done, but really someone on in
your organization should probably be owning that for your application.
Bret Fisher: Right.
Uh, it, uh, you know, I think a lot of people would, who don't really know cloud and Kubernetes
well would probably think that, you know, backups are either built in, or it's a checkbox
type of thing because we would, at this point, it's like, we've w you know, Kubernetes and
Docker and the container world has we've felt like we've got this method for easily deploying.
So we're now we're rapidly deploying, and we're able even able to allow people to
create new apps and deploy them without much effort in terms of the, getting the
operations team to be up, man, this big manual effort to say, oh, you got a new app.
Well, it's going to take us a month to get that into production.
But having that experience where maybe the Kubernetes Jamel has something
in it, or that basically says, yep, this is the part you need to back up.
And that's all I need to know.
And is this, are we to the point with this, project where in my , as a
developer, I can specify what needs to be backed up or somehow labeled or
something so that it can be an automatic effort where the ops team doesn't then
have to figure out how to back up that thing after it's already on my server.
Is it that kind of way?
Nolan Brubaker: We've talked about that so that you
could package in your application to say, yeah, yeah.
That the application could tell Velero, this is what you need.
Veleros not to that level yet.
The way Valera works is that Velero has a command line and it runs as deployment in your server.
Or in your, I'm sorry, in your cluster.
And the Velero operates by creating its own backup CRS and then
goes and runs a controller operator to make the backup happen.
And that backup CR says, okay, what do you want to back up?
Is it whole everything in a namespace?
Is it everything matching this label?
Is it everything except for these things?
So you can do excludes, so everything not matching this
label every everything, but this re this type of resource.
So maybe you don't want to include a cluster roles as an example.
There's, we're, we're engaging with groups upstream.
There's a relatively new, I think it was formed.
Trying to remember exact timelines, but there's a new working group, upstream, the data
protection working group, which is a collaboration between SIG apps and six storage.
That's trying to figure out ways to standardize how
applications can communicate, what would need to be backed up.
So what's Amanda, like what's in this application, is it a deployment?
Is it a stateful set?
What, what are the components of this application and
also what are the building blocks for protecting it?
Is it does it have volumes that need to be snapshotted?
Does it have other external resources that we need to grab?
So we're not only working on it within a Velero context, we're trying
to work with the upstream community so that there's standards here.
So it's not just us.
But yeah, in terms of, of having something that.
The application developers define we're not there yet, but we want to get there.
Bret Fisher: Right.
Cause I, you know, not having any experience with this product, so I'm coming into it as a new user.
I'm thinking, okay, if I'm trying to move my tooling to a more automated fashion, I mean, I'm,
you know, this is my year of get ops, seeing all the things, at least in my personal projects.
And so I imagine okay, I've got a new app, I'm going to host it in my cluster.
I'm going to be putting that those manifest or those helm charts or something in my repo.
And how can I just add that?
Like you said, a custom resource that specifies, Hey, these things in
my deployment are also need to be backed up or maybe an annotation or a
label or something that you're talking about in the existing resources.
If you're a dedicated backup person, you want that command line, right.
That makes total sense.
You want to be able to control the cluster as a whole and want
to be able to see all the backups and how all those resources.
And so, that, that's awesome.
I'm always trying to shift that responsibility to the team.
That's owning the app and of course, with varying levels of success but it,
that, that did seem like a thing that would be super helpful as to say, Hey,
if you all just want to take care of it and don't even involve us we'll just
put these things in your gamble, your helm chart, or whatever you're creating.
And then just know that we're going to be back in that up and, you know, it'll work for whatever.
So the w the way I would
Nolan Brubaker: probably pitch that to a an application team is there's we have what are called
schedules, and that's basically it takes the same set of includes and exclude fields that a.
Backup does, and it runs, it automatically runs it automatically on some schedule.
So say you've got a helm chart for your application.
You could include the schedule as part of that.
So it's not quite to the level that, so we've got like open issues for, I
think we've called them backup templates that applications could define.
But if you include a Velero schedule in your AML, you could apply that and
it could say include the whatever label the application developers define.
So they could say, okay, this is the label we use.
This is the schedule we We use, these are the hooks we define to backup the database.
So we don't have, we don't have anything in Velero to directly dump databases.
We instead to find hooks that, say, run this command on my container
and get the stuff dumped out of the database into a persistent volume.
So that's probably how I would approach it in the current state is give the
application, developers use a schedule or at least find a backup CR in their helm chart.
And from there that can be invoked on a schedule and then the cluster
administrator would be responsible for getting Velero installed.
Bret Fisher: Right.
Can you send things to a cloud storage?
Like I want to put it this S3 bucket or in this, whatever digital
are those like dry, would you call those drivers or plugins or,
Nolan Brubaker: yep.
Those are plug-ins.
Velero has a plugin model for there's three, three, four there's four main types of plugins.
There's two of them are kind of grouped together.
Their item action plugins, which happened on backup and restore, so they
can modify Kubernetes manifests as they come in and out of the cluster.
So on backup, you can modify Kubernetes, excuse me, cumin, Kubernetes manifests.
So we use that to like walk from a pod to a PVC, to a PV.
So like when you're backing up a pod, we assume if it's got any PVCs, you want the PDs.
And then we've got restore item actions that manipulate stuff on restore.
Kind of doing the reverse as an example, you go from a PV to rebuild back to the pod.
Then we have object storage, plugins, which hook up to S3 GCP, object storage, Azure blob storage.
And there's third party plugins for that too.
I don't those three cover the main object storage
cases, and then we have volumes snapshot or plugins.
So those do EBS snapshotting GCP volume.
Snapshotting we have a vSphere snapshot or plugin now.
Bret Fisher: community at a, at a disk level, a volume mountain level.
Yeah, no, it's not so much an application level.
Nolan Brubaker: Yeah, no, those are not at a, at an application level currently.
Bret Fisher: Yeah.
Is it using CSI in the background?
Is that how you're are you able to do these plugins
for story to have to be separate from a CSI plugin?
Is this somehow related or
Nolan Brubaker: right now they're separate.
They Velero actually predates CSI.
Our one point, yeah, so our 1.4 release, which we're hoping to get out in a couple of weeks.
Actually that's our main feature is getting a CSI integration at a beta level.
So that's something myself and sheesh is, are working on she showed another
engineer on our team and we're working with the CSI community to make sure
like we're working with it, working with the Kubernetes CSI integration.
So yeah, our hope is eventually we deprecate the Velero plugins for snapshotting.
But that's going to be a long-term goal.
For the foreseeable future, we're going to
Bret Fisher: have both, right?
That's the same thing that the Kubernetes community is doing, right?
Like obviously the built-in plugins are going to be there for awhile
and like the CSI's not necessarily feature complete in comparison.
So that I can see how that's a, that's a multi-year process.
You don't want to leave anyone behind.
You don't want to, you don't want to leave someone behind
just because they're still using the old backup that works.
Nolan Brubaker: Right.
And not every time CSI driver has the snapshotting capability yet, right.
And the snapshotting even though CSI itself is GA the snapshotting API is still beta.
So we're Velero itself is probably going to trail behind on the GA status.
Just so we can see as more drivers become available, we're testing it
out with the drivers that are there, but It's requiring some tweaks to
Velero and we want to make sure we don't break existing Velero users.
And we also want to make sure, we're being good
community participants and helping inform the design.
Bret Fisher: Yeah, that makes sense.
Certainly a, if people aren't used to defaulting to CSI for their regular apps,
then not using that by the way, CSI, sorry, peoples container storage interface.
It's the standard that criminals is now using for new
plugins to use different storage other than just local disk.
So if you're using a cloud storage, you can either choose the
built-in ones, but now the new way to do it lately is the CSI plugins.
And the idea for all of this, right, is that this
every storage vendor can make their own CSI plugin.
This is even on the roadmap now for Docker swarm for it to
adopt a CSI as sort of a standard mechanism for storage.
So that we maybe have a, the dream someday is that the entire container industry can
rely on a single volume plugin for each type of volume, whether that's a cloud storage or
your VMware storage, or your S your net app ice Guzzy storage or whatever it is, right.
You can, you can just rely on that one driver.
And it provides all the mechanisms for all these different types of tools.
And you don't have to learn us that, so the, each one of your products doesn't
have a different way of connecting to storage, which, I kinda thought we
solved like 15 years ago with ice because he, I felt like that was the dream.
And then the cloud happened and then containers happen.
And, you know, we're, now we're back to trying to figure it out loud again, so, yup.
Nolan Brubaker: Yup.
And on the literal team.
We saw that in we're like we saw the CSI stuff happening and we're like,
well, It doesn't make any sense to ask storage vendors to make Valera specific
plugins, especially like if there's competing backup solutions, it doesn't
make any sense to ask storage vendors to make a bunch of different ones.
Bret Fisher: there always is right.
There's a ton of ideas,
Nolan Brubaker: right?
So if there's this community standard, let's get involved there,
let's make sure it works for everybody and hook into that.
Yeah, for sure.
It makes absolute sense to make Velero compatible with that.
But yeah, when Velero started CSI, the discussion for CSI hadn't really started.
Bret Fisher: And even Docker struggled with this because they, they
provided Docker plugins for storage, but they, this was pre CSI.
And so now Docker is wanting to also consider the CSI as a mechanism so that we.
Cause storage vendors don't want to do this.
They don't want to make a Kubernetes plugin and a Docker plugin and a backup, every vendor plugin.
And of course, trying to make them all one is way harder than we all probably think it is.
So yeah, it'll be, it'll get there someday.
I have some friends that are very skeptical that this is ever going to come to fruition.
They dream of storage.
Nolan Brubaker: not easy.
And I'm glad that from my perspective, I'm just calling storage.
I'm not an implementer.
The stateful stuff's hard.
Stateful stuff is definitely hard.
Bret Fisher: And snapshotting, and like snapshotting is crazy voodoo
that sometimes I don't even understand really how it's happening.
Especially when the apps are aware of this, of the snapshot and they
actually write to disk before snapshotting like that stuff gets super nerdy.
We could talk about that all day, but I do want to get to demos cause you sure you
prepared demos and we all love a good demo here on YouTube where we could watch.
Let me know when you got your screen share and I will, yeah,
Nolan Brubaker: I've got it shared.
We'll switch over to Firefox.
So you don't see yourself.
So I am in my browser.
So I've just got a very simple WordPress and my SQL application running.
So I just want to show that this is up and running.
I've got some data in it.
Hopefully this is big enough for people to read.
Got a post, sorry, bumped in my microphone there.
I can go into the post.
I've got a comment.
And just show that like I'm not restoring any other weird data, five comment.
I'm not doing any, I'm going to take an actual backup.
So got some small data and then I'm going to jump over here to my terminal.
So I just want to show I've got,
Got Velero running in.
My name is bass,
and right now I just got one replica.
And because I'm not great at talking and typing, I'm just going to run this
demo script, which is gonna run commands and proceed when I hit buttons.
I've got this WordPress namespace, it's just got one pod each for WordPress and my SQL and
they've got server services to expose them just one to go to the outside world for WordPress.
And I've also got PVCs.
So we've got one for my SQL.
We've got one for WordPress.
That's just for static assets, pictures, things like that.
And just Pru just approve.
There are indeed persistent volumes.
This is the, my SQL claim that matches to this one to
Bret Fisher: that.
Nolan Brubaker: Now I'm going to show, I have, this is the latest version of Velero.
Got one, three, two on the server and one, three, two running on my laptop.
Bret Fisher: So before, so sorry, let me back up for a second.
So before this, you deployed the custom resource
definitions, and then you deploy a controller, right?
So there's a controller running in your cluster and this
works on any standard Kubernetes conformance cluster, right?
Nolan Brubaker: Yup.
So OpenShift TKG rancher.
Any Kubernetes conformant cluster.
And and Something, I should also mention Velero will also work on managed Kubernetes clusters.
So it backs up things through the Kubernetes API.
It doesn't grab at the D directly.
So if you're on GKE or EKS or anything like that, you don't get access to SED directly.
So that's why we work through the API server.
So it'll work even on, excuse me, managed.
Yeah, that there's some there's some issues with that because things
could be changing in the API server while the backups running.
And we've talked in the data protection working group
about maybe introducing some sort of freeze API.
But that's probably down the road and requires upstream,
Kubernetes changes, but yeah, for now that's how I've Lira works.
And yeah, so I'm going to.
Do this Velero command varies, like what we talked about.
I'm naming it, WP dash demo, including this namespace,
WordPress, and I'm just going to wait for it to complete.
Intro Music: All right.
Nolan Brubaker: And the way that works is we just fire off a custom resource to the Kubernetes
API server and let our controller slash operator run against it and clear the screen.
So it's not all at the bottom.
And we're going to do a describe against it to see what all is there.
So I'm gonna scroll up here.
So the name is what we named it.
Velero puts all its backspaces or I'm sorry, backups in the Velero namespace.
It does not store it in the same namespace as the application, because
what if you accidentally delete that namespace and you want to get it back?
It's, we've had requests to change it and we're open to that.
We just need to figure out that problem.
Like if you accidentally delete that namespace, you need a way to get it back.
So we're not completely married to that design.
We just need to figure out a solution to that problem.
Bret Fisher: Parallel namespace.
Nolan Brubaker: Right.
We duplicate it or something like that, but for now It either goes into the default.
Namespace is Velero.
You can deploy it to whatever namespace you want.
We label everything with the knee, or we label every backup
with the storage location, which is the object storage.
Uh, there's a representation of the object storage bucket called the store backup storage location.
And this is just so we can easily fetch them from the API server.
In case it went to the default, there were no annotations on the backup and it was completed.
We've got some information on what stuff was included or excluded.
So the namespace was WordPress and we didn't, we did not exclude anything.
This is useful if you want to back up the whole cluster, but say exclude
cube system, because usually there's a lot of stuff that's managed by the
cluster that you might want to exclude because it's specific to the cluster.
Or that running cluster, I should say C here we didn't apply a label
selector to the backup, so we didn't grab stuff based on a label.
Again, we stored it in the default location.
We automatically snapshotted any persistent volumes
The time to live or duration of a backup is a month or a 720 hours.
There were no hooks defined on the backup itself.
You can define hooks on the backup or on, you can actually define a hook on a
pod or a deployment, or, you know, any, anywhere you can put a pod, a template.
So you can find a hook on your application to say, dump my database,
whether that'd be Mongo or my SQL or whatever your application.
Uses the backup format.
This is the format that we store in object storage.
We actually have some changes coming to this in 1.4 that are backwards compatible.
So that it's going to be, I can talk about this more after the demo, if you'd
like but we're going to take some longer term visions for the backup format.
We've got some information on starting completed and then just so
you don't have to download the whole backup to see what's in it.
We've got it.
Bret Fisher: Cool.
That resource list is the money that's where you're confirming all the objects that you expected.
Nolan Brubaker: yup.
And then finally, this is what we snapshotted.
So we've got, these are what we're created.
And this, these ideas will vary based on your provider.
Different providers use different IDs and I'm currently working on updating
the values that will go here for the CSI plug-ins that we've written.
So I've got a PR in progress to add this information for CSI snapshots.
So, now we're going to simulate a disaster or an accident,
and we're just going to nuke the WordPress namespace.
Or, or before I do that, do you have any other questions?
Bret Fisher: No, I don't think so.
Yeah, we, so we're going to nuke it and then we're going to restore it.
Is that what you're doing?
Nolan Brubaker: Yep.
I'm going to new kit and I'll also run to the website and show that no, really it's gone.
So should be busted.
Should start getting four oh four soon.
Intro Music: It's
Nolan Brubaker: deleted
So just to show, I'm not pulling anything site's really
gone and I'm refreshing not getting anything back.
And to confirm even further, nothing found there's nothing on the Kubernetes.
Bret Fisher: All right.
It just leaves nothing.
Nolan Brubaker: Nope, no persistent volumes on the
cluster anymore, either does only stuff in this cluster.
It was the WordPress stuff.
Con it's WordPress.
Contour was my ingress controller which I think you talked to Steve loca and Velero.
And it probably should have cleared everything here, but
we're going to create a restore named WP dash restore.
And we're going to use from backup.
We're going to use the WP dash demo as the source for this restore.
So again, we're what we're doing here is submitting a
request, a custom resource to the community's API server.
The community's API server we'll get it.
And the Velero restore controller or operator will grab that, start the restore, and we're gonna
see that it will create all the namespaces and everything, and then things will start to run.
So take a look at the restore details we come up here.
So of course, very similarly, the restorer store name and namespace, same as the backup.
We've got the same name and the same namespace, no labels or annotations.
And the namespaces here are a little different than the backup.
We included all the namespaces that were in the backup because
we don't know ahead of time what namespaces are in there.
So, we didn't include, we didn't do dash dash include namespaces although we could on a restore.
So you could selectively grab an individual namespace
out of a restore or out of a backup, excuse me.
So you could say back up my whole cluster.
And then only grab one namespace out of it.
So say I accidentally deleted this one.
Namespace, you could go grab one namespace out of it, if you want it
to, or you could do a label selector out of the names out of the backup.
And similarly we re included all resources except for nodes, because it doesn't make a
whole lot of sense to recreate nodes because that's all managed by the cluster itself.
It doesn't make a whole lot of sense to create nodes without
actual hardware or virtual hardware standing behind it.
Events are all short-lived, so we don't recreate those.
They don't make a whole lot of sense to restore.
And also we don't restore our own backups and restores because we found restoring these actually.
Causes some recursion.
So if we restore backups we kick off new backups.
If we, if we restore restores, we kick off new restores.
So flair doesn't do that.
And then, then rustic repositories are for, if we're doing file level
backups and Velero manages those outside of this backup restore cycle.
So we don't restore those.
We can also, if you'd like do namespace remapping on restore, so you can
say, take my WordPress namespace and rename it to something FUBAR, you know?
Bret Fisher: why would
Nolan Brubaker: I do that?
Yeah, you can do that.
If you wanted to clone a namespace, say you wanted to take a production
namespace, or you're playing around, and you wanted to have a
pristine copy of this namespace, but you wanted to change something.
And you weren't sure if it was going to work, you could
copy that namespace with PVS and everything and mess around.
So we've had users request this and it's, I don't, I honestly don't have statistics on
how many people use this, but it's been a feature that people will have definitely used.
Yeah, it's, it's something you can do in and take your PD data along with you.
If you do a namespace remapping, you can copy the PVS.
Bret Fisher: Yeah.
I mean, that's something that is a.
I think, especially in the storage room, that's actually pretty normal.
Like you can, like I mentioned earlier, NetApp storage, like NetApp has an
ability for you to take a snapshot and then essentially put it somewhere
else so that someone can work on a read only copy or something like that.
So, yeah, totally.
That totally makes sense to me, especially if you think about the work that
it would be involved with something like a dev ops team in order to spin up a
new namespace, that's identical to the current one so that someone can see it.
That's a lot of copying yam and stuff versus just
saying, Hey, let's just restore to a new namespace.
Nolan Brubaker: Yup.
And there's similar work upstream.
There's I'm not sure where the cap is, but there's like a PV clone functionality that's coming.
That's not a full namespace copy, but there's similar ideas in other realms.
Well, Not making any commitments on changing, but we're definitely going to
look at maybe using that functionality further on down the road versus our own.
But yeah it's something that maybe, or maybe in a CIC D
set up, you might want to use this to validate something.
And then you can also use a flag too.
Maybe you don't want to automatically restore PVS.
You could set that to false and not restore PVS.
You just want the the Kubernetes metadata for some reason.
So after we've looked at that, take a look at the persistent volumes.
So they're all back.
You can look at the namespace and all of that stuff should be back.
Got our service back, got our pods back.
We've got a replica set or deployments.
I go back to my website, got my post.
It's got both comments, including the one I made and prove I can, it's still working.
So everything is still running.
And, uh, that was the end of the demo.
So yeah, the CRDs let's limit that.
Just to show the Velero CRDs.
This is what we've got in there that was installed.
And there's a Velero install command.
We've got two ways to install it with this Velero
install command, which is shipped with the Velero client.
So it's all built in or with a Velero helm chart.
Bret Fisher: Okay.
Is there, sorry, go ahead.
Nolan Brubaker: Was going to say either way we support both ways.
And the Velero install command we're currently revisiting it.
It was meant to be like a quick start kind of tool.
And then we, it's kind of grown into a little bit of a beast to
show you what I mean, if I do dash dash help, it's a very long list.
Of options, right?
So we're looking at ways to fix this and make it much more friendly.
So the helm charts one way of doing it, and we're also looking at replacing Velero install
command with Velero config command, cause Velero install is like a one and done thing.
Bret Fisher: Right now infrastructure as code friendly.
Nolan Brubaker: Yeah, you can do it has a dash dash dry runs.
You can dump out Yammel and get your get ops kind of thing from there.
But yeah, it's not,
it's not great.
So we're looking at revisiting that and making it more, more useful for the infrastructure
as code and get ops and maybe doing layering with I'm already blanking on a customized yep.
Doing customized and helm.
But those right now, our options are Velero install and I'm a blinking helm.
Uh, those are the two ways that we support for installing it at the moment.
And of course doing the dash that dry run to dump out the ammo that it produces.
Bret Fisher: Can you do that same dry run on a backup job so you can produce
Yammel that you could apply rather than doing the Villa backup command, you know?
Nolan Brubaker: Yeah.
So, if I say include namespaces word press, and I do a dash dash dryer on
Bret Fisher: yep.
Nolan Brubaker: What did I get wrong?
Oh, actually I think it's actually just dash emo, but
Bret Fisher: I know the flag include namespaces yeah, it
Nolan Brubaker: was backup crane.
There you go.
And yeah, so it's just Dasha.
Yeah the install command is kind of janky.
I say that as the person who wrote it yeah, you can do the DASHO Yammel there's also the
Velero schedule command which my scared and that takes all the same commit, same arguments.
As a backup, including a Cron job specification.
So if I say namespaces WordPress and I'm going to forget where the Cron jobs space is,
Bret Fisher: and then you have to schedule in there.
Nolan Brubaker: couldn't remember what the name was.
And now let's do eight.
There we go.
So a schedule would look like this.
So it's a very it's got a backup template right there.
And then when you do restore, you can do a dash dash from schedule.
And if you do dash dash from schedule on a restore, it will grab
the latest backup created from that schedule and restore from that.
Bret Fisher: Okay.
And this schedule is yeah.
It's one of your API resources.
Nolan Brubaker: Yep.
It's one of the, one of the custom resource types.
Bret Fisher: Yeah.
So it acts like a Cron job, but it's not actually using
the jobs or any of the current job, the default resources.
It's a custom resource.
Nolan Brubaker: It's a custom resource.
We have our own custom operator for it.
We've looked into making making it use the Cron job,
but we haven't prioritized that in our backlog yet.
Bret Fisher: Well, it's probably nice to keep regular
application stuff separate from your actual backup stuff.
So, you know, it has the same features.
I don't see like why that's a, at least for me I would, I'd be fine with it being
its own, but I, we had a question actually from the chat is is there some form of
scheduled job that can be deployed to a Kate's cluster that was simulate the backup
and restore periodically to ensure that the backup and restore process is monitored?
Nolan Brubaker: so we include Prometheans metrics, but in
terms of restores, we don't have a schedule for restores.
So right now that needs to be done separately.
So you'd need to include your own.
I can stop sharing at this point if you'd like.
I don't really have any, anything more unless we need to get something specific.
Bret Fisher: Yeah.
But I mean, back that's a good question around you know, backup
monitoring, and then which are almost really two separate things.
And then Dr.
And just recovery validation is always a challenge for every team.
So let's talk about monitoring for a minute.
So what goes up, what's going on there?
Nolan Brubaker: there's two, two things we advise we've got Prometheus's metrics exposed.
We don't ship a whole lot of stuff to set up from atheists mostly because that's.
That's kind of outside the scope.
But we do ship out Promethease metrics.
Like last time, the battery backup job failed last time a backup job finished.
You can also query for the latest backup job, like watch the end
points and grab that information in terms of validating, restores.
That is something we don't provide.
Like I said, we don't provide a schedule.
That's equivalent to a backup.
And we also don't want to just like overwrite.
We don't want to just apply to your clusters.
So you're probably gonna want to come up with some strategy that applies it to some test cluster.
Bret Fisher: I was sitting it's again.
I did, as you were talking, I was thinking that I was like, we, it almost needs
to be like a custom add in that basically pushes your backups to a separate place.
That's has no way to affect production and then does a real one and then validates, and then you
have to have application validation to make sure that the app is actually restored correctly.
Nolan Brubaker: that's a separate product, right?
The closest I can think of to get in a generic way is to do comparisons on drew Yammel.
Bret Fisher: Yeah.
Nolan Brubaker: And even then there's, there's, Kate's metadata,
like UIDs and creation, timestamps and things like that.
That's just not valid in the comparison.
And so you have to re you have to rip out, like we
rip that out on home restore because it's not useful.
So that, that is tough.
There's some internal projects we have that are not ready to ship
out that have tried that, but they're kind of half baked and PLCs.
And we're also talking about for our 1.5 release timeframe to get some
more end to end testing on our public CEI to start playing with this.
And maybe that gets elevated into part of the project, but mostly
it will be to validate some like major bug regression testing.
Bret Fisher: Yeah.
The last thing I can think of is it sounds like it's in scope.
That if I lose my entire cluster, that this can provide
me a mechanism to bring the whole thing back to life.
Nolan Brubaker: Yep.
So the kind of the big brain vision would be to use
this and cluster API to treat your clusters as cattle.
So if you think about this as even if let's say you
don't want to upgrade your clusters in place, right.
Just spin up a new version of Kubernetes with cluster
API, use this to move it, kill the golden clusters.
And maybe you use some higher level load balancer, contour, gimbal or some other thing
to move your cluster, move your traffic over once all there, then you killed the old one.
So yeah, absolutely.
Bret Fisher: So this is providing the.
After a default install and assuming that you've installed Velero installed,
meaning, at, you added it to the cluster, whatever then ideally it's providing
from one cluster a command, or I guess, cause you're doing it cross cluster.
So how does the new cluster even know about the old clusters backups?
Is that something that's built in?
So what do you do cross cluster restores?
Is that a feature?
I guess I'm asking the same question.
Nolan Brubaker: Yeah, yeah, absolutely.
So the way that works that's actually in our documentation and that's a big, that's a big
use case is essentially the way you get that cross cluster restore is you install the hero.
So you have to spin up the cluster, get the Velero deployment running, and
you have to make sure that Velero deployment points to the same bucket,
that same object store bucket that the previous cluster are backed up to.
So once they're pointing to, once they're both pointing to the
same bucket then you can restore from that previous cluster.
So you have cluster a right to your bucket and then you have cluster B
read from that bucket and then you can get that cross-cluster migration.
Bret Fisher: Yeah.
Let's do it.
Well, you know, it's funny this is all like I'm realizing we basically
just created a tutorial and an intro to like backups on Kubernetes.
Cause I think it's a big question.
And as I kind of admitted in some of my social posts that I,
I'm part of the problem because I do training all the time.
I'm teaching people on the internet and I don't talk about backups a lot.
Like I just.
It's one of those topics that people don't buy courses on people don't, they don't they
want it the backup part is something they do after they've already learned everything else.
And so in fact, I've now thinking about it that like the number of questions in my courses
out of 170,000 people that have asked about backup questions is really, really small.
Like in the, like probably just a couple of handfuls of people.
And I don't think it's because we were, we all just don't care.
I think there's just a multiple reasons.
Like you said, some people, like, I, my advice to most people
usually is avoid persistent data in your cluster, if at all possible.
If you could just use the clouds.
Data provisioning services for RDS or whatever do that avoid, the easiest Kubernetes cluster
is the one that can go away and then you can rebuild it from manifests and it's fine.
And you don't have to restore data.
Like the, it would be nice if we all just never had to worry about having Velero.
And we could just have a deployment of yeah.
Of infrastructure as code and the cluster comes back up and connections
start happening, and then we let the cloud worry about the persistent data.
But the reality is that everything's complex.
We're complex in that we all have legacy apps and yeah.
So, all right.
Nolan Brubaker: Yeah.
And there's a, I forgot, I think it was, I think it was Twitter.
Somebody did a cube con I believe it was coupon Seattle.
They talked about using Velero to do backups, and then they were playing around
Kubernetes and did an accidental command that just deleted all their clusters.
And they were like, we didn't think we needed this.
And then we deleted everything.
So it's a it's an afterthought.
And it's one that we've heard from customers is oh yeah, we, it was always a
later, later, and then they deleted stuff and now it's a, oh, we need this.
Bret Fisher: right.
We thought we did.
We thought we didn't have persistent data, but it turns out that we
actually changed things in Kubernetes and we needed that persistent.
It's a valid reason for moving everything to as much infrastructure as code and
get ops and remove the command line from anyone's remove the API connections
from anyone's local machine and only allow the automation bots to do that.
And that, it's a hard thing.
And most of the teams I work with don't ever get that to that level.
Just because there's a lot of things that have to go into that, that sort of bites you in the rear.
If you don't really have a strong, automated pipeline, but
that's for another podcast, we have a couple more questions.
Do you recommend just doing manual restore testing regularly then?
I guess since there is no automated?
Nolan Brubaker: Yeah.
At this point I would recommend that I would strongly recommend working
to automate that and that's something we're working on internally.
It's something we want to get to.
We also have community meetings on Tuesdays at noon Eastern.
If folks want to discuss like approaches to that and that's absolutely a valid topic.
If folks want to discuss how they might approach that and
share that information, absolutely valid a valid thing.
If users want to share what they've tried it's, that would be great to hear One, one
thing I will say as a developer of Lero, a lot of my clusters don't live very long.
So that, that would be great to get some insight from folks who have
clusters that live a lot longer than me or a lot longer than mine.
So I would definitely welcome feedback and user experience there.
And so, yeah, we've got discussions on slack channel, so I would recommend testing your,
your restores, whether they're manual or automated as much as you can at this point.
Just because the backups are good, but if you try to restore and they don't work.
Then it's just as bad as not having backups.
Bret Fisher: There's stuff.
To me, most of the headaches have nothing to do with the backup tool.
Like the re the restore of a backup may be fantastic,
but the application didn't write to disc properly.
So I don't actually have valid data backups or my connections come in to different end points.
And those end points were lost and restored, and I didn't update those things.
So there's so many things there that nothing.
I used to have an old boss that would basically
walking on a Friday where there wasn't a lot going on.
He would say, okay, everybody in the conference room.
And we all kind of knew what that meant, because he was going to say, today is Dr.
Imagine the data centers gone, how do we start restoring the data center?
And it would make the, it would take, we would have the DBA team that we'd have everybody
essentially all hands on deck saying, okay, let's let's go through this exercise.
And it always.
It was basically a shit show.
It was always you know, scrambling to try to figure out all the things and all the teams.
And we realized no matter how much documentation we had, there was always a gap.
There was always some exception because since the last time we did it.
Nolan Brubaker: Yeah.
It's like planned chaos monkey.
Bret Fisher: Yup.
With a bunch of humans, a manual, a chaos monkey.
Of manual activities.
I CA I don't think there's such a thing as too much backup testing.
Especially if you're someone who's responsible for backups.
Cause I think a lot of organizations just, uh, assume that the backup
person or the team responsible for that might be the same as the storage
team that they're somehow like magically able to test all these apps.
And they're usually not like they're not the developers, they're not the operators.
So they don't necessarily have the capability to even
know if the apps were going to work if they restore it.
Yeah, that's a hard thing.
And I sympathize for those people because they're usually the most relied
on in the situation, but they're usually the ones that have the least.
Amount information about how the apps are supposed to work and all the other
things outside of that, like networking that usually needs to be involved as well.
Or even your cloud
Nolan Brubaker: vendor.
Like if you think about it if Amazon or Google cloud or Azure go down, like
their concern isn't necessarily your app it's getting their infrastructure back.
They have a huge incentive to get back online.
That's for sure.
But like their incentive is not your application.
So a lot of that does fall on your organization's shoulders.
So it owning your uptime.
I think a lot of people say that, but it's hard to do.
It's expensive and it's hard.
Bret Fisher: Yeah.
And it's not immediately effective.
It's if the deal, if the failure never happens, then no one ever got to see all your work.
Nolan Brubaker: Yeah, that's a lot of, that's a lot of.
Time and effort spent for something that hopefully never happens.
Bret Fisher: Right.
Which is why it gets pushed to the back of the project.
Because all the project delays are happening and everyone's
like, well, we'll just do the DRA testing later then.
Well, this has been a great discussion.
Thank you all for the questions.
Hopefully we'll have more to talk about in the future about backups and Dr.
And when you guys get some major new features, it'd be great
to have you back on the show to, to get a catch up to this.
But for those of you out there, the message is try this stuff, do the right thing.
Like don't let it be that one day that you suddenly, cause
some, sometimes jobs, jobs depend on this and it's important.
So we'll do better on our end talking about it.
You do your bed better on your end of actually using it.
Thanks a lot Nolan for being on the show.
And people can find you at the little Twitter handles on our little page
there , you can get Velero, velero.io, and there's also a Twitter handle.
I think it's Velero project
projectvelero or projectvelero.
So you can
Bret Fisher: follow them on Twitter and see their releases.
Nolan Brubaker: Awesome.
So on the Kubernetes slack
in the Velero channel.
Nolan Brubaker: Oh,
Bret Fisher: nice.
Bret: All right.
I hope you enjoyed that.
Demo and conversation with Nolan from VMware.
And of course you can get all the stuff in the show notes, all the links and info.
And i will see you in the next episode