1
00:00:04,673 --> 00:00:07,823
Kubernetes Pod Live Migration.

2
00:00:08,063 --> 00:00:12,353
That's what Cast AI calls it when they
dynamically move your pod to a different

3
00:00:12,353 --> 00:00:14,873
node without downtime or data loss.

4
00:00:15,293 --> 00:00:19,193
They've built a Kubernetes controller
that works with CSI storage and CNI

5
00:00:19,193 --> 00:00:24,173
network plugins to copy your running
pod data, the memory, the IP address,

6
00:00:24,173 --> 00:00:28,553
and the TCP connections from one
node to another in real time Welcome

7
00:00:28,553 --> 00:00:32,093
to DevOps and Docker talk, and I'm
your solo host today, Bret Fisher.

8
00:00:32,453 --> 00:00:36,143
This YouTube live stream was a fun one
because I got to nerd out with engineers

9
00:00:36,413 --> 00:00:42,143
Philip Andrews and Dan Muret from Cast
AI on Kubernetes pod live migrations

10
00:00:42,203 --> 00:00:43,913
and how it works under the hood.

11
00:00:44,363 --> 00:00:47,573
We talk about use cases for
this feature, including hardware

12
00:00:47,573 --> 00:00:49,403
or OS maintenance on a node.

13
00:00:49,823 --> 00:00:55,403
Maybe for right sizing or bin packing your
pods for cost savings or moving off of a

14
00:00:55,403 --> 00:00:57,068
spot instance that's about to shut down.

15
00:00:57,848 --> 00:01:01,808
Or really anytime you need to move a
daemon set that would cause an outage if

16
00:01:01,808 --> 00:01:04,568
the pod had to restart or be redeployed.

17
00:01:05,068 --> 00:01:10,558
I don't know of any other turnkey way
to do this on Kubernetes today, but

18
00:01:10,558 --> 00:01:13,588
I've got a feeling that Cast AI has got
a winning feature on their hands, and

19
00:01:13,588 --> 00:01:15,658
I'm glad that we got to dig into it.

20
00:01:16,108 --> 00:01:20,458
Over 20 years ago, the virtual machine
vendors built this live migration feature

21
00:01:20,458 --> 00:01:27,088
into their products, and finally, in 2025,
we're now able to do that in Kubernetes.

22
00:01:27,588 --> 00:01:28,998
Let's get into the episode.

23
00:01:31,023 --> 00:01:31,743
Welcome to the show.

24
00:01:32,024 --> 00:01:35,134
All right, Both of these
gentlemen are from Cast AI.

25
00:01:35,344 --> 00:01:38,464
Philip is the global
field, CTO at Cast AI.

26
00:01:38,514 --> 00:01:40,524
What exactly does Global CTO do?

27
00:01:40,962 --> 00:01:42,762
handling, a lot of our large customers.

28
00:01:42,762 --> 00:01:46,052
a lot of the strategic partnerships,
usually new technologies, so working

29
00:01:46,052 --> 00:01:49,822
with a lot of our, you know, customers on
showing them new technologies, helping in,

30
00:01:49,912 --> 00:01:51,472
proof of concepts with new technologies.

31
00:01:51,932 --> 00:01:53,192
it's been kind of a cool role.

32
00:01:53,472 --> 00:01:56,262
basically on the technical
customer facing side.

33
00:01:56,572 --> 00:01:58,762
I get to do a lot with our
largest customers in solving

34
00:01:58,762 --> 00:02:00,172
some of the the hardest problems.

35
00:02:00,672 --> 00:02:01,212
Nice.

36
00:02:01,722 --> 00:02:03,852
when you don't know what the
title is, it makes it sound like

37
00:02:03,852 --> 00:02:04,992
you just live in an airplane.

38
00:02:05,182 --> 00:02:06,022
it sounds impressive.

39
00:02:06,122 --> 00:02:08,692
and we've got Dan Muret or Muret.

40
00:02:09,022 --> 00:02:12,372
I really don't know my French, so that's
probably, a horrible pronunciation.

41
00:02:12,872 --> 00:02:15,962
Dan is here, he's the senior sales
engineer with CAST AI or one of them.

42
00:02:15,962 --> 00:02:20,162
I'm gonna make you the sales senior sales
engineer so that you sound very elite.

43
00:02:20,222 --> 00:02:20,832
Uh, welcome Dan.

44
00:02:21,332 --> 00:02:24,962
who could tell me about what's the
elevator pitch for Cast AI because

45
00:02:24,962 --> 00:02:26,312
I've known about you all for years.

46
00:02:26,612 --> 00:02:30,482
I've visited your booth at KubeCon at
least a half a dozen times over the years.

47
00:02:30,732 --> 00:02:32,982
what's the main benefit of Cast AI?

48
00:02:33,212 --> 00:02:34,112
Sure I can grab that one.

49
00:02:34,162 --> 00:02:37,342
when Cast AI was founded, it was to
solve a problem that our founders

50
00:02:37,342 --> 00:02:41,692
had in their last startup years ago,
which was every month their AWS bill

51
00:02:41,692 --> 00:02:45,662
was going up 10% regardless of what
they did the month before to try to.

52
00:02:45,922 --> 00:02:48,682
Mitigate and manage costs
across the in infrastructure.

53
00:02:49,232 --> 00:02:52,452
they sold that startup to Oracle, when
they finished that Oracle time, they went

54
00:02:52,452 --> 00:02:54,252
and figured out how to solve this problem.

55
00:02:54,252 --> 00:02:57,272
They said, well, Kubernetes is
gonna be the future platform.

56
00:02:57,332 --> 00:02:58,892
We're gonna make our bet on Kubernetes.

57
00:02:59,192 --> 00:03:01,862
And the only way to solve this
problem is through automation.

58
00:03:01,922 --> 00:03:04,712
Because doing things manually
every month, one, it's tedious

59
00:03:04,712 --> 00:03:05,882
and takes up a lot of time.

60
00:03:06,002 --> 00:03:08,242
And two, it's just, not helpful, right?

61
00:03:08,242 --> 00:03:10,762
You save a little bit every,
you know, it's two steps back

62
00:03:10,762 --> 00:03:11,962
for every step forward you make.

63
00:03:12,462 --> 00:03:15,852
With Cast AI, it's fully
automation first, right?

64
00:03:15,852 --> 00:03:18,942
We made the effort that everything was
going to be automated from the start.

65
00:03:19,282 --> 00:03:22,492
when it comes to node auto scaling,
node selection, node, right sizing,

66
00:03:22,492 --> 00:03:25,552
workload, right sizing, everything
to do with Kubernetes, everything

67
00:03:25,552 --> 00:03:27,472
we implement is automated.

68
00:03:27,922 --> 00:03:31,312
And that's where the live migration
piece came in being able to automatically

69
00:03:31,312 --> 00:03:35,392
move applications around within the
cluster without having downtime.

70
00:03:35,442 --> 00:03:37,782
and that's where application
performance automation comes

71
00:03:37,782 --> 00:03:41,572
in, Moving from this application
performance monitoring mindset.

72
00:03:42,361 --> 00:03:43,831
Datadog has made a lot of money on that.

73
00:03:44,331 --> 00:03:45,291
I dunno if I'm allowed to say that.

74
00:03:45,601 --> 00:03:48,191
but they've done very well
and have a fantastic platform.

75
00:03:48,221 --> 00:03:51,891
we love Datadog, but
you get data overload.

76
00:03:52,011 --> 00:03:55,701
You end up with metric overload and
actioning on those is very hard.

77
00:03:56,201 --> 00:03:59,321
Where we need to go from here,
especially with the AI mindset that

78
00:03:59,321 --> 00:04:02,801
we're moving into, is automation
of that application performance.

79
00:04:02,801 --> 00:04:04,361
And that's what Cast AI
is leading the way in.

80
00:04:04,861 --> 00:04:08,931
Nice, when you all reached out and I
learned about the fact that you now have

81
00:04:08,991 --> 00:04:13,701
live migration, it took me back to almost
25 years ago when that was first invented

82
00:04:13,701 --> 00:04:17,151
for VMs at the time it felt like magic.

83
00:04:17,181 --> 00:04:18,651
It did not seem real.

84
00:04:18,981 --> 00:04:22,971
We all had to try it to believe
it because it seemed impossible.

85
00:04:23,376 --> 00:04:27,226
To, move from one host to
another, maintain the ip,

86
00:04:27,226 --> 00:04:28,816
maintain the TCP connections.

87
00:04:28,946 --> 00:04:31,706
surely I'm gonna freeze up and
it's gonna be like a frozen screen.

88
00:04:31,836 --> 00:04:32,916
we all just assumed that.

89
00:04:32,916 --> 00:04:35,796
eventually, and maybe at first it
was a little hiccupy, I think, if I

90
00:04:35,796 --> 00:04:38,046
remember correctly, like 2003, 2005.

91
00:04:38,046 --> 00:04:40,926
It was one of those where it wasn't
quite live and they didn't, it was

92
00:04:40,926 --> 00:04:42,426
like very short amounts of gaps.

93
00:04:42,426 --> 00:04:44,196
then eventually it got good
enough that it was live.

94
00:04:44,596 --> 00:04:47,276
I was running data centers, for
local governments at the time.

95
00:04:47,276 --> 00:04:48,536
So I was very interested in this.

96
00:04:48,536 --> 00:04:50,696
'cause we were running
both ESX and HyperV.

97
00:04:50,696 --> 00:04:53,776
So I was heavily invested in
that feature and functionality.

98
00:04:54,076 --> 00:04:57,366
So when I saw that you were doing
it to Kubernetes, my first thought

99
00:04:57,366 --> 00:04:59,196
was, why did this take so long?

100
00:04:59,696 --> 00:05:01,676
why don't we have this yet on everything?

101
00:05:02,066 --> 00:05:04,706
Because it's clearly possible,
it's technically possible.

102
00:05:05,066 --> 00:05:08,696
Obviously, it's not super easy and
requires a lot of low level tooling

103
00:05:08,696 --> 00:05:13,456
that has to understand networking
and memory and you know, dis disc

104
00:05:13,456 --> 00:05:15,256
rights and all that kind of stuff.

105
00:05:15,256 --> 00:05:18,946
So I'm excited for us to get
into exactly how this sort of

106
00:05:19,036 --> 00:05:20,956
operates for a Kubernetes admin.

107
00:05:20,956 --> 00:05:24,406
And I really feel like this show's gonna
be great for anyone learning Kubernetes or

108
00:05:24,436 --> 00:05:29,196
Kubernetes admins, because to talk about
the stateful set, the daemon set problem

109
00:05:29,256 --> 00:05:33,026
of, we've got stateful work, everyone's
got, I mean, everyone I know almost

110
00:05:33,026 --> 00:05:34,526
has stateful workflows in Kubernetes.

111
00:05:34,526 --> 00:05:37,256
I would say, I don't know about you
all's experience, but to me it's

112
00:05:37,256 --> 00:05:40,196
an exception when everything is
stateless in Kubernetes nowadays.

113
00:05:40,196 --> 00:05:41,526
Do you find that to be the case

114
00:05:42,066 --> 00:05:43,476
we talk to customers all the time, right.

115
00:05:43,476 --> 00:05:44,316
And yeah.

116
00:05:44,316 --> 00:05:47,436
Used to, you know, a few, just
a couple years ago, right.

117
00:05:47,436 --> 00:05:49,326
It was a lot more stateless.

118
00:05:49,326 --> 00:05:50,646
Web servers, whatever.

119
00:05:51,016 --> 00:05:54,616
now we're definitely seeing a shift
to, more stateful workloads whether

120
00:05:54,616 --> 00:05:58,326
it's, legacy applications being
forced into Kubernetes as part of a

121
00:05:58,326 --> 00:06:00,006
modernization project or whatever.

122
00:06:00,006 --> 00:06:03,576
We're seeing a lot more stateful
workloads in Kubernetes for sure.

123
00:06:03,819 --> 00:06:06,622
Particularly amongst the Fortune one
hundreds, fortune five hundreds, right?

124
00:06:06,622 --> 00:06:10,712
'cause you've got this modernization,
and I put it in quotes, where

125
00:06:10,712 --> 00:06:14,762
modernization means taking some
crusty old 15-year-old application,

126
00:06:15,182 --> 00:06:18,752
containerizing it and shoving it in
Kubernetes and calling it cloud native,

127
00:06:18,802 --> 00:06:20,632
the flawed approach of lift and shift.

128
00:06:21,112 --> 00:06:23,962
and you end up with a lot of
applications that are in Kubernetes.

129
00:06:23,962 --> 00:06:27,862
They're listed as a deployment, but
you can't restart them without your

130
00:06:27,862 --> 00:06:29,422
customer having a significant outage.

131
00:06:30,092 --> 00:06:32,342
it goes against everything
Kubernetes was built on.

132
00:06:32,342 --> 00:06:33,922
But that's the world we live in today.

133
00:06:33,922 --> 00:06:36,712
when we first launched Live migration,
and I posted about it on LinkedIn,

134
00:06:37,212 --> 00:06:39,552
some of the first questions I
got was, why is this even needed?

135
00:06:39,612 --> 00:06:43,362
If you're doing Kubernetes correctly, live
migration shouldn't even be a real thing.

136
00:06:43,412 --> 00:06:46,832
yes, but 95% of the customers I deal with.

137
00:06:47,177 --> 00:06:49,127
Don't do Kubernetes, the right way.

138
00:06:49,627 --> 00:06:53,787
Well, yeah, I honestly think we
could argue that's a great point.

139
00:06:53,787 --> 00:06:56,247
when Docker and Kubernetes were
both created, it was like stateless.

140
00:06:56,367 --> 00:06:58,157
'cause it's easy, you know,
move everything around.

141
00:06:58,157 --> 00:06:58,757
It's wonderful.

142
00:06:59,057 --> 00:07:03,427
But I think that to, I mean, this channel,
if there's anything consistent about this

143
00:07:03,427 --> 00:07:07,507
channel over the almost decade that has
existed, it's that I, it's all containers.

144
00:07:07,507 --> 00:07:10,357
Like I don't care what the tool
is, we're doing it in containers.

145
00:07:10,607 --> 00:07:14,987
the large success of containers is
because we could put everything in it.

146
00:07:14,987 --> 00:07:15,557
so many.

147
00:07:16,007 --> 00:07:19,997
Evolutions or attempted evolutions
in tech have been, well, you're

148
00:07:19,997 --> 00:07:23,357
gonna have to rewrite to, you know,
functionless, you know, or functions.

149
00:07:23,387 --> 00:07:26,057
you're serverless, you're gonna have to
write functions now or you're gonna have

150
00:07:26,057 --> 00:07:27,977
to rewrite this language or whatever.

151
00:07:28,237 --> 00:07:31,817
and that's, I think that's like the
secret sauce of containers was that we

152
00:07:31,817 --> 00:07:33,257
could literally shove everything in it.

153
00:07:33,347 --> 00:07:34,667
It's also the negative.

154
00:07:34,697 --> 00:07:37,077
And so there's, there's a thing that,
I don't know if I learned it from a

155
00:07:37,077 --> 00:07:41,097
therapist or whatever, but often our
weaknesses are just overdone strengths.

156
00:07:41,097 --> 00:07:43,947
And I feel like the strength of containers
is that you can do everything with it.

157
00:07:43,947 --> 00:07:46,797
you can put every known
app on the planet in it.

158
00:07:47,067 --> 00:07:48,987
They will eventually work
if you figure it out.

159
00:07:49,347 --> 00:07:52,647
the overdone weakness is that we're
putting everything in there which makes

160
00:07:53,007 --> 00:07:54,967
managing these infrastructures very hard.

161
00:07:54,997 --> 00:07:58,117
you have to assume everything's
fragile Until you are sure

162
00:07:58,117 --> 00:07:59,797
that it's truly stateless.

163
00:07:59,927 --> 00:08:04,247
even stateless, people say stateless
and what they really mean is it's, it

164
00:08:04,247 --> 00:08:08,587
doesn't care about disc, but it definitely
cares about connections, which is when

165
00:08:08,587 --> 00:08:11,197
we're trying to talk about stateless,
that's not technically accurate.

166
00:08:11,197 --> 00:08:13,627
Like when we say stateless,
we should probably mean also

167
00:08:13,627 --> 00:08:15,277
doesn't care about connections.

168
00:08:15,617 --> 00:08:17,177
at least once the connections are drained.

169
00:08:17,207 --> 00:08:19,937
it's an interesting dilemma we all have
in the infrastructure of we have the

170
00:08:19,937 --> 00:08:23,297
power to be able to move everything and
do everything, but also everything's

171
00:08:23,297 --> 00:08:26,507
super fragile that we're running at the
same time, so how do we even manage that?

172
00:08:26,507 --> 00:08:29,567
We've encountered a lot of teams
that, swore up and down they were

173
00:08:29,567 --> 00:08:32,257
stateless, right up until you
started bin packing their cluster.

174
00:08:32,757 --> 00:08:33,597
They said, wait, wait, wait.

175
00:08:33,647 --> 00:08:35,177
why are we having all
these restarted pods?

176
00:08:35,237 --> 00:08:37,907
We're like, because we're bin
packing and we're moving things, and

177
00:08:37,907 --> 00:08:39,287
we're getting better, optimization.

178
00:08:39,287 --> 00:08:41,147
They're like, but my container restarted.

179
00:08:41,207 --> 00:08:43,727
Well, yes, that's what
containers do in Kubernetes.

180
00:08:45,296 --> 00:08:45,626
Right.

181
00:08:45,626 --> 00:08:49,476
And for those that are maybe just getting
into Kubernetes or haven't dealt with

182
00:08:49,476 --> 00:08:53,181
large enterprise forever workloads
where they just can't be touched.

183
00:08:53,181 --> 00:08:55,491
I've had 30 years in tech
of don't touch that server.

184
00:08:55,491 --> 00:08:56,511
don't touch that workload.

185
00:08:56,511 --> 00:08:59,121
It's fragile, it's precious, but
it's also probably on the oldest

186
00:08:59,121 --> 00:09:00,501
hardware in the least maintained.

187
00:09:00,951 --> 00:09:05,451
so the idea of, one of the performance
measures that any significant size

188
00:09:05,451 --> 00:09:07,821
Kubernetes team is dealing with
is the cost of infrastructure.

189
00:09:08,091 --> 00:09:10,971
And then we keep getting told, I think
this was just in the last year at

190
00:09:10,971 --> 00:09:17,391
KubeCon, that even on top of Kubernetes,
we're still only averaging like 10% CPU

191
00:09:17,391 --> 00:09:19,401
on average utilization across nodes.

192
00:09:19,611 --> 00:09:23,661
Like we, we still are struggling with
the same infrastructure problems that

193
00:09:23,661 --> 00:09:28,161
we were dealing with the last 30 years,
even before VMs, before virtualization.

194
00:09:28,521 --> 00:09:31,521
That was the same problem we had then
because everybody would want their

195
00:09:31,521 --> 00:09:35,811
own server and they always had to plan
for the worst busiest day of the year.

196
00:09:35,811 --> 00:09:38,731
So they would buy huge servers,
put 'em in, and they'd sit idle

197
00:09:38,731 --> 00:09:41,951
almost all the time because
they barely got 5%, utilization.

198
00:09:42,251 --> 00:09:47,421
So I can see where like one of the core
premises of something like a application

199
00:09:47,421 --> 00:09:52,701
performance tool is that we're gonna
save tons of money by bin packing.

200
00:09:52,701 --> 00:09:55,481
can you explain the bin packing process?

201
00:09:55,481 --> 00:09:56,081
what does that look like?

202
00:09:56,591 --> 00:10:01,261
so one of the big things with Kubernetes
is the scheduler will typically

203
00:10:01,261 --> 00:10:03,211
round robin, assign pods to nodes.

204
00:10:03,281 --> 00:10:06,941
if you have 10 nodes in a cluster,
your pods will more or less get evenly

205
00:10:06,941 --> 00:10:08,501
distributed to those nodes in the cluster.

206
00:10:08,501 --> 00:10:11,261
You can manage that with certain
different scheduler hints and

207
00:10:11,261 --> 00:10:15,021
certain scheduler, suggestions to
steer that towards, you know, most

208
00:10:15,021 --> 00:10:16,431
utilized, least utilized, et cetera.

209
00:10:16,431 --> 00:10:18,531
But at the end of the day,
you're gonna have spread out

210
00:10:18,531 --> 00:10:19,671
workloads across your nodes.

211
00:10:20,171 --> 00:10:24,491
Bin packing is basically the
defragmentation of Kubernetes right back.

212
00:10:24,571 --> 00:10:27,041
You know, Bret, when you were first
starting out, when I was first starting

213
00:10:27,041 --> 00:10:29,771
out, and you could actually defragment
a hard drive and you get to move the

214
00:10:29,771 --> 00:10:31,421
little Tetris blocks around the screen,

215
00:10:31,621 --> 00:10:31,831
days.

216
00:10:32,177 --> 00:10:35,687
being able to do that in a Kubernetes
cluster can mean massive, massive

217
00:10:35,687 --> 00:10:39,167
savings on the actual utilization of that
cluster because now you free up a bunch

218
00:10:39,167 --> 00:10:42,707
of workloads, a bunch of nodes in the
cluster that are no longer necessary,

219
00:10:42,707 --> 00:10:45,617
you can delete those off and when you
need them, you just add them back.

220
00:10:45,617 --> 00:10:49,027
that's the joy of being in a cloud
environment, you can use the least amount

221
00:10:49,027 --> 00:10:50,287
of resources when you don't need 'em.

222
00:10:50,347 --> 00:10:53,257
So in, for instance, at, you know,
your off busy hours, your nighttime

223
00:10:53,257 --> 00:10:57,127
hours, and then when you start needing
'em again, you spin 'em up, you add

224
00:10:57,127 --> 00:11:01,507
more to it, you scale up during the
day, and being able to do that process

225
00:11:01,507 --> 00:11:05,377
over and over again every day is how
you can optimize your cloud resources.

226
00:11:05,877 --> 00:11:06,867
What we see is that.

227
00:11:07,367 --> 00:11:11,827
People have so many stateful workloads
whether it's stateful in real state or

228
00:11:11,827 --> 00:11:15,697
stateful in, this is a really poorly
architected application, or stateful in

229
00:11:15,697 --> 00:11:20,107
this application takes 15 minutes to start
up and I, it's a monolithic and I can only

230
00:11:20,107 --> 00:11:23,167
run one copy of it so I can't move it.

231
00:11:23,497 --> 00:11:27,427
All of those things causes, so you
can't bin pack a cluster, right?

232
00:11:27,427 --> 00:11:29,077
You can't move those things around.

233
00:11:29,167 --> 00:11:33,967
So what ends up happening is people just
end up with these stateful workloads

234
00:11:33,967 --> 00:11:35,767
scattered throughout all 10 nodes.

235
00:11:36,187 --> 00:11:39,877
And even if the 10 nodes are only
60% utilized, you can't get rid

236
00:11:39,877 --> 00:11:42,577
of any of them because it'll cause
some kind of service interruption.

237
00:11:43,077 --> 00:11:45,867
And that's where live migration
allows you to move those

238
00:11:45,867 --> 00:11:47,487
stateful sensitive workloads.

239
00:11:47,717 --> 00:11:52,327
So now those 10 nodes can go down
to six or seven nodes without having

240
00:11:52,327 --> 00:11:56,377
a service interruption, even if
there's less than ideal workloads

241
00:11:56,377 --> 00:11:57,547
scattered throughout the cluster.

242
00:11:58,057 --> 00:12:05,357
stateful versus stateless versus like,
where's the scenario for where we need

243
00:12:05,357 --> 00:12:11,497
a live pod migration and like to those
that are perfect in all their software

244
00:12:11,497 --> 00:12:15,007
and they run and they control all
the software that runs on Kubernetes.

245
00:12:15,187 --> 00:12:17,587
I don't know who those people are,
but let's just say they exist.

246
00:12:18,087 --> 00:12:19,347
Then this isn't needed.

247
00:12:19,527 --> 00:12:22,617
Every database has a replica
or database mirror, so you

248
00:12:22,617 --> 00:12:23,967
can always take a node down.

249
00:12:23,967 --> 00:12:29,897
Every pod properly has proper
shutdown for ensuring that connections

250
00:12:29,897 --> 00:12:31,937
are properly moved to a new pod.

251
00:12:32,057 --> 00:12:36,357
By the way, I used to do a whole
conference talk on TCP packets

252
00:12:36,357 --> 00:12:39,717
and, resetting the connection to
make sure it moves properly through

253
00:12:39,717 --> 00:12:42,547
the load balancer to the next one,
and having a long shutdown time

254
00:12:42,547 --> 00:12:43,897
so that you can drain connections.

255
00:12:43,947 --> 00:12:48,207
that world of shutting down a pod is so
more complicated than anyone gives it.

256
00:12:48,267 --> 00:12:50,937
everyone treats it like it's casual
and easy, and it's just not if

257
00:12:50,937 --> 00:12:53,957
you're dealing with, hundreds of
thousands or millions of connections.

258
00:12:54,227 --> 00:12:56,387
there is a lot of nuance
and detail to this.

259
00:12:56,387 --> 00:12:59,307
and I often end up with teams
where they implement Kubernetes.

260
00:12:59,667 --> 00:13:01,287
It's a sort of predictable pattern, right?

261
00:13:01,287 --> 00:13:03,317
They implement Kubernetes,
move workloads to it.

262
00:13:03,817 --> 00:13:07,297
They think Kubernetes gives their
workloads magic and then they just

263
00:13:07,297 --> 00:13:09,787
start trying moving things around
and they realize when their customers

264
00:13:09,787 --> 00:13:14,937
complain that, The rules of TCP IPS
load balancers and connection state,

265
00:13:14,937 --> 00:13:16,407
like all these rules still apply.

266
00:13:16,687 --> 00:13:20,347
you have to understand those lower levels
and obviously disc and writing to disc

267
00:13:20,347 --> 00:13:21,637
and logs for databases and all that stuff.

268
00:13:21,637 --> 00:13:22,807
that's still there too, I think.

269
00:13:23,077 --> 00:13:26,437
I think the networking is where
I see a lot of junior engineers

270
00:13:26,837 --> 00:13:30,697
hand waving over it because quite
honestly, the cloud has made a lot

271
00:13:30,697 --> 00:13:32,017
of the networking problems go away.

272
00:13:32,017 --> 00:13:34,447
So we don't have to have
Cisco certifications just

273
00:13:34,447 --> 00:13:36,067
to run servers anymore.

274
00:13:36,067 --> 00:13:40,387
We used to, but now we can get away
with it till a certain point, in

275
00:13:40,387 --> 00:13:42,757
the career or complexity level.

276
00:13:42,847 --> 00:13:47,897
And then suddenly you're having to really
understand the difference between TCP and

277
00:13:47,897 --> 00:13:53,387
UDP and how session state long polling or
web sockets, how all these things affect.

278
00:13:53,887 --> 00:13:57,447
Whether you're going to break
customers when you decide to restart

279
00:13:57,447 --> 00:13:59,457
that pod or redeploy a new pod.

280
00:13:59,797 --> 00:14:03,247
I love that stuff because it's super
technical and you can get really into

281
00:14:03,247 --> 00:14:06,657
the weeds of it, and it's not, I wouldn't
call a solved problem for everyone.

282
00:14:06,907 --> 00:14:09,397
my understanding of something
like a live migration is,

283
00:14:09,397 --> 00:14:11,197
takes most of those concerns.

284
00:14:11,627 --> 00:14:15,377
it doesn't make them irrelevant, but
it does deal with those concerns.

285
00:14:15,377 --> 00:14:15,857
am I right?

286
00:14:15,857 --> 00:14:18,787
in terms of networking we're talking
about live migrations, having

287
00:14:18,787 --> 00:14:22,352
to be concerned with ips and,
connection state, stuff like that.

288
00:14:22,925 --> 00:14:26,585
yeah, so with being able to do the
networking move of things, to your

289
00:14:26,585 --> 00:14:29,165
point, reestablishing those sessions.

290
00:14:29,195 --> 00:14:32,075
one of the big things we
see is long running jobs.

291
00:14:32,125 --> 00:14:35,305
if you've got a job that's running for
eight hours and it gets interrupted

292
00:14:35,305 --> 00:14:36,535
at six, you've lost that job.

293
00:14:37,035 --> 00:14:41,675
Even if you try to, move it from one to,
there's some checkpointing involved, a

294
00:14:41,675 --> 00:14:45,875
lot of times, like on a spark workload,
the driver will just kill the pod

295
00:14:45,875 --> 00:14:48,725
and restart it if it senses any kind
of interruption in the networking.

296
00:14:49,225 --> 00:14:50,935
So the networking's super important there.

297
00:14:50,935 --> 00:14:51,805
Being able to maintain that.

298
00:14:52,735 --> 00:14:54,235
Long running sessions, web sockets.

299
00:14:54,235 --> 00:14:57,055
To your point, we've actually tested
this extensively with web sockets.

300
00:14:57,325 --> 00:14:58,585
Web sockets, stay connected.

301
00:14:59,085 --> 00:15:03,125
And we're still in those
early vMotion days.

302
00:15:03,175 --> 00:15:05,125
there is a slight pause
when we move things.

303
00:15:05,405 --> 00:15:09,065
it took vMotion multiple years before
they got it kind of really ironed out.

304
00:15:09,565 --> 00:15:11,395
We're moving probably faster
to them 'cause we have a lot

305
00:15:11,395 --> 00:15:14,305
of experience, you know, of the
experiences they went through.

306
00:15:14,585 --> 00:15:16,385
and the research that's
happened since then.

307
00:15:16,385 --> 00:15:18,875
So I think we're moving pretty fast
on shortening that time window.

308
00:15:19,215 --> 00:15:23,865
But what we found is you queue up all
the traffic and once the pod is live on

309
00:15:23,865 --> 00:15:27,765
the new node that traffic is replayed
and all the messages come through.

310
00:15:27,765 --> 00:15:30,765
So even on something like a web socket,
you don't actually lose messages.

311
00:15:30,765 --> 00:15:32,805
They're just held up for a few seconds.

312
00:15:33,305 --> 00:15:36,035
And that's extremely important
from maintaining that connection

313
00:15:36,035 --> 00:15:37,145
state, like you were mentioning.

314
00:15:37,555 --> 00:15:39,835
one of our customers that we're
working with this heavily on,

315
00:15:39,865 --> 00:15:41,485
they run spark streaming jobs.

316
00:15:41,985 --> 00:15:46,455
So they're 24 7, 365, pulling off a
queue, running data transformations

317
00:15:46,455 --> 00:15:49,695
and detections, and then pushing
somewhere else for alerting mechanisms.

318
00:15:50,195 --> 00:15:54,515
If they have a pod go down, it takes about
two minutes to get that process restarted,

319
00:15:54,515 --> 00:15:55,865
pull in all the data that they need.

320
00:15:55,865 --> 00:15:58,585
Again, that's two minutes of backlog.

321
00:15:58,675 --> 00:16:00,175
They have super tight SLAs.

322
00:16:00,175 --> 00:16:05,895
They have a five minute SLA from
message creation to end run through

323
00:16:05,895 --> 00:16:07,185
the entire detection pipeline.

324
00:16:07,185 --> 00:16:10,365
So if you've got a two minute
delay on that shard in your Kafka

325
00:16:10,365 --> 00:16:14,555
topic, that's a huge chunk of that
five minutes that you just ate up.

326
00:16:15,055 --> 00:16:16,945
You're talking all the other pipeline.

327
00:16:16,945 --> 00:16:19,075
It's very easy to start
missing SLAs there.

328
00:16:19,575 --> 00:16:23,475
It's, you can't take maintenance
windows if you're 24 7, 365 and

329
00:16:23,475 --> 00:16:24,765
you're doing security processing.

330
00:16:25,195 --> 00:16:25,645
you don't.

331
00:16:25,705 --> 00:16:27,865
You can't be like, well, security's
gonna be offline for 10 minutes

332
00:16:27,865 --> 00:16:28,945
while we move our pods around.

333
00:16:29,526 --> 00:16:32,316
Like, that's just, that's not
acceptable in that world mindset.

334
00:16:32,716 --> 00:16:35,956
so keeping that connectivity, keeping
the connection state, being able to

335
00:16:35,956 --> 00:16:39,386
keep everything intact, keeping the
Kafka connection, keeping the, spark

336
00:16:39,536 --> 00:16:43,406
driver connection is all super important
with being able to move that entire

337
00:16:43,406 --> 00:16:47,646
TCP/IP stack over from one node to
another, during that migration process.

338
00:16:48,146 --> 00:16:51,236
Yeah, and I mean, we're really talking
about a lot of the different kinds of

339
00:16:51,236 --> 00:16:56,886
problems that come with shifting workloads
like being able to say, you know, walking

340
00:16:56,886 --> 00:17:01,876
into an environment and sort of being
your own wrecking ball and, your own chaos

341
00:17:01,876 --> 00:17:04,996
monkey and saying, I'm gonna go over here
and push the power button on this node,

342
00:17:05,066 --> 00:17:09,966
or I'm gonna properly shut down this node,
do you have everything set up correctly so

343
00:17:09,966 --> 00:17:15,256
that connections are properly drained that
is a, such a. I would say a moving target,

344
00:17:15,256 --> 00:17:18,836
especially because every time we've had
these processes where I've had clients

345
00:17:18,836 --> 00:17:22,706
where we go through this like exercise of
we're going to do maintenance on a node

346
00:17:22,766 --> 00:17:27,116
and we're even gonna plan for it, and
then we do it, and then we fix all the

347
00:17:27,116 --> 00:17:31,896
issues of the pods and the shutdown timing
and, the Argo CD deployment settings

348
00:17:31,896 --> 00:17:33,756
that we need to massage and perfect.

349
00:17:33,876 --> 00:17:38,016
And then, you know, six months later, if
we do it again, the same thing happens

350
00:17:38,016 --> 00:17:41,256
because now there's new workloads that
weren't perfected and weren't well tested.

351
00:17:41,536 --> 00:17:44,056
if I can make a career out
of actually being like.

352
00:17:44,556 --> 00:17:49,576
a pod migration guru, like that sounds
like my kind of dream job where we crash

353
00:17:49,576 --> 00:17:53,666
and break everything and then we track
all of the potential issues of that and

354
00:17:53,666 --> 00:17:57,296
we are like a tiger team that goes pod
by pod and certifies this is like, yep,

355
00:17:57,296 --> 00:18:00,116
this pod can now move safely without risk.

356
00:18:00,336 --> 00:18:01,746
because we've got everything dialed in.

357
00:18:01,746 --> 00:18:03,036
We've got all the right settings.

358
00:18:03,396 --> 00:18:05,016
I feel like that's a workshop opportunity.

359
00:18:05,016 --> 00:18:07,876
maybe sell something on that because
there are so many levels of complexity

360
00:18:07,876 --> 00:18:13,336
we haven't even talked about, like
database logs and database mirroring

361
00:18:13,386 --> 00:18:16,266
you can't really spin up a new node
of a database and let it sit there.

362
00:18:16,266 --> 00:18:19,176
Idle is a pod while you're waiting for
the old one to shut down, they can't

363
00:18:19,176 --> 00:18:20,736
access the same files, blah, blah, blah.

364
00:18:20,786 --> 00:18:23,686
it just depends on the workload,
on how complex this all gets.

365
00:18:23,686 --> 00:18:28,266
But I'm assuming also that when we talk
about something like live migrations,

366
00:18:28,266 --> 00:18:29,556
we're not just concerned with networking.

367
00:18:29,556 --> 00:18:31,626
We're also somehow shifting to storage.

368
00:18:31,981 --> 00:18:35,441
I'm guessing there's certain
limitations to that where you're not

369
00:18:35,441 --> 00:18:38,231
replicating on the backend volumes.

370
00:18:38,231 --> 00:18:41,651
You're, I guess you're just
using like ice zy reconnects or

371
00:18:41,811 --> 00:18:43,011
how's that, how does that work?

372
00:18:43,011 --> 00:18:45,501
we haven't really gotten into the
solution, but I know you're only on

373
00:18:45,501 --> 00:18:49,561
certain clouds right now, and I'm assuming
that's partly due to the technicals of

374
00:18:49,741 --> 00:18:51,421
the limitations of their infrastructure.

375
00:18:51,921 --> 00:18:52,671
Right, exactly.

376
00:18:52,681 --> 00:18:55,521
each cloud has different kind
of quirks, around how they

377
00:18:55,521 --> 00:18:58,041
function, what the different,
technologies look like around them.

378
00:18:58,321 --> 00:19:01,801
somebody had asked about being able
to move, you know, larger systems

379
00:19:01,801 --> 00:19:06,681
are the limits around it and their,
it depends on the use case, right?

380
00:19:06,681 --> 00:19:10,401
If you're talking spot instances,
being able to move from one spot

381
00:19:10,401 --> 00:19:13,221
instance to another spot instance
in a two minute interruption window

382
00:19:13,221 --> 00:19:15,551
on AWS, depends on how much data.

383
00:19:15,791 --> 00:19:19,511
If you're trying to move 120 gigs of
data physics is working against you.

384
00:19:19,541 --> 00:19:22,571
you don't have enough time in that two
minute window to get enough through

385
00:19:22,571 --> 00:19:24,901
the pipe over to the new, system.

386
00:19:25,401 --> 00:19:30,151
Now if you're talking small pods,
if you're talking less than 32 gig

387
00:19:30,151 --> 00:19:31,796
nodes, you can move that fast enough.

388
00:19:31,796 --> 00:19:34,046
64 gigs, maybe you're on the edge.

389
00:19:34,046 --> 00:19:36,536
Depends on how much other network
traffic is tying up the bandwidth,

390
00:19:36,876 --> 00:19:40,026
64 gigs is getting on the edge that
you can move in a two minute window.

391
00:19:40,376 --> 00:19:43,586
that other example I was talking about,
those long running spark streaming

392
00:19:43,586 --> 00:19:48,027
jobs, if they're running on demand,
live migration, is still a massive

393
00:19:48,027 --> 00:19:51,327
benefit because now you can do server
upgrades without taking an outage.

394
00:19:51,687 --> 00:19:55,287
You can create a new node running
your new patch version of, Kubernetes,

395
00:19:55,287 --> 00:19:58,767
running your new Patched Os, and
migrate the pod from one to the other.

396
00:19:59,127 --> 00:20:02,627
Your time to replicate is less important.

397
00:20:03,127 --> 00:20:06,032
Even if it takes you three minutes
four minutes, or five minutes

398
00:20:06,032 --> 00:20:09,242
to replicate the memory from
one box to the other, who cares?

399
00:20:09,512 --> 00:20:11,942
It's not gonna be paused for that
long, because what we're doing

400
00:20:11,942 --> 00:20:13,352
is we're doing delta replication.

401
00:20:13,742 --> 00:20:16,772
So you replicate a big chunk, and then
a smaller chunk, and then a smaller

402
00:20:16,772 --> 00:20:20,252
chunk until the chunk is small enough
where you can do it in a pause window.

403
00:20:20,672 --> 00:20:23,882
And so now when you're moving a
huge service from one to the other.

404
00:20:24,182 --> 00:20:24,602
Same thing.

405
00:20:24,602 --> 00:20:28,402
If you're talking like NVME local storage,
we've got another customer we're working

406
00:20:28,402 --> 00:20:30,472
with and it's a different set of problems.

407
00:20:30,832 --> 00:20:35,812
They have a terabyte of NVME that they use
on local ephemeral disc on every node that

408
00:20:35,812 --> 00:20:37,492
needs to be replicated from node to node.

409
00:20:37,492 --> 00:20:41,072
When they do node upgrades, that
takes about 20 minutes, to replicate

410
00:20:41,072 --> 00:20:43,202
all of that from one node to
another, even on high throughput

411
00:20:43,202 --> 00:20:44,792
discs, on high throughput nodes.

412
00:20:45,292 --> 00:20:47,872
But if it's happening in the
background, while everything else

413
00:20:47,872 --> 00:20:50,512
is humming along nicely, who cares?

414
00:20:50,962 --> 00:20:52,072
Keep replicating it over.

415
00:20:52,072 --> 00:20:55,132
You keep going down to Deltas, and
then once your deltas get small enough,

416
00:20:55,132 --> 00:20:58,492
you pause for six to 10 seconds,
depending on how big the service is.

417
00:20:58,732 --> 00:20:59,842
And then you slide it over.

418
00:21:00,182 --> 00:21:01,562
a lot of these things are being solved.

419
00:21:01,562 --> 00:21:06,242
We're actively reducing these pause
times by being able to do more prep

420
00:21:06,242 --> 00:21:09,122
behind the scenes, being able to do
more processing behind the scenes.

421
00:21:09,402 --> 00:21:11,872
everything is, operating
as a containerd plugin.

422
00:21:12,272 --> 00:21:16,852
I saw somebody asked, about on-prem
we will be supporting on-prem, we

423
00:21:16,852 --> 00:21:18,592
will be supporting other solutions.

424
00:21:19,092 --> 00:21:23,962
The big catch there is everybody
has some different flavor of

425
00:21:23,962 --> 00:21:26,212
networking and different flavors
of things behind the scenes.

426
00:21:26,212 --> 00:21:31,162
The actual live migration piece right now
could apply to any Kubernetes anywhere.

427
00:21:31,222 --> 00:21:35,602
It's the IP stack that gets a little
trickier because you've got cilium

428
00:21:35,602 --> 00:21:39,912
running places, you've got Calico running
places, you've got VPN, V-P-C-C-N-I

429
00:21:40,002 --> 00:21:42,762
running places, you know, everybody
has different networking flavors.

430
00:21:43,262 --> 00:21:45,872
So being able to maintain network
connections when you do the

431
00:21:45,872 --> 00:21:50,062
migration is largely the more
difficult part of the whole process.

432
00:21:50,436 --> 00:21:50,706
Hmm.

433
00:21:50,722 --> 00:21:53,992
Being able to move the pod isn't that
bad, so if you've got workloads that

434
00:21:54,112 --> 00:21:57,562
you can reestablish connections and the
connection resetting is not a big deal,

435
00:21:57,742 --> 00:22:02,232
but you don't wanna have to restart
the pod, that's fairly straightforward.

436
00:22:02,232 --> 00:22:05,922
We could pretty much do that today across
any containerd compatible Kubernetes.

437
00:22:06,252 --> 00:22:09,102
It's specifically the networking
that causes a lot more hardships,

438
00:22:09,132 --> 00:22:11,262
because everybody has a
different flavor of networking.

439
00:22:11,672 --> 00:22:17,452
for AWS, we were able to fork
the open source AWS node CNI, and

440
00:22:17,452 --> 00:22:20,422
create our own flavor of it that
now handles the networking piece.

441
00:22:20,782 --> 00:22:24,652
So we're using the open source
AWS CI code, and we've modified

442
00:22:24,652 --> 00:22:26,212
it, and now it works just fine.

443
00:22:26,212 --> 00:22:30,422
for our purposes, we're doing something
similar on GCP, working with the gc.

444
00:22:30,472 --> 00:22:34,057
GPFI recall is using cilium under
the hood for their networking.

445
00:22:34,807 --> 00:22:38,087
So we're gonna be, building a
similar plugin for their cilium side.

446
00:22:38,587 --> 00:22:41,932
Yeah, and the nice thing is, I
guess if you build it for cilium.

447
00:22:42,432 --> 00:22:44,952
would it work universally
across any cilium deployment?

448
00:22:44,952 --> 00:22:48,632
In theory, I mean, I'm just thinking of
the most popular CNIs and if you check

449
00:22:48,632 --> 00:22:54,122
those off the list, it suddenly gives you,
you know, a lot more reach than having to

450
00:22:54,122 --> 00:22:57,002
go plowed by cloud or os by os, you know?

451
00:22:57,222 --> 00:22:57,312
and,

452
00:22:57,362 --> 00:22:57,962
Exactly.

453
00:22:58,042 --> 00:23:01,522
our first iteration of this back in
January, February, the first version

454
00:23:01,522 --> 00:23:04,312
that we demoed was actually Calico.

455
00:23:04,882 --> 00:23:07,342
A lot of people were like, I don't
wanna have to rip out my cluster and

456
00:23:07,342 --> 00:23:09,292
rebuild it with Calico as the CNI.

457
00:23:09,292 --> 00:23:11,992
we were able to figure out
a way to work with the, VPC

458
00:23:11,992 --> 00:23:14,182
CNI, as a backing basis there.

459
00:23:14,452 --> 00:23:16,582
So Calico's pretty much already built.

460
00:23:16,922 --> 00:23:21,212
we've got AWS CNI now built,
cilium our next target, that saw

461
00:23:21,212 --> 00:23:25,022
somebody asked about Azure, Azure
is probably gonna be early 2026.

462
00:23:25,082 --> 00:23:28,692
we'll be E-K-S-G-K-E and then
we'll work on a KS, and then we'll

463
00:23:28,692 --> 00:23:30,262
work on-prem solutions after that.

464
00:23:30,352 --> 00:23:32,542
So on-prem will probably
be sometime in 2026.

465
00:23:33,092 --> 00:23:35,852
Yeah, I can remember, going
back to the two thousands.

466
00:23:35,852 --> 00:23:39,352
I can remember when we went from
delayed migrations or paused

467
00:23:39,352 --> 00:23:41,002
migrations to live migrations.

468
00:23:41,302 --> 00:23:44,632
I can remember reading the technical
papers coming out of, VMware and

469
00:23:44,632 --> 00:23:48,852
Microsoft and they were talking about
the idea of, these deltas continually

470
00:23:48,852 --> 00:23:52,402
repeating the Delta process until
you get down to zero or like you can

471
00:23:52,402 --> 00:23:55,102
fit it in a packet and then that's
the final packet kind of thing.

472
00:23:55,102 --> 00:23:57,652
I don't know why I remember that all
these years later, but I do remember

473
00:23:57,652 --> 00:24:00,932
that I thought that was some pretty cool
science, like some pretty cool, physics

474
00:24:00,932 --> 00:24:04,052
across the wire because back then we were
lucky if our servers had one gigabit,

475
00:24:04,792 --> 00:24:06,472
200 gig workloads or anything like that.

476
00:24:06,782 --> 00:24:10,322
this actually led me during my
research and, we could talk about

477
00:24:10,322 --> 00:24:15,362
the idea that there are, there are
attempts in Linux over the years

478
00:24:15,362 --> 00:24:17,222
to try to solve this universally.

479
00:24:17,222 --> 00:24:22,792
I did some research before the show and
saw some projects around ML workloads in

480
00:24:22,792 --> 00:24:27,082
particular a lot of, engineers, whether
it's platform engineering or just the ML

481
00:24:27,082 --> 00:24:32,902
engineers themselves interested in this
because of the, sort of the problems of

482
00:24:32,902 --> 00:24:36,952
large ML or AI workloads today where you
can't interrupt them if you interrupt

483
00:24:36,952 --> 00:24:38,962
them, you have to basically start over.

484
00:24:38,962 --> 00:24:41,242
it's sort of a precious workload
while it's running and it

485
00:24:41,242 --> 00:24:42,322
might be running long time.

486
00:24:42,372 --> 00:24:45,982
do you have, AI and ML workload
customers where they're.

487
00:24:46,432 --> 00:24:50,602
Are they maybe part of the first movers
to move onto something like this?

488
00:24:50,632 --> 00:24:53,167
I'm basing it on the KubeCon talks
and things that I've seen out there,

489
00:24:53,257 --> 00:24:57,197
Large scale data analytics is
definitely, one of the big players here.

490
00:24:57,197 --> 00:25:00,317
A lot of it's spark driven data
analytics that we're seeing,

491
00:25:00,667 --> 00:25:02,257
because of exactly that problem.

492
00:25:02,587 --> 00:25:06,397
A lot of these jobs will be running
for 8, 10, 12, 14, 16 hours and.

493
00:25:07,602 --> 00:25:10,362
Running those on demand at the
scale that they're running them

494
00:25:10,362 --> 00:25:12,912
at is extraordinarily expensive.

495
00:25:12,912 --> 00:25:18,792
So the big ask is how do we get those
workloads onto spot instances where

496
00:25:18,792 --> 00:25:22,662
when we get the interruption notice we
can fall back to some type of reserved

497
00:25:22,662 --> 00:25:25,332
capacity and then fail back to spot.

498
00:25:25,422 --> 00:25:29,702
So basically the goal is to move
to this new concept where in your

499
00:25:29,702 --> 00:25:34,312
Kubernetes cluster you have some swap
space, whether that's, excess, spot

500
00:25:34,312 --> 00:25:37,642
capacity, two or three extra nodes of
spot capacity or a couple of nodes of

501
00:25:37,642 --> 00:25:42,622
on-demand capacity where if you get a
node interruption, you can quickly swap

502
00:25:42,622 --> 00:25:47,342
into those nodes, and then once you stand
your new spot instance back up, then

503
00:25:47,342 --> 00:25:48,737
you can swap back to that spot instance.

504
00:25:50,072 --> 00:25:50,762
where we're headed.

505
00:25:50,812 --> 00:25:53,812
that's what Q4 is gonna be working on
this year, is to be able to automate that

506
00:25:53,812 --> 00:25:58,492
entire process to where you can float back
and forth between reserve capacity and

507
00:25:58,492 --> 00:26:04,042
spot capacity to really save on those,
data analytics jobs, those large ML jobs.

508
00:26:04,462 --> 00:26:07,102
We're not to the GPU side of things yet.

509
00:26:07,492 --> 00:26:11,362
I'd love to get us to where we could
migrate GPU workloads 'cause that's where

510
00:26:11,362 --> 00:26:13,402
the next big bottleneck is gonna be.

511
00:26:13,882 --> 00:26:17,122
the hooks aren't there in the
Nvidia tool sets yet for the Cuda

512
00:26:17,122 --> 00:26:20,522
tool sets in a lot of places, to be
able to get what we need for data.

513
00:26:20,522 --> 00:26:21,962
we're figuring our way around that.

514
00:26:22,272 --> 00:26:27,972
they tend to be much larger, so the
time taken to move them very expensive.

515
00:26:28,252 --> 00:26:31,342
it might be 20 minutes to be able
to move a job from one to the other.

516
00:26:31,372 --> 00:26:34,012
'cause it took 20 minutes to get
a startup up in the first place.

517
00:26:34,272 --> 00:26:37,152
just due to the size of the models and
how much data you have to replicate.

518
00:26:37,652 --> 00:26:41,432
we're starting to put some POC work
into the GPU side of things while we're

519
00:26:41,432 --> 00:26:44,822
continuing on full steam with building
out the expansion of the feature set

520
00:26:44,822 --> 00:26:47,492
of the CPG, and memory based workloads.

521
00:26:48,042 --> 00:26:48,312
Alright.

522
00:26:48,312 --> 00:26:52,992
Dan, I was curious if you've seen
on, the implementation side of this.

523
00:26:53,542 --> 00:26:54,862
when we talk about.

524
00:26:55,362 --> 00:26:59,322
the need to live, migrate a pod, whether
it's for maintenance then the almost

525
00:26:59,322 --> 00:27:03,732
feel like the next level is the idea
of spot instances, I love that idea of

526
00:27:03,732 --> 00:27:07,332
my infrastructure dynamically failing
and my applications can handle it.

527
00:27:07,332 --> 00:27:11,002
is there a maturity level where
you see people start out it's hard

528
00:27:11,002 --> 00:27:13,822
for me to imagine like on day one
someone's like, yeah, let's just put

529
00:27:13,822 --> 00:27:15,562
it all on spot instances in yolo.

530
00:27:15,592 --> 00:27:17,032
let's just fi we don't care.

531
00:27:17,032 --> 00:27:17,662
It's all good.

532
00:27:17,662 --> 00:27:19,102
Mi Live migration will solve it all.

533
00:27:19,102 --> 00:27:21,962
Because obviously, there are
physics limits to the amount of

534
00:27:21,962 --> 00:27:23,702
data we can transmit over the wire.

535
00:27:23,702 --> 00:27:25,232
I'm imagining this scenario where you're.

536
00:27:25,527 --> 00:27:30,437
accrediting certain workloads,
like this replica set is good

537
00:27:30,707 --> 00:27:33,347
for spot because it's low data.

538
00:27:33,397 --> 00:27:36,667
we don't need to transfer a hundred
giga data during a two minute outage

539
00:27:36,667 --> 00:27:38,077
or a two minute notice of outage.

540
00:27:38,307 --> 00:27:41,477
do you see that as a maturity
scale where you have to

541
00:27:41,593 --> 00:27:41,803
Yeah.

542
00:27:41,803 --> 00:27:44,228
I mean, it is absolutely
a maturity scale, right?

543
00:27:44,228 --> 00:27:46,668
kind of going back to the references
we talked about, the early days of

544
00:27:46,668 --> 00:27:50,978
VMware, nobody started doing vMotion
in production, everyone started it.

545
00:27:51,478 --> 00:27:55,798
Oh, we've got this, five second
interruption, development and test

546
00:27:55,858 --> 00:27:57,688
boxes can handle that all day long.

547
00:27:57,938 --> 00:28:00,458
so it's the same concept really
that we're living in now.

548
00:28:00,458 --> 00:28:01,898
We're going through that same evolution.

549
00:28:01,898 --> 00:28:02,498
I agree with Phil.

550
00:28:02,498 --> 00:28:07,028
I think we're doing it much faster
than VMware did in 2002, 2003.

551
00:28:07,698 --> 00:28:09,228
I was around when that happened as well.

552
00:28:09,228 --> 00:28:11,873
So I remember racking and
stacking all those boxes.

553
00:28:12,213 --> 00:28:13,713
but yeah, it's very much the same thing.

554
00:28:13,713 --> 00:28:17,843
container live migration is brand new,
we've just been GA for a month with it.

555
00:28:17,843 --> 00:28:21,213
So we've had conversations at trade
shows and with customers and there's

556
00:28:21,213 --> 00:28:22,473
a lot of excitement around it.

557
00:28:22,473 --> 00:28:25,083
I think we're still trying to
figure out where it fits, what

558
00:28:25,083 --> 00:28:28,293
are the exact workloads that it
makes the most sense to do this in.

559
00:28:28,653 --> 00:28:31,378
And yeah, I think it's going
to be a process of adoption.

560
00:28:31,428 --> 00:28:33,258
there's definitely a lot of use cases.

561
00:28:33,618 --> 00:28:37,878
I think the spot is a very interesting
use case, especially the large data models

562
00:28:37,878 --> 00:28:39,678
and things that we're processing today.

563
00:28:40,018 --> 00:28:43,078
I'm working with a customer now that's
doing a lot of video processing in

564
00:28:43,078 --> 00:28:47,488
Kubernetes and that's a very, you
know, CPN memory intensive job.

565
00:28:47,488 --> 00:28:51,688
I mean, you know, we're talking a cluster
that scales from a D CPUs to 6,500

566
00:28:51,688 --> 00:28:53,368
CPUs while they're processing this.

567
00:28:53,588 --> 00:28:56,888
we're really trying to figure out
where it makes the most sense to

568
00:28:56,888 --> 00:28:58,478
apply this type of technology.

569
00:28:58,978 --> 00:29:03,708
no one wants to have that kind of
dynamic scale and then have to pay

570
00:29:03,708 --> 00:29:06,408
for reserved instances for all of
that, like, worst case scenario.

571
00:29:06,458 --> 00:29:08,018
that sounds like a billing nightmare.

572
00:29:08,068 --> 00:29:11,363
and you don't want a job that
runs for, you know, hours that

573
00:29:11,363 --> 00:29:15,833
cost you tons of money to fail 80%
through and have to restart it I

574
00:29:15,833 --> 00:29:17,603
mean, that's just not efficient.

575
00:29:17,603 --> 00:29:20,673
So, yeah, I think the ability to
really move this and allow those

576
00:29:20,673 --> 00:29:22,023
workloads to finish is gonna be.

577
00:29:22,518 --> 00:29:23,588
Huge for the market.

578
00:29:24,088 --> 00:29:25,948
Alright, so we have been
talking a lot about the problem

579
00:29:25,948 --> 00:29:27,028
and some of the solution.

580
00:29:27,088 --> 00:29:29,998
we do have some slides that give
visualizations for those on YouTube.

581
00:29:30,028 --> 00:29:31,138
this will turn into a podcast.

582
00:29:31,138 --> 00:29:34,708
So audio listeners, we will give
you the alt text, version of it

583
00:29:34,708 --> 00:29:35,518
while we're talking about it.

584
00:29:35,518 --> 00:29:40,122
But, Philip, I'm Exactly what is
happening and the process of how a live

585
00:29:40,122 --> 00:29:41,952
migration, like how does it kick off?

586
00:29:42,382 --> 00:29:45,082
what's really going on in the
background when it starts,

587
00:29:45,142 --> 00:29:45,502
Absolutely.

588
00:29:45,502 --> 00:29:48,492
and we do have some
better demos other places.

589
00:29:48,492 --> 00:29:49,302
I think on the website.

590
00:29:49,352 --> 00:29:52,442
basically what we have is a live
migration controller looking

591
00:29:52,442 --> 00:29:55,322
across all the workloads and nodes
that are lab migration enabled.

592
00:29:55,372 --> 00:29:58,012
You don't necessarily have to
turn this on for everything.

593
00:29:58,222 --> 00:30:00,832
You've got all your stateless
workloads, you don't need to live

594
00:30:00,832 --> 00:30:03,322
migration, stateless workloads,
just treat them as normal.

595
00:30:03,712 --> 00:30:06,772
You've got your stateful workloads
that you do want this to use for,

596
00:30:06,832 --> 00:30:09,802
so you could set up a specific,
you know, node group for that.

597
00:30:10,052 --> 00:30:14,012
that's gonna allow you to select what you
actually want to do live migration for.

598
00:30:14,512 --> 00:30:17,512
You could use it for everything, but
it just eats up more network bandwidth

599
00:30:17,512 --> 00:30:19,882
if you're using it for the stuff
that already tolerates being moved.

600
00:30:20,382 --> 00:30:23,532
that controller's gonna be looking for
different signals within the cluster of

601
00:30:23,532 --> 00:30:25,422
when something needs to be live, migrated.

602
00:30:25,422 --> 00:30:26,862
instance, interruption is a good one.

603
00:30:27,212 --> 00:30:30,662
being able to do bin packing, evicting
a node from the cluster because it's

604
00:30:30,662 --> 00:30:34,802
underutilized, and then migrating those
workloads to another node in the cluster.

605
00:30:34,862 --> 00:30:36,062
what we call rebalancing.

606
00:30:36,132 --> 00:30:38,952
basically, rebuilding the
cluster with a new set of nodes.

607
00:30:39,252 --> 00:30:42,012
And that could be because you're
doing a node upgrade, you're doing a

608
00:30:42,012 --> 00:30:45,152
Kubernetes upgrade, you're doing an
OS upgrade, you're just trying to.

609
00:30:45,652 --> 00:30:47,632
Get a more efficient set of nodes.

610
00:30:47,952 --> 00:30:51,672
all of those are good reasons that you
would want to do your live migration.

611
00:30:52,172 --> 00:30:56,672
So what's gonna happen in that process
is the two daemon sets on the source

612
00:30:56,672 --> 00:30:58,922
node and the destination node are
gonna start talking to each other.

613
00:30:59,422 --> 00:31:03,412
They're going to look at the pods on
the source node, start synchronizing

614
00:31:03,412 --> 00:31:04,972
them over to the destination node.

615
00:31:05,272 --> 00:31:08,902
So behind the scenes, all of that
kind of memory is being copied over

616
00:31:08,902 --> 00:31:12,532
any disc states being copied over
any TCP/IP connection statuses are

617
00:31:12,532 --> 00:31:16,552
being copied over and you're doing
all that prep work behind the scenes.

618
00:31:17,052 --> 00:31:20,982
If you have, ephemeral storage, on
the node that'll start getting copied.

619
00:31:21,042 --> 00:31:24,252
Obviously, depending on how much, it's
gonna depend on how long it takes.

620
00:31:24,752 --> 00:31:29,572
Once the two nodes have identical
copies of the data, that's when

621
00:31:29,572 --> 00:31:33,742
the live controller will say, it's
time to cut over it will cut the.

622
00:31:34,102 --> 00:31:38,252
Connections from one, pause it and put
it into a pause state in containerd,

623
00:31:38,802 --> 00:31:40,992
then it will unpause on the new node.

624
00:31:41,052 --> 00:31:42,222
It'll come up with a new name.

625
00:31:42,402 --> 00:31:44,232
Right now we call 'em
clone one, clone two.

626
00:31:44,232 --> 00:31:45,312
We just add clone to 'em.

627
00:31:45,312 --> 00:31:47,532
So you can tell which was the
before and which was the after.

628
00:31:47,872 --> 00:31:51,702
when that clone one unpause,
then traffic will be going to it.

629
00:31:51,702 --> 00:31:55,422
It'll have the same exact IP address that
it had while it was on the previous node.

630
00:31:55,722 --> 00:31:57,852
All the traffic continues onto that node.

631
00:31:57,882 --> 00:32:02,232
It picks up exactly where it left off,
and the old pod disappears, right?

632
00:32:02,232 --> 00:32:07,302
The old pod gets shut down and torn down
if you have something like a PVC attached.

633
00:32:07,362 --> 00:32:11,102
So if you've got an E-B-S-P-V-C
attached, there is a longer pause

634
00:32:11,102 --> 00:32:13,052
because you have to do a detach reattach.

635
00:32:13,332 --> 00:32:15,252
with the API calls, it works.

636
00:32:15,312 --> 00:32:17,502
It just takes a little bit
longer for that pause state.

637
00:32:18,002 --> 00:32:20,992
That's The downfall of
having to work with, APIs.

638
00:32:21,422 --> 00:32:23,812
it takes time to do an unbind
rebind, to the new node.

639
00:32:24,212 --> 00:32:25,142
but it works today.

640
00:32:25,142 --> 00:32:28,292
If you're using NFS where you can
do a multitask, then it's instant,

641
00:32:28,292 --> 00:32:29,612
it doesn't actually add any delay.

642
00:32:29,892 --> 00:32:32,262
just NFS is a slower storage technology.

643
00:32:32,262 --> 00:32:35,162
So does that sort of make
sense from a high level?

644
00:32:35,662 --> 00:32:36,172
Yeah.

645
00:32:36,232 --> 00:32:39,882
when we talk about Cast AI as a
solution, it do live migrations

646
00:32:39,882 --> 00:32:41,052
based on certain criteria?

647
00:32:41,112 --> 00:32:46,352
Is it making decisions around, if you
say I want a bin pack all the time,

648
00:32:46,852 --> 00:32:49,702
it in the background, is it just like
doing live migrations on your behalf?

649
00:32:49,702 --> 00:32:53,812
Or is this something where you're
largely doing it with humans clicking

650
00:32:53,812 --> 00:32:56,362
buttons and controlling the chaos.

651
00:32:56,862 --> 00:32:59,322
No, this goes back to what we
had talked about at the beginning

652
00:32:59,322 --> 00:33:01,182
where, automation is key.

653
00:33:01,602 --> 00:33:05,952
when a node is underutilized,
our bin packer is, probably the

654
00:33:05,952 --> 00:33:07,302
most sophisticated on the market.

655
00:33:07,752 --> 00:33:11,682
It analyzes and runs live tests on
every node in the cluster of whether

656
00:33:11,682 --> 00:33:15,342
that node can be deleted, whether that
node doesn't need to be there anymore.

657
00:33:15,372 --> 00:33:18,942
And it'll simulate all the pods being
redistributed throughout the cluster.

658
00:33:19,342 --> 00:33:22,852
if the answer is we don't need this
node, it would automatically kick

659
00:33:22,852 --> 00:33:26,582
off a live migration of all the
pods on that node Once it's empty,

660
00:33:26,792 --> 00:33:28,052
it'll just get garbage collected.

661
00:33:28,552 --> 00:33:31,072
once it's gone, all your pods
are running on the new nodes.

662
00:33:31,072 --> 00:33:32,482
Everything's moved seamlessly.

663
00:33:32,602 --> 00:33:33,982
You haven't seen any interruption.

664
00:33:34,322 --> 00:33:36,172
the cluster keeps continuing as normal.

665
00:33:36,282 --> 00:33:39,042
Most of our customers do
scheduled rebalances, so those

666
00:33:39,042 --> 00:33:40,362
are just in the background.

667
00:33:40,362 --> 00:33:45,542
It's evaluating how efficiently
designed the nodes in the cluster are.

668
00:33:46,022 --> 00:33:47,852
If the nodes in the cluster are.

669
00:33:48,352 --> 00:33:51,382
Not as efficient as they could be, and
different shapes and sizes would be

670
00:33:51,382 --> 00:33:55,172
better for, that setup at that point
in the day, based on the mixture of

671
00:33:55,172 --> 00:34:00,002
workloads there, it'll do a blue-green
deployment, set up new nodes live,

672
00:34:00,002 --> 00:34:02,972
migrate the workloads to those new
nodes and tear down the old ones.

673
00:34:02,972 --> 00:34:05,702
So everything that we're
talking about here can either

674
00:34:05,762 --> 00:34:07,262
be scheduled or it's automatic.

675
00:34:07,262 --> 00:34:09,122
It's running every few minutes on a cycle.

676
00:34:09,372 --> 00:34:11,542
but yeah, no, it's entirely
seamless to the users.

677
00:34:12,042 --> 00:34:12,522
Nice.

678
00:34:13,022 --> 00:34:17,432
So in the technical details, we're
moving the IP address, I think you had a

679
00:34:17,432 --> 00:34:22,102
diagram, showing the pod, on the nodes,
when we get down to the nitty gritty of

680
00:34:22,102 --> 00:34:26,692
Kubernetes level stuff, pod is recreated,
but pod names have to be unique you

681
00:34:26,692 --> 00:34:29,682
can't have the IP on both nodes at once.

682
00:34:30,042 --> 00:34:33,342
And then there's the difference
between TCP and UDP and other, you

683
00:34:33,342 --> 00:34:38,282
know, IP protocols and the, there's
a lot of little devils in the

684
00:34:38,282 --> 00:34:39,662
details that I'm super interested in.

685
00:34:39,662 --> 00:34:43,082
We won't have time to go into all
of it, but I do remember you showed

686
00:34:43,132 --> 00:34:46,422
the replica, the pod that you're
creating, step one is we create a

687
00:34:46,422 --> 00:34:47,622
pod and download an image, right?

688
00:34:47,622 --> 00:34:48,952
this is all still going
through containerd.

689
00:34:48,972 --> 00:34:52,572
So it's not, there's not like voodoo
magic in the background happening

690
00:34:52,572 --> 00:34:54,392
outside of the purview of containerd.

691
00:34:54,917 --> 00:34:56,177
maybe you can talk about that for a minute

692
00:34:56,367 --> 00:34:57,087
Right, Exactly.

693
00:34:57,087 --> 00:35:01,277
So being by changing the pod name,
you now have a. Placeholder for

694
00:35:01,277 --> 00:35:03,317
your new pod information to go into.

695
00:35:03,797 --> 00:35:07,337
And it does maintain the same IP address
when it moves over from one to the other.

696
00:35:07,337 --> 00:35:11,107
So to your point, that's when that
switch has to kick in where the

697
00:35:11,107 --> 00:35:12,847
old pod definition disappears.

698
00:35:12,847 --> 00:35:16,687
And the new pod definition appears
in your, control plane with the

699
00:35:16,687 --> 00:35:18,487
API calls up to the coop API.

700
00:35:18,797 --> 00:35:22,877
that cutover is extremely important
because you can't have the same

701
00:35:22,877 --> 00:35:24,677
pod living in the same place twice.

702
00:35:25,107 --> 00:35:28,047
that's why we do have to change
a name when we switch it over.

703
00:35:28,327 --> 00:35:31,417
there's certain services that, cause
some tricks because they have an

704
00:35:31,417 --> 00:35:34,147
operator structure where they expect
there to be a certain pod name.

705
00:35:34,477 --> 00:35:37,432
So when you move it and add the clone
suffix to it, we're working on finding

706
00:35:37,432 --> 00:35:39,412
workarounds to that, in certain areas.

707
00:35:39,842 --> 00:35:42,632
that is a little bit tricky on
certain workloads because you can't

708
00:35:42,782 --> 00:35:45,242
have the same pod existing with the
same name in two different spots.

709
00:35:45,242 --> 00:35:46,232
They have to be unique.

710
00:35:46,592 --> 00:35:47,672
but yeah, definitely.

711
00:35:48,172 --> 00:35:52,842
the IP is the same, but the pod
name is gonna have a clone dash

712
00:35:52,842 --> 00:35:54,252
one or something like that on it

713
00:35:54,252 --> 00:35:54,462
Yep.

714
00:35:54,512 --> 00:35:58,672
yeah, so it starts with your pod
and then there's an event that

715
00:35:58,672 --> 00:36:04,782
happens outside of the pod that is
talking into, containerd and you're

716
00:36:05,218 --> 00:36:06,088
It's a second pod.

717
00:36:06,588 --> 00:36:07,038
Correct.

718
00:36:07,068 --> 00:36:08,268
It's adding that placeholder.

719
00:36:08,268 --> 00:36:11,358
And because the placeholder pod is
actually still in a paused state.

720
00:36:11,568 --> 00:36:13,188
It can have the same IP address.

721
00:36:13,238 --> 00:36:16,238
it's not actually routing traffic to
it, 'cause it's not an active pod yet.

722
00:36:16,608 --> 00:36:18,228
it'll be in a staged state.

723
00:36:18,678 --> 00:36:20,838
So you stage it up with all
the information, but it has

724
00:36:20,838 --> 00:36:22,278
to be named differently.

725
00:36:22,428 --> 00:36:25,068
And then when you do the cutover,
that's when you switch it from

726
00:36:25,068 --> 00:36:28,668
being inactive to active and
switch the old pod to be inactive.

727
00:36:29,008 --> 00:36:32,398
and that's the final stage when,
the clone pod becomes the primary.

728
00:36:32,848 --> 00:36:36,078
And because it's maintained the
exact, IP address within the

729
00:36:36,078 --> 00:36:38,848
system, it's not losing any traffic.

730
00:36:39,178 --> 00:36:42,538
So the networking system within
Kubernetes routes it to the new node

731
00:36:42,628 --> 00:36:45,698
and the routing tables are updated and
the pod goes to the new destination.

732
00:36:46,087 --> 00:36:50,227
Yeah, that does sound like the hard
part of like the old pod is shut

733
00:36:50,227 --> 00:36:52,417
down so the IP can be released.

734
00:36:52,417 --> 00:36:53,497
I assume the IP.

735
00:36:53,997 --> 00:36:57,057
Can't be taken out while
that old pod is still active.

736
00:36:57,067 --> 00:37:00,427
it's one of these things where it's like,
I understand it at the theory level, but

737
00:37:00,437 --> 00:37:05,717
I have no idea what containerd and coup
proxy and, all these different things that

738
00:37:05,717 --> 00:37:11,337
are binding to a virtual interface and the
order of the things that have to happen

739
00:37:11,337 --> 00:37:16,587
in the exact right sequence in order for
you to first assign that IP to the new

740
00:37:16,737 --> 00:37:19,407
node and then also replay all the packets.

741
00:37:19,457 --> 00:37:22,997
it does seem like a very discreet
order of things that have to happen.

742
00:37:22,997 --> 00:37:25,667
it has to go in a certain order or
they're just all fail it feels like.

743
00:37:25,993 --> 00:37:28,093
so that's the part that took
us about a year to figure out.

744
00:37:29,663 --> 00:37:32,483
there had been a lot of studies
and some research papers around,

745
00:37:32,533 --> 00:37:34,933
the moving of memory and the
snapshotting of different workloads.

746
00:37:34,933 --> 00:37:38,053
Like that part was a little bit
more straightforward because it

747
00:37:38,053 --> 00:37:41,083
was really out of the vMotion
playbook days, from early on.

748
00:37:41,423 --> 00:37:44,183
there were also some college
studies around using cryo to

749
00:37:44,303 --> 00:37:45,893
replicate and migrate containers.

750
00:37:46,393 --> 00:37:49,633
None of them had been able to
solve the IP side of things,

751
00:37:49,633 --> 00:37:50,983
the connectivity side of things.

752
00:37:51,303 --> 00:37:53,493
that's what Cast AI was able to solve for.

753
00:37:53,493 --> 00:37:56,083
and it took a lot of research,
took a lot of, in-depth work.

754
00:37:56,473 --> 00:38:02,143
We started on this, early 2024,
with a team of about five engineers,

755
00:38:02,643 --> 00:38:05,973
deep kernel level Linux engineers,
Kubernetes engineers, people

756
00:38:05,973 --> 00:38:07,603
very familiar with the, code.

757
00:38:07,883 --> 00:38:11,503
they contributed to the Kubernetes
open source project, it was.

758
00:38:12,003 --> 00:38:15,123
10 months before we had a demo
and that was using Calico.

759
00:38:15,153 --> 00:38:18,033
before we could demo, we had to
have a custom AMI at that point in

760
00:38:18,033 --> 00:38:22,363
time in AWS because everything was
kernel level at the a MI level.

761
00:38:22,763 --> 00:38:26,723
we knew that was not feasible going
forward to production, but that

762
00:38:26,723 --> 00:38:29,723
was the first demoable version Like
anything else, There's a lot of warts

763
00:38:29,723 --> 00:38:31,373
and vaporware in the first version.

764
00:38:31,823 --> 00:38:35,873
since then we were able to move
the logic up to a containerd plugin

765
00:38:35,873 --> 00:38:37,493
makes it a lot more portable.

766
00:38:37,493 --> 00:38:39,263
Now it can be applied to different clouds.

767
00:38:39,263 --> 00:38:40,463
It's much less invasive.

768
00:38:40,463 --> 00:38:41,963
You don't need a specific AMI.

769
00:38:42,343 --> 00:38:47,003
and under the hood anymore, we were able
to move it to the A-W-S-V-P-C-C-N-I.

770
00:38:47,033 --> 00:38:48,803
So you don't need the custom Calico CNI.

771
00:38:48,803 --> 00:38:53,073
all of those were iterative steps to
build this and make it more, production

772
00:38:53,073 --> 00:38:55,083
viable and adoptable by the industry.

773
00:38:55,583 --> 00:38:58,583
Now it's a matter of, we've
got kind of two forks going on.

774
00:38:58,703 --> 00:39:02,243
One is continuing to build out
additional platforms, figuring out

775
00:39:02,243 --> 00:39:04,253
cilium, figuring out Azure, CNI,

776
00:39:04,303 --> 00:39:09,153
The other is performance tuning the
existing migrations, reducing time to

777
00:39:09,153 --> 00:39:13,583
migrate, being able to reduce the size
of the deltas down further and further

778
00:39:13,583 --> 00:39:15,413
so we can migrate it faster and faster.

779
00:39:15,723 --> 00:39:17,733
so we've got kind of those
two tracks right now.

780
00:39:17,793 --> 00:39:19,863
the team's up to I think
10 or 12 engineers.

781
00:39:19,893 --> 00:39:23,763
kind of working on those two paths,
and this is probably one of our most

782
00:39:23,763 --> 00:39:27,393
heavily invested, areas in the company
is being able to further this technology.

783
00:39:27,423 --> 00:39:29,003
'cause we see how much value it brings.

784
00:39:29,387 --> 00:39:34,437
Yeah, I imagine it won't be very
long where, you know, this technology

785
00:39:34,497 --> 00:39:38,257
is pretty advanced, but like others
eventually will probably if it's truly

786
00:39:38,287 --> 00:39:39,637
the thing that we all are looking for.

787
00:39:39,637 --> 00:39:43,077
And it sounds like, it feels like it is,
it feels like the kind of, tooling where

788
00:39:43,577 --> 00:39:47,027
it's a hard problem to solve and we'll
maybe see other people attempt to do it.

789
00:39:47,057 --> 00:39:48,467
I mean, the.

790
00:39:48,967 --> 00:39:50,377
the research I had to do for the show.

791
00:39:50,377 --> 00:39:51,097
'cause I was very curious.

792
00:39:51,097 --> 00:39:52,387
I was like, what's the
history of all this?

793
00:39:52,387 --> 00:39:55,287
And we, someone mentioned Sierra
IU, which I believe you're

794
00:39:55,287 --> 00:39:56,367
using at least some of that.

795
00:39:56,707 --> 00:39:59,527
that's a project that's been around
for quite some time over a decade.

796
00:39:59,527 --> 00:40:04,387
And it's not a new idea, but like
a lot of these other technologies,

797
00:40:04,387 --> 00:40:05,707
the devil's in the details.

798
00:40:05,757 --> 00:40:10,167
we never really had an ability, to
capture and understand what a binary's

799
00:40:10,797 --> 00:40:14,647
true dependencies were, whether
it's disc or, networking things.

800
00:40:15,037 --> 00:40:18,567
until we had containers you mentioned
on here like it's in LXC, it's

801
00:40:18,567 --> 00:40:21,497
in Docker, it's in pod man, like
this actually tool is used widely.

802
00:40:21,547 --> 00:40:25,577
it's just maybe not, well known
to us end users because it's

803
00:40:25,577 --> 00:40:27,257
packaged as a part of other tooling.

804
00:40:27,677 --> 00:40:29,687
And I can sort of see a world where.

805
00:40:30,057 --> 00:40:34,017
if this becomes more widespread, you're
gonna end up with haves and have nots

806
00:40:34,017 --> 00:40:38,327
where my solution doesn't have live
migration or my solution does have live

807
00:40:38,327 --> 00:40:41,907
migration at some point, maybe it's
ubiquitous, you're building functionality

808
00:40:41,907 --> 00:40:45,787
on top of it, like your automation that
truly adds value around, spot instances

809
00:40:45,787 --> 00:40:49,807
where my company maybe has never done spot
instances because it was too risky for

810
00:40:49,807 --> 00:40:53,677
us and we didn't have the tooling to take
advantage of it without risking downtime.

811
00:40:53,727 --> 00:40:56,367
I definitely have a couple of clients
that I've worked with over the last

812
00:40:56,367 --> 00:40:59,377
couple of years that are like that,
where they're a hundred percent reserved

813
00:40:59,377 --> 00:41:03,387
instances because they want the cheapest,
but they also need to guarantee uptime.

814
00:41:03,387 --> 00:41:07,107
And they can't do that at a level
that live migrations would provide.

815
00:41:07,107 --> 00:41:11,477
So they have to pay that extra
surcharge, for avoiding, ephemeral

816
00:41:11,477 --> 00:41:12,737
instances and stuff like that.

817
00:41:12,787 --> 00:41:15,757
to me, it gives me comfort
that the technology stack.

818
00:41:16,117 --> 00:41:18,937
is part open source,
part community driven.

819
00:41:18,937 --> 00:41:23,097
there's also the product, and private IP
side of this as well, but it's not like

820
00:41:23,097 --> 00:41:24,647
you're reinventing the Linux kernel.

821
00:41:24,647 --> 00:41:28,937
It wouldn't have been that long ago where
you had to actually throw in a Linux

822
00:41:28,937 --> 00:41:32,707
module or kernel module rather, that
would only work on certain operating

823
00:41:32,707 --> 00:41:36,037
system distributions of Linux and
that you would have to deploy a custom

824
00:41:36,367 --> 00:41:38,667
ISO that wasn't that far in the past.

825
00:41:38,667 --> 00:41:41,937
But now that we've got all these modern
things, I don't know if EBPF is involved

826
00:41:41,937 --> 00:41:45,792
in this at all, but we've got more modern
abstractions that it feels like you can

827
00:41:45,792 --> 00:41:48,387
just plug and play as long as you've
got the right networking components.

828
00:41:48,437 --> 00:41:51,677
From an engineering perspective,
pretty awesome because it allows you

829
00:41:51,677 --> 00:41:53,057
to build stuff on the stack like this.

830
00:41:53,107 --> 00:41:55,837
the team's not on the call, but the
team that's developing this, good job,

831
00:41:55,837 --> 00:41:57,767
Bravo, that's, some great engineering.

832
00:41:57,797 --> 00:42:02,087
obviously anytime something is a year
long effort to crack a nut like this, I

833
00:42:02,087 --> 00:42:05,497
feel sad for the people that had a six
month, like, no one's gonna see this

834
00:42:05,497 --> 00:42:08,587
feature for a year and I'm gonna work all
year on it and I hope someone likes it.

835
00:42:08,587 --> 00:42:12,237
So that's from a software development
perspective, that's the hard part.

836
00:42:12,237 --> 00:42:13,967
That's the true engineering.

837
00:42:14,517 --> 00:42:14,907
Um.

838
00:42:14,943 --> 00:42:15,783
Yeah, absolutely.

839
00:42:16,230 --> 00:42:19,590
we could talk about this forever,
but people have their jobs to do.

840
00:42:19,590 --> 00:42:19,800
okay.

841
00:42:19,800 --> 00:42:22,470
How do people get started do they just
go sign up for Cast AI and this is

842
00:42:22,470 --> 00:42:25,350
like a feature outta the box that they
can implement in their clusters or.

843
00:42:25,607 --> 00:42:26,447
Yep, absolutely.

844
00:42:26,497 --> 00:42:30,307
it's in the UI now, so if people
want to sign up and onboard, we do

845
00:42:30,307 --> 00:42:33,637
recommend having somebody on our sales
engineering team work with folks.

846
00:42:33,637 --> 00:42:36,747
So reach out to us, We'll also reach
out when people sign up as well.

847
00:42:37,067 --> 00:42:38,327
it's all straightforward.

848
00:42:38,327 --> 00:42:39,107
There's no caveats.

849
00:42:39,107 --> 00:42:42,617
It's helm charts to do the install
and then you set up the autoscaler.

850
00:42:42,987 --> 00:42:46,017
we will be adding support
for Karpenter today.

851
00:42:46,017 --> 00:42:48,817
It's using our autoscaler,
but we will support Karpenter.

852
00:42:48,867 --> 00:42:50,967
around end of Q4 or early Q1,

853
00:42:51,687 --> 00:42:52,317
Yeah, that's great.

854
00:42:52,387 --> 00:42:56,037
my usual co-host is with AWS, so
they would greatly appreciate that.

855
00:42:56,107 --> 00:42:58,167
I know that over the last
year, Karpenter's been

856
00:42:58,167 --> 00:42:58,987
out, a little over a year.

857
00:42:59,377 --> 00:43:02,807
We've had, a surprising number
of people on our Discord server.

858
00:43:02,957 --> 00:43:05,817
For those of you watching, there's
a Discord server you can join.

859
00:43:06,087 --> 00:43:08,367
There's a lot of people on Discord
talking about using Karpenter.

860
00:43:08,417 --> 00:43:10,747
I'm really impressed with
the uptake on that project.

861
00:43:10,747 --> 00:43:14,887
And in case you're wondering what
Karpenter is, it's with a K and it's

862
00:43:14,887 --> 00:43:18,387
for Kubernetes, you can look on this
YouTube channel later because we did a

863
00:43:18,392 --> 00:43:21,642
show on it and has had people talking
about it on the show and demoing it.

864
00:43:21,712 --> 00:43:22,762
whenever it was released.

865
00:43:22,762 --> 00:43:23,962
I think that was 2024.

866
00:43:23,962 --> 00:43:24,922
I can't remember exactly.

867
00:43:25,422 --> 00:43:27,642
alright, so everyone
knows how to get started.

868
00:43:27,642 --> 00:43:31,182
Everyone now knows that they wish they
had live migrations and they currently

869
00:43:31,182 --> 00:43:32,982
don't unless they're a Cast AI customer.

870
00:43:33,292 --> 00:43:34,432
Where can we find you on the internet?

871
00:43:34,432 --> 00:43:37,102
Where can people learn more
about what you're doing?

872
00:43:37,102 --> 00:43:38,632
Are you gonna be at conferences soon?

873
00:43:39,022 --> 00:43:41,512
I'm assuming, cast is probably
gonna have a booth at KubeCon again.

874
00:43:41,512 --> 00:43:42,627
They always seemed to have a booth there.

875
00:43:43,127 --> 00:43:44,517
we've got a big booth
at KubeCon this year.

876
00:43:44,537 --> 00:43:45,797
I think we've got a 20 by 20.

877
00:43:45,797 --> 00:43:48,137
We're gonna be doing demos and
presentations in the booth.

878
00:43:48,187 --> 00:43:49,507
this is gonna be a big part of that.

879
00:43:49,877 --> 00:43:51,647
we'll also be at reinvent in Vegas.

880
00:43:52,037 --> 00:43:54,917
In later, early December, I
guess is first week of December.

881
00:43:55,227 --> 00:43:56,727
so I'll be at both of those events.

882
00:43:56,727 --> 00:43:59,757
I'm also really active on LinkedIn, so
if anybody wants to reach out to me on

883
00:43:59,757 --> 00:44:03,207
LinkedIn, if you wanna set up a session
just to go into more detail, feel free

884
00:44:03,207 --> 00:44:06,527
to ping me I post a lot of Kubernetes
content in general, best practices,

885
00:44:06,737 --> 00:44:10,277
things that we see in the industry from
a Kubernetes evolution side of things,

886
00:44:10,277 --> 00:44:12,377
and also obviously a bunch of cast stuff.

887
00:44:12,717 --> 00:44:15,717
so, you know, feel free to follow or
connect Happy to share more information

888
00:44:15,717 --> 00:44:16,197
Awesome.

889
00:44:16,657 --> 00:44:20,527
well, I'm looking forward to hearing about
that continual proliferation of all things

890
00:44:20,527 --> 00:44:22,387
live migration on every possible setup.

891
00:44:22,727 --> 00:44:26,677
someday it'll be on FreeBSD with
some esoteric, Kubernetes variant.

892
00:44:26,987 --> 00:44:29,137
It's, pretty cool to see
the evolution of this.

893
00:44:29,557 --> 00:44:31,717
Well, thank you both for
being here, Philip and Dan.

894
00:44:31,777 --> 00:44:32,407
see you all later.

895
00:44:32,907 --> 00:44:33,267
Ciao.

896
00:44:33,767 --> 00:44:36,017
Thanks for watching, and I'll
see you in the next episode.