1
00:00:00,000 --> 00:00:05,040
It's DevOps is time to shine because with the advent of Agentic ai,

2
00:00:05,040 --> 00:00:08,220
coding, everything, people are just getting a lot more done themselves.

3
00:00:13,230 --> 00:00:15,030
Welcome to Screaming in the Cloud.

4
00:00:15,240 --> 00:00:16,500
I'm Cory Quinn.

5
00:00:16,500 --> 00:00:23,220
At my guest today is David Yek, who's a senior principal engineer over at AWS.

6
00:00:23,520 --> 00:00:24,810
David, thank you for joining me.

7
00:00:25,140 --> 00:00:26,940
Oh, it's, I'm so excited to be here.

8
00:00:26,940 --> 00:00:27,960
Thanks for having me on.

9
00:00:28,530 --> 00:00:28,860
Longtime fan.

10
00:00:29,820 --> 00:00:32,940
This episode is sponsored in part by my day job Duck.

11
00:00:32,940 --> 00:00:36,120
Bill, do you have a horrifying AWS bill?

12
00:00:36,390 --> 00:00:38,280
That can mean a lot of things.

13
00:00:38,520 --> 00:00:41,580
Predicting what it's going to be, determining what it

14
00:00:41,580 --> 00:00:45,390
should be, negotiating your next long-term contract with

15
00:00:45,390 --> 00:00:49,680
AWS, or just figuring out why it increasingly resembles.

16
00:00:49,905 --> 00:00:53,535
Phone number, but nobody seems to quite know why that is.

17
00:00:53,835 --> 00:00:57,405
To learn more, visit duck bill hq.com.

18
00:00:57,705 --> 00:01:00,585
Remember, you can't duck the duck bill.

19
00:01:00,615 --> 00:01:01,725
Bill, which my

20
00:01:01,725 --> 00:01:05,325
CEO reliably informs me is absolutely not

21
00:01:05,325 --> 00:01:05,985
our slogan.

22
00:01:06,720 --> 00:01:07,110
Well, thank you.

23
00:01:07,110 --> 00:01:07,920
I appreciate that.

24
00:01:07,920 --> 00:01:10,050
Most people don't admit to that in recorded media,

25
00:01:10,050 --> 00:01:12,150
so I, I've taken that and putting it on the website.

26
00:01:12,150 --> 00:01:12,780
It'll be great.

27
00:01:12,960 --> 00:01:18,750
Uh, you are a lead advisor on the AG Agentic AI team at AWS and as

28
00:01:18,750 --> 00:01:22,920
recently reported by the New York Times, a trim man of 42 with a gray

29
00:01:22,920 --> 00:01:26,280
beard, which does not seem to be in evidence and a jittery intensity.

30
00:01:26,640 --> 00:01:27,900
Well, that will be on evidence.

31
00:01:27,900 --> 00:01:29,580
I think I've had plenty of coffee this morning.

32
00:01:29,890 --> 00:01:31,420
So what are you doing these days?

33
00:01:31,420 --> 00:01:32,080
What are you up to?

34
00:01:32,080 --> 00:01:36,399
You've been at a WSA very long time and a agentic

35
00:01:36,399 --> 00:01:40,000
AI has well not been a thing for nearly that long.

36
00:01:40,030 --> 00:01:40,899
What's your backstory?

37
00:01:41,199 --> 00:01:46,630
Well, yeah, I've spent the last around 20 years at Amazon and most of that a

38
00:01:46,630 --> 00:01:50,830
AWS, like there's kind of a gray area on exactly when I started in AWS because.

39
00:01:51,380 --> 00:01:54,260
So much of that was on teams straddling both AWS and the

40
00:01:54,260 --> 00:01:56,840
rest of Amazon that made like the web service framework

41
00:01:57,080 --> 00:01:59,930
that all the AWS services and Amazon services use.

42
00:02:00,290 --> 00:02:03,500
So, I don't know, I'd say the whole time that I've been at

43
00:02:03,500 --> 00:02:07,040
Amazon outta college for 20 years has been with a pretty much a

44
00:02:07,040 --> 00:02:11,510
singular purpose, and that's been to make developers lives easier.

45
00:02:11,690 --> 00:02:16,820
Everything I've done seems to be on the, the same theme of I build something.

46
00:02:17,060 --> 00:02:19,610
I see what was painful about that and then.

47
00:02:19,825 --> 00:02:21,600
I build a thing to make that thing less painful.

48
00:02:22,470 --> 00:02:25,200
When you say you're talking about making a developer's

49
00:02:25,200 --> 00:02:28,140
life easier, are you talking about developer experience?

50
00:02:28,140 --> 00:02:30,930
Are you talking about underlying capabilities of the platform?

51
00:02:30,930 --> 00:02:34,350
Because making developer's lives easier can cover an awful lot of ground.

52
00:02:34,950 --> 00:02:37,380
When I look at the most pain that I have, at

53
00:02:37,380 --> 00:02:39,210
least the way that I look at developer pain.

54
00:02:39,280 --> 00:02:42,820
It's around the ongoing maintenance and operations of a thing.

55
00:02:43,060 --> 00:02:47,110
Like that's just, I shouldn't say, well, yeah, pain, it's just the mo Most

56
00:02:47,110 --> 00:02:51,100
amount of work that I'd rather not be doing is around ongoing operations.

57
00:02:51,490 --> 00:02:53,650
And as a developer, as a software developer,

58
00:02:53,650 --> 00:02:55,390
I like to automate my way out of that.

59
00:02:55,660 --> 00:02:58,900
Like that's, that's actually why we do, and always,

60
00:02:59,170 --> 00:03:01,240
from my view, have always done at Amazon DevOps.

61
00:03:01,870 --> 00:03:05,590
To, to was, what that's meant has been developers do the operations.

62
00:03:05,680 --> 00:03:06,730
There is no separate thing.

63
00:03:06,790 --> 00:03:07,810
It's just one thing.

64
00:03:08,080 --> 00:03:10,090
Developers wear all the hats, just do everything.

65
00:03:10,480 --> 00:03:14,380
And so the nice property of that is that I've been building my way

66
00:03:14,380 --> 00:03:18,580
out of and automating a way, the annoying stuff the whole time.

67
00:03:18,870 --> 00:03:22,800
When I started, I was on this team that ran Amazon dot com's, web server fleets.

68
00:03:23,040 --> 00:03:25,590
Arguably not a DevOps team because we ran the web server

69
00:03:25,590 --> 00:03:28,620
fleet, but when you have hundreds of teams pushing code to the

70
00:03:28,620 --> 00:03:30,630
same web server environment, somebody about, you can't be a

71
00:03:30,630 --> 00:03:32,940
DevOps team when you're working with that much Pearl, I'm

72
00:03:32,940 --> 00:03:35,250
pretty sure it's against the, uh, the byline somewhere.

73
00:03:35,490 --> 00:03:35,970
Yeah.

74
00:03:35,970 --> 00:03:36,270
Yeah.

75
00:03:36,420 --> 00:03:39,330
A lot of pearl to do the, the, to script, the automation.

76
00:03:39,330 --> 00:03:42,180
A lot of pearl rendering the web, the HT ML. Yeah.

77
00:03:42,180 --> 00:03:42,750
A lot of that.

78
00:03:42,870 --> 00:03:46,560
So in order to, to make operating that web server fleet easier.

79
00:03:47,110 --> 00:03:50,020
Had to build systems to help us automate it.

80
00:03:50,020 --> 00:03:54,460
So we built like alarm aggregation systems to turn instead of our pager

81
00:03:54,640 --> 00:03:58,540
like I have here, uh, going off like, uh, a hundred times one for every web

82
00:03:58,540 --> 00:04:02,260
server or a thousand times as we grew, uh, to turn that into just one page.

83
00:04:02,260 --> 00:04:03,370
So, but that takes, that takes time.

84
00:04:03,370 --> 00:04:03,700
Well, now you're an

85
00:04:03,700 --> 00:04:04,210
Amazonian.

86
00:04:04,210 --> 00:04:05,380
I mean, I assume you carry six pagers.

87
00:04:05,985 --> 00:04:09,165
Uh, just one, but I make sure that I, well, I guess I, no, I shouldn't say

88
00:04:09,165 --> 00:04:12,735
that because it's my pager, it's my phone to make sure that as a backstop

89
00:04:12,735 --> 00:04:15,945
does the text message and of course, the app, uh, that we have, so.

90
00:04:15,950 --> 00:04:16,170
Okay.

91
00:04:16,175 --> 00:04:16,454
Okay.

92
00:04:16,454 --> 00:04:16,815
Good point.

93
00:04:16,815 --> 00:04:17,685
I have three pagers.

94
00:04:17,894 --> 00:04:18,345
There we go.

95
00:04:18,345 --> 00:04:21,225
You'll get up to a full six pagers before appendices app later.

96
00:04:21,225 --> 00:04:22,035
It'll be great.

97
00:04:22,185 --> 00:04:25,965
Your timing on this is impeccable because as we record this, about an

98
00:04:25,965 --> 00:04:30,525
hour ago, you folks came out with the general availability of AWS DevOps.

99
00:04:30,905 --> 00:04:34,355
So now you can talk about it freely, but it wasn't enough time

100
00:04:34,355 --> 00:04:37,415
for me to actually kick the tires on it and ask the embarrassing.

101
00:04:37,415 --> 00:04:38,855
Well, what about questions?

102
00:04:38,975 --> 00:04:40,315
So tell me, that was all planned.

103
00:04:40,390 --> 00:04:41,255
I, I assume it was.

104
00:04:41,255 --> 00:04:44,405
I basically, me talking to people dictates

105
00:04:44,405 --> 00:04:46,625
the entirety of AWS release schedules.

106
00:04:46,895 --> 00:04:48,545
What does DevOps agent, what's it do?

107
00:04:48,560 --> 00:04:50,570
I'm guessing it's gonna make developers lives easier.

108
00:04:50,570 --> 00:04:52,430
Just a hint, based upon how you started this.

109
00:04:52,490 --> 00:04:55,310
It really is, I mean, it's kind of the culmination of everything I've

110
00:04:55,310 --> 00:04:58,940
been trying to do over the last 20 years at AWS I've been working hands-on

111
00:04:58,940 --> 00:05:04,490
building DevOps agent now for, for some time, and what it does is it responds

112
00:05:04,670 --> 00:05:08,810
autonomously to operational incidents kind of before you open your laptop.

113
00:05:08,810 --> 00:05:12,260
It has hopefully fully root caused and suggests

114
00:05:12,260 --> 00:05:14,210
remediation steps for how to fix an alarm.

115
00:05:14,360 --> 00:05:16,040
It's also proactive.

116
00:05:16,229 --> 00:05:18,930
In that it will just scan through everything, sift through

117
00:05:18,930 --> 00:05:23,130
everything to find operational improvements that will prevent future.

118
00:05:23,669 --> 00:05:26,549
Incidents of future issues optimize things for you.

119
00:05:26,729 --> 00:05:29,880
So I know nothing about the service yet, so forgive me

120
00:05:29,880 --> 00:05:32,669
if this turns into a really awkward line of questioning.

121
00:05:32,849 --> 00:05:35,340
But there have been a number of bites at this apple

122
00:05:35,340 --> 00:05:38,284
historically, mostly before the rise of Gen ai.

123
00:05:38,665 --> 00:05:41,515
And the biggest stumbling block to all of this was either you

124
00:05:41,515 --> 00:05:45,265
had to build your systems in a very prescriptive way or spend a

125
00:05:45,265 --> 00:05:48,835
tremendous amount of time instrumenting and getting things aligned

126
00:05:48,835 --> 00:05:52,165
in such a way that these agents could then do anything with it.

127
00:05:52,165 --> 00:05:53,815
And that was work that very often just never got done.

128
00:05:54,570 --> 00:05:56,400
Has that, uh, not been cracked?

129
00:05:56,909 --> 00:06:01,830
That is exactly why we're building DevOps agent now and why I think this is

130
00:06:01,830 --> 00:06:06,539
gonna be the right, like swing at the apple or whatever next metaphors you want.

131
00:06:06,989 --> 00:06:09,989
We've always been improving developers' lives by making new services.

132
00:06:10,599 --> 00:06:13,840
I, I worked on DynamoDB because I didn't want to have to do database ops

133
00:06:13,840 --> 00:06:17,409
anymore, so now I don't have to like set up replication and deal with backups.

134
00:06:17,590 --> 00:06:19,539
We built Lambda because I don't want to have to patch

135
00:06:19,539 --> 00:06:21,700
operating systems and deal with server failures.

136
00:06:22,120 --> 00:06:24,760
Those are great and they solve that real problem for customers, but

137
00:06:24,760 --> 00:06:28,090
of course they have to adopt, they have to do work in order to invest

138
00:06:28,090 --> 00:06:31,180
in using it in order to get and switch their application around.

139
00:06:31,180 --> 00:06:31,990
In order to get, just

140
00:06:31,990 --> 00:06:34,270
migrate your relational database to Dynamo

141
00:06:34,270 --> 00:06:36,789
is not really that straightforward of a lift.

142
00:06:36,909 --> 00:06:38,020
Yeah, the migration part.

143
00:06:38,020 --> 00:06:38,169
Yeah.

144
00:06:38,169 --> 00:06:39,010
Using it from scratch.

145
00:06:39,305 --> 00:06:39,635
Great.

146
00:06:39,635 --> 00:06:41,225
But yeah, no, it's unfortunate.

147
00:06:41,225 --> 00:06:43,775
It's like how do we make migration easier and we keep chasing this.

148
00:06:43,835 --> 00:06:47,375
The advent of LLMs are actually the magic here.

149
00:06:47,585 --> 00:06:48,965
Before there were, okay, everything has an

150
00:06:48,965 --> 00:06:50,975
API, you can make anything, talk to anything.

151
00:06:51,155 --> 00:06:53,795
Like that's what it, that the whole web service API service

152
00:06:53,795 --> 00:06:56,465
oriented architecture thing was about, but you have to do work to

153
00:06:56,465 --> 00:06:59,315
integrate those point I integrations to make those work together.

154
00:06:59,765 --> 00:07:00,790
The advent of LLMs.

155
00:07:01,635 --> 00:07:05,655
Is now if as long as you have, actually, arguably in the future you won't.

156
00:07:05,655 --> 00:07:08,205
But as long as you have an MCP interface to a thing that

157
00:07:08,205 --> 00:07:11,235
isn't special purpose for a point integration, you just have

158
00:07:11,505 --> 00:07:15,045
an MCP server that's essentially just documentation plus API.

159
00:07:15,465 --> 00:07:19,095
Once you have that, anything can talk to anything and it works great.

160
00:07:19,425 --> 00:07:20,205
And so it's, I'm

161
00:07:20,415 --> 00:07:21,945
going to say what I've been saying about that.

162
00:07:21,945 --> 00:07:25,140
I feel like I've already had one for many years and it's the A-W-S-C-L-I tool.

163
00:07:25,595 --> 00:07:28,715
The CLI is is great as an MCP server.

164
00:07:28,715 --> 00:07:32,525
In fact, the A-W-S-M-C-P server essentially is just pass a string.

165
00:07:32,525 --> 00:07:34,145
That is the A-W-S-C-L-I command.

166
00:07:34,415 --> 00:07:37,145
So yeah, that is a nice interface with documentation in the

167
00:07:37,145 --> 00:07:40,865
help, just making it tailored so it's easier for NM to use.

168
00:07:40,865 --> 00:07:44,405
But, so the key is with DevOps agent is like that.

169
00:07:44,555 --> 00:07:47,225
Now, it adapts because of LLMs and MCP.

170
00:07:47,225 --> 00:07:48,215
It adapts to anything.

171
00:07:48,425 --> 00:07:49,175
But we've also.

172
00:07:49,690 --> 00:07:54,130
Built it from the beginning to be completely unop opinionated about what

173
00:07:54,130 --> 00:07:57,430
you're using to do your operations, what infrastructure you're running

174
00:07:57,430 --> 00:08:00,940
on, what frameworks you're using, what instrumentation, what, what

175
00:08:00,940 --> 00:08:05,560
observability, provider and tool like everything we, it plugs into everything.

176
00:08:05,560 --> 00:08:07,450
Be thanks to LMS and MCP.

177
00:08:07,840 --> 00:08:08,410
It works.

178
00:08:08,440 --> 00:08:10,780
Say, you know, we, we call it AWS DevOps agent.

179
00:08:10,930 --> 00:08:13,210
That's because it's built by AWS, but it is

180
00:08:13,210 --> 00:08:15,700
unopinionated about whether or not you run on AWS.

181
00:08:15,985 --> 00:08:17,815
You could use this hypothetically to troubleshoot

182
00:08:17,815 --> 00:08:19,315
things in a completely different environment.

183
00:08:19,495 --> 00:08:19,975
That's right.

184
00:08:20,035 --> 00:08:23,065
In fact, like part of GA support was like just built in,

185
00:08:23,065 --> 00:08:25,735
baked in integration for applications running on Azure.

186
00:08:26,305 --> 00:08:28,945
It works for any cloud you run on as long as you

187
00:08:28,945 --> 00:08:32,605
provide an MCP server that can describe your resources.

188
00:08:32,935 --> 00:08:35,755
Some people have hobbies or they grow an herb garden.

189
00:08:35,755 --> 00:08:37,015
I do something very similar.

190
00:08:37,015 --> 00:08:39,505
I have a test Kubernetes cluster running in the spare room

191
00:08:39,565 --> 00:08:43,380
as one does, and I. Basically have been letting Claude

192
00:08:43,380 --> 00:08:46,290
code tend the garden for me, for lack of a better term.

193
00:08:46,500 --> 00:08:48,750
I'll let it go ahead and run and make changes to it.

194
00:08:48,750 --> 00:08:51,089
And periodically I have to slap a chainsaw out of its

195
00:08:51,089 --> 00:08:53,790
hands because, oh, it looks like this thing isn't working.

196
00:08:53,790 --> 00:08:56,430
I'm gonna blow away the volume and recreate like that.

197
00:08:56,670 --> 00:08:57,660
There's data on that.

198
00:08:57,660 --> 00:08:59,010
Maybe don't do that.

199
00:08:59,280 --> 00:09:01,410
Uh, it's the, it's the guardrail thing of

200
00:09:01,560 --> 00:09:03,780
things that might make sense in some contexts.

201
00:09:03,780 --> 00:09:05,099
Could be dangerous in others.

202
00:09:05,130 --> 00:09:06,599
But again, there's nothing critical of this.

203
00:09:06,599 --> 00:09:09,954
I'm not that irresponsible, but how do you wind up, I guess.

204
00:09:10,600 --> 00:09:13,835
Avoiding the temptation to do things that could themselves be destructive.

205
00:09:14,490 --> 00:09:15,960
For the LLM, not you personally.

206
00:09:15,960 --> 00:09:17,520
I mean, we all wanna burn the office down some days.

207
00:09:17,520 --> 00:09:17,939
I get it.

208
00:09:18,180 --> 00:09:21,060
AWS DevOps agent, we're very intentional about

209
00:09:21,120 --> 00:09:23,850
what it can do, what the data can look at.

210
00:09:24,120 --> 00:09:27,060
So just like any, any AWS service, you can figure, you give

211
00:09:27,060 --> 00:09:29,520
it permissions to say you're, you can look at these things.

212
00:09:29,760 --> 00:09:33,300
We actually scope that down so you can give us permissions and if you try

213
00:09:33,300 --> 00:09:36,360
to give us too much permissions, we'll actually just take fewer permissions.

214
00:09:36,689 --> 00:09:38,790
So we are very intentional about.

215
00:09:39,075 --> 00:09:42,975
It can do read only operations, only certain read only operations.

216
00:09:43,335 --> 00:09:45,675
Just a, a guardrails are basically where.

217
00:09:46,050 --> 00:09:48,750
Like the main focus of the AWS DevOps agent, and when

218
00:09:48,750 --> 00:09:51,540
it suggests, it says, Hey, you should, here's a fix.

219
00:09:51,540 --> 00:09:54,390
Here's how to get yourself out of this operational situation.

220
00:09:54,810 --> 00:09:59,970
There is in the moment fixes we, we pro produce what kinda what we do at AWS.

221
00:09:59,970 --> 00:10:03,990
Whenever we make a change, a man, like any manual change, we write down a very

222
00:10:04,079 --> 00:10:07,439
deliberate set of steps in a very deliberate order of, okay, I'm going to.

223
00:10:07,665 --> 00:10:10,365
Run some commands to make sure that the world is what I think it is.

224
00:10:10,365 --> 00:10:13,545
These pre-validation, I'm gonna record the current state of the world.

225
00:10:13,545 --> 00:10:16,485
I'm gonna make the change, have my rollback steps ahead of time and

226
00:10:16,485 --> 00:10:19,995
post validation steps to make sure that this had the effect that I want.

227
00:10:20,175 --> 00:10:22,095
We call these, uh, change management documents.

228
00:10:22,395 --> 00:10:24,645
That's how we present any suggested change.

229
00:10:24,675 --> 00:10:28,005
A very deliberate and methodical thing to, that you can use to.

230
00:10:28,070 --> 00:10:30,410
To execute that change safely.

231
00:10:30,620 --> 00:10:33,830
And then when we suggest a coding follow up change, we give you a,

232
00:10:34,100 --> 00:10:37,430
an agent ready specification to give out the agent all the context

233
00:10:37,430 --> 00:10:41,150
about the what and the why so that it, it goes for the right approach.

234
00:10:41,540 --> 00:10:45,020
One thing that I've always been a fan of when it comes to letting these agents

235
00:10:45,020 --> 00:10:49,760
run loose it, it con it prescribed ways, let's be clear, has been that they

236
00:10:49,760 --> 00:10:53,595
sort of embody that, that spirit I always look for in SRE hires or DevOps folk.

237
00:10:53,925 --> 00:10:56,415
With the, uh, never give up, never surrender approach.

238
00:10:56,535 --> 00:10:59,055
Whenever they get blocked in a particular troubleshooting

239
00:10:59,055 --> 00:11:02,415
direction, they'll come up with some way to solve the problem.

240
00:11:02,535 --> 00:11:04,694
And some of 'em have been pretty freaking creative.

241
00:11:04,755 --> 00:11:10,094
TCP dumps after, uh, netcat experiment being fired off SSHing into the far side.

242
00:11:10,334 --> 00:11:14,354
Just weird in-depth stuff, s tracing to see what the thing's actually doing.

243
00:11:14,354 --> 00:11:18,915
It's, it gets really deep, really quickly, and I love that aspect of it.

244
00:11:19,260 --> 00:11:21,360
So this almost feels like there's a balancing act of how

245
00:11:21,360 --> 00:11:24,510
do you let it continue to drive for those solutions while

246
00:11:24,510 --> 00:11:27,060
not also letting it drive your company into the ground?

247
00:11:27,540 --> 00:11:30,120
That's definitely why we, with DevOps agent, we said,

248
00:11:30,180 --> 00:11:32,580
okay, for production operations, let's give it a very

249
00:11:32,700 --> 00:11:36,390
specific subset of tools, not the Shell, that type of thing.

250
00:11:36,450 --> 00:11:39,720
Until we have that proof that things like shell commands, uh,

251
00:11:39,750 --> 00:11:43,350
writing its own code are are safe, then we're gonna continue with

252
00:11:43,350 --> 00:11:47,340
these like very well easy to understand perimeter and guardrails.

253
00:11:47,699 --> 00:11:50,520
So you've obviously been running this a fair bit yourself in your

254
00:11:50,520 --> 00:11:54,660
environment, which is always dangerous in that it works super well at

255
00:11:54,720 --> 00:11:59,100
companies shaped like Amazon is not as broadly applicable as one might think.

256
00:11:59,280 --> 00:12:03,209
So there's the, okay, now you start testing it on pre-launch customers.

257
00:12:03,209 --> 00:12:07,079
I believe it was in public beta for a while after reinvent and.

258
00:12:07,665 --> 00:12:09,495
What have you seen as this has unfolded?

259
00:12:09,495 --> 00:12:11,985
What areas does it excel in and what areas is it

260
00:12:11,985 --> 00:12:14,445
still not doing as well of a job as you might hope?

261
00:12:14,745 --> 00:12:18,345
I'd say it's, it's pretty good at figuring out these, sifting

262
00:12:18,345 --> 00:12:21,765
through a ton of operational data and figuring out something that,

263
00:12:21,915 --> 00:12:25,755
like a little needle in the haystack for that can explain an alarm

264
00:12:25,755 --> 00:12:29,565
incident that would've been taken a long time to, to figure out.

265
00:12:29,805 --> 00:12:31,605
There's a company, Western Governors

266
00:12:31,605 --> 00:12:34,095
University, they ran it on a, a thing that had.

267
00:12:34,410 --> 00:12:37,410
Like kind of reran it out, they adopted it and then took an

268
00:12:37,410 --> 00:12:40,469
incident that took them two hours to figure out the root cause.

269
00:12:40,800 --> 00:12:42,719
This is a while ago, an earlier version of

270
00:12:42,719 --> 00:12:44,880
DevOps agent, or an earlier incantation of it.

271
00:12:45,120 --> 00:12:48,510
So in that two hours that they had done it manually and it figured it out in

272
00:12:48,660 --> 00:12:54,089
28 minutes, the applications I I, I found myself making a lot of applications.

273
00:12:54,319 --> 00:12:58,280
That have nothing to do with AWS doing it not the Amazon way, like the internal

274
00:12:58,280 --> 00:13:02,600
Amazon way because, uh, for this exact reason to make sure like it, it's

275
00:13:02,600 --> 00:13:05,930
actually teaching me a lot about the tools that exist in the real world too.

276
00:13:05,930 --> 00:13:06,620
It's a lot of fun.

277
00:13:06,830 --> 00:13:10,699
Like even then, like when we started, when I, when I first set it up at Reinvent

278
00:13:10,699 --> 00:13:14,960
and used it, it root caused a certain, uh, operational issue in like 11 minutes.

279
00:13:14,990 --> 00:13:15,620
Uh, and.

280
00:13:15,880 --> 00:13:20,020
Now I ran reran that scenario just a week ago, and it took four minutes.

281
00:13:20,020 --> 00:13:21,520
So we've been making a lot of improvements.

282
00:13:21,520 --> 00:13:22,120
It's very good.

283
00:13:22,120 --> 00:13:23,320
At that alarm,

284
00:13:23,320 --> 00:13:24,160
is that consistent?

285
00:13:24,160 --> 00:13:27,040
Or if you run it a third time, would it then take 45 minutes?

286
00:13:27,040 --> 00:13:29,500
Like there, the, one of the things I found is it feels like Claude,

287
00:13:29,500 --> 00:13:32,170
which is what I use for a lot of this stuff on Claude Codes incarnation.

288
00:13:32,380 --> 00:13:35,200
It has smart days and dumb days, and it,

289
00:13:35,200 --> 00:13:36,820
it's, you never know what you're gonna get.

290
00:13:36,820 --> 00:13:37,930
It's like roll for initiative.

291
00:13:37,930 --> 00:13:39,310
Every time you have it dive into something.

292
00:13:39,460 --> 00:13:40,870
It's, it's relatively consistent.

293
00:13:40,870 --> 00:13:41,260
I think.

294
00:13:41,260 --> 00:13:42,380
Um, a lot of the, what we.

295
00:13:42,775 --> 00:13:45,444
Our building behind the scenes around context, around

296
00:13:45,444 --> 00:13:47,635
your application, learning about your application.

297
00:13:47,875 --> 00:13:49,824
I guess it's not the same every time because

298
00:13:49,824 --> 00:13:51,805
we're learning from past troubleshooting.

299
00:13:51,895 --> 00:13:54,204
That's obviously a trade off of making it so that it doesn't

300
00:13:54,204 --> 00:13:57,834
like have target fixation on a specific, a specific past run.

301
00:13:58,105 --> 00:13:59,965
I know what DNSI just have to prove it.

302
00:14:00,834 --> 00:14:02,515
Well, that's, that is actually baked in.

303
00:14:02,574 --> 00:14:03,355
Unfortunately.

304
00:14:03,715 --> 00:14:04,255
It's always dns.

305
00:14:04,645 --> 00:14:05,665
There's no way for it not to be.

306
00:14:05,665 --> 00:14:07,015
The problem, of course, is when it's dns, it.

307
00:14:07,680 --> 00:14:09,720
When it can't find the SAPI endpoints, well

308
00:14:09,720 --> 00:14:11,190
suddenly we're not going to space today.

309
00:14:11,190 --> 00:14:12,900
But yeah, that, that's always the challenge.

310
00:14:13,110 --> 00:14:17,070
How prescriptive is it as far as this troubleshooting methodology?

311
00:14:17,070 --> 00:14:18,390
Is that baked in?

312
00:14:18,540 --> 00:14:21,150
Does it follow whatever computer equivalent gut instinct is?

313
00:14:21,630 --> 00:14:25,170
This episode is sponsored by my own company, duck Bill.

314
00:14:25,465 --> 00:14:25,795
Having

315
00:14:25,795 --> 00:14:28,525
trouble with your AWS bill, perhaps it's

316
00:14:28,525 --> 00:14:31,075
time to renegotiate a contract with them.

317
00:14:31,435 --> 00:14:33,595
Maybe you're just wondering how to predict

318
00:14:33,595 --> 00:14:36,805
what's going on in the wide world of AWS.

319
00:14:36,895 --> 00:14:39,505
Well, that's where Duck Bill comes in to help.

320
00:14:39,715 --> 00:14:42,445
Remember, you can't duck the duck bill.

321
00:14:42,445 --> 00:14:45,115
Bill, which I am reliably informed by my

322
00:14:45,115 --> 00:14:48,510
business partner is absolutely not our motto.

323
00:14:48,850 --> 00:14:51,990
To learn more, visit doc bill hq.com.

324
00:14:52,845 --> 00:14:56,745
We have baked in kind of how we go about troubleshooting into it.

325
00:14:56,835 --> 00:14:59,085
Uh, that, you know, helps, uh, we find

326
00:14:59,085 --> 00:15:01,695
forward these kind of specific alarm triage.

327
00:15:01,965 --> 00:15:06,735
'cause the goal of of alarm triage is actually not to understand the root cause.

328
00:15:06,735 --> 00:15:07,635
We call it root cause.

329
00:15:07,635 --> 00:15:09,735
But that's actually not ultimately the goal.

330
00:15:09,975 --> 00:15:12,885
The goal is to figure out what mitigation step would make.

331
00:15:13,165 --> 00:15:14,185
Pro the problem.

332
00:15:14,185 --> 00:15:14,905
Stop the customer.

333
00:15:14,905 --> 00:15:15,655
Impact stop.

334
00:15:15,834 --> 00:15:16,765
That's the actually the goal.

335
00:15:16,765 --> 00:15:17,905
And so we baked that in.

336
00:15:17,964 --> 00:15:22,194
Um, talked about this in, in some past reinvent talks of a somewhat

337
00:15:22,314 --> 00:15:25,375
ridiculous name for it, but it, we call the Grand Unified Theory of

338
00:15:25,375 --> 00:15:29,694
Incident Management is that, uh, any production impact can be mitigated

339
00:15:29,694 --> 00:15:32,995
by in through the pursuit of, of looking to see whether it was a

340
00:15:32,995 --> 00:15:36,564
deployment, like a change, that, that broke it, a change in inputs.

341
00:15:36,610 --> 00:15:38,950
Like a traffic spike or, or passing new

342
00:15:38,950 --> 00:15:41,110
parameters or stopping passing parameters.

343
00:15:41,380 --> 00:15:44,530
A failed component like the application crashing on one server, one

344
00:15:44,530 --> 00:15:48,310
availability zone, a dependency, uh, or running out of something.

345
00:15:48,430 --> 00:15:50,050
These are kind of all recursive.

346
00:15:50,140 --> 00:15:52,930
They kind of refer, they point to other parts of, oh, it's not.

347
00:15:53,180 --> 00:15:55,699
It's if, if there's a change in traffic, well let's

348
00:15:55,699 --> 00:15:58,219
go investigate the caller to see if they changed.

349
00:15:58,370 --> 00:16:01,099
They did a code deployment, so it's a recursive kind of exploration.

350
00:16:01,099 --> 00:16:03,530
We found that to be very useful, and so we've baked this

351
00:16:03,530 --> 00:16:06,859
kind of thing into the agent, but we've also known that

352
00:16:06,920 --> 00:16:09,349
we, we also know that that isn't always everybody's goal.

353
00:16:09,380 --> 00:16:11,180
Their goal isn't to always find mitigation.

354
00:16:11,180 --> 00:16:13,969
Sometimes they really just have some ad hoc operational

355
00:16:13,969 --> 00:16:17,000
tasks that they want to figure out what's going on.

356
00:16:17,445 --> 00:16:19,815
So that would, that's why four GA has a little

357
00:16:19,815 --> 00:16:22,395
bit less of the customers using it every day.

358
00:16:22,545 --> 00:16:26,025
But we've launched on demand tasks, so you can ask any kind of question.

359
00:16:26,025 --> 00:16:28,815
It's a little bit less fixated on finding the root

360
00:16:28,815 --> 00:16:31,305
cause of an alarm and more just helping you in general.

361
00:16:31,365 --> 00:16:34,725
So that, I think, has a little bit less usage of it so far.

362
00:16:34,785 --> 00:16:40,335
But it's, but we know that operations is an endless set of, very varied thing.

363
00:16:40,335 --> 00:16:43,490
It's not just responding to alarms and that's where we're trying to help people.

364
00:16:44,490 --> 00:16:48,479
Where do you find that the differentiation comes into as opposed to

365
00:16:48,540 --> 00:16:52,110
my, you know, crappy backyard experiment of just go ahead and run cloud

366
00:16:52,110 --> 00:16:55,349
code in dangerous skip permissions mode, let it tear into the thing and

367
00:16:55,349 --> 00:16:59,880
have fun, versus like the, the hyper lockdown approach that some shops

368
00:16:59,880 --> 00:17:02,939
take where everything it does has to be vetted by a change committee.

369
00:17:03,209 --> 00:17:06,480
One of those leads to faster outcomes, but less

370
00:17:06,480 --> 00:17:08,800
system stability, and the other feels like it.

371
00:17:09,145 --> 00:17:10,944
It's a lot safer, but it doesn't get anything done.

372
00:17:10,944 --> 00:17:12,204
It feels like there's a continuum there.

373
00:17:12,655 --> 00:17:12,895
Yeah.

374
00:17:12,895 --> 00:17:16,074
I feel like striking that balance is the important part, and that's where

375
00:17:16,074 --> 00:17:20,155
we're going with the, uh, the change, safety we're going, but still letting

376
00:17:20,155 --> 00:17:24,744
it have the, uh, creativity if you will, of exploring down any avenue

377
00:17:24,954 --> 00:17:28,614
that any exploring every a, any avenue that it might want to chase down.

378
00:17:29,155 --> 00:17:31,764
You've done a lot of work in your career around the

379
00:17:31,764 --> 00:17:34,705
developer experience space, the DevOps world as it is.

380
00:17:34,735 --> 00:17:37,314
Uh, where do you think the future of DevOps lies?

381
00:17:37,680 --> 00:17:40,560
There have been a lot of flavors of companies organizing

382
00:17:40,560 --> 00:17:44,070
around operations, uh, and individuals around how they

383
00:17:44,070 --> 00:17:46,380
get things done, uh, how they keep things running.

384
00:17:46,470 --> 00:17:49,980
Uh, there's SRE approaches where you have a kind of a

385
00:17:50,340 --> 00:17:54,630
frontline team who also is automating things just on a more.

386
00:17:54,770 --> 00:17:57,950
Pure operations where there's op, like an incident response team.

387
00:17:58,370 --> 00:18:01,460
And what I think of is DevOps, which is, which of course is a

388
00:18:01,460 --> 00:18:05,270
word that has taken a lot of term and, and change over the years.

389
00:18:05,330 --> 00:18:05,420
Uh, it

390
00:18:05,420 --> 00:18:07,910
encompasses so many things, like you can't buy

391
00:18:07,910 --> 00:18:10,010
DevOps, but I sure would like to sell it to you.

392
00:18:10,040 --> 00:18:10,340
Yeah.

393
00:18:10,370 --> 00:18:14,810
It's, it's become a, a panacea for a while of, it just means you're

394
00:18:14,810 --> 00:18:17,900
assistant man, but if you call yourself DevOps, you'll make 40% more.

395
00:18:17,960 --> 00:18:18,200
Great.

396
00:18:18,200 --> 00:18:18,770
Good for you.

397
00:18:18,770 --> 00:18:19,520
Get the bag.

398
00:18:19,700 --> 00:18:22,040
Uh, then that became SRE, then it became platform

399
00:18:22,040 --> 00:18:23,960
engineering, and who only knows what it is this year.

400
00:18:23,975 --> 00:18:24,635
That's right.

401
00:18:24,754 --> 00:18:29,314
I think DevOps the obviously bias because this is how I've been operating

402
00:18:29,375 --> 00:18:34,504
and AWS generally operates for the last 20 years plus is I think DevOps.

403
00:18:35,175 --> 00:18:40,155
It's DevOps is time to shine because with the advent of agentic ai,

404
00:18:40,155 --> 00:18:43,395
coding, everything, people are just getting a lot more done themselves.

405
00:18:43,425 --> 00:18:46,815
Like I love wearing all the hats under talking to customers,

406
00:18:46,935 --> 00:18:49,845
thinking about the product, implementing it, running it.

407
00:18:49,845 --> 00:18:51,705
I like wearing the hats, uh, securing it.

408
00:18:52,125 --> 00:18:52,785
I like that.

409
00:18:52,815 --> 00:18:56,835
And I think now these tools are making that even more vital.

410
00:18:56,890 --> 00:18:59,290
To just be able to do more and more yourself

411
00:18:59,320 --> 00:19:02,530
as, as a be a little bit more generalist almost.

412
00:19:02,770 --> 00:19:05,980
And so I think DevOps is really the future of DevOps.

413
00:19:05,980 --> 00:19:08,170
With with, with my definition, of course.

414
00:19:08,320 --> 00:19:08,590
Oh yeah.

415
00:19:08,590 --> 00:19:09,010
If you get well.

416
00:19:09,070 --> 00:19:10,540
But my version is always gonna be the future.

417
00:19:10,540 --> 00:19:11,530
Let me define the term.

418
00:19:11,530 --> 00:19:14,860
I mean, that's just common sense from where I sit.

419
00:19:15,010 --> 00:19:16,175
Software's no longer the bottleneck.

420
00:19:16,905 --> 00:19:22,035
You can do a whole bunch of things in a suddenly I'm

421
00:19:22,035 --> 00:19:24,255
gonna have it write code to solve my specific problem.

422
00:19:24,255 --> 00:19:25,965
I've been doing that and it's great.

423
00:19:26,115 --> 00:19:27,165
Is this robust code?

424
00:19:27,165 --> 00:19:28,185
Absolutely not.

425
00:19:28,185 --> 00:19:30,435
I mean, one of the, as I'm sure you've seen building

426
00:19:30,435 --> 00:19:33,225
services and products, uh, everything you thought you knew

427
00:19:33,225 --> 00:19:35,205
about, this gets thrown out the window the first time.

428
00:19:35,205 --> 00:19:38,085
You put it in a customer's hand and they try to use it like a bad hammer.

429
00:19:38,535 --> 00:19:42,264
There's, there's a. Other people's requirements, other people's usage

430
00:19:42,264 --> 00:19:46,885
patterns often cause software to no longer work the way that you wanted it to.

431
00:19:46,885 --> 00:19:48,774
As soon as it violates one of those design constraints.

432
00:19:49,165 --> 00:19:50,365
Now I feel like that's okay.

433
00:19:50,365 --> 00:19:52,645
I just send Claude back to the minds to go ahead

434
00:19:52,645 --> 00:19:54,985
and build me another version that now does this.

435
00:19:55,135 --> 00:19:57,235
But back in the early days of the newsletter, when

436
00:19:57,235 --> 00:19:59,575
one day I wanted to start sending out blog posts, that

437
00:19:59,575 --> 00:20:01,855
took me three weeks to get that system to support that.

438
00:20:01,915 --> 00:20:04,705
Now it's just yell at the robot and go get a cup of coffee.

439
00:20:05,110 --> 00:20:07,870
Yeah, I think the key is that it can, like, agents are

440
00:20:07,870 --> 00:20:10,900
super successful when they can see their own output.

441
00:20:11,140 --> 00:20:15,370
That's really the key is that they can, instead of, okay, let me have

442
00:20:15,370 --> 00:20:17,890
it build a thing and it'll tell me, you know, give it to me and then

443
00:20:17,890 --> 00:20:20,950
I'll try it and see that it failed and I'll paste the error back in.

444
00:20:20,950 --> 00:20:23,860
You know, that's the T dium, that's not, that's not the future.

445
00:20:24,070 --> 00:20:27,635
The future is where it can do more and more of that in that loop.

446
00:20:28,360 --> 00:20:30,790
That it runs until it works.

447
00:20:30,850 --> 00:20:34,780
And so that means going further into, well, okay, it'll produce

448
00:20:34,780 --> 00:20:37,720
a pull request for you 'cause it's run all the unit tests.

449
00:20:37,870 --> 00:20:39,670
Well what if it could do the integration tests?

450
00:20:39,850 --> 00:20:43,510
What if it could actually, we also you this AWS

451
00:20:43,510 --> 00:20:45,910
security agent that can do penetration testing for you.

452
00:20:46,120 --> 00:20:49,360
Like what if you could have that be part of the CICD pipeline?

453
00:20:49,600 --> 00:20:51,470
But if you could have DevOps agent.

454
00:20:52,095 --> 00:20:55,425
Kind of calling it back that, okay, this thing that you did didn't work.

455
00:20:55,605 --> 00:21:00,015
So extending that loop out beyond just your local machine, I think

456
00:21:00,015 --> 00:21:03,705
is, is very important in terms of avoiding a bottleneck where

457
00:21:03,705 --> 00:21:06,675
things will just pile up in your kind of deployment pipeline.

458
00:21:06,675 --> 00:21:11,535
With so much change, it's making sure the agent has the agency to

459
00:21:11,535 --> 00:21:14,320
see its own result even further, all the way through to production.

460
00:21:15,135 --> 00:21:18,465
One of the challenges I found with folks taking this approach has very

461
00:21:18,465 --> 00:21:23,475
often been their organizations own guardrails and boundaries, where

462
00:21:23,625 --> 00:21:27,105
even for a human sitting down and solving the problem, they're spending

463
00:21:27,105 --> 00:21:29,565
most of their time trying to get access into the right account to

464
00:21:29,565 --> 00:21:33,045
look at the logs put out by something, which is sort of a necessary

465
00:21:33,075 --> 00:21:36,014
first step unless you wanna black box troubleshoot, which no one does.

466
00:21:36,165 --> 00:21:40,004
It feels like companies do have to evolve to a point where tools.

467
00:21:40,335 --> 00:21:42,585
If not, people could be given access to some of these things.

468
00:21:43,065 --> 00:21:43,575
I agree.

469
00:21:43,575 --> 00:21:47,895
I think it's the, the teams who I've seen be the most

470
00:21:47,955 --> 00:21:51,375
productive in their spend, they make sure to set apart time.

471
00:21:51,375 --> 00:21:52,815
They give themselves some breathing room

472
00:21:52,815 --> 00:21:55,425
to say, okay, what should we smooth out?

473
00:21:55,425 --> 00:21:57,945
That was actually, I, it's kind of like these

474
00:21:57,945 --> 00:21:59,590
things that were, that are going to make.

475
00:22:00,345 --> 00:22:02,565
Teams more productive with agents actually would've made the

476
00:22:02,565 --> 00:22:06,795
team more productive anyway and, but now there's just even higher

477
00:22:06,795 --> 00:22:09,945
upside to it, so it's worth, it's really worth taking a step back.

478
00:22:09,945 --> 00:22:11,625
It's a nice excuse I've always found.

479
00:22:11,835 --> 00:22:16,695
Having these nice excuses to take a step back and just breathe

480
00:22:16,695 --> 00:22:20,655
and improve developer life, even within a team, has been so great.

481
00:22:20,655 --> 00:22:23,415
Like when we realized many years ago that like.

482
00:22:23,560 --> 00:22:26,590
Our changes were actually piling up and like it was the, the time from

483
00:22:26,590 --> 00:22:31,060
check-in to actually reaching production was, I think weeks on average.

484
00:22:31,450 --> 00:22:33,790
And this was not ideal.

485
00:22:33,790 --> 00:22:35,889
We wanted to be able to innovate faster for people.

486
00:22:36,100 --> 00:22:37,750
And so we looked at this and this is when we're like,

487
00:22:37,750 --> 00:22:40,810
let's do this continuous deployment thing and as an entire

488
00:22:40,810 --> 00:22:43,605
company, and that was just a great rallying North Star.

489
00:22:43,870 --> 00:22:47,500
We built an internal pipeline system so that everybody could manage

490
00:22:47,500 --> 00:22:50,800
their code as it was flowing through, automate as much as possible.

491
00:22:51,040 --> 00:22:53,169
But teams still had to do the things like.

492
00:22:53,375 --> 00:22:56,135
Well if, if teams were hesitant to turn it

493
00:22:56,135 --> 00:22:58,655
on kind of full auto, why were they hesitant?

494
00:22:58,655 --> 00:22:59,825
Well, we don't trust our tests.

495
00:22:59,855 --> 00:23:01,715
Okay, spend time and write the test.

496
00:23:01,745 --> 00:23:02,765
'cause it'll pay off.

497
00:23:03,005 --> 00:23:04,265
Like, oh, we don't trust the deploy.

498
00:23:04,505 --> 00:23:05,975
Well okay, have better monitoring.

499
00:23:05,975 --> 00:23:06,935
So it'll roll back.

500
00:23:06,935 --> 00:23:07,925
Well what about this edge case?

501
00:23:07,925 --> 00:23:08,885
Well monitor that.

502
00:23:09,095 --> 00:23:12,575
So it's really this nice rallying north star of how can

503
00:23:12,575 --> 00:23:14,825
we just actually make this all operate a lot better?

504
00:23:14,945 --> 00:23:16,895
And so this is a good time for that with ai.

505
00:23:17,195 --> 00:23:21,335
It seems like there's an opportunity here for a lot of companies to.

506
00:23:21,945 --> 00:23:26,745
Improve their processes with this just because something like a powerful

507
00:23:26,760 --> 00:23:30,075
LLM assistant like this that keeps smacking into the same procedural

508
00:23:30,105 --> 00:23:34,755
guardrails and that being forced to wait for someone to escalate is

509
00:23:34,755 --> 00:23:38,715
going to sound a lot more credible than when an engineering team does it.

510
00:23:38,835 --> 00:23:41,205
Oh, our engineers are just crappy slash lazy

511
00:23:41,205 --> 00:23:43,485
slash maliciously complying with policy.

512
00:23:43,635 --> 00:23:47,025
Okay, now the robot's doing it lands a bit differently, I suspect.

513
00:23:47,795 --> 00:23:52,265
I wonder if that starts to act as internal ammunition to drive cultural change.

514
00:23:52,715 --> 00:23:53,285
I think you're right.

515
00:23:53,585 --> 00:23:53,975
I agree.

516
00:23:54,365 --> 00:23:55,415
That's in phase two of the bot.

517
00:23:55,415 --> 00:23:57,335
It's gonna be a dashboard explicitly aimed at that.

518
00:23:57,335 --> 00:24:00,665
I'm hoping now it's, uh, the problem is your crappy policies great.

519
00:24:00,665 --> 00:24:02,465
That that has its own problems.

520
00:24:02,465 --> 00:24:03,995
Just write code the way that we do.

521
00:24:03,995 --> 00:24:05,885
And thus was launched Kubernetes.

522
00:24:06,335 --> 00:24:09,275
It was, uh, the world in which we live.

523
00:24:09,455 --> 00:24:10,325
So what's exciting to you?

524
00:24:10,325 --> 00:24:13,715
What if we look at the world four years ago versus today?

525
00:24:13,835 --> 00:24:15,035
Technically speaking it.

526
00:24:15,305 --> 00:24:18,335
It's hard to recognize just how far we've come in a short period of time.

527
00:24:18,545 --> 00:24:20,555
Predicting the future is always dangerous, but

528
00:24:20,585 --> 00:24:22,925
what do you think is coming down the road next?

529
00:24:23,195 --> 00:24:23,555
You know what?

530
00:24:23,555 --> 00:24:29,645
I look at the kind of evolution of operations at AWS over the 20 years.

531
00:24:29,645 --> 00:24:31,020
Like the one thing that's remained.

532
00:24:31,875 --> 00:24:36,075
Well, constant, of course, constantly changing, but, uh, constant in terms

533
00:24:36,075 --> 00:24:40,785
of the priority is the culture, is you, you really have to, operations is

534
00:24:40,785 --> 00:24:45,075
one of these things that it can be too easy to backseat to a squeaker wheel.

535
00:24:45,345 --> 00:24:47,895
And so that's one thing that everybody, like all companies

536
00:24:47,895 --> 00:24:50,325
have to think about is how do we make sure that it

537
00:24:50,385 --> 00:24:52,755
operations is something that people care about all the time.

538
00:24:53,025 --> 00:24:54,480
Especially important for AWS because.

539
00:24:55,199 --> 00:24:57,090
You know, that is sen that is our job.

540
00:24:57,090 --> 00:24:59,760
We are doing the ops so you don't have to like, that is the whole thing.

541
00:25:00,209 --> 00:25:05,459
But the things that have changed is, is kind of maybe the maturity around it,

542
00:25:05,580 --> 00:25:09,870
around like starting from when we were first building AWS there, nobody had

543
00:25:09,870 --> 00:25:13,679
built a cloud before and so we are kinda figuring teams are figuring it out.

544
00:25:13,700 --> 00:25:15,650
Of of out the right ways to do things.

545
00:25:15,650 --> 00:25:17,030
Like how do you measure availability?

546
00:25:17,450 --> 00:25:19,010
So teams would measure it by like, when I

547
00:25:19,010 --> 00:25:21,290
was on DynamoDB or we were measuring latency.

548
00:25:21,710 --> 00:25:24,620
Well, latency is the, you look at percentiles

549
00:25:24,620 --> 00:25:27,320
of how long it takes to respond to certain APIs.

550
00:25:27,650 --> 00:25:27,860
Well.

551
00:25:28,545 --> 00:25:31,605
When we were building DynamoDB, it was all about predictable performance.

552
00:25:31,605 --> 00:25:32,475
That's the main thing.

553
00:25:32,475 --> 00:25:32,655
That

554
00:25:32,745 --> 00:25:37,965
slow is slow and, uh, consistent is better from a lot of use cases than fast.

555
00:25:37,965 --> 00:25:38,715
But spiky

556
00:25:38,895 --> 00:25:40,620
well that, that's why DynamoDB went.

557
00:25:40,725 --> 00:25:42,135
We went for Fast and Consistent.

558
00:25:42,135 --> 00:25:42,615
That was the

559
00:25:42,675 --> 00:25:43,155
That was real.

560
00:25:43,155 --> 00:25:43,695
Exactly.

561
00:25:43,695 --> 00:25:44,060
Hey, hey.

562
00:25:44,265 --> 00:25:47,355
Well if I could have all three of the constraints, that one, then terrific.

563
00:25:47,475 --> 00:25:48,615
Well, playing with Cap Theorem is.

564
00:25:49,020 --> 00:25:50,460
Why would you ignore Cap Theorem?

565
00:25:50,460 --> 00:25:51,300
Sure, why not?

566
00:25:51,750 --> 00:25:53,940
I actually, we were able to play some really cool tricks with it.

567
00:25:53,940 --> 00:25:56,430
But anyway, so measuring performance even, so, okay.

568
00:25:56,430 --> 00:25:57,240
How do you do that?

569
00:25:57,270 --> 00:26:01,650
Well, we found in early pre-release beta, we were seeing our

570
00:26:01,650 --> 00:26:03,810
graph, our latency graphs were flapping, well, what's going on?

571
00:26:03,810 --> 00:26:04,620
And actually, and.

572
00:26:04,725 --> 00:26:07,875
Wasn't that the latency was actually flapping, it's our measurement was

573
00:26:07,875 --> 00:26:11,925
because we were clumping together, kind of one kilobyte reads that were

574
00:26:11,925 --> 00:26:16,905
buffer pool hits with four megabyte scans that were going all over the disc,

575
00:26:16,995 --> 00:26:20,115
and that was in the same measurement, so it's, oh, let's break that out.

576
00:26:20,115 --> 00:26:20,334
Let's have.

577
00:26:20,815 --> 00:26:22,465
It's three different latency buckets.

578
00:26:22,465 --> 00:26:25,705
So we'll have the buckets for the, the super, like super small reads.

579
00:26:25,795 --> 00:26:28,525
Another bucket for the medium sized reads, a bucket for the large reads.

580
00:26:28,675 --> 00:26:31,495
So these kinds of operational practices everybody was learning.

581
00:26:31,495 --> 00:26:32,665
That was kind of the first phase.

582
00:26:32,905 --> 00:26:35,515
Then we started writing it down and sharing it more to

583
00:26:35,515 --> 00:26:37,660
say, turning things into checklists, saying, okay, well.

584
00:26:38,090 --> 00:26:41,659
It is kind of like well architected where you would say, okay, we know that it's

585
00:26:41,659 --> 00:26:46,070
important to do these things and organize a pipeline this way, scale this way.

586
00:26:46,100 --> 00:26:49,879
Uh, have resiliency set up in, you know, in multiple availability zones.

587
00:26:49,879 --> 00:26:51,260
You started having checklists.

588
00:26:51,439 --> 00:26:55,040
Then we started automating the evaluation of the checklist to say, well,

589
00:26:55,100 --> 00:26:58,610
let's just let teams know if they have something that isn't following.

590
00:26:59,034 --> 00:27:01,554
What we think is probably a better way to do it.

591
00:27:01,764 --> 00:27:04,375
Then we started fixing things automatically, saying, okay, here's

592
00:27:04,375 --> 00:27:08,485
a poll request that updates Java for you, like upgrade and all your

593
00:27:08,485 --> 00:27:11,034
dependencies, like something that would take you a long time to do.

594
00:27:11,395 --> 00:27:12,820
So I think the future.

595
00:27:13,665 --> 00:27:17,595
He's like, if I follow that trend line of automating operations,

596
00:27:17,955 --> 00:27:21,555
I think the future is going even further to just getting it done.

597
00:27:22,125 --> 00:27:26,205
Like the operational backlog of improving any application is, is endless.

598
00:27:26,625 --> 00:27:29,715
There's an unlimited number of things you can do to improve your operational

599
00:27:29,715 --> 00:27:33,945
posture in a system, and everybody wants to reach further into that backlog.

600
00:27:34,035 --> 00:27:35,475
And so I think just the.

601
00:27:35,700 --> 00:27:37,560
Getting it done for you.

602
00:27:37,650 --> 00:27:40,470
But that means agents, they need to be able to fully like load test.

603
00:27:40,470 --> 00:27:44,100
They need to be able to fully do everything to validate those changes.

604
00:27:44,100 --> 00:27:48,300
Like pushing a change to, to upgrade your application to a newer Java runtime.

605
00:27:48,780 --> 00:27:50,520
Before that wasn't the easy part, but now

606
00:27:50,520 --> 00:27:52,470
that's, that's actually kind of the easy part.

607
00:27:52,530 --> 00:27:55,740
Like it, it's actually the how do you make sure that deploying that

608
00:27:55,830 --> 00:27:59,070
will is actually going to work well without any hiccups along the way.

609
00:27:59,220 --> 00:28:00,090
That's the hard part.

610
00:28:00,420 --> 00:28:02,460
Is there a future where the entire infrastructure.

611
00:28:02,565 --> 00:28:05,415
DevOps side of the world today, that's an entire career field.

612
00:28:05,595 --> 00:28:08,595
Just becomes something the platform does under the hood where people don't

613
00:28:08,595 --> 00:28:11,955
think about it of here's the application, make sure it runs, do what you've

614
00:28:11,955 --> 00:28:16,575
gotta do, and it just goes down to application development in a lot of shops.

615
00:28:16,995 --> 00:28:18,885
Well, I think there's a difference between it happening

616
00:28:18,885 --> 00:28:20,655
automatically and people not thinking about it.

617
00:28:20,655 --> 00:28:23,925
I think people always have to obsess over their customer experience.

618
00:28:24,340 --> 00:28:27,730
And this is a lens ops and understanding that your operational

619
00:28:28,060 --> 00:28:31,570
characteristics are a lens with which to look at, understand the customer.

620
00:28:32,020 --> 00:28:34,660
Um, so that's not gonna go away, but if I look at what

621
00:28:35,140 --> 00:28:38,350
coding agents are doing in, and the, the amount of

622
00:28:38,350 --> 00:28:41,080
complexity that they're encapsulating in a spinning wheel.

623
00:28:41,190 --> 00:28:41,910
There's a spinning wheel.

624
00:28:41,940 --> 00:28:42,150
Oh yeah.

625
00:28:42,150 --> 00:28:43,830
Working on working, oh yeah.

626
00:28:43,830 --> 00:28:46,710
Working on implementing your entire application and then it's done.

627
00:28:46,770 --> 00:28:49,260
You know, that's, that's, that's all a spinning wheel,

628
00:28:49,410 --> 00:28:51,360
an enormous amount of work abstracted into that.

629
00:28:51,570 --> 00:28:53,580
I think DevOps is like the next one of like,

630
00:28:53,610 --> 00:28:56,460
okay, now let's like improve your resiliency.

631
00:28:56,460 --> 00:28:57,900
Let's optimize your application.

632
00:28:57,900 --> 00:28:58,650
Let's add monitor.

633
00:28:58,830 --> 00:29:00,960
These are, yes, these are things that are just

634
00:29:00,960 --> 00:29:03,960
accessible to anyone as just, it'll just happen for you.

635
00:29:04,389 --> 00:29:05,320
These are exciting times.

636
00:29:05,410 --> 00:29:07,840
These are fun products that are really making a lot

637
00:29:07,840 --> 00:29:11,110
of the historical toil, not nearly as toilsome even.

638
00:29:11,139 --> 00:29:12,760
Even as a guided assistant, it's okay.

639
00:29:12,760 --> 00:29:15,100
I'm out of ideas to troubleshoot this robot.

640
00:29:15,100 --> 00:29:18,160
Do you have one that it starts to almost become a collaboration aid?

641
00:29:18,460 --> 00:29:18,700
Yeah.

642
00:29:18,700 --> 00:29:20,800
It's kinda like rubber ducky, like debugging.

643
00:29:20,920 --> 00:29:22,660
It's accepted actually, will.

644
00:29:22,985 --> 00:29:25,895
Give you some ideas instead of just being a reflection of your own.

645
00:29:25,985 --> 00:29:27,485
The rubber duck has learned to talk.

646
00:29:27,485 --> 00:29:29,465
It's, it, it, they've, we've wound up in a very

647
00:29:29,465 --> 00:29:31,985
strange place suddenly and it happened very quickly.

648
00:29:32,044 --> 00:29:32,225
Yeah.

649
00:29:32,225 --> 00:29:33,425
I have this rubber duck right here.

650
00:29:33,425 --> 00:29:34,685
Actually, this, I just picked this up.

651
00:29:34,685 --> 00:29:36,245
It's a Captain Jane way.

652
00:29:36,304 --> 00:29:38,435
That's, that's my, my rubber duck of choice.

653
00:29:38,705 --> 00:29:38,855
Yeah.

654
00:29:38,855 --> 00:29:41,794
I, I've started sneaking them into the house wherever my wife least expects.

655
00:29:41,794 --> 00:29:44,855
I call it redecorating and I'm going to have to deal

656
00:29:44,855 --> 00:29:47,465
with the consequences of that someday, but not today.

657
00:29:47,584 --> 00:29:49,985
David, thank you so much for taking the time to speak with me.

658
00:29:50,314 --> 00:29:50,495
Yeah.

659
00:29:50,615 --> 00:29:51,995
If you wanna learn more, we're sure they go.

660
00:29:52,275 --> 00:29:53,325
What's the best place to find you?

661
00:29:53,685 --> 00:29:57,735
LinkedIn, X blue sky, the fragmentation of Of every social media thing.

662
00:29:57,735 --> 00:29:58,125
I'm there.

663
00:29:58,125 --> 00:29:58,755
Happy to chat.

664
00:29:59,055 --> 00:29:59,505
Fantastic.

665
00:29:59,505 --> 00:30:01,725
And we'll of course put links to that in the show notes.

666
00:30:01,935 --> 00:30:03,915
Thank you so much for taking the time to speak with me.

667
00:30:03,915 --> 00:30:04,695
I appreciate it.

668
00:30:04,875 --> 00:30:05,535
Thanks for having me.

669
00:30:05,595 --> 00:30:06,015
A lot of fun.

670
00:30:06,345 --> 00:30:10,065
David Yek, a senior principal engineer at AWS.

671
00:30:10,215 --> 00:30:13,300
I'm Cloud Economist, Corey Quinn, and this is screaming in the cloud.

672
00:30:13,670 --> 00:30:15,740
If you've enjoyed this podcast, please, we have a

673
00:30:15,740 --> 00:30:18,290
five star review on your podcast platform of choice.

674
00:30:18,320 --> 00:30:21,440
Whereas if you've hated this podcast, please, we have a five star review

675
00:30:21,440 --> 00:30:25,100
on your podcast platform of choice, along with an angry, insulting comment

676
00:30:25,250 --> 00:30:29,360
that you hopefully were able to get the AWS DevOps agent to write for you.