Oxide and Friends

We hit a new (and disturbing!) failure mode recently when a production rack that had been up for several months saw every (!) compute sled's service processor become simultaneously unresponsive. Bryan and Adam were joined by the members of the Oxide team who debugged the vexing issue -- and reached its surprising root cause.

In addition to Bryan Cantrill and Adam Leventhal, we were joined by Oxide colleagues, Cliff Biffle, Matt Keeter, and Will Chandler.

Previously, on Oxide and Friends:

Some of the topics we hit on, in the order that we hit them:

If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!

Creators and Guests

Host

Adam Leventhal

Host

Bryan Cantrill

What is Oxide and Friends?

Oxide hosts a weekly Discord show where we discuss a wide range of topics: computer history, startups, Oxide hardware bringup, and other topics du jour. These are the recordings in podcast form.
Join us live (usually Mondays at 5pm PT) https://discord.gg/gcQxNHAKCB
Subscribe to our calendar: https://calendar.google.com/calendar/ical/c_318925f4185aa71c4524d0d6127f31058c9e21f29f017d48a0fca6f564969cd0%40group.calendar.google.com/public/basic.ics

Adam Leventhal: 00:00

Full enough house?

Bryan Cantrill: 00:02

Full enough house. Yeah. This was a is a wild one. This was a really wild one. So I'm gonna give you my if you don't mind, I'm gonna give you kind of my introduction to this problem, Adam.

Bryan Cantrill: 00:15

And then Will, maybe we can back up and get your introduction to this problem and kind of go forward from there. Because my internal problem was I come into the office and I think on a meeting and Robert came up to me, Robert of Holistic Engineer fame. So we can we can ring the chime even though people on YouTube

Adam Leventhal: 00:37

The much the much loved chime.

Bryan Cantrill: 00:39

The much loved chime. We we will be we're we we we've heard your comment on YouTube and we'll be ringing the chime a lot. We will be making as many possible references to previous episodes. But, Robert of Host Skin Share fame, came up to me with a wild look in his eyes. And I mean, Adam, you have known Robert for a very long time.

Bryan Cantrill: 01:02

Yeah. A wild look in the eyes is not common from Robert. I mean, that it like that alone is enough to I mean, recall, just to just to get that chime really working, during when we had the data center reboot, when we rebooted the data center. Robert was the one who was actually like keel in the water cool. I was the one that I thought was gonna pass out and or vomit on my keyboard.

Bryan Cantrill: 01:23

Bet but Robert was actually like the cool head in the room. No.

Adam Leventhal: 01:26

If Robert's panicking, it's because the fire station itself has burned down.

Bryan Cantrill: 01:30

That's right. That's right. So Robert comes up to with a wild look in his eyes and like, you know, I might be like, I was clearly on a call, but it was like, so okay. I mean, obviously, yes. Robert, what is it?

Bryan Cantrill: 01:40

And Robert says, if I attempt to write to a device address via m d u minus k w, do you think it'll let me? I'm like, oh, where where where are we right now? It is enjoy watch Quantum Leap, that show.

Adam Leventhal: 01:58

Yes. I can quote, you know, chapter and verse from it, but yeah.

Bryan Cantrill: 02:01

I mean, I can Matt, have you ever Matt, I'm sorry to to as a as a token millennial, can you just just Quantum Leap bring any bells? As we've learned in previous episodes, if it wasn't featured on SpongeBob, it might not have left the generational chasm. I I can't imagine that this would. Yeah.

Matt Keeter: 02:19

Yeah. Yeah.

Bryan Cantrill: 02:21

Okay. So you you you've seen you've heard of Quantum Leap.

Cliff Biffle: 02:24

I'm familiar with it. Yeah. We had reroute satellite.

Bryan Cantrill: 02:28

Yeah. There you go. Okay. So Quantum Leap was a pretty, like, interesting show because the guy would travel into someone else's body. And part of the like, in the beginning of the show, he's trying to figure out like, where am I?

Bryan Cantrill: 02:40

Like I've arrived in a new body and I've got no idea like clearly this body knows something that's going I felt like that. I had that feeling of like, okay, I've obvious I've been transported into the middle of this problem where where we are asking if if writing to device memory is going so what Robert is asking is that I do a a a write via the kernel debugger, m d b s k w. And m d b is is our debugger. M d b s k is running it such as it's debugging the kernel. The w flag denotes, I want to actually write to memory, which is obviously like, wait a minute.

Bryan Cantrill: 03:13

Is that like, is this a debugger or is this a destroyer? It's like, well, line gets blurry. That's right.

Adam Leventhal: 03:19

But but any any idea that is like MDB minus KW, it's like all it raises raises certain goosebumps. Right?

Cliff Biffle: 03:28

Like that

Bryan Cantrill: 03:29

What what's the craziest thing you've done to be MDB minus KW, Adam?

Adam Leventhal: 03:33

I have changed locking primitives so that everyone gets a lock. That's that's the I've just changed it so that, like, it doesn't just like whoever wants for a lock, wants a lock, gets a lock.

Bryan Cantrill: 03:47

No. This is not in the production system, one assumes.

Adam Leventhal: 03:51

No. No. Now what is the

Bryan Cantrill: 03:53

what is the craziest thing you've done in a production system? That'd be

Matt Keeter: 03:56

my stupidity.

Adam Leventhal: 03:56

Probably changed an instruction to, like, you know, to to, like, bounce over a, like, you know, a condition, basically. But that felt very it feels very sticky because you're like, if I have computed this instruction incorrectly, we are going to we're gonna bounce into outer space.

Bryan Cantrill: 04:17

That's right. I for me, it was I changed the branch displacement to to I changed a particular door I call displacement actually. I chained to change a CV broadcast into a CV signal. That's exciting. Oh, it was super exciting.

Bryan Cantrill: 04:37

And this is on a definitely a hot customer system where the customer was upset. This is back at Sundays. And you remember I remember Jesse, our our our dear friend that we'd worked with. And this was a customer he'd been working with. And Jesse got very nervous when I started asking him permission for what I was about to do.

Bryan Cantrill: 04:56

He's like, wait, why are you asking me? Like, why are you asking why are you so uncertain about what you're about to do? Like, here's the the the surgeon is not supposed to be asking me these kinds of questions. Why is the surgeon asking me permission in this regard? Like, is it

Adam Leventhal: 05:08

And just to roll it back, you're like, I'm seeing this thundering heard. So what if instead of waking up everybody, we just wake up somebody and

Bryan Cantrill: 05:15

like Yeah. Yeah.

Adam Leventhal: 05:17

Exactly. That'll probably that could be fine.

Bryan Cantrill: 05:20

I assume you have to play it all all this correctly. Assuming I I turned this CV broadcast into a call to CV signal and not a call into the middle of of illegal instruction That's right. And you're anything else that can turn

Adam Leventhal: 05:33

it to. And you're probably like, well, if it does wedge up the system, I could probably turn it back and maybe the system will unwedge too.

Bryan Cantrill: 05:42

The I I am I did not really have no. No. I I kinda knew that there was no insurance policy. Okay. Okay.

Bryan Cantrill: 05:47

This could be always gonna be kind of a one way trip. And I did start asking Jesse, what are the consequences of this? If this system were to reboot right now, what would be the consequences? And Jesse's like, not good. Like, why are you asking why are you asking me that?

Bryan Cantrill: 05:56

That's a weird question to ask me. It's like, you know, it's like your surgeon asking you just like, oh, you know, are your your affairs are in order? So

Adam Leventhal: 06:03

Like, you do have a will.

Bryan Cantrill: 06:04

Right? I mean, obviously, you got a will. Me and my god. You you clearly done the most basic adult things. You're like

Adam Leventhal: 06:11

I feel like it's more like your dentist asking that, but yeah.

Bryan Cantrill: 06:13

It's like your dentist asking you that. Why? This is a super do ask this for no. No. Just like this is a pretty I don't know.

Bryan Cantrill: 06:23

There's a little adventurous, this one. This root canal is a little more adventurous than the ones I've done. I just wanna make sure just asking you, you're a fair share holder. You've got like, know, you don't wanna you don't want your you want your kids to be able to, you know, they're already gonna be devastated enough having lost a parent. You don't want them to be financially ruined.

Bryan Cantrill: 06:37

You're like what? No. So I and that you know it worked I'd like to say. It all worked and I was relieved and then maybe too relieved. Jesse thought I was a little too relieved and realized that like I should not have allowed you to do that.

Bryan Cantrill: 06:49

Okay. So Robert is asking me if he can write your device memory via MB minus kw. And I'm like, if it did allow you to do that, that would be that itself would be a bug. Like we don't that is like what do you wanna do? And what Robert quickly explained, and Will, this is where we're gonna jump to you to get more context, is like so this is a customer system.

Bryan Cantrill: 07:09

We have lost contact with all service processors. And I'm like, what do mean all? It's like every compute sled. We cannot talk to the service processor on any compute sled. Hosts are up, but all service processors are at this point like, I can say, oh yeah, WildLook is definitely merited.

Bryan Cantrill: 07:30

And so what Robert like what Robert wanted to do so what the first thing what what we needed to figure out, and again, we'll we'll go to you to get more context on all this. And I know Matt you jumped in here too. We wanna determine, you know, how dead is the service processor actually? And how can we get what we can tell, because the service processor is connected to the network and what we know is we can't reach it over to the network, we are seemingly. We also know the service processor is in charge of certain basic maintenance of the system.

Bryan Cantrill: 08:05

The service processor, among other things, has the thermal loop, which Matt wrote. And we know if the service processor goes out to lunch completely, the fans will all crank up.

Adam Leventhal: 08:16

So Just for a little context, the service processor may be obvious here, but it's the computer within the computer. It's the computer that is is driving the basic kinda autonomic functions of of the of the server. That's right. As and including things like being able to reboot it or the or the fans as you're saying.

Bryan Cantrill: 08:35

Yeah. That's right. And, I mean, this is the this is our answer to the baseboard management controller. So we we we do not have a traditional baseboard management controller for a bunch of good reasons. And what we have instead is a service processor.

Bryan Cantrill: 08:48

So the but that that SP is definitely important. So and the SP, and we'll go into this in detail, has been designed with an operating system that Cliff led the charge on, Hubris, that is designed to be very robust, and it is really not designed to disappear. It was very unusual that it would disappear. And so the the the reason for the wild look is that what Robert Robert's kind of bonkers idea was so it requires a little bit of backstory, but the so we when we boot, the the service processor talks to the dims on the on on the box. So there's there's an I squared c bus that the DIMMs, the the memory is connected to.

Bryan Cantrill: 09:39

And there are there's there's some identifying information in those DIMMs, and it's the service processor that is connected to that bus and gets that information. But the host CPU ultimately needs to get that information in order to train memory. It's got its own I squared c initiator. And it when it thinks it's talking to the DIMMs, it's actually talking to the service processor. Okay?

Adam Leventhal: 10:01

With you.

Bryan Cantrill: 10:01

So okay. So this but this only happens when you boot the host CPU. Once it's booted, the I squared c controller is like not really used. And one of the things that Robert has discovered over the last not sure when he made this discovery, but one of the things he's gotten working over the last couple of months is like so this I squared c initiator is hidden from you know, we we we talked about the the kind of the various other chips, the hidden cores on the die. And we and all of the the those kind of like the hidden things that we don't see, that I squared c controller is controlled by one of those hidden cores.

Bryan Cantrill: 10:44

And what Robert realized is like, actually, the I squared c controller for that thing is actually mapped into our address space, and we, the host operating system, can actually manipulate that I squared c controller. So we can initiate I squared c. And I'm not even sure, like this kind of defies metaphor. Don't even know. It was like we're kinda like borrowing its underwear.

Bryan Cantrill: 11:01

I don't know. I mean, I'm I'm not even sure what the so what Robert wanted to do was Robert's like, actually, what I think we can go do is I can hit this memory map region for the I squared c controller, and I can force this thing because the whole CPU is up, I can force this thing to initiate I squared c transactions and see if the SP can respond to these I squared c transactions to identify like memory, which Yikes. The I was like, wow. That okay. What?

Bryan Cantrill: 11:34

We are really desperate. And we and even Robert was like, yeah, this this feels a little too adventurous. Like that definitely feel like my my CD broadcast to CD signal can just sit down. If you're gonna do this, this is definitely next level. So the but the that was coming into this while it was already kind of evolved.

Bryan Cantrill: 11:58

So Will, I think you were among the first on scene on this to discover that the service processors were were were down. What was kind of the the and you were attempting to do we're kind of supporting another issue and you all noticed that like, oh my god, these things are all down. How did the kind of things proceed from there?

Will Chandler: 12:16

Do you want me to give us some context on like what the customer is seeing first?

Bryan Cantrill: 12:19

Yeah. Sure. That'd be great. Yeah. Yeah.

Bryan Cantrill: 12:20

That'd be great.

Will Chandler: 12:20

Yeah. So the customer reached out to us because they trying to stop an instance via the CLI and the request was timing out, which typically means that one of the sleds is down and like isn't able to respond. And so the request within the rack times out and so the client times out. So we ask for permission to to connect to the the rack and they let us let us in. And then, you know, we list all the hosts and we can see that sled 23 is down.

Will Chandler: 12:48

And then at that point, we list all the SPs just as kind like a a matter of course just to, like, validate that the SP for that SLED is there. And then as you said, all of the SLED SPs are gone. We still have the SPs for the two PowerShell controllers and the two switches, but the other 16 SPs are just completely missing.

Adam Leventhal: 13:08

So so, like, the list like, you're able to list them, and it's like, yeah. I have these two or three. That's

Will Chandler: 13:13

what you're looking for. Right? Yeah. That's four. Right.

Will Chandler: 13:16

Yeah. So the tool we use to list SPs is called Pilot and that has like a streaming feature where you can just like show me all the discoveries you have. And so we use that next and we can see it's only discovering four that we see it from the list, which makes sense. So we look at the MGS or management gateway service logs next and we can see that, you know, we're reaching out to all these SPs and it's timing after twenty seconds and five attempts. So you know at that point, know, we reach out in the channel and say, don't see any SPs on this rack and, you know, Matt and a number of other people hop in from that point to to start to troubleshoot more.

Bryan Cantrill: 13:53

And, Matt, I'm sure you were like, you can't see you've lost all of them? I the I mean, it definitely reminds me of that far side cartoon where it's like we paid you to look after the children and you cooked and ate them both when they are coming home to a wicked witch. I mean, it feels like you've we've lost all of the SPs. That must have been this is not a failure mode that we've really, this this really shouldn't be possible.

Matt Keeter: 14:16

Yeah. I mean, to lose one SP could be seen as careless list. To to lose 27 of them, that's really, really impressive. Yeah. So at this point, I get pinged, I'm like, oh, it's another quote, unquote management network bug, which is never the management network's fault, but, you know, I gotta go in and prove that to my satisfaction.

Matt Keeter: 14:33

And so I, you know, pop into the system, and we have pretty good counters and visibility to, like, what's going on in the management network, where, I guess, the the context here is the management network is the separate switch, which has all of the SP traffic, and then it has a bridge up to the main network. So we can use this. This is running before the main switch is running. It runs when the system is mostly powered off. And so we start looking at the counters, and sure enough, we see that we have packet counters for all of the SPs, and the packet counters are going up for the two sidecars in the PSC, and they are not going up for any of the SPs.

Matt Keeter: 15:10

So we're like, oh, well, that's a bad sign. That means that no packets are leaving the SP. And it's like, this all checks out. Right? Because the the whole point of the the broad the way that discovery works is that the SPs are periodically broadcasting a message saying, hello.

Matt Keeter: 15:22

I'm an SP. And if we don't see that message, then it doesn't get through. So the fact that the packet counters aren't going up is like, okay. That's understandable, but not great. So I actually let me just see.

Matt Keeter: 15:36

I I pulled out unfortunately, this is a customer support ticket, so it's a private repo, but I pulled out my long list of things that we did.

Bryan Cantrill: 15:42

Oh, yeah. Love that that you were saying that I actually was coming into this several hours of bad ideas in. So I I cannot wait to hear some of the other ideas.

Adam Leventhal: 15:50

Yeah. Robert's like, let's cut off our own hands was like, that was the

Bryan Cantrill: 15:53

best idea. That was exactly right. Yeah. That was after several other ideas.

Matt Keeter: 15:57

So one other thing we try is running there's a command called IPCC, which talks to the SP over the serial port in the system. So that is, like, not using the network stack at all, obviously. And so we tried using that to talk to the SPs on these stuck sleds, and sure enough, they didn't talk over IPCC, the serial line either. So at this point, we're like, I'm like, okay. Something has gone wrong and, like, the kernel has hung in all of these SPs, and maybe it's like a timer that has gone wrong or something like that.

Matt Keeter: 16:27

And I think Robert is like, well, can we just try to ping the SPs just to see if that works? And I'm like, sure. Sure. But that's not gonna work. Like, the whole system is hung.

Matt Keeter: 16:36

And so we, you know, figure out the IP address, and we try to ping one of the SPs, and the SP answer is on ping. And we see the packet counters go up for, like, the the two packets that it takes to do ICMP discovery and echo. And we're like, oh, the SP is still running. This is weirder than we thought.

Bryan Cantrill: 16:56

Yeah. Totally weird. Right? Because we I

Matt Keeter: 16:58

mean So like Yeah. Yeah. Go go ahead.

Bryan Cantrill: 17:01

No. No. I was gonna say that, I mean, I I think the because the expectation there is, like, when we are dead, that the whole that we are the SP is actually dead. The kernel is panicked or something as as so the fact that this or that the networking task is like totally dead. So the fact that this thing can actually respond to ICMP echoes is kinda wild.

Matt Keeter: 17:19

Something very weird is going on. The SPs are alive enough to answer pings, but dead enough that they can't seem to do anything else. And one thing we know about pings is that pings are actually handled within the net task itself, so it doesn't do dispatch to a different task. And what it seems like now is that, like, every task below priority five, which is the priority of the net task, is not running for some reason. And we're not sure why this is happening.

Matt Keeter: 17:47

I point out that if the net task was always runnable for mysterious reasons, then it would always be selected, and that would cause the behavior we saw. But this is still in the, like, mysterious reasons category. We then I think, Will, do you wanna talk a little bit about the, like, thermal current debugging that we tried to do as well?

Will Chandler: 18:06

Sure. Yeah. So the SPs will collect metrics current for the give sleds. And then we also have also from the PowerShell controller. So we did this two different ways and we can query the metrics within the kind of like the management system of the rack using OMDP which is kind of internal tool.

Will Chandler: 18:31

So it's really easy to run metrics queries. So the first thing we did was we just queried for hardware current from one of dead, just let SPs just to see like, when when did this stop? And we could see that it stops, you know, roughly twenty five days previously. And then we just repeated that for all the other sleds just to and we could see they'd all fail within the space of, I think, roughly a day. And then we also checked for the PSC hardware metrics or current metrics just to see like when the SPs died, did we see an increase in current indicating that the fans are kicking off?

Will Chandler: 19:19

It looks like we did see an increase in draw, which was suggestive of that. I don't know if we ever conclusively determined that though.

Matt Keeter: 19:27

Yeah. We so we ended up seeing, like, a moderate increase in current, so not enough that the fans were, like, at a 100%. So it was, like, mixed results. Like, we saw something change in current, but it was not strong evidence for exactly what was going on. At one point, we're also considering asking the customer to go down to the actual data center and, like, car talk style, like, describe the noise the server's making.

Matt Keeter: 19:46

Is it more of a whoo, or is it more of a

Bryan Cantrill: 19:48

whoo, whoo, whoo.

Matt Keeter: 19:52

We didn't actually end up asking the customer to do that, though.

Bryan Cantrill: 19:57

But maybe we should you know, even in hindsight, we've got this thing completely debugged. Why not actually just like you know, I I think, Matt, as long as you ask them in exactly that way, I feel we should, I I like the the various kinds of apes that we're trying to determine which one we're seeing.

Matt Keeter: 20:10

The other, the other thing we did at this point was take the number, like, a 188.5, which was the rough uptime, and, like, look at it in milliseconds and try to turn it into an even power of two. So we're like, oh, maybe something is rolling over at this point. And it turns out you cannot turn a hundred and eighty eight point five days into, like, anything that looks

Cliff Biffle: 20:28

like Anything.

Matt Keeter: 20:28

Power of two, unfortunately.

Bryan Cantrill: 20:33

Yeah. So we've got, and then we and but the host is up. Right? So we also don't wanna like well, yeah. I mean, we don't actually we don't have the ability to extract any more state other than we know we and we don't need to do were were there other bad ideas that before Robert wanted to write to device memory to use the were some of other things that we

Matt Keeter: 20:57

We kind of backed off the the device memory once we saw the ping working because we're like, okay. Like, the the SP is actually alive, which was what we were trying to prove by showing that it responded over SPD. I think the other thing that we hadn't done at this point is, like, we had not proven that the SPs would come back, which was concerning.

Bryan Cantrill: 21:17

Yeah. The, in terms of, like, we don't know if the SPs are actually dead dead.

Matt Keeter: 21:25

Yes. Although, by by the time we saw them responding ping, we were maybe feeling a little bit better on that front, but we were still like, if we power cycle one of these sleds, like, and the SP doesn't come back, then someone is flying off to a certain location with a debugger and not gonna have a good week.

Bryan Cantrill: 21:43

Yeah. That's a very dark thought. I I I guess I came into this after that. I did that dark thought had not occurred to me that we had managed to actually, like, electrically destroy every one of the SPs. That would be, that's dark.

Bryan Cantrill: 21:57

Yeah. That that that's definitely dark. Fortunately so we and I think we did end up resetting Will, you ended up resetting the sled that was the cause of the investigation to begin with, and nothing did come back up. Is that correct?

Will Chandler: 22:10

Yeah. That's right. The next day, we were I got access to the rack again and power cycled it and it came the host and the people came right right up. So that was a relief.

Bryan Cantrill: 22:22

That's a relief. Okay. So now we have got really just the symptoms of this problem. It's obviously a very bad problem. We really rely on that SP being up.

Bryan Cantrill: 22:38

And but so we were kinda we were kinda reduced to we I think, Matt, we've kind of exhausted the information we can get out of the running system. Or maybe there were some other ideas on other information we could get out. But we're kind of reduced to like, alright, we're gonna need to kind of go from first principles to try to reproduce elements of this problem. You had the thought that I think a lot of folks had is like, could this be time related? Definitely reminds Adam of the elbow problem, the the famous Unix elbow problem where after with elbow, it's a 32 bit value.

Bryan Cantrill: 23:13

It and at its default elbow is the lightning bolt variable that dates back to like sixth edition. Yeah. They took very like odd poetic license or poetic license at very odd times. Like, drop the e off of creat and l bolt as lightning bolt. So this thing fires a 100 times a second.

Bryan Cantrill: 23:32

This is the system clock at a 100 hertz. You will that that value will become signed, after February. And kind of famously, Unix systems of a certain vintage could not stay up for longer than two hundred forty eight days. And it I mean, they would be up after two hundred forty eight days. It's just that it would be absolute madness of people thinking that that their timeouts that were in the future were in fact in the past and and a bunch of stuff would break.

Bryan Cantrill: 24:00

That was all this is all like in the nineties. And actually one of the first things I don't the first thing I actually did at Sun was actually fix that. Yeah. That that it was kind of the I I someone else had kind of already started that work and was gonna hand it to me.

Cliff Biffle: 24:13

It was like

Adam Leventhal: 24:14

a classic job to, like, send to the new guy too. Because you're like, how hard like, this doesn't seem that hard. And I'm sure then, like, your phone was ringing for years or whatever, on this critical code path.

Bryan Cantrill: 24:26

You know, it was funny actually on that in particular, like the the that work had already been had already been broadly done. And they also wanted to allow the Hertz value to be to be tuned, to be configured. And they had decided that they, the senior engineers, had decided that the values that were acceptable were 100 and a thousand. So a 100 hertz and a thousand hertz. And I'm like, why not like, why not just allow it to be configured to any value?

Bryan Cantrill: 24:54

Like, well, that'll be kind of unsupportable. My god, it feels like that should be supportable. So I just like crank the values higher. So what I would do is like, I'm just gonna crank the value until the machine no longer boots because it's just running the clock interrupt all the time, which was super satisfying strangely to do that. It was like, I don't know.

Bryan Cantrill: 25:11

I think that there's something there's something sadistic in software engineers where we kind of enjoy computers that are in pain at some level. So and I in particular remember that on Sun 4 C, this is campus. This is a the old old old machine at this point. I don't know what the clock rate was on that thing. I could get it to 26,000 and it hung at 26,000 hertz.

Bryan Cantrill: 25:34

And I came back the next morning and I was very satisfied that it hung. And I hit the the the carriage return and verified that it hung. I guess it did that the night before. Because I came in back the next morning and it was actually making progress. It was executing like some very small number of instructions at a time.

Bryan Cantrill: 25:52

And then the machine immediately panicked. And it panicked on what ultimately was a chip bug that we found. Yeah. It was just crazy. And the as it turns out, there when you set the processor interrupt level, there was a two instruction window immediately after that where you could take an interrupt at a level lower than you just set the interrupt level to.

Bryan Cantrill: 26:16

And that was a chip bug. As it turns out, it was a chip bug in every Spark microprocessor they had made up before UltraSpark. So yeah. It was like and it was it was actually it was it was kind of an interesting lesson in a whole bunch of ways. One, that, like, the the consequences of just time in general.

Bryan Cantrill: 26:33

The these problems that you have that you only see with uptime. Right? That are really hard to test for. Because how do you like you have to just like wait. And obviously there are ways to do this and there are ways to kind jump start time.

Bryan Cantrill: 26:47

I mean, Matt, that was one idea you had was around time. Did you end up doing the experiment of trying to accelerate time, closer to this value that, of course, is not even power of two to see if anything breaks?

Matt Keeter: 26:59

No. We had talked about just like booting the system with the clock set to a hundred and eighty eight days, but didn't actually get around to that.

Bryan Cantrill: 27:05

It get didn't run it. Yeah. So we're it's kind of a you know, we're this is a high priority issue, but we're kind of noodling around on how to like, oh my god, what to do? And then Cliff, you yeah. You pick this up.

Bryan Cantrill: 27:19

It's like, well, seems like a high priority issue, obviously. What was your approach in terms of because this this is a tough one to kinda go from first principles on it.

Cliff Biffle: 27:30

Yeah. So I had been out for a chunk of this dealing with family stuff. And so I get in, and I have text messages waiting from a bunch of these folks being like, hey. So there's a weird thing. Hey.

Cliff Biffle: 27:42

There's a weird thing. We'd really like you to look at the weird thing. Oh my god. Could you please look at the so anyway. But, fortunately, by the time I got in, a lot of this prep work had already been done.

Cliff Biffle: 27:53

And the the stuff that Matt and folks investigated really narrowed down the potential failure space. So by the time I showed up, what we knew was essentially machine goes unresponsive to everything except ICMP ping, which as Matt mentioned, is handled on a fast path inside the network stack. So that can keep working sort of like a brainstem reflex, like a chicken with its head cut off long after the rest of the computer is doing something else. And Matt also mentioned earlier the the possibility of the net task, like, being really busy or doing something in a loop. And so that was sort of the leading hypothesis, but we didn't know what would cause that or why it would be doing a lot of work in a loop.

Cliff Biffle: 28:36

So QBRES has some interesting properties and tools here that actually came in really handy. And this sort of we need to model the operation of the system in our heads because we can't get information in or out of the computer because it's that borked is the sort of situation that guided the design of a lot of the system and tooling. So what we had was from the build system, we could get a graph showing the priority relationships of all of the IPCs between all of the tasks. And from that, we could clearly see that, like, the IPCC handling, the serial port to the host, was a lower priority than the network stack as are all of the clients of the network stack by design. The hubris network stack is a little weird, but it's a separate task rather than being in the kernel.

Cliff Biffle: 29:25

And tasks that wanna offer a network service are lower priority than the network stack, which means whenever the network stack has an opportunity to run, it wins. It will take the CPU away from any of its clients as a fairness thing mostly. But in this case, the IPCC was also lower than the network stack. So if the network stack were running with wild abandon, it could explain why all of these interfaces are down except for ping. The other thing that seemed plausible at the time though was one of the ways a task in hubris can burn lots of CPU is to crash a lot because our default response generally is an unattended system is to restart the task.

Cliff Biffle: 30:10

Now Matt recently actually added a, sort of smart time out to give a waiting a cool down period whenever a task crashes before it gets restarted, but I believe that this customer was on an older release, and I wasn't confident that that code was there. So so that wasn't a possibility. It was that it was just falling over a lot. So we had to sort of rewind. And the good news is is, like, thinking about this after the fact, there's a bunch of problems we didn't have to chase that would have been wild goose chases that we might have had to do on earlier systems I've supported.

Cliff Biffle: 30:43

Like, this was probably not a wild pointer misuse in a different task because we're pretty confident the memory isolation between the tasks works at this point. It's probably not a scheduler bug because the scheduler is super, super simple. So I had to

Bryan Cantrill: 31:01

Yeah. And just on that, a good but especially on the wild pointer clip, because I think this is so important. Because when you have the possibility of kind of arbitrary corruption from arbitrary things.

Cliff Biffle: 31:13

Right.

Bryan Cantrill: 31:15

You've got this kind of like, well, magic happens. Sometimes there's bad magic in this system, and we don't understand why. And therefore, like, certain things are unknowable.

Cliff Biffle: 31:28

Yeah. Supporting complex c firmware for a long time is the reason why I know what it looks like in a hex dump if a pointer has been replaced by an I triple e floating point number. That's the sort of thing that happens. Right.

Bryan Cantrill: 31:43

So So having this that that is a I mean, not only do we have the solace that that doesn't happen, but that also gives us a kind of resolve that, like, this is like, there's something else going on. Like, we we we can't just chalk it up to bad magic due to due to something having a straight pointer.

Adam Leventhal: 32:01

Well, as you say, data corruption is almost is not is almost undebuggable. That is to say you're separated at such a different distance. We talked about this last week, and the ramifications are almost arbitrary. So it does give you some license to feel like giving up is an option. Totally.

Cliff Biffle: 32:18

Yeah. And and data corruption is sort of like the, well, I guess it was ghosts of software engineering.

Bryan Cantrill: 32:24

Exactly. Exactly.

Cliff Biffle: 32:28

Very unpleasant. For better or worse, don't really get to reach for that one. So we instead stepped through everything sort of rigorously. And I started with the kernel looking for potential scheduler bugs, but the scheduler is really simple. And we didn't find anything there.

Cliff Biffle: 32:44

So I started looking for things that might cause the network stack to crash. Because if it's crashing, it must be crashing for a reason. It's probably a panic because it's rust. Like, that's basically how we crash. And the number of actual panics in the network stack code is pretty low, although I did run off on a couple of tangents chasing ghosts that wound up not bearing any fruit.

Cliff Biffle: 33:04

But we pretty quickly narrowed it down to the idea that the network task was probably failing to correctly acknowledge and interrupt. So where peripherals on on machines like this on these microcontrollers indicate some condition to software by signaling an interrupt, which causes an interrupt handler. It basically interrupts the flow of instructions through the computer. And in our case, a small routine runs in the kernel to handle the interrupt and pass it off to a task outside of the kernel for further processing, in this case, the network stack. So the stack were getting an interrupt from the network controller and was failing to, like, clear the condition, then as soon as it tried to go back to sleep, the kernel would receive another interrupt and would flap the network stack again.

Cliff Biffle: 33:51

And you'd wind up in this cycle where the network stack basically receives all user land cycles instead of the computer ever idling. But it wasn't clear what could be causing that because we've had this Ethernet driver for years, and it's generally been pretty reliable. And it's there weren't any obvious interrupt sources we wouldn't be clearing. So I had to go read the manual again. And buried in the manual in an unexpected location of the, let me just double check the page.

Cliff Biffle: 34:23

Can I hear you?

Bryan Cantrill: 34:24

Hey, Cliff. Can I just ask you something about your methodology here? When you say I'm gonna go read the manual again.

Matt Keeter: 34:28

So Yeah.

Bryan Cantrill: 34:29

The manual for the microcontroller is Yeah. Thousands of pages long.

Cliff Biffle: 34:35

I just looked it up. It's 3,294 pages.

Bryan Cantrill: 34:39

Which is a good thing, to be clear. We've I and I I believe Nice. Yeah. I I believe in our episode on transparency and hardware software interfaces, I've called out ST as a model in this regard. So love having this much documentation.

Bryan Cantrill: 34:54

But when you say I'm gonna go back and read the manual, are you and are and clearly, the documentation on ethernet controller is is slimmer, but it's like I mean, it's probably hundreds of pages. Is it not? I mean, it's gotta be

Cliff Biffle: 35:08

Yeah. The Ethernet chapter is is only 300 pages long. And Oh my god.

Matt Keeter: 35:14

I think the other problem with Ethernet chapter is that there are several interrupts which are listed as enabled when the chip is reset, and that is just straight up a lie. Like, we're going through this and we found, oh, it says these interrupts are enabled, these interrupts are enabled, and they're not enabled when the chip is reset.

Cliff Biffle: 35:30

Yeah. The chapter is effectively crying wolf. There are three or four different interrupts where it asserts in plain English, this is enabled by default, but it's not. But it turns out turns out one of those four actually is enabled by default, but the only way to confirm this is with this diagram that's hidden on page, I don't know, 2,000 in Mumble that, I posted on the GitHub issue that, shows that this one interrupt source or every other interrupt has this and gate that, like, masks it if an interrupt enable isn't on. This one interrupt source instead has an or gate that doesn't allow you to mask it ever.

Cliff Biffle: 36:14

So What

Bryan Cantrill: 36:17

is I mean, do I may I are we taking questions at this time? Yeah. Yeah. Why? Why?

Bryan Cantrill: 36:27

Yeah. So that I mean, that that was as confusing to you that this would be I that we'd have this this particular interrupt is basketball.

Cliff Biffle: 36:35

That's pretty bonkers. I mean so normally, the sort of convention on an ARM microcontroller like this is that the center central standard interrupt controller has enable bits for every interrupt source, and sometimes something like an Ethernet controller will combine a bunch of sources internally into a a single bit that goes to the central interrupt controller. So in our case, from the CPU's perspective, e Ethernet produces one interrupt, but it's a combination of, like, two dozen possible event sources. So the nice thing to do is to have bits in the peripheral that let you turn those individual interrupt sources on and off. The next nicer thing to do, if you have some of them that can't be turned off, is to have them connected to some somehow optional and maybe a function that has to be turned on.

Cliff Biffle: 37:25

So there were actually two stanzas in the interrupt handler and the Ethernet, driver in hubris that said, you know, basically had a to do. We don't enable this interrupt. We don't need to handle this, which

Bryan Cantrill: 37:39

Which is which is half true.

Cliff Biffle: 37:42

True. So well, actually, so it turns out to be all the way true. Folks had flagged that as a as an OSHA thing we need to look into, which is valid. But it turns out those interrupt sources actually aren't enabled by default despite what the book says. So those two are fine.

Cliff Biffle: 37:58

There was a third that we didn't know about that was not listed in the Ethernet interrupt section of the manual that turns out to be the thing that's actually on by default. And it's the if I can jump to the spoiler, it's the management counters interface on the network interface controller. So it's this block of things which we skipped over because we're like, oh, management counters, that'd be cool someday. Like, but it's probably fine. Right?

Cliff Biffle: 38:25

Counters, that doesn't sound dangerous. What would go wrong with a counter if we don't do anything? Yeah. So the answer is is there's a one of the counters is responsible for counting packets in and out. And by default, when that counter reaches its halfway value, it lets you know it's gonna overflow by causing an interrupt.

Cliff Biffle: 38:42

There's there's no way to turn this off. Actually, there's it did turn out there's a way to turn it off, but it's really hidden in a different register.

Bryan Cantrill: 38:57

Yeah. And, I mean, I I guess, like, thanks for letting me know that I thanks for thanks for killing me by letting me know that this is like a fire breathing canary. I'm trying to figure out, like, again, what the metaphor is for this. But the and so it it fires the interrupt when it your when when the the counter is halfway at at its halfway mark, which also feels like pretty aggressive.

Adam Leventhal: 39:21

Panicking. Come on, man. Like, you're halfway there.

Bryan Cantrill: 39:23

You're halfway there. I mean, the glass is like isn't isn't the counter literally half empty? Like, why are you what? You're you're complaining that it's at fault.

Adam Leventhal: 39:31

So, Cliff, can you walk me through the stream? So then you get the interrupt, and then what happens? So

Cliff Biffle: 39:38

in this case, because there was no software to handle the interrupt, what happens is the kernel interrupt handler runs as usual. The kernel notices that it's from the Ethernet controller, sends a notification to the Ethernet handling test, the net stack. The net stack takes from sleep, looks at its notifications, says, oh, I've got an interrupt. Great. Let me go run my interrupt processing code.

Cliff Biffle: 40:00

So it goes through a whole series of reading device registers and looking at bits in them and trying to figure out what actual hardware event caused this interrupt. And it runs off the end of the code because it didn't think to check the one register for this thing we hadn't turned on. So then it says, great. I've handled the interrupt. This must have been spurious because that does happen.

Cliff Biffle: 40:24

I'll just re enable the interrupt, which it does, go back to sleep. And as soon as it goes back to sleep, the colonel says, hey. You got an interrupt. And then the whole thing day is right over. So the the failure was interesting because what we basically had was a network activated time bomb in every service processor and actually in in the other in the switch, in the sidecar switch and the the PowerShell features.

Cliff Biffle: 40:54

All of them had this time bomb lurking. We never seen it kind of because we usually hit the machines before they hit this counter. But we in hindsight, we had had a couple of infrequently used dog food systems that would mysteriously fall off the network, and we never really knew why. But I think now we know why. Yeah.

Cliff Biffle: 41:19

It's also fair

Adam Leventhal: 41:21

sorry. Is it fair to say that this this feels like it could be lurking in every consumer of this? Right? Like, the this counter is so buried. The documentation does not make it clear.

Adam Leventhal: 41:32

In some cases, it's wrong. The pathology waits for a gajillion packets or whatever or half a gajillion. Excuse me. Right? Like, could this be lurking in, like, everybody who thought they've properly implemented something on top of this?

Bryan Cantrill: 41:48

Right. I'm not accusing this construct of being enemy action. But if it were enemy action, what would it look like?

Cliff Biffle: 41:53

Yeah. So I actually was wondering that too. So I went and checked every open source driver I could find for the Ethernet block on this microcontroller. And it's not a super popular microcontroller, but it's kind of popular. So there were several code bases I could check.

Cliff Biffle: 42:08

And in general, every code base I found tended to have a commit usually months after the original version of the code that sheepishly went in and turned these bits off. So people are hitting this in practice, but they're hitting it faster than us, which then sent me down a whole, like, why why are we so late to the party here? Like, why are we getting bit with this at a customer site when other people like, the the Rust embedded HAL driver for this Ethernet controller actually fixed this in 2022, I think. So

Bryan Cantrill: 42:43

Oh, interesting.

Cliff Biffle: 42:45

Yeah. And I think I think the reasons are kind of a side effect of the system robustness. If we get a failure like this, the only symptom is that, like, oh, wow. It dropped off the network. In a normal system, having an interrupt storm like this means the entire computer locks up.

Cliff Biffle: 43:03

All interfaces stop working. You have a brick. And it's a lot more obvious when you're having this class of system failure. And with our systems, it's just like, well, don't know, maybe the cable to this one PSC is flaky, we should probably reseat the cable and move on. And

Adam Leventhal: 43:21

Yeah. And then you reboot it and reseat the cable and the problem's gone? Yep.

Cliff Biffle: 43:26

Or you can

Bryan Cantrill: 43:27

also Kuv, I also do wonder if the other folks use this microcontroller because if you I mean, we don't see this is not like a network device. You know you know what mean? Like, yes, we we obviously the network's important here, but we are not like pounding the network all the time. Like, we were making if we were using this microcontroller for you could imagine use cases that would hit this much faster than than we hit it.

Cliff Biffle: 43:50

The management network on the rack is completely separate from the customer data network. And the amount of traffic flowing over it is relatively low and it's very controlled. It's all just our things talking to our things. It's also small packets. It's all UDP.

Cliff Biffle: 44:05

It's very lightweight. So our packet frequency on that network is lower than probably your home WiFi network. So we can we'll get to 200 two to the thirty first power packets sent a lot later than most people.

Bryan Cantrill: 44:22

Yeah. A hundred and sixty six days later, apparently. I from our I realize it's a hundred and eighty six days. Sorry, Matt. What the what the number was.

Bryan Cantrill: 44:29

Yeah. So okay. So but we've got this, really interesting thing in the data sheet that you've discovered. What's next in terms of, like, how can we go explore this hypothesis?

Cliff Biffle: 44:44

Oh, yeah. So I think I posted in chat saying, I may have found a thing that that we're not handling, but I need to look into this more, because, I don't know, I tend to be habitually uncertain about things. But I went digging in the datasheet, and I found a test register in the Ethernet block, which lets you force the hardware to set the counters just below overflow. So there's actually Very helpful. Yeah.

Cliff Biffle: 45:12

The controller for testing the overflow behavior without waiting for two to the thirty first packets. So once I saw that, it was like I was worried I was gonna have to run the thing overnight spamming it with packets. But no. I just went in with a debugger. And, I mean, I know it sounds like in the kernel debugger, you would consider this a bug, but I I used the debugger to write a device register to force this bit on.

Cliff Biffle: 45:38

And the and then pinged it 16 times, and the thing hung up with exactly the behavior we would expect in production or we were seeing in production. So that was a strong indicator. And then if you do humility tasks at that point, you can see the all the running tasks and net is always marked as running. Idle never runs. So it's it's not yielding the CPU.

Bryan Cantrill: 45:59

Yeah. And, I mean, the great thing is that you've now you've reproduced this. Because when we are in a compute sled, certainly in a customer site, but even a compute sled, this is in a rack, we don't have a debugger plugged into that thing. Like, they we the header is populated, but we don't actually there's not a so when the network is actually our only way to understand what's going on inside that service processor, you have now reproduced this on the bench where we you actually have a debugger attached to it. So we can use the serial wire debug interface in, and we can actually understand heaven and earth about what this thing is actually doing.

Cliff Biffle: 46:34

Yep. We can ask it arbitrary questions even though its network interface is down because the nice thing about these embedded debugging interfaces like JTAG and serial wire debug is that they're not cooperative. They will work even if the computer doesn't want them to. It's also why we don't have them plugged in at a customer site because that would be rude and give us lots of weird powers over their computer. But, on the benches

Bryan Cantrill: 46:56

As it has died, I and I can't remember if we mentioned it before or not, but I do feel like one of the I one of the things I I love that we have done is that we have used the serial wire debug and in this incredible robustness of it. It really is a very robust mechanism. And as you say, it's like it's not an opt in mechanism. This is how the root of trust can control the service processor on our machine. The root of trust actually has that SWID line and it can actually control the service processor and force it to attest itself and do all sorts of other things, which is extremely useful.

Cliff Biffle: 47:30

Yeah. On boot, the root of trust actually stops the service processor, like alters the contents of its RAM, forces programs into it, and makes them run and verifies that they ran in a way that the hardware cannot lie about. So that that was was Rick's idea a long time ago, and, boy, has it been a delightful hack. It's it's really quite powerful.

Bryan Cantrill: 47:51

It's damn good. Yeah. It's a damn good idea. And it's been really really robust. But it it all goes to this road.

Bryan Cantrill: 47:56

I mean, and I've I I mean, certainly, I understand I'd standing Cortex microcontrollers was a first for me at Oxide and really appreciating the robustness of that of that SWIT interface. It it's very very helpful to have something that's so robust there. So alright. So so you've now reproduced it. We've got the symptoms, and then so, so what's the fix?

Bryan Cantrill: 48:20

I mean, feels like actually what is did we do we just, actually let this interrupt know that we're actually not I know, I guess we can't we can't mask it, so we've gotta actually deal with it.

Cliff Biffle: 48:30

Yeah. So once you once this interrupt goes off, you have to figure out which of about a dozen different bits in different registers you would have to set to make it stop happening. There's a whole tree process you have to follow of, like, read the top level register and see which bit is set, and then go read a different register depending on what bit is set. And then, like, the data data data data. And we could do that.

Cliff Biffle: 48:55

It also turns out that there are bits you can set in a control register to just turn this feature off. We don't use them.

Bryan Cantrill: 49:02

Oh, there you go.

Cliff Biffle: 49:03

But the code I added is about 30 lines of comments and like two lines of rust.

Bryan Cantrill: 49:11

Cliff Biffle: 49:12

so this is actually on by default, and here is the bit here's the here's the bits we have to set, and here's the code to set them, and now the interrupt shouldn't occur.

Bryan Cantrill: 49:22

Amazing. And and then, I mean, I think that we I mean, on the one hand, I and Cliff, I mean, as as one would, I mean, you said, look, we we we can't say conclusively we found the issue, but boy, we found an issue that matches everything symptomatically. It does feel like we've got a high degree of confidence that this is the issue that we saw in production.

Cliff Biffle: 49:44

Yeah. Like, the the simultaneous failure is really my I mean, okay. I'm I'm a little bit of a ghoul. So one of my favorite parts about this failure is the fact that it was simultaneous. Because if it had fallen sled we probably wouldn't have had this focus and we wouldn't have been able to rule out really simple basic things.

Cliff Biffle: 50:05

But the fact that it was an approximately simultaneous failure of every machine in this class of machines in Iraq was a big wake up call of, like, it's either a bad packet on the network or it's power supply issue or it's something something across all of them. What it actually turned out to be is just the fact that a lot of our management traffic is multicast. A lot of our access patterns over the management network to these compute sleds are uniform, which means if they all start up at about the same time, their packet counters have about the same value.

Matt Keeter: 50:42

And, yeah, we did the math on this afterwards where we were seeing, you know, somewhere between a 100 and a 150 packets per second multiplied by a hundred and eighty eight days. This is a case where it actually does line up with the nice round power of two. So that was another confidence building that this was the right diagnosis.

Cliff Biffle: 50:59

Right. There was just a conversion

Will Chandler: 51:00

That's true. That we had to

Cliff Biffle: 51:01

discover to convert between seconds elapsed and number of packets sent. And then it's a nice power of two.

Bryan Cantrill: 51:09

Yeah. That that is absolutely wild. And, I mean, this is one of those and I feel we've we've had this, you know, many many many times where you've got these symptoms. There's like, there's somehow we are gonna explain these symptoms and the the simultaneity is like, if there's something that's gonna explain that, you know, what could it possibly be? And I mean, Cliff, I don't I mean, I felt this was very surprising.

Bryan Cantrill: 51:33

I mean, ultimate I mean, of course, it makes complete sense, but it was also not I mean, we we really got there from Gadonkin experiment and then careful reading of the data sheet, not thinking like, oh, there's probably some network encounter that's that's interrupting incessantly somewhere. I mean, it's it it it was I don't know. Maybe you were you were less surprised, but I felt it was it was definitely surprising.

Cliff Biffle: 51:57

Yeah. I was honestly mostly focused on the fact one that originally missed the existence of this interrupt when I was writing the ethernet driver. So, you know, it's sort of a bug I created several years ago without realizing it.

Bryan Cantrill: 52:09

Well, they're all I mean, aren't aren't they all? But I also feel that, like, it's and this is where we because we we have not used the the kind of vendor source code for this stuff. We don't use the embedded house. We we have gone our own way, which does require us to own this stuff at a deeper level. But I I I don't know.

Bryan Cantrill: 52:29

I mean, yeah.

Cliff Biffle: 52:32

One of the things I suggested in chat when I was sort of self post morteming this was the it probably makes sense for us to periodically go take a look at the other open source drivers for our peripherals because for a bunch of technical reasons, we can't use the vendor drivers. We can't use NetHel or most of the Rust ecosystem drivers. And it's simply that they generally don't assume they're running on a protected mode operating system. They don't assume that their interrupts go through an IPC system. They don't and there's a bunch of assumptions they make that are valid for 99% of users but are not valid for us.

Cliff Biffle: 53:05

So in the absence of being able to just share the bug fixes, we need some kind of communication channel to make sure we at least hear about them.

Bryan Cantrill: 53:17

Yeah. You and how yeah. So there's a question in the chat. Have we notified ST that the docs are wrong? No.

Bryan Cantrill: 53:29

We probably should do that. ST's been pretty I mean, again, pretty as these things go, pretty receptive. I view ST as kind of a model for for pretty good transparency here and and and generally pretty good documentation. I mean, I know that it's, in this case, it was misleading

Cliff Biffle: 53:44

or wrong. So the the answer here is a little complicated because ST, to their credit, tends to release very comprehensive docs on their microcontrollers. They also never update them unless I don't know. I don't know what you have to do to get them to update them. They will release errata sheets that explain that, oh, by that way, that feature we're still advertising isn't implemented.

Cliff Biffle: 54:09

But what they won't do is a release and a Rata sheet for a thing that just kind of sounds misleading. So it's it's actually not clear that ST has a policy that would result in this getting changed. They've this has been reported to them by I I posted this on Mastodon. Several people reached out saying like, oh, yeah. That bug.

Adam Leventhal: 54:32

Yeah. Man. I remember that bug.

Bryan Cantrill: 54:35

That's the point.

Cliff Biffle: 54:35

And they all reached out to ST. ST is aware, but like ST's position is, hey, man. We described this in the manual and like, yeah, maybe you don't find the pros as accessible as you'd like, but it's not a bug per se and, know, you do keep buying your chip. So so I'm I'm maybe a little cynical there, but I I don't think that there's an obvious pathway to getting the manual updated. What we can do though is publish this in as many forms as we can, which is part of why I took a bunch of the notes and wrote up that public hubris GitHub issue in addition to

Bryan Cantrill: 55:09

Yeah.

Matt Keeter: 55:09

Yeah. And and

Bryan Cantrill: 55:09

and if anyone wants to point us to other podcast episodes from other companies talking about their firmware bugs, we will gladly listen to absolutely all of them to make sure that we because I I do think that the stuff is extremely valuable information to to to share. And I think it it also just underscores the importance of having open source, having because in a a closed proprietary world. And we do work with vendors where, like, all of the source where the documentation is proprietary. And so it makes it a lot harder to find problems like this to go validate them. So the transparency here is a win and is helpful.

Bryan Cantrill: 55:48

It's just not totally sufficient in this case. Well, that's just this is awesome. Actually, I do love the fact that we were talking about this. Megan was asking this morning. He's like, do we need to do like, should we, like, issue a patch for a previous release on this?

Bryan Cantrill: 56:08

Like, well, no. Because we're gonna we're you we know that it actually is gonna take a while to hit this. So by the time they hit this, they're gonna we'll we'll have them updated to the most recent version. So it's actually very helpful that this sorry.

Cliff Biffle: 56:27

Go ahead. Yeah. If a customer doesn't wanna upgrade for another half a year, then we may wanna push out a patch, but we don't think anybody's really, you know, insisting on that. So

Bryan Cantrill: 56:40

Yeah. Yeah. And and then the, the mystery of SLED 23, which is what kinda caused all this, that's gonna be a future podcast episode. So, you know, stay tuned. We don't have we know we've been we we've we've been on a bender recently of of oversharing fascinating bugs, but we just seem to be hitting more and more content generation.

Bryan Cantrill: 57:05

Well, this is great. I Cliff, again, extraordinary work on this when and because for all I mean, this I'm not I'm not sure if you were if there was some hidden work here, but you actually debugged this remarkably quickly once you hit the data sheet. Felt like it was with you. It

Cliff Biffle: 57:25

took a few hours of of concerted work, but I just I wanna emphasize that, like, I was just knocking the pens down that Matt and Laura and Will and the other people that were working on this before I got there had set up.

Bryan Cantrill: 57:38

And, well, yeah, once again, think this has been a theme on these other bugs too. We got the kind of the baton being passed. So everyone taking their turn with the baton. And this was this was another good one. And really terrific work, very exciting to get this this fix in.

Bryan Cantrill: 57:57

I Adam, I for one would like slightly less content generation even though this this is his terrific content.

Cliff Biffle: 58:04

That's right.

Bryan Cantrill: 58:04

We can we can ease up, I feel like. You know, we if the gods are listening, you know.

Adam Leventhal: 58:09

That's right. On one hand, debilitating bug. On the other hand, podcast episode solved. But

Cliff Biffle: 58:14

True podcast

Bryan Cantrill: 58:15

episode solved.

Adam Leventhal: 58:16

It we're is. We're gonna be off for a couple weeks, so we can just pause on some of the debilitating bugs.

Bryan Cantrill: 58:22

We can. We we we can pause on some of those. So we and actually speaking of upcoming weeks, so we've got we're I'm out next week, but then the week after is gonna be our last episode of the year. So that's wrap up time. Gonna do our our wrap up of episodes.

Bryan Cantrill: 58:38

So that's that's gonna be that's gonna be exciting. Yeah. We which means we will be doing a title and image review.

Cliff Biffle: 58:46

So just

Bryan Cantrill: 58:47

one That's right. You know, not too much pressure on anybody. But

Cliff Biffle: 58:54

I assume that episode's just gonna be chime sounds the whole time.

Bryan Cantrill: 58:58

I you know, you know, I mean so that that is actually a bit of an over question. I I feel like it is an act of cruelty to have it just nonstop chimes, but it yeah. Will be a lot of chimes. I think or maybe or maybe no chimes.

Adam Leventhal: 59:10

Maybe we'll get them all in at the beginning or something.

Bryan Cantrill: 59:12

Exactly. We're not actually we we don't actually hate the listener even though sometimes our audio behavior is in Yeah. Is No.

Adam Leventhal: 59:20

It's a much more complicated ambivalence.

Bryan Cantrill: 59:22

Yes. It's a much more complicated relationship. Exactly. Well, again, terrific work. Will, Matt, and Cliff, and and I said a lot of other folks that are working this problem and Robert or and others, but really Justin.

Bryan Cantrill: 59:36

A lot of great work all around and fun to have another podcast episode, but we don't need to have any more world gods. You can actually it's okay. Things can actually work now. Now that we've deal between between future lock and our data corruption episode and yeah. I I also like to see we've kinda like hit on all why am I even why say this?

Bryan Cantrill: 59:57

Why am I like literally tempting the gods with the other? Yeah. I'm just gonna shut up now. We know the gods are listeners.

Adam Leventhal: 01:00:04

Yeah. Were you about to say like we never seen a bug in this subsystem?

Bryan Cantrill: 01:00:06

No. No. I wasn't. I sure wasn't. I sure wasn't.

Bryan Cantrill: 01:00:09

So awesome. Thank you very much especially for joining us and great work debugging this. And we'll get get the word out there for other folks that may have this part and let us know. If if there are other issues that we should be aware of, please let us know. So thanks everybody, and we will see you next time.

Bryan Cantrill: 01:00:29

So in two weeks and bring some of your your your highlights. We'll be doing our your wrap up. See you next time.

More episodes

Chapters

Creators and Guests

What is Oxide and Friends?