Oxide and Friends

Oxide ships a rack scale system--how test the manufacturing of the backplane and switches? Previously we've been using a collection of sacrificial servers, but this was unwieldy, expensive, and unscalable--all big problems as we ramp up manufacturing to 100s a month! Enter "Reverso", an extremely simple test fixture, that uncovered an extremely complex bug.

In addition to Bryan Cantrill and Adam Leventhal, speakers included Oxide colleagues, Robert "RFK" Keith, Adam "The Hammer" Suczewski, and Matt Keeter.

Previously, on Oxide and Friends:
Some of the topics we hit on, in the order that we hit them:
If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!

Creators and Guests

Host
Adam Leventhal
Host
Bryan Cantrill

What is Oxide and Friends?

Oxide hosts a weekly Discord show where we discuss a wide range of topics: computer history, startups, Oxide hardware bringup, and other topics du jour. These are the recordings in podcast form.
Join us live (usually Mondays at 5pm PT) https://discord.gg/gcQxNHAKCB
Subscribe to our calendar: https://calendar.google.com/calendar/ical/c_318925f4185aa71c4524d0d6127f31058c9e21f29f017d48a0fca6f564969cd0%40group.calendar.google.com/public/basic.ics

Bryan Cantrill:

Okay. We're gonna try this. Can you hear us now?

Adam Leventhal:

Yes. You say split brain and I say nobody can hear you.

Bryan Cantrill:

Isn't isn't that that's just like what the other half of the split brain thinks.

Adam Leventhal:

Yeah. But nobody nobody nobody in check could hear you.

Bryan Cantrill:

But that's just other everyone get everyone in this room could hear me. I was not alone. Actually, I've got I'm joined the litter box is at capacity, friends. Full house. It is we have

Bryan Cantrill:

a full house.

Adam Suczewski:

Doubled our Adams too.

Bryan Cantrill:

We do double the Adams. We have finally Adam Suzuki has joined us in the litter box along with a frequent guest, but never in person. Man myth legend, Matt Keter. Hello. In the litter box.

Bryan Cantrill:

Are you able to hear all of us, Adam?

Adam Leventhal:

All three of you. Yes.

Bryan Cantrill:

And then not the litter box as if that's pejorative. You've got a RFK joining us. Robert Keith, how are you?

Robert "RFK" Keith:

Well, I forgot that it's Huber's Palooza right now and they're all doing their thing.

Bryan Cantrill:

They're all doing their thing. Yeah. No, it is. It's like the fire festival in here. It's just an absolute wreck with the Huber's palooza notes.

Bryan Cantrill:

It was a really fun to have the Huber's team. If Huber's seems out all week, including Matt and as a result, Matt does not know which time zone his body is in. I'm very confused. Which is should be excellent. Alright.

Bryan Cantrill:

We are we are here to talk to tell the tale of Reverso. And I mean, did you how how much have you followed the tale of Reverso?

Adam Leventhal:

You know what? I feel like you know, I talk about, like, the Hawking radiation of these things that I don't particularly understand, these black holes that like do a bit something and I get I get some of the vibes. But Reverso is something I'll just tell you what I know so you can so you can correct any misconceptions. I know it's something to do with hardware and something to do with manufacturing and something that sure has taken a lot of work. That's what I have.

Bryan Cantrill:

I actually think you're kind of wrong on the the last part of it. I think the the lot of work yeah. So I would say that Reverso, I think is a classic Adam Leventhal hardware engineer kind

Adam Suczewski:

of a thing.

Adam Leventhal:

Okay. Now I'm interested. Okay.

Bryan Cantrill:

Yeah. Exactly. Now you're at the edge of seat. Now this is because I I would say if anything, Reverso is deceptively simple. Okay.

Bryan Cantrill:

And is a is the this is the kind of thing that you come up with. Like, why can't we just and it's actually as it turns out, which is both is both a good idea and then one that is as we have learned slightly fraught but for no fault of its own. So with with that tantalizing lead in, Arfkid, you wanna talk about when did we it meant I were talking earlier today and trying to remember when we kind of the first idea for Reverso definitely dates back to the the development of sidecar.

Robert "RFK" Keith:

Yeah. I mean, it's a long time ago. Like, a really long time ago. The what happened was

Bryan Cantrill:

for a long time here.

Robert "RFK" Keith:

We have. And what what happened was is Sidecar is, as I'm sure people know, a ton of cables inside it. And it's also our

Adam Leventhal:

switch, just to be clear. Yes.

Robert "RFK" Keith:

It's you know, the switch part actually seems to work reliably all the time. The the cabling parts did not. And I mean, that's pretty normal. Right? You gotta be able to test those kind of things.

Robert "RFK" Keith:

But since our system is had is very difficult to access, we're like, you know, once we've got this thing all wired up, we see, you know, failures in particular links, which is like the for this is all Ethernet. So you'll have one lane that just kinda, you know, doesn't work. And you're like, okay. Well, what how do I figure out where this is? Is it inside the sidecar and all the cables in there?

Robert "RFK" Keith:

Is it in the backplane? After the sidecar, is it both? Is it just intermittent? Like, does this you know, does it work at speed, or does it work, you know, very low rates? I don't know.

Robert "RFK" Keith:

Okay. And we're looking at this thing and saying, okay. We could build this, like, massive test fixture to go and, like, test each one of the cables and then install them. And then, like, presumably when they're installed, they're okay. Turns out that's not true.

Robert "RFK" Keith:

Okay. How do we get the cables inside the sidecar? Was like, well, we can loop those back on each other with, a cable of themselves and just plug them into the back of the thing. Alright. So but once we have this all installed into a rack, when everything's been plugged in, you still have to test it.

Robert "RFK" Keith:

Like, okay. Well, how do I do that without having an entire rack full of gimlets? Yeah. Which is really expensive. Like, really, really expensive.

Robert "RFK" Keith:

And we can't just like have those in the test facility. That's ridiculous. No one we're not gonna do that. So I was like, okay, well, let's be Tom and I were like, okay, let's be really dumb. And what if we just, you know, turn the signal around.

Robert "RFK" Keith:

Part of it.

Bryan Cantrill:

Yeah. Sorry. Go ahead. Please.

Adam Suczewski:

Turn it

Robert "RFK" Keith:

around and it just goes right back into the sidecar. We plug this thing in, like it's a gimlet, like it's a compete sled, and it just it just goes right back to the itself. And it doubles the length, so, like, it probably won't work at line rate. Probably can, actually. But but, you know, that doesn't matter.

Robert "RFK" Keith:

We're just trying to see if it's broken. We like, if if it's if the copper's broken, it doesn't matter. Like right? So whatever. We're just gonna build this board.

Robert "RFK" Keith:

And it's really it's the same width as a gimlet because it has to fit into a gimlet chassis. Because again, we're trying to be really cheap and dumb. Like, we don't know. We're not gonna build its own fixture. We're just gonna use the same thing that the gimlet goes in and just build more of those.

Robert "RFK" Keith:

No. Like, simple. The simplest thing possible. The board's like maybe two inches long and and it's, you know, the same width as the gimlet. And it's got XMX connectors on it.

Robert "RFK" Keith:

And then right after the XMX connector, normally where you would break these things out and goes into an ethernet chip, it's like just this tiny little loop. It goes right back into the other pin, you know. And then you plug it in like a Gimlet. And our idea at the time was, well, sidecar will just, you know, see itself and link up and, you know, whatever. No problem.

Robert "RFK" Keith:

And but and so we all we built it. Tom and I, you know, birthed this board and then and then no one cared for like four years.

Bryan Cantrill:

It's actually, if you don't mind me asking, like, how hard I mean, this seems like a pretty straightforward board because you're basically gonna for the you're basically gonna turn around every link and every link that comes in is immediately gonna go out the other side.

Robert "RFK" Keith:

Yes. It's extremely simple. But

Bryan Cantrill:

And yeah. And it should be said that like the the reason that we can do this is because every compute sled, get what now Cosmo, whether that's the Milan based return based is connected to two switches. So you actually have two connections. So you can just come in on one and go out on the other is is the the premise of Reverso.

Robert "RFK" Keith:

Exactly. And so what it it was supposed to be just like the simplest possible thing ever. And then because we tried to make it fit into all the things that we had, it had its own little, like, quirks as everything does. But generally speaking, this is like a completely passive, very, very simple board.

Bryan Cantrill:

Yes. And so okay. So you guys lay the board out. And do we when did we actually fab I feel like we did fab Immediately. Board?

Bryan Cantrill:

Yeah. We immediately made it. Right.

Robert "RFK" Keith:

Right. And they sat in the factory for, like, three years. It was pretty awesome. We're like, we're gonna use it. And just, like, never never got around to

Adam Suczewski:

doing it. And then but then eventually

Adam Leventhal:

Right. Did you know this was did you know this was gonna be an airing of grievances? Like, did you did you see this one coming?

Bryan Cantrill:

I I I first of all, I would say at least 30% of our episodes are an airing of grievances. So, yes. I mean, of course, like, not at all a surprise.

Robert "RFK" Keith:

No. Even mad. I'm just like, I can't believe that you let it go that long. Like, someone I feel like we abandoned it. It's bad.

Robert "RFK" Keith:

Like, we've done we've done it wrong.

Bryan Cantrill:

I feel like why I mean, we in part because we like, why first of all, we were Reverso and correct me if if I'm misremembering, okay, but I mean, it was really first thought of as a sidecar test apparatus more so than an entire cable backplane. Although, we certainly had the vision of doing it on the whole cable backplane. Yeah. I'm trying to remember, like, why didn't we end up picking it up earlier? I don't know.

Robert "RFK" Keith:

Well, because we thought we could solve the cable backplane breakage issues in a different way. And we were also concerned that Yeah. But we had more concerns about the the rat's nest inside the sidecar because we did see a lot of failures in there that were were more easily addressed with just, you know, a loopback cable that goes plugs in directly in the back sidecar. Then we started we kinda mitigated the cable backplane issues in different ways. So we're like, okay.

Robert "RFK" Keith:

You know, this is working, but really, there's no there is no replacement for the this these full assemble system test unless you wanna go and build 32 gimlets and, like, run everything to validate the backplate. And once it's installed and the sidecars are in and all this, like, it that's kind of a difficult problem, or it can or or just an expensive one. Right? And these boards are really cheap. That it's just Really cheap.

Robert "RFK" Keith:

Material on the frame, which we have chassis for gimlets and sidecars or gimlets and cosmos. It doesn't really matter at that point. And we built this with the intention of, like, hey, when we hit volume, this is gonna matter because we're not gonna be able to have these test flight gimlets as we were calling them for a long time.

Bryan Cantrill:

Maybe this why. It was it wasn't There's

Robert "RFK" Keith:

an insertion. There's also an insertion count, like, limit to the Xmax connectors, but you can only, like, insert them 300 times before they before they break. They or they have a signal integrity issue. So you're gonna start seeing errors that actually aren't really errors. You're just you have bad connectors on your test flight gimlets.

Robert "RFK" Keith:

But those are expensive. So why would I why would I burn out the these connectors on things that are, you know, valuable? I could just burn them out on this cheap little board that no one cares about. Like, that's a way better solution in manufacturing than, you know, something you, you know, love and care for.

Bryan Cantrill:

Yeah. And I think we got away with it too just because our volumes were like low enough. And also we had like plenty of other problems.

Robert "RFK" Keith:

Totally.

Bryan Cantrill:

We had lots of other, I mean, whenever you manufacture something for the first time, you're going to learn a whole bunch. But we got to the point where sometime last year when things really took off and now we really needed to scale manufacturing quickly. And there were a whole bunch of things that we needed to go do to do that. We really needed Reverso that this was no longer merely nice to have. And I feel like that was okay.

Bryan Cantrill:

That feels like that was what mid the middle of last year, maybe towards the end of last year that we're like, okay, this is now going to be really load bearing for our ability to scale manufacturing. Like if we don't do this, this is going to this is going to really impair, impede the the volume of racks we can ship.

Robert "RFK" Keith:

Yeah. Sure. My sense of time is completely messed up, but I if you say that time, I'll agree

Adam Suczewski:

with you. Sounds good.

Bryan Cantrill:

And oh, and then also the the because Reverso needs to there it effectively is occupying a cubby slot. It it has no computer in there. So it's definitely we do not want to ship with Reverso. What what's the what's the story of the of the remove before flight kind of orange? It feels like it'd be very hard to to ship a rack accidentally with these things.

Bryan Cantrill:

What was the origin of that? Do you even know?

Robert "RFK" Keith:

We've done a bunch of those things. And there's a they were on the test flight gimits gimlets too because they I mean, like, as you said, they look the same. But you want that just that. And we're like, okay. We have tape.

Robert "RFK" Keith:

We're just gonna go ahead and put all this on there. But he had also for other things, like the keys, he had taken those, like, little tags that they would put on the bottom of, like, f 14 missiles that you pull tabs before they take off. You know what I'm talking about? Yeah. Yeah.

Robert "RFK" Keith:

Those on on the keys for the rack too. So there's all sorts of stuff like that all over, which is kind of which is fun. Just to, like, please get rid of these before we send this to somebody. They don't want have this. This is not what they asked for.

Bryan Cantrill:

They did not ask for a rack full of Reverso's. So okay. So we start using it. So this is this is great. Alright.

Bryan Cantrill:

So now we're gonna use Reverso and and and the concept feels so simple. And Matt, is this where because I feel like you had to do some things initially just to support this at all in software because software is is a little surprised by this.

Matt Keeter:

So one of

Matt Keeter:

the first problems we had was that we would get kicked off the network at our contract manufacturer site through no fault of their own whenever we activated the reversals because all of a sudden packets would come in to the tech port of the rack, loop around the Reverso, shoot back out the tech port on the front of the rack. And that is a really unpleasant thing to be doing to someone else's network. And so their switches would then just turn off that port. And we would have to go send them an email and be like, sorry about that. Please turn our port on again.

Matt Keeter:

We think we've got the configuration right this time and it only took a couple of tries for that to actually work out.

Bryan Cantrill:

And so how did so we needed to was that changing the SP on the sidecar to configure things differently?

Matt Keeter:

Yeah. Exactly. So the SP on the sidecar is responsible for configuring the network switch that runs the management network and we have a lot of control about how ports map to VLANs and how thing is interconnected. So we have a special image for just the Reverso's that makes the ports a little bit more compatible with the outside world by hiding the weirdness inside the rack. Yeah.

Matt Keeter:

Interesting.

Bryan Cantrill:

And this is kind of in hindsight, maybe this this was an indicator of things to come that the the world was not yet ready for Reverso. The world was not yet ready for the

Matt Keeter:

same MAC address showing up on multiple points.

Adam Suczewski:

Yes. Exactly.

Bryan Cantrill:

And admittedly Reverso is it results in rather unexpected behavior in the rest of the world because it's like, well, hey, you're the thing that you sent over there, now it's over here.

Adam Suczewski:

So I

Adam Leventhal:

think I've described I think I have been open in this podcast in the past about being kind of a network aphasic. So every time Rai, our network genius talks about like loops and cycles and network being bad, obviously like I fall asleep immediately, but I think I'm starting to understand how this would be very surprising to a network if things started kind of coming up in places they didn't expect.

Robert "RFK" Keith:

Yeah. It's also kind of funny. Like, in a manufacturing setting, like, all the stuff that you're running on any piece of equipment is pretty much meant to be, like, break rules to test things. And so when the rest of the network around you, like, freaks out, you're

Adam Suczewski:

like, well,

Robert "RFK" Keith:

you know, it's not always my bad. Like, we're supposed to do this.

Bryan Cantrill:

Right.

Robert "RFK" Keith:

That's my feeling, though.

Adam Suczewski:

But we got this to

Matt Keeter:

a point where we were no longer getting kicked off the network at our contract manufacturer. And that was probably a year or so ago. Like, this was one of those things where we brought it up for testing and then kind of put it on the back burner for quite a while. And we, like, you know,

Robert "RFK" Keith:

we could see if we

Matt Keeter:

could send packets in, we could get packets out, we could use the Reverso to prove that the links were good. And we hadn't really productized that until much more recently when things got more complicated.

Bryan Cantrill:

Yeah. So things got more complicated because now we desperately needed Reverso. And Adam Zusky, you had joined our budding manufacturing software team. And I think what you knew when being tossed into this inferno was that there was a transient networking error. Do you want to describe what the error that, because this is where the the kind of the second odyssey of Reverso begins.

Adam Suczewski:

Yeah. So the the transient network error was we have a software suite of tests we run on the rack called RackTest. What we were seeing is we run the RackTest service as we ordinarily would, and there are drop TCP connections. So

Bryan Cantrill:

where are you running rack test So

Adam Suczewski:

each manufacturing station has a desktop computer that's running running this driver connected to that computer over ether to oh, to ethernet. We're connected to the sidecar tech ports. So there's two RJ 45 ethernet ports in each sidecar. So we have four cables between the station computer and the the sidecar. The sidecar has the VSC seven four four eight, which is the man the management network switch, which also sends through the tech port traffic.

Adam Suczewski:

The Tofino, the larger, you know, high speed data network switch. A PCIe connection to the scrimlet. So that's the PCIe attached compute sled that can configures the sidecar. And then running on that, there's the the switch zone. So it's the zone you for network management that's running the other end of this driver.

Adam Suczewski:

So or so this is the the the rack test agent. So so the behavior is you you spin up the service on the station computer, you send the agent to the scrimlet. The scrimlet sends that to all other host if there were any. In this case, we're on the Reverso.

Bryan Cantrill:

And what what are the what does that agent do? So

Adam Suczewski:

the yeah. And

Bryan Cantrill:

this is a rack that includes two sidecars, two scrimlets because you have to have two computers that control the FMV computer that controls each of the sidecars. And then the the rest of the rack ideally is all Reverso's.

Adam Suczewski:

Yeah. So good question. So that's where I enter where I entered the story was starting to look at a piece of software that was already written, already existed, but had these intermittent blips. So we'd see so so rack test from the perspective of the station computer, you get a effectively a status report that says these things are up or they're down or what have you. And it would show that there are three networks in the oxide rack, the ignition, so it's the the power network, the management network, and the high speed data network.

Adam Suczewski:

And what we would see is this process of the agent or the driver deploying the agent, the agent collecting data, the test passing, and then due to some intermittent failure, everything tearing down, restarting, and and cycling through that process. So the inside, I mean, what it's basically doing is on the agent, there's tests of each of the three networks. So ignition, data network, management network. The ignore the ignition case for now, but management network and data network are basically doing a ping like test. So we're sending out a multicast multicast packet using the DLPIs, so the data link programming interface, to each of the network interfaces for for each cubby and for each of the network types.

Adam Suczewski:

So the management network and the high speed data network. And right, if you get your packet back, you know your link is up and you're good. But somehow And so you get your so get your packet back.

Bryan Cantrill:

So what the the what is the path of that packet is gonna take in the functional case?

Adam Suczewski:

Yeah. So the in the functional case. So so let's start, I guess, in the high speed data network, the Switch Zone has you know, runs the oxide networking services. So this is dendrite and magmite and things maybe other folks on here can describe in more detail, but it's a set of drivers in the Illumos host on the scrimlet that is applying appropriate headers and forwarding packets to Tofino P4 programmable switch connected to through PCIe. So in the data network case, you're gonna send the packet over PCIe to the Tofino through this set of dendrite drivers.

Adam Suczewski:

It should go up or out, you know, out the Tofino port, up the cable backplane through the Reverso, back down through the Tofino, and then back in, you know, through that same set of drivers and VLANs and, you know, the set of software running in the switch zone. The data network, similar but different. You go, you know, same dendrite drivers. You go through PCIe to the Tofino. The Tofino has a port to the other chip, the the management network switch, the VSC seven four four eight, also is that same switch that we came in from through the tech port.

Adam Suczewski:

But that receives packet go ahead.

Bryan Cantrill:

Yeah. That'll become important later. Yeah. Like, that that becomes important. Check it with it.

Bryan Cantrill:

This is a check off switch. Is that what you're gonna say?

Adam Suczewski:

Yeah. This is called foreshadowing. So you're back in through the VSC seven four four eight that does some command tagging then goes out to each of the cubbies through the management network comes back in back through the Tofino back to the switch zone through Dendrite and Friends to your, you know, application software that is running in in that switch zone.

Matt Keeter:

Yeah. One of the important things here is that the VLAN tags mean that we have essentially a separate network for every cubby in the system. So packets leave the scrimlet with a particular VLAN tag, and then the VSE 7448 sends those packets to one of those cubbies based on the VLAN tag. And then when Reverso populated, those packets loop back around and reacquire the VLAN tag associated with that cubby when they reenter the VSC seven four four eight.

Bryan Cantrill:

So this is testing a lot. We we want to ideally, we are testing. We the purpose is to test the links to to to verify that the cables are actually working. And so Adam, what you're seeing is this that the the link test would be adequate, would be fine. But then the entire kind of rack test would give up the ghost.

Adam Suczewski:

Yeah. Is that right? That is right. Yeah. And so at this point and and this is a system that I was fairly new to and like most of the components I was learning about as I went.

Adam Suczewski:

So the questions like how to how to approach the problem or decide. And so there's kind of the physical rep, like, representation. So you have the station computer, the sidecar, the scrimlet, the Reverso. You have like a logical representation. So is it application software like RackTest, the RackTest itself, some of the host services like Dendrite, the host OS, like, is it something in, say, in the kernel?

Adam Suczewski:

Is it something in the SP? So Humility or Hubris runs the Sidecar SP, which configures a bunch of the networking rules. Is it like one of the subcomponents on one of one of these boards? Like the Tofino, the VSC seven four four eight, is it like a physical phenomenon? Like, is there something electrically or or is it the connector issue with the board.

Adam Suczewski:

I also this particular rack under test, it was remote to us. So it was at a contract manufacturer site. So you couldn't I mean, you could ask somebody to go and and manipulate it for you, but there there were a wide range of both physical components and like logical components that where the problem could be.

Bryan Cantrill:

Yeah. This is, I mean, rapidly turning into more murder on the Orient Express for you. And and you are as you're learning about the systems, I mean, you I mean, you only I mean, you've been at the company only a couple of months at this point when you were looking into this. And I I'm sure you're thinking like how complicated can it possibly be to get a packet from the tech port to the switch zone? And the answer as as you were learning is really extraordinarily complicated.

Bryan Cantrill:

And I mean it is Adam, did you did you read Power to the Sea as a kid? No. Caldecott award winner. This is a but this is one of these books I think must have had a very narrow shelf life because I I think that this is clearly I can like maybe someone else born in 1973 would get it. But the so this was a the the premise of Paddle to the Sea is that this native Canadian makes a carving of a canoe called Paddle to the Sea.

Bryan Cantrill:

And it says I'm paddle to the sea and I'm going to the ocean and basically set me back in the ocean on my way through the Saint Lawrence Seaway and this was great book about like this, you know, this little carving of a canoe getting kind of all this trouble and folks other kind of characters kind of putting it on its way. I feel that like a packet from through the tech port is like paddle to the sea. Like I am a packet. I'm going to the switch zone. If you find me, please set me back on my way.

Bryan Cantrill:

It is a because it's just go through so many different complicated paths anyway to get there. It's amazing that it works at all. But Adam, I'm not sure if you're if you were having the same reference for it. You were learning about it.

Adam Suczewski:

I had the same thought, maybe not the same reference, but so Josh had left me a, like, a very useful debugging tool for this, which he had called NetWatt. And he had one commit in the in this repository, which is Sai. And that that was kind of my next, like bread crumb of to start or to to go off of. The Is

Bryan Cantrill:

this like the opposite of I have a truly marvelous proof that this march is too small to contain Adam? This is like, I have an unknown commit to net what with the commit message sigh.

Adam Suczewski:

So it was a smaller I was starting to get to a smaller reproduction of the behavior. So I what what this program would do would, you know, rather than running the entire rack test agent, it was a connector and a listener. So there was a part of it that ran on the scrimlet and then part of it that ran on the station computer. The scrimlet side listens for incoming TCP packets on a port and echoes them back. The sender side attempts to send 10 packets per second, listens for them coming back.

Adam Suczewski:

If the round trip time exceeds one millisecond, it prints a line. And if it is under one millisecond, it, you know, prints the number of packets you've received consecutively. So in a healthy network, you should, you know we have like direct connections between mean, there maybe are more things than we expected, but still nothing that should contribute a millisecond Right. Of of latency. And what we saw was intermittent elevated round trip times, typically around four hundred milliseconds,

Bryan Cantrill:

which makes no sense.

Adam Suczewski:

Yeah. Yeah. Yeah. Yeah. Well, yeah.

Adam Suczewski:

One, like so the the bin RTO, so the retransmit timeout, what is four hundred milliseconds? Right. And so that that was like, okay. Well, sir, some packets getting dropped. We retransmit and then this, like, this phenomenon happens, you know, periodically.

Adam Suczewski:

Yeah. But And this is

Adam Leventhal:

hundred milliseconds is really surprising in that it's like, this is not going, you know, out to sea like this boat. Right? This is sort of all right in front of you. Right? Like

Bryan Cantrill:

It's all right in front of you.

Adam Leventhal:

Like, when, you know, I was flying the other day and my ping time to the office was 400. I'm like, well, that's that sounds amazing. Right? That that I live

Bryan Cantrill:

in a world. Right? I'm in an aluminum tube going at at at 40,000 feet traveling at 600 miles an hour. Like that seems reasonable.

Adam Leventhal:

Yes. Right. I like I can understand why some packets might lose their way here and there. However, this is not that.

Bryan Cantrill:

This is not that. No. And so we're yeah. We're clearly getting drops somewhere, but only when Reverso is running.

Adam Suczewski:

Is that right, Adam? Yeah. So only when running the Reverso agent on the scrimlet, do you actually, should you see elevated our instances or or higher frequency of high RTTs in that case, but you still see some even when the the RAC test agent is not running.

Bryan Cantrill:

But if if there are no reversals in the rack though,

Adam Suczewski:

we do Yeah. So one thing I asked at our Centimeters was, one, can you take all reversals out and then do we have a problem? Yes. So we are clean in this case. Yeah.

Adam Suczewski:

And I say, can you put one Reverso in instead of 30? Yeah. And it it would you'll you'll see a small a small fraction of the l instances of high RTTs.

Matt Keeter:

So the problem scales linearly with the number of Reverso's that are populated, which is terrifying.

Bryan Cantrill:

Which and also, like, it it kinda it that also doesn't make sense because you're kind of like you're where we're seeing these timeouts is like this feels like it should be at some level unrelated. Right? I mean, it's like you're putting reversals in the rack and we are dropping kind of packets in and out of the rack over the tech port. Like, how how does the rack know that a Reverso is plugged into it?

Adam Suczewski:

Yeah. And, well, I mean, part of this sort of thing and then even the next couple of debugging steps were that I'd given my unfamiliarity, like, there's like, what what's the information I want to debug? Like, or what what would I want to do if I knew how to do anything? And then given what I know now, what would I do? And so it's actually very easy to say like, hey, can you have you tried taking them out and have

Bryan Cantrill:

you tried just putting one in? Were you prompting yourself like an LLM? Like now Adam, act like an expert.

Adam Suczewski:

Well, would say another thing is so small modifications to say net what or then I mean the next thing is to, you know, we say, the the pathology is worse when the Ractus agent is running. So it's like, well, can I get tighter control of that? So since instead of I I really I wanna say, well, where where is the TCP packet getting dropped? But the sort like initially where I was heading was like, can I get a more narrow like binary search the the behaviors and then get like get a more narrow reproduction of of the issue?

Bryan Cantrill:

Yeah. Right. Makes sense. And any any fruit there?

Adam Suczewski:

There yeah. There was some fruit. So so the main body of the the RackTest agent is doing this ping test over the the two networks. And so and so that mean that was one dimension is like the data network or the management network. What I had seen so basically, the the Reverso agent is ping data network management network.

Adam Suczewski:

Dendrite brings up each of the connections to sleds. It's either you called the TF port rear, so the Tofino rear port, or one of the name, a Gimlet VLAN. So there's a number of VLANs for the new management network and then Tofino drivers for each of the rear ports. And so Matt, Matt's kind of where Matt jumped into. And we were doing some experiments with behaviors on on those two networks.

Matt Keeter:

Yeah. So one of the things that we tested was like, well, what if we just take all the reversals off of their VLAN? So packets can go in, but they don't actually go out on the Reverso copies. And with that change in place, the system works fine. And we're like, oh, great.

Matt Keeter:

We've solved the problem. And then we realized, oh, shit. We've just defeated the whole purpose

Bryan Cantrill:

of Reverso. Right. I just wanna make sure I understood what you did. So that was a you you reversed Reverso. So it was Reverso became Black Holo.

Bryan Cantrill:

Is that right? Yeah. Yeah. Okay.

Matt Keeter:

Packets would go in directed to like cubby whatever and then they would just get dropped at the VLAN at the VSC seven four four eight because the VLANs would not include those points. Right.

Bryan Cantrill:

But you have now no way of knowing if it's actually like if the link works or not.

Matt Keeter:

Yes. And we started considering desperate options at this point.

Adam Suczewski:

Yes. So that's like the bargaining part,

Robert "RFK" Keith:

which Yeah.

Bryan Cantrill:

Yeah. It's extremely important, I think. Desperation is a very important stage of debugging.

Adam Suczewski:

Yeah. So so in this case, you can ask the BSC seven four four eight for the link status. And it can tell you the link is up. So it's not gonna send packets back to the switch zone, but it will say up based on some information about it. Yeah.

Adam Suczewski:

This is just

Matt Keeter:

like at the physical level. Is there an s g m I I signal that we're seeing on the link?

Adam Suczewski:

And so I asked some folks more on the, you know, mechanical side or the operation side of manufacturing. It's like, hey, you know, we have this test and we can move migrate to the Reverso test. We can get full test of the high speed data network. And we can get, you know, up status of the management network. Is this, you know, that like, is this, you know, good enough?

Adam Suczewski:

And then I mean, the question is like, well, how many, you know, wires need to be pinched or like, how damn how bad would it need to be for something to present as up and not, you know, support say TCP from the switch zone. And so at that point, it's like, okay, no.

Bryan Cantrill:

Yeah. Yeah. Fine. Yeah. I regret asking.

Bryan Cantrill:

Yeah. I I know. This is like well, this is whenever I mean, it's very helpful when you have an idea that like, I know this is not a great idea, but then it's gonna help with someone else like explore it really earnestly and realize like, no, this is this is actually diverging. We actually need to go understand this goddamn problem.

Adam Suczewski:

Yeah. And then so like the next steps were I mean, a similar experiment is in the switch zone disabling all the the data network uplinks. So take all the reports down. Yeah. And then and that basically has no no behavior change.

Adam Suczewski:

So if they're if things are good, it will remain good. If things are bad, it will remain bad. Okay. So on on unlikely the like these rear well, I shouldn't say that. It was hard to know exactly what to be suspicious of.

Adam Suczewski:

Like the the presentation of the different inner so this was like a misleading thread for me. Yeah. Was the presentation of different interfaces in the Switch Zone side. It's kinda how I got to the next point. So I so I I basically got to a point where I I understood like how the frequency of pings required over a single management network loop back in order to induce like a TCP disconnect Right.

Adam Suczewski:

From the switch zone. And I had like very, you know, say fine control over that. And I'm like, this like v like v lans seem relevant here because

Bryan Cantrill:

The lans seem relevant and the management network seems relevant. Relevant. There's something special about the management network.

Adam Suczewski:

And the presentation of those interfaces from the switch zone is different in like a DL ADM output. And it's something like Dendrite to me is also like this massive soft software that I I'm not familiar with. And even prior to RAC test,

Bryan Cantrill:

there were

Adam Suczewski:

a few known bugs mentioned in the kind of that initial ticket. And we were we were running like a software version from that was from June. So it's like eight months ago or eight months old at the time. And so I'd like some of my I'm like, well, maybe it is like something in this kernel side, dendrite side, some related to VLANs.

Matt Keeter:

Yeah. Yeah. We have a whole guide for debugging the management network which opens with it's not the management network's fault, which is something that I wrote about six months ago and that is called hubris. Because most of the time it's not the management network's fault. Like the management network has been blameless in most of the cases where things are suspicious.

Matt Keeter:

It turns out that it's something up stack. Yeah.

Bryan Cantrill:

That it's, you know, in this hubris lowercase h not hubris.

Matt Keeter:

Yeah. It's hubris being embedded operating system.

Bryan Cantrill:

Right. But Matt is not out here for a different hubrispalooza. Oh, I'm sorry. I thought we were out here for the hubrispalooza or we we saw the it would it kind of reminds me of the, you know, when when Bonrick rewrote the Kernel Memory Allocators way back in the day, Adam. And like if you rewrite the Kernel Memory Allocator and this is like in this is in the mid nineties.

Bryan Cantrill:

And then and then ditched a bunch of custom allocators. Yeah. I mean, every kernel data corruption problem is now a bug in the Chrome memory allocator because you always die in memory allocator. And I mean, Matt, not unlike the management network, things that Jeff ended up doing is adding a ton of very rich debugging support in the allocator so he could very quickly determine whose fault it actually was. And you've done the same thing.

Bryan Cantrill:

Like you had really rich debugging support. I mean the management network and this is that old Dave Pacheco ism about exonerating itself. The management network was very good at exonerating itself.

Matt Keeter:

Yeah. And the guy that I had written that started with it's not the management network's fault then went on to tell you how to use all of these debug tools for like looking at port counters and looking at statuses and pinging things through different paths to exonerate the management network.

Bryan Cantrill:

And I don't know if you Adam, we will need to get some screenshots of some of the output of Matt's commands, but these things are like gorgeous. You're just like, I almost want to have a a problem with the management network just so I can revel in the glory of this thing explaining to me in detail that it's not its fault with all these matters.

Adam Leventhal:

Because like presentation really matters. So when you when you are given this pristine output, it's like, look how it's not my fault. What are gonna do? Argue with that? Look at the

Bryan Cantrill:

ass here. Saying I'm saying it's very well plated. I'm saying this is debugging output that's very well plated. But it's also just like very rich. Matt had built in a lot of really good debugging support.

Bryan Cantrill:

And in part because like you did have to frequently like the management network is where a symptom in something else will show up.

Matt Keeter:

Yeah. And it's also a relatively complicated chips. Like one of the tools we have is we parse the CSDK Doxygen comments to generate a bunch of register maps and then turn that into Rust code. And this is all like horrifying, horrifying code. But once we have that, there's a sub command in Humility where you can say like, I just want to read a register from the BSC 7,448 by name and you can type in the name and it will look it up for you and then dump that memory and like reprint it, like pretty print it based on the registers field breakdown in the Humility debug info.

Bryan Cantrill:

And how just to give people a so even this thing would feel maybe like it's simple. Like, oh, I've got it. This is in this is this industrial Ethernet switch effectively, which I think is where the seven forty is is typically used. But this is a wildly complicated piece of hardware. And how many registers are we talking about Matt?

Matt Keeter:

So the data sheet is 520 pages and it's a PDF that includes the registers as like an embedded PDF and the registers are 845 pages of documentation. And that is not sufficient to actually bring up the chip. You then have to actually look at how it's done in the microchip SDK.

Bryan Cantrill:

And how many different instructions and architectures do we know that would that are we've got it because there's like there's like a MIPS core on there, right? That we wire off. I think it's just got

Matt Keeter:

the MIPS core and then it's probably got some cores hiding in

Bryan Cantrill:

the PHY somewhere. But this is a really complicated and and Adam, I we've got to ring the chime for our the the management network episode that we did with Matt. I mean, what, like two years ago now, maybe more going into just how complicated this thing is. It's only simple in comparison with the

Matt Keeter:

big switch, the Tofino because Tofino is obviously a much bigger, much more complicated switch. But this one is I think it's 40 something ports and like 80 gigabits per second of switching fabric. So it's pretty hefty.

Bryan Cantrill:

It is pretty hefty and just a reminder that all these I mean that that it it the complexity of a switch is not merely at speed. You can have a switch that's at relatively lower speed, but it's is wildly complicated. And so when people look at that sidecar switch, it's got both the the Tofino and the 74. So we are we're looking at Matt's guide. We are we are learning that it can't possibly be the management network.

Bryan Cantrill:

We've got all this debugging tools to to go figure out what it actually is. And what what's next?

Adam Suczewski:

Yeah. So next was I sort of described my controlled experiment over reproducing TCP disconnects to to Stu and Robert who are in the office at the time. And so, in contrast to the initial, like, when we run the rack test agent itself, it fails. Like, what we what we had now was when you connect NetWatt from the station computer to the switch zone and, you know, concurrently ping a management network link at because the threshold was a bit under five hundred milliseconds, you'll, you know, after a few, you know, some amount of time, maybe ten seconds, get a disconnect in that TCP connection. Yeah.

Adam Suczewski:

And I kind of explained to And there's real

Bryan Cantrill:

solace in that reproducibility by the way. Yeah. I mean, have often said that bugs may be psychotic or non reproducible, but not both. And we have a bug here, a classic, this bug is psychotic, but it is reproducible.

Adam Suczewski:

Yeah. And and it did, I mean, somewhat nice thing too is like the the connection that was breaking was disrupted by some completely, you know, some other thing that you wouldn't expect. Like it wasn't within the program itself. Like one thing alone. And so I explained a bunch of this to well, actually, I should say, I explained a bunch of this to Robert, and he said, well, you should go check the VSC seven four.

Adam Suczewski:

So you should find the missing packet, and you should start in the VSC seven four four eight. And I was like, I don't know. I mean, well, I mean, what about, like, what about the Tofina or what about Dendrite? Like, aren't, like, these v like, the the VLAN designation in the switch zone, like, kinda suspicious that, like, the the, you know, links that induce the problem are, like, represented much differently in software than the management network links.

Matt Keeter:

We had one more thing which was suspiciously different between the Tofino and the management network, which was the distribution of MAC and IP addresses Where on the main data network, I think every link had its own IP and MAC address. And on the management network, the scrimlet used the same MAC and IP address for all of the links to all of the cubbies and also to the tech ports.

Adam Suczewski:

Yeah. And then, yeah, Matt had comp like, commented that right before and was part of this whole, payload I was describing to Robert.

Bryan Cantrill:

And Robert, like, it reminds me of, the inner data corruption episode, Adam, you observing about the VAs being like, I don't know. Are they on They seem like these virtual addresses seem similar. And then so then Robert basically, you know, I drew

Adam Suczewski:

the whole architecture diagram on the on the board for me. So this is the tech ports to the VSE seven four four eight through the Tofino, you know, through Dendrite to the Switch Zone. And like by the end of that, I'm like, oh, it's like Mhmm. It's in the VSE seven four four eight.

Bryan Cantrill:

Like like, what? How I can I've been so naive?

Adam Suczewski:

Yeah. Right. Right. Yeah. And then that's really where, you know, the pre work that Matt had done for BSE seven four four eight was coming in.

Adam Suczewski:

As soon like, you know, in hindsight, a lot of things are are up. Everything makes perfect sense. But, you know, looking at the combination of the the there was sort of a trail of three useful VSC seven four four eight PRs related to Reverso. Humility debug registers and then the 500 page data sheet. Yeah.

Adam Suczewski:

Like if you come in with the mindset of like, you know, my goal is to find the missing TCP packet. I'm looking for like a counter maybe that is awry. I have duplicate MAC addresses. Yeah. Like traffic that's going out and coming in has the same MAC address like

Bryan Cantrill:

I I do the MAC addresses, that's by design. By design. Yeah.

Adam Suczewski:

So so so what Robert had explained in the board was like, you know, the role of VLAN tagging between, you know, the Switch Zone, the Tofino and VSC seven four four eight, and the tag ports and access ports and the, like, expected behavior of the, you know, of the of the VSC seven four four eight. Also, like, you now it's like ringing bells of like, okay, when when Eliza did this initial Reverso PR for the the tech port loop back that was causing us, you know, grief on the the Correct. CM to network, like

Bryan Cantrill:

Kicked off our CM network.

Adam Suczewski:

Yeah. Like now it's like, okay, all of this is like starting to make sense. Yeah. And I've at that point, it's like, well, I actually like had a notepad and I was like looking at, you know, the counters. I think I went to I think it's RFD one four four has the whole switch port mapping between VSE seven four four eight, the VLANs, and the yeah.

Adam Suczewski:

To the Tofino Yeah. And switch them. And, I mean, bay yeah. So Monorail so Monorail is the name for the management network. Monorail status at least was, you know, showing the the ports we were ports we expect to have traffic having traffic, but then like real really looking at the, you know, layer two configuration of the VSC seven four four eight was like now where we're getting richer information and at least words that when paired with the the, you know, 500 page PDF when you go look at look at the data sheet, it's like kinda screaming off the page.

Bryan Cantrill:

Yeah. So what was our next step? So we've got we're beginning to kinda get a cordon around this thing. What was our

Matt Keeter:

Was it the sticky bits next?

Adam Suczewski:

Yeah. So, I mean, so one, I'm like, I have to give maybe some some there's there's a Cloud Code is actually pretty good at digging through your 500 page PDF. Yeah. Say the the hubris source and the our BSE seven four four eight source and, you know, spitting out like the useful and relevant registers. Interesting.

Bryan Cantrill:

So using quad on the data sheet, which is kind of, was always kind of the the the I mean, the fantasy for this stuff is like, can I give this this dense text and have it help me find some hypotheses?

Adam Suczewski:

Yes. In this case, yeah. I mean, like, I also have a the the graveyard of like Hallucine because actually if you say like Yeah. This is there is I was actually doing it right before this episode. I'm there is a problem in say the the Tofino, in the Alumos, you know, driver Yeah.

Adam Suczewski:

Ford loopback. It it will give you like a an equally, you know, superficially plausible Right. Theory for like why you're seeing what you're seeing. So like so long as you pointed at the right bug or like the right root cause, may, you know, you may find it, but

Bryan Cantrill:

Yeah. Interesting. And so it was helpful because now you're looking for, it's giving you some specific registers to look at for what kind of things.

Adam Suczewski:

Yeah. So the like the the main what Matt was saying, the the the sticky bits for local moves. So register the the chip will track when MAC addresses are moving between ports. And what we were seeing is all MAC addresses are moving between all ports, which is like now also should ring bells because, you know, what Matt had said is all of these management network MACs. So so for all of the scrimlets, for the PowerShelf controller, the sidecar, and the tech port tech ports from the perspective of the Switch zone, yeah, have the same Mac and the same IP.

Matt Keeter:

And it's maybe worth taking a brief sidetrack to talk about what MAC tables are doing in the switch. So when a packet comes into the switch, the switch is configured to do what it calls learning. And so it sees the packet source MAC address and it sees the port that that packet came in on and it stores that information in a table. And then next time when you try to send a packet, if the destination MAC address is in that table, then it'll say, okay, this packet just needs to go out on port 12 to go to the place with the source MAC address. And these if it doesn't see anything in the table then it has to flood the packet on all ports until it eventually figures it out.

Matt Keeter:

But the port tables the MAC tables in the VSC 7,148 also include a field for the VLAN ID. So our assumption going in was oh the packets even though they have the same MAC address will be fine because they have different VLAN tags so they won't end up colliding in the table and it'll learn, you know, this address on this port with this VLAN tag is the upstream scrimlet and the same MAC address on one of these cubby ports has a different VLAN ID.

Bryan Cantrill:

So that's different. Then so when is it believes this MAC is moving all around through different ports. How does that explain the the the dropped pack? Because somewhere in this a packet is getting dropped.

Adam Suczewski:

So the first thing or like the first change that I applied to the SP was to just turn off Mac auto learning. So when you when you get a multicast packet in, when you when if you know the port for the for the corresponding Mac address, you'll send it to the port. If you don't, you you flood. You flood. Right.

Adam Suczewski:

And so if you turn off auto learn, you'll flood every time. Right.

Bryan Cantrill:

So, okay. So with auto learning on, it's effectively sending it out to effectively the wrong port?

Matt Keeter:

That's the question.

Bryan Cantrill:

Yeah. That's Fair.

Adam Suczewski:

And it's like, well, the well, if the VLANs are isolate, like, I yeah. Depends on the configuration of of the chip. But, basically, we when we when we did turn off auto learning, actually, the the pathology is is gone. Like, we're like, everything all Reverso test working. So persistent TCP connectors, you can ping your link and you can you can can ping ping the data network, everything, you know, you know, problem solved.

Adam Suczewski:

But then I have the same question as you, which is like, but why why are TCP connections affected from the incoming tech port? And this this is when I was like, it's 11PM, and I think this is working now, but I I'm like, now why does it work?

Matt Keeter:

And at this point, the the ISRN is very firmly on the Mac table. It's not behaving as we assumed they were behaving.

Adam Suczewski:

Yeah. And and so what's happening

Bryan Cantrill:

But but this is a so but but just learning that you can disable Mac auto learning and the pathological behavior goes away. That was an obviously a big breakthrough. Yeah.

Adam Suczewski:

That was when I shouted like, you know, mark the time of the clock at 07:48PM as the time it was solved.

Bryan Cantrill:

I do feel like Adam, I would love to have like a I mean, if we could have omniscience for a moment, collect a sounds of debugging breakthroughs over the years and which I think are absolutely proportional with the amount of time that you have worked on it. I had a bug that I'd been that had been I've been chasing for so long that when I finally nailed it, I let out an absolute guttural howl and someone else in the office thought that quote someone had severed a finger. So I think that if it sounds like you stayed within finger severing, it was more of a noting the time. I like that for like his posterity. But that they this was a big, I mean, obviously big breakthrough, not all the way there yet, but it feels like a big breakthrough.

Bryan Cantrill:

We're gonna live.

Adam Suczewski:

Yeah. And so I was excited excited by this. Yeah. Then really the next step is like, yeah, why like why is this the fix and, like, in particular? And, why was it working?

Adam Suczewski:

Like, are there all like So many questions. A lot of questions. And then in in yeah. So maybe I'll explain why. Yeah.

Adam Suczewski:

Yeah. This TCP is getting disrupted because every time you have a, you know, packet coming in from some source, the VSE 7448, you know, MAC table is associate is making an association between this MAC and this port. But if the packet goes out and comes in, it's, you know, very shortly after associated with a different port. Right. And for so also why why was the ping test okay in the first place?

Adam Suczewski:

Well, we're sending multicast packets from the the switch zone. So behavior of that ping test is unaffected. But when the TCP connection comes in from the station computer through the tech port to the VSE seven four four eight, it's gonna do a lookup of the MAC table for its destiny for that destination, which Matt noted is the same as every other Mac Yeah. For the, you know, for each of the reverses. So the question is like, is is that your timing for this Mac lookup in a time when you're in a invalid state where you're not gonna get directed back into the to the switch zone?

Adam Suczewski:

Which now is explaining like how come when we turn up the frequency of Reverso pings to the management network, like do you see increased instances of either round trip times or eventually the the disconnect?

Matt Keeter:

Yeah. So it's basically if a packet has come in through Reverso Cubby more recently than from the actual switch zone, then any packet that you try to send to the switch zone gets sent to that Reverso cubby instead. And then it kind of falls off the edge of the map and you're having a bad time.

Bryan Cantrill:

Yeah. We we kind of, I mean, we we've definitely have invented pathological switch behavior. Although admittedly also like kind of pathological and Reverso itself is is rather unsporting in terms of what it's doing. Although this is where RFK you'd be like, no, no, this is like, sorry. This is a this is doing exactly what it what it should be doing, which is testing the rest of the software stack.

Bryan Cantrill:

So what was the the is the the the proper fix is the is not merely to disable the the Mac auto learning, but we'd actually so didn't missing a VLAN tag as well?

Matt Keeter:

Yeah. So the proper fix was very dumb. Actually, was one of the things that Claude picked out as like one of the suspicious registers. So remember, you know, twenty minutes ago when I said, oh, the MAC tables use the VLAN information. It's true that they can use the VLAN information.

Matt Keeter:

And it's also true that by default they don't. So the fix ended up being very dumb. We just had to turn on a configuration bit for each VLAN that says, hey, by the way, also use this VLAN tag for MAC table associations.

Bryan Cantrill:

Yeah. Wow. Interesting. And that was and Quad's Quad pointed that out or suggested it.

Matt Keeter:

I think it it pointed at the right register.

Adam Suczewski:

Yeah.

Bryan Cantrill:

Yeah. Yeah. That's interesting.

Adam Suczewski:

Yeah. And then the question I had then too was, so if you put on VLAN aware MAC tables, then like why like why why is our ping test still unaffected? Because we still have this thrashing of the same MAC address on the same port or or the same MAC address coming on different ports that is, you know, port 49 associated with the Tofino and then the outgoing port the uplink port to the Reverso Cubby.

Adam Leventhal:

Yeah.

Adam Suczewski:

And I mean, it it's like, the only thing we need to do is send a multicast packet and get it back. So in this case, perhaps we are we are thrashing our table. Yeah. But, you know, it doesn't matter because we we still get the packet back and we we know it's the one we sent.

Matt Keeter:

Yes. The table still thrashes for the cubby VLAN IDs because those those appear from both the scrimlet and then from the reversals coming back in, but it doesn't thrash anymore for the switch zone, which is the thing we actually need for the TCP connections. Right. Yeah. Interesting.

Adam Suczewski:

And so in completeness, you could say turn off the turn off tables for Reverso based cubbies, keep it enabled for the others, and then for all, always ensure that we do VLAN tagging for traffic internal to the VSE.

Bryan Cantrill:

And so then with this now properly fixed, we've got, we can now run with Reverso with without issues. We can do the rack test and with the case closed.

Adam Suczewski:

Yeah. Basically, you know, there was kind of the ticking the time bomb on this was as we want to scale racks and the concurrency of rack chassis or rack chassis with the cable backplane is sort of this function of how many fully loaded gimlets do we have available for testing. Yeah. And 16. So Yeah.

Adam Suczewski:

If you want to build more racks at once and Yeah. Or even, you know, if they're if some of these test units fail, like, what's the what do we do? So, yeah, this is the exciting break breakthrough because, I mean, for all like for cost, for, you know, sort of operationally, like, they're the weight of the sled, like Yeah. The reverses are, you know, I don't know, 10% of the weight of a full Cosmo, like a lot of things, it unblocks

Matt Keeter:

a lot of

Adam Suczewski:

We could just be a

Bryan Cantrill:

lot faster. Yeah, we can move the line a lot faster. And I mean, this was the I mean, we've been saying this for a little while, but we definitely are in the how fast can you make them problem. And as we tell ourselves, like these are great problems to have, but they are problems. And this is a very thorn I mean, you know, one of the things that kind of highlighted for me is that the well, I mean, we've said this over and over again that like when you see these wisps of smoke, they can come from a coal fire surprisingly deep under the surface.

Bryan Cantrill:

Adam, I don't know if I mean, you ended up learning a whole lot about the management network. Yeah.

Adam Suczewski:

I thought it was it was a it was an effective onboarding experience. Like, I mean, one of the yeah. I, like, I kind of wrote some, you know, reflections of, what would I do differently or, like, what was I kinda, like, how would I, you know, for the next mysterious bug, like, what would I do? But I think the hard part is well, I mean, part of, like, what what do I go read and study and, like, stop and go you know, take a look at versus when to like actively probe the system. Yeah.

Adam Suczewski:

And I think I don't know. It's even in hindsight, it's hard to know because I think the the the volume of things I could have gone and learned and read at a given time was likely that I would have read and looked at the wrong things. So I think like the active learning part was especially useful. Yeah. But then, I mean, I think getting recommended reading from from folk like I think, you know, Stu had sent me, like, the most relevant part of probably the most relevant RFD about twelve hours before the break.

Adam Suczewski:

So I think this, like, you know, probe what you can, like, share this inform and then, like, ping some people who know what they're what they're doing.

Bryan Cantrill:

Well, yeah, I do I do like you. There is a bit of a where in the world is Carmen San Diego kind of property that I swear you were kind of going around me like, you know, the the the packet mentioned something about the 7448 and then then went to the airport. But the kind of being able to pull from I think also like just knowing that it's all knowable. Like we do actually control this whole thing. You know, asterisk because we are dealing with ultimately these, these hardware parts that can have surprising behavior.

Adam Leventhal:

Well, Brian, I mean, I know that this is not, this is sort of not the thrust of that, but what you said is really important, that we can understand this whole thing. Part of the reason we built this hardware and software together was so that we could understand the whole thing so that when customers encounter some problem, we're not just chasing the problem to the next vendor over, but like we actually can debug this thing. I think sometimes that gets lost a little bit in the like hardware software co design because we enable lots of great stuff. But in particular, one of the things we enable is being able to resolve problems independently.

Bryan Cantrill:

Yeah, absolutely. And I think that you this is yet another I feel like we've had many of these vivid embodiments on them, but this is another vivid embodiment of if we didn't control these components, at some point we would have just given up and been like, this is a weird mystery. We don't get it. We're shipping it. And good is it, I mean, was a relatively, in terms of like the customer impact, this is an issue that does have customer impact.

Bryan Cantrill:

And we did discover some things along the way where we, I mean, we made it ultimately made the product better not just by making Versa better.

Matt Keeter:

Yeah, I mean, this is not a thing that you would see under normal operation but if someone was trying to poison our Mac tables maliciously then having the VLAN tags adds another layer of protection there.

Bryan Cantrill:

Right. And I think just in general, having like having an operate because you could have an operation on the tech port that would, I mean, this cleans up our own understanding even though we don't intend any Reverso as we plugged into production.

Adam Suczewski:

Yeah. In hindsight, it's so it's obvious like I didn't mean it's like, well, why didn't I look there sooner? And even if you pull up the date like, pulling up the data sheet, you you look at this 500 page PDF and it's it's like it's screaming off the like, page 11. It's like right there. And, yeah, it's I mean, yeah, I think it is going in knowing like I can I can find this and like it Yes?

Adam Suczewski:

It is a this is like a deterministic reproducible behavior. Like I have all of the you know, most of like all of the, there's quite a lot of source that we control and like we can find it.

Bryan Cantrill:

We can find it. Well, and then this is, as it turns out, like a big unlock for scaling manufacturing, which has been exciting. And I mean, Adam, you joined the, you definitely have had not wanted for lack of things to do at Oxide with the manufacturing software team which has been grown quite a bit over the last couple of months. Well, this is awesome. Great to have you both physically in the litter box.

Bryan Cantrill:

The enjoy the rest of of hubrispalooza. Thank you. We will have an out loud reading from the the network that the management network is not the problem as part of Hewerspalooza but this is

Matt Keeter:

I'm sure this was the last bug in the management network.

Robert "RFK" Keith:

This is

Matt Keeter:

what's For now.

Bryan Cantrill:

Yes, exactly. There was an issue with the management network but it's not now been found and fixed. But I think it actually it does highlight how important it is to have these really robust components and the debug ability of them to allow us to and so, hey, RFK, nice going on Reverso. Do you feel like as the the do you feel that you're kind of the creator of chaos here? I mean, must feel great to watch all of this.

Robert "RFK" Keith:

That's just the way hardware engineering, like we just make all your problems and then give it to you and think, okay, well, they'll just go figure that one out.

Adam Suczewski:

Here you go.

Robert "RFK" Keith:

You're welcome.

Bryan Cantrill:

Exactly. Thank you. Thank you for Reverso. And and yeah, the Reverso itself has been great. I think that that Eric in the chat is saying it's worked better than we anticipated.

Bryan Cantrill:

I think it's been a big, big win. So that that's been

Adam Suczewski:

I'm glad it works,

Robert "RFK" Keith:

you know? Yeah. Because like at the time it got a lot of hate because we're like, why would we ever need this? Because it was like year one or two and we have no product. It's like, we don't need this manufacturing scale thing.

Robert "RFK" Keith:

Uh-huh.

Adam Suczewski:

Now we do.

Bryan Cantrill:

Yeah. I I I don't regret trying to scale manufacturing ahead of our actual demand because the or rather I don't regret having not it's so complicated to scale manufacturing. Actually You don't know the problems you're going to have until you're actually like right in the thick of it. Think in a lot of this stuff. But it's been fun and yeah, looking forward to the next bottleneck and Adam, the next thing you hit, you know you

Adam Suczewski:

can, know you can I'm excited for it. That's it. That's true.

Adam Leventhal:

Good. I can cut that from the recording if you want.

Bryan Cantrill:

That's right. Yeah. Wants a reserve judgment on that. Awesome. Alright.

Bryan Cantrill:

Well, thanks you too. And RFK, thank you so much for Reverso. And Adam, thanks for thanks for getting over your network aphasia to join us on this one. Is a good one. Yeah.

Bryan Cantrill:

Alright. Awesome. Thanks everyone. Take care. See you next time.