Oxide and Friends

Bryan and Adam are joined by a number of members of the Oxide networking team to talk about the networking software that drives the Oxide rack. It turns out that rack-scale networking is hard... and has enormous benefits!

We've been hosting a live show weekly on Mondays at 5p for about an hour, and recording them all; here is the recording from February 27th, 2023.

In addition to Bryan Cantrill and Adam Leventhal, speakers included Ryan Goodfellow, Levon Tarver, Ben Naecker, and Arjen Roodselaar.

Links

Here's (much of) the live chat from the show:

ahl https://github.com/oxidecomputer/oxide-and-friends/blob/master/2021_11_29.md
ahl That's the Sidecar switch episode
bcantrill https://p4.org/
admchl What does "at line rate" mean?
Riking Line rate = As fast as the packets could possibly come. 1Gbit, 10Gbit, 100Gbit, etc
admchl Do you need ASICs to hit that speed? I assume x86_64 is not going to be fast enough for these specialised operations?
levon Yes, the Tofino 2 is the ASIC
bcantrill You need ASICs
bnaecker Yes, you really can't do these kinds of operations on a general purpose CPU.
rng_drizzt Yeah, you need specialized silicon here.
JustinAzoff Right, also often across all ports at the same time in both direction. a 48 port 10gbps switch will have a line rate of 960gbps (10 ** 48 ** 2)
duckman So the advantage is being able to offload compute to the switch?
bnaecker Yes, and specifically that you can separate the data plane (operations on the packets) from the control plane (decisions about what operations to allow or make).
tahnok What's TCAM?
levon Ternary Content Addressable Memory
bnaecker https://en.wikipedia.org/wiki/Content-addressable_memory#Ternary_CAMs
ryaeng Sure beats logging into a number of Cisco switches and making changes at the console.
admchl This is my favourite episode in a long time, this is all really fascinating.
rng_drizzt the first Sidecar episode was nearly 1.5 years ago ü§Ø , right after we cut the first rev
levon That episode blew my mind
duckman This sounds like a big deal on the scale of ebpf
duckman Or bigger
bnaecker It is extremely useful for understanding the processing pipelines. As long as you only run single-packet integration tests üôÇ
od0 just want to go out and find things to write P4 code for
JustinAzoff <@354365572554948608> yeah one way to think about that sort of thing is that xdp can be used to run little programs on a nic, where p4 is kind of like that, but running on effectively a nic with 48+ ports
bcantrill https://github.com/oxidecomputer/p4
SyntheticGate sidecar is the "codename" of our switch box
SyntheticGate "gimlet" is our server sled
bcantrill https://github.com/oxidecomputer/propolis
wmf So you have P4 and OPTE in the hypervisor at the same time?
bnaecker OPTE is in the host kernel.
arjenroodselaar The P4 runtime Ry described only exists in the test bed, where it high level simulates the switches. OPTE is part of the production environment.
arjenroodselaar The rough difference between P4 and OPTE is that P4 works on individual packets without much concept of a session (so it can't reason about TCP streams, packet order etc, so no firewall like functionality), while OPTE aims to operate on streams of packets.
JustinAzoff So you can run 100 VMs on a test system and wire them up to your virtual switch compiled by x4c?
arjenroodselaar Correct.
bcantrill OPTE == Oxide Packet Transformation Engine
admchl Gimlet?
rng_drizzt Compute server
rng_drizzt The Sidecar switch is actually just a PCIe peripheral to a Gimlet.
bnaecker The Gimlet managing the Sidecar is often called a "Scrimlet" for "Sidecar attached Gimlet"
Riking and "how do i reconfigure this giant network without hosing my ability to reconfigure this giant network"
ShaunO can identify with that - we seriously struggle to keep our own products inter-operating, let alone anyone else's
levon It can feel like a Sisyphean task.
a172 Setup a much smaller/simpler network in parallel that is accessible from "not your network" that gets you to the management interface.
levon It's a whole new world when you can look at the actual table definitions in P4
rng_drizzt Owning all the layers here is immensely beneficial
levon Those DTrace probes have been very helpful
bnaecker Those probes turned out to be everywhere. They are are in: SQL queries, HTTP queries, log messages, Propolis hypervisor state, virtual storage system, networking protocol messages, the P4 emulator, and probably more that I'm forgetting about.
levon For those unfamiliar with the DTrace tool, or the rationale behind leveraging DTrace over other tracing / debugging tools: https://www.cs.princeton.edu/courses/archive/fall05/cos518/papers/dtrace.pdf
bcantrill https://github.com/oxidecomputer/progenitor
ahl some notes on rust codegen: https://github.com/ahl/codegen-template
arjenroodselaar DDM! Bring us home!
a172 it astonishes me how many "cloud" type architectures are built on v4 only or v4 first.
a172 IPv6 is older than Wi-Fi
a172 It solves real problems. PLEASE use it.
nyanotech yessss finally someone realizes broadcast domains are also failure domains
JustinAzoff the worst part of v6 is trying to run dual stack v4+v6, v6 only networks are fairly simple
levon And the bigger the broadcast domain, the more irritating it is to troubleshoot it
bcantrill "Hash and pray"
arjenroodselaar FWIW while DDM is a cool thing we're building, one of the "simple" tasks Tofino does for us is NAT between the networks of our customers and their VPC networks they implement on our platform.
arjenroodselaar Simple NAT is still surprisingly expensive and being able to do that at line rate is pretty nice.
Riking TCP retransmits in steady state seems like an obvious observation point?
arjenroodselaar Yes, you see TCP retransmits.
arjenroodselaar But if you're running say Memcache over UDP and you get a sudden burst of incoming data as a result of a large number of cache queries you drop those packets (because the buffers can't keep up) and you see cache request timeouts.
arjenroodselaar FB did some work on this about 10 years ago to avoid this ingest and dropped packets which hurt your p99 latency.
Riking yeah smartnic is pushing the intelligence to the machine
levon I know someone who basically polled all of the switches for buffer drops in an attempt to divine which paths were dropping packets due to micro-congestion
admchl I feel like I'm in a secret society meeting learning The Hidden Truth behind Reality of The Network
wmf I would argue if the entire hypervisor is on the smart NIC then you're no worse off than the Oxide architecture
a172 I once stumbled on a bug where the vendor's custom protocol for monitoring (because snmp/syslog just cant keep up) had a trace log on the process, that could not be turned off. Some sort of race condition enabled it, and it happened on 1/3 of system boots. It was ~20k logs/s, iirc.
a172 (im going to look up those numbers)
levon I haven't worked with a SmartNIC fast enough to do this well
JustinAzoff We use a FPGA Nic in our products for fast packet capturing. the service that bootstraps it had an issue that caused it to log an error... for every single packet...
JustinAzoff that managed to log the same error something like 250,000 times a second
arjenroodselaar The problem with SmartNICs is that their power features are way less advanced than the power scaling that x86 CPUs do. So you either run them or you don't, and they come with a 50-75W penalty. Unless you can really get useful work done for that 50W budget, a x86 CPU is much more flexible.
arjenroodselaar What we really want is an AMD Epyc SoC with some amount of FPGA fabric That would let you build whatever makes sense there while still having much of the flexibility with respect to how/where you consume power.
a172 It was enough to mess us up. 250k would have killed us even faster.
JustinAzoff Yeah, it happily wrote that error message until the multi TB data array filled up. We reworked how log rate limiting and log rotation worked after that
a172 I was mostly amused that the process that the process that existed because snmp/syslog couldn't keep up was getting a syslog for every iteration of a loop in the process
a172 of course, if you are sending a packet for every packet you send, that sounds like it quickly becomes an exponential problem.
JustinAzoff and to circle back around, this was code inside of the vendor SDK, that is not open source, that we couldn't fix ourselves. it's one of the only components of our system that we don't control. i wish we had our own NIC (that would probably run something like p4)
levon And thus, this is how we become the way we are (at Oxide)
a172 ours was on production network hardware (wireless controller). There is no hope of having source or any ~~insight~~ true observability into it. (edit: saying there was no insight is a little harsh)
JustinAzoff one thing that came up before was if p4 was like ebpf.. there's actually a ebpf backend for p4 that supports some of the features: https://github.com/p4lang/p4c/blob/main/backends/ebpf/README.md
bcantrill Thanks, all!

If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!

Creators & Guests

Host

Adam Leventhal

Host

Bryan Cantrill

What is Oxide and Friends?

Oxide hosts a weekly Discord show where we discuss a wide range of topics: computer history, startups, Oxide hardware bringup, and other topics du jour. These are the recordings in podcast form.
Join us live (usually Mondays at 5pm PT) https://discord.gg/gcQxNHAKCB
Subscribe to our calendar: https://calendar.google.com/calendar/ical/c_318925f4185aa71c4524d0d6127f31058c9e21f29f017d48a0fca6f564969cd0%40group.calendar.google.com/public/basic.ics

Speaker 1: 00:00

Alright. Well, great to have everyone here. And great if if folks did not see it or listened to it already, we recorded an On the Metal episode where we recounted, the our favorite or some favorite, I should say. Not all of it. I mean, there's so many great ones out there.

Speaker 1: 00:15

It's a favorite moments. And that was a lot of fun. Thanks for thanks for doing that.

Speaker 2: 00:18

Oh, no. Thank you for doing I I was I was, like, very honored to be on the show just because it's got, you know, on the metal has been great. I loved I loved listening to it. I gotta say I am still tittering over the, clarinet solo. Like, it's it's just hard for me to think about an episode without anyway, little teaser.

Speaker 2: 00:38

Speaker 1: 00:39

Do you know another actually great moment like that that I had forgotten about, and then I was just relistening to it actually in preparation for this, is the so the episode we did on where Arian had his tweet of a measurement 2 years in the making, and do you remember that that Eric was calling in and was outside with these, like, absolutely deafening Midwestern crickets.

Speaker 2: 00:59

Yes.

Speaker 1: 01:00

That that were great. And I actually love it. I mean, it's so great, because Eric is talking about, it had this just this great, you're talking about how stressful it was and what a rush it was and and how satisfying it was. And then meanwhile, you can just you can you can visualize him. You can feel the humidity, you know, where he is.

Speaker 1: 01:19

It was just great. Yeah. That was fun. That was a totally fun one. And, you know, Ariane, we had had you on talking about the the the sidecar switch, in whenever that was late 2021, I guess.

Speaker 1: 01:33

Let's hope for her. But you what we've not talked about at all is and we'll do that. We've actually got lots of things about oxide that we've not talked about at all, amazingly enough. But we really had not talked about any of the up stack networking software, because, you know, making a switch was really one one of the things that we had, I would say, required a lot of technical boldness. We had, I would say, some uncertainty in those earliest days of of it felt like there was no option.

Speaker 1: 02:02

Like, integrating a third part a third party switches felt like it was gonna be, hugely problematic. But, boy, doing our own switch just seemed ludicrously ambitious.

Speaker 3: 02:11

So small plug, we have a blog post coming out on that soon.

Speaker 1: 02:15

Oh, yeah. That I'll I'll be excited to read it, Ariane. And, Ariane, I mean, you were, you know, in those earliest days as and we had basically come to the conclusion that we need to do a switch, even though, we have got no idea what I'd be or as you I think I've said many times before, we we fortunately did not know how hard it is to do a switch. Otherwise, we might not have got it.

Speaker 3: 02:37

Yes.

Speaker 1: 02:39

And we Well, I was also the one who

Speaker 3: 02:41

said, well, how hard can a switch be? That's kind of a solved problem at this point. So I definitely hate those words.

Speaker 1: 02:47

I know. I know. I know. I feel like that's, like, mine of, like, oh, we're just gonna tweak some reference designs. That's another word.

Speaker 1: 02:52

Yes. I could time machine and slap myself. It's like you're Oh. No. Exactly.

Speaker 1: 02:57

But, you know, you gotta have that, like, a little bit of naivete. And, you know, I think one of the things that is great about having a big, ambitious, bold vision, and then and then projecting that vision is people are attracted to that. And that we've had a lot of people who've come to Oxide because they see what we're doing, and they say, hey, I wanna, like, not only am I interested in that, but that, like, that really speaks to me. And there's a a part of this that I I think I can really help on that is part of my own personal vision. And part of what I I love about what we're doing is there are that for every person at Oxide, there is a part of themselves, a part of their own personal vision that is in what we're doing.

Speaker 1: 03:36

And that is very true of what we're doing. And, you know, Ryan Goodall is here. And and, Ry, you were what I mean, I I think this is true for a lot of people at Oxide, but it's especially true for you that you saw what we were doing. And you I I just remember in your materials, like, I I think you're building a switch out of p 4, and, boy, that's exactly what I wanna go do. So, Bry, could you talk a little bit about your background and kinda kinda how you got to oxide, and maybe we'll pick it up from there?

Speaker 4: 04:07

Yeah. So, I mean, that's pretty much, exactly how I first saw oxide was I I think I was, like, actually at a conference at Sandia National Labs, and I was there giving a talk on the network test beds I had been building for, like, some government projects and things like that. And, Ron Minich was also at that conference, and we were we were talking about a few different things. And I think at that point, he had mentioned the On the Metal podcast, which was actually my first exposure to oxide. And so then, like, a little bit later, I I started to check that out, and then I was like, oh, man.

Speaker 4: 04:38

This is this is, like these are my people. This is this is this is really awesome. And then, I started to look more in-depth, into what was available on the Oxide website in terms of what folks were doing, and then I saw the p four switch sitting there staring me in the face. I think it was in this, like, little subtitle somewhere, like, hidden on the website, like, that has this big beautiful picture of this rack, and then there's, like, the switch that's sitting in the middle of it. I'm sitting there like, I wonder what that is.

Speaker 4: 05:05

Is that some kind of Mellanox ASIC in there? Is it, like, a Spectrum 2, a Spectrum 3? And then I saw the Tafina. I was like, oh, okay. So this is like a this is a fully programmable architecture.

Speaker 4: 05:15

And, like, where I had come from in terms of, like, building network test large scale network test beds for, like, research programs and things like this where people are doing network research and they're doing, like, absolutely, like, bat shit insane things with networking. And we have to write the network code that transports all that crazy stuff that people are doing inside of the networks in a test bed environment to evaluate whether their crazy ideas are gonna work out or not. And it was a very fun and rewarding job, but it also very clearly demonstrated the limits of what we could do with fixed function networking equipment. Like, when weird stuff would happen, if we were operating, like, a really large scale EVPN network and, absolutely, everything looks green across the board. Like, every all the routing protocols are green.

Speaker 4: 06:01

Like, everything looks good, but packets are just not moving in the way that they're supposed to be moving. And, like, you just go down this road of terror of, like, all these people are depending on you and your infrastructure to do their jobs every day. And, like, things just aren't working. And and you get down to the aspic, and it's just a black box. And the only thing that you can do is talk to your vendor and be like, what's up, man?

Speaker 4: 06:23

Like, what what's going on here? We've done everything we can, and then it turns out to be an ASIC bug. And you get a firmware update, and the problem goes away, and you have no idea what happened. And

Speaker 1: 06:33

That is so frustrating. And, I mean, these ASICs are extraordinarily complicated. They are historically very proprietary. There's a huge stack in there. So to kinda not be able to get that that certainty about what actually happened here is, I I mean, obviously, very frustrating.

Speaker 4: 06:53

Yeah. And it just makes the network kind of, like, incomprehensible at a very low a low level. Like, you have these, like, declarative APIs. Like, you have, like, Linux switch dev. You have all the wonderful work that happened with Cumulus Linux to kind of, like, open up the the white label switching environments and have these nice declarative Linux flavored APIs.

Speaker 4: 07:10

But at the end of the day, that's all declarative, And you don't have a real good mental model of, like, what is happening to every single packet that's going through the switch? Like, what's the programming model? How do I actually understand what's going on here? If I'm operating a very busy network and I'm running up against soft limits on the ASIC, like, going back to the e p a EVPN example, if I am running into head end replication limits because I'm blasting an L2 broadcast domain over a layer 3 network, like, what's happening there? What is the fallback mode?

Speaker 4: 07:41

Like, how am I gracefully falling back onto maybe some type of multicast or something like that? And you just don't know. Like, it's it's not specified well enough. You just don't understand how your networks are operating, which from an operator's perspective is extremely frustrating. And when I saw what Oxide was doing, I was like, that's a step in the right direction.

Speaker 4: 07:59

Like, we can build something out of p 4, which from my perspective was mostly in the academic space at that point. I hadn't seen a whole lot coming to market where we where people were actually using p 4 as, like, a mechanism to give operators more comprehensible networks. And I saw this as an opportunity to take a step in that direction and be a part of a really exciting team that was making that a reality.

Speaker 1: 08:22

So and, I mean, you mentioned p 4. P 4 is really at the epicenter of what we're doing here. Can you describe a little bit, about p 4 for folks for whom it may be new?

Speaker 4: 08:31

Yeah. So, p 4 is a a data plane programming language, for switch ASICs and sometimes for, NIC ASICs. And so, basically, what p 4 allows you to do is, describe in a series of controllers, for every single packet that's going through your switch what needs to happen to this packet. It allows you to define a set of tables that can be shared with the control plane, whether these are, like, routing tables or NDP tables or things like this, where a control plane that's running a protocol like BGP or NDP for I p v 6 can start to populate these tables. And as packets flow through your data plane in one of these switches, they run through this p 4 code, that is essentially operating on a packet by packet basis over every single packet.

Speaker 4: 09:23

It runs at the line rate of the switch. Depending on how the switch ASIC is architected, this might be broken up into, like, multiple pipeline stages, for every single packet that's moving through the switch. But you more or less have complete control over every single packet that's running through the switch. It is more of a constrained language than something like like c or Rust. So there's no loops or anything like that.

Speaker 4: 09:44

And so there are sacrifices and expressiveness that are made for the sake of having, some level of determinism and making sure that your pipelines can actually execute at line rate, but it it does allow you to have that that level of programmability and expression in how your data plan is actually executing.

Speaker 3: 10:03

And and to drive home real real quick the point of how programmable this actually is, is there are there are these generic parsers that you implement. So these switches do not have a concept of an Ethernet frame. An Ethernet frame is really a thing that sort of exists at the level so that the can parse frames into into the device. And then but so you have to tell the thing what an Ethernet frame looks like. And, for example, Western Digital has built this cache coherency protocol between CPUs using these where they just encapsulate basically memory request in Ethernet frames, and then they use a completely custom thing to parse these requests and then do a memory coherency thing in the switch and then, like, push packets out to specific CPUs.

Speaker 3: 10:46

But to to so so this this can be as programmable as you can parse a packet coming in that has 2 integers, and you can parse that as 2 integers. You can add these 2 integers and then you can emit a packet with the result of that addition. And you can do that at at line rate.

Speaker 1: 11:01

So that's amazing.

Speaker 3: 11:02

This is this is there's really interesting things you can do with this. We'll limit it within that language, but, things that you definitely cannot do in software at these speeds because you can do this at, you know, 6 terabits 6, yeah, 6 packets per 6,000,000,000 packets per second.

Speaker 1: 11:22

Yeah. With and so there's someone in that chat is asking you, what does at line rate mean? And, like, at line rate means really, really fast. So, yeah, do do you wanna talk about some of the speeds and feeds a little bit, Ariane, in terms of what line rate means to these things?

Speaker 3: 11:33

Well, in this case, the the we are using to build with the the the the the great the largest and greatest of the Tufino 2 lineup from Intel, which is a 64 port, ASIC supporting up to 12 terabits of traffic, given appropriate packet sizes. But really, it will do up to 12, sorry, up to 6,000,000,000 package per second with all 4 pipelines enabled and all ports enabled, everything going full tilt. And needless to say, that is a lot of data moving through a single ASIC.

Speaker 1: 12:05

It's a lot of data. And so when when when we talk about a programmable switch, programmable network infrastructure, what were the programmability really, as you say, the RN really goes down to the very, very bottom of the stack. This thing is not born knowing anything about software protocols and so on. All of that stuff is gonna be given to it as p 4 programs, And then it's gonna be able to do and it's extraordinarily powerful. And, I mean, we are still we are we are big, big believers in p 4.

Speaker 1: 12:36

And Yeah.

Speaker 3: 12:36

And it let's so it lets you parse into packets up to about 1200 of oh, sorry. 500 bytes into a packet. And then you can then emit you can modify headers, emit new headers, plus your payload. And so you can you can inject things. You can you can strip things, you can do operations on on things and and build new things.

Speaker 3: 12:57

I'm assuming that Rai is gonna talk a little bit more about DMM later. And so but then more importantly, all that happens. So so things like, NAT or or, you can add telemetry headers or the those are all things that we we do here, all at line rate using these these these prepopulated tables. But you can go as wild as there is a concept p 4 program that takes a a DNS packet, reads it as if it that that that understands the different segments to get there. So the, yeah, the IP header and the TCP header, etcetera, detects that it is a d a TCP, UDP packet for a DNS request, and then pulls out the actual request that you're making.

Speaker 3: 13:40

And then you can use the tables as a small lookup table to actually generate DNS responses at line rates. So you can build this ridiculously fast DNS relay or a DNS resolver for potentially your authoritative DNS server if you wanted to do that. Right.

Speaker 1: 13:57

The problem is no longer DNS, folks. I've got the DNS monster.

Speaker 3: 14:02

Yeah. It is absolutely limited. But the point I wanna drive home is that this is a fully this is programmable thing. There are definitely limits. But this there is very little fixed function sort of functionality here and whether like, you can build tables that prioritize, some kind of VPN thing or some kind of layer 2 switching thing or some kind of labeled routing thing or and you can size these tables according to the datasets that you're gonna be working with, within limits of what this ASIC can can absorb, because there are different different variations with different sizes or different stages meaning different like larger or smaller amounts of RAM available effectively to do this with.

Speaker 1: 14:45

Which is really powerful and right, I mean, when you say that like you'd kind of reach the end of the road with a fixed function, I assume that part of it is that, like, these fixed functions ultimately do have fixed area. They've got kind of fixed amount someone else has made the decision about how the resources of the silicon are gonna be used. And, really, the future needs to decide that dynamically. Is that is that a fair statement?

Speaker 4: 15:06

Yeah. That that's definitely a part of the the fixed function ASICs that are available today. So, oftentimes, they'll allocate, like, a certain chunk of their TCAM, to, like, multicast. There will be a certain chunk of TCAMs like access control lists. There will be a chunk of TCAM or SRAM that is dedicated, to I p v six routing or I p v four routing, and you just kinda have to I mean, some of the newer a six are a little bit tunable in this regard, saying how much do you want to allocate to to certain functions, but you still have to live within those constraints.

Speaker 4: 15:36

And with the Tafino, we can decide pretty much exactly how we want to do that. And so if we have wildly different use cases that are coming at our our racks, which is in terms of do we need to have, a lot of space allocated for NAT, or do we have someone that is trying to use BDP and get full IPv4 routing tables with a 1,000,000 routes? What does that look like? And so we we have a lot of latitude there in terms of how we're actually going to be able to handle those different use cases.

Speaker 3: 16:05

And it means the opposite too, which is we do not care for we use GENEVE labels in our underlay network. We do not care there's a competing or, like, a complimentary standard or, like, competing standard on a VXLAN. We do not care about VXLAN for that functionality. Our p four program does not have that built in. And so we do not spend any resources in this case on that functionality that that we would never use.

Speaker 3: 16:29

So we can we can we can try to maximize the use of this ASIC according to to what we feel this thing should do or to support the applications our customers want rather than what yeah. Like you said, some some product definition group or, like, some designers have put together, 5 years ago because that's how long these things were designed at some point. And so they've made choices that may not be applicable anymore, which means that with this more programmable nature, you can you can push a switch platform much for much longer potentially because you can adapt it to your changing workloads or to your changing protocol needs.

Speaker 1: 17:05

And and potentially dynamically too. Right? I mean, this can actually be changed.

Speaker 3: 17:09

Yeah. I mean, it it you you need to reload the the the the data plane. So it's not like, it definitely hasn't, like it takes time to do that. It is disruptive. It it's not seamless.

Speaker 3: 17:21

But, yes, you can you can probably reload this in a couple seconds. So, yes, it it is reloadable in some form. And and so if you can if you can absorb the disruption, then, yes, you can do this somewhat dynamically.

Speaker 1: 17:35

So which is extremely cool, and there's just a lot of potential here, and I think we we saw a lot of that potential. But, of course, it also means that it's, like, hey, the good news is that this thing is entirely programmable, and that's very powerful. The bad news is we gotta go program it. Right? So there's a lot that we need to go do.

Speaker 1: 17:52

And, Adam, do you remember the Greg Papadopoulos, who was the CTO at Sun, I don't know. The he he had this line for me that I thought was really good that he said, did all of the big breakthroughs in system software also have their own programming language associated with them, which, of course I

Speaker 4: 18:08

remember that. Spot on.

Speaker 1: 18:09

Yeah. Right. Of course. We obviously strongly agree to that because of

Speaker 2: 18:12

because I mean, I

Speaker 1: 18:13

I could not agree with you more strongly that because we we obviously did that with the d and d trace. But I I feel we definitely see that with p 4 where p 4 really represents a lot of wisdom from a bunch of folks who've thought about this and have kinda done it the old way, done it with fixed function. And p 4 really represents a lot of that wisdom. And so now we've gotta go, we've gotta go build this whole thing. And so while while Arien and Co are are building the actual switch and getting hardware to work, Ryan, can you talk about, like, how, you and and Nils and some of the other folks started getting going on like, what does it actually mean to build the software stack on this?

Speaker 4: 18:54

Yeah. And so, a really good spot to pick up is actually where the the sidecar episode left off, which I think was about a year ago now, which is crazy to think about. But at that point, I mean, the the the rack switch was just coming together in terms of the hardware. And because of the Tofino simulator, which is a piece of technology that Intel, delivers along with the Tafino that, it basically allows you to simulate the Tafino at kind of like a hardware level. Like, the representation that's working off of here is like a hardware RTL type of level.

Speaker 4: 19:27

Then you can use their compiler. You can compile your p four, targeting the simulator and then run it on top of the simulator to be able to, start to build up software infrastructure on top of the Tafina without actually having to run on a Tafina, whether that's on a reference platform or whether it's on the highly customized, integrated switch that we're building for the rack. And so because of that, by that point in time, so Nils, who's our engineer that's, doing a lot of the development for, the switch drivers in the operating system and the daemons that run the management point for the switch and our APIs that drive the switch, like, a lot of that had been defined at that point, using that Tafino simulator. But one of the things about the Tafino simulator was that since it was representing the ASIC at a hardware level, it was not very fast. And so you would max, yeah, to put it lightly.

Speaker 4: 20:21

Right? You would you would max out at, like, a few 100, maybe a 1000 packets per second, and then you would see latency start to spike into, like, the tens. Or if you weren't running TCP or you had back off, you're just running straight up UDP through this thing, like, you could see latency spikes into the 100 of seconds, and then it would just kind of, like, grind to a halt. And so it was an absolutely wonderful tool for understanding how our p four code was executing kind of in the small, so to speak. But the moment that we had to kind of step out of that bubble and start to say, okay.

Speaker 4: 20:52

We actually want to start implementing our network end to end at, like, a system level. Maybe not like a rack scale level, but, say, we wanna have, like, 6 compute threads represented as virtual machines. We wanna represent, the sidecar that's executing our p 4 code and the compute SLED that's connected to that sidecar switch over PCI Express that's driving that switch. Like, we wanna have all of that together in one environment to be able to actually have things working end to end. The simulator wasn't really going to provide us with that capability.

Speaker 4: 21:22

Even we even tried running it on, like, these super beefy machines with, like, it it was a nonstarter. And so we kind of had to to take a step back and say, okay. So how are we gonna start developing things at a system level? How are we gonna start developing our routing protocols? How are we gonna make sure that our NDP implementation that's running through the switch for IPv 6 is actually gonna work with the Alumos host operating system NDP that's sitting right next to it in network.

Speaker 4: 21:51

And so that was an interesting point of space where we're just kind of like, you know, what are we gonna do? And then so, I have been writing

Speaker 1: 22:01

Yeah. Just before you get there, because I do think I just I I wanna make sure we're giving the simulator its due because it is this is a cycle accurate ASIC simulator, which is something that, first of all, most vendors do not, like, allow off their kind of off the property. And the fact that we had the software from Intel, and as as I told that that team, this is I this is the best software that Intel makes. Prove me wrong. Because on the one hand, we were only able to get several 100 packets a second through the thing.

Speaker 1: 22:30

On the other hand, Adam, it goes to the the, you know, one of your favorite lines, that is the, we prefer to think of the OASIS as half full rather than half empty. The because it is remarkable that this thing works at all to the point that when what we were able to do, what you were able to do, you and team were able to do with the simulator UI was stunning. We could that we got so much working with the simulator that you actually do get to the point, like, the actual problem with the simulator is that we can't actually of course, it's never gonna be much. If you've ever done any kind of psycho psychoaccurate simulation in anything, you know that it that if you are only at a 100x degradation, it's really, really hard to get to, and a 1000x is gonna be much more reasonable. So the fact that we were kinda right in that 100,000x degradation is kind of that's pretty impressive, honestly.

Speaker 1: 23:23

So Yeah. I mean, it

Speaker 4: 23:24

it it's completely mind boggling. And, I mean, when you had TCP running through it and TCP was doing back off because it was it was detecting congestion, then it actually stage through it.

Speaker 1: 23:33

You already canceled the message.

Speaker 4: 23:35

Yeah. It was working okay. You could do, like, an app update and, you know, it might take all night, but it would it would it would eventually work.

Speaker 2: 23:42

Right. Am I remembering this right that it was it was single single core as well?

Speaker 4: 23:48

There was a multi core version of it. We we had a lot of issues with that. And

Speaker 3: 23:53

Well, because because multi core RTL simulation is is a notoriously difficult problem because Yeah. Because you very quickly need synchronization primitives that basically undo any any advantages that you might have off of multi multi core, like, that your multiple cores would provide you. And because you need those synchronization primitives, everything else becomes way slower. So in in their defense, it is it is a model of the actual hardware, and it and it emits very detailed logs as a packet travels through this thing, how each of the parsing steps work, how how the lookup steps work. So you can very you get a very detailed trace for every packet going through, like, what it does and how it got to that this the decision point and how it then figured out which which, quote, unquote, port to switch out of.

Speaker 1: 24:37

Which is incredibly cool. You can

Speaker 3: 24:39

just plug

Speaker 1: 24:40

it before.

Speaker 3: 24:40

Yeah. Well, and it were if you loaded it so it's it's a thing that would run on a Linux machine. And so if you had a Linux machine with several network interfaces, you could attach those network interfaces to the virtual interfaces of that switch and see your actual packets from outside the machine travel into into like, you would you could send them into the machine, run through the run through the model, see all these faces of exactly how the packet was class parsed and classified by your p 4 program, how the decisions were made, the lookups that happened in the tables, and then how it then modified the packet potentially and then pushed it out again on one of those real interfaces, and it would just pop up on the other end. And you could Wireshark it on your client machine. And so it was it it is a really impressive tool, but, yes, because of all that accuracy, it is not particularly fast.

Speaker 3: 25:23

So if you

Speaker 1: 25:24

Probably fair to say that we were pushing it harder than anyone else. I don't know that any other No.

Speaker 3: 25:28

Anyone else who who would actually do something like like some real stuff get and gets gets the we'll run into that eventually. Like, once you get through, I'm I'm I'm trying to get through, like, debugging my p 4 program, and I'm trying to run a little bit more traffic. You'll you'll hit that. So everyone hits that eventually, and that's the point where you then need a different solution.

Speaker 1: 25:51

Yeah. So, Rhyde, there I I just wanted to inject some of some praise for the simulator, just because I think it is so extraordinarily impressive. But as you say, it's not it's it's really impressive that we're able to get it to work at all, and that it does that it does work so well, but it's not actually describe some of the ways in which the the deeply suboptimal performance really impedes development of the software we need to develop.

Speaker 4: 26:15

Yeah. I mean, so it is totally a fantastic tool. It's just it's not the right tool for particular jobs. And the the job that we were stepping into at this point in time was, system level end to end type of development. And when you start putting, a bunch of hooking up a bunch of compute slides to your simulated switch, then you have a bunch of protocols that are just starting to run all on their own.

Speaker 4: 26:37

Right? You have NBP running for IP v 6, which is turning out a few packets per second. You have your routing protocols that are running, that are doing keep live messages. And so by the time that we're at, like, over, you know, 10 of these things that are hooked up to, an emulated switch, we're we're probably in the neighborhood of at least a few 100 packets per second that are going through this thing. And and that's about the limit of what we can get to in terms of pushing packets through the simulator.

Speaker 4: 27:02

And so you can forget about running any type of, TCP flows or flows or anything like that once you get to this point. We we basically got to the point where it's like, okay. We can we can stand up the network. We can get the simulator running, and then everything grinds to a halt once, you know, just the basic automated protocol start running. And so that's kind of the the decision point that we found ourselves at.

Speaker 1: 27:26

And the challenge that we've got in front of us at that point, and the challenge that we've had, indeed, the entire history of the company, is how do you develop the software that's gonna run on the hardware without the hardware in hand? And even when you have the hardware in hand, you might not have enough of it for everybody. You may not have enough for CI. We all we for at every layer of the stack, we're always asking ourselves, how can we simulate, emulate the layer beneath us?

Speaker 4: 27:52

Yeah. Exactly. And so what we decided to do at this point, was actually do the the the full crazy thing, which is write our own p 4 compiler. And this kind of grew out of, like, a a nights and weekends project that I was doing. We were using p 4 at work, and I was trying to use p 4 at home with, like, compiling to some risk 5 cores that I was tinkering around with.

Speaker 4: 28:13

And so I I had a little bit of forward momentum on this already. And I was like, you know what? We can probably just compile p 4 to Rust and then have that Rust code as the implementation of our packet processing pipelines and use it wherever we want. And so just taking a step back real quick, if you're familiar with the p four ecosystem, like, this sounds even more crazy because in the p four ecosystem, we have, the p four c compiler that's maintained by the community. There's the behavioral model version 2, which is kind of like the execution substrate that exists around, the output of that P4 compiler.

Speaker 4: 28:50

And the the question is, why not use that? Like and and that's a very valid question. And for us, it really comes down to, where do we need to execute this p four code? And in the sidecar episode, one of the things that Ariane, had mentioned was, the level of fidelity that we get from the the FeNO simulator in terms of how it presents itself to higher layers of the stack, including the operating systems drivers, including the, Intel SDK that allows us to manipulate, p 4 tables in real time. Like, that was all very high fidelity, and high fidelity enough that we could actually run, our OS stack, and our ASIC management stack on top of all of that.

Speaker 4: 29:32

So that was something that was critically important, that we really didn't want to lose. But then kind of on the other end of the fidelity spectrum, we also have this development environment at oxide that is really, really important. It's kind of like an oxide in a box environment, and then it's it's what a lot of our control plan engineers use, to evaluate systems, in the control plane software substrate. And so, basically, what developers do there is they have, an Lumos box, which Lumos is the operating system that is at the core of our product. And they're able to deploy the entire control plane onto that box, which includes, the ability to launch VMs.

Speaker 4: 30:09

It has the entire oxide API, and they can do all of their work in this environment. It's not necessarily super important to have, like, high switch level fidelity in that environment, but we definitely need the network functions that implement the core of our network in that environment. So these are 2 very different environments where we need to have the logic of our P4 executing. And what we really want there is to be able to compile freestanding p four code and use that code in both environments. I'm sorry, Brian.

Speaker 4: 30:38

Were you saying something?

Speaker 1: 30:39

No. I was just I mean, the the the this is a problem where it's like you can't you that needs to be in a lot of places that's not gonna have specific hardware. So we need we needed a different solution.

Speaker 4: 30:49

Yeah. And so and when you look at the the p four c and the the BDM 2 model, right, you p 4 c produces, like, this declarative JSON representation and b d m 2, which is like this Python c plus plus thing, basically ingests all that and acts as an execution, engine over that kind of declarative representation of a p 4 program. But that really just wasn't gonna work for us. And so in order to get the first environment that we that we wanted, in terms of having that very high fidelity interface to exercise our whole networking stack up to down, on the switch itself and on the, compute sled that's actually attached to the switch over PCI Express. We decided that we wanted to implement, a virtual ASIC inside of a hypervisor.

Speaker 4: 31:34

And so this was a great opportunity to dog food even more oxide technology. So if you're if you're coming from, like, the Linux side of things, there's QMU and KBM. And what we have in the oxide stack that's kind of analogous to that is we have beehive and propolis, or propolis is kind of like the user space emulation side of things. And it's written entirely in Rust. And so if we have this x4cp4compiler that can compile this t four into Rust, and it can it can compile into, like, a dynamic library, and we can dynamically load that from other Rust code, then we have this nice substrate that we can actually consider to be like an ASIC inside the hypervisor.

Speaker 4: 32:16

We can we can expose that ASIC to the guest operating system that's also running a Lumos. It's running our full switching stack in terms of the OS drivers, in terms of the daemons that actually drive the ASIC. And we have this very high fidelity substrate that, for most purposes in our system software, actually represents, what, an actual hardware based system looks like. And so, that's what we wound up doing in in this environment, and it's actually worked out phenomenally well. So the It's just extraordinary.

Speaker 1: 32:50

I mean, the this it is so rare, I feel, to have this level of fidelity for, certainly, for some for an ASIC that is this complicated. I mean, it's just extraordinary to be able to actually sit to be able to virtualize all of this and be able to develop that software. It's amazing.

Speaker 4: 33:05

Yeah. And so where we're sitting at with this today is, we can get about a gigabit per port, on the the code that's been compiled and is in this harness inside of the hypervisor. I think we're not actually limited by the code itself. We're more limited by the IO paths. So we're using something called, DLPI for our IO path.

Speaker 4: 33:26

That's the, data link provider interface. It's kind of a packet at a time type of interface. And so I think that's where our our gigabit limit per port, is is coming through. But the the the payoff here is that a gigabit is plenty, or a gigabit per port is plenty for the kind of environment where we need to actually be able to test things, end to end and see how the network is unfolding. And so today, we can do things like test multipath routing algorithms, where we have, our switches our our rack has 2 switches inside of it, and every single sled is connected to both switches.

Speaker 4: 34:00

And so that creates a multipath routing problem. And when we wanna look at that end to end from the routing protocols that are running on our compute slides to the ones that are running on the switches and how all that traffic works end to end. We can actually do that now at about gigabit speeds and evaluate how those algorithms are working, and it it presents, a really nice environment for doing all of this.

Speaker 1: 34:21

Yeah. That is amazing. Maybe now is a good time to get either LaVonne in here or Ben in here. I mean, LaVonne, when you, I'm when when you came to Oxide, you were also coming from, you know, a a yet a different, networking background, had suffered a lot of these same problems, And, you know, when, when you came aboard, I think one of the the first things that you were targeting on was was how do I get, kind of pull all of this stuff together, so we can actually, use it in CI, we can actually, develop on it. Do you want to describe some of the work that's been involved there?

Speaker 5: 34:59

Yeah. So, I saw I think I joined Oxite, like, last July or August or so.

Speaker 1: 35:06

It's it's a time warp thing.

Speaker 5: 35:10

So, like, everyone on this call who is, like, you know, his mind is getting blown, that was me. And then they said, hey. You're gonna put all this stuff together and, like, automate it. And I'm like, okay. And but, no, it was it was, 1, it was just kind of jaw dropping to see how much work had been done in just 2 years from the actual SICAR, Twitch, and Tofino ASIC all the way up to these different components, like OPTE and, the the p four compiling pipeline and and, some of the other components that we're probably gonna get into as we talk about, like, how does the control plane interact with these things.

Speaker 5: 35:53

But what Rye introduced me to was, kind of a virtual topology building tool that was also created, that we were going to use to kind of get some of these things sorted out. And, essentially, it it was just kind of diving in the ocean and swimming, because I'm I'm learning a new operating system. You know, I'm learning Lumos. And it's there's enough overlap with a lot of other Unix and Linux style systems. But then there are some things that are different.

Speaker 5: 36:26

There are some things that are really cool. And then just kind of hoovering up all of the information, they're blasting at me with a fire hose. But the biggest thing that was so exciting about all of this was coming from the background that I was in before, which was your traditional data center, public cloud, private cloud, automation, it's there are just some real painful things that come with gluing a bunch of disparate vendor infrastructure together. So, like, I I worked at Rackspace for a few years. I worked at Equinix Metal for for a little over a year.

Speaker 5: 37:03

And the thing is is, like, you're trying to take several different switches that have several different APIs and big function hardware. And the vendors are changing how these things work. You know, every software updating, you have no idea what's going on in the box, just like what I was talking about earlier. And then you're also trying to get it to communicate, or you're you're trying to get it to behave in concert with things that you don't really have deep integration with. Like, you don't have integration with the server part where you don't know precisely what state it's gonna end up in.

Speaker 5: 37:38

You just kinda know what state you hope it's gonna end up in and everything. And so yeah. And so you're they're like, well, you know, this is what they asked for. Hopefully, it ends up there. And if it ends up there, then the packets will flow.

Speaker 5: 37:48

But if not, you know, well, they're gonna blame the network, and then we gotta go and look and see what happens. So coming into this thing where they're like, oh, yeah. OPT lives in the kernel, and it's going to make sure that certain things end up a certain way. And,

Speaker 1: 38:04

and then We we talked about OPDE a couple of times. You wanna just describe that briefly?

Speaker 5: 38:09

So the the high level, I guess, explanation is it's kind of like the distributed virtual switch that lives in the kernel and the oxide hypervisors. So, basically, all of these VMs, when they're created, a lot of the rules for how their traffic should be handled will get programmed into OPTE. OPTE will end up leveraging the, NICs that we're using, which are the t 6, t 7? I I get lost on some of the

Speaker 1: 38:40

t yeah. T 6. Yeah. The t 7 one day, but t 6 for now. Yeah.

Speaker 1: 38:44

Yeah. The Chelsea Knicks.

Speaker 5: 38:46

Yeah. The Chelsea or Knicks. And, and so since our gimlets have the cap capability to interact with this, Tofino based switch, like, some of the gimlets will realize that, hey. I have a hardware connection to the Tofino switch. So as things happen on the gimlets, I can tell the switch what's happening.

Speaker 5: 39:08

And the switches are ready for this traffic, when these VMs come up. So the servers and the switches are integrated at the software level, at the control plane level. And so you're not trying to take a switch OS that knows jack squat about anything that happens on any server and have a server OS that was never designed to know a jack squat about anything that's happened on a switch and try to close them together with some Python code or Ansible scripts, it's like that nightmare goes away. And it's like if the gimlet is authoritative and says, Okay, I'm creating this VM. And as long as the process of creating this VM is successful, the process of configuring the switch is also part of that process.

Speaker 5: 39:51

So either they both succeed or they both fail. And that just creates what I think is going to be a tremendous difference. And I went, I think, off the rails. I just the the whole thing excites me, so I just kinda go wild. But

Speaker 2: 40:05

One one one clarification, Levon. You we've we've been using the name Gimlet a bunch. So just to be clear, Gimlet is our server. But, not to be confused because of the the sidecar and the Tufino are are kind of the the CPU, as Brian likes to say, gets it its coffee, and it just goes at line rate, rather than in, some types of server, pardon me switches, we decided rather than, having a a general purpose CPU plugged into this Tofino dedicated exclusively to switch purposes. We have all these gimlets, all these servers, and then 2 of them are special in that they have this PCI link that was alluded to.

Speaker 1: 40:46

That's right. We call those scrumlets. And we it is, just a PCIe peripheral as Aaron was saying in the chat. I think it I'm not sure it's the largest PCIe peripheral ever made, but it's definitely it's much larger than the actual, the compute slides that they

Speaker 2: 40:59

And it's just like a by it's just like a by 4 link. Right? Like, if you're thinking that this is like a GPU, it's it it I mean, there's there's some analogs there in terms of, like, your your the CPU is just bringing it to its coffee and getting out of the way. But the the bandwidth between, Tofino and the general purpose CPU, you know, doesn't need to the the sort of extremes that a GPU needs.

Speaker 1: 41:29

The the kind of the switch operating systems that are out there, the network optics that are out there, I'd have wondered, I have to say many times, how does anyone get all that stuff to to interoperate? Because, I mean, that must be I mean, it is a challenge for us where we are controlling all sides of this. I can't imagine what it's like trying to get, even open source things to work together, let alone proprietary things. And, you know, you just wonder how anything works at all, and then it's kind of less of a surprise to know that. Well, it often doesn't actually.

Speaker 1: 41:59

It often, like, is is broken or it breaks in strange ways or it's or it takes a long time to actually get functional.

Speaker 3: 42:06

There was an interesting tweet from, there was there was this this thread going on a couple days ago, about, you know, moving from from from cloud to on prem and, I don't know, a bunch of stuff back and forth. But there was someone who, righteously pointed out that what we don't really what what most people don't really realize is how much work the the large hyperscalers have done in order to make their network the way operate as well as it does. And part of that is all that work, what you're just describing. How do you how do you control this large distributed machine that consists of thousands of switches to make sure that all the all all the right configuration is in all these tables in order for these things to work. And, having seen a little bit of that, how it worked at at Facebook, now Meta, there's a large body of software to push that around and to get that in these in these and they in they're they're using a lot of Broadcom, ASICs to get that into those ASICs because, and they've they've they've written that all from scratch.

Speaker 3: 43:03

Because

Speaker 1: 43:03

It's not just, like, not open source. Right? Or in in general, it's like they have

Speaker 2: 43:07

It is it is not even

Speaker 3: 43:08

so much about open source. It is so specific to what they're doing because it relies on so much existing Facebook infrastructure because there's basically a whole Pub Sub model on top of that in order to push route information around. So they use this, forgot what the routing protocol was they developed at some point for this. That is basically the complement to BGP. So they're distilling some p g BGP, the information, and they have link state information that they compute.

Speaker 3: 43:32

And they push these things together through a PubSub system, and then switches are subscribed to that and then get that that gets pushed into ASICs. It's a it's it's a large and very complicated piece of machinery in order to make that work.

Speaker 1: 43:44

Yeah. And I, and I think that, you know, as we kind of get further and further down the path, I mean, I feel not that we had, I mean, I think we cut, we knew that we had to integrate the switch, but boy, this is just not something we've looked back on at, I mean, this is absolutely the right decision.

Speaker 2: 43:59

You mean in terms of rather than buying some off the shelf switch on?

Speaker 1: 44:03

Oh my god. Oh my god. I mean, it's a fate too terrible to contemplate. Yeah. But it's I mean,

Speaker 2: 44:09

it's handing someone else our fate. Right? I mean, exactly as as rye was alluding to earlier, it means that all of you know, we'd be sub subject to all of these completely undebuggable problems. And, you know, relying on what would need to be an extremely good partner to bail us out in these incredibly hard situations.

Speaker 3: 44:29

Well, in the end, running any software, just getting the thing cabled up in a sensible manner in the in the like, physically having DAC cables between servers and switches, that would like, we would have not been able to build the the the cable backplane that we have today. That that would just not have existed.

Speaker 2: 44:43

And can I ask, it it also seems like, you know, Arjun, you you mentioned how Meta was able to build something very, you know, purpose built? Seems like to a degree, we don't have to build all the features and functionality of an Arista switch or a Cisco switch because because we we know, you know, everything that's going to be plugged into this switch. So it it seems like there's in some ways, I mean, obviously, there's a ton to build, but we don't need the the same kind of long list of features that a general purpose switch might need.

Speaker 3: 45:14

Well, that's also why they decided to do white label switches and why everyone who has ultimately done of that scale who has done white label switches does that. You realize that what they really needed was really high high performance I like, they want IP performance. They wanna just push as many IP packets around as they could possibly can at the lowest possible price. And once you start stripping all these enterprise features that are all these that a lot of the switch vendors are providing, you know, Cisco particularly is serves that whole market, but then, you know, there's Juniper and there's there's there's other switch vendors with lots of different things that are maybe really good in enterprise, but that you if you once you start getting I once you distill that down to, I wanna run a fabric just as fast as we can, and then we're gonna layer a lot of the smart functionality more in software on each individual machine because the other part that that is often overlooked is that the the the reason the network at Google or a network at Facebook works the way it works is that every host in a network actually participates in that by running some active component that allows a centralized controller to steer traffic from those hosts so that they can actually make decisions about large flows, where they go, when they go, how they go.

Speaker 3: 46:29

And that is just that doesn't exist in many environments, other than those environments where you have that amount of control over each of these individual pieces. And what yes. Once you have that that control over these pieces, you can very aggressively cut away all the things that you don't need, and you can focus on making it just do the thing that you really want and then do that as fast and as best as it can do that.

Speaker 1: 46:51

Yeah. I mean, it's just and it just gives us extraordinary kind of potential, but, of course, there's a lot of integration to go do. I mean, Ben, do you wanna speak? I mean, you've been right on the right of the coal face. I you know, I I I, in terms of actually getting all of the stuff to to integrate, do you wanna talk about some of the adventures there in terms of of getting all these things to cooperate?

Speaker 6: 47:13

Yeah. Sure. It's, it's certainly been a challenge. So I think I think the hardest thing has been, sort of what Rye was talking to earlier, just so much of it is being built at the same time, that there is a lot of bootstrapping that's required where you have to mock out certain interfaces or or kind of have acts that seem to be short lived that end up being much longer lived than you had intended, to kind of make make the system work at least to an approximation of of the end state that you want. I think though that actually we've been really fortunate to have kind of, folks like Ryan and Levon who have have done a lot of that integration previously.

Speaker 6: 47:56

I think in in kind of being able to build up a lot of these simulation tools, you know, being able to virtualize so much of the stack has really made that actually much much easier than you might have thought. I think one of the things that's been really useful at least to say OPTE for example actually is Ryan Zeske before he, before he left a few months ago he did some kernel testing stuff that was really useful that hadn't been been done before. That's been very, very helpful I would say. So, yeah, I mean, I think the best thing is just been having strong tests, really kind of useful simulation and emulation tools for for kind of mocking up the parts of the system that you don't wanna you don't wanna consider when you're building building some part of it, that you wanna abstract away. That's just been incredibly useful.

Speaker 1: 48:41

And Ben, for the work that you've done, I mean, because I I think we've seen this in lots of other parts of the stack. I think it's especially true here, where you've got a bunch of different components, and you've got kind of one body of work that is crossing a bunch of different components. Could you speak a little bit to that?

Speaker 6: 48:58

Yeah. So I so we I mean, kind of practically speaking, what we really have is this kind of main program called called the SLED agent, which which comes from an older name for the gimlets, or another name for the gimlets. We just often call them compete SLEDS or just SLEDS, but but this program really is sort of the the I'm not sure what the analogy is here, but it's it's kind of the the part of an individual sled, and so it really is kind of managing all of the the individual compute, resources that you might have on that machine. And so what we've really tried to do is put in a lot of debug ability, and kind of the ability to introspect what that thing is doing, and and, really sort of understand how that system is operating, and and what it's kind of doing in terms of marshaling the hardware resources that are available. So we've been able to put in a lot, for example, DTrace probes to understand what OPTE is doing, or what the SLED agent itself is doing when it tries to provision certain resources, what PROPOLIS, the hypervisor is doing.

Speaker 6: 50:04

The DTrace probes there have been sort of super useful just in terms of understanding exactly what kind of parts of it are actually running at various pieces points in time. So I I think I think kind of the the fact that we've really written all of these different components from scratch, I I think despite the work that that entails, it really doesn't mean that we can instrument it in a way that you can't really do with other systems. I think a lot of people have A common theme here has been, how much of a black box certain, you know, vendor supply

Speaker 1: 50:37

And Levon mentioning like the the the critical role that hope was playing for Levon in his lives.

Speaker 6: 50:44

We don't really, I mean, we hope, but we don't need to because we can also verify. Right? Hope but verify. I mean, you can kind of, like, check that it is operating the way you expect, because we built the entire thing. You can stuff in some extra details probes over here, and make sure that your packets are doing what you expect them to do, that they are going out the interface that you want, and that's been extremely useful.

Speaker 6: 51:05

Well, and

Speaker 1: 51:05

then and then also, David, the fact that we we have I mean, we've got our own control plane. We're not trying to interact we're not integrating with VMware or OpenStack or what have you. We've got our own control plane, which gives us the latitude to do whatever we need to do upstack to make this whole thing work. Right?

Speaker 6: 51:20

Yeah. That's correct. So, I mean, the the kind of the brains of the the operation are are really, yeah, totally under our control, and we can really, you know, instruct the rest of the system, the entire rest of the system to basically do whatever we want. And so the the that part of stack is is called Nexus appropriately. Sort of the the central point for for everything all roads are leading there, and so we we kind of put all of our, you know, decision making logic into that part of the system, and combined with something that Levan was hinting at earlier which we call sagas, which are kind of this abstraction we have for running a bunch of steps that are potentially dependent on previous steps where you want to Each one of them to succeed or fail atomically, and then if anything fails, you want to unwind the whole sequence of operations.

Speaker 6: 52:07

By writing things in terms of those sagas at the level of nexus, you can really kind of do some pretty complicated orchestration, and not have fear that you're gonna leave some, you know, half finished state, kind of laying around on some switch somewhere. Right? That will end up routing your packets off into a black hole. You won't really need to worry about that because we can make sure that we're unwinding the entire thing, you know, all the way back to the to the beginning, which has been very, very useful too.

Speaker 1: 52:36

Totally. It's and we've kind of been building each of these components and making each kind of one robust, and then being able to kinda put all of the pieces together. I think we've we've got new levels of appreciation of both what we've done and then the stuff that we've needed to go do to get that foundation working robustly. And then you do kinda wonder, like, what what would we have done without all this foundation? Like, alright.

Speaker 1: 52:59

Yeah. I think we know the answer to that. It's not good. We we would've we would've struggled without having all this foundation in place.

Speaker 6: 53:06

Yeah. And I I think it's interesting, you know before I joined DocuSign, I would have argued that we should not have built so much of this from the beginning, from the ground up.

Speaker 1: 53:14

Oh, interesting. Yeah. I'm sorry. You're I we're we're at sea now, Ben. We actually Yeah.

Speaker 6: 53:19

Exactly. Exactly. We're not

Speaker 1: 53:20

coming back to port.

Speaker 6: 53:21

You'll like the second half of this, Brian. So I I I have changed my opinion about this in a lot of ways, and I think part of the reason is that exactly what you're describing that when you integrate, an existing piece of software, you know, we've all kind of hit this, right, like, okay, there's a bug in this 3rd party code. Okay. Well, what do I do? Do I fork it?

Speaker 6: 53:43

Do I put a pull request on you know, you have to deal with that sort of thing, but even beyond that, the ability to build in all of the instrumentation, all of the debugging, all of the sort of, understanding that you want, you can do all of that if you build the whole thing with all of the pieces in mind. Right? I think that's a big thing, is to be able to, you know, put in the debugging knowing how you're gonna use something. Right? I think is is is extremely useful, and I I think this has just been, yes.

Speaker 6: 54:17

It it's something I would not have expected to be so valuable when I joined DocuSign, and I have Yeah.

Speaker 1: 54:22

It's been extraordinary, and I think I mean, and you we've gotta be and I think we've we've done so far a pretty good job of this, but it's like, you know, we can't just do it our own way for our own sake, obviously. We need to, you know, be be pretty careful about that, and we wanna you know, I thought, Rye, what you and maybe it's worth expanding on what you've done in terms of the p four compiler. But, like, I think that that's kind of the the that we did our own p four compiler, I think, kind of represents that that entire kind of oxide viewpoint in that we are strong believers in p 4, love the fact that we've got a a language that that we did not invent p 4, obviously. P 4, that it, exists beyond certainly oxide. But we are also unafraid to go our own way, provided that that's what makes sense.

Speaker 1: 55:07

We're not gonna go our own way just because we we wanna go our own way and everything. We but but if it if it makes sense, we we're gonna do that. We're gonna actually and I I also don't feel that, like, around here, by the time we're wondering whether it makes sense to go our own way, it it's probably time to go our own way. We we don't come to that decision lightly because we've got a lot of we've got folks that, you know, know the the parallel of that or some some fraction of the parallel.

Speaker 4: 55:31

Yeah. And we've gotten an absolutely huge amount of mileage out of that. I mean, because we've implemented our own p four compiler, we can implement this in a way that is very helpful with our own debugging tools that we use for everything else up and down the stack. And so, like, one of the really neat things that we do with the p 4 compiler is we actually emit static d trace probes inside of the compiled p 4. And so when you wanna understand exactly how your p 4 program is executing, it's along that execution path, in the pipeline and understand exactly how that program is executing.

Speaker 4: 56:14

And and we've gotten a huge amount of mileage, out of just that, of of integrating p four with DTrace to get visibility and get very fast development cycles. And so, I mean, that alone has just been such a massive win.

Speaker 1: 56:26

Adam, am I only one cheering up? I hope you're cheering up, Adam. I I think there's I hope I hope you got

Speaker 2: 56:31

I think I already went through it. I mean, when when I don't know if you'd seen this already, Brian, but when I when, you know, I was, I was geeking out with rye over this code gen stuff and to see that integration with DTrace and I, it's just, it was made up.

Speaker 1: 56:46

I mean,

Speaker 4: 56:46

it was

Speaker 2: 56:46

just it was just awesome to see.

Speaker 1: 56:48

It's awesome always because it's also building on the work that you had done with Ben. Whatever that was a long time ago.

Speaker 2: 56:54

Yep. 35 years ago. That's right.

Speaker 1: 56:56

35 years ago. Nobody was

Speaker 4: 56:59

Many dogs are using the, the USDT crate.

Speaker 2: 57:03

Yeah. Yeah. So so, Ben was that, that was one of the, I think it was the first thing we did together, but it was fairly early in your tenure, at oxide was to to 6

Speaker 6: 57:15

months in or something left. Yeah.

Speaker 2: 57:17

Yeah. So to to build a crate for embedding USDT probes in Rust, I think we we got to, like, a pretty nice sort of rusty spot with it. I think there's there's more that could be done to make it even more tightly integrated. One of the neat things was, as we were researching this, I don't know if I showed you this, Brian, but early, early, early in Russ lifetime, there was an issue saying, like, build in DTrace probes. This was like, you know, like triple digit rust issue or whatever, from from the from the earliest they did.

Speaker 2: 57:52

Obviously, they didn't get get built, but, they were thinking about it. We just need to see and and and fun working on that crate.

Speaker 1: 58:00

Yeah. That is great. And then but to see that, I mean, certainly, you did you we could not have anticipated because this was I mean, again, this was very early in the lifetime of the company that that ultimately would be so useful because, oh, by the way, we're gonna have a p four in power that we're gonna write. It's gonna generate process. That's also gonna emit these SCT probes.

Speaker 1: 58:19

It's gonna be really important to debug how our networking protocols work. Like Uh-huh.

Speaker 2: 58:23

Yeah. And I just because it's a theme that we've discerned in the Oxide and Friends show, but I think there were moments when Ben and I were working on this where we kinda ask, you know, are people gonna use this thing? Like, is this gonna be important? Are we gonna, like, is is this gonna solve problems? And I think Ben, like, kind of slapped me because it's like, of course, it's gonna solve problems, you dummy.

Speaker 1: 58:42

But I definitely thought you

Speaker 2: 58:47

but but, you know, I I think that's like, whatever. It's always true. Like, the the these tools that we build are are always helpful. And if there's something you take away, it's like, go build those tools.

Speaker 1: 58:58

And I think it's always good that we're kind of like we we are self aware enough to know, like, god, do I why should I really be polishing this turd? I mean, I'm going to put another, like, I'm going to put another sheen on this thing. Is this the right decision? But, I don't know if Ben slapped you. I definitely recall slapping Adam because you're just, is this the right thing you're working on?

Speaker 1: 59:14

I'm like, this is definitely the right thing to be working on because we are I mean, the you know, it's on the the the the tin where oxide, we are doing a lot in Rust, and we are gonna continue. And, actually, Ryan, what maybe you could speak a little bit to that because I don't know. I think you've done some rust prior to oxide, but I feel like certainly more rust at oxide. And how has that experience been in terms of building the the the p 4 compiler and soft MPU and so on?

Speaker 4: 59:42

It's been, so, I mean, before oxide, I had done a very small amount of Rust. I'm trying to remember the Rust. I think I I wrote some, like, some TTY code in Linux and Rust. And just like a few very small things, we in my group, we had just started to explore using Rust, for a few different things. We had thought about we one of our major products was an Internet network emulator, and we had built that in c plus plus over many years.

Speaker 4: 01:00:10

And we had kicked around the idea of doing that in Rust, but never actually actually got around to doing that. They may have done that now that that I'm not there anymore. But yeah. So very little very little Rust experience. But, I mean, the Rust ecosystem has I can't imagine a more perfect ecosystem to do this work in, a, because, you know, it's it's compiled natively, so we don't have to worry about any of the performance issues that would come with, like, if you're compiling to like go or something else like that.

Speaker 4: 01:00:40

The ecosystem around code generation in Rust is just flat out amazing. Like, the quote crates and all of the tools that are available Yeah. To to generate code are like, I have no doubt that the compiler the code generation stage of the compiler that we've written is an order of magnitude less complex than it would be without those strong cogeneration tools. I mean and, Adam, you've used these tools significantly and, Drop Shot and a lot of the the tools that that you've put together as well. And, I mean, these the they're just incredible, code generation tools.

Speaker 2: 01:01:18

Absolutely. I mean, I I I think that in other languages, when you're generating code when I'm generating code in other languages, I feel like I'm 100% doing it wrong. And it's gonna be this undepugable pile. And in rust, it's like, I'm, you know, only 20% sure I'm doing it wrong. And it's a semi debuggable pile.

Speaker 2: 01:01:35

But it the amount of time it saves in the sort of elegance and testability of it is just phenomenal.

Speaker 1: 01:01:41

And we are using that up and down the stack. We are using its ability to generate code everywhere. And it it and in you, we we kind of put we we say hygienic macros as kind of a placeholder for it, but it it was really so much more than hygienic macros. I mean, it is just the I mean, it's as you say, it's a quote crate. It's Build RS.

Speaker 1: 01:01:59

I mean, there's so much that you can go do to actually make this comprehensible and extraordinarily powerful. So it's yeah. The the right. That is awesome to and no surprise that you're making use of things like quote.

Speaker 4: 01:02:13

Yeah. And one of the one of the the tricks that I pulled from Adam's book was in the Progenitor, tool that we have that, basically, you can point a Rust macro at an open API spec, and you just, like, instantly get all of the code that implements the client for that API spec in the Rust code that you're you're directly working with. And we wound up doing that same trick for p 4 code. So if you wanted to have p 4 pipeline code directly used from your Rust code, you just say, use p 4, point it at your p 4 code. That'll kick in the compiler library, compile it, or splat all that generated Rust code directly into your current workspace, and boom, you have you have access to a p four program pipeline right then and there.

Speaker 1: 01:02:57

That is really cool. I don't

Speaker 2: 01:02:58

think I yeah.

Speaker 1: 01:02:59

I don't think you I know you've done that. That's awesome.

Speaker 2: 01:03:01

Yeah. And right. It's only hiding like tens of thousands of lines of code that it's emitted behind that macro.

Speaker 4: 01:03:08

Right. Yeah. It's it's yeah, It's not a small amount of code.

Speaker 1: 01:03:13

But that's really cool, right? Because, I mean, I mean, this is where and I think this is part of why we're so polish on p 4 and programmable networking, a programmable switch, programmable fabric. Because when you make that easy to to integrate into other programs, you can begin to use p 4 in, like, lots and lots of other places that are not merely that that are not just a switch.

Speaker 4: 01:03:34

Oh, yeah. Absolutely. Like, I was I was doing something the other day where I I needed something like tcpdump or snoop like, but, like, not quite tcpdump, not quite snoop. I needed a little bit more programmability. I didn't wanna have to go write a whole bunch of, like, header code by hand, but I knew I had the vast majority of the headers that I needed to interact with already in our sidecar p 4 program.

Speaker 4: 01:03:53

So I just grabbed a whole bunch of that code, imported it into my Rust code, wrote, like, another 100 lines of Rust, and I have the exact observability program that I needed. And so it was it was very nice in that regard.

Speaker 1: 01:04:06

That is really, really neat. Yeah. I got the ability to just quickly spin. I can, as you say, like a new tool. I can, I just, I, I wanna actually get, I, that is really, really nifty?

Speaker 1: 01:04:20

And so what are, you know, is it someone kinda asked earlier. It's like, what are some of the use cases that like, when we kind of, you know, look forward, what are some of the things that we can go do kinda controlling this thing end to end? What what are some of the things we can kinda deliver for folks who are actually using

Speaker 4: 01:04:38

infrastructure? Sorry. You broke up a little bit there. I think my Internet blipped.

Speaker 1: 01:04:44

Sorry. Folks were asking us about the about what what are some of the things that what are some of the use cases, for an integrated switch, and what are the kinds of things that we can go do when we can control the stack end to end?

Speaker 4: 01:04:57

Oh, man. It's you know, you you you get to kind of move heaven and earth. And, I mean, when I guess it there's so many directions to go with that question. I mean, when we're talking about how we integrate with our customers' networks, which I think is the most, like, directly visible thing that that we're going to be seeing, we're talking about, having strong BGP implementations that can interact with, upstream networks from, you know, the size of, like, standard corporate BGP networks that are kind of, like, EVPN flavored to, interacting with, Internet connection endpoints, connecting directly to CDNs and colos. If we have customers that want to be able to do tunneling, whether it's through, like, FireGuard or Geneva or VXLAN to be able to get through remote sites to get to other racks or other parts of their network.

Speaker 4: 01:05:56

Like, we can implement all of this, as as needed according to demand on the product and start to build up a more robust network stack, that is basically allowing us to evolve the platform in any way that we see fit without being constrained by a a fixed function at a click. And

Speaker 2: 01:06:15

right, you know, I've heard you get very excited about DDM and about kind of multipathing and the kinds of things we can do when we control all ends of the conversation. Can you talk about that a bit?

Speaker 4: 01:06:29

Yes. That's that's that's a very good point because that's that's something that this infrastructure actually does allow us to do. That is is one of the killer apps, for the the OktaIP network, infrastructure. And so so the DBM stands for delay driven multipath, and it's our routing protocol, that allows the SLEDs to communicate, the compute SLEDs to communicate with each other, both within a particular rack and a crop racks. But before we talk about DDM, we need to talk a little bit about the oxide network architecture, which is something that, Robert Muscocchi has put together and that the network team is collectively, working to implement.

Speaker 4: 01:07:11

And there are several features of this networking architecture that allow us to build very robust, routing protocols. And, something that Arin was talking about earlier is that, you know, a lot of, like, your Arista, and Cisco switches, like, have, you know, the alphabet soup of protocols associated with them. But at the end of the day, when you're building, like, a scalable infrastructure, you wanna simplify things, and you really just wanna focus on doing IP routing. And when we're talking about living at layer 3 of the network, we we just wanna get packets from host a to host b in the most effective way possible. And so when we look at the oxide network architecture, we have an underlay overlay architecture where our compute sleds are essentially communicating over a physical underlay, and then customer instances and the, interfaces that are in those virtual machines, communicate over overlay network.

Speaker 4: 01:08:05

Our underlay is a pure ipv6 underlay. There's no ipv4 there. And then overlays are built on going to GENEVE, and then there can be ipv4 or ipv6, packets inside of those, encapsulated, GENEVE headers. Every single one of our compute SLEDs is summarized by an ipv6, slash 64. So that reduces our network fan out a little bit in terms of the routing tables that we have to consider, and there's no broadcast domain.

Speaker 4: 01:08:35

And so if folks are here that have done large VXLAN deployments, or e VPN deployments, every single address that one of the customer instances gets is a slash 32 for ipv4 or a slash 128 for ipv6. And so since we have no broadcast domains that we need to push out over our layer 3 networks, this gets rid of at least, like, half of the really, really nasty, overlay, underlay problems. And we can really just focus on layer 3 routing. And so with that in mind, like, what are the what are the goals that we have for layer 3 routing, in the RAC? And something important here also is that the RAC is explicitly multipath.

Speaker 4: 01:09:18

So there are 2 switches in every single RAC, and every single compute Sled in the rack is connected to both of those switches. And so every time a a compute Sled is sending on a packet, it has to make a decision about what is the best next hop to take. And that's where, our routing protocol comes in. And so taking a look at this problem and saying, what are the most like, before we even start to think about, like, specific protocols and things like that, we have to ask ourselves, what are the most important properties of this routing protocol? And 1st and foremost, we wanna be fault tolerant.

Speaker 4: 01:09:51

We have a multipath physical network, so we really wanna take advantage of that. We want anything in the network within reason to be able to fail, and the network's just gonna keep on ticking. We can lose any link. We can lose any particular port. We could even lose an entire switch.

Speaker 4: 01:10:06

And this this network is just gonna continue to function, and the TCP set sections will all stay alive there in customer instances and things like this, and and we'll be okay. The next thing that we wanted to focus on was flexible topology construction. And this means putting customers in the driver's seat of how they interconnect their racks, and not constraining them to particular topologies. And so, like, one of the routing protocols we looked at early on was RIFT, which is, under the IETF standardization process now. It stands for routing in fat trees.

Speaker 4: 01:10:39

There are a lot of really attractive things about that thing protocol, but it's also very specific to fat trees, and we didn't really wanna constrain ourselves to, to that particular type of topology. Another thing was scalability. We wanted to be able to scale up to, eventually, like, hyperscale sized networks, and so that put us squarely in, like, a distance vector, path vector, routing protocol type of place. And then finally, we wanna do load balancing, at the speed of the network, at packet level granularity. And so when you look at multipath routing and you look at how it's done today, if you just take a simple example of, say, I have an IVGP network, and I have a couple of paths that are coming from my VGP peers that I can make a decision on in the data plane.

Speaker 4: 01:11:24

How do I make that decision? So that typically comes through ECMP today. But we know for data center workloads, there are significant drawbacks with ECMP in terms of both, microcongestion and elephant flows. And so with the way the ECMP basically works is that you're taking a hash over, parts of the layer 4 packet and deciding based on that hash what your next hop is going to be or what your complete path is going to be. And depending on how the hash thing works out, you can have what's called these elephant flows that are these high volume flows that just kind of squat on a particular path in the network, and they take up all the bandwidth, and they they cause lots of flow completion times for smaller flows that really need to happen much faster.

Speaker 4: 01:12:07

And then there's microcongestion that can microcongestion that can happen, when you have things like storage networks and storage fabrics that are sitting, on top of these networks that create congestion in, like, microsecond type of windows that can really degrade the ability to get to the aggregate bandwidth capacity of the network. So, these are problems that we really wanted to look at and solve with our routing protocols. But in order to do this, we had to be able to measure the network. Like, ECMP is kind of like a hatch and pray type of approach. Like, you're gonna hash all your packets, and we're just gonna hope that the hashing gives a nice, easy distribution.

Speaker 4: 01:12:44

Like, our traffic matrices are even enough so ECMP is gonna work out well. But we know from a lot of research that that's just not the case for a lot of data center workloads. And so what we did was we, did a lot of literature survey and a lot of reading, into a lot of the SIGCOM like, ACM SIGCOM conference, congestion control papers and routing control routing papers. And what we came up with was, we drew a lot of inspiration from, 27 2017 paper, called drill from the University of Wisconsin, which talks about microload balancing and how you can do that really effectively without completely murdering TCP in terms of reorder buffers and things like that, and then a 2020 SIGCOM paper from Google called Swift, which uses delay, as a leading indicator of congestion. And this is more of a congestion control paper of, saying, you know, how do we get rid of some of the sharper edges of TCP and data center network?

Speaker 4: 01:13:41

But what we've done is taken these two things and kind of combine them and say, if we can measure the network, at line rate, and this is where p 4 comes in because we want to be able to, add additional telemetry information to packets as they cross the network. Can we, at every single thread inside the rack for every single destination that it's talking to, say, at this point in time, this is what the delay looks like to that particular destination and then load balanced based on that destination in such a way that I'm not gonna reorder things so badly that TCP is just gonna have a really, really bad time. And p 4 is very instrumental in doing that because it gives us the flexibility to add this telemetry information to packets in ipv6. We're actually using ipv6 extension headers, as packets are crossing the network, and we can actually make routing decisions, in sub round trip time, based on what's happening, and avoid microcongestion, within the network and hopefully reach the aggregate capacity of the network without sacrificing flow completion times. And so we think, this is a very novel approach to solving this problem, and it is fundamentally enabled by having a programmable data plane and doing hardware software co design that allows us to actually do this in a way that wouldn't completely tank performance.

Speaker 1: 01:15:03

And so

Speaker 4: 01:15:03

that's And you

Speaker 2: 01:15:04

just you just touched on it. Right? But, you know, if we were, if we didn't control or if in an environment where, you know, I got my switch from 1 vendor and my Nick from another and my my servers from a third, Is this attainable, or is it only because of that integration throughout?

Speaker 4: 01:15:24

I mean, it it's very much for the from the integration throughout. I mean, if we didn't have programmability on the switches, if we didn't really understand the data paths, and how they're working from, you know, our host operating system that's taking packets from the virtual machines, picking them out to the t sixes, understanding the paths from the t sixes, or the multiple paths from the t sixes to the Tofinos and how all that's working, being able to do symmetric routing for TCP acknowledgment so we understand latency as it's actually evolving across the network, we we wouldn't be able to do any of that.

Speaker 3: 01:15:57

It it so they're simple. We need we need to add these telemetry headers into these packets as a traversal network. If you didn't do that if you can do that in the ASIC, the only place you can do that is at the host. So those are the only points where you can make them decisions based on that telemetry data. And so and so you have to wait until if you wanted this on any other piece of hardware, you would have to wait for some standard to start to exist and then every other vendor to or or at least one vendor to implement this in an ASIC as a fixed function.

Speaker 3: 01:16:28

And then you're and then hope that they got it right the first time around. Whereas we we can freely experiment with this until it works and then, and then roll it out because it and we can even we can even develop it further once it is even in the in the deployed sides of the customers because we can reprogram this thing over and over again. I mean, sure, we have to take downtime for it, but if there's an upgrade path to future versions of this protocol as we learn more and develop it more. And that is just not possible with a with a with a fixed ASIC from someone else.

Speaker 1: 01:17:01

Yeah. I mean, it's just extraordinary in terms of what it all adds up to. And I think it you know, they kinda the the lead in on p 4, you can see why p 4 is so extremely important. Right? Just for all those things you mentioned that that we can't wait for for fixed function here.

Speaker 1: 01:17:15

We need to be able to do this entirely dynamically. And, you know, Ryan, when you talked about a punch I mean, what I I love, elephant flows. And but with micro congestion, do folks have how do people observe micro congestion today? How do you know that you've got micro congestion in the network other than, like, I am sad, and I don't know why? I mean, is this something because, I mean, I think we're we're also gonna be able to allow people to actually observe where that micro congestion exists.

Speaker 1: 01:17:42

Right?

Speaker 4: 01:17:43

Yeah. And so, like, with the drill paper, for example, one of the ways that they were able to, observe micro congestion and I I will have to go back and look. They might have been using p 4 programmable switches, but they were able to, observe queue depth, on the switches as packets were traversing the network. And so they could use that as a proxy for congestion as packets were traversing. And there's been a lot of research, and this was referenced in the Swift paper as well if people are interested in taking a look at the papers, in storage fabrics.

Speaker 4: 01:18:17

And so people that are building NVMe storage fabrics where there, is a high level of sensitivity to latency. They were able to basically, the way that they measure delay in Swift allows them to, understand where congestion is building up in the network on a packet by packet basis. And so then from that point, you can build a story, in terms of this is how my storage fabric is operating. This is how these latency measurements are unfolding across the network, and we see these extremely short duration latency events. And even though they're extremely short, in terms of their duration, like, you might have a high frequency of them, and that can completely tank performance as, you know, TCP is kicking in on other parts of the network reacting to this.

Speaker 4: 01:19:03

And it really pays to be able to get ahead of that, before TCP implementations start reacting to that congestion.

Speaker 1: 01:19:10

It it totally we and we know we are talking about when we were talking about the the measurement 2 years in the making, talking about all the crazy physics required to deliver this incredible link, and how, you know, Adam and I were kinda joking about, meanwhile, we're up here, you know, waiting for, you know, an incorrect time out or a multi millisecond time out, but TCP retransmits are brutal for I mean, the second you've decided that, like, I have had to give up on that packet, I'm gonna have to resend it. Now this unbelievable dragster that can go so quickly, now you're waiting milliseconds, tens of milliseconds, 100 of milliseconds before you retransmit. And that's the stuff that's gonna have major upstack or implications. So the the ability to minimize that microcongestion and those retransmits, is just gonna be extraordinary. And I think it's also emblematic, right, of our the kind of the the rack level view that we're taking.

Speaker 1: 01:20:00

I mean, one of the things I know that both you and Levana done in previous lives is looked at smart necks. And I was I I not to not to get you going on a rant, but, you know, smart necks are kind of the antithesis of our approach, honestly, where I feel I mean, maybe that's phrasing a bit too strongly, but where we are it's it's not a rack level approach. It it and it is it puts a lot more complexity in places that are already struggling with complexity. Lavonne, I imagine, requires a lot more hope. Although with SmartNex, it's just a matter of, like, actually getting the parts is hard enough.

Speaker 1: 01:20:38

And but you're you're you're pushing the intelligence to a part that is actually doesn't have great visibility. So it can be really, really I mean, to me, it's like smartNICs are not the solution to the problem. We've gotta, like, actually think about this from the the entire rack, the entire network to actually solve the entire problem. And I feel that, like, right, this is a just a huge lurch forward to that regard.

Speaker 4: 01:20:59

Well, SmartNex can do

Speaker 3: 01:21:00

some part. Sorry. Like, there's some workloads that do make sense in, like, if you if you had a NIC that could do line rate encryption on a per flow basis, which I don't know if that exists already or not or if it if it's tractable, but that would be super interesting because that, you can't do in a switch like this. And you also don't wanna do that on a c you may not wanna do that on a CPU and software. So that's where a smart NIC would provide value.

Speaker 3: 01:21:23

But that that's

Speaker 1: 01:21:24

Yeah. And I'll tell you where else a SmartNIC would indisputable provide value. I'm in a very cold room right now, and I could actually use a SmartNIC as a space heater in here.

Speaker 3: 01:21:31

A little space heater. Yeah. Sure. Yeah. A little a little 75 watts to your, to your room.

Speaker 3: 01:21:36

That's Oh. Yeah. Oh, it makes exactly. But it's warm. On that.

Speaker 3: 01:21:38

It's very comfortable.

Speaker 1: 01:21:39

Oh, absolutely. Yeah. It's, like, warmer than a cat. A little uncomfortable, actually.

Speaker 4: 01:21:44

Yeah. I mean, I think with the SmartNIC thing so there's so many layers of complexity here. Right? So we have, like, a we have a a network stack that's running inside of the virtual machine of the guest that the customer is running, that that is going out of that virtual machine through a Vioana interface to the the host operating system stack, or in our case, the OPTE that isn't done encapsulating that and sending it over UDP to its final destination. So we we have the behavior of the internal TCP session, and then we have that, you know, somewhat masked by the the outer UDP and Genevieve sessions that are kind of encapsulating that and carrying it to where it needs to go.

Speaker 4: 01:22:22

And, I mean, I think what we really want is a, a more programmable NICS that can integrate as a whole with our overall network architecture. And so if we can get to a place where our NICS can be programmable on the data plane, similar to how our switch is programmable on the data plane, then I think we're gonna start to be getting into a a pretty good place. But, in terms of, like, these smartNICs with, like, the ARM cores and you're connecting your computer to computer through PCI and the programmability is actually quite limited, like, I there I don't see the benefit for that for for our architecture.

Speaker 1: 01:22:57

Yeah. And then the, yeah, at Lava to say that, like, actually, the smart and excited are just not fast enough to as well. So then and also then you got the the can we actually just make them show up too? Lots of lots of challenges with SmartNICs. And not to pick on SmartNICs too much.

Speaker 1: 01:23:11

But, the, in terms of the, you know, so kind of, we look at this kind of at the aggregate level and the ability to get the, DDM actually deployed. What were kind of some of the challenges to actually make all of the stuff work right?

Speaker 4: 01:23:31

Oh, man. So just getting all the so coming to oxide was my first real brush with p 4 programming, and before I knew it, I was writing a compiler, so I really jumped into the deep end. But getting so p 4 errors so DBM, has, a lot of, like, data plane components to it, including, being able to modify the tables, which allow us to make decisions at line rate, which in vanilla p 4 is not really a thing, but it is a thing in Tofino p 4. So they they have these, particular register types that you can attach to a table key, and that allows you to update that table key over time, which means if we can attach that register state to a router table key, then then we can automatically keep a delay, associated with that that that that key, error, sorry, that routing entry. And so that was definitely a challenge getting up and going because it wasn't, like, part of p four proper.

Speaker 4: 01:24:31

It's kind of like this extension this really nice extension, of Tufino that allows us to do this. That was a significant challenge. Getting DDM up and running on the endpoints, in the Olumos kernel, has been a very interesting adventure of deciding where is this gonna plug in into the overall operating system network stack. Because there's so many different choices, and we're not trying to take, like, the BPDK approach where it's just like this ad hoc thing off to the side. Like, we want this to be a part of the operating system network stack in a very efficient way.

Speaker 4: 01:25:06

We're currently talking to, like, our vendors about or or, Chelsea, about, you know, what's the best way to be able to offload some of this, in terms of just pushing it off onto the next so we have, less overhead in terms of sending, like, acknowledgment messages out, and things like that. And when we are sending acknowledge messages out, like, can we piggyback on existing TCP sessions if we can snoop that far down into the packet, and would that provide advantages? And so there are so many choices about how to implement this. I mean, the the theory is so is sound behind it in terms of what the potential is to achieve in our networks, but the the the plethora of choices that we have to make along the way are, it's a large mountain of them. And so just trying to, trying to diligently just kind of document everything as we go and see you know, articulate why we made particular choices, and then maybe we need to revisit this choice later on down the road and kind of, like, building a decision tree, so we can when once we land with our initial implementation, try to figure out, like, how did we get here?

Speaker 4: 01:26:13

And if we need to go back and look at something that is not quite performing on the level that we need it to, like, how do we get back to that point?

Speaker 1: 01:26:21

Yeah. And you whenever you got all the these kind of many degrees of freedom, it can be really, it it's it's a challenge. Right? It could be a challenge to figure out what is the the the right layer of abstraction to to to tack into. I feel like this is a this is a common challenge for us.

Speaker 1: 01:26:38

The and then so, as we kinda look at to some of the other, I mean, because the one of the other things you mentioned is the ability to, really express the oxide rack in in a customer's deployment. Could you speak a little bit to to to that and what that would would look like and some of the things that we can go build to better integrate this in with a with a customer's network?

Speaker 4: 01:27:02

Yeah. So I mean, coming from a from a background of of deploying, fairly large network systems, like, it's when when you do this in, like, a colocation facility or something like that, it's it's a matter of, you know, you have to set up your firewalls, and then you have to set up your routers that are gonna be interacting with your network provider, your CDM, and then you have to set up all of your management networks to be able to manage all of that stuff just in case your your primary network providers are going out. And, then you you log in to your switches and use whatever command line interface that your your switches are supporting. And, you know, I recently actually did this for Oxide. Like, I'm I'm putting together our colo presence.

Speaker 4: 01:27:44

And and so, like, I I'm just reminded of how painful all this is. And, like, you're logging in to, like, your your firewall VMCs, and it's like this weird Redfish thing that they put in place that doesn't actually work as good as the I t m I thing. And so you wind up, like, scrapping that path, and then you

Speaker 1: 01:28:02

I'm glad. I I know that Josh Cluo is gonna be glad that someone else in Oxide is having to suffer with BMCs right now. I feel like there's so much Josh has been suffering so much with BMCs. It's like, well, Josh, we didn't start a computer company. It's like, yes.

Speaker 1: 01:28:15

I know. I want to run those computers instead of stuck running this computer because our computers aren't ready yet. But, yeah.

Speaker 4: 01:28:21

So I've I've I've joined joined Josh and the the BMC fun club. And then, you know, you have your switches that you're deploying. And even though you've bought them from same vendors, they have slightly different CLIs that don't take the exact same commands. And and so this this really, like so this this painful experience, like, really highlights, like, with the oxide rack, you you you buy it, you ship it into where you're going to deploy it, and then you use this nice API to set everything up. And that includes setting up BGP.

Speaker 4: 01:28:49

Or if you're using static routing, that includes setting up static routing. Like, it's a it's a multi path setup first type of scenario where, like, the architecture really encourages people to use both of the switches on the the the I guess, they're not top of rack. They're, like, middle of rack switches. So using both of those middle middle rack switches, and then using something like BGP to be able to peer with both of those switches and having, like, a good setup. Or you can use static routing, if if that's where you want to go.

Speaker 4: 01:29:18

But, like, the overall experience of just deploy this whole unit and then setting up networking is not this out of band, like, side thing. It's just a fundamental part of setting up the rack, and I think that's just gonna be a a really eye opening experience for people.

Speaker 1: 01:29:34

Well, I think it it just the ability to get the rack in and get it actually functional in a customer's network quickly. And that it takes part of the challenge, and and already you're making mention, some of the discussions that have been going on around on prem infrastructure and folks not really realizing that, actually, with the poor anemic state of the art today, it takes a long time to get actual vendor gear in and get it working. The hyperscalers don't have this problem because of all the work, Ari, that you mentioned that they've done on the networking. They are able to roll it in and be able to to actually turn it on and turn it up and and deploy it. And, Ryan, I think the vision is that we're gonna be able to do something awfully similar with with oxide and the the the time from that rack arriving, provide the the the biggest challenge is gonna be the the the loading dock and making sure that the doors are high enough, which, by the way, is not a small challenge.

Speaker 1: 01:30:27

Just to emphasize that, this this thing's a monster. It's pretty big, but provided they can once it's physically in the DC, getting this thing connected with the customer's network is hopefully gonna be pretty quick because of all of this programmability that we've been able to build into it.

Speaker 4: 01:30:43

Yeah. And it's something we're putting a huge amount of effort in. I mean, followers of of upside probably know we have our our RFD process, our our request for discussion. And, I mean, our customer network integration, RFD, has to be something like one of the most discussed RFPs that we we have at this point. It's to the point where GitHub has difficulty loading it because there's so many comments.

Speaker 4: 01:31:06

And we've spent a lot of time thinking about this, and we really are trying to make, like, a very turnkey solution. So you can you can just connect with BGP or whatever routing dynamic routing protocol that you're using to connect to your upstream provider, and, you know, just get going.

Speaker 1: 01:31:22

Ed, it's and, boy, it it feels like we are tantalizingly close here to the dream. And, you know, it's, this is a huge testament to everything, Ry, that you, and, and LaVonne, and Ben, and Ryan, and every Arian have done, I mean, there's been such a, a broad team effort to get all this stuff, not just, conceived of, but, but, but there are so many gritty details required to get this thing implemented, and I love the, hopefully, folks will check out the the P4 compiler that you've done. And the and just before in general, I think, that it has been really at the the epicenter of what we've done here in terms of the programmable network. Really, really exciting stuff. And stuff, you know, and I remember that, know, when we when we started the company, one of the concerns we had is, like, boy, if we do our own switch, that the the when we talk to to potential customers, the, we're going to be basically picking a fight with the networking team, because they're going to have their own kind of preferred switch vendor.

Speaker 1: 01:32:25

And I think that what we have found instead, actually, is that we are delighting folks, because they are excited that we've actually built the network that they have wanted to build. And I hopefully, I'm not overstepping there, but it certainly feels that way that a lot of these decisions we've made come from from your wisdom, and LeVon's wisdom, and all of the wisdom that we've accumulated over the years. And it really feels like we're building what people want to be able to to run.

Speaker 4: 01:32:51

Yeah. I think a lot of the customer meetings that I've been in, I've been bracing for that impact. Bracing for the oh, man. Like, here's this weird thing you're trying to put into my network. Like, why is it not just the this Arista thing or this Mellanox thing?

Speaker 4: 01:33:07

And we we haven't really gotten that feedback yet. I'm still bracing for it every single meeting, but it it hasn't arrived yet.

Speaker 1: 01:33:13

It hasn't arrived yet. And in part, like, you know, this would vendor sound exactly helping themselves out. So it's yeah. It has not arrived, but, I think that, you know, that, and part of the reason it's not arrived is because people are, are, are saying that the, the value of this is actually gonna bring. So it is very, very exciting stuff.

Speaker 1: 01:33:30

Well, you know, I think, Adam, I know you got a a that a child to run to, no longer a a toddler. A toddler when we started Oxide and Friends. Now really

Speaker 2: 01:33:39

I know. Once a toddler, now a boy who interrupted with his lightsaber demanding I play with him.

Speaker 1: 01:33:44

That's right. And pretty soon, he's gonna be, like, hassling you for the car keys. So, but this has been great. Thank you so much. I mean, again, Rai and Lan and and Ben, I mean, and Ariane, great to have you here.

Speaker 1: 01:34:00

Great to see this thing actually turning into coming to fruition, and so many of these of these things beginning to really compound on themselves. And I just feel lucky to be a part of it, I think as we all do. And thank you so much for for sharing this with us, and looking forward to to get it out there.

Speaker 4: 01:34:19

Yeah. Absolutely. Wonderful to be here.

Speaker 1: 01:34:21

Awesome. Alright. Thanks, everyone, and we we will see you next time.

Rack-scale Networking

Rack-scale NetworkingRack-scale Networking

More episodes

Rack-scale Networking

Rack-scale Networking

Chapters

Creators & Guests

What is Oxide and Friends?