Bryan and Adam are joined by a number of members of the Oxide networking team to talk about the networking software that drives the Oxide rack. It turns out that rack-scale networking is hard... and has enormous benefits!
ryaeng Sure beats logging into a number of Cisco switches and making changes at the console.
admchl This is my favourite episode in a long time, this is all really fascinating.
rng_drizzt the first Sidecar episode was nearly 1.5 years ago ü§Ø , right after we cut the first rev
levon That episode blew my mind
duckman This sounds like a big deal on the scale of ebpf
duckman Or bigger
bnaecker It is extremely useful for understanding the processing pipelines. As long as you only run single-packet integration tests üôÇ
od0 just want to go out and find things to write P4 code for
JustinAzoff <@354365572554948608> yeah one way to think about that sort of thing is that xdp can be used to run little programs on a nic, where p4 is kind of like that, but running on effectively a nic with 48+ ports
wmf So you have P4 and OPTE in the hypervisor at the same time?
bnaecker OPTE is in the host kernel.
arjenroodselaar The P4 runtime Ry described only exists in the test bed, where it high level simulates the switches. OPTE is part of the production environment.
arjenroodselaar The rough difference between P4 and OPTE is that P4 works on individual packets without much concept of a session (so it can't reason about TCP streams, packet order etc, so no firewall like functionality), while OPTE aims to operate on streams of packets.
JustinAzoff So you can run 100 VMs on a test system and wire them up to your virtual switch compiled by x4c?
rng_drizzt The Sidecar switch is actually just a PCIe peripheral to a Gimlet.
bnaecker The Gimlet managing the Sidecar is often called a "Scrimlet" for "Sidecar attached Gimlet"
Riking and "how do i reconfigure this giant network without hosing my ability to reconfigure this giant network"
ShaunO can identify with that - we seriously struggle to keep our own products inter-operating, let alone anyone else's
levon It can feel like a Sisyphean task.
a172 Setup a much smaller/simpler network in parallel that is accessible from "not your network" that gets you to the management interface.
levon It's a whole new world when you can look at the actual table definitions in P4
rng_drizzt Owning all the layers here is immensely beneficial
levon Those DTrace probes have been very helpful
bnaecker Those probes turned out to be everywhere. They are are in: SQL queries, HTTP queries, log messages, Propolis hypervisor state, virtual storage system, networking protocol messages, the P4 emulator, and probably more that I'm forgetting about.
a172 it astonishes me how many "cloud" type architectures are built on v4 only or v4 first.
a172 IPv6 is older than Wi-Fi
a172 It solves real problems. PLEASE use it.
nyanotech yessss finally someone realizes broadcast domains are also failure domains
JustinAzoff the worst part of v6 is trying to run dual stack v4+v6, v6 only networks are fairly simple
levon And the bigger the broadcast domain, the more irritating it is to troubleshoot it
bcantrill "Hash and pray"
arjenroodselaar FWIW while DDM is a cool thing we're building, one of the "simple" tasks Tofino does for us is NAT between the networks of our customers and their VPC networks they implement on our platform.
arjenroodselaar Simple NAT is still surprisingly expensive and being able to do that at line rate is pretty nice.
Riking TCP retransmits in steady state seems like an obvious observation point?
arjenroodselaar Yes, you see TCP retransmits.
arjenroodselaar But if you're running say Memcache over UDP and you get a sudden burst of incoming data as a result of a large number of cache queries you drop those packets (because the buffers can't keep up) and you see cache request timeouts.
arjenroodselaar FB did some work on this about 10 years ago to avoid this ingest and dropped packets which hurt your p99 latency.
Riking yeah smartnic is pushing the intelligence to the machine
levon I know someone who basically polled all of the switches for buffer drops in an attempt to divine which paths were dropping packets due to micro-congestion
admchl I feel like I'm in a secret society meeting learning The Hidden Truth behind Reality of The Network
wmf I would argue if the entire hypervisor is on the smart NIC then you're no worse off than the Oxide architecture
a172 I once stumbled on a bug where the vendor's custom protocol for monitoring (because snmp/syslog just cant keep up) had a trace log on the process, that could not be turned off. Some sort of race condition enabled it, and it happened on 1/3 of system boots. It was ~20k logs/s, iirc.
a172 (im going to look up those numbers)
levon I haven't worked with a SmartNIC fast enough to do this well
JustinAzoff We use a FPGA Nic in our products for fast packet capturing. the service that bootstraps it had an issue that caused it to log an error... for every single packet...
JustinAzoff that managed to log the same error something like 250,000 times a second
arjenroodselaar The problem with SmartNICs is that their power features are way less advanced than the power scaling that x86 CPUs do. So you either run them or you don't, and they come with a 50-75W penalty. Unless you can really get useful work done for that 50W budget, a x86 CPU is much more flexible.
arjenroodselaar What we really want is an AMD Epyc SoC with some amount of FPGA fabric That would let you build whatever makes sense there while still having much of the flexibility with respect to how/where you consume power.
a172 It was enough to mess us up. 250k would have killed us even faster.
JustinAzoff Yeah, it happily wrote that error message until the multi TB data array filled up. We reworked how log rate limiting and log rotation worked after that
a172 I was mostly amused that the process that the process that existed because snmp/syslog couldn't keep up was getting a syslog for every iteration of a loop in the process
a172 of course, if you are sending a packet for every packet you send, that sounds like it quickly becomes an exponential problem.
JustinAzoff and to circle back around, this was code inside of the vendor SDK, that is not open source, that we couldn't fix ourselves. it's one of the only components of our system that we don't control. i wish we had our own NIC (that would probably run something like p4)
levon And thus, this is how we become the way we are (at Oxide)
a172 ours was on production network hardware (wireless controller). There is no hope of having source or any insight true observability into it. (edit: saying there was no insight is a little harsh)
If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!