Oxide hosts a weekly Discord show where we discuss a wide range of topics: computer history, startups, Oxide hardware bringup, and other topics du jour. These are the recordings in podcast form.
Join us live (usually Mondays at 5pm PT) https://discord.gg/gcQxNHAKCB
Subscribe to our calendar: https://calendar.google.com/calendar/ical/c_318925f4185aa71c4524d0d6127f31058c9e21f29f017d48a0fca6f564969cd0%40group.calendar.google.com/public/basic.ics
Okay.
Adam Leventhal:How do we know if we can be heard?
Bryan Cantrill:We don't I don't know that we Do you know if we can be heard?
Adam Leventhal:Alright. People who are here, can you hear us? You can be heard. You can be heard.
Bryan Cantrill:Alright. Excellent. There's Finch. We is now the time to talk about the very strange pathology that we were we did a sound check.
Adam Leventhal:We did. And Multiple times.
Bryan Cantrill:Okay. And I described for you that I this very strange pathology where I could not be heard. That Finch was on. Mhmm. And we had a helpful member of the community on and I could not be heard.
Bryan Cantrill:And this
Adam Leventhal:was true also three weeks ago when we did our last episode.
Bryan Cantrill:This has been this was true three weeks ago too. Yeah. So we got the mean, as if Discord knows that we are about to do an episode about distributed system, we had the the strangest split brain Yes. Where we came back on, and, Adam, you and I could hear one another. And Finch, you and our member of the community could hear one another, but our two groups could not hear one another, which was very strange.
Adam Leventhal:Right. Well, then it got otter, which was we restarted it and you Finch and community member could all hear each other and you could hear me, but community member could not hear me. But then I turned to my video and they could see me. Yes. And that
Bryan Cantrill:was Simon. Simon, this yeah. So Simon could send see you, but could not hear you. Correct. And then we restarted it all and all worked.
Bryan Cantrill:It's very frustrating. And so I I know look. I know we get criticized for having audio problems around here, but this is like we're we're in this is like Byzantine general's Discord call. I mean, at some point, this is We
Finch Foner:should have just had another person up and then we would have had an odd number.
Bryan Cantrill:There we go. Exactly. But we are not here to discuss audio problems. We are but are here to discuss distributed systems and in particular, the the distributed system that we ship at in the Oxide rack actually, you know what? Hold on.
Bryan Cantrill:I'm realizing that's like, well, we're really rushing into the topic. Right. I feel like there is something I did wanna just say very briefly that the your UConn
Adam Leventhal:Huskies I was like, are we you're not doing Huskies? Because you also know that they're Andrews UConn Huskies.
Finch Foner:They are Andrews.
Adam Leventhal:A fellow nutmegger.
Bryan Cantrill:A fellow nutmegger.
Adam Leventhal:Yeah. Exactly.
Andrew Stone:And my sister went to UConn, but before she rooting for the huskies, was rooting for the huskies. So There you go.
Bryan Cantrill:Yeah. Okay. Well, this is good to know because, you know, every year, my my sister puts together an extended family March bracket. And every year, I pick the schools that are either Oxide alma maters or spiritually close to Oxide. In Andrew's case as it turns out, listen if I could put the Hartford Whalers in my bracket I would but I can't.
Bryan Cantrill:And so I had I had UConn going deep and it was a great day for Connecticut, Andrew.
Andrew Stone:Yeah. And Michigan, which is where I went to school. So, yeah, that's a double win yesterday. It was a good day.
Adam Leventhal:What a day.
Bryan Cantrill:Yeah. A very good day. Yeah. And I'd like to say, hey, Andrew, I've got Michigan as well in my in my bracket. So I've got I don't think I'm going to confide who I have winning at all because it doesn't feel like that's in my self interest right now.
Adam Leventhal:Oh, that seems right. That seems like that's just gonna make enemies. That's just gonna make enemies.
Bryan Cantrill:And I'll tell you who it's not gonna be enemies of Purdue that I also had going to the final four and could not get the job done. But yeah. The okay. So we've got we've got that out of our system. That really was wild.
Bryan Cantrill:Dude, that was
Adam Leventhal:So wild. Moments before it went in, before the last shot, I told my son, because we have a foster dog at home, don't lose your mind if they win. And of course, I lost 100% of my mind. All of my mind. It was so great.
Adam Leventhal:Still barking at me.
Bryan Cantrill:The dog is still barking. Exactly. I and I always do love when you're giving an admonishment to one's children that is clearly actually you giving it to yourself.
Adam Leventhal:That's right. Yeah.
Bryan Cantrill:That's right. I I that is really great. Well, really exciting moment. Andrew, great for you as well.
Andrew Stone:Now Yeah. I also had a cat bolt from the room as I went nuts.
Bryan Cantrill:The and so Andrew, guess after you were a Husky, a Husky fan, you so when you I feel like this is one of the first things that you looked at coming to Oxide, namely this problem. So just to set the table, we've got a distributed system that we shipped with the Oxide rack, the control plane. And we need to establish somehow mutual trust within this rack. We need to know that we and I mean, that was kind of the problem that you walked up to when you joined the company, right? Interim, this is now years and years and years ago.
Andrew Stone:Yeah, I had no idea like obviously what I was gonna work on when I joined, but I feel like, although I can find no evidence of this, that I like read the word stress quorum before I joined and asked somebody about it, but it must have been like verbally and not over email. Because the earliest email I can find, I did go back through all my Oxide emails. And the earliest reference was a meeting with Sean that I had on 08/03/2021 to discuss TrustQuorum and what was happening with it. And so I joined, that was less than a month after I'd been at the company.
Bryan Cantrill:And and so what what what was the subject of that kind of discussion with Sean's circa 2021? And this would be now this is like two years before we would ship Iraq. So this is still, everything's still very much in development at that time.
Andrew Stone:Yeah. So here's the thing. I have no idea because that meeting was not recorded and there are no notes. But then there was a meeting after that, which did have some notes. 09/17/2021 with Keith, Laura, Bryan, Rick, Phil, and Robert.
Andrew Stone:We started discussing, like, what the Trust Quorum was actually going to be and, like, what it meant to be able to access things. Like, how we were gonna actually talk to the ROT, I think was the big thing there. And I think that was the biggest part of this was like, initially I didn't know what it meant to establish trust across all the sleds, but what we knew we needed to do was store keys on the ROT and have some way to get them into the system. And so I don't know if IPCC was a thing yet, I think that Keith had already written about it, like the IPCC channel or using the URTs to talk from the host processor to the service processor, which then talks over the SPI channel to the ROT, but we didn't have a protocol or anything like that. And so we started talking about that.
Andrew Stone:And then a couple months later, it looks like I had started implementing SPDM and we didn't know at that point even how sleds were going to discover each other. So we had this like bootstrapping problem where like, okay, the sleds have to talk to each other so they can, you know, establish trust in some manner, but like, we don't even know how they're gonna find each other. And so we started scoping out schemes and I think like Rye had just joined the company at that point, so we didn't have DDM, which is domain driven multi path. And I can talk a little bit about that, but yeah, I joined and was looking at a bunch of things, you know, fault management and stuff, and kind of dove into this because it seems like an appropriate problem. I'm interested in security and I've worked on consensus algorithms and stuff in the past.
Andrew Stone:Yeah, that's how I took a look.
Bryan Cantrill:All So right, I feel like I'm running after you with a glossary here just to catch people up on on some of the acronyms you described. Yeah. Alright. So first of all, you talk about the ROT, the root of trust. We've actually not had we have Laura talking about vulnerabilities in the NXP, the LPC fifth the the part that we use.
Bryan Cantrill:This is the LPC 55 s 69 that we use and I think we did an episode on on the vulnerabilities. Yeah. Right? For sure, years ago. But we haven't really I don't think, Andrew, we've spoken too much about like the purpose of the like why we have a root of trust and how we actually plumb that up the system.
Bryan Cantrill:So do you wanna just describe a little bit like what the root of trust is? What IPCC is? Why we need a what SPDM is? Would you mind just kinda sorry to make you
Andrew Stone:Yes. Use terms. I could try. So I am probably not the best person to describe this to give a broad overview. I would think that Phil and Laura would do a better job than me, but I will I'll give it the basics of it is that we have a root of trust, which is like a secure piece of CELCOM, secure in quotes, that serves to maintain secrets that cannot be extracted from it.
Andrew Stone:So if you think about like a YubiKey, it has a private key that's stored on there and never leaves. And then, you know, it can sign things and authenticate, you know, interacting with external services. And so our root of trust acts like that for an individual sled on the rack. So it stores it's capable of storing private information. It's capable of a physically unconable puff identifier, which is like literally like silicon impurities and an algorithm that give it a unique unforgeable identity.
Andrew Stone:And so from there, you can kind of build up this whole unique unforgeable bit of trust. And if you inject signed certificates into that root of trust, the root of trust can act to set up, you know, TOS channels or things that require, you know, authentication and encryption. And so the way we do that at manufacturing time is that we have like a manufacturing root cert, and then we have an intermediate cert that is in our online key signing service. And the root of trust generates a key pair, private RSA 96 bit keys, I think, and it generates a certificate signing request that goes out over some channel during manufacturing time, and it gets signed by our intermediate key signing service. And so the the private key never leaves, right, the root of trust chip.
Andrew Stone:And then the certificate gets injected back in and saved. And so at that point, it has it has bootstrap trust, and we know that this is authentic Oxide hardware with a unique identifier, right? The certificate has the serial number of the baseboard identifier, which is our motherboard embedded in there along with the public key that is tied to the private key of that root of trust that never leaves. And so we have this unique device that can be used to do security things.
Bryan Cantrill:And this all of that apparatus that you described, which is gonna merit its own episode because there's so much that we have to develop. We had to develop all that apparatus for that before we shipped the first rack.
Andrew Stone:Oh, and we had none of this when I joined. Like in 2021, we didn't even know like we had the RT chip, right? Like LARO had already found vulnerabilities in it. But like, we didn't have any of this manufacturing software.
Bryan Cantrill:Only half of the vulnerabilities that you
Andrew Stone:would Yes, yeah.
Bryan Cantrill:Right.
Andrew Stone:It's crazy. Is part of it, it's like, well, what don't we know here? We knew we needed to use the ROT chip to establish secure channels between sleds. It's like, well, what is that protocol going to be? And one of the protocols we started looking at was a protocol called SPDM, which I do not remember what it stands for, but it is essentially like TLS plus attestations, remote attestations, with the same key being used for both.
Andrew Stone:It's an open protocol by the DMTF, which I also don't remember what that stands for, But the gist of it is, is it's protocol that's designed to work on embedded hardware and to establish secure channels over various transport protocols. And so there was no Rust implementation. It's not entirely true. There was an Intel Rust implementation, but it was a standard based one. And we thought initially that we wanted to write the code for these to have, like we essentially thought we might want to have the ROT participate in this interactive slide to slide protocol, say, setting up these secure channels, like, during you know, have it participate fully.
Andrew Stone:And so we thought we needed some no standard software that would run on the RT Right. And, you know, and the ST. And so we'd have a component there that would be running. It turns out in the long run, like, we we change strategies, but I spent a long time looking at this SPDM protocol, trying to implement a no standard implementation, got pretty far, had bits of it running on the ROT, and then we went and kinda threw it away. But the gist of it is that we wanted to be able to have some sort of secure channel that was both protected, you know, for authenticated and confidential, and that attested to the remote software running on each side in a mutually, like, secure mutually attested way.
Andrew Stone:So that like, we knew because of the certificates that this was signed by us and so it's authentic oxide hardware. And then we wanted the ROT to actually assure that no, no, the software running on the whole device, the sled itself, is also authentic oxide software. And we do that by exchanging these measurements that could be remotely attested and verified by the other sled before any secret information was shared between sleds. So they would trust each other at that point.
Bryan Cantrill:Right. So that I mean, that is a whole lot of apparatus. Then, you know, why do we move away from SPDM? What was the the kind of the the rationale? I mean, I remember that that that the pivot movie I well, I don't want put words in
Andrew Stone:your mouth, but
Bryan Cantrill:I remember it being like, this is not this is costing us more than it's buying us to be Yep. What was the kind of the
Andrew Stone:thinking in terms of the It's fine protocol. Really is a fine protocol, but I I realized that there were I kept trying to like wanting to modify the protocol slightly because we So we wanted to set up like, I realized, you know, after I was well down this path that like, no, we actually didn't need a no standard protocol. I was like, oh, well, maybe I can just use the implementation from Intel, which I think had some no standard bits, but I'm not sure. There were other things going on there with it that I had to change and they accepted PRs, so it was great. It was a great interaction actually with the folks over there.
Andrew Stone:But I realized that we wanted to use like, the timing of things just didn't work out. Meaning, like, we would sign the attestations with a different key than we would use to set up the channels, the remote channels. And so we actually wanted two keys. And I was trying to squeeze that down to one key because like SPDM only supports doing both of those things with a single key. So the channel setup happens that way.
Andrew Stone:And so I ended up writing what was very similar to like SPDM is remarkably similar to TLS 1.3. It's nearly identical in certain things, except it has these attestations layered into the initial handshake. And so the transcript that records the handshake includes the measurements. So when you sign that, you're signing the measurements as well. And like, I realized, like, okay, I'm building all this out.
Andrew Stone:And then I ended up, like, basically reimplementing a separate protocol that had two separate keys. At that point, was like, do these things really need to be coupled? And I was like, can we just use TOS? And, like, I remember Josh being like, yeah, dummy. Like, he didn't say it that way, but that was that was implied when I was talking to him when I went.
Andrew Stone:And he's like, yeah. He's like, I I fully agree. Like, just use TLS and, like, get the gains from using Russell's, and then we can layer the attestations in over with the separate keys. So we have this authenticated encrypted channel, and now we can say, before we actually let the user, which is our SLUT agents, use this channel, like before we release it to them and say it's fully established, we'll do the remote attestations. And so now it's kind of a two phase protocol.
Andrew Stone:But we went away just because it just didn't kind of fit. And like Yeah. It was just a tedious protocol to work with also.
Bryan Cantrill:Well and I feel like this is something that has happened to us quite a bit where we go in with it, like, we want to use existing standards where they exist, and we wanna use existing components with like, we are actually are not trying to reinvent everything on our own all the time even though I know it sometimes can feel that way. We are all we often have gone down a path and then realize like, okay, there is a like the value of using a standard, you begin to chip away at that when you have to like do different things to actually get it to work in our and it's like, this is just not actually buying us that much.
Adam Leventhal:I assume you're I mean, talk in particular in this domain feels like a good example where we for this root of trust and for the service processor, we were looking at maybe not off the shelf is right, but an existing project Right. That felt well built for this use case. And then it just turned out like by the time we augmented into what we want it to be, we built our own thing.
Bryan Cantrill:We built our own thing. So you might as well go build. Yeah. And they not just talk about I think over and over again. I think that the we but certainly that as well where we end up going down our our own path.
Bryan Cantrill:And then when you go down your when you're kind of like in the you're the worst of all all worlds when you're trying to be a standard but stretching the standard.
Finch Foner:Yes.
Bryan Cantrill:Because now you're actually not able to do what you would wanna be able to go do. You're also not exactly the standard. So you can't even like it's not like you're gonna incorporate
Adam Leventhal:And the extensible benefits of of of being standard are like minimal. Right? SPDM for this thing Right.
Andrew Stone:That is inclusive. It doesn't matter at all. We're talking to ourselves. Exactly. It's
Adam Leventhal:all us talking to us. Yeah. The fact that we happen to be speaking, this flavor of Esperanto is not benefiting anyone.
Bryan Cantrill:That's right. And find
Andrew Stone:it funny actually because we we went from a standard to me doing my own thing. And like, actually, is how we brought John into the company. He started working on this Sprockets protocol with me and we got a really cool implementation of TLS essentially. And then we went back to the standard and just used TLS, right? Then we layer out of stations on top.
Andrew Stone:And for that, we're also using a standard. So we want standards, some standard, realize it's wrong, try to do our own thing, realize that there's a better standard we could use, and we layer two standards on top of each other. And so we're using, like, this COGRAM functionality, which is another acronym that I don't know what it stands for, but it's essentially a manifest or a document that like is a way of encoding measurements for remote attestation. It's like a standardized form. And so I didn't work on that, Lauren Phil worked on that.
Andrew Stone:But that is the attestation format that we send over the wire and use the like. And then I don't know if that's the format of the measurements. I mean, they're really just like SHA-two 56 ashes that gets sent over the wire and then we associate them with this CoRim document, which says which measurements are acceptable for our software that we wanna trust.
Adam Leventhal:And Andrew, after now that we've done the glossary, which has its own appendix glossary, so I would just say popping even farther out. The whole point here is how do customers ensure that the rack is running the software that they wanna run and not some malicious software or some injected software? How can they trust, again, down to the root that all of this software is what they intend? And obviously, the the mechanics you're describing is what gets us there. But that's the that's the core problem and one that the industry has not got its own standard for, I don't think.
Andrew Stone:Yeah. So this is where, like, I feel like I wasted a lot of time. Like I learned a lot, but I didn't know any of this stuff coming to Docsight. Right? I was just interested in it.
Andrew Stone:And like, I learned that, like, this is only a subset of the problem. Like, we haven't even talked about TrustQuorum. We just talked about the secure channels at which the TrustQuorum protocol is going to operate over. So we haven't talked about TrustQuorum at all, right? This is just like the baseline to get us some secure channels.
Andrew Stone:And so like, why do we want TrustQuorum in the first place and what is it? And so like, we've got a couple problems we're trying to solve, right? What Adam said, which is we want these each sled to be able to trust that it's running the right hardware and software. But like before that, like, why like, why is that the core problem? Like, I think, like, another core problem that leads to this is that we want encrypted storage at rest.
Andrew Stone:But, like, we are also building a RackScale system that's gonna run potentially in remote data centers without operators, you know, being there twenty four seven. And, like, the rack can reboot due to a power outage. Right? Or or some other reason. And if the rack reboots, like, the storage is encrypted, but, like, there's not gonna be a human there connected to the tech port with a laptop to type in a password.
Andrew Stone:And so how do we decrypt the drives to boot our ControlPoint software and bring everything back online? And so that was like the kind of the introduction, the basis of the protocol. And so I don't know who came up with this first. I know, like I talked to Rick about it first, one of our former employees at Oxide Security embedded everything engineer. And like Robert said, mentioned that you guys had kind of done some of this similarly at Joint, but the suggestion was, hey, do something with like Shamir secret sharing.
Andrew Stone:And so we have two types of drives and the drives that have encrypted ZFS datasets on them are these u dot two form factor drives, and that's what stores all the customer data and the Oxide control plane data, the vast majority of it at least. And those are drives that are
Bryan Cantrill:when you're looking at an Oxide rack, the drives that you are seeing out the front are those u dot two drives? Just to Correct. Orient folks.
Andrew Stone:Yep. Yep. I mean, obviously correct, Brian. And then Quick question. I'm not telling you that.
Andrew Stone:And then there are some internal drives, these m dot two formatted drives. And those are to store, like, local metadata to the sleds, and they also store, like, the installable software. So, like, non private things, at least up until TrustQuorum, storing some mostly non private things, but sled local small data. And so we could use the data that sled local there to manage a protocol over some transport channels that can use Shamir secret sharing to reconstruct a rack secret that we can use to derive disk encryption keys and to decrypt those u dot two drives. And so what are we protecting against here?
Andrew Stone:Yeah. Go ahead, Brian.
Bryan Cantrill:Well, I was just saying, no. That's exactly it. Like, what what is what is the threat model here? What are we protecting against? And what's the value of basically having to to turn on this distributed system and to test yourself into this trust quorum before you can see these drives with with what are we protecting against for that?
Andrew Stone:Yeah. So we don't we don't have a password. Right? Like and and so so at this point it's like, well, where do you store it? Like, let's say you just wanted to put an encryption key on a Like you just wanted to store an encryption key on the M.
Andrew Stone:2s. Right? The M. 2s are internal to the slides. They are field replaceable, but you have to take the sled out.
Andrew Stone:But like, okay, somebody can steal a drive. And if you had a hard coded secret on the m dot two, somebody could steal a drive and not get any encrypted data off it, but they could just steal the whole sled. A sled is like it's big, but it's not like enormously big. And so we wanna protect against that. We wanna protect against this, what we call casual physical access.
Andrew Stone:And so we don't want anybody being able to like walk into the data center, pull a bunch of drives or a couple of sleds and walk away with them and be able to get any information, any customer information off those drives. And so importantly, what's not in our threat model is somebody stealing an entire rack. If they steal an entire rack and can boot it, they can access all the data, right? They have access. And so if you have five minutes in a data center, you could steal a few things, but you're not gonna steal a three ton rack.
Bryan Cantrill:Yeah. At some point, that's on the we are looking for some things in terms of our data center operators. Do prevent someone from walking out with the entire rack.
Adam Leventhal:Yeah. Just like check their clothing, their jacket to see if they've got a entire 2,000 pound rack under it.
Andrew Stone:That's right. Yeah, exactly. And so, you know, we do we we have thought about actually making it so that you can't steal a whole rack, but that is not currently part of the protocol. I don't know if it'll ever be, but we have had customers ask questions like, what if somebody steals their act? I was like, I don't know, what if you hire somebody with guns?
Andrew Stone:You know, like, I could help. I'm like, pay them good salary. But yeah, so that's like the key thing we're protecting against there. And so then the one of the constraints is, well, you wanna do that, sure, but you can't type a password. And so we decided to go Shamir secret sharing.
Andrew Stone:And so we do have these sled local drives, and we can split up a rack secret such that there's a piece of that secret stored in what's called a key share stored on the sled local drives. And so there's a different part of the secret there. And then a bunch of the sleds have to cooperate and exchange those key shares to recompute a shared rack secret so they can derive their disk encryption keys. And so that allows the SLED to autonomously cold boot without an operator present and still protect against this casual physical attacks. So that's that's where we were going really with this solution.
Bryan Cantrill:And Andrew, the I think it's RFD238 that talks about this.
Andrew Stone:238. Yes.
Bryan Cantrill:Which we've we've made public so folks can check that out. And so could you just talk about Shmir secret sharing and and how that works? I don't know if wanna
Andrew Stone:get into the math
Bryan Cantrill:at all, but like how how does it work to have a shard of the secret throughout the rack?
Andrew Stone:Yeah. So a shard is essentially so you can view Shamir secret sharing operates via what's called like polynomial interpolation. So it's kind of an extension of a one time pad. The way we do it is we actually generate a bunch of, and the way it's largely done is we generate a bunch of random points on a polynomial. So like, if you remember from eighth grade math or whatever, you know, your X squared plus X plus one is a polynomial of degree two, because the square is highest power, they got X cubed, whatever.
Andrew Stone:So you have this polynomial disregarding just say a polynomial for normal numbers right now, and you can pick points on that polynomial. And if it's a degree two polynomial, need, I think it's three points. So it's like the degree plus one, it might be minus one, I can't recall right now. I think it's plus one. You need to reconstruct the Rack secret.
Andrew Stone:So if you can get say three points on the polynomial, you can reconstruct any or any other point along the polynomial. And so what you do is you define the secret as the point at X equals zero on some randomly defined polynomial. And then if everybody gets a piece of that polynomial, gets a different point, each one, each sled can exchange their points, gather enough points to recompute the secret, which is the point at x equals zero and the y value is the secret. And so that's a 32 byte secret for us. So at that point, you gotta ask the question, well, what prevents people from asking, hey, like, can I get your key shares?
Andrew Stone:Like without having a key share, right? And so that's where that whole discussion of the trusted attested channels comes in. We won't actually, a sled won't actually return a key share to anybody who's asking in the real trust arm, we'll get to this, to get to the earlier versions. It won't actually return a key share until that authenticated channel has been set up And then the request for it comes over that channel and then the shares can be returned. And so that allows the sleds to mutually trust each other in a point to point fashion and gather enough key shares.
Andrew Stone:And so importantly, if one sled is corrupted, it can ask for shares. But like, if you have a handful of them even, they wouldn't be able to gather enough shares. Like, if everybody else is operating properly and you have a bunch of a few sleds that are corrupted, they won't be able to convince enough sleds to get enough of the actual shares to reconstruct the rack secret. And so there's sort of this online component as well. Now, if you could steal a whole rack or all the M.
Andrew Stone:2s, you could just look at the shares and recompute the secret, Right? So that's not within our threat model.
Bryan Cantrill:Okay. So this is obviously a kind of a huge engineering project. There's a lot of hardware and software that needs to come together for this. And you kind of made reference to this with like the RealTros Quorum. How did we how do you bootstrap this whole thing just from an engineering perspective?
Andrew Stone:Yeah. So first after realizing that we actually have to build a protocol on top of this thing, And it's already like, August 2022, and we're supposed to ship the rack, maybe September at that point, I don't think we were quite ready, but we're like, well, we're not gonna have a trust fund to ship the rack. And then it's like, we realized, well, here's a problem. So if we want any encrypted storage, we need to like, you know, encrypt the ZFS datasets now, because you can't go from an unencrypted dataset to an encrypted dataset in ZFS. Like you would just have to copy all the data.
Andrew Stone:And that is something we did not wanna do. And so we started thinking about, well, how the hell can we do this without the protocol? And we're sitting in a meeting and we're like, well, we can like hard code it. We could like just have local keys derived from VPT data, but that's also sled level data. So you steal one sled and now you can compute the secret, right?
Andrew Stone:That does not solve the problem. And so like, well, what if we could do a low rent version? What if like sitting in this meeting and I just remember like proposing, what if like the encrypted channels weren't ready? We didn't have the manufacturing software written to even sign things yet. Like it was still under wraps, like tons of work ongoing across the stack by multiple people.
Andrew Stone:It's like, okay, we cannot set up the secure channels we wanna set up. That's problem one. We don't actually have a protocol to allow us to securely, like, rotate keys rotate rack secrets and allow failures. Like, I don't know how to do that. Like, that's a protocol I wanna work on.
Andrew Stone:Don't know how to do it right now. It's like, what if we just didn't do that? But we did the most basic thing possible, which is we generated a rack secret at rack setup time and distributed the key shares split from that secret over plain TCP channels to individual sleds that would then save them to their M. 2s. And so those sleds would just boot up and be able to operate over the plain TCP channels.
Andrew Stone:So we don't have that online trust. We don't have any of that at a station. But we do have the encryption at rest protected against casual physical access. You you still won't get anything if you steal a sled or a few drives or a few sleds even because you just won't have those additional chairs. Now, if you could, you know, if you could eavesdrop on the bootstrap network, which is like an active physical attack, which we're not protecting against currently, right, then like, sure, you could get that stuff.
Andrew Stone:But importantly, this distribution of shares happens over the Bootstrap network and that network is not exposed outside of the rack. It's one of our internal rack networks. And so, yeah, we're just like, all right, we won't have any trust and like anybody can get on the Bootstrap network from ask and for T we won't have any rotation. But we will have a mechanism to protect against casual physical access. And so now we have this one time rack secret setup and every sled has key shares on it.
Andrew Stone:And we have this kind of fixed mechanism for decrypting. It's like a password you can never change. And I won't get into more of that, but I'll stop there for a second.
Bryan Cantrill:Well well, yeah. And I I thought I mean, because this is a really gnarly engineering problem. This is a problem we we had over and over again at Oxide. It's like the minimum viable product is really, really large. And you have to find ways because it was not an option to be like, well, we will wait to ship our first rack until we have all of Trust Quorum working because it would be not we would not have shipped.
Bryan Cantrill:And we we had to ship a product for you also, and I think this is kind of the night that you were on Andrew, is like if we have no trust quorum whatsoever or if we have effectively if we don't encrypt the data at rest, then the only way to go from the old world to the new world is gonna be to completely nuke the rack of all of its data. And we really Yeah. Did not wanna do that. So is unacceptable. Yeah.
Bryan Cantrill:Was only I mean, the the balance we're trying to hit is how and in transparently with our earliest customers of like, here's what this thing is and isn't. Here's what the threat model is and isn't. And here is how we can give ourselves the foundation where we can our future selves. This is our future selves now looking from 2023 forward, 2020, late twenty twenty two, early twenty twenty three forward. But our future selves can we'll be able to bootstrap ourselves into something that is that that is more trustworthy but we and so it mean that's a real knife edge to hit and I I mean I loved your your low rent trust quorum.
Bryan Cantrill:I I feel like this is what we tried to do over and over again. It's like what what problem can I how can I short circuit this problem? How can I stub this problem off in a way that doesn't foreclose on my future optionality? And I think that this is a really vivid example of that Andrew because this is like a multi year project.
Andrew Stone:Yeah. This was not It was not going to be I was not gonna be able to come up with the protocol, like the correct real protocol quickly. Like, I'll get into how that evolved, but like I rewrote the protocol, the entire protocol, RFE238, at least three times from scratch. So it has changed a lot until it ended up in its final form. And I just remember, we just had no idea what we're gonna do.
Andrew Stone:We knew we needed encryption and nobody had any ideas. Obviously I had been the one thinking about this the most. I just kind of came up with a meeting and I'll never forget, Robert immediately after sends me a DM and he's like, I don't love it, but I don't hate it. And that was the biggest compliment I could have ever gotten. Like, it was like, okay, Yeah.
Andrew Stone:We're going with And we're like, perfect. Like this will get us, this will allow us to ship the rack. And like, this is something like, I think I can build this in two months. Like luckily I did have more time. Like this was not the long tail of shipping the rack, but but it was something that I thought was, like, very doable.
Andrew Stone:It ended up only being a few thousand lines of code.
Bryan Cantrill:And when you say this, you're talking about what we're calling low rent trust quorum LRTQ. You'll see this mentioned from time to time. But they And building that in time to ship the first rack.
Andrew Stone:Yeah. So yeah. Exactly. So, you know, there's some drawbacks. Right?
Andrew Stone:As I mentioned, like, you can only distribute shares over plain TCP channels. There's no key rotation. There's no ability to, like, expunge sleds. So I'll talk about that more with the Real Trust Quorum. There's no remote attestation, so anybody can just ask for a share.
Andrew Stone:And so those are pretty big drawbacks, but the big win is that we have encryption and we do have this encryption at rest that is not vulnerable to this casual physical attack. And so those are huge wins for what is a very janky kind of minimum viable product. And there's one other thing here that I didn't discuss when I mentioned the Lauren Truss quorum. And this was, I think, pat myself on the back, a bit of creativity, right? Establishing a shared rack secret and sending the shares over plain TCP channels, there's nothing innovative there, right?
Andrew Stone:Like I'm sure people have done that a lot. But what I did differently was that we knew that we might have sleds that would need to be removed from the rack and we'd want to replace them. Or people would want to start with like a 16 sled rack and go to a 32 sled rack. Well, if you have a fixed rack secret, like how can you do that? And there are other ways to do it.
Andrew Stone:And I will say that I did not fully understand the math as I was doing this, but we were using an existing secret share library called VSSRS, which I can also talk more about if it matters. But I realized that what we could do is generate the maximum number of key shares even though we only have at most 32 sleds in a rack. So we could generate two fifty five key shares and give each sled an encrypted form of a set of the shares. So say divide those two fifty five over 32, split them up evenly over 32 sleds in an encrypted form derived from that initial rack secret. And that way, if a new sled was inserted and it needed a key share and it came in after the fact, it could say, Hey, I need a key share.
Andrew Stone:I'm starting out in this learning mode. And it will ask for it. And the sled that it asked would go ahead and grab a bunch of key shares from other sleds, recompete the rack secret, decrypt its extra shares, mark one as used and handed out to that sled based on its ID. And so that way you could add new SLEDs to the rack that could participate in the protocol. I mean, do love that.
Bryan Cantrill:Yeah. Treating it as a rack secret, what these the the the key shares for future sleds that are going to join us, that is that that we're gonna protect with the rack secret. It's a very a very nice way of of bootstrapping all that.
Andrew Stone:And we do end up doing something similar to that in the in the real trust form or key rotations. But yeah, it worked. It does have a drawback though, is as you remove sleds, you're removing extra key shares. And so you cannot do the rack of Theseus as we rediscovered at one point, or Alan rediscovered when trying to do it or Alan and Angela. And the rack of Theseus is, you know, like expunge every sled and add it back to the same rack, you know, to participate.
Andrew Stone:So you expunge them one at a time and you add them back and you still have all your data, right? Because data is replicated and the rack is still operating, but the sleds have different identities and different IP addresses, different control plane identities. They're still the same ROTs and everything. But we couldn't do that because if you expunge that last sled, once you expunge sleds, their key shares are now gone. You have to clean slate them.
Andrew Stone:And so the extra shares aren't there. And so that last sled, wanna insert, nobody's around that could handed out extra key shares. And so you can only get to the last slide. But, know, in practice, don't have to do that. That was just something we wanted to do on our dog food rack.
Adam Leventhal:Yeah. Just to demonstrate it all working in the kind of worst case scenario.
Andrew Stone:Yep. Yep.
Bryan Cantrill:Okay. So we so you get low rent trust quorum implemented and in time to to ship the first racks. And so then advances in time here. We we we get those racks in the field. How do you begin to get an eye on the horizon that is the actual the the real trust quorum, the high red trust quorum?
Andrew Stone:So nobody cared, it turned out. Like, we had, you know, we have customers that are dealing very air gapped customers, dealing very sensitive information. They did not appear they appeared to want other software besides TrustBorm. So I just worked on a bunch of other stuff, right? Like Upgrade.
Andrew Stone:Upgrade was a significant project that I I worked on.
Bryan Cantrill:Ring the chime for our our, yeah, our episode with Dave. Exactly. Yeah.
Andrew Stone:Dave and others. That was one thing. But yeah. But then it came back, and it was like, okay. We do have to do this.
Andrew Stone:We have a customer that wants it. But importantly, like I want it. I want to get this thing off my plate. Like it's now been like three plus years of working on it like periodically. And we have this like somewhat gross protocol.
Andrew Stone:It has a bunch of like embarrassing security flaws. And so as we're advancing, like it turns out that like, okay, now we actually have manufacturing software. We've shipped the rack. We have actual keys. We have like Phil working on Dice, like boot stuff.
Andrew Stone:Dice is another acronym that I forget, but it's essentially a way to do like measured boot, which is kind of the opposite of secure boot. It's what we do. Like we take measurements as we boot the rack and those measurements are the remote attestations. Those are the things exchanged over these secure channels to trust each other's to trust the sled. So, like, we will boot up to our Bootstrap agent running on the host network or on the host CPU.
Andrew Stone:And at that point, like, these sleds have to communicate with each other to trust each other. So we've we've got a bunch more things built, to my point. And now, like, really want to get into it. And so I started working on like, okay, how can we rotate keys and how can we, like, how can we do this in a secure manner? How can we get out of how can we go from LRTQ to trust form?
Andrew Stone:And so a couple of constraints of all. So first it was like, well, the two big problems in distributed systems are asynchrony and partial failure, right? And so asynchrony meaning like you have no idea how long things are gonna take, how long the message is gonna take to get from one point to another. Partial failure means like, well, one of your slots could fail or like a network link could fail or whatever while everything else is still working. And so that means like things can move on and continue to work while other things are failures.
Andrew Stone:And if that failure becomes resolved, like what happens? Like, that break, does that healing, does that sled that was rebooted or offline for repair, like, is that permanently offline? Or can you just insert it and have it relearn the rest of things? And so you gotta build a protocol that deals with that. And so the real trust quorum, I started basically just working out what this protocol was gonna be like and eventually ended up with the protocol you see in RFD two thirty eight, which allows you to do that same kind of trusted dealer mode where there's one node that generates the rack secret and distributes the key shares.
Andrew Stone:Now it's distributing them over these secure channels. But importantly, it allows failures both during the initial distribution and during queue rotation. So like sleds can be offline, multiple reconfigurations can happen, nodes come and go, rack secrets get rotated, and then that sled can come back online and catch up to where it left off as long as it's a member in the trust quorum. So it tolerates all these faults and you just have this ability to like robustly have sleds be able to learn the rack secret over time and tolerate failures. And so that's a pretty complicated protocol.
Andrew Stone:And like, don't think I'm gonna go through the protocol here. It's just too much. Like, I can't even, it's hard to visualize, but it is a two phase protocol. So you can look it as two phase commit, the problem with two phase commit is that two phase commit doesn't tolerate any failures. And so that was my first cut, was going with two phase commit and then saying, well, how do we recover?
Andrew Stone:We just, like, boot a sled if it goes offline during the reconfiguration. It's like, sure. Yeah. We can do that. But, like, we can do better.
Andrew Stone:So
Bryan Cantrill:And, Andrew, when is it that that Finch comes aboard? Because I think Finch, you you came aboard, I wanna say, like, October. When when when did and joined in November. Has only been November okay.
Andrew Stone:Yeah. Yeah. Is that right? I had a lot of things that were not done when Finch Jones.
Bryan Cantrill:But that was, like, only, like Finch, you've been at Oxide for, like, four you sure it's four months, not four years? It feels like it's been for you. Does it feel like it's been four years to you? Just out of curiosity.
Finch Foner:It it feels like an enormously long amount of time compared to the amount of time that it's actually been.
Bryan Cantrill:Yeah. That that seems that seems crazy. Okay. So I A lot
Andrew Stone:has gone on.
Bryan Cantrill:A lot has gone on. So, Finch, you joined, and this is a problem that I think you had a natural affinity for real background in cryptography distributed systems and a real affinity for this problem. Absolutely. Yeah. And I know Andrew was really stokes to have someone aboard that that that could really go go compare notes with and really work out the details of the protocol and so on.
Bryan Cantrill:So you wanna describe what it was like ramping up on this and coming aboard and joining Antrim?
Finch Foner:Well, I remember that in my call with you, Bryan, where you were offering me the job, you said, And there's this thing called Trust Quorum that you should take a look at because I think this would be a natural fit for the kind of stuff that you're interested in. And I did, and it was exciting. I think what was interesting was coming aboard, and for me at least, the Oxide codebase was the largest codebase that I had ever grappled with. Omicron is enormous by itself. Trying to figure out, okay, I'm joining the flight crew as the airplane is about to land.
Finch Foner:And I am completely unfamiliar with all of the controls that are in the cockpit. And so I think the tricky part was we were in the integration phase, we were trying to figure out, okay, how do we plumb all of this machinery for the Sprockets Protocol and for the Trust Quorum secret assembly into SLED agent, which is the process that's managing all of the sled local stuff that's going on. And it's just a huge and somewhat tricky code base to navigate and situate myself in. So it felt at first like I felt like I was making very little progress in building up my mental model of how everything is going on. And Andrew did a fantastic job of orienting me.
Finch Foner:And it was just this sort of very nonlinear thing where all of a sudden things started to click into place and I realized, okay, so this plumbs into that plumbs into this. And I'll say I used Claude as a guide throughout this process, which was really useful because I could say, can you trace? I want to write something that does this. Can you trace the 12 layers of abstraction that this thing needs? And it was a really useful way of at least pointing me at which files I need to examine with my own human eyes to understand how everything jigsaws together.
Bryan Cantrill:Yeah. That's really interesting because I feel that this is one of those uses of Claude that I really encourage everyone to do, which is I mean, because we've used tools to navigate a source space. And those tools have like not really evolved since well, I mean, like, I I do you see scope on on rust?
Adam Leventhal:You're walking right into this rust analyzer intervention. You realize Oh,
Andrew Stone:this is why I
Adam Leventhal:have them here behind me.
Bryan Cantrill:This door seems to have been locked from the other side. I can't see the leaf. What's going on?
Adam Leventhal:Like, I mean, if you ignore the tool we've been using pervasively for the last five years.
Bryan Cantrill:Yes. Yes. If if you ignore that
Adam Leventhal:Have not evolved at all.
Bryan Cantrill:That's right. That's right. The same I have not evolved at all. Me rephrase.
Adam Leventhal:You said The whole generation.
Bryan Cantrill:Look, I know. I I but okay. Like, rust analyzer.
Adam Leventhal:Sure. Do you?
Bryan Cantrill:You know? Know. I love the idea of loving it. I I do
Adam Leventhal:love it. I love it with my whole heart.
Bryan Cantrill:Very good. The but it is it is difficult to come up to speed on a and I'm just even I mean rust makes it easier, rust analyzer makes it easier to understand types and so on and obviously you're reading block comment. But it is very very helpful to have something that can analyze the source base and be able well you can ask it questions and then have it I mean because I think it is it is hard to ramp on a new system and it requires a lot of work. I think this is one of the people I think Finjan, I'm sure you felt this too, that when you you come aboard, it feels like everybody already knows this. Yeah.
Bryan Cantrill:And it feels like, oh, they must have just been born knowing
Andrew Stone:it. And
Finch Foner:you know, they they do they do write it, some of them. And so, like, I'm I'm talking to Andrew and Well, I guess it was interesting because there's a mixture of stuff that We were talking together Andrew and you were like, Oh, absolutely. The thing does this thing. And then a couple of questions that I would ask and you would say, well, it's been a couple of years since I've considered this. And we would both sort of collectively explore what it was that actually transpired.
Adam Leventhal:I do remember thinking, asking
Andrew Stone:me a bunch of questions. Like, gotcha questions. I'll be like, no. No. Got you.
Andrew Stone:It already does that.
Adam Leventhal:I'm with you to these tools of, like, can I walk up to this code base and understand what it's doing? And and I was I mean, just to make sure everyone is getting bingo today, I I felt like with DTrace, that was something that I used DTrace for a lot.
Andrew Stone:Yeah. For sure.
Bryan Cantrill:It was.
Andrew Stone:Yeah.
Adam Leventhal:Yeah. I'm in this huge code base. I know that there's some activity located proximate to kind of what I give a shit about here.
Bryan Cantrill:What is actually going on? Exactly. Kind of
Adam Leventhal:what but but doing it at rest, I think that to your point, that's a real novelty of the LLM era.
Bryan Cantrill:Yeah. And and rewards, I think, that you you when you have well a well documented source space, you've got you've got good types, you're using Rust. I mean, there's a bunch of things that that are make it easier for for for tools to help one ramp up on a new source space. So, yeah, Finjan, it's great to hear that that was that was helpful for you.
Finch Foner:It was. Yeah. I I think the excuse me. The the the last bit that I that I would say I haven't seen everyone doing is you can get Claude to invoke your editor as a tool call. So I would say, give me a tour of all of the call sites of this function or drill down into all of the abstraction layers and let me tell you when I want to move on to the next one.
Finch Foner:And Claude would just like be like, okay, I'm gonna invoke Zed on the CLI and just pop you over in your editor to the place that you want to look at anyways, which was, I don't
Andrew Stone:know, that's fun.
Bryan Cantrill:It's like a a Claude chaperoned sewer tour of the of a
Andrew Stone:new source.
Finch Foner:A sled agent.
Adam Leventhal:Yes. Hey, now. There
Andrew Stone:are other bits that are sewer y, but I wouldn't say that this was particularly sewer
Bryan Cantrill:Sewers are an important act of civil engineering, sir. I I I Sewer adjacent? Exactly. Sewer adjacent. No.
Bryan Cantrill:Fine. Just just the clean water then. But that is that's really interesting, because I do also think that it is just when you're you can become numb with all the new abstractions that you're learning and it can just be helpful to get to to get some guidance through it. That's And and so and then when do you so you're you're beginning to come up to speed and you're beginning to be like, okay, can actually see how I can be helpful here. I know you were I mean, and remember you describing, it's like, God, is so great to have someone who I can just talk to about the protocol and we can like meaningfully engage on it.
Bryan Cantrill:I think it's you know, there there there's a lot a lot to be said for working with other people on a problem as it turns out.
Andrew Stone:This is such a lonely job. Like, it really, like the protocol got very complex and you really had to like take a little bit of secondary ownership to deal with it. So like, by the time Finch came on, the protocol was written, there was a TLA plus spec. The whole thing was actually operating as part of Omicron, but it wasn't wired in for usage. Meaning, like, you could generate key shares, you could rotate them, you could all the encryption code was written.
Andrew Stone:Everything was written, but it like, you couldn't use it. None of the nexus code that wired in and none of the slot agent code was wired in. And so Finch came in being like, Okay, there's this protocol. And like, I don't know anything about Omicron, but I can learn this protocol. And I was like, I think Finch could help me on this SLED agent side.
Andrew Stone:And like, importantly, none of the upgrade code was written either. And so there was a whole bunch of existing bits in SLED agent that need to be plugged in into. And I can let Finch talk about that. Just one second, I did build a tool to help explore this. There's a trust form debugger.
Andrew Stone:And so we have a bunch of prop tests that can generate random configurations and different message paths and failures and simulate all sorts of things so we can test the protocol and we emit traces. And so I built a bugger that allows you to stand up like a fake set of nodes and step through their internal state using the real trust form code, but not running over a real network. So we can actually pause and use it to step through messages being sent, check the state, take snapshots of the state, diff it. And so you can explore the protocol, the real protocol running in that manner, but it's not wired into like the real Oxide rack at that point. And so, like, this stuff exists, but, like, literally none of it's hooked in and it's still months of work to go.
Andrew Stone:And and at this point, I'm like, shit. This is, like, gonna take a really long time and the pressure has just ramped up. And then we have Finch. And so Finch joins and Finch can talk about what they did.
Finch Foner:Sure. Yeah. So I just wanted to note that one of the things that empowered that simulator to exist is because you had the foresight to write TrustQuorum as a sans.io core that could then be hooked into a different execution environment. I thought that was pretty cool.
Bryan Cantrill:But, Vige, can you elaborate on that? Because I think that it's a really important technique that allow because again, you've got this problem of like, how do you build the thing before it exists? And
Andrew Stone:Absolutely. Yeah.
Bryan Cantrill:Can you elaborate on that a
Finch Foner:Yeah. I mean, the sans.io technique, I mean, shows up in so many different disguises. Like in Haskell it's called a free monad, but in Rust, mean, the basic idea is you just encode the entire protocol that you're trying to describe as a side effect free state machine that is a pure function that takes some kind of materialized enum of actions that can occur perhaps, and changes its internal state perhaps, and emits potential signals of what effects ought to happen that are also materialized as maybe an enum or some struct or something that just says like here's a description of what I would do given the input that you've just described to me. And doing it this way means that you can hook this up to a prop test suite that fuzzes all sorts of arbitrary inputs and asserts that the outputs match some other Oracle implementation. It means that you can do this kind of tracing debugger that Andrew was describing.
Finch Foner:As soon as you take away this tight coupling between the side effects that are being performed like network calls or disk IO or whatever it is that your protocol is actually doing in the real world, and you add this essentially like a dependency injection shim for the inputs and the outputs, you can start to do a huge amount more really easy and quick verification that it obeys the invariants that you actually care about.
Bryan Cantrill:Yeah, that's really powerful. And so that allowed you to get ramped on this thing quickly because you've got the stuff that's already effectively working and now you're building around it.
Finch Foner:Right. Like I came in and it was pretty clear from the outset that there's a really high amount of rigor that's been already invested in this artefact. Like, there's the TLA plus spec, there's this set of prop tests, etc. It's like, okay, we're sure that the protocol is really darn correct already. No need to spend a huge amount of effort assuring myself that this is so.
Finch Foner:Let's get to the plumbing. And the plumbing was really interesting because we had to figure out how do you make it so that during the course of an upgrade, some nodes post upgrade will want to be operating in the trust quorum state, where they have already upgraded from low rent trust quorum to full and real trust quorum, and other nodes have not yet upgraded? Because an upgrade of the rack is an upgrade of a distributed system. So you have asynchrony and partial failure and all of this, as you said, Andrew. So there are a couple of levels of indirection that we had to insert to make it possible for some sleds to toggle forward into trust quorum mode, while still being able to communicate according to low rent trust quorum with the sleds that hadn't yet sort of moved across that threshold.
Bryan Cantrill:So these are sleds that are effectively living in both worlds simultaneously. They're living in the new much more improved world, but they also still need to be speaking the older world in order to be able to bring that world along.
Finch Foner:Right. And that's because you might have a sled die in the middle of the upgrade, and it might even need to communicate with other sleds when it comes back up to say, Hey, I'd like to reassemble my share from those other sleds. And maybe this is spoiling something for you, Andrew, so I might hand it off to you to talk about this. But a really cool feature of Andrew's implementation of Shamir secret sharing is that you can reassemble your share as a particular sled if you've happened to lose it from asking others about
Bryan Cantrill:it. Yeah,
Andrew Stone:this is something that for some unknown reason nobody else has done. And I think because I just don't think Shamir secret sharing has been used. Like, this is something that seems to happen in Oxide a lot. Like, we go deep with something and we tend to end up using a technique more than anybody else has ever used it, even if it's existed for decades. And Shimriri Kesharing is used in a bunch of places.
Andrew Stone:Like, I think Vault uses it from HashiCorp, right? And like, there's a bunch of other good uses for it. But like, we have baked it in to be like the core of trust in our system. And so we find new things. And so when I built LRTQ, I had no idea really how it I knew at a high level how it worked, but I really didn't understand the math.
Andrew Stone:So I started digging into it and I realized, oh, I can write my own implementation here. And what I realized is that this polynomial interpolation to learn the Rack secret at the point x equals zero, when you have more than, you have k plus one where k is the degree the polynomial. When you have that many shares, you learn the point at x equals zero. Well, it turns out you can actually learn it at any point x. And so if each sled is x equals one, two, three, four, five, they can actually, and they know uniquely what their X value is, they can go ahead and say, Hey, I don't have a share yet.
Andrew Stone:Or everybody else can say, Hey, here's your share, right? Like you're point X. We know who you are based on your identity and your certificate, but you clearly haven't learned the share yet. So go ahead and ask everybody for the key shares and recompute the value for X equals five. Compute the y value for that point on the curve.
Andrew Stone:And so they can recompute their own. And so our implementation called GFSS, which stands for Galaf Field Secret Sharing, it's a very basic name, but the acronym wasn't taken in any other crates. Yeah. It allows you to not just recompute the secret, but to recompute any share. So And that allows the
Finch Foner:system to be self healing.
Andrew Stone:Yeah. Yeah. It's not fully self healing because there's a limit to what you can lose, and I didn't bake it in, but I think we can make it, like, more robust. But what it does have, it means if a if a SLED is offline and it and, like, I talked about the protocol, like, key rotations occurring, The sled was offline, but it's still part of the trust quorum just because it happened to be down maybe, but we didn't wanna like an operator didn't wanna remove it because there's data on that sled and they didn't wanna migrate it. They they could go ahead and say, like, once it came back online, it could ask and it can learn its share for the latest rack secret and all the and learn the encrypted rack secrets for prior epochs going back to when it was the last one it knew about so that it could do its own key rotation.
Andrew Stone:So it learns its own share, able to compute a RAC secret, participate in the latest trust quorum, but then can also decrypt older RAC secrets so it can rotate itself out asynchronously. Like none of these key rotations happen atomically. And so Finch worked on some of this key rotation stuff. And that also gets in I'll kick it back. There's some fascinating things that Finch did to make that all work.
Andrew Stone:But that was like a big chunk of work that I knew at a high level how to do it using ZFS, like change key commands, but I didn't do it. Right? This was another thing where I was like, oh, Finch is here. I can work on the Nexus side and Finch can go do this. And I think this is some of the more interesting work that was done in the last couple months.
Bryan Cantrill:Well, and we are feeling a delightful side quest, Finch, involving the Lua interpreter that one would be forgiven for forgetting is in ZFS.
Finch Foner:I mean, I didn't know it in the first place until I went and tried to solve this problem. So I guess we've talked about key rotation so far. And the question that perhaps you might be asking is if the keys can be rotated, how does each sled know which version of the key to use to decrypt any given encrypted volume? And each sled has many U. Two drives, which are each encrypted separately with key material that is deterministically derived from the identity of that particular hardware and also from the rack secret.
Finch Foner:So when you rotate the key on a particular, like, encrypted volume that's on a particular drive, you you don't you can't atomically do this across all the drives that are on a particular sled. You would like to be able to atomically do it on an individual drive, and that's actually where things got a little bit gnarly, because we wanted we have this ability to set metadata properties on ZFS volumes that are on drives. We have this crypt volume that's on each individual drive. That's where all the encrypted data goes that's derived from that's encrypted according to a key that's derived from the RAC secret. And so we wanted to set this metadata value called OxideEPOC that says which version of the key was used to encrypt this data.
Finch Foner:The problem is that the ZFS change key command, which we shell out to ZFS and we say do this to that volume, did not support setting user properties atomically while it's changing the key, which means that we could have an off by one or potentially off by many error if we had some failure exactly at the split moment where we were doing key rotation on a particular drive. So you have partial failure that could occur both across the particular like, a partial failure can occur maybe at at least three different levels: across sleds, across drives, and between the metadata annotation time and the actual key rotation time on a particular drive. So we wanted
Bryan Cantrill:This to is high stakes, eventually in terms of like these failures are high stakes because if you were to bounce at exactly the wrong time, your data would be lost at sea potentially. I mean, this is like we because we wouldn't actually know that these things would not actually line up and we would not be able to get to your data potentially.
Finch Foner:Potentially. Although there is is sort of a last ditch attempt that you can do, which is to to trial decryption with every possible key that it could have been working backwards. And that's actually baked in now so that if there is some unknown failure, we fall back to trial decryption and emit a loud warning in the logs saying this should never happen. But we were able to recover the data anyways. But we didn't want to have to do trial decryption every time that we do a key rotation and then come back up because that's linear in the number of key rotations that have ever been performed potentially.
Finch Foner:So the solution was, well, we want ZFS change key to be atomic with the ability to set user properties while you do it. And there began a kind of fun side quest as you were indicating. So one version of doing this would be to alter the implementation of ZFS that we use in Helios that we ship to the rack so that we can simply have this property be true, Add some extra command line flags to ZFS change key, and there you go. But I've never done kernel development on ZFS, and we wanted to be able to test this as quickly as possible, So I wanted to find another solution in the interim before we were able to engage people who had a lot more expertise for this kind of area. So I asked around in the Oxide storage channel we've got on our internal chat, and I was pointed in the direction of these things called channel programs.
Finch Foner:And a channel program is a Lua script which runs in the context of the ZFS kernel module, and it has access to a global object called ZFS that exposes the entire API surface area, as far as I can tell, or close to it, of ZFS, the command line utility, and allows you to programmatically do a bunch of stuff in that Lua script. Now the advantage of having this execute inside of ZFS itself is it means that the entire Lua script's effects on the file system are either all of them or none of them. You get all of the transaction effects on doing anything to any particular volume affected atomically in one transaction group, which would not be the case if you executed that same Lua code outside of the context of a channel program. So this seemed like exactly what we need to be able to patch in this functionality without actually changing ZFS. There was a bit of a wrinkle though, because Well, there's sort of two different wrinkles.
Finch Foner:One wrinkle is I've never written Lua before. So that was a fun getting started problem, which was actually, again, aided by Claude Code, thank you. And the other problem was the meaning of atomicity in the context of a channel program is whichever effects happen to happen during the course of the channel program's execution are either committed all as a batch, or if there's a hardware failure or something, they don't happen at all from the perspective of the file system. What it doesn't mean is that if the Lua program returns an error at some point during its execution, then you roll back the changes that happened up until that point. So, in order to make it truly atomic from the sense that we care about, which is that even if there's an error that happens while you're performing these operations Say, you go to set the metadata property, but something is misconfigured, and you get an error back from ZFS saying we couldn't set that metadata property, or even if you're setting multiple metadata properties, we don't want to end up with a partially failed, partially committed state.
Finch Foner:So the Lua script itself needs to implement an application level rollback protocol in order to make sure that the operation that we're doing is atomic in the true sense that we actually intended it to be.
Bryan Cantrill:So this is gonna be some complicated Lua that you're gonna which I mean, good news, all technically feasible. And the and this is also like, I mean, I mean, you said this, but but kind of importantly, this is something that doesn't need to exist for a huge amount of time because we are actually gonna get the proper we wanna get this like properly fixed in ZFS. So we're like, we've kind of scheduled that work, but you've got this interim of several weeks, whatever, days or weeks, which is we've already described is like weeks, months, or years in Oxide time where you would like to have this functionality so you could go on to so the writing this quickly is kind of of the essence, which Yes, that's right.
Finch Foner:But writing it quickly and also having the kind of confidence that we would be able to ship it if we wanted to ship Release 19 with full trust quorum, to have the confidence that this is in fact a rigorous solution to the problem and it's not going to have all sorts of brokenness to it. So it comes along with a bunch of tests, a pretty exhaustive test suite, which you know, I guess ring the bell for engineering rigor with LLMs, generated in part by Claude. And I feel pretty confident that if we had not been able to go and patch ZFS to support this functionality natively, we really could have shipped this to the rack and it would have held up under pressure.
Bryan Cantrill:Yeah. That's really interesting. And I yeah. Sorry. Andrew, go ahead.
Andrew Stone:I was just gonna say, like, importantly, though, we actually did test it. Like, we merged this in and I ran a bunch of tests on both A four by two, which is our simulated platform for running multiple sleds using Propolis, which is our hypervisor. And I also tested it on Bracklets, which are like four sled racks that kinda are not full racks. They're racklets. Besides the sleds, there's some other funky things going on to make everything work together.
Andrew Stone:But it's part of our engineering stuff. So I was able to validate that this works, do add and remove sleds and do all sorts of stuff and it just And upgrade from LRTQ to TrustQuorum, like it worked. I didn't find one problem with it, at least in my limited testing, which is pretty awesome.
Bryan Cantrill:It is. And I mean, the friction you were describing this, I love that you're you're you're kind of expressing, you know, Claude is is always down for whatever. There's no sense of Claude's been like, woah. What what are we doing here? Wow.
Bryan Cantrill:That's like Claude's like, alright. Got it. Lua interpreter in the kernel rollback. Whatever, buddy. Yeah.
Bryan Cantrill:Yeah. Exactly. And this and the the which is I mean, it's the peril of it too. Right? In that it will like it point is to be like, hold on.
Bryan Cantrill:Hold on. Hold on. Where are
Adam Leventhal:we? Sorry.
Bryan Cantrill:What is going on? That's right.
Adam Leventhal:Unlike all of our chat by the way.
Bryan Cantrill:The right.
Adam Leventhal:The way in the kernel they all said it once.
Bryan Cantrill:That that that's right. But in in in this but it because this facility is well documented and it is some and then because then you further have the ability to generate so many tests for it, it really did allow us to get a lot of confidence in this mechanism and importantly allowed us to build everything else we needed around it. I mean, it really was pretty essential for this.
Finch Foner:Yeah. And that's really the thing is like, we couldn't see this in action in any kind of way in a four by two or like on a racklet or anything until we actually had the ability to rotate keys on the actual ZFS volumes. So all of the code that I had been writing up until this point with the abstracted secret retriever that could function in LRTQ or trust Quorum mode, All of this was done essentially blind. Sure, it compiles. It passes scrutiny when Andrew and I look very, very carefully at it.
Finch Foner:It runs when we run it without producing any errors, but nothing is actually happening until you have this thing actually plugged in and hooked up. So that's what that let us actually do.
Bryan Cantrill:That's very cool. And so that allowed us to get confidence that we could actually pull this whole thing off. So, I mean, Andrew, kind of with this in place and the other work that Finch was doing, I mean, we were getting on the actual glide slope of landing all this stuff.
Andrew Stone:Yeah. At that point, like, once that worked, I I basically called it like, in my mind, it was done, which is why I said, like, I've said to multiple people that, like, it felt very anticlimactic when I merged the last PR in because it was like just adding like an OMDB command to like expunge a sled. Like just again, just like plumbing and like text processing. It just like the thing the hard things worked and I'd been working for like a couple of months by that point. And it was just like really just, like, crossing the t's and dotting the i's.
Andrew Stone:Yeah. So it felt great. When, like, you saw this, was blown away.
Finch Foner:I remember messaging you at the very end when you had merged the very final PR, I and was like, how are you gonna celebrate, Andrew? This is like this is so big. And you were like, I I feel like it's just it's already been done.
Bryan Cantrill:Yeah. I I feel that this is often the case that you when you've been grinding on something for so long. And Andrew, this is a problem that you've been working on in some capacity for, I mean, God, like four and a half years. Right?
Andrew Stone:Yes.
Bryan Cantrill:Jesus. Crazy.
Andrew Stone:I didn't realize I had started as early on as I did. Like, that is It it it a long time ago.
Bryan Cantrill:And and had done obviously many many many other things along the way. And this is problem that you kind of dabble it'd gone into and then and then we had rightfully prioritized some other things, had done a minimum implementation that allowed us to get to a a future world that was much better. And it kind of so it this is not like you've been working on this exclusively, but this is something that's just been a very very very long journey. And I I think that when the relief that you get because it is so long, I think that sometimes it that that you don't get that kind of feeling of exhilaration that you would think that you get. Like the feelings of exhilaration
Andrew Stone:That's a 100% true.
Bryan Cantrill:And and I think actually the contrary, I know Andrew you and I spoke very directly about this, but I I do always kind of caution people that when you've got something very large that you're you're finishing, you need to be you almost have a postpartum effect where you are You kind of like, God this thing has been I've been shackled to this thing for so long that I actually don't know what to do with myself now that I've got this thing. I'm so used to working on this problem that I'm now having to become accustomed to not having the problem and it is I can be a real adjustment and you've gotta be I think you've gotta be kind of very very self aware that when you not only don't feel that that sense of exhilaration, but kind of feel a sense of like, I mean, I I I mean, on we is probably phrasing it a little too strongly, Andrew, but you do get this this kind of postpartum effect that can be kind
Andrew Stone:of I really want do something immediately. Yeah. I want to do something immediately, but I didn't wanna dig into anything big. Or I should say that, like, anything big that I was gonna dig into, I didn't think was ready to be dug into, and I didn't think I had the mental state. And so I was like, okay.
Andrew Stone:Can I do, like, some small stuff to ship stuff? Because, like, at this point, I also didn't wanna take time off. Like, I was like, I just wanna work on some stuff in my hold. It's not that important right now, but that will contribute overall to the product. And that's kind of what I did for a few weeks.
Andrew Stone:It was it was nice. But, like,
Bryan Cantrill:yeah. Yeah. And think it's important to have that stuff because I do think that it it is just when you're working on something that is so it is such a long path. The moments of exhilaration such as they come are when you have surprising good news, which I feel does happen. I'm of course
Andrew Stone:LRTQ was that. So, like, shipping coming up with LRTQ and then, like, merging that in within whatever two months so we could ship the first rack, like, that Yeah. Like, I literally broke down and cried when I when that like merged in. Which I was expecting a similar feeling and I didn't have remotely that feeling when Real Trust Quorum, which is a much, much, much larger project merged in. You know, who knows?
Bryan Cantrill:Yeah. Well, I mean, it was it it's really tremendous. And I mean, now we've kind of it it it's it's kind of amazing now that you we we it took us years to get there. But I mean, like with some of these other very long problems, we can now we have achieved the vision that we've wanted to achieve with respect to Trust Quorum, which is really I mean, it is exhilarating. It is it is really exciting.
Bryan Cantrill:Think now Andrew getting a little bit of distance on it too. Can look back and be like, goddamn. We did the valley floor is way way way below us.
Andrew Stone:It is crazy. Like there are other things we wanna do with Trustform also. Like they're not critical, but like we've I think we have a meeting, like Phil scheduled a meeting for a week from Thursday with a few of us to discuss what I've been calling secret share ceiling. And so this is an interesting problem where it's like, we said that the shares right now are individually stored on the m dot two drives, but they're unencrypted. So if somebody could steal all the m dot two drives, they would be able to reconstruct the rack secret.
Andrew Stone:Right? And m dot two drives are tiny. Like, anybody could probably could put those in their pockets. Difference is that they're inside the rack. You have to unplug a sled, unscrew them.
Andrew Stone:Still only takes about thirty seconds, but you could do that. But what we what we wanna do is make it so that the ROT you have to boot the ROT in order like, on that particular sled that the m dot two is assigned to in order to decrypt the share. So that forces you not to just to steal the m dot twos, but to steal the entire sled and be able to power it and boot it. And so you'd essentially have to steal the whole rack again.
Bryan Cantrill:Let's do the whole rack again.
Andrew Stone:Yeah. That's another hole we'd like to we'd like to plug. And I don't think it's gonna be that challenging.
Bryan Cantrill:That's awesome. Yeah. And I think that it should also be said that you kind of mentioned this in passing, but one of the things, I mean, this aspect of the system has always been important to us in terms of actually delivering a secure system. And we said from the beginning that we're not developing a security product, but we are developing a secure product. And so this aspect of the of the system is really really important to us.
Bryan Cantrill:One of things that's been exciting to us is it's become really important to a bunch of our customers, which is great. Doesn't always track the ones that you'd expect to be interested in it, but it's like as it turns out like people that have that guard guard machines with guns are actually just kinda less interested in some of these problems. Where they they it's like, yeah, well, we've got people with guns for that. But meanwhile, like in the like mainstream enterprise world, there was like, yeah, people with guns are really not on the not on the menu. The there's a lot of interest in this and that's been great.
Bryan Cantrill:I mean, it's been great to actually go to depth on what we're doing and have this be really meaningful for customers who are deploying the Oxide rack, which again, it wasn't always true in the history of Oxide. And I mean, you again, referenced that Andrew that we there was kind of a prolonged period where we did put this aside because we didn't we had other priorities that we like like for example, we wanted to be able to, you know, update the rack without parking it and wanted to be
Andrew Stone:able to And that does seem a little bit more important. Yeah. You know? But like yeah. It's great.
Andrew Stone:It's great that it's important. And like, what was also great was when I finally got back to working on this, it became a priority. And so that I wasn't, not that I was ever strictly pulled off, that's just not the way things really work at Oxide, but it was prioritized enough that I could focus on it pretty much, 95% of my time. And so that was very handy to me to get to the point where I could merge things in. And when Finch started interviewing, I was like, oh yeah, we're gonna hire Finch.
Andrew Stone:Yep, yep, let's do this. We're gonna yeah, nope, Finch is joining. Finch has exactly the experience I need. Was very excited I to bring Finch on board.
Bryan Cantrill:Well, and Finch, you were a huge shot in the arm. I mean, I think you know this, but it was it was and Andrew loved the collaboration. The collaboration between the two of you was really very very important to getting this thing all the way in. And it's really satisfying to see.
Finch Foner:It was my pleasure.
Andrew Stone:It not just the work. Yeah.
Finch Foner:I I had a blast.
Bryan Cantrill:Awesome. Well, thanks you. Thanks you two for joining us and enduring the I mean, it is the the irony of talking about a coherent distributed system and having a very incoherent distributed system in terms of Discord. I think just like over the course of the chat, people that the people that can hear some subset of us and can't hear others. So I apologies for those of you tuning in like, actually the good news is I think those folks actually can't hear me right now so they'll be they'll be catching exactly the they'll be catching the recording but I I we we're confident we got a good recording here.
Bryan Cantrill:And meanwhile Discord, can you please like I mean kind of in the one job department And can you can you also like stop with like the dialogue boxes that make me feel like I'm an adult in a space for children? Do you have this like feeling when it's like the things that it no. It's just like it's it's off putting Discord. I mean like, don't I know if the Discord is is but like so did you see the
Adam Leventhal:Wait. Are you saying you're not like, podcasts for Gen Xers is not their key demo. Like, is
Bryan Cantrill:Okay. I you know, you you're saying that in a way that was that's very pointed and hurtful, but I I I I know.
Adam Leventhal:Thing feels like it's made for gamers or something.
Bryan Cantrill:It's really I I get it. It like it's made for my own children. I I mean, it's like Okay.
Adam Leventhal:You're really close.
Bryan Cantrill:The okay. Well, okay. So serious question because these I'm gonna read the dialogue box that it gave me when I logged in the Discord today. And I I just need an explainer on this just before we split. So with with everyone finally settled in after coming from end of year breaks, What does that mean first of all?
Bryan Cantrill:Is that like that's a school break, obviously. That I I like It's not end of end of You're likely getting back into your usual routines for vibe in Yeah. That's an apostrophe. I thought that's o g. Okay.
Bryan Cantrill:Just wanna make sure you heard it down. Vibin on Discord as well. You know. Yes. That's a you.
Bryan Cantrill:You know, bringing the guild back together for the new World of Warcraft expansion or playing four player co op in Slay the Spire two's early access release. In caps, yes, finally. This is, can someone please confirm for me that this is like peak millennial? Oh, I mean, this is I will go read this to my own gen z children and they're gonna I I just have a feeling that they are gonna call this very very cringe. Is that
Andrew Stone:Oh absolutely. Without a doubt.
Bryan Cantrill:Okay. Thank you. Thank you. I I I just need I just maybe I'll put it in the chatty bitty so I can translate it into Gen X for me because this is this is definitely. But if they could we could just get coherent audio.
Bryan Cantrill:Then we go.
Adam Leventhal:Suffer through any amount of
Bryan Cantrill:I I wanna get the guild back together to play That's it for the That'll be all. Exactly. But I I can't right now because our audio is split. So thank you for the Gen Zers in the chat who are are confirming that it's it's hella cringe. Oh, and then the millennials too, know, there you go.
Bryan Cantrill:Well, again, thank you. Thank you all. No thanks Discord, but hopefully we can improve that. But but thank you Andrew and Finch and thank you for really terrific work on this. Really exciting to see and expect to see some more episodes from us on some of the security underpinnings we've got in the oxide rack because we are we're kind of do some episodes.
Bryan Cantrill:Yeah. We'll talk about a bunch of stuff. So got a bunch of new stuff to talk about. Awesome. See you next time.
Bryan Cantrill:Can't wait to get the guild back together.