Tractable

In this episode of Tractable, Orb's podcast with engineering leaders, Kshitij Grover dives deep on Warpstream. Warpstream is a cost-effective Kafka-compatible streaming platform built to be simpler to manage than traditional Kafka solutions, built on top of Richie's previous experiences managing large-scale data systems at both Datadog and Uber. WarpStream is not only architected to save companies inter-AZ costs but also to provide them a stateless system that's easier to manage with often-acceptable tradeoffs on end to end latency. RIchie talks about how Warpstream being in the critical path for other companies' infrastructure influences the GTM strategy and the architecture: serverless pricing is an important mechanic to make Warpstream a foundational building block, and it influences how the team thinks about reliability as a mission-critical part of the offering.

What is Tractable?

Tractable is a podcast for engineering leaders to talk about the hardest technical problems their orgs are tackling — whether that's scaling products to deal with increased demand, racing towards releases, or pivoting the technical stack to better cater to a new landscape of challenges. Each tractable podcast is an in-depth exploration of how the core technology underlying the world's fastest growing companies is built and iterated on.

Tractable is hosted by Kshitij Grover, co-founder and CTO at Orb. Orb is the modern pricing platform which solves your billing needs, from seats to consumption and everything in between.

Kshitij Grover: [00:00:00] Hey everyone, welcome to another episode of The Tractable Podcast. I'm Kshitij, co-founder and CTO here at Orb. Today I have with me Richie. Richie is the co-founder at WarpStream. WarpStream is a Kafka compatible data streaming platform, and WarpStream is doing a ton of cool stuff on top of S3. And Richie has worked at Datadog and Uber before working on lots of big data streaming problems. So, really excited to have Richie. Richie, welcome to the show.

Richie Artoul: Thanks for having me, man. Really excited to be here.

Kshitij Grover: Awesome. Well, let's just start a little bit with your background and what inspired you to start WarpStream. As I just said, you've clearly worked in parts of this technology at previous companies.

So give me a little bit of that play-by-play and how, you've thought about this problem perhaps in previous roles in your career.

Richie Artoul: Yeah, you know, I've been working in distributed storage for most of my professional career at this point. So like nine years or something, kind of got [00:01:00] started almost all focused in the observability space, too.

So I kind of got started in storage at Uber working on, you know, when Uber was going through their kind of like hyper growth phase, they had a bunch of internal observability technology. I mean, they still do. And there was an internal metrics platform called M3, that I worked on, that technology, you know, it was open source and eventually became the company that is Chronosphere. And so they have a, they had a distributed open source time series metrics engine and distributed time series aggregation thing. And so I worked on that for about three years, you know, mostly on the kind of guts and internals of the database. That was a really different system than WarpStream, too..

That was very much like a, you know, it does all the replication itself. It interacts with raw disks, classic distributed system. Interestingly enough, when we were at Uber, the observability teams actually couldn't like afford to use Kafka, like the whole company used Kafka, [00:02:00] but it just like the amount of data we were pumping through, we just, it wasn't cost effective.

And so, at least at the time, the M3 team created this thing called M3 message, which is you can think of like just the in memory part of Kafka, you know, if the buffers fill up cause there's too much back pressure, then you have to drop metrics and that's life. So that was kind of like my first kind of encounter with streaming is for these kinds of like big data systems.

After that, my co founder and I, Ryan, we met at a conference actually while I was still working at Uber. And Ryan's like a big kind of distributed systems nerd. He reads all the papers and knows all the, you know, can explain the exact nuances between Paxos and Raft and all that type of stuff.

At the time tried to convince me that we should either build, a logging database on top of object storage, or we should build Kafka on top of object storage, just like five years ago. And I was like, Kafka object storage, who cares? That sounds so boring. We don't even use Kafka for our metrics.

It's fine. [00:03:00] Let's do logging, you know, that sounds cool. And so we built, you know, like a prototype of a system. And that system eventually became a system called Husky at Datadog, which is essentially the kind of columnar store that powers most of Datadog's storage products. Everything that's not metrics basically goes into Husky logs, network events, profiling, APM, all that sort of stuff.

And it's basically a giant, you know, you can think like, Snowflake, but for observability data or, you know, that sort of thing. Kind of very similar architecture to WarpStream, you know, fully stateless separation of data, or compute and storage and all that type of stuff. And it was interesting at Datadog I learned was like, they, you know, they use Kafka for everything.

There's tons of, you know, tons of Kafka pipelines that they do for processing and stuff like that. And what I learned, it was like how nice it is to have this like really scalable, durable [00:04:00] buffer, you know, if you can afford to use it. And it just made developing Husky so much easier cause it's the static buffer that doesn't change too often.

So you can kind of change all the software around it. If you mess up anything too bad, you can replay from Kafka, right? If you even think about doing the migration, I mean, you know, we had the old system reading from Kafka and Husky reading from Kafka. They both read the same data. You shadow queries and it's really straightforward.

And you didn't have to worry that like, you know, if we had a buggy early build of Husky, that would somehow impact the service that was powering all the real-time stuff. And I was like, wow, this is great. So super nice. And then we kind of finished migrating everything to Husky, the new system, saved tons of money.

The much smoother operations, stateless, you know, it was great. And then when we were done, we were like, man, these Kafka pipelines actually are starting to feel really annoying compared to Husky. Cause you know, we'd have some customer notify us that like, "Hey, you know, we're having this event tomorrow. We're going to 10x in traffic" or whatever. And we'd be like, on the Husky team, we were [00:05:00] like, we don't care. Like it'll auto scale. But then like someone would have to go and manually scale the Kafka clusters. And once you scale them back up, someone would have to remember to scale them back down. And scaling down Kafka is not like a trivial thing to do. You had to make sure everything was triply replicated. And if, you know, if you had a leader, you know, top leader election problems and zookeeper and all these things. Right. We're just like, this doesn't make sense.

Kshitij Grover: One thing that's interesting about both those experiences is that I think in order to get to the inspiration point for WarpStream, you really have to work on real problems at scale.

Like, do you think that's true? Do you think it would have been hard to even dream up the idea of like, oh, let's build Kafka and object storage if you hadn't seen organizations at the scale of Uber and Datadog really run into the kind of economic problems? Because, you know, that's not an experience that a ton of folks have coming into their startup, right?

Richie Artoul: Yeah, I think that's 100 percent true. And I think, I think that's also kind of [00:06:00] why, you know, Ryan and I think made such like a good pair also as co founders, because he's like very deep in the weeds on the kind of like cutting edge, theoretical, what's possible. Can we like do the mapping path? Is this possible? And I had a lot of that kind of like experience of just like grinding my head against a terrible problem in production and eventually coming out with like a really reliable system. I think the two of us together kind of really allowed us to do something that was both, you know, figure out how to operate operationalize, what was essentially like a massively new architecture really quickly.

But I agree. And it's, I mean, Kafka is like a kind of also just like, you know, I think I was joking about how like usage based billing is like such an unsexy problem. And like, that's kind of the value in solving it. I think Kafka is kind of similar. It's like this very kind of like enterprise-y thing, you know, and like the value is not obvious, I think, to a lot of people who haven't dealt with like some of these problems at scale.

But like, if you've worked at these [00:07:00] problems scale, being able to like afford to use Kafka in the middle in your system somewhere is just like this huge, you know, besides the problems of maybe with the open source version of like cost and operations, it just, it gives you all these benefits that yeah, are probably not obvious at like smaller scale.

Kshitij Grover: So let's talk a little bit about how WarpStream is built. And I have a follow up there around whether the trade offs that WarpStream, and kind of rebuilding Kafka on top of object storage, whether those made sense for Datadog and Uber. And, you know, which companies they make sense for. But maybe first let's just talk about architecturally, what is WarpStream? What's both the current product and perhaps like, how are you thinking about it in the medium long term?

Richie Artoul: Yeah. So, today WarpStream is designed to be like drop in replacement for Apache Kafka. So it supports the Apache Kafka protocol, you know, if you have an application that works with [00:08:00] Kafka, even like a UI admin tool for Kafka, you can connect it to WarpStream and, you know, everything should just work. You might notice a couple of quirks, just that we do things a little bit differently. But the semantics of your application should be pretty much exactly the same. And the value of that, I think, which is that, like, going back to the thing at Datadog where we finished migrating everything to Husky.

When we were done, we were like, wow, is Kafka operationally really difficult, but also really expensive. Like the thing that happens when you self-host Apache Kafka and you run it in three AZs, is that you spend a ton of money on inner zone networking. And so while WarpStream is designed to be a drop-in replacement, the two things that it does that are kind of novel, is that it doesn't have any local disks.

It uses object storage as the one and only storage in the system, like not even tiered. And what that means is we never replicate data across availability zones ourselves. Which means you never pay those inner zone networking fees, which can be huge, you know, at like a, for something like with Orb, for example, you know, your people are sending you all these events, [00:09:00] you know, at high volumes, the replication fees can dominate basically the workload.

And the other is just like the stateless nature of it, making it kind of like Husky, just trivially auto-scale based on CPU. You can use smaller nodes because you're not over-provisioning and just let your autoscaler kind of take care of most things for you. You know, there's no topic partition leaders. Any agent can write data for any topic or partition. There's no zookeeper, there's no KRaft, there's no embedded raft group. The state, the agents are truly these kind of like thick stateless proxies. That's the product today. Where we're going, that's like a, that's a good question. The thing I always tell the team, really. I don't like put this on the website because it's not good positioning, in the short term, but like the way we think about ourselves is more like we think of ourselves as a streaming company and a storage company, not like a Kafka company. Today I think the highest value thing we can do for the streaming space and for our customers is give them something [00:10:00] that implements the Kafka protocol and saves them a bunch of money and makes it easier for them to sleep through the night.

But, you know, longer term, I think there are higher level products and abstractions, that will make a lot of sense to kind of generalize some of the really common things that people do in the streaming space, and kind of offer those as more kind of tightly knit products that are a little more, you know, vertically integrated.

But we'll obviously always offer the Kafka protocol product because it's a really useful and scalable infrastructure primitive. But it's not like our company identity.

Kshitij Grover: And there's a couple of things you said there. One, of course, is the cost savings of the architecture, right? Like the fact that you don't have to do inter AZ replication.

But the other thing you touched on is the fact that it is stateless, and it's in some sense in some real way, it's a simpler architecture. It's simpler to reason about, you don't have to think about zookeeper. Which of those problems do you think is the higher order bit? Or when someone adopts WarpStream today, is it [00:11:00] like, okay, it's kind of a barbell of if you're a really large company, what you really care about is the fact that it's much cheaper.

You don't pay the AZ replication costs and maybe if you're a smaller team you just want a product that is going to mean you don't get paged and have to reason about guarantees that perhaps you're not familiar with. And so the stateless nature helps a lot.

Richie Artoul: Yeah, it's a good question. I, people ask me that a lot too.

And I think like canonical wisdom, right, is that to like supersede and establish a kind of well known product you're, the new thing has to be like 10 times better. I think for like critical infrastructure, like Apache Kafka. it has to be more than 10 times better. Like if you think about like how popular people still run, you know, single node RDS Postgres databases today.

Right. Right. Because it's just like, it's a really good product and it's taken, you know, 30 years before we're even starting to see like some of this new SQL stuff take off and people offering serverless Postgres and stuff. It's just taken that long [00:12:00] because it's like, yeah. It's such a critical piece of infrastructure.

It's so battle-tested. It's so well known you have to be that much better. And I think with WarpStream, it's that too. It's like I think if any of the three things we did, if it wasn't, you know, five to 10 times cheaper, if it wasn't completely stateless and trivially to operate, and if it didn't implement the Apache Kafka protocol products, so you can do like a centrally seamless migration.

If we hadn't done any one of those three things, I think it would probably have failed, not to say that we're like a massive success already. But I think, you know, I think those three things are like a requirement for something to have a big disruption in this space. Cause there's been a lot of other people have built a lot of other message brokers since Apache Kafka was released.

Tons of them actually. And I think either they were more complicated than Kafka or they were a little bit simpler, but not that much cheaper. Right. No, one's going to do a massive migration for 40 percent cost savings. Like that sounds cool, but to a new architecture and I have to learn everything, like it's not enough.

I stay with WarpStream or, you [00:13:00] know, they invite some new protocol, right. Then it's like, well, now my application doesn't work. And so with WarpStream, really, I think what tends to happen is like, the engineers, really what resonates with them is the simplicity in the operations. Like they're like, oh yes, like S3.

I know that I like stateless services. I know how to run those. Like we've got Kubernetes, do that all day. The cost savings tend to be the thing that kind of helps them push it through the organization, basically. And then also the, you know, they also really like the Kafka protocol compatibility because they don't have to rewrite their application.

The cost savings thing too, though, is interesting, which is that like one thing people I think undervalue about. Like, it's one thing if you take like, you know, some product that you use that, you know, you're paying per seat or something and you sell it 10 times cheaper. That's one thing. But making core infrastructure 5 to 10x cheaper means you can do new things, right? It's not just about like reducing your spend. Everyone's really focused on that right now. But it means like you can use this technology for things that [00:14:00] it was previously cost prohibitive to use. And so suddenly, you know, that Uber metrics use case, or whatever, you know, suddenly becomes like actually it's more cost effective to send the data through WarpStream than it is to manually send it across availability zones in memory.

Right. And so enabling the technology to be used in more places, I think that's kind of what I find more exciting. But, you know, obviously, being able to tell people, hey, your existing thing will work today and be cheaper is really cruel. And then, you know, the slope, instead of being the slope per byte, instead of being like, this is like this. And now you can, put more data in it. So,

Kshitij Grover: so let me actually ask the architecture question one level deeper, which is, you know, this idea, let's take the Kafka protocol. Let's build it on object storage. Let's put everything on object storage, not have any tiering. And conceptually, that sounds kind of simple.

So, so what's the core insight? Like, why hasn't someone done this before? Why aren't you incurring, like, a ton of, you know, API request cost in, in, you know, [00:15:00] making requests to S3. Is it that, you know, S3's cost model has changed over the last year or two, and that's the timing bet? Or is it that you all have, been able to generalize some learnings that maybe you had at these previous companies, and these are just non obvious, you know, technical decisions, that maybe other companies haven't thought of.

Richie Artoul: It's, all of those things, I think this idea has been, like, ruminating in, like, my co founder writes. The whole architecture and implementation was his idea, and it's been stewing in his brain for, like, six years. He's just been wanting to do this forever. I'll start with like the kind of primary insight I think that we made, and this is kind of where people describe our architecture as like aggressive, which is that like, we just looked at our careers and the workloads we know and all the use cases for Kafka.

And we were like, if you look at like, let's say per byte, maybe not number of use cases, but like by raw bytes transmitted through systems like Kafka. More than 90% of that is not latency [00:16:00] sensitive. It's latency sensitive in the sense that you want, things to happen in seconds, but it's not like this message needs to get from the producer to the consumer in five milliseconds.

And I'm willing to pay 10 times more and burn an entire 10 person engineering team to accomplish that. Right. It's like, Hey, if the P99 from the producer to consumer was 1. 5 seconds, would your business function just as well? And if it was 10 times cheaper, if you weren't getting paged in the middle of the night, like everyone's pretty happy with it.

They take the trade off. It's funny, people are like, no, I want to use it for this use case cause it's great. And it's not latency sensitive. I have this other use case. It's latency sensitive, so we can't use it there. But like literally while I'm talking to them, they'll talk themselves out of it.

They'll be like, Ah, 10 times more expensive. Is it that latency? Like I'll see it happen. You know, it's kind of funny. So that was kind of like an aggressive, which was like, you know, we believe that it wasn't super clear when we, you know, pitching investors and talking to potential customers and stuff like that cause it's like, it's not until you like put stuff out there that you really know. But that's what we felt. So that was part of it. The other [00:17:00] thing, you're talking about, you know, accomplishing, it sounds simple.

It's this weird kind of knot, right, where it's like, if you start from first principles, you're like, okay, the kind of naive way to do this is like, just make a file every, you know, 250 milliseconds for every topic partition. And like you said, then you immediately run into this problem of like, okay, now I have two major problems. One is okay, but I have multiple agents that are producing data for the same partition. So how do I decide which of those files is first? Right. And then it's like, oh, I could have a leader, but now you've introduced leaders, right? And we don't want that. That's one problem. And then the other thing you're doing, you're like, okay, well, I can just shove a bunch of data for different public topic partitions into one file, which is what we did. But then once you do that, you're like, a bunch of stuff falls out of that because like I can't just like tail the files in object storage and read a partition from start to finish.

So now you create this like, oh, okay, I need to track metadata, but it's actually a really big metadata. High volume metadata scenario that you need to solve. And then you get into that and you're like, okay, how do I solve the [00:18:00] high volume metadata scenario and how do I keep things cost effective? And you just have to keep pulling on this thread.

And they're all solvable problems. There's kind of little tricks for each piece, but by making that one trade off, you end up with something completely different. And that's kind of like my, you know, a lot of times people ask me like, what's the difference between like WarpStream and doesn't Kafka have tiered storage?

And I'm like, tiered storage works, but it's like, you're still copying all the data across the zones and you still have to deal with top leaders and you still have all the problems of Kafka. You're just essentially cutting off the storage problem. But as soon as you get from like, you want to go from like, even if you get it down to like, I'm storing five seconds of data on disk and everything else gets copied to S3, going from five seconds to zero means you have to throw the whole thing away and start from scratch because you've just introduced a bunch of other things that have to be solved.

And so it's a combination of things. I think it's being willing to be aggressive on that latency trade off. It's basically being able to work through kind of from first principles, all of the little problems that latency trade off introduces and the [00:19:00] stateless nature. It's kind of hard to commit to making such an aggressive trade off, and being willing to work through those first principles. You know, when we started the company, like Ryan was like, here's the napkin math.

It'll work and I'm like, that's cool. But will it work? And he's like, I don't know. And so, that's hard. And you know, I've had that with every new system I've worked on too. I mean, there were periods of time when we were, you know, let's say like a year and a half into Husky before it had really gotten to production, but we'd spent a bunch of time on it and we were still dealing with some like major performance problems and stuff.

And we're just like, is this going to work? Like, you know, you start to doubt yourself. And so it's hard.

Kshitij Grover: I guess how long did it take from that napkin math moment of like, I'm guessing what you're thinking about is, well, you know, here's what we know about S3 latency, you know, on the storage class, here's what we know about network times.

And so here's the end to end latency we estimate. Do you have a sense of like, how far off you were from where WarpStream is in production [00:20:00] today to where that estimate was and like, have there been step jumps and okay, like prototype one was 100x what we thought it would be. And then when we like slowly iterated, maybe switched metadata stores, switchedthe way that you copy data, that's something, significantly changed.

Richie Artoul: Yeah. So there's definitely like for sure, if you compare the early prototype to now, like the first prototype, if you had like a thousand partitions would just like immediately explode. Right. So there's definitely like stuff like that, that we've fixed. And especially around scalability of the metadata store, also making that cost effective so we can offer this really, truly multi tenant serverless scale to zero thing. You know, one thing we considered really early on was using FoundationDB as the metadata store.

And I think if we'd built WarpStream at like Datadog, that's what we would have done because it's there. We know how to operate it and you can just deploy a FoundationDB cluster and cause you're going to have, it's going to power this giant other workload. And so the costs that are minimal, but if you want to do like some [00:21:00] customers send half a megabyte per second and some said five gigabytes per second, and you want to give them something that looks kind of single tenant, isn't just a giant shared FTB cluster, you know, where you're going to have all these noisy neighbor problems.

We, that's why we built something custom is to deal with that. But going back to the early prototypes, we basically spent the first three months of the company doing nothing but basically trying to prove that the napkin math was right. We actually managed to do a little bit better on end to end latency than we thought we would be able to because we were a bit conservative in our estimates. But yeah, like the literally the first three months of the company's life was basically us just trying to prove it would work and de-risk things. The most of the stuff turned out to work. I think we ended up did realizing some problems were way harder than we thought they were.

The problem that ended up being way more complicated than we thought. And we didn't anticipate how complex it would be, was actually how we would do compaction. This would be like a whole [00:22:00] thing, so I won't get into now, but the way that we do compaction planning for WarpStream is really weird, and it's like basically a, I think, a relatively novel algorithm for compaction planning that makes sense because of the semantics of like Kafka. But it's not something, if you look at WarpStream, it kind of looks like a traditional LSM on S3, right?

Like you make small files and we compact them with the big files. But because of the kind of ordering semantics of Kafka, you have to be really precise about which files go together and when. So that was the part that we ended up spending way more time on than I thought. And we ended up having to scratch our initial implementation and redo it.

You know, I had one or two of those existential, like, is this going to work moments and stuff.

Kshitij Grover: So actually, one thing that's interesting about that is if you think about the contract of a system like Kafka, you could almost say that latency is part of the contract, and that's the part of the contract that you've relaxed a little bit, right?

In order to rebuild with the same protocol, but clearly [00:23:00] you're not trying to relax a bunch of other constraints. Has that ever been a thought of like, oh, you know, what if we didn't keep the, I don't know, partition order? Or what if we didn't keep this other property? Is it just that like that would violate how existing applications use Kafka and so it is just a non starter or is there something like that where you're like, you know, I don't know, Kafka has this specific item potency or exactly one semantic that we're just not going to support because we don't think most people need that just like in the way that most people probably don't need the end to end latency that they're getting.

Richie Artoul: Yeah, we definitely considered that. I mean, there were early designs where we, like, didn't have monotonically increasing UN64 offsets. You know, the first protocol we actually started with, I guess I didn't explain this, was actually not Kafka, it was Kinesis. So, the Kafka protocol is extremely complex, so we started with Kinesis, which is an HTTP protocol, to prove out the rest of the semantics of the system.[00:24:00]

And so we definitely had thoughts about dropping a bunch of some Kafka features, which are particularly annoying to implement in a stateless architecture. You know, and I think, again, that's another thing, like if we had built this at a big company, we would have done it. Like we would have done the easier thing because who cares?

I can rewrite my application or I can write a client wrapper, right? I control the system end to end. And I, you've seen that like Pinterest has a system called like MQ, which is essentially like they wrote client wrappers and they upload file to S3 and then push a pointer to it through Kafka. And you know, it adds like additional zookeepers and it makes the thing more complex, but even at a big company, you can afford that complexity to not have to do all the crazy stuff Warframe does. But I think the more Ryan and I talked to each other and to customers and like looked at the Kafka ecosystem, there's not much you can throw away and still have most applications work. Even like the latency trade off we made, it's a relaxation, but a lot of people tune their Kafka clients to buffer and batch for a long time anyways, [00:25:00] because it reduces costs and performance. And so that was kind of just like, it's like almost like a knob. But stuff like item producer, even compacted topics, we're like compacted topics, who could possibly be using compacted topics? Like our naivety coming from Datadog.

Cause it's like, we don't use Kafka as a KV store, but it turns out like a lot of stuff in the Kafka ecosystem uses Kafka as a KV store. After we started the company, we're like, wow, Kafka Connect, Debezium, Kafka Streams, none of this stuff works without Connect and back to topics.

Yeah, I think we, we would have liked to, I just don't think we really had the choice. All the value in this stuff is like, it's cool, the statelessness and blah. But like, if we didn't implement the Kafka protocol, it would be like a cool science project, right?

The full support for Kafka protocol compatibility is to me what makes it a business. And it's like that unsexy part of it that makes it, you know, that's what lets you sell it to people and continue developing it so.

Kshitij Grover: Yeah. And we've been talking about the sort of complexity, and even the sort of infrastructure that small teams can afford to reason about or run, [00:26:00] versus large teams who can, you know, do things like spin up a FoundationDB cluster.

And then obviously they're just working with a different set of trade offs where they're not building a multi tenant environment. So tell me a little bit about, and I've heard you talk about this elsewhere that smaller teams can just build more ambitious products today. Obviously, WarpStream, very, very core technical infrastructure is a very ambitious product.

Where do you think that leverage is coming from? Is it just the composability of these cloud providers who have built, you know, progressively better primitives?

Richie Artoul: Yeah, I mean, that's what I think. Like the, you know, I think this is a different interview. I'd be talking about AI right now. I have found AI to be supremely unuseful in building infrastructure software.

So, but the like, but people sleep on like what you can like, I mean, we have like, you know, some Terraform modules written by like two people, basically, that stand up production grade enterprise, whatever infrastructure it [00:27:00] creates. I mean, the way you can program networks in AWS with like a couple lines of Terraform code is like magic, right?

Like if you think about what you'd have to do to accomplish that level of sophistication in like a colo or whatever, even with WarpStream, like our entire control plane has and like back end, like besides essentially our Aurora database for storing, like what customers we have, there are no discs anywhere in the entire stack.

I could go wipe every container in our back end and it would make a bunch of people very unhappy, but there would be no data loss. You know, everything is built on cloud primitives, like object storage and DynamoDB and stuff like that. And the fact that you can just get those things, pay for them in like a serverless manner and get going.

It's absolutely crazy. I think it's easy to overuse those things and introduce a ton of complexity. I've talked to some people and it's like, you know, it's like a six person startup and they have like 40 repos. Yeah. Yeah. [00:28:00] And it takes like, there's like 30 microservices.

I'm like, wait. And so I, but I think if you like kind of know what you're doing, you take a couple of key dependencies on really battle-tested, scalable, good stuff, like object storage, dynamoDB, application load balancers, that sort of thing. And you try and stay within the confines of those primitives.

You can, I mean, you can build really crazy stuff.

Kshitij Grover: And in fact, you all have built two different versions of WarpStream, today where there's a kind of bring your own infrastructure. And my understanding is in that model, you're deploying WarpStream on, you know, basically Kubernetes in the customer's VPC, and then there's like a fully serverless, hosted by WarpStream model.

So let's talk a little bit about what are the architecture differences and why choose to build both? Especially when you all are, you know, still, pretty early, still trying to understand the viability of the core offering. So why build both from the get go?

Richie Artoul: [00:29:00] Yeah, that was actually a really tough call, and I wouldn't say we did it from the get go, but we did do it pretty early.

We started with just the BYOC product, because to me, it's like the most differentiated product. Like, the fact that you can just get something that runs completely in your op, in your environment, but is so simple that, like, it's as if you have a SaaS, but you don't. Like, your data never leaves your cloud account. You have complete data sovereignty, it's all in your bucket.

All the consensus is handled remotely. You don't have to think about that. You don't pay for any inner zone networking fees, and it can actually be five times cheaper than running open source Kafka in your own environment. To me, that's like a very, differentiated and clear product. And whereas the serverless product is like once the Kafka protocol is hosted by somebody else, the differentiation becomes less obvious.

You're differentiating on other things, scale to zero, its price, it's not as clear. And so we didn't start with a fully hosted serverless product. But the thing we kind of kept running into was [00:30:00] that. You know, our pitch is that it's the easiest way to do Kafka. It's the easiest way to do streaming and like deploy, even just deploying some stateless containers into your environment.

Maybe you need to talk to your boss, you got to figure out the chart. You got to do some Terraform. Yeah. Whereas like click a button, you know, Kafka is a URL. Just get some credentials. That's easy. It just helps us engage with smaller customers. We've gotten a ton of feedback from people where it just doesn't make sense for them to avoid the containers themselves, but they use our serverless product in production and they're really happy with it.

I think we've differentiated the product enough with the scale to zero and the way the pricing works. But it's really just about, I think like kind of that mission of making it as simple and easy as possible. We had to, I think, have that offering. And there's a really nice story, right? Which is like, if you're small scale, where the pricing and the networking fees don't matter, you can get these serverless clusters.

They scale to zero when you're not using them. Even if you do encounter some decent traction and you take off the pricing is still pretty reasonable. And then when you're [00:31:00] ready, if you mature as an organization, you can always move to BYOC and run your own agents.

Kshitij Grover: So let's actually talk about the pricing part of WarpStream just because that is also a core focus of the value prop, right?

So how do you think about pricing for serverless versus BYOC, where have you put in more energy? And it doesn't have to be exact price points. Like, obviously those will evolve over time as the product evolves. But, yeah, like how did you think about that?

Richie Artoul: So the thing we knew when we started this was we wanted to have very predictable, transparent, and what we considered, you know, fair pricing.

Fair is probably not the right word. Principled is probably a better word. But like, we did not want to be the, I won't name any names, okay, but we did not want to be the like, really expensive, bougie vendor that only the people who have lots of money to burn can afford to use, basically. We wanted to be [00:32:00] the AWS in this space. We wanted to be like, you know, you're paying money for it because it's a good product and people need to stay in business, but it is a core primitive that you can scale your business on. And it's not like, oh, if I have really high scale, I have to go back to self-hosting. Right.The way we thought about it, we also like, I hate when things are priced like per core or per node or something. I hate that stuff because it's like, especially for infrastructure. Cause I'm like, you're writing the software. It's your job to make it fast. Why are you charging me because your software is slow? Like that's dumb, you know, we should both be incentivized to make the software fast.

Right? And scalable, especially with our architecture where it's stateless. You know, I tell people all the time you can have these nodes only handle writes and these nodes only handle reads, and these nodes do background jobs. And now you have full isolation between your writers and readers. And like, that sounds scammy if I'm also telling you, but I'm going to charge you three times as much to do that, right?

You shouldn't have to pay three times as much to do that. And so what we settled on was a [00:33:00] usage-based pricing model, where everything's metered. And for the BYOC product, it's only on three dimensions. It's like, it just, if you have a cluster and it's doing stuff that's 100 a month. You know, it's metered in 15 minute increments, but just some overhead for small clusters because of how our metadata store works.

And then we just charge you for how much data you write and how much data you store. And the kind of principled approach I took at the time was like, I estimated how much it costs to run a self hosted Apache Kafka cluster at a certain scale, and that's where our pricing started, basically. And I'm like, you're getting a much better product and it's the exact same price.

But then as you like abuse it and use it at scale, you know, the price tiers super aggressively and in public, right? You don't have to go argue with the sales guy for two years to get the price to go down It's like you can look at it project if I'm a successful company and this use case grows.

Here's how much it's gonna cost and that's public and on our website. And to me that's like if you're [00:34:00] doing core infrastructure, you're trying to sell to infrastructure engineers, it's got to be like that, right? That's how AWS does its pricing. AWS doesn't have super public tiers. So, but close, right?

Yeah.

Kshitij Grover: Let's talk a little bit about, the fact that, you know, in WarpStream, I think regardless of the architecture, right? WarpStream, your infrastructure, your metadata store in particular, I think is in the critical path of any read and write request.

Right. That's a lot of responsibility. And I'm wondering, especially again, for something so core to someone else's technical stack, how are you navigating that problem? Is that an objection you have? Is that something where people run stress tests and you can kind of just prove out reliability?

How do you maybe on a technical standpoint, think about reliability? And then from a kind of go to market sales standpoint, think about the talk track around it.

Richie Artoul: Yeah, the, it's a really good question and I think it's, I don't know if that's like a thing we'll ever, [00:35:00] there's no like checkbox answer. It's kind of just like something that has to be part of our core philosophy, part of our core engineering culture.

And we just have to live and breathe for the rest of the company's life. I think from a technical perspective, you know, I've worked on tons of systems like this. Datadog was really critical. I mean, I, still remember when, you know, there were times when Datadog incidents or things went down, you would be surprised at the things in the world that have to turn off because they don't have visibility.

And that was a lot of pressure that Datadog bore. Same at Uber, right? People get stranded in the middle of nowhere. You know, you think you're just running some backend, but there's someone drunk, two in the morning, maybe in an unsafe area, you know, they can't get a ride home or whatever.

And so that's been, I think baked into the company culture very early on from like a pure technical perspective of how we do it. You know, there's the obvious stuff about relying on really scalable rock solid primitives, having good rate limiting in place, structuring everything so that there's isolation [00:36:00] between tenants.

We're really big fans of the cellular architecture. Datadog is a heavily cellular based architecture. WarpStream is it will be the same, which is basically, you know, you have these cells and a cell is like a copy paste of your entire infrastructure stack and a cell may be multi tenant and there's a bunch of constraints within that cell to prevent anyone from using too many of the resources.

But on top of that, like cellular boundaries are hard boundaries and no matter what someone does in one cell, it can never impact another cell, that's really helpful because it just reduces blast radiuses when things do inevitably go wrong. And so all of those things, there's also like for kind of like bigger workloads, we have stuff like single tenant control plane cells.

So, you know, if the workload's big enough for the customer is willing to pay for it, basically we can give you essentially a completely isolated stack. And yeah. That sounds crazy, but because of the architecture of the product, it's not [00:37:00] really, you add one more thing to an array and a terraform provider and it's not like there's more nodes that we have to rotate or anything.

It mostly manages itself. And then also just because of the simplicity of the control plane at even really mega scale, it's tenable for customers to self host the control plane. It's not what I generally recommend to people because it's like, you're probably not going to be nearly as good at operating it as us but there are certain environments, things need to be air-gapped, whatever it is, where that makes sense. And so it's, something we think about all the time. I mean, you know, the, a huge, probably more than half of all our engineering resources go into basically resource constraints, limits, rate limiting, all the type of stuff there.

On the go to market side, I think a lot of it is just about, part of it is having good answers, right? Like a lot of times when I talk to people, they're like, okay, so how does the control plane work? And, you know, if the answer is like, oh, we like have these custom RAF nodes that we run with like local disks and like, I've got this guy and he's really [00:38:00] smart.

That's not a super satisfying answer. But when I tell people all your, everything is like, we run four replicas per tenant of the control plane. And the only dependencies we take are on that the network. And AWS needs to work and object storage and DynamoDB need to be up. And even if object storage is down, we can keep running for a while, but DynamoDB needs to be up.

They're usually very happy. A lot of these people like they're running in AWS. They know how rock-solid that infrastructure is. They know that they're not going to run anything more reliably than AWS. And that's a satisfying answer. Sometimesit goes a little bit beyond that.

And I think a lot of it too, is I think about your brand and reputation as a company too. Do people trust you? There's a lot of companies that no one thinks twice about putting all their company infrastructure in AWS because AWS has a really good brand.

They take incidents seriously. AWS is an extremely multi tenant environment, but you know that they have like, they've written their own hypervisors and they've done everything they possibly could to make it good and scale. And that just comes with time and brand and reputation.[00:39:00]

And I think that's nothing I, you know, we talk about a lot in the team too, is that, you know, the way we put ourselves out there in the world needs to, you know, I don't want to be like fairly serious, but we need to be professional and we need to, people need to trust us and we need to do everything as responsibly and securely as we can.

And not kind of get up, caught up in stuff that can happen on the internet when people aren't being professional, basically.

Kshitij Grover: Yeah. Yeah. All right. Well, maybe on a slightly less serious note, last thing I'll ask, what are you most excited about? It could be a technical problem you all are working on, a new feature rollout. What's coming next? Where are you putting at least a lot of the positive energy at WarpStream?

Richie Artoul: Dude, we got so much good stuff coming out. We're sponsoring Kafka Summit in London, March 20th. We're about to roll out. I just launched it actually, like infinite retention, to all customers. So you can do that now.

We're going to roll out full support for compacted topics in Kafka. So you can use us as a KV store [00:40:00] if that's what your heart desires. Our implementation of compacted topics is actually really good. I'm really excited about it. We put a lot of engineering effort into making it good. It's not like a half baked early release.

So I'm really excited about that cause that opens up a whole new world of the Kafka ecosystem that can use us. Our serverless product is still kind of in alpha, but we've started getting a couple people in production. I'm looking forward to maturing that. The BYOC product is going to go GA in March, I'm really excited about that too. And I think kind of Q2, Q3, we're going to start looking a little bit beyond the Kafka protocol. And I think that'll be really cool stuff like server side partitioning, native Iceberg integrations, that type of stuff. Stuff that really, like, allows you to do new things that you just can't really do with traditional Kafka today, instead of just being a cheaper, easier-to-run mousetrap. So we got a lot of really cool stuff coming down the pipe.

Kshitij Grover: Awesome. Well, that's very exciting and it was great. Great having you. Thanks for coming on.

Richie Artoul: Thanks for having me, man. This [00:41:00] is great.