Tech on the Rocks

Tech on the Rocks Trailer Bonus Episode 15 Season 1

Reinventing Stream Processing: From LinkedIn to Responsive with Apurva Mehta

Reinventing Stream Processing: From LinkedIn to Responsive with Apurva MehtaReinventing Stream Processing: From LinkedIn to Responsive with Apurva Mehta

00:00
Summary

In this episode, Apurva Mehta, co-founder and CEO of Responsive, recounts his extensive journey in stream processing—from his early work at LinkedIn and Confluent to his current venture at Responsive.

He explains how stream processing evolved from simple event ingestion and graph indexing to powering complex, stateful applications such as search indexing, inventory management, and trade settlement.

Apurva clarifies the often-misunderstood concept of “real time,” arguing that low latency (often in the one- to two-second range) is more accurate for many applications than the instantaneous response many assume. He delves into the challenges of state management, discussing the limitations of embedded state stores like RocksDB and traditional databases (e.g., Postgres) when faced with high update rates and complex transactional requirements.

The conversation also covers the trade-offs between SQL-based streaming interfaces and more flexible APIs, and how Responsive is innovating by decoupling state from compute—leveraging remote state solutions built on object stores (like S3) with specialized systems such as SlateDB—to improve elasticity, cost efficiency, and operational simplicity in mission-critical applications.

Chapters

00:00 Introduction to Apurva Mehta and Streaming Background
08:50 Defining Real-Time in Streaming Contexts
14:18 Challenges of Stateful Stream Processing
19:50 Comparing Streaming Processing with Traditional Databases
26:38 Product Perspectives on Streaming vs Analytical Systems
31:10 Operational Rigor and Business Opportunities
38:31 Developers' Needs: Beyond SQL
45:53 Simplifying Infrastructure: The Cost of Complexity
51:03 The Future of Streaming Applications

What is Tech on the Rocks?

Join Kostas and Nitay as they speak with amazingly smart people who are building the next generation of technology, from hardware to cloud compute.

Tech on the Rocks is for people who are curious about the foundations of the tech industry.

Recorded primarily from our offices and homes, but one day we hope to record in a bar somewhere.

Cheers!

Nitay (00:01.451)
Apurva great to have you today here with us in the podcast. Why don't we start with, give us a bit of your background, work history.

Apurva Mehta (00:09.006)
Yeah, thanks, Nitay and Kostas. Nice to be here. Yeah, about myself, I'll give a one minute intro. I'm currently co-founder and CEO of Responsive. We're building a stream processing solution for event-driven apps, which I'm sure we'll talk about more on this podcast. Before this, I was at Confluent for almost seven years where I was running the stream processing teams. I was an early engineer there. One of the early things I did was add transactions to Kafka and then moved on to...

you know, stream processing and I'm still doing stream processing. So, and then before Confluent, I was at LinkedIn where one of the first things I did there was to add some stream processes to do ingestion into the graph database and later into their search index. So in some sense, since 2013, I've been either building these apps or, building the systems that these apps, stream processing apps run on. And that's pretty much a majority of my career, would say before long ago, was at Yahoo doing other stuff.

But that's my intro.

Nitay (01:11.691)
That's amazing. You've really seen kind of the entire streaming space start up and then become, you know, the huge category it is today. So why don't we take us back a little bit to those early days. Like what was it like back in LinkedIn and early days of Confluent in terms of what was around at that point streaming wise? What did you guys create and why?

Apurva Mehta (01:31.214)
Yeah. So when we started, LinkedIn, I mean, I started in 2013. So Kafka was already a thing at LinkedIn. It was already deployed, right? I Kafka originated in 2010, 2011, something like that. So pretty much two years in. I think it was getting open source adoption outside of LinkedIn too at that point. But I think what we were focused on is like, you know, I think this is still true to what I'm focused on today. You know, like for example, one of the use cases, the first thing I did was write a kind of stream processing job.

to index, like LinkedIn has indexed a graph database, which is kind of an index of all connections. At least at that time they had this system. So every time you connect, like you connect with me, that is an event that goes into Kafka and then that has to be indexed into the graph database. And then that graph database has to basically, it's kind of the foundation of everything at LinkedIn. So can I see your profile? It depends on how many degrees apart we are, right?

What shows up on your feed is sorted by connection, distance on the graph, search rankings, all of it. So that graph database is highly used, and it has to be indexed at very low latency. And then there's two types of indexes. One is just connections, simple. there's second degree. I need to be added to all my second degree connections, more complicated, which is app-less cached.

In any case, so that was the first, it was like a core application, sophisticated, especially this indexing problem. So that was, I would say like a lot of what the stream use case, there's one major use case of Kafka is get it into Hadoop. And then there was all sorts of jobs running downstream of that, like people you may know at LinkedIn stuff of that nature. It was get into Kafka and you know, metrics like monitoring the website started, I think using data downstream of Kafka.

And then there were these application teams effectively, like Search infrastructure, graph database that used Kafka events to have real-time updates to their databases to drive their products. those are like, I would say that use case breakdown is still probably true.

Apurva Mehta (03:47.758)
And so I continued to work. Then I moved to search at LinkedIn where we built our own stream processor to have search index updates, which is way more complicated than graph index updates, right? You know, because there's relevance as all sorts of things that goes into having a live index of a search because there's so much more than just a key value update, right? So that was our own stream processor. We called it the live updater, again, downstream of Kafka.

And then they moved that to Samza later, or they tried, that was after I left. So that was a very interesting journey. I don't know what they do now, frankly, but that Samza stuff was very difficult for many reasons, which you can get into. And then at Confluent, when I moved to Confluent, I kind of have a very, first of all, let me pause. Any questions or, you know, should I just keep going?

Nitay (04:42.539)
Yeah, I don't know. think plenty of questions. There's lots of interesting stuff there. In particular, you mentioned kind of complex computations in real time happening to update indexes, right? First the graph database, then kind of the search. Why is those computations so complex and how does the requirements over there start to bleed into the actual stream processing and everything itself?

Apurva Mehta (05:05.304)
Let's take search. think graph is, so the graph one, I'm connected with you, I update a key value store with a list of your connections, it's simple and vice versa. But now, for example, updating the second degree cache. So now based on my connections and based on your connections, my second degree has changed. That is a set operation, little more complicated.

That's an example of complexity. It's just computationally intensive, And if you have many connection requests or some load spike, you just have to manage that. And the thing is, it's not stateful because you can just look up two lists and do an intersection, so it's not that complicated. It becomes much more complicated when those operations are stateful. So the stream processor has to materialize some intermediate state to perform that computation. So...

So search, think not to refresh my memory, I was not prepared. basically we had like the base index that we had in search. Like you could say people search. You have a LinkedIn profile. All of your details on your LinkedIn profile is indexed. Your connections are indexed in the search. And then, but you may update your profile. You may change stuff. So the base index, it was computed on Hadoop. It was a job.

basically take a snapshot of your data at a point in time and generate your index entry so you can show up in search results for various queries. Now, if you change your profile, you basically have to run that entire job again on the Delta. So that starts now. It's again, I think we cached some stuff on the live. We definitely had to produce an index, by the way.

So, you know, the output of the stream processor was a Lucene snapshot in real terms. So that's actually a stateful output. So you accumulate a bunch of changes into the Lucene snapshot and then your search would go to multiple indexes and there'd be a join at the top. But essentially that stream processor was stateful because you have a snapshot of the index on an ongoing basis. And eventually you'd compact it down to the base and then

Apurva Mehta (07:23.438)
reset, right? So that's what makes it complex because now you have state in the stream processor, you you have to manage failover, consistency of that state, replication of that state, so on and so forth. You can think of the search index at a very high level as an aggregation, right? You're accumulating updates into the state per key and you have to keep, as events come in, have to keep doing that accumulation.

And that's when it starts getting super complicated because now you have to basically manage a stateful system that is scaling. so that's the core, right? Like it becomes a distributed systems problem. And the stream processing framework has to manage that, typically should manage that for you because the app team, the person writing updates to the indexing job doesn't need, shouldn't worry about how to have that run well.

And that's kind of the challenge, right? The stream processing framework doesn't do a good job. You have to solve a lot of these infrastructure issues that typically, you know, a database or some infrastructure should solve. And that's where it gets complicated.

Kostas (08:35.934)
One quick question and I'll give the microphone back to Nitai to continue with his questions, but I really have to ask that because I kind of do always with people, especially who are like in streaming and they use the term like real time. So what is real time? mean, back then, right? And it's two parts of the question. One of the experience that you had in what real time means in the context back then in updating these indices and like the

the graph in a product like LinkedIn, which is massive, right? But also from your experience all these years where people are using Kafka and streaming processing to do real time, at end of the day, what real time is in terms of world clock time, right? Because what I hear is many people say real time, but it means it has different semantics depending on the use case. So I'd love to hear from you on that.

Apurva Mehta (09:30.926)
Yeah.

Apurva Mehta (09:34.914)
Yeah, no, I think it's a contour. I don't like using the word real time. I don't know if I said real time at all. We should go back and look at the transcript. I don't like using it. I just generally say low latency sophisticated computations. That's kind of where stream processing is a good fit, right? Because real time, I mean, you're not, you like you switch on a switch in your house, your light turning on is not a stream processing problem. That's real real time, right? We have low tolerance and very low latency requirements.

I think stream processing, you know, I think, you know, if you want to put a number on it, I think where seconds is good enough, right? I think that you can tolerate some variance. And also I would say where the requirement of what the output is. So events happen and something else needs to happen as a result. If that's something else is now the more sophisticated it gets. And if you want that one to two second, like up to one to two second reaction time, you know, like you could be higher.

Right. I think that's where it's a sweet spot. So it's like, you know, so it's not like your self-driving car has a stream processor running on board. It's not going to work. Right. But that same self-driving car could have be beaming data back and could be computing stuff on the data center about local traffic, local weather, local stuff that then updates each. think Tesla, I think, starts to do this. Right. But that

could be a stream processing problem, right? At the data center level, because you can tolerate your reaction to a changing conditions with wait for a couple of seconds, as long as the output is good. I don't know that's a real, real example. I don't know what exactly they do, but I think that's a good hypothetical difference between what people, what could be in real time, what's happening on the car so the car turns when something happens. That's not stream processing, but the car updating its model based on

changing conditions locally, that could be a stream processing problem because it is sophisticated, has to be low latency. You don't want to get the update one hour later. You want to get the update of what you're changing conditions as soon as possible, but it doesn't have to be instant. And there's no, you can tolerate variance. So that's, does that answer your question? I I think it's a nuanced topic. But I think if you have sophisticated responses with a...

Apurva Mehta (11:59.924)
Optimally zero latency, but with some, you know, you know, some slack allowed. I think that that would be great. That that's where stream processing is a great fit.

Kostas (12:04.522)
Yeah.

No, Yeah, yeah, no, I think, actually, I love the terminology that you are using and saying actually that real time is not a good term to use and low latency requirements. I think it's a much better way of talking about that stuff. I think it makes it much more clear and then sure, like...

each use case might have their own requirements and what latency could be and what's like the slack that they can, you know, like allow there and all these things. But I do think that they use, and maybe it's a marketing caused problem trying to use, you know, like terms that are like, okay, low latency might be like a little bit harder, like for the majority of people, like understand what it is, but real time is like something easier. But at the end of the day, like I think it causes like

problems when you're trying to communicate of what a real time system is. And I've seen that with, you know, like talking with people that are using, let's say interactive analytics is like interactive analytics, real time. What does that mean? And why is that like different to like a streaming processing? And when do we need one and when do we need the other? And I think, it gets like really complicated in terms of the systems that you need.

Right? Like I think a few seconds to milliseconds makes like a huge difference in terms of like what kind of system you need to have there. Right. And I think it is important to use the right terminology. Like I love the way you put it. Awesome. Thank you. Thank you. That's great. I think that's going to help a lot like also later on when we are going to be talking about some new stuff there. Nitai, the microphone back to you.

Nitay (13:56.075)
Yeah, thank you. No, it was a great question because I personally have seen that having built data systems and in particular data systems for non-technical people in my last company, I have personally seen people in a meeting where they say real time, the entire meeting they're saying real time, real time, real time. And then I exactly pop up and say, what do you mean when you say real time? Just so we're talking the same language. And they say, you know, anything within like two to four hours. And at that point I say, OK.

Great. Like, no problem. That's what you mean by real time. Like, I can go to sleep and do this. Right? And so some people say it and they mean milliseconds. Other people say it and they mean hours. And us technologists know that moving from milliseconds to seconds to minutes to hours, that's a complete architecture design change that has huge ramifications. So no, it's great to kind of set the context there. Going back to the stateful part a little bit.

I mean, that part is, is super interesting to me because, know, think a lot of engineers and, developers are familiar with the concept of like, man, I should try to make things as stateless as possible. I built stateless microservices. The more stateless something is, the more I can scale it out and so forth. But it turns out really what I think you're kind of alluding to is there's only a certain set of problems you can solve that way. And some set of problems you have to have stable computation. You have to kind of.

Do that and lean into it and so tell us a bit more about like that class of problems and why that's hard and what you guys did maybe at Confluent and leading into responsive obviously in terms of helping make that easier for people and why it's a hard thing

Apurva Mehta (15:36.408)
So you're saying specifically the stateful stream computation problems, why not necessary and what makes them hard and what the industry as a whole generally has been doing about it. Is that question? Yeah. So when would you need it stateful stream processing? I think so I can talk about there's two ways. One, we can talk about the core operators that are stateful. I think that's maybe less interesting. Let's talk about the use cases and then we can...

Nitay (15:39.605)
Mm-hmm.

Nitay (15:48.651)
Mm-hmm.

Apurva Mehta (16:05.23)
talk maybe one level deeper. But in terms of use cases, I think if we go back to what we just said about sophisticated computations with low latency, what I've seen is, for example, lot, if you just take verticals, like logistics, logistics and maybe inventory management is a subset of logistics.

But basically in that domain, you like what people, what many, many companies are doing now, Kafka is growing very fast in that whole sector as a technology. And what people are building off of Kafka is all sorts of, you know, either tracking of where things are in a supply chain, inventory at different places in a warehouse, in a factory, so on and so forth. Right. So that's the broad logistics use case on Kafka.

Okay. And then within the applications you build for that use case, right? What is stateful? You know, I think, you know, what is the current, like, for example, one of the companies that we talking to is building, like they build, you know, they build a platform that factories buy to help manage inventory in a warehouse or in a factory.

Right. And so what they do is they have like these, like workers are in a factory and they have vending machines of machine parts or whatever. Right. And what they need to, what their software does is, you know, figure out what part level is in each vending machine at a given point in time. And then if inventory levels are running low, they need to alert people that, this needs to be refilled because once

It's zero. Lots of second order impacts happen. Things stop being produced. So that's a set of applications. And those applications have to kind of materialize what is the current. Someone took a part out of a machine. Now I have to deduct the count there. That is state. That means you need to have the count to start with.

Apurva Mehta (18:30.794)
Then when the count falls below X or the rate of flow is like at a certain thing, that's custom logic. I need to send an alert. Did the alert get responded? No, I need to send another alert. Maybe I need to go to text or call, escalate. There's so much logic there, by the way. All of those decisions you have to have remembering the app has to basically know what happened before, fundamentally speaking, to do any of that, what I just said. So that's state.

Fundamentally, without maintaining that state of what the sequence of events has and what the current output, current state of the world as a result of the sequence of events fundamentally allows you to do many more things that are significantly more useful. That's a theoretical and practical example, the practical example, but the theory is basically what I just said. You can have just a sequence and you react.

you know, as a like a simple automate onto that sequence. But if you remember the sequence, you can do much more. And like this inventory example is a clear example of that, where you can maintain the current state of everything in real time. And remember what you did in response to the state, your responses can get more and more and more sophisticated, more and more and more useful. And that's why you do it. First of all, does that make sense?

Yeah. So yeah, so that's, you know, and that's always attractions in search. Same thing, maintaining a live index. That's, you know, like a state again, like, you know, so those kinds of things. it's super useful. It's super essential. I would say for Kafka as a category or data streaming as a category to actually be widely adopted without stateful stream processing, it's kind of a toy that like, how much can you can filter?

Nitay (19:54.443)
Yeah, absolutely.

Apurva Mehta (20:22.83)
You can project stuff, maybe you can branch route events here and there, without small, that's it. So it's kind of more of a toy less interesting. And so was just stateful stream processing actually makes it interesting for a wide variety of applications. And that's why then we can go to the next stuff, what makes it hard and what people have been doing about it.

Apurva Mehta (20:50.658)
First of all, let me pause any questions, comments.

Kostas (20:53.598)
Yeah, I have a question that might help also like move to the next part of the question. So we're talking about state and I think like, and persisting state and writing and reading state and all that stuff and doing computation on that. And the first thing that comes to my mind is like transactional database is like Postgres, right? In the sense like that's why we have these systems. So we can persist the state and manipulate the state and

I'm not talking about like the OLAP systems here. They have like different latency requirements, obviously, but the transactional systems kind of like what they are built for, right? So what's, why do we need like a different category of systems to do the streaming processing? And we can't just, let's say have a reader on our Kafka topic, read the data, persist the data on something like Postgres and have a trigger there.

Right. That's calculate something, pushes that and writes it on another table and another reader reads from that and delivers, let's say the, the result. Why that's not enough. And we need something.

Apurva Mehta (22:05.806)
No, I think it is enough. Like it could be enough for a lot of people, right? A lot of people actually build effectively reactive systems to Kafka using Postgres and some custom thing to kind of do their job. think there's nothing like fundamentally like that could be a very valid solution for a wide range of problems, right? I think there's a couple of distinctions that

Kostas (22:08.884)
Mm-hmm.

Apurva Mehta (22:34.966)
you know, where you need something more, right?

I think I would actually recommend people to use Postgres with stream processors as far as they can. Unfortunately, most stream processors don't give you that option. I think even if you say like the biggest reason not to do that is cost and scale. Like Postgres, typically streaming systems have high update rates.

Stateful operations are really joins and aggregates, is basically range scans and key lookups. again, Postgres becomes highly inefficient at scale because fundamentally, you have to index these high update rates. And then you're using a massively complicated index structure and database structure and query system to do very simple queries. It gets very inefficient very fast. But at low volume, if one database instance can do it, what?

You can't get less than that, right? So it's fine. Or you may be willing to pay that inefficiency because you don't want to add more to your tech stack. That's fine. So I think the biggest delta is what are the operations and efficiency of traditional databases for those operations, right? And even the capability of those databases to serve those operations in many cases, like in some of the people we work with, 100, 200, 300,000 events per second. Try indexing that in Postgres. Any level of, you know, are not going to happen.

So in some cases, it's not feasible. And in some cases, it's not practical from an efficiency and cost perspective. And in some cases, and then those cases don't apply, you can use it. So that's number one. But I think the other thing that makes it challenging is, again, you said transactional. So now, how do you think of transactions? How do you model this? We didn't talk about this challenge of stream processing at all. But there's this whole correctness, data modeling,

Apurva Mehta (24:34.388)
operator semantics angle to it, which is like a whole half of the problem, really, like which I'm not even talking about. We're just talking about the infrastructure. But now that you're talking about Postgres, that's the other thing, right? Where you need to reconcile like Kafka has transactions, Postgres has transactions. Now you need to reconcile, you know, those two systems transactions, which is also a hard problem, right? So

So in theory, you can do it. In practice, you have to be pretty good to do it correctly, especially strict transaction requirements, like exactly once processing, and so on and so forth. if that problem is solved, I would say Postgres or any traditional database is actually not a bad store of state for your streaming computation. It would be preferable for many people, honestly, because there many benefits of using those databases.

Kostas (25:28.778)
Yeah.

Apurva Mehta (25:29.536)
You have all the tooling for inspections, observability, patching. It's far more operationally mature than anything else. But that's what I'd say. Let me pause. that answer your question? Any follow-ups?

Kostas (25:36.159)
Yeah.

Kostas (25:41.17)
yeah, yeah. A hundred percent. I do have a quick follow up to that. There's another class of database systems out there, right? Which is, and one of them is like used also like in LinkedIn. So you have systems like Pinot for example, like Clickhouse, right? Like systems that they can sustain like a very high rate of like writing. They are optimized like to index really, really, really fast. And then they provide some analytical capabilities that might be, let's say more sophisticated than one at scale.

OSGRIs can sustain. So how do pure streaming processing systems compare to these systems?

Apurva Mehta (26:22.806)
I think it's use case, Like, take a use case. I think this inventory management, right? Okay, inventory management, I would say is not a good example here because generally volumes are low. How much? Like ultimately when it's a physical, something has to happen in the physical world, right? Even if your platform, software platform runs every factory, you're not a very high volume. Okay? So maybe that's not a good example to continue in this case. think what I'm trying to say is that

I guess what I was at, so Inventory Managed, maybe they could use Postgres, frankly, as a state store. Okay, so let's put that on the side. Now, say you can't use Postgres. Now your question is, would you use Pinot as a state store and why not? Right? So I think it comes down to again, you what you said, like it's, you know, those are analytical systems, right? Like I think,

Do they perform really well on keys and rain scans? How efficient are they on those? First of all, think, I don't know. I think that's a question. That's one of the criteria you have to evaluate. I think the bigger question, right, which I know for a fact that they're probably not that good on is transactionality of any kind, right? So like lot of people, like one very common use case of these streaming systems and stateful stream processing is in...

Kostas (27:27.689)
Mm-hmm.

Apurva Mehta (27:45.176)
fintech, right? So you have people do many big companies settle trades on massive trading platforms in an event driven fashion with stateful stream processing, right? So the idea there would be just in terms of, you events, you're like you and me, you sell a share, right? And now your broker basically has found a buyer, right? And then there is the settlement process that happens where the security is transferred to you and money is transferred or whatever I suppose, right?

But it's a settlement process. That's a stateful stream processing computation. It has to move from one state to another. Things have to move around and then at the end it's complete. But it's triggered at the start by an event that a trade happened. A match was made and then it's triggered by many other events as the settlement process continues. So that's kind of how these systems work. I'm not making this up. This is actually a fact. On Wall Street, people do this with Kafka and even Kafka streams.

That now is a highly transactional setup. at, you know, like trading is very high volume because it's also has automation there. Take those two things together. Things like Pinot and any of these systems don't have the semantics that you need to maintain that transactionality. You want these state changes to happen exactly once. You want responses to events exactly once. You need your state.

Kostas (28:47.858)
Mm-hmm.

Kostas (29:09.247)
Yeah.

Apurva Mehta (29:13.176)
to align with that transaction, those exactly once requirements, which these analytics systems don't need to do and don't do. And that's where it breaks down. So when you want volume and these correctness requirements, that's where these are less good. I think these basically are disqualified. And even if you don't need transactionality, I don't know how efficient they'll be. Because they are highly optimized by analytical queries. You're just doing key lookups.

Kostas (29:35.389)
Yeah.

Kostas (29:39.498)
Mm-hmm.

Apurva Mehta (29:41.002)
You're probably spending a lot of money for stuff you don't really need. And I would say availability, I don't know, these are highly available systems, are they replicated? Right? You know, that's another thing you need to, you know, even if you don't need correctness, you need high availability for these applications, right? Your stream process is done in many cases, trading settlement stops unacceptable. And so you need high availability state too, which I'm not sure these analytics systems necessarily guaranteed.

Kostas (30:00.094)
Mm-hmm.

Kostas (30:06.036)
Yeah, a hundred percent. I wonder if we think of this problem we're talking about right now a little bit more from like a product perspective, right? And who is the user at the end of the day of the system and the data? What I hear, and I'd love to hear like your take on that is these systems, like let's say Pinot or like ClickHouse or these analytical very fast, efficient like...

systems, they have a very core assumption about having a human at the end kind of like working with the systems. That's why, and that's kind of reminded me of another episode that we had with Chris, the first one when he was saying that SQL is a leaky abstraction for streaming processing. But SQL at the same time is like something built primarily like to be used by humans, right? Like someone who's doing like

interactive queries at the end of the day, like they are trying to ask something on the data. And I wonder if from a product perspective, big distinction between like a streaming processing system and something like the rest of the systems has to do with automation. Like we want processes that they are going to be like highly automated, like machines at the end of the day, they are going to transact with each other. Like for example, the use case that you mentioned about like the trading, right? There's no human in the, I mean,

shouldn't be a human in the loop there because he cannot react fast enough to do these transactions. And then when you have Pinot in LinkedIn, you use it like, as long as I know at least it's primarily used for me, the end user to go and see some metrics about my network and do some filters there and some projections and like that kind of stuff, which is always like a very different requirement in terms of like,

who reacts, what latencies again, they are involved there and what kind of correctness is supposed to be there. Does this distinction make sense for you from your experience, from more like a product point of view when we compare the systems?

Apurva Mehta (32:20.15)
Yeah, I think that's from a product. So you're saying the consumer of Pinot and these analytics systems are humans versus the consumer of streaming systems are more like machine to machine, are more automated. There's less humans in the loop, generally speaking. I think that's a good rule of thumb, generally. I think it gets more complicated because now your streaming sequel is becoming more popular. Who's the output of that?

When you talk about SQL specifically and who's actually writing those SQL things and who's consuming the outputs. But I think in general, you're right. that's how I generally see core stateful stream processing. It shines for these core business apps, these core applications that drive the business, that drive the automation. And the other class is reporting of some sort.

Kostas (33:16.883)
Mm-hmm.

Apurva Mehta (33:18.286)
which I think, you know, honestly, probably, you know, you don't need streaming for it because you can tolerate latencies. You don't need high availability, you don't need transactionality. And I would say the biggest thing which I think is under discussed is the operability too, right? If you have core business application, like anything I mentioned, inventory, trade settlement, and the many, you have teams holding pages typically because you cannot tolerate downtime, right? So these applications need to be

You know, like they are applications like your web servers, like your databases, like, you know, all of these things are mission critical, pager carrying systems, right? Whereas, you know, and I think that's one of the key requirements, I would say, from a product perspective of any streaming framework that you adopt that how I'll set up as your app teams to operate these things, right? In terms of simple things, who's called when something goes wrong.

And I think these big complicated analytical systems or even like cluster model stream processors, that becomes very foggy because you won't build a mission critical thing downstream of a Pinot necessarily because it's such a complicated thing in the middle, no one really knows in general. Or you'd have some DBA to be responsible for uptime of Pinot.

But yeah, think anyway, I think I'm digressing, what I think is you said is right. think humans are on the output of that. think reporting use cases and critical databases are great, right? I think anything that drives a business code applications, I think you want to stay to a stream processor. And then I think it goes into what are the requirements of that for these good operations characteristics, right?

Nitay (35:09.119)
But you actually highlighted a really important point there because, you know, as much as we talked about, you know, the types of computations, the types of users and the other kind of potential data systems that could be used, that the last point you just made there about the operational rigor that are required, actually to me is probably one of the biggest things that screams business opportunity, right? The fact that you need so much of a support team and management and this and that just in order to make this stuff work means that, hey, some company could come about and just make this easy for you.

Right? And so, and so that leads me to kind of, you know, moving on from LinkedIn, tell us a bit about some of the stuff you were building at LinkedIn and then what that led you to, to starting responsive.

Apurva Mehta (35:49.464)
Confluent, right? think you meant stuff that was moving. Yeah. So I think so that's exactly, think we can, yeah, I think so Confluent, started with Kafka transactions and Kafka was a big addition that unblocked a ton of use cases, right? And then, you know, moved into this Kafka Streams KSQL team, right? KSQL was new when I started at Confluent and I started on the team. So just for defining it, Kafka Streams is a library.

Nitay (35:51.179)
Sorry, sorry. Sorry, man, I'm in conflict and then responsive.

Apurva Mehta (36:17.326)
Embedded stream processor, that's what I call it, is stateful stream processing in a library form factor. It's like DuckDB is to snowflake, right? You can think Kafka Streams is to fling, for example, like DuckDB is to snowflake. One is embedded in wherever you want it to be, and the other is like a centralized cluster-oriented model.

link in this case, but what Kafka Streams was always embedded, it was always targeted at developers. It gives you stateful stream processing APIs, state management, all that stuff in a very, very flexible form factor, just a library. And then ksql was the other thing we worked on, which was built on top of Kafka Streams that gives you a SQL interface to write these streaming applications.

So that's what I did at Confluent, right? And I think that's kind of, you know, over the years, you so our focus was always KSQL. And, you know, ever since KSQL was launched, there was always lot of interest in streaming SQL. And Confluent, as a result, you know, focused a lot of investment there with KSQL. And Kafka Streams was invested into the extent that it supported KSQL, and not necessarily as a category in its own right.

and, but actually I was running the teams, right. And I was on the escalation parts of those teams. And what we saw again and again was that Kafka streams was really being widely adopted, for these core applications. A lot of these examples I gave you of what I learned there, like what I saw there. And, and, you know, and, and, and the pattern was very clear, right? These app teams wanted to build apps. These apps were really stream processing apps.

Kafka Streams was a huge component of these apps, really, because they were just writing functions that reacted to events and Kafka Streams was handling the state, they were handling the fault tolerance, availability, everything for them as a library. So was like, if you think of it like a iceberg, the app is the tip of the iceberg and Kafka Streams is the 90 % pillow.

Apurva Mehta (38:35.342)
was very interesting because it caused a lot of problems. now those app developers who are saddled with running effectively distributed database, if they have a stateful application, because Kafka Streams has a local rocks DB, which is partitioned, which is replicated over Kafka. It has things like a rebalance protocol, which is basically like a control plane for the application. All of this is in the application context.

So it's a stream processing framework that has come to the app and now the app developer has to figure it out. And it's, as you can imagine, beyond a certain point, extremely hard to get that working. And which is why we saw all these escalations from that side, right? So yeah, so I think, so that's what I did at Confluent and that's really where this genesis of what we're doing now.

It's like, look, it's obviously true that developers, because I should mention Confluent, just looks, since 2019 has marketed zero on Kafka Streams. There's nothing. It's all ksql and now it's all Flink. It's still streaming SQL. That's the strategy. And in spite of that, this growth is there of this thing. The use case adoption is there and the problems are there. And I think that's kind of where the...

idea for responsive game. was like, my belief now very strongly is that, know, stateful stream processing is, you know, is necessary for a wide range of applications, right? If you like businesses, insecurity, logistics, fintech at least need to, you know, have more volumes of data and are competing on functionality and

And again, when we talked about what's real time, like these, if they can respond in two seconds or like one second with something very sophisticated, they are differentiating their products, fundamentally speaking, right? If they can do that with ever wide range of data, you know, that's kind of the trend. And so these Kafka applications, I would say, are becoming a bigger part of the product stack, right? These are built by app developers.

Apurva Mehta (41:01.08)
business runs on them. Like literally these are typically the entry points to the rest of these products. Right. And, and I, I, that's kind of what, what we believe now, right? What I believe that's why we started responsive is that this is kind of the category on top of Kafka, these applications, right? Like this is the one reason you'd pick Kafka is because you need to build these applications, not because you need faster, you know, reporting, right? Not because you need anything else.

Maybe you can use it as a dump pipe to fill your lake. That's fine. That's a big use case, right? Maybe you'll use it to feed Pinot and do the reporting. That's fine. That's a good use case. But the use case for us was this whole new type of systems that are possible when you have these high volume event streams, right? And that's really what we're doing at Responsive, like take Kafka streams, embedded form factor. Developers have spoken. I don't think it's in my mind clear that it's a preferable way to build stream processors.

for application developers, for people carrying pages, when it's business critical, it has so many advantages to be embedded. So how do you take that form factor and solve the other problems of you want the embedded APIs, you want the embedded form factor, you don't want to run a distributed system. And I think that's kind of exactly where we are focused on. But anyway, that's kind of getting me to where we are now.

I did the stream processing. This is the insight of, I would say, white space in the market. And I think we're doubling down embedded stream processing, business critical applications, Kafka applications, and then make that really, really good. And that's kind of, yeah, that's what we do now. And let me pause.

Nitay (42:46.919)
Yeah, no, that makes a lot of sense. Last question I have on this, and then I want to transition to, you mentioned some interesting stuff around RocksDB and kind of the embedded state, which I know there's been a lot of development on. But the last question I have here is, so you mentioned ksql, and you mentioned kind of, you you have kind of raw Kafka level processing, and then you have kstreams that kind of sits somewhere in the middle, and developers really actually want that interface, right? That ksql doesn't cut it, and obviously you kind of just the raw utilities is not enough either.

So what is, from what you guys are building a response of, what is the right interface? What is the thing that developers actually crave and what do they want to be doing versus what you should be managing and how does that, where does that kind of divide happen?

Apurva Mehta (43:26.126)
Yeah, so would say on the ksql stuff, I mean, it was SQL, right? think, again, like what Kostas said, I think SQL, what Chris may have said, I didn't hear the previous podcast, but I think that's the biggest limitation. Like applications, developers don't write application logic in SQL. Like this is, you can add UDFs and user-defined functions, but typically that's not what you do. What they want to do, this is getting to your question.

is really they want to effectively write these stateful functions and create topologies of those functions. You can pretty much model virtually anything like that. It's kind of like this reactive pattern somewhat at scale.

Yeah, I think, I think the thing what I, you know, what developers would we think they really want and what I think the expressed in, you know, expressed preferences based on last several years of adoption in the market is, you know, like if you're building an app, you want to build an app, you want APIs that allow you to build apps. It's not very different from like a web server, right? Like, you know, nowadays people, people, they're still web app developers right now.

And next, you know, frameworks make it really, really easy to write web applications with a good server and client side split. And like this framework in the middle allows you to just write functions. You can just declare as a client, as a server, you know, and it has routers, it has everything, it takes care of the whole thing. But the developer is just writing functions effectively. They're writing logic to how those functions are outed and how requests are outed to various functions. And then things like Vercel just run it.

and they don't ever think about it, right? That is what developers want, right? They just want to focus on their logic with as little overhead as possible. And they want the facilities underneath them, like in the case of front-end routing, and then there's a very native database integrations now, so on and so forth, that just allows them to express the logic they want as quickly as possible, right? That's like a very general statement, but it's not surprising, right? And I think in stream processing Kafka streams,

Apurva Mehta (45:46.53)
type of model where it's embedded again, where it gives you very rich stateful APIs. It gives you all the you want to do whatever you want, hook in your own observability, integrate with your own CI, call anything else in theory in your environment. It's just an app, but it's triggered by events in Kafka. It gives you sophisticated stateful APIs to do whatever logic you want with. And then it is so powerful.

you know, with just doing that, that, you know, it just has the organic adoption, right? Like I have a problem, like typically that's how it happens, right? If you're a developer at company, whatever company I have a problem, okay, Kafka kind of you always select into Kafka, now Kafka is popular enough that that's always something you consider if you have any kind of, you know, for many, for many businesses, that's a starting point, right?

And then if you want to build apps, Kafka Streams is so attractive because it's very easy to get started, very powerful from day zero, and it gives you all the freedom you need to fit your company. So I think that's what developers want. That's what we believe they want. That's what I think the data shows. And I think that's kind of where embedded makes a big difference. Anyway, I don't know if I answered your question, but...

But I think that's a big difference from ksql. It's the APIs are not SQL. And then it's a very big difference from anything else where embedded form factor affords the freedom that developers want.

Nitay (47:20.811)
Yeah, well that makes a lot of sense. There's a lot more we could talk about here, but for the sake of time, maybe we'll come back to it. But I wanted to cover one other topic, which is, you know, we've talked a good amount about state here and you kind of mentioned the notion of like an embedded RocksDV and so on. I know that's something you guys have invested a lot and have partnered with Chris on. So tell us a bit more about like the world of state management.

Apurva Mehta (47:42.892)
Yeah, that's a good... That is one of the problems, right? I think what Kosta said at the very beginning, early, why did you put everything in Postgres? And it could work, right? RocksDB, I think, is the de facto choice for state and stream processing. Flink has it by default. Kafka Streams has it by default. And these are the two big ones in the space, right? RocksDB is very attractive because, again, it has very high write volume and...

Like the join and aggregates work really well, because key and range scans are very efficient, right? And so that's why RocksDB. The problem, again, what I said, once you have embedded RocksDB, like in general, like stream processing, any of these frameworks, the default open source ones are just old, right? Like in the sense of architecture, they have state embedded on the compute nodes that removes all degrees of freedom. Right? Now you need...

A high amount of CPU, well, you have to copy state everywhere to scale out. It's slow, it's not elastic, it's inefficient. It's basically a custom database without database tools. You can't do migrations or schemas. can't do... Stateful or stream processing app updates are difficult or impossible. People have to solve it on their own.

you know, things are wrong. Inspection tools don't exist. State patching tools don't exist. Right. So there's a lot of drawbacks with embedded rocks DB. you know, like, which shouldn't be surprising, right? If you had, you know, embedded my SQL or embedded SQL, like whatever running, you know, which all your app developers were embedding in their web apps and I'll go for it. I don't think we'd have a.

Internet industry, right? But that's really what we have in stream processing by default. So, yeah, I think what we're doing is saying, look, you need to separate the state out. And you need to separate it out by maintaining transactionality while maintaining the throughput requirements, especially. Because again, what we said, latency is less important for most people. You don't have that low tolerance of milliseconds. So you can put state remote, right?

Apurva Mehta (50:02.478)
and, as long as you maintain throughput and, Yeah. So I think that's kind of what we think. Once we do that, put it in a dedicated database, you get all the database tools, you get significantly better elasticity. Developers are not running a distributed database. It's a significantly simple operations, significantly more reliable, right? That was, I think it's not even a thesis. It's obvious if you think about it for two minutes, that it's a superior architecture.

And what we've seen since we've deployed this to our customers is that they've seen a night and day difference in terms of reliability. So that's kind of the core of what we've done for Kafka Streams, which is a very hard problem to solve transactional remote state at scale. And we support MongoDB and Cassandra as remote state today. We'll probably do DynamoDB sometime this year.

And so that's kind of the product today. But I think, you know, we're moving on from that because even those existing databases, right, are not enough in terms of the same problems as Postgres, right, where it's expensive. Like Mongo has its own replication, it has its own transaction system, it can do much more than key value lookups. So, you know, it's less complicated than...

Postgres, but more complicated than it needs to be for our use case. So we are continuing to invest with SlateDB and our storage service on SlateDB RS3 to kind of reduce that remote state to be highly optimized for stream processing, making it far more elastic and far more cost efficient compared to anything out there. So that's kind of the journey we're on.

Kostas (51:50.25)
So Apurva, I have a question. You mentioned the cost factor, which of course for the buyer is quite important. But I think there's also a hidden cost always, which has to do with maintaining very complex infrastructure. And you need people for that. So when you have something like Mongo, you kind of mentioned that already about what it takes to being able to use RocksDB and distribute

like fashion and like what it takes like to maintain like the transactionality and like the consistency and like all that stuff there. But how important do you see that part to like how important it is for the industry to move forward, right? And the companies to actually deliver more value for themselves at the end to simplify also the infrastructure that they are using outside of the, they,

the cost itself of like using something like S3 instead of using, let's say MongoDB to go and store like the data there.

Apurva Mehta (52:54.19)
So you're saying outside of cost, are these Mongo and other databases too complicated as an infrastructure? And is that a problem holding? Yeah, think so. think there are many. It's not just cost, right? So I think, yes, I think there's complexity. I mean, you can solve the complexity by just putting it on Atlas, which is very, very expensive.

Kostas (53:12.391)
Mm-hmm.

Apurva Mehta (53:23.278)
But I think that's kind of why we, I think this trend of object store native state, stateful services, I think makes a lot of sense because S3 is kind of the distributed file system you need. it takes care of replication, or object stores in general, once they take care of replication, once they take care of durability, a lot of the problems of a stateful service.

Kostas (53:39.263)
Mm-hmm.

Apurva Mehta (53:53.022)
or state the database of any kind go away, right? Number one. So that itself is a huge simplification. I think the other big simplification in terms of infrastructure is that the other big part of databases, what is it? It is, you know, transactionality, right? And, you know, and I think they do that by having this right ahead logs and MVCC that they maintain on their own, which is a huge part of their complexity, right? But Kafka is already a transactional right ahead log.

which these companies that we target already have adopted, right? So one of the big ways to simplify is like, can we decomponentize the database, right? Kafka is a writer head log. We need to align with Kafka transactions. Kafka gives you single writer semantics. Its protocol ensures that you'll have only one writer per key at any point in time. There's epochs, there's fencing of people with the wrong epochs. I was part of building the transactional protocol back in the day.

So it already gives you that. So it gives you single writer guarantees. It gives you the writer head log, which is a significant thing in distributed systems. So what we've done, what we think is the way to simplify to your point, is that S3 for replication and durability. Kafka is a transactional writer head log. And now in the middle, you need something very simple. You need something that can do a compare and set that can basically give you fast

key lookups, fast range lookups, and tolerate high write volumes, LSM trees are perfect. And that interface is well with S3, right? So that's kind of what SlayDB is, frankly speaking, right? It is this LSM tree which is object store native that can do local caching. And on top of which we built RS3, which does the transactionality alignment with Kafka and Kafka Streams. But I think that is a dramatically simple system.

that can run in a person's network. Effectively, it's like what WarpStream did for Kafka, the brokers, the agents in the network are stateless. Effectively, RS3 nodes are stateless, right? Or because they don't have any persistent state, everything is in S3. And it's a dramatic simplification, dramatic cost reduction. To your point, it's significantly simpler and cheaper, but I would say it also opens up significant...

Apurva Mehta (56:19.566)
other benefits, right? For example, we're just going to do snapshotting with SlateDB, right? So what that means is that you can, you know, you have an app, you can branch the state because again, the log is Kafka, the log is versioned by Kafka Offsets. We store those offsets in RS3 with SlateDB and just that alone allows you to, okay, at a certain point in time, I want to branch the app and do something else, right?

I can take snapshots of my app at points in time, and now I see a bug, I can roll it back and replay. None of this existed in stream processing, by the way, right? But this architecture is significantly simpler, and because this data is S3, you can have so many asynchronous things added to it to significantly change what you do with these apps. So I think all of it is something that is very exciting for me because...

it will open up many doors, right? You can get good disaster recovery, good auditing, good everything, right? Which is a big hindrance, would say, in stream processing, none of it exists. You can't even debug a state issue easily in Flink. There's no tool. How do you do it? You can't do it. It's so popular, you can't do simple things, right? So with Kafka Streams, it's little more possible, but not easy, right? But I think these are the big steps forward we can make with this architecture.

Kostas (57:29.94)
Yeah.

Kostas (57:43.178)
Yeah, that makes a lot of sense. So S3 has been around in block storage for a very long time. It's not like a new technology. And the paradigm of disaggregating compute from storage, it's something that at least, let's say in the analytical space has been done for a while, like snowflake, like a good paradigm of that. I would say kind of like even like

Sparkle. in general, in the more traditional data processing space, this has been a true force of change in terms of how we can deliver that to more people, to more companies, commoditize it in a way. And that's how we commoditize the data warehouse at the end of the day. So why for streaming took that long to start implementing this pattern of

Let's say, let's separate the compute from the store ads.

Apurva Mehta (58:45.469)
Is it that long? How long has Data Warehouse been allowed? Data Warehouse has been allowed to the 90s, right? Kafka Streams has launched in 2016, Flink was available in 2014. So in less than 10 years compared to like 40 years. So I would say like, I think we have to have reasonable expectations, right? Like it's not that long. would say in general, this is an emerging space. Stream processing is frankly downstream of Kafka before Kafka the...

Kostas (58:48.381)
I don't.

Kostas (58:56.553)
Mm-hmm.

Kostas (59:03.635)
Mm-hmm.

Apurva Mehta (59:10.722)
think Tim thinks it's Kafka itself. However you count it, it's no more than 13 years old. So I don't think it's that long. think, you you can't do it. These are not easy investments, frankly speaking. Like redoing, like whatever I just said, using Kafka as a right ahead log, because it has this transactional functionality you can hook into, marry that with what S3 brings. S3 is what? Coming up to 20 years old, it's 18 years old, it launched in 2006.

Kostas (59:15.198)
Yeah. Yeah.

Kostas (59:40.138)
Mm-hmm.

Apurva Mehta (59:40.824)
It's not that old, right? So I I mean, I don't think it's, I think it's a natural evolution, right? In the scheme of things, it's going much faster than other systems evolution in this direction.

Kostas (59:55.229)
Yeah, yeah, yeah.

Kostas (59:59.43)
Yeah, yeah. Let me rephrase the question because like actually where I wanted to get is I wanted to hear from you if there were like certain things in S3 itself or like block storage as a technology itself that need to happen before this could be applicable like for streaming processing. Because it's probably like an easier kind of problem maybe like for the analytical use case where you have like a bunch of parquet files like you put them out there. You have let's say the

anything that you need, like for 10 years now, but maybe S3 needed some more stuff in their protocol to allow, let's say like this.

Apurva Mehta (01:00:41.59)
Yeah, I think so. have introduced compare and set recently, for example. That's something that is useful to us in the API, but not essential, right? We already were going to build outside of S3's compare and set. Now we don't have to. But did S3, something change in S3 to enable this wave of innovation? Not recently, right? I think we didn't start doing Slate because something happened in S3.

making it possible to do Slate, right? We did Slate because we knew we needed our own storage system. We knew that Mongo, Overtime, or whatever, that dependency was not going to cut it. We were looking for options. We have an internal doc, by the way, from like 2023, says like, look, we can do RocksDV plus Fuse, S3Fuse. We can do our own LSM. We can fork RocksDV.

but that's really something the company has to invest in. So we knew that from the beginning. And then I think Chris, I met him and it happened to be a very, he had the same idea and then it happened. I would say I was not expecting the company's 18 months old. I was not expecting to have a RS3 service in beta 18 months at all. It happened because we knew we wanted eventually.

We didn't know we'd do it now. It's just a big investment. I think it's kind of... So I would say it's more like luck and circumstance than any fundamental, you know, technology unlock, right? I would say what really helped was WarpStream, right, for us, right? We say, okay, that makes a lot of sense. Can we do it for state storage, right? And I think that's the bigger, bigger eye-opener than anything technical.

Nitay (01:02:39.339)
That makes a lot of sense and it makes me think of, in a way, an analogy is like kind of what's happening in the database world with data fusion, where it's kind of slicing up the database into different layers. You have the storage layer, the operator, the logical use, the query language, et cetera. So last question here and then we are coming up on time.

Where do you, what's the future? Where do you guys see kind of things going? both with, with slight to be with a responsive, obviously kind of with, with, you know, stateful stream computation as a whole low latency operations. What's, what's next.

Apurva Mehta (01:03:13.272)
Yeah, I think,

Yeah, I think what we're going to see is this embedded streaming as a category become much more recognized. That's something responsive is going to invest in doing because it's kind of invisible right now. Apps are apps and you don't know you're actually doing stream processing. I think we need to change that story. think, you know, in terms of what is going to happen, I think there's a lot like I view this whole space.

of streaming applications, Kafka applications as the wild west, right? Like you've got very little investment historically, even though it's very popular. So, you know, things like what I just said, like you can do stateful application upgrades. You can do instant rollbacks. You can have strong state observability, strong state's patching tooling. You can have true disaster recovery for stateful applications. You have better observability at the application level all around. You have, you know,

Once all of, we're doing all of that, by the way, right? And I think as those things roll out, we're going to start seeing far more adoption of these applications, right? Because right now people are just given, you know, hammers and nails effectively and they're building castles, right? A lot of companies have done lots with very little. And now what if you could give them significantly better, you know, building blocks, they can build faster and more, right? And I think that's kind of what you're going to see.

because frankly what Responsive is doing, right? I think these are all fundamental changes to the space that we're looking forward to.

Nitay (01:04:49.259)
Very cool. That sounds amazing. But we'll definitely have to have you back on the podcast again to talk about all of that. So Apurva thank you. Yeah.

Apurva Mehta (01:04:55.33)
Yeah, I mean, it'll be fun. I like it. It was fun, maybe maybe in a year or whatever. It'll be a good, let's see if these predictions fan out. But yeah, there's a lot of fun chatting with you all.

Nitay (01:05:06.261)
Perfect. Sounds great. Happy to do it. Well, thank you for joining us. This has been fantastic.

Apurva Mehta (01:05:14.306)
Yeah, likewise. Thank you for having me and looking forward to the launch of the show.