Tech on the Rocks | How Denormalized is Building ‘DuckDB for Streaming’ with Apache DataFusion

In this episode, Kostas and Nitay are joined by Amey Chaugule and Matt Green, co-founders of Denormalized. They delve into how Denormalized is building an embedded stream processing engine—think “DuckDB for streaming”—to simplify real-time data workloads. Drawing from their extensive backgrounds at companies like Uber, Lyft, Stripe, and Coinbase. Amey and Matt discuss the challenges of existing stream processing systems like Spark, Flink, and Kafka. They explain how their approach leverages Apache DataFusion, to create a single-node solution that reduces the complexities inherent in distributed systems.

The conversation explores topics such as developer experience, fault tolerance, state management, and the future of stream processing interfaces. Whether you’re a data engineer, application developer, or simply interested in the evolution of real-time data infrastructure, this episode offers valuable insights into making stream processing more accessible and efficient.

Contacts & Links
Amey Chaugule
Matt Green
Denormalized
Denormalized Github Repo

Chapters
00:00 Introduction and Background
12:03 Building an Embedded Stream Processing Engine
18:39 The Need for Stream Processing in the Current Landscape
22:45 Interfaces for Interacting with Stream Processing Systems
26:58 The Target Persona for Stream Processing Systems
31:23 Simplifying Stream Processing Workloads and State Management
34:50 State and Buffer Management
37:03 Distributed Computing vs. Single-Node Systems
42:28 Cost Savings with Single-Node Systems
47:04 The Power and Extensibility of Data Fusion
55:26 Integrating Data Store with Data Fusion
57:02 The Future of Streaming Systems
01:00:18 intro-outro-fade.mp3

What is Tech on the Rocks?

Join Kostas and Nitay as they speak with amazingly smart people who are building the next generation of technology, from hardware to cloud compute.

Tech on the Rocks is for people who are curious about the foundations of the tech industry.

Recorded primarily from our offices and homes, but one day we hope to record in a bar somewhere.

Cheers!

Kostas (00:02.197)
Amé and Mats, welcome to the show. It's very nice to have you here with me, Tai, and we're very excited to chat about all the great things that you are building and also learn more about your background and your history. So let's start with a little bit of introduction, small introduction from each one of you. Amé, do you want to go first?

Amey Chaugule (00:24.637)
Yeah, that works. So yeah, so I'm Amay. I'm working on a company called Denormalized with my co -founder Matt here. We're building an embedded stream processing engine, which to put it in a one -liner, it's DuckDB for stream processing workloads. That's our dream and goal. And yeah, and as for my background, I, you know,

started out writing Hadoop jobs at Yahoo, you know, on their ad tech stack in like early 2010s, then moved to San Francisco, ended up working on a bunch of startups, you know, worked with some early Spark, and then I was at Uber for many years working on their ML infra, which really

Data infra, know, because ML infra was just like the cooler branding, I feel like. That's where I worked quite a bit on stream processing workloads, because Uber had one of the largest Kafka deployments back in the day. I think it still might be true. And so yeah, so saw like the evolution of Flink, because we were actually on something called SAMSA, tried Spark streaming, and then worked on like Flink quite a bit. One of the things that at Uber I can...

the claim that my claim to fame that was working on the user sessionization, which, know, every time you open an app, figuring out the whole context of the session in real time, you know, when you open an app, Uber needs to know where in the product selection cycle you are. And they that's, that's how they even like sometimes intelligent intelligently figure out where to position supply in real time. Cause you know, there's like a concert that that's, that's just got over and they see a lot more people opening the app. They know that there's a surge of demand coming in.

That's a pricing signal. So these are lots of complicated interrelated systems and two -sided marketplaces, which is what an Uber is. And yeah, so for that, you do need stateful systems to do that. And we used stateful processing in Flink to be able to achieve that. So yeah, so got a good dose of how to handle this at the scale. And after that, I was a staff engineer at like Stripe on the stream.

Amey Chaugule (02:35.181)
processing team and did machine learning platform at Coinbase before starting denormalized with Matt. We started since I was at Coinbase, the first product we actually had was like a real -time data product for crypto users primarily interested in figuring out fraud and bad actors in real time.

Over a year ago, we did YC in summer of 22. Over a year ago, we switched our focus because we realized we're building that company. The first version of our product, we ended up using a lot of open source like Spark, Kafka, what have you to be able to deliver because crypto data is still quote unquote big data because it's blockchains can go up to a few terabytes. Even though we are a two person company, everyone tells you to stick your data in Postgres, right? Like that's the hacker news thing. But no, and what we were also doing was we were generating 10s

to hundreds of gigabytes of data, but we needed to also ship this data to our customers, know, sync it into their snowflakes or their, or, and we tried some, you know, and we also tried to set up Trino, you know, and then we were like, okay, this iceberg thing looks, everyone seems to think this is the thing, but even a year and a half ago, we found like things like iceberg needed, you know, like the Python environment and for iceberg, you needed to launch a spark cluster just to be able to query your.

iceberg tables, yeah, we were very dissatisfied with the state of the data infra, know, especially for smaller teams, but even from an ergonomic sense. And that was a realization that there's a big shift happening in data infra, especially with DuckDB and rebuild everything in Rust TM, you know, movement. yeah, so that's, this is our second iteration at this company. And yeah, we're, like I mentioned, we're interested in making stream processing workloads

just run from your laptop or a single node because scaling up, we believe, gives you a lot more headway to actually, there's still a lot more headway with modern hardware that you can use there that has not been utilized. And also stream processing has always stayed a niche area.

Amey Chaugule (04:36.422)
of data in the world. Every year at Kafka Summit, they will talk about how this year is the year of streaming, but we would argue that a lot of it also boils down to it's really difficult. Time is relative and clocks are hard, so it's hard to figure out whether joining streams in real time gave you the correct result or not.

Yeah, so we want to over time solve those problems, but our first step is actually to build the core tech, which we are working in Data Fusion, and we would like to make this open source as well.

Kostas (05:09.385)
That's amazing. And before I go to Matt, just I have to ask this, okay? Because you used like the metaphor of like DACDB for streaming, but should we in general say when we are talking about embedded data infrastructure, should we say DACDB for something or we should say SQLite for something? I just sometimes feel like we are doing a disservice to SQLite, which

Probably like one of like the most successful databases out there and most used to hands. But like nobody like talks as much about it or at least like people just take it for granted, right? Compared like to something like that could be so I don't know. What do you think of it?

Amey Chaugule (05:41.821)
Yeah, it is.

Amey Chaugule (05:57.639)
I think I'm betraying like my, my OLAP columnar world bias, right? Cause DuckDB like, yes, I will fully admit to it. You know, I have used SQLite for my projects, but, and, know, and, and, and, know, like we, through the process of building a data infra company, have obviously become much more familiar with the OLTP world and the happenings there. For instance, Matt and I were at Systems Distributed recently, you know, by Tiger Beetle in New York and.

It is a very, you know what I feel like these two communities have not cross pollinated as much enough, because there is always like old TV guys telling us like, oh, you guys are just rebuilding something that was written in a paper in 1980 or something like that, right? And you know, that criticism as like wizened gray beardy it feels, it rings to it. So yes, I would say, you know,

DuckDB or SQLite for streaming. Nowadays, in streaming worlds, there is this thing about Kafka as a protocol, so I'm pretty sure there will be a Postgres extension very soon that someone will make Kafka as a protocol for Postgres, but yes.

Matt (07:13.44)
would say that SQLite did not raise $100 million VC rounds, so currently, mine share seems to be for DuckDB, even though SQLite, I think, is still much larger. A lot of the engineers that I talk to who are not preppy to the data space have heard of and have used SQLite, but only a few of them have heard of or used DuckDB. So I think to some extent, it's just on who we're trying to speak to.

Kostas (07:36.389)
Yeah, 100%. I think what Ames said about the cross -pollination between the OLTP and the lab, I think it's a very interesting statement. It's definitely true. I don't know, maybe, Mitai, at some point, we should just bring some folks from both sides and see what happens. I think it will be interesting to have them on the show talking about that stuff. But Mati, you also have to introduce yourself, so please, please go ahead.

Nitay Joffe (08:00.881)
Yeah, for sure.

Matt (08:03.836)
Yeah, absolutely. So my name is Matt Green. I've been working in and around startups or so for the last decade. Started my first company out of college. I did not go anywhere. Moved out to San Francisco and started to work for some other startups. Notably, I was one of the founding engineers at a company called Booster Fuels that does on -demand gasoline delivery. So with actual like physical gas trucks. And I have actually pumped gasoline as part of my job duties before as an engineer, which was fun.

And I learned a lot about gas safety and regulations. So that was always an interesting time.

While there, I was responsible for basically full stack engineering. So I was building out the native applications, native apps with iOS and Android, as well as building out the backend service to manage the entire product. And then even doing some hardware integrations with the trucks themselves. So trying to be able to digitally figure out how much gasoline was pumped and then link that back to a car so that we could charge the customer for the gasoline that was sold. During that time, I kind of had my first

of data when we were using a MongoDB backend as many cool startups were at the time, but we needed to, we had some data analysts who spoke SQL. So we needed to load our MongoDB database into a Postgres database so that they could create the charts and graphs that were needed and slice and dice the data. And that was a surprisingly painful experience.

as one can imagine. After that, I moved on to Lyft, where I worked on the driver incentives team, which was a pretty interesting and intense team that involved taking a bunch of signals from the marketplaces and using those to create incentives in real time to incentivize drivers to stay online or to drive in particular areas. So really doing a lot of like supply smoothing.

Matt (09:59.778)
So that was a used a lot of real time data operated at a pretty high scale and we were paying out money. So there was not a lot of room for mistakes in terms of, you know, off by one errors or anything like that. Had to make sure that everything, everything worked properly. And then that was also where I developed a bunch of opinions about how data systems should work and like data systems economics. You know, a lot of it was working with data scientists and research scientists who were slinging code and Jupyter notebooks like a pandas data frames. And we had to

get their mathematical models that were written in pandas into production and then had to add things like SLAs on the calculations and failure modes if those solutions failed to solve for particular ticks in time. So a lot of interesting problems there. After that I joined a large large very

Let's see, a startup called Fast that famously blew up after 14 months in $120 million. So there I was one of the founding engineers on the data team. So I was building out the data analytics side of things and working on recommendation systems as well as fraud detection. And so that was quite the wild ride. And then after that, decided to start Denormalized with Amay.

Kostas (11:22.63)
How did you meet, by the way, the two of you? How did you get together and decide you'd like to do something together?

Amey Chaugule (11:26.065)
yeah, so I worked at Uber and Matt worked at Lyft. So that's a clue there, but we met at parties in San Francisco and we were nerding out about marketplace dynamics of ride sharing marketplaces, you know. So that was how our friendship started, but yeah. I'll let Matt add.

Matt (11:31.712)
Yeah.

Matt (11:39.132)
Yeah, so.

Matt (11:43.944)
Yeah, well, that's the story. Yeah, we had mutual friends and we would find ourselves off in a corner talking about ride share marketplace dynamics while everybody else was having fun. And so, yeah.

Kostas (11:56.645)
That's amazing. All right, so what are you building today? us a little bit more about what are you coding every day and what's the goal there, what are you trying to

Amey Chaugule (12:08.357)
Yeah. Yeah. So, you know, again, like for the broader audience, you know, because stream processing can mean different things to different people. Sometimes people think Netflix streaming, right? So, in the world of like real time data, a stream processing engine is broadly like a query engine, like Spark, Snowflake is a query engine. Stream processing has some dedicated ones, but Spark can do both.

These query engines operate on big data, but under tight latency requirements. the sort of charter is not just to process gigabytes of data, but also do it within millisecond, microsecond, sometimes, type of latency. Historically, Kafka has been one of these real -time systems, such as the ones we were building at Uber and Lyft and later

like they have broad application, right? Like for that Uber and Lyft, they were using it to do these pricing models. So a lot of machine learning models need real time, you know, real time data coming in, because that's how, you know, and some of the challenges there is that oftentimes these models are trained using like a Spark workload, right? Like the training jobs is a Spark query written by someone, but when you're running that, the inference piece of any machine learning model involves, hey, this signal just came in

time, we need to run the same physical plan that your training step required, but just on a few events or just single event at a time and under extreme tightness. So you were not going to be able to run your snowflake query or the...

Spark batch query against the same thing. So you need it generally that that's how a lot of the stream processing systems they found a product market fit there over 2010. So that was one of the things or even in recommendation systems that's a very common thing to do. But also these systems are sometimes used for a lot of times actually are used for rules based systems for risk and fraud systems. So when I was working at Coinbase I was like the technical lead on the machine learning platform but our primary

Amey Chaugule (14:21.298)
use case was considering cryptocurrencies, it's a market rife with fraud and Coinbase. Their one thing is they have never been hacked and they have never really lost any customer's money. so in those systems, typically have analysts, these are just your regular data analysts writing these domain specific rules

these systems are supposed to then fire off based on what's going on. and yeah, so it's a real time system. So stream processing is one part of it. Stream processing is crunching the data, but they're often coupled with a data bus. So that's where Kafka's of the world come in. Kafka is not the only data bus, but it's like the behemoth in the space. Kafka is often also used for non real time things now, basically like as a dumb pipe, you know, to feed your data lakes and what have you, but.

in under tight latency requirements, that is where Kafka comes of great use. So again, as you can imagine, a lot of this world is built towards distributed systems, like even to query, if you feed a few events to your Kafka cluster, the state of the art these days, even to query how those events look like.

is to fire off a Spark job or a Flink job, you know, so the round trip time from being able to even see what's happening on your thing is a couple of minutes, even if you run the Spark cluster locally or Flink cluster locally. So yeah, so, and we were, you know, and we, the process of building our initial real time data and crypto product, we saw DuckDB and how much it solved. the further things that like, you know, personally, I sort of grew up with Spark, you know, there was a Twitter trend where

were thanking TypeScript for their Lambos or whatever. I never got a Lambo, hey, like Spark kept me employed for a long time, you know, whatever I have, it's like, thanks Matei, you know.

Amey Chaugule (16:16.743)
So similarly, so yeah, so we believe that now, you know, there, so for stream processing to have a, we believe it has wider applications. but you typically companies still need to hire like a Flink engineer and a Flink engineer is like not just a data engineer, like who knows how to write the data flow, but it's also like, the intricacies of JVM, how to run, like you're a distributed systems engineer effectively, you know, run like two phase commit for checkpointing, you know, these distributed real time.

data compute programs, right? So we realized that this would just be much easier if you do this on a single node. And there's a lot of actually academic work that is out there that just shows that a single node.

application workload actually beats most distributed workloads. That's actually just true. Dr. DB proved it. It can beat Spark for up to workloads for up to one terabyte, but for streaming, it's even...

truer, you know, because you're not streaming one terabyte of data per second, often times, right? So yeah, so, so yeah, so that's a weird building. that's what we mean by building a embedded stream processing engine, which is take away the distributed systems complexities where you don't have to do consensus all the time, where, know, your checkpointing doesn't involve a full blown implementation of Chandi Lamport, you know, which is a common algorithm, you know, it's, it's very few people can even implement it really correctly. Take away that

difficulty of just building real -time dataflow programs and what we're focusing on and we're working in data fusion. It's a library, obviously, query engine we can talk about later. Yeah, so we're building, what we're trying to do is build this single node stream processing system with a view of like you can scale up to the cores that are available on your machine and our thesis is that, you know, this should...

Amey Chaugule (18:13.737)
take care of at least 90 to 95 percent of the workloads. We think it might be actually more higher, but I hesitate to go 99 percent because we're at the phase where we're talking to people in the industry. We're trying to collect the empirical data for this.

Nitay Joffe (18:35.645)
I'd be curious to hear, you mentioned earlier on that like, you know, you'd go to these streaming conferences and every year they come and tell you this is the year, this is the year, right? Clearly something is happening now with you guys leaning into this kind of, you know, streaming Dr. D if you will. And so why, what is it about, you know, given your experiences, what you've seen in Uber, Uber, Lyft, DelSore, et cetera, that tells you that like the world needs this now? makes, what gave you the guys that,

including.

Amey Chaugule (19:08.265)
So, you know, it's...

So like I mentioned, Streaming actually does have use cases, right? If you talk about on Hacker News, people can be like, no one needs streaming. Just put everything in Postgres. There are real -world use cases that already exist. Flink or Kafka, have their raise on Dettra. They are big companies. They use these systems for a particular reason. And again, it's generally coupled with computations that require tight latency. What we're seeing today

especially with the rise of the AI engineer archetype, right? Like historically machine learning engineering are people, you know, who, because data, infra, machine learning, infra data engineering, machine learning, engineering. This was all like a nebulous term over the last 10 years. That

audience has expanded broadly over the last two years, I believe, right? Because a lot of people are now using LLMs as a great generalized world model, and you can do some prompt engineering, and people are now rediscovering search. It's a fancy new term called RAG, but it's search engineering. So a lot of people are discovering these. the audience for stream processing has definitely, potential audience actually has increased, but the primitives are also still...

extremely difficult and cumbersome to use. even within like old school data infra or data engineering, stream processing also became a sort of a niche within a niche, right?

Amey Chaugule (20:40.061)
But now that data, infrared data engineering niche just has expanded to just newer crop of people who are building these AI applications. And so yeah, so we feel like this is the correct time to actually introduce, you know, to introduce modern sort of developments that are already happening in broader data infrared to stream processing so that we can then have these newer crop of engineers, better tools to use, better ones than we were using, you know.

Matt (21:10.85)
I have like a slightly...

Nitay Joffe (21:11.049)
Next one, next one. Go ahead, go ahead man.

Matt (21:13.178)
I have a slightly different take on this. think what Amit is saying is absolutely correct. But I also think that you're seeing a shift to some extent in how people are also starting to develop backends. And so you now have this rise of serverless backends. That mixed with these development paradigms on the front end, like React .js, that have this approach of emitting events and then processing those events, I think lends itself in it's very similar or related to stream processing.

In the big data space, we think about streaming as like this like very specific portion of like large data, you know, where it's like you're admitting a bunch of events and then you have, you you have to reach consensus and you're trying to process these things in relatively low latency. But like that model of actually programming and thinking I think is, is pretty analogous to what's happening a lot on the front end. And that's the model that a lot of like new developers, this next generation of front end developers are starting to think about. And so think that these developers generally speaking are like, you know, they're looking to be able to like spin up some service on like

so like a Node .js application, they write their front -end React code, and then they're like emitting events between the front -end and the back -end. And if you kind of think about it, they end up building these reactive applications that sync between the front and the back -end, but a reactive application in many ways is actually just like another way for, or another way to think about like a stream processing application. And so like why now? I kind of think that part of it has to do with like, you we've talked about servers are getting a lot faster.

So like, you can, you can actually process more data on like a single node, but more than that, I think that the time is kind of ripe to disrupt the old generation of streaming technology that was really built around like the JVM ecosystem. like Spark and Flink and actually just be able to make it a much like more ergonomic and usable experience for app developers to basically come in and build their applications. And so like that kind of looks like, you know, you're emitting API requests from like a front end application. You're just doing some transformations and then you're basically syncing it into your data store. And then you

be just calling that back, but you can model all of that as basically a bunch of streams. And I think that a lot of developers are starting to be primed to think in that way because of their experience developing applications on the front end.

Nitay Joffe (23:23.389)
So tell us a bit more to that point. I mean, made a couple of really interesting points in particular, one about kind of the developer experience. So what is the interface as you guys think about it of how people interface with your data store eventually or your event processing streaming system? Is it basically just SQL? Is that enough? Right? Obviously with like streaming queries, have kind of, you know, notions of time and window queries and kind of deeper nuances of what you're doing. How do you think about

Matt (23:49.349)
Yeah.

So we have some of these discussions a decent amount internally. I think we tend to think that SQL is like a fine interface if you're a data analyst and you're somebody that's actually just trying to like pull some data, make some charts, report on some information. But SQL is an interface for streaming data and broadly programming, think is it generally lacks a lot of the nuance and power that makes like building production systems easy. like anybody that's really tried to deploy like raw SQL,

to like a production environment is familiar with all the difficulties of like, you know, having to like maintain this code and having to like,

you know, like, like interrupt between the type system of like SQL and then like whatever programming language you're using. and so, you know, like, I think that like the SQL interface works because it's, it's a much bigger market generally to sell into like the data analyst side, but it's not the only way that you can express compute over your data. and so we've been talking a lot about, kind of like the data frame, interface for both, batch computing, but also for stream computing. So something like, pandas or like IBIS is like a newer project.

Amey Chaugule (24:58.769)
Yeah. And like to add to what Matt said, Nithai, right? Like the developer experience piece, you know, a bold statement if I would like to make is that I think the future of data and for it's just having TypeScript interfaces for developers to write their applications in, right? And that is just not a market that has really been catered to. And that's why we see, you know, things like for cells like

the serverless compute pieces that they offer, the draw to a lot of developers isn't necessarily the underlying technology you're using, but it's accessible to them in a language such as TypeScript. And you can actually see this already happening in the new crop of AI engineer archetype, right,

We have just like for what it's worth, right? Like I think since we're systems builders, we like typed languages, right? And I'm sure a lot of us tried to like, you know, poopoo Python when it was like trying to muscle in into the, did the data space, but that battle was lost. It was clearly lost as the users found things like pandas really, really useful. And I think similar thing is also now beginning to happen.

to even languages such as TypeScript. yeah, we, how we are building is, so Data Fusion is written in Rust. And so our core work right now is on day -to -day coding is in Rust, but we are actually deeply thinking about the interfaces that we would like to support. For instance, there's a conversation going on in Data Fusion right now itself about the Python interfaces, because the Python version currently that Data Fusion has looks like you're writing your Rust in Python, right?

Polaris is a project that comes to mind, which has done really a good job of the Python Rust interop and something like that. And in a platonic ideal, we would have a Polaris equivalent to the end target languages that the users really care about, while the core of the compute logic can stay in a systems language such as Rust.

Kostas (27:04.405)
I have a question and actually it's very interesting for me because it happens that we had a similar conversation with someone else like a few days ago about streaming and part of the conversation there was who cares at the end about like streaming systems, right? And when we say like who cares, we're talking about like who is the user? And that's something that I observed

even with you two guys talking, right? I hear like a man at the beginning and I hear like a data person, like a person who is like on the data infrastructure side. And then Matt comes in and talks about the application engineer, right? And these are like very different types of people. Like they understand even like common concepts or like other students use like in different ways, just because like their jobs are different, right?

It's very interesting to hear that from you guys because I think it is like kind of like what is like the problem of like streaming systems in a way that they kind of like are dragged into, we're going to like to use SQL because SQL is like what the data analysts or like the data engineer like understand or whatever. But then do these people at the end like really care about that stuff or like they should don't like this type of like technology

To your point, Like, I mean, you said like about TypeScript, if we're talking about applications, application engineers here, right? Like why they would care about SQL? mean, even like for the transactional systems that they have, they probably use a no -arrival to do it. Like they don't even write like directly like SQL to their right. So my question is, what's your take on like at the end, who should be, let's say the target persona of like the streaming systems out there and

Amey Chaugule (28:39.073)
Mm -hmm, yep. Yep.

Amey Chaugule (28:47.39)
Mm -hmm.

Yep.

Kostas (29:01.665)
Are we ever going to solve this thing? Is it going to be the data pins or is it going to be the application product engineering?

Amey Chaugule (29:09.991)
So I can like hearken back, right? Like so typically at bigger companies, right? So my most recent industry role was at Coinbase. You typically have an internal data infrateam, right? Like you're the ones maintaining your Flink as a service or BlockTech as a service. Even the SQL users, even the end users in these companies, right? Like people writing the applications on top of streaming are generally not data people, right? Like a lot of data. In fact, I guess the meta...

It's not really a criticism, it's just like how these things have evolved. The meta point I would make is a lot of data and friends built by people who

generally really deep into say like database transactions and like the core technology, but not necessarily deep on the application side of things. People on the Spark open source, like when Spark first came about, Matei, they ended up interning at Twitter and different companies, not because they wanted a job at Twitter, but they just wanted real world industry use cases to actually test their ideas against. So there is always that tension.

what streaming happened, right? So streaming, like Mac was alluding to, it's an event -driven paradigm, right? And the sequel addition to streaming, especially that happened over the last five to seven years, the idea there was like, it was really hard to sell streaming systems to

traditional database users, right? So SQL always felt like a bit of a retrofit onto streaming. And again, just like writing smart SQL and batch, right? Like you can write your SQL, but it is completely divorced from the physical mechanics of how your query is running, right? For instance, if I'm running with Presto, I should assume that my query may not return results because

Amey Chaugule (31:02.025)
timed out or something failed and Presto's not gonna retry. Spark is a retry thing, so your query might take, something that should take like five minutes could take like six hours because one of the, was like really bad shuffle and like, but yeah, anyway, but the SQL user doesn't care about these specific mechanics of your distributed systems, so that tension already existed between SQL and the data, like SQL users and the data and for teams. To your larger point, I think perhaps it's

not necessary that you should try to have one system for all. You could have system that has two interfaces, right? And they expose the use cases at a way. And I think Flink and things like Spark actually already do this well enough.

what ends up happening is they also have this complicated clustered environment they're running under. So even if you use the SQL interface, even if you use the data frames interface, you have to also have some knowledge. think so my hypothesis if I have to make one is it's not necessarily a bad thing that there is a SQL interface and a data frames and we should actually just pick one. It's totally fine to keep both.

It's just that can we make the underlying compute environment transparent enough that a user doesn't have to think of it. And currently if your underlying compute environment is a distributed cluster, you know, that has to be backed by ZooKeeper to make sure that it can not lose your data. That's where

That's where no matter how well those top line interfaces are, the experience is still gonna be lacking for the end user. And as we know, a lot of the end users are again, right? With data frames as a persona, there's a data scientist or a data analyst, but a lot of times it could just be an application engineer, working within the confines of a company. Similarly with SQL, there could be a data analyst, but it could also just be a team writing an application and that's the interface they found, right?

Amey Chaugule (33:05.797)
So yeah, so I think making the underlying compute system itself a lot more transparent and easier to reason about probably is the answer, but yeah.

Kostas (33:19.53)
And okay, next question. You're talking about one of the things that you said is like, okay, we have to get rid of the cluster or like the distributed nature, let's say, of these like systems, right? To simplify things. And actually we can do that. And probably, I don't know, serve like 99, 90 % of like the workloads out there. Okay.

And the question I have, and I think that's something that probably a lot of people out there be a little bit confused, is usually when we're talking about systems, when we talk about the workload, it's kind of easy to reason about because we say, we have a workload that's five terabytes, or a petabyte, or an exabyte, whatever.

But there is a size attached to it that makes it much more concrete to understand what scale means. With streaming systems, it's kind of hard to say that because you have, in theory, an unbounded stream of data coming in. And sure, latency is one thing that is important, but a streaming system doesn't necessarily have to care about latency.

Streaming is defined primarily by the fact that you have an unbounded data set coming in. And the question now is, OK, let's say we restrict ourselves in one node. You still, as a streaming system, you maintain and you update some kind of state. And that state has some kind of size. So what kind of state size we are talking about here that can be served?

Amey Chaugule (35:01.347)
Mm -hmm. Yep.

Kostas (35:11.915)
in a single node, which can be a very beefy one, right? Like, I don't know, we have really big servers right now. And where's the threshold where you have to get it distributed?

Amey Chaugule (35:16.286)
Yep.

Amey Chaugule (35:24.061)
Yeah. So that's a very interesting question. like, yeah, very highly great. Like in the streaming world, we don't really have an equivalent of a tipsy benchmark set, right? And streaming, streaming typically cares about records per second and,

data ingestion per second, State size, like how does the state grow over time is like generally an exercise left to the operator of the system. You know, just to tie it into a real world example, right? Like when I was writing my sessionization jobs at Uber and...

using Spark streaming, which is very unstable back in the day. I would get paged at like 4 a .m. every morning and I would try to understand what's going on. And turns out New York, it's 7 a .m. in New York and that was the high peak traffic time for people in New York, right? Like the state grows generally depends on how your application characteristics looks like. That said, right? So yeah, so streaming system is a continuous compute system that's ingesting some data.

A lot of the goal of your system should be just then to like pare down the state like each event is coming in but if but generally you aggregate it to states like generally you you use streaming windows that's a that's one of the concepts right to to actually pair down the straight state so

The way we think about state currently in our implementation, we're using RocksDB because it's a very, you know, industry trusted and a well -used system, right? But there is nothing that stops you from, you know, using SQLite, using Postgres as one of your state stores. You can use a remote because you're generally, even if your state is like eight terabytes, right? On each event that isn't coming in, you're not accessing the whole eight terabytes. So you can do that on loading and

Amey Chaugule (37:08.619)
loading fairly well. And .db actually, right? Like for instance, if you have 16 gigabyte memory on your laptop, it doesn't mean your Parquet can max go up to 16 gigabytes. It can do up to like 10 to 20 times of the memory in your laptop. And the way they actually do it is by buffer managing and actually, know, correctly spilling the state onto the disk and only loading the relevant pages in. So yeah, so the way we

haven't done a whole lot of work right now in actually just solving the state piece, but right now we're using RocksDB. And the idea is RocksDB would be the one that, you know, does the disk spilling and things like that. as

As you can imagine, As there will be continuous compute applications where the state grows to like a terabyte or so, that's where you would want to build, that's where we were thinking, that's where you actually, you can build the buffer management to bring in the relevant state, because...

each continuous compute stream, right? Even if it computes like a petabyte of data over some period of time, maybe like over a week at any given moment, it's only still looking at few gigabytes at the top end of some of these applications. yeah, so that is how we're thinking about it. And, you know, and the answer to when should you use a distributed compute, to be honest, like we're talking to some of our ex -employers, we're talking to some of the people in the industry. We're yet to, you know, perhaps like.

Kostas (38:23.51)
Mm -hmm.

Amey Chaugule (38:38.873)
Facebook's of the world, because big data is again, one of those extremely relative things, Maybe Facebook's of the world or Google's of the world really have streaming workloads where you truly cannot do anything. But even at Uber's of the world,

For most use cases, wouldn't necessarily need a distributed cluster. Yeah, sure, there will be, think the place where this actually a distributed cluster makes sense is if you have 4 ,000 partitions on your Kafka topic, extremely high partition topic, you want to, one way is to have like 4 ,000 threads on your machine, it actually might make sense just to get the read input.

I think the way we're currently thinking about it is if your read is more than the 10 gigabit ethernet card that you have on your machine, if you're saturating that, that is where you should consider truly going for a distributed way to read this.

Kostas (39:43.259)
Yeah, that makes sense. Sorry, one last question and then I'll give it to you, okay, because it's related to the distributed part. when we're talking about distributed systems, there are two things here. One is always we're thinking about scaling and scaling horizontally, right? You answer that, but then there's also full tolerance.

Right? Which I think, especially when we are talking about streaming systems, like systems that they need to be online all the time, it becomes like an important thing. And also like when you have like some kind of fault there, you have to recover really fast because data keeps coming. Right? So how do you think about fault tolerance in a single node world? Right?

Amey Chaugule (40:20.457)
Mm -hmm.

Amey Chaugule (40:29.831)
Yep.

Yeah, so this is actually a very nuanced topic, right? Like in a bad system when your Spark executor fails, we're like, all right, well, whatever. We'll just spin up another executor and the shuffle was somewhere in the disk or even if it was not on the disk, your data is bounded. So you can re -query, you can rerun that part of the pipeline and go through it. Streaming systems actually fault tolerance, even if they advertise, it's actually really, really hard to even to this day, productionized fault tolerant distributed streaming systems.

And I'll tell you why, right? So they rely on, well, you need a full -blown consensus generally to actually checkpoint these systems.

When these systems are checkpointing, they typically also build up back pressure because you're not actually processing data at that point in time, which again, depending on the criticality of latency. So a lot of people generally run these systems by having checkpointing off. you can, you know, at Uber for instance, we would run like red, red, red green, red blue. think it's been a while since I've been out of a big company, so I forgot my SRE term. So, you know, so call

in stream processing systems especially is actually hard. I would actually argue that it actually makes it just easier to checkpoint a streaming system when it's on a single node because all your operators on the same node and you have access to their states. So the way we're thinking about this is you checkpoint state to a durable storage. Your disks is where you would start, but there's no reason why it can't be remote storage on S3.

Amey Chaugule (42:12.874)
the recovery mechanism is again, you recover from where back from the checkpoint file again, just like a distributed system, you have checkpoint files. So we actually believe it makes it easier to add.

fault tolerance in the notion of this fault tolerance in a bat system is like if my spark well in presto's case right like if something died it just the query is is cut off and you have to rerun the query entire luck in sparks case you know your five minute query could be running for six hours because it's like thrashing on trying to open up executors and and depending on how your scale up scale down thing is set up it may or may not work that is fault tolerance for a spark job can the query run but for

streaming systems you have to recover from a point in time

for the users of these systems, it's not a push button, fault tolerance as is in these systems. So the way we think about it is like, yes, you have checkpoint files and you can recover from checkpoint files. And then can you recover to point in time? That entirely depends on the upstream sources. So if your upstream source is a Kafka, it has some duration on the log.

you can recover, if you're using, so you could have streaming systems with web socket inputs, At that point, you're at a lock. we're out again, like we are trying not to be Kafka specific, but since majority of the streaming use cases will interrupt with Kafka, that is how we're thinking about it right now.

Kostas (43:30.158)
Thank you.

Kostas (43:40.343)
Thank you.

Matt (43:50.654)
And also I'll add as well is that if you really need like low latency and high resiliency, you could just run the same job twice and just do auto failover.

Kostas (44:00.788)
All right, Nidai, all yours.

Nitay Joffe (44:03.303)
Yeah, I was actually going to ask basically the very same question for, in terms of fault tolerance. other, the other thing I was thinking about was you were talking about.

A lot of what you were saying was kind of the deployment model and the management pain of kind of maintaining these systems. And Matt, I think you alluded before in terms of kind of offering serverless options and so forth. If I think about kind of tying to the question we had before in terms of like who's using the system, right? We have, as we said, we had the application engineer, we have the data engineer, and then we really have kind of some DevOps -y person or some infra engineer, if you will.

And so I could see how that person would care deeply about operating systems, right? Because they're the ones day in, day out, SREs, getting alerts, paging, and so on and so forth.

If it's in the cloud, it's in your, you know, the normalized cloud and you're offering something. And if that offering is, for example, serverless or if a different, even a different, stream processing operate, offered a serverless offering, why would they even care at that point that it's single note, not single note, multi -note? Like, where does, where does that come in? How do you think about that?

Amey Chaugule (45:11.305)
So you can develop, oh, it's okay. Yeah, I mean, it's a code that runs on your laptop, right? A Spark job, you know, like getting, you can have the same 100%. You know, for instance, right, like streaming applications to develop the current processes you.

Matt (45:11.408)
Yes. no, go for it.

Amey Chaugule (45:32.679)
check out a few events from Kafka, you you somehow create a mock, you know, type like how your stream should look like ideally, right? Maybe you have good enough, like I think Databricks Cloud is a good integration with Spark Streaming. So you can actually just write like a streaming query and see some outputs printed to the screen, but that's generally the flow. And then you would put it into a jar.

compile the jar, ship it to your cloud environment, however it is done, and then you actually can run it. So that is where we feel the developer experience is also the thing that actually holds you back, right? And well, you can join two streams, but streaming joins are not like normal joins where there's just one correct answer. Streaming joins are really sensitive to, you know.

the scheduling of the different processing elements in your system and how time interplays with all of this. So the correctness of those applications is actually hard to figure out currently as a developer. And that is also the pain point we really want

scroll down on from the developer experience side. So that is how we're thinking about it. So yeah, you're right. They don't care. It's like, know, like users of Motherlidra Cloud definitely don't care, but it's also nice that, the same thing can run in my laptop without running a Docker image, you know.

Matt (46:56.098)
Yeah, so what I would also add to that is going to be cost. like, you know, our fundamental thesis here too, is that if you run these stream jobs on a single node, you can save a lot of money. And so whether you care about it being run on a single node or not, like it still will allow us as a service provider to offer a cheaper cost compared to people that are distributing this. And then the final bit I'll add too, is that, you know, there's, I'm sure a lot of people have had this experience. I know I personally have where, you know, what's advertised is not always what you get. like you, you know, you're advert, like the advertising is that like,

we'll run this for you, we'll run this cluster for you, and it'll just work TM. And then the moment you try to send some traffic through it, it doesn't work. And there is something to be said about these systems are still really complicated to run, even if you are an expert at them. And there are failure modes, and there are edge cases, and you want your SaaS product to just work. But there's no guarantee that it will until you've actually battle tested yourself.

Nitay Joffe (47:50.365)
I agree completely. They're all leaky abstractions somewhere. Somewhere something bleeds through in terms of the fault tolerance of the query parameters and so forth.

Amey Chaugule (47:50.845)
Yeah.

Matt (47:53.762)
Yeah.

Amey Chaugule (47:54.181)
Yeah.

And Nithai, like the worst job I would ever wish upon some, wouldn't ever wish upon someone is to be on call for random stateful stream processing jobs because you created it. Typically this is someone who is at a bigger company who has their.

who works in the data infrateam and they provide streaming as a service for their internal users. the first rule is your users will use your systems in ways you have never even imagined. So that's just going to happen. And when these jobs are coupled with state, that is where the fun pages really come from.

Nitay Joffe (48:35.837)
As they say, it's like onboarding by throwing somebody in a pool of water with no floaties.

Amey Chaugule (48:39.815)
Yeah.

Nitay Joffe (48:43.007)
That makes a lot of sense. Shifting gears slightly. One question I had, you mentioned you guys are leveraging Data Fusion. I'd love to hear a bit more in terms of how you guys are using it, how you guys have been working together with the community, where you'd like to see the Data Fusion project go and what you'd like to see them build and what you see yourself working with and contributing and so forth.

Matt (49:07.987)
So, Fusion is a great library. For those who are unaware, Data Fusion is a kind of this next generation query library, query engine written in Rust. And it's the kind of thing that can be used to basically build databases. so actually the way that we met, you know, CoastUs was through organizing the first inaugural San Francisco Data Fusion meetup. And so we brought a number of members off from the community, a number of the Data Fusion PMC

came out and we had a nice night of pizza and presentations. So, you overall what we've been doing is we've actually been prototyping our idea in a fork of data fusion and then we've recently just started to engage the community to see where it makes sense to contribute some of our changes back to the broader community upstream versus where it makes sense to kind

Create like a separate library that then depends on data fusion in order to implement our streaming use cases So those are conversations that are kind of ongoing, but we do we do plan to open source the the entire thing It's just whether it'll be upstreamed or not

Amey Chaugule (50:18.099)
Yeah, we, I think we're, the roadmap isn't final yet, but it has been, streaming has been considered onto the roadmaps, yeah, and to add, right? Like, so Data Fusion, the whole philosophy behind it is, it's, Data Fusion, can think of it as kind of like DuckDB, but it's also a query engine that is built for extensibility. So you can add your, and it has a lot of entry points where you can easily add operators to it, where you can easily add your own SQL optimizations, and you can add your own new SQL

too, right? Because DuckDB seems to create their own ergonomic sequel, right? Things like that. So that is a very vibrant community. is already used by about 20, 30 different projects, including vector databases, things like that. so Data Fusion is just, it would have taken, it took us like...

two and a half months of really coding to get our first prototype out, you know, that where we could compare things, it would have taken us like a year to build the same thing without DuckDB. That just goes without saying. So, yeah, and, you know, we are distributed systems nerds, we're data infra nerds, but we know that...

building correct databases is a undertaking that we know enough to know that we don't know enough about it. So we let the work of experts and stand on that. That's how we see it.

Nitay Joffe (51:41.033)
Yeah, so was talking with Lacoste the other day about how things like Data Fusion and Aero and VLocks and so forth are going to have a surge of data projects and companies because, you know, historically it would take millions of dollars and probably five years before you saw a result. And now, as you just said, you know, in a couple of months, you can get a full data system up and running, which is amazing and awesome. What would you want to see the Data Fusion community or even like RocksDB you mentioned using as well? Like what else would you want to see?

the open source world developing for you guys to use.

Amey Chaugule (52:13.545)
So, you know, what would, I think RocksDB is a really great project. And actually, just an aside, I just remembered this when we, so we started because we got into YC, this is two years ago, and I remember the first thing I told Matt, at some point I'm gonna be tempted to build a database and you should stop me. This is actually said, these words were said, but Data Fusion changed that, that it made us.

dream enough that we can actually pull this off. So that is actually something that Data Fusion can do. They can make your dreams come true. But from the open source community, think Data Fusion actually has a good direction. It has a great steering committee. They have been very open and very welcoming to us. So we're actually very...

Kostas (52:39.027)
Thank you.

Kostas (52:44.5)
you

Amey Chaugule (52:58.677)
happy with how that project is progressing and I think there's a lot more interest in it and it's only bound to get you know, think Apple started supporting, you know, Data Fusion Comet, which is their drop -in. It's a sub project of Data Fusion. It's supposed to be a drop -in. It's an accelerator for Spark. So it's not quite a drop -in replacement yet, but it could it has potential to go there.

You know, I think like you guys alluded, right? Like what do you do with state? You know, like growing state. I think like, you know, these LSM based systems have.

You know, they were really revolutionary when they came out like 10, 15 years ago. But I think there is still, I feel like there is a piece there that perhaps, you know, is ripe for revisiting, you know, especially with people, you know, far smarter than us, for instance, who know about storage systems and.

they interact and building a more updated version. And perhaps the answer could be within RocksDB or a project like that itself, or a project that builds on a lot of the learnings from these that can actually, for instance, at least in streaming worlds, a company we didn't talk about is WarpStream that they are Kafka on object storage. there's this, we are tackling the compute piece of data infra in

piece of data in for there's a lot of movement going on you know which is you know object storage is the storage right so we would be interested in you know seeing some it would be refreshing to see new takes on something like rocks DB you know like there's a foundation DB I believe but I personally don't have a lot of experience with it other than just looking at some hacker news you know posts and things like that so

Nitay Joffe (54:54.281)
There's one project we definitely know well because we had the person on the podcast that you should check out called SlateDB.

Matt (55:00.816)
Sleep DB.

Amey Chaugule (55:00.872)
We will.

Nitay Joffe (55:03.123)
It's early, not announced yet, believe.

Matt (55:04.414)
Nice.

Kostas (55:06.928)
Yeah, it's not open source yet, but it's going to be open source. So it might be interesting for you guys, but it's like what Omega kind of like describing actually. So it's ROGDB, but for block storage. We'll see how it's going to work out. But question about like data fusion. Go ahead.

Amey Chaugule (55:23.752)
Yes.

Matt (55:25.833)
Okay.

Amey Chaugule (55:29.351)
Yeah, would, sorry, I would say that would just solve our durable state issue, right? Because if the whole thing goes on, including your disk, yeah, that point you, the best you can offer is to store checkpoints to S3, but that will add latency. but sorry, cost us.

Kostas (55:46.648)
Yeah, so I have a question for you guys about like data fusion following up like to the type like questions about the project there and the community. So what I have and that's like also like the outcome from like the meetup actually like seeing so many different types of data systems actually being built on data fusion is like kind of like surprising in a good way, right?

Primarily because we've all learned that there are different trade -offs that data systems have. That's why we have different systems out there and not just one database that does everything. But it appears that people find ways to use data fusion in very different systems. From time series databases to stream processing to all app systems, to all these different things.

ask people that, okay, you've been working with the code base there, right? My question is, is this the result of

Kostas (56:58.499)
let's say the system that has been built or the architecture the system has or the community. And it doesn't have to be an order, it might be an end, right? So what's your take of like, what's like the secret sauce like for this like, like really unique like success like for project like this.

Matt (57:20.95)
I personally think it's all of the above. I think Data Fusion has a really great, well -organized and welcoming community. And I think that that is one really hard to do in open source. And then I think coupled that with the fact that Data Fusion is, I think, very sophisticated. So it's fast. community alone is not enough, I think, to warrant the level of excitement and adoption. It has to perform at the end of the day. And from our testing and our work, we see that Data Fusion very

does. And then third bit that Amay mentioned around extensibility, just it's really easy to take it and make it do a bunch of different things. And so people are having a lot of fun playing around with it and exploring and experimenting. And then some of those changes make it upstream, some don't, new projects get launched. But overall, think it's all three.

Amey Chaugule (58:12.775)
Yeah. And then to add to it, Like, yeah, like some of the early, some of the creators of data fusion, you know, they were, they were writing databases at Vertica. They have a lot deeper experience. know, they, the data fusion pairs the academic world of databases with, with the actual open source design. And yeah, like Matt said, extensibility, DuckDB is open source too. can, but DuckDB extension model is kind of like Postgres's extension model. can write

Kostas (58:13.097)
Yeah.

Amey Chaugule (58:39.589)
extension plugin, right? Of course, you can look at their code base, but it's not designed with an eye towards how could you change operators or add operators or change certain things. DuckDB just increases the surface level of where you can hook in and experiment. I, at least to our knowledge, this is like the first project of its kind.

Kostas (59:02.165)
Yeah, 100%. I totally agree on that. So, all right, we're at the end here. So I'll ask one last question and it's going to be about data fusion. So what's the system that you haven't seen yet been built using data fusion and you would be really interested to see

Amey Chaugule (59:33.873)
I have one in mind already, but yeah, I think... Yeah.

Matt (59:35.648)
Yeah, I have one in mind as well, I think. It might be the same one, I think it would be integrating an actual data store, like a streaming data store with Data Fusion for computation. So kind of like Kafka with Data Fusion built in, which I think would be pretty interesting.

Amey Chaugule (59:55.431)
Yeah, mean, that is one of the things. I guess since you picked that one, the other one I would say is like, I think data systems are going to go towards having to handle multimodal data itself. And sorry, there was some banging, that's all right.

You know, having to handle multi -modal data, so you should be able to store streams and images and audio and same thing. And right now people are doing it using object stores. I wonder, can databases reinterpret these binary data sets and then can you use

can you actually bring database algorithms and SIMD compute to actually working from them within the confines as opposed to databases just being used to store and retrieve binaries? Can you actually also translate the compute pipelines for these workloads into that? And I think that would be very useful for a lot of the newer use cases we're seeing.

Kostas (01:00:54.97)
super interesting. Nitai, anything before we close the conversation here from your side?

Nitay Joffe (01:01:02.665)
Sure, yeah, last question, guess, going back to the query language of stuff. What would you like to see or what's your vision in terms of the future of like, how do you think people should be building or rather at least interfacing with streaming systems? Like is data frames actually the answer? Is that kind of API mixed with SQL?

Amey Chaugule (01:01:20.211)
Well, it... Yeah, I mean, it's... There was a recent paper, right? What goes around comes around and around, you know, the conclusion is sequel, is it? Why are you guys bothering? We have slightly different takes, I think.

Nitay Joffe (01:01:28.348)
Yep.

Amey Chaugule (01:01:34.513)
A lot of times people care about convenience over, you know, academic or mathematical correctness of some type of a compute. think data frames aren't necessarily the final step, but they're a step in a direction and their adoption is more of it. That's what the users of these systems telling us something, you know, and, and, and yeah,

Like I said, like I think people are, even there's separate world of like these durable execution systems, right? Like that are like describing your computations as events. So I think something, I think data frames could, you know, are one way, but also like these event driven data flow things. They do have some nice characteristics. And like I mentioned, the problem with interfaces is not necessarily the interface itself, but the underlying compute, the query engine compute layer.

And if you can make that interaction really rock solid, that it's like the interface user doesn't have to worry about it too much, that is the platonic ideal.

So yeah, so I guess the answer I would have, it's, see, DataFrame seems to step in that direction. It's unclear whether that is the only way, because, you know, a streaming query is a continuous query. Once you start it, it's like a freight train. So, you know, you can't stop. So that's a, still an icky part of a streaming DataFrame that we don't necessarily have the best answers for just yet.

Nitay Joffe (01:03:06.28)
Absolutely.

Kostas (01:03:10.666)
Alright guys, thank you so much for spending like this one hour with us. We really enjoyed the conversation. I'm sure like our audience will do that too. And we are looking forward to have you again in the future. I know that you're working hard like to release things and hopefully also open source stuff. So I'm waiting for this to happen

have a conversation after that and see where the journey is taking you to.

Nitay Joffe (01:03:42.231)
Likewise, thank you guys. It's been an absolute pleasure looking forward to seeing the normalized grow and expand And how should for the audience? How should folks get in touch with

Matt (01:03:54.18)
Twitter, I am Matt Green on Twitter and then also LinkedInworks.

Amey Chaugule (01:03:59.515)
Yeah, and our domain is denormalized with a z .io. that is our, it's, we sort of have a landing page, but it's a good excuse for us to put more content out there and which we are working on right now, actually. So, yeah, and I am ASC89 on Twitter, but, you know, LinkedIn is also where you can find us. You can just look up denormalize on LinkedIn and that's

That's a good way because I think LinkedIn is also becoming a place for data conversation more so than Twitter sometimes these days, but yes.

Kostas (01:04:40.072)
That's true and we are going to add all the contact details to the landing page of the episode 2 so people can check that and find all the links there to connect with you. Thank you so much again guys and again looking forward to have you again back on the show really really soon.

Amey Chaugule (01:04:50.717)
Yep.

Matt (01:04:56.874)
Perfect, thanks so much.

Amey Chaugule (01:04:57.139)
Thank you. Thank you for having us.

Nitay Joffe (01:04:57.407)
Thank you.