Tech on the Rocks


Summary

In this episode, we dive deep into the future of data infrastructure for AI and ML with Nikhil Simha and Varant Zanoyan, two seasoned engineers from Airbnb and Facebook. Nikhil and Varant share their journey from building real-time data systems and ML infrastructure at tech giants to launching their own venture.

The conversation explores the intricacies of designing developer-friendly APIs, the complexities of handling both batch and streaming data, and the delicate balance between customer needs and product vision in a startup environment.

Contacts & Links

Nikhil Simha
Varant Zanoyan
Chronon project

Chapters

00:00 Introduction and Past Experiences
04:38 The Challenges of Building Data Infrastructure for Machine Learning
08:01 Merging Real-Time Data Processing with Machine Learning
14:08 Backfilling New Features in Data Infrastructure
20:57 Defining Failure in Data Infrastructure
26:45 The Choice Between SQL and Data Frame APIs
34:31 The Vision for Future Improvements
38:17 Introduction to Chrono and Open Source
43:29 The Future of Chrono: New Computation Paradigms
48:38 Balancing Customer Needs and Vision
57:21 Engaging with Customers and the Open Source Community
01:01:26 Potential Use Cases and Future Directions

What is Tech on the Rocks?

Join Kostas and Nitay as they speak with amazingly smart people who are building the next generation of technology, from hardware to cloud compute.

Tech on the Rocks is for people who are curious about the foundations of the tech industry.

Recorded primarily from our offices and homes, but one day we hope to record in a bar somewhere.

Cheers!

Kostas (00:03.001)
Nikhil and Varant, welcome. It's very nice to have you here with Nitai. We're very excited to chat with you, learn about your past and most importantly, what you're up to in the future. So let's start with a quick introduction of yourselves. Nikhil, would you like to go first and tell a few things about

Nikhil Simha (00:24.979)
Yeah, absolutely. Thanks for having us. I started working in Amazon and then at Walmart Labs roughly on the same thing, but with a bunch of PhDs who have PhD in NLP. And they were trying to solve the problem of putting high quality information in item pages in Walmart and Amazon so that we don't have to get out of those pages to do research.

about that item. So there would be like this whole infrastructure that would crawl data from across the web.

Nikhil Simha (01:05.001)
convert that unstructured data into data that can be shown on item page. So this was a system using.

humans in the loop and a little bit of NLP magic before it even worked, you know, back in 2012 to do this task. So things like item descriptions, comparison tables, et cetera. I was the infra guy basically, then I realized that's where my interest lies mostly. So when I moved to Facebook, I was working on the real time data system, Steam. The second engineer on

We built everything from scratch, basically the engine, the scheduler for stream processing and the compiler for converting SQL into stream processing jobs. I did that for about four years and towards the end, I was working on like unifying batch and streaming metrics within Facebook. Now the realization was that the only legit customer for that was machine learning.

I moved to Airbnb in end of 2017, early 18, I would say, and I was working on ML data infra ever since. been on the ML infra team for a while working with Varanth here for like about six years, but towards the end, I was supporting the feature platform team, the embedding platform team, and ML observability.

Towards the very end, we were spinning up the rag orchestration.

Kostas (02:45.097)
Alright, that's amazing. Varan, your turn.

Varant Zanoyan (02:50.544)
Yeah, hey, yeah, first of all, thanks for having us on the show. Let's see, so I started working at Palantir back in 2013.

So at the time it was like a forward deployed engineering role. And yeah, there was a lot of us doing that in those days, kind of flying around the client sites and doing kind of on -site engineering to get things to work. So I liked that job. I liked the client facing side of it. And so when I wound up at Airbnb a few years later, they kind of had a somewhat similar internal role. So there was no ML in for at the time. There was just data infrastructure and data engineering. And then, you know, basically if you were

the data edge team, would work sort of embedded with various product teams across the company to help solve infrastructure and data challenges that they were having. So I was doing that for about a year or so.

when I first got to Airbnb in 2015. But then, you know, through that process, we basically saw a lot of repeated sort of challenges that machine learning teams were hitting in trying to get their models to production and trying to like actually do useful machine learning projects sort of in the company. And rather than solve them, you know, one off team by team, that's when we first kind of formed the ML Infra Team, which shortly after that. And yeah, that basically began...

pretty long infrared building journey for me, which was a lot of fun. And yeah, I got to work with some great people. That's around the time that Nikhil joined. yeah, I ended up being at Airbnb about eight and a half years total, start to end working on that. And most of that time was building out data infrastructure for machine learning pretty much, which is surprisingly hard to get right as it turns out.

Kostas (04:38.248)
Alright, there's more to talk about because that's about your past. There's like a very interesting future for both of you, but we'll talk about this like a little bit later. I have a question for you Nikhil because you said something that I found like very interesting. You said when you were like at Facebook, Meta, Facebook, whatever, you were working at some point to merge like bots and stream processing like together and you figured out that the

customer who cares about this is ML Can you tell a little bit more about that? Why is this the case?

Nikhil Simha (05:17.339)
Yeah,

So it started out with a very different problem statement. There were some people who were consuming some metrics, or like leaders of certain organizations, who said these metrics land super late. Like they land after two days, you know, whenever something happens. So you would get like, today is July 12th, right? So you would only see July 10th metrics today. And that's because there were like a series of pipelines you need to run to get to the metrics table.

And that goes into some storage system that gets displayed on dashboards. And there is that huge pipeline there. But like this is the problem that not many people like wanted to do too much about. And like what we figured out is that you could instead run these pipelines every hour. And the key technical problem to solve there was to make them incremental.

Like let's say I want to know 90 day average time spent on the Android app of Facebook. Right? And I want to do that every day. And if you want, if you think about

Kostas (06:17.924)
Mm -hmm.

Nikhil Simha (06:32.127)
Today and tomorrow, there is an 89 -day overlap of that computation. So just by structuring it in an incremental way, you could speed up a lot of the stages that you would have in the middle. So that was one thing we discovered. I was in the real -time data team, so we're building stream processing engines and whatnot. But someone found that this is the way to do it. And it made total sense. You really don't need streaming engines to.

Kostas (06:58.735)
Mm -hmm.

Nikhil Simha (07:01.823)
do any of this. can write your good old mapping jobs as long as you structure them incrementally. And that gave the speed up that they needed. But then part of this group was this other set of people who were doing fraud detection, site integrity back in the day. And it was like, if you remember, 2016. It's all the rage within Facebook.

Nikhil Simha (07:31.347)
They really cared about true real time. I'll use that. I they have these models that look at sequences of activity and decide if this is malicious or this is OK to let through. And all the data that goes into these models need to be super real time. Because once you promote bad content, it has done damage. And especially misinformation.

or like abuse, both of these are like top priority for those things. They cared a lot about all the data being real time. And on the flip side, when they create training data, they want training data, which is computed in batch to have the same semantics. So if I have this new feature idea that I wanted to compute as of a historical point in time, when I have a label

this was abusive or this was not, right? At that point in time, you need to recreate the feature at that millisecond basically, right? So even though you're like in this batch job, you're like creating this point in time correct features for the keys. so that's where like, I first saw the problem. I didn't end up solving there, but like I recognize the problem statement. And then when I moved to Airbnb, I saw the same problem with fraud detection at Airbnb.

it clicked and by that time, know, Varanth and Brad, the other engineer who was working on zipline have made really good progress on solving the bad side of the

Nitay Joffe (09:09.988)
Those are really interesting solutions. I'm curious like what compromises if you will or trade -offs you had to make in order. Like for example, the solution you just described before essentially was going from like a single count to a bucketed count. And so you're increasing the storage, but therefore you get a much faster computation, right? So as you move to this kind of lower and lower, lower latency,

Kostas (09:10.346)
Mm -hmm.

Nitay Joffe (09:29.132)
I'm curious, like kind of saying, like I've seen other organizations, you know, you start doing like approximate things and sketches and all these kinds of things. So I'm curious, like, how, did you think of that in terms of the trade -offs and in terms of like the ML folks you were working with, what were they pushing for and what was important versus not and so forth.

Nikhil Simha (09:44.805)
Yeah, so ML folks were pretty okay with approximate counts and like sketches basically. Right, like for those listening, sketches something that approximates the value of a precise count like instead of doing a unique count which requires you to store all unique elements in a set, you can use an approximate.

thing called hyperlog log, know, just if someone's interested to look it up, to approximate the unique count, essentially, just stores hashes in a bit set somewhere and like uses the bit set to estimate how many elements are in this set. Now, ML is totally fine with this because ML models just need positive signal, not necessarily precise information. Metrics are like less fine with this kind of a thing.

Nikhil Simha (10:45.287)
Yeah, going back to question, the batch, so where we ended up was

Nikhil Simha (10:58.975)
People didn't want the latency of this metrics computation to be two days. They're like fine if it is an hour or 15 minutes. They don't need the real time, real timeness, like the machine learning models need. So they just needed to change them into being incremental and work off of hourly partitions or 15 minute partitions. And that did the job. So we ended up just going with that for the metrics use cases.

For ML is like where we think the problem to some degree still exists within Airbnb or like within Facebook if I understand correctly.

Nitay Joffe (11:38.953)
So moving forward to that a little bit, Varrant, I think you said something very interesting when you were talking about Airbnb. were saying like, said basically data infrastructure for ML is just a hard pump to get right. Airbnb is obviously very successful, great company, and in particular makes a heavy investment in tech, right? A lot of great software engineers. So hearing that from such a great company, like why is it so hard?

Nitay Joffe (12:04.776)
Ron, stay, step away if Nikhil, you wanna start or we can wait.

Nikhil Simha (12:10.271)
Yeah, just to kind of describe the situation in 2017, what people would do when they need to iterate on their models and add new features is first, they would spin up this elaborate online system that would serve the feature. The model is still not trained on this feature. It's just an old model which does not know anything about this feature, but they would still serve it and log

for x months for Airbnb's case, that's like three to six months. And when they have enough log data, they would use that to train a new version of the model. So they're just throwing this new feature in a shadow mode, essentially for a long time. And building this system, if you need scalability, also takes a couple of months. By that, basically if you are like storing three or four elements and doing an aggregation to get a feature,

for any given key, you can do it very easily. But if the element set is very large, like in thousands, you cannot solve that feature at scale. You need to pull thousand events every time and do some aggregation. So there is an infra problem there, but there is also this log and wait problem where you log and wait for many months. And all of that meant that any new feature that you want to know if it is valuable took six months. Before even you knew that this feature is going to be of value, you had to spend

a long long time to capture enough data, build online system. So for fraud, was basically not okay. mean, fraud moves really quickly. It's adversarial, right? By the time you like build this model, the fraud has done its damage and like, know, the fraudsters would have moved on to a different strategy. So this was a huge problem at Airbnb and where we landed was

we need to backfill these new features without even having to sell them online somehow. yeah, I'll just stop there and see if Varanth has anything to add.

Nitay Joffe (14:23.277)
Broth, I was asking you, you had mentioned in your intro that the data structure for ML is hard to get right. Anything you wanted to add there of why it's hard to get

Varant Zanoyan (14:31.353)
yeah, yeah, yeah. So I think like you look at it, right? And you look at like one way to look at the whole thing is like, you know,

There's data, there's the application data serving that's like somewhat well solved. know, it's like you have production databases that serve applications and you we know how to scale those and you know, like at Airbnb it looks one way and on Facebook it looks one way, but you know, more or less it's like you can serve data applications. Then you have these data warehouses that kind of evolved for these more analytical workflows, large averages, window aggregations, things like that. But then for machine learning, you're in this world where you need to bridge that gap. You need, you know, oftentimes like big, complex, heavy analytical

workloads to produce training data or evaluation data or anything like that. And then you also need to be able to serve production inference. So you need to kind of bridge this. We call it the online offline divide. And so I think that is a deceptively challenging problem. And what's funny is that

Yeah, it's like solving that problem is harder than solving both problems individually as point solutions because you need to be able to guarantee this consistency between the two and you really need to be able to

like do the difficult scaling work to do like point lookups with low latency and then also like efficient batch backfills and guarantee consistency between the two. And that offline piece, you if you might look at this and you might think, like streaming real -time aggregations, that's pretty tough. And it is, but surprisingly that offline piece, that like batch computation of these things for the ML use case is also very surprisingly complex. It's something that Nikhil actually made a lot of really great contributions to.

Varant Zanoyan (16:13.843)
tried it every which way to get it to be relatively painless.

And yeah, mean, interestingly, our motivation was not as pure as you might think as engineers to make that piece solid. It was more like very selfish in that if, you know, people's batch jobs are failing, they'll come to your goalie channel. And we didn't want that. You know, we wanted to rest easy and we wanted our lives to be easy. So we put a lot of time into like, well, how do we make, like, how do we get people this very flexible way to define transformations, which they'll inevitably use to define very inherently complex, high cardinality transformations.

how do we build a compute engine that can do that in a relatively fault tolerant way that's tolerant to really big windows, to really large data scale, to highly skewed data sets. That's a really big one, especially for fraud use cases, anti -fraud use cases and security use cases. You get a lot of highly skewed data sets that you can't do the easy thing of just like filtering out the skewed keys because you need those to be part of your training. just getting that batch piece to scale was surprisingly difficult.

I think it took us lot of iterations before we finally got there.

Kostas (17:29.229)
Can I ask something, Varun? You said there's like a selfish reason behind like going and building that stuff, which is I don't want people to come and bother me because things are failing, right? Like, it makes total sense. Can you help us define a little bit like what fail means, right? Because I think when someone like thinks of like failure, I mean, I don't know, like they instinctively you think of

Varant Zanoyan (17:41.099)
Yeah.

Varant Zanoyan (17:48.205)
Yeah, yeah.

Kostas (17:59.27)
something breaks, doesn't run, doesn't finish. But I think like failure might be a little bit like more like complicated terms. So I'd to hear like what, what, how you define like failure and what types of failure like you had to deal with.

Varant Zanoyan (18:16.193)
Yeah, yeah, yeah. So that's actually a really good question. And so we've kind of formed this from years and years of goalie duty, basically. So one, there's a couple of different ones. I guess if I were to try to categorize them, this might be wrong. I'm kind of doing this on the fly, so I might revise this. But you could say like,

the definite like the the like your definite. So OK, so Cronon is basically like you as the human defining some some semantics on top of your data for transformation. So it could be that you accidentally did the wrong semantics and so you've got an unexpected result. And that's because and then there we think about how can we make the API more ergonomic and more natural and more likely for you to do what you were expressed what you were trying to do without accidentally expressing something else. Two is the computation crapped out somewhere. So like, you know, really what I'm talking about here is

Mark Ooms, Elisa AirBnB's deployment of Cronon, which is something that a lot of people are very familiar with. But it could also be like a hanging executor. could also be like...

you know, maybe some other kind of like, it could be like, like some kind of other just issue in the pipeline. Like you tried to join things on a key with different type or something like that. So, but most of the time it's a sparker. So, you know, but all that family of stuff is kind of what we try to make abstract that we want to abstract all that away from the user. like, if there's an issue and if you define something like,

non, like, you know, kind of nonsensical, then we should just basically catch that before, like, the compute job even happens. That should be kind of like a compile time or something, a sort of thing that we catch and we report in a very natural way. Like, hey, you, I don't know, I'm trying to think of like a good example of that. I mean, let's take the case where it's like, you joined these two things in your semantics, but they have different types. So like, did you mean to cast something? Did you mean to like use a different column? That could be caught early.

Varant Zanoyan (20:14.447)
Another example might be like, you defined a one year window on this source. But when we start the job, we can check the source data range really quickly before we run any heavy year long computation and be like, but the source only goes back three months. So you tried to define a one year window, but you only have three months of data. So those kinds of things, the more ergonomic we can make it, the more clearly we can express this stuff to user. The more it can seem like, OK, I'm just defining this transformation, this framework's helping me, and it's giving me sensible results. And the less we can expose them to like a

know, 400 lines, sparks, trace, the better experience we can offer our users and the better, you know, the lighter goalie burden we can create for ourselves. That's kind of the idea.

Kostas (20:56.899)
No, that makes total sense. Okay, question here because here we're talking about like two completely different levels of...

system that you have to deal with, right? Like you have the info on one side, like more of like the, yeah, we are running out of memory, right? That's, I think like in every engineer's brain, they understand like it is like a hard problem to solve. It is like a real problem, but kind of like also intuitively at least like there's a connection between like the problem and the solution. Throw more memory. It will probably solve it, right? But.

Varant Zanoyan (21:30.857)
Yeah, Yep, yep.

Kostas (21:34.24)
When we're talking about semantics here, right? And the ergonomics and the developer experience, things become like much more vague, right? Like it depends on who is your developer. Like a data scientist might be different than a data engineer or like an analyst, right? Or like, not only like in terms of like how they work or like the type of work that they are doing, but also like how they understand the words, right? And like even like the vocabulary they are using that might be like different.

So my question to you guys, like how do you surface these opportunities, right, for creating these ergonomics? how, what's like the process? How did you come up with these things? Before you even implement them, right? Like implementation, I don't know, in my mind, at least might be like the easier part than like identifying like all these.

opportunities like for optimizing experience. So how do you do

Nikhil Simha (22:39.775)
mean, for us, our users basically yelled at us. These are all the things. These are all the ways in which this thing is failing and we don't want to deal with this, make it go away. They gave us a really good answer, validate for all of these things before you even kick the jobs off.

Varant Zanoyan (22:55.662)
We

Varant Zanoyan (23:02.007)
Yeah, I think, and we had the starting point of SQL, you know, it's like SQL is like a pretty good semantics for like data transformation. We'd use that as a starting point. But the reason why we couldn't use pure SQL was because we had to have this more first -class support of time, like time as like

first -class citizens. So like you can express things like point and time correct joins, for example, which we can go into more detail on. But we wanted to be able to express things that you can't express with pure SQL. But we had SQL as a starting point. And so like that kind of gave us a little bit of like grounding kind of in a space of thinking about how to design the semantics or like the DSL of a system. And then we just, I mean, we basically saw people getting the time stuff wrong with our first iteration. And that's a funny one. Like you asked about failure modes.

A common failure mode, at least in some implementations of data pipelines, is like, my output data is all null and I don't know

And so like we, you know, we saw that one happen a lot and we saw like the mismanaging the time variable in our initial API is a very common one. So then we were like, okay, instead of giving people control at that like lower level of like controlling these timelines, let's ask, let's kind of map things to more familiar concepts and then infer how to do the time -based computation and aggregation under the hood. And then that got us to like a much more successful state in terms of people trying to express what they wanted to express and things just working.

Kostas (24:25.403)
Mm

Varant Zanoyan (24:25.941)
So, yeah, I realized I was very abstract.

Nikhil Simha (24:26.547)
We should maybe give an example about exactly where.

Kostas (24:30.48)
Yeah, yeah, please do, please do, yeah.

Nikhil Simha (24:35.271)
Yeah, so one mode of failure is that they would have this fact table that would have events going back to multiple partitions for any given key. And there would be a dim table where like every partition has all the keys and like, you know, snapshot of values within that. Now, they fundamentally have different scanning semantics. Like if you want to get some information out of fact table, you need to

look at multiple partitions. Dim table, one partition is good enough. Slowly changing dimensions, you can reduce even the amount of information you need to look within the same partition. Only the change data is enough. You can incrementally compute the rest. Now, all of these, people would try to do by themselves without truly reasoning about whether a table is dim or fact. Some tables are named dim or fact, but most tables, they have these semantics that are not exposed through the name.

Right, people would often like assume one thing for another and like define their scanning logic and get it wrong. And they would see nulls as outputs when they try to do joins or aggregations and whatnot. Right, then we realized, you know what, like just giving them bare SQL is an issue because of these timelines. Instead we should ask them for the data model upfront, like tell us if it's factored in and we will do the scan for you. And we will like move the scan window.

according to when it's computed, instead of you hard coding a DS or like a date string inside your query. That ended up like removing that whole class of problems or like user errors where they would like write the wrong

Kostas (26:10.426)
Mm -hmm.

Kostas (26:17.611)
That's super interesting. Quick question on that. So you said, Varan, that SQL is great, but has like, lack support when it comes to time, right? That's not exactly like a first -class citizen. So how did you end up like, because you need to add like additional syntax or semantics there. Actually, no, you have to add like additional semantics.

The way you can do it is like either like you extend the syntax, right, to support, or you create a DSL that can transpile, for example, like in SQL or something like that. Like they're like different approaches. Like which was like the approach that you took?

Varant Zanoyan (27:00.149)
Yeah, yeah, so that's a good question. So we ended up taking a DSL approach, at least in Cronon. So the reason why was two things. One, so that, I mean, because we wanted to add.

class of four for time with others because we really want one definition from which we can power a bunch of different things. So like we can power streaming jobs for real -time computation, batch jobs for backfills, and for part of the Lambda architecture to seed online values for like real -time updates and stuff. So basically, you know, we have this DSL and from that we can compose like Flink jobs, Spark streaming jobs, Spark batch jobs, and then also like, you know,

things for like partial aggregate. like, you know, if you have a Let's do an example actually because I can use it where an example gets helpful It's like let's say you're doing a 30. Let's say you're doing like a 30 day 30 day average Let's say you're like a retailer and you want a 30 day average for each user of their checkout or their you know shopping cart checkouts or something like

So, you know, for real time, what you need, basically what you need is that decomposed. You need the like sum and the count so that you can, as new things come in, you can update the sum and the count and basically have the new, the new average be computed as new, new things come in. And then offline, ideally what you want on a day -to -day basis is not looking back 30 days and recomputing 30 days every day, forward to forward, forward, but you'd rather, you know, you'd want to be able to use partial

so that you can only take the new data this kind of the same way that you're doing online and incrementally update things. So we need to be able to take that definition, that average, and under the hood break it into sums and counts, both online and offline and batch, so that we can do things like caching partial aggregates and reuse them in other computations for computational efficiency, and so that we can do things like new events come in. Now we're going to update this existing thing.

Varant Zanoyan (29:00.119)
You know, we need a lot of flexibility in terms of taking a simple definition and, you know, breaking it down into various pieces that need to be able to run in like all these different environments. So yeah, I mean, that's kind of why we decided to go with the DSL approach. I think, yeah, there could be a way to extend SQL to kind of get to a similar thing. think.

both kind of converge in terms of user experience at the end of the day. The other nice thing about the Python DSL is that it is inherently a little bit more composable because it's just Python. You can have your helper functions and you can build your repository the way you want. I think there's like bunch of trade -offs here. I think at the end of the day though, both can result in a pretty good user experience. I mean, we went one direction for now.

It's not like I'm 100 % sure that's necessarily the optimal thing, but I think with enough work, both of them can be pretty good approaches, honestly.

Nikhil Simha (30:03.069)
We have seen some users kind of like, so we initially had more or less a static API where people would like define all of these windows and whatnot, but we have seen people write this Python. So we asked people to write essentially YAML, right? And inside YAML, stick their SQL queries. And what we have seen some power users do is like write a Python framework on top of this to generate these YAML.

Kostas (30:03.47)
Yeah.

Nikhil Simha (30:33.727)
And we have seen a lot of that happen. And then it made sense that like, know, let's just make the API be Python, be more data frame like, and like have people like get that ability to generate these configs however they want easily. So we ended up picking Python

Kostas (30:51.567)
So why you didn't start with a data frame API from the beginning and you picked SQL? And the reason I'm asking is because I'm like, least in my mind, like SQL is amazing, but especially if you want like to target like a very broad audience out there, you can find, mean, just like business users being in like to write SQL. But when it comes like to data scientists, like I know people in general, like, I don't know, like the first thing that comes to my mind is like Python.

And when you think about Python and querying stuff, then you think of data frames and pandas and all these nice things over there. So why did you pick SQL and you didn't just go directly and start with data frame API?

Nikhil Simha (31:41.097)
So when I joined, we had this YAML thing. I mean, the answer is just like it's basically legacy. The legacy API was in, maybe Varanth can like talk more about like the legacy API, right? Before we move to data.

Varant Zanoyan (31:59.063)
Yeah, I think when I said SQL, I think that might've... The reality is like the Python API, if you look at it, it's a little bit data framey in some ways and a little bit SQL -y in some ways.

Varant Zanoyan (32:19.1)
The most sequely part of the API is basically the column level extraction.

So it's like, you you're, extracting some information from certain columns and like, that's where we give you pure SQL just to like, just to run there. mean, I guess we, yeah, if we had the nice thing about that is I think just that it's more likely or it's easier to make it work in like Flink and Spark streaming, I think is the main thing. It just like, is a more transpilable, dialect than straight data frame. But the rest of the, the rest of the DSL.

It uses SQL concepts like group by, then again, like those exist in data frames too. So it's not exactly SQL. It's not exactly data frame. It's a little bit of both, but it is a Python, a Pythonic kind of environment. like to your point, we did want to expose people to Python. That was like the tool that we're most interested in. It lives in a machine learning repository. That's all Python. We wanted it to be Python. And then there's some places where we just like your write SQL. And I think we like to write SQL as opposed to like writing, I don't know, like

or something is because we wanted to run in Flink. then otherwise, the SQL concepts that we have like group by, yeah, those are the concepts that are kind of both SQL and DataFrame like it's kind of just like data transformation sort of concept. So it ends up looking like a bit of a hybrid between SQL and DataFrame. It's sort of, I think, if you look at it, what you'll see.

Nitay Joffe (33:41.636)
Do you have a curious to hear like kind of kind of going off of this, you know, one of the things I see with a lot of new projects and languages and DSLs and so on is you naturally want to start kind of as simple as possible because you want to lower the learning curve as much as possible for new people, right? So they don't come in and are like right away challenged with I got to learn a hundred things, right? But then like as the old saying goes, kind of all programming languages start simple and then eventually the community forces it to add generics. And so,

Nikhil Simha (33:41.929)
Thank you.

Varant Zanoyan (33:54.767)
Yeah.

Varant Zanoyan (33:59.863)
Yeah.

Varant Zanoyan (34:07.115)
Yeah.

Nikhil Simha (34:07.867)
you

Nitay Joffe (34:09.007)
So I'm not what I'm really asking is like kind of that version for for for your project is like you know if you didn't have the legacy that you mentioned and you have kind of everything you know now with project up and running and having great usage and feedback from folks What would be like your ideal your vision of like here's what those language actually should look like here's how people should actually interoperate with

Varant Zanoyan (34:30.829)
That's funny you asked that. Nikhil and I were just talking about this last night because, I mean, we had this, you know, we are kind of rethinking, like we have a bit of a clean slate now in our next adventure and we are asking ourselves that very question. Actually, yeah, Nikhil, you've thought about this a lot. Do want to talk a bit about like the PRQL approach and the sort of fluent API stuff? Might be interesting.

Nikhil Simha (34:51.527)
Yeah, yeah, yeah, absolutely. So I think SQL as great as it is, like lacks first class support for time. And we experimented with a bunch of ways to like add time in like through YAML, then through this data frame like thing. But what we want to do next is essentially give more auto -complete to the API.

Like instead of users having to guess what are the columns, what are the parameters of this like huge Python function that I need to define, right? Instead of that, we want people to be able to say like in SQL, right? You can imagine if you started with a from within the select statement, you can auto -complete the columns. Most people start with a select, right? That makes it hard to auto -complete the columns. So we want to essentially solve that problem in many ways.

every stage of the API. So if people start say from this source and say dot select, we want the auto complete to like work mostly seamlessly. Like the columns that are within that table should be available to the IDE for suggestion. And similarly, want, so if you think about it after from, group by does not happen, right? The IDE does not prevent you from writing that if you're writing

Right, you write from where, so in terms of execution, what happens is from, then where, then select, then group by. Right, but how people typically write is select, from, where, and then group by. Right, so what that means is essentially you cannot like show interactively what's happening to the data. Like if you have used some of these IDEs for Scala or like Rust.

On every dot map dot reduce method, they show you what the type of the transformation is. Now you can imagine doing that for data itself. Like if you have a single row from the source, as you select, we can interactively show like what is the transformed value. And after where like we can show if it passes the flags or not. Right. And after group by, can like show the final aggregated value.

Nikhil Simha (37:15.945)
We want that level of interactivity in what we build next.

basically, and for that, we were thinking of restructuring our API to be more fluent, essentially.

Nitay Joffe (37:30.226)
And in a way, imagine more dynamic to it has to be less static to accomplish that. Like you have to have a deeper understanding with the underlying data, not just a schema, essentially.

Nikhil Simha (37:38.643)
Yep. Yeah, exactly.

Varant Zanoyan (37:39.375)
Yeah.

Yeah. It's like, what's nice. Yeah. It's like, need, you need the, if you have like the input schema and the input or like a sample input real from any given, you know, source, then at every point, the transformation can statically sort of be inferred. like every, you know, every output of a statement, you could kind of like map that and make a really good developer experience for data transformation.

Nitay Joffe (38:08.618)
That sounds very interesting. go ahead.

Kostas (38:09.127)
So I think it's a good time to ask them like what's new next thing, right?

Nitay Joffe (38:15.942)
What's next? was gonna say, and we should probably honestly even backtrack a tiny bit because we've mentioned the project and this and that a little bit, but I don't think we've like formally defined it. So can you guys just kind of tell us the guests for everybody listening, like what is this project Chrono that you guys created at Airbnb? Why open source it? And then we'll go into the interesting question of what's next.

Varant Zanoyan (38:17.507)
Yeah. Yeah, yeah.

Varant Zanoyan (38:32.335)
Yeah, yeah, I guess.

Varant Zanoyan (38:40.047)
Yeah, we didn't do a good job of putting the context in there. So if any of the listeners are confused at this point, let me try to help a little bit. So we've been talking a lot about transforming structured data, like the 30 -day average purchase price example that needs to be served online, backfilled, offline, whatever, right? Structured data transformation, windowed aggregations, that kind of stuff. Very useful and predictive ML.

That's what Cronon does really well. I think it's probably the best engine in the world for, you know, for defining those transformations, serving them online, backfilling them offline, guaranteeing consistency between the two things.

We talked about these kind of like, you know, just now we were talking about kind of these like compute graphs, right? Of like, you take this data, you transform it that way, you pass this other group by that, you know, aggregates it this way and that way. And so you can imagine that graph for a structured transformation. That's all sort of expressible and computable within Cronon and it can orchestrate those pipelines for you. I'll talk more about the open source. It's exciting. Airbnb open sourced it along with Stripe who uses it for all of their sort of anti -fraud modeling. And the project is a

more momentum now. I think you'll see some other big brands sort of announce them onboarding onto the project soon. But in terms of what's next, let's start with that sort of structured data compute graph. Now imagine you have unstructured data and embeddings generation within the same sort of API.

So as opposed to just taking tables with structured columns and producing structured output, you can now take sort of other sort of input types like text, image, video, whatever, produce embeddings and have all of that be within one graph and one layer for orchestrating all of that. And the interesting thing is when you think about embeddings generation, obviously that's like, now you're bringing a model kind of into the graph as well.

Varant Zanoyan (40:36.836)
You are definitely expanding the scope of what this compute layer can do. I think what we've seen, Nikhil and I are pretty excited about this direction, given everything that we've seen in terms of what, like

where things are headed in ML and AI, and what is actually landing the most impact and delivering the most value these days. So I think one example I really like for why these two things need to live together is, let's say you're a generic delivery service, right? You deliver stuff and you wanna build a chat bot basically for customer support, like a classic LLM use case for customer support.

And so let's say, you know, a user comes in and is requesting a refund, has an issue with a delivery and is requesting a refund, which is arguably probably one of your very common customer support sort of flows. So, you know, that user has a message that they sent and you're going to, and you know, but to make that message more useful to your chat bot to be able to actually provide a useful response to it, you're going to need some more context and just, Hey, I'm having an issue with this delivery. So, you know, the first classic thing that's like very common in RAG

like, you know, sort of rag implementations for a use case like this is policy. It's like, what's our, what's our refund policy? So like, you know, that's where you have sort of your, policy documents being embedded, chunked and embedded, and you're going to be fetching some of those and adding that to the prompt as context. But then you're also going to want to know for this thing to be really, really valuable. You're going to want to know a few more things about the user, like in the last seven days, how many refunds has this user requested? That might be interesting, but then also what about the merchant that they're buying this item from? how many issues has that merchant had in

one day, 10 days, 30 days? Is there a recent uptick or something like that? What about this particular item? Has that had a lot of refund requests or something? And then what about the delivery person? Has that person had, how many issues have they had? How long have they been around for? All of this context is stuff that you'll want to be able to add to that prompt in order for that chatbot to actually be useful and actually have a high chance of resolving this thing effectively. And what's interesting is that the data plumbing...

Varant Zanoyan (42:44.303)
for that whole flow. There's the unstructured piece, which is the document embedding generation stuff. And then there's the rest of that context. And the rest of that context is basically structured aggregation that looks a lot like the kinds of features that you'll have in an anti -fraud model, for example. That Cronon is very, very good at computing. So.

Essentially, what we're thinking about next is taking what Chrono is already very, very good at and then expanding it to do essentially embedding generation and unstructured data processing and model inference into this more kind of end -to -end graph that you can now compose these much more powerful systems in one API. I don't know. Was that a good summary? Did that make sense? Nikhil, me if that made sense to you guys.

Nikhil Simha (43:29.545)
Yeah, pretty good, think, but I mean, I know the context, so I'm not a good judge.

Varant Zanoyan (43:32.751)
Yeah, I guess you're the wrong person to ask. What do you guys think?

Kostas (43:39.672)
No, think you do really good job describing both what you're building and what you're trying to solve. And at least for me, it sounds very fascinating, primarily because of trying to bridge two worlds that traditionally have been separated. And they're separated on many different levels, right?

It's very interesting. think like for people who are into like, let's say more like the data platform systems, right? You have, let's say if you take like anything that has to do with stuff like PyTorch, right? Clicking one house and then you get like all the state of the art right now, like in working with structured data, which is, let's say everyone has decided that vectorized computation is like the solution there, right?

that to be like whatever we use there, like it's some kind of like a vectorized engine that processes like the data. And then you go on the other side and everything is compiled, right? Not only compiled, but it has like a completely different core concept, which is a tensor, which is like different than like what is like the vector that's using the vectorized engine on the other side. Now, the reason I'm saying that is because definitely you don't necessarily have to go so deep like

Unify this right, but it seems like the industry for like all these years It's kind of like developing in parallel without like these two words being Like caring with it for each other, right? And suddenly something happens which is called I don't know like some Altman or like whatever and now We start like caring about like how these two things like will come together because they have to

Right. And there are like so many different layers. That's why I gave like the compute engine, like kind of like example there, because it's like really on the low level of stuff. Actually you can go even lower. They're like new file formats now that we are talking like with Parquet being like pretty much the dominant thing for all these years. And now we are talking about, maybe we have to rethink that because AI ML like is becoming like more dominant. That's like a use case. Right.

Kostas (46:03.495)
But it goes up to the user experience itself, which is also like you talk a lot about that. And going back to what I said, like the technical stuff are hard, but there is like kind of a clear path towards there, right? But the high level things of like how the users should like interact, like how to reason about these things, how to be, avoid all the fail modes that you were talking before. think that's like an extremely hard.

problem and like completely go back right now because we never tried like to even face this before. So I think like what you're trying to do is like extremely fascinating for me, at least for these reasons, right? I don't know, Nitai, what's your take?

Nitay Joffe (46:49.715)
Yeah, no, I I think the explanation made a lot of sense and I call a of work processes in particular. I tie back to what you guys said at beginning where kind of, you guys started your, your, work life, bringing streaming and batch together and now you're bringing structure and unstructured together. Right. So we're seeing in many ways different things of like, you know, velocity of data, variety of data, volume of data, as they say, all these kinds of different things coming together and consolidating. and I think it's super interesting

As we've seen the proliferation of data use cases means you get more and more specialized systems. Right. And so if I understand what you guys are doing correctly, you're never going to see yourself as like, will replace this existed, this specific database with a specific data system, right? Because there will always be some particular system that's doing that particular piece. But what you're able to do is kind of orchestrate and bring all those things together. And one of the things that that that leads me to as kind of a question is, you know, we've seen a few, I think there's been.

I think there's a huge demand for this because of exactly that proliferation of systems makes it so much more complex to just build an app or build a solution. Right. Like I want to do X and now I have to go and look at this, like, you know, five page Google spreadsheet of systems that I can potentially utilize. Right. And so I think that that has in a way brought up to my, in my opinion, things like, you know,

Varant Zanoyan (48:04.665)
Yeah.

Nitay Joffe (48:11.352)
slightly now a slight tangent of like a system like a temporal or something, right? That's kind of a way to think about computations. Do you guys see Cronon and what you're doing in the future of it becoming kind of a new way to think about computation, a new way to build applications or is it a way to build ML? like, and also I would ask the same question kind of coming from the users, right? You mentioned Stripe, like are they seeing this as like, man, this is the best fraud platform or are they seeing this as like...

wait a minute, this is our new ML application for a platform. Like this is where we're gonna build and different use cases and here's how it's gonna spread and so on. How does that, how do you see that?

Varant Zanoyan (48:47.639)
Yeah, yeah, think so. Yeah, that's a that's a very, very good question. So I can't let me speak to Airbnb. I don't I don't want to speak for strike necessarily. I have like more a little bit more experience with that. But I think, you know, what we observed at Airbnb was that this even with Cronon, right, which doesn't have the unstructured data support when the open source. But even with that, what we saw is that it just became a data platform more than an ML AI tool.

And so, you know, like if you go to Airbnb and like you see the average, like, you know, average reviews for any given listing, like a 4 .8 star average, like that got, that has competed on chrono now. And the reason why people just go there to do that engineering is because it's just the easiest way to do.

your data engineering if you want your data to serve something in the application, whether that's a machine learning model or in this case the UI with the average review score, or if you want to be able to do anything that's like a streaming computation or any kind of windowed aggregation that you want to be efficient, computationally efficient. So it's hard to get used very generally for ML and non

use cases in that regard, just kind of as a data tool. And like that is very much how we kind of see it in the direction that we would like it to go in. But your point on orchestration is spot on. I think when we think about what this replaces in a company, if they were to try to use this, it's not any one given tool. Really what it is is like a shit ton of gluing and annoying integrations and this and that and issues that you have when you integrate this streaming thing with your KV store and then this batch thing and they're not working

It's like, it's, it's,

Varant Zanoyan (50:24.737)
It's hours and hours of frustration and slowness and edge case issues and data consistency problems and model performance problems that are hard to debug. That's what it's replacing more so than any one given thing. It's not replacing this database or that streaming platform. It's replacing all the crappy work that needs to be done to glue these things together to actually create an end -to -end application. We want to abstract away all of that glue complexity so that you come with your...

actually Nikhil phrased it this way like a year, a week or so ago and I really liked it. It's like you come with your raw data. You might have some Kafka sources, you might have some S3 locations, you might have, know, whatever that is, right? Like Kinesis, whatever, whatever you're using, PubSub. And you come with that and then where everything that you need to go from there to like your application, like use case, whether it's showing a 4 .8 average, sorry for like a listing or whether it's, you know, a machine learning inference or whether it's like running this chat bot or whatever. It's like that pipeline.

orchestration is like what we can do. And like you said, we're obviously not gonna try to solve the database creation. We're gonna use something really good under the hood and same for like a vector store. But we're gonna handle all the data plumbing, because that data plumbing is a pain in the ass. And we can say that having built it a number of times ourselves, we wanna automatically orchestrate all that infrastructure. That's exactly right.

Nitay Joffe (51:48.777)
Does adding the unstructured piece then make the batch streaming piece 10 times harder now? Does it change that? Is it actually two completely separate dimensions? How do you think about the explosion of permutations of different? And as you said, you may have Kafka, may have S3, may have PubSub, you may have this explosion of options and permutations, right?

Varant Zanoyan (52:11.597)
Yeah, Nikhil, want to, Nikhil works more hands on with this at Airbnb. That's why I handed off to him.

Nikhil Simha (52:18.287)
I think fundamentally there is only a few storage abstractions. There is the columnar data warehouse, then there's a blob store, then there's the event store. And then there is online systems, like directly databases or services that you can query and get data out of in low latency. I think these are the three or four modes, roughly.

And you can almost implement any like Kafka, Kinesis, PubSub, there is a single interface that can kind of cover all of those. And similarly S3 and GCS, again, same single interface. And MinIO for example, is something that does that across these things. And for data lakes like.

There is BigQuery where like essentially you have like this columnar tables that so BigQuery is more vertically integrated, like similarly Databricks. But like if you think about it, the layering there is like you have parquet files essentially, and they map to a metastore which says which table maps to what set of files in a blob store. like there is all of these things, but like the idea is that there is a single interface for the data warehouse.

And.

Nitay Joffe (53:45.606)
Do you find that there's like a challenge or rather how are you thinking about as you build the kind of this abstraction that you talked about the language that the dynamic autocomplete in terms of like leaky abstractions and exposing the right level of information, right? Cause like at a high level, agree with you, right? You're either a blob store, comp store, events, et cetera.

But somebody would come and say, but if you're an event and you're doing a query, are you doing tumbling window? you doing sliding window? We actually just had a different podcast that was all about streaming. And we went like 10 levels deep into that, right? So like you can, you can go into like the rabbit hole goes pretty deep on each one. And at your level, have to like, you're constantly making this decision of like, which of these things that we just deciding for the user and then hiding from them and which of these things that like, no, we have to actually expose that to the developers. Like they have to be able to choose this.

Nikhil Simha (54:32.773)
Yeah, that's a great question. So we just talked about the storage abstraction. Similarly, there is like batch compute, streaming compute, indexing and like serving. Those are the other abstractions. There is like a lot of interface and all of them can have many implementations. And if you like just try to count how many permutations are possible, it's probably a shit and many, digits of permutation. And what

We want, like the way we are thinking about it is that we make the choices for the specific implementations of these interfaces. Right? Like you bring your store. So you already have an implementation of like what your blob store and event store is. We plug in with that, but like the compute implementation or like the KV store or the service, like we figure out, we have an opinionated way of like picking those.

So that doesn't explode at least, right? Only the data layer is like what's kind of exploded. So yeah, I mean, there is a lot of complexity, like you said, in this space. And one thing that is kind of unavoidable is the data integration complexity. Like there is many permutations. We have to be very methodical and just like do the work of integrating with all of these sources as like our customers need them.

Nitay Joffe (56:00.154)
How are you? Go ahead, time.

Nikhil Simha (56:03.057)
Yeah, I mean, the leaky abstraction thing is very much on point, I would say. But so far, we have been successful in keeping at least the compute abstractions and storage abstractions really separate. mean, storage abstractions do tend to leak sometimes. We think among themselves, but we haven't had that problem with compute

Nitay Joffe (56:32.678)
It no sense. It's very interesting. Yeah, I mean, find out a lot of kind of projects are in this kind of glue layer. Walk that line and the decisions you make there end up being very, very key to kind of a lot of the, you know, quick adoption and flexibility use case and so on. It sounds like you guys are thinking about it great way. You mentioned kind of, you know, we have to listen to our customers. How are you guys kind of balancing between...

You know, obviously you have the customers that you're going after. You have your underlying vision for your company and right, like the auto complete beautiful language you mentioned. Do you have the open source community that you're, you know, interacting with in some way, shape or form? We'd love to understand kind of like how you're bouncing between those and what's your vision for those different parts.

Varant Zanoyan (57:21.069)
That is a good question. I would say we don't know exactly yet. mean, think, so basically, I mean, here's where we're at in the startup journey, right? Like we're like three weeks into doing it full time. We have a lot of like stuff that we know we need to build no matter what. And right now, as far as building goes, we're focused on that. And then we're like essentially looking for like, you know, talking to companies about seeing who's the best fit for like an early design partner. And then we're gonna...

I mean, we're going to do like obviously whatever they need us to do for it to land value and be effective. And I think that's one of the interesting parts of like the startup journey. And like, it's one of the interesting pieces of advice that you hear when you're trying to build a software service company is that like, you know, companies might ask you to do a lot of things and you have to be kind of careful about what you choose to do. Obviously there's a trade -off between making sure that you're making them happy and then also building stuff that does generalize. And so that's one of the things that we've gotten kind of, you know, advice on, but to be honest, we're still pretty early that

we haven't had to juggle, we haven't had to make any really difficult trade -off decisions there yet, because we're still kind of in the early phase of talking to potential customers and seeing kind of what their stocks look like. And I think the good thing is, mean, we have a small sample size so far, a relatively small sample size of companies that we've talked to, but so far, the good news is it does seem

the way we solve things and the way we thought about these problems and solutions does seem to generalize at least a bit. So we're optimistic that we won't have to make any really difficult choices there. yeah, I don't have much of an answer for you. think it's becoming clear the longer I try to answer this is that we don't exactly know yet. We have some theories on what people, from an API design and user ergonomics point of view, we've seen a lot

Airbnb and Stripe type users. And so we have a good sense of empathy for those types of users. But as we get out there and talk to more companies, think one of the things that's interesting is like, you know, do

Varant Zanoyan (59:18.979)
types of profiles come forward. I don't know, Kostas already said, wow, you were mentioning kind of like the data analyst who's like more familiar with SQL and less familiar with Python. You know, I don't know, we haven't tried to enable those types of people to do ML that much in the past, but we're certainly open -minded to those sorts of things. And I think that's one of the things that might get more interesting as the journey continues, if we encounter those people and they have a use case and we can see the value and we see what's missing to enable them. That's where we might make a choice to kind of like,

go a slightly different direction, but we haven't really encountered much. It's too early probably to say.

Nitay Joffe (59:54.666)
Makes sense. Congrats on taking the leap and starting the company. Are there any particular, to your previous point, there particular use cases or things that you'd love to see explored with your product, with what you're building? Are there things that you're

Varant Zanoyan (59:58.627)
Thank you.

Varant Zanoyan (01:00:10.189)
Yeah, I mean, so here's where we know it has potential to be like very impactful because we've seen it. We've landed this impact with stuff like it before. It's like.

Personalization, basically, that could be search results or that could be like content serving or ad serving. know, anti -fraud, like Airbnb and Stripe, probably two of the biggest sort of like, or two very large payments companies. mean, people don't often think of Airbnb as a payments company, but you know, really it is. It moves, you know, billions of dollars around across borders and bank accounts. And there's a lot of frauds, sort of use cases there. The third one is customer support. So, you know, those are three areas where we know it has a lot of value.

Varant Zanoyan (01:00:52.943)
in terms of very solid use cases that Nikhil and I have seen a bunch of times over and over again. But I think we are very interested to learn more about what else is out there. Because like I said, when people start using this, find interesting things in terms of it being a more general data platform. I can do metrics, can do application data orchestration in all kinds of ways. It's like...

So think there's potentially a lot more out there and we're very open -minded, I'll put it that way.

Nikhil Simha (01:01:26.68)
I mean our users at Airbnb have consistently surprised us with like what they do with it. So that's probably a reasonable expectation that will continue to happen.

Varant Zanoyan (01:01:30.351)
Yeah.

Varant Zanoyan (01:01:38.134)
Yeah.

Nitay Joffe (01:01:40.361)
Sounds like the best kind of early users to have the surprising ones. And I liked your previous example, by the way. remember from my own company, one of the best advice, one of the best things I ever heard was one of my early salespeople. I asked him for like, you one piece of sales advice. What do you tell people? And he said, you know, what's the one most powerful word a salesperson can say on a sales call? And I said, what? And he said the word no. And I was like, what do you mean?

Varant Zanoyan (01:01:44.109)
Yeah.

Varant Zanoyan (01:02:06.361)
Hmm.

Nitay Joffe (01:02:08.05)
And he said, well, you have to realize so many of these folks you're talking to, they're constantly just hearing, yeah, we can do that. Yes, we can do this. Yes, yes, yes, yes, yes. And the second you say no, suddenly they realize a couple of things. One, suddenly they realize, wait a minute, there's something that they won't do. But more importantly, suddenly they believe everything else you just said before.

Varant Zanoyan (01:02:25.359)
Hmm.

Nitay Joffe (01:02:25.98)
Because up till then, they're like, I don't know if I believe it. They're just saying yes to everything. So kind of to your point about like, know, customers will come in and demand many things, but really you have to kind of go back to your vision and your plans and say, okay, which of this maps and which of this does kind of tracks with what we're trying to do.

Varant Zanoyan (01:02:30.755)
Hahaha

Nikhil Simha (01:02:43.241)
Hmm

Varant Zanoyan (01:02:43.489)
Yeah, yeah, interesting. Honestly, that resonates. feel like I'd be the same way if I heard too many yeses, I'd be a bit dubious of the whole thing.

Nikhil Simha (01:02:52.191)
Peace.

Kostas (01:02:53.549)
Yeah, and it's also like really hard to say no when you are at a very early stage So that's I think like one of the biggest struggles of like anyone who starts something has and it's not just because of let's say like the Many people say like it's hard to say no to money right like if like a customer comes and they are willing like to write a big check and it's against your Vision are you going to do it? but that's like the

Let's say the straightforward triangle of like everyone can understand that comes up. I think deeper inside what's happening is that, okay, when you're deciding to go and build a company, think like you kind of have like in your personality, like the thing of like, I want to build something to kind of like, let's say like satisfy the other party, the other side, like I'm building for them, right? Like you're building for your

Like you want to get the satisfaction for yourself that like you build something that delivers value to them, right? And you look for that and at the beginning, like you can't find it because okay, you have like things in your mind, but like, mean, the world is always like much more complex. It's hard like to communicate what you're doing. People might not understand. They don't care. They have better things to do. So when you don't have like an established brand out there and like people just like.

trust what you're saying, like it's much, much harder to convince someone, like to spend time with what you're building, right? But you need that. You need this, let's say, like feeling of achievement. I built something and this person over there is willing to use it and pay for it. That's amazing, right? And I think for an entrepreneur who decides to go and do this zero to one, I think it's like a very common trait that you need to have.

in order to successfully go and build and that's much harder to overcome at the end than it is to just say no to a check at the end, which don't take me wrong, it's really important. You need the money, you have to take the money. But it's also hard to be coming up with requests or needs and be like, no, I'm not going to do that because my vision is something different. It is hard.

Kostas (01:05:07.748)
But the good thing with you guys is that you have a lot of signal, or sources at least, of signal. You have the open source out there, which is great. I'm sure you're talking with people, you're living and breathing in an environment that is rich with people that are willing to try new things and believe in the state of the art. you're going to have a lot of

these moments, but it's a good problem to have. So try to enjoy and like not worry that much about like saying no and try like to develop the muscle of like saying no. So one last question for me, because we're like really close to the end here. How people should reach out to you and what they should look after to learn about

your journey, right? Like you're very, very early. There is like the open source project out there, but they're like probably more. How should people connect with you so they can keep up to date with whatever exciting things you're building.

Varant Zanoyan (01:06:16.973)
Yeah, I mean, well, if email works, I mean, we have an email for the new company. It could be hello at zipline .ai. We don't have a zipline .ai homepage yet. So if you're listening soon after publishing, that might not be up, but the email should work and you can get in touch with Nikhil and I that way. And then otherwise, yeah, that's probably the easiest, I think.

Nikhil Simha (01:06:41.823)
Another way of saying that we don't have too much of an online presence either of us.

Varant Zanoyan (01:06:46.511)
Yeah, LinkedIn. You can probably find us on LinkedIn. Our names are unique enough. So that's also fine. whatever you prefer. I think that's probably the best two options.

Nikhil Simha (01:06:53.567)
Yeah.

Kostas (01:06:59.315)
Yeah, we'll also make sure to put all the contact details on the episodes page so people can go check there and find where they'd to connect with you. Nitai, your last question.

Nikhil Simha (01:06:59.401)
we don't post much, you know.

Varant Zanoyan (01:07:07.46)
Yeah.

Nitay Joffe (01:07:13.256)
no, don't have anything else, honestly, actually. I think this was a very good conversation. I really enjoyed it. it sounds like a really great project with, with a lot of need, from, from seeing that from many companies in the community and, obviously kind of the current times. and I'm excited to, to see the company blossom and grow. And it was a pleasure to have you guys on the podcast.

Varant Zanoyan (01:07:36.067)
Thank you so much. was our pleasure. Yeah. Our pleasure. Yeah.

Nikhil Simha (01:07:36.723)
Thanks for having us.

Kostas (01:07:40.681)
Thank you guys and we are looking forward to have you back soon with more updates about what you are building and your adventures. You have to promise that you will come back because otherwise we are not going to end the episode. Please do that, okay?

Nitay Joffe (01:07:55.623)
Hahaha

Varant Zanoyan (01:07:56.658)
You have our word.

Kostas (01:07:57.499)
Alright, Nikhil and Ran, thank you so much. It was such a great pleasure to talk with you. We learned a lot and I think the audience will be equally even more excited about what we are building and they should reach out and make sure that they pay attention to what we are doing.

Nitay Joffe (01:08:18.073)
Absolutely zipline check it out. Thank you

Varant Zanoyan (01:08:18.361)
Thank you.

Kostas (01:08:22.618)
Thank you.

Nikhil Simha (01:08:23.977)
Thank you.