Prodity: Product by Design

In this episode of Product by Design, Kyle chats with Sandy Ryza, lead engineer on the Dagster project, author, and thought leader in data engineering. Sandy shares his journey through the world of data—from building big data tools at Cloudera to working as a data scientist, product manager, and engineer—and how those experiences led him to help create Dagster, an open-source data orchestration platform.

We discuss:

The evolution of data engineering and the growing complexity of modern data pipelines.
The role of AI and unstructured data in shaping the future of data platforms.
How organizations should think about data platforms to avoid costly rework.
Best practices for managing data complexity using software engineering principles.
The future of open-source tools in data infrastructure and the push toward interoperability.

Sandy Ryza
Sandy is a lead engineer, author, and thought leader in the domain of data engineering. Sandy co-wrote “Advanced Analytics with PySpark” and “Advanced Analytics with Spark”. He led ML and data science teams at Cloudera, Remix, Clover Health, and KeepTruckin.

Sandy is currently the lead engineer on the Dagster project, an open-source data orchestration platform used in MLOps, data science, IOT and analytics. Sandy is a regular speaker at data engineering and ML conferences.

Links from the Show:

Twitter: @s_RYZ

Dagster: dagster.io

Book: Advanced Analytics with Spark – O'Reilly

Podcast Recommendation: Empire (British Empire & Ottoman Empire history)

Books Sandy is Reading: The Shortest History of India, The Sun Also Rises, Werner Herzog’s Autobiography

More by Kyle:

Follow Prodity on Twitter and TikTok

Follow Kyle on Twitter and TikTok

Kyle's writing on Medium

Prodity on Medium

Like our podcast, consider Buying Us a Coffee or supporting us on Patreon

What is Prodity: Product by Design?

Fascinating conversations with founders, leaders, and experts about product management, artificial intelligence (AI), user experience design, technology, and how we can create the best product experiences for users and our businesses.

Kyle (00:01)
All right, welcome to another episode of Product by Design. I am Kyle. This week we have another awesome guest with us, Sandy. Sandy, welcome to the show.

sandy (00:11)
Great to be here. Thanks for having me on.

Kyle (00:13)
All right. Well, let me do a very brief introduction for you, Sandy. And then you can tell us a little bit more about yourself. But Sandy is a lead engineer, author, and thought leader in the domain of data engineering. So Sandy, that's a super brief introduction. So why don't you tell us a little bit more about yourself?

sandy (00:31)
Yeah, so my career has sort of revolved around the world of data in a few different ways. I'm sorry, do you mind if we retake that? I just realized I would rather be standing for this and I messed it up.

Should I just jump back in?

Kyle (00:49)
Yeah, go ahead and jump back in.

sandy (00:51)
Awesome. My career has revolved around the world of data in a bunch of different ways. It's been kind of a loop in that I started as a software engineer, building tools that allowed organizations to work with large and complex data sets, but then spent the middle of my career working as a data practitioner. So doing data science, machine learning, data engineering, actually being the kind of person.

who would use the kind of tool that I had been building before. And then finally kind of came full circle, dealt with a bunch of frustrations in those years as a data practitioner. And that led me to want to refocus on tooling, just try to solve some of those problems that I had encountered inside that world. So I've been a data scientist, a software engineer, a product manager, a data engineer, but currently working now on Dagster, which is a...

orchestration tool and data management tool that helps people in data practitioner roles build better data pipelines and data platforms.

Kyle (02:01)
Awesome. Well, I'm excited to talk a little bit more about all of those things, data orchestration, DAGsters, some of your experiences. But before we do, why don't you tell us a little bit more about some of the things you enjoy doing outside of the office and outside of the work that you're doing.

sandy (02:17)
Outside of the office. My biggest hobby right now is surfing, which is, you know, can be a mixed bag in the San Francisco Bay Area where I live. It's not the warmest place to go surfing, but there are a bunch of fun spots. I like to read history. And I like plants.

Kyle (02:46)
Awesome. Well, we've got, I'll have some questions for you and we would like to hear about anything that you're reading or recommend towards the end. But love to jump in to some of the experience and a little bit more about your journey. So maybe you can tell us a little bit more about your journey. How did you, you mentioned being a practitioner and -

some of the challenges that you were facing as a practitioner within data. Tell us more about what some of the challenges were and kind of what brought you into data science and into data orchestration and again, into the current work that you're doing.

sandy (03:36)
Cool, so I'll start maybe further back than you were asking about. But I think the probably original catalyzing event was in college seeing this talk by Peter Norvig, who was a sort of senior researcher at Google. And he had this talk about the unreasonable effectiveness of data. And the thesis was basically that by taking data and lots of data, you could solve these problems that had kind of eluded computer scientists for years.

So the biggest ones that he was focused on were around natural language, where they had tried to build translation models and text classification models. It had run up sort of against a wall using purely algorithmic techniques, but then just taking massive data sets, like Google sized data sets, and throwing them at those problems, and then using fairly simple techniques, was able to kind of burst through that wall and do pretty amazing things. So that's, I think, what

seated me being excited about this idea of, you know, if you can put enough data together and then throw the right computations at it, you can develop this deeper understanding of the world and kind of push forward in ways that you might not otherwise be able to. So that's what led me to my first job, which was at Cloudera, which was a company building big data software.

At the time that meant Hadoop, that was kind of the center of the big data universe. And the idea was up to this point, it had been computationally infeasible to process enormous amounts of data because you didn't have software that could do it. And so if we could build software that could, you know, run a, you know, take a summary or train a machine learning model on a massive data set, then that could, you know, kind of

get us to those places that Peter Norvig had been talking about. And so I spent a while working on Hadoop and Spark, which are these open source software projects that were enabling these capabilities, but ended up feeling just kind of disconnected from the actual building cool stuff with data part of the world. I was spending my life in the nitty gritty of...

performance tuning for this software. That was very fun, but it was hard for me to connect it to actual things that were happening in the world. So through a series of jumps, first within that company working kind of as an internal or as an external consultant, and then at other organizations, I moved into the using data software part of the world. So I was taking this software that

was able to process large amounts of data and then helping other organizations use it to understand their data. I eventually from there moved to a healthcare company, Clover Health, and then sort of got thrust immediately into the world of health data, which is some of the most complex, wild data in the world. Moved there, from there eventually to a logistics company that was helping understand data about.

trucks moving around the country. So a totally different data space with its own set of complexities. And one of the things that I encountered, which I think was parallel with a lot of my generation of data scientists, was that handling large data sets was only a small part of the problem. So you might have a really small data set or a bunch of really small data sets. And you could handle that using traditional software that had been around forever, like databases like Postgres.

But there was so much of it and it was so hard to understand the connections between all the data sets and all the entities in those data sets. There were so many different processing steps that needed to happen to get from the mountain of let's say health insurance claims data to a number that actually reflected reality in some sort of way. That the challenge moved away from like handling large amounts of data to handling a very complex data.

And so that was the challenge that I faced when I was in those organizations. Just trying to get a handle on all the different data processing steps, many of them which were happening on very small amounts of data. And so I did at Clover Health and then also at Keep Trucking, what I think many data practitioners have done, which is build software to help them do their job.

to help them manage this complexity of all the data that they're dealing with. So I built these internal tools at those companies that basically were a layer to help me write code in Python to define the data sets that I was working with and understand how they relate to each other and then define how to take one and use it to compute the next one. And then eventually,

discovered this very early stage startup, Dexer Labs, which was called Elemental at the time. I joined it as one of the early employees that was basically working on a more general purpose and open source version of this thing. So I had spent so much of my time, you know, when I was supposed to be being a data engineer, but having trouble being a data engineer, because I didn't have this software that I needed, I had spent so much time building internal versions of that. I was very excited to help sort of build out.

this more generally available version of that thing that, you know, so that every organization wouldn't need to build their own internal version of that. So now working as the lead on the Dagster project, which is that open source project that helps organizations, you know, orchestrate and manage their data in that way.

Kyle (09:45)
I think that's really interesting. And I'd love to hear some of your thoughts on what is it like building that out? Because it feels like so many great products and great tools that we see today come from this very thing that you're talking about, where it's, I have this problem. There's not a tool that's solving it for me. I need to build something to do this, kind of like you're talking about. And that feels very much like you're doing so.

What has that been like for you? And what is it like to build out this tool, Dagster, that is really solving this problem that you had and that so many people are having? What are some of the benefits of that? And maybe what are some of the challenges that you faced with it as well?

sandy (10:29)
Yeah, it's really interesting. I sort of alighted this part of my career, but I also spent some time as a product manager. And it's kind of interesting because as a product manager working on pretty technical products. So I was working as a product manager helping at this company called Remix that was helping transit agencies do scheduling for their buses. And so...

the whole center of that job as a product manager was trying to understand what it's like to be a transit scheduler as much as possible. And all the effort that I was putting in was learning this job that was completely alien to me. And then even working as a data scientist at this trucking startup, so much of my job was trying to get into the mind of a trucker. What does a trucker think about reading books on trucking and understanding that life?

And so sort of coming and working at Daxter on this problem that I knew so much about it was almost the complete opposite where I didn't have to spend any energy to get into the mind and, you know, embody what it's like to be one of these people, because, you know, that sort of was my mind already. So, you know, first of all, I think that is that has worked out really well and has really kind of given me a guiding.

light and in a lot of tough design decisions. I think another thing that it does is gives you kind of like the conviction that you need to rock the boat a little bit. So one thing that sort of happened when we were just listening to early user feedback at the company was people ultimately...

kind of ended up requesting things that looked a lot like existing tools, because that's what they were used to. And I think if I had been coming from this pure product management perspective, I would have kind of indexed very heavily on that and been like, OK, that's what the users want. Let's build it for them. But having had this personal experience and this personal set of opinions that I had developed from doing this work, I was able to sort of get past that and be like, no, let's not necessarily do the obvious thing. Let's like,

push a little further, even though it feels more risky because I have that conviction. Let's see, I think the downside is you're very biased. So you have your own personal experience and you understand it very well. And it takes a lot of extra effort to incorporate the experiences of other people that are in the same space as you, it might have been a little bit different. So I think a lot of the journey at Daxter Labs has been...

I'm trying to find out the places where I may be overly opinionated and need to build something a little bit more general than I would have expected in order to handle a much wider set of use cases that I hadn't personally dealt with in my time as a data practitioner.

Kyle (13:43)
Yeah, no, I think that makes a lot of sense and is one of those very difficult balances of having a ton of the experience and being very opinionated about something, but then also trying to build something that applies to all of the other potential users out there and their experiences and incorporating that for...

You know, what are the, not just immediate use cases, kind of like you're talking about, but what's coming down the road? So I'm interested, you know, how, how are people using the tools that you're building and, and, you know, what Dagster Labs is creating? Like, what are some of the, the main use cases and, you know, what, what have you seen both as you've built it and then what is coming down the pike as well, as far as.

what you're creating and some of the importance of this data orchestration that companies need and that you're solving for.

sandy (14:49)
Sure, yeah, so I'll give maybe the 10 ,000 foot view of data pipelines and data platforms just to try to contextualize all this stuff because it can get pretty nitty gritty. So ultimately organizations build data assets. And data assets is the sort of broad term that we and many others use to refer to sort of like...

packaged nuggets of understanding that are derived from the data they have in their organization and then useful in all sorts of other ways in their organization. So to make this concrete, you might have a feature in your application, which is a recommender. And then that recommendation system in your application depends on a data asset, which is a machine learning model that is essentially an algorithm that provides recommendations as well as

large data sets that machine learning model interacts with that describe and summarize the behavior of users in that application. Another example of a data asset might be a data set or a table that's used to feed a common report. So maybe the executive team needs to understand how sales are doing every month so they can make decisions about how to reallocate resources. So.

You have the two super high level places where organizations use data is like one, in their products and applications, and then two, to make decisions. And data assets are just ultimately the objects that they build that allow them to power their products and make decisions with data. Data assets don't come fully formed out of the ground. And so,

This is where Dagster and data pipelines come in. So if you're building a recommendation model or if you're building a canonical data set that describes all your different sales across the world, you have to start somewhere. And where you start is normally source data. So that might be in all sorts of systems. So it might be produced by your application. If your application is an e -commerce application, you might have

tracking of every time someone buys something on your website or clicks on your website. If you're modeling healthcare, you might be getting information from all sorts of different hospital chains that you work with. Or if you're modeling trucking, you might be getting information from gas stations that you're partnered with around the country about fuel usage.

And so ultimately you need to sort of get from point A, which is all this source data, which is often very messy, often in different formats, often sort of difficult to understand in all sorts of ways to these ideally pristine data assets that are actually useful for your business. And so a data pipeline is basically what gets you from point A to point B. A data pipeline will do all sorts of different stuff.

So it might take data that's in one system. So maybe you have data that's thrown into Amazon S3, like an object store, as a CSV file or an Excel file. And then it might move that into a different system, like a data warehouse, like Snowflake. And then it might also do transformations on that data. So even within the same system, it might take data inside your data warehouse, like Snowflake. And...

Combine a bunch of tables into one or filter out some bad records or group by some record, any sort of thing you can imagine. And along the way in getting from point A to point B, there's often what we call intermediate data assets, which are useful as well. So you might at the end of your pipeline have a hyper -specific data asset that's useful for some particular application, like

recommendation, but on the way to there, you might have a more general idea of like, okay, to recommend products, we need like a canonical list of what all our products are. cause you can't recommend products without knowing what your products are. But that canonical list of all your products is going to be useful, not just in recommendation, but, you know, in a bunch of other applications as well, like dashboards and reports and everything. so, so your data pipeline builds these intermediate assets along the way.

And when you zoom out, you end up with this kind of network of data assets. So starting at point A with your really messy data assets, with all your sort of intermediate assets in between, and then finally at the end with your production usable data assets that are going to power your business. And a data pipeline is basically that network or a part of that network. So when we talk about data pipeline, we're talking about

data moving from one place to another and being transformed, often repeatedly. So if you have some report that's based on your sales numbers, well, you're going to get new sales every day. And often you're going to be needing to repeatedly update those data assets to take in and represent the latest state of the world. So that finally brings us to Dagster. Dagster is software that health people

orchestrate and manage their data pipelines. So Daxter isn't the software that would run, that would sort of be responsible for the nitty gritty of each individual data transformation. You might use a data warehouse for that. You might use Python code. You might use a shell script. What Daxter does is tie it all together. So Daxter understands that network of all your transformations and understands how to be continuously

running the data processing steps along that network to keep the data at the end of those pipelines up to date. Does that give some picture of what we do and how we help out?

Kyle (21:18)
It does. And I think that's a really great picture of all of the different pieces of how this starts to come together. And I'm interested in your take because as data continues to increase and becomes more and more critical to businesses and to the way that everybody's operating, we see what feels like the complexity of this increasing and increasing. And that doesn't seem to be getting less.

in the future. How do you see that changing and increasing as we go forward? And how is that possible to continue to manage? I mean, you mentioned Daxter as a tool for managing the pipelines of the data, which is a critical part, but how have you seen both this change over the past and how do you see it changing going forward and being something that...

as the complexity increases, like how is this going to affect our ability to manage all of this data infrastructure?

sandy (22:25)
Yeah, so first of all, I think 100 % agree with you that the amount of data and the kinds of data are just monotonically increasing. One sort of big dimension of that is the increase in unstructured data. So if you looked at the data that organizations were keeping track of and managing maybe 20 years ago, or even more recently,

it largely revolved around data that could fit neatly into a table. But a couple of things started happening. One was just the internet started producing massive amounts of data, and that data didn't often fit into a clearly tabular form. And then two, this of course ties into the world of AI, which has been a big topic of conversation recently. But we found better ways to use unstructured data.

So I said talk about unstructured data, images, text, audio. As the applications for those kinds of data increase, it also increases the demand for storing and digesting that kind of data. And that creates this further complexity in data platforms.

One, I think maybe the second part of your question was like, OK, given that this is happening, how do we deal with it? I think there's no one size fits all answer to that. What we and I normally talk about in this realm is being thoughtful about your data platform. Every organization has some data platform, whether they've thought about it or not.

So if you have a Google Drive folder where you put spreadsheets, that's a data platform. It might not be the most scalable data platform. That's probably not the best data platform for an organization like Google. But if you're running a consulting business, that could be the perfect data platform for you. I think that the challenge is kind of.

arise when organizations grow without thinking intentionally about their data platforms. And a data platform ultimately amounts to, it's a set of technologies, but also a set of practices that you standardize within your organization for how you want to deal with data. And being very thoughtful about those practices and trying to implement some degree of standardization, thinking about where there needs to be diversity because,

There's lots of different kinds of data. There's many different ways to process data. So you can't say, all of my data is going to be in the data warehouse and I'm only going to run SQL, which is a common data processing language, over that data. So there needs to be variety and diversity in some elements of the data platform. But in some cases, choosing early to standardize how you do something can save you a lot of headache in the future in terms of...

managing and taming all of that data.

Kyle (25:47)
I think that's really great advice because I've been in organizations where we've both made those decisions earlier or at least early enough that it's benefited and then others where it's come much later and we just, we've had to go back and try and fix the mess that has happened as decisions have been made kind of piecemeal throughout. And you get to the point where it's just like, I have no idea.

what to do with everything because like we've obviously made decisions like as we've gone and now we have all of this data and it's like our option is to just like try and do the best we can or we just have to go back and fix everything and it's going to be like a big a big project to do. I'm interested in your take. I mean you mentioned obviously like the best option is to create early practices. You know what are some other thoughts around good

data hygiene as far as what organizations or companies should do in order to kind of create some of these best practices or at least good practices and processes for managing their data, especially as they grow and especially as the data that we have grow significantly.

sandy (27:04)
Totally. Yeah, and very much agree with you on that sort of like, there's like a Humpty Dumpty factor. It's like once you can't put it all back together again. Okay, so the number one one that we sort of think about, that we advocate for in the world of Dagster is trying to manage the complexity of data in the same way that you might manage the complexity of software. So,

software gets enormously complex. Thinking about all the different, from the front end perspective, all the different states that a button can be in, all the way to, if you're writing TurboTax and just modeling the tax code. So over the years, software engineers have developed kind of a tool set for dealing with the massive complexity of software.

Some of the biggest elements in this tool set are, one of them is version control. So being able to have an auditable log of all the changes that you've made to your software over time that anyone can sort of understand and it can allow you to revert changes that cause problems. A review process, so the ability to sort of look at changes before you're about to...

merge them and understand them. And then the ability to test things out before they hit production. So the ability to try things out with your software without saying, OK, I'm going to push this out to all my users, and then they'll be my testers. So what we encourage, and I think what is really sort of

informed Daxter design and philosophy is this notion of managing data using these same software practices. So the sort of non -softwarey way of doing things would be, OK, you want to try something out with a new data set. You want to create a new data set. OK, you sort of just write out a new table in your data warehouse or upload a new Excel or Google Sheets to your Google Drive.

And ending up with, okay, you have version one and then version one underscore final and version one underscore final final, etc. Whereas the software approach that we recommend is using code to define the data that you expect to exist. This is sort of core to the Daxter model. We call it software defined data assets. And so if you sort of want to create a table in your data warehouse that let's say,

has a summary of all your sales statistics over the course of your business, you would write out code that generates that table, which you're going to have to do anyway. To create that table, you'll probably have to run a SQL statement or some code that actually generates it. But take that code and actually check it into your Git repository. And then...

in a structured way annotate that code so it's clear what that code does. The point of this code is to generate my sales forecast table. And this code, it generates the sales forecast table by looking at the raw sales statistics table and this other sales table. And then once you've written that code, using all the version control and review and testing practices to manage that.

So yeah, I think the high level is define your data in software and then use software engineering practices to help manage the complexity of that data.

Kyle (31:09)
Yeah, that's really good advice to use the same mindset that we're using in software engineering or in UX and design to really apply to the data engineering as well and understand what some of these changes are and some of the best practices are. I think that's really great. I'm interested in as well, you've written some books about...

about these topics. Tell us about these books you've written. What kind of led you to write about these topics in depth and what are they called? And tell us more about them.

sandy (31:53)
Yes, this goes a while back when I talked about the loop of my career and how I started out working on tools for data people, then worked as a data person myself, and now came back to tools. So the book that I co -authored led was this book called Advanced Analytics with Spark. And Spark now is, and it was at the time, this very popular technology for dealing with

large data sets. And what was particularly exciting about it is there had been technologies like MapReduce that let you deal with large data sets, but they were very difficult to use in a data science context. You had to write a lot of code in languages that data scientists were not familiar with to be able to process that data. And they didn't really support an iterative workflow. So you couldn't try something out, see how it worked.

and then try out the next thing without the pretty slow loop between all of those. So what was very exciting about Spark at the time was it was the opportunity to work with big data, but also to do data science. Spark is an open source software project that I was very heavily involved with contributing to at Cloudera. And then as I talked about before, it was sort of...

had made the jump to be a data scientist and wasn't only contributing to that software. So I had this kind of perspective that allowed me to see both sides of it, like all the difficult, all the like exciting things and all the difficulties of this new software, as well as the applications of it and having seen it used at various organizations. So wrote this book called Advanced Analytics with Spark, basically talking about how to do data science.

with Spark. I worked on it with some of my coworkers at Cloudera. Yeah, it was a really fun and time consuming, but really, really fun process.

Kyle (34:04)
That's really interesting. What does it take to create kind of a technical project and technical book like that? Obviously it would take a lot of time and probably a lot of passion for the topic itself. What was the effort like and the collaboration like in order to bring something like that to fruition?

sandy (34:28)
Yeah, so we worked with O 'Reilly, which is a textbook company that produces a lot of these, and they were really helpful and awesome. I think the hard part is always taking where you are, which is very nitty gritty and perhaps steeped in the technical specifics of the software and applications of the particular users that you spent time with.

and then trying to think about where your readers are and kind of connect those. And this is something that is a huge part of my job even to this day. I'm not working on any books right now, but I write a ton of documentation, a ton of blog posts. And the hard part is always sort of like connecting the dots between that super nitty gritty technical place that you are and the things that your readers know and care about.

Kyle (35:28)
Yeah, absolutely. Would you write another one? Or was that kind of a, that was a great thing to do, but I think I'm gonna leave that.

sandy (35:39)
Yeah, it's a good question. There is not yet a No Riley book on Dagster, so maybe at some point we'll need to write one of those. It's very time consuming. It was a pretty popular book, but it's not like an extreme money -making endeavor, so it has to be a bit of a labor of love.

I don't know, if I could find the time I would do it again.

Kyle (36:12)
Yeah, okay. No, that definitely seems fair. I'm interested too in where you see things going within your field over the next three to five years. Obviously things have changed rapidly over even the past year, but over the past few years. Where do you see things going now, especially with the amount of changes that are happening, broadly speaking within technology?

but within the amount of kind of like we've talked about within the amount of data and everything else that's kind of changing within the world. What has changed for you? What is changing within the profession? And where do you see that taking the field for everybody?

sandy (36:59)
Man, there are so many different directions. One that, as I mentioned before, often comes up is AI, which is this huge thing that's changing so many parts of the field. Ultimately, data engineering is the basis for any AI model. So I think there's a general increase in using data for those kinds of applications. As I sort of touched on a little bit more, I think that

moves a little bit of the center of gravity of data engineering away from relational data, which is really kind of the, has been the heart of it for a long time, towards more unstructured data, different sort of computational environments that support training and doing inference with machine learning and AI and like large models better. Another one is I think the industry has been on,

an interesting roller coaster for a while.

If you look at where things were like 10 years ago, there were the set of technologies that were used to process data was like super fragmented. There was this big question of like, okay, what's going to be kind of like the big data processing engine, you know, fundamentally where are you looking to store data where are people hard looking at, you know, process data.

It's a bit of a pendulum. There have been moments since then where it seemed like, is it just, is everyone going to use Snowflake? Snowflake became so, so popular and so big as a data storage company. You know, from my perspective, working in open source data, it was a little bit sad that, that, you know, is the winner going to be this like totally non open source company that stores all of its data behind, you know, its servers, or the only way you can't even run a query.

on your own computer without connecting to their services. But I think since then, the pendulum has been swinging the other way a bit. So especially as Databricks has risen and especially sort of challenged Snowflake and its core data warehousing use case, things have opened up a bit more. There's been this more of a return of storing data in widely accessible formats. So instead of, you know,

just giving it to some data warehouse, storing it in these formats like Iceberg or Delta Lake, which allow a variety of different computational tools to come and process it and do different things with it. In the 80s, everything was packaged, all the data processing functionality was into

a single relational database. And then there's been this question that's sort of repeatedly come about how bundled or unbundled should that functionality be? Like should you use the same software to store your data that you use to process your data? And I think, you know, at least right now, maybe there's a, maybe this is the pendulum, but maybe it's the, like a longer trend where Snowflake is just a bit of a blip is towards, I think some of that unbundling where,

you're allowed to have a larger diversity and heterogeneity in the tools that are used to store and process data, which I think is great.

Kyle (40:29)
Yeah, that's really interesting. And I'll be interested to kind of see what that trend plays out like, because obviously, like, I've been at companies, kind of like you mentioned, where, you know, Snowflake was the dominant thing, and it was what we're using, and others where it seemed like it might be trending, more kind of like what you said. So it'll be interesting, kind of how that plays out. I'm interested, kind of like you mentioned, though, where do you see...

open source kind of playing into this? Is that kind of a rising factor? And what have you seen kind of changing within open source as far as some of the tools and options?

sandy (41:14)
Hey, I think I lost you for a second connection wise.

Can you hear me, Bill? No worries. The last thing I heard is you're talking about open source playing into this. I can hear you now.

Kyle (41:18)
Sorry about that. Yeah, yeah. Can you still hear me?

Yeah, that was, I think that was the end of the question. Where do you see open source being a factor and kind of playing into the market more generally?

sandy (41:39)
Yeah, I mean, one of the things that I love about the world of data is that open source has been such a huge part of it for such a long time. As I talked about, there's been some ebbs and flows to that. And I think ultimately companies that build open source need to have some sort of financially viable business model. But...

I still think open source technology is the core of data infrastructure and will probably stay that way for some time. Dagster is open source, and so that kind of reflects our values there. But as I talked about before, a lot of the most popular data storage formats are open source. The processing layer is a bit more all over the place. There's both open source and proprietary data processing tools.

But I think open source has a very strong presence in data infrastructure, in part because data organizations really don't want to be too tied down or don't want to feel like their future is totally tied to the future of some company whose IP they have no ability to interact with. And that matters because data has such gravity.

You can't really afford to like have your data disappear. It's kind of the core of your, the core of your business. So open source is very appealing for those reasons.

Kyle (43:08)
Yeah. Anytime we have a conversation about data, that feels like it always comes back to that as literally the most important part of anything that we're talking about. And so who is owning it and who is controlling it always seems to be at the center of a lot of those conversations. So I'm interested in any lessons that you've learned from...

the, especially the recent work that you've been doing, whether it is with Dagster or, you know, in some of the data science, anything that has been kind of surprising to you as far as the building of the products or, you know, some of the using of the data or the data orchestration that, you know, maybe you didn't expect or that might have, or that has been particularly useful for, you know,

the company or yourself.

sandy (44:09)
Sure, yeah, I think one of the things that I and we think a lot about is kind of the gradient of technical skills and technical background of the people who use our technology. There's always this question of, OK, do we want to build for, like we have, especially because we build open source software, we have a lot of enthusiasts who are very technical.

But then also, I think, relative to other areas of software, the people that work with data often are not technical. Data is an important part of their jobs, but their job is finance or sales ops or something like that. And so there's always this question of who are we building for and how do we make that work? I think that there's no silver bullet there, but the principles that we think about the most are,

gradual disclosure of complexity, which is basically like sort of when you're coming early, you should ideally be able to see a simple mental model. But then as you sort of like poke through and try to do more complex things, you get a smooth ramp up to the more complicated aspects of the product, which allow you to do that. So, you know, if you talk to some DAX users, not everyone will say that we've succeeded on every bit of that, but that's...

kind of one of our main design goals. And then the related piece there, it's sort of very tightly connected, is layering. So having sort of general abstractions at the bottom that support more specific abstractions.

Kyle (45:54)
that's really interesting. And I'm interested too in, what advice would you have for somebody who's looking to get into data science or data engineering now, either as a career or maybe looking to upscale or upskill their career and potentially move up? What are some of the things that they should be thinking about in order to do that?

sandy (46:19)
Yeah, so I think it'll go back to some of the pieces that I talked about earlier. So I hope this isn't too repetitive, but one of the one of the is really thinking about data engineering as software engineering. So fundamentally, you know, to produce data, you need to write software that produces that data and getting comfortable with sort of the basic format of how software engineering works allows you to build in a much, you know, ultimately,

higher velocity and sustainable way. And the second one, which is related, is thinking and being very intentional about the data platform that you're implicitly building. So anytime you're the first person in an organization doing something with data, you're implicitly building a data platform. And so trying to build that data platform in a way that, as naturally more data and more use cases fall into it, will allow it to sort of support the strain and scale of the data.

Kyle (47:22)
Yeah, I think that's really great advice. Well, Sandy, this has been a really, really intriguing and I think great conversation. We do have a couple last questions that we like to ask towards the end of our discussions, but before we do that, where can people find out more about you, about the things that you're working on, and obviously about Dagster as well?

sandy (47:48)
Yeah, you can find me on Twitter at s underscore RYZ. For Dagster, check out dagster .io. And I'm also on LinkedIn as Sandee.

Kyle (47:59)
Awesome. Well, we will put the links to all of those things in the show notes as well, so you can check those out. And we'd like to kind of wrap up with a couple of questions. And these do not necessarily have to be product or data engineering related, but they certainly can be if you want them to be. But are you, and you referenced at the very beginning that you're very much into reading about history. So do you have anything that you would like to recommend as far as things that you're reading?

or watching or listening to that you are particularly enjoying.

sandy (48:34)
Sure, yeah. Let's see, on the history subject, I've been listening a lot to the podcast Empire. Their first season was on the British Empire in India, and now the second season is on the Ottoman Empire. And at the same time, reading the shortest history of India, to get sort of another perspective on that part of the world. Find that stuff super interesting.

you know, almost no exposure to it before. Reading, this is not on the topic of history, but reading Werner Herzog's autobiography, which is really fun. It's kind of all over the place, but really interesting life. And then the last one is I've kind of, it's like a New Year's resolution, been pushing myself to try to read things in genres that I'm not super used to reading. So reading The Sun Also Rises by Ernest Hemingway.

Kyle (49:34)
Nice. Okay. Yep. Very, very nice. We'll put those links to those in the show notes as well. Those are some absolutely fantastic recommendations. I'll have to put some of those on my list as well. And then finally, are there any products that you have been using that you would like to recommend? Either digital products or physical products could be either.

sandy (49:56)
Man, let's see, I use Notion all the time. I have such a love -hate relationship with it. Let's see, I use a lot of Surfline, which maybe will be exciting to the... Or if you're a surfer, you'll probably already know about it. And if you're not, then you probably don't care about it. Yeah, I don't know if I... Aside from those...

the two biggest products in my life, Notion and Surfline. I don't know if I have any exciting recommendations.

Kyle (50:29)
Okay. I use Notion all the time as well. So I'm, I pretty much run so much out of Notion. Be interested. What do you love and what do you hate?

sandy (50:40)
man, we should do a full other episode on that. So many times it's behavior is maddening to me. Yeah, check out my Twitter for a long stream of rage tweets on Notion.

Kyle (50:55)
Okay, we'll link it there. I'm gonna have to go check

it out. I probably have a more love than hate relationship, but I do have moments where I get a little bit outraged at some of the things. So yeah. Yep.

sandy (51:07)
That's great. Yeah, I mean, you know, it's an awesome piece of software. I feel like it's

opened up so many things, but maybe that my expectations for it are what create my difficulties with it.

Kyle (51:30)
Yeah, no, I could totally see that. You paused there for a second. So, sorry if there was a long pause. Awesome. Well, Sandy, this has been an amazing conversation. Appreciate all of your insight and maybe we'll do the next one on why we both love and hate Notion. So, okay. Awesome. Well, yeah, thank you and thank you everyone for listening.

sandy (51:33)
Awesome.

Looking forward to it. Really great to chat, Kyle. Thanks for having me on.

More episodes

Chapters

What is Prodity: Product by Design?