R for the Rest of Us

In this episode, I’m joined by Will Landau, a statistician and software developer currently working with Eli Lilly and Company. Will specializes in Bayesian methods, high-performance computing, and reproducible workflows. He is the creator of the {targets} R package, a pipeline tool for reproducible computation in statistics and data science. The package became part of ROpenSci in early 2021.

Will talks about his journey into R and using it for open source projects. He gives a detailed account of {targets} - its origin and how it works as a reproducible analysis pipeline tool.

Check out the YouTube version of this podcast
Important resources mentioned: Get started with the {targets} R package in four minutes
Learn more about Will by visiting his website and connect with him on LinkedIn and X (@wmlandau).

Subscribe to our newsletter: https://rfortherestofus.com/newsletter

What is R for the Rest of Us?

You may think of R as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting, to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways.

David Keyes: 00:00

Hi. I'm David Keyes, and I run R for the Rest of Us. You may think of r as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways.

David Keyes: 00:25

I'm joined today by Will Landau. Will is a statistician and software developer in the life sciences. He earned his PhD in statistics at Iowa State in 2016, and he specializes in Bayesian methods, high performance computing, and reproducible work flows. Will is the creator of the targets r package, a reproducible analysis pipeline tool, which became part of rOpenSci in early 2021. Will, welcome, and thanks for joining.

Will Landau: 00:53

Glad to be here. Thanks for having me.

David Keyes: 00:56

Well, I wanna start out just by asking some kind of basic questions. I I'm curious kind of how you first got into R.

Will Landau: 01:03

Well, I got into R because at the time it was pretty much right in front of me. I was in undergrad in my 3rd or 4th year of college, and I was just discovering how much I love to code. And I was just getting into statistics and that kind of gradually started to intersect when our homework assignments used R and there is a mix of reactions. I mean, people come to statistics. I'm sure, you know, from all kinds of backgrounds and people come from economics, computer science, even a lot of other different backgrounds.

Will Landau: 01:41

Some people really love to code and some people struggle with it a bit. I think that for me, it hit me right at the right time. And I loved it. I loved how the computational problems intersected with the statistical ones. And, I wanted to carry that as a, as a focus through, you know, whatever I did after that.

Will Landau: 02:04

And in grad school at Iowa state, I got really lucky. There was just such a strong, our community. There was a lot of people doing really cool things in visualization and package development and, professors, Die Cook and Heike Hoffman were there, to give you an idea, they were the mentors of Hadley when he was at Iowa State. And they they kept that community going and the grad students around me kept that community going. There was, there was just, a lot of collegiality, a lot of excitement around the stuff that we were doing in our, it was, there, there were just a lot of positive vibes and interesting projects, and I couldn't really help, but, but grow to like it every, every, year that I used it.

Will Landau: 02:56

And, you know, now at work it's, I'm among colleagues who use R all the time. We're we're basically an R shop and what I do falls under, you know, classical experimental design, a lot of the times. So it's it's still a good fit for for the job.

David Keyes: 03:15

And I know you work I should have mentioned this before. You work at, Eli Lilly. And for obvious reasons, we can't talk about, you know, the details of of what you do. So I wonder if you could give me kind of an overview. I mean, you talked briefly about, you know, some of the the in broad strokes, the work that you do there, but I wonder if you can also talk about kind of the daily work that you do, with R as well, especially on the kind of the open source side and and and what that looks like.

Will Landau: 03:43

Sure. Well, there's 2 sides that I quite like with, with the open source work that I do. And one is statistical and not all of the models that I develop or contribute to are available as open source packages, but, some are, and there are these there's this, great collaboration that I'm part of. There's this ASA group called Open Statsware, and we have a work stream to implement a Bayesian, MMRM. And we're we're developing this package.

Will Landau: 04:20

We're pulling all kinds of knowledge from from across companies, and I've, taken the lead on the implementation of this package. It's called brmsmmerm. It's very much a group effort. And, that's that's a model that comes up a lot in my line of work. It's this repeated measures, model of experimental data.

Will Landau: 04:42

And it's, it's non competitive. It's and it's, it's used by so many of us that we just want to get a solid implementation that we all agree on. And that's, that's been extremely rewarding. And, with that package, we've, we've, we have a crayon release or 2 and we have, done a lot of the interface work. And we're going to move on to to the deeper, statistical pieces of of, you know, borrowing from historical data sources and stuff pretty soon.

Will Landau: 05:17

And the other side of that is the infrastructure and the workflow tools that that I work on, like targets, which we'll get into. These are tools just to make your life easier and just to manage, the, the analysis and the analysis workflows that go into either developing a model like that, or running really any kind of data analysis project. And so, I mean, there are things that I do that that a lot of people may not be able to relate to and, like to be upfront about that. But we all face a common set of of workflow problems no matter what we're using to analyze our data, whether it's a a Bayesian model or a or a machine learning method, or just just some just some data transformation and an ETL.

David Keyes: 06:10

Yeah. Yeah. And I'll say, I mean, you know, speaking for myself, hearing you talk about like the Asian work that you do, that's not something that I am even familiar with. I don't even, wouldn't even know where to start with that, but, you know, I know as I was mentioning to you before we started, targets as a package is something I've heard about and has intrigued me, but is not something that I've ever had the opportunity or I've I guess I've chosen not to to to take the opportunity to use it. So, I'm curious if you could kind of, at a very high level, give me an overview of what targets is and talk about where the idea for it originally came in from.

Will Landau: 06:52

Great. So targets is a what I'd call a reproducible analysis pipeline tool. And so you have in a typical project, some datasets that you want to analyze and you'll produce model objects for the fitted models, And then you'll summarize those models. And, and, or maybe you're not doing modeling. Maybe you start from the datasets and you want to run some visualizations to to explore the data, to see, what a scatterplot of one variable looks like against another.

Will Landau: 07:26

And maybe you want to summarize those results in a bunch of downstream reports in Rmarkdown or quarto. There are these, depths of the of the of flow in, in a data analysis project. And if we think of a data analysis as a project, that's where that's when it becomes big enough in that sense. That's where that's where targets comes in and that's where, that's where it starts to help. And hopefully when we get into an example, we can start to think about, well, what kinds of, what kinds of projects, targets is originally intended for.

Will Landau: 08:00

But, the idea came when I needed a package like that to do my dissertation work in grad school. That's the, that's the first time I started to want a tool like targets. I was developing this large Bayesian model for, for this agronomy problem. And on top of all the writing and all of the, all the model development, I needed to run this model on some pretty big datasets as far as genomics was concerned. In each time that I started to run the model, I had to wait 3 or 4 hours for it to complete.

Will Landau: 08:39

And I had a bunch of those analyses. And then meanwhile, I had these other moving parts, like the dissertation that I was writing that depended on those results. And it was all a lot to keep track of. And the runtime of these models and to, to go about my day, putting together something that ended up being a, a dissertation. And my advisor, Jared Nimi watched me struggled through all of this the whole time.

Will Landau: 09:16

And he mentioned to me that I should be using a tool called make to to organize and wrap my head around this stuff. I was a bit too far into the project to transition to a thing like that at the time. But then when that was all wrapped up and I defended, then I immediately went to start saying, okay, how do I make this easier for other grad students like me? And how do I make this easier for, for my future self? And how do I make this easier for, for anybody who's, who's developing any kind of project, whether it's as it's whether it's like a dissertation or whether it's, it's, it's something even, even a little bit more lightweight or something that maybe has less of a runtime burden, but just has hundreds of artifacts that you want to produce at like an ETL workflow.

Will Landau: 10:12

And that's, that's when I started searching for existing tools. There's this developer, scientist, statistician by the name of rich Fitz John, who, who had developed something that was almost what I was looking for. It was a package called remake that he had developed between 2014 and 2016. And that package is like the make tool except designed completely for R and the, the, the concepts that he introduced there, just completely blew my mind. The way that he, he took something that, that was it's usually language agnostic and made it completely focused on our, an extremely friendly.

Will Landau: 10:58

And the reason that I originally got into developing pipeline toolkits was, was he had moved on to other things that he, he was no longer at a certain point interested in developing, remake. And so at that point, I created a project called Drake, which was kind of like remake, but initially it attempted to, to pursue a greater degree of scale in a couple of ways. And that was when I was really first getting my start as a, as an R package developer. And that was, that was a journey. I learned an absolute ton from it.

Will Landau: 11:35

But Drake got to the point where I was. I was faced with the hard limitations of my original design choices. And so up to a certain point, it became time to. To take the lessons that I learned to developing Drake and to start something completely new. And then that became targets, which I started to develop in 2020.

Will Landau: 11:58

And I was secretly working on it on nights and weekends for about 6 months. And then I open sourced it and, and, released it to our open side. And the rest says they, as they say is, recent history.

David Keyes: 12:14

Yeah. That's great. So from my kind of naive understanding so I think about, for example, the types of projects that we do, which which don't actually involve any modeling, but typically what it involves is taking some raw data. We usually have, an R script file where we take the raw data, do cleaning transformation on it. Then we typically save that as usually as an RDS file, and then we have some we do a lot of reporting.

David Keyes: 12:45

So a lot of parameterized reporting, that type of thing. So then our our markdown or quarto files read in that kind of cleaned, you know, RDS file, and use that for the reporting. And we've developed this, a structure, like, you know, we always we basically follow the the structure that our packages use, where we'll put our raw data into file in a folder called data data dash raw, and then our clean data would go just in the data folder. But what it sounds like you're saying is essentially targets is kind of a formalization of some of the processes that we have kind of informally put together where, for example, we don't you know, maybe it takes a while to run the code to clean your data. And, you know, the way we deal with that is we run it once and save as an RDS file.

David Keyes: 13:40

But with targets, you could set it up so that it kind of always knows is the how have we run the most, you know, the the code, the the kind of cleaning transformation code? Is that up to date? And then if that code changes, it'll kind of like rerun it. Right? And so it helps to again, I guess it's it's really like formalizing that process of making sure that everything it both speeds up your your kind of runtime, but then also make sure that everything is up to date.

David Keyes: 14:11

Are those am I purely understanding the the kind of Yeah.

Will Landau: 14:15

Yeah. That's a great

David Keyes: 14:16

discussion. At a high level.

Will Landau: 14:17

And I like your use of the word transformation in particular, because this is where we start to think about a project oriented workflow, like the, like the one that you described with, with, working with data and generating reports as a series of transformations, you're transforming those datasets into, into reports, or maybe you're transforming those datasets into human readable summaries, and then transforming those human readable summaries into those reports. It's this, it's this mapping that's, it's, it's a really useful aspect of this whole process to think about and to be, and to be pedantic about. And that's and to, and to, to, to think about which one of those transformations and what what is what is exactly going on in each one of those transformations and what are the set of inputs that you need and what is the one output that this transformation produces in each case? And I think that that's, that's kind of how targets makes this whole process, a bit more, pedantic. And then the process of skipping a step that's already up to date and then only running it if the upstream code or dependencies changed.

Will Landau: 15:33

For, for somebody who is running these, these extreme kinds of Bayesian or machine learning models. It's a, it's a really critical time save and it's the kind of problem that you don't know you're going to have until you're faced with a kind of analysis that takes forever to run, that you might have to stop in the middle. But there's something for everyone in that, in that it's, even if you're just, churning away at a, at a couple of, of datasets and, and, producing some reports. I I mean, you might have a couple of those, or you might have a few hundred of a few hundred datasets and a few hundred reports. And to be able to see right in front of you, that everything is up to date is a it's vouching for the reproducibility of the project.

Will Landau: 16:21

In the most from first principles sense that I can really think of it. It it's saying that if you were to reproduce or recreate this analysis from scratch, that And that's, that's a kind of reproducibility that's, that's, that was missing, I think, for, for a long time. And it complements the kind of reproducibility that you find in being able to describe in your own words exactly what's going on, let's say inside, an R markdown or a quarter report for the analysis.

David Keyes: 17:00

Yeah. That makes sense. Well, you know, I have a bunch of questions about targets, but I actually think what would be most useful at this point is to have you give us a brief demo, and then I can ask those questions. So I'll let you, if you just wanna put your screen up Certainly. And then we can watch that.

David Keyes: 17:18

Hey. David here. Just wanted to let you know that at this point in the conversation, we switched to a screencast. Now obviously, showing code doesn't work very well in an audio podcast. So if you wanna see the rest of this conversation, check out the video version of this podcast on YouTube.

David Keyes: 17:34

You can find a link to that in the show notes. So I'm curious, like, in your mind, how complex does a project need to be to benefit from using targets?

Will Landau: 17:50

That's a hard question because there really isn't a hard and fast rule, and it can depend on personal preference. But I would say anything even a little bit complicated or even a pipeline with 4 or 5 steps or even 3 or 4 computationally intense steps is worth doing in targets. Once you once you experience the the time the time savings and the and I didn't even get to the the features and and targets that abstract away files as our objects. You mentioned that it's, you know, common to save to manually save, the output data as an RDS file and then read it back later. In targets, there are more convenient functions to access data in that, in that cache or that data store.

Will Landau: 18:46

And that in itself is is a convenience that's hard to move away from even for really small, workflows once you've once you've experienced it. And projects can also be complex in a bunch of different ways. So sometimes the computation time is a burden. Sometimes you just have a lot of data objects to output and a and a lot of, you know, downstream tasks that read in those data objects. And and targets can smooth things out in in either one of those cases.

David Keyes: 19:19

Yeah. I mean, I think about you know, for the work that we do, it'd probably be less on the kind of runtime issues. You know, it's not that it's gonna take a huge amount of time, although there are some cases where where that happens. But it's more about making sure everything is up to date. You know, for example, like, we create some reports, and then the client says, oh, yeah.

David Keyes: 19:39

Here's some some new data. Well, the way we do it right now, we have to make sure manually that someone goes in and reruns the data cleaning and importing code transformation code so that it spits out that new RDS file so that when we generate the reports again, it's using the up to date data. It seems like, you know, using targets would make it so that every time we run tar make, it will just check. Okay. Is the data up you know, is everything up to date so that when it before it generates the those reports, it just automatically runs that code.

David Keyes: 20:14

So I can I can definitely see the benefit of it?

Will Landau: 20:17

Yeah. It's one of those things where you don't realize how much mental book keeping you're doing just to remember the state of the project. It's much easier to have a tool automatically tell you, and it's it's hard to move away from that once you've experienced it.

David Keyes: 20:32

Yeah. That absolutely makes sense. Because a lot of times, what we have to do too is then we have to explain to the client, how everything works because they, in some cases, will actually take over the project and then, you know, run it themselves. And it's only when we're actually sitting down and going through, okay, first you do this, then you do that. You know, we'll have, like, a really long read me sometimes.

David Keyes: 20:52

And it seems like in some ways, targets targets is a really long read me, in in code in a way that, you know, can facilitate people continuing to run the code, in a way that that is efficient and and hopefully makes more sense to them than the long readme's that we give them.

Will Landau: 21:11

Yeah. That's a that's a great point. And I've seen users of targets insert dependency graph visualizations in their readmeets. So there's this, JavaScript library called mermaid.js that produces really nice static graphs, and there's this tar mermaid function in targets to produce one of those visualizations of the dependency graph but in a static format. And it's possible to embed that into an R Markdown document and share that.

Will Landau: 21:43

And with the description field of each target that, that t j Maher and Noam Ross, nudged me to implement in targets, I think that that that could, be even more powerful for communicating how the pipeline is is set up for somebody else, either who's, you know, just consuming the results or or sharing in the development.

David Keyes: 22:06

Yeah. That makes sense. I have a couple last questions for you. First, I've heard people describe targets as kind of the equivalent of the efficiency you you switching to targets as the efficiency gains that you get, if you switch from an R script based workflow to using something like Rmarkdown or Cuarto. I'm curious if that, comparison resonates with you.

Will Landau: 22:29

It does. And I remember when I discovered that I could use our markdown to do homework assignments in grad school. And I just remember how much nicer it was than having an R script over here and maybe some comments on the code and then my description in my own words and prose in another place to to, you know, show the other parts of my work and my thought process. And to have those woven together in a lot of situations was just really nice, and it solved a problem that I didn't know I had until the problem was actually solved. And targets, I think, is is similar.

Will Landau: 23:14

Not everybody is immediately faced with these crushingly long run times. And so maybe the friction is just enough to fly under the radar, but it's it's still a problem that you may not know you have until it's actually solved. And although those are different problems, they're hidden in similar ways.

David Keyes: 23:36

Yeah. That makes sense. Great. Well, one last question. I'm curious what changes you kind of anticipate making to targets in the future.

David Keyes: 23:44

I know you talked about adding the description, field. Are there other things you're thinking about as you move forward with targets?

Will Landau: 23:52

So in general, most of targets, I think, is in a really good spot. I think that there will always be bug fixes and little features and just general maintenance to do. And that's that's just, that's just part of every maintained actively maintained package. And I'm I'm in it for the long haul. Over the past year, I've I've focused most of the development on trying to make targets a viable cloud computing pipeline tool.

Will Landau: 24:28

And that's been that's that kind of work has been really important to me and my colleagues. That's kind of where the the project is at the moment.

David Keyes: 24:38

Great. Well, Will, thank you very much. This has been really useful. It's given me a lot to think about. It's honestly it's inspired me to think about potentially trying out targets.

David Keyes: 24:48

So thank you again for taking the time to chat. I really appreciate it.

Will Landau: 24:51

Glad to be here. Thanks for having me on.

David Keyes: 24:53

That's it for today's episode. I hope you learned something new about how you can use r. Do you know anyone else who might be interested in this episode? Please share it with them. If you're interested in learning r, check out r for the rest of us.

David Keyes: 25:06

We've got courses to help you no matter whether you're just starting out with r or you've got years of experience. Do you work for an organization We work with clients to make high quality data visualization, beautiful reports made entirely with R, interactive maps, and much, much more. And before we go, one last request. Do you know anyone who's using r in a unique and creative way? We're always looking for new guests for the r the rest of us podcast.

David Keyes: 25:41

If you know someone who would be a good guest, please email me at david@rfortherestofus.com. Thanks for listening, and we'll see you next time.

More episodes

Chapters

What is R for the Rest of Us?