R for the Rest of Us

In this episode, I speak with Miles McBain, a data scientist and R package developer from Brisbane, Australia, about patterns and anti-patterns in data analysis reuse. Miles shares his journey from a generalist software developer to a data science specialist, his passion for R, and the evolution of his coding practices. We delve into the intricacies of code reuse in data analysis, discussing common pitfalls to avoid, the benefits of creating reusable code packages, the process of breaking down large codebases, and how teams can evolve their coding practices to enhance efficiency and maintainability.

Important resources mentioned:

Patterns and anti-patterns of data analysis reuse blog post by Miles McBain
Jenny Bryan’s talk ‘Code smells and feels’ presented during useR!2018

Connect with Miles McBain:

Website: milesmcbain.xyz/
GitHub: @milesmcbain
Mastodon: https://fosstodon.org/@milesmcbain

Subscribe to our newsletter: https://rfortherestofus.com/newsletter

What is R for the Rest of Us?

You may think of R as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting, to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways.

David Keyes: 00:00

Hi. I'm David Keyes, and I run R for the Rest of Us. You may think of r as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways.

David Keyes: 00:25

Well, I am delighted to be joined today by Miles McBain. Miles is a computer turned data scientist, R package developer, and open source enthusiast. So, Miles, welcome, and thanks for joining.

Miles McBain: 00:39

Thank you. Thanks for the invitation.

David Keyes: 00:41

I know you're based in Australia. Where exactly in Australia you're located?

Miles McBain: 00:45

Coming to you from Brisbane. So that was where we had rUsers 2018, and I was very proud to be part of that. That's great.

David Keyes: 00:52

So tell me a bit about your background. I'm curious kind of how you came to use R, what your daily use of R looks like today.

Miles McBain: 01:01

Yeah. I was the software developer for a little while. I was working in in Townsville in North Queensland, and I moved to Brisbane. And I guess I became kind of aware that, like, I was a bit of a generalist in software development, and I didn't have any kind of specialization. And, you know, I felt like I wasn't really in control of the direction of my career because I was just sort of this generalist, then I got moved around from team to team and different things.

Miles McBain: 01:22

And this was maybe like 2012 or something like that. Shortly after that famous quote about like data scientist is gonna be the sexiest job of the 21st century or something. And yeah, I was tossing up between 2 specializations at the time. I could either go into cybersecurity or I could either go into data science, and I decided to go with data science mainly because I had really, like, fond memories of, like, statistics and probability, which is like I felt like a little bit unusual, like a lot of people didn't enjoy those subjects, but for some reason, I did. And so, yeah, I decided, yep, gonna do my master's in stats.

Miles McBain: 01:54

And then early on in the course, it was like a lot of the stuff was to do with using like Minitab and GUI like stats programs. And since I was already a programmer, I was just like, no, this doesn't feel good. I tried a few different things and I already knew Python at that time actually. And I remember I tried to use Python, and I think maybe the Python ecosystem at that time was a bit immature or whatever, but I, I just remember like trying to do like PIP install packages and just got this like spinning kind of resolve dependencies and just got that classic Python environment hell. And I was like, this doesn't feel really good.

Miles McBain: 02:26

And actually one of the programming languages I learned and really liked up until that point was Ruby. I don't know if you've ever used Ruby, but Ruby had this excellent like gem install thing which was way better than pip, it just always worked. Then I saw some buzz about r and in particular, like, people were making these really nice looking gg plots and stuff. And, yeah, I sat down, I tried r and installed our packages. It just worked.

Miles McBain: 02:47

Right? And it was that, like, same gem install experience, and I was like, great. I can work with this. So that was pretty much how I came to r. And as I learned more about it, like, it seems very quirky and a little bit inconsistent, but at the same time I was aware that it almost had a lot less rules compared to other programming languages and that appealed to me.

Miles McBain: 03:06

A lot of the things that I create, I always like trying to probe it, like, what is the limit here? What will, actually let me do?

David Keyes: 03:13

Well, that's interesting, actually. If I can pick up on that, anecdotally, I've heard people who come from a more kind of computer science y background, which I know is where you come from as well, get to r, and they're like, what is this? Like, I don't like this. And you see there's a whole kind of, like, common trope of computer science developers ragging on r for being quirky. But you had the opposite reaction.

David Keyes: 03:31

You kind of like that. Yeah. I'm curious why that may have been the case.

Miles McBain: 03:35

Well, it just felt very productive, I guess. I guess it felt like whatever I could imagine that I wanted to do, there was less kind of stopping me. At first, it felt really weird to be working in with no scalars, only vectors, but then you realize how powerful it is that nearly everything you wanna do with data is already vectorized. It feels super productive. It's like, oh, I don't have to think about that now.

Miles McBain: 03:56

The other thing was, like, immutable by default. In a lot of other programming languages, there's always this thing about like, oh, should this be mutable or should this be immutable? And you can get yourself into a real tangle if you forget, like, which mode you're in when you're calling functions and stuff like that, if they modify their arguments or not. But with that, that whole question just goes away. And I didn't really realize at that time I was a bit early in my programming journey, but I later came to realize that this idea of immutability and the sort of like simplifications and guarantees that creates is like a core kind of benefit of like functional programming style.

David Keyes: 04:26

Are there other kind of things you think coming from your more computer sciency background gives you a different perspective than other R users?

Miles McBain: 04:37

I suppose one I've done a fair bit of teaching of R to people who are coming from a science y background. And I think probably one of the main sort of advantages that you have coming from a more computer science background is you're really at home with the idea of, like, creating your own, like, functions and procedures for things. And R, and it's cool, like it's a functional programming language. And so everything's geared towards this idea that you're going to create your own procedures and you're going to pass them around. And that's part of how you create these abstraction, s three methods and things like that are just like effectively procedures with attributes.

Miles McBain: 05:07

I feel like people who come to R from like a more sciency background, they are less in tune with the idea that like, hey, you can just make your own procedures for things. And they're more like, okay teach me the procedure and I will follow it So they write these like scripts that are like long series of like instructions and this copy pasted curry, which I'm sure we'll get to soon. But they don't realize that you can just write a recipe for that and reuse that everywhere. And I felt like I was a bit more primed to take advantage of that because I had come from a computer science background and already knew the power of creating functions for things.

David Keyes: 05:40

Yeah. That makes a lot of sense. You've only struck me as way more on the computer science y end of the R users who I come across. So it's interesting to hear your your perspective, that has been there.

Miles McBain: 05:53

Yeah. Well, you picked it correctly. That is where I started from. Right. I wouldn't say I identify heavily with that now, though.

Miles McBain: 05:59

Yeah.

David Keyes: 05:59

Yeah. But I think even just in terms of your perspective I mean, even, you know, we're gonna talk about this this blog post that you wrote. I I think there are elements in there that having that perspective, even some of the things you were just talking about in terms of being able to use to by default, like, just making your own functions, but then also realizing, like, what are the limits or what issues might you run into by doing this? I think that's something that people who who aren't necessarily coming from a computer science y background would necessarily anticipate in the ways that that you clearly have. So, maybe we can actually dive into that.

David Keyes: 06:36

So the article that you wrote is called patterns and anti patterns of data analysis reuse. Yeah. I'm curious, maybe starting out, what do you consider an anti pattern of data analysis reuse? What does that mean to you?

Miles McBain: 06:50

Well, I feel like Jenny Brian did a really good job of introducing the idea of patterns and anti anti patterns to the art community with her talk, code, smells, and feels. I think she referenced a bit of Martin Fowler's work from the software world, you know, patterns and anti patterns. And if you haven't seen that talk, it's absolutely spectacular, and I highly recommend people go and check out that talk. That was at use. Io2018, by the way.

David Keyes: 07:10

It was. Okay.

Miles McBain: 07:11

So it's this idea that there are these things that software frameworks and programming languages might lead you to do, and they might seem like a good idea at the time, but there are, like, hidden costs. And it doesn't necessarily mean bugs. It could be like performance penalties or like our classic one. Concurrency is like deadlocking where you have like, there are certain anti patterns of the ways that you manage parallel processing that mean that, you can get your program into a situation where, like, you can effectively get locked with 2 parts of the program, each waiting for a piece of information from each other. And there are patterns to avoid that happening.

Miles McBain: 07:44

So an anti pattern, yeah, is this idea that, like, there's like designs, ways you can design your systems and design your code that can lead to bad outcomes or there's ways you can design for better outcomes. I just kind of took that concept in because I saw a similar thing happening with code reuse. And I feel like there's a discipline that I is kinda like sitting between data science, data engineering, software development. I wanna call it, like, data science engineering, but it's not like plumbing the data. It's like, okay.

Miles McBain: 08:12

So we are analysts. We are doing data analysis every day. We've got lots of projects and context switching. How do we bring some sort of engineering view to design that rather than just having this organically created mess? I feel like you have to be in that organically created mess a few times before you start to realize, like, the patterns that create that organic mess and ways you can get around them.

Miles McBain: 08:35

And so this was the kind of idea. It's like, yeah, there are things that I've seen people do and I myself have done. They seem like a failure at the time in terms of like managing your code and redoing the same sorts of analysis over and over again. But I guess the main theme is like quite often complexity is not managed very well. And so complexity can ramp up and it might be to do with, like, you add more people to the team or you are doing more context switching than you were, and all of a sudden, the strategy you were using falls apart.

Miles McBain: 09:03

And, yeah. The analogy I like to use is like the technical debt idea, and for people who aren't familiar with that, it's like kind of like servicing your car. Right? Most people have the concept that like, okay, you get your car serviced every 3 months because if you don't, like, the wheels might fall off or something catastrophic might happen and then you have no car for a long time. Technical debt or process debt or complexity debt with code reuse, a similar thing can happen where if you don't kind of address it and acknowledge it and take steps to mitigate it in an ongoing way, then then at some point, the complexity will get very hard to manage and potentially catastrophic things will happen.

David Keyes: 09:41

Yeah. And you talk about kind of 4 stages that you've identified. I'll walk through them, and then you tell me what I interpreted correctly or incorrectly. So the first stage that you talk about is copying and pasting code from one project to the next. So say you're working on one project, you write some code, then you work on another project.

David Keyes: 09:59

You're like, oh, hey. That code that I wrote for that last project could apply here. Let's move it over there. But then you realize if you make changes in project number 2, then you also should probably go back and make those changes in project number 1. Did I get the first step?

David Keyes: 10:17

Is Ria, anything you'd add there?

Miles McBain: 10:19

So I have these, multiple copies of copies. And I and it to me, like, I think I even said it in an article. It look it's almost looks like a virus, like, replicating. And these copies require, like, mutations. And then I what becomes unclear at times is which one should I use.

Miles McBain: 10:35

So I've taken some methodology, some data analysis stuff that I did, and then I use that on project a, and then I copy pasted that on project b. And now project c needs to ramp up. Do I use project a or do I use project b? Or do I, like, copy paste from both and try to, like, merge them together? So copy pasting definitely saves you time because it saves you running the code and and saves you, having to, like, think.

Miles McBain: 10:57

But the hidden complexity cost comes later when you have all these, like, subtly different but similar versions of the same thing, and you're trying to decide which you should use. And I think in the article, I I used a real example of, like, bugs that you squashed suddenly, like, reappearing. And you're like, I've copied the wrong version that had the bug and now I've reintroduced the bug into our workflow and that's kind of frustrating.

David Keyes: 11:20

Yeah. And so then you talk about the next stage being to kind of make a template project with to dos sprinkled throughout. Can you talk about how that differs from the just copying and pasting?

Miles McBain: 11:32

Yeah. As soon as you start to realize how copy pasting fails to manage complexity, you recognize the problem, which is, like, there's too many versions all spread out, and they're all different, and what you need to do is to centralize. It's a stage I went through, and it's because package development feels hard, modules or whatever feel hard, and it's like, no. I don't need all that complexity. I can just create a template for myself.

Miles McBain: 11:53

And I think this is something that people are reasonably familiar with from, like, just creating text documents and stuff. Stuff. If you've ever had to, like, do it like a mail merge or to create a document, I had to go to, like, multiple people, or you ever had to, like, create the same document for different purposes, you create a structure, and then you have, like, within that structure, the places where you're going to place the content that differs. It's a fairly obvious idea, and it can work pretty well. And so teams try and apply that to their code, so they create this template.

Miles McBain: 12:20

And then the problem with the templates is that we want to ask more and more and more of our templates. So we wanna have this one template, but actually, if it could just do this, then we'd be right. And then we put in the ability to that. Oh, if the template could just do this and that happens lots and lots of times. And then you have this situation where you're starting to, like, inject a lot of complexity into the template itself.

Miles McBain: 12:47

And the template is becoming like this pseudo programming framework where you might be using the mustaches, like the templating stuff to, like, parameterize your templates. And then you might have, like, special syntax that, like, replaces things you put in there with other things or computes things that, like, template build time. And now you've effectively built something like a programming language almost, except you didn't do it and you didn't design it. It kind of just grew. And yeah.

Miles McBain: 13:14

Like, templates can get really complicated. And particularly if you're working with a few different people, there's always this tension of like, well, I want the template to do this for me, so I'm gonna put this feature in the template. Templates are kinda hard to test because I have this, like, explosion of, like, parameters and things. And so if I add a feature to the template, I'm gonna test that feature works, While I might not test all the cases that affect what you need to do with it. So you have this thing of people, like, stepping on each other's toes and changing the template and not realizing that that breaks how someone else was using the template.

Miles McBain: 13:45

That's the sort of complexity how you find yourself in there.

David Keyes: 13:48

Yeah. So it sounds like with both approaches, the kind of just simple copying and pasting in templates, the main issue is that the code just kind of metastasizes and gets way more complex

Miles McBain: 14:01

Yes.

David Keyes: 14:01

Than you originally intended.

Miles McBain: 14:04

And

David Keyes: 14:04

that complexity then becomes its own beast that you have to to manage. And so any of the benefits that you might gain from not having to rewrite the code from scratch are either mitigated or or reduced by the fact that you have to maintain that complexity. Is that accurate?

Miles McBain: 14:21

That's that's right. And you have to maintain this templating framework that you created, which in itself is like it's not it's not as gonna be as well designed as, like, the R programming language. So it's like, well, why didn't we just write it in code to start with? Because we didn't have a way to manage it. And that's where we get to the next stage, which is like, okay.

Miles McBain: 14:39

Can we create our own kind of like personal universe of functions and packages?

David Keyes: 14:44

Well, but even before that in in your article you actually talk, about the next step being to make one package,

Miles McBain: 14:52

like a

David Keyes: 14:52

single package. One benefit being that it forces better practices than using something like a template or copying and pasting. Can you talk about what you mean by that?

Miles McBain: 15:02

Yeah. Okay. So I think immediately, as soon as you create a package, and I highly encourage people who want a centralized code to do that, you're sort of triggered I mean, if you're reading, like, any material, like, you know, Hadley's and Jenny's and other contributors, great R Packages book, then immediately you're confronted with all sorts of stuff about, oh, okay. There's stuff in here about what to do with the documentation. I said you're encouraged to document your work properly and to have, like, nice HTML documentation that people couldn't consume, or you're encouraged to do some level of unit testing, which wasn't really potentially even possible with the template, and you might be encountering that idea for the first time.

Miles McBain: 15:37

And you have multiple people contributing to this code base with the unit tests being written. That might even lead you down the path of, like, continuous integration where the unit tests have to pass in order to make the changes or something like that. So there's this path that creating packages puts you on, and, yeah, a lot of that can only improve the output.

David Keyes: 15:56

So how do you decide at what point when you've written some code, it makes sense to put it into a package? I mean, obviously, sounds like you tend to go that direction. So if, say, you're working with someone and they're deciding, you know, is this code that's worth putting into a package? What's your your rubric for making that decision with them?

Miles McBain: 16:15

I have a very low threshold for that, I guess. First of all, it's like, is this thing reusable? And sometimes there's a little bit of work to, like, see the reusable parts versus, like, the parts that are only specific to what they're doing at that time. So sometimes they'll say, no, no, I can't put this into a package because it's too specific to the project that I'm working on now. But often that's just a case of like refactoring what they're doing a little bit where you're like, okay.

Miles McBain: 16:40

Hang on. So we can separate the project specific stuff from the domain specific stuff, and we can make that into a package. There's ways you can go about that. You can have functions that take arguments or take even other functions that tell them how to do the domain specific part or the very specific part of what they need to do. And that takes a little bit of practice.

Miles McBain: 16:57

Right? But it's about looking at what's happening and seeing like, okay. So what is the part that is genuinely reusable, and what is the part that is specific to this thing? And I think rather than like looking at code that's reusable, it's actually a bit about like looking at concepts that are reusable, you know. So an example might be like, okay.

Miles McBain: 17:15

It seems like we are always, like, writing these, like, same or similar like SQL queries to get this like stuff from this database. Maybe we could like wrap over those with some parameters and that would simplify our code rather than having to have essentially the same SQL copy paste and modify it slightly for this specific case. Maybe we can, like, parameterize that and then wrap that up in a package. Something like that. That's a really simple example.

Miles McBain: 17:39

Yeah. So the the concepts would be, like, our core datasets. So be like, okay. But when I was, at the Queensland Fire Service, our core datasets are about, like, incidents. So if we're talking about, like, incidents that are happening in the location where they happened and the type of thing that they were, that's like a concept that is reusable across all our work.

Miles McBain: 17:59

And in my current work, I work with like not for profits, like charities. So, like, something that would be, like, reusable across them. It would be like, okay. Charities are always, at some stage going to ask people for money. Right?

Miles McBain: 18:14

They have a variety of ways that they want to determine how much that should be. And so the concept of asking people for money, that's shared across all charities. And so there can be a package that's like calculate how much to ask someone. So I guess what I'm saying is often it's not like stare at the code and see, like somehow extract the, like, common bits. It's more like identify the concepts that the code is wrapping up and identify the shared concepts and then pull up the code for that.

David Keyes: 18:42

That makes a ton of sense. One thing I was wondering about too is, you know, I I know I talk to people sometimes and I'll encourage them to build a package, and they'll be like, oh, that's too much work. I'm gonna spend so much time just putting it together. Is it really worth it? Is it really gonna save me time to do that?

David Keyes: 19:01

How do you answer that type of question?

Miles McBain: 19:03

Well, it's just not a lot of work. I mean, there are there are people like, I think Jim Hester, I can remember a great talk from him, like, create an R package in 20 minutes. It's literally like dev tools, create project, or create package or something like that. I can't actually remember what it is right now, but it's a one liner. And you will get a package skeleton and then you can create your your functions, run check, and then you've got a package.

Miles McBain: 19:26

So there's actually not a lot to it these days. I think the difficulty is actually feeling confident and understanding what you're doing. Yeah. So that's the work. The work is like understanding what even is in our package?

Miles McBain: 19:37

What do I need to do? Why do I need to do it? So I'm more like understanding the structure. Like, this is where the code goes in the r folder and this is a test go in the test folder and this is what happens when I run check. So it's more like a kind of understanding deficit rather than like, oh, I have to do like a ton of work now to write this package.

Miles McBain: 19:56

And the good thing about that is like once you understand how to make one package, you understand how to make 20 packages. It's the same process every time. So that would be how I'd encourage people. I would say, look. There's a little bit of learning to do upfront to make your first one.

Miles McBain: 20:09

But after you make your first one, every single other package will feel very easy and very quick.

David Keyes: 20:14

Yeah. Yeah. I mean, I've always been surprised. I remember when I was first starting to make packages. I don't remember exactly what it was.

David Keyes: 20:21

And it seemed very scary from the outside, but once I actually dove in and did it, I was like, oh, this is basically just creating functions with a few additional things tacked on on top of that.

Miles McBain: 20:33

Yeah. And and and so that is actually a thing you you identified there. Creating the functions is sometimes the hard thing, and we talked about that a little bit before about how, like, people coming from a science background, they're like, oh, you know, functions, like, wow. What is that? It seems like a bit mysterious and particularly the thing that breaks people's brain is like functions being passed to other functions and stuff like that.

Miles McBain: 20:55

But once you get over the hurdle of what is a function, it's a pretty core idea that gets reused everywhere, including in package development. So maybe if we're, like, trying to help people get to the point of creating packages, it's like, well, hang on. Are you comfortable with functions?

David Keyes: 21:08

Yeah. Yeah. Maybe this is specific to me, but one thing I see when people are learning to write functions, you were sort of hinting at this, is they'll end up writing massive functions, like, say, to clean my data, and it has, like, 200 lines of code. Yeah. I'm curious, like, how you explain the benefit of kind of breaking that into, you know, say, multiple functions.

David Keyes: 21:32

How do you talk to people about why that may or may not be the best approach?

Miles McBain: 21:37

Yeah, I might have even talked about that in the article. I can't remember but the first package that you write is a do everything package. I feel like that'll be self revealing in a way because that massive function that they write with all those parameters, it's basically a template. That's what I wrote in the article. So the the issue that you'll have is, like, maintaining that thing is gonna be really challenging.

Miles McBain: 21:58

All these different parameters and combinations of parameters, you're not really gonna be able to test it, and you're gonna have trouble. And and, because of that, you're gonna have that situation with the template where you're like, oh, okay. That someone changed the function and that now breaks my work. So I'll just use this different version of the function Right. That I know works for mine and you're you're back to, like, we have multiple copies of the same thing.

Miles McBain: 22:19

So I guess I'd point out that the more complicated you make the function, the more likely it is that it won't be able to be reused.

David Keyes: 22:28

That's interesting.

Miles McBain: 22:30

Yeah. And and I I don't know. This might be a bit high level for people trying to create their first functions. But the thing I think is, like, the reason why creating functions is creating functions is a is a good and useful thing is because it lets you express what you're trying to do in terms of the domain. So using my Queensland fire example, like I can write functions that talk about calculating response times and finding locations of things that fall within areas.

Miles McBain: 22:56

And it's like the code that I am writing has function names that say that, like find things in this and time between this and whatever. So for someone in our domain, they might not even know r, but they can be like, oh, yeah. I see what you're doing. Because I understand that core to this domain is like points and times and road networks and all that sort of stuff. And that is represented there in the code.

Miles McBain: 23:20

So that's kind of like the core power of like creating a little like vocabulary for yourself out of functions is that functions can now represent the domain knowledge like really clearly. And so people just starting out, I don't know if there's an easy way to demonstrate that. But I saw Hadley actually give, Hadley look and give a pretty good workshop on stringr where he was like, this is what it looks like if you're just like throwing regexes at this thing, you know whereas like if we break this down into a series of like verbs That you could call as you know, kind of workflow on your string. It's much more clear than just like reg x, you know,

David Keyes: 23:55

yeah

Miles McBain: 23:57

so I think I don't know. That was a pretty good example because regex is like as hard as it gets to, like, parts. Right? Maybe something like that might be good.

David Keyes: 24:04

Yeah. So we've been talking about making a package and the benefit that that offers compared to copying and pasting or making a template. But in your article, you actually talk about potential downsides of making a single package, which in many ways follows the same path as the other two approaches, which is it metastasizes. It explodes. It tries to do everything.

David Keyes: 24:28

So can you talk a little bit about the the downside or what what that looks like when a package kind of blows up and what your solution to that is?

Miles McBain: 24:38

Yeah, I mean what it looks like when a package blows up and and this has happened in my last two jobs I've seen this exact same thing happen so I'm not sure if it necessarily happens everywhere or if it's just places I'm involved with. But, like, the namespace becomes so big that people kind of like carve out their own little niche inside the name specs. It's almost like there's sub packages within the package. Right? And people are like, well, this is the area that I know about.

Miles McBain: 25:02

And so I'm not really going to step too much outside of that. Like these are the functions that I use and the ones that I know about. And if I can't find what I'm looking for in here, I'll just assume that it doesn't exist. Yeah. And then I'll be like, okay.

Miles McBain: 25:13

I better add this to the package. And so you get this situation happening where you end up with, like, a bunch of similar functions in the package that do similar things, and even have, like, similar kinds of, like, code in them. And you're like, oh, okay. This is not ideal. So basically, there's a kind of, like, just seeing the edges of this massive thing are hard.

Miles McBain: 25:38

And and so understanding what it can and can't do is sometimes challenging. So people just make assumptions because it seems too hard to figure it out. And then you get to other things like, well, if we're doing the right thing and we're writing tests for this thing, then the test suite is gonna be enormous. And so then it's gonna take a while to run. And so then people might go, you know what?

Miles McBain: 25:58

I'll just, like, skip running that right now because it takes a while to run. And then test test failures accumulate. And then now you you wanna submit one little tiny change to a function and and you do the right thing and you run the test suite. And there's like a dozen test failures and a lot of them aren't stuff that you've got no idea about. So then that gums up the works, slows everything, you know.

Miles McBain: 26:19

So basically, I think once things get to a certain size, they almost encourage the accumulation of technical debt. And the thing that encouraged the accumulation of that technical debt was the fact that the test suite was too large, so it wasn't getting run regularly. Or the name space was so large, so no one was like taking the time to read it and understand it properly.

David Keyes: 26:42

Yeah. And so your solution is to make what you call a verse of packages. So, you know, similar to the

Miles McBain: 26:48

Are you not? Versus side of the side of the side of the Yeah.

David Keyes: 26:50

Right. So could you maybe give an example of a verse of packages that you've seen that work well together? I guess I'm wondering specifically, like, how have you seen the different types of packages broken down, and what does each of them do typically?

Miles McBain: 27:07

Oh, yeah. So I think in that article, I linked to a blog post by Emily Riederer and she had a really nice way of thinking about it, which is like, just imagine each package is like a specialist team member that you don't have. What happens in a data science team, at least in my experience, is you have to wear like a few hats because you don't necessarily have like the cloud compute specialist or the database specialist or, you know, like you don't have someone who can like, yeah, figure that all out and package it up for you. So you end up having to do that. And what you wanna do is just cram, how do I provision cloud services?

Miles McBain: 27:42

How do I store stuff in the cloud? How do I, you know, do all that stuff? And then you wanna, like, put that into a package and then you wanna forget about it because it's not your, like, your core work. Right? But Right.

Miles McBain: 27:50

You you do need that functionality. So I think that's a really good way to think about it. So I think about my last job and we had, like, you know, we had a data package. It was called like QFERS data and that was like the package I was talking about before where, you can just have at your fingertips, like give me the incidents, give me the station locations. You have functions that give you all that stuff.

Miles McBain: 28:10

We had a visualization package that had like commonly used like visualization layers. So these are like things for, interactive maps and for gg plots, like little geoms and layers that help us build up our maps like really quickly. We did a lot of mapping in that job. There might be like more domain specific stuff. So we were creating Shiny applications and we needed to like have the ability to like cache things on AWS rather than locally.

Miles McBain: 28:37

So we have, like, a little cache module that we wrote, like a little bit of domain specific thing that makes our particular use case easier. So it's stuff like that, I guess. Data and vias and all that stuff are pretty obvious domains, but then I guess they could also be broken down further, always, depending on like how big that package gets and how much of that thing you're doing. So like I said, we combined like static ggplot vis with web ish vis in 1 package. And part of the reason we did that is because we wanted the ability that whether you were running in a creating an interactive map or a static map, we wanted them to look identical.

Miles McBain: 29:10

So we needed, like, some level of, like, parity between the features. But you can see another team going, like, actually, we're gonna go interactive business, the same thing and static business, the same thing, and they're 2 separate packages.

David Keyes: 29:20

Yeah. That makes sense. What criteria do you use to decide when to take one large package and break it up into multiple packages?

Miles McBain: 29:32

Oh, yeah. I mean, I think it's just vibe. It's a feeling, but it's also when I start to see the technical debt being created because that's a signal that things have gotten too big and too complicated. So if I start to see cases where the test suite hasn't been run when someone's contributed a feature or there is, like, duplication or, like, I wanna say non orthogonality. But I don't know.

Miles McBain: 29:55

People people in computer science talk about orthogonality for like design, but they often use it in the wrong way. And people that come from a statistics background actually will understand it better where, like, if you have an orthogonal basis, you know, you don't have this kind of like correlation between features and functions. And so you can see, like, a sort of non orthogonality or correlation creeping into the package where there's things that do, like, almost the same thing but not quite. And, you know, there is maybe, like, one concept in the middle that could be, like, pulled out and now both these things would be separated. So, yeah, it's looking and seeing the signs of the bad stuff happening.

David Keyes: 30:31

Yeah. I mean, it seems like it's a very similar process between deciding when to go from that kind of template to a package, and then when to take a single package and break it into a collection of

Miles McBain: 30:43

of packages. That's exactly right. I think it is at least for me anyway, I think it is a very similar process of, like, looking at at code and deciding to put it in the function and deciding how many functions that should be. And then looking at a set of functions and deciding whether they should be a package and how many packages they should be. Yeah.

Miles McBain: 30:56

It's about trying to, like, make the complexity of dealing with that manageable.

David Keyes: 31:00

Yeah. That makes sense. So kind of wrapping up, you say in the article, you're you're a bit downbeat. You say it's sort of inevitable that people or organizations will go through the stages you outlined. I'm curious why you think it's inevitable.

Miles McBain: 31:14

Well, I don't exactly downbeat. But I am saying there is an inevitability to it, I think. In in some cases, there's always an inevitability to things and it's good to acknowledge that. Like, I'll get back to your question in a second. I promise.

Miles McBain: 31:28

But, like, when I first started using Git, I used like a GUI tool. And people always say, no, I use Git on the command line. I use Git on the command line. But I think actually there's kind of an inevitable reaction to the complexity of Git that in the beginning it feels good to hide away that complexity. And it's actually just better to acknowledge that inevitability rather than trying to like beat people over the head with the command line and so they should use that from the start.

Miles McBain: 31:51

And that's sort of the perspective I'm taking here. And and as we've discussed, even initially the complexity and understanding what a function is seems hard and overwhelming. So people are kind of on a path. And as they progress down that path, they're gonna have different solutions available to them to manage this problem of, like, reusing data analysis. And what complicates it is teams need to do this together.

Miles McBain: 32:11

So you might have some people on the team who are like, let's go. Functions, packages. You might have other people on the team that are like, oh, we don't need that. We don't need that. Template's fine.

Miles McBain: 32:20

And so the team is kind of on a journey together of, like, everyone getting on the same level of understanding about what's the best way to do something. So those two things combined to me to to mean that there's always gonna be a bit of a journey that people and teams need to go on together to arrive at this place, and you'd be very lucky to get dropped into a team of experienced seasoned data scientists who could all be like, yeah. Perfect. Let's just make packages and let's go. Mhmm.

Miles McBain: 32:44

So I'm not trying to be downbeat about that or fatalistic about that. I'm more like, I think it's good to acknowledge and not try and, like, set the bar impossibly high

David Keyes: 32:53

Right.

Miles McBain: 32:54

For people and not try and feel bad about we have to go on this journey.

David Keyes: 32:58

Yeah. I mean, do you think there's value in going on the journey because people then see the downsides and appreciate the good sides of the approach that you lay out at the end of your article?

Miles McBain: 33:07

I think so. And by reading the article and having conversations like this and just getting that content out there, I feel like you can at least short circuit the journey a little bit. People might read and be like, what's this guy talking about? You know, templates are fine. But then they'll hit that point and be like, oh, wait.

Miles McBain: 33:21

This is what they were talking about. I mean, that complexity hell, and I made the template too complicated. This is exactly it. And then it might restart realization a little bit faster, I guess. That's all I'm hoping for.

David Keyes: 33:33

Yeah. That's great. Well, this was really enlightening. And for me, the types of folks that I tend to work with are are very much on the kind of first one or two steps. So hopefully, this will give some people some food for thought in terms of what might come after that.

David Keyes: 33:50

So, thanks again, Miles, for taking the time to chat. I really appreciate it.

Miles McBain: 33:54

Yeah. Thanks very

David Keyes: 33:56

much. That's it for today's episode. I hope you learned something new about how you can use r. Do you know anyone else who might be interested in this episode? Please share it with them.

David Keyes: 34:06

If you're interested in learning r, check out r for the rest of us. We've got courses to help you no matter whether you're just starting out with r or you've got years of experience. Do you work for an organization that needs help communicating effectively with data? Check out our consulting services atrfortherestofus.com/consulting. We work with clients to make high quality data visualization, beautiful reports made entirely with R, interactive maps, and much, much more.

David Keyes: 34:34

And before we go, one last request. Do you know anyone who's using r in a unique and creative way? We're always looking for new guests for the R for the

David Keyes: 34:41

Rest of Us podcast.

David Keyes: 34:41

If you know someone who would be a good guest, please email me at david@rfortherest

David Keyes: 34:49

ofus.com. Thanks for listening, and we'll see you next time.

More episodes

Chapters

What is R for the Rest of Us?