R for the Rest of Us

In this episode, I chat with Nick Tierney, a statistician, data scientist, and creator of the {naniar} R package for exploring missing data. Nick reflects on his journey from a psychology undergrad to a PhD in statistics, and how open-source tools—and a deeply curious mindset—shaped his path.

We discuss Nick’s early struggles with R, the importance of community, and his evolution into package development and consulting. He also walks through the why and how of testing R packages, making the case for better, more reliable code.

Resources mentioned:

  • {naniar} package

  • Books: (1) The R Book by Michael Crawley, and (2) R Packages by Hadley Wickham and Jenny Bryan

Connect with Garrick on Bluesky and GitHub

What is R for the Rest of Us?

You may think of R as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting, to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways.

David Keyes:

Hi, I'm David Keyes, and I run R for the rest of us. You may think of R as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways. I'm delighted to be joined today by Nick Tierney.

David Keyes:

Nick is a statistician, data scientist, and research software engineer with a PhD in statistics. He recently transitioned to private consulting, offering services in data analytics, modeling, code review, R package development, teaching, and mentoring. Previously, Nick worked at the Kinz Research Institute Australia and as a lecturer in business analytics at Monash University. Nick is a passionate advocate for open source software, having developed numerous R packages that improve data analysis workflows. An outdoor enthusiast, Nick hiked the entire Pacific Crest Trail in 2023, completing a five month journey from Mexico to Canada.

David Keyes:

Nick, thanks for joining. I'm really excited to speak with you today.

Nick Tierney:

Thank you so much. It's really good to be here. And yeah, it's funny to hear your own bio that I wrote. I read aloud, but it's but yeah, thank you.

David Keyes:

Yeah, no problem.

Nick Tierney:

I've been looking forward to this, yeah.

David Keyes:

Great. So you have a blog that I have read over the years, it's called Credibly Curious. I'm curious, maybe first of all, if you could talk about what's the significance behind that name?

Nick Tierney:

Yeah. I've always had I've always been someone who's had a lot of questions. Like, you know, I think in the classic thing, my kids always said why. I said why a lot, and, like, you know, my parents sort of parent teacher interview, these sort of teachers would be like, Just ask a lot of questions at home. I'd be like, Yeah, we actually encourage them.

Nick Tierney:

They're like, Just ask a lot. And so I feel like I have often been someone who's like, Oh, I enjoy being curious. Then there's a quote from Albert Einstein, I think, that's like, I have no special talents, I'm only passionately curious. And that really resonated with me and so that's something that I feel like I've identified with a lot. Yeah, I started a blog in twenty twelve-twenty thirteen.

Nick Tierney:

Yeah, 2013 when I first started my PhD, I wanted to sort of document things I learnt along the way. And I've been sort of inspired by other blogs like Rob Hindman's Hindsight blog. And that's a Hindsight is like such a good name for Rob Hindman

David Keyes:

because it's

Nick Tierney:

about hindsight, looking back at things. And also, his name's Hindman and he does forecasting work.

David Keyes:

It's just like, that's a

Nick Tierney:

good name. Well done. And I learned something that was about curiosity. So, I thought about passionate curiosity or something to tie back to the Einstein quote but incredibly curious felt like, Oh, like I'm doing a PhD and at that time I had an undergrad in psychology. Well, I just finished out of my honors year so I was like, I have some credibility.

Nick Tierney:

And so, yeah, it just felt like and then had a little Yeah.

David Keyes:

Yeah, that makes sense. So why don't you talk a little bit now maybe about your background? I know you did your undergrad in psychology and then you did your PhD in statistics. And I guess I'm curious in particular, like how you were exposed to R and how you came to use it as regularly as you do now.

Nick Tierney:

Yeah, I was really inspired in my final year of psych by a course called public health psychology. It was the first time I'd seen stats that we'd learnt about, you know, Treatment A and Treatment B applied to a population level, and it had just never occurred to me somehow that you could do that. I don't know, sort of, sort of, or 20 or 30 you have like n of a million or several million. And the impact of that was really exciting and I found it quite inspiring. So, I wanted to pursue health and stats stuff and I just kept asking around for people who knew of jobs in that space.

Nick Tierney:

And I ended up landing a research assistant job at Queen's University of Technology with Carrie Mengerson. Basically, remember her sat in the office and she just started piling up these books. I was like, so this will be good, this will be good. And one of them was the R book by Crowley, I think. It's like, it's huge.

Nick Tierney:

You'd hurt yourself if you dropped it. And I remember spending like a day trying to read in a CSV or something and like getting a professor who starts to help me read a CSV file. Was like asking to use a jet plane to use like the overhead light. Yeah, yeah. Was just like, sorry, this is a But yeah, I struggled with it.

Nick Tierney:

It was so hard. Was just saying that file path. I can't remember what the problem was. It all seemed like it should've probably like my working directory is set somewhere else or something that's hard to debug. Yeah.

Nick Tierney:

The group that I was in was the Bayesian Research and Applications group and there's about probably 15 PhD students and probably another five or 10 post docs who all pretty much exclusively use R. So from the start, was in this little really supportive group of people who I could go to and be like, I don't understand this thing. Someone would usually have time to sit down and very kindly explain it to me. So I got really lucky in that I feel like a lot of people when they learn art, it's a bit of a silo journey, whereas everyone in my group is using it all the time. So that helped really foster that, helped me get through that initial steep learning curve, I guess.

David Keyes:

Yeah.

Nick Tierney:

Yeah, and then we did a data viz section in our lab group meetings and someone was like, there's this thing called GGPLOD. And everyone was like, oh man, this is going be where we'd have to teach people how to use for loops to make like separate subplots, that's going be awesome. And then I was like, who's this happy Wickham guy? And I sort of like started reading all of his stuff, his packages and his papers and then I basically just got really into like I could sort of see his philosophy in that and then had the opportunity to meet him at a stats conference in Australia that he presented at and Seeing him talk about Deepwater was like, wow, that's huge. So those ideas, I think the initial guidance of the group that I was in was incredibly important and also getting early exposure to ideas from people like Hathi Wickham and also Yu Queshi about R Markdown and Nittar.

Nick Tierney:

Things just felt like right and I had so much frustration with coding because it just felt like things seemed harder than they should be.

David Keyes:

Yeah.

Nick Tierney:

And these ideas from having the way it seemed to like resonate with me that they felt like quote, like right. Yeah. It felt like an easier path.

David Keyes:

I mean, that was my experience when I remember learning R. And I did the classic silo thing, just learning on my And I didn't kind of realize initially there was a base R approach versus a tidyverse approach. I mean, don't know that, right, when you're starting out, if you're learning on your own, you're just like, I want to learn R. But it was only when I came across the tidyverse that I was like, oh, this actually clicks now. I get the design principles behind it.

David Keyes:

That's made a huge difference for me as well.

Nick Tierney:

Yeah. Absolutely.

David Keyes:

Why don't we fast forward a little bit and talk about the work you did. First of all, what is the Kids Research Institute Australia and what kind of work did you do and how did you use R there?

Nick Tierney:

Yes, the Kids Research Institute Australia, they have been through a couple of name changes over the past couple of decades. People in Australia might have heard of them as the Telethon Kids Research Institute. They are a research group that primarily focuses on research outcomes for children. Some other similar ones in Australia are like the Murder of Children's Research Institute or QMRI, which is the Queensland I think the Burghoffer Medical Research Institute. It's effectively a research institute attached to a hospital who do work on cancer and treatments and alcohol and mental health and this kind of thing and drug abuse, and also vaccine development and that kind of thing.

Nick Tierney:

It was a really amazing environment because we're actually in the hospital and I'll be at my computer and then on my left was a big fishbowl of lab. There'd be people in white coats, like pipetting and looking at microscopes and doing stuff.

David Keyes:

Wow.

Nick Tierney:

And it was really cool to have that connection because you could look over and you're like, oh, like, I'm coding and doing stat three things. Then over there, there's some people doing science and writing in notebooks and doing the science there. The group that I was in was called the Infectious Disease Ecology and Modelling Group. So that's a group that's led by Nick Golding. And it's work around infectious diseases like COVID and influenza, but also things around malaria and other vector borne diseases, so things that are transmitted by things like mosquitoes.

Nick Tierney:

Yeah, that's a bit about Kids Research Institute.

David Keyes:

Yeah. The way I think initially came across your work was with the NANIAR package, which is one of a couple of packages that you've written that deal with missing data. It seems like missing data is something that you particularly care about. I'm curious where that well, first of all, is that true? And second of all, if that is true, why?

David Keyes:

Why do you care so much about missing data?

Nick Tierney:

Yeah. It's a bit of credit to my last year in SciCA for this actually. We in our fourth year stats unit, the lecturer had one or two lectures on missing data. And I just remember it blowing my mind about, here's missing data, I'm like, what do you mean? You collect your experiment and you have the data.

Nick Tierney:

And they're like, yeah, but sometimes you just don't have the data. And I was like, what? That's crazy. What do you do? Like, it just it was like, of course it's missing data, but it just like had blown my mind because so many of the all the exams that we've been given were all like perfect levels.

Nick Tierney:

Course. Conducted your experiment and then you've also had an analysis plan and then you basically crank the handle of an ANOVA and then out comes a p value and you make some inference on that, which is how they teach a lot of the modelling insight. It's a lot about trying to think carefully about designing your experiment alongside your statistical model so that you do this. So that you can make a more informed decision and end up pity hacking and stuff like this. So that idea was planted there about missing data existing.

Nick Tierney:

And then the first dataset I got on my PhD had something like 7,000 rows and I was like, wow, I'm working with big data and I totally wasn't. And then I thought it was pretty I thought that was a lot of rows. And then then about 60% of the data was missing. And we had like probably 50 or 60 columns. So, it was a reasonably large dataset.

Nick Tierney:

And I just remember being so like I was like, wow, this is wrong. Know? Have all this missing data. And it just felt like there weren't really good guidelines or ways to get sort of at this. Like, do you even start looking at it?

Nick Tierney:

How do you break it down? And I guess a lot of my work on missing data then was just kind of trying to unpick this problem and just trying to make it easier. And I often say sometimes that I kind of hate missing data, because it's so annoying and all of this development was kind of frustrated. It was like frustration driven. And I guess there's a passion sort of underlying that.

Nick Tierney:

Like, want to make this easier to solve so I can think about this more clearly and in terms of the things that matter. Because missing data, I think I've over the years come to this idea that basically one of the things that makes missing data hard is that it's adding an entire new dimension to the data that's unknown, and that is what makes it hard to grapple with. So how do you reason about that? But yeah, was the fact that there was like 60% of my data that was missing. I'd misheard something someone said about decision trees.

Nick Tierney:

They said, I just learned about classification and regression trees, I thought they're awesome. And so I was like, are they using that to impute missing data? And I had misheard the word impute and I just thought that they had said something about using it to explore missing data. So, I applied a decision tree to predict which variables were important for predicting whether the number or the proportion are missing in a given row. And that ended up being really, really useful for exploring the data because we found out really quickly that there was actually some key variable that was like what type of data this was.

Nick Tierney:

And basically, there'd just been like a bad merge of the data sets of human level stuff and machine level stuff. So humans don't have like a dust recording and a dust recording machine doesn't have a pulse. And so, when you join all this data, you get huge amounts of missing data.

David Keyes:

Sure.

Nick Tierney:

And so, quickly, we could explore that. But it's like, those are some of the things that led to me kind of exploring missing data and like, just wanting it to be easier and wanting to understand why 60% is missing. It turns out there was a key variable that explained most of it.

David Keyes:

Yeah. And that's kind of what I mean, full transparency, I've never actually used NaNiR, although I've suggested it to other people. But what it does and tell me if I'm wrong. Is it, like, it allows you to kind of easily see I know there are, like, ways to make, like, charts pretty easily, some graphs that show, like, the percentage of missing observations, for example. Is that is that accurate?

Nick Tierney:

Yeah. So, the work I did in my PhD was then picked up with my post doc with Doc Cook at Monash University. And so, the work we did there was there's a lot of convenience helpers to explore like how many missings you have in a variable, how many missings you have in a row, how many common sets of missings you have. So like, what are the times when variable A and B go missing together, or A, B and C, or just A. And then you can kind of arrange those so they're easy to view.

Nick Tierney:

But yes, there's a lot of helpful data biz. There's also this idea of what we call the shadow matrix, which was binding a copy of the data column wise. So you create an entire copy of the data and then you name it. If you have A, B and C, you name it ANA and BNA. This is useful and computationally expensive, but it stores an initial copy of the data.

Nick Tierney:

So, then if you do any imputation and other things, you have this other way to compare if something is missing or not, or was missing and so you know if it was imputed. Because one of the issues with computing data is you don't know then if it was missing because it's now complete. So having this and that was based off some earlier database work, I think done in the late 80s by Deborah Swain and Andreas Bruyas. So that was one of the really great things about working with DIAs. There was a lot of really good background knowledge to the problem of missing data and trying to visualise it and what things people had done.

Nick Tierney:

So, that was some of the theory. Underlying it was there were these other things that once you use them, you could then apply that to creating useful data visualisations and also understanding imputations and other things. But there is a lot of convenience helpers in Narnia, which I think are really useful.

David Keyes:

Yeah. So, I mean, you know, obviously one of the things that you do a lot is develop packages. So the reason, I mean, I invited you to chat just about many things, but one of the things I wanna spend some time on is about writing tests in packages. Because I know it's something that's really important, but it's also something I have actually never done. I have created R packages, like we work with clients, we create R packages, but some of my consultants who I work with have written tests, but I've never done it myself.

David Keyes:

So I'm curious, first of all, if you can tell me like, what is a test and what's your spiel for like, why it's valuable to write tests when making a package?

Nick Tierney:

Yeah. So, I guess I wanna start this by saying that a lot of my thoughts on this are basically things I've learned from people like Hadley Wickham and his book are packages. And so, I feel like a lot of what I'm saying feels similar to what he said. So, I guess firstly thanks to Hadley Wickham and also Jenny Brian at the R Packages book, but also like, yeah, these ideas I think are common in a few places. But a test is a way to ensure that your code has a stated output.

Nick Tierney:

So that might be, you might have an expectation that your function that you wrote will always return a DataFrame. Or you might have an expectation that it's going to return variables with certain names or that it returns at least one row. And then also, it might be other cases like I assume that if I give it bad input, it's going to error. Or I assume if I give it this weird output, it's going to tell me something specific about how to deal with what I've given it. So it's about ensuring that your code is reliable and it works in the way that you really like when you think about it, the way that it that you expect it to work.

David Keyes:

Can you give me an example of like something that would, you know, say a function in a package that would go wrong that having a test would be able to help you identify that issue in advance? Does that make sense?

Nick Tierney:

Yep. Something that would go wrong so, the reason for tests existing in this formal way of stating all these things because it does one of things I found weird about it was that I wrote this function, I know it returns a DataFrame. So, yd is a test, but it returns a DataFrame.

David Keyes:

Mhmm.

Nick Tierney:

It seems kind of like I'm just it's like almost a tautology.

David Keyes:

Right.

Nick Tierney:

The reason is that if you make changes to your code, you want to be sure that they're doing that it's still behaving the way it was. And so I guess, one example on this might be I wanted to check, say that the grouping, so if you use a group by in dplyr, I wanted to make sure that like the group by characteristic was still held on to when I did stuff with Nadia. And so I write tests to sure it still inherits from I think that's the groups DF, it gains an extra class when you do a group by statement. And there were some changes to dplyr that then meant that that was dropping, which meant that my code was breaking. And that was something that was able to get picked up.

Nick Tierney:

So it's kind of like insurance, I guess, is one way to look at it.

David Keyes:

So it's almost like, in some ways less, you know, making sure that your code does what you want it at the moment you write it, although maybe there's some of that, but also ensuring in the future when, write other functions that maybe interact with that function or functions in other packages change that your code will still work the way you intended to. Is that do you think that's a good way of putting it?

Nick Tierney:

Yes. And I also think, yeah, it's kind of like future proofing, but also you kind of do tests when you write your code anyway. Like when you write some code, you kind of play around with the console and be like, okay, like what does this do? What does this return? And trying to remember that for every function that you do, if every time you made a change and you wanted to make sure the function still did the same thing, then that would be like, it's just a lot of extra stuff to carry around in your head.

Nick Tierney:

So, it's kind of like trying to take those ideas and formalise them So then you don't have to think about it again. So there's also this idea of test driven development, which is you write all of your tests before you write the function. So then all of your tests fail. And then you know the function is complete once all the tests pass. So, this is like you expect a certain number of columns or you expect a data frame or you expect a certain class of input or output.

Nick Tierney:

And once you state all of those expectations, it then means that you're really clear on exactly what you want the function to do and exactly how you want it to behave. That makes sense. Which is a bit of a it's a bit more of an extreme framework. I don't think it's as popularised in R but in other programming languages, lot of people, it's like how they do everything. Another thing I want to say about tests actually was basically that it forces you to really examine your code.

Nick Tierney:

It almost makes you have you heard of the idea of rubber ducking? Yeah. Yeah. So it kind of forces you to do that again, is kind of like you have to really sit through and explain it again and kind of almost like this Feynman technique of explaining to someone younger than you or someone who doesn't know

David Keyes:

as much.

Nick Tierney:

Explain the concepts again, which forces you to kind of revisit things in a different way.

David Keyes:

Yeah, yeah. Well, let's actually have you help me to learn a bit about writing tests. So let's have you, yeah, put your screen up and we'll talk about this internal package that we use for various things and talk about the best way to add some tests to it.

Nick Tierney:

Cool.

David Keyes:

Hey, David here. Just wanted to let you know that at this point in the conversation, we switched to a screencast. Now, obviously, code doesn't work very well in an audio podcast. So if you wanna see the rest of this conversation, check out the video version of this podcast on YouTube. You can find a link to that in the show notes.

David Keyes:

Well, let me just ask you a couple kind of final questions. So this is a great introduction to how tests work. I'm curious, in your doing consulting work now, you're out on your own, this the type of thing you're doing in your consulting work? Are you spending time writing tests? What else are you doing as part of your consulting work?

Nick Tierney:

Yeah, so related to that is package review. So I recently did some work for some folks at R Noah through Openscapes, through Julie Lown's group. So I did a package review for them where I recorded myself going through the process of changing codes and sort of live commentary on the changes I would make and why I'd make those changes. And that's nice, I find, because it's not just here's the updates to your code, it's sharing the process as well. In terms of package development and that sort of thing, establishing tests is also helping out future you.

Nick Tierney:

There's a quote from someone, I can't find it, but it's like, you're always collaborating with your future self. So if you can set up tests, then you're helping yourself in the future for if something comes up and there's some bug. Yeah. The other work I do in my consulting is modelling development and also some mentoring. So I have a mentor at the moment.

Nick Tierney:

We meet once every so often and we just like discuss some of where they're at with our goals and that sort of thing. It's kind of like almost coaching and providing a sense of what do you know and what do you want to know and maybe there's some ways I can help you get there or explain some concepts.

David Keyes:

I mean, it sounds like it's recreating what you had when you were learning that that

Nick Tierney:

Totally. Yeah, it's, it's absolutely the thing that I was yet dislike. It's a thing that I really care about and I want to make work for people. It's not something I'll do a lot of, but it's something I care about and I want to make that approachable for people. But yeah, there's a lot of those are some of the things I've done.

Nick Tierney:

I recently wrote a targets pipeline for someone they needed to get. They weren't able to share their data with me, but they could share the key structure of the data. And then I was able to see some data that had the same structure, do a bunch of tidy models analysis, explain the variables and partial dependent spots and other things to see like what changes over what variables are important for predicting your outcome. And then they're able just to slot in their own data at the top and the pipeline ran and then they had a report at the end. So those are some of the things that I've been working on so far.

David Keyes:

That's awesome. So if people want to learn more about you, the type of work you do, what are the best ways to get in touch with you?

Nick Tierney:

Yeah, the best way to get in touch is through my email and on my blog. So you can see on my email, on my blog, on my bio or on my consulting page. So that's n j t e dot com. And my email is my name, nicholas.tande@gmail.com. So that's the best way to get in touch.

David Keyes:

Great. Well, Nick, thank you so much for chatting with me about R in general, showing me about tests. It's been really informative, so I appreciate it.

Nick Tierney:

Great. Thanks, Ed.

David Keyes:

That's it for today's episode. I hope you learned something new about how you can use R. Do you know anyone else who might be interested in this episode? Please share it with them. If you're interested in learning R, check out R for the Rest of Us.

David Keyes:

We've got courses to help you no matter whether you're just starting out with R or you've got years of experience. Do you work for an organization that needs help communicating effectively with data? Check out our consulting services at rfortherestofus.com/consulting. We work with clients to make high quality data visualization, beautiful reports made entirely with R, interactive maps, and much, much more. And before we go, one last request.

David Keyes:

Do you know anyone who's using R in a unique and creative way? We're always looking for new guests for the R for the Rest of Us podcast. If you know someone who would be a good guest, please email me at david@rfortherestofus.com. Thanks for listening, and we'll see you next time.