R for the Rest of Us

In this episode, I chat with Crystal Lewis about data management and her recently published book titled ‘Data Management in Large-Scale Education Research’. Crystal, a freelance research data management consultant, shares insights on good planning and systematic implementation of practices that are key to effective data management. She discusses the importance of automated data validation, and outlines a structured approach to data cleaning. Additionally, Crystal reflects on her experience writing an open-source book with Bookdown and navigating the publishing process.

Important resources mentioned:

Crystal’s book Data Management in Large-Scale Education Research

Learn more about Crystal Lewis by visiting her website and connect with her on X (@Cghlewis), LinkedIn, GitHub, and Fosstodon.

Subscribe to our newsletter: https://rfortherestofus.com/newsletter

What is R for the Rest of Us?

You may think of R as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting, to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways.

David Keyes: 00:00

Hi. I'm David Keyes, and I run r for the rest of us. You may think of r as a tool for complex statistical analysis, but it's much more than that. From data visualization to efficient reporting to improving your workflow, R can do it all. On this podcast, I talk with people about how they use R in unique and creative ways.

David Keyes: 00:25

I am delighted to be joined today by Crystal Lewis. Crystal is a freelance research data management consultant who helps other people to understand how to organize, document, and share their data. She's been wrangling data in the field of education for over 10 years and is also a co organizer for RLadies Saint Louis. Crystal, thanks for joining me today.

Crystal Lewis: 00:50

Thanks for having me, David. Good to see you.

David Keyes: 00:53

Yeah. Good to see you too. And I should also say you have done work, for R for the Rest of Us, which has been great because we love having someone who enjoys, data management and data cleaning as much as you do. Absolutely. So maybe you could start out by just telling us, like, what does a free freelance research data management consultant do?

David Keyes: 01:13

What does what does that mean?

Crystal Lewis: 01:15

That's a great question. And it's a great question because I don't know if there's a lot of other people that have that specific title, at least no one that I know. But, essentially, I I the work I do is in 3 kinds of buckets. So, so one, I work with mostly, university faculty who have these sort of large scale federally funded brains in the world of education, and they're looking for help. So one is I do help with data wrangling.

Crystal Lewis: 01:45

So maybe they've collected data over a couple of years, and they haven't really done much with it. And so it's been sitting there, and it's very messy, and they need to analyze or share it. And so they need someone to kinda come in and quickly help them organize, and clean up and document that data. So that's one bucket of what I do. And I also do consulting work.

Crystal Lewis: 02:06

So that's more, hey. I got a new grant. I need to think. I need to plan for how I'm gonna manage data throughout the life cycle of this project. I need someone to help me think through that process.

Crystal Lewis: 02:17

Can we set up some meetings and do consultations around that? And then my 3rd bucket is training. So maybe you have a lab, and you've got 15 people in that lab, and you're all just kind of doing things in different ways, and you need someone to come in and do a training around how can we kind of standardize our practices. And so that's the other bucket that I do. And so I've been doing this for 2 years, and I've gotten to meet lots of different, cool researchers at different universities across the US.

Crystal Lewis: 02:43

And, I've really enjoyed the work. So

David Keyes: 02:46

Cool. And when you talk to people who maybe are less familiar with your work, how do you kind of succinctly describe what data management is? Because, you know, as you just said, like, there can be kind of different things that you do. So what's the kind of definition that you use to describe data management?

Crystal Lewis: 03:05

Yeah. Sure. So I would say that data management is any process associated with, the collection, the organization, the storage, and the preservation of data. So that's a lot. Right?

Crystal Lewis: 03:18

So I think a lot of times when I talk to people, their first reaction to the word data management is something that happens after collection. Data cleaning. Right? I think that's what most people I talk to, they're like, oh, data management. So you mean how you clean data?

Crystal Lewis: 03:30

And I'm like, well, actually, data management begins long before you ever collect data in the planning process. So you'll want to, if possible. Because sometimes you're given the dataset. Right? And you don't have that luxury of planning.

Crystal Lewis: 03:43

But if you're collecting your own original data, you wanna start that planning process long before your project ever starts. And so you plan things like, you know, what do I want the dataset to actually look like in the end? What kind of quality assurance, quality control procedures can I implement during data collection to get the kind of data that I want?

David Keyes: 04:02

For

Crystal Lewis: 04:02

instance, one of my favorite tools to create during a planning process is the data dictionary. I talk a lot about data dictionaries. Anyway, that's heard me talk about data management. Here's me talk about data dictionaries. And that tool is an excellent planning tool because you can lay out exactly what you want your final dataset to look like.

Crystal Lewis: 04:18

What are the variables going to be? How am I going to name them? What variable types are they going to be? You know, numeric, character, date, what are the allowable values of those? And then you can use that throughout your project to kind of plan for where you're going.

Crystal Lewis: 04:31

You can build your tools to align with what you expect, you know, and you can use that in your data cleaning and your data validation process as well. And so I try to get people to think more about, data management as a process, not just something that just happens at the end of a project.

David Keyes: 04:48

Yeah. That makes a ton of sense. I mean, I'm, you know, thinking about projects that we have worked on or or or possibly you have worked on where I I mean, people are always like, well, can we clean this data in R? And the answer is always yes. Of course, you can.

David Keyes: 05:02

But, like, in many cases, if there's more the more thought you can put into things up front, the less kind of coding, you know, reshaping, whatever you have to do on the back end. So it seems like what you're saying is a lot of the data management work that you do is both on the back end in terms of that data cleaning, but also on the front end helping people to avoid getting into the situation where they have to have, you know, a 1,000 lines of R code to Yes. To do that data cleaning. Is that is that accurate?

Crystal Lewis: 05:32

It's very accurate because not only does it create less work, but it's possible that depending on how you collect the data, if you didn't plan fully collect your data, you might have data that you can't even use. Right? Like maybe you can't even interpret what somebody inputted into your tool. Maybe it was like an open ended text value and you're like, I have no idea what this means. So you might even lose data, not only just having more work.

Crystal Lewis: 05:53

So it's definitely worth putting that work in.

David Keyes: 05:57

Yeah. I'm curious if you have any kind of you know, when you first start talking to a a new client, like, are there any kind of main points that you highlight? I know you talked about data dictionaries. Are there other kind of main points that you highlight with them when thinking about, like, how to kind of how how to manage their data?

Crystal Lewis: 06:18

Yeah. So I'll I mean, a lot of what I talk about is this planning process. It's, like, putting structures in place so that your your whole project is more organized. So we talk about the data dictionary. Some of it is just, like, basic file organization.

Crystal Lewis: 06:32

Right? Like, making sure that you develop a project structure that people know how to use and and to find things and file maybe, so that you're using the correct version of files. And some of it's more even I'd almost even call it project management that that that ultimately affects data management. So things like assigning roles and responsibilities so that, like, everybody on your team knows exactly what they're doing and that pieces of data management don't get dropped off because someone's like, oh, I thought someone over here was doing it. But, no, actually, it's it's my responsibility.

Crystal Lewis: 07:03

And so putting all those systems into place early on leaves a better data. Right? So, I try to talk to people talk to people about all these different processes that they can implement that might help them get better data ultimately.

David Keyes: 07:18

Yeah. It's interesting how much of, you know, getting good data ultimately is really about, like, the kind of non data pieces

Crystal Lewis: 07:25

So much. That I feel like

David Keyes: 07:27

Happened around.

Crystal Lewis: 07:28

50% of what I talk about is just project management stuff. But if you think about it, like, every person that works with the dataset impacts that data. So it's not always like the data person. So like a project coordinator teaches the person less organizing the collection of data. And then there's data collectors out in the field, and the way they collect data impacts the data.

Crystal Lewis: 07:46

And so all these people have a hand in the data and everyone impacts the quality of it. So

David Keyes: 07:52

Yeah. That makes sense. So you recently published a book. It's called Data Management in Large Scale Education Research. Now, obviously, you know, what you're talking about can go beyond education research.

David Keyes: 08:05

But I'm curious what makes large scale education research so in need of good data management to the point that you would write a whole book about it?

Crystal Lewis: 08:13

Yeah. That's fair. Yeah. And there's a lot of existing materials out there around research data management, but a of them are, you know, discipline agnostic or they're, like, STEM focused. And what makes education research so unique is it's, typically human participant data so that you've got this whole, issue of privacy and confidentiality.

Crystal Lewis: 08:34

And then a lot of these education projects are what I would call large scale. So they're typically these federally funded, multiyear, longitudinal projects, and you're collecting data on many different types of participants. So you've got, like, parents and teachers and students and schools. And so you have all these levels of information coming in. And then not only is a lot of data, but you're often collecting data in a lot of different ways.

Crystal Lewis: 08:58

So we've got people collecting data on paper in the field, electronic surveys, people assessing classrooms and collecting data on tablets. They're getting data from school districts. It's all these different types of things you have to think about. And not only think about, but track and make sure that you're not losing things, and it's all quality data. And so it's just a lot more to consider than I would say in, like, a maybe normal, like, science y lab types of projects, if that makes sense.

David Keyes: 09:27

Yeah. Can you give an example of a a project? You know, like, what what what's the project do? What does the data look like? That type of thing.

David Keyes: 09:36

Just to help

Crystal Lewis: 09:37

us kind of find

David Keyes: 09:38

the context.

Crystal Lewis: 09:38

It's essentially most of the projects I work with are essentially kind of what I just mentioned. They'll be funded by, like, a a federal funder, like, IES, which is the Institute of Education Sciences, or NIH, or NSF, or NIJ. They're big federal funders. And and typically, they're not not all the projects, but a lot of the projects are evaluating some sort of program in schools on the efficacy of our program. And so they are not only implementing some sort of, program in schools.

Crystal Lewis: 10:10

That's like a whole different arm of the project, and they implement the Mhmm. The the, the program. But the other arm is in evaluating. And so to evaluate at the typical collection layer, yeah, things like I mentioned, like surveys, observations. So people actually go in the classroom sometimes and will collect observation data on, teachers or students or classrooms as a whole, and they will collect school district data.

Crystal Lewis: 10:33

So we'll act actually ask the district for, individual level data on students and, assessments. So, like, different types of math or reading assessments will be collected. And all that data is collected in various ways. Right? And so there's so much to think about because sometimes you're sometimes you're having to hand enter data from paper.

Crystal Lewis: 10:53

Because if you're working with little students, a lot of times your data is gonna be on paper. Kids can't go into Qualtrics and do these things. Right? So, so you've got the paper data. You gotta think about the quality of that.

Crystal Lewis: 11:04

You've got surveys going out through Redcap or Qualtrics. You've got the district data coming in, which has its own, you know, array of messiness. And so there's just a lot to think about in these projects.

David Keyes: 11:17

So I'm imagining a project would be something like, you know, a new technique for teaching kids reading or something like that.

Crystal Lewis: 11:25

Exactly.

David Keyes: 11:26

That's funded. And then, you know, like you said, there's the implementation side where they the teachers or whoever else is involved in actually doing it. But then there's the evaluation side, which is looking at, like, does this new method of teaching reading actually lead to better outcomes, however, those are defined. Is that kind of a

Crystal Lewis: 11:44

That's exactly it.

David Keyes: 11:45

A typical

Crystal Lewis: 11:45

And a lot of the projects I work on, not all of them, obviously, but a lot of them that I work on are are randomized controlled trials. So a portion of, you know, the participants are getting the intervention, a portion are are not doing the info, they're getting a different intervention. And then there's this whole kind of pre and post assessment of the intervention on throughout the study. And a lot of times, there's cohorts. So another butthole of complex complexity.

Crystal Lewis: 12:07

Right? So we've got longitudinal, and we've got cohorts of new people coming in every year. It's so complex. And that's why data management is so important for these projects.

David Keyes: 12:18

So I'm curious when you talk about data, like, data management in this context, what is does this look like a set of typically a set of Excel files? What what does it I I mean, I know it varies, obviously, project to project, but I'm curious kind of what it concretely looks like.

Crystal Lewis: 12:36

So, yes, that that varies greatly. So so most of the tools that I've interacted with are, Redcap, if you've ever used that. It's it's more common, I would say, in health fields, but educationally swimmers use it Teams. It's a fairly secured tool to collect data. And call checks are 2 big tools that people use and then paper data.

Crystal Lewis: 12:54

And so the paper data, people enter that in a variety of systems. So some people come back to the office. They enter that in Redcap. They might enter it into Excel. They might enter it into a database like Access or FileMaker or something like that.

Crystal Lewis: 13:08

And so and then eventually what happens is everything gets exported. Right? And that varies. It could be a CSV. It could be an Excel file.

Crystal Lewis: 13:16

It could be an SPSS file. And then that all is all saved into their file system, and then the data wrangling from there happens. And so, eventually, what your ultimate goal is is to have, linking variables across all those. So, you know, primary foreign piece, so you can kind of link all that data as needed across students, across teachers, across schools, and so forth.

David Keyes: 13:39

Got it. Yeah. I mean, I'm just thinking this is not something I I plan to ask you, you, but, like, I think about in the context of projects that we work on where, you know, it is a series of Excel files for the most part. Sure. Yeah.

David Keyes: 13:56

And one of the challenges there, especially, you know, you just talked about, like, needing to link multiple sources of data. It can be complex because I mean, just as a simple example, like, we we're working on a project where we have, trainings that are done, and there's a identifier for the cohort for the training. Yeah. And sometimes when it gets read in, it gets read in as numeric, and sometimes it gets read in

Crystal Lewis: 14:23

as Oh, yeah.

David Keyes: 14:24

Character.

Crystal Lewis: 14:25

Character.

David Keyes: 14:25

And so then sometimes when we try and join, it's it's complicate. Like, it's I mean, obviously, again, you can do it because we work in r. You can always do it, but it it's complicated. I I guess in my mind, I'm always like, oh, if I could just get people to work in, like, in a database, then it would be better. Can you talk to me?

David Keyes: 14:48

Like, because I'm sure you've seen cases where people do use a database, you know, because I think, like, oh, well, then you can say, you know, this this variable has to have this, you know, date be of this data type. What are the pros and cons of of working in that way versus you know, kind of collection of

Crystal Lewis: 15:05

Excel files?

David Keyes: 15:05

Yeah. So, you could essentially so if you tool

Crystal Lewis: 15:08

use a tool like Redcap, you could, essentially build this kind of database in the back end where things are ranked all within Redcap. Although I I don't know of Redcap. I say that, but actually Redcap has its own limitations. But you could build everything into a system where you can collect everything in one system and then link it into the system. But like I mentioned with all these different types of data that you're collecting, it's it's quite tricky to use one tool when it comes down to it for everything because one tool just doesn't end up meeting the needs of every type of data that you need to collect.

Crystal Lewis: 15:46

And with that said, I I tell people it's actually okay to use different tool because even if it's not a database, if you use Qualtrics, and Excel is not I I would not recommend entering data into Excel because we know the limitations of Excel. But if you use Qualtrics, for instance, you can build that validation into Qualtrics. Right? So you can't say I only want this variable to be numeric. I only want this variable to be the ranges of 1 to 100 or something like that.

Crystal Lewis: 16:12

So most tools besides Excel, which you can't rather Excel with more complicated kind of formulas,

David Keyes: 16:18

you

Crystal Lewis: 16:18

can build that validation in there. So I told people, as long as you're building that day of validation and you're being consistent across your tools and building things consistently, but it's okay to have different tools and then export out to to a one file format that you're eventually gonna link those together.

David Keyes: 16:34

That makes sense. And I saw you recently posted that, I hear a quote from what you posted. Tools aren't the key to better data management, good planning, and consistent implementation of practices is. Can you talk more about what you mean there?

Crystal Lewis: 16:47

Yeah. I I just I hear this whole thing about tools a lot. People like, we just need better tools. We just need better tools for data collection. We just need better tools that allow us to do exactly what we want to do.

Crystal Lewis: 16:59

To me, we have tools, and they may not all be perfect. But if we just do the pieces that I'm talking about where the we plan, we build in data validation, Like, when I talk to people about data management, I'm talking to people who use SPSS, SaaS, data. Everybody uses a variety of things. And I, and I'm not like, well, you know what? You don't use R you're up to creep.

Crystal Lewis: 17:27

Like I love R of course. But I, I want people to know that like, whatever Twitter you think, you can still do good data management. It's just you need to do things in a more systematic, planful way.

David Keyes: 17:40

Yeah. That makes sense. So let's actually talk, you just brought up data validation. And, you know, I think you were talking about in the context of I don't know. Say you are using Excel

Crystal Lewis: 17:52

Yeah.

David Keyes: 17:52

Where, you know, you set, like, the values in this column are numeric and can only have a range from 1 to a100 or whatever. But you also I think in your book, you talk about doing data validation kind of, like, on the back end, like, after you kind of import your data into R, what does that look like?

Crystal Lewis: 18:13

Yeah. So data validation and and, typically, when I use data validation in in the context of data management, typically, I am talking about the data cleaning process. But then I also, yeah, will sometimes also mention how it's you can build it into work as well. And so when I talk about data validation in the context of the data cleaning process, what I'm trying to suggest to people is that they need to do what some I've heard other people call it one last final data sanity check. Right?

Crystal Lewis: 18:40

So you've done your data cleaning, and you've been even doing check really, you've been doing check throughout your day of paying process to make sure everything's going okay. But before you are, like, just putting a seal on approval on that dataset for use, you need to do this one final data validation check. And and by one, I mean, it's multiple checks. Right? And I think that the one package that actually really changed or transformed the way I think about data validation is the point blank package.

Crystal Lewis: 19:07

I don't know if you Mhmm. Use that much in your work. And there's there's a lot of data validation packages out there, obviously, and and they're all probably great. But for some reason, point blank is the one that just really impacted me. So before finding point blank, I typically for data validation or data checks, I would, just gonna run summary statistics, you know, and see if different variables fall without sight of a range or, you know, if their their type wasn't nice expected.

Crystal Lewis: 19:35

But it was all done with my eyes. Right? So I'm just, like, looking at it and scanning, like

David Keyes: 19:39

Right.

Crystal Lewis: 19:40

The thing about humans is you miss things. Right? And so until I really started building checks, like validation checks into my data using the food web package, but, again, if you use any package, that's when it became more systematic. And so then I started building these actual checks into all my feeding, syntax files, and I would build that around my remember the data dictionary? I would build it around the data dictionary.

Crystal Lewis: 20:03

Right? So every variable in my data dictionary has these checks that I need to confirm, and I would build all of my validation around, you know, are these variables meeting their specific type? Are they within the right ranges? You know, are there duplicates in the variables? There shouldn't be duplicates and things like that.

Crystal Lewis: 20:19

And so those would be my final data validation checks before I, you know, allow people to use it for analysis or something like that.

David Keyes: 20:27

That makes sense. And I'm thinking about, a project that I've worked on. It's called Oregon by the Numbers. I import I've done it for, I don't know, like, 5 or 6 years. I import data every year that gets sent to me by the client.

David Keyes: 20:40

And, I mean, just as an example of a simple check that, Thomas, one of the other, consultants who I work with helped me with, we use the Asurdar package.

Crystal Lewis: 20:51

Yeah. And I know there's lots

David Keyes: 20:52

of other packages.

Crystal Lewis: 20:53

Yeah.

David Keyes: 20:54

Yeah.

Crystal Lewis: 20:54

I think they're very similar.

David Keyes: 20:56

Yeah. Exactly. Functionally, I think it's really similar. I mean, just as an example, we need to make sure that when the data comes in, almost all the yeah. All the data has one column that's, like, the location, and the it's almost always by county.

David Keyes: 21:11

And so we run a check that's, like, does this column only have, you know, the the names of these 36 counties? Yeah. If it doesn't, something's wrong. You know? Usually, what that means is, like, it was misspelled, in the original data that was given to us.

David Keyes: 21:28

And that's the kind of thing that, like, beef I got worried after I did this for several years because I was like, oh, man. Like, what if someone misspells this and then, like, something, you know, is off in all the visualization? Because I'm producing, like, hundreds of visualizations for them. And so having that validation, like, I still worry, but I worry a little bit less because I I feel like there's that at least level of check. And I imagine it's a similar situation in terms of the data that you work with just knowing that

Crystal Lewis: 21:57

Yes.

David Keyes: 21:57

There is that that validation in place.

Crystal Lewis: 22:00

Yeah. And I do I do recommend like, I talked about that being the last thing that you do in the dating process. But I do recommend that if you're collecting longitudinal data, it's possible that you're reusing its syntax every wave that you collect that same data. And so I do recommend also adding, just like you said, adding that to the top of the script so that when you're reading the data, like, you expect it to have the same format every way. But what if something changed?

Crystal Lewis: 22:21

A variable was added or something like that, you know, changed? It's good to check that before you start cleaning your data and it's not doing what you think it's doing. Right? So

David Keyes: 22:31

Yeah. That makes sense. So I asked you about data validation, which is, in some ways, like, jumping to the end of the data cleaning process. Although, like you said, you can also do it, like, right when you're reading your data. But, you know, as you said before, data management is more than data cleaning, but data cleaning is an important part.

David Keyes: 22:50

So when you think about data cleaning, what are the kind of most common steps that you think about? Because I think cleaning data cleaning can mean very different things to different people. Yeah. So from your perspective, yeah, what are what are those most common steps?

Crystal Lewis: 23:04

Yeah. So in the book, I talk about sort of 3 phases of a dataset. So the first phase is your raw data, and this is your untouched data that comes directly from a source. So maybe you're downloading it from somewhere or you're extracting it in another way. Then the second phase is your clean data set.

Crystal Lewis: 23:22

And so even with some of these best practices and data management implemented throughout your project, often, you still need to do some additional processing to get your data in a format that you feel comfortable using or sharing. And so this clean data set is, minimally altered for any future purpose. So we're not gonna get deep into what variables you need for a specific analysis or a certain structure you need for a specific visualization. That happens later. The second phase is just kind of getting it into a general clean format.

Crystal Lewis: 23:57

And we'll talk about that more in a minute. And then the 3rd phase is when you create your datasets for a specific purpose. So you are creating a dataset for a specific analysis and maybe you're creating a bunch of variables specific to that analysis, or you need a dataset for a specific visualization, and so you're restructuring the data for that specific visualization. Those are all created from your general clean clean data set. And so what I talk about in the book and what I do in my own work is clean data for general purposes, of sharing with future users.

Crystal Lewis: 24:29

And so in many ways, creating that general clean data set is a very personalized process. So the way I might clean a data set for a electronically collected teacher survey might look very different from the way I would clean a dataset that I received from a school district that has student record in it. With that said, all datasets, when they are clean for general purposes, still need to ultimately meet a set of data quality criteria. So in the book, I review 7 data quality criteria that all data sets should meet. And so let me walk you through those real quick.

Crystal Lewis: 25:06

So the first criteria where that we're working towards is that the data set should be analyzable. And what I mean by that is that it should be it should make a rectangle. It should be machine readable. So that means that the first row of your data should be your variable names, and the remaining remain remaining cells, should be made of both values. And that also you have a unique identifier that identifies key new cases and data as well.

Crystal Lewis: 25:33

And that also your data meets a series of organizational rules that we would expect. The second indicator is that your data should be complete. So if you've collected the data, it should exist in your data set. You shouldn't be missing anybody. No one should have gotten dropped along the way.

Crystal Lewis: 25:49

And, also, you shouldn't have duplicate information if you should, if that shouldn't exist in your data. And same with variables, not just cases, but also variables. So making sure that you didn't accidentally drop variables when you were downloading your raw data or something like that. The 3rd indicator is interpretable. And what I mean by that is that your variable names should be both human and machine readable.

Crystal Lewis: 26:10

So the variable name should make sense to people who are using your data, and it should also not include things like special characters or other things that machine have a difficult time interpreting. The 4th indicator is valid. And by that, I mean, that your variable should conform to the constraints that you laid out in your data dictionary. So, if your variables are supposed to be numeric, that they are actually numeric in your dataset. If your variables just fall within a certain range that your variables will actually fall within within that range in the dataset, and things like that.

Crystal Lewis: 26:43

The next indicator is accurate. Now accuracy is a tough thing to judge. You don't always know that what something someone reported is actually accurate or not. You can tell you can check things within a dataset or across datasets to see, if there's accuracy or not. So consider something like a student is in 2nd grade in a dataset, then they should be associated with a 2nd grade teacher.

Crystal Lewis: 27:06

And so you can check things like that, for consistency across variables. The next indicator is consistent. For consistency, I mean, 2 things. So one, value should be consistent within a variable. So maybe all dates should be the same format.

Crystal Lewis: 27:21

And I also mean consistency across datasets. So if you're collecting the same form across, several waves of data collection, then you should be collecting the variables in the same way. And then the last indicator is the identified. So if you promise confidentiality to your participants, then their information, their, direct and indirect identifier should be considered in the dataset. There should be no names, no emails, social security numbers, no things that directly identify participants.

Crystal Lewis: 27:51

Those should be removed and replaced with a unique random ID that you've assigned to them for your project. And that indirect identifiers are also considered what combinations of variables might be able to be identified someone, especially when you're publicly sharing data. So to that end, using that data quality indicator risk, I go through a series of, cleaning steps to create a dataset that meets those criteria. And in the book, I give a checklist of 19 steps that you wanna consider in a data cleaning process. And, again, not every dataset will need to go through all 19 steps.

Crystal Lewis: 28:32

You know, we might have a fair and clear data set that only needs a couple of those steps to be wrangled into a acceptable format. But others might need all 19 steps. It really just depends. And so just to give you a glimpse of what that checklist looks like, you know, step number 1 is always to access your data, whether you're importing it into r or something like that. Step number 2 is always to, review your data, kinda like what we talked about with the data validation question.

Crystal Lewis: 28:58

You wanna make sure you know what you're working with and what kind of needs to be done. And then from there, I list out a series of other steps that don't necessarily have to go in a specific order. It's more just kind of checking through to make sure you've considered each of these times of stuff. So we've got, like, bringing variables. We've got recoding variables, restructuring data, de identifying data, merging.

Crystal Lewis: 29:21

All those kinds of things are considered in the checklist. And so what steps you actually use is going to depend on your specific dataset. But, ultimately, we will end up with datasets that all meet the data quality criteria that I just spoke about. And that's, what's really cool about kind of using the data quality indicator list as your guide while you're cleaning.

David Keyes: 29:45

For those of us who work in R and work with the tidy verse, I think it's really easy to kind of conflate data cleaning and data tidying. But it sounds like you're that that's not that doesn't to me, it doesn't sound like tidying at all. That sounds like very much just kind of on the cleaning end.

Crystal Lewis: 30:02

Yeah. Because when you think of tidying, you think of Hadley's, you know, exact like, exact format and how, like, a a file should look. Right? But in education research in particular, data doesn't always fit into that tidy format. You know?

Crystal Lewis: 30:17

Sometimes people need their data to be in this really wide format where, like, variables repeat over waves of data. And so I'm not necessarily trying to tidy, data in a way that would make it easiest for an analysis or easiest for a graph. I'm trying to tidy it in the way that it needs to be for the education purposes, for anybody that does the cleaned up, and then they can restructure it as needed or something like that.

David Keyes: 30:46

That makes sense. So you talked before and you said you know, we said tools are are are not the key to better data management and I assume also to better data cleaning. But Yeah. I I know that you, like R in particular. So I'm curious, you know, what what is it about R that you think makes it uniquely good for data cleaning?

Crystal Lewis: 31:08

Yeah. Well, obviously, because it's open source. I'm a freelance worker now, so buying proprietary software is not necessarily something I wanna add to the budget. So that's one. But also just sharing data with clients, it's it's really helpful to have an open source permit so I don't have to be like, oh, you don't have Stata?

Crystal Lewis: 31:27

Well, then you can't open my dataset or whatever it is. And so it's nice to be able to share that with anyone and know they won't be able to access it. But also, I love the tiny verse. So I started learning r in 2011, and I learned base r. And I was like, no.

Crystal Lewis: 31:47

It did not stick with me. And so it wasn't until midyear later when the flavors started becoming more, popular and I started learning that that I was like, this this is intuitive. This makes sense to me. And not only that, but it it had I was able to access all the functions I needed to be the type of data. I really didn't have to start building, like, new functions to meet the needs of what I would probably need to do.

Crystal Lewis: 32:11

And then I guess the third thing is if I need to do something more complex, which just happened with some really complicated data, then I have the ability to build more complex functions if I need to, which I don't know if I would have that in some more proprietary tools. Like, I don't know if SPSS allows you to, you know, write your own complex functions or So so all those in the in the community, of course, I've learned so much from everybody in the art community. I I'd say, like, I I don't know. 40% of what I learned is kind of just, like, self learning and, you know, going through different reading materials. And then, like, 60% is just learning from, like, what other people share, which is amazing.

David Keyes: 32:49

Right. Yeah. But it's interesting because I thought I thought one of the things you would say would be because it's a code based language that, you know

Crystal Lewis: 32:59

And it is.

David Keyes: 33:00

In terms of data It's

Crystal Lewis: 33:01

just I forgot that.

David Keyes: 33:02

Oh, okay. Sorry.

Crystal Lewis: 33:03

Yeah. No. That's just me not thinking about it. So, yes, a 100%. I I guess why I let it into my head is I've used SCSRS.

Crystal Lewis: 33:10

I've used data and I've used SAS and I coded all of them. And so

David Keyes: 33:14

That's true.

Crystal Lewis: 33:15

It wasn't necessarily unique that R would not give me necessarily that uniqueness because I did write syntax in all of those programs. But, yes, that's huge, obviously. Like, no matter what program you use, we want data to be your processes to be reproducible, and so, yes, being able to do that is huge.

David Keyes: 33:32

Yeah. And I guess that I mean, that comparison makes sense if you're comparing it to something like Excel, like a point and click tool

Crystal Lewis: 33:38

Yes. Excel. Yes. Yes. Don't play with Excel.

Crystal Lewis: 33:42

And then yeah. I always tell people, like, some people are not ready to make the leap from Excel to the tool. And what I tell people is if you're using Excel, you need to be so thorough in documenting exactly what your process is in order for people to be able to reproduce what you've done. And and and you'll never even with that, you'll never be able to meet the reproducibility. Alright.

Crystal Lewis: 34:04

David Keyes: 34:05

Yeah. That makes sense. So let's talk briefly about your book. You wrote your book in a way that it has, you know, free online version. It's open source.

David Keyes: 34:17

So can you talk about how you wrote the book and and put it online?

Crystal Lewis: 34:22

Yeah. So I, started with I also remember my desk was, like, shaking your ass. Shake the desk. I wrote the book and book down, and I I think I started with that because I wasn't a 100% sure where I was going as far as, well, was I gonna integrate any code at all, or anything like that? And so BookBound was the tool I knew.

Crystal Lewis: 34:46

Most people who had written open source books, that's what they used. And so I was familiar with it. I also knew a lot of people who I could reach out to for help if I needed to. A lot of people have GitHub repos with their book done code, and so it was something where I felt comfortable, trying to dig into. In hindsight in hindsight, I probably didn't have to do that because I ended up not really putting any code in the book.

Crystal Lewis: 35:09

But I don't regret it because it did create a really beautiful product to put on the Internet, and so I think it worked out in the end. And it's got, you know, cool features like searchability and things like that. So I'm glad I did it. It was a tricky process, and I'm very thankful for people like you and several other people who who I was able to reach out to for assistance as needed. But

David Keyes: 35:35

That's interesting to hear too that you, you know, you wrote a book with BookDown that doesn't have code in it. And

Crystal Lewis: 35:42

Which is not common.

David Keyes: 35:44

Yeah. But it speaks to how, you know, powerful a tool like Book Down or, you know, same thing if you're writing a book with with quarto, like, you don't need to have code. I mean, I've done, like I have, like, an internal, like, r for the rest of us handbook that's written as a quarto book. It has zero code.

Crystal Lewis: 36:01

Yeah.

David Keyes: 36:02

But it were it just it's it's pretty straightforward to make it work. And I think that it sounds like that was the kind of benefit for you

Crystal Lewis: 36:08

as well. And I and it kept, like, I had this whole workflow going for all my data cleaning where, you know, you're pushing to GitHub and you're you're versioning things, and it and allowed me to have that kind of same workflow in my writing as well, which is nice.

David Keyes: 36:22

Yeah. That makes sense. How did you work with your publisher, to go from, you know, the Bookedown version to something that they would be able to review? Yeah.

Crystal Lewis: 36:33

Good question. So if I was more savvy with BookDown so don't get me wrong. BookDown was great for exactly what I needed. If I was a little more savvy, I I might have had an easier time. So, CRC provides you a style file, that's that kinda matches up to their style.

Crystal Lewis: 36:49

And so, typically, what you would do is you would integrate that style file into your BookBound project and then you would print to a PDF. I've had, several people try to help me with this, and I was never able to make it work. And so I ended up rendering to, a word document, and having to kind of split out each chapter as its own unique document. So it took a little bit more time, but it worked great for me and my purposes. So it's not a it's not a big deal.

Crystal Lewis: 37:16

But yeah. So if if if you're a little more savvy than I am, like, we're done. You can do it pretty easily to integrate that style file in there and just and just just report it right out.

David Keyes: 37:28

That sounds amazing. I was thinking about my experience, because I had to, unfortunately, export to Word documents Uh-huh. And then do the editing process with my publisher. And then Integrate. Because I really wanted it to live online, I had to take the finalized versions and put them back into what started as a book down project then became a quarto book project.

David Keyes: 37:52

And it was it was a lot of work to do that.

Crystal Lewis: 37:54

I think that you and I had a similar a similar experience there. And I, at the end of the day, I was happy to do it because I felt more comfortable about it than the PDF process that I couldn't get to work very well. But I'm pretty sure the people who are at the PDF process to work, it was much quicker and smoother.

David Keyes: 38:12

Yeah. Yeah. Any other kind of benefits or drawbacks that you think about when it comes to writing a book in that way?

Crystal Lewis: 38:21

Honestly, I don't even know what other tools I would wanna use to publish an open access book online. I'm just not even that familiar because everyone I know uses BookDon and Crota. And and, also, everyone I know that does open access books are doing some sort of code based book. And so I I don't know a lot of people who aren't doing code like, something like I did and publishing open access. And that's probably why I don't know a lot of other tools that I would use.

David Keyes: 38:49

That makes sense. Well, if people wanna read your book and or find out more about you and the work that you do, where would be good places for them to go?

Crystal Lewis: 39:01

Yeah. So it's online. It's at well, maybe we can link to it. So it's basically a a shortened version of data management, anesthesiandresearch.com. And then I also have a website, if you wanna learn more about my work.

Crystal Lewis: 39:15

I also have a blog on there, where I share posts about various estate management topics, and then I share, slides from talks I've given and things like that. So you can kind of read various parts of my work on either of those websites.

David Keyes: 39:30

Great. And we'll link, both of those, down below in the show notes so people can check

Crystal Lewis: 39:36

those out. Yeah.

David Keyes: 39:38

Great. Well, Crystal, thank you very much for taking the time to chat with me today. I really appreciate it.

Crystal Lewis: 39:44

It was super fun chatting with me, David.

David Keyes: 39:46

That's it for today's episode. I hope you learned something new about how you can use r. Do you know anyone else who might be interested in this episode? Please share it with them. If you're interested in learning r, check out R for the Rest of Us.

David Keyes: 39:58

We've got courses to help you no matter whether you're just starting out with R or you've got years of experience. Do you work for an organization that needs help communicating effectively with data? Check out our consulting services atrfortherestofus.com/consulting. We work with clients to make high quality data visualization, beautiful reports made entirely with R, interactive maps, and much, much more. And before we go, one last request.

David Keyes: 40:26

Do you know anyone who's using R in a unique and creative way? We're always looking for new guests for the R for the Rest of Us podcast. If you know someone who would be a good guest, please email me at david @rfortherestofus.com. Thanks for listening, and we'll see you next time.

More episodes

Chapters

What is R for the Rest of Us?