Data Dialogues

Data science is critical to companies that want to turn their data into actionable insights. But what can they do to strengthen their data science program and attract the right talent? Jennifer Priestley, Professor of Statistics and Data Science at Kennesaw State University, shares practical ways that companies can collaborate with universities to solve their business problems.

Show Notes

This episode focuses on how companies can collaborate with universities to solve business problems and strengthen their own data science programs. Jeff Dugger, Principal Data Scientist at Equifax & University Research Director, interviews Jennifer Priestley, Professor of Statistics and Data Science at Kennesaw State University, about how university’s structure their data science programs -- and how they rely on the private sector to stay relevant.

Skip ahead to these highlights:

1:00 - About Kennesaw State University’s School of Data Science and Analytics
3:25 - The challenges Jennifer faced when launching one of the first university data science programs
6:00 - How to bring different backgrounds together for an effective data science program
11:00 - How data science programs differ from so-called “spoke” programs
17:05 - It’s all about adding value to the organization
18:20 - Understanding model results
21:03 - The real-time feedback loop with the private sector
26:33 - The need for communications skills - how do you tell the story of your data?
28:30 - Top 3 takeaways for companies to build and strengthen their data science programs

What is Data Dialogues?

A podcast where innovative business leaders discuss data: how to think about, how to use it and how it can help us all make better business decisions every day. As they tell their stories of trials and triumphs, you’ll gain key insights to leverage in your own day-to-day operations.

Data Dialogues Podcast
Episode 9: Kennesaw State University

Jeff Dugger (00:02):
Welcome to another episode of Data Dialogues. I'm your host, Jeff Dugger. I'm excited to be joined by Dr. Jennifer Priestley at Kennesaw State University, School of Data Science and Analytics, where she is professor of statistics and data science. Dr. Priestley is a pioneer in bringing data science programs into universities. Today, she will share with us how universities are collaborating with industry to shape the field of data science and to help companies strengthen their data science programs. Welcome Jennifer, and thanks for joining us.

Jennifer Priestley (00:32):
Thank you so much for having me, Jeff.

Jeff Dugger (00:35):
For our listeners who may not be familiar with KSU and its School of Data Science and Analytics, can you introduce us to the work you do as a professor with KSU, and more specifically, with the focus on the center for statistics and analytical research?

Jennifer Priestley (00:49):
Yeah, absolutely. So it's my favorite topic actually. So the School of Data Science and Analytics at Kennesaw State is housed within the College of Computing and Software Engineering. And within our school, we have four entities. We have the undergraduate minor in applied statistics and data analysis that serves about 150 - 200 undergraduates at any given point in time. We have our masters of science and applied statistics and analytics. And in that program, we have about 50 students again, at any given point in time. We have a PhD program in data science which we are proud to say we were the first university in the country to launch a formal PhD in data science. And then finally we have the center for statistics and analytical research that you just made reference to. So the center for statistics and analytical research really is sort of the heartbeat of the School of Data Science and Analytics at KSU.

Jennifer Priestley (01:50):
That is the primary point of intersection between what we do in the classroom and what we do in terms of collaborating with private sector organizations. So right now we have about eight to 10 different private sector collaborations. And in the context of those collaborations, students have an opportunity to work on real-world problems, real-world data. Specifically in the context of the PhD program, they're working on novel challenges specific areas of research and thought leadership in the data science space. But importantly, the center for statistics and analytical research is the point where the students really learn what it means to be a data scientist. So, I like to say that you can learn being a doctor in the classroom, right? You can study all the textbooks around biology and biochemistry, but you don't actually learn how to be a doctor until you start walking up and down the hallway of the hospital. Right? So, similarly, our students are learning the theory behind mathematics and statistics and computer science, but I would argue they're not really learning how to be data scientists until they start working with organizations and they start hearing what those challenges are. What's actually happening on the ground and how to put what they're learning in the classroom in action.

Jeff Dugger (03:13):
Well, let's dive into that a little further. As you mentioned, being a pioneer in bringing data science into the academic world, having launched one of the first data science PhD programs at a university, what were some of the challenges involved in getting that off the ground?

Jennifer Priestley (03:29):
Oh my gosh. How much time do you have? So, it's incredible to think about, but if we go back just 15 years ago, certainly 20 years ago, we didn't even have this academic discipline that we now call data science. There were no university programs that were formally teaching students at any level, undergraduate, master's or doctorate, how to extract transport load, clean data and then apply different modeling techniques to then translate that data into information, to solve problems. There, you know, there were no programs that did that. And when you start thinking about all of the different traditional academic disciplines that go into that core set of skills, again, you've got computer science, you've got statistics, statistical modeling, you've got applied mathematics. Increasingly you also have the different humanity disciplines about the societal impacts of what this process ultimately results in.

Jennifer Priestley (04:43):
And so in a traditional university setting, I like to say that universities do silos better than any other organization on the planet. And, it's not traditional to have the computer science department and the math department and the statistics department working collaboratively with the business school and with the college of humanities and with engineering. And yet that's exactly what you have to do in order to have an effective data science program. Is you have to have this interdisciplinary approach where you bring these different academic disciplines together. So I think the biggest challenge that we faced and I think you'll probably hear this from other universities as well, is, is to bring together faculty and departments and traditionally siloed areas of the university together to have this common program. This singular program, where we have this input from all these different perspectives on data science, but ultimately that's what you have to do to really have a world-class program. So without question, that was the biggest challenge.

Jeff Dugger (05:58):
Yeah, it sounds like a bit of a wild west with people coming from a whole bunch of different fields. I know personally I work with data scientists with backgrounds in engineering and political science and even forestry. And you mentioned others like mathematics and statistics and computer science. So you're bringing some civilization to this wild west environment. And what are some of the core fields, what are some of the core commonalities that you extract out of these various disciplines to make this work? Algorithms, math, tools, et cetera.

Jennifer Priestley (06:38):
All of the above. It's such a great question. So, you know, I like to say that if data science was going to have a mascot, it would be the Platypus. We would be the Fighting Platypi. And the reason I say that is because it's very easy to classify a computer scientist, right? It's very easy to classify a mathematician, but much like historically people that are engaged in taxonomy of the animal kingdom have kind of struggled to, you know, where do you put a Platypus, right? Because it's a mammal, but it lays eggs and it's got webbed feet and a duck bill, but it's got fur, right? So it kind of crosses all of these traditional barriers that we've historically set up for what it means to be a particular type of animal, a particular type of creature.

Jennifer Priestley (07:30):
And so when we talk about data science, we, you know, again, to kind of reiterate that point about interdisciplinarity, a data scientist is not a computer scientist, but they have to understand computer science. A data scientist is not a statistician, but they have to understand statistics. A data scientist is not a mathematician, but obviously they have to understand mathematics. So within the context of what I've seen across the country with different academic programs and how data science has been evolving, I like to say that programs take kind of one of two forms very broadly. We see data science programs that are what I'll call hub-based programs and then programs that are what I'll call spoke-based programs. So what do I mean by that? So hub programs are programs where students are studying to become scientists of data. And those are the programs that typically are going much deeper into the programming and into the computer science.

Jennifer Priestley (08:33):
These are the students that are really learning how to work with massive amounts of unstructured data. They're going to be excelling in artificial intelligence and machine learning. And they are fairly agnostic in terms of their area of application. So the core is going to be the, whether they're ultimately going to be applying data science to healthcare, to financial services, or to manufacturing or to marketing or whatever the area of application is because their area of study is data science. Their major is data science, and then they'll have some area of application. And so again, those programs tend to be more computationally strong. Again, ours is housed in the college of computing and software engineering. Our students go very deep in the math and into the computer science, but importantly, they have to have some area of application.

Jennifer Priestley (09:31):
I would draw a contrast to data science and analytics programs that are what I'll call spoke-based programs, which means that the students study something else. So they're majoring in finance. So they're majoring in engineering, or they're majoring in biology or healthcare. And then they pick up one or two classes in analytics. And so these programs tend to be maybe less deep in the computational aspects of data science and much more focused on more black box software. Really focused on being able to interpret the results and communicate the results within the context of whatever their focus is. Again, be that financial services or healthcare retail or whatever. The whatever the area is that they're really focused on. So, again, to summarize across the entire academic kind of ecosystem, data science has evolved as a unique academic discipline in one of two forms. Either they're hub programs, where students are really learning how to be true scientists of data or spoke programs where the students are really going deep in something like finance or healthcare, and then picking up one or two analytics classes as part of their training.

Jeff Dugger (10:55):
Great. So as a scientist of data, then data is at the heart of everything we do. One item I've noticed in my own experience as a data scientist and dealing with students, as I work with some students, including from KSU, is a lot of focus on algorithms and math and code. But there seems to be a big shock for people getting into data science at the beginning of how much work there is to go into cleaning up the data. How to mitigate the impact of imperfect data. How to interpret the results. I would imagine that is one thing that is unique and fundamental to this field of data science, that the other fields in the spokes may or may not emphasize as much as you'll get in the hub. What would you say about that?

Jennifer Priestley (11:50):
Well, it's such a spot on point, Jeff. This whole idea of working with real data. So let me give you a very specific example of this. So I teach a class in the spring every year called stat 8330, and this is a binary classification class, and the students are actually working with some very real, very messy, very imperfect data that actually came to us from Equifax. So Equifax very graciously gave us this massive dataset that we can now use for academic purposes, which by the way, is a footnote to all of your listeners. We, as data science faculty, can't do what we need to do if we don't have real data. And we can go out and get things from Kaggle or, or things from the US government or things off of the internet, but I tell you what nothing is better.

Jennifer Priestley (12:46):
There's nothing that works better in the classroom than having somebody from an organization actually walk in and say, here's my data. Here, my pain points. And here's what a day in the life of a data scientist within my organization. Here's what it looks like. And again, here's, here's what that data really looks like. That becomes a very real experience for the students in a way that just pulling down data from Kaggle just doesn't do. But coming back to your question, so, you know, we have this massive, very messy data set that was such a gift. So the students walk in, and these are predominantly doctoral students, but we have some master's students too. And they walk in and, you know, they may not necessarily have an expectation that they're going to be building these binary classification models on day one or two, but they certainly think by day three, they're going to be writing some really cool code to do the classification.

Jennifer Priestley (13:44):
And the reality is they don't actually get to the point of building models until probably week six or even later because they have to spend so much time transforming the variables to achieve things like monotonicity. To find the optimized binning solution. To determine how to best transform the variable, given the context of the dependent variable. And this whole, real world issue that they have to deal with with coded values and erroneous values and values that just don't make sense within the context of the business problem. And so again, they have to spend a lot of time thinking through computation strategies and, how do I bend this data, and how do I do these transformations? And then of course, they've got a ton of data.

Jennifer Priestley (14:38):
They've got thousands of potential predictors, which obviously won't work. So they have to go through a variable reduction, a data reduction process. And, the point that we continue to drive home to the students is there's a lot of right ways to do this. There's a lot of wrong ways to do this, but importantly, there's a lot of right ways to do this. And you just have to be able to think through the decisions that you're making now in step five. How those decisions in terms of how you're cleaning and transforming this data are ultimately going to impact your results in step 50. And so I think if nothing else, the student, you know, the coding is easy. Building a logistic regression model ultimately is pretty easy. I think what the students actually learn in that class are the latent skills of thinking about how I'm cleaning and pre-processing and engineering the data in the first couple of weeks, how that then affects the model in week 8, 9, 10.

Jennifer Priestley (15:47):
And then obviously at the end, in terms of building their profitability function or however it is that they're going to be judged. So it's, it's like opening that aperture, so that you're not just thinking about achieving a 98% global classification rate. That's not where we're going to be placing our focus. Our focus is going to be much broader in terms of this entire continuum of going from raw, messy data all the way through the data cleansing, data prep optimization, and then model building, and then ultimately the classification process. So, hopefully that answered your question. But that is a really important aspect. I think to data science that I think gets lost sometimes. And you know, it's, it's certainly something that we tend to place a lot of emphasis on at Kennesaw State. Is just this idea of how to go from raw data through the data cleansing data engineering process, because that is not trivial. And that has so much more impact on ultimately your ability to add value to an organization than just being able to download some Python code.

Jeff Dugger (17:03):
Yes. And in my experience, I would say that's probably the one item that really surprises a lot of newcomers to the field. There's a lot of focus on all the fancy algorithms, the AI, the neural networks, the machine learning, but everybody seems to forget if they're new to this field, that it's all about the data.

Jennifer Priestley (17:23):
Oh, absolutely. And, you know, it's all about getting the answer right. Right? I mean, it's all about adding value to the organization. So let's not start with the most complex option first, right? Let's start with the simplest option and then work our way up. So that's another challenge that I have found, especially with doctoral students, is they like to beat their chest about how complicated their neural nets are and their deep learning processes. And you know, how they now understand all of these different Keras packages and all these different things that they've been able to pull down off of Git Hub. And okay. Yay. But ultimately, if you can solve this problem using a simple regression model, let's do that first. And then let's talk about becoming more sophisticated. But let's not go after the most complex solution right out of the gate.

Jeff Dugger (18:17):
I would agree with that. I remember learning that lesson when I was fresh out of a PhD as a beginning engineer and had to have that taught to me by an experienced older engineer. Look for the simplest solution first. It's all about solving the problem. So a related topic would be understanding the results that you're getting out of the model. How does one know that the results make sense, that they're not crazy? For example let's be a little extreme and say your model tells you, you have a 99% accurate prediction. How do you instill in the students that there's probably something wrong there?

Jennifer Priestley (18:59):
Well, it's funny, you should bring that up because we have this problem every year. And I, you know, I have to laugh because it's almost always a math major. Not to pick on math majors, but between you and me, we'll pick on math majors. So every year when I teach that same binary classification class, there's always that one student that thinks they've cracked the code, right? And they'll come up and they'll say, you know, wow, Dr. Priestley, I had the highest classification rate ever. I win the prize! And, you know, we have to explain to them that I'm willing to bet you a Diet Coke, which is a pretty important form of currency in my world. I'm willing to bet you a Diet Coke that you have one of the outcome variables that you're using as a predictor.

Jennifer Priestley (19:48):
I don't think I've ever been wrong on that. Any time students come back with these really high classification rates, that's almost always the case. So again, I think that that experience in the classroom, as embarrassing as it is sometimes for the students, what a gift to have that mistake in the context of your academic program. Such that you'll know to look for it, then after you graduate. So you don't make that mistake on the job, which I'm sad to hear that does sometimes happen. So the first thing that we have the students really check for as they're building their models is to make sure that they're not using anything that would be known only post hoc as a potential predictor. And like I said, if they learn nothing else from my class, I hope that they take that very important lesson with them after they graduate.

Jeff Dugger (20:44):
I think that's a very important point to make because I've heard of that as well. So it must be a fairly common mistake. [inaudible]

Jennifer Priestley (20:51):
Although hopefully you haven't heard that from any of our students.

Jeff Dugger (20:54):
Ah, no, I can’t say that I have.
Jennifer Priestley (20:56):
Ok, good! Ok, good!

Jeff Dugger (20:59):
So you mentioned earlier working with Equifax and working with datasets and real-world data and industry. What kind of feedback loop do you have from industries that are looking to hire data scientists? That drives your curriculum, design. Drives your training and teaching.

Jennifer Priestley (21:17):
Yeah, absolutely. So, you know, I'll just start just with some big, broad brush strokes in the context of what we do at Kennesaw State at the undergraduate level, master's level, and even at the doctorate level, is heavily integrated with our local economic drivers. So, ultimately we are a public university, right? I mean, we are funded by the taxpayers of the state of Georgia. And so we need to ensure that those investments from the taxpayers of the state of Georgia, in terms of what we're doing in the classroom, is then creating value for the state of Georgia. And then obviously it all then comes back to the taxpayer. So, yes, absolutely. We reach out and integrate what we do in the classroom at all levels with the big organizations that are really driving the economy of the state of Georgia.

Jennifer Priestley (22:13):
So when we first launched our master's program, our undergraduate program, the first thing we did is we called organizations like Equifax and Delta Airlines and Coca-Cola, and the Home Depot and all of the big fortune 100, 500 companies that were in and around the Atlanta area. We brought them into a big conference room and we said, bring us your job ads. Tell us what you need. What is it that you're going to be hiring for? And importantly, not just what you're going to be hiring for today, but what are you looking for 1, 3, 5 years in the future? You know, somebody asked Wayne Gretzky very famously a couple years ago: What makes you the greatest? What makes you the greatest hockey player? And he said, I don't skate to where the puck is. I skate to where the puck's going to be.

Jennifer Priestley (23:03):
And so I think in terms of how we've been thinking about data science and analytics as an academic discipline. We've always kind of kept that philosophy and the front of how we've approached our curriculum design is we don't want to just be designing for the needs of the state of Georgia and for the economic drivers of our local economy and our, and then more broadly of our national economy. We don't want to just be meeting the needs of today. We want to ensure that we're going to be graduating people who are going to have skills that are going to be in demand long after the day that they graduate. So just in terms of big, broad brush strokes, integrating with all of the big companies in and around Atlanta has always been on the forefront of what we do.

Jennifer Priestley (23:52):
I said that the Center for Statistics and Analytical Research is really the heartbeat of what we do in the school of data science. And, again, it's through the center for statistics and analytical research that we have these very strong ties with the private sector. And so, we very quickly, regularly get feedback at the micro level and at the macro level, from organizations that we work with through the center in terms of what we're doing well and where we're falling short. And it's real time feedback. You know, if a student is, you know, recently graduated and went to work for an organization, we get that feedback right away in terms of how well people are doing and, and again, where they're falling short. So, if we do get feedback that we're falling short in something, we then go back and take a look at the curriculum.

Jennifer Priestley (24:44):
Was this an issue with that particular graduate or is there a systemic gap in our curriculum such that we need to go back and take a look at it? And I can tell you what one issue that continues to come up is not just in our data science program, but I think consistently with data science programs across the country. And that is this whole idea of, of communication, of soft skills. Of students not just understanding the mathematics and the statistics and the computer science and all these hard skills that we've been talking about, but also importantly, how do you communicate those results to a non-technical audience, both in terms of your ability to stand up and give a presentation, your ability to write, your ability to communicate to people you know, again, who are technical and non-technical, but also people across the continuum of the administration within your organization, right?

Jennifer Priestley (25:42):
So from the CEO all the way down, you need to be able to communicate and throttle the depth of your messages in terms of computational complexity. You need to be able to throttle that based upon your audience. So I think those latent skills related to communications is really an important part of why we partner with so many organizations in the private sector. Not just about reviewing our curriculum and making sure that what we're teaching in the classroom is consistent with what these organizations are going to need. But then also importantly, it's through conversations outside of just their faculty, actually working with people and talking with people who are actually practitioners of data science helps them develop their skills and communications, which is so critical.

Jeff Dugger (26:31):
You make a very, very important point there. The ability of people to tell stories about their data, about their models of their data, about what the data and the results mean. I also like to instruct the students that work with us, that the stories are not just important for their communicating to upper management or other colleagues who may not be experts. They're also very important for solving some of the issues we talked about earlier. Understanding your data, understanding the results you're getting out of your models. Because when you have to turn around and turn what you're doing into a story, it forces you to think clearly about what it is you're doing and not just throw data into Python and hope for some really good results. So that's one way I try to sell students on learning to tell a story from the beginning, it will help them solve their problem, as well as communicate what they want to communicate to the people who matter in their world.

Jennifer Priestley (27:34):
It's such a great point. And, I think that ties back to part of the previous conversation about how you open up the aperture for data scientists, right? Beyond just the computational skills, the tools and the, and all of the packages, but actually thinking more broadly about how these results are going to be used. And, what role does parsimony play here? I mean, should I be willing to sacrifice a couple of points of classification accuracy? If I can go from 50 predictors down to 10 predictors or five predictors. You what does that, what does that actually cost, what's the loss function there in terms of that trade-off. So, yes, absolutely. I think having those types of conversations. The understanding of how these results are going to be used. I think it, again, kind of opens up the aperture and helps the students think more broadly about how their work fits into a larger picture.

Jeff Dugger (28:29):
I think you've given us a lot of great insights today. Jennifer, what would be the top three takeaways that companies can use to build and strengthen their data science programs?

Jennifer Priestley (28:41):
So if I had to just pick three, three big points for people to take away in no particular order, the first would be that universities are truly working hard to pivot their curriculum to try to meet the needs of the private sector for data scientists. I know everybody has openings for data scientists, and they're having trouble hiring. Universities are working hard to make sure that what we're doing in the classroom is relevant. But importantly, we can't do what we do unless we're partnering with the private sector. So I would encourage all of your listeners to reach out to their local university to determine, is there a way that maybe I can sponsor a project? Is there a way that I can sponsor a capstone class? Can we, de-identify some of the data that we use that really represent kind of a day in the life for our analysts?

Jennifer Priestley (29:39):
Is there a way that I can do that and bring that into the classroom for some real-.world classroom exercises? So I guess that's the first point. That we can't do what we do without partnership with the private sector. The second is that, back to that idea of communication skills, right? That students have to learn how to communicate beyond just learning the programming and the math and the computer science and the way they do that is through talking and writing and engaging. And so, as you're looking to reach out to a university for partnership, also ask if there's an opportunity for you to meet with students on a one-on-one basis, potentially to be a mentor. We have a great mentoring program that we've set up through our center for statistics and analytical research that's headed up by Bill Franks.

Jennifer Priestley (30:32):
Who's just done an amazing job with that center and in the context of the work that he's put together, he's bringing in people from the private sector to actually be mentors to help students with their communication skills. So I suspect that a lot of universities have something similar. So, seek out opportunities to, again, engage with your local university. And then the third is just more broadly related to that idea of hub programs versus spoke programs. You know, just about every major university across the country has some type of initiative in data science. Some, you know, like I said, are going deeper into data science, really helping students become scientists of data and going deep into hubs. And then some are more aligned as spokes where the students aren't necessarily going into the deep nuances of programming, but they're learning how to work with black boxes. And they're taking those results and then tying it back to the original business problem. So neither of those approaches is wrong, but I would encourage your listeners if they are going to be reaching out to a university program for the purposes of partnership, do a little bit of investigating and try to understand what kind of program does this university have. Is it a spoke program or is it a hub program? Because that'll set expectations on both sides in terms of what that collaboration looks like.

Jeff Dugger (31:59):
All right. Jennifer, thank you so much for joining us today and for sharing your rich knowledge with us. If our listeners are interested in more information, where can they find you?
Jennifer Priestley (32:09):
Ooh, they can reach us at datascience.kennesaw.edu.
Jeff Dugger (32:18):
Thanks again for joining us today.

Jennifer Priestley (XXXX):
Great. Thank you, Jeff. Thank you for having me.