The Sci-Files on Impact 89FMTrailerBonusEpisode 19Season 9
Emily Bolger on Understanding the Role of Machine Learning and Text Analysis in Systematic Literature Reviews
Emily Bolger on Understanding the Role of Machine Learning and Text Analysis in Systematic Literature ReviewsEmily Bolger on Understanding the Role of Machine Learning and Text Analysis in Systematic Literature Reviews
00:00
More episodes
Emily Bolger on Understanding the Role of Machine Learning and Text Analysis in Systematic Literature Reviews
Subscribe
Copied to clipboard
Share
ShareCopied to clipboard
EmbedCopied to clipboard
The Sci-Files on Impact 89FMTrailerBonusEpisode 19Season 9
Emily Bolger on Understanding the Role of Machine Learning and Text Analysis in Systematic Literature Reviews
On this week's episode of The Sci-Files, your hosts Mari and Dimitri interview Emily Bolger. Emily Bolger is a 5th year PhD Candidate in the Department of Computational Mathematics, Science, and Engineering. She works in the Computing Education Research Lab (CERL) with Dr. Danny Caballero. Her dissertation research uses Natural Language Processing to identify and synthesize themes in Instructional Change Strategies in Undergraduate STEM Literature. Systematic literature reviews critically collect and evaluate findings from a specific area of research. In collaboration with her colleagues, the analysis seeks to identify themes in undergraduate STEM education specifically focused on literature highlighting instructional and curriculum strategies. Extending previous work conducted about 15 years ago, the team is repeating the analysis with new literature and assessing the integration of machine learning tools. With developments in Natural Language Processing, the field behind tools like ChatGPT, there are many techniques available for assisting our researchers in extracting information from the literature. The team explores how machine learning methods can provide new insights to traditional methods in systematic literature reviews.
Emily also works with her colleagues in CERL to develop curriculum materials for CMSE’s undergraduate Data Science and Computational Modeling courses, particularly assignments that focus on data ethics and algorithmic bias.
If you're interested in discussing your MSU research on the radio or nominating a student, please email Mari and Dimitri at thescifileswdbm@gmail.com. Check The Sci-Files out on Twitter and Instagram!
Chapters
On this week's episode of The Sci-Files, your hosts Mari and Dimitri interview Emily Bolger. Emily Bolger is a 5th year PhD Candidate in the Department of Computational Mathematics, Science, and Engineering. She works in the Computing Education Research Lab (CERL) with Dr. Danny Caballero. Her dissertation research uses Natural Language Processing to identify and synthesize themes in Instructional Change Strategies in Undergraduate STEM Literature. Systematic literature reviews critically collect and evaluate findings from a specific area of research. In collaboration with her colleagues, the analysis seeks to identify themes in undergraduate STEM education specifically focused on literature highlighting instructional and curriculum strategies. Extending previous work conducted about 15 years ago, the team is repeating the analysis with new literature and assessing the integration of machine learning tools. With developments in Natural Language Processing, the field behind tools like ChatGPT, there are many techniques available for assisting our researchers in extracting information from the literature. The team explores how machine learning methods can provide new insights to traditional methods in systematic literature reviews.
Emily also works with her colleagues in CERL to develop curriculum materials for CMSE’s undergraduate Data Science and Computational Modeling courses, particularly assignments that focus on data ethics and algorithmic bias.
If you're interested in discussing your MSU research on the radio or nominating a student, please email Mari and Dimitri at thescifileswdbm@gmail.com. Check The Sci-Files out on Twitter and Instagram!
What is The Sci-Files on Impact 89FM?
The Sci-Files is hosted by Mari Dowling and Dimitri Joseph. Together they highlight the importance of science, especially student research at Michigan State University.
Dimitri Joseph:
WDBM East Lansing.
Mari Dowling:
Welcome to the Sci Files, an Impact eighty nine FM series that explores student research here at Michigan State University. We're your co hosts, Mahdi Dowling
Dimitri Joseph:
And Dmitry Joseph.
Mari Dowling:
Today on Pi Day, we are joined by Emily Bulger from the Department of Computational Mathematics, Science, Engineering, who is here to talk
Emily Bolger:
to us about her research.
Mari Dowling:
Hi, Emily. Thank you for joining us. Could Could you tell us a little bit about yourself?
Emily Bolger:
Hi, everyone. Thanks for having me on. My name is Emily Bulger. I'm a PhD candidate in the department of CMSE. I do computing education research, and so that means that I use data science and machine learning methods to study education, particularly the students in our department.
Emily Bolger:
So first, I guess machine learning is a term that I
Mari Dowling:
feel like we've heard a lot, you know, in recent years as technology and a tool also. Can you tell us a little bit about what that is?
Emily Bolger:
Yeah, that's a great question. Especially with the development of artificial intelligence and AI, the lines have kind of blurred about like what is machine learning, what is data science, what is artificial intelligence. I like to think of machine learning as using models to either predict or assess a set of data. So in our introduction to computational modeling and data science course in our undergraduate program, we really teach students how to think about data, how to critically assess data, and how to use models to either make predictions or learn more about that data. So data science and machine learning kind of go in tandem with each other.
Emily Bolger:
It kind of depends on what type of model you are using and sort of the sophistication of the model that you are using. So, like, things like linear regression are much more in the data science world, where things like neural networks are much more in the machine learning type world.
Dimitri Joseph:
Thank you. I appreciate that that background on what machine learning is. But you mentioned that you apply it in a completely different way than I'm used to hearing machine learning or AI being applied to. You use it for education. How do you use machine learning to advance education research?
Emily Bolger:
So I will preface with our research group kinda goes in two separate directions. And so work that some of my colleagues do is more qualitative analysis. And so they do a lot of interview studies with our students, trying to strengthen our curriculum and teach them in ways that are easier for them and help progress their learning. My dissertation work is a little bit different. I use what's called natural language processing.
Emily Bolger:
That is the field behind ChatGPT, is all the Dolly, all of those really popular AI tools. And what I'm doing is using those models to look at literature in science education, and we're trying to identify themes in that literature over the last ten to fifteen years to try and understand what strategies are being used in undergraduate classrooms. Are they effective? Are they non effective? In tandem with that, I am working with collaborators across The United States who are doing a more traditional systematic literature review.
Emily Bolger:
So they are like reading the papers and looking at the data and identifying themes. And the point of my dissertation is to figure out how machine learning can help them in that process and try and synthesize that information in a way that's useful to them.
Dimitri Joseph:
Yeah. Machine learning isn't as old as it it's not as new as it once was, right?
Emily Bolger:
Right.
Dimitri Joseph:
And now, hearing that your colleagues are doing it the old way, it just seems antiquated and just too slow. Feels like seems that Inefficient? Yeah. It seems very inefficient, and I think we need the computers to help us become more efficient and speed up that process.
Mari Dowling:
I think it's a really valuable application of machine learning because we tend to focus on the way that it's stealing people's art and copyright and a lot more of the negative aspects. And I think there are definitely beneficial ways where it can be used as a tool. And this looks like a really, really valuable way where it can save researchers a lot of time. Kind of building off of that, I was wondering if you could tell us a little bit about what natural language processing is.
Emily Bolger:
Yeah. Another great question. It kind of gets thrown in with data science and machine learning and all that sort of world. Natural language processing has a long history. It's been talked about more recently with the development of large language models and ChatGPT and all the sort of machine learning AI.
Emily Bolger:
But it really comes from linguistic theory and studying like word formation and how words relate to each other. And so the history of large language models is based off a hypothesis back in the nineteen fifties called distributional hypothesis. And the idea is that words that have similar meaning often appear in similar context. And so the role of machine learning is to sort of take that framework and figure out how we can make machines emulate that sort of understanding. So we use something called text embeddings.
Emily Bolger:
Text embeddings take words or groups of words and represent them with a vector of numbers. And so if two words have similar meaning, the common example I give is cat and dog. Cat and dogs are both pets, they're both animals, they have some sort of relationship between them. Their representation with vectors of numbers will be similar but cheetah and cat will be even closer because cat and cheetah are both felines, they come from the same biological upbringing so their vector representations will be even closer. Once we have numbers, we can do math on them which is how we get the Chachi BTS and the LLMs of the world.
Emily Bolger:
So yeah, natural language processing is really built on that linguistic theory of how we can study text and the developments of machine learning have helped us figure out how we can study text with numbers and make it go faster as y'all were talking about.
Mari Dowling:
So how do they manage to convey these linguistic ideas like a cheetah or a cat into a numerical mathematical
Emily Bolger:
vector, as you say? That's a great question, yeah. So I understand some of that, but that is an entire field in and of itself of how it gets represented. The models are trained on data and they sort of continually update the model based on examples that you give it. And a lot of times, traditional natural language processing uses classification.
Emily Bolger:
And so whether it's like a yes or a no answer, as a person, you have an understanding of whether the classification is right or wrong. So similarly to machine learning and natural language processing, you can train the model to identify what is correct and what is not correct. And so sort of the underlying, at least how I understand it, of text embeddings is very similar. Right? You can sort of train the model to think about, yes, cat and dog are similar to each other, and yes, cheetah and cat are similar to each other, but snake and cheetah are not similar to each other.
Emily Bolger:
The data that I work with and the sort of pre trained models and the results that I work with are based on lots of work and lots of mathematical modeling of people who have determined and sort of corrected the model into what is quote unquote correct and what is not correct.
Dimitri Joseph:
And what I'm hearing is that you use this optimized model that's able to to understand the patterns that are within the language and text. Mhmm. And then from there, it can make predictions about what text may come next following the previous text or what words are associated with each other Mhmm. Just based off of pattern recognition. So now I'm just curious to know, how do you apply these models that have been optimized towards your research?
Emily Bolger:
You're exactly correct. That's how especially ChachiBT, I like to tell people that it is a black box, but it's not magic. It is based on statistical methods and everything is it's just predicting what it thinks the next word is. I specifically use text embeddings and so our collaborators have done a number of iterations. Right?
Emily Bolger:
They're doing a traditional systematic literature review where they search the web and search things like EBSCOhost to identify about 9,000 articles. From there, they did what's called the title and abstract screening, and they decreased that set to about 700, and they are now doing a full text analysis on the subset of the 200 articles. The plan is to integrate the machine learning pieces into both the title and abstract screening and the full text. So currently, my analysis is looking at those 700 abstracts that they identified, and we are using text embeddings to take a single abstract and represent it with a single vector of numbers. From there, we're using something called HDB scan, which is a unsupervised clustering method.
Emily Bolger:
So you take the abstract, you represent it with a vector of numbers, you technically reduce it down to two dimensions because those vectors are like 500 numbers long. And when you reduce it down to two dimensions, you basically have like a point cloud. So if you're thinking about a traditional x y axis, you just have a bunch of points in space. But there's some sort of relationship between those points in space and there's some sort of higher density of points in some spaces versus others. And so that's where we pull in unsupervised clustering to try and help us identify through mathematical methods where the clusters are.
Emily Bolger:
What we're doing right now is taking a deeper dive into what those clusters actually mean. What papers are in that group? Is there a sort of synthesis there? Is there not a synthesis? And how does that help our researchers identify maybe new themes or new topics that they're looking at in their analysis.
Dimitri Joseph:
So as a novice, I'm hearing so there's this database in maybe 200 different papers.
Emily Bolger:
Yeah. There's a few hundred papers. Yep.
Dimitri Joseph:
And you're applying your techniques to cluster them, and you're looking for what are the different clusters of these papers.
Emily Bolger:
Exactly.
Dimitri Joseph:
So did you find anything yet, or it's has you have you reached that point?
Emily Bolger:
We're we're sort of in the middle of it right now. Something that I think is really interesting is that when we did it initially, the clusters we were noticing were by discipline. So they're looking at science as a whole field, and so the initial clustering would pick up like an engineering cluster or like a mathematics cluster or a statistics cluster, which makes sense. Right? That's sort of like a proof of concept that our model is like the text embeddings piece that we're doing is noticing that papers that talk about similar disciplines are similar and are putting them together.
Emily Bolger:
And so, again, good proof of concept that the text embeddings and the methodology that we're using are capturing things that are thinking about. The challenge is that our researchers are not necessarily interested in the discipline. And they are. Right? They care about if we're doing analysis or doing instructional changes in statistics or mathematics, but they more care about the strategy that was used in the classroom.
Emily Bolger:
So think of things like a flipped classroom or active learning or remote learning. And so something that we're working on right now is trying to I hate using the word manipulate, but manipulate the data in a way to focus in on those topics. And so one thing we've experimented with is removing the disciplinary words from the abstracts and repeating the analysis. And we're seeing some pretty cool findings. Now we're getting clusters of like an active learning cluster or we're getting professional development for faculty.
Emily Bolger:
So sort of change in the data helps the model refocus to the type of topics and things that we're thinking about.
Mari Dowling:
Kind of along those lines, if you guys are doing I understand why you're doing that kind of as you called it manipulation and I understand why you called it that as well. Have you thought about applications where that might be misused and how, you know, you might prevent that kind of thing going forward?
Emily Bolger:
Ethics is a hot topic in our group and there are a lot of divisive opinions on it. And our group is thinking about it in a very thoughtful way and so we don't do that lightly. And we don't just like remove words all willy nilly. We talked with our researchers and the folks who are well versed in the education space. They've helped us decide what words to remove and we are still not throwing out that analysis.
Emily Bolger:
The analysis where it's clustering on disciplinary words, we're still looking at that because they still they're like so enamored by the engineering cluster because there wasn't a lot of engineering papers when they did this analysis about ten to fifteen years ago. And so they wanna do like an analysis just on that subcluster. I think it's important to think about how the data and the the choices that you're making with the data do affect the analysis and the results. But I think that we're doing it in a really thoughtful way. And this sort of framework is one that you could totally pass off to another group.
Emily Bolger:
It's a very well established framework in the field, but we don't make those decisions lightly. And I think that's the advice that I give to other people in the field that think about the choices that you're making and and we're doing what's right for the analysis that we want to do. And like I said, we're still looking at both analysis and results pipelines, but don't just do it because you feel the
Mari Dowling:
need to do it. And I think your point about being able to manipulate your data analysis to get what you want is a problem that you see before all of this in the machine learning was applied in research anyways and, you know, it's always going to be a thing. So it's important to keep in mind and I think it's great that you guys are thinking about it mindfully.
Emily Bolger:
Yeah, like I said, we're really trying to keep our experts who are in the education field a part of all of the decisions we make in the analysis, and I think that's really critical.
Dimitri Joseph:
A word that I like instead of manipulate is refine.
Emily Bolger:
Yeah. That's a better word.
Mari Dowling:
You're fine tuning because that is what you're And
Emily Bolger:
that's exactly right. Yeah, that's why I don't like the word manipulate. Yeah. It makes it sound like I'm manhandling the data and totally changing it. I'm just trying to get the model to focus on a different part of Right.
Emily Bolger:
The data that is important to our researchers. Right. What
Dimitri Joseph:
what were the inclusion criterias for the datasets that you considered?
Emily Bolger:
I was not fully a part of that process, so I don't know all of the details, but I know some of the highlights. They are focusing on papers that were written and published in The US. We're looking at data from, I believe it's 2013 to 2023. Everything has to be peer reviewed. There's some discrepancies about whether conference proceedings count for that or not, and they're also looking at a bunch of science fields, so your math, engineering, but they think this time they've also included things like sociology and things that are a little bit more social science rather than hard science.
Emily Bolger:
I will say, I didn't say this in the beginning, but this analysis was done by three of my collaborators back in 2011. And they analyzed papers from, I believe it was '95 to 02/2010, something like that. So, they are redoing the analysis because of the boom of literature and science that's happened, and the development of more scientific fields and so, yeah, it's much more about which fields are in there.
Dimitri Joseph:
Were there search words to specifically focus on education?
Emily Bolger:
Yes. They definitely had specific search criteria. Things like instructional change strategies were also there.
Mari Dowling:
So this question is more directed towards you in how you think about things in your research, but I've noticed that compared to myself, for example, as a social scientist, you tend to approach questions and your data very analytically, thinking about things in perspectives of data points and numbers and graphs, and I think that's really interesting. And I was wondering if maybe you could us some insight about how you approach it, your thoughts on this.
Emily Bolger:
Yes. So short answer, yes. That's how I think about my data. I think about it in numbers and graphs. I see patterns, I see relationships, but that's what I was taught to think of.
Emily Bolger:
I grew up loving math, like I was the math geek growing up and so that's just kind of how my brain works and the classes I've taken have helped me train my brain to think in patterns and relationships and all that sort of stuff. But it's really interesting that you say that because that's not how my collaborators' brains work at all. They very much are focused in like education and reading literature and they're way more intelligent than I am, especially in that space. And so every time I come to them with a finding, I have to figure out a way to explain it in a way that makes sense. And they're asking similar questions to what y'all are asking.
Emily Bolger:
Right? Like, okay, you're taking words and putting it to numbers, but what does that even mean? How did you get there? So yeah, I've had a lot of practice trying to convey what's in my brain to someone else who isn't as mathematically focused. Very interesting insight,
Mari Dowling:
I think it's really valuable having, you know, it's the value of interdisciplinary work, right, in being able to advance an idea from different directions.
Emily Bolger:
Exactly, and that is the entire focus of our group and our research and our collaboration. We are coming at it from a very analytical perspective and they're coming at it from a very theoretical perspective and we're trying to figure out how to mesh these two. Interdisciplinary work is the way of the world now. Everyone is using computing in some way, whether they like it or not, we all walk around with their computers and even if you're using Google Docs to write up a paper, right, you're using computing in some way. So the whole goal of our work is to figure out how to mesh those two approaches and where the lines blur.
Emily Bolger:
And it's it's so funny to me that you talk about how like clear cut the machine learning and the data is because that is so true. But working with our collaborators, it's helped to make those lines a little bit fuzzier. So, yeah, it's been a really eye opening process for me as well.
Dimitri Joseph:
What's some of the ultimate goals of this teamwork?
Emily Bolger:
I think it's really just to figure out where machine learning can fit in. And I often talk with the folks in the computer science world who are really passionate about innovation and putting computers everywhere. And there's a lot of talk about computers replacing the work of people. And like we were talking in the beginning, computers to make things more efficient. And I think we're coming at it a little bit from that perspective, but our goal is to really figure out how we can decrease the workload and help our researchers who want to continue to do this work.
Emily Bolger:
We're not looking to take machines and just say, hey, we'll do systematic literature review for you, and the machines will figure it all out because we've proven that the machines can't do all of the work right now. And so the goal of the collaboration is to really figure out how machines can give them new insights into their work. So for this project, it's specifically trying to identify themes in the literature and they are also looking for themes, so the goal is to compare and contrast themes that the humans have found versus the machines. But I think our larger goal is just to really figure out how you can make those two coexist together in a way that is actually helpful to the education researchers, not totally replacing their work at all.
Mari Dowling:
So then kind of building off of that, do you have any insights you can share with
Emily Bolger:
us about how you might be able to apply machine learning techniques to literature reviews to help scientists be more efficient in their work? I think machines can come in at a bunch of stages. I know our collaborators have used it during their screening process, so there's a lot of software tools now to help recognize patterns of papers that you've selected and sort of give you like a percentage of if they think you should select this paper. I think those are pretty well established and definitely speed up their process and the screening. I think in the more information extraction stage, which is where we are at, I think there's still a lot to be developed.
Emily Bolger:
I think my biggest insight is talk to the people you're working with. Right? The machines are powerful, the machine learning tools are powerful, LLMs like large language models and chat gbt are really powerful and they're only getting more powerful as the days go on. But they're only powerful if they work with and work for the people that want to use them. And so my insight and maybe my takeaway is to talk with the people who want to use them, right, and build the tool for them and for their application because every systematic literature review is different and every process and analysis is different.
Mari Dowling:
Great insight. Thank you so much. Yeah.
Dimitri Joseph:
Thank you.
Mari Dowling:
Looking back at your experience in graduate school as a whole, have there been any other meaningful experiences outside of your graduate work that have enriched your experiences here?
Emily Bolger:
Yeah, my biggest one is probably Graduate Women in Science. They are a national organization, so there's a bunch of chapters across The States, but I've been very involved with the one here housed at MSU, the Mid Michigan chapter. I've been the president for the last couple years, but I've been involved ever since I started in graduate school here. Our chapter focuses a lot on social and professional development for our graduate students. So we're hosting a book club this year.
Emily Bolger:
Well, not me, Alyssa Saunders, our social and professional development chair is hosting the book club graciously for us. And we also do like, Valentine's Day events, just things for folks to get together. Our secondary goal is outreach. So, Cynthia and Jada, who are our fantastic outreach chairs, go to a bunch of elementary schools and middle schools. They have amazing community partners in the area to talk about science with young kiddos and give them the opportunity to experience science in an authentic way.
Emily Bolger:
Our biggest outreach event of the year, which just happened a couple weeks ago, is Girls Math and Science Day. So we invite middle schoolers to come to MSU and participate in activities led by MSU researchers. It's a really wonderful event. This year I think they had like 150 students come, was coming back from COVID, really big numbers again. And so it's a great organization to be a part of.
Emily Bolger:
Our e board spreads a wide breadth of disciplines, so it's become a nice collaborative effort across MSU.
Mari Dowling:
Yeah, and I'm sure for the young girls as well, being able to meet women across a bunch of different professions and seeing all these different opportunities that they can have, I think that's really great.
Emily Bolger:
Exactly, yeah. And I, it's especially having people from MSU participate in such a variety of disciplines, it gives them ideas of what they can do and ideas of what they can study and what their life can look like as a scientist.
Mari Dowling:
Thank you so much.
Emily Bolger:
We really enjoyed you having you here. Thanks for having me.
Dimitri Joseph:
You. Thank you for inspiring the next generation of scientists.