Behind The Bots

In this interview, Rachel Lovell and Jiaxin Du discuss their groundbreaking AI project analyzing bias in police reports of sexual assault cases. Criminology professor Rachel and data scientist Jiaxin Du share how they leveraged natural language processing and statistical machine learning methods to uncover troubling patterns in thousands of police reports. Their findings uncovered implicit bias against certain victims based on race, age, and other factors. This project demonstrates the power of AI to identify systemic bias and has major implications for improving policing practices and achieving justice for victims.

DONATIONS

https://clevelandrapecrisis.org/support/donate-now/
https://give.rainn.org/a/donate


THE PROJECT

https://www.sciencedirect.com/science/article/abs/pii/S0047235223000788
https://www.sciencedirect.com/science/article/abs/pii/S0047235223000776
https://sites.google.com/view/nlp-for-rape-reports/lexicon


RACHEL LOVELL

https://expertise.csuohio.edu/csufacultyprofile/detail.cfm?FacultyID=r_e_lovell


JIAXIN DU

https://www.linkedin.com/in/jiaxin-du-a3861134/


FRY-AI.COM

https://www.fry-ai.com/subscribe
https://twitter.com/lazukars
https://twitter.com/thefryai

Creators & Guests

Host
Ryan Lazuka
The lighthearted Artificial intelligence Journalist. Building the easiest to read AI Email Newsletter Daily Twitter Threads about AI

What is Behind The Bots?

Join us as we delve into the fascinating world of Artificial Intelligence (AI) by interviewing the brightest minds and exploring cutting-edge projects. From innovative ideas to groundbreaking individuals, we're here to uncover the latest developments and thought-provoking discussions in the AI space.

I am a assistant professor of criminology at
Cleveland State. My background, my PhD is in

criminology, or incident sociology. I
specialize primarily in gender-based violence.

So sexual assault, human trafficking, intimate
partner violence. Since 2009, I've worked

primarily in the applied field, working with
primarily funded projects. As a methodologist,

as an evaluator, working a lot with criminal
justice and public health, evaluating a variety

of projects. And so, and have a lot of funded grants
as a result of that. And so the combination of those

things is kind of what brought about, or what was
the impetus for this grant. So I'm just into

currently a PhD candidate in Kansas A &M
University. My background is information

systems, and also data science informatics. So
with this project, majorly I focus on the natural

edge processing part, using large language
models to process the data, and understand like

how those information systems, I see those
information like technology changes in the

departments, and how the reports in various forms
change. And I also focus on a statistical part to

get the evidence how police write their reports.
Very cool. So how did you two meet and start working

together? So you can probably guess of the two
introductions, like how we, the pair there,

right? So I have the background specifically
within sexual assault and the connections,

specifically within the Cleveland Police
Department and Cargo County Prosecutor's Office

to get access to the data and the research
questions and so forth. And I got connected to

Zhezhen's doctoral advisor when he was a faculty
member at Kent State. When I was at Case Western

Reserve University, and then as faculty often do,
we moved around a whole bunch. And so then his

advisor went to New Jersey, and then Texas A &M, and
then I went to Cleveland State. So we all moved

around a whole bunch, but the connection there is
that the grant, this project was funded through a

grant that I was the principal investigator on. I
wrote the grant, came up with the idea for the

project, but knew enough about the methods to know
that we could probably do this, but not

necessarily knowing or really what we would find
with that. I hadn't found anything that had been

done as we sort of talked about AI and machine
learning with criminology data and criminal

justice data is fairly new. They're really behind
the curve in almost all forms of technology, but

certainly with machine learning, and especially
as it relates to when I wrote the grant, which was in

2017 when I wrote the grant, there wasn't much out
there in the criminal justice world around that.

So, Yixin was the one who was the grad student who
worked, we worked most closely together, and he

did the day-to-day analysis on all the data. So,
and what I think is a really unique pair is I think

you really need often two people, one that
specializes really well in the data, and one that

specializes really well in the topic to be able to
do that. There was often some challenges because I

know the topic really well, but I may not know the
method really well, and he knows the method really

well, but doesn't necessarily know how to
translate it, that method to the topic. So, I think

we made a really good pair, so that's why I thought
he would make a really good addition to the

conversation of being able to have both of those
sides to this discussion. Awesome, and how did you

get involved in criminology, and it sounds like
you have a big interest in that, and that's what

sort of got you to write the grant in the first
place? Yeah, so starting in late 2014, I was

working at Case Western Reserve University, and
we were approached by then-county prosecutor,

Tim McGinty, who had just formed a task force to
investigate and follow up on untested rape kits.

So, he formed an initiative to address untested
rape kits. They had started in the county to

inventory them and submit them, and this
initiative was formed to investigate the DNA

testing, to follow up on the testing, through
investigations and prosecutions, and so he

reached out to us as to be his research partner on
that project with an initial pilot project that

was funded actually through the Cuyahoga County
Prosecutor's Office. At the time, we didn't know

it was a pilot project because we didn't know more
funding would be coming, but then the federal

initiative came along in 2015, and so we were ahead
of the curve on being able to get some of that money

and go in with the Cuyahoga County Prosecutor's
Office to be their research partner. Since then,

we have been their research partner on a very large
number of grants since 2015, and so the data from

that and the work that we've done since 2015 really
led up to the work that we were seeing here because

as part of that work, we were reading those reports
as humans and kind of detecting stuff that we

thought might be. We're like, hmm, why is that in a
police report? Like why are they saying it this

way? Like why are they talking about this? Why did
they mention these things but then don't follow up

with certain things? And so we were noticing that
as a human. I knew we could likely get access to a

very large number of reports, which is one of the
things we really needed to do this project. I was

beginning to learn a lot more around what the
capabilities were around AI and machine

learning, what we could do with all this text data.
And so those kind of forces all kind of came

together and thought, I thought, well, we already
know the outcomes of these cases because these are

older cases, these are cold cases. So if we could
get, let's say, what we're ended up being about

6,000 rape reports and we have the police reports
from those rape reports and we know what happened

with those cases. Could we see if the language in
those reports was predicting what happened and

what, you know, in those, with those cases, if they
closed early or not or how they closed or how

different kinds of cases were written for certain
types of victims, for younger victims, older

victims, black victims, different kinds of
things. And so that was sort of the idea behind

that. And we called it signaling, which is
basically, because that's what we thought it was.

Like when we were reading them, they're like,
they're not outright stating it, but it seems like

they're kind of trying to like talk about certain
victims in certain ways that would indicate that

maybe this, that a police officer shouldn't
really spend a lot of time following up. So that's

kind of the idea of where we got started with it.
Okay. Yeah. One of the things that stood out in your

article, in the article about your work is when the
police officer was interviewing somebody, you

know, a rape victim at the time, she was laughing
and he sort of blew it off because he thought she was

just like making fun of something, but like the,
and then the article went into more detail. It's

like, well, that's maybe how she was handling at
that time, cause it's a trauma, a traumatic event.

And that's, that happens a lot that cases like that
get blown off because the officer at the time

doesn't think it's serious. And also the initial
report is like very important to the

investigation. Is that true as well? Yeah. And
there really hadn't been any research to really

try to talk about that. There'd been a lot of
research about all these other characteristics,

but, you know, we know, especially, you know, the
power of language and the power of words. And that

as a, as someone in criminology, I know that police
often investigators don't actually see a victim

first. They see a report first. So it's a patrol
officer or someone who responds to the call first,

who goes and takes that initial report and then
details all that information and then forwards

that information to a detective to follow up. So
the detective is the first, first reads the

report, doesn't actually talk to the, often to the
victim before reading the report. So we had the

idea that like, yes, that what was contained in
that would make a big difference, especially in a

place like Cleveland where, you know, an urban
jurisdiction where there's often heavy case

loads. So they're going to be making decisions
about which cases to follow up on based upon the

contents of those reports. And so, yes, we had read
those and, you know, oftentimes officers seem

confused about victims' behaviors or what they
expect those behaviors to be. And so they're

trained to sort of note that in reports, like note
most pertinent facts or note this. And so noting

something that like seems odd to them would seem
normal to them, right? So if it seems odd that

someone is laughing, it would seem like, well,
that seems odd. So I should note that. But like, now

that we know that that's actually, you know, a
fairly typical response or that people don't have

a stereotypical response to trauma, that noting
those types of things and many other things could

have also played a role in that. Yeah, very
interesting. So like for when you got the grant and

what was the game plan when you got the grant, you
got the money to do everything, how did it look on

your end? Like, did you, was the first step to get
all the reports and then sort of analyze all the

data? Was there a thesis behind the project you're
doing? Things like that. Yeah, so we knew we could

get access to, like I said, almost 6,000 rape
reports, which was really amazing. Often

researchers just don't get access to that. And the
reason why we could is because we were already

researching them with the County Prosecutor's
Office. Like we already had access to them. We were

already working with the Cleveland Police
Department and the Prosecutor's Office. And so we

had already had our permissions there. We already
had access to their systems. However, and I think

this might be of interest to some of your readers or
listeners, and I guess readers on the... Yeah,

that's right. Yeah, that the way people think how
the criminal justice systems work and how they

actually do are two very different things. Like
the CSI has given them a very weird sense of how

police actually use their electronic management
systems. And so they're very outdated. So the way

that the Prosecutor's Office electronic
management system is set up, which is what we had

access to, was that they put these reports in
basically like a folder. And each case had its own

folder. So that means that we had to go into all
6,000 files, find the report. Physically, like

literally go into each of the 6,000. Or actually
there was more, there was about 6,400. Not all of

them had the reports that we needed and other sorts
of things. So the full sample, full population was

6,400. And we had to like find the, go into the
folder for that report. We had a full list of cases,

find that folder, pull that report out, save it,
save it, save it. That took forever, right? Like

scan it, like you scan it into a computer
pre-match? You save it as like a PDF into box, into

like, you know. And then we gave it over to Ja'jen,
who then like, we told them, okay, can you automate

turning these over from PDF into text files,
right? And what are all these, are all these

documents at different locations too? Or are they
all in one area in Cleveland? Like did you have to go

all over for the 6,400? No, luckily they were all in
one place. They were all in the prosecutor's

office database. So we just had to have a full list
of all their files, which we had. And so we had to

just go one by one into their system and look them up
and get the police reports out and then save them

into our system. And then, you know, but these
cover over two decades worth of reports. So some of

the reports are a mess. Some of them are like copies
of copies of copies. Some of them are handwritten

field notes. Like, you know, you can imagine like
what, over 20 years of police reports might look

like. Reading the handwriting would probably be a
difficult one. Yeah, some of them were so bad that

we actually had to use like dictation software
where we actually just read them into dictation

instead of or typing that. Like we had to do a
variety of things to try to do that. But he was able

in an amazing way, the only computer scientists
and other sorts of really smart techie people can

do to convert those in a much faster automated sort
of way. Jorgen, do you want to talk about, you use

Python to do that, correct? Yes, and the OCR, AI
technology can recognize your handwriting, even

if it's images or the copies, copies the AI can
recognize that. So it really like accelerate this

process. Seems like everybody's using Python
these days. Rachel, you sort of spearheaded this.

So did you give direction to everybody on what to
do? And then how many people were helping you at

this time? Because it sounds like it's a big amount
of work to do. Yeah, so I had a team of about four or

five people. Not everyone worked full time on the
projects, like we had other projects as well. But

yeah, so we had four or five people on the team
working on various times to convert all of that.

Then we had to go through a very extensive like
quality control process because the conversion

didn't always convert correctly. And as you could
probably guess, there wasn't police officers

didn't use Bell Checker or didn't actually write
well. And like didn't actually when they were

typing these up, often sometimes with literal
typewriters, or it was before much of the software

that would be able to like fix problems
grammatically. So sometimes they weren't

conversion errors, they were just typos. And also
police reports are full of abbreviations. Like

that's how they write reports, right? So they're
like, so even trying to get the computer to

understand a gazillion different types of police
abbreviations took a little bit, right? Because

we had to like work through what does LT mean? Does
it mean lieutenant or does it mean light? Like in

light skinned when they're talking about a
suspect? Very interesting. So did you have police

officers help you convert those all those
abbreviations as well? Like when you had a

question, you just asked somebody? Yeah,
sometimes. So since I was working with a lot of the

individuals from that task force or former
Cleveland police officers, some I would do some we

looked up some, I just know because I've been
reading police reports for so long. And so yeah,

through a whole series of them, we had like a going
along, you know, sort of Excel list of like all the

possible abbreviations. For what something
could mean. And actually we, as part of our final

report for this, we actually did a long sort of
detailed process. And we wrote like all this stuff

because what we wanted to do was kind of
memorialize this process for other researchers

who would maybe go through this because a lot of
electronic management systems or they call them

LERMs systems for law enforcement, they don't,
they have the electronic, they have the

electronic data, the text data in their systems,
but their systems don't allow them to share or

print anything off other than PDFs. So like there
is no really sharing of electronic data. Does that

make sense? Even if, yeah, if it was shareable,
like are you the only person that could have access

to this data or can the general public go in and get
this data too if they wanted to? The data

specifically for this project? Correct, yeah. Or
I guess, or, and then on top of that, maybe police

reports in general, like can people go in and if
they want to research on their own, can they do

that? So you, and well, it depends on the state laws
in each state. So in Ohio, at least incident

reports are a matter of public record. So you can
get those, but no police department would give you

6,000 of them because they would have to pull them
one at a time. So they would deny that, that records

request because no one, because they, there's no,
because they would literally have to go in because

they can't just say like, give me all reports for
all years and then print them all. They'd have to

like go to each one and like pull the report and then
send it to, you know, sort of, at least. So you'd

have to create 6,000 over 6,000 requests pretty
much, because you'd have to do one for everyone.

Right, or some variation. I mean, I think you could
probably request like it in a smaller batches, but

they will deny FOIA requests for larger ones
because it's just too time consuming for them to

do, but that hadn't already been done for us
because they were already all PDFs. We just had to

like take them and make them text files. To get to
your other question, the data from this as a

requirement of the federal grant is archived with
the National Criminal Justice Data Archive, but

our data, because it's, they're all, you know,
it's some 4 million words from rape reports that

really cannot be de-identified, fully
de-identified. We worked, Jo-Jean did a really

great job to try to de-identify the data, but the
more you de-identify the more you remove the

substantive words. Right, so you could say remove
all numbers, right? Because that would remove

addresses, but then if you remove all numbers,
then you remove all dates, you remove all first

person, second, you know, like you're removing
all the substantive information that then makes

the report make sense. Okay. So they actually have
a very sensitive version of an archive called an

enclave, where it is archived, but someone has to
get special permission and you have to go through a

certain, you know, a pretty rigorous process and
you have to physically go there to access that

data. So that's where you're going to go to for the
next year. Gotcha. So it would be a little bit

different, but you, if someone really, if some AI
expert out there wants to go look at this on top of

what Rachel's done, they could do that. Great. And
then so you had, now at this point, you've got all

the data into the system. Jo-Jean has got this data
as well. Like how the heck do you go about at this

point, like filtering out what you want? It sounds
like the first thing you guys looked at is bias, if

that's correct. Yeah. So we started with
sentiment analysis because of the idea that

police reports aren't supposed to have
sentiment. I mean, by the structure of them, they

aren't supposed to have opinions. And I, you know,
we knew that sentiment analysis and other types of

natural language processing was really based off
using it off like, you know, internet data or, you

know, like when you're talking, consumer data,
when you're talking about how much you like an

iPhone or other sorts of things. Like that, the
structure of it was not really designed for

something so formulaic as a police report. So we
knew the technique could fairly easily be

applied. And Jo-Jean did that really well. You
have these like public dictionaries, right?

Jo-Jean, you can talk about those dictionaries if
you want for the, for your audience, folks that

might know more about those dictionaries. But
there's public dictionaries that you can use that

will score how much opinion or sentiment is in your
text. But those weren't designed for police

reports. Police reports are some of the most
formulaic data that you can possibly get. And so we

found it was very formulaic. However, it took a
little bit for us to kind of weed through the

formulaic to get to the sort of media part of those
reports. I can speak about the methodology as

well. So it's two years ago, there is no challenge
to BT. So no, but there are a lot of true models, but

those large models are black box, which did not add
knowledge to those bias analysis in the police

reports. So instead of, we took a very traditional
natural language processing methods, mostly

statistical methods means, so usually normal
documents, normal police reports will say, rape,

rape, rape. And if there are some weird words
appears, you see that it's very low possibility

those words will appear, we can calculate those
possibilities and say, okay, here is a strange

words of phrases, we've detected. And those words
pop up and we can, and Rachel and those experts has

more knowledge, so they can analyze, okay, why
these weird words appears. So the one, the weird

word appears, you sort of just gave that to Rachel
at that point and said, hey, there's an anomaly

here. Yes, and we looked those depressing reports
together, but Rachel is very positive and she

leads this team very well and she has all the
knowledge to analyze those. Yeah, so we did find

some really interesting things preliminarily
around that, like tangentially related words. So

like things like basketball were in, was in one of
those, like it didn't appear in a whole lot of

reports, but we were like, why is basketball like
showing up as being predictive, right? So we

looked up the reports of like, okay, what's the
context of the word basketball, right? It's not

happening a lot, but it seems to be quite
predictive of a more negative. What, like when you

say predictive, so basketball, the word
basketball actually came up in the report, is that

what was flagged? Mm-hmm. Okay, gotcha. And so we
were, because we knew which cases didn't go

forward, right? So we were like, oh, it's not
happening very often, but when this word appears,

it seems like it's signaling something, but what
it was actually doing is it was sort of

tangentially related to the nature of the crime.
So it wasn't basketball, nothing about

basketball. It was that they were at basketball
courts or leaving basketball courts or had just

got done playing basketball or other sorts of
things. So as a victimologist, what that says to me

is that, it's picking up the fact that these are
kids, that they're youth, they're adolescents,

and they're playing basketball, right? Like the
fact that, so it's not basketball, it's picking up

on a common factor that would be, that, you know, so
what, you know, they're picking up that they're

kids or adolescents that are in or around outdoors
and outside. So it's picking up the fact that, you

know, girls in particular in certain
neighborhoods are more likely to be sexually

assaulted when they're outdoors, outside by
strangers. And we have published some work on

that. And so like where they were coming to and from
didn't matter as much as the fact that word was

saying, oh, these are, this is a risk factor just
because they're outdoors and outside in certain

neighborhoods. And that happened to be a common
theme. So like, so that was one of the things where

it was like, oh, well, this isn't really a
signaling word. This is just picking up like

victimology patterns in the data. Like words come
up, I'd imagine, I mean, my mind is just sort of

going on a tangent here. But words come up, like,
don't words come up that sort of tie specific cases

together as well. Like if there's a guy that's, you
know, unfortunately raped 10 or 12, 20 girls, I

would think that the report might have words that
come up that sort of corroborate that as well. Does

that actually happen or is that just- No, we'd like
to think so. You really need, the reports often are

not very detailed enough to do that. Even we've
had, even have some publications on one rapist who

was connected to 22 rapes by DNA. So like, you know,
like these are stranger rapes. They would never

have been connected. They're so very different.
They wouldn't have been connected had it not been

for DNA. And he was so different in all of them, they
wouldn't have connected. So like the report

certainly, if you read them, you wouldn't have
thought that they were connected to the same

person. He wasn't, he didn't say the same things.
He didn't act the same way. So I think you may be able

to pick up some if it's something really unique or a
particularly unique MO or like particularly

graphic or particularly like extreme fetish or
extremely like gratuitous violence or other

sorts of things that might stand out. But those are
actually, you know, a much more uncommon type of

rape. So then you guys, you get these, you get these
common words that come out and then, like what does

that lead to? So those words pop up. Is there
anything else in Zhejiang's work that sort of

helps you to put things together to see which cases
maybe got pushed under the rug that should be

prosecuted to this day or like brought them back to
the forefront? Yeah, so we went through a whole

series of things where we were able to get this
sentiment and the subjectivity of the reports

using these public dictionaries. And then, but we
had to sort of make sense of what that meant. Like

what does it mean to have subjectivity in a report?
like in a report of rape, like what does it mean to

for an office because the officer is writing, you
have to remember like a report, a rate, you know, a

report of a crime, an officer is writing often
third person. So the victim states this and this

happened to, in these cases, happened to her. So
that's the subjectivity part, like they're

personalizing the part of what happened to the
victim as compared to, you know, victim is a known

runaway victim did this victim did that. That's
not a personalization. But the third person as

well, like they're writing these reports on the
third person, but really it's just, it's a mask for

what they're thinking, like they could write it in
person. Yeah, because they have their subjective

subject bias when they're writing in a report. And
that some of the things that were the most that were

the most damning in the reports were very short
reports. And there were the ones that were written

with very factually, which is not really what we
expected, where it was just a series of short

victim blaming statements, where it was like
victim is a known runaway victim is a known crack

abuse, you know, cocaine abuser victim can't
remember this victim doesn't know this victim

doesn't you know, so it's like a whole series of
like, you know, statements about what the victims

know do say or. statements about, you know, some
characteristic of the victim. And those were

really, and especially when they're short, were
really the most damning of all the reports. And the

sort of best reports, best in terms of better
outcomes, were the ones where they wrote it, like

they wrote about the real nature of a rape, which is
that she was scared, you know, they wrote, they

wrote about the statute elements of a rape. So, you
know, police are actually also supposed to

include the elements of a crime. And so, you know,
you're getting all those like details about what

happened to who and how much but also like what
actually, you know, she was scared she did this he

forced, you know, like you're capturing a lot of
those details. And then those were some of the more

successful cases, basically like writing as if
they cared that they wanted that case to go

forward. It's also from the test show analysis. So
we like down all the test show analysis you can

think about on the police reports, the length of
the reports, how many sections you write, she

appears for police reports, there are different
sessions from the narratives to like follow up,

follow ups and all the different things. And we
also calculate how many typos to make. So all those

grammar issues and sentiment subjectivity and
also all the black wash models, all of them. And we

pick up those like signaling parts and write that
in the reports. Very cool. So I had a question,

Rachel or Junjun, whatever one of you is thinks
we'd better to answer this or a little bit of both is

fine. So we touched on a little bit about how AI is
being implemented into the project. I just want to

know with a little bit more detail, maybe behind
the scenes, what kind of what kind of AI power

resources are you using? How are they helping? And
then do you see any limits with the sort of AI

sources that you're using? So when we talk about
AI, we generally think about deep learning as time

we write the reports 2020. So at the time like Bert,
those language models were getting popular. But

the pen stage in those deep learning AI is their
black box, although there are some splintable

techniques developed later. And we found though,
so I can just send this document or send the

sentences to the language model and we label them.
Okay. So this sentence is bias. That sentence is

not. And the AI model can detect. Okay. So this is
the bias sentence, but the model cannot tell you

why. So that is really a pen state. And also there
are limitations. So it's still limitation in the

AI community. Like the length of the document you
can input to a model. At the time, we can only do a

segment. We cannot do the whole reports. Like some
reports generate their 500 words, 2000 words, and

the model cannot handle them. It's still a problem
nowadays. So instead of those deep learning

models, at last, we took the more traditional
machine learning methods like NAVE Bayes and

support veteran machines, those more
traditional, but they can just treat all the

documents as a bunch of numbers and run the
statistics and have a very good explanation

because then we will have the words of phrases we
tried. So uni-gram means one word. Wide-gram

means two words phrase. Tree-gram means three
words phrases. So we see the different, like trunk

the documents into different lengths of phrases.
See which kind of phrases is more creative in our

final analysis. So yeah, we really wanted to look
at, like we thought using both methods adds to this

much bigger picture, which is why we did something
really unique with our publications. Normally

you wouldn't submit two papers at the same time to
the same journal. And I, it was because I

approached the editor of the journal and I was
like, you know, we can't possibly put this into one

paper. It's a much, you know, this is a big, you
know, it's a big study. It's too much there, but you

don't, you can't really do one without the other
because one of the papers is looking at this

sentiment, right? The, the meat, you know, these
like trying to score the value, the words and the

opinions. And one of them is really looking at the
actual words and phrases, the trigrams in them and

saying, what are the actual phrases that are
predicting, that are most common in cases of

runaway, runaway victims or cases in the
unfounded and unfounded cases where an officer is

supposed to investigate a case and rule that a
crime didn't occur. So they're basically saying,

you know, something didn't rape didn't happen. Or
with cases that, you know, a prosecutor said they

wouldn't die, which are the most successful cases
in our sample. So you really kind of needed both

because one just tells you a score, but doesn't
tell you what the text is, right? And one tells you

what the text is, but doesn't tell you anything
about like how it goes in with the other part. And so

we wanted to at the same time to tell the bigger
story. And so that's why we submitted both at the

same time. And I think that it's a very
complimentary sort of viewpoint using

different, you know, different methods. So you,
you have, you have a score for every report. Okay.

We have multiple scores. Yeah, we have multiple
scores actually. Yeah. It's not just one aspect of

them. But is it, is it sort of like when you, when you
guys analyze this data, Jojen's start analyzing

the data with, with deep learning and other AI
things as well. Did Jojen, did you sort of, you

discovered a lot of this stuff on your own. It
wasn't like through Rachel, because you, you saw

the data, right? It was coming to you. So you sort of
helped Rachel as well with this. It wasn't just

like coming through your Rachel saying, I need
this, this and this, the data sort of presented

itself to you. And it made a lot of revelations, I
guess, to both of you at that time, to help you

understand what's going on here. Yeah. So I had
zero criminal justice background. Okay.

Although my father is a police officer, but that is
a different story. He did not teach one of those.

Okay. But I know all the, I didn't know that, Jojen.
I don't think you ever told me that in all these

years. That's a big, big, big piece there. And I'm
not. Yeah, that's a 25,000 or report. So I know all

the natural and reprocessing techniques and
those are just programming. So I can try all the

established methods and I can also develop some
new things for the police reports. I think that's

one fascinating thing about those automation or
computer science is we can just start trying them.

And after we got all the results, we can try to
explain them also. And because I touched the data

so I can read them myself and I think, okay, so here
is the distribution of the data. And here are some

common typos. So I can choose, I think the best
method to it. I can choose based on the number of

words, the length of the reports, if there is
language style, because the language style

means, do they really care about grammar? And it's
like states, victim states, victim stated are

those two different things. And we discussed
this. Okay. So those 10s, those different ways to

manipulate the language does those reflect the
police officers opinion. And this is really new

and only into disciplinary research can find
those research questions here. I will give a

really great example, because he's probably not
going to brag on himself very much. But initially

in the proposal, I had written to do topic
modeling, which was the idea of kind of taking

topics or kind of, you know, getting to the text.
And we did try that, but it wasn't working out very

well. The topics were kind of all like kind of
weren't showing up very well. And it was Jaxin's

idea, because of his knowledge of the methods much
more than me. I'm like, here, I think here you

should be using trigrams instead of this other
thing. And we can do it around these predictive

outcomes. And so I think, you know, I knew about the
technique of topic modeling. It wasn't working

very well. So his ability to sort of pivot and find a
technique that would work better for these data

really shows that compliment of having our, you
know, our pair that we do. Yeah, that's awesome.

It's awesome. It seems like you guys have really
complimentary strengths. Like you said, at the

beginning, that really helps this. I was
wondering about what's your next step in

development? Like what are you currently working
on right now? So we had the two papers come out at the

same time. And we have a big report that we wrote as
part of the requirement for the grant that's

archived already and it's online. And so we're
working on at least two, maybe three more papers.

So some of the bigger findings that we found were
the runaway cases. Some of the some of the worst

written reports were about runaway victims. And
we knew that going in because we had read a lot of

those reports and use those reports are some of the
saddest reports you've ever read. So we knew that

as a human reading them. We saw that and showed that
quantitatively. And so we want to really, you

know, show all of our methods there. So we're going
to write a paper on that. We want to dig in deeper to

the unfounded cases because we have such a large
sample. We did find a very interesting race effect

showing African-Americans cases had, you know,
sort of worse outcome, worse scoring. And I but 65

percent of the victims are African-American
because this is based primarily on Cleveland data

and Cleveland, you know, Cleveland
disproportionately likely to be victims of

crime. So we want to really dig in deeper there. I
think it has something to do with the victimology

patterns because we've published on sub samples
of these data, meaning that it's not just the race

of the victim is a combination of the race and the
victims and how they were raped, like where they

raped outside by a stranger. Like so it's because
you you wouldn't find a pattern just for 65 percent

of your cases. That's way too much. Like you either
the effect would be washed out. So I want to dig in a

little bit deeper there. And then we also have an
idea. Josh, and do you want to talk about the idea

for sort of an extension of sort of a completely. So
we want to publish more because there are more

findings there that we didn't have opportunities
to publish off of. But Jaxin, do you want to talk

about other ideas? Yes, I had discussed about the
applications. So since we are this research on the

police reports, we want to help the police officer
write better reports. So our team is developing

training materials for police officers. And
personally, I'm interested in can police officer

has like dedicated like on a model only for police
officer to help them write reports. You cannot

just connect to Internet and a lot of ShazBT to
revise your reports. But you can have something on

your police departments help you write reports.
So that's kind of my. I think a lot of police

officers are under, especially in Cleveland,
because I see it living here. They're under a lot of

stress, a lot of pressure, and this can be sort of
something that someone can look at and be like, oh,

the police officers suck. They should have done
better on this. But the training needs to be there.

That's probably a huge thing to help them out,
because they've got so much pressure of violent

crime that some things get pushed under the table
or some things got to give. So that's awesome that

you're doing the potential training for police
officers on writing the reports. One other thing

too is it sounds like you guys are doing all the
research on this. Is it possible to have someone

come along and open up old cases to prosecute
people based on these biases that the police

officer, like you can see objectively is
happening from your research? You mean like

prosecuting the police officers or prosecuting
people if you're finding the crime. Correct. Or

reopen the case or whatever the terminology is. So
I think, and also because I have other projects

with the Calgary County Prosecurs, the cold case
unit who are working on sexually motivated,

unsolved, sexually motivated homicides and
rapes that have DNA forensic evidence connected

to them. So I think there are some really great
opportunities to advance ways of thinking about

how police can access data and really put those
connections together as the ways that you're

talking about. Because logically that's what you
would want. Like you're thinking, why can't you

just search? Where's all the other cases in this
area and those sorts of things? But the systems are

often too cumbersome or that's not really how they
work in their day to lives. But I think that they

could. And so really kind of getting folks up to
speed so that they can see those connections or see

even evidence connections or things that haven't
been tested. In other words, making the data, the

text livable instead of pictures, which are what
they work with, they work with PDFs, which are just

pictures and file folders. And all their rich data
is just stuck in a picture that's not really

available to them. Yeah. It's the police. Like I
know, I mean, I know we're near the experience you

guys have on this. But just dealing with police
reports in Cleveland, it's like you said, it's an

antiquated system and it's not open up. So it makes
it hard for someone to come in and find out, find

data, analyze it. And then, you know, like I just
imagine if I, my wife or someone in my family was a

victim of rape, I'd want to go in, see your data and
maybe be like, try to tie the pieces together

myself to help the police out. So maybe that will
happen. But one day the data might open up

internally in the police departments to make it
easier for the detectives to actually analyze

data in a more objective way. So it's really
awesome what you guys are doing. I was just

wondering, and this is maybe what you're going to
get at is what is your vision for this project in the

long term? You know, you guys are putting a lot of
research into this, developing these systems.

How do you see these being used in the long term? Or
at least hope that they are used? Well, I mean, you

know, the work that I do, the best part of the work
that I do is that, is the connections to the

community and to police, you know, like, is the
access that I have to individuals to be able to give

the information from research to translate into
practice because of the connections. So you're

not just doing research and then it gets put behind
a paywall and an article that no one reads. So for

this one, we really wanted to do something
different. So, you know, for example, we created

infographics that we put online and other sorts of
things that were like hears and recommendations.

So for example, I sent those out to some, like, just
as like beginning ones, like to folks that I know

who train law enforcement officers in, around
sexual assault. And I was like here, just use

these, see if they, you know, send them to your
folks. See if this is useful or helpful. Put it in

connection to the research. Show them the
connections there of where you're not, you're not

trying to tell officers, like you hate, you know,
you're terrible, you hate victims at you because

that may not, that may be the case, but it's likely
not. It's that this writing appears in a way that

they, whether intentionally or not. And so, and
they were taught to write this way. In fact, most

officers are still taught to write in this very,
like, perfunctory sort of way that removes the

nature or the reality of rape from the crime
report. And so, we're really trying to get this now

into the hands of those that are doing this
training to make it very accessible to

individuals. But I think on the tech side, I think
there's a lot of things. I mean, right now,

basically, the technology is the same as it was
before, which is you just have an open text box. I

mean, before it was a typewriter and now you just
have an open text box, but that's how officers

write. Like there are no prompts. There are no
ways. There's no structure to it. There's no,

like, fields that, you know, like it is literally
an open text box where they're just like, right,

whatever you want. Just, you know, and so you can
imagine all the different ways that you would get a

police report from just an open text box where
there is no structure. And I think really with

technology, there's no reason to be able to do that
because we can easily put prompts that you can put

structure. You can use the even the stuff around
chat, GPD to help automate some of it and then have

that rich text and that rich detail about the rapes
where the officer puts that in. But at least

there's some standardization around the
elements of the crime. And it could prompt you to

really say, like, hey, you didn't include this.
Hey, you should rephrase this. Hey, we're going to

help you write some of this. So it's almost like a
combination of like Grammarly and chat GPD or

something. Of course, I'm not a software
developer and nor do I want to start my own software

development company. But like that was that's the
sort of what I envision would be very useful. OK,

and I think maybe some people on our audits might be
interested in writing something like that. So

that would be helpful. But yeah, but like you do a
survey online, right? There's certain prompts

and then it the depending what you write in the
first prompt. Second question might be different

for the for the next one, you know, so something
like things like that. One thing that I know I keep

harping on. But is there something like from your
research so far, is there any cases that came up

that you're like, wow, this one got pushed under
the rug? It should definitely be reopened. Is is

anything like come like that in front of you guys
that you sort of pushed to the police at that point

to say, hey, look at this one? Nothing because we
didn't. We we didn't look at the ones where they

weren't open already because the task force had
already opened them all. So we are actually

looking them after the task force had already done
that. That's how we had access to the data. So in

fact, we had even a better viewpoint. We had the
viewpoint once DNA testing had shown it. And once

you had the ability of hindsight, so you had
forensic and you had hindsight and then you could

look back to that case and go, well, no wonder this
case didn't go anywhere. But also look at and so

there's that example in the Cleveland.com story
that we cite where, you know, there's that the guy

and there was that one report where it's like
victims clothes aren't dirty and disheveled. And

you're like, well, why is that in there? Like
there's no contact. Why are you saying that? Like

there's no contact. Maybe it's because it was a
backyard rape, an outdoor backyard rape where

you're like, OK, well, maybe there's some context
for why you would say that. But you didn't like you

didn't say why you why that's important. So it
seems like you're blaming the victim where you're

not believing her. And so we call those like
unqualified statements. Like you put that in

there, but there's nothing to prompt the officer
to say, like, hey, you've got to explain why you're

saying this because that that is not by itself
enough information. And so and so the victim got a

rape kit, the rape kit wasn't tested. Fast forward
many years later, I think it's eight years later,

the rape kit actually gets tested. P.S. It tests to
a guy who had just actually been arrested in in the

act of raping a woman in the snow. A police officer
actually had to pull him off of a victim as he's

raping her in the snow in Cleveland. And he hit to
another rape kit and he had, you know, I think how

many he had five other sexual offenses in his
criminal history. So when you put that case now,

that report in the context of his, of the bigger
one, you're like, that's all these cases are going

to slip through the cracks because that report was
easily dismissed. Is it because it had that in

there? Maybe, maybe not. But like if we wrote that
better, maybe that, maybe that victim would have

had better justice to begin with. And maybe he
wouldn't have gone on to, you know, commit all

these other acts of rape. So it's almost like all
the research you're doing is to help like future

cases from here on out because that's the writing
of the reports is super important. Do you see this

potential, this sort of model having potential
beyond rape cases, maybe murder cases theft and

things like that? Yeah. Yeah. I think this, I think
if you're going to find signaling that was our

supposition, if you're going to find signaling in
any type of crime, it's going to be in rape reports

for all the reasons we can all imagine. But I think
you can, you can get the structure of writing

better reports. You can get at some aspect of bias
and some aspect of improving report writing for

almost all types of crimes because even the sort of
statutory elements of crimes, there's some good

research looking at machine learning. Not
somebody else has done some really good research

about like how very few actually police reports
are good at capturing this, the actual elements of

a crime, which is needed when a prosecutor
actually charges the crime because there's not

enough information to know what specific things
to charge and how many to charge with. And so like

even some of the more basic just legal stuff, not
necessarily even bias, but I think there's great

potential for lots of other types of crimes and
reporting on that and just learning ways to, ways

to analyze criminal justice data, to find trends,
to find connections, to find social network

analysis of even text stuff of trying to link words
and phrases to locations to, I think, there's some

really cool things that could be done with that as
well. If you mapped like phrases to locations to

people to victims, like, you know what I mean, and
some really relational sorts of things, you could

find some really cool stuff. Well, that sounds
awesome. Was there anything that we didn't cover

today that maybe you wanted to talk about? Or is
there a way that we can follow this development

further? Well, Jojen and I are going to be
presenting, we're going to be submitting a

proposal for ACJS, which is the Academy for
Criminal Justice Sciences, which will be in

March. The presentations will be in March. And
we're going to still be continuing to do this. I'm a

director of a research center and so we'll be in
Cleveland and Cleveland States. So we'll be

certainly distributing things on our social
media about that. And Jojen is about to finish his

PhD. And so we're excited for him to finish and go on
the market. So I'm sure he'll have some some stuff

out as well on both his PhD, which is not on this
topic, as well as some of the stuff on ours as well.

Is there any like any kind of like donation sites or
anything like that that we can put in the video

below that people can link? Yeah, I would say two
things, especially when you talk about Rae Bell,

always try to say for resources, so to maybe make a
link to the Cleveland, especially the Cleveland

Right Crisis Center or the National Hotline.
There is a National Hotline, so especially for

people who may hear this or maybe victims or may
know of individuals who are victims or may have

family members who are victims. If, you know, they
want to find out more information or, you know,

stuff that they heard is troubling. There are
national hotlines. The Cleveland Right Crisis

Center has a chat line so they can chat 24 hours a
day. So there's some resources I want to be able to

provide individuals. And I would say, you know, if
someone, you know, donations or other things

should certainly go to victims organizations
that provide support for survivors. And so

organizations such as locally the Cleveland
Right Crisis Center or RAINN nationally, which is

RAINN, RAINN, RAINN would be able to be those ones
that I would want to most emphasize. Yeah, well,

thank you so much for coming out today. Anybody
watching, subscribe to Ryan and I's free

newsletter. It's a weekday newsletter. It covers
the latest and greatest in artificial

intelligence. And then we do deep dive articles on
Sundays as well, going into projects like this

that are really awesome and emerging in the AI
space.