Recsperts - Recommender Systems Experts | #16: Fairness in Recommender Systems with Michael D. Ekstrand

In episode 16 of Recsperts, we hear from Michael D. Ekstrand, Associate Professor at Boise State University, about fairness in recommender systems. We discuss why fairness matters and provide an overview of the multidimensional fairness-aware RecSys landscape. Furthermore, we talk about tradeoffs, methods and receive practical advice on how to get started with tackling unfairness.

In our discussion, Michael outlines the difference and similarity between fairness and bias. We discuss several stages at which biases can enter the system as well as how bias can indeed support mitigating unfairness. We also cover the perspectives of different stakeholders with respect to fairness. We also learn that measuring fairness depends on the specific fairness concern one is interested in and that solving fairness universally is highly unlikely.

Towards the end of the episode, we take a look at further challenges as well as how and where the upcoming RecSys 2023 provides a forum for those interested in fairness-aware recommender systems.

Enjoy this enriching episode of RECSPERTS - Recommender Systems Experts.

(00:00) - Episode Overview
(02:57) - Introduction Michael Ekstrand
(17:08) - Motivation for Fairness-Aware Recommender Systems
(25:45) - Overview and Definition of Fairness in RecSys
(46:51) - Distributional and Representational Harm
(53:59) - Relationship between Fairness and Bias
(01:04:43) - Tradeoffs
(01:13:36) - Methods and Metrics for Fairness
(01:28:06) - Practical Advice for Tackling Unfairness
(01:32:24) - Further Challenges
(01:35:24) - RecSys 2023
(01:38:29) - Closing Remarks

Links from the Episode:

Papers:

General Links:

Follow me on Twitter: https://twitter.com/MarcelKurovski
Send me your comments, questions and suggestions to marcel@recsperts.com
Podcast Website: https://www.recsperts.com/

What is Recsperts - Recommender Systems Experts?

Recommender Systems are the most challenging, powerful and ubiquitous area of machine learning and artificial intelligence. This podcast hosts the experts in recommender systems research and application. From understanding what users really want to driving large-scale content discovery - from delivering personalized online experiences to catering to multi-stakeholder goals. Guests from industry and academia share how they tackle these and many more challenges. With Recsperts coming from universities all around the globe or from various industries like streaming, ecommerce, news, or social media, this podcast provides depth and insights. We go far beyond your 101 on RecSys and the shallowness of another matrix factorization based rating prediction blogpost! The motto is: be relevant or become irrelevant!
Expect a brand-new interview each month and follow Recsperts on your favorite podcast player.

Note: This transcript has been generated automatically using OpenAI's whisper and may contain inaccuracies or errors. We recommend listening to the audio for a better understanding of the content. Please feel free to reach out if you spot any corrections that need to be made. Thank you for your understanding.

Fairness is what's called an essentially contested construct, which to a first approximation for my non-philosopher interpretation of what that means, means humans will never stop arguing about it.
Who is actually getting benefit from the system, and is it actually achieving my goal of ensuring that everybody has access to the ability to market their goods to get readers for their content, etc.?
If your system is unfair to particular groups of artists, they might just leave, in which case there's less content on your platform, which makes it less appealing to customers, and so you shrink the pie.
Bias is a mismatch between what we have and what we're supposed to represent.
And then some biases give rise to fairness problems, and so I use fairness in the kind of the social normative context of something's unfair when that it misaligns with our ethical and social principles of what it means to be fair.
And a lot of times it's caused by bias.
Sometimes bias can be useful for correcting a fairness problem.
Approaching it with the perspective of humility and good faith and being forthright about the limitations that we see, that does mean that then reviewers need to respect the limitations and not just see it as a list to reject.
But if we're forthright about, here's what I did, here's the limitations, then as a community, we can have a conversation about how to do better in the next project.
Thinking clearly about the limitations of my current work has proven to be a very fruitful basis for developing next work.
Hello and welcome to RECSPERTS recommender systems experts.
In today's episode, everything will be about fairness in recommender systems, how to define fairness, how to assess and ensure it.
We will have references to multi-stakeholder and multi-objective scenarios.
And for this huge and important topic, I have invited a luminary and long-time known Recspert in the community.
And I'm very glad and delighted to welcome Professor Michael D. Ekstrand from Boise State University to the show.
Hello Michael, welcome to the show.
Hello, thank you for having me.
Yeah, I was very happy that you were open to that.
And I guess we have a very interesting and a very important topic for today's episode.
And as I said, I'm very glad to welcome you to the show.
Professor Michael Ekstrand is an associate professor for computer science at Boise State University.
And he obtained his PhD from the University of Minnesota in 2014, where he was part of the Grouplands Research Lab.
And he is also a co-director of the People and Information Research team called PIRATE.
Also the co-author of the corresponding chapter in the recommender systems handbook, so the chapter on fairness and recommendation systems.
As a researcher and as a teacher of recommender systems, I think it goes without a saying that he published lots of papers at many different conferences, namely SIGIR, U-Map, WebConf, the IKM and of course the recommender systems conference.
He not only published lots of papers at RecSys and other conferences, but he is also greatly involved with the community.
So Michael Ekstrand was co-chair for several different sections of the recommender systems conference.
He was the program co-chair in 2022 and where the RecSys was held in Seattle or also get general co-chair in Vancouver.
And not only this, but he also is organizer and senior advisor for the corresponding workshop on fairness, accountability and transparency in recommender systems.
And last but not least, there's also a dedicated conference, the ACM conference on fairness, accountability and transparency, where Michael serves as executive committee member.
I mean, lots of things about you, but I guess you are the better person to describe yourself.
So I guess there might be some points missing.
So can you touch on a couple of points that introduce yourself to our listeners?
See, I think that captures a lot of it.
I've been spending a lot of my teaching and research work on recommender systems in general.
And then as I transitioned to an independent research career, I really started working a lot on understanding issues, particularly of fairness, but also other aspects of social impact of recommender systems.
Have been working on bridging that out into more and more information retrieval broadly in addition to core RecSys stuff.
I spent a lot of time trying to work on building community to support research on these topics through the fact rec workshop, through leadership in both the RecSys and fact communities through other things like the Trek fair ranking track.
I want to not only do good work on this topic myself, but I want there to be a thriving community that has the social infrastructure, the intellectual infrastructure and the data and computational infrastructure to really do a lot of work in a lot of different places over the next years on trying to make sure that these systems are fair and non-discriminatory and also just making good recommendations in general.
I've also been involved with co-teaching the recommender systems MOOC on Coursera and my work with the Lenskit software for supporting recommender systems research and education.
I want to see healthy communities with the resources they need for studying these topics.
It's nice that you bring up the Coursera course because there is some personal anecdote I have to share here.
When I started, I guess, six or seven years ago with my master thesis on recommender systems implementing one for a vehicle recommender in Germany for marketplace.
This was actually my very, very starting point into recommender systems.
Doing the course that you along with Joseph Konstan held on Coursera.
That was kind of my first touch point to recommender systems.
Also some personal thank you to you and to Joseph Konstan here.
Thank you.
We're always happy to hear how it's helped people get started on their careers or deepen their knowledge.
There are so many stories you've heard like that.
It was really a great resource.
Yeah, Michael, so fairness and recommendation systems and you already said that you do not only want to do good work, good research, good teaching there, but also want to do community building to facilitate people with that topic, which then also further down their career or further up their career and up, for example, in industry and can then also make use of their competencies and that topic to promote fairness really in large scale industrial applications where I guess the effects of unfairness are having great impact to people, right?
So when I started studying recommender systems in the late 2000 zeros, my first RecSys was 2009.
A lot of people in the community were talking about Chris Anderson's book, The Long Tale, which had the premise there's a long tale of products that are now available and personalization, moving from just keyword search to personalization is key to enabling people to find the products of interest in the long tail in this long tail of products.
And it has this potential of before e-commerce and recommendation.
If you wanted to say, make some niche artistic widgets, if you wanted that to be financially viable, you needed to have a large enough market in the craft shows you could drive to in the people you could reach through a male advertising and magazine advertising, et cetera.
But with e-commerce, the world really, or at least the nation, depending on what international taxes you want to have to deal with is your marketplace and you can sell your widgets anywhere.
And so the domain from which potential customers can come is greatly expanded.
And recommendation then is the key to basically matchmaking between people who are looking for things and the creators, whether it's the creators of physical products, the creators of artistic and cultural content that are producing the things that they're interested in hearing and maybe even things that they don't know yet that they're interested in.
And so when I started that very much in the ethos, there was this idea of recommender engines as a powerful vehicle for economic opportunity.
And as I internalized it also equity of that economic opportunity of like, there's the possibility here to create and sustain much more diverse ecosystems of creators because with a larger potential marketplace, more niche products can actually have enough of a market to be viable.
But the question is, do we actually achieve that?
And as I've been spending all of this time studying user perception of recommendation, studying and doing the work on the Coursera course and also my in-person teaching, kind of the question that started bothering me is, so I'm teaching people to build all of these systems and I'm studying how to build them.
Are they good for society?
Are they delivering on their promise or are they only delivering that economic opportunity to a few?
And we've been concerned in recommender systems for a long time with popularity bias, where the few who get a lot of attention just keep getting more attention and a self-reinforcing feedback loop.
But also then is the opportunity being distributed equitably across different people or is it concentrated along some of the same kinds of gender and racial and ethnic lines that economic opportunity, wealth and power have historically accumulated?
And so that sent me on this mission to try to figure out.
Was it notion of fairness embedded in your thought process on recommender systems from the very beginning, kind of already very naturally or was there some kind of an inflection point where you really said, hey, there is something fundamentally going wrong in how we view these systems and look at them and optimize them and this is where you said, okay, I want to do it differently or how has that evolved in your personal process?
So there was definitely an inflection point as I was finishing grad school.
Throughout grad school was a time of significant growth for me personally and just in terms of my understanding of the world.
And I became substantially more aware of how discrimination and inequity function.
And so as I was wrapping up my PhD, I was starting to think about how these things I was learning about society might connect with my work and the work of my peers.
I wasn't aware at that time that the fairness, accountability and transparency community was starting because I think I finished in 2014 and I think that was the year of the first Fat ML workshop.
But those were the pieces I was starting to put together.
And so I wanted to start to see, okay, so like these societal problems I'm learning about, do they show up in the recommendation technology?
And then over the next couple, two or three years developed that into what would be my primary, not my only research direction, but the direction of my personal primary research program.
Okay, I see.
From that point, so how did you drive that topic?
Was it already that during your PhD thesis, you turned your focus into a direction that addresses fairness in a wider manner?
Or how did you kind of operationalize that insight that you get from what you experienced and what you said?
It took a lot of time.
So my PhD didn't really have fairness in it aside from the usual kinds of diversification and maybe some popularity kinds of things.
But then as I was setting myself up as a faculty member and looking at what's going to be my new big thing that I'm going to work on as my own, that's where I tried and poked at a few different things, but really settled on that being the question that I wanted to pursue.
Then I had one of my master's students start working on some preliminary, a thesis that will be preliminary results that could hopefully use to try to get a grant and work on developing that out.
And I had one failed grant proposal on it in those first couple of years, but kept working on it.
Particularly, I tried to figure out where could I start because there's many different places, but where's the one that I could practically start?
What's a domain where I can get some of the relevant data?
Settled through that on books and looking at gender fairness in book authors.
I mean, it's an important problem, but of the set of important problems, it's one where I could start to find data, where I could start to develop and pursue the questions in that kind of, in a setting that also would work with the recommendation tools I'd spent so long building.
So I built on it in that way of how do we, I built myself the ability to do a lot of different experiments fairly quickly.
So then how do I build on that with available resources to start to make first progress?
And then from there, it kind of exploded.
So that by RecSys 18 paper on author gender, which we then expanded for UMUII, that took, I think, three years from starting to work on it to finally getting it published and some very helpful feedback from some negative reviews along the way.
But that was always intended to be the first thing.
It's like, okay, this, I can get some data.
I can start to work on this.
We can start to work on the problem space.
And then as we study deeper on it, and there were some others who are working on it at similar times, at about the same time I was starting to think about it, Toshihiro Kami Shima did some of the earliest fair recommendation work with his item independence work.
Robin Burke was also starting to work on it from a multi-stakeholder perspective, but realizing just how complex the problem space is that we can't just, there's not a simple fairness we can write down, particularly with the multi-stakeholder nature of recommendation and information retrieval in general.
The problem is very, very complex.
And so from the starting point, then we started to work on, okay, so what does the map of the problem look like?
And also recognizing there's some significant limitations in how we were approaching the author gender work in that we weren't measuring what we were hoping to measure.
And so how do we actually go and improve that?
And it's been a lot of growth and expansion.
And now we've got a map of a lot of territory, but there's a lot of things to go work on.
Yeah.
And I guess the map that you are already alluding to is also the map that we will make the try to lay out today to make that whole topic a bit more concrete for our listeners.
So during my initial research and preparation for this episode and also for a talk that I will be giving myself tomorrow, I have seen that the topic of fairness and recommender systems of fairness in a broader sense of information access systems, like what the book that you also co-authored is called, is not that clear.
And that this is also stated throughout the chapters and throughout the sub chapters that you are writing there, that there's no universal notion.
The question for me is really starting with the why.
So for some that might be quite self-evident, for others it might not, but I guess as always it's nice to first understand the why to then go into the depths and the width of the whole topic.
So why do we need fairness and recommender systems or what does fairness when we need it even mean?
So I think there's a couple of directions we can go on the line.
One is if we start from the general goal of recommendation, like in particularly that long tail kind of vision of we want recommendation to achieve its potential to drive equitable economic opportunity and give a lot of different people the opportunity to have a viable career or at least a revenue positive hobby.
Then if we're going to pursue that in a socially responsible way and in a good faith way, then we kind of eventually are compelled to ask, okay, who is actually getting benefit from the system?
And is it actually achieving my goal of ensuring that everybody has access to the ability to market their goods to get readers for their content, et cetera.
And you can go from that direction of that's our goal.
And then we want to see where does it break down.
We might want to look at some of the lines that are connected with kind of the classic anti-discrimination legislation kinds of things of race, gender, religion, sexual orientation, et cetera.
We can also then look, there are a number of groups that become very relevant endogenously to recommendation and information retrieval.
We might want to ask, does a new author who just published their first book, do they get a good opportunity to have people see their book or are the recommendation slots primarily going to established authors on their fourth or fifth book?
And so you can also look at the popularity of like, are we giving all of the slots to the most popular things?
RecSys has studied that a lot.
So you can work backwards from the goal of we want recommendation to provide economic opportunity to all people.
You can also go at it from the get to the why, from the historical discrimination angle of studying how discrimination is played out in society and how it's played out differently in different cultures and societies.
And then asking, does this, we see evidence that this has shown up in many different facets of society, education, employment, housing, does that show up in recommendation or search?
And you'll find people going from both directions to get to this set of questions.
Okay.
So as you mentioned, the first point about driving equitable opportunities, do you think that this is something that is already coined as a goal for major platforms that employ recommender systems?
So I'm a bit unclear about whether this is really something that everything that you might think about is driving or driving for or having in its mind at first.
So I do share that this makes sense from, let's say, a societal or ethics point of view, but is this also something that aligns with, let's say, the business perspective of many big players?
So I think it can definitely align with some of them and some are more explicit about it than others.
And some, I think it is implicit in their stated goals, or sometimes it's more explicit.
So I have for one example, if you go to LinkedIn's about page, their vision is to create economic opportunity for every member of the global workforce.
And so fairness kind of falls out from that of like, okay, are we providing it to every member of the global workforce?
And so yeah, I think you definitely do have some firms where it is explicitly aligned and others where it's implicitly aligned where it's like, if you think through the implications of how you would have had a test, whether they're really achieving their stated vision, it becomes necessary.
Because I see that in some scenarios.
So when I was approaching that, I was kind of laying out for me kind of three stages for it, which is in some cases, there might be really legislative or regulatory conditions that just force you to pay respect to these concerns and to fairness in some other areas, or it's more voluntarily, like you say, okay, business ethics is important to us, and we want to adhere to want to have a positive impact.
So we do it.
And in some other scenarios, so not implying that this is all kind of mutually exclusive, I guess it's not.
But in some other scenarios, there might be really platforms that have multiple stakeholders in their interests, which are just not healthy or sustainable platforms, once they don't kind of cater to the needs of their different stakeholders.
Would you agree?
Or what is your perspective on this?
Yeah, and there's several different arguments and I trying to make things less unfair than they were previously, I'm perfectly fine with arguments of convenience, even if for me, the moral argument is what wins like arguments, so long as they're true, that are persuasive, let's go with them.
I do credit Fernando Diaz, one of my collaborators with some of my thinking here, he's the one I've learned a lot of this from in terms of thinking through motivational levers.
But one of the things we talked about in the tutorial that we gave at SIGIR and RecSys is that there's a few different arguments.
So one is the moral or ethical argument.
And in some firms, you'll be able to get buy in on that.
One is the regulatory argument.
This is clearest in regulatory domains, and it doesn't even necessarily have to be the regulation clearly established because finding out through an expensive lawsuit that, oh, your job recommendation platform is partly liable for your client's discriminatory practices.
There'll be a very expensive lesson to learn, both in terms of the legal costs and in terms of the PR.
Your argument is PR of you, by being proactive and getting ahead of things, you might be able to state, like if you find and fix a fairness problem before someone else finds it and writes about it publicly, you can save yourself some PR problems.
There's also market vulnerability problems because, or market vulnerability argument.
Because if there is a significant segment of your potential customer base you're not providing good service to, then that's a market weakness.
If somebody else can figure out how to provide a better service to that set of customers, they can win their business.
That might also be a beachhead from which they can pivot to compete more directly with your business overall.
Any systematic underserving can be a potential market weakness.
There was a paper that did a simulation study of what happens if you've got an underserved population and you have attrition based ... I don't remember the details.
I'm trying to reconstruct it from my memory on the fly here, but basically if you've got attrition based on service quality, what does that do to your user base over time?
You lose customers.
You can have a market weakness here.
There's a lot of different reasons why a firm might want to care about this besides regulators telling them they have to.
Okay.
Definitely good point.
I was convinced before, but I'm even more convinced right now and maybe others will also be.
Last year at CIAKM's workshop on eval.rexus, I hope that's correct, you gave a talk and the name of the talk was a bit funny because it was called, please correct me there, How to Hunt a Kraken, right?
Something like that.
Okay.
Okay.
Kind of similar.
The Kraken was actually the map that we want to lay out, the map of fairness and that's not as easy as one might think to lay out that map.
After being convinced about the why it's necessary, let's try to approach the problem and say or try to define what is actually fairness, what does it translate to?
Is there even a unique, universal notion of fairness or how do we make it kind of concrete for us?
So yeah, that's a very good question.
And there is, and kind of early on, there was a lot of work on trying to figure out fairness definitions.
How do we define what might be fair?
It turns out then in recommendation, there's so many different ways.
And even just within a lot of the research on fairness and machine learning has focused on binary classification tasks.
And even in there, there's a great deal of complexity about fairness, but then the recommender setting system settings, information access settings are much more complex.
But there was a lot of work on trying to get, on trying to write down fairness definitions, but there has been a shift that I have seen in the fairness community, just through what's being published, through the hallway conversations, etcetera that I have, away from setting fairness as an abstract goal towards identifying and dealing with specific concrete forms of unfairness.
And there's a few reasons for this.
One is we have, there are various trade-offs between different fairness concepts.
There are impossibility results that show that multiple types of fairness in certain cases cannot be satisfied simultaneously.
There's a line of papers showing how exactly which mathematical fairness definitions are appropriate, depend very heavily on context, depend very heavily also on the assumptions that we make going into the problem.
Depending on a concrete unfairness or a concrete discriminatory harm, we call them in our book fairness related harms to distinguish them from all the other kinds of harms that we could think about.
That gives us focus because we have all of these different metrics, we have impossibility results, but if we have a specific harm in mind, then that gives us, that guides us through figuring out what might be the best thing to apply.
The other reason is, and Andrew Selpst and collaborators laid this out in a very excellent paper in fact 2019 on fairness and abstraction in sociotechnical systems, and they lay out a number of, what five of what they call traps that particularly computer scientists, as we like to abstract everything, find the abstract problem and solve that, how that leads us astray when we're trying to work on problems like fairness.
One of the things that they identify in there is that fairness is, they say, contextual, what it means to be fair is different, different contexts.
It's procedural, a lot of people's intuitive notions of fairness rely on how the decision was made as much as what its outcome was.
Then third and very critically, contestable.
How we navigate fairness in society is done through contestation of people arguing that action was unfair and maybe we have a public debate, maybe we have a lawsuit, but it also, fairness is what's called an essentially contested construct, which to a first approximation for my non-philosopher interpretation of what that means means humans will never stop arguing about it.
So basically with that, it's effectively inherently impossible to get to a universally agreed definition of this is what it means to be fair.
That's going to cover all of the context that people are all going to agree on and everything.
So focusing on more of a bottom up approach of targeting the specific harms.
Yeah, we can agree that this harm is unfair or maybe we can't agree, but enough of us can agree.
From that bottom up approach, we can hopefully build general things.
So with that, we have to figure out how do we locate a recommender systems or an information access fairness problem.
The first thing really is to think about who are you concerned about being harmed?
And this is where the multi-stakeholder thing comes in because in a lot of AI fairness work, the stakeholder we're concerned about is relatively clear.
If we're concerned about say in a resume screening system that's being used in an employment pipeline, we want to make sure that it's treating the job applicants fairly and that's fairly obvious.
But in an information access system, we have the users, the consumers, the system, and I use the term consumer instead of user typically for this because from a human computer interaction perspective, lots of different people use the system.
Creators use the system to publish their work, but we're talking about people who are using the system in the context of finding information products, etc.
And so those are the consumers.
We probably want to be fair to them in some way.
We have the providers of the information, the people who are making the widgets, the musicians who are writing and recording songs, the authors of books and news articles, etc.
We want to make sure they're being treated fairly.
In some domains, we have subjects of the material.
So I think about this most in news.
News articles are about events and people and places.
And you might, if you're doing a statewide or nationwide news discovery service, you might want to make sure that the visibility and coverage is somehow fair across say different regions or different cultural subgroups within the area so that a wide variety of people have their concerns reported on and people are aware of the issues affecting them rather than concentrating all of the attention on the concerns and the issues of a few.
So that's where subject fairness can come up.
It can come up in a variety of places, but basically where you're concerned about what the content is about being fairly represented in some way.
There are then other stakeholders like publishers, music labels, a movie has so many different people involved in it, etc.
And then the system owner, the platform owner, has various interests that some of the motivations we just talked about around what might be fair.
On a multi-vendor marketplace like Amazon or eBay, you also have different vendors that are using on the platform and several of them might even be selling the same product.
Are you being fair to your different vendors?
And this is where some of the discussion around self-preferencing to what extent is it happening?
Is it a problem comes in of you can self-preference the products, you can also self-preference the vendor.
And so there's many different stakeholders.
The first question is, who are you concerned about being unfair to?
And you might try to go for multiple at the same time, but identifying who that is.
The second piece that I, when I try to write it down on a list at least, the second piece is on what basis are you concerned that they are receiving unfair treatment?
And this is where the idea of individual versus group fairness comes in.
And so the algorithmic fairness literature has this concept of individual and group fairness and individual group or individual fairness sets the goal that similar individuals should be treated similarly.
It actually does not make any claims about how dissimilar individuals should be treated.
And that framing of fairness can be traced all the way back to Aristotle who said, like things should be alike.
In that sense, so talking about the individual fairness, I mean, there's also that notion, I guess you will come to it about sensitive or protected and other non-protected attributes.
So when we talk about individuals and individual fairness, then we are talking about similarity in a sense that this similarity should disregard protected or sensitive attributes.
Yes.
There's a good argument that actually you can deal with some sensitive attribute problems by correcting for them in your similarity function.
But the key idea of individual fairness is that the similarity function, it needs to be an unbiased estimate of similarity with respect to the task.
So it's not a general similarity, but similarity with respect to task in say a lending example, two people with the same ability to repay the loan, that's similarity with respect to task, should have the same probability of getting the loan.
Because the bank doesn't have infinite money, there might need to be a random process in there, but in expectation, two people with the same ability to pay the loan or the same ability to do the job should have the same chance.
It then gets complicated in the details because how do you actually measure similarity with respect to task?
And a lot of the systemic bias can actually then be in whatever measure we're using for similarity.
One example would be, say for college admissions in the United States, we have the SAT, we have the ACT scores.
There is a significant correlate, it's not the only thing it measures, but there is a correlation between SAT scores and parent socioeconomic status.
And so if you have two students with the same, to the extent that raw academic ability is a thing, the same raw academic ability, but one of them comes from a wealthier family, they'll probably have a higher SAT score.
And so if you treat SAT score as your measure to determine similarity with respect to the task of completing college, then you're going to still bake in that disparity.
But as an intuitive principle, it makes a lot of sense.
It also translates fairly well to information retrieval because we have a concept of similarity with respect to task, relevance.
We have a lot of extended debates about what relevance means in particular contexts, but we have this idea of if two documents are both relevant to the user's information need with all that entails, they're similar with respect to the task of meeting the user's information need.
And so that opens a lot of opportunity for trying to pursue fairness in RecSys and IR.
And then on the other side, there is group fairness, I guess you mentioned, right?
Group fairness is where the sensitive attributes come in, where we're concerned about disparate experiences, I'll use that word to be vague for a moment, between different groups.
So those might be, say, the protected characteristics in US discrimination law.
Those might be the various group affiliations that are considered in the discrimination law of other jurisdictions.
They might be some of these endogenous groups like new versus established authors, less popular recording artists, et cetera.
But then within that, there's a lot of different ways that we can look at how they might be experiencing this, getting different experiences.
Really, really roughly, they can be grouped into approximately three categories, at least the ones that are concerned with decisions and distribution of resources, where we can look, the system might be actually treating the groups differently explicitly in that like there is a gender field that is being used as a part of the decision process.
It might be impacting the groups differently in that they have different success rates in their job application, or they have different levels of exposure.
So just like, oh, we see that the system is recommending more men than women.
Okay, so based on the outcome.
Yeah.
And particularly, it's just based on the raw outcome without trying to disentangle very many things in it.
The third category is what's been called equality of opportunity or disparate mistreatment, which is that the different groups experience errors at the same rate.
So they can allow, like you can have a difference in loan approval rates, but you can't have a difference in false positive rates or false negative rates.
And that allows there to be differences between the groups, but it says, you know, we shouldn't be systematically more likely to wrongly deny someone alone.
They all have their place.
Okay, I see.
Depending on the problem you're solving attributes or characteristics, they don't need necessarily to be such discreet as those examples that you brought up with regards to the anti discrimination laws, but it can also be more continuously.
So for example, judging by, let's say we do have companies on a marketplace where they are competing with each other for talent and you have larger and smaller companies that you might differentiate by the number of employees or the revenue they are making.
And those larger ones are just getting better recommendations for potential high-reas as the smaller ones are.
And then this would be kind of an example for, I guess, the last categories that you brought up because the smaller or just getting less relevant recommendations, which then might be a difference in the error rate somehow, if you want to translate the irrelevance into error, right?
Yeah, yes.
Okay.
There's a lot of rich space to explore and then also understanding what happens when you're targeting, when you're thinking about multiple groups at the same time.
And for some of that, I really highly recommend, um, Sorelle Friedler, Suresh Venkatasubramanian and, uh, Carlos Scheidegger's paper and communications at the ACM on the impossibility of fairness, because it lays out a theoretical framework for thinking about the different type individual group and the assumptions that go into correcting them.
It's a very, very good paper.
Um, I recommend it thoroughly.
It's nice that you bring that up because in that regard, I also sometimes personally just think that there are some, let's say cases where we do actually have data and we might want to jump on faster.
Like for example, talking about gender bias, because we have more data on this, even though the data, uh, might not include, let's say non-binary gender identifications or something like that.
But if we treat it like kind of binary, then we have more data there.
But then like you said, if you also want to control for fairness, um, in terms of sexual orientation, race, and so on, then kind of you are reducing your individual groups that are kind of the combination of several category or attribute down so far to such individual groups or small groups that you don't even have significant results there.
This is one of the problems or kind of the combinatorial complexity of several attributes.
Yes, it is a challenge.
And we do have that.
So if we're trying to consider multiple attributes, there's kind of two ways we can go.
One is we can consider them separately.
I want to be gender fair and I want to be race fair, et cetera.
But that doesn't capture intersectional concerns.
And if we want to start looking at intersectional concerns, the naive way to do it, which is a good way to start and Alorn Hoffman makes a good argument that this is not complete, but it's a good place to start is to do the Cartesian product.
But as soon as you start doing that, you do get significant combinatorial explosion.
And that manifests in a couple of challenges.
One is just the computational complexity of doing the math.
So we did have eight different attributes ranging from three to I think 20 levels per attribute for the track fair ranking track last year.
And so with that Cartesian product, a single articles vector was I believe multiple megabytes.
And so I had to get creative to not run out of memory.
But also then you have a lot of intersection cells that might have zero little or zero data.
And the more dimensions you have, the more that problem arises.
But you do have challenges.
And so dealing with, I mean, dealing with gender is is a difficult one.
We actually just published a paper last month specifically on that at the cheer conference.
There like often we don't have a lot of data on non-binary subjects in the data set.
Also there's good reasons why one wouldn't want to have that data depending on the setting.
But there's also like even if the data you have is only binary, making sure that say the statistical methods that you use aren't limited to that.
That if you do get better data or richer data, you can keep going.
You can improve.
And so like I think there is value in starting with the data that you have being clear about its limitations and not sweeping them under the rug, but using the data you have to make progress and then looking to improve the data and the methods.
So kind of an interim resume there might be, yeah, keep that in mind that maybe if you improve on the one side, you might harm the other, but it should be nothing that should prevent you from getting started.
And once you do and start seeing first results and also check whether you have harmed other categories or other perspectives on different attributes or something like that.
Yeah.
And there's also a like what data you have internally versus actually publishing the data like it's in terms of statistical errors.
Like if the errors in labels are rolled up into a statistical inference, you're not publishing a list of here's people with their labels and those labels are wrong.
It's an ever evolving thing.
And so this shift I talked about of shifting from fairness as an abstract goal to specific harms, I'm sure that's not the last shift we're going to have, paradigmatic shift we're going to have in fairness research.
As a community, we're continually growing and learning to improve and do better in terms of how we deal with a lot of things, how we deal with and reason about sensitive attributes and complex attributes, how we deal with data quality, data provenance, et cetera.
There's a lot of growth that happens.
And we'll do like, I've done stuff I wouldn't do again now.
Okay.
So I think that that approaching it with the perspective of humility and good faith and being forthright about the limitations that we see, that does mean that then like reviewers need to respect the limitations and not just see it as a list to reject.
But if we're forthright about here's what I did, here's the limitations, then as a community, we can have a conversation about how to do better in the next project.
Yeah, definitely.
That sounds reasonable.
I mean, in terms of that map, I guess there's also some additional point maybe you can touch on, which is the distributional and representational harm.
So what is this about?
Yeah.
So if we've got the first two questions of who's being harmed and the second on what basis, then the third is how.
And so in our NeurIPS keynote, Kate Crawford identified a couple of different categories of harm.
Distributional harms are those that are concerned with the distribution of a resource where the resource is not being distributed fairly.
And so in a recommender system, one of the obvious resources to be concerned about is exposure.
So when the piece of content is exposed in a recommendation list that carries with it both direct economic benefits in terms of people will listen, they'll click, they'll buy, they'll click the article and then the author can sell the ad impressions.
But it also brings indirect benefits, reputational benefits of if someone's work is known, then maybe they'll have more opportunities for commissions of future work.
And so exposure is the mechanism by which the recommender system or the search engine or whatever facilitates opportunity for those kinds of benefits.
And so we can think of exposure as a resource that the system is distributing.
And then we can ask, is that being distributed equitably?
That was one of the key points of the expected exposure paper that Fernando and some others, including myself wrote at CITM 2020 as a way to optimize the equitable distribution of exposure.
But that's one concrete resource.
We can also think of quality of recommendations as being a resource, like user side utility.
And we want that to be equitably distributed.
There's a really important difference between those two though, in that the provider side exposure is what economists call a rival risk or a subtractable good.
That is one person obtaining the good means another person can't get that unit of the good, because I can only put one news article in the first slot of my news recommendation list.
Someone else might get the first slot on the next list, but that specific impression can only go to one article.
Whereas consumer side utility is not a rival risk good.
If I have good recommendations, that's not the reason you have bad recommendations.
And so the way we should measure them differs.
Provider side exposure, we probably want to look at how it's balanced and we do need to trade off, move some exposure from some providers to others.
If the number of overall lists we're producing is constant, but on consumer side exposure, we would want to look at, okay, we don't want to go make things worse for the users who have a good experience.
How do we make things better for the users who are having a bad experience?
But those are two kinds of resources that can be distributed.
Exposure can also be distributed to subjects.
Representational harm is where there isn't necessarily a resource involved, but the system is representing the user or representing the data subject in a way that reproduces societal harms.
And this can come up in a whole bunch of different ways.
One is it just might be wrong in its representation and it might be disproportionately more likely to be wrong for some groups than others.
They can both can be individual and group.
It can also be stereotypes of various forms, some of which can be quite harmful.
And so both Latanya Sweeney and Sophia Noble have done work on understanding how people are being represented back in search engine results.
So what happens when you search for black girls or white girls in Google?
What happens when you search for a name that in American society is typically read as black versus white?
And Latanya Sweeney found there that names associated with black communities are more likely to be for black people or were substantially more likely to have the search results include things like criminal background checks.
And so it was representing a higher level of criminality back in terms of what it was producing.
And that was the reason why I would class as a representational harm.
We've also been looking at so our SIGI our e-commerce paper last year led by my PhD student Ama Faraj.
We were looking at gender stereotypes and particularly in children's toys and products.
There's been a lot of research on early childhood development, education, etc. around the influence of gender stereotypes and children's development, their self-perception, etc.
And it has an effect and it can have a notably adverse effect in terms of their perceptions of how they fit in the world.
So there's a lot of different mechanisms by which children get exposed to these stereotypes.
They're kind of unavoidable in our society.
But we wanted to ask, is the search system or the recommender system also contributing to the propagation?
Because kind of like once you have this concept in society, it gets propagated through a whole bunch of channels.
Yeah, yeah, I see.
Because we stereotypes show up in places.
And so we were wanting to understand, are they showing up in search and recommendation, especially for kids products?
And the short answer is yes.
But that was another representational harm.
We were concerned.
We were and that's where like what it's representing is what is the kind of toy that a boy can want to play with or a girl would want to play with?
Or what kind of a child is more likely to want to play with a superhero toy?
And then associating a gender dimension with that.
And so there's this a lot of questions around how do our systems represent the world to us represent us to the world and to ourselves?
And how does that interact with and reproduce the kinds of social ways or those are aligned?
There is one specific term that has been growing largely in frequency over the past minutes, which is actually bias.
And the different stages at which bias might enter the whole process, which gets you from the real world, let's say to the recommendations and back into the real world.
So there's a world that you already alluded to is already biased.
There's societal bias.
So the distinction between what I guess you wrote in that chapter, the world could or should be and how the world is.
And then this is going to be captured in data.
Data is kind of the resource for our algorithms.
And these algorithms are then providing some output, which people are going to interfere with, which then also might be object to exposure bias or several other position bias.
And so on and so forth.
Fairness and bias.
How are these related?
So how they're related probably depends on the author.
Many authors will just use bias for an unfairness problem.
I try to avoid doing that.
Sometimes I will informally.
But when I'm trying to be clear about our distinctions, I try to use the term bias in something much closer to its statistical sense, where bias is a mismatch between what we have and what we're supposed to represent or what it's supposed to represent.
So in the sense that an unbiased estimator, it's expected value is the parameter.
And so under this understanding, data is biased when it has a systematic, and this is where we distinguish it from noise.
There's always going to be errors.
If those errors are just evenly distributed across the data set, it's just noise.
But the data is biased then when there is a systematic skew between what the data is supposed to represent and what's actually in the data.
And this I also owed that the impossibility of fairness paper lays this out and actually they define bias in terms of these kinds of skews where the bias is the skew.
And so whether that's a skew in a statistical estimator, whether that's a skew in data, the data bias can both arise in what data points are in the data.
We say maybe more likely to observe some groups than others.
And so our data is not representative.
It's biased with respect to the population distribution.
The other way that it can be biased is maybe the ability to observe is the same, but our measurement is biased with respect, like the actual measurement we stick in a field is biased with respect to what it purports to measure.
Like the SAT example I gave, we can also see this with like, if we're using clicks as a measure for satisfaction, there's been a lot of work on like position bias and things.
But if we just have a click log, that's not an unbiased measure of what people like.
Yeah.
Yeah.
Even though it sometimes might be hard to really say what is it that people really like.
Yeah.
But that's how I use the term bias.
And then some biases give rise to fairness problems.
And so I use fairness in the kind of the social normative context of something's unfair when that it misaligns with our ethical and social principles of what it means to be fair.
And a lot of times it's caused by bias.
Sometimes bias can be useful for correcting a fairness problem.
Okay.
Because that brings me to some questions that just arise in my mind, which is, is bias necessarily bad?
Or if we even want to, let's say counteract bias, and I guess we will come to that in a minute, do we necessarily want to minimize it down to zero, however we quantify it?
Because I'm actually thinking about that topic or that problem of popularity bias that you encounter in recommender systems.
And for me, there might be some good reason that to some certain degree describes the popularity let's say to a reasonable extent, or let's just say some objects, items of recommendation are just more popular because they have a higher utility for the average user.
So in that sense, I actually don't want to minimize bias because by that I would also kind of disregard the utility that items have for users.
So what is that kind of interplay between bias and fairness?
Yeah, so bias, bias is not always bad.
And it's also it's a thing.
And it's also in many cases, it's not going to be something we can fully get rid of.
But that's one of the reasons why I try to use it in this statistical sense.
It's just a fact about the data, whether it's bad or not depends on our normative understanding.
And it depends on our goals because if the bias interferes with our ability to deliver good recommendations, if it interferes with our ability to deliver fair recommendations, then it's bad.
But it's kind of the more consequential side of it.
But in terms of say the popularity bias, yes, items differ significantly in their quality.
And popularity and quality correlate.
And Theodore Sturgeon, a sci-fi editor and essayist wrote, I don't remember the exact percentage, I think it was either 90 or 95 percent of everything is crud.
And so if that's true, then you've got this long tail of stuff that is bad.
If we just recommend bad things to users, nobody's going to come use our systems.
But there's two components of popularity, of observed popularity.
One is the quality that contributes to popularity of people liking better stuff.
Actually there's three components.
So one is the quality.
One is social popularity of people hearing about it from their friends, et cetera.
And then the third is the systems amplification of popularity.
And where popularity bias really comes in to be a problem is in that third one.
The system should recommend, like it should probably, it should recommend more popular items than unpopular items for most systems, not all.
But how much more?
And so popularity bias, it isn't that it's recommending more than even, it's that it's recommending more than it should.
The influence of popularity on the user's likelihood to like the product is being exaggerated.
Unfortunately we don't know where, for most systems, we don't know where between zero and what the system is doing, popularity should be.
That's kind of the disentangling that I do on popularity bias specifically, but it does.
This comes into a lot of other things.
Like we don't want to recommend bad content just because it was by a particular creator, at least after the first few runs.
We do need to make sure that our signals for whether or not content is good are unbiased.
Because if, and this can be through either the way we're measuring or people's biases in their own behavior in terms of like, oh, they're less likely to click results by certain authors.
Well, that's partially a indication of interest and that might be partially an indication of their bias.
And then as the difficult question, and we cannot get it, and working on fairness, we cannot get away from making normative judgment calls based on ethical and normative reasoning is to the extent that it is a bias on the part of the user, a discriminatory bias on the part of the user, do we want the system to learn and replicate that bias?
And typically the way we have to get around that is with a correct, if we don't, we have to do it with a correcting bias.
So the data is a biased observation of the construct.
We then add another bias to get it back.
That does depend on assumptions, but it's something that we can't get away from.
I would like to say there, like in dealing with assumptions, it's not making assumptions versus not making assumptions.
Because if we take the data as is, then we're making the assumption that the data is a sufficiently good representation of the world as we want it to be as a result of our system existing.
That's a very interesting assumption, but other direction, you're making an assumption.
So you can't, like there's not a neutral in that sense.
Like as system developers, we need to think about what effect we want our system to have in the world and what role we want it to play.
This can also then get into, so we've got these stages of bias from Shira Mitchell and collaborators paper on assumptions in algorithmic fairness of, yeah, the world as it could and should be, like what the world would look like if there was no racism, sexism, and other discriminatory bigotries at all.
And everybody just treated everybody fairly.
The world that we have, and then we have the data that we're observing from that world.
If we want the recommender system to be an agent of change, to nudge the world as it is, to be closer to the world as it could and should be, our data won't teach us to do that.
It's a value judgment to say we're going to put these biases into the system so that the recommendations to the extent that they influence people nudge the world closer to a more equitable state.
I see.
And maybe nudge the world closer to that long tail vision distributed across all people.
And so it's a complex interplay that we have to be, this is also where we have to be clear about what problem we want to solve.
Which brings us back to that normative starting point you need to clarify first before answering the question whether something is fair or unfair.
I was thinking about another thing that you were already touching a bit when we were talking about provider side and consumer side fairness aspects in terms of exposure, reminding me of my first year at university and my microeconomics course when you were alluding to the rivalry and non-rivalry goods, which is about trade-offs.
So trade-offs is a prominent topic that is arising when someone is concerned with fairness or debiasing and so on and so forth.
Am I going to promote the utility of a system in terms of the goals for one group and necessarily hurting that satisfaction of utility or something like that for another group?
I mean in the provider side scenarios you said, okay, if I have exposure for the items that are provided by my providers, I'm taking something from one provider and providing that exposure to another.
So this is quite clear, but are there always trade-offs?
I mean, if you also go a bit higher and say, okay, I do have a recommender system and the first thing I did is I only optimized for the relevance of recommendations, even though we know relevance not necessarily always translates into satisfaction of users or of consumers.
But if I then say, okay, now I also want to counteract the heavy popularity bias that I have in my system and thereby I'm introducing another objective and then I start optimizing for two objectives which are relevance and some quantification of popularity bias.
So is this necessarily leading to a trade-off?
So necessarily leading to, let's say, worse recommendations or is the topic of trade-offs a bit overestimated?
So I think the topic of trade-offs is overestimated.
I think there definitely are going to be trade-offs in some places.
The provider side of distribution of exposure is one of them.
But in a lot of places, we often jump to a trade-off, but it's not always clear that that's an accurate understanding of what's happening.
So for example, there's a lot of discussion often about diversity and accuracy trade-offs.
But when we look at user perceptions, for example, in my RecSys 14 paper, we found that user perceived diversity and user satisfaction, which was indistinguishable from user perceived accuracy in that experiment, were positively correlated.
And so in some cases, I think when we're seeing a trade-off, we might not actually be seeing a trade-off in the underlying construct.
And so we're trading off our relevance metric.
That if the relevance metric has multiple components, the signal of the actual construct it's measuring, the bias of the measurement and the noise of the measurement, are we actually trading off signal or are we trading off bias or noise?
Because if all we're trading off is bias or noise, the actual utility to the user might be increasing, but we don't, when we're just looking at, say, I've got NDCG and diversity, we don't have the data to detect that because NDCG is not an unbiased estimator of what's going to happen when you actually ask users to use the system.
And so I think there's the trade-offs.
We really want to understand when there's an actual trade-off in the meaningful experience of the stakeholders.
Do users feel like they're getting less useful recommendations?
Are recommendations less useful as evidenced by users buying fewer products or listening to less music?
And so that's where I think we might need to be concerned about a trade-off and we might decide we need to eat it anyway, but the offline metrics when they trade off, there's so much noise and bias in so many of the metrics that I think we need a lot more evidence to claim that we're actually trading off the constructs that we're trying to measure.
I do think it's possible for there to be trade-offs in a number of spaces.
Say if there's a trade-off between provider and consumer fairness.
So first that's going to depend on precisely what kind of each of those fairness we're just talking about, but one could definitely envision situations where that could happen.
I think it's definitely an empirical question of the extent to which it is happening in any particular case.
I don't think, at least from any of the reading I've done, the knowledge isn't currently out there to make general statements about when that trade-off might arise or not.
I think though even some things that look like trade-offs in the long term may not be.
So if you redistribute some exposure to make the system more fair to different providers, give niche providers or underrepresented providers more visibility.
If you hold user activity fixed, that's trading off between different providers.
If users don't think the system is useful and they go away, then it's hurting everybody.
If users either find the fairer system more useful or it gets a reputation and users are like, yeah, I want to use that because they're doing a better job of providing equality of opportunity, then user activity isn't holding constant.
The total user activity may be increasing, in which case the providers that we've reallocated some of their exposure over might still be better off than they were before because they're getting a slightly smaller fraction of a larger pie.
I think when we're thinking about trade-offs, it's helpful to think in terms of the big picture of understanding what's actually happening in terms of the usefulness and in terms of the value to users, to providers, to the business that's actually happening.
That's not just a function of the recommendation algorithm.
It's a function of the entire sociotechnical context in which it operates.
That actually reminds me of a paper by Rishabh Mehrotra and his colleagues at Spotify, I guess it was in 2018, where they provided some counterfactual evaluation of satisfaction for users.
They really distinguished these three terms, satisfaction, diversity, and the relevance.
Relevance is different from satisfaction.
If we only optimize for relevance, then this basically gives us high satisfaction, but then down the road and assessing several methods of how to introduce fairness into the system by promoting diversity and also later on doing this in a personalized manner because some users have less propensity to diverse content and some users have higher propensity to diverse content.
They were actually able to say that in the end they achieved basically higher satisfaction, if I'm correct, by promoting fairness at the same time, even though they had less relevant recommendations, but relevance is not necessarily the same as satisfaction.
It reminds me a lot of that when thinking about trade-offs.
Yeah, at the end of the day, we want to deliver a good experience to our users.
So they'll either as an end in themselves or so they'll remain happy, satisfied customers who tell their friends to use our service.
And our other metrics are, and we also then couple that with we may have fairness as an organizational goal if we want to do that, and we want to provide opportunity to content creators of whatever form are relevant to our platform.
And the metrics are proxies, usually noisy, biased, limited proxies for those kinds of goals.
The other thing you can have on that is if your system is unfair to particular groups of artists, they might just leave, in which case there's less content on your platform, which makes it less appealing to customers.
And so you shrink the pie.
Mm-hmm.
Can also go the other way.
So because I already liked that additional perspective that by promoting fairness in your system, you can grow the pie, but by deciding whether implicitly or explicitly of not doing it, then you might also, as you said, shrink the pie.
Moving into the direction of actually promoting fairness, what are from your perspective metrics and methods to assess certain notions of fairness, if that is something that you could accept as an intermediate term?
Yeah, so I think we have made some significant progress on measuring and addressing some kinds of fairness concerns.
So on the provider side exposure, we've got the equity of exposure construct, we've got the amortized attention construct.
We've got also there's...
So our SIGIR paper last year looked at several different fair ranking constructs and looked at their sensitivity to configuration, sensitivity to edge cases in the data, and came away really recommending the expected exposure and the attention-weighted rank fairness for a single ranking setting as ones that are relatively robust to what's happening with the data you're trying to measure with them.
There's also...
So Mike Zellicki has a wonderful paper on the normative principles underlying different fair ranking constructs.
And so on the provider side exposure fairness, we've kind of settled now, we want to look at exposure and we're still working on the details of how you make that measurement interpretable.
But in terms of a construct, that seems to have a lot of promise and gets close to measuring that provider side exposure piece that we care a lot about.
And then other things can be useful, say, as optimization proxies, but when it comes down to, at least as I see from the landscape, if you want to measure provider side exposure, something in that exposure family is a good way to go about doing it.
On the consumer side, so if we want to say measure consumer side utility, that's the easiest thing to quantify by whatever utility metric we have, whether it's just doing some ECGs, whether it's looking at our other metrics like time on site or click through rates or things like the query formulation rates.
There was a really promising...
So there's a question of how do you measure it, but then how do you turn it into an overall metric or goal?
There's a really promising approach that was published at KDD in the last two or three years. And I'm not remembering the authors offhand.
It's in my...
I mentioned it in my eval RS CICM talk.
We will bring that into the show notes.
So some of them just look at, okay, we've got two groups and we're going to take the difference in the utility and that's going to be our unfairness.
But that sets up a trade off between the groups.
And so what they did was they take the mean or the sum over the groups of the log of the total utility that group got.
And so that gives you a metric where the most efficient way to increase the metric is to increase utility for the group with the least because the log puts a diminishing return on the groups that are well-served.
And so that struck me as a way of quantifying the distribution of consumer side utility that doesn't fall into the trap of treating a non-rivalrous good as rivalrous.
Because you don't have to hurt anybody's utility to improve this metric.
In fact, you can't.
If you hurt somebody's utility, the metric goes down.
Yeah, yeah, right.
It's just it emphasizes you get your biggest delta by improving things for the group that has the least utility.
I'm really happy that I finally got you up to the point where you are not able to bring up all the authors and the corresponding conference and its year of a paper that you are bringing up.
So I guess it was my 15th trial.
So far in terms of the metrics, as we have already alluded to the stages at which bias might enter the system, then these different stages, I guess, might also be the intervention points for debiasing methods.
Can you briefly walk us through that or what might be the major categories there to watch out for?
So there, yeah, bias can creep in at any stage from the differences in the world to there can be bias, say, in our content representations, there can be bias in what content we've acquired.
There can be bias in the user feedback signals.
There can be bias in the expert judgments.
We're using defects, automize the content, all of these things.
Then there can be biases actually in the models themselves.
Maybe a little bit more likely there can be biases in the objective functions.
Like, oh, the model might be fine.
It's gradient descent, but it's trained on that loss function and that training data, induce a bias, and then all the way out to then the final user responses that then feed back into the system and the feedback.
It can enter all of these different stages.
At the end of the day, you still have to measure fairness at the end, the final task, because it will be nice if fairness composed, but we cannot assume fairness composes.
And Dwork and Elvento had a paper, uh, on the composition of fairness.
I think it was just called on compositional fairness or something like that.
That's looking at this compositional problem.
They were arguing, like you always have to go measure your fairness at the end application, but there's a lot of intervention sites that you can look at with that still as your, your end evaluation.
And so you can work on improving data or debiasing data, like the techniques that have been developed, say to D bias, to remove position bias and click through data.
One could envision trying to develop similar techniques for trying to D bias with regards to social biases.
You can pay for data, like higher professional annotators to produce higher quality annotations and category labels for a segment of content producers whose content is not being appropriately cataloged and represented to be able to be recommended well in the system for changing between the world as it is in the way it could be.
You could develop grant programs for underrepresented content creators to produce new content in the world.
Um, so there's a lot of intervention sites there.
There's intervention sites around the models.
And also you've got different, a typical deployed system is a multi-stage model.
And most of the fair recommendation research has focused on it as a single stage or just focused on the final ranking stage.
There was a paper presented by Twitter folks at fact, rec last year on the role of the candidate selection stage and the fairness of the final recommendations.
But you can try to intervene in terms of improving your raw data.
You can try to intervene in terms of how your training process uses data.
Maybe you're changing your sampling strategy.
You can change your objective function.
You can put in regularizers that correct for some of the kinds of biases that you're trying to remove.
You can, this can actually be useful for representational harms.
There's some things you can do with GANs.
And so one thing I saw, there was a paper at fat ML 2017 on this by some of the Google folks on what they, again, where the discriminator was attempting to predict the user sensitive attribute from their embedding.
Oh, okay.
And so you do that and you train an embedding.
If you can learn an embedding from which you cannot predict the user, say the user's gender, then it's really hard to get a gender stereotype on that embedding.
Okay.
Makes sense.
And so you can do it.
Like if you're concerned, that can be an approach for dealing with representational harm of using adversarial learning to prevent the systems from, and there's other adversarial approaches to fairness objectives as well.
So you prevent the system from learning either some of the associations from a representation perspective or some of the associations that are going to cause distributional harms that you want to avoid.
Just two references here.
So one actually to the episode that I had with a Felice Mera on adversarial recommender systems or something that the listeners might want to look up.
And the other one, as you're already bringing up how to predict the user gender from the embeddings without having access to the actual gender attribute reminds me of the episode that we had with Manel Slocom where part of her work was actually concerned about that fact that they were trying to predict the gender from the ratings that user provided.
So actually not from the embeddings, but from the ratings, which also not explicitly had this, but there we said, okay, action is more, let's say male associated and drama or something like that was rather female associated.
Yeah.
And there's like, I very much don't like gender prediction as a task.
Like I don't want to build systems that predict gender, but as a discriminator to build systems from which it cannot be predicted, that's something a little bit different.
Yeah, yeah, definitely, definitely there.
It makes sense.
Yeah.
And I just, I just looked up that adversarial fair representations paper and it was Alex Butel et al.
Alex Butel, Jilin Chen, Zizao and Ed Chi is on archive, but they presented it at FatML in 2017.
So that's a really interesting kind of approach.
You can look at approaches where, yeah, as I said, you're just changing the objective function.
Maybe you're adding a fairness regularizer of some kind.
There's many different approaches for doing that.
You can also then do fairness through re-ranking where you take the model as is, but then you re-rank its outputs to satisfy a fairness objective.
In addition to your base relevance objective, this can be done using the same kinds of re-rankers that you'd use for diversification and things like that.
And in that sense, like provider fairness and subject fairness wind up mathematically looking very much like diversity.
But you can do that re-ranking technique.
And so there's, there's a range of interventions at every stage of this pipeline.
If you've got a multi-stage model, you can put them up multiple stages.
You just can't assume that they're necessarily going to compose.
So making one component more fair is hopefully useful, but you can't just assume that doing so is going to make the final recommendations more fair.
You still have to check.
Because as you said, there are multiple stages and there might also be business rules that intervene at the part after your re-ranking and so on and so forth.
So lots of things that still can get you into trouble.
With regards to that method of re-ranking, which I have also come across a couple of times, which seems quite prominent one, isn't this already a bit too late in the process, so where actually my model, if it contains some representation that it has learned from the data has already embedded the bias and it might, let's say, pose the problem if I want to use the same model or its embeddings, whatever it might reproduce to represent my users or items in the system contains this and might be used for several cases.
So isn't this rather risk, so I'm a big fan of doing it relatively early in the process, let's say in the sampling or in the regularization stage, when I work with some multiple objectives, then to do it in the re-ranking stage because not only of the risk, but also it's also associated with some kind of inefficiency because I would need to create more candidates that are used to then perform the re-ranking step because let's say if I come up with 10 recommendations per user, then I just might not have enough candidates to re-rank to fulfill a certain quota on, let's say, underrepresented items.
So what is your take on this?
Yeah, so I think it does depend.
I mean, it depends a lot on your system architecture.
Yeah, if you especially if you're just adding as another re-rank stage on top of a small list, you're probably going to run into problems because if the bias means that if the bias affected what made it into that short list, you can't correct for that with a re-ranker.
If you're providing longer lists, then the re-ranker is better able to deal with it.
So if it's like, yes, it's biased content from this group is not making it into the top 10, but it's still showing up in the top 100, then re-ranking that 100 item list, you can still get it back.
But I think it depends, like I said, on the system architecture.
And if you're in a multi-stage ranking environment anyway, maybe you don't necessarily add it as a new re-ranking stage, but you add it as an existing objective to your heavy re-ranker.
That's already processing your candidate, like you get your candidate set, you have another objective in the heavy re-ranker model, and that then can be where you put in fairness.
I think from a research perspective, the re-ranking is interesting because it's something we can apply to a wide range of models to show generalized techniques.
An actual implementation of that technique might just fold it into the ranker instead of making it a separate stage, but it lets us show then understand how the re-ranking logic interacts with different base models.
There's a variety of reasons I think to pursue it, but I think when you go to actually putting it in production, your system architecture is probably going to be as big a variable as any of the other external concerns in terms of where does it make sense to put this.
It might be as a new stage.
It might be as a new objective in an existing stage, and it might be on that data pre-processing, and it might be ultimately you may want to bake it in throughout, so every stage is doing fairness-aware things, but that might not be the thing you want to do first.
You might look at where's the easiest place to try to intervene first and see how it does, and then start to start to work on baking it in throughout the system architecture and throughout the data lifecycle.
Good point.
This is already, I guess, a very beneficial advice that you have for RecSys practitioners.
What might be other suggestions that you're having for the practitioners who are now more convinced than at the beginning or before this episode of fairness and recommendation systems?
So what might be your recommendations for them for best practices?
So I think the first thing would be to not get paralyzed by the complexity.
It is a complex space with a lot of different things.
Pick a problem that matters to you or that matters to your community or matters to your firm and start trying to work on it.
Pick a concrete way like, I am concerned that my system is unfair in this way to this group of stakeholders, and then we'll work on figuring out how to measure that and address that and then work on the next one.
It's kind of a cumulative process of, oh, we've addressed these unfairnesses.
We're now working on another one.
We maybe are learning some generalized things about how to make the system more fair overall.
There is some interesting work on fairness without demographics that can try to tackle that in a general way, but you still have to measure whether it succeeded in removing the harm you care about.
And so pick a corner and start working on it, figure out how to measure it, make progress, document what you did, document its limitations.
Like I talk a lot about limitations and part of that is just in terms of the honesty and transparency, the research out, but some of it also was a research productivity because thinking clearly about the limitations of my current work has proven to be a very fruitful basis for developing next work.
And so, yeah, work, find a corner that you care about, work on it, share what you learned, and then work on another one.
There are plenty of plenty of places and that then allows you to make progress on something.
You're not trying to solve all of the problems.
You're trying to solve this one.
A lot easier to make progress on it, and then you can move to another one and you can move towards more generalized solutions.
Great.
And also facilitate that others can learn from your work and not just chased for the next best fourth decimal improvement on some mean average precision.
Yeah.
The other thing then is to be clear about the normative basis for the problem that you're trying to solve.
Why from an ethical perspective, from a legal perspective, from a business interest perspective, do you care about this problem?
What's the basis for it being a problem?
And that will provide clarity in terms of evaluating, am I actually measuring the problem that I care about and am I actually then solving the problem that I care about?
We actually had this in our book work because what we measured was the fraction of the books and the recommendation list that were written by women, which is a very good thing to measure.
It looks at overall representation of women in the book recommendation space.
But the, my original motivation for it was this equality of opportunity of do you get the exposure needed in order to say, get enough readers that you get a second book contract.
And so as we worked through the work and reflected on it, we kind of identified, yeah, we were measuring something interesting, but we weren't measuring the thing we originally cared about.
And that then motivated my participation in the expected exposure work, because that got us to a point where that gives us a metric that does get us a lot closer to the original normative goal.
And so being clear, what's the problem you care about?
And then having a robust argument for what the problem is that you're trying to solve, why that's a problem and how that affects your assessment of what the system's doing.
I guess some great advice on an operational and on a more strategic level that people can go with and put into practice and also put into research.
And research is something that I also want to start my conclusion a bit with, because I mean, it really sounds like there's still a lot of things to do and many open questions in that field.
What is it for you that you think is something that we might want to solve next or what is the most crucial thing that we haven't solved so far sufficiently?
I mean, there might not be a single thing or what is it that concerns you for the future in terms of fairness away our excess research?
So a couple of things.
One is broadening from the distributional harms into the representational harms.
We've been making some progress.
My students paper being one example, but there's still a lot that's very, very unknown about how do we measure different kinds of reputational harms, whether they're stereotypes or other representational harms in a way that we can actually detect.
That's a very that there's a lot of work to do in that broad space.
The other one I think that from a long term perspective is how do we communicate the results and then engage a broad community of stakeholders and discussion about those results and discussion of future design of future evaluations from a governance and accountability perspective, because it's one thing for us to divide a fairness metric and improve the system on it.
But do the people who are affected agree that that captures what it means to be fair to them?
Do they perceive the improvement as being more fair?
If the systems are going to have public trust, they not only need to be fair, but people need to believe that they're fair and understand that they're fair in a way that aligns with what they expect from a fair system.
And we're going to run into the different people having different ideas and contested constructs.
So we're not going to have a system that everybody agrees is fair.
But it's not just an aspect of communicating what we think is fair to persuade people, but having that robust conversation so that at the end of a set of conversations and a set of analyses, we have a system that we and the stakeholders agree is treating them more fairly.
There's a lot of work that needs to be done to make that kind of a vision of systems that the public trusts are fair, a reality.
Okay, so it seems like there is still a lot of work to do there for both the people who are rather focused on academia, but also those that are working in practice.
And I mean, they are not necessarily separate from each other.
Great and interesting and really important things that you are bringing up there.
So thanks for sharing.
Yeah, as we are more and more approaching the deadlines of the upcoming RecSys and RecSys 2023 being held in Singapore, I just wanted to bring up that funny thing about you because you have been a chair in different functions for the RecSys or different groups for the RecSys.
And I have seen that I guess since 26 and I haven't gone further beyond every second year.
I mean, the America year, I would call it or refer to it.
You have been chair for something at RecSys.
So that was I'm not sure whether you are aware of it, but it was publicity co-chair in 2016, general co-chair in 2018, doctoral consortium co-chair in 2020, program co-chair in 2022.
So can we assume you won't be chair this year, but what are you going to do this year?
So I'm not on the organizing committee this year.
So this year, my organization is working on fact continuing to work on fact, right?
And so that it's happening again.
Papers are due on August 3rd.
And this year we of court, we welcome all the kinds of work that have been at Fact Rec.
We that this this talking about particular geographies is not a limit on other work.
Please send work from the U.S. and EU.
But we are hoping that with the location in Singapore, that we can see some work on recommendation, fair recommendation, accountable recommendation in Asia Pacific contexts, because a lot of fairness literature has focused on U.S.
and European contexts.
But those are not the only contexts in which recommendation is deployed.
And so please consider submitting your work.
And if you are working on these issues in Asia Pacific, we would dearly love to hear from you.
When is their submission deadline?
The submission deadline is August 3rd.
OK, so still plenty of time.
Yes, yes.
There's another workshop that might also be interesting to the folks interested in this topic, the normalize workshop on normative values in designing and evaluating recommender systems, which is looking at these questions of what are the values and the human goals that we bring to what we want from recommendation system.
And there's there's overlap between that and fact tracking.
We're in discussion with the organizers of that workshop.
But there's also like there's they're not congruent topics.
But I think that one might also be interesting to folks in this discussion of what is it that we want our systems to do in the world?
What world do we want and how do we want recommendation to help us get there?
Great.
I mean, there is nothing more to add at that point.
I was just very astonished when clicking through the RecSys page where I've been scrolling through the workshop that I guess I have never seen as many workshops as there are.
Going to be this time.
So almost 20 workshops for this year's RecSys.
So it's it's tremendous.
Yeah.
Before we finish the episode, Michael, is there someone that you would like me to invite to the show as well?
I mean, I guess you were mentioned before as well, but are there people that you would enjoy having in this show as well?
Oh, so that that's a good question.
One person who comes to mind would be Sole Perra to talk about her work on a recommendation and search for children and for other groups of users that we don't often consider.
Mm hmm.
Mm hmm.
That's great that you are bringing up Sole Perra because actually, and here's a proof.
There's my list of people I'm currently approaching.
And do you see the list?
Yes, I do.
Yes, I do.
Because because I do have to organize the upcoming episodes and reach out to some people and there have been scrolling through and yeah, then I've come across her work and now you are just saying the name that was written on my list.
So we are perfectly aligned there.
Who else?
So Elaine Stark or Martine Wilhelmsen to talk about well, a lot of things around decision making recommendation for behavior change.
Great.
They've done some some fascinating work around recommendation for energy savings.
Yeah.
And a variety of other fascinating topics as well.
So they they will be very, very good on how does recommendation interact with people's decision making processes and the choices that they make in their lives.
Okay, great.
No, I mean, I have already transitioned to the point where I'm not asking for a single person.
So it's good that you brought up three.
This makes also that final recommendation around a bit more fair because I have already seen that people have difficulties nominating just a single person.
There's no reason why it should not be more than a single person.
So thanks for that.
And yeah, overall, Michael, I really enjoyed the episode with you.
As I said, I'm going to talk about that topic at a meetup tomorrow.
And I feel at the same time, small and big, small in terms of all the knowledge that you kind of shared during this episode and that I learned along with what I have maybe known to a certain degree before, which was little, but now it's more.
So I feel also bigger than before.
So thanks for sharing all your knowledge, not only with me, but especially with all the listeners of this episode.
And yeah, I wish you all the best for your future research endeavors.
I'm looking forward to meet you at this year's RecSys.
And thanks for all of that.
Thank you very much for having me.
It was a pleasure.
Thank you.
See you.
Bye.
Bye.
Thank you so much for listening to this episode of RECSPERTSs, recommender systems experts, the podcast that brings you the experts in recommender systems.
If you enjoy this podcast, please subscribe to it on your favorite podcast player and please share it with anybody you think might benefit from it.
Please also leave a review on Podjazer.
And last but not least, if you have questions, a recommendation for an interesting expert you want to have in my show or any other suggestions, drop me a message on Twitter or send me an email to Marcel at RECSPERTSs.com.
Thank you again for listening and sharing and make sure not to miss the next episode because people who listen to this also listen to the next episode.
See you.
Goodbye.

Recsperts - Recommender Systems Experts

More episodes

Chapters

What is Recsperts - Recommender Systems Experts?