TalkRL: The Reinforcement Learning Podcast

Taylor Killian on the latest in RL for Health, including Hidden Parameter MDPs, Mimic III and Sepsis, Counterfactually Guided Policy Transfer and lots more!

Show Notes

Taylor Killian is a Ph.D. student at the University of Toronto and the Vector Institute, and an Intern at Google Brain.

Featured References 

Direct Policy Transfer with Hidden Parameter Markov Decision Processes
Yao, Killian, Konidaris, Doshi-Velez 

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes
Killian, Daulton, Konidaris, Doshi-Velez 

Transfer Learning Across Patient Variations with Hidden Parameter Markov Decision Processes
Killian, Konidaris, Doshi-Velez 

Counterfactually Guided Policy Transfer in Clinical Settings
Killian, Ghassemi, Joshi 

Additional References 

Creators & Guests

Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Robin:

This is TalkRL Podcast. All reinforcement learning all the time. Interviews of brilliant folks from across the world of RL. I'm your host, Robin Chauhan. Taylor Killian is a PhD student at the University of Toronto and the Vector Institute.

Robin:

He works as an intern at Google Brain and in his own words, is an aspiring researcher slash scientist. Taylor Killian, thanks so much for joining us today.

Taylor:

Yeah. I'm I'm really excited to have the opportunity to talk and share, what I'm working on and also how I've gotten to where I am.

Robin:

Super excited to chat with you. So how do you describe your research interests?

Taylor:

It's a great question. It's it's been under constant evolution, but in a directed fashion. You know, I had the opportunity quite early in my adult life to, to serve as a religious representative for my church, where I spent a full 2 years talking with people about my beliefs and sharing, you know, what they mean to me, and I was always fascinated by how people receives that information and what they did with it. And, you know, some people would act what I felt like were counter to what they proposed to be their beliefs versus those who acted in line with their beliefs, and there's a lot of uncertainty in people's decision making. And, after finishing that, time, I returned to my undergraduate institution and thought I wanted to do behavioral science, but in an analytical way because I was, you know, very interested in math, and I felt like I was good at it.

Taylor:

But probably fortunately for me, they didn't have a behavioral science track in the psychology department at my university, and so I was forced at the time to put sort of decision making on the back burner, just as I progressed through my, undergrad. But after after that and and graduating and getting a job and being a computational scientist, that that question kept coming back. You know, how do we make decisions in opportunities, or in in situations where there is a high level of uncertainty or where we might have some prior context? And, a lot of those questions in my own mind, came from sort of a neuroscience or behavioral science background, but, you know, I I'm quite analytical in in in in thinking and, you know, given my limited exposure to the world, I thought that that had to be within applied math. And what is there within applied math to study that has to do with decision making?

Taylor:

And I was fortunate to get, the opportunity to pursue a master's degree, at Harvard while I was working, and I approached a faculty member and said, hey. I'm really interested in applied math, but about decision making. And she says, oh, that sounds like reinforcement learning. And, I have some projects along those lines. Are you interested in health care?

Taylor:

And, my father is a doctor, and I had sworn to never be a medical doctor in my life, just given the stress that I observed in his life, and it didn't seem like that was the path for me, but I said, yeah, you know, I'm interested in health care. I think that it's a valuable area to be, you know, pursuing research, solutions to some of the major problems. And and that kind of became became the introduction to where I am now as a researcher who is trying to develop stable and robust methods within reinforcement learning as motivated by or applied to health care problems. And so all of that was just, I think, prepare a quick answer. I apologize for kind of, editorializing a little bit, but, a quick answer is about what my particular research interests are is, you know, within the construct of decision making under uncertainty, are there ways that we can develop robust, reliable, or generalizable, approaches within reinforcement learning, in highly uncertain or partially observed settings.

Robin:

Awesome. So this is your episode. I encourage you to editorialize all you want. That's that's totally bonus for us. As listeners, we wanna know what you're thinking.

Robin:

This is great. From looking at your Google Scholar page, you you have some some work in physics, related to fluid dynamics, and then machine learning with radio sensors, and then some core ML stuff like capsule networks. So did you have, like, different chapters in the past where you focused on these different things, and is is your current direction the future for you?

Taylor:

I I I really struggle with the word chapters because that kind of there's a connotation that that door is closed. In some of these circumstances, the door is definitely closed. Like, I'm probably never going to return to working in experimental fluid dynamics. And a a lot of, which I did during my undergraduate as a research in, research assistant, in this applied and computational math program that I was, designed for myself. I had the fortune of working with Tad who's now at Utah State University who pioneered really fascinating imaging techniques for experimental fluid dynamics, and he needed people to help develop computational models.

Taylor:

And I had the interest but also ability to join him, and that prepared me in a way to take the job that I was ultimately offered at MIT Lincoln Laboratory, which is where I did more radar processing, because that is the heritage of Lincoln Laboratories, that it was one of the early inventors and adopters of, radio magnetic frequency, for sensing purposes is that they have a heritage in the that comes from the MIT Radiation Laboratory that spun out shortly after World War 2 into what is now known as Lincoln Laboratory. And, I was not fully aware of the type of work that I'd be getting myself into, coming into electrical engineering predominant, business. But it it was great, and I learned a lot. And, that stage of my career really taught me a lot about what I was interested in and what I wanted to do, and I was really fortunate that I was given the opportunity as part of my employment to return to school, to really flush out what those research and professional interests are. And after I finished my degree, I needed to return to work full time to fulfill my obligations to them, and that's where we kind of were forced to do more low hanging, fruit from a government perspective is that they didn't quite have this appetite for sequential decision making in the way that I was proposing.

Taylor:

And so we were looking at model robustness for vision, type applications. And, that's where the capsule network work came from.

Robin:

Cool. Okay. So so looking at your, health related papers, some of the ones that I looked at, I get the sense that, that you really do dig dig really deep into some diverse angles on this on this topic in machine learning for health. Can can you tell us how you think about your your research roadmap going forward? Like, do you have a very specific paths in mind?

Robin:

Or are you doing a lot of exploration?

Taylor:

I think that the way that I at least have diagnosed my inability to get on a very specific path is that there's too many good ideas out there that need to be solved. Or it's just like there's fascinating problems that I see I mean, let let me backtrack a little bit that, my training, from the earliest days of my aspirational scientific career, have been in interdisciplinary settings where I come with a set of expertise or, you know, growing expertise and I'm working with experts from a different area, and we come together to solve a common problem. And, that's been a standard for my entire career from back when I was in undergrad through my employment to today, is that I find it unfortunate when people refuse to work in interdisciplinary fashions. And I think naturally, machine learning, and AI in general is an interdisciplinary field, and I'm really grateful to be a part of it. That is probably not to say that I don't have specific directions in mind.

Taylor:

A lot of the diversity in my research has come through just taking up, taking advantage of the opportunities right ahead of me or working on that good idea that just can't be put away. More specifically within a health care context, as I I mentioned earlier, one of my core research interests is in generalization and robustness. And currently, machine learning models applied to health care problems are not robust, and they are not reliable, much less generalizable. And one of the core research focuses that I have, throughout my PhD, but I think it will it's it's a big enough problem that I think it's going to be a hopefully hallmark of my career is developing, you know, suitable methods or model frameworks that allow for, you know, distributed processing, but also model transfer between hospitals. Yeah.

Taylor:

I have family that live in very rural settings where, their access to health care technology is quite limited. And, my professional career has brought me to large urban settings where we have research hospitals and, fantastic opportunities, for for our health care. And I would hate for any technology I developed to not be accessible and available to my family, that live in, you know, less opportune areas. And, so that is one of the major directions that I'm going for in in my career is, you know, can we develop things that can operate reliably outside of the environment that they were trained in? And, along the lines, there's little little battles or fires that you have to put out along the way.

Taylor:

You you have to develop a way to handle uncertainty. You have to handle partial observability or, you know, changing, action sets or changing feature sets, you know, depending on the particular application within health care. You might get very diverse, types of distribution shift, between these environments. And, so along the way, there's always gonna be some really good idea in a collaborative fashion that I'm gonna be working on. But ultimately, the direction is, you know, making reinforcement learning reliable and functional within a off policy or partially observed setting.

Taylor:

And, so from, like, a technical standpoint, that's probably where I sit within RL, but I'm pretty open to lots of different things.

Robin:

So from my point of view, you seem to be able to you have this amazing ability to innovate with a real breadth of different methods, kind of the opposite of the one trick pony. So do you how do you how do you balance learning, new things versus applying what you already know? Like, how did you come to this breadth? And and I'm talking both on the ML side and and and maybe the health side too.

Taylor:

Yeah. You know, the first, it's very generous for you to say that I've been innovative. I think it's more desperate than anything is that, you know, you come to a problem and you have an idea of how it should work. And since I've been relatively new to this field, like, I didn't know anything about machine learning until I started my master's degree. And, you know, so that's, thinking back now 4 years ago, and I had very rudimental skills in programming at that time.

Taylor:

And so I've approached research in a sponge like manner, you know, sort of just trying to draw insight from lots of different areas. And, you know, I think that in order to solve some of these more challenging problems, we need to look at the ways that things have worked other in other places, from the health care perspective. And I think that this is important for anybody who's trying to apply any machine learning much less reinforcement learning to the real world is that you have to talk with experts. You have to understand what they've done and what the relevant problems are. It's an unfortunate crutch of ours in the research community to sort of play pay lip service to an application to motivate our work and give it some meaning.

Taylor:

And I I do appreciate the efforts by my colleagues within the reinforcement learning community that when they talk about off policy reinforcement learning, in particular, and they they motivate it by, oh, this could be useful for health care. That's good and that's important, and we need to make strides in some of these important technical problems, and the challenges that we face with them. But if we're doing that in a vacuum and in isolation without knowledge of what the actual practices of the doctors who would be using the technology, then we're wasting our time and we're wasting their time. And we're developing solutions and putting hype around them that, if adopted would potentially be harmful and quite dangerous. And I think that I think it's important to recognize our own limitations, but then also pick up the expertise and the best practices of those who we want to work with.

Taylor:

And, and I think by synthesizing the best practices of various fields, you know, I struggle with imposter syndrome like anybody, and it's probably made worse by the fact that I try to do the synthesis is that I don't feel like I'm getting to be good at any one thing, but rather, you know and this is in my mind, my doubts telling me that I'm becoming mediocre at a lot of things or at least knowledgeable about what might be out there, without having any dedicated experience. But that's that's partially why I chose to get a PhD is to be able to slow down a little bit and do an in-depth focus on a particular area of research so that I could become, proficient and good at that, area of research and then expand as I move forward in my career.

Robin:

So can we talk about the health and and clinical setting in a little more depth for for people who may have you know, you may maybe understand RL, but have focused on Atari or OpenAI Gym. Can you help us understand and appreciate what is really different here about RL in health in general and and in a clinical setting?

Taylor:

Yeah. I've been asked this question a lot, just by colleagues and friends, and I think it's really important to preface the comments that or my my comments that I'll give in response to the question, by the, the motivation I have for doing health care research as a reinforcement learning researcher is that the majority of open problems within reinforcement learning, such as sparse rewards or, credit assignment or, you know, the exploration exploitation trade off, off policy RL, and the list kinda goes on and on in terms of these big open challenges or problems within the reinforcement learning community. All of those are present in spades in the health care problems. And, health care is characterized, you know, at least in the way that I observe it, as an inherently sequential process where an expert decision maker with their best understanding of the setting of the environment, the patient, of, the multiple patients that they're seeing with their various compounding conditions and symptoms, they take that information and make the best decision they can, and then they observe the response, and then they adjust, adapt, and try something again. And they do this hopefully toward the patient improving and, leaving the hospital or having a stable and healthy life if it's in a less acute setting, or in in in in sort of the unfortunate circumstances where, you know, the the clinician is unable to provide the care adequate to have the patient survive.

Taylor:

And in some cases, it's unavoidable, right, is that, patients and individuals' health decays to a point where there's not much that can be done. And, the the standard of care at that point changes quite drastically, and, it's remarkable to see the types of approaches that clinicians take when they hit these, the these sort of dead end type, situations within, health care. But in in in terms of more directly answering your question, how does it differ from your traditional RL, you know, simulator based, research? And the the major difference is that we have a fixed dataset. We are unable to explore and we're unable to do online evaluation of policies that we learn, from this fixed data.

Taylor:

And so that opens up a big can of worms in terms of off policy evaluation and off policy reinforcement learning in general. And, this is a largely unsolved area of research. There's some fantastic efforts that have been going on, from a variety of groups throughout the world, looking at off policy evaluation and ways to improve it. There's particularly I'd like to highlight, like, the the work that's been coming out from Finale Doshi Velez' group with Emma Brunskill as a collaborator as well as Nathan Callas from Cornell Tech. That, like, the these two groups among others.

Taylor:

I mean, there's significant effort, you know, from David Sontag, for example, and other you know, I I could I could list a lot of names who are looking at, off policy evaluation and making it stable so that we know in these settings where we cannot explore, we cannot evaluate new policies online, you know, how reliable are the the outcomes that we are suggesting that we can, achieve with reinforcement learning. And that's, under a traditionalist sense of, you know, we are learning policies to suggest actions. There's an alternative approach, however, that, I've been investigating in collaboration with Mehdi Fatemi from Microsoft Research based in Montreal, of looking at it in in sort of a inverse direction is using this sequential framework where we can distill long term outcomes to our current decision. Can we choose which actions to avoid instead? And we have some preliminary work that is under review, along these lines right now, and it's it's sort of making the statement of this is how RL and healthcare is different, is that we can't take this traditional sense because we can't explore.

Taylor:

We, you know, we can't experiment, with actions, but we can use the same framework to describe what's going on and to maybe identify optimal behavior from the observed experts, you know, the the clinicians who have helped generate the data that we use. And and and so I I, you know, I I feel like I'm kind of belaboring the point a little bit here, but, the major difference is just in data access as well as, being able to test and evaluate the solutions that you find.

Robin:

So in Atari or OpenAI Gym, reward is really simple. But here, how do yeah. How do we think of reward here in health setting? Like, are we trying to get the best expected, like, average health outcome across the population? Or should we be trying to avoid the worst?

Taylor:

That opens up a really interesting can of worms when you talk about expected health outcomes because there's a plethora of work, within the ML fairness community that shows that expected outcomes is incredibly biased and unfair towards marginalized or minority communities. And this is particularly challenging in health care. You know, so there's some work a couple years ago that Irene Chen published with, David Sontag out of MIT where she looked at, where the discrimination within a health care decision framework would be coming from, and she looked at a cross section of different, groups and peoples within the Mimic dataset based on this expected positive outcome and found that women and, minorities, racial minorities were provided with much worse expected outcomes because they are not adequately accounted for, in the training. And so it's it's difficult to say that a reward in health care from an RL perspective could be sort of this mean or median type performance, and this is where I think the holy grail that we're all striving for within the machine learning for health care community is personalized medicine and looking at an individual by individual basis. You know, can we provide adequate care and treatment selection that is tailored to a particular situation or a particular patient condition?

Taylor:

And, the motivation or at least how that informs the design of rewards that we approach is, you know, it's it's better to use hard and fast binary, rewards. You know, for for a hospital acute care setting, that's a pretty easy thing to define. You know, whether somebody survives and is discharged or is allowed to leave the hospital or they, unfortunately, expire, and succumb to their symptoms. And, you know, so that binary plus one minus one reward is pretty easy to define, but, the other types of reward definition that you might find, you know, if you're looking at a long term care scenario or somebody who's trying to manage diabetes, for example, you know, that reward design is gonna be largely informed by the particular problem that they're working on. So, like, back to this diabetes example, you might want to maximize the time that the patient is in a healthy glucose range or minimize the times that they go hypotensive or, hypotonic in their in their glucose levels where they're they're having, you know, too much blood sugar which is quite dangerous.

Taylor:

Right? And and so you you will design your rewards based on the particular problem. And a good example of somebody who did this and has done significant work looking at defining rewards is Niranjani Prasad, who just recently graduated from Princeton, in her work with her advisor, Barbara Engelhardt. It's the, one paper that I have in mind is, Niranjani looked at removing a patient from a ventilator, something that we're all very aware of right now in this age of, the coronavirus pandemic is that, you know, when is the appropriate time to remove somebody from a ventilator? And she designed a very specific reward function that helps describe the clinical problem of removing somebody too early or too late from a ventilator.

Taylor:

And and, you know, that that and she has some follow-up work that she recently published this summer, looking at, you know, what is an admissible reward in in a health care problem and, does it cover the right physiological, characteristics of the problem, is it attainable from a clinical practice perspective, etcetera? And and and so it the I think the short answer is it's it's nuanced. You know, the reward definition within health care can be as simple as binary outcome versus some continuous measure of a long term process.

Robin:

Okay, cool. And then, but just a little bit more on that, I guess what's not clear to me is if you have a distribution of outcomes, like let's say in the long term care setting, you know your policy could be shifting the mean of that distribution, but it also could be shifting the, changing the variance. So different policies might have different types of tails. So I just wonder if that's something that you think about in terms of, do we do, like a maximum thing, like trying to make the worst, the best possible worst case, outcome for the population versus the more expected, the more average? I get to your point about fairness of different the different subgroups, and I think that question applies to them too no matter how you split it.

Taylor:

Yeah. You know, I I have I'm not I'm not, like, super aware of this approach being, undertaken within a health care context yet. I know that there is some work within the causal causal inference literature applied to health care with a machine learning focus, that have been doing this. You know, so like some of the work from Susan Murphy has been thinking about this, but I would also point to some interesting work that's come out this summer from Sergey Levin's group that is doing this max min approach to q learning where the their algorithm is, is titled conservative q learning, which is very creative and I appreciate that. But then then there's another there's another there's another paper that just just came out, like, a couple weeks ago from, Kmar Ghassemipore, who's friend of mine who's, been working as a student researcher at Google with, Shanggu, and it's, EMAC is the name of their approach where they take this expected positive reward in this off policy sense, and then they marginalize against some of the subgroup type challenges.

Taylor:

And it their their setting is well, both Sergey Levin's paper and this EMAC paper is looking at robotics, specifically, but some of the characteristics could potentially be applied to some of these long term, you know, continuous type problems within a health care setting for sure. But, there there really hasn't been a whole lot that I'm aware of that has been explicitly looking at domain shift within the expected rewards, as as a response to optimal or optimizing policy. Right?

Robin:

Thanks. Those are great tips. Okay. So so what about model based RL in this setting? My impression is that a lot of model based RL is looking at domains that are really quite deterministic.

Robin:

And either they're they have no noise or maybe they have very simple noise. So how do models differ in in these health settings? Like, are they still useful? Maybe they're useful in different ways. Like, is it possible to to do planning with models in settings like this, in noisy environments like this?

Taylor:

I think that that's an open research question that's yet to be determined. Some of my prior work has been within model based RL, within healthcare where we're trying to, you know, adapt. We're gonna talk about this later, so I won't get too far into it, but we try to adapt the model based on local features of the the the task or the environment. But in general, I think that there's a danger in thinking that model based RL is the solution, to I, you know, I have continually found myself thinking this, and I think that it has its its use cases for sure, but like you pointed out, a lot of those use cases are in more simplistic settings where you have deterministic, transition behavior, you know, very low noise environments and, extrapolating to a health care scenario, you you what what is your model of, right, like how well calibrated can a model be of the human body, and we know so little about it. You know, give even with the centuries of medical research that have, produced a lot of great understanding and insight about, you know, medical practice, but then also just our physiology.

Taylor:

There's still a lot we don't know and, you know, in model based RL, the performance of your policy is largely driven by the accuracy of your model or at least how well it can describe the environment around you. And there has been, papers in in the recent past, you know, under the imitation learning framework that look at, you know, what happens if you have a suboptimal model or a suboptimal demonstrator, but in like when you add additional layers of complexity such as nondeterminism in the transition statistics or, you know, going full off policy, you know, that a lot of those solutions don't really work that well.

Robin:

So let's move on to hidden parameter MDPs. This topic is related to your master's thesis. Is that right?

Taylor:

Yeah. Yeah. So the HIP MVP was the core foundation of my master's thesis. You know, I was fortunate to, be able to repurpose the paper we published, on it as my master's thesis, which is with additional, explanatory introduction chapters about, you know, Gaussian processes and Bayesian neural networks. But, yeah, the HIPMDP, it's it's something that I really enjoy talking about because it, one, means a lot to me.

Taylor:

It was the first, like, real research project that I was able to start and finish, as a machine learning researcher, and it's been I mean, the fact that we got, published makes me feel at least that it was successful, and that other people have been building on top of it as another way to, I guess, deem in my eyes that it's been successful.

Robin:

So what is a HIPMDP, and and why is it useful and interesting?

Taylor:

You know, George Konadares, who is our collaborator on this project and one of the originators of the HIPMDP with, Finale, Doshi Velez, my adviser at Harvard. Yeah. He might not like me saying this, but, you know, I I view the HIPMDP as an abstraction of the a POMDP where given a family of related, MDPs or tasks where the major differentiation is the a perturbation in how the transition dynamics are observed. The hip MDP parameterizes that that variation in the transition dynamics because if you can accurately or correctly prescribe what the individual MDP or individual tasks transition dynamics are, you should be able to optimally solve that problem given prior observed behavior from other, tasks within that same family. And so without as a illustrative example, if you've learned to write with a pen, you can most likely write with any other pen that you pick up no matter how heavy it is, no matter what the tip is like.

Taylor:

You know, as long as it has ink, you will likely be able to pick it up and write it, and that's because our body and our mind have been trained among like, to handle these types of different variations where a reinforcement learning agent hasn't necessarily. Right? And it's not necessarily robust to slight perturbations in the weight of an object or how, you know, the mechanics of a moving arm might change if the tolerances on a motor are off by a little bit. And so what we proposed or at least what was originally proposed, in their original paper that was, put on archive in 2013 and finally published in 2016, was that if you can filter or, estimate, among all of the prior observations of this family of tasks and use them to find something that's similar to what you're observing now and parameterize it that way, then you should be able to accelerate your learning in the current task or the current framework. And so my work, during my master's degree was trying to make that approach scalable and more robust because, as as I said, they they used a filtering procedure, that was provided at least the prior over that filtering procedure was seeded through an Indian buffet process, which, is really difficult to scale.

Taylor:

And, at least in in in in the setting that they were using to establish, basis functions over the transition dynamics that they're observing. And and so one of the insights that Finale and George came up with and proposed to me, when I was starting the project was, you know, can we take these filtering weights that are being used to linearly combine these learned, you know, basis functions of our transition statistics? Can we take those weights and use them as input to a parametric model? Or, in in the original setting was, can we use them as input to a Gaussian process? Because they were still interested in sort of these, you know, nonparametric statistical basis functions at the time.

Taylor:

We found that GPs aren't a really great setting or at least that the understanding that we had of them at that time. You know, this is late 2016 that it was better to maybe move into, you know, a probabilistic framework that we still wanted to be able to do inference over these estimated, you know, hidden parameters that would connect the family of tasks together, but still be somewhat scalable, to higher dimension problems but also with more data. And so that's where we replaced the Gaussian process, with a Bayesian neural network, to help, be a stand in transition model that we could then optimize based on the individual tasks that we're observing by, function of these hidden parameters. And so I I feel like I've been meandering a little bit. So the the HIPMDP in summary is a method by which we can describe, small perturbations and observe dynamics between related tasks.

Taylor:

And and so from a health care perspective, this is, you know, a task would be you treating a patient from a cohort that all have, you know, AIDS, for example, you know, that was a simulated problem that we, addressed in our paper is that when you, observe some new patient, you what about their physiology can you learn from their observed response to the medication that you give, and can that be used then to help inform the type of medication that you want to give them in the future? And this was done at least within this construct of hidden parameter mark up decision processes by estimating and optimizing the hidden parameters for that individual patient. And then after, you know, we solve the problem for that one patient, we would take the observed statistics and these hidden parameters and keep them, along with our updated transition model, the Bayesian neural network, to be prepared for the next patient that would come in. And then the hope here is that if you find somebody who's similar to what you've observed in the past, it will be easier to update and optimize their hidden parameters and then get a quicker or more efficient and effective policy, downstream.

Taylor:

And so

Robin:

So this sounds a bit related to contextual MDP, but that's a slightly different concept. Right? Could you help us compare, contrast those two concepts?

Taylor:

Yeah. It definitely does, sit within the same idea. You know, so I view contextual MDP as a specialized use case of, what you know, thanks to current research has been termed a generalized hidden parameter mark up decision process is that the contextual MDP has largely been used in multi armed bandit settings, where the reward is fixed, per task, and that specific context of the reward being fixed or the particular user being different is known, to the learning algorithm. And where the HIPMDP differs is that it doesn't assume that knowledge. It just observes that there's been a change, and and and we assume in the construction of the HIPMDP that you will know when you are approached with a new task, and it's upon the algorithm's job to figure out and learn a approximation for that context.

Taylor:

And then so this generalized hidden parameter markup decision paper that I, that I've referenced is from Christian Perez and Theophanis Caroletsos, who were at Uber AI at the time, and it was, presented at AAAI this past winter.

Robin:

So can the, can observables give us hints about the hidden parameters like, the demographics of the patient maybe? Or are we assuming that we don't get any hints about these hidden parameters except for what happens with the transitions?

Taylor:

Yeah. So I think I think in practice, if if this were scaled and improved to be able to be used in a real setting is that demographics would absolutely be a part of that, contextual process of learning the underlying or hidden parameters that you can't observe. You know, demographics, you know, such as race or gender, you know, height, weight, age, you know, etcetera etcetera, you can go down this list. Those those things do help give some understanding or context, but there's still broad variance within these demographic groups. And, so I would view demographic information as a head start of, like, learning some actual physiological context, but, ultimately, it just has to be about the data, and it has to be about the observed transitions and how they respond to medication.

Taylor:

And, you know, in in an ideal setting, that's how health care works is that, you know, doctors come with their training and their understanding of the the medical literature as well as just the practice of medicine, but, and they use that to inform their initial diagnosis and treatments, but they adjust and they adapt or at least the best ones do, and they adapt in hopefully compassionate way. And and I think that's the way that we're trying to develop machine learning methods for is to have this built in at least conceptual understanding of a problem and, develop a solution that adapts. And and, you know, this is might be overthinking it, but in a compassionate way, in a fair way, in in a way that is equitable across the cross section of the demographic.

Robin:

So you talked a little bit about how you, improved the the HIP MVP solution and maybe the setting, with your first paper. I wonder if you can could you walk us through kind of the set of MDP papers, like what the different types of progress that was made in terms of both the setting and and the and the solution?

Taylor:

Yeah. I am happy to do that. So the, the original paper by Finale and George just set up the problem and introduced the framework. You know, they they their early work did bear some similarities, to a few other prior pieces of literature that I I'm kind of spacing on right now, but it's very slow. Like, what they did was very slow, and it couldn't scale to problems of higher than 4 dimensions.

Taylor:

And, I kind of chuckle when I say that because we in in our updated paper didn't look at anything greater than 6 dimensions, but, you know, we added 22, 2 factors of variation in in the state space. But, what we, what we did in that my first paper looking at hidden parameter mark up position processes was you develop a scalable or at least a functional approach to learning these hidden parameters, and we did that by virtue of inference through a Bayesian neural network. What we found or at least it was pretty apparent to us as we're doing that research is that it was still computationally inefficient and really expensive because we would need to simulate thousands of episodes using that model in order to infer what those hidden parameters were. And, you know, it worked for what we were trying to do, but there's no way that that approach would work in a real setting. And, after I finished my master's degree, I had to go back to work full time, so I didn't get a chance to really participate in the next step of this, but luckily, Finale had a brand new PhD student start right at the same time I graduated named Jiayu Yao.

Taylor:

And Jiayu was fascinated by the idea, initially of the HIPMDP and was interested in making it more computationally feasible, without needing to run, you know, thousands of simulated episodes of a of an environment in order to estimate these hidden parameters. And her idea, was to distill the optimal policies from prior tasks into a generalized policy class in the same way that we were distilling all the transition functions into this learned Bayesian neural network that would be parameterized by these hidden parameters, which would give you the change of behavior. She says, oh, can we learn those hidden parameters, using our transition model, but we don't need to rely on that transition model being, like, absolutely correct. We just need it to be good enough to get us a stable set of hidden parameters and then use those hidden parameters to parameterize the policy class and then get the differentiated behavior in this, you know, general policy based on those hidden parameters. And, unfortunately, it is great work, and it it it worked really well, but we have yet been able to convince reviewers that we did a good enough job.

Taylor:

But we did put the paper oh, we didn't put the paper on archive yet. We we still have some things in the works to hopefully improve it. Looking at more of a theoretical bent, and Finale's had some undergraduate mathematics students looking at more of the theory behind these hidden parameter mark up decision processes and specifically with this direct policy transfer framework. But we do have like, I have a version of the paper that we presented at an ICML workshop 2 years ago, on my website, and it has been cited by other researchers, and so at least it's making some contribution in that fashion.

Robin:

This seems like it's gonna be, huge, basically, for for Real World RL. I I I can't imagine it just being limited to health care setting. It seems like it would have touched everything.

Taylor:

Yeah. I have similar thoughts about it. I I think that, I I think that this approach to, adaptation and generalization in RL is really appealing. We see that with the meta learning community within RL that have been doing fantastic work, looking at ways to do adaptation online. You know, as you are learning in a new task, can you adapt a policy class to work optimally?

Taylor:

However, I I I do stress at least in my own mind of thinking that, you know, meta learning and and even my own work is fitting to single distribution is that it's really difficult to get any of these things to work outside of the observed, task class that you have in your training set. There has been some efforts in the meta RL community looking at out of distribution, adaptation, but I I haven't found any of the papers to be overly convincing. One additional limitation of our work is that we only looked at the transition of, or or the the the perturbation of the transition dynamics. There is additional factors of variation in our RL problem that you can account for, and this was the the major focus of the generalized hidden parameter, process or sorry, the generalized HIPMDP paper, from Christian Perez and his collaborators, was that they factored the the hidden parameters to account for different forms of variation. So variation in the reward structure, variation in the transition structure, and I think they had another factor variation, but it's escaping me right now.

Taylor:

And that has also been a feature of some additional follow on work. One one particular paper that I have yet to read, but I've had I've been in a lot of discussions with Amy Zhang about who's the lead author on is that she took the HIPMDP framework along with the block MDP framework, which is something that she has inherited from John Langford and has been looking looking at on her own, for quick adaptation, but then also, synthesization of policies. And, you know, the they're addressing different factors of variation that you might observe, among a family of tasks. So there there there's a lot of really exciting and fun work in the in the days to come of looking at outside of a meta URL perspective because I'm I'm still not overly convinced that it's the right approach because we're using a lot of computation to fit to the same distribution, there. But, I I I think that the insights that we're gaining in that line of research is really informing, creative modeling strategies and approaches within a more traditional RL framework.

Robin:

So it sounds like this area has a lot of potential, and it's not fully solved yet.

Taylor:

Yeah. That's right. There's a lot that can be done, and I'm excited that there are a lot of researchers looking at it. I I shouldn't say a lot. There's there there have been efforts in the in the near past that, indicate that people are interested in this type of problem.

Robin:

I'm gonna move to another recent paper of yours, counterfactually guided policy transfer in clinical settings. What's going on in this paper? You're you're tackling domain shift in RL using causal models. Is that the idea?

Taylor:

I I that's the the major technical direction of the paper. I think it's a little easier to stomach by describing the motivation. And, as I referenced earlier, there is a lot to do in order to make models within machine learning and health care, transferable, and generalizable between medical institutions. And one of the major challenges, of this model transfer is that practices vary between hospitals. The type of measurements that they take, the type of instrumentation that they have at these different hospitals, confound the transfer problem, but the major confounding factor, that limits the ability to transfer models between hospitals is that, you know, the patient dish the patient population is completely different, and it can vary quite widely, with various conditions or underlying, symptoms or at least sim syndromes that that patient population has.

Taylor:

You know, you can think, for example, like looking at the overall structure of what a transfer learning problem is is that you'll have some source setting or source dataset that you use to train your model from, and you want to apply that somewhere else with minimal adaptation or some adaptation, or no adaptation depending on, how confident you are. In the health care setting, so that large source environment could feasibly be a research hospital in a large, urban environment where you do have some population diversity, but, that that patient cohort that you might have in your dataset will be pretty different from a regional, diabetes clinic clinic, for example, where, you know, you might have had a minority of patients within your source setting that had diabetes and, you know, their particular practice, and care taken to accommodate them. But when you go to a diabetes clinic, that's the majority of the population all of a sudden. And, you know, this this patient population might also have skewed to be a bit older. There might be other demographic, differences.

Taylor:

And without with blindly applying a model from a large research hospital to a regional clinic, you're going to miss a lot of that variation and, as I said earlier, potentially do a lot of harm and, be overconfident in in the in the policy or the treatment strategy learned from the major hospital, when applying it to the smaller setting. And, so that that was the primary motivation for our work is looking at a way to address this this form of domain shift, within the underlying data distribution, and we did this with a simulated, cohort of, you know, again, simulated patients that had sepsis and, one of the factors of variation that you could set in defining, these, your simulated patient cohorts is, the percentage or the proportion of that population that is diabetic. And, we used this simulator that was developed by David Sontag's group out of MIT and was featured in a a paper that, Michael Oberst and he published at ICML last summer. And so we took their simulator and sort of built sort of a wrapper around it that allowed us to vary, the proportion of patients within it, as being more or less diabetic than the source environment and then studied, algorithmic solutions or improvements, to some off policy RL settings with counterfactual inference, to address this type of domain shift, just in the the patient population itself.

Robin:

Seems like, we're very so early on in combining the the causal models and the reinforcement learning side. And I think there's still some people still don't even think that that's important to do, but I think it's it's it's really exciting to see you, with one of the early papers in this in the in combining these two fields. Do you see this combination being a big deal going forward?

Taylor:

I do. You know, I I think that there's there's a really good history, of people with specifically within the health care context of machine learning research that have been looking at causal inference. You know, professors that come to mind are, you know, Suji Saria, Susan Athey, you know, Susan Murphy being 1, you know, and the list kinda goes on and on. David Sontag has been looking at this. Elias has been looking specifically at the fundamental theoretical underpinnings of reinforcement learning and causal inference and the connection between them.

Taylor:

But, you you know, I I I believe, quite ardently actually that, any future solution that we have for generalization within RL needs to account for causal structure, especially in an off policy or offline where you have a fixed dataset is that we need to learn a lot from our colleagues in the statistics department and, public health and epidemiology world, about how to do good causal inference. And, you know, I think Judea Pearl, Bernard Shokov, you know, at all have been doing a really good job. You know, Jonas Peter, sorry. That was the name that I was trying to say. You know, these 3 researchers among all of the ones they've named have been doing a really great job of introducing some of these concepts within machine learning, and now a lot of the effort is, you know, drawing the coherent connections for usability.

Taylor:

And, you know, is it feasible to make the assumptions that we do make, in in order to make these things work? And, you know, people have their bones to pick, with the way that machine learning researchers use causal language and causal frameworks, and I think they're valid in, raising those concerns. And it's it's upon us as a community who want to use these tools to listen and to learn, and that's been something that, Sean Lee Joshi and I, you know, my primary collaborator on this paper, and then as well as my adviser, Majid Ghasemi, we've been listening, and we've been talking with experts in this field to try to get a better sense of what we're doing right and what we could be doing better. And, I I I think that it it's an exciting future assuming that we can be in scaling the approaches that we present in this paper that we're highlighting right now, to more realistic scenarios. Right now, most of the causal inference and reinforcement learning literature that's or at least the bridging between these two areas has been in fixed, or discrete settings.

Taylor:

I think that the only only work that's been done that has looked at slightly continuous settings has been with Susan Murphy's work, developing a monitor, and providing mobile interruptions to somebody's, day, you know, wearing, like, a smartwatch, for example. Like, oh, hey. You should get up and walk, or, oh, hey. You know, your heart rate's too high. Slow down.

Taylor:

Like, her her project and sort of the, the, you know, the fully funded study that she's been on is known as HeartSteps. They're they're probably one of the only, projects or at least sets of research out there that's been looking outside of the more controllable discrete settings. And I think that there's a lot of development that needs to be done both in the statistics side, but then also in the modeling side from a machine learning perspective about how to expand and adapt to more continuous and realistic settings. And that's actually some work that I'm quite excited to get started on, you know, later this year.

Robin:

Awesome. Sounds like I have a lot of background. What he needs to do?

Taylor:

There's a lot that I don't understand yet, and I'm trying, to learn from my collaborators who, you know, know far more than I do.

Robin:

I wanna I wanna just add, I I love hard steps. I think Susan Murphy's work is is so fascinating, and I learned a lot from reading about that. Yeah. I wanna move on to, to talk more about mimic mimic 3 and sepsis. Okay.

Robin:

So mimic 3 and the sepsis problem seems to come up a lot, in ML for health. I think you made a comment that it's kind of like the, MNIST for for for ML for health. And, so I understand this is ICU data, from a teaching hospital. Is that right? Can you tell us more about the the problem and that and the dataset?

Taylor:

Yeah. I mean, so the data is collected from the Beth Israel Deaconess Medical Center in Boston, which is part of the Harvard Medical School, you know, system of teaching hospitals and research hospitals. So Leo Celi and his collaborators at MIT, thought, you know, we have this really rich data set of electronic medical records that we can use to inform better decision making, but then also improve medical practice, and, you know, so Leo is a practicing acute care doctor, and saw within his own workplace, you know, in the intensive care unit, the potential benefits for developing this type of data or dataset to be used by the community, and they've gone through substantial efforts to privatize it, to clean it, and to present it to be used by anybody as long as they go through adequate ethics training and they follow the protocols, this defined by the the consortium that has hosted and also prepared the dataset for use. They actually have just finished a new version of Mimic, so version 4, and it's being rolled out, this summer, to include an in a much larger set of patients. Another improvement to the data is they now have chest X rays fully integrated for all the patients that have them.

Taylor:

They have improved or increased availability of the notes from doctors and clinical teams. And another thing that, some of my colleagues are quite excited about is that they also are including pharmacology and medication reports, which has been something that they haven't had historically within the mimic dataset. And, why mimic and why sepsis, or at least why that has become sort of this focus is that sepsis is a really poorly understood problem. So it has a lot of potential gains, but it also introduces a lot of difficulty where, you know, we fall into the trap as machine learning researchers saying we've solved it, we've done it, but, a doctor looks at the solutions as, oh, we knew all that already. It's just a harder problem that you thought.

Taylor:

Why it's been used so widely within the machine learning for health care community is one, the availability of mimic, but it also is one of the conditions within the hospital that gets really dedicated monitoring. And so there is a richness to the data that's there as well as the avail, like, consistent measurements. And so you don't have as much missingness or at least, unobserved information about a patient such as their heart rate or their respiratory rate, you know, their blood levels, the list goes on and on as you consider the vitals, because these patients are at the most danger of dying, within the hospital. In fact, sepsis is one of the leading causes of in hospital death, and, sepsis itself isn't a, you know, diagnosable condition, but it is a umbrella term for large scale systemic shutdown of a body's organ in response to infection and pain. And so sepsis can be detected by a a variety of measures, one of which be being rising lactic acid levels within the blood, which is a a physiological response that our bodies have to infection and pain, and and it it can be manifest in multiple different ways.

Taylor:

And so if you have access to the MIMIC dataset and the notes, you look through the patient cohort who have sepsis, and unfortunately or sadly those that succumb to their sepsis, there's a variety of, your scenarios or conditions, that may lead to a patient becoming septic. You'll, you know, the the ones that stick out to me is, individual had surgery after an accident. Their, their sutures got infected and that infected their blood and, you know, they became septic. Another particular patient I have in mind be, got infected from chemotherapy, and, you know, like, these these are really rare and unsettling situations that when you look at aggregate or in aggregate at the this hospital data is that they pop up. And it's, it it's it's not a happy time to read, case reports about somebody who passed away, and it's even more difficult when you you look at the clinical decisions that have been made and you say, oh, you know, in retrospect, they could have seen this and they could have changed that, and Zied Obermeyer has a really nice way of describing this phenomenon.

Taylor:

It says in retrospect, we can be the best at anything. The challenge is diagnosing or at least identifying the signal as it's happening or even better before it happens. And I think that that is the large, large motivation behind a lot of the machine learning for health care research, but in particular, solving the sepsis problem is only one really small piece of, you know, this overall health care puzzle that we have. It just happens that, you know, thanks to the efforts of, you know, the the Mimic team, we have this dataset available, and it's, unfortunately influenced a lot of larger data collection practices is that, so, like, recently, a team in Europe, just published a large health dataset for intensive care, but it's focused on sepsis Or, you know, when we talk to our clinical collaborators at local hospitals here in Toronto, they kind of back away. They say, like, oh, we don't have the data to support sepsis.

Taylor:

And be like, no. No. No. We don't want to focus on sepsis. We want to focus on problems that you're facing, but we're going to benchmark against this sepsis, condition in this open dataset.

Taylor:

And, you know, once we have those types of conversations with our clinical collaborators, I think that, one, we learn, what they're really interested in, and 2, they see the limitations of your current practice within machine learning. And, it's it it helps kind of bring us to equal terms where they see that, you know, the problem hasn't been solved and we're not just there to be engineers, but we're there to help them in actual clinical research, which opens a lot of really great partnerships and doors, when you, have this common understanding.

Robin:

So from my brief readings, I only ever encounter, the phrase mimic 3 in the context of sepsis. But is mimic 3 really all about sepsis or is it a much broader dataset?

Taylor:

It's it's definitely a broader dataset. Like like I was sort of saying, because of the frequency of records for a septic patient, it makes it an easier problem to to look at, and the the community has defined really good data extraction code to pull out the sepsis patients, but there there's a large variety of, conditions and people within the mimic dataset, but all constraints to, the intensive care unit. And so it's it's acute care and, the challenges that come with that, you know, given that, you know, these are short timelines. These are people in very dire circumstances, and some of the recording for these patients is quite sporadic because of that. You know, doctors are working feverishly to treat and care for people who are on the brink of, of dying.

Taylor:

And and so, sepsis has become a little bit easier because it it it has very systematic protocols of measuring and monitoring the patients, and so I think that's just why the majority of the the papers that we see from the community that use Mimic utilize the the sepsis framework. But, that that doesn't mean that you don't use this data if you're interested in, in in solving something else. So the the mechanical ventilation weaning paper from Niranjani Prasad that I referenced earlier, that looks at the septic cohort, but they don't look at treating sepsis. Right? They're looking at a sub problem, within that cohort.

Taylor:

But, I I am aware of research of people looking at the same septic cohort to do diabetes management and recognition within a clinical setting. You know, there's mental health type research that has been done with, like, within the context of, sort of the the mimic or septic cohort as well. Right? Like, there there there's a lot of interesting parallels that can be drawn within the data that doesn't focus on sepsis, but, at its core, I think the most low hanging fruit of the problem, just given data availability is the septic cohort.

Robin:

So when we look at, say when Deep RL first started with Atari and how DQN did with with Atari back then and and how agents today are doing like with agent 57 and MuZero, Some people are saying or I I sometimes wonder, you know, have we solved Atari? Has has Atari just solved? It's not that useful anymore. How how would you comment in terms of where we are on that journey with with mimic 3 and sepsis? I I guess we're away a long ways.

Robin:

Are we a long ways from from solving it? What would it mean to solve this this problem?

Taylor:

Yeah. I don't know. I don't I you know, to be completely honest, I don't know if if it's possible clinically to describe a solution, like, you know, is is a solution, like, with the the language that you we're used to using, you know, is a solution something that is attainable, and I I think that there's always gonna be some context driven exception to any one clinical practice, given, the patient characteristics, the situation at hand, right? So what we've seen or at least there was there there have been some published medical papers from China looking at the coronavirus pandemic, and a 100% of the patients who died in their hospitals were observed to be septic. Whereas those that recovered, how many it was like 40 to 60% were septic at one point.

Taylor:

Right? So like it it it takes on different contexts and form, because if you're treating a patient in in the hospital currently with, you know, the COVID 19 virus, the sepsis is gonna be a factor that you consider, but largely, you're just focused on treating the current symptoms of of the virus. And so that it it largely changes, I think, the the the texture of the problem, and, there there have been efforts to make generalizable Deep RL solutions to the septic problem, and I think that they're ill guided, in in a lot of ways. And I don't want to really delve too deeply into them because I really respect the effort and the researchers who who did this. You know?

Taylor:

So this is Anurud Raghu's paper, or set of papers from, you know, 2017, 2018, and then the AI clinician paper that was published in Nature Medicine in 2018. Like, they did great work introducing Deep RL into the question, at least getting the attention of the reinforcement learning community looking at sepsis. But I think that we do ourselves a disservice, when we take a traditionalist RL approach to health care. And I think that's what a lot of people have been trying to do by applying a DeepQ network just to learn optimal action selection strategies. And, you know, Finale Doshi Velez, Omar Godesman, who was one of her, now graduated students wrote this really great opinion piece for Nature Medicine giving guidelines of using reinforcement learning in health care, and they identify all of the pitfalls of using a traditionalist approach for this fixed health dataset, and I all of this is to say, I don't know what a a solution looks like, especially without running clinical trials.

Taylor:

You know, I think that in the best best world or the best case scenario, we convince either the NIH or some other health regulation body that we have done a good enough job, and I hope that we would feel confident and assured that we've done a good enough job capturing all these factors of variation, in medical practice, that we could run a full clinical trial. Like, we cannot assume that we've solved anything in health care or really in the real world without actually experimenting and, that's really dangerous territory for machine learning within health care is that, we need to be sure that there are reversible decisions, in case our models are wrong. And, the current status of a lot of this is that we are not there, and we're nowhere close. Part of my motivation of coming to Toronto is that there are doctors here who are willing to work with us to develop clinical trials for algorithmically driven decision tools, not for full treatment, that would preclude a solution to the sepsis problem, but might help us in a smaller, smaller problem that will free up the doctor's mental capacity and time to be a little bit more attentive to their patients or provide, opportunity to develop new or novel treatment solutions, and that is the goal that I think is realizable within the next 5 to 10 years depending on the use case and depending on the problem.

Taylor:

You know, for for example, there is a group here in Toronto at the Children's Hospital that is looking at triage, for the children patients that come in of identifying, the right wings or departments in the hospital to to put them in, and that itself is a reinforcement learning problem as that you wanna look at the long term effects or outcomes of that initial triage decision that's being made. And that is some exciting work that, some colleagues of mine are getting started on. And and I think that that is a really feasible or at least, ethical approach to trying to develop some algorithmic aid for a health care setting. When it comes down to, you know, health and recovery, I get a little nervous about thinking that we're close or even putting a prediction of how close we may be, but I think that we're hopefully getting there, within my lifetime, and I am excited to be a part of the the efforts to at least make that somewhat realizable.

Robin:

I'm glad you raised, AI clinician, and there's quite a bit of back and forth about, about how close we were to using a solution like that and and, critiques about handling uncertainty.

Taylor:

Yeah. I think that the like, the the criticisms of that work are both fair and unfair. I think there's some sour grapes by some of the critics of that work because they wanted to be the first to do this. But I think a lot of their their criticisms about modeling uncertainty and the sort of the robustness of the approach as well as, like I mean, even the types of treatments that the AI clinician suggested are very, very severe and very risky. And I I think that that is the large challenge with, doing RL in health care is that, you kind of get fixated or at least the agents get fixated on, you know, the the actions that cause the most change and, you know, devolve into suggesting those really serious actions, immediately when they're not necessary.

Taylor:

And so, that's been something that we've been trying to diagnose in some of this work that I I I alluded to with Mehdi Fatemi is, you know, why is it choosing those types of actions and can we identify when those actions are useful or when those actions are potentially detrimental.

Robin:

So you presented a workshop spotlight, related to this topic, titled learning representations for prediction of next patient state at the ACM conference on health inference and learning 2020. Yeah. Can you tell us a little bit about that?

Taylor:

Yeah. So, this this, this this conference is brand new. It was headed up by my adviser, Marzia, and her close collaborators. And the the the reason why I'm sort of prefacing this is I just wanna say that we do have probably the coolest conference acronym out there, CHILL. Right?

Taylor:

And so we're we're, I I tried to with, you know, no success to, convince Marzia to contact Netflix to get a sponsorship, and so that we and and, like, this is especially because Ben and Jerry's now has a ice cream flavor, Netflix and chilled. Right? So it just it would have been fantastic to have that. You know, we we don't have to take ourselves overly seriously all the time. So but, this this paper that I, you know, presented as a workshop spotlight, was looking at the way within sort of the sequential framework of decision making in health care, you know, what's the right way to learn a representation.

Taylor:

A lot of reinforcement learning methods in health care just take the data in a isolated sequence, and just say, like, okay. Well, this is time step t. We have all of these observations. We'll just use that data as raw input into our agent. That is fine, and it's worked well, you know, given the the results that we do have in the literature, but it's not necessarily realistic because a clinician will use some historical understanding of the, you know, a patient's, you know, blood pressure, for example, in providing some treatment.

Taylor:

And there have only been 2 papers in the literature, at least that I'm aware of, that have done any sort of recurrent modeling to construct a, you know, time dependent history, you know, in in sort of a hidden state of what the patient condition is. And, even then, those two papers just sort of blindly use a recurrent neural network, and there hasn't been a systematic or rigorous study on what the appropriate representation is for a patient. And so what our work was trying to to do, and we're hoping to get this up on archive within the next couple months, just because we we got good comments from, you know, a failed submission of this paper to a conference this summer, of where to improve it, and we're going to do that. But what we did is we took several different models that would embed, in a recurrent fashion sequential observations, and then or at least investigated what those representations look like and what they allow us to do, and the auxiliary task that we use with these representations is in predicting the next observation. So given a sequence up to time t, can you predict the observation of your time varying vitals in the patient at time t plus 1?

Taylor:

And, what we wanted to make sure we did when learning these representations was ensure, at least constrain the representation learning to be clinically meaningful. So rather than just learn some latent variable that, you know, happens to achieve good accuracy on your your test setting, we wanted to at least maintain some semantically meaningful, information. And so what we did is we constrained the learning process, by making sure that the the Pearson correlation with the learned representation and the, you know, known and measured acuity scores, which are just a measure of how severe a patient condition is. We wanted to maximize that correlation, while learning these representations, and what we were able to find here, this is just sort of analysis study. We're not developing any new model.

Taylor:

We're not presenting any new way of doing these embeddings, but by learning these representations with this constraining process is that, these simpler RNN based methods are able to more or less, separate out the types of patients that we would see. Right? Your patients that survive, who have less severe, conditions versus those that have more severe conditions and who do ultimately succumb to their to their treatment or or to their symptoms. And why this is important is that if we're thinking about applying, reinforcement learning, agent to these learned representations, we want to kind of help maintain some of this clinically meaningful information, but then also give it some head start and seeing that, these representations are separable between the 2 patient cohorts. We're we're we're we're excited to start applying, learned policies to this type of, learned representation from.

Taylor:

And so, you know, this workshop spotlight that, I did as well as the paper that we're going to be, publishing or at least putting on the archive server soon as, just largely a first step at saying, hey, you know, all of this state construction business that we've been doing in, you know, RL for health care, you know, much less machine learning for health care is probably not well informed, and we can do better. And and so that's just starting trying to start this conversation of, you know, what's the appropriate way to represent a patient condition, given sequential observations and how can we use that to do better reinforcement learning downstream.

Robin:

So you've developed these, you know, a range of innovative solutions for using RL in these clinical settings. Can you say more about what that path would be to to getting them out there to helping real people? Like, would would this is this decades away? Is it could it be around the corner? What's the path look like?

Taylor:

Yeah. So in in terms of reinforcement learning, I think we've got, quite a ways to go. But in terms of, I I would you know, I probably am not speaking perfectly precisely here, but, you know, a lot of these standard or more traditional machine learning approaches, you know, we have evidence of them working really well in some health care settings, and those have been in direct collaboration with clinicians and hospitals. And so there's Kat Heller, who's now at Google, but was at Duke. She and her students were able to develop a really great sort of causal inference based solution to managing patients in the hospital.

Taylor:

So Brett Nestor, who is a student with me at University of Toronto, he's been working with St. Michael's Hospital here in Toronto about developing, prediction methods, over, you know, in, you know, the general internal medicine department. Can you predict whether or not a patient will need to be transferred to the intensive care unit? Because that is a very difficult process that takes a lot of coordination. And if they can know a day in advance that somebody's gonna be coming to the ICU, they can make that transition better and maintain the patient health, much easier in that transition to the intensive care unit.

Taylor:

Another further example has been, you know, Susan Murphy's work where she's probably the only researcher, that has had a, like, actual clinical trial with machine learning approaches under the hood. Suchi Saria has been working at this. But in each one of these cases of these success stories with, applying machine learning to health care in practice and production is that it's always been in collaboration is that we like I said earlier, we can't operate in a vacuum, within health care, and by having you know, invested clinicians who understand, you know, the the the methodology, is really important. And, you know, there are some really great, doctors and radiologists that, we're affiliated with, and collaborate with that are helping us always see the big picture of what they're doing. And so specifically, I'm talking about Judy Gachoya who's at Emory Hospital in Atlanta, Georgia, and then Luke Oakton Raider, who's based out of Australia.

Taylor:

They are really great critics of everything that's being done, to make sure that we're doing it in an appropriate fashion. And, you know, I have friends and colleagues out of Harvard Medical School who are constantly helping us, remember that, you know, that there is understood practice when we approach making strides in, you know, technology within health care, but it needs to be motivated and and informed by what can actually be done.

Robin:

So there's there's good reasons why, probably the medical establishment doesn't really follow the philosophy of move fast and break things, which is maybe something that's

Taylor:

I I think there's good reasons why. There's probably some bad reasons why too if we're gonna be completely honest. The, the the challenge with regulation boards is that they're humans. Right, and these humans are also doctors that have their own individual biases and preferences. And, you know, it's it's the reality that we need to deal with, and it's upon us as researchers to convince them that we are being careful and that we're thinking through the challenges that they care about too.

Taylor:

And so it's, you know, it it this is this is kind of the excitement of being at the leading edge of any type of problem in any type of industry is that you get to, you get to develop a lot of patients, but you also learn a lot in in the in the same frame. And and I think that that's why we're here, on Earth in general is to develop and learn and to to to become better versions of ourselves. And and I think that when we work interdisciplinary or in in in let me correct. When we work in interdisciplinary settings, we are exposed to more opportunities to improve ourselves.

Robin:

Taylor, do you have any comments on other things that are happening, in the world of RL that you find really interesting or important these days outside of your own work?

Taylor:

There's a lot, and I feel like I've probably taken too much time to describe, like, feelings that I have and thoughts I have. One really quick thing that I'm excited about is that I am really grateful that there has been an added interest in applying reinforcement learning to the real world. And with the challenges in modeling and, you know, architecture and learning that comes with that. And so I think that we're I wouldn't say we're in the renaissance yet of offline RL, but I think that what we're seeing coming from various groups and labs throughout the world is that there is a dedicated interest in making these things work in the real world. And, you know, there are some success stories that I've been made aware of that I I know are not public, where reinforcement learning has actually been used in the real world to great effect and done so in a robust and stable and safe manner.

Taylor:

And, I think it's it's really exciting to envision or at least hypothesize, how much more progress we're gonna be making in the near term.

Robin:

Taylor Killian, this has been an absolute pleasure, and, thank you so much for giving us a window into your your really important and fascinating work that you've been doing. Thanks so much for joining us.

Taylor:

You know, I appreciate the invitation. I think that what you do with this podcast is fascinating and that you balance between young researchers as well as established experts. And I think that, you know, speaking as a, you know, consumer of your podcast, but now as a guest, is that I really appreciate that balance because, I think that it's important for young and new ideas to get exposure as well as to, just just to get the experience to be out there, and so I am really grateful for the opportunity.

Robin:

Notes and links for this episode are at talkrl.com. If you like this show, I need your support. You can help in a few ways. 1, subscribe on your favorite podcast platform. Subscriptions make a big difference.

Robin:

2, follow us on Twitter at talkRL podcast. We love retweets. 3, give us a 5 star rating on Apple Podcasts. If you don't think we deserve 5 stars, let us know on Twitter what we could do better.

Taylor:

Talk RL.