TalkRL: The Reinforcement Learning Podcast

Sven Mika of Anyscale on RLlib present and future, Ray and Ray Summit 2022, applied RL in Games / Finance / RecSys, and more!

Show Notes

Sven Mika is the Reinforcement Learning Team Lead at Anyscale, and lead committer of RLlib. He holds a PhD in biomathematics, bioinformatics, and computational biology from Witten/Herdecke University.

Featured References

RLlib Documentation: RLlib: Industry-Grade Reinforcement Learning

Ray: Documentation

RLlib: Abstractions for Distributed Reinforcement Learning
Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica

Episode sponsor: Anyscale

Ray Summit 2022 is coming to San Francisco on August 23-24.
Hear how teams at Dow, Verizon, Riot Games, and more are solving their RL challenges with Ray's RLlib.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

What is TalkRL: The Reinforcement Learning Podcast?

TalkRL podcast is All Reinforcement Learning, All the Time.
In-depth interviews with brilliant people at the forefront of RL research and practice.
Guests from places like MILA, OpenAI, MIT, DeepMind, Berkeley, Amii, Oxford, Google Research, Brown, Waymo, Caltech, and Vector Institute.
Hosted by Robin Ranjit Singh Chauhan.

Sven: 00:00

There's a rise in interest in our finance. We have JPM for example, as well as other companies that we're seeing moving into the space, and trying RL on financial decision making. Ray was actually developed, because of the need, to write a reinforcement learning library. TalkRL.

Robin: 00:20

R l. Talk r l podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chauhan. A brief message from AnyScale are sponsored for this episode.

Robin: 00:40

Reinforcement Learning is gaining traction as a complementary approach to supervised learning with applications ranging from recommender systems to games to production planning. So don't miss Ray Summit, the annual user conference for the Ray open source project, where you can hear how teams at Dow, Verizon, Riot Games, and more are solving their RL challenges with RLLib. That's the Ray Ecosystems open source library for RL. Ray Summit is happening August 23rd 24th in San Francisco. You can register at ray summit.org and use the code ray summit 22 r l for a further 25% off the already reduced prices of a $100 for keynotes only or a 150 to add a tutorial from Sven.

Robin: 01:23

These prices are for the first 25 people to register. I can see from personal experience, I've used Ray's RLLib and I have recommended it for consulting clients. It's easy to get started with, but it's also highly scalable and supports a variety of advanced algorithms and and settings. Now on to our episode. Sven Mika is the reinforcement learning team lead at AnyScale and lead committer of RLlib.

Robin: 01:44

He holds a PhD in Biomathematics, Bioinformatics, and Computational Biology from Witten Herdeka University. Thank you, Sven, for joining us today.

Sven: 01:53

Hey Robin, nice nice to meet you, nice to be here. Thanks for having me.

Robin: 01:56

Great to have you. So can we start with AnyScale? What does AnyScale do?

Sven: 02:00

Yeah. So AnyScale is the, the startup behind, the Ray open source, library, which is a Python package that is supposed to make distributed computing very very easy. And the Ray the Ray package comes with several, what we call libraries, mostly related to machine learning. For example, our lip for reinforcement learning, we have Ray surf for like model serving, and so on. And the idea of or the bets that we're making at any scale, and in our philosophy is that distributed computing is really hard.

Sven: 02:39

It's normally something that's that's as a software developer you would like to out source somehow in your work. Like it's not If you if you want to write a, for example, a machine learning distributed application, you would probably not want to worry about this aspect of your work. So the idea is to have a platform, the AnyScale platform, where you can very easily bring up a cluster either on, right now we support Amazon or GCP, and then run your preferably of course Ray applications but not not just a restricted array, but any distributed application on the platform. And and the idea is to have both this this OSS or the open source Ray system that that will draw live users into becoming customers for AnyScale, for this AnyScale platform. So we're roughly a 100 people right now.

Sven: 03:29

We collected more than a 100,000,000 investor money so far, and we have been around for roughly 3 years, I believe. I joined AnyScale two and a half years ago as the RL person. First, so now we kind of, since since roughly a year ago we grew into a larger team of, 5 full time RL engineers. And my team is responsible for developing and maintaining this this RLLIP library within the Ray open source system.

Robin: 03:57

We're here to talk about RLLib mostly, but RLLib is based on Ray. So can you tell us a bit about Ray to get started?

Sven: 04:03

Yeah. So with Ray, you can either specify, we call these tasks, so these are functions that you can tag with a with a Python decorator. And then the the function you can you can, call it, let's say let's say a thousand times, and the function then gets executed on different, nodes based on your your research, your resources, that you have in the cluster, in parallel, and they can collect the results in parallel. And this works locally, for example with multi processing, but also on on a cluster with different machines. And this is the easy case, you have a function, the harder cases you have a class, and then we call this an actor.

Sven: 04:41

So the class has a state, and you can you can tag it the same way, this is the the at array. Remote tag, that you put on your class. And then you have these actors run on the on the cloud on the different machines using different types of resources that you can specify. For example, by default it's just a CPU, but you can also of course think about actors that utilize GPU's. And then you can kind of like sort of ping these actors by calling their methods.

Sven: 05:08

Kind of like, think about a micro service that you would like to utilize, or an area of micro service that you would like to request data from. And our Lib utilizes Ray in such a way that the most common case that our lip taps into using Ray is the environment parallelization. So you have instead of just having a single environment that you step through, collecting some data, and then you learn on that data, our Lib by default is already distributed. So so for example if you take the PPO algorithm, our default setting or default configuration for that algorithm uses 2 of these actors, or 2 of these, we call them rollout workers. That's the class, and then we make it a Ray actor by decorating it.

Sven: 05:55

And each of these rollout workers has environment copies. Either 1 or more for batch 4 passing. And so you can, so PPO can collect samples much faster than than a single environment PPO would be able to. This is like a very simple example. We have other algorithms like, you know, A2C, A3C, and then the newer ones that works in a similar way, and have different very complex sometimes very complex execution patterns where you, not just collect samples from the environment in parallel, but you also already calculate gradients on on the workers and send the results back for updating them.

Sven: 06:35

For this to to, to serve these really complex execution patterns that RL requires, Ray is the perfect tool. And as a matter of fact, Ray was actually developed because of the need to write a reinforcement learning library. So they wanted to the the Ray's lab at Berkeley wanted to build a reinforcement library, and then they figured need we need some nice tool that helps us with, unifying and and taking the the difficulty away from from distributed computing.

Robin: 07:00

There's such a wide variety of settings. What are the settings that are best suited for RLLib in terms of off policy, on policy, model based, model free, multi agent, offline, etcetera.

Sven: 07:12

Yeah. So RLlib is is really there's no no particular end for this on any of these, again, except for the, like like, limited support for model for really model based RL. Also, what's what we see a lot is where where the traction of the whole Ray ecosystem comes, comes in for the users is that Ray has this not just the DRL library, but also the other machine learning libraries. For example, RayTune for hyperparameter tuning, RaySurf for models serving, or Ray, RayTrain, for for supervised learning. And there's a lot of interest right now in combining all of these.

Sven: 07:43

Like for example, if you want to do, what's called online learning, you you train some model, either with supervised learning or with reinforcement learning. You deploy it into production. You see what happens, kind of evaluate it there because, because you don't have You cannot do it in the simulator. You need to see what it's doing in production. You collect more data in production, and then you use that new data to retrain and to kind of like repeat the whole process.

Sven: 08:05

So that's that's one of the other strengths of OurLib because it's so well integrated with these, with these other machine learning libraries that that we're comes

Robin: 08:14

with. So we featured authors of other major RL libraries on the show, Antonin Rafa and Ashley Hill, who wrote Stable Baselines, and Pablo Samuel Castro, who wrote Dopamine. How do you situate RLlib in a landscape with, these other types of RL libraries?

Sven: 08:30

Yeah. That's a great question. We've we've thought about this ourselves for for, like, a lot, and, we did some surveys trying to find out what what people use and why they use other libraries other than RLIP, and where RLIP stands in this in this ecosystem of RL libraries. And yes, stable bass lines definitely is probably the still it's the go to tool for when you start with the audio and when you want to just understand maybe some algorithm. Because it's The implementation is a little simpler.

Sven: 09:03

You have only one environment, kind of like single batch setup. RLLIP, yes. It's the the heavier version of of an RLL library because of the the scalability and and the other really nice features that's that's, we believe, it it stands out from from the other from the crowd, which are, for example, multi agent support, strong offline RL support. We support both TensorFlow and PyTorch. All types of models.

Sven: 09:33

You can you can plug in an LSTM. We have those off the shelf that you can just use. Attention nets, regular CNN networks and MLPs. So that's that's where we see our LIP in in the place where if you have larger workloads we have customers that use a 100 workers. So they step through a 100 environments in parallel, the environment sits maybe on some server that they have to connect through.

Sven: 09:58

These really complex large scale distributed workloads are are, yeah, predestined for for using RLlib. The other round. Our latest is pretty designed for for supporting these. We are trying to tackle this this the problem of complexity and the problem of this steep learning curve that people tell us we have. We realize that as well, of course, by different different, projects that we're working on right now.

Sven: 10:21

So we have a we had a huge push over the last half year in simplifying APIs. And also, so that's one topic. I can go a little bit more in detail if you'd like, that's one thing. We simplify the APIs, make the code more structured, more transparent, more self explanatory, and the other larger item that we have on our list is better documentation. More examples, maybe create some YouTube videos on how to do cool stuff with Rlib, like very typical.

Sven: 10:51

Like how to set up typical things, typical experiments with Rlib and so on.

Robin: 10:55

Well, I have no doubt that you'll get that and more done. But to be fair to stable baselines, they do support vector environments where you can run, many environments, but I believe that they are limited to a single machine, which Ray, of course, doesn't have that limitation. Right. How big can these runs get with RLLib?

Sven: 11:15

Yeah. So again, I mentioned before we had so we we have seen users use, 100 and and more workers. We we've run experiments with 200 50th, I believe on on, for example, on Impala. Some benchmarks use these. So these run on, yeah, really large clusters with, like, one head nodes, that has a couple of GPUs, and then you have dozens of of small or CPU machines, so that these these environment workers come on on those.

Sven: 11:44

And we've seen these these workloads used also by our by our users slash customers, in a meaningful way. The other access that comes in here for scaling is the the hyperparameter tuning access. So this is this this this could be like a single job, right, where you say I have a 100 environment workers. And and on the head note for for learning, for updating my model, I use a couple of GPUs. But you can also then scale this further on another axis and say I have, 8 different hyper parameter sets that I would like to try, or different model architecture.

Sven: 12:15

So so again by combining our Lib with with other Ray libraries, Ray Tune in this case, you can, yeah. You can you can think that's that this becomes even even larger job and then sure. You can you can run hyperparameter suites in sequence, but you would also like to to parallelize here.

Robin: 12:32

Can you tell us about some of the use cases you've seen for RLlib, what the customers are doing with it?

Sven: 12:37

Yeah, I can talk about that. That's actually one of the most exciting parts of working with Allip. So we are currently Our rough idea is that we have 2 major bets that we're taking right now for the next couple of months to to work on, which is the gaming industry, as well as the Rex's sector. And for Let me talk about the gaming industry. We have, 2 customers that have already presented publicly about what they're doing, so I can I can talk about this here?

Sven: 13:08

Which is, Wildlife Studios as well as Wildgames. And, the interesting thing is that they use RLIP for very different setups or use cases. Wildlife is building a, or has built an in game items sales recommender system that basically figures out prices for the for the players, that they would probably pay for some some items that they can buy for their games. And they have used RLlib for for that. Also like in an online fashion, kind of like training with Rlib offline.

Sven: 13:35

We're using offline RIL, then deploying into production, using different OPE methods to figure out what's what's what what could be the best model. Using RayServ was for for the price serving and then collecting more data, bring it all back, and kind of repeating the cycle. And then Riot Games, does does this classic thing where they have they have these these multiplayer adversarial games, where 2 different teams play against each other. And one of the main challenges for for these kind of game studios is that they have to make sure the games are fair, and and there's there's no imbalance in the game. Maybe because you're picking a different character or different weapons and all these things, so that the game doesn't get boring.

Sven: 14:13

So that's that's the big challenge, where they can Normally they would use testers that would play the games a couple of times, and this is very expensive and very tedious. So it would be much nicer to have a bot that can play the game against itself using self play, and then learn how to, or like kind of like figure out automatically where these, where these exploits could lie or where these imbalances could be located. For example they figured out that one one card in one of their other cards like games, was very powerful and they had to reduce the value of the cards by 1. And that that completely fixed, like, this this imbalance.

Robin: 14:46

When I think of recommender systems, I often think of the one step case, the banded case. Is that what you're talking about here? Or also the full RL setting, the multi step case?

Sven: 14:56

Yeah. Correct. I'm actually talking about both. So we we still see a lot of companies trying Bandits, as a single step, kind of, like, try to optimize the very next reward kind of setup. But also, but also you have these these companies that always think in long term effects on the recommendations, on on engaging the user.

Sven: 15:15

Maybe it has some some negative effect that you, that you always make some recommendations and the user clicks on it, engages with it, and kind of gets gets tired maybe of the content. So these these considerations kind of slowly creep into their, yeah, into their considerations of the problems that they want to solve. So this session based long long range, yeah, kind of delayed rewards settings that you can only use with with classic RL. And yeah, there's a lot of movement right now. A lot of companies want to want to try RL for workspaces, where before they used either either some non machine learning methods or like supervised learning.

Sven: 15:54

Now I think they've figured out that this end to end setup of RL is really nice, and you can, it just gives you an action, and then you can just use that without having to program more logic into the system. But it's it's very hard. I feel, like one one of the challenges here in Rex is maybe to explain this, is is the, if you have to recommend several items. So like think about YouTube and you have, you go to YouTube and you have these 8, or or there's a couple of slots that are filled with with recommended videos. It's it's quite crazy for the action space with what this means.

Sven: 16:27

If you have to fill, it kind of explodes easily if you think about the the number of videos that YouTube has several dozen, billions I think, and you have to pick 12 of those that makes your action space quickly explode. If you don't have, like, a nice preselection method that you probably want to put there, which has nothing to do with our graph. So you have to be careful there. It's it's really like a big challenge. I find it really really difficult.

Sven: 16:53

That's just one problem. The other problem is the user model. Like how do you, how do you do this all without without a simulator. Maybe maybe you have a simulator, but maybe like how do you program the user into the simulator, the user behavior. Especially the long term behavior, the long term effects on what you do with the user, what you recommend to the user into this model.

Sven: 17:11

I find it extremely challenging, an extremely challenging problem. So other use cases that we're seeing, there's a rise in interest in our finance. We have JPM, for example. I can I can say this because also they, publicly spoke about this using RLlib, as well as other companies that we're seeing moving into the space, and and trying RL on financial decision making, buy sell decisions, and and so on? And then another one is, we we have seen some some attempts in self driving cars, robotics.

Sven: 17:41

It does feel like some some verticals are further ahead of Another one is logistics, which is which is also very further ahead, or this whole process optimization sector. Where you have some, maybe you have some factory, like a chemical factory, and and you would like to optimize different different processes there through through RL. That's also very very far ahead already. And, but but yeah, the the different verticals have have made different amounts of progress into into moving into RL. And the the only problem that we see right now still is that it's not at large scale yet.

Sven: 18:16

So we see signal companies starting to think about it, but I don't think we are quite at the point where where really everyone wants to wants to do it right now. But I think we are we are close. So so maybe it's another year. It's hard to tell, and this is one of the difficulties for us at any scale to predict when this really when this point will happen, where where everything really goes goes exponential. It's it's quite a challenge.

Robin: 18:40

Can you say a bit about the roadmap for RLlib? What do you see in the future for RLlib?

Sven: 18:45

As I already mentioned before, like one important product that we're currently working on, I would say we're maybe 50, 60% done with that is API simplification. So we are, we have realized that Stable Baselines, for example, is a much nicer library to use and to learn and easier to learn, and we really respect that and we would like to, for our lib, to become Or to to have that feature or have that feel as well. So we're trying to get rid of old complicated unintuitive APIs. I can give you an example for example, our our algorithms, the configurations, they used to be a Python dictionaries, where you have we didn't really know what the different keys were, and we had some users tell us, that one of the main locations where they would hang out on the Internet would be the the the Hourlyp documentation page where all the the config options were listed. And so so instead type safe properties that you can set inside this class.

Sven: 19:47

The class comes with different helpful methods, for example to set training related parameters, or or roll out related parameters, resource related parameters, so on. So it's much more structured, it's much more IDE friendly, you can see the different docstrings in your IDE, and immediately know what's what setting you need to adjust to make the algorithm, yeah, hopefully learn learn a little faster. That's one change we did. We are currently also exploring making our models easier to to customize. Before every algorithm kind of had its own, yeah, model API, like a DQN would have this cue model thing, and it would have certain methods that you call for handling the dueling head, for example.

Sven: 20:34

Now we're trying to unify all that, so that's in between the different algorithms you can use the same subclasses, for example, for for a queue network or for a policy network or for a value function network or for a transition dynamics network. And that that will all be unified and you can This will make it easier to to plug and play these different, maybe PyTorch or TensorFlow models that you have, flying around anyway, if you if you do research on these things, and then the algorithms will just recognize any of those. So it will be much more pluggable and much more intuitive, to set up different arbitrarily complex models for your algorithms.

Robin: 21:09

So I understand RLLib supports both PyTorch and TensorFlow. How does that work?

Sven: 21:14

Yeah. Great question. It makes things much more complicated. Yes. We don't have an internal RLIP specific deep learning framework.

Sven: 21:22

No, we just basically do everything twice. So so but but it's it's simpler than that. So the the top concept in our Lib is the algorithm, which is completely framework agnostic. It determines when you sample, or it determines when things should happen. So when should I sample, when should I put the samples into my behavior buffer, When should I sample from the replay buffer?

Sven: 21:45

When should I update my my policy network from from that data? Completely framework agnostic, just you just pass around abstract abstract objects and concepts. And then on the one level below, you have the we have what we call the policy, and that one is framework specific. So we have a TensorFlow policy super class, and the Torch policy superclass, and the different algorithms for example PPO and DQN, they have their own loss functions, which are part of this policy written in in 2 in 2 ways, in TensorFlow and in PyTorch. The the problem of TensorFlow 1 with with the sessions and TensorFlow 2 with with eager and and not using sessions and placeholders, we solved by kind of automating this away.

Sven: 22:30

So you really only have to write 1 TensorFlow loss function to support both these versions. But yes, for each algorithm we have to write, both both loss functions. But that's mostly it, and then the other thing you have to be careful of course is the the default models. So we have we have a set of different, default models, NLP's, some some simple CNN setup, as well as an LSTM setup. Of course those also we provide in both TensorFlow and PyTorch.

Sven: 22:57

But the main work is really for loss functions. If you want to implement a new algorithm, for our users it doesn't matter, they they just implement one of these, and then of course their algorithm only exists in that one world. But for the the the built in algorithms that come with our Lib, we went through the work and implemented the loss functions in both, in both frameworks. But it's it's not as it's not as bad as it sounds. It's not as much work I think as people would fear it is.

Sven: 23:24

So that's that's a good thing.

Robin: 23:26

Does the TensorFlow side still get a lot of use? Seems in the research side, PyTorch is more common?

Sven: 23:33

I think so. So a lot of industry users are still in TensorFlow. They believe in the, in the speed, in the performance. We we have seen weird stuff with Torch lately that sometimes it runs super fast, depending on the machine you are on, whether you're on an AM Processor or also the GPU CUDA versions play a big role, but a lot of industries still use TensorFlow, the TensorFlow versions of the algorithms. Also sometimes they don't really care, they use everything out of the out of the box, so they don't really have their own models.

Sven: 24:03

They just use everything that comes with RLlib anyways. So then in this situation they can easily switch and compare, which is also very nice. But traditionally we have seen that TensorFlow still has some edge over PyTorch, performance wise, but we also we from time to time, we we look into PyTorch and see why, or like how we can make it faster, and then which which there there are these JRT tricks that you can apply to to make it, like, similar to to how you would yeah, use TensorFlow 2 with the with the Eager tracing to make it faster. We're still kind of, like, working on that one. But yeah, we see we still see a lot of people in TensorFlow, definitely.

Sven: 24:41

I think in the research area, probably PyTorch has has the edge now, but I think in industry, it's still pretty undecided up to this point.

Robin: 24:52

We had Jordan Terry on recently who maintains JEM, And they were talking about our systems that have both the agent and the environment in the GPU. And so the entire RL clock cycle is happening within the GPU. Is there any plans to do anything like that with RL loop?

Sven: 25:11

Yeah. We've seen this come up in in discussions with with with Jordan himself, but also with our users and and customers or potential customers, the need to to do exactly that. So to be on the GPU in your environment, because maybe the environment has a complex image space, observation space, and then to to not have to copy all the data from the GPU on the environment back to the CPU, send it through the ray object store, and then and then basically move it back to the GPU for for learning updates. We have seen this this question a lot, and we started thinking about it. We started experimenting with it.

Sven: 25:43

Another possible setup is to even think about having different GPUs. So you have the environment on 1 GPU, on several, and then the central GPU for learning. How can you realize direct GPU to GPU communication to also speed this up. To at least avoid the the CPU copy. And we we have come to the conclusion that this is more like a problem that we should solve in in RayCore via the object store.

Sven: 26:10

So the object store, this is the thing that Ray works with that, basically is available from all the different nodes in a cluster. Things go into the object store as read only, they get serialized, put there, and then, you can pass the reference around. And then with the reference on the other side of the cluster you can you can pull it out of the object store. But this is all before stuff goes to the object store, this is all being copied to the CPU. So that's currently the problem, and we're trying to solve that.

Sven: 26:35

If we can say, this this particular data should Yes, it should go to the object store, but please leave it on on the GPU or or send it directly to this other GPU in the cluster. We're we're currently working on this, it's on our roadmap, and but we have to still figure out a lot of details related to our lib. For example, yeah, what's what's what does this mean for the environments? Then we may need to add JAX support, because you have the nice NumPy JAX API that you can then use. And, yeah, we currently don't don't support JAX.

Sven: 27:08

But this is this is on our road map, and this may may happen, pretty soon. Yes.

Robin: 27:13

Can you tell us a bit about the team behind, RLlib?

Sven: 27:16

Yeah. Great question. So we have a as I mentioned before, our team size is roughly we have 5 full time engineers, we had a couple interns that finished their fantastic projects already, one one finished yesterday. Rohan, who worked on off policy evaluation, a nice new API for this, and as well as the WRobots implementation that we have right now in our look. And the other one, Charles, who is working on the decision transformer implementation.

Sven: 27:44

It's quite a challenge for myself. I'm working remotely from from Germany, and most of the other people are in in San Francisco. But we we have a pretty solid pipeline for like, planning, and and we work in sprints. So we plan ahead every every 2 weeks on what everyone should should work on for the next for the next 2 weeks. And then we do pretty solid quarterly planning where we come up with with lots of thoughts of what's where direction that our live should go into.

Sven: 28:14

What's what's needed by the, by the users, by the important customers, and what also what what things are we predicting to to happen next. Like is gaming gonna be the next big thing in the next 6 months or so. So all this goes into the planning, and then we come up with quite detailed, kind of like by the by the engineering week, sort of planning. What everyone should work on. Distribute this among the engineers, and then during the quarter make sure that we help each other out.

Sven: 28:44

If there are if there are roadblocks or if they're like, someone gets stuck somewhere that would the whole team works. So it's it's working really quite well. We have only been together for a couple of months now, I have to say. So the last two and a half years since I joined AnyScale, most of the time I was more or less working alone in our Lib. Maybe had some help from from interns in between.

Sven: 29:03

Also in the beginning, Eric was still there, who then moved into the Raycore team. But the the this Our Lib team, this this really like larger team that's that's working professionally full time on Our Lib, has only been around for really a couple of months now, since the beginning of this year. So, and I feel like it's it's working really well. Like we, we're getting a lot of stuff done. Our lab is changing quite a lot right now as we go towards Ray 2.0, and I'm really really happy about the the, yeah, intermediate results so far.

Sven: 29:33

I look forward really to to, yeah, to all the nice changes that are to come.

Robin: 29:38

Can I ask you about testing? How do you test RLlib?

Sven: 29:42

Yeah. Yeah. Testing, that's actually one of the pain points we discussed recently in a in a, like, how can we be more efficient? Testing right now, Yeah. We have we have a CI, we use BuildKite for our CI tests, and the the RLib, when you when you do a branch from master, and then you have your own branch and you push an update to your PR, takes about yeah, more than an hour to to fully run, because we have we have all the unit tests, we have the, what we call the CI learning test, which are like smaller learning tests on carpool and pendulum environments.

Sven: 30:16

For all the different algorithms, for all the different frameworks, sometimes even different different settings and different model types. So that's roughly an hour, which which it shouldn't take that long. There's a lot of things that we can do better to maybe cut it in half. One of these things is, our lib is not dependent on building stuff. Our Lib is really just source files, it's a pure Python library.

Sven: 30:38

We should be able to just pull a ready container and just run it on there. That's something we can heavily optimize. And then we have daily release tests that we run. So, and those are like heavier or harder task learning tests on on Atari, on MuJoCo. Also for the different algorithms, for the different frameworks, TensorFlow and PyTorch, and those take several hours to run on, like, expensive GPU machines.

Sven: 31:06

But we, also there, we did a lot of progress lately. We added a lot of algorithms, so we have much more trust in ourselves right now, which is very good, and very important. But, yeah, it's it's a it's a huge pain point as, as as a team of of OSS developers.

Robin: 31:23

So I understand Rllib is not your first RL library. Can you tell us about how you got here?

Sven: 31:29

Yeah. Sure. So the my my journey to RL actually started with, with with games. So I was I was looking at the Unreal Engine, in in, I think it was 2016, and I wanted to I had this great idea of writing a system that's, where you have like a game world and then you have some some characters in there, and the characters would kind of like learn how to, interact and kind of play the game. And I wasn't even, didn't even know much about Royale back then.

Sven: 31:54

But this idea of like creating kind of like a background story, by just making the characters become smarter and smarter and act in this world. It got me into into RL, and then I figured, oh, this is the perfect method to to use for this, to solve this kind of problem. So that's when I started learning what are all and and also writing some some my own libraries up front or some own algorithms. And I started with joining the Tensorforce group, the Tensorforce open source project, that was in 2017. And then we yeah, Tensorforce is another open source or library.

Sven: 32:29

And then with some of these people from the Tensorforce OSS or GitHub repo, we started our graph. In this was 2018, 2019, we published a paper comparing ourselves, comparing our graph with our lip, and that paper then got attention from the AnyScale team. And at the end of 2019, I believe AnyScale reached out to me, and also to other people from the Aurelograph team, and asked us whether we would like to work for them. Yeah. And then Surreal Surreal is a smaller library that I've came up with.

Sven: 33:04

I wanted to I was always obsessed with the idea to make it really, really simple to implement an algorithm. Like it shouldn't be harder than you read the paper, you see the pseudo code, you understand pseudo code, and then you just use as many lines as are in the pseudo code to code the algorithm in the library. It's it's a it's a tough goal to to achieve, but that's that should be the ideal in in in my opinion for for any RL library, or the the ultimate goal. And that was that was surreal, so I tried to really implement the algorithms in a very, kind of like, dense, but easy to read fashion. And I had some some algorithms in there.

Sven: 33:41

I think PPO, SAC, DQN. Yeah. So that was that was just like some small side project. It was it was also related to games. I had it had a module in there where you could plug in the Unreal Engine to to this, to this RL Library, and then learn some some smaller problems.

Sven: 33:55

Very similar to ML Agents for Unity, want to do the same with with the Unreal Engine, and that was that surreal.

Robin: 34:01

Sven, is there anything else you want to share with our audience while you're here?

Sven: 34:04

Yeah. Sure. So we have our RAY summit coming up in August, end of August 22nd 23rd, in San Francisco. It's the 1st RAY summit. This is the 3rd one, and the first one that's actually in person.

Sven: 34:15

I'm super excited to be there. We'll fly in on on at the end of August there to give a tutorial on our Lib. And there's other cool talks about how people use Ray in industry, how people use Ray's libraries in industry. Of course, also a lot of talks in RLLib. Yeah.

Sven: 34:31

If you're interested, sign up and and join us for the summit. That would be awesome.

Robin: 34:35

Sven, Mika, it's been great chatting with you and learning about RLLib. Looking forward to Ray Summit, and thanks so much for sharing your time and your insight with, the TalkRRL audience today. Thank you, Sven.

Sven: 34:46

Thanks a lot, Robin. It's a pleasure to be here.

TalkRL: The Reinforcement Learning Podcast

More episodes

Chapters

Show Notes

Creators and Guests

What is TalkRL: The Reinforcement Learning Podcast?