Machine Learning: How Did We Get Here?

Tom discusses the chaotic evolution of the field of machine learning with Tom Dietterich, Distinguished Professor Emeritus at Oregon State University.

Tom has made numerous research contributions to the field, and has served in professional roles from Executive Editor of the journal Machine Learning, to President of the Association for the Advancement of Artificial Intelligence. He shares his encyclopedic knowledge of the field and its evolution, describing waves of alternative paradigms, the interaction of theory with practice, the interaction of statisticians with computer scientists, some of his main research results, and his experience spinning off a machine learning startup company.

Creators and Guests

Host

Tom Mitchell

Tom Mitchell is the University Founders Professor at Carnegie Mellon University, a Digital Fellow at the Stanford Digital Economy Lab, and the author of Machine Learning, a foundational textbook on the subject.

Producer

Matty Smith

Writer/director/editor from Los Angeles, with experience writing and directing scripted television and national commercials. Mixed media producer with hands-on experience in all areas of production.

Guest

Thomas Dietterich

Distinguished Professor at Oregon State University

What is Machine Learning: How Did We Get Here??

Tom Mitchell literally wrote the book on machine learning. In this series of candid conversations with his fellow pioneers, Tom traces the history of the field through the people who built it. Behind the tech are stories of passion, curiosity, and humanity.

00:00:00 Tom Mitchell: Welcome to machine learning. How did we get here? I'm Tom Mitchell, your podcast host. And today's episode is an interview with Tom Dietrich. Tom has made many technical contributions to the field of machine learning over the years, including work on error correcting output codes, hierarchical reinforcement learning, multi-instance learning, and other areas. In addition to his technical work, Tom has also been very active professionally, for example serving as president of the AAI, the primary artificial intelligence professional society, and also serving as editor of the Machine Learning Journal. I hope you enjoy this conversation with Tom. Today. I'm glad to have with me Tom Dietrich, one of the pioneers of machine learning. I've known Tom for many decades. Tom, thanks for taking the time to to do this.

00:01:19 Tom Dietterich: Oh my pleasure. Yeah.

00:01:21 Tom Mitchell: So let me start just by asking what got you into this field in the first place. There were so many things you could have done at that period of time.

00:01:29 Tom Dietterich: Well, I mean, partly I think it was accidental. Um, you know, I had applied to graduate school. I got admitted to to University of Illinois at Urbana-Champaign. Um, there was a guy there, assistant Professor Richard Makowski, who, um, was working on what we would now call, I guess, inductive logic programming, but trying to find trying to learn, uh, basically binary classifier to distinguish between positive and negative examples of structural descriptions. So these would be essentially annotated graphs. Uh, you could think of, of chemical molecules or, or he, he always had his, his trains going east and west where there was, you know, a string of cars and they had different contents in them. And so it was not just a feature vector. And um, and so, um, yeah, he, he was basically extending ideas from sort of karnaugh maps to find a DNF kind of expression, uh, where each conjunction, uh, was itself a subgraph. And you had to do graph isomorphisms to check. Your goal was to find the smallest subgraph, uh, that that was a graph that was a subgraph of all the positives and none of the negatives. And if you couldn't find that, then you had to find, uh, a disjunction of such graphs that at least one of them was a subgraph of a positive, and none of them were subgraphs of any of the negatives. So you had to graph Graphs of isomorphism calculations, which was very expensive. Still is. And so that was the main question was to make those methods efficient. But one thing that was so. So I got interested in it. And I guess as an undergraduate I had taken a class in political philosophy. But the professor for his sort of science and engineering type students had us read Thomas Kuhn's Structure of Scientific Revolutions and, and, and also Marx and said, you know, Karl Marx claimed he was a scientist. Was he? And that was kind of the prompt. But that introduced me to, you know, to Kuhn and paradigms and, and all this stuff and got me very interested in, in the philosophy of science and sort of the fundamental question, um, which, uh, I can't remember which philosopher stated it, but what was how do we get from our common sense notion of the physics of the world to quantum mechanics through a series of rational steps? Because, you know, and this, of course, was something that Herb Simon was very interested in also, uh, and claimed that it's got to be normal problem solving of some kind. And because he didn't believe in miracles or dreams or things like this. Um, so I was really interested in that. And so this, this was essentially I mean, obviously there's a huge gap between doing, uh, figuring out how to, to mechanize science and, and finding subgraphs of, of, of, uh, structural descriptions. But, but there was some connection there. And, and so that was the thing that kind of hooked me in and uh, um, yeah. So I guess, I guess that was it. And then, um, you know, as I was working on that, uh, as a master's student, I eventually discovered, uh, things like the Dendro dendro projects that you had worked on at Stanford. Um, and so then I switched after my master's and went to Stanford and worked with Bruce Buchanan. Uh, in following in your footsteps and indeed inheriting some of your code, which was very, very well written, by the way.

00:05:01 Tom Mitchell: Thanks. So, um, that's really interesting. You know, it's interesting to me how many people, um, in at that stage in the nineteen seventies, nineteen eighties and even earlier, um, kind of had a philosophical bent or at least a philosophy interest that overlapped with their motivation for getting into the field. I know when you went to Stanford, your advisor was Bruce Buchanan, who we also have a video interview from Bruce. Um, and he himself is a philosopher by training PhD in philosophy. That's interesting. So tell about, um, what are the early days when you went to Stanford? What was what did you do? What was going on?

00:05:50 Tom Dietterich: Well, of course, what was happening at Stanford was knowledge based systems. Um, and, uh, and so that was the idea that we would be, uh, interviewing domain experts and representing their knowledge symbolically, uh, usually as rules in something like first order logic or horn clauses, um, and uh, uh, and then, uh, problem solving, uh, was was logical inference, uh, in horn clauses, which was very efficient. Um, and uh, and of course, that was also at that time, um, uh, as, as you know, Ted Shortliffe had built this uh, system called Mycin, which, uh, diagnosed blood infections, bacterial infections of the blood. And, uh, and he had to introduce some certainty factors to deal with, uh, with uncertainty in, in, in the diagnostic process. Uh, and so while I was there at Stanford, there was a, the beginnings of the whole Bayesian network, um, uh, formalization of reasoning under uncertainty. Uh, that was, uh, you know, my fellow students, particularly Eric Horvitz, uh, was, uh, really, um, they were taking classes over in engineering and economic systems, uh, a separate department at Stanford and importing those into computer science. Uh, and, uh, and so that, that, that was, that was part of it. So in those days, I guess the there was that and then there was logic programming, and Prolog had had made a big splash. Um, uh, and uh, and so I, I was also very much influenced by, by Mike Nazareth, um, who had, uh, I don't know, half a dozen students, um, all looking at various questions of how to do something more sophisticated than just the simple backward chaining in Prolog. Um, so, yeah. So, um, you know, Ed Feigenbaum was the head of that big organization in those days called the Heuristic Programming Project. And, uh, And Ed wasn't really interested in learning, he said. The right way to formulate this is knowledge acquisition. We're trying to get knowledge into the computer. Um, and, uh, so, so it was kind of lonely in some sense. But as you know, Bruce, uh, gives his students a lot of freedom. And, uh, and I was still, uh, very much enamored of this idea of sort of automated science. So I was looking for a domain that was, um, real enough to be interesting, but but but not real enough to be too difficult. Uh, which is always the challenge in a PhD thesis is how do you do something that's doable and yet is, is is interesting. So I decided I would have a sort of artificial scientist that would study the Unix command shell and try to figure out what was going on inside Unix. So, so just by doing, you know, playing with uh, LZ and CP and, and, and the, you know, just moving, copying, uh, typing, changing the protections on files to mod and to group things. Um, could could it understand what was going on? Um, and, uh, and this was formulated basically the, the theory that's being constructed is itself a computer program that is basically saying, here's what's happening inside the, the command. Um, but I didn't, uh, really appreciate how challenging that was, because I don't think we could even do this today. You're talking about a huge latent space, the entire file system and all of its protections and attributes, contents of files. It's a massive space. Uh, and it's not directly visible. You only see it through these, uh, these these things like the cat, cat and RLS and things like this. So how do you make sense of that? You really are trying to do state estimation. Um, at the same time as you're trying to update your model of what the, uh, the actions do, what the operators do. So, I mean, it's it's sort of a giant, partially observable Markov decision process. And even when we do pomdps nowadays we assume we have the MDP, the, the, the actions and we, we, we can we know what our sensors are doing and we can solve the state estimation problem basically. So, um, so there's still a huge challenge. Uh, obviously I did not solve it in my thesis. I just signed to work on just the state estimation part with a, with a, say, a buggy version of the RLS command. Could it, could it realize there was a bug in there and try to fix it?

00:10:18 Tom Mitchell: Um, pretty interesting today, I guess if I tried to do that or if someone tried to do that, uh, you would use a language model to read the documentation for the commands to.

00:10:30 Tom Dietterich: Well, that would really help. Yes, yes, but but but, uh, you know, uh, when God created the earth, he didn't give us any documentation. So we have to figure that out.

00:10:40 Tom Mitchell: Right. And at that point in time, it would have been impossible for a computer to do anything with the documentation in the first place.

00:10:47 Speaker 3: Yeah, right.

00:10:48 Tom Mitchell: Okay, cool. So, um, as you were doing all this, I think, you know, I think the nineteen eighties, which is period of time when you were doing all this.

00:10:59 Tom Dietterich: seventies, right.

00:11:00 Tom Mitchell: As really, uh, one of the more exploratory decades for the field. There were a lot of, I mean, like, keeping your topic of trying to figure out discover how Unix works. Pretty amazing idea. Um, and what did the field look like to you at the time? Did you have, um, a picture of there being different kinds of problems being worked on, different approaches? Um, did you feel like you were making a hand to make a decision about what kind of paradigm to follow, or was it a matter of, well, here I am in the middle of this laboratory. That's got a bunch of people doing things, and I'm relying on them. What did the field look like itself to you?

00:11:53 Tom Dietterich: I'd say it was really chaotic. Um, you know, uh, I was, uh, attended that very first machine learning workshop that was organized. I think you were one of the core organizers at CMU, and there were probably thirty people in the room and, uh, and probably thirty completely different talks. Um, you know, I remember, uh, uh, you know, I was talking about I had done, uh, uh, a sort of, uh, algorithm comparison paper that I published at Ijcai seventy nine, I think. So just before that workshop, um, in which I was, uh, by hand executing these very simple algorithms for this kind of subgraph learning problem, uh, and comparing how many subgraph isomorphism calculations they had to do. But it was like the first attempt to actually compare multiple machine learning algorithms that are more or less trying to do the same thing. There are a couple of them there. Um, and uh, but, uh, you know, uh, I think John Anderson was there talking about, uh, you know, cognitive models. Um, you were there talking about the beginnings of EBL and the Lex system for, for for, uh, calculus, uh, symbolic integration. Um, uh, you know, the I remember the most interesting talk I thought was Ross Quinlan's talk on, on ID3, uh, where he was trying to take these, uh, reverse numerated chess endgames and learn decision trees. Uh, that would completely, exactly losslessly, uh, basically compress those, uh, giant tables into a small decision tree. A really important thing people should understand in those days was we were we believed there was a right answer for our machine learning problems. And um, and we would, uh, and it would often happen that I would run like michalski's algorithms and it would not get the right answer. It would not get the, the, the logical expression that we thought was the right answer. It would get something that was really, actually equally accurate on the training data. Um, and actually it worked pretty well on although we didn't really have a set idea of a separate test set in those days. I mean, it was not a field of statistics. It was the idea was right. We were coming out of the, uh, really the John McCarthy program of programs with common sense, which didn't have a lot to do with common sense, but was about we're going to represent everything in logic, and we're going to use logical inference as the execution engine. And, uh, and so there was a correct program that we were supposed to be learning, correct, the logical statements. And so, um, uh, that was, uh, you know, one of the big, uh, I don't I don't know if it's a missed opportunity, but it just hits me so hard. Is that when I was, uh, in the PhD program over in computer science, about one hundred meters away was the statistics department. And Trevor Hastie and Rob Tibshirani were PhD students at exactly the same time as me. We never met, but I think if we had been together at some party or something, we would not have realized that our two fields were going to collide within five years. We just had no idea that we were working on the same thing. Um, and and it really wasn't until the skytop meetings or the third, uh, machine learning conference or workshop that you organized that, uh, that, that we started to get a glimpse of that in particular. You invited Leslie Valiant to give a talk. He had just published this paper in Cacm. Um, uh, what was it? I can't remember what it was. A theory of the learnable and introducing the idea of probably approximately correct learning. And this this, uh, I mean, I think I initially found it very puzzling. I did not get what was going on, but I think the most important thing about that talk was he was saying a successful learning algorithm is one that, with high probability, outputs an answer that is approximately correct. So it's giving us the freedom to let go of the idea that there was one right answer, and that our methods could, could be very accurate. Uh, and yet, uh, only but not perfect. They didn't have to be perfect. And even more, ten percent of the time we could output junk. That was okay too, right? Because I don't have to be, uh, approximately correct with high probability. And under those constraints, then he could show that you only would need a polynomial amount of data and a polynomial amount of time. Um, we weren't so concerned about the time, but that data turned out to be really important. The idea that you could learn with a with a modest amount of data was really the whole name of the game. Um, because if we needed, of course, now we seem to be using exponential amounts of data and well, we'll come back to that later. Anyway, so I sort of date, uh, the advent of statistical machine learning really from that paper. Um, because because up until then, even ID3 was lossless compression. It was not inductive learning. Um.

00:16:53 Speaker 3: That's so.

00:16:54 Tom Mitchell: Interesting. You know, when I saw ID3, it blew me away because I didn't even have the concept of, uh, looking at all the data at once. All the work that we had been doing was here comes incrementally another training example. Now modify your hypothesis and wait for the next example and ID3. I thought, by the way, Ross said that at that workshop that you mentioned was the first time that he gave a talk ever about decision tree learning. Um, but when I saw that, I thought, oh yeah, yeah, right. It's okay to look at all the data at once. Up until then, for some reason, I, I, and I think many people, uh, were in the mode of incremental one by one examples.

00:17:44 Speaker 3: Yeah.

00:17:44 Tom Dietterich: Well, Pat Winston certainly had kind of set that stage. And even the perceptron algorithm was, was, was that it was an online algorithm. So, um, and I think when you were thinking about, uh, mass spectroscopy, which you were working on, you got one spectrum at a time. I mean, those were extremely expensive data points to be collecting. So so I think that all contributed to that mindset. Uh, yeah.

00:18:07 Speaker 3: Yeah.

00:18:07 Tom Dietterich: In fact, I'm really disappointed. Now, we don't seem to have online algorithms anymore. We you know, the whole problem of continual learning in deep learning is essentially unsolved. Uh, we we don't know how to update these models in any, any way that that, uh, has good behavior, you know.

00:18:26 Speaker 3: Yeah. Okay.

00:18:27 Tom Mitchell: So let's move forward then. So when you look at you've been in the field now for many decades, and you've seen paradigms come and go. You've seen, uh, dead ends, you've seen surprising breakthroughs. Um, what what's just your share with us, your picture of, you know, from the eighties till, say, now, um, what were the major developments that got us from that, in your words? Chaotic stage. I have to agree to where we are today.

00:19:07 Speaker 3: Um.

00:19:08 Tom Dietterich: Well, let's see, uh, uh, I think the the the, you know, you know, the so the advent of neural networks, which was also right around in the mid eighties. Right. I can't remember when, uh, the PDP book by Brown et al came out, but it must have been around then.

00:19:25 Speaker 3: Um, eighty six, I think. Yeah. Okay.

00:19:27 Tom Dietterich: Right. Which was also I mean, so that was like a very important year because I think then Dean Ross's book came out in eighty six. The cart book that Breiman et al wrote came out around that same time. Um, so, uh, that, yeah, that was probably the watershed year and, um, uh, but but, um, everything still had a pretty ad hoc feel to it. I mean, we had these different algorithms, uh, you know, um, uh, although we did, you know, one of the, one of the, uh, big shifts was really to go back to feature vectors, right? That was the other thing that was outrageous about what Ross was doing. He was just working with feature vectors, and and we all thought, oh, you need things that look more like, uh, you know, logic with relations in them. And, and he was basically feature engineering those away. Um, and, and then, and then he could make progress. And of course, the neural network work was almost entirely feature vector based as well. Um, and uh, and, and and for me, a breakthrough was realizing that that, that the people doing neural networks were trying to solve exactly the same problem as, as the machine learning community was at that point, because those are really two separate communities I think people maybe don't realize, but, uh, but, you know, the, the sort of computer science pathway had come through logical representations and, uh, and, and, and, and sort of decision trees and rules and things, whereas the, the nips community was more and signal processing and, and, and was was comfortable with floating point numbers and linear transformations and, and all this kind of thing. So, um, uh, so I remember coming to CMU and giving a talk, uh, I'm, I don't know when that was eighty three. I don't know, pretty early on. No, because I was working on error correcting codes. Well, I had written a paper comparing, um, how well decision tree algorithms could solve the Nettalk problem, which Terry Sejnowski and and the student had, uh, built the system to learn how to pronounce English. Uh, so going from text to speech, uh, and, uh, and it made a great demo because you could actually make the computer talk out loud using a, a special box called a Dectalk machine that Digital Equipment Corporation, uh, had. So I bought one of those machines to uh, and, uh, and, and, and I wanted to and I did a comparison to see how well we could do with decision trees. The answer was not as well. The neural networks were doing slightly better. Um, but but that sort of, uh, said, oh, there's this abstract problem. We're trying to solve it with a bunch of different ways, different strategies. But but that becomes the paradigm, the supervised learning problem with a separate test set. Uh, although even that talk didn't use a separate test set. Test set in the original paper. But Pat Langley, uh, was particularly clear sighted about this. And I think also the sort of the Irvine gang, um, about tightening up our methodology and saying we need separate test sets. We want to get, you know, unbiased. Well, we didn't know the terminology, but unbiased estimates of our accuracy. Um, and so we started measuring those and uh, and basically becoming statisticians. I mean, for me, that culminated in the year two thousand publishing a paper on how to do statistical tests comparing, uh, supervised classification algorithms. So, um, so that got pretty hardcore. Um, so, uh, let's see, I guess, um, yeah, there was the, um, the so the Homedale gang at Bell Labs, which was, uh, you know, Yann LeCun and, and and, uh, the lots of other people there, um, uh, were working on, uh, you know, the mNIST and handwriting recognition, digit recognition, reading checks, reading zip codes off of envelopes and things like this. Um, and, uh, so they were they were making rapid progress and that was very exciting. Uh, and it was on real applications, wasn't made of things like figuring out how Unix works. Um, uh, I mean, I you know, what? You, uh, the Stanford group was working on real problem to the, the, um, you know, the but but almost everybody else in machine learning was working on toys, right? We had to know what the right answer was in order to evaluate our methods. So that sort of forced that, um, uh, so, so then, uh, kernel methods actually came out of that same lab, right. Because, um, uh, um, uh, Vladimir Vapnik, uh, you know, we discovered, I think, through some weird chain that involves Judea Pearl and some I'm not sure the exact sequence, uh, because but but, uh, who was reading this Soviet literature from the nineteen sixties and 70s. That was in translation from Russian into English in a very obscure journals. But there, uh, you know, basically, uh, vapnik and Chervonenkis had, uh, built a theory of machine learning. Uh, that was, that was very sophisticated, um, and applied equally well to discrete, uh, discrete, uh, hypothesis spaces of the type that, that, uh, the PAC learning people have been analyzing and continuous spaces, uh, so things like perceptrons, well, everything goes back to the perceptron convergence theory theorem actually. Um, so uh, uh, so and they were also closely reading, um, uh, um, a dude on heart. So dude on heart, uh, was was not being taught in, at least in my experience at Stanford. Um, which was crazy.

00:25:19 Speaker 3: Uh, you know, what's interesting is.

00:25:22 Tom Mitchell: Um. Tom cover was teaching a course on information theory at Stanford, which I took a few years before you were there.

00:25:33 Speaker 3: Yeah.

00:25:34 Tom Dietterich: And over there. Yeah.

00:25:37 Speaker 3: He.

00:25:38 Tom Mitchell: I don't think to do the hard book was quite available, but there was there was a clear connection between that book and the kinds of things cover was talking about in his information theory course. Anyway, diversion back to your story.

00:25:54 Tom Dietterich: And so. Yeah, so, so, uh, I think, um, that was that was another missed opportunity. If we had read that book earlier, uh, because it came out in seventy three, I think. So it had been around and it and it started with, you know, linear discriminant analysis, quadratic discriminant analysis. It was the again, the same problem of supervised learning coming very much from a statistics and signal processing perspective. Um, although the back half of the book was about computer vision was extremely AI ish. Uh, you know, with, with more structured representations and stuff like that. Nobody reads that part anymore because it was the front part that that really laid out beautiful, uh, clean theory, uh, particularly for, for the case of modeling the distribution of each class as a conditional Gaussian, um, of how to train classifiers. And there's a footnote in there that evidently set off the, the, um, so talk I don't know if you've talked to Isabelle Leon, but, uh, but that would be an interesting person because she can tell you the SVM story. Uh, because the because she was at Holmdel and, and she and her husband and, and a couple were, were very carefully reading dude on heart, almost like the Bible. I mean, just thinking deeply about each sentence and, uh, and, and that set them on a path and, uh, and then, um, uh, vapnik who was also there, uh, uh, there there was some interaction maybe, between them. That resulted in the SVM paper. So, so uh, really uh, again, bringing in now functional analysis, which was something that most computer scientists weren't studying. Um, but, uh, I had had a little bit of it as a math major as an undergraduate, um, showing that kernel methods, basically for every uh, classifier model that we were building, there was an equivalent kernel, uh, and so you could look at the, the learned, um, decision boundary as, as being a linear in the data after going through a kernel. Um, and uh, and so that had a huge impact. It was probably the only time in the history of machine learning where theory led the way and practice following. Right. Because, um, at that same time, uh, you know, there had been work at Wisconsin. Um, I'm going to blank on his name right now, but but of course, Grace Wahba was at Wisconsin, and she had been doing a lot of this kernel stuff and reproducing kernel Hilbert spaces, uh, in statistics, as sort of a voice in the wilderness. Um, and um, uh, olvi. Mangasarian, right. Was an or person also at Wisconsin who had been looking at ways that you could take the classification problem and solve it with linear programming. So how could you come up with a loss function that would be convex and linear. So you could solve it, uh, using mathematical optimization tools. So those ideas were all floating around together and uh, and SVMs, the, you know, uh, by introducing the hinge loss, we're able to instead of log loss or square loss or these other things, we're able to formulate the problem as a convex problem and then solve it using quadratic programming techniques off the shelf. I mean, that lasted for about a year until we started to say, oh, well, there are ways of exploiting this to our particular problem to make it much faster. Of course, then the computer scientists start to really crank because we know how to take something and make it run fast. So, um, um, so that was a very exciting time. I think also in the, in the early nineties, um, but at the same time, uh, um, Leo Breiman and Trevor Hastie and, and, uh, um, uh oh, I'm blanking on some of the other names. Um, Friedman. Jerry Friedman started coming to nips, um, and walking around in the and and, uh, I remember following him around in one of the nips, uh, um, um, poster sessions, which only had, you know, maybe eighty posters. I mean, this is a small meeting, uh, and, uh, they would look at a poster and say, oh, you know, we tried something like that back in nineteen sixty two, but we could never get it to work. The kind of stuff. Uh, so, so they really brought in some ideas. And then Leo, uh, started, uh, publishing a series of papers on basically sort of the ensembles and bias variance. I mean, he had this really interesting paper, something about heuristics of stability, um, pointing out that if you could find ways of stabilizing your learning algorithm, uh, both sort of numerically and uh, in terms of its just its behavior, um, reducing the the variance of the answer would produce that this could improve your accuracy. Um, and so one way of stabilizing is to take an ensemble and then average all the members of the ensemble. Um, and so that's what he pursued. Uh, first. Well, he was also trying to understand what was going on with these boosting algorithms. Um, and, uh, because that was another example of theory leading practice, because, uh, Rob Shapiro, his PhD thesis at MIT, introduced boosting, uh, a really purely as a theoretical construct to try to prove, uh, the equivalence of weak and strong learning. Um, but then, uh, you know, they turned that into a practical algorithm. And, uh, not only that, they wrote an experimental paper where they did something no one had done before, which was to compare on twenty six different data sets the performance of this method and show a basically a scatter plot, uh, you know, on one axis the performance of one algorithm, on the other axis, the performance of another, and show that all the all the points were above the diagonal line. It was like, amazing how much better the boosting algorithm was than just growing, say, one decision tree. Um, and uh, so that was a very exciting time, realizing that, um, we could take basically any learning algorithm and improve it by creating an ensemble of something. Um, uh, I mean, I think that, uh, we are the one of the other things that was really fun about the whole boosting story was that, you know, um, uh, they had an initial theory in those papers that, that, uh, that the reason boosting was working with because it was extremely efficient at fitting the data. But then Leo Breiman showed that that was wrong. Through a series of experiments, he showed that even after you had a perfect fit to the data, if you continued boosting, accuracy continued to improve, which was a bit like the double descent story we see now. And I remember Ross Quinlan had a paper, I think, at AEI one of those years saying, something's fishy here. Uh, I can create an ensemble of decision trees and and I know if I vote them, that's actually equivalent to one gigantic decision tree. Um, but a bunch of the leaves of that gigantic tree will have no data to support them. How is this fair? How is this possible? How can this possibly be working? He was just outraged. And, uh, and, uh, I think that was still a good question. Um, uh, and, uh, and so I think maybe we still don't know the answer to that. Um, uh, but, um, um, yeah. Because you have more leaves than you have data points. So you've got to be there's some sense in which you have to be overfitting. But yeah, in any case, um, so we now we have this general tool of ensembles and we just start applying it to everything. Um, and that lasted for my four or five years, I guess, until we kind of exhausted that. Um, and yet and then there was another line of, of work which was, uh, you know, Rich Sutton and Andy Barto, uh, again, um, really coming from an outside machine learning, in fact, I would say the history of machine learning for those, um, twenty years, maybe between nineteen eighty and two thousand was mostly about importing ideas from other fields. So. So, uh, the police came in with signal processing and so on. Um, and in the case of and of course, the, the, uh, neuroscientists were bringing in ideas that go back to Hubel and Wiesel and, and, uh, and Sutton and Barto were coming from like the animal learning literature and reinforcement learning. Um, but they had made the connection to dynamic programming. Uh. Uh, and, uh, and and and had developed this algorithm t lambda. Right. And I remember, uh, you know, one of the things, um, that maybe, uh, students today, uh, would find amusing is that the conferences were small enough in those days that that, uh, we would actually have the meeting of the entire program committee. We would get together in some hotel rooms and, and sit down and discuss each of the submitted papers among so two people or maybe four people had reviewed it, and we would have a little discussion and decide whether to accept it or not. And we would also have a whole bunch of side conversations right at these meetings. They were incredibly wonderful from a point of view of networking. And in one of the side conversations, uh, Rich Sutton is explaining to Tom Dietrich and David Haussler what he does, and we're, like, completely baffled. What is this thing you're doing, this reinforcement learning stuff? Um, And, uh, and so that became very interesting. And, uh, um, I guess in my own personal story, uh, you know, I had been, I was executive editor of the Machine Learning Journal, uh, as, as I think the third or fourth time, I think was the first and then and then Pat Langley and then then I came in, I think, um, and so, uh, and as a PhD student, I had also written the, the, uh, you know, at Feigenbaum had this grand plan of the, the Handbook of artificial intelligence and was having a graduate students write chapters on each sort of subfield within AI. And so I took on the subfield of learning and inductive inference, which was what it was called then, not really machine learning and uh, and, and, uh, and so I read what I thought at that point was the entire literature on machine learning, which was, you know, maybe a hundred papers or something. That's because we didn't know about statistics. We didn't know about neuroscience. Um, uh, I think there were ten pages there devoted to perceptrons and, and sort of, uh, and control theory and, and the other one hundred and seventy eight pages were about all the other stuff we were doing in there, um, including, uh, explanation based learning. So, um, um, anyway, I was invited to write a paper, a chapter for annual reviews of computer science. Uh, you know, the annual reviews, uh, series of, of of books was, uh, very prominent in other fields of science. They tried doing computer science and eventually gave up because we were just too slow to produce our materials for them. But I wrote a chapter for that on machine learning and included reinforcement learning for the first time. And so that well, as you know, when you're teaching this stuff or when you have to write up like a textbook, for example, you have to go learn it really well. And so that forced me to, to read, uh, the, the reinforcement learning literature. Um, and I tried to form a marriage between reinforcement learning and traditional AI search. Um, I had a grant from NASA to work on scheduling for the space shuttle, um, and, uh, and, and, uh, and there was the, um, backgammon. Uh, Jerry Tesauro had applied TD lambda to, uh, to, to make a decent backgammon player, um, uh, using the neural network as the representation of the value function. Um, so, uh, and. Yeah, so and it was called what? TD.

00:37:39 Speaker 3: TD um, and yeah.

00:37:41 Tom Dietterich: Uh, and that made quite a splash. It was very nice. Um, and so I basically took the td-gammon idea, but applied it and said to try to learn the search heuristic for a scheduler. Um, so we would start with a, so these, uh, job shop scheduling problems, you have a bunch of resources that like, uh, like a forklift or a test rig, um, that, that are needed for multiple tasks. And so, uh, you can't you have to introduce, uh, delays in the schedule while while subtasks wait first for their prerequisite tasks to be finished and also for these resources to be available, then they're serially reusable. So you assign the resource to that task. It takes a certain amount of time, and then it releases that resource and something else can use it. So you get a search space over all of those possible changes. You start with a schedule that is as is as short as possible, ignoring all the resource constraints. And then you gradually have to lengthen it, uh, to, to make room for, for uh, those to, to resolve those resource constraints. And so it works quite nicely. Uh, so and it beat, uh, the stuff they were using at NASA at the time. So, so that was fun. Um, so this gets us up to maybe two thousand, around two thousand when I think that was probably, uh, the in some sense the peak. Well, at least in my own career, that was sort of the peak because we had ensembles, we had the machines, we had, um, reinforcement learning. It was a time just amazing. Um, uh, progress, I think, and and excitement. Um, yeah. But then in my own career, that was when I two things happened. First, I, I was having sort of mid-career crisis in the sense that I knew how to write papers and publish them and get grants, but I was still not seeing, um, I don't know, any, uh, sort of real world impact. I mean, I had worked on, you know, a hierarchical reinforcement learning and made beautiful theoretical framework, but it was completely unrealistic. You know, it was very theoretical, a very elegant but not practical. Um, and so I was looking for some directions where I could have more impact. Um, and, and I pursued two of those. One was, um, That I got interested in ecology and, uh, and, and started collaborating with ecologists on my campus because Oregon State University is like a super power in ecology. And, uh, Ed Feigenbaum had told me, you know, always collaborate with the smartest people you can find on campus. And so I found those people and started collaborating with them. And that turned into a series of projects on computer vision for recognizing insects and, uh, wildfire management using reinforcement learning, invasive species management, using reinforcement learning, um, bird migration using probabilistic graphical models. Um, and um, so, uh, and, and, uh, Carla Gomez at Cornell and I got two of those NSF expeditions things and created something called the Institute for Computational Sustainability. Um, produce a lot of wonderful graduate students who are now. Well, you have one of them at CMU. Uh, and, uh, and so, so we that was a big success. The other direction was I got involved in DARPA projects and DARPA in, in two thousand and three, maybe created this thing called the Pal program, a personal assistant that learns. And the vision was, could we create essentially a personal assistant that would be an agent that working on your behalf? So very much the vision that we're now talking about, um, could manage your calendar, uh, help you make presentations, um, help you find stuff on in your all your do your information management. So personal information management. Um, what else? Uh, attend meetings and take transcripts, identify action items. Right. So many of these things have now come to pass with large language models, but, uh, but this was the first concerted effort, and, uh, and they spent more than one hundred million dollars, I think, on this effort over five years. Um, and, uh, I got involved in year two because Sri international had taken on some ridiculous. They were taking on most of it. Um, and they had something like twenty six subcontracts and it was a management nightmare. Um, and I was brought in to, uh, to manage the seventeen subcontracts having to do with, uh, machine learning or learning in general. Um, and, uh, and that mainly consisted of me weaving stories out of all the different things the subcontractors are doing, briefing the DARPA director and keeping the money flowing. So, uh, so I was a storyteller, um, and, uh, and doing some research on my own, um, that actually led to a spin out company called Smart Desktop that was trying to we were trying to make the windows desktop be more task oriented. I mean, the windows desktop is still very much application oriented, open word. You do some stuff, you open outlook, you do some stuff. And the idea was, If I'm working on project one, then I should have the websites or project one and folders, and the email and everything for that project all together in some coherent way. Um, and uh, and it was cool. We so we spun it out as a, as a company called Smart Desktop, we built for beta releases and tested them, and it just wasn't good enough. And, uh, we just, um, we, we could help people up to some point, and then they would sort of there'd be some workflow stuff that would fall out of our. We weren't thinking about workflows, we were just thinking about information access. And there would be some workflow stuff where we just couldn't help them. And and there was an added cost to them of keeping the computer up to date with what tasks they were doing. Um, and, uh, and it wasn't worth the, the benefit wasn't worth the cost. So, um, the company was acquired. So we actually made a little bit of money off of it. Uh, but it was acquired just to, to get the people, um, and so, uh, so those were my two, uh, sort of application oriented things. And then I continued, uh, I made a transition from being a faculty member in the classroom to being a project manager. So I was, you know, writing, John, DARPA grants funding, a bunch of my colleagues and, uh, and managing those, uh, and I did that pretty much, uh, until I retired in twenty sixteen. Well, until twenty twenty three. In fact. Yeah. Um, so that was also the point where I lost track of the field in the sense that, you know, up and, well, it was also becoming impossible. But but when I was executive editor of the journal and then when I was writing these review articles and so on, um, I was I was able to more or less keep track of what was going on. Um, now it's completely impossible, but, uh, but but when I switched to being a project manager, I didn't have time to to read all the things that I needed to read. Um, so, uh, I have some regrets there, but, uh, I don't know. You know, you want impact or you want to, uh, be, uh, you know, heads down. doing cool little technical things. You can't do both.

00:44:59 Tom Mitchell: Well, I also think you're a little bit too modest in saying you lost touch with the field. Since I know that you know a lot about what's going on. I want to ask you about your experience with the company you just described, and trying to build an agent to help people. Um, what are the key lessons from that that make you that you think about when you look at today's attempts, advertised attempts, at least to give us LA based agents that will help us. What are the lessons?

00:45:38 Tom Dietterich: I think you've got to understand the workflows of people's, uh, knowledge work. Um, it's not enough to give them, uh, a fork if they need a knife as well. So. So they've got to eat the whole meal And and and that was our failure. And that thing was we we had really beautiful forks, but we didn't have the other utensils that they needed. So, um, and, and, and, uh, and there is a cost, uh, I think to, to, you know, you have to learn these new tools and so on. And so, um, I, you know, I think, uh, it's very natural as technologists, we become technology first and we say, oh, we have this beautiful thing that you're going to love. Uh, and here's and try it out. And it is. And you can do lovely demos and it's super cool. But when you're actually supporting someone's work and maybe you're only supporting, you know, twenty five percent of it, um, and if you're imposing costs on them, uh, then it's not clear that it's going to really win. So, uh, especially, I mean, uh. Of course, the other thing is that as, as you may recall, one of my slogans is machine learning means always having to say you're sorry, uh, that that this is, uh, you know, a take on the love story, uh, theme. Love means always having to say you're sorry but. Or never having to say you're sorry, I guess. But machine learning means. Oh, is that the methods are not perfect. They will make mistakes. We, uh, we we we gave ourselves that permission back when PAC learning came along. And so you've got to figure out how you're going to deal with the failure rates of these things. And again, uh, that's the other problem with the agents, I think, is they're going to make mistakes sometimes. And how are you going to you know, in when we were doing information access, the mistake was we didn't we, we just didn't retrieve the right documents or they weren't properly labeled. So they weren't at the fingertips of the user, but the user was completely in control. So we didn't have we were just doing information management. Um, uh, and um, I guess in the larger Carlo story, um, uh, one of the things I thought was very cool was a calendar management system, right? Um, but, boy, are people's preferences complicated? Right. For when you want to take meetings with particular people. Who has priority? Um, lots of stuff that is also extremely sensitive. Uh, from a, you know, not offending people and confidentiality. And one of the things that really, um, struck me in, in the project, which was part of the the Sri project was um, uh, that, uh, how much privacy you needed to be willing to give up these AI systems to be useful to you had to have complete knowledge of your work life and maybe also your personal life, because they'd have to know that the reason you don't want to have an appointment on Thursday at five pm is you have a date that night or, you know, or or you have a doctor's appointment because you're also having some, you know, big medical problems. And do you want your colleagues to know about that? You know, it's just like, is your agent going to negotiate with somebody else's agent and talk about this kind of stuff. Uh, it's just, uh, just rife with with privacy problems. Um, the other thing that was an important lesson from pal was, uh, there's not enough memory on the computer to keep track of all the stuff. Our pal agents could only run for about four days before they completely filled the disk. Because we were saving everything, right? And now we we we see. Uh oh. Maybe Microsoft wants to take a screenshot every 10s or whatever or a minute of your screen, and then give that to a VM and try to figure out what you're doing. Um, uh, yeah. It you know, even with the tremendous scale in storage and the idea that, you know, I can put a I this this is a terabyte, right? I mean, that used to fill a whole floor of a building. Uh, it's not enough. Uh, and so, so, um, knowing what you can forget. Memory management in that sense. It turned out to be a key thing, and we didn't have any forgetting, uh, infrastructure in our system. So that that will probably be another big challenge. So those are the first things, uh, off the top of my head is, uh, acquiring the, the the subtle preferences of, of the user. It may take them longer to express those preferences than it would be for them to do the task themselves. I mean, I think this is why all travel agent tools so far have pretty much failed is that people's preferences are so complicated when they're planning trips. Um, and the amounts of money involved are so high, uh, that that that is the potential. You know, an error is not only money, but maybe you get stuck in an airport for twenty four hours because there's some sort of screw up. I mean, this is this is a very high cost. And machine learning means always having to say you're sorry. So, um.

00:50:45 Speaker 3: Yeah.

00:50:47 Tom Mitchell: So I want to ask you about your view of the field today. Is machine learning over? What what is the state of affairs and what needs to be done, if anything?

00:50:58 Speaker 3: Well, uh.

00:51:00 Tom Dietterich: So I think it's in that machine learning is in a state of crisis in the, in the Thomas Kuhn sense. Um, so, uh, it's hard to imagine a field that has that has had as much money invested in it in taking the basic ideas of statistical learning and scaling them up to sort of planetary scale and incredible, uh, amount of data and computing time and labeling. Uh, and I feel like it's just not good enough. We still see that these large language models look a lot more like a kernel method than sort of smoothing and interpolating between the training data, than a method that has somehow learned, uh, the, uh, general rules that would that that would extend beyond the training data that could extrapolate. Um, and I remember asking Leo Breiman back in like nineteen ninety eight, you know, what do statisticians know about when it's safe to extrapolate beyond the data? And he said, don't, uh, and, and I think that that's the, uh, that's the fundamental weakness of statistical learning is it's, it's primarily interpretative. So, uh, uh, and yet, uh, you know, one of the powers of, of science and engineering is we learn invariants, uh, relationships that do extrapolate, that are universal or that are universal over some very wide range. And, uh, and we find that, uh, that these large language models don't seem to be able to learn that they can read about it, they can answer a question and tell you the general rule, but they can't apply that rule. I think it's amazing that, um, you can ask these llms, you know, what are the rules of chess? Is this a legal move? And so on. and they will answer all those questions perfectly. If you then ask them to play chess, they'll start making illegal moves so they somehow can't use the knowledge that is in linguistic form to turn it into action in the world. Uh, in that sense, um, I mean, of course we could train them to do that, but but the thing is, you can write a very small computer program that can do this perfectly. Uh, and so they're somehow not learning that program. Um, and, uh, so, so, so I think that that we, we're running, running into the limits of statistical learning and what it's capable of doing. And I think that the challenge for the next generation is, uh, you know, how do we learn invariants and, and, and there's been this community maybe for the last ten years trying to figure out how to bring ideas from causality into, uh, machine learning. You know, Judea Pearl and and and colleagues, Joe Halpern, uh, they've all developed this very nice theory of causality and, uh, and, and, uh, you know, Bernhard Schölkopf and his institute in in Germany. They've been doing lots of great work along these lines. So I think that there are chances that we can, uh, develop causal machine learning methods that will learn these invariants that that extend beyond the training data. Um, uh, and, uh, and that maybe that will give us something. So, so I, so I think that's a big challenge. Um, uh, I think, uh, learning from the kinds of, uh, partial imperfect feedback that we get with agents actually working in the world and building world models. It's kind of the same idea. How do we how do we build those world models that so that the system understands when it is signing a contract or making a it's it's outputting text, but that has, uh, all kinds of legal ramifications. The the effects of that action are not just the certain tokens have been emitted. And, uh, I mean, we saw that a couple of years ago when some auto dealership in, in Oregon, in fact, I think, uh, put an LM up there and it started offering two for one offers on cars and stuff, um, because it didn't understand. And it would even say this is a real legally binding thing. I mean, people had had jailbroken it and and gotten it to do those sorts of things, but obviously the system had no understanding of of the full effects of its actions. Um, I think another thing, a weakness we see there is, uh, that our models don't, don't have a theory of knowledge in the sense of, um, uh, why they believe something is true. Right. Uh, what we our our our, um, sort of contract with machine learning in the past is we give you clean training data and you the algorithm learn a believe it all and output an answer. Or maybe we give you data that's been corrupted with a small amount of zero mean Gaussian noise and, uh, and you, uh, average that away and give us something. But basically, uh, the algorithms trust their data and, uh, but when we're training on, on, you know, all of YouTube and all of Reddit, you can't trust all this data. And, uh, and so the system will, will, uh, output stuff. That's false because it hasn't, uh, it is not reasoning about its data sources and asking which ones are more trustworthy. I should be cross-referencing these against each other. And, uh, and indeed, um, you know, a dream would be that with our language tools that we have now, could we read the entire history of science and scientific literature and and figure and trace back why we do believe everything we believe? What were the experiments and all the antecedents? We'll probably find lots of holes when we do that, that we believe stuff that we shouldn't, or that there's a lot of other stuff that we should know that we've lost because, uh, it just got lost in the literature. Um, uh, or maybe this is just my, uh, desire to be able to understand everything again. Uh, but by having an LLM that reads the whole literature for me. But, um, so I think these are the kinds of grand challenges that we face is, um, of course, there's the problem of uncertainty quantification to the Llms can characterize more or less their, their aleatoric uncertainty. Uh, you know, you can have them generate lots of different answers, which they do naturally and look at how diverse they are. And if there's a big spread, you know, the model's uncertain. But one thing they have a lot of trouble reasoning about is whether they had any training data at all relevant to the question you're asking, because the training data is too big to keep around, so they can't actually look that up. Um, and a related challenge is the problem of attribution, which is, you know, why? Why do you believe this? Tracing that back to which training data points contributed to the answer, that was just output. Um, and uh, and I'm, I'm excited about, you know, the Allen AI Institute's, uh, omo system, where they have where you have all the training data available. And so we could study these questions in that context. And so, yeah, how do we build. And this is also related to retrieval augmented things. How can we build the right kind of indices. That would let us uh, find those the training points that had the biggest impact and, and be able to really do the credit assignment back to those. Uh, and so, so there's tons of questions to work on. It's super, super difficult and uh, and of course, all of machine learning and the vast majority of AI has assumed that there's only one agent in the universe. Um, and so all the change in the universe is either spontaneous or caused by us by our one agent. But, uh, but but aside from the multi-agent systems community and the game theory people, that that's been the dominant assumption in AI. And so we're rapidly going to find with, with the, with our genetic systems talking with each other and negotiating with each other, with being attacked by other agents, untrustworthy agents, um, that, uh, that that the dynamics of a multi-agent world and creating, I don't know, agent institutions and law and and police and all these kind of things. Um, uh, it's it's going to be very complicated world out there. And, uh, again, uh, figuring out who to trust, uh, what to learn from. Uh, yeah. Uh, our huge challenges.

00:59:05 Tom Mitchell: There's some great thesis topics I often think of, uh, future work kinds of things, like you're identifying in terms of PhD thesis topics. I think there are a lot in what you just said. Uh, final question. If there's a new PhD student just entering right now, um, and they just heard you talk about these different thesis topics. What what advice do you have for them on how to get started? How to conduct themselves?

00:59:44 Tom Dietterich: I guess I guess a lot of my advice really comes from Alan Newell, who, uh, I only have a lot of the second hand, but my understanding was that one of the things he emphasized was, uh, that you, you, part of your education is you're building a toolbox of techniques that you can take, uh, and that it's important to, to fill that box with as many tools as you can, and in particular, to have some where you that are going to be your secret weapons, the things that you are better at than anybody else, like you are an absolute whiz at convex optimization or, you know, all this, uh, convex analysis or, um, you know what? Whatever it is, uh, I mean, you cannot know too much mathematics. And so I think, uh, you know, one of my role models, uh, that I wish I had emulated more is Michael Jordan. Because even after he was a faculty member, he continued to take math classes, uh, until he pretty much knew so much of mathematics. And he was able to then bring that in and, and again, import ideas. Um, uh, so, so I think he played a role in bringing in, um, variational methods from statistical physics into, into probabilistic reasoning, um, and uh, which was another important tool for us. So, so first of all, build a good toolbox and, uh, and and and, and and and identify what you think your secret weapon is, which is not always obvious. I think I was, uh, never fully realized that my secret weapon is probably explaining to other people what's happening in the field or explaining to people what they themselves are doing, uh, as a kind of, uh, understanding and, and and framing, uh, thing and, uh, and maybe I should have spent more time writing textbooks and less time trying to do original research than my own. Uh, but, um, so, so, uh, so that requires a lot of self-knowledge. About what? What your own strengths are then, um, uh, another piece of advice. This is not really for starting graduate students, though. I think as a starting graduate student, um, uh, you know, uh, Donald Knuth used to say, you know, you work on a problem and this is maybe more for theory students, but you work on a problem that takes you a minute to solve, then a problem that takes you an hour to solve, then a problem takes you a day to solve. And when you work on a problem, it takes a year to solve. You graduate. Uh, so, um, but but maybe there's, uh, that general thing is that, um, you shouldn't try to to, uh, solve the such a deep problem like learning how Unix works in one step. But, but but do smaller things and then take gradually bigger things. Um, but I but I also think you need to have your eye on the prize and on the fundamental questions. Um, uh, when I was working in a, I spent a three years working part time, a pharmaceutical startup, where we were trying to do machine learning for drug design back in the early nineties. Way too early. Um, but one of my colleagues there was Tomas Lozano Perez from MIT, and, uh, and one day he walked by a conference room where I was working with some other people, and he listened for a while, and he said, you guys realize you're searching the space of Turing machines, right? Um, and what he was basically saying was, we are down in the mechanisms randomly flailing around. And he said, there's an awful lot of Turing machines out there. You're not going to find the one that works. And we see a huge amount of that right now with neural network architectures where people are tweaking this and that. Am I going to do, you know, BatchNorm or or, uh, you know, put this kind of shortcut in or do this kind of thing? Um, that is not the route to success, I think. I mean, occasionally it is. And maybe that's unfortunate because, um, uh, the story I've heard is that the transformer model was basically developed more or less in that way with trying out various, uh, things. Um, uh, but, uh, but I think, uh, we need to spend more time understanding the fundamentals. Uh, the theory of machine learning, which is itself in disarray right now. Um, but, uh, but I think that theory people are starting to catch up, and we're seeing some interesting things there. So, um, representation, uh, is the most important thing in AI and computer science in general. And, uh, and so we need to think and understand that much more deeply, uh, with deep learning, what representations are being learned, how do they work? So, um. Uh, yeah. Don't search the space of Turing machines. Think about what representations we want. Um, and try to understand what the real task is that we're solving. Um, so don't don't take at face value the paradigms that are set up, supervised learning or self-supervised learning or whatever. But but think about what is the task we're really trying to solve. Um.

01:04:49 Tom Mitchell: That's terrific. Uh, Tom Dietrich, thank you so much. Um, I agree that you're an excellent explainer, and this was just fantastic. Uh, capsule summary of the history of the field. So thank you so much for sharing that.

01:05:09 Tom Dietterich: Well, yeah, thanks for the for the discussion and good luck with this project.

01:05:14 Matty Smith: Tom Mitchell is the Founders University professor at Carnegie Mellon University. Machine learning. How Did We Get here? Is produced by the Stanford Digital Economy Lab. If you enjoyed this episode, subscribe wherever you listen to podcasts.

More episodes

Chapters

Creators and Guests

What is Machine Learning: How Did We Get Here??