Machine Learning: How Did We Get Here?

Tom Mitchell, Founders University Professor at Carnegie Mellon University kicks off the podcast with this recording of his February 2026 seminar talk on “The History of Machine Learning.”

He takes us from the writings of early philosophers about whether it is even possible to form correct general laws given only specific examples, to today’s machine learning algorithms that underlie a trillion dollar AI economy. Along the way we see the thoughts and recollections of many of the pioneers in the field, in the form of excerpts from upcoming podcast episodes featuring full interviews with each.

Tom discusses the wonderful creativity and diversity of approaches explored during the 1980s, the integration of statistics and probability into the field in the 1990s and early 2000s, and the amazing progress over the past decade that has brought us today’s AI systems. He reflects in the end on what we should learn from this history.

Recorded at Carnegie Mellon University.

Creators and Guests

Host

Tom Mitchell

Tom Mitchell is the University Founders Professor at Carnegie Mellon University, a Digital Fellow at the Stanford Digital Economy Lab, and the author of Machine Learning, a foundational textbook on the subject.

Producer

Matty Smith

Writer/director/editor from Los Angeles, with experience writing and directing scripted television and national commercials. Mixed media producer with hands-on experience in all areas of production.

What is Machine Learning: How Did We Get Here??

Tom Mitchell literally wrote the book on machine learning. In this series of candid conversations with his fellow pioneers, Tom traces the history of the field through the people who built it. Behind the tech are stories of passion, curiosity, and humanity.

Tom Mitchell:
Welcome to machine learning.

Tom Mitchell:
How did we get here?

Tom Mitchell:
I'm Tom Mitchell, your podcast
host.

Tom Mitchell:
Now many people ask, how did we
get to this point where today we

Tom Mitchell:
have these amazing AI systems?

Tom Mitchell:
I have a one sentence answer to
that question.

Tom Mitchell:
We tried for fifty years to

Tom Mitchell:
write by hand intelligent

Tom Mitchell:
programs, but we discovered

Tom Mitchell:
about a decade ago that it was

Tom Mitchell:
actually much easier and much

Tom Mitchell:
more successful to use machine

Tom Mitchell:
learning methods to instead

Tom Mitchell:
train them to become

Tom Mitchell:
intelligent.

Tom Mitchell:
So the real question is, how did
machine learning get here?

Tom Mitchell:
What were the successes along
the way and the failures?

Tom Mitchell:
Who were the people involved?

Tom Mitchell:
What were they thinking?

Tom Mitchell:
What even made them want to get

Tom Mitchell:
into this field in the first

Tom Mitchell:
place?

Tom Mitchell:
This first episode will set the
stage for the podcast.

Tom Mitchell:
It is a recording of a lecture I
gave this month in February

Tom Mitchell:
twenty twenty six at Carnegie
Mellon University, and it

Tom Mitchell:
attempts to cover in one hour a
seventy five year history of the

Tom Mitchell:
field of machine learning.

Tom Mitchell:
Most of the rest of the episodes

Tom Mitchell:
in the podcast involve

Tom Mitchell:
interviews with various pioneers

Tom Mitchell:
in the field, who made very

Tom Mitchell:
significant contributions along

Tom Mitchell:
the way.

Tom Mitchell:
Before we start, I want to thank
Carnegie Mellon University and

Tom Mitchell:
also the Stanford University
Digital Economy Lab for

Tom Mitchell:
supporting the podcast.

Tom Mitchell:
And I want to thank Maddie
Smith, our podcast producer.

Tom Mitchell:
I hope you enjoy the podcast.

Tom Mitchell:
If we're going to talk about

Tom Mitchell:
machine learning, it's only fair

Tom Mitchell:
to start with the first people

Tom Mitchell:
who talked about how on earth is

Tom Mitchell:
learning possible?

Tom Mitchell:
Which were the philosophers?

Tom Mitchell:
And so as early as Aristotle, he
was talking about the question

Tom Mitchell:
of how is it that people could
look at examples of things and

Tom Mitchell:
learn their general essence?

Tom Mitchell:
In his words, about a century
later, there was a school of

Tom Mitchell:
philosophers called the
Pyrrhonists, who really zeroed

Tom Mitchell:
in on the problem of induction
and how it can be justified.

Tom Mitchell:
When we say induction, what we

Tom Mitchell:
really mean is the process of

Tom Mitchell:
coming up with a general rule

Tom Mitchell:
from looking at specific

Tom Mitchell:
examples.

Tom Mitchell:
And so they talked about

Tom Mitchell:
questions like, well, if all of

Tom Mitchell:
the swans we've seen so far in

Tom Mitchell:
our life are white, should we

Tom Mitchell:
conclude that all swans are

Tom Mitchell:
white?

Tom Mitchell:
What would be the justification
for that?

Tom Mitchell:
Maybe there's a black swan out
there that we haven't seen.

Tom Mitchell:
And, uh, that debate went on for

Tom Mitchell:
some time around thirteen

Tom Mitchell:
hundred.

Tom Mitchell:
William of Ockham, uh, suggested

Tom Mitchell:
something that we now call

Tom Mitchell:
Occam's razor, the policy that

Tom Mitchell:
we should prefer the simplest

Tom Mitchell:
hypothesis.

Tom Mitchell:
So, indeed, if all the swans
we've seen so far are white,

Tom Mitchell:
then the simplest hypothesis is
all swans are white.

Tom Mitchell:
That was his prescription.

Tom Mitchell:
Later on, around sixteen
hundred, Francis Bacon brought

Tom Mitchell:
up the importance of data
collection, of actively

Tom Mitchell:
experimenting, to collect data
that could falsify hypotheses

Tom Mitchell:
that weren't correct.

Tom Mitchell:
And then in the seventeen
hundreds, the philosopher David

Tom Mitchell:
Hume really kind of nailed
the problem of induction.

Tom Mitchell:
He argued very persuasively that
it's really impossible to

Tom Mitchell:
generalize from examples if you
don't have some additional

Tom Mitchell:
assumption that you're making.

Tom Mitchell:
And he pointed out that even the
assumption that the future will

Tom Mitchell:
be like the past is itself not a
provable assumption is just a

Tom Mitchell:
guess that we use.

Tom Mitchell:
So his point was that people do
induction, but it's a habit.

Tom Mitchell:
It's not a justified, rational,
provable, correct process.

Tom Mitchell:
So they had plenty to say around
the nineteen forties when

Tom Mitchell:
computers became available.

Tom Mitchell:
Alan Turing, who's often called

Tom Mitchell:
the father of computing, uh,

Tom Mitchell:
suggested that maybe computers

Tom Mitchell:
could learn.

Tom Mitchell:
He said instead of trying to
produce a program to simulate

Tom Mitchell:
the adult mind, why not rather
try to produce one which

Tom Mitchell:
simulates a child's?

Tom Mitchell:
If this were then subjected to

Tom Mitchell:
an appropriate course of

Tom Mitchell:
education, one would obtain the

Tom Mitchell:
adult brain.

Tom Mitchell:
So he had the idea that maybe
computers could learn.

Tom Mitchell:
But he did not have an algorithm
by which they would learn that

Tom Mitchell:
waited until the nineteen
fifties, when there were two

Tom Mitchell:
important seminal events.

Tom Mitchell:
One was a computer program

Tom Mitchell:
written by an IBM researcher

Tom Mitchell:
named Art Samuel, and his

Tom Mitchell:
program learned to play

Tom Mitchell:
checkers.

Tom Mitchell:
I'll just read you a couple

Tom Mitchell:
sentences from the abstract of

Tom Mitchell:
this paper.

Tom Mitchell:
He said two machine learning
procedures have been

Tom Mitchell:
investigated in some detail
using the game of checkers.

Tom Mitchell:
enough work has been done to

Tom Mitchell:
verify the fact that a computer

Tom Mitchell:
can be programmed so that it

Tom Mitchell:
will learn to play a better game

Tom Mitchell:
of checkers than can be played

Tom Mitchell:
by the person who wrote the

Tom Mitchell:
program.

Tom Mitchell:
And then he went on to point out

Tom Mitchell:
the principles of machine

Tom Mitchell:
learning verified by these

Tom Mitchell:
experiments are, of course,

Tom Mitchell:
applicable to many other

Tom Mitchell:
situations.

Tom Mitchell:
So he had really one of maybe

Tom Mitchell:
the first demonstration of a

Tom Mitchell:
program that learned to do

Tom Mitchell:
something interesting.

Tom Mitchell:
And he understood that the

Tom Mitchell:
techniques he was using were

Tom Mitchell:
very general.

Tom Mitchell:
Now, how did he get the computer
to learn to play checkers?

Tom Mitchell:
His program learned an

Tom Mitchell:
evaluation function that would

Tom Mitchell:
assign a numerical score to any

Tom Mitchell:
checkers position, and that

Tom Mitchell:
score would be higher, the

Tom Mitchell:
better the checkers position

Tom Mitchell:
was.

Tom Mitchell:
From your point of view as
you're playing the game, and

Tom Mitchell:
then you would use that to
control a search.

Tom Mitchell:
A look ahead search for which
move to proceed to take that

Tom Mitchell:
evaluation function was a linear
weighted combination of board

Tom Mitchell:
features that he made up.

Tom Mitchell:
Things like how many checkers
are on the board that are mine,

Tom Mitchell:
how many are on the board that
are yours, and so forth.

Tom Mitchell:
So his program learned.

Tom Mitchell:
What it learned was that
evaluation function.

Tom Mitchell:
How did it learn it?

Tom Mitchell:
By playing games against itself.

Tom Mitchell:
And he points out that in eight
to ten hours, it could learn

Tom Mitchell:
well enough to beat him.

Tom Mitchell:
Those ideas persisted through
the decades.

Tom Mitchell:
They became reused over and
over, including in the computer

Tom Mitchell:
programs that finally beat the
World Chess Champion and the

Tom Mitchell:
World Backgammon Champion and
the World Go champion.

Tom Mitchell:
So those ideas were really
seminal.

Tom Mitchell:
A second thing that happened in

Tom Mitchell:
the fifties was the invention of

Tom Mitchell:
the first early version of

Tom Mitchell:
neural networks by Frank

Tom Mitchell:
Rosenblum, wrote, I'm sorry,

Tom Mitchell:
Frank Rosenblatt from Cornell,

Tom Mitchell:
and he was interested in

Tom Mitchell:
neuroscience.

Tom Mitchell:
How can the brain neurons in the
brain be used to learn?

Tom Mitchell:
And he ended up building a
simple, uh, at least by today's

Tom Mitchell:
standards, simple neural network
that consisted of, uh, one layer

Tom Mitchell:
of neurons where, uh, there
would be a receptive field, uh,

Tom Mitchell:
input, say an image, and then
the neurons would respond to

Tom Mitchell:
that and produce an output set
of neuron firings.

Tom Mitchell:
What got learned in that case

Tom Mitchell:
were the connection strengths

Tom Mitchell:
between the input to the neuron

Tom Mitchell:
and the probability that it

Tom Mitchell:
would fire.

Tom Mitchell:
And the way he trained it was

Tom Mitchell:
what we now call supervised

Tom Mitchell:
learning.

Tom Mitchell:
You show an input and and what
the output should be.

Tom Mitchell:
And he had schemes for updating
those weights to fit the data.

Tom Mitchell:
Now that the importance of this

Tom Mitchell:
work is that it catalyzed a

Tom Mitchell:
whole bunch of work in the

Tom Mitchell:
nineteen sixties, for the next

Tom Mitchell:
decade, looking at different

Tom Mitchell:
algorithms for tuning the

Tom Mitchell:
weights of perceptron style

Tom Mitchell:
systems.

Tom Mitchell:
That work proceeded for a

Tom Mitchell:
decade or so, and at the end of

Tom Mitchell:
the nineteen sixties, two MIT

Tom Mitchell:
scientists, Marvin Minsky and

Tom Mitchell:
Seymour Papert, wrote a book

Tom Mitchell:
called perceptrons.

Tom Mitchell:
But unfortunately, that book

Tom Mitchell:
proved that a single layer

Tom Mitchell:
perceptron, which is the only

Tom Mitchell:
thing we knew how to train at

Tom Mitchell:
that point, uh, could never even

Tom Mitchell:
represent any many, many

Tom Mitchell:
functions that we wanted to

Tom Mitchell:
learn.

Tom Mitchell:
It could only represent linear
functions, not even, uh,

Tom Mitchell:
exclusive or, you know, where
the input could be.

Tom Mitchell:
The output would be one.

Tom Mitchell:
If input one is a one and the
other is a zero, or if it's a

Tom Mitchell:
zero and a one.

Tom Mitchell:
But the output would have to be
zero if they were both one.

Tom Mitchell:
You can't even represent that

Tom Mitchell:
simple function with a

Tom Mitchell:
perceptron no matter how you

Tom Mitchell:
train it.

Tom Mitchell:
So this really kind of put the

Tom Mitchell:
kibosh on work on perceptrons,

Tom Mitchell:
uh, following the publication of

Tom Mitchell:
this book.

Tom Mitchell:
Now, if we're not going to be
able or don't want to spend our

Tom Mitchell:
time figuring out how to learn
perceptrons, Then what's next?

Tom Mitchell:
Well, it turned out one of

Tom Mitchell:
Minsky's PhD students, Patrick

Tom Mitchell:
Winston.

Tom Mitchell:
The next year published his

Tom Mitchell:
thesis, and Winston suggested

Tom Mitchell:
that instead of learning

Tom Mitchell:
perceptron type representations

Tom Mitchell:
of information, we should learn

Tom Mitchell:
symbolic descriptions.

Tom Mitchell:
And so his program, uh, in his
thesis, he showed how his

Tom Mitchell:
program could learn descriptions
of different physical structures

Tom Mitchell:
like an arch or a tower.

Tom Mitchell:
And he would train the program
by showing it line drawings of

Tom Mitchell:
positive and negative examples
of, uh, in this example arches.

Tom Mitchell:
And then the program would
process those incrementally

Tom Mitchell:
arriving examples to produce a
symbolic description that would

Tom Mitchell:
describe the different parts and
relations among them.

Tom Mitchell:
For example, an arch could be
two rectangles which don't touch

Tom Mitchell:
each other, but which jointly
support a roof of any shape.

Tom Mitchell:
So this was an important step

Tom Mitchell:
because it shifted the focus

Tom Mitchell:
onto learning a much richer kind

Tom Mitchell:
of representation, symbolic

Tom Mitchell:
descriptions.

Tom Mitchell:
And this became the new paradigm

Tom Mitchell:
which dominated the nineteen

Tom Mitchell:
seventies.

Tom Mitchell:
So during the seventies, there

Tom Mitchell:
were a number of people working

Tom Mitchell:
on learning symbolic

Tom Mitchell:
descriptions.

Tom Mitchell:
My favorite is the metaphor
program, developed by Bruce

Tom Mitchell:
Buchanan at Stanford.

Tom Mitchell:
This program, again, was a
symbolic learning program.

Tom Mitchell:
What it learned was rules that
would predict how molecules

Tom Mitchell:
would shatter inside a mass
spectrometer, and therefore

Tom Mitchell:
predict what the mass spectrum
of a new molecule would be.

Tom Mitchell:
And those rules again described,

Tom Mitchell:
Symbolically described a

Tom Mitchell:
subgraph of atoms within the

Tom Mitchell:
molecular graph.

Tom Mitchell:
And the rules would say, if you
find this subgraph, then

Tom Mitchell:
specific bonds in that subgraph
are likely to fragment when you

Tom Mitchell:
put this in a mass spectrometer.

Tom Mitchell:
And this was an important step
forward.

Tom Mitchell:
I asked Bruce Buchanan, how will
it work?

Tom Mitchell:
What was this program able to do
in terms of did it work.

Bruce Buchanan:
Well for one small class of
steroid molecules, the keto and

Bruce Buchanan:
estranes, if you will?

Bruce Buchanan:
Uh, we had, uh, fewer than a

Bruce Buchanan:
dozen spectra, and we were able

Bruce Buchanan:
to tease out the rules that

Bruce Buchanan:
determine, uh, How a new keto

Bruce Buchanan:
androstane would fragment in a

Bruce Buchanan:
mass spectrometer, and we were

Bruce Buchanan:
able to publish that set of

Bruce Buchanan:
rules in a refereed chemical

Bruce Buchanan:
chemical journal, Chemistry

Bruce Buchanan:
Journal.

Bruce Buchanan:
Sorry.

Bruce Buchanan:
Uh, and it was, to our

Bruce Buchanan:
knowledge, the first time that

Bruce Buchanan:
the result of a machine learning

Bruce Buchanan:
program, Symbolic Learning, had

Bruce Buchanan:
been published, uh, in a

Bruce Buchanan:
refereed journal.

Tom Mitchell:
So that was an important
milestone for machine learning,

Tom Mitchell:
really, the first time that a
program discovered some

Tom Mitchell:
knowledge that was useful enough
to get published in that domain.

Tom Mitchell:
Now it turned out personal note

Tom Mitchell:
I was a PhD student at Stanford

Tom Mitchell:
at the time, and Bruce became my

Tom Mitchell:
PhD advisor, so my PhD thesis

Tom Mitchell:
was also built around, this same

Tom Mitchell:
data set.

Tom Mitchell:
And for my thesis I developed a
system called Version Spaces

Tom Mitchell:
that was the first symbolic
learning algorithm where you

Tom Mitchell:
could prove that it would
converge, and furthermore, that

Tom Mitchell:
the learner would know when it
had converged, so it would know

Tom Mitchell:
it was done.

Tom Mitchell:
And it did that by maintaining

Tom Mitchell:
not just one hypothesis that it

Tom Mitchell:
would modify, but by keeping

Tom Mitchell:
track of every hypothesis

Tom Mitchell:
consistent with the data that it

Tom Mitchell:
had seen.

Tom Mitchell:
And this also opened up the
possibility of what we call

Tom Mitchell:
today active learning.

Tom Mitchell:
It made it easy for the system

Tom Mitchell:
to play twenty questions with

Tom Mitchell:
the teacher.

Tom Mitchell:
Uh, it could ask the teacher,
please label this example so

Tom Mitchell:
that in a way, uh, it could
reduce the set of hypothesis as

Tom Mitchell:
quickly as possible.

Tom Mitchell:
So by the end of the seventies,
there seemed to be enough work

Tom Mitchell:
going on in the field that it
was time to hold a meeting.

Tom Mitchell:
And so we organized the first

Tom Mitchell:
workshop in machine learning was

Tom Mitchell:
held here at CMU at Wayne Hall,

Tom Mitchell:
a couple of buildings that

Tom Mitchell:
direction, and it was organized

Tom Mitchell:
by Jaime Carbonell, who was an

Tom Mitchell:
assistant professor here at the

Tom Mitchell:
time.

Tom Mitchell:
Richard Michalski, who is a more

Tom Mitchell:
senior professor at Illinois and

Tom Mitchell:
myself, I was at the time an

Tom Mitchell:
assistant professor at Rutgers

Tom Mitchell:
University.

Tom Mitchell:
And so we held this meeting,
pulled together some people.

Tom Mitchell:
One of the people who attended
was a student of Richard

Tom Mitchell:
Michalski named Tom Dietterich.

Tom Mitchell:
And Tom went on to make many

Tom Mitchell:
contributions in the field of

Tom Mitchell:
machine learning.

Tom Mitchell:
And so I asked Tom, what was the
field like in nineteen eighty?

Tom Dietterich:
I'd say it was really chaotic.

Tom Dietterich:
you know, I was,

Tom Dietterich:
attended that very first machine

Tom Dietterich:
learning workshop that was

Tom Dietterich:
organized.

Tom Dietterich:
I think you were one of the core

Tom Dietterich:
organizers at CMU, and there

Tom Dietterich:
were probably thirty people in

Tom Dietterich:
the room and, uh, and probably

Tom Dietterich:
thirty completely different

Tom Dietterich:
talks.

Tom Dietterich:
You know, I remember, I was talking

Tom Dietterich:
about I had done, a sort of algorithm
comparison paper

Tom Dietterich:
that I published at Ijcai
seventy nine, I think.

Tom Dietterich:
So just before that workshop,
in which I was, by

Tom Dietterich:
hand executing these very simple
algorithms for this kind of

Tom Dietterich:
subgraph learning problem, uh,
and comparing how many subgraph

Tom Dietterich:
isomorphism calculations they
had to do.

Tom Dietterich:
But it was like the first

Tom Dietterich:
attempt to actually compare

Tom Dietterich:
multiple machine learning

Tom Dietterich:
algorithms that were more or

Tom Dietterich:
less trying to do the same

Tom Dietterich:
thing.

Tom Dietterich:
There were a couple of them
there, and, you

Tom Dietterich:
know, I think John Anderson
was there talking about, you

Tom Dietterich:
know, cognitive models.

Tom Dietterich:
You were there talking about

Tom Dietterich:
the beginnings of EBL and the

Tom Dietterich:
Lex system for, for,

Tom Dietterich:
calculus, symbolic

Tom Dietterich:
integration.

Tom Dietterich:
You know, I remember
the most interesting talk I

Tom Dietterich:
thought was Ross Quinlan's talk
on, on ID3, where he was

Tom Dietterich:
trying to take these reverse
numerated chess endgames

Tom Dietterich:
and learn decision trees.

Tom Dietterich:
That would completely,

Tom Dietterich:
exactly losslessly,

Tom Dietterich:
basically compress those

Tom Dietterich:
giant tables into a small

Tom Dietterich:
decision tree.

Tom Dietterich:
A really important thing people
should understand in those days

Tom Dietterich:
was we believed there
was a right answer for our

Tom Dietterich:
machine learning problems.

Tom Dietterich:
And we would,

Tom Dietterich:
it would often happen that I

Tom Dietterich:
would run like the algorithms

Tom Dietterich:
and it would not get the right

Tom Dietterich:
answer.

Tom Dietterich:
It would not get the, the
logical expression that we

Tom Dietterich:
thought was the right answer.

Tom Dietterich:
It would get something that was
really, actually equally

Tom Dietterich:
accurate on the training data.

Tom Dietterich:
And actually it worked

Tom Dietterich:
pretty well although we

Tom Dietterich:
didn't really have a set idea of

Tom Dietterich:
a separate test set in those

Tom Dietterich:
days.

Tom Dietterich:
I mean, it was not a field of
statistics.

Tom Dietterich:
It was, the idea was right.

Tom Dietterich:
We were coming out of the,
really the John McCarthy program

Tom Dietterich:
of programs with common sense,
which didn't have a lot to do

Tom Dietterich:
with common sense, but was about
we're going to represent

Tom Dietterich:
everything in logic, and we're
going to use logical inference

Tom Dietterich:
as the execution engine.

Tom Mitchell:
So there's Tom's take on what
things were like.

Tom Mitchell:
He mentioned that he thought
the most interesting talk was

Tom Mitchell:
Ross Quinlan's talk.

Tom Mitchell:
I agree, I thought that was the
most interesting talk.

Tom Mitchell:
Ross's talk presented the idea

Tom Mitchell:
that we should learn decision

Tom Mitchell:
trees.

Tom Mitchell:
A decision tree is something
where you classify your example

Tom Mitchell:
by putting it at the root of the
tree, and then you sort it down

Tom Mitchell:
to a leaf in the tree based on
its features, and the leaf tells

Tom Mitchell:
you what the output
classification label should be.

Tom Mitchell:
That's what get learned.

Tom Mitchell:
What gets learned?

Tom Mitchell:
So I asked Ross how he came
up with this idea.

JR Quinlan:
I had done a PhD under a
psychologist, Earl hunt.

JR Quinlan:
And part of his work involved
decision trees, which I learned

JR Quinlan:
about, of course, as a student,
but then put in the back of my

JR Quinlan:
mind for fifteen years or so.

JR Quinlan:
And then I was at at Stanford on

JR Quinlan:
sabbatical at the same time as

JR Quinlan:
Donald.

JR Quinlan:
Mickey was teaching a course on
learning, and he had a challenge

JR Quinlan:
for the class on which, you
know, I sat in on the class and

JR Quinlan:
the challenge was to work
out a way of predicting a win in

JR Quinlan:
a very simple chess end game.

JR Quinlan:
King rook versus king knight.

JR Quinlan:
So I remembered Earl Hunt's work
on decision trees, and I

JR Quinlan:
thought, well, maybe that would
be the way to go.

JR Quinlan:
So I developed a thing called
ID3, which was just a simple

JR Quinlan:
decision tree program.

JR Quinlan:
No pruning, just straight
decision tree.

JR Quinlan:
And then, uh, that that seemed

JR Quinlan:
to solve the problem pretty

JR Quinlan:
well, up to about ninety five

JR Quinlan:
percent.

JR Quinlan:
And then I got that up to one
hundred the next year.

JR Quinlan:
And then remember, the first
real time I talked about this

JR Quinlan:
was at that conference.

JR Quinlan:
You organized the workshop in
nineteen eighty at Pittsburgh,

JR Quinlan:
at Carnegie Mellon.

JR Quinlan:
You, Richard and Hymie all,
all set up that workshop.

JR Quinlan:
And then I gave a talk on, uh,
decision tree learning.

Tom Mitchell:
So there's Ross's story.

Tom Mitchell:
He he got the idea of decision

Tom Mitchell:
trees from his thesis advisor

Tom Mitchell:
many years earlier, but it turns

Tom Mitchell:
out Ross was the one who came up

Tom Mitchell:
with the algorithm that actually

Tom Mitchell:
successfully discovered useful

Tom Mitchell:
decision trees.

Tom Mitchell:
And that whole idea of decision
tree learning became very

Tom Mitchell:
important in the field.

Tom Mitchell:
By twenty ten, it was probably

Tom Mitchell:
the one of the most commercially

Tom Mitchell:
used approaches in machine

Tom Mitchell:
learning.

Tom Mitchell:
So in the early eighties, there
were various experiments like

Tom Mitchell:
these trying to build machine
learning systems, but really no

Tom Mitchell:
theory, no theory that could
tell us, for example, how many

Tom Mitchell:
examples would we have to
present to a learner in order

Tom Mitchell:
for it to reliably learn?

Tom Mitchell:
And that changed in nineteen

Tom Mitchell:
eighty four, when Les Valiant

Tom Mitchell:
published a paper on what he

Tom Mitchell:
calls probably approximately

Tom Mitchell:
correct learning.

Tom Mitchell:
And the idea is it really

Tom Mitchell:
was the first practical theory

Tom Mitchell:
to tell us how many examples you

Tom Mitchell:
would need.

Tom Mitchell:
And it in particular, in

Tom Mitchell:
particular, the number of

Tom Mitchell:
examples you need depends on

Tom Mitchell:
three things.

Tom Mitchell:
The complexity of your
hypothesis space.

Tom Mitchell:
For example, if you're going to
learn decision trees of depth

Tom Mitchell:
two, that's a lot less complex
than if you're learning decision

Tom Mitchell:
trees of depth twelve.

Tom Mitchell:
So the it depends on how complex

Tom Mitchell:
your hypotheses are, depends on

Tom Mitchell:
the error rate you're willing to

Tom Mitchell:
tolerate in the final

Tom Mitchell:
hypothesis.

Tom Mitchell:
One percent error five percent
error.

Tom Mitchell:
It also depends on the
probability you're willing to

Tom Mitchell:
put up with that.

Tom Mitchell:
If you do choose that many

Tom Mitchell:
random randomly provided

Tom Mitchell:
training examples.

Tom Mitchell:
The probability that you'll
still fail.

Tom Mitchell:
You can't guarantee that you

Tom Mitchell:
won't fail, but you can reduce

Tom Mitchell:
that probability.

Tom Mitchell:
So this was a breakthrough in
the area of theoretical

Tom Mitchell:
characterization of algorithms.

Tom Mitchell:
So I asked I asked les what he
thought was the key idea there.

Leslie Valiant:
It's a it's a kind of a model of
computation.

Leslie Valiant:
But it yeah, it makes sense

Leslie Valiant:
because it's got some

Leslie Valiant:
applications.

Leslie Valiant:
So that's the particular
result which persuaded

Leslie Valiant:
people that there was
something there is this result

Leslie Valiant:
that if you take a
conjunctive normal form formula,

Leslie Valiant:
which, you know, from NP
completeness at the time, we

Leslie Valiant:
already knew there's some
hardness in it, because if

Leslie Valiant:
someone gave you the formula was
computationally difficult to

Leslie Valiant:
find out whether it's a null,
it's the equivalent of formula

Leslie Valiant:
which, is always zero,
which is never satisfiable.

Leslie Valiant:
On the other hand, this was

Leslie Valiant:
kind of this, uh, conducting

Leslie Valiant:
normal form formula with three,

Leslie Valiant:
variables in each

Leslie Valiant:
clause.

Leslie Valiant:
Uh, so this was PAC learnable.

Leslie Valiant:
And so this was a bit striking
that something which is very

Leslie Valiant:
hard is learnable.

Leslie Valiant:
But then this, this

Leslie Valiant:
highlighted the difference

Leslie Valiant:
between, uh, computing and uh,

Leslie Valiant:
and learning because so with the

Leslie Valiant:
learning model, the idea was

Leslie Valiant:
that there was a distribution of

Leslie Valiant:
inputs.

Leslie Valiant:
And you learned from this
distribution, but you only have

Leslie Valiant:
to be good on this distribution
when you have to predict.

Leslie Valiant:
So if, for example, in this
formula, there were some very

Leslie Valiant:
rare ones which are so very
rare, then the learner wouldn't

Leslie Valiant:
have to know about that.

Leslie Valiant:
So in this sense this was easier
than the NP completeness.

Tom Mitchell:
So I was actually quite
surprised at that answer.

Tom Mitchell:
What he's saying.

Tom Mitchell:
Put another way is that what was

Tom Mitchell:
really interesting there is that

Tom Mitchell:
for this one kind of hypothesis,

Tom Mitchell:
conjunctive normal form, which

Tom Mitchell:
is a way of it's a kind of

Tom Mitchell:
logical expression.

Tom Mitchell:
If your hypotheses are of that
form, then it's easier to learn

Tom Mitchell:
them than it is to compute them.

Tom Mitchell:
When he says compute them, what
he means is the cost of

Tom Mitchell:
answering the question, can you
find a positive example of this?

Tom Mitchell:
And it was known at the time
that the computational cost of

Tom Mitchell:
answering that question, is
there a positive example of this

Tom Mitchell:
formula was exponential in the
size of the formula?

Tom Mitchell:
And then he discovered that

Tom Mitchell:
learning a formula, if somebody

Tom Mitchell:
gives you a positive and

Tom Mitchell:
negative examples only takes

Tom Mitchell:
polynomial less than exponential

Tom Mitchell:
time.

Tom Mitchell:
So I agree with him that that's

Tom Mitchell:
a fascinating theoretical fact,

Tom Mitchell:
but that would not be the answer

Tom Mitchell:
I would give about why this

Tom Mitchell:
revolutionized the field of

Tom Mitchell:
machine learning.

Tom Mitchell:
It revolutionized the field, in
my view, because he was the

Tom Mitchell:
first person, really to be able
to come up with a framing, a new

Tom Mitchell:
framing of the machine learning
problem that even allowed this

Tom Mitchell:
kind of theoretical analysis.

Tom Mitchell:
In particular, his framing

Tom Mitchell:
included assumptions like the

Tom Mitchell:
training data would come from

Tom Mitchell:
some source that would give you

Tom Mitchell:
that would give you random

Tom Mitchell:
examples according to some

Tom Mitchell:
probability distribution.

Tom Mitchell:
And then later, when you wanted
to test your hypothesis on new

Tom Mitchell:
data, you would get more random
examples from that same source.

Tom Mitchell:
And so he reframed the problem

Tom Mitchell:
in a way that made theory

Tom Mitchell:
possible.

Tom Mitchell:
The consequence of that was he

Tom Mitchell:
catalyzed a huge amount of

Tom Mitchell:
theoretical work in machine

Tom Mitchell:
learning and continues this day

Tom Mitchell:
just keeps branching further and

Tom Mitchell:
further.

Tom Mitchell:
There are conferences
specifically designed to cover

Tom Mitchell:
theoretical computer science.

Tom Mitchell:
So the eighties was really a
very generative decade.

Tom Mitchell:
There are a lot of things going
on.

Tom Mitchell:
Another thing was going on was
some people were looking at

Tom Mitchell:
human learning and how that
might inspire our models of AI

Tom Mitchell:
and machine learning.

Tom Mitchell:
One such effort was here at CMU

Tom Mitchell:
by Alan Newell and his two PhD

Tom Mitchell:
students, John Laird and Paul

Tom Mitchell:
Rosenbloom.

Tom Mitchell:
They took the approach of.

Tom Mitchell:
They built a system they called

Tom Mitchell:
Soar, which was really one of

Tom Mitchell:
the first AI agents designed to

Tom Mitchell:
capture the full breadth of what

Tom Mitchell:
humans do play games, solve

Tom Mitchell:
problems many different tasks,

Tom Mitchell:
so they frame their machine

Tom Mitchell:
learning problem as one of

Tom Mitchell:
getting a general agent to

Tom Mitchell:
learn.

Tom Mitchell:
And their architecture had very
interesting properties that I

Tom Mitchell:
think are relevant today.

Tom Mitchell:
Now that agents are again a
topic of hot activity, I won't

Tom Mitchell:
go into the details, but in the
podcast there's an interview

Tom Mitchell:
with John Laird who goes into
detail on this.

Tom Mitchell:
Another item that can't be

Tom Mitchell:
overlooked in the eighties was

Tom Mitchell:
really the rebirth of neural

Tom Mitchell:
network.

Tom Mitchell:
Remember, in the end of sixties,

Tom Mitchell:
Minsky and Papert published that

Tom Mitchell:
book that killed off work on

Tom Mitchell:
perceptrons?

Tom Mitchell:
Well, in the mid eighties,

Tom Mitchell:
finally, people came up with an

Tom Mitchell:
algorithm that could train not

Tom Mitchell:
just one layer perceptrons, but

Tom Mitchell:
multilayer perceptrons.

Tom Mitchell:
And that allowed learning

Tom Mitchell:
functions that were highly

Tom Mitchell:
non-linear.

Tom Mitchell:
And Dave Rumelhart, J.

Tom Mitchell:
McClelland and Geoff Hinton were

Tom Mitchell:
three of the ringleaders of this

Tom Mitchell:
effort.

Tom Mitchell:
So I asked Geoff about that
period.

Tom Mitchell:
Now we're up to the mid eighties

Tom Mitchell:
when really neural nets are

Tom Mitchell:
reborn.

Tom Mitchell:
Is that the right word?

Tom Mitchell:
How would you.

Geoffrey Hinton:
Backprop with backpropagation?

Geoffrey Hinton:
I mean, we didn't invent it.

Geoffrey Hinton:
Invented by several different

Geoffrey Hinton:
groups, but we showed that it

Geoffrey Hinton:
really worked to learn

Geoffrey Hinton:
representations.

Geoffrey Hinton:
And as you know, sort of one of
the big problems in AI is how do

Geoffrey Hinton:
you learn new representations?

Geoffrey Hinton:
How do you avoid having to put
them all in by hand?

Geoffrey Hinton:
And my particular example,

Geoffrey Hinton:
which was the family trees

Geoffrey Hinton:
example, where you take all the

Geoffrey Hinton:
information in some family

Geoffrey Hinton:
trees, you convert it into

Geoffrey Hinton:
triples of symbols like John has

Geoffrey Hinton:
Father Mary.

Geoffrey Hinton:
And then you train a neural

Geoffrey Hinton:
net to predict the last term in

Geoffrey Hinton:
a triple.

Geoffrey Hinton:
Given the first two terms.

Geoffrey Hinton:
So it's just like the big
language models.

Geoffrey Hinton:
You're predicting the next word
given the context.

Geoffrey Hinton:
It's just much simpler.

Geoffrey Hinton:
I had one hundred and twelve

Geoffrey Hinton:
total examples, of which one

Geoffrey Hinton:
hundred and four training

Geoffrey Hinton:
examples and eight were test

Geoffrey Hinton:
examples, which is a bit less

Geoffrey Hinton:
than the trillion examples they

Geoffrey Hinton:
have nowadays,

Geoffrey Hinton:
but it was the same idea.

Geoffrey Hinton:
You convert a symbol into a
feature vector.

Geoffrey Hinton:
You then have the feature
vectors of the context interact

Geoffrey Hinton:
via a hidden layer.

Geoffrey Hinton:
They then predict the features

Geoffrey Hinton:
of the next symbol, and from

Geoffrey Hinton:
those features you guess what

Geoffrey Hinton:
the next symbol should be, and

Geoffrey Hinton:
you try and maximize the

Geoffrey Hinton:
probability of predicting the

Geoffrey Hinton:
next symbol.

Geoffrey Hinton:
And you then backpropagate

Geoffrey Hinton:
through the feature interactions

Geoffrey Hinton:
and through the process of

Geoffrey Hinton:
converting a symbol into

Geoffrey Hinton:
features.

Geoffrey Hinton:
And that way you learn
feature vectors to represent the

Geoffrey Hinton:
symbols and how these vectors
should interact to predict the

Geoffrey Hinton:
features of the next symbol.

Geoffrey Hinton:
And that's what these big
language models do.

Tom Mitchell:
So there's Jeff in the mid

Tom Mitchell:
nineteen eighties work on

Tom Mitchell:
backpropagation.

Tom Mitchell:
Another personal note in

Tom Mitchell:
nineteen eighty six, while this

Tom Mitchell:
was going on, I came to spend a

Tom Mitchell:
year at CMU as a visiting

Tom Mitchell:
professor.

Tom Mitchell:
And I got to meet Allen Newell
at the time.

Tom Mitchell:
And Allen said, hey, do you want
to team teach a course?

Tom Mitchell:
We'll teach a course on

Tom Mitchell:
architectures for intelligent

Tom Mitchell:
agents.

Tom Mitchell:
And of course I said yes.

Tom Mitchell:
The opportunity to teach with
Allen.

Tom Mitchell:
And he said, by the way, there

Tom Mitchell:
will be another, uh, an

Tom Mitchell:
assistant professor working with

Tom Mitchell:
us.

Tom Mitchell:
The three of us will team teach
it.

Tom Mitchell:
That's Geoff Hinton.

Tom Mitchell:
So Allen, Geoff and I team

Tom Mitchell:
taught in spring of nineteen

Tom Mitchell:
eighty six.

Tom Mitchell:
Uh, this course was one of the
best experiences of my career up

Tom Mitchell:
to that point.

Tom Mitchell:
And so it was a large part of
the reason why I ended up

Tom Mitchell:
staying at CMU.

Tom Mitchell:
But when I came, I was here

Tom Mitchell:
for about a year, and then Jeff

Tom Mitchell:
moved on.

Tom Mitchell:
He moved up to the University of
Toronto and started

Tom Mitchell:
building up a group there.

Tom Mitchell:
One of the people who joined his
group was a person named Yann

Tom Mitchell:
LeCun, who went on to win the
Turing Award jointly with Jeff

Tom Mitchell:
and Yoshua Bengio for their work
in neural networks.

Tom Mitchell:
So I asked Jon about this
period.

Yann LeCun:
And then, mid nineteen

Yann LeCun:
eighty seven, I moved to Toronto

Yann LeCun:
to do a postdoc with Jeff, and I

Yann LeCun:
completed this, the

Yann LeCun:
simulator.

Yann LeCun:
Jeff thought I was not doing

Yann LeCun:
anything because I was just

Yann LeCun:
basically hacking, you know, all

Yann LeCun:
the time,

Yann LeCun:
and this, this
system was kind of

Yann LeCun:
interesting because we had to
build a front end language to

Yann LeCun:
interact with it.

Yann LeCun:
And that language was the Lisp

Yann LeCun:
interpreter that Leon and I

Yann LeCun:
wrote.

Yann LeCun:
And so we're using Lisp, even
though as a front end to kind of

Yann LeCun:
a neural net simulator.

Yann LeCun:
And I, you know, implemented

Yann LeCun:
a weight sharing, abilities

Yann LeCun:
and all that stuff and started

Yann LeCun:
experimenting with what became

Yann LeCun:
convolutional nets.

Yann LeCun:
You know, when I was a postdoc
in Toronto, early nineteen

Yann LeCun:
eighty eight, roughly, and
started to get really good

Yann LeCun:
results on, you know, very
simple shape recognition, like,

Yann LeCun:
yhandwritten characters
that had drawn with my mouse or

Yann LeCun:
something like that.

Yann LeCun:
Right.

Tom Mitchell:
So, as you just heard, Yann was

Tom Mitchell:
experimenting with can we apply

Tom Mitchell:
neural networks to the problem

Tom Mitchell:
of character recognition,

Tom Mitchell:
written characters.

Tom Mitchell:
People were experimenting with
many different uses of neural

Tom Mitchell:
nets at the time.

Tom Mitchell:
My favorite, the one I would
vote application of the decade

Tom Mitchell:
was done in the area.

Tom Mitchell:
Surprisingly, of self-driving
cars.

Tom Mitchell:
There was a PhD student here at
CMU named Dean Pomerleau.

Tom Mitchell:
He trained a neural network

Tom Mitchell:
where the input was an image

Tom Mitchell:
taken by a camera looking out

Tom Mitchell:
the front windshield of a

Tom Mitchell:
vehicle.

Tom Mitchell:
And the output of the neural

Tom Mitchell:
network was the steering command

Tom Mitchell:
telling the car which direction

Tom Mitchell:
to steer.

Tom Mitchell:
So I asked Dean about that work.

Tom Mitchell:
How much training data did you
have?

Dean Pommerleau:
So the interesting thing was, to

Dean Pommerleau:
begin with, it was all batch

Dean Pommerleau:
training.

Dean Pommerleau:
So I'd drive, I'd have a person
drive the vehicle along Schenley

Dean Pommerleau:
Park, uh, Flagstaff Hill Path,
and then I would go off and

Dean Pommerleau:
crunch it overnight.

Dean Pommerleau:
But in the end, what we were

Dean Pommerleau:
able to do is, uh, real time

Dean Pommerleau:
learning.

Dean Pommerleau:
So one drive up the hill with a

Dean Pommerleau:
human behind the wheel steering

Dean Pommerleau:
and the neural network, learning

Dean Pommerleau:
to pair images with camera

Dean Pommerleau:
images with the steering command

Dean Pommerleau:
that the human was giving was

Dean Pommerleau:
able to, uh, train it in about

Dean Pommerleau:
five minutes to, uh, take over

Dean Pommerleau:
and steer on its own from there

Dean Pommerleau:
on, on that road and on similar

Dean Pommerleau:
roads.

Dean Pommerleau:
So it was one of the first real
time, real world vision

Dean Pommerleau:
applications of, uh, of
artificial neural networks going

Dean Pommerleau:
beyond just Flagstaff Hill, you
know, the little paths on there.

Dean Pommerleau:
And we went out on, on real
roads first through the golf

Dean Pommerleau:
course, Schenley Golf Course, on
the, uh, on the road there.

Dean Pommerleau:
And then we, we went on, you
know, the local highways, in

Dean Pommerleau:
fact, the longest as part of my
PhD, the longest trip we did

Dean Pommerleau:
was, I think, about one hundred
miles at the time from basically

Dean Pommerleau:
up, uh, I-79 from Pittsburgh all
the way up to Erie.

Dean Pommerleau:
Uh, and it drove basically the,
the whole way.

Dean Pommerleau:
So it and it was getting up to
fifty five miles per hour after

Dean Pommerleau:
we got a faster vehicle.

Tom Mitchell:
It turns out he didn't ask for
permission.

Tom Mitchell:
So so this was all happening in
the nineteen eighties.

Tom Mitchell:
Really, it was a decade of

Tom Mitchell:
amazing invention and innovation

Tom Mitchell:
and exploration.

Tom Mitchell:
Another important thing that

Tom Mitchell:
happened in that decade was the

Tom Mitchell:
development of reinforcement

Tom Mitchell:
learning.

Tom Mitchell:
The way to understand that is to
first realize that supervised

Tom Mitchell:
learning was the kind of
standard way of framing the

Tom Mitchell:
machine learning question.

Tom Mitchell:
When Dean talked about training

Tom Mitchell:
his system, he would input an

Tom Mitchell:
image.

Tom Mitchell:
He had people drive the car, so
he got a lot of training

Tom Mitchell:
examples of the form.

Tom Mitchell:
Here's the image and here's the
correct steering command.

Tom Mitchell:
So he could tell the neural
network for this input.

Tom Mitchell:
Here's the correct output.

Tom Mitchell:
That's called supervised
learning.

Tom Mitchell:
But reinforcement learning
reframes the problem.

Tom Mitchell:
It takes into account that
sometimes we don't know what the

Tom Mitchell:
right output is.

Tom Mitchell:
For example, if you're learning
to play chess, you might not

Tom Mitchell:
have a person who tells you at
every step given this board

Tom Mitchell:
position, here's the right move.

Tom Mitchell:
Instead, you might have to wait
until the end of the game after

Tom Mitchell:
you've made many moves to get
the feedback signal that says

Tom Mitchell:
you lost or you won, and then
you have to figure out what to

Tom Mitchell:
do about that because you
actually took many moves.

Tom Mitchell:
So that's what reinforcement
learning is about.

Tom Mitchell:
And Rich Sutton and Andy Barto
were instrumental in kind of

Tom Mitchell:
framing that problem and, and
working on it.

Tom Mitchell:
They recently won the Turing
Award for this work.

Tom Mitchell:
So I asked Rich how

Tom Mitchell:
reinforcement learning fit into

Tom Mitchell:
the field.

Rich Sutton:
The field of machine learning

Rich Sutton:
has always been been dominated

Rich Sutton:
by the more straightforward

Rich Sutton:
supervised approach.

Rich Sutton:
There was, as I
mentioned at the very beginning,

Rich Sutton:
the rewards and penalties were
were very much a part of it.

Rich Sutton:
But then the, focus, as

Rich Sutton:
things became more clear and

Rich Sutton:
more better defined and it

Rich Sutton:
became more clear, learning

Rich Sutton:
problem then became pattern

Rich Sutton:
recognition and supervised

Rich Sutton:
learning.

Rich Sutton:
And, this fellow, the
strange, uh, fellow Harry Klopf,

Rich Sutton:
recognized this more than
other people and

Rich Sutton:
wrote some reports and
ultimately a book, saying

Rich Sutton:
that something had been lost.

Rich Sutton:
And Andy Barta and I
picked up on his work and

Rich Sutton:
and eventually realized that he
was right, that something had

Rich Sutton:
been left out, and in some sense
it was obvious that something

Rich Sutton:
had been left out.

Rich Sutton:
From the point of view of

Rich Sutton:
psychology, where I'd been

Rich Sutton:
studying how animals learn and

Rich Sutton:
animals learn.

Rich Sutton:
Really in both ways, in both a

Rich Sutton:
supervised way and a

Rich Sutton:
reinforcement way.

Rich Sutton:
And so, we picked up on
that and made that into a well

Rich Sutton:
defined area in the.

Rich Sutton:
When was that?

Rich Sutton:
That would have been in the
eighties.

Rich Sutton:
And then finally, you wrote a
book on it in ninety eight.

Rich Sutton:
So then it became a clear, uh,
subfield of machine learning.

Rich Sutton:
Yeah.

Rich Sutton:
But the key thing is why is why
why is I the way I say it to

Rich Sutton:
myself is that why is
reinforcement learning off?

Rich Sutton:
Why is it powerful?

Rich Sutton:
Potentially powerful.

Rich Sutton:
It's powerful because it's
learning.

Rich Sutton:
It's really learning from
experience.

Rich Sutton:
Learning from the normal data

Rich Sutton:
that an animal or a person would

Rich Sutton:
get.

Rich Sutton:
And it doesn't require a

Rich Sutton:
prepared special data like you

Rich Sutton:
of course do in supervised

Rich Sutton:
learning.

Tom Mitchell:
So during the eighties, there
were a lot of other really

Tom Mitchell:
interesting things going on.

Tom Mitchell:
Uh, people experimenting with

Tom Mitchell:
the idea that maybe machines

Tom Mitchell:
should learn by simulating

Tom Mitchell:
evolution.

Tom Mitchell:
There was an entire set of
conferences on something called

Tom Mitchell:
genetic algorithms, genetic
programming, which had to do

Tom Mitchell:
with that sort of thing.

Tom Mitchell:
Uh, a cluster of work on

Tom Mitchell:
studying human learning and

Tom Mitchell:
other areas.

Tom Mitchell:
But we don't have time for all
of those.

Tom Mitchell:
Let's move on to the nineteen

Tom Mitchell:
nineties, when, again, there was

Tom Mitchell:
a, I would say, a sea change in

Tom Mitchell:
terms of the style of work that

Tom Mitchell:
went on.

Tom Mitchell:
The theme of the nineteen
nineties was really the

Tom Mitchell:
integration of statistical and
probabilistic methods into the

Tom Mitchell:
field of machine learning.

Tom Mitchell:
And a lot of that took the

Tom Mitchell:
grounded form of learning a new

Tom Mitchell:
kind of object, which people

Tom Mitchell:
called either graphical models

Tom Mitchell:
or Bayes.

Tom Mitchell:
Bayes nets.

Tom Mitchell:
But what got learned in that

Tom Mitchell:
case was, again, a network where

Tom Mitchell:
each node would represent a

Tom Mitchell:
variable.

Tom Mitchell:
For example, maybe you would be
interested in predicting whether

Tom Mitchell:
somebody has lung cancer.

Tom Mitchell:
You'd make that a variable and
maybe you'd have evidence like

Tom Mitchell:
are they a smoker?

Tom Mitchell:
Do they have a normal or
abnormal X-ray result?

Tom Mitchell:
You'd make those variables.

Tom Mitchell:
And then the edges in the graph
represent probabilistic

Tom Mitchell:
dependencies among the variables
in a way such that in the end,

Tom Mitchell:
the whole graph represents the
full joint probability

Tom Mitchell:
distribution over the entire
collection of variables.

Tom Mitchell:
So that's what got learned and
how it got learned.

Tom Mitchell:
Waited for some algorithms to be
discovered.

Tom Mitchell:
One of the key people who was

Tom Mitchell:
involved in inventing those

Tom Mitchell:
algorithms, although Judea

Tom Mitchell:
Pearl, came up with the idea of

Tom Mitchell:
how to represent these,

Tom Mitchell:
Daphne Kohler, a professor at

Tom Mitchell:
Stanford, was one of the most

Tom Mitchell:
active researchers in terms of

Tom Mitchell:
designing algorithms for

Tom Mitchell:
learning these.

Tom Mitchell:
So I asked her, why do we need
graphical models?

Daphne Koller:
Graphical models, for me,
emerged by realizing that the

Daphne Koller:
problems that we needed to solve
to address most real world

Daphne Koller:
applications went beyond.

Daphne Koller:
You have a vector representation

Daphne Koller:
of an input and a single,

Daphne Koller:
oftentimes binary or at best

Daphne Koller:
continuous output.

Daphne Koller:
There was so much more
opportunity to think about

Daphne Koller:
richly structured environments,
richly structured problems.

Daphne Koller:
So even if you think about
problems like understanding what

Daphne Koller:
is in an image, that's not a
single label problem of there is

Daphne Koller:
a dog, because images are
complex and there's

Daphne Koller:
interrelationships between the
different objects you want it to

Daphne Koller:
get beyond the yes no. Is there
a dog in this image to something

Daphne Koller:
that is much more rich?

Daphne Koller:
There's a dog and a Frisbee and

Daphne Koller:
a beach and three kids building

Daphne Koller:
a sandcastle.

Daphne Koller:
You have a rich input and a rich
output.

Daphne Koller:
Thinking about these richly

Daphne Koller:
structured domains gave rise to

Daphne Koller:
we have to think about multiple

Daphne Koller:
variables.

Daphne Koller:
We have to think about the

Daphne Koller:
interactions between those

Daphne Koller:
variables and leverage that

Daphne Koller:
structure both in our input and

Daphne Koller:
output space in order to get to

Daphne Koller:
much better conclusions and deal

Daphne Koller:
with problems that really

Daphne Koller:
matter.

Tom Mitchell:
So this work on training

Tom Mitchell:
graphical models was really part

Tom Mitchell:
of a bigger theme that decade,

Tom Mitchell:
which was just the integration

Tom Mitchell:
of statistical methods with what

Tom Mitchell:
had been pretty much statistics

Tom Mitchell:
free machine learning up to that

Tom Mitchell:
point.

Tom Mitchell:
Another person who was

Tom Mitchell:
instrumental in that was

Tom Mitchell:
Berkeley professor named Mike

Tom Mitchell:
Jordan.

Tom Mitchell:
I asked him about the

Tom Mitchell:
relationship between statistics

Tom Mitchell:
and machine.

Michael I. Jordan:
So anyway, by the time I moved
to wanted to move to Berkeley, I

Michael I. Jordan:
was realizing that I was missing
the whole statistics community,

Michael I. Jordan:
that, uh, it was just separate
from machine learning, as maybe

Michael I. Jordan:
you kind of remember, there was
occasionally a little leakage,

Michael I. Jordan:
but it was way too separate.

Michael I. Jordan:
And and nowadays we're often
seeing, you know, people will

Michael I. Jordan:
run a machine learning method,
but then it's not calibrated.

Michael I. Jordan:
It's not, you know, has bias and
all that.

Michael I. Jordan:
And that's the thing
statisticians have talked about

Michael I. Jordan:
for a long, long time.

Michael I. Jordan:
And so nowadays I think it's a
given that, yeah, they're,

Michael I. Jordan:
they're kind of two parts, two
sides of the same coin.

Michael I. Jordan:
Machine learning is maybe a
little more engineering in order

Michael I. Jordan:
to build a system and make it do
great things in the world.

Michael I. Jordan:
And statistics is a little bit
more, well, let's be cautious.

Michael I. Jordan:
Let's say we're going to do like
clinical trials.

Michael I. Jordan:
Let's make sure that the the
answer is really trustable, but

Michael I. Jordan:
those are two sides of the same
coin, and I think that's

Michael I. Jordan:
probably pretty much clear now.

Michael I. Jordan:
But for a long time there was a
resistance.

Michael I. Jordan:
Everyone said this is a brand
new field, this is different.

Michael I. Jordan:
And I kept and again annoying
colleagues by saying, no, I

Michael I. Jordan:
don't believe it is.

Michael I. Jordan:
So anyway, long story short, it
is.

Tom Mitchell:
It is remarkable that to me that

Tom Mitchell:
the field of machine learning

Tom Mitchell:
went through most of the

Tom Mitchell:
nineteen eighties, kind of

Tom Mitchell:
without even noticing that

Tom Mitchell:
statistics exist.

Michael I. Jordan:
I mean, people like Leo Breiman

Michael I. Jordan:
were around to help make the

Michael I. Jordan:
passage.

Michael I. Jordan:
So ensemble methods, they were
kind of invented by Leo and stat

Michael I. Jordan:
literature, but they were
independently invented in the

Michael I. Jordan:
machine learning literature.

Michael I. Jordan:
And is that machine learning or
statistics?

Michael I. Jordan:
Well, clearly it's both and it
needs both perspectives.

Michael I. Jordan:
And yes, in the nineteen
nineties that the Em algorithm,

Michael I. Jordan:
you know, the graphical models,
they were they had, they had uh,

Michael I. Jordan:
so yeah, the nineties, it was a
real flourishing of that.

Tom Mitchell:
So Mike mentioned that one of
the themes was ensemble.

Tom Mitchell:
So anyway, I think that's

Tom Mitchell:
actually a very nice example of

Tom Mitchell:
how machine learning theory and

Tom Mitchell:
statistical theory kind of

Tom Mitchell:
intertwined.

Tom Mitchell:
The idea of ensemble learning is

Tom Mitchell:
instead of learning one

Tom Mitchell:
hypothesis, let's learn multiple

Tom Mitchell:
ones.

Tom Mitchell:
For example, instead of learning

Tom Mitchell:
a decision tree, you might learn

Tom Mitchell:
a whole forest of decision

Tom Mitchell:
trees.

Tom Mitchell:
And then when it comes to

Tom Mitchell:
classifying a new example, you

Tom Mitchell:
give it to all of the

Tom Mitchell:
classifiers and you let them

Tom Mitchell:
vote and you take the vote of

Tom Mitchell:
the classifiers.

Tom Mitchell:
Well, that turned out to be very

Tom Mitchell:
successful and commercially very

Tom Mitchell:
important.

Tom Mitchell:
But it also is a beautiful

Tom Mitchell:
example where, there's a

Tom Mitchell:
pretty interesting theory around

Tom Mitchell:
that.

Tom Mitchell:
And initially, Yoav Freund and
Robert Shapiro, uh, in the early

Tom Mitchell:
nineties, uh, started working on
a theory and methods for doing

Tom Mitchell:
this kind of ensemble.

Tom Mitchell:
Leo Breiman, who was a
statistician, recognized that

Tom Mitchell:
this echoed some of the themes
of resampling and statistics.

Tom Mitchell:
And those two things, uh, kind

Tom Mitchell:
of came together in a very

Tom Mitchell:
successful way.

Tom Mitchell:
So in the nineties and the first

Tom Mitchell:
decade of the two thousand,

Tom Mitchell:
there were many other things

Tom Mitchell:
going on.

Tom Mitchell:
The development of things
called support vector machines,

Tom Mitchell:
kernel methods, which were,
mathematical techniques for

Tom Mitchell:
learning, very nonlinear
classifiers that were actually

Tom Mitchell:
commercially important and
opened the door in many cases to

Tom Mitchell:
machine learning for
non-numerical data, data like

Tom Mitchell:
images or text.

Tom Mitchell:
There is work on manifold
learning.

Tom Mitchell:
There was also growing

Tom Mitchell:
commercialization during that

Tom Mitchell:
decade.

Tom Mitchell:
More and more companies were

Tom Mitchell:
starting to use machine learning

Tom Mitchell:
commercially.

Tom Mitchell:
But for me, the theme of that
first decade of the two thousand

Tom Mitchell:
was really a growing awareness
by many people that, you know,

Tom Mitchell:
maybe we have good enough
machine learning algorithms that

Tom Mitchell:
the bottleneck to more accuracy
is not the algorithm.

Tom Mitchell:
Maybe we need more data and more
computation.

Tom Mitchell:
And this idea was crystallized
in this beautiful paper written

Tom Mitchell:
in two thousand and nine by
three authors at Google, called

Tom Mitchell:
The Unreasonable Effectiveness
of Data, which really

Tom Mitchell:
highlighted, cases where,
if you want better

Tom Mitchell:
results, keep your same
algorithm, get more data.

Tom Mitchell:
And that was kind of a theme of
what was going on at the time,

Tom Mitchell:
but things really broke open in
the year twenty twelve.

Tom Mitchell:
In twenty twelve, the
computer vision community had

Tom Mitchell:
been using a data set created by
Fei-Fei Li called ImageNet to

Tom Mitchell:
test out different vision
algorithms, see who could do the

Tom Mitchell:
best job of labeling which
object was the primary object in

Tom Mitchell:
an image, and the image net data
set was very large.

Tom Mitchell:
In twenty twelve, Geoff Hinton
and some of his students entered

Tom Mitchell:
the competition and they blew
away the competition.

Tom Mitchell:
What's interesting is they were
the only neural network approach

Tom Mitchell:
in the competition by that time.

Tom Mitchell:
By the way, neural networks were

Tom Mitchell:
very scarce in the field of

Tom Mitchell:
machine learning.

Tom Mitchell:
They had been displaced really

Tom Mitchell:
by more recent probabilistic

Tom Mitchell:
methods, and only a smallish

Tom Mitchell:
number of researchers were even

Tom Mitchell:
still working on neural

Tom Mitchell:
networks.

Tom Mitchell:
But, nevertheless, this
happened.

Tom Mitchell:
So I asked Geoff about that.

Geoffrey Hinton:
And Yann realized when Fei-Fei
came up with the ImageNet

Geoffrey Hinton:
dataset, Yann realized they
could win that competition, and

Geoffrey Hinton:
he tried to get graduate
students and postdocs in his lab

Geoffrey Hinton:
to do it, and they all declined.

Geoffrey Hinton:
And Ilya, Ilya Sutskever
realized that, backprop

Geoffrey Hinton:
would just kill ImageNet.

Geoffrey Hinton:
He wanted Alex to work
on it and actually didn't really

Geoffrey Hinton:
want to work on it.

Geoffrey Hinton:
Alex had already been

Geoffrey Hinton:
working on small images and

Geoffrey Hinton:
recognizing small images in Cfar

Geoffrey Hinton:
ten, and pre-processed

Geoffrey Hinton:
everything for Alex to make it

Geoffrey Hinton:
easy.

Geoffrey Hinton:
And I bought Alex two Nvidia

Geoffrey Hinton:
GPUs to have in his bedroom at

Geoffrey Hinton:
home.

Geoffrey Hinton:
Alex then got on with
got on with it, and he was an

Geoffrey Hinton:
absolutely wizard programmer.

Geoffrey Hinton:
He wrote amazing code on

Geoffrey Hinton:
multiple GPUs to do convolution

Geoffrey Hinton:
really efficiently.

Geoffrey Hinton:
Much better code than anybody
else had ever written.

Geoffrey Hinton:
I believe and so it's a
combination of Ilya realizing we

Geoffrey Hinton:
really had to do this.

Geoffrey Hinton:
I know you was involved in the
design of the net and so on, but

Geoffrey Hinton:
Alex's programming skills.

Geoffrey Hinton:
And then I added a few ideas,
like use rectified linear units

Geoffrey Hinton:
instead of sigmoid units and use
little patches of the images.

Geoffrey Hinton:
I mean, big patches of the
images.

Geoffrey Hinton:
So you can translate things

Geoffrey Hinton:
around a bit to get some

Geoffrey Hinton:
translation invariance, as well

Geoffrey Hinton:
as using convolution, and

Geoffrey Hinton:
use dropout.

Geoffrey Hinton:
So that was one of the first
applications of dropout.

Geoffrey Hinton:
And that helped about one
percent.

Geoffrey Hinton:
It helped.

Geoffrey Hinton:
And then we beat the best vision
systems.

Geoffrey Hinton:
The best vision systems were
sort of plateauing at twenty

Geoffrey Hinton:
five percent errors.

Geoffrey Hinton:
That's errors for getting the
right answer in the top in your

Geoffrey Hinton:
top five bets.

Geoffrey Hinton:
And we got like fifteen
percent, fifteen or sixteen,

Geoffrey Hinton:
depending on how you count it.

Geoffrey Hinton:
So we got almost half the error
rate.

Geoffrey Hinton:
And what happened then was what

Geoffrey Hinton:
ought to happen in science but

Geoffrey Hinton:
seldom does.

Geoffrey Hinton:
So our most vigorous opponents,
like Jitendra Malik and

Geoffrey Hinton:
Zisserman, Andrew Zisserman,
looked at these results and

Geoffrey Hinton:
said, okay, you were right.

Geoffrey Hinton:
That never happens in science.

Geoffrey Hinton:
And slightly irritating.
Andrew Zisserman then switched

Geoffrey Hinton:
to doing this.

Geoffrey Hinton:
He had some very good postdocs
or students working with him.

Geoffrey Hinton:
Simonyan, after about

Geoffrey Hinton:
a year, they were making better

Geoffrey Hinton:
networks than us, but that was

Geoffrey Hinton:
really the.

Geoffrey Hinton:
As far as the general public was
concerned.

Geoffrey Hinton:
That was the start of this big

Geoffrey Hinton:
swing towards deep learning in

Geoffrey Hinton:
twenty twelve.

Tom Mitchell:
So that event, that competition

Tom Mitchell:
and the fact that the neural

Tom Mitchell:
network approach, totally

Tom Mitchell:
dominated all the other

Tom Mitchell:
approaches really was a wake up

Tom Mitchell:
call to both the computer vision

Tom Mitchell:
community, which within a couple

Tom Mitchell:
of years everybody was using

Tom Mitchell:
neural networks.

Tom Mitchell:
But it was also a wake up call
to the machine learning

Tom Mitchell:
community, who had kind of
scoffed at neural networks for

Tom Mitchell:
several decades, that neural
networks were back.

Tom Mitchell:
And so people started again, now

Tom Mitchell:
experimenting with this new

Tom Mitchell:
generation of deep neural

Tom Mitchell:
networks.

Tom Mitchell:
That just meant that instead of
having two layers, they could

Tom Mitchell:
have many layers, dozens of
layers, because training

Tom Mitchell:
algorithms were available and so
was is computation.

Tom Mitchell:
People start experimenting with
these and primarily on

Tom Mitchell:
perceptual style problems.

Tom Mitchell:
In fact, by twenty sixteen,

Tom Mitchell:
neural nets had taken over not

Tom Mitchell:
only computer vision, but in

Tom Mitchell:
twenty sixteen, some scientists

Tom Mitchell:
from Microsoft showed that they

Tom Mitchell:
had been able to train a neural

Tom Mitchell:
network to finally reach human

Tom Mitchell:
level recognition.

Tom Mitchell:
Speech recognition performance
for individual words in a widely

Tom Mitchell:
used data set called the
switchboard data set.

Tom Mitchell:
So people were experimenting
with neural nets for visual

Tom Mitchell:
data, speech data, radar, lidar,
all kinds of sensory data.

Tom Mitchell:
People started also asking,

Tom Mitchell:
well, can we apply these to text

Tom Mitchell:
data?

Tom Mitchell:
And the answer was yes.

Tom Mitchell:
And people started inventing
various architectures, things

Tom Mitchell:
with names like long short term
memory and others to analyze

Tom Mitchell:
sequences of text and applying
them to problems like machine

Tom Mitchell:
translation, translating English
into French, and so forth.

Tom Mitchell:
And, uh, that kind of
worked.

Tom Mitchell:
And then in twenty seventeen,

Tom Mitchell:
a very important paper was

Tom Mitchell:
published.

Tom Mitchell:
The name of the paper was
Attention is All You Need.

Tom Mitchell:
And with that was referring to
was a subcircuit in a

Tom Mitchell:
neural network called an
attention mechanism that had

Tom Mitchell:
recently been invented and
developed and was trainable.

Tom Mitchell:
But that attention mechanism

Tom Mitchell:
was used in this paper, and it

Tom Mitchell:
advanced the state of the art in

Tom Mitchell:
machine translation.

Tom Mitchell:
But even more importantly for us
today, it introduced the

Tom Mitchell:
transformer architecture based
on this attention mechanism.

Tom Mitchell:
And it's that transformer

Tom Mitchell:
architecture that underlies GPT

Tom Mitchell:
and pretty much all of the large

Tom Mitchell:
language models that were

Tom Mitchell:
released around twenty twenty

Tom Mitchell:
two.

Tom Mitchell:
So that was a major event.

Tom Mitchell:
Now, around the same time, Yann

Tom Mitchell:
LeCun, remember the guy who was

Tom Mitchell:
a postdoc with Jeff in nineteen

Tom Mitchell:
eighty seven?

Tom Mitchell:
Yann had become the head of AI
research at Facebook.

Tom Mitchell:
And so he was in a very
interesting position because he

Tom Mitchell:
was both an academic.

Tom Mitchell:
He retained his NYU
professorship and at the same

Tom Mitchell:
time he had a foot in the
commercial world directing the

Tom Mitchell:
AI strategy for Facebook.

Tom Mitchell:
So ask John about this period

Tom Mitchell:
and what it looked like to him

Tom Mitchell:
from from being inside both

Tom Mitchell:
worlds.

Tom Mitchell:
His first part of his answer was

Tom Mitchell:
that he said for him, a key

Tom Mitchell:
development was realizing that

Tom Mitchell:
you didn't have to wait for

Tom Mitchell:
people to label all your

Tom Mitchell:
training data, that you could do

Tom Mitchell:
something called self-supervised

Tom Mitchell:
learning.

Tom Mitchell:
For example, just take data like
a string of words and remove a

Tom Mitchell:
word and have the program force
the program to predict what that

Tom Mitchell:
removed word was.

Tom Mitchell:
So there's no human labeling you
have to do for that.

Tom Mitchell:
You can use the whole web and

Tom Mitchell:
you get a lot of training

Tom Mitchell:
examples.

Tom Mitchell:
So that's self-supervised
learning was a key development.

Tom Mitchell:
But then here's this description
of what next.

Yann LeCun:
So the idea that self-supervised
learning could really kind of

Yann LeCun:
bring something to the table
there, I think was kind of a

Yann LeCun:
big sort of mind,
change of mindset.

Yann LeCun:
And then there was
Transformers, of course.

Yann LeCun:
Right.

Yann LeCun:
Um, that, so, so
before that, there was some

Yann LeCun:
demonstration that, you
know, you could basically match

Yann LeCun:
the performance of classical
systems for tasks like

Yann LeCun:
translation, language
translation using large neural

Yann LeCun:
nets like LSTM.

Yann LeCun:
So this was the work by Ilya
Sutskever when he was at Google.

Yann LeCun:
We had this big sequence to
sequence model with LSTMs and

Yann LeCun:
some gigantic model where you
can train it to do.

Yann LeCun:
Translation.

Yann LeCun:
And it kind of works at the same

Yann LeCun:
level, if not better in some

Yann LeCun:
cases than the then classical,

Yann LeCun:
classical, the transition

Yann LeCun:
methods.

Yann LeCun:
Then a few months later,

Yann LeCun:
Yoshua Bengio and Kyunghyun Cho,

Yann LeCun:
who is now a colleague at NYU,

Yann LeCun:
uh, showed that you could change

Yann LeCun:
the architecture and use this

Yann LeCun:
attention mechanism.

Yann LeCun:
That, that they proposed,
to basically get really good

Yann LeCun:
performance on translation with
much smaller models than what

Yann LeCun:
Ilya had been proposing.

Yann LeCun:
And the entire industry jumped

Yann LeCun:
on this, Chris Manning's

Yann LeCun:
group at Stanford, kind of, you

Yann LeCun:
know, used that architecture and

Yann LeCun:
basically beat, you know,

Yann LeCun:
won the WMT competition for a

Yann LeCun:
particular, uh, type of

Yann LeCun:
translation.

Yann LeCun:
And the entire industry jumped
on it.

Yann LeCun:
So within a few months after
that, like, you know, all the

Yann LeCun:
big players, uh, in translation,
were using attention type

Yann LeCun:
architectures for translation.

Yann LeCun:
And that's when, the
transformer paper came out.

Yann LeCun:
Attention is all you need.

Yann LeCun:
So basically, if you build a
neural net just with those kind

Yann LeCun:
of attention circuit, you don't
need much else.

Yann LeCun:
And it ends up working super
well.

Yann LeCun:
And that's what started the, you

Yann LeCun:
know, the transformer

Yann LeCun:
revolution.

Yann LeCun:
Uh, and then after that came
Bert, that also came out of

Yann LeCun:
Google, which was this idea of
using self-supervised learning,

Yann LeCun:
where I take a sequence of
words, corrupt it, remove some

Yann LeCun:
other words, and then train this
big neural net to reconstruct

Yann LeCun:
the words that are missing.

Yann LeCun:
Predict the words that are
missing.

Yann LeCun:
And again, people were

Yann LeCun:
amazed by like how how good the

Yann LeCun:
representations learned by the

Yann LeCun:
system were for all kinds of NLP

Yann LeCun:
tasks.

Yann LeCun:
And that really, uh, you know,
kind of captured the imagination

Yann LeCun:
of a lot of people.

Yann LeCun:
And then after that, the
next revolution was, oh,

Yann LeCun:
actually, the best thing to do
is you remove the encoder, you

Yann LeCun:
just use a decoder.

Yann LeCun:
And you just train a system,
you feed it a sequence, and you

Yann LeCun:
just train it to reproduce the
input sequence on its output,

Yann LeCun:
and because the architecture of
the decoder is strictly causal.

Yann LeCun:
Because a particular output
is not connected to the

Yann LeCun:
corresponding input, it's only
connected to the ones to the

Yann LeCun:
left of it.

Yann LeCun:
Implicitly, you're training the

Yann LeCun:
system to predict the next word

Yann LeCun:
that comes after a sequence of

Yann LeCun:
words.

Yann LeCun:
That's the GPT architecture that

Yann LeCun:
was, you know, promoted by

Yann LeCun:
OpenAI.

Yann LeCun:
And, that turned out to be
more scalable than Bert.

Yann LeCun:
And so in a sense that you can

Yann LeCun:
train gigantic networks on

Yann LeCun:
enormous amounts of data and you

Yann LeCun:
get some sort of emergent,

Yann LeCun:
property.

Yann LeCun:
And that's what gave us llms.

Tom Mitchell:
So that brings us up to today
with Transformers.

Tom Mitchell:
And you can see this very
strange evolution in wandering

Tom Mitchell:
path of, uh, progress
exploration over decades.

Tom Mitchell:
So before we leave, I

Tom Mitchell:
want to let's just take a look

Tom Mitchell:
at that history And say, what if

Tom Mitchell:
this is a case study of how

Tom Mitchell:
scientific progress was made in

Tom Mitchell:
this field?

Tom Mitchell:
What are the main themes we see?

Tom Mitchell:
Well, I think the first one is
progress happens in waves.

Tom Mitchell:
It's paradigm after paradigm,
right?

Tom Mitchell:
First there were perceptrons,

Tom Mitchell:
but that got, uh, thrown away

Tom Mitchell:
and replaced by symbolic

Tom Mitchell:
representations being learned,

Tom Mitchell:
eventually to be replaced by

Tom Mitchell:
neural nets, which were replaced

Tom Mitchell:
by probabilistic methods and so

Tom Mitchell:
forth.

Tom Mitchell:
So there's wave after wave of
paradigm.

Tom Mitchell:
Another theme is that a lot of

Tom Mitchell:
these ideas really came from

Tom Mitchell:
other fields.

Tom Mitchell:
Even the very notion of

Tom Mitchell:
perceptrons came from somebody

Tom Mitchell:
who was fundamentally a

Tom Mitchell:
neuroscientist interested in how

Tom Mitchell:
neurons in the brain could even

Tom Mitchell:
learn stuff.

Tom Mitchell:
Pack learning.

Tom Mitchell:
You heard less valiant talk.

Tom Mitchell:
He's very much a

Tom Mitchell:
computational complexity

Tom Mitchell:
researcher who found that this

Tom Mitchell:
was an interesting theoretical

Tom Mitchell:
result.

Tom Mitchell:
Bayesian networks heavily

Tom Mitchell:
influenced by statistics and so

Tom Mitchell:
forth.

Tom Mitchell:
Many of these advances really

Tom Mitchell:
were new framings of the

Tom Mitchell:
problem.

Tom Mitchell:
So, uh, Winston's work on

Tom Mitchell:
symbolic learning was really a

Tom Mitchell:
reframing of what the problem

Tom Mitchell:
was.

Tom Mitchell:
The work on reinforcement

Tom Mitchell:
learning is really changing the

Tom Mitchell:
definition of what the training

Tom Mitchell:
signal even is for these

Tom Mitchell:
systems.

Tom Mitchell:
So that's another theme that you
see.

Tom Mitchell:
And finally, I think like a lot

Tom Mitchell:
of scientific fields, machine

Tom Mitchell:
learning is really a blend of

Tom Mitchell:
technical forces and social

Tom Mitchell:
forces.

Tom Mitchell:
Certainly in the long term,

Tom Mitchell:
the cold, hard facts of what

Tom Mitchell:
works best come out and those

Tom Mitchell:
methods win.

Tom Mitchell:
But in the shorter term, the

Tom Mitchell:
question of who works on what

Tom Mitchell:
kinds of problems is very much

Tom Mitchell:
influenced by the personalities

Tom Mitchell:
of people.

Tom Mitchell:
Their ability to persuade other

Tom Mitchell:
people to jump in and start

Tom Mitchell:
working with them on their

Tom Mitchell:
problems.

Tom Mitchell:
So these are some of the themes
you see.

Tom Mitchell:
And I think if you look around
at other fields, sometimes you

Tom Mitchell:
see similar themes.

Tom Mitchell:
Finally, what are the lessons
from all this for researchers?

Tom Mitchell:
I think the first lesson really
is question authority.

Tom Mitchell:
Because really, if you think

Tom Mitchell:
about the major advances, many

Tom Mitchell:
of those came from just, uh,

Tom Mitchell:
going against what was currently

Tom Mitchell:
the conventional wisdom in the

Tom Mitchell:
field.

Tom Mitchell:
Inventing a new framing or

Tom Mitchell:
taking a radically different

Tom Mitchell:
approach.

Tom Mitchell:
Another lesson don't drag your
feet.

Tom Mitchell:
I've seen decade after decade,
new paradigms emerge in the

Tom Mitchell:
field, and every single time
that happens, existing

Tom Mitchell:
researchers take longer than
they need to to recognize the

Tom Mitchell:
benefits of the new paradigm.

Tom Mitchell:
And the most guilty people are
the senior researchers.

Tom Mitchell:
You can probably explain that by

Tom Mitchell:
taking into account who has the

Tom Mitchell:
most to lose if there's a new

Tom Mitchell:
paradigm replacing the current

Tom Mitchell:
approach.

Tom Mitchell:
Another lesson learn to

Tom Mitchell:
communicate and learn to follow

Tom Mitchell:
through.

Tom Mitchell:
You heard Geoff Hinton when he

Tom Mitchell:
was talking about in the mid

Tom Mitchell:
eighties, the development of

Tom Mitchell:
back propagation.

Tom Mitchell:
You heard him say we didn't
invent backpropagation, but we

Tom Mitchell:
showed that it was important.

Tom Mitchell:
And actually, to be fair, they

Tom Mitchell:
thought they were inventing

Tom Mitchell:
backpropagation.

Tom Mitchell:
They they actually reinvented

Tom Mitchell:
it, but they had no idea that

Tom Mitchell:
somebody had invented it before,

Tom Mitchell:
because whoever did that didn't

Tom Mitchell:
succeed in waking up the

Tom Mitchell:
research community to the fact

Tom Mitchell:
that they had a really good

Tom Mitchell:
idea.

Tom Mitchell:
I don't know why.

Tom Mitchell:
Maybe they didn't put in the

Tom Mitchell:
effort or succeed in

Tom Mitchell:
communicating.

Tom Mitchell:
Maybe they dropped it after they

Tom Mitchell:
did it and went some other

Tom Mitchell:
direction so that they didn't

Tom Mitchell:
follow through to provide the

Tom Mitchell:
evidence.

Tom Mitchell:
But that kind of thing happens

Tom Mitchell:
frequently in successful

Tom Mitchell:
researchers are good

Tom Mitchell:
communicators, and they follow

Tom Mitchell:
through to to push the field to

Tom Mitchell:
pay attention.

Tom Mitchell:
The final lesson, I think, is

Tom Mitchell:
the philosophers were actually

Tom Mitchell:
right.

Tom Mitchell:
We really today, despite these
amazing capabilities of our

Tom Mitchell:
learning systems, we don't have
a proof or anything like a

Tom Mitchell:
rational justification of why
you can generalize from examples

Tom Mitchell:
to get these general rules that
work well despite the success

Tom Mitchell:
that we have.

Tom Mitchell:
We don't really understand at
this very fundamental level why.

Tom Mitchell:
And I think that if we did pay
more attention to that question,

Tom Mitchell:
we might have a better chance to
develop algorithms that

Tom Mitchell:
outperform what we have today.

Tom Mitchell:
So I'll stop there.

Tom Mitchell:
Thank you very much.

Speaker 12:
Tom Mitchell is the Founders

Speaker 12:
University professor at Carnegie

Speaker 12:
Mellon University.

Speaker 12:
Machine learning How Did We get
here?

Speaker 12:
Is produced by the Stanford
Digital Economy Lab.

Speaker 12:
If you enjoyed this episode,

Speaker 12:
subscribe wherever you listen to

Speaker 12:
podcasts.

More episodes

Chapters

Creators and Guests

What is Machine Learning: How Did We Get Here??