Tom Mitchell literally wrote the book on machine learning. In this series of candid conversations with his fellow pioneers, Tom traces the history of the field through the people who built it. Behind the tech are stories of passion, curiosity, and humanity.
Tom Mitchell is the University Founders Professor at Carnegie Mellon University, a Digital Fellow at the Stanford Digital Economy Lab, and the author of Machine Learning, a foundational textbook on the subject. This podcast is produced by the Stanford Digital Economy Lab.
Tom Mitchell:
Welcome to machine learning.
Tom Mitchell:
How did we get here?
Tom Mitchell:
I'm Tom Mitchell, your podcast
host.
Tom Mitchell:
Now many people ask, how did we
get to this point where today we
Tom Mitchell:
have these amazing AI systems?
Tom Mitchell:
I have a one sentence answer to
that question.
Tom Mitchell:
We tried for fifty years to
Tom Mitchell:
write by hand intelligent
Tom Mitchell:
programs, but we discovered
Tom Mitchell:
about a decade ago that it was
Tom Mitchell:
actually much easier and much
Tom Mitchell:
more successful to use machine
Tom Mitchell:
learning methods to instead
Tom Mitchell:
train them to become
Tom Mitchell:
intelligent.
Tom Mitchell:
So the real question is, how did
machine learning get here?
Tom Mitchell:
What were the successes along
the way and the failures?
Tom Mitchell:
Who were the people involved?
Tom Mitchell:
What were they thinking?
Tom Mitchell:
What even made them want to get
Tom Mitchell:
into this field in the first
Tom Mitchell:
place?
Tom Mitchell:
This first episode will set the
stage for the podcast.
Tom Mitchell:
It is a recording of a lecture I
gave this month in February
Tom Mitchell:
twenty twenty six at Carnegie
Mellon University, and it
Tom Mitchell:
attempts to cover in one hour a
seventy five year history of the
Tom Mitchell:
field of machine learning.
Tom Mitchell:
Most of the rest of the episodes
Tom Mitchell:
in the podcast involve
Tom Mitchell:
interviews with various pioneers
Tom Mitchell:
in the field, who made very
Tom Mitchell:
significant contributions along
Tom Mitchell:
the way.
Tom Mitchell:
Before we start, I want to thank
Carnegie Mellon University and
Tom Mitchell:
also the Stanford University
Digital Economy Lab for
Tom Mitchell:
supporting the podcast.
Tom Mitchell:
And I want to thank Maddie
Smith, our podcast producer.
Tom Mitchell:
I hope you enjoy the podcast.
Tom Mitchell:
If we're going to talk about
Tom Mitchell:
machine learning, it's only fair
Tom Mitchell:
to start with the first people
Tom Mitchell:
who talked about how on earth is
Tom Mitchell:
learning possible?
Tom Mitchell:
Which were the philosophers?
Tom Mitchell:
And so as early as Aristotle, he
was talking about the question
Tom Mitchell:
of how is it that people could
look at examples of things and
Tom Mitchell:
learn their general essence?
Tom Mitchell:
In his words, about a century
later, there was a school of
Tom Mitchell:
philosophers called the
Pyrrhonists, who really zeroed
Tom Mitchell:
in on the problem of induction
and how it can be justified.
Tom Mitchell:
When we say induction, what we
Tom Mitchell:
really mean is the process of
Tom Mitchell:
coming up with a general rule
Tom Mitchell:
from looking at specific
Tom Mitchell:
examples.
Tom Mitchell:
And so they talked about
Tom Mitchell:
questions like, well, if all of
Tom Mitchell:
the swans we've seen so far in
Tom Mitchell:
our life are white, should we
Tom Mitchell:
conclude that all swans are
Tom Mitchell:
white?
Tom Mitchell:
What would be the justification
for that?
Tom Mitchell:
Maybe there's a black swan out
there that we haven't seen.
Tom Mitchell:
And, uh, that debate went on for
Tom Mitchell:
some time around thirteen
Tom Mitchell:
hundred.
Tom Mitchell:
William of Ockham, uh, suggested
Tom Mitchell:
something that we now call
Tom Mitchell:
Occam's razor, the policy that
Tom Mitchell:
we should prefer the simplest
Tom Mitchell:
hypothesis.
Tom Mitchell:
So, indeed, if all the swans
we've seen so far are white,
Tom Mitchell:
then the simplest hypothesis is
all swans are white.
Tom Mitchell:
That was his prescription.
Tom Mitchell:
Later on, around sixteen
hundred, Francis Bacon brought
Tom Mitchell:
up the importance of data
collection, of actively
Tom Mitchell:
experimenting, to collect data
that could falsify hypotheses
Tom Mitchell:
that weren't correct.
Tom Mitchell:
And then in the seventeen
hundreds, the philosopher David
Tom Mitchell:
Hume really kind of nailed
the problem of induction.
Tom Mitchell:
He argued very persuasively that
it's really impossible to
Tom Mitchell:
generalize from examples if you
don't have some additional
Tom Mitchell:
assumption that you're making.
Tom Mitchell:
And he pointed out that even the
assumption that the future will
Tom Mitchell:
be like the past is itself not a
provable assumption is just a
Tom Mitchell:
guess that we use.
Tom Mitchell:
So his point was that people do
induction, but it's a habit.
Tom Mitchell:
It's not a justified, rational,
provable, correct process.
Tom Mitchell:
So they had plenty to say around
the nineteen forties when
Tom Mitchell:
computers became available.
Tom Mitchell:
Alan Turing, who's often called
Tom Mitchell:
the father of computing, uh,
Tom Mitchell:
suggested that maybe computers
Tom Mitchell:
could learn.
Tom Mitchell:
He said instead of trying to
produce a program to simulate
Tom Mitchell:
the adult mind, why not rather
try to produce one which
Tom Mitchell:
simulates a child's?
Tom Mitchell:
If this were then subjected to
Tom Mitchell:
an appropriate course of
Tom Mitchell:
education, one would obtain the
Tom Mitchell:
adult brain.
Tom Mitchell:
So he had the idea that maybe
computers could learn.
Tom Mitchell:
But he did not have an algorithm
by which they would learn that
Tom Mitchell:
waited until the nineteen
fifties, when there were two
Tom Mitchell:
important seminal events.
Tom Mitchell:
One was a computer program
Tom Mitchell:
written by an IBM researcher
Tom Mitchell:
named Art Samuel, and his
Tom Mitchell:
program learned to play
Tom Mitchell:
checkers.
Tom Mitchell:
I'll just read you a couple
Tom Mitchell:
sentences from the abstract of
Tom Mitchell:
this paper.
Tom Mitchell:
He said two machine learning
procedures have been
Tom Mitchell:
investigated in some detail
using the game of checkers.
Tom Mitchell:
enough work has been done to
Tom Mitchell:
verify the fact that a computer
Tom Mitchell:
can be programmed so that it
Tom Mitchell:
will learn to play a better game
Tom Mitchell:
of checkers than can be played
Tom Mitchell:
by the person who wrote the
Tom Mitchell:
program.
Tom Mitchell:
And then he went on to point out
Tom Mitchell:
the principles of machine
Tom Mitchell:
learning verified by these
Tom Mitchell:
experiments are, of course,
Tom Mitchell:
applicable to many other
Tom Mitchell:
situations.
Tom Mitchell:
So he had really one of maybe
Tom Mitchell:
the first demonstration of a
Tom Mitchell:
program that learned to do
Tom Mitchell:
something interesting.
Tom Mitchell:
And he understood that the
Tom Mitchell:
techniques he was using were
Tom Mitchell:
very general.
Tom Mitchell:
Now, how did he get the computer
to learn to play checkers?
Tom Mitchell:
His program learned an
Tom Mitchell:
evaluation function that would
Tom Mitchell:
assign a numerical score to any
Tom Mitchell:
checkers position, and that
Tom Mitchell:
score would be higher, the
Tom Mitchell:
better the checkers position
Tom Mitchell:
was.
Tom Mitchell:
From your point of view as
you're playing the game, and
Tom Mitchell:
then you would use that to
control a search.
Tom Mitchell:
A look ahead search for which
move to proceed to take that
Tom Mitchell:
evaluation function was a linear
weighted combination of board
Tom Mitchell:
features that he made up.
Tom Mitchell:
Things like how many checkers
are on the board that are mine,
Tom Mitchell:
how many are on the board that
are yours, and so forth.
Tom Mitchell:
So his program learned.
Tom Mitchell:
What it learned was that
evaluation function.
Tom Mitchell:
How did it learn it?
Tom Mitchell:
By playing games against itself.
Tom Mitchell:
And he points out that in eight
to ten hours, it could learn
Tom Mitchell:
well enough to beat him.
Tom Mitchell:
Those ideas persisted through
the decades.
Tom Mitchell:
They became reused over and
over, including in the computer
Tom Mitchell:
programs that finally beat the
World Chess Champion and the
Tom Mitchell:
World Backgammon Champion and
the World Go champion.
Tom Mitchell:
So those ideas were really
seminal.
Tom Mitchell:
A second thing that happened in
Tom Mitchell:
the fifties was the invention of
Tom Mitchell:
the first early version of
Tom Mitchell:
neural networks by Frank
Tom Mitchell:
Rosenblum, wrote, I'm sorry,
Tom Mitchell:
Frank Rosenblatt from Cornell,
Tom Mitchell:
and he was interested in
Tom Mitchell:
neuroscience.
Tom Mitchell:
How can the brain neurons in the
brain be used to learn?
Tom Mitchell:
And he ended up building a
simple, uh, at least by today's
Tom Mitchell:
standards, simple neural network
that consisted of, uh, one layer
Tom Mitchell:
of neurons where, uh, there
would be a receptive field, uh,
Tom Mitchell:
input, say an image, and then
the neurons would respond to
Tom Mitchell:
that and produce an output set
of neuron firings.
Tom Mitchell:
What got learned in that case
Tom Mitchell:
were the connection strengths
Tom Mitchell:
between the input to the neuron
Tom Mitchell:
and the probability that it
Tom Mitchell:
would fire.
Tom Mitchell:
And the way he trained it was
Tom Mitchell:
what we now call supervised
Tom Mitchell:
learning.
Tom Mitchell:
You show an input and and what
the output should be.
Tom Mitchell:
And he had schemes for updating
those weights to fit the data.
Tom Mitchell:
Now that the importance of this
Tom Mitchell:
work is that it catalyzed a
Tom Mitchell:
whole bunch of work in the
Tom Mitchell:
nineteen sixties, for the next
Tom Mitchell:
decade, looking at different
Tom Mitchell:
algorithms for tuning the
Tom Mitchell:
weights of perceptron style
Tom Mitchell:
systems.
Tom Mitchell:
That work proceeded for a
Tom Mitchell:
decade or so, and at the end of
Tom Mitchell:
the nineteen sixties, two MIT
Tom Mitchell:
scientists, Marvin Minsky and
Tom Mitchell:
Seymour Papert, wrote a book
Tom Mitchell:
called perceptrons.
Tom Mitchell:
But unfortunately, that book
Tom Mitchell:
proved that a single layer
Tom Mitchell:
perceptron, which is the only
Tom Mitchell:
thing we knew how to train at
Tom Mitchell:
that point, uh, could never even
Tom Mitchell:
represent any many, many
Tom Mitchell:
functions that we wanted to
Tom Mitchell:
learn.
Tom Mitchell:
It could only represent linear
functions, not even, uh,
Tom Mitchell:
exclusive or, you know, where
the input could be.
Tom Mitchell:
The output would be one.
Tom Mitchell:
If input one is a one and the
other is a zero, or if it's a
Tom Mitchell:
zero and a one.
Tom Mitchell:
But the output would have to be
zero if they were both one.
Tom Mitchell:
You can't even represent that
Tom Mitchell:
simple function with a
Tom Mitchell:
perceptron no matter how you
Tom Mitchell:
train it.
Tom Mitchell:
So this really kind of put the
Tom Mitchell:
kibosh on work on perceptrons,
Tom Mitchell:
uh, following the publication of
Tom Mitchell:
this book.
Tom Mitchell:
Now, if we're not going to be
able or don't want to spend our
Tom Mitchell:
time figuring out how to learn
perceptrons, Then what's next?
Tom Mitchell:
Well, it turned out one of
Tom Mitchell:
Minsky's PhD students, Patrick
Tom Mitchell:
Winston.
Tom Mitchell:
The next year published his
Tom Mitchell:
thesis, and Winston suggested
Tom Mitchell:
that instead of learning
Tom Mitchell:
perceptron type representations
Tom Mitchell:
of information, we should learn
Tom Mitchell:
symbolic descriptions.
Tom Mitchell:
And so his program, uh, in his
thesis, he showed how his
Tom Mitchell:
program could learn descriptions
of different physical structures
Tom Mitchell:
like an arch or a tower.
Tom Mitchell:
And he would train the program
by showing it line drawings of
Tom Mitchell:
positive and negative examples
of, uh, in this example arches.
Tom Mitchell:
And then the program would
process those incrementally
Tom Mitchell:
arriving examples to produce a
symbolic description that would
Tom Mitchell:
describe the different parts and
relations among them.
Tom Mitchell:
For example, an arch could be
two rectangles which don't touch
Tom Mitchell:
each other, but which jointly
support a roof of any shape.
Tom Mitchell:
So this was an important step
Tom Mitchell:
because it shifted the focus
Tom Mitchell:
onto learning a much richer kind
Tom Mitchell:
of representation, symbolic
Tom Mitchell:
descriptions.
Tom Mitchell:
And this became the new paradigm
Tom Mitchell:
which dominated the nineteen
Tom Mitchell:
seventies.
Tom Mitchell:
So during the seventies, there
Tom Mitchell:
were a number of people working
Tom Mitchell:
on learning symbolic
Tom Mitchell:
descriptions.
Tom Mitchell:
My favorite is the metaphor
program, developed by Bruce
Tom Mitchell:
Buchanan at Stanford.
Tom Mitchell:
This program, again, was a
symbolic learning program.
Tom Mitchell:
What it learned was rules that
would predict how molecules
Tom Mitchell:
would shatter inside a mass
spectrometer, and therefore
Tom Mitchell:
predict what the mass spectrum
of a new molecule would be.
Tom Mitchell:
And those rules again described,
Tom Mitchell:
Symbolically described a
Tom Mitchell:
subgraph of atoms within the
Tom Mitchell:
molecular graph.
Tom Mitchell:
And the rules would say, if you
find this subgraph, then
Tom Mitchell:
specific bonds in that subgraph
are likely to fragment when you
Tom Mitchell:
put this in a mass spectrometer.
Tom Mitchell:
And this was an important step
forward.
Tom Mitchell:
I asked Bruce Buchanan, how will
it work?
Tom Mitchell:
What was this program able to do
in terms of did it work.
Bruce Buchanan:
Well for one small class of
steroid molecules, the keto and
Bruce Buchanan:
estranes, if you will?
Bruce Buchanan:
Uh, we had, uh, fewer than a
Bruce Buchanan:
dozen spectra, and we were able
Bruce Buchanan:
to tease out the rules that
Bruce Buchanan:
determine, uh, How a new keto
Bruce Buchanan:
androstane would fragment in a
Bruce Buchanan:
mass spectrometer, and we were
Bruce Buchanan:
able to publish that set of
Bruce Buchanan:
rules in a refereed chemical
Bruce Buchanan:
chemical journal, Chemistry
Bruce Buchanan:
Journal.
Bruce Buchanan:
Sorry.
Bruce Buchanan:
Uh, and it was, to our
Bruce Buchanan:
knowledge, the first time that
Bruce Buchanan:
the result of a machine learning
Bruce Buchanan:
program, Symbolic Learning, had
Bruce Buchanan:
been published, uh, in a
Bruce Buchanan:
refereed journal.
Tom Mitchell:
So that was an important
milestone for machine learning,
Tom Mitchell:
really, the first time that a
program discovered some
Tom Mitchell:
knowledge that was useful enough
to get published in that domain.
Tom Mitchell:
Now it turned out personal note
Tom Mitchell:
I was a PhD student at Stanford
Tom Mitchell:
at the time, and Bruce became my
Tom Mitchell:
PhD advisor, so my PhD thesis
Tom Mitchell:
was also built around, this same
Tom Mitchell:
data set.
Tom Mitchell:
And for my thesis I developed a
system called Version Spaces
Tom Mitchell:
that was the first symbolic
learning algorithm where you
Tom Mitchell:
could prove that it would
converge, and furthermore, that
Tom Mitchell:
the learner would know when it
had converged, so it would know
Tom Mitchell:
it was done.
Tom Mitchell:
And it did that by maintaining
Tom Mitchell:
not just one hypothesis that it
Tom Mitchell:
would modify, but by keeping
Tom Mitchell:
track of every hypothesis
Tom Mitchell:
consistent with the data that it
Tom Mitchell:
had seen.
Tom Mitchell:
And this also opened up the
possibility of what we call
Tom Mitchell:
today active learning.
Tom Mitchell:
It made it easy for the system
Tom Mitchell:
to play twenty questions with
Tom Mitchell:
the teacher.
Tom Mitchell:
Uh, it could ask the teacher,
please label this example so
Tom Mitchell:
that in a way, uh, it could
reduce the set of hypothesis as
Tom Mitchell:
quickly as possible.
Tom Mitchell:
So by the end of the seventies,
there seemed to be enough work
Tom Mitchell:
going on in the field that it
was time to hold a meeting.
Tom Mitchell:
And so we organized the first
Tom Mitchell:
workshop in machine learning was
Tom Mitchell:
held here at CMU at Wayne Hall,
Tom Mitchell:
a couple of buildings that
Tom Mitchell:
direction, and it was organized
Tom Mitchell:
by Jaime Carbonell, who was an
Tom Mitchell:
assistant professor here at the
Tom Mitchell:
time.
Tom Mitchell:
Richard Michalski, who is a more
Tom Mitchell:
senior professor at Illinois and
Tom Mitchell:
myself, I was at the time an
Tom Mitchell:
assistant professor at Rutgers
Tom Mitchell:
University.
Tom Mitchell:
And so we held this meeting,
pulled together some people.
Tom Mitchell:
One of the people who attended
was a student of Richard
Tom Mitchell:
Michalski named Tom Dietterich.
Tom Mitchell:
And Tom went on to make many
Tom Mitchell:
contributions in the field of
Tom Mitchell:
machine learning.
Tom Mitchell:
And so I asked Tom, what was the
field like in nineteen eighty?
Tom Dietterich:
I'd say it was really chaotic.
Tom Dietterich:
you know, I was,
Tom Dietterich:
attended that very first machine
Tom Dietterich:
learning workshop that was
Tom Dietterich:
organized.
Tom Dietterich:
I think you were one of the core
Tom Dietterich:
organizers at CMU, and there
Tom Dietterich:
were probably thirty people in
Tom Dietterich:
the room and, uh, and probably
Tom Dietterich:
thirty completely different
Tom Dietterich:
talks.
Tom Dietterich:
You know, I remember, I was talking
Tom Dietterich:
about I had done, a sort of algorithm
comparison paper
Tom Dietterich:
that I published at Ijcai
seventy nine, I think.
Tom Dietterich:
So just before that workshop,
in which I was, by
Tom Dietterich:
hand executing these very simple
algorithms for this kind of
Tom Dietterich:
subgraph learning problem, uh,
and comparing how many subgraph
Tom Dietterich:
isomorphism calculations they
had to do.
Tom Dietterich:
But it was like the first
Tom Dietterich:
attempt to actually compare
Tom Dietterich:
multiple machine learning
Tom Dietterich:
algorithms that were more or
Tom Dietterich:
less trying to do the same
Tom Dietterich:
thing.
Tom Dietterich:
There were a couple of them
there, and, you
Tom Dietterich:
know, I think John Anderson
was there talking about, you
Tom Dietterich:
know, cognitive models.
Tom Dietterich:
You were there talking about
Tom Dietterich:
the beginnings of EBL and the
Tom Dietterich:
Lex system for, for,
Tom Dietterich:
calculus, symbolic
Tom Dietterich:
integration.
Tom Dietterich:
You know, I remember
the most interesting talk I
Tom Dietterich:
thought was Ross Quinlan's talk
on, on ID3, where he was
Tom Dietterich:
trying to take these reverse
numerated chess endgames
Tom Dietterich:
and learn decision trees.
Tom Dietterich:
That would completely,
Tom Dietterich:
exactly losslessly,
Tom Dietterich:
basically compress those
Tom Dietterich:
giant tables into a small
Tom Dietterich:
decision tree.
Tom Dietterich:
A really important thing people
should understand in those days
Tom Dietterich:
was we believed there
was a right answer for our
Tom Dietterich:
machine learning problems.
Tom Dietterich:
And we would,
Tom Dietterich:
it would often happen that I
Tom Dietterich:
would run like the algorithms
Tom Dietterich:
and it would not get the right
Tom Dietterich:
answer.
Tom Dietterich:
It would not get the, the
logical expression that we
Tom Dietterich:
thought was the right answer.
Tom Dietterich:
It would get something that was
really, actually equally
Tom Dietterich:
accurate on the training data.
Tom Dietterich:
And actually it worked
Tom Dietterich:
pretty well although we
Tom Dietterich:
didn't really have a set idea of
Tom Dietterich:
a separate test set in those
Tom Dietterich:
days.
Tom Dietterich:
I mean, it was not a field of
statistics.
Tom Dietterich:
It was, the idea was right.
Tom Dietterich:
We were coming out of the,
really the John McCarthy program
Tom Dietterich:
of programs with common sense,
which didn't have a lot to do
Tom Dietterich:
with common sense, but was about
we're going to represent
Tom Dietterich:
everything in logic, and we're
going to use logical inference
Tom Dietterich:
as the execution engine.
Tom Mitchell:
So there's Tom's take on what
things were like.
Tom Mitchell:
He mentioned that he thought
the most interesting talk was
Tom Mitchell:
Ross Quinlan's talk.
Tom Mitchell:
I agree, I thought that was the
most interesting talk.
Tom Mitchell:
Ross's talk presented the idea
Tom Mitchell:
that we should learn decision
Tom Mitchell:
trees.
Tom Mitchell:
A decision tree is something
where you classify your example
Tom Mitchell:
by putting it at the root of the
tree, and then you sort it down
Tom Mitchell:
to a leaf in the tree based on
its features, and the leaf tells
Tom Mitchell:
you what the output
classification label should be.
Tom Mitchell:
That's what get learned.
Tom Mitchell:
What gets learned?
Tom Mitchell:
So I asked Ross how he came
up with this idea.
JR Quinlan:
I had done a PhD under a
psychologist, Earl hunt.
JR Quinlan:
And part of his work involved
decision trees, which I learned
JR Quinlan:
about, of course, as a student,
but then put in the back of my
JR Quinlan:
mind for fifteen years or so.
JR Quinlan:
And then I was at at Stanford on
JR Quinlan:
sabbatical at the same time as
JR Quinlan:
Donald.
JR Quinlan:
Mickey was teaching a course on
learning, and he had a challenge
JR Quinlan:
for the class on which, you
know, I sat in on the class and
JR Quinlan:
the challenge was to work
out a way of predicting a win in
JR Quinlan:
a very simple chess end game.
JR Quinlan:
King rook versus king knight.
JR Quinlan:
So I remembered Earl Hunt's work
on decision trees, and I
JR Quinlan:
thought, well, maybe that would
be the way to go.
JR Quinlan:
So I developed a thing called
ID3, which was just a simple
JR Quinlan:
decision tree program.
JR Quinlan:
No pruning, just straight
decision tree.
JR Quinlan:
And then, uh, that that seemed
JR Quinlan:
to solve the problem pretty
JR Quinlan:
well, up to about ninety five
JR Quinlan:
percent.
JR Quinlan:
And then I got that up to one
hundred the next year.
JR Quinlan:
And then remember, the first
real time I talked about this
JR Quinlan:
was at that conference.
JR Quinlan:
You organized the workshop in
nineteen eighty at Pittsburgh,
JR Quinlan:
at Carnegie Mellon.
JR Quinlan:
You, Richard and Hymie all,
all set up that workshop.
JR Quinlan:
And then I gave a talk on, uh,
decision tree learning.
Tom Mitchell:
So there's Ross's story.
Tom Mitchell:
He he got the idea of decision
Tom Mitchell:
trees from his thesis advisor
Tom Mitchell:
many years earlier, but it turns
Tom Mitchell:
out Ross was the one who came up
Tom Mitchell:
with the algorithm that actually
Tom Mitchell:
successfully discovered useful
Tom Mitchell:
decision trees.
Tom Mitchell:
And that whole idea of decision
tree learning became very
Tom Mitchell:
important in the field.
Tom Mitchell:
By twenty ten, it was probably
Tom Mitchell:
the one of the most commercially
Tom Mitchell:
used approaches in machine
Tom Mitchell:
learning.
Tom Mitchell:
So in the early eighties, there
were various experiments like
Tom Mitchell:
these trying to build machine
learning systems, but really no
Tom Mitchell:
theory, no theory that could
tell us, for example, how many
Tom Mitchell:
examples would we have to
present to a learner in order
Tom Mitchell:
for it to reliably learn?
Tom Mitchell:
And that changed in nineteen
Tom Mitchell:
eighty four, when Les Valiant
Tom Mitchell:
published a paper on what he
Tom Mitchell:
calls probably approximately
Tom Mitchell:
correct learning.
Tom Mitchell:
And the idea is it really
Tom Mitchell:
was the first practical theory
Tom Mitchell:
to tell us how many examples you
Tom Mitchell:
would need.
Tom Mitchell:
And it in particular, in
Tom Mitchell:
particular, the number of
Tom Mitchell:
examples you need depends on
Tom Mitchell:
three things.
Tom Mitchell:
The complexity of your
hypothesis space.
Tom Mitchell:
For example, if you're going to
learn decision trees of depth
Tom Mitchell:
two, that's a lot less complex
than if you're learning decision
Tom Mitchell:
trees of depth twelve.
Tom Mitchell:
So the it depends on how complex
Tom Mitchell:
your hypotheses are, depends on
Tom Mitchell:
the error rate you're willing to
Tom Mitchell:
tolerate in the final
Tom Mitchell:
hypothesis.
Tom Mitchell:
One percent error five percent
error.
Tom Mitchell:
It also depends on the
probability you're willing to
Tom Mitchell:
put up with that.
Tom Mitchell:
If you do choose that many
Tom Mitchell:
random randomly provided
Tom Mitchell:
training examples.
Tom Mitchell:
The probability that you'll
still fail.
Tom Mitchell:
You can't guarantee that you
Tom Mitchell:
won't fail, but you can reduce
Tom Mitchell:
that probability.
Tom Mitchell:
So this was a breakthrough in
the area of theoretical
Tom Mitchell:
characterization of algorithms.
Tom Mitchell:
So I asked I asked les what he
thought was the key idea there.
Leslie Valiant:
It's a it's a kind of a model of
computation.
Leslie Valiant:
But it yeah, it makes sense
Leslie Valiant:
because it's got some
Leslie Valiant:
applications.
Leslie Valiant:
So that's the particular
result which persuaded
Leslie Valiant:
people that there was
something there is this result
Leslie Valiant:
that if you take a
conjunctive normal form formula,
Leslie Valiant:
which, you know, from NP
completeness at the time, we
Leslie Valiant:
already knew there's some
hardness in it, because if
Leslie Valiant:
someone gave you the formula was
computationally difficult to
Leslie Valiant:
find out whether it's a null,
it's the equivalent of formula
Leslie Valiant:
which, is always zero,
which is never satisfiable.
Leslie Valiant:
On the other hand, this was
Leslie Valiant:
kind of this, uh, conducting
Leslie Valiant:
normal form formula with three,
Leslie Valiant:
variables in each
Leslie Valiant:
clause.
Leslie Valiant:
Uh, so this was PAC learnable.
Leslie Valiant:
And so this was a bit striking
that something which is very
Leslie Valiant:
hard is learnable.
Leslie Valiant:
But then this, this
Leslie Valiant:
highlighted the difference
Leslie Valiant:
between, uh, computing and uh,
Leslie Valiant:
and learning because so with the
Leslie Valiant:
learning model, the idea was
Leslie Valiant:
that there was a distribution of
Leslie Valiant:
inputs.
Leslie Valiant:
And you learned from this
distribution, but you only have
Leslie Valiant:
to be good on this distribution
when you have to predict.
Leslie Valiant:
So if, for example, in this
formula, there were some very
Leslie Valiant:
rare ones which are so very
rare, then the learner wouldn't
Leslie Valiant:
have to know about that.
Leslie Valiant:
So in this sense this was easier
than the NP completeness.
Tom Mitchell:
So I was actually quite
surprised at that answer.
Tom Mitchell:
What he's saying.
Tom Mitchell:
Put another way is that what was
Tom Mitchell:
really interesting there is that
Tom Mitchell:
for this one kind of hypothesis,
Tom Mitchell:
conjunctive normal form, which
Tom Mitchell:
is a way of it's a kind of
Tom Mitchell:
logical expression.
Tom Mitchell:
If your hypotheses are of that
form, then it's easier to learn
Tom Mitchell:
them than it is to compute them.
Tom Mitchell:
When he says compute them, what
he means is the cost of
Tom Mitchell:
answering the question, can you
find a positive example of this?
Tom Mitchell:
And it was known at the time
that the computational cost of
Tom Mitchell:
answering that question, is
there a positive example of this
Tom Mitchell:
formula was exponential in the
size of the formula?
Tom Mitchell:
And then he discovered that
Tom Mitchell:
learning a formula, if somebody
Tom Mitchell:
gives you a positive and
Tom Mitchell:
negative examples only takes
Tom Mitchell:
polynomial less than exponential
Tom Mitchell:
time.
Tom Mitchell:
So I agree with him that that's
Tom Mitchell:
a fascinating theoretical fact,
Tom Mitchell:
but that would not be the answer
Tom Mitchell:
I would give about why this
Tom Mitchell:
revolutionized the field of
Tom Mitchell:
machine learning.
Tom Mitchell:
It revolutionized the field, in
my view, because he was the
Tom Mitchell:
first person, really to be able
to come up with a framing, a new
Tom Mitchell:
framing of the machine learning
problem that even allowed this
Tom Mitchell:
kind of theoretical analysis.
Tom Mitchell:
In particular, his framing
Tom Mitchell:
included assumptions like the
Tom Mitchell:
training data would come from
Tom Mitchell:
some source that would give you
Tom Mitchell:
that would give you random
Tom Mitchell:
examples according to some
Tom Mitchell:
probability distribution.
Tom Mitchell:
And then later, when you wanted
to test your hypothesis on new
Tom Mitchell:
data, you would get more random
examples from that same source.
Tom Mitchell:
And so he reframed the problem
Tom Mitchell:
in a way that made theory
Tom Mitchell:
possible.
Tom Mitchell:
The consequence of that was he
Tom Mitchell:
catalyzed a huge amount of
Tom Mitchell:
theoretical work in machine
Tom Mitchell:
learning and continues this day
Tom Mitchell:
just keeps branching further and
Tom Mitchell:
further.
Tom Mitchell:
There are conferences
specifically designed to cover
Tom Mitchell:
theoretical computer science.
Tom Mitchell:
So the eighties was really a
very generative decade.
Tom Mitchell:
There are a lot of things going
on.
Tom Mitchell:
Another thing was going on was
some people were looking at
Tom Mitchell:
human learning and how that
might inspire our models of AI
Tom Mitchell:
and machine learning.
Tom Mitchell:
One such effort was here at CMU
Tom Mitchell:
by Alan Newell and his two PhD
Tom Mitchell:
students, John Laird and Paul
Tom Mitchell:
Rosenbloom.
Tom Mitchell:
They took the approach of.
Tom Mitchell:
They built a system they called
Tom Mitchell:
Soar, which was really one of
Tom Mitchell:
the first AI agents designed to
Tom Mitchell:
capture the full breadth of what
Tom Mitchell:
humans do play games, solve
Tom Mitchell:
problems many different tasks,
Tom Mitchell:
so they frame their machine
Tom Mitchell:
learning problem as one of
Tom Mitchell:
getting a general agent to
Tom Mitchell:
learn.
Tom Mitchell:
And their architecture had very
interesting properties that I
Tom Mitchell:
think are relevant today.
Tom Mitchell:
Now that agents are again a
topic of hot activity, I won't
Tom Mitchell:
go into the details, but in the
podcast there's an interview
Tom Mitchell:
with John Laird who goes into
detail on this.
Tom Mitchell:
Another item that can't be
Tom Mitchell:
overlooked in the eighties was
Tom Mitchell:
really the rebirth of neural
Tom Mitchell:
network.
Tom Mitchell:
Remember, in the end of sixties,
Tom Mitchell:
Minsky and Papert published that
Tom Mitchell:
book that killed off work on
Tom Mitchell:
perceptrons?
Tom Mitchell:
Well, in the mid eighties,
Tom Mitchell:
finally, people came up with an
Tom Mitchell:
algorithm that could train not
Tom Mitchell:
just one layer perceptrons, but
Tom Mitchell:
multilayer perceptrons.
Tom Mitchell:
And that allowed learning
Tom Mitchell:
functions that were highly
Tom Mitchell:
non-linear.
Tom Mitchell:
And Dave Rumelhart, J.
Tom Mitchell:
McClelland and Geoff Hinton were
Tom Mitchell:
three of the ringleaders of this
Tom Mitchell:
effort.
Tom Mitchell:
So I asked Geoff about that
period.
Tom Mitchell:
Now we're up to the mid eighties
Tom Mitchell:
when really neural nets are
Tom Mitchell:
reborn.
Tom Mitchell:
Is that the right word?
Tom Mitchell:
How would you.
Geoffrey Hinton:
Backprop with backpropagation?
Geoffrey Hinton:
I mean, we didn't invent it.
Geoffrey Hinton:
Invented by several different
Geoffrey Hinton:
groups, but we showed that it
Geoffrey Hinton:
really worked to learn
Geoffrey Hinton:
representations.
Geoffrey Hinton:
And as you know, sort of one of
the big problems in AI is how do
Geoffrey Hinton:
you learn new representations?
Geoffrey Hinton:
How do you avoid having to put
them all in by hand?
Geoffrey Hinton:
And my particular example,
Geoffrey Hinton:
which was the family trees
Geoffrey Hinton:
example, where you take all the
Geoffrey Hinton:
information in some family
Geoffrey Hinton:
trees, you convert it into
Geoffrey Hinton:
triples of symbols like John has
Geoffrey Hinton:
Father Mary.
Geoffrey Hinton:
And then you train a neural
Geoffrey Hinton:
net to predict the last term in
Geoffrey Hinton:
a triple.
Geoffrey Hinton:
Given the first two terms.
Geoffrey Hinton:
So it's just like the big
language models.
Geoffrey Hinton:
You're predicting the next word
given the context.
Geoffrey Hinton:
It's just much simpler.
Geoffrey Hinton:
I had one hundred and twelve
Geoffrey Hinton:
total examples, of which one
Geoffrey Hinton:
hundred and four training
Geoffrey Hinton:
examples and eight were test
Geoffrey Hinton:
examples, which is a bit less
Geoffrey Hinton:
than the trillion examples they
Geoffrey Hinton:
have nowadays,
Geoffrey Hinton:
but it was the same idea.
Geoffrey Hinton:
You convert a symbol into a
feature vector.
Geoffrey Hinton:
You then have the feature
vectors of the context interact
Geoffrey Hinton:
via a hidden layer.
Geoffrey Hinton:
They then predict the features
Geoffrey Hinton:
of the next symbol, and from
Geoffrey Hinton:
those features you guess what
Geoffrey Hinton:
the next symbol should be, and
Geoffrey Hinton:
you try and maximize the
Geoffrey Hinton:
probability of predicting the
Geoffrey Hinton:
next symbol.
Geoffrey Hinton:
And you then backpropagate
Geoffrey Hinton:
through the feature interactions
Geoffrey Hinton:
and through the process of
Geoffrey Hinton:
converting a symbol into
Geoffrey Hinton:
features.
Geoffrey Hinton:
And that way you learn
feature vectors to represent the
Geoffrey Hinton:
symbols and how these vectors
should interact to predict the
Geoffrey Hinton:
features of the next symbol.
Geoffrey Hinton:
And that's what these big
language models do.
Tom Mitchell:
So there's Jeff in the mid
Tom Mitchell:
nineteen eighties work on
Tom Mitchell:
backpropagation.
Tom Mitchell:
Another personal note in
Tom Mitchell:
nineteen eighty six, while this
Tom Mitchell:
was going on, I came to spend a
Tom Mitchell:
year at CMU as a visiting
Tom Mitchell:
professor.
Tom Mitchell:
And I got to meet Allen Newell
at the time.
Tom Mitchell:
And Allen said, hey, do you want
to team teach a course?
Tom Mitchell:
We'll teach a course on
Tom Mitchell:
architectures for intelligent
Tom Mitchell:
agents.
Tom Mitchell:
And of course I said yes.
Tom Mitchell:
The opportunity to teach with
Allen.
Tom Mitchell:
And he said, by the way, there
Tom Mitchell:
will be another, uh, an
Tom Mitchell:
assistant professor working with
Tom Mitchell:
us.
Tom Mitchell:
The three of us will team teach
it.
Tom Mitchell:
That's Geoff Hinton.
Tom Mitchell:
So Allen, Geoff and I team
Tom Mitchell:
taught in spring of nineteen
Tom Mitchell:
eighty six.
Tom Mitchell:
Uh, this course was one of the
best experiences of my career up
Tom Mitchell:
to that point.
Tom Mitchell:
And so it was a large part of
the reason why I ended up
Tom Mitchell:
staying at CMU.
Tom Mitchell:
But when I came, I was here
Tom Mitchell:
for about a year, and then Jeff
Tom Mitchell:
moved on.
Tom Mitchell:
He moved up to the University of
Toronto and started
Tom Mitchell:
building up a group there.
Tom Mitchell:
One of the people who joined his
group was a person named Yann
Tom Mitchell:
LeCun, who went on to win the
Turing Award jointly with Jeff
Tom Mitchell:
and Yoshua Bengio for their work
in neural networks.
Tom Mitchell:
So I asked Jon about this
period.
Yann LeCun:
And then, mid nineteen
Yann LeCun:
eighty seven, I moved to Toronto
Yann LeCun:
to do a postdoc with Jeff, and I
Yann LeCun:
completed this, the
Yann LeCun:
simulator.
Yann LeCun:
Jeff thought I was not doing
Yann LeCun:
anything because I was just
Yann LeCun:
basically hacking, you know, all
Yann LeCun:
the time,
Yann LeCun:
and this, this
system was kind of
Yann LeCun:
interesting because we had to
build a front end language to
Yann LeCun:
interact with it.
Yann LeCun:
And that language was the Lisp
Yann LeCun:
interpreter that Leon and I
Yann LeCun:
wrote.
Yann LeCun:
And so we're using Lisp, even
though as a front end to kind of
Yann LeCun:
a neural net simulator.
Yann LeCun:
And I, you know, implemented
Yann LeCun:
a weight sharing, abilities
Yann LeCun:
and all that stuff and started
Yann LeCun:
experimenting with what became
Yann LeCun:
convolutional nets.
Yann LeCun:
You know, when I was a postdoc
in Toronto, early nineteen
Yann LeCun:
eighty eight, roughly, and
started to get really good
Yann LeCun:
results on, you know, very
simple shape recognition, like,
Yann LeCun:
yhandwritten characters
that had drawn with my mouse or
Yann LeCun:
something like that.
Yann LeCun:
Right.
Tom Mitchell:
So, as you just heard, Yann was
Tom Mitchell:
experimenting with can we apply
Tom Mitchell:
neural networks to the problem
Tom Mitchell:
of character recognition,
Tom Mitchell:
written characters.
Tom Mitchell:
People were experimenting with
many different uses of neural
Tom Mitchell:
nets at the time.
Tom Mitchell:
My favorite, the one I would
vote application of the decade
Tom Mitchell:
was done in the area.
Tom Mitchell:
Surprisingly, of self-driving
cars.
Tom Mitchell:
There was a PhD student here at
CMU named Dean Pomerleau.
Tom Mitchell:
He trained a neural network
Tom Mitchell:
where the input was an image
Tom Mitchell:
taken by a camera looking out
Tom Mitchell:
the front windshield of a
Tom Mitchell:
vehicle.
Tom Mitchell:
And the output of the neural
Tom Mitchell:
network was the steering command
Tom Mitchell:
telling the car which direction
Tom Mitchell:
to steer.
Tom Mitchell:
So I asked Dean about that work.
Tom Mitchell:
How much training data did you
have?
Dean Pommerleau:
So the interesting thing was, to
Dean Pommerleau:
begin with, it was all batch
Dean Pommerleau:
training.
Dean Pommerleau:
So I'd drive, I'd have a person
drive the vehicle along Schenley
Dean Pommerleau:
Park, uh, Flagstaff Hill Path,
and then I would go off and
Dean Pommerleau:
crunch it overnight.
Dean Pommerleau:
But in the end, what we were
Dean Pommerleau:
able to do is, uh, real time
Dean Pommerleau:
learning.
Dean Pommerleau:
So one drive up the hill with a
Dean Pommerleau:
human behind the wheel steering
Dean Pommerleau:
and the neural network, learning
Dean Pommerleau:
to pair images with camera
Dean Pommerleau:
images with the steering command
Dean Pommerleau:
that the human was giving was
Dean Pommerleau:
able to, uh, train it in about
Dean Pommerleau:
five minutes to, uh, take over
Dean Pommerleau:
and steer on its own from there
Dean Pommerleau:
on, on that road and on similar
Dean Pommerleau:
roads.
Dean Pommerleau:
So it was one of the first real
time, real world vision
Dean Pommerleau:
applications of, uh, of
artificial neural networks going
Dean Pommerleau:
beyond just Flagstaff Hill, you
know, the little paths on there.
Dean Pommerleau:
And we went out on, on real
roads first through the golf
Dean Pommerleau:
course, Schenley Golf Course, on
the, uh, on the road there.
Dean Pommerleau:
And then we, we went on, you
know, the local highways, in
Dean Pommerleau:
fact, the longest as part of my
PhD, the longest trip we did
Dean Pommerleau:
was, I think, about one hundred
miles at the time from basically
Dean Pommerleau:
up, uh, I-79 from Pittsburgh all
the way up to Erie.
Dean Pommerleau:
Uh, and it drove basically the,
the whole way.
Dean Pommerleau:
So it and it was getting up to
fifty five miles per hour after
Dean Pommerleau:
we got a faster vehicle.
Tom Mitchell:
It turns out he didn't ask for
permission.
Tom Mitchell:
So so this was all happening in
the nineteen eighties.
Tom Mitchell:
Really, it was a decade of
Tom Mitchell:
amazing invention and innovation
Tom Mitchell:
and exploration.
Tom Mitchell:
Another important thing that
Tom Mitchell:
happened in that decade was the
Tom Mitchell:
development of reinforcement
Tom Mitchell:
learning.
Tom Mitchell:
The way to understand that is to
first realize that supervised
Tom Mitchell:
learning was the kind of
standard way of framing the
Tom Mitchell:
machine learning question.
Tom Mitchell:
When Dean talked about training
Tom Mitchell:
his system, he would input an
Tom Mitchell:
image.
Tom Mitchell:
He had people drive the car, so
he got a lot of training
Tom Mitchell:
examples of the form.
Tom Mitchell:
Here's the image and here's the
correct steering command.
Tom Mitchell:
So he could tell the neural
network for this input.
Tom Mitchell:
Here's the correct output.
Tom Mitchell:
That's called supervised
learning.
Tom Mitchell:
But reinforcement learning
reframes the problem.
Tom Mitchell:
It takes into account that
sometimes we don't know what the
Tom Mitchell:
right output is.
Tom Mitchell:
For example, if you're learning
to play chess, you might not
Tom Mitchell:
have a person who tells you at
every step given this board
Tom Mitchell:
position, here's the right move.
Tom Mitchell:
Instead, you might have to wait
until the end of the game after
Tom Mitchell:
you've made many moves to get
the feedback signal that says
Tom Mitchell:
you lost or you won, and then
you have to figure out what to
Tom Mitchell:
do about that because you
actually took many moves.
Tom Mitchell:
So that's what reinforcement
learning is about.
Tom Mitchell:
And Rich Sutton and Andy Barto
were instrumental in kind of
Tom Mitchell:
framing that problem and, and
working on it.
Tom Mitchell:
They recently won the Turing
Award for this work.
Tom Mitchell:
So I asked Rich how
Tom Mitchell:
reinforcement learning fit into
Tom Mitchell:
the field.
Rich Sutton:
The field of machine learning
Rich Sutton:
has always been been dominated
Rich Sutton:
by the more straightforward
Rich Sutton:
supervised approach.
Rich Sutton:
There was, as I
mentioned at the very beginning,
Rich Sutton:
the rewards and penalties were
were very much a part of it.
Rich Sutton:
But then the, focus, as
Rich Sutton:
things became more clear and
Rich Sutton:
more better defined and it
Rich Sutton:
became more clear, learning
Rich Sutton:
problem then became pattern
Rich Sutton:
recognition and supervised
Rich Sutton:
learning.
Rich Sutton:
And, this fellow, the
strange, uh, fellow Harry Klopf,
Rich Sutton:
recognized this more than
other people and
Rich Sutton:
wrote some reports and
ultimately a book, saying
Rich Sutton:
that something had been lost.
Rich Sutton:
And Andy Barta and I
picked up on his work and
Rich Sutton:
and eventually realized that he
was right, that something had
Rich Sutton:
been left out, and in some sense
it was obvious that something
Rich Sutton:
had been left out.
Rich Sutton:
From the point of view of
Rich Sutton:
psychology, where I'd been
Rich Sutton:
studying how animals learn and
Rich Sutton:
animals learn.
Rich Sutton:
Really in both ways, in both a
Rich Sutton:
supervised way and a
Rich Sutton:
reinforcement way.
Rich Sutton:
And so, we picked up on
that and made that into a well
Rich Sutton:
defined area in the.
Rich Sutton:
When was that?
Rich Sutton:
That would have been in the
eighties.
Rich Sutton:
And then finally, you wrote a
book on it in ninety eight.
Rich Sutton:
So then it became a clear, uh,
subfield of machine learning.
Rich Sutton:
Yeah.
Rich Sutton:
But the key thing is why is why
why is I the way I say it to
Rich Sutton:
myself is that why is
reinforcement learning off?
Rich Sutton:
Why is it powerful?
Rich Sutton:
Potentially powerful.
Rich Sutton:
It's powerful because it's
learning.
Rich Sutton:
It's really learning from
experience.
Rich Sutton:
Learning from the normal data
Rich Sutton:
that an animal or a person would
Rich Sutton:
get.
Rich Sutton:
And it doesn't require a
Rich Sutton:
prepared special data like you
Rich Sutton:
of course do in supervised
Rich Sutton:
learning.
Tom Mitchell:
So during the eighties, there
were a lot of other really
Tom Mitchell:
interesting things going on.
Tom Mitchell:
Uh, people experimenting with
Tom Mitchell:
the idea that maybe machines
Tom Mitchell:
should learn by simulating
Tom Mitchell:
evolution.
Tom Mitchell:
There was an entire set of
conferences on something called
Tom Mitchell:
genetic algorithms, genetic
programming, which had to do
Tom Mitchell:
with that sort of thing.
Tom Mitchell:
Uh, a cluster of work on
Tom Mitchell:
studying human learning and
Tom Mitchell:
other areas.
Tom Mitchell:
But we don't have time for all
of those.
Tom Mitchell:
Let's move on to the nineteen
Tom Mitchell:
nineties, when, again, there was
Tom Mitchell:
a, I would say, a sea change in
Tom Mitchell:
terms of the style of work that
Tom Mitchell:
went on.
Tom Mitchell:
The theme of the nineteen
nineties was really the
Tom Mitchell:
integration of statistical and
probabilistic methods into the
Tom Mitchell:
field of machine learning.
Tom Mitchell:
And a lot of that took the
Tom Mitchell:
grounded form of learning a new
Tom Mitchell:
kind of object, which people
Tom Mitchell:
called either graphical models
Tom Mitchell:
or Bayes.
Tom Mitchell:
Bayes nets.
Tom Mitchell:
But what got learned in that
Tom Mitchell:
case was, again, a network where
Tom Mitchell:
each node would represent a
Tom Mitchell:
variable.
Tom Mitchell:
For example, maybe you would be
interested in predicting whether
Tom Mitchell:
somebody has lung cancer.
Tom Mitchell:
You'd make that a variable and
maybe you'd have evidence like
Tom Mitchell:
are they a smoker?
Tom Mitchell:
Do they have a normal or
abnormal X-ray result?
Tom Mitchell:
You'd make those variables.
Tom Mitchell:
And then the edges in the graph
represent probabilistic
Tom Mitchell:
dependencies among the variables
in a way such that in the end,
Tom Mitchell:
the whole graph represents the
full joint probability
Tom Mitchell:
distribution over the entire
collection of variables.
Tom Mitchell:
So that's what got learned and
how it got learned.
Tom Mitchell:
Waited for some algorithms to be
discovered.
Tom Mitchell:
One of the key people who was
Tom Mitchell:
involved in inventing those
Tom Mitchell:
algorithms, although Judea
Tom Mitchell:
Pearl, came up with the idea of
Tom Mitchell:
how to represent these,
Tom Mitchell:
Daphne Kohler, a professor at
Tom Mitchell:
Stanford, was one of the most
Tom Mitchell:
active researchers in terms of
Tom Mitchell:
designing algorithms for
Tom Mitchell:
learning these.
Tom Mitchell:
So I asked her, why do we need
graphical models?
Daphne Koller:
Graphical models, for me,
emerged by realizing that the
Daphne Koller:
problems that we needed to solve
to address most real world
Daphne Koller:
applications went beyond.
Daphne Koller:
You have a vector representation
Daphne Koller:
of an input and a single,
Daphne Koller:
oftentimes binary or at best
Daphne Koller:
continuous output.
Daphne Koller:
There was so much more
opportunity to think about
Daphne Koller:
richly structured environments,
richly structured problems.
Daphne Koller:
So even if you think about
problems like understanding what
Daphne Koller:
is in an image, that's not a
single label problem of there is
Daphne Koller:
a dog, because images are
complex and there's
Daphne Koller:
interrelationships between the
different objects you want it to
Daphne Koller:
get beyond the yes no. Is there
a dog in this image to something
Daphne Koller:
that is much more rich?
Daphne Koller:
There's a dog and a Frisbee and
Daphne Koller:
a beach and three kids building
Daphne Koller:
a sandcastle.
Daphne Koller:
You have a rich input and a rich
output.
Daphne Koller:
Thinking about these richly
Daphne Koller:
structured domains gave rise to
Daphne Koller:
we have to think about multiple
Daphne Koller:
variables.
Daphne Koller:
We have to think about the
Daphne Koller:
interactions between those
Daphne Koller:
variables and leverage that
Daphne Koller:
structure both in our input and
Daphne Koller:
output space in order to get to
Daphne Koller:
much better conclusions and deal
Daphne Koller:
with problems that really
Daphne Koller:
matter.
Tom Mitchell:
So this work on training
Tom Mitchell:
graphical models was really part
Tom Mitchell:
of a bigger theme that decade,
Tom Mitchell:
which was just the integration
Tom Mitchell:
of statistical methods with what
Tom Mitchell:
had been pretty much statistics
Tom Mitchell:
free machine learning up to that
Tom Mitchell:
point.
Tom Mitchell:
Another person who was
Tom Mitchell:
instrumental in that was
Tom Mitchell:
Berkeley professor named Mike
Tom Mitchell:
Jordan.
Tom Mitchell:
I asked him about the
Tom Mitchell:
relationship between statistics
Tom Mitchell:
and machine.
Michael I. Jordan:
So anyway, by the time I moved
to wanted to move to Berkeley, I
Michael I. Jordan:
was realizing that I was missing
the whole statistics community,
Michael I. Jordan:
that, uh, it was just separate
from machine learning, as maybe
Michael I. Jordan:
you kind of remember, there was
occasionally a little leakage,
Michael I. Jordan:
but it was way too separate.
Michael I. Jordan:
And and nowadays we're often
seeing, you know, people will
Michael I. Jordan:
run a machine learning method,
but then it's not calibrated.
Michael I. Jordan:
It's not, you know, has bias and
all that.
Michael I. Jordan:
And that's the thing
statisticians have talked about
Michael I. Jordan:
for a long, long time.
Michael I. Jordan:
And so nowadays I think it's a
given that, yeah, they're,
Michael I. Jordan:
they're kind of two parts, two
sides of the same coin.
Michael I. Jordan:
Machine learning is maybe a
little more engineering in order
Michael I. Jordan:
to build a system and make it do
great things in the world.
Michael I. Jordan:
And statistics is a little bit
more, well, let's be cautious.
Michael I. Jordan:
Let's say we're going to do like
clinical trials.
Michael I. Jordan:
Let's make sure that the the
answer is really trustable, but
Michael I. Jordan:
those are two sides of the same
coin, and I think that's
Michael I. Jordan:
probably pretty much clear now.
Michael I. Jordan:
But for a long time there was a
resistance.
Michael I. Jordan:
Everyone said this is a brand
new field, this is different.
Michael I. Jordan:
And I kept and again annoying
colleagues by saying, no, I
Michael I. Jordan:
don't believe it is.
Michael I. Jordan:
So anyway, long story short, it
is.
Tom Mitchell:
It is remarkable that to me that
Tom Mitchell:
the field of machine learning
Tom Mitchell:
went through most of the
Tom Mitchell:
nineteen eighties, kind of
Tom Mitchell:
without even noticing that
Tom Mitchell:
statistics exist.
Michael I. Jordan:
I mean, people like Leo Breiman
Michael I. Jordan:
were around to help make the
Michael I. Jordan:
passage.
Michael I. Jordan:
So ensemble methods, they were
kind of invented by Leo and stat
Michael I. Jordan:
literature, but they were
independently invented in the
Michael I. Jordan:
machine learning literature.
Michael I. Jordan:
And is that machine learning or
statistics?
Michael I. Jordan:
Well, clearly it's both and it
needs both perspectives.
Michael I. Jordan:
And yes, in the nineteen
nineties that the Em algorithm,
Michael I. Jordan:
you know, the graphical models,
they were they had, they had uh,
Michael I. Jordan:
so yeah, the nineties, it was a
real flourishing of that.
Tom Mitchell:
So Mike mentioned that one of
the themes was ensemble.
Tom Mitchell:
So anyway, I think that's
Tom Mitchell:
actually a very nice example of
Tom Mitchell:
how machine learning theory and
Tom Mitchell:
statistical theory kind of
Tom Mitchell:
intertwined.
Tom Mitchell:
The idea of ensemble learning is
Tom Mitchell:
instead of learning one
Tom Mitchell:
hypothesis, let's learn multiple
Tom Mitchell:
ones.
Tom Mitchell:
For example, instead of learning
Tom Mitchell:
a decision tree, you might learn
Tom Mitchell:
a whole forest of decision
Tom Mitchell:
trees.
Tom Mitchell:
And then when it comes to
Tom Mitchell:
classifying a new example, you
Tom Mitchell:
give it to all of the
Tom Mitchell:
classifiers and you let them
Tom Mitchell:
vote and you take the vote of
Tom Mitchell:
the classifiers.
Tom Mitchell:
Well, that turned out to be very
Tom Mitchell:
successful and commercially very
Tom Mitchell:
important.
Tom Mitchell:
But it also is a beautiful
Tom Mitchell:
example where, there's a
Tom Mitchell:
pretty interesting theory around
Tom Mitchell:
that.
Tom Mitchell:
And initially, Yoav Freund and
Robert Shapiro, uh, in the early
Tom Mitchell:
nineties, uh, started working on
a theory and methods for doing
Tom Mitchell:
this kind of ensemble.
Tom Mitchell:
Leo Breiman, who was a
statistician, recognized that
Tom Mitchell:
this echoed some of the themes
of resampling and statistics.
Tom Mitchell:
And those two things, uh, kind
Tom Mitchell:
of came together in a very
Tom Mitchell:
successful way.
Tom Mitchell:
So in the nineties and the first
Tom Mitchell:
decade of the two thousand,
Tom Mitchell:
there were many other things
Tom Mitchell:
going on.
Tom Mitchell:
The development of things
called support vector machines,
Tom Mitchell:
kernel methods, which were,
mathematical techniques for
Tom Mitchell:
learning, very nonlinear
classifiers that were actually
Tom Mitchell:
commercially important and
opened the door in many cases to
Tom Mitchell:
machine learning for
non-numerical data, data like
Tom Mitchell:
images or text.
Tom Mitchell:
There is work on manifold
learning.
Tom Mitchell:
There was also growing
Tom Mitchell:
commercialization during that
Tom Mitchell:
decade.
Tom Mitchell:
More and more companies were
Tom Mitchell:
starting to use machine learning
Tom Mitchell:
commercially.
Tom Mitchell:
But for me, the theme of that
first decade of the two thousand
Tom Mitchell:
was really a growing awareness
by many people that, you know,
Tom Mitchell:
maybe we have good enough
machine learning algorithms that
Tom Mitchell:
the bottleneck to more accuracy
is not the algorithm.
Tom Mitchell:
Maybe we need more data and more
computation.
Tom Mitchell:
And this idea was crystallized
in this beautiful paper written
Tom Mitchell:
in two thousand and nine by
three authors at Google, called
Tom Mitchell:
The Unreasonable Effectiveness
of Data, which really
Tom Mitchell:
highlighted, cases where,
if you want better
Tom Mitchell:
results, keep your same
algorithm, get more data.
Tom Mitchell:
And that was kind of a theme of
what was going on at the time,
Tom Mitchell:
but things really broke open in
the year twenty twelve.
Tom Mitchell:
In twenty twelve, the
computer vision community had
Tom Mitchell:
been using a data set created by
Fei-Fei Li called ImageNet to
Tom Mitchell:
test out different vision
algorithms, see who could do the
Tom Mitchell:
best job of labeling which
object was the primary object in
Tom Mitchell:
an image, and the image net data
set was very large.
Tom Mitchell:
In twenty twelve, Geoff Hinton
and some of his students entered
Tom Mitchell:
the competition and they blew
away the competition.
Tom Mitchell:
What's interesting is they were
the only neural network approach
Tom Mitchell:
in the competition by that time.
Tom Mitchell:
By the way, neural networks were
Tom Mitchell:
very scarce in the field of
Tom Mitchell:
machine learning.
Tom Mitchell:
They had been displaced really
Tom Mitchell:
by more recent probabilistic
Tom Mitchell:
methods, and only a smallish
Tom Mitchell:
number of researchers were even
Tom Mitchell:
still working on neural
Tom Mitchell:
networks.
Tom Mitchell:
But, nevertheless, this
happened.
Tom Mitchell:
So I asked Geoff about that.
Geoffrey Hinton:
And Yann realized when Fei-Fei
came up with the ImageNet
Geoffrey Hinton:
dataset, Yann realized they
could win that competition, and
Geoffrey Hinton:
he tried to get graduate
students and postdocs in his lab
Geoffrey Hinton:
to do it, and they all declined.
Geoffrey Hinton:
And Ilya, Ilya Sutskever
realized that, backprop
Geoffrey Hinton:
would just kill ImageNet.
Geoffrey Hinton:
He wanted Alex to work
on it and actually didn't really
Geoffrey Hinton:
want to work on it.
Geoffrey Hinton:
Alex had already been
Geoffrey Hinton:
working on small images and
Geoffrey Hinton:
recognizing small images in Cfar
Geoffrey Hinton:
ten, and pre-processed
Geoffrey Hinton:
everything for Alex to make it
Geoffrey Hinton:
easy.
Geoffrey Hinton:
And I bought Alex two Nvidia
Geoffrey Hinton:
GPUs to have in his bedroom at
Geoffrey Hinton:
home.
Geoffrey Hinton:
Alex then got on with
got on with it, and he was an
Geoffrey Hinton:
absolutely wizard programmer.
Geoffrey Hinton:
He wrote amazing code on
Geoffrey Hinton:
multiple GPUs to do convolution
Geoffrey Hinton:
really efficiently.
Geoffrey Hinton:
Much better code than anybody
else had ever written.
Geoffrey Hinton:
I believe and so it's a
combination of Ilya realizing we
Geoffrey Hinton:
really had to do this.
Geoffrey Hinton:
I know you was involved in the
design of the net and so on, but
Geoffrey Hinton:
Alex's programming skills.
Geoffrey Hinton:
And then I added a few ideas,
like use rectified linear units
Geoffrey Hinton:
instead of sigmoid units and use
little patches of the images.
Geoffrey Hinton:
I mean, big patches of the
images.
Geoffrey Hinton:
So you can translate things
Geoffrey Hinton:
around a bit to get some
Geoffrey Hinton:
translation invariance, as well
Geoffrey Hinton:
as using convolution, and
Geoffrey Hinton:
use dropout.
Geoffrey Hinton:
So that was one of the first
applications of dropout.
Geoffrey Hinton:
And that helped about one
percent.
Geoffrey Hinton:
It helped.
Geoffrey Hinton:
And then we beat the best vision
systems.
Geoffrey Hinton:
The best vision systems were
sort of plateauing at twenty
Geoffrey Hinton:
five percent errors.
Geoffrey Hinton:
That's errors for getting the
right answer in the top in your
Geoffrey Hinton:
top five bets.
Geoffrey Hinton:
And we got like fifteen
percent, fifteen or sixteen,
Geoffrey Hinton:
depending on how you count it.
Geoffrey Hinton:
So we got almost half the error
rate.
Geoffrey Hinton:
And what happened then was what
Geoffrey Hinton:
ought to happen in science but
Geoffrey Hinton:
seldom does.
Geoffrey Hinton:
So our most vigorous opponents,
like Jitendra Malik and
Geoffrey Hinton:
Zisserman, Andrew Zisserman,
looked at these results and
Geoffrey Hinton:
said, okay, you were right.
Geoffrey Hinton:
That never happens in science.
Geoffrey Hinton:
And slightly irritating.
Andrew Zisserman then switched
Geoffrey Hinton:
to doing this.
Geoffrey Hinton:
He had some very good postdocs
or students working with him.
Geoffrey Hinton:
Simonyan, after about
Geoffrey Hinton:
a year, they were making better
Geoffrey Hinton:
networks than us, but that was
Geoffrey Hinton:
really the.
Geoffrey Hinton:
As far as the general public was
concerned.
Geoffrey Hinton:
That was the start of this big
Geoffrey Hinton:
swing towards deep learning in
Geoffrey Hinton:
twenty twelve.
Tom Mitchell:
So that event, that competition
Tom Mitchell:
and the fact that the neural
Tom Mitchell:
network approach, totally
Tom Mitchell:
dominated all the other
Tom Mitchell:
approaches really was a wake up
Tom Mitchell:
call to both the computer vision
Tom Mitchell:
community, which within a couple
Tom Mitchell:
of years everybody was using
Tom Mitchell:
neural networks.
Tom Mitchell:
But it was also a wake up call
to the machine learning
Tom Mitchell:
community, who had kind of
scoffed at neural networks for
Tom Mitchell:
several decades, that neural
networks were back.
Tom Mitchell:
And so people started again, now
Tom Mitchell:
experimenting with this new
Tom Mitchell:
generation of deep neural
Tom Mitchell:
networks.
Tom Mitchell:
That just meant that instead of
having two layers, they could
Tom Mitchell:
have many layers, dozens of
layers, because training
Tom Mitchell:
algorithms were available and so
was is computation.
Tom Mitchell:
People start experimenting with
these and primarily on
Tom Mitchell:
perceptual style problems.
Tom Mitchell:
In fact, by twenty sixteen,
Tom Mitchell:
neural nets had taken over not
Tom Mitchell:
only computer vision, but in
Tom Mitchell:
twenty sixteen, some scientists
Tom Mitchell:
from Microsoft showed that they
Tom Mitchell:
had been able to train a neural
Tom Mitchell:
network to finally reach human
Tom Mitchell:
level recognition.
Tom Mitchell:
Speech recognition performance
for individual words in a widely
Tom Mitchell:
used data set called the
switchboard data set.
Tom Mitchell:
So people were experimenting
with neural nets for visual
Tom Mitchell:
data, speech data, radar, lidar,
all kinds of sensory data.
Tom Mitchell:
People started also asking,
Tom Mitchell:
well, can we apply these to text
Tom Mitchell:
data?
Tom Mitchell:
And the answer was yes.
Tom Mitchell:
And people started inventing
various architectures, things
Tom Mitchell:
with names like long short term
memory and others to analyze
Tom Mitchell:
sequences of text and applying
them to problems like machine
Tom Mitchell:
translation, translating English
into French, and so forth.
Tom Mitchell:
And, uh, that kind of
worked.
Tom Mitchell:
And then in twenty seventeen,
Tom Mitchell:
a very important paper was
Tom Mitchell:
published.
Tom Mitchell:
The name of the paper was
Attention is All You Need.
Tom Mitchell:
And with that was referring to
was a subcircuit in a
Tom Mitchell:
neural network called an
attention mechanism that had
Tom Mitchell:
recently been invented and
developed and was trainable.
Tom Mitchell:
But that attention mechanism
Tom Mitchell:
was used in this paper, and it
Tom Mitchell:
advanced the state of the art in
Tom Mitchell:
machine translation.
Tom Mitchell:
But even more importantly for us
today, it introduced the
Tom Mitchell:
transformer architecture based
on this attention mechanism.
Tom Mitchell:
And it's that transformer
Tom Mitchell:
architecture that underlies GPT
Tom Mitchell:
and pretty much all of the large
Tom Mitchell:
language models that were
Tom Mitchell:
released around twenty twenty
Tom Mitchell:
two.
Tom Mitchell:
So that was a major event.
Tom Mitchell:
Now, around the same time, Yann
Tom Mitchell:
LeCun, remember the guy who was
Tom Mitchell:
a postdoc with Jeff in nineteen
Tom Mitchell:
eighty seven?
Tom Mitchell:
Yann had become the head of AI
research at Facebook.
Tom Mitchell:
And so he was in a very
interesting position because he
Tom Mitchell:
was both an academic.
Tom Mitchell:
He retained his NYU
professorship and at the same
Tom Mitchell:
time he had a foot in the
commercial world directing the
Tom Mitchell:
AI strategy for Facebook.
Tom Mitchell:
So ask John about this period
Tom Mitchell:
and what it looked like to him
Tom Mitchell:
from from being inside both
Tom Mitchell:
worlds.
Tom Mitchell:
His first part of his answer was
Tom Mitchell:
that he said for him, a key
Tom Mitchell:
development was realizing that
Tom Mitchell:
you didn't have to wait for
Tom Mitchell:
people to label all your
Tom Mitchell:
training data, that you could do
Tom Mitchell:
something called self-supervised
Tom Mitchell:
learning.
Tom Mitchell:
For example, just take data like
a string of words and remove a
Tom Mitchell:
word and have the program force
the program to predict what that
Tom Mitchell:
removed word was.
Tom Mitchell:
So there's no human labeling you
have to do for that.
Tom Mitchell:
You can use the whole web and
Tom Mitchell:
you get a lot of training
Tom Mitchell:
examples.
Tom Mitchell:
So that's self-supervised
learning was a key development.
Tom Mitchell:
But then here's this description
of what next.
Yann LeCun:
So the idea that self-supervised
learning could really kind of
Yann LeCun:
bring something to the table
there, I think was kind of a
Yann LeCun:
big sort of mind,
change of mindset.
Yann LeCun:
And then there was
Transformers, of course.
Yann LeCun:
Right.
Yann LeCun:
Um, that, so, so
before that, there was some
Yann LeCun:
demonstration that, you
know, you could basically match
Yann LeCun:
the performance of classical
systems for tasks like
Yann LeCun:
translation, language
translation using large neural
Yann LeCun:
nets like LSTM.
Yann LeCun:
So this was the work by Ilya
Sutskever when he was at Google.
Yann LeCun:
We had this big sequence to
sequence model with LSTMs and
Yann LeCun:
some gigantic model where you
can train it to do.
Yann LeCun:
Translation.
Yann LeCun:
And it kind of works at the same
Yann LeCun:
level, if not better in some
Yann LeCun:
cases than the then classical,
Yann LeCun:
classical, the transition
Yann LeCun:
methods.
Yann LeCun:
Then a few months later,
Yann LeCun:
Yoshua Bengio and Kyunghyun Cho,
Yann LeCun:
who is now a colleague at NYU,
Yann LeCun:
uh, showed that you could change
Yann LeCun:
the architecture and use this
Yann LeCun:
attention mechanism.
Yann LeCun:
That, that they proposed,
to basically get really good
Yann LeCun:
performance on translation with
much smaller models than what
Yann LeCun:
Ilya had been proposing.
Yann LeCun:
And the entire industry jumped
Yann LeCun:
on this, Chris Manning's
Yann LeCun:
group at Stanford, kind of, you
Yann LeCun:
know, used that architecture and
Yann LeCun:
basically beat, you know,
Yann LeCun:
won the WMT competition for a
Yann LeCun:
particular, uh, type of
Yann LeCun:
translation.
Yann LeCun:
And the entire industry jumped
on it.
Yann LeCun:
So within a few months after
that, like, you know, all the
Yann LeCun:
big players, uh, in translation,
were using attention type
Yann LeCun:
architectures for translation.
Yann LeCun:
And that's when, the
transformer paper came out.
Yann LeCun:
Attention is all you need.
Yann LeCun:
So basically, if you build a
neural net just with those kind
Yann LeCun:
of attention circuit, you don't
need much else.
Yann LeCun:
And it ends up working super
well.
Yann LeCun:
And that's what started the, you
Yann LeCun:
know, the transformer
Yann LeCun:
revolution.
Yann LeCun:
Uh, and then after that came
Bert, that also came out of
Yann LeCun:
Google, which was this idea of
using self-supervised learning,
Yann LeCun:
where I take a sequence of
words, corrupt it, remove some
Yann LeCun:
other words, and then train this
big neural net to reconstruct
Yann LeCun:
the words that are missing.
Yann LeCun:
Predict the words that are
missing.
Yann LeCun:
And again, people were
Yann LeCun:
amazed by like how how good the
Yann LeCun:
representations learned by the
Yann LeCun:
system were for all kinds of NLP
Yann LeCun:
tasks.
Yann LeCun:
And that really, uh, you know,
kind of captured the imagination
Yann LeCun:
of a lot of people.
Yann LeCun:
And then after that, the
next revolution was, oh,
Yann LeCun:
actually, the best thing to do
is you remove the encoder, you
Yann LeCun:
just use a decoder.
Yann LeCun:
And you just train a system,
you feed it a sequence, and you
Yann LeCun:
just train it to reproduce the
input sequence on its output,
Yann LeCun:
and because the architecture of
the decoder is strictly causal.
Yann LeCun:
Because a particular output
is not connected to the
Yann LeCun:
corresponding input, it's only
connected to the ones to the
Yann LeCun:
left of it.
Yann LeCun:
Implicitly, you're training the
Yann LeCun:
system to predict the next word
Yann LeCun:
that comes after a sequence of
Yann LeCun:
words.
Yann LeCun:
That's the GPT architecture that
Yann LeCun:
was, you know, promoted by
Yann LeCun:
OpenAI.
Yann LeCun:
And, that turned out to be
more scalable than Bert.
Yann LeCun:
And so in a sense that you can
Yann LeCun:
train gigantic networks on
Yann LeCun:
enormous amounts of data and you
Yann LeCun:
get some sort of emergent,
Yann LeCun:
property.
Yann LeCun:
And that's what gave us llms.
Tom Mitchell:
So that brings us up to today
with Transformers.
Tom Mitchell:
And you can see this very
strange evolution in wandering
Tom Mitchell:
path of, uh, progress
exploration over decades.
Tom Mitchell:
So before we leave, I
Tom Mitchell:
want to let's just take a look
Tom Mitchell:
at that history And say, what if
Tom Mitchell:
this is a case study of how
Tom Mitchell:
scientific progress was made in
Tom Mitchell:
this field?
Tom Mitchell:
What are the main themes we see?
Tom Mitchell:
Well, I think the first one is
progress happens in waves.
Tom Mitchell:
It's paradigm after paradigm,
right?
Tom Mitchell:
First there were perceptrons,
Tom Mitchell:
but that got, uh, thrown away
Tom Mitchell:
and replaced by symbolic
Tom Mitchell:
representations being learned,
Tom Mitchell:
eventually to be replaced by
Tom Mitchell:
neural nets, which were replaced
Tom Mitchell:
by probabilistic methods and so
Tom Mitchell:
forth.
Tom Mitchell:
So there's wave after wave of
paradigm.
Tom Mitchell:
Another theme is that a lot of
Tom Mitchell:
these ideas really came from
Tom Mitchell:
other fields.
Tom Mitchell:
Even the very notion of
Tom Mitchell:
perceptrons came from somebody
Tom Mitchell:
who was fundamentally a
Tom Mitchell:
neuroscientist interested in how
Tom Mitchell:
neurons in the brain could even
Tom Mitchell:
learn stuff.
Tom Mitchell:
Pack learning.
Tom Mitchell:
You heard less valiant talk.
Tom Mitchell:
He's very much a
Tom Mitchell:
computational complexity
Tom Mitchell:
researcher who found that this
Tom Mitchell:
was an interesting theoretical
Tom Mitchell:
result.
Tom Mitchell:
Bayesian networks heavily
Tom Mitchell:
influenced by statistics and so
Tom Mitchell:
forth.
Tom Mitchell:
Many of these advances really
Tom Mitchell:
were new framings of the
Tom Mitchell:
problem.
Tom Mitchell:
So, uh, Winston's work on
Tom Mitchell:
symbolic learning was really a
Tom Mitchell:
reframing of what the problem
Tom Mitchell:
was.
Tom Mitchell:
The work on reinforcement
Tom Mitchell:
learning is really changing the
Tom Mitchell:
definition of what the training
Tom Mitchell:
signal even is for these
Tom Mitchell:
systems.
Tom Mitchell:
So that's another theme that you
see.
Tom Mitchell:
And finally, I think like a lot
Tom Mitchell:
of scientific fields, machine
Tom Mitchell:
learning is really a blend of
Tom Mitchell:
technical forces and social
Tom Mitchell:
forces.
Tom Mitchell:
Certainly in the long term,
Tom Mitchell:
the cold, hard facts of what
Tom Mitchell:
works best come out and those
Tom Mitchell:
methods win.
Tom Mitchell:
But in the shorter term, the
Tom Mitchell:
question of who works on what
Tom Mitchell:
kinds of problems is very much
Tom Mitchell:
influenced by the personalities
Tom Mitchell:
of people.
Tom Mitchell:
Their ability to persuade other
Tom Mitchell:
people to jump in and start
Tom Mitchell:
working with them on their
Tom Mitchell:
problems.
Tom Mitchell:
So these are some of the themes
you see.
Tom Mitchell:
And I think if you look around
at other fields, sometimes you
Tom Mitchell:
see similar themes.
Tom Mitchell:
Finally, what are the lessons
from all this for researchers?
Tom Mitchell:
I think the first lesson really
is question authority.
Tom Mitchell:
Because really, if you think
Tom Mitchell:
about the major advances, many
Tom Mitchell:
of those came from just, uh,
Tom Mitchell:
going against what was currently
Tom Mitchell:
the conventional wisdom in the
Tom Mitchell:
field.
Tom Mitchell:
Inventing a new framing or
Tom Mitchell:
taking a radically different
Tom Mitchell:
approach.
Tom Mitchell:
Another lesson don't drag your
feet.
Tom Mitchell:
I've seen decade after decade,
new paradigms emerge in the
Tom Mitchell:
field, and every single time
that happens, existing
Tom Mitchell:
researchers take longer than
they need to to recognize the
Tom Mitchell:
benefits of the new paradigm.
Tom Mitchell:
And the most guilty people are
the senior researchers.
Tom Mitchell:
You can probably explain that by
Tom Mitchell:
taking into account who has the
Tom Mitchell:
most to lose if there's a new
Tom Mitchell:
paradigm replacing the current
Tom Mitchell:
approach.
Tom Mitchell:
Another lesson learn to
Tom Mitchell:
communicate and learn to follow
Tom Mitchell:
through.
Tom Mitchell:
You heard Geoff Hinton when he
Tom Mitchell:
was talking about in the mid
Tom Mitchell:
eighties, the development of
Tom Mitchell:
back propagation.
Tom Mitchell:
You heard him say we didn't
invent backpropagation, but we
Tom Mitchell:
showed that it was important.
Tom Mitchell:
And actually, to be fair, they
Tom Mitchell:
thought they were inventing
Tom Mitchell:
backpropagation.
Tom Mitchell:
They they actually reinvented
Tom Mitchell:
it, but they had no idea that
Tom Mitchell:
somebody had invented it before,
Tom Mitchell:
because whoever did that didn't
Tom Mitchell:
succeed in waking up the
Tom Mitchell:
research community to the fact
Tom Mitchell:
that they had a really good
Tom Mitchell:
idea.
Tom Mitchell:
I don't know why.
Tom Mitchell:
Maybe they didn't put in the
Tom Mitchell:
effort or succeed in
Tom Mitchell:
communicating.
Tom Mitchell:
Maybe they dropped it after they
Tom Mitchell:
did it and went some other
Tom Mitchell:
direction so that they didn't
Tom Mitchell:
follow through to provide the
Tom Mitchell:
evidence.
Tom Mitchell:
But that kind of thing happens
Tom Mitchell:
frequently in successful
Tom Mitchell:
researchers are good
Tom Mitchell:
communicators, and they follow
Tom Mitchell:
through to to push the field to
Tom Mitchell:
pay attention.
Tom Mitchell:
The final lesson, I think, is
Tom Mitchell:
the philosophers were actually
Tom Mitchell:
right.
Tom Mitchell:
We really today, despite these
amazing capabilities of our
Tom Mitchell:
learning systems, we don't have
a proof or anything like a
Tom Mitchell:
rational justification of why
you can generalize from examples
Tom Mitchell:
to get these general rules that
work well despite the success
Tom Mitchell:
that we have.
Tom Mitchell:
We don't really understand at
this very fundamental level why.
Tom Mitchell:
And I think that if we did pay
more attention to that question,
Tom Mitchell:
we might have a better chance to
develop algorithms that
Tom Mitchell:
outperform what we have today.
Tom Mitchell:
So I'll stop there.
Tom Mitchell:
Thank you very much.
Speaker 12:
Tom Mitchell is the Founders
Speaker 12:
University professor at Carnegie
Speaker 12:
Mellon University.
Speaker 12:
Machine learning How Did We get
here?
Speaker 12:
Is produced by the Stanford
Digital Economy Lab.
Speaker 12:
If you enjoyed this episode,
Speaker 12:
subscribe wherever you listen to
Speaker 12:
podcasts.