1
00:00:00,240 --> 00:00:02,279
Tom Mitchell:
Welcome to machine learning.

2
00:00:02,279 --> 00:00:03,839
Tom Mitchell:
How did we get here?

3
00:00:03,839 --> 00:00:07,719
Tom Mitchell:
I'm Tom Mitchell, your podcast
host.

4
00:00:07,719 --> 00:00:12,599
Tom Mitchell:
Now many people ask, how did we
get to this point where today we

5
00:00:12,599 --> 00:00:16,280
Tom Mitchell:
have these amazing AI systems?

6
00:00:16,280 --> 00:00:20,160
Tom Mitchell:
I have a one sentence answer to
that question.

7
00:00:20,160 --> 00:00:22,820
Tom Mitchell:
We tried for fifty years to

8
00:00:22,820 --> 00:00:25,679
Tom Mitchell:
write by hand intelligent

9
00:00:25,679 --> 00:00:27,899
Tom Mitchell:
programs, but we discovered

10
00:00:27,899 --> 00:00:30,480
Tom Mitchell:
about a decade ago that it was

11
00:00:30,480 --> 00:00:32,079
Tom Mitchell:
actually much easier and much

12
00:00:32,079 --> 00:00:35,039
Tom Mitchell:
more successful to use machine

13
00:00:35,039 --> 00:00:37,159
Tom Mitchell:
learning methods to instead

14
00:00:37,159 --> 00:00:38,719
Tom Mitchell:
train them to become

15
00:00:38,719 --> 00:00:40,399
Tom Mitchell:
intelligent.

16
00:00:40,399 --> 00:00:44,320
Tom Mitchell:
So the real question is, how did
machine learning get here?

17
00:00:45,399 --> 00:00:49,200
Tom Mitchell:
What were the successes along
the way and the failures?

18
00:00:49,200 --> 00:00:50,640
Tom Mitchell:
Who were the people involved?

19
00:00:50,640 --> 00:00:52,359
Tom Mitchell:
What were they thinking?

20
00:00:52,359 --> 00:00:53,799
Tom Mitchell:
What even made them want to get

21
00:00:53,799 --> 00:00:55,960
Tom Mitchell:
into this field in the first

22
00:00:55,960 --> 00:00:56,799
Tom Mitchell:
place?

23
00:00:58,200 --> 00:01:02,219
Tom Mitchell:
This first episode will set the
stage for the podcast.

24
00:01:02,219 --> 00:01:06,219
Tom Mitchell:
It is a recording of a lecture I
gave this month in February

25
00:01:06,219 --> 00:01:11,060
Tom Mitchell:
twenty twenty six at Carnegie
Mellon University, and it

26
00:01:11,060 --> 00:01:16,620
Tom Mitchell:
attempts to cover in one hour a
seventy five year history of the

27
00:01:16,620 --> 00:01:19,540
Tom Mitchell:
field of machine learning.

28
00:01:19,540 --> 00:01:21,500
Tom Mitchell:
Most of the rest of the episodes

29
00:01:21,500 --> 00:01:23,500
Tom Mitchell:
in the podcast involve

30
00:01:23,500 --> 00:01:27,200
Tom Mitchell:
interviews with various pioneers

31
00:01:27,200 --> 00:01:29,219
Tom Mitchell:
in the field, who made very

32
00:01:29,219 --> 00:01:31,819
Tom Mitchell:
significant contributions along

33
00:01:31,819 --> 00:01:33,340
Tom Mitchell:
the way.

34
00:01:33,340 --> 00:01:38,060
Tom Mitchell:
Before we start, I want to thank
Carnegie Mellon University and

35
00:01:38,060 --> 00:01:42,980
Tom Mitchell:
also the Stanford University
Digital Economy Lab for

36
00:01:42,980 --> 00:01:44,659
Tom Mitchell:
supporting the podcast.

37
00:01:44,659 --> 00:01:48,939
Tom Mitchell:
And I want to thank Maddie
Smith, our podcast producer.

38
00:01:48,939 --> 00:01:51,140
Tom Mitchell:
I hope you enjoy the podcast.

39
00:02:06,569 --> 00:02:08,530
Tom Mitchell:
If we're going to talk about

40
00:02:08,530 --> 00:02:10,669
Tom Mitchell:
machine learning, it's only fair

41
00:02:10,669 --> 00:02:12,969
Tom Mitchell:
to start with the first people

42
00:02:12,969 --> 00:02:15,129
Tom Mitchell:
who talked about how on earth is

43
00:02:15,129 --> 00:02:16,530
Tom Mitchell:
learning possible?

44
00:02:16,530 --> 00:02:18,490
Tom Mitchell:
Which were the philosophers?

45
00:02:18,490 --> 00:02:23,110
Tom Mitchell:
And so as early as Aristotle, he
was talking about the question

46
00:02:23,110 --> 00:02:28,449
Tom Mitchell:
of how is it that people could
look at examples of things and

47
00:02:28,449 --> 00:02:30,689
Tom Mitchell:
learn their general essence?

48
00:02:30,689 --> 00:02:35,810
Tom Mitchell:
In his words, about a century
later, there was a school of

49
00:02:35,810 --> 00:02:39,849
Tom Mitchell:
philosophers called the
Pyrrhonists, who really zeroed

50
00:02:39,849 --> 00:02:44,969
Tom Mitchell:
in on the problem of induction
and how it can be justified.

51
00:02:44,969 --> 00:02:46,449
Tom Mitchell:
When we say induction, what we

52
00:02:46,449 --> 00:02:50,050
Tom Mitchell:
really mean is the process of

53
00:02:50,050 --> 00:02:52,569
Tom Mitchell:
coming up with a general rule

54
00:02:52,569 --> 00:02:54,289
Tom Mitchell:
from looking at specific

55
00:02:54,289 --> 00:02:55,449
Tom Mitchell:
examples.

56
00:02:55,449 --> 00:02:57,060
Tom Mitchell:
And so they talked about

57
00:02:57,060 --> 00:02:59,229
Tom Mitchell:
questions like, well, if all of

58
00:02:59,229 --> 00:03:01,430
Tom Mitchell:
the swans we've seen so far in

59
00:03:01,430 --> 00:03:03,849
Tom Mitchell:
our life are white, should we

60
00:03:03,849 --> 00:03:05,550
Tom Mitchell:
conclude that all swans are

61
00:03:05,550 --> 00:03:06,189
Tom Mitchell:
white?

62
00:03:06,189 --> 00:03:08,270
Tom Mitchell:
What would be the justification
for that?

63
00:03:08,270 --> 00:03:13,030
Tom Mitchell:
Maybe there's a black swan out
there that we haven't seen.

64
00:03:13,030 --> 00:03:16,069
Tom Mitchell:
And, uh, that debate went on for

65
00:03:16,069 --> 00:03:18,090
Tom Mitchell:
some time around thirteen

66
00:03:18,090 --> 00:03:18,990
Tom Mitchell:
hundred.

67
00:03:18,990 --> 00:03:23,110
Tom Mitchell:
William of Ockham, uh, suggested

68
00:03:23,110 --> 00:03:24,490
Tom Mitchell:
something that we now call

69
00:03:24,490 --> 00:03:27,110
Tom Mitchell:
Occam's razor, the policy that

70
00:03:27,110 --> 00:03:29,430
Tom Mitchell:
we should prefer the simplest

71
00:03:29,430 --> 00:03:31,110
Tom Mitchell:
hypothesis.

72
00:03:31,110 --> 00:03:35,389
Tom Mitchell:
So, indeed, if all the swans
we've seen so far are white,

73
00:03:35,389 --> 00:03:39,710
Tom Mitchell:
then the simplest hypothesis is
all swans are white.

74
00:03:39,710 --> 00:03:42,310
Tom Mitchell:
That was his prescription.

75
00:03:42,310 --> 00:03:46,389
Tom Mitchell:
Later on, around sixteen
hundred, Francis Bacon brought

76
00:03:46,389 --> 00:03:50,349
Tom Mitchell:
up the importance of data
collection, of actively

77
00:03:50,349 --> 00:03:55,629
Tom Mitchell:
experimenting, to collect data
that could falsify hypotheses

78
00:03:55,629 --> 00:03:57,810
Tom Mitchell:
that weren't correct.

79
00:03:57,810 --> 00:04:01,729
Tom Mitchell:
And then in the seventeen
hundreds, the philosopher David

80
00:04:01,729 --> 00:04:06,810
Tom Mitchell:
Hume really kind of nailed
the problem of induction.

81
00:04:06,810 --> 00:04:11,949
Tom Mitchell:
He argued very persuasively that
it's really impossible to

82
00:04:11,949 --> 00:04:16,449
Tom Mitchell:
generalize from examples if you
don't have some additional

83
00:04:16,449 --> 00:04:18,689
Tom Mitchell:
assumption that you're making.

84
00:04:18,689 --> 00:04:22,209
Tom Mitchell:
And he pointed out that even the
assumption that the future will

85
00:04:22,209 --> 00:04:27,990
Tom Mitchell:
be like the past is itself not a
provable assumption is just a

86
00:04:27,990 --> 00:04:29,730
Tom Mitchell:
guess that we use.

87
00:04:29,730 --> 00:04:35,250
Tom Mitchell:
So his point was that people do
induction, but it's a habit.

88
00:04:35,250 --> 00:04:41,170
Tom Mitchell:
It's not a justified, rational,
provable, correct process.

89
00:04:42,970 --> 00:04:46,850
Tom Mitchell:
So they had plenty to say around
the nineteen forties when

90
00:04:46,850 --> 00:04:49,209
Tom Mitchell:
computers became available.

91
00:04:49,209 --> 00:04:50,930
Tom Mitchell:
Alan Turing, who's often called

92
00:04:50,930 --> 00:04:54,009
Tom Mitchell:
the father of computing, uh,

93
00:04:54,009 --> 00:04:55,990
Tom Mitchell:
suggested that maybe computers

94
00:04:55,990 --> 00:04:57,189
Tom Mitchell:
could learn.

95
00:04:57,189 --> 00:05:02,750
Tom Mitchell:
He said instead of trying to
produce a program to simulate

96
00:05:02,750 --> 00:05:07,470
Tom Mitchell:
the adult mind, why not rather
try to produce one which

97
00:05:07,470 --> 00:05:09,629
Tom Mitchell:
simulates a child's?

98
00:05:09,629 --> 00:05:11,250
Tom Mitchell:
If this were then subjected to

99
00:05:11,250 --> 00:05:12,470
Tom Mitchell:
an appropriate course of

100
00:05:12,470 --> 00:05:14,970
Tom Mitchell:
education, one would obtain the

101
00:05:14,970 --> 00:05:16,430
Tom Mitchell:
adult brain.

102
00:05:16,430 --> 00:05:19,750
Tom Mitchell:
So he had the idea that maybe
computers could learn.

103
00:05:19,750 --> 00:05:24,250
Tom Mitchell:
But he did not have an algorithm
by which they would learn that

104
00:05:24,250 --> 00:05:27,910
Tom Mitchell:
waited until the nineteen
fifties, when there were two

105
00:05:27,910 --> 00:05:30,910
Tom Mitchell:
important seminal events.

106
00:05:30,910 --> 00:05:32,949
Tom Mitchell:
One was a computer program

107
00:05:32,949 --> 00:05:35,589
Tom Mitchell:
written by an IBM researcher

108
00:05:35,589 --> 00:05:38,589
Tom Mitchell:
named Art Samuel, and his

109
00:05:38,589 --> 00:05:39,790
Tom Mitchell:
program learned to play

110
00:05:39,790 --> 00:05:40,350
Tom Mitchell:
checkers.

111
00:05:40,350 --> 00:05:42,009
Tom Mitchell:
I'll just read you a couple

112
00:05:42,009 --> 00:05:43,870
Tom Mitchell:
sentences from the abstract of

113
00:05:43,870 --> 00:05:45,110
Tom Mitchell:
this paper.

114
00:05:45,110 --> 00:05:48,550
Tom Mitchell:
He said two machine learning
procedures have been

115
00:05:48,550 --> 00:05:53,899
Tom Mitchell:
investigated in some detail
using the game of checkers.

116
00:05:53,899 --> 00:05:55,439
Tom Mitchell:
enough work has been done to

117
00:05:55,439 --> 00:05:57,860
Tom Mitchell:
verify the fact that a computer

118
00:05:57,860 --> 00:06:00,180
Tom Mitchell:
can be programmed so that it

119
00:06:00,180 --> 00:06:02,660
Tom Mitchell:
will learn to play a better game

120
00:06:02,660 --> 00:06:04,899
Tom Mitchell:
of checkers than can be played

121
00:06:04,899 --> 00:06:06,060
Tom Mitchell:
by the person who wrote the

122
00:06:06,060 --> 00:06:07,899
Tom Mitchell:
program.

123
00:06:07,899 --> 00:06:09,740
Tom Mitchell:
And then he went on to point out

124
00:06:09,740 --> 00:06:11,139
Tom Mitchell:
the principles of machine

125
00:06:11,139 --> 00:06:13,300
Tom Mitchell:
learning verified by these

126
00:06:13,300 --> 00:06:15,180
Tom Mitchell:
experiments are, of course,

127
00:06:15,180 --> 00:06:17,259
Tom Mitchell:
applicable to many other

128
00:06:17,259 --> 00:06:18,899
Tom Mitchell:
situations.

129
00:06:18,899 --> 00:06:22,019
Tom Mitchell:
So he had really one of maybe

130
00:06:22,019 --> 00:06:24,339
Tom Mitchell:
the first demonstration of a

131
00:06:24,339 --> 00:06:26,180
Tom Mitchell:
program that learned to do

132
00:06:26,180 --> 00:06:28,019
Tom Mitchell:
something interesting.

133
00:06:28,019 --> 00:06:29,699
Tom Mitchell:
And he understood that the

134
00:06:29,699 --> 00:06:31,259
Tom Mitchell:
techniques he was using were

135
00:06:31,259 --> 00:06:32,899
Tom Mitchell:
very general.

136
00:06:32,899 --> 00:06:36,300
Tom Mitchell:
Now, how did he get the computer
to learn to play checkers?

137
00:06:36,300 --> 00:06:37,860
Tom Mitchell:
His program learned an

138
00:06:37,860 --> 00:06:40,480
Tom Mitchell:
evaluation function that would

139
00:06:40,480 --> 00:06:43,779
Tom Mitchell:
assign a numerical score to any

140
00:06:43,779 --> 00:06:46,819
Tom Mitchell:
checkers position, and that

141
00:06:46,819 --> 00:06:48,160
Tom Mitchell:
score would be higher, the

142
00:06:48,160 --> 00:06:49,740
Tom Mitchell:
better the checkers position

143
00:06:49,740 --> 00:06:50,500
Tom Mitchell:
was.

144
00:06:50,500 --> 00:06:54,040
Tom Mitchell:
From your point of view as
you're playing the game, and

145
00:06:54,040 --> 00:06:56,639
Tom Mitchell:
then you would use that to
control a search.

146
00:06:56,639 --> 00:07:01,560
Tom Mitchell:
A look ahead search for which
move to proceed to take that

147
00:07:01,560 --> 00:07:07,759
Tom Mitchell:
evaluation function was a linear
weighted combination of board

148
00:07:07,759 --> 00:07:09,519
Tom Mitchell:
features that he made up.

149
00:07:09,519 --> 00:07:13,100
Tom Mitchell:
Things like how many checkers
are on the board that are mine,

150
00:07:13,100 --> 00:07:17,399
Tom Mitchell:
how many are on the board that
are yours, and so forth.

151
00:07:17,399 --> 00:07:19,360
Tom Mitchell:
So his program learned.

152
00:07:19,360 --> 00:07:22,439
Tom Mitchell:
What it learned was that
evaluation function.

153
00:07:22,439 --> 00:07:24,040
Tom Mitchell:
How did it learn it?

154
00:07:24,040 --> 00:07:27,160
Tom Mitchell:
By playing games against itself.

155
00:07:27,160 --> 00:07:31,079
Tom Mitchell:
And he points out that in eight
to ten hours, it could learn

156
00:07:31,079 --> 00:07:32,920
Tom Mitchell:
well enough to beat him.

157
00:07:34,000 --> 00:07:37,560
Tom Mitchell:
Those ideas persisted through
the decades.

158
00:07:37,560 --> 00:07:42,160
Tom Mitchell:
They became reused over and
over, including in the computer

159
00:07:42,160 --> 00:07:46,560
Tom Mitchell:
programs that finally beat the
World Chess Champion and the

160
00:07:46,560 --> 00:07:51,279
Tom Mitchell:
World Backgammon Champion and
the World Go champion.

161
00:07:51,279 --> 00:07:54,420
Tom Mitchell:
So those ideas were really
seminal.

162
00:07:54,420 --> 00:07:55,819
Tom Mitchell:
A second thing that happened in

163
00:07:55,819 --> 00:07:58,699
Tom Mitchell:
the fifties was the invention of

164
00:07:58,699 --> 00:08:01,379
Tom Mitchell:
the first early version of

165
00:08:01,379 --> 00:08:03,819
Tom Mitchell:
neural networks by Frank

166
00:08:03,819 --> 00:08:06,079
Tom Mitchell:
Rosenblum, wrote, I'm sorry,

167
00:08:06,079 --> 00:08:09,139
Tom Mitchell:
Frank Rosenblatt from Cornell,

168
00:08:09,139 --> 00:08:12,079
Tom Mitchell:
and he was interested in

169
00:08:12,079 --> 00:08:13,699
Tom Mitchell:
neuroscience.

170
00:08:13,699 --> 00:08:20,019
Tom Mitchell:
How can the brain neurons in the
brain be used to learn?

171
00:08:20,019 --> 00:08:25,339
Tom Mitchell:
And he ended up building a
simple, uh, at least by today's

172
00:08:25,339 --> 00:08:32,379
Tom Mitchell:
standards, simple neural network
that consisted of, uh, one layer

173
00:08:32,379 --> 00:08:37,379
Tom Mitchell:
of neurons where, uh, there
would be a receptive field, uh,

174
00:08:37,379 --> 00:08:43,120
Tom Mitchell:
input, say an image, and then
the neurons would respond to

175
00:08:43,120 --> 00:08:48,100
Tom Mitchell:
that and produce an output set
of neuron firings.

176
00:08:48,100 --> 00:08:50,690
Tom Mitchell:
What got learned in that case

177
00:08:50,690 --> 00:08:52,450
Tom Mitchell:
were the connection strengths

178
00:08:52,450 --> 00:08:54,950
Tom Mitchell:
between the input to the neuron

179
00:08:54,950 --> 00:08:56,610
Tom Mitchell:
and the probability that it

180
00:08:56,610 --> 00:08:57,570
Tom Mitchell:
would fire.

181
00:08:58,570 --> 00:09:00,289
Tom Mitchell:
And the way he trained it was

182
00:09:00,289 --> 00:09:01,990
Tom Mitchell:
what we now call supervised

183
00:09:01,990 --> 00:09:03,370
Tom Mitchell:
learning.

184
00:09:03,370 --> 00:09:07,889
Tom Mitchell:
You show an input and and what
the output should be.

185
00:09:07,889 --> 00:09:13,250
Tom Mitchell:
And he had schemes for updating
those weights to fit the data.

186
00:09:13,250 --> 00:09:15,049
Tom Mitchell:
Now that the importance of this

187
00:09:15,049 --> 00:09:18,129
Tom Mitchell:
work is that it catalyzed a

188
00:09:18,129 --> 00:09:19,529
Tom Mitchell:
whole bunch of work in the

189
00:09:19,529 --> 00:09:21,450
Tom Mitchell:
nineteen sixties, for the next

190
00:09:21,450 --> 00:09:23,570
Tom Mitchell:
decade, looking at different

191
00:09:23,570 --> 00:09:26,129
Tom Mitchell:
algorithms for tuning the

192
00:09:26,129 --> 00:09:28,950
Tom Mitchell:
weights of perceptron style

193
00:09:28,950 --> 00:09:30,009
Tom Mitchell:
systems.

194
00:09:31,649 --> 00:09:34,490
Tom Mitchell:
That work proceeded for a

195
00:09:34,490 --> 00:09:37,690
Tom Mitchell:
decade or so, and at the end of

196
00:09:37,690 --> 00:09:40,629
Tom Mitchell:
the nineteen sixties, two MIT

197
00:09:40,629 --> 00:09:42,970
Tom Mitchell:
scientists, Marvin Minsky and

198
00:09:42,970 --> 00:09:45,450
Tom Mitchell:
Seymour Papert, wrote a book

199
00:09:45,450 --> 00:09:47,289
Tom Mitchell:
called perceptrons.

200
00:09:47,289 --> 00:09:49,549
Tom Mitchell:
But unfortunately, that book

201
00:09:49,549 --> 00:09:52,389
Tom Mitchell:
proved that a single layer

202
00:09:52,389 --> 00:09:54,370
Tom Mitchell:
perceptron, which is the only

203
00:09:54,370 --> 00:09:55,730
Tom Mitchell:
thing we knew how to train at

204
00:09:55,730 --> 00:09:58,549
Tom Mitchell:
that point, uh, could never even

205
00:09:58,549 --> 00:10:02,470
Tom Mitchell:
represent any many, many

206
00:10:02,470 --> 00:10:03,830
Tom Mitchell:
functions that we wanted to

207
00:10:03,830 --> 00:10:04,549
Tom Mitchell:
learn.

208
00:10:04,549 --> 00:10:09,450
Tom Mitchell:
It could only represent linear
functions, not even, uh,

209
00:10:09,450 --> 00:10:13,669
Tom Mitchell:
exclusive or, you know, where
the input could be.

210
00:10:13,669 --> 00:10:14,870
Tom Mitchell:
The output would be one.

211
00:10:14,870 --> 00:10:18,789
Tom Mitchell:
If input one is a one and the
other is a zero, or if it's a

212
00:10:18,789 --> 00:10:20,070
Tom Mitchell:
zero and a one.

213
00:10:20,070 --> 00:10:23,269
Tom Mitchell:
But the output would have to be
zero if they were both one.

214
00:10:23,269 --> 00:10:25,149
Tom Mitchell:
You can't even represent that

215
00:10:25,149 --> 00:10:28,149
Tom Mitchell:
simple function with a

216
00:10:28,149 --> 00:10:29,950
Tom Mitchell:
perceptron no matter how you

217
00:10:29,950 --> 00:10:31,190
Tom Mitchell:
train it.

218
00:10:31,190 --> 00:10:33,070
Tom Mitchell:
So this really kind of put the

219
00:10:33,070 --> 00:10:36,649
Tom Mitchell:
kibosh on work on perceptrons,

220
00:10:36,649 --> 00:10:38,750
Tom Mitchell:
uh, following the publication of

221
00:10:38,750 --> 00:10:39,629
Tom Mitchell:
this book.

222
00:10:40,669 --> 00:10:44,190
Tom Mitchell:
Now, if we're not going to be
able or don't want to spend our

223
00:10:44,190 --> 00:10:49,769
Tom Mitchell:
time figuring out how to learn
perceptrons, Then what's next?

224
00:10:49,769 --> 00:10:50,889
Tom Mitchell:
Well, it turned out one of

225
00:10:50,889 --> 00:10:53,809
Tom Mitchell:
Minsky's PhD students, Patrick

226
00:10:53,809 --> 00:10:55,049
Tom Mitchell:
Winston.

227
00:10:55,049 --> 00:10:56,769
Tom Mitchell:
The next year published his

228
00:10:56,769 --> 00:10:59,450
Tom Mitchell:
thesis, and Winston suggested

229
00:10:59,450 --> 00:11:00,529
Tom Mitchell:
that instead of learning

230
00:11:00,529 --> 00:11:03,769
Tom Mitchell:
perceptron type representations

231
00:11:03,769 --> 00:11:05,809
Tom Mitchell:
of information, we should learn

232
00:11:05,809 --> 00:11:07,809
Tom Mitchell:
symbolic descriptions.

233
00:11:07,809 --> 00:11:13,409
Tom Mitchell:
And so his program, uh, in his
thesis, he showed how his

234
00:11:13,409 --> 00:11:18,210
Tom Mitchell:
program could learn descriptions
of different physical structures

235
00:11:18,210 --> 00:11:20,850
Tom Mitchell:
like an arch or a tower.

236
00:11:20,850 --> 00:11:25,009
Tom Mitchell:
And he would train the program
by showing it line drawings of

237
00:11:25,009 --> 00:11:31,570
Tom Mitchell:
positive and negative examples
of, uh, in this example arches.

238
00:11:31,570 --> 00:11:35,690
Tom Mitchell:
And then the program would
process those incrementally

239
00:11:35,690 --> 00:11:40,809
Tom Mitchell:
arriving examples to produce a
symbolic description that would

240
00:11:40,809 --> 00:11:44,529
Tom Mitchell:
describe the different parts and
relations among them.

241
00:11:44,529 --> 00:11:49,169
Tom Mitchell:
For example, an arch could be
two rectangles which don't touch

242
00:11:49,169 --> 00:11:54,629
Tom Mitchell:
each other, but which jointly
support a roof of any shape.

243
00:11:55,789 --> 00:11:57,789
Tom Mitchell:
So this was an important step

244
00:11:57,789 --> 00:11:59,769
Tom Mitchell:
because it shifted the focus

245
00:11:59,769 --> 00:12:02,629
Tom Mitchell:
onto learning a much richer kind

246
00:12:02,629 --> 00:12:04,870
Tom Mitchell:
of representation, symbolic

247
00:12:04,870 --> 00:12:05,870
Tom Mitchell:
descriptions.

248
00:12:05,870 --> 00:12:08,529
Tom Mitchell:
And this became the new paradigm

249
00:12:08,529 --> 00:12:10,669
Tom Mitchell:
which dominated the nineteen

250
00:12:10,669 --> 00:12:12,669
Tom Mitchell:
seventies.

251
00:12:12,669 --> 00:12:13,950
Tom Mitchell:
So during the seventies, there

252
00:12:13,950 --> 00:12:15,190
Tom Mitchell:
were a number of people working

253
00:12:15,190 --> 00:12:16,830
Tom Mitchell:
on learning symbolic

254
00:12:16,830 --> 00:12:18,110
Tom Mitchell:
descriptions.

255
00:12:18,110 --> 00:12:23,950
Tom Mitchell:
My favorite is the metaphor
program, developed by Bruce

256
00:12:23,950 --> 00:12:26,590
Tom Mitchell:
Buchanan at Stanford.

257
00:12:26,590 --> 00:12:30,549
Tom Mitchell:
This program, again, was a
symbolic learning program.

258
00:12:30,549 --> 00:12:35,990
Tom Mitchell:
What it learned was rules that
would predict how molecules

259
00:12:35,990 --> 00:12:40,570
Tom Mitchell:
would shatter inside a mass
spectrometer, and therefore

260
00:12:40,570 --> 00:12:44,750
Tom Mitchell:
predict what the mass spectrum
of a new molecule would be.

261
00:12:44,750 --> 00:12:48,539
Tom Mitchell:
And those rules again described,

262
00:12:48,539 --> 00:12:51,080
Tom Mitchell:
Symbolically described a

263
00:12:51,080 --> 00:12:54,259
Tom Mitchell:
subgraph of atoms within the

264
00:12:54,259 --> 00:12:56,259
Tom Mitchell:
molecular graph.

265
00:12:56,259 --> 00:13:01,179
Tom Mitchell:
And the rules would say, if you
find this subgraph, then

266
00:13:01,179 --> 00:13:07,059
Tom Mitchell:
specific bonds in that subgraph
are likely to fragment when you

267
00:13:07,059 --> 00:13:10,299
Tom Mitchell:
put this in a mass spectrometer.

268
00:13:10,299 --> 00:13:13,100
Tom Mitchell:
And this was an important step
forward.

269
00:13:13,100 --> 00:13:16,340
Tom Mitchell:
I asked Bruce Buchanan, how will
it work?

270
00:13:16,340 --> 00:13:20,740
Tom Mitchell:
What was this program able to do
in terms of did it work.

271
00:13:21,019 --> 00:13:29,179
Bruce Buchanan:
Well for one small class of
steroid molecules, the keto and

272
00:13:29,179 --> 00:13:31,580
Bruce Buchanan:
estranes, if you will?

273
00:13:31,580 --> 00:13:35,659
Bruce Buchanan:
Uh, we had, uh, fewer than a

274
00:13:35,659 --> 00:13:38,860
Bruce Buchanan:
dozen spectra, and we were able

275
00:13:38,860 --> 00:13:42,700
Bruce Buchanan:
to tease out the rules that

276
00:13:42,700 --> 00:13:47,679
Bruce Buchanan:
determine, uh, How a new keto

277
00:13:47,679 --> 00:13:51,320
Bruce Buchanan:
androstane would fragment in a

278
00:13:51,320 --> 00:13:55,200
Bruce Buchanan:
mass spectrometer, and we were

279
00:13:55,200 --> 00:13:57,360
Bruce Buchanan:
able to publish that set of

280
00:13:57,360 --> 00:13:59,259
Bruce Buchanan:
rules in a refereed chemical

281
00:13:59,259 --> 00:14:02,080
Bruce Buchanan:
chemical journal, Chemistry

282
00:14:02,080 --> 00:14:02,679
Bruce Buchanan:
Journal.

283
00:14:02,679 --> 00:14:03,360
Bruce Buchanan:
Sorry.

284
00:14:04,559 --> 00:14:07,320
Bruce Buchanan:
Uh, and it was, to our

285
00:14:07,320 --> 00:14:10,840
Bruce Buchanan:
knowledge, the first time that

286
00:14:10,840 --> 00:14:12,879
Bruce Buchanan:
the result of a machine learning

287
00:14:12,879 --> 00:14:15,720
Bruce Buchanan:
program, Symbolic Learning, had

288
00:14:15,720 --> 00:14:17,960
Bruce Buchanan:
been published, uh, in a

289
00:14:17,960 --> 00:14:19,159
Bruce Buchanan:
refereed journal.

290
00:14:19,279 --> 00:14:22,399
Tom Mitchell:
So that was an important
milestone for machine learning,

291
00:14:22,399 --> 00:14:25,559
Tom Mitchell:
really, the first time that a
program discovered some

292
00:14:25,559 --> 00:14:31,000
Tom Mitchell:
knowledge that was useful enough
to get published in that domain.

293
00:14:31,000 --> 00:14:33,139
Tom Mitchell:
Now it turned out personal note

294
00:14:33,139 --> 00:14:35,360
Tom Mitchell:
I was a PhD student at Stanford

295
00:14:35,360 --> 00:14:37,759
Tom Mitchell:
at the time, and Bruce became my

296
00:14:37,759 --> 00:14:41,039
Tom Mitchell:
PhD advisor, so my PhD thesis

297
00:14:41,039 --> 00:14:45,539
Tom Mitchell:
was also built around, this same

298
00:14:45,539 --> 00:14:46,940
Tom Mitchell:
data set.

299
00:14:46,940 --> 00:14:52,019
Tom Mitchell:
And for my thesis I developed a
system called Version Spaces

300
00:14:52,019 --> 00:14:55,220
Tom Mitchell:
that was the first symbolic
learning algorithm where you

301
00:14:55,220 --> 00:14:59,500
Tom Mitchell:
could prove that it would
converge, and furthermore, that

302
00:14:59,500 --> 00:15:02,940
Tom Mitchell:
the learner would know when it
had converged, so it would know

303
00:15:02,940 --> 00:15:04,259
Tom Mitchell:
it was done.

304
00:15:04,259 --> 00:15:06,659
Tom Mitchell:
And it did that by maintaining

305
00:15:06,659 --> 00:15:08,740
Tom Mitchell:
not just one hypothesis that it

306
00:15:08,740 --> 00:15:10,720
Tom Mitchell:
would modify, but by keeping

307
00:15:10,720 --> 00:15:13,860
Tom Mitchell:
track of every hypothesis

308
00:15:13,860 --> 00:15:15,899
Tom Mitchell:
consistent with the data that it

309
00:15:15,899 --> 00:15:16,860
Tom Mitchell:
had seen.

310
00:15:16,860 --> 00:15:20,360
Tom Mitchell:
And this also opened up the
possibility of what we call

311
00:15:20,360 --> 00:15:22,340
Tom Mitchell:
today active learning.

312
00:15:22,340 --> 00:15:24,779
Tom Mitchell:
It made it easy for the system

313
00:15:24,779 --> 00:15:26,700
Tom Mitchell:
to play twenty questions with

314
00:15:26,700 --> 00:15:27,860
Tom Mitchell:
the teacher.

315
00:15:27,860 --> 00:15:32,200
Tom Mitchell:
Uh, it could ask the teacher,
please label this example so

316
00:15:32,200 --> 00:15:37,500
Tom Mitchell:
that in a way, uh, it could
reduce the set of hypothesis as

317
00:15:37,500 --> 00:15:39,539
Tom Mitchell:
quickly as possible.

318
00:15:39,539 --> 00:15:43,370
Tom Mitchell:
So by the end of the seventies,
there seemed to be enough work

319
00:15:43,370 --> 00:15:47,690
Tom Mitchell:
going on in the field that it
was time to hold a meeting.

320
00:15:47,690 --> 00:15:50,450
Tom Mitchell:
And so we organized the first

321
00:15:50,450 --> 00:15:53,009
Tom Mitchell:
workshop in machine learning was

322
00:15:53,009 --> 00:15:56,509
Tom Mitchell:
held here at CMU at Wayne Hall,

323
00:15:56,509 --> 00:15:57,690
Tom Mitchell:
a couple of buildings that

324
00:15:57,690 --> 00:16:00,129
Tom Mitchell:
direction, and it was organized

325
00:16:00,129 --> 00:16:02,929
Tom Mitchell:
by Jaime Carbonell, who was an

326
00:16:02,929 --> 00:16:04,529
Tom Mitchell:
assistant professor here at the

327
00:16:04,529 --> 00:16:05,370
Tom Mitchell:
time.

328
00:16:05,370 --> 00:16:07,889
Tom Mitchell:
Richard Michalski, who is a more

329
00:16:07,889 --> 00:16:10,730
Tom Mitchell:
senior professor at Illinois and

330
00:16:10,730 --> 00:16:12,809
Tom Mitchell:
myself, I was at the time an

331
00:16:12,809 --> 00:16:15,690
Tom Mitchell:
assistant professor at Rutgers

332
00:16:15,690 --> 00:16:17,490
Tom Mitchell:
University.

333
00:16:17,490 --> 00:16:21,490
Tom Mitchell:
And so we held this meeting,
pulled together some people.

334
00:16:21,490 --> 00:16:26,090
Tom Mitchell:
One of the people who attended
was a student of Richard

335
00:16:26,090 --> 00:16:29,210
Tom Mitchell:
Michalski named Tom Dietterich.

336
00:16:29,210 --> 00:16:30,889
Tom Mitchell:
And Tom went on to make many

337
00:16:30,889 --> 00:16:33,210
Tom Mitchell:
contributions in the field of

338
00:16:33,210 --> 00:16:34,610
Tom Mitchell:
machine learning.

339
00:16:34,610 --> 00:16:39,009
Tom Mitchell:
And so I asked Tom, what was the
field like in nineteen eighty?

340
00:16:38,850 --> 00:16:40,909
Tom Dietterich:
I'd say it was really chaotic.

341
00:16:40,909 --> 00:16:43,809
Tom Dietterich:
you know, I was,

342
00:16:43,809 --> 00:16:45,629
Tom Dietterich:
attended that very first machine

343
00:16:45,629 --> 00:16:46,909
Tom Dietterich:
learning workshop that was

344
00:16:46,909 --> 00:16:47,549
Tom Dietterich:
organized.

345
00:16:47,549 --> 00:16:48,830
Tom Dietterich:
I think you were one of the core

346
00:16:48,830 --> 00:16:50,830
Tom Dietterich:
organizers at CMU, and there

347
00:16:50,830 --> 00:16:52,029
Tom Dietterich:
were probably thirty people in

348
00:16:52,029 --> 00:16:54,669
Tom Dietterich:
the room and, uh, and probably

349
00:16:54,669 --> 00:16:55,990
Tom Dietterich:
thirty completely different

350
00:16:55,990 --> 00:16:56,750
Tom Dietterich:
talks.

351
00:16:56,750 --> 00:17:01,909
Tom Dietterich:
You know, I remember, I was talking

352
00:17:01,909 --> 00:17:06,910
Tom Dietterich:
about I had done, a sort of algorithm 
comparison paper

353
00:17:06,910 --> 00:17:10,829
Tom Dietterich:
that I published at Ijcai
seventy nine, I think.

354
00:17:10,829 --> 00:17:16,069
Tom Dietterich:
So just before that workshop,
in which I was, by

355
00:17:16,069 --> 00:17:18,809
Tom Dietterich:
hand executing these very simple
algorithms for this kind of

356
00:17:18,809 --> 00:17:23,269
Tom Dietterich:
subgraph learning problem, uh,
and comparing how many subgraph

357
00:17:23,269 --> 00:17:24,750
Tom Dietterich:
isomorphism calculations they
had to do.

358
00:17:24,750 --> 00:17:26,150
Tom Dietterich:
But it was like the first

359
00:17:26,150 --> 00:17:27,410
Tom Dietterich:
attempt to actually compare

360
00:17:27,410 --> 00:17:28,349
Tom Dietterich:
multiple machine learning

361
00:17:28,349 --> 00:17:29,269
Tom Dietterich:
algorithms that were more or

362
00:17:29,269 --> 00:17:30,230
Tom Dietterich:
less trying to do the same

363
00:17:30,230 --> 00:17:30,990
Tom Dietterich:
thing.

364
00:17:30,990 --> 00:17:35,910
Tom Dietterich:
There were a couple of them
there, and, you

365
00:17:35,910 --> 00:17:39,700
Tom Dietterich:
know, I think John Anderson
was there talking about, you

366
00:17:39,700 --> 00:17:41,569
Tom Dietterich:
know, cognitive models.

367
00:17:41,569 --> 00:17:43,529
Tom Dietterich:
You were there talking about

368
00:17:43,529 --> 00:17:45,009
Tom Dietterich:
the beginnings of EBL and the

369
00:17:45,009 --> 00:17:47,470
Tom Dietterich:
Lex system for, for,

370
00:17:47,470 --> 00:17:49,529
Tom Dietterich:
calculus, symbolic

371
00:17:49,529 --> 00:17:50,369
Tom Dietterich:
integration.

372
00:17:50,369 --> 00:17:56,329
Tom Dietterich:
You know, I remember
the most interesting talk I

373
00:17:56,329 --> 00:18:01,210
Tom Dietterich:
thought was Ross Quinlan's talk
on, on ID3, where he was

374
00:18:01,210 --> 00:18:04,529
Tom Dietterich:
trying to take these reverse
numerated chess endgames

375
00:18:04,529 --> 00:18:06,410
Tom Dietterich:
and learn decision trees.

376
00:18:06,410 --> 00:18:08,009
Tom Dietterich:
That would completely,

377
00:18:08,009 --> 00:18:11,730
Tom Dietterich:
exactly losslessly,

378
00:18:11,730 --> 00:18:14,230
Tom Dietterich:
basically compress those

379
00:18:14,230 --> 00:18:16,609
Tom Dietterich:
giant tables into a small

380
00:18:16,609 --> 00:18:17,930
Tom Dietterich:
decision tree.

381
00:18:17,930 --> 00:18:20,609
Tom Dietterich:
A really important thing people
should understand in those days

382
00:18:20,609 --> 00:18:24,890
Tom Dietterich:
was we believed there
was a right answer for our

383
00:18:24,890 --> 00:18:26,490
Tom Dietterich:
machine learning problems.

384
00:18:26,490 --> 00:18:29,269
Tom Dietterich:
And we would,

385
00:18:29,269 --> 00:18:30,490
Tom Dietterich:
it would often happen that I

386
00:18:30,490 --> 00:18:33,730
Tom Dietterich:
would run like the algorithms

387
00:18:33,730 --> 00:18:35,009
Tom Dietterich:
and it would not get the right

388
00:18:35,009 --> 00:18:35,690
Tom Dietterich:
answer.

389
00:18:35,690 --> 00:18:38,759
Tom Dietterich:
It would not get the, the
logical expression that we

390
00:18:38,759 --> 00:18:39,880
Tom Dietterich:
thought was the right answer.

391
00:18:39,880 --> 00:18:42,920
Tom Dietterich:
It would get something that was
really, actually equally

392
00:18:42,920 --> 00:18:44,680
Tom Dietterich:
accurate on the training data.

393
00:18:45,839 --> 00:18:47,119
Tom Dietterich:
And actually it worked

394
00:18:47,119 --> 00:18:48,480
Tom Dietterich:
pretty well although we

395
00:18:48,480 --> 00:18:49,799
Tom Dietterich:
didn't really have a set idea of

396
00:18:49,799 --> 00:18:51,079
Tom Dietterich:
a separate test set in those

397
00:18:51,079 --> 00:18:51,599
Tom Dietterich:
days.

398
00:18:51,599 --> 00:18:54,160
Tom Dietterich:
I mean, it was not a field of
statistics.

399
00:18:54,160 --> 00:18:56,400
Tom Dietterich:
It was, the idea was right.

400
00:18:56,400 --> 00:19:00,099
Tom Dietterich:
We were coming out of the,
really the John McCarthy program

401
00:19:00,099 --> 00:19:02,920
Tom Dietterich:
of programs with common sense,
which didn't have a lot to do

402
00:19:02,920 --> 00:19:05,799
Tom Dietterich:
with common sense, but was about
we're going to represent

403
00:19:05,799 --> 00:19:09,279
Tom Dietterich:
everything in logic, and we're
going to use logical inference

404
00:19:09,279 --> 00:19:11,039
Tom Dietterich:
as the execution engine.

405
00:19:11,240 --> 00:19:14,240
Tom Mitchell:
So there's Tom's take on what
things were like.

406
00:19:14,240 --> 00:19:18,200
Tom Mitchell:
He mentioned that he thought
the most interesting talk was

407
00:19:18,200 --> 00:19:19,920
Tom Mitchell:
Ross Quinlan's talk.

408
00:19:19,920 --> 00:19:23,759
Tom Mitchell:
I agree, I thought that was the
most interesting talk.

409
00:19:23,759 --> 00:19:26,680
Tom Mitchell:
Ross's talk presented the idea

410
00:19:26,680 --> 00:19:28,839
Tom Mitchell:
that we should learn decision

411
00:19:28,839 --> 00:19:29,839
Tom Mitchell:
trees.

412
00:19:29,839 --> 00:19:33,859
Tom Mitchell:
A decision tree is something
where you classify your example

413
00:19:33,859 --> 00:19:38,259
Tom Mitchell:
by putting it at the root of the
tree, and then you sort it down

414
00:19:38,259 --> 00:19:42,660
Tom Mitchell:
to a leaf in the tree based on
its features, and the leaf tells

415
00:19:42,660 --> 00:19:46,500
Tom Mitchell:
you what the output
classification label should be.

416
00:19:46,500 --> 00:19:48,099
Tom Mitchell:
That's what get learned.

417
00:19:48,099 --> 00:19:49,500
Tom Mitchell:
What gets learned?

418
00:19:49,500 --> 00:19:53,900
Tom Mitchell:
So I asked Ross how he came
up with this idea.

419
00:19:56,420 --> 00:20:01,700
JR Quinlan:
I had done a PhD under a
psychologist, Earl hunt.

420
00:20:01,700 --> 00:20:05,740
JR Quinlan:
And part of his work involved
decision trees, which I learned

421
00:20:05,740 --> 00:20:10,539
JR Quinlan:
about, of course, as a student,
but then put in the back of my

422
00:20:10,539 --> 00:20:14,339
JR Quinlan:
mind for fifteen years or so.

423
00:20:14,339 --> 00:20:16,980
JR Quinlan:
And then I was at at Stanford on

424
00:20:16,980 --> 00:20:18,980
JR Quinlan:
sabbatical at the same time as

425
00:20:18,980 --> 00:20:19,299
JR Quinlan:
Donald.

426
00:20:19,299 --> 00:20:25,779
JR Quinlan:
Mickey was teaching a course on
learning, and he had a challenge

427
00:20:25,779 --> 00:20:29,460
JR Quinlan:
for the class on which, you
know, I sat in on the class and

428
00:20:29,460 --> 00:20:37,920
JR Quinlan:
the challenge was to work
out a way of predicting a win in

429
00:20:37,920 --> 00:20:40,319
JR Quinlan:
a very simple chess end game.

430
00:20:40,319 --> 00:20:43,119
JR Quinlan:
King rook versus king knight.

431
00:20:43,119 --> 00:20:47,720
JR Quinlan:
So I remembered Earl Hunt's work
on decision trees, and I

432
00:20:47,720 --> 00:20:50,680
JR Quinlan:
thought, well, maybe that would
be the way to go.

433
00:20:50,680 --> 00:20:55,400
JR Quinlan:
So I developed a thing called
ID3, which was just a simple

434
00:20:55,400 --> 00:20:57,279
JR Quinlan:
decision tree program.

435
00:20:57,279 --> 00:21:01,079
JR Quinlan:
No pruning, just straight
decision tree.

436
00:21:01,079 --> 00:21:02,960
JR Quinlan:
And then, uh, that that seemed

437
00:21:02,960 --> 00:21:04,119
JR Quinlan:
to solve the problem pretty

438
00:21:04,119 --> 00:21:06,039
JR Quinlan:
well, up to about ninety five

439
00:21:06,039 --> 00:21:07,440
JR Quinlan:
percent.

440
00:21:07,440 --> 00:21:10,799
JR Quinlan:
And then I got that up to one
hundred the next year.

441
00:21:10,799 --> 00:21:14,640
JR Quinlan:
And then remember, the first
real time I talked about this

442
00:21:14,640 --> 00:21:16,359
JR Quinlan:
was at that conference.

443
00:21:16,359 --> 00:21:20,660
JR Quinlan:
You organized the workshop in
nineteen eighty at Pittsburgh,

444
00:21:20,660 --> 00:21:22,359
JR Quinlan:
at Carnegie Mellon.

445
00:21:22,359 --> 00:21:27,720
JR Quinlan:
You, Richard and Hymie all,
all set up that workshop.

446
00:21:27,720 --> 00:21:31,559
JR Quinlan:
And then I gave a talk on, uh,
decision tree learning.

447
00:21:32,559 --> 00:21:34,640
Tom Mitchell:
So there's Ross's story.

448
00:21:34,640 --> 00:21:37,099
Tom Mitchell:
He he got the idea of decision

449
00:21:37,099 --> 00:21:39,700
Tom Mitchell:
trees from his thesis advisor

450
00:21:39,700 --> 00:21:42,339
Tom Mitchell:
many years earlier, but it turns

451
00:21:42,339 --> 00:21:44,700
Tom Mitchell:
out Ross was the one who came up

452
00:21:44,700 --> 00:21:46,599
Tom Mitchell:
with the algorithm that actually

453
00:21:46,599 --> 00:21:48,859
Tom Mitchell:
successfully discovered useful

454
00:21:48,859 --> 00:21:50,059
Tom Mitchell:
decision trees.

455
00:21:50,059 --> 00:21:54,420
Tom Mitchell:
And that whole idea of decision
tree learning became very

456
00:21:54,420 --> 00:21:56,339
Tom Mitchell:
important in the field.

457
00:21:56,339 --> 00:21:59,660
Tom Mitchell:
By twenty ten, it was probably

458
00:21:59,660 --> 00:22:02,220
Tom Mitchell:
the one of the most commercially

459
00:22:02,220 --> 00:22:04,460
Tom Mitchell:
used approaches in machine

460
00:22:04,460 --> 00:22:06,140
Tom Mitchell:
learning.

461
00:22:06,140 --> 00:22:10,339
Tom Mitchell:
So in the early eighties, there
were various experiments like

462
00:22:10,339 --> 00:22:14,359
Tom Mitchell:
these trying to build machine
learning systems, but really no

463
00:22:14,359 --> 00:22:19,200
Tom Mitchell:
theory, no theory that could
tell us, for example, how many

464
00:22:19,200 --> 00:22:23,220
Tom Mitchell:
examples would we have to
present to a learner in order

465
00:22:23,220 --> 00:22:25,819
Tom Mitchell:
for it to reliably learn?

466
00:22:25,819 --> 00:22:27,880
Tom Mitchell:
And that changed in nineteen

467
00:22:27,880 --> 00:22:31,619
Tom Mitchell:
eighty four, when Les Valiant

468
00:22:31,619 --> 00:22:34,690
Tom Mitchell:
published a paper on what he

469
00:22:34,690 --> 00:22:37,289
Tom Mitchell:
calls probably approximately

470
00:22:37,289 --> 00:22:39,450
Tom Mitchell:
correct learning.

471
00:22:39,450 --> 00:22:42,410
Tom Mitchell:
And the idea is it really

472
00:22:42,410 --> 00:22:44,789
Tom Mitchell:
was the first practical theory

473
00:22:44,789 --> 00:22:47,609
Tom Mitchell:
to tell us how many examples you

474
00:22:47,609 --> 00:22:48,730
Tom Mitchell:
would need.

475
00:22:48,730 --> 00:22:52,329
Tom Mitchell:
And it in particular, in

476
00:22:52,329 --> 00:22:53,970
Tom Mitchell:
particular, the number of

477
00:22:53,970 --> 00:22:56,289
Tom Mitchell:
examples you need depends on

478
00:22:56,289 --> 00:22:57,529
Tom Mitchell:
three things.

479
00:22:57,529 --> 00:23:00,450
Tom Mitchell:
The complexity of your
hypothesis space.

480
00:23:00,450 --> 00:23:04,450
Tom Mitchell:
For example, if you're going to
learn decision trees of depth

481
00:23:04,450 --> 00:23:09,009
Tom Mitchell:
two, that's a lot less complex
than if you're learning decision

482
00:23:09,009 --> 00:23:10,809
Tom Mitchell:
trees of depth twelve.

483
00:23:11,849 --> 00:23:14,009
Tom Mitchell:
So the it depends on how complex

484
00:23:14,009 --> 00:23:16,849
Tom Mitchell:
your hypotheses are, depends on

485
00:23:16,849 --> 00:23:18,430
Tom Mitchell:
the error rate you're willing to

486
00:23:18,430 --> 00:23:20,130
Tom Mitchell:
tolerate in the final

487
00:23:20,130 --> 00:23:21,369
Tom Mitchell:
hypothesis.

488
00:23:21,369 --> 00:23:24,289
Tom Mitchell:
One percent error five percent
error.

489
00:23:24,289 --> 00:23:27,630
Tom Mitchell:
It also depends on the
probability you're willing to

490
00:23:27,630 --> 00:23:28,970
Tom Mitchell:
put up with that.

491
00:23:28,970 --> 00:23:31,190
Tom Mitchell:
If you do choose that many

492
00:23:31,190 --> 00:23:34,089
Tom Mitchell:
random randomly provided

493
00:23:34,089 --> 00:23:36,390
Tom Mitchell:
training examples.

494
00:23:36,390 --> 00:23:38,829
Tom Mitchell:
The probability that you'll
still fail.

495
00:23:38,829 --> 00:23:40,569
Tom Mitchell:
You can't guarantee that you

496
00:23:40,569 --> 00:23:42,430
Tom Mitchell:
won't fail, but you can reduce

497
00:23:42,430 --> 00:23:44,349
Tom Mitchell:
that probability.

498
00:23:44,349 --> 00:23:48,609
Tom Mitchell:
So this was a breakthrough in
the area of theoretical

499
00:23:48,609 --> 00:23:51,589
Tom Mitchell:
characterization of algorithms.

500
00:23:51,589 --> 00:23:57,670
Tom Mitchell:
So I asked I asked les what he
thought was the key idea there.

501
00:23:59,589 --> 00:24:02,589
Leslie Valiant:
It's a it's a kind of a model of
computation.

502
00:24:02,589 --> 00:24:04,950
Leslie Valiant:
But it yeah, it makes sense

503
00:24:04,950 --> 00:24:06,150
Leslie Valiant:
because it's got some

504
00:24:06,150 --> 00:24:06,829
Leslie Valiant:
applications.

505
00:24:06,829 --> 00:24:12,470
Leslie Valiant:
So that's the particular
result which persuaded

506
00:24:12,470 --> 00:24:15,869
Leslie Valiant:
people that there was
something there is this result

507
00:24:15,869 --> 00:24:20,329
Leslie Valiant:
that if you take a
conjunctive normal form formula,

508
00:24:20,329 --> 00:24:23,789
Leslie Valiant:
which, you know, from NP
completeness at the time, we

509
00:24:23,789 --> 00:24:26,470
Leslie Valiant:
already knew there's some
hardness in it, because if

510
00:24:26,470 --> 00:24:29,609
Leslie Valiant:
someone gave you the formula was
computationally difficult to

511
00:24:29,609 --> 00:24:32,490
Leslie Valiant:
find out whether it's a null,
it's the equivalent of formula

512
00:24:32,490 --> 00:24:38,210
Leslie Valiant:
which, is always zero,
which is never satisfiable.

513
00:24:38,210 --> 00:24:40,269
Leslie Valiant:
On the other hand, this was

514
00:24:40,269 --> 00:24:42,730
Leslie Valiant:
kind of this, uh, conducting

515
00:24:42,730 --> 00:24:44,390
Leslie Valiant:
normal form formula with three,

516
00:24:44,390 --> 00:24:48,289
Leslie Valiant:
variables in each

517
00:24:48,289 --> 00:24:48,809
Leslie Valiant:
clause.

518
00:24:48,809 --> 00:24:50,970
Leslie Valiant:
Uh, so this was PAC learnable.

519
00:24:50,970 --> 00:24:53,769
Leslie Valiant:
And so this was a bit striking
that something which is very

520
00:24:53,769 --> 00:24:55,089
Leslie Valiant:
hard is learnable.

521
00:24:55,089 --> 00:24:56,950
Leslie Valiant:
But then this, this

522
00:24:56,950 --> 00:24:57,930
Leslie Valiant:
highlighted the difference

523
00:24:57,930 --> 00:25:00,490
Leslie Valiant:
between, uh, computing and uh,

524
00:25:00,490 --> 00:25:03,009
Leslie Valiant:
and learning because so with the

525
00:25:03,009 --> 00:25:04,769
Leslie Valiant:
learning model, the idea was

526
00:25:04,769 --> 00:25:06,009
Leslie Valiant:
that there was a distribution of

527
00:25:06,009 --> 00:25:07,130
Leslie Valiant:
inputs.

528
00:25:07,130 --> 00:25:09,690
Leslie Valiant:
And you learned from this
distribution, but you only have

529
00:25:09,690 --> 00:25:13,089
Leslie Valiant:
to be good on this distribution
when you have to predict.

530
00:25:13,089 --> 00:25:15,650
Leslie Valiant:
So if, for example, in this
formula, there were some very

531
00:25:15,650 --> 00:25:19,210
Leslie Valiant:
rare ones which are so very
rare, then the learner wouldn't

532
00:25:19,210 --> 00:25:20,450
Leslie Valiant:
have to know about that.

533
00:25:20,450 --> 00:25:23,849
Leslie Valiant:
So in this sense this was easier
than the NP completeness.

534
00:25:24,809 --> 00:25:28,089
Tom Mitchell:
So I was actually quite
surprised at that answer.

535
00:25:28,089 --> 00:25:29,130
Tom Mitchell:
What he's saying.

536
00:25:29,130 --> 00:25:32,079
Tom Mitchell:
Put another way is that what was

537
00:25:32,079 --> 00:25:33,640
Tom Mitchell:
really interesting there is that

538
00:25:33,640 --> 00:25:35,980
Tom Mitchell:
for this one kind of hypothesis,

539
00:25:35,980 --> 00:25:38,720
Tom Mitchell:
conjunctive normal form, which

540
00:25:38,720 --> 00:25:40,839
Tom Mitchell:
is a way of it's a kind of

541
00:25:40,839 --> 00:25:42,640
Tom Mitchell:
logical expression.

542
00:25:42,640 --> 00:25:47,839
Tom Mitchell:
If your hypotheses are of that
form, then it's easier to learn

543
00:25:47,839 --> 00:25:50,799
Tom Mitchell:
them than it is to compute them.

544
00:25:50,799 --> 00:25:54,920
Tom Mitchell:
When he says compute them, what
he means is the cost of

545
00:25:54,920 --> 00:25:59,359
Tom Mitchell:
answering the question, can you
find a positive example of this?

546
00:26:01,039 --> 00:26:06,400
Tom Mitchell:
And it was known at the time
that the computational cost of

547
00:26:06,400 --> 00:26:09,640
Tom Mitchell:
answering that question, is
there a positive example of this

548
00:26:09,640 --> 00:26:15,599
Tom Mitchell:
formula was exponential in the
size of the formula?

549
00:26:15,599 --> 00:26:17,359
Tom Mitchell:
And then he discovered that

550
00:26:17,359 --> 00:26:19,799
Tom Mitchell:
learning a formula, if somebody

551
00:26:19,799 --> 00:26:20,920
Tom Mitchell:
gives you a positive and

552
00:26:20,920 --> 00:26:23,240
Tom Mitchell:
negative examples only takes

553
00:26:23,240 --> 00:26:26,140
Tom Mitchell:
polynomial less than exponential

554
00:26:26,140 --> 00:26:27,279
Tom Mitchell:
time.

555
00:26:27,279 --> 00:26:29,150
Tom Mitchell:
So I agree with him that that's

556
00:26:29,150 --> 00:26:31,720
Tom Mitchell:
a fascinating theoretical fact,

557
00:26:31,720 --> 00:26:34,299
Tom Mitchell:
but that would not be the answer

558
00:26:34,299 --> 00:26:36,059
Tom Mitchell:
I would give about why this

559
00:26:36,059 --> 00:26:37,500
Tom Mitchell:
revolutionized the field of

560
00:26:37,500 --> 00:26:38,900
Tom Mitchell:
machine learning.

561
00:26:38,900 --> 00:26:42,500
Tom Mitchell:
It revolutionized the field, in
my view, because he was the

562
00:26:42,500 --> 00:26:47,619
Tom Mitchell:
first person, really to be able
to come up with a framing, a new

563
00:26:47,619 --> 00:26:52,200
Tom Mitchell:
framing of the machine learning
problem that even allowed this

564
00:26:52,200 --> 00:26:54,500
Tom Mitchell:
kind of theoretical analysis.

565
00:26:54,500 --> 00:26:56,460
Tom Mitchell:
In particular, his framing

566
00:26:56,460 --> 00:26:58,539
Tom Mitchell:
included assumptions like the

567
00:26:58,539 --> 00:27:00,180
Tom Mitchell:
training data would come from

568
00:27:00,180 --> 00:27:03,859
Tom Mitchell:
some source that would give you

569
00:27:03,859 --> 00:27:05,660
Tom Mitchell:
that would give you random

570
00:27:05,660 --> 00:27:07,220
Tom Mitchell:
examples according to some

571
00:27:07,220 --> 00:27:09,420
Tom Mitchell:
probability distribution.

572
00:27:09,420 --> 00:27:13,940
Tom Mitchell:
And then later, when you wanted
to test your hypothesis on new

573
00:27:13,940 --> 00:27:18,539
Tom Mitchell:
data, you would get more random
examples from that same source.

574
00:27:18,539 --> 00:27:20,900
Tom Mitchell:
And so he reframed the problem

575
00:27:20,900 --> 00:27:22,059
Tom Mitchell:
in a way that made theory

576
00:27:22,059 --> 00:27:23,099
Tom Mitchell:
possible.

577
00:27:23,099 --> 00:27:26,339
Tom Mitchell:
The consequence of that was he

578
00:27:26,339 --> 00:27:27,970
Tom Mitchell:
catalyzed a huge amount of

579
00:27:27,970 --> 00:27:30,359
Tom Mitchell:
theoretical work in machine

580
00:27:30,359 --> 00:27:33,579
Tom Mitchell:
learning and continues this day

581
00:27:33,579 --> 00:27:37,000
Tom Mitchell:
just keeps branching further and

582
00:27:37,000 --> 00:27:37,759
Tom Mitchell:
further.

583
00:27:37,759 --> 00:27:42,099
Tom Mitchell:
There are conferences
specifically designed to cover

584
00:27:42,099 --> 00:27:44,119
Tom Mitchell:
theoretical computer science.

585
00:27:45,400 --> 00:27:49,240
Tom Mitchell:
So the eighties was really a
very generative decade.

586
00:27:49,240 --> 00:27:51,400
Tom Mitchell:
There are a lot of things going
on.

587
00:27:51,400 --> 00:27:54,400
Tom Mitchell:
Another thing was going on was
some people were looking at

588
00:27:54,400 --> 00:28:00,900
Tom Mitchell:
human learning and how that
might inspire our models of AI

589
00:28:00,900 --> 00:28:02,680
Tom Mitchell:
and machine learning.

590
00:28:02,680 --> 00:28:06,599
Tom Mitchell:
One such effort was here at CMU

591
00:28:06,599 --> 00:28:08,880
Tom Mitchell:
by Alan Newell and his two PhD

592
00:28:08,880 --> 00:28:11,000
Tom Mitchell:
students, John Laird and Paul

593
00:28:11,000 --> 00:28:12,480
Tom Mitchell:
Rosenbloom.

594
00:28:12,480 --> 00:28:15,440
Tom Mitchell:
They took the approach of.

595
00:28:15,440 --> 00:28:16,859
Tom Mitchell:
They built a system they called

596
00:28:16,859 --> 00:28:20,279
Tom Mitchell:
Soar, which was really one of

597
00:28:20,279 --> 00:28:23,779
Tom Mitchell:
the first AI agents designed to

598
00:28:23,779 --> 00:28:26,329
Tom Mitchell:
capture the full breadth of what

599
00:28:26,329 --> 00:28:29,299
Tom Mitchell:
humans do play games, solve

600
00:28:29,299 --> 00:28:33,420
Tom Mitchell:
problems many different tasks,

601
00:28:33,420 --> 00:28:35,660
Tom Mitchell:
so they frame their machine

602
00:28:35,660 --> 00:28:37,160
Tom Mitchell:
learning problem as one of

603
00:28:37,160 --> 00:28:38,759
Tom Mitchell:
getting a general agent to

604
00:28:38,759 --> 00:28:39,460
Tom Mitchell:
learn.

605
00:28:39,460 --> 00:28:43,259
Tom Mitchell:
And their architecture had very
interesting properties that I

606
00:28:43,259 --> 00:28:45,180
Tom Mitchell:
think are relevant today.

607
00:28:45,180 --> 00:28:51,779
Tom Mitchell:
Now that agents are again a
topic of hot activity, I won't

608
00:28:51,779 --> 00:28:55,599
Tom Mitchell:
go into the details, but in the
podcast there's an interview

609
00:28:55,599 --> 00:28:59,859
Tom Mitchell:
with John Laird who goes into
detail on this.

610
00:28:59,859 --> 00:29:01,819
Tom Mitchell:
Another item that can't be

611
00:29:01,819 --> 00:29:03,779
Tom Mitchell:
overlooked in the eighties was

612
00:29:03,779 --> 00:29:06,140
Tom Mitchell:
really the rebirth of neural

613
00:29:06,140 --> 00:29:07,339
Tom Mitchell:
network.

614
00:29:07,339 --> 00:29:09,420
Tom Mitchell:
Remember, in the end of sixties,

615
00:29:09,420 --> 00:29:11,140
Tom Mitchell:
Minsky and Papert published that

616
00:29:11,140 --> 00:29:13,980
Tom Mitchell:
book that killed off work on

617
00:29:13,980 --> 00:29:15,779
Tom Mitchell:
perceptrons?

618
00:29:15,779 --> 00:29:17,960
Tom Mitchell:
Well, in the mid eighties,

619
00:29:17,960 --> 00:29:21,460
Tom Mitchell:
finally, people came up with an

620
00:29:21,460 --> 00:29:23,380
Tom Mitchell:
algorithm that could train not

621
00:29:23,380 --> 00:29:26,170
Tom Mitchell:
just one layer perceptrons, but

622
00:29:26,170 --> 00:29:28,329
Tom Mitchell:
multilayer perceptrons.

623
00:29:28,329 --> 00:29:30,589
Tom Mitchell:
And that allowed learning

624
00:29:30,589 --> 00:29:32,130
Tom Mitchell:
functions that were highly

625
00:29:32,130 --> 00:29:33,930
Tom Mitchell:
non-linear.

626
00:29:33,930 --> 00:29:36,690
Tom Mitchell:
And Dave Rumelhart, J.

627
00:29:36,690 --> 00:29:39,049
Tom Mitchell:
McClelland and Geoff Hinton were

628
00:29:39,049 --> 00:29:41,450
Tom Mitchell:
three of the ringleaders of this

629
00:29:41,450 --> 00:29:42,410
Tom Mitchell:
effort.

630
00:29:42,410 --> 00:29:45,890
Tom Mitchell:
So I asked Geoff about that
period.

631
00:29:45,890 --> 00:29:48,670
Tom Mitchell:
Now we're up to the mid eighties

632
00:29:48,670 --> 00:29:51,789
Tom Mitchell:
when really neural nets are

633
00:29:51,789 --> 00:29:52,609
Tom Mitchell:
reborn.

634
00:29:52,609 --> 00:29:54,329
Tom Mitchell:
Is that the right word?

635
00:29:54,329 --> 00:29:55,009
Tom Mitchell:
How would you.

636
00:29:55,170 --> 00:29:57,289
Geoffrey Hinton:
Backprop with backpropagation?

637
00:29:57,289 --> 00:29:58,410
Geoffrey Hinton:
I mean, we didn't invent it.

638
00:29:58,410 --> 00:30:00,049
Geoffrey Hinton:
Invented by several different

639
00:30:00,049 --> 00:30:03,029
Geoffrey Hinton:
groups, but we showed that it

640
00:30:03,029 --> 00:30:04,369
Geoffrey Hinton:
really worked to learn

641
00:30:04,369 --> 00:30:05,609
Geoffrey Hinton:
representations.

642
00:30:05,609 --> 00:30:08,170
Geoffrey Hinton:
And as you know, sort of one of
the big problems in AI is how do

643
00:30:08,170 --> 00:30:10,490
Geoffrey Hinton:
you learn new representations?

644
00:30:10,490 --> 00:30:12,930
Geoffrey Hinton:
How do you avoid having to put
them all in by hand?

645
00:30:12,930 --> 00:30:16,309
Geoffrey Hinton:
And my particular example,

646
00:30:16,309 --> 00:30:17,650
Geoffrey Hinton:
which was the family trees

647
00:30:17,650 --> 00:30:18,890
Geoffrey Hinton:
example, where you take all the

648
00:30:18,890 --> 00:30:20,130
Geoffrey Hinton:
information in some family

649
00:30:20,130 --> 00:30:22,170
Geoffrey Hinton:
trees, you convert it into

650
00:30:22,170 --> 00:30:24,630
Geoffrey Hinton:
triples of symbols like John has

651
00:30:24,630 --> 00:30:25,990
Geoffrey Hinton:
Father Mary.

652
00:30:25,990 --> 00:30:28,230
Geoffrey Hinton:
And then you train a neural

653
00:30:28,230 --> 00:30:29,910
Geoffrey Hinton:
net to predict the last term in

654
00:30:29,910 --> 00:30:30,309
Geoffrey Hinton:
a triple.

655
00:30:30,309 --> 00:30:32,029
Geoffrey Hinton:
Given the first two terms.

656
00:30:32,029 --> 00:30:33,950
Geoffrey Hinton:
So it's just like the big
language models.

657
00:30:33,950 --> 00:30:36,549
Geoffrey Hinton:
You're predicting the next word
given the context.

658
00:30:36,549 --> 00:30:38,509
Geoffrey Hinton:
It's just much simpler.

659
00:30:38,509 --> 00:30:41,490
Geoffrey Hinton:
I had one hundred and twelve

660
00:30:41,490 --> 00:30:43,390
Geoffrey Hinton:
total examples, of which one

661
00:30:43,390 --> 00:30:44,829
Geoffrey Hinton:
hundred and four training

662
00:30:44,829 --> 00:30:46,109
Geoffrey Hinton:
examples and eight were test

663
00:30:46,109 --> 00:30:48,150
Geoffrey Hinton:
examples, which is a bit less

664
00:30:48,150 --> 00:30:49,470
Geoffrey Hinton:
than the trillion examples they

665
00:30:49,470 --> 00:30:50,670
Geoffrey Hinton:
have nowadays,

666
00:30:50,670 --> 00:30:52,670
Geoffrey Hinton:
but it was the same idea.

667
00:30:52,670 --> 00:30:56,589
Geoffrey Hinton:
You convert a symbol into a
feature vector.

668
00:30:56,589 --> 00:30:59,549
Geoffrey Hinton:
You then have the feature
vectors of the context interact

669
00:30:59,549 --> 00:31:02,349
Geoffrey Hinton:
via a hidden layer.

670
00:31:02,349 --> 00:31:04,029
Geoffrey Hinton:
They then predict the features

671
00:31:04,029 --> 00:31:06,230
Geoffrey Hinton:
of the next symbol, and from

672
00:31:06,230 --> 00:31:07,309
Geoffrey Hinton:
those features you guess what

673
00:31:07,309 --> 00:31:08,829
Geoffrey Hinton:
the next symbol should be, and

674
00:31:08,829 --> 00:31:09,789
Geoffrey Hinton:
you try and maximize the

675
00:31:09,789 --> 00:31:10,950
Geoffrey Hinton:
probability of predicting the

676
00:31:10,950 --> 00:31:12,150
Geoffrey Hinton:
next symbol.

677
00:31:12,150 --> 00:31:13,470
Geoffrey Hinton:
And you then backpropagate

678
00:31:13,470 --> 00:31:16,390
Geoffrey Hinton:
through the feature interactions

679
00:31:16,390 --> 00:31:17,390
Geoffrey Hinton:
and through the process of

680
00:31:17,390 --> 00:31:18,349
Geoffrey Hinton:
converting a symbol into

681
00:31:18,349 --> 00:31:19,269
Geoffrey Hinton:
features.

682
00:31:19,269 --> 00:31:22,779
Geoffrey Hinton:
And that way you learn
feature vectors to represent the

683
00:31:22,779 --> 00:31:26,849
Geoffrey Hinton:
symbols and how these vectors
should interact to predict the

684
00:31:26,849 --> 00:31:28,450
Geoffrey Hinton:
features of the next symbol.

685
00:31:28,450 --> 00:31:30,970
Geoffrey Hinton:
And that's what these big
language models do.

686
00:31:31,130 --> 00:31:34,210
Tom Mitchell:
So there's Jeff in the mid

687
00:31:34,210 --> 00:31:35,569
Tom Mitchell:
nineteen eighties work on

688
00:31:35,569 --> 00:31:36,930
Tom Mitchell:
backpropagation.

689
00:31:36,930 --> 00:31:38,890
Tom Mitchell:
Another personal note in

690
00:31:38,890 --> 00:31:40,650
Tom Mitchell:
nineteen eighty six, while this

691
00:31:40,650 --> 00:31:44,210
Tom Mitchell:
was going on, I came to spend a

692
00:31:44,210 --> 00:31:45,970
Tom Mitchell:
year at CMU as a visiting

693
00:31:45,970 --> 00:31:47,329
Tom Mitchell:
professor.

694
00:31:47,329 --> 00:31:51,009
Tom Mitchell:
And I got to meet Allen Newell
at the time.

695
00:31:51,009 --> 00:31:54,089
Tom Mitchell:
And Allen said, hey, do you want
to team teach a course?

696
00:31:54,089 --> 00:31:55,609
Tom Mitchell:
We'll teach a course on

697
00:31:55,609 --> 00:31:57,670
Tom Mitchell:
architectures for intelligent

698
00:31:57,670 --> 00:31:58,769
Tom Mitchell:
agents.

699
00:31:58,769 --> 00:32:00,329
Tom Mitchell:
And of course I said yes.

700
00:32:00,329 --> 00:32:02,210
Tom Mitchell:
The opportunity to teach with
Allen.

701
00:32:02,210 --> 00:32:03,289
Tom Mitchell:
And he said, by the way, there

702
00:32:03,289 --> 00:32:04,990
Tom Mitchell:
will be another, uh, an

703
00:32:04,990 --> 00:32:06,609
Tom Mitchell:
assistant professor working with

704
00:32:06,609 --> 00:32:07,089
Tom Mitchell:
us.

705
00:32:07,089 --> 00:32:08,650
Tom Mitchell:
The three of us will team teach
it.

706
00:32:08,650 --> 00:32:10,250
Tom Mitchell:
That's Geoff Hinton.

707
00:32:10,250 --> 00:32:11,990
Tom Mitchell:
So Allen, Geoff and I team

708
00:32:11,990 --> 00:32:13,309
Tom Mitchell:
taught in spring of nineteen

709
00:32:13,309 --> 00:32:14,569
Tom Mitchell:
eighty six.

710
00:32:14,569 --> 00:32:20,049
Tom Mitchell:
Uh, this course was one of the
best experiences of my career up

711
00:32:20,049 --> 00:32:21,170
Tom Mitchell:
to that point.

712
00:32:21,170 --> 00:32:24,599
Tom Mitchell:
And so it was a large part of
the reason why I ended up

713
00:32:24,599 --> 00:32:26,920
Tom Mitchell:
staying at CMU.

714
00:32:26,920 --> 00:32:29,759
Tom Mitchell:
But when I came, I was here

715
00:32:29,759 --> 00:32:31,920
Tom Mitchell:
for about a year, and then Jeff

716
00:32:31,920 --> 00:32:33,200
Tom Mitchell:
moved on.

717
00:32:33,200 --> 00:32:38,920
Tom Mitchell:
He moved up to the University of
Toronto and started

718
00:32:38,920 --> 00:32:40,759
Tom Mitchell:
building up a group there.

719
00:32:40,759 --> 00:32:45,880
Tom Mitchell:
One of the people who joined his
group was a person named Yann

720
00:32:45,880 --> 00:32:50,640
Tom Mitchell:
LeCun, who went on to win the
Turing Award jointly with Jeff

721
00:32:50,640 --> 00:32:54,519
Tom Mitchell:
and Yoshua Bengio for their work
in neural networks.

722
00:32:54,519 --> 00:32:57,440
Tom Mitchell:
So I asked Jon about this
period.

723
00:32:58,440 --> 00:33:01,500
Yann LeCun:
And then, mid nineteen

724
00:33:01,500 --> 00:33:03,400
Yann LeCun:
eighty seven, I moved to Toronto

725
00:33:03,400 --> 00:33:06,500
Yann LeCun:
to do a postdoc with Jeff, and I

726
00:33:06,500 --> 00:33:08,940
Yann LeCun:
completed this, the

727
00:33:08,940 --> 00:33:09,559
Yann LeCun:
simulator.

728
00:33:09,559 --> 00:33:10,720
Yann LeCun:
Jeff thought I was not doing

729
00:33:10,720 --> 00:33:11,759
Yann LeCun:
anything because I was just

730
00:33:11,759 --> 00:33:13,640
Yann LeCun:
basically hacking, you know, all

731
00:33:13,640 --> 00:33:14,440
Yann LeCun:
the time,

732
00:33:14,440 --> 00:33:19,940
Yann LeCun:
and this, this
system was kind of

733
00:33:19,940 --> 00:33:23,140
Yann LeCun:
interesting because we had to
build a front end language to

734
00:33:23,140 --> 00:33:23,859
Yann LeCun:
interact with it.

735
00:33:23,859 --> 00:33:25,059
Yann LeCun:
And that language was the Lisp

736
00:33:25,059 --> 00:33:26,240
Yann LeCun:
interpreter that Leon and I

737
00:33:26,240 --> 00:33:26,819
Yann LeCun:
wrote.

738
00:33:26,819 --> 00:33:31,180
Yann LeCun:
And so we're using Lisp, even
though as a front end to kind of

739
00:33:31,180 --> 00:33:32,740
Yann LeCun:
a neural net simulator.

740
00:33:32,740 --> 00:33:35,480
Yann LeCun:
And I, you know, implemented

741
00:33:35,480 --> 00:33:37,619
Yann LeCun:
a weight sharing, abilities

742
00:33:37,619 --> 00:33:38,740
Yann LeCun:
and all that stuff and started

743
00:33:38,740 --> 00:33:40,660
Yann LeCun:
experimenting with what became

744
00:33:40,660 --> 00:33:41,859
Yann LeCun:
convolutional nets.

745
00:33:41,859 --> 00:33:46,160
Yann LeCun:
You know, when I was a postdoc
in Toronto, early nineteen

746
00:33:46,160 --> 00:33:48,660
Yann LeCun:
eighty eight, roughly, and
started to get really good

747
00:33:48,660 --> 00:33:51,900
Yann LeCun:
results on, you know, very
simple shape recognition, like,

748
00:33:51,900 --> 00:33:55,420
Yann LeCun:
yhandwritten characters
that had drawn with my mouse or

749
00:33:55,420 --> 00:33:56,180
Yann LeCun:
something like that.

750
00:33:56,180 --> 00:33:56,779
Yann LeCun:
Right.

751
00:33:57,180 --> 00:34:00,220
Tom Mitchell:
So, as you just heard, Yann was

752
00:34:00,220 --> 00:34:02,579
Tom Mitchell:
experimenting with can we apply

753
00:34:02,579 --> 00:34:04,180
Tom Mitchell:
neural networks to the problem

754
00:34:04,180 --> 00:34:06,460
Tom Mitchell:
of character recognition,

755
00:34:06,460 --> 00:34:07,779
Tom Mitchell:
written characters.

756
00:34:07,779 --> 00:34:11,340
Tom Mitchell:
People were experimenting with
many different uses of neural

757
00:34:11,340 --> 00:34:12,780
Tom Mitchell:
nets at the time.

758
00:34:12,780 --> 00:34:17,739
Tom Mitchell:
My favorite, the one I would
vote application of the decade

759
00:34:17,739 --> 00:34:20,519
Tom Mitchell:
was done in the area.

760
00:34:20,519 --> 00:34:24,119
Tom Mitchell:
Surprisingly, of self-driving
cars.

761
00:34:24,119 --> 00:34:28,440
Tom Mitchell:
There was a PhD student here at
CMU named Dean Pomerleau.

762
00:34:28,440 --> 00:34:30,320
Tom Mitchell:
He trained a neural network

763
00:34:30,320 --> 00:34:32,840
Tom Mitchell:
where the input was an image

764
00:34:32,840 --> 00:34:34,599
Tom Mitchell:
taken by a camera looking out

765
00:34:34,599 --> 00:34:36,480
Tom Mitchell:
the front windshield of a

766
00:34:36,480 --> 00:34:37,559
Tom Mitchell:
vehicle.

767
00:34:37,559 --> 00:34:38,760
Tom Mitchell:
And the output of the neural

768
00:34:38,760 --> 00:34:41,460
Tom Mitchell:
network was the steering command

769
00:34:41,460 --> 00:34:43,480
Tom Mitchell:
telling the car which direction

770
00:34:43,480 --> 00:34:44,960
Tom Mitchell:
to steer.

771
00:34:44,960 --> 00:34:48,360
Tom Mitchell:
So I asked Dean about that work.

772
00:34:48,360 --> 00:34:51,400
Tom Mitchell:
How much training data did you
have?

773
00:34:52,039 --> 00:34:53,659
Dean Pommerleau:
So the interesting thing was, to

774
00:34:53,659 --> 00:34:55,039
Dean Pommerleau:
begin with, it was all batch

775
00:34:55,039 --> 00:34:55,440
Dean Pommerleau:
training.

776
00:34:55,440 --> 00:35:00,280
Dean Pommerleau:
So I'd drive, I'd have a person
drive the vehicle along Schenley

777
00:35:00,280 --> 00:35:05,559
Dean Pommerleau:
Park, uh, Flagstaff Hill Path,
and then I would go off and

778
00:35:05,559 --> 00:35:06,840
Dean Pommerleau:
crunch it overnight.

779
00:35:06,840 --> 00:35:08,360
Dean Pommerleau:
But in the end, what we were

780
00:35:08,360 --> 00:35:10,519
Dean Pommerleau:
able to do is, uh, real time

781
00:35:10,519 --> 00:35:10,960
Dean Pommerleau:
learning.

782
00:35:10,960 --> 00:35:14,400
Dean Pommerleau:
So one drive up the hill with a

783
00:35:14,400 --> 00:35:16,619
Dean Pommerleau:
human behind the wheel steering

784
00:35:16,619 --> 00:35:19,320
Dean Pommerleau:
and the neural network, learning

785
00:35:19,320 --> 00:35:21,760
Dean Pommerleau:
to pair images with camera

786
00:35:21,760 --> 00:35:23,380
Dean Pommerleau:
images with the steering command

787
00:35:23,380 --> 00:35:25,340
Dean Pommerleau:
that the human was giving was

788
00:35:25,340 --> 00:35:27,739
Dean Pommerleau:
able to, uh, train it in about

789
00:35:27,739 --> 00:35:30,900
Dean Pommerleau:
five minutes to, uh, take over

790
00:35:30,900 --> 00:35:32,360
Dean Pommerleau:
and steer on its own from there

791
00:35:32,360 --> 00:35:34,179
Dean Pommerleau:
on, on that road and on similar

792
00:35:34,179 --> 00:35:34,659
Dean Pommerleau:
roads.

793
00:35:34,659 --> 00:35:38,179
Dean Pommerleau:
So it was one of the first real
time, real world vision

794
00:35:38,179 --> 00:35:43,960
Dean Pommerleau:
applications of, uh, of
artificial neural networks going

795
00:35:43,960 --> 00:35:47,860
Dean Pommerleau:
beyond just Flagstaff Hill, you
know, the little paths on there.

796
00:35:47,860 --> 00:35:51,340
Dean Pommerleau:
And we went out on, on real
roads first through the golf

797
00:35:51,340 --> 00:35:55,539
Dean Pommerleau:
course, Schenley Golf Course, on
the, uh, on the road there.

798
00:35:55,539 --> 00:35:58,860
Dean Pommerleau:
And then we, we went on, you
know, the local highways, in

799
00:35:58,860 --> 00:36:03,380
Dean Pommerleau:
fact, the longest as part of my
PhD, the longest trip we did

800
00:36:03,380 --> 00:36:09,119
Dean Pommerleau:
was, I think, about one hundred
miles at the time from basically

801
00:36:09,119 --> 00:36:13,619
Dean Pommerleau:
up, uh, I-79 from Pittsburgh all
the way up to Erie.

802
00:36:13,619 --> 00:36:17,849
Dean Pommerleau:
Uh, and it drove basically the,
the whole way.

803
00:36:17,849 --> 00:36:21,929
Dean Pommerleau:
So it and it was getting up to
fifty five miles per hour after

804
00:36:21,929 --> 00:36:23,650
Dean Pommerleau:
we got a faster vehicle.

805
00:36:24,090 --> 00:36:26,889
Tom Mitchell:
It turns out he didn't ask for
permission.

806
00:36:28,769 --> 00:36:32,769
Tom Mitchell:
So so this was all happening in
the nineteen eighties.

807
00:36:32,769 --> 00:36:34,489
Tom Mitchell:
Really, it was a decade of

808
00:36:34,489 --> 00:36:38,050
Tom Mitchell:
amazing invention and innovation

809
00:36:38,050 --> 00:36:39,849
Tom Mitchell:
and exploration.

810
00:36:39,849 --> 00:36:42,809
Tom Mitchell:
Another important thing that

811
00:36:42,809 --> 00:36:45,889
Tom Mitchell:
happened in that decade was the

812
00:36:45,889 --> 00:36:48,210
Tom Mitchell:
development of reinforcement

813
00:36:48,210 --> 00:36:49,010
Tom Mitchell:
learning.

814
00:36:49,010 --> 00:36:54,650
Tom Mitchell:
The way to understand that is to
first realize that supervised

815
00:36:54,650 --> 00:36:57,969
Tom Mitchell:
learning was the kind of
standard way of framing the

816
00:36:57,969 --> 00:36:59,730
Tom Mitchell:
machine learning question.

817
00:36:59,730 --> 00:37:01,389
Tom Mitchell:
When Dean talked about training

818
00:37:01,389 --> 00:37:03,849
Tom Mitchell:
his system, he would input an

819
00:37:03,849 --> 00:37:04,289
Tom Mitchell:
image.

820
00:37:04,289 --> 00:37:07,590
Tom Mitchell:
He had people drive the car, so
he got a lot of training

821
00:37:07,590 --> 00:37:09,409
Tom Mitchell:
examples of the form.

822
00:37:09,409 --> 00:37:13,530
Tom Mitchell:
Here's the image and here's the
correct steering command.

823
00:37:13,530 --> 00:37:16,510
Tom Mitchell:
So he could tell the neural
network for this input.

824
00:37:16,510 --> 00:37:18,389
Tom Mitchell:
Here's the correct output.

825
00:37:18,389 --> 00:37:20,630
Tom Mitchell:
That's called supervised
learning.

826
00:37:20,630 --> 00:37:24,989
Tom Mitchell:
But reinforcement learning
reframes the problem.

827
00:37:24,989 --> 00:37:27,949
Tom Mitchell:
It takes into account that
sometimes we don't know what the

828
00:37:27,949 --> 00:37:29,469
Tom Mitchell:
right output is.

829
00:37:29,469 --> 00:37:34,510
Tom Mitchell:
For example, if you're learning
to play chess, you might not

830
00:37:34,510 --> 00:37:38,070
Tom Mitchell:
have a person who tells you at
every step given this board

831
00:37:38,070 --> 00:37:40,309
Tom Mitchell:
position, here's the right move.

832
00:37:40,309 --> 00:37:43,750
Tom Mitchell:
Instead, you might have to wait
until the end of the game after

833
00:37:43,750 --> 00:37:48,369
Tom Mitchell:
you've made many moves to get
the feedback signal that says

834
00:37:48,369 --> 00:37:52,510
Tom Mitchell:
you lost or you won, and then
you have to figure out what to

835
00:37:52,510 --> 00:37:55,309
Tom Mitchell:
do about that because you
actually took many moves.

836
00:37:55,309 --> 00:37:58,269
Tom Mitchell:
So that's what reinforcement
learning is about.

837
00:37:58,269 --> 00:38:03,769
Tom Mitchell:
And Rich Sutton and Andy Barto
were instrumental in kind of

838
00:38:03,769 --> 00:38:06,949
Tom Mitchell:
framing that problem and, and
working on it.

839
00:38:06,949 --> 00:38:09,869
Tom Mitchell:
They recently won the Turing
Award for this work.

840
00:38:09,869 --> 00:38:12,389
Tom Mitchell:
So I asked Rich how

841
00:38:12,389 --> 00:38:14,269
Tom Mitchell:
reinforcement learning fit into

842
00:38:14,269 --> 00:38:15,269
Tom Mitchell:
the field.

843
00:38:16,969 --> 00:38:18,250
Rich Sutton:
The field of machine learning

844
00:38:18,250 --> 00:38:21,250
Rich Sutton:
has always been been dominated

845
00:38:21,250 --> 00:38:22,489
Rich Sutton:
by the more straightforward

846
00:38:22,489 --> 00:38:24,449
Rich Sutton:
supervised approach.

847
00:38:24,449 --> 00:38:30,329
Rich Sutton:
There was, as I
mentioned at the very beginning,

848
00:38:30,329 --> 00:38:34,170
Rich Sutton:
the rewards and penalties were
were very much a part of it.

849
00:38:34,170 --> 00:38:38,789
Rich Sutton:
But then the, focus, as

850
00:38:38,789 --> 00:38:40,130
Rich Sutton:
things became more clear and

851
00:38:40,130 --> 00:38:41,730
Rich Sutton:
more better defined and it

852
00:38:41,730 --> 00:38:43,969
Rich Sutton:
became more clear, learning

853
00:38:43,969 --> 00:38:45,889
Rich Sutton:
problem then became pattern

854
00:38:45,889 --> 00:38:47,130
Rich Sutton:
recognition and supervised

855
00:38:47,130 --> 00:38:48,769
Rich Sutton:
learning.

856
00:38:48,769 --> 00:38:54,610
Rich Sutton:
And, this fellow, the
strange, uh, fellow Harry Klopf,

857
00:38:54,610 --> 00:38:59,130
Rich Sutton:
recognized this more than
other people and

858
00:38:59,130 --> 00:39:03,730
Rich Sutton:
wrote some reports and
ultimately a book, saying

859
00:39:03,730 --> 00:39:05,449
Rich Sutton:
that something had been lost.

860
00:39:05,449 --> 00:39:11,489
Rich Sutton:
And Andy Barta and I
picked up on his work and

861
00:39:11,489 --> 00:39:13,610
Rich Sutton:
and eventually realized that he
was right, that something had

862
00:39:13,610 --> 00:39:16,420
Rich Sutton:
been left out, and in some sense
it was obvious that something

863
00:39:16,420 --> 00:39:17,079
Rich Sutton:
had been left out.

864
00:39:17,079 --> 00:39:17,920
Rich Sutton:
From the point of view of

865
00:39:17,920 --> 00:39:19,000
Rich Sutton:
psychology, where I'd been

866
00:39:19,000 --> 00:39:20,780
Rich Sutton:
studying how animals learn and

867
00:39:20,780 --> 00:39:21,800
Rich Sutton:
animals learn.

868
00:39:21,800 --> 00:39:23,159
Rich Sutton:
Really in both ways, in both a

869
00:39:23,159 --> 00:39:24,579
Rich Sutton:
supervised way and a

870
00:39:24,579 --> 00:39:26,320
Rich Sutton:
reinforcement way.

871
00:39:26,320 --> 00:39:31,619
Rich Sutton:
And so, we picked up on
that and made that into a well

872
00:39:31,619 --> 00:39:35,119
Rich Sutton:
defined area in the.

873
00:39:36,559 --> 00:39:37,599
Rich Sutton:
When was that?

874
00:39:37,599 --> 00:39:40,079
Rich Sutton:
That would have been in the
eighties.

875
00:39:40,079 --> 00:39:42,880
Rich Sutton:
And then finally, you wrote a
book on it in ninety eight.

876
00:39:42,880 --> 00:39:47,280
Rich Sutton:
So then it became a clear, uh,
subfield of machine learning.

877
00:39:49,519 --> 00:39:50,039
Rich Sutton:
Yeah.

878
00:39:50,039 --> 00:39:55,860
Rich Sutton:
But the key thing is why is why
why is I the way I say it to

879
00:39:55,860 --> 00:39:59,239
Rich Sutton:
myself is that why is
reinforcement learning off?

880
00:39:59,239 --> 00:40:00,239
Rich Sutton:
Why is it powerful?

881
00:40:00,239 --> 00:40:02,400
Rich Sutton:
Potentially powerful.

882
00:40:02,400 --> 00:40:05,079
Rich Sutton:
It's powerful because it's
learning.

883
00:40:05,079 --> 00:40:07,239
Rich Sutton:
It's really learning from
experience.

884
00:40:07,239 --> 00:40:09,400
Rich Sutton:
Learning from the normal data

885
00:40:09,400 --> 00:40:11,000
Rich Sutton:
that an animal or a person would

886
00:40:11,000 --> 00:40:11,760
Rich Sutton:
get.

887
00:40:11,760 --> 00:40:13,079
Rich Sutton:
And it doesn't require a

888
00:40:13,079 --> 00:40:15,679
Rich Sutton:
prepared special data like you

889
00:40:15,679 --> 00:40:16,820
Rich Sutton:
of course do in supervised

890
00:40:16,820 --> 00:40:17,420
Rich Sutton:
learning.

891
00:40:18,980 --> 00:40:21,980
Tom Mitchell:
So during the eighties, there
were a lot of other really

892
00:40:21,980 --> 00:40:24,219
Tom Mitchell:
interesting things going on.

893
00:40:24,219 --> 00:40:25,860
Tom Mitchell:
Uh, people experimenting with

894
00:40:25,860 --> 00:40:27,780
Tom Mitchell:
the idea that maybe machines

895
00:40:27,780 --> 00:40:29,519
Tom Mitchell:
should learn by simulating

896
00:40:29,519 --> 00:40:30,860
Tom Mitchell:
evolution.

897
00:40:30,860 --> 00:40:34,219
Tom Mitchell:
There was an entire set of
conferences on something called

898
00:40:34,219 --> 00:40:38,219
Tom Mitchell:
genetic algorithms, genetic
programming, which had to do

899
00:40:38,219 --> 00:40:39,619
Tom Mitchell:
with that sort of thing.

900
00:40:39,619 --> 00:40:42,880
Tom Mitchell:
Uh, a cluster of work on

901
00:40:42,880 --> 00:40:45,179
Tom Mitchell:
studying human learning and

902
00:40:45,179 --> 00:40:46,099
Tom Mitchell:
other areas.

903
00:40:46,099 --> 00:40:48,860
Tom Mitchell:
But we don't have time for all
of those.

904
00:40:48,860 --> 00:40:50,440
Tom Mitchell:
Let's move on to the nineteen

905
00:40:50,440 --> 00:40:52,960
Tom Mitchell:
nineties, when, again, there was

906
00:40:52,960 --> 00:40:56,039
Tom Mitchell:
a, I would say, a sea change in

907
00:40:56,039 --> 00:40:58,239
Tom Mitchell:
terms of the style of work that

908
00:40:58,239 --> 00:40:59,340
Tom Mitchell:
went on.

909
00:40:59,340 --> 00:41:02,719
Tom Mitchell:
The theme of the nineteen
nineties was really the

910
00:41:02,719 --> 00:41:08,539
Tom Mitchell:
integration of statistical and
probabilistic methods into the

911
00:41:08,539 --> 00:41:10,380
Tom Mitchell:
field of machine learning.

912
00:41:10,380 --> 00:41:12,199
Tom Mitchell:
And a lot of that took the

913
00:41:12,199 --> 00:41:15,360
Tom Mitchell:
grounded form of learning a new

914
00:41:15,360 --> 00:41:16,920
Tom Mitchell:
kind of object, which people

915
00:41:16,920 --> 00:41:19,320
Tom Mitchell:
called either graphical models

916
00:41:19,320 --> 00:41:20,480
Tom Mitchell:
or Bayes.

917
00:41:20,480 --> 00:41:21,840
Tom Mitchell:
Bayes nets.

918
00:41:21,840 --> 00:41:23,960
Tom Mitchell:
But what got learned in that

919
00:41:23,960 --> 00:41:27,360
Tom Mitchell:
case was, again, a network where

920
00:41:27,360 --> 00:41:29,679
Tom Mitchell:
each node would represent a

921
00:41:29,679 --> 00:41:30,880
Tom Mitchell:
variable.

922
00:41:30,880 --> 00:41:34,599
Tom Mitchell:
For example, maybe you would be
interested in predicting whether

923
00:41:34,599 --> 00:41:36,679
Tom Mitchell:
somebody has lung cancer.

924
00:41:36,679 --> 00:41:41,079
Tom Mitchell:
You'd make that a variable and
maybe you'd have evidence like

925
00:41:41,079 --> 00:41:42,800
Tom Mitchell:
are they a smoker?

926
00:41:42,800 --> 00:41:46,599
Tom Mitchell:
Do they have a normal or
abnormal X-ray result?

927
00:41:46,599 --> 00:41:48,559
Tom Mitchell:
You'd make those variables.

928
00:41:48,559 --> 00:41:53,119
Tom Mitchell:
And then the edges in the graph
represent probabilistic

929
00:41:53,119 --> 00:41:59,719
Tom Mitchell:
dependencies among the variables
in a way such that in the end,

930
00:41:59,719 --> 00:42:04,420
Tom Mitchell:
the whole graph represents the
full joint probability

931
00:42:04,420 --> 00:42:08,760
Tom Mitchell:
distribution over the entire
collection of variables.

932
00:42:08,760 --> 00:42:14,139
Tom Mitchell:
So that's what got learned and
how it got learned.

933
00:42:14,139 --> 00:42:17,340
Tom Mitchell:
Waited for some algorithms to be
discovered.

934
00:42:17,340 --> 00:42:19,340
Tom Mitchell:
One of the key people who was

935
00:42:19,340 --> 00:42:21,559
Tom Mitchell:
involved in inventing those

936
00:42:21,559 --> 00:42:23,579
Tom Mitchell:
algorithms, although Judea

937
00:42:23,579 --> 00:42:25,099
Tom Mitchell:
Pearl, came up with the idea of

938
00:42:25,099 --> 00:42:27,420
Tom Mitchell:
how to represent these,

939
00:42:27,420 --> 00:42:29,420
Tom Mitchell:
Daphne Kohler, a professor at

940
00:42:29,420 --> 00:42:32,099
Tom Mitchell:
Stanford, was one of the most

941
00:42:32,099 --> 00:42:34,380
Tom Mitchell:
active researchers in terms of

942
00:42:34,380 --> 00:42:36,300
Tom Mitchell:
designing algorithms for

943
00:42:36,300 --> 00:42:37,579
Tom Mitchell:
learning these.

944
00:42:37,579 --> 00:42:40,699
Tom Mitchell:
So I asked her, why do we need
graphical models?

945
00:42:41,099 --> 00:42:46,440
Daphne Koller:
Graphical models, for me,
emerged by realizing that the

946
00:42:46,440 --> 00:42:50,219
Daphne Koller:
problems that we needed to solve
to address most real world

947
00:42:50,219 --> 00:42:52,860
Daphne Koller:
applications went beyond.

948
00:42:52,860 --> 00:42:55,260
Daphne Koller:
You have a vector representation

949
00:42:55,260 --> 00:42:57,239
Daphne Koller:
of an input and a single,

950
00:42:57,239 --> 00:42:59,460
Daphne Koller:
oftentimes binary or at best

951
00:42:59,460 --> 00:43:01,139
Daphne Koller:
continuous output.

952
00:43:01,139 --> 00:43:04,519
Daphne Koller:
There was so much more
opportunity to think about

953
00:43:04,519 --> 00:43:08,059
Daphne Koller:
richly structured environments,
richly structured problems.

954
00:43:08,059 --> 00:43:13,210
Daphne Koller:
So even if you think about
problems like understanding what

955
00:43:13,210 --> 00:43:17,849
Daphne Koller:
is in an image, that's not a
single label problem of there is

956
00:43:17,849 --> 00:43:20,510
Daphne Koller:
a dog, because images are
complex and there's

957
00:43:20,510 --> 00:43:23,289
Daphne Koller:
interrelationships between the
different objects you want it to

958
00:43:23,289 --> 00:43:27,289
Daphne Koller:
get beyond the yes no. Is there
a dog in this image to something

959
00:43:27,289 --> 00:43:29,730
Daphne Koller:
that is much more rich?

960
00:43:29,730 --> 00:43:31,809
Daphne Koller:
There's a dog and a Frisbee and

961
00:43:31,809 --> 00:43:33,730
Daphne Koller:
a beach and three kids building

962
00:43:33,730 --> 00:43:34,849
Daphne Koller:
a sandcastle.

963
00:43:34,849 --> 00:43:37,929
Daphne Koller:
You have a rich input and a rich
output.

964
00:43:37,929 --> 00:43:39,210
Daphne Koller:
Thinking about these richly

965
00:43:39,210 --> 00:43:41,989
Daphne Koller:
structured domains gave rise to

966
00:43:41,989 --> 00:43:43,090
Daphne Koller:
we have to think about multiple

967
00:43:43,090 --> 00:43:43,489
Daphne Koller:
variables.

968
00:43:43,489 --> 00:43:44,250
Daphne Koller:
We have to think about the

969
00:43:44,250 --> 00:43:45,329
Daphne Koller:
interactions between those

970
00:43:45,329 --> 00:43:47,329
Daphne Koller:
variables and leverage that

971
00:43:47,329 --> 00:43:49,769
Daphne Koller:
structure both in our input and

972
00:43:49,769 --> 00:43:52,630
Daphne Koller:
output space in order to get to

973
00:43:52,630 --> 00:43:53,969
Daphne Koller:
much better conclusions and deal

974
00:43:53,969 --> 00:43:54,969
Daphne Koller:
with problems that really

975
00:43:54,969 --> 00:43:55,769
Daphne Koller:
matter.

976
00:43:56,849 --> 00:43:58,650
Tom Mitchell:
So this work on training

977
00:43:58,650 --> 00:44:00,849
Tom Mitchell:
graphical models was really part

978
00:44:00,849 --> 00:44:03,289
Tom Mitchell:
of a bigger theme that decade,

979
00:44:03,289 --> 00:44:05,010
Tom Mitchell:
which was just the integration

980
00:44:05,010 --> 00:44:09,670
Tom Mitchell:
of statistical methods with what

981
00:44:09,670 --> 00:44:11,949
Tom Mitchell:
had been pretty much statistics

982
00:44:11,949 --> 00:44:13,550
Tom Mitchell:
free machine learning up to that

983
00:44:13,550 --> 00:44:14,389
Tom Mitchell:
point.

984
00:44:14,389 --> 00:44:15,550
Tom Mitchell:
Another person who was

985
00:44:15,550 --> 00:44:17,889
Tom Mitchell:
instrumental in that was

986
00:44:17,889 --> 00:44:20,630
Tom Mitchell:
Berkeley professor named Mike

987
00:44:20,630 --> 00:44:21,510
Tom Mitchell:
Jordan.

988
00:44:21,510 --> 00:44:22,550
Tom Mitchell:
I asked him about the

989
00:44:22,550 --> 00:44:24,670
Tom Mitchell:
relationship between statistics

990
00:44:24,670 --> 00:44:25,309
Tom Mitchell:
and machine.

991
00:44:25,150 --> 00:44:27,730
Michael I. Jordan:
So anyway, by the time I moved
to wanted to move to Berkeley, I

992
00:44:27,730 --> 00:44:30,670
Michael I. Jordan:
was realizing that I was missing
the whole statistics community,

993
00:44:30,670 --> 00:44:33,909
Michael I. Jordan:
that, uh, it was just separate
from machine learning, as maybe

994
00:44:33,909 --> 00:44:35,869
Michael I. Jordan:
you kind of remember, there was
occasionally a little leakage,

995
00:44:35,869 --> 00:44:38,110
Michael I. Jordan:
but it was way too separate.

996
00:44:38,110 --> 00:44:40,869
Michael I. Jordan:
And and nowadays we're often
seeing, you know, people will

997
00:44:40,869 --> 00:44:43,150
Michael I. Jordan:
run a machine learning method,
but then it's not calibrated.

998
00:44:43,150 --> 00:44:46,510
Michael I. Jordan:
It's not, you know, has bias and
all that.

999
00:44:46,510 --> 00:44:48,150
Michael I. Jordan:
And that's the thing
statisticians have talked about

1000
00:44:48,150 --> 00:44:49,309
Michael I. Jordan:
for a long, long time.

1001
00:44:49,309 --> 00:44:52,550
Michael I. Jordan:
And so nowadays I think it's a
given that, yeah, they're,

1002
00:44:52,550 --> 00:44:56,150
Michael I. Jordan:
they're kind of two parts, two
sides of the same coin.

1003
00:44:56,150 --> 00:44:58,389
Michael I. Jordan:
Machine learning is maybe a
little more engineering in order

1004
00:44:58,389 --> 00:45:01,309
Michael I. Jordan:
to build a system and make it do
great things in the world.

1005
00:45:01,309 --> 00:45:03,550
Michael I. Jordan:
And statistics is a little bit
more, well, let's be cautious.

1006
00:45:03,550 --> 00:45:05,469
Michael I. Jordan:
Let's say we're going to do like
clinical trials.

1007
00:45:05,469 --> 00:45:09,130
Michael I. Jordan:
Let's make sure that the the
answer is really trustable, but

1008
00:45:09,130 --> 00:45:11,130
Michael I. Jordan:
those are two sides of the same
coin, and I think that's

1009
00:45:11,130 --> 00:45:13,610
Michael I. Jordan:
probably pretty much clear now.

1010
00:45:13,610 --> 00:45:15,210
Michael I. Jordan:
But for a long time there was a
resistance.

1011
00:45:15,210 --> 00:45:17,849
Michael I. Jordan:
Everyone said this is a brand
new field, this is different.

1012
00:45:17,849 --> 00:45:21,210
Michael I. Jordan:
And I kept and again annoying
colleagues by saying, no, I

1013
00:45:21,210 --> 00:45:22,650
Michael I. Jordan:
don't believe it is.

1014
00:45:22,650 --> 00:45:24,369
Michael I. Jordan:
So anyway, long story short, it
is.

1015
00:45:24,889 --> 00:45:28,489
Tom Mitchell:
It is remarkable that to me that

1016
00:45:28,489 --> 00:45:30,110
Tom Mitchell:
the field of machine learning

1017
00:45:30,110 --> 00:45:31,849
Tom Mitchell:
went through most of the

1018
00:45:31,849 --> 00:45:34,230
Tom Mitchell:
nineteen eighties, kind of

1019
00:45:34,230 --> 00:45:35,730
Tom Mitchell:
without even noticing that

1020
00:45:35,730 --> 00:45:37,329
Tom Mitchell:
statistics exist.

1021
00:45:37,170 --> 00:45:38,809
Michael I. Jordan:
I mean, people like Leo Breiman

1022
00:45:38,809 --> 00:45:40,309
Michael I. Jordan:
were around to help make the

1023
00:45:40,309 --> 00:45:41,130
Michael I. Jordan:
passage.

1024
00:45:41,130 --> 00:45:44,849
Michael I. Jordan:
So ensemble methods, they were
kind of invented by Leo and stat

1025
00:45:44,849 --> 00:45:46,889
Michael I. Jordan:
literature, but they were
independently invented in the

1026
00:45:46,889 --> 00:45:47,849
Michael I. Jordan:
machine learning literature.

1027
00:45:47,849 --> 00:45:50,289
Michael I. Jordan:
And is that machine learning or
statistics?

1028
00:45:50,289 --> 00:45:53,530
Michael I. Jordan:
Well, clearly it's both and it
needs both perspectives.

1029
00:45:53,530 --> 00:45:55,909
Michael I. Jordan:
And yes, in the nineteen
nineties that the Em algorithm,

1030
00:45:55,909 --> 00:46:00,349
Michael I. Jordan:
you know, the graphical models,
they were they had, they had uh,

1031
00:46:00,349 --> 00:46:03,369
Michael I. Jordan:
so yeah, the nineties, it was a
real flourishing of that.

1032
00:46:03,809 --> 00:46:07,239
Tom Mitchell:
So Mike mentioned that one of
the themes was ensemble.

1033
00:46:07,239 --> 00:46:08,659
Tom Mitchell:
So anyway, I think that's

1034
00:46:08,659 --> 00:46:11,239
Tom Mitchell:
actually a very nice example of

1035
00:46:11,239 --> 00:46:14,400
Tom Mitchell:
how machine learning theory and

1036
00:46:14,400 --> 00:46:16,659
Tom Mitchell:
statistical theory kind of

1037
00:46:16,659 --> 00:46:18,000
Tom Mitchell:
intertwined.

1038
00:46:18,000 --> 00:46:20,239
Tom Mitchell:
The idea of ensemble learning is

1039
00:46:20,239 --> 00:46:21,579
Tom Mitchell:
instead of learning one

1040
00:46:21,579 --> 00:46:24,159
Tom Mitchell:
hypothesis, let's learn multiple

1041
00:46:24,159 --> 00:46:24,880
Tom Mitchell:
ones.

1042
00:46:24,880 --> 00:46:26,480
Tom Mitchell:
For example, instead of learning

1043
00:46:26,480 --> 00:46:28,880
Tom Mitchell:
a decision tree, you might learn

1044
00:46:28,880 --> 00:46:30,800
Tom Mitchell:
a whole forest of decision

1045
00:46:30,800 --> 00:46:31,639
Tom Mitchell:
trees.

1046
00:46:31,639 --> 00:46:32,699
Tom Mitchell:
And then when it comes to

1047
00:46:32,699 --> 00:46:35,800
Tom Mitchell:
classifying a new example, you

1048
00:46:35,800 --> 00:46:37,420
Tom Mitchell:
give it to all of the

1049
00:46:37,420 --> 00:46:39,519
Tom Mitchell:
classifiers and you let them

1050
00:46:39,519 --> 00:46:42,039
Tom Mitchell:
vote and you take the vote of

1051
00:46:42,039 --> 00:46:43,519
Tom Mitchell:
the classifiers.

1052
00:46:43,519 --> 00:46:45,400
Tom Mitchell:
Well, that turned out to be very

1053
00:46:45,400 --> 00:46:48,280
Tom Mitchell:
successful and commercially very

1054
00:46:48,280 --> 00:46:49,320
Tom Mitchell:
important.

1055
00:46:49,320 --> 00:46:51,159
Tom Mitchell:
But it also is a beautiful

1056
00:46:51,159 --> 00:46:53,679
Tom Mitchell:
example where, there's a

1057
00:46:53,679 --> 00:46:55,519
Tom Mitchell:
pretty interesting theory around

1058
00:46:55,519 --> 00:46:56,280
Tom Mitchell:
that.

1059
00:46:56,280 --> 00:47:02,960
Tom Mitchell:
And initially, Yoav Freund and
Robert Shapiro, uh, in the early

1060
00:47:02,960 --> 00:47:08,659
Tom Mitchell:
nineties, uh, started working on
a theory and methods for doing

1061
00:47:08,659 --> 00:47:10,380
Tom Mitchell:
this kind of ensemble.

1062
00:47:10,380 --> 00:47:14,300
Tom Mitchell:
Leo Breiman, who was a
statistician, recognized that

1063
00:47:14,300 --> 00:47:19,579
Tom Mitchell:
this echoed some of the themes
of resampling and statistics.

1064
00:47:19,579 --> 00:47:21,960
Tom Mitchell:
And those two things, uh, kind

1065
00:47:21,960 --> 00:47:23,739
Tom Mitchell:
of came together in a very

1066
00:47:23,739 --> 00:47:25,739
Tom Mitchell:
successful way.

1067
00:47:25,739 --> 00:47:27,539
Tom Mitchell:
So in the nineties and the first

1068
00:47:27,539 --> 00:47:29,400
Tom Mitchell:
decade of the two thousand,

1069
00:47:29,400 --> 00:47:31,579
Tom Mitchell:
there were many other things

1070
00:47:31,579 --> 00:47:32,900
Tom Mitchell:
going on.

1071
00:47:32,900 --> 00:47:36,460
Tom Mitchell:
The development of things
called support vector machines,

1072
00:47:36,460 --> 00:47:42,820
Tom Mitchell:
kernel methods, which were,
mathematical techniques for

1073
00:47:42,820 --> 00:47:48,059
Tom Mitchell:
learning, very nonlinear
classifiers that were actually

1074
00:47:48,059 --> 00:47:53,079
Tom Mitchell:
commercially important and
opened the door in many cases to

1075
00:47:53,079 --> 00:47:56,900
Tom Mitchell:
machine learning for
non-numerical data, data like

1076
00:47:56,900 --> 00:47:59,380
Tom Mitchell:
images or text.

1077
00:47:59,380 --> 00:48:02,460
Tom Mitchell:
There is work on manifold
learning.

1078
00:48:02,460 --> 00:48:04,380
Tom Mitchell:
There was also growing

1079
00:48:04,380 --> 00:48:06,800
Tom Mitchell:
commercialization during that

1080
00:48:06,800 --> 00:48:07,880
Tom Mitchell:
decade.

1081
00:48:07,880 --> 00:48:09,400
Tom Mitchell:
More and more companies were

1082
00:48:09,400 --> 00:48:11,760
Tom Mitchell:
starting to use machine learning

1083
00:48:11,760 --> 00:48:13,599
Tom Mitchell:
commercially.

1084
00:48:13,599 --> 00:48:17,480
Tom Mitchell:
But for me, the theme of that
first decade of the two thousand

1085
00:48:17,480 --> 00:48:23,579
Tom Mitchell:
was really a growing awareness
by many people that, you know,

1086
00:48:23,579 --> 00:48:28,039
Tom Mitchell:
maybe we have good enough
machine learning algorithms that

1087
00:48:28,039 --> 00:48:33,400
Tom Mitchell:
the bottleneck to more accuracy
is not the algorithm.

1088
00:48:33,400 --> 00:48:37,199
Tom Mitchell:
Maybe we need more data and more
computation.

1089
00:48:37,199 --> 00:48:41,639
Tom Mitchell:
And this idea was crystallized
in this beautiful paper written

1090
00:48:41,639 --> 00:48:46,920
Tom Mitchell:
in two thousand and nine by
three authors at Google, called

1091
00:48:46,920 --> 00:48:50,679
Tom Mitchell:
The Unreasonable Effectiveness
of Data, which really

1092
00:48:50,679 --> 00:48:55,199
Tom Mitchell:
highlighted, cases where,
if you want better

1093
00:48:55,199 --> 00:48:59,199
Tom Mitchell:
results, keep your same
algorithm, get more data.

1094
00:48:59,199 --> 00:49:04,519
Tom Mitchell:
And that was kind of a theme of
what was going on at the time,

1095
00:49:04,519 --> 00:49:10,739
Tom Mitchell:
but things really broke open in
the year twenty twelve.

1096
00:49:10,739 --> 00:49:16,380
Tom Mitchell:
In twenty twelve, the
computer vision community had

1097
00:49:16,380 --> 00:49:23,139
Tom Mitchell:
been using a data set created by
Fei-Fei Li called ImageNet to

1098
00:49:23,139 --> 00:49:26,659
Tom Mitchell:
test out different vision
algorithms, see who could do the

1099
00:49:26,659 --> 00:49:31,619
Tom Mitchell:
best job of labeling which
object was the primary object in

1100
00:49:31,619 --> 00:49:35,940
Tom Mitchell:
an image, and the image net data
set was very large.

1101
00:49:35,940 --> 00:49:41,380
Tom Mitchell:
In twenty twelve, Geoff Hinton
and some of his students entered

1102
00:49:41,380 --> 00:49:45,539
Tom Mitchell:
the competition and they blew
away the competition.

1103
00:49:45,539 --> 00:49:50,260
Tom Mitchell:
What's interesting is they were
the only neural network approach

1104
00:49:50,260 --> 00:49:52,780
Tom Mitchell:
in the competition by that time.

1105
00:49:52,780 --> 00:49:55,380
Tom Mitchell:
By the way, neural networks were

1106
00:49:55,380 --> 00:49:58,000
Tom Mitchell:
very scarce in the field of

1107
00:49:58,000 --> 00:49:59,739
Tom Mitchell:
machine learning.

1108
00:49:59,739 --> 00:50:02,090
Tom Mitchell:
They had been displaced really

1109
00:50:02,090 --> 00:50:04,650
Tom Mitchell:
by more recent probabilistic

1110
00:50:04,650 --> 00:50:09,409
Tom Mitchell:
methods, and only a smallish

1111
00:50:09,409 --> 00:50:11,210
Tom Mitchell:
number of researchers were even

1112
00:50:11,210 --> 00:50:12,489
Tom Mitchell:
still working on neural

1113
00:50:12,489 --> 00:50:13,730
Tom Mitchell:
networks.

1114
00:50:13,730 --> 00:50:17,010
Tom Mitchell:
But, nevertheless, this
happened.

1115
00:50:17,010 --> 00:50:18,849
Tom Mitchell:
So I asked Geoff about that.

1116
00:50:19,449 --> 00:50:21,929
Geoffrey Hinton:
And Yann realized when Fei-Fei
came up with the ImageNet

1117
00:50:21,929 --> 00:50:25,289
Geoffrey Hinton:
dataset, Yann realized they
could win that competition, and

1118
00:50:25,289 --> 00:50:27,849
Geoffrey Hinton:
he tried to get graduate
students and postdocs in his lab

1119
00:50:27,849 --> 00:50:30,090
Geoffrey Hinton:
to do it, and they all declined.

1120
00:50:31,769 --> 00:50:38,369
Geoffrey Hinton:
And Ilya, Ilya Sutskever
realized that, backprop

1121
00:50:38,369 --> 00:50:40,369
Geoffrey Hinton:
would just kill ImageNet.

1122
00:50:40,369 --> 00:50:45,449
Geoffrey Hinton:
He wanted Alex to work
on it and actually didn't really

1123
00:50:45,449 --> 00:50:46,489
Geoffrey Hinton:
want to work on it.

1124
00:50:46,489 --> 00:50:47,769
Geoffrey Hinton:
Alex had already been

1125
00:50:47,769 --> 00:50:49,090
Geoffrey Hinton:
working on small images and

1126
00:50:49,090 --> 00:50:50,730
Geoffrey Hinton:
recognizing small images in Cfar

1127
00:50:50,730 --> 00:50:53,889
Geoffrey Hinton:
ten, and pre-processed

1128
00:50:53,889 --> 00:50:55,010
Geoffrey Hinton:
everything for Alex to make it

1129
00:50:55,010 --> 00:50:55,929
Geoffrey Hinton:
easy.

1130
00:50:55,929 --> 00:50:58,429
Geoffrey Hinton:
And I bought Alex two Nvidia

1131
00:50:58,429 --> 00:51:00,829
Geoffrey Hinton:
GPUs to have in his bedroom at

1132
00:51:00,829 --> 00:51:01,869
Geoffrey Hinton:
home.

1133
00:51:01,869 --> 00:51:07,110
Geoffrey Hinton:
Alex then got on with
got on with it, and he was an

1134
00:51:07,110 --> 00:51:08,429
Geoffrey Hinton:
absolutely wizard programmer.

1135
00:51:08,429 --> 00:51:11,269
Geoffrey Hinton:
He wrote amazing code on

1136
00:51:11,269 --> 00:51:13,230
Geoffrey Hinton:
multiple GPUs to do convolution

1137
00:51:13,230 --> 00:51:14,110
Geoffrey Hinton:
really efficiently.

1138
00:51:14,110 --> 00:51:16,469
Geoffrey Hinton:
Much better code than anybody
else had ever written.

1139
00:51:16,469 --> 00:51:24,550
Geoffrey Hinton:
I believe and so it's a
combination of Ilya realizing we

1140
00:51:24,550 --> 00:51:26,269
Geoffrey Hinton:
really had to do this.

1141
00:51:26,269 --> 00:51:29,070
Geoffrey Hinton:
I know you was involved in the
design of the net and so on, but

1142
00:51:29,070 --> 00:51:31,349
Geoffrey Hinton:
Alex's programming skills.

1143
00:51:31,349 --> 00:51:35,909
Geoffrey Hinton:
And then I added a few ideas,
like use rectified linear units

1144
00:51:35,909 --> 00:51:40,750
Geoffrey Hinton:
instead of sigmoid units and use
little patches of the images.

1145
00:51:40,750 --> 00:51:42,389
Geoffrey Hinton:
I mean, big patches of the
images.

1146
00:51:42,389 --> 00:51:43,590
Geoffrey Hinton:
So you can translate things

1147
00:51:43,590 --> 00:51:44,769
Geoffrey Hinton:
around a bit to get some

1148
00:51:44,769 --> 00:51:46,750
Geoffrey Hinton:
translation invariance, as well

1149
00:51:46,750 --> 00:51:50,030
Geoffrey Hinton:
as using convolution, and

1150
00:51:50,030 --> 00:51:50,909
Geoffrey Hinton:
use dropout.

1151
00:51:50,909 --> 00:51:53,150
Geoffrey Hinton:
So that was one of the first
applications of dropout.

1152
00:51:53,150 --> 00:51:55,349
Geoffrey Hinton:
And that helped about one
percent.

1153
00:51:55,349 --> 00:51:57,469
Geoffrey Hinton:
It helped.

1154
00:51:57,469 --> 00:52:00,250
Geoffrey Hinton:
And then we beat the best vision
systems.

1155
00:52:00,250 --> 00:52:03,309
Geoffrey Hinton:
The best vision systems were
sort of plateauing at twenty

1156
00:52:03,309 --> 00:52:05,250
Geoffrey Hinton:
five percent errors.

1157
00:52:05,250 --> 00:52:07,650
Geoffrey Hinton:
That's errors for getting the
right answer in the top in your

1158
00:52:07,650 --> 00:52:08,969
Geoffrey Hinton:
top five bets.

1159
00:52:08,969 --> 00:52:14,329
Geoffrey Hinton:
And we got like fifteen
percent, fifteen or sixteen,

1160
00:52:14,329 --> 00:52:15,889
Geoffrey Hinton:
depending on how you count it.

1161
00:52:15,889 --> 00:52:18,050
Geoffrey Hinton:
So we got almost half the error
rate.

1162
00:52:19,289 --> 00:52:21,650
Geoffrey Hinton:
And what happened then was what

1163
00:52:21,650 --> 00:52:22,849
Geoffrey Hinton:
ought to happen in science but

1164
00:52:22,849 --> 00:52:24,809
Geoffrey Hinton:
seldom does.

1165
00:52:24,809 --> 00:52:29,409
Geoffrey Hinton:
So our most vigorous opponents,
like Jitendra Malik and

1166
00:52:29,409 --> 00:52:32,889
Geoffrey Hinton:
Zisserman, Andrew Zisserman,
looked at these results and

1167
00:52:32,889 --> 00:52:35,210
Geoffrey Hinton:
said, okay, you were right.

1168
00:52:35,210 --> 00:52:37,329
Geoffrey Hinton:
That never happens in science.

1169
00:52:37,329 --> 00:52:40,289
Geoffrey Hinton:
And slightly irritating.
Andrew Zisserman then switched

1170
00:52:40,289 --> 00:52:41,889
Geoffrey Hinton:
to doing this.

1171
00:52:41,889 --> 00:52:45,130
Geoffrey Hinton:
He had some very good postdocs
or students working with him.

1172
00:52:45,130 --> 00:52:50,170
Geoffrey Hinton:
Simonyan, after about

1173
00:52:50,170 --> 00:52:51,329
Geoffrey Hinton:
a year, they were making better

1174
00:52:51,329 --> 00:52:54,250
Geoffrey Hinton:
networks than us, but that was

1175
00:52:54,250 --> 00:52:55,289
Geoffrey Hinton:
really the.

1176
00:52:57,289 --> 00:52:59,039
Geoffrey Hinton:
As far as the general public was
concerned.

1177
00:52:59,039 --> 00:53:01,000
Geoffrey Hinton:
That was the start of this big

1178
00:53:01,000 --> 00:53:02,219
Geoffrey Hinton:
swing towards deep learning in

1179
00:53:02,219 --> 00:53:03,480
Geoffrey Hinton:
twenty twelve.

1180
00:53:04,159 --> 00:53:08,860
Tom Mitchell:
So that event, that competition

1181
00:53:08,860 --> 00:53:10,400
Tom Mitchell:
and the fact that the neural

1182
00:53:10,400 --> 00:53:13,920
Tom Mitchell:
network approach, totally

1183
00:53:13,920 --> 00:53:15,119
Tom Mitchell:
dominated all the other

1184
00:53:15,119 --> 00:53:17,199
Tom Mitchell:
approaches really was a wake up

1185
00:53:17,199 --> 00:53:19,679
Tom Mitchell:
call to both the computer vision

1186
00:53:19,679 --> 00:53:22,019
Tom Mitchell:
community, which within a couple

1187
00:53:22,019 --> 00:53:24,079
Tom Mitchell:
of years everybody was using

1188
00:53:24,079 --> 00:53:25,159
Tom Mitchell:
neural networks.

1189
00:53:25,159 --> 00:53:28,880
Tom Mitchell:
But it was also a wake up call
to the machine learning

1190
00:53:28,880 --> 00:53:33,599
Tom Mitchell:
community, who had kind of
scoffed at neural networks for

1191
00:53:33,599 --> 00:53:38,800
Tom Mitchell:
several decades, that neural
networks were back.

1192
00:53:38,800 --> 00:53:41,400
Tom Mitchell:
And so people started again, now

1193
00:53:41,400 --> 00:53:43,119
Tom Mitchell:
experimenting with this new

1194
00:53:43,119 --> 00:53:44,960
Tom Mitchell:
generation of deep neural

1195
00:53:44,960 --> 00:53:45,880
Tom Mitchell:
networks.

1196
00:53:45,880 --> 00:53:48,960
Tom Mitchell:
That just meant that instead of
having two layers, they could

1197
00:53:48,960 --> 00:53:53,320
Tom Mitchell:
have many layers, dozens of
layers, because training

1198
00:53:53,320 --> 00:53:57,500
Tom Mitchell:
algorithms were available and so
was is computation.

1199
00:53:57,500 --> 00:54:01,420
Tom Mitchell:
People start experimenting with
these and primarily on

1200
00:54:01,420 --> 00:54:04,300
Tom Mitchell:
perceptual style problems.

1201
00:54:04,300 --> 00:54:07,300
Tom Mitchell:
In fact, by twenty sixteen,

1202
00:54:07,300 --> 00:54:09,739
Tom Mitchell:
neural nets had taken over not

1203
00:54:09,739 --> 00:54:12,559
Tom Mitchell:
only computer vision, but in

1204
00:54:12,559 --> 00:54:16,300
Tom Mitchell:
twenty sixteen, some scientists

1205
00:54:16,300 --> 00:54:19,099
Tom Mitchell:
from Microsoft showed that they

1206
00:54:19,099 --> 00:54:20,739
Tom Mitchell:
had been able to train a neural

1207
00:54:20,739 --> 00:54:24,699
Tom Mitchell:
network to finally reach human

1208
00:54:24,699 --> 00:54:25,820
Tom Mitchell:
level recognition.

1209
00:54:25,820 --> 00:54:31,739
Tom Mitchell:
Speech recognition performance
for individual words in a widely

1210
00:54:31,739 --> 00:54:35,739
Tom Mitchell:
used data set called the
switchboard data set.

1211
00:54:35,739 --> 00:54:39,139
Tom Mitchell:
So people were experimenting
with neural nets for visual

1212
00:54:39,139 --> 00:54:45,019
Tom Mitchell:
data, speech data, radar, lidar,
all kinds of sensory data.

1213
00:54:45,019 --> 00:54:46,900
Tom Mitchell:
People started also asking,

1214
00:54:46,900 --> 00:54:49,739
Tom Mitchell:
well, can we apply these to text

1215
00:54:49,739 --> 00:54:51,219
Tom Mitchell:
data?

1216
00:54:51,219 --> 00:54:53,179
Tom Mitchell:
And the answer was yes.

1217
00:54:53,179 --> 00:54:58,639
Tom Mitchell:
And people started inventing
various architectures, things

1218
00:54:58,639 --> 00:55:04,199
Tom Mitchell:
with names like long short term
memory and others to analyze

1219
00:55:04,199 --> 00:55:09,840
Tom Mitchell:
sequences of text and applying
them to problems like machine

1220
00:55:09,840 --> 00:55:14,760
Tom Mitchell:
translation, translating English
into French, and so forth.

1221
00:55:14,760 --> 00:55:19,880
Tom Mitchell:
And, uh, that kind of
worked.

1222
00:55:19,880 --> 00:55:22,840
Tom Mitchell:
And then in twenty seventeen,

1223
00:55:22,840 --> 00:55:25,320
Tom Mitchell:
a very important paper was

1224
00:55:25,320 --> 00:55:26,559
Tom Mitchell:
published.

1225
00:55:26,559 --> 00:55:30,639
Tom Mitchell:
The name of the paper was
Attention is All You Need.

1226
00:55:30,639 --> 00:55:36,800
Tom Mitchell:
And with that was referring to
was a subcircuit in a

1227
00:55:36,800 --> 00:55:39,960
Tom Mitchell:
neural network called an
attention mechanism that had

1228
00:55:39,960 --> 00:55:45,159
Tom Mitchell:
recently been invented and
developed and was trainable.

1229
00:55:45,159 --> 00:55:50,400
Tom Mitchell:
But that attention mechanism

1230
00:55:50,400 --> 00:55:53,835
Tom Mitchell:
was used in this paper, and it

1231
00:55:53,835 --> 00:55:55,989
Tom Mitchell:
advanced the state of the art in

1232
00:55:55,989 --> 00:55:57,710
Tom Mitchell:
machine translation.

1233
00:55:57,710 --> 00:56:02,909
Tom Mitchell:
But even more importantly for us
today, it introduced the

1234
00:56:02,909 --> 00:56:07,869
Tom Mitchell:
transformer architecture based
on this attention mechanism.

1235
00:56:07,869 --> 00:56:09,469
Tom Mitchell:
And it's that transformer

1236
00:56:09,469 --> 00:56:13,110
Tom Mitchell:
architecture that underlies GPT

1237
00:56:13,110 --> 00:56:15,670
Tom Mitchell:
and pretty much all of the large

1238
00:56:15,670 --> 00:56:17,630
Tom Mitchell:
language models that were

1239
00:56:17,630 --> 00:56:19,989
Tom Mitchell:
released around twenty twenty

1240
00:56:19,989 --> 00:56:20,590
Tom Mitchell:
two.

1241
00:56:21,909 --> 00:56:24,510
Tom Mitchell:
So that was a major event.

1242
00:56:24,510 --> 00:56:27,389
Tom Mitchell:
Now, around the same time, Yann

1243
00:56:27,389 --> 00:56:29,150
Tom Mitchell:
LeCun, remember the guy who was

1244
00:56:29,150 --> 00:56:32,570
Tom Mitchell:
a postdoc with Jeff in nineteen

1245
00:56:32,570 --> 00:56:34,429
Tom Mitchell:
eighty seven?

1246
00:56:34,429 --> 00:56:39,630
Tom Mitchell:
Yann had become the head of AI
research at Facebook.

1247
00:56:39,630 --> 00:56:43,710
Tom Mitchell:
And so he was in a very
interesting position because he

1248
00:56:43,710 --> 00:56:45,190
Tom Mitchell:
was both an academic.

1249
00:56:45,190 --> 00:56:50,429
Tom Mitchell:
He retained his NYU
professorship and at the same

1250
00:56:50,429 --> 00:56:54,329
Tom Mitchell:
time he had a foot in the
commercial world directing the

1251
00:56:54,329 --> 00:56:56,969
Tom Mitchell:
AI strategy for Facebook.

1252
00:56:56,969 --> 00:56:59,750
Tom Mitchell:
So ask John about this period

1253
00:56:59,750 --> 00:57:02,170
Tom Mitchell:
and what it looked like to him

1254
00:57:02,170 --> 00:57:04,969
Tom Mitchell:
from from being inside both

1255
00:57:04,969 --> 00:57:05,769
Tom Mitchell:
worlds.

1256
00:57:05,769 --> 00:57:07,650
Tom Mitchell:
His first part of his answer was

1257
00:57:07,650 --> 00:57:10,530
Tom Mitchell:
that he said for him, a key

1258
00:57:10,530 --> 00:57:13,210
Tom Mitchell:
development was realizing that

1259
00:57:13,210 --> 00:57:15,809
Tom Mitchell:
you didn't have to wait for

1260
00:57:15,809 --> 00:57:17,769
Tom Mitchell:
people to label all your

1261
00:57:17,769 --> 00:57:20,050
Tom Mitchell:
training data, that you could do

1262
00:57:20,050 --> 00:57:22,670
Tom Mitchell:
something called self-supervised

1263
00:57:22,670 --> 00:57:23,409
Tom Mitchell:
learning.

1264
00:57:23,409 --> 00:57:28,690
Tom Mitchell:
For example, just take data like
a string of words and remove a

1265
00:57:28,690 --> 00:57:33,369
Tom Mitchell:
word and have the program force
the program to predict what that

1266
00:57:33,369 --> 00:57:35,090
Tom Mitchell:
removed word was.

1267
00:57:35,090 --> 00:57:38,130
Tom Mitchell:
So there's no human labeling you
have to do for that.

1268
00:57:38,130 --> 00:57:39,449
Tom Mitchell:
You can use the whole web and

1269
00:57:39,449 --> 00:57:40,650
Tom Mitchell:
you get a lot of training

1270
00:57:40,650 --> 00:57:42,010
Tom Mitchell:
examples.

1271
00:57:42,010 --> 00:57:47,369
Tom Mitchell:
So that's self-supervised
learning was a key development.

1272
00:57:47,369 --> 00:57:50,650
Tom Mitchell:
But then here's this description
of what next.

1273
00:57:50,929 --> 00:57:53,849
Yann LeCun:
So the idea that self-supervised
learning could really kind of

1274
00:57:53,849 --> 00:57:56,110
Yann LeCun:
bring something to the table
there, I think was kind of a

1275
00:57:56,110 --> 00:58:03,349
Yann LeCun:
big sort of mind,
change of mindset.

1276
00:58:03,349 --> 00:58:07,909
Yann LeCun:
And then there was
Transformers, of course.

1277
00:58:07,909 --> 00:58:08,630
Yann LeCun:
Right.

1278
00:58:08,630 --> 00:58:13,989
Yann LeCun:
Um, that, so, so
before that, there was some

1279
00:58:13,989 --> 00:58:17,909
Yann LeCun:
demonstration that, you
know, you could basically match

1280
00:58:17,909 --> 00:58:21,449
Yann LeCun:
the performance of classical
systems for tasks like

1281
00:58:21,449 --> 00:58:26,309
Yann LeCun:
translation, language
translation using large neural

1282
00:58:26,309 --> 00:58:27,269
Yann LeCun:
nets like LSTM.

1283
00:58:27,269 --> 00:58:30,710
Yann LeCun:
So this was the work by Ilya
Sutskever when he was at Google.

1284
00:58:30,710 --> 00:58:34,750
Yann LeCun:
We had this big sequence to
sequence model with LSTMs and

1285
00:58:34,750 --> 00:58:38,469
Yann LeCun:
some gigantic model where you
can train it to do.

1286
00:58:40,389 --> 00:58:41,110
Yann LeCun:
Translation.

1287
00:58:41,110 --> 00:58:43,510
Yann LeCun:
And it kind of works at the same

1288
00:58:43,510 --> 00:58:44,750
Yann LeCun:
level, if not better in some

1289
00:58:44,750 --> 00:58:47,510
Yann LeCun:
cases than the then classical,

1290
00:58:47,510 --> 00:58:50,530
Yann LeCun:
classical, the transition

1291
00:58:50,530 --> 00:58:51,690
Yann LeCun:
methods.

1292
00:58:51,690 --> 00:58:53,070
Yann LeCun:
Then a few months later,

1293
00:58:53,070 --> 00:58:57,230
Yann LeCun:
Yoshua Bengio and Kyunghyun Cho,

1294
00:58:57,230 --> 00:58:58,869
Yann LeCun:
who is now a colleague at NYU,

1295
00:58:58,869 --> 00:59:01,889
Yann LeCun:
uh, showed that you could change

1296
00:59:01,889 --> 00:59:03,550
Yann LeCun:
the architecture and use this

1297
00:59:03,550 --> 00:59:05,130
Yann LeCun:
attention mechanism.

1298
00:59:05,130 --> 00:59:09,889
Yann LeCun:
That, that they proposed,
to basically get really good

1299
00:59:09,889 --> 00:59:12,869
Yann LeCun:
performance on translation with
much smaller models than what

1300
00:59:12,869 --> 00:59:14,889
Yann LeCun:
Ilya had been proposing.

1301
00:59:14,889 --> 00:59:16,170
Yann LeCun:
And the entire industry jumped

1302
00:59:16,170 --> 00:59:18,130
Yann LeCun:
on this, Chris Manning's

1303
00:59:18,130 --> 00:59:20,010
Yann LeCun:
group at Stanford, kind of, you

1304
00:59:20,010 --> 00:59:22,250
Yann LeCun:
know, used that architecture and

1305
00:59:22,250 --> 00:59:24,789
Yann LeCun:
basically beat, you know,

1306
00:59:24,789 --> 00:59:27,409
Yann LeCun:
won the WMT competition for a

1307
00:59:27,409 --> 00:59:30,190
Yann LeCun:
particular, uh, type of

1308
00:59:30,190 --> 00:59:31,050
Yann LeCun:
translation.

1309
00:59:31,050 --> 00:59:32,570
Yann LeCun:
And the entire industry jumped
on it.

1310
00:59:32,570 --> 00:59:35,489
Yann LeCun:
So within a few months after
that, like, you know, all the

1311
00:59:35,489 --> 00:59:40,349
Yann LeCun:
big players, uh, in translation,
were using attention type

1312
00:59:40,349 --> 00:59:43,690
Yann LeCun:
architectures for translation.

1313
00:59:43,690 --> 00:59:48,010
Yann LeCun:
And that's when, the
transformer paper came out.

1314
00:59:48,010 --> 00:59:49,519
Yann LeCun:
Attention is all you need.

1315
00:59:49,519 --> 00:59:52,880
Yann LeCun:
So basically, if you build a
neural net just with those kind

1316
00:59:52,880 --> 00:59:56,960
Yann LeCun:
of attention circuit, you don't
need much else.

1317
00:59:56,960 --> 00:59:59,519
Yann LeCun:
And it ends up working super
well.

1318
00:59:59,519 --> 01:00:01,760
Yann LeCun:
And that's what started the, you

1319
01:00:01,760 --> 01:00:02,639
Yann LeCun:
know, the transformer

1320
01:00:02,639 --> 01:00:03,760
Yann LeCun:
revolution.

1321
01:00:03,760 --> 01:00:06,079
Yann LeCun:
Uh, and then after that came
Bert, that also came out of

1322
01:00:06,079 --> 01:00:08,699
Yann LeCun:
Google, which was this idea of
using self-supervised learning,

1323
01:00:08,699 --> 01:00:12,519
Yann LeCun:
where I take a sequence of
words, corrupt it, remove some

1324
01:00:12,519 --> 01:00:16,199
Yann LeCun:
other words, and then train this
big neural net to reconstruct

1325
01:00:16,199 --> 01:00:17,239
Yann LeCun:
the words that are missing.

1326
01:00:17,239 --> 01:00:19,239
Yann LeCun:
Predict the words that are
missing.

1327
01:00:19,239 --> 01:00:21,559
Yann LeCun:
And again, people were

1328
01:00:21,559 --> 01:00:23,559
Yann LeCun:
amazed by like how how good the

1329
01:00:23,559 --> 01:00:24,840
Yann LeCun:
representations learned by the

1330
01:00:24,840 --> 01:00:27,679
Yann LeCun:
system were for all kinds of NLP

1331
01:00:27,679 --> 01:00:28,880
Yann LeCun:
tasks.

1332
01:00:28,880 --> 01:00:32,400
Yann LeCun:
And that really, uh, you know,
kind of captured the imagination

1333
01:00:32,400 --> 01:00:33,920
Yann LeCun:
of a lot of people.

1334
01:00:33,920 --> 01:00:37,719
Yann LeCun:
And then after that, the
next revolution was, oh,

1335
01:00:37,719 --> 01:00:40,599
Yann LeCun:
actually, the best thing to do
is you remove the encoder, you

1336
01:00:40,599 --> 01:00:42,239
Yann LeCun:
just use a decoder.

1337
01:00:42,239 --> 01:00:46,519
Yann LeCun:
And you just train a system,
you feed it a sequence, and you

1338
01:00:46,519 --> 01:00:49,260
Yann LeCun:
just train it to reproduce the
input sequence on its output,

1339
01:00:49,260 --> 01:00:52,579
Yann LeCun:
and because the architecture of
the decoder is strictly causal.

1340
01:00:52,579 --> 01:00:55,500
Yann LeCun:
Because a particular output
is not connected to the

1341
01:00:55,500 --> 01:00:57,900
Yann LeCun:
corresponding input, it's only
connected to the ones to the

1342
01:00:57,900 --> 01:00:58,659
Yann LeCun:
left of it.

1343
01:00:58,659 --> 01:00:59,940
Yann LeCun:
Implicitly, you're training the

1344
01:00:59,940 --> 01:01:01,860
Yann LeCun:
system to predict the next word

1345
01:01:01,860 --> 01:01:03,260
Yann LeCun:
that comes after a sequence of

1346
01:01:03,260 --> 01:01:04,099
Yann LeCun:
words.

1347
01:01:04,099 --> 01:01:06,860
Yann LeCun:
That's the GPT architecture that

1348
01:01:06,860 --> 01:01:08,840
Yann LeCun:
was, you know, promoted by

1349
01:01:08,840 --> 01:01:10,019
Yann LeCun:
OpenAI.

1350
01:01:10,019 --> 01:01:13,300
Yann LeCun:
And, that turned out to be
more scalable than Bert.

1351
01:01:13,300 --> 01:01:15,579
Yann LeCun:
And so in a sense that you can

1352
01:01:15,579 --> 01:01:16,699
Yann LeCun:
train gigantic networks on

1353
01:01:16,699 --> 01:01:18,219
Yann LeCun:
enormous amounts of data and you

1354
01:01:18,219 --> 01:01:20,179
Yann LeCun:
get some sort of emergent,

1355
01:01:20,179 --> 01:01:20,980
Yann LeCun:
property.

1356
01:01:20,980 --> 01:01:22,659
Yann LeCun:
And that's what gave us llms.

1357
01:01:23,500 --> 01:01:27,380
Tom Mitchell:
So that brings us up to today
with Transformers.

1358
01:01:27,380 --> 01:01:32,340
Tom Mitchell:
And you can see this very
strange evolution in wandering

1359
01:01:32,340 --> 01:01:39,019
Tom Mitchell:
path of, uh, progress
exploration over decades.

1360
01:01:39,019 --> 01:01:42,199
Tom Mitchell:
So before we leave, I

1361
01:01:42,199 --> 01:01:44,980
Tom Mitchell:
want to let's just take a look

1362
01:01:44,980 --> 01:01:47,760
Tom Mitchell:
at that history And say, what if

1363
01:01:47,760 --> 01:01:50,480
Tom Mitchell:
this is a case study of how

1364
01:01:50,480 --> 01:01:52,599
Tom Mitchell:
scientific progress was made in

1365
01:01:52,599 --> 01:01:53,519
Tom Mitchell:
this field?

1366
01:01:53,519 --> 01:01:56,800
Tom Mitchell:
What are the main themes we see?

1367
01:01:56,800 --> 01:02:01,159
Tom Mitchell:
Well, I think the first one is
progress happens in waves.

1368
01:02:01,159 --> 01:02:03,760
Tom Mitchell:
It's paradigm after paradigm,
right?

1369
01:02:03,760 --> 01:02:06,360
Tom Mitchell:
First there were perceptrons,

1370
01:02:06,360 --> 01:02:09,119
Tom Mitchell:
but that got, uh, thrown away

1371
01:02:09,119 --> 01:02:11,099
Tom Mitchell:
and replaced by symbolic

1372
01:02:11,099 --> 01:02:13,280
Tom Mitchell:
representations being learned,

1373
01:02:13,280 --> 01:02:15,199
Tom Mitchell:
eventually to be replaced by

1374
01:02:15,199 --> 01:02:17,320
Tom Mitchell:
neural nets, which were replaced

1375
01:02:17,320 --> 01:02:19,440
Tom Mitchell:
by probabilistic methods and so

1376
01:02:19,440 --> 01:02:19,719
Tom Mitchell:
forth.

1377
01:02:19,719 --> 01:02:24,000
Tom Mitchell:
So there's wave after wave of
paradigm.

1378
01:02:24,000 --> 01:02:26,199
Tom Mitchell:
Another theme is that a lot of

1379
01:02:26,199 --> 01:02:28,659
Tom Mitchell:
these ideas really came from

1380
01:02:28,659 --> 01:02:30,320
Tom Mitchell:
other fields.

1381
01:02:30,320 --> 01:02:31,519
Tom Mitchell:
Even the very notion of

1382
01:02:31,519 --> 01:02:33,820
Tom Mitchell:
perceptrons came from somebody

1383
01:02:33,820 --> 01:02:35,880
Tom Mitchell:
who was fundamentally a

1384
01:02:35,880 --> 01:02:39,159
Tom Mitchell:
neuroscientist interested in how

1385
01:02:39,159 --> 01:02:41,539
Tom Mitchell:
neurons in the brain could even

1386
01:02:41,539 --> 01:02:43,679
Tom Mitchell:
learn stuff.

1387
01:02:43,679 --> 01:02:44,480
Tom Mitchell:
Pack learning.

1388
01:02:44,480 --> 01:02:47,309
Tom Mitchell:
You heard less valiant talk.

1389
01:02:47,309 --> 01:02:48,670
Tom Mitchell:
He's very much a

1390
01:02:48,670 --> 01:02:50,230
Tom Mitchell:
computational complexity

1391
01:02:50,230 --> 01:02:53,150
Tom Mitchell:
researcher who found that this

1392
01:02:53,150 --> 01:02:55,349
Tom Mitchell:
was an interesting theoretical

1393
01:02:55,349 --> 01:02:56,630
Tom Mitchell:
result.

1394
01:02:56,630 --> 01:02:58,030
Tom Mitchell:
Bayesian networks heavily

1395
01:02:58,030 --> 01:03:00,110
Tom Mitchell:
influenced by statistics and so

1396
01:03:00,110 --> 01:03:00,750
Tom Mitchell:
forth.

1397
01:03:01,829 --> 01:03:04,090
Tom Mitchell:
Many of these advances really

1398
01:03:04,090 --> 01:03:06,630
Tom Mitchell:
were new framings of the

1399
01:03:06,630 --> 01:03:08,349
Tom Mitchell:
problem.

1400
01:03:08,349 --> 01:03:11,750
Tom Mitchell:
So, uh, Winston's work on

1401
01:03:11,750 --> 01:03:13,389
Tom Mitchell:
symbolic learning was really a

1402
01:03:13,389 --> 01:03:15,510
Tom Mitchell:
reframing of what the problem

1403
01:03:15,510 --> 01:03:16,269
Tom Mitchell:
was.

1404
01:03:16,269 --> 01:03:17,670
Tom Mitchell:
The work on reinforcement

1405
01:03:17,670 --> 01:03:19,710
Tom Mitchell:
learning is really changing the

1406
01:03:19,710 --> 01:03:21,949
Tom Mitchell:
definition of what the training

1407
01:03:21,949 --> 01:03:24,349
Tom Mitchell:
signal even is for these

1408
01:03:24,349 --> 01:03:25,750
Tom Mitchell:
systems.

1409
01:03:25,750 --> 01:03:28,550
Tom Mitchell:
So that's another theme that you
see.

1410
01:03:30,869 --> 01:03:32,989
Tom Mitchell:
And finally, I think like a lot

1411
01:03:32,989 --> 01:03:35,829
Tom Mitchell:
of scientific fields, machine

1412
01:03:35,829 --> 01:03:38,969
Tom Mitchell:
learning is really a blend of

1413
01:03:38,969 --> 01:03:42,050
Tom Mitchell:
technical forces and social

1414
01:03:42,050 --> 01:03:43,630
Tom Mitchell:
forces.

1415
01:03:43,630 --> 01:03:46,449
Tom Mitchell:
Certainly in the long term,

1416
01:03:46,449 --> 01:03:48,369
Tom Mitchell:
the cold, hard facts of what

1417
01:03:48,369 --> 01:03:51,969
Tom Mitchell:
works best come out and those

1418
01:03:51,969 --> 01:03:53,610
Tom Mitchell:
methods win.

1419
01:03:53,610 --> 01:03:56,010
Tom Mitchell:
But in the shorter term, the

1420
01:03:56,010 --> 01:03:57,449
Tom Mitchell:
question of who works on what

1421
01:03:57,449 --> 01:04:00,150
Tom Mitchell:
kinds of problems is very much

1422
01:04:00,150 --> 01:04:02,590
Tom Mitchell:
influenced by the personalities

1423
01:04:02,590 --> 01:04:03,969
Tom Mitchell:
of people.

1424
01:04:03,969 --> 01:04:06,090
Tom Mitchell:
Their ability to persuade other

1425
01:04:06,090 --> 01:04:09,050
Tom Mitchell:
people to jump in and start

1426
01:04:09,050 --> 01:04:10,849
Tom Mitchell:
working with them on their

1427
01:04:10,849 --> 01:04:11,849
Tom Mitchell:
problems.

1428
01:04:11,849 --> 01:04:14,090
Tom Mitchell:
So these are some of the themes
you see.

1429
01:04:14,090 --> 01:04:17,889
Tom Mitchell:
And I think if you look around
at other fields, sometimes you

1430
01:04:17,889 --> 01:04:19,650
Tom Mitchell:
see similar themes.

1431
01:04:20,769 --> 01:04:26,449
Tom Mitchell:
Finally, what are the lessons
from all this for researchers?

1432
01:04:26,449 --> 01:04:31,610
Tom Mitchell:
I think the first lesson really
is question authority.

1433
01:04:31,610 --> 01:04:33,090
Tom Mitchell:
Because really, if you think

1434
01:04:33,090 --> 01:04:35,570
Tom Mitchell:
about the major advances, many

1435
01:04:35,570 --> 01:04:39,110
Tom Mitchell:
of those came from just, uh,

1436
01:04:39,110 --> 01:04:41,690
Tom Mitchell:
going against what was currently

1437
01:04:41,690 --> 01:04:43,550
Tom Mitchell:
the conventional wisdom in the

1438
01:04:43,550 --> 01:04:44,230
Tom Mitchell:
field.

1439
01:04:45,550 --> 01:04:47,849
Tom Mitchell:
Inventing a new framing or

1440
01:04:47,849 --> 01:04:49,949
Tom Mitchell:
taking a radically different

1441
01:04:49,949 --> 01:04:50,909
Tom Mitchell:
approach.

1442
01:04:52,150 --> 01:04:55,150
Tom Mitchell:
Another lesson don't drag your
feet.

1443
01:04:55,150 --> 01:04:59,869
Tom Mitchell:
I've seen decade after decade,
new paradigms emerge in the

1444
01:04:59,869 --> 01:05:05,750
Tom Mitchell:
field, and every single time
that happens, existing

1445
01:05:05,750 --> 01:05:11,789
Tom Mitchell:
researchers take longer than
they need to to recognize the

1446
01:05:11,789 --> 01:05:14,550
Tom Mitchell:
benefits of the new paradigm.

1447
01:05:14,550 --> 01:05:18,710
Tom Mitchell:
And the most guilty people are
the senior researchers.

1448
01:05:18,710 --> 01:05:21,090
Tom Mitchell:
You can probably explain that by

1449
01:05:21,090 --> 01:05:22,750
Tom Mitchell:
taking into account who has the

1450
01:05:22,750 --> 01:05:25,469
Tom Mitchell:
most to lose if there's a new

1451
01:05:25,469 --> 01:05:27,269
Tom Mitchell:
paradigm replacing the current

1452
01:05:27,269 --> 01:05:28,110
Tom Mitchell:
approach.

1453
01:05:30,030 --> 01:05:31,670
Tom Mitchell:
Another lesson learn to

1454
01:05:31,670 --> 01:05:33,550
Tom Mitchell:
communicate and learn to follow

1455
01:05:33,550 --> 01:05:34,590
Tom Mitchell:
through.

1456
01:05:34,590 --> 01:05:36,070
Tom Mitchell:
You heard Geoff Hinton when he

1457
01:05:36,070 --> 01:05:38,150
Tom Mitchell:
was talking about in the mid

1458
01:05:38,150 --> 01:05:39,469
Tom Mitchell:
eighties, the development of

1459
01:05:39,469 --> 01:05:41,190
Tom Mitchell:
back propagation.

1460
01:05:41,190 --> 01:05:46,210
Tom Mitchell:
You heard him say we didn't
invent backpropagation, but we

1461
01:05:46,210 --> 01:05:48,530
Tom Mitchell:
showed that it was important.

1462
01:05:48,530 --> 01:05:51,650
Tom Mitchell:
And actually, to be fair, they

1463
01:05:51,650 --> 01:05:52,889
Tom Mitchell:
thought they were inventing

1464
01:05:52,889 --> 01:05:54,610
Tom Mitchell:
backpropagation.

1465
01:05:54,610 --> 01:05:57,090
Tom Mitchell:
They they actually reinvented

1466
01:05:57,090 --> 01:05:58,489
Tom Mitchell:
it, but they had no idea that

1467
01:05:58,489 --> 01:06:00,690
Tom Mitchell:
somebody had invented it before,

1468
01:06:00,690 --> 01:06:04,329
Tom Mitchell:
because whoever did that didn't

1469
01:06:04,329 --> 01:06:06,329
Tom Mitchell:
succeed in waking up the

1470
01:06:06,329 --> 01:06:08,369
Tom Mitchell:
research community to the fact

1471
01:06:08,369 --> 01:06:09,889
Tom Mitchell:
that they had a really good

1472
01:06:09,889 --> 01:06:11,170
Tom Mitchell:
idea.

1473
01:06:11,170 --> 01:06:12,130
Tom Mitchell:
I don't know why.

1474
01:06:12,130 --> 01:06:14,010
Tom Mitchell:
Maybe they didn't put in the

1475
01:06:14,010 --> 01:06:16,050
Tom Mitchell:
effort or succeed in

1476
01:06:16,050 --> 01:06:17,170
Tom Mitchell:
communicating.

1477
01:06:17,170 --> 01:06:18,690
Tom Mitchell:
Maybe they dropped it after they

1478
01:06:18,690 --> 01:06:19,730
Tom Mitchell:
did it and went some other

1479
01:06:19,730 --> 01:06:21,530
Tom Mitchell:
direction so that they didn't

1480
01:06:21,530 --> 01:06:22,889
Tom Mitchell:
follow through to provide the

1481
01:06:22,889 --> 01:06:23,889
Tom Mitchell:
evidence.

1482
01:06:23,889 --> 01:06:25,849
Tom Mitchell:
But that kind of thing happens

1483
01:06:25,849 --> 01:06:28,250
Tom Mitchell:
frequently in successful

1484
01:06:28,250 --> 01:06:30,050
Tom Mitchell:
researchers are good

1485
01:06:30,050 --> 01:06:31,849
Tom Mitchell:
communicators, and they follow

1486
01:06:31,849 --> 01:06:34,269
Tom Mitchell:
through to to push the field to

1487
01:06:34,269 --> 01:06:35,409
Tom Mitchell:
pay attention.

1488
01:06:36,570 --> 01:06:38,489
Tom Mitchell:
The final lesson, I think, is

1489
01:06:38,489 --> 01:06:40,170
Tom Mitchell:
the philosophers were actually

1490
01:06:40,170 --> 01:06:41,800
Tom Mitchell:
right.

1491
01:06:41,800 --> 01:06:47,000
Tom Mitchell:
We really today, despite these
amazing capabilities of our

1492
01:06:47,000 --> 01:06:52,719
Tom Mitchell:
learning systems, we don't have
a proof or anything like a

1493
01:06:52,719 --> 01:06:57,659
Tom Mitchell:
rational justification of why
you can generalize from examples

1494
01:06:57,659 --> 01:07:02,400
Tom Mitchell:
to get these general rules that
work well despite the success

1495
01:07:02,400 --> 01:07:03,360
Tom Mitchell:
that we have.

1496
01:07:03,360 --> 01:07:07,920
Tom Mitchell:
We don't really understand at
this very fundamental level why.

1497
01:07:07,920 --> 01:07:12,099
Tom Mitchell:
And I think that if we did pay
more attention to that question,

1498
01:07:12,099 --> 01:07:16,599
Tom Mitchell:
we might have a better chance to
develop algorithms that

1499
01:07:16,599 --> 01:07:19,920
Tom Mitchell:
outperform what we have today.

1500
01:07:19,920 --> 01:07:21,320
Tom Mitchell:
So I'll stop there.

1501
01:07:21,320 --> 01:07:22,639
Tom Mitchell:
Thank you very much.

1502
01:07:28,719 --> 01:07:29,800
Speaker 12:
Tom Mitchell is the Founders

1503
01:07:29,800 --> 01:07:31,159
Speaker 12:
University professor at Carnegie

1504
01:07:31,159 --> 01:07:32,360
Speaker 12:
Mellon University.

1505
01:07:32,360 --> 01:07:33,679
Speaker 12:
Machine learning How Did We get
here?

1506
01:07:33,679 --> 01:07:36,199
Speaker 12:
Is produced by the Stanford
Digital Economy Lab.

1507
01:07:36,199 --> 01:07:37,320
Speaker 12:
If you enjoyed this episode,

1508
01:07:37,320 --> 01:07:38,500
Speaker 12:
subscribe wherever you listen to

1509
01:07:38,500 --> 01:07:39,440
Speaker 12:
podcasts.