1
00:00:00,000 --> 00:00:03,397
Welcome to another episode of Computer Vision Decoded.

2
00:00:03,397 --> 00:00:15,854
I'm really excited about this episode because it's going to solve a lot of questions that
we get about structure from motion and 3D reconstruction when it comes to COLMAP and just

3
00:00:15,854 --> 00:00:21,266
figuring out how to do some of the basics of 3D reconstruction from imagery.

4
00:00:21,266 --> 00:00:28,351
And as always, I have Jared Heinly our in-house computer vision expert to walk us through
what happens

5
00:00:28,351 --> 00:00:37,659
when you run software like COLMAP to get camera poses, 3D reconstruction, and kind of
break down how that all works at a tangible level.

6
00:00:37,659 --> 00:00:48,240
So when you walk away from this episode, you should have a better understanding of this
black box of COLMAP and other 3D reconstruction software that follows the same workflow.

7
00:00:48,240 --> 00:00:51,978
So as always, Jared, thanks for joining me and welcome to the episode.

8
00:00:51,978 --> 00:00:52,709
Yeah, thank you.

9
00:00:52,709 --> 00:00:55,310
Let's just get to what we're all here for.

10
00:00:55,310 --> 00:01:07,953
Let's learn about COLMAP and I don't wanna say specifically COLMAP, but we're gonna use it
as the basis for this episode to have something for someone to follow along and since it's

11
00:01:07,953 --> 00:01:10,274
open source and free,

12
00:01:10,370 --> 00:01:20,799
they can download COLMAP and do this on their own PC without, you know, have to pay for
some third party software that they won't learn as much through.

13
00:01:20,799 --> 00:01:24,722
So Jared, let's just start off with, I'm gonna share my screen.

14
00:01:24,722 --> 00:01:34,871
I have some images and we want to turn these images into a 3D model or just at least know
where these cameras are in relation to each other.

15
00:01:34,871 --> 00:01:36,925
I'm gonna be doing some screen shares.

16
00:01:37,127 --> 00:01:43,352
If you're listening to the audio only, I'll do my best to talk about what we have on the
screen.

17
00:01:43,533 --> 00:01:44,900
But if I start out

18
00:01:44,889 --> 00:01:51,759
here, I have a picture of a, well, it was a fountain that used to work in front of the
Oregon State Capitol.

19
00:01:51,759 --> 00:01:54,201
I took this one sunny day last year.

20
00:01:54,302 --> 00:02:01,808
And if I flip through the images, I basically walked around this fountain and got a

21
00:02:01,808 --> 00:02:02,899
bunch of good angles.

22
00:02:02,899 --> 00:02:07,492
In fact, I believe I used a video and extracted a bunch of images.

23
00:02:07,492 --> 00:02:15,238
And at some points there's some sun issues, things like that, but it was good enough for
me to get a 3D model.

24
00:02:15,238 --> 00:02:23,511
So Jared, what's the first step someone would take then to turn this into a 3D model, know
where the cameras are, things like that?

25
00:02:23,511 --> 00:02:24,542
Yeah, yeah.

26
00:02:24,542 --> 00:02:26,602
Well, you you hinted it right there at the very end.

27
00:02:26,602 --> 00:02:27,684
Know where the cameras are.

28
00:02:27,684 --> 00:02:29,743
I guess I try to refine my language.

29
00:02:29,743 --> 00:02:33,517
A lot of times when I say camera, sometimes I mean, you know, image and camera.

30
00:02:33,517 --> 00:02:35,868
use those words interchangeably sometimes.

31
00:02:35,868 --> 00:02:42,233
You but you said that you walked around with a single camera, you know, your phone or DSLR
or whatever it may be.

32
00:02:42,233 --> 00:02:47,627
And from that video, maybe you extract frames, know, images or you took photos yourself.

33
00:02:47,627 --> 00:02:48,503
And so.

34
00:02:48,503 --> 00:02:56,536
you have multiple images taken by a single physical camera, but you were moving around
that scene, moving around that object.

35
00:02:56,536 --> 00:03:01,320
And so that camera was occupying different physical 3D points in space.

36
00:03:01,320 --> 00:03:06,262
And then these images were captured from those different 3D points and there's from those
different 3D perspectives.

37
00:03:06,262 --> 00:03:09,582
So, you know, as humans, we just do this naturally.

38
00:03:09,582 --> 00:03:15,822
Like as you just flipped through those photos there, you know, and as you kind of orbited
around,

39
00:03:15,822 --> 00:03:16,692
that fountain.

40
00:03:16,692 --> 00:03:19,143
It's like, our brains are immediately like, yeah, okay.

41
00:03:19,143 --> 00:03:21,944
I can see that the ground is a little bit closer.

42
00:03:21,944 --> 00:03:23,454
Here's this foreground fountain.

43
00:03:23,454 --> 00:03:25,295
I see the trees in the background.

44
00:03:25,295 --> 00:03:33,118
I see some other structures in the background and I'm immediately, I can see that, yep,
you are moving to the left and sort of this clockwise motion.

45
00:03:33,118 --> 00:03:36,399
This thing's, you know, near that other things are far.

46
00:03:36,399 --> 00:03:43,831
Our brains are immediately doing all of that 3D reasoning, but in order to have software
to this, in order to have a computer,

47
00:03:43,831 --> 00:03:49,143
generate a 3D reconstruction or 3D representation of what's in these photos.

48
00:03:49,263 --> 00:03:53,983
It has to figure out, it has to do all of that math and it doesn't know how to do that
reasoning by default.

49
00:03:53,983 --> 00:03:57,477
It has to figure out, where were you standing when that photo was taken?

50
00:03:57,477 --> 00:03:58,827
Where was the camera positioned?

51
00:03:58,827 --> 00:04:00,008
How was it angled?

52
00:04:00,008 --> 00:04:03,711
What was the zoom level of the current lens?

53
00:04:03,711 --> 00:04:07,642
And so it's doing all, it has to figure out where everything was oriented.

54
00:04:07,642 --> 00:04:09,013
And that's typically one of the first processes.

55
00:04:09,013 --> 00:04:11,954
It's kind of figuring out how are things related to each other?

56
00:04:11,958 --> 00:04:20,363
And once we kind of know how they're related, then figure out what is the 3D geometry that
describes that relationship.

57
00:04:20,363 --> 00:04:21,203
Mm-hmm.

58
00:04:21,244 --> 00:04:27,037
And so it goes through, so I don't have it on my screen, but I will pull it up in a
second.

59
00:04:27,037 --> 00:04:33,952
But COLMAP has in their tutorial information, a good kind of diagram.

60
00:04:33,952 --> 00:04:38,816
I'll let me bring that up, but it basically shows the workflow that it goes through.

61
00:04:38,816 --> 00:04:46,381
So if I go to the actual website for COLMAP and you go look at their tutorial, you can see
that.

62
00:04:46,381 --> 00:04:49,713
So let's just pull that up on my screen as well.

63
00:04:58,376 --> 00:04:59,713
while he's pulling that up, just to jump in with a little bit of personal history about
COLMAP So I did my PhD back at UNC Chapel Hill.

64
00:04:59,713 --> 00:05:01,757
So I was there from 2010 to 2015.

65
00:05:01,757 --> 00:05:06,901
And while I was there, Johannes Schoenberger, he came to UNC for two years to do his
masters.

66
00:05:06,901 --> 00:05:10,088
And so Johannes, he's the author of COLMAP.

67
00:05:10,088 --> 00:05:13,465
But at the time when he was there, COLMAP didn't exist.

68
00:05:13,545 --> 00:05:17,457
Johannes had worked on previous structured promotion software and had

69
00:05:17,457 --> 00:05:23,148
built, he'd worked with some, I believe those drones, so aerial photography, 3D
reconstruction.

70
00:05:23,148 --> 00:05:26,250
And so he had built a pipeline that he had called MavMap.

71
00:05:26,250 --> 00:05:33,212
I'll probably get this wrong, but I've been like, you know, mobile aerial vehicle, MavMap,
like mapper, mobile aerial vehicle mapper.

72
00:05:33,212 --> 00:05:42,398
And so, but he was looking to generalize that to move beyond just aerial photography and
to do more general purpose image collections.

73
00:05:42,398 --> 00:05:53,038
And so was this idea of image collections, where he came up with COLMAP, collection
mapper, to say, I want to take a collection of images and generate a 3D reconstruction

74
00:05:53,038 --> 00:05:53,418
from it.

75
00:05:53,418 --> 00:05:55,629
So he was working on that while he was at UNC.

76
00:05:55,629 --> 00:06:04,489
I may have been one of the first people to actually use COLMAP in my final PhD project.

77
00:06:04,489 --> 00:06:08,546
had processed 100 million images on a single PC.

78
00:06:08,546 --> 00:06:14,308
And I was doing this feature matching extraction, but then I needed some way to
reconstruct them.

79
00:06:14,348 --> 00:06:20,712
And our lab had some other software that could do 3D construction, but Johannes had just
written this first version of COLMAP.

80
00:06:20,752 --> 00:06:22,214
And so I said, great, let's use that.

81
00:06:22,214 --> 00:06:26,486
And that was efficient, that was fast, and it did exactly what we needed to do.

82
00:06:26,486 --> 00:06:31,358
And so that helped get my paper across the gold line there at the very end.

83
00:06:31,838 --> 00:06:36,882
And since then, Johannes has gone off at ETH Zurich and now at other companies.

84
00:06:36,882 --> 00:06:39,117
and continued to open source COLMAP.

85
00:06:39,117 --> 00:06:43,160
And now it's used all over the world and has won him some awards for it.

86
00:06:43,160 --> 00:06:50,347
You know, interestingly, GLOMAP came out last year and he had his fingers in that as well.

87
00:06:50,347 --> 00:06:51,700
So it's not over.

88
00:06:51,700 --> 00:06:57,597
I still see COLMAP being updated on a semi-regular basis as well.

89
00:06:57,597 --> 00:07:02,546
Although it came out several years ago, it's not static.

90
00:07:02,546 --> 00:07:06,466
No, no, because it is such an important step.

91
00:07:06,466 --> 00:07:22,166
The task that COLMAP solves and similarly GLOMAP, figuring out the 3D pose, pose is
position plus orientation, figuring out the 3D pose of images is a key step in so many 3D

92
00:07:22,166 --> 00:07:22,806
pipelines.

93
00:07:22,806 --> 00:07:27,346
If you want to understand the world in 3D, you got to figure out where these images were
taken from.

94
00:07:27,346 --> 00:07:31,786
And that's the key task that COLMAP solves for a lot of people.

95
00:07:32,039 --> 00:07:32,949
Okay.

96
00:07:32,949 --> 00:07:34,170
That makes sense.

97
00:07:34,170 --> 00:07:38,171
I had no idea also that COLMAP stood for collection mapper.

98
00:07:38,171 --> 00:07:42,073
I mean, that makes sense, but I thought maybe it was a long acronym.

99
00:07:42,073 --> 00:07:44,723
okay.

100
00:07:44,723 --> 00:07:55,868
Well, I have this diagram up then if you're watching, you can see it on the screen, but if
you're listening, it's basically a workflow of how images go from just a collection of

101
00:07:55,868 --> 00:07:56,858
images.

102
00:07:57,200 --> 00:08:00,692
to a 3D reconstruction and you've got camera poses.

103
00:08:00,692 --> 00:08:03,713
And I'm gonna show this in Cole map on my screen as well.

104
00:08:03,713 --> 00:08:10,897
But this diagram just shows the different phases of the rate or steps that you go through
to get from pictures to 3D.

105
00:08:11,018 --> 00:08:14,459
And it starts out with feature extraction.

106
00:08:14,459 --> 00:08:20,273
And if you actually go to the tutorial as well, so if I share the just the tutorial page,
that diagram makes sense.

107
00:08:20,273 --> 00:08:22,834
But the minute you start diving into it,

108
00:08:22,974 --> 00:08:34,943
you have a wall of text that to most people won't make through this very well unless they
are perhaps a computer science major, someone like Jared who does this academically or for

109
00:08:34,943 --> 00:08:35,954
a job.

110
00:08:36,254 --> 00:08:39,136
I look at this and I'm like, okay, some of this makes sense.

111
00:08:39,136 --> 00:08:41,238
A lot of this is beyond me.

112
00:08:41,238 --> 00:08:42,340
So we're gonna break that down.

113
00:08:42,340 --> 00:08:44,882
So yeah, again, starting out with feature extraction.

114
00:08:44,882 --> 00:08:45,763
So what is that step?

115
00:08:45,763 --> 00:08:47,364
what are we, we're taking the images and.

116
00:08:47,364 --> 00:08:49,923
Sounds like something's happening there with features.

117
00:08:50,029 --> 00:08:51,399
Yeah, yeah, absolutely.

118
00:08:51,399 --> 00:08:59,072
So, and just to take a step back here too, so like you said, this is sort of a workflow, a
sequence of steps that goes into generating reconstruction.

119
00:08:59,072 --> 00:09:04,554
So you had those images input, there's a sort of first block of steps that's labeled
correspondent search.

120
00:09:04,554 --> 00:09:08,956
After that, we have incremental reconstruction and then finally we end up with a final
reconstruction.

121
00:09:08,956 --> 00:09:16,374
But yeah, so within correspondent search, our goal for correspondent search is to figure
out the 2D relationship.

122
00:09:16,374 --> 00:09:17,954
between that collection of images.

123
00:09:17,954 --> 00:09:20,393
So we're not even talking about real 3D yet.

124
00:09:20,393 --> 00:09:31,054
There might be some hints at 3D in these steps, but we haven't done any reasoning to
really understand which photos are where in 3D space.

125
00:09:31,054 --> 00:09:37,335
So it's just about 2D understanding, 2D matching, 2D correspondence between this
collection of images.

126
00:09:37,335 --> 00:09:42,915
So with that in mind, first step is feature extraction.

127
00:09:43,035 --> 00:09:45,195
So the goal there is

128
00:09:45,195 --> 00:09:49,675
to identify unique landmarks within a photograph.

129
00:09:49,775 --> 00:10:01,735
And these unique landmarks, the intent of that is if I can identify a unique landmark, a
2D point in one photo, hopefully I can identify that same point in another photo, and

130
00:10:01,735 --> 00:10:03,475
another photo, and another photo.

131
00:10:03,475 --> 00:10:15,175
And if I can identify and follow or track that 2D point between multiple images, now I can
use that as a constraint later on when I do the 3D reconstruction.

132
00:10:15,219 --> 00:10:26,365
I can say, hey, however these images are positioned, that point that they saw, that pixel
should converge to a common 3D position in space.

133
00:10:26,365 --> 00:10:30,827
And so it's adding a sort of a viewing constraint saying, you know, each image saw a 2D
point.

134
00:10:30,827 --> 00:10:32,288
I don't know the depth of that point.

135
00:10:32,288 --> 00:10:34,419
So it gets all sort of gives me as a viewing ray.

136
00:10:34,419 --> 00:10:39,232
So along this direction out into the scene, I saw this unique landmark.

137
00:10:39,232 --> 00:10:42,133
Now I've seen that same landmark in many other photos.

138
00:10:42,334 --> 00:10:44,935
I want to identify that and add that as constraint.

139
00:10:45,001 --> 00:10:47,452
is most likely a 3D point.

140
00:10:47,452 --> 00:10:56,958
So feature extraction is the automatic identification of typically tens of thousands,
thousands or tens of thousands of these unique landmarks in an image.

141
00:10:56,958 --> 00:11:00,200
A lot of times there are different flavors of feature detection.

142
00:11:00,200 --> 00:11:04,603
The one used in COLMAP is SIFT, Scale-Invariant Feature Transform.

143
00:11:04,603 --> 00:11:09,315
What it does is it looks for, I call it a blob-style detector.

144
00:11:09,315 --> 00:11:14,260
where it's looking for a patch of pixels that has high contrast to its background.

145
00:11:14,260 --> 00:11:21,436
So it could be something that's light-colored, surrounded by dark, or vice versa,
something that's dark, surrounded by light.

146
00:11:21,436 --> 00:11:24,830
It's going to look for these at multiple scales.

147
00:11:24,830 --> 00:11:27,152
That's why it's scale invariant, so multiple resolutions.

148
00:11:27,152 --> 00:11:32,226
So this could be something that's very small, or something that's larger in the image.

149
00:11:32,607 --> 00:11:37,675
But once it's found that sort high contrast landmark,

150
00:11:37,675 --> 00:11:46,275
it now will then extract some representation of the appearance of the area around that
landmark.

151
00:11:46,275 --> 00:11:47,895
So it'll say, hey, I found something interesting.

152
00:11:47,895 --> 00:11:52,195
So maybe it's a doorknob on a door.

153
00:11:53,075 --> 00:11:57,975
So it'll say, that doorknob is a different color than the background, the rest of the
door.

154
00:11:58,275 --> 00:12:00,755
And so now I want to describe that doorknob.

155
00:12:00,755 --> 00:12:05,575
so gonna look, I don't wanna look just at the doorknob itself, I'm gonna look around it
and say, here's my doorknob.

156
00:12:05,575 --> 00:12:07,855
And then, oh, there's this wood pattern.

157
00:12:07,867 --> 00:12:08,969
on the door around it.

158
00:12:08,969 --> 00:12:12,291
And so it's going to come up with a representation for that.

159
00:12:12,291 --> 00:12:18,858
so what SIFT actually does or what different feature representations are, that could be a
whole podcast in and of itself.

160
00:12:18,858 --> 00:12:26,185
But at a conceptual level, you just think about it, it summarizes what that looks like at
a rough level.

161
00:12:26,185 --> 00:12:28,607
says, OK, I saw something dark in the middle.

162
00:12:28,607 --> 00:12:32,658
And then there was this rough pattern around its vicinity.

163
00:12:32,658 --> 00:12:33,418
Mm-hmm.

164
00:12:33,418 --> 00:12:42,703
Okay, so then I'm bringing up COLMAP and this is I've unfortunately had already run the
project because I didn't want us to have to sit and watch things go and a lot of these

165
00:12:42,703 --> 00:12:43,713
things run really fast.

166
00:12:43,713 --> 00:12:46,664
So SIFT is fast if you can run it on GPU.

167
00:12:46,665 --> 00:12:52,547
I don't can't necessarily show but 10th way say I think it maxes at 10,000 by default.

168
00:12:52,547 --> 00:12:58,550
But if you have COLMAP and you kind of want to follow along the first thing you do is set
up a new project and

169
00:12:59,206 --> 00:13:06,322
that part's pretty easy, but then you just go to processing and feature extraction and you
get to pick a camera model.

170
00:13:06,322 --> 00:13:07,244
Why is that important?

171
00:13:07,244 --> 00:13:10,031
Why is picking a camera model important for this?

172
00:13:10,031 --> 00:13:20,298
Well, this is important and this ends up being really important later on when we start
thinking about the geometry of these images and what kind of camera and lens was used.

173
00:13:20,298 --> 00:13:26,141
Because these camera models are, it is defining the geometry of that camera.

174
00:13:26,141 --> 00:13:30,714
So this, right now you have a simple radial camera to select it.

175
00:13:30,714 --> 00:13:35,446
And so underneath of it, sort of in gray scale, are some parameters listed.

176
00:13:35,446 --> 00:13:36,859
It says, simple radial.

177
00:13:36,859 --> 00:13:40,459
has F, Cx, Cy, and K.

178
00:13:40,959 --> 00:13:50,419
And so you kind of have to know from a computer vision literature that F is your focal
length, Cx and Cy, that's the principal points, that's the final, where is the center of

179
00:13:50,419 --> 00:13:57,239
my image or where is the optical axis of my lens and how is that aligned with the image
center.

180
00:13:57,239 --> 00:14:01,579
So a lot of times I was kind of say, hey, hand wavy, what's the center of my image?

181
00:14:01,659 --> 00:14:04,757
And then that K is a single.

182
00:14:04,757 --> 00:14:15,177
Radial distortion term so it's assuming a lot of times lenses Introduce a little bit of
curvature effect, know curvature distortion to them and so we're gonna use a single

183
00:14:15,177 --> 00:14:27,329
mathematical term single, you know polynomial term to Represent the distortion in that
lens This might be great this is great for a lot of just you know general cameras but

184
00:14:27,489 --> 00:14:38,318
If you know that your lens has little bit more distortion, maybe you're using a wide angle
camera, a GoPro or a drone that has a wider field of view and some distortion.

185
00:14:38,318 --> 00:14:45,372
If you have a really wide angle camera, something that you can see a lot of distortion,
then you might want one these fisheye versions.

186
00:14:45,372 --> 00:14:48,487
They have simple radial fisheye or the normal fisheye.

187
00:14:48,487 --> 00:14:52,110
There's even, I think, at very bottom of list, there's one called FOV.

188
00:14:52,110 --> 00:14:53,034
That's one that's...

189
00:14:53,034 --> 00:14:55,994
really great for super wide angle.

190
00:14:56,025 --> 00:15:05,666
But a lot of times for a normal camera like your iPhone in your pocket or your DSLR or
your point and shoot or whatever it ends up being, your simple radial or your radial

191
00:15:05,666 --> 00:15:18,606
models are nice because they assume that you've got a single focal length, your pixels are
square, so I don't need more than one F term.

192
00:15:18,790 --> 00:15:25,415
You want to model your principal point with CXCY and here the radial model added an extra
lens distortion.

193
00:15:25,415 --> 00:15:27,897
now instead of just K, now we have K1 and K2.

194
00:15:27,897 --> 00:15:30,104
So that's two radial distortion terms.

195
00:15:30,104 --> 00:15:34,821
We can do a little better job of estimating the distortion of our lens.

196
00:15:35,162 --> 00:15:43,348
And so COLMAP asks for this right away because what it's doing is it has that, you know,
part of that project creation process is you create a database.

197
00:15:43,348 --> 00:15:46,680
And so that's going to be, you know, a collection of data stored on disk.

198
00:15:46,680 --> 00:15:56,671
And so this process of feature extraction is when COLMAP goes through all of your images,
extracts features, but then also creates those image entries in the database.

199
00:15:56,671 --> 00:16:01,833
And so it needs to know what style of camera is going to be associated with that image.

200
00:16:01,833 --> 00:16:02,674
Mm hmm.

201
00:16:02,674 --> 00:16:06,714
And and we could go and deepen a bunch of buttons on here.

202
00:16:06,714 --> 00:16:13,481
I don't want to if you just run this in default and simple radial and using smartphone or
something, you'll be OK.

203
00:16:13,481 --> 00:16:16,603
But, you know, like here is thinking I have all these different cameras.

204
00:16:16,603 --> 00:16:19,664
There's options where you can say use is always one camera.

205
00:16:19,664 --> 00:16:23,566
So it just assumes then everyone's the same camera, which is great.

206
00:16:23,586 --> 00:16:27,008
There was there was options for masks.

207
00:16:27,248 --> 00:16:29,149
I just bring up a mask on my screen.

208
00:16:29,149 --> 00:16:30,690
This is me masked.

209
00:16:31,042 --> 00:16:38,065
This is a mask, not necessarily the mask you would use, but basically there's a picture of
me.

210
00:16:38,065 --> 00:16:39,366
This might be the wrong picture.

211
00:16:39,366 --> 00:16:41,997
And I've been with a mask as a separate file.

212
00:16:41,997 --> 00:16:45,568
And then if you kind of like combine the two, you end up with me masked out.

213
00:16:45,568 --> 00:16:49,150
And that's like a way to say, you want me not to be in this result.

214
00:16:49,150 --> 00:16:54,408
You can mask out things specifically if you want perhaps just an object to be
reconstructed.

215
00:16:54,408 --> 00:16:57,411
You want to mask out a background, things like that.

216
00:16:57,411 --> 00:17:00,744
we could go deep into, but there's all these options, right?

217
00:17:00,744 --> 00:17:03,136
To help get the right key points.

218
00:17:03,136 --> 00:17:12,023
So if I go to this database, so I ran this already and I have this database manager where
I can kind of jump into things and I pick one of these and I'm just gonna hit show image.

219
00:17:12,023 --> 00:17:15,886
It's gonna bring up the image and I can make this nice and big on my screen.

220
00:17:15,886 --> 00:17:20,330
What we're seeing now is an image of the fountain.

221
00:17:20,330 --> 00:17:25,654
on the backside of it right now with all these red circles, which are key points, not
necessarily all the features, right?

222
00:17:25,654 --> 00:17:27,159
It's just some of the

223
00:17:27,159 --> 00:17:28,602
ones that I think it matched on.

224
00:17:28,602 --> 00:17:30,803
Is that wrong or am I on the wrong?

225
00:17:31,858 --> 00:17:32,562
Yeah.

226
00:17:32,562 --> 00:17:38,450
some software packages, they may show you all of them or may show you just the ones that
have been matched.

227
00:17:38,831 --> 00:17:41,514
I'm not sure with this specific viewer right now.

228
00:17:41,514 --> 00:17:43,974
so yeah, and I'm not a hundred percent clear either.

229
00:17:43,974 --> 00:17:45,294
I didn't read the documentation.

230
00:17:45,294 --> 00:17:46,534
All I know is this visualizing.

231
00:17:46,534 --> 00:17:54,334
So this is an idea of key points where you'll notice there's no key points where you have
a lot of low contrast, not a lot of visual variation.

232
00:17:54,334 --> 00:18:03,414
So I'm on my screen and there's a part where it shows the street and there's just not much
going on there versus there's a lot of points on the fountain which has all these ornate

233
00:18:03,414 --> 00:18:06,654
decorations on it in the background, there's trees and.

234
00:18:06,746 --> 00:18:08,537
buildings that is slatching onto.

235
00:18:08,537 --> 00:18:18,394
So it makes sense that where you have less variation, you're going to have less features
that it's, it's, it's, the sky is also another one where there's no variation, but this

236
00:18:18,394 --> 00:18:20,556
nice tree behind this thing, caught a lot on.

237
00:18:20,556 --> 00:18:23,698
So it doesn't mean it matched on those because you might not see those.

238
00:18:23,698 --> 00:18:28,462
So if I then, I'm going to close this and then you can look at show, overlapping images.

239
00:18:28,462 --> 00:18:28,813
So

240
00:18:28,813 --> 00:18:31,426
you know, if I click here, you can look at the matches.

241
00:18:31,426 --> 00:18:41,788
You're going to see then this kind of correspondence matches where it's finding key points
between two images and they show these green lines basically saying these two images have

242
00:18:41,788 --> 00:18:44,830
matching features that it believes are the same points.

243
00:18:44,830 --> 00:18:45,050
Right.

244
00:18:45,050 --> 00:18:46,070
Is that what we're seeing?

245
00:18:46,070 --> 00:18:47,031
exactly, exactly.

246
00:18:47,031 --> 00:18:54,115
So this is now sort of moved to the second and third bubbles within that correspondence
search block.

247
00:18:54,115 --> 00:19:01,899
So back to that correspondence search, the first step was the feature extraction, which
was just the identification of these key points in each of the images.

248
00:19:01,899 --> 00:19:03,430
So wasn't even trying to compare images yet.

249
00:19:03,430 --> 00:19:07,162
We're just saying for each image, let me find those key points.

250
00:19:07,162 --> 00:19:14,194
as Jonathan said, by default, if you've got a GPU enabled version of COLMAP and you've got
a nice GPU in your computer,

251
00:19:14,194 --> 00:19:19,457
it will use the GPU implementation, that graphics processor, which makes it go a lot
faster.

252
00:19:19,558 --> 00:19:26,465
So once we've extracted those key points or features, again, I use those terms
interchangeably a lot, the key point and the feature.

253
00:19:26,465 --> 00:19:29,847
Now we want to match images together.

254
00:19:30,027 --> 00:19:34,180
And that's to discover which images show similar content.

255
00:19:34,180 --> 00:19:41,429
And so the result of that is going to be the set of correspondences, the set of features
saying the features in this image matched

256
00:19:41,429 --> 00:19:42,479
to the features in this image.

257
00:19:42,479 --> 00:19:49,992
And so those were those green lines that Jonathan had shown up just prior saying that, you
not all of the key points from one image matched to the other.

258
00:19:49,992 --> 00:19:55,224
There was some subset, but we're trying to discover what those matches are.

259
00:19:55,925 --> 00:20:01,451
In this diagram, we said that, you know, we had feature extraction, matching, and then
geometric verification.

260
00:20:01,591 --> 00:20:10,615
Matching and geometric verification, a lot of times will go hand in hand, you know, so you
run matching and then you immediately run geometric verification after that.

261
00:20:10,615 --> 00:20:23,961
So the intention there is your matching is just trying to figure out which features look
similar between two images, but it's not trying to do any sort of 2D or 3D reasoning.

262
00:20:23,961 --> 00:20:31,388
So it may think that, the top of the tree in one image looks like the top of another tree.

263
00:20:31,388 --> 00:20:34,749
in another image, but they're in completely different parts of the image and it doesn't
even make sense.

264
00:20:34,749 --> 00:20:43,688
Like it may confuse things, or especially if you have a building with some sort of
repetitive pattern on it, the same brick repeated over and over again, but you have some

265
00:20:43,688 --> 00:20:48,101
sort of unique windows or unique artwork that appears on that wall.

266
00:20:48,101 --> 00:20:53,184
For feature matching, it may end up matching incorrect parts of the image to each other.

267
00:20:53,204 --> 00:20:58,256
So matching does its best to try to figure out what matches, but it might be wrong.

268
00:20:58,302 --> 00:21:08,665
It's geometric verifications job to come in and clean those up to figure out well now that
I have these initial set of matching key points between my two images, which ones actually

269
00:21:08,665 --> 00:21:13,386
make sense based on our knowledge of geometry and how cameras move.

270
00:21:13,386 --> 00:21:19,249
And so that's where sometimes you can leverage knowing what kind of camera model you have
can be helpful.

271
00:21:19,249 --> 00:21:23,522
Knowing if if you expect a lot of distortion or if it's a fisheye lens that can help.

272
00:21:23,522 --> 00:21:24,568
But sometimes

273
00:21:24,568 --> 00:21:27,688
Some methods don't even try to use that information.

274
00:21:27,688 --> 00:21:30,868
We'll just look at the 2D to 2D relationships.

275
00:21:31,348 --> 00:21:41,688
And so there are some key words that you might see would be estimating a homography, a
homography like a perspective transform or an essential matrix or a fundamental matrix.

276
00:21:41,688 --> 00:21:52,428
So each of these sort of relationships, each of these matrices is a way to describe how a
point in one image matches to a location in another image or a set of locations in another

277
00:21:52,428 --> 00:21:52,929
image.

278
00:21:52,929 --> 00:22:04,789
And so we're trying to estimate, is there a valid camera motion that we can imagine to get
a set of points in one image to move to the set of points in the other image?

279
00:22:04,789 --> 00:22:11,460
And so that's what geometric verification is doing, just figuring out those 2D
relationships between images.

280
00:22:11,460 --> 00:22:15,200
And somewhere in my logs, you can see some hints of that.

281
00:22:15,200 --> 00:22:16,320
So it's just running.

282
00:22:16,320 --> 00:22:18,980
It's showing all kinds of text on your screen.

283
00:22:18,980 --> 00:22:24,240
And it's I'm sure some of that when it's showing bundle adjustment on my screen right now.

284
00:22:24,240 --> 00:22:29,581
But at one point, it's talking about some of that, the matches and running different.

285
00:22:29,581 --> 00:22:32,021
Algorithms in the background to get that.

286
00:22:32,021 --> 00:22:39,083
So and then if I if I click on like one of these points that it created, it almost it
shows you where you have multiple matches on a.

287
00:22:39,083 --> 00:22:44,918
specific point and things you can do to kind of get different views and get hints of what
we're talking about here.

288
00:22:44,918 --> 00:22:51,873
But, so one thing we didn't really talk about when you're matching these images too, that
there's different options as well.

289
00:22:51,873 --> 00:22:54,875
So when I go through here, I'm processing, I've got my key points.

290
00:22:54,875 --> 00:22:59,258
It goes fast on a GPU because it's able to like look at all the different images all at
once, right?

291
00:22:59,258 --> 00:23:02,611
They don't care about respect to each other when you're extracting features.

292
00:23:02,611 --> 00:23:06,393
But then you get to the point where you need to do your matching.

293
00:23:06,393 --> 00:23:13,249
This is where it's all CPU driven, because it's kind of either a sequential or exhaustive,
but it's not able to look at every image all at once.

294
00:23:13,249 --> 00:23:20,606
But there's options here where if I go to this button here, it's not displaying on my
screen correctly for some reason.

295
00:23:20,606 --> 00:23:21,527
there we go.

296
00:23:21,527 --> 00:23:26,082
You can do an exhaustive, sequential, vocab tree, spatial.

297
00:23:26,082 --> 00:23:29,808
There's these different styles you can pick, or I'm gonna say styles.

298
00:23:29,808 --> 00:23:32,569
different algorithms you can pick to match these.

299
00:23:33,149 --> 00:23:42,701
My understanding always is if you have a random collection of images, like someone walked
around and they're not necessarily one image is taken and then your next image you moved

300
00:23:42,701 --> 00:23:45,532
over and took just of the same part of the scene.

301
00:23:45,532 --> 00:23:49,353
But I don't know, maybe you're just walk around taking pictures in all which directions.

302
00:23:49,393 --> 00:23:58,726
Exhaustive is what you want to use because it's gonna, you can explain this, but it's
gonna like, of try to get every image to match to every image versus sequential where

303
00:23:58,726 --> 00:23:59,450
you're saying,

304
00:23:59,450 --> 00:24:01,742
No, no, no, each image was taken in sequence.

305
00:24:01,742 --> 00:24:03,804
So I see the fountain from one spot.

306
00:24:03,804 --> 00:24:05,667
I moved a few feet, took another photo of it.

307
00:24:05,667 --> 00:24:08,600
They should be sequentially somewhat matching between each other.

308
00:24:08,600 --> 00:24:10,021
Does that sound correct?

309
00:24:10,021 --> 00:24:12,163
Am I at the right assumption?

310
00:24:12,193 --> 00:24:13,053
you're exactly right.

311
00:24:13,053 --> 00:24:13,824
You're exactly right.

312
00:24:13,824 --> 00:24:21,890
yeah, once you've extracted the key points from a single image, now you want to figure out
which pairs of images are related to each other.

313
00:24:21,890 --> 00:24:27,336
So the simplest, most naive way is to say, well, let me match every single image to every
single other one.

314
00:24:27,336 --> 00:24:33,552
Let me look at all order n squared, every single combination of pairs of images that I can
imagine.

315
00:24:33,552 --> 00:24:35,763
And so that's what exhaustive matching is doing.

316
00:24:35,763 --> 00:24:38,425
So exhaustive matching, like you said, it's great when...

317
00:24:38,425 --> 00:24:41,328
You have sort of an unsorted random collection of images.

318
00:24:41,328 --> 00:24:47,251
And especially it works well if you have in the order of a few hundred images.

319
00:24:47,251 --> 00:24:53,129
Because it is doing this every image to every other image, that quickly gets expensive in
terms of time.

320
00:24:53,129 --> 00:24:57,345
Like that's going to take a lot of time to compute if you try to do this on thousands of
images.

321
00:24:57,345 --> 00:24:59,737
You could still do it and just have to wait a long time.

322
00:24:59,737 --> 00:25:03,576
But yeah, it's great because it's going to try to discover every single.

323
00:25:03,576 --> 00:25:05,421
pair of matching images that it can.

324
00:25:05,421 --> 00:25:05,917
Mm-hmm.

325
00:25:05,917 --> 00:25:15,092
And so that's where the sequential is nice if you have something like you said there in
the fountain sequence where you know, hey, these are frames from a video or my images.

326
00:25:15,092 --> 00:25:18,213
Maybe I was taking photos, but I'm taking them in order.

327
00:25:18,213 --> 00:25:23,285
Like, I started here, took a photo, took a few steps, took another photo, took a few more
steps, took another photo.

328
00:25:23,285 --> 00:25:27,077
And so there is some sort of sequential information to those photos.

329
00:25:27,077 --> 00:25:31,747
know that images taken near each other in that list show.

330
00:25:31,747 --> 00:25:34,074
similar content and that's what's sequential.

331
00:25:34,074 --> 00:25:37,634
It'll leverage that information to help the matching be more efficient.

332
00:25:37,634 --> 00:25:40,474
And then I don't really understand vocab tree.

333
00:25:40,474 --> 00:25:49,185
I do know that if you want to do an exhaustive style match, not sequential, but you have,
let's say 800 images, I've always heard use a vocab tree.

334
00:25:49,185 --> 00:25:51,847
Yeah, yeah, that's exactly right.

335
00:25:51,847 --> 00:25:57,922
So the vocab tree, you might heard like, it's a vocabulary tree or image retrieval style
matching.

336
00:25:57,922 --> 00:26:04,716
Yeah, what it's doing behind the scenes is it uses a image lookup data structure.

337
00:26:04,716 --> 00:26:10,508
So it takes all the images, comes up with a really compact summarization.

338
00:26:10,508 --> 00:26:21,793
of the kinds of things that are in each image and then provides a way that I can say, hey,
for this given image, what other images in my data set are likely to have the same kinds

339
00:26:21,793 --> 00:26:23,393
of things in them?

340
00:26:23,393 --> 00:26:31,826
You it's not a guarantee, but it just says, you know, if I have one image and I've got
10,000 other images that I can match to, I can ask it, well, hey, I don't want to look at

341
00:26:31,826 --> 00:26:32,356
all 10,000.

342
00:26:32,356 --> 00:26:36,914
Can you at least give me a sorted list of the ones that are most likely to match?

343
00:26:36,914 --> 00:26:37,304
Mm-hmm.

344
00:26:37,304 --> 00:26:41,201
what the vocab tree option does for you is it returns that ranked list.

345
00:26:41,201 --> 00:26:48,073
then, instead of matching all 10,000, I can choose to match the best 50 or the best 100 or
whatever my threshold is.

346
00:26:48,235 --> 00:26:49,095
Yep.

347
00:26:49,137 --> 00:26:49,535
Yep.

348
00:26:49,535 --> 00:26:50,736
more efficient.

349
00:26:50,736 --> 00:26:57,931
Yeah, once you get beyond three to 400 images, exhaustive should not be your option.

350
00:26:57,931 --> 00:27:02,346
You should go to the vocab tree unless they're all sequentially taken and then always use
sequential.

351
00:27:02,346 --> 00:27:04,888
Well, not always, but that's probably your default.

352
00:27:04,888 --> 00:27:06,579
So if I'm taking a video,

353
00:27:06,771 --> 00:27:09,613
and then extraction images, sequential is always gonna work.

354
00:27:09,613 --> 00:27:13,525
Well, always gonna be your first option if you wanna be as fast as possible.

355
00:27:13,525 --> 00:27:18,129
And then in here, I know you can pick loop detection.

356
00:27:18,129 --> 00:27:20,240
it's trying to, we've talked about that before, right?

357
00:27:20,240 --> 00:27:23,813
It's trying to detect, have you come back to an area, correct?

358
00:27:23,813 --> 00:27:24,297
And.

359
00:27:24,297 --> 00:27:27,629
And that will do it using the vocab tree option.

360
00:27:27,629 --> 00:27:39,174
Like, so if I do loop detection, so under the sequential tab, if I do loop detection and
then specify a vocab tree path there at the bottom, that will enable it to say, as I'm

361
00:27:39,174 --> 00:27:48,157
processing through all those video frames, you know, every 10th frame, every 50th frame,
every a hundred frame, whatever you set it to, you can have it go and then do a vocabulary

362
00:27:48,157 --> 00:27:52,408
tree retrieval, do that image retrieval step to try to discover.

363
00:27:52,408 --> 00:27:55,708
loop closures within some of that data.

364
00:27:55,708 --> 00:27:57,983
Okay, so we have these options.

365
00:27:58,065 --> 00:28:00,221
I always just say, and then there's spatial and transitive.

366
00:28:00,221 --> 00:28:01,573
We haven't talked about that.

367
00:28:01,575 --> 00:28:04,125
Does spatial have to do with GPS or?

368
00:28:04,125 --> 00:28:04,335
right.

369
00:28:04,335 --> 00:28:14,311
So it just says, you know, for each image, assuming if the images have embedded geotags,
so GPS data embedded in the exif, it will say for each image, just find other images with

370
00:28:14,311 --> 00:28:16,982
similar GPS and match to those.

371
00:28:18,383 --> 00:28:18,733
Yep.

372
00:28:18,733 --> 00:28:25,712
of people here listening probably are taking drone images and spatial is the one I always
use.

373
00:28:25,712 --> 00:28:32,664
That's a great option because a of times that drone is looking straight down or it's not
looking at completely random directions.

374
00:28:32,664 --> 00:28:36,798
There is some order and structure to that drone data and so the spatial.

375
00:28:36,798 --> 00:28:41,393
a lot of the drones people are using nowadays have a really good GPS on it.

376
00:28:41,393 --> 00:28:50,303
Thinking of the enterprise versions of like a DJI drone are getting really good GPS even
without a RTK attachment.

377
00:28:50,303 --> 00:28:53,466
it's not gonna throw a bunch of error into there.

378
00:28:53,466 --> 00:28:54,397
And then what's transitive?

379
00:28:54,397 --> 00:28:55,999
That's the one I don't think I've ever touched.

380
00:28:55,999 --> 00:28:57,201
I don't even know what that means.

381
00:28:57,201 --> 00:29:02,524
Yeah, that's a way to densify a set of existing matches.

382
00:29:02,524 --> 00:29:05,876
So suppose you had gone and run one of the existing modes.

383
00:29:05,876 --> 00:29:17,602
you ran, okay, maybe not exhaustive, but like if you had ran sequential or ran your
spatial or ran your vocab tree, but then you wanted to go back and create a more complete

384
00:29:17,602 --> 00:29:26,959
set of connections between images, what Transitive will do is it'll look at your database
and it'll say, hey, if image A matched to B and image B,

385
00:29:26,959 --> 00:29:32,967
matched to image C, but I didn't try to match image A directly to C, let me go ahead and
do that now.

386
00:29:32,967 --> 00:29:38,894
And so it goes back and finds these transitive links between images and attempts to do
that matching.

387
00:29:38,894 --> 00:29:45,893
So what that does, that just creates a stronger set of connections between images, which
will help COLMAP out during the reconstruction phase.

388
00:29:45,893 --> 00:29:51,777
Okay, so I feel like this gives me good idea then of, or the listener slash viewer, an
idea.

389
00:29:51,777 --> 00:29:53,597
There's different options.

390
00:29:53,778 --> 00:29:56,958
Pick the one that makes sense for the data set you have.

391
00:29:56,958 --> 00:30:03,621
You might get the best results out of exhaustive as far as error, but you might be waiting
a day.

392
00:30:03,621 --> 00:30:07,794
Heard people say, I set this and now it's telling me it'll be ready in 28 hours.

393
00:30:07,794 --> 00:30:09,141
Well, probably not the right mode.

394
00:30:09,141 --> 00:30:10,827
You probably used a vocab tree, but.

395
00:30:10,827 --> 00:30:13,821
You know, I always say find the right one.

396
00:30:13,821 --> 00:30:14,710
Start with sequential.

397
00:30:14,710 --> 00:30:18,869
If you have sequential images at least, you probably get good, a good result there.

398
00:30:18,869 --> 00:30:20,312
I also want to.

399
00:30:20,312 --> 00:30:28,858
to mention it back, you know, in the diagram under the corresponding search, you know,
they do break it down versus the feature extraction, feature matching, and then geometric

400
00:30:28,858 --> 00:30:30,219
verification.

401
00:30:30,719 --> 00:30:37,384
That geometric verification, those options show up on that matching, those matching
settings screens that we just saw.

402
00:30:37,384 --> 00:30:42,748
For each of those tabs at the bottom, there was the general settings or general options.

403
00:30:42,748 --> 00:30:46,042
And a lot of those general options are related to.

404
00:30:46,042 --> 00:30:56,708
geometric verification saying, when I'm matching these points and I want to then verify
it, what sort of pixel error do I expect or what is the minimum number of inliers or an

405
00:30:56,708 --> 00:30:57,948
inlier ratio?

406
00:30:57,948 --> 00:31:03,201
And so those inliers are the number of geometrically verified matches between a pair of
images.

407
00:31:03,201 --> 00:31:09,019
And so that's where geometric verification kind of comes into play within this COLMAP
workflow.

408
00:31:09,019 --> 00:31:09,430
Okay.

409
00:31:09,430 --> 00:31:11,011
so let's move this along.

410
00:31:11,011 --> 00:31:14,792
Then I do want to point out, I'm going to show COLMAP one more time.

411
00:31:14,792 --> 00:31:19,513
At this point, you've ran both your feature extraction and feature matching.

412
00:31:19,513 --> 00:31:21,264
You will still see nothing on your screen.

413
00:31:21,264 --> 00:31:25,075
Well, you will see logs, but you will not see these camera poses, which I have.

414
00:31:25,075 --> 00:31:27,310
So I have a point, I have this sparse point cloud.

415
00:31:27,310 --> 00:31:36,318
I have these red camera positions around it and none of this shows up because at this
point we haven't, we haven't created a point cloud.

416
00:31:36,460 --> 00:31:38,341
We haven't projected anything yet.

417
00:31:38,341 --> 00:31:45,944
So we're moving from correspondence search to, if I bring up that diagram one more time,
we're moving on to incremental reconstruction.

418
00:31:45,944 --> 00:31:50,272
And that's where we start to see fun things happening on a COLMAP GUI screen.

419
00:31:50,272 --> 00:31:54,038
If you're running on a GUI, you'll start to see camera poses show up.

420
00:31:54,038 --> 00:31:56,240
So the first step is initialization.

421
00:31:56,240 --> 00:31:57,090
What is that?

422
00:31:57,090 --> 00:31:59,412
So is that just starting?

423
00:31:59,753 --> 00:32:00,713
Yeah, that's what it is.

424
00:32:00,713 --> 00:32:05,447
mean, it's the starting process for this incremental reconstruction.

425
00:32:05,468 --> 00:32:11,403
So incremental reconstruction is just one style to attempt to do 3D reconstruction.

426
00:32:11,403 --> 00:32:18,149
so the core idea here is that, like you said, we don't have any 3D information yet.

427
00:32:18,149 --> 00:32:22,013
So we're going to start with the minimum amount that we need, which is a pair of images.

428
00:32:22,013 --> 00:32:27,431
So let's start with a pair of images and then figure out what is the 3D relationship.

429
00:32:27,431 --> 00:32:31,503
between those images as well as what 3D points did they see in the scene.

430
00:32:31,503 --> 00:32:34,005
And so we're going to create this two view reconstruction.

431
00:32:34,005 --> 00:32:41,989
Take that pair of images, triangulate an initial set of 3D points, and then we use that as
the initialization for the rest of the reconstruction.

432
00:32:41,989 --> 00:32:50,344
And so everything after that is going to figure out, based on these initial two images and
some points, how can I add a third image to that and how does it relate?

433
00:32:50,344 --> 00:32:54,286
Now that I have these three, how can I add a fourth and a fifth and a sixth?

434
00:32:54,286 --> 00:32:57,197
And so you just keep adding images one at a time.

435
00:32:57,307 --> 00:33:00,453
to grow a larger and larger reconstruction.

436
00:33:00,453 --> 00:33:04,039
But initialization is just, what is that initial pair?

437
00:33:04,039 --> 00:33:09,247
Which two images am I gonna start with to build this entire reconstruction?

438
00:33:09,575 --> 00:33:10,215
Okay.

439
00:33:10,215 --> 00:33:12,426
And then it kind of goes into a circle.

440
00:33:12,426 --> 00:33:20,981
So if you look at this, I say circle, the diagram on the screen shows image registration,
triangulation, bundle adjustment, outlier filtering.

441
00:33:20,981 --> 00:33:24,503
And then if you follow the lines, you notice you're really doing a loop.

442
00:33:24,503 --> 00:33:26,904
So it's looping through that process.

443
00:33:26,965 --> 00:33:30,236
And then also this dashed line showing reconstruction.

444
00:33:30,236 --> 00:33:36,660
So it's kind of probably looping through that and adding to the reconstruction while it's
going or, okay.

445
00:33:36,660 --> 00:33:37,700
Exactly right.

446
00:33:37,700 --> 00:33:38,340
Exactly right.

447
00:33:38,340 --> 00:33:43,323
So it's that initialization that picks the first pair of images.

448
00:33:43,323 --> 00:33:50,286
But once I have my pair of images, now I'm going to enter in this loop that starts with
image registration.

449
00:33:50,286 --> 00:33:58,090
So image registration is a fancy name to say, how can I add a new image to my existing
reconstruction?

450
00:33:58,090 --> 00:34:00,931
And so what it's going to look at is...

451
00:34:00,987 --> 00:34:11,927
based on the 3D points that have already been triangulated, it's going to ask, what's the
best next image in my data set that also saw those points?

452
00:34:12,047 --> 00:34:22,587
And then if, and once I find that image, know, via the set of feature matches and so say,
know, if I've matched image one and two and triangulated that, well two, image two matched

453
00:34:22,587 --> 00:34:26,887
to image three, well then image three is seeing the same points in the scene.

454
00:34:26,887 --> 00:34:28,267
So let me add image three.

455
00:34:28,267 --> 00:34:30,477
And so there it's a 2D to 3D,

456
00:34:30,477 --> 00:34:40,920
registration process, 2D, 3D pose estimation process, where I take the 2D points in that
third image and I want to align those 2D points with the 3D points that have been

457
00:34:40,920 --> 00:34:41,661
triangulated.

458
00:34:41,661 --> 00:34:47,944
So you might hear that as image registration or perspective endpoint problem, pose
estimation.

459
00:34:47,944 --> 00:34:53,188
There's a few different words for what this process is, but you're adding a new image to
the reconstruction.

460
00:34:53,188 --> 00:34:55,336
And so that's the image registration step.

461
00:34:55,711 --> 00:35:08,486
I do know when I ran this, I can always take a video and kind of project it onto this in
post, but when it's creating this reconstruction, instead of taking image one and then

462
00:35:08,486 --> 00:35:18,327
image two and then image three and kind of building off that, I'll notice it'll pick, if
you look at my, if you're watching this in video, you'll notice it took two loops.

463
00:35:18,327 --> 00:35:21,087
and some of the images are like right above each other almost.

464
00:35:21,087 --> 00:35:24,067
I held the phone at like above my head and then I held it down at chest level.

465
00:35:24,067 --> 00:35:28,407
So I have two loops and there's a lot of common key points, common features.

466
00:35:28,407 --> 00:35:39,067
So as it's building this up, started at this, kind of where I started walking around this
fountain, but it's using images from further along in the video extraction, or sorry, the

467
00:35:39,067 --> 00:35:40,067
images I had.

468
00:35:40,067 --> 00:35:46,879
So it used like image one and image 180 because those are next to each other.

469
00:35:46,879 --> 00:35:48,413
and had a lot of strong feature matches.

470
00:35:48,413 --> 00:35:51,725
So they don't necessarily use images in sequence of how you took them.

471
00:35:51,725 --> 00:35:55,120
It's ones that had strong correlation, correct?

472
00:35:55,120 --> 00:35:55,740
a great point.

473
00:35:55,740 --> 00:35:56,540
That's a great point.

474
00:35:56,540 --> 00:36:00,320
Yeah, it isn't just going to go, you know, one, two, three, four, five, six.

475
00:36:00,320 --> 00:36:02,121
You it's not going to do them in order.

476
00:36:02,121 --> 00:36:04,061
You know, it's going to start that pair of images.

477
00:36:04,061 --> 00:36:08,281
It's going look through all of the images in your collection and find the pair.

478
00:36:08,281 --> 00:36:13,861
And it might not be the consecutive pair, but find the pair of images that maximizes some
criteria.

479
00:36:13,861 --> 00:36:16,561
You know, it's a pair of images that has strong connectivity.

480
00:36:16,561 --> 00:36:18,361
So there were a lot of feature matches.

481
00:36:18,361 --> 00:36:22,161
But I also want to make sure that that pair of images has

482
00:36:22,165 --> 00:36:23,605
know differences in viewpoint.

483
00:36:23,605 --> 00:36:30,505
don't want two images that were taken at the exact same position in space because that
gives me no 3D information.

484
00:36:30,505 --> 00:36:33,965
need, you know, we talked about this in the last episode, this concept of a baseline.

485
00:36:33,965 --> 00:36:35,505
I need some sort of translation.

486
00:36:35,505 --> 00:36:38,665
I need some motion between two images.

487
00:36:38,665 --> 00:36:41,645
Or maybe it in our depth map episode.

488
00:36:41,645 --> 00:36:47,065
And we talked about this, you know, in that we need motion between images in order to
estimate depth.

489
00:36:47,065 --> 00:36:48,925
So the initialization could look for the same thing.

490
00:36:48,925 --> 00:36:51,103
It wants lots of matches between the image.

491
00:36:51,103 --> 00:36:54,927
but it also wants a strong amount of motion between that.

492
00:36:54,927 --> 00:36:59,543
So it's gonna pick whichever pair of images maximizes that criteria.

493
00:36:59,543 --> 00:37:05,900
And once it has that, then it'll start adding other images that are strongly connected to
those initial ones.

494
00:37:05,900 --> 00:37:09,323
And yeah, it won't necessarily do it in order that you capture those images.

495
00:37:09,323 --> 00:37:12,584
It's gonna be in the order in which those connections are strongest.

496
00:37:12,584 --> 00:37:15,437
And I was seeing mostly yours.

497
00:37:15,437 --> 00:37:20,411
I was seeing like the first photo and then somewhere further along where I came and did a
loop.

498
00:37:20,411 --> 00:37:26,977
I saw those two photos start together because I think there was more as you're talking
about a baseline was better.

499
00:37:26,977 --> 00:37:32,622
There was more parallax because I have these are pretty closely spaced images I took from
picture to picture.

500
00:37:32,622 --> 00:37:35,484
So not a lot has changed versus the next loop.

501
00:37:35,484 --> 00:37:39,776
I have a I'm looking the exact same part of the fountain, but I have a different.

502
00:37:39,776 --> 00:37:44,899
elevation and angle, so there's a lot of parallax movement between those images.

503
00:37:44,899 --> 00:37:54,194
So it was matching those better as opposed to image one to image two, it's more of image
one to image 180, because of that baseline was probably better.

504
00:37:54,194 --> 00:38:03,670
So you gotta, the fun thing is when you run this in the GUI, this Cole map, you gotta
watch those build and you're gonna see the point cloud just start to generate in front of

505
00:38:03,670 --> 00:38:03,890
you.

506
00:38:03,890 --> 00:38:07,394
And you get an understanding then of what it's doing in these logs that are.

507
00:38:07,394 --> 00:38:10,188
looping through this process over and over.

508
00:38:10,188 --> 00:38:14,663
And you can kind of see it just iteratively add to the scene and build and refine.

509
00:38:14,663 --> 00:38:24,348
When it's doing this incremental reconstruction, is it refining the camera poses as it
goes, or is it just saying, here's the camera poses, there's where they are?

510
00:38:24,713 --> 00:38:26,813
Now there's refinement, there's refinement.

511
00:38:26,813 --> 00:38:30,613
And a lot of times that refinement is called bundle adjustment.

512
00:38:30,833 --> 00:38:32,934
That's a key word that's used commonly in the literature.

513
00:38:32,934 --> 00:38:39,893
I remember the first time I heard the word bundle adjustment, I was a first year grad
student and I had no idea what the person was talking about.

514
00:38:39,893 --> 00:38:44,134
I was like, what, a bundle of sticks, a bundle of what, a straw, what is going on?

515
00:38:44,134 --> 00:38:45,794
But no, a bundle adjustment.

516
00:38:45,794 --> 00:38:48,890
So it's the idea of refining.

517
00:38:48,890 --> 00:38:51,390
the 3D points as well as the camera positions.

518
00:38:51,390 --> 00:39:00,610
And so you end up with just a bundle of constraints, a bunch of constraints saying, these
2D points and these images all triangulated and all saw the same 3D point in this scene,

519
00:39:00,610 --> 00:39:03,390
but I've got a bunch of images and I've got a bunch of points.

520
00:39:03,390 --> 00:39:08,030
How can I optimize the alignment of all of this data?

521
00:39:08,030 --> 00:39:10,090
And that's what bundle adjustment is.

522
00:39:10,090 --> 00:39:15,990
So yeah, so as COLMAP is running, it's doing that image registration process.

523
00:39:15,990 --> 00:39:17,516
It'll add a new image.

524
00:39:17,516 --> 00:39:23,870
It then runs triangulation, which creates new 3D points based on that new image and other
images that are already there.

525
00:39:24,011 --> 00:39:27,663
But then it'll do bundle adjustment, which will say, how can I refine that?

526
00:39:27,663 --> 00:39:31,256
And there's two styles of bundle adjustment that I believe COLMAP uses.

527
00:39:31,256 --> 00:39:34,199
One of them is local bundle adjustment, the other is global.

528
00:39:34,199 --> 00:39:43,156
So a lot of times what you will see is, we had already reconstructed a thousand images and
we're adding that a thousand and first.

529
00:39:43,222 --> 00:39:50,202
When I add that thousand and first, trying to do a bundle adjustment using all thousand
images, that takes a long time.

530
00:39:50,462 --> 00:39:59,863
And so we recognize that, well, that first image, that thousand and first, that next image
that I'm adding, well, it's off in the corner of the reconstruction.

531
00:39:59,863 --> 00:40:02,663
It's far away from the other side of the reconstruction.

532
00:40:02,663 --> 00:40:05,023
These things aren't really related to each other.

533
00:40:05,023 --> 00:40:07,043
So I can run a local bundle adjustment.

534
00:40:07,043 --> 00:40:11,363
Let me just optimize only those cameras and points that are near.

535
00:40:11,363 --> 00:40:15,503
that new image that I just added or those new points that I've triangulated.

536
00:40:15,503 --> 00:40:17,783
And so that's a way to sort of do this local refinement.

537
00:40:17,783 --> 00:40:21,263
And I can do that every single time I add a new image.

538
00:40:21,263 --> 00:40:25,403
And then periodically, COLMAP will run a global bundle adjustment.

539
00:40:25,403 --> 00:40:26,483
So there's some settings there.

540
00:40:26,483 --> 00:40:35,063
think every, you know, once the reconstruction is increased in size by 10 % or you've
added every, you know, 500 images or something, there's certain criteria, especially at

541
00:40:35,063 --> 00:40:39,864
the end of the reconstruction, COLMAP will run a global bundle adjustment, which says,

542
00:40:39,864 --> 00:40:41,635
let's optimize everything.

543
00:40:41,635 --> 00:40:43,416
Let's optimize the points.

544
00:40:43,416 --> 00:40:45,437
Let's optimize the camera poses.

545
00:40:45,437 --> 00:40:50,739
And something we haven't mentioned is it will also be optimizing the camera parameters.

546
00:40:50,739 --> 00:40:58,803
So back when we picked that camera model and we said, we're going to use a camera model
that has a focal length term and a principal point, CX and CY, or maybe has some radial

547
00:40:58,803 --> 00:41:04,876
distortion terms during bundle adjustment, COLMAP will also be optimizing those parameters
as well.

548
00:41:04,876 --> 00:41:07,078
to figure out, what is the field of view of my camera?

549
00:41:07,078 --> 00:41:13,173
That's the focal length or how much lens distortion was there in order to achieve that
alignment.

550
00:41:13,519 --> 00:41:23,176
Would it run those if you, cause we didn't cover this earlier on, but let's say you do
have a camera model calibration file.

551
00:41:23,176 --> 00:41:25,137
So you're saying, I know this.

552
00:41:25,217 --> 00:41:33,724
I think DJI's and they're again, and their enterprise level drones will give you this
information on their lenses because they've been calibrated.

553
00:41:33,724 --> 00:41:34,965
and it's in the XF data.

554
00:41:34,965 --> 00:41:36,245
Well, well that changed.

555
00:41:36,245 --> 00:41:38,197
Does it do like a refinement on top of that?

556
00:41:38,197 --> 00:41:40,398
Or does it just say, no, no, no, you give us that.

557
00:41:40,398 --> 00:41:41,829
won't change that.

558
00:41:41,842 --> 00:41:43,162
That's an option.

559
00:41:43,162 --> 00:41:50,862
think under the reconstruction options or under the bundle adjustment options, there are
ways to say, hey, do I want to refine my focal length?

560
00:41:50,862 --> 00:41:53,982
Do I want to refine my distortion terms?

561
00:41:53,982 --> 00:41:57,603
So you could enable or disable that setting.

562
00:41:57,603 --> 00:42:02,663
To that point, I do believe that COLMAP will parse the EXIF data in those images.

563
00:42:02,663 --> 00:42:04,763
And if it sees that, yeah, there is a focal length.

564
00:42:04,763 --> 00:42:07,856
Because a lot of times, an image will contain

565
00:42:07,856 --> 00:42:16,316
you know, that, this was taken with a 10 millimeter lens or a 24 millimeter lens, you
know, and so COLMAP can parse that data to take an initial guess at what it thinks that

566
00:42:16,316 --> 00:42:20,856
focal length is, you what's the field of view of the camera and can use that as
initialization.

567
00:42:20,936 --> 00:42:30,936
But a lot of times there is benefit to refine that because it may be, make it too close,
but not, might not be close enough to get a really sharp reconstruction.

568
00:42:31,316 --> 00:42:34,576
So, okay, so I got a lot more appreciation for what's happening here.

569
00:42:34,576 --> 00:42:36,676
I tell people run this on their computer.

570
00:42:36,676 --> 00:42:41,937
You don't need the highest spec computer to run a small data set and learn how this works.

571
00:42:41,937 --> 00:42:48,628
I ran this on my older computer, which doesn't have, you know, 24 cores or anything, and
it still ran fairly quick.

572
00:42:48,628 --> 00:42:51,780
I'd say there's some things you gave me some notes.

573
00:42:51,780 --> 00:42:56,334
I think we covered largely most of it, but then from here you can do things.

574
00:42:56,334 --> 00:43:04,680
So I ran this through, you can hit automatic reconstruction, it'll create all this, but
then you can hit bundle adjustment, which is that global one at the end.

575
00:43:04,680 --> 00:43:08,674
And then you can build a dense reconstruction, which we're not really gonna cover on this
episode.

576
00:43:08,674 --> 00:43:14,508
This is just kind of like, here's how we got that workflow I showed to get the camera
poses, the sparse point cloud.

577
00:43:14,508 --> 00:43:16,505
And then from there you can use it for.

578
00:43:16,505 --> 00:43:17,685
more downstream tasks, right?

579
00:43:17,685 --> 00:43:24,949
So I could use this for, again, doing a dense 3D reconstruction where you're gonna, I
wanna get millions of points on this scene.

580
00:43:25,029 --> 00:43:29,731
Or I can use this as the basis for initializing 3D gaussian splatting.

581
00:43:29,731 --> 00:43:36,915
There's just different things you can use once you got camera positions and a point cloud,
sparse point cloud.

582
00:43:36,915 --> 00:43:40,997
I'm showing also on my screen, I talk about, you have these kind of magenta lines.

583
00:43:40,997 --> 00:43:42,957
This is showing kind of your

584
00:43:43,043 --> 00:43:44,293
these images matched.

585
00:43:44,293 --> 00:43:50,546
I double clicked on one, it'll show that kind of information of the key points and which
ones match to it.

586
00:43:50,546 --> 00:43:53,457
But you can just click around and learn things.

587
00:43:53,457 --> 00:43:55,998
Double click on different parts of the scene.

588
00:43:55,998 --> 00:44:01,260
It'll show you the point and which different cameras made up that point.

589
00:44:01,260 --> 00:44:06,322
And it's a good tool to kind of learn how this works because it's very visual on the
screen.

590
00:44:06,322 --> 00:44:08,723
Lots of data, lots of options.

591
00:44:08,743 --> 00:44:12,696
You can even create animations in this if you really want to show off what you learned.

592
00:44:12,696 --> 00:44:14,557
There is one thing we didn't really talk about.

593
00:44:14,557 --> 00:44:15,358
Well, there's a couple of things.

594
00:44:15,358 --> 00:44:20,002
So incremental reconstruction, everyone always complains, I bought the newest GPU.

595
00:44:20,002 --> 00:44:21,843
This should be really fast.

596
00:44:21,843 --> 00:44:23,104
Why is this running so slow?

597
00:44:23,104 --> 00:44:24,665
My GPU is not even being used.

598
00:44:24,665 --> 00:44:29,151
And it says it's taken five hours to run my thousand image data set.

599
00:44:29,151 --> 00:44:29,692
Why is that?

600
00:44:29,692 --> 00:44:32,083
Why can't we use a GPU for this incremental reconstruction?

601
00:44:32,083 --> 00:44:36,173
Or I know we can, but why can't we in COLMAP the way it's configured?

602
00:44:36,173 --> 00:44:37,775
Yeah, yeah, because COLMAP.

603
00:44:37,775 --> 00:44:42,568
Yeah, a lot of these algorithms are not easily to paralyze on a GPU.

604
00:44:42,568 --> 00:44:49,752
So a GPU works well when you're doing the exact same operation on millions of things, you
because that's what a GPU does.

605
00:44:49,752 --> 00:44:55,145
Its job is to draw pixels to a screen on your monitor, on your desktop.

606
00:44:55,145 --> 00:44:57,206
And so you've got millions of pixels on your screen.

607
00:44:57,206 --> 00:45:01,129
And so that GPU is processing a million pixels at once and figures out what to draw.

608
00:45:01,129 --> 00:45:09,557
And so for tasks like feature extraction where, I've got a, again, millions of pixels and
I want to figure out which ones have features in them.

609
00:45:09,557 --> 00:45:12,220
GPU is great or feature matching.

610
00:45:12,220 --> 00:45:15,473
I've got tens of thousands of features in one image, tens of thousands of the other.

611
00:45:15,473 --> 00:45:22,851
want to figure out which features match with each other Then again, that's great for a GPU
for incremental reconstruction.

612
00:45:22,851 --> 00:45:25,485
It's like I'm operating on one image at a time.

613
00:45:25,485 --> 00:45:35,905
and I have to just solve a math equation and do some linear algebra to figure out what's
the 3D position or pose of that image, that's not a very paralyzable task.

614
00:45:35,905 --> 00:45:41,665
And so it's not very easy to adapt some of these algorithms to the GPU.

615
00:45:42,105 --> 00:45:47,385
I will say then another thing too that contributes to it is COLMAP is very flexible.

616
00:45:47,405 --> 00:45:52,186
There's a lot of algorithms, a lot of switches, a lot of different techniques that you can
use.

617
00:45:52,186 --> 00:45:56,539
And to implement all of those on the GPU, we just take a lot of time.

618
00:45:56,539 --> 00:45:58,940
It's nice having software that's flexible.

619
00:45:58,940 --> 00:46:09,415
With COLMAP being open source, a bunch of people contributing to it, it's nice having a
flexible platform where people can easily dive in, make changes, add their own algorithm,

620
00:46:09,415 --> 00:46:11,406
plug it in, tweak things and play with it.

621
00:46:11,406 --> 00:46:17,040
so having that sort of more general purpose CPU based implementation is helpful.

622
00:46:17,040 --> 00:46:17,674
But yeah.

623
00:46:17,674 --> 00:46:21,134
To get back to the core, it really is primarily just around the algorithms.

624
00:46:21,134 --> 00:46:27,346
A lot of these algorithms are not parallelizable or not well suited for processing on a
GPU.

625
00:46:27,563 --> 00:46:28,364
That makes sense.

626
00:46:28,364 --> 00:46:31,725
I and I want to explain it or someone's trying to explain it.

627
00:46:31,725 --> 00:46:41,171
It's like your CPU is a really good detective at solving clue by clue one thing at a time,
versus GPU is like it can just point out all the clues all at once.

628
00:46:41,171 --> 00:46:44,293
But you really need that like hard math equation.

629
00:46:44,293 --> 00:46:48,956
You need really fast cores to try to solve those things one at a time.

630
00:46:48,956 --> 00:46:49,736
And it's incremental.

631
00:46:49,736 --> 00:46:50,419
So think about it.

632
00:46:50,419 --> 00:46:53,819
It's like you can't you can't solve all these all at once as is.

633
00:46:53,819 --> 00:46:54,180
So

634
00:46:54,180 --> 00:46:57,805
That's something that people just have to keep in mind that don't get frustrated.

635
00:46:57,805 --> 00:47:00,439
It's just how this technology works today.

636
00:47:00,439 --> 00:47:01,240
And there's GLOMAP.

637
00:47:01,240 --> 00:47:04,184
So how does GLOMAP make this all sudden magically fast?

638
00:47:04,739 --> 00:47:09,403
Yeah, so GLOMAP is a different style for that reconstruction process.

639
00:47:09,403 --> 00:47:16,811
So GLOMAP deals with global Mapper, know, so global reconstruction versus incremental
reconstruction.

640
00:47:16,811 --> 00:47:27,418
instead of here in COLMAP, we just talked about it uses an incremental reconstruction and,
you know, one image at a time, whereas global reconstruction, it tries to figure out the

641
00:47:27,418 --> 00:47:30,862
3D poses of all of the images all at once.

642
00:47:31,126 --> 00:47:34,768
So GLOMAP still has that same correspondence search step.

643
00:47:34,768 --> 00:47:41,750
to run GLOMAP, you still got to extract key points, extract features from your image, you
got to match them, got to run your geometric verification.

644
00:47:41,750 --> 00:47:48,514
But once you have that web of connectivity between your images, you can then run global
reconstruction techniques.

645
00:47:48,514 --> 00:47:51,177
And so there's a few different steps there.

646
00:47:51,177 --> 00:47:54,719
In GLOMAP, they've run rotation averaging first.

647
00:47:54,719 --> 00:47:57,201
So the idea with that is that you...

648
00:47:57,201 --> 00:48:01,541
look at all of the feature matches between your pairs of images.

649
00:48:01,801 --> 00:48:07,721
For each pair, you estimate how much rotation occurred between that pair of images.

650
00:48:07,721 --> 00:48:08,761
So that gives you a constraint.

651
00:48:08,761 --> 00:48:20,021
But now if I look at all of the rotations that are estimated between all of the pairs, can
I come up with a consistent orientation for all of my images that satisfies each of those

652
00:48:20,021 --> 00:48:20,841
pairwise constraints?

653
00:48:20,841 --> 00:48:24,785
So can I arrange the orientations of my images

654
00:48:24,785 --> 00:48:27,765
so that all of those pairwise rotations make sense.

655
00:48:27,765 --> 00:48:29,665
And so that's what rotation averaging does.

656
00:48:29,665 --> 00:48:34,445
So it's not even looking at position, it's just trying to rotate all of the images.

657
00:48:34,445 --> 00:48:44,385
And once they're rotated in 3D space, then it does a global positioning step, which
simultaneously solves both the camera positions as well as some of the 3D points.

658
00:48:44,385 --> 00:48:48,145
And so it kind of throws all of the cameras into a big soup, a big mess.

659
00:48:48,145 --> 00:48:53,721
It gives them a bunch of random initializations and then defines these constraints saying,
well, these images,

660
00:48:53,721 --> 00:49:01,732
solve these common points, how can I rearrange all of these images so that they line up
and see those common points?

661
00:49:01,732 --> 00:49:04,324
So it's similar to bundle adjustment.

662
00:49:04,324 --> 00:49:16,208
that the idea of take a bunch of images that see points and refine it, but it uses a
different formulation, a different set of constraints that is better suited to random

663
00:49:16,208 --> 00:49:17,479
unknown camera positions.

664
00:49:17,479 --> 00:49:20,216
And so that's this global positioning problem that they solve.

665
00:49:20,216 --> 00:49:21,930
So that gets you pretty close.

666
00:49:22,054 --> 00:49:26,778
So once you've run your rotation averaging, your global positioning, you get a
reconstruction that's pretty close.

667
00:49:26,778 --> 00:49:34,924
And then you can run bundle adjustment, an actual high quality refinement using bundle
adjustment, and then you have your 3D reconstruction.

668
00:49:34,924 --> 00:49:39,528
So it skips a lot of this incremental slow process that wasn't parallelizable.

669
00:49:39,528 --> 00:49:47,213
The rotation averaging and global positioning, that's a little better suited to
parallelization and is more efficient because you're not having to do this one after the

670
00:49:47,213 --> 00:49:48,474
other after the other.

671
00:49:48,685 --> 00:49:53,097
Yeah, and I have it on my screen here, the project page where it kind of showed you were
talking about.

672
00:49:53,097 --> 00:49:56,099
and this last is showing it all happening all at once.

673
00:49:56,099 --> 00:50:00,131
We're just kind of all just kind of resolves at once.

674
00:50:00,131 --> 00:50:07,372
I do want to say that it, it's something it's it to me is there's a low, a low, what's the
right words.

675
00:50:07,372 --> 00:50:15,447
It's not, you're not going to be wasting a lot of your time to give this a shot, to see if
this works well for your project, because you don't have to wait a lot of time for it to

676
00:50:15,447 --> 00:50:16,031
do.

677
00:50:16,031 --> 00:50:17,503
the incremental reconstruction.

678
00:50:17,503 --> 00:50:27,650
it doesn't work well with all scenes as I found, but because you know within minutes if
it's gonna work well or not, it's worth a shot and you get to learn what scenes work well

679
00:50:27,650 --> 00:50:28,051
with it.

680
00:50:28,051 --> 00:50:30,354
You've done some tests as well, Jared.

681
00:50:30,354 --> 00:50:33,396
You can't get too tight in on a bunch of little things.

682
00:50:33,396 --> 00:50:41,903
I feel like you need more of a global view or the example images have a lot of features
and aren't really close tight in on little

683
00:50:41,903 --> 00:50:43,500
features in a scene.

684
00:50:43,618 --> 00:50:43,838
Mm-hmm.

685
00:50:43,838 --> 00:50:44,729
Mm-hmm.

686
00:50:44,729 --> 00:50:56,138
Yeah, you want, from my experience with GLOMAP and other global structure from motion
global reconstruction techniques, they work best when you have a lot of connections

687
00:50:56,138 --> 00:50:57,619
between your images.

688
00:50:57,620 --> 00:51:04,060
So it's not you just walking through a cave or walking down a city street and never
returning back.

689
00:51:04,200 --> 00:51:05,822
It likes a lot of loop closures.

690
00:51:05,822 --> 00:51:11,146
It likes a lot of connectivity, a lot of different vantage points and overlap and diverse
content.

691
00:51:11,146 --> 00:51:12,247
And so it...

692
00:51:12,287 --> 00:51:20,903
It takes the strength of those diverse and dense connections and very quickly figures out
how to arrange them to produce that final reconstruction.

693
00:51:21,066 --> 00:51:27,229
And that's probably why in my experience when I have these more broader view shots, it
works well because I have a lot of connections.

694
00:51:27,229 --> 00:51:36,912
I have a lot of unique features and you get too close in on one little object where you
have a lot of like I think inside I've done some indoors that haven't turned out because

695
00:51:36,912 --> 00:51:39,723
you have a lot of just blank white walls with not a lot of features.

696
00:51:39,723 --> 00:51:41,913
So it's just not able to do that.

697
00:51:42,494 --> 00:51:43,214
So all right.

698
00:51:43,214 --> 00:51:46,687
Well, this is something I say I had on my screen just to

699
00:51:46,687 --> 00:51:47,868
to kind of show some examples.

700
00:51:47,868 --> 00:51:55,594
If you're listening, I will make sure I'll link in the show notes as well, GLOMAP and
COLMAP, but GLOMAP's an interesting one you can look at.

701
00:51:55,594 --> 00:51:59,247
It drops on top of COLMAP.

702
00:51:59,247 --> 00:52:02,259
So you even get it running, isn't like a large lift.

703
00:52:02,259 --> 00:52:04,991
And you see Johannes in the list of names.

704
00:52:04,991 --> 00:52:07,103
So you can see he's still working on these things.

705
00:52:07,103 --> 00:52:09,925
I think this is interesting because it does make things go faster.

706
00:52:09,925 --> 00:52:12,104
And if you look in the results that...

707
00:52:12,104 --> 00:52:17,015
They are in the same range of accuracy as you get with incremental reconstruction using
COLMAP.

708
00:52:17,015 --> 00:52:19,976
So it's not saying, well, this is fast, but it's not nearly as good.

709
00:52:19,976 --> 00:52:22,158
It's fast and it is good if you have a good result.

710
00:52:22,158 --> 00:52:30,020
But you find out really quick because I've noticed that the results either are absolutely
all over the place or you have a really good sparse point cloud.

711
00:52:30,020 --> 00:52:31,721
And so, you know, if it's good or not.

712
00:52:31,721 --> 00:52:35,833
In fact, you'll see cameras all over the place where everything's kind of like this weird
looking cube.

713
00:52:35,833 --> 00:52:38,174
And that's how you know it didn't work.

714
00:52:38,174 --> 00:52:39,744
But you will know.

715
00:52:39,744 --> 00:52:40,794
based off of your output.

716
00:52:40,794 --> 00:52:43,874
Yep, I've gotten a few Borg cubes.

717
00:52:43,874 --> 00:52:45,294
That's what I think they look like.

718
00:52:45,294 --> 00:52:49,694
But I think I've gotten a few cubes as my results.

719
00:52:49,754 --> 00:52:50,418
yeah.

720
00:52:50,577 --> 00:52:53,229
Well, I think we covered, think we covered this all really well.

721
00:52:53,229 --> 00:53:04,078
I hope at the end of this, people will go try COLMAP or go, I mean, even if they use other
software, it will follow relatively the same sort of process.

722
00:53:04,078 --> 00:53:06,340
I don't think you could, maybe there's other ways this done.

723
00:53:06,340 --> 00:53:13,157
I'm sure there is, but this is the standard kind of method that most at least follow this
sort of style.

724
00:53:13,157 --> 00:53:15,640
And now there's all this machine learning stuff that's different.

725
00:53:15,640 --> 00:53:24,667
But as far as classical 3D reconstruction from imagery, this is a very well known and
reused pipeline for a lot of projects.

726
00:53:24,667 --> 00:53:27,228
Yeah, and it's a great, like you said, like just go and try that.

727
00:53:27,228 --> 00:53:28,539
That's, can't stress that enough.

728
00:53:28,539 --> 00:53:29,909
Just try it.

729
00:53:29,909 --> 00:53:38,343
You know, if you're either one, just get new to computer vision and want to understand how
3D reconstruction works, you know, or maybe you kind of understand it, but don't, you

730
00:53:38,343 --> 00:53:42,334
know, but want to get a better insight of how things work behind the scenes.

731
00:53:42,334 --> 00:53:48,077
A tool like COLMAP is great just to, you know, throw some images at it, run a
reconstruction and then start poking around.

732
00:53:48,077 --> 00:53:54,277
There's a lot of neat visualizations that Jonathan showed where you can look at a point
and see which images saw it or in an image.

733
00:53:54,277 --> 00:53:55,668
What did it match to?

734
00:53:55,668 --> 00:54:05,222
There's other debug visualizations where you can look at sort of the match graph or the
match matrix and see how the different patterns or ways that images are matching to each

735
00:54:05,222 --> 00:54:05,932
other.

736
00:54:05,932 --> 00:54:16,698
So it's a nice way to get in, get your hands dirty, and see how this process of turning
pixels to 2D information to final 3D results.

737
00:54:16,698 --> 00:54:17,560
And that...

738
00:54:17,560 --> 00:54:21,142
mapping from 2D to 3D and all the information that goes into that.

739
00:54:21,142 --> 00:54:25,263
So it's a great way to get in there and get an intuition for how this all works behind the
scenes.

740
00:54:25,405 --> 00:54:26,616
Yes, definitely.

741
00:54:26,616 --> 00:54:36,741
And I would say the most important part when you're trying to run this is picking the
right matching strategy because that can be the difference between waiting hours and an

742
00:54:36,741 --> 00:54:37,912
hour or minutes.

743
00:54:37,912 --> 00:54:42,604
So, well, thanks, Jared, for this episode and kind of covering all this stuff.

744
00:54:42,604 --> 00:54:47,316
I hope this was tangible enough for people to go try it and having the visuals up.

745
00:54:47,316 --> 00:54:51,168
if you're listening, go find this video on the every

746
00:54:51,168 --> 00:54:52,849
point YouTube channel.

747
00:54:52,849 --> 00:54:55,351
have a playlist of all of our episodes.

748
00:54:55,351 --> 00:54:59,034
I'll make sure I haven't named it yet, but I'm sure Cole map will be in the name.

749
00:54:59,034 --> 00:55:03,718
It'll be, I can't remember what episode we're on, but it's like 15 or 16.

750
00:55:03,718 --> 00:55:08,342
You will see that it's a great, it's a, it'll be a great way for you to learn this.

751
00:55:08,342 --> 00:55:13,146
If you're, if you're getting into there, cause I see every day, I didn't go over these,
but we have questions.

752
00:55:13,146 --> 00:55:18,870
see every day either on my videos or on Reddit or discord.

753
00:55:18,870 --> 00:55:21,190
There's these different communities that are all

754
00:55:21,190 --> 00:55:26,043
using projects that require COLMAP to run to start, think 3D guys been spying.

755
00:55:26,043 --> 00:55:32,360
And it's just obvious that this is something that people just know they have to use, but
have no idea what's happening.

756
00:55:32,360 --> 00:55:38,109
They just know they threw a bunch of images at it and something came out and then they're
going to do something else with it.

757
00:55:38,109 --> 00:55:42,792
But they have no appreciation for the sausage making of COLMAP.

758
00:55:42,792 --> 00:55:46,916
And if you know what each step is, you can get better results in my opinion.

759
00:55:46,916 --> 00:55:48,275
Just play with it.

760
00:55:48,275 --> 00:55:51,088
See what works learning those different options are.

761
00:55:51,088 --> 00:55:55,051
If you don't know what other option is as well, jump on our YouTube channel, ask a
question.

762
00:55:55,051 --> 00:56:01,979
I'm will be watching and trying to respond as intelligently as possible on those and give
you a good answer.

763
00:56:01,979 --> 00:56:04,261
So Jared, any other parting thoughts you want on this?

764
00:56:04,261 --> 00:56:06,453
You said go get, give it a try.

765
00:56:06,453 --> 00:56:10,161
Any other tips you would give people take good sharp imagery.

766
00:56:10,161 --> 00:56:10,722
do it.

767
00:56:10,722 --> 00:56:11,658
Just do it yourself.

768
00:56:11,658 --> 00:56:12,360
Get out and try it.

769
00:56:12,360 --> 00:56:15,407
Take your own photos and see how they turn out.

770
00:56:15,407 --> 00:56:16,727
Yeah, take your own photos.

771
00:56:16,727 --> 00:56:21,739
Don't go use the like open source data sets because they know those are going to work in.

772
00:56:21,739 --> 00:56:26,830
You know, those are great for testing, but not great for learning on your own data.

773
00:56:26,830 --> 00:56:27,950
So, right.

774
00:56:27,950 --> 00:56:28,621
Well, thank you.

775
00:56:28,621 --> 00:56:33,742
And if again, you're if you're listening, this will be on all major podcast players.

776
00:56:33,742 --> 00:56:42,775
Please, if you can subscribe to our to our channel or to one of our podcast episodes,
that'll mean a lot to us know that we're making the right content and that you guys care

777
00:56:42,775 --> 00:56:44,913
about learning about this information.

778
00:56:44,913 --> 00:56:52,793
And as always, let us know in the comments as well on your YouTube channel, if there is
something here that you would like us to go deeper in, maybe we can get someone like

779
00:56:52,793 --> 00:56:57,213
Johannes on one of these episodes to go super deep if you want to.

780
00:56:57,333 --> 00:57:01,833
anyways, well, thanks Jared for being on this episode and I'll see you guys in the next
episode.