Computer Vision Decoded

In this episode, Jonathan Stephens and Jared Heinly delve into the intricacies of COLMAP, a powerful tool for 3D reconstruction from images. They discuss the workflow of COLMAP, including feature extraction, correspondence search, incremental reconstruction, and the importance of camera models. The conversation also covers advanced topics like geometric verification, bundle adjustment, and the newer GLOMAP method, which offers a faster alternative to traditional reconstruction techniques. Listeners are encouraged to experiment with COLMAP and learn through hands-on experience.

This episode is brought to you by EveryPoint. Learn more about how EveryPoint is building an infinitely scalable data collection and processing platform for the next generation of spatial computing applications and services: https://www.everypoint.io

Creators and Guests

Host

Jared Heinly

Chief Scientist at @EveryPointIO | 3D computer vision researcher (PhD) and engineer

Host

Jonathan Stephens

Chief Evangelist at @EveryPointIO | Neural Radiance Fields (NeRF) | Industry 4.0

What is Computer Vision Decoded?

A tidal wave of computer vision innovation is quickly having an impact on everyone's lives, but not everyone has the time to sit down and read through a bunch of news articles and learn what it means for them. In Computer Vision Decoded, we sit down with Jared Heinly, the Chief Scientist at EveryPoint, to discuss topics in today’s quickly evolving world of computer vision and decode what they mean for you. If you want to be sure you understand everything happening in the world of computer vision, don't miss an episode!

Welcome to another episode of Computer Vision Decoded.

I'm really excited about this episode because it's going to solve a lot of questions that
we get about structure from motion and 3D reconstruction when it comes to COLMAP and just

figuring out how to do some of the basics of 3D reconstruction from imagery.

And as always, I have Jared Heinly our in-house computer vision expert to walk us through
what happens

when you run software like COLMAP to get camera poses, 3D reconstruction, and kind of
break down how that all works at a tangible level.

So when you walk away from this episode, you should have a better understanding of this
black box of COLMAP and other 3D reconstruction software that follows the same workflow.

So as always, Jared, thanks for joining me and welcome to the episode.

Yeah, thank you.

Let's just get to what we're all here for.

Let's learn about COLMAP and I don't wanna say specifically COLMAP, but we're gonna use it
as the basis for this episode to have something for someone to follow along and since it's

open source and free,

they can download COLMAP and do this on their own PC without, you know, have to pay for
some third party software that they won't learn as much through.

So Jared, let's just start off with, I'm gonna share my screen.

I have some images and we want to turn these images into a 3D model or just at least know
where these cameras are in relation to each other.

I'm gonna be doing some screen shares.

If you're listening to the audio only, I'll do my best to talk about what we have on the
screen.

But if I start out

here, I have a picture of a, well, it was a fountain that used to work in front of the
Oregon State Capitol.

I took this one sunny day last year.

And if I flip through the images, I basically walked around this fountain and got a

bunch of good angles.

In fact, I believe I used a video and extracted a bunch of images.

And at some points there's some sun issues, things like that, but it was good enough for
me to get a 3D model.

So Jared, what's the first step someone would take then to turn this into a 3D model, know
where the cameras are, things like that?

Yeah, yeah.

Well, you you hinted it right there at the very end.

Know where the cameras are.

I guess I try to refine my language.

A lot of times when I say camera, sometimes I mean, you know, image and camera.

use those words interchangeably sometimes.

You but you said that you walked around with a single camera, you know, your phone or DSLR
or whatever it may be.

And from that video, maybe you extract frames, know, images or you took photos yourself.

And so.

you have multiple images taken by a single physical camera, but you were moving around
that scene, moving around that object.

And so that camera was occupying different physical 3D points in space.

And then these images were captured from those different 3D points and there's from those
different 3D perspectives.

So, you know, as humans, we just do this naturally.

Like as you just flipped through those photos there, you know, and as you kind of orbited
around,

that fountain.

It's like, our brains are immediately like, yeah, okay.

I can see that the ground is a little bit closer.

Here's this foreground fountain.

I see the trees in the background.

I see some other structures in the background and I'm immediately, I can see that, yep,
you are moving to the left and sort of this clockwise motion.

This thing's, you know, near that other things are far.

Our brains are immediately doing all of that 3D reasoning, but in order to have software
to this, in order to have a computer,

generate a 3D reconstruction or 3D representation of what's in these photos.

It has to figure out, it has to do all of that math and it doesn't know how to do that
reasoning by default.

It has to figure out, where were you standing when that photo was taken?

Where was the camera positioned?

How was it angled?

What was the zoom level of the current lens?

And so it's doing all, it has to figure out where everything was oriented.

And that's typically one of the first processes.

It's kind of figuring out how are things related to each other?

And once we kind of know how they're related, then figure out what is the 3D geometry that
describes that relationship.

Mm-hmm.

And so it goes through, so I don't have it on my screen, but I will pull it up in a
second.

But COLMAP has in their tutorial information, a good kind of diagram.

I'll let me bring that up, but it basically shows the workflow that it goes through.

So if I go to the actual website for COLMAP and you go look at their tutorial, you can see
that.

So let's just pull that up on my screen as well.

while he's pulling that up, just to jump in with a little bit of personal history about
COLMAP So I did my PhD back at UNC Chapel Hill.

So I was there from 2010 to 2015.

And while I was there, Johannes Schoenberger, he came to UNC for two years to do his
masters.

And so Johannes, he's the author of COLMAP.

But at the time when he was there, COLMAP didn't exist.

Johannes had worked on previous structured promotion software and had

built, he'd worked with some, I believe those drones, so aerial photography, 3D
reconstruction.

And so he had built a pipeline that he had called MavMap.

I'll probably get this wrong, but I've been like, you know, mobile aerial vehicle, MavMap,
like mapper, mobile aerial vehicle mapper.

And so, but he was looking to generalize that to move beyond just aerial photography and
to do more general purpose image collections.

And so was this idea of image collections, where he came up with COLMAP, collection
mapper, to say, I want to take a collection of images and generate a 3D reconstruction

from it.

So he was working on that while he was at UNC.

I may have been one of the first people to actually use COLMAP in my final PhD project.

had processed 100 million images on a single PC.

And I was doing this feature matching extraction, but then I needed some way to
reconstruct them.

And our lab had some other software that could do 3D construction, but Johannes had just
written this first version of COLMAP.

And so I said, great, let's use that.

And that was efficient, that was fast, and it did exactly what we needed to do.

And so that helped get my paper across the gold line there at the very end.

And since then, Johannes has gone off at ETH Zurich and now at other companies.

and continued to open source COLMAP.

And now it's used all over the world and has won him some awards for it.

You know, interestingly, GLOMAP came out last year and he had his fingers in that as well.

So it's not over.

I still see COLMAP being updated on a semi-regular basis as well.

Although it came out several years ago, it's not static.

No, no, because it is such an important step.

The task that COLMAP solves and similarly GLOMAP, figuring out the 3D pose, pose is
position plus orientation, figuring out the 3D pose of images is a key step in so many 3D

pipelines.

If you want to understand the world in 3D, you got to figure out where these images were
taken from.

And that's the key task that COLMAP solves for a lot of people.

Okay.

That makes sense.

I had no idea also that COLMAP stood for collection mapper.

I mean, that makes sense, but I thought maybe it was a long acronym.

okay.

Well, I have this diagram up then if you're watching, you can see it on the screen, but if
you're listening, it's basically a workflow of how images go from just a collection of

images.

to a 3D reconstruction and you've got camera poses.

And I'm gonna show this in Cole map on my screen as well.

But this diagram just shows the different phases of the rate or steps that you go through
to get from pictures to 3D.

And it starts out with feature extraction.

And if you actually go to the tutorial as well, so if I share the just the tutorial page,
that diagram makes sense.

But the minute you start diving into it,

you have a wall of text that to most people won't make through this very well unless they
are perhaps a computer science major, someone like Jared who does this academically or for

a job.

I look at this and I'm like, okay, some of this makes sense.

A lot of this is beyond me.

So we're gonna break that down.

So yeah, again, starting out with feature extraction.

So what is that step?

what are we, we're taking the images and.

Sounds like something's happening there with features.

Yeah, yeah, absolutely.

So, and just to take a step back here too, so like you said, this is sort of a workflow, a
sequence of steps that goes into generating reconstruction.

So you had those images input, there's a sort of first block of steps that's labeled
correspondent search.

After that, we have incremental reconstruction and then finally we end up with a final
reconstruction.

But yeah, so within correspondent search, our goal for correspondent search is to figure
out the 2D relationship.

between that collection of images.

So we're not even talking about real 3D yet.

There might be some hints at 3D in these steps, but we haven't done any reasoning to
really understand which photos are where in 3D space.

So it's just about 2D understanding, 2D matching, 2D correspondence between this
collection of images.

So with that in mind, first step is feature extraction.

So the goal there is

to identify unique landmarks within a photograph.

And these unique landmarks, the intent of that is if I can identify a unique landmark, a
2D point in one photo, hopefully I can identify that same point in another photo, and

another photo, and another photo.

And if I can identify and follow or track that 2D point between multiple images, now I can
use that as a constraint later on when I do the 3D reconstruction.

I can say, hey, however these images are positioned, that point that they saw, that pixel
should converge to a common 3D position in space.

And so it's adding a sort of a viewing constraint saying, you know, each image saw a 2D
point.

I don't know the depth of that point.

So it gets all sort of gives me as a viewing ray.

So along this direction out into the scene, I saw this unique landmark.

Now I've seen that same landmark in many other photos.

I want to identify that and add that as constraint.

is most likely a 3D point.

So feature extraction is the automatic identification of typically tens of thousands,
thousands or tens of thousands of these unique landmarks in an image.

A lot of times there are different flavors of feature detection.

The one used in COLMAP is SIFT, Scale-Invariant Feature Transform.

What it does is it looks for, I call it a blob-style detector.

where it's looking for a patch of pixels that has high contrast to its background.

So it could be something that's light-colored, surrounded by dark, or vice versa,
something that's dark, surrounded by light.

It's going to look for these at multiple scales.

That's why it's scale invariant, so multiple resolutions.

So this could be something that's very small, or something that's larger in the image.

But once it's found that sort high contrast landmark,

it now will then extract some representation of the appearance of the area around that
landmark.

So it'll say, hey, I found something interesting.

So maybe it's a doorknob on a door.

So it'll say, that doorknob is a different color than the background, the rest of the
door.

And so now I want to describe that doorknob.

so gonna look, I don't wanna look just at the doorknob itself, I'm gonna look around it
and say, here's my doorknob.

And then, oh, there's this wood pattern.

on the door around it.

And so it's going to come up with a representation for that.

so what SIFT actually does or what different feature representations are, that could be a
whole podcast in and of itself.

But at a conceptual level, you just think about it, it summarizes what that looks like at
a rough level.

says, OK, I saw something dark in the middle.

And then there was this rough pattern around its vicinity.

Mm-hmm.

Okay, so then I'm bringing up COLMAP and this is I've unfortunately had already run the
project because I didn't want us to have to sit and watch things go and a lot of these

things run really fast.

So SIFT is fast if you can run it on GPU.

I don't can't necessarily show but 10th way say I think it maxes at 10,000 by default.

But if you have COLMAP and you kind of want to follow along the first thing you do is set
up a new project and

that part's pretty easy, but then you just go to processing and feature extraction and you
get to pick a camera model.

Why is that important?

Why is picking a camera model important for this?

Well, this is important and this ends up being really important later on when we start
thinking about the geometry of these images and what kind of camera and lens was used.

Because these camera models are, it is defining the geometry of that camera.

So this, right now you have a simple radial camera to select it.

And so underneath of it, sort of in gray scale, are some parameters listed.

It says, simple radial.

has F, Cx, Cy, and K.

And so you kind of have to know from a computer vision literature that F is your focal
length, Cx and Cy, that's the principal points, that's the final, where is the center of

my image or where is the optical axis of my lens and how is that aligned with the image
center.

So a lot of times I was kind of say, hey, hand wavy, what's the center of my image?

And then that K is a single.

Radial distortion term so it's assuming a lot of times lenses Introduce a little bit of
curvature effect, know curvature distortion to them and so we're gonna use a single

mathematical term single, you know polynomial term to Represent the distortion in that
lens This might be great this is great for a lot of just you know general cameras but

If you know that your lens has little bit more distortion, maybe you're using a wide angle
camera, a GoPro or a drone that has a wider field of view and some distortion.

If you have a really wide angle camera, something that you can see a lot of distortion,
then you might want one these fisheye versions.

They have simple radial fisheye or the normal fisheye.

There's even, I think, at very bottom of list, there's one called FOV.

That's one that's...

really great for super wide angle.

But a lot of times for a normal camera like your iPhone in your pocket or your DSLR or
your point and shoot or whatever it ends up being, your simple radial or your radial

models are nice because they assume that you've got a single focal length, your pixels are
square, so I don't need more than one F term.

You want to model your principal point with CXCY and here the radial model added an extra
lens distortion.

now instead of just K, now we have K1 and K2.

So that's two radial distortion terms.

We can do a little better job of estimating the distortion of our lens.

And so COLMAP asks for this right away because what it's doing is it has that, you know,
part of that project creation process is you create a database.

And so that's going to be, you know, a collection of data stored on disk.

And so this process of feature extraction is when COLMAP goes through all of your images,
extracts features, but then also creates those image entries in the database.

And so it needs to know what style of camera is going to be associated with that image.

Mm hmm.

And and we could go and deepen a bunch of buttons on here.

I don't want to if you just run this in default and simple radial and using smartphone or
something, you'll be OK.

But, you know, like here is thinking I have all these different cameras.

There's options where you can say use is always one camera.

So it just assumes then everyone's the same camera, which is great.

There was there was options for masks.

I just bring up a mask on my screen.

This is me masked.

This is a mask, not necessarily the mask you would use, but basically there's a picture of
me.

This might be the wrong picture.

And I've been with a mask as a separate file.

And then if you kind of like combine the two, you end up with me masked out.

And that's like a way to say, you want me not to be in this result.

You can mask out things specifically if you want perhaps just an object to be
reconstructed.

You want to mask out a background, things like that.

we could go deep into, but there's all these options, right?

To help get the right key points.

So if I go to this database, so I ran this already and I have this database manager where
I can kind of jump into things and I pick one of these and I'm just gonna hit show image.

It's gonna bring up the image and I can make this nice and big on my screen.

What we're seeing now is an image of the fountain.

on the backside of it right now with all these red circles, which are key points, not
necessarily all the features, right?

It's just some of the

ones that I think it matched on.

Is that wrong or am I on the wrong?

Yeah.

some software packages, they may show you all of them or may show you just the ones that
have been matched.

I'm not sure with this specific viewer right now.

so yeah, and I'm not a hundred percent clear either.

I didn't read the documentation.

All I know is this visualizing.

So this is an idea of key points where you'll notice there's no key points where you have
a lot of low contrast, not a lot of visual variation.

So I'm on my screen and there's a part where it shows the street and there's just not much
going on there versus there's a lot of points on the fountain which has all these ornate

decorations on it in the background, there's trees and.

buildings that is slatching onto.

So it makes sense that where you have less variation, you're going to have less features
that it's, it's, it's, the sky is also another one where there's no variation, but this

nice tree behind this thing, caught a lot on.

So it doesn't mean it matched on those because you might not see those.

So if I then, I'm going to close this and then you can look at show, overlapping images.

you know, if I click here, you can look at the matches.

You're going to see then this kind of correspondence matches where it's finding key points
between two images and they show these green lines basically saying these two images have

matching features that it believes are the same points.

Right.

Is that what we're seeing?

exactly, exactly.

So this is now sort of moved to the second and third bubbles within that correspondence
search block.

So back to that correspondence search, the first step was the feature extraction, which
was just the identification of these key points in each of the images.

So wasn't even trying to compare images yet.

We're just saying for each image, let me find those key points.

as Jonathan said, by default, if you've got a GPU enabled version of COLMAP and you've got
a nice GPU in your computer,

it will use the GPU implementation, that graphics processor, which makes it go a lot
faster.

So once we've extracted those key points or features, again, I use those terms
interchangeably a lot, the key point and the feature.

Now we want to match images together.

And that's to discover which images show similar content.

And so the result of that is going to be the set of correspondences, the set of features
saying the features in this image matched

to the features in this image.

And so those were those green lines that Jonathan had shown up just prior saying that, you
not all of the key points from one image matched to the other.

There was some subset, but we're trying to discover what those matches are.

In this diagram, we said that, you know, we had feature extraction, matching, and then
geometric verification.

Matching and geometric verification, a lot of times will go hand in hand, you know, so you
run matching and then you immediately run geometric verification after that.

So the intention there is your matching is just trying to figure out which features look
similar between two images, but it's not trying to do any sort of 2D or 3D reasoning.

So it may think that, the top of the tree in one image looks like the top of another tree.

in another image, but they're in completely different parts of the image and it doesn't
even make sense.

Like it may confuse things, or especially if you have a building with some sort of
repetitive pattern on it, the same brick repeated over and over again, but you have some

sort of unique windows or unique artwork that appears on that wall.

For feature matching, it may end up matching incorrect parts of the image to each other.

So matching does its best to try to figure out what matches, but it might be wrong.

It's geometric verifications job to come in and clean those up to figure out well now that
I have these initial set of matching key points between my two images, which ones actually

make sense based on our knowledge of geometry and how cameras move.

And so that's where sometimes you can leverage knowing what kind of camera model you have
can be helpful.

Knowing if if you expect a lot of distortion or if it's a fisheye lens that can help.

But sometimes

Some methods don't even try to use that information.

We'll just look at the 2D to 2D relationships.

And so there are some key words that you might see would be estimating a homography, a
homography like a perspective transform or an essential matrix or a fundamental matrix.

So each of these sort of relationships, each of these matrices is a way to describe how a
point in one image matches to a location in another image or a set of locations in another

image.

And so we're trying to estimate, is there a valid camera motion that we can imagine to get
a set of points in one image to move to the set of points in the other image?

And so that's what geometric verification is doing, just figuring out those 2D
relationships between images.

And somewhere in my logs, you can see some hints of that.

So it's just running.

It's showing all kinds of text on your screen.

And it's I'm sure some of that when it's showing bundle adjustment on my screen right now.

But at one point, it's talking about some of that, the matches and running different.

Algorithms in the background to get that.

So and then if I if I click on like one of these points that it created, it almost it
shows you where you have multiple matches on a.

specific point and things you can do to kind of get different views and get hints of what
we're talking about here.

But, so one thing we didn't really talk about when you're matching these images too, that
there's different options as well.

So when I go through here, I'm processing, I've got my key points.

It goes fast on a GPU because it's able to like look at all the different images all at
once, right?

They don't care about respect to each other when you're extracting features.

But then you get to the point where you need to do your matching.

This is where it's all CPU driven, because it's kind of either a sequential or exhaustive,
but it's not able to look at every image all at once.

But there's options here where if I go to this button here, it's not displaying on my
screen correctly for some reason.

there we go.

You can do an exhaustive, sequential, vocab tree, spatial.

There's these different styles you can pick, or I'm gonna say styles.

different algorithms you can pick to match these.

My understanding always is if you have a random collection of images, like someone walked
around and they're not necessarily one image is taken and then your next image you moved

over and took just of the same part of the scene.

But I don't know, maybe you're just walk around taking pictures in all which directions.

Exhaustive is what you want to use because it's gonna, you can explain this, but it's
gonna like, of try to get every image to match to every image versus sequential where

you're saying,

No, no, no, each image was taken in sequence.

So I see the fountain from one spot.

I moved a few feet, took another photo of it.

They should be sequentially somewhat matching between each other.

Does that sound correct?

Am I at the right assumption?

you're exactly right.

You're exactly right.

yeah, once you've extracted the key points from a single image, now you want to figure out
which pairs of images are related to each other.

So the simplest, most naive way is to say, well, let me match every single image to every
single other one.

Let me look at all order n squared, every single combination of pairs of images that I can
imagine.

And so that's what exhaustive matching is doing.

So exhaustive matching, like you said, it's great when...

You have sort of an unsorted random collection of images.

And especially it works well if you have in the order of a few hundred images.

Because it is doing this every image to every other image, that quickly gets expensive in
terms of time.

Like that's going to take a lot of time to compute if you try to do this on thousands of
images.

You could still do it and just have to wait a long time.

But yeah, it's great because it's going to try to discover every single.

pair of matching images that it can.

Mm-hmm.

And so that's where the sequential is nice if you have something like you said there in
the fountain sequence where you know, hey, these are frames from a video or my images.

Maybe I was taking photos, but I'm taking them in order.

Like, I started here, took a photo, took a few steps, took another photo, took a few more
steps, took another photo.

And so there is some sort of sequential information to those photos.

know that images taken near each other in that list show.

similar content and that's what's sequential.

It'll leverage that information to help the matching be more efficient.

And then I don't really understand vocab tree.

I do know that if you want to do an exhaustive style match, not sequential, but you have,
let's say 800 images, I've always heard use a vocab tree.

Yeah, yeah, that's exactly right.

So the vocab tree, you might heard like, it's a vocabulary tree or image retrieval style
matching.

Yeah, what it's doing behind the scenes is it uses a image lookup data structure.

So it takes all the images, comes up with a really compact summarization.

of the kinds of things that are in each image and then provides a way that I can say, hey,
for this given image, what other images in my data set are likely to have the same kinds

of things in them?

You it's not a guarantee, but it just says, you know, if I have one image and I've got
10,000 other images that I can match to, I can ask it, well, hey, I don't want to look at

all 10,000.

Can you at least give me a sorted list of the ones that are most likely to match?

Mm-hmm.

what the vocab tree option does for you is it returns that ranked list.

then, instead of matching all 10,000, I can choose to match the best 50 or the best 100 or
whatever my threshold is.

Yep.

more efficient.

Yeah, once you get beyond three to 400 images, exhaustive should not be your option.

You should go to the vocab tree unless they're all sequentially taken and then always use
sequential.

Well, not always, but that's probably your default.

So if I'm taking a video,

and then extraction images, sequential is always gonna work.

Well, always gonna be your first option if you wanna be as fast as possible.

And then in here, I know you can pick loop detection.

it's trying to, we've talked about that before, right?

It's trying to detect, have you come back to an area, correct?

And.

And that will do it using the vocab tree option.

Like, so if I do loop detection, so under the sequential tab, if I do loop detection and
then specify a vocab tree path there at the bottom, that will enable it to say, as I'm

processing through all those video frames, you know, every 10th frame, every 50th frame,
every a hundred frame, whatever you set it to, you can have it go and then do a vocabulary

tree retrieval, do that image retrieval step to try to discover.

loop closures within some of that data.

Okay, so we have these options.

I always just say, and then there's spatial and transitive.

We haven't talked about that.

Does spatial have to do with GPS or?

right.

So it just says, you know, for each image, assuming if the images have embedded geotags,
so GPS data embedded in the exif, it will say for each image, just find other images with

similar GPS and match to those.

Yep.

of people here listening probably are taking drone images and spatial is the one I always
use.

That's a great option because a of times that drone is looking straight down or it's not
looking at completely random directions.

There is some order and structure to that drone data and so the spatial.

a lot of the drones people are using nowadays have a really good GPS on it.

Thinking of the enterprise versions of like a DJI drone are getting really good GPS even
without a RTK attachment.

it's not gonna throw a bunch of error into there.

And then what's transitive?

That's the one I don't think I've ever touched.

I don't even know what that means.

Yeah, that's a way to densify a set of existing matches.

So suppose you had gone and run one of the existing modes.

you ran, okay, maybe not exhaustive, but like if you had ran sequential or ran your
spatial or ran your vocab tree, but then you wanted to go back and create a more complete

set of connections between images, what Transitive will do is it'll look at your database
and it'll say, hey, if image A matched to B and image B,

matched to image C, but I didn't try to match image A directly to C, let me go ahead and
do that now.

And so it goes back and finds these transitive links between images and attempts to do
that matching.

So what that does, that just creates a stronger set of connections between images, which
will help COLMAP out during the reconstruction phase.

Okay, so I feel like this gives me good idea then of, or the listener slash viewer, an
idea.

There's different options.

Pick the one that makes sense for the data set you have.

You might get the best results out of exhaustive as far as error, but you might be waiting
a day.

Heard people say, I set this and now it's telling me it'll be ready in 28 hours.

Well, probably not the right mode.

You probably used a vocab tree, but.

You know, I always say find the right one.

Start with sequential.

If you have sequential images at least, you probably get good, a good result there.

I also want to.

to mention it back, you know, in the diagram under the corresponding search, you know,
they do break it down versus the feature extraction, feature matching, and then geometric

verification.

That geometric verification, those options show up on that matching, those matching
settings screens that we just saw.

For each of those tabs at the bottom, there was the general settings or general options.

And a lot of those general options are related to.

geometric verification saying, when I'm matching these points and I want to then verify
it, what sort of pixel error do I expect or what is the minimum number of inliers or an

inlier ratio?

And so those inliers are the number of geometrically verified matches between a pair of
images.

And so that's where geometric verification kind of comes into play within this COLMAP
workflow.

Okay.

so let's move this along.

Then I do want to point out, I'm going to show COLMAP one more time.

At this point, you've ran both your feature extraction and feature matching.

You will still see nothing on your screen.

Well, you will see logs, but you will not see these camera poses, which I have.

So I have a point, I have this sparse point cloud.

I have these red camera positions around it and none of this shows up because at this
point we haven't, we haven't created a point cloud.

We haven't projected anything yet.

So we're moving from correspondence search to, if I bring up that diagram one more time,
we're moving on to incremental reconstruction.

And that's where we start to see fun things happening on a COLMAP GUI screen.

If you're running on a GUI, you'll start to see camera poses show up.

So the first step is initialization.

What is that?

So is that just starting?

Yeah, that's what it is.

mean, it's the starting process for this incremental reconstruction.

So incremental reconstruction is just one style to attempt to do 3D reconstruction.

so the core idea here is that, like you said, we don't have any 3D information yet.

So we're going to start with the minimum amount that we need, which is a pair of images.

So let's start with a pair of images and then figure out what is the 3D relationship.

between those images as well as what 3D points did they see in the scene.

And so we're going to create this two view reconstruction.

Take that pair of images, triangulate an initial set of 3D points, and then we use that as
the initialization for the rest of the reconstruction.

And so everything after that is going to figure out, based on these initial two images and
some points, how can I add a third image to that and how does it relate?

Now that I have these three, how can I add a fourth and a fifth and a sixth?

And so you just keep adding images one at a time.

to grow a larger and larger reconstruction.

But initialization is just, what is that initial pair?

Which two images am I gonna start with to build this entire reconstruction?

Okay.

And then it kind of goes into a circle.

So if you look at this, I say circle, the diagram on the screen shows image registration,
triangulation, bundle adjustment, outlier filtering.

And then if you follow the lines, you notice you're really doing a loop.

So it's looping through that process.

And then also this dashed line showing reconstruction.

So it's kind of probably looping through that and adding to the reconstruction while it's
going or, okay.

Exactly right.

So it's that initialization that picks the first pair of images.

But once I have my pair of images, now I'm going to enter in this loop that starts with
image registration.

So image registration is a fancy name to say, how can I add a new image to my existing
reconstruction?

And so what it's going to look at is...

based on the 3D points that have already been triangulated, it's going to ask, what's the
best next image in my data set that also saw those points?

And then if, and once I find that image, know, via the set of feature matches and so say,
know, if I've matched image one and two and triangulated that, well two, image two matched

to image three, well then image three is seeing the same points in the scene.

So let me add image three.

And so there it's a 2D to 3D,

registration process, 2D, 3D pose estimation process, where I take the 2D points in that
third image and I want to align those 2D points with the 3D points that have been

triangulated.

So you might hear that as image registration or perspective endpoint problem, pose
estimation.

There's a few different words for what this process is, but you're adding a new image to
the reconstruction.

And so that's the image registration step.

I do know when I ran this, I can always take a video and kind of project it onto this in
post, but when it's creating this reconstruction, instead of taking image one and then

image two and then image three and kind of building off that, I'll notice it'll pick, if
you look at my, if you're watching this in video, you'll notice it took two loops.

and some of the images are like right above each other almost.

I held the phone at like above my head and then I held it down at chest level.

So I have two loops and there's a lot of common key points, common features.

So as it's building this up, started at this, kind of where I started walking around this
fountain, but it's using images from further along in the video extraction, or sorry, the

images I had.

So it used like image one and image 180 because those are next to each other.

and had a lot of strong feature matches.

So they don't necessarily use images in sequence of how you took them.

It's ones that had strong correlation, correct?

a great point.

That's a great point.

Yeah, it isn't just going to go, you know, one, two, three, four, five, six.

You it's not going to do them in order.

You know, it's going to start that pair of images.

It's going look through all of the images in your collection and find the pair.

And it might not be the consecutive pair, but find the pair of images that maximizes some
criteria.

You know, it's a pair of images that has strong connectivity.

So there were a lot of feature matches.

But I also want to make sure that that pair of images has

know differences in viewpoint.

don't want two images that were taken at the exact same position in space because that
gives me no 3D information.

need, you know, we talked about this in the last episode, this concept of a baseline.

I need some sort of translation.

I need some motion between two images.

Or maybe it in our depth map episode.

And we talked about this, you know, in that we need motion between images in order to
estimate depth.

So the initialization could look for the same thing.

It wants lots of matches between the image.

but it also wants a strong amount of motion between that.

So it's gonna pick whichever pair of images maximizes that criteria.

And once it has that, then it'll start adding other images that are strongly connected to
those initial ones.

And yeah, it won't necessarily do it in order that you capture those images.

It's gonna be in the order in which those connections are strongest.

And I was seeing mostly yours.

I was seeing like the first photo and then somewhere further along where I came and did a
loop.

I saw those two photos start together because I think there was more as you're talking
about a baseline was better.

There was more parallax because I have these are pretty closely spaced images I took from
picture to picture.

So not a lot has changed versus the next loop.

I have a I'm looking the exact same part of the fountain, but I have a different.

elevation and angle, so there's a lot of parallax movement between those images.

So it was matching those better as opposed to image one to image two, it's more of image
one to image 180, because of that baseline was probably better.

So you gotta, the fun thing is when you run this in the GUI, this Cole map, you gotta
watch those build and you're gonna see the point cloud just start to generate in front of

you.

And you get an understanding then of what it's doing in these logs that are.

looping through this process over and over.

And you can kind of see it just iteratively add to the scene and build and refine.

When it's doing this incremental reconstruction, is it refining the camera poses as it
goes, or is it just saying, here's the camera poses, there's where they are?

Now there's refinement, there's refinement.

And a lot of times that refinement is called bundle adjustment.

That's a key word that's used commonly in the literature.

I remember the first time I heard the word bundle adjustment, I was a first year grad
student and I had no idea what the person was talking about.

I was like, what, a bundle of sticks, a bundle of what, a straw, what is going on?

But no, a bundle adjustment.

So it's the idea of refining.

the 3D points as well as the camera positions.

And so you end up with just a bundle of constraints, a bunch of constraints saying, these
2D points and these images all triangulated and all saw the same 3D point in this scene,

but I've got a bunch of images and I've got a bunch of points.

How can I optimize the alignment of all of this data?

And that's what bundle adjustment is.

So yeah, so as COLMAP is running, it's doing that image registration process.

It'll add a new image.

It then runs triangulation, which creates new 3D points based on that new image and other
images that are already there.

But then it'll do bundle adjustment, which will say, how can I refine that?

And there's two styles of bundle adjustment that I believe COLMAP uses.

One of them is local bundle adjustment, the other is global.

So a lot of times what you will see is, we had already reconstructed a thousand images and
we're adding that a thousand and first.

When I add that thousand and first, trying to do a bundle adjustment using all thousand
images, that takes a long time.

And so we recognize that, well, that first image, that thousand and first, that next image
that I'm adding, well, it's off in the corner of the reconstruction.

It's far away from the other side of the reconstruction.

These things aren't really related to each other.

So I can run a local bundle adjustment.

Let me just optimize only those cameras and points that are near.

that new image that I just added or those new points that I've triangulated.

And so that's a way to sort of do this local refinement.

And I can do that every single time I add a new image.

And then periodically, COLMAP will run a global bundle adjustment.

So there's some settings there.

think every, you know, once the reconstruction is increased in size by 10 % or you've
added every, you know, 500 images or something, there's certain criteria, especially at

the end of the reconstruction, COLMAP will run a global bundle adjustment, which says,

let's optimize everything.

Let's optimize the points.

Let's optimize the camera poses.

And something we haven't mentioned is it will also be optimizing the camera parameters.

So back when we picked that camera model and we said, we're going to use a camera model
that has a focal length term and a principal point, CX and CY, or maybe has some radial

distortion terms during bundle adjustment, COLMAP will also be optimizing those parameters
as well.

to figure out, what is the field of view of my camera?

That's the focal length or how much lens distortion was there in order to achieve that
alignment.

Would it run those if you, cause we didn't cover this earlier on, but let's say you do
have a camera model calibration file.

So you're saying, I know this.

I think DJI's and they're again, and their enterprise level drones will give you this
information on their lenses because they've been calibrated.

and it's in the XF data.

Well, well that changed.

Does it do like a refinement on top of that?

Or does it just say, no, no, no, you give us that.

won't change that.

That's an option.

think under the reconstruction options or under the bundle adjustment options, there are
ways to say, hey, do I want to refine my focal length?

Do I want to refine my distortion terms?

So you could enable or disable that setting.

To that point, I do believe that COLMAP will parse the EXIF data in those images.

And if it sees that, yeah, there is a focal length.

Because a lot of times, an image will contain

you know, that, this was taken with a 10 millimeter lens or a 24 millimeter lens, you
know, and so COLMAP can parse that data to take an initial guess at what it thinks that

focal length is, you what's the field of view of the camera and can use that as
initialization.

But a lot of times there is benefit to refine that because it may be, make it too close,
but not, might not be close enough to get a really sharp reconstruction.

So, okay, so I got a lot more appreciation for what's happening here.

I tell people run this on their computer.

You don't need the highest spec computer to run a small data set and learn how this works.

I ran this on my older computer, which doesn't have, you know, 24 cores or anything, and
it still ran fairly quick.

I'd say there's some things you gave me some notes.

I think we covered largely most of it, but then from here you can do things.

So I ran this through, you can hit automatic reconstruction, it'll create all this, but
then you can hit bundle adjustment, which is that global one at the end.

And then you can build a dense reconstruction, which we're not really gonna cover on this
episode.

This is just kind of like, here's how we got that workflow I showed to get the camera
poses, the sparse point cloud.

And then from there you can use it for.

more downstream tasks, right?

So I could use this for, again, doing a dense 3D reconstruction where you're gonna, I
wanna get millions of points on this scene.

Or I can use this as the basis for initializing 3D gaussian splatting.

There's just different things you can use once you got camera positions and a point cloud,
sparse point cloud.

I'm showing also on my screen, I talk about, you have these kind of magenta lines.

This is showing kind of your

these images matched.

I double clicked on one, it'll show that kind of information of the key points and which
ones match to it.

But you can just click around and learn things.

Double click on different parts of the scene.

It'll show you the point and which different cameras made up that point.

And it's a good tool to kind of learn how this works because it's very visual on the
screen.

Lots of data, lots of options.

You can even create animations in this if you really want to show off what you learned.

There is one thing we didn't really talk about.

Well, there's a couple of things.

So incremental reconstruction, everyone always complains, I bought the newest GPU.

This should be really fast.

Why is this running so slow?

My GPU is not even being used.

And it says it's taken five hours to run my thousand image data set.

Why is that?

Why can't we use a GPU for this incremental reconstruction?

Or I know we can, but why can't we in COLMAP the way it's configured?

Yeah, yeah, because COLMAP.

Yeah, a lot of these algorithms are not easily to paralyze on a GPU.

So a GPU works well when you're doing the exact same operation on millions of things, you
because that's what a GPU does.

Its job is to draw pixels to a screen on your monitor, on your desktop.

And so you've got millions of pixels on your screen.

And so that GPU is processing a million pixels at once and figures out what to draw.

And so for tasks like feature extraction where, I've got a, again, millions of pixels and
I want to figure out which ones have features in them.

GPU is great or feature matching.

I've got tens of thousands of features in one image, tens of thousands of the other.

want to figure out which features match with each other Then again, that's great for a GPU
for incremental reconstruction.

It's like I'm operating on one image at a time.

and I have to just solve a math equation and do some linear algebra to figure out what's
the 3D position or pose of that image, that's not a very paralyzable task.

And so it's not very easy to adapt some of these algorithms to the GPU.

I will say then another thing too that contributes to it is COLMAP is very flexible.

There's a lot of algorithms, a lot of switches, a lot of different techniques that you can
use.

And to implement all of those on the GPU, we just take a lot of time.

It's nice having software that's flexible.

With COLMAP being open source, a bunch of people contributing to it, it's nice having a
flexible platform where people can easily dive in, make changes, add their own algorithm,

plug it in, tweak things and play with it.

so having that sort of more general purpose CPU based implementation is helpful.

But yeah.

To get back to the core, it really is primarily just around the algorithms.

A lot of these algorithms are not parallelizable or not well suited for processing on a
GPU.

That makes sense.

I and I want to explain it or someone's trying to explain it.

It's like your CPU is a really good detective at solving clue by clue one thing at a time,
versus GPU is like it can just point out all the clues all at once.

But you really need that like hard math equation.

You need really fast cores to try to solve those things one at a time.

And it's incremental.

So think about it.

It's like you can't you can't solve all these all at once as is.

That's something that people just have to keep in mind that don't get frustrated.

It's just how this technology works today.

And there's GLOMAP.

So how does GLOMAP make this all sudden magically fast?

Yeah, so GLOMAP is a different style for that reconstruction process.

So GLOMAP deals with global Mapper, know, so global reconstruction versus incremental
reconstruction.

instead of here in COLMAP, we just talked about it uses an incremental reconstruction and,
you know, one image at a time, whereas global reconstruction, it tries to figure out the

3D poses of all of the images all at once.

So GLOMAP still has that same correspondence search step.

to run GLOMAP, you still got to extract key points, extract features from your image, you
got to match them, got to run your geometric verification.

But once you have that web of connectivity between your images, you can then run global
reconstruction techniques.

And so there's a few different steps there.

In GLOMAP, they've run rotation averaging first.

So the idea with that is that you...

look at all of the feature matches between your pairs of images.

For each pair, you estimate how much rotation occurred between that pair of images.

So that gives you a constraint.

But now if I look at all of the rotations that are estimated between all of the pairs, can
I come up with a consistent orientation for all of my images that satisfies each of those

pairwise constraints?

So can I arrange the orientations of my images

so that all of those pairwise rotations make sense.

And so that's what rotation averaging does.

So it's not even looking at position, it's just trying to rotate all of the images.

And once they're rotated in 3D space, then it does a global positioning step, which
simultaneously solves both the camera positions as well as some of the 3D points.

And so it kind of throws all of the cameras into a big soup, a big mess.

It gives them a bunch of random initializations and then defines these constraints saying,
well, these images,

solve these common points, how can I rearrange all of these images so that they line up
and see those common points?

So it's similar to bundle adjustment.

that the idea of take a bunch of images that see points and refine it, but it uses a
different formulation, a different set of constraints that is better suited to random

unknown camera positions.

And so that's this global positioning problem that they solve.

So that gets you pretty close.

So once you've run your rotation averaging, your global positioning, you get a
reconstruction that's pretty close.

And then you can run bundle adjustment, an actual high quality refinement using bundle
adjustment, and then you have your 3D reconstruction.

So it skips a lot of this incremental slow process that wasn't parallelizable.

The rotation averaging and global positioning, that's a little better suited to
parallelization and is more efficient because you're not having to do this one after the

other after the other.

Yeah, and I have it on my screen here, the project page where it kind of showed you were
talking about.

and this last is showing it all happening all at once.

We're just kind of all just kind of resolves at once.

I do want to say that it, it's something it's it to me is there's a low, a low, what's the
right words.

It's not, you're not going to be wasting a lot of your time to give this a shot, to see if
this works well for your project, because you don't have to wait a lot of time for it to

do.

the incremental reconstruction.

it doesn't work well with all scenes as I found, but because you know within minutes if
it's gonna work well or not, it's worth a shot and you get to learn what scenes work well

with it.

You've done some tests as well, Jared.

You can't get too tight in on a bunch of little things.

I feel like you need more of a global view or the example images have a lot of features
and aren't really close tight in on little

features in a scene.

Mm-hmm.

Yeah, you want, from my experience with GLOMAP and other global structure from motion
global reconstruction techniques, they work best when you have a lot of connections

between your images.

So it's not you just walking through a cave or walking down a city street and never
returning back.

It likes a lot of loop closures.

It likes a lot of connectivity, a lot of different vantage points and overlap and diverse
content.

And so it...

It takes the strength of those diverse and dense connections and very quickly figures out
how to arrange them to produce that final reconstruction.

And that's probably why in my experience when I have these more broader view shots, it
works well because I have a lot of connections.

I have a lot of unique features and you get too close in on one little object where you
have a lot of like I think inside I've done some indoors that haven't turned out because

you have a lot of just blank white walls with not a lot of features.

So it's just not able to do that.

So all right.

Well, this is something I say I had on my screen just to

to kind of show some examples.

If you're listening, I will make sure I'll link in the show notes as well, GLOMAP and
COLMAP, but GLOMAP's an interesting one you can look at.

It drops on top of COLMAP.

So you even get it running, isn't like a large lift.

And you see Johannes in the list of names.

So you can see he's still working on these things.

I think this is interesting because it does make things go faster.

And if you look in the results that...

They are in the same range of accuracy as you get with incremental reconstruction using
COLMAP.

So it's not saying, well, this is fast, but it's not nearly as good.

It's fast and it is good if you have a good result.

But you find out really quick because I've noticed that the results either are absolutely
all over the place or you have a really good sparse point cloud.

And so, you know, if it's good or not.

In fact, you'll see cameras all over the place where everything's kind of like this weird
looking cube.

And that's how you know it didn't work.

But you will know.

based off of your output.

Yep, I've gotten a few Borg cubes.

That's what I think they look like.

But I think I've gotten a few cubes as my results.

yeah.

Well, I think we covered, think we covered this all really well.

I hope at the end of this, people will go try COLMAP or go, I mean, even if they use other
software, it will follow relatively the same sort of process.

I don't think you could, maybe there's other ways this done.

I'm sure there is, but this is the standard kind of method that most at least follow this
sort of style.

And now there's all this machine learning stuff that's different.

But as far as classical 3D reconstruction from imagery, this is a very well known and
reused pipeline for a lot of projects.

Yeah, and it's a great, like you said, like just go and try that.

That's, can't stress that enough.

Just try it.

You know, if you're either one, just get new to computer vision and want to understand how
3D reconstruction works, you know, or maybe you kind of understand it, but don't, you

know, but want to get a better insight of how things work behind the scenes.

A tool like COLMAP is great just to, you know, throw some images at it, run a
reconstruction and then start poking around.

There's a lot of neat visualizations that Jonathan showed where you can look at a point
and see which images saw it or in an image.

What did it match to?

There's other debug visualizations where you can look at sort of the match graph or the
match matrix and see how the different patterns or ways that images are matching to each

other.

So it's a nice way to get in, get your hands dirty, and see how this process of turning
pixels to 2D information to final 3D results.

And that...

mapping from 2D to 3D and all the information that goes into that.

So it's a great way to get in there and get an intuition for how this all works behind the
scenes.

Yes, definitely.

And I would say the most important part when you're trying to run this is picking the
right matching strategy because that can be the difference between waiting hours and an

hour or minutes.

So, well, thanks, Jared, for this episode and kind of covering all this stuff.

I hope this was tangible enough for people to go try it and having the visuals up.

if you're listening, go find this video on the every

point YouTube channel.

have a playlist of all of our episodes.

I'll make sure I haven't named it yet, but I'm sure Cole map will be in the name.

It'll be, I can't remember what episode we're on, but it's like 15 or 16.

You will see that it's a great, it's a, it'll be a great way for you to learn this.

If you're, if you're getting into there, cause I see every day, I didn't go over these,
but we have questions.

see every day either on my videos or on Reddit or discord.

There's these different communities that are all

using projects that require COLMAP to run to start, think 3D guys been spying.

And it's just obvious that this is something that people just know they have to use, but
have no idea what's happening.

They just know they threw a bunch of images at it and something came out and then they're
going to do something else with it.

But they have no appreciation for the sausage making of COLMAP.

And if you know what each step is, you can get better results in my opinion.

Just play with it.

See what works learning those different options are.

If you don't know what other option is as well, jump on our YouTube channel, ask a
question.

I'm will be watching and trying to respond as intelligently as possible on those and give
you a good answer.

So Jared, any other parting thoughts you want on this?

You said go get, give it a try.

Any other tips you would give people take good sharp imagery.

do it.

Just do it yourself.

Get out and try it.

Take your own photos and see how they turn out.

Yeah, take your own photos.

Don't go use the like open source data sets because they know those are going to work in.

You know, those are great for testing, but not great for learning on your own data.

So, right.

Well, thank you.

And if again, you're if you're listening, this will be on all major podcast players.

Please, if you can subscribe to our to our channel or to one of our podcast episodes,
that'll mean a lot to us know that we're making the right content and that you guys care

about learning about this information.

And as always, let us know in the comments as well on your YouTube channel, if there is
something here that you would like us to go deeper in, maybe we can get someone like

Johannes on one of these episodes to go super deep if you want to.

anyways, well, thanks Jared for being on this episode and I'll see you guys in the next
episode.

More episodes

Chapters

Creators and Guests

What is Computer Vision Decoded?