Computer Vision Decoded | Exploring Depth Maps in Computer Vision

In this episode of Computer Vision Decoded, Jonathan Stephens and Jared Heinly explore the concept of depth maps in computer vision. They discuss the basics of depth and depth maps, their applications in smartphones, and the various types of depth maps. The conversation delves into the role of depth maps in photogrammetry and 3D reconstruction, as well as future trends in depth sensing and machine learning. The episode highlights the importance of depth maps in enhancing photography, gaming, and autonomous systems.

Key Takeaways:

Depth maps represent how far away objects are from a sensor.
Smartphones use depth maps for features like portrait mode.
There are multiple types of depth maps, including absolute and relative.
Depth maps are essential in photogrammetry for creating 3D models.
Machine learning is increasingly used for depth estimation.
Depth maps can be generated from various sensors, including LiDAR.
The resolution and baseline of cameras affect depth perception.
Depth maps are used in gaming for rendering and performance optimization.
Sensor fusion combines data from multiple sources for better accuracy.
The future of depth sensing will likely involve more machine learning applications.

Episode Chapters
00:00 Introduction to Depth Maps

00:13 Understanding Depth in Computer Vision

06:52 Applications of Depth Maps in Photography

07:53 Types of Depth Maps Created by Smartphones

08:31 Depth Measurement Techniques

16:00 Machine Learning and Depth Estimation

19:18 Absolute vs Relative Depth Maps

23:14 Disparity Maps and Depth Ordering

26:53 Depth Maps in Graphics and Gaming

31:24 Depth Maps in Photogrammetry

34:12 Utilizing Depth Maps in 3D Reconstruction

37:51 Sensor Fusion and SLAM Technologies

41:31 Future Trends in Depth Sensing

46:37 Innovations in Computational Photography

This episode is brought to you by EveryPoint. Learn more about how EveryPoint is building an infinitely scalable data collection and processing platform for the next generation of spatial computing applications and services. Learn more at https://www.everypoint.io

Creators and Guests

Host

Jared Heinly

Chief Scientist at @EveryPointIO | 3D computer vision researcher (PhD) and engineer

Host

Jonathan Stephens

Chief Evangelist at @EveryPointIO | Neural Radiance Fields (NeRF) | Industry 4.0

What is Computer Vision Decoded?

A tidal wave of computer vision innovation is quickly having an impact on everyone's lives, but not everyone has the time to sit down and read through a bunch of news articles and learn what it means for them. In Computer Vision Decoded, we sit down with Jared Heinly, the Chief Scientist at EveryPoint, to discuss topics in today’s quickly evolving world of computer vision and decode what they mean for you. If you want to be sure you understand everything happening in the world of computer vision, don't miss an episode!

Jonathan Stephens (00:00)
Welcome to Computer Vision Decoded, where we dive into the quickly evolving world of computer vision.

today's episode, we're gonna dive into depth. Well, depth maps and why they're important for computer vision.

Hi, I'm Jonathan Stevens and today we're joined by my co-host and in-house expert, Jared Heinly, where he knows everything about this topic, at least everything I can ever

Jared. All right. Yeah, well, so just to like start off this episode, I'm sure we're gonna have a mix of people who understand depth maps. We have people who've probably never even heard of it, but probably have used them in their life and not even known it.

Jared Heinly (00:25)
Thanks, Jonathan.

Jonathan Stephens (00:41)
Jared, let's just start out with like what's depth in computer vision terms and depth maps and just kind of lay the basics for what we're going to talk about.

Jared Heinly (00:51)
Yep, absolutely. So depth, depth just means how far away is something. least that's how we're talking about it here in computer vision. So from the vantage point of a person or a sensor or a camera, how far away is something from that sensor? And that typically could be a measurement like meters or feet saying, that point in space is...

10 meters away or 5.7 feet or something like that. But it's a distance away from some sort of reference point. Now you mentioned the word depth map. And so a depth map is a particular representation of depths in an environment. So just like an image, if you pull out your camera and take a photo, you get a nice...

color photo. That photo has pixels, typically has millions of pixels, and each pixel has its own color. Those pixels are arranged in a 2D grid. You know, so you have X and Y. It's a two dimensional photo. And there is some geometry to that, that photo. You know, it's, it's, know, the camera was at one point in space and then each pixel is the color along a particular viewing direction. So for every pixel, you can say, well, and you know, from this, you know, from the camera's viewing,

Jonathan Stephens (01:45)
Mm-hmm.

Jared Heinly (02:13)
point out in that direction, that point was red, or maybe the next pixel's orange and the one beside it's yellow. And so you get all these color readings. A depth map is really similar to an image in that it's arranged in a 2D grid. You could say it has pixels, but instead of color at each pixel, it has a depth value telling you how far away that object is at that particular point in the image.

Jonathan Stephens (02:40)
Okay, so it's like, it something that would be baked into, like for depth map, it baked into your color image or is it a separate file you would deal with? Or both?

Jared Heinly (02:50)
Great question. It can be both.

So there are. It really depends then on the file format or the file format or the representation that you're using to store those depths. Some file formats do that. You have an RGBD, you know, red, green, blue depth representation for every single pixel so that those depths are baked into the image. Other times you'll separate them out, you know, because you might not even have a color sensor. You may only get depth.

and you would only just then have a 2D grid of depths. Just because you talked about file formats, mean, there's other file formats besides depth maps. And a lot of times it depends on what kind of sensor you're using or what kind of technologies you're using. So for example, imagine you were using a laser scanner. Now, I don't know much about the internal hardware of a laser scanner, but mine, and I think, John, you may know more than I do.

Jonathan Stephens (03:19)
Good point.

Yeah.

Jared Heinly (03:44)
But you've got a laser that's moving around, spinning around, reflecting off of a mirror sometimes. And so you'll get distance readings in many different directions. so those individual distance readings, maybe those are reported as a point cloud or in some other file format because that laser scanner may not be looking at every single pixel in an image.

Jonathan Stephens (03:44)
Maybe.

Jared Heinly (04:11)
looking at every degree or every tenth of a degree as it rotates around. So that could be stored as just an XYZ point saying, well, in this direction, I know there was something at that depth, so I can convert that to sort of an XYZ position. It can get more complex because sometimes laser scanners can have multiple returns. And so how do you store multiple depths along a viewing ray? there's file formats to support that.

Jonathan Stephens (04:37)
know with like a LIDAR station, end up with like a, there is a recorded location from which the scan is your station, you know, GPS coordinate, and then you got your depths away from that scanner, but then you have another scanner, so it's kind of like you can have points, multiple points to the scene, but you have to know where the points, what scanner is represented from, or scanner position, or else you don't really know the depth, you know, to that.

to that scene unless it's to like, I guess an origin point. Is that what you, yeah, okay.

Jared Heinly (05:08)
Yeah, yeah, yeah.

if you had multiple, like someone from a single scanning positions perspective, yeah, you may have multiple points, but then if you start bringing in multiple scanners or you pick up that scanner, move it, place it somewhere else, do a new scan, well, now you have to align those into a common coordinate system to reason about that XYZ, those XYZ There's also...

Jonathan Stephens (05:15)
Mm-hmm.

Mm-hmm.

Jared Heinly (05:33)
If you have an ultrasound, that's returning depth, but maybe it's more of a scan line instead of looking at a 2D area. Maybe it's a slice through that volume if you're getting an ultrasound of the body or some other environment. There's even more machine learning based depth representation. So if I want to represent an entire volume in front of the camera and not just...

Jonathan Stephens (05:43)
Mm-hmm.

Jared Heinly (05:59)
a per pixel value, if I want to represent like a volume, I'm scanning some sort of density field or fog or haze or transparent objects, maybe I want to represent the depth density or depths at various areas, maybe I'll use something like a voxel grid or an octree. And I said machine learning base, so maybe you're using like a nerf or even a Gaussian splat as some sort of intermediate representation of the depths of the distances from a camera at a particular point in space.

Jonathan Stephens (06:16)
Mm-hmm.

Yeah,

so there's like a bunch of different applications which we'll kind of jump into in a few minutes, but you can represent it in different ways depending on the hardware you're to capture it and also perhaps what you're trying to use it for. Okay, okay. And so I'm sure why I said everyone probably uses Depth Maps and not know it. I just think of the one application everyone doesn't even realize they use it for and that's the photography on an iPhone or a

Jared Heinly (06:39)
Yep. Yep. Yep.

Jonathan Stephens (06:52)
Android or one of these phones they have a lot computational photography. So you take a picture of a person and it puts it in portrait mode and also in that person's beautifully exposed in the backdrops bokeh I'm sure it's doing some sort of depth map to pull out that person and say only apply the the computation overlay to make them look better to that person because they're close in and it can kind of pick them out, right so

And that way it's probably using some sort of, I don't know if RGBD, but it's got a layer. And I know if I bring up Apple's keynote and they talk about computational photography, and I think Google shows it too, but they have like all these different, 27 different layers of data. And that's like one of them is a depth map that they use for doing all this processing within milliseconds.

Jared Heinly (07:21)
Absolutely.

Yeah, yeah, yeah. there is, the iPhone's a great example because there are.

Jonathan Stephens (07:49)
Okay.

Jared Heinly (07:53)
I'm gonna claim there's five different kinds of depth maps that the iPhone interacts with and or generates. So.

Jonathan Stephens (08:00)
Okay, let's jump into those. That's a good place, let's jump into that. So like everyone's

got a smartphone and I'm sure the different manufacturers all use probably multiple versions. So even though we're talking about iPhone, it's not specific probably to this.

Jared Heinly (08:16)
Yep, yep, yeah. And you'll find these on other, exactly, you're exactly right. A lot of the things that Apple's doing is not necessarily unique to Apple. And it can be found on any modern mobile device or even some just standalone cameras, a DSLR or a point and shoot or a mirrorless

Jonathan Stephens (08:23)
Mm-hmm.

Yeah,

So, all right, so there's five types, five. Okay.

Jared Heinly (08:32)
yeah. And there may be more.

I'll think of them as we go along. But first one is what you mentioned, portrait mode on an iPhone. The primary way that that was done is via stereo vision, stereo view, multi-view stereo. And so the intuition there is in order to estimate the depth of something, well, you can use

Jonathan Stephens (08:52)
Mm-hmm.

Jared Heinly (09:00)
two eyes, two vantage points. So just like how our eyes have, you know, certain difference apart, you know, when I look at something out in the scene, depending on how far away it is, it's gonna move, it's gonna, the parallax, you know, that I can see if I close one eye, then close the other, I can see things slightly shift back and forth. Or if you move your head side to side, things that are closer to you move more than things that are far away. And so you can use these two vantage points to figure out how far away something is.

Jonathan Stephens (09:03)
Mm-hmm.

Jared Heinly (09:29)
And so that's one way that Apple's using depth maps is by using two cameras that were side by side. So usually like the wide angle on the telephoto or the wide and the normal and the ultra wide to have two vantage points on the scene. Use that to estimate some depth and that helps them figure out, what is background versus what is foreground and which part of the image should be blurred versus not blurred. So that's one way they do it. Now on portrait mode, Apple is also doing some other things. They do some more intelligent try to blending.

Jonathan Stephens (09:39)
Mm-hmm.

Jared Heinly (09:58)
to enhance that blur, but it is driven sort of there by depth estimation. So you've got portrait mode. Another opportunity where Apple can use depth maps again, this would also still be for portrait mode, whether that's on your selfie camera or the outward facing one is via monocular depth estimation.

Jonathan Stephens (10:00)
Mm-hmm.

Jared Heinly (10:20)
So ideally, yes, if you've got two cameras side by side, I can actually do some depth estimation using that. But if all that I have is a single camera, I can use machine learning to entirely estimate the depth of the image. And what Apple's done there is just they've trained this machine learning model on a bunch of images. So a bunch of images of people with various backgrounds and Apple, that machine learning model automatically predicts that, this face is in front of

Jonathan Stephens (10:33)
Mm-hmm.

Jared Heinly (10:49)
whatever's right behind them, which then is in front of whatever's behind them, even further back. So that machine learning model will predict what the different depths are in the image. And then Apple can use that to automatically segment foreground from background, apply the blur appropriately.

Jonathan Stephens (10:50)
Mm-hmm.

The

talk about that, by the way, is it not even recognizes my cat and my dog. They'll say it's an animal, you know, apply specific filters. So I'm assuming that it's doing certain things. I can notice that the fur on a photo of my extremely fluffy cats look much better than they did on three models ago camera iPhone. And I'm assuming it's doing some sort of like upscaling even.

on that machine learning of the edges because I don't think the depth maps aren't super high resolution always, right? They're not 20 megapixel depth map. So it's probably doing something to get that. Cause even the LIDAR on these phones such I have, cause I have a pro phone, aren't that high resolution to a like pull out strands of hair.

Jared Heinly (11:39)
Correct.

Yeah.

Yeah, yeah, absolutely. so that you just touched on on sort of the third, you know, and then maybe even fourth types of depth maps that Apple uses. You so one, you mentioned LIDAR. So yeah, if you have an iPhone 12 Pro or newer or some of the latest iPad Pros, yeah, they've got a LIDAR sensor in there. And so that LIDAR sensor is sending out light at specifically a time of flight sensor. So it's measuring how long it takes that those beams of light to shoot out and reflect back.

Jonathan Stephens (12:11)
Mm-hmm.

Jared Heinly (12:23)
crazy math and crazy hardware going on there. But that physical light that's being sent out is very low resolution. It's a 12 by 12, 24 by 24. It's a low resolution grid of light, but then Apple's doing up sampling. So it takes those depths combined with whatever that high resolution imagery is, that 12 megapixel image from the camera, and can up sample it to that.

resolution using machine learning, saying, well, hey, here's what it looks like at low resolution depth, but here's all of the edges and the texture and the color and the patterns that it sees in the high resolution photo. And so it figures out the best mapping and kind of fills in the details to give you that high resolution depth map. there are kind of the two things that you have this, the time of flight LIDAR sensor, plus this concept of depth up sampling. It's another way that Apple is using it.

Jonathan Stephens (13:04)
and

I mean,

I remember when the iPhone 12 came out and they were talking about LiDAR and a lot of people said, I don't know if that's even for And then we're trying to say like even in low light situations, you're able to grab focus in a very short amount of time. Cause it's able probably to sense the depth to the intended object and perfectly focus on that subject for you versus if it's dark and you maybe got

Jared Heinly (13:26)
Yeah.

Jonathan Stephens (13:38)
multiple cameras on most of these cameras now, on these phones, still it's a little bit slower. You can't just shoot out an infrared light and say, okay, I know roughly where this is. Yeah.

Jared Heinly (13:49)
And then with, you know, it continues to think about Apple and the work that they've done, so we mentioned that LiDAR sensor in time of flight, I will mention also the Face ID sensor, because that's also using depth. But there it's not using time of flight, it's using a structured light sensor. And so what happens there is at the very top of the phone, you actually have a projector, and then you have a...

Jonathan Stephens (13:59)
Mm-hmm.

Jared Heinly (14:13)
a camera and so that projector is sending out random dots, a random pattern of light. And then there's an offset from that, you a small baseline, know, there's a camera that then looks at that light pattern and it looks to see, how did those dots shift? And again, I mentioned before this concept of parallax, things that are closer are gonna move farther than things that are farther away. And so it looks at how those patterns of dots have shifted and warped and moved in the image.

And so it uses that to infer the depth of the scene, in this case, the depth of your face. And so it knows if it's your face or someone else's. And so that's another way. And so even the first Xbox Kinect common off-the-shelf sensor, that was using structured light. I think some of newer hardware now has transitioned to time of flight. But structured light is a very common way to estimate depth in a scene, sending out this light pattern and seeing.

Jonathan Stephens (14:46)
Mm-hmm.

interesting.

Yeah.

Jared Heinly (15:11)
seeing where it is.

Jonathan Stephens (15:12)
Interesting. Yeah. And it feels like I think a lot of that Kinect sensor, a lot of that was to capture where a human was. So when they did movements, could replicate that movement. And I feel like nowadays with, computer visions, machine learning component, a lot of the, when I say pose estimation, not cameras, but human body poses, a lot of that's happening now because it's just built to detect.

Jared Heinly (15:20)
Yeah.

Jonathan Stephens (15:37)
like the human body edges and predict where the human is. I see stuff of hands doing this and crazy movements and it's to able to track it pretty good even with occlusions. Some of this depth mapping might actually be replaced by machine learning. Instead of using sensors, we're just using lots of training, lots of training data.

Jared Heinly (15:48)
Yeah. Yeah. Yeah.

Yeah,

absolutely. And that's a great insight because it's like before where we needed this high accuracy 3D data, it's now we can infer it just because the machine learning models have gotten so advanced that we can estimate body pose like you said. That connect sensor, yeah, it was originally intended for games, so that people playing on an Xbox could use their body as the controller and play various games. I know a lot, at least at our university, lot of researchers jumped on that and said, well, hey, I can use

Jonathan Stephens (16:12)
Mm-hmm.

Mm-hmm.

Jared Heinly (16:30)
an Xbox Kinect sensor for computer vision. It's a real cheap off-the-shelf depth sensor and they used that to enable many different applications and research projects.

Jonathan Stephens (16:39)
Mm-hmm. Okay,

so I have a camera that's not a Kinect sensor. Maybe it's an RGBD camera. It's basically... It looks like a Kinect sensor and the fact that it's the size of perhaps like a candy bar. And it's got a left and right camera and then a high resolution center camera. And as I'm filming, it's able to get me live depth maps and...

Jared Heinly (16:52)
How is this like the oak?

Yeah.

Jonathan Stephens (17:02)
live 4K video footage. And so I'm guessing that these, left and right camera, which I can tell, I think they're black and white. I don't think there's color coming through that, but it's doing the depth for me. But it also only goes out to, I think, 20 feet. And I say 20 feet with quotations because that 20, when you get around 18, 20 feet, your depth data, I noticed is like jumping. Why is that? Why is...

Why you know like what's that limitation factor for getting a depth map to only 20 feet on that camera like just something there?

Jared Heinly (17:33)
Yeah, it entirely comes down to two things. One, the distance between those cameras, which in computer vision we call the baseline. So it's how far apart are those two cameras in your stereo vision system. So that's the one factor. And the second factor is the resolution of those cameras. How many megapixels, how many pixels are in those side cameras? A lot of times the biggest impact

Jonathan Stephens (17:41)
Mm-hmm.

Okay.

Jared Heinly (18:00)
on depth perception is that baseline. Being able to have cameras that are farther apart lets you see or perceive motion or depths that are farther away. And so there is a trade-off there. You have to match your depth sensor to your environment. If I'm trying to reconstruct the depths of a tabletop scene, where I'm dealing with at most six feet or two meters of depth, then yeah, having those cameras just a few centimeters apart, plenty sufficient.

Jonathan Stephens (18:09)
Mm-hmm.

Mm-hmm.

Jared Heinly (18:29)
But yeah, if I want to go out 20 feet, 100 feet, 1,000 feet, I need to have those cameras pretty far apart in order to actually see that parallax at distance.

Jonathan Stephens (18:36)
Yeah.

I want to say this

sensor is six inches of spread maybe between the cameras. So it gives you a lot more than an iPhone or a smartphone camera spread has. And I know it's even on an iPhone. I can't keep referencing an iPhone because that's what I have. But if you put in portrait mode and someone steps too far away, it'll say you're too far away from the subject. And I'm guessing that's just because it's unable to, you know, the camera, that baseline can't support.

Jared Heinly (18:49)
Yep. Yep.

Mm-hmm.

Jonathan Stephens (19:08)
So I'm beyond eight feet or whatever it is for that phone. And I know the LiDAR gives me more distance, but still there's a limit. It's like, sorry, you can't be 15 feet away. So.

Jared Heinly (19:09)
Yeah. Yeah.

Yeah, yeah.

To touch on a little bit there, there's actually different kinds of depth maps. And I'm going to dive into that for a sec. we, think up until this point, I think maybe I've been predominantly talking about, and I'll say like absolute depth maps. So where every single pixel has a real world value that's immediately interpretable. So for example, like this pixel means there's a depth at

5.7 meters or 2.3 feet. There is a real world depth associated with it. And so that's what I would just call, that's absolute depth. It's immediately useful and immediately measurable. There's another style of depth map called, what I'll call a relative depth or a relative depth map, which means there are still depths, but I don't know the scale of those depths. So when I see a value, when I see the value 2.0,

Jonathan Stephens (19:51)
Mm-hmm.

Jared Heinly (20:16)
in my depth map. I don't know, was that two feet, two meters? It's unknown, because there's some unknown scale factor that can be applied to it. But what is important is that the ratio between depths is still preserved. So that if I see a 1.0 and a 2.0 on my depth map, I know that that 2.0 value is twice as far away as the 1.0. And so those ratios are preserved, but it's just that the overall scale, I don't know if I'm looking at a really small scene or a really big scene.

because there's that unknown scale. Now, why is that important? Yeah, yeah. And so it depends sometimes on the application. if I knew, in order to get absolute depth, I need to know some properties typically about the camera, or I need to know some properties about

Jonathan Stephens (20:49)
Yeah, why would I care? Why? How does that help me? Didn't know that.

Jared Heinly (21:15)
the that baseline. So again in this sort of stereo vision setup if I have two cameras observing a scene, if I know the focal length of those cameras and I know the baseline, I the distance between them, then I can use that to compute the scale factor and figure out how far away things are in the scene. But if I don't know, if I had two cameras, I had two photos, but I didn't know how far apart those photos were in 3D space, then I have sort of unknown scale. I don't know how far apart the cameras are, therefore I don't know

how deep things actually are in the scene. So I can still solve some computer vision tasks with that. I could still ask and figure out, what is the foreground? What is the background? And I can do that using the relative values, but I couldn't precisely tell you, hey, is this object five meters away or 10 meters away? I couldn't tell you with that. But I can still do background blurring. I can still do some 3D reconstruction. It won't be scaled. It won't be in some sort of known unit.

Jonathan Stephens (22:04)
Mm-hmm.

I can see that

how that's helpful. perhaps you're, let's just start with 3D reconstruction. You could just say, hey, I want the foreground object. And if it knows the relative distance to something is two, three X further away, that's not the intended object. So it'll mask it. You could do a mask with that or whatever you're doing. You say, don't include that in my 3D reconstruction data. Okay, now I'm starting to see where that's helpful.

Jared Heinly (22:41)
Yeah. Yeah.

Jonathan Stephens (22:42)
Because

I was thinking, if you don't know how far it is, what's good is that.

Jared Heinly (22:46)
Yeah, and there are plenty of applications where you need absolute depth. Autonomous driving, I've got pair of cameras in my car. If my autonomous driving is driven based on vision, it's really important for your car to know how far away is that car in front of me. And so having absolute depth there is important. But like you said, there's other applications where it's not as important. If I just want to figure out foreground versus background, I don't need the absolute depth there to do that.

Jonathan Stephens (22:50)
Mm-hmm.

Yeah.

Mm-hmm.

Jared Heinly (23:15)
Kind of riffing on, continue to riff on the relative depth. There's another similar representation that is a called disparity map. So instead of representing the depth of a pixel, it represents how much that pixel has shifted between those two photos. And so that's that pixel shift that's called disparity. So I've got two images side by side and I find a pixel left image and I ask.

hey, how far did it move? How much parallax occurred? That's the disparity and I can store that pixel value. this shifted three pixels, this shifted five pixels. And so there, higher disparities mean that something is closer. Again, coming back to that concept of parallax, whereas low disparity pixels, pixels that did not move much between those photos, those are your more distant objects. And so there, that's another way to represent the relative depths. And again, if you know your focal length and your baseline, you can upgrade.

those disparities into absolute depths. And maybe just one other last kind of depth map is even more abstract. And I'll call this more just like the depth ordering depth map, because maybe you don't actually care about the ratios. I don't need to know that my face is exactly three times closer to the camera than the background behind me. I just want to know that, my face is closer than the background. And so there,

you can assign just numerical values to say, what's in front? know, make that number one, what's behind it, make that number two, what's further back, make that three, and what's even farther back, make that number four. And so you can just have these sort of integer values telling you the relative ordering of the depths within the image. So that's another way to do their foreground, background segmentation.

Jonathan Stephens (25:06)
I see that

in 3D Gaussian splatting because again, the way that works is you got a bunch of 3D gaussians with some sort of semi-opaque appearance to it and you have to order things. The way it renders it is it renders in an order and then it accumulates the color as they're stacking on each other in their appearance. You have to know the depth.

Jared Heinly (25:23)
Yup.

Jonathan Stephens (25:35)
of the scene and just to order those gaussians And so it's kind of like that, it's almost doing that, what you're talking about, where it's, I just need to make sure these gashans are ordered in depth from the pixels that saw in the images. And then there you go. You got, start to build your scene. And as you get more images, you'll have different, you know, they should all align in order. And I've noticed a lot of times in these scenes, you'll have something that should be way in the background is really not that far in the 3D gaussians scene, because you're just mimicking appearance.

Jared Heinly (25:46)
Exactly.

Jonathan Stephens (26:05)
You don't care that there's a tree that is probably half a mile away when they just have a gauze and that's right behind the next thing that looks like a tree that's because of scale, of size of the tree, they can mimic appearance without actually having to have seen the size of real-world coordinates.

Jared Heinly (26:25)
Yep, yep, exactly, exactly. And yeah, talking and just touching on Gaussian splats, they're just the graphics space. Depth maps, what you just described there, needing to know the ordering of things in order to render it properly, that comes up in computer graphics, where a lot of times instead of depth map, you may hear the word Z-buffer or Z-ordering. You need to know the depth ordering of the objects so that when you draw them on the screen,

Jonathan Stephens (26:42)
Mm-hmm.

Mm-hmm.

Jared Heinly (26:53)
you draw them in the correct order so that you get proper transparency effects or that some background object doesn't accidentally show up on front of your foreground object. And so yeah, there's a lot of applications for this too in that graphic space.

Jonathan Stephens (27:02)
Mm-hmm.

Yeah, just

back talking about games, depth maps are really important because I do know that's one way to build to run perhaps some games on a lower end GPU is to say if it's that Z buffer that it's the last thing in the, in the order of things you should be rendering. Well, we could just apply a fog at one point and then everything beyond that just ends up not being rendered. this is a point where you need to stop rendering objects in the scene.

you might have a massive level too, right? So, but it knows everything in the scene, in the game. So it's saying, okay, it's too far out, stop. And versus some, I know some video games, you can turn up your settings and all of a you're seeing further and further, like that fog goes away or just furthers out. I know that from playing video games with my son. A lot of times his little Nintendo Switch won't see nearly as far as my gaming PC I got here.

So again, people are dealing with depths, they don't even know it, know, games, photography, probably less in self-driving cars, but it does make me think about my car. I have what's called ADOS, which is, know, your automatic detection and avoidance systems, where all of a someone breaks in front of me. I know I have probably some sort of depth sensing.

going on. have this little thing in my windshield now of my newer car that will perceive when a car is coming at me too fast and I'll apply brakes or even I think even even newer is I've never seen before until I had my new cars. I'll be sitting in a stoplight not pay attention because my kids are asking me a question perhaps and the car in front of me starts going and it'll beep at me and say the leading car is now moving. You need to start moving too and I'm guessing that's doing some sort of just live depth map in front of me saying okay there's a car object.

Jared Heinly (28:32)
Yeah.

Yeah.

Jonathan Stephens (28:54)
Is it still there? it moving away? Is it moving too close to you too fast? And so all those little safety and alert systems are all based on, I'm sure, depth.

Jared Heinly (28:58)
Yeah.

Yeah, absolutely. And it's like in different technologies, like some manufacturers chose to go the radar route, you know, and measuring that to estimate the depth of the vehicle in front of you. Other manufacturers use a camera system. I think there's some Subarus where you have two cameras at the top of the windshield and that is a stereo pair of cameras that you can use to estimate a depth map by comparing pixels between those two images and seeing how far those pixels have moved.

Jonathan Stephens (29:09)
Mm-hmm.

Mm-hmm.

Probably in

combination of all these things because I have like the sonar on the bumpers for when you're parking, which is probably some close range, but it also has a camera embedded in the windshield right behind the rear view mirror. And I'm pretty sure those are for more of your further distance things. Yeah, it's interesting what they're doing. And they probably don't need to be super high resolution. I don't know, but you're probably not.

Jared Heinly (29:35)
yep. Yep.

Jonathan Stephens (29:58)
getting 8K video feeds on these things, trying to sense the world.

Jared Heinly (30:02)
No,

no, because also then you end up with a sort of compute power requirements. As you have more more pixels, 4K, 8K, et cetera, you just need a lot of horsepower to process all those pixels. So yeah, that's where it comes back to that trade off of, what do you want the resolution of your cameras to be versus how far apart do you want the cameras to be? And you can balance that to sort of meet your performance design requirements.

Jonathan Stephens (30:08)
Mm-hmm.

Mm-hmm.

Okay, so I got a pretty good understanding. if we're just kind of, if you've hung on the episode this long and you're really into depth maps, there is different types of depth maps, different ways to represent them. like, know, there's absolute and there's relative and you're saying like the, like the Z ordering or disparity maps. And then there you can, represent them volumetrically in like a 3D fashion, like a point cloud or just a 2D image, right? And then

There's different ways to get them right. So you can use cameras, can structure light sensors. It's all some sort of vision based system, but something that's either sending out an active light or just using machine learning or pairs of cameras and triangulating. So we have all that going together and then we're seeing it in photography and games and robotics. So it's kind of happening everywhere. that's where we were at so far in this episode. So that's quite a bit.

You're an expert in 3D reconstruction. So how are depth maps used, for example, in photogrammetry? Like I wanna make a 3D model of something and I know we're triangulating points based off of where know where the camera is, but I know at one point in that pipeline, depth maps play a role in getting you a final, let's say mesh or dense point cloud. How's that involved in that process?

Jared Heinly (31:50)
Yeah, absolutely. So you're exactly right. So in a photogrammetry pipeline, typically you start out with what I'll call a sparse point cloud. You so you've got multiple images, you've extracted 2D features from each of those images. And then once you've extracted those features from the images, you find matches to try to figure out which features in one image match to features in another image. And these are just 2D points, know, so an image that may have millions of pixels.

Jonathan Stephens (32:15)
Mm-hmm.

Jared Heinly (32:19)
I'm only extracting 10,000 of these 2D points per image. And so if there's 10,000 points for one image, I figure out which points they match in the other image, use that to triangulate points in the scene. So that sparse point cloud, yes, while it may represent the shape of the object or shape of the scene, it's very incomplete. It has a lot of holes. I haven't used every single pixel.

Jonathan Stephens (32:45)
Mm-hmm.

Jared Heinly (32:45)
to

estimate the scene. I've only used these 10,000 points per image. So where Depth Maps gets used is when I want to create a denser representation of the scene. So if I want to create a dense point cloud or some other mesh representation of the scene, a lot of times you'll do that via a Depth Map. And so a Depth Map will now change that sparse representation so we only had 10,000 points per image.

Now I'm going to estimate typically a per pixel depth. So for every single pixel in that image, in which now could be millions, figure out what the depth is. And then once I have these millions of depths per image, now I can use those to triangulate or reconstruct perhaps a much denser point cloud, or I can use those depth maps plus some fusion or other algorithms to compute the mesh representation or to feed a nerf or a splat, Gaussian splat.

but I can use that richer, dense depth information to construct a higher fidelity representation of the scene.

Jonathan Stephens (33:50)
Okay, so you're not using them right away, but then once you get to the point where you want perhaps the highest detail possible, they aid in figuring that all out.

Jared Heinly (34:00)
Typically, yes. yes. Yeah, a lot of times, yeah, you start with the images, you do the sparse features to get the camera poses, camera poses in this sparse set of 3D points, and then depth maps come in at the end to fill in, fill in details.

Jonathan Stephens (34:02)
Okay.

Okay.

Okay. And is there, could you shortcut processes if you had depths upfront? For example, I have an RGBD camera and it's getting good camera. Does that help speed anything up or is it, does it not really matter? Or maybe it just gives you, maybe you're not computing depth maps, then you already have them. Just speeding things up by not having to compute them.

Jared Heinly (34:35)
Yeah.

Yeah, I mean, that is the most obvious thing that I can do is if you already have depth maps up front, well, then you don't need to spend time later computing them. I can just use those depth maps to immediately skip that step and say, great, let me go ahead and do my dense fusion or my mesh representation or whatever it may be. Having depth maps up front, like you said, in your RGBD camera does open up some interesting possibilities for the type of 3D reconstruction you do initially.

Jonathan Stephens (34:50)
Mm-hmm.

Jared Heinly (35:06)
Before I send this photogrammetry pipeline, I have these images, I extract 2D features, I need to use these features to estimate camera poses. There is another line of work, typically I see this in the SLAM literature, where if I have a mobile device or a vehicle, robot, whatever it is, that has a depth sensor on it and it's moving through the space, I can use that depth map to do localization and positioning. So instead of trying to align these 2D features,

I can align depth maps to each other directly to estimate how much the camera has moved between frames. And I can build up a dense representation of the scene as my map as I go along. So you can sort of skip the 2D feature extraction and use the depth map as your main source of alignment.

Jonathan Stephens (35:40)
interesting.

Mm-hmm.

Yeah, it makes me wonder those, you have, you have different types of slam sensors or I don't know if they're hardware. I think of the lowest end is just obviously camera based and using inertial sensors, things like that on top of that. But then I see, see like throughout a brand, don't like throw up behind us necessarily, but I think a GeoSlam has got this nice handheld scanner, which got a, it's got a LiDAR puck on there and like a 360 camera. So I'm wondering if it's using that.

the lidar for depths or is it, you cause that moves really fast. It's spinning, it's getting you live depths every like in less than a second or they, know, so then there's even the more advanced ones I've had NavVis their VLX where it's like an exoskeleton you put on. It's got a scanner on top, scanner on your chest, cameras in different locations. And so you're probably getting depth maps from multiple scanners and getting you more accuracy. It's doing a lot there, I'm sure.

Jared Heinly (36:30)
Yeah.

Yeah.

Yep.

Yeah, yeah. And then all these setups you described where you have a depth sensor plus a camera or a 360 camera or multiple cameras, there that just opens up the possibility for all sorts of different algorithms or fusion of information. Because you said SLAM simultaneous localization and mapping. We touched on it on our last episode, but this concept of moving through a space, figuring out where you are, but then also mapping the environment.

Jonathan Stephens (37:07)
Mm-hmm.

Mm-hmm.

Jared Heinly (37:24)
There are so many different ways to do it when you have depth sensor, camera, or a combination of the two. So for instance, on that combination, maybe you're using the depth sensor to do some mapping, but when you want to identify, have I returned to a place that I've seen before, this loop closure, maybe the camera ends up being more reliable in that case, or sometimes move the depth. So you can...

Jonathan Stephens (37:47)
interesting, yep.

Jared Heinly (37:51)
leverage the strengths and weaknesses of both sensors to help either handle the mapping task or the localization task or some of that alignment.

Jonathan Stephens (37:59)
Mm-hmm.

And so, about sensor fusion, I don't know why I that in quotes, but at EveryPoint, we've also used sensor fusion when we used to have an app out there and now we support Recon 3D and some other applications that use sensor fusion. That depth mapping that we get from the LIDAR on these Apple devices definitely are helping, but not are the full scan. So,

We're using photogrammetry to get those fine details because the depth maps off the lidar on an iPhone are good for mixed reality applications, augmented reality, but they're not really meant to do like a full 3D scan, right? So, but then you'll run into an area where we're scanning, for example, we're scanning a car or a room where you have lots of areas where there's a definite surface there, but.

photogrammetry, you're not going to be able to compute a very good depth because we don't have any unique features on big portions of surfaces. And so we're using an active scanner to get us a depth there and using photogrammetry to fill in, you know, getting our fine details where you probably do have a lot of textures. And then you kind of get this really great complete output. Is that a good description of what we've been doing?

Jared Heinly (39:14)
No,

exactly right. And you're exactly right there because it's, you know, with what we've done with the Apple LIDAR combined with the imagery that it gives us is trying to leverage the best of both worlds. using the camera, the camera can see, you know, to infinity, you know, and so we use those images to estimate the positions of the camera, get some initial sparse 3D model. We can even use that to estimate some dents. So we can use that to estimate a photogrammetry based reconstruction of the scene.

Jonathan Stephens (39:26)
Mm-hmm.

Jared Heinly (39:43)
But there's times when LIDAR is better. You mentioned the case of the car, that car is shiny. That's incredibly difficult for image-based depth map estimation because depending on the viewing angle, the appearance of that shiny surface, that reflective surface looks different. And so that breaks a lot of the assumptions of image-based depth reconstruction. You also mentioned building interiors, a wall, just a white blank wall that's painted a solid color. That also is a really, really hard...

environment for image based 3D reconstruction because there it's relying on sort of the unique appearance of pixels and figuring out where those pixels in one image ended up in another image. And if all the pixels look white, it can't tell them apart. But that's where, because LiDAR is an active sensor, it's sending out that light measuring how long it took to come back. It's able to recover the depths of these challenging surfaces without any issue. But as we've said, the LiDAR, at least on the iPhone, it's low resolution.

Jonathan Stephens (40:20)
Mm-hmm.

Mm-hmm.

Jared Heinly (40:43)
It has a limited range, five meters or the newer phones can go out to seven meters. But again, if you're out in a big outdoor environment, five or seven meters, that's not enough. You're not going to get everything. And so there we fall back on the image-based reconstruction and combine them together to get that best result.

Jonathan Stephens (41:00)
Mm-hmm.

Yeah, interesting. All right. Well, just kind of like move into, okay, so now we're having to make how we're using them and some sensor fusions, things like that, and how they speed up workflows or maybe enhance them in the case of Everypoint instead of speeding up workflow just because it's more complete output. Where do think this is going in the next? mean, we've seen, since I started working at Everypoint,

the amount of sensors we have available in our, just in our pocket to get depth masks have gone through the roof. mean, I got so many more cameras than my first iPhone and my car's got sensors galore on it and the prices have gone down. But what do you see the future? Is it more sensors, cheaper? Or is it us just finally just saying like, screw the sensors, we're just gonna do this off machine learning. It's hard to predict the future, but it's always fun. Like what's your prediction in the next, let's say,

five years and then 10 years.

Jared Heinly (42:02)
man, well, I don't know how to distinguish between those two,

absolutely all the trends you just touched on, I mean, we're gonna keep running at those and they're not going away. The proliferation of sensors is gonna keep going up. Everyone's got a phone, a camera in their pocket now. And the number of cameras that are installed in places, inside of it, in stores or parking lots or cities, that's gonna keep increasing.

Jonathan Stephens (42:20)
Mm-hmm.

Jared Heinly (42:30)
likely, that just cameras are cheap and easy to install. And so you're going to have more sensors in this space. But I think the biggest gains are around the machine learning, like you touched on. Even how we mentioned with the Kinect sensor, it's like, in order to track the pose of a person, human, where are their arms, where are their fingers, where are their legs? Before, we needed 3D hardware to measure. Let's measure the depth of the scene, figure out

Jonathan Stephens (42:52)
Mm-hmm.

Jared Heinly (42:58)
where are all these body parts? Whereas now, just using an image, using a normal video plus machine learning gets amazing results. so machine learning is going to keep getting better and better in understanding the scene. And so that, I think that's going to be your biggest impact is just to continue to applications, continue to improvements of machine learning to understand the real world and make sense of this visual representation, these images and video.

Jonathan Stephens (43:07)
Mm-hmm.

Yeah. Along those lines is we keep seeing compute power come down for the per dollar. could do you if you want to think of it that way. And now I'm seeing these on board compute systems for robotics. And they're getting, you you think you can get ones that can do machine learning for a couple hundred dollars, you know, like in that Raspberry Pi prices to

put these on there and they can do pretty fast and heavy computing. I'm seeing that already start to translate more into like what used to be the highest end drones would have some sort of avoidance system to now it's like there's not a drone that DJI sells except for maybe like a toy drone, the Telo or something like that. That doesn't have some sort of sensor because the cost to.

Jared Heinly (44:18)
Yeah.

Jonathan Stephens (44:21)
put a processor that can handle that in and the cost of the sensors, they're all just kind of getting lower and lower and lower as we've made more and more of these. And so it's kind of like everything is all sudden getting smarter because we're able to do that. And it's using machine learning now, right? So perhaps maybe it doesn't have to have as many, you don't have to put as many actual physical sensors on a drone and it can do some of this with just like knowing where things was based off where it was seen. I guess.

Jared Heinly (44:38)
Yeah.

Jonathan Stephens (44:50)
That's a good topic. We can make a whole episode out of it, but we saw a drone shared this last week online and it had a LIDAR system on top and it was navigating at 50 miles an hour through a forest. it's doing, I think it's creating like like Skydio does, it creates like an occupancy grid around it. Like it is an area and space. Is that depth map you assume? Like is that like a depth map?

Jared Heinly (45:04)
Yeah.

Jonathan Stephens (45:19)
function or is that something entirely different?

Jared Heinly (45:22)
Yeah, again, it comes down to the specifics of that hardware. think in that particular case, it might have been like a 360 LiDAR. And so they're getting depths from all directions, which could be hard to represent in a single depth map. So they could be using multiple depth maps or could using a point cloud. But in some way, they're getting a representation of depth and then fusing that and converting that into that sort of 3D grid. You mentioned this occupancy grid or a voxel grid, but some sort of 3D representation of the world.

Jonathan Stephens (45:36)
Mm-hmm.

Mm-hmm.

Jared Heinly (45:52)
But then, again, using depths, whatever that representation is, could be depth maps, could be something else, but highly important. Yeah. And it depends, and again, it comes down to sometimes to what the specific sensor provides. know, some LIDAR sensors.

Jonathan Stephens (45:59)
Yeah, yeah, it's hard. Well, they're not going to tell you exactly maybe not when they did in a research paper, but

Jared Heinly (46:14)
may give you a depth map because all of its sensing ability is in a particular direction sort of mimicking the point of view of a camera, like the field of view of what a camera would see. But other LiDAR sensors, they're spinning around 360 degrees. That's not a normal camera. so they may store it in a depth map. may not. It's different things. Coming back just as you said before about what's coming the next five to 10 years.

Jonathan Stephens (46:22)
Mm-hmm.

Mm-hmm.

Okay.

Jared Heinly (46:43)
I also looking back five, years, just human creativity and finding ways to extract depth from images. There are so many different things that we have to rely on or have discovered to help either develop algorithms, discover depth or train machine learning models. You mentioned before video games and fog that reminds me there's a depth estimation technique.

that relies on fog or haze in images in order to infer depth. suppose you're looking, and this only really works on outdoor photos, but if you're in outdoor environment where you can see miles off into the distance, kilometers off into the distance, there was a paper that leveraged what they called the dark channel prior, which said that in a haze-free environment with no smoke, no noise, no haze, within a particular patch of pixels,

Jonathan Stephens (47:19)
Mm-hmm.

Jared Heinly (47:41)
at least one of those red, green, or blue pixels should have a zero value, should be fully dark. So that pixel has a high saturation. Red is bright and green is dark, or blue is bright and green is dark, or vice versa. But when you introduce haze, everything averages out to gray. And so you can measure that lack of saturation. You can measure the amount of grayness in the pixels to estimate their depth. And so that was one.

Jonathan Stephens (47:47)
Mm-hmm.

Jared Heinly (48:11)
creative way that I've saw people estimate the depth of an outdoor image without actually having a depth sensor. was just, hey, from a single photo, let's estimate how much haze is in there. What's the gray value of all these different pixels and using that to figure out how far away things are. So cool. And there was another, okay, not completely different direction was if you have a smartphone. Now this one has three cameras on it, but suppose it only had one camera.

Jonathan Stephens (48:23)
That's so creative.

Mm-hmm.

Jared Heinly (48:38)
someone goes to take a photo or takes a short video, you can use the subtle handshake from the person, even though they're not moving very far, just them moving ever so slightly. And if you record just even like a few seconds of data, so let's just say you took a three second video of someone trying to hold their phone still, but there's still that subtle motion, you can use just the redundancy of all of those video frames, that three second worth of data.

and some very careful math to estimate then the depths of those pixels or the dominant things in the image just on that really, really, really subtle motion. That was something else I thought was pretty cool.

Jonathan Stephens (49:16)
Mm-hmm.

I mean, again, I know Apple does that and Google Pixel phones, all of them do, but it's, you take in a photo and I think people think they're taking, snap, one photo, one very small instance in time. Well, if you take a live photo on an iPhone, you're actually recording like several seconds, I think, but even when you don't have live photo on, am certain it's actually taking several different photos in.

and doing a bunch of different averaging and using a combination of properties from all those photos. It's like what you're talking about. You don't even know it's happening, but they're using the data from movement and all kinds of things that probably enhances it.

Jared Heinly (49:44)
Absolutely.

Yeah, that's it.

Computational photography, there

is so much that goes into taking a single photo. There's so much in software that you can do to make that single photo look amazing. Like what you just said, it's not just that single photo. It's a bunch of photos, whether it's to do high dynamic range, HDR, whether it's to do the motion analysis to figure out what's foreground, what's background, getting the focus right. You mentioned four about focus stacking, focus racking, but where...

You can estimate depth just by changing the focus of the camera and that different focus distance figure out what depth or what point in my focus did that pixel become sharpest. And you can use that then to get a depth ordering. It may not tell you absolute depth, but you can get that relative depth of that depth ordering to say, hey, well, this pixel was sharp when I focused close versus, this pixel was sharp when I focused far.

Jonathan Stephens (50:36)
Mm-hmm.

Mm-hmm.

Yeah,

what's interesting is, too, is how quick that all moves and how little number of sensors you need. Because I remember also several years ago, the Lytro camera came out and it was like a camera that had like 20 lenses on it. I don't know how exactly it worked, but they're all a little different angles and stuff. And you take a photo and then it's able to do a bunch of computational photography so you could then change. It was like the claim was you take one photo, you can focus it later.

Jared Heinly (51:00)
Mmm, yep.

Jonathan Stephens (51:18)
Right? And it's like, you can do all that stuff, but fast forward 2025, even before that, you have three, two or three lenses on an iPhone or an Android or, and, you're doing all that with a whole lot less lenses. they've found a bunch of ways to not have to use 20 lenses on one camera. It looked really cool, but it also fairly quickly became obsolete to science.

Jared Heinly (51:18)
Yeah, yep.

Yep. Yep.

Yeah. And, but,

and again, it is trade-offs. mean, like physically having 20 lenses or physically having 20 sensors, know, physically better. You know, if I need perfect measurements or I need true hardware measurements from all of these different perspectives, you know, nothing's going to beat actually having that hardware. But like what we just said is with the trends of machine learning, machine learning is getting better and better and better, you know, and now machine learning

Jonathan Stephens (51:52)
Mm-hmm.

Mm-hmm.

Jared Heinly (52:13)
can just predict what it would have looked like to have those 20 sensors or those 20 lenses. It can recompose your photo for you. It can refocus it. It can do image inpainting. It can modify the subject of your scene just using computational photography and machine learning.

Jonathan Stephens (52:16)
Mm-hmm. Yeah.

Yeah. I mean, that's a

good note to like end on. I remember the first time I ever did light field. I don't know if you've ever done a light field experience on like a really nice headset, a VR headset. And you like, I did one, were like in the 16th chapel or something. You move your head left and right and the light would trace through the columns of the stained glass and the stained glass and all that. And it felt like you were there. Then you watch the documentary on how they capture these demo scenes and they...

They didn't do any tricks. The trick was they had lots of cameras all taking camera images in sync and being able then to move with, it was basically a camera array and gave you this really cool effect. But I'm sure if you did that with machine learning, you probably wouldn't still be hitting quite the peak as if you had a bunch of these high-end film studio cameras in an array. Then you had the least amount of, I don't know, guessing going on.

because it's hard data. And also that instant splat. Yeah. Well, so I did, I've been playing with instant splat, which is making Gaussian splat from three or I think you can do that. You can use two images. And so you can kind of like move through space and it makes a Gaussian splat of a scene with just a couple of images. And everyone's like, what's so big about that? It's like, well, you can have two pretty far apart images and it will just fill in what is.

Jared Heinly (53:28)
Yep. Yeah, nothing beats nothing beats data, but machine learning is getting better.

Jonathan Stephens (53:57)
supposed to look like as you transit between those two and in normal, the original Gaussian splats, it would look pretty poor between your two ground truth images. And this is what it's doing is just saying, hey, we're gonna predict what it looks like in each camera spot as you move along. And it's not perfect. If you look at it, you can always pick out little problems here and there where it would have been better if I had just gotten more images, but it's amazing what they're able to do. And I think it runs that in like 60 seconds as well.

You know, we're doing pretty good at predicting future scenes between ground truth nowadays. It's fascinating. So, all right, well, is there anything else you want to, anything else on this topic?

Jared Heinly (54:34)
Yeah, it is, it is.

That's all I got.

That's all I got. It's depth is such an important thing. We've just touched on, you know, in terms of graphics, in terms of computer vision, or just understanding the real world. And yeah, we can get around with it with hardware and physical camera setups, but machine learning is getting better and better. So it's an exciting, exciting time.

Jonathan Stephens (54:51)
Mm-hmm.

for those few listeners who may be deciding to go and pursue a career in computer vision, is this something that you learn in like your computer vision 101 classes? this?

Jared Heinly (55:14)
Yeah, so like in, you when I, grad school, I had to take an intro to computer vision, you know, during my first year there. And yes, top talking about depth estimation is one of, one of the topics, you know, in there. So like you would learn about a photogrammetry pipeline, learn about structure from motion, learn about slam. And then you would learn about various ways to estimate depth. And there are so many more than I haven't even touched. I haven't touched on shape from say shading, photometric stereo. there are so many other ways that people have invented.

Jonathan Stephens (55:20)
Mm-hmm.

Jared Heinly (55:44)
to figure out the depth of scenes and really complex objects that could be hard to measure otherwise. So it's a big field.

Jonathan Stephens (55:51)
We

might need to find an expert in one of these more nuanced methods and bring them on. It would be interesting to hear what's the latest in research on this. We can go deep then on a follow-up episode. These are important. You get better photography, keeps your cars from crashing, at least the newer cars.

It keeps your drones from crashing all sorts of different things you can do with it and get you makes your games look better Who doesn't mind that? I mean because there's virtual depth maps even if it's not a real environment. You still got depth map It's probably going on within a game. So anyways, well, this has been a it's been a very insightful episode for me I I again been working with depth maps for a long time still always learning about it so that's why we have this podcast so people can learn and

Jared Heinly (56:21)
That's the same, it keeps your games running. Yep.

Jonathan Stephens (56:42)
keep up and learn about the more nuanced things beyond just, I know I take images and I can make 3D things or calculate poses of humans. So thanks for your time here, Jared. And if you're listening to this podcast, you can subscribe to us on any major podcast player, or again, we'll be posting this on our YouTube channel or Everypoint YouTube channel for the Computer Vision Decoded podcast. And I'd love for you guys to subscribe. And as always, comment on if you found this insightful.

or if there's something that we touched on you want us to go deeper in, we did get a couple of good comments on our last episode and I'm always excited because then we will definitely think about what's next for you guys and what you're interested in. So thanks a lot Jared and we'll see you guys in the next episode.