Computer Vision Decoded

In this episode of Computer Vision Decoded, we sit down with Jared Heinly, Chief Scientist at EveryPoint, to discuss announcements made last week at Apple’s WWDC conference and find out what they mean for the computer vision community.

Show Notes

In this inaugural episode of Computer Vision Decoded we dive into the recent announcements at WWDC 2022 and find out what they mean for the computer vision community. We talk about what Apple is doing with their new RoomPlan API and how computer vision scientists can leverage it for better experiences. We also cover the enhancements to video and photo capture during an active ARKit Session.

00:00 - Introduction
00:25 - Meet Jared Heinly
02:10 - RoomPlan API
06:23 - Higher Resolution Video with ARKit
09:17 - The importance of pixel size and density
13:13 - Copy and Paste Objects from Photos
16:47 - CVPR Conference Overview

Follow Jared Heinly on Twitter
Follow Jonathan Stephens on Twitter

Learn about RoomPlan API Overview
Learn about ARKit 6 Highlights
CVPR Conference

Episode sponsored by: EveryPoint

Creators & Guests

Host
Jared Heinly
Chief Scientist at @EveryPointIO | 3D computer vision researcher (PhD) and engineer
Host
Jonathan Stephens
Chief Evangelist at @EveryPointIO | Neural Radiance Fields (NeRF) | Industry 4.0

What is Computer Vision Decoded?

A tidal wave of computer vision innovation is quickly having an impact on everyone's lives, but not everyone has the time to sit down and read through a bunch of news articles and learn what it means for them. In Computer Vision Decoded, we sit down with Jared Heinly, the Chief Scientist at EveryPoint, to discuss topics in today’s quickly evolving world of computer vision and decode what they mean for you. If you want to be sure you understand everything happening in the world of computer vision, don't miss an episode!

Welcome everyone to our first episode of computer vision decoded, where we sit down with Jared Heinley, the chief scientist at EveryPoint, to discuss topics in today's quickly evolving world of computer vision. In today's episode, we're gonna cover announcements made last week at WWDC conference and find out what they mean for the computer vision community. But first let's welcome. Jared Heinley to the show. Jared please tell us about yourself.

Yeah, thanks Jonathan. Yeah. Happy to be here. I'm the chief scientist here at EveryPoint. My specialty is in computer vision. So EveryPoint is a computer vision company, and as chief scientist, I lead and direct the computer vision efforts here. So my background, I've been doing computer vision for 12 years now. I did my PhD at UNC chapel hill. There I focused on large scale 3d reconstruction from crowdsourced photo collections. So what that means is I would download - scrape the internet. I would write Python scripts to scrape the internet for publicly available photos of tourist locations. I'd say, okay, let's go to Berlin or let's go to Paris or let's go to even the entire world and try to download as many photos possible.

So millions or hundreds of millions of photos, and then would generate 3d reconstructions from that. A lot of my work dealt with how do you do really large scale 3d computer vision? How can you know, not just tens or hundreds of photos, but millions, tens of millions, hundreds of millions of photos. How do you scale up algorithms to that magnitude? Then when you're dealing with such large amounts of data, all the different edge cases you run into where it's, it's not just photos from a lab setting, you've got photos out in the real world that are blurry, confusing, weird, taken with weird cameras from weird angles and weird conditions. And so how do you write algorithms that can be robust to all sorts of that diversity? So that's what I've taken. I've taken that knowledge of large scale computer vision in the real world and bringing that now and applying that here at EveryPoint.

Interesting. So you're an expert taking images and creating 3d reconstructions, but not in, a lab setting or a perfect theoretical data set. That's really interesting and that's what we've always found having these camera tools in our pockets and iPhone. They're great, but usually things are messy. You have bad lighting, things like that. So that dovetails into what we want to talk about today, Apple's WWDC conference. And they had a couple announcements specifically, I wanted to talk about with you in regards to ARKit and some of the evolving functionality, because those ARKit announcements have always helped us give tools to our users as time has gone on. So the first one, I just want to jump into is called RoomPlan and I can give you a quick overview for everyone has not heard about RoomPlan or followed us. Basically it allows you to scan a room from a stationary position or you can walk around and get a planametric view of that room.

As you can see in this video, I have up here, you're creating this floor plan and it's actually even detecting couches, tables, fireplace, things like that, and creating this great interactive augmented model. So for you as a computer vision scientist, how do you see that this tool could help us with users scanning environments, or just what do you see when you first saw this? What was your reaction of what this means for you and for especially capturing things in the wild where it's, everything's not always perfect.

The first thing I thought when I saw that wa s just, it makes total sense. You know, Apple as they've been working on ARKit and augmented reality experiences in general, it keeps getting more advanced. We keep moving up the stack of complexity and sort of abstraction. What I mean by that is when ARKit first came out, its job was to say, for every video frame, as our user is moving around for every video frame, figure out what the position of the camera is, what's the pose and of that camera as it's moving through this space, and that's it. And it's sort of understanding some sparse points in the scene, the camera position, and so it's that sort of basic level one of understanding, eventually then it starts adding in plane detection.

There's some points in the scene that all look flat, maybe that's a table, maybe that's the wall. So you start understanding these individual geometric primitives in the scene. With introduction of the LIDAR sensor, Apple also introduced some semantic understanding. Being able to say this, "this looks like a chair, this looks like a wall, this looks like a door". So there was like six different classes or limited subset initially of semantic categories of objects that it could detect. But what this now is sort of bringing all of that together to one cohesive experience saying not only can we identify these flat surfaces in the scene, but we understand how they work together. For exmaple: "This is two walls that are joining at a corner. This is a door that's inside of a wall."

You've combined both the geometry and the semantic of that environment into a single representation. What I'm excited about is this. How is this useful for computer vision or computer vision applications is it's, it's guiding that user to create that experience. One of the things that's hard is when you are in a constrained environment and it's really easy just to be like "oh yeah, I've, I've scanned everything in the room" but not realizing that I missed the table in the corner. I missed this thing on the side, or I didn't actually scan underneath this table in the middle of the room. So having that quick and easy way to build up a floor plan enables additional sort of follow applications to say "now I've got a rough map of what this space looks like. Let me now use that to sort of help guide users in that space".

That makes sense. That's what we've personally had working with a lot of users at EveryPoint. It's really easy to tell someone how to walk around an object. They want to scan toy figurine on their table. It, it doesn't take a lot of effort and know how, but like you said, scanning in a room is a lot more complex. So how can we quickly guide them through that process? So that actually is going to bring up my second topic I want to talk about. Now we got this, this great room scan. We've built this room and you can now better guide a user, perhaps in some sequence scans of that room. Because now we know something about our space we're in, Apple also announced the ability to take a higher resolution video while running ARKit. Jared, you can tell us, what's the limitation (on the iPhone) that has been in the past versus what they're doing here now.

Oh yeah, in my role, I say I'm both engineer and scientist. And so the scientist side of me is excited about RoomPlan. The engineer side of me is excited about the enhancements to the camera. Previously whenever you're using ARKit to run an ARKit session on device, it sort of locks down. So Apple, the iOS operating system locks down that camera, you know? And so I'm stuck using, 1080p video. It's 1920 by 1440. You have sort of no control over the camera settings. You can't use HDR video and you can't take photos. It's locked into this video experience. That's what it used to be.

Now Apple has unlocked that the camera. When I'm running an ARKit session, now that ARKit session can use 4k video, that ARKit session can use HDR video, you can even take still photos while that video session is running. It's really opened the door for developers to take greater control over that, over that video capture session. Now why is that useful? That's useful if, as a developer, something about the experience about the application that you're working on. For instance, if I know that I'm going to be trying to scan objects that are really close to the camera, maybe I want to take control of the camera's focus and lock that focus to add sort of a near distance so that the camera isn't sort of focused searching near than far than near than far. I can fine tune my application to meet the demands of what I'm trying to do or same thing with the camera exposure. If I know I'm gonna be in a low light environment, or I'm scanning dark objects, I might want over expose, set the exposure higher than what, I might do by default in order to get better exposure control over the objects that I'm interested in.

Interesting. So in all the time I've interacted with computer vision and photogrammetry in general, I've also learned from you that higher resolution imagery doesn't always equate to better imagery. And so they are going from 1080 by 1440 to much higher resolution. And they did even touch on in the conference session that in the past they were doing pixel binning to get you more information and working darker scenes, things like that. And so I always did that question is that if I just get more mega pixels camera, well, I just get better results. Can you touch on the fact that perhaps while this is going to give you more options, what does that, you know, do we always need more mega pixels? Is, is there more to it than just the fact that we have pixels?

Yeah, no, that, that's a great point. That's a great point. So yes, more megapixels can mean better results, but it really depends. So with video, especially one of the problems that I frequently come into the quality of those pixels. Some that's a lot of times comes down to motion blur. So if I'm indoors it's, it's sort of medium lighting or dim lighting, you know? Yes. That overall video looks crisp, but at that pixel level, there is a bit of motion blur, you know, neighboring pixels that color has sort of been blurred or smeared for one pixel to the other. And so when I go from that 1080p up to 4k, well, as opposed to that blur being one or two pixels, now that blur is three or four pixels. So the value of that individual pixel isn't any greater, it's actually less because that pixel has now been blurred across multiple pixels, as opposed to a single one.

So with motion blur, the more pixels just kind of means more blur because it's just sort of an amplified. The other thing you talked about there. The pixel binning and that's, that's also important too, is with that lower resolution video, Apple's able to take multiple pixels on the sensor, you know, and sort of average together the intensity sort of the photons as they're coming in, hitting those, that area on the sensor, it's able to average that together to output a single color for that pixel. And what that gives you is greater robustness to noise. So like lower noise in those, low, dim lit environments moving to 4k, your options for pixel binning are reduced or removed entirely. And so now you're gonna see a greater amount of noise because the area, the amount that area that can capture those photons is reduced. And so there's higher noise in the pixels making them less useful. So for me, sometimes even when I'm doing 4k video, I'm gonna downsample that by a lot to try to sort of simulate pixel binning because that higher number of pixels doesn't actually give me greater quality.

That makes sense. We're seeing in the non-Apple world as well, manufacturers, for example, Samsung with their 104 megapixel cameras. I have yet to actually buy and try and see what that can do for us. But what I hear from most people is they actually don't want that. What they want is the pixel binning output at a much lower resolution for one, no one wants the file size, it's 104 megapixels, but also the noise goes away or not necessarily goes away, but you get a much cleaner crisper photo by averaging out those pixels. So I find that very interesting. I always like to remind people too: more more pixels isn't better. In fact, the size of the pixels makes a difference. I've noticed. So if you get close to an object, people say, "well, and I can get things further away", but should you just be getting closer to the object in your capture?

So there's always that. Thank you for, for touching on that, the, those capabilities. The only other thing I thought was, was interesting that I thought might overlap with computer vision is the new copy and paste feature that people are loving where you can copy, especially on an iPad, an object, and then paste it into another photo or paste it onto a black backdrop I've played around with it. It's pretty amazing. I'm pretty sure that's all machine learning happening in the background and lots of training to figure out what's certain objects are. Cause I noticed does a really good job at me, people pets, things like that. Not so good on other things. And I'm guessing there's some depth being built into that image from the LiDAR sensors or just the image that they're able to extract depth off of. I know in photogrammetry a big technique that people use, they mask their photos. So if you just want the result of an object, you can mask out the photo. I can see now people using that. Is that something they think people could use or is that, you know, some auto masking you're seeing where you could get a better quality output by not having the background to deal with?

Yeah, that's definitely helpful. As you said, when you're trying to do a reconstruction of a single object, there are times when that background can hurt the reconstruction. It's like the biggest case I see time a lot of times is turntable reconstruction. So if I have an object on a turntable, the object's rotating, but the rest of the scene behind it is remaining static. That's gonna completely confuse photogrammetry because photogrammetry assumes that the entire scene is static, but now I've got one moving object in relation to a static scene. And so by in that case, if I mask the background, even though that object is spinning, photogrammetry thinks that the object is static and it's the camera that's actually spinning around the object. So that's where masking a lot of times I've seen has been really, really helpful.

It also is helpful too if I'm just in a big scene and only care about a small portion of it. If I have a mask out the background that can really help limit, you know, the computation time to the areas that I'm interested in. There has been a lot of literature and work on sort of foreground background segmentation. Like you mentioned there, so there's there's techniques that can do it purely from image only, it gets a lot easier or better when you have that depth information to better segment. What's the foreground depths versus background depths, you know, and use the color information as well to help figure out what's the sort of the interesting part of the scene. Um, and so while those works have been out there, you know, apple providing it by to thought that just sort of lowers the barrier to entry. And so now you know, greater number of developers are just, you know, for free be able to get that, that foreground information to enable masking and photogrammetry applications, you know, a lot, lot, lot easier.

Interesting. Yeah. It's, that's one thing that I've been doing. It's just giving us free tools. Every it's every year is like you get new hardware, you get new software and just trying to get the maximum that you can out of those has been quite an interesting journey we've been on. They tend to always lead on making things easy for us to, to develop. That's all I plan to cover today on WWDC that conference, there was, there's quite a bit that was not talked about here that I'm sure could dovetail into computer vision. I know that there's new hardware, they announced for the, the M2 for the MacBook air and just of those sort of enhancements. So you can always do your own research if you're watching this show and want to learn more ton of videos you can watch and tons of developer sessions.

Talking of conferences, our next episode will be in two weeks from now, we're gonna do one of these episodes, roughly every couple weeks. And it sounds like you're going to CVPR. Can you just give everyone a little bit of a quick brief of what that is, because most people I talk to that aren't computer vision scientists don't know this exists or what it is or what does even CVPR stand for. If you could just give us like a quick rundown of what you'll be doing, why, why you're going and then, that'll be what we'll talk about in our next episode.

Yeah, definitely. So it's a conference on computer vision and pattern recognitions. That's the CVPR portion of it, but CVPR is the premier computer vision conference and it's held annually. So every summer it's hosted in the United States, it moves around the country. It it's the top conference on computer vision it's been growing like crazy the past few years. When I first started going, there was a thousand, 2000 people, the last time they had it in person, which was two or three years ago I think, there were like 10,000 people or something. It's just a really big conference, so what I'm excited about. It's a great place to get in touch with the state of the art of computer vision. So you not only get to see I say sort of my specialty 3d computer vision, I get to, you know, talk with, you know, PhD students, other researchers, academic academics, people in industry who are working on these kinds of problems.

We get to talk about the latest ideas, what the state of art is as well as I get to see the state of art, state of the art in other computer vision areas. So things that necessarily, aren't my forte, I get to hear and learn from the top of the top about what they're working on, find new ideas, figure out how I can actually apply this idea. I haven't even heard of before, but this helps me solve a problem, um, that I've been thinking about, um, for what I'm working on. So it's just, it's a great place to, you know, connect to other people, share ideas and, and learn what's current computer vision.

So that's what we'll cover in two weeks. Your top highlights that you saw from the show. I've been lucky to know you for several years now and kind of get to see all that. But I do know that you are somewhat active on Twitter especially during CVPR. I would say if we were to graph your activity at skyrockets that week of when you're there in person How can we follow you online or on Twitter? What's your handle name on that?

Yeah. Yeah. My Twitter handle is just my name, Jared Heinley. I'll be, I'll be tweeting out anything interesting that I see? Um, yeah, that's one of the things I love doing when I'm there, you know, there at CVPR is I find, you know, papers or posters or ideas that I think are interesting and applicable. Um, just sharing those out with the community. So

I'll be also notice that you've already, you've already been retweeting, a few papers that have already been submitted. Um, that's one thing I always find interesting this time of year too, is sure. Volume of papers being submitted leading up to CVPR ramps up considerably. So just being able to see that and get an inside view of a computer vision scientists of what they find interesting is, you know, it's fascinating because a lot of this is pretty far out there on the edge of technology, but you're able to kind of give us a nice curated list of, well, these are interested at least in your world. So yeah, I'll be, I'll be very interested to see what you have to say in two weeks. So, uh, with that, thanks everyone for coming. We will have, uh, this, this, um, episode airing every two weeks, you can find us on YouTube, LinkedIn, and we're also considering putting this on a podcast for an audio stream. If you don't care about watching anything, you just want to passively listen to this. So again, this is computer vision decoded with Jared Heinley and you can find us again on all of the streaming services: YouTube, LinkedIn, and pretty soon on podcasts. So thank you. And I'll see you guys in the next up. So thanks Jared for your time. Thank you.