Computer Vision Decoded

In this episode of Computer Vision Decoded we dive into Jared Heinly's recent trip to the CVPR Conference. We cover: what the conference about, who should attend, what are the emerging trends in computer vision, how machine learning is being used in 3D reconstruction, and what NeRFs are for.

00:00 - Introduction
00:36 - What is CVPR?
02:49 - Who should attend CVPR?
08:11 - What are emerging trends in Computer Vision?
14:34 - What is the value of NeRFs?
20:55 - How should you attend as a non-scientist or academic?

Follow Jared Heinly on Twitter
Follow Jonathan Stephens on Twitter

CVPR Conference

Episode sponsored by: EveryPoint

Creators & Guests

Host
Jared Heinly
Chief Scientist at @EveryPointIO | 3D computer vision researcher (PhD) and engineer
Host
Jonathan Stephens
Chief Evangelist at @EveryPointIO | Neural Radiance Fields (NeRF) | Industry 4.0

What is Computer Vision Decoded?

A tidal wave of computer vision innovation is quickly having an impact on everyone's lives, but not everyone has the time to sit down and read through a bunch of news articles and learn what it means for them. In Computer Vision Decoded, we sit down with Jared Heinly, the Chief Scientist at EveryPoint, to discuss topics in today’s quickly evolving world of computer vision and decode what they mean for you. If you want to be sure you understand everything happening in the world of computer vision, don't miss an episode!

Jonathan Stephens (00:00):
Welcome everyone to episode two of computer vision decoded, where we sit down with Jared Heinley, chief scientist at EveryPoint to discuss topics in today's quickly evolving world of computer vision. In today's episode, we're going to discuss Jared's recent trip to CVPR. And in this episode, Jared's going to give us a quick overview of the conference and a few key findings that he found at the conference. So Jared, first off, welcome to the show. Can you just tell us what CVPR is? Because I'm assuming half of the audience might not have ever have heard of this conference.

Jared Heinly (00:36):
Yeah, thanks. Happy to be here. So yeah, so CVPR, so it stands for conference on computer vision and pattern recognition, that's where you get the CVPR acronym from. So CVPR, you know, is the top computer vision conference that's held each each year in the United States. And so it moves around different parts of the us, but it's typically held during the summer months here in the US. So yeah, it's a top computer vision conference, you know, it's, it's, there's both academics there, there's industry there, but it's, you know, a big place where you know, all the latest and greatest computer vision research is being presented.

Jonathan Stephens (01:14):
All right. And so this year it was a new Orleans. My, my biggest question is, did you have alligator?

Jared Heinly (01:21):
Yes, I did. I had alligator two ways. I had it one day a it was like, it was like an appetizer, like fried, alligator bites, it was really tender, really juicy. I was kind of surprised. And then I also had it another evening. It was some like alligator sausage, nice and tasty. Pretty good.

Jonathan Stephens (01:42):
Yeah. Well, if you've never been to that part of the country or Florida, you have to try fried alligator they're actually, most people aren't realize more of a nuisance than they are anything in those states in the country. So anyways, so let's jump back into computer vision, cause I know that's what everyone's here for. Okay. So at the show there's papers, there's oral presentations, there's all sorts of different information. And if you watch Jared, Jared, you have very robust Twitter feed from this show. So if you wanna go back and watch this, just follow Jared he's at Jared Heinley, but I noticed that a lot of the papers, a lot of presentations were very academic or scientific. If I'm a non-scientist me, myself non-engineer or a scientist would it be worth me going to this conference and checking it out? Or do you feel like this is really geared towards the computer vision and pattern recognition, research community people outside of academia and scientific study, would they get something out of this?

Jared Heinly (02:49):
Yeah, that's a good question. You definitely could get something out of it. I mean, first and foremost, CVPR, it's a place to present. Yeah. Academic work, you know, and so PhD students, researchers, professors, you know, throughout the year or, you know, we'll have been, you know, working on papers, working on new methods within the field of computer vision. And then those are written up in papers, which are then are presented at the conference either through, you know, short oral presentations or as posters that are displayed that then people can walk around and ask questions about and see so primarily it's, it's a way to discuss and present academic work. Some of these works are easy to see, like you could walk up and there's a very easy method. It's oh, Hey, we, we took this problem.

Jared Heinly (03:40):
You know, here's some key insight, you know, here's this new solution. And so there are there's opportunities there to, you know, sort of learn and see new things. Very understandable. Some of the works are maybe a bit more intimidating. You know, I walk up and I'm like, I have no idea what, what this math means, what this problem is. I've never heard of this before. And so it, it takes a bit of understanding to kind of, you know, ask questions and see why is this useful? What's novel here? What was new? And so parts of that are, are a little more difficult, but it is an opportunity to get a sense of what, what the computer vision community at large is working on. You know, there's, you know, thousands of posters that were presented during the conference.

Jared Heinly (04:26):
And so there's just a lot of information there. It's also interesting too, if you know, there, there is a big industry presence, there's an industry expo that also happens in parallel. That's primarily sort of a you know, recruiting event, you know, companies are there to, you know, attract new talent, you know sometimes they're also, you know, selling, if they're, you know, like a camera company might find sell a new camera or, or depth device other kinds of sensors that they want people to be buying and using as part of their work or their research. But you know, there's a, it's an opportunity to get sort of the taste of, Hey, here's what, what academics are working on. And here's the kinds of talent that, you know, industry's looking for or things that industry is tackling.

Jonathan Stephens (05:09):
All right. So it sounds like the expo is more geared for someone like me specifically, if you wanna make connections in the industry. Other companies may be doing things with computer vision, but not necessarily dive into those academic papers. Pretty sure. If I looked at some of these papers, I would look at the pretty pictures, maybe ask a couple questions and they would wonder why this guy's here. It's not really for me. And so that's good. It's good to know because we see a lot of these interesting conferences and you always wonder, is it worth the time and money to go to these conferences? And so it sounds like if you want to go and learn about what's the newest in computer vision, it's more geared for the scientific community and less geared for you know, a project manager who just happens to have computer vision in the project. So that's helpful. Yeah. And so you said you've been you've, you said before to me, you've gone to this show for several years. This is I mean, last year, it wasn't in person, but you've been attending for you know, at least a half decade longer?

Jared Heinly (06:09):
Longer. Yeah. My first was 2011. I was starting my PhD. And so 2011 was my, the first CVPR I attended. And I didn't attend every year since then, but this has kind of been my sixth, seven or time there. Well, I guess it hasn't been in person the past two years. But yeah, I I've been there many times and have seen it grown over the years, just kind of coming back, you know, CVPR when I, when I attended back in 2011, there were, you know, 1,500 thousand 600 attendees which at the time felt like a, you know, big, a big conference. Now so the last step was held in person was 2019. There were nine over 9,000 people in person attendance, which is huge. Now this year with, with travel restrictions in COVID, there was lower attendance. There were 5,600 people in person, but it's, it's a big conference.

Jonathan Stephens (07:03):
It just shows a Testament to the industry as a whole in how it's, it's growing and becoming a part of so many products and everyone's lives. I don't think everyone understands how computer vision is touching their lives. Everyone, almost everyone on this planet in some way or another, in some, some way it right in a product they're using or invisibly behind some product. So, yeah, it's, it's really exciting to see this, this whole entire industry grow. So as then, you've seen this industry grow and of course things keep changing. What was in 2011, probably a hot topic might be no longer a hot topic. Maybe we've mastered it. What, what are some research trends or areas of research that have emerged in the last year or two that you, you know, that you've seen at this show that perhaps you weren't seen so much of a couple years ago, because either didn't exist or because just weren't the market, you know, of course industry money pushes research in certain directions as well. So what did you see as emerging trends there?

Jared Heinly (08:11):
Yeah. I got something like the biggest thing I saw was the prevalence of NeRFs. So, you know, neural radiance fields you know, there, and there was there was a blog post. I think that Frank Dellaert posted, so there was like over 50 NeRF papers there at the, at the conference which, which was just impressive. Every time every corner you turn goes, oh, here's another, you know, adaptation another way to sort of improve proven NeRF. So so, so 3D cases for people aren't familiar with NeRFs and it's, it's a a 3D representation of the world that's, you know, helps, you can take photos as input. It sort of learns this volumetric representation that then you can use to render really, really nice images from vantage points that, you know, you hadn't captured, you know, so if I took a bunch of images, you know, 10 images of an object, I can now render sort of an infinite number of image, you know, views of that object from different angles.

Jared Heinly (09:08):
So there's a lot of works focused around that which that's really taken off in recent years. Sort of coupled with it, I'd say, okay, that is sort of, you know, emerging another, just big presence, which I think has always been there. Just this year I happened to sort of notice it more was the sort of intersection between computer vision and understanding humans. And I'll say understanding humans in a 3D sense. So like, Hey, from a video, can I recover the 3D pose of a person? You know, so where, where are their arms? Where are their legs? Where's their head? You know, how are their fingers there's even a work on: can I reconstruct the, a person's tongue in their mouth in that was sort of geared toward, you know, AR/VR, we're trying to create virtual avatars and, and really recover the complex geometry of a human and be able to create that, you know, sort of lifelike replica.

Jared Heinly (10:03):
So a lot of works dealing with, you know, sort of digitizing humans and understanding how do humans interact with spaces. There was a big a lot of work around sort of egocentric video. So it's, you know like first person views a camera mounted to your head or camera mounted to you, that's watching what you're doing and trying to learn from that sort of vantage point. So that that's sort of you know, human human understanding was also continue to be a big trend. But yeah, just in general, a lot of works dealing with the 3D side of computer vision, which I was excited about. I'm, I'm sort of biased toward toward 3D. And so a lot of work's dealing with, you know, 3D rendering, 3D reconstruction and just marrying machine learning with 3D techniques.

Jared Heinly (10:56):
Talking about trends, I'd say a few years ago, a big trend was saying Hey, we've got this powerful tool, deep learning machine learning. Let's use it to solve anything that we can. It's like, Hey, we don't, we don't need geometry. We can, the, you know, machine learning model can learn it. You know, we can learn the geometry of a lens. We can learn the geometry of the camera. Let's just do end to end learning which is still important and still being worked on. But where I'm seeing now, a lot of advancements happen is saying, well, Hey, we can use machine learning to solve as a tool to solve these tasks. But if we also add in our understanding of physics, understanding from the computer graphics literature from decades ago, let's, let's take in this understanding, combine it with machine learning in, you know, in efficient, intelligent way. And then it makes, you know, the results so much better. That for me was really exciting. Looking at how 3D is continues to be very relevant, you know, and useful in a lot of these works.

Jonathan Stephens (11:57):
Yeah, I you've mentioned that in the past to me several times that that computer vision and machine learning are merging as far as a discipline of technology that you need to learn as a computer vision scientist no longer tend to just know 3D reconstruction or you, you know, that machine learning component is definitely becoming something that you at least need to know at a core understanding. Maybe not be an expert in it in itself, but it is something that is playing in most people's technology stack. When they're, when they're doing computer vision processes. I find that I find that very interesting. I'm gonna jump back on the NeRFs. If anyone knows me, they know I'm, I've become Mr. NeRF because I've jumped into these. And I found out that you can, you can actually play with them at a very tangible level and you don't need, you know, NVIDIA's $20,000, whatever, $10,000 GPU for science research, you can actually do it on a fairly, relatively new RTX 3080, or, you know, even some of the older GPU's work fine for getting away with it.

Jonathan Stephens (13:03):
So I'm seeing a lot of implementations. I looked through a lot of those papers and I noticed that there's, they all kind of have their different focuses. Some are just to visualize world in a better way. Some are, you know, alternate ways to, to take, let's say camera shake and reduce that camera shake down to nothing, especially when you have extreme camera shake, how can it get a smooth trajectory path in that, in that camera and make it look like you have a still shot or removing objects, things like that. But then I also noticed in video announced, they're doing 3D modeling in the way that you would expect a photogrammetry package to have. So we're starting to see this in different sections of use. And so I get a lot of people keep asking me really, why are, why, why are people even using NeRFs?

Jonathan Stephens (13:53):
Because we can do all this with photogrammetry or, you know, they just don't see the point of it. Can you help the non-scientific community? Because I think that's who struggles as a scientist, people will say, of course, this makes sense that we're diving into this research. And of course it's not in an iPhone or Samsung Galaxy Note, or one of the, the newest, you know, devices, because it's, it's not there yet, but we're only two, two and a half years in. How do you see this technology playing out in, in maybe in either an everyday product or in some sort of fashion that, you know, is not just this academic paper showing look out where you took out a kid on a bike from the scene as you drove through it.

Jared Heinly (14:34):
Yeah. So a NeRF is a, you know, a representation of the world. It's, it's a way to digitize, you know, and store information about reality. So just kind of take a step back. So it's similar to you, if I think about, okay, how can I model the room that I'm in? You know, I could have a CAD model, which is gonna represent, you know, here are the vertices here, like the, the walls, the planes, sort of high level, you know, geometric properties of this room, you know, that's one way to digitize reality, does that CAD model, maybe that CAD model is just the lines and the points. And so there is no color information. So then we say, okay, well, it'd be really nice to have color. So let's go to a, let's take that CAD model instead, convert it to a textured mesh.

Jared Heinly (15:20):
So, okay. We've taken all those polygons we've turned them into triangles. And so now we've got this triangle mesh that then we have texture coordinates associated with, okay, now we've got color and geometry. Well, what about reflections or whoa, is this, you know, shiniest, shiny surface or a rough surface? Okay. So now we start adding material properties to that textured mesh. But then what about fog, if I wanna model, Hey, there's a, you know, a fire and there's smoke or there's fog, you know, a triangle mesh can't represent smoke or, or fog because that mesh is a, you know, a specific surface. It's like, I'm, you know, it's empty space and here's my triangle. And then there's more empty space. You need some other way to represent fog or other effects. And so you, you can keep trying to switch to different implementations or different representations.

Jared Heinly (16:11):
And so that, that's where NeRF is. NeRF is just one style of representation to digitize reality. And what's, what's nice is it, it can model complex interactions you know, and so it's able to, you know, mimic reflections and, you know, sort of, you know, under, as you look at an object from different angles, you're going to get different colors. So it's able to understand and represent that the appearance of an object changes based on the direction that you look at, think of a reflection, a mirror that the way I move my head and look at a mirror from different angles, I'm gonna see different things. And so a NeRF is able to capture that a NeRF is also so it's, it's represents sort of this a volumetric grid. And so it's able to capture effects like density or fog saying that, okay, as I'm looking through this space, there's a particular density.

Jared Heinly (17:05):
And so I'm able to represent volumetric effects inside of a NeRF. So it just there's pros and cons to a NeRF. So to me, it's akin to, you know, I've got a point cloud, I've got a mesh, I've got a, you know, a CAD model, you know, I've got a NeRF. Each of them have sort of pros and cons in what they're able to represent as well as how easy they are to manipulate. And so that's where I see right now with nerfs being, you know, rather new it's, we're saying, Hey, okay, we we've proposed this representation. Now how can we use it? And so that's what a lot of this research is. It's saying, okay, well, Hey, how can I use it to represent really large scenes? That's a field of research. How can I use it to get more accurate color, you know, or more efficient?

Jared Heinly (17:50):
How can I, you know, compute a NeRF more quickly how can I manipulate a NeRF if I want to go in and manually edit a NeRF to say, oh, I want to, you know, inject a sphere here or move this or warp it or recolor it. How can I do that? And so it it's, you know, just how there's been a lot of work dealing with, you know, rendering and manipulation of CAD models or meshes, you know, there's a lot of work happening now about the rendering manipulation and creation of NeRFs.

Jonathan Stephens (18:21):
Interesting, that that really helps break that down for, for normal people or people like me, I would say. I would also like to add it to really bring the analogy down and this might be the wrong analogy, but I like to point out the fact that when cars came on the scene and over a hundred years ago, they were slow. I mean, a horse was a way better mode transportation. And I'm sure there was a lot of people said, this is just these, you know, loud, noisy machines that break down. They spew all kinds of, you know, smoke in the air. I don't, we, you know, these are just a fad, you know? And now look, I mean, do we all take horses to work? No. so not to say that a NeRF will replace some current technology, but it might, but it might just become another really great way to get you said a 3D representation of a scene, an object.

Jonathan Stephens (19:18):
And so I, that's why I always have been, that's why I have been keeping up with it because you don't wanna wake up one day and realize we're in, we're in the back of the technology stack, because we dismiss this new technology and lo and behold NeRFs have matured to that point where it's more efficient or better at producing some sort of 3D representation of a scene that you would, that you were trying to do in a different way that just took more time perhaps more compute power. We already seen the time and compute power coming way down in this. I'm also hearing from other people that are working on the visualization. So right now you make a NeRF, your training times come way down, but it still takes a lot of compute power to then, you know, trace light through a scene and, and, and visualize it.

Jonathan Stephens (20:05):
So I noticed that there's probably some research you even saw there, how can we visualize this better? How can we render it on a, on a computer that is, and high end? So I'm, I definitely seen that, that progress in that move forward. And I look forward to, to kind of keeping up with that with you, as well as, as we all kind of learn, what's the newest and latest and greatest. So, alright, so I, I, I guess I'm getting near the end of a kind of questions I had about this conference. So it's grown really, it's gotten much larger. You, you pointed out that there's NeRFs, there's a machine learning, being applied to make 3D reconstruction just faster, more robust. And also it's a great event for people who want go recruit. So I'm guessing if you're trying to build a computer vision team, you guys are not easy to hire.

Jonathan Stephens (20:55):
It's not like, you know, not like there's, you know, thousands and thousands of you being graduated each year from a bunch of different universities. So that's, that's a good reason to go alone if you're trying to recruit, make connections in the industry. But on top of all of that if you were to go next year and you, you are a non-scientist, would you just recommend they go to the expo? Do you, do you still feel like they maybe get something? I see those papers maybe go, I think for me as a non-computer vision person and the whole point of this, this, this show is to kind of decode computer vision, would it, would it be worth it for me to always say, attend with you or attend with your computer vision team? And they could probably be your liaison to this world of computer vision and kind of them decode what you're seeing here. Is that, that be a good strategy, perhaps if you're a, an executive at a company you wanna learn more about what's going on in trends?

Jared Heinly (21:52):
Yeah, no, yeah, yeah, no, I, I think that's, that's a good recommendation. You know, having, yeah, if you've got a computer team go with them, you know, have use them as their liaison, but I also, and, and I give this advice to, you know, any student or, you know, person, if their first time at CVPR should I give this advice and just, I'd say, don't be afraid to ask questions. You know, the computer vision field is really large. And so no one is an expert in the entire breadth of the field. And so I would say, you know, when you're there take advantage of that poster session, the poster sessions are my absolute favorite part of CVPR because that's the time when you get to interact with the authors of those works. And so people love talking about their, what they've accomplished, what they've done.

Jared Heinly (22:36):
And so I'd say, I try to model this. I try to encourage other people to do it, but, you know, take advantage of the poster session, walk up to a poster. If you see something interesting in the title, see something interesting on, you know, on the poster itself, some interesting graphic, even if you don't, if you have no idea what's going on, just kinda say, Hey, I'm new to this field. Can you just explain to me, you know, in 30 seconds, what what's sort of the high level idea, what were the key, key takeaways or insights and the authors, you know, can then give you those insights because a lot of times with these works, yes. You know, there may have been a lot of formulas on the poster, oh, the method that took a lot of engineering to pull off, but there really is, you know, one, two or three sort of like key insights that the author discovered, and it was able to use to enable this breakthrough in what they're publishing.

Jared Heinly (23:27):
And a lot of times those results that those key insights are very understandable. And so just, just going and asking questions during that poster time is, is to me the most valuable I will also call out usually during these conferences like this time see are the first two days where the tutorials and workshops, and then it was four days of the main conference, and that's where the, the oral presentations and posters were but take advantage of those tutorials. So the purpose of the tutorial during those first two days when there's workshops and tutorials happening is to teach people new concepts, you know, and so there are typically tutorials around machine learning or 3d reconstruction sort of whatever is, you know, relevant at the time. And so that's a great forum to attend and learn more about it.

Jared Heinly (24:12):
So for instance, like this year, there was a tutorial on NeRFs, you know, which really walked through, like here is the basics of the basics is to find the terminology, you know, what is a radiance field? How did we get here? And it starts building on top of that saying, okay, well, with this basic knowledge, we can add it on and sort of like, you know, a compressed mini lecture that, that walks you through the topic and tries to bring you up to speed so that you can better approach and understand the work that's happening. Yeah. There's a lot of opportunities just to yeah. Learn, ask questions.

Jonathan Stephens (24:44):
Yeah. How often do you have a chance to stand face to face with the lead researcher on a concept or at the paper sessions? That's what I've noticed is you see these, these specific people. I mean, I would call them computer vision science celebrities in some cases where, you know, they are the foremost researcher, perhaps in this specific domain of computer vision and they're presenting a new paper and there's no other way, you're gonna have a chance to ask them a question. I mean, maybe if you're an academia, it might, but most of the time they're heads down doing research. And you can't just ask them questions about that paper. I also notice as you read through a lot of these papers I've, I've, I've gotten better at reading them and realizing they're all structured the same way. Yeah, it's definitely a formula that they, or format they have to follow. This one thing that people always tend to skip over are the assumptions. And I dunno if its assumptions are risks or, you know, there's, there's always some shortcoming. So people say, oh, this is amazing with NeRF. Right? And then there'll be some assumptions and there'll be some like shortcomings, they pull out, like for the, for example, they say this took 30 hours using four very expensive GPUs or 10 seconds of output video

Jared Heinly (26:06):
Mm-Hmm

Jonathan Stephens (26:06):
So, yeah, it looks great, but they're also saying, you know, we're assuming you have that hardware and that time to wait for this output, or they're assuming the imagery you have is perfect. And that's something I want to talk to I'm gonna allude to the next episode of Computer Vision Decoded, just talking a little bit about the challenges of computer vision in the wild. And I know you have a lot of background on that. And so people don't realize when you see these really great models of, let's say a shoe, everyone likes to see shoes and rocks, but a lot of times those are put on a pedestal with a white background or on a turntable, and everything's perfect and it's not in the wild, or, you know, there's a lot of things going on to get those captures just right.

Jonathan Stephens (26:50):
You're doing something for an archeological preservation, and you can control a lot of the environment, but you're trying to put a lot of this in practice, for example, AV autonomous vehicles, everything's wild, you know, you're not driving on a racetrack that everything's always the same, even, even the light changes as you're driving dynamically. So I like to point that out in these papers, they're assuming a lot of things. Always look for that. And you know, you can ask those questions, you ask those hard questions and they'll, they're excited because they get to talk about what they've been working on for perhaps years on this paper. So anyways with that any, any last remarks you wanna add on CVPR is, is so just give you that last chance.

Jared Heinly (27:36):
Oh, I'm just to me, I'm just, I'm excited. I mean for computer vision in general, I mean of course the field keeps growing. It's really active. And just my own bias, you know, I, I love, you know, taking computer vision and applying it to 3D problems, you know, and that's where I see, you know, 3D computer vision is alive and well and is thriving just in, in the ways that it's being applied across the board. And so I, I love just seeing all the, the breadth of problems that people are tackling using 3D computer vision, using really practical insights, you know, and applying them to new problems to solve things that had never been solved before.

Jonathan Stephens (28:14):
Great. Well, thanks for breaking down the CVPR conference. I'm excited to watch you with your Twitter feed next year. It's it's, it's your Superbowl event on Twitter. And again, I'm gonna throw this out there. It's @ Jared Heinley, just like you see on the screen for watching this on YouTube. It's @ J A R E D H E I N L Y for the people listening on the podcast, you can follow him on Twitter. That's where he's most active. And he does a great job at taking these papers and just distilling them in obviously only a few sentences, because that's all you have on Twitter. So you can go back and watch pretty much all, not all of the papers, cuz there was hundreds, thousands of papers, but you can get some great ones. At least if you like what Jared's saying here, it's a good way to see what he's interested in and learn a little bit about what he saw at the show. So again, Jared thanks again and we'll look forward to seeing everyone at the next episode. We'll do this every two weeks and again, next episode will be computer vision in the wild. All right. Thank you everyone. Thanks Jared. Thanks.