Chuck Yates Got A Job

Most energy companies buying billion dollar assets get handed a stack of hard drives and have zero clue what's on them. Clay Branch, Technical Staff at Collide, joins Chuck to break down the unglamorous but absolutely critical work of document categorization, data extraction, and why clean reliable data is the real foundation of any AI strategy in oil and gas.

Click here to watch a video of this episode.

Join the conversation shaping the future of energy.
Collide is the community where oil & gas professionals connect, share insights, and solve real-world problems together. No noise. No fluff. Just the discussions that move our industry forward.
Apply today at collide.io

0:00 Clay's background and how he ended up at Collide
1:34 The reality of energy data: hard drives, banker's boxes, and chaos
4:47 Building document categorization tools from scratch
8:52 Handling duplicates and the shared drive problem
13:57 Why data reliability is the foundation of AI
17:21 The last mile: getting from 80% to 100% accuracy

https://twitter.com/collide_ai

https://www.tiktok.com/@collide.io

https://www.facebook.com/collide.io

https://www.instagram.com/collide.io

https://www.youtube.com/@collide_io

https://bsky.app/profile/collide-ai.bsky.social

https://www.linkedin.com/company/collideai

What is Chuck Yates Got A Job?

Welcome to Chuck Yates Got A Job with Chuck Yates. You've now found your dysfunctional life coach, the Investor Formerly known as Prominent Businessman Chuck Yates. What's not to learn from the self-proclaimed Galactic Viceroy, who was publicly canned from a prominent private equity firm, has had enough therapy to quote Brene Brown chapter and verse and spends most days embarrassing himself on Energy Finance Twitter as @Nimblephatty.

0:18 Welcome, great. All right. Now, my understanding is you've never done a podcast before. This is, this is my first time. This is my first time. So we'll just see, we'll see how it goes. This

0:30 is cool. So you said something the other day we were in talking to a potential client, and you said something wild, and that's what I wanted you to come on and just chat with me about it is you

0:42 said when you get to an organization, you, and I forget exactly how you said it, but you said you usually gravitate towards the gnarliest, jankiest stuff out there. Yeah, man. I think something

0:55 I picked up over the course of my career, like to be honest, it's probably just a job security thing, which is I'd like to come in and find the things that nobody else wants to work on that are

1:07 scary, but important. And because if you can solve problems like that, nobody wants to get rid of you because I don't want to deal with that problem, you know. And so that's, that's kind of

1:18 always been my instinct. I, I, I kind of, I'm real. Unfortunately, that's been my approach as a boyfriend. So I'm not sure that's been really good, but I'm glad it's worked out for you. Yeah,

1:29 I think I've got a little bit of that myself. Yeah. Fair enough. So, so you came in, you're not an oil and gas guy. Real quick, give me your background. So I'm a full stack engineer. I've

1:42 been in the industry, been in tech for 15 years. Kind of

1:48 came in, took a kind of security route into tech. Was a, I was a DJ and an English teacher and the Czech Republic and then kind of stumbled into it. And then I've worked in front end. I spent a

2:01 lot of time kind of in the web development world and then moved into enterprise and moved into architecture and kind of worked my way up the stack and then came here with Kineshius

2:16 this opportunity here and get to work on. I liked the opportunity to work on something a little bit more broad. Gotcha. And where you've kind of run to is documents. And did you ever believe you

2:33 would see a world where you buy a billion dollar asset and you get handed 27 hard drives that have a bunch of terabytes of data on there and and you get asked what's on it and people say I don't know.

2:47 It's incredible coming just like knowing nothing about this and then seeing kind of looking under the hood and seeing like how all this stuff works and just it's like incomprehensible from the outside

2:60 just like wait that can't be how that actually works but then yeah to get to just like dive into it and be like because sometimes it's fun like this is this is a frequent thing that I do with like a

3:11 lot of prototyping work. You just come into a world that you're nothing about. but you also don't have any assumptions. So you just come in and, okay, how does this work? I'm gonna just, okay,

3:22 there's these documents, there's these number of pages, these are the tables, this is how it works. And then just try to reverse engineer, what would actually be required to get data out of

3:31 documents like this that would actually be useful. And that's what I've been doing since I came here, basically. So I'm gonna go kind of, I'm gonna sound boomer on you, but for the record, I'm

3:39 Gen X. Okay, that is an important distinction. That is very important distinction But you know the way it used to work is you had, in effect, you know, iron mountain paper storage or whatever.

3:54 And you had these storage things, and as you bought assets, you basically just took over the rental payment of the storage. And then the next guy that bought the asset would take over the rental

4:07 storage. People didn't even have the keys to know where this data was.

4:13 I mean, we've got clients that are some of the most sophisticated energy companies on the planet, you go walk around and they've got bankers boxes all over their offices and stuff. So, yeah. Now,

4:25 I mean, getting that stuff digitized is like sometimes a victory. Absolutely. Yeah. And it feels like it feels like a problem that everyone, at least that was my experience when I came in, where

4:42 everyone had a perception that, Oh, this is a solved problem. You just take the documents and you get the data out. Then when we started exploring the details of that and what was actually

4:51 involved, it was like, Oh,

4:57 people don't really know how to do this yet, and the tools that exist to do this are not really up to the task. And I did a lot of research on, like, I read a bunch of research papers and dug into,

5:08 like, What is the industry standard for getting this infrastructure data out? when it can be as crazy and chaotic as things are in oil and gas. And the answer was it didn't really exist. And

5:19 that's what drew me toward it was like, okay, if there isn't really a solution here, there's an opportunity to maybe come up with something. So what have you done? Like

5:30 maybe not definitely don't give a specific client name or anything, but just give me some examples of things you're doing to help solve that problem of holy cow, we've got a ton of data there. Yeah,

5:43 so I think, yeah, without getting too mired in the details,

5:47 it's not, I think a lot of times when people are thinking about the AI world, they think about things just as like one-off solutions. But when you're talking about data processing, it's a whole

5:58 pipeline with so many individual steps. And it's about like making improvements at every little step along the way I'm kind of widening. the scope of documents that you can process. So it's like,

6:11 it's not, it's not just like, oh, go hack out some amazing solution and chip it. It's like 50 different little problems that all have to improve together to get to the full solution. So like one

6:26 of the problems that we were just working on was categorization, which is like, whatever you wanna do with a document, step one is which kind of document do we have? And if you can't do that right,

6:39 basically anything that comes after that is useless. So I just spent a lot of time with one of our clients. They gave us like a big treasure, a big trove of documents. And it was, can we discover,

6:50 can we find patterns and similarities between these documents just using like without any information from them? And so I kind of built a little tool that allows us to experiment with clustering

7:02 techniques

7:06 And we just tried a few different ways of attacking it and what ended up coming out was like surprised me and like it really, it is really, really good at identifying exactly which documents there

7:19 are. Could you showed me that demo and the thing that struck me 'cause and just for the listeners out there, I mean, we wound up with here are well logs, here are reserve reports, here AFEs,

7:34 here are joint operating agreements. And the thing that struck me about 'em is one, it did it really well, and I forget how many different categories you had, 37 or 52 or whatever it was. They

7:50 were very, very different looking documents. I mean, and so, but it was literally pulling out the characteristics of the documents, not, you know, going, Oh, it wasn't looking for the title

8:06 that said reserve report. It was actually going, Oh, okay. These are forecasted cash flows. I bet this is a reserve report. Exactly. Yeah. And that that's what's fun about this sort of

8:17 research and building technology like this is just getting to come in and like, okay, this is the goal. And then you just try a bunch of different methods. You combine systems together and then

8:28 like there was, I remember it was like two AM on a Wednesday and I was playing with it and it just started working 200 better. I made one change and

8:41 it just started behaving completely differently in a way that was like almost unpredictable. And like that's what makes this kind of stuff fun. It's like, oh, yeah, like, like it was exactly

8:50 what you said where there was a. I think it was like a royalty assignment where there was like eight different versions of a royalty assignment from 40 years of time that looked so different. group

9:04 them all together and that I was kind of shocked. Yeah. No, and so, I mean, when we think through what people can do with it, there's the quintessential, we just bought this, we got a hard

9:14 drive, we have all this data on it, we don't know what to do. But I think the other thing that it does, well, there are multiple things. The other thing it does is, historically, you didn't

9:24 wanna digitize a lot of documents 'cause you had to pay a fortune 'cause somebody had to put a header on everything This is a reserve report, blah, blah, blah, this date and everything. I mean,

9:37 we can have a high school kid just scan in documents all day, put it through your machine, and do a really good job of figuring out what's there, right? Exactly. Exactly.

9:50 And like, there's still some tuning and like work that's involved to get it to,

9:55 I think that's one of the things that's kind of fun of what we've worked on so far is what I built for this client had no input from the client.

10:04 We just let it, so, but with input from the

10:06 client, they say, actually, we wanna group these documents together and we care about these documents and not these others. We can tune it that much more specifically to exactly what their needs

10:17 are and exactly what the structure that we care about. So it's like, it's kind of like, it just gets better over time. Yeah, and getting somebody to 90 is huge when we're talking about the number

10:29 of documents 'cause the other thing we've seen out there is folks will sit there and buy an acquisition and they'll leave the file folder structure of that acquisition as is. And so,

10:45 in some cases, if you bought Company X, they put all the well logs in one folder, right? Company Y that they bought does it buy well or whatever. Potentially, we're gonna have the ability to

10:58 just jam all that stuff. through, put everything into buckets and in effect, redo the file folder structure that the company wants. Exactly, and find, like one of the things we've already seen

11:09 is find files that are completely mislabeled in the wrong folder, have the wrong name, like, and you can just see it and be like, okay, that's not where that belongs. And that was a huge problem

11:23 in other solutions that we use, like we've had some categorization tools that look at those file paths and use that as the way to figure out what kind of document it is. And so there's so many

11:33 mistakes. Things just aren't where they're supposed to be. You know, God rest, Mary Jane Kasurik, who was my librarian in junior high, she told me I was gonna need to know what the Dewey Decimal

11:44 System was. And that's all I can think right now is 'cause that's what we're doing. The Dewey Decimal System. Pretty much that. It's pretty much that, exactly. So anyway, well that's cool.

11:58 That's cool, we've got, you know,

12:01 'cause we sit there and we talk AI and everybody wants AI, you know, I'm kind of using this analogy. If you have a pyramid to solve a problem, you know, the bottom 40 of that pyramid is, I just

12:16 gotta go find some data. I'm gonna start rifling through stuff. The next 30 is, wow, it's these three things that matter, right? And then 20, the next level is I'm gonna do some calculations,

12:30 I'm gonna do some math, whatever. And then that upper 10 is, all right, should I do this work over? Is that a good risk reward trade off? Everybody wants AI to do that? It actually does the 90

12:44 really well. I mean, I think the 10 is still human beings, still your subject matter experts And what you're talking about is literally that first one

12:56 And it is, I think that metaphor is perfect 'cause it's the foundation. It's like, and it's kind of the least sexy part of it a lot of the time, because there's just an assumption that it'll just

13:07 work. But then it's like, one of the things that I think people are starting to discover is they work with AI is it's so much data and it's moving around in such unpredictable ways that it's hard if

13:18 you don't have really, really reliable data extraction, it's so hard to know that anything's even wrong Because it looks, AI is really good at making things look reasonable. So it takes garbage

13:32 data and like cleans it up and makes it look, and so you're like, everything's working perfectly. So you have to have really reliable systems that track everything through the process to make sure

13:43 that it's coming in clean. 'Cause like from the

13:47 start working on this project, the assumption is this is data that people will make decisions with that have extreme stakes tens of millions of dollars if that data is wrong. So we need to have the

13:59 reliability and accuracy to be able to absolutely nail that from the beginning. And that's a big problem, but that's what we're facing. Do you have any thoughts on kind of how we deal with

14:12 duplicates? So that's actually kind of an interesting one, is the classification can make it up. That kind of feels like the next step after we put everything in buckets Yeah, so dealing with

14:26 duplicates is like, you can do it a couple of different ways. One thing you can do is there's something called hashing, which is like you just take the content of the files and there's like a

14:35 fingerprint that you can pull from the file type. And if the fingerprint is the same, it's a duplicate and just get rid of it. But it's also, I think, the clustering, the like categorization

14:46 piece starts to get to that where you can see, are these so similar that they are actually the same file? But we had a couple cases where it was like - We were seeing duplicates of files where one

14:58 was a PDF, one was an Excel file, and one was an image file, and they were all duplicates of the same thing. And because it was clustering them together, we could be like, it was pretty obvious

15:07 that, oh, these are duplicates, and then you just get rid of them. Yeah, 'cause I think the bigger problem we're gonna be dealing with with clients the from downloaded Chuck is shared drive,

15:16 changed

15:19 a file, saved it on his hard drive. Probably didn't save it back Chuck cannot be trusted. Yes. So it's that problem that we're gonna have to deal with and do rules and it may be like, hey, if we

15:34 find a file from Chuck and it's about finance, he's our finance dude, we're gonna take that file as source of record versus clays, 'cause clays the programming guy. I think that's probably right.

15:47 And so yeah, we're gonna have to do those types of things You know, the, the big problem with shared drives is why is Ant Bessie's chocolate chip cookie recipe on the shared drive? I mean, that,

16:03 that'll be pretty easy to deal with, but the, the duplicate, cause one, one, a potential client said they have 39 terabytes of data and they believe 29 of the terabytes are duplicates. Yep,

16:16 that checks out. Yeah. So that's going to happen. Well, that's cool, man. I'm glad, uh, glad you're on the team, dude. This is, uh, it's been fun working with you Saying, man, it's,

16:25 it's a, it's a fun challenge. And, uh, that's what it is. It's like, I, I wanted to find something that I could drive myself a little crazy with and like get really into it. And here we are

16:34 now it's happening. Nice. The, uh, my favorite story on you is you went out to the field and, uh, they pointed at the well test. That's a well test. That's a well test. So

16:47 great. I mean, when you spend like 30 hours staring at a document over and over and then, That's actually what that's recording. Yeah, it was kind of a moment. That's cool. That's cool. I just

16:60 really want to circle back to how important of a problem this is. 'Cause to me, this feels like maybe one of the most foundational things. If you want to do any sort of work in AI that is based on

17:13 data, the quality, reliability of that data and your ability to figure out what's going on and get the data out, it's really the whole ball game And I think that is being a little bit neglected

17:25 right now. And so that's, I think, is an opportunity for us. And, you know, the thing I get all the time is, well, why can't I just do search and co-pilot, right? And you cherry pick the

17:37 data in, so you've in effect done your data selection, you drop it into co-pilot and it works pretty well. But it's maybe 80 of the way there And that last mile, that last 20 and

17:53 to do it scale. really freaking hard. And we have to do all of these things that we're doing to be able to get to that hundred percent. Because like you said, when you follow wrong number with the

18:06 railroad commission, they get pretty pissy about that. Exactly. If you're making decisions about like where they're going well or like what, you know, at least that's the frame that's been in my

18:15 mind. Because you also have to decide like sometimes the quality of the data doesn't matter that much. Sometimes it can be a little messed up and everything still works. But that is not the case

18:24 that we're dealing with here. We have that level of requirement. So that makes it hard, but it also makes it fun.

More episodes

Chapters

What is Chuck Yates Got A Job?