Chuck Yates Got A Job

John and Bobby broke down how RAG actually works and why the real battle isn’t the LLM, it’s the chaos sitting in PDFs, Excel files, post-job reports, random folder structures, and handwritten scans. They showed how extraction, chunking, and embeddings fall apart when data is messy, and why clean structure, good metadata, and consistent organization matter way more than people want to admit. They also hit on how tools like MCP and text-to-SQL let AI pull from WellView, production systems, and databases in one place, instead of everyone living in 20 different apps. The takeaway was simple: AI gets powerful fast when the data is ready, but if the inputs are junk, you’ll just get faster junk back.

Click here to watch a video of this episode.

Join the conversation shaping the future of energy.
Collide is the community where oil & gas professionals connect, share insights, and solve real-world problems together. No noise. No fluff. Just the discussions that move our industry forward.
Apply today at collide.io

https://twitter.com/collide_io

https://www.tiktok.com/@collide.io

https://www.facebook.com/collide.io

https://www.instagram.com/collide.io

https://www.youtube.com/@collide_io

https://bsky.app/profile/digitalwildcatters.bsky.social

https://www.linkedin.com/company/collide-digital-wildcatters

What is Chuck Yates Got A Job?

Welcome to Chuck Yates Got A Job with Chuck Yates. You've now found your dysfunctional life coach, the Investor Formerly known as Prominent Businessman Chuck Yates. What's not to learn from the self-proclaimed Galactic Viceroy, who was publicly canned from a prominent private equity firm, has had enough therapy to quote Brene Brown chapter and verse and spends most days embarrassing himself on Energy Finance Twitter as @Nimblephatty.

0:00 So Bobby and I are here to talk about the data side of Bragg because, in my opinion, the most important,

0:08 but Bobby and I have both come from the data side. He's much more trained than I am on, on all things data, but we'll just kind of jump into this. I just want to set the stage with kind of the

0:21 steps of Bragg for everybody that isn't aware. But so the first thing you've got to do is identify the documents

0:30 Yeah, I was like, what's Bragg?

0:33 Probably good call. So if you're a fan, I do all things data at the line. My background is started as a frack engineer. Out in the field and then moved into the office doing technical support.

0:45 Over time, I ended up working for a surface and downhill gauge company where we were doing a ton of High frequency time series data and, you know, excel wasn't cutting it. And so I had to learn

0:58 how to deal with data. that way. And that's kind of how I got into it. And at that same time, I started working with Bobby at that same company. And so, anyway, I've been here at DW or collide

1:12 for three over three years now. And so, I'm super excited for all you'll be here today. I'll let you introduce yourself. Yeah. And I'm Bobby Neeland, currently my own independent consulting for

1:24 data analytics I'll drill down analytics since the name of the segment. But I've been in the industry of 11 years now. I started out at Congo Phillips as a reservoir tech and

1:35 along the way learned some software, data, cloud, all the things were a lot of hats, you know, RDS with John and then worked at University of lands. And most recently at a great single energy

1:50 where I basically built out the data team there and the data infrastructure and sold to Devin and did my time at Devin and now trend over. bring the good news of data to a lot of companies across oil

2:02 and gas. And I don't think we mentioned it too, but John and I co-host energy buys together as well. So.

2:11 Okay, so RAG stands for retrieval augmented generation. I've got to slide after this. It actually shows what that means, but the main steps for RAG data ingestion kind of will go down to these six

2:25 things right here. So you've got to identify the documents, the type of documents, then we classify them. So based off of

2:32 the type of document, how are we going to classify it? Is it a drilling report? Is it a post job report? Is it a contract, those sort of things? Then we have to go through and extract all the

2:42 information from those documents. Next, we go through and we chunk that data, which we've got slides and we'll talk about, then that data goes and gets embedded into a vector database, and then

2:54 we can start searching on top of all of that

2:57 I also want this to be you. collaborative. So if you guys have questions in a point along the way, please feel free to ask. But this is just a very basic representation of both the pipeline up

3:07 here and then the actual rag down here. But at the top, you can see we take a bunch of our documents and we chunk them and we embed them and put them in a vector database. And then down here, when

3:18 you submit your query, the query itself goes into the vector embedding finds the top K matches from the vector database and then passes those down into the language model to generate your response.

3:31 I think that's one of the biggest misconceptions with our biggest it is the biggest difference between a rag and a traditional model this as well as the biggest misconceptions as well. It's like a

3:42 language model in a rag isn't involved into the very last step essentially. So a language model is not touching the data, it's being fed data from your initial search and that's what it's

3:53 constrained to essentially John, can you simplify, John, can you exactly read? Yes, in the mouth. slides.

4:01 So I rolled all of these things into one just to give you, I've got a bunch of screenshots in real world examples in here, but all data are not created equally. So obviously we have our traditional

4:14 things that we think of for RAG like PDFs and documents and power points and things like that. But of course, our industry loves to work out of Excel files and CSVs. Also, another big piece is

4:25 audio and video. People don't think about that as data that you can search across generally, but the text or the speech-to-text technology has got incredibly

4:38 better, faster, and cheaper. And so now we can actually use video content or audio content and transcribe that. And now we have text data that we can also add to these datasets. But you know,

4:50 it's not just identifying the file types. It's for each of these files, as you can kind of tell. We've got to go through and we've got to identify the layout. We've got to identify if images,

4:59 we've got to extract the text. Also, if it's an academic paper or a textbook or something like that, it's going to have LaTeX in it, which is another fun one that's the - those are basically

5:10 formulas. That is a different thing that you've got to extract. We've got tables that are also from CSVs and from PowerPoints and DocXs, which are really, really, really fun. And then, of

5:21 course, transcripts. And then taking that a step further, we've got to go and generate a bunch of metadata around that We generate captions and descriptions of the images from the documents and

5:32 from the diagrams, and then we recognize all the text and stuff. So this is before we've done any chunking, any embedding. This is the hard part for Ag, in all honesty, or one of the big hard

5:44 parts. It takes a ton of work. And so to give you some examples, over here, we've got this wiki article, right? So there's a bunch of latex mixed in with words and a fairly structured format

5:55 Over here, we've got an academic paper, you know, from Arbics or wherever this is round.

6:00 which, you know, okay, cool, it's a text document, but oh, we've got columns in this document, and then the next page of this document has columns in it, so you have to understand the layout

6:09 and be aware of what the layout is so that you can parse and extract the data correctly and with the right context. Another fun one, here's a drilling survey up here that's in a PowerPoint or in a

6:21 PDF document, but just a giant table that continues on for pages, and then probably the easiest and simplest form is like a contract where it's just a text file, but even the contracts that we've

6:34 got have tables and different things in them as well. And so it's just, it gets very complicated very quickly. Everyone thinks, oh, it's a PDF, PDF is a PDF. PDF is just a dumpster fire for

6:45 unstructured data in all honesty. There's no structure to it, there's no format,

6:50 there's nothing repeatable or anything like that

6:55 For example, these are the first three pages of a report that Buck passed that or in his data set. We've got this crazy, well schematic that everyone outside industry has no idea what the hell that

7:05 is. You've got a giant plot, and then you've got a bunch of tables. And so this is just the first three pages of this document that's probably dozens of pages long, but I just say all of this to

7:17 show you how complicated this really gets once you get into the data set of things. I mean, John, just hold up there too, but you can go back, but I mean, we're in an age now, we're trying to

7:26 get this output into LMS. How was this even being managed before? I mean,

7:33 that's a very valid point, right? Like Bobbi and I were having this conversation earlier and it's like nothing on the data side has changed now that AI has come along, other than the fact that it

7:42 just highly exacerbates the problem that we have with bad data, that structure, bad organization, and that sort of things, that's what you're going to have. I mean, yeah, but I think either

7:51 before there was

7:54 parsing PDFs with Python or previous OCR, like Tesseract, or things that people are using with. some success, but very rarely. But even a lot of it would go back to the manpower thing was just

8:06 taking these and hand entering these into well view or stuff like that, whereas like now if we're able to reliably get this out with the newer OCR AI technologies and make it searchable, you know,

8:17 cutting out a lot of steps there. Yeah, there's no reason anymore in our industry for us to be manually entering data from a report directly into a database into well view into whatever, wherever

8:29 that data needs to live. Humans hate doing that. We get tired, we don't, you know, that's the last thing that someone wants to do. So the opportunities for that figuring or screwing that up go

8:39 up exponentially the more time they're doing that. But ultimately, if it's a repetitive repeatable thing, then this is how we should do this. We should let AI pull it out of the document once it's

8:50 processed and map it to where it gets to go. So the future from that perspective is very exciting. I haven't mentioned a lot of work that people just don't like everyone's worried about AI taking

9:02 their jobs before that ever happens It's going to make their jobs much nicer much easier much more fun because it's going to limit a lot of the BS So they have to do on a day-to-day basis But to

9:11 Bobby's point link historically speaking you'd have to set up like all these OCR templates If everyone's ever worked it with OCR it could be very frustrating Because you know if the formatting changes

9:23 at all it throws it all off right now We've got OCR paired with vision models

9:29 So you're using computer vision and OCR in the extraction Process and stuff now is much more robust and Bobby and I've been working on some stuff And he was like I think I can just get Jim and I to

9:42 Feed me the data right out and I was like you probably can because Jim and I's really good at classification and things like that But then he ran into the same issues I ran into that their API doesn't

9:51 match what their friend and does so So the results are questionable but.

9:57 That being said, the OCR that being able to extract data from previously unextractable documents is very exciting in the technology around that in the last two, three years as exploded.

10:09 Another wonderful data format, revenue statement, clearly built by someone who has never dealt with data their entire life at any kind of scale because of or intentionally so that this would, they

10:22 could upsell so long the CSV service that they also provide with this, but.

10:30 Oh,

10:32 what's cool, post job report. That one's actually really easy, right? Very straightforward, very fun, it's from Universal. Oh, wait a minute, there's another totally different format that

10:41 Universal also uses in this beautifully bastardized, pivoted, messed up table that continues on for pages and pages and pages.

10:51 These are, this is like the real world that we're dealing with from a rank perspective. This is where things get really tricky. Just trying to get all of that structure. This is the output from

11:03 the extraction for that last document. And I pull this up just to show you how many different pieces of data we're pulling from this. I can't even read those numbers, but hundreds to thousands of

11:15 different segments from just one post job report. One thing I will suggest is if you work with companies that provided in this format, they actually have it in a database. And if you can get ahead

11:29 of it, I mean, the problem now is like, we're trying to play catch up, right? But if point four bases lean on your vendors to provide a very least CSV copies of it, but access to a database,

11:40 access to an API to get that data into a more structured format because it exists somewhere. They don't, there's not someone I, well, I hope there's not someone hand keying that information into

11:50 these PDFs, but. Well, I mean, you remember when we met with, what's his name and when back at the RDS days. It was tasked with building a machine learning model for predictive frack stuff,

12:02 screenouts, et cetera. And he spent the first nine months building a machine learning model that just went in and normalized the headers from the past 10 years of their frack data because just the

12:12 header data is not normalized, there's no structure, there's no consistency across it. And so, yes, any time you can get structured, organized data, get it.

12:24 Oh, handwriting, this is my favorite one So this is a really cool example. These are some old, old files that we've got from a client. They stand in all these vendor boxes worth of data.

12:37 Generally speaking now, the rule kind of that I've seen is if you can read the PDF, even if it's handwriting, there's a very good chance that we can extract the data from it. That being said,

12:48 here's another wonderful example. I have no idea what any of this stuff says. Very hard to read, but my favorite part of it is that it extracted the. butter and pain

12:60 piece over here that is, you know, I don't even know what what they're trying to say right there, but it doesn't say butter and pain. So this is just an example, though, you know, as with all

13:09 things data, if you feed it junk, you're going to get junk on the back end. So you can't just dump all of your data into a rag model and magically express hope, and it's just going to magically

13:19 work. Yeah, I mean, I think that's where I mean, they're going to get more into like some sort of where I'm a file structures and all like myself. But I mean, I guess to this for people here, I

13:30 mean, is this something that collide itself can extract or that you're using other models to extract to get the data into? Yeah, we're using other tools to extract it to get into the data. So what

13:41 are some of the ones that have worked well for you guys to get genuinely depends? That's also the tricky part, right? Like, for these post-driver reports, for example,

13:53 while I was working on this, Mistral came out with their OCR model, and I was super excited about it, and Mistral makes really good products. And I got a couple files in, everything was looking

14:05 great, and then I went and dumped it into just a CSV to look at it all, and it would just skip a page. For no reason, it would pull in like the headers and then none of the data, like five pages

14:17 in. And I still have no explanation as to why I did that, but that's what happened with a lot of these Actually, with these specifically, some of the best tooling that I found were just the

14:27 traditional able from PDF libraries that have existed for years that aren't vision enabled or anything like that. So that's another piece of this, just because it's new and shiny doesn't mean it's

14:40 better than what has been around. Yeah, I will say too. I mean, just I'm sure Bill will like this, but I think document intelligence and everything on Azure is very good I. mean, especially

14:50 when you have a structure, consistent structure to it, I mean, it works really well. And I mean, at that point, you can start using it within like a basically data pipeline to get data from a

15:00 structured structure. Yeah, and there's a bunch of new services out there that do this, that you know, they're y-combinate or back or whatever. But most of them honestly appear to be wrappers

15:11 around duck intelligence or something else. And so, yes, we use a lot of duck intelligence internally as well, but again, I just wanna point this out that not all of our data is good. There is

15:25 nasty stuff in there. And so, we have to be cognizant of what we put into these things, but if we expect that all puts to

15:35 be perfect. Chunking, fun. So there's a million different ways to chunk. Again, chunking is the concept of taking, let's say, a one page from a document and breaking it up into contextually

15:47 relevant chunks. But you can jump by page, So that would be like a fixed size chunking. you can chunk semantically, you can chunk cursively, you can chunk based off the document structure itself.

15:58 So like typically most everything gets extracted into Markdown or HTML. And so because it's in Markdown or HTML, I

16:05 now have headers, sub headers, all of those things. And so based off of that, I can then chunk the data where the context is relevant, just based off the headers of the document. And that works

16:16 really well for things like contracts because they're pretty much structured every time But a drilling report where the first three pages are completely different from each other, not as great. So

16:26 you have to, there's different ways to do this and there's different times to use different chunking strategies and things like that. But the chunking strategy in and of itself, like Ariana, I

16:35 know we have that. Initially, when we first deployed with Kraken, our chunks were way too big and he was getting a bunch of bad answers. We went back and re-evaluated that. We changed the

16:44 chunking strategy and the chunk size and now he has much better, much more reliable answers. But again, there's not like a cut and dry,

16:53 under every scenario, it's very scenario dependent.

16:57 And then here's just another table that I found of the advantages and disadvantages of each of those.

17:04 Moving on to the embeddings. This is where things get really fun. So embeddings are essentially the

17:12 graphic representation of words and the relationships to each other. And this is, you know, on a 3D plot, but these are in hundreds of dimensions in real life. And so what you can kind of see

17:25 here though is if they query kitten, kitten is most closely related to cat, which is also kind of related to dog, but not really wolf and definitely not related to apple or banana. And that's kind

17:37 of what you're seeing here is all you're doing when you're vectorizing things is giving number values to each of the words in the relationships across a bunch of different dimensions. Same thing over

17:50 here, you can kind of see, I can see what that is. Yeah. So here's all the vectors related to boy in the way that we actually figure out, you know, what is matching closest is cosine distance,

18:02 literally the cosine distance between the dots in the vector database.

18:09 Lastly, that brings us to search. So, you know, I want to go search all of this data now that we've chunked it and embedded it and built a bunch of stuff around it To get in, before I get into

18:18 search, I'm going to talk about MCP real quick because it's kind of pivotal to all of this and it's not going anywhere, but MCP user essentially a way for a language model to use an API call. And

18:30 so this opens up a ton of opportunities for, especially for enterprises where I want to keep my internal data private, but I want to pull in things like public well data or real time crude crisis or,

18:43 you know, my Gmail or my Slack or do a web search or GitHub or whatever else is out there If there's an, if there's a API to it, you can set up an MCP server. So it allowed you to. ultimately

18:55 contextualize your data with external data without leaking any of your private data to whatever service you're using. And so what that ultimately looks like as a query comes in, it goes through a

19:07 router based off of your question. And so if your question can be answered via your documents, it goes and looks in your vector store for that. If it's coming from a database like Postgres or SQL,

19:18 it can go queried for that. If it's coming from a third-party software, your SAS, black, routines, or wherever else, it can actually make those API calls there and then bring all of that data

19:28 back to one place and textualize it and give you one succinct answer instead of you having to go five, 10 different places to get single bits of information to try and put together. So I'm super

19:40 excited about this. We're slowly starting to roll some of these out into our stack as well, so be on the lookout for that. But again, very similar concept, but this is just a diagram that I made

19:50 that kind of represents, you know. something that might be more

19:55 related to what we're doing. But the user query

19:58 comes in, so you search for something. This is kind of similar to actually the W10 workflows, minus the external data. But your query comes in, it searches your RAG, it also searches your

20:11 production, and database then it does, what is that? A well database MCP, so then it pulls public well data and real-time oil prices from another MCP server, and then it can pull all of that data

20:26 directly into your answer from one query. And so this is where everything's going, this is, in my opinion, this is where LLMs make the most sense, is sitting as a layer on top of our existing

20:38 SaaS tools and products and databases. 'Cause ultimately, no one likes going 20 different places to find the data that they need. No one likes having all those tabs open, all of that stuff, and

20:50 so you can literally layer an LOM on top of that as long as there's access via a database or an API call, and have one single pane of class across your entire organization, across all of your data.

21:01 But yeah, that's all I had for that. I know Bobby's got some questions. Yeah, so I mean, I think, they think probably for a lot of people here that they're not, most people aren't even using AI

21:12 or Ag yet, necessarily internally. What do they need to be thinking about to get the data AI ready? I mean, what are you seeing? What, where somewhere where people have been really prepared at

21:21 least in one side whether it's file structure or the databases or whatever that actually makes it easier to deploy AI like this. For sure, no, so that's the thing. Like people, you know, one of

21:32 the

21:34 bubbles in that first

21:37 schematic was metadata. And so metadata is incredibly important in your vector database and in your rag because that allows it to contextualize things even more. And so if you have an organized

21:50 folder structure of like asset. And then within that asset, you've got patients drilling, reservoir, production, vehicle, whatever. We will absolutely use that to help contextualize the data

22:04 and pull that out as metadata for the documents that we have in there. And it makes the searches that much better because it's there. One of the things we're working on internally is building out

22:13 agents and stuff that will go through and pull out specific types of information. Think of like date, right? Everyone asks the LLOM about different things regarding time. Show me a timeline, show

22:25 me a history, et cetera. Well, if we're not pulling dates out and using that metadata, then it's not gonna have the best idea of exactly how to structure that, how to order that, all of those

22:35 things. And so there's a very real future where we have an agent that just goes into every single document. It looks at the file name to see if there's a date there. If there's not, then it goes

22:43 into the document and press it, find it there. And then if it finds it, it adds it automatically. It's a piece of metadata in the database move on and that's how we're able to start doing these

22:51 things at scale. Um, without having to have a horde of people, you know, constantly manually managing these things. But the big thing is, yeah, you know, if you have crappy data or if it's,

23:03 you know, not very structured, not very organized, it's going to be much harder to, to do than if, you know, it's all digital, um, things like that. But the more information about the data

23:15 has

23:17 Excuse me, as is what metadata is, the better, generally speaking, but. Yeah, and I think a lot of this was about unstructured data, but even get back into the structure data world. I mean, I

23:27 think. Uh, having good models, whether it's you're using stuff like your Databricks or sequels or whatever it is. Um, I mean, I'm, I'm pretty sure you guys have to get the questions about text

23:38 to sequel or being able to ask questions in natural language and have it go You know, query the database like, what do you see in there and what, what do people need to have in place to make that

23:46 efficient as possible. Yeah, so one. If you move to a cloud version of a software's service, make sure you still have access to the data that they are putting in the cloud so that if you want to

23:59 integrate it with tools like this, you can, that has come up, that's fine. But, you know, the big,

24:08 take the question, owner. I mean, I guess if you want to be ready for art to be able to do like say, text SQL or - So on the SQL side, obviously access is the most important piece of that, some

24:21 programmatic way that, you know, something external can access that data, whether that's an API or a SQL query, but there's two ways to do text to SQL. You can hard-code the queries. So for

24:32 example, with like the win regulatory workflow, we're hard-coding the calls into the production database because it's the exact same thing every single time. We're looking for the same table, the

24:42 same fields, just different wells. So that one is like just hard-coded in there The other way to do this is to have it be. As a tool call where the LLM says, Oh, I need data from the production

24:54 database, I'm going to go call that production database. What it's fed is essentially a description of the database and then descriptions of every single field in that database. And then it's

25:07 generated, it's writing its own SQL query basically based off of that and then making that call. So that's really cool, but it's also less controllable than just hard coding the query itself That

25:19 being said, it's getting better and better. And of course, the better you understand those tables and the more data you have about those tables and those fields and those tables, the better it

25:28 will work. Yeah, and I think that's where, and it always sounds good and usually documentation is out of date as soon as you write it. But I think where you can have proper data documentation,

25:40 data dictionaries, anything like that, where you feed that context in or say if you're using tools, like I use a lot of a tool called DBT, where it has relationships. the models andor ERDs

25:52 or any of these kind of things, where it knows the primary and 4G relationships. I think that's where the context that MLMs need for that. Or even just building out star schemas, everything is old

25:60 as new again, but now semantic models are a thing, and it was a thing what 20 years ago with SSAS with, OLAP cubes and everything else. And now that's the exact same engine that's running behind

26:13 Power BI in the back-ended, but that's why I say Microsoft's able to do some of this text to SQL or ask a question to do a Power BI data set and get the answer back because it actually has the

26:22 context to the knowledge of those data sets. Yeah, and so in my experience, it's much better today, and this has probably changed since I've missed a little bit, but historically speaking, it's

26:34 much more effective and reliable if it's just calling a single table than if you're trying to get it to call a bunch of tables and do a bunch of joints. The more complicated your SQL query, the

26:44 harder the less accurate or repetitive I think it could be, but there's ways to. and that'll back it by creating

26:54 a data statement like that. And John and Bobby, two questions I get out, I get asked a lot when we're out talking to folks and I'll give the example. One client we're talking to has 39 terabytes

27:05 of data, one in nine of the terabytes are duplicates.

27:11 What are strategies to deal with that? And then the second question we get, particularly in EMP, we're out there making acquisitions every day, we buy a company, we get their data on their file

27:26 sharing system that looks at nothing like how we organize stuff, and we just kind of keep it the same way. Are there strategies, AI cases to deal with either one of those? What was your first one?

27:42 The first one's

27:45 39 terabytes and 29 are duplicates Yeah, so duplicates are actually -

27:50 But if the problem is straightforward, so you essentially hash the document, which is creating this giant string, AKA hash of all the contents of that document. And so you can then just compare

28:03 hashes to see if you have duplicates or things like that. But it also brings up a good point, like on the contracts side, right? There's multiple versions of contracts or they get amended and

28:14 things like that. So like the date and the version, those are pieces of metadata that we want to extract to make sure that we can keep those in order 'cause when Ariane asks about a contract, he

28:24 wants the most updated version of that contract unless he specifies otherwise. And so that's on us to make sure that we do that. But again, that boils back down to the metadata and the structure of

28:34 the underlying data set.

28:38 Your mapping problem, as far as acquiring and then getting it into your structure, there's ways to do that. Using language models, it's not gonna be like an exact science. that will work on

28:51 everyone's data sets, but it's ultimately a mapping exercise and you know if you say okay well they're they call you know they're drilling folder we have a drilling folder and what they call they

29:04 call it operations and so like that could literally just be a mapping exercise without LLM being involved but if you wanted to get more granular you can use LLM's to classify documents and say tell me

29:16 what type of document this is is it a contract is it whatever and then now you've got a much better idea of exactly what's in there without a human having to go and open every single file one thing I

29:29 will say about that is for whatever reason our industry is really gung-ho about shoving a bunch of stuff into one PDF that may or may not be related to the other contents that are also within that PDF

29:42 and so that's another piece of this that gets kind of tricky where it's like a lot and I've noticed this in either like MA type stuff or just like Oh, that's from the bankers boxes that we've had

29:53 sitting around for 25 years.

29:56 But you don't, yeah, it's,

30:02 it's just, it's a tricky problem. But using the language model, like you could literally have a different document type for each page of one PDF, if that makes sense, right? Like if you just had

30:14 a third party scan and a bunch of documents and they weren't paying attention, oh, these are all the documents from this asset. But within that one document, you've got drilling reports, you've

30:24 got post-shower reports, you've got legal, you've got all these things. Well, that document should be split up into multiple documents instead of just one, because contextually, all those things

30:34 in that document are very different. And so that's a tricky problem. Yeah, it's just, how is the progression, we talk a little bit about this, so that learning

30:46 the context So you do one project, you learn a bunch of contextual things about stuff. the oil and gas city to energy, how is it possible for it to learn that context just like a young engineer or

30:57 young tech would, such that when it goes to the next project, it uses that context and then adds on and it starts to build on itself such that it eventually gets to a point where you tell it a

31:09 mission and tell it to go figure it out like you would an actual person. I know we're not there now but where is that possible or do I see a path to that? I mean we're seeing people do this already

31:22 right like I mean we've been training Google's computer vision models for decades with CAPTCHA. No one realizes that CAPTCHA is literally just us training training doing computer vision

31:33 identification for Google and we have them for decades but that's why their vision is so good because they've got a ton of data on that so as far as this stuff goes and we're seeing that like we use

31:44 cursor or we use all these different coding tools and you know as the more that you use and even The more you use it, the more it understands about you, because it has the history of what you've

31:54 searched. You know, when I initially started using chatubt, I would ask it, what's the formula or how do I calculate flush volume? It would give me the volume of the toilet. And now, when I ask

32:04 it that, it knows that I work in energy and I actually mean the volume of a well, without me having to say well in the query. And so, but what that is, is they're just using the data, the

32:16 historic data that they've got from your interactions and from all of yourself, to refine the system prompt basically whenever you ask your questions. With Collide, we absolutely have an intention

32:27 of building, you know, allowing the profile of each user to be tweaked to what that user is based off what they're searching or based off what we know about them and things like that. But, you

32:39 know, it's a chicken and egg thing. You've got to have enough data in order to then go and train it to do those things, but it still involves training. But what a lot of the the tech companies are

32:47 doing are trying to do this like federally-learning. thing where everyone is doing different things at the same time. That goes back to the main model. That gets updated much faster because it's

32:56 learning across all kinds of variances and different types of data and all of that stuff and then gets pushed back down once that model gets better. And so that's probably the closest we are to it

33:08 today is, but again, it's still training itself. It's ultimately just training itself, right? Instead of having a human in the loop, it's got, they've built all these pipeline and stuff on the

33:18 back end. Like, right now, if you use the copy button in OpenAI, it sends an event to chat GBT, basically telling them that that response was a good response from your actions without you having

33:30 to go in and click the thumbs up button or whatever. But those are the really interesting ways that they're able to use user behavior to train and improve these things It's that kind of what Colin

33:41 was saying, like jasmine is, that's that reinforcement learning side of it. Yeah, so we just brought on a step from Microsoft. But I'm super excited about because that's going to be a lot of work

33:53 on this. It seems like if you could mimic a young engineer and does whatever and does learning figure here, but it takes the context so we can just all figure it out. That would be the whole thing

34:04 to be able to. Absolutely have to parse it into. Yeah, and I think if I can answer. Yeah, and I think to Bobby's point and that's exactly kind of where we are headed with You want to build this

34:18 knowledge graph, right, and these nodes and these neural networks that over time can self train through reinforcement learning to find new models through all of these things. So that and find two

34:34 particular things that one is most specificity within oil and gas models And then also over time taking you your personal right or the different operator with something else, something else, and

34:47 then just continue to learn. that machine. That's why computes going to become a really important conversation here because to do that very effectively you need more and more heavy equipment.

34:58 Otherwise you can't do it, but because this is so much data behind this, and that was one of the fundamental reasons why you want to go out and get something very much trained in traditional ML, AI,

35:10 from like even Microsoft backgrounds and

35:13 start

35:15 building sort of more. So they've been deployed for documents. Yeah, I think there's just one last thing to drive home here is that I think the traditional data management and just data practices

35:31 are actually more important maybe than they've ever been. And I, but I think when all this came out, there was, you know, the hype cycle and like, oh, wait, this is actually going to cover up

35:41 all of our words, right? I can just, I can just give you a huge S3 bucket of all my data and it'll figure it out, right? And then we That's not the case or an Azure blob, right? Well, I'm not

35:54 gonna ask you about it, but like, but it's actually even more important now. Like, I mean, like that you have your data organized and in order so that you can provide the proper context into the

36:03 AI models. Cause again, I think people thought it was gonna be a silver bullet that, oh, I finally don't know. It's fine that I was behind on data cause now I can just give it everything I've got

36:11 and it'll just figure it out for me. But we're definitely not there yet

36:15 But, you know, if you were to pay attention to open AI and cloud and all the big foundational companies, they want you to think that it's that, like they do a great job of making it look like

36:25 magic. Or like, you know, chat GPT used to be terrible at doing any kind of math. And then they figured out that it was bad at doing math. So now they let it write Python script and they execute

36:36 the Python code and then they give that back to the language model. But they eat like, if you don't understand that, if you're not paying attention, you're just like, oh, well, it just did math.

36:44 That's great they built a function for it to do that explicitly. they built the function for it to go search the web. They built like, there's all these things happening under the hood that aren't

36:54 just part of it out of the box. They're explicitly built for certain problems that they're trying to solve. And so that's another piece of this. It's like, yeah, you want this to be people's

37:06 expectations are set off of what the foundational models can do. But then if you go to anybody else, those are things that either have to be built or have to be added to the existing stack that

37:16 aren't just native to a out-of-the-box.

37:20 Yeah, and proverbial crap in crap out is actually getting worse 'cause they can hallucinate on top of the crap. Yeah, that happens. But that's another benefit of the rags. Constraint it to your

37:32 data, it might hallucinate, but it'll be grounded in a document and not just made up out of the near. But yeah,

37:41 the data problem has existed since and that will came on the scene, right? And that's why, you know, from. the oil and gas perspective, it took us five to 10 years to really get into production

37:55 where the useful ML models, because everyone was like, Oh yeah, we'll just throw a machine learning model at it. Then they got down the path and like, Oh, our half of our data sucks. Or none of

38:05 it's normalized or whatever it may be. And so they had to take the time and spend the money to invest in getting the data where it is structured, it's cleaned, organized. And then the models work

38:17 and they work great. It's the same thing with natural language models, right? So, and that's honestly one of the talks that I do. I tell everyone, machine learning came around because we

38:27 generated so much machine type of data, numeric data that we as humans could have processed all of it. And we needed a machine to help with that. Large language models are the exact same thing,

38:37 but for text. What on the top of context do you find that materializing the context like dictionary,, some type of table. It's like your go-to or kind of like, how do you divide it? Like, I

38:51 think, I think it's time to rely on your person for any how that is like, to me, something like that. In it, it works out. It's like, somebody saw it at the point, but it started going to play

39:02 through me. What is that actually? Yeah, no, I mean, the nice thing with RAG is that we can just get it stationary as a data set. And now it has that into it. And then we're working on

39:13 fine-tuning models that have that built into it directly so that it can recognize all of those different fun jargon terms that we have in our industry and things like that. So, do you share any one

39:26 that you define or like, perhaps?

39:30 No, it's a dictionary that we build from a bunch of different clusters across the industry. But yeah, that's another thing with RAG. It's like, you don't have to train these things. You just

39:40 give it the context.

39:43 And it doesn't have to be RAG. It can be your chat GPT.

39:49 If you don't include the context that you're trying to get the answer to, it's not going to do a very good job generating an answer without the context, because that's how it generates the answer is

39:59 based off of what you come up with, you ask it. And so, like, if you're doing API calls, if you're doing any kind of coding, give it the documentation, link to the GitHub, like, there's so

40:09 many ways to get context into these things now that make the answers infinitely better, infinitely faster, and way more useful you're spending way less time iterating and telling you to fix it.

40:21 Also, that's a terrible follow-up prompt. If you're, it generates code that doesn't work, fix it, does not help it very much. You need to tell it what is wrong or what it needs to fix. My

40:33 favorite story when we're out there talking, one of the pictures of AI is you can talk to all the legacy software. You can talk to all the disparate data sources, and that's truly where the magic

40:45 happens. And anytime they ask me, Oh, can you talk to this? I always say, Yes. John rolls his eyes in the meeting, and I watch what color red he turns to know how hard it's gonna be to actually

40:59 talk to that software. So,

41:03 we have a great relationship.

More episodes

Chapters

What is Chuck Yates Got A Job?