Practical AI

Chris and Daniel unpack how AI-driven document processing has rapidly evolved well beyond traditional OCR with many technical advances that fly under the radar. They explore the progression from document structure models to language-vision models, all the way to the newest innovations like Deepseek-OCR. The discussion highlights the pros and cons of these various approaches focusing on practical implementation and usage.

Featuring:

Chris Benson – Website, LinkedIn, Bluesky, GitHub, X
Daniel Whitenack – Website, GitHub, X

Sponsors:

Shopify – The commerce platform trusted by millions. From idea to checkout, Shopify gives you everything you need to launch and scale your business—no matter your level of experience. Build beautiful storefronts, market with built-in AI tools, and tap into the platform powering 10% of all U.S. eCommerce. Start your one-dollar trial at shopify.com/practicalai
Fabi.ai - The all-in-one data analysis platform for modern teams. From ad hoc queries to advanced analytics, Fabi lets you explore data wherever it lives—spreadsheets, Postgres, Snowflake, Airtable and more. Built-in Python and AI assistance help you move fast, then publish interactive dashboards or automate insights delivered straight to Slack, email, spreadsheets or wherever you need to share it. Learn more and get started for free at fabi.ai
Framer – Design and publish without limits with Framer, the free all-in-one design platform. Unlimited projects, no tool switching, and professional sites—no Figma imports or HTML hassles required. Start creating for free at framer.com/design with code `PRACTICALAI` for a free month of Framer Pro.

Upcoming Events:

Creators and Guests

Host

Chris Benson

Cohost @ Practical AI Podcast • AI / Autonomy Research Engineer @ Lockheed Martin

Host

Daniel Whitenack

What is Practical AI?

Making artificial intelligence practical, productive & accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, GANs, MLOps, AIOps, LLMs & more).

The focus is on productive implementations and real-world scenarios that are accessible to everyone. If you want to keep up with the latest advances in AI, while keeping one foot in the real world, then this is the show for you!

Jerod: 00:04

Welcome to the Practical AI podcast, where we break down the real world applications of artificial intelligence and how it's shaping the way we live, work, and create. Our goal is to help make AI technology practical, productive, and accessible to everyone. Whether you're a developer, business leader, or just curious about the tech behind the buzz, you're in the right place. Be sure to connect with us on LinkedIn, X, or Blue Sky to stay up to date with episode drops, behind the scenes content, and AI insights. You can learn more at practicalai.fm.

Jerod: 00:36

Now onto the show.

Sponsor: 00:39

Well, friends, when you're building and shipping AI products at scale, there's one constant, complexity. Yes. You're wrangling models, data pipelines, deployment infrastructure, and then someone says, let's turn this into a business. Cue the chaos. That's where Shopify steps in whether you're spinning up a storefront for your AI powered app or launching a brand around the tools you built.

Sponsor: 01:02

Shopify is the commerce platform trusted by millions of businesses and 10% of all US ecommerce from names like Mattel, Gymshark to founders just like you. With literally hundreds of ready to use templates, powerful built in marketing tools, and AI that writes product descriptions for you, headlines, even polishes your product photography. Shopify doesn't just get you selling, it makes you look good doing it. And we love it. We use it here at Changelog.

Sponsor: 01:30

Check us out merch.changelog.com. That's our storefront and it handles the heavy lifting too. Payments, inventory, returns, shipping, even global logistics. It's like having an ops team built into your stack to help you sell. So if you're ready to sell, you are ready for Shopify.

Sponsor: 01:50

Sign up now for your $1 per month trial and start selling today at shopify.com/practicalai. Again, that is shopify.com/practicalai.

Daniel: 02:17

Welcome to another fully connected episode of the Practical AI Podcast. This is Daniel Wightnack. I am CEO at PredictionGuard, and I'm joined by Chris Benson, my cohost, who is a principal AI research engineer at Lockheed Martin. And in these fully connected episodes where it's just Chris and I, we try to dig into a few topics or deep dive into some learning resources that will help you level up your AI machine learning game. Looking forward to this one, Chris.

Daniel: 02:47

I think in reflecting before the episode, both of us going into American Thanksgiving, which is tomorrow, as we're recording this, but going in with a lot of a lot of gratitude for the year. Just a lot a lot happens in in life, and, it's a nice time to kinda reflect and see see the blessings that we have, at Thanksgiving. And, yeah, what a blessing to just keep doing this keep doing this show for going on eight years now.

Chris: 03:21

It's been a moment.

Daniel: 03:24

Having a lot of fun, making a few stepping on a few minds along the way, but having having fun generally. And I think, yeah, thankful to our listeners as well, just to take a moment to say thank you for sticking with us all all these years. Chris and I have have a lot of cool plans for the coming year and there's energy behind the show, lots of ideas going on that we'll talk about soon. But yeah, thank you to our listeners for sticking with us.

Chris: 03:55

Couldn't say it better. Thank you to the listeners for sticking with us. And and I gotta say, these fully connected shows in a lot of ways are are so much fun. They're they're among my very favorites because like we get to talk to these most amazing guests in a typical episode, know, where you're like talking to some of the smartest people in the world, and being able to kind of understand how they see it and learn. And and I know our listeners go along for the ride on that.

Chris: 04:19

But I also love when we just, know, it's the Wednesday afternoon before Thanksgiving for you and me as we're recording this. I know people will be listening to it just after Thanksgiving. But it's a lot of fun just to jump into the conversation. And I know I know we have some fun things to hit today. So I'm relaxed and looking forward to it, Daniel.

Daniel: 04:37

Yeah, yeah, for sure. And I don't know a more exciting topic for the Thanksgiving dinner table than document processing, which what I kind of brought forward today. Guess what I was realizing, Chris, is we talk a lot about large language models. We've talked a lot about We have talked a lot about computer vision type of things on the show, maybe not as much recently, but over the years. We've talked, about all the kind of chatbot stuff and all of that, but I think kind of lurking below the surface of a lot of work in industry is document processing.

Daniel: 05:22

As the years have gone along and we've kind of entered into the generative AI kind of revolution, there has been also this kind of stream of innovations in relation to processing documents in an automated way with models. And of course that reaches very practical places in terms of everyday business work, right? I think often the most valuable workflows that people have day to day or maybe the most annoying ones is, you know, this person sends me an email with this document. I've gotta extract this or do this or, create a summary of that. Or, I have new documents that are, you know, regulations related to compliance and I need to process them and get them, you know, into somewhere.

Daniel: 06:19

And that's really kind of at the center of a lot of what happens in businesses day to day. So I, yeah, I thought it would, you know, as we hopefully aren't yet in a coma after eating too much turkey, we could use this time when we're alert to talk about some of that.

Chris: 06:39

Great point there. And before, like, I kind of hate the name, like document processing. I think, like before everyone out there goes to sleep, turns us all goes, Oh my God, they're talking about document processing and goes to sleep. This is pretty cool stuff. And it's

Daniel: 06:54

important Modeling because it's wise.

Chris: 06:57

Yeah, absolutely. And it is productive and we pride ourselves you know, on on bringing that that, you know, practical, productive and accessible approach to it. And, and I think that's really important is like, I think one of the differences in the conversations we have on the show versus some other shows is the other ones tend to chase the headlines and the glam and stuff. And we're really focused on like getting people into this technology so that they can use it day to day in a fun way. And yeah, and so, before you turn off and go up, I'm gonna turn off for Turkey on document processing.

Chris: 07:32

This is pretty cool stuff. As Daniel said, this has been going on, which just doesn't get the headlines anymore like it used to. And so it's really worth diving into and saying, hey, look at what's possible now versus the last time we talked about it.

Daniel: 07:45

Yeah. And probably what initially prompted this is, of course, I mean, we've been working with some of these models internally, but also DeepSeek did release a DeepSeek OCR model, which people have been talking a lot about, which represents at least part of this stream of work that's been going on around document processing models. Now, just so people kind of have, I guess, a little bit of background or jargon kind of where we're headed, my thought is we really need to kind of pick apart some of these different kinds of modeling, how they fit in and where they're practical, maybe where they're not practical. And in particular, there is OCR, which has been around for the longest, I guess, in terms of the things that we'll talk about, which is optical character recognition. That's what that stands for.

Daniel: 08:42

Then there are language vision models, which is something that has happened, or LVMs. Then there are, I guess, document structure type of models, kind of like a Dockling, people might've heard of Dockling. And then finally, there's kind of this latest model, DeepSeek OCR, which is different from kind of like what people might think of in terms of OCR. And so there's these different kind of categories or families of methodologies here. And there's really, like you say, Chris, lot happening in these different areas, but that's kind of where we're headed in the conversation, I guess, for those listening as there's kind of pick apart some of these things.

Daniel: 09:29

I don't know, Chris, how long I mean, I kind of remember OCR happening for a very long time. I mean, neither one of us, I think, grew up with computers, at least that had OCR on them, or computers in general. I do remember in grad school, you know, processing some papers or other things and applying some type of OCR maybe in some tools on these. But yeah, what's your history there?

Chris: 10:03

Yeah, well, I mean, early OCR was really not very good, and this was you know, certainly before kind of the current generation of AI. And I'm using generation very broadly here, like the the last fifteen years. And it's come a long way with these new technologies and stuff. I know when I was younger, some of the kind of pre AI OCR technologies just were like, I remember trying them when I was younger and kind of going really not working for like it's almost costing me more effort than it's worth it. So things have changed dramatically.

Chris: 10:36

I mean, it's so good now and there's so many approaches to it as we're gonna dive into.

Daniel: 10:41

Yeah, yeah. And I think that maybe a good starting point for that, if we just start with OCR, is really thinking about the processing pipeline and the different components that are involved in it, because that really drives what compute is needed, how fast it is, how performant it is, you know, and it kind of distinguishes it as a category. So if we just start with OCR, I think we could do that. Now, just by way of reference in terms of how things are processed through a kind of quote classical OCR model or a typical OCR model, these would be things like Tesseract or Paddle OCR, these sorts of technologies that we're thinking of. What happens is an image is input and then ideally kind of text or characters are output.

Daniel: 11:33

If we just contrast that, because everyone's talking about LLMs now, typical processing pipeline with LLMs is, you know, not images come in, but text comes in, that text is split apart into tokens. Those tokens are assigned indices, like within a vocabulary. That kind of array of indices is embedded into a dense representation by a transformer based model often. And then what is predicted on the output side is an array of probabilities corresponding to different tokens, such that you can know what is the most probable next token coming out of the model. So you kind of have text come in, that text is split apart into tokens, that's embedded, and then output are these probabilities of next token.

Daniel: 12:28

So if we just contrast that with the OCR model, first of all, we have a different type of input, right? We have an image and that image is made of pixels. And often, so we have this image, it's made of pixels and the output actually not dissimilar to the LLM. There is an output of probabilities at the end. It's just an output of kind of probabilities of characters.

Daniel: 12:55

So what happens is if you look at a big image, it might have regions of characters in it or words. And what happens in the OCR model is you take that big image with a lot of characters. There might be some pre processing on the image, like a resizing or something, But then there is one kind of model that detects the area or regions where there are kind of characters or text, text regions. And then you take each of these text regions and you put it through like a convolutional neural network or an LSTM. And then that outputs through a sequence model, a probability of characters or the probability of what character corresponds to that region, right?

Daniel: 13:46

So essentially that OCR model, it's really just looking at that big image, determining where there are characters or text regions. And then for each of those, predicting what that character or text region is, right? So that's how the processing goes, which in some ways is kind of seems kind of almost brute force, right? You're splitting it apart into all of these regions, right?

Chris: 14:16

As you were talking though, I was also thinking back over the history of the show and we're talking like, this is the first time I think you've said LSTM you know, in

Daniel: 14:25

a while

Chris: 14:26

in a bit. Yeah. How many years has it been since we talked about that and, and recurrent neural networks, you know, which were also involved in, and then kind of transformers also starting to bridge the gap there. Wow. Taking us back a little ways there.

Daniel: 14:43

We're good. Taking us back. Yeah. So if you This is really kind of in a lot of ways, a brute force type thing. You're really splitting apart that image into these different regions, and then for each kind of trying to detect which character.

Daniel: 15:00

Now, similar to what you were saying, we're talking about maybe convolutional models or architectures, maybe LSTMs, which is a long short term memory recursive type of network. These kind of traditionally in these tools like the OCR tools are rather small models by today's standards. And as such, even though it's kind of you're brute forcing all of these characters, they are fairly efficient in terms of where you can run them. So I can run one easily on my laptop. I can run it on a CPU.

Daniel: 15:35

I don't have to have a large GPU.

Chris: 15:38

True. It's, you know, it's interesting is that evolution and the different kind of branches of possibility in terms of how you might approach the problem have developed. Any thoughts do you have any any kind of thoughts around kinda like as we as we went from LSTMs and got to convolutionals and then transformers started making an impact on that. You know, maybe after we come out of the break, we can talk a little bit about kind of how those how those evolved and why the different selections became kind of primary over time.

Sponsor: 16:15

Well, friends, it is time to let go of the old way of exploring your data. It's holding you back. But what exactly is the old way? Well, I'm here with Mark DePuy, co founder and CEO of Fabi, a collaborative analytics platform designed to help big explorers like yourself. So, Mark, tell me about this old way.

Sponsor: 16:33

So the old way, Adam, if you're a a product manager or a founder and you're trying to get insights from your data, you're you're wrestling with your Postgres instance or Snowflake or your spreadsheets, or if you are you don't maybe even have the support of a data analyst or data scientist to help you with that work. Or if you are, for example, a data scientist or engineer or analyst, you're wrestling with a bunch of different tools, local Jupyter Notebooks, Google Colab, or even your legacy BI to try to build these dashboards that, you know, someone may or may not go and look at. And in this new way that we're building at ABBYY, we are creating this all in one environment where product managers and founders can very quickly go and explore your data regardless of where it is. It can be in a spreadsheet, it can be in Airtable, can be in Postgres, Snowflake. Really easy to do everything from an ad hoc analysis to much more advanced analysis if, again, you're more experienced.

Sponsor: 17:25

With Python built in right there, NRAI Assistant, you can move very quickly through advanced analysis. And the really cool part is that you can go from ad hoc analysis and data science to publishing these as interactive data apps and dashboards, or better yet, delivering insights as automated workflows to meet your stakeholders where they are in, say, Slack or email or spreadsheets. If this is something that you're experiencing, if you're a founder or product manager trying to get more from your data or for your data team today and you're just underwater and feel like you're wrestling with your legacy, you know, BI tools and and notebooks, come check out the new way and come try out Fabi.

Sponsor: 18:04

There you go. Well, friends, if you're trying to get more insights from your data, stop resting with it. Start exploring it the new way with Fabi. Learn more and get started for free at fabi.ai. That's fabi.ai.

Sponsor: 18:18

Again, fabi.ai.

Daniel: 18:25

Yeah, Chris. So you were just kind of getting into, I guess maybe why, assuming we have OCR, right? That does work in the sense that you can predict characters, you can pick out these text regions. So OCR models have obviously got better over the years. So why is there a need for something else?

Daniel: 18:48

Why is there a transition to maybe other architectures or other things? So what I would say is there's kind of, if you think about that process of the image coming in and you splitting apart those text regions, you kind of end up with all of this kind of plain text output and any sort of logic around the reconstruction of that document, especially related to the layout of the document is problematic, I would say. And I would say these are often highly dependent on the actual quality of the pixels that are input. Remember the pixels are input here and often the images are kind of resized on the inputs to these models, or they need to be just in terms of the input size. So you've got kind of this combination of problems of not having an understanding of the layout, but also requiring kind of clean scans of the documents, if you will, which is definitely a drawback of this approach, I would say.

Chris: 20:01

Yeah, I mean, I can remember back in the day with the traditional OCR, I mean, that was not just a problem, but it was constant. Would use OCR on a document and you had to pretty meticulously go through the document afterwards to correct a lot of the error on that. That didn't change really until we got past the traditional into more of the vision based models. So definitely seeing the progression there.

Daniel: 20:26

Yeah, yeah. And I mean, that kind of naturally transitions us into one of the things that is now a part of our world and helps with that, at least a part of that problem, the structure and layout problem, which are what are called document structure models. Or one of the most popular of these is called Dockling. And there's different families of these. Dockling, it might be confusing because there's some models that are kind of labeled as Dockling models.

Daniel: 20:58

There's also a toolkit called Dockling that IBM released, which isn't actually just one model. It's a whole series of pipelines and options around document processing. But one of the core concepts here, whether it's in use in that library or in reference to a model, is that a document structure model in terms of what it does or the differences, it actually doesn't do any OCR. It doesn't detect text. It doesn't convert images to text and this sort of thing.

Daniel: 21:38

What it does is it tries to predict the structure of the document or a structured representation of the document. Because remember with OCR, we don't really have that, right? We just have the prediction of these characters in these different, you know, croppings of the image. And so with Dockling or a similar document structure model, what happens is you have that document that's input, a document or an image, And then what happens is that a kind of parser extracts layout primitives. So that might be like rectangles or certain shapes or vectors or fonts.

Daniel: 22:26

And then a layout model, again, kind of part of this document structure model. Layout model then makes predictions for what those regions should be classified as. So things like titles or paragraphs or headings or tables, etcetera. And then output of the model rather than predicting characters again, so I'm not getting text out of this, I'm not getting characters or text. What I'm getting is a structured output representation of the document, usually in kind of JSON, markdown, HTML format, which basically tells me, okay, you put in this document, over here is a table, over here is a title.

Daniel: 23:14

This region corresponds to a heading. There's a paragraph over here. And that way, when you have these more complex documents, maybe two column papers or white papers with a bunch of tables or data sheets or that sort of thing, you kind of have this structure laid out. You have the classification of that structure. And so actually a Dockling model or this type of document structure model is often used in combination with an OCR model.

Daniel: 23:45

And it would kind of go like document comes in, you detect all the structure of the document, right? Over here's a table, here's a paragraph, here's a title. Okay, well now let me send that title bit into an OCR model and then actually get the text associated with the title, right? And so now you've overcome a little bit of that limitation of the raw OCR by applying this structure on top and you can reconstruct the document as a markdown document with all the tables and titles and that sort

Chris: 24:20

of thing. It's funny as you kind of describe your going through the process there as a very loose analogy, it reminds me somewhat of For those of us in the audience who are programmers like you and I, it reminds me a little bit of the way programming languages are compiled into this tree structured format. It's called an abstract syntax tree and asked, you know, where it kinda captures what regardless of what the originating language is, it kinda captures the the essence of what the program is before it's, you know, compiled into machine code or whatever whatever your target is. But it kinda feels like Dockling is doing a somewhat at a higher level, obviously, but doing a little bit of a similar thing in terms of capturing all that structure out of the doc.

Daniel: 25:06

Yeah. Yeah. It would be like the OCR model has an output of character probabilities. Right? The LLM has an output of token probabilities.

Daniel: 25:17

The document structure model actually has an output of this tree structure or the tree representation of the structure of the document. So it's that kind of processing pipeline where you pick apart these layout primitives and then you classify each one. So really it's kind of main piece of this is the classification piece of each of these elements and then assembling that into this tree structure, which yeah, is certainly very useful. I think it's worth noting that this does help you handle more complicated documents. Again, though, it doesn't solve the text extraction piece.

Daniel: 25:59

So you still kind of need to add that piece in. And often this is more computationally heavy than just raw OCR, which can run on CPUs often. I think I've run Dockling models also on CPU or on constrained environments. I think Hugging Face released a small Dockling model, which is also geared towards that side of things. Obviously, you have the same trade offs with any kind of model size.

Daniel: 26:32

The smaller ones maybe don't have the same level of performance, but will run on more constrained environments. The larger ones maybe have higher performance, but they might need a GPU to run.

Chris: 26:44

As we talk about this, are you would you say that Dockling is is like still a very modern and current way of doing things, given the fact that Hugging Face is releasing models? And are there use cases where you would not necessarily wanna go to this in your view? Like where might you say, I don't like, I get the benefits that we've talked about. Where might we say not the right fit?

Daniel: 27:07

Yeah. I I would say that you really kind of want to use this when you need to preserve the structure of the documents that are input and you maybe have complex structures, again, like the data sheets or multi column or mix of columns and, other things, this is really useful at that point. But if you just have like a raw scan that's relatively clean and all of it's just text and you need to detect all of that text, then maybe an OCR model is totally Sufficient. And the structure model is overkill, right? But yeah, I would say this is still very much in widespread use now and quite powerful.

Daniel: 27:56

We've used it on a few different projects as well with good success. It is still a model that I would say, even though it's a little bit more computationally expensive than OCR, we'll talk about language vision models and deep seek OCR here in a second, it is not at the level of computation of those types of models, which means you could still embed it kind of within your application or something, maybe run it on a commodity GPU, that sort of thing. So it is still really useful in those ways as well.

Chris: 28:33

Thinking a little bit about different use cases, we still today, like if you go and use different types of Office tools, and I don't necessarily mean Microsoft Office, but that genre of productivity tools, and you're doing file format changes and stuff across. I know I know recently, I think, about a week ago, I was trying to move a a keynote, just into a PowerPoint context. And you would think in 2025 we would have gotten past that. I didn't. Do you think this is something that is either used at some level or could be used at levels in terms of trying to to capture that kind of complex structure and get it into a different format without losing the gist of what the communication was.

Chris: 29:18

Is that is am I on target there?

Daniel: 29:20

Yeah. Yeah. I think the limitation, I guess, is in how rich that description is. Right? Like you might get these heading or you might get these labels like heading, title, paragraph, etcetera, table.

Daniel: 29:35

But ultimately, if you were to need to reconstruct that, you have to decide how you are going to render a table, how you are going to render a title, which may be very different than the original keynote, let's say the keynote presentation, and you're going and putting it in Google Slides or something like that. So actually that, I think that rendering piece is still a quite challenging one. What I would say, maybe this is a generalization because we've actually used Dockling models in other ways than what I'm about to say. But one of the very frequent uses of these models is for the processing of documents that are feeding into, let's say, a Rag, a retrieval augmented generation pipeline. Why would that be?

Daniel: 30:21

It's sort of because the cleaner and more context relevant you can make those chunks of text into your Rag system, the better results you're gonna get in the responses from the Rag system. And so if you're just processing your documents that have some complex structure using OCR, all the text might get jumbled up and thus the knowledge and the context in the document is kind of jumbled up. Even though all the pieces are there, they might be out of order or they might be something like that. In the case of Rag, you actually don't need to render anything. You just need to parse it well and preserve the structure, right?

Daniel: 31:04

So actually, think Dockling or these document structure models are a really good way to do that document processing for input to Rag pipelines, because there you probably just need things to be represented well in markdown or some similar text format, not in a cool PDF that you recreate or something like that.

Chris: 31:26

Yeah. I'm just thinking of like, wasn't too far back, a year, a year and a half and Rag was all new at the time and now it is so embedded into our workflows. It it lots and lots of organizations out there.

Daniel: 31:41

Yes.

Chris: 31:41

And and I'm thinking about the fact that I wonder how many people out there are using Dockling in that capacity, you know, as that input to that workflow. And it would probably, you know, having the contextual aspect of of the information saved structurally in that way would probably yeah. I agree with you. I mean, that that makes perfect sense intuitively that you would definitely have a rag system able to give you better answers on that. Have you seen that in that use case much out there or is that very much one off?

Chris: 32:13

What's your gut feeling about that?

Daniel: 32:15

Yeah, definitely. I would say in particular toolkits like Dockling, the toolkit, and there's other ones like MarketDown, which I think is a toolkit from Microsoft. We've used those over and over in Rag Systems and I know other people do as well. Certainly people also use vision models, which we'll talk about here in a second. But I would say, again, in the Rag system, wanna preserve that structure.

Daniel: 32:42

You don't want things out of order, but you really don't care how they're rendered necessarily. You just need to preserve the structure and ordering. And so that works out really good for RAG systems.

Sponsor: 33:11

So most design tools lock you behind a paywall before you do anything real. And Framer, our sponsor, flips that script. With Design Pages, you get a full featured professional design experience from vector workflows, three d transforms, image exporting, and it's all completely free. And for the uninitiated, Framer has already built the fastest way to publish beautiful production ready websites, and now it is redefining how we design for the web. With their recent launch of Design Pages, which is a free Canvas based design tool, Framer is more than a site builder.

Sponsor: 33:47

It is a true all in one design platform from social media assets to campaign visuals to vectors to icons, all the way down to a live site. Framer is where ideas go live start to finish. So if you're ready to design, iterate, and publish all in one tool, start creating for free today at framer.com/design and use our code practical AI for a free month of Framer Pro. Again, framer.com/design.

Daniel: 34:18

All right, Chris. Well, there's a couple of, I guess, variations on next, types of models. Maybe it would be helpful to talk about language vision models or vision language models first and then talk about kind of deep seek OCR, which is kind of a different kind of animal. It's not OCR like we talked about before. It's not vision model like we're about to talk about.

Daniel: 34:45

But the vision model is actually kind of diff Or it's more similar to the LLM than the OCR model, I think. So a language vision model, what that means is that the input to the model can actually be an image and a text prompt. And so this is often how it works. Like if you go into a multimodal kind of chat thing and you upload an image and say, Hey, what's going on in here? Who is this in this?

Daniel: 35:20

Or what product is this in this photo? Or all of those sorts of things. You want to ask about the image or you want to ask about, give it as extra context to the language model. So the language vision model actually takes an image and or text. And then the output is similar though to the large language model in the sense that it's just going to output a stream of probable tokens.

Daniel: 35:48

So this isn't actually, in one sense, this is not document processing, but it could be used for that. But it doesn't have to be used for it. So it could be used just to enhance the chat experience or to have a multimodal experience or to reason over images, right? Or to even classify images. It's kind of a general purpose reasoner over images.

Daniel: 36:13

And what happens is you kind of take a large language model and you add kind of a vision transformer into the mix. And the vision transformer takes the image and converts it into an embedding. The transformer piece of the LLM takes your text and converts it into an embedding. And then you smash both of those embeddings together into a vision plus text embedding. And that's what's used to generate the probability of the tokens on the output.

Daniel: 36:46

So again, image or text coming in, text going out the other end. And where this plugs into document processing is I could upload an image of a document, right? And just as my prompt, Hey, reconstruct this table in this image, right? And maybe that works. And it actually works quite well depending on, of course, the model and what image you put in and that sort of thing.

Chris: 37:17

I'm kind of curious as you're kind of going through that fusion process between the text and the image thing. Do the do you have any insight on whether those operate kind of in parallel and come together at some point? Or like how how that fusion how that fusion is able to generate the better outcome? Is it one of those things we just know it does? Or do you

Daniel: 37:40

Yeah,

Daniel: 37:40

have any I think the key thing here is that at least in my understanding and our listeners can correct me if I'm spewing nonsense here. But in my understanding, part of it is that, yes, there are these two pieces. And so the input, the image input goes through the vision transformer. The text goes through different layers of a transformer network. Those embeddings are generated, they're smashed together, but that whole system is jointly trained together towards the output, right?

Daniel: 38:13

So it's not like you train the one-

Chris: 38:16

That makes sense.

Daniel: 38:16

And then

Daniel: 38:16

you train the other one and then you hope they work well together. It's kind of like you join them together at hip to start with. You train the whole system on, many, many, many of these kinds of inputs and outputs. And that's what, obviously it's not interpretable in the sense of knowing how or why it outputs certain things, but it is able to recreate that probable output. And that would be, I would say it would be a major contrast with something like using Docling plus OCR, because then actually you do get a human observable structure of a document out and text corresponding to that.

Daniel: 39:02

With the language vision model, you toss an image and text in and text comes out. And there's no real interpretable connection between the structure or content of that text on the output and any region in the images or specific characters in the images. It's all just us related via the semantics of those embeddings, not any sort of structure or anything like that.

Chris: 39:31

It's fascinating. It sounds like when you I'm just kind of once again thinking back over the whole conversation and the maturity that's evolving in this capability. And so I guess as we've kind of hit that point, like what's the next step in it? Where do you see things going?

Daniel: 39:51

Well, I think the, or at least a next step, it might not be where everything is going, but I think a next step is kind of represented by what DeepSeek has done with DeepSeek OCR. So there are many language vision models or vision language models. I've heard it both ways. There's the one we use kind of Quinn 2.5 vision language model. We've used that one quite a bit, really great model.

Daniel: 40:17

I mean, reality is the best of these are all coming out of China, least at the moment of this recording in terms of the vision language model side of things. So there have been these models over time, but they have limitations in the sense that most of these vision language models still assume a fixed resolution of the input of that image. And they still require huge training data sets and that sort of thing. But I think one of the main limitations is this fixed resolution size, right? So no matter the size of your document, how it's structured, all of that, you're gonna get this fixed resolution, which often does kind of create problems.

Daniel: 41:04

And so what DeepSeek OCR has kind of done is that they actually have kind of a different processing pipeline that doesn't take So it doesn't take the whole image as a whole image, but actually what happens is it takes the input image and then it splits it apart into these kind of image tokens, if you will. So small vision tokens that then are kept at their higher resolution and they're combined with the kind of big, the whole image, right? So you take the whole image, you combine it with these vision tokens or a global full resolution view, I think they call it. So you get this global page plus these tiles. And each of these tiles are kind of vision tokens, are smashed together with the global page.

Daniel: 42:12

And the idea is that you actually don't lose It's a way of kind of representing this image or this document in a kind of compact token sequence where you are not limited by the resolution of your document. And so what that means is that DeepSeek OCR, at least in terms of how it seems right now, is that it does a good job at preserving certain shapes of characters, line breaks, alignments, very tiny mathematical equations or notation, right? You get sort of little dots or a caret above mathematical notation. And so really what DeepSeek OCR is kind of taking some of these ideas to the next level and preserving a lot of that information from the larger document into these kind of full resolution tiles, which can then be processed through the model.

Chris: 43:13

Could you talk a little bit about, like, when we're talking about resolution, like what you could kind of level set, what does resolution mean in this context? As we're talking about specific resolutions and then a multi resolution thing, can you kind of clarify what that is?

Daniel: 43:28

Yeah, yeah. So if I have just kind of reducing it to thinking about a single page, right? If I have a single page I represent that of a document, I represent that as an image, it might be however many pixels by however many pixels, right? Let's say 1,000 pixels by 1,500 pixels, right? But in a vision language model, typically, regardless of what image you input, it's going to resize it to whatever, two fifty six by two fifty six.

Daniel: 44:02

And if you imagine taking that larger page, smashing it down into two fifty six by two fifty six, you're gonna lose little handwriting or diagrams or code or equations or little tiny fonts or footnotes, etcetera, all of that stuff. And so what DeepSeek is saying, well, let's not lose all of that context, but let's also not have to, keep everything in the same resolution. Let's take the tile, let's tile this image. And now we have the original resolution of the document, but the tile is not there. But we also don't lose the ordering or the context of where that tile fits because we have the global view of the page.

Daniel: 44:51

And so it's kind of like when we put text into a transformer, we actually don't lose the ordering often either, right? We understand where text is related to other text. And this is kind of a similar concept where we're not losing any of the resolution, but we're also not losing the structure of where these kind of tiles are placed, if you will.

Chris: 45:16

That makes perfect sense. And so it's kind of the natural progression. If we're going back a few years and talking about the way convolutional neural networks are working and the fact that you were constantly having to go to, you know, reduce the size down, but that created problems in terms of you were doing analysis of what was in the pictures, you know, identification of of whatever, and and the lack of resolution could sometimes make that a challenge. Yes. And this this solves that in a particular way.

Daniel: 45:45

Yeah. Yeah. Which the kind of last, or I don't know, current or last, I don't know what generation we're in, bulk of vision language models at the moment do not solve that because they still force this kind of fixed resolution. Now, at the same time, DeepSeek OCR, it is also a larger model. It does require GPUs to run, but this is only the kind of first generation of these.

Daniel: 46:14

Similar to vision language models, large language models, I'm sure there will be a gradual shrinking of these models at higher performances as more and more people train them. And who knows if this is the right approach, kind of quote right approach to go down. But it is interesting. One of the things I find interesting here, Chris, is we talk a lot about large language models and for the most part, they all operate the exact same way. And we've been talking about them operating the exact same way for some time.

Daniel: 46:46

But if you look at the progression of these models, these multimodal models, as we've gone through this conversation, they all do operate in quite different ways. And so there's a lot of, to your point at the beginning, from my perspective, maybe from a nerdy perspective, document processing is very much not boring because there's actually such a diversity and such innovation going on here with much more diversity on the model side and the technical side than what you see in large language models.

Chris: 47:21

And not only that, but our listeners have come through this with us. This is probably not something most of them have been hitting on lately. And so not only have they earned their if they're in The US, at least their Thanksgiving meal for tomorrow By the time they've done this, but maybe coming out of the holidays, they can go back into the office and and kind of give an upgrade to the rag system and and be wizards be wizards at how effective rag is being for their organization. Because I definitely learned a bit along the way here about that. I have a whole bunch of use cases in mind now.

Chris: 47:59

I'm thinking, oh gosh, we can go back and do this and that and the other. So fantastic explanation of of these different different approaches and kind of the timeline about how they develop. So thanks for doing that.

Daniel: 48:11

Yeah, of course. And happy Thanksgiving again, Chris. Happy Thanksgiving to all our listeners. Hope you enjoy your Tofurky.

Chris: 48:20

There you go. And even if you're outside The US, we are thankful for you listening in and looking forward. Hope that you have whatever holidays you celebrate. We hope they're very good going over the next few months here.

Jerod: 48:41

Alright, that's our show for this week. If you haven't checked out our website, head to practicalai.fm and be sure to connect with us on LinkedIn, X, or Blue Sky. You'll see us posting insights related to the latest AI developments, and we would love for you to join the conversation. Thanks to our partner Prediction Guard for providing operational support for the show. Check them out at predictionguard.com.

Jerod: 49:03

Also, thanks to Breakmaster Cylinder for the beats and to you for listening. That's all for now, but you'll hear from us again next week.

More episodes

Chapters

Creators and Guests

What is Practical AI?