Practical AI

We often judge AI models by leaderboard scores, but what if efficiency matters more? Kate Soule from IBM joins us to discuss how Granite AI is rethinking AI at the edge—breaking tasks into smaller, efficient components and co-designing models with hardware. She also shares why AI should prioritize efficiency frontiers over incremental benchmark gains and how seamless model routing can optimize performance.

Featuring:

Kate Soule – LinkedIn
Chris Benson – Website, GitHub, LinkedIn, X
Daniel Whitenack – Website, GitHub, X

Links:

Creators and Guests

Host

Chris Benson

Cohost @ Practical AI Podcast • AI / Autonomy Research Engineer @ Lockheed Martin

Guest

Kate Soule

What is Practical AI?

Making artificial intelligence practical, productive & accessible to everyone. Practical AI is a show in which technology professionals, business people, students, enthusiasts, and expert guests engage in lively discussions about Artificial Intelligence and related topics (Machine Learning, Deep Learning, Neural Networks, GANs, MLOps, AIOps, LLMs & more).

The focus is on productive implementations and real-world scenarios that are accessible to everyone. If you want to keep up with the latest advances in AI, while keeping one foot in the real world, then this is the show for you!

Jerod: 00:04

Welcome to Practical AI, the podcast that makes artificial intelligence practical, productive, and accessible to all. If you like this show, you will love the changelog. It's news on Mondays, deep technical interviews on Wednesdays, and on Fridays, an awesome talk show for your weekend enjoyment. Find us by searching for the changelog wherever you get your podcasts. Thanks to our partners at fly.io.

Jerod: 00:28

Launch your AI apps in five minutes or less. Learn how at fly.io.

Chris: 00:44

Welcome to another episode of the Practical AI Podcast. This is Chris Benson. I am your cohost. Normally, Daniel Wightnack is joining me as the other cohost, but he's not able to today. I am a principal AI research engineer at Lockheed Martin.

Chris: 00:59

Daniel is the CEO of Prediction Guard. And with us today, we have Kate Soule, is director of technical product management at Granite for IBM. Welcome to the show, Kate.

Kate: 01:13

Hey, Chris. Thanks for having me.

Chris: 01:15

So I wanted to I know we're gonna dive shortly into what Granite is, and some of our listeners are probably already familiar with it. Some may not be. But before we dive into that, wondering we're talking about AI models, that's what Granite is, and the and the the world of LLMs, generative AI. Wondering if you could start off talking a little bit about your own background, how you arrived at this, and we'll get into a little bit about, you know, what IBM is doing and why it's interested in and how it fits into the landscape here for those who are not already familiar with it.

Kate: 01:51

Perfect. Yeah. Thanks, Chris. So I lead the technical product management for Granite, which is IBM's large family of large language models that is produced by IBM Research. And so I actually joined IBM and IBM Research a number of years ago, before large language models were really became popular.

Kate: 02:11

You know, they had a bit of a Netscape moment right back in November 2022. So I've been working at the lab for a little while. I am a little bit of a odd duck, so to speak, in that I don't have a research background. I don't have a PhD. I I come from a business background.

Kate: 02:27

I worked in, consulting for a number of years, went to business school, and joined IBM Research and the AI Lab here in order to get more involved in in technology. You know, I've always kind of had one foot in the tech space. I I was a data scientist, for most of my tenure as a consultant and always, you know, thought, that there was a lot of exciting things going on in AI. And so I joined the lab and basically got to work with a lot of generative AI researchers before large language models really kind of became big. And about two and a half years ago, a lot of this technology we're working on, all of a sudden we started to find and see that there were tremendous business applications.

Kate: 03:09

You know, OpenAI really demonstrated what could happen if you took this type of technology and force fed it enough compute to make it powerful. It could do some really cool things. So from there, we worked as a team really to spin up a a program and offering at IBM, for our own family of large language models that we could offer our customers and the broader open source ecosystem.

Chris: 03:31

Do you I'm I'm curious with one of the things that I've I've you know, we've noticed over time is different organizations kind of are positioning the these large language models within their product offerings in in very unique ways. And and we've you know, we could go through some of your competitors and say they do this way. How do you guys see that in terms of, you know, how large language models fit into your product offering? Is there is there a vision that IBM has for that?

Kate: 03:59

Yeah. I you know, I think the fundamental premise of large language models are that they are a building block that you get to build on and reuse in many different ways, right, where one model can drive a number of different use cases. So, you know, from IBM's perspective, that value proposition resonates really clearly. We see a lot of our customers, our own internal offerings where, there's a lot of effort on data curation and collection and kind of creating and training bespoke models for a specific task. And now with large language models, we get to kind of use one model and with very little labeled data, all of a sudden, the world's your oyster.

Kate: 04:40

There's a lot you can do. And so that's a bit of the reason why we have centralized the development of our language models within IBM Research, not a specific product. It's one offering that then feeds into many of our different products in downstream applications. And it allows us to kind of create this building block that we can then also offer customers to be able to build on top of as well. And open source ecosystem developers, you know, we think there's a lot of different applications for that one offering.

Kate: 05:12

And so, you know, that's a little bit kind of from the organizational side why we're why it's kind of exciting, right, that we get to do this all within research. We don't have a a p and l, so to speak. We're doing this to create ultimately a tool that can support any number of different use cases and downstream applications.

Chris: 05:28

Very cool. And and you mentioned open source. So I wanna ask you because that's always a a big topic among organizations is, if I remember correctly, Granite is under an Apache two license. Is that is that correct?

Kate: 05:41

That's correct.

Chris: 05:42

Why I'm just curious because we've seen strong arguments on both sides. Why why is Granite an an open source license like that? Why what was the decision from IBM to to go that direction?

Kate: 05:55

Yeah. Well, there was kind of two levels of decision making that we had to make when we talked about how to license granted. One was open or closed. So are we gonna release this model, release the weights out into the world so that anyone can use it regardless if they spend a dime with IBM? And, ultimately, IBM, you know, believes strongly in the power of open source ecosystems.

Kate: 06:19

A huge part of our business is built around Red Hat and being able to provide open source software to our customers with enterprise guarantees. And we felt that open AI was a far more responsible environment to develop and to incubate this technology as a

Chris: 06:34

whole. And when you say open AI, you mean open source AI?

Kate: 06:37

Open source AI. Just making sure. Very important clarification. Very important clarification. So that was why we released our models out into the open.

Kate: 06:46

And then the question was under what license? Because there are a lot of models, there are a lot of licenses And a bit of the, like, moments, that everyone's seeing is you have a Gamma license for a Gamma model. You've got a LAMA license for a LAMA model. Everyone's coming up with their own license. You know, it kind of, in some ways it makes sense.

Kate: 07:05

Models are a bit of a weird artifact. They're not code. You can't execute them like on their own. They're not software. They're not data per se, but they are kind of like a big bag of numbers at the end of the day.

Kate: 07:19

So like, you know, some of the traditional licenses, I think some people didn't see a clear fit, and so they came up with their own. There are also all these different kind of potential risks that you might wanna solve for with a license, with a large language model that are different than risks that you look at with software or data. But at the end of the day, IBM really wanted just to keep this simple, like a no nonsense license that we felt would be able to promote the broadest broadest use from the ecosystem without any restrictions. So we went with Apache two, because that's probably the most widely used and just easy to understand license that's out there. And I think it really speaks also to where we see models being important building blocks that are further customized.

Kate: 08:03

So we really believe the true value in generative AI is being able to take some of these smaller open source models and build on top of it and even start to customize it. And if you're doing all that work and building on top of something, you wanna make sure there are no restrictions on all that IP you've just created. And so that's ultimately why we went with Apache two point zero.

Chris: 08:24

Understood. And one last follow-up on licensing and then I'll move on. It's more it's partially just a comment. IBM has a really strong legacy as as someone in the AI world and and decades of software development along with that. I know both Red Hat with the acquisition some years back being strong on open source and IBM both before and after has.

Chris: 08:46

Was it was it I'm just curious. Did that make it any easier, do you think, to go with open source? Like, hey. We've done this so much that we're gonna do that with this thing too even though it's a little bit newer, you know, in context. Culturally, did it seem easier to get there than some companies that that possibly really struggle with that, that don't have such a legacy in open source?

Kate: 09:08

I think it did make it easier. I think there are always going to be like any company going down this journey has to take a look at, wait, we're spending how much on what and you're gonna give it away for free and come up with their own kind of equations on how this starts to make sense. And I think we've just experienced as a company that the software and offerings we create are so much stronger when we're creating them as part of an open source ecosystem than something that we just keep close to the best. So, you know, it was a much easier business case, so to speak, to make and to get the sign off that we needed. Ultimately, our leadership was very supportive in order to encourage this kind of open ecosystem.

Chris: 09:49

Fantastic. Turning a little bit, as as as IBM was diving into this into this realm and and starting, you know, and obviously, like, you have a history with Grant that that, you know, you guys are on 3.2 at this point. That means that you've been working on this for a period of time. But as you're diving into this very competitive ecosystem of of building out these open source models that are that are big, they are expensive to make, and they're and you're, you know, looking for an outsized impact in the world, How do you decide how to proceed with what kind of architecture you want? You know, how did you guys think about, like, like, you're looking at competitors.

Chris: 10:29

Some of them are closed source like OpenAI is. Some of them like Meta AI, you know, has Llama and, you know, that that series. As you're looking at what's out there, how do you make a choice about what is right for what you guys are about to go build? You know? Because that's one heck of an investment to make.

Chris: 10:45

And I'm I'm kinda curious how you when you're looking at that landscape, how you make sense of that in terms of where to invest.

Kate: 10:52

Yeah. Absolutely. So, you know, I think it's all about trying to make educated bets that that kind of match your your constraints that you're operating with and your your broader strategy. So, you know, early on into our Generve AI journey, when we were kind of getting the program up and running, you know, we wanted to take fewer risks. We wanted to learn how to do, you know, common architectures, common patterns before we started to get more, quote, unquote, innovative in coming up with net new additions on top.

Kate: 11:24

So early on the gen and also you have to keep in mind, this field has just been like changing so quickly over the past couple of years. So no one really knew what they were doing. Like, we look at how models were trained two years ago and the decisions that were made, the game was all about as many parameters as possible and having as little data as possible to keep your training costs down. And now we've totally switched. The general wisdom is as much data as possible and as few parameters as possible to keep your inference costs down once the model's finally deployed.

Kate: 11:58

So the whole field's been going through a learning curve. But I think early on, our goal was really working on trying to replicate some of the architectures that were already out there, but innovate on the data. So really focusing on how do we create versions of these models that are being released that deliver the same type of functionality, but that were trained by IBM as a trusted partner working very closely with all of our teams to have a very clear and ethical data curation and sourcing pipeline to train the models. So that was kind of the first major innovation aim that we had was actually not on the architecture side. Then as we started to get more confident as the field started, I don't wanna say mature because we're still very, again, very early innings, but we started to call less to some shared understandings of how these models should be trained and what works or doesn't.

Kate: 12:55

Then our goal really has started to focus on from an architecture side, how can we be as efficient as possible? How can we train models that are going to be economical for our customers to run? And so that's where you've seen us focus a lot on smaller models for right now, and we're working on new architectures. So for example, mixture of experts, there's all sorts of things that we are really focusing in really with kind of the mantra of how do we make this as efficient as possible for people to further customize and to run-in their own environments.

Chris: 13:32

So that was a fantastic start to as we dive into Granite itself, kind of laying it out. You know, your last comments, you talked about kind of the smaller, more economical models so that you're getting efficient inference on the customer side. You mentioned a phrase, which some people may know, some people may not, mixture of experts. Maybe talk as we dive into, you know, what Granite is and its versions going forward here. Maybe start with mixture of experts and and what you mean by that.

Kate: 14:04

Absolutely. So if we think of how these models are being built, they're essentially billions of parameters that are representing small little numbers that basically are encoding information. And, you know, to like draw a really simple explanation, if you have a, you know, a linear regression, like you've got a scatter point and you're fitting a line, Y equals MX plus B, like M is a parameter in that equation, right? So that except on the scale of billions. With mixture of experts, what we're looking at is, do I really need all 1,000,000,000 parameters every single time I run inference?

Kate: 14:39

Can I use a subset? Can I have kind of little expert groupings of parameters within my, large language model so that at inference time, I'm being far more selective and smart about which parameters get called? Because if I'm not using all, you know, 8,000,000,000 or a 20,000,000,000 parameters, I can run that inference far faster. So it's much more efficient. And so really, it's just getting a little bit more nuanced of instead of, like, I think a lot of early days of generative AI is just throw more compute at it and hope the problem goes away.

Kate: 15:13

We're now trying to like figure out how can we be far more efficient in how we build these models.

Chris: 15:19

So appreciate the explanation on mixture of experts. And that makes a lot of sense in terms of trying to use the model efficiently for an inference by reducing the number of parameters. I believe you're right now, you guys have is it 10,000,000,000 are the are the model sizes in terms of the parameters? Or or have I gotten that wrong?

Kate: 15:41

We got actually a couple of sizes. So you're right. We've got 10,000,000,000. But speaking of those mixture of expert models, we actually have a couple of tiny MOE models. MOE stands for mixture of experts.

Kate: 15:53

So we've got a MOE model with only a billion parameters and an MOE model with 3,000,000,000 parameters. But they respectively use far fewer parameters at inference time. So they run really, really quick, designed for more local applications like running on a CPU.

Chris: 16:09

So and and when when when you make the decision to have different size models in terms of the number of parameters and stuff, do you have different use cases in mind of how those models might be used? And is there one set of scenarios that you would put your 8,000,000,000 and another one that would be that 3,000,000,000 that you mentioned?

Kate: 16:30

Yeah. Absolutely. So if we think about it, when we're kind of designing the model sizes that we wanna train, a huge question that we're trying to solve for is, you know, what are the environments these models gonna be run on? And how do I, you know, maximize performance without forcing someone to have to, like, buy another GPU to host it. So, you know, there are models like those small MOE models that were actually designed much more for running on the edge locally or on a computer, like a just a local laptop.

Kate: 16:59

We've got models that are designed to run on a single GPU, which is like our 10,000,000,000, models. Those are standard architecture, not MOE. And we've got models on our road map that are looking at how can we kind of max out what a single GPU could run, and then how can we max out what a box of GPUs could run. So if you got eight GPUs stitched together. So we are definitely thinking about those different kind of tranches of compute availability that customers might have, and each of those tranches could relate to different use cases.

Kate: 17:33

Like, obviously, if you're thinking about something that is local, you know, there's all sorts of IoT type of use cases that that could target. If you are looking at something that has to be run on, you know, a box of ATPUs, you know, you're looking at something that you have to be okay with having a little bit more latency, you know, time it takes for the model to respond, but where the use cases also probably needs to be a little bit higher value because it costs more to run that big model. And so you're not gonna run like a really simple, like, you know, help me summarize this email task hitting, you know, eight GPUs at once.

Chris: 18:09

So as you talk about kind of the segmentation of these of the of the family of models and how you're doing that, I know one of the things you guys have a white paper, which we'll be linking in on the show notes for folks to go and take a look at either during or after they listen here. And you talk about some of the models being experimental chain of thought reasoning capabilities. And I was wondering if you could talk a little bit about what that means.

Kate: 18:34

Yeah. So really excited with the latest release of our Granite models. Just the February, we released Granite 3.2, which is an update to our 2,000,000,000 parameter model and our 8,000,000,000 parameter model. And one of the kind of superpowers we give this model in the new release is we bring in an experimental feature for reasoning. And so what we mean by that is there's this new concept, relatively new concept in Genov AI called inference time compute, where if you what that really equates to, just to put in plain language, if you think longer and harder about a prompt, about a question, you can get a better response.

Kate: 19:14

I mean, this works for humans. This is how you and I think, but it's the same is true for large language models. And thinking here, you know, is a bit of a risk of anthropomorphizing the term, but it it's where where we've landed as a field, so I'll I'll run with it for now, is really saying generate more tokens. So have the model think through what's called a chain of thought, you know, generates logical thought processes and sequences of how the model might approach answering before triggering the model to then respond. And so we've trained Granite 8B 3.2 in order to be able to do that chain of thought reasoning natively, take advantage of this new inference time compute area of innovation.

Kate: 19:56

And what we've done is we've made it selective. So if you don't need to think long and hard about, you know, what is two plus two, you turn it off, and the model responds faster just with the answer. If you are giving it a more difficult question, you know, and pondering the meaning of life, you might turn thinking on, and it's gonna think through a little bit first before answering an answer with a much, in general, a longer kind of more chain of thought style approach where it's explaining kind of step by step why it's responding the way it is.

Chris: 20:25

Do you anticipate kind of and I've seen this done from different organizations in different ways. Do you anticipate that your inference time compute capability is going to be kind of there on all the models and you're turning it on and off? Or do you do you anticipate that some of the models in your family are more specializing in that and that's always on versus others? Which way you kinda mentioned the on and off, so it sounded like you might have it in all of the above.

Kate: 20:51

Yeah. You know, I right now, it's marked as an experimental feature. I think we're still learning a lot about how this is useful and what it's going to be used for, and that might dictate what makes sense moving forward. But what we're seeing is kind of universally. It's useful one to try and improve the quality of the answers, but two as an explainability feature, like if the model is going through and explaining more how it came up with a response that helps a human better understand the response.

Kate: 21:19

So you know, I think it is something we're heavily considering just baking into the models moving forward, which is a different approach right than some models, which are just focused on reasoning. I I don't think we're going to see that very long. You know, I think more and more we're going to see more selective reasoning. So like Claude three point seven came out. They're actually doing a really nice job of this.

Kate: 21:39

We can think longer or harder about something or just think for a short amount of time, so I think we're going to see increasingly more and more folks move in that direction. But you know, well, there's still again early innings. I'll say it again. So we're going to learn a lot over the next couple of months about where this is having the most impact. And I think that could have some structural implications of how we design our roadmap moving forward.

Chris: 22:01

Gotcha. With there has been a a larger push in the industry toward towards smaller models. So, you know, kinda going back over the the recent history of of LLMs and, you know, you saw initially, you know, the just the number of parameters exploding and the models becoming huge. And, obviously, you know, we talked a little bit about the fact that that's very expensive on inference

Kate: 22:25

Yeah.

Chris: 22:25

To to run these things. And over the last especially over the last, I don't know, year, year and a half, there's been a much stronger push, especially with open source models. We've seen a lot of them on Hugging Face pushing to smaller. Do you anticipate as as you're thinking about this capability of being able to reason that that's going to drive smaller model use toward models like what you guys are creating, where you're saying, okay, we have these large you know, Claude has the, you know, big models out there, you know, as an option, or or a LAMA model that's very large. Are you guys anticipating kinda pulling a lot more mindshare towards some of the smaller ones?

Chris: 23:05

And do you anticipate that you're going to continue to focus on on these smaller, more efficient ones where people can actually get them deployed out there without without breaking the bank of their organization? How how how does that fit in?

Kate: 23:18

Yeah. So look. One thing to keep in mind is even without thinking about it, without trying, we're seeing small models are increasingly able to do what it took a big model to do yesterday. So you look at what a tiny, you know, 2,000,000,000 parameter, our granite two b model, for example, outperforms on numerous benchmarks, you know, LAMA two seventy b, which is a much larger but older generation. I mean, was state of the art when it was released.

Kate: 23:46

But the technology is just moving so quickly. So, you know, we do believe that by focusing on some of the smaller sizes, that ultimately we're gonna get a lot of lift just natively, because that is where technology is evolving. Like, we're continuing to find ways to pack more and more performance and fewer and fewer parameters and expand the scope of what you can accomplish with a small language model. I don't think that means we're going to ever get rid of big models. I just think if you look at where we're focusing, we're really looking at kind of where are the models you know, if you think of the eighty twenty rule, like 80% of the use cases can be handled by a model, you know, maybe 8,000,000,000 parameters or less.

Kate: 24:30

That's that's what we're targeting with Granite, and we're really trying to focus in. We think that there's definitely still always gonna be innovation and opportunity and complex use cases that you need larger models to handle. And that's where we're really interested to see, okay, how do we expand the Granite family potentially, focusing on more efficient architectures like mixture of experts to target those larger models and more complex model sizes so that you still get a little bit more of a more practical implementation of a big model, recognizing that, again, it's not you're always gonna need there's always gonna be those outliers, those really big cases. We just don't think there's gonna be as much business value, frankly, behind those compared to really focusing and delivering value on the small to medium model space.

Chris: 25:18

I I think we've that's one thing Daniel and I have talked quite a bit about is that we we would agree with that. It's I think the the bulk of the use cases are are for the smaller ones. While we're at it, you know, we've been talking about various aspects of Granite a bit, but could we take a moment and have you kinda go back through the Granite family and kind of talk about each component in the family, what it you know, what it's called, what it does, and just kinda lay out the array of things that you have to offer.

Kate: 25:48

Absolutely. So the Granite model family has the language models that I went over. So between 1,000,000,000 to 8,000,000,000 parameters in size. And again, we think those are like the workhorse models. You know, 80% of the tasks, we think you can probably get away with a model that's 8,000,000,000 parameters or less.

Kate: 26:06

We also, with 3.2, recently released a vision model. So these models are for vision understanding tasks. That's important. It's not vision or image generation, which is where a lot of the early, like, hype and excitement on generative AI came from. It's like DALL E and those.

Kate: 26:22

We're focused on models where you provide a image in a prompt, and then the output is text, the model response. So really useful for things like image and document understanding. We specifically prioritize a very large amount of document and chart, q and a type data in its training dataset, really focusing on performance on those types of tasks. So you can think of, you know, having a a picture or an extract of a chart from a PDF and being able to answer questions about it. We think there's a lot of opportunity.

Kate: 26:55

So Rag is a very popular workflow in enterprise. Right? Retrieval augmented generation. Right now, all of the images and your PDFs and documents, they all get basically thrown away. But we really, like, are working on, can we use our vision model to actually include all of those charts, images, figures, diagrams to help improve the model's ability to answer questions in a rag workflow?

Kate: 27:17

So we think that's gonna be huge. So lots of use cases on the on the vision side. And then we also have a number of kind of companion models that are designed to work in parallel with a language model or a vision vision language model. So we've got our Granite Guardian family of models, and these are our we call them guardrails. They're meant to sit in right in parallel with the large language model that's running the main workflow, and they monitor all the inputs that are coming in to the model and all the outputs that are being provided by the model, looking for potential adversarial prompts, jailbreaking attacks, harmful inputs, harmful and biased outputs.

Kate: 27:59

They can detect hallucinations and model responses. So it's really meant to be a governance layer that can sit and work right alongside Granite. It could actually work alongside any model. So, you know, even if you've got a OpenAI model, for example, you've deployed, you can have Granite Guardian work right in parallel. And, you know, ultimately just be a tool for responsible AI.

Kate: 28:19

And, you know, the last model I'll talk about is our embedding models, which again is meant to be you know, assist a model in a broader generative AI workflow. So in a Rag workflow, you'll often need to take large amounts of documents or text and convert them into what are called embeddings that you can search over in order to retrieve the most relevant info and give it to the model. So our Granite embedding models are are used for that embedding step. So these are meant to do that conversion and and can support in a number of different similar kind of search and retrieval style workflows working directly with the granite large language model.

Chris: 28:54

Gotcha. I know there was there was some comment in the white paper also about time series. Yes. Can you talk a little bit to that for a second?

Kate: 29:02

Absolutely. So I mentioned, Granite is multimodal and that it supports vision. We also have time series as a modality, and I'm really glad you brought these up because these models are really exciting. So we talked about our focus on efficiency. These models are, like, one to 2,000,000 parameters in size.

Kate: 29:22

That is teeny tiny in today's generative AI context. Even compared to other forecasting models, these are really small generative AI based time series forecasting models, but they are right now delivering top of the top marks, when it comes to performance. So we just, as part of this release, submitted our time series models to Salesforce as a time series leaderboard called Gift. They're the number one leaderboard on Gift right now, number one model on Gift's leaderboard right now. And we're really excited.

Kate: 29:52

They've got over 10,000,000 downloads on Hugging Face. They're really taking off in the community. So it's a really excellent offering in the time series modality for the Granite family.

Chris: 30:03

Okay. Well, thank you for going through kind of the layout of the family of models that you guys have. I actually wanna go back and ask a quick question. You talked a bit about Guardian of providing guardrails and stuff. And that's something that if you take a moment to dive into, think we often tend to focus kind of on you know, the model and it's gonna do x, you know, whatever.

Chris: 30:30

I love the notion of integrating these guardrails that Guardian represents into a larger architecture, you know, to address kind of the quality issues surrounding the inputs and the outputs on that. How did you guys arrive at that? I'm just you know, and and how how did you you know, it's it's pretty cool. I love the idea that not only is it there for your own models, obviously, but that, you know, that you could have an end user go and apply it to something something else that they're doing, maybe from a competitor or whatever. How did you decide to do that?

Chris: 31:02

And I think that's a fairly unique thing that we don't tend to hear as much from other organizations.

Kate: 31:08

Yeah. You know, so Chris, the one of the values, again, of being in the open source ecosystem is we get to, like, build on top of other people's great ideas. So we actually weren't the first ones to come up with it. There's a few other guardrail type models out there. But, you know, IBM has quite a large, especially IBM research, presence in security space.

Kate: 31:29

And we there are challenges in security that are very similar to the large language models in generative AI that, you know, it's not totally new. And what I think are we've learned as a company and as a field is that you always need layers of security when it comes to creating a robust system against, you know, potential adversarial attacks and dealing with even the model's own innate safety alignment itself. So, you know, when we saw some of the work going out in the open source ecosystem on guardrails, you know, I think it was kind of a no brainer from perspective of this is another great way to add an additional layer on that generative AI stack of security and safety to better improve model robustness and figure out, you know, IBM's hyper focus on what is the practical way to implement generative AI. So what else is needed beyond efficiency? We need trust.

Kate: 32:20

We need safety. Let's create tools in that space. So it it kind of, you know, number of different reasons all all made it very clear clear and easy when to go and pursue. And we are actually able to build on top of Granite. So Granite Guardian is a fine tuned version of Granite that's laser focused on these tasks of detecting and monitoring inputs going into the model and outputs going out.

Kate: 32:44

And the team has done a really excellent job, first starting at basic harm and bias detectors, which I think is pretty prevalent in other guardrail models that are out there. But now we've really started to kind of make it our own and innovate. So some of the new features that were released in the 3.2 Granite Guardian models include hallucination detection. Very models few models do that today, specifically hallucination detection with function calling. So if you think of an agent, you know, whenever an LLM agent is trying to access or submit external information, they'll make what's called a tool call.

Kate: 33:18

And so when it's making that tool call, it's providing information based off of the conversation history saying, you know, I need to look up, you know, Kate Sowell's information in the HR database. This is her first name. She lives in Cambridge, Mass, X Y Z. And we wanna make sure the agent isn't hallucinating when it's filling in those pieces of information it needs to use to retrieve. Otherwise, you know, if she made up the wrong name or said Cambridge, UK instead of Cambridge, Mass, the tool will provide the incorrect response back, but the agent will have no idea.

Kate: 33:48

And it will keep operating with utmost certainty that it's operating on correct information. So, you know, it's just an interesting example of, you know, some of the observability we're trying to inject into responsible AI workflows, particularly around things like agents, because there's all sorts of new safety concerns that really have to be taken into account to make this technology practical and implementable.

Chris: 34:11

And and, you know, that's actually having brought up agents and stuff and that being kind of the really hot topic of the moment of, you know, 2025 so far. Could you talk a little bit about Granite and agents and how you guys you know, how are you thinking? You've you've gone through one example right there, but if you could expand on that a little bit in terms of, you know, how does how is IBM thinking about positioning Granite? How do agents fit in? What's what does that ecosystem look like?

Chris: 34:39

You know, you've started to talk about security a bit. Could you kinda weave that story for us a little bit?

Kate: 34:45

Absolutely. So, yeah, obviously, IBM is all in on agents, and there's there's just so much going on in the space. A couple of key things that I I think are interesting to bring up. So one is looking at the open source ecosystem ecosystem for building agents. So we actually have a really fantastic team located right here in Cambridge, Massachusetts that is working on a agent framework and broader agent stack called b AI, like a bumblebee.

Kate: 35:14

And so we're working really closely with them on how do we kind of co optimize a framework for agents with a model that, in order to be able to have all sorts of new, tips and tricks, so to speak, that you can harness when building agents. So I don't wanna give too much away, but I think there's a lot of really interesting things that IBM's thinking about agent framework and model codesign. And that only unlocks, you know, so much potential when it comes to safety and security because there needs to be parts, for example, of an LLM, of an agent that agent developer programs that you never want the user to be able to see. There are parts of data that an agent might retrieve as part of a tool call that you don't want the user to see. So for example, an agent that I'm working with might have access to anybody's HR records, but I only have permission to see my HR records.

Kate: 36:08

So how can we design models and frameworks with those concepts in mind in order to better demark types of sensitive information that should be, you know, hidden, in order to protect information that the model knows, like, these types of instructions can never be overwritten no matter what type of, like, later on attacks, adversarial attacks somebody might try and do and say, you're not Kate's agent. You're, you know, a nasty bot and your job is to do x y and z. Like, how do we prevent those types of attack vectors through model codesign and agent model and agent framework codesign? So I I think there's a lot of really exciting work there. More broadly, though, you know, I think even on more traditional ideas and implementations of agent not that there's a traditional one.

Kate: 36:51

This is so new. But more classical agent implementations. We're working, for example, with IBM Consulting. They have a agent and assistant platform that is where Granite is the default agent and assistant that gets built. And so that allows IBM all sorts of economies of scale.

Kate: 37:08

If you think about we've now got a 60,000 consultants out in the world using agents and assistants built off of Granite in order to be more efficient and to to help them with their their client and consulting projects. So we see a ton of client zero what we call client zero. IBM is our, you know, first client in that case of how do we even internally build agents with Granite in order to improve IBM productivity.

Chris: 37:34

Very cool. I'm kinda curious as as you guys are looking at this this array of of considerations that you've just been going through, and as there is more and more push out into the edge environments, and you've already talked a little bit about that earlier, as we're starting to wind down, could you talk a little bit about kind of as as things push a bit out of the cloud and out of the data center and as we have been migrating away from these gigantic models into a lot more smaller, hyper efficient models, often that have that are doing better on performance and stuff, and we see so many opportunities out there Could you talk a little bit about kind of where Granite might be going with that or where it is now and and kind of what the what the thoughts about Granite at the edge might look like?

Kate: 38:28

Yeah. So I think with Granite at the edge, there's a couple of different aspects. One is how can we think about building with models so that we can optimize for smaller models in size? So when I say building, I mean building prompts, building applications so that we're not, you know, designing prompts how they're written today, which I like to call it like the YOLO method where I'm gonna give 10 pages of instructions all at once and say, go and do this and hope to God, you know, the model follows all those instructions and does everything beautifully. Like small models, no matter how much this technology advances, probably aren't going to get, you know, perfect scores on that type of approach.

Kate: 39:11

So how can we think about broader kind of programming frameworks for dividing things up into much smaller pieces that a small model can operate on? And then how do we leverage model and hardware co design to run those small pieces really fast? So, you know, I think there's a lot of opportunity, you know, across the stack of how people are building with models, the models themselves, and the hardware that the model is running on. That's going to allow us to push things much further to the edge than we've really experienced so far. It's gonna require a bit of a mind shift again.

Kate: 39:46

Like, right now, I think we're all really happy that we could be a bit lazy when we write our prompts and just, like, you know, write kind of word vomit prompts down. But

Chris: 39:55

Right.

Kate: 39:55

I think if we can get a little bit more, like, kind of software engineering mindset in terms of how you program and build, it's gonna allow us to break things into much smaller components and push those components even farther to the edge.

Chris: 40:08

That makes sense. That makes a lot of sense. I guess, kind of final question for you as we talk about this. Kind of any other thought you talked a little bit about kind of where you think things are going right there. Anything that you have to add to that in terms of kind of industry or specific to Granite, where you think things are going, what the future looks like when you are kinda winding up for the day and you're at that moment where you're kind of just your mind wanders a little bit?

Chris: 40:36

Any anything that appeals to you that kinda goes through your head?

Kate: 40:39

So I I think the thing I've been most obsessed about lately is, you know, we need to get to the point as a field where models are measured by, like, how efficient their efficient frontier is, not by, like, you know, did they get to 0.01 higher on a metric or benchmark? So I think we're starting to see this with like the reasoning with granite, you can turn it on and off with the reasoning with Claude, you can pay more, you know, have harder thoughts, you know, longer thoughts or shorter thoughts. But I really wanna see us get to the point, and I think we've got the like, the table is set for this. We've got the pieces in place to really start to focus in on how can I make my model as efficient as possible, but as flexible as possible? So I can choose anywhere that I wanna be on that performance cost curve.

Kate: 41:26

So if my if my task isn't, you know, very difficult, I don't wanna spend a lot of money on it, I'm gonna route this in such a way with very little thinking to a small model, and I'm gonna be able to achieve, you know, acceptable performance. And if my task is really high value, you know, I'm gonna pay more. And I don't need to, like, think about this. It's just going to happen either from the model architecture, from being able to reason or not reason, from routing that might be happening behind an API endpoint to send my request to a more powerful model or to a less powerful but cheaper model. I think all of that needs to be know, we need to get to the point where no one's having to think about this or solve for it and design it.

Kate: 42:06

And I really wanna see I wanna see these curves, and I wanna be able to see us push those curves as far, to the left as possible, making things more and more efficient versus like, here's a here's a number on the leaderboard. Like, I spent another, you know, x gazillion dollars on compute in order to move that number up by point o two. And, you know, that's science. Like, I'm ready to move beyond that.

Chris: 42:29

Fantastic. Great conversation. Thank you so much, Kate Soul for joining us on the Pride to Clay podcast today. Really appreciate it. A lot of insight there.

Chris: 42:39

So thanks for coming on. Hope we can get you back on sometime.

Kate: 42:41

Thanks so much, Chris. Really appreciate you having me on the show.

Jerod: 42:51

All right. That is our show for this week. If you haven't checked out our Changelog newsletter, head to changelog.com/news. There you'll find 29 reasons. Yes.

Jerod: 43:03

29 reasons why you should subscribe. I'll tell you reason number 17. You might actually start looking forward to Mondays.

Kate: 43:11

Sounds like somebody's got a case of the Mondays.

Jerod: 43:15

28 more reasons are waiting for you at changelog.com/news. Thanks again to our partners at fly.io, to Brakemaster Cylinder for the Beats, and to you for listening. That is all for now, but we'll talk to you again next time.

More episodes

Chapters

Creators and Guests

What is Practical AI?