Chain of Thought | AI Agents, Infrastructure & Engineering

Building trustworthy, scalable AI isn't just about models; it's about navigating a complex ecosystem of tools and regulations. Join hosts Conor Bronsdon and Atindriyo Sanyal as they explore these challenges with Dr. Maryam Ashoori, Head of Product for watsonx AI at IBM. To meet these challenges, Maryam explains how watsonx simplifies the AI stack, automates pipelines, and empowers enterprises to scale their AI operations while optimizing costs rapidly.Maryam also explores IBM's strategy for leveraging open-source and commercial models, enabling the potential of agentic systems. Plus, she shares insights from a recent survey of 1,000 developers, revealing key takeaways about the current landscape for enterprise AI implementation, and what results mean for both developers and the enterprises they support.Chapters00:00 Introducing Dr. Maryam Ashoori01:13 Overview of IBM's AI Strategy01:47 Enterprise AI Challenges and Solutions04:40 IBM's Approach to AI Models and Tooling09:52 Simplifying the AI Stack12:20 Challenges in Agentic AI15:55 Importance of Data Management and Lineage21:11 IBM's Strategy for Gen AI Products23:43 Scaling Challenges with Agents27:40 Effective Agent Evaluation Systems35:18 Gaps and Opportunities in AI Tooling41:35 Success Stories with watsonx44:00 Closing RemarksFollow the hostsFollow⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow Today's Guest(s)watsonx.ai Check out Galileo⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠

Show Notes

Building trustworthy, scalable AI isn't just about models; it's about navigating a complex ecosystem of tools and regulations. 

Join hosts Conor Bronsdon and Atindriyo Sanyal as they explore these challenges with Dr. Maryam Ashoori, Head of Product for watsonx AI at IBM. To meet these challenges, Maryam explains how watsonx simplifies the AI stack, automates pipelines, and empowers enterprises to scale their AI operations while optimizing costs rapidly.

Maryam also explores IBM's strategy for leveraging open-source and commercial models, enabling the potential of agentic systems. Plus, she shares insights from a recent survey of 1,000 developers, revealing key takeaways about the current landscape for enterprise AI implementation, and what results mean for both developers and the enterprises they support.


Chapters

00:00 Introducing Dr. Maryam Ashoori

01:13 Overview of IBM's AI Strategy

01:47 Enterprise AI Challenges and Solutions

04:40 IBM's Approach to AI Models and Tooling

09:52 Simplifying the AI Stack

12:20 Challenges in Agentic AI

15:55 Importance of Data Management and Lineage

21:11 IBM's Strategy for Gen AI Products

23:43 Scaling Challenges with Agents

27:40 Effective Agent Evaluation Systems

35:18 Gaps and Opportunities in AI Tooling

41:35 Success Stories with watsonx

44:00 Closing Remarks


Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠


Follow Today's Guest(s)

watsonx.ai


Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠

What is Chain of Thought | AI Agents, Infrastructure & Engineering?

AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.

Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes weekly.

Conor Bronsdon is an angel investor in AI and dev tools, Technical Ecosystem Lead at Modular, and previously led growth at AI startups Galileo and LinearB.

Disclaimer: All views, opinions and statements expressed on this account are solely my own and are made in my personal capacity. They do not reflect, and should not be construed as reflecting, the views, positions, or policies of Modular. This account is not affiliated with, authorized by, or endorsed by Modular in any way.

[0:00] Speaker:
I think the market is underestimating how complicated the modern AI stack is and what developers need to master and harvest in order to deliver on those potentials that generative AI is promising.

[0:20] Conor Bronsdon:
Welcome, everyone. I am your host, Conor Bronsden, here with my fellow co host, Atin Driosanyal, CTO and co founder at Galileo. Atin, great to see you. Great to have you back on the mic with me. Always great to be here, Conor. Yeah, we've had an exciting conversation planned here. We're joined by IBM's Doctor. Miriam Ashoori, Head of Product for Watson xAI. Miriam, welcome to the show. Great to have you here. Thanks for having me.

[0:44] Conor Bronsdon:
We're really excited about it because you've got over fifteen years of experience in developing data driven technologies, and now you're responsible for leading Watson xAI, IBM's AI development studio. Our listeners may know you as an expert in enterprise AI with two master's degrees in artificial intelligence and a PhD in system design engineering from the University of Waterloo,

[1:05] Conor Bronsdon:
where you are also an adjunct professor, which is fantastic. I I I have to admit, I've always wanted to teach a university class. That's really cool. So it's such a pleasure to have you on the show. Let's jump right into your work at IBM. Can you provide an overview of IBM's AI strategy and how Watson X is

[1:24] Speaker:
fitting into that vision or driving it? Everything that we do is a reflection of the market. And it's been exciting to just watch how the market has been evolving over the past, I would say twenty four months at this point. Just looking past, I feel like last in 2023 or even early twenty twenty four, most of the market was exploring and investigating with generative AI. They were looking for a wow factor

[1:48] Speaker:
and moment. But at this point, the majority of the enterprise market have moved past that moment. They have moved toward production and scale. And that's been the area that we've been focused on since day one designing. What are the requirements for enterprise for production production and scale when it comes to generative AI or AI in general. And it's been interesting to just see how consistent the market has been in terms of the challenges. The top three challenges that I've been seeing and basically that was the guiding principles for the design of our platform has been centering around one, ensuring a responsible implementation of AI.

[2:29] Speaker:
The second one is cost performance optimization. Like you're talking about the scale of enterprise. The very large general purpose models with large compute is not necessarily delivering what you need. So optimization, exactly the cost. And last area to focus on, which is accelerated with agents over the past year is how can we increase and capitalize on the

[2:58] Speaker:
ROI of generative AI, bringing this generative AI potential to every single of our enterprise, which is really taking advantage of automation to bring that level of productivity to every single corner. So these three areas optimization, ensuring a responsible implementation of the technology, and bringing automation to every single corner of enterprise has been the driving force for what we are designing as part of the platform for Watson X. Atid, I know a lot of these ideas resonate with you. Do you want to chime in here?

[3:34] Speaker:
I think there's a lot of similarity between the kind of market that IBM's going after big enterprise and us, even though we're a much more smaller company, I think my experience from what Mariam just described with our customers is very similar. We spent a lot of time in 2022, 2023, really just sort of it was in simmer mode. Like AI was in simmer mode. And last few months has really seen a Cambrian explosion of agentic applications and the maturity of some of these orchestration frameworks.

[4:12] Speaker:
So very similar observations on my side as well. But one follow-up question for you, Maryam, given what you've seen in big enterprise, how does IBM kind of think about build versus buy when it comes to open source models as well as, you know, bigger models like OpenAI and Thropic? Does IBM look at it as leveraging these models in their platform, or do you focus more on building foundational models from scratch?

[4:41] Speaker:
Yeah, so let's talk about, for example, cost performance challenge that we just brought up as one of the challenges. Cost performance optimization. If you optimized solution, what are you gonna need? You're gonna need LLM or a collection of LLMs, and you're gonna need a bunch of model customization tooling that is allowing you to take advantage of your proprietary data about your users, your domain specific

[5:10] Speaker:
data to build something that is differentiated from the market, but is also delivering the performance that you want for your target use case for a fraction of the cost, right? So that's basically the problem that we are trying to solve. So in this story, in order to deliver on this challenge, what you're gonna need is to have access to a series of state of the art models, right?

[5:33] Speaker:
Different sizes, different architectures, different license terms. We have global customers. Maybe there are some license restrictions in one part of the board versus others. There is GPU restrictions in some part of support. It's like what are the requirements for running these models. Right? So you wanna have access to all of those. And it's the same for the toolings for model customization. There is a range of those approaches. You might just need prompt engineering and rag, or in some cases, they need full fine tuning and parameter efficient fine tuning or even alignment tuning, right? So the needs are different.

[6:09] Speaker:
We wanna have make sure that there is optionality and there are choices for your customers. And in terms of choices and flexibility, we've been also looking into where are these options coming from, right? So for example, for models, we've been leveraging the state of the art open source models. Like the first day that state of the art models goes out, we are out on the same day. So it's like our users should have access to that. But also some of them are commercial models. Like for example, we established partnership with Meta and Mistral over the last year. So Mistral Largest is an example of a commercial models that we have available in our platform.

[6:47] Speaker:
Or a customer may come in and they said, hey, I have a custom model that I made myself on my on premises or on another platform and I want to bring it in, import it so it's not in your catalog. So we have also expanded to import your own custom models. So basically delivering a range of options because we strongly and firmly believe that one single model is not the solution for a range of those. It's a mix and match. So that's the model part. The same for tooling part. We wanna make sure that we are integrated with ecosystems. So for example, you mentioned agents.

[7:22] Speaker:
Where are the developers these days? They are experimenting. They are exploring. Some are exploring with crew AI. Some are exploring with Lama Index. Some are exploring with Nangraft. We are blinking altogether integration to make sure that that optionality is available for experimentation. But also, we care about production and scale, security, privacy, deployment,

[7:45] Speaker:
like availability of the service, robustness, all of those in production, not just experimentation.

[7:51] Conor Bronsdon:
Miriam, I I'd love to understand more about how Watson X as a platform is differentiating from other AI platforms in the market. You've talked a bit here about enabling open source, trying to make sure that you access other ecosystems. What are the other ways that Watson X is different?

[8:10] Speaker:
Very good question. So back to optionality and choices that we talked about, we wanna make sure that our customers are not locked in, and that's one of the foundations that we have been focusing on, being hybrid. So in our platform today, you can take the whole platform and deploy it on basically the platform of your choice, either on premises or the cloud of your choice, right? Versus being locked in into one single model provider. So hybrid is one of them. The second one that we are really

[8:43] Speaker:
serious about is trust. Ensuring a responsible implementation of AI. You saw that with tooling and models. For models, we started training our own models from scratch. That's our granite collection because we wanted to be comfortable standing behind those models and provide indemnification for our customers. Right? And we've been very transparent in terms of how the models are trained.

[9:06] Speaker:
And that's on the model side of the house. The same story for tooling. Like, we've been heavily investing on our governance and observability platform. What's an x dot governance? Like, every step of the way, like, who touched the model to do what, we automatically document the lineage of what happened. But also, we are building guardrails in place. Guardrails on input, guardrails on output, guardrails orchestrators.

[9:32] Speaker:
Right? Basically monitoring everything. So I would say that trust is very close to our heart and very critical for us, especially because we have a lot of customers coming from highly regulated environments like finance, insurance, health care, you can imagine. It's a serious thing for them. So trust is the second angle. The third angle is we recognize the complexity of the stack.

[9:58] Speaker:
I think the market is underestimating how complicated the modern AI stack is and what developers need to master and harvest in order to deliver on those potentials that generative AI is promising. Right? So from our perspective, we've been looking into simplifying the stack as much as we can, integrating behind the scene, integration between different components of the platform, but also integration with the ecosystem. For example, we mentioned Crue AI, we mentioned Land Graph. We are behind the scene integrated. So a developer comes in, they don't need to be worried about

[10:35] Speaker:
maintaining the code or coming from third party or learning that. It's one single SDK. They they get the job done. Right? So simplicity is the next factor that I would mention. And that last but not least is automating, providing guidance as much as we can. We acknowledge that, for example, in this work, there are lots of developers getting into AI application development when

[11:00] Speaker:
the depth of AI knowledge and skills may not be there. Like for example, we asked thousand developers, we ran a survey, they were built in The US, they were building AI applications. We said how comfortable you are with your GenAI. And surprisingly, the AI app developers that we talked to, only 24% associated themselves with knowledgeable and skilled on GenAI.

[11:23] Speaker:
So there is a big gap in AI skill development and development. Right? So we've been trying because of that. And why is it important? When you think about cost performance optimizations or any sort of optimizations, you need to have a knowledge of AI. Like, what parameter is that? Hyperparameter optimization is one angle. Like what is the right model to use? What is the right technique for model customization

[11:48] Speaker:
tool to use? So we've been heavily also investing on automating those pipelines. So for example, for Rack, it's like how can I automate multiple Rack pipelines with different parameters and show developers the performance of which one yields a better one just to pick that? Right?

[12:05] Conor Bronsdon:
And that's the fourth area that we've been heavily investing on. Anton, I see you nodding along to a lot of that.

[12:11] Speaker:
Yeah, I think I do agree on a lot of things that Miriam mentioned, especially around simplicity, sort of meeting the developer where they are already in familiarity. And scale is certainly a very big sort of challenge, which is I would say unsolved in the agentic workflows because we're in the prototypical phase of agents and it's all very exciting. But one additional observation I have made is when it comes to agents,

[12:38] Speaker:
the interesting pattern of development would have seen, and this comes from my own experience having worked on the first version of Siri over a decade ago, a lot of the software engineering paradigms are coming back into play. And this includes frameworks like LangGraph, LangChain, who are essentially incorporating software paradigms and design patterns that have been known to work for traditional software systems.

[13:03] Speaker:
They've kind of been augmented and sprinkled with LLMs, and there's newer and fewer paradigms which need to be learned, like how do you do rethinking or how do you incorporate any agentic specific things. I almost see this as a sort of an amalgamation of what we've already learned around scaling traditional software, sort of meeting the new world of GenAI and the GenAI components that they bring.

[13:29] Speaker:
And I'm very optimistic and excited about the next twelve to fifteen months when all this will truly come together. And the real challenge will be some of the more fundamental things which Marian mentioned, like trust and accuracy of especially from an evaluation perspective, I can talk about it from a lens of evaluation and observability. We've always had problems of

[13:55] Speaker:
observability in traditional software systems to root cause something, like how do you do it in an effective way. The same challenges are here, except that there's a few more newer paradigms that you're dealing with with LLMs in the mix. So I would say that a lot of things are coming together, And the challenge for effective, you know, building high quality and effective agents is really high quality observability,

[14:19] Speaker:
high quality evaluations. And that's something that is very bread and butter to Galileo, and we focus a lot on that day in and day out.

[14:27] Conor Bronsdon:
That seems like it aligns with your perspective on agents as well, Miriam, and enabling developers to really go and explore this technology.

[14:36] Speaker:
No, he's right on. It's interesting to just back to the beginning that I started to talk about market, how the market has been evolving. It's interesting to see last year, they were experimenting with LLMs. They moved to production. Now we are going through the same thing with agents. The market is exploring. They are looking for factor. But when they go to production and scale,

[15:01] Speaker:
they cannot go to production without observability and governance. It's essential. You got to have transparency and traceability of actions and monitor that for agents. It's like, well, action was taken. Under what circumstances? And can I control that, right? And also over time, not just one time. So you're gonna need to see that tracing of information at build time and run time and after that over time, right?

[15:31] Speaker:
So these are the areas that like I don't think enterprise has been heavily looking into but once they are past that excitement about agents, these are one of the essential elements that they need to seriously follow-up on. Absolutely. And I'd also want to add to what Mariam had said earlier about versioning and lineage. I think that's a topic which is not talked about enough, I feel. I'm drawing parallels to the age of MLOps

[16:00] Speaker:
when feature stores were created and evangelized. As part of that, model monitoring solutions also came. Lineage and versioning and management, the data management side of AI, is extremely critical, and it will be all the more critical as agentic systems start to slowly scale because in the end, data is the fuel at the end of the day that powers any system. That is something that's never changed in machine learning,

[16:28] Speaker:
and that will still be the same for agentic applications. So having that layer of data management, lineage, and combining that with observability to be able to actually root cause when something goes wrong, and to be able to track lineage off. A lot of problems in AI essentially boil down to the data and it's the same story here with Gen AI and even with agents.

[16:51] Speaker:
So that layer of data management is very critical. I had an insurance company coming to me and say then, Maryam, it doesn't matter what the model customization approach is. If you don't have lineage and governance, I just can't use this because I need to know exactly what version of what model was trained on what version of data. And now with synthetic data generation everywhere, like, we need to know what was synthetic data, what was real, retain that data, and be able to audit it back and trace it back. That was an example of highly

[17:24] Speaker:
regulated industries. They just can't go to production without that knowledge.

[17:29] Conor Bronsdon:
So I'd love to know more about what's made for successful implementations in these highly regulated industries. Obviously, IBM is an expert here. So something that for decades now IBM has

[17:42] Speaker:
done successfully. How are you making sure that you translate that success to this new Gen AI agentic era? At the foundations of all of those, we really have the models and we have the tooling stack, right? Let's go back to that. On the model side, there is a deep need on a model that can they can trace back, be transparent in some terms of the training data that went into that. Right?

[18:07] Speaker:
And just a portion of the models out there, you can actually get access to that information. So from our perspective, at least with our granite, we wanted to make sure that one, we establish a trustworthy governance process around the training of the model that we can document the lineage of what happens and filter out the data, like for example, toxic information and copyrighted information and all of those

[18:31] Speaker:
to best of our knowledge and keep updating that. But also be transparent with the market about how they are trained and provide client indemnification. So basically we are like, okay, so this is our strategy to cover you on the model side, right? On the tooling side, it's hand in hand. It's like you grab the pre trained models, but then through the input and instructions,

[18:56] Speaker:
can nudge the model to potentially create an undesirable output, right, independent from what the model was. So it's essential for you to automatically document the lineage of that and stop that. So for agents, it amplifies. Why? Because for LLMs of last year, the absolute worst thing that could happen was LLM creating some content that was inappropriate. Agents

[19:22] Speaker:
can take actions. Actions have consequences. What if the agent decides to take an action to connect to a sensitive structured database and delete some of the customer sensitive information or grab them and combine them with other workloads. Then we are talking about amplified impacts of this. And that's why it's essential to not only document that lineage and track that, but also stop that.

[19:54] Speaker:
Proactively detect those actions that are not appropriate, detect them, or make sure that human is in the loop and no actions is taken automatically when the stake is high. These are the areas that we've been looking into and we've been trying to surface and educate the market about. These are the consequences that the stack might not be mature, it will get there. This is a stack that is evolving in the market rapidly around the agents.

[20:24] Speaker:
And I think that's the part that the technology providers and the technology consumers hand in hand are trying to figure out what is that uncharted territory? What are the gaps and let's resolve them. Might be a dumb question for you, Mariam, but I'm curious to know from IBM's perspective, a lot of the things that you've talked about are sort of platform challenges and tooling challenges to enable developers.

[20:52] Speaker:
And Galileo, for example, is essentially an evaluation observability platform. So a lot of the problems you're talking about, we think about it day in, day out. But IBM's also a very big company, and it has the ability to also enable developers to build great GenAI products which is kind of the layer above the platform. So I'm curious to know from your perspective is IBM's strategy to

[21:17] Speaker:
just provide the tools to build any kind of application for developers or also provide products, GenAI products, to your users? Yeah, very good question. At the end of the day, the goal is to solve customer problems. So for every single line, we sit together with them to identify what is the gap. Sometimes the gap is to use a product and out of the box use it. Sometimes the gap is you need a custom build approach,

[21:49] Speaker:
build it. And not every single customer has the talent to actually make that custom. The majority of the customers at this point, they are interested in out of the box solutions that solves that. So as IBM also, we looked into our own platform. We said, hey. We have a platform that is providing Gen AI technologies, but we have a series of softwares that can benefit from that technology. So we have a category of

[22:17] Speaker:
softwares and products that are enriched with just generative AI technologies. One example is for example, Environmental Intelligence Suite. It's a product that you can use for your disaster management. Behind the scene, you can use foundation models. We have geospatial models we build with NASA to use them for the purpose of disaster management, right? Out of the box, you can use that. You don't really need to know what are the foundation models or what are the tools.

[22:46] Speaker:
There is a second category of softwares that are powered up by GenAI. These are the new sets of products that now we can bring to the market. One example of that is Watson Xcode Assistant. So now behind the scene, there's a granite code powering up the whole product that the developer can use for the purpose of productivity. So within IBM, we are actually thinking about four different areas. One is the platform providing these technologies to the market. The second one is the products that are in can enrich the experience and benefit from generative AI. The third one is the new products that are powered up by GenAI.

[23:26] Speaker:
And the fourth area is the services that we offer to sit together with the customer and figure out, hey, is one of these helping you or there is something else that we can pull in a custom solution to solve your problems. Thanks for explaining that. That makes a lot of sense.

[23:41] Conor Bronsdon:
I'm curious to dig into the agentic side more. So you brought up that there's this scaling challenge happening with agents where getting to production, particularly in these highly regulated industries, can be challenging. And some of the ways that we've been tracking on this, and trying to understand the impact of agents, align very closely, I think, with your perspective. It's like,

[24:05] Conor Bronsdon:
did the LLM planner for this agent select the right tool and start on that right path? Did those tools work? Are they having errors? Are they actually advancing towards the ultimate goal with, does the trace reflect this action advancement? And then, of course, completion around, does the final action align with the agent's original instructions? And you brought up this challenge around

[24:32] Conor Bronsdon:
that final completion metric. I'm saying like, look, we have to be really careful, particularly in these highly regulated industries, to go, we actually need to achieve this goal. We can't risk this action completion being incorrect. We have to ensure that these agents are doing not just the right job, but they're doing it well. How are you approaching that with your customers?

[24:55] Speaker:
Yeah. There are four areas that we are looking into. The first one is the LLM itself. Agent is powered up by LLM. So all the concerns that we have with LLMs in terms of hallucination, lack of explainability, transparency, all of those are applicable here. Even the guardrails in terms of filtering happen, jailbreak, and they are all applicable to agents here. So that is coming to the picture. The second category is what I call agent guardrails.

[25:24] Speaker:
So these are the guardrails that are specific to agents that you wanna develop. Like, for example, faithfulness, the action that was taken. But there are a series of metrics that are stemmed from the specific action calling and reasoning. Right? Those guardrails. So that's the second category that we wanna make sure that we have a solid story around. The third one is something that I call agent

[25:47] Speaker:
agent evaluation. So this agent evaluation is a superset of LLM evals that we had in the past plus all the consequences for tool calling that is coming to the picture. Accuracy, performance, now you're talking about external APIs, the data that you're getting back from that API, how are you gonna evaluate the quality of the respond that you got back. Right? Are you gonna have some sort of content filtering on those? Like, are how you gonna deal with those? So that's that's the third area that I'm looking into, agent eval. And last but not least is one area that Etienne mentioned, observability and governance

[26:27] Speaker:
throughout the whole life cycle. Build time all the way to run time and after that overtime, Like checking what's going on for the agents. These four areas are the areas that we need to deeply look into and see if it's a requirement for the customer. If it's the requirement, what level of maturity they need to have in order to go to production and what is this custom subset of the metrics that they need to track. If it's not provided by the existing one, how can we expand it or

[27:00] Speaker:
build share metrics in the market around those to address those?

[27:05] Conor Bronsdon:
Yeah. Genetic evaluations is definitely an area where we're spending time researching and improving our products as well. Aten, I'm curious if you wanna speak to your perspective on it, because I think there's a ton of ground to cover here. As Miriam and you have both highlighted, there are misconceptions about agentic AI, where folks just think, oh, this is gonna solve my problem, and they don't think about the entire structure that has to go around this to consistently solve the right problem in the right way. May from Ryder said to us a couple months ago, you know, this is not magic. We have to actually put guardrails in place. We have to build the right systems. So so, Atin, what's your perspective on this?

[27:42] Speaker:
Yeah, my perspective is slightly orthogonal potentially to what Miriam was saying, although I do agree with what she was saying, double clicking into essentially what an effective end to end agentic evaluation system comprises of. I kind of flip the words. People talk about agentic Evals. I see them as we need to build evaluation agents because every piece of evaluation can be essentially needs to be agentic by nature

[28:13] Speaker:
because they need to adapt to the different variables in your system once you productionize it, data being a key variable because data is always changing and there's no one size fits all metric or any kind of statistical measure that will give you a good sense of whether the entire agentic system or parts of it are doing well or not. What needs to happen is you need to curb the false positives very quickly

[28:39] Speaker:
and give the instructions to the agent, which is the evaluation agent, to not make those same mistakes so that the agent itself evolves with your product. So that's kind of my view in that you need to solve the problem end to end. One is you need to break it down into different components because each component needs to be evaluated in a different way. There's different measures for the health of your rag system which includes retrieval quality, ranking quality,

[29:06] Speaker:
and then there's the output which can be measured by a different set of metrics. But each of those metrics can't be one and done. They need to be adapted through human feedback, which is where there's a massive human feedback component. But the challenge for a platform play here becomes how do you take that human feedback in a minimal and least intrusive manner

[29:27] Speaker:
and incorporate it into your ecosystem so that the agent doesn't make that same mistake ever again. So it's a lot of these things that need to come together to build a cohesive platform, and we at Galileo kind of see them as self evolving evaluation agents, which is kind of like a multilayered system, at the bottom of which are the fundamental metrics, the statistical measures that give you some sense of kind of the leading indicators of good or bad. But on top of that, there's this automation

[29:56] Speaker:
layer that you have to build which can be seen as agentic in itself, which adapts to the data and the different any variable to the system, whether it's the usage patterns which are evolving over time or the data that's changing over time, And how do you make those evaluation paradigms adapt to that? And I'm glad you brought it up, Atin. I think it's two ways,

[30:19] Speaker:
both of them, not this or that. So for example, agent evaluation, one area that we are using looking into is LLM as a judge to go and look into, let's say, if it's a chain of thought, break it down to different pieces, multiple notes in every note, evaluate, like, evaluate if that was a right action taken. And if it wasn't, automatically regenerate the prompt

[30:49] Speaker:
template that can cover that and fix it. It. So LLM behind the scene is part of the process in this case to do the agent evaluation. So I think it's a two way using agents for evaluations but also agent eval itself. No, absolutely. I totally agree. And the other side of the coin is the scale and cost, which we talked about initially. And the same challenges apply to evaluation

[31:18] Speaker:
as well. To your point about using LMS Judge and Chain of Thought, they do give you great results, but behind the scenes, Chain of Thought is a very expensive process which has, not to get too technical, but multiple forward passes across the deep learning model, And those costs add up because they're running on very expensive hardware. So one of our challenges that we've been going down the journey of is how do you build effective

[31:44] Speaker:
but cheaper evaluations which need not necessarily use the most complex paradigms all the time. Chain of thought is great but one of our discoveries has been that in many circumstances from an eval's perspective, chain of thought might not be needed. And you can get high accuracy eval's without chain of thought. And we've published about it as well. That's the other side of the coin, which is how do you surface high accurate evaluations

[32:13] Speaker:
at cheaper price? Exactly, looking at what is the use case, what is the best way to tackle that. Funny enough, last week we released the new Granite, Granite 3.2, Granite Reasoning. For that one, you can toggle thinking on and off. So if you don't need that chain of thought reasoning because it's very expensive, You turn it off. And for the use cases that you need to actually do that, you can turn it on. But I'm with you, it's a combination of models and evas, and at the end of the day, cost performance optimization.

[32:47] Conor Bronsdon:
Love it. This has been a fantastic part of the discussion. I'd love to keep diving into agentic AI more broadly as it's obviously kind of the theme of 2025 thus far in AI. What are the misconceptions, Miriam, that you see different businesses having about how we should be approaching agentic AI? Let me tell you a fun fact.

[33:11] Speaker:
Those two master's degree that you mentioned, it was actually a multi agent systems twenty years ago. So I did agents and multi agent systems before deep learning. So when you talk to me and you I'm like, I'm looking into the definition of agents that has reasoning capabilities, I mean, those category of people that don't believe LLMs of today can do reasoning because they're not like in my mindset, that that's very different than

[33:38] Speaker:
the culture of agents that I grew up with. Right? And I think this is this is also a misconception in the market. Like, what's commonly known as agents in the market is basically LLM with function calling and tool calling, which is, don't get me wrong, it's great opportunity to bring GenAI to every single corner of enterprise because now you can connect them together. Right? But the promise of agents is really that autonomous reasoning

[34:07] Speaker:
and planning and making decisions and taking actions. I don't think we are there yet. We have some sort of preliminary planning in place, but I think that's that's the opportunity that over the next six to nine months I'm personally super excited about. I feel like once we get comfortable with our technology stack collectively in the market to a point that these models can actually make

[34:32] Speaker:
sound decisions, then that's the point that it unlocks a lot of use cases for all of us. And I think that's a misconception in the market that it does it today, but it's not really capable to do that today. I have a follow-up question, Miriam, for you. This might be a little more general, but in the evolving tool stack that you see today and, you know, clear increase in capabilities of what we can do with LLMs.

[35:03] Speaker:
What is that number one thing that you find maybe a gap in the tooling ecosystem or the platform or anything that can enable developers to build effective high quality agents? If there's a gap at all that you see and what is that top of mind? Yeah, I would say that there are three areas that it's both gap and opportunity, right? It's the active areas that they care about. One, back to our optimization,

[35:31] Speaker:
flexibility and choices. When you talk to developers, I talk to my developers. I surveyed and I said, how many tools are you using in average on a daily basis to create an AI applications? More than half of them, they said five to 15 tools. Five to 15 tools in a market that is evolving rapidly, new tools are coming up in the market fast. I said, how much time are you willing to dedicate to learn a new tool to figure out if it's the right tool to integrate your stack or not? They said not more than two hours.

[36:05] Speaker:
So they are craving for easy to master and learn new technologies. Less than two hours, bring it in, integrate. This is a major need. Need for simplicity, need for abstraction, need for integration behind the scene to address the optionality that they need. And I think that's the area that we should all work on collectively. The second area is innovation. The crave for innovation.

[36:32] Speaker:
That that word that, like, researchers develop something and then next year publish they publish a paper and then next year product picks up. It's all gone. Right? Now we are talking about research. What do you have today? What were you working on last night? Can I bring it into products? The speed of innovation, and that's the expectations from developers. Even though if they don't have the AI background,

[36:57] Speaker:
they expect us to have this state of the art technology as it becomes available to the market for them in a way that they can consume. And that's a major gap. Like, it's it's not an easy thing because they need to go all over the stack. It's not just application layer. They need to go to model layer, GPU level, runtime level, optimize everything. And I think our role is to simplify that and tackle those two areas, three

[37:24] Speaker:
areas, simplification, simplicity, innovation, and optionality and choices. I couldn't agree more. I think I've had the same observation where I think we've kind of reached the point where the excitement for a new model that comes out is starting to kind of saturate and dwindle over time. Right? It's like, Hey, there's a new model. Great. It probably does the same thing. And there's

[37:50] Speaker:
step function improvements and some academic benchmarks, which the citizen developer or someone who just wants to build something awesome wouldn't really care about. And on the other side, there's tool fatigue and this proliferation of, Hey, here's another tool that can help you build something. But if that tool is not simple, to Mariam's point, I, as a developer, am trying to keep pace with the innovation and really trying to build something awesome and give me tools that can help me on the way and not serve as impediments.

[38:22] Speaker:
So I see that as challenge as well. Yeah. This pacing challenge also

[38:26] Conor Bronsdon:
speaks to something that Miriam brought up earlier in the conversation, talking about this survey of a thousand developers you did where, what was it, 23 ish percent said, Yeah, I feel proficient right now. The pace of this industry and of the tooling around the industry, everything you need to learn, it's moving so rapidly. There's always a new model coming out. There's always a new tool. It's got some new innovation,

[38:47] Conor Bronsdon:
and it's making it really hard for folks to feel fully comfortable. I can see a world where this just continues to accelerate, and it becomes harder and harder to keep pace. So I'm curious how you both think developers and their technical leaders should be thinking about this challenge.

[39:05] Speaker:
My suggestion is always to know your problem. What is the problem that you're trying to solve and establish a point of view from the market because then they are able to differentiate what is noise and what is the thing that is worth their time to spend on. They have to be very careful about their time. They have limited time to spend on different pieces of technology. And the best thing that they can do is understand their problem and have established a point of view to be able to evaluate what is noise, what is worth the time. I was in a hackathon recently

[39:42] Speaker:
where I saw a demo of essentially taking a research paper, like the Transformers paper, for example, which was the demo. You upload it and you create a podcast out of the paper. And it blew my mind because like, just turning the time back to when I used to do hackathons very, very often, this kind of stuff was unthinkable to be built in ten hours. So the technology and the ingredients are there for us to build something amazing.

[40:14] Speaker:
The only impediments are really the cost, the platform, and just the simplicity of tooling that can enable me to accelerate and build something very real as opposed to just being impediments. And the noise is really one of the big challenges. My last advice is take advantage of that. I was asking the developers how much AI assisted coding are you using? And basically

[40:40] Speaker:
a good portion of them, I don't remember the numbers now, maybe it was 49%. It was a sizable size. They said very often they use it. And in average, they said they are saving one to two hours per day out of that coding. And four persons of the developers that I talked to, they said they are saving more than four hours a day. Like if they really know how to use AI for that purpose.

[41:06] Speaker:
And I'm like, this is massive, like get on that. Take advantage of AI for this sake of productivity of yourself.

[41:15] Conor Bronsdon:
Miriam, this has been a really fantastic conversation. Honestly, I'm looking at our talk track here we kind of built in advance, I feel like we only got through half our questions because you and Aten are diving into, like, so many interesting threads here and having these great back and forths. I'm like, well, I can't I can't move us forward. They're having too great of a conversation here. But I want to close out with a couple of highlights

[41:35] Conor Bronsdon:
that we didn't have a chance to chat about. So so one, do you have a good example of a success story you can share with the Watson Next platform? Because you've talked about some some really interesting examples here, some challenges with regulated industries, But I also know there's some big successes that have happened. So I'd love to know and maybe inspire our audience with what does success look like? The point that I like to call out is,

[41:59] Speaker:
and that's a misconception that Gen AI is applicable to every single enterprise use case, it's not. We need to understand what Gen AI is capable of addressing and what are of the use cases that it unlocks. And then when you look across enterprises, there are thousands of those examples that can come up. So for example, for generative AI, if you look into content grounded question and answering as the number one use case for generative AI, classification,

[42:28] Speaker:
information extraction, code generation that we were just talking about, summarization, right? It's some of the applications. But perhaps the top one that I like to highlight is content grounded question and answering, and most specifically in customer care. Because now we can basically equip whoever that has been providing historically the answers to our customers

[42:52] Speaker:
with knowledge that is generated from a body of information that can be quickly verified. And that was the story of last year. That was RAC. Now with agents, we can say, hey, LLM. If you don't find the answer in that body of information, you you don't need to say, I don't know. Go to fire up a web search API and see what you can find and come back. And we always have human in the loop to verify that.

[43:20] Speaker:
So it's just amazing to think about the potential consequences to every single industry. This very simple example of AgenTic Rag or whatever you wanna call it. Right? And because of that, I think I'm super excited. It's not just one example use case for one specific industry. It's impacting every single one of us as consumer, as enterprise businesses. That's the beauty of GenAI. We just have to figure out how to really deliver on the potential of

[43:54] Speaker:
what they have promised to deliver with our stack and our capabilities,

[43:59] Conor Bronsdon:
empowering of developers to build those. Absolutely. And I think that's why conversations like this one are so important. So Atin, Miriam, thank you so much for joining me today. Miriam, where can folks learn more about your work or the work you're doing at IBM? What's the Next dot ai? That's my product. Perfect. That is an easy one to remember, and we will, of course, link it, along with everything else we discussed today in the show notes. Go check out the research we've talked about. If you've got that survey you mentioned, that you sent out to developers and and some of that's public, we'd love to share that as well. Wonderful. And if you're listening and you are checking out the show notes,

[44:30] Conor Bronsdon:
maybe just go ahead and leave us a review, hit that subscribe button on whatever platform you're listening. It really helps us bring more guests like Miriam on the show. If you're developing an AI product or leading a team, you can reach out to me or to Atin directly, at Connor Bronson at at andriosanyal. You can find us both on LinkedIn. We'd love having this conversation, so reach out if you'd like to be on the show or have a guest to suggest. Miriam, just thank you so much. This has been so fantastic. It was a ton of fun. Thank you.

[44:56] Conor Bronsdon:
Thanks all.