Chain of Thought | AI Agents, Infrastructure & Engineering

Most AI agents are built backwards, starting with models instead of system architecture.Aishwarya Srinivasan, Head of AI Developer Relations at Fireworks AI, joins host Conor Bronsdon to explain the shift required to build reliable agents: stop treating them as model problems and start architecting them as complete software systems. Benchmarks alone won't save you. Aish breaks down the evolution from prompt engineering to context engineering, revealing how production agents demand careful orchestration of multiple models, memory systems, and tool calls. She shares battle-tested insights on evaluation-driven development, the rise of open source models like DeepSeek v3, and practical strategies for managing autonomy with human-in-the-loop systems. The conversation addresses critical production challenges, ranging from LLM-as-judge techniques to navigating compliance in regulated environments.Connect with Aishwarya Srinivasan:LinkedIn: https://www.linkedin.com/in/aishwarya-srinivasan/Instagram: https://www.instagram.com/the.datascience.gal/Connect with Conor: https://www.linkedin.com/in/conorbronsdon/00:00 Intro — Welcome to Chain of Thought00:22 Guest Intro — Ash Srinivasan of Fireworks AI02:37 The Challenge of Responsible AI05:44 The Hidden Risks of Reward Hacking07:22 From Prompt to Context Engineering10:14 Data Quality and Human Feedback14:43 Quantifying Trust and Observability20:27 Evaluation-Driven Development30:10 Open Source Models vs. Proprietary Systems34:56 Gaps in the Open-Source AI Stack38:45 When to Use Different Models45:36 Governance and Compliance in AI Systems50:11 The Future of AI Builders56:00 Closing Thoughts & Follow Ash OnlineFollow the hostsFollow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Show Notes

Most AI agents are built backwards, starting with models instead of system architecture.

Aishwarya Srinivasan, Head of AI Developer Relations at Fireworks AI, joins host Conor Bronsdon to explain the shift required to build reliable agents: stop treating them as model problems and start architecting them as complete software systems. Benchmarks alone won't save you. 

Aish breaks down the evolution from prompt engineering to context engineering, revealing how production agents demand careful orchestration of multiple models, memory systems, and tool calls. She shares battle-tested insights on evaluation-driven development, the rise of open source models like DeepSeek v3, and practical strategies for managing autonomy with human-in-the-loop systems. The conversation addresses critical production challenges, ranging from LLM-as-judge techniques to navigating compliance in regulated environments.

Connect with Aishwarya Srinivasan:

LinkedIn: https://www.linkedin.com/in/aishwarya-srinivasan/

Instagram: https://www.instagram.com/the.datascience.gal/

Connect with Conor: https://www.linkedin.com/in/conorbronsdon/

00:00 Intro — Welcome to Chain of Thought

00:22 Guest Intro — Ash Srinivasan of Fireworks AI

02:37 The Challenge of Responsible AI

05:44 The Hidden Risks of Reward Hacking

07:22 From Prompt to Context Engineering

10:14 Data Quality and Human Feedback

14:43 Quantifying Trust and Observability

20:27 Evaluation-Driven Development

30:10 Open Source Models vs. Proprietary Systems

34:56 Gaps in the Open-Source AI Stack

38:45 When to Use Different Models

45:36 Governance and Compliance in AI Systems

50:11 The Future of AI Builders

56:00 Closing Thoughts & Follow Ash Online

Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

What is Chain of Thought | AI Agents, Infrastructure & Engineering?

AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.

Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes weekly.

Conor Bronsdon is an angel investor in AI and dev tools, Technical Ecosystem Lead at Modular, and previously led growth at AI startups Galileo and LinearB.

Disclaimer: All views, opinions and statements expressed on this account are solely my own and are made in my personal capacity. They do not reflect, and should not be construed as reflecting, the views, positions, or policies of Modular. This account is not affiliated with, authorized by, or endorsed by Modular in any way.

[0:00] Aishwarya Srinivasan:
You cannot just look at a model benchmark and feel that, hey, like this is a good enough model for me to use in my workflow, and then just like tie it together with each other. That's the worst thing that you can do because none of these models understand like I can break any model with a bunch of quirky, twisted prompts. So that's where it's important to understand that, hey, like, how do you architect it as a software?

[0:25] Conor Bronsdon:
Welcome back to Chain of Thought, everyone. I am your host, Conor Bronsden. And today, we're diving into one of the critical challenges facing our industry, taking AI agents from promising prototypes to reliable production systems that operate at scale. And we have the perfect guy joining us. Ash Srinivasan is a renowned AI expert, head of AI developer relations at Fireworks AI, a powerful voice for building AI the right way. You also may know Ash from her,

[0:54] Conor Bronsdon:
what, 600,000 followers on LinkedIn that she has developed and the incredible content that she shares there. Ash, welcome to the show. It is so good to see you. Thank you so much, Conor. I'm very glad to be here. Yeah. We've had the opportunity to collaborate a couple of times already, but we haven't actually sat down for a one on one conversation like this that was recorded. So I think this will be a fantastic way for us to dive deeper into some of the things that I know we're really excited about with LLMs and small English models, evaluation driven development, so much more. But let's start with the bedrock of all of this. It's one thing to build a really cool agent demo. We've seen a lot of them. I know you have in particular.

[1:31] Conor Bronsdon:
But it's another thing entirely to deploy it to potentially millions of users responsibly. When we talk about responsible AI in the context of agents at scale, what are the top risks that you're seeing? I think when we're going from

[1:46] Aishwarya Srinivasan:
building a prototype to building something in production, one of the biggest differences comes when we are seeing LLMs as tools versus LLMs part of tools. What I mean by that is since maybe like 2022, end of twenty twenty two, 2023, we have seen a lot of applications which are so called wrappers around LLMs. And the reason it's being called such is at the end of the day, it is a little bit of engineering,

[2:20] Aishwarya Srinivasan:
but most of it is LLM calls. That's what's happening. We are expecting that this LLM is going be the magic box where we're going to ask it any sort of questions, follow ups, and it's going to get us the end product. And I think that has quickly changed where people have realized that, hey, it's way more of a software engineering challenge rather than just a model building challenge.

[2:45] Aishwarya Srinivasan:
That is one part of it, sure. That is why all the frontier model labs are working towards it. But when it comes to building an application that uses generative AI as a brain for it, things get way more complicated. So when people say about responsible AI, right, one of the biggest risks or challenges that I've seen or one of the issues that gets tackled is around how much autonomy is right autonomy.

[3:13] Aishwarya Srinivasan:
Because of the pure nondeterministic nature of the model, it sometimes is very hard to go through all of the edge cases and understand where a model might fail. So people are obviously going with eval data sets, trying to understand that, hey, like this is how it performs on my eval set. But at the end of the day, the question is how good is the benchmark? Why you have chosen that model? How good is the eval set that you have built?

[3:44] Aishwarya Srinivasan:
It's as good as you think it is, but there is no like gold standards for you to quantify how good that is. So at the end of the day, it only gets as good as you're testing it in your development environment. But as soon as it goes into production, there are a lot more variability that it encounters. And it's very simple, right? Like the exact same way how we have seen chatbot grow from,

[4:08] Aishwarya Srinivasan:
hey, select between these four options to more language specific chatbots, which uses like Dialogflow at the end of the day, to what we have right now, which is a completely autonomous LLM model that's able to converse without any boundaries to it, right? As soon as that happens, it's very hard to keep it under a specific territory. And that's where it is it needs to be very well planned on where you add a human in the loop.

[4:38] Aishwarya Srinivasan:
So that's one of the top challenges, right? Like how much autonomy is too much autonomy, which all are the places where we need to have human in the loop. And there's no right answer to it. It all depends about how your user journey looks like with that particular toolkit, where are the possible points of risks, and which are the places where it makes most sense for you to have a human in the loop or have a more deterministic

[5:02] Aishwarya Srinivasan:
review or checkpoint at that point in time. So that's one of the biggest risks that I've seen. Now, second thing that I've seen is with these kind of models, because even the testing is not very rigorous or it's not very deterministic, it's not very quantified, people are still going with this vibe eval where it's like, hey, I'm just going to run the model. I know this particular coding question on top of my mind, I'm just going to run it through the model and see how it performs. Oh, it does good. But how many times can you really check? How rigorously can you check for

[5:39] Aishwarya Srinivasan:
these models and how they perform and how they are going to perform when you land them in your users' hands. That stress testing is something which is still a big question mark. The third thing that I've seen is around reward hacking, because in the systems that we are building, we are using a lot of reinforcement models. So whether it be for fine tuning or human in the loop, we are using a lot of reinforcement models.

[6:06] Aishwarya Srinivasan:
And it does sometimes come back and bite you in the way that the agents try to find those shortcuts or they reverse engineer how it can maximize the rewards that you've put inside the system. And that can sometimes come back and bite you. So these are like the top three things that I have seen particularly with respect to challenges which are aligned to responsible AI agentic systems.

[6:33] Conor Bronsdon:
You brought up a ton of great points there. I mean, the whole topic of evaluation, we can definitely dive into. And I think the other piece that really struck me is this idea of finding the right context for LLMs. And there's a reason that we're hearing the phrase context engineering talked about so much more. You know? Folks were talking about, oh, prompt engineering, and, of course, folks are still adapting prompts.

[6:56] Conor Bronsdon:
But actually enabling our systems with the right context can be really challenging. You have a a deep background in data centric AI. How are you thinking through the right data approaches that will actually enable AI systems and agents with the context they need to be successful and help hopefully address some of these risks and failure modes before they make it to production? Absolutely.

[7:24] Aishwarya Srinivasan:
I think data centric behavior is not something which is specific to machine learning. That's my background is in. I have been working in machine learning much before like Gen AI became cool. So data centric AI is something that I've been doing before. And it's very well applicable to what we are calling as context engineering right now. And let me like maybe for the viewers, if they've not heard about it or are not very

[7:53] Aishwarya Srinivasan:
well versed with what it is, the entire shift that we have seen from prompt engineering to context engineering has happened because of the very reason that I was talking about. When we're building systems, it's not just a LLM prompt anymore. It's not an API call anymore where you're just getting the output and that becomes your product, right? Like that's not the end result that you're looking for. You're trying to build a system,

[8:20] Aishwarya Srinivasan:
which means that you're going to have a combination of different modality of models. You're going to have different combination of smaller models, mid sized model, larger models. Some of them are going to be fine tuned. Some of them are going to be distilled. Some of them are going to be proprietary. Some of them are going to be open source. And all of this combines with the data that it is getting access to, the memory systems that you're building,

[8:44] Aishwarya Srinivasan:
the tool calls that you're giving it access to. So all of these combine and formulate a more software engineering problem again, right? And that's where this entire context engineering is coming in picture because now that models are conversing with each other, it's not just about the user prompt and the system prompt anymore. It's about all the logs and traces that is going back and forth between each and every model to model conversation,

[9:12] Aishwarya Srinivasan:
between say a traditional machine learning model and your large language model, anything that's happening between the data set that it's trying to access, between the tools that it has called, and everything in between. So it's becoming a huge clutter of logs and traces and how you're going to think about the fallback if anything in the system breaks down. And that's somewhere where it's very important to think about

[9:39] Aishwarya Srinivasan:
what data is fueling the back end of all of these systems and where is it pulling the information from. So starting from the very first thing, right, like even today, like when I'm talking to my friends working at Frontier Labs, there are folks, and this is still a super, super manual process, when they have to go and download data manually because they have to sanitize it. They have to make sure that it is very high quality.

[10:09] Aishwarya Srinivasan:
Because at the end of the day, if you're trying to fine tune a model or you're trying to build a coding agent, you need to train it with the data that is really high quality. As soon as you feed in a large language model, when you're building a foundational model with bad code, it is going to mess up the output. It's going to fail on the benchmarks. So this process is still very manual, whether going in like trying to find the most sanitized, the most clean code, which with like high programming standards that is used to like train these foundation models. That's step one, which is pre training.

[10:45] Aishwarya Srinivasan:
Same goes on when you're doing a post training for these models. Whenever organizations are trying to fine tune them or distill these models and trying to customize it for the use cases that they have, it comes back to how are they gathering that data? Does it reflect the right kind of data that that is the use case for which they're building for? The same thing again propagates to when you're trying to test it for evals.

[11:12] Aishwarya Srinivasan:
If you're testing it with the wrong data or the bad data or the data which is not diverse enough, you're not going to get real results of how your actual model is performing. So again, it all comes back to like what kind of the quality of data that you're using. And going back again to something like reinforcement fine tuning or reinforcement human feedback,

[11:35] Aishwarya Srinivasan:
not every human feedback is going to be the right feedback to your model. How do you distinguish between what's a good feedback that you've received from the user and what's a bad feedback? Because if you end up training it with every single feedback, then you're going to vulnerabilities. So all of these, if we think about the core of or the foundation of where all of these problems stem from or what's the right

[12:05] Aishwarya Srinivasan:
key to solving all of these problems, it comes down to how do you get the right quality data? How do you get the most correctly annotated data that you can use to train your models?

[12:15] Conor Bronsdon:
And as teams work through this challenge in different domains, I think we've seen an obvious spike domain here where software engineering, there's so much good quality data out there. There's also a lot of bad code. Please do not ever train a model off my GitHub. Not a good idea. But there's plenty of great code out there that is very well labeled, very well commented,

[12:36] Conor Bronsdon:
excellently done, that that can be used to train these models. And it's interesting because we also have developers and PMs who are working on these AI models who can act as subject matter experts for that human feedback you mentioned. And so I think we're seeing that use case move ahead particularly fast because not only do you have SME experts who are already on hand, you don't have to go find a bunch of lawyers, for example, to double check the data. They're just already working at the company. But you also have a vast

[13:09] Conor Bronsdon:
supply of potentially very strong open source code, and many other projects. And this is where companies can also leverage their own code bases to assign things like, hey, this is gonna be the right standards we want. This is the approach we want. But this gets more complicated as we expand the use cases we're considering, whether that's something along the lines of,

[13:31] Conor Bronsdon:
you know, broader writing and chat with the chatbots versus code specific, you know, it's quick for us to see or quite quickly we see that concerns around fairness, transparency, safety start to come in. And with agents making autonomous decisions, whether that's in customer support or many other use cases, this is a key area of concern for many enterprises in particular.

[13:59] Conor Bronsdon:
Maybe you're not worried about it in a demo, but as you actually start putting this out to thousands, hundreds of thousands, millions of people, this becomes a problem. I know we've thought about this from a standpoint of, okay, working with a lot of our customers around providing them in production guardrails they can apply to block things like personal identifying information,

[14:19] Conor Bronsdon:
safety concerns, and also, of course, customize to their needs. But beyond simply guardrailing, as useful as that is, what are the other considerations you think teams should be taking into their processes as they try to ensure that their agents are actually delivering on the promise that they've set out to solve?

[14:40] Aishwarya Srinivasan:
So the short answer is trying to quantify anything and everything in that system that you're building. Because that's what comes handy when you're trying to scale things. And that's what comes handy when you have to go back and look at where things broke and why things broke. Parts of what we talk about tracing or logs or understanding how the agent behavior changes with different user prompts,

[15:11] Aishwarya Srinivasan:
it all comes back to, hey, how do you define what is good or bad? So it's all about like those metric definitions and all of those quantifiable elements that really helps you build that reliability in any particular system. So one of the examples is when we are looking at something as simple as a copilot, like a coding copilot, and we're talking about, hey, what level of autonomy

[15:41] Aishwarya Srinivasan:
do we want to give this particular model? Do we let it write the code for me? Should I let it also review the code for me? Should I also let it write unit tests for me? Should I also let it push the code and like merge the code for me? For any of these steps, there are different levels of quantification that you want to do, right? Like at every given step, is it doing right? Is it not doing right? At any given point in time, do I need a human reviewer or do we need a different model maybe to judge

[16:15] Aishwarya Srinivasan:
the response of that, like having an active critic system right there. So all of those are very important to understand where are those guardrails, where do they need to sit, and how much autonomy do you want to build it in that system. And as I said, there's no right answer to like how much autonomy. It is completely dependent on the use cases. You were talking about a use case where you have PII.

[16:40] Aishwarya Srinivasan:
It would be completely different if in my case, there is no PII, right? The level of autonomy versus the level of human intervention that you want to have in any particular system is very subjective to what's the level of risk that you have, what is the level of scrutiny or compliance that you're building that under. So, for example, in like going back to the coding example,

[17:03] Aishwarya Srinivasan:
right? Like if there is a point where I have automated the entire pipeline, right from where it understands what I wanted to do, it codes it out, it writes unit tests for me, it does a code review, It pushes and merges the code into the main branch. If something breaks, how exactly do I go back and see where it broke and why it broke? And can I really reverse

[17:28] Aishwarya Srinivasan:
each and every step that happened along the way? And that is really important when you're trying to think about fallback options. If something breaks, how do you trace it back and how do you make it right? Because if you have not logged all of those things, if you have not had like checkpoints along the way for how a model is performing, how is it responding and how is it

[17:51] Aishwarya Srinivasan:
merging any code into your code base, then it's going to be extremely hard for you to go figure out what broke, why it broke, and how do you avoid from it to happen again. So all through this way, what's really important is that how do you quantify each and every step that your model is taking? How do you quantify each and every step of your system? And how do you have evaluations

[18:17] Aishwarya Srinivasan:
running at every given checkpoint? So that really helps you to build something which is scalable. And at the same time, you're not just falling back on subject matter experts to review it and see if it's right or not. And it's something that can be logged and it can be something that you can look at at a dashboard and see like what's going wrong. So instead of like you having to manually go through all the logs and figure out what went wrong, you have all of that available in a very quantified manner And you're able to pick and choose. And you can really pinpoint and say that why things broke,

[18:57] Conor Bronsdon:
and how do you not let that happen again. I think you're spot on that we have to treat observability of our AI systems as fundamental, because we need to be able to unpack and root cause analysis what is actually going on, what's broken. Do we have an agentic tool error problem, or is it a context window issue? Is it simply an efficiency piece, and we need to

[19:20] Conor Bronsdon:
improve the actual ability of our LLM to make calls more more rapidly? Or maybe we just simply haven't paid our API bill and we're getting stuck. It really depends. I think this is why we see, as we get past testing, a so often evaluations for agents become obsolete if they haven't been customized to an organization's needs. And a lot of engineers are starting off of either open datasets for their evals and scores,

[19:52] Conor Bronsdon:
or for out based off of out of the box metrics, like ones we provide on agents, for example, action completion. And those are all great and useful, but it's once you actually fine tune those for your use case and customize them for your use case that I think they really start to shine. Are you seeing AI builders actually take on this customization challenge, or do you feel like we're still at a point where there's a lot of confusion for builders

[20:18] Conor Bronsdon:
about what it means to effectively evaluate AI systems?

[20:22] Aishwarya Srinivasan:
I feel it depends on who are we talking about. I wouldn't really generalize saying that, hey, like, everybody is confused or everybody is, like, doing it the right way. I think it's different when it comes to deeply technical professionals. I think people who have worked with machine learning operational systems have a good understanding on how they should be defined, even though there is slight difference between how machine learning traditional machine learning models work versus how large language models work. There's a lot to do with context window memory, etcetera, which sort of is quite different. And the model nature is also quite different.

[21:02] Aishwarya Srinivasan:
But I think people who have had deep expertise running machine learning models at scale have been able to adapt to that change well. Compared to that, folks who are sort of coming from a nontechnical machine learning background, now that the barrier to entry for building anything with LLMs have become so easy, I think that's the area of vulnerability where I'm seeing where

[21:32] Aishwarya Srinivasan:
a lot of early founders who don't have a deep technical background, who have not really built systems at scale, who don't have a good understanding of ML system design, when they are coming in and they're seeing that, hey, I have a bunch of these white coding tools out there where I can spin up a system in minutes, and they don't really go through that thorough analysis

[21:55] Aishwarya Srinivasan:
of how original systems used to be built. Having solid PRDs before you build a product, having solid understanding of what does the user journey look like, what does that each and every step look like in the process, talking about the flow of data like that. It's also part of context engineering, right? Like how do we evaluate each and every step of how the data flows

[22:22] Aishwarya Srinivasan:
inside a large language model and through the system. People who have not done that or people who are coming from non machine learning backgrounds, I feel primarily are confused, are figuring out because they have a solid business need of why they need to build this product, but they lack that engineering expertise.

[22:44] Conor Bronsdon:
It sounds like you think not only from a context engineering perspective, but also perhaps even an evaluation driven development perspective, which is terminology that really has only come into vogue over the last year, and I think maybe harkens back to this idea of test driven development from traditional software engineering. What's your perspective on the decision making process for

[23:06] Conor Bronsdon:
folks who are perhaps newer to AI and maybe don't have that deep machine learning background? How should they be thinking through when to add evaluation observability to their systems? Is this something that, you know, has to be a p zero? You start doing it from the very start. Is it does it depend on the use case more so?

[23:26] Aishwarya Srinivasan:
What what's your perspective? It depends on how soon they want to build something in production. If it's sooner, then definitely that should be your day one thought of how are you planning to build evals into your system. But if it's something where you're like, I'm building a prototype. I am going to go pitch that to IC and see how things goes. That's something that you can

[23:50] Aishwarya Srinivasan:
think about later. So it depends on how soon do you think your system can go from your laptop to a cloud hosted environment and

[24:00] Conor Bronsdon:
available for people to use? And the risk factor you talked about earlier too, for sure, where like, hey, do you have PAI that may be surfaced from that application? Yeah, you might need some guardrails in place. May need to have live evals. Of course, this monitoring is dependent upon what use case folks have. I think this also comes back to your points about responsible AI and

[24:26] Conor Bronsdon:
considering, as you put it from from first principles, okay, what am I doing with this? Let me make sure I actually have this set out. Let's I mean, not just vibe cut my way into potentially a mess, but actually think through the architecture and the goals that I'm setting forth for whatever agent we're building. You also mentioned a specific evaluation technique that is no longer new, but I think is very central to how evals are being conducted today.

[24:50] Conor Bronsdon:
And it's no longer focused on humans as evaluators or at least not first line evaluators. Most folks are using LM as judge, at least to some extent within their systems. And you mentioned very kindly our prep call for this, that you enjoyed the insights in our ebook on the topic, Mastering LM as Judge. And I'd love to get your perspective. Is this an approach that you see every team should be applying at scale,

[25:18] Conor Bronsdon:
or are there inherent risks to leveraging LLMs? And we need to be considering the biases and and guardrailing them in some way. There definitely is risks that come to using LLM as a judge because

[25:33] Aishwarya Srinivasan:
you're saying that, you know, on one hand, you're saying that, hey, I'm using a LLM, and it can sometimes not produce the right kind of output. So you are inherently defining the nature of how the model looks like. And then you are saying that, hey, with the same exact characteristics, we're putting another model to go and judge how this previous model was doing. So it's in

[26:00] Aishwarya Srinivasan:
a way, if you're trying to say that, hey, this model is blind, then you're taking another blind model to go and check how the previous model is doing. So that's not really going to help in some use cases, but they are great in some other use cases. So one of the things is LLM as a judge, it's great if you are trying to measure something which is not very quantifiable.

[26:31] Aishwarya Srinivasan:
So the reason I say that is if you have a LLM output, a particular result, and when you're building that system, you have a good understanding of what that range of output could look like or what that range of answers could look like and what a right or what a wrong answer could look like. If there is like a statefulness in that system, then going with traditional

[26:59] Aishwarya Srinivasan:
evals is the best thing to do. But in a lot of cases, that's not how things are. A lot of cases, one of the examples that I can talk about, right, like something as simple as, hey, be my writing assistant. It is such a subjective use case. And the reason it is, is because the way I write is very different from how somebody else's write or how a third person writes or how somebody who is a native English speaker writes or how somebody who is proficient in Mandarin writes. It's very, very different. So when

[27:36] Aishwarya Srinivasan:
the use case is that undefined or difficult to define where the boundaries are not very specific, having an LLM as a judge can actually help because you're using a different model which also has a wide variety of knowledge base in it to go and judge how the previous model did. Now, obviously, like even for LLM as a judge, you can fine tune it to certain extent because if you are using

[28:08] Aishwarya Srinivasan:
this LLM plus LLM as a judge combo for, say, a storyboard writing use case or a poem writing use case or a go writing use case, for example, there are still some boundaries that you want the models to follow. So in those scenarios, you can fine tune the LLM as a judge model to also like perform a specific way. But inherently, like when the output range can be different for

[28:35] Aishwarya Srinivasan:
every single user, that's when it's helpful to like use LLM as a judge. It does come with its own challenges the same way an LLM comes with the challenges. But for any any of those scenarios, it's rightfully one of the best ways to deal with it rather than, like, not having anything at all. Yeah. It's

[28:56] Conor Bronsdon:
typically better to do something even if it's not the perfect thing when it comes to eval's observability. Having at least something in place is gonna start to give you telemetry that will prove valuable as you fine tune and adjust your approach down the line. But let's shift our focus a little here. I know something that is extremely important to you and that you're passionate about is the open source community,

[29:19] Conor Bronsdon:
both as an area that you've, you know, spent time professionally, spent time in your free time. And of course, as we discussed earlier, it's an incredibly important dataset for LLMs to train off of as well. How do you think the proliferation of both data and now powerful open source models has fundamentally changed the game for teams looking to build AI agents.

[29:49] Aishwarya Srinivasan:
So funny thing is one of the questions that I get all the time is I'm very vocal about open source community. I advocate for it a lot. And one of the biggest questions that I get is like, if open source is so good, why do we even need proprietary models? And it's a very valid question. And it's something that I keep asking myself every time there is a new open source model come out. And to some extent, I would say that

[30:19] Aishwarya Srinivasan:
after the DeepSeek v3 model came, we did see that inflection point where the performance and price ratio of what you can really build at a very cost efficient and a very customized manner, something that you have full control on with an open source model compared to a proprietary model, it just went up. Like with DeepSeque three people were like, oh, like we never thought an open source model could be as performant as this one is. And then it was followed up with R1 model. There were different variations of R1 model. Recently, we have had V3.1.

[30:59] Aishwarya Srinivasan:
And subsequently, we have seen various other labs also produce their own open source models. And they are one of the top used models when I see from an agentic AI behavior, like when I see the use cases that companies are building on, both enterprises as well as startups. They are heavily invested in using open source models for certain use cases. Again, I'm not saying that the industry has completely shifted from using proprietary model. There's obviously like certain reasons around

[31:28] Aishwarya Srinivasan:
developer experience, around how easy it is to like get up and running on proprietary models. A lot of times, it is also around model quality. For example, if you're looking at vision capabilities of the model, proprietary models does give you a really, really good performance. So it's a mix of reasons why industries is going and looking at open source models from a customization perspective,

[31:53] Aishwarya Srinivasan:
from a cost of use perspective, from running it on prem perspective. There are a bunch of reasons there. And what we are seeing is a lot of the companies which had started pivoting into pure proprietary models, including OpenAI, including Meta, including Google as well. Now they have started thinking that, hey, no, we need to invest in open source models as well. That is one

[32:24] Aishwarya Srinivasan:
of the edges that they are getting. And rightfully, that's why you're seeing all of the new models coming out. Like, the latest was with OpenAI's GPTOS model. And that's why we're seeing equal level of investment going on in both to some extent. But the top models, if you see, are still coming out of the Kimi models, the Quen models, the DeepSeek models, etcetera. And

[32:49] Aishwarya Srinivasan:
they are really, really capable. And we are seeing that shift where people realize that for my use cases, for certain of the use cases where out of the box models are not the right solution for me, where I need more control over how my models are being trained, where I need more control over how my data that I am pushing into the system is being used, where I care about zero data retention policy,

[33:16] Aishwarya Srinivasan:
places where I care about customizing the model to the size and to the speed that I care about. In all of those use cases, people are switching to open source models because they just have that autonomy to them in order to customize the model the way they want. From

[33:36] Conor Bronsdon:
your vantage point, where do you see the biggest gaps in the open source AI stack, such as vision, which you mentioned earlier, is an area where proprietary models are doing extremely well, that if these gaps were filled, would unlock a new wave of open source scaled applications that are powered off of open source LLMs.

[33:59] Aishwarya Srinivasan:
I think vision is definitely one of the areas where it is harder for the open source models, at least the ones that are out there compared to the proprietary models which are available in the market. Apart from that, yes, there are some level of performance gap that comes with how open source model look like on the benchmarks that we have publicly available.

[34:26] Aishwarya Srinivasan:
That being said, again, like most of the use cases when I'm seeing enterprises or startups use these open source models, they are not using it out of the box. They are either like using supervised fine tuning methods to fine tune it or reinforcement learning fine tuning, or even like if they don't have any of the data set or they are not likely pulling in the data to fine tune it, they are using synthetic data to do that. So

[34:51] Aishwarya Srinivasan:
at the end of the day, I feel like people have a clear choice of when they are building a composable system where they require very, very specific outputs from specific models, they are choosing a combination of models. So most of the use cases that I'm seeing, they're never like running it on one particular model. Like One of the use cases that I can talk about is with Notion. We had

[35:19] Aishwarya Srinivasan:
the head of AI engineering from Notion join us for our Dev Day, and she was talking about the use cases that they're running on fireworks. And parts of the Notion AI where latency was really important to them and quality of the responses was very, very important to them, they were not able to get that from the proprietary models. So in those use cases, they did fine tune the models

[35:41] Aishwarya Srinivasan:
on Fireworks AI platform, and they were running it with very, very low latency on their tool. Now, are they using just one model? No, they're using a combination of multiple different models. And that's the concept of like Compound AI systems that we've also been speaking about, which is integrating multimodality of models together, integrating smaller and larger models together,

[36:04] Aishwarya Srinivasan:
integrating models with specialized performance metrics together and having them build that end to end system. So I don't see that it's going to be a future where it's not the case that, hey, all the proprietary model shops are going to shut out because there's open source models. I think it's a pros and cons of how these systems are being designed and how the model architecture looks like. What are the specific nuances that model is good at?

[36:34] Aishwarya Srinivasan:
I mean, can share a personal example, right? Like I have my preferences on using when I use ChatGPT versus when I use Clot, versus when I use Perplexity, versus when I use Gemini. I use all the four of them, but for slightly different use cases because somewhere I like the way Chattypreeti writes certain things for me, whereas I like certain things about Gemini the way it writes for me. So

[37:00] Aishwarya Srinivasan:
at the end of the day, when the model gets so large and when we are trying to just measure it on top of a benchmark, it's not like an accurate way to judge the performance of a model because at the end of the day, it is not an objective task that we are measuring it against. It is a subjective task and there's no one right answer. There's no four right answers. There could be multiple different right answers. There could be multiple different ways of framing the same thing.

[37:30] Aishwarya Srinivasan:
Also subjective to the user who's interfacing with these models. So that's where all of these nuances come in play. And yeah, I mean, that's why we are seeing that some companies are focusing on building 600,000,000,000 parameter model versus we are also seeing 20,000,000,000,

[37:48] Conor Bronsdon:
4,000,000,000 variations of the same models. I love that you're bringing up both model variation and personal model variation, where I would term that as using different models for different use cases, something I will say I personally do as well. Like, for example, if I'm needing help writing or problem solving, I actually prefer Claude over GPT often. I find it's a little more effective at that.

[38:13] Conor Bronsdon:
I'm curious, from your perspective, do you mind sharing an example or two of when you might switch models if you had a particular set of tasks?

[38:24] Aishwarya Srinivasan:
From a so also, probably giving you another example, right? Like, already spoke about using different tools, which is ChatGPT versus Claude versus Perplexity versus Gemini, depending on if I'm needing its help to write a technical blog or something which is a more personable text or if I'm needing it to help me write code or am I asking you to help me write a

[38:52] Aishwarya Srinivasan:
blog around the code? So all of these are very nuanced specific use cases. But I want to give you a use case where it's essentially the same task, but different results is what I find when I'm using a cursor agent versus when I'm using Claude code. So what I've realized is when I have a very, very defined scope of what I'm trying to build, when I know that in one or two short of the prompting, I am going to get the result that I'm looking for, That's how defined I have. That's how defined I've like written out the problem.

[39:31] Aishwarya Srinivasan:
Cursor agent does a good job for me. It does a it's able to like understand what I'm trying to build and it is able to do a very good job at it. But at times what I've seen is, let's say my problem statement is not very well defined. And let's say I give a mildly ambiguous task to Courser Agent and I have to like do a bunch of follow ups on it. It starts breaking.

[39:56] Aishwarya Srinivasan:
It tries to like fix one thing, ends up breaking 10 other things, and then I have to do another prompt to like fix that one thing that it broke, and then it ends up breaking like few other things. So it takes a lot of like back and forth to get a right result. And by the time I do that, it has hallucinated a bunch of import statements at the top. So what I've realized is in cases where I probably have a very, very defined

[40:21] Aishwarya Srinivasan:
scope of this is exactly what I want to build. These are the buttons that I want in this particular app. This is the tool that I want you to use the front end for. If I have something that's that specifically defined, then Cursor Code or a Cursor Agent works out better for me. But when it's more collaborative, Claude Code seems to be working better for me. So again, like going back to the point, right, like it's so specific about how your users are going to interact with the model.

[40:50] Aishwarya Srinivasan:
Somebody who is an engineering professional would interact with your model differently compared to how a PM would interact with your model versus how a completely nontechnical professional would interact with your model. So that's where I keep coming back to for your use case. How can you define that user journey? How can you quantify the way the users are going to interact with

[41:19] Aishwarya Srinivasan:
your model? Can you build a simulation out of it? That's one question that I typically ask. Hypothetically, if you're trying to build a particular product and I ask you that, hey, you don't have time to go gather a user feedback for this. I need you to get this out in the next couple of hours, next couple of days, let's say, for example, being more realistic.

[41:40] Aishwarya Srinivasan:
Can you build a simulation environment where you can exactly point to how your users are going to be using or interacting with this particular model. So if that's something you can define, that's great because you can build test cases around it and you can think about all the possible edge cases that could come in that scenario. But a lot of times people cannot do that because you don't know how your tool is going to be end being used, who's going to use it, how they're going to use it. Are they going to try to deliberately break it or not? So all of those unknowns

[42:13] Aishwarya Srinivasan:
come in play. And that's where it becomes harder for

[42:20] Conor Bronsdon:
the designers to architect a system like that. I love that answer. And I think it's very important to double click on it a bit. Because there's so much customization and personalization that needs to happen depending on your use case, depending on the size of companies you're working with, depending on the type of users you're working with, that just saying even something as

[42:46] Conor Bronsdon:
specific as, Oh, I work with AI engineers, may not actually get you where you need to go. You need to understand what they're trying to achieve. Perhaps you need to understand if this is new AI engineers or folks who've been doing machine learning for years, for example. As you brought up earlier, there's a big difference in how those folks who are newer to the field versus have been around for a while

[43:09] Conor Bronsdon:
are approaching things as fundamental as evaluations. And another crucial area you brought up earlier is the idea of compliance. And you mentioned zero data retention policies, for example. With the accessibility that models have to our data and their systems, including making their own decisions about what to do with that data sometimes if they're in an agentic environment,

[43:36] Conor Bronsdon:
there's definite risk. This is why you're seeing risk management teams at large enterprises have very many sleepless nights. What do you see as the unique governance and security challenges that teams need to consider when they make the choice between building skilled agents on open source models, like we've been talking about here,

[43:58] Aishwarya Srinivasan:
versus proprietary ones? Are there are there pros and cons that you would encourage them to consider? With proprietary model, I think one of the biggest challenges that come is, as far as I understand, again, there could be difference in policies across different model providers. But in most of the cases, it isn't a zero data retention policy. They do capture

[44:21] Aishwarya Srinivasan:
the data that goes in and out of the model in order to make their models better. So that's one part of it. Second, it can become harder for you to really peel the onion and see where things are breaking, how things are being processed by these models compared to like what you can do when things are open source. You can really tweak the model to the T when you're trying to use an open source model and you're trying to customize that model for your specific use case. Now,

[44:59] Aishwarya Srinivasan:
so one of the things that, Conor, you were mentioning around is around compliance and how it would play around in a regulated environment, right? So one of the very basic challenges that come in when you're trying to build a system like this is when users are interacting with your model, how do you control the access that the model gets every time it's interacting with a different user? So that's the traditional

[45:26] Aishwarya Srinivasan:
role based access management that you want to build into your model. And how much information does that model particularly needs to save in its short term memory? Or like what's the exact level of anonymization that you to happen in a particular data being held by a model. So in all of these scenarios, in the typical, again, like this is the typical best practices that I've seen across the board,

[45:53] Aishwarya Srinivasan:
is where people try to bring in multi agent systems. They say that, hey, like we have a primary agent which has access to five different things. That's the primary agent that interacts with the user. We're not going to give it access to certain data sets by default, because we know that it could be a point of failure. So what we're going to do is have it send a call to our secondary agent

[46:19] Aishwarya Srinivasan:
that has access to specific data set that you're trying to pull. And based on the parameters that is being shared from the user request, plus whatever the processing has happened on at the primary agent stage to the secondary agent, one of the secondary agents, to see if it matches the criteria at which it should be pulling in that right information and sharing it back to the primary agent.

[46:44] Aishwarya Srinivasan:
Now again, it could be something where it does share it back with the primary agent, but the primary agent does not store that information either in its short term or long term memory. It is just something which is it's communicating and then just like forgetting about. So all of these are nuanced engineering setups that folks need to make. And that's exactly where

[47:06] Aishwarya Srinivasan:
I go to that. Hey, when you are talking about building an AI agent, as an architect or as a product manager or as a leader who's thinking about building a system, I think you need to go way beyond how a model performs. You cannot just look at a model benchmark and feel that, hey, like this is a good enough model for me to use in my workflow, and then just like tie it together with each other. That's the worst thing that you can do because none of these models understand. I can break any model with a bunch of

[47:39] Aishwarya Srinivasan:
quirky, twisted prompts. So that's where it's important to understand that, hey, how do you architect it as a software? It is not a model that you are surfacing to your user. It is a software. It is a system that you're surfacing to your user, which requires it to have that nuanced approach at every single point of communication between the models, between the model and any tools that it is trying to access,

[48:05] Aishwarya Srinivasan:
between the model and the dataset that it is trying to access, etcetera. You have such a unique position

[48:10] Conor Bronsdon:
from how central your content creation is and how central you are to much of the AI movement today. I know you talk to many leaders and builders on a daily basis who share their insights with you and are seeking to share with you, here are the the new things we're looking at. And then, of course, you're doing so much important work with fireworks as well. Based off of all that context that you're you're gathering for, I guess, your personal mental model,

[48:39] Conor Bronsdon:
what are you seeing that you think AI builders are either not paying enough attention to right now or on the horizon? How do you

[48:49] Aishwarya Srinivasan:
take any particular model that's running in production in development, any system that's running in development, to take to production level? And when that happens, it's very different when you're working with traditional machine learning models. And this is the same exact challenge that we had when we used to work with something as simple as a logistic regression model or deterministic

[49:14] Aishwarya Srinivasan:
systems to traditional machine learning models. And the same level of shift is what we are seeing from traditional machine learning models to generative AI kind of models. And it's the level of complexity of how the model performs. So there is, while the cloud computing fundamentals, while the system design fundamentals stay the same, it is still not the same because of how the model performs. It is because of how

[49:42] Aishwarya Srinivasan:
input goes into the model and how the output changes every single time that you're just running inference. So one of those beautiful blogs that I saw recently was from Thinking Machines Lab, and they talk about this, which is nondeterministic nature of large language models. For a large part, people's misconception was that LLM is nondeterministic, or it shares answers differently

[50:13] Aishwarya Srinivasan:
when you're asking different kinds of questions. Sure, it does. If I just say like summarize my own query into something which is two lines shorter, it's going to respond in a different manner. But even with the same queries, at times you see the LLM inference engine responding in different manner. And that was coming based on how the plumbing looked like around the large language models rather than just the nature of the model itself.

[50:44] Aishwarya Srinivasan:
So these are some of the things that are still being uncovered about the performance of the model, about how the behavior changes in different scenarios. And it is still something which people are discovering and sharing, and it's still something that not a lot of practitioners understand. So I think it's always a challenge to architect these systems based on how we are changing the brains of these systems

[51:14] Aishwarya Srinivasan:
from something which is more deterministic to traditional ML to large language models. So like now that we have a different kind of brain, it has a different kind of thinking capability. So the architecture also needs to change. Architecture also needs to evolve. Part of it would be objective evaluation. Part of it would be subjective evaluation. And the way that we define all of these things is one of the one of the biggest

[51:39] Aishwarya Srinivasan:
areas

[51:40] Conor Bronsdon:
that still people are continuing to continuing to work on. Ash, thank you so much for joining me today and sharing your thoughts on the importance of evals, open source, and so much more. It's been such a distinct pleasure having on the show, and I I really appreciate you taking the time. Where can our listeners go to follow you and the important work you're doing? Well, LinkedIn is where I'm most active on.

[52:03] Aishwarya Srinivasan:
While I did start recently on Instagram as well because I realized there's only so much volume of content that I can post on LinkedIn. And while I have been catering it to more technical audience, more maybe like mid level to senior level professionals, on my LinkedIn to help more students and enthusiasts, beginner level folks, I've started sharing like Tidbit

[52:26] Aishwarya Srinivasan:
and like bite sized videos on Instagram

[52:28] Conor Bronsdon:
just to get people started in the field. That's awesome. I'm gonna have to go follow your Instagram. I'm excited to check out some of those videos. And we'll be sure to link Ash's Instagram and LinkedIn in the show notes. Ash, it's been such a pleasure having you on. And a reminder to everyone who's listening, while you're going to follow Ash on LinkedIn or Instagram, your platform of choice,

[52:51] Conor Bronsdon:
please don't forget to subscribe to our new chain of thought newsletter on LinkedIn for more insights on building with AI. And, hey, you know what? If if you haven't already, if you're watching us on YouTube, you know, like and subscribe there, or Apple Podcasts, Spotify, however it is, we always appreciate that as well. So thank you so much for joining us, Ash, and, thank you for all the insights. Thank you. Thanks, Conrad. Listeners, that's all for us this week. Have a good one.