This week, a panel of experts (Mehmet Murat Ezbiderli, ServiceTitan; Grant Ledford, Indeed; and Vinnie Giarrusso, Twilio) join Atin Sanyal (CTO, Galileo) and Conor Bronsdon (Developer Awareness, Galileo) to explore the challenges and opportunities of deploying GenAI at enterprise scale in a conversation that's a wake-up call for any business leader looking to harness the power of AI.
Together, Atin & Conor break down key considerations like performance, cost, and model selection, emphasizing the need for robust evaluation frameworks and a shift in developer mindset.
Atin then sits down with our panel of AI engineering experts to discuss their firsthand experiences with enterprise AI, including the trade-offs of building AI systems, the evolving tools and frameworks available, and the impact these technologies are having on their organizations.
Chapters:
00:00 Enterprise Scale Deployment
05:17 Cost, Performance, and Model Selection
08:59 Building and Integrating GenAI Systems
15:26 Emerging Enterprise Use Cases
18:12 Predictions for AI in 2025
27:28 Panel Discussion: Deploying AI at Enterprise Scale
31:19 Gen AI Solutions and Challenges
33:12 Building & Deploying Traditional Infrastructure vs GenAI Infrastructure
34:36 How to Assemble Your GenAI Stack
40:39 Today's Best GenAI Use Cases
48:15 Enterprise AI Trends for 2025
50:36 Closing Remarks and Future Outlook
Follow:
Atin Sanyal: https://www.linkedin.com/in/atinsanyal/
Mehmet Murat Ezbiderli: https://www.linkedin.com/in/mehmet-murat-ezbiderli-b894a49/
Grant Ledford: https://www.linkedin.com/in/grant-ledford-36b146a5/
Vinnie Giarrusso: https://www.linkedin.com/in/vinniegiarrusso/
Show notes:
Watch all of Productionize: https://www.galileo.ai/genai-productionize-2-0
This week, a panel of experts (Mehmet Murat Ezbiderli, ServiceTitan; Grant Ledford, Indeed; and Vinnie Giarrusso, Twilio) join Atin Sanyal (CTO, Galileo) and Conor Bronsdon (Developer Awareness, Galileo) to explore the challenges and opportunities of deploying GenAI at enterprise scale in a conversation that's a wake-up call for any business leader looking to harness the power of AI.
Together, Atin & Conor break down key considerations like performance, cost, and model selection, emphasizing the need for robust evaluation frameworks and a shift in developer mindset.
Atin then sits down with our panel of AI engineering experts to discuss their firsthand experiences with enterprise AI, including the trade-offs of building AI systems, the evolving tools and frameworks available, and the impact these technologies are having on their organizations.
Chapters: 00:00 Enterprise Scale Deployment
05:17 Cost, Performance, and Model Selection
08:59 Building and Integrating GenAI Systems
15:26 Emerging Enterprise Use Cases
18:12 Predictions for AI in 2025
27:28 Panel Discussion: Deploying AI at Enterprise Scale
31:19 Gen AI Solutions and Challenges
33:12 Building & Deploying Traditional Infrastructure vs GenAI Infrastructure
34:36 How to Assemble Your GenAI Stack
40:39 Today's Best GenAI Use Cases
48:15 Enterprise AI Trends for 2025
50:36 Closing Remarks and Future Outlook
Follow:
Atin Sanyal: https://www.linkedin.com/in/atinsanyal/
Mehmet Murat Ezbiderli: https://www.linkedin.com/in/mehmet-murat-ezbiderli-b894a49/ Grant Ledford: https://www.linkedin.com/in/grant-ledford-36b146a5/ Vinnie Giarrusso: https://www.linkedin.com/in/vinniegiarrusso/
Show notes: Watch all of Productionize: https://www.galileo.ai/genai-productionize-2-0
AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.
Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes bi-weekly.
Conor Bronsdon is an angel investor in AI and dev tools, Head of Technical Ecosystem at Modular, and previously led growth at AI startups Galileo and LinearB.
Conor Bronsdon: [00:00:00] Welcome back to chain of thought. Everyone. I'm Conor Bronsdon, head of developer awareness for Galileo, and I'm delighted to be joined by Atindriyo Sanyal, co founder and CTO at Galileo.
Atin, great to see you.
Atin Sanyal: Great to see you too, Conor. I'm very excited. It's close to the end of the year, and I'm happy to talk about what the year has been like for Gen AI and what the next year holds.
Conor Bronsdon: there's been so much innovation this year, and then there's also been. Signals of areas where we need to improve as an industry and beyond the hype and the headlines, there is this growing reality that the challenges and immense potential of deploying GenAI solutions at enterprise scale are myriad as businesses are striving to harness the power of AI questions have arisen around how can we effectively integrate into our operations?
What are the key considerations for successful deployment, which I know we'll talk [00:01:00] about. And how can we mitigate the risks and maximize the returns? and to discuss those questions, Atin actually sat down with leaders from Indeed, ServiceTitan, and Twilio, whom you'll hear from later in the episode. but first, Atin, I know this is a topic you've thought a lot about, that you talk about with customers all the time behind the scenes at Galileo.
It's obviously one thing to build a demo. It's completely different to deploy it across a global enterprise.
Conor Bronsdon: What are some of the unique challenges that you're seeing around deploying at enterprise scale?
Atin Sanyal: Yeah, I think 2024 was kind of the year when we first started productionizing a lot of the prototypes and MVPs that were built the year prior. And a lot of new discoveries were made, or realizations rather, which in hindsight I don't think are surprising. Of course, there's a myriad of challenges, that, a lot of enterprises faced. While trying to get their apps into production, including, the toolscape being [00:02:00] immature and there's a long tail of issues, but I think the ones that stood out for me were number one was this discrepancy around performance. So it's one thing to build, an app that works well. As far as your latencies go and cost and accuracy in a, pre production or development environment. but often these applications falter in real world scenarios and when the app actually meets the wild. for various reasons, including, data is dynamic, data is changing over time. It doesn't match up with what you expect. initially evaluated your, Gen AI apps with, the quality of the data is very different in production. I think this year was kind of the zero to one for productionizing these applications and uh, a lot of the data that these applications generated or got queried with were Often data that was not seen while it was building. The developers were building them in their sort of isolated environments. [00:03:00] Then there was, of course, performance. One of the main realizations was, number one, a lot of the metrics that are used were sort of these prehistoric, old school metrics to evaluate. Erstwhile language modeling applications, which no longer apply in the LLM era. And, uh, I think we've spoken about it for a while now, but, it really sort of came to light where the need for a rigorous evaluation framework that's not only, providing you state of the art newer ways to evaluate applications, but is also customizable where, you essentially need the tools to. Create and define your own evaluation criteria because there's no one size fits all solution to performance as far as evaluation goes. And of course, that is very bread and butter to Galileo. So we've done tons of innovation around this particular point. And then there's integrations and cost. I think we saw a lot of legacy systems trying to [00:04:00] integrate the new unstable toolscape around gen AI. of course it demanded for very careful planning, uh, and, to align these new capabilities into existing workflows. That was a big challenge. And then cost, these systems are not, cheap at all, especially at enterprise scale. I think one of the key realizations was, uh, you can tinker around and build, uh, a small MVP, but when it comes to actually, hitting the wild and getting 10 million queries a day. Your costs go through the roof you need to think about newer patterns of querying these kind of models which can mitigate cost but also there's evaluation cost. So one of the challenges going into next year is how do you continuously keep reducing the cost of not only building the application but also testing it, evaluating it because there's LLMs in the mix and you know across the board and costs can go through the roof if you don't. Pay attention to it.[00:05:00]
Conor Bronsdon: Absolutely. And I think it's something you'll hear from a lot of critics of where they talk about. Oh, it's too costly. It doesn't make sense. And Grant, Mehmet and Vinny talk about this, in the panel with you later in this episode. They've importance of making the right trade AI stack.
Cost, Performance, and Model Selection
Conor Bronsdon: Atin, how do you think through those key considerations of cost, performance, and model selection as you have to really fine tune for enterprise scale?
Atin Sanyal: Yeah, that's a good question, because those are sort of the three main things which have come up. in as you sort of holistically evaluate your application, there's overall cost of your system, there's performance in terms of accuracy, but also latency. And then, yeah, which models right for your, your use case.
often we've seen that, there's the consideration of I'm going to use LLM APIs versus I'm going to try to build some kind of an on prem LLM solution. often, there's cloud [00:06:00] based solutions which have offered you the flexibility of building something very quickly, but they've proven to be very expensive at scale. While, the more recent trend towards using some of the open source models and hosting them On prem or in your cloud using a custom inference engine like SageMaker or any kind of, bespoke and inference engines. But then there's also a whole host of inferencing services out there.
Now, what we've seen is that in the world of using and hosting your own models, there's the larger upfront cost of development and figuring out the right configurations, but at scale, the costs are significantly lower than, using an open AI or some kind of LLM provider out of the box. So that's certainly one consideration around cost. There's also a trend towards training versus inference like you fine tune a model for your particular use case. [00:07:00] This is a muscle that's emerging more and more, but we certainly saw some trends of that in 2024 where, the consideration of, you know, Hey, should I fine tune my model for this particular task, which can be significantly cheaper than prompting? that also came up, as one of the key considerations with a few of our customers. latency was a very big factor and that's where I think latency was interesting because you often get, there, there's a couple of ways of. addressing the latency issue. you have a lot more flexibility towards, addressing latency as well as cost. If you have control of the model, for example, if you take a Lama and host it yourself, there's various techniques that you can use like pruning and sparsification, to lower your inference time latencies, which many of the inference engines, like inference as a service businesses offer out of the box now and they're constantly working towards, low latency, cheap ways of hosting these LLMs, but you [00:08:00] get less flexibility if you use a third party LLM provider.
So this has been one of the key considerations for, for Decision makers and leaders is, Hey, do I go ahead and just use open AI or anthropic or Gemini out of the box? Or should I, take into account that. I want to give my developers the power to, control the cost and the latency through various measures, by hosting a Llama 3.
1 or 3. 2, which today are performing almost as well as some of the open AI like providers. so that's been a paradigm shift that I've observed, especially in the second half of this year.
Conor Bronsdon: I totally agree. I think it's so easy to say, Oh, let's just go with Anthropic. Let's use their API. Let's get this up in a production. And then it's really easy also to see this ballooning cost effect. Once you go from, hey, we're just running a basic demo to, oh, this is actually in the hands of thousands, if not more of folks.
who are now using it.
Building and Integrating Gen AI Systems
Conor Bronsdon: [00:09:00] And it feels like we're really having to redevelop new AI development best practices. Many software engineers are still used to building and deploying deterministic software systems. What changes are you seeing and how engineers need to now approach building enterprise deployments with non deterministic systems?
Atin Sanyal: I think the first thing is, you know, for a developer to start integrating with these non deterministic systems, the first key thing is to really understand. what the non determinism is, and really just make sense of it.
this requires a mindset shift, for example, from thinking about unit and integration tests, which have traditionally been, the tools that software developers use to harness their system and test them in a more holistic way. Those tests are deterministic and very specific conditions that the developers write to protect and validate your system. Now that's kind of gone, [00:10:00] away and that has led to, this need for a shift, in your mindset. So understanding the non determinism, what it means, going from this mindset of thinking about very specific deterministic conditions to, Thinking of these outputs of gen AI systems as probabilistic outcomes and how do you build a testing and evaluation framework around, probability distributions is sort of one shift that has been the need of the hour for developers. but also I think, there's been a change towards having developers also incorporate interdisciplinary skills. more data science y skills. thinking about datasets and, how do you have datasets with ground truth and without ground truth and, using them as your test harness rather than having bespoke functions that you write as unit tests. so this has been an interesting play where not only have the software developers themselves had to upskill themselves to think in a slightly more [00:11:00] data sciency model sort of a way where traditionally data scientists would think that way. But it has also caused a lot of interesting collaboration between developers and data science where there's both teams that exist in organizations and the overall Gen AI effort has been more of a holding hands together and building it in unison rather than, traditional, software engineering or even traditional ML, which used to be done only in silos and they would kind of throw a model on top of the wall and, you know, the developer wouldn't really care about the ins and outs of it or how it was built.
You'd just deploy it. I think those lines have kind of blurred while both cohorts of people, the data scientist, as well as software engineers, they've kind of upskilled themselves to understand each other's skills a little bit better.
Conor Bronsdon: As you point out, we are seeing this almost blurring of the lines between where data scientists and software engineers live and increased [00:12:00] collaboration. How can companies strike a balance between leveraging familiar tools? Whether it's for their data scientists or their software engineers, and then embracing these new technology paradigms, are there particular toolkits and frameworks that you think AI devs in particular should be applying?
Atin Sanyal: That's a great question. I think this year also, I thinkwe saw a lot of new technologies come. And that's constantly changing, right? New SDKs, new frameworks, better libraries, the toolscape is still maturing. one of the key considerations for an engineer has been to minimize that kind of disruption. Most folks have taken a rather cautious approach to, you know, having almost like an incremental integration into their legacy systems because no one wants to uproot their system and build a JNI system from scratch. this ties into, the method by which you can have this kind of incremental integration really ties into the [00:13:00] product roadmaps. And that's what we've seen with, many of our customers where the roadmaps are built in such a way where You. think of Gen AI as an extension of the, existing capabilities of your apps and you want to slowly integrate, and take a more cautious approach to baking newer capabilities in your legacy systems. But then there's also the need for building experimentation or baking new capabilities. That as a paradigm in your software, just as an example, things like feature flags or AB experimentation, frameworks, which give you this ability to quickly swap in, swap out certain pieces of your, Software based on, how the app is being used in the real world and really understand the effects of the changes that you're making. That is something that's emerged as a good practice and I believe should be taken into 2025. As these systems mature more and more, all this, calls for the need for very robust [00:14:00] evaluation tooling, really, like a good evaluation framework gives you all these capabilities out of the box, where in the end, as a developer want your app to perform well. That's all you want. So the decision to whether, you know, use this prompt or that prompt, or use this model or that model, or use this agentic framework or that one, often the answer lies in the data. And you can make a lot of, assumptions and opinions. But at the end of the day, you'll have to deploy something, see how it works, and make those data driven decisions. And, that'll be very critical. And this is kind of what I'm describing to you are the ingredients of an ideal, observability framework that gives the power to the developer to be able to do these data driven decisions as their app sort of evolves.
Conor Bronsdon: So, let's say I'm an AI developer or AI engineering leader who's listening to this conversation. I'm pulling a lot of great insights from what you're saying about how to apply frameworks, the trade offs you have to make, the decisions you have to think through. [00:15:00] Maybe I haven't actually thought through.
What I want to build with this yet. Maybe I'm just simply being given a directive from leadership. Hey, we need AI. It's going to keep our investors happy. you'll hear our panelists later discuss a range of use cases. From virtual assistants to systems for technicians and beyond.
In your experience, Atin, which enterprise use cases are showing the most promise right now that our audience should be thinking about?
Atin Sanyal:
Emerging Enterprise Use Cases
Atin Sanyal: being in this, Gen AI landscape this year and really seeing firsthand the innovation that's happening in enterprise, I think we saw certain use cases emerge as Some of the obvious ones to go to while we saw a lot of expansion, for example, of course, there's the content creation and customer service.
they're, they're super big use cases and they saw a lot of maturity, compared to 2023 where, a lot of, stuff that was built actually got productionized, HR tooling. like AI tools [00:16:00] to screen resumes and build employee training programs is another sort of emerging use case, which we saw legal came up as one of the main winners.
I think we saw a lot of, very good tooling companies come out for legal AI as well this year. And, I think that market is certainly set for disruption, use cases, including like analyzing contracts and in general, I think, one of the needs has been. because we are in the zero to one phase still, where we don't really know exactly, how do you build like a true Six Sigma Gen AI application, people are still saying that, hey, I have a lot of unstructured data lying around, language data, text data in the form of PDFs, in the form of slide decks, documents, and I just need a way to kind of unstructure, unlock this unstructured data, make sense of it.
And then, you know, even in an offline setting, like just help me make sense of it to make good business decisions. I think that's, [00:17:00] that's something that a lot of large enterprise, especially have been trying to, uh, really, you know, unlock. whole host of companies, including leaders we've spoken to at writer and various other companies are kind of going in that direction, trying to, really unlock the power of unstructured data. and then there's, the future, right? Like there's multimodal, which has come up, already people are, building, image. To text systems and, trying to bake in agents into this to really make something more holistic beyond just querying, text data and getting a textual answer. so we've seen a lot of progress on all these fronts. That said, I think 2025 will be the year where you'll start seeing a few things actually get to success. I think the success has not yet been achieved. People are still in that mode where, Hey. I've cautiously productionized something. I need robust evaluation tooling to make decisions, to improve my system. And, that's when I'll sort [00:18:00] of realize, the buck is still out there to say that, Hey, here are the top three, winning use cases in each industry. But, a lot of those realizations I think will happen next year.
Predictions for AI in 2025
Conor Bronsdon: What other predictions do you have about how we'll see enterprises use AI in 2025?
Atin Sanyal: Yeah, if you followed this industry and you followed the trends, I think there's certain obvious things which will likely happen. one of them is, just these models will have enhanced reasoning capabilities. We already saw a preview of that, uh, with, for example, OpenAI's O1 models. people haven't realized how good they are. And there's a lot of, questions around how you'll use reasoning and enhance reasoning capabilities in practical production systems, or whether or not they'll even yield something that's, uh, beneficial, but we'll see a lot of answers to that in 2025 and you'll start seeing the reasons why, you can achieve step [00:19:00] function improvements in, your GenAI system if you simply have the ingredients. which are up leveled, for example, having a better reasoning baked in to these LLMs. and then there's also the general reduction of the cost of intelligence. we clearly saw eight billion parameter models perform as well as a hundred billion plus ones, which got launched in 2023. So a massive reduction in the cost of intelligence, and everyone benefits from that. There's also agents and multi modal, I think they kind of took off a little bit this year, and it's still kind of where, just these LLMs were in 2023, so still maturing, the stack is maturing, but we'll really start seeing these gen AI systems go from just Being chatbots and interacting with people to actually taking action. Uh, and we've already started seeing very interesting agents being built and productionized and we'll start seeing that happen at [00:20:00] scale Multimodal is also I think another area which will see a lot of progress in 2025 and I think in general, what I see is that, the quality of these systems will start improving. They'll start working at scale slowly and you know what we call low hanging fruit today will be up leveled. And what that means is today we say that, Hey, I use these LLMs as virtual assistant and it helps me be more efficient. Today, that's the low hanging fruit. Tomorrow, the low hanging fruit's definition will be that, hey, I have these agents and they do these things for me, and, we'll just be up leveled in general. so I'm very excited about seeing actually the ceiling rather than the floor, because, there'll be newer opportunities. of hard problems to solve with mature agents and mature agentic systems. And that'll kind of pave the path for the year after that to 2026 and beyond. I think, it'll be exciting to see what we can do, As a [00:21:00] company, which is working on the cutting edge, it's always exciting to see what lies like three to four years ahead. So in general, I'm excited to work with our, enterprises to see what's practical, uplift them and, um, Yeah, just excited to see, you know, what innovations we do next year.
Conor Bronsdon: Something you and I have talked about before, and I think we maybe both see on the horizon here is this opportunity for Agents to leverage open source models as well, because to your point earlier, it can be at times cost prohibitive to try to leverage open AI or anthropics API, if you're trying to have all these calls happen for a multi agent system that is actually solving problems for you, do you expect to see more agent frameworks, leveraging open source models in 2025 to improve efficiency?
Atin Sanyal: Absolutely. I think one of the key factors for that is, cost, and latency. I think, typically an agentic system [00:22:00] because the goal of an agentic system is to perform tasks. It involves multiple calls to L. L. M. S. And multiple interactions. So this compounded system is a lot more complex than a simple chat bot based system. And that's where, restrictions of an API based solution comes in where often you'd find yourself being very restrictive. there's a multitude of benefits that open source models offer, including Their weights being fine tunable, ways to sparsify them and quantize them to be able to make them run at less than half the cost. so that really opens up the avenue for you to build systems that can think and rethink. And you can do a lot more with an open source model or open weights model rather that you deploy on your own. Then say using a closed boxed A. P. I. the reason why this is likely going to happen. This trend will likely come to fruition is because number one, the quality of the open source models is [00:23:00] extremely high compared to what it was a year ago. And number two, I think people are trying to really get. figure out and tap into, the possibilities of the kind of software frameworks that you can build around these LLMs. and so as the toolscape starts maturing, as the stack starts evolving, people will eventually figure out how to, build very dynamic sort of compound systems, which involves not only vector stores and rag, but also kind of loop back with, feedback loops and having multiple agents at different places and multiple LLMs. So it will almost look like this complex graph that eventually, takes in some data at the beginning and does some complex action at the end. And the flexibility that open source models offer is a lot higher than using some closed box LLM as a, as a service.
Conor Bronsdon: completely agree. It's definitely an exciting time. That said, I do think there are some folks who are inflating the hype a little bit. personally, [00:24:00] I don't think we are close to AGI, for example, under at least a transistor based model, unless we have some sort of major breakout, in how the technology works.
Do you agree with that? Or what's your perspective on, where we are at AGI?
Atin Sanyal: It's hard to have a yes or no answer for that. I do know that there's incredible innovation happening at various places, including some of the closed source sort of model service providers, places like Anthropic and places like OpenAI. They're doing phenomenal work. you know, if you look at their roadmap, if you think about, I mean, I don't, work there, so I don't know.
But, based on what data points I can sort of connect and surmise, it's very clear that, the capabilities of these models, just these LLMs themselves, it's going to go much beyond just language and generating output text. They're doing incredible work to be able to have these models have actions baked in. And, I think just based on [00:25:00] that, I can say that, we'll make tremendous progress in the next year or two years towards, what could be called AGI. but then again, people think of different things when they think of AGI, you know, the extreme being Terminator and all the fantasy movies. That's likely not going to happen, is my take, but we will see. You know, again, going back to what I was saying, you know, there's the floor, which will get uplifted next year, but the ceiling is where the excitement really is, because that kind of lays the possibilities for the next 3 to 5 years of innovation in this space. the ceiling will certainly be, very, very high next year, because this year there were improvements on two fronts, right? There's large enterprise and mid market businesses who started adopting what was coming out of these, innovation factories and figuring out how to practically use LLMs and LLM systems. but then there was also a lot of work done in the factory itself and a lot of that is going to come out. [00:26:00] essentially all this is to say that the capabilities of the Gen AI models will be, far greater than what we are imagining, 2025 to hold.
And I'm super excited about that.
Conor Bronsdon: And I think to your point, There are all these gains happening within the industry and within these very smart teams around quality efficiency looking at open source in particular, but it's not quite as exciting to talk about. Oh, we improved quality here. We improved efficiency there as it is to talk about.
Oh, we can get to or, oh, we're going to build the next sky net or whatever else. so a lot of that kind of gets papered over by how the media is discussing this. So I appreciate you focusing in on. Here are these opportunities we don't have to necessarily get to AGI tomorrow to make this all worthwhile.
It's like there are all these gains we can have. Atin, thank you so much for the time. I really appreciate it.
Atin Sanyal: Thank you. This was great.
Conor Bronsdon: Yeah, and listeners stay tuned.
After the break, you'll hear more from Atin, along with our panelists.
Conor: Unlock evaluation [00:27:00] intelligence for your AI team with Galileo, the industry leading platform for evaluating Gen AI applications.
Trusted by AI leaders like HP, Databricks, Comcast, Twilio, Headspace, CollegeVine, and many more, and powered by our proprietary Luna Evaluation Suite, Galileo enables evaluation driven development and observability for your AI systems. Learn more at Galileo. ai
Panel Discussion: Deploying AI at Enterprise Scale
Conor Bronsdon: Next up, we've got our Applied AI Lessons from Deploying AI Enterprise Scale panel from Productionize. You'll hear from Mehmet Murat, Principal Software Architect at ServiceTitan, Grant Ledford, Senior Software Engineering Manager at Indeed, and Vinny Chiaruso, Principal Software Engineer at Twilio, along with our favorite here, Atin Sanyal, Co Founder and CTO of Galileo, as our moderator.
Atin: thank you to all three of you for joining me here. so today I'll try to cover a bunch of topics related [00:28:00] to, some of the realizations around deploying Gen AI at scale, particularly this year.
but maybe I'll just start off with the different kinds of Gen AI solutions and sort of how do they compare with traditional app development? Maybe this is a question for all so we can do round robin on, what type of Gen AI solutions are you and your team leveraging today and how have you gone about developing them?
Maybe Mehmet, we could start with you.
Mehmet Murat Ezbiderli: Yeah, sure thing. before getting into like some of the applications we do a lot, which are pretty much enhanced, maybe I should give a quick overview of what the company is doing. So we are a vertical SaaS company and we focus on the trades. from plumbing to HVAC, electrical, garages, roofing, and actually other trees.
And the features are our customers are big and small, big shelves, enterprise that install your dishwashers, fix your electrical issues and such. And one of the motto that we have is changing lives. it's the first company I saw where there's some real backing into that. So the type of applications, there's a range.
Some of them are nice. Some [00:29:00] of them are like we talked in the POC mode. Um,we just had our, like, Canteen this year, which is our, like, customer, conference. we announced some, some products and some of them are like, their second chance leads, which is basically effectively where we do a reassessment of the conversation.
It's been a customer and a customer support person for calls and see if that conversation result in a lead. And if the CSR marks it as a non lead, we reassess the conversation and say, Hey, actually this conversation was supposed to book a job. So can we give a call back? And we're like five minutes, we actually call back the customer and then turn into a lead.
And there's been some revenue ROI numbers that are being pronounced by our customers. And we also do, we just announced our like virtual assistants, for example, that are like CSR, I don't want to call replacement, like counterparts or helpers to upload some of the, contact center management processes, from the, like be able to answer calls off office hours and book jobs for the customers and such, we also have assist teachers.
that are like helping the technicians or go on a site and to [00:30:00] the, to, assist them with some questions they might have. And literally like ask, ask on about like, Hey, how do I install this part? Or like how much warranty is left on this furnace and such so that there is enormous opportunities that we can directly translate them to some otherwise.
But I want to just quickly tap on your other point, which is how does it change different from another application? I mean, the most obvious thing is. Deterministic versus undeterministic nature of the thing. So with the AI, traditional AI. Yes, there was a semi level of determinism where we could actually train a model, hold out test data, so we get some confidence about, okay, this model is going to perform fairly well, but the JNAI would, we don't have that.
we just make some assumptions about the way that JNAI is like our LLM server. I'm going to perform, but there is no, like we put as many guardrails on it as possible, but even in like the pan shell from the calls we received, it's quite obvious that there is a high potential that the agents can hallucinate.
They can give up fake up numbers. They can point to some inexistent websites and things like that. So, um, the, the [00:31:00] primary thing, I think the challenge is shifting from like, I do my best and like test the hell out of my system to now I need to put guardrails and I don't know when it's going to break. So I need to be.
I'm able to run the same test feedback over and over. And I also need to prevent that from happening on production. So those are the, that's the paradigm shift we're realizing.
Atin: That's super interesting.
Panel Insights: Gen AI Solutions and Challenges
Atin: I know Vinny, you're also sort of working on some early day advanced agentic systems, which are very similar.
So maybe I'll just pass the same question on to you.
Vinnie Giarrusso: so at Twilio, we're building, an AI builder platform. And so for us, like we're not building any particular one type of application, and we're not really building any one particular type of solution. But what we're offering is the ability for our customers to do that, or the ability for our customers to say, this is my use case.
I want to build this on top of your platform. And Twilio is accustomed to doing that. we provide APIs. you know, we have just so many users and so many use cases. it's honestly, it's incredible sometimes to come to work and find out some of these things. but that presents like [00:32:00] a really big challenge for us, you know, as many of us saying, like, Nondeterminism is a thing.
It's like the main thing here. and like, how do we determine, like, how do we do this safely? You know, how do we determine like a set of guardrails that works across, an unknown set of use cases? Like what is the baseline level of trust forage and AI and how do we get there? that's, that's sort of the question that we're asking ourselves today.
and as it comes to multimodal, right. We even have. A lot of different challenges there. Like we, we have not solved prompt injections for text, right? But we have streaming voice to voice. We have speech to speech models that AI, uh, open AI just worked on with us. We have, video coming, like there's a whole bunch of other vectors that are coming soon that the industry is rushing towards.
and we still have a lot of questions to answer, for even just the most basic of use cases.
Atin: Absolutely. This is fascinating. Maybe Grant, since you're leading some, uh, AI infra teams, would love to understand maybe two questions for you on a related note. Like, number one, what is [00:33:00] the difference you're seeing between, building and deploying traditional infrastructure versus gen AI infrastructure?
And how do you think the new infrastructure is Melding in with the old infrastructure.
Building & Deploying Traditional Infrastructure vs GenAI Infrastructure
Grant Ledford: Yeah. I work in a very similar space to what Vinny mentioned as a platform team for building out these applications, but mine's internal to Indeed to let other Indeed product teams fill these out. And one big piece I noticed in this like differences.
Is actually something that I think both Craig and May alluded to earlier, which is that kind of like builder shift, right? That we have a broader audience of people looking at building applications with these tools beyond just traditional D site background and. The tools that are then needed have to change and have to accommodate a larger swath of background context.
So that's one, the other one is that without that kind of shared foundational knowledge, doing a one to one mapping of like, oh yeah, here's how you test an LLM that is similar to a concept [00:34:00] you know, like similar to a concept of a unit test, may not actually exist and may not have that shared foundational understanding.
And that's, that's kind of the other big one I've run into is that like helping people kind of map, like what are the right steps you should be doing? What options are available for you?
Atin: shifting gears a little bit to talking about the Gen AI stack itself. Maybe Mehmet, a question for you is, when it came to sort of trade offs that you had to make while, using your existing infrastructure and putting in new elements like vector stores, et cetera, howwhat were some of those trade offs that you made while assembling the GenAI stack?
How to Assemble Your GenAI Stack
Mehmet Murat Ezbiderli: yeah, so the trade offs can be in multiple dimensions and as you know, software is all about trade offs, but, I guess one dimension is really the cost, right? LLMs are still quite expensive. And a new LLM comes up, which is more fancy and shiny, which has much more reason capabilities and like much larger context windows and you want that because it you know, it solves your problem much better, but then you need to be very mindful of [00:35:00] your costs.
Because, I mean, in our observations, like for an actual virtual assistant thing, I don't know the exact numbers but the LLM cost was about at least half of the operational overhead. and this includes like all the computes and everything. So it's quite a big item. And that's why we typically need to see, okay, like, especially in rag context, for example, where you want to answer questions, like do we, how large of a window do we really want to use?
Can we use, I mean, we can shuffle everything the context window and ask the questions, but then you're basically spending all that money per request. and with the rag, as most of us know that the large context defeats rag in pretty much all the, contextual Q&Abut this very.
Like, I don't want to say cheap, but it's much more affordable, much more reasonable when you do a rag thing, because the vector databases are much cheaper, so then there's a question of what do I use for what, right? Like how much rag do I use and what, scenarios do I use rag? there is no great answer.
Sometimes you use them, sometimes use one, [00:36:00] sometimes use the other, but there's also new research coming up such as a self routes. Where you actually ask the LLM, can you answer this question for me? And, uh, with a rag window and it says, I cannot, then you build like the large context, so there's like staggered approaches that are emerging.
So, and also like the other trade off is maybe like some models are great. They're fantastic. They reason well, but you don't need them for like the task at hand. For example, if you're just kind of the summarization or some sentiment detection, or maybe some simple reg decorator, 3. 5 works great. Whereas 4.
0 is great. Something that you might want to use for virtual assistants with reasoning and like. Contextual, awareness and things like that. And it's especially like, I want to highlight the importance of Golden Year Year in the sense that like which approach is better? There's no, again, deterministic outcome.
Like you need to run your tests on your PlayStay test as many times. And look at your benchmarks saying, okay, like there's a higher likelihood that this approach [00:37:00] is better. Even if that's not the end of the story, you need to do your due diligence of staggered deployments, you know, like real time monitoring, all that stuff still persists.
So
Atin: Vinny, maybe, talking about toolkits and SDKs and frameworks a little bit. number one, what toolkits are you using? Is your stack fully Python? Are you using other frameworks? And what kind of libraries have you found to be most beneficial? And as a user who's sort of starting out zero to one, building out end to end Gen AI apps, how should one go thinking about, you know, what toolkits and frameworks to use?
Vinnie Giarrusso: Yeah, really good question. I would say that you should use the tools that you're, most comfortable with. in the early keynote, Vikram was saying, you know, a lot of, ML engineers historically have, you know, Python is, is the place to be. TypeScript is quickly coming up.
our team had this discussion as well. When we first started, we had a working POC built on Vercel. It was completely outside of like the Twilio infrastructure, but it was just, you know, just to get it going when we decided, [00:38:00] okay, we're going to move this into a real thing.
We had these discussions, like what languages are we going to use? Like what trade offs does that mean? Like if, if we choose TypeScript, are we losing out on some good Python? Libraries that don't offer a TypeScript package, or maybe their TypeScript package just doesn't have like the same level of rigor put into it as the Python package.
so we, we made some trade offs right away. Uh, for us, we already had, a fully featured POC that was written in TypeScript, and We knew that a lot of engineers at Twilio would be more familiar with TypeScript than maybe Python. And so we just kept it like that. And we decided like at some point in the future, we're going to pull out the parts from TypeScript that fit better off into other languages.
And that's exactly what we did. you know, we pulled out our chunking. Betting into a completely separate service and that service run Python because it's just easier to do there. It's more efficient. It's faster. It's better. the library and tooling is better there in some other areas. We kept it, as TypeScript, our LLM chain is still Lang chain and TypeScript.
It's been that way since day [00:39:00] one. our public API that we put in front of everything that's written and go. it really depends on what your team is familiar with. that being said. I will also say, don't be afraid of new things. if, if you're , you know, a TypeScript engineer, it's okay to get your hands into some Python.
It's all right to learn about, embedding models and it's okay to learn like how, how these things work and how they interact, because honestly you're going to need to, so I would say that like use the tooling that you're familiar with, but keep your eyes open for what's around the corner and what might be useful for you and other admins.
Atin: Absolutely. That's also certainly a pattern that I have seen with our customers as well. on the topic of hallucinations, I'm curious to sort of get your platform take on this. how do you think about potentially, you know, you're building a platform that other users are building apps on top of and just building general infrastructure?
How do you think about reducing the risk of hallucinations, particularly in the post-deployment scenario? Once the GenAi app is out there in the wild? [00:40:00]
Yeah. So I think a part of this is having, we have a system built out to help people run some initial safety checks as part of their observability, and then giving a lever for them.
Okay. Once you get up running, you're going to need to customize this. You're going to find edge cases in your application.
Grant Ledford: And
Atin: You're going to get more data that will lead you to want to tweak these systems, these basic integrations. So having that configurability for teams to say, hey, we have like a base semantic similarity or something that says, hey, this looks like things we know are hallucinated or we do,
that kind of context adherence check as part of our evaluation set.
How can we monitor that in production?
Today's Best GenAI Use Cases
Atin: that's been a big part of like what we're looking at in the platform side is how can we give those initial starting points for people to say, Hey, here's the right way is you can use these tools and then you can extend those as you actually get up and running.
you know, we've kind of spent a lot of time, just as an industry kind of prototyping and putting early versions of these applications out. I'm curious to know all three of your [00:41:00] perspectives. like what kind of Gen AI applications or, sub features that you've put out have started to show most substantial wins for the business.
Maybe Vinny, we could start with you.
Vinnie Giarrusso: I think a lot, a lot of the use cases that we're seeing like today are pretty much what you might expect. we see a lot of customer service use cases, would probably be our primary one. and so I think that like. for us when we look at, you know, what we want to do and how we want to do it going into production, we're really looking towards like, the, those most common use cases.
And we talked about trade offs earlier, another, the sort of the inverse of making trade offs is like which ones you're not going to trade off on. and so from the beginning we had a very, strong idea about like how our customers might want to use a product like this, but we weren't exactly sure, but we definitely had things that we knew we weren't going to trade off on, like we're not going to trade off on security, for example.
we're not going to trade off on anything that's going to make our application less trustworthy for our customers. So. when it [00:42:00] comes to like, you know, specific use cases in the ROI, like ours is very, uh, wildly varying. but you know, customer use cases, uh, customer centers is, is definitely like eye on, on the top of the list for sure.
Atin: Got it. No, that's very interesting. I'm curious to know your take on this. any, workflows you've seen. be a hit with customers of service Titan.
Mehmet Murat Ezbiderli: Yeah. I mean in transparency, we are still like evolving in the game, right? I mean, there are a lot of POCs like last year and they were just started to roll out some of those.
the main application that we roll as what I mentioned earlier, which was the second chance leads, which is actually an ROI generator because you basically detect the lead that was already missed. So there's some significant, there's good feedback from our customers about the ROI, basically, in terms of the other type of, applications we're looking to ship, which are announced are that our customers are very interested.
and we have potential, a lot of ROI is the virtual assistants, obviously, which is. If you take off all that, like if you're able to answer a call about [00:43:00] business hours, that's a win for the business. That's a win for the show. If in a spike season, when there's a heat wave coming in, all the lines are booked.
And if your assistant can help not only team the customer and help them wait, but actually help them get their questions, like find the right type of job and like find the right appointment slot and actually perform the booking. Like that's a huge ROI expectation. Again, that's not yet like rolled out.
the other ROIs are basically helping, um,like. in the call center scenario, like guiding and mentoring the call CSR assistant, like once they had a call, how did the call go really? Like did the CSR asks the right type of questions and like show empathy and like follow the protocol, So being able to assess that and like show it to the mentor or the manager.
So they can actually work with those partners. What you gain is like a higher skillset and maybe time savings. And finally, like some of the assist features that we just announced, which is [00:44:00] like helping a technician or an accounting person or whatever domain, like being able to answer their specific domain, specific questions, or maybe like showing some tailored recommendations at certain times.
those are the things we'll be looking at. Hopefully the next couple of years, there's going to be almost like an explosion of the RLipo.
Atin: I know, Vinny, I know Twilio is of course doing agentic stuff, but also leveraging multimodal. Like, I'm curious to know if, sort of challenges have you seen in the early going as you've been building out these applications, maybe if you could focus maybe on evals, that would be interesting.
Vinnie Giarrusso: I mean, a lot of times they don't exist yet, you know, and we're, when we're talking about speech to speech, there's not a lot of good options for testing that out, you know, because it is very much in the, in the super, super early days. And so when we think about, how we're going to run evals on this, for example, there's obviously a very large set of technical considerations.
we're moving from a model where [00:45:00] it's like HTTP requests, client server, and it's just, you know, send one back and forth and you just wait for the next one to setting up a WebSocket connection and streaming and just waiting for streams to come back. how does that interact with the current systems that we have today?
and so that, that's sort of where we're looking at. And so for evals in particular. we are concerned with the, one to one mapping of input and output, but as we move to multimodal and we have, other sorts of considerations in there. task based evals are going to be, you know, something that we're particularly interested in, it might be really difficult for us to, across text and video and voice to have the same level of rigor between all three of those when we create our evals.
but, if we look at the ends. that might be a little bit easier for us to start out before we start, merging into the middle and getting into the nitty gritty of like, did it stream back the right thing? And then like, did we catch the stream halfway through and did we catch the interrupt and what happened when we interrupted it?
And like all this stuff, like, so there's [00:46:00] just a ton of technical considerations that just don't have solutions for. So we try to find the areas where there is like maybe a half a solution that we can, you know, help with and work with people to find the rest of it.
Atin: maybe related to that grant, since you're sort of building a platform, like how do you, just given the subjectiveness of, the output quality and just the lack of, you know, objective metrics around a lot of these evolving workflows, like how do you set expectations with.
your customers, but who are the builders and then finally, you know, the eventual customers of Indeed. Like, how do you set expectations around number one, you know, what a successful Gen AI platform can do and how do you set, set up your users for success essentially?
Grant Ledford: Yeah. So like part of this is figuring out what kind of tools toolkit, then guiding them on how they can use those, what levers they can pull to accomplish different goals.
Mement mentioned [00:47:00] earlier some of the pieces around like different models might be work for different use cases better, maybe a smaller model if you're trying to do something that needs that faster latency, or you can have that like context window for a larger context window model might let you do a solution without having to rag based solution, then you get into things like prompt caching.
So giving people that kind of suite of, hey, here's your options and the leverage you can pull that would give you different success metrics is one of them. But then the other, and this is actually more of a broader product piece, is how do you measure success for your application as a whole once you've built something with this?
and being able to actually keep a tight feedback loop on that so you know not just like, is my model doing what I expected it to, is my application with this model doing what I expected it to, but it's actually moving a business metric we care about. so my team's worked really closely with a couple of kind of our early adopters here.
To help measure and get that feedback loop of like, did this move what you expected it to do?
Atin: Uh, just an interest of time. One last super short question for all three of you, [00:48:00] sort of final thoughts as we enter 2025.
Mehmet, we can start with you. Like, what do you think your personal take on what the next year looks like for enterprise gen A. I. What trends do you think we'll see and what use cases will be like the top ones?
Enterprise AI Trends for 2025
Mehmet Murat Ezbiderli: okay. So I think I'm going to rely back on like the previous conversation where there is a POC stage and now like the actual productionalizing stage, I would expect some of the POCs that has been done and some of the enhancements to actually exploit themselves and become productionalized services.
Like if you look at Salesforce, they just announced the assist force, for example. So I see a strong inclination and direction towards using assistance. Eventually like having a cell. Assistant, co pilot, wherever you go from your personal thing to a business scenario. So those issues scenarios, I think it's going to become more, more common.
I see a lot of buzz around the assistants working together to accomplish complex workflows. Like if I have a complex business problem, instead of me going out like knocking [00:49:00] 10 doors, I just assign it to my assistants. And they go and talk to each other and like find out what's the right way.
is it going to be next year? I think there's going to be some things we're going to see next year. They're like incremental steps, but I see that sort of like automated running of business in the horizon for sure.
Atin: Fantastic. Vinny, maybe one liner from you.
Vinnie Giarrusso: I think we're probably going to see a little bit of a tempering on the excitement of AI, like everyone, like, obviously there's huge potential here and most companies are like, how are we going to use this?
But I think over the next year, what we're probably going to see is like a more modest approach, a more measured approach to getting this into your, um, you know, your companies and getting it into a state where you can trust it just as much as any other piece of software. I hope that's the case. I'm also excited about all the other stuff, but I'd love to get some stability too.
And Grant.
Grant Ledford: Yeah, I think there's two, I'd say one is compound systems and that has an infrastructure as a whole so you can actually build out more complex use cases and have [00:50:00] confidence that those work well. And the other is that, as we saw in like the keynote, there's a ton of players in the kind of tooling and infrastructural space for building all on based applications and as the winners and subcategories get identified, it'll be easier to get all those pieces to play well together.
So I think that's something that will happen more over the next year as well, is there'll be some clear front runners that get better tie in and an option.
Atin: Absolutely. I'm certainly very excited to see all of this sort of come together and see something truly useful that works at scale. But thank you all three of you so much.
I mean, this has been a fantastic chat
Closing Remarks and Future Outlook
Conor Bronsdon: That's it for this week. Thanks so much for listening. Make sure to subscribe to chain of thought for your podcasts and we'll see you next week.