Chain of Thought | AI Agents, Infrastructure & Engineering

Intercom was spending $250K/month on a single summarization task using GPT. Then they replaced it with a fine-tuned 14B parameter Qwen model and saved almost all of it. In this episode, Intercom's Chief AI Officer, Fergal Reid, walks through exactly how they made that call, where their approach has changed over time, and how all of their efforts built their Fin customer service agent. Fergal breaks down how Fin went from 30% to nearly 70% resolution rate and why most of those gains came from surrounding systems (custom re-rankers, retrieval models, query canonicalization), not the core frontier LLM. He explains why higher latency counterintuitively increases resolution rates, how they built a custom re-ranker that outperformed Cohere using ModernBERT, and why he believes vertically integrated AI products will win in the long term.If you're deciding between fine-tuning open-weight models and using frontier APIs in production, you won't find a more detailed decision process walkthrough.๐Ÿ”— Connect with Fergal:ย Twitter/X: https://x.com/fergal_reidLinkedIn: https://www.linkedin.com/in/fergalreid/Fin: https://fin.ai/๐Ÿ”— Connect with Conor:YouTube: https://www.youtube.com/@ConorBronsdonNewsletter: https://conorbronsdon.substack.com/Twitter/X: https://x.com/ConorBronsdonLinkedIn: https://www.linkedin.com/in/conorbronsdon/๐Ÿ”— More episodes: https://chainofthought.showCHAPTERS0:00 Intro0:46 Why Intercom Completely Reversed Their Fine-Tuning Position8:00 The $250K/Month Summarization Task (Query Canonicalization)11:25 Training Infrastructure: H200s, LoRA to Full SFT, and GRPO14:09 Why Qwen Models Specifically Work for Production18:03 Goodhart's Law: When Benchmarks Lie19:47 A/B Testing AI in Production: Soft vs. Hard Resolutions25:09 The Latency Paradox: Why Slower Responses Get More Resolutions26:33 Why Per-Customer Prompt Branching Is Technical Debt28:51 Sponsor: Galileo29:36 Hiring Scientists, Not Just Engineers32:15 Context Engineering: Intercom's Full RAG Pipeline35:35 Customer Agent, Voice, and What's Next for Fin39:30 Vertical Integration: Can App Companies Outrun the Labs?47:45 When Engineers Laughed at Claude Code52:23 Closing ThoughtsTAGSFergal Reid, Intercom, Fin AI agent, open-weight models, Qwen models, fine-tuning LLMs, post-training, RAG pipeline, customer service AI, GRPO reinforcement learning, A/B testing AI, Claude Code, vertical AI integration, inference cost optimization, context engineering, AI agents, ModernBERT reranker, scaling AI teams, Conor Bronsdon, Chain of Thought

Show Notes

Intercom was spending $250K/month on a single summarization task using GPT. Then they replaced it with a fine-tuned 14B parameter Qwen model and saved almost all of it. In this episode, Intercom's Chief AI Officer, Fergal Reid, walks through exactly how they made that call, where their approach has changed over time, and how all of their efforts built their Fin customer service agent.

Fergal breaks down how Fin went from 30% to nearly 70% resolution rate and why most of those gains came from surrounding systems (custom re-rankers, retrieval models, query canonicalization), not the core frontier LLM. He explains why higher latency counterintuitively increases resolution rates, how they built a custom re-ranker that outperformed Cohere using ModernBERT, and why he believes vertically integrated AI products will win in the long term.

If you're deciding between fine-tuning open-weight models and using frontier APIs in production, you won't find a more detailed decision process walkthrough.

๐Ÿ”— Connect with Fergal:ย 

๐Ÿ”— Connect with Conor:

๐Ÿ”— More episodes: https://chainofthought.showCHAPTERS

0:00 Intro

0:46 Why Intercom Completely Reversed Their Fine-Tuning Position

8:00 The $250K/Month Summarization Task (Query Canonicalization)

11:25 Training Infrastructure: H200s, LoRA to Full SFT, and GRPO

14:09 Why Qwen Models Specifically Work for Production

18:03 Goodhart's Law: When Benchmarks Lie

19:47 A/B Testing AI in Production: Soft vs. Hard Resolutions

25:09 The Latency Paradox: Why Slower Responses Get More Resolutions

26:33 Why Per-Customer Prompt Branching Is Technical Debt

28:51 Sponsor: Galileo

29:36 Hiring Scientists, Not Just Engineers

32:15 Context Engineering: Intercom's Full RAG Pipeline

35:35 Customer Agent, Voice, and What's Next for Fin

39:30 Vertical Integration: Can App Companies Outrun the Labs?

47:45 When Engineers Laughed at Claude Code

52:23 Closing Thoughts

TAGSFergal Reid, Intercom, Fin AI agent, open-weight models, Qwen models, fine-tuning LLMs, post-training, RAG pipeline, customer service AI, GRPO reinforcement learning, A/B testing AI, Claude Code, vertical AI integration, inference cost optimization, context engineering, AI agents, ModernBERT reranker, scaling AI teams, Conor Bronsdon, Chain of Thought

What is Chain of Thought | AI Agents, Infrastructure & Engineering?

AI is reshaping infrastructure, strategy, and entire industries. Host Conor Bronsdon talks to the engineers, founders, and researchers building breakthrough AI systems about what it actually takes to ship AI in production, where the opportunities lie, and how leaders should think about the strategic bets ahead.

Chain of Thought translates technical depth into actionable insights for builders and decision-makers. New episodes bi-weekly.

Conor Bronsdon is an angel investor in AI and dev tools, Head of Technical Ecosystem at Modular, and previously led growth at AI startups Galileo and LinearB.

FINAL TRANSCRIPT
================
Speakers: Conor, Fergal
Duration: 53:30
Total Words: 9612
Generated: 2026-02-25

---

[0:04] Conor:
Welcome back to Chain of Thought, everyone. I am your host, Connor Bronstein, head of technical ecosystem at Modular. And today we are talking about something every AI product team is wrestling with right now. When do you fine tune? When do you use frontier models? And how the heck do you make these decisions when everything changes quarter by quarter, if not week by week at times with some of these recent model releases? My guest today is Fergal Reid. Fergal is the chief AI officer at Intercom, where he has scaled their AI team from just 10 people to 55 in about two and a half years, and planning to double that team size again next year as they continue to heavily invest in AI, including with their agent Finn, which we'll talk a bit about. But what makes this conversation particularly valuable is that Fergal and his team have completely reversed their position on a fundamental strategic question. Just two years ago, Intercom's take was relatively simple. Look, Frontier LLMs are improving so fast, we don't really need to spend that much time fine-tuning. Just get really good at prompt engineering, providing context, and ride the wave of better models. Today, they've instead heavily invested in post-training and fine-tuning open weight models, with fine-tuned Quen models running at scale in production, replacing GPT-4 for key tasks, and saving significant money in the process. So what changed? How did Intercom and Fergal make these decisions? And what does that tell us about where the industry is heading? Fergal, welcome to Chain of Thought. It's great to see you.

[1:32] Fergal:
Thanks for having me, Connor. Delighted to be here.

[1:34] Conor: [OVERLAP]
Yeah, I think this is gonna be a really fun conversation because often in these talks, we don't focus on a use case and really talk through the trade-offs that are made along the way as much as I think we ought to. It's very common, and I think listeners will tell me this occasionally, like, hey, look, this is maybe a little too high level. You just talked about what's exciting, what's interesting, and those conversations can be useful, but I think it's important to drill down and really explore why decisions were made at a company like Intercom, and I think it's going to be a fascinating test case. So let's just start with that elephant in the room. You know, two and a half years ago, your position was essentially don't bother fine tuning. The frontier models are improving too fast. Focus elsewhere. Now you have, as we said, heavily invested in post-training open weight models. Walk me through what changed your mind. How did you get to where you are today?

[2:27] Fergal: [OVERLAP]
Yeah, absolutely, Conor. So things really changed out there in the external world. So again, two and a half years ago, we were just starting to build Thin, and we were building Thin initially with GPT-4, GPT-3.5 at the time. It didn't follow in instructions as well, it used to hallucinate more and UDGBD4 was this threshold for us where we'd be given instructions to try and constrain it to just answering in, you know, from text that a business customer of ours controlled, which everyone calls RAG now. And so, we were very early to that. And there was really so much to do at the time by just getting better and better at prompting these large models. The difference between a badly prompted GPT-4 and a very well-prompted with a well-engineered prompt night and day in terms of the actual performance, in terms of the actual accuracy. And so there was just this period of time for maybe a year, year and a half where models got better so fast, they got cheaper, they got more efficient. We had models like Sonnet came along, Cloud Sonnet 3, 3.5 came along really much better at like following instructions. And we could do a lot more by just getting very good at like prompt engineering, tuning, back testing, building more and more into the core prompts. And at the time, there was other people in the space talking a lot about, you know, training their own models or maybe taking a model and, you know, post-training or mid-training with this like mix of data curated for their specific area. And we looked at that and we did a few experiments. I remember we ran an A-B test with a fine-tuned version of It was Text Da Vinci 3, one of the GPT 3.5 turbo models. We ran experiments with fine-tuning. We fine-tuned models like in the GPT 3.5 turbo series, and we did A-B tests and production of those, but we just didn't see the improvement overall versus instead just getting better and better, like taking these models and just getting better and prompting them. And so that was really the case for a long time. We made this very deliberate decision. No, we're not going to invest a ton of money and time into finding your own models. We're just going to piggyback on this wave of, it felt like every quarter or two quarters, there was a new leading frontier model that was dramatically better at customer service and at the sort of things we wanted to do. And really in the last sort of six months, nine months, that's kind of changing. Now, I do think the frontier models are continuing to improve and they're getting better and better, like being agentic and following instructions. But in our area, our domain of customer service, let's say for our core RAG tasks, The models, it feels like they're saturating the intelligence level that we need for that core task. And instead, what's more important to us is really fine grained control over when the models do and don't behave in certain ways. So one thing we care a lot about at Intercom is with Finn is, when should you escalate? When should you say, hey, I want to involve a human in this? And that's something that we care a lot about, very fine control over. And so we spent a long time hand engineering our prompt with certain times when you should escalate and when you shouldn't, giving customers even the ability to provide guidance to that. But we've been able to get a level of control over and above that again by actually post-training and fine-tuning some custom models. And so I really think that when your application matures, maybe when you no longer need frontier levels of intelligence, It's a good idea to think about doing your own post writing because you can reduce cost a lot and also you can get like much more fine-grained control over the exact policy that the LLM follows. So both of those reasons have been very effective for us.

[6:38] Conor:
When did you start to question that original approach that you were taking? Was there a specific moment where you said, actually, maybe we need to change up how we're approaching this? Or was it more gradual, maybe even correlated to simple team capacity growth and the ability to say, oh, look, now we can do both?

[6:55] Fergal:
It's definitely related to team capacity growth. Now the causality goes both way around there, which is that we hire more people because we want to do more post-training, and so we do that because we think there's more value to it, there's more upside to it. And I would say that like, you know, around the time of DeepSeek, that was really when the open weight models started to become close to the closed weight models and started to become almost good enough to do some of our core tasks. And so for a long time, you know, FIN underneath the hood, there's the hardest prompt of FIN, which we use Claude Sonnet for in production, which is like that kind of core answer the end user's question. But then there's always been a set of ancillary prompts that are very important to the overall experience that we've had for a long time. And we would always try and run these prompts on the smallest, cheapest, lowest latency model that we could to get the best end user experience. And so for an example of this would be like, When you ask a question to Finn, the first thing we've always done is we've always had an LLM that will try and summarize and canonicalize the query that the end user has asked. And we found it's worth using an LLM to do that before going to retrieval because end users ask questions in ways that sometimes are, you know, they can be very colloquial, they can vary a lot from person to person, they can be inaccurate. And so putting it through an LLM to sort of like hey, can you summarize this query and canonicalize this query before going to the retrieval model in the RAG stack? And the retrieval models are powerful, but they're a lot less powerful than a modern LLM. So doing that pre-summarization or pre-canonicalization phase has always improved the accuracy of our overall retrieval system, which then in turn enhances the accuracy that the LLM generates at the end. And we very early on realized that we didn't need to run GPT-4 or Cloud Opus or something very heavy like that to do that sort of canonicalization piece. So we were using models like GPT-4.1 to do that, which is a model that's pretty, pretty good kind of cost performance trade-off. And we're spending a lot of money on that summarization task. I don't remember exactly, but it could have been like a quarter million dollars a month on just that summarization task. And so when you're at that and now you see, you know, the Quen models are coming out and the Quentry models are very good. And he's a 14 billion parameter Quentry model. It's really good. And we were able to go and take that model and train that model, post train that model to do that summarization task. and able to replace that and saved almost all of those, you know, several hundred thousand a month on inference just by doing that one thing. And then we get more fine grained control over it. So it's probably a combination of, you know, our product maturing, also the gap between the open weight models and the closed weight models for the specific tasks we were interested in closing. And you know, obviously that the Quen tree series of models is an extremely capable and powerful series of open weight models. I think a lot of people are kind of using them in production for those kind of smaller or ancillary tasks.

[10:24] Conor:
I'd love to understand more about how you trained these quen models to be so effective here. So obviously, we've seen this explosion in open source AI, particularly, I think, DeepSeek has led the way a lot in driving the edge forward. And Alibaba, obviously, with the quen models, has done a fantastic job as well, kind of following through there. And it's interesting, I'll say, to see the juxtaposition between open source models coming out of China and then elsewhere, and then how it's just propagating. That's a whole longer topic we can dive into if we have time. But I'd love to understand specifically, okay, how did you take that open weight model and actually tune it to what you needed for your summarization tasks and other tasks it sounds like?

[11:09] Fergal:
It's definitely involved. So we work primarily on Amazon, on AWS. And so we use large EC2 instances. I

[11:25] Fergal:
think our standard instance at the moment is one node of H by H200. So we got a lot of RAM. I'm dead that's kind of art we have set up we have a kind of an AI in for team that has set up a lot of infrastructure to make it easy for a scientist to log in and to run a distributed training job on those those large gpu's. And, you know, we went through a process here and we started out experimenting with like, you know, Laura and like parameter efficient fine tuning methods like that. And maybe on slot, I think we used a lot at the start. But over time, we sort of invested more and more in this. And now we use, I guess, more distributed training. And most of what we do these days is a combination of full supervised fine-tuning, unquantized supervised fine-tuning. And then for some of the harder things we're doing as well, we use reinforcement learning and we have Models in production that we've trained and we found that and you know a job has worked well for us for some of the things we've been able to do and we've invested a lot in doing things like. Building evaluation models you know we invested a lot in building a resolution model we care a lot about. Has this answer sufficiently resolved the end-user query? We've invested in building a resolution model that does a good job at proxying the real signal we get from an end-user, whether something has been resolved or not. And then we use that as an input into our reinforcement learning setup. And so our reinforcement learning setup now tends to, you know, this is a multi-dimensional setup. One thing is like the resolution model. Another thing is a whole bunch of things we care about in terms of like, are you using like bullet points, the right amount, not too much. How long is your answer? How short is your answer? And we also use other open weight LLMs as judges as well for part of that too. we've a pretty sophisticated reinforcement learning setup. And now, you know, the actual, you know, setting up your objective function and stuff is relatively the smaller part of it. The harder part has been just the infrastructure of, you know, Probably if I look at our investment overall, we have a lot of heads in getting good at running these models at scale on AWS and getting good at that with us putting the bulk of the investment, although both are important.

[14:09] Conor:
But it sounds like the trade-off decision you've made is like, sure, we've added headcount to make sure that we can run this system effectively. But the feeling is that if you were, say, using GPT models still for these same tasks, not only would the model inference be significantly more expensive, Uh, but you'd still would have to be doing this, this fine tuning to, to get what you're doing effectively. So I'm curious what has made Quinn in particular work so well for you. I mean, obviously there's the, the cost perspective. I think that's a huge piece, but it sounds like there are other reasons you're also focusing on the Quinn family.

[14:42] Fergal: [OVERLAP]
Yeah, so the QEM models, I guess we realized pretty early on after the QEM models came out. Firstly, they benchmarked well, but then whenever a model benchmarks well, you've always got to look at it yourself and try it in some of your own bat tests and some of your own evaluations because some people accidentally contaminate the benchmarks or some people just...

[15:02] Conor: [OVERLAP]
I would argue it's fairly common.

[15:04] Fergal:
Yes, maybe common. Yeah. And then even people who maybe don't, you know, it's so easy for a researcher or a scientist trying to optimize a benchmark to get good at teaching the model how to do the kind of task that's in the benchmark and then suffer generalization failure. So, you know, we saw the Quent 3 models and obviously they, you know, a ton of tokens through them. So really generous that they were released open weight because, you know, the large training expense to kind of create them. And then when we back tested them, they worked really well on sort of our standard back tests. And suddenly we were very interested in this because we were like, these seem like legitimately performant models. And then we started to do production tasks of them. And we would always do like, we always like to do a production test of a well-prompted but untrained model before we go and then start post-training for our thing, just to get some sort of a sense of like, hey, roughly how good is this? Is it close to what we're trying to do? And yeah, the Quentry models performed well on our back test for our simpler tasks. And again, Finn still runs Frontier Model and Sonnet 4, I think we're in the process of moving to Sonnet 4 or 5 in production at the moment for the hardest prompts, but for those easier ones, I'm in. All that surrounding stuff we find is a major part of the actual performance of the system. If you look at FIN, FIN is really this cluster of 10 or 15 different problems in machine learning systems. And you know we have improved the resolution rate of fin from launch was about thirty percent now it's it's heading towards seventy and i'm like the vast majority of that improvement has not been in. The core hardest frontier island that has improved and it's been great. But it's been in that surrounding cloud of systems the retrieval model the rank model we built a custom re ranker we built that almost from scratch using modern birth, add our own cost function that's worked really really well and trained on a whole bunch of data and so you know overtime we've getting a percentage point here percentage point there across the tender for any different parts of the system. It really adds up. So, you know, I guess we've been invested in this area for a while. And so, yeah, we were just excited. As open weight models get better and better and we get the ability to fine tune them and control them, we're excited to see if we can turn that into more resolutions.

[17:47] Conor:
And I think I should be explicit here since I don't know that we stated this earlier and maybe alluded to it, but FIN is Intercom's fantastic AI agent for all your customer service needs. That's what we're talking about here. And there's quite a bit of information on their website you can find about it. Super interesting stuff. I want to bring up one particular thing you said, Fergal, which is this idea of, you know, when a measure becomes a target, it ceases to be a good measure necessarily, Goddard's law. And I think Goddard's law is really interesting here in part because of that contamination issue we alluded to. But I am curious if there are particular benchmarks that you're looking at when you are deciding which models to prioritize. So you mentioned that there's a few of them you look at as well as some internal benchmarking. I'd love to understand a bit more about how you're evaluating these models and what external benchmarks you're using and if you're able to share at all what internal ones.

[18:38] Fergal:
Yeah, so I mean, in terms of external benchmarks, like everybody else, we look at the model cards that release and you get some sort of a sense there, but then you also have to worry about like contamination or people like overfitting to the benchmarks. In terms of internally, we have a suite of back tests that we've curated over time. And it includes things like, every time we have a customer that reports a hallucination, we capture that, and then that goes into our back test. So we have a pretty good suite of real in the wild hallucinations. that we can use and we can basically quickly quantify, hey, in a customer support setting, in a RAG setting, is this model likely to hallucinate or not? We have that and we have a whole bunch of other like quality things. We care a lot about like, you know, how often will the model give an answer to the question? And then we use LLM judges to like, did they get the right answer or not? So we started to build benchmarks and back tests internally like that. But then nothing beats testing in production. We test everything in production. We've done this. This has been our philosophy for years at this point. And what we really care about is like resolutions. And in particular, we tend to run an A-B test in production with a combination. We look at like the soft resolutions and the hard resolutions. And a soft resolution is when Finn has given an answer, thinks the answer is correct, and the end user hasn't said anything, and Finn has said, do you want to talk to a human if you didn't get what you wanted? And the end user has like disappeared and not said anything. We count that as a soft resolution. And that's just going to happen a lot. People will get their answer. It's customer service. They'll go away. They won't bother tanking the bot. But typically about 30 to 40% of the time, whenever you give an answer to an end user, they will do what's called like a hard resolution, which is where they will be like, yep, that has actually resolved my question. They'll say something like, thanks. They say, yes, I got what I want. And that tends to be about 30, 40% of the rate of the soft resolutions. And that's our North Star. That's our ultimate ground truth.

[20:54] Conor:
And are you using that for RLHF?

[20:56] Fergal:
Not directly because it requires a human in the loop and when we're doing RLHF it's offline rather than in production. We don't have an open loop RL in production.

[21:15] Conor:
Be too risky as far as what you're actually getting at back out of it.

[21:18] Fergal:
Um, we, we might get there eventually. Um, we, so it just, our workflow is about like, take the model, make the model grade of what you want it to do. And like, that's the context in which we tend to be like tuning these models. So we're like, we're evaluating them on bag tests and then we're taking a release candidate and we're A, B testing that in production. So our, our, our RL is like, I guess, offline RL, you know? Uh, yeah. And.

[21:47] Conor:
I think it's understandable given how broad our dataset is and some of the pieces. I'm just curious.

[21:53] Fergal: [OVERLAP]
I think there probably aren't that many people doing full RL open loop in production. I think

[22:00] Conor: [OVERLAP]
That's my take

[22:00] Fergal: [OVERLAP]
maybe

[22:00] Conor: [OVERLAP]
as well.

[22:02] Fergal:
Cursor or someone said they were doing it, which was pretty interesting. It

[22:06] Conor:
Yeah.

[22:07] Fergal:
makes sense if you've got a very dynamic, fast-moving signal. So this is where you get into boundary algorithms and things like that in the past. If you were running a news site and you were using LLMs to predict who's interested in what news articles, or if you're running the X Twitter, if you're running the recommendation algorithm for that, where it's a very dynamically changing signal, I guess you'd want to do that. For us in customer service, it can change fast. It can change month on month, but not day on day. And so like an offline model training setup is fine for us. We have these other parts of our stack. So like our retrieval model, and we use in-production signals. Our retrieval model is trained very heavily on direct real user feedback, where the real user is like, that resolved my question, and then the retrieval model learns, yep, this sort of article is good at resolving that question. It's a little bit more intermediate, a little bit more complicated for training our LLMs. But yeah, our gold standard has always been those sort of resolutions. And then we always AB test any model before putting it in production. And we check to see that the soft resolutions have gone up, but also that the hard resolutions don't go down, basically. And we need to avoid building a deflection machine. And so checking that the hard resolution rate you know, at least doesn't decrease and ideally increases. Like basically that the ratio of soft to hard resolutions should be roughly constant for every model change. And we've done this for a long time. And, you know, FIN is at significant scale, so we can get very highly statistically powered A-B tests. That's a bit dated now, but just to give you an idea, like when we switched from GPT to Claude Sonnet, you know that was a multi-million end-user interaction A-B test and you know so we're at substantial volume now. Fin is resolving more than a million conversations successfully a week. I think possibly a good bit over that and so we're at enough volume that we can just afford to A-B test in production with real end-user signals and I would say that like That is the gold standard and nothing else comes close in that like every backtest and every evaluation that we've ever built, no matter how carefully we spend on it, it's possible that the grand treat in production deviates from it. And we care a lot about like a 10th of a percentage point of resolution. So like we need very highly statistically powered testing setups in order to do this. And we just find just the weirdest things happen in terms of like, one great example of this is like latency. Like for a long time, we always believed that decreasing latency would be a better end user experience. We still believe that. But we now know that increased latency almost always leads to more resolutions. And you might be like, oh, well, I can understand how it would lead to more deflections, but increased latency tends to lead to more hard resolutions,

[25:30] Conor:
Interesting. OK,

[25:32] Fergal: [OVERLAP]
right?

[25:32] Conor: [OVERLAP]
I'd

[25:32] Fergal: [OVERLAP]
Yeah.

[25:32] Conor: [OVERLAP]
love to unpack that a bit because I really like that you have these kind of two north stars of like we want more hard resolutions, but we also want to ensure we're not losing soft resolutions by doing that. And you keep that ratio. I mean, it must mean that you have. I mean, frankly, I think there's a lot of people listening who maybe are a little jealous of like how strong of a data set you get in production here. So I would love to understand a bit more of, you know, that deflection rate and how that's all coming together.

[25:57] Fergal: [OVERLAP]
The point of the latency thing is just, it's very unintuitive. You really need to measure in production. You would never build that latency signal, which confounds everything. You would never build that into your backtest, because suddenly, a longer answer might seem like it's got a higher resolution rate. But the reason it's got a higher resolution rate is because there's more latency before you serve it to the end user. More latency may increase the end user's perception of work that the bot

[26:24] Conor: [OVERLAP]
How

[26:24] Fergal: [OVERLAP]
has done.

[26:24] Conor: [OVERLAP]
fascinating.

[26:25] Fergal: [OVERLAP]
and something you get, you get a lot of confounding. And so over time, you learn to tease out these confounders. But I really think there's two types of AI products. All models are wrong. Some are useful. There's two types of AI product in the world at the moment. One is where it's like we have hand-engineered our prompt with a data scientist to this specific business, this sort of forward-deployed engineering model. And almost everybody's doing that because you get a lot of accuracy quickly with that. And there's a second type of product, which is what we've invested in and we've built in, which is build this scientific system where you don't have to hand engineer it per customer. And if you go down that second route, well, you don't have to hand engineer it per customer. But I think the bit everybody misses is. That means that the system, you now have this scientifically optimizable system that will get better and better over time. Whereas if you branch your prompts and you hand prompt per customer, you can't build this system that gets better and better over time. You just end up with like technical debt. You end up with like a slightly different prompt for every customer. And I think as we all know at this point, Prompts like leak, if you make a change in one part of a prompt, it affects the performance of something somewhere else. And so you really want your prompts to be as standardized as possible, you want your systems to be as standardized as possible, with well-isolated, well-encapsulated different pieces that can be tested separately from each other, but always checking the overall system accuracy. And that's really what we've built, and I really think that over time, People are going to realize you don't want a product that has been built for you by someone just like hand hacking the prompt. It seems great right? On week one it's like oh and I asked for this feature and then suddenly they turned it around. Yeah they did but you're no longer on a standard product. I think everyone would realize that if someone was like you have the sass fender and they made a great product for me they branch the database just for me, i want to be like oh how are they gonna maintain that will they clearly not gonna maintain house and so i think i people have not made that connection yet, but yeah we believe that in the long run and the future belongs to people who are building a products who are doing this sort of like, A standardized, well-designed AI system that then has this level of testing and rigor, and that it improves over time through scientific standardized testing. That is where we are balancing our bets anyway. Hopefully it will work out for us.

[28:52] Conor:
Thanks to Galileo for sponsoring this episode. Their new 165-page comprehensive guide to mastering multi-agent systems is freely available on their website at galileo.ai and provides you the lens you need to understand when multi-agent systems add value versus single-agent approaches, how to design them efficiently, and how to build reliable systems that work in production. Download it for free at the link in the show description to discover how to continuously improve for your AI agents, identify and avoid common coordination pitfalls, master context engineering for agent collaboration, measure performance with multi-agent metrics, and much more. I'd love to understand how this correlates to your team building strategy and kind of the stages with which you've continued to develop this approach. Obviously, you move from this prompt focus to, you know, a post-training setup. What was the MVP for your post-training? Whereas, you know, where are you today? And how is the team growing alongside of that? Because I think it'll be really informative for other AI leaders who are listening and thinking through, okay, how do I need to scale up my team? How should I be thinking through these stages of enabling, you know, a much more scientific approach?

[30:09] Fergal:
The scientific piece is really cultural and it was in the team from the very start. In a previous role, I once worked for a company called Optimizely, an A-B testing SaaS company, and so I've always sort of believed in A-B testing and experimentation. And, you know, I think we really took that into the AI group. And so, you know, all our scientists will go and will run tests in production very fast and very rapidly and are kind of responsible for checking and making sure. And when we interview, our interview loops always contain a lot about like, you know, randomized control trials, a scientific process. We really want people to chat. We really want people to be coming to the team as scientists who are familiar with the core fundamentals of science. I am because we just we use that day in day out to make a better product for end users and so we ended up with a pretty technical structure in the AI group like the movie skew very tech heavy and our scientists tend to be there mostly applied scientists, and they'll be good at like you know writing prompts they'll typically have a background in machine learning often have a PhD masters in machine learning three to five years industrial work and see a lot of people have come in maybe they've worked in recommender systems things like that in the past where you see a lot of that same sort of testing and mature development process. And then we also have a large cohort of engineers and the engineers are typically backend engineers. We kind of draw about half of them from intercom, half of them are new hires. And hence be our most experienced kind of backend and systems engineers, but who are then kind of like crossing into machine learning and AI as a discipline. So they've learned prompt engineering and we've taught them prompt engineering. And then also, you know, pretty common for them to be able to run an A-B test, interpret the results of that, able to like run a classifier, use scikit-learn to do some NLP stuff or whatever we need to do. Again, as the LLMs get cheaper and cheaper, they become our go-to tool more and more. But yeah.

[32:14] Conor: [OVERLAP]
And it also seems like you're doing a lot of context engineering, since I know you invested in REG quite early. Talk

[32:19] Fergal: [OVERLAP]
Yeah.

[32:20] Conor: [OVERLAP]
to me a bit more about how that factors into this pipeline you've developed.

[32:24] Fergal: [OVERLAP]
Yeah, I think we must have been one of the first production RAG deployments. Certainly, we launched on GPT-4 launch day with GPT-4. And it was really like before GPT-4, when you tried to do RAG, it was very difficult to get the previous generation models to actually respect the RAG instructions reliably. So I remember when we launched Finn, I had made this decision to go this rag direction. And we had a whole lot of competitors at the start who were just serving the naked model. And so it'd be like, hey, here's a chat bot and we're just going to ask GPT 3.5 Turbo the answer. And sometimes it would get it right and sometimes it wouldn't. And we were like, oh, this is really a bad product. The business doesn't have any control over it. And I think we've seen people standardize on ride, but we were certainly one of the first kind of production ride deployments. And back in those early days, yeah, context engineering was a huge part of what you did. I mean, it still is, but I guess probably in the first year after FIN launched. So for a lot of 2023, we were really realizing how important it was to assemble the right context and pass it to the LLM. We probably spent a lot of time in early 2024 doing that as well. And we're now at the point where, you know, we have quite a complex, sophisticated pipeline for getting the right context into FIN. This starts with our chunking. We take all the contents that FIN might possibly use, that content is chunked, by LLMs. Then we go and in production we take the query the end user has and as I mentioned we canonicalise that query. Then we have a custom retrieval model. It's a fine-tuned version of Snowflake but fine-tuned on actual production resolution signals. And we have a custom re-ranker model, it's completely custom re-ranker, using modern vert as a building block. And those together are responsible for a lot of the accuracy of FIN. And when we built our custom re-ranker, we outperformed the best Cohere re-ranker at the time, which we were previously using. And we're really happy with that. And that was just because we have, I think, tons of high quality data. of actually has this resolved the query or not and we had to get our customers permission in order to use that and we did comms and stuff where we were like hey if you want to opt out of training here you can but I think you know I It totally makes sense to me as a customer, not like, yeah, of course, a retrieval model, a re-ranker model, there's very little downside participating in that, and that makes the system better. So yeah, all of those things together really feed into this, I guess, this whole system that's responsible for providing, in our case, CloudSonic 3.5, or CloudSonic 4.0, with the right context that we need, and then with a well-engineered prompt. A whole lot of other stuff that goes into that prompt, some of which is customer specific, where the customer has used our product to provide further guidance to it. And that's how Finn generates its answers. And that's how we've been able to get such high quality answers.

[35:35] Conor:
And as you continue to build out your team over the next year, you're going to get even more human power to support building the system out. Can you talk a bit about the goals you have for taking a next step within? And then conversely, I'd be curious what systemic risks you're seeing that you're paying attention to and trying to avoid from the start.

[35:58] Fergal: [OVERLAP]
Yeah, the second one is a hard question. It's a fast moving field. In terms of our goals, there's several big initiatives we have. One is we're building something we call a customer agent, which we've publicly announced before, which is essentially that we think that the future here is not that you'll have different agents from different vendors all in the one conversation facing your end user because that's just not going to be a good experience. They're going to like fight with each other. Will the sales agent from one vendor hand off to the customer service agent from the other one? There's going to be substantial overlap in terms of what they do in terms of the systems they need to connect. We also don't see a good future of some orchestrator agent that sits outside them and referees them all. That just feels like a pain. So we really think, we really see a world where to get good experiences, you really want one agent or one set of agents from one vendor. That's like talking to your engineers, one beautifully integrated system. And so that's really what we're trying to build. We call that customer agent. And so that means that we need to build roles for Finn. Finn needs to be able to act. when the conversation is a sales conversation. It needs to be able to act when it's a customer service conversation, as it does today, and probably other things as well, like customer success. It probably needs to be able to act well in the e-commerce setting. And so, you know, that is one area of investment for us. There's the other investments we're continuing to follow through on. We have a pretty mature, thin voice product built using OpenAI's Realtime API. I think we were one of the early customers of that API. Very good API for low latency, high quality. We're now in version 4 of our tasks or procedures product, which is this product that takes actions in external systems. It took us a while to crack that. It was always relatively easy to take actions. Finn from launch back in March 2023 could call APIs. But to really nail the product experience that made it easy for one of our customers to set up actions took a really long time. I feel we've just gotten there really in the last sort of like six to nine months. We're starting to see real adoption and product market fit and people now using Finn where it's like you say something to Finn and it takes an action that like opens a garage door in the real world or something like that. You know, it's it's. It's really quite cool and people using it for very consequential real tasks. So we'll be following through of all those investments. We'll be following through our investment of continuing to invest in, you know, better and better AI models. And again, we've gotten good at post training and it's worked better than we thought. Like you can take an open weight model and if you have enough data and you have the expertise and it's a bit of work. you can get real benefits on your actual task. Again, when we beat Cohera's model at the re-ranking task, and Cohera is a great company, and their model, we had previously done a big evaluation to find what's the best re-ranker we could, and we were using, I think, Cohera's re-rank tree. I'm misremembering the name. I'm there and then we train your own re-ranker from scratch using modern vert. And on a lot of data and then even out of sample even cross customer it because you ranker this task were like wow i like it it be that substantially wrote a blog post on a bit substantially much more so than we expected my wife is a lot of value to. fine-tuning for a specific task or training from scratch for a specific task. And so we still believe that the future is probably vertically integrated AI products. And I think we're starting to see that. Like, I've been using Cloud Code with Opus 4.5 recently, and it's just a great experience. And yeah,

[39:56] Conor: [OVERLAP]
Oh, completely agreed that the ability for Cloud Code to set up its models like I mean, I've used Opus 4.5 with Cursor. It's a good experience, but having it within the Cloud Code setup, it just seems to flow so much better.

[40:10] Fergal: [OVERLAP]
And they are going to have post-trained or mid-trained Opus 4.5 in that harness. And we see this. We basically see there is absolutely a value to vertical integration where the model has been not just prompted, but being post-trained or mid-trained. for the specific application. the exact right trade-off. You end up trying to have a prompt that does a particular task and then you realize it's doing the task wrong so you put a special case in to stop it doing that wrong task and that works fine for a while. Then you're in production for a while, you've got like 50 special cases in your prompt and you're overloading your prompt And you just realize, no, I could like post-train these 50 special cases in and then I wouldn't need to put them in the prompt anymore. My prompt is leaner and lighter. And so there is absolutely a value to that vertical integration. Is the value big enough to like carry it a day and to be the thing that everybody cares about? I don't know, but I do think you're starting to see, you're seeing like people at the model layer, obviously like, you know, thinking machines came out with this like API to kind of, and I think you're going to see, you're going to see lots of people provide more and more interfaces. You're going to see lots of frontier labs providing interfaces to like specialize and fine-tune and post-train, would be my guess, yeah.

[41:52] Conor:
I think this is also maybe an important point to highlight because one of the common questions in, I guess, the prior era of software is, you know, why can't Google just do this? Why can't Microsoft just do this? And now we're again asking, why can't Google just do this? Why can't OpenAI just do this? You know, what if their model just gets good enough? And the answer I think we're increasingly seeing is, A, there's the unique data set angle of, look, we have this unique data advantage that others don't. Obviously that's been something that's been kind of talked about by the data and AI industry for 10 plus years, but we're really seeing that come to fruition now as models get better and better. But this idea of a vertical stack, a harness belt for your task is clearly starting to differentiate. I mean, you have to stay on top of it. You have to continue to develop it. You can't, you know, rest on your laurels. But it seems like there is a real case for defensibility that is arising with Finn and other, you know, production use cases that have gone far enough down the road.

[42:54] Fergal:
So I'll say a couple of things, I'll speak quite candidly here. I'll say a couple of things. I firstly say that something like clogged code is a case for defensibility and verticalization coming from the model there, right? Which is that like, given that you have the best model, no one else can build a harness that's as good for your model as you can. So

[43:14] Conor: [OVERLAP]
Yeah.

[43:14] Fergal: [OVERLAP]
it's a way of the model layer getting up the stack. And so that's quite strategically threatening if you're not a model layer company. And I've been keeping an eye on Cloud Code for a while as an exemplar of this great product. Okay. It doesn't necessarily mean that the vertical integration works the other way around. And so a company like us, we're coming from like up the stack and then we're trying to vertically integrate to get the same defensibility, but coming from the top down. And so like we have been doing stuff like, and you know, I would say we have, we have some real legitimate, durable vertical integration and defensibility in FIN today in

[43:54] Conor: [OVERLAP]
I would agree.

[43:55] Fergal: [OVERLAP]
that Yeah, in that like, you know, all FIN, like, you know, and I think, you know, we do a lot of tokens in FIN. And when everyone was releasing their tokens recently, we cancelled it off. We had like over 10 trillion tokens of FIN in production. So that's just in production. So we have a lot of tokens. And like, you know, if we can make 30% of those, 50% of those using a lighter model, using our own custom model, cheaper, faster, better, I think it's great, it's real margin, it's real lower latency. And then we have now specific examples where you get better performance than we could possibly get by prompting. So I think we're starting to see it. It's certainly the direction we're bending and I'm bending and I'm bending that like we'll be able to create durable value for us and for our customers by vertically integrating in this way. But weight against that is like, it's the bitter lesson, right? Which is that like, okay, data, How much data do you really need to train the horizontal model? You know, often these things are quadratic. You get like, you know, I don't know, got to go like 10 times up more data to get like, you know, two times more performance or whatever it is. And so, you know, people climb that curve pretty fast

[45:07] Conor: [OVERLAP]
Can I

[45:08] Fergal: [OVERLAP]
and

[45:08] Conor: [OVERLAP]
just throw a compute at it instead?

[45:09] Fergal: [OVERLAP]
Yeah, some of the control computer that are like, you know, you can kind of get enough data to get to diminishing returns and more data pretty fast. You know, we even see that when we do post training with Finn. You need a lot of data, but you don't need like, you know, you don't need billions of data points. You can get pretty far with like tens of millions of data points, basically. And so, you know, so I do think that the large model providers definitely do have a hand to play. And I think a really strong one. And so, yeah, I do think it's possible there's like still a bit or less in dynamics here. But we have a hand to play too. I think like vertically integrating, I think, you know, the open models, like intelligence will saturate for different tasks, like for the right task of, you know, here is a whole lot of data and I give the right answer to that question. You want to need a certain level of intelligence in order to do that you know like i'm a human customer support agent can do it off like i don't know what, hundred iq human or hundred and ten iq human is gonna be like as good as a hundred and forty iq human basically at doing that task maybe better who knows, i'm down is gonna be the same for like the lms like you get to a certain point where like, Yeah, the LLM is intelligent enough to do that specific task. And what matters now is about the expertise, how it's tuned. Is it great at customer service? Has someone gone and tuned it to be absolutely great at customer service? And you can do that with prompting you can get so far. We find you can get a little bit further with post-training. And so I think there's some defensibility there. That's certainly something we're trying to do. We're also trying to go broad. We've got customer agent. We think there's There's going to be a lot of defensibility in terms of building something that is great at not just customer service, but at customer service, at sales, at customer success. I think we can build a unified horizontal agent that we think is legitimately the best in the world at those things. And that feels pretty defensible. I feel like you need a lot of expertise. And then you build a product around that too to make it easy to use through reporting. That feels like a durable offering to us, or at least the best we can get. If someone comes out with a singularity or something, who knows what happens.

[47:34] Conor: [OVERLAP]
I

[47:34] Fergal: [OVERLAP]
But

[47:34] Conor: [OVERLAP]
mean, yeah, it's

[47:35] Fergal: [OVERLAP]
yeah,

[47:35] Conor: [OVERLAP]
hard to plan for that.

[47:36] Fergal:
it's hard to plan for that. But short of that, that's our strategic direction. And we have a lot of validation on the bet so far. So we'll see. We'll see.

[47:46] Conor:
It's a really exciting story and I appreciate you taking me through it kind of from inception to where you are today. It's been a super interesting conversation and I honestly wish we had another half hour, an hour, because I feel like we had three other topics that I'd kind of bullet pointed out as like, oh, we should dive into this too. And, you know, Fergal, you've been so fascinating on the post-training front and data pipelines that I know we're not going to get to everything. But I did want to ask a couple of final questions here because, you know, you brought up Cloud Code. I think you and I are both big fans of it. And my understanding is you were the first user of Cloud Code at Intercom, and now it's integrated throughout your development pipeline, even, I believe, running autonomously on some bug tickets. Can you tell me a bit about that journey and what made you push for adoption company-wide?

[48:30] Fergal: [OVERLAP]
Yeah, I don't remember exactly why I started to use Cloud Code, but I saw that Antropic had released this new thing in beta. I checked it out, NPM install, and it really kind of took me out for a week. Every evening after the kids are in bed, I was like, coding, I don't normally do a lot of IC work, I do a little bit but not much but I was like just coding with cloud code and I was like holy cow this thing is amazing this is

[49:00] Conor: [OVERLAP]
It's

[49:00] Fergal: [OVERLAP]
like a step

[49:00] Conor: [OVERLAP]
kind of

[49:00] Fergal: [OVERLAP]
change

[49:00] Conor: [OVERLAP]
joyful.

[49:01] Fergal: [OVERLAP]
Yeah, it's kind of joyful. It felt magical. And I really, I really spent a lot of time that week, like, you know, awake at night after I was able to code late at night, awake at night, but thinking about the future of the software industry and like how horizontally disruptive it is going to be over time. And, and, you know, I reached out to the Antropic team and like, you know, I was like slacking them. I was like, this is amazing. You built an amazing product here. And I think they featured me on like, I gave them a quote for the page. I

[49:30] Conor: [OVERLAP]
That's

[49:30] Fergal: [OVERLAP]
think

[49:31] Conor: [OVERLAP]
cool.

[49:31] Fergal: [OVERLAP]
They did something really cool called Code. And I think it's an amazing product. And then when I went internally in Intercom and I did an AI group, all hands, and I was telling people about it, some people laughed at me. They were kind of like, oh, come on, we can do this already in Cursor. And some people thought we were serious. Some people laughed at me. Some people were like, oh, you're getting excited about it, but it's just the same as Cursor. And I think there's a lot of explaining because I know that like Cursor and the other IDs have done a lot of work to become more agentic since then. But like at that point in time, it's like, no, it's it's different that when you put it in like the full auto dangerously skip permissions mode, which is like the only way to use it. on a secure machine, hopefully. But when you do that, you know, it just goes in a way that other things didn't. And so, you know, I kind of had to go and take several people in the company and kind of like call us and install, like install cloud code into them, like get them to sit down. Here's a laptop. Look at this. Check it out. I did that literally for one or two of the execs. And then that started the ball rolling. Once I kind of aligned those people and showed them the value of it, Intercom started an initiative to try and build a team responsible for taking this. This is going to change our space. Intercom is really good, our CTO and other people are really good at Hey, let's be ambitious. Let's try it. Let's see how much we can get out of this. And so we ended up with a good team of people whose job it is to make intercom engineering faster overall. And then they started taking a role and build agents that live as part of the intercom system, the intercom software engineering system, doing things like automatically attempting to triage and identify the source of an issue when a customer submits an issue. It doesn't always work, but when it's a hit, it maybe saves time on a critical issue where many minutes count in terms of having a great customer experience.

[51:36] Conor: [OVERLAP]
Yes.

[51:36] Fergal: [OVERLAP]
So yeah, it's been cool. It's been cool to see. And honestly, it feels like we're still just at the start of these things are going to show up all over software.

[51:44] Conor:
Yeah, I think there's a lot of exciting stuff that's going to happen in the coming years around coding in particular, because we have just such an amazing data set for it with everything that's happened with GitHub. And then obviously, there's so much focus on it. There's even like companies like poolside, which we've previously had the podcast, I highly recommend folks listening to check out that episode with ISO and Jason Warner, the two co founders are talking about how they think code is going to drive AGI, there's a ton of great stuff happening on coding. And I agree, I think cloud code is really like a magical experience at times. And so it's a wonderful note to end on because I think it hopefully highlights the opportunity and the excitement and what's possible here. And Fergal, thank you so much for spending the time with me today. It's been such a pleasure chatting with you and really appreciate it.

[52:29] Fergal:
Real pleasure, Connor. Thanks for having me.

[52:31] Conor:
And for folks who want to find out more about FIN, I believe fin.ai is the best place to go. There's a ton more there. Definitely recommend checking it out. Super interesting stuff that Intercom's doing here. And if you enjoyed this episode, we would love to hear from you, whether it's in the comments on Spotify. Let us know what, you know, should we have asked Virgo questions? We didn't. Should we have dove into topics we didn't cover? Or did you really love something we talked about? We'd love to hear from you. We'd love to know what your experience is like here too, because that customer feedback is also valuable to us. I'm not yet training a model off it, maybe I should be, but I certainly am training my own mental model. So whether it's on YouTube, on Spotify, on LinkedIn, or DMing me on Twitter, or just adding me in publicly even, I'll take it. I would love to hear from you listeners, and thank you so much for joining us today. And Fergal, one more time, I really appreciate it. Thanks for coming on.

[53:19] Fergal:
that's gone.