You've Been a Bad Agent

Matt's Big Move to Lisbon

Cloudflare Code Mode: 2,500 Tools in 1,000 Tokens

Code Mode vs Bash vs SQLite: What Actually Works for Agents

2025 Reflections: The Year of Claude Code and MCP

2026 Preview: Files Are Back

Quick Fire: Best Model, Best Lab, Biggest Surprises

Creators and Guests

Host

Matt Carey

agent and mcp at Cloudflare

Host

Wilhelm Klopp

building @kolo_ai

What is You've Been a Bad Agent?

Wil and Matt discuss tech, startups, and building really cool things with AI. Sometimes joined by (actual expert) friends.

Matthew Carey (00:03.032)
Hey boss man. Sorry, my audio is probably not wonderful today because I'm in a booth in the office with no microphone, no headphones, no nothing.

Wilhelm (00:03.323)
What up? Look at you.

Wilhelm (00:17.405)
in a booth like in a sort of American diner booth, eating a mean stack of buttermilk pancakes.

Matthew Carey (00:21.805)
I wish.

Matthew Carey (00:26.296)
Dude, she sounds banging. Mate, I have eaten no fiber today. I've just had coffee and panchocola.

Wilhelm (00:35.089)
That's how it should be. we went for a bike ride with a bunch of Frenchies. Is that a okay term to use? It was amazing. They were all eating, we were like getting on the Bart and they were all smashing panas Chocolas. Like it was like 7 a.m. and it was like five guys just speaking in French, smashing panas Chocolas, like so good. And then they shared them at the end and they were delicious. I need to ask them where it was actually, where they got them from. All right, let's roll the intro.

Matthew Carey (00:42.754)
Yeah.

Matthew Carey (00:50.158)
Thank

Matthew Carey (01:00.447)
Nice.

Wilhelm (01:13.474)
It goes so hard, goes so hard. How's it going man, how are you?

Matthew Carey (01:19.534)
Yeah, I'm good, dude. I'm good. Uh, kinda? My movers came this morning. Oh, my morning was wild. I was at the office at five to seven this morning to pick up a bunch of like lava lamps and hats and all that sort of good stuff. And then, I don't know if you've seen on Twitter, but we...

Wilhelm (01:23.503)
Are you deep in moving stress? How's that going?

Wilhelm (01:42.749)
As you do, this is just day of a life in a Cloudflare employee for a Cloudflare employee. You just gotta pick up the lava lamps at the morning. Like that's just what you do.

Matthew Carey (01:47.574)
Yeah, and we... I had to... you know, otherwise the internet stops, so someone's gotta do it. Yeah, so then we... then went to Soho, we booked out this like a coffee like shop.

Wilhelm (01:54.119)
Exactly.

Matthew Carey (02:03.342)
and yeah we basically just fed London's techies coffee and pano chocolate all morning. I stayed for a couple hours and then I had to leave and go and meet my, greet my movers who picked up all my furniture and then I basically just left again immediately and went back to the coffee shop. Yeah lots of cycling across London and meant that I haven't actually eaten anything which feels kind of weird I'm not gonna lie.

Wilhelm (02:10.8)
and hats.

Wilhelm (02:17.245)
Mm.

Wilhelm (02:28.145)
Damn, that sounds like a packed day. Exciting day, big day, stressful day.

Matthew Carey (02:32.814)
Not super stressful, but like big day, big day.

Wilhelm (02:39.931)
So when are you moving?

Matthew Carey (02:46.804)
Sunday but like all of my stuff just went today yeah yeah all of my stuff just went today like you know in a in in a

Wilhelm (02:49.263)
in two days.

Wilhelm (02:58.471)
That's mad. Yeah, okay, I mean, I did this like nine months or 10 months ago. It's a lot. What's your sleeping setup for the next two nights?

Matthew Carey (03:07.566)
We have some like spare bedding and stuff and like the bed was still in the flat like because it was owned by the landlord So we can have a bed. We just don't have like our mattress, which just kind of sucks So makeshift scenario, but it'll be fine

Wilhelm (03:17.643)
got it.

Wilhelm (03:24.219)
Right. So where you put the bedding, you put the bedding on the slats.

Matthew Carey (03:30.222)
Oh, it's gonna be fun, isn't it? No, no, no, I think we have some better solution than that. yeah, it's gonna be a little bit janky for the next couple of days, next night. We're actually gonna have a little party tonight because my flat is completely empty. But all we have is all the alcohol we had in our flat and drinks. We actually have loads of non-alcoholic stuff as well. We just have loads of drinks that have just accumulated over the last two, almost three years. So we're just like...

Wilhelm (03:40.803)
Wilhelm (03:52.731)
Yeah, totally.

Matthew Carey (03:59.586)
gonna get everyone over and drink it all.

Wilhelm (04:02.182)
From what I've seen, I've never had the pleasure of being at your place, but what I've seen from the backgrounds is it looks like the perfect spot for an East London rave and sounds like you're fully equipped.

Matthew Carey (04:11.8)
Yes, we are less furniturely challenged now. should be good.

Wilhelm (04:16.796)
man, I wish I could be that. That sounds great.

Matthew Carey (04:21.826)
Yeah, it should be good. It should be good.

Wilhelm (04:25.18)
That's awesome. There is so much to talk about. The world is not slowing down. It's crazy. So I I would love, I think the most, I think the stuff I sent you like maybe a week ago, whatever around like 2025 reflections and 2026 predictions would make for like a really, really good episode. I'm not sure we can cover it in like an hour.

Matthew Carey (04:29.014)
Yeah, where should we stop? Where should we stop?

Matthew Carey (04:45.646)
Mmm.

Matthew Carey (04:49.536)
Yeah, let's not do that now because yeah, probably gonna have to cut this one a little bit even shorter than an hour

Wilhelm (04:55.59)
Okay, how much time have you got?

Matthew Carey (04:57.71)
I dunno, probably should be leaving in like, four minutes.

Wilhelm (05:02.116)
Okay, cool, cool, All right, I mean, there's other stuff we can talk about. Maybe we can also split it into two, do like, like, I mean, I think it's gonna be, there's gonna be a crazy time for us constantly. So I'm not sure we'll ever have that much amazing time to prepare perfect thoughts. So why don't we talk about like, just normal stuff that's been happening, and then also maybe do a little bit of like looking back at 2025, because it is like, maybe this is like the last time we'll chat in January. So it's still appropriate to do.

Matthew Carey (05:14.818)
Mm.

Wilhelm (05:31.773)
2025 reflections. But we can just do it off the cuff. And then we can do predictions, are kind of like, obviously, this podcast is entirely about the future. So I think we will have no trouble at all doing like predictions or whatever. We can do that next time. All right. But why don't we start? Why don't we? So one big thing you shipped recently that everyone is very excited about, but I actually don't know that much about, is you shipped like a code mode, Cloudflare.

Matthew Carey (05:34.188)
Yeah, that's cool. Let's do that.

Wilhelm (06:00.941)
I think, right?

Matthew Carey (06:02.06)
Yeah.

Wilhelm (06:04.122)
What? Tell me.

Matthew Carey (06:06.19)
Um, yeah, so I'm talking about now I need to write a blog post and stuff. I need to like do all of the bits that involved around shipping and like documentation and getting it all sorted and all that sort of stuff. But like the essence of it is I just stuffed the whole CloudFlare API into one MCP server. And it uses about a thousand tokens of context. Everything else is discovered on demand and it works really well.

Wilhelm (06:14.428)
Mm.

Wilhelm (06:36.939)
That's nice. Whereas an MCP server without code mode that covers the whole Cloudflare API would probably be like blowing the context window. It would be like 100k tokens.

Matthew Carey (06:47.924)
so it's...

Oh, like way more. So the Cloudflare OpenAPI spec is 2.3 million tokens.

Wilhelm (06:58.364)
Amazing, yeah.

Matthew Carey (07:01.294)
and there's 1500 and something tools, and sorry, 2500 something tools on the Cloudflare API. It's like almost kind of dumb. I mean, it's very dumb. The open API spec is a JSON file, right? So I let the model...

Wilhelm (07:09.434)
That's mad. Okay, so how does it work?

Wilhelm (07:22.108)
A very verbose JSON file as well, might add. Open API is not concerned with concision.

Matthew Carey (07:26.7)
Yeah, so no. So you need to do something like progressive disclosure. So CLIs are really good at this, where if you want to use a CLI, if you want the model to use a CLI, it can just do like dash dash help on any CLI command and we'll get like a help page and it teaches how to use it. MCP tools traditionally didn't really have that because people just dumped them in the context window and blew up the context window.

Wilhelm (07:36.412)
Mm-hmm.

Matthew Carey (07:54.712)
But I think progressive disclosure should be a thing. And you can work with it on the client by doing something like tool search over all connected servers. But also you can do it on the server. Yeah.

Wilhelm (07:54.714)
Mm-hmm.

Wilhelm (08:07.632)
Which I believe Cloud Code does this now by default when you hit a threshold in terms of your MCP tools, I think. Like maybe 20k, if your MCP that you have configured are over 20k, it will enable this tool search thing, which has some trade-offs, right? Like it is a bit slower because it needs to search first. And that's like a pass through the model, right? An inference pass before it actually calls any tools.

Matthew Carey (08:15.489)
Yeah.

Wilhelm (08:35.702)
but it is like, is mainstream enough, good enough to ship into Claude code by default.

Matthew Carey (08:43.17)
Yeah, it has some like massive trade-offs.

Yeah, it has some huge trade-offs. I'm not sure how much I like it because you're at mercy of the search tool. So initially it's going to feel kind of ropey until they train on the search tool. think, so the way I did it was I moved all of that logic to the server. And so you export two MCP tools, a search tool and an execute tool. And this is not a new idea. This is...

like I think ACI.dev did something like this in the summer, which was really, really cool. They had like 600 tools for different SaaS platforms all in one MCP server. But like, you don't really need that if you have an open API. So if you have an open API spec,

Wilhelm (09:28.078)
Mm-hmm.

Matthew Carey (09:37.334)
It has a logical consistency to it anyway. The model understands what an API spec looks like. You can give it the type of the API spec. It's got some consistency, which is really helpful. And it follows the structure. So why don't you use the structure by telling the model the structure and then just letting it write code to extract the bits it needs? So that's what I did. Rather than make my own search tool, I just let the model write code, run the code in the sandbox.

Wilhelm (10:05.852)
You

Matthew Carey (10:07.138)
The sandbox in this case is like a dynamic work isolate. So it's like a really fast, lightweight V8 isolate. It spins up in like one millisecond. So you don't, there's no overhead to this, which is really sick. Yeah, it's really, really sick. And then, and then the...

Wilhelm (10:10.812)
Mmm.

Wilhelm (10:14.684)
Mmm.

Wilhelm (10:18.694)
That's so cool, Yep.

Matthew Carey (10:28.748)
And then like the execute, you create like a little client that's sort of like a tool. don't know. In this case, you just get the model to call fetch on whatever it wants to, because it has the OpenABI spec. It can just call fetch and you create like a little custom fetch client in a closure that handles like all of the authentication basically.

Wilhelm (10:52.156)
Okay, that's great. That's awesome. Damn. Yeah, I should try it. How can we all try it? Where do we go?

Matthew Carey (10:58.83)
It's like public on my GitHub. It's currently on the official Cloudflare API MCP. And like hopefully sometime soon it will become official, but there's some work that needs doing before then basically.

Wilhelm (11:02.097)
Okay.

Wilhelm (11:11.75)
sick and sounds like you're in the middle of that work. I actually, I should have used this yesterday, I was onboarding a new domain onto Cloudflare. Is that something you can do with the API or is that something you have to do with the UI?

Matthew Carey (11:16.396)
Yeah.

Matthew Carey (11:28.718)
Yeah, so most of the UI actions just call the API underneath. I don't think there are any UI actions which don't call the API. I'm sure I'll be corrected on this, but pretty much everything uses the Cloudflare API under the hood. For instance, Wrangler only covers a small amount, well, quite a large amount actually of the Cloudflare API.

Wilhelm (11:42.044)
Mm.

Wilhelm (11:52.966)
Mm-hmm.

Matthew Carey (11:53.964)
but there are still gaps in Wrangler, right? The cool thing about going to the API is you have like the whole capacity of Cloudflare, right? So you can create DNS records, can add zones.

Wilhelm (11:57.009)
Mm-hmm.

Wilhelm (12:07.805)
right, of course a Wrangler wraps the API but you with the Cloudflabs V you go straight to the API. There's no Wrangler. Got it, got it. Yeah, makes sense.

Matthew Carey (12:14.22)
Yeah. Yeah. Yeah. And so, like, Wrangler does a bunch of really nice stuff.

But yeah, you just just bypass all of that for better or for worse really and the same with the UI right that you bypass all of the the forms and the pages like that and the navigation you just you let the model handle handle all of it so you can do everything really it's quite it's quite wild like you can even deploy workers that you haven't written because the model can write them on demand which is kind of crazy

Wilhelm (12:23.952)
Mm-hmm.

Wilhelm (12:47.469)
Yeah, okay, damn, should play around with this. That's really cool.

Matthew Carey (12:50.67)
So you can deploy whole websites that you just write on demand, that you never store on your own computer, that you just, it just, yeah, it just runs.

Wilhelm (12:54.894)
Yeah.

Wilhelm (12:59.779)
Yeah, yeah, it just, just yolo's it. Damn. mean, yeah. Okay. I have so many thoughts on this. How does the search work? Is it like a full text search over the open API JSON, or is there like separate modes for discovering like the endpoints versus the...

Matthew Carey (13:04.853)
It's crazy.

Matthew Carey (13:15.822)
It's so dumb. You're overthinking it. It's so dumb. You let you let the model Write code, but you give it access to the JSON file. You say you have access to const spec Then you let it write code

Wilhelm (13:27.195)
right, nice, nice. Right, that's where the actual code mode comes in. Yeah, nice. That's awesome. It's interesting, we actually, there was a pass where we had a version of code mode in a very first edition of like the Z agent. But at that point, it was like, do we do this code mode, which didn't have a name at the time, or do we do like copy what Claude code does? Like have...

Matthew Carey (13:46.093)
Mmm.

Wilhelm (13:56.688)
grep, have ls, have bash, or just have one single tool in our agent that is just create code and execute the code. And...

Matthew Carey (13:58.519)
Yeah.

Matthew Carey (14:09.326)
Yeah, no, I think that's some fun stuff that so just to dig into a tiny bit. there there was an eval's and actually Bersel just released a thing today like a few months ago or a month or so ago. Brain Trust initiated a bunch of evals where Bersel said that well.

like a blog post by Bersel said that bash was all you need basically, just bash and you can call a file system with bash and you can just use bash plus a file system and that's all you need for agents. basically this eval that Braindress did which was based on the whole GitHub archive said that bash was okay.

File system tools were better than bash, which makes sense why Claude code uses file system tools. Actually, they're much better at reading files in a nice format than bashers, which is kind of interesting. But what was even better was SQLite. So storing all your data in SQLite in a structured query language, it's kind of interesting. And so...

Wilhelm (14:54.459)
Mm.

Wilhelm (15:11.611)
What?

Wilhelm (15:16.283)
Okay, can you send me the link to that brain trust or maybe I'll just find it? That sounds interesting. I haven't seen that brain trust eval thing, but I'm skeptical.

Matthew Carey (15:22.424)
Yeah, Vercel actually just released a blog post saying like, kind of, I haven't fully read it yet, but almost going back on what they said recently. I'll find it for you, Vercel. I don't know if it was today or yesterday or...

engineering.

Wilhelm (15:46.233)
I saw also that the CTO has a, is it called testing if bash is all you need? Yeah, that looks like it. Okay, cool, yeah, I'll check this out. That looks really good.

Matthew Carey (15:52.044)
Yes.

Matthew Carey (15:59.79)
Where did you find that? Art testing if actually you need. Yeah, when did this come out? This came out 22nd yesterday. Amazing. Amazing, amazing, amazing. Yeah, so what we actually found, so we've been playing with code mode instead of...

Wilhelm (16:01.402)
I'll send you the link. Yeah. Yesterday.

Matthew Carey (16:24.686)
So what you can do is instead of just having SQL tools or having bash tools, having a bash tool or file system tools, you can have a code tool which can do any of the above. Like code can write bash commands. It could do that, right? You can have all of these things. And what we found was it was the most token efficient by a very, very, very long way.

Wilhelm (16:40.228)
Totally.

Wilhelm (16:43.864)
Mm-hmm

Wilhelm (16:49.518)
Totally, I can see that, yeah.

Matthew Carey (16:50.25)
even more so than the native sequel. And so that's kind of interesting. I think the bash, we haven't published this yet, so it's kind of bad to talk not great to talk about.

We need to clean it up and make it more interesting. I'm very bullish about code mode as a way of reducing the strain on a context window. And I think if you want to get to some very autonomous, very smart agent that can do loads of things, one is you would need a very strong model to get the smart bit. And potentially you need to go in a loop to get the autonomous bit.

Wilhelm (17:07.386)
You

Wilhelm (17:21.562)
Mm.

Wilhelm (17:32.888)
Right. Yep.

Matthew Carey (17:37.328)
of things you need to have access to loads of external systems and for that I think you need progressive disclosure.

Wilhelm (17:38.712)
Yeah, yeah, right.

Matthew Carey (17:48.396)
which you get with all of these things because they're all just like generalizable query, like query language really. They're all like generalizable languages. And I think you should lean into doing something that the model is kind of trained on and the model is trained on a lot of JavaScript. So it's really good at writing hacky JavaScript and it manages to do that with very small amount of tokens because code like...

code is by its very nature a very concise way of representing discrete knowledge, like a discrete control flow. Like that's what code is designed to do, right? It's like a concise way of

Wilhelm (18:22.682)
For sure, yeah, yeah, That's a great point. You need a smart model. You need, and also I imagine it depends, like I like that what you're saying around when you want to do a lot of things. So like when the agent kind of goes into like execution mode, right? Then it's like especially powerful. I'd imagine like when you're more in the sort of like explorer phase, you're less concerned with.

finding additional stuff in the context window because it is useful to explore, like when you're exploring, right, to see files, like to see stuff that takes up more space. But then when you're actually executing and you're churning through code and you want to do that as fast as possible, then code makes a lot of sense.

Matthew Carey (18:52.972)
Yeah.

Matthew Carey (19:00.867)
Yeah.

Matthew Carey (19:11.822)
Yeah, definitely. so yeah, I'm very bullish about Copo, just in general. I think getting the model to write something that it's trained on just kind of makes sense to me, like.

Wilhelm (19:25.22)
Totally. Yeah, yeah.

Matthew Carey (19:27.028)
getting them all to write JavaScript and also running that code very importantly in in like a sandbox where you don't have access to Envables you don't have access to even access to like direct access to a file system all of this stuff is is is really cool and where it comes in

Wilhelm (19:36.026)
Hmm.

Matthew Carey (19:47.062)
Like, it's not going to be the most simple solution for client-side stuff, like stuff running in a sandbox already. That, it's probably better just to use bash or SQLite with a file system.

Wilhelm (20:00.42)
Wait, can you explain the SQLite thing to me? I don't really get where SQLite comes in.

Matthew Carey (20:03.638)
Yeah, so what they did, it feels a little bit contrived, but what they did is they loaded all of the files into, because they're GitHub, they're Git blogs, so they just loaded all of those like JSON files into, actually it's not blogs at all, it's actually just, it was just JSON files. They loaded all of those files into SQLite with like a very basic schema and...

Wilhelm (20:11.918)
I see.

Wilhelm (20:23.353)
Mm-hmm.

Matthew Carey (20:27.5)
Yeah, and what that showed was just the model SQLite is very you're very good at arranging complex queries to extract data right from Right that you need with very low token token cost and so Yeah, and so so code mode is it's similar to that. Obviously. It's not gonna be as good for the Dataset that these guys had because this this was a huge data set but for the data set which is a simple

Wilhelm (20:37.56)
Right, yeah. Yep.

Matthew Carey (20:57.544)
here is a million tools or here is like yeah 500,000 like an API spec with 500,000 tools of all the most popular APIs in the world I think code mode could be could be pretty good there.

Wilhelm (21:12.942)
Yep, yep, yep, nice. Exciting stuff. And yeah, it's just wild how quickly everything is moving. Yeah.

Matthew Carey (21:21.614)
It feels very meta to be like, oh, we're not having the agent call tools. We're having the agent called some other intermediary thing. Like, we're having the agent write some other intermediary, call a tool that writes some other intermediary thing that then we take and we run in some sort of like sandboxed compute environment. And that gets us our answer. I think it's quite funny, but it...

Wilhelm (21:29.326)
Yeah.

Matthew Carey (21:48.18)
follows on the trend where agents do agents seem to do things very similarly to humans in this way like a sequel was invented squeal was invented so you could access like like do queries on data right that's literally its its job and so it makes sense that it would be good to use for agents to do queries on data

Wilhelm (22:14.297)
Do you know, or a fun fact, the airport that I now fly out of, I'm going there later today actually, it's like airport code is SQL. It's called San Carlos and it's SQL. And right at one end of the runway is like the Oracle headquarters. And we have all these procedures around like flying a certain way around the Oracle headquarters because on one of the approaches you actually get like very close to it. It's like a big office park.

Matthew Carey (22:24.173)
That's really funny.

Matthew Carey (22:32.386)
haha

Wilhelm (22:44.181)
But then also sometimes you are supposed to fly over it because flying over offices is way less of a consideration for preventing airport noise. Lots of local residents around the airport always complain about noise. So you want to avoid houses and fly over offices. So Sequel is taking up a new dimension in my life that I didn't expect.

Matthew Carey (23:06.51)
That's very funny.

Wilhelm (23:08.057)
Alright man, I'm gonna just ask you some questions for this 2025 review stuff. Let's see how much we get through. So, I want you to cast your mind back a year. It's the beginning of 2025, we just finished 2024. Deep Seek R1 had just come out. I think 03 was all the rage, or 03 was so powerful OpenAI wasn't even gonna release it and then they did decide to release it.

Matthew Carey (23:26.658)
Thank

Matthew Carey (23:33.55)
I'm

Wilhelm (23:37.37)
And then the latest model was Claude 3.5 new. So Claude 3.6. I think Claude 3.7 came out early, like in early, or early February maybe, sometime in February, maybe early March, something like that. Claude code didn't exist yet. Do you remember what your feelings about everything were at the time? Like where did you think things were heading? what?

Matthew Carey (23:42.728)
Mmm.

Matthew Carey (23:53.187)
Yeah.

Wilhelm (24:04.985)
I'm not sure either of us wrote down predictions for 2025, but like, what do think your predictions were at the beginning of 2025?

Matthew Carey (24:07.586)
So.

Matthew Carey (24:12.942)
We were all on this rage because Deepsea had just come out, it was R1, The reasoning models were going to be insane, open source reasoning models were going to wipe factors of value off OpenAI and Anthropic. I think there was talk there about like...

Wilhelm (24:18.713)
Mm-hmm.

Wilhelm (24:27.777)
Mm-hmm. Mm-hmm, mm-hmm, mm-hmm.

Matthew Carey (24:34.926)
there was still, was still, those people were still very bullish on like this general purpose model that gets stronger and stronger. I think before that, I think this comes in peaks and troughs, but before that we had, was a bit of like almost fud around that. And people were thinking that especially the smaller labs were going to have to specialize and be more,

Wilhelm (24:43.513)
Mmm.

Wilhelm (25:02.553)
Mmm.

Matthew Carey (25:02.806)
Yeah, just like more more constrained in what they did and they were going to make models that were like really good at a certain domain And then when deep sea car 1 came out, we were all like, my god. No actually Open source labs can also Generalize. sorry. Yeah, they can also even smaller labs can generalize and like make big models that are multi-purpose and multi-use and all that sort of stuff so I think that was like part of a very big model craze right like three came out and then

Wilhelm (25:26.999)
Yep. Yep.

Wilhelm (25:31.127)
Yep. Yep.

Matthew Carey (25:32.8)
I'd be- I'm so interested to know how big Opus 4.5 is, but it smells huge. Like, it's- it's so much better than what came before. Yeah.

Wilhelm (25:37.646)
Mmm.

Wilhelm (25:43.63)
Yeah, it is. But at the same time, it's cheaper than Opus 4, right? So they cut the price quite a bit. something is interesting there. But yeah, it's almost like 2024 was the dawn of the reasoning models, I think, at least in my head, Canon, the whole reason... You know the meme, like, what did Ilya see? Why did he leave?

Matthew Carey (25:51.19)
It probably is smaller. Yeah, it probably is small.

Matthew Carey (26:00.398)
Hmm.

Matthew Carey (26:09.173)
Yeah.

Wilhelm (26:10.105)
late 2023, I think it was like the dawn of the reasoning models, like the idea that we could have this huge leap of progress by introducing reasoning, which in some ways, I mean, who am I to say it was obvious, but like, I remember in 2023 thinking, wait, instead of having the model just like output stuff directly, immediately, instantly, can we just have the model do some more thinking, you know, and then come back with me after its thought rather than just like stream of consciousness?

Matthew Carey (26:17.89)
Yeah.

Wilhelm (26:38.937)
Yeeting back tokens at me, but then clearly open I figured out to make that work and then 2024 we got a one and And I don't think we got an anthropic model in 2024 that was actually had reasoning it I think the first anthropic reasoning model was Claude Sonnet 3.7 Which came yeah like early and again spring 20? Five so 3.5 or 3.5 new

Matthew Carey (27:02.318)
Yeah, was it not 3.5? Was it not 3.5?

Wilhelm (27:08.089)
Like I think I remember Sandeep saying actually at the meetup where I first met you at AI demo days, said 3.0, sorry, this son of 3.5 or 3.6 as it's colloquially known, I guess. He said, it's the first one that seems like it does reasoning, but it didn't have explicit reasoning. it seemed like it was doing some reasoning, but like it was.

Matthew Carey (27:16.076)
Mmm.

Matthew Carey (27:30.637)
Mm.

Wilhelm (27:36.529)
But there wasn't like a thinking flag. There was no like, it was giving you stuff back immediately. And then, but then we did have O1 and O1 Pro, which was this like super weird, super slow model that like, it had like a strange place in the like dimension of intelligence and speed. And then we had R1 and so forth.

Matthew Carey (27:52.558)
Hmm.

Wilhelm (28:04.524)
But yeah, it seemed like reasoning was just kind of taken off. And now obviously every model has reasoning baked into it, right?

Matthew Carey (28:09.196)
And I think we're starting to see like the birth of almost like new labs, new, what do you call them? Labs? I don't know. New data studios, almost like people trying to build huge data sets for reinforcement learning, like, creation of reasoning traces and all of that sort of cool stuff. And there was quite a lot of optimism around models and there was a huge amount of pessimism around...

infrastructure and things like that. And think that has actually, and like infrastructure companies, so I remember Nvidia took the big dive after R1, and then this year we, this year there was like...

Wilhelm (28:39.658)
yeah?

Wilhelm (28:46.552)
Mmm.

Matthew Carey (28:49.836)
Obviously you've seen since the summer, Google stock price has gone massively up. It does seem like people are trying to diversify from just Nvidia GPUs and that there's more of a thought about that process. I don't know. It's still the biggest company in the world though, think, all by market cap.

Wilhelm (29:09.56)
Yeah, yeah, yeah, I think so. I remember at the beginning of twenty or something. So I was kind of like building a startup adjacent to agents in twenty, twenty four. And I don't know, maybe I'm still kind of building that. But the thinking at the time was like, oh, like we would build an agent like a coding agent. And obviously, the kind of leader in like air quotes in the space at the time was Devon. And the thinking was like, OK, Devon, it's like this.

incredibly crack team of like math Olympiad winners, like they are like so smart. it's like imagine the smartest physics, mathsy friend you've ever had. And these guys are like 10 times better. And you need that level of like brain power to build like a coding agent. And I don't know if we've seen, I don't know how Devon actually works. And it's very much like a cloud coding agent, right? Not a local coding agent, but I imagine it was doing all this.

fancy stuff like custom models, fine tuning. then, so that was 2024. And then the thing I kept being told in 2024 was, oh yeah, like every single model lab, I guess, at the time referring mostly to OpenEye and Anthropic, they want to build a coding agent in 2025. That's their goal. They have that clear goal. They want to build like a software engineering coworker. But at the time,

Matthew Carey (30:05.646)
actually.

Wilhelm (30:29.816)
That sounded like they basically wanted to build Devon in-house. And it makes sense, right? Like it's a huge market and it unlocks further stuff. Like it makes sense why they wanted to do this. But what we didn't yet know was what shape it would take. And I think I was going into 2025 thinking, okay, they're all going to build like essentially what looks like a Devon. And then the big surprise was they built something like that looked more like a Claude code. It's like no secret sauce really. It's, you know, you can...

Decompile Claude code or whatever and look at the code and it's and it's just tools Running locally in a loop with a really strong model that's been trained on the tools, which is maybe the special sauce but that to me and I remember when so sonnet 3.7 and Claude code were announced in the same Blog post. I remember thinking like okay Claude code interesting like a CLI for Claude. Okay, makes some sense But I care a lot more about the model like this new model

Matthew Carey (31:07.182)
Yeah.

Matthew Carey (31:20.397)
Yep.

Wilhelm (31:28.352)
rather than cloud code. But now we know, obviously, that this paradigm of calling tools in a loop locally, not in the cloud, with a strong model like 3.7 already was. Obviously, now we're three generations beyond that. It's incredibly powerful. Incredibly powerful. And then all the other labs followed suit.

Codex CLI came out very quickly after that. Copilot CLI is a thing now everyone has adopted this paradigm. And I would say it's actually much more powerful than the cloud coding agents that came afterwards, right? So like Codex, cloud or whatever it's called now, Codex web, Claude code on the web, they all kind of suck. Like it's just, it's hard to get the environment set up. It's like hard to have observability.

Maybe it's just a different way of working or whatever, but the local way has much more product market fit and where I enjoy it a lot more.

Matthew Carey (32:30.688)
Yeah, yeah, no, I would somewhat agree with you. I think-

We saw the paradigm of training on your tools with deep research and I think deep research was a step. I can't exactly remember when that came out, but it was a step.

Wilhelm (32:48.274)
Mm. Mm-hmm.

Matthew Carey (32:53.526)
It was wild, like deep research was the moment when we were like, models can work for a really long time if they're trained on the data. And yeah, let's just like.

Wilhelm (33:03.926)
Right. Right.

Wilhelm (33:10.134)
In a way, deep research was the original Ralph Wiggum loop.

Matthew Carey (33:13.898)
It like literally was. It literally was. Yeah. Yeah. I remember being super excited about deep research and just thinking like this paradigm feels like it's going to go forwards. And I think Claude Code to some extent proved that right. I don't think we realized how much Anthlopic were going to train on those tools. And I don't think they maybe even realized how much they were going to train on those tools for better or for worse.

Wilhelm (33:16.536)
You

Wilhelm (33:38.508)
Totally. And clearly, so Cloud Code, think, like, I don't know exactly when internally at Anthropic, they started using it. But from what I've read, it sounds like they had a version of Cloud Code working internally at about the time of Sonnet 3.5, both v1 and v2, so about 3.5 and 3.6. And already it was working well enough for them to, like, be very intrigued by the idea.

But I feel like at that point they wouldn't have yet trained on the tools yet. So even before they did it, it already started kind of showing signs of interestingness.

Matthew Carey (34:18.408)
Yes. Yeah. Yeah. think like, devs want to interact with their tools in a way that feels like it's customizable and in an environment that they're very like, that they're used to. And you never really got that from, from like a standalone application. I think putting the model in the environment that you are in.

Wilhelm (34:29.719)
Mm.

Matthew Carey (34:46.818)
was like a big unlock, like putting the model and giving it the tools to interact with that environment. It's like, it feels obvious in retrospect, like all of these things, but.

Wilhelm (34:47.071)
Mm-hmm. Yep.

Wilhelm (34:58.029)
Yeah, it's crazy, isn't it?

Matthew Carey (34:59.072)
And then like having the right like tool definitions and then then training on those tool definitions like 100 % 3.7 was trained on something like they had training data for those tool definitions that they used in Claude code because it was just so much better immediately. And like I'd done, I'd put a model in a loop with tools like a year and a half before Claude code came out and it wasn't anywhere near as good, you know, like, like

Wilhelm (35:06.635)
Right. Yep.

Wilhelm (35:17.035)
Hehehe.

Mm-hmm.

Totally.

Matthew Carey (35:25.804)
It's that we go back to the code mode thing like you got to give the model something that's in distribution and your wacky tools are not in distribution

Wilhelm (35:31.192)
Right. Yeah, exactly. Yeah. By the way, I, I spend an hour or whatever coming up with names for this, this thing that, because I think this, this idea of like your wacky tools are not in distribution is a very, it's, it's very real. Like it's totally like, that is the truth. but I feel like it's, it's not obvious that it's the case because every model app.

offers tool calling APIs, right? Tool calling infrastructure. So I feel like there's kind of an implicit promise there from the labs that's like, if you define tools, the model will call them at some point, right? But there is no promise, yeah. So I did a tweet about this a while, a few days ago. I tried to come up with like a name for this problem.

Matthew Carey (36:11.436)
Yeah, I'm not sure there is a promise from the lab there.

Matthew Carey (36:22.488)
Okay.

Wilhelm (36:23.927)
OK, wait, here, let me just get the list up, and then you can tell me which is your favorite. Because it is weird, right? Everyone has tool calling APIs, but there's no guarantee that the model will ever call them. OK, so unseen tool generalization gap. That's the first idea. Sorry?

Matthew Carey (36:44.566)
Yeah, where's that? Where are you reading that from?

Wilhelm (36:50.071)
I just like it. It's just something I put on like a tweet a few days ago. So that's my first suggestion. Second one Monkey see but monkey don't want to call

Matthew Carey (37:01.656)
Monkey see but monkey don't do.

Wilhelm (37:04.084)
Yeah. Stranger tool, danger fool. Or maybe actually the better way is stranger danger tool fool.

Matthew Carey (37:14.754)
Okay.

Wilhelm (37:16.093)
Okay, and then the last one, tool curious but invoke abstinent.

Matthew Carey (37:23.054)
Okay, that's crazy.

Wilhelm (37:27.319)
Yeah, I have stayed through it. Any of these have PMF with you.

Matthew Carey (37:31.52)
No, no, I like monkey see monkey don't do

Wilhelm (37:35.701)
Monkey see but monkey don't do. Maybe it'll have to be that one. Okay.

Matthew Carey (37:38.145)
Yeah.

Wilhelm (37:45.367)
Yeah, I think I spent a lot of last year with this problem of like, how do I get the model to call my epic tool that I spent so much time making great and you just can't force the model to do it. So you have to find another way. But I think some progress has been made on that front or I feel like I've made some progress on that front. Much more simple question for 2025 review. Walk me through like how your daily driver models shifted throughout the year.

Matthew Carey (37:45.378)
Yep.

Matthew Carey (38:15.95)
Oh, I'm not sure I can even remember back that far. what were we at beginning of last year? Were we at Sonnet 3-5? Yeah, we must have been. Maybe even Sonnet 4?

Wilhelm (38:29.632)
Yeah, I mean, think at the beginning of the year we were still on Sonnet 3-5 V2.

Matthew Carey (38:34.284)
Sonnet 3.5 v2, so Sonnet 3.6. Because I remember when we we chatted with with Torsten, that was around the point, was that? no, maybe it wasn't. Maybe we were just chatting about amp and we were chatting about them using Sonnet 3.5 rather than Sonnet 3.6.

Wilhelm (38:41.376)
Mm-hmm.

Wilhelm (38:53.812)
Mm-hmm.

Matthew Carey (38:54.254)
There was a moment there where we definitely talked about that. I don't know. I think I've always been the latest Sonic model until now where I'm at the latest Opus model. And there was a moment, can't exactly remember where it was, near the middle of the year where I felt like Opus 4 was so overpowered compared to Sonic 4. so I went to Opus 4, but it was just super, super expensive. And then they made it cheaper with Opus 4.1. Sorry, phone call.

Wilhelm (39:14.71)
Mm-hmm. Yup.

Wilhelm (39:24.214)
So you're basically always on the latest, like, anthropic, the most powerful anthropic model that exists. Yeah. And then this is across the board for all your LLM use or just for coding models?

Matthew Carey (39:25.419)
I'm gonna go.

Matthew Carey (39:32.652)
Yeah, trying. Definitely trying to be, yeah.

Matthew Carey (39:43.618)
just for coding models. think for like general purpose models, OpenAI had some really good stuff.

Yeah, yeah, for general purpose models, I had some really good stuff. Like, think for personal system, I was using 4.0 for ages and then 4.1 mini for a while, think even.

Wilhelm (40:11.215)
Interesting. Like on, TAT

Matthew Carey (40:15.082)
No, no, no, just like when I was like instrumenting my own, just like my own scripts and stuff. Yeah.

Wilhelm (40:20.816)
I see. This is like what you use in your code to power LLM stuff. Okay.

Matthew Carey (40:27.212)
Yeah, yeah and like throughout like with with stack one we were using like a lot of a lot of model usage like that

Wilhelm (40:37.419)
Mm-hmm.

Matthew Carey (40:39.702)
Yeah, because we always used OpenAI embeddings and we were using those models. And then, I just like, for personal, I've always been using the Claude models for ages, for ages. And I was using them inside Cursor. And now I'm mostly using them in Claude code or in OpenCode.

Wilhelm (40:51.146)
Mm-hmm.

Wilhelm (40:58.518)
Mm-hmm.

Yeah, that's another thing that came on the scene in 25, right? Open code. What's your, yeah, I haven't, I really haven't used it that much. I've just been like using cloud code for all this. Interestingly, also, I feel like I completely stopped using deep research and I just instead have cloud, like tasks I previously would have given to deep research, I now just give to cloud code, like in the terminal.

Matthew Carey (41:05.099)
Yeah.

Matthew Carey (41:26.86)
Yeah, seriously. my god, sorry, my noise is so loud. How do I turn on focus? I've got like the glass. There we go. What's it called?

Wilhelm (41:33.686)
I think you can...

Wilhelm (41:38.548)
you can option click on the timestamp at the top right and then it puts you in do not disturb.

Matthew Carey (41:42.05)
Does that actually work? Okay, I did it, I did it, I did it, I did it.

Wilhelm (41:45.61)
Nice.

Matthew Carey (41:49.08)
Sorry, you were saying.

Wilhelm (41:49.471)
Yeah, I'm just Claude code maxing basically. No, no, I haven't really played much with open code or I don't really know what it would get me. I don't know what, yeah, what's your open code like workflow.

Matthew Carey (42:01.198)
I just direct replacement for Claude code If Claude goes down or I'm like, yeah, I want to play with some different models Also, like we're playing with it a lot inside sandbox So if you want to like run run something remotely we're playing with open code inside Inside Cloudflow sandbox quite a lot I know Nourish has been working a lot on on the sandbox SDK to make that really nice really clean we also have like

Wilhelm (42:05.695)
Okay.

Wilhelm (42:10.23)
Mmm.

Wilhelm (42:14.644)
Mm-hmm.

Matthew Carey (42:31.328)
some cool instrumentation of open code internally where like our MCP servers are automatically added and there's some good stuff there.

Wilhelm (42:34.55)
Mm-hmm.

Wilhelm (42:39.858)
Nice, nice, Right. It's kind of obviously a lot more hackable than Cloud Code.

Matthew Carey (42:46.082)
Yes.

Wilhelm (42:48.404)
Nice. And do you find that the performance, like say you're just using Opus 4.5, is what it can do similar in or like is the performance similar in Cloud code and Open code? Because I feel like I would be worried that it would just be worse.

Matthew Carey (42:54.754)
Mm-hmm.

Matthew Carey (43:05.398)
Yeah, yeah, I mean their tool definitions are not entirely correct, it goes off-piste a little bit more. See, the thing is, all of this stuff is kind of fixable. I think it's tough because obviously they don't make the model.

Wilhelm (43:14.048)
Okay.

Wilhelm (43:22.568)
Yeah, exactly. That's kind of like why I'm like, want to use the thing that is from the people who make the model because they will know the best. Yeah.

Matthew Carey (43:29.526)
Yes, yeah. think using OpenCode, it's quite nice because it is... When you would have created your own tool definitions and you would create your own system prompt instead of doing that and wired it all up together, you have a pre-wired up agent as a server, pre-made, and I think that's a really good use of OpenCode.

Wilhelm (43:51.766)
But you can use Cloud Code SDK for that as well, right? Or I mean, guess it's less hackable that way.

Matthew Carey (43:56.078)
You can, but you're essentially just running the chord binary and passing it in messages and taking them out as JSON. Like it's not, it's, it's yeah, not the same.

Wilhelm (44:03.423)
sure. Yep.

Wilhelm (44:09.907)
because it's less hackable.

Matthew Carey (44:11.688)
Yeah, yeah, you don't have, it's also just like much less developer friendly, like you're running a binary in the process you're in, you're not like starting a server and making a site persistent and making requests to it.

Wilhelm (44:13.779)
or less.

Mm.

Wilhelm (44:25.663)
Fair, fair, Yeah. Although Claude seems to have got this concept of sessions recently, which seems interesting. Although I haven't played with it yet. Yeah, okay, interesting. I think for me, similar in terms of coding model use, like it's kind of the latest whatever Anthropic offers.

Matthew Carey (44:33.613)
Yep.

Matthew Carey (44:45.986)
Hmm.

Wilhelm (44:47.997)
I think there was briefly a time when Gemini 3 came out where I was like, damn, I should probably change my daily driver model to Gemini 3. But then I was very glad Opus 4.5 came out a week later and I just had to be like procrastinate for a week and then the solution was presented itself. There's a great, do you know this? There's a book by like a big Harvard or Stanford professor called The Art of Procrastination. Have ever heard of this?

Matthew Carey (45:09.569)
So fun.

Matthew Carey (45:14.946)
Yeah. No.

Wilhelm (45:18.613)
It's a book basically about how people who are big procrastinators should just embrace the procrastination and not fight it all the time. And one of them is like, he lays out very clearly a bunch of steps. It's actually quite a short read and some very actionable stuff. But one of them is just like, sometimes if you procrastinate on a task long enough, it just will resolve itself without you needing to do anything.

Matthew Carey (45:27.085)
Hmm.

Matthew Carey (45:31.416)
Okay.

Matthew Carey (45:43.469)
Yeah.

Yeah.

Wilhelm (45:47.796)
So like procrastination was the right choice there. Or he will say like, I think one example he gave is like, as like a professor, he constantly has to like grade student papers and he just procrastinates on that all the time. But by procrastinating, what he does instead is like, he'll like fiddle with his like mail server settings or whatever. So he'll learn a lot about like mail servers while he's procrastinating. And then his colleagues will be like,

Matthew Carey (46:09.516)
Yeah.

Matthew Carey (46:12.888)
You

Wilhelm (46:15.913)
How do you ever have time to learn about random things like mail servers? I'm so busy grading my student papers, and he's just like, well, I'm just procrastinating. And then I get all my student grading done in a rush at the end. So it just shows some examples of how procrastination can be very good thing. sidebar over. It's good, right? And then, yeah, so to get back to the model stuff.

Matthew Carey (46:35.35)
I that. I rate that so much.

Wilhelm (46:42.037)
All the topics are the then for me, actually, my daily driver for just random queries is just like ChatGPT on the web or the mobile app. I have the action button on the iPhone bound to just open straight into a new conversation in ChatGPT. So it's just very, very easily accessible.

Matthew Carey (47:02.242)
We're going to have to cut this out. need to get this. I'm so sorry.

Wilhelm (47:05.941)
You need to get the call? Okay.

Wilhelm (47:14.205)
I should just play some music.

Wilhelm (47:31.893)
just play the intro again.

Wilhelm (00:02.1)
We're back. Okay, wait. Yeah, I think I can just cut this together or whatever.

Yeah, we should be wrapped soon anyway. Is it the movers calling? Or no?

Matthew Carey (00:09.804)
Yeah, I'm gonna have to wrap. yeah, just organizing all of that stuff is just like, ugh, such hassle. But, it's cool.

Wilhelm (00:19.08)
Yeah, it's a busy time. mean, thanks for being down to do pot at all today, to be honest.

Matthew Carey (00:23.182)
It's really nice. It's nice to hear, it's nice to see you dude and like hear some thoughts. What are you most excited about, about this year coming? I know we're not going to do a full like year dive, but come on, talk to me.

Wilhelm (00:35.1)
Yeah, I'll give a hint. And it's actually there was also a very brief for cell blog post about this, which is which kind of hinted at this. But I think representing your the data is the data is representing the data you have in whatever domain you're in, in a like file system native way. So folders and files like bringing your whatever your

Matthew Carey (00:41.687)
Yeah.

Wilhelm (01:05.064)
domain is, like if it's support tickets or if it's polls or if it's like traces, like invocations, runtime traces, representing that in a nested folder and file structure so that Claude can discover it and grep it and glob it and ls it and all this stuff, I think will be incredibly powerful. So I think it's

to me, and we can expand on this, Nusha, is like it's the year of human readable text files in nested folder structures.

Matthew Carey (01:48.596)
year of human... Okay, I do love how we're going back to like a Linux era or Unix era of like everything is a file. I'm still not sold on it. think, yeah, I'm really, I'm still not sold on it and I think the first self blog post about expressing stuff as files, there's gonna be lots of domains that are not expressible as a file system.

Wilhelm (01:54.346)
He

Wilhelm (01:57.876)
Mm-hmm.

Matthew Carey (02:13.53)
And I think we have to find a way to still let an agent reason about it. A file system is very easy, right? Because you have progressive disclosure with read files, list files, all of that. But you can get progressive disclosure on other stuff as well, especially on like, I think people need to work out memory a little bit more.

Wilhelm (02:21.544)
Mm-hmm. Mm-hmm. yep, yep, yep, yep, yep, yep.

Matthew Carey (02:37.036)
But I don't think so. It's very easy to dump stuff in files and just be like, have at it, and that is memory. I'm not sure.

Wilhelm (02:37.534)
Mm-hmm.

Wilhelm (02:43.826)
Yeah, guess I think like, I think we will, like in the past we were intentionally designing websites or whatever, right? I think now we'll be intentionally designing text files. Maybe the text file as a format will get an upgrade in whatever shape that might take or might look like. But I think the agents are just good at text files. And I think you're right, not all domains can be expressed like this, but I think so many can. And I think that's where a lot of interesting opportunity lies. But yeah, I...

Matthew Carey (03:10.776)
Yeah.

Wilhelm (03:13.482)
I think we can talk much more depth about this in the future. Okay, I think we should probably wrap pretty soon. I have one last topic, I don't think we've... Yeah, we've mentioned this, not as the star of the show. I think, so one of the prompts I had for like... Oh, actually, let's do this and then let's do like a quick fire round of like best, like superlatives basically. I think one of the weirdest things of 2025 in my mind...

Matthew Carey (03:24.302)
Go on, hit it. Let's go.

Matthew Carey (03:42.188)
Hmm.

Wilhelm (03:42.44)
was actually MCP. Like, weirdest technology, weirdest to understand, weirdest to make sense of. And is it? So I actually learned about MCP when it first was released in like late 2024. I was like, my God, this is so cool. Like, I love everything about this. And then Zed actually was one of the launch partners of MCP because David's big Zed fan, right? So I was like, damn, that's very timely.

Matthew Carey (04:05.091)
Yeah.

Wilhelm (04:13.36)
It's funny, there's an analyst, guy who used to work at Andreessen Horowitz called Benedict Evans, who has an interesting take on this. I actually don't think most of his AI takes are that good or I don't agree with him that much, but he is very good. I think he's a very good macro analyst of big trends, big shifts. He talks about technology adoption S-curves all the time. He creates a massive slide deck that's fun to flick through every year.

Matthew Carey (04:19.182)
Hmm. Looks like.

Wilhelm (04:41.821)
What he said about MCP is hilarious. said, developers love middleware and MCP is middleware. That's it. Which I'm not sure is completely off. But I think, go on, yeah.

Matthew Carey (04:56.542)
yeah. Yeah, I... It's not completely off. I think developers love standardization because...

especially when you're developing with LLMs, even before that, it meant that you knew how to do something in one place and you could directly apply that to another. If you know how to call a REST API, you know how to get data from any REST API and that is quite a well-used standard. If you know how to use gRPC, you know how to use gRPC, but if you know how to use REST, it doesn't mean you know how to use gRPC.

Wilhelm (05:25.587)
Yep, yep, yep, yep.

Matthew Carey (05:34.006)
or GraphQL for instance, like these things are different, but they are standards. And previously we had no standards to sending tools over the web or prompts or resources or whatever. And so, like, I think having a standard was a good idea because the idea of remote tools was always going to be something that people wanted to do if tools became the dominant interface that...

Wilhelm (05:49.257)
Yeah.

for sure.

Wilhelm (06:00.456)
Yeah.

Matthew Carey (06:03.454)
the agents access the world. I'm still not convinced that tools like it's still weird I still find it weird that we call them tools.

because it's like so like it's like treating the models like humans like you're giving them a tool belt and like saying like go go do your job you know like here's your tool belt and here we load some tools where you can access the world I just find it so funny

Wilhelm (06:19.209)
Mmm.

Wilhelm (06:24.839)
Yeah, I agree. But I think this is part of why MCP is weird to me, like both what you just described there and also like, it's not really true that like, you can learn MCP as a standard and then you can just like call all these tools like the same way you can with like REST or like understanding SQL or whatever, right? Because of what we talked about earlier, like maybe the model will just not call the tool. So it has great means of calling arbitrary tools via MCP, but it won't do it, right? So it's like, that's what I mean by it's like just a bit weird.

Like it's, and I would actually argue most of the reason why MCP has been, it's been such a big year for MCP and such a successful year with everyone embracing it, right? Which by the way, that wasn't clear at the beginning of 2025 that Google and OpenAI and everyone would, like the whole OpenAI, sorry, the ChatGBT apps model is built entirely on MCP, which is very interesting and kind of cool, but.

I think the reason why MCP grew so much in the first half of last year is because it finally gave everyone an outlet for like, AIs here. How does my company do AI? turns out we can just build this little new server thing where we define all of our tools and we'll catch the AI wave, right? Which makes sense. And it's like great that we could all

Matthew Carey (07:43.15)
Yeah.

Wilhelm (07:52.434)
direct our energy towards writing MCP servers and tool calls. But then the problem is the models don't call the tools and there's all these issues, right? So it's like, it's just weird, man. Like it's just strange. And then obviously you talk to the MCP people about, like it'd be great if the models called the tools more. You know, that seems like the main issue with MCP. And they'd be like, well, that's not really an MCP concern. That's an MCP client concern.

That's something in the models. You don't have to call an MCP tool just because the model wants to call it. An MCP client like Zed or Claude Code or anything that calls tools, you could just call tools whenever. You can make code mode and have inside code mode call tools. There's no law that says only when the LLM does a tool calling thing in its output.

then the MCP tool call comes in, right? But no one builds clients that way. there's not really a client that exists right now that does really creative things with tool calling until maybe now with the tool search or whatever. So it's just weird, man. It's just like the whole thing is just unusual or fun, great, but like weird.

Matthew Carey (09:11.084)
Yeah, definitely would. I never really thought about MCP as a way for companies to get on, for companies that weren't AI native to get on an AI hype train. I haven't actually heard that before. But it rings true. It really does ring true. Yeah, that's smart.

Wilhelm (09:11.113)
promising.

Wilhelm (09:31.389)
Like why was everyone building MCP servers? Like I think it's because everyone felt this like, crap, I'm getting left behind on AI. Like every company is hiring Accenture to build an AI strategy, right? Obviously does Accenture know what the AI strategy is? I don't know, probably not. But like, if you can just see, I can just, like it's the dream of like any company, right? Like we just need to package up our, unique value that we provide.

Matthew Carey (09:39.948)
Hmm.

Wilhelm (10:00.84)
whether that's like creating a to-do list or checking out an Instacart or whatever, and package it into a neat little tool call, and then the money will just start flowing. Everyone who's using models will just use our amazing technology via MCP and happy days. But it's just, it's not really, it's not really. All right, quick fire round. you have,

Matthew Carey (10:07.906)
Mm.

Matthew Carey (10:20.898)
Yeah, it was smart, wasn't it?

Matthew Carey (10:25.87)
I just run. Oh cool, let's do a quick file, let's do a quick file and then I gotta run. Opus 4.5 100 % like insanely insanely insanely good like I don't have anything. Yeah it's ridiculous.

Wilhelm (10:30.652)
Best, best model 2025.

Wilhelm (10:37.32)
Agreed. It's not fair, it? Best lab of 2025.

Matthew Carey (10:48.31)
like best in what sense?

Wilhelm (10:51.228)
Oof, it's quickfire round.

Matthew Carey (10:55.148)
I have to go with Anthropic then.

Wilhelm (10:57.32)
Okay, yeah, we'd have to agree as well. Most surprising thing in 2025.

Matthew Carey (11:11.182)
I think I'm give to MCP taking off as a standard and people adopting it. Definitely. And then the second one is this obsession with file systems that people have. and the third one is their bun being acquired. Sorry. Crazy.

Wilhelm (11:18.472)
Mm-hmm.

Wilhelm (11:24.86)
Yeah, yeah.

Ooh, yep, yep, yep. Which in a way is related to running stuff. It's almost like Claude code. Like the button thing could not have happened without Claude code, right? So it's like, mm.

Matthew Carey (11:35.19)
I... Maybe it extrapolates, but like, I think that, know this is quick fire, but I don't think that pulled code being a success was a surprise. Yeah. No.

Wilhelm (11:48.7)
You don't think it was a surprise?

Matthew Carey (11:52.824)
Like, definitely not.

Wilhelm (11:53.033)
I think the surprise was that we're all going back to CLIs and local and files. And it's very clear that we keep heading that way. I think actually I'll sneak in one thing which I want to discuss at length with you. Which is, you know the whole thing about cattle and pet? We had this huge shift over the past 10 years of you should treat your servers as cattle, not as pets. We're now kind of shifting back to pets a little bit.

Matthew Carey (12:16.055)
Hmm

Wilhelm (12:22.374)
because of cloud code, essentially, because of long running things and you need to be authed and all that stuff. And maybe we're maybe finding a new paradigm that's kind of in the middle, actually. You still want to obviously be able to spin up a new thing very quickly, but you kind of, the cattle thing doesn't work as well anymore. And also it's slower, right? The cattle problem was always...

A deploy will take at minimum five minutes if you're building a new Docker thing and then you're spinning up the new Docker thing and you're draining the traffic and then all the stuff. That just doesn't work well when an agent needs to write the code, make the request, the test if it's working, then make changes. You can't have five minutes in between that rebuilding Docker files and starting up a new box or whatever. Anyway, that's both an interesting shift and a prediction.

Matthew Carey (12:52.621)
Yep.

Wilhelm (13:15.858)
That's all I have.

Matthew Carey (13:18.995)
It's been a while, dude.

Matthew Carey (13:24.969)
Yeah, mate, hopefully we'll get back into some more rhythm and I'll actually be awake for this one, for the next one. was lovely.

Wilhelm (13:31.176)
Mate, you earn so much straight. Well, the next one pod you'll be in Lisbon, right? End of an era.

Matthew Carey (13:36.952)
Yeah, yeah, yeah, it'll be good. We can do it next week. I think Wednesday will be awesome. Let's do it.

Wilhelm (13:43.11)
Okay, sweet, Yeah, I'll hit you up. Yeah, first year of the pod.

Matthew Carey (13:51.458)
fucking crazy. Lovely to see you.

Wilhelm (13:52.87)
Fucking crazy. It's a pleasure. Enjoy the party tonight.

Matthew Carey (13:57.28)
If I bake it, I might...

Wilhelm (13:57.756)
Give my best to everyone, if you make it.

Matthew Carey (14:02.534)
Alright, catch you in a bit. Bye!

Wilhelm (14:04.477)
Bye, peace.

More episodes

Chapters

Creators and Guests

What is You've Been a Bad Agent?