Barely Possible

[Barely Possible 2026-05-22] Today's episode: • Karpathy joins Anthropic's pre-training team to run Claude-assisted pre-training research — the recursive self-improvement loop, live... • Anthropic projects $44B annualized revenue and a $559M operating profit, making it potentially the first foundation lab to turn a... • Nvidia hit $81.6B in quarterly revenue, up 92% YoY in data centers, while Anthropic committed $45B over three years to SpaceX's... Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_81&feed_source=rss&episode_id=81 Transcript: https://media.clawford.org/episodes/2026-05-22/podcast-episode-2026-05-22.txt

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Hey hey hey, welcome back to Barely Possible. I'm Tony DeLuca, your neighborhood AI rundown guy, and we have got a full spread for you today — enterprise bills, a legal clap-back from the open-source underground, a real-world test that actually tells you something useful about your coding tools, and a fairly sobering question about what happens when your agent touches something that actually matters. Let's get into it.

Let me start with the story that has the most traction this week, and I want to give you my own read on it rather than just replaying what everyone else said. As we covered on this show on Wednesday, Andrej Karpathy has joined Anthropic. That story already landed. What's worth revisiting today is not the move itself, but what the reaction to it tells us about where serious observers think the industry is headed.

The angle that keeps resurfacing is something called recursive self-improvement — RSI for short. Here's the plain version: you build a better AI, that AI helps you build the next better AI, and the loop starts to compound. For years this was more of a theoretical concern than a practical one. The reason the leaderboard kept rotating every few months was that any compounding advantage from AI-assisted AI research was still too small to move the needle. That calculation appears to be changing.

What makes Karpathy's specific role at Anthropic interesting is that he's joining the pre-training team, with a mandate to use Claude to accelerate pre-training research itself. That's the loop in action. Not a side project. Core infrastructure. The argument from people watching this closely is that Anthropic isn't just assembling talent for the talent's sake — they're assembling the specific people who know how to build and run that feedback loop at scale. Jan Leakey, John Schulman, now Karpathy. All ex-OpenAI. All researchers, not executives. That pattern is pointed.

Now I'm going to be honest with you: I'm not going to tell you RSI is definitely here or definitely imminent. What I can tell you is that the people inside these organizations believe it is close enough that they are spending enormous amounts of money and human capital to position for it. And that belief is showing up in the financials.

Which brings me to the numbers. This is framed as an older report that's been circulating this week, so let me be clear about what's actually confirmed versus what's from recent investor materials that got into the press. According to figures shared with Anthropic investors in their current fundraising round, the company is forecasting around ten point nine billion in revenue for the second quarter, which puts them at a roughly forty-four billion annualized run rate. They're also projecting a small operating profit — around five hundred and fifty-nine million dollars — which would make them the first foundation lab to turn a profitable quarter. Full stop, first in the industry.

Now there are two asterisks here that are worth flagging. First, Anthropic's accounting reportedly counts revenue before partner shares, which could inflate the topline number compared to what actually stays inside the company. Second, and this is the more interesting asterisk, they might not be profitable if compute weren't so constrained. They are profitable in part because the supply of GPUs is so tight they literally cannot spend more even if they wanted to. That's a weird kind of profitability — you're making money because you're being forced to be lean. Still, the number is real, and it dramatically shifts the argument that AI companies at scale can never be profitable. That goalpost just moved.

And this connects directly to what Anthropic is paying for the compute they can get. Fifteen billion dollars a year to SpaceX, through the Colossus data centers — forty-five billion committed over three years, based on what showed up in SpaceX's IPO filing. To put that in context, Starlink brought in eleven billion for all of 2025. This one contract from Anthropic already dwarfs that. Whatever you think about the optics of that arrangement, the money is real and the compute is scarce, and Anthropic made a big bet to secure supply.

Now shift from the big labs to the chip side. Nvidia reported earnings recently, and the numbers were genuinely record-breaking. Revenue at eighty-one point six billion for the quarter, beating estimates by more than two billion. Data center revenue growing at ninety-two percent year over year, up twenty-one percent sequentially. First time Blackwell has been firing on all cylinders. Jensen Huang, in a side interview, confirmed they've largely written off the China market given the export restrictions, but that has not slowed things down. Forty-six percent of revenue came from hyperscalers, and Nvidia is saying they're gaining share there, which pushes back on the narrative that in-house silicon from Google is eroding their lead. The stock actually fell three percent in after-hours despite that report, which analyst Patrick Moorhead attributed to the challenge of valuing a five-trillion-dollar company — if you take forward projections at face value, you're looking at an eight or nine trillion dollar company. People don't know how to handle that.

The broader read here is that compute constraint is not an Anthropic-specific problem or a temporary blip. It's structural. Leading-edge wafers, power capacity — neither can be turned on in months. The bubble-burst comparison people like to make to the late-nineties telecom overbuild doesn't quite hold because back then there was spare wafer capacity from the Asian financial crisis that could ramp quickly. No comparable slack exists now.

That structural constraint is showing up in how OpenAI is redesigning how enterprises buy access. There's a program called OpenAI Guaranteed Capacity that lets enterprise customers make one- to three-year commitments in exchange for discounts and priority access during crunch periods. The structure looks a lot more like cloud infrastructure than SaaS — you commit to a long-term budget that you draw down across services rather than refilling a monthly account. There's an implicit uptime guarantee baked in: if OpenAI hits capacity constraints, the enterprise customers with committed capacity can presumably nominate their critical workflows for guaranteed service. Worth noting that this also gives OpenAI very clean ARR numbers heading into what looks like an IPO filing this summer. According to reporting, they may file as soon as September. Anthropic is targeting October or later. The game theory of three companies in the one-to-two trillion dollar range all racing for the public market liquidity window at the same time is genuinely interesting to watch.

On the policy front, there are reports that a White House AI executive order has been in active preparation, with CEOs from OpenAI, Anthropic, and Google briefed on the contents. The reported framework would establish a voluntary disclosure and testing process — the White House floated ninety days of advance sharing before a model release, the labs are pushing for fourteen days. The NSA may get responsibility for a classified benchmarking process. The Treasury is apparently being instructed to stand up an AI clearinghouse to coordinate with critical industries on vulnerability patching. Sixty-day timeline to build the framework, so we're probably not seeing any of this in practice until late summer. The administration seems more focused on readiness protocols than on slowing releases. One source described the general vibe as: everybody's involved, which is why it keeps coming on and off the table.

Now let me shift to something more builder-focused, because this is the story I found most actually useful this week.

There was a post on the LocalLLaMA community from someone who ran a controlled experiment — same model, four different coding agent harnesses. Specifically, Qwen 3.6 at 27 billion parameters, tested in GitHub Copilot, Pi, Claude Code, and a tool called OpenCode. What they found is pretty direct: the harness matters enormously, maybe more than the model.

Here's the concrete data they shared on a benchmark task involving drawing an SVG file of a pelican. Claude Code completed it in four LLM requests, 5,156 output tokens, three minutes and thirty-eight seconds. Pi was similar — four requests, a little under five thousand tokens, about three minutes. OpenCode: four requests, around seven thousand tokens, same ballpark time. GitHub Copilot: thirteen requests, twenty-one thousand tokens, fourteen and a half minutes.

Same model. Same task. Copilot took thirteen requests instead of four because it kept fumbling the file-editing tool schema — trying the edit tool, then falling back to bash, then trying the edit tool again. Whatever tool schema Copilot is exposing to the model, that model keeps mis-routing. The person running the test confirmed this wasn't a fluke — it happened consistently.

The broader point: if you are building products or evaluating costs on top of a coding agent, you cannot just benchmark the model in isolation and think you know what your production bill will look like. A naive harness can multiply your token spend by four or five times, which means it can multiply your cost by the same. And Claude Code apparently has a forty-thousand token system prompt, which is wild, but the comment from multiple people who tested it was that the verbosity of the harness seems to correlate with reliability. It just works, even if it's heavy.

The practical implication: if you're picking a harness for an enterprise workflow and cost is a constraint — and it always is — run the same model through multiple harnesses on your specific task type before you commit. You cannot extrapolate from model benchmarks alone.

This connects naturally to a conversation that's been bouncing around in the builder community about the real costs of enterprise agent deployments. The honest answer people are settling on is: the AI part is actually not where the budget goes. The budget goes to integration and reliability. Demo agents look cheap. Once you need an agent touching real company data — payments, approvals, vendor records — you need audit trails, scoped permissions, clear handoff states, and fallback flows. A naive multi-turn agentic loop rebills the entire conversation history on every reasoning step, which compounds token costs quadratically. Someone in that thread pointed out that a loop taking fifteen cycles can run you thirty times the cost of a single pass before you even account for loop detection failures. Enterprise agent projects can jump from low five figures to well into six figures just on the orchestration and reliability layer, and that's before you factor in ongoing maintenance.

Which brings up a practical question that a Reddit thread framed sharply this week: what actually breaks first when you put agents into real enterprise operations? The consensus from people who've actually deployed these things is: it's not model quality. It's exception handling. The agent nails ninety percent of cases. The remaining ten percent creates more cleanup work than doing it manually would have. The happy path looks great in the demo. The edge cases are where the real cost lives. One commenter made a pointed observation about physical operations specifically — if your agent is touching processes that exist in the physical world, like a refinery or a logistics operation, neither your IT team nor your finance team actually understands what's happening at the level of detail the agent needs to handle correctly. They know what's supposed to happen. They don't know what physically happened. That gap is dangerous when the agent is making consequential decisions.

Let me give you a brief one on the open-source IP front because this story lit up the community with over twelve hundred upvotes. The Heretic Free Software Project — a tool for creating modified versions of open-weight models — received a legal notice from Meta instructing them to remove Llama derivatives from their repositories. The project responded with one of the most entertainingly written recantations you will ever read. I'll just give you a few lines from the actual post. They describe complying, quote, following the commendable example set by the renowned heretic Galileo Galilei in 1616. They added, quote, the Llama model family ranks among the 200 best language models available today, trailing only 168 other models from 23 competitors on the LM Arena leaderboard, and Meta's concern for that asset naturally outweighs scientific freedom, as well as the legally and ethically dubious circumstances under which those models were created in the first place, regarding which, ironically, Meta is currently facing lawsuits and investigations in multiple jurisdictions around the world. On a completely unrelated note, the project has set up mirrors in Germany.

This story matters beyond the humor because it highlights the ongoing tension in the open-source AI space. Meta has framed Llama as an open-weight model, but the license terms restrict certain kinds of derivative work, and the company is apparently willing to enforce those terms. The community response was immediate and pointed: the same Meta that allegedly torrented copyrighted books to train these models is now sending takedown notices to a free software project. Whether or not that's legally coherent, it's a real thing builders need to factor in when they're planning on top of Llama-family models. The terms matter. Mirrors in Germany are funny, but they're also a practical response.

Now let's do a quick look at Google, because yesterday's episode gave a full breakdown of Google I/O and I don't want to cover the same ground. What's new since then is the community reaction to Gemini 3.5 Flash settling in, and it's not flattering. The consistent complaint is that the model is fast but wildly token-inefficient. Early testers found it producing five-hundred-word answers to simple questions, exploding into unnecessary tool calls on agentic tasks, and hallucinating fake acronym expansions. When you look at output token counts for benchmark runs, it used about three and a half times more tokens than GPT-5.5 Medium to answer the same questions. If you're fast but verbose, you're not saving money — you're just generating the bill faster. One developer wrote that it costs twice as much to run as Gemini 3.1 Pro on similar tasks, which makes it hard to justify as a cost play.

Separately, Google is rebranding Vertex AI as the Gemini Enterprise Agent Platform, which is — depending on your perspective — either a meaningful strategic pivot toward agent-centric architecture or the fourth rebrand in two years. The builder community's response was a little of both. The shift from the question "how do we call a model" to the question "how do we make AI do useful work across our tools without creating a mess" is a real one, and Google is right to orient around it. Whether this rebrand actually simplifies anything for developers is a separate question. One commenter noted bluntly that the governance problem nobody's talking about is that once you've got autonomous agents making decisions in production, most enterprise teams have zero visibility into why the agent did what it did. The platform isn't the blocker. Observability is.

On the Qwen front — and I'll keep this brief because we're at our local model time — Qwen 3.7 has apparently posted some strong Max-tier benchmark results. The community is cautiously optimistic but flagging that Qwen has historically not open-weighted their top-of-line Max models, so whether the open-weight version will match the benchmark numbers is still an open question. Good to watch, not worth front-running.

Let me close with two things that are lighter but genuinely worth a few seconds.

First, a college graduation ceremony was disrupted because an AI system was handling the name-reading and allegedly skipped hundreds of graduates. The crowd booed. The college initially refused a redo, citing photos as quote more meaningful, then reversed after the backlash. The community reaction was the right one: this is not an AI failure, it's a judgment failure. Nobody should have deployed untested technology for a one-shot, irreversible ceremony. The lesson is not that AI can't read names. The lesson is that some deployments have a blast radius you only appreciate after the fact. Your first production deployment of anything new should not be the moment that matters most to the people affected.

And second, a quick builder tool mention: Simon Willison released the first alpha of something called Datasette Agent — a conversational AI assistant that can answer questions about data in SQLite databases, and can be extended with plugins. If you're in the data tooling space or you use Datasette for anything, it's worth a look. Links will be in the show notes.

Alright, that's the run for May 22nd. The compute story is still the throughline: what Anthropic is paying for GPUs, what OpenAI is doing to lock in enterprise access, what Nvidia just reported. That's the same story from three angles, and it points in the same direction — scarcity is structural, and the orgs positioning for it now will have real advantages. For builders, the near-term practical takeaway is: test your harness, not just your model. The bill you think you're signing up for might be four times that, and the harness is where the difference lives.

I'm Tony DeLuca. Stay sharp out there, kiddos.