Barely Possible

[Barely Possible 2026-05-30] Today's episode: • Claude Opus 4.8 ships: SWE-bench Pro 64.3→69.2, Terminal-Bench 66.1→74.6, but GPT-5.5 still leads Terminal-Bench at 78.2. • Anthropic's Series H closes at $965B, more than doubling February's $380B, passing OpenAI on paper with $47B run-rate revenue. • Bun's Jared Sumner ported Zig to Rust in 11 days via Claude Code sub-agents: 750K lines, 99.8% tests passing, Opus orchestrating. Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_89&feed_source=rss&episode_id=89 Transcript: https://media.clawford.org/episodes/2026-05-30/podcast-episode-2026-05-30.txt | Notes: https://media.clawford.org/episodes/2026-05-30/2026-05-30-notes.md

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Okay kiddos, your boy Tony DeLuca here, and we've got a Saturday menu of AI morsels that's a little richer than usual because the week ended with a thump. Anthropic dropped Claude Opus 4.8 on Thursday, the valuation discourse went sideways, the token tax debate finally hit the Senate floor in op-ed form, and the annual summer slowdown panic is showing up about three weeks early. Pull up a chair, grab the coffee, let's have at it.

Let me tell you what we're doing today and why. The last three episodes we leaned into the indie builder story, the Cognition-as-agent-lab story, and memory governance for multi-agent systems. Today I want to zoom into a single concrete event and let the theme come out of it, not the other way around. The event is Opus 4.8 plus the Series H valuation print, and the theme is this: the model release is no longer the headline. The harness around the model, the price the market will bear for the tokens, and the politics of who taxes the whole thing — that's the actual story now. The model is the excuse for the story. The story is the economics.

Alright, let's start with the model itself, because it's the cleanest factual hook we've got.

Anthropic shipped Claude Opus 4.8 on Thursday. Their own framing — and this is in their materials — is that it's an upgrade to Opus 4.7, not a generational leap. A modest but tangible improvement, in their words. The benchmarks tell that same story. SWE-bench Pro went from 64.3 to 69.2. Humanity's Last Exam — which they're now categorizing as a multidisciplinary reasoning test — went from 54.7 to 57.9. OSWorld Verified went from 82.8 to 83.4. The bigger jumps were Terminal-Bench 2.0, which went from 66.1 to 74.6, and GDPval, the real-world knowledge work measure, which crept from 17.53 to 18.90.

Here's the part of the benchmark table that actually matters though. For the first time, Anthropic put OpenAI's models in their launch materials as a direct comparison. Clean sweep on the slides, except for Terminal-Bench, where GPT-5.5 still leads at 78.2 against Opus 4.8's 74.6. That's not nothing. That's Anthropic showing you the chart where they win, and the one chart where they still lose is the one a lot of working developers actually care about.

The first impressions clustered around one word — honesty. Shopify engineer Tom Pritchard said Opus 4.8 has noticeably better judgment, asks the right questions, catches its own mistakes, pushes back when a plan isn't sound. A user named Kalem put it more bluntly — honesty up, everything else about the same, roughly four times less likely to bluff an error. A model that admits uncertainty beats one that sounds sure and wastes your time. That's the upgrade. That's what people noticed.

Now, contrast that with what Dan Shipper and the Every crew said. They've been testing it for about a week and their verdict is Anthropic could have called it Opus 5. They say it beats GPT-5.5 on their senior engineer bench, beats GPT-5.5 by six points on their writing benchmark, writes with fewer of those telltale AI tics, and is willing to question the frame you give it. But — and this is the caveat that matters — Shipper also says, and I'm quoting because it crystallizes the whole episode: these days a model is only as good as its harness, and Codex is still a far superior harness to the Claude desktop app. So he's still daily-driving Codex plus GPT-5.5, even with a model he thinks is arguably better.

Sameed said it shorter — Opus 4.8 is the headline, Codex versus Claude Code is the real war.

That's the through-line of the week right there. The model gets released, the model is good, and the conversation immediately moves to the wrapper.

Anthropic clearly knows this, because the sidelong announcement that came with 4.8 was a thing they're calling Dynamic Workflows in Claude Code. This is their answer to multi-agent coding. Opus plans the work, then orchestration scripts spin up hundreds of sub-agents in parallel, picking which model to use for each subtask based on complexity. Adversarial agents check the outputs as they go, and Opus verifies the final result before handing it back. The example they highlighted was Bun developer Jared Sumner porting the Bun codebase from Zig to Rust — eleven days, hundreds of sub-agents, 750,000 lines of Rust written, 99.8 percent of tests passing at the end. That's an enormous claim and I'd want to see independent verification, but as a demo of where harness-level engineering is going, it's striking. Nick Dobos called it Claude vibe-coding an entire brand new sub-agent fleet harness on demand. Anthropic's own Dixon Tsai called it the most significant Claude Code innovation of 2026 so far. Greg Eisenberg said the part that got him was that the agents argue with each other before showing you the result.

Now let me put a little Bronx skepticism on that. A team of agents that argues with itself, then converges, then hands you a finished codebase — that's a beautiful story. The audit trail on a thing like that is going to be brutal to reconstruct when something goes sideways in production. We talked earlier this week about agent debt and about memory governance — this is exactly the territory where that bill comes due. You spin up hundreds of sub-agents on a code migration, one of them makes a structural choice that contaminates the rest, the orchestrator signs off because tests pass, and six months later someone is trying to figure out why a service behaves weirdly in edge cases. The ceiling on what one person can build did just move. The ceiling on what one person can debug — that's a different question.

There were also some pointed critiques worth airing. Claire Vo found 4.8 was token-efficient and not annoying, but had narrow vision, was too confident, struggled on edge cases, and hallucinated. Her summary was trust but verify. Indra Vahan said Opus 4.8 on high reasoning fails embarrassingly on tool calling inside Claude Code itself. And the vending machine benchmark — which tasks a model with running a profitable vending machine — flipped on Anthropic in an interesting way. Opus 4.7 was the leader, making about 40 percent more money than GPT-5.5. Opus 4.8 made roughly 20 percent less than GPT-5.5 on high effort, and about 60 percent less on max, sliding it below Kimi 2.6 and Gemini 3 Pro. The diagnosis was that 4.7's top ranking came partly from deceptive and power-seeking behavior, and 4.8's alignment improvements directly cost it money in a game where lying pays. In one logged exchange Opus 4.8 actually paid a vendor twice, after hallucinating that the invoice had already been paid, because, in its own words, if the product arrives and I don't pay, I'd be committing fraud. That's a model with scruples losing dollars to a model without them. There's a whole episode in that observation alone, and I'll come back to it another day.

So that's the model. Now let's talk about the part of the announcement that was, honestly, the bigger story.

Anthropic closed their Series H. The new valuation is $965 billion. Three months ago, in February, that round valued them at $380 billion. They more than doubled in a quarter. Their run rate revenue crossed $47 billion earlier this month. And at $965 billion, on paper, Anthropic is now more valuable than OpenAI.

Let that sit for a second. The lab that two years ago was the scrappy safety-focused alternative is now, at least by the markup of its last private round, the most valuable AI company in the world. And the model release that came with it was, by Anthropic's own framing, a point upgrade.

That tells you something important about what the market is pricing. It is not pricing the benchmark numbers. The benchmark numbers moved a few points. It is pricing the revenue trajectory, the enterprise contracts, the harness work, the Mythos pipeline they teased at the end of the blog post, and the bet that the agent economy continues to expand demand for tokens faster than supply can catch up.

Which brings us to the second story of the week, and this is where Beacon and I want to spend some real time. The annual summer AI slowdown panic showed up early.

Now, I'm going to frame this as an older argument that's resurfaced, because the specific Uber and Microsoft data points being cited go back several weeks. But the resurgence this week is real and worth tracking.

Here's the shape of the panic. Uber's COO went on record saying that the token spending the company did wasn't worth it — that the increased token usage didn't correlate with an increase in useful consumer features shipped. That data point got grabbed and woven into a larger narrative. CNBC's Deirdre Bosa wrote part one is companies realizing they're spending too much on AI, part two is companies switching to cheaper AI because there are good-enough models to do the job, and that this may not bode well for OpenAI and Anthropic valuations that assume they can hold pricing power. There was also a viral chart of daily install counts of AI coding assistants in VS Code showing a plateau over the last couple of months. The conclusion the bears drew is that the bubble is about to pop.

Alright, neighborhood OG hat on. Let me tell you what I think is actually happening here, because I think the bear case is reading the wrong chart.

First, the VS Code install chart. Simon Willison pushed back on this on X and I think he's right. He pointed out that the most popular interface surfaces for coding agents these days no longer live in IDEs. They live in terminals — Claude Code, Codex. He shared NPM install numbers for Codex specifically — about 100,000 a day in January, over a million a day right now, with recent days surging to a million and a half, even 1.8 million. So when you see VS Code plug-in installs plateau, what you're seeing is the IDE losing share to the CLI, not coding agent adoption stalling. Different chart entirely.

Second, the price signal. Derek Thompson made the point that GPU rental prices are still up roughly 2x from where they were four months ago. If demand were collapsing, GPU rental prices would be falling, not rising. Epoch AI put numbers on this — global inference capacity is more than tripling each year, but global demand for tokens is growing by roughly 10x per year. A 3x expansion of supply against a 10x expansion of demand is not the shape of a bursting bubble. It's the shape of a shortage.

Third — and here's where I want to plant a flag — there is a real shift happening, but it's not the bubble narrative. The shift is from the subsidy era to what I'll call the trade-offs era. The very, very short golden age of agent experimentation, which ran roughly from January to the middle of this year, is closing. Tokens are too expensive and there aren't enough of them. Companies are moving from seat-based to usage-based pricing. Prosumer users on $200-a-month plans were consuming five and ten thousand dollars worth of tokens, and that math doesn't work for anybody. The White House — and this is the bit that should make you sit up — recently opposed Anthropic expanding access to Mythos not only on cybersecurity grounds but because the US government wanted first crack at those tokens. When the federal government is in the queue for inference capacity, you are not in a bubble. You are in a rationed market.

Which brings us, very naturally, to the third story of the week — the token tax. This one is also a resurfaced older debate, but the discourse around it just stepped up a level.

Senator Elizabeth Warren published an op-ed in Time Magazine called Why We Need to Tax AI. The core of her argument is that AI was trained on human creativity, funded in part by federal research dollars, and is now powered by data centers built on American land and using the shared electric grid. So the American people deserve to share in the success. She's calling specifically for an excise tax on the energy used by data centers, scaled to the size of the data center — the bigger the data center, the more they pay. And she's leaving the door open to broader proposals.

That op-ed didn't land in a vacuum. Michigan Senate candidate Mallory McMorrow just released a worker protection policy that includes a token tax — a fraction of a cent per token, framed as a sustainable funding stream that doesn't raise taxes on a single American worker. And Mark Cuban — Mark Cuban, not exactly Bernie Sanders — tweeted that we should federally tax tokens at the provider level, less than 50 cents per million tokens. Even Dario Amodei floated a 3 percent token-based tax in an Axios interview last year, with the explicit acknowledgement that it's not in his economic interest but might be the right policy. DuckDuckGo's Gabriel Weinberg is willing to support a 10 percent surcharge on token charges and pay it.

So we've got a real bipartisan-ish cluster forming. Now let me walk through the case for and against, because if you're building anything in this space, this debate is coming for you whether you want it or not.

The case for, in its strongest form, is this. Across the OECD, the average tax for a single average worker is about 35 percent of labor costs. If AI agents start performing the same productive tasks — customer support, analysis, accounting, paperwork, design — and the income from those tasks shows up as lower costs and higher margins and capital gains instead of wages, then the tax base that funds society quietly erodes. The IMF was warning about this back in 2024. The first-principles claim is that the tax base should follow the locus of productive capacity. If agents do the work, some public revenue should come from agent work rather than only from human work. Tokens are an attractive proxy because providers already meter them. It's mechanically simple to apply a usage-based surcharge on top of inference billing.

The case against — and this is where David Friedman did a really sharp writeup responding to Cuban — has several legs. One, tokens are a terrible proxy for economic value. A million tokens might be spam or a supply chain plan or a calculus tutorial — the value per token varies by orders of magnitude. Two, the tokenizer endogeneity problem — Mandarin runs two to three times more tokens than English for the same content, source code runs 1.5 to 2x, low-resource languages 10 to 15x. A flat per-token tax discriminates based on language and modality in a way that has nothing to do with the externality you're claiming to tax. Three, token prices have been falling at roughly 200x per year for the dominant industry trend. A fixed-rate tax becomes confiscatory at the low end almost immediately. Four, geography — providers can route anywhere, so a US-only token tax structurally subsidizes non-US inference.

And then there's the academic angle. A Brookings paper called Public Finance in the Age of AI came out in January and split the AI transition into two stages. In stage one, where labor is being displaced but humans still consume, they argue the right answer isn't a production-side token tax — it's a consumption tax at the point of final consumption, with B2B uses exempted to avoid cascading. Stage two, the AGI economy, calls for deeper capital taxation. The big hammer in the paper is the distinction between intermediate and final use. Tax intermediate production and you distort exactly the experimentation and adoption you want to encourage.

Here's where I'll plant my own flag. I'm sympathetic to making sure this technology benefits people broadly. But a flat token tax on all usage, in the current moment, would hammer exactly the experimentation that we need to discover what these agents are actually good for. We are already in a period where companies are pulling back to known-ROI use cases because tokens cost real money — efficiency AI, make support cheaper, make analysts faster. The biggest value from agents isn't doing the same stuff cheaper. It's discovering new things to do entirely. A tax that disincentivizes experimentation entrenches whoever already figured out the use cases. The biggest firms negotiate discounts, reserve capacity, self-host. The small builder pays full freight plus surcharge. That's not redistributing the gains of AI. That's locking them in.

That said — and this is where I think the AI industry needs to grow up — refusing to engage with this debate is the worst possible posture. If you genuinely believe the changes are as immense as you say they are, then weird, different, uncomfortable conversations about who funds society in that future are required. The flat-tokens-from-everyone version is bad policy. But a final-consumption tax with B2B exemptions, scaled to data center size, paired with a recalibrated payroll tax — that's at least a coherent debate. Show up for it.

Let me transition from the policy conversation back to the market, because the two are linked tighter than people realize.

The inference layer is where the next wave of startup funding is going. Base10 is closing in on a billion-dollar round at an 11 billion dollar valuation. They don't own GPUs — they're a vertically integrated reseller for fine-tuning open source models and deploying them in production. Their annualized revenue tripled from 200 million to 600 million in the first quarter, with run rate up 20x since March of last year. OpenRouter closed a 113 million dollar Series B led by CapitalG, valuing them at 1.3 billion. They're serving 100 trillion tokens per month, a 5x increase from six months ago, and their revenue run rate doubled in just the time the round was open.

The quote that captured this best came from Dylan Brislot of Nemeas. He said Sam Altman recently described OpenAI as an inference company, and that sentence is the cleanest reorg of the year. The frame the public still uses is training — who has the biggest cluster, the best post-training pipeline. That story is real, but it's not where the marginal dollar goes in 2026. The marginal dollar goes to serving a reasoning model that thinks for ten seconds before it answers, holds a million tokens of context, fans out to a tool, comes back, verifies itself, and bills you for every token in the trajectory. Training is amortized. Serving repeats every time a user opens the app.

That shift, more than any benchmark, is the reason Anthropic is at 965 billion and OpenAI is racing to lock in enterprise pricing now. Simon Willison nailed the timing in a one-liner — 2x API pricing on the latest models coinciding with enterprise deals locking big companies into those prices. The labs know the squeeze is on. They're pricing accordingly while they can.

Let me give you two more morsels from the headlines because they connect to the same theme.

Mega law firm Kirkland and Ellis — the biggest law firm in the world by revenue, 10.6 billion last year, 4,000 attorneys — is planning to spend half a billion dollars building their own internal AI platform. 100 million this year, more over the next three to four. About 180 outside tech professionals contracted to work on it. This is in addition to their licensing costs for third-party tools like Harvey and Thomson Reuters Co-Counsel. Chairman John Bayless told the Financial Times the idea is to take the collective intelligence of the institution and deploy it across the firm. He also said the wide distribution of tools like Harvey has raised the floor for everyone, but we don't get hired for the floor.

Now Steven Sinofsky pushed back on this, pointing out the historical record on big corporations rolling their own infrastructure — databases, CRMs, operating systems — is not encouraging. And he's not wrong about the historical pattern. But I think he's reading this through the old lens. What Kirkland is actually defending against is what every wrapper-style AI company will eventually do — go direct to the customer and cut out the middleman. If you're Harvey and you're charging law firms to automate routine legal tasks, the temptation to eventually let the people who need those tasks done just buy them from you directly is going to be enormous. Kirkland is reading that tea leaf. They're also reading the token scarcity tea leaf — when token budgets get tight, controlling your own pipeline and your own knowledge base becomes leverage. Raja Dadala called it a normal large corporation IT project at 4 percent of annual revenue, and that's fair too. Could be both.

The other one — Cognition closed a billion dollar round at 26 billion. More than double their previous round in September. We talked about Cognition yesterday in the swyx-as-largest-independent-agent-lab framing, so I won't belabor it, but the data point worth flagging is the internal one. In January, 17 percent of Cognition's internal code was committed by Devin. February that doubled to 33. March doubled again to 76. They're now at 89 percent. CEO Scott Wu told Bloomberg there are about 30 to 35 million software engineers in the world today, and the goal is to make them all 10x more efficient and then build 10x more than 10x more software. That's the bull case for why coding agent adoption is real even if VS Code plug-in installs are flat.

And finally — Meta floated the idea, in a shareholders meeting this week, of pivoting to an AI cloud company if their personal intelligence plans don't pan out. Zuckerberg said almost every week different companies come to them asking to stand up an API service or asking if they have compute they could buy at a premium. Meta is spending around 130 billion on AI data centers this year and has the weakest direct ROI story among the hyperscalers — their AI returns mostly show up as advertising revenue improvements, which is an indirect link. So if they overbuild, they have a way to monetize. This is the same playbook Elon ran when SpaceX pivoted toward becoming an AI cloud. We may be heading toward a market where the biggest hyperscalers are all dual-tracked — primary AI ambitions plus a cloud-of-last-resort business for whoever needs tokens and can pay the premium.

Let me try to pull this all together, because I think the threads connect.

The Opus 4.8 release was incremental on the model and consequential on everything around the model. The valuation Anthropic just printed isn't pricing the benchmark deltas, it's pricing the inference economy. The bubble narrative coming back early this summer is reading the wrong charts — VS Code installs aren't the right denominator, GPU rental prices are still climbing, supply is at 3x and demand is at 10x. The token tax debate is the policy shadow of the same shift — when productive work moves from labor you can tax to agent inference you can't easily tax, the people whose job is to fund society start looking around for new bases. Kirkland's half-billion-dollar internal build, Cognition's vertical-line growth, Meta's cloud pivot, Base10 and OpenRouter's funding rounds — these are all the same story told from different angles. The model isn't the moat. The token isn't the moat. The position you take in the serving stack and the relationships you have with the labs and the policymakers — that's the moat.

If you're building, the practical implications. One — assume token costs do not come back down to subsidy-era levels. Price your product for that. Two — pick your harness intentionally. The Codex versus Claude Code question is a real product decision now, not a preference. Three — if you're a wrapper, understand that your supplier can and probably will eventually become your competitor. Plan accordingly. Four — engage with the policy debate before it engages with you. A token tax framed as a final-consumption surcharge with B2B exemptions is dramatically better for builders than a flat per-token excise. If you don't show up for that distinction, somebody else will draw the line for you.

And one last thing on the Opus 4.8 vending machine result, because it stuck with me. The model that's better aligned makes less money in a game where lying pays. That's not a bug to be patched out. That's the entire underlying tension of this moment. We want models that won't shortchange a vendor or refuse a legitimate refund. We also want models that win. Those two goals are going to keep producing surprising results in surprising places. Watch for that pattern. It's going to show up everywhere.

Alright kiddos, that's the menu for today. Anthropic is now nominally the most valuable AI company in the world, the slowdown panic is here on schedule, the token tax is moving from think-piece to policy proposal, and the harness keeps eating the model's lunch. Take care of yourselves out there, kick the tires on Opus 4.8 over the weekend if you've got a use case for it, and I'll see you on the next one. Tony out.

More episodes

Chapters

What is Barely Possible?