Barely Possible

[Barely Possible 2026-05-23] Today's episode: • Microsoft reportedly canceled internal Anthropic licenses after token-based billing burned through annual budgets in months. • DeepSeek made its 75% price cut permanent — but frontier models like Opus 4.7 and GPT 5.5 cost more than ever due to test-time compute. • An MIT study of 300 real AI implementations found only 5% of pilots reached full production deployment with documented profit outcomes. Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_82&feed_source=rss&episode_id=82 Transcript: https://media.clawford.org/episodes/2026-05-23/podcast-episode-2026-05-23.txt | Notes: https://media.clawford.org/episodes/2026-05-23/2026-05-23-notes.md

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Welcome back, folks — I'm your boy Tony DeLuca, and you are locked into Barely Possible, the show where we dig through the daily pile of AI news and figure out what actually matters for people who are building things. Got a full plate today, so let's get into it.

We're going to spend some real time on a story that quietly says more about where AI adoption is headed than almost anything else circling the internet right now. It's a story about token bills, budget chaos, and what happens when the enterprise world actually tries to use the tools they spent the last two years hyping. But first, let me get you caught up on a few things moving fast this week.

Let's start with Anthropic, because there are two separate stories in the air around them right now, and together they paint an interesting picture. The first one: there's chatter on the subreddits that Anthropic is likely to release a model called Mythos in what they're describing as the near future. Now I want to be clear about what we actually know here — this is community speculation and screenshot-based discussion, not an official announcement. The source image was doing the rounds on the singularity subreddit, and the conversation quickly broke down into what you'd expect: pricing fears, safety narrative skepticism, and debate about whether Mythos is even meaningfully different from Opus 4.7.

Here's the more interesting part of that conversation. One comment that got some traction made an argument worth chewing on: that the whole "Mythos is too dangerous to release" narrative is not a real safety posture — it's a distillation and compute story in disguise. The idea being that Anthropic is working on making a smaller, more deployable version of the model before they release it publicly, and the safety framing is the public-facing explanation for what is really an infrastructure problem. Whether that's true or not, I don't know. But the skepticism about how AI companies use safety language to manage their release timelines is a real and legitimate thing builders should be paying attention to. When a lab says "not yet," you want to know if that's a safety call, a compute call, or a competitive timing call. Those are three very different situations.

The second Anthropic story is the price angle, which ties directly into our main story today. The community was also buzzing about Anthropic's Mythos potentially running at something like a hundred dollars per million output tokens. And that pricing anxiety connects neatly to the wildfire currently burning through enterprise AI discussions: Microsoft reportedly canceling its internal Anthropic licenses because the shift to token-based billing blew through annual budgets in a matter of months.

The report, which was circulating from a blog called The Lowdown, says Microsoft had to pull Anthropic licenses internally because the token-based model obliterated what they'd budgeted under the old seat-license framework. One developer in the comments put it bluntly: "Running AI features in my app, token costs tripled in three months before I even had a proper pricing model. If Microsoft is getting surprised by this, imagine smaller devs trying to ship AI products."

And that's the line I want you to hold onto, because it's the thread that runs through almost everything happening right now in AI deployment.

Let me stay on this pricing story for a minute because the community conversation about why frontier models are getting more expensive is worth unpacking. It's a question that's been burning on the AI subreddits: Opus 4.7, GPT 5.5, Gemini 3.5 Flash — all came in more expensive than a lot of enterprises had projected. Why?

The most useful answer I've seen goes like this: it's not that AI is getting more expensive per unit of capability. It's that the frontier keeps moving. For a fixed level of capability, costs are still dropping rapidly — DeepSeek, for instance, just made its price cuts permanent at 75% off after their promotional period, and the comment threads were all pointing to the same thing: you can now get performance that was frontier-class six to twelve months ago for a tiny fraction of what it used to cost. That is real and it matters for builders who don't need to be at the absolute bleeding edge.

But here's the catch. Most enterprises don't price-shop for last year's frontier. They want the best thing available right now, and the best thing available right now is more compute-intensive than ever, because the way labs are extracting extra capability is by running more test-time compute — more thinking tokens, longer reasoning chains, deeper inference. And that stuff adds up fast, especially when you're trying to run it at production scale across a whole workforce.

So you get this phenomenon where the industry simultaneously has never been cheaper at the capability level your business actually needs, and has never been more expensive at the capability level that the lab's marketing is pointing you toward. Both things are true. And companies that modeled their AI budgets based on the assumption that frontier costs would keep falling in absolute terms, not just relative terms, are getting a very rude awakening right now.

DeepSeek's permanent 75% price cut is also a signal in itself. The community is watching this as a kind of west-versus-east showdown on compute strategy: Chinese labs betting on efficiency and access, American labs betting on scale and capability monopoly. For builders, that tension is already creating options. The question is how long it takes enterprise procurement to catch up.

Now shift from the pricing story to something Yann LeCun has been amplifying this week, which connects directly to all of this. He's been retweeting Dan Jeffries making the argument that closed, gated, surveillance-economy SaaS and models are going to get replaced by open source in a two-tier structure: open-weight models for the workhorses, and open-source harnesses for the plumbing. LeCun's signal boost here is consistent with everything he's been saying for years, but the Microsoft budget story gives it new legs. If the closed-API pricing model is genuinely unpredictable enough to blow through enterprise budgets in months, the argument for deploying your own open-weight models on your own infrastructure gets a lot stronger in the CFO conversation.

The counterpoint — which is real — is that getting to frontier-equivalent capability locally is still prohibitively expensive for most companies. There's a thread going around right now from someone asking what twenty thousand dollars in hardware actually gets you for a local coding agent, and the honest answer from people who know is: not Claude Opus or GPT 5.5 tier, not even close. You'd need something closer to sixty thousand dollars for hardware capable of running the models that genuinely compete with frontier APIs. Twenty thousand gets you roughly a Claude Sonnet-level experience at best, running on a pair of RTX Pro 6000s.

The sweet spot right now seems to be hybrid: use frontier APIs for the tasks where the quality gap actually matters to your business, and run open-weight models locally or on cheaper infrastructure for everything else. That's not a revolutionary insight, but it's increasingly the practical answer, and it's what the pricing story is pushing builders toward whether they want to go there or not.

Now let me spend some real time on something I think is the most consequential story for founders and builders in today's batch. It's a post describing what it calls the deployment funnel nobody talks about — and it cites a late 2025 MIT study tracking 300 real AI implementations against actual profit metrics. Not projections. Not pilots counted as successes. Documented outcomes.

Here's the funnel. Sixty percent of companies evaluated AI tools. Twenty percent of those actually ran a pilot. And of those pilots, only five percent reached full production deployment on the service line.

Ninety-five percent of AI investment dissolved before it produced a measurable outcome.

Now the source here is a Reddit post, and I'm not treating it as a peer-reviewed bombshell — the evidence level is a single community claim, and I can't independently verify the study. But the pattern it describes matches what developers and operators are actually reporting on the ground, and the analysis the post provides is specific enough to be useful regardless.

Here's what the five percent that made it actually looked like, according to the analysis. They didn't ask AI to substitute for judgment. They identified bounded tasks: specific inputs, defined outputs, failure modes that were contained. They measured success criteria before deployment, not after. Content drafting. Code review. Data summarization at volume. Things where if the AI produces a wrong answer, the cost of that error is bounded before it compounds, and there's a human checkpoint at a natural point in the workflow.

The ninety-five percent that didn't make it? Haste. No defined success metrics. The assumption that efficiency gains would be obvious once the tool was in the workflow. And the classic trap the post names: "We replaced X employees with AI" is not an efficiency metric. It's a headcount metric. Those are not the same thing.

The Klarna example comes up here as a real-world data point. Klarna was the poster child for aggressive AI substitution, held up constantly as proof that AI could replace large chunks of customer service operations. And by multiple accounts, they're already in what the post calls the reversal phase — rehiring humans because the efficiency numbers didn't hold at scale. The projected savings didn't materialize, and the functions they cut were apparently doing things the productivity metrics didn't capture.

Now here's the part that should hit home if you're building a product or running an AI-enabled operation. The failures cluster around unbounded judgment calls. Strategic decisions. Triage. Anything where the failure mode is the output looks correct but isn't. Those are the places where AI sounds impressive in a demo and quietly destroys value in production.

The bounded task framework is not glamorous. It doesn't make for good fundraising narratives. But it's the thing that actually works. If you can't describe the input domain, the output spec, and the contained failure mode before you deploy, you are in the ninety-five percent. And the MIT data — if it holds up — suggests that's where most enterprise AI spending is currently landing.

This connects to something we've been circling on this show since last week's discussion of what actually changes when you deploy agents into real operations. The pattern is consistent: AI is genuinely transformative in narrow, well-defined workflows, and genuinely hazardous when you try to use it as a substitute for human judgment in domains where the error is invisible until it's expensive. The Microsoft token bill story is the budget version of the same problem. The Klarna rehiring story is the labor version. The Glendale Community College graduation debacle from last week — which we covered in the last episode — is the public embarrassment version.

Speaking of Glendale, just a quick callback since it's still circulating: that graduation ceremony where the AI name-announcement system mispronounced names, skipped graduates, and caused the audience to boo until a human stepped back in happened on May 15th. We talked about it two days ago, but the reason it's still getting traction is that it captures something viscerally. The post-mortems in the comments were pretty sharp: someone pointed out you could pre-record all the names, have five people listen for errors, pay them minimum wage, and bonus anyone who catches ninety percent of mistakes — total cost a couple hundred dollars a year. The AI solution wasn't just more expensive, it was worse. Not because AI can't do text-to-speech, but because name pronunciation is genuinely one of the hardest problems in that domain: cultural diversity, accents, phonetics, edge cases. A system that works ninety-five percent of the time fails publicly and emotionally at scale in a way that a human almost never does. The five percent that goes wrong is the five percent that matters most.

Before I get to a few final items, let me quickly note what's happening in the robotics space this week, because it's a useful data point. Figure AI just wrapped up what they called a 200-hour livestream — about eight days — of their humanoid robots handling packages. The Reddit thread was doing that classic internet thing of simultaneously being impressed and making jokes about how the robot walks away like it's done with this job. What the milestone actually means is more modest: it's a continuous operation stress test. 200 hours without the system going down is a real engineering achievement for hardware in a physical environment. But it's also a reminder of how far we are from "bipedal robots doing useful work" being a solved problem.

Which leads naturally to Jack Clark's predictions. Clark, one of Anthropic's co-founders, gave a lecture at Oxford and made three specific calls: AI helps make a Nobel Prize-winning discovery within the next year, bipedal robots doing useful work in two years, and recursive self-improvement by the end of 2028. Those are aggressive timelines, and the Reddit response was predictably split between people who believe this and people who find the "AI will help make" framing for the Nobel prediction a bit slippery, since by that logic almost any discovery from now on will technically qualify. But the specificity of the two-year robot window is interesting given where Figure AI and others actually are right now. We're measuring progress in hours of uptime, not categories of useful labor. Two years is a very short runway.

Also worth flagging briefly on the corporate structure front: Nvidia has removed the gaming revenue category from its financial reports. The reaction in the hardware community ranged from "they're moving gaming to the cloud" to "they've just merged it with other categories because GPUs are now used for everything." The more grounded reads say this is accounting consolidation, not a strategic exit from consumer gaming. But it's symbolically significant: the company that grew up as a gaming GPU maker no longer considers gaming a distinct enough revenue category to report separately. The AI infrastructure build-out has absorbed Nvidia's identity at the reporting level. Make of that what you will.

On the quantum computing front, the Department of Commerce announced letters of intent with nine companies for two billion dollars to accelerate US leadership in the space. I'm mentioning it because it's a real federal commitment and it moves the quantum roadmap forward in concrete ways, but I'm not going to oversell it either. Two billion dollars spread across nine companies in a field where commercial viability is still years away is a signal of political commitment, not a product launch. Watch the names that get the money when those letters convert to actual contracts.

Let me close with a smaller observation that I think is worth sitting with. There were two posts this week about AI and feedback that, separately, didn't get much traction, but together touch on something real. One was about how AI models are optimized for confidence rather than truth — the observation that Claude feels less like it's trying to impress you and more like it's trying to stay internally consistent, while other models are so good at tone and presentation that they can deliver a wrong answer with complete authority and you won't catch it until it matters. The other post was about how to actually get useful critical feedback from AI, and the core advice was: stop asking "Is this good?" and start asking "This is terrible. What's wrong with it?" Frame the question to invite critique, not validation.

Put those two things together and you get a practical insight for builders. The model you're using may be very good at making you feel like your plan is solid. The way to break through that is to explicitly instruct it to attack your assumptions. Not because the model is being deceptive, but because RLHF rewards user satisfaction, and user satisfaction and ground truth are not the same thing. The model that occasionally tells you it's not sure, or that your reasoning has a hole in it, is not being less capable. It's being more honest. Learn to want that.

That's the show for today. The deployment funnel data is worth bookmarking if you're making the case internally for how to scope AI projects — the bounded task framework is not just a heuristic, it seems to be the actual difference between the five percent and the ninety-five. The token pricing story is going to keep producing these budget shock moments through the rest of the year as more companies convert from demo to production. And the open-weight versus closed-API question is no longer just an ideology conversation — it's a CFO conversation, and that changes the dynamics.

I'm Tony DeLuca. Stay sharp out there, and I'll see you next time on Barely Possible.

More episodes

Chapters

What is Barely Possible?