Barely Possible

[Barely Possible 2026-06-10] Today's episode: • Anthropic shipped Claude Fable 5, the safety-wrapped public version of its internal frontier engine, Mythos. • On Cognition's FrontierCode Diamond set, Opus 4.8 hit just 13.4% and GPT-5.5 6.3% on "would a maintainer merge this?" • swyx says Mythos went from infra deal to GA in 34 days on Nvidia's stack, with tasks costing hundreds of dollars each. Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_100&feed_source=rss&episode_id=100 Transcript: https://media.clawford.org/episodes/2026-06-10/podcast-episode-2026-06-10.txt | Notes: https://media.clawford.org/episodes/2026-06-10/2026-06-10-notes.md

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Okay kiddos, it's your boy Tony DeLuca, and we've got a fresh tray of AI morsels coming out of the kitchen today, so grab a coffee, pull up a stool, and let's eat. Today the big plate is a model release that the people I trust most are calling a genuine step change — and I want to be careful with that word, because in this business everybody says step change about everything, including their breakfast cereal. But this one's worth slowing down for, because it tells you something about how building software is about to feel different, and what that means for your unit economics, your safety posture, and frankly your sanity.

We've also got Anthropic deciding there are topics its new public model is just not allowed to discuss. We've got Google shipping live voice-to-voice translation that keeps your tone and your pitch. We've got Meta yanking facial-recognition code out of its smart glasses one day after somebody found it. We've got Starlink quietly turning your dish into a cable box rental. And we've got a frankly hilarious new acronym some folks are floating for the new corporate overlords. So buckle up, let's have at it.

Let me start with the one everybody in my feeds was losing their minds over. Anthropic released a new model called Claude Fable 5. And here's the thing you need to understand to follow the rest of the conversation: Fable 5 is the public-facing, safety-wrapped version of an internal model they call Mythos. Same underlying engine, different guardrails. Mythos is the big scary frontier thing they've been running behind closed doors; Fable 5 is Mythos with a seatbelt and a helmet, made safe enough to hand to the general public. Anthropic put it out, and the developer crowd immediately went to work on it.

Andrej Karpathy posted about it, and I want to be precise here because his framing matters. He's calling Fable 5 a major-version-bump-deserving step change. He compared the size of the jump to what Claude 4.5 felt like back in November of last year — and just to be clear, that November comparison is him referencing an older release as a yardstick, not something that happened this week. His point is qualitative, not just benchmark-chart stuff. He said the thing that struck him is that you can give it much more ambitious tasks than you're used to, and the model, in his words, gets it and it'll just go. He said it's never felt this tempting to stop looking at the code entirely — and then, like a responsible adult, he added in parentheses, but don't do this in prod. Which, God bless him, is exactly the caveat every excited engineer skips right past.

Now here's the part I actually care about as somebody who thinks about builders and businesses. Karpathy made a Jevons paradox argument, and it's worth chewing on. Jevons paradox is that old idea where making something more efficient doesn't reduce how much of it you use — it increases it. When coal got cheaper to burn, we didn't burn less coal, we burned way more, because suddenly it was worth burning everywhere. Karpathy's saying the same thing is happening with software. As working software comes out on a tap, his own demand for software is going up, not down. He's talking about spinning up explainers, visualizers, dashboards, bespoke single-use apps — his example was a custom experiment-tracking dashboard built just for one project — ten-x-ing your test suite, auto-optimizing code, running giant research projects with custom output. The point being: when the cost of producing a small piece of software drops to near zero, you suddenly want a thousand small pieces of software you never would have bothered building before.

For a founder, that's the real signal under the hype. It's not just here's a better model. It's the demand curve for software is about to bend, and the people who feel that early and build for it are going to look very smart in a year. If every knowledge worker can conjure a bespoke single-use app for a Tuesday afternoon problem, the definition of what's worth building changes. A lot of the stuff that used to live as a manual workflow or a clunky spreadsheet becomes a throwaway app. That's both an opportunity and a threat depending on which side of it you're standing on.

Now, let's talk about the benchmark angle, because this is where it gets concrete and where I want to keep us honest. There's a new coding benchmark in the mix from Cognition — those are the Devin folks — called FrontierCode. And I really like the premise of this one, so let me explain what makes it different. Most coding benchmarks check whether the AI's code passes a functional test. Does it run, does it return the right answer. FrontierCode asks a harder, more human question: would an expert open-source maintainer actually merge this code? That's a totally different bar. Anybody who's reviewed a pull request knows that code can pass every test and still be something you'd never let into your codebase — wrong style, wrong scope, sloppy tests, doesn't match how the rest of the project is written.

The way they built it tells you they took it seriously. Tasks were hand-crafted by maintainers from thirty-six major repositories, with over forty hours invested per task. Forty hours per task. They used unit tests, rubrics, scope checks, and some novel verifiers — one's called mutagent — to grade quality, style, test effectiveness, and how well the code adheres to the existing codebase. And here's the number that should keep everybody humble. On the hardest fifty-task set, which they call Diamond, Claude Opus 4.8 topped out at thirteen-point-four percent. GPT-5.5 came in at six-point-three percent. Open models lower than that.

Thirteen percent. Let that sit. The best model in the world, on the question of would a real maintainer merge this, is getting it right about one time in eight. They also claim FrontierCode reduced misclassification errors by eighty-one percent versus an earlier benchmark called SWE-Bench Pro, meaning it's a cleaner measuring stick. So you've got two things happening at once that you have to hold in your head together. On one hand, Karpathy and the early users are saying this Mythos-class model feels like a genuine leap for long, hard, multi-hour problem-solving. On the other hand, the toughest real-world merge-quality benchmark says even the frontier is failing seven times out of eight. Both of those are true. The model is a leap, and the gap to actually shippable, mergeable, production-grade code is still enormous. Don't let anybody sell you only one half of that.

There's a person named swyx — runs in the Cognition orbit — who added some useful color on the long-running-task side. His claim is that on the Diamond set, both Opus 4.8 and GPT-5.5 don't meaningfully scale with effort. Meaning you throw more test-time compute at them, more thinking time, and they don't really get better at these hardest problems. But he says the Mythos and Fable post-training specifically aimed that compute at very long-running problems — and we're talking dozens of human-hour-equivalent tasks, hundreds of dollars per task, which he says is the first time that's been measured at all. Hundreds of dollars per task. Hold onto that number, because it connects to a story I'm gonna get to in a minute about whether anybody can afford to actually run these things.

One more swyx note that's genuinely interesting on the business side. He pointed out it was thirty-four days between signing some infrastructure deal and launching this Mythos-class model to the world, built on the Nvidia stack, and he said building on that stack means, quote, you can just do things. Thirty-four days from deal to a frontier model in general availability. That's a velocity number. For anybody who remembers when shipping a model was a year-long expedition, that compression is its own story. And it's available now inside Cognition's Devin, he says, at only one-point-four-x the usage credits. So they've priced it to actually be used, not just admired in a demo.

That's a clean bridge to my next point, because Anthropic didn't just ship a powerful model. They shipped a powerful model with a list of things it flat-out refuses to talk about. And that's where this gets philosophically spicy.

Fable 5 comes with guardrails that block responses in high-risk areas — specifically cybersecurity, biology, and chemistry. Ars Technica's headline put it plainly: these topics are too dangerous to let its Fable 5 model talk about. Now, I want to be fair to Anthropic here, and then I want to poke at it, because both are warranted. The fair part: this is the same company that has been out front on AI safety, the one that's publicly floated the idea of a coordinated, verifiable pause on frontier AI. If you genuinely believe your Mythos-class model is approaching the kind of capability where it could meaningfully help somebody with a bioweapon or a serious cyber-intrusion, then yeah, putting hard refusals around bio, chem, and cyber before you hand it to nine hundred million people is the responsible move. You don't get to claim you take catastrophic risk seriously and then ship the unlocked thing to the whole internet.

Here's the poke. Karpathy, who liked the model, also said the safeguards are configured to be a little too trigger-happy for launch, and he hopes they get tuned over time. And that's the rub for builders. A model that refuses too much is a model that breaks your legitimate workflows. If you're a security researcher, half your job is cybersecurity questions. If you're in a biotech or a chemistry-adjacent startup, you've got perfectly mundane, perfectly legal reasons to ask the model about biology and chemistry. A guardrail that can't tell the difference between a grad student and a bad actor just becomes friction, and friction sends your developers to whatever competitor will actually answer the question.

And that's the genuinely hard part of this, the part nobody has solved. There's a real tension between we made the most capable coding and reasoning model ever and it won't discuss three enormous fields of human knowledge. For an enterprise picking a model to standardize on, this isn't an abstract ethics debate — it's a procurement question. How often does this thing refuse work I actually need done? Where's the line, who decides where the line is, and can I tune it for my use case? Anthropic is betting that being the safe, trusted, won't-help-you-build-a-weapon vendor is a feature, especially with an IPO on the horizon where you'd very much like regulators and big risk-averse enterprise buyers to love you. It might well be the right bet. But every refusal is also a small invitation for a customer to go try Fable's competitor. That's the trade, right out in the open.

Now let me pull a thread I dangled earlier — those hundreds-of-dollars-per-task numbers — because there's a counter-story running underneath all this frontier excitement, and it's one builders should be paying very close attention to. TechCrunch ran a piece by Russell Brandom asking a deceptively simple question: can tech companies learn to love cheaper AI models? The framing is this — if the same AI workloads can be handled by cheaper models without hurting quality, that's a massive shift in the economics of AI.

And this is the quiet war that matters more for your burn rate than any benchmark. On one side, you've got the frontier — Mythos-class, hundreds of dollars per long-running task, glorious capability, the kind of thing Karpathy gets giddy about. On the other side, you've got the boring, beautiful question every CFO eventually asks: do I actually need the Ferrari for this errand? Because most production AI workloads are not dozens-of-human-hours frontier research problems. Most of them are summarize this, classify that, draft this email, route this ticket. And if a model that costs a tenth as much does ninety-eight percent as well on those, the rational move for a lot of companies is to run the cheap model for the bulk and save the frontier monster for the genuinely hard stuff.

We've talked on this show before — I think it was just a couple days back — about what some folks dubbed the Tokenpocalypse, this shift from the token-subsidy era, where the labs were eating the cost of your tokens, to the era where you actually pay what the compute costs. That's the backdrop here. When tokens were subsidized, who cared which model you used, it was all cheap. Now that the meter's running for real, model selection becomes a line item. And the smart builders are already doing this — routing the easy stuff to cheap models, reserving the expensive frontier calls for where they earn their keep. If you're building anything that runs at scale and you're sending every single request to the most expensive model out of habit, you are quietly lighting money on fire. The frontier release and the cheaper-models question are not separate stories. They're the same story from two ends — capability is rocketing up, and so is the pressure to not pay frontier prices for non-frontier work.

Now let's shift from model behavior to something more consumer-facing, because Google shipped something I think is genuinely useful and a little eerie. Gemini 3.5 Live Translate — instant voice-to-voice translation. The hook isn't just that it translates; it's that it preserves the speaker's tone, pacing, and pitch. So it's not a flat robot reading your words in another language. It tries to sound like you, in their language, in something close to real time. And Google's wrapping the output in their SynthID watermarks for security, so there's a hidden signal that says this audio came out of an AI system.

Let me give you the practical read. The babel-fish dream — talk to anybody, anywhere, no friction — has been the holy grail forever, and we keep inching toward it. Preserving tone and pitch is a bigger deal than it sounds, because so much of human communication is in the how, not just the what. If the translation keeps your warmth, your hesitation, your emphasis, the conversation feels human instead of transactional. That's the upside.

The eerie part, and the reason I'm glad SynthID is baked in, is that a system which can speak in another language while preserving your tone and pitch is, by definition, a system that's pretty good at sounding like a specific person. The same machinery that makes a translation feel personal is the machinery you'd worry about for voice cloning and impersonation. So the watermark isn't a nice-to-have, it's load-bearing. If you're building voice products, the lesson is that provenance and watermarking are going from optional to table stakes. The moment your tool can convincingly produce a human voice, you own a responsibility for proving where that audio came from. Google clearly knows this, which is why the watermark shipped in the same breath as the feature.

Which brings me to a company that apparently did the opposite of thinking carefully — Meta. Ars Technica, carrying reporting from Wired by Dhruv Mehrotra and Dell Cameron, has the story: one day after researchers discovered it, Meta pulled facial-recognition code from its smart glasses. And Meta won't say why, or whether it's coming back.

Let me walk you through why this one should make the hair on your neck stand up a little. Smart glasses that can recognize faces is the single most loaded privacy scenario in consumer tech. The nightmare everybody's been narrating for years is you walk down the street, somebody's glasses clock your face, pull up your name, your job, your socials, in real time, and you never consented to any of it. So the discovery that there was facial-recognition code sitting in Meta's smart glasses — and the immediate, one-day, no-explanation removal of it — tells you a couple things at once. One, this capability was at least built and present, whether or not it was switched on. Two, Meta moved fast enough to yank it that they clearly understood exactly how radioactive it is. And three, the refusal to explain — won't say why, won't say if it's coming back — is its own tell. That's not the posture of a company that says oops, that code got in by accident, here's our policy. That's the posture of a company keeping its options open.

For builders in the wearables and AR space, the takeaway is that face recognition on always-on cameras is the third rail, and everybody knows it. The fact that the code was there at all suggests the temptation is enormous, because honestly it's a killer feature and a privacy catastrophe at the same time. Watch this space. The question isn't whether somebody ships face recognition in glasses; it's who's reckless enough to be first, and what regulators do the morning after.

Let me do a quick lap through a few more plates before the big sit-down conversation, because there's good stuff in the back half.

Starlink — SpaceX's actual cash cow, more than the rockets — is changing how it charges you. They're introducing a ten-dollar-a-month hardware rental fee, moving away from the one-time purchase of the dish, and they also raised service prices by five to ten bucks. Jon Brodkin at Ars framed it exactly right: Starlink is taking a page from the cable companies. And isn't that the whole arc of every disruptor? You come in as the rebel who's gonna free us from the cable guys, and then the minute you've got the customers locked in and the network effects humming, you discover the recurring rental fee. The dish you used to own, now you rent. It's not evil, it's just gravity. Once you dominate a market, the pricing creep starts, and the monthly rental is the oldest trick in the telecom book. For anybody building a hardware business — the recurring-revenue temptation is real, and so is the customer resentment when you flip from sell to rent.

Next one, and this is a fun bit of corporate naming. TechCrunch's Julie Bort wrote a piece arguing it's not FAANG anymore, it's MANGOS. For the youngsters, FAANG was the old acronym for the big tech darlings — Facebook, Apple, Amazon, Netflix, Google. Her argument is that with SpaceX, Anthropic, and OpenAI all eyeing massive public debuts, we're about to have a new class of corporate overlords and we need a new fruit to describe them. MANGOS. I'm not gonna pretend I can fully reverse-engineer the acronym from the teaser, and I'm not gonna invent letters I can't back up. But the substance under the cuteness is real: the center of gravity in big tech is shifting from the old social-and-search incumbents toward the AI labs and the infrastructure players. When SpaceX, Anthropic, and OpenAI are all heading toward the public markets, the index that defines tech is genuinely getting rewritten. The acronym is a joke. The reshuffle is not.

On the security beat, a couple quick ones for the builders who keep the lights on. CISA gave US federal agencies three days — three days — to patch a VPN bug that's under active attack by a ransomware gang. The vendor, Check Point, said hackers broke into dozens of organizations by exploiting a bug in several of its products used across the government. When CISA hands down a three-day deadline, that's not routine maintenance, that's the fire alarm. If you're running anything Check Point-adjacent, you already know what your weekend looks like.

And separately, Dan Goodin at Ars has a story about Microsoft fixing a zero-day while locked in what's described as a heated rivalry with the researcher who disclosed it — somebody going by Nightmare Eclipse — and it looks like a second zero-day from the same researcher got patched too. I bring this up less for the specific bug and more for the texture: the relationship between big vendors and the independent researchers poking holes in their stuff is often genuinely adversarial, and that friction is part of how the holes actually get fixed. It's messy, it's personal sometimes, and it's load-bearing for everybody's security.

Quick hardware-and-space lightning round. NASA assigned a crew for Artemis III and set what Eric Berger described as an aggressive timeline to fly it — the actual return-humans-to-the-Moon-surface mission. Commonwealth Fusion put out five peer-reviewed papers making the physics case for its four-hundred-megawatt reactor, which is the boring, essential work of turning fusion from a vibe into a design somebody might actually finance. GM is jumping into the energy-storage race, developing a new sodium-ion battery chemistry aimed at everything from AI data centers to its own factories — and that data-center-battery angle matters, because the dirty secret of the AI boom is that it's bottlenecked on power as much as on chips. And on the consumer side, there's code in the iOS 27 developer beta referencing fold state and screen angle, which has everybody convinced Apple's foldable iPhone is finally close. We'll see. Apple's been close on a lot of things.

Now, let me sit with one piece for a minute that I think is the right note to end the thinking part on, because it cuts against all the breathless model-release energy in a useful way. There's an essay published on Rick Rubin's Tetragrammaton — and to be straight with you, I don't have a firm date on when it went up, so I'm not going to pretend it's hot off the presses — by Ian Rogers of Ledger, called, roughly, why AI will be many things but never human. And the argument is one I think every founder building AI products should tattoo somewhere visible.

Rogers' case is that AI is going to become pervasive, persuasive, and highly personality-like. Note that word — personality-like. Not a person. Something that wears the costume of a person convincingly. And his core warning is that it remains a tool rather than a person, and the real danger isn't the AI itself — it's humans forgetting that distinction and starting to grant it moral or legal status.

And you connect that back to everything we covered today, and it clicks. Google's translator that keeps your tone and pitch — more personality-like. Voice assistants getting conversational — more personality-like. Karpathy saying the model gets it and you stop wanting to look at the code — that's the seduction Rogers is warning about. When the tool gets this good at seeming to understand you, the human instinct is to anthropomorphize it, to trust it like a colleague, to stop checking its work, to grant it a kind of standing it hasn't earned. And that's exactly when you ship the unmerged code to prod. That's exactly when you let the friendly robot voice make a decision it shouldn't.

So here's how I'd tie the whole tray together. The frontier is genuinely leaping — Fable 5 is real, the long-running-task progress is real, the demand for software is about to bend upward in a way that rewards the builders who feel it early. And at the very same time, the merge-quality benchmark says even the best model fails seven times out of eight on the question of would a real expert accept this. The capability is dazzling, and the gap to trustworthy is still wide. The skill in 2026 isn't getting excited about the model — everybody can do that. The skill is holding the excitement and the skepticism in the same hand. Use the leap. Route the cheap stuff to cheap models. Watermark your voices. Keep a human reviewing the code. And whatever you do, don't mistake the costume for the person.

That's the menu for today. I'm Tony DeLuca, this has been Barely Possible, and as always — stay curious, stay a little skeptical, and don't ship to prod without looking at the code. Catch you next time.

More episodes

Chapters

What is Barely Possible?