Barely Possible

Andrej Karpathy joins Anthropic for foundational R&D amid OpenAI's product-compa

Show Notes

[Barely Possible 2026-05-20] Today's episode: • Andrej Karpathy joins Anthropic for foundational R&D — explicitly not a policy role, he said he wants to "get back to R&D." • Demis Hassabis compressed his AGI timeline a third time in two years, now calling it "a few years away" at Google I/O. • Gemini 3.5 Flash is ~30x pricier than Gemini 1.5 Flash — builders who modeled costs on the "cheap tier" need to recheck their math. Catch the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_79&feed_source=rss&episode_id=79 Transcript: https://media.clawford.org/episodes/2026-05-20/podcast-episode-2026-05-20.txt

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Hey hey hey, your boy Tony DeLuca, welcome back to Barely Possible — your daily download of what's actually happening in AI, minus the hype tax. Buckle up, we've got a genuinely wild menu today, so let's get into it.

Alright, let's start with the biggest personnel move anyone's been talking about. Andrej Karpathy, one of the most respected researchers in this entire industry, has joined Anthropic. His own words on the announcement: "I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time."

Now, Karpathy doesn't make moves lightly. This is a guy who built his own lab, did deep educational work with neural nets, and is beloved by basically every developer who ever touched machine learning. What does it say that he's landing at Anthropic specifically? A few things. One, the observation making the rounds that OpenAI has increasingly become a product company, and the serious R&D talent has been migrating out. Karpathy was already gone from OpenAI, so this isn't a direct flight — but the destination is telling. Two, Anthropic just raised massive capital, has Claude sitting at the top of the market pecking order right now — we covered last episode how Claude took the number one slot from ChatGPT across several key metrics — and is clearly positioning itself as the place where foundational research still happens. Three, and this is the one the internet underweights: Karpathy specifically said "get back to R&D." He's not coming in as a policy person or a VP of something. He wants to do the work. That matters.

Is this a PR exercise? Some people online said so. I don't buy it. You don't recruit the guy who made 3Blue1Brown jealous just to put his face on a slide deck. Watch for what actually comes out of Anthropic in the next 12 months. That'll be the real answer.

Now here's the thing that connects to Karpathy's move. At Google I/O, Demis Hassabis said something that's been getting a lot of attention — he said AGI is "just a few years away." He also dropped the phrase "foothills of the singularity." Now, Demis is notable because he's historically been the conservative one. While other lab heads have thrown around "two years" timelines with the casual confidence of someone who has never been wrong before, Demis was the guy who'd say five to ten. Then he said five to eight. Then he said around 2030. And now he's saying "a few years."

Here's my read: people are arguing about what "a few" means, and that is genuinely a funny argument. But strip that away. When the most cautious major lab CEO compresses his timeline three separate times in public over two years, you should take that seriously. He's not doing it for hype. If anything, his incentive is to underpromise. When someone like that says "foothills of the singularity," you should update your priors at least a little. We're not in the flatland anymore.

Speaking of Google I/O, there's a lot to parse from what came out of that event. Let me walk you through the Gemini story because it's actually more nuanced than the headlines.

Gemini 3.5 Flash dropped. It's fast — we're talking over 275 tokens per second, which is nearly three times the competition. It reportedly beats GPT 5.5 at tool use, which is a meaningful claim for builders. The benchmarks look strong. So far, good.

But then there's the pricing. Gemini 3.5 Flash is three times more expensive than the previous Flash version, and around thirty times more than the original Gemini 1.5 Flash. The community reaction was immediate: "Flash" is supposed to mean fast and cheap. This thing is priced close to what the Pro tier used to cost. One person online put it perfectly: "The 'Flash' is referring to how fast your money disappears."

The counterargument — and this is a fair one — is that it's still cheaper than Gemini 3.1 Pro by about 25%, while being better. So if you were already paying Pro prices, you're getting a discount on a better model. But if you built your cost model around Flash being the economical option, you've got a problem.

This is a broader pattern worth watching. The model names and tier labels that made sense 18 months ago are starting to break down. Flash means something different now. Pro means something different. The whole naming convention was invented when the models themselves were much weaker, and the ladder of capability-to-cost has been rearranged. If you're making build-versus-buy or API-versus-self-host decisions right now, don't trust the tier labels. Look at the actual numbers.

On the Gemini video side, they also announced the Gemini Omni model, which generates video. Early user reports are mixed — one person said they made four videos and burned through their entire five-hour usage window without getting results noticeably better than Veo 3.1. Someone else called the video quality solid but noted real scientific inaccuracies in the explainer-style content it produced. And there's the now-running joke that it still can't make someone do a proper backflip, which became its own bit of internet performance art.

The Google Antigravity 2.0 story is the one that'll make you stop and think, though. At the event, Google showed off a system that used 96 agents over 12 hours to build an operating system from scratch — and the thing runs Doom. Under a thousand dollars in token costs, allegedly. Now, the skeptics came out immediately, and rightly so. Someone in the audience pointed out they burn through a hundred dollars in tokens with a single agent in under an hour, so the math would need some explanation. Others noted that previous AI-built-software demos — like that browser a certain startup CEO demoed — turned out to have borrowed heavily from existing open source with minimal genuine synthesis. So the extraordinary claim needs extraordinary receipts. But even if you discount by 80%, a 96-agent swarm building a bootable OS in 12 hours is a data point you should log. The direction is real even if the specific demo needs scrutiny.

Shift from the model arms race to what builders are actually using to code with agents, because this is practical stuff. A discussion broke out about favorite agentic coding harnesses — comparing Codex CLI, Claude Code, Gemini CLI, OpenCode, and a relatively new lean option called Pi. The Pi agent is interesting: four tools only — read, write, edit, bash — under 2,000 tokens for the system prompt, and it's built for local models. People running it with Qwen 27B were getting results they didn't expect from a tool that simple. The observation that landed hardest: for all the complexity baked into the bigger harnesses, sometimes a minimal tool surface produces cleaner results because you're not fighting with multi-agent coordination overhead.

This connects to something we looked at a few episodes back when Karpathy himself did that 700-experiment autonomy run — the finding that scaffolding constraints mattered more than people assumed. A simple harness with a clear interface might outperform a complex one with ambiguous tool overlap. Worth testing in your own stack.

Also worth a brief mention: a researcher posted a tool for generating 3D objects with functional, articulated parts using an LLM as a structured code compiler into Blender's scene graph rather than a diffusion model. The key insight is that diffusion-based text-to-3D produces monolithic blobs — change one word in your prompt and the whole thing regenerates from scratch because the model has no concept of parts. The LLM-as-compiler approach writes Python code that targets specific nodes in the scene graph, so you can swap components independently. Local models still hallucinate the matrix math badly on complex transforms, per the builder. But the architecture is genuinely novel and the repo is open source — link will be in the show notes.

Now let's do the deep dive, and I want to focus on something that came out of a Reddit thread that I think is one of the more useful analytical frames I've seen this week. The question that kicked it off: are AI agents actually becoming productive, or just more capable?

Here's the raw text from the thread that I think nails it better than most think pieces I've read:

"Capability and productivity are two different variables and people keep collapsing them. The agent getting better at writing, coding, searching — that's capability. Productivity is whether the output actually moves something forward inside a real workflow, with real stakeholders, real downstream consequences. Those are not the same axis. Six months ago you could blame the model. Now the model is fine and the bottleneck moved — it moved to how clearly you can describe what 'done' looks like for your situation. Most organizations have never had to articulate that at this resolution. They used to outsource it to mid-level managers, who absorbed ambiguity and converted it into tasks. Agents don't absorb ambiguity. They execute it literally. So in my experience the gap isn't 'agents can't do the work.' The gap is that the work was never specified well enough for anything that doesn't fill in the blanks on its own. Humans fill in blanks constantly without noticing. Agents don't, and that exposes how much of 'productivity' was actually just shared context nobody wrote down. Honestly I don't think this gets solved by better agents. It gets solved when teams learn to write intent the way engineers learn to write tests. Which is a cultural shift, not a model release."

I want to sit with that for a second, because that's a genuinely important observation. Six months ago, your excuse for an agent failing was the model. The model wasn't good enough. That excuse is increasingly gone. Claude Opus, Gemini 3.5, GPT-5.5 — these things can do the task. The bottleneck has moved. It moved to you. It moved to whether your organization can articulate what done looks like at a level of precision that a machine will not fudge on your behalf.

Mid-level managers — and I mean this without contempt — have traditionally served an underappreciated function: absorbing ambiguity and converting it into executable tasks. You'd tell someone "clean this up" and they'd figure out what that means given context, history, culture, and fifteen other signals they picked up over years. Agents execute literally. They don't pick up those signals unless you explicitly encode them.

What this means practically for builders: the adoption ceiling for agentic systems isn't the model. It's organizational specification quality. If you're deploying agents into a workflow and they keep failing, ask whether the workflow was ever actually specified at the resolution that would let a capable contractor execute it without guessing. Usually the answer is no. And the fix isn't better prompting — it's better process documentation.

A separate thread picked up a related angle: trust as the actual constraint, not intelligence. One person described letting an agent handle their calendar booking for a week. It double-booked them twice because of timezone handling. Now they only let it suggest, not act, for anything involving real money or real people. Another commenter drew a structural point: trust in humans is built on repeated, predictable behavior over time. But every call to an AI model is statistically independent. You can't build trust the way you build trust with a person, through accumulated behavior, because the model doesn't accumulate anything except what you put in context. What you're actually doing is building trust in your own ability to specify and verify — not trust in the agent as an entity.

One commenter had a practical system I found genuinely useful: they build verification signals directly into their output criteria. They ask for assumptions made, sources cited, inference chains explained. The level of scrutiny they apply is proportional to the cost of being wrong. Low-stakes task? Minimal verification. Client proposal? Maximum scrutiny. This is actually a sensible engineering approach: treat the agent like a contractor you can't fully audit, so you build the audit into the output specification itself.

Both of these threads — agent productivity and agent trust — are really saying the same thing from different angles. The frontier of agent deployment right now isn't model capability. It's organizational and process infrastructure. Teams that figure out how to specify intent clearly, build legible outputs, and create verifiable checkpoints will get 10x out of current models. Teams that don't will keep blaming the model for failures that are actually their own spec debt.

Let me shift to a story that's strange and very 2026. Christopher Olah — one of the Anthropic co-founders, the interpretability guy, not Dario — is apparently collaborating with Pope Leo on a formal papal document about AI and human dignity. An encyclical is the highest-level official communication from a Pope, directed formally to the bishops but functionally to all Catholics and often to the world. The fact that one of the people most deeply embedded in trying to understand what's actually happening inside neural networks is working with the Vatican on a document about AI is genuinely unusual. This is not a PR stunt from either side. Olah has spent years on mechanistic interpretability — trying to actually read what models are doing inside, not just what they output. The Vatican has been more engaged with technology ethics than people often realize. I'll be curious to see what that document actually says, because Olah is not the kind of person who'd put his name on something without thinking hard about it. Details will be in the show notes when the document drops.

Couple of other things before I wrap.

A New York Times report came out about a book on truth in the age of AI that contains quotes made up by AI. The book is literally about truth. This is so brutally ironic that I almost thought it was satire when I first saw the headline. It wasn't. This is a real thing that happened. The lesson here isn't complicated: the people who most need to understand the limitations of these tools are often the people most convinced they've accounted for them.

On the college essay front — there was a charming thread about AI detection systems and how students are now deliberately downgrading their punctuation to avoid getting flagged. Purposely leaving grammar errors in. Removing em-dashes. Running the output through an AI to ask it if it sounds like AI and then editing it to be less AI. The arms race has gotten so weird that human writers are now performing un-humanness to prove their humanity. We've reached the point where the signal we use to identify AI writing is being adopted by humans as camouflage and the sign we use to identify human writing is deliberate error. That's going to age poorly as a detection strategy.

And on the infrastructure side — a former Samsung executive is calling for RAM prices to potentially drop significantly in the second half of 2027 due to aggressive Chinese investment. ChangXin Memory Technologies reported a 1,688% profit surge in Q1 2026 and is expanding capacity toward 300,000 wafers per month, with HBM development also underway. If that investment plays out, it could meaningfully change the cost structure for both local inference and smaller data center builds. Don't book it as fact — "memory expert predicts" is a category with a mixed track record — but it's a macro trend worth watching if your infrastructure costs are memory-bound.

Also quick note on Musk's appeal: as we covered yesterday, the OpenAI case was dismissed on statute of limitations — not the merits. Musk is now appealing to the Ninth Circuit. The substantive questions about OpenAI's nonprofit-to-capped-profit transition will likely stay unresolved at the legal level for a while. That's the update — no new information on the merits today, just the appeal announcement.

Last thought I'll leave you with. We've been talking for a few weeks now about the gap between what AI can do and what actually gets deployed. The agent productivity thread today put it in the clearest terms I've heard: the bottleneck moved. It's not the model. It's you. It's your organization. It's whether the work is actually specified well enough for something that won't guess on your behalf.

That's a hard thing to internalize because it means the failure modes are now yours to own. But it's also actually good news, in the same way that discovering a constraint you can fix is better than discovering one you can't. You can learn to write intent the way engineers write tests. That's a learnable skill. The question is whether you'll do it before your competitors do.

Alright, that's the show. I'm Tony DeLuca, this has been Barely Possible. See you tomorrow, and don't let the agents fill in the blanks for you.

More episodes

Chapters

Show Notes

What is Barely Possible?