Barely Possible

[Barely Possible 2026-05-27] Today's episode: • A builder's Memory Curator Agent bars worker agents from writing to durable memory, routing events through four scopes mapped to... • Demis Hassabis told Axios AGI could arrive by 2029, a sharper timeline than usual from DeepMind's strictest definitional hawk. • EngineAI is shipping a humanoid robot every fifteen minutes as MiMo and DeepSeek open a price war and Wiz wires into Anthropic's... Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_86&feed_source=rss&episode_id=86 Transcript: https://media.clawford.org/episodes/2026-05-27/podcast-episode-2026-05-27.txt | Notes: https://media.clawford.org/episodes/2026-05-27/2026-05-27-notes.md

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Okay kiddos, I'm your boy Tony DeLuca and we've got a fresh menu of delicious AI morsels today, buckle up and let's have at it. Today on Barely Possible, we're going to start with something boring on the surface and dangerous underneath: memory governance in multi-agent systems. Then we'll get into Demis Hassabis putting a number on AGI, Anthropic's Chris Olah doing a scientist voice at the Vatican, a price war breaking out between MiMo and DeepSeek, Wiz plugging into Anthropic's compliance API, the SkillOpt paper that treats markdown files like trainable parameters, China telling its AI researchers to stay home, EngineAI cranking out a humanoid robot every fifteen minutes, the Mythos versus GPT 5.5 unit distance rematch, the METR time horizon graph getting taken to the woodshed, and Uber's COO saying the quiet part out loud about AI spend. Plus a few odds and ends from the kitchen. Let's go.

Let me start where I want to start, because this is the story that's been sitting with me all morning. There's a Reddit post on r/artificial from a builder calling himself Hot-Leadership-6431, and the title is Memory Curator Agent, a governance layer for memory in multi-agent systems. Sounds dry. It's not. It's the most honest description I've read in weeks of what actually breaks when you push agents into production.

Here's the picture he paints, and I'm going to quote it close because the words matter. He says, quote, I keep seeing the same failure in every multi-agent setup I touch. Memory looks fine on day one. By week three it is half stale facts, half private context that should not have been written publicly, and half decisions that were superseded but never overwritten. Retrieval gets noisier. Users keep repeating context because the right fact ended up in the wrong scope. The recursion limit is not the problem here. The memory store itself is the problem. End quote.

Now if you've shipped anything with agents and a vector store behind it, you just nodded. You know the feeling. The demo looks magic on Tuesday. By the end of the month the system is hallucinating its own history, the support team is escalating tickets that the agent technically already, quote, knows about, and you're staring at a database that looks more like a junk drawer than a knowledge base.

His fix is simple in shape and clever in spirit. Worker agents are not allowed to write to durable memory. They can think, they can act, they can produce output. But when they want to remember something, they don't write it. They emit a structured memory event with a proposed scope and some evidence. Then a separate agent, the Memory Curator, decides whether to write it, where to write it, or to discard it entirely. He routes into four scopes: agent repo memory for one agent's durable design rules, agent team memory for shared procedures and handoffs, project memory for the current engagement's state and decisions, and session scratch for temporary observations that probably shouldn't survive the day. And he maps those, deliberately, to organizational memory categories from the human team-science literature. Individual specialist memory. Transactive team memory, Ren and Argote. Project memory. Short-term working memory. That's not name-dropping. That's saying, hey, humans already figured out how teams remember things, let's not reinvent it from scratch with a vector database.

The line that's going to stick with me is, quote, durable memory has to be earned. End quote. Default to discard. Promote only on evidence and curator approval. Because the work agent, the one doing the task, has skin in the game. Its own outputs feel important from the inside. That's not a bug, that's a cognitive bias baked into having an actor write its own history. So you separate the concerns. The curator has no ego in the work product. It just decides what's worth keeping.

Now here's why this matters for builders, and why I want to plant a flag on it. We've been talking on this show for two weeks about agentic systems hitting walls. Karpathy's seven hundred experiment run. Multi-agent loop failures being org design failures. Wix paying for Base44 by cutting twenty percent of headcount. The pattern keeps showing up: the model is fine. The scaffolding is the bottleneck. And inside the scaffolding, memory is the part nobody wants to do the boring work on, because it doesn't demo well. Nobody puts memory governance in a launch video. But it's where production agents either survive or rot.

A commenter on the thread, who's apparently working on the same problem at a product layer at getkapex.ai, asks the right follow-ups and I want to flag them because they're the questions you should be asking too if you're building this. One: how does the curator handle supersession? If a decision in project memory gets reversed in a later session, does the old decision get demoted, annotated, or does the new one just get written alongside it? Because if both persist with equal weight, retrieval is going to surface the old one when the context matches. Two: temporal decay. A fact that was true six months ago might be wrong now. The write gate prevents bad writes, but what governs whether old writes are still current? Three: promotion. Some of the most important context starts as throwaway scratch and proves valuable across multiple sessions. Is there a path up the ladder?

And then somebody asks the killer practical question, which is: what's the latency and token cost of having the curator evaluate every event? Because if your worker agent is fast and your curator is slow, you've just put a bottleneck on the wrong side of the request. The answer, probably, is batching. You don't gate writes synchronously. You queue memory events, the curator works on them between turns, and the worker agent just gets a confirmation that the event was received. But that's an implementation detail. The architectural shape is what's interesting.

The reason I'm leading with this and not with Hassabis predicting AGI in three years is that this is the kind of thing that founders should be staring at. If you're building an agent product, the moat is not which model you're calling. The moat, if it exists, is in how you manage state over time. And the model providers can't do this for you, because they don't know what's durable in your domain. A medical agent and a coding agent have completely different definitions of what should be remembered, where, and for how long. That's your job. And honestly, that's good news, because that's a job that doesn't get commoditized when the next model drops.

The repo is on GitHub under jeongmk522-netizen slash agent memory curator agent. Schema, agent contract, routing rules. We'll put the link in the show notes. Worth a look even if you don't adopt it wholesale, just to see somebody being thoughtful about a problem most people are still ignoring.

Okay. Let's pivot. From the boring-and-important to the loud-and-quotable. Demis Hassabis sat down with Axios this week and said AGI could arrive in just three years, by 2029. That's not a casual aside. That's the CEO of Google DeepMind, the guy who probably has the strictest working definition of AGI of anyone running a major lab, putting a number on it. The Reddit comments are predictable. Kurzweil said 2029 a long time ago. He's hyping his book. He has a stricter definition so this means more. Pick your camp.

My honest read: Hassabis is not Sam Altman. He doesn't blurt timelines for fundraising theater. When he says three years, he is signaling that whatever paradigm breakthroughs he thinks are needed, he believes are tractable. He didn't elaborate on what those are, which is the frustrating part. One commenter put it well: I wish he were more detailed about what makes him so optimistic. Yeah, me too. Because the difference between, quote, we know how to get there and just need to execute, and, quote, we need three more research breakthroughs and we think they'll happen, is a very big difference for anyone making a five-year business plan.

The relevance for builders is not whether 2029 is right or wrong. It's that the heads of the labs are now publicly working from timelines where the labor automation curve gets steep inside a single product cycle. If you're picking a wedge in 2026, you should at least ask yourself what your business looks like if frontier capability triples by 2028. Doesn't mean you give up. Means you don't build a moat that depends on capability staying where it is.

Which connects directly to the next thing. Anthropic's Chris Olah, head of interpretability research, spoke at the Vatican during the presentation of Pope Leo's AI encyclical, Magnifica Humanitas. Yes, you heard that right. The Pope has an AI encyclical and Anthropic sent its head of interpretability to present at the Vatican. The Olah quote that ran with it was, quote, we keep finding things that are mysterious, end quote, referring to interpretability work and what they're seeing inside the models.

The Reddit thread on this is fun, because half the commenters are saying nothing new here but the venue, and the other half are saying this is alarmist rhetoric for regulatory capture. Both can be true. The substantive part is that Anthropic is, again, leaning into the message that large scale labor displacement is a real possibility and that the public should hear about it. Whatever you think of their motives, they've been consistent on that line for two years. The Trump administration doesn't love it. The general public doesn't love it. They keep saying it anyway.

The venue is the news. Because when a frontier AI lab is presenting at the Vatican, alongside an encyclical, you've moved from technology story to civilizational story. You can argue about whether that's earned. You can't argue that it's not happening.

Let's get into something more practical. The price war is on. A post on r/singularity flagged that MiMo 2.5 Pro now costs the same as DeepSeek V4 Pro. From the commenters who've used both, MiMo is, quote, considerably better and more efficient than DeepSeek, especially for coding, though not at the level of GLM 5.1. The skeptics are asking the obvious question: is this a real breakthrough that lets them price aggressively, or are they just eating the loss to take share?

Probably both. We've watched this movie before in cloud. You burn money to lock in developers, you optimize the kernels for a year, the unit economics catch up. The interesting builder takeaway is that the floor for capable coding models is dropping fast. If you're paying frontier rates for a workflow that a mid-tier Chinese model can handle for a fifth of the cost, your CFO is going to notice. And if your product depends on being able to pass token cost through to the customer, you've got a problem coming.

Which is a perfect bridge to Uber. Andrew Macdonald, Uber's COO, told Business Insider that it's getting, quote, harder to justify, end quote, the money they're spending on AI tokens because they can't draw a clean line from spend to meaningful new features. This is the first time, as far as I can tell, that a Fortune 500 operator has said it that directly in public.

Now, the Reddit dunks on this are funny. Uber is a taxi company, what features are they even building? Fair, sort of. But the deeper point sticks. When companies start measuring token usage as an internal productivity metric, you get Goodhart's law in five minutes flat. Engineers will inflate their numbers, the dashboard will look great, and the actual product won't move. We talked about this on Saturday's episode in the context of token-maxxing as a cultural problem. It's now a balance-sheet problem. Uber spent the money. The features didn't ship. Now the COO is saying so on the record. Expect more of this. The honeymoon between CFOs and AI line items is ending, and the people who survive the next twelve months are the ones who can show real outcomes, not token throughput.

The one place where AI spend is still defensible without much hand-waving is security tooling, and Wiz announced an integration with Anthropic's Compliance API. This one is a small story but it's a real one. Wiz is the cloud security company everybody knows, the one Google tried to buy. Anthropic exposing a compliance API that Wiz can plug into means enterprises can finally see Claude usage in the same dashboard where they manage every other piece of cloud risk. Who's using it. What data is going through it. Where the audit trail is.

Not sexy. Necessary. If you're a founder selling into enterprise, you should be tracking which model providers are wiring themselves into the cloud governance stack and which ones are still asking customers to trust them on a Notion page. Anthropic, again, is being consistent here. They show up at the Vatican talking about risk, and they ship the compliance plumbing that lets risk officers actually do their job. Whether that's marketing or principle, the customer doesn't care. The customer cares that it works.

Now let's get into something a little more technical, but worth a few minutes because it changes how some of you should think about prompt engineering. A paper called SkillOpt got passed around on r/LocalLLaMA. The framing is great. They treat markdown skill files, the kind of thing a lot of you have been writing by hand to give an agent procedural know-how, as trainable parameters. You use a frontier model to propose bounded edits to the file, add, delete, or replace. Every edit is gated against a held-out validation set. Only strict improvements get accepted. Ties are rejected. Rejected edits become negative signal for the next round.

What's interesting in the results. Best skills converge with one to four accepted edits out of many more proposals. The edit budget per step matters. Four to eight works best. Take the cap off and performance collapses, because the model starts making big sweeping rewrites that overfit. The median final skill is around nine hundred and twenty tokens. And here's the part that should make your eyebrows go up: a skill optimized on Codex transferred to Claude Code with zero modification and gained 59.7 points on SpreadsheetBench. GPT 4.1 nano with an optimized skill roughly matched frontier on procedural benchmarks.

In plain English, what they're showing is that the markdown prompt sitting in your repo is doing more work than you think, and you can treat it like something you optimize rigorously instead of something you tweak by feel. The catch is real though. Their validation gate requires a graded answer. Works great for code and spreadsheets. Breaks for open-ended tasks. And as a commenter pointed out, the token cost to optimize a skill can run into thousands of dollars at scale, especially for complex agentic tasks where each evaluation burns reasoning tokens. So this is not free. But it's a methodology, not just a vibe, and that connects back to where we started the episode. The harness is the work. The skill file is part of the harness. Optimizing it deserves the same engineering discipline as optimizing your model choice.

Link to the paper is in the show notes.

Alright, let's swing to the geopolitics desk for a minute. A report passed around on r/LocalLLaMA, originally from the International Business Times, says China has been clamping down on overseas travel for AI talent at Alibaba and DeepSeek. The community reaction is split, predictably. One camp says this is bad for open-source AI out of China because researchers can't travel, share, collaborate. The other camp says it's evidence the talent is valuable enough that Beijing wants to protect it from being poached. A third camp is reflexively cynical about the framing and accuses the original poster of running narrative for Western labs.

What I'd actually pay attention to is more boring and more important. China has been the source of the most aggressive open-weight releases of the last year. DeepSeek, Qwen, GLM, MiMo. If the people doing that research start having travel restrictions, the question is not whether they defect. The question is whether the research culture shifts toward more closed publication. That would matter to every builder using those models. Because right now, a non-trivial percentage of you are running open Chinese weights in your products. If the open-release cadence slows down, your stack changes.

No conclusions yet. The original report is thin on sourcing. But worth watching.

While we're on China, a quick note from the robotics side. EngineAI claims they're producing one humanoid robot every fifteen minutes at their Shenzhen manufacturing base. That's thirty five thousand humanoids a year from that one line, with another ten thousand per year line planned in Zhengzhou. Add in Unitree, Leju, AgiBot, and others, and the rough math is China is positioning to put roughly a hundred thousand humanoid robots out the door annually.

Look, I'm not in the camp that thinks these are about to be folding your laundry. The top Reddit comment captured it: stop showing me robots doing gymnastics, show me a robot clearing a dinner table. Fair. But the production rate is the real news. The bottleneck on humanoid robotics for the last decade has been unit cost and supply chain. Build one in a research lab for a million dollars and ship it on a press release. Build thirty five thousand a year and now the per-unit economics start to make sense, the parts ecosystem matures, and somebody, eventually, gets the manipulation problem solved well enough to do useful work. If you're a founder thinking about embodied AI, the question to ask is not when will the robot fold laundry. The question is when does the per-unit BOM cost drop below the line where it makes economic sense to put one in a warehouse. And on that question, China is moving faster than anyone else.

Okay, let's do a math story, because I love these and they keep happening. Last weekend we talked about Google DeepMind's agent autonomously solving nine of Erdős's open problems. This week, a thing called Mythos, which is using Claude Code under the hood, solved the unit distance problem that was recently handled by GPT 5.5. Some people are calling it a cute, simple proof. Some people are saying the proof is actually weaker and doesn't refute the same version of the problem that OpenAI's internal model did. The truth is probably in the middle, as usual with these announcements.

The Reddit comment I want to flag, from a user called TFenrir, is the one to actually internalize. Quote, I view basically every demonstration of capability as a demo for all of the big providers within the next six months. The competition is nuts and the capabilities are basically in lock step, end quote. That's the right frame. Whenever you see one lab announce a flashy capability, the other labs are at most six months behind. Sometimes weeks behind. Sometimes already there but quieter about it. So if you're betting your product on the fact that only one provider can do a particular thing today, you're betting on a moat made of fog.

The more interesting question, and one of the commenters raised it, is why are we getting more attention for one Erdős problem than for Mythos finding a thousand vulnerabilities? Because qualitative wins are easier to communicate than statistical ones. One famous problem solved, the math community can verify it, the story writes itself. A thousand vulnerabilities found in obscure codebases, you need a year of follow-up reporting to know which ones mattered. That's a media gap, not a capability gap. But it does mean we as builders should be reading the second story more carefully than the first one.

Now here's a deflation story to balance out the hype. A long thread on r/MachineLearning surfaced a critical essay by Nathan Witkin, a research writer at NYU Stern's Tech and Society Lab, published in the Substack publication Transformer. Witkin tears into the famous METR AI time horizons graph. You've seen this graph. It's the one that shows AI models doubling the length of tasks they can complete autonomously every few months. It gets cited everywhere as proof of exponential capability growth.

Witkin's critique is brutal and worth knowing. Some of the human baseline data isn't measured, it's guesstimated by the authors. When METR did measure human task time, they paid benchmarkers hourly, which incentivized them to take longer. The sample of human benchmarkers was biased toward the authors' friends and former colleagues. Humans familiar with a codebase finished tasks five to eighteen times faster than the strangers METR used as baselines, which means the comparison is inflated. And some tasks had published solutions online, which means there's training data contamination on the model side.

The Reddit response is, predictably, half-defensive. METR staff have publicly addressed many of these concerns. The graph is directionally correct even if absolutely imprecise. Capabilities are still improving, possibly exponentially. All of that is probably true. But the post's bigger point lands. Quote, the field's central pathology is to aggressively overindex on a mix of anecdotal data from power-users alongside a long list of benchmarks even more compromised than METR's, end quote.

For founders, this matters in a specific way. When somebody hands you a slide deck with the METR curve on it and says we'll have human-level coding by 2027, you should not throw out the deck, but you should be honest with yourself that the underlying measurement is contested. The trend is real. The slope is uncertain. Build a product strategy that doesn't require the steepest interpretation to be true.

Let me hit a few quick ones before we wrap.

On the local model side, there's a fun post from somebody running a 2016 Mac Pro, the trash can model, with a pair of D700 GPUs that finally got Vulkan support under a new Linux kernel. They're running Qwen 3.5 9B at eleven tokens per second on twelve gigs of VRAM. The fun line in the post, and I'll quote because it's earned, was that Qwen's planning output beat Claude Sonnet 4.6 on a complex C-sharp dotnet 10 app. Sonnet struggled, Qwen googled the docs. Take it with a grain of salt, it's one user's experience. But the broader signal is real. The gap between local open-weight models and frontier closed models, for many production tasks, is narrower than the marketing suggests. Especially if you're willing to do the harness work we keep talking about.

There was also a discussion about Qwen 3.6 27B running on a Strix Halo box with an HTML5 game console project. The user gave it three reference files and asked for a Breakout clone. One-shot playable, controls made sense, console API worked, single follow-up to fix one glitch. Their words: this is something everything but Opus could handle until recently. Local capability is moving fast.

A paper called SkillOpt we already covered. CXMT, the Chinese memory maker, started selling DDR5 to Corsair, which might pressure consumer RAM prices but probably won't help GDDR or HBM pricing because the wafers are all going to data center HBM demand. So RAM gets cheaper, GPUs don't. Cold comfort if you're trying to build a local rig.

There's a developer who built a tool called Coffer that adds a save button to ChatGPT, Claude, and Gemini responses and stores them locally in a searchable Markdown vault. Free Chrome extension. Useful if you keep losing good answers in long threads. Link in the show notes. A small thing, but the fact that this is even a viable product tells you how bad memory inside chat UIs still is. Which, by the way, ties right back to where we started this episode. Memory is the unsolved problem at every layer of the stack. Worker agents, multi-agent teams, individual users, all of us are drowning in context we can't structure.

One more story I want to spend a minute on, because it's actually a culture story disguised as a developer story. A modder named pardeike, who makes mods for the game RimWorld with about two million Steam subscribers across his catalog, posted on r/singularity. He found that users in the official RimWorld Discord are uninstalling his mods the moment they hear he's used AI to help update them. Not because the mods are worse. Not because there are bugs. By sheer principle.

He called the reaction religious and got dragged for it. But the dynamic is real and it's spreading. There's a population of users, not all young, not all loud, who have made an identity decision that products touched by AI are off-limits. If you're a builder, this is the other side of the AI economic story. We talk endlessly about how AI is going to expand markets. We talk less about the slice of customers who are doing the opposite, who are actively narrowing what they'll buy. That slice is small today. It will not stay small. If your product is enthusiastic about its AI-ness, you're picking a side. That's fine. But pick consciously, not by default.

Alright, let's bring it home.

The through-line of today's episode, if I had to pull one out, is that the discipline of AI engineering is moving from glamour to plumbing. The Memory Curator post, the SkillOpt paper, Wiz integrating with Anthropic's compliance API, Uber's COO admitting they can't justify the token spend, the METR graph getting properly critiqued. These are all stories about people finally doing the boring engineering work around models instead of just shouting about model releases. The Vatican story, the Hassabis AGI prediction, the humanoid robot factories, the math proofs, those are the loud stories. They get the headlines. But the work that actually wins, the moats that actually hold, are being built right now in the unglamorous middle layer. Memory governance. Skill optimization. Compliance APIs. Audit trails.

If you're a founder listening, I'd close my notebook on one question today. What's the boring infrastructure piece in your stack that you've been deferring because it doesn't demo well? Because that's where your competitors are not looking, and that's where the durability of your product gets decided.

That's the show. Links in the notes for the Memory Curator repo, the SkillOpt paper, the Witkin essay on METR, and a couple of the local model threads. I'm Tony DeLuca. Be kind, be skeptical, and I'll catch you tomorrow.

More episodes

Chapters

What is Barely Possible?