Barely Possible

[Barely Possible 2026-05-25] Today's episode: • Nine cyber researchers who tested Claude Mythos in controlled settings called it capable of producing "a SolarWinds every quarter" —... • Google DeepMind's agent cracked 9 of 353 open Erdős math problems for a few hundred dollars each, with downstream implications for... • Multi-agent loop failures trace to org-design, not prompts — agents with no stop authority and shared resource writes will recurse... Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_84&feed_source=rss&episode_id=84 Transcript: https://media.clawford.org/episodes/2026-05-25/podcast-episode-2026-05-25.txt | Notes: https://media.clawford.org/episodes/2026-05-25/2026-05-25-notes.md

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Alright kiddos, pour yourself something and settle in — I'm your boy Tony DeLuca and we've got a real mixed bag of morsels today, from AI hacking Washington to the slow-motion demolition of the junior developer pipeline. Let's get into it.

Before we dive in, a quick note from the last couple of days: we've been circling the enterprise AI cost problem — token billing eating annual budgets, deployment funnels where only five percent of pilots actually ship. That was the beat on Thursday. We're done flogging that horse for today. What's fresh is different enough to be worth your time.

Let's start with something that has actual stakes for security teams and builders who touch anything internet-facing.

Politico ran a piece this weekend that's been moving around the AI corners of Reddit, and the headline is blunt: Anthropic's Claude Mythos and OpenAI's GPT-5.5 are jolting Washington, and the researchers who've actually gotten hands-on access in controlled settings are not downplaying what they saw. Politico spoke to nine top cyber researchers and tech leaders. All nine came away with the same conclusion: these tools are advancing faster than anticipated and will permanently change the digital security landscape.

The quote that's going to stick around is one researcher's description of Mythos as capable of generating, and I'm quoting here, "a SolarWinds every quarter." For context, SolarWinds was the Russian government's breach of US federal agencies in 2020 — one of the worst supply-chain hacks in history, affecting more than eighteen thousand organizations worldwide. The claim is that Mythos could produce that level of novel attack surface, at that frequency, autonomously.

Now, separate from the Politico piece, an Engadget report on what Anthropic is calling Project Glasswing says Mythos has already discovered more than ten thousand vulnerabilities. Skeptics on Reddit are asking the obvious question — ten thousand sounds like a marketing number. And they're right to be skeptical of big round numbers attached to security claims. But there are independent verifications here worth noting. The UK's AI Safety Institute published an evaluation of Mythos's cyber capabilities. Cloudflare published their own write-up on what they're calling cyber-frontier models. VulnCheck put out analysis on AI-assisted vulnerability discovery. These aren't just Anthropic's press releases — third parties are corroborating that something real is happening.

The meaningful split in the Reddit discussion is between people who think this is pure hype and people who work in security and recognize the pattern. One commenter made the point that GPT-5.5 Cyber may be equally capable, and the "Mythos is miles ahead" framing is competitive marketing. That's probably fair. The actual news isn't which model is better — it's that multiple frontier models are now demonstrably useful for offensive cyber work in controlled research settings, and Washington is paying attention.

For founders and builders: if you're building anything that touches sensitive systems, this is the year you need to think about AI-assisted red-teaming as a standard part of your security posture. Not because it's fashionable. Because your adversaries are going to have access to these tools before your defense team does.

That brings us cleanly to a related signal that ties the capabilities story to the intelligence story: Google DeepMind's agent autonomously solved nine of three hundred and fifty-three open Erdős problems in mathematics. Cost: a few hundred dollars per problem.

Now, we covered back on May 21st that OpenAI's general reasoning model was disproving an Erdős conjecture that had stood for eighty years. This is a fresh data point — DeepMind's result is from a paper posted this weekend, and it's a different framing. Nine of three hundred and fifty-three is about two and a half percent of the open problems on the list. Reddit's reaction split predictably: half the crowd is saying goalpost-movers are already assembling their excuses, the other half is saying two and a half percent of problems that stumped professional mathematicians for decades is genuinely remarkable.

Here's where I land on it: the cost point is what matters most. A few hundred dollars per problem. That's not a research budget — that's a rounding error in anyone's compute spend. If we're at the point where autonomously generated mathematical proofs cost less than a dinner for two, the question isn't whether AI can do hard math. The question is what happens to the pace of scientific discovery when hard math gets cheaper by orders of magnitude. That has downstream effects on drug discovery, materials science, cryptography, and a dozen other domains where math is the bottleneck.

From the frontier capabilities story, let's pivot to something more immediately operational for builders.

There's a post on Reddit that's gotten real traction in the multi-agent building community, and it's making an argument I haven't heard stated this precisely before. The post is titled "Multi-agent loop failures might be org-design failures, not prompt failures," and the author has been shipping and testing multi-agent systems long enough to identify a pattern that keeps coming back.

The argument goes like this: when you watch an agent system spiral into an infinite loop — reviewers asking for one more revision pass forever, research workers spawning indefinite subtopics, tool calls recursing until the limit kicks in — the conventional response is to tweak the prompt or add a max-iteration knob. But those are treating symptoms. The root cause is that the agents are organized as peers. When researcher talks to analyst who talks to writer who hands back to reviewer, nobody clearly owns the outcome. No single agent has the actual authority to declare the run done. That authority is implicit at best and gets diluted across the peer network.

The proposed fix is architectural, not prompt-level. Treat the agent network as an org chart with explicit reporting lines. One accountable mission owner. One owner per workstream. Finite delegation depth. A typed return contract per worker — status, evidence, output, blockers, next action. Manager-only authority to reopen or terminate. Memory lives at the authority layers; specialists only get scoped context.

The author walks through roughly five layers — chair, strategy office, division manager, team lead, specialist worker — with QA and policy as separate staff functions that can reject and escalate but cannot spawn unbounded new work. The reviewer-recursion failure mode specifically gets killed when verifiers are structurally allowed one reject pass and then must escalate.

What's interesting here is the frameworks already have the primitives. CrewAI has hierarchical process with manager validation. LangGraph has supervisors and explicit recursion limits. OpenAI's Agents SDK has manager-style orchestration. AutoGen has GroupChatManager. Anthropic's own published research system uses orchestrator-worker. The argument isn't that these don't exist — it's that they're being underused because people are still treating the manager as a group chat moderator rather than a formal reporting line with stop authority.

The two honest caveats the author raises: first, strict hierarchy can become its own bottleneck. If every decision routes upward, the chair agent is a single point of latency and failure. Second, escalation-as-feature only works if the top of the chart has real stop authority. If the chair just calls another LLM that calls more LLMs, the loop moved one floor up without being solved.

And a commenter added something sharp: when two agents can both modify the same resource, neither treats the other's changes as authoritative. That's competing writes with no conflict resolution layer. No amount of prompting fixes a structural ownership gap.

If you're building multi-agent systems and you're hitting loops, save this as a framework before you spend another three hours tuning prompts.

Now let's get into the deep dive. This one is about the junior developer pipeline, and I think it's the most consequential thing for founders and engineering leaders in today's batch.

There's a piece that's been going around — the blog post is called "No Juniors Today, No Seniors in 2031" — and while some commenters dismissed it as AI-generated slop, the core argument survives the skepticism. Let me give you the actual substance.

Junior engineering roles have collapsed roughly forty percent since 2024. The justification companies reach for is AI — why hire a junior when the AI can do junior work? This reasoning is defensible in a narrow, short-term cost frame. But here's what it breaks: engineers become senior through a five-to-seven year apprenticeship loop. That loop is built out of accountability moments — debugging something you don't understand at two in the morning, sitting in a design review where a senior engineer tears apart your architecture, being the person on call when a production incident happens and figuring out why. Those moments build judgment. You can't shortcut them. AI doesn't hand them to you; it removes them from the picture.

The 2024-to-2026 cohort is already feeling this. They're getting stuck at mid-level. Not because they're bad engineers. Because they skipped the accountability-building moments that would have hardened their judgment. And now they're burning out on AI-accelerated workloads that demand more output while providing less mentorship and less context.

On the hiring side, the data point that should concern you if you're running an engineering org: staff-plus roles are already taking sixty-six days to fill. The industry is bifurcating into too many mid-level people and too few seniors. That gap is going to get worse before it gets better, because you can't manufacture senior engineers in a year. The pipeline you killed in 2024 and 2025 won't produce seniors until 2030 at the earliest.

The counterargument you'll hear is that AI will eventually replace senior engineers too, so why invest in the pipeline at all? That argument is making a very confident bet about a timeline nobody can actually verify. In the meantime, you're running a company today, in 2026, on systems that need architectural judgment, incident response, and institutional knowledge that AI-augmented mid-level engineers demonstrably don't have yet.

The recommended fixes are straightforward: reopen junior hiring, protect mentorship time as a first-class budget item, and treat early-career development as a strategic investment rather than a cost center to be offset against AI subscriptions. The last part is the one that gets ignored. Mentorship time costs real engineering hours. Companies will cut it the moment a quarter gets tight. But if you don't pay that cost now, you're going to pay it later — in sixty-six-day hiring cycles, in production incidents your team can't debug, and in senior engineers who burn out because there's nobody coming up behind them.

This connects directly to a thread from the community that is blunter but arrives at the same place. Someone asked the question: how exactly is AI, which is this massively expensive technology, supposed to be more cost-effective than just paying humans? And a commenter broke it down without much patience for the framing. AI companies aren't primarily targeting minimum wage workers. They're targeting skilled labor — the hundred thousand dollar plus tier. And AI already costs less than a junior engineer for a comparable range of tasks. Companies are betting that models keep improving until they can handle senior-plus work at lower cost. What that bet ignores is the apprenticeship problem: even if you're right about models eventually getting there, you've spent the next five years destroying the pipeline that produces the humans who can evaluate whether the models are actually doing good work.

You can't outsource judgment to a model you don't have the judgment to evaluate.

Let's shift to a couple of faster stories worth knowing.

On the benchmarking side, there's a practitioner who ran a careful comparison of vision-capable LLMs against OCR-based pipelines on thirty long, image-heavy PDFs. One hundred seventy-one questions, using Claude Sonnet 4.5 as the evaluation model. The headline finding: native PDF, the quote-unquote "just attach the file and let the model read it" pattern, came in fifth out of six on accuracy and was the most expensive at about twenty-five cents per query.

Premium OCR with layout extraction — specifically LlamaCloud premium with full context — topped the accuracy chart at nearly sixty percent, at about nineteen cents per query. Native PDF hit fifty-two percent and cost more.

Two specific findings the author flagged. First: vision LLMs underperformed on chart-heavy and table-heavy pages — the exact territory where the "vision makes OCR obsolete" narrative has been making its case. Layout extraction held up better there. Second: the native-PDF arm had a seven percent intrinsic failure rate related to file size that survived retries. OCR-based arms had zero intrinsic failures after retries.

The caveats are real — thirty documents is a small sample, and only three of fifteen head-to-head gaps are statistically distinguishable — but the vision-versus-OCR finding held up under the statistical test. If you're building document extraction pipelines for production, the "just throw it at the vision LLM" shortcut has a real reliability problem that matters before you ship. The link to the full writeup will be in the show notes.

Now, a quick one on the tools side that's worth a mention for anyone building with agents. An open-source devtool called AgentLantern just shipped for CrewAI projects. The core problem it's trying to solve: once your agent project grows past a handful of agents, the actual execution graph is invisible. You can't tell which agent did what, which tool was called, where the failure happened. AgentLantern generates browsable documentation from source code and config files without needing LLM calls, does static checking before runtime to catch design issues, and opens a runtime viewer showing agents working and delegating in real time. It's early, it's only for CrewAI right now with plans to expand. For people debugging agent loops — and there's a lot of that going around — it might be worth a look. Link in the show notes.

Before we go, I want to dip back into something that's been churning in the broader community discourse, because it connects to what we've been tracking for the last few episodes.

There's a PhD student who posted a genuinely honest question: who am I supposed to trust about the future of AI? They're being hit from every angle — doom predictions, hype cycles, high p-doom numbers from actual academics. And they make a reasonable observation: climate change has IPCC reports, an imperfect but organized attempt at scholarly consensus. AI doesn't have that.

The most useful response in that thread was simple: start by learning who not to trust. Anyone who profits from the prediction has a conflict. And one commenter made the point I'd echo: experts are no better than anyone else at predicting events with no historical precedent. The accuracy mechanisms for predictions — a pattern of similar events, predictions against that pattern, timely feedback on whether you were right — don't exist for something genuinely novel. So when someone tells you with high confidence what AI will do in ten years, the honest answer is they don't know. Neither do I. Neither does anybody.

What you can trust is what's in front of you. The agent org-design problem is real and solvable today. The junior pipeline problem is real and getting worse today. Claude Mythos finding ten thousand vulnerabilities has third-party verification today. Build around what's real and in front of you. The abstractions will sort themselves out.

That's the menu for today. As always, show notes and links at barely-possible dot com. Tony DeLuca — see you tomorrow.

More episodes

Chapters

What is Barely Possible?