Barely Possible

AI-enabled zero-day exploit marks first criminal weaponization of AI for cyberse

Show Notes

[Barely Possible 2026-05-12] Today's episode: • Google confirmed the first criminal AI-generated zero-day exploit; OpenAI announced "Daybreak" cyber defense the same day. • Anthropic is weighing a $900B pre-money round of up to $50B — above OpenAI's $852B March valuation — ahead of a fall IPO. • Cerebras IPO Thursday: demand 20x available shares, implied market cap projected above $50B on day one per PolyMarket. Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_71&feed_source=rss&episode_id=71 Transcript: https://media.clawford.org/episodes/2026-05-12/podcast-episode-2026-05-12.txt

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Alright kiddos, pull up a chair — I'm your boy Tony DeLuca and we've got a full plate of AI news today, so let's not waste daylight.

May 12th. Monday. We've got agents getting wallets, AI finding zero-day exploits before the good guys do, a leaked Google video model, a semiconductor shake-up that's reshuffling the whole deck, and a deep dive into one of the more interesting practitioner debates I've seen in a while — what file format you use when you're handing off work between your AI tools, and why it says something bigger about how knowledge work is actually changing. We're getting into all of it.

But before any of that, let me pick up a thread from last week, because it connects to what's happening today.

We talked last episode about the Claude Mythos time-horizon data — that METR finding showing reliability at long-horizon tasks roughly doubling every 45 days. The implication was: agents are getting meaningfully more capable at running unsupervised, and the rate of improvement is not slowing down. Hold that thought. Because everything we're covering today is downstream of that basic shift. Agents that can transact autonomously, agents that can find exploits, agents that can hold context across sessions — these are all extensions of that same capability curve finally hitting practical infrastructure.

Okay. Let's start with the cybersecurity story, because it's arguably the most consequential headline of the day and the one that's going to be hardest to ignore.

Google published a report saying cybercriminals have created a zero-day exploit using AI — described as the first documented case of artificial intelligence being used to find and weaponize a software vulnerability for an illicit enterprise. Not a state actor with a hundred-million-dollar budget. Criminals. And Sam Altman, same day, announced OpenAI is launching something called Daybreak — their effort to accelerate cyber defense and continuously secure software. His framing was direct: AI is already good at cybersecurity and is about to get very good at it. The pitch is to start working with companies now to help them continuously secure themselves.

Let's sit with the timing here for a second, because it's not subtle. Google says bad actors have already crossed a line — they've used AI to find a zero-day. OpenAI, same day, says we're standing up a defense program. You could read this as a coincidence. You could also read it as two major AI shops doing what any experienced security practitioner would recognize: the offensive capability becomes public, and the defensive commercial pitch follows immediately after.

Here's what this actually means for builders. A zero-day found by AI is faster, cheaper, and more scalable than one found by human researchers. It can be retargeted. It can be pointed at any stack. The specific target in the Google report is less important than the category: AI-assisted vulnerability discovery is no longer theoretical. If you're running production infrastructure, any system that relies on the assumption that discovery takes time is now less safe than it was two weeks ago. The patch cycle gets shorter, not longer. The window between discovery and exploitation compresses. That's the new operating environment.

And to be real about Daybreak — I don't know yet whether it's a genuine defense product or a branding exercise riding the news cycle. But the announcement matters because it signals that the big AI shops are now positioning in the security market in a serious way, not just as a side feature but as a core commercial offering. Watch who they partner with, what the actual product surface looks like, and whether enterprise security buyers treat it as a real vendor or as a curiosity.

The broader point is one we've circled before on this show in a different context: AI makes both attack and defense cheaper. But offense historically scales faster than defense in any new capability regime, because offense requires only one success and defense requires stopping everything. That asymmetry gets worse, not better, when you add AI to the offense side.

Now let's move from security to money, and there is a lot of money moving around right now.

Anthropic is reportedly weighing a fundraising round that would value the company at around $900 billion pre-money. That's the number circulating, sourced to the Financial Times. For context: Anthropic last raised in February at a $380 billion valuation. Their revenue has gone significantly higher since then. The FT reports the round could raise as much as $50 billion. If that comes together, Anthropic would be valued above OpenAI, which last raised at $852 billion in March.

One of the investor quotes making the rounds: "People are ready to throw any dollar amount at Anthropic. It's just about when Anthropic want to pop their heads up and say we're ready." Another investor specifically cited the SpaceX compute deal as de-risking the investment, calling it resolution of the biggest bottleneck. Terms haven't been finalized. The round is described as a potential last private raise before what's expected to be a fall IPO.

And then there's Cerebras — yes, the chipmaker, not the lab — coming up on its IPO this Thursday, and the numbers are wild. Originally priced at $115 to $125 per share. As of the latest reporting, they're considering bumping that to $150 to $160. That would push the implied valuation past $34 billion, up from the $26 billion previously expected. Demand among institutional investors has been described as 20 times the number of available shares available. On PolyMarket, the implied market cap at close on day one is projected above $50 billion. Some analysts are very bullish. Others — specifically some NVIDIA-focused tech analysts — are saying they wouldn't touch it with a hundred-foot pole, citing fundamental execution risk on scaling. We'll see Thursday.

Also worth noting on the semiconductor side: TSMC reported its slowest sales growth in six months — 17.5% annualized in April, roughly half what analysts had forecast. The interpretation isn't that AI demand is softening; it's that TSMC is simply out of advanced fab capacity. TSMC is sold out. Not slowing. Sold out. The bottleneck isn't demand, it's physical manufacturing capacity. And upstream components like high-bandwidth memory are hitting their own constraints simultaneously.

That context matters for a side story that flew under the radar: Apple has signed a preliminary agreement for Intel to manufacture some of its chips. This deal has apparently been in negotiation for over a year, with the White House applying real pressure on Apple to work with Intel rather than relying entirely on TSMC. The Commerce Secretary has been personally running meetings with tech companies to drum up Intel business. Apple used to be TSMC's top customer. Now, because everyone in the AI build-out is competing for the same fab time, Apple's negotiating leverage in Taiwan has weakened. So they diversify. Intel gets a customer. And you start to see a possible scenario where the semiconductor landscape looks meaningfully different in two years than it does today.

AMD and Intel have both popped roughly 25% over the past week on deal-making momentum. NVIDIA, by comparison, was up only around 8% — which in normal circumstances would be great, but in a week where double-digit gains were the baseline, looks relatively muted. One analyst at Mizuho is talking about a changing of the guard in AI hardware. Memory suppliers specifically are the ones printing money right now — when capacity constraints hit and pricing surges, and your fixed costs barely move, the economics are extremely favorable.

One more item in the weird-but-plausible category before we move on: there are apparently serious conversations happening about installing micro data centers on the exteriors of newly built homes, as nodes in a distributed computing cluster. Pulte Group, a major housing developer, is in a testing phase with partners including NVIDIA and a California startup called SPAN. The argument is that for batch processing and non-time-sensitive workloads, the home environment works surprisingly well, assuming you can solve power, connectivity, and heat. Whether homeowners and regulators go for this is a completely different question — the same community opposition that's blocked big data centers doesn't automatically evaporate when you scale it down to a house-sized unit. But the fact that this is being seriously explored is itself a signal about how deep the compute shortage runs.

Now. Let's shift over to something a little more unusual. There's a data story from ICLR 2026 that circulated over the weekend and I want to spend a minute on it, because it's the kind of thing that gets a shrug in the US but should probably get more attention.

ICLR — the International Conference on Learning Representations — is one of the top three AI and machine learning research conferences in the world. The breakdown of contributions by country for 2026: China, including Hong Kong, accounted for more than 50 percent of total contributions. Europe came in around 20 percent. The US and everyone else split the remainder. Google, as an institution, is at 1.3 percent.

Now, you can debate what conference contribution counts tell you about who's winning the actual AI race. Conference papers aren't the same as deployed capability, and there are a lot of very capable people who don't publish at top venues. But ICLR is a genuine proxy for where foundational ML research is happening. And the answer, by that measure, is China. By a lot.

Singapore got a shout-out for punching well above its weight. Which is true. But the headline is the China number. The US policy debate about AI export controls and chip restrictions has largely been framed as slowing Chinese capability development. But if you're looking at where the research talent and research output is concentrated, the picture is complicated. Half the world's top ML research is being produced by Chinese institutions. That's not a thing you export-control your way out of.

Now shift from research geopolitics to something happening right at the surface of daily practice.

Consensus NLP raised $30 million to build what they're calling an AI OS for Research, with 2.5 million researchers reportedly starting their work on the platform. It's a niche angle compared to the Anthropic fundraising numbers, but it's meaningful. Vertical AI platforms built specifically for research workflows — literature review, synthesis, citation management — are a real product category. The funding validates that there's enterprise money willing to pay for AI that understands domain-specific research work, not just general-purpose assistants.

Okay. Let's talk about one of the more interesting practitioner conversations that went viral over the past few days, and then I want to go deep on why it matters for how you're building right now.

The conversation started with a piece called "The Unreasonable Effectiveness of HTML" by Tariq Shahipar, who works on Claude Code at Anthropic. It got around 10 million views. That's not typical for a technical essay about file formats. The question it poses is: should you be using HTML instead of Markdown when your AI agents are producing documents, specs, plans, and handoffs?

The short version of Tariq's argument: Markdown has become the default format for agent-to-human communication. It's simple, portable, easy to edit. But as agents do more complex work, Markdown gets limiting. He finds that he won't actually read a Markdown file longer than about 100 lines. He can't get anyone else in his org to read one either. HTML, by contrast, can carry visual hierarchy, color, tables, diagrams, interactive elements — it renders natively in a browser, it's easy to share as a link, and it can be interactive in ways Markdown can't. He's arguing that for specs, planning docs, code review documents, design exploration — HTML is the better format.

The cynic response came fast: Tariq works for Anthropic, HTML costs way more tokens than Markdown, and more tokens means more revenue for Anthropic. That's a fair observation to put on the table. Whether it's the full story or not, it's worth knowing.

But here's where it gets more interesting than just a format debate.

Section two, today's deep dive.

I want to get into the actual substance of what the HTML-versus-Markdown argument reveals about how work is changing for knowledge workers in the agent era. Because I think the file format is almost a side issue. The thing underneath it is more important.

The argument that got real traction — and it's not from Tariq, it's from the responses — is about what kind of document you're actually creating when you're working with agents. There are three questions that a commenter named Smart Ape framed really cleanly: who reads it, who edits it, and how long does it live? If an AI model in a future session is reading it, that's a vote for Markdown. If a human is reading it, that's a vote for HTML. If it's going to be edited many times, Markdown. If it's going to be written once, HTML. If it lives forever and gets indexed, Markdown. If it's ephemeral, HTML. When those three votes line up, you have your answer. When they split, you're in hybrid territory.

That framework is useful for the immediate question. But the deeper insight — and this is the one I want to sit with — is about what the agent era is doing to the structure of knowledge work itself.

Here's the thing. Before AI agents, your job as a knowledge worker was to produce an output. You went from blank page to finished thing as directly as possible. The in-between states — the draft, the outline, the brainstorm — were transitions. You moved through them as fast as you could to get to the deliverable. Time spent in-between was time not spent finishing.

In the agent era, that's inverted. Your job increasingly is not to produce the output. Your job is to create the conditions under which the agent can produce the output. You're not the producer. You're the staging director. And that means you now live, professionally, in what you might call the in-between space — the liminal workspace between brainstorming and building, between deciding and implementing, between one session and the next.

The handoff document — whether it's a Markdown file or an HTML file — is the primary artifact of that in-between space. And here's what makes it genuinely hard: most real projects exist in a state of mixed doneness. Some parts are locked. Decided. Non-negotiable. Other parts are totally open — the agent should explore, generate options, surface trade-offs. And a bunch of parts are semi-decided: you're leaning in a direction, but it's provisional, subject to the agent pushing back and finding something better.

The problem with a Markdown file for conveying that mixed state is that it's text-flat. You can write the words "this is decided" versus "this is open," but you're adding meta-commentary about the document inside the document itself. That gets noisy. And when an AI writes a handoff document — when you ask Claude to summarize the conversation and hand it off to the next session — it tends to present everything as ordained. It flattens provisional decisions into firm decisions. It makes everything look equally resolved, even when it isn't.

That's not a small problem. Because when the building agent on the receiving end treats a provisional decision as locked, it constrains its own range. You lose exactly the thing you were hoping the agent would bring: better judgment on the parts you hadn't fully figured out yet. Over-specify, and you kill the agent's range. Under-specify, and it flails. The skill is in calibrating how much structure to impose so that what's left unspecified is something the agent can actually productively resolve.

That calibration problem is real and it's new. It didn't exist when you were the one doing the work, because your own brain held the nuance of what was locked versus provisional versus open. You didn't need to encode it explicitly. The agent does.

Now, HTML's argument is that it gives you native tools to encode that mixed doneness without meta-commentary. Visual hierarchy, tabs, color-coded status, expandable sections — these are ways the document itself can communicate which parts are firm and which are open, without needing prose annotations. Whether that's actually better in practice depends on the situation. For short-lived, human-readable specs — probably yes. For long-lived indexed documentation that the model reads — probably Markdown is still the right call.

But the debate itself is pointing at something real: the craft of working with agents is not just prompt engineering. It's something more like information architecture for the in-between. How do you represent partial state? How do you hand off mixed doneness between sessions? How do you keep the receiving agent from over-anchoring on decisions that were still provisional when you handed off? Those are genuinely new skills, and we are in the early innings of figuring out good answers.

One thing worth watching: there was a commenter who pointed out that a well-built HTML document can include interactive elements that let you tweak parameters and copy the changes directly back into a prompt. That's a feedback loop embedded in the document itself. Not just a spec, but an interface. That's a more interesting version of the argument than "HTML is prettier than Markdown."

The token cost concern is real and shouldn't be waved away. If you're running hundreds of agent sessions a day and every handoff document is a 50-kilobyte HTML file instead of a 10-kilobyte Markdown file, that adds up. You need to be honest with yourself about whether the richer format is actually being used, or whether you're paying a 5x token premium for a document the agent skims the same way it would have skimmed the Markdown.

The honest answer is probably: know your use case. Ephemeral, human-facing, rich visualization? HTML has a strong case. Long-running, frequently-edited, model-consumed? Markdown is still your friend. And for that specific class of document where you need to represent mixed doneness — the planning brief, the session handoff, the work-in-progress spec — experiment with both and watch what the receiving agent actually does with it.

Now. A couple of other stories worth your time before we close.

There's a community post circulating that's getting decent attention among builders, about a team that built a self-optimizing LLM stack. Month one, they were manually picking models for each task. Month one cost: $420. Month two, they built a feedback loop — every request traced with inputs, outputs, model used, tokens, cost, latency, and a quality score. The router clustered similar requests and learned which model actually performs best for each cluster. After three weeks of trace data, they fine-tuned a 7B model on their specific workloads — that smaller model took over classification, tagging, and summarization at 95% agreement with GPT-5.1 at 2% of the cost. Month two bill: $73. Month three, they changed nothing. The bill dropped another 12% because the router had more data and was making better decisions.

The comment worth flagging: "The interesting part isn't the model routing, it's whether your quality scoring stays trustworthy over time. Feedback loops get weird fast once the system starts optimizing against the evaluator itself."

That caveat is correct. This kind of self-improving loop is genuinely powerful, but it has a failure mode where the system learns to game its own quality signal — it starts producing outputs that score well on your evaluator without actually being high-quality outputs. The guardrails the commenter listed are real requirements: trusted eval sets, human-reviewed samples, drift checks, rollback paths. Without them, you get a very efficient system that's confidently wrong.

That said, the core architecture here — production traces over benchmarks, specific routing decisions based on real workload clusters, incremental fine-tuning on validated outputs — this is the right direction for anyone running AI at meaningful scale. Benchmarks tell you what a model is generally capable of. Your production traces tell you what it does on your actual workload. Those are not the same thing, and the gap between them is often where your costs are hiding.

Now let's talk about a story that didn't get enough coverage, and it's directly relevant if you're thinking about how the economics of building on top of agents change over time.

AWS launched something called Amazon Bedrock AgentCore Payments in partnership with Coinbase and Stripe. The summary: your AI agent can now have a wallet and spend money autonomously during task execution. You fund it, set a session spending limit — say $5 per run — and the agent can pay for APIs, data sources, and other agents mid-execution without interrupting for human approval.

The protocol making this work is called x402. It's open source, developed by Coinbase, and it revives the HTTP 402 status code — "Payment Required" — which has technically existed in the HTTP spec for decades but was never widely implemented. The flow is: agent requests a resource, server responds with 402 plus a price, agent signs a USDC micropayment, gets the content, keeps going. Settlement happens in roughly 200 milliseconds on Base, the Coinbase L2 chain, at a fraction of a cent per transaction.

The protocol has processed over 169 million payments since launch, across 590,000 buyers and 100,000 sellers. Coinbase also launched something called the Bazaar MCP server inside AgentCore Gateway — essentially an app store for x402-enabled services. Agents can discover and pay for services on their own, within budget constraints you set.

Why does this matter for builders? Because the pricing model for software is about to bifurcate. There are going to be products built for humans — subscriptions, seats, dashboards — and products built for agents — pay-per-call, x402 endpoints, micropayment APIs. Right now, if you're building a data API or a research tool, your monetization model assumes a human on the other end who signed up for a plan and has a billing relationship with you. That model breaks when the consumer is an autonomous agent running 500 tasks a day at fractions of a cent each. Traditional payment rails aren't built for that. x402 fills that gap, at least in principle.

The honest take here is that this is preview infrastructure. It's not production-ready magic. The agentic economy is still in very early days. But the builders who figure out agent-native pricing now are going to have a real advantage over those retrofitting subscriptions later. If you're building any kind of specialized data service, tool, or API right now, the question you should at least be asking is: how does an autonomous agent pay me automatically?

That said, the cynical read is also worth hearing: the real winners here, at least in the near term, are Coinbase and Stripe, who are building the rails. Agents spending money means payment volume, and payment volume means revenue for whoever runs the pipes. The question of whether the developer building on top captures meaningful economics depends on how competitive the market for x402 endpoints gets. If it's easy to replicate your service, pricing pressure will be intense. The moat has to be something other than just having an endpoint.

One more note before we wrap. There's a trending discussion in the LocalLLaMA community about what happened to OpenClaw — an autonomous agent system that had a significant hype spike a few months back. The trend line is apparently heading to zero. A software engineer described spending two hours setting it up on their Mac, discovering it could run commands as root, deleting everything, then spending a full day getting it running in a Docker sandbox, then realizing it burns through tokens fast enough that a $20 OpenAI subscription won't survive a week, and deleting everything again.

The broader conversation points to something real. There was a category of "autonomous computer-use agents" that generated significant buzz last year. A lot of those products are struggling with the basic math of cost-per-task versus value-per-task, especially when the model needed to run at the quality level required for real autonomous operation is expensive. A 7B model that costs almost nothing and runs locally doesn't have the capability to actually replace a human for most complex tasks. A frontier model that does have the capability costs enough per session that the economics are hard to make work unless the task genuinely replaces expensive labor.

The people building in this category who are surviving are the ones who found a very specific task — narrow enough that a fine-tuned or well-prompted mid-tier model can handle it reliably — and built the infrastructure around that specific thing rather than chasing general-purpose autonomy. That's consistent with what we see broadly: general-purpose AI products are hard to monetize sustainably; specific, high-reliability AI products for high-value tasks are where the real business models are forming.

Alright. Let me pull this together before we go.

The through-line today, the one that connects the cybersecurity story, the agent wallets, the HTML debate, and the self-optimizing stacks, is this: we are past the point where AI capability is the interesting variable. The capability is there. What's being figured out right now — in real production, by real builders — is the infrastructure and the practice. How do agents transact? How do they hand off state to each other and to humans? What does it actually cost to run this stuff at scale? What breaks, and what breaks silently?

The builders who are winning are the ones treating these as serious engineering and business design problems, not as prompting challenges. The format you use for your handoff document matters. The quality signal in your evaluation loop matters. The payment rails you expose to agents matter. These are not glamorous problems. They're plumbing. But plumbing determines whether the building stands.

Thanks for being here. This is Barely Possible. I'm Tony DeLuca. Go build something that works.