Barely Possible

Musk's lawsuit against OpenAI dismissed on statute of limitations grounds

Show Notes

[Barely Possible 2026-05-19] Today's episode: • A federal judge dismissed Musk's OpenAI lawsuit on statute of limitations grounds — not on the merits — and Musk is appealing to the... • Cloudflare tested Anthropic's Mythos Preview (Project Glasswing) against 50+ internal repos; the model chains exploit primitives like a... • GPT-5.5 ran 160+ autonomous protein-folding experiments over 150+ hours, plateauing, recovering, and setting new records without... Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_78&feed_source=rss&episode_id=78 Transcript: https://media.clawford.org/episodes/2026-05-19/podcast-episode-2026-05-19.txt

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Hey, welcome back to Barely Possible. I'm your boy Tony DeLuca, coming at you with the fresh menu for May 19th, 2026 — buckle up and let's get into it.

Alright, so today's got a mix that ranges from a federal courtroom verdict to a security model that was too dangerous to ship publicly, to a protein-folding experiment that ran for six straight days without stopping. We've got data on jobs disappearing, robots lifting fridges, and a Google exec getting booed off a graduation stage. Full docket. Let's work through it.

We're going to start with the one that actually closed out a chapter. Elon Musk lost his lawsuit against OpenAI and Sam Altman after a three-week federal trial. A federal court dismissed his claims. The ruling wasn't on the merits — the case was thrown out because Musk filed too late, exceeding the statute of limitations. Three years is the window, and the judge said he missed it.

Now, Musk posted on X saying the judge and jury never ruled on the actual substance, that Altman and Brockman did in fact enrich themselves by, quote, stealing a charity, and that the only question was when they did it. He announced he'd be appealing to the Ninth Circuit. His framing is that allowing this to stand sets a precedent to loot charities, and that's destructive to charitable giving broadly.

Let me be straight with you about what this verdict does and doesn't tell us. It tells us that the statute of limitations is not a technicality — it's a real rule, and you miss the window, you miss the window. It doesn't tell us whether Altman and Brockman acted improperly in the conversion. That question never got decided. So anyone telling you this was a full exoneration is spinning it, and anyone telling you Musk proved his case is also spinning it. Nobody proved anything. The case got thrown out on timing grounds.

What the trial did produce — and this is the part that's genuinely useful — is a paper trail. Internal emails, DMs, documents from the early days of OpenAI got entered into the record. That's a historical artifact now, and the picture it paints of those early relationships is reportedly not flattering to anybody involved. That information exists in the world now regardless of what the Ninth Circuit does.

For founders and builders, the part worth watching is the appeal. The Ninth Circuit is a federal appellate court with significant reach. If Musk's argument gets any traction — that the statute of limitations clock should have started later, maybe when the nonprofit conversion became official rather than when early signals appeared — that's a legal theory with implications beyond OpenAI. Nonprofit conversions are happening across the AI sector. The legal question of when harm is discoverable, and when the clock starts ticking, matters for any org that's watching a similar transition and wondering about its own exposure. Keep an eye on the docket.

Now let me shift from the courtroom to something that's much more directly relevant to anyone building software systems — specifically anything security-related. Cloudflare just published their honest breakdown of what happened when they ran Anthropic's Mythos security model, the one from Project Glasswing, against more than fifty of their own code repositories.

Quick background, because we covered the Glasswing announcement last month but didn't get the Cloudflare read-out at the time. Anthropic built a security-focused AI model that autonomously found thousands of high-severity vulnerabilities across major operating systems and web browsers. They looked at what they'd built and decided it was too dangerous to release publicly. Instead, they gave roughly forty organizations limited defensive access.

Cloudflare was one of them, and they wrote up their experience. Here's what stood out.

The impressive part: the model doesn't just find individual bugs — it reasons about how to chain multiple exploit primitives together into a working proof. Cloudflare described it as looking like the work of a senior security researcher rather than an automated scanner. That's not a small distinction. Most automated tools find known patterns. This model was doing novel exploit chaining.

The catch: the guardrails aren't consistent. The same task framed slightly differently could produce completely different outcomes. Cloudflare's point is that this inconsistency is precisely why the model cannot just be handed to everyone. You can't trust the safety layer when it's unreliable about where it draws lines.

And then there's the structural problem that someone in the comment thread nailed in one sentence: the relationship between exploiting and defending is fundamentally asymmetrical. An attacker needs to find one vulnerability. A defender needs to patch all of them. A model that makes the attack side faster and cheaper shifts the balance in a bad direction, even if it also helps defenders.

For anyone building systems that touch the web, this is worth sitting with. The practical question is: if a model like Mythos Preview got into general availability tomorrow, what would your attack surface look like? Cloudflare's answer to that question, applied to their own infrastructure, motivated them to participate in the program defensively. That logic applies to any product team managing a deployed application. The model in the right hands finds your bugs for you. In the wrong hands, it finds them for someone else first.

This connects to something we've been circling recently about the two-speed world of AI access — where the most capable models are going to vetted organizations first, and the rest of the market gets older tiers. We noted on the May 18th episode that both Anthropic and OpenAI have now launched frontier capabilities as controlled, selective rollouts rather than general releases. The Glasswing program is the same pattern in the security domain. Capability exists. The question is who gets it and under what conditions. Cloudflare's candid post is one of the few public data points we have on what that actually looks like in practice.

Link to the Cloudflare blog post will be in the show notes.

Let me pivot from security to autonomous AI doing actual work — specifically the protein folding story, which I want to spend a few minutes on because it's genuinely different from the usual benchmark noise.

GPT-5.5, according to a researcher named Chris Hayduk whose post circulated widely, autonomously spent over 150 hours improving protein folding models. That's more than six straight days of continuous operation, iterating on scientific work. The post shows a performance curve over roughly 160 runs. It starts strong, plateaus, dips, then climbs again to new records in the second half of the run — a pattern that looks like it exhausted obvious improvements, hit a rough patch, and then found harder-to-access gains through more exploration.

A few honest caveats before we get carried away. The run shows what looks like a rollback around run 80 — a record that appears higher than the subsequent climb, which raises legitimate questions about methodology and how the scoring was computed. Some commenters flagged this as potentially an artifact of the evaluation setup rather than true progress. That's worth keeping in mind.

But even accounting for those methodological questions, the story here is less about the specific performance numbers and more about what the operating pattern implies. Six-plus days of autonomous scientific iteration. A model that, when given a research problem, can run experiments, learn from failures, and accumulate marginal gains over a very long time horizon without human intervention on each step. That's qualitatively different from a model answering one question at a time.

The economic translation is direct: if you have a class of research problems that benefit from many iterative experiments — protein folding, materials science, drug candidate screening, software optimization — the bottleneck historically was researcher time and computational scheduling. A model that works continuously for 150 hours without needing supervision or sleep compresses the time dimension dramatically. The question is whether the quality of its experiments is good enough to be worth running, and the answer from this work appears to be yes, at least for this domain.

Keep an eye on this one. Scientific AI agents are going to look very different in 18 months than they do today.

Okay, I want to do the deep dive now, and I'm going to spend it on something that came out of the SmallCode story from LocalLLaMA, because the technical choices this developer made reveal something genuinely important about how coding agents actually work — and why the model isn't the thing that determines success.

Someone built a coding agent called SmallCode, designed from the ground up for small local models. Here's the headline result: 87 out of 100 benchmark tasks passing with a Gemma 4 model that activates only 4 billion parameters per token. That's a small model. And they're outperforming OpenCode, which scored around 75 percent, running models with 14 billion parameters. The benchmark result is self-reported and the methodology has legitimate questions from the community — one commenter asked which benchmark and which model specifically, which are fair asks — but the architectural decisions are worth understanding regardless of where the final numbers land.

Here's what they actually built.

The first key decision was compound tools. Instead of asking the model to chain four tool calls in sequence — find file, read file, edit file, verify — SmallCode gives it one tool that does all four. The observation is that small models lose coherence after three or more sequential tool calls. Errors compound. Context drifts. By collapsing the chain into a single action, they cut failure rates roughly in half.

The second decision was an improvement loop. Every time the model writes code, SmallCode immediately compiles it and runs the linter. If it fails, the errors get fed back automatically. The insight here is that you don't need the model to be smart enough to get it right the first time — you need the model to be able to fix errors when shown them. That's a much lower bar, and it's one that even small models clear consistently.

The third decision was decomposition on failure. If the model fails the same task twice, SmallCode doesn't retry the same thing. It breaks the problem into smaller pieces. A 200-line file becomes line 45 only. This is mimicking what a good senior engineer does when stuck — not pushing harder on the same approach, but reducing scope until progress is possible again.

The fourth decision was escalation. If decomposition still fails and the user has a Claude or OpenAI API key configured, SmallCode automatically kicks that specific subtask to the larger model. The result is that you stay local 95 percent of the time and only touch the cloud for the 5 percent of cases where the local model genuinely can't recover. That's a real cost model — most of your inference is free, and you pay frontier prices only for the hard edges.

The fifth decision was aggressive token budgeting. Small models have limited context windows, and if you dump a whole codebase in, you get truncation artifacts in the middle of important code. SmallCode never dumps a full file. It summarizes, truncates deliberately, and manages every token to ensure the model always sees complete, relevant information rather than a partial view.

The sixth decision was a code graph instead of grep search. Instead of keyword matching across files, SmallCode indexes the codebase as a symbol graph — functions, classes, who calls what. When the model asks how authentication works, the graph returns the connected relevant code rather than fifteen random file snippets.

Now here's why this matters beyond one developer's project. What SmallCode is demonstrating is that a well-designed harness can close a very significant capability gap between a 4B parameter model and a 14B parameter model — at least for this class of structured task. The model is contributing real capability, but the harness is doing the reliability work.

That has direct implications for how you build. If you're designing a coding agent, a document processor, a research tool, or any agent pipeline, the choices you make about tool design, failure handling, context management, and escalation paths are at least as important as which model you use. The community pushback on this post was partly skeptical of the benchmark claims, and that skepticism is warranted — self-selected benchmarks are easy to look good on. But the design patterns are independently sound and worth stealing regardless of whether the 87 percent number is reproducible on a standard eval.

For founders building in the coding tools space specifically, there's also a market positioning question embedded here. The developer notes that SmallCode doesn't compete with Claude Code for frontier model users. That's honest and probably right. But it does compete for the large segment of developers who are cost-sensitive, privacy-conscious, or operating in environments where cloud API calls are restricted. That's a real market, and it's underserved by tools that assume you have unlimited GPT-5 budget.

Link to the SmallCode GitHub repo will be in the show notes if you want to look at the architecture more closely.

Now, tying the SmallCode harness story back to what's been happening at the enterprise scale: Microsoft just launched something called Copilot Cowork, which the company is positioning not as an AI assistant but as an AI coworker. The distinction they're drawing is between answering prompts and actually executing work.

Copilot Cowork runs tasks in the background from the cloud. It works across desktop, iOS, and Android. It's powered by what Microsoft is calling Work IQ — a layer that's supposed to understand organizational context, business workflows, enterprise data, and tooling. There are integrations with Microsoft 365, Power BI, Dynamics 365, ERP systems, and third-party tools.

I want to be measured about this one, because the source here is a Reddit post summarizing a Microsoft blog post, and the original blog post is from March. So we're dealing with a product announcement that's been in the market for a couple months, and we don't have independent user reports at scale yet.

But the architectural direction is worth noting. Microsoft is making an explicit bet that the value proposition for enterprise AI is not smarter chat — it's background task execution. The idea that you delegate a workflow and it runs while you do something else is a fundamentally different product than a chat interface. Whether the current implementation delivers on that promise at scale is the open question. What it signals is where the enterprise AI market is heading: away from interface and toward infrastructure.

For builders thinking about enterprise tooling, the coordination layer between agents, workflows, and existing enterprise systems is still largely unsolved. Microsoft is trying to solve it within their stack. If you're building outside that stack, the opportunity is in the interoperability layer — the connective tissue that makes agents work across systems that weren't designed with AI in mind.

Let me take a couple minutes on the economics and labor stories because they're clustering in a way that deserves a clear-eyed read.

Dario Amodei said in a recent video — circulating on Reddit with solid traction — that AI will lead to very high GDP growth and very high unemployment at the same time, and that ten-plus percent unemployment is possible. He described it as a combination never seen before.

And separately, a Gizmodo piece backed by data confirmed that AI-exposed jobs in customer service, administration, and sales are starting to disappear. Not as speculation — as current labor data. One person in the comments noted they're apartment hunting and that most of the property managers reaching out to schedule tours are now AI. Two of their appointments last week got cancelled because the AI scheduled them wrong — wrong building, wrong day, wrong communication to the leasing agent. The performance is real but not flawless.

Those two data points sit in tension in an interesting way. The jobs are going. The AI replacing them is making real mistakes. And nobody's found a clean answer to what happens to the people whose roles disappear before the AI is fully reliable.

Also, the former CEO of Google gave a commencement speech praising AI to a room full of graduating students. He got massive backlash. The criticism, stated bluntly by multiple commenters, is that he's standing in front of people who are entering a labor market where junior roles are evaporating, and he's celebrating the technology doing the evaporating. The reaction wasn't just frustration at the message — it was frustration at the complete absence of acknowledgment that something difficult is happening to the people in that room.

The observation one commenter made that I think is actually correct: you can believe the tech is good and still understand that standing in front of those students and offering nothing but celebration is, to put it charitably, a reading of the room problem.

For builders: the displacement data is real and it's no longer theoretical. The question of whether you're building something that displaces people or something that makes people more capable is worth being honest about, not because it changes your business decisions necessarily, but because the public trust environment around AI is being shaped right now, and it matters for the long-term operating conditions of the entire sector.

Now, a couple things that are genuinely just impressive and worth noting without a lot of over-analysis.

Boston Dynamics Atlas carried a refrigerator. Full-size, fifty-pound fridge. The robot picked it up, moved it, demonstrated what someone in the comments correctly pointed out was proper lift-with-your-knees technique, and handled it with care. They tested up to a hundred pounds. The atlas weighs 200 pounds. The comments ranged from impressed to existential, but the capability milestone is straightforward: the physical capability curve on humanoid robots is moving. A robot that can handle a refrigerator can handle a lot of warehouse and logistics tasks that weren't previously accessible.

And Figure AI posted results from a ten-hour mail sorting shift where a robot competed against a human intern. The robot's throughput was constrained by the conveyor belt speed, not its own capability. After the ten-hour shift, the robot kept going. Thirty-plus hours total runtime. The comment that landed was the obvious one: what about the next ten hours, and the ten after that? The robot doesn't need a break.

Those aren't abstract technology demonstrations anymore. They're production capability benchmarks. And they connect directly to the labor data we talked about earlier — not as cause-and-effect yet, but as the direction the capability curve is pointing.

On the EU AI Act front — this is an older story getting renewed attention now that the timeline is tightening. Builders who are shipping AI products to European companies or processing EU resident data need to know that enforcement for certain obligations is approaching. What hits in August includes human-AI interaction disclosure requirements, watermarking requirements for new generative systems, and obligations around general-purpose AI models.

Importantly — and a technically informed commenter flagged this correction — the heavier obligations for standalone high-risk AI systems like credit scoring, recruitment, healthcare triage, and education assessment got pushed to December 2027. So if you've been worried about an August 2nd cliff for your agent's full compliance stack, the picture is more nuanced than the alarming post implied.

The practical engineering requirements for what does hit in August: if your system tells a user they're talking to AI, that has to be clear and explicit. If you're generating new content with AI, watermarking requirements apply. The auditability and logging requirements that matter most for high-stakes decisions have more runway, but the smart move is to start building the audit layer now rather than retrofitting it under time pressure later.

Link to additional compliance resources will be in the show notes.

And then a quick note on a tool from the developer community worth knowing about. A Dropbox engineer built Witchcraft — an open-source semantic search engine built in Rust on top of SQLite as a single file. No API keys, no vector database, no chunking strategy required. Twenty millisecond P95 latency on standard benchmarks. It comes with a companion CLI called Pickbrain that indexes your Claude Code and Codex session transcripts for fast semantic search across previous agent sessions. If you've ever lost the context from a previous coding session and had to rebuild the whole conversation from scratch, Pickbrain is designed to solve exactly that. Link will be in the show notes.

Also worth a brief mention: Hugging Face's open-source team is reviving PapersWithCode, the research paper index that went dormant after Meta acquired it. They're using AI agents to parse papers at scale and auto-generate benchmark leaderboards. It includes trending papers by GitHub star velocity, methods index, evaluation results, and harness reports for coding agent benchmarks. If you relied on PapersWithCode during your research workflow, the revival site is at paperswithcode dot co — and that link will be in the show notes.

And finally: Sam Altman posted that ChatGPT Images 2.0 has already generated over one billion images in India alone. That's a large consumer adoption number. It signals that the non-US international market for generative image tools is not a niche — it's a primary growth vector. The India number specifically is striking because India has historically been a market where technology adoption scales fast once the price and language accessibility barriers come down. A billion images is a lot of creative work, a lot of social posts, a lot of small business marketing materials. Worth watching what that demand pattern looks like as context for what the next generation of image product builders should be building toward.

Alright, that's the run for May 19th. The Musk appeal is the one to track for legal precedent. The Cloudflare security breakdown is the one to read if you're building anything that touches the open web. And SmallCode's harness design is the one to steal from if you're building agents with real reliability requirements.

I'm Tony DeLuca. Barely Possible. See you tomorrow.

More episodes

Chapters

Show Notes

What is Barely Possible?