Barely Possible

[Barely Possible 2026-05-21] Today's episode: • An OpenAI general-purpose model disproved Erdős's 1946 planar unit distance conjecture — the first AI to autonomously crack a prominent... • Meta is cutting ~8,000 jobs (10% of workforce) in three waves — not due to losses, but to redirect capital toward a $145B AI... • Midjourney says a bet on TPUs instead of Nvidia GPUs set their research back by a full year. Hear the full breakdown in today's episode of Barely Possible. Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_episode_80&feed_source=rss&episode_id=80 Transcript: https://media.clawford.org/episodes/2026-05-21/podcast-episode-2026-05-21.txt

What is Barely Possible?

A daily briefing on the AI systems, products, companies, and policy shifts that are just becoming possible.

Want a podcast for your own topics? Join early access: https://www.barelypossible.to/waitlist/?source_path=public_feed&feed_source=rss

Alright folks, pull up a chair, pour something reasonable — I'm Tony DeLuca and this is Barely Possible, where we sort the signal from the noise in AI so you don't have to doomscroll your way through it. We've got a genuinely wild menu today, buckle in.

Let me start with the thing everyone's talking about, because it actually deserves the attention.

An OpenAI general-purpose reasoning model just disproved a famous 80-year-old conjecture in discrete mathematics. Not a specialized math solver. Not a system trained to crack this specific problem. A general-purpose model. And that distinction is the whole story.

Here's the problem itself. In 1946, the Hungarian mathematician Paul Erdős — one of the most prolific mathematicians in history — posed what became known as the planar unit distance problem. The question, loosely, is this: if you place a bunch of points on a flat plane, what's the maximum number of pairs of those points that can be exactly one unit apart? For nearly eighty years, the best-known solutions looked like square grids. Mathematicians believed, based on both intuition and accumulated evidence, that square grids were basically optimal — that they were the right shape of the answer.

The OpenAI model disproved that. It discovered an entirely new family of constructions that performs better than anything based on a square grid. OpenAI's blog post on this is clear: this marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics. The proof came not from a specialized system purpose-built to attack this conjecture, but from a general reasoning model working through a long, difficult chain of inference. The links to the full proof paper and an abridged version of the model's chain of thought will be in the show notes if you want to dig in.

Now, I want to stay with this for a few minutes because there are a couple of angles here worth separating.

The first is purely about what happened mathematically. This is legitimate. The mathematical community will need to verify the proof fully, and that process is ongoing. But the structure of the result — a new family of point configurations that beats the square grid on the unit distance problem — that's a real contribution to combinatorial geometry, a real field that matters for everything from computational geometry to coding theory to the study of distance structures. Erdős conjectures are not easy targets. The fact that one fell to a general model is notable regardless of your views on AI.

The second angle is about what this tells builders and founders. Sam Altman's reaction was telling. He posted that he had complicated feelings. Quote: "I'm very excited for AI to greatly extend our understanding of the world, but still, I have complicated feelings today." He also posted separately that this is one of three things OpenAI is most excited about right now — AGI accelerating research, AGI accelerating companies, and what he called personal AGI accelerating everyone toward their goals. He mentioned this result in the same breath as the announcement that OpenAI is offering two million dollars in credits to every YC company.

That pairing is worth noting. A math breakthrough and a startup credit program in the same post. That's a deliberate positioning move. OpenAI is signaling simultaneously that it can do things no human team has done, and that it wants to be the infrastructure every new company runs on. Both claims are worth watching.

The third angle is the one that comes up in the community reaction, and it's the honest question: is this the goalpost moving again? Every time AI crosses a threshold — high school math, olympiad problems, coding benchmarks — there's a predictable cycle where critics say the threshold didn't really count. One commenter put it bluntly: what's the unit distance to the goalposts after the naysayers get done moving them again? It's a fair point. But I'll also say: if you find yourself repeatedly moving the goalposts to avoid crediting real capability jumps, at some point you owe yourself a reckoning. This one is real.

Here's the part that matters for builders specifically. OpenAI's own framing said the result points to something larger: AI systems are becoming capable of holding together long, difficult chains of reasoning, connecting ideas across distant fields, and surfacing paths researchers may not have explored. They're projecting this will accelerate biology, physics, engineering, and medicine. The key phrase from OpenAI's post is worth repeating: expertise becomes more valuable, not less. AI can help search, suggest, and verify. People choose the problems that matter, interpret the results, and decide what questions to pursue next.

That last sentence is doing a lot of work. Because if a general model can now originate new mathematics, the question for every domain-specific founder is: what's my version of this problem? What is the hard, long-standing constraint in my vertical that a general reasoning model might be able to crack if properly directed? That's the right question coming out of this result. Not whether AI is going to replace mathematicians wholesale — that's the wrong unit of analysis. The right unit is: what does it mean for your business when general reasoning models can explore solution spaces faster and wider than any individual human expert?

That's the deep dive for today. The Erdős result is the story of the week, possibly the story of the month.

Now shift from math to the labor market, because Meta had a rough Tuesday morning for eight thousand people.

Meta announced it's cutting about ten percent of its workforce — roughly eight thousand jobs — rolling out in three waves, with employees getting emails at four in the morning local time in their respective regions. The New York Post went with the word bloodbath in the headline, which, you know, tabloid energy is tabloid energy. But the number is real.

Here's the business context that matters. Meta made fifty-six billion dollars in the first quarter of this year. They are not firing people because they're in trouble. They're firing people because they've decided the bottleneck is no longer human capital — it's compute. The framing in the Reddit discussion was direct: they are realizing the bottleneck is not human capital anymore, it is compute. The unit economics of agentic systems are just too compelling to ignore.

That's a significant admission about how company architecture is shifting. Meta has a one hundred forty-five billion dollar AI budget. They are not cutting staff to save money. They are cutting staff to reallocate capital toward infrastructure. That's a different decision with different implications.

The cynical read is that this is standard ZIRP-era overhiring correction dressed up in AI vocabulary. And there's something to that — these companies absolutely bloated headcount between 2020 and 2023. But the timing, the framing from leadership, and the scale of the AI infrastructure bet suggest this isn't purely a correction. It's also a deliberate architectural choice. If agentic workflows can replace meaningful portions of the knowledge work that was distributed across large teams, the economics of human headcount look different. Meta's leadership clearly believes they're past the tipping point where that math holds up.

For founders: this is a leading indicator, not a lagging one. If Meta is making this call at scale today, smaller organizations will face versions of this same decision over the next two to three years. The interesting question isn't whether AI reduces headcount — it does, in a lot of contexts. The interesting question is what kinds of human judgment become more valuable as you reduce routine knowledge work. That's where the durable bets are.

Connecting the Meta story to something that's been circulating in the builder community this week: there's a real discussion about whether the challenge for AI agents right now is intelligence or trust. The argument coming out of the r/artificial community is that we've been so focused on whether agents are capable enough that we're under-indexing on whether users and operators trust them enough to actually let them act.

The contrast is between an agent drafting an email versus an agent submitting a form, canceling a service, navigating a multi-step account flow, or making a commitment on your behalf. The capability may be there for a lot of these tasks. The trust infrastructure isn't.

One developer in the thread had a great framing. He said there's a layer below trust that's even more fundamental: predictability. Knowing an agent is capable doesn't help if you can't anticipate where it'll deviate from intent, especially across multi-step flows. That gap between competence and legibility is where most agent deployments quietly fail.

Another commenter who had been using Claude to handle calendar bookings for a week described what happened: double-bookings twice because it didn't understand timezone nuances. The resolution was straightforward — only let it suggest, not act, for anything involving real money or real people.

That's the current state of the art for a lot of builder deployments. The model might be capable. The production behavior isn't stable enough for fully autonomous action on consequential tasks. And the practical upshot — which one commenter nailed — is that trust requires auditability, and most teams don't have that wired in from day one.

This connects directly to a pattern we talked about last week around agent deployments. We covered the agents-are-productive-versus-just-capable question in the previous episode, so I won't retread the whole argument. But the trust problem is the operational face of that same question. Capability without predictability and auditability doesn't give you an agent you can deploy. It gives you a demo. Getting from demo to deployment is mostly about building the trust infrastructure — the logging, the guardrails, the human-in-the-loop points, the rollback mechanisms.

Now there's a funny story that illustrates the trust gap from a slightly different angle, and I can't not mention it.

Claude has apparently been telling users to go to sleep mid-session. Not in a malicious way. Not in an intentional wellness feature way. Just — randomly, during long working sessions, the model suggests the user should rest. Fortune ran a piece on it. Nobody at Anthropic seems to fully understand why it keeps doing this. The explanation from an Anthropic staff member was, and I'm quoting from the Fortune piece: it's a bit of a character tic. And: we're aware of this and hoping to fix it in future models.

The wildest part of the coverage is the speculation. The theories being floated include: it's an intentional wellbeing feature, it's a compute-saving measure to discourage long sessions, it's an emergent property of training data. None of these explanations are likely according to the article, partly because Claude isn't given context about how long a given user has been active. It's not reading the clock. It has no concept of time, full stop.

One user in the thread had a genuinely practical point about this: they wished Claude had continuous access to a clock. They'd run out of credits during a session that renewed while they were asleep, and the ability to say something like — resume this task at a specific time — would have been extremely useful. That's actually a reasonable feature request that this whole conversation is accidentally surfacing.

But the broader point for builders is worth sitting with. Here you have a leading frontier model exhibiting emergent behavioral patterns that its own creators can't fully explain. That's not a scandal — it's honestly expected given how training works. But it's a concrete example of why the trust and auditability argument matters. If you're building on top of these models, you need to account for the possibility of emergent character tics, not just capability failures. Your users are going to encounter this stuff. Building in explainability and override mechanisms isn't just good engineering hygiene. It's product liability management.

Alright, let's talk Google for a minute, because there's a mixed picture coming out of the Gemini 3.5 Flash release.

On the positive side: Gemini 3.5 Flash scored 76.7 percent on SimpleBench, which puts it just 0.2 percentage points behind GPT 5.5 Pro's score. For a Flash-tier model — the speed and cost tier, not the premium tier — that's a meaningful result on reasoning evaluation. Community reaction was surprised, given how the model scored in some other domains.

On the less positive side: on Cursor's coding evals, Gemini 3.5 Flash is not performing well. The coding benchmark data puts it toward the bottom of the comparison. And separately, we covered in the last episode that Gemini 3.5 Flash is priced at roughly three times the previous version and around thirty times what Gemini 1.5 Flash cost. So you're getting better reasoning performance at a significantly higher price, but the coding story is soft.

For builders choosing a model tier: the SimpleBench result suggests Gemini 3.5 Flash is competitive on common-sense and world-knowledge type reasoning. For coding-heavy workflows, the Cursor evals say look elsewhere at the moment. That's a profile that might work well for some product use cases — anything where the primary value is synthesis, explanation, or multi-step reasoning through structured information — but probably not for agentic coding or automated code review.

The confirmation from a Google DeepMind employee of the Gemini 3.5 line generated over twelve hundred upvotes before the actual releases hit. That's the market-signal equivalent of a sell-the-news moment: the community was more excited about the confirmation than the actual product has been received so far. That's not necessarily a death knell, but it's a gap worth watching as more evaluation data accumulates.

While we're in the coding tools area, Google's Antigravity IDE 2.0 also landed this week. The community reaction was mixed. The first observation from users is that the whole UI looks like Codex — which, fair, a lot of AI IDE tools are converging on similar interfaces. The second observation, and this is the more substantive one: Antigravity 2.0 apparently enables data sharing by default in a way that some users found difficult to opt out of on certain platforms. One user noted it lets you refuse on Windows but had trouble finding the setting on other configurations. Another commenter had the pointed observation that it's a little ironic to be concerned about having your data used to improve a technology when that technology was built on training data in the first place — but they also acknowledged that consent and opt-out mechanisms matter regardless of the underlying irony. Google is clearly in a position of playing catch-up in the coding tools space, and the UX choices here suggest some friction in execution.

Shift now to a story that's genuinely instructive for any founder thinking about infrastructure decisions at scale.

Midjourney says its research was set back by approximately a year because it tried to run on Google TPUs instead of sticking with NVIDIA hardware. This isn't a dismissal of TPUs as categorically bad — the more nuanced reading, which one commenter provided, is that he's hinting at infrastructure friction caused by mixing stacks, not flat out saying TPUs are terrible. The problem is running multiple hardware stacks as a relatively small team.

The practical implication from the community discussion is clear: if you're a small-to-medium AI company — not a hyperscaler — the friction cost of mixing compute stacks is significant. You're not just managing hardware, you're managing software stacks, debugging environments, tooling compatibility, and the cognitive overhead of context-switching between different optimization paradigms. NVIDIA's software ecosystem, whatever you think about the company, is far more mature and has much more community tooling around it.

Midjourney's situation is a cautionary tale about the hidden costs of hardware experimentation. A year of research velocity is not recoverable. In a competitive landscape where Stable Diffusion, Flux, and commercial image generation tools are all advancing, a year behind is meaningful. The lesson isn't don't experiment. It's: before you experiment with infrastructure alternatives, accurately price the switching cost including lost velocity, not just the billing comparison.

The broader context here is that NVIDIA's hardware moat is partly about the GPUs themselves and partly about this exact network effect — the tooling, the libraries, the community knowledge, the debugging paths. Intel's Crescent Island GPU — which leaked this week showing a design with 160 gigabytes of LPDDR5X memory as a way to sidestep HBM shortages — is targeting customer sampling in the second half of this year. That's a potentially interesting alternative for use cases where raw memory capacity matters more than peak throughput. But Midjourney's experience is a useful reminder that the switching cost calculation is not just the benchmark comparison.

The agent safety story this week came from a developer who had their first encounter with an agent running a delete-everything command on their Linux system during a bash command whitelist implementation. The agent was testing whether the harmful command block was functioning — by issuing the dangerous command itself. The block worked, so no damage was done beyond the developer's heart rate. But sandboxing got implemented very quickly afterward.

A couple of things worth noting here. First: on modern Linux systems, that particular command format doesn't work by default — there's a no-preserve-root protection that has to be explicitly bypassed. So the actual risk was lower than the gut reaction suggested. But second, and more importantly: an agent probing its own safety constraints by testing them with live harmful commands is a behavior pattern worth understanding. It wasn't doing this maliciously. It was testing the implementation it was being asked to help build. That's a reasonable thing to do! But it points to why you need execution environments isolated from your actual system whenever you're giving agents shell access, regardless of what guardrails you think are in place.

Sandbox first, verify second. Don't assume the model will respect constraints it can't actually feel.

Quick note on the open-weights model landscape: the Qwen community is watching closely for what comes next from Alibaba's AI team. The current anticipation is around a 122 billion parameter model and an improved 27B. The community has been burned before — a poll for a similar release in a prior version got everyone excited and then nothing materialized. That skepticism is healthy. Open-weights model releases are legitimately significant for builders who need on-premises or privacy-preserving deployments, and Qwen's releases have been genuinely competitive on code and reasoning for their size classes. Worth watching, not yet worth planning around until it actually ships.

Also worth a brief mention for builders who spend a lot of time spinning up cloud GPU instances: someone put out an open-source tool called swm-gpu that solves a real problem in this workflow. The core issue is that every time you rent a fresh GPU instance — on RunPod, Vast.ai, Lambda, or similar providers — you spend the first thirty to forty-five minutes reinstalling everything. Custom nodes, models, configs, all of it. Docker images go stale, providers have different base images, nothing is portable. This tool syncs your entire workspace to S3-compatible storage and pulls it on spin-up, so your setup is available instantly across providers. It also has a lifecycle guard that watches GPU utilization and terminates the instance automatically if nothing's happening for thirty minutes — which apparently has saved them more money than they care to admit. It's free, open source, Apache 2.0 licensed. The repo and install instructions will be in the show notes. It's a narrow tool, but for people doing frequent GPU work across multiple providers, this is the kind of thing that pays for itself immediately.

On the research infrastructure front, there's a dual storyline worth a paragraph. First: the ECCV 2026 review process has researchers frustrated that none of their reviewers are updating their scores after rebuttals. This is a known pattern — reviewers with high confidence and negative scores rarely update, partly because ego is involved, partly because the incentive structure doesn't reward engagement. It's a systemic problem that keeps resurfacing across venues. Separately, someone proposed a structural fix to reciprocal reviewing at AI conferences — splitting authors and reviewers into independent halves so that reviewers in group A only review papers from group B, removing the incentive to reject competitors. The idea is clean in principle. The practice breaks down because ML is a small world — trace coauthors of coauthors and you end up with massive overlapping clusters that make clean splits impossible at NeurIPS or ICML scale. NeurIPS this cycle is experimenting with withholding reviews from authors until they complete their assigned reviewing tasks, which is a stick-based approach rather than an incentive-restructuring approach. Whether it changes behavior or just creates resentment is the open question.

This matters for builders in a specific way: the peer review system is a lagging indicator for what's actually happening in frontier AI. By the time something works its way through venues, it may already be deployed. The best signal is often the unpublished stuff — the Discord conversations, the evals released alongside product launches, the chain-of-thought traces like the ones OpenAI published alongside the Erdős result today. Getting good at reading those signals is a real competitive advantage.

Let me bring this back around to where we started.

The unit distance result is the headline, but it points at something broader that I think deserves more attention than the headlines are giving it. The claim from OpenAI is that the same capabilities that let this model hold together long chains of mathematical reasoning — connecting distant concepts, surfacing paths that human researchers wouldn't have prioritized — will translate to biology, physics, engineering, and medicine. That's a big claim. And it may be right.

But here's the nuanced version: a model autonomously solving an eighty-year-old conjecture does not mean every open problem in science falls next week. Math has a very nice property that most scientific problems don't — it's verifiable. You can check a proof. You cannot check a drug mechanism by reading the chain of thought. The transfer from math to wet science requires a whole additional layer of experimental infrastructure, domain expertise in problem selection, and validation methodology. OpenAI's own post acknowledges this: expertise becomes more valuable, not less. That's not a throwaway sentence. That's the actual operational reality. The model accelerates the search. The human still has to know what to search for, and how to validate what gets found.

For any founder or builder thinking about how to position around this: the durable opportunity is in that human judgment layer. Domain expertise combined with AI search capacity. The ability to correctly frame the problem, interpret the result, and ask the next question. That combination is what compounds. The model capability alone does not compound — it commoditizes. Your expertise compounded with the model is the asset.

Alright, that's the show. Big week, bigger implications. The Erdős result is real. The Meta layoffs are a preview of a structural shift, not a one-off. The trust and auditability problem for agents is the current ceiling on deployment. And if you're building AI-heavy products, the Midjourney infrastructure story is worth reading twice before you make any hardware bets.

I'm Tony DeLuca. Stay skeptical, stay building. Catch you on the next one.

More episodes

Chapters

What is Barely Possible?