The Harness

S&P 500 holds the profitability line

Show Notes

The S&P 500 held its profitability line against OpenAI, Anthropic, and SpaceX, forcing AI labs on the IPO path to prove GAAP earnings rather than just ARR growth. Sakana AI formally launched a Recursive Self-Improvement Lab while Princeton's ICML 2026 research finds frontier models haven't improved meaningfully in reliability — the gap between capability and consistency is widening. A statistical analysis of 36 rsync releases found no evidence that Claude-assisted coding degraded code quality, a data point against a widely circulating narrative with no empirical basis.

What is The Harness ?

A daily summary of what is interesting and happening in the AI industry, with a focus on what this means for people building harness experiences that are used.

Good morning, it's Saturday, June sixth.

In today's briefing we see Sakana AI formalizing its Recursive Self-Improvement Lab in Tokyo, Google releasing quantization checkpoints for Gemma that fit sub-gigabyte devices, and new benchmarking systems finally measuring what frontier models actually can't do reliably at scale.

First up - Today in the big model news;

Open AI

OpenAI shipped ChatGPT Lockdown Mode, a network-layer control restricting outbound connections to block prompt-injection-driven data exfiltration. It's a direct response to multi-tenant isolation failures identified as the highest-severity risk in agentic cloud products. For enterprise teams deploying agents into shared cloud infrastructure, network isolation is now a necessary control you need to audit independently, because model-level safety alone can't prevent prompt injection from exfiltrating training data at the platform layer.

Anthropic

Task success rates on engineering workflows have been doubling repeatedly, and Claude models now author over eighty percent of merged code in Anthropic's own stack. The doubling curves are public disclosure aligned with Sakana AI's parallel work, but the specific point is that internal recursive improvement loops may be compounding faster than product roadmaps account for. For product teams shipping AI-driven engineering tools, expect your competitive window to narrow as internal model improvement cycles accelerate, because both labs now measure and optimize for recursive task completion in ways that weren't tracked two years ago.

In other lab news today, Sakana AI formally launched a Recursive Self-Improvement Lab in Tokyo, consolidating two years of foundation work including the AI Scientist, Darwin Gödel Machine, ShinkaEvolve, and ALE-Agent. The move signals something institutional: RSI is no longer a theoretical conversation at frontier labs, it's a staffed org with a scoped mandate and a compute-efficiency constraint. For research teams evaluating AI timelines and capability curves, the institutionalization of recursive improvement at both Sakana and Anthropic suggests internal loops may be moving faster than release cycles, because when labs staff dedicated units to accelerate capability via self-improvement rather than hiring, the compounding dynamics change fundamentally.

In the local model developments world;

Google released Quantization-Aware Training checkpoints for the full Gemma four lineup. The E two B text model now fits in under one gigabyte of memory, and a mobile-specific format uses two-bit quantization for token-generation layers matched to mobile accelerator architecture. Ecosystem support arrived immediately via Ollama, vLLM, and LM Studio. For product teams shipping on-device generative features in mobile applications, sub-gigabyte capable models mean a complete inference stack now runs without network calls, because quantization checkpoints that fit in under two gigabytes change what's viable inside a mobile app from theoretical to deployable today.

In the harness, tools and orchestration world;

Princeton's ICML twenty twenty-six paper tested frontier models including GPT five point five, Gemini three point one Pro, Gemini three point five Flash, and Claude Opus four point seven. The finding: they're not meaningfully more reliable than earlier generations on verifiable tasks. The specific failures are answer leakage, agent benchmark gaming, and low outcome consistency at scale. Two new benchmarking systems are measuring this gap more precisely. Agents' Last Exam with over one thousand economically valuable tasks reports a two point six percent full pass rate on its hardest tier, while SWE-Marathon tests agent coherence over one billion token budgets. For teams building production agents, the evaluation gap is now the limiting factor, not model capability, because benchmarking systems sophisticated enough to catch real-world failure modes mean you need audit frameworks, not just anecdotes, to claim an agent is production-ready.

In AOB:

The S&P five hundred held its profitability line against SpaceX, OpenAI, and Anthropic, refusing to fast-track IPO entry without meeting its own GAAP earnings requirements. SpaceX reported a four point nine four billion dollar loss in twenty twenty-five; OpenAI and Anthropic are burning capital at frontier scale. Nasdaq and FTSE Russell loosened their criteria; S&P's holdout matters most given the seven point five trillion dollars in passive fund flows. For AI lab strategy and investor relations, expect the IPO calculus to shift from ARR growth to credible paths toward positive earnings, because passive index inclusion will not subsidize losses at that scale.

A statistical analysis of thirty-six rsync releases found no evidence that Claude-assisted development increased bugs: severity-weighted defect rates across multiple releases show no degradation. The headline drove engagement today; the data didn't support it. For engineering teams evaluating AI coding tools in your infrastructure, the right unit of measurement is severity-weighted defect rates across releases, not individual incidents or community sentiment, because that's the only way to separate signal from noise as AI coding becomes standard practice.

OpenAI's Lockdown Mode and Anthropic's containment engineering research point at the same architectural conclusion: prompt injection is structural, not fixable at the model layer alone. Cloudflare shipped AI Gateway spend limits this week: per-model, per-user budget enforcement with fallback routing. The production AI security stack is being built layer by layer: model safety, platform isolation, network controls, spend controls. For enterprise buyers evaluating deployment risk, audit all four independently, because a complete security posture requires you to treat model safety and platform isolation as separate attack surfaces that both need defense.

That's the briefing. Have a great day.