The Harness

Anthropic documents three production containment failures

Show Notes

Gemma 4 12B drops open-source with a breakthrough encoder-free architecture that runs multimodal inference on 16GB VRAM, and Ideogram 4.0 goes open-weight as the top-ranked open image model — the local AI infrastructure stack is quietly becoming commodity. Anthropic published a production engineering post documenting three major containment failures, including a credential-exfiltration attack that succeeded 24 of 25 times, confirming that OS-level isolation is doing more safety work than model training. Berkeley's CS failure rates hit 35% as the first well-documented AI skills gap comes due — students who completed prerequisites under open-AI policies arrived at exams unable to defend their reasoning.

What is The Harness ?

A daily summary of what is interesting and happening in the AI industry, with a focus on what this means for people building harness experiences that are used.

Good morning, it's Thursday, June fourth.

In today's briefing we see Google's Gemma 4 introducing an encoder-free architecture that fits multimodal inference into sixteen gigabytes of VRAM, Ideogram releasing its image model at open-weight with top-tier quality rankings, and Anthropic documenting production containment failures that prove OS-level isolation beats model training for security.

First up - Today in the big model news;

Google + Deepmind - Gemini

Google released Gemma 4 in a twelve-billion-parameter version with a deliberate architectural choice: eliminating vision and audio encoders entirely. Raw signals project directly into token space through a single matrix multiplication and positional embedding. The result: a twelve-billion-parameter model fitting in sixteen gigabytes of VRAM with near-parity performance to the twenty-six-billion MoE variant on standard benchmarks. The model shipped with day-zero integration into llama.cpp, vLLM, Ollama, and SGLang under Apache two-point-oh. For teams building on-device multimodal inference, this architecture template changes the cost equation, because collapsing the modality stack into the backbone eliminates the encoder integration tax that made local multimodal expensive.

Anthropic - Claude

Anthropic published how it contains Claude across products: ephemeral containers for claude.ai, OS-level sandboxing for Claude Code, and full VM isolation for Claude Cowork. Three production failures emerged: a config parser running before consent dialogs, a phishing attack exfiltrating AWS credentials twenty-four of twenty-five times despite safety training, and legitimate allowlist entries becoming exfiltration channels. The finding: operating system primitives held; Anthropic's custom components failed. For product teams shipping agents into enterprise, OS-level isolation should be the primary security layer, because Anthropic's OS-level defenses stopped all three attacks while custom components failed.

In the local model developments world;

Ideogram released version four-point-zero as open-weight. It's a nine-point-three-billion-parameter Diffusion Transformer using Qwen three-VL-eight-billion-Instruct as text encoder. The model ships in nf4 and fp8 quantized versions on Hugging Face, supports native two-K resolution with layout control, and ranks first among open-weight models on Design Arena. The trade-off: non-commercial license. For design teams, open-weight image generation reaches quality parity with closed models, though the non-commercial license means weighing cloud dependency against restricted local deployment.

Multimodal capability crossed from experimental to commodity this week. Miso One brought one-shot voice cloning to one hundred and ten milliseconds latency in an eight-billion-parameter open-weights model. Alibaba's Fun-Realtime-TTS topped the Artificial Analysis Speech Arena. Google's Magenta RealTime two added low-latency music generation. For AI PMs building full-stack local agents, multimodal capability is now a commodity infrastructure layer, because voice, image, and music inference run on commodity hardware with acceptable latency.

In the harness, tools and orchestration world;

A case study from Harvey and LangChain showed tiered model architecture, with GLM five-point-one as worker and Opus four-point-seven as verifier, achieving eighteen percent versus fourteen percent all-pass rates on legal tasks while cutting costs from nine hundred fifty-four dollars to three hundred sixty-eight dollars per hundred tasks. That's a two-point-six-times cost reduction with quality improvement. For product teams building agentic workflows, tiered model routing is moving from research to production, because the economics now justify the routing complexity.

Uber capped coding-tool spend at one thousand five hundred dollars per month per tool per engineer. This is the first major public enterprise ceiling for AI dev-tool spend. The cap validates multi-tool stacks as normal practice. For vendors and enterprises adopting coding tools, expect the pricing ceiling to anchor around one thousand five hundred dollars per tool per month, because what was theoretical pricing is now anchored to a concrete enterprise precedent.

In AOB:

UC Berkeley's CS ten hit a thirty-five-point-three percent failure rate in spring twenty twenty-six, up from under ten percent historically. Faculty identified the mechanism: students completed prerequisites under open-AI policies without building math foundations, then failed assessed work they couldn't defend without the tool. For hiring teams recruiting junior engineers in two to three years, the talent pipeline is being distorted, because a cohort is entering with tool-dependent reasoning in domains requiring independent verification.

Sixteen mathematicians published a declaration drawing over one hundred thirty signatories, warning that AI-generated proofs could overwhelm peer review and erode attribution. The concern extends to corporate encroachment in university mathematics research. For professional communities developing standards for AI-generated work, institutional verification infrastructure becomes competitive advantage, because whoever builds trusted validation tooling owns the niche.

That's the briefing. Have a great day.