Anthropic goes two-front on AI security
A daily summary of what is interesting and happening in the AI industry, with a focus on what this means for people building harness experiences that are used.
Good morning, it's Wednesday, June third.
In today's briefing, Microsoft enters as a formal frontier AI lab with the MAI family and custom silicon, Anthropic expands its defensive AI program for enterprise security teams, and we have significant moves in on-device inference and verification economics.
First up - Today in the big model news;
Anthropic - Claude
Anthropic published a year-long MITRE ATT&CK study tracking eight hundred and thirty-two malicious accounts across AI systems. The finding: medium-to-high risk actor share jumped from thirty-three percent to fifty-six percent in six months, and sixty-seven percent of banned accounts used AI to write malware. Simultaneously, Anthropic is expanding Project Glasswing, its defensive AI program, to one hundred and fifty more organizations across fifteen countries. The first fifty Glasswing partners found over ten thousand high and critical vulnerabilities in their own environments. For enterprise security teams, the read is that AI-enabled threat sophistication is advancing on six-month cycles, because threat actors are operationalizing AI-assisted attack orchestration faster than traditional security tooling can defend.
In other lab news today, Microsoft entered as a frontier AI lab with significant technical substance. The central announcement was MAI-Thinking-1, a thirty-five billion active parameter mixture-of-experts model trained on eight thousand one hundred and ninety-two GB200 GPUs, scoring ninety-seven percent on AIME twenty twenty-five and fifty-three percent on SWE-Bench Pro. What made the release notable was the transparency: a one hundred and nine-page technical report disclosing data composition, fifty percent code, seventeen point five percent STEM, seventeen point five percent math, and a hard claim of no synthetic data, no distillation. Independent researchers called it one of the most transparent frontier model reports at scale, a deliberate positioning move against OpenAI and Anthropic, both of which have reduced disclosure over the past year. The model also runs more efficiently on Microsoft's MAIA two hundred custom silicon, thirty percent better performance-per-dollar and one point four times performance-per-watt versus raw GB200, marking the first credible custom-chip story outside Google TPUs and AWS Trainium. For product teams evaluating full-stack economics in agent deployments, expect custom silicon to become table stakes at frontier scale, because margin structure fundamentally changes when you control the entire stack from silicon through model through runtime.
Microsoft's Build conference also presented Windows as the execution platform for agents: Project Solara as a desktop AI companion plus wearable badge, Project Scout as an always-on work agent, and Surface RTX Spark Dev Box with one hundred twenty GB unified memory, one hundred ten TOPS, and twenty CPU cores. MAI-Code-1-Flash, five billion parameters, is rolling out in VS Code GitHub Copilot with no setup required, outperforming Claude Haiku four point five by sixteen percentage points on SWE-Bench Pro. Microsoft shipping a coding model that explicitly benchmarks against a competitor's product by name is new. For product teams shipping integrated coding workflows, the signal is that competitive advantage now flows through bundled harness quality as much as raw model capability, because the market is shifting toward evaluating complete end-to-end performance rather than selecting by model alone.
Running alongside Build coverage, H Company released Holo three point one, a local computer-use model family spanning point eight billion to thirty-five billion parameters. The thirty-five billion model hits seventy-nine point three percent on AndroidWorld, the current leading benchmark for on-device task automation, available in GGUF, FP8, and NVFP4 formats. For teams building autonomous workflows on consumer and edge hardware, capable computer-use no longer requires cloud dependency, because the performance threshold on local devices has crossed into production territory.
In the harness, tools and orchestration world;
DeepSeek V4 Flash maintains ninety-four to ninety-six percent agreement with Opus four point seven on legal verification tasks at eighteen times lower cost per criterion and roughly one thousand times cheaper in batch mode. The concrete number: three thousand two hundred RL rollouts that cost eighteen thousand dollars dropped to eighteen dollars after optimization. For product teams architecting agent verification pipelines, accuracy and cost are no longer in tension, because batch economics have shifted verification from a constraint on pipeline design to a negligible cost factor.
In AOB:
Trump signed a downsized AI executive order, replacing mandatory ninety-day pre-release review with a voluntary thirty-day framework with no enforcement. For product teams, the practical effect is freedom from federal guardrails through end of year, because the voluntary framework carries no penalty for non-participation. State-level action is accelerating separately. Florida's product liability suit against OpenAI and Maryland's AI pricing ban are both active, with multiple attorneys general watching precedent.
Stanford Law published a blind study of nearly three thousand responses finding AI wins seventy-five percent of head-to-head matchups against law professors on contract law questions. Professors rated AI responses as potentially harmful three point five percent of the time versus twelve percent for peer answers. For legal product companies and teams evaluating vertical AI, the read is that credentialed-expert benchmarking is now table stakes in procurement conversations, because AI has crossed into outperforming domain experts in controlled peer-reviewed settings.
That's the briefing. Have a great day.