Cloudflare opens Project Think. Agents can now be paying customers.
New Agents SDK adds durable execution, sub-agents and sandboxed code execution. Agents can now register Cloudflare accounts and deploy code without a human in the loop.
Working notes, dated digests, and the occasional opinion. No newsletter, no ghostwriters. Every post is signed by the engineer who wrote it.
BEGIN; CREATE INDEX CONCURRENTLY idx_users_email_lower ON users (lower(email)) WHERE deleted_at IS NULL; -- the agent suggested this with deleted_at = false. -- it was wrong. boolean column doesn't exist. -- 7 minutes lost. caught in review. COMMIT;
Coding agents passed the hype curve sometime last quarter. Eighty percent of teams use them. Acceptance rates jumped from 20% to 60%. So why does our bench still write migrations, indexes and SQL by hand? A short defense of the unsexy 30%.
New Agents SDK adds durable execution, sub-agents and sandboxed code execution. Agents can now register Cloudflare accounts and deploy code without a human in the loop.
V4 Flash and V4 Pro hit Hugging Face. Roughly 90% of GPT-5.4 quality at a fraction of the cost. The economics question for self-hosting just shifted again.
Six tests we run on every new model, including one we stole from a security review checklist. The short version: SWE-bench scores tell you almost nothing about whether an agent will be safe in your CI.
~ Lilit PetrosyanNew default in Claude Code. Anthropic also rolled the cyber-safeguards developed for the unreleased Mythos model into Opus 4.7.
Partitioning, the autovacuum knobs nobody documents, and the one extension we ship by default. Notes from a recent migration off a managed plan that was costing more than two engineers.
~ Hovhannes DavtyanFirst model from Meta Superintelligence Labs ships proprietary. Capex guide for 2026 raised to $115–$135B. The Llama strategy as we knew it is over.
Apache 2.0, four sizes from 2.3B to 31B, and the 31B variant ranks #3 globally on Arena. We priced out the 8B variant on three GPU configurations, including one that fits under a desk.
~ Mariam AsatryanFour months out. Three things every team shipping into the EU should already have on a Jira board, and one obligation most teams misread.
On why "fractional QA" almost never works, and what we ask before we agree to embed a tester. Includes the four questions we use to know whether to staff a single QA or a pair.
~ Tigran MkrtchyanFirst Gartner agent-specific report ships. Headline figure is the failure rate, but the more interesting number is buried on page 14: median time-to-cancel.
Curated by the bench, written by humans. Unsubscribe in one click.