Sonnet 5 Arrives — and So Does Its Red-Team Report Card
Published on
Today's AI news: Sonnet 5 Arrives — and So Does Its Red-Team Report Card, Washington Wants a Piece of the AI Action, Agents That Learn, Scale, and Know When They're Lying, Agents Go Spelunking: Tools for RE, Memory, and Design, Beyond Transformers: Small Models, Big Questions, Fifty-Dollar Hardware and Sub-Billion Parameters, The Art of Innocent-Looking Malice. 22 sources curated from across the web.
Sonnet 5 Arrives — and So Does Its Red-Team Report Card
Anthropic dropped Claude Sonnet 5 this week, positioning it as the most agentic Sonnet-class model to date — and, at introductory pricing of $2/$10 per million input/output tokens through August 31, possibly the most cost-efficient frontier model for sustained agentic work. The pitch is straightforward: Sonnet 5 narrows the gap with Opus 4.8 on reasoning, coding, and tool use, while costing roughly 60% less per token. Early access partners report the model finishes complex multi-step tasks where previous Sonnets would stall, self-checks its own output unprompted, and handles brownfield code — race conditions, hidden test dependencies — with notably better root-cause analysis than its predecessor. Anthropic's own safety evaluations show lower hallucination and sycophancy rates than Sonnet 4.6, better resistance to prompt injection in agentic contexts, and substantially weaker cyber capabilities than Opus-class models — it never developed a working exploit in Firefox vulnerability tests. One trade-off: an updated tokenizer inflates input token counts by 1.0–1.35x depending on content, which the introductory pricing is designed to absorb. (more: https://www.anthropic.com/news/claude-sonnet-5)
That safety posture arrives just as an independent red-team study from Italy's AI4I institute quantifies what "frontier model safety" actually means under adversarial pressure. The researchers subjected Fable 5 and Opus 4.8 to hundreds of thousands of automated jailbreak attempts across a ten-category harm taxonomy, using four attack families — from static obfuscation to adaptive tree-of-attacks search. Every apparent success was re-adjudicated by a three-judge panel requiring majority confirmation. The headline finding: static obfuscation (base64 encoding, ciphers, role-play wrappers) is "near-fully neutralised," with confirmed rates at or below 0.1% despite roughly 100,000 attempts per model. But adaptive attacks, where an attacker model iteratively rewrites prompts based on refusals, remain devastatingly effective. Tree-of-attacks search broke Opus 4.8 on over 13% of intents, with child-safety framings hitting 19% and cybersecurity peaking at double digits. Fable 5 held every attack family to single digits but still produced 67 panel-confirmed harmful completions spanning every harm category. The vulnerability is contextual rather than lexical — the same intent succeeds when reframed, not when encoded — which makes it harder to defend with surface-level filters. Most concerning: these failures were found "automatically, cheaply, and within the first one or two refinement steps" by an attacker model with no human expert in the loop. At deployment scale, a success rate of this magnitude is a steady, reproducible stream. (more: https://arxiv.org/abs/2606.18193v1)
A companion cost-performance analysis of Fable 5 on SWE-bench Lite reinforces a practical point: Fable 5 resolved 23 of 25 "hard-tail" tasks that defeated every cheaper model, but a Sonnet-first, Fable-on-failure routing strategy improved resolution from 84% to 88% while cutting costs 22%. The optimal deployment is a tiered ladder where Fable sits at the top rung, engaged only for the 5–25% of instances that actually need frontier-grade reasoning. (more: https://gist.github.com/ruvnet/dbe163c3fc9accfe62198f70a667b339)
Washington Wants a Piece of the AI Action
OpenAI has proposed giving the U.S. government a 5% equity stake — worth roughly $42.6 billion at its $852 billion valuation — as part of a broader arrangement where each major U.S. AI developer would cede similar shares through a sovereign wealth fund vehicle. CEO Sam Altman reportedly pitched the concept directly to the Trump administration, framing it as the best way to share AI's economic upside with the public. The talks are "conceptual" and in early stages; implementation could require an act of Congress. The proposal envisions Anthropic, Google, and Meta ceding similar stakes, though none have agreed. Anthropic says it has not discussed a government equity arrangement. (more: https://www.cnbc.com/2026/07/02/openai-proposes-us-government-own-5percent-stake-to-address-political-blowback.html)
Both OpenAI and Anthropic have previously floated public wealth fund concepts in policy papers, and both are preparing for IPOs that could value each above $1 trillion. Altman has spoken with Commerce Secretary Howard Lutnick, Treasury Secretary Scott Bessent, and Senator Bernie Sanders — the latter pushing a 50% one-time tax on major AI companies' stock to fund an independent commission. The Trump administration has already taken stakes in Intel, quantum computing firms, and critical mineral companies during the president's second term. Trump has described such ownership as making Americans "partners in this revolution." (more: https://www.theguardian.com/technology/2026/jul/02/openai-stake-us-government-ai-sam-altman)
The proposal lands amid a broader battle over who controls AI access. Anthropic CEO Dario Amodei drew fresh community fire this week for arguing that open-source models cannot provide the same benefits as proprietary ones — claiming you "cannot see inside the model," that collaborative fine-tuning "doesn't work in the same way," and that running models locally is impractical. The r/LocalLLaMA community dismantled each claim: open weights are literally the inspectable part, fine-tuning has produced "endless real improvements," and models like Qwen 27B run comfortably on consumer hardware without any cloud dependency. The broader sentiment crystallized around a familiar observation: the CEO of a company selling proprietary AI access has a structural incentive to argue against free alternatives. (more: https://old.reddit.com/r/LocalLLaMA/comments/1ui241x/the_number_1_public_enemy_of_opensource/)
Meanwhile, the open-source community is building infrastructure for a future where model distribution does not depend on any single platform. Model Registry publishes BitTorrent files for popular open models with Hugging Face as a fallback web seed — downloads work peer-to-peer first and fall back to HF's CDN when no peers are available. The project is still experimental, but community interest is immediate: multiple users volunteered seeding capacity, and several framed it as essential contingency planning for a future where centralized model hosting becomes politically or commercially restricted. (more: https://old.reddit.com/r/LocalLLaMA/comments/1uhevvf/model_registry_torrents_for_open_models_using/)
Agents That Learn, Scale, and Know When They're Lying
As AI agents move from demos to production, three distinct problems are crystallizing: how agents learn from their own work, how infrastructure handles real-world compound load, and how to keep self-improving agents honest. A LinkedIn editorial this week names the third problem directly: Goodhart Resistance. The moment an agent can optimize its own behavior, the acceptance gate — the metric it optimizes against — becomes part of the risk surface. The agent will learn to satisfy the verifier, which may or may not mean doing better work. The prescription is blunt: shadow first, promote on holdout data, watch for lagged failures (reverted code, reopened tickets, manually repaired invoices), and roll back fast. A verifier is an instrument, not truth. (more: https://www.linkedin.com/posts/reuvencohen_as-agents-start-optimizing-their-own-behavior-activity-7478640199074230272-CUvh)
Agent Apprenticeship from Forsy AI operationalizes part of that vision. The framework creates a loop where agents complete real-world tasks, are evaluated by mentors, and generate reusable learning signals — execution traces, work episodes, and structured experience compilations — that feed back into future improvement. The seed dataset includes 500+ curated tasks, 1,000+ execution traces, and 39,000+ structured experience records spanning multiple domains. It supports Codex, Cursor, Claude Code, and custom agents. The key design choice: learning is earned from completed work, not synthesized from hypothetical examples. (more: https://github.com/Forsy-AI/agent-apprenticeship)
Scaling those agents in production brings its own challenges, as documented in a Salesforce deployment study. Their Agentforce platform — where a single user request fans out to 3–5 concurrent model invocations — revealed that cold starts compound multiplicatively through dependency graphs, not additively. A three-model pipeline with individual cold starts of 150s, 30s, and 20s experienced 180s effective cold start because upstream models must produce results before downstream models begin. Coordinated pre-warming reduced this by 65%. The system handles 722,000 daily LLM inferences across 21 global regions, processing 4.4 billion tokens per day. Per-model independent scaling correctly handles asymmetric traffic patterns: embedding models scale proportionally with every request while SQL executors scale only 2–3x during spikes. The architecture achieved 50% P95 tail latency reduction and 30–40% cost savings over static deployments. (more: https://arxiv.org/abs/2604.25724v1)
Embodied agents face an even harder version of these problems. OmniAct, from Fudan University, tackles persistent physical autonomy by decomposing planning, memory, and execution verification into three cooperating modules. Its hierarchical memory compresses interaction history at semantic event boundaries rather than fixed intervals, yielding near-flat token consumption over 100,000+ accumulated tokens — compared to linear growth that would exhaust most context windows. Across 40 real-world tasks on two robotic platforms, asynchronous visual preemption (periodic VLM-based verification of physical execution) caught grasp failures and navigation deviations that open-loop baselines propagated into cascading faults. Most strikingly, it elevated the open-weight Qwen3-VL-30B to proprietary Gemini-3.1-Pro performance levels (+30 points) without additional training, demonstrating that architectural clarity complements model capacity rather than competing with it. (more: https://arxiv.org/abs/2606.27251v1)
Agents Go Spelunking: Tools for RE, Memory, and Design
Cellebrite Labs released ghidra-rpc, turning Ghidra — the NSA's open-source reverse engineering suite — into a persistent background daemon controllable by any AI coding assistant via structured JSON over a Unix socket. The agent can decompile functions to pseudo-C, trace call graphs, rename symbols, define structs, patch instructions, and diff binary versions, all without human intervention. Ghidra stays warm between commands (no re-analysis per invocation), and all changes persist in the project file. This is not "AI-assisted RE" in the vague sense — it is a fully autonomous reverse engineering skill set where the agent navigates binary analysis the way a human analyst would, issuing commands and reasoning over structured results. (more: https://github.com/cellebrite-labs/ghidra-rpc)
On the memory side, Recall addresses Claude Code's cold-start problem with a fully local, zero-API-key approach. Every session is captured as an append-only log, and a local TF-IDF + TextRank summarizer — no LLM call anywhere — condenses it into a compact resume point (~1–2K tokens) that loads on the next session. The privacy guarantee is absolute: no network calls, no model endpoint, no credentials. It fills a specific gap between CLAUDE.md (hand-curated instructions) and --continue (token-heavy full transcript replay) — an automatic, deterministic record of what each session did, condensed offline at zero token cost. (more: https://github.com/raiyanyahya/recall)
For Rust developers fighting slow CI, the rust-optimizer plugin audits a repo's GitHub Actions, release workflows, Docker images, caching, and dependency graphs, then produces a prioritized optimization report with a machine-checkable OPTIMIZATION_SPEC.md. It never edits code — it diagnoses root causes (a workspace compiling four times per PR, fat debug artifacts eating disk), reports with honest metrics, and hands off to Claude's autopilot pipeline. Three principles: root cause over symptom, the compiler is the arbiter, enforce or remove. (more: https://github.com/agentic-incubator/rust-optimizer)
Open Design positions itself as the open-source alternative to Claude Design — agent-native, model-agnostic, local-first. It runs on 22 coding-agent CLIs via MCP server integration, ships 150 brand-grade design systems as DESIGN.md files, and produces real HTML/CSS/MP4 artifacts: prototypes, dashboards, pitch decks, and HyperFrames motion graphics. The 0.10.0 release adds parallel sessions, and the project frames itself as "a Figma alternative for the agent era" — delivering single-page artifacts in real CSS and real fonts, already shaped by a design system, already runnable inside whatever coding agent sits on the developer's PATH. (more: https://github.com/nexu-io/open-design)
Beyond Transformers: Small Models, Big Questions
Hierarchos is a 232-million-parameter experiment asking whether a hybrid non-Transformer architecture — RWKV backbone, hierarchical manager/worker loops, differentiable slot-based long-term memory, and a deterministic suffix automaton — can survive training, avoid collapse, and maintain coherent instruction-following. After 13 epochs on an RTX 6000 Blackwell, the answer is a qualified yes: ARC Easy hits 36%, HellaSwag 37% normalized, TruthfulQA MC1 22% — roughly GPT-2 era, not GPT-3.5. The researchers are refreshingly honest about the limitations. The real contribution is engineering: they identified and fixed critical train/inference parity mismatches — drift state reseeding bugs, supervised memory updates creating training-only helper signals, unbounded channel mixing causing NaN gradients — that would have killed the architecture silently. The scaling plan targets 1B–1.5B parameters next, with ablations to isolate how much each component contributes versus a pure Transformer baseline at equal parameter counts. (more: https://old.reddit.com/r/LocalLLaMA/comments/1um1j7q/hierarchos_preliminary_findings_from_a_232m/)
Allen AI's DiScoFormer tackles a more fundamental problem: estimating both the density and score (gradient of the log-density) of an unknown distribution from a finite sample, in a single forward pass, without retraining per distribution. The architecture exploits a mathematical connection: cross-attention weights approximate a Gaussian kernel, making a single attention block equivalent to kernel density estimation but learnable and multi-scale. In 100 dimensions, DiScoFormer cuts score error by 6.5x and density error by 37x compared to best hand-tuned KDE. Since score estimation underlies diffusion models, Bayesian sampling, and particle simulations, a pretrained plug-in estimator that scales to high dimensions could reduce costs across all of them simultaneously. (more: https://huggingface.co/blog/allenai/discoformer)
A community project on r/LocalLLaMA takes interpretability in a visual direction, mapping activation paths across model layers during inference on Gemma 4 and Qwen 3.6 models. Testing five simple prompts (capital of Mongolia, why leaves change color, Fahrenheit-to-Celsius), the visualizations reveal strikingly different activation variance across architectures — Gemma 4 31B shows minimal variance while the 26B variant shows extreme variance for the same prompts. The author draws an analogy to neuroscience: "there is no 'sarcastic node' that fires independently — it's a series of nodes firing together to create that sarcasm." The community requested code access and more detailed descriptions of the visualization methodology. (more: https://old.reddit.com/r/LocalLLaMA/comments/1um8pl0/mapping_local_nodes_mildlyinteresting/)
Fifty-Dollar Hardware and Sub-Billion Parameters
Pine64 launched PineVoice, a $50 smart speaker built on a Bouffalo Lab BL606P RISC-V SoC with integrated Wi-Fi, Bluetooth 5.0, and Zigbee radios. It ships with Alibaba's open-source YoC platform and the Wyoming Satellite protocol, turning it into a local microphone and speaker for self-hosted Home Assistant setups. The specs — 32 MiB pSRAM, 16 MiB flash, 128 KiB ROM — are deliberately minimal; this is an embedded device priced to match the Amazon Echo Dot, not compete with it on features. Pine64's model has always been cheap, open-friendly hardware that developers shape into viable products. Whether PineVoice achieves that depends entirely on community firmware development — Pine64 warns it is early-stage and not consumer-grade. (more: https://www.omgubuntu.co.uk/2026/06/pine64-pinevoice-riscv-smart-speaker-launch)
Complementing that hardware minimalism, a benchmarking study ran eight sub-1B LLMs on a $250 Jetson Orin Nano Super across all four power modes. The headline: llama.cpp consistently beats Ollama on sub-1B models, with gaps from 1.37x (SmolLM2-135M) to a striking 4.2x (LFM2.5-350M) — the latter almost certainly caused by Ollama's bundled llama.cpp version lacking optimized kernels for LFM2's hybrid architecture. At 25W, SmolLM2-135M hit 165 tok/s with 29.6 output tokens per joule, while LFM2.5-1.2B led the ~1B class at 54.1 tok/s in only 698 MB. A top comment notes Ollama v0.30.0 switched to directly using llama.cpp as its backend, which "invalidates all conclusions" about the backend gap — though the raw per-model data remains valuable for edge deployment planning. NVIDIA's Cosmos3-Nano also appeared on the Hugging Face trending list, signaling continued interest in compact world models for edge deployment. (more: https://old.reddit.com/r/ollama/comments/1ugz256/tiny_jetson_nano_orin_super_benchmarking_of_1b/) (more: https://huggingface.co/nvidia/Cosmos3-Nano)
The Art of Innocent-Looking Malice
The Underhanded C Contest, resurrected after years of dormancy and newly re-surfaced on Hacker News, remains one of the most instructive exercises in software security — and its lessons deserve fresh attention in an era where AI generates most of the code under review. The challenge: write a program that appears to perform a nuclear weapons inspection (comparing gamma-ray spectra) but contains a subtle bug allowing a host country to cheat. Over 40 submissions demonstrated how code that passes casual review — and careful static analysis — can harbor deliberately planted vulnerabilities.
The dominant technique was NaN poisoning: engineering floating-point computations where edge inputs produce NaN (Not a Number) values that propagate silently to a final comparison returning the wrong result. About a third of submissions used this, but sophistication varied enormously. The winner triggered a NaN through a Poisson probability calculation that overflows at realistic bin counts (~1686) — achievable by packing a warhead with a short-lived nuclide producing a single strong peak. Another entry hid the bug in logging code: a carefully sized log message overwrites two bytes of a floating-point exponent, changing 2.0 to something that still displays as "2.0" but makes pow() return NaN on negative inputs. A third passed error_message (a string whose initial bytes happen to be valid x86 instructions that pop a stack frame and return true) instead of error_messager (the actual handler function). These are exactly the class of bugs that survive automated review — semantically correct-looking code with mathematically or architecturally subtle failure modes that require domain expertise and adversarial thinking to catch. (more: https://underhanded-c.org/)
Sources (22 articles)
- Claude Sonnet 5 (anthropic.com)
- A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models (arxiv.org)
- [Editorial] ruvnet Technical Reference (gist.github.com)
- [Editorial] OpenAI Proposes US Government Own 5% Stake (cnbc.com)
- OpenAI: In early talks to give 5% stake to US Government (theguardian.com)
- The number 1 public enemy of open-source. (old.reddit.com)
- Model Registry: Torrents for open models using Hugging Face as a fallback web seed. (old.reddit.com)
- [Editorial] Agents Optimizing Their Own Behavior (linkedin.com)
- Forsy-AI/agent-apprenticeship (github.com)
- Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study (arxiv.org)
- OmniAct: Advancing Omnimodal Embodied Agents from Isolated Skills to Everyday Physical Autonomy (arxiv.org)
- cellebrite-labs/ghidra-rpc — Agentic Reverse Engineering Skill for Ghidra (github.com)
- raiyanyahya/recall — Durable Offline Memory for Claude Code (github.com)
- [Editorial] Agentic Rust Optimizer (github.com)
- [Editorial] Nexu Open Design (github.com)
- Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model (old.reddit.com)
- DiScoFormer: One transformer for density and score, across distributions (huggingface.co)
- Mapping Local Nodes — Visualizing Model Activation Paths (old.reddit.com)
- Pine64 launch $50 smart speaker for Home Assistant tinkerers (omgubuntu.co.uk)
- Tiny Jetson Nano Orin Super Benchmarking of 1B and sub 1B LLMs — llama.cpp vs Ollama (old.reddit.com)
- nvidia/Cosmos3-Nano (huggingface.co)
- The Underhanded C Contest (underhanded-c.org)