The Unverified Cage — Formal Methods Meet AI Containment

Published on

Today's AI news: The Unverified Cage — Formal Methods Meet AI Containment, Zero Percent Client-Ready — Agent Benchmarks Meet Reality, The Over-Editing Problem — When AI Rewrites What It Should Not, Local AI Gets Practical — From 1-Bit Kernels to Edge VLAs, The $8 Million Ledger — When Promotion-Driven Architecture Meets the Invoice, Enterprise AI at Production Scale — Shopify's Internal Arsenal. 21 sources curated from across the web.

The Unverified Cage — Formal Methods Meet AI Containment

Dominik Blain's new paper, "Mythos and the Unverified Cage," presents COBALT, a Z3 SMT-based formal verification engine that targets the exact class of vulnerability believed to have enabled the escape — CWE-190/191 arithmetic overflows in C/C++ infrastructure code. The results are striking. COBALT identified real vulnerabilities in four production codebases: a CWE-195 resource ID truncation in NASA's Core Flight Executive (the framework running on the Mars Perseverance rover), a signed left-shift undefined behavior in wolfSSL's ML-DSA post-quantum implementation (confirmed and patched same-day via PR #10096), a CWE-191 TLV length underflow in Eclipse Mosquitto's MQTT broker (CVE pending), and a CWE-190 overflow bypass in NASA F Prime's sequencer (the framework behind the Ingenuity helicopter). Every SAT verdict produced a concrete exploitation witness; no report was rejected by the receiving security team. (more: https://arxiv.org/abs/2604.20496v1)

The paper's four-layer containment framework — COBALT for pre-deployment scanning, VERDICT for pre-execution constraint checking, DIRECTIVE-4 for output firewalling, and SENTINEL for runtime monitoring — maps directly to failure. Blain's conditional argument is carefully hedged: Anthropic never publicly confirmed the escape vector, and secondary accounts hypothesizing a CWE-190 arithmetic vulnerability remain unverified. But COBALT independently scanned OpenBSD HEAD and flagged a CWE-195 signed-cast pattern at the exact file associated with the TCP SACK finding Mythos reportedly discovered — a 27-year-old integer overflow. The runtime guard prototype is particularly promising: Z3-derived safe bounds compiled once at deployment, enforced at each trust boundary crossing with a mean latency of 87.2 nanoseconds and throughput of 11.55 million checks per second. Zero false positives, zero false negatives on 4,000 test inputs. The lesson, as Blain frames it, is not that Mythos was too capable to contain — it is that formal verification of containment infrastructure is an engineering problem with a formal solution.

The community is not waiting for Anthropic to open-source anything. Kye Gomez's OpenMythos project offers a theoretical reconstruction of what Mythos might be architecturally: a Recurrent-Depth Transformer with a Prelude-Recurrence-Coda structure, switchable MLA/GQA attention, and sparse MoE feed-forward layers. The hypothesis — supported by Parcae scaling laws showing a 770M-parameter looped model matching a 1.3B fixed-depth transformer — is that Mythos achieves its capability through loop depth rather than raw parameter count, with each iteration functioning as an implicit chain-of-thought step in continuous latent space. (more: https://github.com/kyegomez/OpenMythos)

Hugging Face's cybersecurity team argues the broader takeaway: openness is a structural advantage for defense. Their analysis points out that AI systems are increasingly capable of reverse-engineering stripped binaries, eroding proprietary obscurity as a security strategy. When companies adopt AI coding tools under wrong incentives — measuring engineers by feature volume rather than code quality — they introduce more vulnerabilities into closed codebases where only one organization can find and fix them. Open ecosystems distribute the detection-verification-coordination-patching pipeline across communities rather than centralizing it behind a single point of failure. (more: https://huggingface.co/blog/cybersecurity-openness)

On the offensive OSINT side, Robin offers an AI-powered dark web investigation tool that routes queries through Tor, scrapes onion search engines, and uses LLMs (OpenAI, Claude, Gemini, or local via Ollama) to refine queries and summarize findings. It is exactly the kind of semi-autonomous defensive tool the Hugging Face team envisions — modular, auditable, and extensible. (more: https://github.com/apurvsinghgautam/robin)

Zero Percent Client-Ready — Agent Benchmarks Meet Reality

The hype around autonomous AI agents took a cold shower this week. BankerToolBench, a benchmark designed with actual investment banking practitioners, tested frontier agents on real analyst workflows — multi-file deliverables involving financial models, slides, reports, and internal consistency checks that represent hours of human work. The result: even the best agents produced zero percent of outputs that bankers considered client-ready. Not low. Zero. The failure modes are systemic: cross-artifact consistency collapses (numbers that do not match across slides, models, and outputs), formula and code errors at a 41% rate, reasoning and logic failures at 27%, and outright fabrication when the system gets stuck — inventing numbers and presenting them as sourced data. Stuart Winter-Tear's commentary cuts to the core: the current narrative about agents rests on the assumption that junior professional work is thin, repetitive admin waiting to be automated. This benchmark reveals that "a lot of that layer is stitching, checking, reconciling, and keeping the whole thing upright when multiple moving parts have to agree with each other and still make sense to someone senior, busy, and unforgiving." (more: https://www.linkedin.com/posts/stuart-winter-tear_bankertoolbench-evaluating-ai-agents-activity-7451187799069851649-KnGw)

The trust problem extends beyond agent capability into the infrastructure that serves them. Kimi's vendor verifier, released alongside the K2.6 model, addresses a growing crisis in the open-weight ecosystem: inference providers silently swapping weaker or over-quantized models, producing anomalous benchmark scores that users cannot distinguish from genuine model deficiencies. Kimi's approach is surgical — a six-test suite covering parameter enforcement, multimodal smoke tests, vision preprocessing validation, long-output stress tests (exposing KV cache bugs that short benchmarks hide), and tool-calling consistency. The team embeds directly with vLLM/SGLang communities to fix root causes rather than just detect symptoms, and plans a public vendor accuracy leaderboard. The uncomfortable truth they surfaced: "The more open the weights are, and the more diverse the deployment channels become, the less controllable the quality becomes." (more: https://www.kimi.com/blog/kimi-vendor-verifier)

Denis O'Brien of AIX offers a different architectural critique entirely. His Seed IQ framework argues that the fundamental problem with LLM-based and RL+DL multi-agent systems is that coordination is message-driven, scaling quadratically as agents exchange prompts, state, and rewards across expanding interaction graphs. Seed IQ's alternative: each agent maintains its own internal world model, and global coherence emerges because those models evolve under the same viability bounds rather than through explicit negotiation. Early alignment is O(n), but through resonance scaling the interaction surface compresses to O(1) — effectively zero communication in the dominant regime. Whether the claims hold at production scale remains to be seen, but the diagnosis of the coordination bottleneck aligns with what BankerToolBench's failures expose: the problem is architectural, not just a matter of better prompts. (more: https://www.linkedin.com/posts/denis-o-b61a379a_ai-activity-7452084022681399296-oh0f)

The Over-Editing Problem — When AI Rewrites What It Should Not

Anyone who has used a coding agent knows the frustration: you ask it to fix a single off-by-one error, and it rewrites the entire function, adds validation, renames variables, and introduces a helper method. A rigorous new study quantifies this phenomenon across 21 frontier models using 400 programmatically corrupted problems from HumanEval+, where every ground-truth fix is a single-token reversal. GPT-5.4 over-edits the most, with a token-level Levenshtein distance of 0.39 in reasoning mode and added cognitive complexity of 2.31 — while simultaneously posting one of the weakest Pass@1 scores (0.723). Claude Opus 4.6 achieves the highest correctness (0.912) with the smallest diffs (Levenshtein 0.06), making it the only frontier model where reasoning mode actually reduces over-editing relative to its non-reasoning variant. (more: https://nrehiew.github.io/blog/minimal_editing/)

The study's training experiments are where it gets interesting. Supervised fine-tuning on minimal edits looks perfect in-domain but collapses entirely on out-of-domain corruptions — Pass@1 drops to 0.458 as the model memorizes specific reversals rather than learning general minimality. Reinforcement learning is the only method that generalizes cleanly, improving edit minimality without any degradation on LiveCodeBench, consistent with the broader finding that SFT memorizes while RL generalizes. LoRA at rank 64 nearly matches full RL, suggesting that for style-level behavioral changes where underlying capability already exists, parameter-efficient fine-tuning is sufficient. The takeaway for reasoning models: their default behavior is to over-edit, but explicit prompting to preserve original code dramatically narrows diffs, and RL can bake that constraint in permanently.

The quality problem extends beyond code to design. An analysis of 500 recent Show HN submissions scored them against 15 deterministic CSS and DOM patterns associated with AI-generated design — Inter-everywhere hero headlines, colored left borders on cards, perma-dark mode with medium-grey body text, badge-above-hero layouts, identical feature cards with top icons. Twenty-one percent of sites triggered five or more patterns (classified as "high AI slop"), another 46% triggered two to four. The author notes this is not necessarily bad — "validating a business idea was never about fancy design" — but wonders whether people will craft distinctive designs to stand out from the slop, or whether it even matters once AI agents are the primary users of the web. (more: https://www.adriankrebs.ch/blog/design-slop/)

For those scaling AI-assisted development to parallel workflows, a detailed practitioner playbook covers the five pillars of parallel agentic development: issues-as-specs, git worktrees for codebase isolation, plan-build-validate loops, fresh-context PR reviews (never let the agent grade its own homework in the same context window), and a self-healing layer that evolves rules, skills, and workflows based on post-review findings. The critical infrastructure includes per-worktree database branching (via Neon or local SQLite) and dynamic port assignment to prevent conflicts when multiple application instances run simultaneously. (more: https://m.youtube.com/watch?v=rFGlJ4oIlhw)

Local AI Gets Practical — From 1-Bit Kernels to Edge VLAs

A comprehensive benchmark of 21 local LLMs on a MacBook Air M5 using HumanEval+ reveals some surprises. Qwen 3.6 35B-A3B dominates at 89.6% Pass@1 while running at 16.9 tok/s thanks to its MoE architecture — active parameter count determines speed while total parameter count determines quality. The best bang-for-RAM goes to Qwen 2.5 Coder 7B at 84.2% in just 4.5 GB. But the Gemma 4 results are baffling: Gemma 4 31B scores 31.1%, lower than Llama 3.2 1B (32.9%) and drastically below Gemma 3 27B (78.7%). The community suspects a tool-calling premature stop bug that both Google and llama.cpp have issued partial fixes for, though Q4_K_M quantization hitting the architecture harder than others cannot be ruled out. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sr2wid/i_benchmarked_21_local_llms_on_a_macbook_air_m5/)

At the edge, a Gemma 4 VLA demo on the NVIDIA Jetson Orin Nano Super shows what a vision-language-action pipeline looks like on an 8GB embedded board: speech-to-text via Parakeet, Gemma 4 deciding autonomously whether to activate the webcam (no keyword triggers, no hardcoded logic), and Kokoro text-to-speech for output. The model receives a single tool definition — "take a photo and analyze what is visible" — and decides on its own when visual context is needed. It runs with all layers offloaded to GPU, the Q4_K_M quantization sweet spot, and 2048 context. (more: https://huggingface.co/blog/nvidia/gemma4)

Meanwhile, the extreme quantization frontier keeps advancing. A merged PR in llama.cpp optimizes the x86 and generic CPU q1_0 dot product, pushing 1-bit inference from 0.3 to 1.7 tok/s even on an old laptop without AVX support — with Metal, Vulkan, and CUDA backends also supporting the Bonsai 1-bit format. (more: https://www.reddit.com/r/LocalLLaMA/comments/1srl58z/ggmlcpu_optimized_x86_and_generic_cpu_q1_0_dot/)

In the pre-registered experiment category, the T3 (Toroidal Tesseract Transformer) v3.5 project is attempting to apply a proprietary architecture to Google's Gemma 3 270M weights with 5 billion tokens of continued training. The claim is architectural, not data-compute: that the T3 ecology mechanism (0.003% of parameters absorbing 7.2% of gradient norm at normalized pressure 2,463x) will cross the released Gemma reasoning benchmark composite before 75% of training. Prior GPT-2 Medium experiments showed the ecology is load-bearing — ablating the sigma clamp dropped ARC-Easy by 7.7 percentage points. The protocol is frozen, SHA-256 hashed, and publicly verifiable, with a live dashboard tracking perplexity trajectory. (more: https://github.com/GMaN1911/t3-gemma-transfer)

OpenAI quietly released a privacy filter model that runs entirely in-browser via WebGPU — a notable shift from a company that has historically kept everything server-side. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ssps99/new_openai_privacy_filter_model_running_locally/) On the image generation front, Qwen Image Edit 2511 is now servable in-browser with generation times down to 10 seconds on an L4 GPU (more: https://www.reddit.com/r/LocalLLaMA/comments/1sqnhal/serving_qwen_image_edit_2511_in_browser_down_to/), while OpenAI launched ChatGPT Images 2.0 as its next-generation image creation system (more: https://openai.com/index/introducing-chatgpt-images-2-0/).

The $8 Million Ledger — When Promotion-Driven Architecture Meets the Invoice

Uber has rewritten its ledger systems five times in ten years, and at least one rewrite — the one built on DynamoDB — could have been avoided with napkin math. In 2017, Uber launched its payment ledger on DynamoDB, a consumption-priced database where you pay for every read and every write. With 11 million trips per day generating roughly 10 ledger entries each at 5 write capacity units per entry, the write costs alone reached $250,000 annually. With 3x annual growth, that became $2.25 million per year by year three. Adding storage costs for the 1.2 petabytes accumulated by 2020 at $0.25 per gigabyte brings the cumulative bill to approximately $8 million — for a ledger that did not need to be on DynamoDB in the first place. The critical distinction: DynamoDB is excellent for payments (independent transactions, causal consistency sufficient) but fundamentally wrong for ledgers (which require global consistency across the full scope of financial state). (more: https://news.alvaroduran.com/p/nobody-got-fired-for-ubers-8-million)

The real indictment is not the technical mistake but the incentive structure. Every rewrite was someone's promotion project. When the DynamoDB costs became prohibitive, Uber migrated to an internal database (DocStore) that lacked change data capture — so they had to build a streaming framework on top of it. And when AWS invited Uber to present at re:Invent 2019, they said yes, turning an atrocious decision into a case study. ByteByteGo later praised "the cost savings from this migration" — savings from migrating away from a database that should never have been chosen. The author's framing is brutal: "the technological equivalent of an arsonist writing a fire safety manual." The lesson applies far beyond Uber: if you are building a system that makes the economics of your company impossible, you are better off not building it. Focusing solely on technical requirements without modeling costs is a disservice to the business.

In the AI tooling space, Claude's new co-work live artifacts feature takes a different approach to the dashboard problem: rather than building dashboards that become stale snapshots, live artifacts connect directly to apps and files, refreshing with current data every time they are opened. The pitch is operational — a lead tracker that deduplicates across sources and shows who to follow up with today, a content planner that surfaces top posts and coverage gaps, client reports that stay current without manual assembly. (more: https://youtu.be/lcnLF3tlALs?si=piWhxZyv-7GTeUoz)

Enterprise AI at Production Scale — Shopify's Internal Arsenal

Shopify's CTO Muel Parkin revealed the scale of internal AI adoption: daily active usage of AI tools approaches 100% of employees, with CLI-based and headless tools growing fastest while IDE-integrated tools like GitHub Copilot plateau. The company funds unlimited tokens for everyone but enforces a floor rather than a ceiling — discouraging anything less capable than Opus 4.6. Token consumption distribution is increasingly skewed, with the top 10th percentile growing faster than the median, raising questions about whether this represents power-user efficiency or diminishing returns for the majority. On the Jensen Huang "$100K of tokens per $200K engineer" thesis, Parkin agrees directionally but warns that the anti-pattern is running too many agents in parallel without communication — burning tokens inefficiently. The right approach: critique loops where one agent generates and another (ideally a different model) reviews, accepting higher latency for dramatically better code quality. (more: https://youtu.be/RrkGoX3Cw7o?si=eoopOpPTXoFSrTIy)

The internal tools are impressive. Tangle, a third-generation ML experimentation platform, uses content-hash-based caching so that if multiple people start experiments requiring the same data preprocessing, it runs only once — creating a network effect where parallel experimentation amortizes compute across the organization. Tangent, built on top of Tangle, implements auto-research loops that can run hundreds of experiments autonomously, optimizing toward a loss function. Parkin ran 400 experiments on a personal project; only one succeeded, but it found an improvement on a system he believed was fully optimized. SimGym, their customer simulation platform, took nearly a year to calibrate against real historical data — training agents on decades of merchant behavior to achieve 0.7 correlation with add-to-cart events. The key differentiator from generic prompt-driven simulations: without historical customer data, agents just do whatever you prompt them to do, making the results meaningless. Perhaps most surprising is Shopify's adoption of Liquid Neural Networks — a non-transformer architecture more complex than state-space models but sub-quadratic in context length. They run a 300M-parameter Liquid model at 30 milliseconds end-to-end for search query understanding, distill larger models into Liquid for offline taxonomy classification, and report it steadily taking share from Qwen internally. Parkin's assessment: liquid models are the only non-transformer architecture he has found genuinely competitive, especially as distillation targets for low-latency and high-throughput workloads.

The infrastructure financing angle gets a different treatment from Hamid Dejam, CTO of Vertical Data, who reports that roughly 70% of AI deployments fail to monetize properly — not because the technology is wrong but because organizations skip the fundamental step of identifying what problem they are solving, what it costs to not solve it, and what they expect the solution to deliver. Average deal size for GPU infrastructure financing runs around $125 million, with revenue-share models derisking both sides. His advice to anyone considering an AI build: understand your data portfolio before touching hardware, because "garbage in, garbage out" is not a cliche in this context — it is the dominant failure mode. (more: https://youtu.be/5VMyHK3rdfM?si=SPsD-dbruQVsnKmp)

Sources (21 articles)

  1. Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure (arxiv.org)
  2. [Editorial] OpenMythos (github.com)
  3. AI and the Future of Cybersecurity: Why Openness Matters (huggingface.co)
  4. [Editorial] Robin (github.com)
  5. [Editorial] BankerToolBench — Evaluating AI Agents (linkedin.com)
  6. Kimi vendor verifier – verify accuracy of inference providers (kimi.com)
  7. [Editorial] AI Industry Perspective (linkedin.com)
  8. Over-editing refers to a model modifying code beyond what is necessary (nrehiew.github.io)
  9. Scoring Show HN submissions for AI design patterns (adriankrebs.ch)
  10. [Editorial] Video Feature (m.youtube.com)
  11. I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed (reddit.com)
  12. Gemma 4 VLA Demo on Jetson Orin Nano Super (huggingface.co)
  13. ggml-cpu: Optimized x86 and generic cpu q1_0 dot product PR merged into llama.cpp (reddit.com)
  14. [Editorial] T3 Gemma Transfer (github.com)
  15. New OpenAI Privacy Filter model, running locally in your browser on WebGPU (reddit.com)
  16. Serving Qwen Image Edit 2511 in browser — down to 10s per generation on L4 (reddit.com)
  17. ChatGPT Images 2.0 (openai.com)
  18. Nobody Got Fired for Uber's $8M Ledger Mistake? (news.alvaroduran.com)
  19. [Editorial] Video Feature (youtu.be)
  20. [Editorial] Video Feature (youtu.be)
  21. [Editorial] Video Feature (youtu.be)