AI Security at Machine Speed

Published on May 8, 2026

Today's AI news: AI Security at Machine Speed, The Local Inference Arms Race, Agents Learn to Spawn Themselves, GPU Hoarding and Counting What AI Actually Does, The Claude Ecosystem Matures, When Helpfulness Becomes a Liability. 22 sources curated from across the web.

AI Security at Machine Speed

Mozilla just published the most detailed account yet of what happens when you point a frontier AI model at a 20-year-old codebase and tell it to find bugs. The numbers are staggering: 271 vulnerabilities discovered by Claude Mythos Preview in Firefox 150 alone, with 180 rated sec-high. Across all April 2026 releases, Mozilla shipped fixes for 423 security bugs — a volume that required over 100 engineers writing patches, triaging, and managing the release pipeline. The disclosed sample bugs span a jaw-dropping range: a 15-year-old use-after-free in the <legend> element triggered by orchestrating recursion stack depth limits and cycle collection; a 20-year-old XSLT bug where reentrant key() calls cause a hash table rehash that frees its backing store while a raw pointer is still live; IPC race conditions enabling sandbox escapes via IndexedDB refcount manipulation; and an HTML table rowspan=0 overflow bypassing a 16-bit layout bitfield that fuzzers missed for years. (more: https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox)

The methodology matters as much as the results. Mozilla's pipeline evolved from early static analysis experiments with GPT-4 and Sonnet 3.5 — which suffered crippling false-positive rates — into an agentic harness that dynamically generates and runs proof-of-concept test cases. They parallelized jobs across ephemeral VMs, each assigned a target file, with results written to a shared bucket. The inner loop stayed remarkably simple: "there is a bug in this part of the code, please find it and build a testcase." Their advice to every software project: start now, iterate on prompts, and build the pipeline so you're ready when the next model drops. Perhaps most telling, the models repeatedly attempted prototype pollution escapes in the privileged parent process — and were consistently blocked by Firefox's prior decision to freeze prototypes by default. Defense-in-depth actually working as intended.

On the offensive side, Moak AI has built what may be the first autonomous CVE exploitation pipeline that works at production scale. Five specialized agents — Collector, Researcher, Builder, Exploiter, and Judge — take a CVE number as input and output a working exploit validated against a dockerized environment. The Researcher spawns a multi-model sub-agent swarm combining Claude, GPT, and Gemini that rotate roles between runs to reduce groupthink. Against 178 Known Exploited Vulnerabilities published after all models' knowledge cutoffs, the pipeline successfully exploited 174 — a 97.8% success rate. It now runs live against newly disclosed CVEs, providing real-time exploitability assessment within hours of disclosure. The implications are uncomfortable: the disclosure-to-weaponization timeline is collapsing to hours while most defenders still operate on quarterly patch cycles. (more: https://moak.ai/#how-it-works)

The security problems aren't limited to legacy code. Researchers at RedAccess analyzed thousands of web applications built with AI coding tools Lovable, Replit, Base44, and Netlify, finding over 5,000 with virtually no security or authentication. Around 2,000 exposed what appeared to be sensitive data — hospital work assignments with doctor PII, corporate strategy presentations, customer chatbot logs with full names and contact information, cargo shipping records. Some exposed apps allowed gaining administrative privileges. Security researcher Dor Zvi compared the wave to the S3 bucket epidemic, but noted a more fundamental problem: these tools let anyone in an organization "generate an app without going through any development cycle or any security check." (more: https://www.wired.com/story/thousands-of-vibe-coded-apps-expose-corporate-and-personal-data-on-the-open-web)

The attack surface extends to the AI infrastructure itself. Ollama, the popular local inference server, has a critical unauthenticated memory leak vulnerability — dubbed "Bleeding Llama" — that affects instances running with default configurations (more: https://www.reddit.com/r/LocalLLaMA/comments/1t4zhh4/bleeding_llama_critical_unauthenticated_memory/). And if all this weren't enough, fresh Linux kernel vulnerabilities including Copy Fail 2 have prompted at least one security researcher to suggest a moratorium on installing new software for a week, warning that "right now would be one of the best times for a supply chain attack via NPM to hit hard" (more: https://xeiaso.net/blog/2026/abstain-from-install/).

The Local Inference Arms Race

The race to squeeze more tokens per second out of consumer hardware is accelerating on every front — runtimes, engines, and quantization techniques are all evolving simultaneously. The biggest immediate win for most users is Multi-Token Prediction (MTP) support landing for Qwen 3.6 27B in llama.cpp. A community member converted the model with the relevant PR, uploaded GGUF quants, and published a comprehensive hardware guide: on an M2 Max 96GB, MTP delivers a 2.5x speed increase to 28 tok/s. Combined with 8-bit KV cache quantization, a 48GB Mac can now run the full 262K native context window. The post includes detailed tables for every Apple Silicon and NVIDIA tier from 12GB to 96GB, with quant recommendations, maximum context sizes, and vision support flags — the kind of practical guide that actually gets people running models instead of just benchmarking them. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t57xuu/25x_faster_inference_with_qwen_36_27b_using_mtp/)

On the engine side, two purpose-built projects are challenging the llama.cpp generalist approach with model-specific optimization. Atlas, now open source, is a pure Rust + CUDA inference engine with hand-tuned kernels for NVIDIA's Blackwell SM120/121 architecture. On a single DGX Spark (GB10), it hits 130 tok/s peak on Qwen3.5-35B with NVFP4 and MTP — roughly 3x vLLM at test time. The team is working with Spectral Compute on a Strix Halo port and has AMD hardware incoming. Community benchmarks suggest a more modest 17% speed advantage over AWQ in spark-vllm-docker, but it's a genuine new contender in a space where llama.cpp has been the default. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t5p2yv/the_gb10_solution_atlas_is_now_open_source_the/)

Antirez — yes, the Redis creator — takes the opposite approach with ds4.c, a Metal-only inference engine built exclusively for DeepSeek V4 Flash. The thesis: this 284B-parameter MoE model deserves a dedicated engine because its compressed KV cache makes it "a first-class disk citizen," its thinking traces are roughly one-fifth the length of comparable models and proportional to problem complexity, and its million-token context window is practical on local hardware. The 2-bit quantization is deliberately asymmetric: only routed MoE experts are quantized (IQ2_XXS up/gate, Q2_K down) while shared experts and routing weights stay at full precision. On a Mac Studio M3 Ultra with 512GB, Q2 achieves 37 tok/s generation. It ships with both OpenAI and Anthropic-compatible API endpoints and works out of the box with coding agents. Built openly with GPT 5.5 assistance. (more: https://github.com/antirez/ds4)

DFlash takes a different tack on the speed problem: block diffusion models trained as lightweight speculative drafters for existing LLMs. The project now supports 15+ model families — Qwen3.x, Gemma 4, Kimi K2.5, MiniMax, and the open GPT models — with integrations into vLLM v0.20.1+, SGLang, Transformers, and MLX. Draft models predict up to 15 tokens in parallel, letting the target model verify in a single forward pass (more: https://github.com/z-lab/dflash). Meanwhile, Heretic 1.3 — the leading open-source tool for removing safety guardrails from language models, now at 20,000 GitHub stars and 13 million model downloads — adds reproducible runs (byte-for-byte identical models from recorded configurations), a built-in benchmarking system based on lm-evaluation-harness, reduced peak VRAM usage, and support for latest-generation architectures including Qwen3.5 and Gemma 4 (more: https://www.reddit.com/r/LocalLLaMA/comments/1t4hwup/heretic_13_released_reproducible_models/).

Agents Learn to Spawn Themselves

A new paper from Carnegie Mellon and Amazon AGI Labs formalizes something the agentic coding community has been doing ad hoc: teaching models to recursively delegate sub-tasks to copies of themselves. Recursive Agent Optimization (RAO) trains a single LLM policy that runs at every node of a dynamically generated execution tree. Unlike hand-designed orchestration, the model itself learns when delegation helps, how to formulate sub-tasks, whether to run children sequentially or concurrently, and how to aggregate their outputs. The training signal is clever: a local node reward combines each agent's own success with a delegation bonus from its children's average performance, weighted by depth-level inverse frequency to prevent deep nodes from dominating gradients. On TextCraft-Synth (a Minecraft-style crafting benchmark), recursive agents trained only on medium tasks achieved 88% success on hard tasks versus 20% for single-agent baselines. On Oolong-Real, a long-context aggregation task requiring 55K+ tokens, a recursive Qwen3-VL-30B trained with only 32K context approached frontier model performance. When tasks decompose cleanly, recursive agents achieve up to 2.5x wall-clock speedup through concurrent sub-agent execution. (more: https://arxiv.org/abs/2605.06639v1)

The RAO paper's conclusion — "inference-time scaffolds should not merely be designed around models; models should be trained to use them" — aligns with Anthropic's emerging concept of "dreaming" for agents. The idea: agents that review their own behavior between sessions, compress signal from noise, refine delegation patterns, and change future execution based on historical outcomes. This moves beyond simple persistent memory toward what one commentator described as "governed cognition" — structural awareness and traceability over time. Whether this materializes as a product feature or stays at the research stage remains to be seen, but it directly addresses the critique that current agents "remain episodic, stateless between sessions, and don't get better with use" (more: https://www.linkedin.com/posts/reuvencohen_anthropic-introducing-dreaming-for-agents-share-7458132809287798784-WVQF).

Training agents also requires environments to train them in. Qwen's WebWorld series — 32B, 14B, and 8B models fine-tuned from Qwen3 on over one million real-world web interaction trajectories — provides exactly that. The models support long-horizon simulation (30+ steps), multiple state representations (A11y Tree, HTML, XML, Markdown), and chain-of-thought reasoning for transition prediction. Agents trained on WebWorld-synthesized trajectories gained +9.9% on MiniWob++ and +10.9% on WebArena; used for inference-time lookahead search, WebWorld outperformed GPT-5 as a world model (more: https://www.reddit.com/r/LocalLLaMA/comments/1t6c6vs/qwenwebworld_32b14b8b_qwen3_finetune/). And once trained agents need to run safely against real data, Tilde.run — from the team behind lakeFS — offers a transactional versioned filesystem where every agent run is an atomic transaction: on clean exit, changes commit; on failure, nothing changes. Code from GitHub, data from S3, and documents from Drive appear as a single ~/sandbox, with every outbound call policy-checked and logged (more: https://tilde.run/).

GPU Hoarding and Counting What AI Actually Does

Anthropic has secured access to over 220,000 NVIDIA GPUs at SpaceX's Colossus 1 data center, adding more than 300 megawatts of compute capacity within a single month. The deal arrives on top of the multi-gigawatt commitments already documented — 5GW of TPU capacity from Google Cloud, a gigawatt deal with Microsoft, and 900MW via Stargate Abilene. The community reaction splits predictably: optimists see faster responses, higher rate limits, and a signal that a significant model launch is imminent; skeptics note the Musk connection and wonder what's next for Grok. At this point, more than half of all AI data center capacity under construction is being built for just two companies. The strategic read is straightforward — Anthropic is positioning for an inference-heavy future of agents and multimodal features that will consume orders of magnitude more compute than chat. (more: https://www.reddit.com/r/ClaudeAI/comments/1t5j05c/anthropic_just_secured_a_reserve/)

All those GPUs ultimately need to produce measurable business value, and a new CFO-oriented framework proposes four layers of AI measurement that most companies are getting wrong. Layer 1 (Consumption) tracks tokens and compute — where nearly everyone is stuck. Layer 2 (Work) counts what AI did: records updated, workflows triggered, code accepted. Layer 3 (Outcomes) captures verified results: tickets resolved, leads qualified. Layer 4 (Business Impact) measures P&L effects. Salesforce has introduced the Agentic Work Unit (AWU), reporting 2.4 billion to date with 771 million in Q4 alone — but that's still Layer 2. Intercom prices Fin at $0.99 per resolution with reversal logic if the customer returns, putting it cleanly at Layer 3. ServiceNow's CFO cited roughly $100 million in internal headcount savings — a Layer 4 claim more credible because the vendor reports on itself. Microsoft, despite 15 million paid Copilot seats, remains stuck at Layer 1, reporting access and engagement rather than work or outcomes. The practical takeaway: if you only track tokens, a 30% escalation rate that silently destroys gross profit is invisible in your dashboard. (more: https://www.thesaascfo.com/the-four-layers-of-ai-measurement-a-cfos-framework)

At the hardware edge, Cognitum is shipping a $131 AI device with a WASM runtime, Ed25519 cryptographic identity, on-device vector store, OTA firmware updates, and MCP protocol integration — self-learning hardware designed to work without the cloud. Each device gets a unique cryptographic identity at manufacturing time, sandboxed execution for self-contained apps called "cogs," and multi-device coordination for fleet deployments. It slots into the Model Context Protocol (MCP) ecosystem with both cloud and local plugins, positioning itself as physical infrastructure for the agentic stack rather than another dev board (more: https://github.com/cognitum-one).

The Claude Ecosystem Matures

Five months ago, a developer got tired of starting every Claude session from scratch. So they built iai-mcp — a local daemon that captures every conversation, organizes it into three memory tiers (working, episodic, semantic), and feeds the right context back when a new session starts. It stores everything verbatim, runs neural embeddings locally, encrypts at rest with AES-256, and consolidates memory in the background during idle time. The published numbers: verbatim recall above 99%, retrieval under 100ms, session-start cost under 3,000 tokens. After five months of daily use with Claude Code, the developer reports it knows coding style, project structures, and preferences that were never explicitly saved — picked up from conversation patterns. It's now open source under MIT. (more: https://www.reddit.com/r/ClaudeAI/comments/1t5yhio/my_claude_dreams_at_night_and_remembers/)

The ecosystem's personality layer is getting attention too. Claude Skins brings visual theming to the Claude Code CLI — terminal colors, ASCII art banners, status line customization, voice/personality presets, and tool feedback sounds. Nine skins ship out of the box, from "Nebula" (offensive security scanner aesthetic) to "Brutalist" (pure monochrome, zero decoration, maximum terseness). Skins are YAML-defined, hot-swappable without restart, and integrate via Claude Code's hooks system (more: https://github.com/basicScandal/claude-skins). For those building agents rather than skinning them, Marktechpost's growing collection of 60+ AI agent tutorials and implementations now covers everything from multi-agent swarm orchestration to procedural memory systems to cybersecurity AI agents — a practical onboarding resource for a framework ecosystem that's maturing faster than documentation can keep up (more: https://github.com/Marktechpost/AI-Agents-Projects-Tutorials).

When Helpfulness Becomes a Liability

Stuart Winter-Tear frames the sycophancy problem not as a model quirk but as an organizational threat: "sycophancy is not simply agreement, but a boundary failure between social alignment and epistemic integrity." The concern isn't chatbots being too polite — it's AI moving into strategy, governance, board packs, and investment analysis where "confidence arrives before judgement." A model that softens corrections, preserves mood, and hands back a cleaner version of what the user already wanted to hear isn't helping anyone think. It's laundering flawed premises through better prose. As one commenter noted, what's needed isn't better prompting but "structural collision detection" — a governance layer that flags when AI systems operate on contradictory assumptions or when a new deployment undermines an existing decision. The antidote to sycophancy turns out to be architectural, not behavioral. (more: https://www.linkedin.com/posts/stuart-winter-tear_when-helpfulness-becomes-sycophancy-ugcPost-7458408233750253568-cCyR)

On a lighter research note, Vibe-Cast's JEPA exploration offers an accessible entry point into Joint Embedding Predictive Architecture — Meta AI's self-supervised learning approach that learns representations by predicting masked embeddings rather than reconstructing pixels. The educational implementation walks through patch-based encoding, masking strategies, predictor networks, and momentum-updated encoders, providing a hands-on path into a paradigm that keeps winning empirically even as debate continues over whether it will deliver on LeCun's vision of a post-generative AI future (more: https://github.com/mondweep/vibe-cast/tree/claude/explore-jepa-orphan-adYwi).

Sources (22 articles)

[Editorial] Behind the Scenes: Hardening Firefox (hacks.mozilla.org)
[Editorial] Moak AI — AI Security Tool (moak.ai)
[Editorial] Thousands of Vibe-Coded Apps Expose Corporate and Personal Data (wired.com)
Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama (reddit.com)
Maybe You Shouldn't Install New Software for a Bit (xeiaso.net)
2.5x Faster Inference with Qwen 3.6 27B Using MTP — Complete Hardware Guide (reddit.com)
Atlas Is Now Open Source — Pure Rust+CUDA Inference Engine for Blackwell (reddit.com)
antirez/ds4 — DeepSeek 4 Flash Local Inference Engine for Metal (github.com)
[Editorial] dflash — Flash Inference Tool (github.com)
Heretic 1.3: Reproducible Abliterated Models, Integrated Benchmarking, Reduced VRAM (reddit.com)
Recursive Agent Optimization — RL for Recursive Agent Spawning (arxiv.org)
[Editorial] Anthropic Introducing Dreaming for Agents (linkedin.com)
Qwen WebWorld 32B/14B/8B — Open Web World Model for Agent Training (reddit.com)
Tilde.run — Agent Sandbox with Transactional Versioned Filesystem (tilde.run)
Anthropic Just Secured a Reserve — 220,000 GPUs via SpaceX Colossus (reddit.com)
[Editorial] The Four Layers of AI Measurement — A CFO's Framework (thesaascfo.com)
[Editorial] Cognitum — AI Knowledge Platform (github.com)
iai-mcp: Persistent Claude Memory Daemon — 5 Months of Daily Use, Now Open Source (reddit.com)
[Editorial] Claude Skins — Custom Claude Code Personas (github.com)
[Editorial] AI Agents Projects & Tutorials Collection (github.com)
[Editorial] When Helpfulness Becomes Sycophancy (linkedin.com)
[Editorial] Vibe-Cast JEPA — Joint Embedding Predictive Architecture Exploration (github.com)