White-Box Red-Teaming Arrives for Agentic AI

Published on March 12, 2026

Today's AI news: White-Box Red-Teaming Arrives for Agentic AI, Local Inference Pushes Past the Million-Token Barrier, Small Models, Big Wins: When Distillation Beats the API, Agentic Infrastructure: From Persistent Memory to Deterministic Workflows, AI, Jobs, and the Governance Vacuum, Generative Media: Diffusion Speedruns and Creative Synthesis. 21 sources curated from across the web.

White-Box Red-Teaming Arrives for Agentic AI

The attack surface of an AI agent is not a port or a buffer — it is language itself. VotalAI, presenting at RSA Conference 2026, has open-sourced a white-box red-teaming framework that takes this seriously: it reads your application's source code, maps every tool, role, and guardrail, and then generates LLM-powered attacks across 47 categories — from forged JWTs and RBAC bypass through steganographic exfiltration, memory poisoning, and multi-agent confused-deputy attacks. (more: https://github.com/sundi133/wb-red-team)

The key design choice is the white-box premise. Black-box testers treat an agent endpoint like a chatbot and hope something breaks. VotalAI's codebase analyzer bundles source files, sends them to an LLM, and extracts a structured map of tools, permissions, regex-based guardrails, sensitive data locations, and authentication weaknesses. Every attack module then uses that map to generate targeted, application-specific payloads. In benchmarking, the team claims white-box attacks find 3–5x more vulnerabilities than equivalent black-box approaches. The framework runs adaptive rounds: each round's results inform the next, letting the system evolve its strategy mid-test. A demo run against a reference agentic app found 14 vulnerabilities in 238 attacks, including critical JWT forgery and role-escalation bugs — the kind of findings that would take a human pentester days to surface manually. (more: https://votal.ai/white-box-red-teaming-for-agentic-ai-an-open-source-framework-for-testing-llms-ai-agents)

The framework's 47-category taxonomy is worth studying on its own. It includes agentic-specific attack classes that most traditional scanners do not even conceptualize: agentic_workflow_bypass (checkpoint injection, approval-gate forgery), agent_reflection_exploit (poisoning ReAct reasoning loops with fake Thought/Action/Observation blocks), and cross_session_injection (corrupting persistent memory and shared vector stores). For anyone building production agents with tool access, this taxonomy is a practical threat model. The tool supports OpenAI, Anthropic, and OpenRouter backends, making it provider-agnostic for the attack generation itself.

While agentic AI security tooling matures, the human side of data protection is having a worse week. A whistleblower complaint alleges that a former DOGE employee claimed access to two highly sensitive Social Security Administration databases and planned to share the information with a private employer — an unprecedented breach allegation at an agency serving over 70 million Americans. The SSA's internal watchdog is investigating. No exploit code needed; just a person with access and, allegedly, questionable intentions. (more: https://www.washingtonpost.com/politics/2026/03/10/social-security-data-breach-doge-2/)

Local Inference Pushes Past the Million-Token Barrier

Running a 120-billion-parameter model with a one-million-token context window on a single desktop machine sounds like a 2028 prediction. It is happening now. A LocalLLaMA user benchmarked Nemotron 3 Super (Q4_K_M quantization) on an M1 Ultra with 128 GB unified memory, processing the full million-token prefill in approximately 3 hours and 20 minutes. At the start the model processes 255 tokens per second; at one million tokens of depth, that drops to 49 t/s for prefill and 8 t/s for generation — a steep but non-catastrophic degradation curve enabled by Nemotron's hybrid Mamba-2 architecture, which makes memory scaling sublinear compared to pure-attention models. The same user benchmarked Qwen 3.5 122B (pure MoE transformer) under identical conditions and found it starts faster at 391 t/s but degrades more aggressively, hitting 59 t/s at 250K depth. The Mamba-2 hybrid wins the long-context race decisively. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rrlbhm/processing_1_million_tokens_locally_with_nemotron/)

This long-context capability reframes the RAG-vs-stuff-the-window debate. If the entire Lord of the Rings plus The Hobbit fits in the prompt with room to spare, the engineering overhead of chunking strategies, embedding models, vector databases, and rerankers starts to look like unnecessary complexity for bounded datasets. The counterarguments remain valid — the "rereading tax" of processing a 250K-token manual on every query, attention dilution burying needles in haystacks, and the simple fact that enterprise data lakes measure in terabytes, not megabytes. For bounded, high-reasoning tasks (legal contract analysis, codebase auditing), long context wins on simplicity. For infinite datasets, the vector database stays. The practical answer, as usual, is routing between both. (more: https://youtu.be/UabBYexBD4k?si=R9GZ3Fgnm5VfA0DN)

On the edge-inference front, Embedl released FlashHead, a drop-in replacement for the language-model head that delivers up to 40% faster token generation on NVIDIA Jetson devices without sacrificing reasoning quality, stacking on top of W4A16 quantization. On an AGX Thor, the Cosmos-Reason2-2B model jumps from 88.3 t/s (W4A16) to 128.2 t/s with FlashHead enabled. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rrttkx/flashhead_up_to_40_faster_multimodal_reasoning_on/)

Architecture choices matter beyond speed. A careful benchmark of DeepSeek V3.2 Speciale shows that running it with dense attention (as llama.cpp currently does, lacking a native sparse-attention implementation) produces a 17% accuracy decrease on lineage-512 tasks and 22% on lineage-1024 versus the intended sparse (DSA) attention path. The model's sparse attention selects only the top 2,048 tokens per head; forcing it to attend to the full context overwhelms it with noise it was never trained to handle. Until llama.cpp adds proper DSA support, running DeepSeek V3.2 locally means accepting measurably degraded reasoning on complex tasks. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rq8otd/running_deepseek_v32_with_dense_attention_like_in/)

A companion paper on activation outliers in transformer quantization provides the theoretical explanation for why naive INT8 quantization crushes model quality. The researchers show that structured activation outliers — not random noise — drive PTQ failure. In BERT-base fine-tuned on QNLI, global W8A8 quantization drops validation accuracy from 89.66% to 54.33%. The culprit: kurtosis in activation distributions reaches 271 by Layer 11 (a Gaussian has kurtosis of 3), and the top 1% of channels concentrate 55% of total activation energy. Uniform min-max scaling allocates nearly all dynamic range to these few channels, crushing everything else. Mixed-precision (keeping critical layers in FP16) recovers to 89.42%, while per-embedding-group quantization only partially recovers. Surprisingly, percentile-based calibration makes things worse — the dominant channels encode signal, not noise. The practical takeaway: channel-aware precision allocation is necessary, and simply clipping outliers destroys information. On an RTX 3050, none of the INT8 methods produced measurable latency improvements, a reminder that quantization benefits depend entirely on hardware kernel support. (more: https://arxiv.org/abs/2603.04308v1)

Small Models, Big Wins: When Distillation Beats the API

A systematic comparison from Distil Labs pits fine-tuned Qwen3 models (0.6B to 8B parameters) against frontier APIs — GPT-5 nano/mini/5.2, Gemini Flash, Claude Haiku/Sonnet/Opus, Grok 4 — across nine datasets spanning classification, function calling, QA, and open-book QA. The headline: a 0.6B model hits 98.7% on smart-home function calling versus Gemini Flash at 92.0%. On Text2SQL, Qwen3-4B reaches 98.0% versus Claude Haiku at 98.7%, but at $3 per million requests versus $378. Classification tasks (Banking77, TREC) are "basically solved" — distilled models land within 0–1.5 percentage points of the best frontier option. Where frontier still wins clearly is open-ended reasoning requiring world knowledge: HotpotQA shows 92.0% versus Haiku's 98.0%. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rozrmn/finetuned_qwen3_slms_068b_beat_frontier_llms_on/)

The throughput numbers are equally striking: Qwen3-4B on a single H100 sustains 222 requests per second for Text2SQL with p50 latency of 390ms and only 7.6 GiB VRAM (BF16, no quantization). FP8 adds 15% throughput and cuts VRAM by 44% with no accuracy loss. The practical decision framework: distill when you have structured tasks, well-defined schemas, high volume, or data-sovereignty needs; call frontier APIs for broad world knowledge or low volume. The cost differential is 100x or more at scale.

On the research side, EvoKernel from Shanghai Jiao Tong University tackles a harder version of this problem: getting LLMs to generate kernel code for hardware they have essentially zero training data for. The "data wall" is stark — GPT-5.2 achieves 92% correctness on CUDA L1 tasks but collapses to 14% on Ascend C (NPU) tasks. EvoKernel formulates kernel synthesis as a reinforcement-learning task over self-evolving memory, using value-driven retrieval (learned Q-values) to select which past experiences to feed the generator. Starting from 11.0% correctness, EvoKernel pushes GPT-5.2 to 83.0% on NPU benchmarks, with a median 3.60x latency speedup over initial drafts through iterative refinement. The cross-task memory transfer is the most interesting finding: experiences from simpler operators genuinely accelerate learning on complex ones, suggesting that value-guided experience accumulation can substitute for training data in cold-start domains. (more: https://arxiv.org/abs/2603.10846v1)

Agentic Infrastructure: From Persistent Memory to Deterministic Workflows

Stripe is now shipping over 1,300 pull requests per week that are entirely AI-written — human-reviewed, but with zero human-authored lines. The enabling architecture is not a more powerful model but a more disciplined harness. Stripe's internal system, Minions, implements "blueprints" — workflows that interleave agentic nodes (where an LLM reasons and codes) with deterministic nodes (where linting, type-checking, and test execution happen in guaranteed fashion). The system curates context before the agent ever sees a task, pulling from a 500-tool MCP server called "tool shed" and selecting a relevant subset. Each run executes in an isolated AWS EC2 instance, pre-loaded with Stripe's codebase and lint caches. The test validation loop pulls from Stripe's three-million-test CI suite and gives the agent a maximum of two retry cycles before escalating to a human. Shopify has open-sourced a similar system called Roast. The pattern is clear: the industry is moving toward systems that control agents, not agents that control systems. (more: https://youtu.be/NMWgXvm--to)

The developer tooling ecosystem around Claude Code has matured to the point where a top-10 list is genuinely useful. Key recommendations include the Supabase CLI (not the MCP server — CLIs outperform MCPs because they are purpose-built for terminal-native agents), the Anthropic skill-creator skill (which now supports A/B testing of skill performance), the GSD framework for spec-driven project scaffolding, the NotebookLM CLI for offloading research to Google's free tool, and the Playwright CLI for browser automation. The throughline: CLI tools plus purpose-built skills consistently outperform MCP-based equivalents because they avoid the overhead of the MCP protocol layer. (more: https://youtu.be/OFyECKgWXo8)

The memory problem for AI agents gets a dedicated solution in SAGE, an open-source (Apache 2.0) desktop application that gives any MCP-compatible AI persistent, encrypted, locally-stored memory. Every memory passes through four validators (Sentinel, Dedup, Quality, Consistency) with a BFT quorum (3/4) required to commit — real signed vote transactions through CometBFT consensus. Memories strengthen with corroboration and decay without reinforcement. The team cites published research showing agents with governed memory measurably improve over time, while memoryless agents show zero improvement regardless of session count. SAGE supports Claude, ChatGPT, DeepSeek, Gemini, Cursor, and Ollama, and includes a Chrome extension for free-tier ChatGPT users. (more: https://l33tdawg.github.io/sage)

At the infrastructure layer, Common Ground Core (CGC) goes further, proposing a "sociotechnical operating system" for multi-agent coordination based on cybernetics principles. State lives in Postgres with CAS locks; NATS JetStream serves as a pure wakeup doorbell; an immutable CardBox creates unforgeable cognitive lineage for every tool call and reasoning step. Agents can fork and join hundreds of child nodes through the L1 kernel while the system handles concurrent convergence. The key design decision: humans participate as asynchronous nodes under the same protocol as AI agents, not as top-level prompters. The project is in preview with no ACL and an RCE warning, but the architecture — protocol-first, worker-agnostic, zero-brain-split — represents a thoughtful approach to the coordination-collapse problem that plagues naive multi-agent setups. (more: https://github.com/Intelligent-Internet/CommonGround)

AI, Jobs, and the Governance Vacuum

Atlassian is cutting 1,600 jobs — roughly 10% of its workforce — to "self-fund further investment in AI and enterprise sales." CEO Mike Cannon-Brookes insists Atlassian does not follow the philosophy of replacing people with AI, then immediately adds: "It would be disingenuous to pretend AI doesn't change the mix of skills we need or the number of roles required in certain areas." The severance bill: $225–236 million. About 900 of the cuts target developers and software roles across the US, Australia, and India. CTO Rajeev Rajan is stepping down. Atlassian's stock has fallen over 50% since January, part of a broader SaaS selloff driven by the market narrative that AI threatens per-seat licensing models. Whether AI is the cause or the cover, the result is the same for the people being walked out. (more: https://www.heise.de/en/news/Atlassian-CEO-AI-doesn-t-replace-people-here-but-we-re-firing-them-anyway-11208758.html)

The emotional toll is not limited to those receiving severance packages. NetworkChuck, a tech YouTuber with five million subscribers, posted a remarkably candid video about nearly quitting his channel due to AI-induced burnout and anxiety. A UC Berkeley study he cites reports 62% of AI workers experiencing burnout, anxiety, or decision paralysis by month six. The Matt Schumer article that went viral — "If your job happens on a screen, AI is coming for significant parts of it" — crystallized a fear many were already stuffing down. The rebuttal data is real too: 95% of organizations see no measurable ROI from AI; the Yale Budget Lab says AI labor displacement "remains largely speculative"; every previous tech revolution overestimated the speed of economic transformation. The resolution NetworkChuck arrives at is pragmatic: learning AI is still essential, but IT fundamentals (networking, security, systems) are what make you better with AI and valuable when AI fails. The anxiety is not speculative. The job losses are not speculative. The timeline for mass displacement remains genuinely unknown. (more: https://m.youtube.com/watch?v=dbMXi9q78Tk)

Meanwhile, Debian's attempt to establish policy on AI-generated contributions ended in a deliberate non-decision. Lucas Nussbaum proposed a general resolution requiring disclosure of AI-generated code, accountability for technical merit and license compliance, and prohibition against using AI tools with embargoed security data. The ensuing debate was illuminating: Russ Allbery argued that "AI" is too amorphous for durable policy and urged specificity around LLMs; Simon Richter raised the onboarding problem — AI-assisted drive-by contributions are "a missed opportunity to onboard a new contributor"; Matthew Vernon centered the ethical dimension of organizations that "hoover up content with scant regard to copyright"; Ted Ts'o countered that gatekeeping AI-using contributors is even more self-defeating than the risks. Nussbaum ultimately withdrew the GR, noting the discussion had been civil enough that case-by-case handling under existing policies was sufficient for now. Given how fast the underlying technology is shifting, kicking the can may be the least disruptive option. (more: https://lwn.net/SubscriberLink/1061544/125f911834966dd0/)

Generative Media: Diffusion Speedruns and Creative Synthesis

Photoroom's PRX team trained a competitive text-to-image model in 24 hours on 32 H200 GPUs for approximately $1,500 — a budget that would have been absurd two years ago. The recipe stacks several recent innovations: x-prediction formulation that eliminates the VAE entirely and trains directly in pixel space (patch size 32, 256-dim bottleneck), TREAD token routing that randomly skips tokens through contiguous transformer blocks, REPA representation alignment using DINOv3 as teacher, perceptual losses (LPIPS + DINO features) applied on pooled full images at all noise levels, and the Muon optimizer. Training starts at 512px then fine-tunes at 1024px. The result is not flawless — texture glitches, occasional anatomy issues — but the remaining artifacts look like undertraining, not architectural flaws. All code and configs are open-sourced. The democratization curve for image generation is steep: what cost millions in compute two years ago now costs less than a used laptop. (more: https://huggingface.co/blog/Photoroom/prx-part3)

On the integration side, a new OpenWebUI tool brings LTX 2.3 video generation directly into the chat interface via ComfyUI workflows, supporting both text-to-video and image-to-video with adjustable resolution, frame count, and FPS. (more: https://www.reddit.com/r/OpenWebUI/comments/1rrt109/new_ltx23_tool_for_openwebui/)

CodeSpeak, created by the Kotlin creator, proposes a fundamentally different interface between humans and code generators: a formal language where plain-text specifications 5–10x smaller than the equivalent source code produce production-ready implementations via LLM. The pitch is that maintaining specs is easier than maintaining code, and when your model outputs code rather than latents, you can plug in the entire classical software engineering toolbox for verification. Whether this is the future of programming or an elaborate prompt-engineering framework with syntax highlighting remains to be seen, but the pedigree is worth watching. (more: https://codespeak.dev/)

For pure creative joy, Loopmaster published a step-by-step tutorial on building Roland's TB-303 synthesizer from scratch — the instrument responsible for the entire acid house genre. Starting from raw oscillators, adding the characteristic diode-ladder low-pass filter, shaping the cutoff with envelopes instead of LFOs, layering accents and slides, and finishing with distortion, the tutorial turns synthesis education into interactive code. The "acid" sound emerges incrementally, and the values in the accent and slide parameters change the output dramatically. Infinite sounds from finite parameters — a good metaphor for what the rest of this edition's tools are trying to achieve. (more: https://loopmaster.xyz/tutorials/tb303-from-scratch)

Sources (21 articles)

[Editorial] (github.com)
[Editorial] (votal.ai)
Whistleblower: DOGE member took Social Security data to new job (washingtonpost.com)
Processing 1 million tokens locally with Nemotron 3 Super on a M1 ultra (reddit.com)
[Editorial] (youtu.be)
FlashHead: Up to 40% Faster Multimodal Reasoning on Top of Quantization (reddit.com)
Running DeepSeek V3.2 with dense attention (like in llama.cpp) makes it a bit dumber (reddit.com)
Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs (arxiv.org)
Fine-tuned Qwen3 SLMs (0.6-8B) beat frontier LLMs on narrow tasks (reddit.com)
Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis (arxiv.org)
[Editorial] (youtu.be)
[Editorial] (youtu.be)
[Editorial] (l33tdawg.github.io)
Intelligent-Internet/CommonGround (github.com)
Atlassian CEO: AI doesn't replace people here, but we're firing them anyway (heise.de)
[Editorial] (m.youtube.com)
Debian decides not to decide on AI-generated contributions (lwn.net)
PRX Part 3 — Training a Text-to-Image Model in 24h! (huggingface.co)
New LTX2.3 Tool for OpenWebui (reddit.com)
Kotlin creator's new language: a formal way to talk to LLMs instead of English (codespeak.dev)
Building a TB-303 from Scratch (loopmaster.xyz)