Securing the Agentic Stack

Published on May 5, 2026

Today's AI news: Securing the Agentic Stack, MCP Goes Creative, Opening the Black Box, Multimodal Intelligence Gets Practical, The $100K Self-Hosting Question, Engineering at Scale, The Walking Problem. 22 sources curated from across the web.

Securing the Agentic Stack

A position paper from NVIDIA and Johns Hopkins researchers lays out an architectural blueprint yet for defending AI agents against indirect prompt injection — the attack class where malicious instructions hidden in emails, web pages, or tool outputs hijack agent behavior. The paper's central claim is uncomfortable but honest: static security policies are insufficient. When a coding agent hits a deprecated API and needs to replan, or a debugging agent discovers the crash evidence lives in a log file nobody anticipated, the security boundary has to flex. But the moment plans and policies become dynamic, untrusted environmental data can influence them — and that is the fundamental tension the paper tries to resolve. (more: https://arxiv.org/abs/2603.30016v1)

The proposed architecture splits agent execution into six components: an orchestrator that generates plans and policies, a plan/policy approver with optional human escalation, an executor, a policy enforcer, the environment, and a feedback loop enabling iterative updates. The key insight lives in Position 2: when LLMs must participate in security decisions (and the authors argue they must, because the complexity exceeds what rule-based checks can express), the system should feed the model only narrowly scoped, structured artifacts — typed diffs, proposed plan changes, provenance traces — never raw attacker-controlled text. Two concrete proposals follow. First, decouple instruction recognition from instruction-following: make the model explicitly verbalize which instructions it intends to follow, then trace provenance and apply system-level policy. Second, use an LLM to synthesize step-specific programmatic validators for environment responses rather than trusting the executor to handle adversarial content directly. The paper's sharpest contribution may be its benchmark critique. AgentDojo, the most popular agent-security benchmark, contains only 6 of 97 tasks that require replanning. Every benchmark uses static, non-adaptive attack payloads. The authors argue this creates "a false sense of both utility and security" and call for pluggable adaptive attackers — RL-trained adversary models or genetic-algorithm-style payload refinement — as a minimum standard.

On the tooling side, Vercel open-sourced DeepSec, a security harness that runs Opus 4.7 and GPT 5.5 at maximum effort against production codebases. The pipeline — static regex scan, agent investigation, validation pass to reduce false positives, git-metadata-based triage — scales to 1,000+ concurrent sandboxes on Vercel's own infrastructure. The false positive rate sits at 10–20%, which Vercel considers acceptable given the severity of true positives. Dub.co's founder noted that DeepSec "surfaced the kind of issues we'd actually want a security engineer to flag" — a bar most automated scanners still fail to clear. The tool works with off-the-shelf models; the "cyber" fine-tunes from Anthropic and OpenAI are supported but not required, as DeepSec's built-in classifier checks for refusals after each research step and finds them to be a non-issue for both models. (more: https://vercel.com/blog/introducing-deepsec-find-and-fix-vulnerabilities-in-your-code-base)

MCP Goes Creative

Anthropic shipped Claude for Creative Work with nine official Model Context Protocol (MCP) connectors, including a flagship Blender integration. These are not copy-paste plugins — each connector maintains persistent context within the host application's own data model, letting Claude read live project state and execute actions directly. The Blender connector can manipulate objects, materials, and render settings without leaving the 3D viewport. This is one of the first production-scale deployments where an LLM maintains persistent context within a host application's native data structures, and if the pattern holds, it probably becomes the template for how agents integrate with domain-specific professional software going forward. Community response immediately surfaced the wish list: CAD tools (FreeCAD for Linux users locked out of Fusion), building plan generation, and video editing. (more: https://www.reddit.com/r/ClaudeAI/comments/1t48vtx/anthropic_ships_claude_for_creative_work_with/)

The MCP ecosystem continues to grow in parallel. n8n released a new MCP server designed specifically for agentic coding tools like Claude Code and Codex. The key design choice: TypeScript-based validation instead of JSON. A prompt describing a desired automation gets parsed for intent, the MCP server resolves the required node types, the coding agent writes TypeScript that compiles and validates before being converted to JSON and deployed to a running n8n instance — cloud or self-hosted. For workflows that benefit from deterministic orchestration rather than end-to-end agent reasoning, this is a pragmatic middle ground. (more: https://www.youtube.com/shorts/Op-OpCjs0KE)

The governance question keeps surfacing alongside adoption. Architecture Decision Records (ADRs) — structured documents capturing what was decided, why, and what it supersedes — are finding new relevance in agentic engineering. With the Ruflo ADR plugin for Claude Code, each ADR is stored as a graph node in AgentDB, linked through relationships like "depends on," "amends," and "supersedes." The plugin scans diffs against accepted ADRs and flags mismatches, closing the drift gap between code and intent that compounds rapidly when autonomous agents are modifying a codebase. (more: https://www.linkedin.com/posts/reuvencohen_how-to-use-adrs-in-ruflo-adrs-are-quietly-share-7457413347890798592-uN5I)

Opening the Black Box

The Qwen team released Qwen-Scope, a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family spanning 2B to 35B MoE. SAEs decompose a model's internal activations into interpretable features — think of them as a dictionary of the model's concepts. Instead of opaque floating-point vectors, you can identify specific feature IDs that correspond to things like "legal talk," "Python code," or "refusal." This is the largest official interpretability release for an open-weight model to date: GemmaScope from Google only covered models up to 9B parameters, while Qwen-Scope includes a dense 27B variant. (more: https://www.reddit.com/r/LocalLLaMA/comments/1szrbub/qwenscope_official_sparse_autoencoders_saes_for/)

The interactive Space demo shows what happens when a model unexpectedly switches from English to Chinese mid-response: feature comparison reveals exactly which feature ID spiked. With that ID in hand, you can steer the model by suppressing or amplifying specific features during generation — what practitioners call "surgical abliteration." This is considerably more precise than the brute-force "mean difference" approach that previous community efforts relied on. The Qwen team released everything under Apache 2.0, though their caution statement explicitly prohibits using the tools to remove safety filters — a tension the community has already noticed. Google maintains its own parallel effort with Gemma Scope for Gemma 2 and 3. Meanwhile, GuideLabs released steerling-8b, a purpose-built steerable 8B model that represents a different approach to the same goal: baking steerability into the model weights rather than discovering it post-hoc through SAE analysis. (more: https://huggingface.co/guidelabs/steerling-8b)

On the quantization front, NVIDIA published Gemma-4-31B-IT-NVFP4, a 4-bit floating-point quantized version of Google's latest 31B model. FP4 quantization represents a different tradeoff from the integer-based Q4 formats common in the llama.cpp ecosystem: floating-point mantissa bits can preserve more dynamic range at the cost of reduced precision in the tails. Whether this translates to measurably better output quality at the 31B scale — a sweet spot for consumer GPUs with 24GB VRAM — is the question worth tracking. (more: https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4)

The Qwen 3.6 family continues to produce both impressive results and frustrating failure modes. One user running the 27B model for tax accounting on a modest Ryzen 9 laptop with 60GB RAM reports accuracy matching Claude on structured tasks like converting PDF tax forms to Excel — slower, but correct. The thesis: local models are entering the phase where they handle domain-specific professional workflows, not just coding tasks. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t0g1tg/qwen_36_27b_neo_code_q4_km_i_matrix_is_badass/)

But Qwen 3.6 27B also has a documented tendency to fall into self-affirming thinking loops — the model's internal reasoning degenerates into increasingly absurd self-reinforcement ("I will be concise. I will be precise. I will be accurate. I will be reliable...") that never terminates. This is not a new phenomenon at the 27B scale: Qwen 3.5 27B exhibited similar degenerate behavior on coding benchmarks, declaring existing tests as passing and quitting without writing code. Community consensus points to setting presence penalty to 1.5 (per Qwen's own recommendation) and using fixed chat templates from the community as mitigations, though neither is a complete fix. For anyone deploying these models in production loops, loop detection remains a mandatory guardrail. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sz9f6f/qwen3627budq6_k_xlgguf_sometimes_gets_stuck_in_a/)

Multimodal Intelligence Gets Practical

NVIDIA's Nemotron 3 Nano Omni is a 30B-A3B omni-modal understanding model built to reason jointly across text, images, video, and audio — and it is efficient enough to deploy. The architecture combines a hybrid Mamba-Transformer Mixture-of-Experts backbone (23 Mamba state-space layers for long-context efficiency, MoE for conditional capacity, Transformer attention for global expressivity) with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder. Each encoder connects through lightweight MLP projectors into a shared embedding space where all modality tokens are interleaved and processed jointly. The model targets five workload classes: complex multi-page document intelligence (100+ pages), speech understanding, joint audio-video reasoning, agentic GUI computer use, and cross-modal multi-step reasoning. (more: https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence)

The design innovations target real-world bottlenecks. Dynamic resolution processing at native aspect ratio handles images from 512×512 up to 1840×1840, critical for dense documents and screenshots. Conv3D temporal compression fuses consecutive frame pairs into "tubelets," halving the vision token count for video. Efficient Video Sampling then drops redundant static tokens at inference, keeping only frames where something has actually changed. The result: 9× higher throughput and 7.4× higher system efficiency on multi-document tasks compared to other open omni models at a fixed per-user interactivity threshold. NVIDIA's training recipe includes multimodal RL with intentionally unanswerable questions — forcing the model to learn to abstain rather than hallucinate — and 11.4M synthetic QA pairs from real PDFs that produced a 2.19× accuracy gain on MMLongBench-Doc.

The local voice stack gained another piece with vibevoice.cpp, a pure C++ ggml port of Microsoft's VibeVoice speech-to-speech model from the LocalAI team. The entire pipeline — TTS with voice cloning from a 30-second reference clip and 7B-parameter ASR with speaker diarization — runs with zero Python at inference. On CUDA Q4_K (GB10), a 68-second sample processes in 28 seconds at a real-time factor of 0.41; CPU-only on a Ryzen 9 pushes to 2.2× real time. The 17-minute audio test case still requires 26GB peak RSS on CPU, and streaming output is not yet implemented, but backends for CUDA, Metal, Vulkan, and hipBLAS make deployment radically simpler than the upstream Transformers/vLLM stack. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t48fkt/vibevoicecpp_microsoft_vibevoice_tts_longform_asr/)

On the research side, a paper from National Tsing Hua University introduces VI-NBFNet, a neural beamforming network that uses lip movement features extracted from a pretrained visual speech recognition model to steer a four-microphone array. The system jointly learns audiovisual features and spatial covariance matrices end-to-end, achieving PESQ 2.088 and 8% word error rate on Whisper-turbo for moving speakers — roughly half the error rate of the next-best baseline — while remaining notably compact at 7.15M parameters. (more: https://arxiv.org/abs/2603.05270v1)

The $100K Self-Hosting Question

A startup founder burning $1,500 to $4,000 per day on Claude Opus 4.7 API credits posted an open question to the LocalLLaMA community: what does the optimal $100K self-hosted inference server look like for agentic coding? The thread crystallized the current build-vs-buy debate with unusual clarity. The EPYC + RTX 6000 Pro path dominated: one practitioner running 4× 6000 Pros (768GB total VRAM) reports serving MiniMax M2.7 FP8 "all day long" at high speed, with Qwen3.5 397B NVFP4 hitting hundreds of tokens per second under concurrent load. "With Claude CLI hooked up to those bad boys it feels like SOTA in a box." Eight 6000 Pros fit within $100K if you pare back system RAM. The consistent warning: do not buy Macs for inference workloads — memory bandwidth makes them uncompetitive for production throughput — and open-weight models still trail Opus quality meaningfully enough that the investment thesis requires careful validation. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t30566/i_will_soon_have_100k_to_build_an_inhouse_llm/)

Tenstorrent's TT-QuietBox 2 represents the ASIC alternative. The desktop workstation packs two liquid-cooled Blackhole cards, each carrying two Blackhole ASICs with 240 Tensix cores and 64GB DDR6 at 16 GT/s — 1 TB/s aggregate memory bandwidth across 128GB of VRAM, plus 256GB DDR5 system memory, with 800G Ethernet between ASICs internally. Community response is cautious: a single RTX Pro 6000 delivers 1.79 TB/s on its own, so the raw bandwidth competition is not yet won. But at roughly $10K for a complete system with experimental ASIC silicon, the price-per-exploration-dollar is compelling if Tenstorrent expands model support. (more: https://www.reddit.com/r/LocalLLaMA/comments/1szpwwt/tenstorrent_ttquietbox_2_specifications_blackhole/)

For those unwilling to pick a single vendor, llama.cpp's -DGGML_BACKEND_DL=ON flag now enables simultaneous CUDA and ROCm backend loading — genuine mixed-vendor inference in a single process. One user reports running MiniMax M2.7 Q4 across an NVIDIA GPU (83.6 GB) and an AMD GPU (40.3 GB) simultaneously, with prefill speed as the primary advantage. The setup requires manual CMakeLists.txt edits and careful flag management, but the fact that it works at all collapses a boundary the community had accepted as permanent. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t0bkaf/cuda_rocm_simultaneously_with_dggml_backend_dlon/)

A Blackwell + M3 Ultra RDMA cluster with roughly 2TB of aggregate RAM is being assembled for tinygrad driver testing, with community-requested benchmarks of DeepSeek V4 and MiMo V2.5 on the schedule (more: https://www.reddit.com/r/LocalLLaMA/comments/1t24qle/tinygrad_driver_testing/). MiMo 2.5 itself drew pointed criticism for requiring GPUs in multiples of four — the checkpoint's TP=4-interleaved fused qkv_proj architecture makes anything below four GPUs a non-starter on SGLang and vLLM, though GGUF via llama.cpp avoids the restriction entirely. Community commentary was blunt: the release looks like "a dump of the files they use for their internal proprietary serving stack," with missing weights, garbage model index files, and nonstandard tensor padding. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t0f9g8/mimo_25_requires_at_least_4_gpus_am_i_reading/)

Engineering at Scale

Stripe published the full story behind rubyfmt, the zero-configuration Ruby autoformatter that reformatted 25 million lines of code in a single Saturday morning commit. The project started as a bar conversation at RubyConf 2018 and spent six years evolving from a single-file Ruby script to a Rust binary that linked an entire Ruby VM just to walk parse trees in memory. The speed budget was strict: 100ms to format all but the largest files, because Ruby's 158ms startup time made invoking it normally a non-starter for save hooks. The Rust rewrite produced genuinely unusual engineering: Serde deserializing Ruby VALUE objects directly from memory by introspecting C types through Ruby's API, with the schema unchanged — only the deserializer swapped out. (more: https://stripe.dev/blog/formatting-an-entire-25-million-line-codebase-overnight-the-rubyfmt-story)

The payoff came in 2024 when Stripe flipped the opt-in model — only a tiny minority of files were held back — and merged a diff so large GitHub could not render it. Today, 100% of Stripe's 42 million lines of Ruby run through rubyfmt. One engineer who arrived from Python captured the result: "I sometimes receive code format nits on code reviews... I haven't thought about this problem in years." In 2025, rubyfmt migrated to Prism, the new official Ruby parser that instantiates parse trees without a Ruby VM at all, shrinking the binary by megabytes. The engineering lesson generalizes: the hardest part of building a formatter is not the formatting — it is parsing a language whose grammar was never designed for external tooling.

Crabbox launched as an open-source remote testbox runner targeting maintainers and AI agents. The architecture — a Go CLI on the laptop, a Cloudflare Worker broker that owns provider credentials and serializes lease state, and managed runners on Hetzner or AWS — lets developers sync a dirty local checkout to cloud compute, run a command, and stream output back. Cost guardrails enforce per-lease and monthly spend caps with live EC2 Spot pricing, and GitHub Actions hydration turns a leased box into an ephemeral Actions runner so the repo's own workflow handles runtime setup. The OpenClaw plugin exposes the full lifecycle as agent tools, making crabbox_run and crabbox_warmup native to AI-assisted development workflows. (more: https://github.com/openclaw/crabbox)

The Open WebUI ecosystem got a significant polish with Inline Visualizer v2.1.0, which adds pre-styled bare HTML tags, a 9-color accent palette, and expanded chart types including stacked areas, radar, and KPI cards with sparklines. Every bug reported after the v2.0 launch was fixed, including a particularly insidious issue where visualizations stayed blank when the model's "Thinking" section was expanded — a problem affecting Bedrock-hosted Haiku and any provider wrapping responses in reasoning blocks. (more: https://www.reddit.com/r/OpenWebUI/comments/1t3enpo/inline_visualizer_v210_prestyled_tags_9color/)

The Walking Problem

Firgelli Automations published a comprehensive engineering guide on humanoid robot actuators that reads like a field manual for the mechanical problems everyone in robotics knows about but few document this thoroughly. The central constraint is brutal: a commercially viable humanoid takes roughly 5,000 steps per hour, each delivering 2–3× body weight in impact force through the leg actuators. At a sustained 84 steps per minute, an 8-hour shift accumulates over 40,000 impacts — compressing years of standard industrial wear into weeks. These impacts arrive in 50–100 milliseconds, faster than any sensor loop can react, which means the actuator must be mechanically back-drivable to absorb the energy. Self-locking designs like industrial lead screws force the gearbox to absorb 100% of the shock, causing immediate shear failure.

The guide documents what it calls the "mass penalty spiral." A 200-gram error at the ankle — choosing a cheaper, heavier actuator — compounds exponentially upward through the kinematic chain. The knee must grow to handle the heavier foot, the hip must grow to handle the heavier knee, the battery must grow to power larger motors. A component-level mistake becomes a system-level catastrophe. The only escape is enforcing density targets from day one: greater than 10 Nm/kg specific torque for rotary actuators, greater than 4,000 N/kg specific force for linear ones. "There is no 'upgrading later.' The mass budget is set by the actuators."

Every major humanoid manufacturer — Tesla, Figure, Apptronik — independently converged on the same split architecture. Strain wave gears (harmonic drives) handle rotational joints like shoulders and wrists: zero backlash, 50:1 to 100:1 reduction in a single flat stage, though the flexing metal generates heat and fatigues. Planetary roller screws handle impact-bearing joints like knees and ankles: their threaded rollers make line contact with 10–15× more surface area than ball screws, keeping peak Hertzian stress below yield threshold under repeated walking impacts. A ball screw rated for 10 million cycles might fail at 100,000 under walking loads — the rating assumes smooth, unidirectional force that never exists in a humanoid leg. The reflected inertia trap explains the gear ratio choices: a 100:1 gearbox multiplies motor inertia by 10,000× at the output, making the leg feel like a brick rather than a spring. Thermal management is the final gatekeeper separating lab demonstrations from commercial products — continuous torque is typically only 25–30% of peak in air-cooled actuators, and "peak torque impresses investors. Continuous torque — the ability to work all day without overheating — is what makes a product." (more: https://www.firgelli.com/pages/humanoid-robot-actuators)

Sources (22 articles)

Architecting Secure AI Agents: System-Level Defenses Against Indirect Prompt Injection (arxiv.org)
[Editorial] DeepSec: Find and Fix Vulnerabilities in Your Code Base (vercel.com)
Anthropic ships Claude for Creative Work with nine MCP-native connectors (reddit.com)
n8n Just Got a New Tool (and it can SUPERCHARGE Claude Automations) (youtube.com)
[Editorial] How to Use ADRs in Ruflo (linkedin.com)
Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (reddit.com)
guidelabs/steerling-8b — Steerable 8B Model (huggingface.co)
nvidia/Gemma-4-31B-IT-NVFP4 — NVIDIA FP4 Quantized Gemma 4 (huggingface.co)
Qwen 3.6 27B Neo Code Q4_KM running tax accounting on a Ryzen laptop (reddit.com)
Qwen3.6-27B gets stuck in a self-affirming thinking loop (reddit.com)
NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents (huggingface.co)
vibevoice.cpp — Pure C++ ggml port of Microsoft VibeVoice with voice cloning and long-form ASR (reddit.com)
Visual-Informed Speech Enhancement Using Attention-Based Beamforming (arxiv.org)
$100K budget to build an in-house LLM server for agentic coding (reddit.com)
Tenstorrent TT-QuietBox 2 Specifications (Blackhole) (reddit.com)
CUDA + ROCm simultaneously with -DGGML_BACKEND_DL=ON (reddit.com)
Tinygrad Driver testing on Blackwell + M3 Ultra RDMA cluster (reddit.com)
MiMo 2.5 requires at least 4 GPUs — model accessibility locked behind hardware (reddit.com)
Formatting a 25M-line codebase overnight — the rubyfmt story (stripe.dev)
openclaw/crabbox — warm a box, sync the diff, run the suite (github.com)
Inline Visualizer v2.1.0 — Pre-styled tags, 9-color accent palette, and chart catalog for Open WebUI (reddit.com)
Humanoid Robot Actuators (firgelli.com)