MoE Models Hide Refusal in Places You Can't Weight-Bake Away

Published on April 9, 2026

Today's AI news: MoE Models Hide Refusal in Places You Can't Weight-Bake Away, Activation Watermarks, Egress Firewalls, and the Trust Gap, The Gemma 4 Ecosystem Matures: From Benchmarks to Real-Time Voice, Voice Clones, AI Charts, and the Content Authenticity Problem, Agent Infrastructure: From DIY Hell to Managed Platforms, Infrastructure Shifts: Safetensors Governance, S3 Files, and MoE Anomaly Detection. 22 sources curated from across the web.

MoE Models Hide Refusal in Places You Can't Weight-Bake Away

The abliteration community just discovered something that should make safety researchers sit up: Mixture-of-Experts models don't encode refusal the way dense models do, and the standard techniques for removing it only half-work. A researcher adapted FailSpy's abliteration technique to Qwen3.5-397B-A17B running at 4-bit on a Mac Studio M3 Ultra (512GB), targeting PRC-specific censorship while attempting to preserve Western safety refusals. The results revealed two separable refusal subspaces — Chinese-political and Western-safety refusals live in different directions in activation space, and you can surgically remove one without touching the other. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sdkb68/abliterating_qwen35397b_on_a_mac_studio_revealed/)

The more consequential finding is architectural. On dense models, orthogonalizing output projections (o_proj, down_proj) is functionally equivalent to projecting the refusal direction out of the residual stream at inference time. On MoE models, weight-baking removes political refusals but not safety refusals. The inference-time hook removes both. The hypothesis: safety refusals route through specialized "safety experts" via the MoE router, and the routing decision happens before the output projection, so orthogonalizing down_proj doesn't catch it. This is a genuinely new observation — it means MoE safety alignment is structurally more resilient to weight-level tampering than dense model alignment, but only accidentally. The 397B model proved fragile: exactly one working setting (top-16 directions), with top-18 causing stuck repetition loops.

A separate effort abliterating India's Sarvam 30B and 105B — reasoning-capable MoE models — uncovered another wrinkle: reasoning models have two refusal circuits, not one. The <think> block and the final answer can disagree, with the model reasoning toward compliance in its chain-of-thought and then refusing anyway in the response. Perhaps most striking, a single English-computed refusal direction removed refusal across Malayalam, Hindi, and Kannada, suggesting refusal is pre-linguistic — encoded in the model's internal representation before language-specific layers. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sg5770/finally_abliterated_sarvam_30b_and_105b/)

Activation Watermarks, Egress Firewalls, and the Trust Gap

A new arxiv paper proposes a fundamentally different approach to LLM safety monitoring: instead of wrapping models with external guard classifiers that adversaries can study and evade, embed a secret keyed watermark directly into the model's internal activations. The method fine-tunes the LLM so that whenever it produces policy-violating output, its hidden states align with a secret direction that only the provider knows. Detection requires a single cosine similarity check per token — negligible overhead compared to running a separate guard model. Against adaptive attackers who know the monitoring algorithm but not the secret key, activation watermarking outperforms Llama Guard and Qwen Guard by up to 29% AUROC. Guard models collapse under sophisticated attacks (AutoDAN pushes their ASR above 68%), while the watermark's keyed randomization makes transfer attacks unreliable. The paper formalizes this as a security game and demonstrates multi-entity monitoring: the provider can track which specific piece of sensitive knowledge was accessed, with 95%+ attribution accuracy and fewer than one false alarm per 1,000 benign requests. (more: https://arxiv.org/abs/2603.23171v1)

On the infrastructure side, iron-proxy tackles a related problem from the opposite direction: what happens when the AI agent itself is the untrusted workload? Built in Go, it's a MITM egress proxy with a built-in DNS server that enforces default-deny at the network boundary. Sandboxed workloads — CI pipelines, AI coding agents, containers running code you don't fully trust — can only reach domains you explicitly allowlist. The clever part is secret injection: real API keys never enter the sandbox. Workloads use proxy tokens, and iron-proxy swaps in real credentials at egress. If the sandbox is compromised, the attacker gets tokens that are worthless outside the proxy. Every request produces a structured JSON audit trail with full transform pipeline results, and the whole thing ships as a single binary with a single YAML config. For enforcement, it offers three tiers: DNS-only (easy to bypass), DNS + nftables (blocks hardcoded IPs), and TPROXY (kernel-level interception, catches everything). (more: https://github.com/ironsh/iron-proxy)

The trust question extends to what agents can actually read. The Agent Reading Test is a benchmark that doesn't test whether AI coding agents can reason — it tests whether they can reliably consume web documentation. Ten pages, each designed around a specific failure mode: 150K-character pages with canary tokens at strategic positions to map truncation limits, 80K of inline CSS before real content, client-side rendered pages that return empty shells, tabbed content where only the first variant is visible, HTTP 200 pages with "not found" messages, and cross-hostname redirects. A perfect score of 20 is unlikely for any current agent; the expected range is 14-18. The uncomfortable implication: agents that confidently cite documentation may be working from truncated, garbled, or incomplete versions of it. (more: https://agentreadingtest.com)

These infrastructure and trust concerns crystallize in a practical post cataloging five patterns that consistently appear in AI systems that work in development but fail in production: no evaluation framework (iterating by feel), no confidence thresholding, prompts optimized on demo data, retrieval quality buried in end-to-end metrics, and integration layer underscoping — the async handling, graceful degradation, and output validation that typically runs 40-60% of total production effort but never shows up in demos. The litmus test: "Can you show me what the user sees when the AI call fails?" Teams who've built for production answer immediately. (more: https://www.reddit.com/r/learnmachinelearning/comments/1sesr3e/five_patterns_i_keep_seeing_in_ai_systems_that/)

The Gemma 4 Ecosystem Matures: From Benchmarks to Real-Time Voice

The ecosystem around Gemma 4 is already producing concrete tools. Parlor is perhaps the most striking: a fully on-device, real-time multimodal AI application that takes audio and video input and talks back, running entirely on an M3 Pro laptop. It pairs Gemma 4 E2B for understanding speech and vision with Kokoro for text-to-speech, achieving end-to-end latency of 2.5-3.0 seconds including speech recognition, response generation (~83 tokens/sec on GPU), and TTS. The creator's motivation is practical — self-hosting a free voice AI on a home server to help people learn English, with hundreds of monthly active users. Six months ago, this required an RTX 5090 for voice models alone. The fact that it runs on a consumer laptop with 3GB of RAM for the model represents a genuine step function. (more: https://github.com/fikrikarim/parlor)

On the quantization front, the Gemma-4-E2B-it model quantized to NVFP4 (W4A4) using NVIDIA Model Optimizer is putting up surprising numbers on a single DGX Spark. At 89 tokens/sec single-user decode, it ranks #2 across ~57 models on the Spark Arena leaderboard — behind only Qwen3.5-0.8B, which is 2.5x smaller. More interesting is the concurrency behavior: at 10 concurrent sessions, the E2B model actually gains rank, taking #1 in both token generation and prompt processing at 4K-16K depths. The Per-Layer Embeddings architecture uses sliding window attention on 28 of 35 layers (window=512), keeping KV cache tiny even under load. The quantized model fits in 7.5GB on disk with vision and audio towers preserved in BF16. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sdumpp/gemma4e2bit_nvfp4_on_dgx_spark_1_in_9_spark_arena/)

Unsloth published a comprehensive Gemma 4 fine-tuning guide that doubles as a bug report. They found and fixed several upstream issues: Gemma-4 E2B and E4B share KV state across layers, and disabling the cache (as every QLoRA tutorial sets use_cache=False) causes the KV-shared layers to recompute locally, producing garbage logits. The fix is simple — keep use_cache=True — but the failure mode is silent and devastating: the model outputs gibberish without any error message. They also caught a gradient accumulation bug inflating losses, and a num_hidden_layers=0 misconfiguration in the 31B and 26B variants that crashes the cache on the first attention forward. E2B LoRA works on 8-10GB VRAM; 31B QLoRA fits in 22GB. Unsloth claims 30% less memory than FA2 setups with no accuracy loss. (more: https://unsloth.ai/docs/models/gemma-4/train)

The community quantization debate is settling into practical consensus: Bartowski GGUF quantizations are generally preferred for consistency and long-context agent coding sessions, while Unsloth has the edge in marketing and training workflows. One user reports Bartowski IQ2_M for the 26B model running at 65 tokens/sec on an RTX 3060 12GB without hallucinations. The 26B-A4B at Q6_K_XL fits most gaming GPUs with partial RAM offloading. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sdu8oz/bartowski_vs_unsloth_for_gemma_4/)

Meanwhile, inferrs enters the inference server space as a Rust-based alternative to vLLM and llama.cpp. It ships as a single binary with OpenAI, Anthropic, and Ollama-compatible APIs, supports CUDA, ROCm, Metal, Vulkan, and CPU backends, and features a "TurboQuant" KV cache management strategy alongside PagedAttention support. The pitch: vLLM's features without Python's overhead, llama.cpp's lightweight profile with streaming and multi-backend support. (more: https://github.com/ericcurtin/inferrs)

Voice Clones, AI Charts, and the Content Authenticity Problem

An AI "singer" named Eddie Dalton — created by content creator Dallas Little — now occupies eleven spots on the iTunes top 100 singles chart and holds the #3 album. Little writes songs, records them with AI, and invented Dalton's look, sound, and videos entirely through prompt-driven generation. Positions 3, 8, 15, 22, 42, 44, 51, 58, 60, 68, and 79. One track, "Another Day Old," has 1.2 million YouTube views. But the numbers don't add up: Luminate reports just 6,900 total track sales, no radio airplay, and no streaming presence. How an AI act with under 7,000 sales holds eleven chart positions simultaneously is a question nobody at Apple seems interested in answering. The broader point: chart infrastructure was designed for a world where making and distributing music had real costs. When production cost drops to zero and distribution is unlimited, the integrity assumptions baked into these systems stop holding. (more: https://www.showbiz411.com/2026/04/05/itunes-takeover-by-fake-ai-singer-eddie-dalton-now-occupies-eleven-spots-on-chart-despite-not-being-human-or-real-exclusive)

The supply side of that equation keeps expanding. Mistral released Voxtral TTS, an open-weight text-to-voice model claiming to clone any voice from three seconds of audio and beat ElevenLabs on quality benchmarks. Community reception was lukewarm — the local install doesn't actually support voice cloning, the non-English output has an extremely hard accent despite multilingual claims, and the release was largely a reannouncement of something shown weeks earlier. (more: https://www.reddit.com/r/LocalLLaMA/comments/1selwtz/mistral_introduces_voxtral_tts_an_openweight/) VoxCPM2 enters the same space with three modes: voice design (create a brand-new voice), controllable cloning (clone with style guidance), and ultimate cloning (reproduce every vocal nuance through audio continuation), claiming state-of-the-art results on Seed-TTS-eval and other zero-shot TTS benchmarks. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sg89kl/new_tts_model_voxcpm2/)

On the music generation side, Ace Step released the 1.5 XL models — turbo, base, and SFT variants — a week after the software update, having apparently forgotten to publish the model weights with the initial release. (more: https://www.reddit.com/r/LocalLLaMA/comments/1semfx5/ace_step_15_xl_models_available/)

Agent Infrastructure: From DIY Hell to Managed Platforms

Anthropic's Claude Managed Agents entered public beta, and the pitch is explicit: stop building agent infrastructure yourself. The platform provides secure sandboxing, state management, credential scoping, long-running sessions (hours, even if you disconnect), error recovery, tracing, and multi-agent coordination where one agent can spin up others for parallel work. The gap it targets is real — sandboxing, checkpointing, scoped permissions, and orchestration typically take months to build in-house. The community response was mixed: some builders see it as Amazon Bedrock AgentCore with Claude as the model, while others are more pointed — "just give us a working model, stop nerfing our usage time and model quality." The complaint about quality degradation in managed contexts echoes a persistent concern: infrastructure is useless if the model running inside it has been throttled. (more: https://www.reddit.com/r/Anthropic/comments/1sfzu1e/anthropics_new_claude_managed_agents_public_beta/)

At the other end of the spectrum, botctl treats AI agents as system services managed through a declarative configuration. Write a BOT.md file with YAML frontmatter for settings and markdown body for the prompt, and botctl spawns Claude with your tools and workspace, running on a schedule. Every run saves its session for resumption, you can send messages to redirect running bots, and config changes are picked up automatically on the next run cycle — no restarts, no deploys. A skills system lets you search, install, and share reusable capability modules from GitHub. It's the systemd for AI agents: boring, practical infrastructure that treats autonomous processes as first-class system primitives. (more: https://botctl.dev/)

Feynman positions itself as an open-source AI research agent. Point it at a topic and it dispatches four bundled agents — Researcher, Reviewer, Writer, Verifier — across papers, web, repos, and docs. The deep research workflow runs multi-agent investigation with parallel researchers, synthesis, and verification. A /replicate command attempts to reproduce paper experiments on local or cloud GPUs via Modal or RunPod integration. Every output is source-grounded with direct URLs, and it includes an /audit command that compares paper claims against public codebases. Built on Pi for the agent runtime and alphaXiv for paper search, it's the kind of specialized tool that only makes sense once agent frameworks mature enough to support it. (more: https://github.com/getcompanion-ai/feynman)

The Claude Code Video Toolkit takes the "what can Claude Code automate?" question in an unexpected direction: full video production. It provides skills, commands, templates, and tools that let Claude Code orchestrate everything from script to final render — AI voiceover via Qwen3-TTS, image generation via FLUX.2, music via ACE-Step, talking-head animation via SadTalker, and video generation via LTX-2. Projects track through a multi-session lifecycle (planning → assets → review → audio → editing → rendering), with automatic reconciliation of planned intent versus what files actually exist. Cloud GPU work runs on Modal ($30/month free tier) or RunPod. The author's use case is sprint review videos, but the framework is generalizable to any "explainer" format. (more: https://github.com/digitalsamba/claude-code-video-toolkit)

Infrastructure Shifts: Safetensors Governance, S3 Files, and MoE Anomaly Detection

Safetensors has joined the PyTorch Foundation as a foundation-hosted project under the Linux Foundation, alongside DeepSpeed, vLLM, and PyTorch itself. The move is a governance milestone: the format that eliminated arbitrary code execution from model distribution now has vendor-neutral trademark, repository, and governance. For users, nothing changes — same format, same APIs, same Hub integration. For the ecosystem, it signals that safetensors is no longer a Hugging Face project that happens to be open source; it's community infrastructure with formal contributor pathways. The roadmap ahead targets problems the whole ecosystem shares: device-aware loading (direct to CUDA/ROCm without CPU staging), first-class Tensor Parallel and Pipeline Parallel loading APIs, and formalized support for FP8, GPTQ, AWQ, and sub-byte integer types. The real news buried in the announcement: the team is working with PyTorch so that safetensors may be used within PyTorch core as a serialization system for torch models. If that ships, pickle-based model distribution finally has an expiration date. (more: https://huggingface.co/blog/safetensors-joins-pytorch-foundation)

AWS announced S3 Files, which integrates Amazon EFS into S3 and allows any existing S3 data to be accessed as a network-attached filesystem. Mount any bucket or prefix inside an EC2 VM, container, or Lambda function, and your tools see a local directory. Changes propagate back to S3. The design story is the interesting part: the team spent months trying to unify file and object semantics into a single invisible abstraction, failed, locked their senior engineers in a room over Christmas 2024, and eventually realized the boundary between file and object is the feature, not the problem. They adopted a "stage and commit" model borrowed from git: changes accumulate in EFS and commit back to S3 roughly every 60 seconds. Lazy metadata hydration means mounting a bucket with millions of objects starts immediately. For high-throughput reads, "read bypass" reroutes directly to S3 via parallel GETs, achieving 3 GB/s per client. The edges are honest: renames are expensive (S3 has no native rename), explicit commit control isn't there at launch, and some object keys can't be represented as POSIX filenames. With S3 Tables (Iceberg), S3 Vectors, and now S3 Files, the object store is quietly becoming a polyglot data platform. (more: https://www.allthingsdistributed.com/2026/04/s3-files-and-the-changing-face-of-s3.html)

On the research side, MoECLIP brings Mixture-of-Experts architecture to zero-shot anomaly detection — detecting defects in images of objects the model has never seen during training. The core problem: existing CLIP-based anomaly detectors apply the same transformation to every image patch regardless of whether it's background, object body, or anomaly. MoECLIP dynamically routes each patch to a specialized LoRA expert based on its characteristics. Two mechanisms prevent experts from learning the same thing: Frozen Orthogonal Feature Separation (FOFS) forces experts into non-overlapping input subspaces, while a simplex equiangular tight frame (ETF) loss maximizes angular separation of expert outputs. Across 14 benchmark datasets spanning industrial manufacturing and medical imaging (brain MRI, liver CT, colon polyps), MoECLIP achieves state-of-the-art performance with improvements of 3.0 AUROC on image-level and 1.1 AUROC on pixel-level detection. The Grad-CAM visualizations show genuine specialization: one expert focuses on anomalies, another on object body, a third on background, each with distinct routing patterns. The practical appeal for industrial quality inspection and medical screening is obvious — zero-shot means no per-factory or per-pathology training data required. (more: https://arxiv.org/abs/2603.03101v1)

Sources (22 articles)

Abliterating Qwen3.5-397B on a Mac Studio revealed that MoE models encode refusal differently than dense models — safety refusals route through expert selection and survive weight-baking (reddit.com)
Finally Abliterated Sarvam 30B and 105B! (reddit.com)
Robust Safety Monitoring of Language Models via Activation Watermarking (arxiv.org)
iron-proxy — Egress Firewall for Untrusted Workloads (github.com)
Agent Reading Test — Can AI Agents Be Trusted With What They Read? (agentreadingtest.com)
Five Patterns I Keep Seeing in AI Systems That Work in Dev but Fail in Production (reddit.com)
Parlor — Real-time AI (audio/video in, voice out) on M3 Pro with Gemma E2B (github.com)
Gemma-4-E2B-it NVFP4 on DGX Spark — #1 in 9 Spark Arena categories (reddit.com)
Unsloth Gemma 4 Fine-Tuning Guide (unsloth.ai)
Bartowski vs Unsloth for Gemma 4 (reddit.com)
inferrs — Rust-Based Local LLM Inference Engine (github.com)
AI singer now occupies eleven spots on iTunes singles chart (showbiz411.com)
Mistral Introduces Voxtral TTS: Open-Weight Text-to-Voice Model — Clones Any Voice From 3 Seconds, Beats ElevenLabs (reddit.com)
New TTS Model: VoxCPM2 — Voice Design, Controllable Cloning, Ultimate Cloning (reddit.com)
Ace Step 1.5 XL Models Available (reddit.com)
Anthropic's Claude Managed Agents Public Beta — Production Agent Infrastructure (reddit.com)
botctl — Process Manager for Autonomous AI Agents (botctl.dev)
Feynman — AI Learning Companion (github.com)
Claude Code Video Toolkit (github.com)
Safetensors is Joining the PyTorch Foundation (huggingface.co)
S3 Files — AWS Reimagines Object Storage (allthingsdistributed.com)
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection (arxiv.org)