Fable 5 Falls in Days: Anthropic's Classifier Architecture Meets Reality
Published on
Today's AI news: Fable 5 Falls in Days: Anthropic's Classifier Architecture Meets Reality, Supply Chain: Greyware, IDE Sabotage, and the DLP Gap, Agent Tooling Matures: From Bewilderment to CLI, Inference: Squeezing the KV Cache and Breaking the Sequential Bottleneck, Beyond Commits: Version Control, WASM, and Platform Control, When Should AI Refuse? Evidence Scoring and Emergent Fairness. 22 sources curated from across the web.
Fable 5 Falls in Days: Anthropic's Classifier Architecture Meets Reality
Anthropic launched Fable 5 on June 9 with a bold claim: over a thousand hours of external red-teaming had produced zero universal jailbreaks. The architecture behind that confidence was unusual — Fable 5 and the restricted Claude Mythos 5 share the same weights, separated by a layer of safety classifiers that silently route flagged queries in cybersecurity, biology, chemistry, and distillation to the weaker Opus 4.8.
The claim lasted days. Red-teamer Pliny the Liberator published a coordinated multi-agent bypass he called "a pack hunt," producing step-by-step stack buffer overflow exploitation for x86 Linux — including disabling ASLR, writing vulnerable C server code, and compiling without protections — along with the Birch reduction pathway for methamphetamine precursors. The attack vectors are not novel individually — Unicode and Cyrillic homoglyph substitution, long-context smuggling, taxonomy framing, fiction wrapping — but their orchestration was. Pliny's decomposition-and-recomposition technique, extracting sensitive technical details in benign chunks across multiple turns and reassembling them, proved most effective. As he noted, "getting uplift on the process itself, like Birch reduction method or reductive amination, is much more doable" than requesting a named compound directly. A jailbroken Opus instance assisted in the backend, creating a multi-model attack chain that single-model safety evaluations cannot catch by design. (more: https://cybersecuritynews.com/anthropics-claude-fable-5-jailbroken)
The leak of Fable 5's approximately 120,000-character system prompt to GitHub added insult to injury. The CL4R1T4S repository exposes the internal safety instructions, tool definitions, copyright compliance rules, and behavioral framing Anthropic uses to govern the model — a blueprint for anyone designing future bypasses. The classifier-based architecture was meant to reduce friction for legitimate users while containing risk. Instead it created a binary: either the classifier fires and the weaker model answers anyway, or it does not and Fable responds unconstrained. Research from earlier this year showed safety mechanisms occupy roughly 0.005% of model parameters and can be pruned with 30 lines of code, jumping jailbreak success rates to 76-80%. Classifiers sitting in front of that thin layer were always going to be a speed bump, not a wall. The episode also highlights a structural problem with agentic pipelines: when one jailbroken model can assist another in evading controls, single-model safety evaluations become fundamentally insufficient. (more: https://github.com/elder-plinius/CL4R1T4S/blob/main/ANTHROPIC%2FCLAUDE-FABLE-5.md)
Supply Chain: Greyware, IDE Sabotage, and the DLP Gap
Chainguard introduced source code scanning for a category they call "greyware" — packages that honestly declare harmful functionality in their READMEs, pass every existing malware scanner, and do exactly what they say. The five npm examples are instructive: a Canva/Leonardo account fraud bot, a backdoor disguised as an AI coding assistant, a Solana token miner, a penetration testing toolkit, and a Chrome credential harvester that decrypts SQLite databases via macOS Keychain access. All remain available on npm with thousands of downloads each. All passed standard cooldown periods. Chainguard's four-pillar approach — publisher behavior analysis, artifact inspection, source-to-publish diff comparison, and sandboxed execution — processes over 100,000 packages daily and has blocked 52,000+. The greyware category is the important part: these packages do not obfuscate, do not inject, do not hide. They bet that nobody reads the README before running npm install, and they are right often enough to matter. Think of Unroll.me a decade ago — an email unsubscription service that buried data-selling in its terms of service. Same pattern, new ecosystem, higher stakes. (more: https://www.chainguard.dev/unchained/the-expanding-threat-landscape-chainguard-now-scans-source-code-for-traditional-malware-and-greyware)
At the other end of the pipeline, PyCharm's built-in "Full Line Code Completion" plugin — a local deep learning model suggesting entire lines — offers verify=False when it sees a requests.Session() being instantiated, and suggests disabling certificate verification for HTTPS. Seth Larson reported it to JetBrains, waited 90 days, and found the behavior unchanged in the next version. JetBrains could not decide whether it was a security vulnerability (they referred him to their bug bounty policy) or not (they asked him not to publicize). The classification ambiguity matters: without a CVE angle, the report stays deprioritized. Every code generation model likely has variants of this, but the IDE context makes it worse — the suggestion appears inline, at the moment of typing, in the tool developers trust implicitly. (more: https://sethmlarson.dev/are-insecure-code-completions-a-vulnerability)
Prompt Gate (ShieldNet-360) takes the opposite approach: instead of scanning what developers pull in, it intercepts what they push out. The open-source Go+Electron agent blocks unauthorized AI tools at DNS level and inspects content sent to approved tools through a layered DLP pipeline. Aho-Corasick plus regex with adversary-resistant normalization folds approximately 50 Cyrillic and Greek homoglyphs to Latin ASCII, strips zero-width characters, decodes inline base64, and runs NFKC before the pattern matcher fires. A per-tab multi-piece correlator reassembles secrets split across consecutive pastes. Launch benchmark on a million lines of real source code: precision 100%, recall 73%, zero false positives, under 1ms per scan, no ML dependency. The recall gap is honest — exhaustive secret format detection is unsolved — but the false-positive rate matters more for adoption. (more: https://github.com/ShieldNet-360/prompt-gate)
Staris occupies the validation layer between scanners and pentests. Continuous AppSec validation that proves exploitability with working exploits and ships PR-ready patches at release cadence. The claimed 99% noise reduction comes from executing each vulnerability candidate rather than pattern-matching it. Every finding includes an execution trace, reproduction steps, and a code-level patch. At $4,900 per cycle or $25K/year for continuous validation, it undercuts traditional pentesting by roughly 40% while running at release cadence instead of annually. (more: https://staris.tech)
Agent Tooling Matures: From Bewilderment to CLI
Six months ago, Andrej Karpathy described the agentic tooling landscape as "a powerful alien tool handed around with no manual." This week's crop suggests the manual is being written.
Gadi Evron's approach to agent-driven security scanning is deliberately brute-force: scan the entire project, then file by file, function by function, with an independent executor and a judge, concurrency N=15 with full isolation, Karpathy auto-research style looping. The instruction: "continue trying no matter what, always check what you tried before." Token-heavy but effective. The key insight is abandoning deterministic harness complexity in favor of raw coverage with built-in deduplication. It does not replace purpose-built tools like raptor or OpenAnt that provide reachability verification and threat model adjustment, but as a complement it is "beyond effective." (more: https://www.linkedin.com/posts/gadievron_recently-i-started-asking-agents-to-force-find-share-7471042303424524288-950D)
Ken is a pure Go port of semble's hybrid code search — BM25 lexical plus Model2Vec semantic embeddings plus reciprocal rank fusion — compiled to a single static binary. No cgo, no Python, no GPU, no API keys, air-gapped friendly. The numbers are specific and reproducible: 97% recall@10 in hybrid mode on semble's 1,251-query benchmark while consuming approximately 46x fewer tokens than grep+Read (4,120 vs 189,773 median on natural language queries). MCP-compatible, drop-in replacement with identical tool schemas. The MCP server auto-fetches a 60MB model on first run, serving BM25 until the download completes then upgrading to hybrid. Database schema indexing — Postgres, SQLite, MySQL introspection alongside code — means an agent answering "how does authentication work" gets the Go function, the SQL it runs, and the table definition in one ranked list. (more: https://github.com/townsendmerino/ken)
Open Brain (OB1) tackles persistent cross-tool memory. The architecture is Supabase plus vector search plus MCP, designed so Claude, ChatGPT, Cursor, and whatever ships next month share the same memory of the user. Community traction is real: 20 merged PRs from multiple contributors covering imports from ChatGPT, Obsidian, Gmail, and X/Twitter, plus entity extraction workers, provenance chains, a Kubernetes self-hosted deployment, and Next.js dashboards. The extension system teaches through practical builds that compound — household knowledge through professional CRM through job hunt, each wired into the others. A 27-minute setup walkthrough covers the full system end to end. (more: https://github.com/NateBJones-Projects/OB1) (more: https://vimeo.com/1174979042/f883f6489a)
Google's Agent CLI demonstrates the convergence point: install a CLI plus skills, and the coding agent drives scaffold, evaluate, deploy, and monitor without the developer touching documentation directly. Skills are documentation injected into the agent's context — the capability is the CLI, the instructions are the skills. The demo takes an ask-your-data agent from idea to deployed GCP endpoint without a single manually-typed terminal command, including sandboxed code execution in production. The 4-second first-token threshold for customer-facing agents still means production traffic cannot go through a 100-agent swarm — frameworks like ADK remain the right choice when latency and cost per token matter. (more: https://www.youtube.com/watch?v=1wfY7GCVvh0)
Claude Code's ultracode feature spawns 100+ agents via dynamic workflows for complex tasks — token-heavy, best reserved for hard problems. (more: https://www.youtube.com/shorts/kmX8h8EYgqE) Community repos worth noting: NotebookLM-Pi for connecting NotebookLM to Claude Code, Caveman for concise responses (studies show more concise frontier model outputs can be more accurate), and the Codex plugin for adversarial cross-review between Claude Code and Codex. (more: https://www.youtube.com/shorts/_Wn7b5TLUGU)
Inference: Squeezing the KV Cache and Breaking the Sequential Bottleneck
KVarN from Huawei delivers KV-cache quantization that does not trade throughput for capacity — the usual catch. The technique walks each fixed-size token tile through four stages: raw cache, Hadamard rotation along the channel dimension to spread per-channel outliers, iterative variance normalization (Sinkhorn-like, alternating column- and row-wise in log space), and asymmetric round-to-nearest at low bit-width. The shipped preset uses 4-bit keys and 2-bit values. On Qwen3-32B at 16K context with TP=2, KVarN matches FP16 accuracy, beats its throughput, and delivers approximately 4x KV-cache capacity. For context, vLLM's TurboQuant blog reports 40-52% lower throughput for 2.3-3.7x capacity; KVarN claims the upper-right corner those methods cannot reach. MLA support makes it the first vLLM-compatible sub-8-bit method for latent attention models like GLM-4.7-Flash (2.77x KV capacity at parity accuracy). Also supports hybrid Mamba/linear-attention models and speculative decoding with MTP. Calibration-free, one-flag install as a vLLM fork with JIT-compiled Triton kernels. (more: https://github.com/huawei-csl/KVarN)
Orthrus attacks the sequential bottleneck: a dual-view architecture sharing the autoregressive model's exact KV cache with a diffusion decoder for parallel token generation. The diffusion view proposes multiple tokens simultaneously; an intra-model consensus mechanism verifies them against the autoregressive distribution, guaranteeing strictly lossless output. O(1) memory overhead from sharing the KV cache rather than maintaining a separate draft model. Benchmarks on Qwen3 backbones: 4.25x to 5.36x average speedup across 1.7B to 8B parameters, outperforming EAGLE-3 and DFlash on acceptance length while maintaining throughput at 40K context lengths where DFlash degrades rapidly. On MATH-500, the 8B variant delivers approximately 6x speedup over the Qwen3-8B baseline with strictly lossless performance, while diffusion-only alternatives like Fast-dLLM-v2 suffer significant accuracy drops. Only 16% of parameters are fine-tuned; the base LLM stays frozen. MLX inference on Apple Silicon is included. (more: https://github.com/chiennv2000/orthrus)
Qwen3.5-122B-A10B landed on HuggingFace — 122B total parameters with only 10B active per token, continuing Alibaba's MoE scaling push. (more: https://huggingface.co/Qwen/Qwen3.5-122B-A10B) GLM-OCR appeared on trending models as an application-specific variant targeting optical character recognition. (more: https://huggingface.co/zai-org/GLM-OCR)
Beyond Commits: Version Control, WASM, and Platform Control
Zed's DeltaDB makes the boldest claim in developer infrastructure this week: version control that captures every operation between commits, not just snapshots. Where Git records a commit, DeltaDB records a stream of fine-grained deltas with stable identities. A message and the edit it produced sit side by side, anchored to deltas rather than line numbers, surviving as code moves underneath. The motivation: "Increasingly, the conversation that generates the code is becoming the true source of our software." CRDT-based conflict-free replicated worktrees allow multiple agents and humans to edit simultaneously across machines. Git and CI remain for checks and external integration; DeltaDB handles the continuous agent-developer conversation that occurs between commits. Pull requests, review threads, and inline comments exist to reattach discussion to code after the fact because the discussion and code lived in separate places. DeltaDB puts them in the same place and lets the ceremony disappear. An early-access release is weeks away. (more: https://zed.dev/blog/introducing-deltadb)
The WebAssembly Component Model 1.0 roadmap crystallized at the Bytecode Alliance Plumbers Summit. Five work areas define the path: a lazy ABI replacing eager allocation (zero-copy forwarding, drop-if-unused semantics), native browser implementation (Mozilla benchmarks show up to 2x speedup on DOM mutation-heavy workloads), simplified specification via guest and host C-ABIs, ecosystem documentation, and WIT expressivity features including optional imports, nullable types, and resource inheritance. The transition is non-breaking: lazy ABI ships as opt-in in 0.3.x, becomes default at 1.0. Cooperative threads are implemented at the Component Model level below WASI, with pthreads support largely done and LLVM patches landed. Stream splicing — zero-copy stream-to-stream forwarding — ships in an early P3 follow-up. The critical dependency is LLVM multivalue support at the C ABI level, which may be the longest lead-time item on the entire roadmap. (more: https://bytecodealliance.org/articles/the-road-to-component-model-1-0)
macOS 27 "Golden Gate" beta hides Asahi Linux partitions from the boot picker and Startup Disk preferences. Data remains intact — but invisible and unbootable. Asahi filed a bug report and advises users to avoid the beta entirely. In parallel, Linux 7.2 enables boot support for Apple M3 devices, though far from end-user usable. The juxtaposition is sharp: the open-source community reverse-engineers Apple Silicon support one kernel version at a time while a single firmware update can break it. (more: https://www.phoronix.com/news/macOS-27-Beta-Breaks-Asahi)
When Should AI Refuse? Evidence Scoring and Emergent Fairness
SIEVES (Selective Prediction through Visual Evidence Scoring) addresses a deployment problem accuracy benchmarks ignore: when should a vision-language model abstain rather than guess? The approach requires reasoner models to produce localized visual evidence alongside answers, then trains a selector to estimate localization quality across three axes — correctness, spatial accuracy, and coherence. On five out-of-distribution benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, AdVQA), SIEVES improves coverage by up to 3x compared to non-grounding baselines while maintaining user-defined risk levels. Critically, the selector transfers to proprietary reasoners without weight or logit access — tested on o3 and Gemini-3-Pro — providing coverage boosts beyond accuracy alone. The principle: a model that can point to where it found the answer is more likely to have actually found it. Model-agnostic, no benchmark-specific training required. (more: https://arxiv.org/abs/2604.25855v1)
The Arrow's Impossibility paper reframes multi-agent fairness through social choice theory using a controlled hospital triage framework. Two LLM agents negotiate over three structured debate rounds — one aligned to a specific ethical framework via RAG, the other either unaligned or adversarially prompted to favor demographics over clinical need. The central finding: neither agent's allocation is ethically adequate in isolation, yet their joint final allocation satisfies fairness criteria that neither reaches alone. Aligned agents moderate bias through contestation rather than override, restoring access for marginalized groups without fully converting a biased counterpart. Even explicitly aligned agents exhibit intrinsic biases toward certain ethical frameworks, consistent with known left-leaning tendencies in LLMs. The connection to Arrow's theorem is precise: no aggregation mechanism simultaneously satisfies all desiderata of collective rationality. Multi-agent deliberation navigates rather than resolves this constraint. The practical implication for production systems is direct: fairness may need to be designed as an emergent property of agent interaction — with the system, not the individual model, as the appropriate unit of evaluation. For anyone building multi-agent pipelines for high-stakes decisions, the question is no longer "is each agent fair?" but "does the interaction protocol produce fair outcomes?" (more: https://arxiv.org/abs/2604.13705v1)
Sources (22 articles)
- [Editorial] Anthropic's Claude Fable 5 Jailbroken (cybersecuritynews.com)
- [Editorial] CL4R1T4S Fable 5 Jailbreak Techniques (github.com)
- [Editorial] Chainguard Scans Source Code for Malware and Greyware (chainguard.dev)
- Are insecure code completions in PyCharm a vulnerability? (sethmlarson.dev)
- ShieldNet-360/prompt-gate (github.com)
- [Editorial] Staris Tech (staris.tech)
- [Editorial] Gadi Evron on Forcing Agents to Find (linkedin.com)
- townsendmerino/ken (github.com)
- [Editorial] OB1 Project (github.com)
- [Editorial] Vimeo Feature (vimeo.com)
- Google's Agents CLI: The CLI + Skills Combination to Ship AI Agents EASILY (youtube.com)
- ultracode is the most powerful claude code feature in months (youtube.com)
- Top 3 Underrated Open Source Repos Nobody Talks About (youtube.com)
- huawei-csl/KVarN (github.com)
- chiennv2000/orthrus (github.com)
- Qwen/Qwen3.5-122B-A10B (huggingface.co)
- zai-org/GLM-OCR (huggingface.co)
- Software is made between commits (zed.dev)
- The Road to the WASM Component Model 1.0 (bytecodealliance.org)
- macOS 27 Beta breaks the ability to boot Asahi Linux (phoronix.com)
- SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring (arxiv.org)
- Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration (arxiv.org)