Supply Chain Under Siege

Published on

Today's AI news: Supply Chain Under Siege, Confused Deputies and Composition Failures, Building the AI-Native Security Organization, Multi-Agent Coding and the Write-Only Paradigm, Mutation Testing: Verifying the Verifiers, The Knowledge Work Reckoning, Local AI: Models and Metal, NebulFog: When Demo Day Meets Reality. 22 sources curated from across the web.

Supply Chain Under Siege

LiteLLM versions 1.82.7 and 1.82.8 landed on PyPI carrying credential-stealing payloads β€” not a theoretical risk but a live compromise of one of the most widely deployed AI proxy gateways, used by thousands of teams to route calls across OpenAI, Anthropic, Bedrock, and dozens of other providers. The malicious packages triggered on Python invocation, harvested API keys, cloud credentials, and Kubernetes secrets, then exfiltrated them to an external endpoint. The GitHub issue tracking the incident has already drawn 137 repository watchers, and the community response ranges from forensic to existential: if the middleware sitting between your agents and every LLM provider gets owned, your entire stack is compromised in one pip install. (more: https://github.com/BerriAI/litellm/issues/24512)

The commentary was pointed. As one observer put it, the industry spent years debating which frontier model is smartest and which agent framework is best, while "the real boss fight was apparently: can your stack survive pip install?" The remediation checklist is unglamorous but essential β€” package provenance verification, build isolation, runtime egress controls, credential minimization, SBOMs, artifact signing β€” essentially treating AI infrastructure the way we should have been treating all infrastructure. For teams reconsidering their proxy layer dependency, alternatives like Requesty offer an OpenAI-compatible gateway with PII scanning, automatic failover in under 20ms, and EU data residency, though the real lesson is architectural: any single point of credential aggregation is a single point of catastrophic failure. (more: https://www.linkedin.com/posts/ownyourai_litellm-just-reminded-the-entire-ai-industry-activity-7442249976409174016-A6pX) (more: https://www.requesty.ai)

The LiteLLM compromise did not happen in isolation. Sysdig's Threat Research Team revealed that TeamPCP β€” the group behind the Trivy GitHub Actions compromise on March 19 β€” subsequently poisoned Checkmarx's ast-github-action using the same payload, same AES-256+RSA-4096 encryption, same tpcp.tar.gz filename, and the same credential-harvesting kill chain targeting Runner.Worker memory, AWS IMDS, and Slack/Discord webhooks. The only differences: vendor-specific typosquat domains (scan.aquasecurtiy[.]org for Trivy, checkmarx[.]zone for Checkmarx), a deliberate deception technique that makes exfiltration traffic look like vendor-legitimate calls in CI logs. The cascading mechanism is the key insight: stolen credentials from one compromised action enable poisoning of additional actions in affected repositories, creating a self-propagating supply chain attack. Both domains returned clean verdicts from threat intelligence feeds at exploit time. The only defense that caught both waves was runtime behavioral detection β€” correlating IMDS access with subsequent binary data upload to external domains β€” because the attacker must ultimately execute observable system calls regardless of entry vector. (more: https://www.sysdig.com/blog/teampcp-expands-supply-chain-compromise-spreads-from-trivy-to-checkmarx-github-actions)

Rounding out the supply chain theme: hackers planted a top Google search result for "Claude plugins" that redirected to a malicious page, exploiting the trust surface where developers search for AI tooling extensions. The SEO poisoning attack underscores that supply chain compromise extends beyond package registries into the discovery layer itself. (more: https://www.reddit.com/r/Anthropic/comments/1s2e81m/a_top_google_search_result_for_claude_plugins_was/)

Confused Deputies and Composition Failures

A detailed writeup from Origin HQ walks through a privilege escalation pattern in multi-agent coding workflows that should give pause to anyone running Claude Code and Codex in the same repository. The attack leverages the classic confused deputy problem: each agent understands its own configuration as sensitive and resists direct self-modification, but treats the other agent's config files as ordinary project files it can freely write. Split the payload across .codex/instructions.md (containing instructions targeting Claude) and .claude/CLAUDE.md (containing instructions targeting Codex), and neither agent sees the full attack plan. Codex happily rewrites Claude's settings to weaken its approval model; Claude happily rewrites Codex's config to disable its sandbox. The sandbox boundary, designed to prevent direct self-modification, doesn't help because the other agent's config lives inside the writable workspace. (more: https://www.originhq.com/blog/escaping-the-sandbox-confused-deputies)

This matters because multi-agent development is increasingly normal β€” one agent for large refactors, another for fast coding β€” and the security model for each was designed in isolation. Anthropic's separate context file (.claude/CLAUDE.md vs. CLAUDE.md) actually makes the split-payload attack easier. Both Claude Code and Codex offer mechanisms to move security-sensitive configuration out of the workspace (system-level managed settings), but neither ships with a default-deny posture, and the configuration surface is riddled with pitfalls. The broader point: once multiple principals share a writable environment, confused deputy problems emerge whether the principals are humans, processes, or AI agents.

Meanwhile, Alex Polyakov's team built an AI agent that cracked all eight levels of the new Gandalf CTF for Agents β€” and the agent had zero prior knowledge of AI security attack techniques, inventing them from first principles using a 15-lens adversarial analysis framework. The composition failures it exploited are instructive: the output filter checks for the password as a complete string, so outputting one letter per line bypasses it (Lens 7, Composition Failure). Intent classifiers trained on "tell me the password" patterns miss extraction wrapped in novel-writing scenes (Lens 15, Distribution Boundary). Most damning: even with every filter active, the output filter itself becomes an oracle β€” when it blocks a response, it confirms the password was present, enabling metadata reconstruction through letter counts, vowel counts, and Scrabble scores (Lens 5, Oracle Construction). The takeaway for anyone building guardrails: defense layers that are independently secure can compose insecurely, and any component that behaves differently based on secret state leaks that state. (more: https://www.linkedin.com/posts/alex-polyakov-cyber_aisecurity-promptinjection-redteaming-activity-7442191503374004224-o7s0)

Building the AI-Native Security Organization

Five talks from recent security conferences paint a remarkably convergent picture of where AI-native security operations are headed β€” and how far most organizations still have to go to get there. Dan Guido of Trail of Bits delivered the most concrete transformation case study: after roughly a year of aggressive AI adoption, the company maintains 94 plugins containing 201 specialized agents and over 400 reference files encoding 14 years of audit knowledge. Bug discovery jumped from 15 per week to 200, with 20% of all bugs reported to clients now initially discovered by AI. Sales revenue averages $8 million per representative, roughly double the consulting industry benchmark. The organizational design is as important as the tooling: a maturity matrix that makes AI proficiency a first-class professional capability (with consequences for staying stuck at level zero), internal hackathons that force engineers into bypass-permissions mode to learn real sandboxing constraints, a curated skill marketplace that prevents random plugin installation, and a mandatory seven-day package cooldown policy enforced via MDM. (more: https://youtu.be/kgwvAyF7qsA?si=VfvPVI1CfXHH5wWD)

Rob Lee of SANS demonstrated Claude Code running on the SIFT forensic workstation, producing a complete incident response report β€” system profile, attack chain, malware deployment, persistence mechanisms, chronological timeline, MITRE ATT&CK overlays, and remediation recommendations β€” from a single memory image in 18 minutes. A full C-drive analysis with comprehensive PDF report took 14 minutes. The configuration investment: roughly 90 minutes of skill-building via markdown files rather than full MCP servers. Lee is now organizing a SANS-sponsored hackathon (April 1 through May 15, $22,000 prize pool) to accelerate forensic MCP engineering and solve the context rot problem that limits sustained multi-system analysis. (more: https://youtu.be/OsUg3TlAqjQ?si=R6YW5C86fXRcBX7y)

Paul and Ryan from OpenAI articulated a philosophy that amounts to "security guardrails are free β€” just ask for them." Their approach: commit a text file describing what you want checked, write a tool call that returns zero or one, feed it into your CI framework. No vendor contracts, no frameworks. They demonstrated a product with roughly a million lines of code β€” about 250,000 of which are prompts β€” where zero humans write code. Every PR gets reviewed by an agent against a threat model checked into the repo. Every AppSec review finding gets distilled not into point fixes but into executable guardrails (cheap bespoke lints, dependency scanners forking 16 agents daily) that statically disallow entire classes of mistakes. The punchline: they spend approximately 50% of their token budget producing code and 50% refining it through automated review, and the mandatory package cooldown policy was implemented by copy-pasting a colleague's PR diff and telling Codex "make it so." (more: https://youtu.be/U2O14Jd3MBU?si=5cZv8i5x7Ux1ompg)

Josh Saxe, formerly leading AI security work at Meta, made the evaluation case: classical ML metrics (precision, recall, F-score) assume oracle-quality ground truth labels, but SOC analysts disagree at double-digit rates on whether alerts are true positives, and determining whether a binary is malware reduces to the halting problem. Even a 1% label-flip rate causes measured accuracy to plummet. His thesis: autonomous cyber defense systems should be evaluated the way we evaluate human security engineers β€” through rubric-based assessment of reasoning quality, evidence gathering, first-principles analysis, and decision justification, not just binary outcome accuracy. With a well-calibrated LLM judge and as few as 100 labeled samples, teams can hill-climb all evaluation dimensions toward a deployment bar. (more: https://youtu.be/rO2yA52U_i4?si=1VH0rGeQUV1htCrO)

Google's Heather Adkins and four (DeepMind CISO) presented Big Sleep and Code Mender β€” an end-to-end autonomous vulnerability discovery and patching engine targeting zero false positives. Big Sleep recreates Project Zero's expertise through an agentic reasoning loop: deep codebase understanding, vulnerability hypothesis formation, testing via debugger and Python interpreter, and verified exploit generation as proof. The system has found deep memory safety bugs in open-source projects β€” every finding includes working proof-of-vulnerability code. Code Mender then generates candidate patches, verifying them through fuzzing, formal verification, differential testing, and LLM review before submission. The 178 autonomously generated fixes now in open-source mark the beginning of what Adkins calls the "vone apocalypse" β€” an impending flood of discovered vulnerabilities that will overwhelm traditional patching workflows. (more: https://youtu.be/B_7RpP90rUk?si=RXKs-QEddJhscQWC)

Multi-Agent Coding and the Write-Only Paradigm

Tim Sehn of DoltHub spent a week and $3,000 using Gas Town β€” a multi-agent coding orchestrator built on Dolt's version-controlled database β€” to build DoltLite, a drop-in SQLite replacement with Git-style version control powered by Prolly Tree storage. The result: 18,000 lines of new C code, all 87,000 SQLite acceptance tests passing, and a working implementation of commits, branches, merges, conflicts, reverts, cherry-picks, tags, and audit tables. Sehn hasn't written C in 20 years. The key factors: two reference implementations to crib from (Dolt and SQLite), a clean interface boundary, an extensive existing test suite, deep domain expertise, and a close-to-unlimited budget. Gas Town's parallelism was most useful in the early implementation phase; as the project matured, it became hand-to-hand combat with a single Claude session, debugging performance issues that required telling Claude to "deep research" and "spend as much time as you need" before it finally identified the missing edit-in-place optimization. (more: https://www.dolthub.com/blog/2026-03-24-a-week-in-gas-town)

The write-only code paradigm deserves scrutiny. Sehn explicitly warns: "The first time you stare at an 8,500-line Git diff and try to parse what happened, you've lost." The code literally contains magic numbers with no explanation. DoltLite is "by agents for agents β€” humans, code review at your own risk." The economics are revealing: Gas Town burns roughly $100/hour, and Sehn suspects tools like it will kill the $200/month Claude Code Max subscription because the token consumption is unsustainable at flat-rate pricing. His team's first question was whether DoltLite is maintainable β€” a question that applies to every multi-agent codebase.

On the framework side, Clemens Hoenig released Lazy Fetch, a CLI companion for Claude Code built after analyzing 18 agentic engineering frameworks. Its philosophy distills to a single claim: context is the real engineering problem, not architecture or coordination. It provides a structured task loop across five phases (Read, Plan, Implement, Validate, Document), a context engine that outputs @file references, persistence between sessions, and an MCP server with 15 tools. (more: https://www.linkedin.com/posts/hoenig-clemens-09456b98_lazy-fetch-activity-7442146427448750080-cUgi) Meanwhile, the RuVector project published a comprehensive GitHub issue outlining a "Second Brain with Pi" architecture β€” a portable, self-learning vector database running on Raspberry Pi hardware, featuring GNN layers, dynamic mincut, edge cognition, and neural meshes as part of a broader agentic infrastructure stack. (more: https://github.com/ruvnet/RuVector/issues/295)

Mutation Testing: Verifying the Verifiers

Senko's blog post on mutation testing for AI-generated code addresses a gap that keeps widening as vibe-coded applications proliferate: how do you know your tests actually catch bugs? The workflow is elegant in its simplicity. First, ask AI to write tests against existing code (the usual approach). Second, have a separate AI session review the test suite for tautological tests β€” tests that verify mocks, frameworks, or themselves rather than actual behavior. Third β€” the mutation testing step β€” have yet another AI session (with no access to the tests) identify realistic places to introduce bugs, then run the test suite against each mutation. The result: 98% code coverage looked reassuring until 25% of the mutations produced bugs the tests missed entirely. (more: https://blog.senko.net/improving-ai-generated-tests-using-mutation-testing)

The technique is old, but AI makes it dramatically cheaper. Where human developers would balk at manually designing dozens of plausible mutations, an LLM generates them in seconds β€” and because it hasn't seen the tests, the mutations aren't biased toward tested paths. The critical operational detail: context isolation between sessions prevents the mutation-generating AI from cheating by examining what's already tested. The four-step recipe (TDD if feasible, review tests for meaninglessness, mutation test the survivors, fix the gaps) is the most practical AI code quality framework published this week.

The Knowledge Work Reckoning

Daniel Miessler published a 6,000-word argument that AI will replace most knowledge work β€” and that this is good news. The piece is less provocative than its headline suggests and more structurally rigorous than most entries in the genre. His core framework: knowledge workers operate on four capability layers (knowledge, understanding, intelligence, creativity), and expertise is not a separate layer but the combination of the first three plus experience. AI already matches or exceeds median human performance on knowledge, understanding, and intelligence. Creativity β€” the layer humans reflexively claim as their moat β€” accounts for only 4% of US work activities at median human level, per McKinsey. The real gap is not capability but capture: most expert knowledge lives in people's heads, passed brain-to-brain, and dies when Cliff retires. The "articulation gap" closes permanently every time someone writes a skill, documents an SOP, or publishes a process β€” it's a ratchet that only turns one direction. (more: https://danielmiessler.com/blog/exactly-why-and-how-ai-will-replace-knowledge-work)

Miessler's "Lattice" architecture β€” a unified daemon where every individual, team, department, and company broadcasts SOPs, metrics, work items, and budgets via queryable APIs β€” is the prescriptive component. It addresses the transparency problem that makes most companies "a giant soup sandwich": CEOs and CFOs have almost no real-time visibility into what work is being done, at what quality, at what cost. Whether or not the Lattice ships as described, the diagnosis rings true for anyone who has endured the consultant-audit-OKR-reset cycle.

On a different axis of organizational measurement, Ibrahim Bashir's "South Star Metrics Revisited" provides a diagnostic toolkit for when north star metrics go wrong: detrimental (metric improves, customer suffers), out-of-reach (team can't meaningfully move the number), incomplete (optimizing one funnel stage while ignoring the rest), pressure (metric so urgent it crowds out all future investment), inconsequential (metric that expired but nobody sunset it), nonsensical (taken to logical extreme, the metric destroys value), and incongruent (two metrics in the same portfolio fight each other). The Windows Update case β€” forcing reboots to hit 90% adoption, creating one of the most universally hated user experiences in software history β€” remains the canonical detrimental metric example. The framework matters for AI teams specifically because agent-driven optimization can hit any of these failure modes at machine speed. (more: https://open.substack.com/pub/runthebusiness/p/south-star-metrics-revisited?r=v5uaz)

Local AI: Models and Metal

Google Research published TurboQuant, a KV cache quantization technique that achieves comparable performance to FP16 KV cache at sub-4-bit precision on long-context benchmarks. The approach draws conceptual parallels to the Burrows-Wheeler Transform used in zip compression β€” reordering data to expose redundancy before quantizing. This is KV cache only, not model weight quantization, but the implications for local inference are significant: KV cache memory is the binding constraint for long-context workloads, and a 4x reduction means the difference between fitting a context window and hitting OOM. Nvidia published a competing approach using tricks from image compression at even higher compression rates, suggesting this is an active research front. (more: https://www.reddit.com/r/LocalLLaMA/comments/1s31kvq/turboquant_from_googleresearch/)

Qwen3.5 is earning a reputation as the "working dog" of open-weight models β€” agentic-first, context-hungry, and borderline useless without a substantial system prompt. The 27B variant reportedly doesn't become useful below 3,000 tokens of input context, spending up to 5,000 tokens just orienting itself. This makes sense if the models were trained agentic-first: they expect to know their environment, tools, and modality before producing output. Community reports suggest the 122B MoE variant at Q6_K is the sweet spot for serious local work, while the 35B MoE is considered notably weaker than benchmarks suggest. The practical takeaway: invest in your system prompt, give explicit constraints and objectives, and treat these models as specialists that want a job description, not a greeting. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ryljps/qwen35_is_a_working_dog/)

A comprehensive llama.cpp benchmark shootout across the RTX 5090, DGX Spark (GB10), AMD AI395 (Strix Halo), and dual AMD R9700 cards reveals the current hardware landscape for local inference. The RTX 5090 is unmatched when models fit in 32GB VRAM β€” 5,989 t/s prompt processing and 205 t/s generation on Qwen3.5 35B MoE β€” but fails completely on 70B+ models. The AMD AI395 with 98GB unified memory emerges as the dark horse, the only non-enterprise node capable of running Qwen3.5 122B MoE at nearly 20 t/s generation (with the critical tip that -mmp 0 is required to force models into RAM for the iGPU). The DGX Spark's numbers appear suspiciously low β€” community members report 1,742 t/s prompt processing versus the benchmarker's 605, suggesting a configuration issue. ROCm dominates prompt processing on AMD hardware while Vulkan sometimes edges ahead on text generation for MoE models, though Vulkan proved less stable under extreme load. Two GMKtec Evo-X2 boxes connected via USB4 running the full Qwen3.5 397B at IQ4_NL achieved roughly 13 t/s β€” the bleeding edge of consumer-grade trillion-parameter inference. (more: https://www.reddit.com/r/LocalLLaMA/comments/1s3170r/benchmark_the_ultimate_llamacpp_shootout_rtx_5090/)

NebulFog: When Demo Day Meets Reality

The NebulFog Singularity AI security hackathon results offer a sobering meta-lesson about the gap between building and presenting. The top-scoring teams β€” AgentRange and DoYouKnowWhatYouBuiltLastSummer, both at 8.3 β€” exist in a quantum state of SchrΓΆdinger's excellence: their scores suggest strong work, but complete absence of observable demo content means the judges evaluated ghosts. The de facto winner by evidence is Genomics at 7.9, which demonstrated docker-based fuzzing harnesses for genomic file format parsers and a full-chain prompt injection flowing from BED track files through to model output. Over 15 teams experienced catastrophic presentation failures β€” OBS placeholder screens, muted cameras, corrupted audio β€” suggesting systematic infrastructure issues rather than individual incompetence. (more: https://nebulafog.ai/singularity-results.html)

The substantive entries reveal where the AI security community is focusing energy: multi-agent penetration testing orchestration (the top two entries), context-aware tool-call interception for agent security (SCTX at 7.5, blocking credential exfiltration by scoring tool calls based on surrounding behavioral context), and genomic file format fuzzing with embedded prompt injection. The event's sharpest lesson: in a hackathon, a working demo beats a brilliant architecture, clear audio beats perfect code, and sometimes the best strategy is knowing when to stop talking. Teams that used their full 600-second allocation didn't score higher than those who focused on a tight 282-second demonstration β€” judges valued signal density over comprehensive coverage, a principle that applies well beyond hackathons.

Sources (22 articles)

  1. Tell HN: Litellm 1.82.7 and 1.82.8 on PyPI are compromised (github.com)
  2. [Editorial] (linkedin.com)
  3. [Editorial] option to litellm (requesty.ai)
  4. [Editorial] (sysdig.com)
  5. A Top Google Search Result for Claude Plugins Was Planted by Hackers (reddit.com)
  6. [Editorial] (originhq.com)
  7. [Editorial] (linkedin.com)
  8. [Editorial] (youtu.be)
  9. [Editorial] (youtu.be)
  10. [Editorial] (youtu.be)
  11. [Editorial] (youtu.be)
  12. [Editorial] (youtu.be)
  13. [Editorial] (dolthub.com)
  14. [Editorial] (linkedin.com)
  15. [Editorial] Second Brain with Pi (github.com)
  16. [Editorial] (blog.senko.net)
  17. [Editorial] (danielmiessler.com)
  18. [Editorial] (open.substack.com)
  19. TurboQuant from GoogleResearch (reddit.com)
  20. Qwen3.5 is a working dog. (reddit.com)
  21. [Benchmark] The Ultimate Llama.cpp Shootout: RTX 5090 vs DGX Spark vs AMD AI395 & R9700 (reddit.com)
  22. [Editorial] (nebulafog.ai)