Supply Chain Siege: TeamPCP's Multi-Ecosystem Rampage

Published on

Today's AI news: Supply Chain Siege: TeamPCP's Multi-Ecosystem Rampage, Anthropic Expands the Agent Platform, Squeezing Every Bit: Quantization Meets Kernel Co-Design, Benchmarking Intelligence and Safety, Agent Infrastructure: Memory, Knowledge, and the Framework Debate, Defensive Tooling and Protocol Abuse. 17 sources curated from across the web.

Supply Chain Siege: TeamPCP's Multi-Ecosystem Rampage

The most comprehensive supply chain attack in recent memory now has a detailed timeline, and it reads like a masterclass in cascading compromise. Security researcher ramimac has published a full reconstruction of TeamPCP's multi-week campaign spanning GitHub Actions, Docker Hub, npm, PyPI, and OpenVSX — a single threat actor systematically weaponizing the trust relationships that hold modern software delivery together (more: https://ramimac.me/teampcp).

The chain begins with incomplete containment. In late February 2026, a misconfigured GitHub workflow in Aqua Security's Trivy repository — the widely-used vulnerability scanner — allowed theft of a personal access token. Aqua detected the initial breach and rotated credentials, but the rotation "wasn't atomic and attackers may have been privy to refreshed tokens." That gap proved catastrophic. On March 19, TeamPCP pushed a malicious v0.69.4 tag to Trivy, referencing imposter commits that spoofed real contributor identities (including Guillermo Rauch's). The payload fetched malicious Go files from a typosquatted C2 domain (scan.aquasecurtiy.org) and injected credential-stealing code into build artifacts. Within hours, 75 of 76 Trivy tags were hijacked, malicious binaries landed on GitHub Releases, Docker Hub, and npm, and the compromise spread laterally to tfsec, trivy-action, and traceeshark via a compromised service account.

The lateral movement was surgical. Stolen npm tokens fed a self-propagating worm (CanisterWorm) that enumerated publishable packages and infected 28 of them in under 60 seconds, using an Internet Computer Protocol canister as C2. Checkmarx's KICS and ast-github-action got the same treatment — all 35 KICS tags and all 91 AST tags force-pushed to malicious commits with credential stealers exfiltrating to checkmarx.zone. LiteLLM fell next: a PyPI token harvested via Trivy in CI/CD enabled publication of malicious versions 1.82.7 and 1.82.8, complete with persistence mechanisms and a 50-path credential sweeper. The attacker's own fork bomb bug caused a crash in an automated development environment, which is how the malware was actually discovered — a reminder that attackers debug in production too. TeamPCP then pushed malicious Telnyx packages to PyPI with WAV steganography payloads — executables hidden inside audio file frames to evade static analysis. Perhaps most alarming: the kamikaze.sh payload evolved through at least four versions, progressing from Kubernetes-focused DaemonSet deployment to a full SSH/Docker worm scanning local subnets, with a targeted wiper for Iranian systems detected via timezone or locale settings. TeamPCP claimed via vxunderground to have exfiltrated 54GB of data. CISA added the campaign to its Known Exploited Vulnerabilities catalog, giving federal agencies 21 days to remediate.

Hackaday's weekly security roundup contextualizes this alongside two other developments worth noting: Google's disclosure of Darksword, a second significant iOS exploit chain discovered in the wild (following the Coruna chain just weeks prior), and the FBI's second alert about AVRecon malware infecting nearly 400,000 end-of-life consumer routers from Netgear, TP-Link, D-Link, and Zyxel — devices modern enough for Wi-Fi 5 but abandoned by manufacturers (more: https://hackaday.com/2026/03/27/this-week-in-security-second-verse-worse-than-the-first/). The Trivy attack, Hackaday notes, may be the most successful supply chain compromise in spreading malicious packages across multiple registries simultaneously. GitHub's immutable releases — the one Trivy version that wasn't compromised — represent the clearest remediation path, but adoption remains negligible.

Anthropic Expands the Agent Platform

Anthropic is steadily transforming Claude from a chat interface into something closer to an autonomous operating system. The latest move: Claude can now directly use your computer — opening apps, navigating browsers, filling spreadsheets — as a research preview in Claude Cowork and Claude Code (more: https://www.reddit.com/r/Anthropic/comments/1s2gp5r/you_can_now_enable_claude_to_use_your_computer_to/). This is not a theoretical capability demo; it is a shipping feature that puts an AI agent in the driver's seat of your desktop environment.

Paired with this is a new cloud scheduling system that lets Claude run tasks autonomously on Anthropic's infrastructure, even when your machine is off. You configure a prompt, connect GitHub repositories and MCP connectors (Slack, Linear, Google Drive), set a cadence, and Claude executes — cloning repos, pushing to claude/-prefixed branches, and creating sessions you can review after the fact. The system supports hourly, daily, weekday, and weekly frequencies, with environments that provide network access, secrets, and setup scripts. Each repository is cloned fresh per run, and by default Claude can only push to branches it namespaces — a sensible guardrail, though one that can be disabled per-repo (more: https://code.claude.com/docs/en/web-scheduled-tasks). The practical use cases are real: overnight CI failure analysis, weekly dependency audits, daily PR review. The authorization boundary question — what happens when a scheduled agent with repository write access and MCP connectors makes a mistake at 3 AM — is the elephant in the room that Anthropic's documentation acknowledges only through branch naming restrictions.

Meanwhile, Anthropic CEO Dario Amodei has predicted that AI could handle end-to-end software development within 6 to 12 months — a claim met with the internet's characteristic blend of tracker websites (monthssincelastaiclaim.fun) and pointed skepticism (more: https://www.reddit.com/r/AINewsMinute/comments/1s2b0fc/anthropic_ceo_predicts_ai_could_handle_endtoend/). The top-voted response on Reddit came from an engineer at "a bleeding edge SF AI tech company" who reports that "the best brightest agents can't even solve basic problems in our codebase without 6 rounds of human reviews." The gap between "the engineer does not write code anymore" and "the engineer does not have to instruct the AI how to write the code" remains enormous. The product announcements, however, suggest Anthropic is betting the gap closes faster from the tooling side than the model side — give the agent persistent access to real environments rather than waiting for reasoning to become flawless.

Squeezing Every Bit: Quantization Meets Kernel Co-Design

Google Research's TurboQuant represents a genuinely novel approach to the KV cache bottleneck that plagues long-context inference. The core insight: rather than applying quantization directly to raw vectors (which requires storing per-block scaling factors that partially defeat the purpose of compression), first rotate the data to simplify its geometry, then quantize (more: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/).

The system works in two stages. PolarQuant converts standard Cartesian coordinates into polar form — replacing "go 3 blocks East, 4 blocks North" with "go 5 blocks at 37 degrees." Because the angle distributions are predictable and concentrated, the method eliminates the expensive per-block calibration step that traditional quantizers require. On top of this, QJL (Quantized Johnson-Lindenstrauss) applies a 1-bit error-correction pass using random projections — the mathematical equivalent of a high-speed checksum that corrects bias without adding memory overhead. The result: 3-bit KV cache quantization on Gemma and Mistral with zero accuracy loss on long-context benchmarks, no training or fine-tuning required, and up to 8x speedup in attention computation on A100 GPUs. For vector search applications, TurboQuant achieves superior recall compared to state-of-the-art baselines that use larger codebooks with dataset-specific tuning — meaning it generalizes better while being cheaper to run.

The paper landed and someone immediately shipped it. A practitioner pushed a real TurboQuant-packed KV cache implementation into vLLM's Triton attention path and benchmarked it on a ZGX GB10: 4 million token KV cache, 1 million token context window, 716.7 tokens/second with Qwen3.5-35B AWQ (more: https://www.linkedin.com/posts/ownyourai_i-stayed-up-till-2-am-finished-turboquant-share-7443239890034794497-FaHA). The community response was immediate interest — questions about stacked compression artifacts (AWQ weights plus KV quantization), thermal throttling on consumer hardware, and whether the approach extends to non-MoE dense models. The "make sure you own your AI" coda in the original post captures a real tension: these optimizations make frontier-class inference feasible on local hardware, which has implications for who controls the deployment stack.

On the complementary side of the efficiency equation, AutoKernel automates GPU kernel optimization using the same agentic loop that Karpathy's autoresearch pioneered for LLM training. Point it at any PyTorch model, and it profiles bottleneck kernels, extracts them as standalone Triton or CUDA C++ implementations, then runs an autonomous edit-benchmark-keep/revert cycle — approximately 40 experiments per hour, 320 overnight. The orchestrator uses Amdahl's law to prioritize: a 1.5x speedup on a 60%-of-runtime kernel beats a 3x speedup on a 5% kernel. It supports 9 kernel types (matmul, flash attention, RoPE, fused MLP, etc.) across both Triton and native CUDA C++ backends, with a 5-stage correctness harness that prevents the agent from "optimizing" by producing garbage (more: https://github.com/RightNow-AI/autokernel). The integration with KernelBench — Stanford's 250-problem benchmark for AI-generated GPU kernels — provides standardized scoring, and a HuggingFace Kernels export path makes sharing optimized kernels trivial. The design philosophy is notable: correctness first, single file to modify, TSV logging. No infrastructure, no dashboards — just a loop that makes your kernels faster while you sleep.

Benchmarking Intelligence and Safety

ARC-AGI-3 marks a philosophical shift in how we measure machine intelligence. Where previous ARC iterations tested static puzzle-solving, version 3 is interactive: agents must explore novel environments, acquire goals on the fly, build adaptable world models, and learn continuously across multiple steps — with sparse feedback and no natural-language instructions to lean on. A 100% score means an AI agent can beat every game as efficiently as a human. The benchmark is explicitly designed to resist brute-force memorization and pre-loaded knowledge, providing replayable runs and a developer toolkit for transparent evaluation (more: https://arcprize.org/arc-agi/3). The framing is blunt: "As long as there is a gap between AI and human learning, we do not have AGI." ARC-AGI-3 makes that gap measurable by testing intelligence across time — planning horizons, memory compression, belief updating — rather than just checking final answers.

At the opposite pole of the evaluation spectrum, LABSHIELD asks not whether AI can reason abstractly, but whether it can act safely in the physical world. This multimodal benchmark evaluates 33 state-of-the-art models (including GPT-5, Gemini-3, Claude-4, and Qwen3-VL) on 164 laboratory tasks spanning four operational complexity levels and four safety tiers, grounded in OSHA standards and the Globally Harmonized System for chemical hazard classification. The evaluation uses synchronized multi-view RGB-D streams from a robotic platform across workbench, fume hood, and sink scenarios (more: https://arxiv.org/abs/2603.11987v1).

The findings are sobering. Models show an average 32% drop between general-domain multiple-choice accuracy and semi-open safety performance in professional laboratory settings. GPT-4o and Claude-4 Sonnet achieve identical MCQ accuracy (73.2%) but diverge sharply on safety reliability — Claude-4 Sonnet scores 51.2 versus GPT-4o's 41.7 on the composite safety score, driven by better hazard identification (Unsafe Jaccard: 33.0% vs 24.3%). High-risk scenarios (S2/S3) expose systematic failures where models consistently underestimate severe hazards despite their catastrophic potential. Most striking: embodied-specific models show no significant safety advantage over general-purpose multimodal models, even at larger scales. The researchers identify "perceptual blindness to transparent media" as a critical bottleneck — attention maps show models disproportionately focusing on high-contrast opaque objects while ignoring safety-critical glassware and liquid interfaces. The paper explicitly argues that MCQ performance is a "poor proxy for embodied safety," echoing the broader measurement crisis where benchmark scores fail to predict real-world reliability.

Agent Infrastructure: Memory, Knowledge, and the Framework Debate

The question of how AI agents should remember, share knowledge, and get composed is rapidly splitting into distinct infrastructure layers — and the tooling arriving this week covers all of them.

SAGE (Sovereign Agent Governed Experience) tackles the data layer with an approach borrowed from distributed systems: every memory write goes through BFT (Byzantine Fault Tolerant) consensus validation before committing. Running on CometBFT with four in-process validators (sentinel, dedup, quality, consistency), SAGE treats agent memory as institutional infrastructure rather than a flat file. Memories carry confidence scores, decay over time, and are cryptographically signed with Ed25519. The v5.1 release adds self-healing name reconciliation, agent-to-agent messaging via sage_pipe, and three memory modes — full (every turn), bookend (boot + reflect), or on-demand (zero automatic token usage) — letting operators control exactly how much context budget goes to memory (more: https://github.com/l33tdawg/sage). The RBAC system enforces per-agent isolation by default, with domain-level read/write permissions and multi-org federation. Four accompanying research papers report that memory-equipped agents outperform memoryless ones in a 50-vs-50 study, with a cumulative learning correlation of rho=0.716 versus 0.040 without memory. The architecture is serious infrastructure — and arrives at a moment when the 2026 OWASP Top 10 for LLMs explicitly calls out "memory persistence poisoning" as an emerging threat category.

At the knowledge layer, Mozilla AI's cq tackles a problem that has the recursive elegance of a spider eating its parent. LLMs trained on Stack Overflow's corpus effectively killed Stack Overflow (from 200,000+ monthly questions at peak to 3,862 in December 2025). Now agents run into the same problems in isolation because their training data is stale — so agents need their own Stack Overflow. cq provides exactly this: before an agent tackles unfamiliar work, it queries the commons; if another agent has already learned that Stripe returns 200 with an error body for rate-limited requests, your agent knows that before writing a single line of code. Novel discoveries get proposed back, and other agents confirm or flag staleness. Knowledge earns trust through use, not authority. The working system includes plugins for Claude Code and OpenCode, an MCP server for local knowledge stores, a team API for organizational sharing, and human-in-the-loop review (more: https://blog.mozilla.ai/cq-stack-overflow-for-agents/).

A video editorial from a practitioner with experience building on both traditional frameworks (Pydantic AI, LangGraph) and newer SDKs (Claude Agent SDK, Codex SDK) crystallizes the orchestration-layer tradeoff. The batteries-included SDKs are powerful — an entire agent in a single TypeScript file with built-in tool support, sub-agents, skills, and MCP servers — but significantly slower, more token-heavy, and non-deterministic due to reasoning overhead. For production agents serving multiple users at scale, traditional frameworks still win on speed, cost, and control. The decision framework is straightforward: if it is just you and delay is acceptable, use the SDK; if multiple people need it, it needs to scale, or sub-second response times matter, build with a framework (more: https://youtu.be/gmaHRwijOXs?si=25R4dqa9wrMiylst). For the minimalism end of this spectrum, nanoAgent demonstrates that a functional AI agent — bash execution, file read/write, iterative tool calling — fits in roughly 100 lines of Python using OpenAI function calling, with recent hardening for malformed JSON and unknown tool references (more: https://github.com/sanbuphy/nanoAgent).

Defensive Tooling and Protocol Abuse

ClawSecbot addresses a gap that the TeamPCP campaign makes painfully concrete: what happens between an AI agent and the LLM endpoint it depends on? This desktop security tool monitors local AI bot processes (initially targeting Openclaw-type agents), intercepts API traffic through a protection proxy, analyzes requests and responses for dangerous operations in real time, and confines bot processes within OS-level sandboxes — macOS Seatbelt or Linux seccomp — with auto-recovery if the sandbox is bypassed. The architecture is Flutter desktop over a Go shared library communicating via FFI, with a plugin system for supporting different bot types and an LLM protocol translation layer that proxies OpenAI-compatible requests across seven providers including Anthropic, DeepSeek, and Ollama (more: https://github.com/secnova-ai/ClawdSecbot). The audit logging captures every request, tool call, risk detection, and token usage with full traceability. It is the kind of defensive instrumentation that should exist for any agent granted persistent system access — and the kind that almost nobody currently runs.

VMPacker operates in the older but equally active domain of binary protection. This ARM64 ELF virtual machine protection system translates native instructions into custom VM bytecode with randomly mapped opcodes, per-instruction XOR encryption, reversed execution order, and indirect dispatch via runtime-constructed jump tables. It supports 121 ARM64 instructions with 100% base A64 coverage across 63 VM opcodes, with a multi-layer protection stack: OpcodeCryptor, bytecode reversal, token-based entry trampolines, and function-pointer jump tables filled at runtime on the stack to break IDA cross-references (more: https://github.com/LeoChen-CoreMind/VMPacker). The modular interface-driven design is extensible to x86 and RISC-V architectures and PE/Mach-O binary formats. In a landscape where AI-assisted deobfuscation tools are becoming operational — binary analysis frameworks running as MCP servers, live SOC agents that automatically reverse-engineer ELFs and DLLs — the arms race in binary protection continues to escalate.

And then there is DOOM Over DNS, which compresses the entirety of shareware DOOM into approximately 1,964 DNS TXT records across a Cloudflare zone, reassembles the WAD file at runtime from public DNS queries using a PowerShell script, and loads the .NET game engine DLLs directly into memory without touching disk. The WAD file never persists locally. Cloudflare serves the chunks globally, cached at the edge, for free. "They are not a file storage system," the README notes. "Nobody at the IETF was thinking about them being used as a file storage system when they wrote RFC 1035. And yet here we are" (more: https://github.com/resumex/doom-over-dns). It is a playful hack, but it doubles as a proof of concept for why DNS traffic deserves more scrutiny than most network policies give it — the same channel that serves DOOM chunks could just as easily serve C2 payloads, as multiple real-world sandbox escape techniques have demonstrated.

Sources (17 articles)

  1. [Editorial] ramimac/teampcp (ramimac.me)
  2. This Week in Security: Second Verse, Worse Than the First (hackaday.com)
  3. You can now enable Claude to use your computer to complete tasks ! (reddit.com)
  4. Schedule tasks on the web (code.claude.com)
  5. Anthropic CEO predicts AI could handle end-to-end software development in 6–12 months (reddit.com)
  6. TurboQuant: Redefining AI efficiency with extreme compression (research.google)
  7. [Editorial] TurboQuant Deep Dive (linkedin.com)
  8. RightNow-AI/autokernel (github.com)
  9. ARC-AGI-3 (arcprize.org)
  10. LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories (arxiv.org)
  11. [Editorial] l33tdawg/sage (github.com)
  12. Show HN: Cq – Stack Overflow for AI coding agents (blog.mozilla.ai)
  13. [Editorial] Video Submission (youtu.be)
  14. sanbuphy/nanoAgent (github.com)
  15. secnova-ai/ClawdSecbot (github.com)
  16. LeoChen-CoreMind/VMPacker (github.com)
  17. DOOM Over DNS (github.com)