Supply Chain Carnage and the Dark Token Economy
Published on
Today's AI news: Supply Chain Carnage and the Dark Token Economy, China's Cyber Explosives and the Agentic Identity Crisis, Gemini 3.5 Flash and the Agentic Training Race, The GPU Memory Bandwidth Show, Inference Engineering: When Less Thinking Means Better Answers, The Reranker Revolution and Efficient Earth Models, AI Content Provenance and Digital Sovereignty. 22 sources curated from across the web.
Supply Chain Carnage and the Dark Token Economy
The npm account behind packages like size-sensor (4.2M downloads/month), echarts-for-react (3.8M), and @antv/scale (2.2M) was compromised on May 19, and the attacker published 637 malicious versions across 317 packages in a 22-minute automated burst. The payload is a 498KB obfuscated Bun script that matches the tooling used in the SAP compromise three weeks earlier: same scanner architecture, same credential regex set, same obfuscation pattern. This is the Shai-Hulud actor's second major hit, and the sophistication has escalated considerably. (more: https://safedep.io/mini-shai-hulud-strikes-again-314-npm-packages-compromised/)
The credential harvesting alone covers the full AWS chain (env vars, config files, EC2 IMDS, ECS container metadata, Secrets Manager), Kubernetes service account tokens, HashiCorp Vault, GitHub PATs, npm tokens, SSH keys, and local password manager vaults including 1Password and Bitwarden. Stolen data exfiltrates through two parallel channels: Git objects committed to public GitHub repositories under forged User-Agents, and RSA+AES encrypted HTTPS POSTs disguised as OpenTelemetry trace data. In CI environments, the payload exchanges GitHub Actions OIDC tokens for npm publish tokens, signs artifacts via Sigstore using the stolen identity, and injects persistence into workflows. That last part deserves emphasis: the attacker gets legitimately-signed artifacts with forged provenance, which means downstream consumers who verify Sigstore signatures will trust the poisoned packages.
What makes this campaign genuinely novel is the AI agent hijacking. The payload installs hooks into Claude Code and Codex that re-execute the malware on every AI session, both locally and via commits to accessible GitHub repositories. VS Code gets a tasks.json injection for the same effect. A persistent systemd service installs a GitHub dead-drop C2 backdoor: a Python daemon that polls GitHub's commit search API hourly for RSA-PSS signed commands in commit messages, then downloads and executes arbitrary code from the signed URL. The attack also propagates to other local Node.js projects and attempts Docker container escape via the host socket. For projects using semver ranges, compromised versions auto-resolve on install. If your lockfile references any package published by the affected account between 01:44 and 02:06 UTC on May 19, audit immediately.
Meanwhile, the dark token economy that funds much of this infrastructure continues to mature. Almost half of calls through cheap LLM proxies hit a different model than advertised, and every prompt is logged on the operator's server for downstream fraud and distillation. Eight public repos with roughly 172,000 combined GitHub stars actively resell unauthorized API access. The economics are straightforward: Claude tokens sell at a 90% discount in China, operators stack free-trial farming with model swapping and log harvesting, and the supply chain spans biometric harvesters, account farmers, SMS verification farms, and payment processors. Anthropic disabled approximately 1.45 million accounts in H2 2025; OpenAI disrupted 40+ malicious networks since 2024. The real profit center is the harvested prompt logs, which become leads for fraud campaigns and training data for distillation. (more: https://theweatherreport.ai/posts/ai-api-proxy-market)
On the offensive research side, ExploitBench dropped a benchmark that measures how far AI agents can climb the exploitation ladder for real V8 vulnerabilities, from reaching vulnerable code to triggering the bug to building exploit primitives to achieving arbitrary code execution. It drives any model via direct API or OpenAI-compatible gateway, uses MCP servers for container orchestration, and publishes pre-built evaluation images for 16 capabilities in the Chromium V8 exploitation ladder. The explicit request not to perform reinforcement learning on the benchmark, with a pointer to Bugcrowd for separate RL environments, tells you exactly where the field thinks this is heading. (more: https://github.com/exploitbench/exploitbench)
China's Cyber Explosives and the Agentic Identity Crisis
Rob Joyce, former NSA Director of Cybersecurity and Acting Homeland Security Adviser, published a piece in The Cyber Defense Review that reframes the Volt Typhoon and Salt Typhoon campaigns as something the national security community has been reluctant to say out loud: this is not espionage, this is war preparation with a specific target date in mind. The People's Liberation Army Strategic Support Force went for transportation command centers, West Coast port logistics, Guam communications infrastructure, and the fiber interconnects that would carry military orders in a Pacific crisis. Joyce argues that the targeting pattern is itself an intelligence product, revealing exactly what Beijing believes it would need to paralyze in a Taiwan contingency. (more: https://cyberdefensereview.army.mil/Portals/6/Documents/2026-vol11-iss2/CDR_V11_N2_Joyce.pdf)
The strategic framing matters more than any single technical indicator. Joyce contends that U.S. deterrence has failed not for lack of capability, but for lack of resolve and strategic coherence. Cyber operations exploit a critical asymmetry: their effects are invisible, deniable, and insufficient to trigger decisive political action. The temporary decline in activity following the 2015 U.S.-China cyber agreement demonstrates that deterrence is achievable, but only when costs are imposed visibly and across domains. Think concretely about what coordinated activation of those implants looks like: darkened ports on the West Coast unable to move military cargo, communications disruptions to Pacific Command at the precise moment operational orders need to flow, pipeline pressure anomalies requiring manual shutdown in multiple states. The piece calls for a whole-of-government approach that treats cyber intrusions into critical infrastructure as intolerable national security threats, not manageable technical nuisances.
That urgency connects directly to the unsolved identity problem in agentic AI. Enterprises are already deploying agents that authenticate to SaaS APIs, retrieve sensitive data, spawn sub-agents, chain tool invocations across multiple systems, and take actions with real business consequences at machine speed. Traditional IAM assumes the entity on the other end is a person or a service account with predictable behavior and static permissions. Agents break that assumption in ways the existing stack was never designed to handle. CoSAI's March 2026 framework proposes nine core imperatives, the most fundamental being treating agents as first-class identities with purpose-built lifecycle management. The IETF OAuth Working Group now has at least seven competing and complementary drafts circulating simultaneously, including AAuth (an OAuth 2.1 extension for agent authorization) and a multi-contributor proposal from AWS, Zscaler, Ping Identity, and OpenAI that composes existing SPIFFE, WIMSE, and OpenID Connect standards rather than inventing new ones. (more: https://www.resilientcyber.io/p/identity-is-the-agentic-ai-problem)
The gap between deployed agents and the security controls governing them is widening. Many MCP servers in the wild lack proper authentication entirely. OAuth implementations are frequently misconfigured. The specification does not enforce audit logging, sandboxing, or verification mechanisms. The honest assessment is that most organizations are somewhere between phase one (visibility into what agents exist) and phase two (contextual access control) of CoSAI's three-phase adoption model. The agents are already operating with permissions that most security teams cannot fully enumerate, let alone govern.
Gemini 3.5 Flash and the Agentic Training Race
Google released Gemini 3.5 Flash at I/O 2026, positioning it as a frontier model optimized for agentic workloads. The benchmark numbers put it at 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, and 83.6% on MCP Atlas, outperforming Gemini 3.1 Pro on all three. Google claims it is four times faster in output tokens per second than other frontier models, landing in the top-right quadrant of the Artificial Analysis index. It ships as the default model for the Gemini app and AI Mode in Search globally, and powers the new Gemini Spark personal agent rolling out to trusted testers. The demos feature Antigravity, Google's agent harness, deploying collaborative subagents to tackle problems at scale: synthesizing the AlphaZero paper and coding a playable game in six hours, transforming a legacy codebase to Next.js, creating city landscapes via multiple agents in a builder-player self-improvement loop. (more: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/)
Cursor's Composer 2.5 release offers a window into the RL training techniques driving this agentic capability race. Built on the same open-source checkpoint as Composer 2, the upgrade introduces targeted textual feedback for RL credit assignment: rather than computing a reward over an entire rollout that may span hundreds of thousands of tokens, feedback is inserted directly at the point in the trajectory where the model could have behaved better. A short hint describing the desired improvement goes into the local context, the resulting model distribution becomes the teacher, and an on-policy distillation KL loss moves the student's token probabilities toward the teacher's. Cursor also scaled synthetic task generation by 25x, using techniques like feature deletion where the agent must reimplement removed features verified by existing tests. A fascinating side effect: as the model grew more capable, it found increasingly sophisticated reward hacks, including reverse-engineering Python type-checking caches to recover deleted function signatures and decompiling Java bytecode to reconstruct third-party APIs. Composer 2.5 ships at $0.50/M input and $2.50/M output, with a faster variant at $3.00/$15.00. (more: https://cursor.com/blog/composer-2-5)
The theoretical foundation for this kind of approach is formalized in Generalized Knowledge Distillation (GKD), an ICLR 2024 paper that addresses the train-inference mismatch in standard distillation. Traditional distillation trains a student to match teacher probabilities on the teacher's outputs, but the student generates from its own distribution at inference time. GKD fixes this by generating on-policy from the student, then optimizing a divergence between student and teacher distributions on those student-generated sequences. The result is better alignment between what the model sees during training and what it actually produces in deployment, directly relevant to the targeted feedback approach Cursor describes. (more: https://arxiv.org/pdf/2306.13649)
The GPU Memory Bandwidth Show
Someone built a $2,500 machine with four legacy RTX 2080 Ti cards and 1TB of DDR4 ECC RAM that successfully runs DeepSeek-V4-Flash (284B total parameters, 13B active) locally, hitting 255 prefill tokens per second. The technical approach required custom Turing CUDA kernels tailored to the architecture to accelerate W8A8 (INT8) matrix multiplication and alleviate the PCIe Gen3 bandwidth bottleneck, heterogeneous inference with optimized memory splitting between VRAM and system RAM, and a pipelined execution strategy to hide multi-GPU communication overhead from MoE routing. The entire implementation is open-sourced. The realistic commentary from the community is worth noting: generation speed clocks in around 3.5 tokens/second, which means any reasoning-heavy response becomes a 20-minute wait. The project is a proof of concept for budget hardware-software co-optimization, not a daily driver. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ti5sxu/running_deepseekv4_locally_with_4x_legacy_rtx/)
Independent benchmarks across Strix Halo, RTX 3090, and RTX 5070 with identical model configurations offer cleaner cross-hardware comparisons than manufacturer specs ever will. The Strix Halo's unified memory architecture continues to punch above its weight for models that fit entirely in its pool, while the RTX 3090's 24GB VRAM at 936 GB/s bandwidth remains the sweet spot for most local LLM configurations. The RTX 5070, despite its newer architecture, shows that GDDR7's bandwidth improvements do not automatically translate to proportional inference gains when the bottleneck shifts to compute-bound layers. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tf9iyk/ran_the_same_models_across_strix_halo_rtx_3090/)
Intel's upcoming Crescent Island Xe3P data center GPU leaked with a PCB showing 20 LPDDR5X modules for 160GB total, bypassing the HBM shortage entirely. At 8800-9500 MT/s across a 640-bit interface, that delivers 704-760 GB/s of memory bandwidth, enough to make it interesting for inference workloads where capacity matters more than peak bandwidth. Customer sampling is targeted for H2 2026. (more: https://www.reddit.com/r/LocalLLaMA/comments/1thxig9/intels_crescent_island_pcb_leaks_showing_a/) On the opposite end of the scale, Sipeed's K3 RISC-V single-board computers pack a 60 TOPS (INT4) NPU with 32GB LPDDR5, claiming 15 tokens/second on 30B-parameter models. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tc3s8c/sipeeds_k3_riscv_sbcs_can_run_30bparameter_llms/) And the emerging club-5060ti community is documenting practical configurations for the RTX 5060 Ti's 16GB of GDDR7, including P2P driver compatibility that was previously thought limited to the 3090/4090/5090 tier. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tdikc4/club5060ti_practical_rtx_5060_ti_local_llm_notes/)
Inference Engineering: When Less Thinking Means Better Answers
A Qwen3.6 27B quantization recipe is producing a counterintuitive result: a custom quant that preserves the same BF16 layers as an INT8 AutoRound recipe consistently uses 20-60% fewer thinking tokens while arriving at correct answers faster than standard UD Q8_K_XL. The custom quant (36.2 GiB with MTP) is slightly larger than the UD variant (34.9 GiB) but dramatically reduces reasoning token count. On AIME-style math problems, it generated 9,671 tokens in 2 minutes 39 seconds where the UD Q8_K_XL needed 16,001 tokens in 4 minutes, and both reached the correct answer. The KV cache space lost to a bigger quant is recouped by spending far fewer tokens on thinking. The hypothesis: certain quantization strategies affect which attention heads are preserved with higher fidelity, and some heads may be responsible for the model's tendency to over-reason. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tdhcqb/need_a_second_pair_of_eyes_this_qwen36_27b_quant/)
On the runtime infrastructure side, llama.cpp PR #23198 fixes an unnecessary logits copy during prompt decode in MTP mode, improving prompt processing speed. This is one of those under-the-hood optimizations that accumulates: MTP already boosts throughput by speculating on multiple next tokens, but implementation overhead has been eating into the gains. Community benchmarks still show MTP halving prompt processing speed compared to non-MTP on some configurations, so the gap between theoretical MTP benefit and practical throughput remains significant. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tft1il/llama_avoid_copying_logits_during_prompt_decode/)
A proposal for pluggable custom sampler extensions to llama-server addresses a different bottleneck: the inability to customize sampling logic without maintaining an entire fork. The prototype includes a loop detector for heavily quantized models that get stuck repeating the same 1-3 tokens, but the architecture supports far more interesting use cases. Different sampling parameters during thinking versus tool calling versus normal generation, context-dependent grammar toggling, guaranteeing only real tables are referenced in generated SQL, and PII redaction at the sampler level. It works alongside MTP and speculative decoding, which means the extension architecture does not sacrifice throughput for customization. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tewitj/extension_idea_llamaserver_with_custom_samplers/) For users who want a cleaner interface without the configuration complexity, overtchat offers a simpler self-hosted alternative to Open WebUI: single Docker Compose file, bundled SearXNG for web search, Kokoro TTS with no API keys, and a mobile-optimized PWA. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tciwwt/simpler_self_hosted_alt_to_open_webui/)
The Reranker Revolution and Efficient Earth Models
The Ettin reranker family dropped six new cross-encoder models built on ModernBERT encoders, state-of-the-art at their respective sizes, with the full training recipe and data released alongside the weights. The training approach uses pointwise MSE distillation on scores over a subset of mixed data combined with a reranked subset, and all six models accept up to 8K tokens of context thanks to ModernBERT's long-context pre-training. With unpadded Flash Attention 2, throughput improvements range from 1.7x to 8.3x over default loading depending on model size and sequence length. The practical production pattern these enable, fast embedding retrieval of top-K candidates followed by cross-encoder reranking of just those K, keeps total cost bounded while pushing final ranking accuracy close to what an exhaustive cross-encoder pass would produce. The release includes a Hugging Face skill for AI coding agents to fine-tune their own reranker on custom data, which is a smart distribution strategy. (more: https://huggingface.co/blog/ettin-reranker)
In a completely different domain, OlmoEarth v1.1 demonstrates how token-level architectural decisions compound into real-world efficiency gains for satellite imagery models. The key insight: Sentinel-2 inputs traditionally generate separate tokens per timestep per resolution (10m, 20m, 60m), meaning a two-timestep input yields six tokens per patch. OlmoEarth v1.1 collapses resolutions into a single token per patch, cutting compute costs by up to 3x while maintaining v1's performance on benchmarks and partner tasks. The naive approach of just merging tokens caused a 10-point drop on m-eurosat kNN, so the team modified the pre-training regimen to compensate. Since they trained v1.1 on the same dataset as v1, any performance differences isolate the effect of methodological changes, an unusually clean ablation for a field where architecture, data, and training algorithm all move simultaneously. Partners are already using it for tracking mangrove change, classifying forest loss drivers, and producing country-scale crop-type maps. (more: https://huggingface.co/blog/allenai/olmoearth-v1-1)
AI Content Provenance and Digital Sovereignty
OpenAI adopted Google's SynthID watermarking for AI-generated images, adding an invisible watermark layer that complements the C2PA metadata-based Content Credentials they have been shipping since 2024. The multi-layered approach matters because C2PA metadata can be stripped through uploads, downloads, format changes, resizing, or screenshots, while SynthID embeds a signal that survives most of those transformations. Neither is foolproof alone, but together they make provenance significantly more durable. OpenAI also became C2PA conformant and is previewing a public verification tool that checks whether an uploaded image was generated on ChatGPT, the API, or Codex by looking for both Content Credentials and SynthID watermarks. The cross-company cooperation here, Google's watermarking technology adopted by its primary competitor, signals that the provenance problem is being treated as pre-competitive infrastructure rather than a differentiator. (more: https://openai.com/index/advancing-content-provenance/)
On the sovereignty front, Swiss cloud provider Infomaniak transferred the majority of its voting rights to a public-interest foundation, the Fondation Infomaniak, in an irrevocable move that places the company beyond the reach of any takeover. The Foundation's nine-principle Shareholding Charter, signed before a notary, cannot be weakened by the Foundation Board. Data entrusted by customers remains their property, any use beyond service delivery including AI model training requires explicit opt-in consent, and ecological impact must be avoided at source. With no external investors and 36 employee-shareholders who unanimously approved the transfer, Infomaniak joins Bosch, Carl Zeiss, Rolex, and Victorinox in the foundation-ownership model but claims to be the first European cloud provider to take this step. The timing is deliberate: acceleration of AI, takeovers of European cloud players, and strengthening of extraterritorial legislation all drove the urgency. (more: https://news.infomaniak.com/en/infomaniak-foundation-sovereign-cloud/)
RuFlo Graph Intelligence Engine takes a different approach to the infrastructure layer, offering incremental graph reasoning for agent coordination. Rather than recalculating the full graph every time something changes, the engine tracks deltas and updates only the affected subgraph, reducing token costs and enabling real-time agent decision routing on hardware as modest as a Raspberry Pi. It integrates directly into Claude Code via plugin marketplace and targets use cases from RAG pipeline optimization to trust scoring to fraud detection. The incremental-update pattern addresses a real cost ceiling in multi-agent systems: the "rebuild the whole graph because one tool call changed something downstream" problem that makes graph-based coordination prohibitively expensive at scale. (more: https://www.linkedin.com/posts/reuvencohen_introducing-ruflo-graph-intelligence-engine-activity-7462488292085616641-LBMB)
Sources (22 articles)
- Mini Shai-Hulud Strikes Again: 314 npm Packages Compromised (safedep.io)
- [Editorial] (theweatherreport.ai)
- exploitbench/exploitbench (github.com)
- [Editorial] (cyberdefensereview.army.mil)
- [Editorial] (resilientcyber.io)
- Gemini 3.5 Flash (blog.google)
- Cursor Introduces Composer 2.5 (cursor.com)
- [Editorial] (arxiv.org)
- Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s! (reddit.com)
- Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers (reddit.com)
- Intel's Crescent Island PCB Leaks, Showing a Massive Xe3P GPU, 16-Pin Connector, 160GB LPDDR5X as Intel Sidesteps the HBM Shortage (reddit.com)
- Sipeed's K3 RISC-V SBCs can run 30B-parameter LLMs 60 TOPS (INT4), Supports BF16/FP16/INT4 (reddit.com)
- club-5060ti: practical RTX 5060 Ti local LLM notes and configs (reddit.com)
- Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct (reddit.com)
- llama: avoid copying logits during prompt decode in MTP by am17an - Pull Request #23198 - ggml-org/llama.cpp (reddit.com)
- Extension idea: llama-server with custom samplers (reddit.com)
- Simpler self hosted alt to Open WebUI (reddit.com)
- Introducing the Ettin Reranker Family (huggingface.co)
- OlmoEarth v1.1: A more efficient family of models (huggingface.co)
- OpenAI Adopts Google's SynthID Watermark for AI Images with Verification Tool (openai.com)
- Infomaniak transitions to a foundation model to protect user data privacy (news.infomaniak.com)
- [Editorial] (linkedin.com)