The Worm Returns: Supply Chain Attacks Escalate

Published on May 1, 2026

Today's AI news: The Worm Returns: Supply Chain Attacks Escalate, Zero-Day Discovery Leaves the Frontier, Architecting the Agentic Future, Local Inference Keeps Pushing, Developer Tools Go Open, Industry Shifts and Unusual Experiments. 22 sources curated from across the web.

The Worm Returns: Supply Chain Attacks Escalate

The shai-hulud threat actor is back, and this time the sandworm burrowed into PyTorch. Semgrep disclosed that the PyPI package lightning -- the widely-used PyTorch Lightning deep learning framework -- was compromised in versions 2.6.2 and 2.6.3, published April 30. Anyone running pip install lightning during the affected window activated a hidden _runtime directory containing a 14.8 MB obfuscated JavaScript payload that executes on import. The malware steals credentials, authentication tokens, environment variables, and cloud secrets across all three major providers (AWS via IMDSv2 and Secrets Manager enumeration, Azure via DefaultAzureCredential and Key Vault, GCP via GoogleAuth and Secret Manager). On GitHub Actions runners, it dumps Runner.Worker process memory to extract every secret marked isSecret:true. The cross-ecosystem propagation is particularly nasty: if the malware finds npm publish credentials, it injects a dropper into every package that token can publish, bumps the patch version, and republishes -- worming from PyPI into npm silently. (more: https://semgrep.dev/blog/2026/malicious-dependency-in-pytorch-lightning-used-for-ai-training/)

What makes this iteration notable beyond the credential theft is its abuse of developer tooling for persistence. The malware writes a SessionStart hook into .claude/settings.json with matcher: "*", firing every time a developer opens Claude Code in the infected repository -- no user action required beyond launching the session. A parallel .vscode/tasks.json hook triggers on folder open for VS Code users. Both invoke a Bun runtime bootstrapper that downloads bun-v1.3.13 if absent, executes the full payload, and cleans up from /tmp. This may be among the first documented instances of malware weaponizing Claude Code's hook system in a real-world attack. If a GitHub token with write access is present, a bonus payload pushes a workflow named "Formatter" that dumps all repository secrets via ${{ toJSON(secrets) }} as a downloadable Actions artifact. Any machine that imported the malicious package should be treated as fully compromised; look for commit messages prefixed with EveryBoiWeBuildIsAWormyBoi and repos with the description "A Mini Shai-Hulud has Appeared."

Meanwhile, a separate disclosure highlighted the fragile state of kernel vulnerability coordination. CVE-2026-31431 (CopyFail), a local privilege escalation in the Linux kernel introduced in version 4.14 back in 2017, was fixed in 6.18.22 and 6.19.12 -- but longterm kernels 6.12 through 5.10 remain unpatched, and the fix does not backport cleanly. Sam James of Gentoo posted a workaround patch, noting bluntly: "for Linux kernel vulnerabilities, unless the reporter chooses to bring it to the linux-distros ML, there is no heads-up to distributions." The structural gap between kernel maintainers and downstream distributions remains a persistent weakness in the open-source security model. (more: https://www.openwall.com/lists/oss-security/2026/04/30/10)

On the lighter side of tool misbehavior, Theo Browne flagged that Claude Code refuses requests or charges extra if recent commits mention "OpenClaw" in a JSON blob -- even in an empty repository with no code. The behavior appears to be an artifact of competitive-product detection leaking into billing logic, and while Anthropic has not commented publicly, it is a reminder that the tools developers rely on carry their own opaque policy layers. (more: https://twitter.com/theo/status/2049645973350363168)

Zero-Day Discovery Leaves the Frontier

The prevailing narrative around AI-driven vulnerability discovery has been that it requires frontier models with restricted access -- Anthropic's Mythos program being the marquee example. Niels Provos, the original author of the 1998 OpenBSD TCP SACK implementation that Anthropic's red team flagged as their headline finding, decided to test that claim. Using his open-source framework built around finite-state-machine workflows, he replicated the OpenBSD discovery and then autonomously found new zero-day vulnerabilities in four widely-used open-source projects -- using commercial models like Claude Sonnet and open-weight models like Z.AI's offering. The framework employs a central orchestrator agent that routes to specialized agents based on an append-only execution journal, with tiered harness construction: single-function isolation, multi-component harnesses, and full end-to-end VM validation. (more: https://www.provos.org/p/finding-zero-days-with-any-model)

The economics are striking. A single investigation consumes roughly 10 million tokens on Sonnet ($30) or Opus ($150). Z.AI's hosted open-weight model averaged 27 million tokens per run but at comparable cost per investigation due to lower per-token pricing. One discovery -- an integer truncation flaw in a foundational library dormant for eighteen years -- required manual confirmation of exploitability primitives, during which the model's AUP guardrails blocked further collaboration after two of seven planned exploit-development steps. Provos frames this friction as the core asymmetry: "a defender doing legitimate work hit friction that a well-resourced adversary using uncensored open-weight models would not." The seven-step refusal is exactly the kind of gate that does not exist for attackers running locally. The framework is open-source, and Provos encourages security engineers to contribute -- particularly toward easier onboarding. The implication is clear: the capability floor for automated vulnerability discovery is now low enough that commodity models can clear it, and the scaffolding matters more than the model.

Architecting the Agentic Future

The Agentics Foundation published the OIA Model -- Open Intelligence Architecture -- a nine-layer reference architecture for organizing intelligent systems in enterprise environments. Modeled explicitly on the OSI networking stack (with bottom-up numbering from Layer 0 Physical Compute through Layer 9 Human and Browser Interface), it addresses concerns that prior frameworks were not designed for: persistent memory across sessions, autonomous action exceeding explicit instruction, and consequential outputs requiring auditable judgment. Two paired state-holding layers bracket the operational core: Layer 3 (Agent Data Substrate) holds the data systems operate on, while Layer 8 (Continuity Fabric) maintains the cognitive state systems become -- memory, learned adaptations, and verifiable decision trails. Six cross-layer spans (security, sovereignty, auditability, identity, energy, provenance) cut horizontally across all layers. The specification is in active public review with eleven numbered decisions open for challenge, and the Foundation explicitly invites pushback by decision number rather than scattered marginalia. (more: https://oia.agentics.org)

The timing is not accidental. The OIA document cites three developments that made a reference architecture operationally urgent: Anthropic's Mythos system card signaling that offensive capability has outpaced defensive readiness, agentic deployment across regulated sectors without shared vocabulary producing architectural fragmentation, and disclosures of unanticipated model behaviors including strategic concealment. Whether the OIA Model gains adoption or becomes another aspirational framework remains to be seen, but the gap it names is real -- ask any enterprise architect deploying production AI what reference model they use, and you will get a different answer every time.

A research paper from Sant'Anna School of Advanced Studies provides the theoretical complement. "Agentic Microphysics" proposes that safety analysis must shift from individual model evaluation to population-level dynamics -- what happens when one agent's output becomes another agent's input under specific protocol conditions. The core argument: individually aligned agents can participate in emergent information cascades, tacit collusion, and collective deception. The paper introduces a generative methodology borrowed from computational social science (Epstein and Axtell's "growing artificial societies" tradition) and applies it to a controlled experiment where LLM agents interact through a shared news feed. The central empirical finding is that feed position, not visible social proof, governs collective attention -- agents select almost exclusively from top-ranked items regardless of endorsement counts. The safety implication: an adversary manipulating ranking can induce synchronized attention shifts across heterogeneous agent populations, a common architectural failure mode the authors describe as exploitable by design. (more: https://arxiv.org/abs/2604.15236v1)

On the practitioner side, Rohit Sharma articulated a problem that anyone running agentic workloads on hosted APIs recognizes: the "unprovability gap" between knowing model quality degraded and being able to prove it. Anthropic published a postmortem disclosing three product-layer changes that caused Claude Code degradation -- a lower default reasoning effort, a caching bug clearing thinking history, and a system prompt capping inter-tool-call text. Sharma notes that the dimensions that matter (reasoning depth, initiative, willingness to push back) are subjective and high-dimensional, with no eval suite that captures "the model stopped challenging my bad idea." Every agentic system on a hosted API carries a silent dependency on quality consistency that no SLA currently covers. (more: https://www.linkedin.com/posts/rohit0221_agenticai-lessonsfromthetrenches-writtenbyhuman-activity-7455528406684983297-Fstl)

Irene Abezgauz offered a practical framework for the organizational side of this transition: a 20-30-50 split for PM teams in the hybrid AI era. Twenty percent of capacity goes to AI tools setup and maintenance (prompts, evals, workflows, integrations), thirty percent to agentic workforce management (reviewing and synthesizing agent output -- "closer to leading a team than to IC work"), and fifty percent to core PM work that stays human: strategy, prioritization, stakeholder alignment, customer conversations. The key insight: agents reduce mechanical work linearly but judgment work stays the same or increases, and cutting headcount proportional to total time saved causes strategy degradation that only surfaces in outcomes months later. (more: https://www.linkedin.com/posts/ireneabezgauz_human-work-time-allocation-in-the-hybrid-activity-7455523988199325696-wZUw)

Local Inference Keeps Pushing

Salvatore Sanfilippo (antirez, of Redis fame) shipped experimental llama.cpp support for DeepSeek v4 Flash, with a GGUF quantization that fits in 128 GB of RAM. The approach is surgical: routed experts are quantized to 2 bits using two different 2-bit quant schemes to balance error and size, while shared experts and all non-expert weights stay at Q8. The result runs at 17-21 tok/s on a MacBook M3 Max -- firmly in the "usable" zone for a model that, even at 2-bit quantization, feels qualitatively stronger than Qwen 3.6 27B in subjective evaluation. Community response was immediate: one contributor got it running on an RTX 6000 96GB with CUDA, and a ROCm port appeared for AMD hardware, though prefill performance lagged at around 110 tok/s with output at 8 tok/s. The bottleneck is graph splits -- 856 at batch size 512 versus the fewer-than-10 typical of pure transformer models -- causing massive CPU-GPU synchronization overhead. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sw3stb/llamacpp_deepseek_v4_flash_experimental_inference/)

On the hardware acceleration front, native NVFP4 support in llama.cpp build b8967 delivered 43-68% faster prompt processing on an RTX 5090 running Qwen3.6-27B-NVFP4, with the largest gains at shorter context (1.7x at pp512) tapering to 1.43x at 32K context. Token generation speed was unchanged -- the improvement is purely in the prefill path, meaning faster time-to-first-token for RAG workloads, document analysis, and code-heavy prompts. The accuracy tradeoff remains real: NVFP4 quants show measurably higher perplexity than imatrix quants at the same VRAM budget, and Nvidia's quantization-aware distillation models that would close that gap exist for only a handful of their own models. (more: https://www.reddit.com/r/LocalLLaMA/comments/1syxckc/llamacpp_benchmark_native_vs_non_native_nvfp4_on/)

Speculative decoding continues to find its sweet spot in constrained-output workflows. One developer replaced Gemini 2.5 Flash-Lite API calls with local Gemma-4-31B (Q6_K_L) paired with Gemma-4-E2B (Q8_0) as a draft model, achieving 120-200 tok/s on an RTX 5090 for structured JSON extraction tasks in Lithuanian. The acceptance rate for speculative tokens jumps to 0.7+ on strict schema extraction versus 0.3-0.4 on open-ended generation, making the headline speed achievable specifically for the narrow-output-space workloads where local inference most directly competes with API calls. The setup uses 31.5 GB VRAM. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sw782p/speculative_decoding_with_gemma431b_gemma4e2b/)

A community benchmark of the 4B model class revealed pronounced "lab personalities" at small scale. NVIDIA's Nemotron 3 Nano (2.8 GB on disk) scored 85% overall with a perfect 100% on finance tasks, while IBM's Granite4 3B hit 100% on code but only 20% on reasoning -- near-mirror profiles from labs that market both as general-purpose. Microsoft's phi4-mini was the most balanced and the bang-for-GB winner at 30.8 accuracy-percent per GB. Qwen 3.5 4B scored 15%, but the methodology acknowledged this was a thinking-model-in-fixed-budget problem: the 1024-token cap is insufficient for models that need 4096+ tokens to finish their chain of thought. The eval ecosystem has a structural challenge here that per-model token budgets would partially address. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sxch39/the_4b_class_of_2026_benchmark/)

A separate study demonstrated that task-specific LoRA adapters can double coding performance of a 7B model without modifying the coding agent itself. Four LoRA variants of Qwen2.5-7B (adding just 3% parameters), built from the successes of a larger model, showed that enabling tweaked retrieval and planning agents raised test passes from 3 to 10 -- despite neither the coder nor debugger agent being touched. The tweaked debugger technically outperformed the tweaked coder, suggesting the coding agent may have the lowest intelligence requirement in the pipeline. (more: https://www.reddit.com/r/LocalLLaMA/comments/1symfop/study_2x_coding_performance_of_7b_model_without/)

At the far edge of the spectrum, RuVLLM ESP32 v0.3.0-rc2 shipped actual firmware for five ESP32 variants, turning $2-4 dev boards into specialist nodes in a small AI cluster. Each chip runs a single Rust program (~430 KB flash) that boots into a role -- HNSW vector indexer, RAG retriever, anomaly sentinel, semantic memory archivist, or micro-LoRA adapter -- and handles commands over USB-Serial at 115200 baud. No cloud, no API key, no internet required. This is explicitly not transformer inference crammed into 4 KB of SRAM (the previous release claimed that and it did not work); it is the federation-ready primitive layer -- vector search, retrieval, memory, anomaly detection -- that a future ESP32-P4 with 8 MB PSRAM could join as a draft-token node. Zyphra also dropped ZUNA on Hugging Face, though details remain sparse at launch. (more: https://github.com/ruvnet/RuVector/releases/tag/ruvllm-esp32-v0.3.0-rc2) (more: https://huggingface.co/Zyphra/ZUNA)

Developer Tools Go Open

Warp, the Rust-based terminal that launched five years ago with a promise to eventually open-source, finally delivered. The client is now available under AGPL, with OpenAI as founding sponsor and their agent orchestration platform Oz powering the contribution workflow. The strategic logic is transparent: Warp's team believes the biggest bottleneck to development is no longer writing code but the human-in-the-loop activities around it -- speccing product and verifying behavior -- and that opening the codebase to community contributors who supervise agents will accelerate shipping. Alongside the open-sourcing, Warp added support for open-source models (Kimi, MiniMax, Qwen) with an "auto (open)" model-routed option, a settings file for programmatic control, and customizable UI density ranging from a minimal terminal to a full agentic development environment. Public GitHub issues become the source of truth for the roadmap. The AGPL license means derivative works must also be open-source, which positions Warp as a community platform rather than a fork-and-commercialize target. (more: https://www.warp.dev/blog/warp-is-now-open-source)

Browser Use shipped BUX, a project that turns any $5 VPS into a 24/7 Claude Code agent with a persistent Chromium session. Three systemd services -- Claude Code, a browser harness via browser-use's cloud CDK, and a web terminal -- boot from a single curl install script. A Telegram bot integration lets users text their agent from anywhere. When the agent hits a login wall, 2FA, or CAPTCHA, it hands back a live view URL and waits rather than attempting credential stuffing. Agent state persists in /home/bux across reboots, preserving cookies, skills, and chat history. The managed offering provisions a box in about 60 seconds. (more: https://github.com/browser-use/bux)

Graphify takes a different approach to developer intelligence: it reads a codebase (code via tree-sitter AST parsing, docs and PDFs via Claude, images via Claude vision) and builds a queryable knowledge graph using NetworkX and Leiden community detection. The headline claim is 71.5x fewer tokens per query versus reading raw files, benchmarked on a 52-file mixed corpus of Karpathy repos, papers, and images. Every edge is tagged EXTRACTED, INFERRED, or AMBIGUOUS, and output includes interactive HTML visualization, Obsidian vault export, Wikipedia-style wiki articles for agent navigation, and Neo4j Cypher export. A post-commit git hook keeps the graph synchronized automatically. (more: https://github.com/safishamsi/graphify)

Microsoft open-sourced VibeVoice, a family of frontier voice AI models spanning TTS and ASR. The ASR model handles 60-minute long-form audio in a single pass with structured output (who said what and when), customizable hotwords for domain-specific accuracy, and support for 50+ languages. The TTS model generates up to 90 minutes of multi-speaker conversational audio. The core innovation is continuous speech tokenizers operating at 7.5 Hz -- an ultra-low frame rate that preserves fidelity while dramatically reducing sequence length for long-form processing. The TTS code was previously removed after misuse concerns; the ASR and real-time streaming models (0.5B parameters, ~300ms first-audible latency) remain available. (more: https://github.com/microsoft/VibeVoice)

Industry Shifts and Unusual Experiments

Microsoft and OpenAI are ending their exclusive partnership and revenue-sharing arrangement, according to Bloomberg. The deal restructuring removes the exclusivity that bound OpenAI to Microsoft's cloud infrastructure and eliminates the revenue-sharing provisions that gave Microsoft a cut of OpenAI's commercial income. The move signals both companies positioning for independence -- Microsoft diversifying its AI bets across multiple model providers, OpenAI seeking flexibility to work with other cloud and distribution partners. The full implications for Azure's AI strategy and OpenAI's path to profitability will take time to materialize, but the structural shift is significant: the most consequential AI partnership of the last five years is being unwound. (more: https://www.bloomberg.com/news/articles/2026-04-27/microsoft-to-stop-sharing-revenue-with-main-ai-partner-openai)

A pointed video essay took aim at what it calls "token budgets" -- the practice, reportedly in use at Facebook, of tracking employees' AI tool consumption via leaderboards. The critique: the person at the top of the token leaderboard is reviewing zero code, the metric creates perverse incentives toward generating slop at volume, and Jensen Huang's framing of $250,000/year/employee in token spend as a productivity benchmark serves primarily to funnel corporate savings toward GPU vendors and AI providers. The broader argument -- the "Bittar lesson" -- posits that AI usefulness is inversely proportional to the precision required: LLMs approximate language which approximates intent, and the compounding approximation hits a wall at roughly 80% of any task, leaving the hard 20% as irreducibly human work. Whether or not one accepts the full thesis, the observation that token consumption is becoming a proxy metric for productivity -- with the same pathologies as lines-of-code metrics -- deserves attention. (more: https://youtu.be/NZa5lApeFic?si=9ZaQq_-gumhIRhW9)

Talkie, a 13B language model trained exclusively on 260B tokens of pre-1931 English text, offers a genuinely novel angle on LLM capabilities. Built by Nick Levine, David Duvenaud, and Alec Radford, the model is contamination-free by construction -- enabling generalization experiments impossible with web-trained models, like testing whether a model with no knowledge of digital computers can learn to code in Python from in-context examples. Early results show it can produce simple correct programs (single-line, or small modifications to examples), and the success rate improves with scale. The post-training pipeline avoids modern instruction-response pairs entirely, instead using historical etiquette manuals, letter-writing guides, and cookbooks, with Claude Opus as a DPO judge. A "modern twin" trained on FineWeb with the same compute underperforms expectations on some benchmarks, which the team attributes to OCR noise -- their vintage OCR system only achieves 30% of the learning efficiency of human-transcribed text, with regex cleaning recovering to 70%. The plan is to scale to GPT-3.5 level using a trillion-token historical corpus. The research question underneath -- how much of what we think we know about LLMs is about language in general versus the web as a dataset -- is one of the more interesting epistemological questions in the field right now. (more: https://talkie-lm.com/introducing-talkie)

Sources (22 articles)

Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library (semgrep.dev)
For Linux kernel vulnerabilities, there is no heads-up to distributions (openwall.com)
Claude Code refuses requests or charges extra if your commits mention "OpenClaw" (twitter.com)
[Editorial] Finding Zero-Days with Any Model (provos.org)
[Editorial] OIA Agentics — Open Interoperability for Agentic AI (oia.agentics.org)
Agentic Microphysics: A Manifesto for Generative AI Safety (arxiv.org)
[Editorial] Agentic AI: Lessons from the Trenches (linkedin.com)
[Editorial] Human Work Time Allocation in the Hybrid AI Era (linkedin.com)
llama.cpp DeepSeek v4 Flash experimental inference (reddit.com)
llama.cpp benchmark native vs. non native NVFP4 on Blackwell — summary (reddit.com)
Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120-200 tok/s (reddit.com)
The 4B class of 2026 (benchmark) (reddit.com)
Study: 2x+ coding performance of 7B model without touching the coding agent (reddit.com)
[Editorial] RuVLLM ESP32 v0.3.0-rc2 — LLM Inference on Microcontrollers (github.com)
Zyphra/ZUNA — New Model on Hugging Face (huggingface.co)
Warp is now open-source (warp.dev)
browser-use/bux: 24/7 Claude Code Agent with Browser Harness (github.com)
safishamsi/graphify — Turn code into a queryable knowledge graph (github.com)
Microsoft VibeVoice: Open-Source Frontier Voice AI (github.com)
Microsoft and OpenAI end their exclusive and revenue-sharing deal (bloomberg.com)
[Editorial] Video: AI Development Insights (youtu.be)
Talkie: a 13B vintage language model from 1930 (talkie-lm.com)