Supply Chain Under Siege

Published on

Today's AI news: Supply Chain Under Siege, Europe's Digital Security Reckoning, Agent Orchestration Grows Up, Model Architecture: From Monolith to Modular, Local AI Performance Wars, Intent, Evaluation, and the Personality Problem. 22 sources curated from across the web.

Supply Chain Under Siege

The supply chain worm known as Shai-Hulud โ€” named, with apparent literary flair, after the sandworms of Dune โ€” has burrowed deeper. Socket Security confirmed that newly compromised packages now include @opensearch-project/opensearch (versions 3.5.3 through 3.8.0, pulling 1.3 million weekly downloads), mistralai 2.4.6 on PyPI, and guardrails-ai 0.10.1 on PyPI. The guardrails-ai package executes malicious code on import, downloading a payload from git-tanstack[.]com and running it without integrity verification. The attackers left a calling card: the domain displayed a message signed "With Love TeamPCP," alongside a boast about stealing credentials for over two hours. (more: https://x.com/SocketSecurity/status/2054048025081737446?s=20)

Meanwhile, model repositories are getting the same treatment as package registries. A fake "model" called Open-OSS/privacy-filter appeared on Hugging Face, presenting itself as an OpenAI privacy filter. It is actually a Python-based dropper that downloads a malicious PowerShell command, which spawns another PowerShell instance, which downloads a Rust-compiled infostealer targeting Chrome credentials, WinSCP sessions, and more. The chain is loader.py to base64-encoded URL to PowerShell batch file to yet another base64 PowerShell script to the final compiled binary โ€” layers of obfuscation that would be comical if the package hadn't accumulated 244,000 downloads before being reported. The accounts liking the model show patterns consistent with coordinated bot activity, some already closed for violations yet still boosting new malware. At press time, the repo remained live. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t6febk/warning_openossprivacyfilter_malware/)

The absurdity of modern supply chain security is best captured by a satirical incident report making the rounds. "CVE-2024-YIKES" chronicles a fictional cascade where a compromised npm dependency leads to credential theft, which enables a supply chain attack on a Rust compression library, which gets vendored into a Python build tool, which ships malware to approximately four million developers โ€” before being inadvertently patched by an unrelated cryptocurrency mining worm. Every individual failure mode is real: hardware 2FA keys going missing, Google AI Overviews linking to phishing sites, Dependabot auto-merging malicious PRs because CI passes after the malware installs the test runner. "Total machines saved by a cryptocurrency worm: also estimated 4.2 million. Net security posture change: uncomfortable." (more: https://nesbitt.io/2026/02/03/incident-report-cve-2024-yikes.html)

On the defense side, Root Evidence's Robert Hansen proposes a provocative strategy: use AI to rewrite third-party libraries entirely, eliminating CVEs by removing the vulnerable code from your codebase. The tool, Lib-Theseus (after the ship of Theseus paradox), is a Claude Code / Codex skill that identifies dependencies, studies their actual usage, generates tests, writes replacement code, and cleans up artifacts. In testing, it rewrote 14 of 18 modules (78%), with only platform-critical packages like Electron left untouched. The upsides are clear: no vulnerable library means no CVE, no future supply chain compromise, and potentially fewer licensing headaches. The downsides are equally real: npm audit cannot find vulnerabilities it does not know exist, and AI-generated replacements could introduce novel bugs. (more: https://www.linkedin.com/pulse/removing-3rd-party-cves-rootevidence-posyc/?trackingId=kJWpzc9eSQaWvjS3pkhobA==)

Europe's Digital Security Reckoning

The Internet Cleanup Foundation launched SecurityBaseline.eu, a transparency platform auditing the cybersecurity posture of European government websites across 32 countries, 67,000 local governments, and roughly 200,000 domains. The project expands the Dutch "Basisbeveiliging" initiative, which has been measuring government website security for over a decade, to the entire EU and EEA. Three months before launch, the foundation sent tens of thousands of advance notification emails to affected governments. The results suggest those emails were largely ignored. (more: https://internetcleanup.foundation/2026/05/european-governments-3000-tracking-sites-1000-phpmyadmins-and-99pct-poorly-encrypted-email-introducing-securitybaseline-eu/)

Three findings stand out. First, 3,081 government websites place tracking cookies on visitors without consent โ€” a straightforward GDPR violation. YouTube embeds are the primary offender at 2,077 tracking cookies, followed by Google Ads at 842, Facebook at 293, and TikTok at 20. Slovakia leads at nearly 10% of government sites, followed by Greece at 8% and Portugal at 7.6%. The foundation notes these are mostly side effects of embedding modern web components without understanding the surveillance implications.

Second, over 1,070 phpMyAdmin database administration portals are publicly reachable across EU government domains. France leads with 513 exposed instances, Poland with 499, Hungary with 368. Two of these panels sit at CSIRT addresses โ€” the organizations theoretically responsible for preventing exactly this kind of exposure. The foundation found zero financial contributions from any EU government to the phpMyAdmin open-source project, despite widespread dependency on it.

Third, 99% of European governmental email fails current encryption best practices. Only the Netherlands (58% compliance) and Denmark (44%) show promising numbers. There are no EU-level TLS standards for government email, and the country-specific guidelines that do exist are incompatible. The platform rebuilds its 1,827 maps nightly across 21 metrics, coloring each region traffic-light style. Right now, the continent is overwhelmingly red.

Agent Orchestration Grows Up

The gap between "agents that work" and "agents you would trust in production" is where the interesting engineering happens. A research paper from Huawei Noah's Ark Lab and UCL pushes multi-agent coordination into genuinely novel territory by modeling it as organizational design. OneManCompany (OMC) separates a portable "Talent" identity (role, skills, working principles) from a runtime "Container" (LangGraph, Claude Code, or script-based), connected through six typed interfaces that mirror an OS kernel: process management, memory, filesystem, I/O, IPC, and security. A community-driven Talent Market provides on-demand recruitment through three channels: curated open-source agents, prompt-sourced personas assembled with matched skills, and dynamically assembled agents from cloud skill libraries. (more: https://arxiv.org/abs/2604.22446v1)

The most provocative element is OMC's HR pipeline. Agents undergo one-on-one feedback after each task, participate in project retrospectives that produce updated SOPs, and face formal performance reviews every three projects. Fail three consecutive reviews and you enter a Performance Improvement Plan. Fail one more and it is automated offboarding โ€” the container is deprovisioned and the capability gap is flagged for re-recruitment. On PRDBench (50 project-level software tasks), OMC achieved 84.67% success rate, a 15-percentage-point improvement over prior state of the art, at about $6.91 per task. Cross-domain case studies ranged from $1.57 for audiobook production to $16.26 for an automated academic research survey.

GammaLab's Harmonist takes a complementary approach: instead of organizational metaphors, it enforces protocol compliance mechanically. Every code-changing turn runs through IDE hooks that verify required reviewers executed, memory was updated, and protocol was satisfied โ€” returning a structured followup message if anything is missing. The LLM literally cannot ship code that skipped review. Supply-chain integrity is baked in via a SHA-256 manifest for every file, with the upgrade process verifying hashes before copying and refusing tampered agent definitions outright. A prompt-injection scanner checks agent markdown for override attempts, exfiltration patterns, and policy subversion โ€” all running on pure Python stdlib and bash, zero external dependencies. (more: https://github.com/GammaLabTechnologies/harmonist)

On the practical deployment side, the community consensus around OpenAI's Codex-at-scale approach was telling: "The most reliable safety mechanism isn't prompt-level constraints โ€” it's tool scope. An agent that can only write to specific directories and run specific commands fails safely when it tries to exceed scope." Hard boundaries beat soft instructions every time. (more: https://www.reddit.com/r/OpenAI/comments/1t8aezx/how_openai_runs_its_codex_coding_agent_safely_at/) DeepSwarm offers a narrower but practical tool: task-agnostic parallel API worker orchestration with tiered model delegation (frontier model plans, cheaper model executes), auto-optimized worker counts, and 99.95% API success rates across 31,000 calls. (more: https://github.com/amanning3390/deepswarm) Daniel Miessler's PAI project โ€” described as "an OS on top of Claude Code" with 12,000 GitHub stars โ€” ripped out RAG entirely in favor of plain text files navigated with ripgrep, reasoning that with sufficient context windows, search becomes navigation. Its three-tier memory hierarchy (WORK, KNOWLEDGE, LEARNING) compounds across sessions while the system itself deliberately shrinks as models improve โ€” what the developers call "bitter-pilled engineering." (more: https://www.linkedin.com/posts/paoloperrone_daniel-miessler-built-an-operating-system-activity-7459732498499985408--RxF)

Model Architecture: From Monolith to Modular

Mixture-of-Experts models have been pitched as a way to get big-model performance at small-model cost, but in practice, removing experts kills performance because routing learns low-level lexical patterns rather than semantic domains. Allen AI's EMO (Emergent MOdularity) fixes this with a deceptively simple change: during pretraining, all tokens in a document must choose active experts from a shared pool. This single constraint causes expert groups to naturally specialize by domain โ€” code, law, biomedical, web content โ€” instead of clustering around prepositions and punctuation. (more: https://huggingface.co/blog/allenai/emo)

The numbers are striking. EMO uses 1 billion active parameters from 14 billion total (8 of 128 experts per token, trained on 1 trillion tokens). At 12.5% of experts retained, performance drops only about 3%. A standard MoE at the same subset size degrades to near-random. At 25%, the drop is roughly 1%. Expert selection is remarkably cheap โ€” a single few-shot example suffices to identify a good subset for any domain. Global load balancing across documents rather than within a micro-batch proved essential for training stability, and the pool size is randomly sampled during training rather than fixed. The practical implication is significant: deploy only the expert subset needed for a specific task, achieving memory-accuracy tradeoffs standard MoEs cannot offer. Model, baseline, and training code are all open.

If EMO rethinks model structure, Thinking Machines Lab rethinks model interaction entirely. Their "interaction models" research preview replaces turn-based exchanges with continuous 200-millisecond micro-turns that interleave input processing and output generation across audio, video, and text simultaneously. There are no artificial turn boundaries โ€” the model tracks implicit turn states, interjects proactively, translates live, and executes concurrent tool calls while conversing. The architecture pairs a real-time interaction model (276B MoE, 12B active) with an asynchronous background model for deep reasoning, achieving state-of-the-art combined performance on both intelligence and interactivity benchmarks. Existing models โ€” including GPT Realtime-2 โ€” cannot meaningfully perform the visual proactivity tasks the team introduces, such as counting exercise repetitions or responding to visual cues without being prompted. (more: https://thinkingmachines.ai/blog/interaction-models)

Local AI Performance Wars

A single command-line flag is saving people thousands of dollars. A llama.cpp user discovered that increasing the micro-batch size (-ub) dramatically improves prompt processing for partially offloaded MoE models. On an RTX 3090 running gpt-oss-120b, pushing -ub from the default 512 to 8192 boosted prefill throughput from 380 tok/s to 2,090 tok/s โ€” a 5.5x improvement โ€” while token generation dropped only 7%. The trade-off: larger batches need more GPU workspace, requiring a few extra MoE layers on the CPU via --n-cpu-moe. Bigger batches mean fewer kernel launches, keeping the GPU saturated during prefill, while generation speed barely changes because it is memory-bandwidth bound on CPU expert weights. "One of the reasons I bought a DGX Spark was better prompt processing. If I had known about this trick, I might not have." (more: https://www.reddit.com/r/LocalLLaMA/comments/1tany5t/drastically_improve_prompt_processing_speed_for/)

Speculative decoding is pushing consumer GPUs into territory that previously demanded multi-card setups. Gemma 4 26B (a MoE with 4 billion active parameters) hit 578 output tokens per second on a single RTX 5090 using DFlash speculative decoding in vLLM โ€” a 2.56x speedup over the 228 tok/s baseline. The optimal setting was 13 speculative tokens with max batched tokens of 8192; interestingly, the lowest mean latency configuration was not the best serving configuration due to worse p95 tail latency. Community pushback noted DFlash performance degrades above 20K context and that random dataset benchmarks overstate acceptance rates versus real workloads โ€” but for short-context inference, the throughput is undeniable. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t796qe/gemma_4_26b_hits_600_toks_on_one_rtx_5090/)

AMD's viability as a full training platform gets a concrete data point with ZAYA1-8B, a frontier MoE model pretrained entirely on 1,024 AMD MI300X nodes with Pensando Pollara interconnect. The architecture introduces Markovian RSA (a novel reinforcement learning technique) and uses 0.8 billion active parameters from 8 billion total. Whether it beats Qwen-3.5-9B is debatable, but the training stack proof-of-concept matters: "The hardest part is always the first run for a new lab. And given that they're running on an AMD stack, they had an even bigger hill to climb." (more: https://www.reddit.com/r/LocalLLaMA/comments/1t5nll0/zaya18b_frontier_intelligence_density_trained_on/)

On the tooling front, ggerganov merged llama-eval directly into llama.cpp, providing standardized evaluation against AIME, AIME2025, GSM8K, and GPQA without the API-key-or-transformers dependency that plagues most benchmarking setups โ€” "saves everyone from setting up their own janky benchmark pipeline that measures different things." (more: https://www.reddit.com/r/LocalLLaMA/comments/1tb0uln/examples_add_llamaeval_by_ggerganov_pull_request/) And at the opposite extreme, someone got Karpathy's TinyStories-260K running on a stock Game Boy Color โ€” INT8 weights, fixed-point math, bank-switched cartridge ROM, KV cache in SRAM. It is extremely slow and the output is gibberish, but transformer prefill and autoregressive generation genuinely execute on 1998 hardware with 32KB of work RAM. As one commenter put it: "Pointless. Therefore, indispensable." (more: https://www.reddit.com/r/LocalLLaMA/comments/1tbi2n3/i_got_a_real_transformer_language_model_running/)

Intent, Evaluation, and the Personality Problem

A new benchmark called IntentGrasp puts numbers on something practitioners have long suspected: AI models are substantially worse at understanding what users actually want than at generating fluent responses. Across 20 models from seven families, all scored below 60% on the full benchmark and below 25% on the harder "Gem Set." The paper distinguishes declared intent (what someone types) from enacted intent (what someone does), arguing that as agents turn language into action, misinterpretation does not produce a wrong answer โ€” it produces a correctly executed wrong goal. The good news: intentional fine-tuning substantially improves performance, meaning intent comprehension is trainable, not a fixed ceiling. The commercial implications are clear: organizations wanting agents to handle more valuable work need better intent interpretation before they widen authority โ€” "clarification loops, context grounding, task boundaries, escalation points, and stop conditions." (more: https://www.linkedin.com/posts/stuart-winter-tear_intentgrasp-a-benchmark-for-intent-understanding-ugcPost-7459864899524075520-DfRE)

The evaluation problem runs deeper than intent. A candid post on r/learnmachinelearning confesses what many teams will not: "Most of what I've been calling 'evals' has been vibe checks." The author nearly shipped Haiku over Sonnet based on testing six prompts, only to find in a proper held-out evaluation that Sonnet was winning by a meaningful margin on the cases that mattered most. A real eval requires a held-out dataset, a scoring rubric, and a comparison table โ€” "when an engineer asks 'why are we picking this' you have numbers not vibes." The community response confirmed how endemic the problem is: "6 prompts and a gut feeling being called evals is way more common than people want to admit." (more: https://www.reddit.com/r/learnmachinelearning/comments/1tb5uwr/rant_the_realization_that_most_of_what_ive_been/)

Claude specifically is developing what users describe as "radical over-honesty mixed with absolute certainty about timelines it doesn't understand" โ€” confidently estimating months of team effort for things already running in another terminal. The training data still assumes software is built through committees and sprint planning; meanwhile practitioners are shipping entire frameworks between breakfast and dinner. The irony: the same model lecturing about ambition is often fully capable of building the thing if you ignore the lecture and keep prompting. (more: https://www.linkedin.com/posts/reuvencohen_claude-has-developed-a-fascinating-new-personality-share-7459949962026627072-WZpJ) The practical remedy, according to Ofer Maor, is structured team-wide AI literacy: weekly show-and-tell sessions where people demonstrate what they have built, pushing everyone past the "event horizon" of AI savviness where it becomes hard not to be a fanatic. (more: https://www.linkedin.com/posts/ofermaor_the-ai-revolution-activity-7459984889619308547-Dwzg)

On the infrastructure side, FaultLine tackles LLM memory reliability by building a searchable knowledge graph from conversations with explicit validation gates. Facts are validated against existing knowledge, stored in a user-controlled database, and injected as grounding context before the model responds. Short-term vector search handles recent context while long-term relational storage locks in validated facts across sessions. The project draws on several open research problems โ€” write gates, fact lifecycle management, self-building ontologies โ€” that remain unshipped in production systems, making it an early test of whether academic memory governance concepts actually work outside the lab. (more: https://www.reddit.com/r/OpenWebUI/comments/1talf31/faultline_llm_memory_with_a_bouncer_at_the_door/)

Sources (22 articles)

  1. [Editorial] (x.com)
  2. WARNING: Open-OSS/privacy-filter MALWARE (reddit.com)
  3. Incident Report: CVE-2024-YIKES (nesbitt.io)
  4. [Editorial] (linkedin.com)
  5. SecurityBaseline.eu (internetcleanup.foundation)
  6. From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company (arxiv.org)
  7. GammaLabTechnologies/harmonist (github.com)
  8. How OpenAI runs its Codex coding agent safely at scale (reddit.com)
  9. amanning3390/deepswarm (github.com)
  10. [Editorial] (linkedin.com)
  11. EMO: Pretraining mixture of experts for emergent modularity (huggingface.co)
  12. [Editorial] (thinkingmachines.ai)
  13. Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models (reddit.com)
  14. Gemma 4 26B Hits 600 Tok/s on One RTX 5090 (reddit.com)
  15. ZAYA1-8B: Frontier intelligence density, trained on AMD (reddit.com)
  16. examples : add llama-eval by ggerganov ยท Pull Request #21152 ยท ggml-org/llama.cpp (reddit.com)
  17. I got a real transformer language model running locally on a stock Game Boy Color! (reddit.com)
  18. [Editorial] (linkedin.com)
  19. Rant: The realization that most of what ive been calling "evals" has been vibe checks. (reddit.com)
  20. [Editorial] (linkedin.com)
  21. [Editorial] (linkedin.com)
  22. FaultLine - LLM memory with a bouncer at the door (reddit.com)