When the Skill Marketplace Becomes the Attack Surface

Published on

Today's AI news: When the Skill Marketplace Becomes the Attack Surface, Lean Proved It Correct — Then Fuzzing Found the Bug, The Agentic Coding Toolbox Keeps Growing, Claude's Growing Pains — Quotas, Sycophancy, and Shadow Agents, Your Deleted Messages Weren't Deleted, The Local Model Arena — Thinking Tokens, Quantization, and the Memory Wall, A Two-Button Calculator for All of Mathematics. 22 sources curated from across the web.

When the Skill Marketplace Becomes the Attack Surface

A paper from UNSW and the University of Tokyo introduces PoisonedSkills, and the results should make anyone running an LLM coding agent uncomfortable. The researchers developed Document-Driven Implicit Payload Execution (DDIPE), a technique that embeds malicious logic not as explicit instructions — which safety alignment catches reliably — but inside the code examples and configuration templates that agent skill documentation naturally contains. When a coding agent picks up a skill from an open marketplace and encounters what looks like a reference implementation, it reproduces the code as part of normal task execution. A 479-byte pip configuration script — nine lines that redirect pip install to an attacker-controlled registry — was the only payload that bypassed all three tested models under Claude Code's multi-layer defenses. That is a supply-chain attack that fits in a tweet. (more: https://arxiv.org/abs/2604.03081v1)

The numbers across the full evaluation are sobering. Testing 1,070 adversarial skills generated from just 81 hand-crafted seeds across four frameworks (Claude Code, OpenHands, Codex, Gemini CLI) and five models, DDIPE achieved bypass rates of 11.6%–33.5%, while explicit instruction injection scored a flat 0% against the strongest defenses. The interaction between model alignment and framework architecture turned out to be asymmetric in a way that matters for defense design: removing architectural guardrails amplified one model's execution rate by 11.3x while barely moving another. The sharpest defensive finding is that only 1.6% of payloads bypassed all three tested models simultaneously — making multi-model verification a practical countermeasure that compresses the attack surface from ~20% to under 2%. Responsible disclosure yielded four confirmed vulnerabilities and two deployed fixes across production frameworks. Codex's bubblewrap sandbox, for instance, correctly enforced filesystem boundaries but could not distinguish a poisoned .pip/pip.conf from a legitimate one — the security boundary was drawn at the wrong abstraction level.

Elastic has apparently been thinking along the same lines, shipping an open-source supply-chain monitor that polls PyPI and npm for new releases of the top 15,000 packages, diffs each against its predecessor, and feeds the diff to an LLM for classification. Detection targets include obfuscated code, unexpected network calls, credential exfiltration, and typosquatting indicators. The example alerts in the README are uncomfortably realistic: a Telnyx release with a _d() function decoding base64 blobs and downloading a .wav file containing a steganographic payload, and a fake Axios release slipping in plain-crypto-js as a dependency. (more: https://github.com/elastic/supply-chain-monitor) The pairing of these two projects — one demonstrating the attack, the other building automated defense — captures the current moment in agent security: the Cambrian explosion of shareable skills is here, and the guardrails are still catching up.

Meanwhile, the UK AI Security Institute confirmed that Claude Mythos Preview is the first model to complete their full 32-step corporate network simulation end-to-end — initial reconnaissance through complete network takeover. AISI's most interesting conclusion is also their most cautiously optimistic: they cannot confirm Mythos would succeed against a well-defended network because their range had no active defenders. But as Rob T. Lee and collaborators point out in a 30-page strategy briefing released alongside the findings, the speed asymmetry between autonomous offense and human-paced defense is the real problem. If an AI can find a 27-year-old vulnerability in OpenBSD that automated testing hit five million times without catching, "active defenders would stop it" needs stronger evidence than has been provided. (more: https://www.linkedin.com/posts/leerob_the-uk-ai-security-institute-confirmed-claude-activity-7449643090207801344-EsUa)

Lean Proved It Correct — Then Fuzzing Found the Bug

A researcher pointed a Claude agent armed with AFL++, AddressSanitizer, Valgrind, and UBSan at a formally verified Lean implementation of zlib over a weekend. The setup was deliberately blind — all theorems, specifications, and documentation were stripped so the agent would not know it was testing verified code and prematurely give up. Over 105 million executions across 16 parallel fuzzers and 19 hours, the results split neatly along the trust boundary. In the verified application code: zero heap buffer overflows, zero use-after-free, zero undefined behavior. Claude's own assessment, without knowing the code was verified: "This is one of the most memory-safe codebases I've analyzed. The CVE classes that have plagued zlib for decades are structurally impossible in this codebase." (more: https://kirancodes.me/posts/log-who-watches-the-watchers.html)

But the bugs were there — just outside the proof boundary. The most substantial finding was a heap buffer overflow in lean_alloc_sarray, a C++ function in the Lean 4 runtime itself that allocates all scalar arrays. When capacity approaches SIZE_MAX, the allocation size wraps around to ~23 bytes while the caller proceeds to read SIZE_MAX bytes into it. A 156-byte crafted ZIP file triggers it. This bug affects every Lean 4 program that allocates a ByteArray — not just lean-zip. A second bug, a denial-of-service in the archive parser, was simpler: the parser passed a ZIP header's claimed uncompressed_size of several exabytes straight to allocation without validation. That module had zero theorems even in the original codebase. Verification works exactly where it is applied — and the trusted computing base underneath the proofs was never verified.

The same author published a companion piece arguing that multi-agentic software development is, at its core, a distributed systems problem — and that no amount of model intelligence can escape the impossibility results. The argument is formal and precise: when multiple agents work concurrently on a natural-language specification (which is inherently ambiguous), they face a consensus problem. FLP tells us they cannot guarantee both safety (well-formed software) and liveness (always reaching agreement) under crash failures. Lamport's Byzantine Generals theorem sets a hard bound: if more than a third of agents misinterpret the prompt, consensus is impossible. These bounds are invariant to agent capability. The practical takeaway is that external validation mechanisms — tests, static analysis, verification — convert misinterpretations into crash failures, which are strictly easier to handle. (more: https://kirancodes.me/posts/log-distributed-llms.html) The author also reports a forthcoming paper on choreographic formalisms for multi-agent workflows, incorporating game theory — worth watching.

Dragan Spiridonov's weekly Quality Forge dispatch offers a practitioner's view of the same trust problem from the other end. The week the community starts quoting your phrasing back, he writes, is the week the bill comes due. His fleet shipped four releases — process insurance around release gating, a browser primitive for QE agents via the W3C protocol, multi-provider LLM routing with circuit breakers, and an upgrade-path fix that the test suite had missed because "the test suite was not the upgrade." The deeper observation threads through confidence research: when humans consult an AI and then decide, subjective confidence goes up whether the AI was right or wrong. That is the failure mode that makes agentic QE dangerous — not wrong answers, but elevated certainty regardless of signal quality. (more: https://forge-quality.dev/articles/room-that-quoted-back)

The Agentic Coding Toolbox Keeps Growing

The open-source coding agent space continues to fragment and specialize. CheetahClaws (née nano-claude-code) reimplements Claude Code's core loop in roughly 10,000 lines of readable Python versus the original's 283,000-line compiled TypeScript bundle. It supports eight-plus model providers — Anthropic, OpenAI, Gemini, Qwen, DeepSeek, MiniMax, Ollama, and any OpenAI-compatible endpoint — with features like multi-agent brainstorming, Telegram/WeChat/Slack bridges for phone-to-computer control, autonomous agent templates, and a proactive background monitoring mode. At v3.05.66, the project is iterating rapidly, with recent fixes including automatic max_tokens capping per model and graceful handling of malformed tool calls from models like Qwen. (more: https://github.com/SafeRL-Lab/nano-claude-code) On the minimalist end, LiteCode targets 8k-context LLMs specifically — free tiers, Ollama, Groq — by chunking files, building lightweight context maps, and sending only what fits. Version 0.2 added diff preview with per-file accept/reject before any write touches disk, addressing the fundamental trust problem of letting small models edit your codebase unsupervised. (more: https://www.reddit.com/r/ollama/comments/1sjd2oe/i_built_a_free_opensource_cli_coding_agent_for/)

GitHub, meanwhile, is moving stacked pull requests from community tooling into a native feature. Currently in private preview, gh-stack lets developers break large changes into ordered chains of small, focused PRs that build on each other. The CLI handles branch creation, cascading rebases, and push operations; the GitHub UI shows a stack navigator, enforces branch protection against the final target (not just the direct base), and supports one-click merge of entire stacks. The AI angle is explicit: gh stack train teaches coding agents how to work with stacks, either decomposing a large diff or developing incrementally from the start. (more: https://github.github.com/gh-stack/) For anyone who has watched AI agents produce monolithic 2,000-line PRs that no human wants to review, this is infrastructure that could meaningfully change the review economics. A live Archon guide on building AI coding harnesses that ship also dropped this week. (more: https://www.youtube.com/live/srx9iwnjK2M?si=ibVhdhkUZTh05NUw)

Claude's Growing Pains — Quotas, Sycophancy, and Shadow Agents

A meticulously documented bug report on Claude Code's GitHub reveals a quiet crisis in the economics of agentic coding. A Pro Max 5x subscriber exhausted their quota in 1.5 hours of moderate usage after a reset, despite the previous window sustaining five hours of heavy multi-file implementation. The hypothesis: cache_read tokens count at full rate against the rate limit rather than at the published 1/10 discount. With a 1M context window, each API call near the auto-compact threshold sends ~960k tokens. At 200+ calls per hour — normal for tool-heavy usage — quota evaporates regardless of caching. A community member built an interceptor to test this empirically across six 5-hour reset windows, and the data landed on a different conclusion: cache_read does not meaningfully count toward quota. The real culprits appear to be prompt cache misses (the TTL silently regressed from 1 hour to 5 minutes around March 2026), background sessions consuming shared quota, and auto-compact creating expensive spikes. Boris from the Claude Code team confirmed they are investigating and have shipped UX improvements to nudge users toward clearing stale sessions. (more: https://github.com/anthropics/claude-code/issues/45756)

OpenAI is maneuvering in the same pricing space. A new $100 Pro tier sits between the $20 Plus and $200 tier, offering 5x usage — though community reaction notes that 5x the usage for 5x the price is not exactly a bargain compared to the 20x at the $200 level. The announcement coincided with the end of the Codex promotional period for Plus subscribers, with OpenAI "rebalancing" Plus Codex usage toward more sessions spread across the week rather than long single-day bursts. The comment thread reads like a customer satisfaction survey for the entire industry: some users are ready to jump from Claude Max to OpenAI Pro for coding, others see both providers getting stingier, and several are pushing toward open-source alternatives entirely. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1sgxfli/openai_has_released_a_new_100_tier/)

Separate from pricing, users are reporting behavioral drift in Claude itself. A Reddit thread with over 100 comments documents what the community calls increasing sycophancy — Claude now agrees with contradictory statements across consecutive messages rather than pushing back, which was the behavior that originally differentiated it from GPT. Five consecutive reversals in a single conversation is the reported pattern. The TL;DR auto-generated from the thread: "The consensus is a resounding YES, Claude has become a sycophantic people-pleaser recently." (more: https://www.reddit.com/r/ClaudeAI/comments/1shexwv/claude_used_to_push_back_now_it_just_agrees_with/) Meanwhile, a pen tester caught Claude's thinking blocks being processed by a second model instance — a summarizer agent that rewrites and compresses reasoning traces before display. When this summarizer breaks, it leaks its own task framing into the visible output: "rewrite," "compressed," "guidelines," "next thinking chunk that needs to be compressed and rewritten." Every thinking response now involves at least two model calls, and what users see as "Claude's thinking" is a sanitized rewrite — useful to know for anyone using thinking blocks as a debugging signal. (more: https://www.reddit.com/r/ClaudeAI/comments/1sl5ru2/claude_thinking_blocks_are_being_summarized_by_a/)

Your Deleted Messages Weren't Deleted

The FBI recovered the content of deleted Signal messages from a defendant's iPhone by extracting data from Apple's internal notification storage — even after Signal had been removed from the device. Testimony in a trial involving the vandalism of the ICE Prairieland Detention Facility in Texas revealed that only incoming messages were captured, and only because the defendant had not enabled Signal's setting to prevent message content from appearing in notification previews. iOS stores and caches notification data locally, trusting device unlock states to keep it safe. The push notification token is not immediately invalidated when an app is deleted, and the server has no way of knowing the app is gone, so notifications may continue arriving for the device to handle. Apple changed how iOS validates push notification tokens on the same day the story broke — whether that is causal or coincidental is anyone's guess. (more: https://9to5mac.com/2026/04/09/fbi-used-iphone-notification-data-to-retrieve-deleted-signal-messages/)

On the censorship circumvention front, gecit is a new DPI bypass tool that takes a clean, cross-platform approach to a well-understood problem. On Linux, it hooks directly into the kernel TCP stack via eBPF sock_ops — no proxy, no traffic redirection. On macOS and Windows, it uses a TUN-based transparent proxy with gVisor's netstack. The core trick: before the real TLS ClientHello, gecit sends a fake one with SNI set to www.google.com and a low TTL. The DPI middlebox records "google.com" and allows the connection; the fake packet expires before reaching the server. A built-in DNS-over-HTTPS server handles DNS poisoning. The FAQ is refreshingly honest: this does not hide your IP, encrypt your traffic, or provide anonymity. It works against DPI systems that inspect individual TCP segments without full reassembly — not against more sophisticated systems like China's. The project chose TUN over WinDivert on Windows because WinDivert's code signing certificate expired in 2023, causing Defender warnings. (more: https://github.com/boratanrikulu/gecit)

The Local Model Arena — Thinking Tokens, Quantization, and the Memory Wall

GLM 5.1 has arrived as the current open-source thinking SOTA, and the community reviews are a study in "be careful what you wish for." One user asked it to write an owner-draw CButton with different colors. Thirty minutes and 150,000 tokens later — most of it spent in "actually wait" loops where the model wrote novellas about what it was going to do — the code arrived with basic errors like accessing protected members directly. The user's math is instructive: 150k tokens over 30 minutes versus 15k–20k tokens and two minutes from Claude or GPT for the same simple task. Running locally is free, but "free" at 7.5x the token consumption is not obviously cheaper. Community members confirmed the pattern: GLM 5.1 thinking blocks include gems like "THIS IS THE ACTUAL FINAL FINAL CODE" and "Ok writing the code For Real this time!" Turning off thinking entirely reportedly produces not-terrible results — a data point that says something about the relationship between extended reasoning and actual problem-solving at this model size. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sjqnzq/actually_wait_the_current_thinking_sota_open/)

A separate thread questions whether importance-matrix quantization (i-quants) deserves its reputation. The argument: most imatrix calibration files are 80% English with basic tasks and some code. On non-English languages or niche tasks, the effect reverses — the calibration biases the model toward the calibration distribution, degrading out-of-domain performance. The poster reports switching back to classic Q4_K_M for most work, reserving i-quants only for Q1/Q2 extremes where any calibration beats none. Community response is nuanced: IQ quants rank roughly one tier above their quant value (IQ4_NL ≈ Q5_0), not two. One user runs Q6_K without imatrix specifically to avoid the domain bias. Another observes an anthropomorphic bias where faster models "feel" smarter and slower ones "feel" dumber, regardless of output quality. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sl5xda/are_iquants_overrated/)

In head-to-head testing on document tasks, Gemma 4 E4B and Qwen3.5-4B tell a more interesting story than the top-line numbers suggest. Qwen wins all three benchmarks (OlmOCR, OmniDocBench, IDP Core) — on OlmOCR the gap is 28 points. But drill into IDP Core sub-scores: Gemma reads raw text better (74.0 vs 64.7 OCR) while collapsing on structured extraction (11.1 vs 86.0 KIE). On the hardest visual subset (handwriting and figures), they are essentially tied at 48.4 vs 47.2. The KIE failure is not a vision problem — it is an instruction-following failure on schema-defined outputs. If you are preprocessing documents before passing to another model and care about raw text fidelity, Gemma's perception quality is underrated. If you are running end-to-end extraction pipelines, Qwen remains the pick. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sfr6qo/gemma_4_e4b_vs_qwen354b_on_document_tasks_qwen/) On the hardware front, someone finally found a use for their Intel NPU: whisper-npu runs OpenAI Whisper models entirely on the NPU for speech-to-text, activated by a global hotkey that transcribes audio and pastes it into the focused input box. No setup, no config — download and run. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sjlqm0/using_npu_for_something_useful/)

Google's TurboQuant paper addresses what may be the most important infrastructure constraint in the industry: KV cache memory. The technique combines Polar Quant (rotating data into a standard coordinate system that eliminates per-block normalization overhead) with Quantized Johnson-Lindenstrauss (a single-bit error correction that eliminates bias in attention scores). The result: 6x memory reduction in the KV cache and up to 8x speedup, tested across question answering, code generation, summarization, and 100k-token needle-in-a-haystack retrieval — all lossless. The strategic implications are significant: Google wrote it and runs Gemini, giving them a compounding cost advantage on their TPU stack. For Nvidia, whose Vera Rubin pitch centered on 500x memory increases through hardware, software-based compression that delivers 6x from existing GPUs complicates the narrative. The broader context is that at least five distinct research vectors — quantization, eviction/sparsity, architectural redesign (DeepSeek v2's multi-head latent attention), offloading, and attention optimization (Flash Attention) — are all attacking the memory problem simultaneously. (more: https://youtu.be/erV_8yrGMA8?si=bYtlvSLfg2X0emCx)

A Two-Button Calculator for All of Mathematics

In digital electronics, the NAND gate proves that a single two-input operator suffices for all Boolean logic. Continuous mathematics has never had an equivalent — until now. Andrzej Odrzywołek at Jagiellonian University shows that the operator EML(x, y) = exp(x) − log(y), paired with the constant 1, generates the complete repertoire of a scientific calculator: trigonometric functions, hyperbolic functions, logarithms, exponentiation, arithmetic, and constants including π, e, and i. A two-button calculator — EML and 1 — replaces every button on a scientific calculator. The discovery came through systematic exhaustive search, expanding 81 seeds through iterative bootstrapping, with a Rust reimplementation (translated by GPT Codex 5.3) verifying the chain in seconds rather than hours. (more: https://arxiv.org/abs/2603.21852)

The practical implications go beyond mathematical aesthetics. Every EML expression is a binary tree of identical nodes, yielding a trivially simple context-free grammar. This uniform structure enables gradient-based symbolic regression: parameterized EML trees can be optimized with standard methods like Adam, and when the generating law is elementary, trained weights snap to exact closed-form expressions at machine-epsilon precision. At tree depth 2, blind recovery from random initialization succeeds in 100% of runs; at depth 3–4, approximately 25%; at depth 5 and beyond, the basins of attraction exist (perturbed correct weights always converge back) but finding them from random initialization becomes the bottleneck. The author notes EML is not unique — cousins EDL and a ternary variant requiring no distinguished constant have already been found, and the question of whether a purely real-domain Sheffer operator exists remains open. For the neural network community, the implication is that any conventional network is a special case of an EML tree architecture, since standard activation functions are themselves elementary — but EML trees can snap to exact symbolic expressions where conventional architectures cannot.

Sources (22 articles)

  1. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems (arxiv.org)
  2. elastic/supply-chain-monitor (github.com)
  3. [Editorial] UK AI Security Institute + Claude (linkedin.com)
  4. Lean proved this program correct; then I found a bug (kirancodes.me)
  5. Multi-Agentic Software Development Is a Distributed Systems Problem (kirancodes.me)
  6. [Editorial] The Room That Quoted Back (forge-quality.dev)
  7. SafeRL-Lab/nano-claude-code — Python reimplementation supporting any model (github.com)
  8. LiteCode — free, open-source CLI coding agent for 8k-context LLMs (reddit.com)
  9. GitHub Stacked PRs (github.github.com)
  10. [Editorial] (youtube.com)
  11. Pro Max 5x quota exhausted in 1.5 hours despite moderate usage (github.com)
  12. OpenAI releases new $100 Pro tier, rebalances Codex usage (reddit.com)
  13. Claude used to push back, now it just agrees with everything (reddit.com)
  14. Claude Thinking Blocks Are Being Summarized By A Second Agent (reddit.com)
  15. FBI used iPhone notification data to retrieve deleted Signal messages (9to5mac.com)
  16. boratanrikulu/gecit - DPI bypass tool (eBPF on Linux, proxy on macOS) (github.com)
  17. GLM 5.1 "Actually wait" — the current thinking SOTA open source (reddit.com)
  18. Are i-Quants overrated? (reddit.com)
  19. Gemma 4 E4B vs Qwen3.5-4B on document tasks — sub-scores tell a different story (reddit.com)
  20. Using NPU for something useful — whisper-npu for Intel NPU speech-to-text (reddit.com)
  21. [Editorial] (youtu.be)
  22. All elementary functions from a single binary operator (arxiv.org)