Eighteen Years of Sleeping Code

Published on June 2, 2026

Today's AI news: Eighteen Years of Sleeping Code, When Graph Structure Becomes a Liability, The Undocumented Machine, Squeezing Every Last Flop, Open Models and New Plumbing, NVIDIA's Full-Stack Play, The Cannes That Wasn't. 22 sources curated from across the web.

Eighteen Years of Sleeping Code

An 18-year-old heap buffer overflow in NGINX's ngx_http_rewrite_module has been sitting in production since 2008, and it's a full unauthenticated RCE. CVE-2026-42945 exploits a mismatch between NGINX's two-pass script engine: the length-calculation pass runs on a freshly zeroed sub-engine where is_args is 0, so it returns the raw capture length. The copy pass sees is_args = 1 from the parent engine and calls ngx_escape_uri with NGX_ESCAPE_ARGS, expanding each escapable byte to three bytes. The copy overflows the undersized heap buffer with attacker-controlled URI data. Exploitation uses cross-request heap feng shui — spraying ngx_pool_t structures via POST bodies to corrupt an adjacent pool's cleanup pointer, redirecting it to a fake cleanup handler that invokes system(). Every NGINX Open Source build from 0.6.27 through 1.30.0, and NGINX Plus R32 through R36, is affected. The vulnerability was autonomously discovered by DepthFirst's security analysis system along with three additional memory corruption issues (CVE-2026-42946, CVE-2026-40701, CVE-2026-42934), all from a single onboarding of the NGINX source (more: https://github.com/DepthFirstDisclosures/Nginx-Rift).

The IoT security narrative got another chapter courtesy of a Honeywell X2S smart thermostat teardown. The device packs a 200 MHz Renesas Cortex-M33 with TrustZone and a Realtek RTL8721DM Wi-Fi/BLE SoC — two brains, two Winbond flash chips, all encrypted. Cracking the Realtek turned out to be trivially easy using its own RSIP decrypt-on-the-fly feature, which led to the discovery of a TLS certificate issue enabling man-in-the-middle attacks and a seeding bug that makes session key recovery possible. The Renesas firmware still needs decrypting, but the partial results are already damning enough to confirm that the "S" in IoT still stands for "security" (more: https://hackaday.com/2026/05/26/honeywell-x2s-smart-thermostat-firmware-reverse-engineering/).

On the defensive side, a new open-source tool called lockbit-rescue exploits the documented keystream-reuse weakness in LockBit 3.0 ("Black") to decrypt files without the attacker's private key. LockBit's file encryption uses a modified Salsa20 cipher where the same keystream gets reused across files in an encryption batch. The tool groups encrypted files by their RSA-encrypted KEK fingerprint, picks the longest-named file in each group as a known-plaintext oracle (the original filename is apLib-compressed UTF-16LE in the footer), XORs it against ciphertext to recover keystream bytes, and decrypts every other file in the group whose footer fits within the recovered coverage. Real-world recovery rates run 5–40% depending on original filename lengths, and a Phase 2 brute-force extension can push further by climbing a ladder of intermediate files. It's resumable, read-only on source files, and includes a libmagic-based verification sweep (more: https://github.com/Saddytech/lockbit-rescue).

When Graph Structure Becomes a Liability

A new paper on the Elliptic Bitcoin Dataset delivers one of the cleaner methodological takedowns in recent fraud-detection literature. The consensus that GCN, GraphSAGE, GAT, and EvolveGCN beat feature-only baselines on Elliptic has been cited for years — GCN at F1=0.70, GraphSAGE+SSL at 0.75, EvolveGCN at 0.77. Those numbers are artifacts of the evaluation protocol, not evidence that graph structure helps detect fraud. Every prior Elliptic study trains with transductive message passing, meaning the full graph — including test-period nodes and edges — is visible at every training forward pass. Several setups labeled "inductive" still leak test-period statistics through batch normalization and neighborhood aggregation.

When the authors run a strictly inductive protocol (encoder trained only on the time-step ≤ 34 subgraph, with 10 seeds and per-timestep reporting), Random Forest on the raw 165-dimensional features reaches F1=0.821 and beats every GNN tested. GraphSAGE, the strongest graph encoder under strict inductive evaluation, manages only F1=0.689 ± 0.017. A paired controlled experiment quantifies the leakage directly: across 10 matched seeds holding architecture, optimizer, loss, and seed constant, GraphSAGE scores F1=0.294 transductively versus F1=0.689 inductively — a 39.5-point gap (Cohen's d = 15.8, p = 2.6 × 10⁻¹²) explained entirely by training-time exposure to test-period adjacency.

The edge-shuffle ablation is perhaps the most devastating finding. Randomly shuffled edges outperform the real transaction graph by 8.9 F1 points, and removing edges entirely still beats real edges by 2.5 points. The original graph's mean degree of 2.3 produces neighborhoods that are simultaneously sparse and semantically misleading under temporal shift — each fraud node is surrounded by licit counterparties, so message passing actively pulls fraud representations toward the licit manifold. The paper's feature attribution confirms where the discriminative signal actually lives: local features account for 82.4% of total Random Forest importance, and the top five features are all local. A concatenation hybrid of GraphSAGE embeddings and raw features drops from the previously reported F1=0.807 to 0.699 under the clean protocol, and the GNN contributes only +0.018 F1 over a matched-capacity MLP — real, reliable, and practically inconsequential. For anyone building fraud pipelines on transaction graphs: test under strict inductive protocol before you invest in graph machinery, because the feature engineering you already have might be the strongest signal in the room (more: https://arxiv.org/abs/2604.19514v1).

The Undocumented Machine

Someone went spelunking through Claude Code's node_modules and surfaced a surprisingly deep configuration surface area that the official docs never mention. The auto-mode permission system is internally called the "YOLO Classifier" (that's the actual variable name in yoloClassifier.ts), and it accepts plain English environment descriptions — strings like "this is a staging server, destructive operations are acceptable" — that the classifier reads to decide what gets auto-approved. PreToolUse hooks can return updatedInput to rewrite commands mid-flight (so git push silently becomes git push --dry-run), permissionDecision to force allow/deny without user prompts, and additionalContext to inject text into the conversation. Three undocumented hook fields change behavior fundamentally: once: true fires a hook exactly once then auto-removes it, async: true runs hooks in the background without blocking, and asyncRewake: true runs non-blocking but wakes the model and blocks if the hook exits with code 2 — non-blocking on the happy path, blocking when something's wrong. Custom agents support persistent memory across invocations (user, project, or local scopes), omitClaudeMd: true for "fresh eyes" reviews, and a criticalSystemReminder_EXPERIMENTAL field that gets re-injected at every turn even after context compaction. Two undocumented settings.json fields — autoMemoryEnabled and autoDreamEnabled — activate a compound learning loop: sessions produce memories, dreams consolidate them every 24 hours, and consolidated memories inform future sessions (more: https://buildingbetter.tech/p/i-read-the-claude-code-source-code).

The configuration deep-dive arrives at the exact moment organizations are discovering what unrestricted AI coding tool access actually costs. Microsoft reportedly pulled internal Claude Code licenses after the token bill ran past anything anyone had budgeted for — the tool didn't fail, it worked so well that engineers leaned on it constantly. The post promoting this disclosure also introduces SpectralQuant, a KV-cache compression library claiming 6x compression on production H200s where the cache eats roughly 64% of GPU memory. On Mistral 7B at 5.95x compression, it reports 40.2 tokens/sec versus 18.9 for the uncompressed baseline — same model, same accuracy, roughly double the speed (more: https://www.linkedin.com/posts/ashgopi_microsoft-just-pulled-internal-claude-code-ugcPost-7467234522502254592-GGUQ). Meanwhile, the local-AI community is pushing in the opposite direction entirely: building agents that optimize their own prompts and tool usage over time using feedback loops, turning the cost problem into an engineering problem that runs on hardware you already own (more: https://www.reddit.com/r/LocalLLaMA/comments/1toejzp/turning_local_agents_into_selfoptimizing_agents/).

Squeezing Every Last Flop

A blog series culminating in "A 10 year old Xeon is all you need" lays out exactly what it takes to run Gemma 4 26B on a single Intel Xeon E5-2620 v4 from 2016 — 8 cores, DDR3, no GPU. The author's magic spell requires 25 flags across speculative decoding (--spec-type mtp --draft-max 3 --spec-autotune), CPU-aware MoE routing (--cpu-moe --merge-up-gate-experts), memory pinning (--mlock --run-time-repack), and Flash Attention ported to CPU (--flash-attn on --mla-use 3). The argument for speculative decoding is stronger on CPU than GPU: CPU compute is cheap relative to streaming verifier weights through cache, so the drafter's working set fits in L3 while the verifier spills out of everything. Runtime repacking reorganizes weight matrices to align with CPU cache layout, paying a small startup penalty for maximum bandwidth during generation. The result: an 82 GB footprint in DDR3 generating text at reading speed on hardware that was old when the architecture hadn't been invented yet (more: https://point.free/blog/gemma-4-on-a-2016-xeon/). In the same vein, a community member reports 10.33 tokens/sec on Qwen 3.5 35B running on a $300 laptop (more: https://www.reddit.com/r/LocalLLaMA/comments/1tpfw50/inferencing_at_1033_ts_on_qwen_35_35b_on_a_300/).

On the research side, Huawei's OSC framework tackles W4A4 quantization by systematically characterizing where activation outliers actually cluster. Their key insight: outliers aren't uniformly distributed — they exhibit "token-persistent structural clustering" where high-magnitude values consistently occupy fixed channels across tokens, with clustering density exceeding 60% in attention and FFN up-projection inputs. OSC exploits this by building an offline lookup table that identifies the single most statistically significant outlier channel per quantization group, extracting it to a high-precision buffer before calculating scaling factors. The result is a dual-path computation: a low-precision 4-bit GEMM path and a compact high-precision 16-bit branch GEMM. For W2 down-projection inputs where outliers are more diffused (20–35% clustering), OSC falls back to FP8. On Qwen3-8B and Qwen3-30B, OSC restricts the average accuracy drop to 2.19 and 1.12 points respectively while achieving 1.78x speedup over the W8A8 baseline (more: https://arxiv.org/abs/2604.12782v1).

NVIDIA's Lightning OPD addresses a different efficiency bottleneck: the live teacher server required during on-policy distillation. Standard OPD trains a student model to match a stronger teacher's token-level distribution, but keeping the teacher running throughout training is expensive. Lightning OPD identifies a previously overlooked condition called "teacher consistency" — the SFT-stage and OPD-stage teachers must be the same model. Violating this introduces irreducible gradient bias causing convergence to a suboptimal fixed point regardless of training duration. With teacher consistency enforced, teacher log-probabilities can be precomputed once over SFT rollouts and reused, eliminating the live server entirely. Starting from Qwen3-8B-Base, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours — a 4.0x speedup over standard OPD (more: https://arxiv.org/abs/2604.13010v1). Separately, the quantization precision debate got concrete data: Qwen3.6 shows a measurable quality jump from Q4 to Q6 quantization specifically for coding agent workloads, suggesting that the "good enough" threshold for agentic use cases sits higher than for conversational ones (more: https://www.reddit.com/r/LocalLLaMA/comments/1tpebhw/qwen36_huge_quality_gain_from_q4_to_q6_for_coding/).

Open Models and New Plumbing

JetBrains released Mellum2, a 12B-parameter Mixture-of-Experts model that activates only 2.5B parameters per token, under Apache 2.0. It's the first model from an IDE vendor designed explicitly as infrastructure for AI-powered development tools rather than as a standalone assistant. JetBrains positions it as a "focal model" — fast and well-scoped for high-frequency tasks inside larger AI systems: routing, RAG post-processing, tool selection, context compression, and agent subtasks like planning and validation. Benchmarks show competitive performance against similar-sized models with over 2x faster inference, targeting the latency-sensitive slots in multi-model architectures where calling a frontier model for every intermediate step is wasteful (more: https://huggingface.co/blog/JetBrains/mellum2-launch). Tencent's Hy-MT2 also moved to Apache License 2.0, continuing the trend of major labs opening their model weights under permissive terms (more: https://www.reddit.com/r/LocalLLaMA/comments/1to6g1d/tencent_hymt2_is_now_under_apache_license_20/).

The tooling layer beneath models continues filling in. LogicPipe tackles multi-device LLM inference by splitting transformer layers across ranks, using DAG scheduling with dependency-aware task dispatch and contextual KV-cache reuse between points — a structured approach to pipeline parallelism on heterogeneous edge hardware (more: https://github.com/fxyz666/LogicPipe). dlmserve launches as the first dedicated serving engine for diffusion language models, a class of models that has lacked the vLLM/SGLang-equivalent infrastructure that autoregressive models enjoy (more: https://www.reddit.com/r/LocalLLaMA/comments/1to95ja/oss_dlmserve_first_serving_engine_for_diffusion/). Genspark AI ships an open-source "Super Agent" framework with 80+ built-in tools, multi-LLM routing, and a planner-executor architecture that decomposes goals into DAGs of subtasks dispatched to specialist agents — positioned as a self-hosted alternative to the commercial Genspark.ai service (more: https://github.com/veryyoldman/Genspark-AI). And Qwen released Qwen-Image-Bench, a new benchmark for evaluating visual language models (more: https://www.reddit.com/r/LocalLLaMA/comments/1tpww8m/qwenqwenimagebench_hugging_face/).

NVIDIA's Full-Stack Play

NVIDIA's Cosmos 3 represents a significant architectural leap from its predecessors. Where earlier Cosmos releases required separate models for world generation (Predict), controlled generation (Transfer), scene understanding (Reason), and policy generation (Policy), Cosmos 3 unifies everything into a single Mixture-of-Transformers architecture. The model processes text, image, video, audio, and action modalities in one forward pass, splitting the input sequence into an autoregressive subsequence for reasoning and a diffusion subsequence for generation. AR and DM tokens use separate parameter sets per layer but interact through joint attention, letting the same model serve as a VLM, video generator, forward/inverse dynamics model, or robot policy without architectural changes. Two sizes ship: Cosmos 3 Nano (16B, workstation-grade) and Cosmos 3 Super (64B, datacenter-scale), both with Diffusers integration and post-training scripts for domain adaptation. NVIDIA also released accompanying synthetic data generation datasets covering robotics, physics simulation, autonomous driving, warehouse operations, and human motion — the kind of training data that physical AI has chronically lacked (more: https://huggingface.co/blog/nvidia/cosmos-3-for-physical-ai).

At Computex in Taipei, Jensen Huang unveiled RTX Spark, a chip designed for what NVIDIA calls "the era of personal AI agents." Lenovo, HP, Dell, Microsoft Surface, Asus, and MSI will ship Windows PCs with RTX Spark beginning this autumn, with Acer and Gigabyte following. Forrester's Charlie Dai called it a "paradigm shift" from "component supplier" to "architecture owner in the PC market," putting NVIDIA in direct competition with Intel, AMD, and Qualcomm. Analyst Ian Fogg tempered expectations, noting the machines will likely carry a significant price tag and target workstation-class buyers (more: https://www.bbc.com/news/articles/crmp9mppvzro). NVIDIA also released LocateAnything, a vision-language grounding model using parallel box decoding that claims 10x faster inference than Qwen3-VL for object detection and spatial understanding tasks (more: https://www.reddit.com/r/LocalLLaMA/comments/1tpvldv/nvidia_locateanything_fast_and_highquality/).

The Cannes That Wasn't

Higgsfield, a San Francisco startup valued at $1.3 billion, announced it had premiered a fully AI-generated feature film at Cannes. The Wall Street Journal covered it. The founder posted on LinkedIn about how "for decades, Cannes has been the room where new cinema gets legitimized." Then Cannes said it never happened. A festival spokesperson confirmed to Futurism that "Hell Grind was not screened as part of the official Festival de Cannes program." The film screened at the Marche du Film, a separate commercial marketplace that will screen anything that pays the fee — it has screened Sharknado. Calling it a Cannes premiere is roughly equivalent to buying an ad in the New York Times and describing yourself as a Times journalist.

The underlying work was real. Higgsfield made a 95-minute action film in two weeks using AI video generation tools including Google's Veo 3, at a total cost of $500,000 ($400,000 in compute). Each prompt averaged 3,000 words. The first 25 minutes required 16,181 initial video generations to produce 253 final shots. Maintaining visual consistency across feature length required detailed style prefixes defining lighting, camera type, and physics behavior for every prompt. "You can't go into AI and say make me a 95-minute cool video," one content lead acknowledged. But director John Washburn's response to the festival claims was more direct: "The suggestion that paying for a screening at some random theatre in the same town and at the same time as a major festival is somehow the same thing as being selected by that festival is misleading at best. Spurious bullshittery, really." The Wall Street Journal later added a correction clarifying the distinction. The pattern is familiar — real capability dressed in borrowed prestige, traveling faster than anyone can check (more: https://firethering.com/hell-grind-ai-film-cannes-premiere-higgsfield/).

Sources (22 articles)

DepthFirstDisclosures/Nginx-Rift (github.com)
Honeywell X2S Smart Thermostat Firmware Reverse-Engineering (hackaday.com)
Saddytech/lockbit-rescue (github.com)
When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift (arxiv.org)
Claude Code – Everything You Can Configure That the Docs Don't Tell You (buildingbetter.tech)
[Editorial] (linkedin.com)
Turning local agents into self-optimizing agents (reddit.com)
A 10 year old Xeon is all you need (point.free)
Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop (reddit.com)
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension (arxiv.org)
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation (arxiv.org)
Qwen3.6 huge quality gain from Q4 to Q6 for coding agent (reddit.com)
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains (huggingface.co)
Tencent Hy-MT2 is now under Apache License 2.0 (reddit.com)
fxyz666/LogicPipe (github.com)
[OSS] dlmserve - first serving engine for diffusion language models (reddit.com)
veryyoldman/Genspark-AI (github.com)
Qwen/Qwen-Image-Bench · Hugging Face (reddit.com)
Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action (huggingface.co)
Nvidia announces new AI chip for personal computers (bbc.com)
Nvidia LocateAnything - Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding. (10x faster than Qwen3-VL) (reddit.com)
The $500K AI Film That "Premiered at Cannes" Was Not in the Official Festival (firethering.com)