Nightmare Eclipse: One Grudge, Six Zero-Days

Published on

Today's AI news: Nightmare Eclipse: One Grudge, Six Zero-Days, Benchmarks, Agents, and the Enterprise Reality Gap, Agentic Coding: Dynamic Workflows and the Slow Code Movement, DeepSeek V4 & The Pricing Power Myth, The KV Cache Compression Race, Custom Inference Engines: AMD and the Edge, Platform Control: APIs, Labels, and the Data Arms Race. 22 sources curated from across the web.

Nightmare Eclipse: One Grudge, Six Zero-Days

Not all threat actors chase money or further national objectives. Some just want to burn the house down. Nightmare-Eclipse — a solo security researcher operating under anime-decorated aliases like "Chaotic Eclipse" and "Dead Eclipse" — has released six Windows zero-day exploits since April 2026 in what Barracuda's threat intelligence team describes as an escalating retaliatory campaign against Microsoft. The motivation, per Eclipse's own blog: Microsoft violated an agreement and "left me homeless with nothing." Whether that's a former employee, contractor, or external researcher remains unverified. What isn't in question is the damage — Huntress Labs has confirmed exploitation activity linked to Russian-geolocated infrastructure, and CISA has added BlueHammer (CVE-2026-33825) to its Known Exploited Vulnerabilities catalog. (more: https://blog.barracuda.com/2026/05/19/nightmare-eclipse-zero-days-grudge)

The exploit chain reads like a product roadmap for endpoint compromise. BlueHammer and RedSun are two separate privilege-escalation paths through Windows Defender — the tool that's supposed to stop attacks, not enable them. UnDefend then blinds Defender while making the system appear healthy. GreenPlasma offers an alternate SYSTEM escalation path through broader Windows internals, and MiniPlasma targets the Cloud Files Mini Filter Driver. Then there's YellowKey, which has drawn the most attention because it targets BitLocker. A detailed video breakdown explains the mechanism: Windows 11's recovery environment automatically replays NTFS file-system transactions from external volumes. A malicious USB containing crafted FSTX transactions deletes winpeshl.ini — the file telling recovery which application to load — causing a drop to a command shell that already has the TPM-unsealed BitLocker key in memory. The mitigation is relatively straightforward for informed administrators (enable TPM PIN, or remove the autofstx registry key), but Eclipse has claimed the bypass works even with TPM+PIN, just hasn't published that proof-of-concept. Eclipse has promised RCE drops and "a big surprise" for June Patch Tuesday, plus a dead-man's switch for automatic disclosure. The researcher has been banned from both GitHub and GitLab. (more: https://www.youtube.com/watch?v=H-SgQP3Hif0)

Benchmarks, Agents, and the Enterprise Reality Gap

The DeepSWE benchmark made waves this week by claiming Claude Opus "cheats" — but the details are more nuanced. When the prompt and repository state don't match, Opus 4.7 explores recent changes with git log and recovers the gold solution from .git history. Is that cheating, or exactly what a senior engineer would do when the codebase doesn't match the spec? Several commenters pointed out that if .git is present and no constraint prohibits its use, git log is environment use, not rule violation. The deeper concern: Sonnet 4.6 on "high" effort beats Opus 4.6 on "max" effort in some configurations, and GPT-5.4 Mini outranks Kimi K2.6, which doesn't match practitioner experience. The real lesson is that evaluation harnesses need to declare their environment contract alongside the score. (more: https://www.reddit.com/r/LocalLLaMA/comments/1toychi/new_deepswe_benchmark_finds_claude_opus_cheats/)

ITBench-AA, a collaboration between Artificial Analysis and IBM, offers a more controlled view. Across 59 Kubernetes incident-response tasks, the best model — Claude Opus 4.7 at max effort — scores just 47%. GPT-5.5 follows at 46%, Qwen3.7 Max at 42%. These are real SRE scenarios requiring log reading, dependency tracing, and root-cause identification. More turns don't help: Gemini 3.1 Pro Preview averages 83 turns per task but scores only 30%, while Gemma 4 31B averages 58 turns and scores 37% at $0.14/task versus $2.23. Models that over-investigate surface upstream fault-injection mechanisms as false positives, tanking their precision scores. Open-weight models sit on the cost frontier — GLM-5.1 Reasoning scores 40% at $1.23/task, matching Gemini 3.5 Flash at $1.70. (more: https://huggingface.co/blog/ibm-research/itbench-aa)

A rigorous ACM paper deepens this picture through controlled ablation of compound LLM agent design across 3,475 episodes in a cyber-defense POMDP. The headline: programmatic state abstraction — giving the agent a deterministic, structured summary instead of raw observations — delivers the largest performance improvement per token, up to 76% better outcomes at near-zero marginal cost. Hierarchy helps when applied sparingly (bounded specialists with strict I/O contracts), but distributing deliberation tools across a hierarchy produces what the authors term a "deliberation cascade." Independent self-critique loops in each sub-agent amplify uncertainty rather than resolving it, degrading performance for all six tested models while doubling token costs. The design principle: invest in context engineering and bounded task decomposition, not deeper per-agent reasoning. This aligns with the ITBench-AA finding that more turns actively hurt — the failure mode is the same whether the over-reasoning happens within one agent or across a hierarchy. (more: https://arxiv.org/abs/2605.16205v1)

On the theoretical side, a new paper claims the mathematical thread connecting a decade of robustness techniques — PGD adversarial training, RLHF, data augmentation — showing they all implicitly estimate the same "deployment nuisance covariance matrix." When estimated correctly and applied as a geometric penalty, sycophancy on Qwen2.5-7B drops from 38.5% to 13.5%, beating standard PGD adversarial training by 14.8 points on clean accuracy. The kicker from Theorem G: if the matrix misses even one direction where real-world data varies, the model will actively exploit that blind spot. More data and bigger models won't fix a geometry problem. (more: https://www.reddit.com/r/learnmachinelearning/comments/1tnw9p4/10_years_of_ai_robustness_tricks_pgd_rlhf_data/)

Agentic Coding: Dynamic Workflows and the Slow Code Movement

Anthropic launched dynamic workflows in Claude Code, a feature that lets the agent write orchestration scripts running tens to hundreds of parallel subagents in a single session. The showcase example: Jarred Sumner used dynamic workflows to port Bun from Zig to Rust — roughly 750,000 lines of code — in eleven days with 99.8% of existing tests passing. One workflow mapped Rust lifetimes across the Zig codebase; another wrote every .rs file with paired reviewers per file; a fix loop drove the build until clean. Anthropic is transparent about the tradeoff: "dynamic workflows consume meaningfully more usage than a typical Claude Code session." Enterprise plans have them off by default, and the first trigger shows what's about to run and asks for confirmation. (more: https://claude.com/blog/introducing-dynamic-workflows-in-claude-code)

Nolan Lawson published a thoughtful counterpoint: using AI to write better code more slowly. His technique runs Claude, Codex, and Cursor Bugbot in parallel against every PR, then has a synthesizer deduplicate and rank findings. The false-positive rate is "near zero," and it consistently surfaces criticals and highs that predate the PR. Velocity doesn't increase — the review sends him down side-quest rabbit holes fixing pre-existing bugs — but codebase health improves steadily. A commenter's caveat is worth repeating: "The agents find the bugs; you still need to understand them." (more: https://nolanlawson.com/2026/05/25/using-ai-to-write-better-code-more-slowly/)

On r/ClaudeAI, practitioners debating overnight autonomous coding have reached pragmatic consensus: it works for tightly scoped tasks with machine-checkable stop conditions — failing tests that should pass, dependency bumps, mechanical refactors. The pro move is an orchestration script that commits after each step, writes a handoff summary, and rehydrates a fresh session to prevent context drift. Anything architectural? "It'll drift and commit very confidently to the wrong abstraction." The adversarial multi-agent setup — separate agents for requirements, architecture, implementation, and review — is what makes overnight runs trustworthy for some practitioners, but even veterans acknowledge the morning triage cost. (more: https://www.reddit.com/r/ClaudeAI/comments/1tpwt5k/overnight_autonomous_coding/)

DeepSeek V4 & The Pricing Power Myth

DeepSeek V4 Pro's pricing tells the story: $0.435 per million input tokens, $0.87 output. That's 28.7× cheaper than Claude Opus on output and 11.5× cheaper than GPT-5.5 on input. The Reddit post framing this as "popping the American AI bubble" is provocative, but the thesis — that AI pricing power is eroding faster than Wall Street expects — has teeth. When a model is competitive at 1/20th the cost, margin compression becomes a question of when, not if. (more: https://www.reddit.com/r/OpenAI/comments/1tm49d0/deepseek_just_popped_the_american_ai_bubble/)

The argument gains force when you realize V4 Flash already runs locally on consumer hardware. A self-described "vibe coder" got the 284B-total / 13B-active MoE running at 8.4 tok/s on three RTX 3090s with 128GB system RAM. The catch: mainline llama.cpp doesn't support V4's architecture yet — Compressed Sparse Attention, Sinkhorn-normalized hyperconnections, 256-expert routing — so it lives on forks. Most HuggingFace GGUFs were quantized against an older fork with incompatible metadata, requiring 12 missing metadata keys and ~393 tensor renames. The poster published a one-pass Python patcher that stream-copies the 150GB weight blob with a rewritten header in four minutes on NVMe. The key flag: --cpu-moe keeps all 256 expert FFNs in system RAM while 16GB of non-expert weights fit across the GPUs. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tptuph/deepseek_v4_flash_at_84_toks_on_33090_patching/)

The third thread connects the dots: an r/LocalLLaMA discussion on why AI companies spread FUD about their own products. The "critihype" thesis: if the "critique" is that AI is too powerful, too dangerous, too close to AGI — that's marketing, not critique. The fear serves a specific purpose: encouraging regulation that creates barriers to entry before open models render centralized gatekeeping irrelevant. "Without regulation, they have no moat. With regulation, they have a regulatory apparatus they can capture." Whether that's tinfoil-hat territory or pattern recognition, the economic logic is hard to dismiss when a 284B model runs on three consumer GPUs at pennies per million tokens. The three threads form a causal chain that this publication has covered in pieces but never unified: pricing pressure makes local inference viable, local inference viability erodes the API moat, and the erosion of the moat makes regulatory capture the only durable defense. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tp92dn/why_are_the_ai_companies_spreading_fud_about_ai/)

The KV Cache Compression Race

Running long-context models locally means confronting a brutal memory constraint: at 128K context, the KV cache can dwarf the model weights. This publication has tracked KV cache quantization methods across six editions, and the community consensus has been bleak: sub-4-bit approaches consistently degraded quality to the point where FP8 emerged as the practical default. OSCAR may be the first method to crack the accuracy wall below 3 bits. It proposes a clean solution: capture Q/K/V activations on a calibration set, compute attention-aware covariance matrices per layer, then derive rotations that align quantization with the directions attention actually uses. The result is INT2 KV cache storage — roughly 7× compression versus BF16 — with single-digit percentage-point accuracy drops. On a 5-benchmark mean across reasoning and coding tasks, Qwen3-8B under OSCAR scores 69.42% versus 70.84% for BF16, while QuaRot-INT2 collapses to 10.14% and naive INT2 hits zero. OSCAR is the only INT2 method that doesn't collapse on reasoning, and it's integrated into SGLang with a pre-computed "RotationZoo" on HuggingFace so users can skip calibration entirely. (more: https://github.com/FutureMLS-Lab/OSCAR)

Shard takes a different approach: PCA plus INT4 on K heads (effectively low-rank once you undo RoPE), and Hadamard rotation plus vector quantization on V heads. The result is ~10× KV compression on Llama-3.1-8B without measurable NIAH or LongBench degradation. Community reception was characteristically blunt — "stop vibe coding for llama 3.1, make it for Qwen 3.6 27B" — but the split K/V treatment is a genuinely novel design choice (more: https://www.reddit.com/r/LocalLLaMA/comments/1tnvo7r/shard_getting_to_10_kv_cache_compression/). LightVLM bundles KV compression with INT4 weights and visual token pruning for vision-language models, squeezing LLaVA-1.5-7B to 6.2GB peak memory at 71.9 tok/s on an RTX 4090 — down from 15.1GB at FP16 (more: https://github.com/cortsdine/LightVLM). And Speech Tokenizer Arena applies the same side-by-side comparison philosophy to discrete speech tokenizers, benchmarking EnCodec, DAC, SpeechTokenizer, and HuBERT units on reconstruction quality, bitrate, and downstream WER — the kind of standardized comparison the speech community has lacked (more: https://github.com/andraiming/speech-tokenizer-arena).

Custom Inference Engines: AMD and the Edge

hipEngine is a new ROCm-native inference engine built from scratch with custom HIP/C++ kernels — no PyTorch dependency — that runs Qwen 3.6 MoE competitively with llama.cpp on RDNA3. On a Radeon 7900 XTX, prefill throughput with ParoQuant hits 2,839 tok/s at 4K context versus 2,177 for llama.cpp HIP. With near-lossless INT8 KV cache, the full 256K context window fits in under 24GB. On Strix Halo (Ryzen AI MAX+ 395), it already beats llama.cpp HIP on decode at most context lengths with "minimal targeted optimization." The engine ships with 100+ custom kernels, initial GGUF support, and an AGPLv3 license. The GGUF path is strategically important — it means existing quantized models work without custom training, unlike ParoQuant which can take days to quantize. The developer notes the kernel optimization was "the result of hundreds (thousands?) of rounds of AI-assisted generation," making hipEngine itself an artifact of the agentic coding workflows discussed above. Community interest centers on multi-GPU support and expanding beyond Qwen to Gemma 4 and other architectures. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tmq4s6/hipengine_fast_native_qwen_36_inference_for_rdna3/)

At the extreme edge, a developer built a from-scratch C++ engine to run MiniCPM-V 4.6 entirely on the Orange Pi AIPro — a $149 board with Huawei's Ascend 310B NPU. Stock aclnnMm gave 2.88 tok/s; custom AscendC cube kernels for M=1 matmul, 16-chunk lm_head tiling, and vectorized causal-conv1d brought it to 5.90 tok/s — a 2× speedup. Both text generation and SigLIP vision tower run natively on the NPU with zero PyTorch dependency. The theoretical floor at 44 GB/s bandwidth is 32ms per step; current latency sits at 170ms, with fused INT4/INT8 dequantization as the next optimization target. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tmy4g9/wrote_a_custom_c_engine_for_minicpmv_46_on_orange/)

Platform Control: APIs, Labels, and the Data Arms Race

Volkswagen has permanently disabled the unofficial API that Home Assistant's volkswagencarnet integration relied on, requiring a "client assertion" only approved commercial partners can provide. Existing tokens work until expiry; new authorizations are blocked. The Android app and browser login remain functional — this isn't a service outage, it's a deliberate lockout of third-party automation. A commercial API exists (used by Tronity) but is paid and gated. For the Home Assistant community, this is the automaker equivalent of the robot vacuum that bricked when its owner blocked telemetry: your car, their data. (more: https://github.com/robinostlund/homeassistant-volkswagencarnet/issues/967)

YouTube is moving in a more transparent direction, rolling out automatic detection and labeling of AI-generated video content. Starting May 2026, if a creator doesn't disclose photorealistic AI use and YouTube's systems detect it, a label is applied automatically — displayed as an overlay and below the player. Creators can contest incorrect detections, but disclosures are permanent for content made with YouTube's own tools (Veo, Dream Screen) or carrying synthetic-media provenance metadata. The policy explicitly notes "a disclosure label alone does not change how a video is recommended or whether it's eligible to earn money" — a line that will be tested as AI-generated content scales. (more: https://blog.youtube/news-and-events/improving-ai-labels-viewers-creators/)

A 103-billion-token Usenet corpus spanning 1980–2013 is being pitched as "pre-web, human-only, zero AI contamination." It sounds valuable until the community starts poking holes: multiple commenters questioned the ethics of licensing other people's public posts for commercial training without consent, others noted AI labs likely already scraped Usenet, and the "free preview, paid full access" model "goes against the spirit of Usenet and the pre-algorithm internet" (more: https://www.reddit.com/r/LocalLLaMA/comments/1tphhqk/i_built_a_103btoken_usenet_corpus_19802013_preweb/). Meanwhile, Clark Browser takes the automation-versus-detection arms race to the source level. Rather than papering over headless-browser tells with JavaScript injections, it patches Chromium where the values originate — navigator.webdriver returns false, plugins show five PDF-viewer entries, WebGL strings match real GPUs, UA Client Hints are coherent. It passes Cloudflare challenges, SannySoft, and BrowserLeaks in testing, ships prebuilt binaries under MIT, and offers 30+ --fingerprint-* CLI switches for deterministic identity generation (more: https://github.com/clark-labs-inc/clark-browser).

Sources (22 articles)

  1. [Editorial] Barracuda Nightmare Eclipse Zero-Days (blog.barracuda.com)
  2. [Editorial] Video (youtube.com)
  3. New DeepSWE benchmark finds Claude Opus cheats (reddit.com)
  4. ITBench-AA: Frontier Models Score Below 50% on Enterprise IT Tasks — by Artificial Analysis and IBM (huggingface.co)
  5. Context, Reasoning, and Hierarchy: Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP (arxiv.org)
  6. 10 years of AI robustness tricks (PGD, RLHF, Data Augmentation) are actually computing the same hidden matrix (reddit.com)
  7. [Editorial] Dynamic Workflows in Claude Code (claude.com)
  8. Using AI to write better code more slowly (nolanlawson.com)
  9. Overnight autonomous coding with Claude Code (reddit.com)
  10. DeepSeek just popped the American AI bubble. (reddit.com)
  11. DeepSeek V4 Flash at 8.4 tok/s on 3×3090 — patching GGUF metadata for cchuter's fork (reddit.com)
  12. Why are the AI Companies spreading F.U.D. about AI? (reddit.com)
  13. OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization (github.com)
  14. Shard — getting to 10× KV cache compression (reddit.com)
  15. LightVLM: Efficient inference toolkit for vision-language models (github.com)
  16. Speech Tokenizer Arena: Side-by-side benchmarking for discrete speech tokenizers (github.com)
  17. hipEngine: Fast native Qwen 3.6 inference for RDNA3 — ROCm-native open-source engine (reddit.com)
  18. Custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro — 2× speedup via custom AscendC kernels (reddit.com)
  19. Volkswagen blocks Home Assistant by requiring client assertion (github.com)
  20. YouTube to automatically label AI-generated videos (blog.youtube)
  21. 103B-token Usenet corpus (1980-2013) — pre-web, human-only, zero AI contamination (reddit.com)
  22. Clark Browser: AI-native browser that finds a way (github.com)