Application Security's Rough Week

Published on May 26, 2026

Today's AI news: Application Security's Rough Week, Subscription Fraud's Maturing Toolkit, LLMs Enter the SOC, The Harness Is the Product, Beyond Vectors: Agent Memory Gets a Graph, Ternary Training, DNA Models, and Small-Model Pragmatism. 22 sources curated from across the web.

Application Security's Rough Week

Next.js v16.2.5 dropped quietly, but the security fixes it carried were anything but quiet. A researcher reverse-engineered the commit diff between v16.2.4 and v16.2.5 using ProjectDiscovery's Neo tool and published proof-of-concept exploits for all twelve patched advisories. The haul includes six High-severity issues: a React server-action stream denial-of-service (CVE-2026-23870), an App Router prefetch middleware bypass (CVE-2026-44575), a next-resume connection exhaustion attack (CVE-2026-44579), a dynamic-route and middleware mismatch (CVE-2026-44574), a WebSocket upgrade SSRF affecting self-hosted deployments (CVE-2026-44578), and a Pages Router i18n data-route bypass (CVE-2026-44573). The moderate-severity items hit CSP nonce parsing, next/script XSS via beforeInteractive, an image optimizer decompression bomb, and RSC/HTML cache confusion. Two low-severity issues round it out with weak _rsc cache-busting hashes and x-nextjs-data redirect cache poisoning. Each PoC directory ships a runnable exploit script plus a vulnerable minimal app for reproduction. If you run Next.js in production and haven't upgraded, the attacker surface just got a detailed instruction manual (more: https://github.com/dwisiswant0/next-16.2.4-pocs).

The lesson that client-side code cannot be trusted got a painful refresher from India's Central Board of Secondary Education. A student — fresh off his own Class 12 exams — poked at the On-Screen Marking portal that thousands of examiners use to grade answer sheets digitally and found a catastrophe. A hardcoded master password sat in plain text inside the minified Angular bundle, bypassing OTP entirely. The OTP itself was theater: the server returned the code in the auth response, and the browser graded its own test. No route guards existed — seeding a few values into browser storage dropped you onto the dashboard without authentication. The password-change API never verified the current password, and a systemic IDOR across the entire API meant changing one ID in localStorage let you act as any user. Combined, an attacker could take over any examiner account and edit marks for a national exam affecting millions of students. The researcher reported everything to CERT-In, got a boilerplate acknowledgment, and watched the flaws sit unpatched for months (more: https://ni5arga.com/blog/posts/hacking-cbse/).

Meanwhile, scammers have been abusing Microsoft's own account-alert@accountprotection.microsoft.com address — the one used for legitimate two-factor authentication codes — to send phishing emails. Anti-spam nonprofit Spamhaus confirmed the activity has persisted for months, noting that "automated notification systems should not allow this level of customization." Microsoft acknowledged the issue after TechCrunch's inquiry and says it is "actively investigating," but the loophole remains open (more: https://techcrunch.com/2026/05/21/scammers-are-abusing-an-internal-microsoft-account-to-send-spam/).

Subscription Fraud's Maturing Toolkit

The Gpt-Agreement-Payment repository — an end-to-end replay tool for obtaining ChatGPT Plus and Team subscriptions through automated Stripe, PayPal, GoPay, and QRIS billing agreement flows — has matured considerably since it first surfaced. The project now ships a Docker-based WebUI with a 14-step configuration wizard, N-worker concurrency with advisory-lock mutual exclusion for OTP stages, and a 12-path self-healing daemon designed for weeks of unattended operation. The hCaptcha visual solver alone runs roughly 4,000 lines covering twelve known challenge types, falling back through a VLM → CLIP → OpenCV decision ladder. A PayPal guest checkout path using OpenAI promo campaign long-links adds a fourth payment vector alongside the original billing agreement, GoPay wallet linking, and QRIS QR-code flows. The concurrency model is particularly well-engineered: multiple workers sharing a phone pool use advisory locks to serialize only the OTP stage while running pre- and post-OTP steps fully in parallel, with database atomic claims preventing double-allocation of promo links or inventory emails.

What makes this more than a script kiddie toolkit is the anti-fraud research documentation. The author reports that batch-created accounts show approximately 2% next-day survival. OpenAI's defenses operate on at least two layers: a probe layer that detects fraud signals (IP string-level fingerprinting, batch correlation across account creation timestamps, Stripe runtime fingerprint drift) and a separate ban layer that acts on delayed schedules, catching clusters that real-time blocking misses. The runtime.version and js_checksum fields in Stripe's client-side telemetry drift every few weeks and require manual re-alignment — a moving target that makes fully automated fraud unsustainable at scale. The repo is explicitly positioned for CTF, bug bounty, and authorized security research; the author's own data demonstrates why the economics of subscription fraud don't pencil out for actual criminals. That 2% survival rate is the anti-fraud system working as designed (more: https://github.com/DanOps-1/Gpt-Agreement-Payment).

LLMs Enter the SOC

A new research paper from the Austrian Institute of Technology introduces CAM-LDS, the Cyber Attack Manifestation Log Data Set — a purpose-built benchmark for testing whether LLMs can interpret security logs. The dataset covers seven fully scripted attack scenarios executing 81 distinct MITRE ATT&CK techniques across 13 tactics, collected from 18 log sources in a reproducible Linux environment. Unlike red-team exercise datasets where labeling is noisy and human error is common, CAM-LDS uses AttackMate playbooks for deterministic, repeatable execution with keystroke-level realism. The scenarios span video server exploitation, Linux malware deployment, lateral movement, network attacks, credential sniffing, client-side social engineering, and Docker container escape — covering the full kill chain from reconnaissance through exfiltration and impact.

The paper's analysis of how attacks manifest in logs reveals sobering numbers. Of 243 attack steps, 47.3% leave explicit command traces in audit logs and 17.3% show indicators in other log sources, but 32.1% produce events that don't directly reference the executed command, and 18.5% leave no observable traces at all after filtering. Signature-based IDS (Wazuh and Suricata with default rules) detected only 43 of 198 log-producing attack steps at Low-High severity. The illustrative LLM evaluation using ChatGPT 5.2 in a zero-shot setting is where it gets interesting: correct MITRE ATT&CK techniques appeared in the top prediction for roughly one-third of attack steps, and within the top-10 predictions for another third. Classification accuracy correlated strongly with command observability, high event frequencies, and the presence of IDS alerts. The bottom third — subtle attack steps with indirect manifestations — remained largely opaque to the model. The dataset, infrastructure, attack scripts, and prompts are all open-source (more: https://arxiv.org/abs/2603.04186v1).

On the offensive tooling side, goLoL is a new Go-based Windows scanner that pulls the latest LOLBAS (Living Off The Land Binaries and Scripts) catalog, resolves which binaries actually exist on the local filesystem, and filters techniques by the current privilege tier — standard user, administrator, or SYSTEM. It maps each technique to MITRE ATT&CK IDs with example commands, providing a privilege-aware attack surface inventory in a single scan (more: https://github.com/aaron-kidwell/goLoL). And for the reverse engineering crowd, Oxidizer has merged into the angr decompiler framework as the first decompiler that actually understands Rust — not C-with-Rust-syntax, but real Rust with drop_in_place, Option<T>, bounds checks, and Result<T, E> recovery through a pipeline of FLIRT-based compiler fingerprinting, fCFG simplification, Retypd type inference, and high-level Rust pseudocode output (more: https://www.linkedin.com/posts/mitkox_normal-rust-cult-followers-rewrite-it-in-activity-7463557322452373507-2qz9).

The Harness Is the Product

A sweeping 30-page survey from UIUC, Meta, and Stanford frames the shift that practitioners are living through: code is no longer just what agents produce — it is the operational substrate through which they reason, act, and verify. The paper introduces "code as agent harness," organizing the landscape into three layers: the harness interface (code for reasoning, action, and environment modeling), harness mechanisms (planning, memory, tool use, feedback-driven control), and scaling the harness across multi-agent systems with shared code artifacts. The taxonomy covers everything from Claude Code and Codex to embodied robots and scientific discovery pipelines, arguing that the bottleneck of autonomy is increasingly the reliability of the system connecting model outputs to long-horizon actions — not the reasoning ability of the base model itself (more: https://arxiv.org/pdf/2605.18747).

The practical evidence keeps stacking up. OpenAI's Codex just shipped an experimental "goals" feature that amounts to a built-in RALF loop with budget-aware graceful shutdown. Give it /goal, define completion criteria, and it runs autonomously for hours — creating assets, writing code, running Playwright verification scripts — with four exit paths depending on budget, completion state, crashes, or manual pause. One user reported a 50-hour continuous run. The key differentiator over hand-rolled loops is integrated billing management: when approaching token limits, Codex injects a budget-limit markdown file and wraps up gracefully with a status report rather than simply dying (more: https://www.youtube.com/watch?v=nOFordZCyzs).

A practitioner running a fully local agent stack on DGX Spark + Mac Mini + A5000 hardware offers the field report that crystallizes why harness engineering matters more than model selection. The single biggest win: stop giving local models 39 tools. A fast router model (phi4:14b, ~50ms) classifies intent and routes to specialists that see only 3-6 relevant tools each. Most pipelines use deterministic code-controlled workflows — not ReAct loops — where the model only extracts parameters and synthesizes text. A 30B model choosing between 6 tools is reliable; the same model choosing between 39 "wanders, picks wrong tools, and burns iterations." Memory moved from flat JSONL (which accumulated 14 near-duplicate facts) to FalkorDB with a graph schema featuring SUPERSEDES chains for fact evolution (more: https://www.reddit.com/r/ollama/comments/1tn3l7y/harnesses/).

DeepSeek Reasonix takes a different but complementary approach: an append-only, cache-first coding agent loop engineered specifically around DeepSeek's byte-stable prefix cache. By never reordering or compacting message history, long sessions hold 90%+ cache hit rates and input-token cost collapses to roughly one-fifth of standard pricing. It supports Model Context Protocol (MCP) servers via one-line mount and sandboxes all tools to the launch directory (more: https://esengine.github.io/DeepSeek-Reasonix/). Archon extends the integration surface by adding a Jira adapter — comment in a ticket, get real-time AI agent responses — joining existing adapters for GitHub, Slack, Discord, and Telegram (more: https://www.youtube.com/watch?v=qyB52HIiou8).

A merged PR in llama.cpp addresses a pain point anyone doing local agentic coding has hit: context reprocessing. Tools like opencode rewrite conversation history to optimize context, forcing llama.cpp to reprocess the entire prompt (potentially 70k tokens). The checkpoint fix ensures that only the changed portion gets reprocessed, making local agentic coding measurably more responsive. The author reports two weeks of daily use with significant improvements (more: https://www.reddit.com/r/LocalLLaMA/comments/1tn0jyp/server_fix_checkpoints_creation_by_jacekpoplawski/). And on the human side of the harness equation, a detailed argument for evolving from "prompt engineering" to an "AI question method" makes the case that frontier models (Opus 4.7, GPT-5.5) are now senior partners rather than junior ones — they need questions that convey intent with directional focus while leaving room for exploration, not step-by-step task specifications (more: https://youtu.be/ogTLWGBc3cE?si=o9np_D9OoKUerOjv).

Beyond Vectors: Agent Memory Gets a Graph

A deep dive into Neo4j's agent-memory SDK highlights a fundamental limitation of vector indices for long-lived agents: they have no concept of identity. If "Karpathy" appears in ten documents, a vector index cannot unify those references, link them to entities, or know whether today's mention refers to the same person it embedded yesterday. The proposed fix is a three-tier graph memory system anchored to a POLE+O ontology (Person, Object, Location, Event, Organization). Short-term messages use :NEXT chains, long-term entities get deduplicated, and reasoning traces store past thought patterns for one-shot retrieval — essentially reinforcement learning at the database level. Entity extraction uses a confidence ladder: spaCy and GLiNER handle high-confidence cases, LLMs fire only on ambiguity, and an identity gate auto-merges at scores above 0.95 while creating pending :SAME_AS edges for the 0.85-0.95 band. The key design principle: false merges are silent and unrecoverable, false splits are recoverable. Retrieval runs as a single Cypher query fusing vector similarity with multi-hop traversals (more: https://www.reddit.com/r/learnmachinelearning/comments/1tifghn/a_vector_index_cant_tell_if_todays_karpathy_is/).

On the model adaptation front, an NTK-mirror dual approach claims to make fine-tuning as we know it obsolete. The core idea: one SGD step has an exact dual — a small signed controller on the forward pass that reproduces the supervised update without touching any weight. On Qwen2.5-7B plus GSM8K, a 50k-scalar controller matches LoRA-r8 accuracy in 22 seconds of fitting, infers 100% faster, and forgets 3.4x less (4.7pp vs 16.3pp drop on held-out code generation). Each controller is a 200KB file that can be retrieved at inference and installed via forward hooks with zero token overhead — no KV cache footprint, no quadratic attention cost. Two controllers trained on different tasks compose with just +0.005 NLL drift, versus 17% degradation under equivalent LoRA composition. The package is pip-installable and works on any Hugging Face causal LM, with a preprint promised for next week (more: https://www.linkedin.com/posts/leochlon_ai-machinelearning-llm-activity-7463999124816695297-5xy7).

The broader trajectory these point toward is ambient intelligence — AI systems that sense presence through walls, monitor environments passively, learn patterns continuously, and respond without cameras, cloud dependence, or constant user interaction. The framing positions this as AI escaping the screen to interact with "the invisible world that was always there waiting to become programmable" (more: https://www.linkedin.com/posts/reuvencohen_ambient-intelligence-is-the-next-big-thing-share-7463578300289048576-W1Ef).

Ternary Training, DNA Models, and Small-Model Pragmatism

BitCPM-CANN presents the first end-to-end 1.58-bit (ternary) training system on Huawei's Ascend NPU platform, scaling up to 8B parameters. Four models (0.5B through 8B) were trained using quantization-aware training aligned with MiniCPM4 counterparts in architecture and pre-training data. The results are striking: the 1B, 3B, and 8B variants retain 95.7%–97.2% of full-precision performance across 11 benchmarks spanning commonsense reasoning, domain knowledge, and mathematics. The 0.5B variant retains 90.1%, with the gap concentrated in mathematics — indicating that capacity, not the ternary quantizer, is the bottleneck at sub-billion scales. Training overhead is a mere 4.5% (148 vs. 155 TFLOP/s per NPU), with up to 8x weight memory reduction at inference. The political subtext is hard to miss: this proves ternary training is viable as a default configuration entirely outside the CUDA ecosystem (more: https://www.reddit.com/r/LocalLLaMA/comments/1tmf63y/bitcpmcann_native_158bit_large_language_model/).

Hugging Face's Carbon release brings modern LLM training techniques to genomics. Carbon-3B matches the current state-of-the-art (Evo2-7B) while running 275x faster, using deterministic 6-mer tokenization instead of BPE (which "doesn't behave well on DNA"), a factorized loss function that replaced cross-entropy mid-training to handle the fact that scoring 5/6 nucleotides right the same as 0/6 is brittle for genomic sequences, and a curated staged functional DNA + mRNA data mixture (more: https://www.reddit.com/r/LocalLLaMA/comments/1thsw7b/carbon_decoding_the_language_of_life/).

The AI content detection arms race continues with Slop Hammer, a Chrome extension running a Qwen 3.5 0.8B model fine-tuned on Pangram's EditLens dataset. The model runs entirely locally in under one second on an M1 MacBook, outputting probability distributions across four detection buckets rather than a single binary score. The limitations are real — GPT-5.5 output already confuses it, and community testing reports abundant false positives on complex human writing — but the approach of distilling detection into a 400MB local model running in a browser extension is notable for its accessibility (more: https://www.reddit.com/r/LocalLLaMA/comments/1tngkav/ai_content_detector_based_on_qwen_08b_finetuned/). On the creative tooling front, a pipeline that uses LLMs as structured code compilers to generate multi-part 3D objects with functional articulated parts through Blender's scene graph — exporting clean GLB files with working pivot axes instead of the monolithic mesh blobs that diffusion-based systems produce — shows what happens when you treat generation as code compilation rather than pixel prediction. Local models still hallucinate Blender's internal matrix math, but the architecture is model-agnostic (more: https://www.reddit.com/r/LocalLLaMA/comments/1thucyj/a_tool_i_built_to_generate_3d_objects_with/). And Audiomass ships a free, open-source multitrack audio editor that runs entirely in the browser — drag, drop, cut, paste, pitch-shift, all client-side with no install and no upload (more: https://audiomass.co/?multitrack=1).

Sources (22 articles)

dwisiswant0/next-16.2.4-pocs (github.com)
Exposing Critical Vulnerabilities in CBSE's On-Screen Marking Portal (ni5arga.com)
Scammers are abusing an internal Microsoft account to send spam links (techcrunch.com)
DanOps-1/Gpt-Agreement-Payment (github.com)
CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts (arxiv.org)
aaron-kidwell/goLoL (github.com)
[Editorial] (linkedin.com)
[Editorial] (arxiv.org)
Codex Just Became THE BEST Long Running Agentic Harness (youtube.com)
Harnesses (reddit.com)
DeepSeek reasonix, DeepSeek native coding agent with high caching and low cost (esengine.github.io)
Archon + Jira: Drag a Ticket, Get a Pull Request (Live Build) (youtube.com)
server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp (reddit.com)
[Editorial] (youtu.be)
A vector index can't tell if today's "Karpathy" is the same one it saw yesterday. Here's the fix (reddit.com)
[Editorial] (linkedin.com)
[Editorial] (linkedin.com)
BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU (reddit.com)
Carbon: Decoding the Language of Life (reddit.com)
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset (reddit.com)
A tool I built to generate 3D objects with functional, articulated parts. It's on github, and is mostly LLM-agnostic. (reddit.com)
Show HN: Audiomass – a free, open-source multitrack audio editor for the web (audiomass.co)