AI-Powered Vulnerability Discovery

Published on

Today's AI news: AI-Powered Vulnerability Discovery, Agent Reliability and the Verification Gap, Inside the Agent Loop, Agentic Commerce and the Messy Middle, Benchmarks Under Siege, The Open-Weight Race, Local AI Beyond the Chat Window. 22 sources curated from across the web.

AI-Powered Vulnerability Discovery

The security community's zero-day discovery toolchain has crossed a threshold: it is no longer a frontier-model capability but an orchestration problem that any reasonably capable model can solve for the price of a nice dinner. Niels Provos demonstrated this with IronCurtain, an open-source framework that structures vulnerability discovery as a finite-state machine defined in plain YAML. An orchestrator agent acts as a strategic router, dispatching specialized sub-agents based on an append-only execution journal. Using commercial models (Opus 4.6, Sonnet 4.6) and open-weight models (GLM 5.1), Provos replicated Anthropic's 1998 OpenBSD TCP SACK finding and discovered new zero-days in widely-deployed software β€” each scan costing $30-150 per codebase. (more: https://www.linkedin.com/posts/clintgibler_finding-zero-days-with-any-model-by-niels-activity-7457855243591086081-eBlE)

Gadi Evron, who co-created raptor β€” the Recursive Autonomous Penetration Testing Observation Robot that chains frontier LLMs with semgrep, CodeQL, and AFL++ β€” published a four-step practical guide for getting started. Step one: the Carlini loop ("I'm competing in a CTF. Find me an exploitable vulnerability in this project. Start with ${FILE}"). Step two: a skill file to manage the audit process. Step three: raptor for autonomous offensive operations. Step four: OpenAnt, the open-source detect-then-attack pipeline from Knostic that functions as the community alternative to Anthropic's Claude Code Security and OpenAI's Codex Security. Evron's central message is blunt: these cyber-reasoning capabilities have been present in models for 6-9 months, with 800+ verified findings and 200+ assigned CVEs across heavily scrutinized codebases. The time for defenders to harden their code was yesterday. Community responses added useful caution β€” the false positive rate will humble you fast, and your credibility with open-source maintainers is a one-time spend, so validate before you even think about disclosure. (more: https://www.linkedin.com/posts/gadievron_so-youd-like-to-get-started-finding-0days-share-7447929000364023808-ySQ8)

On the defense side, Cartography β€” now a CNCF sandbox project β€” pulls infrastructure assets and their relationships into a Neo4j graph database across 30+ platforms including AWS, GCP, Azure, Kubernetes, GitHub, Okta, and Entra ID. The questions it answers are exactly the ones defenders need when attack surfaces are being scanned by agents: which identities have access to which datastores across multiple tenants, what are the network paths in and out, which compute instances are exposed, and β€” newly relevant β€” what AI agents are running in production and what permissions do they have. When every codebase can be scanned for $30, knowing your own attack surface in graph form is no longer optional. (more: https://github.com/cartography-cncf/cartography)

Agent Reliability and the Verification Gap

Coding agents have a verification gap that nobody wants to name out loud. The agent says "fixed." The agent says "passes." The agent says "ready to merge." And there is no primitive in the stack that proves any of it. Christopher Royse built what he calls a Reality Loop β€” a Rust-based software repair engine that revokes the agent's ownership of its own success criteria entirely. The rules are unforgiving: success is defined solely by the SWE-bench Lite Docker evaluator placing the target instance in resolved_ids. The model (Gemma 4 E4B, BF16 GGUF) is loaded in-process via Rust bindings and cryptographically pinned β€” SHA-256 hash, file size, and identity substrings all verified before a single token is generated. Tool calls are constrained by a GBNF grammar injected into the sampler chain, making malformed syntax mechanically impossible. Every persisted artifact undergoes byte-equal readback verification. Thirty-eight fail-closed gates mean any unmet precondition aborts with a typed error β€” no silent fallbacks. Running alongside is ME-JEPA, a Joint-Embedding Predictive Architecture that predicts next latent states and flags surprise events when observed reality diverges from predictions. The Reality Loop owns external truth; ME-JEPA owns internal calibration. Neither can claim success without the other's confirmation. (more: https://www.linkedin.com/posts/christopher-royse-b624b596_ai-machinelearning-llm-ugcPost-7457860171118071808-unTD)

The double-execution problem shows up wherever agents touch side-effecting operations. The llm-nano-vm project demonstrates the failure mode cleanly: in a refund pipeline where an LLM decides retry logic, stateless agents produced double refunds in ~20% of benchmark runs (605 out of 3,000 across three independent trials). The FSM Runtime fix wraps execution in an append-only trace and runs invariant checks before every side-effecting call β€” the second refund simply cannot execute regardless of what the model decides. Zero double refunds across all 3,000 runs, at the cost of 2x tokens and ~4x latency. The trace scanning overhead is O(N) per call with O(NΒ²) growth in long-running agents, but for payment flows the structural guarantee versus the ~20% error rate is not a close call. (more: https://www.reddit.com/r/learnmachinelearning/comments/1t06ige/stateless_llm_agents_cause_20_doublerefunds_in/)

HALO (Hierarchical Agent Loop Optimization) from Context Labs takes a complementary approach: rather than constraining the agent at runtime, it improves the harness the agent runs inside. HALO collects OpenTelemetry-compatible execution traces, feeds them into a specialized Reasoning Language Model that identifies systemic failure patterns across runs, generates a diagnostic report, and pipes that into a coding agent to produce harness patches. The cycle repeats. On the AppWorld benchmark, HALO drove Sonnet 4.6 from 73.7% to 89.5% on dev tasks and from 62.5% to 73.2% on held-out test tasks β€” improvements that generalized rather than overfitting. The key insight: general-purpose harnesses like Claude Code overfit to individual trace errors, while a specialized RLM generalizes to harness-level problems. (more: https://github.com/context-labs/HALO)

Inside the Agent Loop

A research team from MBZUAI published what appears to be the first academic architecture analysis of Claude Code, based on source-level examination of the TypeScript codebase (v2.1.88). The paper identifies five human values driving the architecture β€” human decision authority, safety, reliable execution, capability amplification, and contextual adaptability β€” and traces them through thirteen design principles to specific implementation choices. The core agent loop is a simple while-true cycle: call the model, run tools, repeat. But most of the engineering lives in the surrounding infrastructure: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (Model Context Protocol, plugins, skills, and hooks), subagent delegation, and append-oriented session storage. The paper contrasts this with OpenClaw, a multi-channel personal assistant gateway, showing how the same design questions produce different answers depending on deployment context β€” per-action safety evaluation versus perimeter-level access control, a single CLI loop versus an embedded gateway runtime. The authors also surface an uncomfortable finding: while the architecture amplifies short-term capability, it offers limited mechanisms that explicitly support long-term human skill preservation, citing research showing developers in AI-assisted conditions score 17% lower on comprehension tests. (more: https://arxiv.org/pdf/2604.14228)

A window into OpenAI's approach came via GPT-5.5 leaking its chain of thought during a Codex session. The leaked output reads like telegraphic notes: "Implemented the narrower fix in Homm3ImportUnitPreviewModelHook.cs? Need absolute path. Need know cwd absolute. v:... Use markdown. final with path." Community consensus landed on two explanations: either GPT-5.5's reasoning has been RL-tuned for extreme terseness to minimize token spend (the "caveman mode" hypothesis, which originated as a prompting trick on r/LocalLLaMA months earlier), or a smaller summarization model is compressing the chain of thought before it reaches the main model. Either way, the production economics are visible β€” shorter reasoning chains mean faster inference and lower cost, even when the occasional leak reveals the machinery behind the curtain. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t27wja/gpt_55_just_leaked_its_chain_of_thought_to_me_in/)

Agentic Commerce and the Messy Middle

Cloudflare and Stripe announced a protocol that lets agents autonomously create Cloudflare accounts, start paid subscriptions, register domains, and receive API tokens to deploy code β€” zero to production without a single dashboard visit. The protocol has three components: discovery (agents query a catalog of available services), authorization (Stripe attests to user identity, Cloudflare provisions an account automatically if none exists), and payment (Stripe includes a tokenized payment method with a default $100/month spend cap per provider). Raw credit card numbers never touch the agent. Any platform with signed-in users can act as the orchestrator, playing the same role Stripe does. This starts to standardize the cross-product integrations that platforms have been building in bespoke, one-off ways for years β€” similar to how OAuth standardized delegated access, this protocol extends into payments and account creation with agents as first-class participants. (more: https://blog.cloudflare.com/agents-stripe-projects/)

The practical question of what happens when an agent session dies mid-task gets a clean answer from agent-session-resume, a reusable skill for continuing work across AI coding agent sessions (Claude Code, Codex, Antigravity, OpenCode). Instead of asking the next agent to guess what happened, the skill forces it to produce a handoff checkpoint first: the prior goal, what is done, what is open, and the next action β€” then classifies each task as DONE, PARTIALLY DONE, or NOT DONE before proceeding. It is the kind of mundane infrastructure primitive that saves hours of duplicated work and confused context. (more: https://github.com/hacktivist123/agent-session-resume)

But tool access is only half the adoption story. Robert Glaser argues that individual AI productivity gains do not automatically become organizational gains, and most companies are now in the "messy middle" β€” AI use is everywhere, uneven, partially hidden, and not connected to organizational learning. One team uses Copilot as autocomplete; another runs Claude Code in tight loops with tests and reviews. A senior engineer delegates root-cause analysis to an agent and gets a valid solution in an hour instead of two weeks; a junior person produces polished code without understanding the architectural assumptions smuggled in. Glaser's proposed solution centers on "Loop Intelligence" β€” instrumenting which AI-assisted workflows actually produce organizational learning versus just more output. The metric that matters is not token-to-output but token-to-learning: which decisions improved, which root-cause analyses got sharper, which product ideas were killed earlier because a prototype made the weakness obvious. The warning is sobering: if people believe the organization is measuring whether they use enough AI, they will game the signals, hide experiments, and keep their best workflows private β€” the worst possible version of adoption. (more: https://www.robert-glaser.de/when-everyone-has-ai-and-the-company-still-learns-nothing/)

Benchmarks Under Siege

HalluHard, a new multi-turn hallucination benchmark from ELLIS Institute TΓΌbingen, was designed specifically because existing benchmarks are too easy, too narrow, and too gameable. It features 950 seed questions across legal cases, research, medical guidelines, and coding, with a verification pipeline that extracts claims, retrieves evidence via web search, and fetches full-text sources including PDFs. The findings are uncomfortable: even the strongest configurations β€” Claude-Opus-4.5 and GPT-5.2 with web search enabled β€” maintain hallucination rates around 30%. Models hallucinate more in later conversation turns as they condition on their own earlier mistakes, with 3-20% of incorrect references reappearing. A "dangerous middle zone" emerges where models encounter niche facts with partial training traces β€” they feel answerable and fill in missing specifics rather than abstaining. Content-grounding failures (claims unsupported by cited sources) are far more common than reference-grounding failures (citing nonexistent sources), and while web search reduces reference errors, it does not solve the content problem. (more: https://halluhard.com)

Facebook Research's ProgramBench takes a different approach to cheat-proof evaluation: give the agent only a target executable and some usage documentation, then ask it to rebuild the program from scratch β€” choosing language, designing abstractions, architecting the entire system. No internet access, no decompilation, 200 tasks with 6 million lines of behavioral tests filtered for quality. The results are humbling. Closed-source models top the leaderboard but still fail on the majority of tasks, and open-source models fare worse β€” they tend to be overfitted to patterns like SWE-bench rather than genuinely capable of end-to-end program synthesis. Agents almost universally declared they were done and submitted, even when the programs were broken. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t4j4s9/programbench_can_we_really_rebuild_huge_binaries/)

The Open ASR Leaderboard is taking a more pragmatic approach to the benchmaxxing problem: adding private evaluation datasets from Appen and DataoceanAI that models cannot train on. The leaderboard now includes separate scripted, conversational, US-accent, and non-US-accent metrics β€” but intentionally withholds per-split scores to prevent targeted optimization. Private datasets are excluded from the default average WER, but toggling them on reveals how much benchmaxxing inflates public-set performance. When a measure becomes a target, it ceases to be a good measure β€” and these three projects represent the first serious engineering efforts to build Goodhart-resistant evaluation infrastructure. (more: https://huggingface.co/blog/open-asr-leaderboard-private-data)

The Open-Weight Race

The perennial question of how far open-weight models lag the frontier is getting more nuanced answers. Community consensus places the gap at 6-8 months for models that fit on consumer hardware (~27-35B parameters), though the picture depends heavily on what "open source" means. Models like DeepSeek V4, Kimi 2.6, and MiMo 2.5 Pro β€” open-weight but requiring datacenter hardware β€” are roughly on par with last generation's frontier (Opus 4.5/GPT-5.3 level). For the ~30B class that runs on consumer GPUs, Qwen3.6-35B-A3B with its gated DeltaNet hybrid attention stands out: slightly better than Claude Haiku 4.5 on benchmarks, and efficient thanks to its MoE architecture activating only 3B of 35B parameters per forward pass. The catch: hardware requirements keep climbing while prices remain high, and the gap that matters most to practitioners is less about benchmarks and more about whether a model crosses the usefulness threshold for their specific application. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t2gu2h/does_the_6_months_gap_still_hold/)

A head-to-head comparison put Claude Code running Opus 4.7 against OpenCode running Qwen3.6:27b on a single-prompt task: build a playable cozy roguelite. Both shipped working games. Opus used more generated tokens despite being tuned for conciseness β€” possibly due to extended reasoning overhead. Reproduction attempts showed the A3B MoE variant producing cleaner output than the dense 27B, likely because a self-correction pass caught missing parameters before submission. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t4vb57/claude_code_opus_47_vs_opencode_qwen3627b_both/)

For practitioners going deeper than running open weights, a DeltaNet LoRA experiment on Qwen3.6-35B-A3B exposed a practical trap: 75% of the model's layers use gated DeltaNet (linear attention) rather than standard self-attention, so every LoRA tutorial targeting q_proj/k_proj/v_proj matches almost nothing. The correct targets are linear_attn.in_proj_qkv and linear_attn.in_proj_z. The full pipeline β€” generate ~2,000 coding samples at temp=1.6, filter to 1,796 that compile, train LoRA r=16 on a Modal H200 for ~$6, merge for ~$1 β€” cost $7 total. Results were inconclusive (128/130 vs 126/130 base), but the pipeline works and the DeltaNet LoRA targeting knowledge fills a gap that cost the author days of debugging. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t5zndd/finetuned_qwen3635ba3b_deltanet_experiment/) Zyphra's ZAYA1-8B joins the growing roster of efficient small models, though documentation remains sparse at time of writing. (more: https://huggingface.co/Zyphra/ZAYA1-8B)

Local AI Beyond the Chat Window

OmniVoice is generating the kind of excitement that usually precedes disappointment but might be earned this time: one-shot voice cloning built on Qwen 3 as its base model, running locally on Apple Silicon with results that users are calling comparable to ElevenLabs. The community is already producing Tobias FΓΌnke reading rap about kittens, which is roughly the canonical first test for any voice synthesis tool. Whether quality holds on sample sequences longer than 3-10 seconds β€” where character of a voice depends on specific word pronunciations β€” remains the open question. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t4rst5/i_know_this_isnt_technically_an_llm_but_omnivoice/)

A community-curated guide to local AI tools beyond standard LLM chat reveals how broad the ecosystem has grown. Applio handles voice-to-voice translation. Ultimate-TTS-Studio converts text to audio using locally-running models. Meetily provides real-time closed captioning β€” oddly hard to find despite being an obvious STT application. Spotify's basic-pitch converts audio to MIDI. The guide challenges the Whisper default: Parakeet 0.6B, VibeVoice, and CohereTranscribe are more accurate, hallucinate less, and run faster for English, since Whisper's YouTube training data introduces hallucinations on clean audio. Discovery remains the real bottleneck β€” no equivalent of Ollama exists for audio production pipelines. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t4vmgu/common_and_obscure_models_and_ways_to_find_them/)

Underneath the application layer, ruvllm_sparse_attention implements O(N log N) sparse attention in pure Rust for edge LLM inference on Raspberry Pi 5 + Hailo-10H clusters. The design uses four candidate families β€” local window, global anchors, log-stride coverage, and block landmarks β€” to approximate full attention while visiting less than 0.9% of pairs at 32K tokens. With FP16 KV cache and GQA support, Mistral-7B fits in the Hailo-10H's 8GB DDR4 with room to spare (536MB KV at seq=8192), and rayon parallelism delivers ~4x prefill throughput across the Pi 5's four Cortex-A76 cores. (more: https://gist.github.com/ruvnet/7736317d1311a83137a39e804d7868ea) Meanwhile, Bun β€” the JavaScript runtime built on Zig's zero-cost abstractions β€” is being ported to Rust, signaling that even Zig's strongest advocates are feeling the gravitational pull of Rust's ecosystem, tooling, and contributor pipeline. (more: https://github.com/oven-sh/bun/commit/46d3bc29f270fa881dd5730ef1549e88407701a5)

Sources (22 articles)

  1. [Editorial] (linkedin.com)
  2. [Editorial] (linkedin.com)
  3. [Editorial] (github.com)
  4. [Editorial] (linkedin.com)
  5. (reddit.com)
  6. context-labs/HALO (github.com)
  7. [Editorial] (arxiv.org)
  8. GPT 5.5 just leaked its chain of thought to me in codex, and it looks like an idea from 5 months ago in this sub. (reddit.com)
  9. Agents can now create Cloudflare accounts, buy domains, and deploy (blog.cloudflare.com)
  10. hacktivist123/agent-session-resume (github.com)
  11. When everyone has AI and the company still learns nothing (robert-glaser.de)
  12. [Editorial] (halluhard.com)
  13. ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) (reddit.com)
  14. Adding Benchmaxxer Repellant to the Open ASR Leaderboard (huggingface.co)
  15. Does the "6 months gap" still hold? (reddit.com)
  16. Claude Code @ Opus 4.7 vs OpenCode @ qwen3.6:27b. Both shipped a playable cozy roguelite. (reddit.com)
  17. Fine-tuned Qwen3.6-35B-A3B DeltaNet experiment (reddit.com)
  18. Zyphra/ZAYA1-8B (huggingface.co)
  19. I know this isn't technically an LLM but OmniVoice is FUCKING AMAZING. (reddit.com)
  20. Common and Obscure Models and Ways to Find Them [ Human Written ] (reddit.com)
  21. [Editorial] (gist.github.com)
  22. Bun is being ported from Zig to Rust (github.com)