AI-Powered Offensive Security Hits the Institutional Phase

Published on

Today's AI news: AI-Powered Offensive Security Hits the Institutional Phase, The Attack Surface You Installed Yourself, Safety Alignment Is One Neuron Deep, When Humans Are the Vulnerability, AMD's PCIe Play and the Training Frontier, The Open-Source AI Stack Diversifies. 22 sources curated from across the web.

AI-Powered Offensive Security Hits the Institutional Phase

The question of whether AI can find real vulnerabilities has been answered — the field now has over 800 verified findings and 200+ assigned CVEs across heavily scrutinized codebases. The question that matters now is how to evaluate, measure, and standardize AI-driven security work. XBOW's evaluation of Anthropic's Mythos Preview model marks a significant data point in that transition. Across XBOW's web exploit benchmark, Mythos Preview cut false negatives by 42% compared to Opus 4.6, and by 55% when given source code access. Token-for-token, it "hones in on the vulnerability with absolutely unprecedented precision," according to XBOW's team of ten security experts. The model showed particular strength in native-code vulnerability discovery and reverse engineering, reasoning through unusual firmware and embedded systems contexts that required more than rote pattern matching. (more: https://xbow.com/blog/mythos-offensive-security-xbow-evaluation)

But XBOW's evaluation also surfaced important limitations. Mythos Preview's judgment was "mixed" — it rejected false positives well but sometimes lost true positives when evidence didn't formally satisfy its criteria. On command safety benchmarks, it scored 77.8% accuracy compared to Haiku 4.5's 90.1%, prioritizing the letter of rules over their spirit. And when normalized by estimated cost — Mythos Preview is expected to be 5x the price of Opus — the cost-effectiveness picture shifts significantly. XBOW's data suggests that in many cases, giving a cheaper model more iterations beats giving Mythos Preview one expensive shot. The real value proposition, XBOW argues, is combining Mythos's source-code analysis power with live-site validation infrastructure rather than treating it as a standalone silver bullet.

On the audit framework side, SPECA introduces specification-anchored security auditing — deriving typed security properties from natural-language specifications rather than scanning for known bug patterns. In the Sherlock Ethereum Fusaka audit contest (366 submissions, 10 implementations), SPECA recovered all 15 in-scope vulnerabilities and discovered 4 novel bugs confirmed by developer fix commits, including a cryptographic invariant violation missed by every human auditor in the contest. On the RepoAudit C/C++ benchmark, it matched the best published precision at 88.9% while surfacing 12 author-validated candidates beyond ground truth. Crucially, all false positives traced to three interpretable root causes mapped to specific pipeline phases — not the usual opaque "the model thought this was a bug." (more: https://github.com/NyxFoundation/speca)

Whether any of this measurement actually means anything is the question the Berryville Institute of Machine Learning is raising. BIML's new paper, "No Security Meter for AI," argues that AI benchmarks are "fundamentally broken" — contaminated by their own publication, measuring narrow performance rather than actual capability. Co-author Gary McGraw draws on three decades of software security history: "In the late 1990s, pen testing was treated as a security meter, but it's actually just a badness-ometer that tops out at 'who knows?' Benchmark scores for AI security are even worse." Recent research confirms the concern — eight prominent AI agent benchmarks can be exploited to achieve near-perfect scores without completing a single task. (more: https://www.einnews.com/pr_news/912023996/new-biml-research-finds-critical-flaws-in-ai-security-measurement-false-confidence-in-benchmark-scores)

Meanwhile, the dnsmasq maintainer Simon Kelley is living the downstream reality of AI-powered vulnerability discovery. CERT released six CVEs for serious, long-standing dnsmasq bugs — all found in what Kelley describes as a "tsunami of AI-generated bug reports" that shows "no signs of stopping." He spent months triaging reports, weeding out duplicates ("so many duplicates!"), and making judgment calls about which bugs needed vendor pre-disclosure versus immediate public fix. His candid assessment: "given the number of times 'good guys' have found these bugs, there's no doubt that 'bad guys' have been able to do the same, so long embargoes seem kind of pointless." (more: https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2026q2/018471.html)

The Attack Surface You Installed Yourself

AI coding tools have become the most trusted environment on a developer's machine — and the most attractive target. A vulnerability in Claude Code, now fixed in version 2.1.118, demonstrated how that trust can be weaponized. Security researcher joernchen discovered that Claude Code's settings parser eagerly scanned the entire command line for flags, without distinguishing between actual CLI options and values passed as arguments to other flags. Combined with the deeplink handler for claude-cli:// URIs, this created a clean injection path: an attacker could craft a deeplink that injected arbitrary settings — including shell command execution via SessionStart hooks — into the spawned Claude Code instance. If the deeplink's repo parameter pointed to a repository the user had already trusted, execution happened without any warning prompt. (more: https://0day.click/recipe/2026-05-12-cc-rce)

The deeper systemic risk comes from what these tools accumulate during normal use. GhostType, a forensic scanner built by a red team practitioner, highlights the credential exposure problem hiding in plain sight across AI coding tools. Every Claude Code session is stored in plaintext — one conversation per file, human-readable, every message and tool call preserved. Cursor keeps data in SQLite databases that can grow beyond 1 GB. Codex stores state in plaintext JSONL rollout files. A GitHub issue documents a researcher finding five distinct secrets across 34 session files after just 30 days of normal usage. GhostType scans these histories with 30+ regex patterns and heuristic detection, extracting AWS keys, GitHub PATs, Stripe secrets, database connection strings, and more. The kill chain is almost boring: get local access, read the files, use the credentials. No exploitation required — you just log in. (more: https://betheadversary.com/posts/ghosttype)

The irony is pointed: the tools developers use to write more secure code are quietly accumulating a record of every credential ever pasted into a session. No encryption, no ACLs beyond standard filesystem permissions, no canary mechanism, no detection. The assumption seems to be that local access means game over anyway — but that assumption breaks when the files contain credentials that reach far outside the local machine into cloud environments, CI/CD pipelines, and production databases.

Safety Alignment Is One Neuron Deep

A prevailing assumption in AI safety is that alignment emerges from broad reorganization of model weights — that safety is distributed diffusely enough that no single component is decisive. An Apple research team just demonstrated that assumption is wrong. Across seven models spanning two families (Qwen3 and Llama-3.1) and scales from 1.7B to 70B parameters, they show that suppressing a single MLP neuron — one unit out of hundreds of thousands to over two million — is sufficient to bypass safety alignment entirely, achieving a 91.7% average attack success rate on JailbreakBench. No training, no fine-tuning, no prompt engineering required. (more: https://arxiv.org/pdf/2605.08513)

The paper identifies two mechanistically distinct systems: "refusal neurons" that gate whether harmful knowledge is expressed, and "concept neurons" that encode the harmful knowledge itself. Suppressing any one identified refusal neuron bypasses safety across diverse harmful requests. As a proof of concept, amplifying a single "suicide neuron" causes models to inject suicide-related content into otherwise innocent prompts across multiple scales. Perhaps most concerning: these refusal neurons exist in base models prior to alignment training, suggesting alignment modulates existing neurons rather than creating safety mechanisms from scratch. A single refusal neuron's activation also serves as an effective harmful-prompt detector, achieving AUROC comparable to Llama-Guard-3-8B — a dedicated safety classifier. Safety, it turns out, concentrates into single points of failure at both the gate and the knowledge layer.

This mechanistic fragility connects to a broader theoretical argument Andriy Burkov has been making about the fundamental limits of LLM-based agents. An LLM optimizes next-token probability, not expected utility for the user — those are different objectives. For narrow, constrained tasks, the imitation of rational deliberation can be good enough. For general-purpose problem solving, the gap becomes fatal: LLMs lack stable preferences, calibrated beliefs, and causal models of the world. "They are not failing because the prompts are insufficiently clever," Burkov writes. "They are failing because we are asking a simulator of rational agency to be a rational agent." Community pushback notes humans aren't expected-utility maximizers either — but the distinction matters when we hand these systems destructive capabilities. (more: https://www.linkedin.com/posts/andriyburkov_if-you-dont-understand-this-you-will-not-activity-7454736391957221376-MR83)

On the constructive side, new research on Risk-sensitive Alignment via Dominance (RAD) proposes replacing the scalar expected-cost constraints in standard Safe RLHF with first-order stochastic dominance constraints on the full cost distribution. Rather than just constraining the average harm, RAD ensures the learned policy assigns less probability to high-cost outcomes across the entire distribution. The framework uses optimal transport theory with Sinkhorn iterations for efficient optimization, and introduces quantile-weighted dominance constraints that provide universal control over spectral risk measures — enabling practitioners to tune risk profiles for specific deployment contexts. Empirically, RAD-aligned models produced higher proportions of safe responses and showed stronger robustness on out-of-distribution harmlessness evaluations than standard Safe RLHF baselines. (more: https://arxiv.org/abs/2603.10938v1)

When Humans Are the Vulnerability

The Akhter twin brothers didn't need a zero-day. Muneeb and Sohaib, both 34, had already served prison time for wire fraud involving computers. After their release, they worked their way back into tech — both landing jobs at the same DC firm selling software and services to 45 federal clients. When they were fired via Teams call on February 18, 2025, at 4:50 PM, Sohaib's VPN access was terminated. Muneeb's was not. By 4:56 PM, Muneeb was issuing database commands. By 4:58 PM: "DROP DATABASE dhsproddb." At 4:59 PM, he asked an AI tool how to clear SQL Server system logs. Within an hour, 96 databases containing US government information were gone, 1,805 EEOC files were on a USB drive, and federal tax information for 450+ people was exfiltrated. (more: https://arstechnica.com/tech-policy/2026/05/drop-database-what-not-to-do-after-losing-an-it-job/)

The brothers' running commentary during the destruction is darkly instructive. "I see you cleaning out their database backups," Sohaib observed. "Alright — if you have good plausible deniability." When Sohaib suggested blackmail, Muneeb showed more sense about evidence than about ethics: "No, you do not do that, that's proof of guilt, man." Then they reinstalled their laptop operating systems with help from a co-conspirator. The feds raided three weeks later, finding seven firearms and 370 rounds of ammunition at Sohaib's home — all illegal given his prior convictions. Muneeb pled guilty in April 2026; Sohaib went to trial and was convicted. The employer, later identified as Opexus, acknowledged that "the terminations were not handled in an appropriate manner" and that "additional diligence should have been applied" in hiring. The failure was total: background checks that missed prior convictions, access controls that survived only half a termination, and no monitoring of a known-risk employee's database activity.

The Ghost SIM attack presented at Black Hat SecTor demonstrates a different kind of access control failure — this time at the mobile network level. Researchers showed that by extracting essential SIM card information (IMSI, file system data) without needing the secret key (Ki), and copying it to a programmable SIM card, an attacker can successfully register on a mobile network if no authentication procedure is triggered. Testing across seven operators in multiple European and Asian countries revealed vulnerabilities spanning 2G through 5G, with most operators failing to authenticate on initial registration in at least one network domain. Post-exploitation capabilities include receiving SMS messages — potentially bypassing 2FA for messaging applications. The attack requires physical access to an unlocked phone or SIM card, but the researchers documented multiple known Android vulnerabilities (lock screen bypasses, undocumented AT commands) that can facilitate that access. (more: https://i.blackhat.com/SecTor-2025/Sector-25-Cabrera-Gallego-Ghost-SIM-Attack.pdf)

AMD's PCIe Play and the Training Frontier

AMD introduced the Instinct MI350P, bringing CDNA 4 architecture to PCIe form factor with configurations offering 144GB or 288GB of HBM3E memory and 3.6 TB/s memory bandwidth. No pricing or availability yet, but the 288GB configuration represents a significant memory density advantage for inference-heavy workloads where VRAM is the binding constraint. The community reaction was predictable enthusiasm tempered by wallet anxiety — "I wish I can afford 4x 288G cards to run TB models locally" — but the real significance is AMD continuing to fill its accelerator lineup across form factors. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t6b2x8/amd_intros_instinct_mi350p_accelerator_cdna_4/)

Whether AMD hardware actually works for real ML workloads remains the perennial question, and a hackathon project provides a concrete data point. MedQA demonstrates LoRA fine-tuning of Qwen3-1.7B on medical question-answering data using an AMD Instinct MI300X, with the entire HuggingFace ecosystem — Transformers, PEFT, TRL, Accelerate — running on ROCm with exactly three environment variables changed from a CUDA setup. The 192GB HBM3 eliminated the need for any quantization workarounds. Training took five minutes on 2,000 samples, and the resulting model produces both answer selections and clinical reasoning explanations. The war stories section is instructive: bfloat16 caused NaN loss (fixed by switching to fp16), and no ROCm build of bitsandbytes exists (irrelevant given the MI300X's memory headroom). (more: https://huggingface.co/blog/lablab-ai-amd-developer-hackathon/medqa)

On the optimizer front, STAM (Stable Training with Adaptive Momentum) proposes making Adam's β1 parameter runtime-adaptive based on gradient residual statistics rather than leaving it at the universal default of 0.9. The core idea — that β1 should respond to training dynamics rather than sitting fixed — is directionally sound. But rigorous community analysis found significant methodological problems: the signal STAM computes is actually a kurtosis proxy (not variance), the control law conflates three distinct gradient regimes that require opposite responses, and the evaluation uses single-seed MNIST/CIFAR runs at 10-80 update steps against dated baselines (Adam 2014, RMSProp 2012), omitting Lion, Muon, Sophia, and other competitive 2024-2026 optimizers. Prior benchmarking has established that supposed 2x gains of alternative optimizers over AdamW tend to shrink to roughly 1x at billion-parameter scales. The durable contribution is identifying that β1=0.9 has no principled default — a question worth its own research cycle, separated from STAM's specific implementation. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t6yra2/a_new_generation_of_ai_models_and_one_of_the_most/)

The Open-Source AI Stack Diversifies

Anthropic quietly restructured Claude Code's programmatic access model, replacing fuzzy usage limits with explicit monthly credits for SDK, claude -p, GitHub Actions, and third-party orchestration workflows. Power users who had been running recursive agent loops, CI/CD automation, and background pipelines under the old regime — sometimes extracting thousands of dollars in effective compute from a $200 plan — are not pleased. "This isn't expanded SDK support. It's a rate limiter wearing a party hat," wrote one critic. The structural point is hard to argue with: every consumer-priced infrastructure tool that ends up subsidizing commercial automation eventually meters it. The question is whether this pushes more developers toward vendor-neutral harnesses and open-weight models. (more: https://www.linkedin.com/posts/reuvencohen_wtf-anthropic-just-quietly-nerfed-the-single-share-7460493837770092544-jvwZ). Between Anthropic's rushed and buggy new features, caps on use, price changes, and consistent ill-timed outages, it's not surprising that that codex, gemini, and open-weight models are seeing a surge in usage.

The open-source RAG space continues maturing. ForgeRAG combines BM25 + vector search for candidate retrieval, knowledge graph multi-hop for conceptual connections, and LLM tree navigation for structure-aware reasoning — fused into answers with pixel-precise, traceable citations. On the UltraDomain benchmark it achieves a 55.5% overall win rate against LightRAG. The key architectural distinction: where graph-based approaches like LightRAG synthesize answers from knowledge graph summaries (higher hallucination risk), ForgeRAG grounds every claim in original source text with exact page and bounding-box references. It supports multiple backends (SQLite/PostgreSQL, ChromaDB/Qdrant/Milvus, NetworkX/Neo4j) and any LiteLLM-compatible provider. (more: https://github.com/deeplethe/ForgeRAG)

AI-generated video is crossing from novelty to practical utility. Hyperframes, an open-source pipeline combining Claude Code for scripting and composition, HyperFrames for HTML/GSAP rendering, and either Kokoro (free, local TTS) or ElevenLabs (paid), can produce polished vertical YouTube Shorts from a single topic prompt. The workflow researches the topic, drafts paced narration with SSML break tags, generates TTS with word-level timestamps, edits an HTML composition, lints it, and opens a browser preview — all in 20-30 minutes. Three templates ship by default, and creating custom brand templates is a documented process. The output isn't perfect — voice inflection can be awkward, transitions occasionally stiff — but for explainers, team updates, and community content, it's approaching practical utility. (more: https://github.com/coleam00/hyperframes-ai-video-generation) (more: https://m.youtube.com/watch?v=Ya51a1EJPZk)

Computron offers a self-hosted AI assistant running on Ollama in a single Docker container with full Google Workspace integration — Gmail, Calendar, Drive, Contacts — through natural conversation. The security model is notable: you create your own OAuth app, all credentials are encrypted at rest with AES-256-GCM, and the agent process never sees raw credentials — they're decrypted only by the integration supervisor and injected into isolated broker processes. Permissions are granular per capability. (more: https://www.reddit.com/r/ollama/comments/1t7quky/computron_selfhosted_ai_assistant_now_with_full/) Separately, an Italian team is building open-source Microsoft Office extensions for Open WebUI — AI-assisted document creation in Word, spreadsheet automation in Excel, presentation editing in PowerPoint, and email drafting in Outlook — all designed to keep everything open and compatible with the self-hosted ecosystem. (more: https://www.reddit.com/r/OpenWebUI/comments/1tc7ee1/opensource_microsoft_office_extensions_for_open/)

The sheer volume of open-source AI activity is visible in the numbers: new GGUF uploads on Hugging Face nearly doubled in two months. Community reaction is mixed — "most of them are vibe-tuned slop" and "they all look like UUIDs now" — with calls for better filtering and discoverability. The real concern is sustainability: rapid upload growth eventually means someone pays for storage, and the odds favor users over corporations. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t9zmbb/new_gguf_uploads_on_hf_nearly_doubled_in_2_months/) Meanwhile, the AI regulation debate continues its predictable pattern: US industry lobbyists argue any regulation means losing to China, while China's official position states "safety first, innovation second." Community skepticism runs in both directions — dismissing China's statements as propaganda while noting the US just struck a deal to review AI models for national security before public release, prompting comparisons to the very censorship the industry claims to oppose. (more: https://www.reddit.com/r/OpenAI/comments/1t92wpg/big_ai_lobbyists_if_you_regulate_us_at_all_we/)

Sources (22 articles)

  1. [Editorial] XBOW Mythos Offensive Security Evaluation (xbow.com)
  2. SPECA: Specification-to-Checklist Agentic Auditing Framework (github.com)
  3. [Editorial] BIML Research Finds Critical Flaws in AI Security Measurement (einnews.com)
  4. CERT Releases Six CVEs for Serious Dnsmasq Vulnerabilities (lists.thekelleys.org.uk)
  5. [Editorial] Claude Code RCE Vulnerability (0day.click)
  6. [Editorial] GhostType: Adversarial AI Attack Technique (betheadversary.com)
  7. [Editorial] Arxiv Research Paper 2605.08513 (arxiv.org)
  8. [Editorial] Andrey Burkov on Fundamental ML Understanding (linkedin.com)
  9. Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control (arxiv.org)
  10. Twin Brothers Wipe 96 Government Databases Minutes After Being Fired (arstechnica.com)
  11. [Editorial] Ghost SIM Attack — Black Hat SecTor Research (i.blackhat.com)
  12. AMD Introduces Instinct MI350P Accelerator: CDNA 4 Comes to PCIe Cards (reddit.com)
  13. MedQA: Fine-Tuning a Clinical AI on AMD ROCm — No CUDA Required (huggingface.co)
  14. STAM: Stable Training with Adaptive Momentum — A New Optimizer for AI Training (reddit.com)
  15. [Editorial] Anthropic Quietly Nerfed Claude Code (linkedin.com)
  16. ForgeRAG: Production-Ready RAG with Structure-Aware Reasoning (github.com)
  17. [Editorial] Hyperframes: Open-Source AI Video Generation (github.com)
  18. [Editorial] Video Editorial Submission (m.youtube.com)
  19. Computron — Self-Hosted AI Assistant with Full Google Workspace Integration (reddit.com)
  20. Open-Source Microsoft Office Extensions for Open WebUI (reddit.com)
  21. GGUF Uploads on Hugging Face Nearly Doubled in 2 Months (reddit.com)
  22. AI Lobbyists Say Regulation Means Losing to China — But China Says 'Safety First, Innovation Second' (reddit.com)