AI Security Gets a Lab Coat
Published on
Today's AI news: AI Security Gets a Lab Coat, The Million-Token Era Arrives, Agent Plumbing Gets Real, Breaches, Buckets, and Backsliding, The Closed Loop: Agents Building Agents, The Profession That Didn't Exist Yet. 22 sources curated from across the web.
AI Security Gets a Lab Coat
Microsoft Research's CyberThreat-Eval benchmark lands at an interesting moment for anyone who has watched LLMs inch toward automating threat intelligence. Unlike previous CTI benchmarks that hand models multiple-choice quizzes or ask them to regurgitate CVE descriptions, CyberThreat-Eval is built from the daily workflow of a real threat intelligence team — triage, deep search, and report drafting — and scored by analysts, not ROUGE. The expert-annotated dataset uses metrics that measure factual accuracy, content quality, and operational cost. The results are sobering: LLMs still lack the nuanced expertise to handle complex details and struggle to distinguish correct from incorrect information, especially when synthesizing indicators of compromise across multiple sources. The benchmark incorporates a human-in-the-loop refinement stage called TRA, where analysts iteratively provide feedback, acknowledging that full automation of CTI remains aspirational. (more: https://arxiv.org/abs/2603.09452v1)
Volkan Kutal's LinkedIn post on white-box AI red teaming makes a complementary argument: blackbox scanning is expensive fuzzing. His methodology has an AI coding agent read the target agent's full codebase — system prompts, guardrail configs, tool permissions, business logic — then generate attack objectives designed around the specific defenses. Microsoft's PyRIT executes the campaigns, and results feed back for iterative refinement. In one engagement, the code-aware pipeline discovered that a guardrail confidence threshold was set to 0.65, with attacks between 0.40 and 0.65 slipping through undetected — the kind of architecture-level finding that surface-level jailbreak probes never produce. Kutal notes the scanning market is consolidating fast (Promptfoo, Lakera, and SPLX all acquired), and argues the scanning layer is becoming a built-in feature while the expertise layer — understanding agent architectures, interpreting results in regulatory context — remains where judgment lives. (more: https://www.linkedin.com/posts/volkan-kutal_airedteaming-aisecurity-agenticai-activity-7438145743904743425-waML)
Meanwhile, the nah project offers a pragmatic take on agent safety for Claude Code users. It intercepts every tool call through a PreToolUse hook and classifies it by what it actually does — structural command analysis, pipe composition, shell unwrapping — using deterministic rules that run in milliseconds before optionally routing ambiguous cases to an LLM. git push goes through; git push --force gets flagged. rm -rf __pycache__ is fine; rm ~/.bashrc is blocked. The system self-protects: edits targeting the ~/.claude/hooks/ directory are blocked outright, preventing the agent from disabling its own guardrails. Crucially, project-level .nah.yaml files can only tighten policies, never relax them — a supply-chain safety property that means a malicious repo cannot use configuration to allowlist dangerous commands. (more: https://github.com/manuelschipper/nah)
An interesting technical piece this week is the anatomy of Unicode ignorables — the 4,174 default-ignorable code points in Unicode 15.1 that render as nothing in most contexts but survive copy-paste, JSON serialization, database storage, and HTML transit to arrive intact at an LLM tokenizer. The tag block (U+E0000 through U+E007F) alone provides 128 code points that can each hide one ASCII character, enough to embed entire instruction payloads invisible to human reviewers. The attack surface exists because a font renderer skips these characters while a BPE tokenizer processes the full byte sequence — the model sees instructions the user never saw. Variation selectors can shift tokenization boundaries, creating visually identical strings with different embeddings. Bidi controls can flip text direction to hide instruction fragments. The post catalogs concrete attack patterns: prompt injection via tag encoding, data exfiltration via invisible channels, cache poisoning through ZWJ-differentiated usernames. Defense requires actual Unicode property table lookups, not regex, and must check exact sequence membership before stripping — because some invisible characters are load-bearing (ZWJ in emoji sequences, variation selectors for presentation). (more: https://blog.boesen.me/posts/anatomy-of-unicode-ignorables)
The Million-Token Era Arrives
Anthropic has moved the 1M context window from premium feature to standard pricing across Claude Opus 4.6 and Sonnet 4.6. No multiplier: a 900K-token request is billed at the same per-token rate as a 9K one, with full rate limits across the entire window. Media limits jump from 100 to 600 images or PDF pages per request. Claude Code users on Max, Team, and Enterprise plans with Opus 4.6 get the full window automatically, which means fewer compactions and more of the conversation kept intact. Opus 4.6 scores 78.3% on MRCR v2 at that context length, the highest among frontier models. The practical implications are immediate: Devin reports that large diffs that previously required chunking now fit in a single pass, producing higher-quality reviews from a simpler harness. One user reports a 15% decrease in compaction events, meaning agents can run for hours without forgetting what they read on page one. (more: https://claude.com/blog/1m-context-ga)
NVIDIA's Nemotron 3 Super enters the same arena from the open-weight side — a 120B mixture-of-experts hybrid with 12B active parameters and a 1M-token context window aimed at agentic workflows. The architecture choice is telling: MoE's sparse activation keeps per-token compute low enough to make million-token sessions economically viable, a constraint that matters enormously when an agent might chain hundreds of tool calls in a single run. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rrg10m/nvidia_nemotron_3_super_openweight_120b_moe/)
On the training side, a new expert parallelism approach for fine-tuning trillion-parameter MoE models on a single 8×H200 node claims 50× speed improvement and 2× cost reduction over alternatives. The optimization journey from 17 seconds per step to 2.86 seconds covers grouped matmul, vectorized MXFP4 dequantization, padding-aware token skipping, and sequence packing — the kind of systems-level engineering that turns "theoretically trainable" into "practically affordable." The authors note that gradient accumulation for MoE is still incorrectly implemented in several popular training frameworks, causing silent discrepancies between accumulated and equivalent batch sizes. Open-sourcing is promised in two weeks. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rsw0u3/expert_parallelism_for_1t_moe_finetuning_on_a/)
Agent Plumbing Gets Real
The infrastructure layer beneath AI agents is filling in fast. GoClaw is a self-hosted agent gateway rewritten in Go as a single static binary — no Node.js, no npm, sub-50MB memory, under 100ms cold start. It connects messaging channels (Telegram, WhatsApp, CLI) to any LLM provider (Claude, GPT, Gemini, DeepSeek, Ollama) and adds the plumbing that production deployments demand: multi-agent routing with declarative bindings, inter-agent delegation via an ask_agent tool, persistent memory with BM25 search, a skill system using Markdown files with YAML frontmatter, heartbeat daemons for proactive agent actions, and cron scheduling with runtime management. Security is thoughtfully layered — localhost-only binding by default, bearer token auth with constant-time comparison, SSRF protection that re-validates redirect targets at each hop, workspace containment with symlink resolution, and exec approval policies that block shell metacharacters in allowlist mode. The feature comparison with its Node.js predecessor OpenClaw shows parity on most capabilities while adding inter-agent delegation and eliminating the runtime dependency chain. (more: https://github.com/sausheong/goclaw)
Axon approaches agent intelligence from the code-understanding side. It indexes any Python, TypeScript, or JavaScript codebase into a structural knowledge graph — 12 analysis phases covering parsing, call tracing, type analysis, community detection via the Leiden algorithm, execution flow tracing, dead code detection, git change coupling, and 384-dimensional embeddings. The result is an MCP server where axon_impact("validate") returns all 47 affected symbols grouped by depth (will break / may break / review) with confidence scores, in a single tool call. That is qualitatively different from an agent grepping for callers and hoping it found them all. The web dashboard adds force-directed graph visualization with Sigma.js, a Cypher console, and coupling heatmaps — useful for humans and agents alike. (more: https://github.com/harshkedia177/axon)
Christian Posta's AAuth demo walks through a concrete implementation of agent identity and access management using Keycloak and Agentgateway, with libraries in Java, Python, and Rust. The flows cover pseudonymous access via Header Web Keys, identified access via JWKS, and user consent with authorization — the kinds of boring, essential plumbing that prevents the agent ecosystem from becoming a permissioning free-for-all. (more: https://blog.christianposta.com/aauth-full-demo)
For observability, an eBPF-based tracer called agtap captures LLM API and MCP traffic from any process on a Linux machine with zero code changes — no SDK modifications, no proxy. It intercepts TLS via OpenSSL uprobes and parses Anthropic, OpenAI, and Gemini API calls in real time, extracting model names, token counts, latency, time-to-first-token, tool names, and streaming status. Output goes to JSONL, OpenTelemetry, or Prometheus. The limitation: it requires root and only works with OpenSSL-linked processes, excluding Go, Bun, Deno, and rustls. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rrql1k/trace_your_llm_api_and_mcp_calls_with_zero_code/)
A short video from Chase argues for replacing MCP servers with CLI tools inside Claude Code, pointing to the CLI-Anything open-source project from the LightRAG team that auto-generates CLI wrappers from any open-source project. The thesis: CLIs are lighter on tokens and context than MCP's JSON-RPC overhead. It is a valid optimization for some workflows, though it trades away the structured tool discovery and streaming capabilities that make MCP useful for complex agent orchestration. (more: https://www.youtube.com/shorts/ZgWbzgI9gvU)
Breaches, Buckets, and Backsliding
A threat actor calling themselves ByteToBreach has leaked what they claim is the complete source code of Sweden's e-government platform, obtained through compromised CGI Sverige AB infrastructure. CGI Sverige is the Swedish subsidiary of global IT services giant CGI Group and manages critical government digital services. The disclosed attack chain includes a full Jenkins compromise, Docker escape via the Jenkins user being in the Docker group, SSH private key pivots, and SQL copy-to-program exploitation. The actor is releasing the source code for free while selling citizen PII databases and electronic signing documents separately — a deliberate strategy to maximize exposure while monetizing the most sensitive data. The same actor was behind the Viking Line breach posted the day before, and explicitly called out companies that blame breaches on third parties, stating this compromise "belongs clearly to CGI infrastructure." For anyone in government IT contracting, the Jenkins-to-Docker-escape-to-SSH-pivot chain is a reminder that container boundaries are only as strong as the user permissions that cross them. (more: https://darkwebinformer.com/full-source-code-of-swedens-e-government-platform-leaked-from-compromised-cgi-sverige-infrastructure/)
After nearly a decade of work with AWS security teams, the bucketsquatting problem in S3 finally has a structural fix. AWS introduced account-namespaced bucket names — <account-id>--<prefix>--<region>--x-s3 — where only the owning account can create buckets in that namespace. The old problem: S3 bucket names are globally unique, and if you delete one, anyone can register it. Predictable naming conventions (appending region names) made it trivial for attackers to squat on buckets that organizations or AWS's own internal teams might recreate. The new namespace is recommended by default, enforceable via an SCP condition key s3:ResourceAccount, and available on all major cloud providers. It does not retroactively protect existing buckets — migration is manual. Google Cloud Storage already scopes buckets by verified domain ownership. Azure Blob Storage remains vulnerable due to its 24-character account name limit creating a constrained namespace. (more: https://onecloudplease.com/blog/bucketsquatting-is-finally-dead)
In a quieter but consequential move, Instagram will drop end-to-end encryption for messaging after May 8, 2026. The help page offers a single sentence of justification: E2E encryption "adds extra security and protection." Removing it subtracts that protection. For a platform with over two billion users, the regression is notable — particularly given that Meta spent years implementing E2E across its messaging products after sustained pressure from privacy advocates. (more: https://help.instagram.com/491565145294150)
The Closed Loop: Agents Building Agents
Prateek Karnal's detailed account of building a decentralized self-improving AI system offers the most concrete operational data we have seen on agentic orchestration at scale. The setup: 35 parallel agent sessions running overnight, orchestrated through OpenClaw on Telegram, executing on Agent Orchestrator, with a human as the final governance layer. Two-tier review — Codex for fast first-pass, Claude Code for deep architectural analysis — produced dependency-aware merge ordering across six PRs. The system dogfooded its own feedback tools: session ao-22 filed bug reports using contracts it had designed, OpenClaw verified delivery, the human caught a severity classification error, ao-22 fixed the classification logic, and ao-24 re-validated. Four feedback layers, three different agents, one human judgment call, twelve minutes elapsed. The codebase grew from 43K to 82K lines in 19 days. But the failures are instructive: the dashboard crashed five times, GitHub's GraphQL API rate-limited PR creation, API token exhaustion caused stream disconnects across simultaneous sessions, and authentication failures blocked review sessions. Every failure became a structured report that fed back into the system — the dashboard crashes spawned ao-27, which shipped 594 tests for reliability. The governance model enforces consent gates: no agent can autonomously fork a project, open a PR, or change push targets without human approval. (more: https://www.linkedin.com/pulse/decentralized-self-improving-ai-system-builds-itself-prateek-karnal-zzwqc)
Volodymyr Dybenko describes a lighter-weight version of the same pattern: feeding Codex output to Claude for architectural critique, then feeding Claude's refined task back to a Gemini-based executor. In one evening, five of ten major architectural changes completed — work estimated at two months of engineering time. The key insight is not that AI writes code fast, but that AI systems reviewing, challenging, and improving each other's thinking creates a compound effect. The builder becomes the orchestrator. (more: https://www.linkedin.com/pulse/how-ai-agents-complete-two-months-architecture-work-one-dybenko-fspif)
A detailed video walkthrough of a self-healing AI coding workflow packages the validation problem into a single Claude Code skill command. The approach delegates validation to the coding agent itself: sub-agents run parallel research mapping codebase structure and user journeys, then a browser automation tool (Vercel's agent browser CLI) systematically walks each user flow, taking snapshots and querying after write operations. The plan-implement-validate loop runs automatically on every implementation step. It is token-heavy but addresses the fundamental bottleneck: AI-generated code ships faster than humans can review it, and the agent that wrote the code is best positioned to verify it works — provided you give it a structured framework rather than trusting it to self-assess. (more: https://www.youtube.com/watch?v=YeCHI1dmpZY)
The Profession That Didn't Exist Yet
Rob T. Lee's essay draws a sharp parallel between the current state of AI security and cybersecurity circa 1997. SANS was founded in 1989 as a cooperative for sysadmins who also needed to protect their networks — security was an afterthought with a comma. The CISO title did not widely exist until the mid-2000s, a decade after the commercial internet. US Cyber Command stood up in 2009. The pattern: a 15-to-20-year lag between a technology becoming real and a workforce structure forming around it. AI security, Lee argues, does not get 15 years — nation-state actors were using AI autonomously for cyber operations before most organizations had an AI security policy. Whether AI security ends up as a subspecialty of traditional cybersecurity or a genuinely distinct field remains undefined. The most important job in AI security in 2030 may not have a name yet. (more: https://www.linkedin.com/pulse/everyone-reading-works-profession-didnt-exist-1998-rob-t-lee-0imkc)
Steve Yegge's wide-ranging conversation on AI adoption levels puts numbers to the workforce divide. His eight-level framework runs from no AI (level one) through IDE-integrated yes/no approvals (level two) up to running multiple agents in parallel while you nap (level six and above). His estimate: 70% of engineers are stuck at the bottom levels. The "vampiric effect" — AI makes you intensely productive for short bursts, then you crash — means even advanced practitioners are getting perhaps three good hours of deep agent-assisted work per day. His prediction: big tech companies are quietly dying because their organizational structures cannot absorb the speed, and small teams of 2 to 20 people will rival their output. The open-source Gas Town orchestrator he built reflects this thesis — agent infrastructure designed for small, fast teams rather than enterprise scale. (more: https://www.youtube.com/watch?v=aFsAOu2bgFk)
A thoughtful video essay documents the collapse of the knowledge ecosystem that LLMs depend on. Stack Overflow question volume dropped 78%. Chegg's stock fell 99%. Publishers lost a third of their search traffic to AI summaries. The deeper problem: when AI models train on content generated by other AI models, output quality degrades — the model collapse thesis. The internet is filling with exactly the kind of synthetic content that causes this degradation, while the economic incentives that produced high-quality human-generated training data (professional reputation, ad revenue, community engagement) are evaporating. AI may be consuming the ecosystem it needs to survive. (more: https://m.youtube.com/watch?v=ZAn8PX99kw0&pp=iggCQAE=)
On a more creative note, Tencent's LeVo 2 (SongGeneration 2) is an open-source music foundation model aiming for commercial-grade generation quality. Early community reception is mixed — orchestral generation shows promise, but control over genre and instrumentation remains limited compared to the description, and the HuggingFace demo queue is perpetually long. For anyone who spent time with Udio before its pricing changes, the appeal of a locally-runnable music model where you own the outputs is clear, even if the quality gap has not fully closed. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rrax5a/new_model_levo_2_songgeneration_2_an_opensource/)
Sources (22 articles)
- CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research? (arxiv.org)
- [Editorial] AI Red Teaming for Agentic AI Security (linkedin.com)
- nah: Context-aware safety guard for Claude Code (github.com)
- [Editorial] Anatomy of Unicode Ignorables (blog.boesen.me)
- [Editorial] Claude 1M Context GA (claude.com)
- NVIDIA Nemotron 3 Super: open-weight 120B MoE hybrid with 1M-token context (reddit.com)
- Expert parallelism for 1T MoE finetuning on a single node - 50x faster and 2x cheaper than alternatives (reddit.com)
- goclaw: Self-hosted AI agent gateway written in Go (github.com)
- axon: Graph-powered code intelligence engine for AI agents via MCP (github.com)
- [Editorial] AAuth Full Demo — Authentication for Agentic Systems (blog.christianposta.com)
- Trace your LLM API and MCP calls with zero code changes (eBPF, Linux) (reddit.com)
- The Death of MCPs & The Rise of CLIs (youtube.com)
- Source code of Swedish e-government services has been leaked (darkwebinformer.com)
- Bucketsquatting is (finally) dead (onecloudplease.com)
- E2E encrypted messaging on Instagram will no longer be supported after 8 May (help.instagram.com)
- [Editorial] Decentralized Self-Improving AI System That Builds Itself (linkedin.com)
- [Editorial] How AI Agents Complete Two Months of Architecture Work in One Sprint (linkedin.com)
- [Editorial] Video Submission (youtube.com)
- [Editorial] Everyone Reading This Works in a Profession That Didn't Exist in 1998 (linkedin.com)
- [Editorial] Video Submission (youtube.com)
- [Editorial] Video Submission (m.youtube.com)
- New Model: LeVo 2 (SongGeneration 2), an open-source music foundation model (reddit.com)