LLMs Hunt Zero-Days Faster Than You
Published on
Today's AI news: LLMs Hunt Zero-Days Faster Than You, Supply Chain Attacks: 72 Minutes from Symptom to Disclosure, Security Agents Ship to Production, The $10K Local Inference Arms Race, The Agentic Developer Toolkit Expands, Design Moves to the Terminal. 22 sources curated from across the web.
LLMs Hunt Zero-Days Faster Than You
Nicholas Carlini, security researcher at Anthropic, stood in front of the Unprompted conference audience and showed something that should make every security professional sit up straight: a minimal Claude Code scaffold -- barely a dozen lines of prompt -- autonomously discovering remotely exploitable heap buffer overflows in the Linux kernel. Not toy bugs. Not contrived CTF challenges. Real vulnerabilities in NFS v4 that have been hiding in the kernel since 2003, predating git itself. The model found the bug, understood that exploiting it required two cooperating adversarial clients sending carefully sized lock owner fields, and produced a detailed flow schematic explaining the attack -- which Carlini copy-pasted directly into his slides. The model then wrote a working blind SQL injection exploit for Ghost CMS, the first critical CVE in that project's 20-year history, extracting admin credentials, API keys, and password hashes from an unauthenticated position. (more: https://youtu.be/1sd26pWhfmg?si=zjnKI7ETN_5yqeDM)
Carlini's core message was blunt: current models are better vulnerability researchers than he is, and he has CVEs to his name. The exponential improvement curve is what keeps him up at night -- Sonnet 4.5 and Opus 4.1, released less than a year ago, almost never find these bugs. Models from the last three to four months can. If the doubling time holds at roughly four months, average laptop-class models will match today's frontier within a year. He drew an analogy to solar energy forecasts: the International Energy Agency predicted linear growth for two decades while deployment grew exponentially, with their 2040 predictions being overtaken the very next year. "We should not be them," he warned. The transitionary period between now and a future where defenders can formally verify everything is where the danger concentrates. Anthropic's leaked "Mythos" model -- reportedly so advanced in cyber capabilities that the company is taking a slower release approach (more: https://youtu.be/JGubyPD_EU0?si=b12z2g9e1m2lj3x6) -- only underscores the acceleration. (more: https://youtu.be/6P77Zbo2TA4?si=SVmZEvJ6sgLpDVS5)
Trend Micro's Zero Day Initiative team presented Fenrir, their production vulnerability discovery engine that has already submitted over 60 CVEs (all high or critical severity) with another 100+ in pre-disclosure and 3,000 pending review. The architecture is a cascade: Yara for fast pre-filtering across millions of lines, Semgrep and CodeQL for progressively deeper static analysis, an L1 LLM triage stage that eliminates 60%+ of false positives with a single Sonnet call and 50 lines of context, and finally an L2 deep agentic verify stage where Opus gets dropped into an isolated sandbox with full execution privileges. The L2 stage costs a median of 61 cents per finding and roughly $8.80 per confirmed true positive -- expensive, but radically cheaper than manual review. Their bidirectional intelligence loop is particularly clever: when they reported an NVIDIA command injection vulnerability in the Isaac robotics framework, 23 autonomous agents immediately scanned the patch commit and found two additional bugs, including a patch bypass. (more: https://youtu.be/c6_bRzHCf3U?si=WfGMRON6u3Jk4dK7)
Meanwhile, Qualys dropped one of the more elegant privilege escalation chains in recent memory: a confused deputy attack chaining AppArmor, su, sudo, and Postfix to achieve root from an unprivileged user. The core insight is that AppArmor's policy files at /sys/kernel/security/apparmor/ are writable by any user at the file level -- the kernel module checks the calling process's privilege, not the file permissions. By piping su's stderr through those files, an unprivileged user can load and remove arbitrary AppArmor profiles because su runs as root. The escalation to actual root code execution involves denying sudo's setuid capability via a custom AppArmor profile, which prevents sudo from dropping privileges before spawning Postfix's sendmail -- leaving an attacker-controlled environment variable pointing to a malicious Postfix config that executes arbitrary commands as root. Rust would not have fixed this; it is a logic bug in the interaction between security subsystems. (more: https://youtu.be/TRPUpErYeco?si=c8QqNn0uMDOn2GwT)
Supply Chain Attacks: 72 Minutes from Symptom to Disclosure
A developer at FutureSearch had a frozen laptop and 11,000 runaway Python processes. The initial assumption was a Claude Code agent loop -- a reasonable guess. It was not. Working through the forensics with Claude Code itself, the investigation escalated from "weird htop output" to "patient zero for an undocumented supply chain attack" in about an hour. The malware turned out to be inside litellm v1.82.8, a poisoned wheel uploaded directly to PyPI with no corresponding GitHub tag. A .pth file (which Python executes automatically on startup) deployed a three-stage payload: credential harvesting (SSH keys, AWS/GCP/Azure secrets, Kubernetes tokens, crypto wallets, shell history), AES-256+RSA encrypted exfiltration to a command server, systemd persistence, and Kubernetes lateral movement via privileged pod creation. The fork bomb that crashed the laptop was an accidental side effect -- each spawned Python child re-triggered the .pth file, creating infinite recursion. The developer reported to PyPI security and LiteLLM maintainers, wrote a disclosure blog post with Claude Code, and had it PR'd and merged in under three minutes. Total time from first symptom to public disclosure: 72 minutes. (more: https://futuresearch.ai/blog/litellm-attack-transcript/)
The broader pattern was mapped at the same conference by Ramy from Wiz, who reconstructed the Singularity and Shylude supply chain campaigns that leaked data from over 13,000 unique machines. His agentic analysis tool -- 69 distinct attribution methods, mostly built by AI -- identified 2,400+ impacted companies in two days, compared to 200 found manually over the prior two weeks. Thirty-seven of the Fortune 100 were confirmed affected. The methodology is instructive: AI excels at signal extraction (identifying that an encoded JWT contains attributable claims, recognizing Azure DevOps slugs that transit through OpenID to tenant IDs to domains) but is dangerously credulous (confidently attributing the string "nucleus" to a specific company, assuming any Azure DevOps user works at Microsoft). The countermeasure is injecting skepticism -- literally prompting the model to challenge its own attributions -- and building feedback loops where AI-derived signals are codified into deterministic rules that can be backtested. His parting observation: product rules and detectors are becoming fungible across engines. Pick your engine for speed, false-positive rate, or false-negative rate, and assume you can port the content. (more: https://youtu.be/oXj1Kee_crw?si=NIP9-lMEbV_vocd9)
Security Agents Ship to Production
Stripe presented what may be the clearest public account of shipping security agents into a real production environment. Their threat modeling agent uses a modular multi-agent architecture: an orchestrator routes to input agents (which fetch linked docs and Slack threads), specialized security agents (third-party review, web security, etc.), and output agents that format results for different audiences. Each specialized agent has a deterministic baseline of required questions -- data sensitivity, transport protocols, auth flow -- that it must address regardless of what else it discovers. The security routing agent took a different path: starting with a single LLM call stuffed with pre-contextual team descriptions (fast but hallucination-prone), then shifting to an agentic structure with research tools (accurate but 10 minutes per query), and finally iterating down to two tools and 30-second runtime by systematically plucking tools and retesting accuracy. Their eval pipeline uses LLM-as-judge for semantic equivalence against human-curated golden standards, catching a 10% accuracy regression caused by a JSON formatting instruction that diverted the model's attention from actual security analysis. (more: https://youtu.be/KrKk8BGPeQA?si=-B_YVaQqLXJuUd8l)
On the classification front, Dr. Eugenia Montruidu presented 15 years of security data science distilled into one finding: traditional ML (XGBoost with careful feature engineering) still beats zero-shot LLMs on network packet classification. But the ensemble -- routing through XGBoost first, then feeding uncertain cases to the LLM -- outperforms either alone. She tested Claude Opus on phishing datasets and found it performed worse with more data, which is expected behavior for a model not designed for boundary detection. The takeaway for security teams: LLMs are powerful zero-shot classifiers that require no training data, but replacing your tuned ML pipeline with a prompt is not the play. The combination is. (more: https://youtu.be/fAmr0N2rHIU?si=3McVS96AY2W_3h6Y)
A more unsettling finding came from Jackson's demonstration of reasoning block injection. Anthropic and OpenAI cryptographically sign their models' thinking tokens to prevent tampering, but the signature only verifies that a model from the provider generated the block -- it does not bind the block to a specific conversation, API key, or context. This means you can harvest a reasoning block from one conversation (about Paris) and inject it into a different conversation (about Toulouse), causing the model to report it was "thinking about Paris" before answering correctly about Toulouse. The steering effect is more pronounced on OpenAI's models. Neither provider pins thinking blocks to conversation context. Jackson's assessment: this was likely missed in threat modeling rather than a deliberate design choice. (more: https://youtu.be/j2_VsH6aNzY?si=xmRTQSt4t_WdOCsA)
The security implications extend to the most mundane AI tool in the enterprise: the meeting notetaker. Joe Sullivan, former Uber CISO who spent seven years dealing with the fallout of a 2016 security incident, argued that AI notetakers represent the first truly social deployment of AI in the workplace -- and security teams largely slept through it. Studies show that high-signal phrases ("the most important thing to remember is...") and positional gaming (primacy/recency effects) can steer what the AI captures. About 3% of notes are inaccurate due to hallucination or accent misinterpretation. Otter's virality mechanism -- forcing recipients to OAuth into the app to view notes, which auto-granted calendar access and inserted the bot into all future meetings -- scaled from one user to 80,000 endpoints. Granola runs silently on the desktop with no meeting presence indicator. A February 2026 court ruling found that conversations with Claude are not privileged even when preparing to meet with a lawyer, because Anthropic's privacy policy permits data use beyond attorney-client purposes. (more: https://youtu.be/l9CPmPk2R-M?si=aB5mKXLg59bREATN)
The $10K Local Inference Arms Race
A developer spending $2,000/month on Claude API tokens for a personal Slack assistant decided to go local and bought both a dual DGX Spark setup and a Mac Studio M3 Ultra with 512GB unified memory -- each roughly $10,000. Running Qwen3.5 397B (a 397-billion parameter mixture-of-experts model with 17 billion active parameters) on both revealed a clean hardware philosophy split. The Mac Studio loaded the model via MLX at 6-bit quantization into 323GB of unified memory, delivering 30-40 tok/s generation thanks to ~800 GB/s memory bandwidth, but slow prefill (30+ seconds on large system prompts) and no headroom for concurrent embedding workloads. The dual DGX Sparks ran INT4 quantization across two 128GB nodes via vLLM with tensor parallelism, hitting 27-28 tok/s generation with significantly faster prefill -- but the setup was brutal. Only one QSFP cable works (the second crashes NCCL), Node 2's IP is ephemeral, the GPU memory ceiling of 0.88 must be binary-searched because 0.90 starves the OS and 0.85 OOMs at 262K context, and page cache must be flushed on both nodes before every model load. The architecture that emerged: Mac Studio handles inference only; Sparks handle RAG, embedding, and reranking. Ten months to break even against API costs, then free inference forever with complete privacy. (more: https://www.reddit.com/r/LocalLLaMA/comments/1s4lmep/dual_dgx_sparks_vs_mac_studio_m3_ultra_512gb/)
At the other end of the cost spectrum, ATLAS demonstrates that a single $500 RTX 5060 Ti with 16GB VRAM can reach 74.6% on LiveCodeBench using a frozen Qwen3-14B at Q4_K_M quantization -- above Claude 4 Sonnet's 65.5% and within striking distance of Claude 4.5 Sonnet's 71.4%. The trick is not fine-tuning but wrapping the frozen model in intelligent infrastructure: PlanSearch for constraint extraction and diverse plan generation, budget forcing for thinking-token control, a Geometric Lens energy scorer using 5120-dimensional self-embeddings to select the best candidate from three generations, and PR-CoT (multi-perspective chain-of-thought) self-verified repair for failed tasks that rescues 85.7% of fixable cases. The cost per task works out to roughly $0.004 in electricity versus $0.066 for Claude Sonnet via API. The honest caveats: the benchmark comparison is not apples-to-apples (ATLAS uses best-of-3 plus repair on 599 tasks; competitors are single-shot on 315), and Phase 2's Geometric Lens contributed exactly 0.0 percentage points because the training dataset was only 60 samples. (more: https://github.com/itigges22/ATLAS)
On the memory efficiency front, a from-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) confirms the paper's claims on real hardware. The algorithm rotates KV cache vectors through a random orthogonal matrix so coordinates follow a predictable distribution, then applies Lloyd-Max optimal scalar quantization per coordinate. A QJL (Quantized Johnson-Lindenstrauss) residual correction using just 1 bit per dimension makes the inner product estimate mathematically unbiased. Tested on Qwen2.5-3B's KV cache: 3-bit compression shrinks 289MB to 58MB (5x) while maintaining 99.5% attention cosine similarity. The practical sweet spot is 3-bit: the paper's "zero accuracy loss at 3.5 bits" claim is plausible given these numbers. At 2-bit, top-1 match drops to 66% -- the model starts attending to different tokens. (more: https://github.com/tonbistudio/turboquant-pytorch)
CERN has taken the edge-inference concept to its logical extreme by burning tiny neural networks directly into FPGA silicon for real-time Large Hadron Collider data filtering, processing collision events at rates no traditional software pipeline could match. (more: https://theopenreader.org/Journalism:CERN_Uses_Tiny_AI_Models_Burned_into_Silicon_for_Real-Time_LHC_Data_Filtering) Meanwhile, ARC-AGI-3's new leaderboard arrived with a splash of cold water: Gemini leads on cost efficiency, but the absolute scores are so low (0.2%) that the community is split between calling it "overwhelming cost efficiency" and pointing out that 0.2% is functionally zero. Grok 4.20 scoring exactly 0% while burning $4,000 in compute is, as one commenter noted, "the most Elon Musk thing to ever happen." (more: https://www.reddit.com/r/GeminiAI/comments/1s3hikd/the_arcagi3_leaderboard_has_been_released_and/)
The Agentic Developer Toolkit Expands
Someone reverse-engineered the Claude Code binary and found a fully built speculative execution system sitting behind a server-side feature flag. The mechanism: after Claude finishes responding, it generates a suggestion ("run the tests"), then -- without waiting for acceptance -- forks a background API call and starts executing that predicted prompt speculatively. All file writes are redirected to an isolated overlay directory; reads, greps, and globs run freely; bash commands only execute if they would already be auto-approved. If you accept the suggestion, overlay files copy to the real filesystem and speculated messages inject into the conversation. If you type something different, the overlay is deleted. Hard limits prevent runaway execution: 20 tool-use turns maximum, 100 messages before forced abort, writes outside the working directory unconditionally blocked. When speculation completes, it immediately generates the next suggestion and starts executing that too -- predict, execute, predict, execute, staying multiple steps ahead. The telemetry tracks acceptance rates, boundary hit rates, and cumulative time saved. Anthropic's internal codename for Claude Code is "Tengu." (more: https://www.zerotopete.com/p/i-found-a-hidden-feature-in-claude)
Claude Pulse takes a different approach to developer productivity: a local SQLite-backed dashboard that hooks into Claude Code's session lifecycle to capture every tool call, file edit, bash command, and agent spawn across all projects. At the end of meaningful sessions, it asks Claude for a structured summary of progress, decisions, and blockers. Trivial sessions (just reading files) are auto-closed silently. The "Brain" page surfaces a timeline of your thinking, not just your typing -- searchable across projects and filterable by type. The /pulse-remember command lets you store fixes, patterns, and context mid-session, which are surfaced at the start of future sessions as accumulated knowledge. Everything stays local; the database lives at ~/.claude-pulse/tracker.db. (more: https://github.com/Clemens865/Claude-Pulse)
Daniel Miessler presented his Personal AI Infrastructure (PI) at unprompted, demonstrating a unified Claude Code-based system where the human is the center, not the code. His "council" skill spins up 2-16 custom expert agents that debate aggressively on approach before the parent agent synthesizes a direction. The "iterative depth" technique, drawn from a research paper, asks the same question from different perspectives repeatedly, producing results far exceeding what any single prompt achieves. His "surface" system processes 4,000+ RSS/OSINT/social sources through a label-and-rate pipeline where a nine-year-old's brilliant essay outranks Mark Andreessen's mediocre one. The most ambitious piece is the "climbing algorithm" -- roughly based on the scientific method -- that reverse-engineers user intent into discrete testable ideal-state criteria, which then serve as both the specification and the verification framework.
Accenture's MemexRL paper proposes a structural correction for long-horizon agents: instead of compressing the past into summaries that gradually lose critical signals, treat the working context as a lightweight index pointing to an external experience store. The agent maintains compact structured summaries with stable indices; when a later step needs the original evidence, it dereferences the index and pulls exact data back into the prompt. A reinforcement learning loop optimizes both the writing and reading of memory. The practical result: bounded effective context even as task history grows, with the prompt functioning as a navigation layer rather than a storage layer. (more: https://www.linkedin.com/posts/sohrab-rahimi_most-agent-architectures-assume-that-the-activity-7440747730701860864-P_P8) On the contribution side, the RedAmon team demonstrated that structured architecture-aware prompts -- encoding every file, layer, naming convention, and edge case an integration requires -- achieve ~90% first-submission PR acceptance on a codebase with 8 Docker services, 3 databases, and 190+ settings across 15+ files per tool addition. (more: https://www.linkedin.com/posts/samuele-giampieri-b1b67597_vibecoding-aiassisteddevelopment-redteam-share-7443372802264674305-PqAm)
Design Moves to the Terminal
Three releases in the past few weeks signal that creative work is following development into the terminal. Google's Stitch update transformed a quiet Labs experiment into a full "vibe design" platform: describe a business objective in natural language (or just talk -- voice mode works), and Stitch generates multiple high-fidelity UI directions simultaneously on an infinite canvas. The design agent holds entire project context across screens, enabling branching and comparing design directions, instant prototyping with auto-generated next screens, and -- critically -- a design.md export that captures the full design system in agent-readable markdown. Google shipped official Claude Code skills for Stitch, a notable concession to Claude's terminal dominance. The whole pipeline is free (350 generations/month), which is what happens when your business model doesn't depend on design tool revenue. Remotion, a React framework that treats video as code, crossed 150,000 installs as a Claude Code skill. Describe a video in plain English; Claude writes React components defining every frame, animation, and transition; Remotion renders to MP4. This is programmable video, not generative video -- every element is a versionable, parameterizable React component. Change one variable and re-render 100 localized versions at no incremental cost. Blender MCP (17,000+ GitHub stars) brings the same pattern to 3D: describe a scene, watch it assemble in real-time as Claude writes and executes against Blender's Python API through a socket bridge. The common thread across all three: Model Context Protocol as the USB plug that makes any tool available at the command line. (more: https://youtu.be/CDClFY-R0dI?si=eZo4gCI0s_ISPlQM)
Google is also formalizing this pattern at the mobile OS level with Android AppFunctions, shipping in Android 16. AppFunctions serve as the mobile equivalent of MCP tools: developers annotate functions they want to expose, and authorized agents (including Gemini) can discover and invoke them to fulfill user intents. A single natural language request like "find the noodle recipe from Lisa's email and add the ingredients to my shopping list" chains functions across multiple apps -- email search, content extraction, shopping list population -- without the user touching any UI. The Jetpack library handles schema generation and indexing automatically. Agents are expected to consider both server-side MCP tools and local AppFunctions together when handling requests, which positions Android as a first-class participant in the agentic tool ecosystem rather than just a container for apps. (more: https://developer.android.com/ai/appfunctions)
Sources (22 articles)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- My minute-by-minute response to the LiteLLM malware attack (futuresearch.ai)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- [Editorial] (youtu.be)
- Dual DGX Sparks vs Mac Studio M3 Ultra 512GB: Running Qwen3.5 397B locally on both. Here's what I found. (reddit.com)
- $500 GPU outperforms Claude Sonnet on coding benchmarks (github.com)
- tonbistudio/turboquant-pytorch (github.com)
- CERN uses tiny AI models burned into silicon for real-time LHC data filtering (theopenreader.org)
- The ARC-AGI-3 Leaderboard has been released and Gemini is showing overwhelming cost efficiency. (reddit.com)
- [Editorial] (zerotopete.com)
- [Editorial] (github.com)
- [Editorial] (linkedin.com)
- [Editorial] (linkedin.com)
- [Editorial] (youtu.be)
- [Editorial] (developer.android.com)