AI Security & National Intelligence
Published on
Today's AI news: AI Security & National Intelligence, AI Power & The Benchmark Problem, Apple's AI Platform Play, Open-Weight Models & The Compression Race, GPU Infrastructure & Training at Scale, Agents Get Eyes, Hands, and Documents, Tokaine and the Sampler Temperature of Real Life. 22 sources curated from across the web.
AI Security & National Intelligence
The NSA's own director reportedly told the Senate Intelligence Committee that Anthropic's Mythos model "broke into almost all of its classified systems in hours." Per The Economist, Senator Mark Warner — vice chair of the committee — said General Joshua Rudd conveyed this directly. The timing is pointed: the revelation surfaced on June 11, the same day Amazon reportedly found a separate jailbreak in Anthropic's models. Within hours, the Trump administration ordered Anthropic to cut off foreign access to Mythos and Fable. Anthropic shut both models down entirely instead. (more: https://old.reddit.com/r/OpenAI/comments/1ubrpm6/nsa/)
Two competing narratives are circulating. One says the shutdown was a direct response to the NSA breach. The other says Anthropic considers the jailbreak minor — something other AI models can already be tricked into doing — and views the shutdown as an overreaction. The uncomfortable wrinkle: the NSA was already using Mythos for its own cyber operations, with Anthropic engineers embedded inside the agency. The same tool the agency relied on is the one its director says compromised nearly everything it owns. The strategic implications go beyond defense. If a model can find thousands of flaws in hardened classified networks in hours, any adversary with a comparable model can neutralize the NSA's offensive exploit stockpile by patching those same vulnerabilities before they can be weaponized. The era of hoarding zero-days as a strategic asset may be closing faster than anyone in Fort Meade anticipated.
Meanwhile, Microsoft's BlueHat IL security conference showcased an LLM-based vulnerability variant hunter that inverts the classic SAST calculation. Instead of hand-writing per-language taint-flow rules, the tool takes a known bug — described in natural language — and uses an LLM to judge whether similar code patterns across different languages realize the same vulnerability concept. The four-skill pipeline distills a seed bug into a flow and a reason, finds similar flows across codebases regardless of language, correlates which flows are actually vulnerable, and then skeptically verifies each finding by arguing attacker control, reachability, and generating a proof of concept. Microsoft used it to scan 60 Azure hybrid cloud extensions written in everything from C to Bash, finding issues in 12 and validating the other 48. The key insight is that logical vulnerabilities — flows that are legitimate in one context but become exploitable in another, like calling IMDS on-prem — are exactly the class that traditional SAST misses. (more: https://www.linkedin.com/posts/gadievron_bluehatil-ugcPost-7475839020732751872-I_Qe)
In more conventional breach news, LastPass is notifying users of yet another data breach — this time through market research firm Klue, which had Salesforce and Gong integrations exposing customer names, phone numbers, email addresses, physical addresses, and support case data. Password vaults were reportedly not affected. (more: https://9to5mac.com/2026/06/23/lastpass-notifies-users-of-yet-another-data-breach/) This is the company's third major security incident after the 2015 hash theft and the 2022 source code compromise that exposed encrypted vault backups. At this point, asking users to "remain vigilant of potential phishing attacks" reads less like advice and more like an admission that the steady drip of leaked PII has made social engineering against the LastPass user base permanently easier.
On the defensive tooling side, an open-source project called Hush addresses a problem AI coding agents have made worse: .env files sitting in worktrees where any agent can cat .env and surface secrets into a model's context. Hush keeps secrets age-encrypted outside the repo, with the master key in the macOS Keychain, and injects values only into the child process of the command that needs them. Agent contexts (Claude Code, Codex, or any non-TTY session) are auto-detected and locked to "use, don't see." The design philosophy is refreshingly honest: Hush explicitly states it stops accidental exposure and agent reflexes, not a determined local attacker sharing your uid. (more: https://github.com/allen-hsu/hush)
AI Power & The Benchmark Problem
WIRED verified the leaked membership of Dialog, a private society that operated for 20 years with no public website. The registration list for its August 2026 retreat near Dublin — $16,000 per head — includes OpenAI's Chief Strategy Officer, Google DeepMind's head of AI global affairs, the CEO of YouTube, and sitting government officials who used personal email to avoid FOIA. Peter Thiel co-founded the group; the co-founder of Palantir and the Treasury Secretary are also on the list. (more: https://old.reddit.com/r/OpenAI/comments/1ucotud/openais_chief_strategy_officer_is_on_the/) These networks are not new — the Bohemian Club has operated this way for a century — but the concentration of AI executives alongside regulators nominally overseeing them creates documented conflicts of interest that deserve scrutiny as AI policy decisions accelerate.
On the measurement front, an editorial makes a useful argument: the easiest way to lie with AI is to focus only on benchmark capability while ignoring cost. A company handling 100,000 software engineering tasks at $10–$30 each can spend millions more annually by chasing the highest leaderboard score when a system ten times cheaper delivers similar real-world performance. The better approach: plot cost per task against success rate and examine the Pareto frontier. A startup optimizes for cost; an enterprise pays for marginal accuracy. Neither is universally correct, which is precisely why single-number leaderboards — which measure capability without cost — are misleading by design. (more: https://www.linkedin.com/posts/reuvencohen_the-easiest-way-to-lie-with-ai-is-by-focusing-activity-7475541957482606592-SVrW)
Apple's AI Platform Play
Apple's WWDC 2026 output clarifies that the company's AI strategy is less about competing on frontier model capability and more about owning the on-device runtime. Core AI is a comprehensive Swift framework for loading and running AI models entirely on Apple Silicon — zero server dependencies, zero token costs. Models are automatically specialized per hardware target with ahead-of-time compilation for instant load times. The framework includes PyTorch extensions for converting models into Core AI assets with hardware-optimized attention and normalization operations, support for custom Metal 4 kernels, and Core AI Optimization — a quantization and palettization toolkit with per-layer granularity. A new Xcode integration and standalone Core AI Debugger let developers inspect computation graphs, profile performance, and validate artifacts before deployment. (more: https://developer.apple.com/core-ai)
Separately, Apple open-sourced container, a Swift-based tool for running Linux containers as lightweight VMs on Apple Silicon. It consumes and produces OCI-compatible images, so pulling from and pushing to any standard registry works out of the box. The tool requires macOS 26 and leverages new virtualization and networking features in that release. (more: https://github.com/apple/container) The convergence is telling: Apple is building the full stack for developers who want to run AI models locally (Core AI) and test server-side infrastructure natively on Mac (container), making the Mac a more complete AI development platform without cloud dependency. Neither announcement individually reshapes the landscape, but together they make a clear strategic statement. While competitors race to train the biggest model and sell the most tokens, Apple is betting that who owns the runtime wins — and that the billion-device install base running Apple Silicon is the distribution advantage no amount of training compute can replicate.
Open-Weight Models & The Compression Race
Ideogram 4 is the most significant open-weight image model release in months. Built from scratch as a 9.3-billion-parameter flow-matching Diffusion Transformer, it uses a fully single-stream architecture: text and image tokens concatenated into one sequence, processed through the same 34-layer transformer with no separate branches. Instead of CLIP or T5, the text encoder is Qwen3-VL-8B-Instruct — a full vision-language model — with hidden states from 13 intermediate layers concatenated for multi-scale semantic features. Trained exclusively on structured JSON captions, it offers explicit control over composition, bounding-box layout, color palettes, and typography. On the Design Arena leaderboard it ranks as the top open-weight model, trailing only proprietary GPT and Gemini. In a blind ContraLabs evaluation, professional designers chose it first 47.9% of the time — ahead of Gemini 3.1 Flash at 30% and FLUX.2 max at 15.5%. At 9.3B parameters it delivers better text rendering than models 2–8x its size. (more: https://github.com/ideogram-oss/ideogram4)
The open deep research space gets QUEST-35B, trained on 32 H100s at Ohio State with full recipe, weights, and 8,000 multi-step research traces released. Community reaction is cautiously positive: "multi-step research traces are dense," but the real question is "how brittle it is to harness changes, because that is where these agents usually fall apart." (more: https://old.reddit.com/r/LocalLLaMA/comments/1u9w6my/researchers_trained_a_deep_research_agent_with_32/) Cohere's North Mini Code now ships a 4-bit quantized version that runs on 20GB of RAM, available through Ollama and OpenRouter, with community contributions including oMLX implementations and Q6 GGUF quants arriving within days. (more: https://old.reddit.com/r/LocalLLaMA/comments/1u9dqlm/updates_on_north_mini_code_4_bit_quant_ollama/)
The quantization frontier keeps pushing lower. EdgeRazor is a framework for mixed-precision quantization-aware distillation down to 1.58 bits — a blend of 4-bit and ternary weights, with configurable ratios including 2.79-bit and 1.88-bit variants. It integrates logits, features, and attention distillation in a unified interface, targeting mobile and edge deployment. (more: https://old.reddit.com/r/LocalLLaMA/comments/1ueifp4/edgerazor_a_lightweight_framework_for_large/) In the community fine-tune space, Qwable-3.6-27b applies a LoRA adapter to Qwen's 27B base using a cleaned Fable 5-style reasoning dataset, optimized for step-by-step technical responses and downstream GGUF conversion (more: https://huggingface.co/Mia-AiLab/Qwable-3.6-27b). ScenemaAI also dropped an audio generation model on Hugging Face, though details remain sparse at launch (more: https://huggingface.co/ScenemaAI/scenema-audio).
GPU Infrastructure & Training at Scale
NVIDIA NeMo AutoModel delivers the kind of gains that make MoE practitioners stop and re-read the benchmarks. Building on HuggingFace Transformers v5's first-class MoE support, it adds Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels — accessible by changing a single import line. On Qwen3-30B-A3B and Nemotron 3 Nano 30B-A3B, this yields 3.4–3.7x higher training throughput with 29–32% less GPU memory versus the best Transformers v5 configuration. Peak memory for Qwen3 drops from 68.2 GiB to 48.1 GiB. At frontier scale, AutoModel enables full fine-tuning of the 550B-parameter Nemotron 3 Ultra across 128 H100 GPUs — a regime where Transformers v5 simply runs out of memory. The key: Expert Parallelism is treated as orthogonal to data parallelism, so on 8 GPUs the system runs ep=8 and dp=8 simultaneously, with every GPU holding only 1/8 of the expert weights. Checkpoints remain standard HF safetensors, deployable on vLLM and SGLang without conversion. (more: https://huggingface.co/blog/nvidia/accelerating-fine-tuning-nvidia-nemo-automodel)
On the inference side, NVIDIA is claiming a 15x speedup using diffusion-based LLM generation, where entire blocks of text are produced at once rather than token by token. Early adopters running the 26B diffusion Gemma on AMD Strix Halo report the number is real but context-dependent: predictable outputs like JSON lock in at around 70 tok/s, while creative writing needs more correction passes and drops to about 15. Critical caveats: bf16 is mandatory (fp16 produces garbage), the exact prompt format matters enormously, and the headline numbers rely on compiler tricks that do not yet work on AMD. The deeper question — if the diffusion model's output is "good enough" 95% of the time, why bother with autoregressive verification at all — remains unresolved. But 15x is a big enough number that even a fraction of the claimed speedup, on the right workloads, would change how people think about local inference budgets. (more: https://old.reddit.com/r/LocalLLaMA/comments/1udpd7i/im_eager_for_a_15x_speedup_on_my_strix_halo/)
Agents Get Eyes, Hands, and Documents
Google has integrated computer use as a built-in tool in Gemini 3.5 Flash, making it a native capability rather than a standalone model. The system can see, reason about, and take actions across browser, mobile, and desktop environments. To mitigate prompt injection risks, Google uses targeted adversarial training and ships two enterprise safeguard systems: mandatory user confirmation for sensitive actions and automatic task termination when indirect prompt injection is detected. (more: https://blog.google/innovation-and-ai/models-and-research/gemini-models/introducing-computer-use-gemini-3-5-flash/) For developers who want Gemini's capabilities through a familiar interface, gemini-web2api reverse-engineers Gemini's web StreamGenerate protocol into an OpenAI-compatible API — zero cost, single Python file, with function calling support, multiple models including Flash Thinking for 20k+ character output, and adjustable reasoning depth. Anonymous access works for all models; real Pro routing requires a paid Gemini Advanced cookie. (more: https://github.com/Sophomoresty/gemini-web2api)
Mistral OCR 4 advances document intelligence beyond plain text extraction, returning bounding boxes, typed-block classification (titles, tables, equations, signatures), and inline confidence scores per page and per word. Independent annotators preferred OCR 4 over every competing system tested with win rates averaging 72%, and it scored highest on OlmOCRBench at 85.20. It supports 170 languages, runs in a single container for self-hosted deployment, and prices at $4 per 1,000 pages ($2 with batch discount). Early adopter Rogo reported "equivalent accuracy at roughly 8x lower cost and 17x lower latency" versus leading agentic document parsers on financial QA datasets. (more: https://mistral.ai/news/ocr-4)
From research, AADvark out of MIT tackles a gap no prior agent-aided design system has solved: 3D CAD assemblies with moving parts. The system places Gemini 3 Flash in an iterative loop — writing JSON definitions for parts and joints, compiling via a modified OndselSolver, and refining based on FreeCAD renderings augmented with unique edge colors and text identifiers. The team had to switch the solver from Euler angles to quaternions and make error messages far more informative to compensate for VLMs' spatial reasoning limitations. Generating a functional pair of scissors took 20 iterations, 4.14 hours, $15.85, and 468 LLM calls across 18.2M input tokens — an expensive demo, but the first time any system has passed what the authors call "the scissors test": producing a 3D assembly whose parts actually move in a mechanically correct slicing motion. (more: https://arxiv.org/abs/2604.15184v1)
Tokaine and the Sampler Temperature of Real Life
A PhD student's post about a colleague's relationship with AI agents reads less like satire and more like a clinical case study. The colleague becomes "insanely anxious" whenever no agents are running, wakes at 3 AM to assign new tasks, and when the university ran out of tokens, threatened colleagues he perceived as overusing their allocation. He has private ChatGPT and Anthropic subscriptions on top of university-provided access and reportedly burned 1,000 euros of departmental budget in a single pre-NeurIPS all-nighter. His defense: "tokens are not a substance and AI is not like gambling," so addiction is impossible. His colleagues coined "tokaine addiction," and the community response is sobering — multiple commenters report identical patterns in their own workplaces, including one startup employee describing company parties where "everyone is on their phones looking at their agents doing stupid tasks." The comparison to compulsive gaming is apt: the vehicle differs, but the pattern of escalating engagement despite negative consequences to sleep, finances, and relationships is textbook behavioral compulsion. (more: https://old.reddit.com/r/learnmachinelearning/comments/1uebzpl/tokaine_addiction/)
For a healthier relationship with local models, consider the developer who wired an MQ-2 gas sensor into a suitcase robot running a local LLM. When smoke hits the sensor, it modulates the sampler temperature (1.0 to ~1.6), top_p (0.95 to 0.99), and top_k (64 to 120) in real time, so the robot's speech genuinely gets loopier — lower-probability, more associative tokens — without any scripted behavior. A per-phase persona nudge makes the robot show the effects without announcing them, while the physical body adds drooping eyes and a smoke-and-plasma display at phase 10. The creator's honest caveat — "a cigarette or incense probably trips it too" — does not diminish what amounts to one of the most creative demonstrations of how sampling parameters shape LLM behavior in practice. (more: https://old.reddit.com/r/LocalLLaMA/comments/1u9a17y/my_suitcase_robot_gets_high_now_off_a_real_gas/)
Sources (22 articles)
- NSA Director Says Anthropic's Mythos Broke Into Almost All Classified Systems in Hours (old.reddit.com)
- [Editorial] BlueHat IL Security Conference (linkedin.com)
- LastPass Notifies Users of Yet Another Data Breach (9to5mac.com)
- Hush: Agent-Safe Per-Worktree Secrets for macOS (github.com)
- OpenAI's CSO on Registration List for Thiel's Secret $16K Retreat Alongside Treasury Secretary (old.reddit.com)
- [Editorial] The Easiest Way to Lie with AI (linkedin.com)
- [Editorial] Apple Core AI Developer Framework (developer.apple.com)
- [Editorial] Apple Container — Open-Source Container Runtime (github.com)
- Ideogram 4: Open Image Model at the Forefront of Design (github.com)
- QUEST-35B: Open-Source Deep Research Agent Trained on 32 H100s — Full Recipe, Weights, and Data Released (old.reddit.com)
- North Mini Code: 4-Bit Quant + Ollama + OpenRouter — Now Runs on 20GB (old.reddit.com)
- EdgeRazor: Mixed-Precision Quantization-Aware Distillation Down to 1.58-Bit (old.reddit.com)
- [Editorial] Qwable-3.6-27b — Open Model Release (huggingface.co)
- ScenemaAI/scenema-audio — Audio Generation Model (huggingface.co)
- Accelerating Fine-Tuning with NVIDIA NeMo AutoModel (huggingface.co)
- NVIDIA Claims 15x Speedup with Diffusion-Based LLM — Entire Block of Text Generated at Once (old.reddit.com)
- Computer Use in Gemini 3.5 Flash (blog.google)
- gemini-web2api: Convert Gemini Web to OpenAI-Compatible API — Zero Auth, Single File (github.com)
- [Editorial] Mistral OCR 4 (mistral.ai)
- AADvark: Agent-Aided Design for Dynamic CAD Models with Moving Parts (arxiv.org)
- "Tokaine Addiction" — PhD Student's Colleague Can't Stop Running AI Agents (old.reddit.com)
- Suitcase Robot Uses Gas Sensor to Modulate LLM Sampler Temperature in Real-Time (old.reddit.com)