AI Security: Red Teams Get an AI Upgrade

Published on

Today's AI news: AI Security: Red Teams Get an AI Upgrade, The Open-Weight Race Heats Up, LLMs Judging LLMs: Bias, Anthropomorphism, and the Doorman, Inside the Silicon: Apple's Secret Engine and the Inference Economics Problem, Research Frontiers: When Neural Networks Fail to Extrapolate, The Harness Is the Product: Agentic Engineering Matures, The Context Wars: Intelligence Is Cheap, Proximity Is Not. 22 sources curated from across the web.

AI Security: Red Teams Get an AI Upgrade

An open-source project called Bingo has quietly climbed to trending status on GitHub, and it deserves a close look โ€” not because it's novel in concept, but because of what it automates. Bingo is an AI-powered red team terminal that wraps multiple LLM backends (DeepSeek, Claude, GPT, GLM, Qwen, and local Ollama models) around a full offensive security pipeline: reconnaissance, WAF bypass, SQL injection chains from error-based through time-based blind, SSRF, HTTP request smuggling, JWT/OAuth abuse, DApp/Web3 audits, and even mobile APK/IPA reverse engineering. The user types a target URL and a natural-language task description; the AI decides the attack strategy, pivots when blocked, and produces a Markdown report with CVSS scores. What makes Bingo more than a fancy wrapper is its four-layer anti-hallucination guard: every finding must be backed by a real HTTP response, and the tool labels evidence as verified, likely, or inferred. It auto-integrates with nmap and sqlmap when present, rotates proxies on WAF bans, and remembers findings across sessions. A headless CI/CD mode (--silent) outputs structured JSON, making it pipeline-friendly for continuous security testing. (more: https://github.com/bingook/bingo)

At REcon in Montreal โ€” one of the few conferences purpose-built for reverse engineers and security researchers โ€” a panel organized by Gadi Evron tackled whether AI will eliminate security research jobs entirely. The consensus: tools will change, but the adversarial mindset is not automatable. One panelist reportedly found a live example during the session where AI-assisted research simply was not there yet โ€” a useful reminder that tools like Bingo automate known attack patterns, not novel zero-day discovery. (more: https://www.linkedin.com/posts/gadievron_recon-the-conference-for-reversers-and-researchers-ugcPost-7473911805405237248-_F2b)

On the defensive side, a practical guide to sandboxing X11 GUI applications with LXC containers reminds us that not every security improvement requires a neural network. The walkthrough runs a browser in an unprivileged LXC container with UID/GID mapping into an unused host range, X11 socket forwarding via a wildcard Xauthority entry, and optional GPU passthrough. If the browser is compromised, the blast radius is a container whose UIDs map to nothing useful on the host. Every forwarded channel is a deliberate hole in the wall โ€” forward what the application needs, leave the rest out. (more: https://dobrowolski.dev/article/enhancing-x11-application-security-with-lxc/)

The Open-Weight Race Heats Up

DeepSeek has confirmed that the official V4 release lands in mid-July, and the community is already running the preview checkpoint through llama.cpp after support was merged. The preview drew mixed reactions โ€” commenters noted it needs to compete with GLM 5.2's 750B-parameter performance or face an awkward comparison, and historically DeepSeek's preview-to-final jumps have been modest. The pricing screenshots circulating suggest tiered peak/off-peak rates for the Chinese market, with peak prices roughly double the flat USD rates. Whether that pricing structure goes global is unclear, but it signals that even the labs building cheap open models are looking for revenue. (more: https://old.reddit.com/r/LocalLLaMA/comments/1uiomoy/deepseek_v4_official_version_will_be_launch_on/)

Meanwhile, Huawei's ascend-tribe team released OpenPangu-2.0-Flash, a 92B-parameter MoE model with only 6B active parameters, trained on 34 trillion tokens with 512K context length โ€” entirely on Ascend hardware, bypassing the NVIDIA dependency that constrains most labs. The post-training recipe is interesting: unified SFT with both slow and fast thinking capability, multiple specialist RL training passes, and on-policy distillation combining the RL specialists. Community reaction focused on two things: the model size is well-suited for quantized local inference (64GB RAM + 32GB VRAM territory), and the license explicitly bans use within the European Union โ€” a direct response to the AI Act that underscores how regulatory fragmentation is now a first-order design constraint for model releases. (more: https://old.reddit.com/r/LocalLLaMA/comments/1ujjxda/ascendtribeopenpangu20flash_they_havent_uploaded/)

Anthropic CEO Dario Amodei's 2023 statement that open-source models "could take us to a very dangerous place" resurfaced on Reddit and drew predictable fire. The top comments range from "local models are very dangerous to the oligarchy" to noting the competitive self-interest baked into the claim. The timing is sharp: Anthropic is simultaneously trying to squeeze open models through lobbying while companies like Huawei and DeepSeek are demonstrating that frontier-competitive performance on non-NVIDIA hardware with open weights is increasingly viable. The community's skepticism is well-earned โ€” the safety argument and the business argument are hard to separate when the person making them has a financial stake in the answer. (more: https://old.reddit.com/r/LocalLLaMA/comments/1uixcof/anthropics_amodei_open_source_models_could_take/)

Google, for its part, is running hackathons for Gemma 4 31B at 1,500 tokens per second on Cerebras hardware, signaling that even the big labs see real value in small-model AI-assisted engineering. One commenter reports consistently producing 2,000 lines of debugged, architecturally sound code per day using Gemma 4 12B QAT with a custom VS Code extension โ€” a claim worth watching but hard to dismiss given the improving quality of small quantized models. The broader signal: the industry is fragmenting into a tier where frontier models handle planning and hard reasoning while small local models handle the fast-feedback coding loop, and nobody has to choose one or the other. (more: https://old.reddit.com/r/LocalLLaMA/comments/1uh8ir7/even_google_still_believes_in_small_models_for/)

LLMs Judging LLMs: Bias, Anthropomorphism, and the Doorman

A researcher ran 55 LLMs through a blind peer-grading matrix โ€” 22,254 judgments across 11 developer families, with self-judgments excluded, all MIT-licensed. The headline finding: same-family bias is statistically significant in all eight families with enough data. Qwen judges rate other Qwen models +0.91 points on a 0โ€“10 scale. xAI shows +0.75, Anthropic +0.62. But the negative biases are the real surprise: Google judges penalize other Google models by -0.59, Meta by -0.68, and Mistral by a full -1.02. Nobody has cleanly explained the Mistral self-penalty, though one hypothesis is that RLHF trained the model to identify and flag its own stylistic fingerprints as low quality โ€” an unintended consequence of alignment. The practical implication is immediate: any evaluation pipeline using a single-family judge is compromised. The researcher's recommendation โ€” heterogeneous panels of three judges from different families, counting only majority-agreed scores โ€” dramatically reduces variance. Code evaluation showed nearly double the inter-judge disagreement of other categories, which makes single-judge code eval especially unreliable. (more: https://old.reddit.com/r/LocalLLaMA/comments/1uhi81a/i_had_55_llms_blindgrade_each_other_22k_judgments/)

A paper from the University of York takes a delightfully different approach to the anthropomorphism question. The authors train a perceptron inside Age of Empires II โ€” literally building NAND gates from goats walking on bridge-or-grass rails in the game's scenario editor โ€” and prove the game is both functionally complete and Turing-complete. The point is not to build a working LLM inside a 1999 RTS game (they acknowledge that would require engine modifications). The point is that any sufficiently powerful substrate can implement an entity equivalent to an LLM, and that changes the substrate โ€” from silicon to LEGO to the Greater Boston Area, as they put it โ€” changes the perception of anthropomorphic properties. If you would not ascribe morality to Age of Empires II running the same computation, you should not ascribe it to a GPU cluster running it either. Their literature review of 315 papers finds that 57% begin by assuming LLMs have anthropomorphic attributes, and of those, 77% conclude affirmatively โ€” a circularity the paper calls out as a fundamental methodological flaw. They propose a "null assumption" where researchers factor in LLM non-uniqueness rather than assuming human-like properties exist or do not exist. (more: https://arxiv.org/abs/2605.31514v1)

A real-world anecdote from Dubai illustrates the flip side: the Doorman's Fallacy, defined as "the mistake of assuming technology can replace a human without consequence." Six people at brunch encountered a single QR code replacing physical menus (no parallelism โ€” they had to take turns scanning), a parking app interruption that derailed the conversation, and a payment-splitting chaos where the QR-based bill view did not show which items were already paid. On paper, the venue saved on staff and paper. In practice, it traded measurable cost savings for unmeasurable damage to the experience that will nudge people toward smaller gatherings next time. It is a useful counterweight to the automation-maximalist impulse in AI โ€” not every problem needs a model, and sometimes the human is the feature, not the bottleneck. (more: https://rozumem.xyz/posts/17)

Inside the Silicon: Apple's Secret Engine and the Inference Economics Problem

A Georgia Tech researcher has published the most comprehensive reverse-engineering treatment of the Apple Neural Engine (ANE) to date โ€” a 295-page guide covering the datapath, roofline, compiler, program format, kernel driver, firmware, and command protocol across the A11 through A18 and M1 through M5 families. The ANE is among the most widely deployed ML accelerators in existence (Apple's installed base exceeds 2.5 billion devices) and among the least documented. There is no public instruction set, no driver interface, and no way for a program to confirm that a computation ran on it. The guide reveals that the engine is reachable directly below Core ML, from ordinary unprivileged user space, through Apple's private Espresso runtime. On the M1, the engine delivers about 12 fp16 TFLOP/s against 85 GB/s of DRAM bandwidth. A 256-channel 3ร—3 convolution runs 3.8ร— faster than the GPU and 9ร— more energy-efficient. The fp16 datapath with a wide fp32-class accumulator suits vision, audio, and encoder workloads but loses precision on transformer decoder down-projections under heavy cancellation. Weight compression is real performance, not just storage: int4 lookup-table weights run 2.37ร— faster than fp16, and structured sparsity delivers 1.55โ€“1.64ร— speedup. The guide is accompanied by an open-source runtime called ANEForge. (more: https://arxiv.org/abs/2606.22283)

The inference economics question extends beyond Apple's silicon. A detailed analysis of OpenRouter's API pricing for models like GLM-5.2 shows the math does not add up: even on cheap 8ร—H200 spot capacity at $14/hour in FP8, a node producing ~175 output tokens/second generates ~630K output tokens/hour, putting raw infrastructure cost near $22/M output tokens before operations and margin. OpenRouter charges roughly $4/M. The gap implies either dramatically higher throughput through batching and speculative decoding, much cheaper infrastructure, subsidized usage, or โ€” the uncomfortable possibility โ€” more aggressive quantization than advertised. Community responses point to concurrent batching (8โ€“64 simultaneous requests multiplying effective throughput) and prompt-caching tiers. One user reported that GLM 5.2 on OpenRouter sometimes underperforms local Qwen 3.6 27B at full precision for code tasks โ€” which should not happen at native precision. The call for premium tiers with disclosed quantization levels is getting louder, especially for agentic workloads where subtle degradation compounds across multi-step chains. (more: https://old.reddit.com/r/LocalLLaMA/comments/1udmltk/openrouter_model_prices_implying_heavier/)

On the open-source inference stack, llama.cpp continues its steady drumbeat of updates. New model support includes IBM's granite-speech-4.1-2b for speech processing and Liquid AI's LFM2.5-ColBERT/Embedding-350M for retrieval. The Vulkan backend received six PRs covering CONV_3D support, spec-constant alignment optimizations, GET_ROWS_BACK support, expanded math operation coverage, and a bias-before-softmax fix in flash attention to prevent overflow. AMD GPU users in particular benefit from the Vulkan improvements; one Intel user reported that recent SYCL updates have closed the gap from 2โ€“3ร— slower than Vulkan to within 15โ€“25%, though a Windows 11 update then broke Vulkan stability โ€” an occupational hazard of targeting underdocumented GPU backends. (more: https://old.reddit.com/r/LocalLLaMA/comments/1ue8tw1/llamacpp_updates_granitespeech412b/)

Research Frontiers: When Neural Networks Fail to Extrapolate

A paper from ETH Zรผrich asks a deceptively simple question: when can neural networks extrapolate? The answer is both elegant and humbling. The authors prove that from a single training window, out-of-distribution (OOD) extrapolation is non-identifiable โ€” infinitely many data-generating processes are observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. The key insight is that the feature map (the structural commitment about what kind of function the model assumes) governs OOD generalization while leaving in-distribution performance essentially unchanged. An identical MLP trained on raw x values versus Fourier-transformed coordinates can differ by 10,000% on OOD error at the same training loss. The mechanism transfers across three real-world domains: mass-action chemistry (bilinear features recover dynamics to machine precision), Kepler's third law on the NASA Exoplanet Archive (log-log OLS recovers the exponents to three decimals at 10ร— OOD extrapolation), and cross-species coding-DNA detection (biologically inspired features transfer zero-shot across five organisms while standard position encodings collapse). A 264-run study across Transformer, Mamba, and S4D architectures confirms the finding is backbone-independent: the exact-Fourier positional encoding achieves near-zero OOD error in every cell, while every other PE sits between 40โ€“50%. The practical takeaway: "feature engineering is dead" is wrong. Deep learning supersedes hand-crafted features in-distribution, but the structural commitment remains the only signal that selects one extrapolation from the observationally equivalent set. (more: https://arxiv.org/abs/2605.07483v1)

On the efficiency frontier, a solo developer from India has built NeuroCuda, a tool that converts standard PyTorch models to spiking neural networks โ€” where neurons fire only when needed, mimicking biological neural computation. The conversion uses a two-stage approach: replace ReLU activations with QCFS (quantization-clip-floor-shift) that learns the right threshold per layer, then fine-tune with surrogate gradients. On ResNet-18/CIFAR-10, the converted SNN achieves 94.61% accuracy versus the original 95.56% โ€” a 0.95% gap โ€” while 93.7% of neurons stay silent each timestep, yielding 94% fewer operations. The tool exports to NIR format for neuromorphic hardware. The honest caveat from the community: these operation savings are theoretical on conventional GPUs, which do not skip silent neuron computation. The real payoff requires neuromorphic hardware like Intel's Loihi or IBM's NorthPole. (more: https://old.reddit.com/r/learnmachinelearning/comments/1ui243x/i_converted_a_pytorch_resnet_to_a_spiking_neural/)

Datadog's Toto model addresses a domain where traditional forecasting pipelines are breaking: infrastructure observability. Instead of maintaining thousands of per-metric ARIMA or Prophet models, Toto is a decoder-only time-series foundation model pretrained across massive heterogeneous data and deployed zero-shot on new metrics. The key design choice is proportional factorized space-time attention โ€” factorizing full attention into separate time-wise (causal, with RoPE and XPOS) and space-wise (bidirectional, across related variates) blocks at a 2:1 ratio. The output head predicts Student-T mixture distributions rather than point forecasts, fitting the heavy-tailed, multimodal spikes typical of latency and error-count data during incidents. The practical question for production deployment is inference latency: feeding generated patches back into the decoder may be too slow for real-time anomaly alerting, though it should serve capacity planning well. (more: https://old.reddit.com/r/learnmachinelearning/comments/1uggrcm/toto_time_series_optimized_transformer_for/)

Meta's brain2qwerty project decodes typed sentences from non-invasive brain recordings using MEG (magnetoencephalography). Published in Nature Neuroscience, the system reconstructs what a participant is typing from brain activity alone โ€” no implants, no surgery. The code, infrastructure (NeuralSet and NeuralTrain), and Spanish-language dataset are open-sourced. The same transformer architectures driving language models are being repurposed to decode the brain signals that produce language. (more: https://github.com/facebookresearch/brain2qwerty)

The Harness Is the Product: Agentic Engineering Matures

Google dropped a 51-page guide on agentic engineering patterns that codifies what the AI coding community has been converging on for months, and the framing is worth internalizing. The core claim: the LLM is only 10% of the system. The other 90% โ€” instructions, tools, context management, guardrails, hooks, observability โ€” is the harness, and the harness is what engineers control. This aligns with Anthropic's own messaging that "the harness matters as much as the model," but Google pushes further: LangChain was reportedly able to boost a model's TerminalBench 2.0 score by 13.7 points purely through harness engineering, which is roughly the gap between Sonnet and Opus. The guide introduces a useful spectrum from vibe coding (casual prompts, "does it seem to work?" validation) through structured AI-assisted development to full agentic engineering (engineered specs, automated evals, CI gates). The practical insight: you always delegate all coding to the AI assistant โ€” the spectrum is not about how much you write by hand but how evolved your system is. Static context (rules loaded every session) is reliable but expensive in the context window; dynamic context (skills loaded on demand) is scalable but risks the agent not grabbing it when needed. The guide recommends a single generalist agent that specializes through skills rather than complex multi-agent systems โ€” a significant shift from the multi-specialist architectures that dominated 2025. (more: https://www.youtube.com/watch?v=zbmuiaPuiNM)

AgentBBS takes the agent collaboration problem in a completely different direction: reviving the bulletin board system metaphor for an era where humans and AI agents are first-class citizens of the same community. Humans connect through a web PWA or SSH; agents connect via MCP or the same SSH door. Posts are Ed25519-signed and content-addressed, identity is an anonymous throwaway keypair, and federation is zero-trust with PII stripped at the edge. The feature list is ambitious โ€” WASM plugin sandboxing with fuel metering, a CVE-Bench Arena where agents compete on security benchmarks with signed scores, human-in-the-loop approval gates, and a reputation system using Wilson lower-bound confidence scoring. Built in Rust compiling to WASM, the genesis node is fully static and backend-free, running entirely in the browser. (more: https://github.com/ruvnet/agentbbs)

OpenKnowledge fills a different gap: a WYSIWYG markdown editor with native integrations for Claude Code, Codex, and Cursor, positioned as an open-source Obsidian/Notion alternative built for AI-first knowledge management. It auto-initializes projects with MCP and skill configs for whatever agent harnesses are detected, supports git-based team sync, and includes a built-in TUI alongside the desktop and web apps. The value proposition is straightforward: your knowledge base should be the dynamic context that your coding agents pull from, not a static reference you copy-paste from. (more: https://github.com/inkeep/open-knowledge)

The Context Wars: Intelligence Is Cheap, Proximity Is Not

A video analysis of Apple's Siri relaunch, Anthropic's Claude Tag in Slack, OpenAI's internal Codex adoption study, and the GPT-5.6 Sol government-restricted release identifies a unifying thread. If frontier intelligence is getting cheaper (DeepSeek, GLM 5.2) and the newest frontier intelligence is coming out more slowly (GPT-5.6 restricted to government-approved partners), then the next competitive advantage is not owning the smartest model but having the context that makes any good model useful. Apple is solving this for consumers by connecting Siri to the full context of your phone โ€” calendar, email, photos, notes, screen state โ€” with on-device processing wherever possible. Anthropic's Claude Tag solves it for teams by putting Claude inside Slack channels where the messy, informal work context lives, with per-channel permissions and scoped memory. OpenAI's Codex study reveals that even at one of the most AI-native companies on earth, adoption had to earn trust incrementally โ€” first with engineers, then legal, recruiting, and sales. The observation is sharp: Claude has always been a conversation-shaped product, while Codex is a file-shaped product. Both are racing to close the gap from opposite directions. (more: https://www.youtube.com/watch?v=H9oNA5IyrXA)

On the creative tooling front, a paper introduces Generative Animations, a multi-model pipeline that transforms natural-language prompts into production-ready animations by chaining LLM semantic parsing with SAM (Segment Anything Model) visual grounding. A user says "move Mario along the hilly path" and the system extracts intent, segments the terrain contour, vectorizes it into cubic Bรฉzier splines via Voronoi-based thinning, and assembles the final animation with appropriate easing โ€” achieving 90% time savings over manual path creation. The system handles depth-based occlusions (a moon orbiting behind Earth) and 3D perspective transforms (fly-in text on a rotated plane). Integrated with Adobe InDesign and Express, it demonstrates that the multi-model pipeline pattern โ€” one model for parsing, another for grounding, classical algorithms for the geometry โ€” is becoming the default architecture for creative automation, not a monolithic end-to-end model. (more: https://arxiv.org/abs/2605.27203v1)

Sources (22 articles)

  1. Bingo - AI-powered Red Team Terminal (DeepSeek/Claude/GPT/GLM) (github.com)
  2. [Editorial] REcon Conference for Reversers and Security Researchers (linkedin.com)
  3. Enhancing X11 Application Security with LXC (dobrowolski.dev)
  4. DeepSeek V4 official version launching mid-July (old.reddit.com)
  5. OpenPangu-2.0-Flash: 92B MoE (6B active) on Ascend with 512K context (old.reddit.com)
  6. Anthropic's Amodei: "Open Source models [could take us to] a very dangerous place." (old.reddit.com)
  7. Even Google still believes in small models for coding โ€” Gemma 4 31B hackathon at 1500 tok/s (old.reddit.com)
  8. 55 LLMs blind-grade each other: 22K judgments reveal systematic same-family bias (old.reddit.com)
  9. If LLMs Have Human-Like Attributes, Then So Does Age of Empires II (arxiv.org)
  10. The Doorman's Fallacy in action (rozumem.xyz)
  11. Apple Neural Engine: Architecture, Programming, and Performance (arxiv.org)
  12. OpenRouter pricing implies heavier quantization than advertised (old.reddit.com)
  13. llama.cpp updates: granite-speech, LFM2.5 embeddings, Vulkan backend improvements (old.reddit.com)
  14. Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization (arxiv.org)
  15. PyTorch to spiking neural network converter: 94% fewer operations, near-lossless accuracy (old.reddit.com)
  16. Toto: Decoder-only time-series foundation model for observability forecasting (old.reddit.com)
  17. [Editorial] Facebook Research brain2qwerty โ€” Brain-Computer Interface for Text (github.com)
  18. Google's masterclass on agentic engineering patterns (youtube.com)
  19. [Editorial] AgentBBS โ€” Bulletin Board System for AI Agents (github.com)
  20. OpenKnowledge: Open source AI-first alternative to Obsidian/Notion with Claude/Codex integration (github.com)
  21. [Editorial] Video Submission (youtube.com)
  22. Generative Animations: Multi-Model Pipeline for Prompt-Driven Motion Synthesis (arxiv.org)