AI Security and Safety Frameworks
Published on
Today's AI news: AI Security and Safety Frameworks, Local AI Development Tools and Frameworks, AI Model Performance and Comparison, AI Usage Economics a...
The security discourse around AI systems is undergoing a fundamental shift. Owais Drera's release of Agent-Slayer, a red-teaming framework, crystallizes what security researchers have been warning about: the real enterprise threat isn't chatbot jailbreaking—it's Excessive Agency, classified as LLM08 in vulnerability taxonomies. Agent-Slayer demonstrates how autonomous agents can be hijacked through Indirect Prompt Injection, with a chilling chain of exploitation: a poisoned text file placed in an authorized directory can override system logic when processed, forcing the agent to execute destructive operations like unauthorized file deletion (more: https://www.linkedin.com/posts/owais-drera-590750378_github-owaisdreraagent-slayer-activity-7419782518985486336-7WE3).
The defensive recommendations center on what Drera terms a "Zero-Trust Tooling model." The Principle of Least Agency dictates never granting agents more power than necessary—if reading is the only requirement, delete capabilities shouldn't exist. Human-in-the-Loop (HITL) protocols require manual approval for destructive actions, while Intent Capsules use structured delimiters to prevent LLMs from confusing documents with commands. This framework addresses what the UK's National Cyber Security Centre (NCSC) has identified as a deeper problem: LLMs fundamentally don't enforce security boundaries between instructions and data within prompts (more: https://www.linkedin.com/posts/resilientcyber_prompt-injection-activity-7420165497230454784-NOHa).
The comparison of prompt injection to SQL injection—a framing that initially seemed intuitive—is increasingly viewed as dangerously misleading. SQL injection was a parsing problem with known fixes; prompt injection is an incentive problem embedded in how LLMs function. As the NCSC analysis notes, LLMs are "inherently confusable" because they simply predict the next token without distinguishing between authoritative instructions and potentially malicious data. This manifests acutely in Model Context Protocol (MCP) deployments, where internal servers built as one-offs often lack guardrails. Some practitioners suggest creating a "prompt stack" with distinct layers for Instructions, Tools, Data, Outcomes, and Identity to lock down specific layers without excessive constraints.
Anshuman Bhartiya's experimental work with Claude Skills demonstrates one path forward. By integrating OWASP's Top 10 guide for Agentic Applications into threat modeling workflows, SecureVibes—a multi-agent security platform—showed dramatic improvements. Without skills, the vanilla STRIDE approach identified 30 threats but misclassified prompt injection as "Tampering." With skills augmentation, the system correctly categorized agentic-specific vulnerabilities (more: https://www.linkedin.com/posts/anshumanbhartiya_lets-talk-about-threat-modeling-and-skills-activity-7418130148312674305-arTh). Meanwhile, Reuven Cohen's Prime Radiant takes a different approach entirely, removing confidence scores from safety decisions. Instead of asking models how certain they feel—a metric that measures fluency rather than correctness—the system checks structural coherence through graph-based representations, producing deterministic, auditable decisions in sub-millisecond timeframes (more: https://www.linkedin.com/posts/reuvencohen_introducing-prime-radiant-a-real-time-activity-7420466084006223873-hOct).
The stakes are illustrated by Wiz Research's discovery of CodeBreach, a vulnerability that threatened AWS's entire console supply chain. Two missing characters in a Regex filter allowed unauthenticated attackers to infiltrate build environments and leak privileged credentials from key repositories including the AWS JavaScript SDK. The pattern echoes recent supply chain attacks like tj-actions and last July's attempted compromise of the Amazon Q VS Code extension (more: https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild).
Anthropic's recently published "advanced tool use" pattern—moving intermediate computation outside the model's context window—has inspired tooling that makes the approach viable for local models. The mcpx fork addresses a practical problem: MCP's upfront tool schema loading burns 40-50k tokens before work begins, devastating for 32k context models running locally. The solution treats tools as runtime-discoverable through bash commands rather than API-layer definitions, reducing overhead from approximately 47,000 tokens to around 400 (more: https://www.reddit.com/r/LocalLLaMA/comments/1qhgm0r/bringing_anthropics_advanced_tool_use_pattern_to/).
The implementation includes daemon mode for maintaining stateful connections—browser sessions, database handles—and globally disabled tools functioning like .gitignore for MCP. Critics note that similar compute-saving could theoretically come from spawning separate LLM instances without context, but on resource-constrained hardware like Macs, the speed benefits of the lighter approach remain compelling.
The push to democratize local AI extends beyond developers. Offloom, now with a Steam page, aims to bring local AI to non-technical users through a polished chatbot interface supporting document and web search RAG, image generation, text-to-speech via PocketTTS, and toggleable think modes. The 12GB VRAM recommendation suggests serious capability despite the consumer-friendly packaging (more: https://www.reddit.com/r/LocalLLaMA/comments/1qjl8wl/steam_page_is_live_time_for_nontechnical_folks_to/). For developers frustrated by opaque abstractions, Iris Agent offers a lightweight Python framework emphasizing transparent reasoning flows and explicit tool usage—designed for learning rather than competing with heavyweight frameworks (more: https://www.reddit.com/r/LocalLLaMA/comments/1qguuu5/built_a_lightweight_python_agent_framework_to/).
The native-devtools-mcp project extends MCP's reach to desktop application testing, mimicking Chrome DevTools protocol for native apps with Windows and Android support planned (more: https://www.reddit.com/r/LocalLLaMA/comments/1qhwxu5/nativedevtoolsmcp_an_mcp_server_for_testing/). Meanwhile, GPU-Hot provides real-time monitoring for NVIDIA GPUs with 30+ metrics streamed via WebSockets—particularly useful for tracking thermals and throttling during intensive local inference (more: https://www.reddit.com/r/ollama/comments/1qj3qkv/hi_folks_ive_built_an_opensource_project_that/).
The practical differences between similarly-sized models for agentic coding are proving dramatic. Devstral Small 2, despite being a dense model under 8B parameters (not MoE as some initially assumed), has emerged as the stability leader for development workflows. Users report it rarely encounters errors and reliably follows instructions. GLM 4.7 Flash, by contrast, demonstrates strong chain-of-thought reasoning in theory but consistently gets stuck in loops during actual use (more: https://www.reddit.com/r/LocalLLaMA/comments/1qikoi5/devstral_small_2_vs_glm_47_flash_for_agentic/).
The GLM issues appear to be implementation bugs rather than fundamental model problems—the community is actively working on fixes, with specific instructions from Unsloth regarding repeat penalties for GGUF versions. Some users suggest Aider may provide better "hand-holding" for these smaller models compared to Roo Code. The broader lesson: benchmark performance and real-world agentic capability remain frustratingly disconnected, and deployment environment matters enormously.
In specialized domains, HeartMuLa-oss-3B represents a new family of open-source music foundation models, accompanied by HeartCodec and HeartTranscriptor components. The 3B parameter model targets audio and music understanding, a domain where open alternatives to proprietary systems remain scarce (more: https://huggingface.co/HeartMuLa/HeartMuLa-oss-3B).
A provocative analysis on r/ChatGPTCoding frames $200/month AI subscribers as strategic loss leaders—users that OpenAI and Anthropic must win even at 10x subsidized costs. The reasoning: these power users serve as evangelists influencing their employers and communities, while simultaneously functioning as cheap researchers who push models to their limits, discovering capabilities the providers themselves don't fully understand (more: https://www.reddit.com/r/ChatGPTCoding/comments/1qgg33n/the_value_of_200_a_month_ai_users/).
The Uber/DoorDash comparison looms large. Just as ride-sharing and delivery services used venture capital to subsidize growth before "enshittification" set in, AI providers may be building dependency before extracting value. Skeptics argue AI isn't as essential as transportation—at $2000/month, few would see value, and corporate firewalling of models would simply end consumer participation. Proponents counter with lock-in theory: companies accustomed to rapid, 24/7 feature development won't easily return to traditional developer timelines, regardless of price increases or quality degradation.
The diminishing returns argument adds another dimension. As one commenter noted: "Would you pay 10x as much for an 81% correct model over an 80% one?" The Pareto principle suggests much of AI's current value proposition rests on being "good enough and stupid cheap" rather than exceptional.
Technical issues compound the economic uncertainty. Reports of Claude users instantly hitting usage limits have spawned investigation into potential "permanent memory leaks"—where accumulated system-side context might consume quota invisibly. If confirmed, this would mean users are paying for context they didn't request and can't see (more: https://www.reddit.com/r/ClaudeAI/comments/1qi1s0f/claude_permanent_memory_leak_this_could_be_the/).
Yann LeCun's involvement with Logical Intelligence and their Kona system—demonstrated solving Sudoku deterministically—has reignited debates about what constitutes progress toward AGI. Reuven Cohen offers a sharp critique: Sudoku is the perfect showcase for energy-based models precisely because its rules are static, its space finite, and correctness binary. Encode constraints, define an energy surface, minimize it, converge. This is constraint optimization in a closed world, not reasoning in the wild (more: https://www.linkedin.com/posts/reuvencohen_yann-lecun-claiming-agi-again-and-this-time-activity-7420241622916939776-gbLX).
The limitation surfaces when environments become unstable—when signals contradict, objectives collide, and ground truth arrives late or never. Energy-based models require well-specified worlds upfront; when reality drifts, intelligence stalls unless humans rewrite objectives. Cohen's alternative approach emphasizes hybrid mathematical design: RuVector combines learned embeddings, graph structure, and exact algorithms with incremental learning that updates continuously while coherence bounds constrain changes. Disagreement becomes measurable signal rather than hidden drift.
From a different angle, Charles H. Martin, PhD, describes experiencing "cognitive liberation" through LLM tooling. For decades, serious scientific work meant constant context-switching—wrestling with infrastructure, chasing library changes, debugging glue code. Now, the ability to hold entire mathematical and conceptual landscapes in mind without being dragged into tooling minutiae represents a fundamental shift. The willingness to pay OpenAI $200/month to never use Stack Overflow again captures something real about how AI is changing knowledge work (more: https://www.linkedin.com/posts/charlesmartin14_%F0%9D%97%9C%F0%9D%98%83%F0%9D%97%B2-%F0%9D%97%BB%F0%9D%97%B2%F0%9D%98%83%F0%9D%97%B2%F0%9D%97%BF-%F0%9D%97%B3%F0%9D%97%B2%F0%9D%97%B9%F0%9D%98%81-%F0%9D%98%80%F0%9D%97%BC-%F0%9D%97%B3%F0%9D%97%AE-activity-7410533713505079297-qvtE). Engineering practitioners report similar experiences—workflows that took weeks now complete in days, with AI serving as a 24/7 consultant for installation, usage, and debugging issues. The answers aren't always right, but with humans in the loop, problems resolve quickly.
SEO Research MCP brings Ahrefs data directly into AI-powered IDEs, enabling competitor backlink research, keyword generation, and traffic analysis without context-switching. The tool covers domain rating, anchor text, edu/gov links, keyword difficulty with full SERP breakdown, and traffic estimation (more: https://github.com/egebese/seo-research-mcp). The educational disclaimer—requiring users to comply with third-party terms of service—reflects ongoing tensions around AI tools that interface with external services.
The browser-use agent-sdk takes minimalism to its logical extreme with a philosophy that's almost aggressive: "An agent is just a for-loop. The simplest possible agent framework. No abstractions. No magic." The 300-line-per-provider implementation argues that agent frameworks fail not because models are weak, but because action spaces are incomplete. The bitter lesson applied: the less you build, the more it works (more: https://github.com/browser-use/agent-sdk).
Rust's ecosystem received significant updates with crates.io's security and publishing improvements. Crate pages now display security advisories from the RustSec database, showing known vulnerabilities and affected version ranges. Trusted Publishing expanded beyond GitHub Actions to support GitLab CI, and crate owners can now enforce Trusted Publishing to disable traditional API token-based publishing entirely—reducing unauthorized publish risks from leaked tokens. The workflow_dispatch and schedule GitHub Actions triggers are now blocked from Trusted Publishing following multiple ecosystem security incidents. A new published_at field in crate index entries enables future Cargo features like cooldown periods and historical dependency resolution replay (more: https://blog.rust-lang.org/2026/01/21/crates-io-development-update/).
Google's slow-building wall against cell tower spoofing attacks has reached a new milestone with Android's hidden detection capabilities. "Stingrays"—IMSI (international mobile subscriber identity) catchers—mimic legitimate cell towers to harvest device information and force phones onto older, unencrypted protocols, enabling call interception and SMS reading. Once primarily law enforcement tools, they're increasingly accessible to malicious actors (more: https://www.howtogeek.com/theres-a-hidden-android-setting-that-spots-fake-cell-towers/).
The defensive progression spans several Android versions: Android 12 (2021) introduced the ability to disable 2G connectivity—Stingrays' preferred network due to weak security. Android 14 added options to disable legacy encryption that facilitates SMS and call interception. Android 15 addressed the problem more directly with notifications when networks request device identifiers or attempt protocol downgrades. The "hidden setting" framing reflects that these protections exist but remain buried in menus most users never explore.
For those building storage infrastructure, TerabyteDeals provides current $/TB comparisons across enterprise drives (Western Digital Ultrastar, Seagate Exos), NAS drives (IronWolf, WD Red), and surveillance drives (WD Purple, Seagate SkyHawk). The listings span 4TB to 32TB capacities across SATA and SAS interfaces, with renewed options offering cost savings for non-critical applications (more: https://terabytedeals.com).
Steve Yegge's Gas Town project—whatever it ultimately becomes—has established formal community infrastructure with a dedicated website and Discord for daily announcements and progress updates. The v0.5.0 release addresses serious stability problems including a memory leak spawning "hundreds of Claude Code instances," alongside improved hooks support and dozens of bug fixes and community contributions (more: https://www.linkedin.com/posts/steveyegge_gas-town-hall-activity-7420008043712622592-Oh43).
The companion Beads project hits v0.49.0 as its final release on the SQLite+JSONL backend before migrating to Dolt—a change Yegge promises users won't notice without being told. The community engagement model, with aggressive moderation promised, reflects lessons learned from open-source project management: clear communication channels and transparent development progress reduce confusion while channeling energy productively. Early adopters report achieving stable, deterministic AI coding workflows through ATDD/TDD approaches with seven agents, running Claude Code in loops to reset context windows for extended autonomous sessions—the kind of real-world experimentation that validates (or breaks) new tools.
Sources (22 articles)
- [Editorial] https://www.linkedin.com/posts/owais-drera-590750378_github-owaisdreraagent-slayer-activity-7419782518985486336-7WE3 (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/resilientcyber_prompt-injection-activity-7420165497230454784-NOHa (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/anshumanbhartiya_lets-talk-about-threat-modeling-and-skills-activity-7418130148312674305-arTh (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/charlesmartin14_%F0%9D%97%9C%F0%9D%98%83%F0%9D%97%B2-%F0%9D%97%BB%F0%9D%97%B2%F0%9D%98%83%F0%9D%97%B2%F0%9D%97%BF-%F0%9D%97%B3%F0%9D%97%B2%F0%9D%97%B9%F0%9D%98%81-%F0%9D%98%80%F0%9D%97%BC-%F0%9D%97%B3%F0%9D%97%AE-activity-7410533713505079297-qvtE (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/reuvencohen_introducing-prime-radiant-a-real-time-activity-7420466084006223873-hOct (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/reuvencohen_yann-lecun-claiming-agi-again-and-this-time-activity-7420241622916939776-gbLX (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/steveyegge_gas-town-hall-activity-7420008043712622592-Oh43 (www.linkedin.com)
- [Editorial] https://www.wiz.io/blog/wiz-research-codebreach-vulnerability-aws-codebuild (www.wiz.io)
- Bringing Anthropic's "advanced tool use" pattern to local models with mcpx (www.reddit.com)
- native-devtools-mcp - An MCP server for testing native desktop applications (www.reddit.com)
- Built a lightweight Python agent framework to avoid “black box” abstractions, feedback welcome (www.reddit.com)
- devstral small 2 vs glm 4.7 flash for agentic coding (www.reddit.com)
- Steam page is live! Time for non-technical folks to enjoy local AI too (for free). (www.reddit.com)
- Hi folks, I’ve built an open‑source project that could be useful to some of you (www.reddit.com)
- The value of $200 a month AI users (www.reddit.com)
- Claude Permanent Memory Leak - This could be the cause of issue 16157 - instally hitting usage limits (www.reddit.com)
- browser-use/agent-sdk (github.com)
- egebese/seo-research-mcp (github.com)
- There's a hidden Android setting that spots fake cell towers (www.howtogeek.com)
- TerabyteDeals – Compare storage prices by $/TB (terabytedeals.com)
- Crates.io: Development Update (blog.rust-lang.org)
- HeartMuLa/HeartMuLa-oss-3B (huggingface.co)