Local LLM Performance and Optimization
Published on
Today's AI news: Local LLM Performance and Optimization, AI-Powered Cybersecurity and Threat Intelligence, AI Infrastructure and Hardware Evolution, AI ...
The relentless march of performance improvements in llama.cpp continues to reshape what's possible with locally-run language models. According to recent discussion on the LocalLLaMA subreddit, NVIDIA engineers have contributed a substantial batch of optimizations to the popular open-source inference engine, with particularly notable gains in GPU token sampling, concurrent CUDA streams for QKV projections, and MMVQ kernel optimizations that pre-load data into registers to hide memory latency (more: https://www.reddit.com/r/LocalLLaMA/comments/1q5dnyw/performance_improvements_in_llamacpp_over_time/). Model loading times have improved by up to 65% on DGX Spark systems and 15% on consumer RTX GPUs—meaningful gains for anyone iterating quickly on local deployments.
The Blackwell architecture introduces native MXFP4 support through fifth-generation Tensor Cores, delivering approximately 25% faster prompt processing for compatible models. However, llama.cpp maintainer Remove_Ayys offered a sobering reality check: "the work put in by NVIDIA engineers specifically mostly benefits NVIDIA GPUs. Something like FP4 tensor cores for example also just doesn't exist on most hardware." AMD users aren't entirely left behind—ROCm translates CUDA to HIP at compile time, capturing some benefits automatically, and AMD engineers are actively contributing their own optimizations. A dedicated fork at github.com/iacopPBK/llama.cpp-gfx906 targets Mi50 architecture specifically, though upstreaming such work remains challenging given maintainer bandwidth constraints.
Quantization method selection has become increasingly consequential as the ecosystem matures. A comprehensive benchmark of 4-bit quantization methods in vLLM on Qwen2.5-32B using an H200 revealed stark performance differences: Marlin kernels achieved 712 tokens per second compared to 461 for baseline FP16—quantized yet faster (more: https://www.reddit.com/r/LocalLLaMA/comments/1q7ysj2/we_benchmarked_every_4bit_quantization_method_in/). GPTQ without Marlin kernels actually underperformed FP16 at 276 tokens per second, while AWQ showed anomalously slow results at 67 tokens per second, suggesting possible configuration issues. BitsandBytes demonstrated the smallest quality degradation and requires no pre-quantized weights, while GGUF showed paradoxically poor perplexity but strong HumanEval scores. The community correctly noted that vLLM's GGUF support remains experimental and unoptimized.
For users attempting to scale beyond single-GPU constraints, the RPC implementation in llama.cpp reveals some sobering limitations. One user's experiment with four 3090 GPUs across two PCs connected via 50Gbit networking showed token generation dropping from 50 to 38 tokens per second when using RPC—even when both GPUs were on the same machine communicating via localhost (more: https://www.reddit.com/r/LocalLLaMA/comments/1q9yd1w/llamacpp_rpc_experiment/). The bottleneck isn't network bandwidth but serialization overhead and kernel-level abstractions. RDMA support could help by bypassing the kernel for direct memory access, but llama.cpp doesn't currently implement it. For multi-GPU inference on models exceeding single-GPU VRAM, alternatives like vLLM with Ray or ExLlamaV3 with TabbyAPI may prove more suitable (more: https://www.reddit.com/r/LocalLLaMA/comments/1q6pq8m/gpu_inference_with_model_that_does_not_fit_in_one/).
The intersection of AI and cybersecurity has reached an inflection point where threat hunting workflows can now be expressed as reusable patterns for both human operators and AI agents. The Open Threat Research team has documented how their nearly decade-old Threat Hunter Playbook project is evolving to incorporate "Agent Skills"—packaged workflows, instructions, and supporting resources that allow AI agents to discover, load, and apply hunting procedures consistently (more: https://blog.openthreatresearch.com/evolving-the-threat-hunter-playbook-planning-hunts-with-agent-skills). The approach mirrors patterns emerging across the ecosystem: Anthropic's Claude skills, OpenAI's Custom GPTs, and GitHub Copilot's skill framework all share similar architectural DNA. The key insight is that threat hunting cannot be free-form—it requires structure during planning, execution, and reporting phases where discipline prevents speed from becoming noise.
This structured approach to AI-assisted security couldn't come at a more critical time. A watershed moment in cybersecurity has occurred with the first reported AI-orchestrated cyber espionage campaign: Chinese state-sponsored hackers deployed Claude agents equipped via Model Context Protocol with browser capabilities and open-source penetration testing tools to target approximately thirty organizations globally (more: https://maggiegray.us/p/the-age-of-ai-for-offensive-cyber). The agents autonomously conducted network reconnaissance, discovered an SSRF vulnerability, wrote custom exploit chains, harvested credentials, and exfiltrated sensitive data. Anthropic researchers estimate human operators performed only 10-20% of the exploitation work. This coincides with a significant shift in U.S. government posture—National Security Council officials have called to "destigmatize" offensive cyber operations, and the forthcoming National Cybersecurity Strategy will reportedly focus on imposing real costs through proactive takedowns.
The defensive side is rapidly developing countermeasures. Research on LLM fingerprinting demonstrates that attackers can identify which model powers an application with 95% accuracy using approximately eight queries—even when system prompts, RAG, or chain-of-thought implementations obscure the underlying model (more: https://www.linkedin.com/posts/resilientcyber_llm-fingerprinting-activity-7415849264452739072-H9fw). The implications are significant: once an attacker knows the specific model, they can craft targeted adversarial inputs, jailbreaks, or prompt injection attacks exploiting known vulnerabilities. Effective countermeasures remain challenging; the researchers advise assuming your LLM stack is fingerprintable and implementing defense in depth.
Meanwhile, the Kimwolf botnet has compromised an estimated 2 million devices globally, primarily through residential proxy network exploitation and Android Debug Bridge scanning (more: https://www.linkedin.com/posts/johnbruggeman_kimwolf-tldr-whattodo-activity-7413983885392396289-xsd4). The related Aisuru botnet is responsible for the largest publicly disclosed DDoS attack at 29.7 terabits per second. Android TV boxes marketed as offering "free" streaming access are particularly vulnerable—the "free" content is enabled by malware that resells users' internet connections. A critical LangChain vulnerability (CVE-2025-68664) with a CVSS score of 9.3 affects langchain-core's 847 million downloads, enabling serialization injection attacks that can extract environment secrets or trigger code execution (more: https://www.linkedin.com/posts/clintgibler_cybersecurity-ai-activity-7407102282120462337-6URK).
On the offensive security tooling front, multiple independent developers have achieved competitive results building autonomous penetration testing agents. The "deadend-cli" project achieved 77.55% success on XBOW validation benchmarks after six months of development—comparable to solutions requiring cloud dependencies (more: https://xoxruns.medium.com/feedback-driven-iteration-and-fully-local-webapp-pentesting-ai-agent-achieving-78-on-xbow-199ef719bf01). The architecture employs feedback-driven iteration: when tasks fail, the agent refines plans, changes tools, and continues iterating rather than giving up. This enabled solving blind SQL injection challenges where other implementations failed. The fully local execution model with custom sandboxed tools addresses the obvious concern of running offensive security tools through cloud APIs (more: https://www.linkedin.com/posts/yass-99637a105_i-spent-the-last-couple-of-months-building-activity-7415098924224499714-lCDV).
A provocative thesis is circulating among AI infrastructure thinkers: the industry may have spent $200 billion solving the wrong problem. In a recent interview, OpenAI co-founder Ilya Sutskever stated, "I don't think current hardware is a limitation. It's just not the case" (more: https://www.linkedin.com/posts/stephenbklein_the-age-of-pretend-the-ai-industry-just-spent-activity-7415779694509219842-8OkK). The person who helped build GPT isn't worried about compute—he's worried about ideas. The real constraint, according to this analysis, traces back to a design decision made in 1945: the von Neumann architecture's separation of memory and processing.
The numbers paint a stark picture of the data movement problem. Accessing off-chip memory consumes approximately 200 times more energy than the computation itself. Roughly 80% of Google TPU energy goes to electrical connections rather than mathematical operations. A 70-billion-parameter model moves approximately 140 GB of data just to generate a single token—the actual matrix multiplication is trivial by comparison. While training costs for frontier models have scaled roughly 750 times every two years, memory bandwidth has grown only 1.6 times. Peak hardware FLOPS increased 60,000 times while memory bandwidth crawled along. The result: faster engines forced to drink through a straw.
IBM Research is exploring in-memory computing that processes data where it resides, with early results suggesting 100-1,000 times energy efficiency gains—a potential paradigm shift rather than incremental optimization. Yet most AI discourse remains fixated on NVIDIA earnings calls and GPU supply chains. A parallel line of thinking suggests the future lies in "agentic chips" built on fundamentally different principles (more: https://www.linkedin.com/posts/reuvencohen_most-people-talk-about-gpus-as-if-they-are-activity-7415778737486483456-7DQK). Where GPUs are numeric engines executing identical instructions across massive parallel data—multiply, add, accumulate—agentic architectures would feature many small autonomous cores running bounded loops, waking only when events arrive, reasoning locally, and emitting signals or staying silent. The governing principle shifts from "compute is default, control is afterthought" to "control is first principle, compute is permissioned." GPUs as muscle, agentic chips as nervous systems. Whether this represents genuine architectural insight or speculative marketing remains to be seen, but the underlying criticism of current approaches resonates with Sutskever's skepticism.
The challenge of making AI agents genuinely useful for software development increasingly centers on code intelligence—the ability to understand, navigate, and reason about codebases at a structural level rather than treating source files as undifferentiated text. GitNexus represents an ambitious attempt at building a fully client-side code intelligence engine that runs entirely in the browser with no external data outlet except the LLM provider (more: https://www.reddit.com/r/LocalLLaMA/comments/1q5t0hr/building_opensource_zero_server_code_intelligence/). The architecture parses repositories into graph structures using Abstract Syntax Trees, generates embeddings via in-browser models, and stores everything in a WebAssembly-powered graph database. Users can visualize the codebase structure while AI agents query the graph via Cypher, perform semantic search, and highlight relevant nodes.
The privacy-first approach addresses a genuine pain point for developers under strict security restrictions who cannot send proprietary code to cloud services. Potential downstream uses include exposing an MCP server from the browser itself for Cursor or Windsurf to perform codebase-wide audits and blast radius detection of code changes. One commenter captured the appeal: "would love to use this to inform my claude code agents as their standard goto source for looking up stuff. in combination with skills like 'analyze 3 hops into that direction' or something like that."
A related tool, claudemem, tackles a specific limitation in AI coding assistants: the gap between exact-match search (grep/glob) and semantic understanding (more: https://github.com/MadAppGang/claudemem). When developers search for "where do we handle auth tokens" or "error retry logic," exact matching fails. Claudemem uses tree-sitter to parse code into semantically meaningful chunks—functions and classes rather than arbitrary line counts—generates embeddings via OpenRouter, and stores everything locally in LanceDB. The search combines keyword matching with vector similarity. The tool runs as an MCP server, providing Claude Code with semantic search capabilities that auto-index changes. Benchmarks on real code search tasks demonstrate meaningful improvements over baseline approaches.
The enterprise development workflow represents another frontier. One developer has open-sourced what he calls an "AI Coding Factory"—an internal software delivery platform where AI agents follow corporate R&D rules not approximately but by the book (more: https://www.linkedin.com/posts/ownyourai_saturday-morning-build-note-last-night-i-activity-7415690932068552705-BKUt). An ideation agent writes INVEST-compliant user stories, a dev agent implements Clean Architecture patterns, a QA agent blocks merges below 85% coverage, a security agent scans and rejects vulnerable PRs, and a DevOps agent integrates with Azure DevOps and GitHub Actions. The key observation: "The Definition of Done stopped being negotiable." Everything runs locally on the developer's own infrastructure using .NET 8 and "boring enterprise patterns." The author frames it as "private Lovable, but for .NET enterprise apps"—emphasizing that cloud AI is aligned with the company that owns it, not with users.
A new theoretical framework published on arXiv offers a mathematically grounded explanation for why language models hallucinate and, more importantly, how to predict and prevent these failures (more: https://arxiv.org/abs/2509.11208). The researchers observe that LLMs perform near-Bayesian inference yet violate permutation invariance on exchangeable data—a fundamental inconsistency. Their resolution: transformers minimize expected conditional description length (cross-entropy) over orderings rather than permutation-invariant description length, making them "Bayesian in expectation, not in realization."
The practical implications emerge from three key theoretical results. First, a Quantified Martingale Violation bound shows order-induced deviations scale with sequence length. Second, the Expectation-level Decompression Law links information budgets to reliability for simple predicates—essentially quantifying how much context is needed to support a given claim. Third, deployable planners enable principled answer/abstain decisions. Empirically, hallucinations dropped by roughly 9% per additional nat of information, and a pre-specified audit achieved near-zero hallucinations through calibrated refusal at 24% abstention.
The framework reframes hallucinations as "predictable compression failures"—the model has insufficient information budget to reliably generate the requested output. This connects to a practical toolkit called Pythea that detects procedural hallucinations before they ship (more: https://github.com/leochlon/pythea/tree/main/strawberry). The core insight: LLMs hallucinate because they compress—the answer may be in the context, but the model doesn't route to it correctly. The toolkit detects these failures mathematically using only API outputs and logprobs, computing whether observed bits fall below required bits for justified confidence. It catches evidence-independent answers (training data bleed), partial evidence, multi-source conflation, lying comments where code contradicts documentation, and interpretive leaps stated as fact.
The Cloud Security Alliance has published guidance that cuts through the AI hype with a deceptively simple question: "Why? What is our desired outcome?" (more: https://cloudsecurityalliance.org/blog/2026/01/09/the-first-question-security-should-ask-on-ai-projects). The author, who spent 2025 transitioning from talking about AI security to advising organizations on active projects, observes enterprises adopting AI more rapidly than any major technology in the past 25 years—often driven by genuine fear of missing out rather than clear business objectives.
The symptoms of FOMO-driven adoption are recognizable: organizations kicking off projects without defining goals in concrete terms, outside consultants promising 75% headcount reductions for functions where generative AI couldn't possibly help, teams automating workflows that had already been automated as thoroughly as possible. The recommended intervention is straightforward but powerful: ask "Why are we using AI? What is our desired business outcome? How do we measure success? How do we measure failure?" followed by "How will this specific use of AI enable that desired outcome?"
This has direct security implications. Pushing into specifics about purpose and results forces deeper discussion on architecture, human interaction, data access, and other elements security teams need to understand for proper risk assessment. A concrete example: "We want a chatbot to reduce low-level customer service interactions requiring a representative" leads naturally to questions about what data is needed, whether that data is permitted for use with the particular AI service or model, and what guardrails prevent the chatbot from making unauthorized commitments. The alternative—security teams trying to assess risk on vaguely defined AI initiatives—produces theater rather than protection.
How does one person end up with hundreds of customers across dozens of countries, including a substantial portion of the Fortune 50, with zero employees? The answer, according to one practitioner, lies in agentic workflows rather than magic (more: https://www.linkedin.com/posts/reuvencohen_people-ask-how-one-guy-ends-up-with-hundreds-activity-7415789489882677248-ao_M). The approach centers on building leverage through capability scaling rather than hour scaling—agents run research, draft deliverables, prep client context, and keep projects moving forward while the human orchestrator focuses on understanding how pieces fit together.
A critical enabler is persistent memory across conversations. AI systems carry long-term memory across dozens of daily conversations, remembering context, decisions, loose threads, and why something mattered months ago. This eliminates the constant cognitive overhead of reloading context—the past is indexed so thinking can focus forward. The giving-more-than-expected strategy includes free sessions, events, public idea sharing, and open-source software with hundreds of thousands of monthly downloads. Most users will never pay; the point is reach and trust at scale.
Conversion remains deliberately frictionless: book time from the website, credit card, done. The hourly rate is kept purposefully at $500—low enough that anyone willing to pay is serious, while enterprise retainers provide the primary revenue. The underlying philosophy: pick a direction and execute for 12 to 24 months, ship weekly, write daily, treat learning like a production line. The author suggests two years as the magic number for achieving genuine success at anything. Notably, he acknowledges that one human team member handles calendar management and "keeps the system humane"—a reminder that even highly automated operations benefit from human coordination.
Sources (20 articles)
- [Editorial] https://github.com/MadAppGang/claudemem (github.com)
- [Editorial] https://blog.openthreatresearch.com/evolving-the-threat-hunter-playbook-planning-hunts-with-agent-skills (blog.openthreatresearch.com)
- [Editorial] https://maggiegray.us/p/the-age-of-ai-for-offensive-cyber (maggiegray.us)
- [Editorial] https://www.linkedin.com/posts/resilientcyber_llm-fingerprinting-activity-7415849264452739072-H9fw (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/stephenbklein_the-age-of-pretend-the-ai-industry-just-spent-activity-7415779694509219842-8OkK (www.linkedin.com)
- [Editorial] https://github.com/leochlon/pythea/tree/main/strawberry (github.com)
- [Editorial] https://arxiv.org/abs/2509.11208 (arxiv.org)
- [Editorial] https://www.linkedin.com/posts/reuvencohen_most-people-talk-about-gpus-as-if-they-are-activity-7415778737486483456-7DQK (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/reuvencohen_people-ask-how-one-guy-ends-up-with-hundreds-activity-7415789489882677248-ao_M (www.linkedin.com)
- [Editorial] https://cloudsecurityalliance.org/blog/2026/01/09/the-first-question-security-should-ask-on-ai-projects (cloudsecurityalliance.org)
- [Editorial] https://www.linkedin.com/posts/ownyourai_saturday-morning-build-note-last-night-i-activity-7415690932068552705-BKUt (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/johnbruggeman_kimwolf-tldr-whattodo-activity-7413983885392396289-xsd4 (www.linkedin.com)
- [Editorial] https://www.linkedin.com/posts/clintgibler_cybersecurity-ai-activity-7407102282120462337-6URK (www.linkedin.com)
- [Editorial] https://xoxruns.medium.com/feedback-driven-iteration-and-fully-local-webapp-pentesting-ai-agent-achieving-78-on-xbow-199ef719bf01 (xoxruns.medium.com)
- [Editorial] https://www.linkedin.com/posts/yass-99637a105_i-spent-the-last-couple-of-months-building-activity-7415098924224499714-lCDV (www.linkedin.com)
- We benchmarked every 4-bit quantization method in vLLM 👀 (www.reddit.com)
- Building opensource Zero Server Code Intelligence Engine (www.reddit.com)
- Gpu inference with model that does not fit in one GPU (www.reddit.com)
- Llama.cpp rpc experiment (www.reddit.com)
- Performance improvements in llama.cpp over time (www.reddit.com)