🖥️ Progress in Local LLMs: Speed, Context, Vision

Published on

The rapid evolution of local large language model (LLM) serving continues to break technical barriers once considered out of reach for consumer-grade hardware. The llama-server project now demonstrate...

The rapid evolution of local large language model (LLM) serving continues to break technical barriers once considered out of reach for consumer-grade hardware. The llama-server project now demonstrates Gemma 3 27B running with vision capabilities and a staggering 100,000-token context window on a single 24GB GPU. Performance optimizations, like sliding window attention (SWA), enable throughput up to 35 tokens per second on an RTX 3090 and maintain impressive coherence even at extreme context lengths. Even older enterprise GPUs like the P40, with less VRAM and compute, remain surprisingly viable—dual P40s can achieve nearly 16 tokens per second, underscoring the democratization of local AI compute (more: [url](https://www.reddit.com/r/LocalLLaMA/comments/1kzcalh/llamaserver_is_cooking_gemma3_27b_100k_context)).

This leap in context size isn’t just a technical flex; it unlocks practical workflows. Users can now drop entire codebases or voluminous documents into a prompt and expect coherent, contextually-aware responses. However, such feats require aggressive quantization—Q4 for the key-value (KV) cache is necessary to fit 100K context on 24GB VRAM, with higher quantization (Q8) pushing VRAM needs beyond 30GB. Vision support, while powerful, further increases memory demands. The community is actively documenting optimal configurations for various hardware setups, making these advances accessible to more users.

Yet, not all is seamless. While performance and capabilities surge ahead, VRAM management remains a pain point. Ollama, a popular local LLM runner, has been reported to retain VRAM after model inference, only releasing it upon process termination. Although this is not a critical issue—resources are freed when the program exits—it is symptomatic of the rough edges that persist as the local AI stack matures (more: [url](https://www.reddit.com/r/ollama/comments/1l8r68f/ollama_not_releasing_vram_after_running_a_model)).

Handling massive context windows on limited hardware requires both clever engineering and new algorithms. KVzip, a recently released method for query-agnostic KV cache eviction, promises transformative improvements: 3–4× memory reduction and halved decoding latency for supported models like Qwen3, Gemma3, and LLaMA3. By compressing the key-value cache—essential for maintaining history in transformer models—KVzip enables longer conversations and larger documents to be handled without ballooning memory costs. The method is model-agnostic and open-sourced, inviting experimentation and integration into local LLM workflows (more: [url](https://www.reddit.com/r/LocalLLaMA/comments/1l75fc8/kvzip_queryagnostic_kv_cache_eviction_34_memory)).

For users who routinely interact with large projects or datasets, tools like llmcontext automate the process of gathering an entire codebase into a single, LLM-friendly text file (excluding binaries and with metadata for non-text assets). This enables users to leverage the expanded context windows now possible in high-end models—though, as always, users must be vigilant to avoid leaking sensitive information in these aggregated dumps (more: [url](https://www.reddit.com/r/ChatGPTCoding/comments/1l8qnn9/llmcontext_attach_you_whole_project_in_large)).

The combination of context window advances, memory optimization, and practical tooling is redefining what “local” LLMs can achieve. Where once only cloud-based giants could process vast knowledge bases or entire software repositories in a single session, the capability is now within reach for solo developers and small teams.

The Model Context Protocol (MCP) is rapidly emerging as the de facto standard for integrating LLMs with diverse data sources and tools. MCP acts as a universal interface—akin to USB-C for hardware—allowing LLM hosts (such as IDEs, chatbots, or AI agents) to connect with a rich ecosystem of data and functionality providers (more: [url](https://github.com/Ta0ing/MCP-SecurityTools)).

Open source projects are accelerating MCP adoption. One lightweight MCP server, now available as a 90MB Docker container, enables seamless context sharing between AI tools. It supports self-hosting, persistent context storage via SQLite, and automatic document re-embedding when users switch embedding models. This persistence means users can recall information with semantic search across any connected AI tool, not just by text match. Extensions for platforms like Obsidian and browsers are in the works, pointing toward a future where one’s digital knowledge is always at hand, securely and privately (more: [url](https://www.reddit.com/r/LocalLLaMA/comments/1l0uccd/i_built_a_lightweight_private_mcp_server_to_share)).

The protocol is also seeing specialization. ckanthony/openapi-mcp auto-generates MCP tool definitions from OpenAPI or Swagger specifications, allowing AI agents to access any documented API with zero manual integration. This means that if an API provides a standard spec, it can be instantly exposed as an MCP tool, complete with secure API key handling and schema generation—a significant step toward agentic AI that can safely and flexibly interact with the broader software ecosystem (more: [url](https://github.com/ckanthony/openapi-mcp)).

Security-focused MCP servers, like MCP-SecurityTools, aggregate tools and best practices for integrating security data sources (e.g., Shodan, FOFA, VirusTotal) into LLM workflows. This modularity and standardization across domains—from code to cybersecurity—demonstrates the protocol’s versatility and the community’s appetite for composable, interoperable AI systems (more: [url](https://github.com/Ta0ing/MCP-SecurityTools)).

Tool use is becoming a central pillar of LLM-powered workflows, both for agents and end-users. Ollama’s latest update introduces streaming responses with tool calling, allowing LLMs to interact with external tools in real-time as they generate outputs. This capability is crucial for building agents that can fetch data, perform calculations, or trigger workflows mid-conversation, rather than waiting for a full response cycle (more: [url](https://www.reddit.com/r/LocalLLaMA/comments/1kxubqe/ollama_now_supports_streaming_responses_with_tool)).

On the research front, a new multi-turn tool-calling base model tailored for reinforcement learning (RL) agent training has been released. While details are sparse, the availability of such a base model on Hugging Face suggests an increasing focus on teaching LLMs to use tools effectively over multiple steps—a key requirement for robust, autonomous agents (more: [url](https://www.reddit.com/r/LocalLLaMA/comments/1l7v9gf/a_multiturn_toolcalling_base_model_for_rl_agent)).

For developers building production chatbots with retrieval-augmented generation (RAG), the choice of user interface remains contentious. Chainlit offers deep customization but lacks features like chat history and document upload out of the box. OpenWebUI, while feature-rich, can be rigid and may mishandle complex input chunking for RAG. The community consensus is far from settled; some advocate for customizing existing UIs, while others prefer building bespoke solutions atop flexible frameworks like Chainlit (more: [url](https://www.reddit.com/r/LocalLLaMA/comments/1kva7sp/chainlit_or_open_webui_for_production)). Meanwhile, GUI RAG interfaces that can handle many documents simultaneously remain elusive—most are limited to just a handful, highlighting ongoing usability challenges (more: [url](https://www.reddit.com/r/LocalLLaMA/comments/1ktgz28/gui_rag_that_can_do_an_unlimited_number_of)).

Text embedding and reranking models are fundamental for search, retrieval, and RAG. The Qwen3 Embedding series, spanning from 0.6B to 8B parameters, achieves state-of-the-art scores on the MTEB multilingual leaderboard and supports over 100 languages—including programming languages. The reranker variant (Qwen3-Reranker-4B) excels at ranking search results, code retrieval, and bitext mining, and can be finely controlled with user instructions for specialized tasks (more: [url](https://huggingface.co/Qwen/Qwen3-Reranker-4B)).

MiniMax-M1 takes a bolder approach, introducing a 456B parameter hybrid-attention MoE (Mixture-of-Experts) model with a native 1 million token context window—eight times larger than DeepSeek R1. Despite its massive scale, MiniMax-M1 leverages a “lightning attention” mechanism to reduce computational cost by 75% compared to DeepSeek R1 at 100K token generation length. Trained with advanced reinforcement learning and a novel CISPO algorithm, MiniMax-M1 sets new benchmarks for long-context reasoning and software engineering tasks, establishing itself as a foundation for next-generation agentic models (more: [url](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k)).

Semantic search for scientific literature is also advancing. Tools like arxivxplorer.com leverage OpenAI embeddings to offer rapid, high-quality semantic search across arXiv, bioRxiv, and medRxiv, making it easier for researchers to surface relevant work in an ever-expanding sea of publications (more: [url](https://arxivxplorer.com)).

On the audio front, the release of ConversationTTS marks a significant moment for open speech synthesis research. Trained on an unprecedented 200,000 hours of multilingual speech (with plans to scale to 500,000), ConversationTTS supports multi-speaker, conversational text-to-speech (TTS). The model uses speaker labels to distinguish voices in dialogue and is built atop leading open architectures (UniAudio, CSM, Moshi, RSTNet). All code, checkpoints, and data are released under a permissive license for non-commercial use, inviting researchers to push the frontiers of natural, expressive synthetic speech (more: [url](https://github.com/Audio-Foundation-Models/ConversationTTS)).

As AI adoption spreads globally, unique local challenges and preferences come to the fore. A survey among Indian developers, students, and AI enthusiasts seeks to understand barriers to LLM access, weighing local deployment (e.g., via Ollama) against cloud-based solutions. Factors like API costs, cultural context, and infrastructure limitations shape adoption patterns, with the goal of making AI tools more accessible and inclusive for underrepresented communities (more: [url](https://www.reddit.com/r/learnmachinelearning/comments/1l7fg3a/help_shape_the_future_of_ai_in_india_survey_on)).

The foundation of all this AI progress remains the underlying hardware and operating system. Recent deep dives into Linux features, such as cgroups v2 and nftables, demystify the mechanics of process isolation, resource control, and modern firewall management. While cgroups v2 brings a cleaner, more unified design for containerization and resource allocation, nftables replaces iptables with a more efficient, extensible, and user-friendly configuration model—though enterprise inertia means many systems still use legacy interfaces (more: [url1](https://fzakaria.com/2025/05/26/linux-cgroup-from-first-principles), [url2](https://ewpratten.com/blog/learning-nftables)).

On the hardware side, a persistent issue with Intel 13th/14th gen CPUs has come to light: physical degeneration of clock tree circuitry causes subtle, accumulating errors, manifesting as crashes or decompression failures in demanding applications like Unreal Engine games. While firmware updates attempt to mitigate the issue, the root cause is hardware-level degradation, with no complete fix short of hardware replacement. This serves as a stark reminder that as software pushes boundaries, hardware reliability remains a critical—and sometimes fragile—foundation (more: [url](https://fgiesen.wordpress.com/2025/05/21/oodle-2-9-14-and-intel-13th-14th-gen-cpus)).

Rounding out the week, two research papers illustrate the relentless march of hardware innovation. A Swiss-Italian-Japanese collaboration demonstrates a 100-micron silicon pixel detector paired with a SiGe HBT amplifier achieving 106 picosecond time resolution for minimum ionizing particles—remarkable performance for tracking and spectroscopy applications (more: [url](https://arxiv.org/abs/1511.04231v1)).

Meanwhile, physicists in France and Japan experimentally probe the 0-Ď€ quantum transition in carbon nanotube Josephson junctions. Their work reveals universal phase-dependent behavior and orbital degeneracy effects, advancing the understanding of quantum coherence and electronic correlations in nanoscale superconducting devices (more: [url](https://arxiv.org/abs/1601.03878v1)).

Finally, Databricks has launched a Free Edition, democratizing access to its industry-standard data and AI platform. With the demand for AI and ML skills surging—Forbes notes 74% annual job growth in the sector—this move allows students and hobbyists to gain hands-on experience with real-world tools, bridging the gap between theory and professional practice. Free Edition also includes self-paced training and industry certifications, lowering the barrier to entry in the increasingly competitive AI job market (more: [url](https://www.databricks.com/blog/introducing-databricks-free-edition)).

Sources (22 articles)

  1. llama-server is cooking! gemma3 27b, 100K context, vision on one 24GB GPU. (www.reddit.com)
  2. I built a lightweight, private, MCP server to share context between AI tools (www.reddit.com)
  3. KVzip: Query-agnostic KV Cache Eviction — 3~4× memory reduction and 2× lower decoding latency (www.reddit.com)
  4. Ollama now supports streaming responses with tool calling (www.reddit.com)
  5. Chainlit or Open webui for production? (www.reddit.com)
  6. Ollama not releasing VRAM after running a model (www.reddit.com)
  7. Help Shape the Future of AI in India - Survey on Local vs Cloud LLM Usage (Developers/Students/AI Enthusiasts) (www.reddit.com)
  8. llmcontext: Attach you whole project in large context chats (www.reddit.com)
  9. ckanthony/openapi-mcp (github.com)
  10. Audio-Foundation-Models/ConversationTTS (github.com)
  11. Ta0ing/MCP-SecurityTools (github.com)
  12. Learning (The Basics of) Nftables (ewpratten.com)
  13. Semantic search engine for ArXiv, biorxiv and medrxiv (arxivxplorer.com)
  14. Databricks Free Edition (www.databricks.com)
  15. Linux Cgroup from First Principles (fzakaria.com)
  16. Oodle 2.9.14 and Intel 13th/14th gen CPUs (fgiesen.wordpress.com)
  17. 100ps time resolution with thin silicon pixel detectors and a SiGe HBT amplifier (arxiv.org)
  18. 0-$\pi$ quantum transition in a carbon nanotube Josephson junction: universal phase dependence and orbital degeneracy (arxiv.org)
  19. MiniMaxAI/MiniMax-M1-40k (huggingface.co)
  20. Qwen/Qwen3-Reranker-4B (huggingface.co)
  21. A multi-turn tool-calling base model for RL agent training (www.reddit.com)
  22. GUI RAG that can do an unlimited number of documents, or at least many (www.reddit.com)