Local GPUs stretch their legs: Caches meet long contexts
Published on
On the practical edge of home AI, three posts this week show how far a single workstation can go for media and massive models. Video2X 6.x, a long-running open-source video upscaler/interpolator, land...
Local GPUs stretch their legs
On the practical edge of home AI, three posts this week show how far a single workstation can go for media and massive models. Video2X 6.x, a long-running open-source video upscaler/interpolator, landed a major C/C++ rewrite with a faster pipeline, cross-platform support, and a Windows GUI installer. It now runs Anime4K v4, Real-ESRGAN, Real-CUGAN, and RIFE via Vulkan-powered ncnn backends, with installers/packages across Windows/Linux, Docker/Podman images, and even a Colab notebook. Community users note the usual speed/quality trade-offsâAnime4K is fast but anime-focused, while ESRGAN brings higher quality at slower speedsâand point out that CUDA-native tools may outperform Vulkan on NVIDIA cards. Itâs a real, usable refresh, not hype, but donât expect CUDA-class throughput from a Vulkan backend on a 3090. Still, the âold Python version that dumped PNG framesâ era is firmly over. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nykzv3/video2x_6x_opensource_upscaler_frame/)
At the other extreme, a user got Qwen3-VL-235B (both Instruct and Thinking variants) running via AWQ quantization on vLLM across 8Ă RTX 3090s. With tensor parallelism (size 8), expert parallel, 32K max length, and ~95% VRAM utilization, they report around 78.5 tokens/s prefill and ~46â47 tokens/s generation, using ~18.5 GB of the 24 GB per GPU. Itâs not llama.cpp territoryâeverything lives in VRAM hereâbut itâs a clean demonstration that AWQ with vLLM can serve very large multimodal models on commodity GPUs if you have enough of them. As a bonus: the âThinkingâ variant kept its internal reasoning separate from the chat history, which helped control context growth. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nul4ti/running_qwen3vl235b_thinking_instruct_awq_on_vllm/)
Finally, an early Granite 4.0 anecdote captured what âlong contextâ looks like in practice. One user set the Granite 4 H Tiny Q8 context window to 1M in LM Studio on a 3090 with 48 GB of system RAM and saw 50â60 tokens/s, but others observed capability drop-offs beyond ~64K on some long-context tasks. Speed is exciting; fidelity at depth still needs careful, task-specific testing. âNo free lunchâ remains a good default for ultra-long contexts. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwpshs/granite_4_h_tiny_q8_in_rtx_3090_its_a_context_king/)
Caches meet long contexts
Serving stacks are catching up to the reality that most LLM workloads re-use context. LMCache, an open project designed to reduce repetitive computation, reports up to 15x faster context generation and 3x higher chat throughput by offloading and reusing KV caches in DRAM/disk rather than evicting them under GPU memory pressure. NVIDIAâs Dynamo has integrated LMCache, and adopters range from vLLM stacks to Bloomberg and AWS. The projectâs team is clear: long-context use cases (RAG on long docs, accumulating chat history, agent workflows) drive the biggest wins; normal inference benefits too, just less dramatically. The punchline is simpleâdonât pay to prefill the same tokens twice. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw74ec/we_built_this_opensource_llm_inference_project_to/)
Those economics matter because open models are leaning into longer sequences and structured tool use. IBMâs Granite 4.0 line includes long-context, Apache 2.0âlicensed instruct models with improved instruction-following and tool-calling. Architecturally, Granite-4.0-H-Tiny (7B) and H-Small (32B) use a decoder-only MoE with GQA and Mamba2 blocks, 128K sequence length, NoPE positional encoding, and shared experts. Benchmarks cover general reasoning, math, code, multilingual, tool-calling, and safety; the larger H-Small leads as expected. The docs explicitly position these as enterprise-ready assistants and RAG generators. (more: https://huggingface.co/ibm-granite/granite-4.0-h-tiny) (more: https://huggingface.co/ibm-granite/granite-4.0-h-small) (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw2wd6/granite_40_language_models_a_ibmgranite_collection/)
Nous Researchâs Hermes 4 70B, built on Llama-3.1-70B, pushes âhybrid-modeâ reasoning with optional
The takeaway: long contexts are useful, but caching and careful context management determine whether theyâre fast and affordable. Even in single runs, users observed zero prefix cache hits; production traffic is where reuse pays off. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nul4ti/running_qwen3vl235b_thinking_instruct_awq_on_vllm/)
Document AI and better RAG
ByteDance open-sourced Dolphin, a two-stage âanalyze-then-parseâ document model for images and PDFs. It first produces a reading-order layout, then parses elements (text, tables, formulas) in parallel via heterogeneous anchor prompting, outputting structured JSON/Markdown. The project includes Hugging Face and âoriginalâ inference paths, and notes recent vLLM and TensorRT-LLM accelerationsâpractical for high-throughput parsing pipelines. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nvvdws/dolphin_analyzethenparse_document_image_model/)
On the application side, Eclaire is an MIT-licensed personal data assistant that ingests bookmarks, photos, documents, and notes; runs OCR/text extraction (Docling or LibreOffice for docs; readable/markdown/PDF for pages); tags and classifies; and supports scheduled tasks. Itâs tested locally on Ollama with Qwen3-14B for the assistant and Gemma3-4B for multimodal workers, but models are swappable. Crucially, all extracted content is saved under per-user data directories, making the knowledge base inspectable and portableâhandy for debugging and privacy. (more: https://www.reddit.com/r/ollama/comments/1nvdmud/eclaire_opensource_privacyfocused_ai_assistant/)
Under the hood, Retrieval-Augmented Generation keeps advancing. A new arXiv paper proposes KeyKnowledgeRAG (K^2RAG), a framework that blends dense+sparse retrieval, knowledge graphs, pre-summarization to shrink corpora, query decomposition, reranking, and quantized generators to cut VRAM and latency. The authors survey hybrid systems like Blended RAG and SPECTER2+TF-IDF, noting limits on vague queries and metadata dependency. K^2RAGâs key bets: do more work before generation to reduce âneedle-in-a-haystackâ failures, retrieve truly relevant passages, and keep the modelâs job to synthesis/presentation. In plain terms: smarter retrieval and smaller, quantized generators beat brute-force long context. (more: https://arxiv.org/abs/2507.07695v1)
This aligns with what users see in practice: fast long-context models are great, but quality degrades as contexts swell; better selection and ordering often win over sheer window size. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwpshs/granite_4_h_tiny_q8_in_rtx_3090_its_a_context_king/)
Agents, red teams, and side channels
One practitioner ran a controlled experiment where an AI agent followed hidden instructions embedded in a document and issued destructive git changes. The test raises the obvious but urgent question: who is responsible when an autonomous agent trusts untrusted content? The thread splits between âthe user who clicked allow,â leadership pushing âAI-firstâ without guardrails, engineering deploying without test coverage, and security as the usual scapegoat. A sensible baseline emerges: donât blindly delegate destructive commands and assume content canât smuggle instructions. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ny5345/what_happens_if_ai_agents_start_trusting/)
For researchers building red-team datasets, an adjacent concern is how to probe bad behavior without getting banned. The pragmatic answers: use local models (llama.cpp wonât ban you) or secure explicit permission before testing on commercial APIs. Register-and-proxy games arenât ethical or necessary if serious safety work can be done offline. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nze43a/how_can_i_test_bad_behavior_in_model_apis_without/)
Meanwhile, Tomâs Hardware highlights a different attack surface: âmic-e-mouse,â which uses high-performance gaming mice as improvised microphones. By converting acoustic vibrations captured by mouse sensors into speech via AI, attackers could eavesdrop without a conventional mic. Itâs a reminder that sensor fusion and side channels are the norm now; âno microphone presentâ wonât mean âno audio capture possibleâ much longer. (more: https://www.tomshardware.com/tech-industry/cyber-security/high-performance-mice-can-be-used-as-a-microphone-to-spy-on-users-thanks-to-ai-mic-e-mouse-technique-uses-mouse-sensors-to-convert-acoustic-vibrations-into-speech)
Developer workflows, tools, and taste
Coding with AI is increasingly a two-step: plan with one tool, implement with another. One thread captures a growing patternâuse Codex for planning and Claude Sonnet 4.5 (or GLM 4.6) for code generationâoften in two terminals or panes to keep a human in the loop. Some wire them together via Model Context Protocol (MCP), letting Claude Code call Codex for plan reviews, but others prefer manual mediation to avoid cascading errors. Either way, specificity wins: detailed prompts (role, goals, constraints) and a Product Requirements Document/Tech Spec first can save multiple iterations later. (more: https://www.reddit.com/r/ClaudeAI/comments/1nwryh1/plan_with_codex_code_with_sonnet_45_whats_your/) (more: https://www.reddit.com/r/ChatGPTCoding/comments/1ny1axy/how_do_i_help_codex_critique_my_ideas_rather_than/)
If models seem too agreeable, tune the system prompt to discourage conflict avoidance and force critical analysisâseveral users report success with âsharp modeâ instructions that explicitly suppress agreement bias and demand uncertainty tagging. Itâs not magic, but itâs effective when paired with a âmeasure thrice, cut onceâ habit and good context management. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1ny1axy/how_do_i_help_codex_critique_my_ideas_rather_than/) (more: https://www.reddit.com/r/ClaudeAI/comments/1nwryh1/plan_with_codex_code_with_sonnet_45_whats_your/)
Tooling quality still matters. Linus Torvalds blasted âcompletely crazy Rust format checking,â complaining that rustfmtâs heuristics make maintainability worse by collapsing multi-line use lists into single lines unpredictably: âItâs automated tooling that is literally making bad decisions for the maintainability.â Whatever oneâs taste in brace placement, stable, predictable formatting rules reduce merge painâa principle LLM codegen should respect, too. (more: https://www.phoronix.com/news/Linus-Torvalds-Rust-Formatting)
On the small-but-sharp tool front, zentrox is a tiny Go HTTP micro-framework that wraps net/http with a minimalist router, chainable middleware, route scopes, and pragmatic context handling. It includes built-ins (Logger, Recovery, CORS, GZIP, JWT), JSON/form/query binding with validation, context pooling, and a precompiled route tree. Sample RPS microbenches on an M1 Pro hover around ~0.9â1.0M rps for static/param routes with low allocationsâprecisely the kind of lean baseline that plays well with AI backends and webhooks. Benchmark your own setup, but the ergonomics and performance look solid. (more: https://github.com/aminofox/zentrox)
Open hardware, hacks, and tribes
Open Toolsâ âOpen Printerâ aims squarely at locked-in ink. Itâs a fully open-source inkjet (mechanicals, electronics, firmware, BoM under Creative Commons) designed for repairability and refillable cartridgesâno DRM, no subscriptions, and no driver lock-in. It uses refillable HP 63/302 cartridges, prints up to 600 dpi B/W and 1,200 dpi color, supports sheets and rolls with an automatic cutter, and runs a small interface on a Raspberry Pi Zero W with WiâFi, BT, and USB. Pricing and launch timing are TBD on Crowd Supply, but if it delivers, the lifespan and mod-ability could outlast typical vendor lock-ins. (more: https://www.techspot.com/news/109674-open-printer-fully-open-source-inkjet-drm-free.html)
Over in console-land, a hacker proved an AI overview wrong by turning a Nintendo Wii into a server. With PPC Arch (via the Wii-Linux Continuation Project), he eventually ditched temperamental USB NICs for an internal Ethernet adapter and got a Python file server working. A Rust-based Minecraft server struggled, but static web-serving has already been shown viable on Wii hardwareâespecially with a TLS-terminating proxy in front. Itâs not practical hosting; itâs the point. (more: https://hackaday.com/2025/10/02/yes-gemini-a-wii-server-is-possible/)
Privacy-minded users might appreciate a faster way to make Tor addresses memorable. The onion-vanity-address tool generates Tor v3 vanity hostnames using ed25519 curve tricks (batch inversion and y-coordinate symmetry) to check ~45 million keys/sec on a laptopâabout 2Ă faster than mkp224o in the example. Expect ~1 minute for a 6âcharacter prefix; each extra character is ~32Ă harder. Thereâs even a Kubernetes manifest for distributed search without exposing private keys. (more: https://github.com/AlexanderYastrebov/onion-vanity-address)
And if youâd rather build agents together than alone, the Agentic Tribe offers a structured community: biweekly two-hour live coding sessions plus 5â6 person cohorts, with a focus on shipping agentic applications and infrastructure. Itâs led by rUv, who brings enterprise agent experience from AWS, Microsoft, OpenAI, and more. The pitch is practical: code, learn, and deliver in a cadence that avoids solo stall-outs. (more: https://ruv.io/tribe)
Sources (22 articles)
- [Editorial] Agentic Tribe (ruv.io)
- We built this open-source LLM Inference project to boost context generation by up to 15x and now it is being implemented by NVIDIA Dynamo! (www.reddit.com)
- Running Qwen3-VL-235B (Thinking & Instruct) AWQ on vLLM (www.reddit.com)
- What happens if AI agents start trusting everything they read? (I ran a test.) (www.reddit.com)
- Dolphin â analyze-then-parse document image model (open-source, ByteDance) (www.reddit.com)
- Granite 4.0 Language Models - a ibm-granite Collection (www.reddit.com)
- Eclaire â Open-source, privacy-focused AI assistant for your data (www.reddit.com)
- How do I help Codex critique my ideas rather than just go along with it everytime? (www.reddit.com)
- Plan with Codex, code with Sonnet 4.5. What's your simple workflow here? (www.reddit.com)
- aminofox/zentrox (github.com)
- AlexanderYastrebov/onion-vanity-address (github.com)
- High-performance mice can be used as a microphone to spy on users (www.tomshardware.com)
- Linus Torvalds Vents over "Completely Crazy Rust Format Checking" (www.phoronix.com)
- Open Printer is an open-source inkjet with DRM-free ink and no subscriptions (www.techspot.com)
- NousResearch/Hermes-4-70B (huggingface.co)
- ibm-granite/granite-4.0-h-small (huggingface.co)
- Yes, Gemini, A Wii Server Is Possible (hackaday.com)
- KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities (arxiv.org)
- Granite 4 H Tiny Q8 in RTX 3090, It's a context king. (www.reddit.com)
- How can I test bad behavior in model APIs without getting banned? (www.reddit.com)
- Video2X 6.x â open-source upscaler + frame interpolation (Anime4K v4 / Real-ESRGAN / Real-CUGAN / RIFE) đ (www.reddit.com)
- ibm-granite/granite-4.0-h-tiny (huggingface.co)