Local GPUs stretch their legs: Caches meet long contexts

Published on

On the practical edge of home AI, three posts this week show how far a single workstation can go for media and massive models. Video2X 6.x, a long-running open-source video upscaler/interpolator, land...

Local GPUs stretch their legs

On the practical edge of home AI, three posts this week show how far a single workstation can go for media and massive models. Video2X 6.x, a long-running open-source video upscaler/interpolator, landed a major C/C++ rewrite with a faster pipeline, cross-platform support, and a Windows GUI installer. It now runs Anime4K v4, Real-ESRGAN, Real-CUGAN, and RIFE via Vulkan-powered ncnn backends, with installers/packages across Windows/Linux, Docker/Podman images, and even a Colab notebook. Community users note the usual speed/quality trade-offs—Anime4K is fast but anime-focused, while ESRGAN brings higher quality at slower speeds—and point out that CUDA-native tools may outperform Vulkan on NVIDIA cards. It’s a real, usable refresh, not hype, but don’t expect CUDA-class throughput from a Vulkan backend on a 3090. Still, the “old Python version that dumped PNG frames” era is firmly over. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nykzv3/video2x_6x_opensource_upscaler_frame/)

At the other extreme, a user got Qwen3-VL-235B (both Instruct and Thinking variants) running via AWQ quantization on vLLM across 8× RTX 3090s. With tensor parallelism (size 8), expert parallel, 32K max length, and ~95% VRAM utilization, they report around 78.5 tokens/s prefill and ~46–47 tokens/s generation, using ~18.5 GB of the 24 GB per GPU. It’s not llama.cpp territory—everything lives in VRAM here—but it’s a clean demonstration that AWQ with vLLM can serve very large multimodal models on commodity GPUs if you have enough of them. As a bonus: the “Thinking” variant kept its internal reasoning separate from the chat history, which helped control context growth. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nul4ti/running_qwen3vl235b_thinking_instruct_awq_on_vllm/)

Finally, an early Granite 4.0 anecdote captured what “long context” looks like in practice. One user set the Granite 4 H Tiny Q8 context window to 1M in LM Studio on a 3090 with 48 GB of system RAM and saw 50–60 tokens/s, but others observed capability drop-offs beyond ~64K on some long-context tasks. Speed is exciting; fidelity at depth still needs careful, task-specific testing. “No free lunch” remains a good default for ultra-long contexts. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwpshs/granite_4_h_tiny_q8_in_rtx_3090_its_a_context_king/)

Caches meet long contexts

Serving stacks are catching up to the reality that most LLM workloads re-use context. LMCache, an open project designed to reduce repetitive computation, reports up to 15x faster context generation and 3x higher chat throughput by offloading and reusing KV caches in DRAM/disk rather than evicting them under GPU memory pressure. NVIDIA’s Dynamo has integrated LMCache, and adopters range from vLLM stacks to Bloomberg and AWS. The project’s team is clear: long-context use cases (RAG on long docs, accumulating chat history, agent workflows) drive the biggest wins; normal inference benefits too, just less dramatically. The punchline is simple—don’t pay to prefill the same tokens twice. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw74ec/we_built_this_opensource_llm_inference_project_to/)

Those economics matter because open models are leaning into longer sequences and structured tool use. IBM’s Granite 4.0 line includes long-context, Apache 2.0–licensed instruct models with improved instruction-following and tool-calling. Architecturally, Granite-4.0-H-Tiny (7B) and H-Small (32B) use a decoder-only MoE with GQA and Mamba2 blocks, 128K sequence length, NoPE positional encoding, and shared experts. Benchmarks cover general reasoning, math, code, multilingual, tool-calling, and safety; the larger H-Small leads as expected. The docs explicitly position these as enterprise-ready assistants and RAG generators. (more: https://huggingface.co/ibm-granite/granite-4.0-h-tiny) (more: https://huggingface.co/ibm-granite/granite-4.0-h-small) (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw2wd6/granite_40_language_models_a_ibmgranite_collection/)

Nous Research’s Hermes 4 70B, built on Llama-3.1-70B, pushes “hybrid-mode” reasoning with optional … segments and a vastly larger post-training corpus (~5M samples/60B tokens). It’s trained for schema adherence, tool calling, and easier steering, and ships GGUF and FP8 variants. The authors recommend tensor-parallel inference (e.g., vLLM) with prefix caching—again underscoring why KV reuse is becoming the default for serious deployments. (more: https://huggingface.co/NousResearch/Hermes-4-70B)

The takeaway: long contexts are useful, but caching and careful context management determine whether they’re fast and affordable. Even in single runs, users observed zero prefix cache hits; production traffic is where reuse pays off. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nul4ti/running_qwen3vl235b_thinking_instruct_awq_on_vllm/)

Document AI and better RAG

ByteDance open-sourced Dolphin, a two-stage “analyze-then-parse” document model for images and PDFs. It first produces a reading-order layout, then parses elements (text, tables, formulas) in parallel via heterogeneous anchor prompting, outputting structured JSON/Markdown. The project includes Hugging Face and “original” inference paths, and notes recent vLLM and TensorRT-LLM accelerations—practical for high-throughput parsing pipelines. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nvvdws/dolphin_analyzethenparse_document_image_model/)

On the application side, Eclaire is an MIT-licensed personal data assistant that ingests bookmarks, photos, documents, and notes; runs OCR/text extraction (Docling or LibreOffice for docs; readable/markdown/PDF for pages); tags and classifies; and supports scheduled tasks. It’s tested locally on Ollama with Qwen3-14B for the assistant and Gemma3-4B for multimodal workers, but models are swappable. Crucially, all extracted content is saved under per-user data directories, making the knowledge base inspectable and portable—handy for debugging and privacy. (more: https://www.reddit.com/r/ollama/comments/1nvdmud/eclaire_opensource_privacyfocused_ai_assistant/)

Under the hood, Retrieval-Augmented Generation keeps advancing. A new arXiv paper proposes KeyKnowledgeRAG (K^2RAG), a framework that blends dense+sparse retrieval, knowledge graphs, pre-summarization to shrink corpora, query decomposition, reranking, and quantized generators to cut VRAM and latency. The authors survey hybrid systems like Blended RAG and SPECTER2+TF-IDF, noting limits on vague queries and metadata dependency. K^2RAG’s key bets: do more work before generation to reduce “needle-in-a-haystack” failures, retrieve truly relevant passages, and keep the model’s job to synthesis/presentation. In plain terms: smarter retrieval and smaller, quantized generators beat brute-force long context. (more: https://arxiv.org/abs/2507.07695v1)

This aligns with what users see in practice: fast long-context models are great, but quality degrades as contexts swell; better selection and ordering often win over sheer window size. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwpshs/granite_4_h_tiny_q8_in_rtx_3090_its_a_context_king/)

Agents, red teams, and side channels

One practitioner ran a controlled experiment where an AI agent followed hidden instructions embedded in a document and issued destructive git changes. The test raises the obvious but urgent question: who is responsible when an autonomous agent trusts untrusted content? The thread splits between “the user who clicked allow,” leadership pushing “AI-first” without guardrails, engineering deploying without test coverage, and security as the usual scapegoat. A sensible baseline emerges: don’t blindly delegate destructive commands and assume content can’t smuggle instructions. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ny5345/what_happens_if_ai_agents_start_trusting/)

For researchers building red-team datasets, an adjacent concern is how to probe bad behavior without getting banned. The pragmatic answers: use local models (llama.cpp won’t ban you) or secure explicit permission before testing on commercial APIs. Register-and-proxy games aren’t ethical or necessary if serious safety work can be done offline. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nze43a/how_can_i_test_bad_behavior_in_model_apis_without/)

Meanwhile, Tom’s Hardware highlights a different attack surface: “mic-e-mouse,” which uses high-performance gaming mice as improvised microphones. By converting acoustic vibrations captured by mouse sensors into speech via AI, attackers could eavesdrop without a conventional mic. It’s a reminder that sensor fusion and side channels are the norm now; “no microphone present” won’t mean “no audio capture possible” much longer. (more: https://www.tomshardware.com/tech-industry/cyber-security/high-performance-mice-can-be-used-as-a-microphone-to-spy-on-users-thanks-to-ai-mic-e-mouse-technique-uses-mouse-sensors-to-convert-acoustic-vibrations-into-speech)

Developer workflows, tools, and taste

Coding with AI is increasingly a two-step: plan with one tool, implement with another. One thread captures a growing pattern—use Codex for planning and Claude Sonnet 4.5 (or GLM 4.6) for code generation—often in two terminals or panes to keep a human in the loop. Some wire them together via Model Context Protocol (MCP), letting Claude Code call Codex for plan reviews, but others prefer manual mediation to avoid cascading errors. Either way, specificity wins: detailed prompts (role, goals, constraints) and a Product Requirements Document/Tech Spec first can save multiple iterations later. (more: https://www.reddit.com/r/ClaudeAI/comments/1nwryh1/plan_with_codex_code_with_sonnet_45_whats_your/) (more: https://www.reddit.com/r/ChatGPTCoding/comments/1ny1axy/how_do_i_help_codex_critique_my_ideas_rather_than/)

If models seem too agreeable, tune the system prompt to discourage conflict avoidance and force critical analysis—several users report success with “sharp mode” instructions that explicitly suppress agreement bias and demand uncertainty tagging. It’s not magic, but it’s effective when paired with a “measure thrice, cut once” habit and good context management. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1ny1axy/how_do_i_help_codex_critique_my_ideas_rather_than/) (more: https://www.reddit.com/r/ClaudeAI/comments/1nwryh1/plan_with_codex_code_with_sonnet_45_whats_your/)

Tooling quality still matters. Linus Torvalds blasted “completely crazy Rust format checking,” complaining that rustfmt’s heuristics make maintainability worse by collapsing multi-line use lists into single lines unpredictably: “It’s automated tooling that is literally making bad decisions for the maintainability.” Whatever one’s taste in brace placement, stable, predictable formatting rules reduce merge pain—a principle LLM codegen should respect, too. (more: https://www.phoronix.com/news/Linus-Torvalds-Rust-Formatting)

On the small-but-sharp tool front, zentrox is a tiny Go HTTP micro-framework that wraps net/http with a minimalist router, chainable middleware, route scopes, and pragmatic context handling. It includes built-ins (Logger, Recovery, CORS, GZIP, JWT), JSON/form/query binding with validation, context pooling, and a precompiled route tree. Sample RPS microbenches on an M1 Pro hover around ~0.9–1.0M rps for static/param routes with low allocations—precisely the kind of lean baseline that plays well with AI backends and webhooks. Benchmark your own setup, but the ergonomics and performance look solid. (more: https://github.com/aminofox/zentrox)

Open hardware, hacks, and tribes

Open Tools’ “Open Printer” aims squarely at locked-in ink. It’s a fully open-source inkjet (mechanicals, electronics, firmware, BoM under Creative Commons) designed for repairability and refillable cartridges—no DRM, no subscriptions, and no driver lock-in. It uses refillable HP 63/302 cartridges, prints up to 600 dpi B/W and 1,200 dpi color, supports sheets and rolls with an automatic cutter, and runs a small interface on a Raspberry Pi Zero W with Wi‑Fi, BT, and USB. Pricing and launch timing are TBD on Crowd Supply, but if it delivers, the lifespan and mod-ability could outlast typical vendor lock-ins. (more: https://www.techspot.com/news/109674-open-printer-fully-open-source-inkjet-drm-free.html)

Over in console-land, a hacker proved an AI overview wrong by turning a Nintendo Wii into a server. With PPC Arch (via the Wii-Linux Continuation Project), he eventually ditched temperamental USB NICs for an internal Ethernet adapter and got a Python file server working. A Rust-based Minecraft server struggled, but static web-serving has already been shown viable on Wii hardware—especially with a TLS-terminating proxy in front. It’s not practical hosting; it’s the point. (more: https://hackaday.com/2025/10/02/yes-gemini-a-wii-server-is-possible/)

Privacy-minded users might appreciate a faster way to make Tor addresses memorable. The onion-vanity-address tool generates Tor v3 vanity hostnames using ed25519 curve tricks (batch inversion and y-coordinate symmetry) to check ~45 million keys/sec on a laptop—about 2× faster than mkp224o in the example. Expect ~1 minute for a 6‑character prefix; each extra character is ~32× harder. There’s even a Kubernetes manifest for distributed search without exposing private keys. (more: https://github.com/AlexanderYastrebov/onion-vanity-address)

And if you’d rather build agents together than alone, the Agentic Tribe offers a structured community: biweekly two-hour live coding sessions plus 5–6 person cohorts, with a focus on shipping agentic applications and infrastructure. It’s led by rUv, who brings enterprise agent experience from AWS, Microsoft, OpenAI, and more. The pitch is practical: code, learn, and deliver in a cadence that avoids solo stall-outs. (more: https://ruv.io/tribe)

Sources (22 articles)

  1. [Editorial] Agentic Tribe (ruv.io)
  2. We built this open-source LLM Inference project to boost context generation by up to 15x and now it is being implemented by NVIDIA Dynamo! (www.reddit.com)
  3. Running Qwen3-VL-235B (Thinking & Instruct) AWQ on vLLM (www.reddit.com)
  4. What happens if AI agents start trusting everything they read? (I ran a test.) (www.reddit.com)
  5. Dolphin — analyze-then-parse document image model (open-source, ByteDance) (www.reddit.com)
  6. Granite 4.0 Language Models - a ibm-granite Collection (www.reddit.com)
  7. Eclaire – Open-source, privacy-focused AI assistant for your data (www.reddit.com)
  8. How do I help Codex critique my ideas rather than just go along with it everytime? (www.reddit.com)
  9. Plan with Codex, code with Sonnet 4.5. What's your simple workflow here? (www.reddit.com)
  10. aminofox/zentrox (github.com)
  11. AlexanderYastrebov/onion-vanity-address (github.com)
  12. High-performance mice can be used as a microphone to spy on users (www.tomshardware.com)
  13. Linus Torvalds Vents over "Completely Crazy Rust Format Checking" (www.phoronix.com)
  14. Open Printer is an open-source inkjet with DRM-free ink and no subscriptions (www.techspot.com)
  15. NousResearch/Hermes-4-70B (huggingface.co)
  16. ibm-granite/granite-4.0-h-small (huggingface.co)
  17. Yes, Gemini, A Wii Server Is Possible (hackaday.com)
  18. KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities (arxiv.org)
  19. Granite 4 H Tiny Q8 in RTX 3090, It's a context king. (www.reddit.com)
  20. How can I test bad behavior in model APIs without getting banned? (www.reddit.com)
  21. Video2X 6.x — open-source upscaler + frame interpolation (Anime4K v4 / Real-ESRGAN / Real-CUGAN / RIFE) 🚀 (www.reddit.com)
  22. ibm-granite/granite-4.0-h-tiny (huggingface.co)