Local multimodal catches up: Throughput MoE and templating
Published on
Qwen3-VL is landing in local stacks—rough edges and all. A community patch now lets llama.cpp run Qwen3‑VL‑30B variants in GGUF with vision enabled by supplying the multimodal projector via --mm...
Local multimodal catches up
Qwen3-VL is landing in local stacks—rough edges and all. A community patch now lets llama.cpp run Qwen3‑VL‑30B variants in GGUF with vision enabled by supplying the multimodal projector via --mmproj and a jinja template, plus a small source patch; prebuilt releases and iterative fixes are circulating, including a crucial OCR improvement from ggml PR #15474 that markedly boosts text reading quality in images (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/). Users report it “works like a charm” for text-and-image inputs but can still hallucinate or repeat under some backends and quantization levels, with better outcomes after applying the OCR patch and using clean llama.cpp builds (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/). Practical tips matter: rotating images improves OCR results, and orientation-sensitive inputs see clearer text extraction when corrected before inference (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/).
Expect a quality gap vs. Qwen’s official hosted model. The upstream Qwen3‑VL‑30B‑A3B‑Instruct promises 256K–1M context, stronger spatial and video reasoning, expanded OCR across 32 languages, and “visual agent” capabilities to operate GUIs—architecturally backed by Interleaved‑MRoPE, DeepStack fusion, and timestamp‑aligned temporal modeling (more: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct). That headroom sets expectations users may not meet under heavy quantization on consumer GPUs. Community tests echo that vision models degrade more under quantization than pure text LLMs; 4‑bit runs can “lobotomize” perception, especially for fine-grained OCR and character description—consistent with comments that higher precision helps stabilize outputs (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/).
Quantization itself doesn’t “train” the model; it’s a numeric compression. Calibration using importance matrices (imatrix) can reduce error by prioritizing more influential weights, typically with small calibration sets, but language-specific imatrices yield only marginal differences in practice relative to baseline calibration—so an English-skewed imatrix isn’t likely to severely penalize non‑English performance (more: https://www.reddit.com/r/LocalLLaMA/comments/1o2bxq9/does_quantization_need_training_data_and_will_it/). Quality issues observed here correlate more with precision level, backend maturity (e.g., correct projector wiring and OCR fix), and prompt templating than with calibration set language (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/).
Backend differences matter. Some report OpenAI‑compatible APIs exacerbate repetition or hallucinations with the patched build; others flag Vulkan performance variability by GPU (e.g., MI50 sluggishness) and note that vLLM or SGLang usually add official support faster for large, multi‑GPU deployments, while llama.cpp remains the go‑to for mixed CPU‑GPU and personal rigs (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/).
Throughput, MoE, and templating
Performance on consumer GPUs is improving—sometimes dramatically—thanks to software. An AMD Radeon 7900XTX user jumped from sub‑70 to ~168 tokens/sec on gpt‑oss‑20B using Vulkan in LM Studio; commenters attribute this to llama.cpp Vulkan updates and the model’s MoE design that only activates a subset of experts per token, making it inherently faster at inference (more: https://www.reddit.com/r/LocalLLaMA/comments/1o01una/what_and_when_7900xtx_is_boosted/). Others note similar wins with Qwen3 30B A3B at Q4, with throughput ranging widely depending on system prompts and settings—reminding that the input pipeline and prompt scaffolding can be as important as the GPU (more: https://www.reddit.com/r/LocalLLaMA/comments/1o01una/what_and_when_7900xtx_is_boosted/).
Template hygiene is also a quality lever. Users evaluating GLM‑4.5‑Air/GLM‑4.6‑Distill in llama.cpp report repeated outputs tied to templating or stop‑tokens rather than model quality per se—an easy trap when swapping formats (GGUF vs. safetensors) or mixing chat templates across frontends (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyopyc/did_anyone_try_out_glm45airglm46distill/). Multiple GGUF quantizations were benchmarked on an RTX PRO 6000 (Blackwell Max‑Q), with expected memory/perf trade‑offs; low‑VRAM systems hit constraints earlier, and users requested safetensors and vLLM compatibility for smoother deployment (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyopyc/did_anyone_try_out_glm45airglm46distill/).
Small configuration choices often explain “mystery” regressions. With the Qwen3‑VL patch, integrating ggml PR #15474 fixes basic OCR; leaving it out can make the same model look blind to text. In Ollama, you don’t need to repeat [INST] templating across every prompt block inside a Modelfile unless the format specifically requires it—one responder’s “no” to per‑prompt tags aligns with avoiding over‑templating that can cascade into repetition or mismatched stop criteria (more: https://www.reddit.com/r/ollama/comments/1nxptsq/modelfile_do_i_need_these_tags_per_prompt/). Backend maturity, correct templates, and the right precision typically pay bigger dividends than chasing yet another quant pack (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/).
Finally, the ecosystem tailwinds are favorable for AMD users. Multiple commenters point to ongoing Vulkan optimization in llama.cpp and a broader industry push—including investment in AMD GPU support—suggesting these throughput gains aren’t a fluke, but the result of steady, low‑level backend work (more: https://www.reddit.com/r/LocalLLaMA/comments/1o01una/what_and_when_7900xtx_is_boosted/).
Agent ops: run, resume, repair
Developer tooling for coding agents is getting more pragmatic. FleetCode provides a light control panel to run multiple CLI coding agents (Claude Code, Codex) in parallel, isolating each session in its own git worktree, persisting across restarts, and letting users add Model Context Protocol (MCP) servers for capabilities. It includes a fix for Claude Code path issues by disabling “Auto connect to IDE” so the agent operates inside the worktree, plus a macOS quarantine bypass for unsigned builds—useful polish for real‑world use (more: https://github.com/built-by-as/FleetCode).
Claude Code’s context compaction has a reproducible bug: compaction fails with “Conversation too long” even when reported usage sits around 65–74%. Workarounds include /resume on the same chat before compacting, manual pruning, exporting the session and seeding a new one with a custom recap, or simply waiting for fixes; multiple users confirm /resume → /compact works across the desktop and VS Code extensions (more: https://www.reddit.com/r/ClaudeAI/comments/1o0fljh/claude_code_compaction_fails_with_conversation/). Advice to update to latest releases also appears, but the pattern suggests a threshold mismatch between displayed context usage and internal limits (more: https://www.reddit.com/r/ClaudeAI/comments/1o0fljh/claude_code_compaction_fails_with_conversation/).
Frontends are converging on local‑first workflows. An open‑source “local LLM platform for developers” demoing a streamlined app experience drew interest from those avoiding cloud‑only stacks (more: https://www.reddit.com/r/LocalLLaMA/comments/1nx4xz9/demo_my_opensource_local_llm_platform_for/), while OpenWebUI users discuss adding local terminal access for more powerful, self‑contained agent loops (more: https://www.reddit.com/r/OpenWebUI/comments/1nyqscz/local_terminal_access/). A turnkey Windows installer script bundles AI/dev tools, with a focus on free/cheap model access (Qwen Code, Gemini CLI) and Cline integration—a pragmatic response to developer requests for a one‑click setup, alongside a candid OS debate favoring Windows as a primary desktop for reliability and cost (more: https://www.reddit.com/r/ChatGPTCoding/comments/1nyend4/script_to_install_a_bunch_of_ai_or_dev_tools/).
Small configuration questions still trip many up. In Ollama Modelfiles, replicating [INST] {{ .System }} {{ .Prompt }} [/INST] per prompt is not necessary unless a specific model/format requires it—one‑line answers sometimes solve multi‑hour debugging (more: https://www.reddit.com/r/ollama/comments/1nxptsq/modelfile_do_i_need_these_tags_per_prompt/). Combine the right template with sane stop tokens and many “model” problems disappear.
Live benchmarking beats vibes
A real‑time scoreboard now monitors how coding models behave in the wild. aistupidlevel.info runs three tracks—tooling (IDE‑like tasks via a sandbox that simulates Cline behavior), a coding/debugging “7‑axis” suite, and a reasoning track—on schedules across Claude, GPT, Gemini, and Grok, surfacing dips, spikes, and routing oddities. It’s open source, reproducible, and already popular enough to get national TV coverage in Romania; the backend uses a Docker sandbox to replicate file/command flows akin to an editor assistant, and recent fixes improved trend surfacing and cost/perf signals (more: https://www.reddit.com/r/Anthropic/comments/1nx20wc/thank_you_anthropic_this_community_our_little/).
Community benchmarks complement that telemetry with format‑specific detail. In the GLM‑4.5‑Air/4.6‑Distill thread, users shared llama.cpp runs across multiple GGUF quantizations on an RTX PRO 6000 (Blackwell Max‑Q), reported memory/performance constraints on lower VRAM, and flagged repeated responses tied to chat templates and stop tokens, not just model behavior (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyopyc/did_anyone_try_out_glm45airglm46distill/). Requests for vLLM/safetensors point to a standard pattern: experimentation on local GGUF, productionization on high‑throughput backends (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyopyc/did_anyone_try_out_glm45airglm46distill/).
Telemetry and reproducible setups are increasingly the antidote to “model mood.” When numbers show the same tasks, same cadence, and the same toolchains, sudden regressions are usually backend/config issues rather than the sky falling on a model family (more: https://www.reddit.com/r/Anthropic/comments/1nx20wc/thank_you_anthropic_this_community_our_little/). Stack hygiene still wins.
Agents, robotics, and speech
Robotics is getting a new critic. VLAC (Vision‑Language‑Action‑Critic) is a general‑purpose pair‑wise critic and manipulation model trained on 3,000+ hours of egocentric human data, 1,200+ hours of public robotic manipulation data, and 15+ hours of self‑collected demos. It tracks task progress, judges completion, answers VQA, and can even produce embodied actions, with in‑context generalization across robots and scenes. VLAC evaluates trajectories with dense rewards and filters low‑quality data, improving imitation learning; an online demo and code support evaluation and action generation (more: https://github.com/InternRobotics/VLAC).
On the opposite end—tiny, fast tool use—LiquidAI’s LFM2‑1.2B‑Tool targets precise tool calling without “thinking” steps, designed for edge devices that need instant API/database calls. It expects tool definitions as JSON between special tokens, outputs Pythonic calls between reserved tags, and recommends greedy decoding at temperature 0. The aim is to compete with thinking models on tool tasks while keeping latency low; it supports multi‑turn and multiple languages (more: https://huggingface.co/LiquidAI/LFM2-1.2B-Tool).
Audio also gets an open platform. SonicVale is an AI multi‑role, multi‑emotion TTS system built with Electron + Vue on the client and a FastAPI backend orchestrating TTS/LLM engines, subtitles, queues, and more; it’s compatible with OpenAI‑style APIs and can target cloud‑native build platforms with free H20 GPUs. Licensed AGPL‑3, the repo includes a strong disclaimer against unlawful voice cloning or rights infringement (more: https://github.com/xcLee001/SonicVale).
Long‑context “thinking” continues to proliferate. Meituan’s LongCat‑Flash‑Thinking appears as a new resource on Hugging Face, reflecting ongoing work on efficient long‑horizon reasoning models; details aside, availability itself signals steady investment in long‑context inference (more: https://huggingface.co/meituan-longcat/LongCat-Flash-Thinking).
Security, seeds, and scams
Prompt theft is getting more practical—and seed‑centric. A new arXiv paper argues that online prompt‑stealing attacks against diffusion models depend critically on knowing the seed: common optimization losses (LPIPS, CLIP, latent MSE) are more sensitive to the initial noise than to small prompt modifiers, so without the seed, reconstructions drift. The authors also highlight PRNG pitfalls: many pipelines cap seeds to 32 bits (e.g., PyTorch CPU PRNG), or otherwise constrain the space, making seed recovery tractable and enabling practical online prompt stealing that threatens prompt marketplaces and creative IP (more: https://arxiv.org/abs/2509.09488v1).
Misuse risks are not confined to images. Another arXiv preprint explores how AI agents can simulate human‑level scam calls, highlighting the growing realism and orchestration capabilities of agent systems and the need for mitigations that go beyond simple content filters (more: https://arxiv.org/abs/2508.06457v1). As agent stacks gain tool and telephony access, safety must harden at the protocol and policy layers, not just the prompt.
Trust also depends on honest UX. A blog post on dark patterns in buying a Deutsche Bahn BahnCard underscores how interface choices can coerce or mislead, reminding that security is partly about respecting user agency, not only cryptography (more: https://www.ketzu.net/dark-patterns-buying-a-bahncard-at-deutsche-bahn/). In AI tooling, that translates to transparent defaults, reversible actions, and clear cost/perf signals—precisely what community dashboards and open repos aim to provide (more: https://www.reddit.com/r/Anthropic/comments/1nx20wc/thank_you_anthropic_this_community_our_little/).
Continual learning needs solid optimization, too. THUDM’s INFTY Engine packages CL‑friendly optimizers to tackle catastrophic forgetting and gradient interference across PTMs, PEFT, diffusion, and VLMs, with built‑in visualization tools for sharpness and curvature. It’s a plug‑in wrapper around base optimizers with examples for C‑Flat/ZeroFlow/UniGrad‑FS—useful infrastructure when models must adapt continually without wiping past competence (more: https://github.com/THUDM/INFTY).
Hardware hacks and power plays
Not every lab needs a kilobuck PSU. BenchVolt PD is a USB‑PD bench power supply with adjustable rails, SCPI control, and a Python app, delivering up to 100 W with a clear TFT display. It blurs the line between PSU and function generator by emitting waveforms; the clear case and acrylic build aim for transparency and heat dissipation (more: https://hackaday.com/2025/10/10/benchvolt-pd-usb-pd-meets-benchtop-precision/). Hackaday commenters, however, warn about safety when using cheap DC‑DC modules and knob‑heavy UIs under load—sensible caution for anyone powering prototypes on a desk (more: https://hackaday.com/2025/10/10/benchvolt-pd-usb-pd-meets-benchtop-precision/).
The same “measure twice, cut once” applies to inference rigs. Users report that llama.cpp Vulkan updates can materially improve AMD GPU throughput, with MoE models like gpt‑oss‑20B delivering unexpectedly high tokens/sec by design—proof that software maturity and model architecture can beat raw FLOPs on the right workload (more: https://www.reddit.com/r/LocalLLaMA/comments/1o01una/what_and_when_7900xtx_is_boosted/). For multimodal workloads, patch correctness (e.g., projector wiring, OCR fix) and careful quantization are non‑negotiable to avoid hallucinations that look like “model issues” but are really pipeline problems (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyhjbc/qwen3vl30ba3bthinking_gguf_with_llamacpp_patch_to/).
Back to software, a simple rule emerges: start with the official model card, reproduce their minimal example, match templates, and only then chase optimizations. Qwen’s own instructions recommend recent Transformers builds and enable flash‑attention 2 for multi‑image/video scenarios—guidance that translates well even when porting to different backends (more: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct). The fastest way to go slow is to skip the basics.
And as stacks get more agentic, the operational details matter: MCP server configs, resume‑then‑compact workarounds, and one‑click environment bootstraps shave hours off setup and debugging. Together, they turn “vibes” into reproducible engineering (more: https://github.com/built-by-as/FleetCode;), (more: https://www.reddit.com/r/ClaudeAI/comments/1o0fljh/claude_code_compaction_fails_with_conversation/;), (more: https://www.reddit.com/r/ChatGPTCoding/comments/1nyend4/script_to_install_a_bunch_of_ai_or_dev_tools/).
Sources (21 articles)
- Qwen3-VL-30B-A3B-Thinking GGUF with llama.cpp patch to run it (www.reddit.com)
- Did anyone try out GLM-4.5-Air-GLM-4.6-Distill ? (www.reddit.com)
- demo: my open-source local LLM platform for developers (www.reddit.com)
- What and when 7900xtx is boosted? (www.reddit.com)
- Does quantization need training data and will it lower performance for task outside of training data? (www.reddit.com)
- Modelfile. Do I need these tags PER prompt? (www.reddit.com)
- Script to install a bunch of AI or Dev tools automatically.. what can I add to it or improve? (www.reddit.com)
- Claude Code compaction fails with “Conversation too long” even when context is below 75% (www.reddit.com)
- xcLee001/SonicVale (github.com)
- InternRobotics/VLAC (github.com)
- Show HN: FleetCode – Open-source UI for running multiple coding agents (github.com)
- Dark Patterns: Buying a Bahncard at Deutsche Bahn (www.ketzu.net)
- meituan-longcat/LongCat-Flash-Thinking (huggingface.co)
- Qwen/Qwen3-VL-30B-A3B-Instruct (huggingface.co)
- BenchVolt PD: USB PD Meets Benchtop Precision (hackaday.com)
- ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls (arxiv.org)
- Prompt Pirates Need a Map: Stealing Seeds helps Stealing Prompts (arxiv.org)
- Thank you Anthropic & this community! Our little side project just hit 1M visits and even made it on National TV! (www.reddit.com)
- LiquidAI/LFM2-1.2B-Tool (huggingface.co)
- THUDM/INFTY (github.com)
- Local Terminal Access (www.reddit.com)