Vision models: quirks and fixes

Published on

Vision remains the messiest part of local LLM stacks, and this week’s debugging threads show why. One user found image recognition working via llama-server but not through Open WebUI; the server log...

Vision models: quirks and fixes

Vision remains the messiest part of local LLM stacks, and this week’s debugging threads show why. One user found image recognition working via llama-server but not through Open WebUI; the server logs never showed a base64 image when using the WebUI, implying the client wasn’t sending image content even though model “vision” was enabled. Others confirmed it only worked when the image was sent without text, and that formats like webp sometimes misbehave. The same setup worked fine via jan.ai and llama-server’s own UI, strengthening the case for an Open WebUI-side issue rather than llama.cpp or llama-swap. A separate point: tokenization-by-patches can misread isolated characters at patch boundaries—an “R” becoming an “A” when split—so don’t confuse low-level vision artifacts with “hallucination.” In practice, a single tokenized image can cost roughly ~1k context tokens; most use cases don’t need extreme context lengths to analyze a few images. If you don’t see base64 in your server logs, your client likely didn’t send an image. Keep it simple: verify the client sends images via a chat-completions path and try PNG/JPG first. (more: https://www.reddit.com/r/LocalLLaMA/comments/1omgi3f/why_does_image_recognition_work_in_llamaserver/)

If OCR is your end goal, use purpose-built pipelines. AllenAI’s olmOCR-2-7B-1025 is a Qwen2.5-VL-7B-Instruct derivative with extra GRPO RL for math, tables, and tricky layouts. It expects a rendered page (longest side 1288 px) with metadata in the prompt, which their toolkit auto-generates. Reported scores on olmOCR-bench with the toolkit are strong across arXiv, old scans, math, tables, and “long tiny text,” with FP8 being the recommended variant for practical inference. The takeaway: end-to-end OCR quality depends as much on robust rendering, retries, and rotation handling as the base MLLM itself. (more: https://huggingface.co/allenai/olmOCR-2-7B-1025)

Meanwhile, the community continues to squeeze vision models down to home-lab sizes. A thread on quantizing Qwen3-VL-30B “Thinking-abliterated” produced GGUF releases from Q4 to Q8, with an MXFP4 Q4 variant that trades a bit of speed for better quality than vanilla Q4 on non-Blackwell hardware. Users noted behavioral differences (e.g., NSFW refusals) between “thinking” and non-thinking variants—another reminder that training choices, not just quantization, shape outputs. (more: https://www.reddit.com/r/LocalLLaMA/comments/1olnjiu/is_there_simple_way_like_bat_to_compress_to_q4q8/)

Even at the extreme edge, there’s experimentation: someone asked about running Ollama on the Whisplay HAT attached to a Raspberry Pi Zero 2W. Yes, it’s tiny hardware—and no thread-long proof here—but it captures the broader trend: people push VL/LLM workflows onto anything with a CPU and a battery. Expect strict trade-offs and aggressive quantization if you try. (more: https://www.reddit.com/r/ollama/comments/1ooi2my/has_anyone_tested_ollama_on_whisplay_hat_with/)

DIY acceleration, docks, and odd RF

When your 2U chassis won’t take a modern GPU, an OCuLink dock can save the day. One builder moved from a P40 to an RTX 5080 via a MINISFORUM DEG1 OCuLink enclosure powered externally, sidestepping 12VHPWR cabling on the server board. They dropped in a 4-port OCuLink card for future bifurcation and reported 140+ tokens/s with Mistral, running on a Proxmox box with dual Xeon Scalable CPUs and 512 GiB RAM. Cooling/noise mods and reliable power delivery matter more than brand stickers; with external docks you trade neatness for flexibility. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ol5zx5/adding_a_rtx_5080_into_a_2u_server_with_oculink/)

If you’re chasing throughput, don’t forget the basics from the vision-thread logs: a single image costs about ~1k tokens of context; 8k context suffices for most single-image tasks. It’s wasteful to run maximal context lengths by default when you’re compute-bound—especially on consumer GPUs. (more: https://www.reddit.com/r/LocalLLaMA/comments/1omgi3f/why_does_image_recognition_work_in_llamaserver/)

And for a reminder of the hacker DNA powering this ecosystem, someone built a radio “transmitter” driven by the mechanical flex of a piezo disk, ringing a crystal like a bell. It’s barely practical RF, but it’s a perfect metaphor: extract signal wherever physics allows, then scale up. The same mindset fuels the OCuLink dock, Pi HAT experiments, and quantized VL stacks—bootstrapping capability with whatever’s at hand. (more: https://hackaday.com/2025/11/02/2025-component-abuse-challenge-a-piezo-disk-powers-a-transmitter/)

Agents: training, routing, and specs

On agent performance, one project scaled a coding-agent orchestrator to 32× H100s and claimed a 160% relative improvement on Stanford’s TerminalBench (from 7% to 18.25%) over the base Qwen3-14B. The orchestrator coordinates explorer and coder subagents (as tool calls), trained with a dead-simple reward: unit tests. Attempts at “smart” rewards reportedly caused policy collapse. The author is explicit about bias—the training environments are synthetic and similar to the benchmark—and about cost/benefit: in many cases, prompt engineering a SOTA model beats RL for the effort. Still, the jump puts a 14B orchestrator close to a 480B coder on that benchmark (19.7%). It’s a reminder: in agents, well-instrumented feedback (tests) beats cleverness more often than not. (more: https://www.reddit.com/r/LocalLLaMA/comments/1onaops/scaling_codingagent_rl_to_32x_h100s_achieving_160/)

Agent infra is becoming its own product category. Bifrost advertises itself as a high-performance LLM gateway: 11 microseconds mean overhead at 5K RPS with plugins, adaptive routing across providers/keys, cluster-mode resilience, OpenAI-compatible APIs, and Prometheus-level observability. The pitch targets multi-agent setups with many concurrent LLM calls and failure-prone toolchains. Community replies immediately benchmarked the benchmarkers—contrasting claimed overheads with alternatives—which is exactly the mindset you want when your agents depend on the control-plane. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1omr1dw/bifrost_a_highperformance_gateway_for_llmpowered/)

Finally, project definition remains the human bottleneck even when the code writes itself. One builder of a “generate full projects from specs” CLI realized teams often don’t share a mental model of the deliverable. Their solution borrows Jeff Patton’s user story mapping to visualize scope and dependencies before generation. Critics called it “waterfall,” but the point stands: clarity first, code second. If agents are construction crews, you still need a blueprint both humans and machines can read the same way. (more: https://www.reddit.com/r/ClaudeAI/comments/1oolcba/stop_fighting_with_ai_to_build_your_project/)

Security: link triggers, MCP hygiene

A viral demonstration showed that a simple hyperlink on mobile can inject invisible prompts into ChatGPT via URL parameters and hidden markdown in user input, potentially triggering dangerous behaviors in the context of the user’s connected tools, files, or “memories.” The critique: directly piping user-visible content into a deeply integrated assistant without robust isolation invites jailbreak-style abuse that enterprise customers won’t tolerate. Risks cited include getting accounts flagged with hostile content, seeding persistent memory instructions, data exfiltration via code tools (or even steganography), and general social engineering—all from a link click. The iOS app reportedly surfaces hidden content more visibly than mobile web, but the larger challenge is product philosophy: consumer growth hacks (SEO-friendly query links) collide with enterprise threat models. (more: https://www.linkedin.com/posts/georgzoeller_click-a-link-on-the-web-leak-documents-ugcPost-7392112142075740160-So7b?)

The defense playbook is maturing. OWASP published “A Practical Guide for Securely Using Third-Party MCP Servers 1.0,” part of its GenAI security project with Agentic Security and the LLM Top 10. MCP here is Model Context Protocol, and OWASP’s guidance is vendor-agnostic: controls for data boundaries, tool governance, secrets handling, and auditing across agent workflows. If your app chains tool calls and pulls in external MCP servers, assume prompt injection and tool misuse are normal operating conditions—not edge cases—and design the blast radius accordingly. (more: https://genai.owasp.org/resource/cheatsheet-a-practical-guide-for-securely-using-third-party-mcp-servers-1-0/)

Scale: trillion-parameter “thinking” and trillion-dollar capex

Moonshot’s Kimi K2 Thinking landed with heavy claims and heavier weights: an “open-source trillion-parameter reasoning model,” natively INT4 thanks to quantization-aware training on MoE components, with reported 2× generation speedups under INT4 and state-of-the-art results on Humanity’s Last Exam (44.9% with tools), BrowseComp (60.2%), and SWE-Bench Verified (71.3%). The tech blog emphasizes scaling both “thinking tokens” and tool-calling steps, with examples of 200–300 tool calls for long-horizon tasks. Community reactions range from admiration to a familiar “slop taxonomy” joke, and practical notes: the weights are roughly 600+ GB, you’ll still want serious VRAM and RAM to host, and yes, it’s INT4-native but you won’t run this on a single small card. People also asked about how parallel test-time compute factors into benchmark wins, reflecting a growing desire to separate architectural progress from inference-time tricks. Use the official docs and repo; it’s a significant release, but the usual “trust and verify” applies. (more: https://moonshotai.github.io/Kimi-K2/thinking.html) (more: https://www.reddit.com/r/LocalLLaMA/comments/1oq1arc/kimi_released_kimi_k2_thinking_an_opensource/)

At the same time, scale is about dollars as much as parameters. OpenAI is exploring U.S. federal loan guarantees to underwrite a potential trillion-dollar infrastructure buildout—positioning AI compute more like national energy or transport projects. The goal is lower borrowing costs and access to deeper credit markets, especially as the company pursues massive data center ventures and cloud deals. An IPO is “not on the cards,” according to the CFO; the focus is on financing the capex needed to keep pace. It’s an unusual ask for a Silicon Valley firm, but the economics of frontier AI are looking less like software and more like utilities. (more: https://investinglive.com/stock-market-update/icymi-openai-asks-us-for-loan-guarantees-to-fund-1-trillion-ai-expansion-20251105/)

Efficient training, pruning, and “learning from errors”

There’s real progress in making models leaner and smarter without brute-force scaling. One practitioner trained two 100M-parameter LMs: Model A on 700M tokens and Model B on 500M tokens selected via entropy-based filtering. Model B matched Model A using 30% less data, by removing high-entropy junk (e.g., repetitive n-gram samples) based on cross-perplexity and repetition thresholds. Reviewers suggested adding a random 70% control to confirm gains aren’t from “less vs enough” data; the author cites related work and is building a reusable filtering pipeline. Data quality beats data volume when your marginal tokens add no signal. (more: https://www.reddit.com/r/LocalLLaMA/comments/1om2nqy/p_training_better_llms_with_30_less_data/)

On the compression front, Cerebras released GLM-4.5-Air-REAP-82B-A12B: the full 106B MoE, pruned by 25% using REAP (Router-weighted Expert Activation Pruning) to 82B while keeping the router’s control intact. No fine-tuning required post-prune, drop-in for vLLM, and near-identical performance on code, reasoning, and tool-calling benchmarks, with some cases even edging up. The insight: prune low-contribution experts based on gate activity and activation norms, preserve routing to avoid collapsing diverse expertise. One-shot pruning that actually generalizes to generative tasks is a big deal for deployability. (more: https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B)

A new arXiv paper proposes REFINE: Retrieval-Enhanced Feedback via an in-context Neural Error-book for multimodal LLMs. Instead of hoping a model “remembers” past mistakes, REFINE retrieves targeted error feedback from a learned memory of errors (teacher–student setup) and injects it in-context at inference to avoid repeating similar failures. It’s a structured approach to “learning from errors” without retraining, especially useful for the messy failure modes of MLLMs. (more: https://arxiv.org/abs/2508.16313v1)

Backfilling reasoning is another active area: one tinkerer fine-tuned a small model to synthesize reasoning traces for older/non-reasoning datasets, aiming to augment conversational data that lost its “thinking” capability. It’s not SOTA, but it’s practical: scaffolding datasets so larger models can learn explicit step-by-step reasoning during fine-tuning. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ol2odj/i_fine_tuned_a_small_model_to_help_with_reasoning/)

And for perspective on multi-agent learning, Pluribus—superhuman in six-player no-limit Texas hold’em—remains an important milestone. Unlike two-player zero-sum games with clean equilibrium properties, multiplayer poker complicates “safe” strategy and exploitation. Pluribus trained via self-play and beat elite pros across 10,000 hands without real-time opponent modeling during evaluation. It’s a useful reminder that agentic performance at scale often grows from massive self-play plus careful constraints, not just bigger networks. (more: https://www.science.org/doi/10.1126/science.aay2400)

Dev tooling: config and crypto utilities

Config is code now. Conform bills itself as “Pydantic for Go”—type-safe, declarative configuration via struct tags with built-in validation, hot reload, multi-source loading (env, YAML/JSON/TOML, custom), detailed error messages, and zero boilerplate. It leans on Go 1.21+ generics for compile-time safety and covers common validators out of the box. Roadmap items include secrets managers and distributed config. If you’ve hand-rolled config parsing and validation in Go, this aims to make that unnecessary. (more: https://github.com/alicanli1995/conform)

Also surfaced: schollz/e2ecp on GitHub. Link included for those tracking new tools in the end-to-end security space. (more: https://github.com/schollz/e2ecp)

Sources (20 articles)

  1. [Editorial] https://genai.owasp.org/resource/cheatsheet-a-practical-guide-for-securely-using-third-party-mcp-servers-1-0/ (genai.owasp.org)
  2. ⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford's TerminalBench (www.reddit.com)
  3. Adding a RTX 5080 into a 2U server with OcuLink (www.reddit.com)
  4. [P] Training Better LLMs with 30% Less Data – Entropy-Based Data Distillation (www.reddit.com)
  5. Why does Image Recognition work in llama-server but not through Open WebUI? (www.reddit.com)
  6. I fine tuned a (small) model to help with reasoning backfill on old/non-reasoning datasets (www.reddit.com)
  7. Has anyone tested ollama on Whisplay HAT with Raspberry pi zero 2W? (www.reddit.com)
  8. Bifrost: A High-Performance Gateway for LLM-Powered AI Agents (50x Faster than LiteLLM) (www.reddit.com)
  9. Stop fighting with AI to build your project (www.reddit.com)
  10. schollz/e2ecp (github.com)
  11. alicanli1995/conform (github.com)
  12. Kimi release Kimi K2 Thinking, an open-source trillion-parameter reasoning model (moonshotai.github.io)
  13. Superhuman AI for Multiplayer Poker (www.science.org)
  14. OpenAI asks U.S. for loan guarantees to fund $1T AI expansion (investinglive.com)
  15. cerebras/GLM-4.5-Air-REAP-82B-A12B (huggingface.co)
  16. allenai/olmOCR-2-7B-1025 (huggingface.co)
  17. 2025 Component Abuse Challenge: A Piezo Disk Powers A Transmitter (hackaday.com)
  18. Retrieval Enhanced Feedback via In-context Neural Error-book (arxiv.org)
  19. Kimi released Kimi K2 Thinking, an open-source trillion-parameter reasoning model (www.reddit.com)
  20. is there simple way like .bat to compress to q4-q8 like Unsloth, Qwen3-VL-30B-A3B-Thinking-abliterated model (www.reddit.com)