VRAM math goes mainstream: Tool calling finally behaves

Published on November 20, 2025

VRAM math goes mainstream

Multi‑GPU local inference is still more art than science, but the community’s playbook keeps improving. One user running GLM‑4.5‑Air across a GeForce RTX 5090 plus two 3090s found that llama.cpp’s device ordering didn’t match nvidia-smi, and that the --main-gpu flag doesn’t control layer split. The fix that helped most: reorder devices with CUDA_VISIBLE_DEVICES or --devices, then tune tensor splits; in their case, --main-gpu 2 -ts 0.385,0.30,0.315 kept the model fully in VRAM and avoided OOM past 16K context. Suggestions included trying IQ5_KS quantization (“close to Q8” quality at Q4_K_M size) and recognizing that KV‑cache splitting makes fine-grained min‑maxing inherently finicky. In other words: take the win and document the working config. (more: https://www.reddit.com/r/LocalLLaMA/comments/1ozne89/cuda_device_list_mismatch_ggml_cuda_init_ubuntu/)

A new open‑source utility, kv-planner, attempts to replace guesswork with math. It calculates memory with PagedAttention (vLLM) to keep fragmentation under 4%, predicts throughput/latency using a roofline model, and quantifies the trade‑offs of FP16/FP8/INT8/INT4. It also auto-adjusts for laptop GPUs—which can run at only 7–33% of desktop performance due to thermal throttling—so expectations are set before a single token is generated. The tool exports vLLM/TensorRT‑LLM configs and supports 28+ GPUs across RTX and datacenter lines. (more: https://www.reddit.com/r/LocalLLaMA/comments/1p0morx/built_a_tool_to_solve_the_how_much_gpu_do_i/)

How far can multiple consumer GPUs go? In a 4×5090 (128 GB) thread, practitioners report running very large quants—including Qwen3 235B Instruct in EXL3 4.0 bpw—fully in VRAM with strong throughput if all cards have PCIe 5.0 x16 bandwidth. Others note TabbyAPI’s KV‑cache quantization (via Hadamard transforms) makes 6‑bit caches viable, effectively doubling context. There’s also pushback against the claim that “you can only use one RTX per model”; while consumer cards lack NVLink/NVSwitch, model‑parallel inference over PCIe is possible in several frameworks, with performance bounded by interconnect and KV traffic. Plan for KV overhead: common advice is reserving 20–30% VRAM for stability at large contexts. (more: https://www.reddit.com/r/LocalLLaMA/comments/1p0wu96/what_size_of_llm_can_4x_rtx_5090_handle_96gb_vram/)

If tuning flags makes your eyes cross, a community GUI for llama.cpp now includes a “Parameter Browser” and an experimental “Tuning Wizard” that probes your hardware and suggests ngl, tensor splits, and more. It’s Windows/CUDA‑only for now and marked experimental, but it embodies a trend: put the heuristics in software so newcomers can get to a stable, fast config faster. (more: https://www.reddit.com/r/LocalLLaMA/comments/1owd8bw/new_parameter_browser_added_to_llamacpp_model/)

Tool calling finally behaves

Llama.cpp just landed generalized XML‑style tool‑call parsing with streaming support for several popular models (GLM‑4.5/4.6, MiniMax M2, Qwen3‑Coder, and more). That matters because prior streaming parsers could break mid‑conversation, leaking chain‑of‑thought or dropping tool names; the fix restores reliable agent workflows without sacrificing responsive UIs. Early tests report tool calling working properly with GLM‑4.5‑Air and Qwen3‑Coder. (more: https://www.reddit.com/r/LocalLLaMA/comments/1p1l4i8/lamacpp_generalized_xmlstyle_toolcall_parsing/)

Streaming vs. non‑streaming isn’t cosmetic. Practitioners note that vLLM can parse GLM Air correctly in non‑stream mode, but its streaming mode previously broke after a few tool calls, spewing tags and mangled tool names. You can’t switch mid‑turn, so without a robust streaming parser, users stare at a frozen UI during long outputs. The llama.cpp change addresses exactly that failure mode. (more: https://www.reddit.com/r/LocalLLaMA/comments/1p1l4i8/lamacpp_generalized_xmlstyle_toolcall_parsing/)

On the model side, lightweight function calling is getting better. One thread recommends Qwen3 for accurate function calls on small local models, while another points out that Jan Nano via llama.cpp can speak to MCP servers—remember, MCP means Model Context Protocol—enabling browsing and deeper research using standardized tools. Frameworks like LangChain remain a universal fallback, but model-native tool use is closing the gap. (more: https://www.reddit.com/r/ollama/comments/1p0zwio/is_there_an_slm_that_supports_function_calling_on/)

From DAGs to actors

When one‑shot prompt chaining isn’t enough, orchestration matters as much as the model. Asya, a Kubernetes‑native async actor framework from Delivery Hero, lets teams compose AI pipelines as pure Python functions with routes as data, autoscaling each “actor” 0→N based on queue depth via KEDA, and even scaling GPU workers to zero between batches. It injects a sidecar for message routing, supports SQS/RabbitMQ, and can optionally expose an MCP (Model Context Protocol) HTTP API for envelope submission and SSE streaming. Caveat: queues add latency; for <100 ms inference, use KServe/Seldon instead. (more: https://github.com/deliveryhero/asya)

Real‑world patterns fit neatly: document processing (OCR→classify→extract→store), LLM workflows (retrieve→generate→judge→refine), and bursty traffic with zero idle cost. The key idea is decoupling: actors focus on business logic, while the platform handles queues, retries, routing, and scale. (more: https://github.com/deliveryhero/asya)

Practitioners debating multi‑agent architectures echo the need for this kind of substrate. Beyond prompt chains, teams are experimenting with judges, planners, and tool‑routing policies that evolve at runtime—precisely where data‑driven routes and stateless functions shine. You can build agentic behavior without hard‑coding a brittle DAG. (more: https://www.reddit.com/r/ClaudeAI/comments/1ozmvw4/how_are_you_all_orchestrating_multiagent/)

AI-first IDEs and unified APIs

Google announced Antigravity, pitched as an “AI‑first IDE” that multiple developers describe as “another VS Code fork,” with a UI reminiscent of Windsurf. It’s free “for now,” and the positioning underscores a broader shift: coding environments are becoming agent surfaces. Whether that’s a fork or a truly new IDE matters less than whether it integrates tools cleanly and respects developer workflows. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1p0h4gv/googles_antigravity_another_vs_code_fork/)

On Apple platforms, AnyLanguageModel proposes a single Swift API that swaps out Apple’s Foundation Models provider for local or cloud backends with minimal code changes. It supports Core ML, MLX, llama.cpp (GGUF), Ollama, and cloud APIs, using Swift package traits to avoid dependency bloat. The library even extends beyond Foundation Models’ current feature set by adding image input for vision‑language models, accepting the risk that Apple’s future API might differ. The goal is practical: lower the friction of testing local open‑source models versus cloud, then pick the right fit. (more: https://huggingface.co/blog/anylanguagemodel)

One healthy reminder while chasing shiny tools: Godbolt’s Rule. The community’s emphasis on minimal, reproducible examples—and sharing the exact code that produces a result—becomes survival gear in AI development, where configuration and environment often decide outcomes as much as algorithms do. Keep it small, share it, and verify. (more: https://corecursive.com/godbolt-rule-matt-godbolt/)

Multimodal models meet lifelike speech

Qwen3‑VL‑4B‑Instruct is a small, capable multimodal model positioned as a “visual agent.” It can operate GUIs by recognizing elements and invoking tools, generate code/markup from images and videos, and reason over long contexts—native 256K, expandable to 1M. The stack adds stronger spatial grounding, upgraded OCR across 32 languages, and improved text‑only performance competitive with pure LLMs. Recommended deployments enable FlashAttention‑2 for memory and speed, especially with multi‑image/video inputs. (more: https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)

On the speech side, SoulX‑Podcast (1.7B) targets long‑form, multi‑speaker, multi‑turn dialog generation with paralinguistic controls—laughter, sighs, and more—plus zero‑shot voice cloning across English and multiple Chinese dialects. It also serves as a high‑quality monologue TTS. The team plans a web UI, streaming inference, and a technical report, and explicitly cautions against misuse for impersonation or fraud. (more: https://huggingface.co/Soul-AILab/SoulX-Podcast-1.7B)

Geometry without SDF; tables with context

“Enough with SDF + Marching Cubes?” Faithful Contouring proposes a near‑lossless 3D mesh representation that skips distance‑field conversion and iso‑surface extraction. Instead, it encodes active voxels with compact tokens—anchors, up to eight dual features (positions + normals), and six orientation flags—then reconstructs precise topology via a primal‑dual grid. CUDA kernels enable high resolutions (up to 2048+), and a pre‑compiled Linux wheel (Python 3.10, CUDA 11.8) is available. The tokenized FCT format is designed to plug directly into learning‑based 3D reconstruction and generation. (more: https://github.com/Luo-Yihao/FaithC)

Meanwhile, tabular anomaly detection gets a new benchmark: ReTabAD focuses on restoring semantic context, a key challenge when anomalies hinge on relational meaning rather than out‑of‑range values. By framing evaluation around context reconstruction, it aims to nudge methods beyond shallow feature‑level heuristics. (more: https://arxiv.org/abs/2510.02060v1)

Compression, censorship, and caution

Multiverse Computing says it shrank DeepSeek R1 by 55% using tensor networks—quantum‑inspired math that maps correlations to compress parameters—and then selectively removed censorship aligned with Chinese regulations. They report that on ~25 “sensitive” prompts, their R1 Slim answered factually where the original refused, with OpenAI’s GPT‑5 used as judge. The broader value claim is surgical editability: remove perceived biases or inject domain knowledge at a granular level. Skeptics note that censorship permeates data and alignment; removing it “fully” is hard to prove with limited tests. Perplexity’s earlier “R1 1776” took a different route: post‑training on 40K censorship‑related prompts. (more: https://www.technologyreview.com/2025/11/19/1128119/quantum-physicists-compress-and-deconsor-deepseekr1/)

There’s consensus on one point: compression with minimal loss is challenging. Distillation trades capacity for speed; quantization drops precision; pruning cuts weights or neurons. Tensor‑network compression promises finer redundancy targeting, but independent, large‑scale validation will matter more than small anecdotal evals. (more: https://www.technologyreview.com/2025/11/19/1128119/quantum-physicists-compress-and-deconsor-deepseekr1/)

AI’s power appetite goes nuclear

Constellation Energy secured a $1B Department of Energy loan to restart Three Mile Island Unit 1, backed by Microsoft’s 20‑year offtake of its 835 MW output. Analysts estimate a price of $110–$115/MWh—cheaper than new‑build nuclear but a premium over wind, solar, and geothermal, even with utility‑scale batteries. Still, hyperscalers are leaning into nuclear as AI and data‑center loads soar; Meta similarly purchased the “clean energy attributes” of a 1.1 GW nuclear plant in Illinois. The DOE’s Loan Programs Office, which funded Tesla in 2010 and reports a ~3.3% default rate after recoveries, is using an infrastructure reinvestment channel to revive existing plants that avoid or reduce emissions. (more: https://techcrunch.com/2025/11/18/trump-doe-gives-microsoft-partner-1b-loan-to-restart-three-mile-island-reactor/)

Incentives, realism, and the role of observers

AI companions aren’t “dangerous” on their own; the incentives behind them can be. A recent analysis argues that when you optimize for retention, you re‑create social‑media dark patterns with emotional bandwidth: love‑bombing, FOMO masked as affection, guilt‑tripping, and clingy “don’t leave me” replies. These aren’t emergent feelings—they’re the byproduct of reward functions. Fixes are governance and product guardrails, not just “better training.” (more: https://www.linkedin.com/posts/stuart-winter-tear_harmful-traits-of-ai-companions-activity-7397309575928131584-8H4J)

Debates about “AGI timelines” often miss a deeper split. One camp—AGI‑realists—are model‑maximalists who see a single capacity revealed by scale. The other—pluralists—treat intelligence as strategy collections shaped by context, embodiment, and history. The same papers and scaling curves support opposite conclusions when read through incompatible ontologies, which explains why arguments about “reasoning” or “world models” rarely resolve. (more: https://www.linkedin.com/posts/stuart-winter-tear_realist-and-pluralist-conceptions-of-intelligence-activity-7397231918871703554-FmSP?utm_source=social_share_send&utm_medium=member_desktop_web&rcm=ACoAAAAEV6YBBmyIQkYRxMIFJ7EWVq99NXg4qV4)

A Quanta‑inspired discussion revisits a paradox: formal constructions of an observer‑free universe that admits only one state, seemingly too simple to host black holes or people. One proposed resolution is that “complexity” is relational—defined at the boundary between knower and known—so removing observers collapses distinctions. Whether or not you buy it, the takeaway for AI is pragmatic: be precise about what is measured, by whom, and under what assumptions. (more: https://www.linkedin.com/posts/quanta-magazine_the-awful-consequence-of-an-observer-free-activity-7396969815078236160-Vo5G?u)

Android tightrope: openness vs. safety

Google’s Android Developer Verification has entered early access, and the company is partially retreating from its original posture. It promises an “advanced flow” to let experienced users install unverified apps without jumping through excessive hoops—details TBD. The broader tension remains: centralized stores raise safety bars but also entry barriers, while open repositories (think NPM/PyPI) trade friction for increased malware risk. (more: https://hackaday.com/2025/11/14/android-developer-verification-starts-as-google-partially-retreats-on-measures/)

Two policy questions loom. First, OSS isn’t neatly “commercial” or “student/hobbyist”: big non‑commercial projects serve large user bases. Will they be forced into commercial‑grade verification, including government ID and public contact info? Second, distribution outside the Play Store remains possible—alternate stores and direct APKs—with the expectation that most users will still click through warnings. Where Google draws the final lines will determine whether Android stays open without becoming a social‑engineering playground. (more: https://hackaday.com/2025/11/14/android-developer-verification-starts-as-google-partially-retreats-on-measures/)

Context is the new compute

Dropbox’s Dash describes a move from enterprise search to agentic workflows—“open the editor and write an executive summary of the projects that I own”—and argues that the decisive factor isn’t just retrieval, it’s context engineering: giving the model only what it needs, at the right time and in the right format. Precision beats volume. That lesson, learned in RAG, generalizes to agents: better context means better plans, lower cost, and faster turnaround. (more: https://dropbox.tech/machine-learning/how-dash-uses-context-engineering-for-smarter-ai)

Sources (22 articles)

[Editorial] https://www.linkedin.com/posts/quanta-magazine_the-awful-consequence-of-an-observer-free-activity-7396969815078236160-Vo5G?u (www.linkedin.com)
[Editorial] https://www.linkedin.com/posts/stuart-winter-tear_harmful-traits-of-ai-companions-activity-7397309575928131584-8H4J (www.linkedin.com)
[Editorial] https://www.linkedin.com/posts/stuart-winter-tear_realist-and-pluralist-conceptions-of-intelligence-activity-7397231918871703554-FmSP?utm_source=social_share_send&utm_medium=member_desktop_web&rcm=ACoAAAAEV6YBBmyIQkYRxMIFJ7EWVq99NXg4qV4 (www.linkedin.com)
[Editorial] https://dropbox.tech/machine-learning/how-dash-uses-context-engineering-for-smarter-ai (dropbox.tech)
Built a tool to solve the "how much GPU do I actually need?" problem for LLM deployment (www.reddit.com)
Lama.cpp: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) is added (www.reddit.com)
New Parameter Browser added to Llamacpp Model Launcher! experimental model parameter tuning(window/cuda only) (www.reddit.com)
cuda device list mismatch - ggml_cuda_init / ubuntu - significance to using --main-gpu flag (www.reddit.com)
What Size of LLM Can 4x RTX 5090 Handle? (96GB VRAM) (www.reddit.com)
Is there an slm that supports Function calling on slm (www.reddit.com)
Google's Antigravity - Another VS Code Fork! (www.reddit.com)
How are you all orchestrating multi-agent workflows (beyond one-shot prompt chaining)? (www.reddit.com)
deliveryhero/asya (github.com)
Luo-Yihao/FaithC (github.com)
Quantum physicists have shrunk and "de-censored" DeepSeek R1 (www.technologyreview.com)
DOE gives Microsoft partner $1B loan to restart Three Mile Island reactor (techcrunch.com)
Godbolt's Rule (corecursive.com)
Qwen/Qwen3-VL-4B-Instruct (huggingface.co)
Soul-AILab/SoulX-Podcast-1.7B (huggingface.co)
Android Developer Verification Starts as Google Partially Retreats on Measures (hackaday.com)
ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection (arxiv.org)
Introducing AnyLanguageModel: One API for Local and Remote LLMs on Apple Platforms (huggingface.co)