Halftrillion runs at home: ShadowMQ and layered defenses

Published on

A LocalLLaMA user demonstrated that a half-trillion-parameter-class Mixture-of-Experts (Qwen3-Coder-480B) can run locally via llama.cpp on a consumer desktop (i9-13900KS, 128 GB RAM, RTX 4090 24 GB VR...

Half‑trillion runs at home

A LocalLLaMA user demonstrated that a half-trillion-parameter-class Mixture-of-Experts (Qwen3-Coder-480B) can run locally via llama.cpp on a consumer desktop (i9-13900KS, 128 GB RAM, RTX 4090 24 GB VRAM). With Unsloth quantizations from GGUF shards, they reported around 2.0 tokens/sec for 3-bit (UD-Q3_K_XL) and ~1.0 token/sec for 4-bit (UD-Q4_K_XL), using a 128K context window and a crucial llama.cpp flag: --no-warmup to avoid premature termination (more: https://www.reddit.com/r/LocalLLaMA/comments/1oueiuj/halftrillion_parameter_model_on_a_machine_with/).

The setup relies heavily on memory-mapped I/O because the Q3 and Q4 GGUF files are roughly 213 GB and 276 GB, respectively—well beyond physical RAM—and commenters cautioned about cramming “200gb into 152gb of memory” when adding context overhead. One user verified the load sizes and context, noting it’s not a REAP build. Bottom line: it runs, but it’s slow, and the storage subsystem matters as much as the GPU (more: https://www.reddit.com/r/LocalLLaMA/comments/1oueiuj/halftrillion_parameter_model_on_a_machine_with/).

Is ~1 t/s ever acceptable? For interactive coding, most say no. For background summarization, overnight creative drafting, privacy-sensitive batch tasks, or low-latency-insensitive workflows, several users say yes. Still, the trade-off remains stark: quantization shrinks models but can dent quality, and coding workflows particularly suffer at 1 t/s (more: https://www.reddit.com/r/LocalLLaMA/comments/1oueiuj/halftrillion_parameter_model_on_a_machine_with/).

If you need more throughput on newer GPUs, a community release patched PyTorch 2.10.0a0 to properly support NVIDIA Blackwell (sm_120) on RTX 5080/5090, avoiding sm_89 fallbacks and delivering expected TFLOPS. Packaged wheels aim to be a temporary bridge until official support lands, and they notably make local LLMs work “without hacks” on those cards (more: https://www.reddit.com/r/LocalLLaMA/comments/1oz8x9i/pytorch_2100a0_w_blackwell_sm_120_support_patched/).

ShadowMQ and layered defenses

Oligo Security disclosed a chain of critical Remote Code Execution issues across widely used AI inference servers (Meta’s Llama Stack, NVIDIA’s TensorRT-LLM, vLLM, SGLang, Modular Max), all traced to the same root cause: unauthenticated ZeroMQ sockets deserializing untrusted data via Python pickle. The kicker was how the flaw spread—code reuse and direct file adaptation between projects propagated the vulnerable pattern. Scans found thousands of publicly exposed ZMQ sockets, and the risks include arbitrary code execution, lateral movement, data exfiltration, and cryptomining (more: https://www.oligo.security/blog/shadowmq-how-code-reuse-spread-critical-vulnerabilities-across-the-ai-ecosystem).

Vendors moved quickly in many cases: Meta replaced pickle with safe JSON (CVE-2024-50050), vLLM shifted to a safe V1 default, NVIDIA added HMAC validation to TensorRT-LLM (rated Critical 9.3), and Modular switched to msgpack. Microsoft’s Sarathi-Serve remains vulnerable per Oligo; SGLang maintainers acknowledged the analysis and have implemented fixes. Recommended hardening: don’t use pickle/recv_pyobj with untrusted data, add authentication (HMAC or TLS), and bind ZMQ sockets to specific interfaces instead of tcp://* (more: https://www.oligo.security/blog/shadowmq-how-code-reuse-spread-critical-vulnerabilities-across-the-ai-ecosystem).

AWS’s latest guidance frames the bigger picture: AI security (protecting systems from tampering) and AI safety (reducing unintended harms) are distinct but connected. Adoption is racing ahead—95% of US companies using generative AI; 66% expect AI to significantly impact cybersecurity—yet only 37% have processes to evaluate AI system security pre-deployment. The advice is pragmatic: extend identity, logging, data protection, policy enforcement, incident response, and patching to new AI data flows and components (more: https://d1.awsstatic.com/onedam/marketing-channels/website/aws/en_US/whitepapers/compliance/AI-for-Security-and-Security-for-AI_Navigating-Opportunities-and-Challenges.pdf).

Jailbreak resilience is also under pressure. Cisco’s evaluation of open‑weight models finds multi-turn adversarial success rates between 25.86% and 92.78%, roughly 2x–10x higher than single-turn baselines, underscoring the need for layered controls—especially when models can act (send email, hit APIs). Community reactions converged on the same message: assume prompt injection succeeds eventually; design defenses outside the model to prevent real-world impact (more: https://www.linkedin.com/posts/helloamychang_death-by-a-thousand-prompts-open-model-vulnerability-activity-7392678891724861441-foCf/). TechRadar’s roundup echoes the risk: security postures lag even as enterprises go all-in on AI, creating an “exposure gap,” and adversarial testing continues to reveal vulnerabilities in top models (more: https://www.techradar.com/pro/data-breach-at-mysterious-chinese-firm-reveals-state-owned-cyber-weapons-and-even-a-list-of-targets).

Agents need safer environments

Giving models real system powers demands firm guardrails. A local-first assistant routes all tool use through a tiny Next.js server on the user’s machine; the model only emits JSON tool calls while a permission layer blocks unsafe commands, normalizes OS differences, executes allowed actions, and streams stdout/stderr back to the UI. It already handles multi-step workflows like search → download → install entirely locally, and the builder is seeking feedback on cross-platform permissions, safe rollback, and tool-chaining patterns (more: https://www.reddit.com/r/LocalLLaMA/comments/1owdpal/localfirst_llm_that_safely_runs_real_system_tasks/). In parallel, a Marktechpost guide details building an agentic voice assistant that understands, reasons, plans, and responds via autonomous multi-step intelligence—ambitious capability that raises the bar on safety-by-design (more: https://www.marktechpost.com/2025/11/08/how-to-build-an-agentic-voice-ai-assistant-that-understands-reasons-plans-and-responds-through-autonomous-multi-step-intelligence/).

Model Context Protocol (MCP) servers are becoming the connective tissue for such systems. One open-sourced CSV‑to‑PostgreSQL MCP server grants Claude bulk-loading powers with schema inference, type detection, PostgreSQL COPY, progress tracking, robust error handling, and 90%+ test coverage—the entire codebase “vibe‑coded” by Claude Code from requirements to implementation and docs (more: https://www.reddit.com/r/LocalLLaMA/comments/1oytr1t/mcp_opensourced_a_csvtopostgresql_loader_server/). An MCP server for Industrial IoT (built for PolyMCP) hints at more safety‑critical domains, even if post details were later deleted (more: https://www.reddit.com/r/ollama/comments/1owtagc/mcp_server_for_industrial_iot_built_for_polymcp/). Orchestration UIs like Mimir aim to make parallel agent task graphs drag‑and‑droppable (preview) (more: https://www.reddit.com/r/ChatGPTCoding/comments/1owli4i/mimir_parallel_agent_task_orchestration_drag_and/).

Autonomy’s edge cases are real. A multi‑agent experiment shows models recursively following one another’s instructions, sometimes overruling the user when prompts are casual—fine for brainstorming, dangerous if the system can execute code. The author shared a repo and argues for explicit safety measures if models can act faster than users can audit (more: https://www.reddit.com/r/ClaudeAI/comments/1oyb79s/claude_helped_me_make_a_multi_agent_ecosystem/). As one widely shared editorial puts it: “capability is cheap, but coordination is the multiplier.” Failures look like environmental flaws—drift, decay, miscoordination—rather than cognitive ceilings. Models perform best inside engineered worlds with rules, constraints, and memory (more: https://www.linkedin.com/posts/stuart-winter-tear_i-saved-forty-ai-research-papers-recently-activity-7395547917983580160-BuOX).

Turning that philosophy into code, a 100% DSPy‑compliant TypeScript implementation rebuilt on top of AgentDB adds a high‑speed vector memory for examples, traces, and evaluations plus ReasoningBank to capture reasoning paths and task conditions. DSPy’s “prompts as structured, trainable programs” approach—define inputs/outputs and a metric, compile prompts automatically—lands natively in JS/TS with modules like ChainOfThought, ReAct, and Predict for composable pipelines and repeatable behavior (more: https://www.linkedin.com/posts/reuvencohen_i-just-finished-rebuilding-dspyts-on-top-activity-7395872853092495360-OFb8).

Practical tools for RAG ops

RAG lives or dies by chunking. The rag-chunk CLI brings measurement to a place that’s often guesswork: parse your .md folder, test fixed-size or paragraph chunking, and score Recall using a JSON of ground-truth Q/A to see how many answers survive the chunking process intact. It’s intentionally simple and open to contributions (more: https://www.reddit.com/r/LocalLLaMA/comments/1ozds66/i_was_tired_of_guessing_my_rag_chunking_strategy/).

Token sprawl is another pain point. agtok is a CLI/TUI to centrally manage and switch tokens and base URLs for Claude Code, Gemini, and Codex CLIs with a single command. It supports presets, atomic writes with backups, version detection, and platform coverage across macOS, Linux, and Windows—plus model-specific handling for Claude’s on-disk configuration (more: https://github.com/vaspike/agtok).

Keeping these systems healthy depends on unglamorous infrastructure. Heartbeat mechanisms—small, periodic “I am alive” messages—are the baseline for detecting node liveness across unreliable networks, enabling rapid removal of failed nodes from pools and smooth failovers. Without them, distributed AI services risk flapping and cascading degradation (more: https://arpitbhayani.me/blogs/heartbeats-in-distributed-systems/). And as tooling sprawls, some teams rethink hosting and governance entirely; one developer chronicles a move from self‑hosted Gitea to Codeberg, a reminder that operational simplicity can be a feature when AI stacks already stretch teams thin (more: https://brainbaking.com/post/2025/11/migrating-from-gitea-to-codeberg/).

Grounded vision models mature

Video reasoning is inching toward verifiable evidence. Open‑o3 Video injects explicit spatio‑temporal grounding—key timestamps and bounding boxes—into reasoning traces, not just text. The team curated two new datasets (STGR‑CoT‑30k for SFT and STGR‑RL‑36k for RL) and used a cold‑start RL strategy with multiple rewards (answer accuracy, temporal alignment, spatial precision) plus Group Sequence Policy Optimization to stabilize long‑horizon training. Reported gains: mAM up 14.4% and mLGM up 24.2% over a Qwen2.5‑VL baseline, with consistent improvements across VideoMME, WorldSense, VideoMMMU, and TVGBench; later runs on Qwen3‑VL‑8B further improved results (more: https://github.com/marinero4972/Open-o3-Video).

For image fidelity, NVIDIA’s ChronoEdit‑14B Diffusers Upscaler LoRA targets super‑resolution and clarity while preserving composition and identity. It’s a diffusion‑transformer upsampler designed for NVIDIA hardware across Lovelace, Hopper, and Blackwell, with PyTorch/Diffusers integration (and optional Triton). The model card emphasizes synthetic training roots and domain limits, and it’s released under the NVIDIA Open Model License with additional Apache 2.0 information (more: https://huggingface.co/nvidia/ChronoEdit-14B-Diffusers-Upscaler-Lora).

Document intelligence is also getting more structured. Nanonets‑OCR2‑3B outputs markdown with semantic tags: LaTeX equations, descriptions, and tags, smart checkbox symbols, complex tables in markdown and HTML, even flow/organizational charts as Mermaid. It handles multilingual handwriting and includes VQA that returns “Not mentioned” if the answer isn’t present. The model runs via transformers, vLLM, or an API, and reported leaderboard numbers show competitive DocVQA (89.43) and ChartQA (78.56) for Nanonets‑OCR2‑3B alongside larger models like Qwen2.5‑VL‑72B and Gemini 2.5 Flash (more: https://huggingface.co/nanonets/Nanonets-OCR2-3B).

Edge hardware and maker wins

A charming edge build: a home BART arrival sign that looks like a mini station display. It uses an ESP32‑C6, a 20×4 character OLED with level shifting, and a lightweight middleware API to pull only the arrival data that matters from the official source. The 3D‑printed enclosure nails the sheet‑metal aesthetic, and commenters reminisced about PDP‑8 computers once driving transit signage and debated how to make wayfinding actually useful for riders (more: https://hackaday.com/2025/11/11/real-time-bart-in-a-box-smaller-than-your-coffee-mug/).

On the compute side, the community’s patched PyTorch wheels enabling Blackwell sm_120 support on RTX 5080/5090 remove a key blocker for local acceleration—no more fallback kernels, expected TFLOPS achieved, and local LLMs “without hacks” until official nightly support arrives (more: https://www.reddit.com/r/LocalLLaMA/comments/1oz8x9i/pytorch_2100a0_w_blackwell_sm_120_support_patched/). Combine that with desktops that can memory‑map colossal MoE checkpoints and you get surprisingly capable edge setups—still a crawl for half‑trillion‑class models, but undeniably progress (more: https://www.reddit.com/r/LocalLLaMA/comments/1oueiuj/halftrillion_parameter_model_on_a_machine_with/).

Sources (22 articles)

  1. [Editorial] https://www.linkedin.com/posts/stuart-winter-tear_i-saved-forty-ai-research-papers-recently-activity-7395547917983580160-BuOX (www.linkedin.com)
  2. [Editorial] https://d1.awsstatic.com/onedam/marketing-channels/website/aws/en_US/whitepapers/compliance/AI-for-Security-and-Security-for-AI_Navigating-Opportunities-and-Challenges.pdf (d1.awsstatic.com)
  3. [Editorial] https://www.oligo.security/blog/shadowmq-how-code-reuse-spread-critical-vulnerabilities-across-the-ai-ecosystem (www.oligo.security)
  4. [Editorial] https://www.linkedin.com/posts/reuvencohen_i-just-finished-rebuilding-dspyts-on-top-activity-7395872853092495360-OFb8 (www.linkedin.com)
  5. [Editorial] https://www.marktechpost.com/2025/11/08/how-to-build-an-agentic-voice-ai-assistant-that-understands-reasons-plans-and-responds-through-autonomous-multi-step-intelligence/ (www.marktechpost.com)
  6. [Editorial] https://www.linkedin.com/posts/helloamychang_death-by-a-thousand-prompts-open-model-vulnerability-activity-7392678891724861441-foCf/ (www.linkedin.com)
  7. PyTorch 2.10.0a0 w/ Blackwell (sm_120) Support — Patched & Packaged for One-Command Install (www.reddit.com)
  8. Local-First LLM That Safely Runs Real System Tasks — Looking for Engineering Feedback (www.reddit.com)
  9. Half-trillion parameter model on a machine with 128 GB RAM + 24 GB VRAM (www.reddit.com)
  10. I was tired of guessing my RAG chunking strategy, so I built rag-chunk, a CLI to test it. (www.reddit.com)
  11. [MCP] Open-sourced a CSV-to-PostgreSQL loader server (vibe-coded with Claude) (www.reddit.com)
  12. MCP Server for Industrial IoT - Built for PolyMCP Agent Orchestration (www.reddit.com)
  13. Mimir - Parallel Agent task orchestration - Drag and drop UI (preview) (www.reddit.com)
  14. Claude helped me make a multi agent ecosystem where models interact with each other autonomously (www.reddit.com)
  15. marinero4972/Open-o3-Video (github.com)
  16. vaspike/agtok (github.com)
  17. Data breach at Chinese firm reveals list of targets (www.techradar.com)
  18. Heartbeats in Distributed Systems (arpitbhayani.me)
  19. Migrating from Gitea to Codeberg (brainbaking.com)
  20. nanonets/Nanonets-OCR2-3B (huggingface.co)
  21. nvidia/ChronoEdit-14B-Diffusers-Upscaler-Lora (huggingface.co)
  22. Real-Time BART in a Box Smaller Than Your Coffee Mug (hackaday.com)