Blackwell FP4 reality check: Local models now mobile

Published on

On NVIDIA’s new Blackwell (sm_120) cards, FP4 for mixture-of-experts remains more promise than product. Practitioners report NVFP4 or MXFP4 support for MoE is either missing or partial across popula...

Blackwell FP4 reality check

On NVIDIA’s new Blackwell (sm_120) cards, FP4 for mixture-of-experts remains more promise than product. Practitioners report NVFP4 or MXFP4 support for MoE is either missing or partial across popular stacks: vLLM’s MoE FP4 is “still unsupported/WIP,” TensorRT-LLM builds are finicky, and QuTLASS does native FP4 only for dense models, not MoE (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwh9z3/nvfp4_or_mxfp4_moe_on_sm120_rtx_5900_rtx_6000_pro/). MXFP4 can run via vLLM’s Marlin kernels on some models like gpt-oss-120b, but multiple reports suggest it’s not yet tapping native FP4 hardware paths on sm_120. One commenter flatly states: “Nvfp4 is only implemented for dense models on sm120. MXFP4 (gpt-oss) works on sm120, but I don’t think it’s using the native fp4 hardware yet.” For now, many are defaulting to FP8: one user cites GLM-4.5-Air-FP8 at 150–200 tok/s (with “eagle” enabled) outpacing gpt-oss-120b, while another notes GLM‑4.6‑FP8 feels stronger overall (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwh9z3/nvfp4_or_mxfp4_moe_on_sm120_rtx_5900_rtx_6000_pro/;), (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw2ghd/glm_46_is_nice/).

A few brave souls tried pushing huge MoEs through TRT-LLM with NVFP4 anyway—e.g., Qwen3‑235B‑A22B‑Instruct across 4× RTX Pro Blackwell 96 GB—hitting VRAM walls during quantization or half-working builds despite custom containers and multi-step conversion pipelines (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwh9z3/nvfp4_or_mxfp4_moe_on_sm120_rtx_5900_rtx_6000_pro/). On the other end of the spectrum, llama.cpp and GGUF paths can feel “blazing fast” anecdotally on gpt-oss‑120b, but precision and “native FP4” claims are contested—users warn GGUF 4-bit ≠ MXFP4 precision (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwh9z3/nvfp4_or_mxfp4_moe_on_sm120_rtx_5900_rtx_6000_pro/).

Speed at longer contexts remains a pain point for 120B-class models regardless of quant. Threads are trading tips to keep OSS‑120B usable as prompts grow, underscoring that the prefill phase dominates—and that splitting across multiple GPUs can help prefill scale for large prompts (nearly linearly, per one report) even if decode remains bounded (more: https://www.reddit.com/r/LocalLLaMA/comments/1nxfpqd/tips_for_getting_oss120b_to_run_faster_at_longer/;), (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwh9z3/nvfp4_or_mxfp4_moe_on_sm120_rtx_5900_rtx_6000_pro/).

Local models, now mobile

On iOS, Noema is pushing “desktop-level” local LLM capabilities with llama.cpp, MLX, and ExecuTorch backends, RAG, and built-in web search. The latest updates add Remote Endpoints—plug directly into a local LM Studio or Ollama server, pull the model list via their native APIs, and use remote models from your phone. The developer responded rapidly to feedback: the remote button is now easier to find, models can be loaded directly from chat, and web search moved to SearxNG with unlimited usage (subscription removed) in 1.4. It’s open source, too (more: https://www.reddit.com/r/LocalLLaMA/comments/1nuapoj/use_remote_models_on_ios_with_noema/).

For tiny deployments like Home Assistant automations, community picks gravitate to small Qwen and Gemma variants for tool use: “Qwen 3 4B 2507 is very good for toolcalling,” with LFM 1.2B as a lightweight fallback; others highlight Gemma 3 1B or Qwen 3 1.7B, and note a “tiny” leaderboard for more options. One practical tip: if 3B-class models feel sluggish locally, consider cheap hosted inference via OpenRouter for reliability (more: https://www.reddit.com/r/LocalLLaMA/comments/1nxgdyx/best_small_model_3b_for_homeassistant/).

If you truly need micro, there’s even a 130M-parameter instruct model (“lille‑130m‑instruct”) floating on Hugging Face. It won’t reason like a 7B+ model, but for fixed-format tasks or deterministic tool pipelines, ultra-small models can still be useful in the loop (more: https://huggingface.co/Nikity/lille-130m-instruct).

Decentralized inference, scrappy tooling

A decentralized “gateway for AI inference” is emerging in ShareAI, which lets anyone contribute Ollama-compatible capacity and get paid: providers earn 70% of each completed request, with a Windows client available now and Linux/macOS/Docker targeted by end of November. The model is prosumer-friendly: share when idle, burst to the network when busy, and even donate a slice of earnings to NGOs. It integrates directly with Ollama’s runtime to pull, quantize, and manage models, exposing fine-grained controls for what you share (more: https://www.reddit.com/r/ollama/comments/1nswzde/launch_ollama_compatible_shareai_open_beta/).

For agentic coding, “mini_claude_code” demonstrates that the magic is in the loop, not the GUI: ~400 lines of Python give a model controlled read/write/exec tools, with v2 adding a Todo chain and gentle “system reminder” nudges when the model loses the plan. A sibling project (“Kode”) layers in Bash, web fetch, Docker, and IDE integration for a production-grade experience (more: https://github.com/shareAI-lab/mini_claude_code). On the other side of the stack, there’s a no-frills, multi-session WhatsApp API you can self-host: scan a QR once, persist sessions, and send text or attachments via endpoint calls. Security uses a global master key plus per-session keys; it’s unofficial and warns about potential bans if abused (more: https://github.com/Codegres-com/Simple-Whatsapp-API).

And yes, “agentic debugging” evangelists keep posting real-world demos—like spinning up a full-stack app on Cloudflare Workers with auth in “just a few prompts.” It’s a testbed of where tool-using LLMs shine today, even if claimed ease will vary with your use case and tolerance for refactors (more: https://www.reddit.com/r/ClaudeAI/comments/1ns4efb/creating_a_full_stack_app_wcloudflare_works_and/).

Security, policy, and sober ops

At the packet edge, a clever eBPF/XDP experiment filters TLS handshakes at line rate—hashing sorted cipher suites (a JA4-inspired fingerprint variant dubbed FST1) and blocking known-bad signatures inside the kernel. With XDP, a single box can drop tens of millions of packets per second, and eBPF maps shuttle hash lists between user and kernel space. It’s a tour of low-level pragmatism: tight bounds checks for verifier safety, heap-free hashing/sorting in a tiny stack budget, and an admission that non-crypto hashes suffice when attackers can spoof anyway (more: https://foxmoss.com/blog/packet-filtering/).

Higher up the stack, Chainloop introduced a preview WASM/WASI policy engine alongside its Rego support. The goal: write compliance policies in Go/Rust/JS, but evaluate them in a zero-trust WebAssembly sandbox with explicit capabilities (no implicit FS/network), near-native performance, and hot-swappable policies. It’s positioned to reduce supply-chain and policy-eval risk without locking teams into a single policy language (more: https://chainloop.dev/blog/introducing-webassembly-policy-engines/).

Meanwhile, the U.S. Army’s internal memo on Anduril’s NGC2 prototype (with Palantir and others) is a reminder that “move fast” is not a security strategy. The CTO’s assessment flagged “fundamental security” issues—no reliable access control or user activity visibility, and third-party apps with dozens to hundreds of vulnerabilities—recommending the system be treated as “very high risk.” Anduril called the report outdated; Palantir said “No vulnerabilities were found in the Palantir platform.” The Army CIO added that many issues were fixed “within weeks and days” and only one app still had vulnerabilities under work, but the early posture underscores the cost of speed without safeguards (more: https://www.cnbc.com/2025/10/03/anduril-palantir-ngc2-deep-flaws-army.html).

Overseas, Denmark’s airport “drone sightings” prompted closures and geopolitics-laced speculation. Hackaday urges evidence-based responses, citing the Gatwick fiasco where freedom-of-information digging found no tangible drone evidence, and a same-day Schiphol incident later chalked up to a balloon. The pattern: nervous observers can see “phantom drones everywhere,” so authorities should publish clear tracking and imagery to avoid runaway disruption (more: https://hackaday.com/2025/09/27/drones-at-danish-airports-a-plea-for-responsible-official-response/).

Typed workflows beat prompt twiddling

A formal paper proposes Type-Compliant Adaptation Cascades (tacs): treat multi-step LM workflows as a single, typed, unnormalized probabilistic program. Each LM adaptor’s output is constrained to its schema (invalid outputs get zero mass), and the whole graph is trained end to end with a Monte Carlo EM-inspired procedure (tacSTaR) that drops ∇θ log Zθ; the authors bound the resulting bias by a constant times (1 − Zθ), which vanishes as the model becomes more type-compliant (Zθ → 1). Across MGSM/MGSM‑SymPy/FinQA/HotPotQA with 7B and 27B models, tacs outperforms DSPy-style prompt optimization, especially on strict-structure tasks—for instance, MGSM‑SymPy (27B) at 75.9 vs 57.1, FinQA (27B) at 34.0 vs 12.7 (more: https://arxiv.org/abs/2508.18244v1).

In the same spirit of programmatic improvement with LLM help, Sakana’s ShinkaEvolve combines evolutionary algorithms and LLMs to iteratively mutate and optimize scientific code. It supports local or Slurm runs, multi-island evolution with archives, a live WebUI, and includes examples from circle packing to ALE-Bench code optimization. LLMs serve as “intelligent mutation operators,” with verifiers keeping progress grounded in correctness and metrics (more: https://github.com/SakanaAI/ShinkaEvolve).

Open 3D and image models

Tencent’s HunyuanImage‑3.0 is now open-sourced: a native multimodal, autoregressive image generator with a Mixture‑of‑Experts (64 experts; 80B total params, 13B active per token). The team bills it as unifying understanding and generation in a single AR framework—moving away from diffusion’s DiT models—and promises performance competitive with closed systems. The release roadmap includes inference code, full checkpoints, “Instruct” (reasoning) variants, distilled models, vLLM integration, image-to-image, and multi-turn interaction (more: https://huggingface.co/tencent/HunyuanImage-3.0).

For 3D, Hunyuan3D‑Omni offers controllable generation with a unified control encoder—point clouds, voxels, skeletons, and bounding boxes—running in roughly 10 GB of VRAM. Flags allow EMA for stability and FlashVDM for speed; modes span point/voxel/bbox/pose to match the input control you have. It builds atop a lineage of Hunyuan3D releases with a production focus on high-fidelity assets and PBR materials (more: https://github.com/Tencent-Hunyuan/Hunyuan3D-Omni).

Open, multilingual, long-context LLMs

Apertus‑8B‑2509 aims to be fully open and compliant—open weights, open data, and full training recipes—while supporting 1,800+ languages and 65K contexts. It uses a decoder-only transformer with the new xIELU activation, trained on 15T tokens with an AdEMAMix optimizer and a staged web/code/math curriculum. The team provides Megatron-LM training details (4096 GH200s), alignment via QRPO, long-context inference tips, and tool-use readiness via vLLM/SGLang. They also publish EU transparency docs and commit to a PII output filter that can be layered on top (more: https://huggingface.co/swiss-ai/Apertus-8B-2509).

Code at speed, debt at scale

The “comprehension debt” critique is ringing loud: teams shipping LLM-generated code they haven’t fully read are accruing a backlog of future understanding cost. When tools fail to refactor or fix edge cases—cue the “doom loops”—humans must dive in, but the time to understand unfamiliar machine-produced code can match or exceed what the generator “saved” up front (more: https://codemanship.wordpress.com/2025/09/30/comprehension-debt-the-ticking-time-bomb-of-llm-generated-code/). This tracks with user reports: attempts to translate Figma designs to working mobile apps via Copilot yielded broken flows without clear ways to feed design context; others suggest trying different assistants but the core issue remains—faithful reproduction of complex app flows is still brittle (more: https://www.reddit.com/r/ChatGPTCoding/comments/1nvk31h/how_to_make_the_ai_bot_to_understand_the_exact/).

That said, some practitioners showcase smooth “agentic debugging” loops for scaffolding full-stack apps, arguing the experience is already production-friendly in narrow lanes. The truth is in the middle: the tooling is capable and improving, but reliability depends on constraints, tests, and a readiness to step in when the agent stalls (more: https://www.reddit.com/r/ClaudeAI/comments/1ns4efb/creating_a_full_stack_app_wcloudflare_works_and/).

Open-source streaming ASR

A new, fast, streaming speech recognizer—Kroko ASR—was open-sourced as an alternative to Whisper, with an explicit call for testers, feedback, and contributors. It’s early days, but the emphasis on streaming readiness and community input is exactly what open ASR needs to broaden deployment options beyond Whisper derivatives (more: https://www.reddit.com/r/LocalLLaMA/comments/1ntiua9/we_just_opensourced_kroko_asr_a_fast_streaming/).

Sources (21 articles)

  1. We just open-sourced Kroko ASR: a fast, streaming alternative to Whisper. It’s early days, we’d love testers, feedback, and contributors. (www.reddit.com)
  2. Use Remote Models on iOS with Noema (www.reddit.com)
  3. Best small model <3B for HomeAssistant (www.reddit.com)
  4. GLM 4.6 is nice (www.reddit.com)
  5. [Launch Ollama compatible] ShareAI (open beta) — decentralized AI gateway, Ollama-native (www.reddit.com)
  6. How to make the AI Bot to understand the exact design and App flow (www.reddit.com)
  7. Creating a Full Stack App W/Cloudflare Works and BetterAuth (www.reddit.com)
  8. SakanaAI/ShinkaEvolve (github.com)
  9. Tencent-Hunyuan/Hunyuan3D-Omni (github.com)
  10. How I Block All 26M of Your Curl Requests (foxmoss.com)
  11. Policy as code using your favorite programming language with WebAssembly (chainloop.dev)
  12. Comprehension debt: A ticking time bomb of LLM-generated code (codemanship.wordpress.com)
  13. tencent/HunyuanImage-3.0 (huggingface.co)
  14. swiss-ai/Apertus-8B-2509 (huggingface.co)
  15. Drones At Danish Airports, A Plea For Responsible Official Response (hackaday.com)
  16. Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data (arxiv.org)
  17. Nikity/lille-130m-instruct (huggingface.co)
  18. Anduril and Palantir battlefield comms system has deep flaws: Army (www.cnbc.com)
  19. shareAI-lab/mini_claude_code (github.com)
  20. NVFP4 or MXFP4 MOE on sm120 (RTX 5900 RTX 6000 PRO) (www.reddit.com)
  21. Show HN: Simple WhatsApp API (Open Source) (github.com)