Lowprecision training hits stride

Published on

NVIDIA’s NVFP4 pushes 4‑bit pretraining from an experiment to a credible default. The format isn’t a “flat FP4”; it uses shared FP8 E4M3 scaling at a micro‑block level (16 values share a s...

Low‑precision training hits stride

NVIDIA’s NVFP4 pushes 4‑bit pretraining from an experiment to a credible default. The format isn’t a “flat FP4”; it uses shared FP8 E4M3 scaling at a micro‑block level (16 values share a scale), extending per‑block dynamic range to roughly 0.001–2700 while keeping the I/O and memory savings that make 4‑bit compelling. In a reported run, a 12B‑parameter Mamba Transformer trained on 10T tokens in NVFP4 tracked FP8 within ~1% validation loss for most of training, drifting to ~1.5% late in decay. Downstream deltas were small: MMLU Pro 62.58% (NVFP4) vs 62.62% (FP8), with a modest dip on coding (MBPP+ 55.91% vs 59.11%). The broader takeaway echoed by practitioners: structure and scale often matter more than per‑weight precision, and careful block‑wise handling and outlier control let aggressive quantization pay off. NVIDIA’s blog copy leans “16‑bit precision with 4‑bit speed,” but the evidence presented is “near‑FP8 accuracy at 4‑bit cost.” (more: https://www.reddit.com/r/LocalLLaMA/comments/1o61gzs/nvidia_breakthrough_gives_4bit_pretraining/)

Enterprise‑aimed open models are also leaning into efficient designs. IBM’s Granite‑4.0‑H‑Micro (3B) blends attention with Mamba2 for long context (128K), improved instruction following, tool‑calling, and multilingual ability, while maintaining a compact footprint. Notably, training ran on a GB200 NVL72 cluster with 400 Gb/s NDR InfiniBand—underscoring how hardware interconnects, not just GPU cores, now bottleneck or unlock training efficiency at scale. (more: https://huggingface.co/ibm-granite/granite-4.0-h-micro)

On the inference side, OpenVINO shows how much is left on the table without platform‑specific optimization. A small VLM (SmolVLM) converted to OpenVINO dropped time‑to‑first‑token from over 5 s (PyTorch) to 0.42 s and raised decoding throughput from 0.7 to 47 tokens/s on Intel CPUs; 8‑bit weight‑only quantization pushed it further, and static quant on the vision encoder added gains when answers are short or inputs are multi‑image. The pattern mirrors NVFP4’s logic: right‑sized precision in the right places yields most of the speedups with minimal accuracy loss. (more: https://huggingface.co/blog/openvino-vlm)

For local deployments, it still pays to respect the VRAM ceiling. One developer testing “n_gpu layers” on a 16 GB RTX 5070 Ti found inference time jumped sharply once GPU memory was exceeded, a reminder that spilling to host RAM kills latency. If you want longer context windows, stay under the VRAM‑optimized layer count first or pay the performance tax. (more: https://www.reddit.com/r/ollama/comments/1o6i8r1/ai_assisted_suite_doubt_about_n_gpu_layer_test/)

Big multimodal models keep scaling, but they also depend on all of the above. Qwen3‑VL’s latest 235B MoE “Instruct” model claims 256K context (expandable to 1M), stronger video temporal modeling, and improved spatial grounding, while recommending FlashAttention 2 for acceleration—another reminder that efficient kernels and memory‑aware formats are now first‑class design choices, not afterthoughts. (more: https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct)

Practical multimodal in the wild

OCR is finally practical for many day‑to‑day tasks—but not magical. Users working with Nanonets OCR2‑3B report big wins: financial reports convert to decent Markdown tables with cleanup; academic PDFs produce clean title/author/abstract with mostly structured references; clause hierarchies in contracts make search useful. The model still stumbles: “Subtotal” mislabeled as “Total,” “8.” read as “B.” in compressed scans, and skewed handwritten receipts remain hard. Pragmatic tips help: avoid over‑compression, keep the long edge ≥1280 px, specify tables in Markdown and math in $…$, and don’t stitch many receipts into a tall image—localization degrades. Benchmarks vary: one user’s bank‑statement pipeline saw Nanonets miss table structure that a small Mistral Q6 handled well; asking for Markdown instead of CSV may improve results. Reality check: prompts and input hygiene often matter as much as model choice. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o76pft/practical_ocr_with_nanonets_ocr23b/)

Model claims are rising too. Qwen3‑VL touts expanded OCR in 32 languages, better handling of blur/tilt/low light, and improved long‑document structure parsing, alongside upgraded recognition across celebrities, products, flora/fauna, and more. Those capabilities look directly relevant to the failure modes users cite, but, as always, “marketing claims” need to be verified on your own data. (more: https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct)

Video agents are getting more end‑to‑end. Paper2Video (“PaperTalker”) chains multiple agents to turn a paper plus a reference headshot/audio into a full presentation with slides, subtitles, synthesized speech, cursor movements, and a talking head. It ships an MIT‑licensed pipeline and a benchmark across 101 paper–video pairs with metrics like Meta Similarity, PresentArena, PresentQuiz, and IP Memory. Today it runs best with GPT‑4.1 or Gemini 2.5 and a beefy GPU (A6000 48 GB) end‑to‑end; the community is already asking for a base_url config so local models can slot in. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o4szf0/paper2video_turn_a_research_paper_into_a_full/)

Audio is joining the on‑device movement. NeuTTS Air bills itself as a “super‑realistic” on‑device TTS with instant voice cloning from ~3 seconds of audio. Under the hood: a 0.5B Qwen‑based backbone, a 50 Hz neural audio codec, 2048‑token window (~30 seconds of audio with prompt), and GGML/GGUF/ONNX paths for real‑time generation on laptops and even Raspberry Pis. The repo stresses clean reference audio and ships examples; also: beware imposter websites. For researchers, Xiaomi’s MiMo‑Audio‑Eval brings a unified toolkit to evaluate pre‑trained or SFT audio language models across multiple datasets and tasks, including scripts to reproduce paper results and practical notes like prebuilt FlashAttention wheels. Together, these signal a push toward robust, reproducible audio agents that don’t require a cloud round‑trip. (more: https://github.com/neuphonic/neutts-air) (more: https://github.com/XiaomiMiMo/MiMo-Audio-Eval)

Agents wrestle with async tools

Realtime agents still struggle with long‑running tools. One developer found that when a tool takes seconds or minutes, sequential model behavior forces the assistant to “stop talking” or cancels the call if interrupted. An orchestration pattern like Pipecat’s async tool calls “kinda” worked with one model but broke others by retroactively injecting results into past turns. Practical design ideas from the thread mirror concurrency patterns: return “in progress” handles (Promises/Futures), make the conversation itself asynchronous so tools can post results later, or fork the conversation into “pre‑result” and “post‑result” branches and join them deliberately. None of this is free; it’s the cost of turning a turn‑based LLM into an event‑driven system. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o4gka5/how_to_handle_long_running_tools_in_realtime/)

Reasoning parsers add their own wrinkles. A vLLM setup serving GLM‑4.6 FP8 with a “glm45” reasoning parser kept leaking think tokens; switching templates to “no‑think by default, enable with /think” helped in a related GLM‑4.5 case. The meta‑lesson: your chat template, parser, and “thinking” conventions must align, or you’ll ship hidden chain‑of‑thought to users. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o75s9p/anyone_else_having_reasoning_parser_issue_with/)

Dev tooling is adopting “plan first, act later.” A “Plan mode” for OpenAI’s Codex CLI appeared in a public video and PR, prompting comparisons to IDE agents like Cline that draft a multi‑step plan before editing code. That pattern is more than UX sugar: teams describe it as “spec‑driven” development, a lightweight guardrail against vibecoding changes you don’t want. Note that open‑source development in public means “leaks” are often just visible work‑in‑progress. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1o6qefg/plan_mode_coming_to_codex_cli/)

Meanwhile, model variance keeps biting. isitnerfed.org flagged an elevated number of failed coding tests for Claude Sonnet 4.5 while Sonnet 4 looked normal. Practitioners reported “goofy mistakes,” including misreading or misinputting data pulled via Model Context Protocol (MCP), and some switched models mid‑workflow; others asked for methodology details, noting the chart’s variance. Regardless of cause, it reinforces the need for continuous, automated regression testing in agent stacks that combine tools, templates, and fast‑moving frontier models. (more: https://www.reddit.com/r/ClaudeAI/comments/1o3nk6b/something_is_wrong_with_sonnet_45/)

Embodied AI plans ahead

Robots need to replan before they fail, not after. A new arXiv paper proposes scene‑graph‑guided proactive replanning: compare a current RGB‑D‑grounded scene graph to reference graphs from successful demos; if preconditions for the next subtask aren’t met, diagnose the likely cause and revise the plan before executing. This targets the real culprits in long‑horizon manipulation—subtle spatial/relational factors like occlusions, reachability, or whether an object is already being held—where post‑hoc recovery is inefficient or impossible. The work explicitly avoids rule‑based triggers and extensive human supervision, aiming for scalable robustness. Code is set to be released upon publication. (more: https://arxiv.org/abs/2508.11286v1)

3D content generation is also getting a “local‑to‑global” boost. MeshMosaic assembles artist‑grade meshes by generating parts and stitching them into coherent objects. The official repo is prepping a public release with instructions spanning PyTorch 2.5.1 CU124, FlashAttention, and custom kernels, pretrained weights on Hugging Face, and a workflow for segmenting input meshes into connected components before generation. It’s been tested on A100, A800, and H20 GPUs, reflecting the compute needed for high‑quality geometry. (more: https://github.com/Xrvitd/MeshMosaic)

The modeling stack is converging: perception, spatial reasoning, and control in one loop. Qwen3‑VL highlights “Advanced Spatial Perception” with 2D grounding and emerging 3D grounding, long‑context video understanding with timestamp alignment, and even a “Visual Agent” that can operate PC/mobile GUIs. Coupled with proactive replanning and better mesh priors, embodied agents can start behaving less like blind macro recorders and more like adaptable operators in a changing world. (more: https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct)

Security, specs, and surprises

“Proven correct” code can still be wrong in production, and not just because of bugs. A clear write‑up breaks down three failure modes for formally verified software: invalid proofs (rare with mechanized provers, but possible, especially with unsafe “assume true” shortcuts), wrong properties (like leftpad “string length” instead of “visual alignment” in Unicode), and wrong assumptions (like binary search proofs assuming unbounded integers, or algorithms silently relying on sorted inputs, sufficient memory, absence of concurrency, and stable external APIs). The threadbare moral: know exactly what your proofs guarantee, and communicate those limits to everyone depending on them. (more: https://buttondown.com/hillelwayne/archive/three-ways-formally-verified-code-can-go-wrong-in/)

On the firmware front, Secure Boot isn’t secure if the signed payload is a shell. Researchers showed that vendor‑signed UEFI diagnostic shells can act as “signed backdoors.” With powerful commands like mm (memory modify), an attacker with shell access can read/write arbitrary memory pre‑OS, achieving a full Secure Boot bypass while the system still “looks” like it booted securely. This isn’t about malicious keys—it’s about legitimate, powerful tools being acceptable to the boot chain by design. Defense here is policy and configuration, not just cryptography. (more: https://eclypsium.com/blog/bombshell-the-signed-backdoor-hiding-in-plain-sight-on-framework-devices/)

And updates can still break fleets. A widely shared post claimed a software update bricked all 2024 Jeep Wrangler 4xe models—an extreme example, if accurate, of the blast radius when OTA governance fails. When coupled with the UEFI shell issue, it’s a reminder that the modern attack and failure surface spans from silicon to cloud, with “signed” and “automatic” not synonymous with “safe.” (more: https://twitter.com/StephenGutowski/status/1977055831720862101)

At the application layer, both prevention and analysis tooling are advancing. A modern guide to preventing CSRF in Go lays out current best practices for web developers, a good refresher as browser defaults and frameworks evolve. For defenders and reverse engineers, AGAR (Assisting Go Analysis and Reversing) plugs into IDA 9.2 and detects 5–20× more strings in Go binaries on Linux than IDA alone, and helps demystify interface‑type method calls—while openly documenting current failures on stripped binaries and some Windows builds. Secure software is a process, not a checkbox. (more: https://www.alexedwards.net/blog/preventing-csrf-in-go) (more: https://github.com/junron/agar)

Offline bidding gets smarter

Adtech continues to be a proving ground for offline decision‑making. Traditional offline RL for auto‑bidding is attractive in theory but brittle in practice: bootstrapped value estimation, off‑policy learning, and function approximation can create unstable training dynamics. Production constraints make it worse—there’s no highly accurate offline evaluator, and online A/Bs are expensive and risky—so regressions can slip through. (more: https://arxiv.org/abs/2509.15927v1)

A generative alternative, AI‑Generated Bidding (AIGB), sidestepped these pitfalls by treating policy learning as conditional sequence generation with diffusion. It avoids bootstrapping, improving stability, and empirically beat offline RL baselines. But imitation has a ceiling: without explicitly optimizing the real objective (maximize advertiser value subject to budget), performance is capped by the quality/diversity of logged trajectories, and diffusion models can overfit limited data. (more: https://arxiv.org/abs/2509.15927v1)

AIGB‑Pearl tackles that gap by adding offline reward evaluation and policy search on top of the stable generative backbone. The aim is to retain AIGB’s training robustness while directly pushing policies toward the true objective under constraints, all offline. If borne out, this hybrid—generative planning plus explicit policy optimization—could be a template for other safety‑critical sequential decision systems where online exploration is off the table. (more: https://arxiv.org/abs/2509.15927v1)

Open analog chips, end‑to‑end

Open silicon is no longer just digital. A hands‑on walkthrough shows the full process of designing an analog ASIC (an ADC) using the SkyWater 130 nm PDK in the Tiny Tapeout Analog Design VM. The toolchain—xschem for schematic capture, magic for physical layout—leans on Tcl for customization. And the constraints are very analog: FET choices dominate, capacitors are “expensive,” inductors are verboten, and absolute component values are squishy enough that designers aim for stable ratios instead. The community is even working on packaging options—chip‑on‑board from JLC and DIP/castellated boards—to make small runs practical beyond bare dies. (more: https://hackaday.com/2025/10/08/the-entire-process-of-building-an-open-source-analog-asic/)

Sources (22 articles)

  1. Nvidia breakthrough gives 4-bit pretraining technique the accuracy of FP8 (www.reddit.com)
  2. Paper2Video — turn a research paper into a full presentation video (slides, speech, talking head) (www.reddit.com)
  3. How to handle long running tools in realtime conversations. (www.reddit.com)
  4. Practical OCR with Nanonets OCR2‑3B (www.reddit.com)
  5. Anyone else having reasoning parser issue with Qwen-cli + GLM4.6 combo in vllm? (www.reddit.com)
  6. AI assisted suite - Doubt about n_gpu layer test (www.reddit.com)
  7. Plan mode coming to Codex CLI (www.reddit.com)
  8. Something is wrong with Sonnet 4.5 (www.reddit.com)
  9. neuphonic/neutts-air (github.com)
  10. Xrvitd/MeshMosaic (github.com)
  11. Signed Backdoor Hiding in Plain Sight on Framework Devices (eclypsium.com)
  12. Three ways formally verified code can go wrong in practice (buttondown.com)
  13. Jeep pushed software update that bricked all 2024 Wrangler 4xe models (twitter.com)
  14. ibm-granite/granite-4.0-h-micro (huggingface.co)
  15. Qwen/Qwen3-VL-235B-A22B-Instruct (huggingface.co)
  16. The Entire Process of Building an Open Source Analog ASIC (hackaday.com)
  17. Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search (arxiv.org)
  18. Get your VLM running in 3 simple steps on Intel CPUs (huggingface.co)
  19. junron/agar (github.com)
  20. A modern approach to preventing CSRF in Go (www.alexedwards.net)
  21. Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent (arxiv.org)
  22. XiaomiMiMo/MiMo-Audio-Eval (github.com)