Legal LLMs reasoning and thinking

Published on October 7, 2025

Legal LLMs, reasoning, and “thinking”

Across legal workflows, the models that matter most right now aren’t always the newest—or the largest—but the ones that follow instructions reliably and encode enough Western legal knowledge to draft and revise with nuance. Hands-on evaluations from the field put Llama 3.3 70B near a “sweet spot” for accuracy and drafting fidelity, with Gemma 3 27B showing good latent legal knowledge but weaker instruction following. Qwen3’s newer instruct variants are fast yet inconsistent on subtle drafting and rationale clarity, while GLM 4.5 Air and even GLM 4.6 can trail unless you explicitly trigger their chain-of-thought modes. GPT‑OSS models (20B/120B) stand out on knowledge and instruction following—if you get past their guardrails—producing “very on point” redrafts when the prompt affirms attorney oversight. Qwen3‑235B‑A22B can match or exceed GPT‑OSS‑120B quality but may run at ~5 tokens/s locally, making it a “reserve” model for deep work rather than everyday drafting. Practitioners emphasize pairing models with RAG for statutes and local documents; for small jurisdictions, in-context evidence tends to beat parameterized “knowledge.” The caveat everyone agrees on: persuasive tone isn’t proof—review by a human remains non‑negotiable. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nxyfwr/what_are_the_best_models_for_legal_work_in_oct/)

Reasoning benchmarks echo the theme. FamilyBench—a long-context, 400‑person family-tree test—now shows Gemini 2.5 Pro on top (81.48%), Claude Sonnet 4.5 surged into second (77.78%), and GLM 4.6 jumped dramatically after enabling its “thinking” mode. Qwen 3 Next 80B A3B Thinking broke 70% but often spends many tokens reasoning, and Qwen3‑235B Thinking sometimes “thinks forever,” consuming its budget before answering. GPT‑OSS‑120B scored 50.26%, comparatively efficient in tokens but behind the leaders on these multi-step inferential questions. The author also flags provider variance via OpenRouter, a fair warning that operational settings—like explicitly prompting “thinking”—can swing outcomes. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nzgben/update_familybench_new_models_tested_claude/)

Research is racing to make “thinking” both better and cheaper. Princeton’s RLMT (“Language Models That Think, Chat Better”) packages code for SFT/DPO and RL (PPO/GRPO) to train models that generate and then evaluate their own intermediate reasoning, with benchmarks and datasets released for replication. The aim: capture the gains seen when models are allowed to reason, while keeping answers concise and on target. In parallel, a new web‑agent RL method—Tree‑Guided Preference Optimization (TGPO)—structures the agent’s exploration like a tree, aligning preferences toward robust behavior under the chaotic realities of the web. These strands line up with empirical results like GLM 4.6’s leap when “thinking” is prompted explicitly. (more: https://github.com/princeton-pli/RLMT) (more: https://arxiv.org/abs/2509.14172v1)

Architectures chasing longer context

Architectural bets continue. InclusionAI’s Ring Flash 2.0 (104B A6B) swapped standard attention for linear attention, advertised for up to 128k tokens and post‑trained on ~1T tokens. The promised trade‑off is clear: cheaper inference at high context and potentially better long‑document steering. Early users describe “sharp” reasoning and strong instruction adherence, though the provided chat template needs tool‑calling tweaks. The catch: novel architectures can lag on ecosystem support—Ring Flash 2.0 Linear isn’t yet runnable in llama.cpp; a separate “Bailing MoE V2” path is close to landing, and Qwen 3 Next’s Gated DeltaNet is different again. GGUFs may appear, but they don’t guarantee runtime support without kernels and loaders updated for the exact attention type. Expect months, not days, for broad compatibility. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw8jbn/ring_flash_20_104b_a6b_with_linear_attention/)

Benchmarks and anecdotes reinforce why these attention experiments matter. Linear attention aims to keep inference predictable as context scales, and multiple users are probing speed/quality trade‑offs against Qwen and Llama baselines. One practitioner reports Ring’s “mini” and “flash” variants are easy to steer and non‑sycophantic—a prized behavior in production—though the included template needed reverse‑engineering for reliable tool use. Tooling friction remains real: an LM Studio load failed with “bailing_moe_linear not supported,” and a GGUF for a standard‑attention sibling predictably didn’t run without upstream architectural support. The message for builders: check the whole toolchain, not just the model card. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw8jbn/ring_flash_20_104b_a6b_with_linear_attention/)

Local stacks, serving choices, offline research

Infrastructure choices are getting sharper edges. Ollama “dropped” AMD MI50 (gfx906) support after crashes with rocBLAS builds, effectively disabling a budget‑friendly 32 GB card many still deploy. Community members point out llama.cpp runs MI50 fine—recent PRs even improved performance—and accuse Ollama of shipping binaries with GFX906 data removed rather than fix builds. The practical workaround from operators: replace Ollama with llama.cpp or KoboldCpp as the backend (OpenAI‑compatible servers), especially when pairing with Open WebUI. In MI50 land, prefer HIP/CLBlast backends, modest quantization (q4/q5), smaller contexts, and avoid vLLM/TGI paths with spotty ROCm for older ASICs. The broader sentiment: if you need legacy AMD or flexible kernels, llama.cpp is still the safer bet. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nwnfcz/ollama_drops_mi50_support/)

On the orchestration side, turnkey stacks are improving. CoexistAI added a Docker setup with a one‑script deploy that bundles research (web, YouTube, Reddit, Git, local files), page/video summarization, newsletter generation, plus new text‑to‑podcast and TTS pipelines. It also integrates local LLMs/embedders to let small teams assemble Perplexity‑style research engines without bespoke devops. For document generation at scale, an MCP‑compatible streaming HTTP server just landed in the MCP File Generation Tool v0.6.0; the release consolidates its toolset down to two primitives (create_file, generate_archive), adds Pexels image support and professional Office templates, and works out of the box with OpenWebUI 0.6.31—no proprietary MCPO key needed. These are small wins that compound into fewer moving parts. (more: https://www.reddit.com/r/ollama/comments/1nytxfy/coexistai_now_supports_docker_setup_also_now_you/) (more: https://www.reddit.com/r/OpenWebUI/comments/1nxglk1/mcp_file_generation_tool_v060_update/)

One clever cost lever: push “deep research” offline. A community fork of LangGraph’s Open Deep Research routes historical/general lookups to a local Wikipedia index to slash web calls. Practitioners recommend a hybrid: BM25 via Elasticsearch for first‑pass recall, a local reranker (e.g., bge‑reranker) to sharpen results, and only then vectors in Qdrant for long‑tail queries. Index raw wikitext rather than ZIMs for better search, chunk by section headers, keep infobox/lead separate, and store Wikidata QIDs to prefetch related hops without touching the web. In k8s, run indices as separate pods with mmap to minimize memory churn. The pattern is straightforward RAG economics: a cheap lexical gate preserves quality while cutting bandwidth. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nx5w8m/local_open_deep_research_with_offline_wikipedia/)

Voice: local ASR and ultra‑cheap TTS/agents

There’s a flurry of voice tooling that bends the cost curve. One open repo claims a “hypercheap” voice agent—over 30× cheaper than ElevenLabs/OpenAI Realtime—aimed squarely at real‑time conversational apps. It’s early, but the direction is clear: push latency down and prices toward zero for interactive speech. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1nuh90m/i_created_the_cheapest_possible_ai_voice_agent/)

On-device ASR is catching up fast. Maivi (My AI Voice Input) delivers real‑time transcription with a global hotkey, streaming overlays, and clipboard copy—all CPU‑only by default. It uses NVIDIA’s Parakeet TDT 0.6B (~6–9% WER), achieves ~0.36× real‑time on CPU (7 s audio in ~2.5 s), and stitches overlapping chunks to avoid mid‑word cuts and duplicates. For privacy‑sensitive or air‑gapped setups, that’s a substantial usability upgrade over heavier Whisper stacks. (more: https://github.com/MaximeRivest/maivi)

Meanwhile, a compact, multi‑lingual TTS model (KaniTTS, 370M params) targets sub‑second perceived latency by generating codec tokens with a small LLM and decoding via a neural audio codec. Benchmarks claim ~1 s to synthesize 15 s of 22 kHz audio on an RTX 5080, MOS 4.3/5, and WER <5%, with voices across English, German, Chinese, Korean, Arabic, and Spanish under Apache‑2.0. It’s optimized for NVIDIA Blackwell but small enough to consider edge servers—useful for assistants that need snappy speech without cloud bills. (more: https://huggingface.co/nineninesix/kani-tts-370m)

Apps meet context: MCP and ChatGPT SDKs

Distribution is consolidating around where users already work. One editorial argues the new ChatGPT App SDK doesn’t kill small AI startups—it filters them. Thin wrappers will suffocate; tools that deliver real workflow value (analytics, compliance, domain logic) gain an on‑ramp to massive distribution with built‑in auth, context, and discoverability. The advice is pragmatic: stop forcing users into new tabs and meet them in the interface they already trust. (more: https://www.linkedin.com/posts/reuvencohen_the-new-chatgpt-app-sdk-is-shaking-things-activity-7381313590705782784-7d1H)

Technically, that consolidation rides on the Model Context Protocol (MCP): an open, JSON‑RPC 2.0–based standard for connecting assistants to external tools, data, and services. A comprehensive tutorial for ChatGPT’s Developer Mode shows how to add custom MCP connectors, implement required endpoints (including “search” and “fetch”), add production OAuth 2.0 with PKCE, deploy the server (Fly.io recommended for root‑domain control), and debug edge cases. Taken together, Apps SDK + MCP lowers friction to bring back‑office systems to the chat surface with strong auth, structured tools, and shared context. (more: https://gist.github.com/ruvnet/7b6843c457822cbcf42fc4aa635eadbb#file-x-appendix-md)

MCP is also surfacing beyond OpenAI’s walls. The new file‑generation tool supports MCP Streamable HTTP out of the box, speaking the same contract to OpenWebUI and similar front ends. That’s what a standard is supposed to do: one way to declare tools, one way to call them, regardless of the chat client. (more: https://www.reddit.com/r/OpenWebUI/comments/1nxglk1/mcp_file_generation_tool_v060_update/)

Security: robots, messengers, networks

A sobering first for robotics: a wormable exploit chain across Unitree’s humanoid and quadruped lines. Researchers detail a BLE Wi‑Fi config service with hardcoded cryptographic keys, a “unitree” password handshake, and unsanitized input concatenated into shell commands—ultimately yielding root. Worse, compromised robots can infect others within wireless range. It follows the classic disclosure arc—notify vendor, wait, publish mitigations when fixes stall—but the implications are broader: connected robots are computers with legs, and their security debt is now everyone’s problem. (more: https://hackaday.com/2025/09/30/unitree-humanoid-robot-exploit-looks-like-a-bad-one/)

Messaging is bracing for the post‑quantum era. Signal outlined “Signal Protocol and Post‑Quantum Ratchets” (SPQR), advancing work to keep end‑to‑end encryption resilient against future quantum adversaries. Transitioning ratchets without breaking usability is nontrivial; shipping it safely in the wild is a milestone the broader ecosystem will track closely. (more: https://signal.org/blog/spqr/)

Networks are converging in ways that stretch threat models. A telecom security webinar spotlights how LTE/5G and satellite infrastructures are intersecting—think Starlink broadcasting LTE and emerging Non‑Terrestrial Networks—expanding the attack surface across space and ground segments. Lessons from breaches in both domains and the governance question of “who secures what” are front and center. (more: https://www.linkedin.com/posts/dmitry-kurbatov_5g-satellitetechnology-cybersecurity-activity-7381247864624148480-6_4n)

Research and systems: time series, vectors, silicon

Beyond text, Google’s TimesFM 2.5 (200M, PyTorch) is a decoder‑only foundation model for time‑series forecasting. The open checkpoint includes code to produce point forecasts and quantiles (continuous quantile head, crossing fixes), with configs for context/horizon lengths. It’s pretrained on Wikimedia pageviews, Google Trends, and synthetic/augmented data through late 2023. The example API is straightforward and suited for stacking into analytics pipelines or, increasingly, agentic planners that need uncertainty‑aware predictions. (more: https://huggingface.co/google/timesfm-2.5-200m-pytorch)

Under the hood, acceleration work continues. Flash‑KMeans implements exact, batched K‑Means in Triton, delivering large speedups over standard PyTorch on H100s in common regimes (e.g., 16k points × 128‑D × 1k clusters, batch 32, FP16). It comes from the Sparse VideoGen2 effort, but the primitive is broadly applicable: faster clustering can shave minutes off embedding deduplication, vector quantization, or retrieval preprocessing—every bit helps when you’re wrangling millions of chunks. (more: https://github.com/svg-project/flash-kmeans)

On silicon, XiangShan’s Vector FPU design notes are a solid reference for anyone mapping RVV 1.0 floating‑point. The VFPU spans fp16/fp32/fp64 with modules for add/alu (VFAlu), FMA (VFMA), div/sqrt (VFDivSqrt), and conversion (VFCvt), and tackles two hard problems: maximizing lane bandwidth for “multiple single‑precision” ops (e.g., 4×fp16 per 64‑bit lane, 2×fp32) and supporting mixed‑precision ops mandated by RVV without blowing timing. The microarchitectural details—dual‑path addition, vector FMA/div algorithms, sequential accumulation—show how to keep precision flexible while staying fast. (more: https://docs.xiangshan.cc/projects/design/en/latest/backend/VFPU/)

Learning from coding sessions, automatically

Developer ergonomics are also improving. A small but useful tool, Learn and Vibe, mines Claude Code chat histories to surface where time really goes: recurring debugging themes (API assumptions, layout recalcs, missing error handlers), quick‑win patterns, and reusable snippets you’ve already solved. It runs locally (SQLite + Next.js PWA) and uses the Claude Agent SDK for analysis; next on the roadmap is auto‑generating a CLAUDE.md with rules tailored to your “gotchas,” so the assistant can preempt your favorite mistakes. It’s the kind of feedback loop junior and senior engineers alike appreciate—less vibe coding, more deliberate practice. (more: https://www.reddit.com/r/ClaudeAI/comments/1nzbvhq/built_a_tool_to_actually_learn_from_my_vibe/)

Sources (21 articles)

[Editorial] Meet users where they are… (www.linkedin.com)
[Editorial] https://gist.github.com/ruvnet/7b6843c457822cbcf42fc4aa635eadbb#file-x-appendix-md (gist.github.com)
[Editorial] https://www.linkedin.com/posts/dmitry-kurbatov_5g-satellitetechnology-cybersecurity-activity-7381247864624148480-6_4n (www.linkedin.com)
Local Open Deep Research with Offline Wikipedia Search Source (www.reddit.com)
What are the best models for legal work in Oct 2025? (www.reddit.com)
Ollama drops MI50 support (www.reddit.com)
Ring Flash 2.0 104B A6B with Linear Attention released a few days ago (www.reddit.com)
[Update] FamilyBench: New models tested - Claude Sonnet 4.5 takes 2nd place, Qwen 3 Next breaks 70%, new Kimi weirdly below the old version, same for GLM 4.6 (www.reddit.com)
CoexistAI Now Supports Docker Setup, Also now you can turn any text into Podcasts and Speech Easily (www.reddit.com)
I created the cheapest possible AI voice agent (over 30x less expensive than Elevenlabs and OpenAI Realtime). Check out the Github repo below if you want to try it for yourself! (www.reddit.com)
Built a tool to actually learn from my vibe coding mistakes in Claude Code (www.reddit.com)
MaximeRivest/maivi (github.com)
princeton-pli/RLMT (github.com)
Signal Protocol and Post-Quantum Ratchets (signal.org)
XiangShan Vector Floating-Point Unit Design (docs.xiangshan.cc)
nineninesix/kani-tts-370m (huggingface.co)
google/timesfm-2.5-200m-pytorch (huggingface.co)
Unitree Humanoid Robot Exploit Looks Like a Bad One (hackaday.com)
TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning (arxiv.org)
MCP_File_Generation_Tool - v0.6.0 Update! (www.reddit.com)
svg-project/flash-kmeans (github.com)