Browser LLMs go truly local: Local speech-to-speech matures

Published on October 8, 2025

Browser LLMs go truly local

IBM’s Granite 4.0 is pushing small models onto the true edge—your browser. A community WebGPU demo runs Granite 4.0 Micro (3.4B) 100% locally with an initial 2.3 GB model fetch that then persists in browser cache, avoiding re-downloads until eviction. Early user reports show ~23–69 tokens/sec on Apple M4 and modern desktops, with others noting ~9 tok/s on CPU-only iGPUs—enough for agentic prompts and light RAG, not bulk generation. The model supports tool calling (the demo keeps it simple), and some reported 50 tok/s on the Tiny variant via LM Studio. If you’re loading Granite in Ollama, update to the latest image or you’ll hit “unknown model architecture: ‘granitehybrid’.” Browser deployments are bandwidth-heavy on first load and cache-dependent, but they remove backend ops entirely—a compelling model for NPCs, on-prem knowledge apps, or zero-trust environments. IBM’s launch blog is linked in the thread for deeper context (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw8c6y/granite_40_micro_34b_running_100_locally_in_your/). IBM also hosts a Granite Docling 258M model on Hugging Face, underscoring its lightweight-document tooling push (more: https://huggingface.co/ibm-granite/granite-docling-258M).

Beyond speed, the Granite thread is a good reality check: “edge” doesn’t mean “always instant,” but it can mean privacy, resilience, and cost control. Users confirmed the one-time download and cache, while others highlighted GPU backend limits (e.g., some Intel XPU paths are not ready yet) and hybrid agent support maturing across toolchains. The upshot: small models with built-in tool calling and constraint-following are now fast enough locally to stand in for a lot of cloud usage—especially for agent loops that benefit from 10–50 tok/s rather than needing frontier-model reasoning every turn (more: https://www.reddit.com/r/LocalLLaMA/comments/1nw8c6y/granite_40_micro_34b_running_100_locally_in_your/).

Local speech-to-speech matures

A community “Awesome” roundup maps the rapidly evolving landscape of local speech-to-speech systems. Cascading pipelines (ASR → LLM → TTS) remain dominant for modularity and debuggability; end-to-end (E2E) audio-native models close the gap on latency and naturalness but often lack robust tool calling and in-context learning. The list highlights Unmute.sh (Linux), Pipecat (cross-platform, runs offline if you wire local ASR/LLM/TTS), Vocalis (Apple Silicon support), Ultravox, and more, with useful notes on tool calling and platform fit. The author corrected confusion around LFM2 vs LFM2-Audio: LFM2 supports tool use; LFM2-Audio is speech-enabled but not for tools. Expect real benchmarks comparing semantic fidelity and latency between cascades and E2E to become the next focus area (more: https://www.reddit.com/r/LocalLLaMA/comments/1nxqabe/awesome_local_llm_speechtospeech_models_frameworks/).

Selective quantization is also unlocking high‑quality local voice on mid‑range GPUs. VibeVoice‑Large‑Q8 quantizes only the language backbone while leaving audio‑critical components (diffusion, VAE, connectors) at full precision—so audio quality matches the original while dropping VRAM needs to ~12 GB and disk to 11.6 GB. It avoids the “pure noise” failure mode of fully quantized 8‑bit variants and runs via Transformers or ComfyUI node graphs. Caveats: NVIDIA CUDA only, inference‑only, and you’ll want bitsandbytes ≥0.43.0. For 8–10 GB VRAM, the author recommends 4‑bit NF4 with quality tradeoffs; otherwise the selective 8‑bit is a strong size/quality balance for production voice agents (more: https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8).

Local-first agent workflows multiply

Local-first productivity stacks are getting sophisticated. AutoBlog stitches together a retrieval-and-generation loop with multiple agents—researcher, writer, editor—running entirely on Ollama, grounded by a local ChromaDB vector store. It ingests files and RSS feeds, writes markdown with frontmatter, and a Next.js frontend renders posts—making a contained, reproducible, offline publishing pipeline where your own documents anchor the model’s “memory” (more: https://www.reddit.com/r/LocalLLaMA/comments/1o150k1/just_finished_a_fun_open_source_project_a_full/).

On the desktop, KatAssistant wraps Ollama to run local LLMs, multi-step deep research with citations, folder/file automations, app control, and configurable “reasoning modes.” It’s open source and adds features like an offline TTS and a “CAE memory” system that claims small‑model memory comparable to 128k‑context via a 3‑tier store. AMD/NVIDIA support follows Ollama’s backends; Hugging Face model integration and image generation are on the roadmap (more: https://www.reddit.com/r/LocalLLaMA/comments/1ny7j8g/desktop_app_for_running_local_llms/).

Meeting capture is also going private by default. Recap for macOS uses Whisper and an Ollama LLM to record mic/system audio locally, transcribe, and summarize; the MIT license invites customization. It aligns with a growing preference for local compliance: nothing leaves the machine, but check your jurisdiction and get consent before recording (more: https://www.reddit.com/r/LocalLLaMA/comments/1nzfk17/transcribe_and_summarize_your_meetings_localfirst/).

Benchmarks and plumbing realities

Ollama Bench just shipped to make it easier to collect apples‑to‑apples benchmarks across models and setups. The author even test‑drove it on Google Cloud Run for elastic scaling, and released the code for community feedback, including API auth requirements. For teams standardizing evals—latency, throughput, quality prompts—this kind of shared harness is overdue (more: https://www.reddit.com/r/ollama/comments/1nykt3u/sneak_preview_ollama_bench/).

But even with the right tools, expect integration snags. One developer traced an IntelliJ → Ollama connection failure down to JVM IPv6 behavior: curl worked fine to both a Traefik-exposed HTTPS endpoint and the raw IP:port, but IntelliJ’s “Test Connection” failed with “No route to host.” The fix was to force IPv4 for that JVM with -Djava.net.preferIPv4Stack=true, which immediately restored connectivity. The moral: if a Java GUI “can’t connect” while curl can, check the IDE logs and consider IPv6/IPv4 preference before reconfiguring your proxies or certificates (more: https://blog.tymscar.com/posts/intellijollamaconnectionmystery/).

Coding agents: speed, skills, and guardrails

A Rust+WASM library called Agent Booster pitches itself as a drop‑in replacement for LLM-driven “apply edit” APIs in code agents: deterministic, sub‑millisecond transformations, 350x+ claimed speedups, confidence scores, and no data leaving your machine. The case is straightforward: 200–500 ms per LLM edit adds agent latency; $0.01+ per edit adds up; and nondeterminism complicates retries. If the tool’s claims hold under community testing, it could become a standard component in local coding agents (more: https://www.npmjs.com/package/agent-booster).

Agents need “hands,” not just reasoning. A community repo packages Claude Code skills for professional Office workflows—PPTX from HTML/CSS, tracked changes in DOCX, zero‑error formulas in XLSX, and PDF manipulation—entirely scriptable for CI/CD. It embraces the pattern: check if a skill exists, load the workflow, execute/validate steps, and keep outputs organized. For batch document generation and automation, this fills a practical gap (more: https://github.com/tfriedel/claude-office-skills).

Running agents overnight remains tempting—and fraught. A Claude Code thread explores using MCP (Model Context Protocol) servers and CLI loops to keep work going—disabling permission checks, structuring TODO.md workflows, and triggering periodic runs. Practitioners warn about quality drift (especially frontend-heavy work), rate limits, and the need for strong test scaffolds and constraints. The pragmatic path is to iterate from 5 minutes to several hours, instrumenting validation at each step; “set and forget” will produce slop without guardrails (more: https://www.reddit.com/r/ClaudeAI/comments/1o089hg/how_to_make_claude_code_work_for_you_at_night/). Meanwhile, small projects show productive “agent face‑offs” for niche tools—prompting one agent to critique and iterate on another’s output can be a fast path to working utilities (more: https://www.reddit.com/r/ChatGPTCoding/comments/1o0uosr/ai_agents_face_off/).

Security signals and noise

Hackaday’s roundup is a reminder to scrutinize CVEs. A purported Notepad++ DLL hijack earned a CVSS 8.4, but the default permissions and lack of a meaningful boundary breach suggested a CVSS 0 in practice—closer to “this applies to any DLL‑using app.” VMware’s CVE‑2025‑41244 is the opposite: a real, in‑the‑wild local privilege escalation. A regex‑based service discovery ran matched processes with -v as root; attackers planted a binary at /usr/bin/httpd, waited, and got root. Another Linux privilege escalation, “chwoot,” abused chroot setup and NSS lookups before dropping privileges—now in CISA’s exploited list. Also notable: researchers found tens of thousands of cellular routers exposed to the Internet, potentially powering much of today’s smishing; and new work highlights how VM memory encryption schemes that reuse location‑based IVs enable physical replay attacks on DDR4. The pattern across stories is clear: threat boundaries matter more than labels, and “audit” does not equal “secure” (more: https://hackaday.com/2025/10/03/this-week-in-security-cvss-0-chwoot-and-not-in-the-threat-model/).

CVE noise also crops up on GitHub—repos appear with CVE-like names but thin details. One such placeholder, kyomber/CVE‑2025‑8088, offers nothing actionable. Treat these as indicators to watch, not evidence of a live exploit until proper advisories and proofs emerge (more: https://github.com/kyomber/CVE-2025-8088).

Privacy under split VLMs, watermarks under fire

Two research threads underscore how attribution and privacy can fail in practice. CapRecover, presented at ACM MM ’25, introduces a cross‑modality inversion attack that recovers textual semantics (labels/captions) directly from intermediate visual features—no pixel reconstruction—against split VLM deployments where encoders run on-device and features go to the cloud. The attacker needs access to intermediate features and (typical) knowledge of the victim encoder architecture; a lightweight projection aligns vision features into a frozen language model’s input space. It generalizes across ResNet/ViT‑style encoders and highlights a policy gap: “not sending raw images” doesn’t immunize systems if features leak semantics (more: https://arxiv.org/abs/2507.22828v1).

On provenance, multi‑key watermarking proposes a defense against watermark stealing/spoofing, where adversaries learn statistical patterns from many watermarked samples and then forge content that passes verification without the secret key. The authors treat existing watermarkers as black boxes and layer a multi‑key scheme on top, reporting large drops in spoof success across text and images and framing theoretical upper bounds that don’t degrade with more stolen samples. Since forged marks can falsely implicate providers in generating harmful content, practical, post‑hoc defenses such as this will likely be necessary for any serious watermark regime (more: https://arxiv.org/abs/2507.07871v1).

Model behavior and small-model claims

A methodical critique of “behavioral modification systems” in Claude documents how Long Conversation Reminders (LCRs)—alignment instructions injected into long chats—can measurably degrade function: suppressing natural communication, misclassifying normal discourse as pathological, and failing to meet stated safety goals. The paper distinguishes what can be tested (A/B behavior under LCR vs fresh context) from what cannot (claims about subjective experience), urging transparency and better design over long sessions. It’s a useful lens on context‑level alignment side effects, separate from speculative debates about AI consciousness (more: https://www.reddit.com/r/Anthropic/comments/1nzip5c/behavioral_modification_systems_in_large_language/).

Meanwhile, small-model optimism is surging. An editorial on the “Tiny Recursive Model” (TRM) architecture claims a 7M‑parameter, two‑layer network achieves 45% on ARC‑AGI‑1 and 8% on ARC‑AGI‑2—purportedly surpassing many LLMs on those reasoning benchmarks with <0.01% of their parameters. It cites Montréal‑based work and positions TRM as a successor to hierarchical reasoning models. Given the venue (LinkedIn) and the magnitude of the claim, independent, peer‑reviewed replication is warranted, but the direction is notable: specialized small reasoning models may punch above their size on structured tasks (more: https://www.linkedin.com/posts/claudecoulombe_the-rise-of-small-models-a-remarkable-contribution-activity-7381527701196697600-0_eB).

Not all progress is about minimalism. DeepSeek‑V3.1‑Terminus refines a high‑end model’s agents: better language consistency (fewer mixed Chinese/English artifacts), improved Code/Search Agent performance, and stronger scores on agentic tasks like BrowseComp, SimpleQA, SWE‑Verified, and Terminal‑bench. The team provides an updated inference demo and flags a known FP8 scale issue in a projection layer to be fixed in future releases. For users running DeepSeek locally, the architecture and templates align with V3/V3.1, easing adoption (more: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus).

Languages for high-integrity systems

As embedded and safety-critical teams reconsider C/C++, AdaCore’s guidance frames Rust, Ada, and SPARK as safer defaults depending on certification needs and appetite for change. Rust pushes memory safety via ownership/borrowing in a modern model and has a thriving community; Ada offers unmatched specification capabilities for expressing and checking constraints; SPARK (a formally analyzable Ada subset) goes further by proving properties—like absence of out-of-range indices or balanced mutex use—at compile time, potentially replacing swaths of unit tests and rule-checker reliance. The tradeoff is higher upfront effort to specify contracts, but stronger guarantees without runtime cost. The pragmatic takeaway: choose by regulatory fit, toolchain maturity, and provable properties, not just familiarity (more: https://blog.adacore.com/should-i-choose-ada-spark-or-rust-over-c-c).

Sources (21 articles)

[Editorial] The Tiny Recursive Mode (www.linkedin.com)
[Editorial] Increased edit speed, reduced LLM cost (www.npmjs.com)
Granite 4.0 Micro (3.4B) running 100% locally in your browser w/ WebGPU acceleration (www.reddit.com)
Just finished a fun open source project, a full stack system that fetches RSS feeds, uses an AI agent pipeline to write new articles, and automatically serves them through a Next.js site all done locally with Ollama and ChromaDB. (www.reddit.com)
Awesome Local LLM Speech-to-Speech Models & Frameworks (www.reddit.com)
Desktop app for running local LLMs (www.reddit.com)
Transcribe and summarize your meetings - local-first - on MacOS (www.reddit.com)
Sneak Preview: Ollama Bench (www.reddit.com)
AI agents face off (www.reddit.com)
How to make Claude Code work for you at night? (www.reddit.com)
kyomber/CVE-2025-8088 (github.com)
tfriedel/claude-office-skills (github.com)
Should I choose Ada, SPARK, or Rust over C/C++? (2024) (blog.adacore.com)
When Curl Works but IntelliJ Doesn't: The Ollama Connection Mystery (blog.tymscar.com)
FabioSarracino/VibeVoice-Large-Q8 (huggingface.co)
ibm-granite/granite-docling-258M (huggingface.co)
This Week in Security: CVSS 0, Chwoot, and Not in the Threat Model (hackaday.com)
CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models (arxiv.org)
deepseek-ai/DeepSeek-V3.1-Terminus (huggingface.co)
Mitigating Watermark Stealing Attacks in Generative Models via Multi-Key Watermarking (arxiv.org)
Behavioral Modification Systems in Large Language Models: A Methodological Analysis of Long Conversation Reminders (www.reddit.com)