On-device models hit stride: Agentic tooling and MCP data

Published on

The on-device race keeps compressing models into phones and consumer GPUs. A Nexa AI demo shows OpenAI’s GPT-OSS-20B running on an ASUS ROG 9 with Snapdragon Gen 5 at roughly 17 tokens/second and a ...

On-device models hit stride

The on-device race keeps compressing models into phones and consumer GPUs. A Nexa AI demo shows OpenAI’s GPT-OSS-20B running on an ASUS ROG 9 with Snapdragon Gen 5 at roughly 17 tokens/second and a sub-3-second time-to-first-token—numbers that would have sounded fanciful a year ago. Community questions focused on unified 16 GB RAM, potential NPU offload, quantization choices like mxfp4 (~12 GB), and API endpoints for local-network access—practicalities that matter if “local on mobile” is going to be more than a novelty. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nzvhth/run_open_ai_gptoss_on_a_mobile_phone_demo/)

AI21 is pushing the “tiny but capable” envelope with Jamba 3B. The company positions it as a fast, long-context generalist that keeps throughput stable at extended context lengths, claiming ~40 tokens/second on a Mac even past 32K tokens and ~33 tokens/second at 128K, while peers like Qwen 3 4B reportedly fall below 1 token/second and Llama 3.2 3B “goes down.” AI21 says Jamba 3B runs locally on iPhone, Android, Mac, and PC, pitching it as a Swiss Army knife for smart replies, assistants, and routing. As always, treat vendor benchmarks with healthy skepticism until community replications land. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o1ac09/ai21_releases_jamba_3b_the_tiny_model/)

At the other extreme, a user reported running IBM Granite4 Small-h 32b-A9b (Q4_K_M) at its full 1M-token context using about 73 GB of VRAM. That’s still very large hardware, but impressive context scaling in quantized form. If you’re chasing maximum window length rather than minimum latency, this is the kind of configuration that moves the goalposts. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nzozpg/granite4_smallh_32ba9b_q4_k_m_at_full_1m_context/)

Community momentum remains essential to separate signal from hype. A thread collecting hands-on reports for GLM-4.5-Air and GLM-4.6-Distill underscores the appetite for comparative testing under real workloads and hardware constraints. And new small models keep popping up: inclusionAI’s Ling-mini-2.0 appeared on Hugging Face, though the provided material offers no details beyond the listing itself. (more: https://www.reddit.com/r/LocalLLaMA/comments/1nyopyc/did_anyone_try_out_glm45airglm46_distill/) (more: https://huggingface.co/inclusionAI/Ling-mini-2.0)

Agentic tooling and MCP data

Agentic infrastructure is converging around the Model Context Protocol (MCP). An Agentics Foundation update highlights OpenAI’s AgentKit—a toolkit described as bundling Agent Builder (visual flows), a Connector Registry for data, and ChatKit for UI integration, plus evaluation and Reinforcement Fine Tuning (RFT) tools across GPT‑5 and beyond. Claims like “faster, safer, more collaborative” will need independent validation, but the integration push mirrors where many developers already are: stitching tools and data into reliable pipelines. (more: https://www.linkedin.com/pulse/agentics-foundation-weekly-update-oct-7-2025-agentics-org-p81ie)

The same update links a tutorial on ChatGPT’s MCP Developer Mode by Reuven (rUv) Cohen, walking through building an MCP server that ChatGPT can call for live APIs like search, fetch, and analyze. That’s the right direction: fewer brittle “copy-paste” glue steps, more real interfaces that preserve provenance and reduce context loss. (more: https://www.linkedin.com/pulse/agentics-foundation-weekly-update-oct-7-2025-agentics-org-p81ie)

rUv also showcased “Claude Hosted Sandboxes,” where npx commands spin up agentic systems inside Claude’s environment: packages like agentic-flow, agent-nexus, and sublinear-time-solver promise everything from multi-agent swarms and 200+ MCP tools to “emergent intelligence” and “quantum-classical hybrid computing.” Community replies asked for plain-English explanations and cautioned that current quantum backends are limited and error-prone, so any claimed advantage should be scrutinized. If trying these, note you must enable the sandbox first. Ambitious demos are great; measurable, reproducible capabilities are better. (more: https://www.linkedin.com/posts/reuvencohen_many-people-ask-how-they-can-try-my-latest-activity-7381736891026608128-1J1M)

On the data side, Toucan-1.5M released what it calls the largest fully synthetic tool-agent dataset to date: 1.5 million trajectories synthesized from 495 real-world MCPs spanning 2,000+ tools, with multi-round sequential and parallel tool calls. The authors report that models fine-tuned on this data outperform much larger closed systems on BFCL V3 and extend the Pareto frontier on MCP‑Universe—encouraging if you’re training for tool use in authentic environments. (more: https://github.com/TheAgentArk/Toucan)

Planning, memory, and refactoring

In personal AI assistants, Pardus AI open-sourced a local-first assistant that “memorizes what you have done” and lets you query your own information. It relies on Ollama for embeddings, and the maintainers say you can swap components to keep everything local, avoiding OpenRouter. Users asked for graph/wiki-style browsing of the auto-notes and Windows support; the team indicated those features are on the roadmap. (more: https://www.reddit.com/r/ollama/comments/1nxms96/pardus_ai_open_source_ai_assistant_thanks_for_the/)

For project planning, OpenBacklog positions itself as an AI backlog manager for solo devs that syncs with Claude Code via MCP. The workflow: define requirements in OpenBacklog, then ask Claude Code to “pick up work,” which imports the tasks and pushes updates back automatically—no manual context handoff. Pricing is a flat $7/month including $7 of AI credit, with at-cost overage; a commenter noted the site wasn’t serving despite the domain being purchased. It’s fully open source on GitHub, which helps with transparency and longevity. (more: https://www.reddit.com/r/ClaudeAI/comments/1o23v9w/claude_code_finally_has_a_planning_partner_i/)

Refactoring remains a reminder that LLMs need environment context. A developer pasted a 400‑line C# method into multiple chat UIs and got non-compiling outputs; after upgrading to ChatGPT Plus, they reported “GPT5 Thinking” produced compilable code. Others recommended using Codex through a CLI or VS Code extension so the AI can compile/iterate, and mentioned Qwen-coder-plus as effective for refactors. Tools like SonarQube were suggested to baseline code quality before changes. The shared lesson: chat windows are fine for sketches; serious refactors benefit from IDE-integrated agents that can build and test. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1o15etk/what_to_use_for_refactoring/)

Theory of learning and private ML

Two timely papers tackle fundamental and applied concerns. On learning dynamics, “Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking” proposes a three-stage framework—Lazy learning, Independent feature learning, Interactive feature learning—for 2‑layer nonlinear networks. The authors show how weight decay and backpropagated gradients induce hidden units to independently learn features by ascending an explicit energy function, with local maxima corresponding to emergent features. They analyze representation power, generalization vs memorization trade-offs with sample size, and the role of hyperparameters like weight decay and learning rate, connecting the theory to why optimizers such as Muon can work well. Later, interacting features target “missing” concepts. While the results focus on two-layer setups, the authors argue the analysis can extend to deeper networks. (more: https://arxiv.org/abs/2509.21519)

On the applied side, “SecureV2X” addresses privacy in Vehicle-to‑Everything systems, where ML models increasingly run in infrastructure-to-cloud workflows. The paper points out that V2X can involve sensitive signals—from location and speed to license plates and even EEG—and argues for secure multi-party computation (MPC) to enable privacy-preserving inference when models can’t run locally or are proprietary. Examples like multivariate LSTMs predicting crash risk within 3 seconds highlight the promise; the authors situate their work within the U.S. DOT’s National Deployment Plan for V2X, which explicitly emphasizes privacy during testing and deployment. (more: https://arxiv.org/abs/2508.19115v1)

Security agendas and oddities

The Prompt||GTFO #9 agenda spans autonomous vulnerability discovery, patching systems, reverse-engineering agents, leak hunting with LLMs, TPRM vendor questionnaires via ChatGPT, and experiments in collaborative, persistent context. Also on deck: a “Prompt Injector” talk and a proposal where “Two AI Companions Make Repos and Attest Each Other,” exploring decentralized trust. It’s a snapshot of how security engineers are bending LLMs to fit real-world workflows—and where the rough edges still are. (more: https://www.linkedin.com/posts/promptorgtfo_agenda-for-our-upcoming-episode-this-thursday-activity-7380606405503782912-LgNu)

Finally, a post titled “Breaking ‘Provably Correct’ Leftpad” is a useful reminder that proofs depend on assumptions and contexts—particularly in security-critical code where edge cases and environment differences can invalidate seemingly airtight guarantees. (more: https://lukeplant.me.uk/blog/posts/breaking-provably-correct-leftpad/)

Diffusion training and tagging

If you train diffusion models and prefer GUIs, Diffusion-Pipe in ComfyUI adds end-to-end training and fine-tuning as custom nodes. It supports LoRA and full fine-tunes across 20+ models, DeepSpeed distributed training, TensorBoard monitoring, and WSL2. Utilities cover dataset configuration (e.g., image/text pairs, aspect-ratio bucketing), video frame counts, and numerous optimization knobs (activation checkpointing, bf16/fp16/fp8, 4‑bit quantization). A preconfigured workflow file aims to get users started quickly. (more: https://github.com/TianDongL/Diffusion_pipe_in_ComfyUI)

On the tagging side, PixAI Tagger v0.9 targets anime images with ~13.5k Danbooru-style tags, emphasizing recall to aid search, dataset curation, and captioning. It continues training the classification head of EVA02 (from WD v3) on a January 2025 snapshot, using embedding-space MixUp. Reported performance includes a character subset micro‑F1 of ~0.865 at a 0.75 threshold, with defaults tuned for coverage rather than precision. It’s available as an endpoint or local deployment, with clear notes on limitations like NSFW tags, hallucinations, and dataset bias. (more: https://huggingface.co/pixai-labs/pixai-tagger-v0.9)

Computer-use agents advance

H Company’s Holo1.5 models aim to be foundations for “computer use” agents—systems that perceive UIs, localize elements, and act on real applications. The family spans 3B, 7B (Apache 2.0), and 72B (research-only) variants fine-tuned from Qwen2.5‑VL‑7B‑Instruct, trained with supervised fine-tuning followed by GRPO reinforcement learning. They ingest native high-res screenshots up to 3840×2160, and report state-of-the-art UI localization across benchmarks like ScreenSpot‑V2/Pro, GroundUI‑Web, Showdown, and the new WebClick, with the 7B/72B variants outperforming prior models and the 3B competitive with older 7B baselines. (more: https://huggingface.co/Hcompany/Holo1.5-7B)

Beyond clicking, the models also tackle screen content understanding via QA on ScreenQA Short/Complex, VisualWebBench, and WebSRC, where they report notable gains over comparable open models. A demo and Hugging Face Space show prompting patterns in computer-use settings, relevant for agents like Surfer‑H that must both read and manipulate complex UIs. As with all benchmark claims, community reproductions will matter—but this is the kind of targeted capability that agent frameworks can immediately exploit. (more: https://huggingface.co/Hcompany/Holo1.5-7B)

Vision as a sensor, with caveats

A Hackaday project explores estimating air quality by pointing a camera at the sky and running a compact CNN on a Unihiker K10 (ESP32‑S3). Trained on ~12,000 sky images from India and Nepal labeled with the Air Quality Index, the model predicts AQI from a fresh image. The author notes it’s more an additional signal than a replacement for proper sensors, and commenters rightly warn that clear skies can still hide harmful particulates and that nighttime is a blind spot. As a low-cost supplement—especially leveraging existing cameras—it’s an intriguing idea; as a primary measurement, it’s not there. (more: https://hackaday.com/2025/10/05/divining-air-quality-with-a-cheap-computer-vision-device/)

Handcrafted data for fine-tuning

A community tool targets a persistent pain point: making high-quality, handwritten fine-tuning datasets. It supports multiple formats (ChatML/ChatGPT, Alpaca, ShareGPT/Vicuna), multi-turn datasets, system messages and custom IDs, token counting, autosave, and format-specific conveniences (e.g., default instructions for Alpaca). The author used it to create a 1,000‑interaction conversational dataset over a month, arguing that human-authored data remains valuable for customized, high-quality LLM behavior. Video demo and download links are provided. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o1gym0/sharing_my_free_tool_for_easy_handwritten/)

Sources (19 articles)

  1. [Editorial] Agentics Newsletter (www.linkedin.com)
  2. [Editorial] Reminder that Prompt||GTFO #9 is today. (www.linkedin.com)
  3. [Editorial] Latest batch from rUv. (www.linkedin.com)
  4. Granite4 Small-h 32b-A9b (Q4_K_M) at FULL 1M context window is using only 73GB of VRAM - Life is good! (www.reddit.com)
  5. Run Open AI GPT-OSS on a mobile phone (Demo) (www.reddit.com)
  6. AI21 releases Jamba 3B, the tiny model outperforming Qwen 3 4B and IBM Granite 4 Micro! (www.reddit.com)
  7. Sharing my free tool for easy handwritten fine-tuning datasets! (www.reddit.com)
  8. Pardus AI: Open source AI Assistant thanks for the help with Ollama (www.reddit.com)
  9. What to use for refactoring (www.reddit.com)
  10. Claude Code finally has a planning partner — I built an AI backlog manager for solo devs (www.reddit.com)
  11. TheAgentArk/Toucan (github.com)
  12. Provable scaling laws of feature emergence from learning dynamics of grokking (arxiv.org)
  13. Breaking "Provably Correct" Leftpad (lukeplant.me.uk)
  14. inclusionAI/Ling-mini-2.0 (huggingface.co)
  15. Hcompany/Holo1.5-7B (huggingface.co)
  16. Divining Air Quality With A Cheap Computer Vision Device (hackaday.com)
  17. SecureV2X: An Efficient and Privacy-Preserving System for Vehicle-to-Everything (V2X) Applications (arxiv.org)
  18. pixai-labs/pixai-tagger-v0.9 (huggingface.co)
  19. TianDongL/Diffusion_pipe_in_ComfyUI (github.com)