Nanochat makes LLMs tangible: Routing across many models
Published on
Andréj Karpathy’s new “nanochat” lands as a full-stack, minimal codebase that lets anyone pretrain, mid-train, fine-tune, optionally RL-tune, and serve a tiny ChatGPT-like model through a web ...
Nanochat makes LLMs tangible
Andréj Karpathy’s new “nanochat” lands as a full-stack, minimal codebase that lets anyone pretrain, mid-train, fine-tune, optionally RL-tune, and serve a tiny ChatGPT-like model through a web UI—start to finish—in a single repo. The point is not to beat state-of-the-art but to demystify the stack: tokenizer, pretraining on sources like FineWeb, mid-training on dialogue (e.g., SmolTalk), SFT, and a lightweight chat front-end. Community reactions largely frame it as an educational on-ramp—akin to “how to make a website” in the 90s, only for LLMs—rather than a production substitute for leading small models. Setup is reported to work locally and in the cloud; some are experimenting on Macs and modest GPUs, acknowledging slower training. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o5qo0r/it_has_been_4_hrs_since_the_release_of_nanochat/)
Rapid tinkering has already begun. One user swapped multi-head attention with compact per-token MLPs in later layers of a small GPT, claiming a speedup with similar accuracy, but others caution the experiment was at high training loss (~5.09), a regime where long-range attention carries less value. The takeaway: it’s a promising direction for inference efficiency, yet needs validation at lower losses (closer to 3) before generalizing. That is exactly the kind of “see under the hood” discussion nanochat is meant to provoke. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o5qo0r/it_has_been_4_hrs_since_the_release_of_nanochat/)
Importantly, nanochat goes beyond Karpathy’s earlier nanoGPT by offering a fuller, end-to-end pipeline with minimal dependencies and a one-script path from GPU box to a runnable chat model. The ethos is autonomy, not benchmarking: build and run a model yourself so you can grasp the entire training and serving lifecycle. Community threads include pointers to getting started through one-command cloud setups and videos, plus reminders that the result is “like talking to a kindergartener”—that’s the point. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o5qo0r/it_has_been_4_hrs_since_the_release_of_nanochat/)
Routing across many models
A different strand of autonomy is emerging around multi-LLM control. One builder describes a “control center” that routes tasks to the most suitable model (e.g., speed to one model, deep reasoning to another, vision elsewhere), supports self-hosted options for EU/GDPR compliance, and calls 500+ tools via the Model Context Protocol (MCP) and n8n workflows—“find companies that hired a CFO last month and add them to my CRM” becomes a single chat action. The pitch is anti-vendor lock-in with a pay-as-you-go beta coming; the reality is messier: routing needs heuristics, cache invalidation costs surface, and switching models mid-conversation risks semantic drift unless conversation state and “waypoint” decisions are carried across models. Some teams sidestep this by committing to one model per thread until better solutions land. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o82lqp/i_got_tired_of_openai_dependency_built_a_multillm/)
MCP continues to show its value as connective tissue. A developer used the new Apps SDK plus an MCP server to stream live meeting transcripts into ChatGPT for real-time summarization, action-item extraction, and even post-hoc analysis by reopening earlier meetings directly inside ChatGPT. It’s a useful demonstration: once the model sees live streams and typed tool contracts, the assistant stops being a chat toy and starts behaving like a system component. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1o85vl7/turn_chatgpt_into_a_realtime_meeting_assistant/)
On the coding front, users praise the Playwright MCP for test automation with agent assistance, but note heavy context footprints; one approach is to spawn background agents to interact with the browser and despawn them to preserve the main thread’s context. Separately, Coder’s “cmux” brings parallel agentic development to the desktop—workspaces for multiple repos, persistent long-running streams, Plan/Exec loops, Git divergence and diff views, cost tracking, and “opportunistic compaction” to reduce context while keeping agents productive. It’s very much inspired by Claude Code’s UX, oriented to hours-long sessions and multi-approach exploration. (more: https://www.reddit.com/r/ClaudeAI/comments/1o6hsot/claude_code_taking_a_coffee_break/) (more: https://github.com/coder/cmux)
From smarter brains to sturdier bodies
A growing consensus holds that competitive advantage is shifting from raw model cleverness to sustained, tool-using agency: memory over weeks, typed tool contracts, guardrails, recovery behaviors, and auditability. One editorial highlights this “brain vs body” shift, pointing to long context, multimodal tool use, and configurable “thinking budgets” that trade compute for accuracy as key ingredients for agents that can persist through messy enterprise workflows. The practical buyer’s questions: can it hold case context, call approved tools with audit trails, recover and explain failures, and maintain SLOs? (more: https://www.linkedin.com/posts/stuart-winter-tear_google-deepmind-gemini-25-activity-7385540527246725120-mRaN)
Operationalizing that mindset, a practitioner’s “agentic orchestration” guide insists on a bias to verification—agents can be confidently wrong—and on seeding persistent memory with ground truth (schemas, conventions, decisions) so swarms act coherently. It lays out concrete commands for memory storage, status checks, and multi-agent execution patterns—swarm versus hive-mind—within Claude-Flow, but the core lesson generalizes: orchestrate and verify, don’t merely prompt. (more: https://www.linkedin.com/pulse/conductors-guide-agentic-orchestration-marcus-patman-ann5e)
Tooling is catching up. One “meta-skill” for Anthropic’s new Claude Skills automates building skills themselves: given a plain-language workflow, it designs the architecture, picks APIs, and generates thousands of lines of Python plus extensive docs—in about 90 minutes—so repetitive, structured workflows can be turned into reusable, composable skills without manual scaffolding. It’s in the same vein as orchestration-first thinking: capture the process and let the agents do the lift. (more: https://www.linkedin.com/posts/promptcompletion_claudeskills-anthropic-agents-activity-7385415820991983616-ijSB)
The evaluation community is moving to match this reality. MCPVerse, a large-scale benchmark built entirely atop the Model Context Protocol, mounts 550+ real executable tools with schemas exceeding 140k tokens and scores models on outcomes against live services, not just the correctness of tool names or parameters. As the toolset grows, many models degrade due to context limits and mounting constraints; interestingly, agentic models can gain from a larger action space, but top accuracy remains modest—plenty of headroom. This is the right difficulty: real tools, real state changes, and real-time ground truth. (more: https://arxiv.org/abs/2508.16260v1)
A complementary note: the human element still matters. An editorial on building quality engineering skills underscores that durable systems are as much about craft and discipline as they are about models or protocols. (more: https://www.linkedin.com/pulse/building-17-quality-engineering-skills-how-i-turned-29-spiridonov-newhf)
Multimodal progress and pitfalls
Vision-language tooling can falter for non-obvious reasons. LM Studio’s v0.3.6 adds auto-resizing for vision inputs, hard-coding width to 500 px to preserve context windows—but that can tank OCR and dense-text performance. Users report far better text recognition via llama.cpp with other UIs; LM Studio’s own OpenAI-compatible endpoint showed degraded OCR in tests. The underlying trade-off—context vs fidelity—is real; if a VL model underperforms on text-heavy images, check the preprocessor first. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o7l1io/lm_studio_and_vl_models/)
On the generation side, Qwen-Image-Edit-2509 brings multi-image editing and stronger consistency for faces, products, and text, plus native ControlNet conditioning (depth, edges, keypoints). The update emphasizes identity preservation in pose and style changes, poster-friendly product edits, and more faithful text editing including font and material—features that help with realistic composites and design workflows. (more: https://huggingface.co/Qwen/Qwen-Image-Edit-2509)
At the model-architecture frontier, Lumina-DiMOO proposes a unified discrete diffusion approach for both multimodal generation and understanding across text-to-image, image-to-image (editing, subject-driven, inpainting), and vision understanding. The team claims higher sampling efficiency than autoregressive or hybrid approaches, plus a 2x speedup via bespoke caching, while reporting state-of-the-art results across several public benchmarks. It’s another nudge toward converged “omni” models that flow between describing, editing, and generating. (more: https://huggingface.co/Alpha-VLLM/Lumina-DiMOO)
Training: evolution, RL, and hardware
Fine-tuning by evolution is back in the conversation. A new repo implements an “Accelerated Evolution” framework that full-rank fine-tunes a 7B model on a single 3090/4090 (no quantization), with vLLM for fast inference. The cited work suggests evolutionary methods can sometimes outperform RL on certain tasks but need more responses per sample (around 20 versus GRPO’s 8) and aggregate weighted noise based on reward rather than a single perturbation. Skepticism remains warranted—the method feels “insane” at first glance—yet the premise is attractive: no gradients, potentially simpler scaling dynamics. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o5yvut/local_vllm_accelerated_evolution_framework/)
This dovetails with practical constraints: not everyone has access to large clusters or cutting-edge accelerators. The hardware calculus continues in community threads asking whether high-end turnkey systems like Nvidia’s DGX Spark are worth it; the fact that such debates recur signals a sustained appetite for local, controllable training and inference where budgets, power, and operational constraints vary wildly. (more: https://www.reddit.com/r/ollama/comments/1o6oo5y/nvidia_dgx_spark_is_it_worth/)
For smaller-scale experiments, modular toolchains and careful benchmarks matter. Fast inference backends (like vLLM) can shift the bottleneck elsewhere—data pipelines, evaluators, and feedback loops—and evolutionary methods, if they pan out beyond demos, could provide a gradient-free path to targeted improvements without the overhead of RL infrastructure. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o5yvut/local_vllm_accelerated_evolution_framework/)
Voice stacks that actually talk
Speech systems are getting both faster and more “agent-ready.” An open-source streaming STT server combining Parakeet (ASR) with Silero for end-of-turn signals and Pipecat-based turn handling shows low-latency transcriptions, decent EOT detection (probabilities drop during “uhh/umm”), and batch inference support—running locally on a 3090 or on L40s in deployment. Comments showcase practical pipelines: Parakeet → Granite 4 (LLM) → VoxCPM (TTS) for conversational systems, and longer-form workflows with Silero + pyannote speaker segmentation, API transcription, then LLM-based correction. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o4xkr6/open_source_streaming_stt_parakeet_silero_pipecat/)
The pattern mirrors text agents: plug-and-play components, typed contracts, and end-to-end observability. Clean turn-taking and robust EOT matter as much as decoder quality in multi-turn voice, especially when agent actions ride on partial transcripts. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o4xkr6/open_source_streaming_stt_parakeet_silero_pipecat/)
As these systems mature, the boundary between “voice UI” and “operational interface” blurs. Streaming MCP-connected assistants that act in live meetings or call external tools on demand push voice systems from demo-grade to business-grade capabilities. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1o85vl7/turn_chatgpt_into_a_realtime_meeting_assistant/)
Isolation, observability, and signals
For running AI workloads with strong isolation, Volant treats microVMs as a first-class runtime: a control plane, CLI, and in-guest agent share a signed plugin manifest system; cloud-init bootstraps dev sandboxes; VFIO passthrough isolates GPU workloads; and a vsock-secured agent proxies to network-isolated processes. Operators can spin up replicated microVM clusters, boot from snapshots, or convert OCI images to bootable artifacts—all with a REST and MCP API surface for orchestration. It’s designed for “stealth, speed, and scale” with sensible defaults and deep configurability. (more: https://github.com/volantvm/volant)
Observability tools are catching up too. A blog note highlights that Wireshark 4.6.0 now supports macOS pktap metadata like PID and process name—small but meaningful quality-of-life improvements for tracing which process is talking to what in local network captures. (more: https://nuxx.net/blog/2025/10/14/wireshark-4-6-0-supports-macos-pktap-metadata-pid-process-name-etc/)
At the other end of the spectrum, opaque infrastructure can become newsworthy when it leaks into the open. NPR reports a “mysterious signal” from a classified SpaceX satellite network, underscoring how modern systems can broadcast real-world artifacts long before their purpose is explained. It’s a reminder that security, transparency, and spectrum hygiene remain shared concerns as private space infrastructure scales. (more: https://www.npr.org/2025/10/17/nx-s1-5575254/spacex-starshield-starlink-signal)
Watermarks versus modern inpainting
A new GitHub project “SoraWatermarkCleaner” trains a YOLOv11s detector to localize the Sora watermark and uses LaMa-based inpainting (via the iopaint stack) to remove it, with a one-click Windows build and labeled datasets available on Hugging Face. There’s even a FastAPI wrapper exposing upload, status polling, and download routes. It’s a technically clean pipeline—detect, mask, inpaint—that demonstrates how easily modern detectors plus diffusion-based inpainting can erase provenance markers in generated video. (more: https://github.com/linkedlist771/SoraWatermarkCleaner)
The project’s emphasis on portability and dataset openness makes it a strong reference implementation for watermark removal research or testing. For any ecosystem relying solely on visible watermarks for provenance, the lesson is familiar: adversarial removal is now off-the-shelf. (more: https://github.com/linkedlist771/SoraWatermarkCleaner)
As image generation and editing tools improve—multi-image conditioning, ControlNet cues, and stronger identity preservation—defeating visual provenance marks becomes a moving target. Defense likely requires layered strategies beyond simple overlays. (more: https://huggingface.co/Qwen/Qwen-Image-Edit-2509)
Pragmatism beats perfection in the field
Finally, a delightful hardware reminder: the “Chicken Squisher 3000” automates a coop door using an AVR16DD14 microcontroller, NSL-A6009 light sensor, 12 V geared DC motor, and a DRV8231 driver tuned to limit stall torque—strong enough to move a wooden door, gentle enough not to harm birds. Buttons allow manual override. The design deliberately avoids complexity (no rack-and-pinion) and favors a smooth-rod actuation—simple, serviceable mechanics. (more: https://hackaday.com/2025/10/16/chicken-squisher-3000-squish-proof-security/)
The comments are a microcosm of engineering trade-offs: BLDC vs brushed motors for load detection, current sensing via existing low-side resistors, debris management, and environmental sealing. Some argue for linear actuators and drop-doors with rubber seals for better winterization; others emphasize cleaning realities in dusty, feathery environments. The author’s pragmatic response—sealed mechanisms, simple current sensing, weekly wipe-down—captures the spirit. (more: https://hackaday.com/2025/10/16/chicken-squisher-3000-squish-proof-security/)
There’s an analogy here for AI systems: robust, auditable, and maintainable often beats theoretically elegant but fragile. Tool contracts, memory schemas, and clear recovery paths are the “dust covers” and current sensors of agentic software. (more: https://www.linkedin.com/pulse/conductors-guide-agentic-orchestration-marcus-patman-ann5e)
Sources (21 articles)
- [Editorial] https://www.linkedin.com/pulse/building-17-quality-engineering-skills-how-i-turned-29-spiridonov-newhf (www.linkedin.com)
- [Editorial] Brain vs body (www.linkedin.com)
- [Editorial] Claude skills (www.linkedin.com)
- [Editorial] Agentic Orchestration (www.linkedin.com)
- It has been 4 hrs since the release of nanochat from Karpathy and no sign of it here! A new full-stack implementation of an LLM like ChatGPT in a single, clean, minimal, hackable, dependency-lite codebase (www.reddit.com)
- Local VLLM Accelerated Evolution Framework (www.reddit.com)
- Open source streaming STT (Parakeet + Silero + Pipecat Smart Turn) (www.reddit.com)
- LM Studio and VL models (www.reddit.com)
- I got tired of OpenAI dependency. Built a multi-LLM control center instead. (www.reddit.com)
- Nvidia DGX Spark, is it worth ? (www.reddit.com)
- Turn ChatGPT into a real-time meeting assistant (via MCP + Apps SDK) (www.reddit.com)
- Claude Code taking a coffee break 🤔 (www.reddit.com)
- volantvm/volant (github.com)
- linkedlist771/SoraWatermarkCleaner (github.com)
- Show HN: Cmux – Coding Agent Multiplexer (github.com)
- Wireshark 4.6.0 Supports macOS Pktap Metadata (PID, Process Name, etc.) (nuxx.net)
- A classified network of SpaceX satellites is emitting a mysterious signal (www.npr.org)
- Qwen/Qwen-Image-Edit-2509 (huggingface.co)
- Alpha-VLLM/Lumina-DiMOO (huggingface.co)
- Chicken Squisher 3000: Squish-Proof Security (hackaday.com)
- MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use (arxiv.org)