Local LLM engineering gets sharper

Published on November 13, 2025

Local LLM engineering gets sharper

A community continuation of Karpathy’s NanoGPT doubles as a living notebook of modern transformer tricks. Beyond the tutorial baseline, it adds FlexAttention and FlashAttention (via scaled dot-product attention), sliding-window attention with a ramp schedule, document masking, and several flavors of attention logit soft-capping. Optimizers are configurable (AdamW or Muon with momentum/Nesterov), as are multi-head variants (MHA/MQA/GQA), QK normalization, normalization layers (RMSNorm/LayerNorm), activation functions (GELU, ReLU, ReLU^2, SiLU, SwiGLU), positional encodings (RoPE/NoPE/absolute), tied embeddings, gradient clipping, and warmups (including “kernel warmup”). Everything is wired through GPTConfig and TrainingConfig so learners can mix-and-match components while reading explanatory comments. The author stresses it’s for configurability and education, not raw speed, and even points to a faster speedrun fork for performance-minded readers (more: https://www.reddit.com/r/LocalLLaMA/comments/1ov2qee/my_opensource_continuation_flexattention_rope/).

Inference engineers chasing tokens-per-second will appreciate early results on cross-GPU prefix KV reuse over GPU-to-GPU links. Today, most frameworks only reuse cached KV states locally, so multi-GPU deployments redo a lot of prefill work even when the prompt prefix is identical. A prototype that exports/imports prefix KV tensors between processes via RDMA/NVLink shows about a 15% latency reduction in optimistic conditions, with a vLLM fork available for experimentation. It’s early, but the direction—multi-tier KV caching plus faster transports and smarter schedulers—looks promising (more: https://www.reddit.com/r/LocalLLaMA/comments/1ovic54/crossgpu_prefix_kv_reuse_with_rdma_nvlink_early/).

Hardware choices remain a moving target. A popular thread asks when Nvidia’s RTX 6000 Pro makes sense over a GeForce 5090 for local LLM workloads, reflecting the practical calculus many builders face when trading gaming-class throughput for workstation features (more: https://www.reddit.com/r/LocalLLaMA/comments/1otmamz/when_does_rtx_6000_pro_make_sense_over_a_5090/). Meanwhile, a user trying to run a code model on an 8‑core/32 GB RAM CPU-only VPS reports poor results—another reminder that model class and quantization matter, and that “CPU-only” often means “be selective about models and expectations” (more: https://www.reddit.com/r/ollama/comments/1ow31je/anyone_running_code_model_in_cpu_only_vps/).

MCP agents need observability

Developers are still figuring out how to connect models to execution sandboxes with the Model Context Protocol (MCP). One Redditor, after reading Anthropic’s writeup on code execution with MCP, asks whether an MCP client calling tools like “get-folder” and “send-code” is the right architecture and how to actually wire a model to that environment. The post captures an early-stage reality: builders want off‑the‑shelf patterns for MCP clients and servers, but the article’s emphasis on protocol semantics leaves some readers hunting for concrete glue (more: https://www.reddit.com/r/LocalLLaMA/comments/1oti4or/how_to_link_an_ai_to_a_code_execution_environment/).

Visibility is improving. MCP Shark positions itself as “Wireshark-level observability” for MCP, aggregating multiple MCP servers (HTTP or stdio) into a live dashboard that captures every JSON‑RPC request/response, with SQLite-backed audit logging, latency metrics, correlation IDs, and advanced filtering/export. The pitch—observability as “the new compliance”—drew pushback: practitioners noted monitoring is post‑facto, has blind spots in non‑deterministic systems, and can cost 2–4× inference. The counterpoint: observability is necessary but insufficient without runtime governance that can enforce policy and provide verifiable guarantees (more: https://www.linkedin.com/posts/ivandj_as-ai-agents-multiply-across-tools-and-protocols-activity-7394057385872556032-SlAQ).

The protocol itself is maturing. A recent MCP spec shift toward file-based discovery lets servers describe tools and resources as structured files rather than injecting large schemas into the model’s context. Proponents say this separates knowledge (execution layer) from reasoning (model), cuts token usage, enables versioning and testing of tool definitions, and allows agents to discover capabilities simply by scanning a directory—aligning neatly with lightweight, local-first agent stacks (more: https://www.linkedin.com/posts/reuvencohen_the-latest-mcp-spec-feels-like-the-moment-activity-7394373616471072768-okAg). That tracks with Anthropic’s still-relevant guidance: start with the simplest possible workflow; only add agentic complexity when it demonstrably improves outcomes; and keep a clean architectural line between code-orchestrated workflows and autonomous agents (more: https://www.linkedin.com/posts/henrikgothberg_anthropic-building-effective-ai-agents-ugcPost-7394348623796350977-tcq1).

Leveling up everyday workflows

A power user hit Claude’s session limits after a single, heavy financial review and asked how to “level up” beyond better prompts. The community’s advice was pragmatic: put files in a GitHub repo and work through VS Code so Claude can search, read, and write across the corpus without cramming everything into context; add a CLAUDE.md to supply durable task intent; and version all changes for easy rollbacks. Others recommended Claude Desktop Projects/Skills for persistent context, supplying sample outputs to focus analysis, and using Claude Code for tighter control over tools and automation. To stretch capacity, turn off extended thinking unless needed, default to the lighter model (Haiku) for simple tasks, and install utilities that show context window usage to avoid token burn (more: https://www.reddit.com/r/ClaudeAI/comments/1ov1ppj/how_do_i_level_up_from_normie_to_normie_pro_with/).

Not every workflow needs a web UI. One hacker found an open-source iMessage SDK in TypeScript that lets scripts read incoming messages and send replies, including files and images. Coupled to a local API (e.g., Oobabooga) or a lightweight RAG stack, the agent “lives” in iMessage and can summarize group chats or handle routine tasks. Others noted that Telegram bots are easier to wire up, but the iMessage path reduces cognitive friction: the agent shows up where users already are (more: https://www.reddit.com/r/LocalLLaMA/comments/1ovqlq2/a_proper_way_to_connect_a_local_llm_to_imessage/).

Diffusion MoE language model lands

A new open-source diffusion language model, LLaDA2.0‑mini‑preview, adopts a 16B Mixture‑of‑Experts architecture with only 1.4B parameters active per inference step. It’s instruction‑tuned, RoPE‑positioned, 20 layers/16 heads, with a 4,096‑token context, and claims pretraining on roughly 20T tokens. The project emphasizes efficient inference and “tool use” support for complex agent tasks (more: https://huggingface.co/inclusionAI/LLaDA2.0-mini-preview).

On author-reported benchmarks, the model posts an average of 58.71 across a broad suite, including mathematics (GSM8K 89.01; “math” 73.50), coding (MBPP 77.75; HumanEval 80.49), and agent/alignment tasks (BFCL_Live 74.11). Results also show competitive performance on knowledge tests like MMLU (72.49) and multilingual exams. The team plans to release an inference framework and new benchmarks like SyllogEval and IXRB in ABench (more: https://huggingface.co/inclusionAI/LLaDA2.0-mini-preview).

For usage, the authors recommend sampling with temperature 0.0, block_length=32, and steps=32, and allowing up to 2,048 tokens of output (or 4,096 for longer math/programming tasks). They position the model as outperforming similarly sized dense models while reducing compute costs, and distribute it under Apache 2.0 (more: https://huggingface.co/inclusionAI/LLaDA2.0-mini-preview).

The architectural debate isn’t quieting down. A widely shared analysis of Yann LeCun’s recent slides argues that “auto‑regressive LLMs are doomed” for human‑level AI due to error compounding over long generations, and points to objective‑driven “world models” (e.g., JEPA variants) that learn predictive structure from sensory streams—“physics, not poetry.” The reported takeaway for agent builders: autonomy isn’t the bottleneck; architecture is. Whether this marks the next bend of the S‑curve or just a complementary lane to language models remains to be seen, but the critique of token‑by‑token generation is sharpening (more: https://www.linkedin.com/posts/stuart-winter-tear_so-reportedly-yann-lecun-plans-to-leave-activity-7394396547276460032-gEE5).

Imaging: from benchmarks to relighting

A developer report compared frontier image generators by running more than 600 prompts, an increasingly common approach as teams weigh model swaps against cost, latency, and style constraints. Controlled bake‑offs like this are useful because qualitative impressions tend to diverge from aggregate performance across styles and edge cases (more: https://latenitesoft.com/blog/evaluating-frontier-ai-image-generation-models/).

On the customization front, a LoRA for Qwen‑Image‑Edit focuses on relighting. Triggered with the Chinese phrase “重新照明” (relighting), the model can apply instructions like “use the soft, diffuse light from curtains” to re‑illuminate an image. The author trained with ModelScope’s service, shares an online demo, and recommends pairing with Qwen‑Image‑Lightning for best results. For creators, this is a neatly scoped tool: small, fast, and targeted to a single visual affordance that generalist models often get almost right (more: https://huggingface.co/dx8152/Qwen-Image-Edit-2509-Relight).

Identity, auth, and spyware reality

“Attackers are logging in.” That line from a recent editorial captures the core failure of modern web authentication: static controls (passwords, MFA) are necessary but insufficient when adversaries operate inside identity protocols. The piece argues authentication must be treated as operational detection: look for anomalies in device/session behavior, not just factor checks. Token replay sits at the heart of many breaches—session cookies, JWTs, OAuth grants, even some passkeys can be harvested by info‑stealers (e.g., Raccoon, Redline, Vidar) and replayed without tripping front‑door challenges. The author cites rising account‑takeover attempts and shows how attackers optimize their OODA loop to mimic legitimate usage and avoid flags (more: https://defensiblesystems.substack.com/p/web-authentication-is-broken).

Government spyware makes the same point in a harsher register. Despite vendor claims of targeted, rare use, documented cases show broad targeting of journalists, activists, and low‑profile political figures. Systems like Pegasus or Graphite reduce attack “activation energy” to a console that takes a phone number and does the rest. Pricing often scales with concurrent targets, encouraging wide nets. The result: a “huge abuse temptation,” little transparency, and few consequences. There are glimmers of accountability—Paragon publicly cut ties with Italy over alleged abuse investigations; the U.S. sanctioned several vendors and affiliates—but it’s unclear if any measures are slowing a global, multi‑billion market (more: https://techcrunch.com/2025/11/10/why-a-lot-of-people-are-getting-hacked-with-government-spyware/).

Cars, buses, and low‑level stacks

A security thread making the rounds details a CAN‑bus injection theft method affecting several keyless Toyota and Lexus models. This isn’t a radio relay attack; “Faraday pouches are useless.” Thieves reportedly pry into the wheel well or headlight, access the CAN wiring, and plug in a device that floods the network with fake commands. Because classic CAN assumes a trusted network, ECUs can’t distinguish genuine start signals from spoofed ones. The fix is well‑known—cryptographic message authentication like AUTOSAR SecOC, plus network segmentation—but the episode underscores how physical access plus unauthenticated control planes remains a recipe for disaster (more: https://www.linkedin.com/posts/mreichstein_cybersecurity-carhacking-physicalsecurity-activity-7394425210877218816-NPyT).

At the other end of the stack, a RISC‑V SoC project integrated USB HID keyboard/mouse support directly (no external USB PHY), added host control of keyboard LEDs via HID Set_Report, and built a regression‑testable simulation by emulating USB HID devices with a Forth-powered emulator. The system runs on Verilator and an FPGA board (Arty A7), with a Wishbone frontend and proper clock-domain crossing. It’s a great example of how careful hardware/software co‑design and device emulation can validate features that OS‑level stacks often hide (more: https://epsilon537.github.io/boxlambda/usb-hid/).

Email defenses and OSINT leaks

A lean, self‑hosted DMARC parser landed for teams that want visibility without a heavy stack. Written in Go with a Vue.js dashboard, it’s a single binary with IMAP integration, SQLite storage, and RFC 7489 compliance packed into a ~14 MB Docker image. Add a DMARC DNS record to receive aggregate reports, point the tool at the inbox, and you get dashboards showing sources, SPF/DKIM pass rates, and handling outcomes. Start with p=none, validate legitimate senders, then ratchet toward quarantine/reject—without wrestling Elasticsearch (more: https://github.com/meysam81/parse-dmarc).

On the OSINT side, a developer suggests using AI face search as an internal audit to find data leakage in RAG corpora. In a test, a blurry, old photo linked to a pseudonymous profile picture on a personal GitLab, which held a legacy API key. The failure mode is familiar: reusing avatars across personal/pro accounts plus modern face search equals unintended identity linkage. It’s a stark reminder that audits should include public artifacts, not just private repos and logs (more: https://www.reddit.com/r/ChatGPTCoding/comments/1oq4p2s/project_idea_using_an_ai_face_search_to_find_data/).

Compact compute and tooling notes

Compact clusters are getting neater. A Hackaday build uses Raspberry Pi Compute Modules to pack a multi‑node cluster into a very small footprint—useful for CI shards, build farms, or distributed inference experiments that don’t require high‑end GPUs (more: https://hackaday.com/2025/11/12/pi-compute-modules-make-for-compact-cluster/). In the tooling stream, additional open‑source projects like the vsa repository continue to circulate for specialized analysis or automation tasks, reflecting the steady cadence of niche utilities that glue modern stacks together (more: https://github.com/etalazz/vsa).

Sources (22 articles)

[Editorial] Web Authentication is Broken (defensiblesystems.substack.com)
[Editorial] https://www.linkedin.com/posts/ivandj_as-ai-agents-multiply-across-tools-and-protocols-activity-7394057385872556032-SlAQ (www.linkedin.com)
[Editorial] https://www.linkedin.com/posts/mreichstein_cybersecurity-carhacking-physicalsecurity-activity-7394425210877218816-NPyT (www.linkedin.com)
[Editorial] https://www.linkedin.com/posts/henrikgothberg_anthropic-building-effective-ai-agents-ugcPost-7394348623796350977-tcq1 (www.linkedin.com)
[Editorial] https://www.linkedin.com/posts/stuart-winter-tear_so-reportedly-yann-lecun-plans-to-leave-activity-7394396547276460032-gEE5 (www.linkedin.com)
[Editorial] https://www.linkedin.com/posts/reuvencohen_the-latest-mcp-spec-feels-like-the-moment-activity-7394373616471072768-okAg (www.linkedin.com)
Cross-GPU prefix KV reuse with RDMA / NVLink - early experimental results (www.reddit.com)
When does RTX 6000 Pro make sense over a 5090? (www.reddit.com)
A proper way to connect a local LLM to iMessage? (www.reddit.com)
My (open-source) continuation (FlexAttention, RoPE, BlockMasks, Muon, etc.) to Karpathy's NanoGPT (www.reddit.com)
How to link an AI to a code execution environment? (www.reddit.com)
Anyone running code model in cpu only VPS? (www.reddit.com)
Project Idea: Using an AI face search to find data leakage in RAG source repositories. (www.reddit.com)
How do I level up from normie to normie pro with Claude (www.reddit.com)
meysam81/parse-dmarc (github.com)
etalazz/vsa (github.com)
On USB HID, Keyboard LEDs, and device emulation (2024) (epsilon537.github.io)
We ran over 600 image generations to compare AI image models (latenitesoft.com)
Why a lot of people are getting hacked with government spyware (techcrunch.com)
inclusionAI/LLaDA2.0-mini-preview (huggingface.co)
dx8152/Qwen-Image-Edit-2509-Relight (huggingface.co)
Pi Compute Modules Make for Compact Cluster (hackaday.com)