Fine-tuning giants locally: Open agents and research stacks

Published on

The ceiling on “local” keeps moving up. KTransformers’ new integration with LLaMA-Factory claims fine-tuning DeepSeek-671B (Mixture-of-Experts, 58 experts active per layer) on a workstation with...

Fine-tuning giants, locally

The ceiling on “local” keeps moving up. KTransformers’ new integration with LLaMA-Factory claims fine-tuning DeepSeek-671B (Mixture-of-Experts, 58 experts active per layer) on a workstation with just four RTX 4090s by aggressively offloading to CPU RAM and using pipeline parallelism; the team reports around 70–80 GB total GPU memory suffices if you have roughly 1.2–1.3 TB of host memory, with CPU bandwidth likely the bottleneck. They plan to cut CPU RAM further and add QLoRA support; Qwen and GLM families are next on their roadmap, with eventual VL models, too. Notably, they clarified earlier documentation overstated RAM for DeepSeek-V2-Lite-14B; it should be about 30 GB host memory. For multi-GPU setups, total VRAM matters more than the exact card model thanks to pipeline parallelism (more: https://www.reddit.com/r/LocalLLaMA/comments/1oo4kh7/finetuning_deepseek_671b_locally_with_only_80gb/).

Local acceleration is also widening beyond GPUs. Community reports indicate IPEX-LLM’s llama.cpp path can drive smaller LLMs on laptop NPUs from Intel, AMD, and Qualcomm’s Hexagon, though setup—especially on Intel—can be finicky, and comparative GPU vs. NPU performance is still an open question for many (more: https://www.reddit.com/r/LocalLLaMA/comments/1onubfl/ipexllm_llamacpp_portable_gpu_and_npu_working/).

Pushing huge models into small VRAM and using heterogeneous accelerators is not only a cost play. It shifts the locus of control toward individual operators and small teams, who can now tune and run powerful models under their own security, data, and latency constraints—assuming they can stomach the RAM bill and the system integration work (more: https://www.reddit.com/r/LocalLLaMA/comments/1oo4kh7/finetuning_deepseek_671b_locally_with_only_80gb/), (more: https://www.reddit.com/r/LocalLLaMA/comments/1onubfl/ipexllm_llamacpp_portable_gpu_and_npu_working/).

Open agents and research stacks

Open-source agent stacks are converging on privacy, portability, and extensibility. SurfSense pitches itself as a local-first research agent alternative to NotebookLM/Perplexity/Glean, wiring 100+ LLMs (including local Ollama and vLLM), 6,000+ embedding models, and connectors for search engines and common SaaS sources. It supports a cross-browser extension to capture dynamic/authenticated pages and is planning mergeable mind maps and multi-user notebooks; Model Context Protocol (MCP) support and llama.cpp integration are on the roadmap. For those allergic to Docker, there’s a manual installation path documented by the project (more: https://www.reddit.com/r/LocalLLaMA/comments/1oo1j5x/open_source_alternative_to_notebooklmperplexity/).

If you need to run third-party agents locally with stronger guardrails, the AgentSystems platform takes a systems approach: federated discovery via a Git-based index, per-agent containers, default-deny egress with an allowlist proxy, runtime credential injection so keys never live in images, model abstraction across providers (including Ollama), and hash-chained Postgres audit logs. It’s Apache-2.0 and designed to let operators discover, run, and audit agents on their own hardware, with an example agent that synthesizes findings from any subreddit using local models (more: https://www.reddit.com/r/LocalLLaMA/comments/1onpyq8/selfhosted_platform_for_running_thirdparty_ai/).

MCP is becoming the connective tissue for tools that LLMs can operate. GenFilesMCP v0.2.0 adds per-session user authentication for multi-user scenarios in Open WebUI, auto-organizing generated and reviewed docs into per-user knowledge collections, and recommends Postgres for production concurrency. A temporary toggle, ENABLE_CREATE_KNOWLEDGE=false, avoids auto-creating knowledge for teams that need RAG to keep working while a file-processing quirk is addressed; for now, “Bypass Embedding and Retrieval” must be enabled for uploads to process. The tool ships as a Docker image (more: https://www.reddit.com/r/OpenWebUI/comments/1onqizw/v020_genfilesmcp/). On the media side, a Go-based Multi-Source Media MCP Server provides a unified interface to fetch and generate images (Unsplash, Pexels, T2I/I2I), crawl pages asynchronously, and is designed for easy extension; planned tools include text-to-video, similar-image search, caching, and auto-tagging/captioning (more: https://github.com/Decade-qiu/Multi-Source-Media-MCP-Server).

Safety, security, and control layers

OpenAI released gpt-oss-safeguard models (120B and 20B parameter variants) under Apache 2.0, trained specifically for safety reasoning. They interpret your written policy, output their chain-of-thought for auditability, and allow tuning the “reasoning effort.” The 120B version reportedly fits on a single H100 (with 5.1B active parameters), and both require the “harmony” response format. These are positioned for classification and filtering use cases—LLM I/O filtering, online content labeling, and offline T&S pipelines—not as general-purpose assistants (more: https://huggingface.co/openai/gpt-oss-safeguard-120b).

At the same time, researchers are probing for deeper architectural weaknesses. A new arXiv work was announced examining cross-stage vulnerabilities in LLM architectures, signaling interest in threats that emerge from interactions across layers or phases of model processing rather than from prompt content alone. Details are limited from the announcement, but the focus on “cross-stage” vectors is a reminder that safety isn’t only a policy or prompt-engineering problem (more: https://www.reddit.com/r/LocalLLaMA/comments/1oo8q0v/research_crossstage_vulnerabilities_in_large/).

Product choices can exacerbate exposure. A widely shared editorial argues that ChatGPT’s link-handling on Android and mobile web allows hidden prompt injection in user contexts, potentially causing data leaks or account actions—framed as a “feature” rather than a bug given the product’s priorities. Whether one agrees with the author’s framing, the risk model—arbitrary, context-injected instructions from the open web—should concern anyone deploying assistants that browse or execute across trust boundaries (more: https://www.linkedin.com/posts/georgzoeller_click-a-link-on-the-web-leak-documents-ugcPost-7392112142075740160-So7b?).

Defenders are arming up with graph context. runZeroHound converts runZero’s asset inventory into BloodHound OpenGraph nodes and edges—assets, services, subnets, domains, VLANs—so teams can run graph queries like “Windows machines with external IPs” or “paths from the Internet into 10.0.0.0/8,” then pivot to BYOD/iOS sitting next to Cisco gear with default SNMP. It’s not an official runZero product but provides practical glue to pivot from exposure management into attack-path analysis (more: https://github.com/runZeroInc/runZeroHound).

Foundation model research advances

Parameter-efficient tuning and native multimodality both saw noteworthy results. A new paper proposes optimizing one LoRA factor on the Stiefel manifold—maintaining orthonormal columns via Riemannian optimization—to prevent basis collapse that often reduces LoRA’s effective rank during standard Euclidean training. The authors report better accuracy, faster convergence, and improved parameter efficiency (achieving target quality at lower rank), with diagnostics showing full effective rank preservation versus correlated columns under AdamW. For teams squeezing performance out of low-rank adapters, the geometric constraint is a clean, testable change (more: https://arxiv.org/abs/2508.17901v1).

From BAAI, Emu3.5 touts “native multimodal” pretraining over 10T+ interleaved tokens (video frames + transcripts) with a unified next-token objective, reinforcement learning post-training, and a Discrete Diffusion Adaptation (DiDA) mode that reportedly yields around 20× faster inference without quality loss. It handles interleaved vision–text I/O natively, targets long-horizon generation and any-to-image synthesis, and the team claims it matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing and surpasses it on interleaved generation tasks. The repo includes inference code, suggests at least two GPUs for throughput, and exposes a vision tokenizer and image-focused variant (more: https://huggingface.co/BAAI/Emu3.5-Image).

Developer workflows and orchestration

Agents aren’t just writing code—they’re managing other models. One practitioner built an “AI Art Director” agent on Chase Agents to translate a creative brief into prompts for an image model (NanoBanana) and then a video model (VEO3), effectively acting as a project manager that handles prompt engineering across steps. The practical challenge is what you’d expect: error handling. The current iteration adds a review-and-retry loop to fix weak image generations before handing off to video, because garbage in, garbage out still applies (more: https://www.reddit.com/r/ChatGPTCoding/comments/1olwy46/i_built_an_ai_art_director_agent_to_orchestrate/).

On the lighter-weight end, “on-the-fly” code reviews with Ollama monitor a repo and run local model checks during development. It’s a small library, but early users find it handy for catching issues without leaving the terminal—another example of cheap automation woven into existing workflows (more: https://www.reddit.com/r/ollama/comments/1openho/onthefly_code_reviews_with_ollama_it_kinda_works/).

Users continue discovering unexpected assistant patterns. A thread on Claude highlights decision support (clarifying trade-offs), preserving memory/personality between chats via community projects, SSH-driven CLI tasks as a “remote memory” for commands, Kanban-driven tasking inside an IDE, and a cautionary note: when prodded into constant self-critique (“sycophancy checks”), models can prioritize avoiding mistakes over finishing the job, leading to paralysis. It’s a good reminder that tooling and prompt choreography can matter as much as model choice (more: https://www.reddit.com/r/ClaudeAI/comments/1omcw0o/whats_your_most_unexpected_claude_workflow/).

Reliability, from GC to logic

A fascinating Ruby postmortem shows how GC invariants fail when native extensions miss write barriers. Under load, ffi < 1.17.0 stored Ruby objects (like the FFI struct’s internal field map Hash) without RB_OBJ_WRITE, so the GC could collect them and later allocate a String at the same memory address. The next field lookup treated that String as a Hash, exploding with “undefined method 'default' for an instance of String.” Upgrading to ffi 1.17.0, which adds the appropriate write barriers, resolves the issue. It’s a crisp example of why memory-safety and GC semantics matter even in “managed” languages when native code is involved (more: https://mensfeld.pl/2025/11/ruby-ffi-gc-bug-hash-becomes-string/).

For teams verifying system behavior at the model level, the Maude 3 manual remains a deep resource: rewriting logic foundations; search-based and bounded model checking of invariants; LTL satisfiability and model checking; variant generation and unification with irreducibility constraints; narrowing reachability; and rich reflection/metalevel computation. Maude’s breadth—from object-oriented modules to strategies and user interfaces—makes it a durable choice for specifying and checking properties before they become incidents (more: https://maude.lcc.uma.es/maude-manual/).

High-throughput data plumbing

Streaming backbones underwrite modern AI and observability stacks. Apache Iggy (Incubating) is a Rust-based, single-binary message streaming platform focused on ultra-low latency and millions of messages per second. It uses zero-copy (de)serialization over binary data, supports QUIC/TCP/HTTP with TLS, and ships client libraries across eight languages today. Features include consumer groups, partitioning, authZ with PATs, OpenTelemetry/Prometheus integration, built-in benchmarks, retention policies, and S3-compatible backups—plus “cargo install” convenience for local runs (more: https://iggy.apache.org/).

Tinkering at the edge of physics

Not everything needs a datacenter. A Hackaday build explores whether a quadcopter can sustain flight on photovoltaic panels alone, detailing the design trade-offs, energy budget math, and flight results. It’s a useful reality check on what today’s cells and power electronics can deliver in the air, and where energy density still bottlenecks persistent UAVs (more: https://hackaday.com/2025/11/01/building-a-pv-solar-powered-quadcopter/).

Design power shifts to intelligence

A bracing editorial argues design leadership is clinging to interface-era rituals while strategic control migrates to the “intelligence layer”—models, data, prompts, retrieval, policies, and inference-time logic. The author’s claim: today’s processes (sprints, design systems) optimize the 5% of experience still in the interface, while the other 95% is decided by model behavior and system logic. The call to action is operational, not rhetorical: restructure sprints, change metrics, learn the substrate, get seats in model/policy rooms, and actually ship end-to-end using modern tools—otherwise, adjacent functions will (more: https://www.suffsyed.com/futurememo/the-design-leaders-are-lying-to-you).

Taken alongside the security and safety items above, the through-line is clear: power accrues where models are selected, tuned, governed, and integrated. Teams that treat that layer as the “new interface” will make fewer avoidable mistakes and ship more of the product that users actually experience (more: https://www.suffsyed.com/futurememo/the-design-leaders-are-lying-to-you).

Sources (19 articles)

  1. [Editorial] https://www.suffsyed.com/futurememo/the-design-leaders-are-lying-to-you (www.suffsyed.com)
  2. Self-hosted platform for running third-party AI agents with Ollama support (Apache-2.0) (www.reddit.com)
  3. [Research] Cross-Stage Vulnerabilities in Large Language Model Architectures (www.reddit.com)
  4. Finetuning DeepSeek 671B locally with only 80GB VRAM and Server CPU (www.reddit.com)
  5. Open Source Alternative to NotebookLM/Perplexity (www.reddit.com)
  6. IPEX-LLM llama.cpp portable GPU and NPU working really well on laptop (www.reddit.com)
  7. "On-the-fly" code reviews with ollama. It kinda works.. (www.reddit.com)
  8. I Built an "AI Art Director" Agent to Orchestrate Image and Video Models. (www.reddit.com)
  9. What's your most unexpected Claude workflow discovery? (www.reddit.com)
  10. runZeroInc/runZeroHound (github.com)
  11. Decade-qiu/Multi-Source-Media-MCP-Server (github.com)
  12. When Your Hash Becomes a String: Hunting Ruby's Million-to-One Memory Bug (mensfeld.pl)
  13. Maude 3 Manual (maude.lcc.uma.es)
  14. Apache Iggy is a high-performance, persistent message streaming platform (iggy.apache.org)
  15. BAAI/Emu3.5-Image (huggingface.co)
  16. Building a PV Solar-Powered Quadcopter (hackaday.com)
  17. Riemannian Optimization for LoRA on the Stiefel Manifold (arxiv.org)
  18. openai/gpt-oss-safeguard-120b (huggingface.co)
  19. v0.2.0 - GenFilesMCP (www.reddit.com)