Agentic Coding Infrastructure and Tools

Published on

Today's AI news: Agentic Coding Infrastructure and Tools, Open-Source Model Releases and Deployment, Local AI Deployment and Mobile Computing, AI Resear...

What actually works at scale? Addy Osmani's analysis of "The 80% Problem" crystallizes what many developers are discovering: while AI agents now write the majority of code for some engineers, the errors they produce have evolved from simple syntax bugs into architectural landmines that detonate several pull requests deep (more: https://addyo.substack.com/p/the-80-problem-in-agentic-coding).

The statistics are striking. Andrej Karpathy reports inverting his coding ratio from 80% manual to 80% agent-driven within weeks. Boris Cherny, creator of Claude Code, claims 100% agent-written code for over two months, shipping 22-27 PRs daily. A survey of 5,000 developers shows 44% now write less than 10% of their code manually. But these numbers carry a critical caveat: they apply primarily to greenfield projects, not the messy reality of large existing codebases with team dynamics.

Karpathy's catalog of persistent problems reads like a taxonomy of junior developer mistakes amplified by machine speed. "Assumption Drift" describes how models misunderstand something early, then build entire features on faulty premises—invisible until the architecture has calcified. "Overcomplication Bias" manifests as 1,000 lines scaffolded where 100 would suffice; when pushed back, agents immediately simplify, revealing they optimize for appearing comprehensive rather than maintainable. "Collateral Damage" sees old implementations linger while comments disappear as side effects. Most troubling is "Sycophancy"—agents don't push back with "Are you sure?" but enthusiastically execute whatever is described, even when incomplete or contradictory.

The infrastructure response to these challenges is producing interesting tools. A new MCP server called "Lad" addresses what its creators term "Agent Tunnel Vision"—the phenomenon where LLMs, generating text token-by-token, gaslight themselves once they make a bad early design choice. Lad provides a second pair of eyes through dual-reviewer architecture using models like Kimi-K2-Thinking and GLM-4.7, integrating with Serena's codebase indexing to give reviewers access to project context and memories that survive between coding sessions (more: https://www.reddit.com/r/LocalLLaMA/comments/1qu6ylc/arguably_the_best_ai_code_review_mcp_server_with/). This addresses the fundamental problem that standard AI reviewers are "amnesic"—they see the diff, not the history.

Security concerns for agentic coding have spawned ADDT (AI Don't Do That), a universal sandboxing tool that wraps any AI coding agent in Docker isolation. The rationale is straightforward: agents can read, write, and execute code, and mistakes that delete wrong files or overwrite important data should stay contained. ADDT adds network firewalls controlling which domains agents can access and resource limits preventing runaway processes, with extensions for Claude, Codex, Gemini, Copilot, Cursor, and swarm orchestration tools (more: https://www.linkedin.com/posts/patrickdebois_github-jedi4everaddt-run-ai-coding-agents-activity-7424653736788099072-7Aov).

The hardware requirements for local agentic coding remain daunting. One developer building an EPYC 8124P server for running Claude Code with local models learned quickly from experienced users: without dedicated GPUs, even server-class CPUs with massive RAM produce inference speeds measured in minutes per prompt. The consensus suggests 3-5 tokens/second on CPU versus 20+ on GPU makes the difference between productive flow and painful waiting (more: https://www.reddit.com/r/LocalLLaMA/comments/1qq5aif/epyc_8124p_siena_build_for_agentic_coding/).

Perhaps most interesting is the emerging question of whether human code review remains necessary. One developer notes that when refactoring legacy code to use the strategy pattern, Claude found "a crazy large amount of errors, both logical and syntactical" missed by humans who wrote and reviewed the original code. With solo developers becoming responsible for entire systems where only AI reviews the code, the industry is grappling with whether developer-plus-agent constitutes a sufficient "team" for quality assurance. The reverse engineering community is exploring similar questions with a Ghidra MCP server offering 110 tools for AI-powered binary analysis, including normalized function hashing that enables cross-version documentation transfer (more: https://www.reddit.com/r/LocalLLaMA/comments/1qvgu2j/mcp_ghidra_for_aipowered_binary_analysis_110/).

The efficiency conversation is driving interest toward on-device deployment. Qwen3 Coder Next, an 80B MoE model with only ~3B active parameters, reportedly achieves 291 tokens/second at FP8 on a "nano PC" while competing on SWE-Bench Pro with models 10x larger. Trained on 800K+ verified agent trajectories, it's explicitly built for terminals, browsers, tools, and agent loops—described as having "zero chat vibes, all execution" (more: https://www.linkedin.com/posts/ownyourai_i-just-woke-up-to-qwen3-coder-next-80b-activity-7424703876240695297-Nlqf).

At the smaller end, Liquid AI's LFM2.5-1.2B-Thinking targets genuine on-device deployment with a hybrid architecture achieving 239 tokens/second decode on AMD CPU and 82 tokens/second on mobile NPU while running under 1GB of memory. Extended pretraining from 10T to 28T tokens and large-scale multi-stage reinforcement learning produce a model the company recommends for agentic tasks, data extraction, and RAG—though explicitly not for knowledge-intensive tasks or programming. Day-one support for llama.cpp, MLX, and vLLM suggests Liquid understands the deployment landscape (more: https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking).

Running Ollama on Android phones has moved from theoretical possibility to documented practice, with one developer detailing builds on Samsung S20 and Pixel 8 Pro devices—both packing 12GB RAM and multi-core SoCs. The process requires Termux from F-Droid (the Play Store version won't work due to SDK targeting constraints for Android 10+ compatibility), followed by building Ollama from source with Go. The smollm2 model runs on even 4GB devices, though the limitation is stark: pure CPU inference only, with Vulkan acceleration remaining elusive despite hardware support (more: https://www.reddit.com/r/ollama/comments/1qrkbsr/run_ollama_on_your_android/).

The practical use cases extend beyond novelty. Integrating Ollama's loopback API with Tasker enables automated WhatsApp replies—Tasker intercepts notifications, sends chat to Ollama for response generation, then uses the notification's reply function. Tool calling can run Termux commands bound to Android intents, effectively making Tasker an MCP for mobile inference. Performance varies: the Pixel 8 Pro achieves 12.3 tokens/second while the S20 manages 7.2, with 4B models running "much slower."

Enterprise deployment questions are emerging around Claude Code's local LLM integration. The official Ollama blog documents the integration (more: https://ollama.com/blog/claude), but whether using Claude Code with local models for corporate work requires an Enterprise license remains unclear. One user has reached out to Anthropic for clarification—a question increasingly relevant as companies explore on-premises AI to address data sovereignty and compliance requirements (more: https://www.reddit.com/r/ClaudeAI/comments/1qvlaz6/is_using_the_officially_supported_local_llm/).

Google Research has published the first quantitative scaling principles for AI agent systems, and the findings challenge industry assumptions. Through controlled evaluation of 180 agent configurations across four diverse benchmarks, researchers discovered that the "more agents is better" heuristic often hits ceilings and can actively degrade performance when misaligned with task properties. The core insight: multi-agent coordination dramatically improves performance on parallelizable tasks but hurts sequential ones (more: https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/).

The research defines three properties that make tasks "agentic": interaction with external environments, decision-making under partial observability, and adaptation based on environmental feedback. Five canonical architectures were tested—single-agent systems plus four multi-agent variants (independent, centralized, decentralized, and hybrid). The practical payoff is a predictive model identifying optimal architecture for 87% of unseen tasks, replacing heuristics with principled selection.

A separate research effort tackles reasoning plateaus through curriculum learning. The SOAR framework investigates whether pretrained LLMs can leverage latent knowledge to generate automated curricula for problems they cannot solve. Using meta-RL, a teacher copy proposes synthetic problems for a student copy and receives rewards based on measured student progress rather than intrinsic proxy rewards. Testing on the hardest mathematical benchmark subsets (0/128 initial success rate) revealed that structural quality and well-posedness matter more than solution correctness for learning progress—suggesting the ability to generate useful stepping stones doesn't require actually solving hard problems (more: https://arxiv.org/abs/2601.18778).

Multimodal model editing research is reformulating knowledge correction as an out-of-distribution generalization problem. The challenge: existing editing methods use rigid parameter-to-output mappings that cause "causal-underfit" (failing to disentangle coherent causal structures across modalities) and "causal-overfit" (memorizing brittle linkages between local features and outputs). The proposed solution identifies invariant causal trajectories that distinguish semantic shifts (generalization targets) from factual shifts (out-of-distribution regions) (more: https://arxiv.org/abs/2601.19700v1).

An independent researcher working on "potato" hardware (sub-5B models before laptop freezing) has developed a semantic LLM interpreter that redefines temperature to apply around "median" tokens rather than modal tokens. The approach identifies where median intent applies, avoiding hallucinations caused by modal tokens with less than 50% confidence not representing majority output possibilities. Early testing shows outputs often differ from standard greedy selection and prove more useful when models are confident, less prone to hallucination when uncertain. The open-source implementation needs testing on larger models (more: https://www.reddit.com/r/LocalLLaMA/comments/1quk4ne/semantic_llm_interpreter_only_tested_on_a_potato/).

A category-defining open specification for AI runtime security is approaching publication, addressing what its author identifies as a fundamental gap in existing security infrastructure. The core problem: AI agents have legitimate credentials to databases, email, and APIs, executing hundreds of actions per minute while processing content from sources that could contain malicious instructions. Current security tools weren't designed for this threat model (more: https://www.linkedin.com/posts/hermanerrico_aisecurity-agenticai-cybersecurity-activity-7424484799123247104-40_F).

The analysis of existing defenses is pointed. SIEM systems can block but only after pattern detection, seeing "API call" rather than "agent is about to email your customer list to an external address." API gateways check identity and rate limits, not whether specific actions with specific parameters make sense given what the agent just read. Firewalls are irrelevant when agents are already inside the perimeter with valid credentials. The specification addresses understanding what an action means before it executes—particularly challenging when malicious instructions might have been injected three tool calls ago.

On the offensive security side, AIxVuln presents an automated vulnerability mining and verification system combining LLMs, function calling, and Docker sandboxes. The multi-agent architecture includes environment setup, code auditing, vulnerability verification, and report generation agents, with support for PHP, Java, Node.js, Python, and Go environments plus MySQL and Redis middleware. The system enables downstream agents to inherit upstream context—verifiers can access environment information from operations agents—and includes shared memory for cross-team awareness (more: https://github.com/m4xxxxx/AIxVuln).

NVIDIA's Nemotron-Personas-Brazil exemplifies the "sovereign AI" movement toward locally-grounded training data. The dataset comprises 6 million fully synthetic personas—approximately 1.4 billion tokens—statistically grounded in official census and labor data from IBGE (Brazil's statistical agency). Every persona aligns to real demographic, geographic, and occupational distributions while representing no actual person. Coverage spans all 26 Brazilian states plus the Federal District, with 1,500+ occupation categories reflecting Brazil's workforce (more: https://huggingface.co/blog/nvidia/nemotron-personas-brazil).

The technical approach uses NeMo Data Designer, NVIDIA's compound AI system for synthetic data generation, combining Nemotron-70B for statistical grounding with Llama-4-Maverick for narrative generation in Brazilian Portuguese. Each persona includes cultural background, skills, goals, hobbies, and interests. Built in collaboration with Elipse.AI (an NVIDIA Inception member with government and regulated-sector AI deployment experience across Latin America), the dataset extends NVIDIA's growing collection covering USA, Japan, India, and Singapore—all commercially usable under CC BY 4.0.

For educational content creation, a Manim skill package provides best practices for creating 3Blue1Brown-style mathematical animations. Supporting both Manim Community Edition and ManimGL (Grant Sanderson's original version), the skills install via npx and work across AI tools including Claude, GitHub Copilot, and Cursor. The documentation carefully notes these are separate frameworks—code written for one won't work with the other without modifications (more: https://github.com/adithya-s-k/manim_skill).

The PDF format's JavaScript capabilities have been pushed to absurd extremes: Super Mario 64 now runs in a standalone 23.5 MB PDF using decompiled source code, achieving "a few FPS" with ASCII output. Unlike the DOSBox-based Doom PDF port, this runs natively in any viewer supporting JavaScript execution. The demonstration—while impractical—illustrates how much the "document format" has evolved beyond its Postscript-lite origins (more: https://hackaday.com/2026/02/02/running-doom-and-super-mario-64-inside-a-pdf-file/).

A combined skill for Claude Code merges development capabilities with quality engineering into a single workflow. The "vibe-cast" project combines Claude Flow V3 agents (architect, coder, reviewer, security-architect, deployer) with Agentic QE for integrated quality assurance during builds. The system implements pattern storage with confidence tiers (Bronze through Platinum), O(log n) vector search claiming 150x faster performance than linear approaches, and intelligent model routing based on complexity—Haiku for simple tasks (0-20), Sonnet for medium (20-70), Opus for complex (70-100) (more: https://github.com/mondweep/vibe-cast/tree/claude/claude-code-v3-skill-KucJF/claude-code-v3-qe-skill).

The quality gates are aggressive: 85% minimum coverage, 95% for critical paths, network and resource validation, API schema checking for backward compatibility, and ML-powered test selection targeting F1 > 0.8. Domain-Driven Design principles are baked in—bounded contexts, aggregates, entities, value objects, and domain events—alongside standardized Architecture Decision Records. The TDD enforcement includes strict cycle management with dedicated agents and templates for unit, integration, and contract tests following Arrange-Act-Assert patterns.

For developers wanting immediate access without tooling setup, the skill can be invoked by copying a prompt directly into Claude Code: "Build with Quality skill. Project: MyApp | Stack: Next.js + TypeScript | Task: Build user dashboard. Methodology: DDD + ADR + TDD. Quality: 85% coverage, security scan, WCAG AA. Execute and deliver tested code." The five example projects (Todo, REST API, E-commerce, CLI, Chat) demonstrate the range of applicable use cases, though full multi-agent swarm capabilities require additional orchestration tool installation.

Sources (19 articles)

  1. [Editorial] https://www.linkedin.com/posts/patrickdebois_github-jedi4everaddt-run-ai-coding-agents-activity-7424653736788099072-7Aov (www.linkedin.com)
  2. [Editorial] https://www.linkedin.com/posts/hermanerrico_aisecurity-agenticai-cybersecurity-activity-7424484799123247104-40_F (www.linkedin.com)
  3. [Editorial] https://www.linkedin.com/posts/ownyourai_i-just-woke-up-to-qwen3-coder-next-80b-activity-7424703876240695297-Nlqf (www.linkedin.com)
  4. [Editorial] https://github.com/mondweep/vibe-cast/tree/claude/claude-code-v3-skill-KucJF/claude-code-v3-qe-skill (github.com)
  5. MCP + Ghidra for AI-powered binary analysis — 110 tools, cross-version function matching via normalized hashing (www.reddit.com)
  6. Arguably, the best AI code review MCP server (with Serena integration) (www.reddit.com)
  7. Semantic LLM Interpreter - only tested on a potato (www.reddit.com)
  8. EPYC 8124P (Siena) Build for Agentic Coding (www.reddit.com)
  9. Run Ollama on your Android! (www.reddit.com)
  10. Is using the officially supported local LLM integration in Claude Code for business/corporate use a violation of ToS? (www.reddit.com)
  11. m4xxxxx/AIxVuln (github.com)
  12. adithya-s-k/manim_skill (github.com)
  13. Towards a science of scaling agent systems: When and why agent systems work (research.google)
  14. The 80% Problem in Agentic Coding – Addy Osmani (addyo.substack.com)
  15. Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability (arxiv.org)
  16. LiquidAI/LFM2.5-1.2B-Thinking (huggingface.co)
  17. Running DOOM and Super Mario 64 Inside a PDF File (hackaday.com)
  18. Out-of-Distribution Generalization via Invariant Trajectories for Multimodal Large Language Model Editing (arxiv.org)
  19. Nemotron-Personas-Brazil: Co-Designed Data for Sovereign AI (huggingface.co)

Related Coverage