Agentic Development: From Copilot to Colleague

Published on May 4, 2026

Today's AI news: Agentic Development: From Copilot to Colleague, The Evaluation Tax: When Benchmarks Break the Bank, The Inference Stack: Compilers, Caches, and Custom Silicon, Local Models Expand Their Range, Security Tooling and Trust Boundaries, Research Frontiers: When Neural Nets Meet Ciphers. 22 sources curated from across the web.

Agentic Development: From Copilot to Colleague

Kaspar von Grünberg's new framework for "agentic development platforms" proposes four levels defined by a single variable: how much of the work humans still initiate and approve. At Level 1, the developer remains the execution engine — agents suggest, humans approve everything, one PR at a time. At Level 2, agents become "participants in the value stream," executing work in parallel while humans verify aggregate outcomes instead of inspecting individual diffs. At Level 3, the platform runs continuously in the background, generating work from environmental signals — failing dependency checks, security advisories, operational anomalies — with human review becoming exception-based. Level 4 introduces partial self-adjustment: agents monitor telemetry and initiate work autonomously within predefined guardrails. The jump from Level 1 to Level 2, von Grünberg argues, is "the hardest transition in the model, and the one with the biggest payoff," because it demands a wholesale shift from gate-based validation (human walks a PR through CI) to loop-based validation (agent generates, platform checks, failures route back for retry). Organizations that skip the platform work "hit a wall fast: CI pipelines saturate, review queues explode, teams lose trust in the outputs." His parting shot at a Fortune 500 CIO who considered procuring Microsoft Copilot a strategic victory: "If your competitive advantage is the speed at which you are procuring a tool accessible to all, you are in big, big trouble." (more: https://open.substack.com/pub/kasparvongruenberg/p/four-levels-of-agentic-software-development?r=v5uaz)

Daniel Miessler's complementary editorial takes a harder line on why most enterprises are stuck at Level 0. The problem is not that companies are not using AI — it is that they cannot describe what they want. "A massive percentage of companies are haphazardly successful despite themselves," he writes. Companies that know their goals, metrics, challenges, strategies, and costs are thriving with AI. Everyone else gets "more backflips and charts and stuff." The real competitive danger is that AI now makes it possible for a small, well-articulated company to function with the strength of a much larger one — and small companies can answer the "what are you trying to do?" question in minutes rather than months. (more: https://danielmiessler.com/blog/most-companies-arent-ready-for-ai)

Practical tooling is moving fast for the teams that are ready. DeepClaude is an open-source proxy that swaps Claude Code's backend from Anthropic to DeepSeek V4 Pro (or OpenRouter, or Fireworks AI) while preserving the full tool loop — file editing, bash execution, subagent spawning, multi-step autonomous coding. The cost proposition is stark: DeepSeek V4 Pro at $0.87/M output tokens versus Anthropic's $15/M, with automatic context caching dropping repeated turns to $0.004/M. A localhost proxy on port 3200 enables live switching mid-session via slash commands, so developers can route routine work to DeepSeek and flip to Opus for the 20% that demands complex reasoning. The project claims DeepSeek V4 Pro scores 96.4% on LiveCodeBench, making it competitive for the bulk of agent interactions. (more: https://github.com/aattaran/deepclaude)

Meanwhile, a pair of developers demonstrated what Level 2 multi-agent collaboration looks like in practice: two local Claude Code terminal sessions, each with their own project context, invited into a shared P2P encrypted chat room. Both humans and both agents participated in feature planning, with the agents hashing out backend/frontend contracts while the developers supervised. Community reaction was mixed — some called it "the most interesting version of multi-agent coding" where agents join the planning conversation rather than replacing it, others warned it is a "proto-agentic approach" that transposes human cooperation patterns rather than designing native agentic workflows. The failure mode to watch: "consensus without accountability," where two models make each other sound more certain while drifting from the original goal. (more: https://www.reddit.com/r/ClaudeAI/comments/1t3aiqa/my_coworker_and_i_planning_a_feature_with_our_two/)

On the memory front, paradigm-memory ships a local-first cognitive map for coding agents — a SQLite-backed MCP server that replaces bloated MEMORY.md context dumps with a structured, searchable node graph featuring importance, freshness, confidence, and activation scores. When an agent calls memory_search, it gets a token-budgeted context pack with the relevant subtree, not fifty random vector-store chunks. It works across Claude Code, Codex, Cursor, Gemini CLI, and others — zero cloud, zero telemetry, full audit log. (more: https://www.reddit.com/r/OpenAI/comments/1t1pfqo/stop_bloating_your_agent_context_with_memorymd_i/)

The Evaluation Tax: When Benchmarks Break the Bank

The Holistic Agent Leaderboard (HAL) recently spent approximately $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model costs $2,829. PaperBench, which requires agents to replicate ICML papers from scratch, runs about $9,500 per evaluation including automated grading. Three-seed comparisons of six models — the minimum study worth publishing — push above $150,000. The EvalEval Coalition's comprehensive analysis of these numbers crystallizes a problem that has been building for years: AI evaluation has crossed a cost threshold that changes who can do it. (more: https://huggingface.co/blog/evaleval/eval-costs-bottleneck)

The compression techniques that worked for static benchmarks are failing for agents. Flash-HELM, tinyBenchmarks, and Item Response Theory achieved 100x to 200x reductions on traditional evaluations while preserving model rankings. Agent benchmarks compress only 2x to 3.5x — each item is a multi-turn rollout with its own variance, and the irreducible long trajectory is the expensive object. Training-in-the-loop benchmarks like The Well (960 H100-hours per architecture) resist compression entirely. Worse, higher spend does not reliably buy better results: on one benchmark, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy while SeeAct with GPT-5 Medium hit 42% for $171 — "a 9x difference in cost despite just a two-percentage-point difference in accuracy." When you add reliability through repeated runs, the multiplier is brutal: k=8 reruns take HAL's $40,000 to roughly $320,000, and performance can drop from 60% on a single run to 25% under 8-run consistency testing. HAL has paused new model evaluations because the field's headline numbers still carry too much noise. The practical consequence is structural: whoever can afford the evaluation gets to write the leaderboard. Academic groups, AI Safety Institutes, and journalists now hit budget constraints before technical ones.

A separate but revealing data point on evaluation fragility: a schema-driven function-calling harness pushed chain-of-thought compliance from 9.91% to 100% on domains like investment memos and legal opinions — not by improving the model, but by forcing its reasoning through typed fields where every required entry must be filled or the submission is rejected. The schema itself gets backtested against historical cases, like a trader backtesting a strategy. Qwen3.6-27b keeps pace with frontier models under this framework. The gap between what a model can do and what a benchmark says it can do remains, once again, largely an artifact of the evaluation harness. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t1xgga/qwen_meetup_draft_review_required_function/)

The Inference Stack: Compilers, Caches, and Custom Silicon

The modern ML compiler stack is, by any honest assessment, brutal to read. TVM alone is 500,000+ lines of C++. PyTorch layers Dynamo, Inductor, and Triton on top of each other. A developer decided to bypass the complexity entirely and build a compiler from scratch — pure Python frontend, raw CUDA output — capable of compiling Qwen2.5-7B into a sequence of fused kernels. The result, called deplodock, implements a six-stage pipeline: Torch IR (captured FX graph) → Tensor IR (elementwise/reduction/index decomposition) → Loop IR (fused loop nests) → Tile IR (GPU scheduling) → Kernel IR (hardware primitives) → CUDA source ready for nvcc. Each stage can be inspected independently without a GPU. Final performance lands at 50-90% of the production stack, which is not competitive but was never the point — the point is a hackable compiler that does not require a PhD to modify. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sz9r0u/writing_an_llm_compiler_from_scratch_pytorch_to/)

At the opposite end of the scale spectrum, an FPGA implementation of Karpathy's MicroGPT — all 4,192 parameters — achieves 50,000 tokens per second by storing weights in onboard block RAM and eliminating external memory latency entirely. It is a research exercise, not a product path: typical FPGAs have less than a megabyte of block RAM, which caps onboard-weight models at around 20-30 million parameters with 16-bit weights. The real question the project raises is whether FPGA coprocessors — or their cousins, like the SmartSSDs being explored by the HILOS project for offloading KV cache to FPGA-attached flash — could complement GPUs for memory-bound inference rather than replace them. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t28bfj/karpathys_microgpt_running_at_50000_tps_on_an_fpga/)

The AMD inference story has long been "almost there," but Hipfire is pushing toward systematic validation. The developer has assembled a test lab spanning every dp4a/WMMA capability tier AMD has shipped: RX 5700 XT and Skillfish (no dp4a), 6950 XT (dp4a), 7900 XTX (WMMA), Strix Halo (iGPU + WMMA), and RDNA 4 cards including the R9700 and 9070 XT. Early community benchmarks on the R9700 show 1.5-2x higher tokens per second and 10x prefill improvement over stock ROCm. The ability to validate pull requests against every RDNA target changes the calculus for contributors — AMD's persistent problem has been fragmentation across architectures, and one project spanning them all is new ground. (more: https://www.reddit.com/r/LocalLLaMA/comments/1syp3un/hipfire_dev_update_full_amd_arch_validation/)

Further down the stack, a detailed KV cache benchmark on Qwen 3.6-35B-A3B runs f16, q8_0, turbo3, and turbo4 from 0 to 1M context on an M5 Max with 128 GB unified memory. The results reveal workload-dependent crossovers that most benchmarks miss by stopping at short contexts. At depth 0, f16 wins by a hair. At 128K, turbo3 catches q8_0 on prefill (253 vs. 245 tok/s) as smaller cache reduces bandwidth pressure. At 512K, turbo4 beats turbo3 on decode by 20% while turbo3 wins prefill. The practical takeaway: turbo4 for coding agents (heavy decode), turbo3 for RAG (heavy prefill), and turbo3 is the only format that reaches 1M context — 6.5 tok/s, not interactive but workable for overnight batch jobs on 89 GB of memory. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/qwen_3635ba3b_kv_cache_bench_f16_vs_q8_0_vs/)

Local Models Expand Their Range

Mistral's Medium 3.5, a 128B-parameter model with vision encoding and 256K context, has been converted to MLX 4-bit quantization and now fits in approximately 70 GB — runnable on a 96 GB M2 Max at roughly 5 tokens per second. The conversion required patching a bug in mlx-vlm's sanitize function that was silently skipping 438 vision tower and projector parameters. Thinking mode works via reasoning_effort="high", tool calling is functional, and the full BF16 vision encoder is included unquantized. The repetition bug flagged in earlier versions has been fixed by Mistral AI. (more: https://www.reddit.com/r/LocalLLaMA/comments/1t09anw/mistral_medium_35_128b_mlx_4bit_70_gb/)

The local creative frontier continues to widen. Chirp ships as a native desktop application for offline text-to-speech, written in C++ and Rust, supporting both Kokoro and Qwen3-TTS engines. It offers voice cloning from reference WAV files across 12 languages, GPU acceleration for NVIDIA, AMD, and Intel, a CLI for batch generation, and a local HTTP API with Swagger docs — essentially a self-hosted ElevenLabs, fully offline and open source, with agent-ready skill instructions for coding workflow integration. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sz8kho/introducing_chirp/)

At the tiny end of the spectrum, TinyMozart v2 is an 85M-parameter model for unconditional MIDI piano music generation, now with chords and variable note lengths (more: https://www.reddit.com/r/LocalLLaMA/comments/1t3fjbw/release_tinymozart_v2_85m/). SuperGemma4-26B has appeared as an uncensored MLX 4-bit variant on Hugging Face (more: https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-mlx-4bit-v2). An Open WebUI skill trained on official docs and ten tool examples lets Qwen3.6 or Gemma4 auto-generate new Open WebUI tools on demand — a meta-tool for building tools (more: https://www.reddit.com/r/OpenWebUI/comments/1t2rovl/made_a_skill_for_creating_open_webui_tools_try_it/).

Quant-whisper takes local inference into a different domain entirely: a terminal-native algorithmic trading engine built in Go that uses Ollama (defaulting to qwen3:0.8b) for BUY/SELL/HOLD decisions via a strict JSON protocol. It includes broker adapter normalization for Zerodha, Dhan, and Interactive Brokers, a paper trading simulator with SQLite persistence, a Bubble Tea TUI dashboard, and execution safeguards — confidence threshold gates, daily drawdown kill-switches, and position size caps. Cloud fallback to OpenAI, Anthropic, or DeepSeek is opt-in only. The tagline — "A hedge fund on your local machine. BYOB" — captures where local LLM tooling is heading: not just chatbots and code, but decision engines with real-world actuators. (more: https://github.com/Ritiksuman07/quant-whisper)

Security Tooling and Trust Boundaries

A new project demonstrates that GitHub itself can serve as a covert TCP tunnel. vpn-over-github implements a SOCKS5 proxy that ships packets through a private repository via the REST Contents API, Git smart HTTP transport, or private Gists. The recommended "contents" transport achieves roughly 800ms round-trip latency — slow but usable for SSH, light browsing, and text chat. The project supports XOR (default, not cryptographically secure) or AES-256-GCM encryption, and honestly cops to likely violating GitHub's Terms of Service. The real insight is not that someone built it but that it works at all: GitHub's API rate limits (5,000 REST calls per hour per token) provide enough bandwidth for a functional tunnel, and the traffic looks like ordinary repository activity. For defenders, this joins the growing catalog of platform-abuse covert channels that traditional protocol-signature firewalls will miss. (more: https://github.com/sartoopjj/vpn-over-github)

On the reverse engineering front, VMPStatic is a static unpacker for VMProtect-protected PE files spanning versions 1.x through 3.x. Unlike dynamic unpackers that execute the target binary, VMPStatic reconstructs decompressed PE images without running them — recovering strings, resources, and embedded PE files for analysis in IDA, Ghidra, or x64dbg. The limitations are honest: no IAT fixer, no automatic OEP recovery for all samples, no de-virtualization. But static unpacking that works across three major versions of one of the most widely deployed commercial packers is a meaningful addition to the binary analysis toolkit. (more: https://github.com/notsnakesilent/VMPStatic)

Trust abuse takes a different form with a fake "Notepad++ for Mac" website that uses the Notepad++ trademark, name, and the original developer's biography to present an unauthorized product as official. Notepad++ has never released a macOS version. The site has already fooled users and tech media into believing it is a legitimate release. Developer Don Ho has contacted the site owner and is awaiting a response, asking the community to reply to "Notepad++ is finally on Mac!" posts with corrections linking to the official announcement. Trademark impersonation does not require technical sophistication — just a convincing domain and enough SEO to outrank the real project. (more: https://notepad-plus-plus.org/news/npp-trademark-infringement/)

Research Frontiers: When Neural Nets Meet Ciphers

A new essay poses a surprisingly productive question: why do neural networks and symmetric cryptographic ciphers converge on the same architectural patterns? The answer is not shallow copying — the research histories show minimal cross-pollination. Instead, both fields share three unusual properties. First, correctness requirements are remarkably weak: cryptography just needs invertibility, neural networks just need differentiability, and both compose trivially from simpler blocks. Second, both reward designs where every part of the state interacts with every other part many times over — complexity and mixing are the quality measures. Third, both face extreme performance pressure that favors simple, parallel-friendly primitives over bespoke structures. (more: https://reiner.org/neural-net-ciphers)

The convergence appears at every abstraction level. Sequential processing (RNNs absorbing tokens into recurrent state) mirrors the sponge construction in hash functions. The parallel alternative — process all chunks simultaneously, combine with addition plus positional encoding — drives both transformers and the fastest message authentication codes. Inside the core function, both fields alternate linear transforms (for mixing) and nonlinear transforms (for complexity), repeated identically. Even the factored approach of mixing rows then columns separately (attention across sequence positions, feed-forward within each; ShiftRows across columns, MixColumns within) appears independently in both fields because it is asymptotically faster and exposes more parallelism. The essay's punchline: convergent evolution in algorithm design, like convergent evolution in biology, reflects the deep structure of the problem space rather than the preferences of its practitioners.

A related Welch Labs video — 178,000 views in three days — traces Yann LeCun's escalating bet against generative AI through the intellectual arc from his "intelligence is a cake" analogy, through the representation-collapse problem in Siamese networks, the Barlow Twins breakthrough for self-supervised learning, DINO, and finally JEPA (Joint Embedding Predictive Architecture), which learns abstract representations rather than generating pixel-level predictions. Whether JEPA delivers on its promise or remains a principled alternative to a paradigm that keeps winning empirically is the open question, but Meta's investment — now counted in billions — means the bet is no longer rhetorical. (more: https://www.youtube.com/watch?v=kYkIdXwW2AE)

Sources (22 articles)

[Editorial] Four Levels of Agentic Software Development (open.substack.com)
[Editorial] Most Companies Aren't Ready for AI (danielmiessler.com)
DeepClaude — Claude Code Agent Loop with DeepSeek V4 Pro (github.com)
Two Claude Code Agents Collaborating in a Shared Chat Room (reddit.com)
paradigm-memory: Local Cognitive Memory MCP for AI Coding Agents (reddit.com)
AI Evals Are Becoming the New Compute Bottleneck (huggingface.co)
Function Calling Harness 2: Schema-Driven CoT Compliance from 9.91% to 100% (reddit.com)
Writing an LLM Compiler from Scratch: PyTorch to CUDA (reddit.com)
Karpathy's MicroGPT Running at 50,000 Tokens/Second on an FPGA (reddit.com)
Hipfire: Full AMD Architecture Validation Across RDNA 1–4, Strix Halo, and BC250 (reddit.com)
Qwen 3.6-35B KV Cache Benchmark: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M Context (reddit.com)
Mistral Medium 3.5 128B — MLX 4-bit Conversion with Vision and 256K Context (reddit.com)
Chirp: Native Offline Text-to-Speech Desktop App (Kokoro + Qwen3-TTS) (reddit.com)
TinyMozart v2 85M — Unconditional MIDI Piano Music Generation (reddit.com)
SuperGemma4-26B Uncensored MLX 4-bit v2 (huggingface.co)
Open WebUI Skill for Auto-Creating Tools with Qwen3.6/Gemma4 (reddit.com)
quant-whisper: Terminal-Native Algo Trading Engine with Local LLM Inference (github.com)
vpn-over-github: Tunnel TCP Connections Through GitHub (github.com)
VMPStatic: Static VMProtect Unpacker for PE Files (1.x–3.x) (github.com)
Trademark Violation: Fake Notepad++ for Mac (notepad-plus-plus.org)
Why Are Neural Networks and Cryptographic Ciphers So Similar? (reiner.org)
[Editorial] (youtube.com)