Self-Improving AI & The Talent War

Published on

Today's AI news: Self-Improving AI & The Talent War, The Crowded Reasoning & Coding Race, Big Models on Small Hardware, Harness Engineering & Agent Tooling, Security: One Bad Character, Training Smarter: Sparse Deltas & Private Fine-Tuning, Beyond the Cloud: Mesh Networks, Neuromorphic Chips, and When AI Breaks. 22 sources curated from across the web.

Self-Improving AI & The Talent War

Andrej Karpathy's move from OpenAI co-founder to Anthropic hire is now official, and the stated mission lands squarely on the third rail of AI safety discourse: teaching Claude to improve itself without human supervision. The community reaction oscillates between awe and alarm. One commenter captured it well: "Recursive self-improvement was a pipedream only a year ago. Now it's becoming a reality... What does it mean when for-profit organizations not only have the keys to superior knowledge and understanding, but the resources for learning are also locked away?" Others are more skeptical, noting that LLMs are well past the point where any single hire moves the needle technically, and questioning whether Anthropic's marketing department is doing more heavy lifting than the research. The practical concern worth tracking is behavioral continuity — if a model improves itself without human review, every training run introduces regression risk that supervised checkpoints would normally catch. (more: https://www.reddit.com/r/Anthropic/comments/1tjjkkg/openai_cofounder_karpathy_joins_anthropic_to/)

The talent war takes a different shape in Beijing. China is reportedly clamping down on overseas travel for AI researchers at Alibaba and DeepSeek, treating AI expertise as a national security asset subject to movement restrictions. The comparison to US security clearance constraints is not unreasonable — most countries restrict travel for individuals in sensitive fields — but the timing is instructive. The Qwen team lost its technical architect under murky circumstances earlier this year after Alibaba dismantled his vertically integrated R&D group and replaced it with horizontal modules. Meanwhile, the Stanford HAI report documented an 89% collapse in China-to-US researcher migration since 2017. If the talent is already not leaving voluntarily, travel restrictions look less like security policy and more like signaling. The deeper risk is self-inflicted: making China less attractive to its own domestic AI talent who might reasonably wonder whether building cutting-edge systems also means surrendering freedom of movement. (more: https://www.reddit.com/r/LocalLLaMA/comments/1to5fj5/china_clamps_down_on_overseas_travel_for_ai/)

The Crowded Reasoning & Coding Race

The SWE-rebench leaderboard is back after a three-month hiatus with 110 fresh Python tasks drawn from real GitHub PRs in March through May. The setup follows the standard SWE-bench format — models read PR issues, edit code, run tests, pass the full suite — but the batch is deliberately larger to smooth out the noise that plagues monthly updates. GPT-5.5, Opus 4.7, Cursor Composer 2.5, and Kimi K2.6 are all on the board, with Gemini Flash 3.5, DeepSeek v4 Pro, and Qwen3.5-397B promised for the next wave. The community response is telling: the loudest requests are not for frontier models but for local ones. "We already know Gemini, ChatGPT, Claude, and DeepSeek are good," one commenter notes. "We want to know how good local models do because many of them seem to be benchmaxxed." Another requests a fixed tool-call budget or wall-clock column, arguing that "pass rate without cost-to-fix is incomplete because a 14B model that needs 4 retries is a totally different workflow than a 70B that lands pass@1." (more: https://www.reddit.com/r/LocalLLaMA/comments/1tpawlm/swerebench_leaderboard_march_april_and_may_2026/)

The broader reasoning race reinforces the fatigue. Tencent's Hy3 Preview scores 87.8 on the China High School Biology Olympiad, topping Gemini and GPT — and nobody quite knows what to do with that information. The benchmarks-as-GPU-TFLOP-flexing analogy is apt: Hy3 can hit that level on math and code in real testing but, as one practitioner reports, "falls on its face in weird edge cases that never show up in charts." The consensus is hardening that you need your own benchmarks — evals on your domain data, with your edge cases, measuring what actually matters for your production workflow. "Benchmarks are effectively meaningless," another commenter argues. "They are a metric that has become the goal." (more: https://www.reddit.com/r/LocalLLaMA/comments/1tpu5d3/the_frontier_reasoning_race_is_starting_to_look/)

MiMo-V2.5-coder appeared this week billed as "an excellent alternative to Qwen3.6 and DS4, especially for coding." The community was unconvinced. Multiple commenters pointed out it is not a fine-tune but a regular quantization with a slightly customized bits-per-layer profile and a coding-skewed imatrix, making the "-coder" suffix misleading. No benchmarks were provided to support the claims. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tn3455/mimov25coder/)

Big Models on Small Hardware

Krasis, an LLM runtime purpose-built for models that exceed VRAM, hit its v1.0 release with results that stretch what a single consumer GPU can do. The headline: Qwen3.6-35B-A3B at Q4 quantization running at reading speed (12.5 tokens/second decode) on a laptop RTX 3070 Mobile with 8GB VRAM. On a 5090, the same model hits 124.9 tok/s decode and over 10,000 tok/s prefill. The v1.0 update is substantial — the hot path is now 100% Rust (Python's GIL was causing frequent slowdowns), Ampere-generation cards are fully supported, memory requirements dropped from 2x to 1x the quantized model size plus overhead, and the old AWQ attention was replaced with sensitivity-aware HQQ attention at 4, 6, or 8 bits. HQQ assets are built by mathematically assessing the model rather than requiring a pre-built BF16 template, and Krasis can mix precision — 90% HQQ4 + 10% HQQ6 for sensitive layers — keeping memory low while improving accuracy. A llama.cpp user pushed back, claiming 3x better speeds with a ByteShape CPU-5 quant on a 6GB 3060 laptop. The comparison is not quite apples-to-apples (different quantization, different offload strategy), but it keeps the competitive pressure on. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tpyqng/krasis_update_qwen3635ba3b_q4_at_reading_speed_1x/)

At the other extreme of the size spectrum, a head-to-head benchmark pitted Needle (26M parameters, distilled from Gemini 3.1 for function calls) against Qwen3-0.6B on CPU-only tool routing — 50 queries across five difficulty tiers. Needle won 72% tool-match accuracy versus 56%, at 4.4x lower latency. The interesting finding is not the headline but the failure shapes: Needle fails by picking the wrong tool but gets arguments right 97% of the time; Qwen3 fails by not calling a tool at all, answering in prose instead. At tier 3 (implicit queries like "should I bring an umbrella in Amsterdam?"), Needle holds at 80% while Qwen3 collapses to 10%. For agent-as-operator workloads, a model that occasionally routes wrong but always acts is debuggable; a model that goes quiet under ambiguity is a production halt. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tljs5o/benchmarked_needle_26m_vs_qwen306b_on_cpu/)

On the hardware side, a detailed comparison of the RTX 5090 versus RTX 6000 PRO MaxQ on diffusion workloads found the MaxQ matches the 5090 at 400W while consuming only 325W — 75% of the power for identical performance. At full 600W, the 5090 leads by 25%, but the efficiency curve matters for anyone building dense multi-GPU rigs where thermal headroom is the real constraint. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tolp1m/small_comparison_on_full_compute_performance/) Chinese memory manufacturer CXMT entering the Corsair DDR5 supply chain could eventually help the memory pricing picture, though modules are currently China-only and the real pressure point — HBM wafer allocation cannibalizing DDR5 and GDDR production — remains unresolved. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tnz6xz/cxmt_started_selling_ram_to_corsair/) Separately, a community-driven TTS benchmark covering all major engines through May 2026 finally provides centralized speed comparisons across platforms, though commenters rightly note that real-time factor, memory usage, and quality tradeoffs matter more than cherry-picked demo clips. (more: https://www.reddit.com/r/LocalLLaMA/comments/1tm0k2l/tts_benchmark_comparison_all_known_tts_up_until/)

Harness Engineering & Agent Tooling

"Harness engineering" is solidifying as the term-of-art for the discipline of wrapping AI coding agents in rules, tools, hooks, memory, and orchestration scaffolding. Cole Medin's editorial lays out the two-level framing: the first level is a single agent session (your CLAUDE.md rules, MCP servers, hooks, skills — the "AI layer"), and the second is orchestrating multiple sessions into a workflow where one plans, one implements, one validates, and handoffs are automated. The key mindset shift: "When the agent does something dumb, you don't just blame the model — you improve the harness. It missed a convention, so that becomes a rule. It ran something destructive, so a hook now blocks it." A skeptic in the comments offers the necessary counterweight: "nobody could tell you what 'context engineering' means, and they couldn't tell you what 'prompt engineering' meant before that... we're still very much in the Land of Feels and Vibes." Fair point — but the practitioner who grew their rules file from 10 lines to 400 and cut recurring task errors by half represents real signal. (more: https://www.linkedin.com/posts/cole-medin-727752184_harness-engineering-for-ai-coding-is-turning-activity-7465552438746374144-ykIp)

The pacphi Gist takes this further with a concrete, end-to-end autonomous workflow. The setup: read your ADRs and domain models, seed a GitHub issue backlog with dependency graphs and success criteria, then launch an autonomous loop where a fresh agent session branches, implements via the SPARC orchestrator skill, creates a PR, monitors CI with automatic flake retries, runs a multi-stage automated review gate (code review, issue-criteria matching, quality scoring), and auto-merges when zero blocking findings remain. The human role reduces to reviewing PRs that get flagged. Whether this works reliably at scale is an open question, but the machinery is impressively complete. (more: https://gist.github.com/pacphi/74d3995f0e175e7ea4fcf9f51a88eb88)

Three new tools flesh out the ecosystem. piia-engram addresses a genuine pain point: every time you switch between Claude Code, Cursor, and Codex, your preferences, code standards, and hard-won lessons reset to zero. It stores your developer identity as local JSON files under ~/.engram/ and exposes them to every MCP-compatible tool, so new sessions start with full context. It explicitly positions itself as storing "who you are as a person" rather than agent session history — a different layer than Mem0 or Zep. (more: https://github.com/Patdolitse/piia-engram) Tactile inverts the standard computer-use approach: instead of screenshot-guess-click loops, agents first inspect accessibility semantics (element roles, names, states, hierarchy) exposed by the OS, fall back to OCR-grounded coordinates, and only then resort to visual reasoning. The insight is that agent-ready software should also be human-accessible software, and building agents that depend on accessibility APIs creates economic pressure to fix long-standing gaps that affect screen reader users too. (more: https://github.com/yliust/Tactile) Hadrian's OpenHack brings agent orchestration to whitebox security review — a file-based state machine where recon discovers attack surfaces, a router generates scoped vulnerability scenarios, expert agents prove or reject each one, and an independent triage agent evaluates reportability. Human operators approve every phase transition. Twelve expert agent families cover the OWASP Top 10:2025. (more: https://github.com/hadriansecurity/OpenHack)

Security: One Bad Character

A security researcher poking at a fintech's mobile API discovered that adding a trailing slash to an authenticated endpoint flipped the response from 401 to 200 with full account data. The target ran AWS HTTP API (the newer, cheaper alternative to REST API) with a Lambda authorizer checking JWTs against Cognito. The bug: HTTP API's greedy path matching rewrote the trailing-slash path, stripping the slash before forwarding to the integration, but the auth check ran on the original path. The rewrite dropped the auth context. The researcher confirmed the bypass extended to wire transfer initiation — a $0.01 test transfer went through without a valid JWT. The fix was straightforward: switch from HTTP API to REST API (stricter path matching) and add userId validation in every Lambda, not just the authorizer. The bounty: $12,000. (more: https://theguptalog.blogspot.com/2026/04/i-bypassed-aws-api-gateway-auth-with.html)

This fits a pattern that keeps repeating: a trivial input normalization mismatch defeats sophisticated auth middleware. A malformed Host header bypassing Starlette/FastAPI auth. Two missing regex characters exposing AWS CodeBuild. An unsanitized semicolon enabling GitHub RCE. The lesson is structural: when the routing layer and the auth layer disagree on what constitutes a "match," the gap between them becomes an exploit. No amount of JWT complexity or Cognito configuration can compensate for path-parsing inconsistency.

On the network defense side, Mullvad is rolling out mitigations against exit IP fingerprinting across 13 WireGuard servers spanning Australia, Canada, Germany, Finland, France, Ireland, Norway, Sweden, and the US. The mitigation addresses research showing that VPN exit IP behavior could be used to correlate traffic across sessions. It is an incremental improvement, but the arms race extends well beyond IP addresses — browser fingerprinting via GPU, fonts, canvas, WebGL, and behavioral biometrics has been demonstrated to uniquely identify machines even through VPN tunnels. (more: https://mullvad.net/en/help/exit-ip-vpn-servers-mitigation-rollout)

Training Smarter: Sparse Deltas & Private Fine-Tuning

Every async reinforcement learning library trips over the same bottleneck: shipping the entire model from trainer to inference engine after every optimizer step. For a 7B model in bf16, that is 14GB. For a frontier 1T model, it is a terabyte. Per step. Hugging Face's new delta weight sync in TRL exploits an elegant property of bf16 arithmetic: at typical RL learning rates (~3e-6), Adam's per-weight update magnitude is smaller than the spacing between adjacent bf16 representable values. The optimizer is whispering below bf16's hearing threshold. The result: roughly 99% of weights are bit-identical between consecutive steps, with the worst case never dropping below 98%. Ship only the changed elements as a sparse safetensors file, upload it to a Hub Bucket, and the per-step payload drops from 1.2GB to 20-35MB on Qwen3-0.6B. The team demonstrated a fully disaggregated training where the trainer ran on one box, vLLM lived in a Hugging Face Space, and the Wordle environment lived in another Space — no shared cluster, no RDMA, no VPN. Weights flowed through a single bucket. For a 405B model, the delta would be roughly 6GB versus 810GB for a full sync. At 1GB/s of usable internet bandwidth, a cross-cloud full broadcast takes 13 minutes; the delta path does it in 6 seconds. (more: https://huggingface.co/blog/delta-weight-sync)

On the privacy side, PACZero introduces a family of PAC-private zeroth-order mechanisms for fine-tuning LLMs that delivers usable accuracy at privacy levels no existing differential privacy method can match. The core insight: sign-quantizing subset-aggregated zeroth-order gradients creates frequent "unanimity" steps where every candidate subset agrees on the update direction, costing zero mutual information under PAC Privacy. On SST-2 with OPT-1.3B, the budgeted variant reaches 90.8% accuracy at a total MI budget of 0.01 nats — within 1.4 percentage points of the non-private baseline. The zero-privacy-loss variant (PACZero-ZPL) replaces noised releases with uniform coin flips on disagreement steps, achieving MI=0 for every step count while still maintaining competitive accuracy (89.1% on OPT-6.7B). By contrast, all DP-ZO baselines collapse to chance accuracy below epsilon=2. The unanimity branch fires on 34-45% of SST-2 training steps, providing directed optimization at literally zero privacy cost. Whether these rates persist on architectures beyond OPT and at larger scales is the open question. (more: https://arxiv.org/abs/2605.06505v1)

Beyond the Cloud: Mesh Networks, Neuromorphic Chips, and When AI Breaks

The mesh networking space has three contenders, and a detailed comparison argues only one has a real future at scale. Meshtastic, the first-mover in consumer LoRa mesh, works well for small private groups but uses flood routing — broadcasting every message across the entire network — which collapses under load. MeshCore improves on this with actual path-based routing (up to 64 hops) and dramatically reduced radio congestion, but its official clients are proprietary with paywalled features. "The point of having an off-grid mesh network in the first place is total freedom and control," the author argues, "so in this particular case I simply cannot ever support a closed-source solution." Reticulum is the clear favorite: a full transport-agnostic networking stack that routes packets seamlessly across LoRa, Wi-Fi, Ethernet, the internet, Tor, I2P, or any medium accessible via TCP, UDP, or serial interface. Its address space is globally unique through cryptographic identity, requiring no central authority. Distinct local meshes merge automatically when any connection point is established. The drawback: no standalone LoRa firmware, meaning remote solar-powered repeater nodes need a Raspberry Pi in addition to the radio. A MicroReticulum port for ESP32 is in development and could be transformative for adoption. (more: https://www.jonaharagon.com/posts/im-getting-into-mesh-networks-meshtastic-meshcore-and-reticulum/)

In alternative computing, a multi-institution team published in Nature Communications a neuromorphic computer that combines quantum-tunnelling physics with brain-inspired architecture to tackle combinatorial optimization problems — protein folding, logistics networks, chip routing — where conventional AI stalls. The neuromorphic autoencoder with a Fowler-Nordheim annealer searches for solutions "the way natural processes navigate a complex energy landscape to settle into stability," with a guarantee of asymptotic convergence to the optimal solution. The team spans Washington University, IISc Bangalore, Heidelberg, Johns Hopkins, and UC Santa Cruz. (more: https://iisc.ac.in/a-eureka-machine-that-thinks-like-nature-and-explores-what-ai-cannot/) And in a reminder that production AI still fails in mundane ways, Starbucks is scrapping its AI-powered inventory management tool after it reportedly miscounted and mislabeled items — a high-profile case of the pattern where AI deployment breaks not on intelligence but on coordination with messy real-world data. (more: https://www.reddit.com/r/AINewsMinute/comments/1tln6vy/starbucks_is_scrapping_its_ai_inventory_tool/)

Sources (22 articles)

  1. OpenAI cofounder Karpathy joins Anthropic to teach Claude to improve itself without humans (reddit.com)
  2. China Clamps Down on Overseas Travel for AI Talent at Alibaba, DeepSeek (reddit.com)
  3. SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More (reddit.com)
  4. The frontier reasoning race is starting to look like a crowded subway station (reddit.com)
  5. MiMo-V2.5-coder (reddit.com)
  6. Krasis update: Qwen3.6-35B-A3B (Q4) at reading speed, 1x 8GB 3070 Mobile laptop (32GB RAM) (reddit.com)
  7. Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster. (reddit.com)
  8. Small comparison on full compute performance (Anima) of 5090 vs 6000 PRO MaxQ vs 6000 PRO WS/SE (reddit.com)
  9. CXMT started selling ram to corsair (reddit.com)
  10. TTS Benchmark Comparison (all known TTS up until May 2026) (reddit.com)
  11. [Editorial] Harness Engineering for AI Coding (linkedin.com)
  12. [Editorial] pacphi Gist (gist.github.com)
  13. Patdolitse/piia-engram (github.com)
  14. yliust/Tactile: accessibility-first operating layer for agents (github.com)
  15. hadriansecurity/OpenHack (github.com)
  16. I bypassed AWS API Gateway auth with a trailing slash. Got $12K bounty (theguptalog.blogspot.com)
  17. Exit IP VPN servers mitigation rollout (mullvad.net)
  18. Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL (huggingface.co)
  19. PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization (arxiv.org)
  20. I'm Getting into Mesh Networks (Meshtastic, MeshCore, and Reticulum) (jonaharagon.com)
  21. A Eureka machine that thinks like nature and explores what AI cannot (iisc.ac.in)
  22. Starbucks is scrapping its AI inventory tool after it reportedly miscounted and mislabeled items (reddit.com)