Opus 4.7, Qwen3.6, and the Agentic Coding Arms Race
Published on
Today's AI news: Opus 4.7, Qwen3.6, and the Agentic Coding Arms Race, Supply Chain Attacks: The Plugin Marketplace Has a Trust Problem, Agent Authentication Gets an IETF Draft β and the Agentic Web Gets a Reality Check, Local AI Pushes to the Extremes: 1-Bit in the Browser, 5090 on the Mac, Stanford HAI 2026: Scaling Faster Than We Can Measure, Applied AI: RF Covert Channels, Hardware MCP, and the Personal DA. 23 sources curated from across the web.
Opus 4.7, Qwen3.6, and the Agentic Coding Arms Race
Anthropic shipped Claude Opus 4.7 today, and the headline numbers are hard to ignore. On CursorBench, it jumps from 58% (Opus 4.6) to 70%. On Rakuten-SWE-Bench, it resolves three times more production tasks than its predecessor. XBOW's visual-acuity benchmark β critical for computer-use agents reading dense screenshots β leapt from 54.5% to 98.5%. Notion reports 14% higher task completion at fewer tokens and a third of the tool errors. The model also accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), more than triple prior Claude models, which unlocks multimodal use cases from patent diagram analysis to dense screenshot parsing. Anthropic is also releasing a new "extra high" effort level, task budgets in public beta, and an "ultra review" mode that produces a dedicated review session flagging bugs and design issues a careful human reviewer would catch. (more: https://www.anthropic.com/news/claude-opus-4-7)
What matters more than the benchmarks, though, is the long-context coherence story. Cole Medin's early testing highlights the practical shift: with Opus 4.6, quality would drift meaningfully around 250,000 tokens into a run β the agent would write confidently while the plan quietly derailed. Opus 4.7 maintains lucidity "hundreds of thousands of tokens" into an agentic session, still referencing earlier decisions correctly and making moves that fit the original plan. That is the difference between an agent you supervise and one you hand off. As Medin puts it, "SWE-bench will be today's headline. Long-context coherence is the story for the next six months." (more: https://www.linkedin.com/posts/cole-medin-727752184_opus-47-is-live-and-after-an-afternoon-activity-7450612114106343424-d1q3)
Anthropic is not alone in the agentic coding race. Alibaba released Qwen3.6-35B-A3B, a Mixture-of-Experts model that activates only 3 billion of its 35 billion parameters per forward pass β meaning it can run on hardware with 22GB of total memory while competing with models many times its computational cost. It supports 256K context across 201 languages and targets agentic coding workloads specifically. (more: https://qwen.ai/blog?id=qwen3.6-35b-a3b) Unsloth has already published quantized GGUFs with calibrated quantization on real-world datasets, along with detailed inference guides for llama.cpp, MLX on Apple Silicon, and integration with Claude Code and OpenAI Codex via local OpenAI-compatible servers. Unsloth's documentation also introduces Unsloth Studio, a new open-source web UI for running and fine-tuning models locally. A 2-bit Qwen3.6 GGUF reportedly made 30+ tool calls, searched 20 sites, and executed Python code β all locally. (more: https://unsloth.ai/docs/models/qwen3.6)
Meanwhile, a thread on r/LocalLLaMA claims a "major drop in intelligence across most major models" β Claude, Gemini, z.ai, Grok β all reportedly ignoring basic instructions, producing shallow output, and responding slowly. One user tested GLM 5 with identical prompts on a rented H100 versus z.ai's hosted version and got correct answers only from the self-hosted instance. The community's leading hypothesis: providers are aggressively quantizing hosted models to cut costs, potentially serving different quality tiers based on user profiles. One commenter suggests detecting silent quantization by tracking covariance between models on public benchmark question sets across time and peak hours. The irony is thick β the same week Anthropic ships a measurably better model, the broader cloud inference ecosystem may be quietly degrading. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sm08m6/major_drop_in_intelligence_across_most_major/)
Supply Chain Attacks: The Plugin Marketplace Has a Trust Problem
A buyer using the alias "Kris" β background in SEO, crypto, and online gambling marketing β purchased 30+ WordPress plugins from the "Essential Plugin" portfolio on Flippa for six figures in late 2024. The portfolio had been built over eight years by an India-based team, accumulating hundreds of thousands of active installations. The buyer's very first SVN commit planted a PHP deserialization backdoor β a textbook unserialize() arbitrary function call hidden inside what the changelog described as a WordPress compatibility check. It sat dormant for eight months before activating on April 5β6, 2026. (more: https://anchor.host/someone-bought-30-wordpress-plugins-and-planted-a-backdoor-in-all-of-them/)
The sophistication is worth noting. The injected payload resolved its command-and-control domain through an Ethereum smart contract, querying public blockchain RPC endpoints β making traditional domain takedowns ineffective because the attacker could update the smart contract to point to a new domain at any time. The malware served SEO spam exclusively to Googlebot, rendering it invisible to site owners. WordPress.org responded by closing all 31 plugins in a single day and forcing an auto-update, but the forced update only neutralized the phone-home mechanism in the plugin itself β it did not clean wp-config.php, where the backdoor had already injected its payload. This is the same playbook as the 2017 Display Widgets attack (200,000 installs, payday loan spam), except at 15 times the scale. WordPress.org still has no mechanism to flag or review plugin ownership transfers, no change-of-control notification to users, and no additional code review triggered by a new committer.
The plugin marketplace trust problem connects to a broader theme about security friction. A Semgrep blog post argues that security controls which frustrate developers actively make systems less secure β people bypass what gets in their way. The piece makes a familiar but underappreciated point: most developers do not focus on security unless forced to, most organizations provide no security training, and tools that produce long lists of false positives erode trust until they are ignored entirely. The prescription is to surface findings immediately during code creation, use AI for triage and noise reduction (not as a source of truth), and reward teams for issues fixed rather than penalizing for issues found. Security friction is not a software problem β it is an organizational design problem. (more: https://semgrep.dev/blog/2026/security-should-be-the-path-of-least-resistance)
On the web platform side, Google is expanding its spam policies to explicitly target "back button hijacking" β where sites insert deceptive pages into browser history to trap users. Enforcement begins June 15, 2026, with affected pages subject to manual actions or automated demotions. Google notes that some hijacking originates from included libraries or advertising platforms rather than intentional site-owner behavior, and encourages thorough code review. (more: https://developers.google.com/search/blog/2026/04/back-button-hijacking)
And for those tired of entrusting passwords to third-party vaults, HIPPO proposes computing passwords on the fly from two secrets β one user-supplied master password and one server-side key β using an Oblivious Pseudorandom Function (OPRF) protocol. The server never sees the raw password; the client never sees the server's key. The result is a deterministic, site-specific, high-entropy password with no vault to compromise. The obvious weakness is a single point of failure at the HIPPO service, and it does not handle password rotation or 2FA. Commenters quickly pointed out that Hugo Krawczyk's SPHINX protocol (the likely theoretical basis, given a shared co-author) has been in production for nearly a decade and already solves the single-point-of-failure via threshold-OPRF. (more: https://hackaday.com/2026/04/15/dont-trust-password-managers-hippo-may-be-the-answer/)
Agent Authentication Gets an IETF Draft β and the Agentic Web Gets a Reality Check
The identity problem for AI agents just got its second IETF draft in as many months. Dick Hardt (of HellΕ) published draft-hardt-aauth-protocol-01, defining four resource access modes for agent-to-resource authorization: identity-based, resource-managed (two-party), PS-managed (three-party), and federated (four-party), with agent governance as an orthogonal layer. The protocol builds on HTTP Signature Keys and introduces concepts like "Person Servers" that mediate between agents and resources, "Missions" that scope agent actions with approval workflows, and "Clarification Chat" flows where a resource can ask an agent for more context before granting access. It also handles multi-hop resource access β critical for agents that chain tool calls across trust boundaries. (more: https://datatracker.ietf.org/doc/draft-hardt-aauth-protocol)
This is a different and more ambitious draft than the draft-klrc-aiagent-auth-00 covered previously, which proposed treating agents as workloads under the existing WIMSE architecture. Where that draft composed existing OAuth 2.0 and transaction token primitives, AAuth introduces new protocol machinery β Person Servers, Mission lifecycles, agent tokens with explicit governance β designed specifically for the agent use case rather than adapted from human authentication flows. Whether the IETF community consolidates around one approach or fragments remains to be seen, but the fact that two independent teams are racing to standardize agent identity signals how urgent the problem has become.
The urgency is practical. A video essay by a former Google engineer lays out the case that the entire web is built around human affordances β pagination, authentication flows, startup sequences, rate limits β and agents are increasingly bottlenecked by exactly these affordances. Jeff Dean noted at GTC that making a model infinitely fast would yield only a 2β3x productivity improvement, because the remaining 47-fold speedup gets absorbed by human-speed tool infrastructure: compilers, file systems, APIs, CRMs, ERPs. The essay identifies three layers of rebuild: making existing tools faster (TypeScript 7 rewritten in Go, Rust toolchains), replacing tool abstractions with agent-native primitives (persistent containers, copy-on-write filesystems, shared KV caches), and ultimately replacing human scaffolding entirely. (more: https://www.youtube.com/watch?v=XlfumXPPrLY)
On the agent security side, Prompt Security's open-source ClawSec project shipped a new skill for Hermes agents β hermes-attestation-guardian β that provides deterministic attestation with canonical digest binding, fail-closed verification, and authenticated baseline drift analysis. The design is narrow and intentional: it targets Hermes infrastructure posture only, does not attempt runtime defense, and requires proven baseline trust before any drift comparison is treated as meaningful. The philosophy β "security controls should be verifiable, reproducible, and practical enough that operators actually keep them on" β echoes the Semgrep argument about reducing friction. (more: https://www.linkedin.com/pulse/hermes-security-upgraded-new-clawsec-skill-david-abutbul-babtf)
A GitHub Gist showcases what agent-driven quality engineering looks like in practice: a six-agent hierarchical swarm analyzed 170 source files and 31,819 lines of code across the D.U.H. universal harness, producing findings across code quality, security, performance, and test coverage. Among the critical findings: no symlink resolution in file operations (path traversal), WebFetch SSRF with no internal network filtering, JWT decoded without signature verification, and a quadratic token re-estimation bug scaling O(N*M) on every conversation turn. The analysis also identified 295 new test ideas across seven dimensions. (more: https://gist.github.com/proffesor-for-testing/9d5b77c8a80531ab7e4237ba939ab475)
Local AI Pushes to the Extremes: 1-Bit in the Browser, 5090 on the Mac
A 1.7 billion parameter model weighing 290 megabytes is now running inference in a browser tab via WebGPU. Bonsai is a 1-bit quantization-aware distillation experiment β weights constrained to ternary values during training, not just compressed afterward. The community reaction is a mix of awe at the engineering ("two years ago we were arguing whether 7B could run on consumer GPUs, now we're doing it in a tab") and honest assessment that the 8B version hallucinates heavily and is "unusable for any task I can think of." The real question is whether purpose-built 1-bit models trained from scratch (rather than distilled) can close the quality gap β and whether optimized CPU kernels (a finished llama.cpp PR for optimized 1-bit CPU inference just landed) can push throughput to practical levels. (more: https://www.reddit.com/r/LocalLLaMA/comments/1smb3wd/1bit_bonsai_17b_290mb_in_size_running_locally_in/)
At the other end of the hardware spectrum, tinygrad's new drivers allowing NVIDIA and AMD eGPUs on macOS are prompting some creative system design. One user is planning to combine an RTX 5090's raw compute with a Mac Studio's unified memory β using the GPU for prefill and the Apple Silicon for decode. Another commenter reports already having a working llama.cpp build doing exactly this across separate machines: 450 tokens/second prefill on a 5090 for GLM5 Q4, then handing off to a Mac Studio for generation. The bottleneck is PCIe bandwidth β at x4 PCIe4 through a Thunderbolt enclosure, streaming full model weights takes minutes for large models. The practical takeaway: separate machines with enough RAM to saturate a full x16 link beat eGPU adapters for big models, but the driver itself is the foundation for future integration. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sluj8b/pondering_on_improving_prompt_processing_on_mac/)
The privacy-conscious local AI crowd also has a new option: claude-private, a patched build of Claude Code CLI v2.1.88 with all 17+ telemetry mechanisms removed. The project catalogs the stock CLI's phone-home behavior in uncomfortable detail β Datadog event logging every 15 seconds, BigQuery metrics export every 5 minutes, session transcript ingress during conversations, MCP registry prefetch on startup, and more. The patch uses binary-level URL replacement (same-length dummy strings to preserve executable structure) plus environment overrides and source-level patches across 19 files. Core functionality β conversations, tools, file editing, bash, MCP β is unaffected. The project notes that Claude Code's source was exposed via npm registry source maps on March 31, 2026. (more: https://github.com/ultrmgns/claude-private)
Stanford HAI 2026: Scaling Faster Than We Can Measure
Stanford's ninth AI Index report lands with a data point that captures the current moment: generative AI reached 53% population-level adoption within three years β faster than the personal computer or the internet. But the report's real weight is in the uncomfortable juxtapositions. SWE-bench Verified performance rose from 60% to near 100% of meeting the human baseline in a single year, yet documented AI incidents rose to 362 (up from 233 in 2024). Organizational adoption hit 88%, yet almost no leading model developer reports results on responsible AI benchmarks. The U.S.-China model performance gap has effectively closed β Anthropic leads by just 2.7% β while the number of AI researchers moving to the U.S. dropped 89% since 2017, with an 80% decline in the last year alone. (more: https://hai.stanford.edu/assets/files/ai_index_report_2026.pdf)
The labor market data deserves particular attention. The report finds productivity gains of 14β26% in customer support and software development, with weaker or negative effects in tasks requiring more judgment. But in software development specifically β where AI's measured productivity gains are clearest β U.S. developers aged 22β25 saw employment fall nearly 20% from 2024, even as headcount for older developers continued to grow. This K-shaped pattern tracks closely with what we have tracked in prior analysis: AI acts as a seniority-biased technological change, amplifying experienced engineers while displacing juniors. The environmental data is equally sobering: Grok 4's estimated training emissions reached 72,816 tons of CO2 equivalent, AI data center power capacity rose to 29.6 GW (comparable to New York state at peak demand), and annual GPT-4o inference water use alone may exceed the drinking water needs of 12 million people.
Steve Yegge's observations from inside Google add texture to the adoption picture. His source β a 20-year Google tech director β reports that Google's internal AI adoption footprint is roughly equivalent to John Deere's: 20% agentic power users, 20% outright refusers, 60% still using basic chat tools. The diagnosis: an 18-month industry-wide hiring freeze means nobody is moving companies, so nobody knows where they stand on the adoption curve. Google specifically cannot use Claude Code because it is "the enemy," and Gemini has never been good enough to capture agentic workflows. Some companies at the bottom have near-zero AI adoption and cannot even get budget. Others are cancelling IntelliJ licenses for a thousand engineers β an "incredibly bold move" toward full agentic adoption. (more: https://x.com/Steve_Yegge/status/2043747998740689171?s=20)
Jonathan McGuinness offers the emotional counterpoint to these structural shifts. His essay on the "AI resentment stage" describes the moment AI produced in ten minutes what used to take him a week, and his first reaction was not admiration but deflation β "a kind of grief" for a professional identity built around being smart. His prescription: sit with the discomfort rather than rushing to reframe it. "My job didn't shrink. It clarified. I'm not the one doing the processing anymore. I'm the one who knows what matters." (more: https://www.linkedin.com/pulse/ai-resentment-stage-jonathan-mcguinness-7t41f)
Applied AI: RF Covert Channels, Hardware MCP, and the Personal DA
A Sorbonne research team published a paper on AI-enabled detection of covert channels in RF receiver architectures β a genuinely novel intersection of AI security and hardware signals processing. Covert channels in wireless chips allow exfiltration of sensitive data (encryption keys, AI models, configuration states) through Hardware Trojans embedded in RF transmitters. The team's approach: compact a CNN to 80% fewer parameters, train it to detect covert channels directly from raw I/Q samples, and deploy it on an FPGA achieving 107 GOPs/W efficiency. At practical SNR levels above 20 dB, the model achieves over 97% accuracy for both CC detection and identifying the underlying Hardware Trojan technique. The paper benchmarks four prominent HT-CC attack types and demonstrates the first dedicated AI hardware accelerator specifically for covert channel detection. (more: https://arxiv.org/abs/2604.14987v1)
In a different kind of hardware integration, a developer has been connecting Claude Code to oscilloscopes and SPICE simulators via MCP servers, creating a feedback loop where the AI can design a circuit, simulate it, measure the physical output, and iterate. The workflow scales beyond trivial circuits: the key insight is that Claude Code excels when it gets immediate feedback, and connecting it to real measurement equipment provides exactly that. The setup includes MCP servers for LeCroy oscilloscopes and spicelib, with practical tips: give Claude explicit pinout maps, never let it guess physical connections, use Makefiles to expose standardized build/flash/ping commands rather than having the agent construct them on the fly. (more: https://lucasgerads.com/blog/lecroy-mcp-spice-demo/)
For those wanting to build ML intuition from the ground up, Pyre Code offers 68 problems covering the internals of Transformers, vLLM, TRL, and diffusion models. No GPU required β you implement attention variants, training tricks, inference kernels, and alignment algorithms against a local test suite. The problems range from fundamentals (ReLU, Softmax) through attention mechanisms (GQA, Flash Attention, MLA) to alignment (DPO, GRPO, PPO) and frontier architectures (MoE, Mamba SSM, Multi-Token Prediction). It runs entirely locally via a FastAPI grading service and Next.js frontend. (more: https://github.com/whwangovo/pyre-code)
Daniel Miessler's video essay makes the case that all the agent and harness infrastructure being built today is converging on a single endpoint: a named, persistent digital assistant (DA) that knows everything about you β your work, relationships, health, finances, goals β and uses an army of agents as its back-end execution layer. His system, "Pi," currently has 51 public skills, 43 private skills, and 418 workflows, all oriented around moving him from "current state" to "ideal state." The argument is that what humans want from AI is predictable even if the technology is not: we want to feel seen, understood, and supported by a trusted entity that never gets tired. Whether this vision materializes through OpenAI's hardware play, Apple's ecosystem, or open-source harnesses, the direction is convergent. (more: https://www.youtube.com/watch?v=uUForkn00mk)
Sources (23 articles)
- [Editorial] Claude Opus 4.7 Launch (anthropic.com)
- [Editorial] Cole Medin on Opus 4.7 (linkedin.com)
- Qwen3.6-35B-A3B: Agentic coding power, now open to all (qwen.ai)
- [Editorial] Unsloth Qwen3.6 Model Docs (unsloth.ai)
- Major drop in intelligence across most major models (reddit.com)
- Someone bought 30 WordPress plugins and planted a backdoor in all of them (anchor.host)
- [Editorial] Security Should Be the Path of Least Resistance (semgrep.dev)
- Google's new spam policy for back button hijacking (developers.google.com)
- HIPPO: Password Manager Alternative (hackaday.com)
- [Editorial] IETF Agent Authentication Protocol Draft (datatracker.ietf.org)
- [Editorial] Video Submission (youtube.com)
- [Editorial] Hermes Security Upgraded with ClawSec Skill (linkedin.com)
- [Editorial] GitHub Gist Submission (gist.github.com)
- 1-bit Bonsai 1.7B (290MB) running locally in your browser on WebGPU (reddit.com)
- Mac Studio + eGPU (RTX 5090) with new Apple-NVIDIA drivers for local AI (reddit.com)
- claude-private: Claude Code without telemetry (github.com)
- [Editorial] Stanford HAI AI Index Report 2026 (hai.stanford.edu)
- [Editorial] Steve Yegge on AI (x.com)
- [Editorial] The AI Resentment Stage (linkedin.com)
- AI-Enabled Covert Channel Detection in RF Receiver Architectures (arxiv.org)
- SPICE simulation to oscilloscope verification with Claude Code MCP (lucasgerads.com)
- pyre-code: Self-hosted ML coding practice platform (68 problems) (github.com)
- [Editorial] Video Submission (youtube.com)