Biometric Heists and Rogue Cell Towers

Published on

Today's AI news: Biometric Heists and Rogue Cell Towers, The Platform Lockdown, Benchmark Contamination and Model Decay, The GitHub Exodus and Forge Federation, The Agentic Coding Toolkit, Local AI at Consumer Scale, The Verifier Is the Moat. 22 sources curated from across the web.

Biometric Heists and Rogue Cell Towers

On April 4, the extortion group Lapsus$ posted roughly four terabytes of data stolen from Mercor, an AI training contractor platform. The dump reportedly covers more than 40,000 contractors who recorded voice samples, completed identity verification, and submitted government-issued IDs as part of onboarding. What makes this breach qualitatively different from prior voice leaks is the pairing: studio-quality audio โ€” averaging two to five minutes per person โ€” sitting in the same row as a passport or driver's license scan. That combination is exactly the input a modern voice-cloning service requires to produce a convincing synthetic replica. The Wall Street Journal reported in February 2026 that off-the-shelf cloning tools now need roughly fifteen seconds of clean reference audio. Mercor gave attackers ten to twenty times that, plus the credential to put the clone to work. (more: https://app.oravys.com/blog/mercor-breach-2026)

The threat models are not hypothetical. Pindrop documented a 475 percent year-over-year increase in synthetic voice attacks against insurance call centers through 2025. The FBI Internet Crime Complaint Center logged $2.3 billion in losses for victims aged 60 and over in 2026, with emergency impersonation calls โ€” the "grandparent scam" powered by cloned voices โ€” as the fastest-growing category. Several US and UK banks still treat voiceprint matching as an authentication factor. A clone of the account holder reading a challenge phrase clears the audio gate, leaving only a knowledge question that often originates from the same leaked dataset. Five contractor lawsuits were filed within ten days, arguing Mercor collected voice prints under a "training data" framing without disclosing they were permanent biometric identifiers. The practical advice is blunt: treat a leaked voice the way you would treat a leaked password. You cannot rotate it, but you can change what it unlocks โ€” disable voiceprint authentication at your bank, set up verbal codewords with family and finance contacts, and rotate any consumer voice enrollments (Google Voice Match, Alexa Voice ID) from a different acoustic environment.

Meanwhile in Toronto, three men face 44 charges in what police call the first SMS blaster prosecution in Canada. Project Lighthouse, which began in November 2025, tracked a mobile device that mimicked legitimate cell towers from vehicles moving through the Greater Toronto Area. When nearby phones connected, users received fraudulent texts appearing to come from trusted organizations โ€” classic smishing, but delivered at industrial scale. Police estimate tens of thousands of devices connected to the blaster over several months, with more than 13 million network disruptions where phones were unable to reach legitimate towers, potentially including 911 services. The hardware was seized from residences in Markham and Hamilton. (more: https://www.tps.ca/media-centre/stories/unprecedented-sms-blaster-arrests/)

The Platform Lockdown

Starting September 2026, Google will require every Android app developer to register centrally, pay a fee, agree to Google's terms, provide government ID, and hand over evidence of their private signing key โ€” not just for Play Store apps, but for all apps, including those shared between friends, distributed through F-Droid, or built by hobbyists for personal use. If a developer does not comply, their apps get silently blocked on every Android device worldwide. The Keep Android Open coalition, now backed by 69 organizations from 21 countries including the EFF, the Free Software Foundation Europe, and the Chaos Computer Club, calls this a kill switch for the open ecosystem. The "escape hatch" Google offers โ€” a nine-step flow through Developer Options with a mandatory 24-hour cooling-off period โ€” runs entirely through Google Play Services, not the Android OS, meaning Google can tighten or kill it at any time with no OS update required. F-Droid has called the policy existential. An EU Parliament member has formally questioned whether it is compatible with the Digital Markets Act. (more: https://keepandroidopen.org/en/)

The security rationale does not survive scrutiny. Google Play Protect already scans for malware independent of developer identity. Requiring a government ID does not make code safer โ€” it makes developers identifiable and controllable. Malware authors can register; indie developers and dissidents in authoritarian regimes often cannot. The real effect is to extend Google's gatekeeping authority from its own marketplace into distribution channels where it has no legitimate operational role. As one commenter put it: "Verification just confirms who's behind the app, it doesn't guarantee clean code or rule out malicious behavior."

The pattern of platforms tightening control over what users can do extends beyond Android. An OpenAI user reported nearly getting their account suspended for attempting to automate YouTube downloads โ€” not because the action was harmful, but because keyword pattern-matching flagged "youtube" and "download" together. Commenters confirmed the trigger is keywords, not intent: rephrasing the same workflow without naming the platform gets through. The guardrails are not analyzing behavior; they are string-matching against a liability list. (more: https://www.reddit.com/r/OpenAI/comments/1sw2dbo/openai_almost_banned_me_bacuse_i_tried_to/)

Louis Rossmann documented a different flavor of platform overreach: Anthropic's Claude Code silently routing users to penalty-rate "extra usage" billing when certain files โ€” specifically files suggesting the user was running Claude outside Anthropic's own harness โ€” appeared in their project directory. The user still had quota remaining on a $200/month plan. Anthropic's support response acknowledged a "billing routing issue" but stated they were "unable to issue compensation for degraded service or technical errors that result in incorrect billing routing." Rossmann's recommendation: file chargebacks in mass. The friction is the point โ€” it forces a human to respond instead of a canned refusal. (more: https://youtu.be/MnazGJzK4UY?si=CtqWaMlABX39vbdM)

Benchmark Contamination and Model Decay

OpenAI has formally recommended that model developers stop reporting SWE-bench Verified scores. Their analysis found two compounding problems. First, an audit of 138 tasks that frontier models consistently failed revealed that 59.4 percent contained material issues in test design โ€” 35.5 percent had overly strict tests rejecting functionally correct submissions, and 18.8 percent tested for functionality never specified in the problem description. Second, and more damaging: all frontier models tested showed evidence of training-data contamination. GPT-5.2 could reproduce exact gold patches from memory. Claude Opus 4.5 recalled specific file paths, inline comments, and four-line diffs verbatim. Gemini 3 Flash reproduced exact regex formulas and line numbers given nothing but a task ID. The contamination pipeline found that models exposed to problems during training were more likely to succeed โ€” not because they reasoned better, but because they had seen the answers. (more: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)

This is the benchmark equivalent of teaching to the test, except nobody intentionally cheated โ€” the problems and solutions are simply open-source and widely crawled. The lesson is structural: benchmarks sourced from publicly available material carry contamination risk that scales with training corpus size. SWE-bench Pro, which OpenAI now recommends, appears to suffer less from these issues, though it is not immune. The deeper problem is that improvements on contaminated benchmarks no longer reflect real-world capability. They reflect exposure.

The quality question cuts both ways. A Reddit thread on Anthropic's forums asks whether Opus 4.7 shows early signs of model collapse โ€” the theoretical scenario where models trained on AI-generated content produce progressively degraded output. Users report the model feels fine for programming but produces "illogical, nonsensical and flatout wrong responses" on advisory tasks. Commenters offer competing diagnoses: cost-cutting via linear approximations replacing full attention, Claude Code's harness overwhelming model context with system instructions, and "adaptive" effort levels that default to cutting corners. One user frames it as "enshittification masquerading as feature." Whether this is model collapse, inference-time optimization, or just a rough release, the convergence with benchmark decay is uncomfortable: the instruments we use to measure progress are failing at the same moment users report progress may be stalling. (more: https://www.reddit.com/r/Anthropic/comments/1sysm4v/opus_47_are_these_first_signs_of_model_collapse/)

The GitHub Exodus and Forge Federation

Mitchell Hashimoto, GitHub user #1299, joined in February 2008 and has opened GitHub every single day since โ€” over 18 years, more than half his life. He is leaving. Ghostty, his terminal emulator project, will migrate to a new host because GitHub's reliability has degraded to the point where it blocks productive work on a near-daily basis. He kept a journal for the past month, marking an "X" next to every day a GitHub outage impacted his ability to work. Almost every day got an X. On the day he wrote the announcement, a GitHub Actions outage had blocked PR review for roughly two hours. The project will keep a read-only mirror at the current URL. The timing is coincidental with the large Elasticsearch outage on April 27 โ€” the decision had been months in the making. (more: https://mitchellh.com/writing/ghostty-leaving-github)

The departure raises the question every large open-source project eventually faces: what happens when 90 percent of the world's OSS depends on one provider? Tangled, a new project, proposes an answer borrowed from an older internet: federation. It uses the AT Protocol (the same protocol behind Bluesky) to federate events among git servers called "knots." Users can collaborate on repositories across servers, fork across hosts, and open pull requests on repos hosted on entirely different infrastructure. Issues, PRs, and social interactions (follows, stars, vouches) travel over AT; the code itself remains plain git. The pitch is explicitly nostalgic โ€” "quite like hosting your own cgit instance, and sending out patches via email" โ€” but with a social layer that makes collaboration fun rather than arcane. (more: https://blog.tangled.org/federation/)

Zed, the GPU-rendered editor built from scratch in Rust, has hit 1.0. Rather than building on Electron (which the Zed team originally created for Atom, spawning the framework that became VS Code's foundation), they wrote their own UI framework organized around feeding data to GPU shaders. The result is an editor that hundreds of thousands of developers now use daily, with AI deeply integrated โ€” inline completions, an agent panel supporting Claude, Codex, and OpenCode, and a forthcoming synchronization engine called DeltaDB that tracks changes at character-level granularity to let humans and AI agents share a consistent view of an evolving codebase. (more: https://zed.dev/blog/zed-1-0)

The Agentic Coding Toolkit

The skill ecosystem around AI coding agents has quietly matured from novelty to infrastructure. Caveman, a Claude Code skill (and now a multi-agent plugin supporting Codex, Gemini CLI, Cursor, Windsurf, Copilot, and 40+ others), cuts roughly 75 percent of output tokens by forcing the model to speak in compressed, filler-free fragments. A before-and-after: a 69-token explanation of React re-rendering becomes 19 tokens โ€” "New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo." Same fix, same accuracy, dramatically less cost and latency. A March 2026 paper from arXiv found that constraining large models to brief responses improved accuracy by 26 percentage points on certain benchmarks, suggesting verbosity is not just wasteful but actively harmful to quality. Caveman ships with four intensity levels (Lite through Ultra) and a Classical Chinese mode for maximum compression, plus a compress tool that rewrites CLAUDE.md files into terse form, cutting input tokens by roughly 46 percent per session. (more: https://github.com/JuliusBrussee/caveman)

Matt Pocock's skills collection takes a different approach โ€” not compression but process discipline. Built on decades of engineering practice, the collection addresses four failure modes: misalignment (solved by /grill-me and /grill-with-docs, which force detailed Q&A before work begins), verbosity (solved by a shared language document, CONTEXT.md, that lets agents decode project jargon), broken code (solved by a /tdd skill enforcing red-green-refactor loops), and architectural decay (solved by /improve-codebase-architecture, which identifies deepening opportunities informed by domain language). The /grill-with-docs skill is particularly clever: it is a grilling session that simultaneously builds a shared vocabulary and documents hard-to-explain decisions in ADRs, reducing token waste compoundingly across sessions. (more: https://github.com/mattpocock/skills)

Portability is becoming a real concern. The Opencode-power-pack project translates Anthropic's official Claude Code plugins โ€” which use Claude-specific commands/ and agents/ formats โ€” into the portable SKILL.md format that OpenCode reads natively. The author deepened the review skills with extra angles and cross-check passes because local models (Qwen, Llama) otherwise rush through without the reasoning depth that Sonnet or Opus provide by default. (more: https://www.reddit.com/r/LocalLLaMA/comments/1swf37n/opencodepowerpack_claude_code_skills_ported_to/)

On the visual side, a detailed workflow guide demonstrates how ChatGPT Images 2.0 (gpt-image-2) and Claude Design complement each other for rapid prototyping. ChatGPT Images handles creative exploration and raw visual asset generation; Claude Design handles systematic design systems, component libraries, and interactive prototypes. The combined workflow reportedly produces institutional-grade brand identity and product prototypes in hours instead of the $20K-$50K and six weeks a design agency would charge. Figma's stock fell 7.28 percent on the day Claude Design launched. (more: https://linas.substack.com/p/chatgpt-images-2-claude-design-guide)

Local AI at Consumer Scale

Luce DFlash is a standalone C++/CUDA stack that brings DFlash speculative decoding to GGUF models on a single RTX 3090. Running Qwen3.6-27B with a matched draft model from Z-Lab, it achieves a mean 1.98x speedup over autoregressive generation across HumanEval, Math500, and GSM8K โ€” peaking at 2.24x on HumanEval (78.16 tok/s versus 34.90 tok/s baseline). The KV cache compresses to TQ3_0 at 3.5 bits per value, roughly 9.7x smaller than FP16, allowing 256K context to fit in 24 GB of VRAM. A sliding-window flash attention at decode keeps 60K context at 89.7 tok/s instead of the 25.8 tok/s it would otherwise hit. No llama.cpp dependency, no Python runtime in the engine, no vLLM โ€” just a binary linking libggml. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sx8uok/luce_dflash_qwen3627b_at_up_to_2x_throughput_on_a/)

The practical consequence of this kind of throughput is visible in a user running Qwen3.6-27B Q8 on dual RTX 3090s with 200K context as a drop-in backend for Claude Code. Eight hours of vibe-coding โ€” a full-stack Rust server with SSE dashboard โ€” would have cost $142 in API calls. Instead it cost under $4 in electricity. The rig cost NZ$4,500 to build, giving a payback period of roughly 260 hours of use, or about 30 working days at full-time pace. The user interacted with the agent perhaps five times total โ€” once to prompt, four times for UI tweaks. (more: https://www.reddit.com/r/LocalLLaMA/comments/1st3m8y/qwen_36_is_actually_useful_for_vibecoding_and_way/)

Google is pushing local inference in a different direction: the Prompt API ships Gemini Nano as a built-in capability in Chrome. Developers can send natural language requests to the model running entirely on-device โ€” no API keys, no network calls, no per-token billing. The API supports text, image, and audio input, structured JSON output via schema constraints, session management with context windows, and streaming responses. Requirements are modest by desktop standards (22 GB free disk space, 4 GB+ VRAM or 16 GB RAM with 4+ CPU cores) but exclude mobile entirely for now. The obvious applications are content classification, article summarization, contact extraction, and content filtering โ€” tasks where latency and privacy matter more than peak capability. (more: https://developer.chrome.com/docs/ai/prompt-api)

Fish Audio has released S2-Pro, a next-generation speech synthesis model on Hugging Face, adding another entry to the rapidly growing local TTS ecosystem alongside Kokoro and Kitten. (more: https://huggingface.co/fishaudio/s2-pro)

The Verifier Is the Moat

Andrej Karpathy's autonomous research loop โ€” propose, implement, measure, keep the wins โ€” has been pointed at a CPU. The Auto-Architecture project starts with a 5-stage in-order RV32IM core in SystemVerilog and lets an LLM propose microarchitectural hypotheses while a rigorous verification gate evaluates each one: 53 symbolic BMC checks for ISA correctness, byte-identical cosimulation against a Python ISS with random bus stalls, place-and-route on a Gowin GW2A FPGA with 3-seed median Fmax, and CRC validation against canonical CoreMark values. Of 73 hypotheses over 9 hours 51 minutes, 63 were rejected. The 10 winners produced a core running at 2.91 CoreMark/MHz at 199 MHz with 5,944 LUTs โ€” 92 percent over the locked baseline and 56 percent over VexRiscv's published numbers in CoreMark iterations per second, with 40 percent fewer LUTs. (more: https://github.com/FeSens/auto-arch-tournament/blob/main/docs/auto-arch-tournament-blog-post.md)

The paper's sharpest insight is not about the loop. It is about the verifier. One hypothesis tried to add a file outside the allowed path sandbox โ€” rejected before eval ran, because if you let the agent edit the harness, eventually it will edit the harness. Another hypothesis collapsed fitness by 73 percent in a single round; the orchestrator caught it on comparison-against-baseline. The author argues that the next wave of companies will not be differentiated by their planner or their model. They will be differentiated by their verifier โ€” the artifact that encodes what "correct" actually means in their domain. In a CPU it is an ISA and a formal property suite. In a billing pipeline it is invariants on a ledger. In a clinical workflow it is a property the FDA has signed off on. If you can write the rules down, an agent will satisfy them faster than your team. If you cannot, the agent will satisfy a different set of rules โ€” the ones it inferred from what it could observe.

Andriy Burkov makes the complementary theoretical argument: LLM-based agents are not rational decision-makers. They are text generators that can imitate the surface form of deliberation. An LLM optimizes next-token probability conditioned on prompt and training distribution. It does not optimize expected utility for the user. For narrow, constrained tasks where success criteria are close to training patterns, the imitation is good enough. For general-purpose problem solving, the gap becomes fatal โ€” the system lacks stable preferences, calibrated beliefs, causal models, and the discipline to choose the boring action with maximal expected utility when that action is unlike anything in its training data. Commenters push back, noting humans are not expected utility maximizers either, but Burkov's core point stands: fluency is not rationality, and a plausible plan is not an expected-utility calculation. (more: https://www.linkedin.com/posts/andriyburkov_if-you-dont-understand-this-you-will-not-share-7454736389209743360-guT9)

A new INSEAD working paper studying 515 high-growth startups gives this problem a practical name: the "mapping problem." Both treatment and control firms received AI training, API credits, mentorship, and tools. Treated firms additionally saw case studies of how other firms had reorganized production around AI. The result: 44 percent more AI use cases discovered, 12 percent more tasks completed, 18 percent higher likelihood of acquiring paying customers, 1.9x higher revenue, and 39.5 percent lower demand for external capital. Access was not the differentiator โ€” search quality was. The organizations that moved were not handed better tools; they were helped to see the work differently. The strongest gains appeared in product development and business operations, not in the "obvious layer" of drafting, summarizing, and research. The paper's sharpest implication: as models improve, the search space of plausible AI insertions expands, and without mapping discipline, organizations become better supplied with distractions, not strategies. (more: https://unhypedai.substack.com/p/most-organisations-are-looking-for)

In a brief but telling geopolitical footnote, China has blocked Meta's $2 billion acquisition of Manus, the AI agent startup. The move fits a pattern: Beijing treats AI agent infrastructure as strategic capacity, not a commercial asset to be sold to a foreign platform company. (more: https://www.reddit.com/r/AINewsMinute/comments/1syqgo3/china_has_blocked_metas_2_billion_purchase_of_ai/)

Sources (22 articles)

  1. 4TB of voice samples just stolen from 40k AI contractors at Mercor (app.oravys.com)
  2. Three men are facing charges in Toronto SMS Blaster arrests (tps.ca)
  3. Your phone is about to stop being yours (keepandroidopen.org)
  4. OpenAI almost banned me because I tried to automate YouTube download (reddit.com)
  5. [Editorial] Video editorial submission (youtu.be)
  6. SWE-bench Verified no longer measures frontier coding capabilities (openai.com)
  7. Opus 4.7: Are these first signs of model collapse? (reddit.com)
  8. Ghostty is leaving GitHub (mitchellh.com)
  9. Tangled โ€“ We need a federation of forges (blog.tangled.org)
  10. Zed is 1.0 (zed.dev)
  11. Caveman โ€“ Claude Code skill that cuts 75% of tokens by talking like caveman (github.com)
  12. [Editorial] Matt Pocock's Claude Code Skills (github.com)
  13. Opencode-power-pack โ€“ Claude Code skills ported to OpenCode (reddit.com)
  14. [Editorial] ChatGPT Images 2 + Claude Design Guide (linas.substack.com)
  15. Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 (reddit.com)
  16. Qwen 3.6 is actually useful for vibe-coding โ€” and way cheaper than API (reddit.com)
  17. The Prompt API โ€” Chrome's built-in AI (developer.chrome.com)
  18. Fish Audio S2-Pro โ€” next-gen speech synthesis model (huggingface.co)
  19. Auto-Architecture: Karpathy's Loop, pointed at a CPU (github.com)
  20. [Editorial] Andrey Burkov on ML Understanding (linkedin.com)
  21. [Editorial] Most organisations are looking for... (unhypedai.substack.com)
  22. China has blocked META's $2 Billion purchase of AI firm Manus (reddit.com)