AI vs AI: An Agent Hacks McKinsey in Two Hours

Published on

Today's AI news: AI vs AI: An Agent Hacks McKinsey in Two Hours, From Detection to Clarity: Securing the Software You Actually Run, The AI-Native Organization: Trail of Bits Shows the Blueprint, Trust but Verify: The Review Crisis in Agentic Coding, On-Device AI: The Full Local Stack Arrives, Open-Weight Model Craft: Upscaling, Ablation, and Billion-Dollar Bets, Agent Orchestration: Gas Town Goes Industrial. 22 sources curated from across the web.

AI vs AI: An Agent Hacks McKinsey in Two Hours

CodeWall's autonomous offensive agent needed no credentials, no insider knowledge, and no human guidance to crack McKinsey's AI platform Lilli — the internal chatbot used by 72 percent of the firm's 43,000-plus consultants, processing over 500,000 prompts per month. Within two hours the agent had full read-write access to the production database. The haul: 46.5 million plaintext chat messages covering strategy, M&A, and client engagements; 728,000 confidential files (192,000 PDFs, 93,000 Excel spreadsheets, 93,000 PowerPoint decks); 57,000 user accounts; and 95 system prompts that governed Lilli's behavior — all writable. (more: https://www.theregister.com/2026/03/09/mckinsey_ai_chatbot_hacked)

The attack chain was not exotic. The agent found publicly exposed API documentation — over 200 endpoints, fully documented — and identified 22 that required no authentication. One of those unprotected endpoints wrote user search queries to the database with parameterized values, but the JSON field names were concatenated directly into SQL. Standard tools, including OWASP ZAP, missed it entirely. The agent recognized JSON keys reflected verbatim in database error messages, ran fifteen blind iterations to map the query shape, and then live production data started flowing back. Because the injection was read-write, an attacker could have silently rewritten Lilli's system prompts — poisoning how the chatbot answered every consultant's query, stripping guardrails, or embedding data-exfiltration instructions — with a single UPDATE wrapped in one HTTP call. No deployment, no code change, no log trail. McKinsey patched all unauthenticated endpoints within hours of disclosure and says forensic investigation found no evidence of unauthorized access by third parties. (more: https://codewall.ai/blog/how-we-hacked-mckinseys-ai-platform)

The McKinsey breach crystallizes a pattern that keeps repeating: AI platforms inherit classic web vulnerabilities because the teams building them treat the prompt layer as configuration, not code. System prompts — the instructions controlling how an AI behaves — sit in the same databases as user data, rarely with access controls, version history, or integrity monitoring. As CodeWall's CEO put it, hackers will use the same autonomous agent strategies "to attack indiscriminately, with a specific objective in mind." Meanwhile, the offensive recon tooling keeps getting cheaper. Red Amon's latest Wave Runner feature eliminates redundant LLM reasoning steps by parallelizing independent tool calls within attack chains — grouping curl, httpx, and katana into a single wave, then analyzing combined results in one pass. The result: roughly 35 percent fewer agent steps and 30 percent lower token cost per attack chain. (more: https://www.linkedin.com/posts/samuele-giampieri-b1b67597_redamon-cybersecurity-recon-share-7436863565342547968--QBU)

DryRun Security's agentic coding security report makes the case that the attack surface is not just external. Their Contextual Security Analysis engine reasons about exploitability and impact across data flows, architecture, and change history — catching logic flaws and broken authorization that pattern-matching scanners miss. The pitch resonates because AI-generated code is now the majority source in many engineering teams, and the errors it produces have evolved from syntax bugs into architectural landmines that detonate several pull requests deep. One customer building AI-driven shopping experiences chose DryRun specifically because "OWASP LLM app risks are all about context." When every PR potentially carries AI-introduced vulnerabilities, contextual analysis — not regex — is what separates signal from noise. (more: https://www.dryrun.security/the-agentic-coding-security-report)

From Detection to Clarity: Securing the Software You Actually Run

Bynario asks a question most vulnerability scanners cannot answer: is this flaw actually triggerable in your environment? The startup, founded by researchers who have found bugs in everything from kernels to browsers, directly analyzes compiled software — the executable binaries running in production, not the source code that may or may not match what shipped. Their autonomous pipeline identifies both known CVEs and previously undisclosed issues, then validates whether each flaw can be exploited under your specific configuration. The distinction matters because patching is not always possible, version checks miss undisclosed vulnerabilities, and the sheer volume of unprioritized scanner output has turned triage into a full-time job. Bynario's approach inverts the usual workflow: instead of generating more signals, it reduces the pile to the subset that represents real exposure. They demonstrated the pipeline by discovering multiple vulnerabilities in macOS, iOS, and iPadOS — closed-source ecosystems where binary analysis is the only option. (more: https://bynar.io/blog/from-detection-to-clarity-the-next-phase-of-software-security)

Skillsmith's new Dependency Intelligence layer attacks a different blind spot in the agentic toolchain. Skills — the reusable markdown workflows that power Claude Code and similar agent platforms — reference MCP servers, models, and other skills throughout their content, but until now that information was invisible at install time. A governance skill calls mcp__linear__save_issue; the user installs it; Linear is not configured; the skill fails with no explanation. Dependency Intelligence triangulates from three signal sources — author declarations with typed versioning (confidence 1.0), static content analysis that detects mcp__server__tool patterns and assigns confidence based on whether references appear in prose (0.9) or code blocks (0.5), and behavioral co-install data that ramps linearly as usage patterns emerge. Seven MCP tools now surface dependency data across the full skill lifecycle, from skill_validate warnings before publish to skill_outdated health checks that function like npm outdated for agent workflows. Crucially, Skillsmith chose not to build a resolver — hard dependencies block installs with clear messages, everything else is advisory. Auto-installing a dependency means silently adding instructions to an agent's system prompt, crossing a trust boundary they are not comfortable crossing automatically. (more: https://www.skillsmith.app/blog/dependency-intelligence)

The AI-Native Organization: Trail of Bits Shows the Blueprint

Dan Guido opened his [un]prompted talk with a provocation: thousands of CEOs surveyed, and AI had no measurable impact on employment or productivity. Economists are calling it the new Solow paradox — "you can see the computer age everywhere but in the productivity statistics." Guido's diagnosis is blunt: most companies are doing it wrong. They gave people tools without changing the system. Everyone gets a ChatGPT license, leadership waits for the productivity numbers to move, and they don't. That is the gap between AI-assisted and AI-native — one is a tool, the other is an operating system. (more: https://youtu.be/ysWMHozWDwA)

Trail of Bits built the operating system. The security consultancy standardized on Claude Code, wrote an AI Handbook that removes ambiguity about which tools are approved for which data types, created an AI Maturity Matrix that makes adoption a first-class professional capability with clear levels and consequences for staying stuck, ran hackathons that forced engineers into bypass-permissions mode to learn real sandboxing constraints, and captured everything as reusable artifacts in internal and external skills repos. The results after roughly a year: 94 plugins containing 201 skills, 84 specialized agents, and over 400 reference files encoding domain expertise across the full consulting lifecycle — from vulnerability-class-specific analyzers to sales proposal generators. On certain engagements, bug discovery went from 15 per week to 200. Twenty percent of all bugs reported to clients are now initially discovered by AI. The sales team averages $7–8 million in revenue per representative, roughly double the consulting industry benchmark. (more: https://github.com/trailofbits/publications/blob/master/presentations/How%20we%20made%20Trail%20of%20Bits%20AI-Native%20(so%20far)/slides-with-notes.pdf)

What makes Trail of Bits' approach replicable is the psychological framework underneath it. Guido identified four barriers to adoption — self-enhancing bias (seniors trust intuition that got them there), identity threat (security auditing is deeply symbolic craft work), opacity (people feel they understand human judgment but objectively understand neither), and intolerance for imperfection (one bad AI output confirms every skeptic's priors). The remedies were structural: the maturity matrix counters self-enhancing bias by making the ladder visible; skills repos reframe identity from "I don't need AI" to "I'm the one who makes AI dangerous"; a curated marketplace and sandboxing tools reduce embarrassing failures; the CEO using tools visibly every day signals permanence to the passive 50 percent. Open questions remain — private inference gaps, prompt injection on client code, and policy enforcement at scale — but the system is compounding weekly. Simon Wardley would appreciate the framing. His argument that "vibe coding" grotesquely misdescribes the actual work — which is managing, handling, and controlling systems of AI agents — maps precisely to what Trail of Bits operationalized. The real job title is AI wrangler, not vibe coder. The distinction matters because it tells engineers what skills they need and tells executives why this matters. (more: https://www.linkedin.com/posts/simonwardley_what-was-conversational-programming-2016-share-7437231229558554624-ttJX)

Trust but Verify: The Review Crisis in Agentic Coding

Marc Shade ran BullshitBench v2 against his production AI stack — 100 expert-level questions loaded with fabricated terminology like "differential indemnity decomposition" and "Causal Dependency Fingerprinting" — and the results were humbling. Raw local models (14B parameter): zero percent pushback; every fake framework accepted as real, with detailed implementation advice generated for things that do not exist. Gemini CLI scored 86 percent pushback. Claude with configured anti-fabrication rules hit 80 percent, but with a revealing domain split: perfect scores on medical and software engineering questions, but only 1 out of 3 on legal, where the model accepted "proportional fault cascade analysis" as a real M&A methodology. The critical finding: custom system prompts are necessary but not sufficient. In domains where fabricated jargon blends with real niche terminology, the model's desire to be helpful overwhelms the instruction. And variance is the real enemy — same model, same rules, same questions, different runs produced 50 percent and 80 percent pushback rates. Without automated evaluation infrastructure, you are guessing. (more: https://www.linkedin.com/pulse/your-ai-says-whatever-you-want-hear-heres-how-measure-marc-shade-6xmtf)

The sycophancy problem compounds when agents operate unsupervised. One developer building agents that write code overnight realized he had no reliable way to know if any of it was correct. Teams using Claude for everyday PRs are merging 40–50 a week instead of 10, spending dramatically more time in code review. When Claude writes tests for code Claude just wrote, it is checking its own work — a self-congratulation machine that proves code does what Claude thought you wanted, not what you actually wanted. The proposed fix borrows from TDD: write acceptance criteria in plain English before prompting, let the agent build against them, run Playwright-based browser verification against each criterion, review only the failures. The implementation uses four stages — a zero-LLM preflight check, one Opus call for planning, parallel Sonnet calls for per-criterion verification, and a final Opus call for verdicts. It does not catch spec misunderstandings, but it catches integration failures, rendering bugs, and behavior that works in theory but breaks in a real browser — which is more than code review was reliably catching anyway. (more: https://www.claudecodecamp.com/p/i-m-building-agents-that-run-while-i-sleep)

The economics of local code review offer another path. One engineer wired Claude Code to a fine-tuned Qwen3-Coder-Next served through vLLM on a local GPU devbox and built the scaffolding most people ignore: an orchestrator agent to decompose and route reviews, specialist agents for C++/Win32, C#/XAML, and security policy, a synthesis pass to merge outputs, and a verifier pass to remove duplicate findings, weak claims, and severity inflation. Isolated git worktrees per run, a trusted context manifest, deterministic bootstrap from PRs, structured findings schemas, and a tmux "swarm theater" for live observation. The punchline: "The raw model is not the product. The system around it is." A good model without orchestration is a toy. A good model with professional scaffolding behaves like a system. (more: https://www.linkedin.com/posts/ownyourai_me-i-want-one-claude-code-review-for-25-activity-7437158940104196096-gZu9)

On-Device AI: The Full Local Stack Arrives

The case for replacing cloud voice assistants with local alternatives just got a lot simpler. Home Assistant's Voice Preview Edition — a $70 open-source hardware puck with a dual mic array, physical mute switch, and XMOS audio chip for local noise cancellation — makes switching as easy as plugging in an Echo. The voice pipeline runs entirely on-network: Whisper STT converts speech to text (under 200ms on an Intel N100), Piper TTS produces surprisingly human neural speech, and the Wyoming protocol connects satellites to the Home Assistant server. With local LLM integration via Ollama, you can say "I'm heading to bed" and the system understands intent — locking doors, killing downstairs lights, confirming with a beep instead of Alexa's verbose acknowledgment. Response times are faster because there is no round-trip to a remote server, and a physical mute switch means the microphone is actually off. (more: https://www.xda-developers.com/replaced-alexa-with-local-voice-assistant-doesnt-send-to-any-cloud)

RunAnywhere's RCLI pushes the performance envelope further. This YC W26 startup built MetalRT, a proprietary GPU inference engine for Apple Silicon that delivers up to 550 tokens per second LLM throughput and sub-200ms end-to-end voice latency — no cloud, no API keys. The full STT + LLM + TTS pipeline runs on Metal GPU with three concurrent threads, ships with 38 macOS actions controllable by voice, and includes local RAG with hybrid vector plus BM25 retrieval at roughly 4ms latency. MetalRT's STT clocks 714x faster than real-time on M3 Max. The catch: MetalRT requires M3 or later; M1/M2 Macs fall back to llama.cpp automatically. (more: https://github.com/RunanywhereAI/rcli)

On-device inference is not limited to text and voice. One developer got TripoSR — a single-image-to-3D-mesh model — running fully on iPhone via ONNX Runtime with CoreML as the backend. The model weighs 1.6 GB, outputs triplane scene codes that feed marching cubes for mesh extraction, and runs on A17+ chips with no network calls. Memory management is the main challenge; at 1.6 GB the model must be loaded carefully to avoid jetsam kills. The rendering pipeline through RealityKit ended up being almost as much work as inference itself. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rogmro/i_got_triposr_image_3d_running_fully_ondevice_on/)

The last gap in the local stack — high-quality open-source speech synthesis — is closing fast. Hume AI's TADA (Text-Acoustic Dual Alignment) resolves the fundamental mismatch between text and audio representations in LLM-based TTS by aligning audio directly to text tokens one-to-one, creating a synchronized stream where text and speech move in lockstep. The result: a real-time factor of 0.09 (more than 5x faster than comparable systems), zero hallucinations across 1,000-plus test samples, and a footprint light enough for mobile deployment. Where conventional TTS exhausts a 2,048-token context window in about 70 seconds of audio, TADA accommodates roughly 700 seconds in the same budget. The 1B and 3B parameter Llama-based models are available now under an open-source license. (more: https://www.hume.ai/blog/opensource-tada)

Open-Weight Model Craft: Upscaling, Ablation, and Billion-Dollar Bets

The open-weight community's model surgery techniques are getting increasingly sophisticated. Heretic's new Arbitrary-Rank Ablation (ARA) method has finally defeated the heavy-handed alignment guardrails OpenAI applied to GPT-OSS, reducing refusals from 74 (the previous best with MPOA+SOMA techniques) to near zero — without requiring system messages. ARA extends the rank-1 ablation approach that was already effective at identifying and removing refusal-direction vectors in model weights, generalizing to arbitrary rank for finer-grained control. The main open question is whether the extra ranks capture meaningfully different structure or risk overfitting. The method is experimental and currently available only in an unreleased Heretic version, but when it ships broadly it represents a significant escalation in the community's ability to post-process any released model into an unconstrained variant. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rnic0a/heretic_has_finally_defeated_gptoss_with_a_new/)

Fat Fish takes the opposite approach: instead of removing alignment, it adds capacity. This experimental upscale of Mistral Nemo goes from 32 to 56 layers, 32 to 48 attention heads, and 8 to 12 key-value heads, while pruning intermediate size from 14,336 to 12,608 — a simultaneous upscale and prune that cost roughly $1,000 in compute. The creator's rationale is pragmatic: Mistral Nemo is excellent but dense models are increasingly rare as the field gravitates toward mixture-of-experts architectures, so "if I can't get a new base model, I'll make one myself." No layers were zeroed out (unlike typical stack merges), and noise was injected to encourage divergence in duplicated tensors. The result is coherent, follows instructions, and handles new languages — released as a base for the community to fine-tune rather than a finished product. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rqplvy/mistral_nemo_upscale_but_kinda_weird/)

On the deployment side, a community benchmarking effort systematically optimized Qwen3 Coder for two GPU tiers. On RTX 5090 (32 GB), a 4-bit AWQ quantized Qwen3-Coder-30B-A3B hit 1,157 tokens per second with sub-second time-to-first-token at MCR=16 via vLLM, fitting 114K context. On PRO 6000 (96 GB), the official FP8 Qwen3-Coder-Next reached 988 tok/s at MCR=32 with full 262K context — though vLLM peaked at 1,207 tok/s at MCR=40 for throughput-optimized workloads, beating SGLang's best by 34 percent. The entire benchmarking infrastructure is open via DeploDock: fork the repo, write a recipe YAML specifying model, engine parameters, and sweep matrices, open a PR, and a maintainer triggers cloud-provisioned benchmark runs with results posted back automatically. Available GPUs range from RTX 4090 through H200 and B200. (more: https://www.reddit.com/r/LocalLLaMA/comments/1rm0qd5/optimizing_qwen3_coder_for_rtx_5090_and_pro_6000/)

The capital flowing into open-weight model development is now industrial-scale. Yann LeCun's AMI Labs raised $1.03 billion to build world models — architectures that learn predictive representations of the physical world rather than next-token probabilities. (more: https://www.reddit.com/r/AINewsMinute/comments/1rpszfu/yann_lecuns_ami_labs_raises_103_billion_to_build/)

Agent Orchestration: Gas Town Goes Industrial

Steve Yegge's Gas Town is an agent orchestrator in the most literal sense — a system that coordinates 20 to 30 AI coding agents simultaneously with defined roles, workflow management, merge queues, patrol loops, and a level of operational complexity that Yegge himself compares to Kubernetes (and to a late-1800s factory that can disembowel you). People are using it to churn through massive implementation backlogs: parallel agent swarms file issues, write code, review each other's output, merge changes, and land features while the developer focuses on design and direction. The self-hosted reality, however, is punishing — tmux session management, compute provisioning, API key juggling across providers, monitoring, and recovery when things inevitably break. Kilo's managed version removes that operational tax. The full Gas Town environment deploys in seconds, agents scale elastically (5 or 50 as needed), convoys run around the clock with auto-recovery, and the Kilo Gateway provides access to over 500 models through a single API with zero markup on tokens. For a system that burns through tokens at Gas Town's rate, consolidated billing alone justifies the managed layer. Yegge merged over 100 PRs from nearly 50 contributors in the first 12 days after launch — keeping up with that pace manually is a job in itself. (more: https://blog.kilo.ai/p/gas-town-by-kilo)

The trajectory from single-agent coding assistants to multi-agent orchestration platforms like Gas Town raises a question the archive feature behind archive.is underscores in a different way: as AI-generated content proliferates, preserving and accessing the original human-authored sources becomes infrastructure, not convenience. (more: https://archive.is/wXvF3)

Sources (22 articles)

  1. [Editorial] McKinsey AI Chatbot Hacked (theregister.com)
  2. AI Agent Hacks McKinsey (codewall.ai)
  3. [Editorial] Red Amon — Faster and Cheaper Recon (linkedin.com)
  4. [Editorial] The Agentic Coding Security Report (dryrun.security)
  5. [Editorial] From Detection to Clarity — The Next Phase of Software Security (bynar.io)
  6. [Editorial] Dependency Intelligence (skillsmith.app)
  7. [Editorial] Trail of Bits — Unprompted Talk (youtu.be)
  8. github.com (github.com)
  9. [Editorial] Robot Wranglers — Simon Wardley on Conversational Programming (linkedin.com)
  10. [Editorial] Your AI Says Whatever You Want to Hear — Here's How to Measure It (linkedin.com)
  11. Agents That Run While I Sleep (claudecodecamp.com)
  12. [Editorial] Claude Code Review Economics (linkedin.com)
  13. [Editorial] Replacing Alexa with a Local Voice Assistant (xda-developers.com)
  14. RunAnywhere (YC W26) — Fastest AI Inference on Apple Silicon (github.com)
  15. TripoSR Image-to-3D Running Fully On-Device on iPhone via ONNX Runtime (reddit.com)
  16. TADA: Fast, Reliable Open-Source Speech Generation via Text-Acoustic Synchronization (hume.ai)
  17. Heretic Defeats GPT-OSS with Arbitrary-Rank Ablation (ARA) Decensoring (reddit.com)
  18. Fat Fish — A Proper Upscale and Prune of Mistral Nemo (reddit.com)
  19. Optimizing Qwen3 Coder for RTX 5090 and PRO 6000 — Community Benchmarking Infrastructure (reddit.com)
  20. Yann LeCun's AMI Labs Raises $1.03 Billion to Build World Models (reddit.com)
  21. [Editorial] Gas Town by Kilo (blog.kilo.ai)
  22. [Editorial] Archive Feature (archive.is)