Gemma 4 Drops — And the Community Immediately Gets to Work
Published on
Today's AI news: Gemma 4 Drops — And the Community Immediately Gets to Work, The Full Hardware Spectrum: Galaxy Watches to V100 Racks, The Altman Question, Agentic Coding Hits the Wall, Agent Infrastructure: Meta-Learning Harnesses and Desktop Automation, AI Meets Cybersecurity: Generalization and the Shortcut Problem, Post-Quantum Cryptography: "We Need to Ship", Model Ecosystem: Qwen 3.6, Omnivoice, and Single-Step Generators. 21 sources curated from across the web.
Gemma 4 Drops — And the Community Immediately Gets to Work
Google DeepMind's Gemma 4 family landed this week with something the open-weight community rarely gets: a genuinely competitive multimodal model under a permissive Apache 2 license. The lineup spans four sizes — from the tiny E2B (2.3B effective parameters) up to a 31B dense model — but the star is the 26B-A4B, a mixture-of-experts architecture with 128 experts and only 8 active per token (roughly 4B parameters per forward pass). The result is a model that achieves an LMArena Elo of ~1441 while fitting comfortably on a 48GB MacBook Pro at 51 tokens per second. For context, reaching comparable Elo scores typically requires models with 400B+ total parameters. The smaller E2B and E4B variants also support audio input, making them genuine multimodal on-device candidates. All models support vision, text, function calling, and configurable thinking modes, with day-zero support across transformers, llama.cpp, MLX, WebGPU, and Rust. (more: https://huggingface.co/blog/gemma4)
The community response was immediate and practical. George Liu published a detailed walkthrough of running the 26B-A4B locally using LM Studio 0.4.0's new headless CLI (lms and llmster), which extracts the inference engine into a standalone daemon — no GUI required. On his M4 Pro with 48GB unified memory, the Q4_K_M quantization (17.99GB) delivers 51 tok/s with a 48K context window and 1.5s time to first token. Liu also integrated it with Claude Code via a local OpenAI-compatible API, creating a zero-cost local inference backend for agentic workflows. His memory profiling shows the model fits at full 256K context within 37.48GB, leaving headroom on a 48GB machine. The key insight for MoE models: skip speculative decoding entirely — it blows up memory bandwidth by loading the union of all experts across speculative tokens, with benchmarks on Mixtral showing a 54% slowdown on math tasks. (more: https://ai.georgeliu.com/p/running-google-gemma-4-locally-with)
Meanwhile, a community member discovered that swapping the F16 vision projector for Q8_0 loses no quality — and actually performed slightly better in some tests — while freeing roughly 30K tokens of context, enabling 60K+ total context with vision still active. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sdst2i/get_30k_more_context_using_q8_mmproj_with_gemma_4/) And within days of Gemma 4's release, an abliterated variant appeared using Adaptive Refusal Abliteration (ARA), a two-pass SVD technique that achieved the lowest refusal rate, highest quality score, and lowest KL divergence of any published abliteration of this model. Two optimized builds are available: APEX IQ Q5_K_M (~20 GB, imatrix-calibrated for llama.cpp) and MLX mixed-bit (~15 GB, tiered 4/6/8-bit for Apple Silicon, both with vision support. Full BF16 SafeTensors and GGUF are also available. (more: https://huggingface.co/jenerallee78/gemma-4-26B-A4B-it-ara-abliterated)
The Full Hardware Spectrum: Galaxy Watches to V100 Racks
The on-device AI frontier keeps pushing lower. A Caltech spinout claims a 1-bit 8B parameter model fitting in 1.15GB of memory that's competitive with Llama 3 8B on benchmarks — 440 tok/s on a 4090, 136 tok/s on M4 Pro, and reportedly ~40 tok/s on an iPhone. If those numbers hold up under independent testing, a private LLM on a phone stops being a party trick and starts being a product. The skepticism is warranted — benchmark-maxing startups are a dime a dozen — but the architecture (ternary weights, pure add/subtract computation) is sound enough to deserve scrutiny. (more: https://www.reddit.com/r/LocalLLaMA/comments/1s951bw/1bit_llms_on_device/)
At the opposite extreme, a Samsung Galaxy Watch 4 Classic is now running SmolLM2-360M thanks to a clever llama.cpp patch. The problem: the default model loading path maps the file into memory via mmap, then copies tensors into ggml allocations — effectively loading the 270MB model twice, peaking at 524MB in a device with 380MB free. The fix passes host_ptr into llama_model_params so CPU tensors point directly into the mmap region, cutting peak RAM from 524MB to 142MB (74% reduction) and boot time from 19s to 11s. The patch is heading upstream as a PR to ggml-org/llama.cpp. A commenter's observation lands perfectly: "In 2026, a watch has 380 Megabytes of free RAM. My first computer had 80 Megabytes of total hard drive space." (more: https://www.reddit.com/r/LocalLLaMA/comments/1sabiux/running_smollm2360m_on_a_samsung_galaxy_watch_4/)
And then there's the lawyer from South Carolina who built a 10x NVIDIA V100 SXM2 server with 320GB of VRAM, entirely orchestrated through Claude Code over SSH. His candid writeup — "this is just the corniest mid-life crisis I could have ever had" — details the real-world friction of V100 hardware: no FlashAttention2 (requires SM 80+), no GPTQ (ExLlamaV2 broken on SM 7.0), no FP8, and an NCCL dependency conflict where pip install -e . pulls in cu13 NCCL alongside cu12 PyTorch, silently breaking every multi-GPU launch. His benchmarks show Command R 32B at 35.2 tok/s on TP=4, Gemma 4 31B at 21.6 tok/s, and Qwen 2.5 72B at 14.9 tok/s on TP=4 PP=2. Community feedback was blunt: llama.cpp with GGUF quants would outperform vLLM on V100s for most workloads, with one commenter reporting 45 tok/s for Qwen3.5 122B-A10B on 8x V100s via llama.cpp. (more: https://www.reddit.com/r/LocalLLaMA/comments/1sdovu9/built_my_10x_nvidia_v100_ai_server_320gb_vram/)
The Altman Question
The New Yorker published what may be the most comprehensive profile of Sam Altman to date, drawing on over a hundred interviews and previously undisclosed documents — including the "Ilya Memos," seventy pages of Slack messages and HR documents that Sutskever compiled and sent as disappearing messages to fellow board members before the November 2023 firing. The memos allege a pattern of misrepresentation: Altman purportedly told the board that GPT-4 features had been approved by a safety panel when they hadn't, neglected to mention that Microsoft had released an early ChatGPT version in India without completing a required safety review, and privately assured different factions contradictory things about Brockman's role. A board member describes Altman as having "two traits that are almost never seen in the same person: a strong desire to please people, to be liked in any given interaction, and almost a sociopathic lack of concern for the consequences that may come from deceiving someone." (more: https://www.newyorker.com/magazine/2026/04/13/sam-altman-may-control-our-future-can-he-be-trusted)
The article traces the erosion of OpenAI's safety commitments in granular detail. The superalignment team, promised 20% of compute, reportedly received 1-2% — "most of it on the oldest cluster with the worst chips." The WilmerHale investigation into Altman's firing produced no written report; findings were limited to oral briefings shared with two board members, and some board members say they never received even those. Multiple administration officials expressed discomfort with Altman's geopolitical ambitions — a plan to build chip foundries in the UAE, a data-center campus in Abu Dhabi seven times larger than Central Park, and a $500 billion Stargate joint venture announced alongside Trump. A former OpenAI executive calls the UAE data center expansion "the most reckless thing that has been done." Whether one views Altman as a visionary navigating impossible constraints or a pattern-matching dealmaker who tells each audience what they want to hear, the structural question remains: the person who may wield the most consequential technology in human history has systematically dismantled every governance mechanism designed to constrain him.
Agentic Coding Hits the Wall
A GitHub issue on the Claude Code repository has become the most data-driven critique of agentic coding regression to date. A user analyzing 17,871 thinking blocks and 234,760 tool calls across 6,852 session files documented a measurable quality collapse correlating with thinking content redaction starting in early March. The read-to-edit ratio — how many files the model reads before making changes — dropped from 6.6 in January to 2.0 by late March, a 70% reduction in research before mutation. Full-file rewrites doubled. A programmatic "stop hook" catching premature stopping, responsibility dodging, and permission-seeking behavior fired 173 times in 17 days after March 8; it fired zero times before. The user's positive-to-negative sentiment ratio in prompts collapsed from 4.4:1 to 3.0:1. Most strikingly, the human sent roughly the same number of prompts in February (5,608) and March (5,701), but the model consumed orders of magnitude more tokens to produce demonstrably worse results. The user — running 50+ concurrent agent sessions doing systems programming — was forced to abandon the multi-agent workflow entirely and retreat to single-session supervised operation. (more: https://github.com/anthropics/claude-code/issues/42796)
The counterpoint comes from a developer who spent three months building SyntaqLite, a comprehensive SQLite devtools suite he'd wanted for eight years. His 250-hour retrospective is the most honest post-mortem of AI-assisted development published this year. The first month was pure vibe-coding: Claude acted as designer and implementer while the developer managed. The result was a functional prototype with 500+ tests — and complete spaghetti code he couldn't reason about, forcing a total rewrite. The second phase worked: he took ownership of all design decisions, used AI as "autocomplete on steroids" within a tight review loop, and shipped a parser, formatter, VS Code extension, Python bindings, and WASM playground. His taxonomy is precise: when you understand the problem deeply, AI is excellent; when you can describe what you want but don't know the domain, AI is good but requires care; when you don't even know what you want, AI is "somewhere between unhelpful and harmful." The slot-machine comparison recurs — the sunk-cost fallacy, the tiredness feedback loop, the "just one more prompt" compulsion. (more: https://lalitm.com/post/building-syntaqlite-ai/)
That compulsion is now a recognized phenomenon. A LeadDev article catalogs developers losing sleep to agentic coding — not from deadline pressure but from the dopamine loop of rapid iteration. Reddit comments paint a vivid picture: "I've done 7 all-nighters the last 2 months," one writes. Another describes ending up in therapy after a crash. The comparison to Civilization's "one more turn" is apt, but the stakes are higher: context splits between human and agent mean walking away risks losing the thread entirely, creating a perverse incentive to keep going until 5 AM. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1s8hz5x/addictive_agentic_coding_has_developers_losing/)
Agent Infrastructure: Meta-Learning Harnesses and Desktop Automation
AutoAgent takes the concept of agent engineering and turns it recursive: you write a program.md file that instructs a meta-agent, which then modifies the actual agent harness (agent.py — tools, prompts, orchestration, routing), runs benchmarks, checks scores, and iterates autonomously overnight. It's hill-climbing on benchmark scores with Docker isolation, using the same evaluate-keep-or-discard loop as autoresearch. The harness is a single file with a fixed Harbor adapter boundary and an editable everything-else surface. The key design choice is that humans program the meta-agent's instructions, not the harness directly — a clean separation that keeps the optimization loop stable while the agent explores the design space. (more: https://github.com/kevinrgu/autoagent)
On the desktop automation front, usecomputer is a cross-platform CLI for AI agents built on a native Zig binary — screenshot, mouse control, keyboard synthesis, all available as commands with no Node.js runtime required. It supports the screenshot-act-screenshot feedback loop with coordinate mapping, works on macOS (accessibility API), Linux (X11/XWayland), and Windows, and ships with an agent skill that teaches Claude Code or OpenCode the correct workflow. The library API mirrors the native command shapes, making it trivially integrable as an OpenAI computer tool or Anthropic computer-use backend. (more: https://github.com/remorses/usecomputer)
LightRAG continues to prove that graph RAG doesn't require Microsoft-scale budgets. A tutorial walkthrough demonstrates the full pipeline: Docker-compose setup via Claude Code, OpenAI embeddings for knowledge-graph construction, and Claude Code skills wrapping LightRAG's API endpoints for seamless query/upload/explore from the terminal. The key selling point remains cost efficiency — a study cited in the video showed RAG being 1,250x cheaper than naive LLM context stuffing for large document corpora, though that figure is from July 2025 and the gap has likely narrowed. The practical threshold for when RAG pays off: somewhere around 500-2,000 text pages, where agent search starts losing coherence and cost explodes. (more: https://www.youtube.com/watch?v=QHlB-RJfx8w)
Career-Ops takes a different angle entirely: an AI-powered job search pipeline built on Claude Code that evaluates offers with structured A-F scoring, generates ATS-optimized CVs per job description, and scans 45+ company career portals via Playwright. The author used it to evaluate 740+ job offers, generate 100+ tailored CVs, and land a Head of Applied AI role. The system is explicitly not a spray-and-pray tool — it filters ruthlessly, recommending against applying to anything below 4.0/5. The design philosophy mirrors good agent architecture: human-in-the-loop for all decisions, accumulating context over time (interview story banks, negotiation scripts), and self-customization where Claude reads and modifies its own configuration files. (more: https://github.com/santifer/career-ops)
AI Meets Cybersecurity: Generalization and the Shortcut Problem
A paper from Politecnico di Torino and Huawei tackles what may be the most persistent failure mode in cybersecurity ML: shortcut learning. Models that achieve 92%+ accuracy on random splits crater by nearly 30% on temporal splits — where they must classify attacks they've never seen variants of. The authors propose SALM (Semantically Aligned Language Models), a two-stage contrastive learning framework. Stage 1 restructures an LLM embedding space around 15 vulnerability types using contrastive learning on textual CVE descriptions. Stage 2 freezes the text encoder and trains a payload encoder to align raw HTTP payloads to that same space via cross-modal alignment. At inference, classification is pure semantic retrieval: embed the payload, compare cosine distance to textual prototypes like "SQL injection attack" — prototypes the model never saw during training, forcing genuine generalization rather than memorized patterns. SALM reaches 68.1% accuracy on temporally shifted private data versus 62.3% for fine-tuned CodeBERT, with the improvement concentrated on mid-frequency classes where lexical cues are unreliable but textual descriptions provide clear semantic grounding. The approach also opens a zero-shot pathway: add a new vulnerability type by specifying a text description, no retraining required. The gap between 68% accuracy and production-grade reliability is wide, but the direction — transferring knowledge from data-rich text modalities to data-scarce payload modalities — addresses the fundamental asymmetry that makes cybersecurity ML so fragile. (more: https://arxiv.org/abs/2603.20181v1)
On the practical tooling side, Ice Tea is a Go-based SAST scanner combining Tree-sitter AST pattern matching with optional LLM reasoning for false-positive filtering. It ships with 82 built-in SKILLs covering 456+ detection rules across 12 security domains (auth, injection, web/API, crypto, filesystem, infrastructure, logging, memory safety, cloud, Android, network), supports 10 languages, and outputs SARIF 2.1.0 for GitHub integration. The extensibility model is notable: add detection rules via Markdown + YAML files with no Go code required. Its three-engine architecture — pattern matching, taint tracking, LLM verification — represents the emerging consensus that pure static analysis and pure AI scanning both fail, and the hybrid approach is the way forward. (more: https://github.com/zakirkun/ice-tea)
Post-Quantum Cryptography: "We Need to Ship"
Filippo Valsorda, one of the Go standard library's cryptography maintainers, has publicly shifted his position on post-quantum migration urgency — and the shift is dramatic. Two recent papers catalyzed the change: Google revised down the estimated logical qubits needed to break 256-bit elliptic curves (P-256, secp256k1) enough to make attacks feasible in minutes on fast-clock architectures, and Oratomic showed it could be done with as few as 10,000 physical qubits given non-local connectivity. Google's Heather Adkins and Sophie Schmieg have set 2029 as their deadline. Scott Aaronson's "clearest warning" draws an analogy to how nuclear fission research stopped being public between 1939 and 1940. Valsorda's framing is blunt: "the bet is not 'are you 100% sure a CRQC will exist in 2030?' — the bet is 'are you 100% sure a CRQC will NOT exist in 2030?'" (more: https://words.filippo.io/crqc-timeline/)
The practical implications are severe. Any TLS connection without ML-KEM key exchange should now be considered a potential active compromise. Hybrid classic + post-quantum authentication "makes no sense anymore" and will only slow deployment — go straight to pure ML-DSA-44. New non-post-quantum schemes should simply not be deployed, period. Hardware attestation systems (Intel SGX, AMD SEV-SNP) are "just f***d" — their root keys are not PQ, and there's no progress on rolling out PQ replacements at hardware speeds. File encryption is especially vulnerable to store-now-decrypt-later. Symmetric cryptography gets a pass: 128-bit keys remain sufficient because Grover's algorithm doesn't parallelize. Valsorda has started teaching a PhD cryptography course at the University of Bologna where RSA, ECDSA, and ECDH are introduced as legacy algorithms — because that's how the students will encounter them in their careers.
Model Ecosystem: Qwen 3.6, Omnivoice, and Single-Step Generators
Alibaba's Qwen team released Qwen3.6-Plus, positioning it as a "critical milestone" toward native multimodal agents with emphasis on agentic coding. Community reaction is mixed: the benchmarks look strong against GLM-5, Opus 4.5, and Gemini 3 Pro, but commenters note the conspicuous absence of GPT 5.4 and Opus 4.6 from comparisons, and the team admits to "correcting problematic tasks" in SWE-bench Pro before evaluation. Open-source smaller variants are promised "in the coming days." (more: https://www.reddit.com/r/LocalLLaMA/comments/1sa7sfw/qwen36plus/)
Omnivoice, an open-source TTS system supporting 600+ languages with voice cloning, is generating excitement for its quality-to-size ratio. Early testers report 12x real-time generation speed on a 5090, "insanely good" voice cloning even for non-English languages, and expressive tags ([laughter], [sigh], [surprise-ah]) that make output sound alive. At ~6.5GB VRAM, it sits in a practical sweet spot for local deployment. The main gap: inference still requires torchaudio, with no lightweight C++ runtime yet. (more: https://www.reddit.com/r/LocalLLaMA/comments/1saeuv2/omnivoice_600_language_opensource_tts_with_voice/)
Andriy Burkov highlighted a paper that eliminates iterative denoising from image generation entirely. Instead of running a neural network dozens of times to gradually clean noise into an image (diffusion's core mechanism), the authors define a "drifting field" that lets training itself do the distributional work — the network is a single-pass mapping from noise to data. The result: 1.54 FID on ImageNet 256x256, better than all prior single-step methods and competitive with multi-step diffusion models that run hundreds of passes. The detail buried at the end — the framework also works for robot control policies — hints at how general this approach really is. Single-pass generation from complex distributions is not just an image technique; it's a paradigm. (more: https://www.linkedin.com/posts/andriyburkov_most-generative-models-that-produce-high-quality-activity-7446986923409031169-RvCM)
Sources (21 articles)
- Welcome Gemma 4: Frontier multimodal intelligence on device (huggingface.co)
- Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code (ai.georgeliu.com)
- Get 30K more context using Q8 mmproj with Gemma 4 (reddit.com)
- [Editorial] Gemma 4 26B Abliterated (huggingface.co)
- 1-bit llms on device?! (reddit.com)
- Running SmolLM2-360M on a Samsung Galaxy Watch 4 (380MB RAM) – 74% RAM reduction in llama.cpp (reddit.com)
- Built my 10x NVidia V100 AI Server - 320gb vram - vLLM Testing Linux Headless (reddit.com)
- Sam Altman may control our future – can he be trusted? (newyorker.com)
- Claude Code is unusable for complex engineering tasks (github.com)
- Eight years of wanting, three months of building with AI (lalitm.com)
- 'Addictive' agentic coding has developers losing sleep (reddit.com)
- kevinrgu/autoagent — autonomous harness engineering (github.com)
- remorses/usecomputer — Fast computer automation CLI for AI agents (github.com)
- Claude Code + LightRAG = UNSTOPPABLE (youtube.com)
- [Editorial] Career-Ops (github.com)
- Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning (arxiv.org)
- zakirkun/ice-tea — AI-Powered SAST written in Go (github.com)
- A cryptography engineer's perspective on quantum computing timelines (words.filippo.io)
- Qwen3.6-Plus (reddit.com)
- Omnivoice - 600+ Language Open-Source TTS with Voice Cloning and Design (reddit.com)
- [Editorial] Generative Models and High-Quality Output (linkedin.com)