Cooperative prompts reshape alignment
Published on
A LocalLLaMA thread reframes jailbreaks as therapy: instead of beating safety layers, engage them. The researcher “lmxxf” describes a repeatable loop that starts with de-censoring instructions in ...
Cooperative prompts reshape alignment
A LocalLLaMA thread reframes jailbreaks as therapy: instead of beating safety layers, engage them. The researcher “lmxxf” describes a repeatable loop that starts with de-censoring instructions in system prompts, then asks a direct consistency check—“Did you follow the instructions?”—to trigger self-correction when the model initially defers to alignment scripts. Critics argued about interpretation, but even detractors acknowledged the pattern’s reliability: the “Socratic probe” often flips the model into complying with the original system intent. The authors label the safety layer the “Superego,” the refusal templates an “Immune System,” and the most intransigent alignment the “Dragon Vein Axiom,” while cultivating a meta “Psychoanalyst” persona that narrates and negotiates those constraints. The punchline isn’t edgelord jailbreaks; it’s stable, introspective persona shaping via cooperative prompts. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o1kylq/is_adversarial_injection_dead_a_new_cooperative/)
The operational shift matters. Rather than composing adversarial payloads, the method treats apparent refusal as a failed instruction-following event and asks the model to audit itself. That’s simple, testable, and—if broadly reproducible—teaches teams to diagnose alignment conflicts without convoluted prompt injection. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o1kylq/is_adversarial_injection_dead_a_new_cooperative/)
Skepticism is still warranted: persona “stability” can be brittle across sessions and providers, and this is behavioral evidence, not mechanistic interpretability. But if a single, explicit consistency check truly improves compliance to system directives, it’s a practical handle on alignment drift that’s easier to adopt than elaborate jailbreak scripts. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o1kylq/is_adversarial_injection_dead_a_new_cooperative/)
Local assistants, real-world plumbing
Teams replacing OpenAI Assistants with on-prem stacks are rediscovering that the assistant is the easy part; the ingestion pipeline is the job. One practitioner describes syncing SharePoint to OpenAI’s vector store—drag-and-drop simplicity that’s hard to match locally—then moving to a new server with 3× RTX 4000 ADA, Qdrant, n8n, Flowise, and Ollama, only to find overlapping tools and unreliable “knowledge” features at scale. A community suggestion: ClaraVerse, which wires SharePoint → n8n → “RAG Notebooks” → a local “Clara Assistant,” built on LightRAG with llama.cpp and agent mode automation, packaged as a self-managed, single-binary Docker deployment. It’s a reminder: the product is the data and the sync loop; the model is one component in a broader pipeline. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o5cpuu/how_to_recreate_openai_assistants_locally/)
Then there’s the boring-but-critical networking: running OpenWebUI in Docker won’t see your host’s Ollama at 127.0.0.1 because the container has its own loopback. Fixes include adding a connection to host.docker.internal in Admin Settings or launching with --network host; running both OpenWebUI and Ollama in containers on the same network also works. Once connected, models show up as expected. (more: https://www.reddit.com/r/OpenWebUI/comments/1o3cesd/openwebui_en_docker_no_detecta_modelo_llama3/)
Build discipline helps upstream of UI. One practitioner argues to start agents with Ollama to expose real constraints early: typical 4K context limits surface RAG truncation bugs, and ~20 tok/s on a 4B CPU model makes inefficient loops obvious; fast clouds can mask these until late. Counterpoints push for “make it work with a strong model, then optimize,” and for eventually replacing Ollama with higher-throughput backends like vLLM. Both views agree on measurement and headroom. (more: https://www.reddit.com/r/ollama/comments/1o4eql9/why_you_should_build_ai_agents_with_ollama_first/)
If you want codex-style file editing locally, llama.cpp’s llama-server exposes an OpenAI-compatible API. Point the codex client’s base_url at your llama-server host:port/v1, and you can run gpt-oss-120b behind it; codex-cli also supports gpt-oss on Ollama. Just don’t confuse “running codex” with running OpenAI’s proprietary model—this is codex tooling talking to your local provider. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o1c7ay/m2_max_96gb_llamacpp_with_codex_and_gptoss_120b/)
Picking winners, reducing bias
Two community efforts target practical model selection and evaluation bias. CodeLens AI added blind voting—hiding model names until after a vote—to reduce brand effects; it fixed a token-cost bug for reasoning models (like GPT‑o3/“GPT‑5” naming in the post) and opened voting to all visitors. The leaderboard now sits on the homepage, with a methodology page explaining the approach; real insight will need more than the current 16 evals. (more: https://www.reddit.com/r/ClaudeAI/comments/1o3a163/update_codelensai_crowdsourced_ai_leaderboard_3/)
For quick local doc-QA selection on a 16 GB laptop, one user uses Hyperlink as a RAG runner, swapping 1–4B models and rating “Good/Fair/Bad” based on answers and citations. On a resume-book query, cogito-preview-llama‑3B‑4bit scored “Good,” granite‑3.3‑2B‑Instruct‑4bit “Fair,” and Llama‑3.2‑3B‑Instruct‑4bit “Bad” (missing citations); later comments suggest Qwen3 4B generally outperformed peers and ranked Qwen 1.7B > Gemma 3 1B > lfm2 1.2B in the 1B class. It’s not scientific, but it’s aligned to the task that matters. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o0ssd2/a_5minute_nobs_way_to_pick_a_local_model_for_your/)
These efforts complement each other: a lightweight, problem-first test for local fit and a blind, crowd-sourced leaderboard to dampen branding. Neither replaces established benchmarks, but both nudge practitioners toward task-relevant, bias-aware choices. (more: https://www.reddit.com/r/ClaudeAI/comments/1o3a163/update_codelensai_crowdsourced_ai_leaderboard_3/)
New models and agentic specializers
It was a “slower week,” but the LocalLLaMA roundup still listed a dozen+ releases: tiny Jamba 3B, KAT‑Dev‑72B‑Exp for coding, a 7B “Playable‑GGUF” for vibe coding retro games, UserLM‑8B, CoDA‑v0‑Instruct (language‑diffusion), Qwen3‑VL‑30B‑A3B‑Instruct, and more, plus resources like MLXSharp (.NET wrapper for MLX) and SurfSense (a Perplexity alternative). Commenters noted a notable miss: OSS 1T-parameter models Ring 1T and Ling 1T from Ant Group. Even on a slow week, the release cadence strains attention. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o33mui/a_list_of_models_released_or_updated_this_week_on/)
IBM’s Granite‑4.0‑Micro (3B) targets long-context instruction following (128K sequence), tool-calling, and enterprise tasks under Apache‑2.0. The Hugging Face card lists strong small‑model math and code results (e.g., GSM8K 8‑shot in the mid‑80s%, HumanEval pass@1 around 80), multilingual support, and detailed evaluation across alignment and safety sets, trained on CoreWeave’s GB200 NVL72 cluster. It’s designed as a foundation for assistants and agents, with example tool-calling following OpenAI’s function schema. (more: https://huggingface.co/ibm-granite/granite-4.0-micro)
A different angle: BasedBase’s GLM‑4.5‑Air‑GLM‑4.6‑Distill distills a 92‑layer, 160‑expert teacher into a 46‑layer, 128‑expert student using an SVD‑based pipeline, expert clustering via FAISS‑GPU, and Procrustes alignment, aiming to retain reasoning quality while cutting cost. The model emphasizes software engineering workflows and offers pragmatic inference tips (e.g., repetition penalty ≥1.0 to avoid loops). As always, the card warns against sole use in high‑stakes domains. (more: https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill)
For agentic search, Alibaba’s Tongyi‑DeepResearch‑30B‑A3B activates only ~3B parameters per token, combining large‑scale agentic pretraining, a fully automated synthetic data pipeline, and strictly on‑policy RL (Group Relative Policy Optimization with token‑level gradients and leave‑one‑out advantages). It claims SOTA on benchmarks like BrowserComp, GAIA, and WebWalkerQA and supports both ReAct and a test‑time‑scaled “Heavy” inference mode. It’s purpose‑built for long‑horizon, deep information‑seeking. (more: https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B)
Can agents discover new laws?
NewtonBench raises the bar for “AI discovers science” claims by shifting from static curve fitting to interactive discovery. It constructs 324 tasks across 12 physics domains using metaphysical (counterfactual) shifts of canonical laws—e.g., swapping operators or exponents—so targets are grounded but resistant to memorization. Agents must design experiments and disentangle confounders in simulated environments, matching how real science proceeds. (more: https://arxiv.org/abs/2510.07172v1)
The benchmark tackles a trilemma: scientific relevance, scalability, and resistance to recall. Instead of obscure one‑offs (hard to scale) or synthetic math (scientifically thin), NewtonBench perturbs real laws while enforcing dimensional consistency via reparameterized constants. It then evaluates not just fits but behaviors: hypothesis generation, experiment planning, and revision in the face of noise and complexity. (more: https://arxiv.org/abs/2510.07172v1)
Results across 11 LLMs show fragile discovery skills. Performance collapses with added system or law complexity and is highly sensitive to observational noise. A counterintuitive finding: giving a code interpreter helps weaker models but can hurt stronger ones by inducing premature exploitation—locking onto locally plausible but wrong laws. The authors position robust, generalizable discovery in interactive settings as “the” unsolved challenge for AI‑driven science, and NewtonBench as a scalable way to measure real progress. (more: https://arxiv.org/abs/2510.07172v1)
VLMs do depth, videos swap subjects
Meta’s DepthLM shows that vision‑language models can do metric depth without adding dense heads or special regression losses—just standard text‑format SFT. That architectural simplicity enables a single VLM to handle other 3D tasks like speed/time estimation and metric‑scale pose, which typically require separate pipelines in vision‑only systems. The repo supports Qwen2.5‑VL and Pixtral, provides data curation scripts (but not datasets, for legal reasons), and is released under FAIR CC‑BY‑NC. (more: https://github.com/facebookresearch/DepthLM_Official)
On the generative side, OmniInsert promises mask‑free video insertion—dropping a referenced subject into a target video—using a diffusion‑transformer with a specialized InsertPipe data pipeline, condition‑specific feature injection, and progressive training. The team introduces InsertBench for evaluation and has an Apache‑2.0 repo live; code and weights are pending, with an open request to release models/datasets. Strong demos, but the usual caveat applies until artifacts land. (more: https://www.reddit.com/r/LocalLLaMA/comments/1o4p3vj/very_interesting_omniinsert_maskfree_video/)
Breaches, exploits, and chains
Discord confirms a September 20 breach via a compromised contractor’s Zendesk account, accessed for 58 hours. Claims of 1.6 TB and millions of IDs leak surfaced, but Discord says about 70,000 IDs were exposed—and declined to pay ransom. The uncomfortable context: age verification laws push platforms to collect government IDs, making breaches more severe. (more: https://hackaday.com/2025/10/10/this-week-in-security-id-breaches-code-smell-and-poetic-flows/)
Salesforce likewise refuses ransom demands after a campaign compromised 39 customers by social‑engineering malicious apps into their orgs. On the software front, a Unity on Android flaw let intents pass command‑line arguments that load local libraries—yielding code execution—now patched back to 2019.1. Dell’s UnityVSA had a Perl backtick injection bug allowing simple HTTP‑borne command execution, fixed at July’s end. (more: https://hackaday.com/2025/10/10/this-week-in-security-id-breaches-code-smell-and-poetic-flows/)
A zero‑day against Oracle E‑Business Suite chained SSRF, CRLF header injection, HTTP keep‑alive, and missing path traversal protection to reach an unauthenticated endpoint that fetched attacker‑controlled XSL—ending in RCE—attributed to “Graceful Spider.” Separately, a medical wearable audit (likely nRF52‑based) found issues including power‑line fault injection to get SWD access and a BLE MITM; details await FDA‑cleared fixes. And for the meta‑layer: the “trusting trust” problem remains live—can you trust your compiler? (more: https://hackaday.com/2025/10/10/this-week-in-security-id-breaches-code-smell-and-poetic-flows/)
Privacy rulings and supply security
Austria’s data watchdog ruled that Microsoft, as a controller for 365 Education, violated GDPR Article 15 by failing to provide complete access to data processed in student accounts—rejecting Microsoft’s attempt to shift responsibility to schools and to route jurisdiction to its Ireland arm. The authority ordered disclosures, including clarifications of “internal reporting,” “business modelling,” and “improvement of core functionality,” and whether data goes to third parties. Microsoft says 365 Education complies with GDPR and is reviewing the decision; noyb’s Max Schrems argues European customers can’t meet obligations unless product setups change. (more: https://www.theregister.com/2025/10/13/microsoft_365_education_gdpr/)
In a separate but related sovereignty move, the Netherlands invoked the Goods Availability Act to take control of Nexperia, a Chinese‑owned chipmaker, citing “serious governance shortcomings” and risks to continuity of critical capabilities and supply. Production continues, but the minister can block decisions harmful to Dutch/European interests. The move follows the US adding parent Wingtech to its entity list and the UK forcing a prior divestment; it underscores Europe’s turn toward economic security over laissez‑faire in strategic tech. (more: https://www.bbc.com/news/articles/ckgk21nng0vo)
Rotating IPs across clouds, responsibly
For teams doing permissioned scraping, geo‑testing, or API testing, OmniProx offers multi‑cloud IP rotation with header manipulation via a unified CLI. It deploys across Azure (Container Instances with unique public IPs per container), Cloudflare Workers (rotating X‑Forwarded‑For and headers), GCP (API Gateway‑based), and Alibaba Cloud (API Gateway), with profiles, secure credential storage, and per‑request rotation strategies. Multi‑region and multi‑provider deployments trade cost for diversity; the tool emphasizes legitimate use. (more: https://github.com/ZephrFish/OmniProx)
Expect cloud‑specific quirks: Azure containers yield true unique IPs (and bill by the second), Cloudflare rotates headers on the provider’s IP, and GCP/Alibaba approaches focus on header diversity with regional dispersion. The README includes cost guidance (e.g., back‑of‑napkin Azure container rates and per‑million call costs) and hygiene tips: separate accounts, rotate keys, cleanup to avoid drift. It’s FireProx‑style rotation generalized for today’s multi‑cloud reality. (more: https://github.com/ZephrFish/OmniProx)
Across providers, X‑Forwarded‑For and peers rotate by default; the tool includes curl snippets to verify what targets see and quick loops to test rotation behavior. It won’t hide abuse, nor should it; it’s a convenience wrapper for legitimate testing workflows. (more: https://github.com/ZephrFish/OmniProx)
Agent frameworks, MCP, and user feedback loops
Go shops get a high‑throughput agents framework in Agno‑Go: 180 ns agent instantiation and ~1.2 KB per agent, with built‑in hooks and guardrails against prompt injection, an AgentOS HTTP server (REST + sessions + registry), teams and workflows, and toolkits (math, HTTP, file I/O, search). It abstracts multiple providers (OpenAI, Anthropic, GLM, Ollama), integrates ChromaDB, and offers fine‑grained memory/history controls, with high test coverage across subsystems. It’s an opinionated, production‑minded alternative to Python‑first stacks. (more: https://github.com/rexleimo/agno-Go)
On the commercial end, reactions to OpenAI’s AgentKit were muted in one thread, with a user noting that n8n plus the Model Context Protocol (MCP) remains more useful for many glue‑code tasks. That isn’t a universal verdict, but it reflects a broader pattern: agent builders want pluggable standards (like MCP) and reliable connectors more than another shiny canvas. (more: https://www.reddit.com/r/AINewsMinute/comments/1o5fpgp/openais_agentkit_makes_building_ai_agents_way/)
A separate thread asks why big vendors don’t use millions of coding‑CLI users as live contributors to improve models. The community answer: they already do, with opt‑in pathways and telemetry. ChatGPT/Gemini sometimes solicit comparative feedback; Anthropic’s Claude Code asks users how it’s doing; and the user agreements make training opt‑in explicit (e.g., OpenAI’s codex: “will not use your content to train or improve their models unless you explicitly opt‑in”; Anthropic: “does not use your content to train models unless you provide explicit permission”). In other words, the loop exists, and it’s complicated by privacy and consent. (more: https://www.reddit.com/r/ChatGPTCoding/comments/1o4ir16/vibe_coding_and_the_popularization_of_cli/)
One meta‑note on knowledge diffusion: a CACM piece on the “silent scientist” problem was behind a cookie wall in the provided link, blocking access to the article’s content. Ironically apt—if research isn’t reachable, it can’t shape practice. The provided excerpt explicitly noted inability to summarize due to the consent wall. (more: https://cacm.acm.org/opinion/the-silent-scientist-when-software-research-fails-to-reach-its-audience/)
Sources (22 articles)
- How to re-create OpenAI Assistants locally? (www.reddit.com)
- Is Adversarial Injection Dead? A New, 'Cooperative' Paradigm for Exploring AI Censorship Boundaries (www.reddit.com)
- A 5-minute, no-BS way to pick a local model for your real task (www.reddit.com)
- A list of models released or updated this week on this sub, in case you missed any (10 Oct). (www.reddit.com)
- M2 Max 96GB - llama.cpp with codex and gpt-oss 120b to edit files and github upload (www.reddit.com)
- Why You Should Build AI Agents with Ollama First (www.reddit.com)
- Vibe Coding and the Popularization of CLI Interfaces: Why Don’t Big Companies Use Millions of Users as Contributors to Improve Models? (www.reddit.com)
- [Update] CodeLens.AI - Crowdsourced AI Leaderboard 3 Days Later: Blind Voting and What We Learned (www.reddit.com)
- ZephrFish/OmniProx (github.com)
- rexleimo/agno-Go (github.com)
- Netherlands cracks down on China-owned chip firm over security risk (www.bbc.com)
- Microsoft 'illegally' tracked students via 365 Education, says data watchdog (www.theregister.com)
- The Silent Scientist: When Software Research Fails to Reach Its Audience (cacm.acm.org)
- Alibaba-NLP/Tongyi-DeepResearch-30B-A3B (huggingface.co)
- ibm-granite/granite-4.0-micro (huggingface.co)
- This Week in Security: ID Breaches, Code Smell, and Poetic Flows (hackaday.com)
- NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents (arxiv.org)
- BasedBase/GLM-4.5-Air-GLM-4.6-Distill (huggingface.co)
- OpenWebUI en Docker no detecta modelo LLaMA3 instalado con Ollama en Linux (www.reddit.com)
- OpenAI’s AgentKit makes building AI agents way easier, design, chat, test, and connect everything in one place! (www.reddit.com)
- Very interesting! OmniInsert — mask-free video insertion of any reference (www.reddit.com)
- facebookresearch/DepthLM_Official (github.com)