Multimodal memory and perception

Published on November 2, 2025

Multimodal memory and perception

DeepSeek’s latest round of community analysis argues that compressing vision tokens—especially for OCR—can dramatically extend effective context without quadratic memory blowups, at some accuracy cost. Commenters cite DeepSeek’s OCR showing roughly 10x compression at about 92% accuracy and point to work exploring “glyphs” that bundle characters into visual patches for 3–4× token savings; the tradeoff is lossy compression, with reported 0–2% drops on some benchmarks, but major VRAM savings and longer contexts. Others counter that LLMs already “remember” plenty and that the more interesting frontier is selective forgetting—prioritizing high‑value signal over “human slop.” The debate highlights a practical angle: better compression can enable larger windows, but “memory” still breaks on cross‑references humans consider trivial, so the goal is smarter context, not just bigger (more: https://www.reddit.com/r/LocalLLaMA/comments/1oje2cc/deepseek_may_have_found_a_new_way_to_improve_ais/).

On the product side, Qwen3‑VL‑8B‑Thinking is explicit about scaling both perception and context: native 256K, expandable to 1M; stronger OCR (32 languages); visual agent abilities to operate GUIs; and architecture tweaks like Interleaved‑MRoPE and DeepStack to stabilize long‑horizon and fine‑grained alignment. The “Thinking” variant targets stepwise reasoning, with tool invocation and long video handling as first‑class citizens (more: https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking).

NVIDIA’s OmniVinci leans into “omni‑modal” understanding—see, read, listen, speak—with a Hugging Face drop, inference examples across video, audio, and images, and a noncommercial license. The pitch is straightforward: unified multimodal comprehension with practical code paths to try it quickly in Transformers, while signaling competitive performance on common audio/vision benchmarks (more: https://huggingface.co/nvidia/omnivinci).

Real-time vision super-resolution

A different path to efficiency shows up in diffusion‑based video super‑resolution. FlashVSR proposes a one‑step, streaming setup that reports roughly 17 FPS at 768×1408 on a single A100 via three ideas: a distillation pipeline for streaming SR, locality‑constrained sparse attention (LCSA) to cut redundant compute and bridge train/test resolution gaps, and a tiny conditional decoder. The authors claim up to ~12× speedups over prior one‑step diffusion VSR baselines and caution that some third‑party ports missing LCSA regress to dense attention and worse artifacts—use the official pipeline to get the advertised scaling (more: https://github.com/OpenImagingLab/FlashVSR).

As multimodal systems push toward embodied and real‑time experiences, makers continue to fill in the tactile side. A Hackaday build shows a DIY force‑feedback joystick using an Arduino Micro, TMC2208 stepper drivers, belt‑driven pulleys, and magnetic encoders, presenting as a USB HID. Community discussion notes that while commercial force feedback faded for a while, better visuals and sim demand are reviving it; steppers offer predictable positioning with straightforward control, though brushed motors remain plausible with the right closed‑loop design (more: https://hackaday.com/2025/10/30/build-your-own-force-feedback-joystick/).

Beyond attention for long contexts

Manifest AI’s Brumby‑14B markets a “completely attention‑free LLM” with “power retention” and promises of long‑context inference “hundreds of times faster” than attention. Community code reads find standard Q/K/V projections in the forward path and a gating layer feeding a custom kernel; even their paper suggests it’s a form of linear attention, with speed still degrading with context length (subquadratic but not constant). The model also appears initialized from Qwen3‑14B, with a $4K “retraining” budget leading some to view it closer to fine‑tuning. Interesting exploration, but the presentation invites skepticism until independent benchmarks land (more: https://www.reddit.com/r/LocalLLaMA/comments/1ojvgsx/manifestai_releases_brumby14bbase_weights_claims/).

Choosing kernels today remains contextual. Practitioners advise FlashAttention‑3 for maximum throughput on H100s with standard causal/bidirectional masks, while FlexAttention (especially via Unsloth) shines for research, custom masking, or pushing context length with nonstandard patterns. Reported speedups vary: Unsloth touts big gains over FlashAttention‑2, FA3 adds another 1.5–2× in some regimes, and others report smaller wins depending on sequence length and mask type. The practical advice: if you’re training a conventional model, FA3 is a safe default; if you’re experimenting with bespoke attention, Flex buys flexibility (more: https://www.reddit.com/r/LocalLLaMA/comments/1oi3w68/flex_attention_vs_flash_attention_3/).

Local constraints still matter. A Windows 11 user with 32 GB RAM and 8 GB VRAM asked what “best” model they can run locally and how to strip content safeguards. The thread orients around the usual tradeoffs—quantization, context budgets, and responsiveness—underscoring that hardware ceilings and desired behavior (including attempts to bypass guardrails) strongly shape model choice and setup (more: https://www.reddit.com/r/ollama/comments/1oh9s66/whats_the_best_i_can_run_with_32gb_of_ram_and_8gb/).

Do LLMs truly reason?

A Towers of Hanoi study from CMU and Berkeley prompted spirited debate: failure rates rose with more stacks; “reasoning” prompts help but agentic wrappers didn’t; and authors conclude LLMs are largely following high‑probability modes rather than “genuine” reasoning. Critics push back that humans also degrade with task complexity, and question why “reasoning/thinking” model variants (e.g., DeepSeek R1, Claude Sonnet’s Thinking) weren’t evaluated. The behavioral failure mode—looping or doubling down rather than stepping back—is familiar to users and highlights the need for meta‑cognitive controls in the loop (more: https://www.reddit.com/r/LocalLLaMA/comments/1oi1f69/ai_agents_reasoning_collapse_imminent_cmu_berkeley/).

Practitioner experience suggests process matters. One engineer reports six months of daily Claude Code use to refactor a large internal app—from 100k to 300–400k LOC—emphasizing structured pairing, disciplined prompting, and frequent replans over “vibecoding.” Key tactics include skills that auto‑activate, documentation flows to keep context aligned, and knowing when to intervene manually rather than forcing the agent to do everything. The takeaway: “reasoning” improves when workflows give models sharper scaffolding and tighter feedback loops (more: https://www.reddit.com/r/ClaudeAI/comments/1oivjvm/claude_code_is_a_beast_tips_from_6_months_of/).

A complementary direction is to make natural language executable. Dao Studio’s “Natural Language Programming” runs instructions as scripts—plain words in, workflows out—pursuing reliability and cost‑effectiveness for automation via open source. It’s not “English replaces Python,” but “express intent in English, run with guardrails,” which meshes well with agent runtimes that need clarity, not cleverness (more: https://www.reddit.com/r/ChatGPTCoding/comments/1ogogk5/natural_language_programming_run_natural_language/).

Rails for accountable agents

Security teams are codifying simple rules that product developers can apply. Meta’s “Agents Rule of Two” says an agent should only have two of: read untrusted data, access sensitive data, or take actions. All three at once is a straight shot to compromise. Commenters note the idea reframes known risks (“lethal trifecta”) into a usable design pattern: break the chain, reduce blast radius (more: https://www.linkedin.com/posts/georgzoeller_ai-security-compliance-activity-7390608158307643392-eVkb).

Identity and authorization are catching up. An OpenID Foundation–aligned editorial argues for “on‑behalf‑of by default,” binding human, agent, and intent to every action; CIBA for asynchronous approvals at the right risk thresholds; registries for agent capability discovery via the Model Context Protocol (MCP); and Web Bot Authentication so APIs can verify which agent is calling. Treat agents as first‑class identities with SCIM for rapid de‑provisioning, and move policy to the edge so governance travels with the call. It’s the boring-but‑vital plumbing that makes autonomy auditable (more: https://www.linkedin.com/posts/stuart-winter-tear_openid-identity-management-for-agentic-activity-7390727231326539776-UMhJ).

Agent platforms are already integrating pieces of this. Hopper, a privacy‑focused WearOS assistant, supports OpenAI‑compatible endpoints (including self‑hosted LLMs), built‑in tools for notes, web search, alarms, custom webhook tools, and crucially, remote MCP servers for safe tool discovery and invocation. It demonstrates a phone‑free, watch‑native agent that chains tools while keeping user control front‑and‑center (more: https://www.reddit.com/r/LocalLLaMA/comments/1ol8zo5/i_built_a_privacy_focused_ai_assistant_for_wearos/).

Defensive tooling is also evolving. One AI security post introduces AIMDS, a neuro‑symbolic “AI Defence” layer claimed to run continuous self‑assessment loops at microsecond intervals, detect manipulation in‑flight, and block risky outputs with sub‑10ms latency at up to a reported 1M requests/second via WASM SIMD, vector caching (AgentDB), and multicore parallelism. Impressive claims, but as with all vendor‑authored benchmarks, independent validation will matter (more: https://www.linkedin.com/posts/reuvencohen_my-latest-ai-defence-system-is-more-than-activity-7389751718428930048-Vovs).

Security alerts and attack surface

BleepingComputer’s feed captures the breadth of today’s attack surface. Highlights include CISA and NSA guidance for securing Microsoft Exchange; a token leak at Open VSX enabling malicious extension publishing; China‑linked exploitation of a Lanscope endpoint manager zero‑day; a Windows zero‑day used to spy on European diplomats; Microsoft Edge’s new scareware sensor; and a surge of NFC relay malware in Eastern Europe with hundreds of malicious Android apps targeting payment cards. The same feed notes OpenAI exploring memory‑based ads in ChatGPT and Google confirming ads in AI‑powered Search results, signaling monetization pressure alongside security hardening (more: https://www.bleepingcomputer.com/news/security/cisa-and-nsa-share-tips-on-securing-microsoft-exchange-servers/).

Meanwhile, OSINT tooling remains a double‑edged sword. Spyder‑OSINT aggregates lookups across major social platforms and email intelligence, even bundling an “email bomber.” It’s a reminder that capability isn’t the bottleneck—governance is. Use responsibly, or better, design systems where misuse is harder than doing the right thing (more: https://github.com/mocred/spyder-osint).

Erasing harmful model knowledge

As regulators push for “right to be forgotten,” unlearning moves from demo to discipline. Metamorphosis Representation Projection (MRP) proposes to erase targeted knowledge by projecting hidden states onto subspaces, inserting small projection matrices after selected MLP layers and training only those (~0.1M parameters reported). The method is idempotent and aims to be irreversible: reapplying the projection doesn’t reintroduce content, and the model resists relearning and jailbreaks better than baselines in reported tests. On sequential unlearning, MRP reports an 0.905 score after four tasks, while maintaining low accuracy (0.383) on “forgotten” items even after relearning attempts (more: https://arxiv.org/abs/2508.15449v1).

The framing echoes an argument from the DeepSeek thread: forgetting is not a bug, it’s a feature—prioritize high‑rank signal, suppress low‑value or harmful data, and reduce interference. Compression, longer contexts, and retrieval help you remember what matters now; unlearning helps you reliably forget what should never have been there (more: https://www.reddit.com/r/LocalLLaMA/comments/1oje2cc/deepseek_may_have_found_a_new_way_to_improve_ais/).

Monetization, limits, and abuse

User chatter around Anthropic’s “Neptune V6” update centers less on model names and more on economics: speculation about “Opus 4.5” or “Sonnet 4.6/4.7” mixes with frustration over hard weekly limits for Opus on the Max plan, workarounds like timed session resets, and strategies like using a deepcontext MCP server to reduce token burn. Others report no issues depending on workflows, but the sentiment is clear: capability without predictable usage budgets frustrates non‑developer users (more: https://www.reddit.com/r/ClaudeAI/comments/1oi2727/latest_update_from_anthropics_new_model_neptune_v6/).

The monetization story broadens: OpenAI is reportedly considering memory‑based ads in ChatGPT, and Google says AI Search will include ads—likely with new formats. Expect more personalization pressure, and with it, more headaches for privacy and safety teams (more: https://www.bleepingcomputer.com/news/security/cisa-and-nsa-share-tips-on-securing-microsoft-exchange-servers/).

Finally, a removed post alleged a venture‑backed “AI phone farm” aimed at flooding social media with spam. With no details beyond the title, treat it as a cautionary note, not a verified event. Still, it underscores the stakes: the same agentic stack that boosts productivity can be weaponized for scale abuse, making the identity, policy, and security rails discussed above non‑optional (more: https://www.reddit.com/r/AINewsMinute/comments/1oj2fpi/ai_phone_farm_startup_gets_funding_from_marc/).

As a coda for practitioners: the playbook for training smaller, cheaper models continues to evolve, and resources like Hugging Face’s “Smol Training Playbook” hint at community efforts to share what works in the trenches—worth bookmarking as techniques stabilize (more: https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook).

Sources (21 articles)

[Editorial] Agent limits (www.linkedin.com)
[Editorial] Agent Identity (www.linkedin.com)
[Editorial] AI Defense (www.linkedin.com)
Flex Attention vs Flash Attention 3 (www.reddit.com)
AI Agents Reasoning Collapse Imminent (CMU, Berkeley) (www.reddit.com)
manifestai releases Brumby-14B-Base weights, claims "attention free" and inference "hundreds of time faster" for long context (www.reddit.com)
I built a privacy focused AI assistant for WearOS that supports locally hosted LLMs (www.reddit.com)
DeepSeek may have found a new way to improve AI’s ability to remember (www.reddit.com)
What's the best, I can run with 32GB of RAM and 8GB of VRAM (www.reddit.com)
Natural Language Programming: Run Natural Language as Script (www.reddit.com)
Claude Code is a Beast – Tips from 6 Months of Hardcore Use (www.reddit.com)
OpenImagingLab/FlashVSR (github.com)
mocred/spyder-osint (github.com)
CISA and NSA share tips on securing Microsoft Exchange servers (www.bleepingcomputer.com)
The Smol Training Playbook: The Secrets to Building World-Class LLMs (huggingface.co)
Qwen/Qwen3-VL-8B-Thinking (huggingface.co)
nvidia/omnivinci (huggingface.co)
Build Your Own Force-Feedback Joystick (hackaday.com)
Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection (arxiv.org)
Latest Update from Anthropic's new model - Neptune V6 (www.reddit.com)
AI "Phone Farm" Startup Gets Funding from Marc Andreessen to Flood Social Media With Spam (www.reddit.com)