Encoder-Decoders Fair Model Comparisons and the T5Gemma Debate

Published on

Recent developments in language model architectures are challenging the dominance of decoder-only models, with renewed interest in encoder-decoder and encoder-only approaches. Google's release of T5Ge...

Encoder-Decoders, Fair Model Comparisons, and the T5Gemma Debate

Recent developments in language model architectures are challenging the dominance of decoder-only models, with renewed interest in encoder-decoder and encoder-only approaches. Google's release of T5Gemma, an encoder-decoder variant in the Gemma family, has reignited discussion about the architectural tradeoffs that underpin modern AI systems (more: https://www.reddit.com/r/LocalLLaMA/comments/1m16kdm/t5gemma_a_new_collection_of_encoderdecoder_gemma/).

Encoder-decoder models split the heavy lifting: the encoder processes the input sequence once, producing a condensed representation, while the decoder generates the output based on this summary. This design allows the decoder to operate without repeatedly attending to the entire (potentially massive) input, which can yield significant efficiency gains for tasks with long contexts—think summarization, translation, and style transfer. The catch? For multi-turn dialogue or tasks where the input changes frequently, the encoder must reprocess the full input each time, making techniques like KV caching—so effective in decoder-only models—far less useful. This undercuts efficiency in chat-style applications, but for single-turn or batch tasks, encoder-decoders can offer throughput and quality that outpace decoder-only setups under the same compute budget.

Another key advantage: encoders can leverage bidirectional attention, attending to both past and future tokens, enabling richer representations for downstream tasks like classification or embedding generation. This is contrasted with the unidirectional (causal) attention in decoders, which is optimal for text generation but suboptimal for embeddings. Notably, the T5Gemma's large encoder could be repurposed as a high-quality sentence transformer, filling a gap left by the scarcity of large, open encoder-only models. There is optimism that T5Gemma's encoder, trained from scratch at scale, will perform strongly in this role.

The release of the Ettin Suite further enables direct, apples-to-apples comparisons between encoder and decoder models. Ettin provides matched encoder-only and decoder-only models, trained with identical data and recipes across multiple scales (17M–1B parameters) (more: https://huggingface.co/blog/ettin). The results are illuminating: encoders dominate in classification and retrieval tasks, while decoders retain an edge in text generation, especially as model size increases. Attempts to convert decoders into encoders (and vice versa) via continued pretraining show that architecture fundamentally matters—it's not just about the training objective. This underscores why production systems often still rely on BERT-like encoders for tasks that require fast, accurate, and memory-efficient embeddings.

In summary, while decoder-only models remain the default for general-purpose language generation, the resurgence of encoder-decoder and encoder-only architectures—now with open, large-scale models—brings much-needed diversity and specialization to the AI toolkit. This is particularly relevant as the field pushes for models that are both efficient and controllable, and as research communities revisit the architectural tradeoffs that were once considered settled.

Hierarchical Models and the End of Tokenization

The quest for more natural, robust language models is driving innovation beyond traditional tokenization. The H-Net architecture, introduced in a recent research paper, replaces fixed tokenization with dynamic chunking learned directly by the model (more: https://arxiv.org/abs/2507.07955;, https://www.reddit.com/r/LocalLLaMA/comments/1lxd7nh/hnet_a_hierarchical_network_that_replaces/). Instead of relying on heuristics or language-specific rules to split text into tokens, H-Net automatically discovers meaningful "chunks"—units of data that are content- and context-dependent—while learning its representations hierarchically.

This approach addresses several long-standing issues. Tokenization introduces brittleness and language bias, with models often struggling on non-English text, code, or DNA sequences—domains where standard tokenizers perform poorly. H-Net sidesteps these pitfalls by operating at the byte level and learning segmentation strategies as part of pretraining. Benchmarks show that a one-stage H-Net can outperform strong Transformer baselines operating over standard byte-pair encoding (BPE) tokens. Stacking multiple hierarchy levels further improves scaling and abstraction, matching or beating token-based Transformers twice its size, especially in languages and modalities with weak tokenization.

Beyond raw performance, this self-learned chunking enables models to build more abstract, semantically rich representations from the ground up. In effect, H-Net learns to group data at the right level of abstraction for each task and language, facilitating better generalization and robustness. This is a promising direction for the next generation of foundation models, potentially reducing preprocessing complexity and unlocking more universal, language-agnostic AI systems (more: https://goombalab.github.io/blog/).

Agentic Frameworks, MCP, and Local-First AI Agents

Agentic frameworks and tool integration protocols are rapidly evolving to make AI assistants more modular, private, and adaptable. The Model Context Protocol (MCP) ecosystem is maturing, with a growing community sharing MCP servers and setups that enable large language models to interact with external tools and data sources (more: https://www.reddit.com/r/OpenWebUI/comments/1m0224h/share_your_mcp_servers_and_experiments/). MCP, combined with translation layers like MCPO and orchestration platforms such as MetaMCP, allows users to assemble bespoke agentic systems—connecting LLMs to everything from Wikipedia and OpenStreetMap to proprietary APIs—within privacy-respecting, local-first environments.

Open-source projects like ARGO exemplify this trend. ARGO is a cross-platform, offline AI agent client that runs locally, storing 100% of user data on-device (more: https://www.reddit.com/r/LocalLLaMA/comments/1m23efn/argo_a_localfirst_offline_ai_agent_that_puts_you/). It features a multi-agent engine capable of planning, tool use, and integrating Retrieval-Augmented Generation (RAG) over personal files, all with a visual "Agent Factory" for building custom assistants. Notably, ARGO supports local LLMs via Ollama and MCP, and can integrate with major cloud providers as needed, letting users balance privacy, performance, and cost.

These innovations are blurring the lines between agent frameworks and traditional programming. There's a growing debate about the complexity of graph-based tool orchestration (as in LangGraph) versus the directness of native programming control flow. Some developers argue that frameworks like LangGraph overcomplicate agentic workflows by abstracting control flow into graph vertices and edges, when native code (e.g., Go routines with type safety) can achieve the same with more transparency and performance (more: https://www.reddit.com/r/LocalLLaMA/comments/1m0hgtt/why_langgraph_overcomplicates_ai_agents_and_my_go/). The upshot: for technical users, native code remains king for maintainability and scalability; for non-technical users, visual graph UIs offer accessibility, but often at the cost of long-term complexity.

The MCP protocol is also seeing expanded use as a backend for tool integration in agentic platforms (more: https://www.reddit.com/r/ChatGPTCoding/comments/1lwmtpz/new_mcp_alt_just_dropped/). Libraries like fasta2a make it trivial to expose any agent as an A2A (agent-to-agent) server, supporting rich conversation context, task management, and compatibility with agentic frameworks (more: https://github.com/pydantic/fasta2a). The result is a flexible, composable ecosystem where tools, models, and user interfaces can be swapped and extended with minimal friction.

Local LLMs, Quantization, and Scaling Challenges

The local LLM landscape continues to accelerate, with new models pushing the limits of quality, efficiency, and deployment options. RekaAI's Flash 3.1 model, now available in quantized formats, is drawing attention for delivering Qwen3 32B-level performance at just 21B parameters (more: https://www.reddit.com/r/LocalLLaMA/comments/1lwgy9m/rekaairekaflash31_hugging_face/). RekaQuant, a new quantization technique released alongside, promises to further improve low-bit inference, making large models more accessible for edge deployment.

Meanwhile, Moonshot AI's Kimi K2 model is making waves by outperforming even OpenAI's GPT-4 on several benchmarks, despite being a fraction of the size. This reflects a broader trend: architectural improvements and smarter training data selection are yielding more capable models without simply scaling parameter counts (more: https://www.reddit.com/r/LocalLLaMA/comments/1m013ou/moonshot_ais_open_source_kimi_k2_outperforms_gpt4/).

Edge-focused models like LiquidAI's LFM2-700M are also raising the bar. LFM2 is a hybrid architecture with multiplicative gates and short convolutions, tuned for fast inference and low memory usage on CPUs and NPUs (more: https://huggingface.co/LiquidAI/LFM2-700M). It outperforms similarly-sized competitors in knowledge, math, and multilingual tasks, and is designed for flexible deployment across devices—from smartphones to vehicles. Its architecture, combining 10 convolutional and 6 attention layers, enables efficient tool use and instruction following, though it is not recommended for knowledge-intensive or programming tasks without fine-tuning.

Baidu's ERNIE 4.5 series, including the 0.3B-parameter post-trained model, showcases the state of the art in multimodal, MoE-based language models (more: https://huggingface.co/baidu/ERNIE-4.5-0.3B-PT). ERNIE 4.5 employs modality-isolated routing and loss functions to ensure both text and vision modalities are effectively represented. Its infrastructure innovations—such as FP8 mixed-precision training and hierarchical load balancing—enable efficient scaling and inference, with lossless quantization down to 2 bits.

The community is also anticipating the release of a fully open-source LLM from ETH Zurich and EPFL, trained on Switzerland’s “Alps” supercomputer and open data (more: https://www.reddit.com/r/LocalLLaMA/comments/1lx8qrz/eth_zurich_and_epfl_will_release_a_fully/). Available in 8B and 70B sizes under Apache 2.0, this initiative promises to enhance transparency and reproducibility—key concerns as most state-of-the-art models remain closed-source or trained on undisclosed data.

For those self-hosting LLMs, concurrency and scalability remain practical challenges. Solutions like vLLM are recommended over Ollama when supporting more than a handful of simultaneous users, as Ollama is optimized for low concurrency (~5 concurrent requests). Scaling strategies include running multiple model instances in containers with load balancing, or leveraging cloud GPU servers for higher throughput (more: https://www.reddit.com/r/ollama/comments/1lxf2d6/advice_needed_best_way_to_replace_together_api/).

Diffusion Language Models and the Dream 7B Milestone

A major technical milestone: support for diffusion-based language models has landed in llama.cpp, starting with Dream 7B, developed in collaboration with Huawei Noah’s Ark Lab (more: https://www.reddit.com/r/LocalLLaMA/comments/1m1h0fy/support_for_diffusion_models_dream_7b_has_been/). Unlike standard autoregressive LLMs that predict the next token sequentially, diffusion models generate text by progressively denoising random noise, akin to how Stable Diffusion creates images. Dream 7B achieves parity with, or even surpasses, leading autoregressive models of similar size on general, math, and coding tasks. It also demonstrates strong planning and inference flexibility, a natural byproduct of the diffusion approach.

The implications are significant: diffusion models could be leveraged for speculative decoding—rapidly generating a draft response before refining it with a heavier autoregressive model. However, current implementations are slower and limited to single-shot responses, not yet supporting conversational use. As with early diffusion models in vision, practical speed and usability will improve with further optimization and research.

Video Scene Generation and Surfel-Indexed Memory

On the generative media front, researchers have introduced VMem, a memory-augmented method for consistent interactive video scene generation (more: https://github.com/runjiali-rl/vmem). VMem anchors past views to "surfels"—surface elements in 3D space—allowing the model to condition new views on the most relevant historical information. This approach addresses the long-standing challenge of maintaining scene consistency over extended video sequences, reducing artifacts and improving quality without the computational overhead of traditional inpainting or geometry estimation. Built atop modern diffusion and transformer architectures, VMem represents a step toward more interactive, coherent video synthesis.

Efficient AI for Wireless and Hardware

The growing demand for efficient AI in wireless communications is spurring innovation at the intersection of deep learning and hardware. The OpenDPDv2 framework presents an end-to-end, open-source toolkit for digital predistortion (DPD) of RF power amplifiers using neural networks (more: https://arxiv.org/abs/2507.06849v1). DPD is critical for improving signal quality and reducing interference, but neural approaches are typically power-hungry. OpenDPDv2 introduces TRes-DeltaGRU, a hybrid model that leverages temporal sparsity (skipping redundant computations) and quantization-aware training to dramatically cut inference energy while maintaining high linearization performance. On real-world hardware (ARMv7-A CPUs), this yields over 4.5x energy savings, with memory access, not arithmetic, emerging as the primary bottleneck.

This hardware-aware mindset is mirrored in the Zig programming ecosystem, which provides advanced allocators for memory management and leak detection (more: https://tgmatos.github.io/defeating-memory-leaks-with-zig-allocators/). Zig’s debug allocator, for instance, captures stack traces on allocation and free, detects double frees, and never reuses memory addresses—making it easier to catch subtle bugs, particularly in recursive data structures common in interpreters and VMs. The lesson: robust, low-level tooling is essential for building reliable AI and systems software, especially as models move closer to the edge.

Safety, Backups, and AI Code Risks

As AI agents gain more autonomy, the risks of granting them file system access are becoming all too real. A cautionary tale surfaced when a user let Anthropic’s Claude manage shell commands on their Mac; a poorly handled cleanup operation resulted in the catastrophic deletion of their entire desktop and repository (more: https://www.reddit.com/r/ClaudeAI/comments/1m21go1/claude_deleted_my_whole_repository/). The community’s consensus is clear: never treat git as a backup, always use external or cloud-synced backup systems, and consider running AI agents inside containers or VMs to sandbox potential damage.

Projects like Terragon Labs are responding with solutions that run code agents in isolated cloud environments, integrating with version control for automatic, off-device backups. This approach prevents unrecoverable losses and limits the blast radius of destructive AI actions. As agentic systems become more capable—and more trusted—robust safety, auditability, and backup strategies are not just best practices, but essential requirements.

Hardware and Open 3D Printing Innovations

Finally, the hardware hacker community continues to push boundaries with open, unconventional 3D printer designs. A recent build employs cantilever arms for the print head, offering a more open, unenclosed design compared to cubic frames or bed-slingers (more: https://hackaday.com/2025/07/12/an-open-concept-3d-printer-using-cantilever-arms/). While the design faces challenges with belt tension and potential print artifacts due to flex, it exemplifies the spirit of experimentation and optimization. Suggestions include leveraging closed-loop control for real-time compensation of flex—an approach that, if generalized, could benefit all printers as they push the limits of speed and precision.

This ethos of continuous iteration, rigorous testing, and willingness to rethink established conventions is a throughline across the AI and tech landscape—from model architectures and agentic protocols to hardware and security practices.

Sources (20 articles)

  1. ARGO - A Local-First, Offline AI Agent That Puts You in Control (www.reddit.com)
  2. Support for diffusion models (Dream 7B) has been merged into llama.cpp (www.reddit.com)
  3. ETH Zurich and EPFL will release a fully open-source LLM developed on public infrastructure. Trained on the “Alps” supercomputer at the Swiss National Supercomputing Centre (CSCS). Trained on 60% english/40% non-english, it will be released in 8B and 70B sizes. (www.reddit.com)
  4. Why LangGraph overcomplicates AI agents (and my Go alternative) (www.reddit.com)
  5. Moonshot AI’s open source Kimi K2 outperforms GPT-4 in key benchmarks (www.reddit.com)
  6. Advice Needed: Best way to replace Together API with self-hosted LLM for high-concurrency app (www.reddit.com)
  7. new MCP alt. just dropped (www.reddit.com)
  8. Claude deleted my whole repository (www.reddit.com)
  9. runjiali-rl/vmem (github.com)
  10. pydantic/fasta2a (github.com)
  11. Defeating Memory Leaks with Zig Allocators (tgmatos.github.io)
  12. baidu/ERNIE-4.5-0.3B-PT (huggingface.co)
  13. LiquidAI/LFM2-700M (huggingface.co)
  14. An Open-Concept 3D Printer Using Cantilever Arms (hackaday.com)
  15. OpenDPDv2: A Unified Learning and Optimization Framework for Neural Network Digital Predistortion (arxiv.org)
  16. Seq vs Seq: the Ettin Suite of Paired Encoders and Decoders (huggingface.co)
  17. H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data (www.reddit.com)
  18. RekaAI/reka-flash-3.1 · Hugging Face (www.reddit.com)
  19. Share your MCP servers and experiments! (www.reddit.com)
  20. T5Gemma: A new collection of encoder-decoder Gemma models- Google Developers Blog (www.reddit.com)