Voice & Audio

Text-to-speech, speech recognition, voice cloning

85 articles across 39 editions

Articles

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA -- 2026-06-29
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence -- 2026-06-29
PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters -- 2026-06-29
MolmoMotion: Language-guided 3D motion forecasting -- 2026-06-29
k2-fsa/OmniVoice — High-Quality Voice Cloning TTS for 600+ Languages -- 2026-04-16
Show HN: Sub-500ms latency voice agent from scratch -- 2026-03-05
PKU-YuanGroup/Helios: Real Real-Time Long Video Generation Model -- 2026-03-04
StyleStream: Real-Time Zero-Shot Voice Style Conversion -- 2026-03-04
KokoClone: Kokoro TTS, but it clones voices now -- 2026-03-04
Speech to text via LLM -- 2026-01-16
kyutai-labs/pocket-tts -- 2026-01-16
zai-org/GLM-ASR-Nano-2512 -- 2025-12-12
zai-org/GLM-TTS -- 2025-12-11
openbmb/VoxCPM1.5 -- 2025-12-11
MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark -- 2025-12-04
nvidia/parakeet_realtime_eou_120m-v1 -- 2025-12-03
Qwen/Qwen3-VL-4B-Instruct -- 2025-11-20
Soul-AILab/SoulX-Podcast-1.7B -- 2025-11-20
Last week in Multimodal AI - Local Edition -- 2025-11-12
pnnbao97/VieNeu-TTS -- 2025-11-12
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation -- 2025-11-12
GLaDOS TTS finetuning on MLX from the original game files -- 2025-11-04
zeusftk/FTK_CANVAS_AGENT_for_Comfyui -- 2025-11-04
guyyariv/DyPE -- 2025-11-04
Esonhugh/go-rex-java -- 2025-10-27
SuperSonic – SuperCollider's audio engine in a Web AudioWorklet -- 2025-10-27
3-way FTP: Pushing files around with silly and unusual methods -- 2025-10-27
HRV Gets Home Automation Upgrades -- 2025-10-27
Open source streaming STT (Parakeet + Silero + Pipecat Smart Turn) -- 2025-10-19
Turn ChatGPT into a real-time meeting assistant (via MCP + Apps SDK) -- 2025-10-19
BASICODE: A Bit Like Java, But From The 1980s -- 2025-10-18
Audio transcription with llama.cpp multimodal -- 2025-10-18
I built a fully automated AI podcast generator that connects to ollama -- 2025-10-18
Chinny (iOS/MacOS): offline, on-device voice cloning with an optimized Chatterbox model -- 2025-10-12
herimor/voxtream -- 2025-10-12
microsoft/VibeVoice-Large -- 2025-10-12
chetwinlow1/Ovi -- 2025-10-12
Phr00t/Qwen-Image-Edit-Rapid-AIO -- 2025-10-12
kyomber/CVE-2025-8088 -- 2025-10-08
This Week in Security: CVSS 0, Chwoot, and Not in the Threat Model -- 2025-10-08
I created the cheapest possible AI voice agent (over 30x less expensive than Elevenlabs and OpenAI Realtime). Check out the Github repo below if you want to try it for yourself! -- 2025-10-07
MaximeRivest/maivi -- 2025-10-07
nineninesix/kani-tts-370m -- 2025-10-07
We just open-sourced Kroko ASR: a fast, streaming alternative to Whisper. It’s early days, we’d love testers, feedback, and contributors. -- 2025-10-04
Chaos96/NTPP -- 2025-09-27
We made a new AI interface that is compatible with Ollama -- 2025-09-24
if-ai/ComfyUI_HunyuanVideoFoley -- 2025-09-24
Show HN: Inferencer – Run and deeply control local AI models (macOS release) -- 2025-09-24
tencent/HunyuanWorld-Voyager -- 2025-09-24
FireRedTeam/FireRedTTS2 -- 2025-09-24
OpenBMB/VoxCPM -- 2025-09-22
voicepowered-ai/VibeVoice-finetuning -- 2025-09-22
Why is the name of a wireless mouse hard-coded into Windows Bluetooth drivers? -- 2025-09-17
Qwen3-Coder-480B Q2_K_XL same speed as Qwen3-235b-instruct Q3_K_XL WHY? -- 2025-09-09
Renting GPUs is hilariously cheap -- 2025-09-09
Ex-Miner Turned Local LLM Enthusiast, now I have a Dilemma -- 2025-09-09
Tencent-Hunyuan/HunyuanWorld-Voyager -- 2025-09-09
Smartphone Sensors Unlocked: Turn Your Phone into a Physics Lab -- 2025-09-08
UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets -- 2025-09-08
Voice cloning -- 2025-09-08
TencentARC/ToonComposer -- 2025-09-04
MeiGen-AI/InfiniteTalk -- 2025-09-04
RELEASED: ComfyUI Wrapper for Microsoft’s new VibeVoice TTS (voice cloning in seconds) -- 2025-09-03
High-Logic/Genie -- 2025-09-03
Has someone used OWebUi with Docling to talk to pdfs with visualizations? -- 2025-09-01
THU-BPM/Omni-SafetyBench -- 2025-09-01
AIDC-AI/Ovis2.5-9B -- 2025-09-01
TTS VibeVoice FastAPI -- 2025-08-30
Microsoft VibeVoice TTS : Open-Sourced, Supports 90 minutes speech, 4 distinct speakers at a time -- 2025-08-29
tencent/HunyuanVideo-Foley -- 2025-08-29
Made Chatterbox TTS a bit faster again on CUDA (155it/s on 3090) -- 2025-08-25
KittenML/KittenTTS -- 2025-08-25
Kitten TTS Web Demo -- 2025-08-09
Show HN: I built a tool to replace capcut audio transcription -- 2025-08-09
Whispers From The Void, Transcribed With AI -- 2025-08-09
kyutai/tts-voices -- 2025-08-08
Explore KittenTTS with Gradio: Easy Text-to-Speech model -- 2025-08-06
Introcuding KokoroDoki a Local, Open-Source and Real-Time TTS. -- 2025-07-19
Voxtral – Frontier open source speech understanding models -- 2025-07-19
AI can now translate brain scans to text -- 2025-07-19
Suggestions to build local voice assistant -- 2025-07-03
google/gemma-3n-E4B -- 2025-07-03
openai/whisper-large-v3 -- 2025-06-23
Audio-Foundation-Models/ConversationTTS -- 2025-06-18
ResembleAI/chatterbox -- 2025-06-06