Speech and Audio

FinAudio: A Benchmark for Audio Large Language Models in Financial Applications

26 March 2025·370 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Speech Recognition 🏢 Stevens Institute of Technology

FINAUDIO: First benchmark for financial audio LLMs, enhancing financial audio analysis and investment decisions.

Quantization for OpenAI's Whisper Models: A Comparative Analysis

12 March 2025·1308 words·7 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Speech Recognition 🏢 Independent Researcher

Quantization optimizes OpenAI’s Whisper models, balancing model size, speed, and accuracy for diverse applications.

DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion

3 March 2025·1645 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Music Generation 🏢 Northwestern Polytechnical University

DiffRhythm: Fast & Simple End-to-End Song Generation via Latent Diffusion, creating full songs (4+ mins) with vocal & accompaniment in seconds!

Slamming: Training a Speech Language Model on One GPU in a Day

19 February 2025·2787 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Speech Synthesis 🏢 Hebrew University of Jerusalem

Slam: Train SLMs on one GPU in a day!

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

18 February 2025·2399 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Music Generation 🏢 Beihang University

SongGen: Single-stage autoregressive transformer for controllable text-to-song generation, simplifying the process and improving control.

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

6 February 2025·3169 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Speech Coding 🏢 Concordia University

FocalCodec: a single codebook, low-bitrate speech codec using focal modulation, achieves competitive performance in speech resynthesis and voice conversion.

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

27 January 2025·2407 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Text-to-Speech 🏢 Chinese University of Hong Kong, Shenzhen

Emilia-Pipe and its resulting datasets, Emilia and Emilia-Large, offer the largest open-source, multilingual speech corpus, enabling more natural and spontaneous AI speech generation.

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

17 January 2025·1883 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Audio Generation 🏢 Alibaba Group

HiFi-SR: A unified generative network achieves high-fidelity speech super-resolution, outperforming existing methods by seamlessly integrating transformer and convolutional components for end-to-end a…

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

15 January 2025·3087 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Music Generation 🏢 Tencent AI Lab

XMusic: A new framework generates high-quality, emotionally controllable symbolic music from various prompts (images, videos, text, tags, humming).

Whisper-GPT: A Hybrid Representation Audio Large Language Model

16 December 2024·1640 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Speech and Audio Audio Generation 🏢 Stanford University

Whisper-GPT, a hybrid audio LLM, improves music/speech generation by combining audio waveforms and text.