Speech and Audio
FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
·370 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Speech Recognition
🏢 Stevens Institute of Technology
FINAUDIO: First benchmark for financial audio LLMs, enhancing financial audio analysis and investment decisions.
Quantization for OpenAI's Whisper Models: A Comparative Analysis
·1308 words·7 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Speech Recognition
🏢 Independent Researcher
Quantization optimizes OpenAI’s Whisper models, balancing model size, speed, and accuracy for diverse applications.
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
·1645 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Music Generation
🏢 Northwestern Polytechnical University
DiffRhythm: Fast & Simple End-to-End Song Generation via Latent Diffusion, creating full songs (4+ mins) with vocal & accompaniment in seconds!
Slamming: Training a Speech Language Model on One GPU in a Day
·2787 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Speech Synthesis
🏢 Hebrew University of Jerusalem
Slam: Train SLMs on one GPU in a day!
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
·2399 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Music Generation
🏢 Beihang University
SongGen: Single-stage autoregressive transformer for controllable text-to-song generation, simplifying the process and improving control.
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
·3169 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Speech Coding
🏢 Concordia University
FocalCodec: a single codebook, low-bitrate speech codec using focal modulation, achieves competitive performance in speech resynthesis and voice conversion.
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
·2407 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Text-to-Speech
🏢 Chinese University of Hong Kong, Shenzhen
Emilia-Pipe and its resulting datasets, Emilia and Emilia-Large, offer the largest open-source, multilingual speech corpus, enabling more natural and spontaneous AI speech generation.
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution
·1883 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Audio Generation
🏢 Alibaba Group
HiFi-SR: A unified generative network achieves high-fidelity speech super-resolution, outperforming existing methods by seamlessly integrating transformer and convolutional components for end-to-end a…
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework
·3087 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Music Generation
🏢 Tencent AI Lab
XMusic: A new framework generates high-quality, emotionally controllable symbolic music from various prompts (images, videos, text, tags, humming).
Whisper-GPT: A Hybrid Representation Audio Large Language Model
·1640 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Audio Generation
🏢 Stanford University
Whisper-GPT, a hybrid audio LLM, improves music/speech generation by combining audio waveforms and text.