TL;DR#
Audio Large Language Models (AudioLLMs) have improved audio tasks, but lack benchmarks in finance where audio data (earnings calls) is key for decisions. Financial evaluation suites exist for LLMs in NLP tasks, but there’s a gap: no audio-focused financial LLM or benchmark. Multimodal financial LLMs can’t handle audio data yet. General AudioLLMs have progressed, enabling tasks like ASR, but a financial audio benchmark is missing, limiting research community’s ability to evaluate and improve strategies.
To address this, the paper introduces FINAUDIO, the first AudioLLM benchmark for finance. It defines three tasks: ASR for short/long financial audio, and summarization of long audio. Four open-source datasets were collected, and a new dataset for financial audio summarization was created. Seven AudioLLMs were evaluated, revealing limitations and insights for improvement. The benchmark offers a low-cost, privacy-preserving ASR solution.
Key Takeaways#
Why does it matter?#
This paper introduces a novel benchmark to evaluate and advance audio LLMs, crucial for financial AI research. It offers valuable datasets and insights, paving the way for more effective and reliable financial audio analysis tools.
Visual Insights#
Dataset Name | Type | #Samples | # Hours | Task | Metrics |
---|---|---|---|---|---|
MDRM-test | Short Clips | 22,208 | 87 | short financial clip ASR | WER |
SPGISpeech-test | Short Clips | 39,341 | 130 | short financial clip ASR | WER |
Earning-21 | Long Audio | 44 | 39 | long financial audio ASR | WER |
Earning-22 | Long Audio | 125 | 120 | long financial audio ASR | WER |
FinAudioSum | Long Audio | 64 | 55 | long financial audio Summarization | Rouge-L & BertScore |
🔼 This table presents a summary of the datasets used in the FinAudio benchmark. It shows the name of each dataset, its type (short audio clips or long audio recordings, including a summarization dataset), the number of samples in the dataset, the total duration of audio in hours, the specific task the dataset is used for within the benchmark (ASR for short audio, ASR for long audio, or summarization), and the evaluation metrics used for each task.
read the caption
Table 1: Statistics of the datasets in the FinAudio benchmark.
Full paper#








