FinAudio: A Benchmark for Audio Large Language Models in Financial Applications

Table of Contents

2503.20990

Yupeng Cao et el.

🤗 2025-03-28

TL;DR
#

Audio Large Language Models (AudioLLMs) have improved audio tasks, but lack benchmarks in finance where audio data (earnings calls) is key for decisions. Financial evaluation suites exist for LLMs in NLP tasks, but there’s a gap: no audio-focused financial LLM or benchmark. Multimodal financial LLMs can’t handle audio data yet. General AudioLLMs have progressed, enabling tasks like ASR, but a financial audio benchmark is missing, limiting research community’s ability to evaluate and improve strategies.

To address this, the paper introduces FINAUDIO, the first AudioLLM benchmark for finance. It defines three tasks: ASR for short/long financial audio, and summarization of long audio. Four open-source datasets were collected, and a new dataset for financial audio summarization was created. Seven AudioLLMs were evaluated, revealing limitations and insights for improvement. The benchmark offers a low-cost, privacy-preserving ASR solution.

Key Takeaways
#

Why does it matter?
#

This paper introduces a novel benchmark to evaluate and advance audio LLMs, crucial for financial AI research. It offers valuable datasets and insights, paving the way for more effective and reliable financial audio analysis tools.

Visual Insights
#

Dataset Name	Type	#Samples	# Hours	Task	Metrics
MDRM-test	Short Clips	22,208	87	short financial clip ASR	WER
SPGISpeech-test	Short Clips	39,341	130	short financial clip ASR	WER
Earning-21	Long Audio	44	39	long financial audio ASR	WER
Earning-22	Long Audio	125	120	long financial audio ASR	WER
FinAudioSum	Long Audio	64	55	long financial audio Summarization	Rouge-L & BertScore

🔼 This table presents a summary of the datasets used in the FinAudio benchmark. It shows the name of each dataset, its type (short audio clips or long audio recordings, including a summarization dataset), the number of samples in the dataset, the total duration of audio in hours, the specific task the dataset is used for within the benchmark (ASR for short audio, ASR for long audio, or summarization), and the evaluation metrics used for each task.
read the caption
Table 1: Statistics of the datasets in the FinAudio benchmark.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

Full paper#

TL;DR
#

Key Takeaways
#

Why does it matter?
#

Visual Insights
#

Full paper
#