FinMTEB: Finance Massive Text Embedding Benchmark

2502.10990

Yixuan Tang et el.

🤗 2025-02-19

TL;DR
#

Existing embedding model benchmarks often overlook the unique challenges of the financial domain. Models trained on general-purpose datasets may not effectively capture the nuances of financial language, which often involves specialized terminology, complex numerical relationships, and temporal sensitivity. This paper highlights the need for domain-specific evaluation and addresses the limitation by presenting a benchmark specialized for the financial domain.

To tackle these issues, the researchers introduce FinMTEB, a finance-specific benchmark encompassing diverse datasets and tasks. They also develop a domain-adapted model, Fin-E5. Their evaluation reveals that domain-adapted models consistently outperform general-purpose ones. Surprisingly, a basic Bag-of-Words (BoW) model outperforms sophisticated dense embeddings in specific tasks, indicating limitations of current techniques in handling financial text semantics. FinMTEB establishes a robust evaluation framework for financial NLP, offering valuable insights for developing effective financial embedding models.

Key Takeaways
#

Why does it matter?
#

This paper is crucial for financial NLP researchers because it introduces FinMTEB, the first comprehensive benchmark for evaluating embedding models in finance. It addresses the lack of domain-specific evaluation in the field and provides a standard for comparing models effectively. The development of Fin-E5, a domain-adapted model, and the surprising performance of BoW demonstrate limitations in current models and suggest new avenues for model development.

Visual Insights
#

🔼 This word cloud shows the most frequent terms present in the training data used to develop the Fin-E5 model. The size of each word reflects its frequency, illustrating the prevalence of various financial concepts within the training dataset. This visualization helps to highlight the domain-specific vocabulary learned by Fin-E5, illustrating its focus on finance-related terms and concepts.
read the caption
Figure 1: Word cloud visualization of Fin-E5’s training data, contain common financial terms.

Model	Size	Tasks							Avg.
Model	Size	STS	Retrieval	Class.	Cluster.	Rerank.	PairClass.	Summ.	Avg.
		2	10	8	6	3	3	3
BOW	-	0.4845	0.2084	0.4696	0.2547	0.7628	0.7143	0.0542	0.4212
Encoder based Models
BERT	110M	0.3789	0.0207	0.5496	0.1744	0.3930	0.7111	0.0452	0.3247
FinBERT	110M	0.4198	0.1102	0.5923	0.2833	0.6404	0.6967	0.0417	0.3978
instructor-base	110M	0.3732	0.5772	0.6208	0.5300	0.9734	0.6138	0.1465	0.5479
bge-large-en-v1.5	335M	0.3396	0.6463	0.6436	0.5725	0.9825	0.7400	0.2019	0.5895
AnglE-BERT	335M	0.3080	0.5730	0.6439	0.5774	0.9650	0.6891	0.5049	0.6088
LLM-based Models
gte-Qwen1.5-7B-instruct	7B	0.3758	0.6697	0.6438	0.5854	0.9890	0.6998	0.2350	0.5998
Echo	7B	0.4380	0.6443	0.6525	0.5776	0.9765	0.6261	0.4722	0.6267
bge-en-icl	7B	0.3233	0.6789	0.6569	0.5742	0.9898	0.6738	0.5197	0.6309
NV-Embed v2	7B	0.3739	0.7061	0.6393	0.6096	0.9822	0.6043	0.5103	0.6322
e5-mistral-7b-instruct	7B	0.3800	0.6749	0.6449	0.5783	0.9875	0.7394	0.5275	0.6475
Commercial Models
text-embedding-3-small	-	0.3254	0.6641	0.6387	0.5802	0.9825	0.5957	0.5085	0.6136
text-embedding-3-large	-	0.3615	0.7112	0.6596	0.6081	0.9910	0.7309	0.5671	0.6613
voyage-3-large	-	0.4145	0.7463	0.6861	0.5944	0.9938	0.6519	0.6484	0.6765
Finance Adapted LLM-based Models
Fin-E5	7B	0.4342	0.7105	0.7565	0.5650	0.9896	0.8014	0.4797	0.6767

🔼 Table 1 presents a performance comparison of various embedding models on the Finance Massive Text Embedding Benchmark (FinMTEB). It evaluates these models across seven distinct tasks: Semantic Textual Similarity (STS), Retrieval, Classification, Clustering, Reranking, Pair Classification, and Summarization. The table displays the performance of each model on each task, allowing for a comparison of their relative strengths and weaknesses. The best performing model for each task is highlighted in bold, and the second-best is underlined. Model sizes (in parameters) are also included to provide context for performance differences.
read the caption
Table 1: Performance comparison across different embedding models on FinMTEB benchmark. The evaluation metrics include semantic textual similarity (STS), retrieval, classification (Class.), clustering (Cluster.), reranking (Rerank.), pair classification (PairClass.), and summarization (Summ.). Best results are in bold. The underline represents the second-best performance.

In-depth insights
#

FinMTEB Benchmark
#

The FinMTEB Benchmark represents a substantial contribution to the field of financial natural language processing (NLP). Its core strength lies in its comprehensive nature, covering diverse financial text types in both English and Chinese across seven distinct tasks. This breadth ensures a more robust evaluation of embedding models, moving beyond the limitations of general-purpose benchmarks which often fail to capture the nuances of financial language. FinMTEB’s focus on domain-specific datasets, including annual reports, news articles, and regulatory filings, is particularly valuable. The inclusion of both Chinese and English datasets significantly expands the scope of applicability and allows for cross-lingual comparisons. Furthermore, the development and release of the Fin-E5 model, a finance-adapted embedding model, provides a valuable resource for researchers and practitioners. The findings regarding the surprising performance of simple Bag-of-Words models in certain tasks highlight the current limitations of sophisticated dense embeddings in the financial domain and suggest avenues for future research. Overall, FinMTEB offers a more realistic and challenging evaluation framework that will significantly advance the field of financial NLP.

Fin-E5 Model
#

The research paper introduces Fin-E5, a finance-adapted text embedding model, designed to overcome limitations of general-purpose embedding models in financial applications. Fin-E5’s development directly addresses the need for improved handling of domain-specific terminology, temporal sensitivities, and complex numerical relationships prevalent in financial text. The model’s creation is notable for its use of a persona-based data synthesis method, generating a diverse range of financial tasks and incorporating different perspectives. This approach enhances the model’s ability to capture nuanced financial semantics and adapt to various financial contexts. The paper emphasizes the importance of domain adaptation through the use of Fin-E5, highlighting its consistent outperformance over general-purpose counterparts across multiple financial tasks. The results demonstrate that Fin-E5 achieves state-of-the-art performance on the Finance Massive Text Embedding Benchmark (FinMTEB), a comprehensive benchmark specifically designed for evaluating financial embedding models. Overall, Fin-E5 represents a significant advance in finance-specific natural language processing, offering valuable insights for researchers and practitioners working within the financial domain.

Domain Adaptation
#

The concept of domain adaptation is central to the research paper, addressing the challenges of applying general-purpose embedding models to the specialized financial domain. The authors highlight the limited correlation between performance on general benchmarks and financial domain-specific tasks, emphasizing the necessity of adapting models to the unique characteristics of financial text. This adaptation is crucial due to factors like domain-specific terminology, temporal sensitivity, and complex numerical relationships. The paper explores domain adaptation strategies, specifically focusing on the development of a finance-adapted model (Fin-E5) using a persona-based data augmentation technique, and demonstrates the effectiveness of these techniques. Their findings strongly support that domain-adapted models significantly outperform their general-purpose counterparts, underscoring the importance of considering domain-specific needs when developing embedding models for financial natural language processing (NLP) applications.

BOW Outperforms
#

The unexpected finding that a Bag-of-Words (BoW) model outperforms sophisticated dense embedding models in specific financial semantic textual similarity (STS) tasks is a significant result. This challenges the prevailing assumption that complex, dense embeddings are always superior; instead, it suggests that the current dense embedding techniques struggle to capture the nuances of financial language effectively. The reasons might include: over-reliance on contextual information which fails to identify core semantic similarities obscured by boilerplate language and financial jargon prevalent in financial documents; inability to effectively handle numerical and temporal relationships key to financial understanding; and/or limitations in the training data itself which may not sufficiently represent the intricate semantic space inherent in financial language. This finding underscores the need for further research into embedding model design and training methods to address these weaknesses, including investigations into how to incorporate better financial domain expertise and potentially explore alternative embedding techniques beyond the dense vector representation paradigm.

Future Directions
#

Future research should focus on addressing the limitations of current embedding models in capturing nuanced financial semantics, particularly within the context of complex numerical data and temporal dependencies. Developing more robust and comprehensive evaluation frameworks for specialized financial domains is crucial, moving beyond single-task benchmarks to encompass diverse financial applications. This includes exploring new architectural designs that effectively handle the specific linguistic features and semantic complexities of financial text, including boilerplate language and domain-specific terminology. Investigating the potential of multimodal approaches that integrate textual and numerical data sources holds significant promise. Further research should also explore cross-lingual financial embedding models, expanding the scope beyond English and Chinese to support broader financial data analysis. Finally, exploring novel domain adaptation techniques specific to financial text is vital to optimize embedding model performance.

More visual insights
#

More on tables

Dataset Name	Language	Description
FINAL (Ju et al., 2023)	English	A dataset designed for discovering financial signals in narrative financial reports.
FinSTS (Liu et al., 2024a)	English	A dataset focused on detecting subtle semantic shifts in financial narratives.
AFQMC ⁶⁶6https://tianchi.aliyun.com/dataset/106411	Chinese	A Chinese dataset for customer service question matching in the financial domain.
BQ-Corpus (Chen et al., 2018)	Chinese	A large-scale Chinese corpus for sentence semantic equivalence identification (SSEI) in the banking domain.

🔼 This table lists the datasets used for the Semantic Textual Similarity (STS) task in the FinMTEB benchmark. It shows the dataset name, language (English or Chinese), and a brief description of each dataset’s content and purpose within the financial domain.
read the caption
Table 2: Summary of STS Datasets

Dataset Name	Language	Description
FiQA2018 (FiQA, 2018)	English	Financial opinion mining and question answering dataset.
FinanceBench (Islam et al., 2023)	English	Open book financial question answering dataset.
HC3(Finance) (Guo et al., 2023)	English	A human-ChatGPT comparison corpus in the finance domain.
Apple-10K-2022 ⁷⁷7https://lighthouz.ai/blog/rag-benchmark-finance-apple-10K-2022/	English	A retrieval-augmented generation (RAG) benchmark for finance applications.
FinQA (Chen et al., 2021)	English	Financial numerical reasoning dataset with structured and unstructured evidence.
TAT-QA (Zhu et al., 2021)	English	Question answering benchmark combining tabular and textual content in finance.
US Financial News ⁸⁸8https://www.kaggle.com/datasets/jeet2016/us-financial-news-articles	English	Finance news articles paired with headlines and stock ticker symbols.
TradeTheEvent (Trading Benchmark) (Zhou et al., 2021)	English	Finance news articles paired with headlines and stock ticker symbols.
TradeTheEvent (Domain Adaption) (Zhou et al., 2021)	English	Financial terms and explanations dataset.
TheGoldman-en	English	English version of the Goldman Sachs Financial Dictionary.
FinTruthQA (Xu et al., 2024)	Chinese	Dataset for evaluating the quality of financial information disclosure.
Fin-Eva (Retrieval task) ⁹⁹9https://github.com/alipay/financial_evaluation_dataset/tree/main	Chinese	Financial scenario QA dataset focusing on retrieval tasks.
AlphaFin (Li et al., 2024)	Chinese	Comprehensive financial dataset including NLI, QA, and stock trend predictions.
DISC-FinLLM (Retrieval Part Data) (Chen et al., 2023)	Chinese	Financial scenario QA dataset.
FinQA (from DuEE-fin) (Lu et al., 2023)	Chinese	Financial news bulletin event quiz dataset.
DISC-FinLLM (Computing) (Chen et al., 2023)	Chinese	Financial scenario QA dataset focusing on numerical tasks.
SmoothNLP ¹⁰¹⁰10https://github.com/smoothnlp/SmoothNLP	Chinese	Chinese finance news dataset.
THUCNews (Sun et al., 2016)	Chinese	Chinese finance news dataset.
Fin-Eva (Terminology) ¹¹¹¹11https://github.com/alipay/financial_evaluation_dataset/tree/main	Chinese	Financial terminology dataset used in the industry.
TheGoldman-cn	Chinese	Chinese version of the Goldman Sachs Financial Dictionary.

🔼 This table lists 20 datasets used in the paper’s benchmark for the retrieval task. Each dataset is described with its name, language (English or Chinese), and a concise description of its content and purpose. For example, some datasets contain financial news articles, others focus on question answering about financial topics, and yet others include information from SEC filings.
read the caption
Table 3: Summary of Retrieval Datasets

Dataset Name	Language	Description
FinancialPhrasebank (Malo et al., 2014)	English	Polar sentiment dataset of sentences from financial news, categorized by sentiment into positive, negative, or neutral.
FinSent (Yang et al., 2023b)	English	Polar sentiment dataset of sentences from the financial domain, categorized by sentiment into positive, negative, or neutral.
FiQA_ABSA (FiQA, 2018)	English	Polar sentiment dataset of sentences from the financial domain, categorized by sentiment into positive, negative, or neutral.
SemEva2017_Headline (Cortis et al., 2017)	English	Polar sentiment dataset of sentences from the financial domain, categorized by sentiment into positive, negative, or neutral.
FLS (Yang et al., 2023b)	English	A finance dataset detects whether the sentence is a forward-looking statement.
ESG (Yang et al., 2023b)	English	A finance dataset performs sentence classification under the environmental, social, and corporate governance (ESG) framework.
FOMC (Shah et al., 2023)	English	A task of hawkish-dovish classification in finance domain.
Financial-Fraud ¹²¹²12https://github.com/amitkedia007/Financial-Fraud-Detection-Using-LLMs/tree/main	English	This dataset was used for research in detecting financial fraud.
FinNSP (Lu et al., 2023)	Chinese	Financial negative news and its subject determination dataset.
FinChina (Lan et al., 2023)	Chinese	Polar sentiment dataset of sentences from the financial domain, categorized by sentiment into positive, negative, or neutral.
FinFE (Lu et al., 2023)	Chinese	Financial social media text sentiment categorization dataset.
OpenFinData ¹³¹³13https://github.com/open-compass/OpenFinData?tab=readme-ov-file	Chinese	Financial scenario QA dataset including sentiment task.
MDFEND-Weibo2 (finance) (Nan et al., 2021)	Chinese	Fake news detection in the finance domain.

🔼 This table lists 16 classification datasets used in the FinMTEB benchmark. Each dataset is described with its language (English or Chinese), and a detailed description of its contents and intended use in financial natural language processing. The descriptions highlight the type of financial text included (e.g., news, social media, regulatory filings), the task the dataset supports (e.g., sentiment analysis, fraud detection, financial news categorization), and key characteristics such as the number of sentences and whether the data is labeled with positive, negative, or neutral sentiments.
read the caption
Table 4: Summary of Classification Datasets

Dataset Name	Language	Description
MInDS-14-en (Gerz et al., 2021b)	English	MINDS-14 is a dataset for intent detection in e-banking, covering 14 intents across 14 languages.
Consumer Complaints (CFPB, 2024)	English	The Consumer Complaint Database is a collection of complaints about consumer financial products and services that sent to companies for response.
Synthetic PII finance (Watson et al., 2024)	English	Synthetic financial documents containing Personally Identifiable Information (PII).
FinanceArxiv-s2s ¹⁴¹⁴14Collect from the Arixv	English	Clustering of titles from arxiv (q-fin).
FinanceArxiv-p2p	English	Clustering of abstract from arxiv (q-fin).
WikiCompany2Industry-en ¹⁵¹⁵15Collect from the Wikipedia	English	Clustering the related industry domain according to the company description.
MInDS-14-zh (Gerz et al., 2021b)	Chinese	MINDS-14 is a dataset for intent detection in e-banking, covering 14 intents across 14 languages.
FinNL (Lu et al., 2023)	Chinese	Financial news categorization dataset.
CCKS2022 (CCKS, 2022)	Chinese	Clustering of financial events.
CCKS2020	Chinese	Clustering of financial events.
CCKS2019	Chinese	Clustering of financial events.

🔼 This table lists the datasets used in the FinMTEB benchmark for the clustering task. Each dataset is described with its language (English or Chinese) and a brief explanation of its content and purpose, providing context to understand the nature of the financial data included in each dataset and its relevance to clustering tasks. This aids in interpreting the results obtained by using these datasets in the FinMTEB benchmark. The datasets show diversity in terms of their sources and the type of financial data represented.
read the caption
Table 5: Summary of Clustering Datasets

Dataset Name	Language	Description
Ectsum (Mukherjee et al., 2022)	English	A Dataset For Bullet Point Summarization of Long Earnings Call Transcripts.
FINDSum (Liu et al., 2022)	English	A Large-Scale Dataset for Long Text and Multi-Table Summarization.
FNS-2022 (El-Haj et al., 2022)	English	Financial Narrative Summarisation for 10K.
FiNNA (Lu et al., 2023)	Chinese	A financial news summarization dataset.
Fin-Eva (Headline) (Zhang et al., 2023)	Chinese	A financial summarization dataset.
Fin-Eva (Abstract) (Zhang et al., 2023)	Chinese	A financial summarization dataset.

🔼 This table lists summarization datasets used in the FinMTEB benchmark. It provides details for each dataset, including the dataset name, language (English or Chinese), and a description of the dataset’s content and purpose within the context of financial text summarization.
read the caption
Table 6: Summary of Summarization Datasets

Dataset Name	Language	Description
Fin-Fact (Rangapur et al., 2023)	English	A Benchmark Dataset for Financial Fact Checking and Explanation Generation.
FiQA2018 (FiQA, 2018)	English	Financial opinion mining and question answering.
HC3(Finance) (Guo et al., 2023)	English	A human-ChatGPT comparison finance corpus.
Fin-Eva (Retrieval task) (Zhang et al., 2023)	Chinese	Financial scenario QA dataset including retrieval task.
DISC-FinLLM (Retrieval Part Data) (Chen et al., 2023)	Chinese	Financial scenario QA dataset.

🔼 This table lists the datasets used for the reranking task in the FinMTEB benchmark. For each dataset, it provides the language (English or Chinese) and a description of its content and purpose. The descriptions highlight the specific focus of each dataset, such as financial fact-checking, question answering, or retrieval of relevant information in financial contexts.
read the caption
Table 7: Summary of Reranking Datasets

Dataset Name	Language	Description
HeadlineAC-PairClassification (Sinha and Khandait, 2021)	English	Financial text sentiment categorization dataset.
HeadlinePDD-PairClassification (Sinha and Khandait, 2021)	English	Financial text sentiment categorization dataset.
HeadlinePDU-PairClassification (Sinha and Khandait, 2021)	English	Financial text sentiment categorization dataset.
AFQMC	Chinese	Ant Financial Question Matching Corpus.

🔼 This table lists the datasets used for the pair classification task in the FinMTEB benchmark. For each dataset, it provides the language (English or Chinese) and a brief description of its content and purpose within the context of financial text analysis. The descriptions highlight the type of data and the specific aspect of financial text pairs being classified.
read the caption
Table 8: Summary of PairClassification Datasets

	STS	Class.	Ret.	Rerank.	Clust.	PairClass.	Summ.
Correlation	0.30	-0.80	0.30	-0.10	-0.70	-0.30	0.60
p-value	0.62	0.10	0.62	0.87	0.18	0.62	0.28

🔼 This table compares several linguistic features of the FinMTEB and MTEB datasets, which are used for evaluating text embedding models. The features compared are average sentence length, average token length, average number of syllables per token, and average dependency distance. The data reveals differences in the complexity of text between the financial domain (FinMTEB) and a general domain (MTEB), with FinMTEB showing longer and more complex sentences.
read the caption
Table 9: Comparison of Text Characteristics Between FinMTEB and MTEB. The numbers represent the average scores across all samples from all datasets.

Task	Factor	Sum of Squares	Degrees of Freedom	F-Statistic	p-value
Classification	Model Factor	4.17	6.00	25.55	$3.41\times 10^{-30}$
	Domain Factor	56.82	1.00	2086.30	$\approx 0$
	Residual	190.42	6992.00	NA	NA
Retrieval	Model Factor	104.25	6.00	9052.57	$\approx 0$
	Domain Factor	6.16	1.00	3207.72	$\approx 0$
	Residual	13.42	6992.00	NA	NA
STS	Model Factor	10.55	6.00	149.00	$1.64\times 10^{-178}$
	Domain Factor	304.09	1.00	25761.71	$\approx 0$
	Residual	82.53	6992.00	NA	NA
Clustering	Model Factor	0.29	6.00	47.60	$1.59\times 10^{-57}$
	Domain Factor	32.25	1.00	32161.37	$\approx 0$
	Residual	7.01	6992.00	NA	NA
Summarization	Model Factor	12.98	6.00	145.31	$2.90\times 10^{-174}$
	Domain Factor	14.49	1.00	973.32	$3.60\times 10^{-200}$
	Residual	104.07	6992.00	NA	NA
Reranking	Model Factor	5.38	6.00	489.05	$\approx 0$
	Domain Factor	0.64	1.00	346.78	$1.39\times 10^{-75}$
	Residual	12.84	7002.00	NA	NA
Pair Classification	Model Factor	0.25	6.00	1.97	0.07
	Domain Factor	249.19	1.00	11989.92	$\approx 0$
	Residual	145.31	6992.00	NA	NA
Average	Model Factor	0.00	6.00	1.34	0.37
	Domain Factor	0.08	1.00	253.87	$\approx 0$
	Residual	0.00	6.00	NA	NA

🔼 This table presents the Spearman’s rank correlation coefficients between the performance rankings of various embedding models on two benchmark datasets: the Massive Text Embedding Benchmark (MTEB) and the Finance Massive Text Embedding Benchmark (FinMTEB). The analysis considers seven different NLP tasks. The p-values associated with each correlation indicate whether the observed correlation is statistically significant. In this case, all p-values are above the standard significance threshold, suggesting that there is no statistically significant relationship between a model’s performance on MTEB and its performance on FinMTEB. This implies that performance on a general-purpose benchmark (MTEB) does not reliably predict performance on a domain-specific benchmark (FinMTEB), highlighting the importance of domain-specific evaluation.
read the caption
Table 10: Spearman’s correlation of embedding models’ performance on MTEB and FinMTEB across different tasks. The p-value indicates that all correlations are statistically insignificant, suggesting a lack of evidence for a relationship between embedding model performance on the two benchmarks.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

FinMTEB Benchmark#

Fin-E5 Model#

Domain Adaptation#

BOW Outperforms#

Future Directions#

More visual insights#

Full paper#