TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

2503.24115

Zhiming Ma et el.

🤗 2025-04-01

TL;DR
#

The detection of telecom fraud faces challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. Traditional methods rely on manual verification and rule-based matching, which have low accuracy and struggle to adapt to evolving strategies. Recent advancements in LLMs offer new solutions, but current methods exhibit a significant modality gap and performance variations. This shows the importance of having a comprehensive multimodal telecom fraud dataset.

To address these issues, the authors present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset for automated telecom fraud analysis. The dataset is constructed through privacy-preserved text-truth sample generation, semantic enhancement via LLM-based self-instruction sampling, and multi-agent adversarial synthesis. The authors construct TeleAntiFraud-Bench, a standardized evaluation benchmark, and contribute a production-optimized SFT model trained on hybrid real/synthetic data, open-sourcing the data processing framework for community-driven dataset expansion.

Key Takeaways
#

Why does it matter?
#

This paper introduces a novel multimodal dataset and benchmark for telecom fraud detection, addressing critical data scarcity and privacy challenges while opening new avenues for research in AI-driven fraud prevention and analysis.

Visual Insights
#

🔼 This figure illustrates the architecture of the TeleAntiFraud-30k system, designed to tackle the challenges in telecom fraud detection. It highlights the three key strategies used to build the TeleAntiFraud-28k dataset: 1) Real-Data ASR Processing, which involves transcribing real audio data while preserving privacy, 2) LLM-Based Imitation and Augmentation which uses large language models to generate synthetic data that augments the real data, and 3) Multi-agent Adversarial Synthesis, which simulates realistic and diverse fraud scenarios. The system also includes TeleAntiFraud-Bench, a standardized benchmark for evaluating model performance, and a fine-tuned, open-source model for supervised learning.
read the caption
Figure 1. An overview of TeleAntiFraud-30k. Our system addresses the challenges of telecom fraud detection. We create the TeleAntiFraud-28k dataset through three strategies: Real-Data ASR Processing, LLM-Based Imitation adn Augmention, and multi-agent adversarial synthesis. We also construct TeleAntiFraud-Bench for model evaluation and contribute a supervised fine-tuning model with an open-sourced data processing framework.

Type	Total	Scam Calls	Normal Calls
Train	21,490	9,950 (46.3%)	11,540 (53.7%)
Test	7,021	3,697 (52.66%)	3,324 (47.34%)
Total	28,511	13,647 (47.86%)	14,864 (52.13%)

🔼 This table shows the distribution of scam and normal calls within the TeleAntiFraud-28k dataset. It breaks down the total number of calls (28,511) into training and testing sets, further categorized by whether each call is fraudulent or not. The percentages provided illustrate the proportion of scam and normal calls in each set and the overall dataset, offering insight into the class imbalance present in the data.
read the caption
Table 1. Distribution of Scam and Normal Calls in the Dataset

In-depth insights
#

AntiFraud Dataset
#

The paper introduces TeleAntiFraud-28k, a novel audio-text dataset for telecom fraud detection. A key aspect is its focus on slow-thinking reasoning, going beyond simple pattern recognition to simulate expert fraud analysis. The dataset addresses the critical lack of high-quality multimodal data that combines audio signals with textual context. The construction involves innovative techniques like privacy-preserved ASR transcription and LLM-based semantic augmentation. A multi-agent adversarial framework simulates diverse fraud scenarios, tackling the limited variety in existing datasets. The TeleAntiFraud-Bench evaluation framework provides a standardized way to assess model performance in scenario classification, fraud detection, and type classification. The dataset seeks to facilitate research into more robust and intelligent anti-fraud systems by offering a challenging resource that tackles issues such as data privacy and scenario diversity. The release of the data processing framework further aims to encourage community-driven dataset expansion, thereby helping to spur innovation in multimodal anti-fraud research.

Multi-Modal Models
#

Multimodal models are increasingly vital, leveraging diverse data such as audio, text, and images to enhance AI systems. Their capacity to integrate varied inputs allows for more robust and nuanced understanding compared to unimodal approaches. In fraud detection, multimodal models can combine speech patterns, text analysis, and visual cues to identify deceptive behaviors more effectively. Challenges include data synchronization, feature fusion, and the need for large, diverse datasets. Future advancements will likely focus on developing more efficient fusion techniques and addressing privacy concerns in handling sensitive data.

Data Generation
#

The data generation process is crucial for telecom fraud detection due to the lack of high-quality multimodal training data. This paper addresses this by constructing ‘TeleAntiFraud-28k’, utilizing three strategies. The first is privacy-preserved text-truth sample generation using ASR-transcribed call recordings with TTS regeneration, ensuring real-world consistency. The second involves semantic enhancement via LLM-based self-instruction sampling on authentic ASR outputs to expand scenario coverage. The third strategy employs multi-agent adversarial synthesis to simulate emerging fraud tactics. This comprehensive approach aims to create a dataset that is both diverse and realistic, addressing critical challenges in data privacy and scenario diversity for multimodal anti-fraud research.

Thinking Process
#

Analyzing the ‘Thinking Process’ in AI, especially concerning fraud detection, highlights the critical need for models to mimic human expert reasoning. It’s about moving beyond simple pattern recognition to structured analysis, similar to how a professional would assess a situation. Key is the model’s ability to articulate its reasoning, demonstrating a clear chain of thought from evidence to conclusion. This involves assessing emotional cues, request legitimacy, and information disclosure patterns. A successful ‘Thinking Process’ requires the AI to not only identify fraud but also explain why, increasing trust and explainability. By incorporating detailed annotations about each step of the reasoning process it helps in generating structured responses and improve the overall process.

Model Ablation
#

Ablation studies are vital for understanding a model’s inner workings. By systematically removing components, such as specific layers, attention mechanisms, or input features, we can gauge each element’s contribution to overall performance. In the context of telecom fraud detection, ablating audio features reveals the text’s baseline effectiveness, while ablating slow-thinking components highlights their impact on reasoning. Such experiments not only identify redundancies but also expose crucial interactions between different modalities. This rigorous analysis guides future model development towards leaner, more effective architectures, enhancing fraud detection accuracy and efficiency by pinpointing the most influential aspects of the model, thereby making it interpretable and improving trust with business stakeholders. Furthermore, ablation of certain input features provides insight into which particular real-world interactions and indicators contribute more substantially to a model’s predictive power, leading to better understanding of telecommunication fraud.

More visual insights
#

More on tables

Scenario Type	Training Set	Test Set
Customer Service Inquiries	6,421	4,632
Appointment Services	1,714	867
Daily Shopping	924	340
Food Ordering Services	581	154
Delivery Services	494	448
Ride-hailing Services	353	489
Transportation Inquiries	223	91
Total	10,710	7,021

🔼 This table shows the distribution of different call scenarios in both the training and test sets of the TeleAntiFraud-28k dataset. It breaks down the number of samples for each scenario type, providing insights into the dataset’s composition and balance across various real-world communication contexts. This is useful for understanding the representativeness of the dataset and its suitability for training and evaluating models.
read the caption
Table 2. Distribution of scenario types in training and test sets

Fraud Type	Training Set	Test Set
Customer Service Fraud	2,022	725
Banking Fraud	1,626	2,408
Investment Fraud	785	216
Phishing Fraud	443	123
Lottery Fraud	418	99
Kidnapping Fraud	324	91
Identity Theft	105	35
Total	5,723	3,697

🔼 This table shows the distribution of different fraud types within the training and test sets of the TeleAntiFraud-28k dataset. It breaks down the number of instances of each fraud type in both the training portion (used to train machine learning models) and the testing portion (used to evaluate the performance of trained models) of the dataset. This allows researchers to understand the class distribution and potential class imbalance issues within the dataset.
read the caption
Table 3. Distribution of fraud types in training and test sets

Type

LLM

Scenario (%)

Fraud(%)

Fraud Type(%)

Avg F1 (%)

Reason

Final

SenseVoice (ASR) + LLM

DeepSeek-V3

88.53

14.62

66.71

56.62

7.51

21.30

DeepSeek-R1

83.60

79.25

85.16

82.67

9.94

31.08

Doubao-1.5-Pro

71.14

36.11

82.25

63.17

6.31

23.74

InternLM2.5-20B-Chat

78.34

36.67

85.42

66.81

6.37

25.11

GLM-4-9B-Chat

75.10

46.91

82.22

68.08

5.25

25.57

Qwen2.5-72B-Instruct

78.31

51.44

81.24

70.33

7.01

26.43

Multimodal Models

GPT-4o*

80.25

50.00

86.26

72.17

6.79

27.12

Gemini-2.0-Flash

80.51

59.61

83.53

74.55

7.25

28.02

GLM-4-Voice*

26.83

38.33

32.58

1.89

12.23

Step-1o-audio*

76.35

40.65

79.71

65.57

5.26

24.63

Qwen2Audio

70.22

58.51

20.47

49.73

4.91

18.69

FT Anti-Fraud

LALM

AntiFraud-Qwen2Audio

81.31

84.78

82.91

83.00

6.44

31.18

🔼 This table presents a comparative analysis of various models’ performance across three key tasks in fraud detection: scenario classification, fraud detection, and fraud type identification. For each model, the table shows the F1-score achieved in each task, along with the average F1-score across all three tasks. Additionally, it provides scores reflecting the quality of the model’s reasoning process, and the final score (a weighted average). Model types are categorized as LLM (Large Language Model) based approaches, multimodal models, and a fine-tuned Anti-Fraud LALM (Large Audio Language Model). This allows for a comparison of different model architectures and strategies in the context of telecom fraud detection.
read the caption
Table 4. Comparison of Model Performance on Fraud Detection Tasks

FT Anti-Fraud

LALM

🔼 This table presents a comparison of the performance metrics for different variants of the Qwen2-Audio model on three tasks: scenario classification, fraud detection, and fraud type identification. The variants include a baseline model, a model trained only on ASR text, a model without the ‘Think’ mechanism (which incorporates slow-thinking reasoning), and a model with the ‘Think’ mechanism. The metrics shown for each variant and task include accuracy, precision, recall, and F1-score. This allows for a detailed analysis of how different training methods and model features impact the overall performance in identifying fraudulent activity.
read the caption
Table 5. Performance Metrics Comparison of Different Qwen2-Audio Model Variants

Model Type	Variant	Scenario Classification				Fraud Detection				Fraud Type				Avg F1
Model Type	Variant	Acc	Pre	Recall	F1	Acc	Pre	Recall	F1	Acc	Pre	Recall	F1	Avg F1
Qwen2-Audio	Base	67.36	78.99	67.36	70.22	61.83	68.40	51.12	58.51	32.38	78.39	32.38	20.47	49.73
	ASR-text	71.01	79.76	71.01	71.55	71.27	76.39	66.80	71.27	76.79	80.39	76.79	77.93	73.58
	NO Think	72.08	80.38	72.08	72.08	68.31	74.04	65.17	69.32	78.83	85.06	78.83	80.39	73.93
	Think	80.91	82.60	80.91	81.31	84.22	86.15	83.45	84.78	82.67	83.57	82.67	82.91	83.00

🔼 This table presents a comprehensive quality assessment of various models used in the paper, focusing on their performance in a telecommunications fraud detection task. It evaluates three key aspects of the models’ reasoning processes: logical rigor, practical value, and expressive quality. Each model receives a score for each aspect, and a final total score is calculated to provide an overall ranking of their performance in terms of their ability to mimic the thoughtful, detailed reasoning process of a human expert. This is done to measure not just the accuracy of model answers, but the quality of their reasoning.
read the caption
Table 6. Quality Evaluation of Different Models

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

AntiFraud Dataset#

Multi-Modal Models#

Data Generation#

Thinking Process#

Model Ablation#

More visual insights#

Full paper#